linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/4] sched: Move work out of the scheduler core
@ 2011-06-22 17:52 Thomas Gleixner
  2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw)
  To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

Block-IO and workqueues have hooks in the scheduler core which have no
strict requirements for being called from irq disabled, preempt
disabled and rq->lock held regions.

The following series moves them outside.

Thanks,

	tglx
---

 block/blk-core.c         |   20 +++-----
 kernel/sched.c           |  106 +++++++++++++++++------------------------------
 kernel/workqueue.c       |   67 ++++++++++++-----------------
 kernel/workqueue_sched.h |    5 --
 4 files changed, 79 insertions(+), 119 deletions(-)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 1/4] sched: Separate the scheduler entry for preemption
  2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner
@ 2011-06-22 17:52 ` Thomas Gleixner
  2011-06-22 18:43   ` Christoph Hellwig
  2011-08-29 14:55   ` [tip:sched/urgent] " tip-bot for Thomas Gleixner
  2011-06-22 17:52 ` [patch 3/4] block: Shorten interrupt disabled regions Thomas Gleixner
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw)
  To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

[-- Attachment #1: sched-split-out-regular-schedule.patch --]
[-- Type: text/plain, Size: 1929 bytes --]

Block-IO and workqueues call into notifier functions from the
scheduler core code with interrupts and preemption disabled. These
calls should be made before entering the scheduler core.

To simplify this, separate the scheduler core code into
__schedule(). __schedule() is directly called from the places which
set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
checks into schedule(), so they are only called when a task voluntary
goes to sleep.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4210,9 +4210,9 @@ pick_next_task(struct rq *rq)
 }
 
 /*
- * schedule() is the main scheduler function.
+ * __schedule() is the main scheduler function.
  */
-asmlinkage void __sched schedule(void)
+static void __sched __schedule(void)
 {
 	struct task_struct *prev, *next;
 	unsigned long *switch_count;
@@ -4300,6 +4300,11 @@ need_resched:
 	if (need_resched())
 		goto need_resched;
 }
+
+asmlinkage void schedule(void)
+{
+	__schedule();
+}
 EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
@@ -4373,7 +4378,7 @@ asmlinkage void __sched notrace preempt_
 
 	do {
 		add_preempt_count_notrace(PREEMPT_ACTIVE);
-		schedule();
+		__schedule();
 		sub_preempt_count_notrace(PREEMPT_ACTIVE);
 
 		/*
@@ -4401,7 +4406,7 @@ asmlinkage void __sched preempt_schedule
 	do {
 		add_preempt_count(PREEMPT_ACTIVE);
 		local_irq_enable();
-		schedule();
+		__schedule();
 		local_irq_disable();
 		sub_preempt_count(PREEMPT_ACTIVE);
 
@@ -5526,7 +5531,7 @@ static inline int should_resched(void)
 static void __cond_resched(void)
 {
 	add_preempt_count(PREEMPT_ACTIVE);
-	schedule();
+	__schedule();
 	sub_preempt_count(PREEMPT_ACTIVE);
 }
 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 3/4] block: Shorten interrupt disabled regions
  2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner
  2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner
@ 2011-06-22 17:52 ` Thomas Gleixner
  2011-06-22 17:52 ` [patch 2/4] sched: Move blk_schedule_flush_plug() out of __schedule() Thomas Gleixner
  2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner
  3 siblings, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw)
  To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

[-- Attachment #1: block-fixup-irq-disabled-region.patch --]
[-- Type: text/plain, Size: 3121 bytes --]

Moving the blk_sched_flush_plug() call out of the interrupt/preempt
disabled region in the scheduler allows us to replace
local_irq_save/restore(flags) by local_irq_disable/enable() in
blk_flush_plug().

Now instead of doing this we disable interrupts explicitely when we
lock the request_queue and reenable them when we drop the lock. That
allows interrupts to be handled when the plug list contains requests
for more than one queue.

Aside of that this change makes the scope of the irq disabled region
more obvious. The current code confused the hell out of me when
looking at:

 local_irq_save(flags);
   spin_lock(q->queue_lock);
   ...
   queue_unplugged(q...);
     scsi_request_fn();
       spin_unlock(q->queue_lock);
       spin_lock(shost->host_lock);
       spin_unlock_irq(shost->host_lock);

-------------------^^^ ????

       spin_lock_irq(q->queue_lock);
       spin_unlock(q->lock);
 local_irq_restore(flags);

Also add a comment to __blk_run_queue() documenting that
q->request_fn() can drop q->queue_lock and reenable interrupts, but
must return with q->queue_lock held and interrupts disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 block/blk-core.c |   20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -301,7 +301,11 @@ void __blk_run_queue(struct request_queu
 {
 	if (unlikely(blk_queue_stopped(q)))
 		return;
-
+	/*
+	 * q->request_fn() can drop q->queue_lock and reenable
+	 * interrupts, but must return with q->queue_lock held and
+	 * interrupts disabled.
+	 */
 	q->request_fn(q);
 }
 EXPORT_SYMBOL(__blk_run_queue);
@@ -2667,11 +2671,11 @@ static void queue_unplugged(struct reque
 	 * this lock).
 	 */
 	if (from_schedule) {
-		spin_unlock(q->queue_lock);
+		spin_unlock_irq(q->queue_lock);
 		blk_run_queue_async(q);
 	} else {
 		__blk_run_queue(q);
-		spin_unlock(q->queue_lock);
+		spin_unlock_irq(q->queue_lock);
 	}
 
 }
@@ -2697,7 +2701,6 @@ static void flush_plug_callbacks(struct 
 void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 {
 	struct request_queue *q;
-	unsigned long flags;
 	struct request *rq;
 	LIST_HEAD(list);
 	unsigned int depth;
@@ -2718,11 +2721,6 @@ void blk_flush_plug_list(struct blk_plug
 	q = NULL;
 	depth = 0;
 
-	/*
-	 * Save and disable interrupts here, to avoid doing it for every
-	 * queue lock we have to take.
-	 */
-	local_irq_save(flags);
 	while (!list_empty(&list)) {
 		rq = list_entry_rq(list.next);
 		list_del_init(&rq->queuelist);
@@ -2735,7 +2733,7 @@ void blk_flush_plug_list(struct blk_plug
 				queue_unplugged(q, depth, from_schedule);
 			q = rq->q;
 			depth = 0;
-			spin_lock(q->queue_lock);
+			spin_lock_irq(q->queue_lock);
 		}
 		/*
 		 * rq is already accounted, so use raw insert
@@ -2753,8 +2751,6 @@ void blk_flush_plug_list(struct blk_plug
 	 */
 	if (q)
 		queue_unplugged(q, depth, from_schedule);
-
-	local_irq_restore(flags);
 }
 
 void blk_finish_plug(struct blk_plug *plug)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 2/4] sched: Move blk_schedule_flush_plug() out of __schedule()
  2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner
  2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner
  2011-06-22 17:52 ` [patch 3/4] block: Shorten interrupt disabled regions Thomas Gleixner
@ 2011-06-22 17:52 ` Thomas Gleixner
  2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner
  3 siblings, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw)
  To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

[-- Attachment #1: sched-move-block-plug-out-of-scheduler-guts.patch --]
[-- Type: text/plain, Size: 1596 bytes --]

There is no real reason to run blk_schedule_flush_plug() with
interrupts and preemption disabled.

Move it into schedule() and call it when the task is going voluntarily
to sleep. There might be false positives when the task is woken
between that call and actually scheduling, but that's not really
different from being woken immediately after switching away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched.c |   25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4253,16 +4253,6 @@ need_resched:
 				if (to_wakeup)
 					try_to_wake_up_local(to_wakeup);
 			}
-
-			/*
-			 * If we are going to sleep and we have plugged IO
-			 * queued, make sure to submit it to avoid deadlocks.
-			 */
-			if (blk_needs_flush_plug(prev)) {
-				raw_spin_unlock(&rq->lock);
-				blk_schedule_flush_plug(prev);
-				raw_spin_lock(&rq->lock);
-			}
 		}
 		switch_count = &prev->nvcsw;
 	}
@@ -4301,8 +4291,23 @@ need_resched:
 		goto need_resched;
 }
 
+static inline void sched_submit_work(struct task_struct *tsk)
+{
+	if (!tsk->state)
+		return;
+	/*
+	 * If we are going to sleep and we have plugged IO queued,
+	 * make sure to submit it to avoid deadlocks.
+	 */
+	if (blk_needs_flush_plug(tsk))
+		blk_schedule_flush_plug(tsk);
+}
+
 asmlinkage void schedule(void)
 {
+	struct task_struct *tsk = current;
+
+	sched_submit_work(tsk);
 	__schedule();
 }
 EXPORT_SYMBOL(schedule);



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner
                   ` (2 preceding siblings ...)
  2011-06-22 17:52 ` [patch 2/4] sched: Move blk_schedule_flush_plug() out of __schedule() Thomas Gleixner
@ 2011-06-22 17:52 ` Thomas Gleixner
  2011-06-22 19:30   ` Thomas Gleixner
                     ` (3 more replies)
  3 siblings, 4 replies; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw)
  To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

[-- Attachment #1: sched-wq-sigh.patch --]
[-- Type: text/plain, Size: 7513 bytes --]

The worker accounting for cpu bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.

Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. There is also no harm from
updating nr_running when the task returns from scheduling instead of
accounting it in the wakeup code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched.c           |   66 ++++++++++------------------------------------
 kernel/workqueue.c       |   67 ++++++++++++++++++++---------------------------
 kernel/workqueue_sched.h |    5 +--
 3 files changed, 46 insertions(+), 92 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2477,10 +2477,6 @@ static void ttwu_activate(struct rq *rq,
 {
 	activate_task(rq, p, en_flags);
 	p->on_rq = 1;
-
-	/* if a worker is waking up, notify workqueue */
-	if (p->flags & PF_WQ_WORKER)
-		wq_worker_waking_up(p, cpu_of(rq));
 }
 
 /*
@@ -2703,40 +2699,6 @@ out:
 }
 
 /**
- * try_to_wake_up_local - try to wake up a local task with rq lock held
- * @p: the thread to be awakened
- *
- * Put @p on the run-queue if it's not already there. The caller must
- * ensure that this_rq() is locked, @p is bound to this_rq() and not
- * the current task.
- */
-static void try_to_wake_up_local(struct task_struct *p)
-{
-	struct rq *rq = task_rq(p);
-
-	BUG_ON(rq != this_rq());
-	BUG_ON(p == current);
-	lockdep_assert_held(&rq->lock);
-
-	if (!raw_spin_trylock(&p->pi_lock)) {
-		raw_spin_unlock(&rq->lock);
-		raw_spin_lock(&p->pi_lock);
-		raw_spin_lock(&rq->lock);
-	}
-
-	if (!(p->state & TASK_NORMAL))
-		goto out;
-
-	if (!p->on_rq)
-		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
-
-	ttwu_do_wakeup(rq, p, 0);
-	ttwu_stat(p, smp_processor_id(), 0);
-out:
-	raw_spin_unlock(&p->pi_lock);
-}
-
-/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
@@ -4240,19 +4202,6 @@ need_resched:
 		} else {
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
-
-			/*
-			 * If a worker went to sleep, notify and ask workqueue
-			 * whether it wants to wake up a task to maintain
-			 * concurrency.
-			 */
-			if (prev->flags & PF_WQ_WORKER) {
-				struct task_struct *to_wakeup;
-
-				to_wakeup = wq_worker_sleeping(prev, cpu);
-				if (to_wakeup)
-					try_to_wake_up_local(to_wakeup);
-			}
 		}
 		switch_count = &prev->nvcsw;
 	}
@@ -4295,6 +4244,14 @@ static inline void sched_submit_work(str
 {
 	if (!tsk->state)
 		return;
+
+	/*
+	 * If a worker went to sleep, notify and ask workqueue whether
+	 * it wants to wake up a task to maintain concurrency.
+	 */
+	if (tsk->flags & PF_WQ_WORKER)
+		wq_worker_sleeping(tsk);
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
@@ -4303,12 +4260,19 @@ static inline void sched_submit_work(str
 		blk_schedule_flush_plug(tsk);
 }
 
+static inline void sched_update_worker(struct task_struct *tsk)
+{
+	if (tsk->flags & PF_WQ_WORKER)
+		wq_worker_running(tsk);
+}
+
 asmlinkage void schedule(void)
 {
 	struct task_struct *tsk = current;
 
 	sched_submit_work(tsk);
 	__schedule();
+	sched_update_worker(tsk);
 }
 EXPORT_SYMBOL(schedule);
 
Index: linux-2.6/kernel/workqueue.c
===================================================================
--- linux-2.6.orig/kernel/workqueue.c
+++ linux-2.6/kernel/workqueue.c
@@ -137,6 +137,7 @@ struct worker {
 	unsigned int		flags;		/* X: flags */
 	int			id;		/* I: worker id */
 	struct work_struct	rebind_work;	/* L: rebind worker to cpu */
+	int			sleeping;	/* None */
 };
 
 /*
@@ -657,66 +658,56 @@ static void wake_up_worker(struct global
 }
 
 /**
- * wq_worker_waking_up - a worker is waking up
- * @task: task waking up
- * @cpu: CPU @task is waking up to
+ * wq_worker_running - a worker is running again
+ * @task: task returning from sleep
  *
- * This function is called during try_to_wake_up() when a worker is
- * being awoken.
- *
- * CONTEXT:
- * spin_lock_irq(rq->lock)
+ * This function is called when a worker returns from schedule()
  */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
+void wq_worker_running(struct task_struct *task)
 {
 	struct worker *worker = kthread_data(task);
 
+	if (!worker->sleeping)
+		return;
 	if (!(worker->flags & WORKER_NOT_RUNNING))
-		atomic_inc(get_gcwq_nr_running(cpu));
+		atomic_inc(get_gcwq_nr_running(smp_processor_id()));
+	worker->sleeping = 0;
 }
 
 /**
  * wq_worker_sleeping - a worker is going to sleep
  * @task: task going to sleep
- * @cpu: CPU in question, must be the current CPU number
- *
- * This function is called during schedule() when a busy worker is
- * going to sleep.  Worker on the same cpu can be woken up by
- * returning pointer to its task.
- *
- * CONTEXT:
- * spin_lock_irq(rq->lock)
  *
- * RETURNS:
- * Worker task on @cpu to wake up, %NULL if none.
+ * This function is called from schedule() when a busy worker is
+ * going to sleep.
  */
-struct task_struct *wq_worker_sleeping(struct task_struct *task,
-				       unsigned int cpu)
+void wq_worker_sleeping(struct task_struct *task)
 {
-	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
-	struct global_cwq *gcwq = get_gcwq(cpu);
-	atomic_t *nr_running = get_gcwq_nr_running(cpu);
+	struct worker *worker = kthread_data(task);
+	struct global_cwq *gcwq;
+	int cpu;
 
 	if (worker->flags & WORKER_NOT_RUNNING)
-		return NULL;
+		return;
 
-	/* this can only happen on the local cpu */
-	BUG_ON(cpu != raw_smp_processor_id());
+	if (WARN_ON_ONCE(worker->sleeping))
+		return;
 
+	cpu = smp_processor_id();
+	gcwq = get_gcwq(cpu);
+	spin_lock_irq(&gcwq->lock);
 	/*
 	 * The counterpart of the following dec_and_test, implied mb,
 	 * worklist not empty test sequence is in insert_work().
 	 * Please read comment there.
-	 *
-	 * NOT_RUNNING is clear.  This means that trustee is not in
-	 * charge and we're running on the local cpu w/ rq lock held
-	 * and preemption disabled, which in turn means that none else
-	 * could be manipulating idle_list, so dereferencing idle_list
-	 * without gcwq lock is safe.
-	 */
-	if (atomic_dec_and_test(nr_running) && !list_empty(&gcwq->worklist))
-		to_wakeup = first_worker(gcwq);
-	return to_wakeup ? to_wakeup->task : NULL;
+	 */
+	if (atomic_dec_and_test(get_gcwq_nr_running(cpu)) &&
+	    !list_empty(&gcwq->worklist)) {
+		worker = first_worker(gcwq);
+		if (worker)
+			wake_up_process(worker->task);
+	}
+	spin_unlock_irq(&gcwq->lock);
 }
 
 /**
Index: linux-2.6/kernel/workqueue_sched.h
===================================================================
--- linux-2.6.orig/kernel/workqueue_sched.h
+++ linux-2.6/kernel/workqueue_sched.h
@@ -4,6 +4,5 @@
  * Scheduler hooks for concurrency managed workqueue.  Only to be
  * included from sched.c and workqueue.c.
  */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
-struct task_struct *wq_worker_sleeping(struct task_struct *task,
-				       unsigned int cpu);
+void wq_worker_running(struct task_struct *task);
+void wq_worker_sleeping(struct task_struct *task);



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 1/4] sched: Separate the scheduler entry for preemption
  2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner
@ 2011-06-22 18:43   ` Christoph Hellwig
  2011-06-22 18:52     ` Thomas Gleixner
  2011-06-22 19:42     ` Jens Axboe
  2011-08-29 14:55   ` [tip:sched/urgent] " tip-bot for Thomas Gleixner
  1 sibling, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2011-06-22 18:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote:
> Block-IO and workqueues call into notifier functions from the
> scheduler core code with interrupts and preemption disabled. These
> calls should be made before entering the scheduler core.
> 
> To simplify this, separate the scheduler core code into
> __schedule(). __schedule() is directly called from the places which
> set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
> checks into schedule(), so they are only called when a task voluntary
> goes to sleep.

I don't think that works.  We'll need to flush the block requests even
for an involuntary schedule.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 1/4] sched: Separate the scheduler entry for preemption
  2011-06-22 18:43   ` Christoph Hellwig
@ 2011-06-22 18:52     ` Thomas Gleixner
  2011-06-22 19:42     ` Jens Axboe
  1 sibling, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-22 18:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

On Wed, 22 Jun 2011, Christoph Hellwig wrote:

> On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote:
> > Block-IO and workqueues call into notifier functions from the
> > scheduler core code with interrupts and preemption disabled. These
> > calls should be made before entering the scheduler core.
> > 
> > To simplify this, separate the scheduler core code into
> > __schedule(). __schedule() is directly called from the places which
> > set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
> > checks into schedule(), so they are only called when a task voluntary
> > goes to sleep.
> 
> I don't think that works.  We'll need to flush the block requests even
> for an involuntary schedule.

We don't do that right now as that code conditional on:

    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

Also blk_flush_plug_list() when called from io_schedule() is
preemptible, so you might be preempted in the middle of the list
operations, so calling into it when preempted would result in an utter
disaster.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner
@ 2011-06-22 19:30   ` Thomas Gleixner
  2011-06-23  8:37   ` Tejun Heo
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-22 19:30 UTC (permalink / raw)
  To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

On Wed, 22 Jun 2011, Thomas Gleixner wrote:
> +void wq_worker_sleeping(struct task_struct *task)
>  {
> -	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
> -	struct global_cwq *gcwq = get_gcwq(cpu);
> -	atomic_t *nr_running = get_gcwq_nr_running(cpu);
> +	struct worker *worker = kthread_data(task);
> +	struct global_cwq *gcwq;
> +	int cpu;
>  
>  	if (worker->flags & WORKER_NOT_RUNNING)
> -		return NULL;
> +		return;
>  
> -	/* this can only happen on the local cpu */
> -	BUG_ON(cpu != raw_smp_processor_id());
> +	if (WARN_ON_ONCE(worker->sleeping))
> +		return;

Darn, forgot to add 

      worker->sleeping = 1;

back when I added the WARN_ON.
  

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 1/4] sched: Separate the scheduler entry for preemption
  2011-06-22 18:43   ` Christoph Hellwig
  2011-06-22 18:52     ` Thomas Gleixner
@ 2011-06-22 19:42     ` Jens Axboe
  2011-06-22 20:15       ` Thomas Gleixner
  1 sibling, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2011-06-22 19:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Tejun Heo, Ingo Molnar,
	Linus Torvalds

On 2011-06-22 20:43, Christoph Hellwig wrote:
> On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote:
>> Block-IO and workqueues call into notifier functions from the
>> scheduler core code with interrupts and preemption disabled. These
>> calls should be made before entering the scheduler core.
>>
>> To simplify this, separate the scheduler core code into
>> __schedule(). __schedule() is directly called from the places which
>> set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
>> checks into schedule(), so they are only called when a task voluntary
>> goes to sleep.
> 
> I don't think that works.  We'll need to flush the block requests even
> for an involuntary schedule.

Yep, doing it just for voluntary schedule() is pointless, since the
caller should just do the flushing on his own. The whole point of the
sched hook was to ensure that involuntary schedules flushed it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 1/4] sched: Separate the scheduler entry for preemption
  2011-06-22 19:42     ` Jens Axboe
@ 2011-06-22 20:15       ` Thomas Gleixner
  2011-06-23 11:41         ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-22 20:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, LKML, Peter Zijlstra, Tejun Heo, Ingo Molnar,
	Linus Torvalds

On Wed, 22 Jun 2011, Jens Axboe wrote:

> On 2011-06-22 20:43, Christoph Hellwig wrote:
> > On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote:
> >> Block-IO and workqueues call into notifier functions from the
> >> scheduler core code with interrupts and preemption disabled. These
> >> calls should be made before entering the scheduler core.
> >>
> >> To simplify this, separate the scheduler core code into
> >> __schedule(). __schedule() is directly called from the places which
> >> set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
> >> checks into schedule(), so they are only called when a task voluntary
> >> goes to sleep.
> > 
> > I don't think that works.  We'll need to flush the block requests even
> > for an involuntary schedule.
> 
> Yep, doing it just for voluntary schedule() is pointless, since the
> caller should just do the flushing on his own. The whole point of the
> sched hook was to ensure that involuntary schedules flushed it.

I guess we talk about different things here. The involuntary is when
you are preempted, which keeps state unchanged and the current code
already excludes that case.

If you block on a mutex, semaphore, completion or whatever that's a
different thing. That code calls schedule() not __schedule() and that
will flush your stuff as it does now.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner
  2011-06-22 19:30   ` Thomas Gleixner
@ 2011-06-23  8:37   ` Tejun Heo
  2011-06-23  9:58     ` Thomas Gleixner
  2011-06-23 15:07   ` Tejun Heo
  2013-04-30 13:37   ` Steven Rostedt
  3 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2011-06-23  8:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds

Hello, Thomas.

The hooking place doesn't have anything to do with rq->lock.  The
problem with Peter's patch was preemption, not the lock.

> There is also no harm from updating nr_running when the task returns
> from scheduling instead of accounting it in the wakeup code.

Well, not exactly.  If CPU is being thrashed, we don't want to try to
fire up new workers or calling in rescuers.  If nr_running is bumped
up from ttwu(), a woken up but not yet running worker already counts
as running.  With the suggested change, when we hit such heavy CPU
thrashing, workqueue code will add things on top of it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-23  8:37   ` Tejun Heo
@ 2011-06-23  9:58     ` Thomas Gleixner
  2011-06-23 10:15       ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-23  9:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds

On Thu, 23 Jun 2011, Tejun Heo wrote:

> Hello, Thomas.
> 
> The hooking place doesn't have anything to do with rq->lock.  The
> problem with Peter's patch was preemption, not the lock.

And we can do it w/o preemption disabled as I have shown.
 
> > There is also no harm from updating nr_running when the task returns
> > from scheduling instead of accounting it in the wakeup code.
> 
> Well, not exactly.  If CPU is being thrashed, we don't want to try to
> fire up new workers or calling in rescuers.  If nr_running is bumped
> up from ttwu(), a woken up but not yet running worker already counts
> as running.  With the suggested change, when we hit such heavy CPU
> thrashing, workqueue code will add things on top of it.

That's the whole problem with that self forking workqueue stuff and
I'm not accepting that ttwu() is the only solution to that. Following
that logic would just invite more abusers of callbacks into ttwu() and
if you think it through then the logical consequence is to have an
upcall hook into userspace so a threading/forking server knows how
many instances are on the fly to avoid creating new ones under
pressure.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-23  9:58     ` Thomas Gleixner
@ 2011-06-23 10:15       ` Tejun Heo
  2011-06-23 10:44         ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2011-06-23 10:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds

Hello,

On Thu, Jun 23, 2011 at 11:58:12AM +0200, Thomas Gleixner wrote:
> > Well, not exactly.  If CPU is being thrashed, we don't want to try to
> > fire up new workers or calling in rescuers.  If nr_running is bumped
> > up from ttwu(), a woken up but not yet running worker already counts
> > as running.  With the suggested change, when we hit such heavy CPU
> > thrashing, workqueue code will add things on top of it.
> 
> That's the whole problem with that self forking workqueue stuff and
> I'm not accepting that ttwu() is the only solution to that. Following
> that logic would just invite more abusers of callbacks into ttwu() and
> if you think it through then the logical consequence is to have an
> upcall hook into userspace so a threading/forking server knows how
> many instances are on the fly to avoid creating new ones under
> pressure.

Extrapolating to extremes doesn't really help anything.  You can make
any argument with logics like that.  The thing isn't being exported to
userland, not even close.

The patch description is simply untrue.  It does affect how wq behaves
under heavy CPU load.  The effect might be perfectly okay but more
likely it will result in subtle suboptimal behaviors under certain
load situations which would be difficult to characterize and track
down.  Again, the trade off (mostly killing of ttwu_local) could be
okay but you can't get away with just claiming "there's no harm".

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-23 10:15       ` Tejun Heo
@ 2011-06-23 10:44         ` Ingo Molnar
  2011-06-23 11:35           ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2011-06-23 10:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds


* Tejun Heo <tj@kernel.org> wrote:

> The patch description is simply untrue.  It does affect how wq 
> behaves under heavy CPU load.  The effect might be perfectly okay 
> but more likely it will result in subtle suboptimal behaviors under 
> certain load situations which would be difficult to characterize 
> and track down.  Again, the trade off (mostly killing of 
> ttwu_local) could be okay but you can't get away with just claiming 
> "there's no harm".

Well, either it can be measured or not. If you can suggest a specific 
testing method to Thomas, please do.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-23 10:44         ` Ingo Molnar
@ 2011-06-23 11:35           ` Tejun Heo
  2011-06-23 12:51             ` Ingo Molnar
  2011-06-24  9:01             ` Thomas Gleixner
  0 siblings, 2 replies; 27+ messages in thread
From: Tejun Heo @ 2011-06-23 11:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds

Hello, Ingo.

On Thu, Jun 23, 2011 at 12:44 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Tejun Heo <tj@kernel.org> wrote:
>
>> The patch description is simply untrue.  It does affect how wq
>> behaves under heavy CPU load.  The effect might be perfectly okay
>> but more likely it will result in subtle suboptimal behaviors under
>> certain load situations which would be difficult to characterize
>> and track down.  Again, the trade off (mostly killing of
>> ttwu_local) could be okay but you can't get away with just claiming
>> "there's no harm".
>
> Well, either it can be measured or not. If you can suggest a specific
> testing method to Thomas, please do.

Crafting a test case where the suggested change results in worse
behavior isn't difficult (it ends up waking/creating workers which it
doesn't have to between ttwu and actual execution); however, as with
any micro benchmark, the problem is with assessing whether and how
much it would matter in actual workloads (whatever that means) and
workqueue is used throughout the kernel with widely varying patterns
making drawing conclusion a bit tricky.  Given that, changing the
behavior for the worse just for this cleanup doesn't sound like too
sweet a deal.  Is there any other reason to change the behavior
(latency, whatever) other than the ttuw_local ugliness?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 1/4] sched: Separate the scheduler entry for preemption
  2011-06-22 20:15       ` Thomas Gleixner
@ 2011-06-23 11:41         ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2011-06-23 11:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, LKML, Peter Zijlstra, Tejun Heo, Ingo Molnar,
	Linus Torvalds

On 2011-06-22 22:15, Thomas Gleixner wrote:
> On Wed, 22 Jun 2011, Jens Axboe wrote:
> 
>> On 2011-06-22 20:43, Christoph Hellwig wrote:
>>> On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote:
>>>> Block-IO and workqueues call into notifier functions from the
>>>> scheduler core code with interrupts and preemption disabled. These
>>>> calls should be made before entering the scheduler core.
>>>>
>>>> To simplify this, separate the scheduler core code into
>>>> __schedule(). __schedule() is directly called from the places which
>>>> set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
>>>> checks into schedule(), so they are only called when a task voluntary
>>>> goes to sleep.
>>>
>>> I don't think that works.  We'll need to flush the block requests even
>>> for an involuntary schedule.
>>
>> Yep, doing it just for voluntary schedule() is pointless, since the
>> caller should just do the flushing on his own. The whole point of the
>> sched hook was to ensure that involuntary schedules flushed it.
> 
> I guess we talk about different things here. The involuntary is when

Seems to be the trend for this patchset, why stop now? :-)

> you are preempted, which keeps state unchanged and the current code
> already excludes that case.
> 
> If you block on a mutex, semaphore, completion or whatever that's a
> different thing. That code calls schedule() not __schedule() and that
> will flush your stuff as it does now.

OK, thanks for the clarification. You are right, the original kernel did
not flush on eg preempt and that wasn't the intent either. The intent
was to flush on block only. So behaviour remains identical.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-23 11:35           ` Tejun Heo
@ 2011-06-23 12:51             ` Ingo Molnar
  2011-06-24  9:01             ` Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2011-06-23 12:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo.
> 
> On Thu, Jun 23, 2011 at 12:44 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Tejun Heo <tj@kernel.org> wrote:
> >
> >> The patch description is simply untrue.  It does affect how wq
> >> behaves under heavy CPU load.  The effect might be perfectly okay
> >> but more likely it will result in subtle suboptimal behaviors under
> >> certain load situations which would be difficult to characterize
> >> and track down.  Again, the trade off (mostly killing of
> >> ttwu_local) could be okay but you can't get away with just claiming
> >> "there's no harm".
> >
> > Well, either it can be measured or not. If you can suggest a specific
> > testing method to Thomas, please do.
> 
> Crafting a test case where the suggested change results in worse 
> behavior isn't difficult (it ends up waking/creating workers which 
> it doesn't have to between ttwu and actual execution); however, as 
> with any micro benchmark, the problem is with assessing whether and 
> how much it would matter in actual workloads (whatever that means) 
> and workqueue is used throughout the kernel with widely varying 
> patterns making drawing conclusion a bit tricky. [...]

Well, please suggest a workload where it *matters* - as i suspect any 
workload tglx will come up with will have another 90% of workloads 
that you could suggest: so it's much better if you suggest a testing 
method.

When someone comes to me with a scheduler change i can give them the 
workloads that they should double check. See the changelog of this 
recent commit for example:

  c8b281161dfa: sched: Increase SCHED_LOAD_SCALE resolution

So please suggest a testing method.

> [...] Given that, changing the behavior for the worse just for this 
> cleanup doesn't sound like too sweet a deal.  Is there any other 
> reason to change the behavior (latency, whatever) other than the 
> ttuw_local ugliness?

Well, the ugliness is one aspect of it, but my main concern is simply 
testability: any claim of speedup or slowdown ought to be testable, 
right? I mean, we'd like to drive people towards coming with patches 
and number like Nikhil Rao did in c8b281161dfa, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner
  2011-06-22 19:30   ` Thomas Gleixner
  2011-06-23  8:37   ` Tejun Heo
@ 2011-06-23 15:07   ` Tejun Heo
  2013-04-30 13:37   ` Steven Rostedt
  3 siblings, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2011-06-23 15:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds

Hello, again.

So, let's get it correct first.

On Wed, Jun 22, 2011 at 05:52:15PM -0000, Thomas Gleixner wrote:
> The worker accounting for cpu bound workers is plugged into the core
> scheduler code and the wakeup code. This is not a hard requirement and
> can be avoided by keeping track of the state in the workqueue code
> itself.
> 
> Keep track of the sleeping state in the worker itself and call the
> notifier before entering the core scheduler. There might be false
> positives when the task is woken between that call and actually
> scheduling, but that's not really different from scheduling and being
> woken immediately after switching away. There is also no harm from
> updating nr_running when the task returns from scheduling instead of
> accounting it in the wakeup code.

I think false positives on schedule() should be safe.  As said
earlier, the gap between ttwu and actually running is a bit worrisome
but it might be nothing, but please at least describe the behavior
change.

> Index: linux-2.6/kernel/workqueue.c
> ===================================================================
> --- linux-2.6.orig/kernel/workqueue.c
> +++ linux-2.6/kernel/workqueue.c
> @@ -137,6 +137,7 @@ struct worker {
>  	unsigned int		flags;		/* X: flags */
>  	int			id;		/* I: worker id */
>  	struct work_struct	rebind_work;	/* L: rebind worker to cpu */
> +	int			sleeping;	/* None */

bool?

> -struct task_struct *wq_worker_sleeping(struct task_struct *task,
> -				       unsigned int cpu)
> +void wq_worker_sleeping(struct task_struct *task)
>  {
> -	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
> -	struct global_cwq *gcwq = get_gcwq(cpu);
> -	atomic_t *nr_running = get_gcwq_nr_running(cpu);
> +	struct worker *worker = kthread_data(task);
> +	struct global_cwq *gcwq;
> +	int cpu;
>  
>  	if (worker->flags & WORKER_NOT_RUNNING)
> -		return NULL;
> +		return;

This doesn't look safe.  It can race with trustee_thread() setting
WORKER_ROGUE.  Let's just grab gcwq->lock on entry to
wq_worker_sleeping() for now; then, the schedule() trickery in
trustee_thread() can go away too.  This also means we can remove the
weird sync rules from ->flags and ->idle_list and just use simple
gcwq->lock for those, which is pretty nice.

> -	/* this can only happen on the local cpu */
> -	BUG_ON(cpu != raw_smp_processor_id());
> +	if (WARN_ON_ONCE(worker->sleeping))
> +		return;

Re-entrance is prevented by the scheduler hook being called only for
non-premption schedule().  Maybe it's better to explain that in the
function comment?

Hmmm... Also, I think worker->sleeping should be cleared by
trustee_thread() when WORKER_ROGUE is set for the worker; otherwise,
it can get out of sync and the above WARN_ON_ONCE() will trigger.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-23 11:35           ` Tejun Heo
  2011-06-23 12:51             ` Ingo Molnar
@ 2011-06-24  9:01             ` Thomas Gleixner
  2011-06-26 10:19               ` Tejun Heo
  1 sibling, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2011-06-24  9:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Ingo Molnar, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2906 bytes --]

On Thu, 23 Jun 2011, Tejun Heo wrote:
> Hello, Ingo.
> 
> On Thu, Jun 23, 2011 at 12:44 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Tejun Heo <tj@kernel.org> wrote:
> >
> >> The patch description is simply untrue.  It does affect how wq
> >> behaves under heavy CPU load.  The effect might be perfectly okay
> >> but more likely it will result in subtle suboptimal behaviors under
> >> certain load situations which would be difficult to characterize
> >> and track down.  Again, the trade off (mostly killing of
> >> ttwu_local) could be okay but you can't get away with just claiming
> >> "there's no harm".

Fair enough. I'm happy to reword this.

> > Well, either it can be measured or not. If you can suggest a specific
> > testing method to Thomas, please do.
> 
> Crafting a test case where the suggested change results in worse
> behavior isn't difficult (it ends up waking/creating workers which it
> doesn't have to between ttwu and actual execution); however, as with
> any micro benchmark, the problem is with assessing whether and how
> much it would matter in actual workloads (whatever that means) and
> workqueue is used throughout the kernel with widely varying patterns
> making drawing conclusion a bit tricky.  Given that, changing the
> behavior for the worse just for this cleanup doesn't sound like too
> sweet a deal.  Is there any other reason to change the behavior
> (latency, whatever) other than the ttuw_local ugliness?

ttwu_local should have never been there in the first place. And yes,
it's a latency and a correctness issue. Workqueues are not part of the
scheduler, period. I'm fine with a call in pre/post schedule, so it's
a separate code path and not convoluted into the guts of the scheduler
code.

The whole design is completely plugged into the scheduler, as you
_abuse_ rq->lock to serialize the accounting and allow the unlocked
access to the idle_list. It does not matter at all whether that code
path is fast or not, it simply does not belong there.

We could do a lot of other fancy accounting tricks with hooks inside
the scheduler. If we tolerate workqueues hackery in there how do we
rule out other crap which wants to have hooks plugged into that? There
are issues which can't be solved other than from the scheduler core,
e.g. the VCPU switch, but the workqueue overload accounting is
definitely not in that category. 

The only reason why you want this precise accounting thing is a corner
case of system overload, but not the common case. In case of system
overload you can do extra work to figure out how many of these beasts
are on the fly and whether it makes sense to start some more, simply
because it does not matter in a overload situation whether you run
through a list to gather information.

Can we please stop adding stuff to the most crucial code pathes in the
kernel just to avoid thinking a bit harder about problems?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-24  9:01             ` Thomas Gleixner
@ 2011-06-26 10:19               ` Tejun Heo
  0 siblings, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2011-06-26 10:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds

Hello, Thomas.

On Fri, Jun 24, 2011 at 11:01:12AM +0200, Thomas Gleixner wrote:
> The whole design is completely plugged into the scheduler, as you
> _abuse_ rq->lock to serialize the accounting and allow the unlocked
> access to the idle_list. It does not matter at all whether that code
> path is fast or not, it simply does not belong there.

It didn't happen like that.  Hooking into scheduler was the one of the
biggest issues and there were several different attempts at using /
generalizing existing hooks somehow which took good six months without
much agreement from scheduler side.

I don't recall all the details now but IIRC it tried to use
fire_sched_out_preempt_notifiers() which was called inside rq lock
which introduced all the convolution with rq lock.  Later somebody
suggested just open coding it and being done with it and that's how it
ended up inside rq lock.  Hindsight is 20/20 and we could have pulled
it out of rq lock at that time but the rest of the code has been
already running with various different hooking schemes for quite some
time by then.

Anyways, so, let's clean it up.

> The only reason why you want this precise accounting thing is a corner
> case of system overload, but not the common case. In case of system
> overload you can do extra work to figure out how many of these beasts
> are on the fly and whether it makes sense to start some more, simply
> because it does not matter in a overload situation whether you run
> through a list to gather information.

Sure, there are multiple ways to resolve the situation.  I was mostly
worried about waking up / attempting to start more workers under CPU
pressure (this is fine under memory pressure as CPU is idle and memory
allocation failure / blocking would quickly recall rescuers).  I've
been looking at the code and am still not sure but it could be that
this isn't really something to worry about.

Workqueue code always wakes up the first idle worker and a worker
leaves idle state not by its waker but on its own accord when it start
running, which means that, with the suggested change, there can be
multiple spurious wake ups but they'll all try to wake up the same
idle worker which hasn't started running yet.  We might still
unnecessarily try to summon rescuers tho.  It needs more
considerations and experimentations.

Another concern is the additional spin_lock in the sleeping path to
synchronize against CPU unplug operation I mentioned in the other
reply.  Another way to do it would be using preempt_disable/enable()
instead and calling synchronize_sched() from trustee.  It would only
add an extra preemption flipping compared to the current code.

> Can we please stop adding stuff to the most crucial code pathes in the
> kernel just to avoid thinking a bit harder about problems?

Heh... sure, let's just keep it technical, okay?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [tip:sched/urgent] sched: Separate the scheduler entry for preemption
  2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner
  2011-06-22 18:43   ` Christoph Hellwig
@ 2011-08-29 14:55   ` tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 27+ messages in thread
From: tip-bot for Thomas Gleixner @ 2011-08-29 14:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, axboe, a.p.zijlstra, tj, tglx, mingo

Commit-ID:  c259e01a1ec90063042f758e409cd26b2a0963c8
Gitweb:     http://git.kernel.org/tip/c259e01a1ec90063042f758e409cd26b2a0963c8
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Wed, 22 Jun 2011 19:47:00 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 29 Aug 2011 12:26:57 +0200

sched: Separate the scheduler entry for preemption

Block-IO and workqueues call into notifier functions from the
scheduler core code with interrupts and preemption disabled. These
calls should be made before entering the scheduler core.

To simplify this, separate the scheduler core code into
__schedule(). __schedule() is directly called from the places which
set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
checks into schedule(), so they are only called when a task voluntary
goes to sleep.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..ec15e81 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4279,9 +4279,9 @@ pick_next_task(struct rq *rq)
 }
 
 /*
- * schedule() is the main scheduler function.
+ * __schedule() is the main scheduler function.
  */
-asmlinkage void __sched schedule(void)
+static void __sched __schedule(void)
 {
 	struct task_struct *prev, *next;
 	unsigned long *switch_count;
@@ -4369,6 +4369,11 @@ need_resched:
 	if (need_resched())
 		goto need_resched;
 }
+
+asmlinkage void schedule(void)
+{
+	__schedule();
+}
 EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
@@ -4435,7 +4440,7 @@ asmlinkage void __sched notrace preempt_schedule(void)
 
 	do {
 		add_preempt_count_notrace(PREEMPT_ACTIVE);
-		schedule();
+		__schedule();
 		sub_preempt_count_notrace(PREEMPT_ACTIVE);
 
 		/*
@@ -4463,7 +4468,7 @@ asmlinkage void __sched preempt_schedule_irq(void)
 	do {
 		add_preempt_count(PREEMPT_ACTIVE);
 		local_irq_enable();
-		schedule();
+		__schedule();
 		local_irq_disable();
 		sub_preempt_count(PREEMPT_ACTIVE);
 
@@ -5588,7 +5593,7 @@ static inline int should_resched(void)
 static void __cond_resched(void)
 {
 	add_preempt_count(PREEMPT_ACTIVE);
-	schedule();
+	__schedule();
 	sub_preempt_count(PREEMPT_ACTIVE);
 }
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner
                     ` (2 preceding siblings ...)
  2011-06-23 15:07   ` Tejun Heo
@ 2013-04-30 13:37   ` Steven Rostedt
  2013-04-30 22:47     ` Steven Rostedt
  3 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2013-04-30 13:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

[ Blast from the past! ]

When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the
try_to_wake_up_local() call. It seems that the 3.4-rt patch has this
patch applied. Although, this is not applied to any of the other -rt patches.

My question is, what was the final verdict of this patch? The thread
sorta died without a definite answer.

Is this a change we want to add upstream? If not, is this a change I
should remove from the 3.4-rt branch?

Note, -rt did hit a bug with the try_to_wake_up_local() code, so
removing it is a welcome idea. We are still trying to reproduce that
bug.

Thanks,

-- Steve

On Wed, Jun 22, 2011 at 05:52:15PM -0000, Thomas Gleixner wrote:
> The worker accounting for cpu bound workers is plugged into the core
> scheduler code and the wakeup code. This is not a hard requirement and
> can be avoided by keeping track of the state in the workqueue code
> itself.
> 
> Keep track of the sleeping state in the worker itself and call the
> notifier before entering the core scheduler. There might be false
> positives when the task is woken between that call and actually
> scheduling, but that's not really different from scheduling and being
> woken immediately after switching away. There is also no harm from
> updating nr_running when the task returns from scheduling instead of
> accounting it in the wakeup code.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/sched.c           |   66 ++++++++++------------------------------------
>  kernel/workqueue.c       |   67 ++++++++++++++++++++---------------------------
>  kernel/workqueue_sched.h |    5 +--
>  3 files changed, 46 insertions(+), 92 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2477,10 +2477,6 @@ static void ttwu_activate(struct rq *rq,
>  {
>  	activate_task(rq, p, en_flags);
>  	p->on_rq = 1;
> -
> -	/* if a worker is waking up, notify workqueue */
> -	if (p->flags & PF_WQ_WORKER)
> -		wq_worker_waking_up(p, cpu_of(rq));
>  }
>  
>  /*
> @@ -2703,40 +2699,6 @@ out:
>  }
>  
>  /**
> - * try_to_wake_up_local - try to wake up a local task with rq lock held
> - * @p: the thread to be awakened
> - *
> - * Put @p on the run-queue if it's not already there. The caller must
> - * ensure that this_rq() is locked, @p is bound to this_rq() and not
> - * the current task.
> - */
> -static void try_to_wake_up_local(struct task_struct *p)
> -{
> -	struct rq *rq = task_rq(p);
> -
> -	BUG_ON(rq != this_rq());
> -	BUG_ON(p == current);
> -	lockdep_assert_held(&rq->lock);
> -
> -	if (!raw_spin_trylock(&p->pi_lock)) {
> -		raw_spin_unlock(&rq->lock);
> -		raw_spin_lock(&p->pi_lock);
> -		raw_spin_lock(&rq->lock);
> -	}
> -
> -	if (!(p->state & TASK_NORMAL))
> -		goto out;
> -
> -	if (!p->on_rq)
> -		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
> -
> -	ttwu_do_wakeup(rq, p, 0);
> -	ttwu_stat(p, smp_processor_id(), 0);
> -out:
> -	raw_spin_unlock(&p->pi_lock);
> -}
> -
> -/**
>   * wake_up_process - Wake up a specific process
>   * @p: The process to be woken up.
>   *
> @@ -4240,19 +4202,6 @@ need_resched:
>  		} else {
>  			deactivate_task(rq, prev, DEQUEUE_SLEEP);
>  			prev->on_rq = 0;
> -
> -			/*
> -			 * If a worker went to sleep, notify and ask workqueue
> -			 * whether it wants to wake up a task to maintain
> -			 * concurrency.
> -			 */
> -			if (prev->flags & PF_WQ_WORKER) {
> -				struct task_struct *to_wakeup;
> -
> -				to_wakeup = wq_worker_sleeping(prev, cpu);
> -				if (to_wakeup)
> -					try_to_wake_up_local(to_wakeup);
> -			}
>  		}
>  		switch_count = &prev->nvcsw;
>  	}
> @@ -4295,6 +4244,14 @@ static inline void sched_submit_work(str
>  {
>  	if (!tsk->state)
>  		return;
> +
> +	/*
> +	 * If a worker went to sleep, notify and ask workqueue whether
> +	 * it wants to wake up a task to maintain concurrency.
> +	 */
> +	if (tsk->flags & PF_WQ_WORKER)
> +		wq_worker_sleeping(tsk);
> +
>  	/*
>  	 * If we are going to sleep and we have plugged IO queued,
>  	 * make sure to submit it to avoid deadlocks.
> @@ -4303,12 +4260,19 @@ static inline void sched_submit_work(str
>  		blk_schedule_flush_plug(tsk);
>  }
>  
> +static inline void sched_update_worker(struct task_struct *tsk)
> +{
> +	if (tsk->flags & PF_WQ_WORKER)
> +		wq_worker_running(tsk);
> +}
> +
>  asmlinkage void schedule(void)
>  {
>  	struct task_struct *tsk = current;
>  
>  	sched_submit_work(tsk);
>  	__schedule();
> +	sched_update_worker(tsk);
>  }
>  EXPORT_SYMBOL(schedule);
>  
> Index: linux-2.6/kernel/workqueue.c
> ===================================================================
> --- linux-2.6.orig/kernel/workqueue.c
> +++ linux-2.6/kernel/workqueue.c
> @@ -137,6 +137,7 @@ struct worker {
>  	unsigned int		flags;		/* X: flags */
>  	int			id;		/* I: worker id */
>  	struct work_struct	rebind_work;	/* L: rebind worker to cpu */
> +	int			sleeping;	/* None */
>  };
>  
>  /*
> @@ -657,66 +658,56 @@ static void wake_up_worker(struct global
>  }
>  
>  /**
> - * wq_worker_waking_up - a worker is waking up
> - * @task: task waking up
> - * @cpu: CPU @task is waking up to
> + * wq_worker_running - a worker is running again
> + * @task: task returning from sleep
>   *
> - * This function is called during try_to_wake_up() when a worker is
> - * being awoken.
> - *
> - * CONTEXT:
> - * spin_lock_irq(rq->lock)
> + * This function is called when a worker returns from schedule()
>   */
> -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
> +void wq_worker_running(struct task_struct *task)
>  {
>  	struct worker *worker = kthread_data(task);
>  
> +	if (!worker->sleeping)
> +		return;
>  	if (!(worker->flags & WORKER_NOT_RUNNING))
> -		atomic_inc(get_gcwq_nr_running(cpu));
> +		atomic_inc(get_gcwq_nr_running(smp_processor_id()));
> +	worker->sleeping = 0;
>  }
>  
>  /**
>   * wq_worker_sleeping - a worker is going to sleep
>   * @task: task going to sleep
> - * @cpu: CPU in question, must be the current CPU number
> - *
> - * This function is called during schedule() when a busy worker is
> - * going to sleep.  Worker on the same cpu can be woken up by
> - * returning pointer to its task.
> - *
> - * CONTEXT:
> - * spin_lock_irq(rq->lock)
>   *
> - * RETURNS:
> - * Worker task on @cpu to wake up, %NULL if none.
> + * This function is called from schedule() when a busy worker is
> + * going to sleep.
>   */
> -struct task_struct *wq_worker_sleeping(struct task_struct *task,
> -				       unsigned int cpu)
> +void wq_worker_sleeping(struct task_struct *task)
>  {
> -	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
> -	struct global_cwq *gcwq = get_gcwq(cpu);
> -	atomic_t *nr_running = get_gcwq_nr_running(cpu);
> +	struct worker *worker = kthread_data(task);
> +	struct global_cwq *gcwq;
> +	int cpu;
>  
>  	if (worker->flags & WORKER_NOT_RUNNING)
> -		return NULL;
> +		return;
>  
> -	/* this can only happen on the local cpu */
> -	BUG_ON(cpu != raw_smp_processor_id());
> +	if (WARN_ON_ONCE(worker->sleeping))
> +		return;
>  
> +	cpu = smp_processor_id();
> +	gcwq = get_gcwq(cpu);
> +	spin_lock_irq(&gcwq->lock);
>  	/*
>  	 * The counterpart of the following dec_and_test, implied mb,
>  	 * worklist not empty test sequence is in insert_work().
>  	 * Please read comment there.
> -	 *
> -	 * NOT_RUNNING is clear.  This means that trustee is not in
> -	 * charge and we're running on the local cpu w/ rq lock held
> -	 * and preemption disabled, which in turn means that none else
> -	 * could be manipulating idle_list, so dereferencing idle_list
> -	 * without gcwq lock is safe.
> -	 */
> -	if (atomic_dec_and_test(nr_running) && !list_empty(&gcwq->worklist))
> -		to_wakeup = first_worker(gcwq);
> -	return to_wakeup ? to_wakeup->task : NULL;
> +	 */
> +	if (atomic_dec_and_test(get_gcwq_nr_running(cpu)) &&
> +	    !list_empty(&gcwq->worklist)) {
> +		worker = first_worker(gcwq);
> +		if (worker)
> +			wake_up_process(worker->task);
> +	}
> +	spin_unlock_irq(&gcwq->lock);
>  }
>  
>  /**
> Index: linux-2.6/kernel/workqueue_sched.h
> ===================================================================
> --- linux-2.6.orig/kernel/workqueue_sched.h
> +++ linux-2.6/kernel/workqueue_sched.h
> @@ -4,6 +4,5 @@
>   * Scheduler hooks for concurrency managed workqueue.  Only to be
>   * included from sched.c and workqueue.c.
>   */
> -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
> -struct task_struct *wq_worker_sleeping(struct task_struct *task,
> -				       unsigned int cpu);
> +void wq_worker_running(struct task_struct *task);
> +void wq_worker_sleeping(struct task_struct *task);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2013-04-30 13:37   ` Steven Rostedt
@ 2013-04-30 22:47     ` Steven Rostedt
  2013-05-03  0:12       ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2013-04-30 22:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds

On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote:
> [ Blast from the past! ]
> 
> When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the
> try_to_wake_up_local() call. It seems that the 3.4-rt patch has this
> patch applied. Although, this is not applied to any of the other -rt patches.
> 

I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt
nor in 3.8-rt.

-- Steve


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2013-04-30 22:47     ` Steven Rostedt
@ 2013-05-03  0:12       ` Tejun Heo
  2013-05-03  0:57         ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-05-03  0:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar,
	Linus Torvalds

On Tue, Apr 30, 2013 at 06:47:10PM -0400, Steven Rostedt wrote:
> On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote:
> > [ Blast from the past! ]
> > 
> > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the
> > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this
> > patch applied. Although, this is not applied to any of the other -rt patches.
> > 
> 
> I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt
> nor in 3.8-rt.

So, it's all good?  Or is there something I need to look into?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2013-05-03  0:12       ` Tejun Heo
@ 2013-05-03  0:57         ` Steven Rostedt
  2013-07-24 10:04           ` Thomas Gleixner
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2013-05-03  0:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar,
	Linus Torvalds

On Thu, 2013-05-02 at 17:12 -0700, Tejun Heo wrote:
> On Tue, Apr 30, 2013 at 06:47:10PM -0400, Steven Rostedt wrote:
> > On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote:
> > > [ Blast from the past! ]
> > > 
> > > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the
> > > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this
> > > patch applied. Although, this is not applied to any of the other -rt patches.
> > > 
> > 
> > I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt
> > nor in 3.8-rt.
> 
> So, it's all good?  Or is there something I need to look into?

It looks good to me. I don't know why it's not in 3.6-rt or 3.8-rt. Was
there a reason that Thomas took it out? I don't know. Maybe it's not
needed or he thought it went mainline?

-- Steve



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2013-05-03  0:57         ` Steven Rostedt
@ 2013-07-24 10:04           ` Thomas Gleixner
  2013-08-06 19:33             ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2013-07-24 10:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tejun Heo, LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar,
	Linus Torvalds, Sebastian Sewior

B1;3202;0cOn Thu, 2 May 2013, Steven Rostedt wrote:
> On Thu, 2013-05-02 at 17:12 -0700, Tejun Heo wrote:
> > On Tue, Apr 30, 2013 at 06:47:10PM -0400, Steven Rostedt wrote:
> > > On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote:
> > > > [ Blast from the past! ]
> > > > 
> > > > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the
> > > > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this
> > > > patch applied. Although, this is not applied to any of the other -rt patches.
> > > > 
> > > 
> > > I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt
> > > nor in 3.8-rt.
> > 
> > So, it's all good?  Or is there something I need to look into?
> 
> It looks good to me. I don't know why it's not in 3.6-rt or 3.8-rt. Was
> there a reason that Thomas took it out? I don't know. Maybe it's not
> needed or he thought it went mainline?

I dropped it on purpose as I was sure, that it's safe.

But after you poked me yesterday I spent quite some time staring at
that code and found that I missed the following:

worker A is about to go idle and the pool->idle_list is empty

  calls worker_enter_idle()

  list_add(&worker->entry, &pool->idle_list);

      idle_list.prev = &worker->entry;

Preemption

Worker B runs and blocks.

  wq_worker_sleeping() sees  !list_empty(&pool->idle_list)

  because idle_list.prev points already to worker A	 

  Then first_worker returns idle_list.next which points to idle list
  so we return some random crap to wakeup.

So yes, I've donned a brown paperbag and we need to bring back that
change and think about making it more palatable for mainline.

Find an untested patch against 3.6-rt below.

Thanks,

	tglx
---
Index: linux-stable-rt/kernel/sched/core.c
===================================================================
--- linux-stable-rt.orig/kernel/sched/core.c
+++ linux-stable-rt/kernel/sched/core.c
@@ -1452,10 +1452,6 @@ static void ttwu_activate(struct rq *rq,
 {
 	activate_task(rq, p, en_flags);
 	p->on_rq = 1;
-
-	/* if a worker is waking up, notify workqueue */
-	if (p->flags & PF_WQ_WORKER)
-		wq_worker_waking_up(p, cpu_of(rq));
 }
 
 /*
@@ -1714,42 +1710,6 @@ out:
 }
 
 /**
- * try_to_wake_up_local - try to wake up a local task with rq lock held
- * @p: the thread to be awakened
- *
- * Put @p on the run-queue if it's not already there. The caller must
- * ensure that this_rq() is locked, @p is bound to this_rq() and not
- * the current task.
- */
-static void try_to_wake_up_local(struct task_struct *p)
-{
-	struct rq *rq = task_rq(p);
-
-	if (WARN_ON_ONCE(rq != this_rq()) ||
-	    WARN_ON_ONCE(p == current))
-		return;
-
-	lockdep_assert_held(&rq->lock);
-
-	if (!raw_spin_trylock(&p->pi_lock)) {
-		raw_spin_unlock(&rq->lock);
-		raw_spin_lock(&p->pi_lock);
-		raw_spin_lock(&rq->lock);
-	}
-
-	if (!(p->state & TASK_NORMAL))
-		goto out;
-
-	if (!p->on_rq)
-		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
-
-	ttwu_do_wakeup(rq, p, 0);
-	ttwu_stat(p, smp_processor_id(), 0);
-out:
-	raw_spin_unlock(&p->pi_lock);
-}
-
-/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
@@ -3627,19 +3587,6 @@ need_resched:
 		} else {
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
-
-			/*
-			 * If a worker went to sleep, notify and ask workqueue
-			 * whether it wants to wake up a task to maintain
-			 * concurrency.
-			 */
-			if (prev->flags & PF_WQ_WORKER) {
-				struct task_struct *to_wakeup;
-
-				to_wakeup = wq_worker_sleeping(prev, cpu);
-				if (to_wakeup)
-					try_to_wake_up_local(to_wakeup);
-			}
 		}
 		switch_count = &prev->nvcsw;
 	}
@@ -3683,6 +3630,14 @@ static inline void sched_submit_work(str
 {
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
+
+	/*
+	 * If a worker went to sleep, notify and ask workqueue whether
+	 * it wants to wake up a task to maintain concurrency.
+	 */
+	if (tsk->flags & PF_WQ_WORKER)
+		wq_worker_sleeping(tsk);
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
@@ -3691,12 +3646,19 @@ static inline void sched_submit_work(str
 		blk_schedule_flush_plug(tsk);
 }
 
+static inline void sched_update_worker(struct task_struct *tsk)
+{
+	if (tsk->flags & PF_WQ_WORKER)
+		wq_worker_running(tsk);
+}
+
 asmlinkage void __sched schedule(void)
 {
 	struct task_struct *tsk = current;
 
 	sched_submit_work(tsk);
 	__schedule();
+	sched_update_worker(tsk);
 }
 EXPORT_SYMBOL(schedule);
 
Index: linux-stable-rt/kernel/workqueue.c
===================================================================
--- linux-stable-rt.orig/kernel/workqueue.c
+++ linux-stable-rt/kernel/workqueue.c
@@ -152,6 +152,7 @@ struct worker {
 	/* for rebinding worker to CPU */
 	struct idle_rebind	*idle_rebind;	/* L: for idle worker */
 	struct work_struct	rebind_work;	/* L: for busy worker */
+	int			sleeping;	/* None */
 };
 
 struct worker_pool {
@@ -691,66 +692,55 @@ static void wake_up_worker(struct worker
 }
 
 /**
- * wq_worker_waking_up - a worker is waking up
- * @task: task waking up
- * @cpu: CPU @task is waking up to
+ * wq_worker_waking_up - a worker is running again
+ * @task: task returning from sleep
  *
- * This function is called during try_to_wake_up() when a worker is
- * being awoken.
- *
- * CONTEXT:
- * spin_lock_irq(rq->lock)
+ * This function is called when a worker returns from a blocking
+ * schedule.
  */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
+void wq_worker_waking_up(struct task_struct *task)
 {
 	struct worker *worker = kthread_data(task);
 
+	if (!worker->sleeping)
+		return;
+
 	if (!(worker->flags & WORKER_NOT_RUNNING))
 		atomic_inc(get_pool_nr_running(worker->pool));
+	worker->sleeping = 0;
 }
 
 /**
  * wq_worker_sleeping - a worker is going to sleep
  * @task: task going to sleep
- * @cpu: CPU in question, must be the current CPU number
  *
  * This function is called during schedule() when a busy worker is
- * going to sleep.  Worker on the same cpu can be woken up by
- * returning pointer to its task.
- *
- * CONTEXT:
- * spin_lock_irq(rq->lock)
- *
- * RETURNS:
- * Worker task on @cpu to wake up, %NULL if none.
+ * going to sleep.
  */
-struct task_struct *wq_worker_sleeping(struct task_struct *task,
-				       unsigned int cpu)
+void wq_worker_sleeping(struct task_struct *task)
 {
-	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
+	struct worker *worker = kthread_data(task);
 	struct worker_pool *pool = worker->pool;
-	atomic_t *nr_running = get_pool_nr_running(pool);
+	struct global_cwq *gcwq = pool->gcwq;
+	atomic_t *nr_running;
 
 	if (worker->flags & WORKER_NOT_RUNNING)
-		return NULL;
+		return;
 
-	/* this can only happen on the local cpu */
-	BUG_ON(cpu != raw_smp_processor_id());
+	if (WARN_ON_ONCE(worker->sleeping))
+		return;
 
+	worker->sleeping = 1;
+	spin_lock_irq(&gcwq->lock);
+	nr_running = get_pool_nr_running(pool);
 	/*
 	 * The counterpart of the following dec_and_test, implied mb,
 	 * worklist not empty test sequence is in insert_work().
 	 * Please read comment there.
-	 *
-	 * NOT_RUNNING is clear.  This means that we're bound to and
-	 * running on the local cpu w/ rq lock held and preemption
-	 * disabled, which in turn means that none else could be
-	 * manipulating idle_list, so dereferencing idle_list without gcwq
-	 * lock is safe.
 	 */
 	if (atomic_dec_and_test(nr_running) && !list_empty(&pool->worklist))
-		to_wakeup = first_worker(pool);
-	return to_wakeup ? to_wakeup->task : NULL;
+		wake_up_process(first_worker(pool)->task);
+	spin_unlock_irq(&gcwq->lock);
 }
 
 /**
Index: linux-stable-rt/kernel/workqueue_sched.h
===================================================================
--- linux-stable-rt.orig/kernel/workqueue_sched.h
+++ linux-stable-rt/kernel/workqueue_sched.h
@@ -4,6 +4,5 @@
  * Scheduler hooks for concurrency managed workqueue.  Only to be
  * included from sched.c and workqueue.c.
  */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
-struct task_struct *wq_worker_sleeping(struct task_struct *task,
-				       unsigned int cpu);
+void wq_worker_running(struct task_struct *task);
+void wq_worker_sleeping(struct task_struct *task);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock
  2013-07-24 10:04           ` Thomas Gleixner
@ 2013-08-06 19:33             ` Steven Rostedt
  0 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2013-08-06 19:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tejun Heo, LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar,
	Linus Torvalds, Sebastian Sewior

On Wed, 2013-07-24 at 12:04 +0200, Thomas Gleixner wrote:
> B1;3202;0cOn Thu, 2 May 2013, Steven Rostedt wrote:
> > On Thu, 2013-05-02 at 17:12 -0700, Tejun Heo wrote:
> > > On Tue, Apr 30, 2013 at 06:47:10PM -0400, Steven Rostedt wrote:
> > > > On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote:
> > > > > [ Blast from the past! ]
> > > > > 
> > > > > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the
> > > > > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this
> > > > > patch applied. Although, this is not applied to any of the other -rt patches.
> > > > > 
> > > > 
> > > > I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt
> > > > nor in 3.8-rt.
> > > 
> > > So, it's all good?  Or is there something I need to look into?
> > 
> > It looks good to me. I don't know why it's not in 3.6-rt or 3.8-rt. Was
> > there a reason that Thomas took it out? I don't know. Maybe it's not
> > needed or he thought it went mainline?
> 
> I dropped it on purpose as I was sure, that it's safe.
> 
> But after you poked me yesterday I spent quite some time staring at
> that code and found that I missed the following:
> 
> worker A is about to go idle and the pool->idle_list is empty
> 
>   calls worker_enter_idle()
> 
>   list_add(&worker->entry, &pool->idle_list);
> 
>       idle_list.prev = &worker->entry;
> 
> Preemption
> 
> Worker B runs and blocks.
> 
>   wq_worker_sleeping() sees  !list_empty(&pool->idle_list)
> 
>   because idle_list.prev points already to worker A	 
> 
>   Then first_worker returns idle_list.next which points to idle list
>   so we return some random crap to wakeup.
> 
> So yes, I've donned a brown paperbag and we need to bring back that
> change and think about making it more palatable for mainline.
> 
> Find an untested patch against 3.6-rt below.
> 
> Thanks,
> 
> 	tglx
> ---
> Index: linux-stable-rt/kernel/sched/core.c
> ===================================================================
> --- linux-stable-rt.orig/kernel/sched/core.c
> +++ linux-stable-rt/kernel/sched/core.c
> @@ -1452,10 +1452,6 @@ static void ttwu_activate(struct rq *rq,
>  {
>  	activate_task(rq, p, en_flags);
>  	p->on_rq = 1;
> -
> -	/* if a worker is waking up, notify workqueue */
> -	if (p->flags & PF_WQ_WORKER)
> -		wq_worker_waking_up(p, cpu_of(rq));
>  }
>  
>  /*
> @@ -1714,42 +1710,6 @@ out:
>  }
>  
>  /**
> - * try_to_wake_up_local - try to wake up a local task with rq lock held
> - * @p: the thread to be awakened
> - *
> - * Put @p on the run-queue if it's not already there. The caller must
> - * ensure that this_rq() is locked, @p is bound to this_rq() and not
> - * the current task.
> - */
> -static void try_to_wake_up_local(struct task_struct *p)
> -{
> -	struct rq *rq = task_rq(p);
> -
> -	if (WARN_ON_ONCE(rq != this_rq()) ||
> -	    WARN_ON_ONCE(p == current))
> -		return;
> -
> -	lockdep_assert_held(&rq->lock);
> -
> -	if (!raw_spin_trylock(&p->pi_lock)) {
> -		raw_spin_unlock(&rq->lock);
> -		raw_spin_lock(&p->pi_lock);
> -		raw_spin_lock(&rq->lock);
> -	}
> -
> -	if (!(p->state & TASK_NORMAL))
> -		goto out;
> -
> -	if (!p->on_rq)
> -		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
> -
> -	ttwu_do_wakeup(rq, p, 0);
> -	ttwu_stat(p, smp_processor_id(), 0);
> -out:
> -	raw_spin_unlock(&p->pi_lock);
> -}
> -
> -/**
>   * wake_up_process - Wake up a specific process
>   * @p: The process to be woken up.
>   *
> @@ -3627,19 +3587,6 @@ need_resched:
>  		} else {
>  			deactivate_task(rq, prev, DEQUEUE_SLEEP);
>  			prev->on_rq = 0;
> -
> -			/*
> -			 * If a worker went to sleep, notify and ask workqueue
> -			 * whether it wants to wake up a task to maintain
> -			 * concurrency.
> -			 */
> -			if (prev->flags & PF_WQ_WORKER) {
> -				struct task_struct *to_wakeup;
> -
> -				to_wakeup = wq_worker_sleeping(prev, cpu);
> -				if (to_wakeup)
> -					try_to_wake_up_local(to_wakeup);
> -			}
>  		}
>  		switch_count = &prev->nvcsw;
>  	}
> @@ -3683,6 +3630,14 @@ static inline void sched_submit_work(str
>  {
>  	if (!tsk->state || tsk_is_pi_blocked(tsk))
>  		return;
> +
> +	/*
> +	 * If a worker went to sleep, notify and ask workqueue whether
> +	 * it wants to wake up a task to maintain concurrency.
> +	 */
> +	if (tsk->flags & PF_WQ_WORKER)
> +		wq_worker_sleeping(tsk);
> +
>  	/*
>  	 * If we are going to sleep and we have plugged IO queued,
>  	 * make sure to submit it to avoid deadlocks.
> @@ -3691,12 +3646,19 @@ static inline void sched_submit_work(str
>  		blk_schedule_flush_plug(tsk);
>  }
>  
> +static inline void sched_update_worker(struct task_struct *tsk)
> +{
> +	if (tsk->flags & PF_WQ_WORKER)
> +		wq_worker_running(tsk);

I think you forgot to add the wq_worker_running() function. As I don't
see it in 3.6-rt or 3.8-rt.

-- Steve

> +}
> +
>  asmlinkage void __sched schedule(void)
>  {
>  	struct task_struct *tsk = current;
>  
>  	sched_submit_work(tsk);
>  	__schedule();
> +	sched_update_worker(tsk);
>  }
>  EXPORT_SYMBOL(schedule);
>  
> Index: linux-stable-rt/kernel/workqueue.c
> ===================================================================
> --- linux-stable-rt.orig/kernel/workqueue.c
> +++ linux-stable-rt/kernel/workqueue.c
> @@ -152,6 +152,7 @@ struct worker {
>  	/* for rebinding worker to CPU */
>  	struct idle_rebind	*idle_rebind;	/* L: for idle worker */
>  	struct work_struct	rebind_work;	/* L: for busy worker */
> +	int			sleeping;	/* None */
>  };
>  
>  struct worker_pool {
> @@ -691,66 +692,55 @@ static void wake_up_worker(struct worker
>  }
>  
>  /**
> - * wq_worker_waking_up - a worker is waking up
> - * @task: task waking up
> - * @cpu: CPU @task is waking up to
> + * wq_worker_waking_up - a worker is running again
> + * @task: task returning from sleep
>   *
> - * This function is called during try_to_wake_up() when a worker is
> - * being awoken.
> - *
> - * CONTEXT:
> - * spin_lock_irq(rq->lock)
> + * This function is called when a worker returns from a blocking
> + * schedule.
>   */
> -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
> +void wq_worker_waking_up(struct task_struct *task)
>  {
>  	struct worker *worker = kthread_data(task);
>  
> +	if (!worker->sleeping)
> +		return;
> +
>  	if (!(worker->flags & WORKER_NOT_RUNNING))
>  		atomic_inc(get_pool_nr_running(worker->pool));
> +	worker->sleeping = 0;
>  }
>  
>  /**
>   * wq_worker_sleeping - a worker is going to sleep
>   * @task: task going to sleep
> - * @cpu: CPU in question, must be the current CPU number
>   *
>   * This function is called during schedule() when a busy worker is
> - * going to sleep.  Worker on the same cpu can be woken up by
> - * returning pointer to its task.
> - *
> - * CONTEXT:
> - * spin_lock_irq(rq->lock)
> - *
> - * RETURNS:
> - * Worker task on @cpu to wake up, %NULL if none.
> + * going to sleep.
>   */
> -struct task_struct *wq_worker_sleeping(struct task_struct *task,
> -				       unsigned int cpu)
> +void wq_worker_sleeping(struct task_struct *task)
>  {
> -	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
> +	struct worker *worker = kthread_data(task);
>  	struct worker_pool *pool = worker->pool;
> -	atomic_t *nr_running = get_pool_nr_running(pool);
> +	struct global_cwq *gcwq = pool->gcwq;
> +	atomic_t *nr_running;
>  
>  	if (worker->flags & WORKER_NOT_RUNNING)
> -		return NULL;
> +		return;
>  
> -	/* this can only happen on the local cpu */
> -	BUG_ON(cpu != raw_smp_processor_id());
> +	if (WARN_ON_ONCE(worker->sleeping))
> +		return;
>  
> +	worker->sleeping = 1;
> +	spin_lock_irq(&gcwq->lock);
> +	nr_running = get_pool_nr_running(pool);
>  	/*
>  	 * The counterpart of the following dec_and_test, implied mb,
>  	 * worklist not empty test sequence is in insert_work().
>  	 * Please read comment there.
> -	 *
> -	 * NOT_RUNNING is clear.  This means that we're bound to and
> -	 * running on the local cpu w/ rq lock held and preemption
> -	 * disabled, which in turn means that none else could be
> -	 * manipulating idle_list, so dereferencing idle_list without gcwq
> -	 * lock is safe.
>  	 */
>  	if (atomic_dec_and_test(nr_running) && !list_empty(&pool->worklist))
> -		to_wakeup = first_worker(pool);
> -	return to_wakeup ? to_wakeup->task : NULL;
> +		wake_up_process(first_worker(pool)->task);
> +	spin_unlock_irq(&gcwq->lock);
>  }
>  
>  /**
> Index: linux-stable-rt/kernel/workqueue_sched.h
> ===================================================================
> --- linux-stable-rt.orig/kernel/workqueue_sched.h
> +++ linux-stable-rt/kernel/workqueue_sched.h
> @@ -4,6 +4,5 @@
>   * Scheduler hooks for concurrency managed workqueue.  Only to be
>   * included from sched.c and workqueue.c.
>   */
> -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
> -struct task_struct *wq_worker_sleeping(struct task_struct *task,
> -				       unsigned int cpu);
> +void wq_worker_running(struct task_struct *task);
> +void wq_worker_sleeping(struct task_struct *task);



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-08-06 19:33 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner
2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner
2011-06-22 18:43   ` Christoph Hellwig
2011-06-22 18:52     ` Thomas Gleixner
2011-06-22 19:42     ` Jens Axboe
2011-06-22 20:15       ` Thomas Gleixner
2011-06-23 11:41         ` Jens Axboe
2011-08-29 14:55   ` [tip:sched/urgent] " tip-bot for Thomas Gleixner
2011-06-22 17:52 ` [patch 3/4] block: Shorten interrupt disabled regions Thomas Gleixner
2011-06-22 17:52 ` [patch 2/4] sched: Move blk_schedule_flush_plug() out of __schedule() Thomas Gleixner
2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner
2011-06-22 19:30   ` Thomas Gleixner
2011-06-23  8:37   ` Tejun Heo
2011-06-23  9:58     ` Thomas Gleixner
2011-06-23 10:15       ` Tejun Heo
2011-06-23 10:44         ` Ingo Molnar
2011-06-23 11:35           ` Tejun Heo
2011-06-23 12:51             ` Ingo Molnar
2011-06-24  9:01             ` Thomas Gleixner
2011-06-26 10:19               ` Tejun Heo
2011-06-23 15:07   ` Tejun Heo
2013-04-30 13:37   ` Steven Rostedt
2013-04-30 22:47     ` Steven Rostedt
2013-05-03  0:12       ` Tejun Heo
2013-05-03  0:57         ` Steven Rostedt
2013-07-24 10:04           ` Thomas Gleixner
2013-08-06 19:33             ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).