* [patch 0/4] sched: Move work out of the scheduler core @ 2011-06-22 17:52 Thomas Gleixner 2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner ` (3 more replies) 0 siblings, 4 replies; 27+ messages in thread From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw) To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds Block-IO and workqueues have hooks in the scheduler core which have no strict requirements for being called from irq disabled, preempt disabled and rq->lock held regions. The following series moves them outside. Thanks, tglx --- block/blk-core.c | 20 +++----- kernel/sched.c | 106 +++++++++++++++++------------------------------ kernel/workqueue.c | 67 ++++++++++++----------------- kernel/workqueue_sched.h | 5 -- 4 files changed, 79 insertions(+), 119 deletions(-) ^ permalink raw reply [flat|nested] 27+ messages in thread
* [patch 1/4] sched: Separate the scheduler entry for preemption 2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner @ 2011-06-22 17:52 ` Thomas Gleixner 2011-06-22 18:43 ` Christoph Hellwig 2011-08-29 14:55 ` [tip:sched/urgent] " tip-bot for Thomas Gleixner 2011-06-22 17:52 ` [patch 3/4] block: Shorten interrupt disabled regions Thomas Gleixner ` (2 subsequent siblings) 3 siblings, 2 replies; 27+ messages in thread From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw) To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds [-- Attachment #1: sched-split-out-regular-schedule.patch --] [-- Type: text/plain, Size: 1929 bytes --] Block-IO and workqueues call into notifier functions from the scheduler core code with interrupts and preemption disabled. These calls should be made before entering the scheduler core. To simplify this, separate the scheduler core code into __schedule(). __schedule() is directly called from the places which set PREEMPT_ACTIVE and from schedule(). This allows us to add the work checks into schedule(), so they are only called when a task voluntary goes to sleep. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/sched.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -4210,9 +4210,9 @@ pick_next_task(struct rq *rq) } /* - * schedule() is the main scheduler function. + * __schedule() is the main scheduler function. */ -asmlinkage void __sched schedule(void) +static void __sched __schedule(void) { struct task_struct *prev, *next; unsigned long *switch_count; @@ -4300,6 +4300,11 @@ need_resched: if (need_resched()) goto need_resched; } + +asmlinkage void schedule(void) +{ + __schedule(); +} EXPORT_SYMBOL(schedule); #ifdef CONFIG_MUTEX_SPIN_ON_OWNER @@ -4373,7 +4378,7 @@ asmlinkage void __sched notrace preempt_ do { add_preempt_count_notrace(PREEMPT_ACTIVE); - schedule(); + __schedule(); sub_preempt_count_notrace(PREEMPT_ACTIVE); /* @@ -4401,7 +4406,7 @@ asmlinkage void __sched preempt_schedule do { add_preempt_count(PREEMPT_ACTIVE); local_irq_enable(); - schedule(); + __schedule(); local_irq_disable(); sub_preempt_count(PREEMPT_ACTIVE); @@ -5526,7 +5531,7 @@ static inline int should_resched(void) static void __cond_resched(void) { add_preempt_count(PREEMPT_ACTIVE); - schedule(); + __schedule(); sub_preempt_count(PREEMPT_ACTIVE); } ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 1/4] sched: Separate the scheduler entry for preemption 2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner @ 2011-06-22 18:43 ` Christoph Hellwig 2011-06-22 18:52 ` Thomas Gleixner 2011-06-22 19:42 ` Jens Axboe 2011-08-29 14:55 ` [tip:sched/urgent] " tip-bot for Thomas Gleixner 1 sibling, 2 replies; 27+ messages in thread From: Christoph Hellwig @ 2011-06-22 18:43 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote: > Block-IO and workqueues call into notifier functions from the > scheduler core code with interrupts and preemption disabled. These > calls should be made before entering the scheduler core. > > To simplify this, separate the scheduler core code into > __schedule(). __schedule() is directly called from the places which > set PREEMPT_ACTIVE and from schedule(). This allows us to add the work > checks into schedule(), so they are only called when a task voluntary > goes to sleep. I don't think that works. We'll need to flush the block requests even for an involuntary schedule. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 1/4] sched: Separate the scheduler entry for preemption 2011-06-22 18:43 ` Christoph Hellwig @ 2011-06-22 18:52 ` Thomas Gleixner 2011-06-22 19:42 ` Jens Axboe 1 sibling, 0 replies; 27+ messages in thread From: Thomas Gleixner @ 2011-06-22 18:52 UTC (permalink / raw) To: Christoph Hellwig Cc: LKML, Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds On Wed, 22 Jun 2011, Christoph Hellwig wrote: > On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote: > > Block-IO and workqueues call into notifier functions from the > > scheduler core code with interrupts and preemption disabled. These > > calls should be made before entering the scheduler core. > > > > To simplify this, separate the scheduler core code into > > __schedule(). __schedule() is directly called from the places which > > set PREEMPT_ACTIVE and from schedule(). This allows us to add the work > > checks into schedule(), so they are only called when a task voluntary > > goes to sleep. > > I don't think that works. We'll need to flush the block requests even > for an involuntary schedule. We don't do that right now as that code conditional on: if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { Also blk_flush_plug_list() when called from io_schedule() is preemptible, so you might be preempted in the middle of the list operations, so calling into it when preempted would result in an utter disaster. Thanks, tglx ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 1/4] sched: Separate the scheduler entry for preemption 2011-06-22 18:43 ` Christoph Hellwig 2011-06-22 18:52 ` Thomas Gleixner @ 2011-06-22 19:42 ` Jens Axboe 2011-06-22 20:15 ` Thomas Gleixner 1 sibling, 1 reply; 27+ messages in thread From: Jens Axboe @ 2011-06-22 19:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Thomas Gleixner, LKML, Peter Zijlstra, Tejun Heo, Ingo Molnar, Linus Torvalds On 2011-06-22 20:43, Christoph Hellwig wrote: > On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote: >> Block-IO and workqueues call into notifier functions from the >> scheduler core code with interrupts and preemption disabled. These >> calls should be made before entering the scheduler core. >> >> To simplify this, separate the scheduler core code into >> __schedule(). __schedule() is directly called from the places which >> set PREEMPT_ACTIVE and from schedule(). This allows us to add the work >> checks into schedule(), so they are only called when a task voluntary >> goes to sleep. > > I don't think that works. We'll need to flush the block requests even > for an involuntary schedule. Yep, doing it just for voluntary schedule() is pointless, since the caller should just do the flushing on his own. The whole point of the sched hook was to ensure that involuntary schedules flushed it. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 1/4] sched: Separate the scheduler entry for preemption 2011-06-22 19:42 ` Jens Axboe @ 2011-06-22 20:15 ` Thomas Gleixner 2011-06-23 11:41 ` Jens Axboe 0 siblings, 1 reply; 27+ messages in thread From: Thomas Gleixner @ 2011-06-22 20:15 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, LKML, Peter Zijlstra, Tejun Heo, Ingo Molnar, Linus Torvalds On Wed, 22 Jun 2011, Jens Axboe wrote: > On 2011-06-22 20:43, Christoph Hellwig wrote: > > On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote: > >> Block-IO and workqueues call into notifier functions from the > >> scheduler core code with interrupts and preemption disabled. These > >> calls should be made before entering the scheduler core. > >> > >> To simplify this, separate the scheduler core code into > >> __schedule(). __schedule() is directly called from the places which > >> set PREEMPT_ACTIVE and from schedule(). This allows us to add the work > >> checks into schedule(), so they are only called when a task voluntary > >> goes to sleep. > > > > I don't think that works. We'll need to flush the block requests even > > for an involuntary schedule. > > Yep, doing it just for voluntary schedule() is pointless, since the > caller should just do the flushing on his own. The whole point of the > sched hook was to ensure that involuntary schedules flushed it. I guess we talk about different things here. The involuntary is when you are preempted, which keeps state unchanged and the current code already excludes that case. If you block on a mutex, semaphore, completion or whatever that's a different thing. That code calls schedule() not __schedule() and that will flush your stuff as it does now. Thanks, tglx ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 1/4] sched: Separate the scheduler entry for preemption 2011-06-22 20:15 ` Thomas Gleixner @ 2011-06-23 11:41 ` Jens Axboe 0 siblings, 0 replies; 27+ messages in thread From: Jens Axboe @ 2011-06-23 11:41 UTC (permalink / raw) To: Thomas Gleixner Cc: Christoph Hellwig, LKML, Peter Zijlstra, Tejun Heo, Ingo Molnar, Linus Torvalds On 2011-06-22 22:15, Thomas Gleixner wrote: > On Wed, 22 Jun 2011, Jens Axboe wrote: > >> On 2011-06-22 20:43, Christoph Hellwig wrote: >>> On Wed, Jun 22, 2011 at 05:52:13PM -0000, Thomas Gleixner wrote: >>>> Block-IO and workqueues call into notifier functions from the >>>> scheduler core code with interrupts and preemption disabled. These >>>> calls should be made before entering the scheduler core. >>>> >>>> To simplify this, separate the scheduler core code into >>>> __schedule(). __schedule() is directly called from the places which >>>> set PREEMPT_ACTIVE and from schedule(). This allows us to add the work >>>> checks into schedule(), so they are only called when a task voluntary >>>> goes to sleep. >>> >>> I don't think that works. We'll need to flush the block requests even >>> for an involuntary schedule. >> >> Yep, doing it just for voluntary schedule() is pointless, since the >> caller should just do the flushing on his own. The whole point of the >> sched hook was to ensure that involuntary schedules flushed it. > > I guess we talk about different things here. The involuntary is when Seems to be the trend for this patchset, why stop now? :-) > you are preempted, which keeps state unchanged and the current code > already excludes that case. > > If you block on a mutex, semaphore, completion or whatever that's a > different thing. That code calls schedule() not __schedule() and that > will flush your stuff as it does now. OK, thanks for the clarification. You are right, the original kernel did not flush on eg preempt and that wasn't the intent either. The intent was to flush on block only. So behaviour remains identical. -- Jens Axboe ^ permalink raw reply [flat|nested] 27+ messages in thread
* [tip:sched/urgent] sched: Separate the scheduler entry for preemption 2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner 2011-06-22 18:43 ` Christoph Hellwig @ 2011-08-29 14:55 ` tip-bot for Thomas Gleixner 1 sibling, 0 replies; 27+ messages in thread From: tip-bot for Thomas Gleixner @ 2011-08-29 14:55 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, torvalds, axboe, a.p.zijlstra, tj, tglx, mingo Commit-ID: c259e01a1ec90063042f758e409cd26b2a0963c8 Gitweb: http://git.kernel.org/tip/c259e01a1ec90063042f758e409cd26b2a0963c8 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Wed, 22 Jun 2011 19:47:00 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Mon, 29 Aug 2011 12:26:57 +0200 sched: Separate the scheduler entry for preemption Block-IO and workqueues call into notifier functions from the scheduler core code with interrupts and preemption disabled. These calls should be made before entering the scheduler core. To simplify this, separate the scheduler core code into __schedule(). __schedule() is directly called from the places which set PREEMPT_ACTIVE and from schedule(). This allows us to add the work checks into schedule(), so they are only called when a task voluntary goes to sleep. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@kernel.org # 2.6.39+ Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 15 ++++++++++----- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index ccacdbd..ec15e81 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4279,9 +4279,9 @@ pick_next_task(struct rq *rq) } /* - * schedule() is the main scheduler function. + * __schedule() is the main scheduler function. */ -asmlinkage void __sched schedule(void) +static void __sched __schedule(void) { struct task_struct *prev, *next; unsigned long *switch_count; @@ -4369,6 +4369,11 @@ need_resched: if (need_resched()) goto need_resched; } + +asmlinkage void schedule(void) +{ + __schedule(); +} EXPORT_SYMBOL(schedule); #ifdef CONFIG_MUTEX_SPIN_ON_OWNER @@ -4435,7 +4440,7 @@ asmlinkage void __sched notrace preempt_schedule(void) do { add_preempt_count_notrace(PREEMPT_ACTIVE); - schedule(); + __schedule(); sub_preempt_count_notrace(PREEMPT_ACTIVE); /* @@ -4463,7 +4468,7 @@ asmlinkage void __sched preempt_schedule_irq(void) do { add_preempt_count(PREEMPT_ACTIVE); local_irq_enable(); - schedule(); + __schedule(); local_irq_disable(); sub_preempt_count(PREEMPT_ACTIVE); @@ -5588,7 +5593,7 @@ static inline int should_resched(void) static void __cond_resched(void) { add_preempt_count(PREEMPT_ACTIVE); - schedule(); + __schedule(); sub_preempt_count(PREEMPT_ACTIVE); } ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [patch 3/4] block: Shorten interrupt disabled regions 2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner 2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner @ 2011-06-22 17:52 ` Thomas Gleixner 2011-06-22 17:52 ` [patch 2/4] sched: Move blk_schedule_flush_plug() out of __schedule() Thomas Gleixner 2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner 3 siblings, 0 replies; 27+ messages in thread From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw) To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds [-- Attachment #1: block-fixup-irq-disabled-region.patch --] [-- Type: text/plain, Size: 3121 bytes --] Moving the blk_sched_flush_plug() call out of the interrupt/preempt disabled region in the scheduler allows us to replace local_irq_save/restore(flags) by local_irq_disable/enable() in blk_flush_plug(). Now instead of doing this we disable interrupts explicitely when we lock the request_queue and reenable them when we drop the lock. That allows interrupts to be handled when the plug list contains requests for more than one queue. Aside of that this change makes the scope of the irq disabled region more obvious. The current code confused the hell out of me when looking at: local_irq_save(flags); spin_lock(q->queue_lock); ... queue_unplugged(q...); scsi_request_fn(); spin_unlock(q->queue_lock); spin_lock(shost->host_lock); spin_unlock_irq(shost->host_lock); -------------------^^^ ???? spin_lock_irq(q->queue_lock); spin_unlock(q->lock); local_irq_restore(flags); Also add a comment to __blk_run_queue() documenting that q->request_fn() can drop q->queue_lock and reenable interrupts, but must return with q->queue_lock held and interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- block/blk-core.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) Index: linux-2.6/block/blk-core.c =================================================================== --- linux-2.6.orig/block/blk-core.c +++ linux-2.6/block/blk-core.c @@ -301,7 +301,11 @@ void __blk_run_queue(struct request_queu { if (unlikely(blk_queue_stopped(q))) return; - + /* + * q->request_fn() can drop q->queue_lock and reenable + * interrupts, but must return with q->queue_lock held and + * interrupts disabled. + */ q->request_fn(q); } EXPORT_SYMBOL(__blk_run_queue); @@ -2667,11 +2671,11 @@ static void queue_unplugged(struct reque * this lock). */ if (from_schedule) { - spin_unlock(q->queue_lock); + spin_unlock_irq(q->queue_lock); blk_run_queue_async(q); } else { __blk_run_queue(q); - spin_unlock(q->queue_lock); + spin_unlock_irq(q->queue_lock); } } @@ -2697,7 +2701,6 @@ static void flush_plug_callbacks(struct void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule) { struct request_queue *q; - unsigned long flags; struct request *rq; LIST_HEAD(list); unsigned int depth; @@ -2718,11 +2721,6 @@ void blk_flush_plug_list(struct blk_plug q = NULL; depth = 0; - /* - * Save and disable interrupts here, to avoid doing it for every - * queue lock we have to take. - */ - local_irq_save(flags); while (!list_empty(&list)) { rq = list_entry_rq(list.next); list_del_init(&rq->queuelist); @@ -2735,7 +2733,7 @@ void blk_flush_plug_list(struct blk_plug queue_unplugged(q, depth, from_schedule); q = rq->q; depth = 0; - spin_lock(q->queue_lock); + spin_lock_irq(q->queue_lock); } /* * rq is already accounted, so use raw insert @@ -2753,8 +2751,6 @@ void blk_flush_plug_list(struct blk_plug */ if (q) queue_unplugged(q, depth, from_schedule); - - local_irq_restore(flags); } void blk_finish_plug(struct blk_plug *plug) ^ permalink raw reply [flat|nested] 27+ messages in thread
* [patch 2/4] sched: Move blk_schedule_flush_plug() out of __schedule() 2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner 2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner 2011-06-22 17:52 ` [patch 3/4] block: Shorten interrupt disabled regions Thomas Gleixner @ 2011-06-22 17:52 ` Thomas Gleixner 2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner 3 siblings, 0 replies; 27+ messages in thread From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw) To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds [-- Attachment #1: sched-move-block-plug-out-of-scheduler-guts.patch --] [-- Type: text/plain, Size: 1596 bytes --] There is no real reason to run blk_schedule_flush_plug() with interrupts and preemption disabled. Move it into schedule() and call it when the task is going voluntarily to sleep. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from being woken immediately after switching away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/sched.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -4253,16 +4253,6 @@ need_resched: if (to_wakeup) try_to_wake_up_local(to_wakeup); } - - /* - * If we are going to sleep and we have plugged IO - * queued, make sure to submit it to avoid deadlocks. - */ - if (blk_needs_flush_plug(prev)) { - raw_spin_unlock(&rq->lock); - blk_schedule_flush_plug(prev); - raw_spin_lock(&rq->lock); - } } switch_count = &prev->nvcsw; } @@ -4301,8 +4291,23 @@ need_resched: goto need_resched; } +static inline void sched_submit_work(struct task_struct *tsk) +{ + if (!tsk->state) + return; + /* + * If we are going to sleep and we have plugged IO queued, + * make sure to submit it to avoid deadlocks. + */ + if (blk_needs_flush_plug(tsk)) + blk_schedule_flush_plug(tsk); +} + asmlinkage void schedule(void) { + struct task_struct *tsk = current; + + sched_submit_work(tsk); __schedule(); } EXPORT_SYMBOL(schedule); ^ permalink raw reply [flat|nested] 27+ messages in thread
* [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner ` (2 preceding siblings ...) 2011-06-22 17:52 ` [patch 2/4] sched: Move blk_schedule_flush_plug() out of __schedule() Thomas Gleixner @ 2011-06-22 17:52 ` Thomas Gleixner 2011-06-22 19:30 ` Thomas Gleixner ` (3 more replies) 3 siblings, 4 replies; 27+ messages in thread From: Thomas Gleixner @ 2011-06-22 17:52 UTC (permalink / raw) To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds [-- Attachment #1: sched-wq-sigh.patch --] [-- Type: text/plain, Size: 7513 bytes --] The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/sched.c | 66 ++++++++++------------------------------------ kernel/workqueue.c | 67 ++++++++++++++++++++--------------------------- kernel/workqueue_sched.h | 5 +-- 3 files changed, 46 insertions(+), 92 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -2477,10 +2477,6 @@ static void ttwu_activate(struct rq *rq, { activate_task(rq, p, en_flags); p->on_rq = 1; - - /* if a worker is waking up, notify workqueue */ - if (p->flags & PF_WQ_WORKER) - wq_worker_waking_up(p, cpu_of(rq)); } /* @@ -2703,40 +2699,6 @@ out: } /** - * try_to_wake_up_local - try to wake up a local task with rq lock held - * @p: the thread to be awakened - * - * Put @p on the run-queue if it's not already there. The caller must - * ensure that this_rq() is locked, @p is bound to this_rq() and not - * the current task. - */ -static void try_to_wake_up_local(struct task_struct *p) -{ - struct rq *rq = task_rq(p); - - BUG_ON(rq != this_rq()); - BUG_ON(p == current); - lockdep_assert_held(&rq->lock); - - if (!raw_spin_trylock(&p->pi_lock)) { - raw_spin_unlock(&rq->lock); - raw_spin_lock(&p->pi_lock); - raw_spin_lock(&rq->lock); - } - - if (!(p->state & TASK_NORMAL)) - goto out; - - if (!p->on_rq) - ttwu_activate(rq, p, ENQUEUE_WAKEUP); - - ttwu_do_wakeup(rq, p, 0); - ttwu_stat(p, smp_processor_id(), 0); -out: - raw_spin_unlock(&p->pi_lock); -} - -/** * wake_up_process - Wake up a specific process * @p: The process to be woken up. * @@ -4240,19 +4202,6 @@ need_resched: } else { deactivate_task(rq, prev, DEQUEUE_SLEEP); prev->on_rq = 0; - - /* - * If a worker went to sleep, notify and ask workqueue - * whether it wants to wake up a task to maintain - * concurrency. - */ - if (prev->flags & PF_WQ_WORKER) { - struct task_struct *to_wakeup; - - to_wakeup = wq_worker_sleeping(prev, cpu); - if (to_wakeup) - try_to_wake_up_local(to_wakeup); - } } switch_count = &prev->nvcsw; } @@ -4295,6 +4244,14 @@ static inline void sched_submit_work(str { if (!tsk->state) return; + + /* + * If a worker went to sleep, notify and ask workqueue whether + * it wants to wake up a task to maintain concurrency. + */ + if (tsk->flags & PF_WQ_WORKER) + wq_worker_sleeping(tsk); + /* * If we are going to sleep and we have plugged IO queued, * make sure to submit it to avoid deadlocks. @@ -4303,12 +4260,19 @@ static inline void sched_submit_work(str blk_schedule_flush_plug(tsk); } +static inline void sched_update_worker(struct task_struct *tsk) +{ + if (tsk->flags & PF_WQ_WORKER) + wq_worker_running(tsk); +} + asmlinkage void schedule(void) { struct task_struct *tsk = current; sched_submit_work(tsk); __schedule(); + sched_update_worker(tsk); } EXPORT_SYMBOL(schedule); Index: linux-2.6/kernel/workqueue.c =================================================================== --- linux-2.6.orig/kernel/workqueue.c +++ linux-2.6/kernel/workqueue.c @@ -137,6 +137,7 @@ struct worker { unsigned int flags; /* X: flags */ int id; /* I: worker id */ struct work_struct rebind_work; /* L: rebind worker to cpu */ + int sleeping; /* None */ }; /* @@ -657,66 +658,56 @@ static void wake_up_worker(struct global } /** - * wq_worker_waking_up - a worker is waking up - * @task: task waking up - * @cpu: CPU @task is waking up to + * wq_worker_running - a worker is running again + * @task: task returning from sleep * - * This function is called during try_to_wake_up() when a worker is - * being awoken. - * - * CONTEXT: - * spin_lock_irq(rq->lock) + * This function is called when a worker returns from schedule() */ -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu) +void wq_worker_running(struct task_struct *task) { struct worker *worker = kthread_data(task); + if (!worker->sleeping) + return; if (!(worker->flags & WORKER_NOT_RUNNING)) - atomic_inc(get_gcwq_nr_running(cpu)); + atomic_inc(get_gcwq_nr_running(smp_processor_id())); + worker->sleeping = 0; } /** * wq_worker_sleeping - a worker is going to sleep * @task: task going to sleep - * @cpu: CPU in question, must be the current CPU number - * - * This function is called during schedule() when a busy worker is - * going to sleep. Worker on the same cpu can be woken up by - * returning pointer to its task. - * - * CONTEXT: - * spin_lock_irq(rq->lock) * - * RETURNS: - * Worker task on @cpu to wake up, %NULL if none. + * This function is called from schedule() when a busy worker is + * going to sleep. */ -struct task_struct *wq_worker_sleeping(struct task_struct *task, - unsigned int cpu) +void wq_worker_sleeping(struct task_struct *task) { - struct worker *worker = kthread_data(task), *to_wakeup = NULL; - struct global_cwq *gcwq = get_gcwq(cpu); - atomic_t *nr_running = get_gcwq_nr_running(cpu); + struct worker *worker = kthread_data(task); + struct global_cwq *gcwq; + int cpu; if (worker->flags & WORKER_NOT_RUNNING) - return NULL; + return; - /* this can only happen on the local cpu */ - BUG_ON(cpu != raw_smp_processor_id()); + if (WARN_ON_ONCE(worker->sleeping)) + return; + cpu = smp_processor_id(); + gcwq = get_gcwq(cpu); + spin_lock_irq(&gcwq->lock); /* * The counterpart of the following dec_and_test, implied mb, * worklist not empty test sequence is in insert_work(). * Please read comment there. - * - * NOT_RUNNING is clear. This means that trustee is not in - * charge and we're running on the local cpu w/ rq lock held - * and preemption disabled, which in turn means that none else - * could be manipulating idle_list, so dereferencing idle_list - * without gcwq lock is safe. - */ - if (atomic_dec_and_test(nr_running) && !list_empty(&gcwq->worklist)) - to_wakeup = first_worker(gcwq); - return to_wakeup ? to_wakeup->task : NULL; + */ + if (atomic_dec_and_test(get_gcwq_nr_running(cpu)) && + !list_empty(&gcwq->worklist)) { + worker = first_worker(gcwq); + if (worker) + wake_up_process(worker->task); + } + spin_unlock_irq(&gcwq->lock); } /** Index: linux-2.6/kernel/workqueue_sched.h =================================================================== --- linux-2.6.orig/kernel/workqueue_sched.h +++ linux-2.6/kernel/workqueue_sched.h @@ -4,6 +4,5 @@ * Scheduler hooks for concurrency managed workqueue. Only to be * included from sched.c and workqueue.c. */ -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu); -struct task_struct *wq_worker_sleeping(struct task_struct *task, - unsigned int cpu); +void wq_worker_running(struct task_struct *task); +void wq_worker_sleeping(struct task_struct *task); ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner @ 2011-06-22 19:30 ` Thomas Gleixner 2011-06-23 8:37 ` Tejun Heo ` (2 subsequent siblings) 3 siblings, 0 replies; 27+ messages in thread From: Thomas Gleixner @ 2011-06-22 19:30 UTC (permalink / raw) To: LKML; +Cc: Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds On Wed, 22 Jun 2011, Thomas Gleixner wrote: > +void wq_worker_sleeping(struct task_struct *task) > { > - struct worker *worker = kthread_data(task), *to_wakeup = NULL; > - struct global_cwq *gcwq = get_gcwq(cpu); > - atomic_t *nr_running = get_gcwq_nr_running(cpu); > + struct worker *worker = kthread_data(task); > + struct global_cwq *gcwq; > + int cpu; > > if (worker->flags & WORKER_NOT_RUNNING) > - return NULL; > + return; > > - /* this can only happen on the local cpu */ > - BUG_ON(cpu != raw_smp_processor_id()); > + if (WARN_ON_ONCE(worker->sleeping)) > + return; Darn, forgot to add worker->sleeping = 1; back when I added the WARN_ON. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner 2011-06-22 19:30 ` Thomas Gleixner @ 2011-06-23 8:37 ` Tejun Heo 2011-06-23 9:58 ` Thomas Gleixner 2011-06-23 15:07 ` Tejun Heo 2013-04-30 13:37 ` Steven Rostedt 3 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2011-06-23 8:37 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds Hello, Thomas. The hooking place doesn't have anything to do with rq->lock. The problem with Peter's patch was preemption, not the lock. > There is also no harm from updating nr_running when the task returns > from scheduling instead of accounting it in the wakeup code. Well, not exactly. If CPU is being thrashed, we don't want to try to fire up new workers or calling in rescuers. If nr_running is bumped up from ttwu(), a woken up but not yet running worker already counts as running. With the suggested change, when we hit such heavy CPU thrashing, workqueue code will add things on top of it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-23 8:37 ` Tejun Heo @ 2011-06-23 9:58 ` Thomas Gleixner 2011-06-23 10:15 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Thomas Gleixner @ 2011-06-23 9:58 UTC (permalink / raw) To: Tejun Heo; +Cc: LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds On Thu, 23 Jun 2011, Tejun Heo wrote: > Hello, Thomas. > > The hooking place doesn't have anything to do with rq->lock. The > problem with Peter's patch was preemption, not the lock. And we can do it w/o preemption disabled as I have shown. > > There is also no harm from updating nr_running when the task returns > > from scheduling instead of accounting it in the wakeup code. > > Well, not exactly. If CPU is being thrashed, we don't want to try to > fire up new workers or calling in rescuers. If nr_running is bumped > up from ttwu(), a woken up but not yet running worker already counts > as running. With the suggested change, when we hit such heavy CPU > thrashing, workqueue code will add things on top of it. That's the whole problem with that self forking workqueue stuff and I'm not accepting that ttwu() is the only solution to that. Following that logic would just invite more abusers of callbacks into ttwu() and if you think it through then the logical consequence is to have an upcall hook into userspace so a threading/forking server knows how many instances are on the fly to avoid creating new ones under pressure. Thanks, tglx ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-23 9:58 ` Thomas Gleixner @ 2011-06-23 10:15 ` Tejun Heo 2011-06-23 10:44 ` Ingo Molnar 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2011-06-23 10:15 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds Hello, On Thu, Jun 23, 2011 at 11:58:12AM +0200, Thomas Gleixner wrote: > > Well, not exactly. If CPU is being thrashed, we don't want to try to > > fire up new workers or calling in rescuers. If nr_running is bumped > > up from ttwu(), a woken up but not yet running worker already counts > > as running. With the suggested change, when we hit such heavy CPU > > thrashing, workqueue code will add things on top of it. > > That's the whole problem with that self forking workqueue stuff and > I'm not accepting that ttwu() is the only solution to that. Following > that logic would just invite more abusers of callbacks into ttwu() and > if you think it through then the logical consequence is to have an > upcall hook into userspace so a threading/forking server knows how > many instances are on the fly to avoid creating new ones under > pressure. Extrapolating to extremes doesn't really help anything. You can make any argument with logics like that. The thing isn't being exported to userland, not even close. The patch description is simply untrue. It does affect how wq behaves under heavy CPU load. The effect might be perfectly okay but more likely it will result in subtle suboptimal behaviors under certain load situations which would be difficult to characterize and track down. Again, the trade off (mostly killing of ttwu_local) could be okay but you can't get away with just claiming "there's no harm". Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-23 10:15 ` Tejun Heo @ 2011-06-23 10:44 ` Ingo Molnar 2011-06-23 11:35 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Ingo Molnar @ 2011-06-23 10:44 UTC (permalink / raw) To: Tejun Heo Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds * Tejun Heo <tj@kernel.org> wrote: > The patch description is simply untrue. It does affect how wq > behaves under heavy CPU load. The effect might be perfectly okay > but more likely it will result in subtle suboptimal behaviors under > certain load situations which would be difficult to characterize > and track down. Again, the trade off (mostly killing of > ttwu_local) could be okay but you can't get away with just claiming > "there's no harm". Well, either it can be measured or not. If you can suggest a specific testing method to Thomas, please do. Thanks, Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-23 10:44 ` Ingo Molnar @ 2011-06-23 11:35 ` Tejun Heo 2011-06-23 12:51 ` Ingo Molnar 2011-06-24 9:01 ` Thomas Gleixner 0 siblings, 2 replies; 27+ messages in thread From: Tejun Heo @ 2011-06-23 11:35 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds Hello, Ingo. On Thu, Jun 23, 2011 at 12:44 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Tejun Heo <tj@kernel.org> wrote: > >> The patch description is simply untrue. It does affect how wq >> behaves under heavy CPU load. The effect might be perfectly okay >> but more likely it will result in subtle suboptimal behaviors under >> certain load situations which would be difficult to characterize >> and track down. Again, the trade off (mostly killing of >> ttwu_local) could be okay but you can't get away with just claiming >> "there's no harm". > > Well, either it can be measured or not. If you can suggest a specific > testing method to Thomas, please do. Crafting a test case where the suggested change results in worse behavior isn't difficult (it ends up waking/creating workers which it doesn't have to between ttwu and actual execution); however, as with any micro benchmark, the problem is with assessing whether and how much it would matter in actual workloads (whatever that means) and workqueue is used throughout the kernel with widely varying patterns making drawing conclusion a bit tricky. Given that, changing the behavior for the worse just for this cleanup doesn't sound like too sweet a deal. Is there any other reason to change the behavior (latency, whatever) other than the ttuw_local ugliness? Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-23 11:35 ` Tejun Heo @ 2011-06-23 12:51 ` Ingo Molnar 2011-06-24 9:01 ` Thomas Gleixner 1 sibling, 0 replies; 27+ messages in thread From: Ingo Molnar @ 2011-06-23 12:51 UTC (permalink / raw) To: Tejun Heo Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds * Tejun Heo <tj@kernel.org> wrote: > Hello, Ingo. > > On Thu, Jun 23, 2011 at 12:44 PM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Tejun Heo <tj@kernel.org> wrote: > > > >> The patch description is simply untrue. It does affect how wq > >> behaves under heavy CPU load. The effect might be perfectly okay > >> but more likely it will result in subtle suboptimal behaviors under > >> certain load situations which would be difficult to characterize > >> and track down. Again, the trade off (mostly killing of > >> ttwu_local) could be okay but you can't get away with just claiming > >> "there's no harm". > > > > Well, either it can be measured or not. If you can suggest a specific > > testing method to Thomas, please do. > > Crafting a test case where the suggested change results in worse > behavior isn't difficult (it ends up waking/creating workers which > it doesn't have to between ttwu and actual execution); however, as > with any micro benchmark, the problem is with assessing whether and > how much it would matter in actual workloads (whatever that means) > and workqueue is used throughout the kernel with widely varying > patterns making drawing conclusion a bit tricky. [...] Well, please suggest a workload where it *matters* - as i suspect any workload tglx will come up with will have another 90% of workloads that you could suggest: so it's much better if you suggest a testing method. When someone comes to me with a scheduler change i can give them the workloads that they should double check. See the changelog of this recent commit for example: c8b281161dfa: sched: Increase SCHED_LOAD_SCALE resolution So please suggest a testing method. > [...] Given that, changing the behavior for the worse just for this > cleanup doesn't sound like too sweet a deal. Is there any other > reason to change the behavior (latency, whatever) other than the > ttuw_local ugliness? Well, the ugliness is one aspect of it, but my main concern is simply testability: any claim of speedup or slowdown ought to be testable, right? I mean, we'd like to drive people towards coming with patches and number like Nikhil Rao did in c8b281161dfa, right? Thanks, Ingo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-23 11:35 ` Tejun Heo 2011-06-23 12:51 ` Ingo Molnar @ 2011-06-24 9:01 ` Thomas Gleixner 2011-06-26 10:19 ` Tejun Heo 1 sibling, 1 reply; 27+ messages in thread From: Thomas Gleixner @ 2011-06-24 9:01 UTC (permalink / raw) To: Tejun Heo; +Cc: Ingo Molnar, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds [-- Attachment #1: Type: TEXT/PLAIN, Size: 2906 bytes --] On Thu, 23 Jun 2011, Tejun Heo wrote: > Hello, Ingo. > > On Thu, Jun 23, 2011 at 12:44 PM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Tejun Heo <tj@kernel.org> wrote: > > > >> The patch description is simply untrue. It does affect how wq > >> behaves under heavy CPU load. The effect might be perfectly okay > >> but more likely it will result in subtle suboptimal behaviors under > >> certain load situations which would be difficult to characterize > >> and track down. Again, the trade off (mostly killing of > >> ttwu_local) could be okay but you can't get away with just claiming > >> "there's no harm". Fair enough. I'm happy to reword this. > > Well, either it can be measured or not. If you can suggest a specific > > testing method to Thomas, please do. > > Crafting a test case where the suggested change results in worse > behavior isn't difficult (it ends up waking/creating workers which it > doesn't have to between ttwu and actual execution); however, as with > any micro benchmark, the problem is with assessing whether and how > much it would matter in actual workloads (whatever that means) and > workqueue is used throughout the kernel with widely varying patterns > making drawing conclusion a bit tricky. Given that, changing the > behavior for the worse just for this cleanup doesn't sound like too > sweet a deal. Is there any other reason to change the behavior > (latency, whatever) other than the ttuw_local ugliness? ttwu_local should have never been there in the first place. And yes, it's a latency and a correctness issue. Workqueues are not part of the scheduler, period. I'm fine with a call in pre/post schedule, so it's a separate code path and not convoluted into the guts of the scheduler code. The whole design is completely plugged into the scheduler, as you _abuse_ rq->lock to serialize the accounting and allow the unlocked access to the idle_list. It does not matter at all whether that code path is fast or not, it simply does not belong there. We could do a lot of other fancy accounting tricks with hooks inside the scheduler. If we tolerate workqueues hackery in there how do we rule out other crap which wants to have hooks plugged into that? There are issues which can't be solved other than from the scheduler core, e.g. the VCPU switch, but the workqueue overload accounting is definitely not in that category. The only reason why you want this precise accounting thing is a corner case of system overload, but not the common case. In case of system overload you can do extra work to figure out how many of these beasts are on the fly and whether it makes sense to start some more, simply because it does not matter in a overload situation whether you run through a list to gather information. Can we please stop adding stuff to the most crucial code pathes in the kernel just to avoid thinking a bit harder about problems? Thanks, tglx ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-24 9:01 ` Thomas Gleixner @ 2011-06-26 10:19 ` Tejun Heo 0 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2011-06-26 10:19 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, LKML, Peter Zijlstra, Jens Axboe, Linus Torvalds Hello, Thomas. On Fri, Jun 24, 2011 at 11:01:12AM +0200, Thomas Gleixner wrote: > The whole design is completely plugged into the scheduler, as you > _abuse_ rq->lock to serialize the accounting and allow the unlocked > access to the idle_list. It does not matter at all whether that code > path is fast or not, it simply does not belong there. It didn't happen like that. Hooking into scheduler was the one of the biggest issues and there were several different attempts at using / generalizing existing hooks somehow which took good six months without much agreement from scheduler side. I don't recall all the details now but IIRC it tried to use fire_sched_out_preempt_notifiers() which was called inside rq lock which introduced all the convolution with rq lock. Later somebody suggested just open coding it and being done with it and that's how it ended up inside rq lock. Hindsight is 20/20 and we could have pulled it out of rq lock at that time but the rest of the code has been already running with various different hooking schemes for quite some time by then. Anyways, so, let's clean it up. > The only reason why you want this precise accounting thing is a corner > case of system overload, but not the common case. In case of system > overload you can do extra work to figure out how many of these beasts > are on the fly and whether it makes sense to start some more, simply > because it does not matter in a overload situation whether you run > through a list to gather information. Sure, there are multiple ways to resolve the situation. I was mostly worried about waking up / attempting to start more workers under CPU pressure (this is fine under memory pressure as CPU is idle and memory allocation failure / blocking would quickly recall rescuers). I've been looking at the code and am still not sure but it could be that this isn't really something to worry about. Workqueue code always wakes up the first idle worker and a worker leaves idle state not by its waker but on its own accord when it start running, which means that, with the suggested change, there can be multiple spurious wake ups but they'll all try to wake up the same idle worker which hasn't started running yet. We might still unnecessarily try to summon rescuers tho. It needs more considerations and experimentations. Another concern is the additional spin_lock in the sleeping path to synchronize against CPU unplug operation I mentioned in the other reply. Another way to do it would be using preempt_disable/enable() instead and calling synchronize_sched() from trustee. It would only add an extra preemption flipping compared to the current code. > Can we please stop adding stuff to the most crucial code pathes in the > kernel just to avoid thinking a bit harder about problems? Heh... sure, let's just keep it technical, okay? Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner 2011-06-22 19:30 ` Thomas Gleixner 2011-06-23 8:37 ` Tejun Heo @ 2011-06-23 15:07 ` Tejun Heo 2013-04-30 13:37 ` Steven Rostedt 3 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2011-06-23 15:07 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds Hello, again. So, let's get it correct first. On Wed, Jun 22, 2011 at 05:52:15PM -0000, Thomas Gleixner wrote: > The worker accounting for cpu bound workers is plugged into the core > scheduler code and the wakeup code. This is not a hard requirement and > can be avoided by keeping track of the state in the workqueue code > itself. > > Keep track of the sleeping state in the worker itself and call the > notifier before entering the core scheduler. There might be false > positives when the task is woken between that call and actually > scheduling, but that's not really different from scheduling and being > woken immediately after switching away. There is also no harm from > updating nr_running when the task returns from scheduling instead of > accounting it in the wakeup code. I think false positives on schedule() should be safe. As said earlier, the gap between ttwu and actually running is a bit worrisome but it might be nothing, but please at least describe the behavior change. > Index: linux-2.6/kernel/workqueue.c > =================================================================== > --- linux-2.6.orig/kernel/workqueue.c > +++ linux-2.6/kernel/workqueue.c > @@ -137,6 +137,7 @@ struct worker { > unsigned int flags; /* X: flags */ > int id; /* I: worker id */ > struct work_struct rebind_work; /* L: rebind worker to cpu */ > + int sleeping; /* None */ bool? > -struct task_struct *wq_worker_sleeping(struct task_struct *task, > - unsigned int cpu) > +void wq_worker_sleeping(struct task_struct *task) > { > - struct worker *worker = kthread_data(task), *to_wakeup = NULL; > - struct global_cwq *gcwq = get_gcwq(cpu); > - atomic_t *nr_running = get_gcwq_nr_running(cpu); > + struct worker *worker = kthread_data(task); > + struct global_cwq *gcwq; > + int cpu; > > if (worker->flags & WORKER_NOT_RUNNING) > - return NULL; > + return; This doesn't look safe. It can race with trustee_thread() setting WORKER_ROGUE. Let's just grab gcwq->lock on entry to wq_worker_sleeping() for now; then, the schedule() trickery in trustee_thread() can go away too. This also means we can remove the weird sync rules from ->flags and ->idle_list and just use simple gcwq->lock for those, which is pretty nice. > - /* this can only happen on the local cpu */ > - BUG_ON(cpu != raw_smp_processor_id()); > + if (WARN_ON_ONCE(worker->sleeping)) > + return; Re-entrance is prevented by the scheduler hook being called only for non-premption schedule(). Maybe it's better to explain that in the function comment? Hmmm... Also, I think worker->sleeping should be cleared by trustee_thread() when WORKER_ROGUE is set for the worker; otherwise, it can get out of sync and the above WARN_ON_ONCE() will trigger. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner ` (2 preceding siblings ...) 2011-06-23 15:07 ` Tejun Heo @ 2013-04-30 13:37 ` Steven Rostedt 2013-04-30 22:47 ` Steven Rostedt 3 siblings, 1 reply; 27+ messages in thread From: Steven Rostedt @ 2013-04-30 13:37 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds [ Blast from the past! ] When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the try_to_wake_up_local() call. It seems that the 3.4-rt patch has this patch applied. Although, this is not applied to any of the other -rt patches. My question is, what was the final verdict of this patch? The thread sorta died without a definite answer. Is this a change we want to add upstream? If not, is this a change I should remove from the 3.4-rt branch? Note, -rt did hit a bug with the try_to_wake_up_local() code, so removing it is a welcome idea. We are still trying to reproduce that bug. Thanks, -- Steve On Wed, Jun 22, 2011 at 05:52:15PM -0000, Thomas Gleixner wrote: > The worker accounting for cpu bound workers is plugged into the core > scheduler code and the wakeup code. This is not a hard requirement and > can be avoided by keeping track of the state in the workqueue code > itself. > > Keep track of the sleeping state in the worker itself and call the > notifier before entering the core scheduler. There might be false > positives when the task is woken between that call and actually > scheduling, but that's not really different from scheduling and being > woken immediately after switching away. There is also no harm from > updating nr_running when the task returns from scheduling instead of > accounting it in the wakeup code. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > kernel/sched.c | 66 ++++++++++------------------------------------ > kernel/workqueue.c | 67 ++++++++++++++++++++--------------------------- > kernel/workqueue_sched.h | 5 +-- > 3 files changed, 46 insertions(+), 92 deletions(-) > > Index: linux-2.6/kernel/sched.c > =================================================================== > --- linux-2.6.orig/kernel/sched.c > +++ linux-2.6/kernel/sched.c > @@ -2477,10 +2477,6 @@ static void ttwu_activate(struct rq *rq, > { > activate_task(rq, p, en_flags); > p->on_rq = 1; > - > - /* if a worker is waking up, notify workqueue */ > - if (p->flags & PF_WQ_WORKER) > - wq_worker_waking_up(p, cpu_of(rq)); > } > > /* > @@ -2703,40 +2699,6 @@ out: > } > > /** > - * try_to_wake_up_local - try to wake up a local task with rq lock held > - * @p: the thread to be awakened > - * > - * Put @p on the run-queue if it's not already there. The caller must > - * ensure that this_rq() is locked, @p is bound to this_rq() and not > - * the current task. > - */ > -static void try_to_wake_up_local(struct task_struct *p) > -{ > - struct rq *rq = task_rq(p); > - > - BUG_ON(rq != this_rq()); > - BUG_ON(p == current); > - lockdep_assert_held(&rq->lock); > - > - if (!raw_spin_trylock(&p->pi_lock)) { > - raw_spin_unlock(&rq->lock); > - raw_spin_lock(&p->pi_lock); > - raw_spin_lock(&rq->lock); > - } > - > - if (!(p->state & TASK_NORMAL)) > - goto out; > - > - if (!p->on_rq) > - ttwu_activate(rq, p, ENQUEUE_WAKEUP); > - > - ttwu_do_wakeup(rq, p, 0); > - ttwu_stat(p, smp_processor_id(), 0); > -out: > - raw_spin_unlock(&p->pi_lock); > -} > - > -/** > * wake_up_process - Wake up a specific process > * @p: The process to be woken up. > * > @@ -4240,19 +4202,6 @@ need_resched: > } else { > deactivate_task(rq, prev, DEQUEUE_SLEEP); > prev->on_rq = 0; > - > - /* > - * If a worker went to sleep, notify and ask workqueue > - * whether it wants to wake up a task to maintain > - * concurrency. > - */ > - if (prev->flags & PF_WQ_WORKER) { > - struct task_struct *to_wakeup; > - > - to_wakeup = wq_worker_sleeping(prev, cpu); > - if (to_wakeup) > - try_to_wake_up_local(to_wakeup); > - } > } > switch_count = &prev->nvcsw; > } > @@ -4295,6 +4244,14 @@ static inline void sched_submit_work(str > { > if (!tsk->state) > return; > + > + /* > + * If a worker went to sleep, notify and ask workqueue whether > + * it wants to wake up a task to maintain concurrency. > + */ > + if (tsk->flags & PF_WQ_WORKER) > + wq_worker_sleeping(tsk); > + > /* > * If we are going to sleep and we have plugged IO queued, > * make sure to submit it to avoid deadlocks. > @@ -4303,12 +4260,19 @@ static inline void sched_submit_work(str > blk_schedule_flush_plug(tsk); > } > > +static inline void sched_update_worker(struct task_struct *tsk) > +{ > + if (tsk->flags & PF_WQ_WORKER) > + wq_worker_running(tsk); > +} > + > asmlinkage void schedule(void) > { > struct task_struct *tsk = current; > > sched_submit_work(tsk); > __schedule(); > + sched_update_worker(tsk); > } > EXPORT_SYMBOL(schedule); > > Index: linux-2.6/kernel/workqueue.c > =================================================================== > --- linux-2.6.orig/kernel/workqueue.c > +++ linux-2.6/kernel/workqueue.c > @@ -137,6 +137,7 @@ struct worker { > unsigned int flags; /* X: flags */ > int id; /* I: worker id */ > struct work_struct rebind_work; /* L: rebind worker to cpu */ > + int sleeping; /* None */ > }; > > /* > @@ -657,66 +658,56 @@ static void wake_up_worker(struct global > } > > /** > - * wq_worker_waking_up - a worker is waking up > - * @task: task waking up > - * @cpu: CPU @task is waking up to > + * wq_worker_running - a worker is running again > + * @task: task returning from sleep > * > - * This function is called during try_to_wake_up() when a worker is > - * being awoken. > - * > - * CONTEXT: > - * spin_lock_irq(rq->lock) > + * This function is called when a worker returns from schedule() > */ > -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu) > +void wq_worker_running(struct task_struct *task) > { > struct worker *worker = kthread_data(task); > > + if (!worker->sleeping) > + return; > if (!(worker->flags & WORKER_NOT_RUNNING)) > - atomic_inc(get_gcwq_nr_running(cpu)); > + atomic_inc(get_gcwq_nr_running(smp_processor_id())); > + worker->sleeping = 0; > } > > /** > * wq_worker_sleeping - a worker is going to sleep > * @task: task going to sleep > - * @cpu: CPU in question, must be the current CPU number > - * > - * This function is called during schedule() when a busy worker is > - * going to sleep. Worker on the same cpu can be woken up by > - * returning pointer to its task. > - * > - * CONTEXT: > - * spin_lock_irq(rq->lock) > * > - * RETURNS: > - * Worker task on @cpu to wake up, %NULL if none. > + * This function is called from schedule() when a busy worker is > + * going to sleep. > */ > -struct task_struct *wq_worker_sleeping(struct task_struct *task, > - unsigned int cpu) > +void wq_worker_sleeping(struct task_struct *task) > { > - struct worker *worker = kthread_data(task), *to_wakeup = NULL; > - struct global_cwq *gcwq = get_gcwq(cpu); > - atomic_t *nr_running = get_gcwq_nr_running(cpu); > + struct worker *worker = kthread_data(task); > + struct global_cwq *gcwq; > + int cpu; > > if (worker->flags & WORKER_NOT_RUNNING) > - return NULL; > + return; > > - /* this can only happen on the local cpu */ > - BUG_ON(cpu != raw_smp_processor_id()); > + if (WARN_ON_ONCE(worker->sleeping)) > + return; > > + cpu = smp_processor_id(); > + gcwq = get_gcwq(cpu); > + spin_lock_irq(&gcwq->lock); > /* > * The counterpart of the following dec_and_test, implied mb, > * worklist not empty test sequence is in insert_work(). > * Please read comment there. > - * > - * NOT_RUNNING is clear. This means that trustee is not in > - * charge and we're running on the local cpu w/ rq lock held > - * and preemption disabled, which in turn means that none else > - * could be manipulating idle_list, so dereferencing idle_list > - * without gcwq lock is safe. > - */ > - if (atomic_dec_and_test(nr_running) && !list_empty(&gcwq->worklist)) > - to_wakeup = first_worker(gcwq); > - return to_wakeup ? to_wakeup->task : NULL; > + */ > + if (atomic_dec_and_test(get_gcwq_nr_running(cpu)) && > + !list_empty(&gcwq->worklist)) { > + worker = first_worker(gcwq); > + if (worker) > + wake_up_process(worker->task); > + } > + spin_unlock_irq(&gcwq->lock); > } > > /** > Index: linux-2.6/kernel/workqueue_sched.h > =================================================================== > --- linux-2.6.orig/kernel/workqueue_sched.h > +++ linux-2.6/kernel/workqueue_sched.h > @@ -4,6 +4,5 @@ > * Scheduler hooks for concurrency managed workqueue. Only to be > * included from sched.c and workqueue.c. > */ > -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu); > -struct task_struct *wq_worker_sleeping(struct task_struct *task, > - unsigned int cpu); > +void wq_worker_running(struct task_struct *task); > +void wq_worker_sleeping(struct task_struct *task); > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2013-04-30 13:37 ` Steven Rostedt @ 2013-04-30 22:47 ` Steven Rostedt 2013-05-03 0:12 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Steven Rostedt @ 2013-04-30 22:47 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Peter Zijlstra, Tejun Heo, Jens Axboe, Ingo Molnar, Linus Torvalds On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote: > [ Blast from the past! ] > > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this > patch applied. Although, this is not applied to any of the other -rt patches. > I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt nor in 3.8-rt. -- Steve ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2013-04-30 22:47 ` Steven Rostedt @ 2013-05-03 0:12 ` Tejun Heo 2013-05-03 0:57 ` Steven Rostedt 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-05-03 0:12 UTC (permalink / raw) To: Steven Rostedt Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds On Tue, Apr 30, 2013 at 06:47:10PM -0400, Steven Rostedt wrote: > On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote: > > [ Blast from the past! ] > > > > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the > > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this > > patch applied. Although, this is not applied to any of the other -rt patches. > > > > I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt > nor in 3.8-rt. So, it's all good? Or is there something I need to look into? Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2013-05-03 0:12 ` Tejun Heo @ 2013-05-03 0:57 ` Steven Rostedt 2013-07-24 10:04 ` Thomas Gleixner 0 siblings, 1 reply; 27+ messages in thread From: Steven Rostedt @ 2013-05-03 0:57 UTC (permalink / raw) To: Tejun Heo Cc: Thomas Gleixner, LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds On Thu, 2013-05-02 at 17:12 -0700, Tejun Heo wrote: > On Tue, Apr 30, 2013 at 06:47:10PM -0400, Steven Rostedt wrote: > > On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote: > > > [ Blast from the past! ] > > > > > > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the > > > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this > > > patch applied. Although, this is not applied to any of the other -rt patches. > > > > > > > I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt > > nor in 3.8-rt. > > So, it's all good? Or is there something I need to look into? It looks good to me. I don't know why it's not in 3.6-rt or 3.8-rt. Was there a reason that Thomas took it out? I don't know. Maybe it's not needed or he thought it went mainline? -- Steve ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2013-05-03 0:57 ` Steven Rostedt @ 2013-07-24 10:04 ` Thomas Gleixner 2013-08-06 19:33 ` Steven Rostedt 0 siblings, 1 reply; 27+ messages in thread From: Thomas Gleixner @ 2013-07-24 10:04 UTC (permalink / raw) To: Steven Rostedt Cc: Tejun Heo, LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds, Sebastian Sewior B1;3202;0cOn Thu, 2 May 2013, Steven Rostedt wrote: > On Thu, 2013-05-02 at 17:12 -0700, Tejun Heo wrote: > > On Tue, Apr 30, 2013 at 06:47:10PM -0400, Steven Rostedt wrote: > > > On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote: > > > > [ Blast from the past! ] > > > > > > > > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the > > > > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this > > > > patch applied. Although, this is not applied to any of the other -rt patches. > > > > > > > > > > I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt > > > nor in 3.8-rt. > > > > So, it's all good? Or is there something I need to look into? > > It looks good to me. I don't know why it's not in 3.6-rt or 3.8-rt. Was > there a reason that Thomas took it out? I don't know. Maybe it's not > needed or he thought it went mainline? I dropped it on purpose as I was sure, that it's safe. But after you poked me yesterday I spent quite some time staring at that code and found that I missed the following: worker A is about to go idle and the pool->idle_list is empty calls worker_enter_idle() list_add(&worker->entry, &pool->idle_list); idle_list.prev = &worker->entry; Preemption Worker B runs and blocks. wq_worker_sleeping() sees !list_empty(&pool->idle_list) because idle_list.prev points already to worker A Then first_worker returns idle_list.next which points to idle list so we return some random crap to wakeup. So yes, I've donned a brown paperbag and we need to bring back that change and think about making it more palatable for mainline. Find an untested patch against 3.6-rt below. Thanks, tglx --- Index: linux-stable-rt/kernel/sched/core.c =================================================================== --- linux-stable-rt.orig/kernel/sched/core.c +++ linux-stable-rt/kernel/sched/core.c @@ -1452,10 +1452,6 @@ static void ttwu_activate(struct rq *rq, { activate_task(rq, p, en_flags); p->on_rq = 1; - - /* if a worker is waking up, notify workqueue */ - if (p->flags & PF_WQ_WORKER) - wq_worker_waking_up(p, cpu_of(rq)); } /* @@ -1714,42 +1710,6 @@ out: } /** - * try_to_wake_up_local - try to wake up a local task with rq lock held - * @p: the thread to be awakened - * - * Put @p on the run-queue if it's not already there. The caller must - * ensure that this_rq() is locked, @p is bound to this_rq() and not - * the current task. - */ -static void try_to_wake_up_local(struct task_struct *p) -{ - struct rq *rq = task_rq(p); - - if (WARN_ON_ONCE(rq != this_rq()) || - WARN_ON_ONCE(p == current)) - return; - - lockdep_assert_held(&rq->lock); - - if (!raw_spin_trylock(&p->pi_lock)) { - raw_spin_unlock(&rq->lock); - raw_spin_lock(&p->pi_lock); - raw_spin_lock(&rq->lock); - } - - if (!(p->state & TASK_NORMAL)) - goto out; - - if (!p->on_rq) - ttwu_activate(rq, p, ENQUEUE_WAKEUP); - - ttwu_do_wakeup(rq, p, 0); - ttwu_stat(p, smp_processor_id(), 0); -out: - raw_spin_unlock(&p->pi_lock); -} - -/** * wake_up_process - Wake up a specific process * @p: The process to be woken up. * @@ -3627,19 +3587,6 @@ need_resched: } else { deactivate_task(rq, prev, DEQUEUE_SLEEP); prev->on_rq = 0; - - /* - * If a worker went to sleep, notify and ask workqueue - * whether it wants to wake up a task to maintain - * concurrency. - */ - if (prev->flags & PF_WQ_WORKER) { - struct task_struct *to_wakeup; - - to_wakeup = wq_worker_sleeping(prev, cpu); - if (to_wakeup) - try_to_wake_up_local(to_wakeup); - } } switch_count = &prev->nvcsw; } @@ -3683,6 +3630,14 @@ static inline void sched_submit_work(str { if (!tsk->state || tsk_is_pi_blocked(tsk)) return; + + /* + * If a worker went to sleep, notify and ask workqueue whether + * it wants to wake up a task to maintain concurrency. + */ + if (tsk->flags & PF_WQ_WORKER) + wq_worker_sleeping(tsk); + /* * If we are going to sleep and we have plugged IO queued, * make sure to submit it to avoid deadlocks. @@ -3691,12 +3646,19 @@ static inline void sched_submit_work(str blk_schedule_flush_plug(tsk); } +static inline void sched_update_worker(struct task_struct *tsk) +{ + if (tsk->flags & PF_WQ_WORKER) + wq_worker_running(tsk); +} + asmlinkage void __sched schedule(void) { struct task_struct *tsk = current; sched_submit_work(tsk); __schedule(); + sched_update_worker(tsk); } EXPORT_SYMBOL(schedule); Index: linux-stable-rt/kernel/workqueue.c =================================================================== --- linux-stable-rt.orig/kernel/workqueue.c +++ linux-stable-rt/kernel/workqueue.c @@ -152,6 +152,7 @@ struct worker { /* for rebinding worker to CPU */ struct idle_rebind *idle_rebind; /* L: for idle worker */ struct work_struct rebind_work; /* L: for busy worker */ + int sleeping; /* None */ }; struct worker_pool { @@ -691,66 +692,55 @@ static void wake_up_worker(struct worker } /** - * wq_worker_waking_up - a worker is waking up - * @task: task waking up - * @cpu: CPU @task is waking up to + * wq_worker_waking_up - a worker is running again + * @task: task returning from sleep * - * This function is called during try_to_wake_up() when a worker is - * being awoken. - * - * CONTEXT: - * spin_lock_irq(rq->lock) + * This function is called when a worker returns from a blocking + * schedule. */ -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu) +void wq_worker_waking_up(struct task_struct *task) { struct worker *worker = kthread_data(task); + if (!worker->sleeping) + return; + if (!(worker->flags & WORKER_NOT_RUNNING)) atomic_inc(get_pool_nr_running(worker->pool)); + worker->sleeping = 0; } /** * wq_worker_sleeping - a worker is going to sleep * @task: task going to sleep - * @cpu: CPU in question, must be the current CPU number * * This function is called during schedule() when a busy worker is - * going to sleep. Worker on the same cpu can be woken up by - * returning pointer to its task. - * - * CONTEXT: - * spin_lock_irq(rq->lock) - * - * RETURNS: - * Worker task on @cpu to wake up, %NULL if none. + * going to sleep. */ -struct task_struct *wq_worker_sleeping(struct task_struct *task, - unsigned int cpu) +void wq_worker_sleeping(struct task_struct *task) { - struct worker *worker = kthread_data(task), *to_wakeup = NULL; + struct worker *worker = kthread_data(task); struct worker_pool *pool = worker->pool; - atomic_t *nr_running = get_pool_nr_running(pool); + struct global_cwq *gcwq = pool->gcwq; + atomic_t *nr_running; if (worker->flags & WORKER_NOT_RUNNING) - return NULL; + return; - /* this can only happen on the local cpu */ - BUG_ON(cpu != raw_smp_processor_id()); + if (WARN_ON_ONCE(worker->sleeping)) + return; + worker->sleeping = 1; + spin_lock_irq(&gcwq->lock); + nr_running = get_pool_nr_running(pool); /* * The counterpart of the following dec_and_test, implied mb, * worklist not empty test sequence is in insert_work(). * Please read comment there. - * - * NOT_RUNNING is clear. This means that we're bound to and - * running on the local cpu w/ rq lock held and preemption - * disabled, which in turn means that none else could be - * manipulating idle_list, so dereferencing idle_list without gcwq - * lock is safe. */ if (atomic_dec_and_test(nr_running) && !list_empty(&pool->worklist)) - to_wakeup = first_worker(pool); - return to_wakeup ? to_wakeup->task : NULL; + wake_up_process(first_worker(pool)->task); + spin_unlock_irq(&gcwq->lock); } /** Index: linux-stable-rt/kernel/workqueue_sched.h =================================================================== --- linux-stable-rt.orig/kernel/workqueue_sched.h +++ linux-stable-rt/kernel/workqueue_sched.h @@ -4,6 +4,5 @@ * Scheduler hooks for concurrency managed workqueue. Only to be * included from sched.c and workqueue.c. */ -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu); -struct task_struct *wq_worker_sleeping(struct task_struct *task, - unsigned int cpu); +void wq_worker_running(struct task_struct *task); +void wq_worker_sleeping(struct task_struct *task); ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch 4/4] sched: Distangle worker accounting from rq->lock 2013-07-24 10:04 ` Thomas Gleixner @ 2013-08-06 19:33 ` Steven Rostedt 0 siblings, 0 replies; 27+ messages in thread From: Steven Rostedt @ 2013-08-06 19:33 UTC (permalink / raw) To: Thomas Gleixner Cc: Tejun Heo, LKML, Peter Zijlstra, Jens Axboe, Ingo Molnar, Linus Torvalds, Sebastian Sewior On Wed, 2013-07-24 at 12:04 +0200, Thomas Gleixner wrote: > B1;3202;0cOn Thu, 2 May 2013, Steven Rostedt wrote: > > On Thu, 2013-05-02 at 17:12 -0700, Tejun Heo wrote: > > > On Tue, Apr 30, 2013 at 06:47:10PM -0400, Steven Rostedt wrote: > > > > On Tue, Apr 30, 2013 at 09:37:22AM -0400, Steven Rostedt wrote: > > > > > [ Blast from the past! ] > > > > > > > > > > When merging in 3.4.42 into the 3.4-rt branch I hit a conflict with the > > > > > try_to_wake_up_local() call. It seems that the 3.4-rt patch has this > > > > > patch applied. Although, this is not applied to any of the other -rt patches. > > > > > > > > > > > > > I take that back. It's in 3.0-rt, 3.2-rt and 3.4-rt, but it's not in 3.6-rt > > > > nor in 3.8-rt. > > > > > > So, it's all good? Or is there something I need to look into? > > > > It looks good to me. I don't know why it's not in 3.6-rt or 3.8-rt. Was > > there a reason that Thomas took it out? I don't know. Maybe it's not > > needed or he thought it went mainline? > > I dropped it on purpose as I was sure, that it's safe. > > But after you poked me yesterday I spent quite some time staring at > that code and found that I missed the following: > > worker A is about to go idle and the pool->idle_list is empty > > calls worker_enter_idle() > > list_add(&worker->entry, &pool->idle_list); > > idle_list.prev = &worker->entry; > > Preemption > > Worker B runs and blocks. > > wq_worker_sleeping() sees !list_empty(&pool->idle_list) > > because idle_list.prev points already to worker A > > Then first_worker returns idle_list.next which points to idle list > so we return some random crap to wakeup. > > So yes, I've donned a brown paperbag and we need to bring back that > change and think about making it more palatable for mainline. > > Find an untested patch against 3.6-rt below. > > Thanks, > > tglx > --- > Index: linux-stable-rt/kernel/sched/core.c > =================================================================== > --- linux-stable-rt.orig/kernel/sched/core.c > +++ linux-stable-rt/kernel/sched/core.c > @@ -1452,10 +1452,6 @@ static void ttwu_activate(struct rq *rq, > { > activate_task(rq, p, en_flags); > p->on_rq = 1; > - > - /* if a worker is waking up, notify workqueue */ > - if (p->flags & PF_WQ_WORKER) > - wq_worker_waking_up(p, cpu_of(rq)); > } > > /* > @@ -1714,42 +1710,6 @@ out: > } > > /** > - * try_to_wake_up_local - try to wake up a local task with rq lock held > - * @p: the thread to be awakened > - * > - * Put @p on the run-queue if it's not already there. The caller must > - * ensure that this_rq() is locked, @p is bound to this_rq() and not > - * the current task. > - */ > -static void try_to_wake_up_local(struct task_struct *p) > -{ > - struct rq *rq = task_rq(p); > - > - if (WARN_ON_ONCE(rq != this_rq()) || > - WARN_ON_ONCE(p == current)) > - return; > - > - lockdep_assert_held(&rq->lock); > - > - if (!raw_spin_trylock(&p->pi_lock)) { > - raw_spin_unlock(&rq->lock); > - raw_spin_lock(&p->pi_lock); > - raw_spin_lock(&rq->lock); > - } > - > - if (!(p->state & TASK_NORMAL)) > - goto out; > - > - if (!p->on_rq) > - ttwu_activate(rq, p, ENQUEUE_WAKEUP); > - > - ttwu_do_wakeup(rq, p, 0); > - ttwu_stat(p, smp_processor_id(), 0); > -out: > - raw_spin_unlock(&p->pi_lock); > -} > - > -/** > * wake_up_process - Wake up a specific process > * @p: The process to be woken up. > * > @@ -3627,19 +3587,6 @@ need_resched: > } else { > deactivate_task(rq, prev, DEQUEUE_SLEEP); > prev->on_rq = 0; > - > - /* > - * If a worker went to sleep, notify and ask workqueue > - * whether it wants to wake up a task to maintain > - * concurrency. > - */ > - if (prev->flags & PF_WQ_WORKER) { > - struct task_struct *to_wakeup; > - > - to_wakeup = wq_worker_sleeping(prev, cpu); > - if (to_wakeup) > - try_to_wake_up_local(to_wakeup); > - } > } > switch_count = &prev->nvcsw; > } > @@ -3683,6 +3630,14 @@ static inline void sched_submit_work(str > { > if (!tsk->state || tsk_is_pi_blocked(tsk)) > return; > + > + /* > + * If a worker went to sleep, notify and ask workqueue whether > + * it wants to wake up a task to maintain concurrency. > + */ > + if (tsk->flags & PF_WQ_WORKER) > + wq_worker_sleeping(tsk); > + > /* > * If we are going to sleep and we have plugged IO queued, > * make sure to submit it to avoid deadlocks. > @@ -3691,12 +3646,19 @@ static inline void sched_submit_work(str > blk_schedule_flush_plug(tsk); > } > > +static inline void sched_update_worker(struct task_struct *tsk) > +{ > + if (tsk->flags & PF_WQ_WORKER) > + wq_worker_running(tsk); I think you forgot to add the wq_worker_running() function. As I don't see it in 3.6-rt or 3.8-rt. -- Steve > +} > + > asmlinkage void __sched schedule(void) > { > struct task_struct *tsk = current; > > sched_submit_work(tsk); > __schedule(); > + sched_update_worker(tsk); > } > EXPORT_SYMBOL(schedule); > > Index: linux-stable-rt/kernel/workqueue.c > =================================================================== > --- linux-stable-rt.orig/kernel/workqueue.c > +++ linux-stable-rt/kernel/workqueue.c > @@ -152,6 +152,7 @@ struct worker { > /* for rebinding worker to CPU */ > struct idle_rebind *idle_rebind; /* L: for idle worker */ > struct work_struct rebind_work; /* L: for busy worker */ > + int sleeping; /* None */ > }; > > struct worker_pool { > @@ -691,66 +692,55 @@ static void wake_up_worker(struct worker > } > > /** > - * wq_worker_waking_up - a worker is waking up > - * @task: task waking up > - * @cpu: CPU @task is waking up to > + * wq_worker_waking_up - a worker is running again > + * @task: task returning from sleep > * > - * This function is called during try_to_wake_up() when a worker is > - * being awoken. > - * > - * CONTEXT: > - * spin_lock_irq(rq->lock) > + * This function is called when a worker returns from a blocking > + * schedule. > */ > -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu) > +void wq_worker_waking_up(struct task_struct *task) > { > struct worker *worker = kthread_data(task); > > + if (!worker->sleeping) > + return; > + > if (!(worker->flags & WORKER_NOT_RUNNING)) > atomic_inc(get_pool_nr_running(worker->pool)); > + worker->sleeping = 0; > } > > /** > * wq_worker_sleeping - a worker is going to sleep > * @task: task going to sleep > - * @cpu: CPU in question, must be the current CPU number > * > * This function is called during schedule() when a busy worker is > - * going to sleep. Worker on the same cpu can be woken up by > - * returning pointer to its task. > - * > - * CONTEXT: > - * spin_lock_irq(rq->lock) > - * > - * RETURNS: > - * Worker task on @cpu to wake up, %NULL if none. > + * going to sleep. > */ > -struct task_struct *wq_worker_sleeping(struct task_struct *task, > - unsigned int cpu) > +void wq_worker_sleeping(struct task_struct *task) > { > - struct worker *worker = kthread_data(task), *to_wakeup = NULL; > + struct worker *worker = kthread_data(task); > struct worker_pool *pool = worker->pool; > - atomic_t *nr_running = get_pool_nr_running(pool); > + struct global_cwq *gcwq = pool->gcwq; > + atomic_t *nr_running; > > if (worker->flags & WORKER_NOT_RUNNING) > - return NULL; > + return; > > - /* this can only happen on the local cpu */ > - BUG_ON(cpu != raw_smp_processor_id()); > + if (WARN_ON_ONCE(worker->sleeping)) > + return; > > + worker->sleeping = 1; > + spin_lock_irq(&gcwq->lock); > + nr_running = get_pool_nr_running(pool); > /* > * The counterpart of the following dec_and_test, implied mb, > * worklist not empty test sequence is in insert_work(). > * Please read comment there. > - * > - * NOT_RUNNING is clear. This means that we're bound to and > - * running on the local cpu w/ rq lock held and preemption > - * disabled, which in turn means that none else could be > - * manipulating idle_list, so dereferencing idle_list without gcwq > - * lock is safe. > */ > if (atomic_dec_and_test(nr_running) && !list_empty(&pool->worklist)) > - to_wakeup = first_worker(pool); > - return to_wakeup ? to_wakeup->task : NULL; > + wake_up_process(first_worker(pool)->task); > + spin_unlock_irq(&gcwq->lock); > } > > /** > Index: linux-stable-rt/kernel/workqueue_sched.h > =================================================================== > --- linux-stable-rt.orig/kernel/workqueue_sched.h > +++ linux-stable-rt/kernel/workqueue_sched.h > @@ -4,6 +4,5 @@ > * Scheduler hooks for concurrency managed workqueue. Only to be > * included from sched.c and workqueue.c. > */ > -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu); > -struct task_struct *wq_worker_sleeping(struct task_struct *task, > - unsigned int cpu); > +void wq_worker_running(struct task_struct *task); > +void wq_worker_sleeping(struct task_struct *task); ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2013-08-06 19:33 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-06-22 17:52 [patch 0/4] sched: Move work out of the scheduler core Thomas Gleixner 2011-06-22 17:52 ` [patch 1/4] sched: Separate the scheduler entry for preemption Thomas Gleixner 2011-06-22 18:43 ` Christoph Hellwig 2011-06-22 18:52 ` Thomas Gleixner 2011-06-22 19:42 ` Jens Axboe 2011-06-22 20:15 ` Thomas Gleixner 2011-06-23 11:41 ` Jens Axboe 2011-08-29 14:55 ` [tip:sched/urgent] " tip-bot for Thomas Gleixner 2011-06-22 17:52 ` [patch 3/4] block: Shorten interrupt disabled regions Thomas Gleixner 2011-06-22 17:52 ` [patch 2/4] sched: Move blk_schedule_flush_plug() out of __schedule() Thomas Gleixner 2011-06-22 17:52 ` [patch 4/4] sched: Distangle worker accounting from rq->lock Thomas Gleixner 2011-06-22 19:30 ` Thomas Gleixner 2011-06-23 8:37 ` Tejun Heo 2011-06-23 9:58 ` Thomas Gleixner 2011-06-23 10:15 ` Tejun Heo 2011-06-23 10:44 ` Ingo Molnar 2011-06-23 11:35 ` Tejun Heo 2011-06-23 12:51 ` Ingo Molnar 2011-06-24 9:01 ` Thomas Gleixner 2011-06-26 10:19 ` Tejun Heo 2011-06-23 15:07 ` Tejun Heo 2013-04-30 13:37 ` Steven Rostedt 2013-04-30 22:47 ` Steven Rostedt 2013-05-03 0:12 ` Tejun Heo 2013-05-03 0:57 ` Steven Rostedt 2013-07-24 10:04 ` Thomas Gleixner 2013-08-06 19:33 ` Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).