From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753219AbXCVHHu (ORCPT ); Thu, 22 Mar 2007 03:07:50 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753240AbXCVHHu (ORCPT ); Thu, 22 Mar 2007 03:07:50 -0400 Received: from mail.gmx.net ([213.165.64.20]:42589 "HELO mail.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1753219AbXCVHHs (ORCPT ); Thu, 22 Mar 2007 03:07:48 -0400 X-Authenticated: #14349625 X-Provags-ID: V01U2FsdGVkX18NmQdW1HgNXRImSFnEtLO50W4HzrmLYSwcW0e/QR n5LTP9MMTo4JYe Subject: Re: RSDL v0.31 From: Mike Galbraith To: Peter Zijlstra Cc: Willy Tarreau , Linus Torvalds , Xavier Bestel , Mark Lord , Al Boldi , Con Kolivas , ck@vds.kolivas.org, Serge Belyshev , Ingo Molnar , linux-kernel@vger.kernel.org, Nicholas Miell , Andrew Morton In-Reply-To: <1174492966.16411.1.camel@twins> References: <200703042335.26785.a1426z@gawab.com> <200703172048.46267.kernel@kolivas.org> <1174125534.7734.2.camel@Homer.simpson.net> <200703172355.30989.a1426z@gawab.com> <45FEB54A.3040602@rtr.ca> <1174321617.30876.69.camel@frg-rhel40-em64t-04> <45FEBBF5.9060002@rtr.ca> <1174322580.30876.72.camel@frg-rhel40-em64t-04> <20070320061152.GW943@1wt.eu> <1174377787.9756.21.camel@Homer.simpson.net> <1174489064.5379.42.camel@Homer.simpson.net> <1174492966.16411.1.camel@twins> Content-Type: text/plain Date: Thu, 22 Mar 2007 08:07:36 +0100 Message-Id: <1174547256.7414.129.camel@Homer.simpson.net> Mime-Version: 1.0 X-Mailer: Evolution 2.8.2 Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org This is a rather long message, and isn't directed at anyone in particular, it's for others who may be digging into their own problems with RSDL, and for others (if any other than Con exist) who understand RSDL well enough to tell me if I'm missing something. Anyone who's not interested in RSDL's gizzard hit 'D' now. On Wed, 2007-03-21 at 17:02 +0100, Peter Zijlstra wrote: > On Wed, 2007-03-21 at 15:57 +0100, Mike Galbraith wrote: > > > 'f' is a progglet which sleeps a bit and burns a bit, duration depending > > on argument given. 'sh' is a shell 100% hog. In this scenario, the > > argument was set such that 'f' used right at 50% cpu. All are started > > at the same time, and I froze top when the first 'f' reached 1:00. > > May one enquire how much CPU the mythical 'f' uses when ran alone? Just > to get a gauge for the numbers? Actually, the numbers are an interesting curiosity point, but not as interesting as the fact that the deadline mechanism isn't kicking in. >>From task_running_tick(): /* * Accounting is performed by both the task and the runqueue. This * allows frequently sleeping tasks to get their proper quota of * cpu as the runqueue will have their quota still available at * the appropriate priority level. It also means frequently waking * tasks that might miss the scheduler_tick() will get forced down * priority regardless. */ if (!--p->time_slice) task_expired_entitlement(rq, p); /* * We only employ the deadline mechanism if we run over the quota. * It allows aliasing problems around the scheduler_tick to be * less harmful. */ if (!rt_task(p) && --rq_quota(rq, rq->prio_level) < 0) { if (unlikely(p->first_time_slice)) p->first_time_slice = 0; rotate_runqueue_priority(rq); set_tsk_need_resched(p); } The reason for ticking both runqueue and task is that you can't sample a say 100KHz information stream at 1KHz and reproduce that information accurately. IOW, task time slices "blur" at high switch frequency, you can't always hit tasks, so you hit what you _can_ hit every sample, the runqueue, to minimize the theoretical effects of time slice theft. (I've instrumented this before, and caught fast movers stealing 10s of milliseconds in extreme cases.) Generally speaking, statistics even things out very much, the fast mover eventually gets hit, and pays a full tick for his sub-tick dip in the pool, so in practice it's not a great big hairy deal. If you can accept that tasks can and do dodge the tick, an imbalance between runqueue quota and task quota must occur. It isn't happening here, and the reason appears to be bean counting error, tasks migrate but their quotas don't follow. The first time a task is queued at any priority, quota is allocated, task goes to sleep, quota on departed runqueue stays behind, task awakens on a different runqueue, allocate more quota, repeat. For migration, there's twist, if you pull an expired task, expired tasks don't have a quota yet, so they shouldn't screw up bean counting. >>From pull_task(): /* * If this task has already been running on src_rq this priority * cycle, make the new runqueue think it has been on its cycle */ if (p->rotation == src_rq->prio_rotation) p->rotation = this_rq->prio_rotation; The intent here is clearly that this task continue on the new cpu as if nothing has happened. However, when the task was dequeued, p->array was left as it was, points to the last place it was queued. Stale data. >>From recalc_task_prio(), which is called by enqueue_task(): static void recalc_task_prio(struct task_struct *p, struct rq *rq) { struct prio_array *array = rq->active; int queue_prio, search_prio; if (p->rotation == rq->prio_rotation) { if (p->array == array) { if (p->time_slice && rq_quota(rq, p->prio)) return; } else if (p->array == rq->expired) { queue_expired(p, rq); return; } else task_new_array(p, rq); } else task_new_array(p, rq); search_prio = p->static_prio; p->rotation was set to this runqueue's prio_rotation, but p->array is stale, still points to the old cpu's runqueue, so... static inline void task_new_array(struct task_struct *p, struct rq *rq) { bitmap_zero(p->bitmap, PRIO_RANGE); p->rotation = rq->prio_rotation; } p->bitmap is the history of all priorities where this task has been allocated a quota. Here, that history is erased, so the task can't continue it's staircase walk. It is instead given a new runqueue quota and time_slice (didn't it just gain ticks?). Now, what if a cross-cpu wakeup or migrating task _didn't_ have a stale array pointer, rather it pointed to the current active array, _and_ there were no quota available at it's priority? It didn't bring any of it's old quota with it after all. In that case, it would fall through as well, and get the full new time_slice and runqueue quota treatment. The next logical step if the above is correct is obvious. If it's not, really bad things are going to happen here shortly :) -Mike (broom and dustpan at the ready)