From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933174AbXCMUNA (ORCPT ); Tue, 13 Mar 2007 16:13:00 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933185AbXCMUNA (ORCPT ); Tue, 13 Mar 2007 16:13:00 -0400 Received: from e1.ny.us.ibm.com ([32.97.182.141]:59841 "EHLO e1.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933174AbXCMUM6 (ORCPT ); Tue, 13 Mar 2007 16:12:58 -0400 Subject: Re: Stolen and degraded time and schedulers From: john stultz To: Jeremy Fitzhardinge Cc: Andi Kleen , Ingo Molnar , Thomas Gleixner , Con Kolivas , Rusty Russell , Zachary Amsden , James Morris , Chris Wright , Linux Kernel Mailing List , cpufreq@lists.linux.org.uk, Virtualization Mailing List , Daniel Walker In-Reply-To: <45F6D1D0.6080905@goop.org> References: <45F6D1D0.6080905@goop.org> Content-Type: text/plain Date: Tue, 13 Mar 2007 13:12:48 -0700 Message-Id: <1173816769.22180.14.camel@localhost> Mime-Version: 1.0 X-Mailer: Evolution 2.8.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2007-03-13 at 09:31 -0700, Jeremy Fitzhardinge wrote: > The current Linux scheduler makes one big assumption: that 1ms of CPU > time is the same as any other 1ms of CPU time, and that therefore a > process makes the same amount of progress regardless of which particular > ms of time it gets. > > This assumption is wrong now, and will become more wrong as > virtualization gets more widely used. > > It's wrong now, because it fails to take into account of several kinds > of missing time: > > 1. interrupts - time spent in an ISR is accounted to the current > process, even though it gets no direct benefit > 2. SMM - time is completely lost from the kernel > 3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU > [snip] > So how to deal with this? Basically we need a clock which measures "CPU > work units", and have the scheduler use this clock. > > A "CPU work unit" clock has these properties: > > * inherently per-CPU (from the kernel's perspective, so it would be > per-VCPU in a virtual machine) > * monotonic - you can't do negative work > * measured in "work units" [snip] > So, how to implement this? > > One quick hack would be to just make a new clocksource entrypoint, which > returns work units rather than real-time cycles. That would be fairly > simple to implement, but it doesn't really take the per-cpu nature of > the clock into account (since its possible that different cpus on the > same machine might need their own methods). > > Perhaps a better fit would be an entity which is equivalent to a > clocksource, but registered per-cpu like (some) clockevents. > > I don't have a particular preference, but I wonder what the clock gurus > think. My gut reaction would be to avoid using clocksources for now. While there is some thought going into how to expand clocksources for other uses (Daniel is working on this, for example), the design for clocksources has been very focused on its utility to timekeeping, so I'm hesitant to try complicate the clocksources in order to multiplex functionality until what is really needed is well understood. I suspect the best approach would be see how the sched_clock interface can be reworked/used for what you want, as it's design goals map closest to the work-unit properties you list above. Then we can look to see how clocksources can be best used to implement the sched_clock interface. -john