From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932572AbXCNTCQ (ORCPT ); Wed, 14 Mar 2007 15:02:16 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932687AbXCNTCP (ORCPT ); Wed, 14 Mar 2007 15:02:15 -0400 Received: from smtp-outbound-1.vmware.com ([65.113.40.141]:52802 "EHLO smtp-outbound-1.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932572AbXCNTCO (ORCPT ); Wed, 14 Mar 2007 15:02:14 -0400 Message-ID: <45F846AB.6060200@vmware.com> Date: Wed, 14 Mar 2007 12:02:03 -0700 From: Dan Hecht User-Agent: Thunderbird 1.5.0.2 (X11/20060420) MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: dwalker@mvista.com, cpufreq@lists.linux.org.uk, Linux Kernel Mailing List , Con Kolivas , Chris Wright , Virtualization Mailing List , john stultz , Ingo Molnar , Thomas Gleixner , paulus@au.ibm.com, schwidefsky@de.ibm.com, Dan Hecht , Rik van Riel Subject: Re: Stolen and degraded time and schedulers References: <45F6D1D0.6080905@goop.org> <1173816769.22180.14.camel@localhost> <45F70A71.9090205@goop.org> <1173821224.1416.24.camel@dwalker1> <45F71EA5.2090203@goop.org> <45F74515.7010808@vmware.com> <45F77C27.8090604@goop.org> In-Reply-To: <45F77C27.8090604@goop.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 14 Mar 2007 19:02:01.0510 (UTC) FILETIME=[3F6EAC60:01C7666B] Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On 03/13/2007 09:37 PM, Jeremy Fitzhardinge wrote: > Dan Hecht wrote: >> With your previous definition of work time, would it be that: >> >> monotonic_time == work_time + stolen_time ?? > > (By monotonic time, I presume you mean monotonic real time.) Yes, I was just trying to use some consistent terminology, so I picked linux (hrtimer.c) terms: CLOCK_REALTIME == wallclock, CLOCK_MONOTONIC == "real" time counter. > Yes, I > suppose you could, but I don't think that's terribly useful. I think > work_time is probably most naturally measured in cpu clock cycles rather > than an actual time unit. You could convert it to ns, but I don't see > the point. > Even cpu clock cycles doesn't really tell you how much "work" a cpu was able to get done. Different cpus have different throughputs per cycle per instruction sequence. > I know its a term in general use, but I don't think the term "stolen > time" is all that useful, particularly when we're talking about a more > general notion of cpu work contributing to the progress of process > execution. In the cpufreq case, time isn't "stolen" per se. > Right, and that's why I'm not sure I'm convinced the two should be confused. In the case of cpufreq you are talking about the cpu not doing as much work due to a choice the kernel made. In the case of stolen time, the choice wasn't made by the kernel, but instead the hypervisor. I understand they are somewhat similar from the perspective of the scheduler, but a bit different. Also, I'm not sure this is the right thing for cpufreq because: 1) when the load is high, which is when this all matters, presumably the kernel will ramp up the cpu to full speed. 2) in the case where there are two different machines with two different process speeds (well, really processor throughputs), today the scheduler doesn't care about trying to adjust the timing due to one machine being faster than the other. I think this is might be design; i.e. the scheduler was intended to work in terms of real time units rather than work units. I guess you are arguing that this is incorrect, and the scheduler should be scheduling based on how much work it was able to get done. I'm not sure this makes sense because it was the kernel that decided slow the cpus and cause less work to be done. In the case of stolen time, however, the kernel was even running at all on any pcpu, and it wasn't even up to the kernel to decide that. > (I guess I don't like the term stolen time because you don't refer to > time spent on other processes as being stolen from your process: its > just processor time being distributed.) > Okay. I think maybe it comes from the fact that most moder processes expect to time share the cpu, whereas most kernels do not expect this, and it has already been adopted by the kernel (cpustat->steal). But, I don't really care what we call it. >> i.e. would you be defining stolen_time to include the time lost to >> processes due to the cpu running at a lower frequency? How does this >> play into the other potential users, besides sched_clock(), of stolen >> time? We should make sure that the abstraction introduced here makes >> sense in those places too. > > Be specific. What other uses are there? > I listed them below. To summarize, there are (at least) three: 1) sched_clock, the main topic of this thread. 2) p->time_slice 3) cpustat->steal >> For example, the stuff that happens in update_process_times(). I >> think we'd want to account the stolen time to cpustat->steal. > > I guess we could do something for that. Would we account non-full-speed > cpus to it? Maybe? > > How is cpustat->steal used? How does it get out to usermode? > Via /proc/stat, used by modern 'top', maybe other utilities. It is useful to users who want to see where the time is really going from inside a guest when running on a (para)virtual machine. I believe previous set of xen paravirt-ops patches already handled cases #2 and #3 (but no longer do since switching to clockevents), and the old vmitime code did also. Obviously, we need revamp this stuff to make it fit in with the new clockevents/hrtimer way of doing things. Also, s390 and powerpc arch's already account steal time, as another reference point. (the old xen time code and vmi time code worked much in the same way as those). We should bring the s390 and powerpc folks into the discussion. > >> Also we'd probably want account for stolen time with regards to >> task_running_tick(). (Though, in the latter case, maybe we first have >> to move the scheduler away from assuming HZ rate decrementing of >> p->time_slice to get this right. i.e. remove the tick based assumption >> from the scheduler, and then maybe stolen time falls in more naturally >> when accounting time slices). > > I think the important part is that sched_clock() be used to actually > compute how much time each process gets. The fact that a time quantum > gets stolen is less important. Or do you mean something else? > How is time quantum getting stolen less important? Time quantum getting stolen results directly in more unnecessary context switches since we might steal the entire timeslice before the process even ran. I actually think #2 and #3 might be more important than #1 (at least they are as important). And, the earlier Xen patches seem to agree with this, since they addressed 2 & 3 only. >> I guess taking your cpufreq as an example of work_time progressing >> slower than monotonic_time (and assuming that the remaining time is >> what you would call stolen), then e.g. top would report 50% of your >> cpu stolen when you cpu is running at 1/2 max rate. > > Yes. In the same way that clock modulation gates the cpu clock, the > hypervisor effectively gates the clock by giving time to other vcpus. > Yes, except in one case it was a choice of the kernel, and in the other it was up to the hypervisor. >> And p->time_slice would decrement at 1/2 the rate it normally did when >> running at 1/2 speed. Is this the right thing to do? If so, then I >> agree it makes sense to model hypervisor stolen time in terms of your >> "work time". > > Yes, that's my thought. > Since it was the kernel that decided to slow down the processor, I don't know if this makes sense. Dan