All of lore.kernel.org
 help / color / mirror / Atom feed
* + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-06  6:52 akpm
       [not found] ` <45ED16D2.3000202@vmware.com>
  0 siblings, 1 reply; 169+ messages in thread
From: akpm @ 2007-03-06  6:52 UTC (permalink / raw)
  To: mm-commits; +Cc: akpm, ak, mingo, tglx, zach


The patch titled
     stupid hack to make mainline build
has been added to the -mm tree.  Its filename is
     stupid-hack-to-make-mainline-build.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

------------------------------------------------------
Subject: stupid hack to make mainline build
From: Andrew Morton <akpm@linux-foundation.org>

All I did was type `make allmodconfig' :(

This might break VMI, but that seems desirable from a consistency POV.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-i386/vmi_time.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff -puN include/asm-i386/vmi_time.h~stupid-hack-to-make-mainline-build include/asm-i386/vmi_time.h
--- a/include/asm-i386/vmi_time.h~stupid-hack-to-make-mainline-build
+++ a/include/asm-i386/vmi_time.h
@@ -61,6 +61,14 @@ extern void apic_vmi_timer_interrupt(voi
 #ifdef CONFIG_NO_IDLE_HZ
 extern int vmi_stop_hz_timer(void);
 extern void vmi_account_time_restart_hz_timer(void);
+#else
+static inline int vmi_stop_hz_timer(void)
+{
+	return 0;
+}
+static inline void vmi_account_time_restart_hz_timer(void)
+{
+}
 #endif
 
 /*
_

Patches currently in -mm which might be from akpm@linux-foundation.org are

origin.patch
stupid-hack-to-make-mainline-build.patch
highres-do-not-run-the-timer_softirq-after-switching-to-highres-mode-tweak-fix.patch
slab-introduce-krealloc-fix.patch
make-aout-executables-work-again-fix.patch
sony-laptop-fix-uninitialised-variable.patch
git-alsa-oops-fix.patch
git-drm.patch
git-dvb.patch
ia64-kexec-use-efi_loader_data-for-elf-core-header-tidy.patch
git-input.patch
setstream-param-for-psmouse-tweak.patch
git-md-accel-fixup.patch
git-mmc-fix-99.patch
nommu-present-backing-device-capabilities-for-mtd-fix.patch
git-ubi.patch
git-netdev-all.patch
git-netdev-all-ipw2200-fix.patch
revert-drivers-net-tulip-dmfe-support-basic-carrier-detection.patch
dmfe-add-support-for-suspend-resume-fix.patch
sis900-warning-fixes.patch
div64_64-common-code-fix.patch
bonding-replace-system-timer-with-work-queue-tidy.patch
git-parisc.patch
rm9000-serial-driver-tidy.patch
git-pciseg.patch
git-unionfs.patch
usbatm-create-sysfs-link-device-from-atm-class-device-tidy.patch
git-wireless-fixup.patch
revert-x86_64-mm-change-sysenter_setup-to-__cpuinit-improve-__init-__initdata.patch
after-before-x86_64-mm-mmconfig-share.patch
linux-sysdevh-needs-to-include-linux-moduleh-up-fix.patch
linux-sysdevh-needs-to-include-linux-moduleh-up-fix-2.patch
x86_64-irq-make-affinity-works-for-genapic_flat-mode-tidy.patch
i386-vdso_prelink-warning-fix.patch
smaps-add-clear_refs-file-to-clear-reference-fix.patch
driver_bfin_serial_core-update.patch
reduce-size-of-task_struct-on-64-bit-machines.patch
mm-shrink-parent-dentries-when-shrinking-slab.patch
define-and-use-new-eventscpu_lock_acquire-and-cpu_lock_release.patch
call-cpu_chain-with-cpu_down_failed-if-cpu_down_prepare-failed-vs-reduce-size-of-task_struct-on-64-bit-machines.patch
speedup-divides-by-cpu_power-in-scheduler.patch
lutimesat-compat-syscall-and-wire-up-on-x86_64.patch
utrace-prep.patch
utrace-prep-2.patch
revert-utrace-prep-2.patch
utrace-vs-reduce-size-of-task_struct-on-64-bit-machines.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
local_t-powerpc-extension.patch
fbdev-hecuba-framebuffer-driver.patch
mm-only-free-swap-space-of-reactivated-pages-debug.patch

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
       [not found]     ` <20070306084647.GA16280@elte.hu>
@ 2007-03-06  8:55       ` Zachary Amsden
  2007-03-06 10:59         ` Thomas Gleixner
  0 siblings, 1 reply; 169+ messages in thread
From: Zachary Amsden @ 2007-03-06  8:55 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Virtualization Mailing List, mm-commits, tglx, akpm

Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>   
>> no, that's not the case: next_timer_interrupt() is the NO_IDLE_HZ 
>> method of doing things - while in the NO_HZ case you are supposed to 
>> use clockevent devices to program timer hardware.
>>     

We don't have a clockevent device.  But we need NO_IDLE_HZ support, 
which NO_HZ has now subsumed.

> a proper CE device also has the added bonus of making high-res timers 
> guests work automatically. It should be simple: just pass it through to 
> your hypervisor, a hyper-CE-device, like a hyper-clocksource device has 
> essentially no guest-side complexity.
>   

It is not so simple.  In theory it works great.  In reality, the i386 
implementation is completely hardwired to work the way hardware works, 
and breaking the clockevent code out of the deep ties to the APIC is 
extremely non-trivial.  We tried, and could not accomplish it for 2.6.21 
because the hrtimers integration was complex, and introduced many bugs 
for us.  We worked around this by keeping NO_IDLE_HZ support, which now 
you deprecated.  So now we are using NO_HZ without a hyper-CE device, 
and it is working fine.  We understand the benefits of moving to the CE 
model - but it cannot be done overnight.

Xen has the same requirements for integrating their timer code.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-06  8:55       ` Zachary Amsden
@ 2007-03-06 10:59         ` Thomas Gleixner
  2007-03-06 21:07             ` Dan Hecht
  0 siblings, 1 reply; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-06 10:59 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, akpm, ak, Daniel Hecht, Virtualization Mailing List,
	Jeremy Fitzhardinge, Rusty Russell, LKML

On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:
> > a proper CE device also has the added bonus of making high-res timers 
> > guests work automatically. It should be simple: just pass it through to 
> > your hypervisor, a hyper-CE-device, like a hyper-clocksource device has 
> > essentially no guest-side complexity.
> >   
> 
> It is not so simple.  In theory it works great.  In reality, the i386 
> implementation is completely hardwired to work the way hardware works, 
> and breaking the clockevent code out of the deep ties to the APIC is 
> extremely non-trivial.  We tried, and could not accomplish it for 2.6.21 
> because the hrtimers integration was complex, and introduced many bugs 
> for us.

Why is this so non-trivial ? All you have to do is _NOT_ register
PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
which uses the hypervisor timer emulation instead of real hardware.

clockevents breaks the hardwired assumptions of the old timer code and
allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
stuff like

       /* Disable PIT. */
        outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */

> We worked around this by keeping NO_IDLE_HZ support, which now 
> you deprecated.  So now we are using NO_HZ without a hyper-CE device, 
> and it is working fine.  We understand the benefits of moving to the CE 
> model - but it cannot be done overnight.

This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
irq_enter() and irq_exit() so the clockevents code is actually invoked.
I have not looked close enough why this does work at all.

I have the feeling that "working fine" means something like "does not
explode".

We really want to fix this now instead of pushing some not know why it
works hack into the kernel.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-06 10:59         ` Thomas Gleixner
@ 2007-03-06 21:07             ` Dan Hecht
  0 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-06 21:07 UTC (permalink / raw)
  To: tglx
  Cc: Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Jeremy Fitzhardinge, Rusty Russell,
	LKML, john stultz, Dan Hecht

On 03/06/2007 02:59 AM, Thomas Gleixner wrote:
> On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:
>>> a proper CE device also has the added bonus of making high-res timers 
>>> guests work automatically. It should be simple: just pass it through to 
>>> your hypervisor, a hyper-CE-device, like a hyper-clocksource device has 
>>> essentially no guest-side complexity.
>>>   
>> It is not so simple.  In theory it works great.  In reality, the i386 
>> implementation is completely hardwired to work the way hardware works, 
>> and breaking the clockevent code out of the deep ties to the APIC is 
>> extremely non-trivial.  We tried, and could not accomplish it for 2.6.21 
>> because the hrtimers integration was complex, and introduced many bugs 
>> for us.
> 
> Why is this so non-trivial ? All you have to do is _NOT_ register
> PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
> which uses the hypervisor timer emulation instead of real hardware.
> 
> clockevents breaks the hardwired assumptions of the old timer code and
> allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
> stuff like
> 
>        /* Disable PIT. */
>         outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */
> 

Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and 
we still need to stop it.

In any case, clockevents doesn't really make it easier nor harder as far 
as init goes.  In the pre-clockevent days, we replace setup_pit_timer, 
setup_boot_clock, setup_secondary_clock.  With clockevents, I think the 
hook points are the same.  Mostly just need to allow the per-cpu 
lapic_event to be generalized to local_clock_events that can be set to 
whatever device we want.  The other thing on i386 is just some minor 
annoyances due initially setting up only the PIT on cpu0 on irq 0 and 
then later setting up per-cpu timer on lvtt, and making this all place 
nice with paravirt timers.  But these are just details and just require 
some minor changes and will be working, but it just takes some massaging.

So, that is not the real reason to move over the clockevents.  The real 
reason is to use the generic interrupt handlers.  We understand that, 
and will get to that point.  In the mean time, we are harming no one. 
Our code has zero effect when you booting natively or on a non-VMI 
hypervisor.

>> We worked around this by keeping NO_IDLE_HZ support, which now 
>> you deprecated.  So now we are using NO_HZ without a hyper-CE device, 
>> and it is working fine.  We understand the benefits of moving to the CE 
>> model - but it cannot be done overnight.
> 
> This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
> irq_enter() and irq_exit() so the clockevents code is actually invoked.
> I have not looked close enough why this does work at all.
> 

I believe this was just a quick fix in response to Ingo breaking the VMI 
build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
reason why NO_IDLE_HZ=y can't coexist with NO_HZ.

(The two work okay together because when using NO_IDLE_HZ, the hooks are 
deeper in a custom safe_halt routine which isn't registered when using 
nohz mode at runtime, and conversely, the nohz code is guarded at 
runtime by the ts->nohz_mode.  So, the two really can co-exist at 
compile time).

Again, no one is arguing that we shouldn't move to clockevents, it's 
just a matter of time (sorry, no pun intended).

The vmi-time code was introduced to solve some shortcomings of the old 
(pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was 
especially painful for virtualization.  Certainly, 
clocksource/clockevents/NO_HZ solves many of the problems (basically, 
moving away from counting interrupts to using time sources).  e.g. xtime 
updating is no longer a worry with the new timeofday/clocksource stuff. 
  But there are some that may not quite be solved, listed below.  (I 
know I'm not telling you anything new, but I might as well flesh it out 
for the other paravirt folks while the code is fresh in my mind):

1) Stolen time (virtual cpu is ready to run but not running): this is 
handled inconsistently between the various clockevent handlers / 
CLOCK_EVT_MODE_ONESHOT combinations:

  a) tick_handle_periodic / CLOCK_EVT_MODE_PERIODIC: depends on how you 
define "periodic" timer in a paravirtual world.  If you do something 
like Xen-style where you send periodic events only to running vcpus, 
then this handler suffers from some of the same problems as the old i386 
timer handler:
   - jiffies updated according to the number of interrupts you get, so 
falls behind monotonic time.  generally, counting timer interrupts is 
bad for paravirt.
   - process time updated according to the number of interrupts, so 
falls behind monotonic time.  This is probably okay though, since it is 
essentially tracking (mono - stolen) time.  I.e. the missing time is stolen.
   - jiffies updated only by boot cpu, which is a problem for paravirt 
since the boot vcpu can be descheduled while the other vcpus are scheduled.
   - can probably just avoid this mode by not advertising PERIODIC 
capability by your clockevent device.

  b) tick_handle_periodic / CLOCK_EVT_MODE_ONESHOT:
   - jiffies updated according to monotonic time since we'll loop to 
catch up the oneshot timer.
   - process time accounted in monotonic time, for the same reason. 
this is probably not what we'd want since it will charge time to 
whatever process happened to be scheduled in the guest during periods of 
stolen time.
   - Same problem about jiffies only updated by one vcpu.

  c) tick_nohz_handler (implies CLOCK_EVT_MODE_ONESHOT):
   - jiffies updated according to monotonic time. this is good.
   - Process time effectively does not count stolen time since 
update_process_times is called once per callback but oneshot expiry is 
advanced into the future.  This is probably the right thing for 
paravirt, but, inconsistent with (b).
   - jiffies updated by all vcpus, which is good.

  d) hrtimer_interrupt (really tick_sched_timer):
   - w.r.t. stolen time, will be similar to (c).  We'll advance the 
sched_timer to the next period in the future, skipping stolen time for 
process accounting.
   - jiffies will be kept up to date with monotonic time.
   - jiffies updated by all vcpus, which is good.

Summary: Cases (c) and (d) should be relatively well behaved for 
paravirt.  So, if we can depend on NO_HZ=y and not being disabled at 
runtime, we should be okay.  We may want to have some knowledge of 
stolen time passed from the hypervisor (if we wanted more accurate 
process time accounting), but this can be put back in later, and isn't 
too important with sample based accounting system like Linux. But, we 
still need to QA all four cases, and all four cases will likely expose 
different bugs due to second order effects.

2) Virtual interrupts have a relatively high overhead as compared with 
native interrupts.  So, in vmitime, we wanted to be able to lower the 
timer interrupt rate at runtime, even if HZ is a compile time constant 
(and set to something high, like 1000hz).  While we could hack this in 
by using evt->min_delta_ns, it wouldn't really work since process time 
accounting would be wrong.  Instead, we should allow the 
tick_sched_timer in cases (c) and (d) to have runtime configurable 
period, and then scale the time value accordingly before passing to 
account_system_time.  This is probably something the Xen folks will want 
also, since I think Xen itself only gets 100hz hard timer, and so it can 
implement at best a oneshot virtual timer with 100hz resolution.  Any 
objections to us doing something like this?

3) clockevent set_next_event interface is suboptimal for paravirt (and 
probably realtime-ish uses).  The problem is that the expiry is passed 
as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
may have passed since the delta was computed and when the timer device 
is programmed, causing that next interrupt to be too far out in the 
future.  It seems a better interface for set_next_event would be to pass 
the current time and the absolute expiry.  Actually, I sent email to 
Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
in July 2006, but never heard back.  Thoughts?

I think all the other important points that the vmitime code addressed 
are also addressed by clocksource/clockevents/NO_HZ.

Finally, while I agree that writing the clockevent callback code is 
trivial, we will hit bugs when moving over.  It is something we need to 
do, but just takes some time for us to test and shake out the bugs.  For 
example, we are seeing this bug.  It seems to me that tick_sched_timer 
should not be running in softirq context, but only from the 
hrtimer_interrupt.  Is that right?  I'm not sure how we get into this case.

Switched to high resolution mode on CPU 0
------------[ cut here ]------------
kernel BUG at 
/dhecht/linux/testing/torvalds/linux-2.6.21/kernel/posix-cpu-timers.c:1295!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU:    0
EIP:    0062:[<c0137801>]    Not tainted VLI
EFLAGS: 00010202   (2.6.21-rc2-smp #29)
EIP is at run_posix_cpu_timers+0x24/0x6a7
eax: 00000200   ebx: c1405be0   ecx: c1405102   edx: c14051d4
esi: c1405ca0   edi: 77ad5c64   ebp: dfed5a50   esp: dfe81db4
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 006a
Process swapper (pid: 1, ti=dfe80000 task=dfed5a50 task.ti=dfe80000)
Stack: 00000078 c039e760 0000007d c010a885 28f8c4ec 00000000 dfed5a50 
c14051c0
        dfe81df8 c0122ac3 c14051cc 00000003 00000008 c14051d4 dfed5a50 
00000000
        dfe81df4 dfe81df4 c1405be0 c1405ca0 77ad5c64 00000000 c013cd3a 
c1405c18
Call Trace:
  [<c010a885>] sched_clock+0x3d/0x69
  [<c0122ac3>] scheduler_tick+0x52/0xc9
  [<c013cd3a>] tick_sched_timer+0x57/0x9d
  [<c013cce3>] tick_sched_timer+0x0/0x9d
  [<c013936a>] hrtimer_run_queues+0x1c3/0x21d
  [<c012d69b>] run_timer_softirq+0x21/0x177
  [<c012a0bd>] __do_softirq+0x66/0xca
  [<c012a164>] do_softirq+0x43/0x51
  [<c012a26b>] irq_exit+0x38/0x6b
  [<c0107ab5>] do_IRQ+0x82/0x99
  [<c025cf44>] serial8250_console_write+0x129/0x136
  [<c025ec10>] serial8250_console_putchar+0x0/0x78
  [<c0105d13>] common_interrupt+0x23/0x30
  [<c01267f9>] vprintk+0x29b/0x2ca
  [<c02056a8>] idr_get_new_above_int+0x13c/0x216
  [<c0126843>] printk+0x1b/0x1f
  [<c03ebf31>] audit_init+0x26/0x101
  [<c03ebe7e>] ikconfig_init+0x14/0x37
  [<c03da92d>] init+0x145/0x23e
  [<c0104c06>] ret_from_fork+0x6/0x20
  [<c0104d69>] syscall_exit+0x5/0x18
  [<c03da7e8>] init+0x0/0x23e
  [<c03da7e8>] init+0x0/0x23e
  [<c0105fe7>] kernel_thread_helper+0x7/0x10
  =======================
Code: e9 b0 fe ff ff 5b c3 55 57 56 53 83 ec 48 89 c5 8d 44 24 40 89 44 
24 40 89 44 24 44 e8 19 9a ec 3b 90 8d 74 26 00 f6 c4 02 74 04 <0f> 0b 
eb fe 8b 95 24 01 00 00 85 d2 74 10 8b 85 08 01 00 00 03
EIP: [<c0137801>] run_posix_cpu_timers+0x24/0x6a7 SS:ESP 006a:dfe81db4
Kernel panic - not syncing: Fatal exception in interrupt


thanks,
Dan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-06 21:07             ` Dan Hecht
  0 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-06 21:07 UTC (permalink / raw)
  To: tglx; +Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On 03/06/2007 02:59 AM, Thomas Gleixner wrote:
> On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:
>>> a proper CE device also has the added bonus of making high-res timers 
>>> guests work automatically. It should be simple: just pass it through to 
>>> your hypervisor, a hyper-CE-device, like a hyper-clocksource device has 
>>> essentially no guest-side complexity.
>>>   
>> It is not so simple.  In theory it works great.  In reality, the i386 
>> implementation is completely hardwired to work the way hardware works, 
>> and breaking the clockevent code out of the deep ties to the APIC is 
>> extremely non-trivial.  We tried, and could not accomplish it for 2.6.21 
>> because the hrtimers integration was complex, and introduced many bugs 
>> for us.
> 
> Why is this so non-trivial ? All you have to do is _NOT_ register
> PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
> which uses the hypervisor timer emulation instead of real hardware.
> 
> clockevents breaks the hardwired assumptions of the old timer code and
> allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
> stuff like
> 
>        /* Disable PIT. */
>         outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */
> 

Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and 
we still need to stop it.

In any case, clockevents doesn't really make it easier nor harder as far 
as init goes.  In the pre-clockevent days, we replace setup_pit_timer, 
setup_boot_clock, setup_secondary_clock.  With clockevents, I think the 
hook points are the same.  Mostly just need to allow the per-cpu 
lapic_event to be generalized to local_clock_events that can be set to 
whatever device we want.  The other thing on i386 is just some minor 
annoyances due initially setting up only the PIT on cpu0 on irq 0 and 
then later setting up per-cpu timer on lvtt, and making this all place 
nice with paravirt timers.  But these are just details and just require 
some minor changes and will be working, but it just takes some massaging.

So, that is not the real reason to move over the clockevents.  The real 
reason is to use the generic interrupt handlers.  We understand that, 
and will get to that point.  In the mean time, we are harming no one. 
Our code has zero effect when you booting natively or on a non-VMI 
hypervisor.

>> We worked around this by keeping NO_IDLE_HZ support, which now 
>> you deprecated.  So now we are using NO_HZ without a hyper-CE device, 
>> and it is working fine.  We understand the benefits of moving to the CE 
>> model - but it cannot be done overnight.
> 
> This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
> irq_enter() and irq_exit() so the clockevents code is actually invoked.
> I have not looked close enough why this does work at all.
> 

I believe this was just a quick fix in response to Ingo breaking the VMI 
build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
reason why NO_IDLE_HZ=y can't coexist with NO_HZ.

(The two work okay together because when using NO_IDLE_HZ, the hooks are 
deeper in a custom safe_halt routine which isn't registered when using 
nohz mode at runtime, and conversely, the nohz code is guarded at 
runtime by the ts->nohz_mode.  So, the two really can co-exist at 
compile time).

Again, no one is arguing that we shouldn't move to clockevents, it's 
just a matter of time (sorry, no pun intended).

The vmi-time code was introduced to solve some shortcomings of the old 
(pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was 
especially painful for virtualization.  Certainly, 
clocksource/clockevents/NO_HZ solves many of the problems (basically, 
moving away from counting interrupts to using time sources).  e.g. xtime 
updating is no longer a worry with the new timeofday/clocksource stuff. 
  But there are some that may not quite be solved, listed below.  (I 
know I'm not telling you anything new, but I might as well flesh it out 
for the other paravirt folks while the code is fresh in my mind):

1) Stolen time (virtual cpu is ready to run but not running): this is 
handled inconsistently between the various clockevent handlers / 
CLOCK_EVT_MODE_ONESHOT combinations:

  a) tick_handle_periodic / CLOCK_EVT_MODE_PERIODIC: depends on how you 
define "periodic" timer in a paravirtual world.  If you do something 
like Xen-style where you send periodic events only to running vcpus, 
then this handler suffers from some of the same problems as the old i386 
timer handler:
   - jiffies updated according to the number of interrupts you get, so 
falls behind monotonic time.  generally, counting timer interrupts is 
bad for paravirt.
   - process time updated according to the number of interrupts, so 
falls behind monotonic time.  This is probably okay though, since it is 
essentially tracking (mono - stolen) time.  I.e. the missing time is stolen.
   - jiffies updated only by boot cpu, which is a problem for paravirt 
since the boot vcpu can be descheduled while the other vcpus are scheduled.
   - can probably just avoid this mode by not advertising PERIODIC 
capability by your clockevent device.

  b) tick_handle_periodic / CLOCK_EVT_MODE_ONESHOT:
   - jiffies updated according to monotonic time since we'll loop to 
catch up the oneshot timer.
   - process time accounted in monotonic time, for the same reason. 
this is probably not what we'd want since it will charge time to 
whatever process happened to be scheduled in the guest during periods of 
stolen time.
   - Same problem about jiffies only updated by one vcpu.

  c) tick_nohz_handler (implies CLOCK_EVT_MODE_ONESHOT):
   - jiffies updated according to monotonic time. this is good.
   - Process time effectively does not count stolen time since 
update_process_times is called once per callback but oneshot expiry is 
advanced into the future.  This is probably the right thing for 
paravirt, but, inconsistent with (b).
   - jiffies updated by all vcpus, which is good.

  d) hrtimer_interrupt (really tick_sched_timer):
   - w.r.t. stolen time, will be similar to (c).  We'll advance the 
sched_timer to the next period in the future, skipping stolen time for 
process accounting.
   - jiffies will be kept up to date with monotonic time.
   - jiffies updated by all vcpus, which is good.

Summary: Cases (c) and (d) should be relatively well behaved for 
paravirt.  So, if we can depend on NO_HZ=y and not being disabled at 
runtime, we should be okay.  We may want to have some knowledge of 
stolen time passed from the hypervisor (if we wanted more accurate 
process time accounting), but this can be put back in later, and isn't 
too important with sample based accounting system like Linux. But, we 
still need to QA all four cases, and all four cases will likely expose 
different bugs due to second order effects.

2) Virtual interrupts have a relatively high overhead as compared with 
native interrupts.  So, in vmitime, we wanted to be able to lower the 
timer interrupt rate at runtime, even if HZ is a compile time constant 
(and set to something high, like 1000hz).  While we could hack this in 
by using evt->min_delta_ns, it wouldn't really work since process time 
accounting would be wrong.  Instead, we should allow the 
tick_sched_timer in cases (c) and (d) to have runtime configurable 
period, and then scale the time value accordingly before passing to 
account_system_time.  This is probably something the Xen folks will want 
also, since I think Xen itself only gets 100hz hard timer, and so it can 
implement at best a oneshot virtual timer with 100hz resolution.  Any 
objections to us doing something like this?

3) clockevent set_next_event interface is suboptimal for paravirt (and 
probably realtime-ish uses).  The problem is that the expiry is passed 
as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
may have passed since the delta was computed and when the timer device 
is programmed, causing that next interrupt to be too far out in the 
future.  It seems a better interface for set_next_event would be to pass 
the current time and the absolute expiry.  Actually, I sent email to 
Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
in July 2006, but never heard back.  Thoughts?

I think all the other important points that the vmitime code addressed 
are also addressed by clocksource/clockevents/NO_HZ.

Finally, while I agree that writing the clockevent callback code is 
trivial, we will hit bugs when moving over.  It is something we need to 
do, but just takes some time for us to test and shake out the bugs.  For 
example, we are seeing this bug.  It seems to me that tick_sched_timer 
should not be running in softirq context, but only from the 
hrtimer_interrupt.  Is that right?  I'm not sure how we get into this case.

Switched to high resolution mode on CPU 0
------------[ cut here ]------------
kernel BUG at 
/dhecht/linux/testing/torvalds/linux-2.6.21/kernel/posix-cpu-timers.c:1295!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU:    0
EIP:    0062:[<c0137801>]    Not tainted VLI
EFLAGS: 00010202   (2.6.21-rc2-smp #29)
EIP is at run_posix_cpu_timers+0x24/0x6a7
eax: 00000200   ebx: c1405be0   ecx: c1405102   edx: c14051d4
esi: c1405ca0   edi: 77ad5c64   ebp: dfed5a50   esp: dfe81db4
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 006a
Process swapper (pid: 1, ti=dfe80000 task=dfed5a50 task.ti=dfe80000)
Stack: 00000078 c039e760 0000007d c010a885 28f8c4ec 00000000 dfed5a50 
c14051c0
        dfe81df8 c0122ac3 c14051cc 00000003 00000008 c14051d4 dfed5a50 
00000000
        dfe81df4 dfe81df4 c1405be0 c1405ca0 77ad5c64 00000000 c013cd3a 
c1405c18
Call Trace:
  [<c010a885>] sched_clock+0x3d/0x69
  [<c0122ac3>] scheduler_tick+0x52/0xc9
  [<c013cd3a>] tick_sched_timer+0x57/0x9d
  [<c013cce3>] tick_sched_timer+0x0/0x9d
  [<c013936a>] hrtimer_run_queues+0x1c3/0x21d
  [<c012d69b>] run_timer_softirq+0x21/0x177
  [<c012a0bd>] __do_softirq+0x66/0xca
  [<c012a164>] do_softirq+0x43/0x51
  [<c012a26b>] irq_exit+0x38/0x6b
  [<c0107ab5>] do_IRQ+0x82/0x99
  [<c025cf44>] serial8250_console_write+0x129/0x136
  [<c025ec10>] serial8250_console_putchar+0x0/0x78
  [<c0105d13>] common_interrupt+0x23/0x30
  [<c01267f9>] vprintk+0x29b/0x2ca
  [<c02056a8>] idr_get_new_above_int+0x13c/0x216
  [<c0126843>] printk+0x1b/0x1f
  [<c03ebf31>] audit_init+0x26/0x101
  [<c03ebe7e>] ikconfig_init+0x14/0x37
  [<c03da92d>] init+0x145/0x23e
  [<c0104c06>] ret_from_fork+0x6/0x20
  [<c0104d69>] syscall_exit+0x5/0x18
  [<c03da7e8>] init+0x0/0x23e
  [<c03da7e8>] init+0x0/0x23e
  [<c0105fe7>] kernel_thread_helper+0x7/0x10
  =======================
Code: e9 b0 fe ff ff 5b c3 55 57 56 53 83 ec 48 89 c5 8d 44 24 40 89 44 
24 40 89 44 24 44 e8 19 9a ec 3b 90 8d 74 26 00 f6 c4 02 74 04 <0f> 0b 
eb fe 8b 95 24 01 00 00 85 d2 74 10 8b 85 08 01 00 00 03
EIP: [<c0137801>] run_posix_cpu_timers+0x24/0x6a7 SS:ESP 006a:dfe81db4
Kernel panic - not syncing: Fatal exception in interrupt


thanks,
Dan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-06 22:21               ` Andi Kleen
  (?)
@ 2007-03-06 21:32               ` Dan Hecht
  -1 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-06 21:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: tglx, Zachary Amsden, Ingo Molnar, akpm,
	Virtualization Mailing List, Jeremy Fitzhardinge, Rusty Russell,
	LKML, john stultz

On 03/06/2007 02:21 PM, Andi Kleen wrote:
>> I believe this was just a quick fix in response to Ingo breaking the VMI 
>> build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
>> reason why NO_IDLE_HZ=y can't coexist with NO_HZ.
> 
> Well it's nasty that you force NO_IDLE_HZ on all of paravirt ops users.

The only thing NO_IDLE_HZ=y "forces" on other users is some extra code 
(which you are going to get no matter what with CONFIG_PARAVIRT).   It 
doesn't force them to use this code.  It just provides a few extra 
routines that a paravirt_ops backend might want to call back into (I 
think both vmi and xen backends use these routines and that is why it 
became associated with CONFIG_PARAVIRT rather than CONFIG_VMI).

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-06 21:07             ` Dan Hecht
@ 2007-03-06 22:21               ` Andi Kleen
  -1 siblings, 0 replies; 169+ messages in thread
From: Andi Kleen @ 2007-03-06 22:21 UTC (permalink / raw)
  To: Dan Hecht
  Cc: tglx, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Jeremy Fitzhardinge, Rusty Russell,
	LKML, john stultz

> I believe this was just a quick fix in response to Ingo breaking the VMI 
> build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
> reason why NO_IDLE_HZ=y can't coexist with NO_HZ.

Well it's nasty that you force NO_IDLE_HZ on all of paravirt ops users.
I think the right solution is to make VMI depend on (not select) NO_IDLE_HZ
until you can fix your code to work with dynticks properly.

-Andi

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-06 22:21               ` Andi Kleen
  0 siblings, 0 replies; 169+ messages in thread
From: Andi Kleen @ 2007-03-06 22:21 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Virtualization Mailing List, akpm, john stultz, LKML, tglx, Ingo Molnar

> I believe this was just a quick fix in response to Ingo breaking the VMI 
> build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
> reason why NO_IDLE_HZ=y can't coexist with NO_HZ.

Well it's nasty that you force NO_IDLE_HZ on all of paravirt ops users.
I think the right solution is to make VMI depend on (not select) NO_IDLE_HZ
until you can fix your code to work with dynticks properly.

-Andi

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-06 21:07             ` Dan Hecht
  (?)
  (?)
@ 2007-03-06 23:53             ` Thomas Gleixner
  2007-03-07  0:24               ` Jeremy Fitzhardinge
  2007-03-07  0:42               ` Dan Hecht
  -1 siblings, 2 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-06 23:53 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Jeremy Fitzhardinge, Rusty Russell,
	LKML, john stultz

Dan,

On Tue, 2007-03-06 at 13:07 -0800, Dan Hecht wrote:
> > Why is this so non-trivial ? All you have to do is _NOT_ register
> > PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
> > which uses the hypervisor timer emulation instead of real hardware.
> > 
> > clockevents breaks the hardwired assumptions of the old timer code and
> > allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
> > stuff like
> > 
> >        /* Disable PIT. */
> >         outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */
> > 
> 
> Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and 
> we still need to stop it.

I guess you have access to the source code of this virtual BIOS. So this
is a real cute technical solution.

ROTFL. The number of lame excuses in this whole virtualization
discussion is amazing.

> In any case, clockevents doesn't really make it easier nor harder as far 
> as init goes.  In the pre-clockevent days, we replace setup_pit_timer, 
> setup_boot_clock, setup_secondary_clock.  With clockevents, I think the 
> hook points are the same.  Mostly just need to allow the per-cpu 
> lapic_event to be generalized to local_clock_events that can be set to 
> whatever device we want.  The other thing on i386 is just some minor 
> annoyances due initially setting up only the PIT on cpu0 on irq 0 and 
> then later setting up per-cpu timer on lvtt, and making this all place 
> nice with paravirt timers.  But these are just details and just require 
> some minor changes and will be working, but it just takes some massaging.

Nothing forces you to follow that low level hardware scheme. That's
_WHY_ clockevents are there. Create a per cpu clock event source, which
uses whatever interrupt you want (you just need to be able to pin it to
the cpu)

> So, that is not the real reason to move over the clockevents. 

It is partially, because clockevents remove the hardcoded hardware
assumptions.

> The real 
> reason is to use the generic interrupt handlers.  We understand that, 
> and will get to that point.  In the mean time, we are harming no one. 
> Our code has zero effect when you booting natively or on a non-VMI 
> hypervisor.

The "we are harming no one" argument is a great excuse to push random
hackery into the kernel. Once it is there, there is no rush to fix it
because it works (for you).

That's exactly the point which is discussed in the "Xen & VMI" thread.
We open up a can of worms and within no time we have 5 or more different
solutions for the same problem. If we do not look careful at this, we
have no way to do any changes in the core code w/o breaking one of those
hypervisor interfaces. The in tree / FOSS hypervisor interfaces might be
fixable, but those which throw a binary blob to the kernel are not.

I completely agree with Ingo, that this whole paravirt business starts
to crawl across the kernel spreading paralyis all over the place.

We have already enough trouble with real hardware, so we want to
carefully avoid that we get broken virtual hardware as an extra workload
via paravirt ops.

> >> We worked around this by keeping NO_IDLE_HZ support, which now 
> >> you deprecated.  So now we are using NO_HZ without a hyper-CE device, 
> >> and it is working fine.  We understand the benefits of moving to the CE 
> >> model - but it cannot be done overnight.
> > 
> > This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
> > irq_enter() and irq_exit() so the clockevents code is actually invoked.
> > I have not looked close enough why this does work at all.
> > 
> 
> I believe this was just a quick fix in response to Ingo breaking the VMI 
> build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
> reason why NO_IDLE_HZ=y can't coexist with NO_HZ.
>
> (The two work okay together because when using NO_IDLE_HZ, the hooks are 
> deeper in a custom safe_halt routine which isn't registered when using 
> nohz mode at runtime, and conversely, the nohz code is guarded at 
> runtime by the ts->nohz_mode.  So, the two really can co-exist at 
> compile time).

It is guarded by the fact, that you are not registering clockevent
devices. It's not guarded by design. It happens to work.

> Again, no one is arguing that we shouldn't move to clockevents, it's 
> just a matter of time (sorry, no pun intended).

clockevents have been around for quite a time - pun intended :). They
did not surface surprisingly with 2.6.21-rc1.

> The vmi-time code was introduced to solve some shortcomings of the old 
> (pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was 
> especially painful for virtualization.  Certainly, 
> clocksource/clockevents/NO_HZ solves many of the problems (basically, 
> moving away from counting interrupts to using time sources).  e.g. xtime 
> updating is no longer a worry with the new timeofday/clocksource stuff. 
>   But there are some that may not quite be solved, listed below.  (I 
> know I'm not telling you anything new, but I might as well flesh it out 
> for the other paravirt folks while the code is fresh in my mind):
> 
> 1) Stolen time (virtual cpu is ready to run but not running): this is 
> handled inconsistently between the various clockevent handlers / 
> CLOCK_EVT_MODE_ONESHOT combinations:
> 
>   a) tick_handle_periodic / CLOCK_EVT_MODE_PERIODIC: depends on how you 
> define "periodic" timer in a paravirtual world.  If you do something 
...
>   b) tick_handle_periodic / CLOCK_EVT_MODE_ONESHOT:
...
>    - Same problem about jiffies only updated by one vcpu.

The periodic mode is only used during boot time, when you have NO_HZ
and/or HIGH_RES_TIMERS enabled. Once you switch over to NO_HZ and / or
HIGH_RES_TIMERS the periodic mode is gone for ever.

All paravirt users probably want to have NO_HZ, so PARAVIRT might simply
depend on NO_HZ. Of course I might be wrong :)

OTOH the stolen time accounting should be fixed in general and not rely
on it happens to work now assumptions. And it should be done for _ALL_
hypervisors in the same way, i.e. in the generic code.

The vcpu binding of jiffies update is a nobrainer to fix.

>   c) tick_nohz_handler (implies CLOCK_EVT_MODE_ONESHOT):
>    - jiffies updated according to monotonic time. this is good.
>    - Process time effectively does not count stolen time since 
> update_process_times is called once per callback but oneshot expiry is 
> advanced into the future.  This is probably the right thing for 
> paravirt, but, inconsistent with (b).

See above.

>    - jiffies updated by all vcpus, which is good.
> 
>   d) hrtimer_interrupt (really tick_sched_timer):
>    - w.r.t. stolen time, will be similar to (c).  We'll advance the 
> sched_timer to the next period in the future, skipping stolen time for 
> process accounting.

See above.

>    - jiffies will be kept up to date with monotonic time.
>    - jiffies updated by all vcpus, which is good.
> 
> Summary: Cases (c) and (d) should be relatively well behaved for 
> paravirt.  So, if we can depend on NO_HZ=y and not being disabled at 
> runtime, we should be okay.  We may want to have some knowledge of 
> stolen time passed from the hypervisor (if we wanted more accurate 
> process time accounting), but this can be put back in later, and isn't 
> too important with sample based accounting system like Linux. But, we 
> still need to QA all four cases, and all four cases will likely expose 
> different bugs due to second order effects.

bugs in the hypervisor ? :)

> 2) Virtual interrupts have a relatively high overhead as compared with 
> native interrupts.  So, in vmitime, we wanted to be able to lower the 
> timer interrupt rate at runtime, even if HZ is a compile time constant 
> (and set to something high, like 1000hz).  While we could hack this in 
> by using evt->min_delta_ns, it wouldn't really work since process time 
> accounting would be wrong.  Instead, we should allow the 
> tick_sched_timer in cases (c) and (d) to have runtime configurable 
> period, and then scale the time value accordingly before passing to 
> account_system_time.  This is probably something the Xen folks will want 
> also, since I think Xen itself only gets 100hz hard timer, and so it can 
> implement at best a oneshot virtual timer with 100hz resolution.  Any 
> objections to us doing something like this?

Yes. It's gross hackery. 

1) We want to have a cleanup of the tick assumptions _all_ over the
place and this is going to be real hard work.

2) As I said above. The time accounting for virtualization needs to be
fixed in a generic way.

I'm not going to accept some weird hackery for virtualization, which is
of exactly ZERO value for the kernel itself. Quite the contrary it will
make the cleanup harder and introduce another hard to remove thing,
which will in the worst case last for ever.

> 3) clockevent set_next_event interface is suboptimal for paravirt (and 
> probably realtime-ish uses).  The problem is that the expiry is passed 
> as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
> may have passed since the delta was computed and when the timer device 
> is programmed, causing that next interrupt to be too far out in the 
> future.  It seems a better interface for set_next_event would be to pass 
> the current time and the absolute expiry.  Actually, I sent email to 
> Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
> in July 2006, but never heard back.  Thoughts?

There is no problem for realtime uses, as the reprogramming path is
running with local interrupts disabled. I can see the point for paravirt
and I'm not opposed to change / expand the interface for that. It might
be done by an extra clockevents feature flag, which requests absolute
time instead of relative time.

Sorry for not replying back then, your mail got into the huge pile of "I
take care of those, when I have some time" mails.

> I think all the other important points that the vmitime code addressed 
> are also addressed by clocksource/clockevents/NO_HZ.
> 
> Finally, while I agree that writing the clockevent callback code is 
> trivial, we will hit bugs when moving over.  It is something we need to 
> do, but just takes some time for us to test and shake out the bugs.

Dave Miller did a conversion of SPARC64 to clock sources and clock
events within a couple of days. He hit _AND_ fixed a couple of bugs in
the go. But he simply did it.

So your "it takes time" prayer wheel is not impressing me at all. 

clockevents have been there long enough to shake out the problems. It
would have been certainly more helpful for everybody if you had started
to use them early and to provide bugreports and/or patches instead of
insisting on the "our hackery works" approach.

> For 
> example, we are seeing this bug.  It seems to me that tick_sched_timer 
> should not be running in softirq context, but only from the 
> hrtimer_interrupt.  Is that right?  I'm not sure how we get into this case.

Fix is already in Linus tree.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-06 23:53             ` Thomas Gleixner
@ 2007-03-07  0:24               ` Jeremy Fitzhardinge
  2007-03-07  0:35                 ` Dan Hecht
  2007-03-07  0:40                 ` Thomas Gleixner
  2007-03-07  0:42               ` Dan Hecht
  1 sibling, 2 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07  0:24 UTC (permalink / raw)
  To: tglx
  Cc: Dan Hecht, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

Thomas Gleixner wrote:
> All paravirt users probably want to have NO_HZ, so PARAVIRT might simply
> depend on NO_HZ. Of course I might be wrong :)
>   

Xen can deal either way, but tickless is certainly preferred.

> OTOH the stolen time accounting should be fixed in general and not rely
> on it happens to work now assumptions. And it should be done for _ALL_
> hypervisors in the same way, i.e. in the generic code.
>   

Yep.  We'll need to come up with a common story for that. 

>>  This is probably something the Xen folks will want 
>> also, since I think Xen itself only gets 100hz hard timer, and so it can 
>> implement at best a oneshot virtual timer with 100hz resolution.  Any 
>> objections to us doing something like this?
>>     

Xen has a nanosecond resolution one-shot timer which I'm using for
this.  There's also a 100Hz tick which gets in the way a bit (it will
appear as a stream of spurious timeouts), but we'll turn that off soon.

>> 3) clockevent set_next_event interface is suboptimal for paravirt (and 
>> probably realtime-ish uses).  The problem is that the expiry is passed 
>> as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
>> may have passed since the delta was computed and when the timer device 
>> is programmed, causing that next interrupt to be too far out in the 
>> future.  It seems a better interface for set_next_event would be to pass 
>> the current time and the absolute expiry.  Actually, I sent email to 
>> Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
>> in July 2006, but never heard back.  Thoughts?
>>     
>
> There is no problem for realtime uses, as the reprogramming path is
> running with local interrupts disabled. I can see the point for paravirt
> and I'm not opposed to change / expand the interface for that. It might
> be done by an extra clockevents feature flag, which requests absolute
> time instead of relative time.
>   

I'm not sure how much different it makes overall.  It's true that
absolute time would be a more useful interface, but because the guest
vcpu can be preempted at any time, we could miss the timeout
regardless.  In Xen if you set a timeout for the past you get an
immediate interrupt; I presume the clockevent code can deal with that?

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  0:24               ` Jeremy Fitzhardinge
@ 2007-03-07  0:35                 ` Dan Hecht
  2007-03-07  0:49                   ` Thomas Gleixner
  2007-03-07  0:40                 ` Thomas Gleixner
  1 sibling, 1 reply; 169+ messages in thread
From: Dan Hecht @ 2007-03-07  0:35 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: tglx, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On 03/06/2007 04:24 PM, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
>>> 3) clockevent set_next_event interface is suboptimal for paravirt (and 
>>> probably realtime-ish uses).  The problem is that the expiry is passed 
>>> as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
>>> may have passed since the delta was computed and when the timer device 
>>> is programmed, causing that next interrupt to be too far out in the 
>>> future.  It seems a better interface for set_next_event would be to pass 
>>> the current time and the absolute expiry.  Actually, I sent email to 
>>> Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
>>> in July 2006, but never heard back.  Thoughts?
>>>     
>> There is no problem for realtime uses, as the reprogramming path is
>> running with local interrupts disabled. I can see the point for paravirt
>> and I'm not opposed to change / expand the interface for that. It might
>> be done by an extra clockevents feature flag, which requests absolute
>> time instead of relative time.
>>   
> 
> I'm not sure how much different it makes overall.  It's true that
> absolute time would be a more useful interface, but because the guest
> vcpu can be preempted at any time, we could miss the timeout
> regardless.  In Xen if you set a timeout for the past you get an
> immediate interrupt; I presume the clockevent code can deal with that?
> 

That's the problem though, you won't know to set it for the past since 
the expiry is relative.  When the vcpu starts running again, it will set 
the timer to expire X ns from now, not Xns from when the timer was 
requested.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  0:24               ` Jeremy Fitzhardinge
  2007-03-07  0:35                 ` Dan Hecht
@ 2007-03-07  0:40                 ` Thomas Gleixner
  1 sibling, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07  0:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Tue, 2007-03-06 at 16:24 -0800, Jeremy Fitzhardinge wrote:
> >> 3) clockevent set_next_event interface is suboptimal for paravirt (and 
> >> probably realtime-ish uses).  The problem is that the expiry is passed 
> >> as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
> >> may have passed since the delta was computed and when the timer device 
> >> is programmed, causing that next interrupt to be too far out in the 
> >> future.  It seems a better interface for set_next_event would be to pass 
> >> the current time and the absolute expiry.  Actually, I sent email to 
> >> Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
> >> in July 2006, but never heard back.  Thoughts?
> >>     
> >
> > There is no problem for realtime uses, as the reprogramming path is
> > running with local interrupts disabled. I can see the point for paravirt
> > and I'm not opposed to change / expand the interface for that. It might
> > be done by an extra clockevents feature flag, which requests absolute
> > time instead of relative time.
> >   
> 
> I'm not sure how much different it makes overall.  It's true that
> absolute time would be a more useful interface, but because the guest
> vcpu can be preempted at any time, we could miss the timeout
> regardless.  In Xen if you set a timeout for the past you get an
> immediate interrupt; I presume the clockevent code can deal with that?

Yep. You also can return -ETIME so it just works w/o an interrupt.

	tglx





^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-06 23:53             ` Thomas Gleixner
  2007-03-07  0:24               ` Jeremy Fitzhardinge
@ 2007-03-07  0:42               ` Dan Hecht
  2007-03-07  1:22                   ` Thomas Gleixner
  1 sibling, 1 reply; 169+ messages in thread
From: Dan Hecht @ 2007-03-07  0:42 UTC (permalink / raw)
  To: tglx
  Cc: Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Jeremy Fitzhardinge, Rusty Russell,
	LKML, john stultz, Dan Hecht

On 03/06/2007 03:53 PM, Thomas Gleixner wrote:
>> 2) Virtual interrupts have a relatively high overhead as compared with 
>> native interrupts.  So, in vmitime, we wanted to be able to lower the 
>> timer interrupt rate at runtime, even if HZ is a compile time constant 
>> (and set to something high, like 1000hz).  While we could hack this in 
>> by using evt->min_delta_ns, it wouldn't really work since process time 
>> accounting would be wrong.  Instead, we should allow the 
>> tick_sched_timer in cases (c) and (d) to have runtime configurable 
>> period, and then scale the time value accordingly before passing to 
>> account_system_time.  This is probably something the Xen folks will want 
>> also, since I think Xen itself only gets 100hz hard timer, and so it can 
>> implement at best a oneshot virtual timer with 100hz resolution.  Any 
>> objections to us doing something like this?
> 
> Yes. It's gross hackery. 
> 
> 1) We want to have a cleanup of the tick assumptions _all_ over the
> place and this is going to be real hard work.
> 
> 2) As I said above. The time accounting for virtualization needs to be
> fixed in a generic way.
> 
> I'm not going to accept some weird hackery for virtualization, which is
> of exactly ZERO value for the kernel itself. Quite the contrary it will
> make the cleanup harder and introduce another hard to remove thing,
> which will in the worst case last for ever.
>

Okay, to confirm I'm on the same page as you, you want to move process 
time accounting from being periodic sampled based to being trace based? 
i.e. at the system-call/interrupt boundaries, read clocksource and 
compute directly the amount of system/user/process time?

Do you know if anyone has explored this?  I thought there was a 
discussion about this a while back but it was rejected due to the 
sample-based approach having much lower overheads on high system call 
rate workloads.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  0:35                 ` Dan Hecht
@ 2007-03-07  0:49                   ` Thomas Gleixner
  2007-03-07  0:53                     ` Dan Hecht
  2007-03-07  5:10                     ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07  0:49 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Tue, 2007-03-06 at 16:35 -0800, Dan Hecht wrote:
> >> There is no problem for realtime uses, as the reprogramming path is
> >> running with local interrupts disabled. I can see the point for paravirt
> >> and I'm not opposed to change / expand the interface for that. It might
> >> be done by an extra clockevents feature flag, which requests absolute
> >> time instead of relative time.
> >>   
> > 
> > I'm not sure how much different it makes overall.  It's true that
> > absolute time would be a more useful interface, but because the guest
> > vcpu can be preempted at any time, we could miss the timeout
> > regardless.  In Xen if you set a timeout for the past you get an
> > immediate interrupt; I presume the clockevent code can deal with that?
> > 
> 
> That's the problem though, you won't know to set it for the past since 
> the expiry is relative.  When the vcpu starts running again, it will set 
> the timer to expire X ns from now, not Xns from when the timer was 
> requested.

Ooops. I completely forgot, that you get the absolute expiry time
already in ktime_t format (nanoseconds) when dev->set_next_event() is
called.

	dev->next_event = expires;

is done right before the call. 

So it's already there for free.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  0:49                   ` Thomas Gleixner
@ 2007-03-07  0:53                     ` Dan Hecht
  2007-03-07  1:18                       ` Thomas Gleixner
  2007-03-07  5:10                     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 169+ messages in thread
From: Dan Hecht @ 2007-03-07  0:53 UTC (permalink / raw)
  To: tglx
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz,
	Dan Hecht

On 03/06/2007 04:49 PM, Thomas Gleixner wrote:
> On Tue, 2007-03-06 at 16:35 -0800, Dan Hecht wrote:
>>>> There is no problem for realtime uses, as the reprogramming path is
>>>> running with local interrupts disabled. I can see the point for paravirt
>>>> and I'm not opposed to change / expand the interface for that. It might
>>>> be done by an extra clockevents feature flag, which requests absolute
>>>> time instead of relative time.
>>>>   
>>> I'm not sure how much different it makes overall.  It's true that
>>> absolute time would be a more useful interface, but because the guest
>>> vcpu can be preempted at any time, we could miss the timeout
>>> regardless.  In Xen if you set a timeout for the past you get an
>>> immediate interrupt; I presume the clockevent code can deal with that?
>>>
>> That's the problem though, you won't know to set it for the past since 
>> the expiry is relative.  When the vcpu starts running again, it will set 
>> the timer to expire X ns from now, not Xns from when the timer was 
>> requested.
> 
> Ooops. I completely forgot, that you get the absolute expiry time
> already in ktime_t format (nanoseconds) when dev->set_next_event() is
> called.
> 
> 	dev->next_event = expires;
> 
> is done right before the call. 
> 
> So it's already there for free.
> 
>

Okay.  I noticed that but didn't think it was okay to use since it 
didn't seem like it was set up for the clock_event_device code's use, so 
seemed like a conceptual interface violation to go digging around in 
there.

Also, wasn't one of the points of clockevents to prevent the device code 
from doing conversions between nanoseconds and clicks themselves?  Don't 
we really want the clockevents generic layer to do this conversion 
between monotonic nanonseconds to absolute device clicks and then give 
the device code that value, so the device layer doesn't perform any 
conversions?


On an unrelated note, can you explain what the difference between 
CLOCK_EVT_MODE_UNUSED and CLOCK_EVT_MODE_SHUTDOWN modes are and what the 
legal state transitions are? (or point me to a document describing 
this).  At least on i386, all clock event devices treat them the same; 
do we really need both?


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  0:53                     ` Dan Hecht
@ 2007-03-07  1:18                       ` Thomas Gleixner
  2007-03-07  2:08                         ` Dan Hecht
  0 siblings, 1 reply; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07  1:18 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Tue, 2007-03-06 at 16:53 -0800, Dan Hecht wrote:
> > Ooops. I completely forgot, that you get the absolute expiry time
> > already in ktime_t format (nanoseconds) when dev->set_next_event() is
> > called.
> > 
> > 	dev->next_event = expires;
> > 
> > is done right before the call. 
> > 
> > So it's already there for free.
> > 
> >
> 
> Okay.  I noticed that but didn't think it was okay to use since it 
> didn't seem like it was set up for the clock_event_device code's use, so 
> seemed like a conceptual interface violation to go digging around in 
> there.

Yes it is. 

I just wanted to point out that you can use it until I'm awake enough to
implement it proper.

> Also, wasn't one of the points of clockevents to prevent the device code 
> from doing conversions between nanoseconds and clicks themselves?  Don't 
> we really want the clockevents generic layer to do this conversion 
> between monotonic nanonseconds to absolute device clicks and then give 
> the device code that value, so the device layer doesn't perform any 
> conversions?

Right. But this applies only to deltas, as the conversion of absolute
time values gets ugly, i.e. 128bit math

IMO the paravirt interfaces should use nanoseconds anyway for both
readout and next event programming. That way the conversion is done in
the hypervisor once and the clocksources and clockevents are simple and
unified (except for the underlying hypervisor calls).

> On an unrelated note, can you explain what the difference between 
> CLOCK_EVT_MODE_UNUSED and CLOCK_EVT_MODE_SHUTDOWN modes are and what the 
> legal state transitions are? (or point me to a document describing 
> this).  At least on i386, all clock event devices treat them the same; 
> do we really need both?

UNUSED:
The device is registered, but not used by any clockevents client

SHUTDOWN:
The device is registered, claimed by a clockevents client, but
momentarily not active.

The clock events device can treat UNUSED and SHUTDOWN basically in the
same way.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  0:42               ` Dan Hecht
@ 2007-03-07  1:22                   ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07  1:22 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Jeremy Fitzhardinge, Rusty Russell,
	LKML, john stultz

On Tue, 2007-03-06 at 16:42 -0800, Dan Hecht wrote:
> >> accounting would be wrong.  Instead, we should allow the 
> >> tick_sched_timer in cases (c) and (d) to have runtime configurable 
> >> period, and then scale the time value accordingly before passing to 
> >> account_system_time.  This is probably something the Xen folks will want 
> >> also, since I think Xen itself only gets 100hz hard timer, and so it can 
> >> implement at best a oneshot virtual timer with 100hz resolution.  Any 
> >> objections to us doing something like this?
> > 
> > Yes. It's gross hackery. 
> > 
> > 1) We want to have a cleanup of the tick assumptions _all_ over the
> > place and this is going to be real hard work.
> > 
> > 2) As I said above. The time accounting for virtualization needs to be
> > fixed in a generic way.
> > 
> > I'm not going to accept some weird hackery for virtualization, which is
> > of exactly ZERO value for the kernel itself. Quite the contrary it will
> > make the cleanup harder and introduce another hard to remove thing,
> > which will in the worst case last for ever.
> >
> 
> Okay, to confirm I'm on the same page as you, you want to move process 
> time accounting from being periodic sampled based to being trace based? 
> i.e. at the system-call/interrupt boundaries, read clocksource and 
> compute directly the amount of system/user/process time?

At least for the paravirt guests this is the correct approach. Once the
CPU vendors come up with a sane solution for a reliable and fast clock
source we might use that on real hardware as well.

> Do you know if anyone has explored this?  I thought there was a 
> discussion about this a while back but it was rejected due to the 
> sample-based approach having much lower overheads on high system call 
> rate workloads.

Yes, with todays hardware it is simply a PITA. PowerPC has some basic
support for this though, IIRC.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07  1:22                   ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07  1:22 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On Tue, 2007-03-06 at 16:42 -0800, Dan Hecht wrote:
> >> accounting would be wrong.  Instead, we should allow the 
> >> tick_sched_timer in cases (c) and (d) to have runtime configurable 
> >> period, and then scale the time value accordingly before passing to 
> >> account_system_time.  This is probably something the Xen folks will want 
> >> also, since I think Xen itself only gets 100hz hard timer, and so it can 
> >> implement at best a oneshot virtual timer with 100hz resolution.  Any 
> >> objections to us doing something like this?
> > 
> > Yes. It's gross hackery. 
> > 
> > 1) We want to have a cleanup of the tick assumptions _all_ over the
> > place and this is going to be real hard work.
> > 
> > 2) As I said above. The time accounting for virtualization needs to be
> > fixed in a generic way.
> > 
> > I'm not going to accept some weird hackery for virtualization, which is
> > of exactly ZERO value for the kernel itself. Quite the contrary it will
> > make the cleanup harder and introduce another hard to remove thing,
> > which will in the worst case last for ever.
> >
> 
> Okay, to confirm I'm on the same page as you, you want to move process 
> time accounting from being periodic sampled based to being trace based? 
> i.e. at the system-call/interrupt boundaries, read clocksource and 
> compute directly the amount of system/user/process time?

At least for the paravirt guests this is the correct approach. Once the
CPU vendors come up with a sane solution for a reliable and fast clock
source we might use that on real hardware as well.

> Do you know if anyone has explored this?  I thought there was a 
> discussion about this a while back but it was rejected due to the 
> sample-based approach having much lower overheads on high system call 
> rate workloads.

Yes, with todays hardware it is simply a PITA. PowerPC has some basic
support for this though, IIRC.

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  1:22                   ` Thomas Gleixner
@ 2007-03-07  1:44                     ` Dan Hecht
  -1 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07  1:44 UTC (permalink / raw)
  To: tglx
  Cc: Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Jeremy Fitzhardinge, Rusty Russell,
	LKML, john stultz

On 03/06/2007 05:22 PM, Thomas Gleixner wrote:
> On Tue, 2007-03-06 at 16:42 -0800, Dan Hecht wrote:
>>>> accounting would be wrong.  Instead, we should allow the 
>>>> tick_sched_timer in cases (c) and (d) to have runtime configurable 
>>>> period, and then scale the time value accordingly before passing to 
>>>> account_system_time.  This is probably something the Xen folks will want 
>>>> also, since I think Xen itself only gets 100hz hard timer, and so it can 
>>>> implement at best a oneshot virtual timer with 100hz resolution.  Any 
>>>> objections to us doing something like this?
>>> Yes. It's gross hackery. 
>>>
>>> 1) We want to have a cleanup of the tick assumptions _all_ over the
>>> place and this is going to be real hard work.
>>>
>>> 2) As I said above. The time accounting for virtualization needs to be
>>> fixed in a generic way.
>>>
>>> I'm not going to accept some weird hackery for virtualization, which is
>>> of exactly ZERO value for the kernel itself. Quite the contrary it will
>>> make the cleanup harder and introduce another hard to remove thing,
>>> which will in the worst case last for ever.
>>>
>> Okay, to confirm I'm on the same page as you, you want to move process 
>> time accounting from being periodic sampled based to being trace based? 
>> i.e. at the system-call/interrupt boundaries, read clocksource and 
>> compute directly the amount of system/user/process time?
> 
> At least for the paravirt guests this is the correct approach. Once the
> CPU vendors come up with a sane solution for a reliable and fast clock
> source we might use that on real hardware as well.
> 

I thought your preference was to not do things differently from real 
hardware?  I guess this case you are okay with since you'd like to see 
the real hardware case follow eventually?

In any case, in paravirt the costs of reading timers and doing system 
call transitions are a bit different than on native, so we'll need to 
figure out what makes sense given those costs.

>> Do you know if anyone has explored this?  I thought there was a 
>> discussion about this a while back but it was rejected due to the 
>> sample-based approach having much lower overheads on high system call 
>> rate workloads.
> 
> Yes, with todays hardware it is simply a PITA. PowerPC has some basic
> support for this though, IIRC.
> 

I think S390 maybe too.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07  1:44                     ` Dan Hecht
  0 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07  1:44 UTC (permalink / raw)
  To: tglx; +Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On 03/06/2007 05:22 PM, Thomas Gleixner wrote:
> On Tue, 2007-03-06 at 16:42 -0800, Dan Hecht wrote:
>>>> accounting would be wrong.  Instead, we should allow the 
>>>> tick_sched_timer in cases (c) and (d) to have runtime configurable 
>>>> period, and then scale the time value accordingly before passing to 
>>>> account_system_time.  This is probably something the Xen folks will want 
>>>> also, since I think Xen itself only gets 100hz hard timer, and so it can 
>>>> implement at best a oneshot virtual timer with 100hz resolution.  Any 
>>>> objections to us doing something like this?
>>> Yes. It's gross hackery. 
>>>
>>> 1) We want to have a cleanup of the tick assumptions _all_ over the
>>> place and this is going to be real hard work.
>>>
>>> 2) As I said above. The time accounting for virtualization needs to be
>>> fixed in a generic way.
>>>
>>> I'm not going to accept some weird hackery for virtualization, which is
>>> of exactly ZERO value for the kernel itself. Quite the contrary it will
>>> make the cleanup harder and introduce another hard to remove thing,
>>> which will in the worst case last for ever.
>>>
>> Okay, to confirm I'm on the same page as you, you want to move process 
>> time accounting from being periodic sampled based to being trace based? 
>> i.e. at the system-call/interrupt boundaries, read clocksource and 
>> compute directly the amount of system/user/process time?
> 
> At least for the paravirt guests this is the correct approach. Once the
> CPU vendors come up with a sane solution for a reliable and fast clock
> source we might use that on real hardware as well.
> 

I thought your preference was to not do things differently from real 
hardware?  I guess this case you are okay with since you'd like to see 
the real hardware case follow eventually?

In any case, in paravirt the costs of reading timers and doing system 
call transitions are a bit different than on native, so we'll need to 
figure out what makes sense given those costs.

>> Do you know if anyone has explored this?  I thought there was a 
>> discussion about this a while back but it was rejected due to the 
>> sample-based approach having much lower overheads on high system call 
>> rate workloads.
> 
> Yes, with todays hardware it is simply a PITA. PowerPC has some basic
> support for this though, IIRC.
> 

I think S390 maybe too.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  1:18                       ` Thomas Gleixner
@ 2007-03-07  2:08                         ` Dan Hecht
  2007-03-07  8:37                           ` Thomas Gleixner
  0 siblings, 1 reply; 169+ messages in thread
From: Dan Hecht @ 2007-03-07  2:08 UTC (permalink / raw)
  To: tglx
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz,
	Dan Hecht

On 03/06/2007 05:18 PM, Thomas Gleixner wrote:
> On Tue, 2007-03-06 at 16:53 -0800, Dan Hecht wrote:
>>> Ooops. I completely forgot, that you get the absolute expiry time
>>> already in ktime_t format (nanoseconds) when dev->set_next_event() is
>>> called.
>>>
>>> 	dev->next_event = expires;
>>>
>>> is done right before the call. 
>>>
>>> So it's already there for free.
>>>
>>>
>> Okay.  I noticed that but didn't think it was okay to use since it 
>> didn't seem like it was set up for the clock_event_device code's use, so 
>> seemed like a conceptual interface violation to go digging around in 
>> there.
> 
> Yes it is. 
> 
> I just wanted to point out that you can use it until I'm awake enough to
> implement it proper.
> 

Well, we'll probably just live with using the relative expiry for the 
first pass, and then revisit this later once that is working, rather 
than resort to hacking it out by reading ->next_event.

>> Also, wasn't one of the points of clockevents to prevent the device code 
>> from doing conversions between nanoseconds and clicks themselves?  Don't 
>> we really want the clockevents generic layer to do this conversion 
>> between monotonic nanonseconds to absolute device clicks and then give 
>> the device code that value, so the device layer doesn't perform any 
>> conversions?
> 
> Right. But this applies only to deltas, as the conversion of absolute
> time values gets ugly, i.e. 128bit math
> 

Yeah, hopefully we can come up with a clean way to do this.  But, like I 
said early, until we do, we'll stick with the relative expiry.

> IMO the paravirt interfaces should use nanoseconds anyway for both
> readout and next event programming. That way the conversion is done in
> the hypervisor once and the clocksources and clockevents are simple and
> unified (except for the underlying hypervisor calls).
> 

I disagree.  The clocksource/clockevents layer are always going to have 
to convert nanoseconds to/from hardware units, so why not use it?  And, 
some guests (say, a future version of linux that does trace-based 
process accounting) may want higher resolution than nanoseconds for 
certain uses.  In any case, this is beside the point; I'd prefer to 
stick to using the clockevents interface in the way it was intended 
rather than reaching into ->next_event.

thanks,
Dan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  0:49                   ` Thomas Gleixner
  2007-03-07  0:53                     ` Dan Hecht
@ 2007-03-07  5:10                     ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07  5:10 UTC (permalink / raw)
  To: tglx
  Cc: Dan Hecht, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

Thomas Gleixner wrote:
> Ooops. I completely forgot, that you get the absolute expiry time
> already in ktime_t format (nanoseconds) when dev->set_next_event() is
> called.
>
> 	dev->next_event = expires;
>
> is done right before the call. 
>
> So it's already there for free.
>   

OK, but a trap for young players (ie, me): the absolute time is in ns
since kernel boot, but the hypervisor wants an absolute time in ns since
system boot.  Everything works reasonably well for the first guest
started early, so be sure to take a snapshot of hypervisor time early in
order to get the correction...

    J


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  1:44                     ` Dan Hecht
  (?)
@ 2007-03-07  7:48                     ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07  7:48 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Jeremy Fitzhardinge, Rusty Russell,
	LKML, john stultz

On Tue, 2007-03-06 at 17:44 -0800, Dan Hecht wrote:
> >>> 2) As I said above. The time accounting for virtualization needs to be
> >>> fixed in a generic way.
> >>>
> >>> I'm not going to accept some weird hackery for virtualization, which is
> >>> of exactly ZERO value for the kernel itself. Quite the contrary it will
> >>> make the cleanup harder and introduce another hard to remove thing,
> >>> which will in the worst case last for ever.
> >>>
> >> Okay, to confirm I'm on the same page as you, you want to move process 
> >> time accounting from being periodic sampled based to being trace based? 
> >> i.e. at the system-call/interrupt boundaries, read clocksource and 
> >> compute directly the amount of system/user/process time?
> > 
> > At least for the paravirt guests this is the correct approach. Once the
> > CPU vendors come up with a sane solution for a reliable and fast clock
> > source we might use that on real hardware as well.
> > 
> 
> I thought your preference was to not do things differently from real 
> hardware?  I guess this case you are okay with since you'd like to see 
> the real hardware case follow eventually?

Real hardware _IS_ broken and slow. If we add the facilities for
virtualization we want it in a way, which is usable by real hardware as
well.

> > Yes, with todays hardware it is simply a PITA. PowerPC has some basic
> > support for this though, IIRC.
> > 
> 
> I think S390 maybe too.

One more reason to make it a generic solution rather than some extra
hackery.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  2:08                         ` Dan Hecht
@ 2007-03-07  8:37                           ` Thomas Gleixner
  2007-03-07 17:41                               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07  8:37 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Tue, 2007-03-06 at 18:08 -0800, Dan Hecht wrote:
> > IMO the paravirt interfaces should use nanoseconds anyway for both
> > readout and next event programming. That way the conversion is done in
> > the hypervisor once and the clocksources and clockevents are simple and
> > unified (except for the underlying hypervisor calls).
> > 
> 
> I disagree.  The clocksource/clockevents layer are always going to have 
> to convert nanoseconds to/from hardware units, so why not use it?  And, 
> some guests (say, a future version of linux that does trace-based 
> process accounting) may want higher resolution than nanoseconds for 
> certain uses. 

That's a pure academic exercise. When we are at the point where
nanoseconds are to coarse - sometimes after we both retired - the
internal resolution will be femtoseconds or whatever fits.

Again: paravirt should use a common infrastructure for this. Virtual
clocksource and virtual clockevent devices, which operate on ktime_t and
not on some artificial clock chip emulation frequency. The backend
implementation will be still per hypervisor, but we have _ONE_ device
emulation model, which is exposed to the kernel instead of five.

On a Linux based host, you probably end up with a hrtimer on the host
side to schedule the next event on the guest. So why do we need to
convert ktime_t to some virtual frequency in the guest so we can convert
it back into ktime_t on the host ?

Abstractions for the abstractions sake are braindead. There is no real
reason to implement 128 bit math into that path just to make the virtual
clockevent device look like real hardware.

The abstraction of clockevents helps you to get rid of hardwired
hardware assumptions, but you insist on creating them artificially for
reasons which are beyond my grasp.

> In any case, this is beside the point; I'd prefer to 
> stick to using the clockevents interface in the way it was intended 
> rather than reaching into ->next_event.

Sigh. The gain is, that you still have a good reason, why you can't move
to the clockevents interface.

Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday
instead of writing up lengthy excuses, why it is soooo hard and takes
sooo much time and the current interface is sooo insufficient.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07  8:37                           ` Thomas Gleixner
@ 2007-03-07 17:41                               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 17:41 UTC (permalink / raw)
  To: tglx
  Cc: Dan Hecht, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

Thomas Gleixner wrote:
> That's a pure academic exercise. When we are at the point where
> nanoseconds are to coarse - sometimes after we both retired - the
> internal resolution will be femtoseconds or whatever fits.
>
> Again: paravirt should use a common infrastructure for this. Virtual
> clocksource and virtual clockevent devices, which operate on ktime_t and
> not on some artificial clock chip emulation frequency. The backend
> implementation will be still per hypervisor, but we have _ONE_ device
> emulation model, which is exposed to the kernel instead of five.
>   

Different hypervisors have different time interfaces for good reasons -
mostly because the real hardware is such a mess, and there's no clear
"good" answer.  In other words, for the same reason that the new clock
infrastructure exists.

Xen, for example, uses the tsc as the principle timebase in the
hypervisor interface. A shared memory region is updated from time to
time with the tsc frequency and other parameters, and the guest is
expected to compute the current time in ns by extrapolating using the
current tsc value.  This only works because the hypervisor goes to some
effort to synchronize the tsc between the (real) cpus, but its otherwise
much the same as using the raw tsc.

Other hypervisors may take other approaches, depending on what the real
underlying hardware is and the real requirements.  One could imagine a
hypervisor exposing an hpet mapping, for example, or just having some
kind of completely synthetic time source.

The point is that if we were to build an abstraction layer over all of
these just so that we could have a single clocksource/event
implementation, it would be pretty much equivalent to the existing clock
infrastructure, and would add no value.

I was very pleased when I saw the clocksource/event mechanisms go into
the kernel because it means different hypervisors can have a clock*
implementation to match their own particular time model/interface
without having to clutter up the pv_ops interface, and still have a
well-defined interface to the rest of the kernel's time infrastructure.

I don't think having a clock implementation for each hypervisor is such
a big deal.  The Xen one, for example, is 300 lines of straightforward code.

> Abstractions for the abstractions sake are braindead. There is no real
> reason to implement 128 bit math into that path just to make the virtual
> clockevent device look like real hardware.
>
> The abstraction of clockevents helps you to get rid of hardwired
> hardware assumptions, but you insist on creating them artificially for
> reasons which are beyond my grasp.
>   

The hypervisor may present abstracted time hardware, but there is real
time hardware under there somewhere, and there are benefits to making
the abstraction as thin as possible.  Xen chooses to express its time
interfaces in ns and so is a good direct match for the Linux time
infrastructure, but it still has to the 128-bit cycles<->ns conversion
*somewhere*, because the underlying hardware is still using cycles.  It
sounds like the VMWare folks have chosen to directly use cycles in order
to avoid that conversion altogether.

> Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday
> instead of writing up lengthy excuses, why it is soooo hard and takes
> sooo much time and the current interface is sooo insufficient.
>   

Yep, it worked out well.  The only warty thing in there is the asm
128-bit math needed in scale_delta() to convert tsc cycles to ns.  John
Stultz had suggested (on a much earlier incarnation of this code) that
it could be generally useful and could be hoisted to somewhere more
common.  I've included the whole thing below.

    J


--

#include <linux/kernel.h>
#include <linux/interrupt.h>
#include <linux/clocksource.h>
#include <linux/clockchips.h>

#include <asm/xen/hypercall.h>

#include <xen/events.h>
#include <xen/interface/xen.h>
#include <xen/interface/vcpu.h>

#include "xen-ops.h"

#define XEN_SHIFT 22

/* These are perodically updated in shared_info, and then copied here. */
struct shadow_time_info {
	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
	u32 tsc_to_nsec_mul;
	int tsc_shift;
	u32 version;
};

static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);

/* Xen time at startup */
static s64 startup_offset;

unsigned long xen_cpu_khz(void)
{
	u64 cpu_khz = 1000000ULL << 32;
	const struct vcpu_time_info *info =
		&HYPERVISOR_shared_info->vcpu_info[0].time;

	do_div(cpu_khz, info->tsc_to_system_mul);
	if (info->tsc_shift < 0)
		cpu_khz <<= -info->tsc_shift;
	else
		cpu_khz >>= info->tsc_shift;

	return cpu_khz;
}

/*
 * Reads a consistent set of time-base values from Xen, into a shadow data
 * area.
 */
static void get_time_values_from_xen(void)
{
	struct vcpu_time_info   *src;
	struct shadow_time_info *dst;

	src = &read_pda(xen.vcpu)->time;
	dst = &get_cpu_var(shadow_time);

	do {
		dst->version = src->version;
		rmb();
		dst->tsc_timestamp     = src->tsc_timestamp;
		dst->system_timestamp  = src->system_time;
		dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
		dst->tsc_shift         = src->tsc_shift;
		rmb();
	} while ((src->version & 1) | (dst->version ^ src->version));

	put_cpu_var(shadow_time);
}

/*
 * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
 * yielding a 64-bit result.
 */
static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
{
	u64 product;
#ifdef __i386__
	u32 tmp1, tmp2;
#endif

	if (shift < 0)
		delta >>= -shift;
	else
		delta <<= shift;

#ifdef __i386__
	__asm__ (
		"mul  %5       ; "
		"mov  %4,%%eax ; "
		"mov  %%edx,%4 ; "
		"mul  %5       ; "
		"xor  %5,%5    ; "
		"add  %4,%%eax ; "
		"adc  %5,%%edx ; "
		: "=A" (product), "=r" (tmp1), "=r" (tmp2)
		: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
#elif __x86_64__
	__asm__ (
		"mul %%rdx ; shrd $32,%%rdx,%%rax"
		: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
#else
#error implement me!
#endif

	return product;
}

static u64 get_nsec_offset(struct shadow_time_info *shadow)
{
	u64 now, delta;
	rdtscll(now);
	delta = now - shadow->tsc_timestamp;
	return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
}

static cycle_t xen_clocksource_read(void)
{
	struct shadow_time_info *shadow = &get_cpu_var(shadow_time);
	cycle_t ret;

	get_time_values_from_xen();

	ret = shadow->system_timestamp + get_nsec_offset(shadow);

	put_cpu_var(shadow_time);

	return ret;
}

static void xen_read_wallclock(struct timespec *ts)
{
	const struct shared_info *s = HYPERVISOR_shared_info;
	u32 version;
	u64 delta;
	struct timespec now;

	/* get wallclock at system boot */
	do {
		version = s->wc_version;
		rmb();
		now.tv_sec  = s->wc_sec;
		now.tv_nsec = s->wc_nsec;
		rmb();
	} while ((s->wc_version & 1) | (version ^ s->wc_version));

	delta = xen_clocksource_read();	/* time since system boot */
	delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;

	now.tv_nsec = do_div(delta, NSEC_PER_SEC);
	now.tv_sec = delta;

	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
}

unsigned long xen_get_wallclock(void)
{
	struct timespec ts;

	xen_read_wallclock(&ts);

	return ts.tv_sec;
}

int xen_set_wallclock(unsigned long now)
{
	/* do nothing for domU */
	return -1;
}

static struct clocksource xen_clocksource __read_mostly = {
	.name = "xen",
	.rating = 400,
	.read = xen_clocksource_read,
	.mask = ~0,
	.mult = 1<<XEN_SHIFT,		/* time directly in nanoseconds */
	.shift = XEN_SHIFT,
	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};

static void xen_set_mode(enum clock_event_mode mode,
			 struct clock_event_device *evt)
{
	switch(mode) {
	case CLOCK_EVT_MODE_PERIODIC:
		/* unsupported */
		WARN_ON(1);
		break;

	case CLOCK_EVT_MODE_ONESHOT:
		break;

	case CLOCK_EVT_MODE_UNUSED:
	case CLOCK_EVT_MODE_SHUTDOWN:
		HYPERVISOR_set_timer_op(0);  /* cancel timeout */
		break;
	}
}

static int xen_set_next_event(unsigned long delta,
			      struct clock_event_device *evt)
{
	s64 event = startup_offset + ktime_to_ns(evt->next_event);

	if (HYPERVISOR_set_timer_op(event) < 0)
		BUG();

	/* We may have missed the deadline, but there's no real way of
	   knowing for sure.  If the event was in the past, then we'll
	   get an immediate interrupt. */

	return 0;
}

static const struct clock_event_device xen_clockevent = {
	.name = "xen",
	.features = CLOCK_EVT_FEAT_ONESHOT,

	.max_delta_ns = 0x7fffffff,
	.min_delta_ns = 100,	/* ? */

	.mult = 1<<XEN_SHIFT,
	.shift = XEN_SHIFT,
	.rating = 500,

	.set_mode = xen_set_mode,
	.set_next_event = xen_set_next_event,
};
static DEFINE_PER_CPU(struct clock_event_device, xen_clock_events);

static irqreturn_t xen_timer_interrupt(int irq, void *dev_id)
{
	struct clock_event_device *evt = &__get_cpu_var(xen_clock_events);
	irqreturn_t ret;

	ret = IRQ_NONE;
	if (evt->event_handler) {
		cycle_t now = xen_clocksource_read();
		s64 event = startup_offset + ktime_to_ns(evt->next_event);

		/* filter out spurious tick timer events */
		if (now >= event)
			evt->event_handler(evt);
		ret = IRQ_HANDLED;
	}

	return ret;
}

static void xen_setup_timer(int cpu)
{
	const char *name;
	struct clock_event_device *evt;
	int irq;

	printk(KERN_DEBUG "installing Xen timer for CPU %d\n", cpu);

	name = kasprintf(GFP_KERNEL, "timer%d", cpu);
	if (!name)
		name = "<timer kasprintf failed>";

	irq = bind_virq_to_irqhandler(VIRQ_TIMER, cpu, xen_timer_interrupt,
				      SA_INTERRUPT, name, NULL);

	evt = &get_cpu_var(xen_clock_events);
	memcpy(evt, &xen_clockevent, sizeof(*evt));

	evt->cpumask = cpumask_of_cpu(cpu);
	evt->irq = irq;
	clockevents_register_device(evt);

	put_cpu_var(xen_clock_events);
}

__init void xen_time_init(void)
{
	get_time_values_from_xen();

	clocksource_register(&xen_clocksource);

	/* get offset between hypervisor and kernel monotonic clocks */
	startup_offset = xen_clocksource_read() - ktime_to_ns(ktime_get());

	/* Set initial system time with full resolution */
	xen_read_wallclock(&xtime);
	set_normalized_timespec(&wall_to_monotonic,
				-xtime.tv_sec, -xtime.tv_nsec);

	tsc_disable = 0;

	xen_setup_timer(0);
}


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 17:41                               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 17:41 UTC (permalink / raw)
  To: tglx; +Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

Thomas Gleixner wrote:
> That's a pure academic exercise. When we are at the point where
> nanoseconds are to coarse - sometimes after we both retired - the
> internal resolution will be femtoseconds or whatever fits.
>
> Again: paravirt should use a common infrastructure for this. Virtual
> clocksource and virtual clockevent devices, which operate on ktime_t and
> not on some artificial clock chip emulation frequency. The backend
> implementation will be still per hypervisor, but we have _ONE_ device
> emulation model, which is exposed to the kernel instead of five.
>   

Different hypervisors have different time interfaces for good reasons -
mostly because the real hardware is such a mess, and there's no clear
"good" answer.  In other words, for the same reason that the new clock
infrastructure exists.

Xen, for example, uses the tsc as the principle timebase in the
hypervisor interface. A shared memory region is updated from time to
time with the tsc frequency and other parameters, and the guest is
expected to compute the current time in ns by extrapolating using the
current tsc value.  This only works because the hypervisor goes to some
effort to synchronize the tsc between the (real) cpus, but its otherwise
much the same as using the raw tsc.

Other hypervisors may take other approaches, depending on what the real
underlying hardware is and the real requirements.  One could imagine a
hypervisor exposing an hpet mapping, for example, or just having some
kind of completely synthetic time source.

The point is that if we were to build an abstraction layer over all of
these just so that we could have a single clocksource/event
implementation, it would be pretty much equivalent to the existing clock
infrastructure, and would add no value.

I was very pleased when I saw the clocksource/event mechanisms go into
the kernel because it means different hypervisors can have a clock*
implementation to match their own particular time model/interface
without having to clutter up the pv_ops interface, and still have a
well-defined interface to the rest of the kernel's time infrastructure.

I don't think having a clock implementation for each hypervisor is such
a big deal.  The Xen one, for example, is 300 lines of straightforward code.

> Abstractions for the abstractions sake are braindead. There is no real
> reason to implement 128 bit math into that path just to make the virtual
> clockevent device look like real hardware.
>
> The abstraction of clockevents helps you to get rid of hardwired
> hardware assumptions, but you insist on creating them artificially for
> reasons which are beyond my grasp.
>   

The hypervisor may present abstracted time hardware, but there is real
time hardware under there somewhere, and there are benefits to making
the abstraction as thin as possible.  Xen chooses to express its time
interfaces in ns and so is a good direct match for the Linux time
infrastructure, but it still has to the 128-bit cycles<->ns conversion
*somewhere*, because the underlying hardware is still using cycles.  It
sounds like the VMWare folks have chosen to directly use cycles in order
to avoid that conversion altogether.

> Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday
> instead of writing up lengthy excuses, why it is soooo hard and takes
> sooo much time and the current interface is sooo insufficient.
>   

Yep, it worked out well.  The only warty thing in there is the asm
128-bit math needed in scale_delta() to convert tsc cycles to ns.  John
Stultz had suggested (on a much earlier incarnation of this code) that
it could be generally useful and could be hoisted to somewhere more
common.  I've included the whole thing below.

    J


--

#include <linux/kernel.h>
#include <linux/interrupt.h>
#include <linux/clocksource.h>
#include <linux/clockchips.h>

#include <asm/xen/hypercall.h>

#include <xen/events.h>
#include <xen/interface/xen.h>
#include <xen/interface/vcpu.h>

#include "xen-ops.h"

#define XEN_SHIFT 22

/* These are perodically updated in shared_info, and then copied here. */
struct shadow_time_info {
	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
	u32 tsc_to_nsec_mul;
	int tsc_shift;
	u32 version;
};

static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);

/* Xen time at startup */
static s64 startup_offset;

unsigned long xen_cpu_khz(void)
{
	u64 cpu_khz = 1000000ULL << 32;
	const struct vcpu_time_info *info =
		&HYPERVISOR_shared_info->vcpu_info[0].time;

	do_div(cpu_khz, info->tsc_to_system_mul);
	if (info->tsc_shift < 0)
		cpu_khz <<= -info->tsc_shift;
	else
		cpu_khz >>= info->tsc_shift;

	return cpu_khz;
}

/*
 * Reads a consistent set of time-base values from Xen, into a shadow data
 * area.
 */
static void get_time_values_from_xen(void)
{
	struct vcpu_time_info   *src;
	struct shadow_time_info *dst;

	src = &read_pda(xen.vcpu)->time;
	dst = &get_cpu_var(shadow_time);

	do {
		dst->version = src->version;
		rmb();
		dst->tsc_timestamp     = src->tsc_timestamp;
		dst->system_timestamp  = src->system_time;
		dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
		dst->tsc_shift         = src->tsc_shift;
		rmb();
	} while ((src->version & 1) | (dst->version ^ src->version));

	put_cpu_var(shadow_time);
}

/*
 * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
 * yielding a 64-bit result.
 */
static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
{
	u64 product;
#ifdef __i386__
	u32 tmp1, tmp2;
#endif

	if (shift < 0)
		delta >>= -shift;
	else
		delta <<= shift;

#ifdef __i386__
	__asm__ (
		"mul  %5       ; "
		"mov  %4,%%eax ; "
		"mov  %%edx,%4 ; "
		"mul  %5       ; "
		"xor  %5,%5    ; "
		"add  %4,%%eax ; "
		"adc  %5,%%edx ; "
		: "=A" (product), "=r" (tmp1), "=r" (tmp2)
		: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
#elif __x86_64__
	__asm__ (
		"mul %%rdx ; shrd $32,%%rdx,%%rax"
		: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
#else
#error implement me!
#endif

	return product;
}

static u64 get_nsec_offset(struct shadow_time_info *shadow)
{
	u64 now, delta;
	rdtscll(now);
	delta = now - shadow->tsc_timestamp;
	return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
}

static cycle_t xen_clocksource_read(void)
{
	struct shadow_time_info *shadow = &get_cpu_var(shadow_time);
	cycle_t ret;

	get_time_values_from_xen();

	ret = shadow->system_timestamp + get_nsec_offset(shadow);

	put_cpu_var(shadow_time);

	return ret;
}

static void xen_read_wallclock(struct timespec *ts)
{
	const struct shared_info *s = HYPERVISOR_shared_info;
	u32 version;
	u64 delta;
	struct timespec now;

	/* get wallclock at system boot */
	do {
		version = s->wc_version;
		rmb();
		now.tv_sec  = s->wc_sec;
		now.tv_nsec = s->wc_nsec;
		rmb();
	} while ((s->wc_version & 1) | (version ^ s->wc_version));

	delta = xen_clocksource_read();	/* time since system boot */
	delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;

	now.tv_nsec = do_div(delta, NSEC_PER_SEC);
	now.tv_sec = delta;

	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
}

unsigned long xen_get_wallclock(void)
{
	struct timespec ts;

	xen_read_wallclock(&ts);

	return ts.tv_sec;
}

int xen_set_wallclock(unsigned long now)
{
	/* do nothing for domU */
	return -1;
}

static struct clocksource xen_clocksource __read_mostly = {
	.name = "xen",
	.rating = 400,
	.read = xen_clocksource_read,
	.mask = ~0,
	.mult = 1<<XEN_SHIFT,		/* time directly in nanoseconds */
	.shift = XEN_SHIFT,
	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};

static void xen_set_mode(enum clock_event_mode mode,
			 struct clock_event_device *evt)
{
	switch(mode) {
	case CLOCK_EVT_MODE_PERIODIC:
		/* unsupported */
		WARN_ON(1);
		break;

	case CLOCK_EVT_MODE_ONESHOT:
		break;

	case CLOCK_EVT_MODE_UNUSED:
	case CLOCK_EVT_MODE_SHUTDOWN:
		HYPERVISOR_set_timer_op(0);  /* cancel timeout */
		break;
	}
}

static int xen_set_next_event(unsigned long delta,
			      struct clock_event_device *evt)
{
	s64 event = startup_offset + ktime_to_ns(evt->next_event);

	if (HYPERVISOR_set_timer_op(event) < 0)
		BUG();

	/* We may have missed the deadline, but there's no real way of
	   knowing for sure.  If the event was in the past, then we'll
	   get an immediate interrupt. */

	return 0;
}

static const struct clock_event_device xen_clockevent = {
	.name = "xen",
	.features = CLOCK_EVT_FEAT_ONESHOT,

	.max_delta_ns = 0x7fffffff,
	.min_delta_ns = 100,	/* ? */

	.mult = 1<<XEN_SHIFT,
	.shift = XEN_SHIFT,
	.rating = 500,

	.set_mode = xen_set_mode,
	.set_next_event = xen_set_next_event,
};
static DEFINE_PER_CPU(struct clock_event_device, xen_clock_events);

static irqreturn_t xen_timer_interrupt(int irq, void *dev_id)
{
	struct clock_event_device *evt = &__get_cpu_var(xen_clock_events);
	irqreturn_t ret;

	ret = IRQ_NONE;
	if (evt->event_handler) {
		cycle_t now = xen_clocksource_read();
		s64 event = startup_offset + ktime_to_ns(evt->next_event);

		/* filter out spurious tick timer events */
		if (now >= event)
			evt->event_handler(evt);
		ret = IRQ_HANDLED;
	}

	return ret;
}

static void xen_setup_timer(int cpu)
{
	const char *name;
	struct clock_event_device *evt;
	int irq;

	printk(KERN_DEBUG "installing Xen timer for CPU %d\n", cpu);

	name = kasprintf(GFP_KERNEL, "timer%d", cpu);
	if (!name)
		name = "<timer kasprintf failed>";

	irq = bind_virq_to_irqhandler(VIRQ_TIMER, cpu, xen_timer_interrupt,
				      SA_INTERRUPT, name, NULL);

	evt = &get_cpu_var(xen_clock_events);
	memcpy(evt, &xen_clockevent, sizeof(*evt));

	evt->cpumask = cpumask_of_cpu(cpu);
	evt->irq = irq;
	clockevents_register_device(evt);

	put_cpu_var(xen_clock_events);
}

__init void xen_time_init(void)
{
	get_time_values_from_xen();

	clocksource_register(&xen_clocksource);

	/* get offset between hypervisor and kernel monotonic clocks */
	startup_offset = xen_clocksource_read() - ktime_to_ns(ktime_get());

	/* Set initial system time with full resolution */
	xen_read_wallclock(&xtime);
	set_normalized_timespec(&wall_to_monotonic,
				-xtime.tv_sec, -xtime.tv_nsec);

	tsc_disable = 0;

	xen_setup_timer(0);
}

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 17:41                               ` Jeremy Fitzhardinge
@ 2007-03-07 17:49                                 ` Ingo Molnar
  -1 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-07 17:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: tglx, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Xen, for example, uses the tsc as the principle timebase in the
> hypervisor interface. [...]

ugh. Please take it from me: i've watched the Linux time code walk its 
long, rocky 10+ years road. One of the first mistakes was when we made 
the TSC the center of the i386-time universe. (incidentally, it was me 
who did the first steps of that, as a rookie kernel hacker) We got cured 
out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
beginning of that same road. Meet in another 10 years? ;)

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 17:49                                 ` Ingo Molnar
  0 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-07 17:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, john stultz, LKML, tglx, akpm


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Xen, for example, uses the tsc as the principle timebase in the
> hypervisor interface. [...]

ugh. Please take it from me: i've watched the Linux time code walk its 
long, rocky 10+ years road. One of the first mistakes was when we made 
the TSC the center of the i386-time universe. (incidentally, it was me 
who did the first steps of that, as a rookie kernel hacker) We got cured 
out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
beginning of that same road. Meet in another 10 years? ;)

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 17:41                               ` Jeremy Fitzhardinge
@ 2007-03-07 17:52                                 ` Ingo Molnar
  -1 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-07 17:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: tglx, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> I don't think having a clock implementation for each hypervisor is 
> such a big deal.  The Xen one, for example, is 300 lines of 
> straightforward code.

/For you/ it's certainly no big deal, you dont have to fix it up and you 
dont have to keep it flexible ;)

and really, i'm not expecting miracles, i've never seen any hardware 
vendor argue /against/ support for their own hardware =B-)

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 17:52                                 ` Ingo Molnar
  0 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-07 17:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, john stultz, LKML, tglx, akpm


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> I don't think having a clock implementation for each hypervisor is 
> such a big deal.  The Xen one, for example, is 300 lines of 
> straightforward code.

/For you/ it's certainly no big deal, you dont have to fix it up and you 
dont have to keep it flexible ;)

and really, i'm not expecting miracles, i've never seen any hardware 
vendor argue /against/ support for their own hardware =B-)

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 17:49                                 ` Ingo Molnar
@ 2007-03-07 18:03                                   ` James Morris
  -1 siblings, 0 replies; 169+ messages in thread
From: James Morris @ 2007-03-07 18:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Virtualization Mailing List, john stultz,
	LKML, tglx, akpm

On Wed, 7 Mar 2007, Ingo Molnar wrote:

> 
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Xen, for example, uses the tsc as the principle timebase in the
> > hypervisor interface. [...]
> 
> ugh. Please take it from me: i've watched the Linux time code walk its 
> long, rocky 10+ years road. One of the first mistakes was when we made 
> the TSC the center of the i386-time universe. (incidentally, it was me 
> who did the first steps of that, as a rookie kernel hacker) We got cured 
> out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
> beginning of that same road. Meet in another 10 years? ;)

What do you suggest instead ?

(Digging into this for lguest now...)



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 18:03                                   ` James Morris
  0 siblings, 0 replies; 169+ messages in thread
From: James Morris @ 2007-03-07 18:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Virtualization Mailing List, akpm, john stultz, tglx, LKML

On Wed, 7 Mar 2007, Ingo Molnar wrote:

> 
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > Xen, for example, uses the tsc as the principle timebase in the
> > hypervisor interface. [...]
> 
> ugh. Please take it from me: i've watched the Linux time code walk its 
> long, rocky 10+ years road. One of the first mistakes was when we made 
> the TSC the center of the i386-time universe. (incidentally, it was me 
> who did the first steps of that, as a rookie kernel hacker) We got cured 
> out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
> beginning of that same road. Meet in another 10 years? ;)

What do you suggest instead ?

(Digging into this for lguest now...)



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 17:41                               ` Jeremy Fitzhardinge
@ 2007-03-07 18:11                                 ` James Morris
  -1 siblings, 0 replies; 169+ messages in thread
From: James Morris @ 2007-03-07 18:11 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: tglx, Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On Wed, 7 Mar 2007, Jeremy Fitzhardinge wrote:

> I was very pleased when I saw the clocksource/event mechanisms go into
> the kernel because it means different hypervisors can have a clock*
> implementation to match their own particular time model/interface
> without having to clutter up the pv_ops interface, and still have a
> well-defined interface to the rest of the kernel's time infrastructure.

It seems to me that it could be useful to have a library of common virtual 
time code (entirely separate from pv_ops), to avoid re-implementing some 
apparently common requirements, such as: handling TSC frequency changes, 
stolen time accounting, synthetic programmable clockevent etc.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 18:11                                 ` James Morris
  0 siblings, 0 replies; 169+ messages in thread
From: James Morris @ 2007-03-07 18:11 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, akpm, john stultz, tglx, Ingo Molnar, LKML

On Wed, 7 Mar 2007, Jeremy Fitzhardinge wrote:

> I was very pleased when I saw the clocksource/event mechanisms go into
> the kernel because it means different hypervisors can have a clock*
> implementation to match their own particular time model/interface
> without having to clutter up the pv_ops interface, and still have a
> well-defined interface to the rest of the kernel's time infrastructure.

It seems to me that it could be useful to have a library of common virtual 
time code (entirely separate from pv_ops), to avoid re-implementing some 
apparently common requirements, such as: handling TSC frequency changes, 
stolen time accounting, synthetic programmable clockevent etc.


- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 17:52                                 ` Ingo Molnar
  (?)
@ 2007-03-07 18:28                                 ` Jeremy Fitzhardinge
  2007-03-07 18:53                                     ` Thomas Gleixner
  -1 siblings, 1 reply; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 18:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

Ingo Molnar wrote:
> /For you/ it's certainly no big deal, you dont have to fix it up and you 
> dont have to keep it flexible ;)
>   

How flexible does it need to be?  Its a simple time source and event
driver.  How flexible does the pit driver need to be?  It's just a small
leaf node hanging off a large existing piece of kernel infrastructure.

> and really, i'm not expecting miracles, i've never seen any hardware 
> vendor argue /against/ support for their own hardware =B-)
>   

And since when has it been kernel policy to argue against including a
well written, self-contained, vendor-provided driver for a piece of
hardware?

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 17:49                                 ` Ingo Molnar
@ 2007-03-07 18:35                                   ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 18:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

Ingo Molnar wrote:
> ugh. Please take it from me: i've watched the Linux time code walk its 
> long, rocky 10+ years road. One of the first mistakes was when we made 
> the TSC the center of the i386-time universe. (incidentally, it was me 
> who did the first steps of that, as a rookie kernel hacker) We got cured 
> out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
> beginning of that same road. Meet in another 10 years? ;)

Yep, the tsc has myriad problems; for Xen its the best of a bad lot. 
Unfortunately in 10 years no clearly better alternative has appeared;
maybe in 10 years there will be one.  It might even be the tsc.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 18:35                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 18:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Virtualization Mailing List, john stultz, LKML, tglx, akpm

Ingo Molnar wrote:
> ugh. Please take it from me: i've watched the Linux time code walk its 
> long, rocky 10+ years road. One of the first mistakes was when we made 
> the TSC the center of the i386-time universe. (incidentally, it was me 
> who did the first steps of that, as a rookie kernel hacker) We got cured 
> out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
> beginning of that same road. Meet in another 10 years? ;)

Yep, the tsc has myriad problems; for Xen its the best of a bad lot. 
Unfortunately in 10 years no clearly better alternative has appeared;
maybe in 10 years there will be one.  It might even be the tsc.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 18:28                                 ` Jeremy Fitzhardinge
@ 2007-03-07 18:53                                     ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 18:53 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Wed, 2007-03-07 at 10:28 -0800, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> > /For you/ it's certainly no big deal, you dont have to fix it up and you 
> > dont have to keep it flexible ;)
> >   
> 
> How flexible does it need to be?  Its a simple time source and event
> driver.  How flexible does the pit driver need to be?  It's just a small
> leaf node hanging off a large existing piece of kernel infrastructure.
> 
> > and really, i'm not expecting miracles, i've never seen any hardware 
> > vendor argue /against/ support for their own hardware =B-)
> >   
> 
> And since when has it been kernel policy to argue against including a
> well written, self-contained, vendor-provided driver for a piece of
> hardware?

The difference is that we have not much influence on the design
decisions of silicon vendors. We usually see them when the shit already
has been morphed into solid silicon.

Software emulated silicon _IS_ actually under our control. And we want
to have it as sane as possible.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 18:53                                     ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 18:53 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On Wed, 2007-03-07 at 10:28 -0800, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> > /For you/ it's certainly no big deal, you dont have to fix it up and you 
> > dont have to keep it flexible ;)
> >   
> 
> How flexible does it need to be?  Its a simple time source and event
> driver.  How flexible does the pit driver need to be?  It's just a small
> leaf node hanging off a large existing piece of kernel infrastructure.
> 
> > and really, i'm not expecting miracles, i've never seen any hardware 
> > vendor argue /against/ support for their own hardware =B-)
> >   
> 
> And since when has it been kernel policy to argue against including a
> well written, self-contained, vendor-provided driver for a piece of
> hardware?

The difference is that we have not much influence on the design
decisions of silicon vendors. We usually see them when the shit already
has been morphed into solid silicon.

Software emulated silicon _IS_ actually under our control. And we want
to have it as sane as possible.

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 18:11                                 ` James Morris
  (?)
@ 2007-03-07 18:56                                 ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 18:56 UTC (permalink / raw)
  To: James Morris
  Cc: Jeremy Fitzhardinge, Virtualization Mailing List, john stultz,
	LKML, Ingo Molnar, akpm

On Wed, 2007-03-07 at 13:11 -0500, James Morris wrote:
> On Wed, 7 Mar 2007, Jeremy Fitzhardinge wrote:
> 
> > I was very pleased when I saw the clocksource/event mechanisms go into
> > the kernel because it means different hypervisors can have a clock*
> > implementation to match their own particular time model/interface
> > without having to clutter up the pv_ops interface, and still have a
> > well-defined interface to the rest of the kernel's time infrastructure.
> 
> It seems to me that it could be useful to have a library of common virtual 
> time code (entirely separate from pv_ops), to avoid re-implementing some 
> apparently common requirements, such as: handling TSC frequency changes, 
> stolen time accounting, synthetic programmable clockevent etc.

Yes please. Expose sane emulated silicon to the kernel core and maintain
your hypervisor decisions behind that silicon instead of exposing us to
10 different silicon versions with 20 bugs each.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 18:11                                 ` James Morris
  (?)
  (?)
@ 2007-03-07 19:05                                 ` Jeremy Fitzhardinge
  2007-03-07 19:49                                   ` Dan Hecht
  -1 siblings, 1 reply; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 19:05 UTC (permalink / raw)
  To: James Morris
  Cc: tglx, Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

James Morris wrote:
> It seems to me that it could be useful to have a library of common virtual 
> time code (entirely separate from pv_ops), to avoid re-implementing some 
> apparently common requirements, such as: handling TSC frequency changes, 
> stolen time accounting, synthetic programmable clockevent etc.
>   

Well, lets put our clock* implementations next to each other and see how
much common code there is to be factored out.

The Xen time code is pretty lean.  There's not much difference in
abstraction between the clocksource/event interface and the hypervisor
interface, so there's just not very much code there.

One immediate candidate is the scale_delta() function which does the
necessary cycles->tsc conversion.  I think that will be generally useful
and should be put somewhere common rather than copied.

I think stolen time is a bit more core, and in principle applies to
non-virtualized systems as well (such as time stolen by SMM and
discontinuities caused by suspend/resume).  The key piece is a monotonic
clock which advances while a vcpu is actually running on a real cpu,
since that should be used to determine how much time each process has
been running for.

Maybe it will just fall out if we start moving to a state-transition
process time accounting rather than the current sample-based one.  Is
there an actual plan to do that, or is it at the handwaving stage?

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 19:05                                 ` Jeremy Fitzhardinge
@ 2007-03-07 19:49                                   ` Dan Hecht
  2007-03-07 20:11                                     ` Jeremy Fitzhardinge
  2007-03-07 21:21                                     ` Thomas Gleixner
  0 siblings, 2 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 19:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: James Morris, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML, Dan Hecht

On 03/07/2007 11:05 AM, Jeremy Fitzhardinge wrote:
> James Morris wrote:
>> It seems to me that it could be useful to have a library of common virtual 
>> time code (entirely separate from pv_ops), to avoid re-implementing some 
>> apparently common requirements, such as: handling TSC frequency changes, 
>> stolen time accounting, synthetic programmable clockevent etc.
>>   
> 
> Well, lets put our clock* implementations next to each other and see how
> much common code there is to be factored out.
> 
> The Xen time code is pretty lean.  There's not much difference in
> abstraction between the clocksource/event interface and the hypervisor
> interface, so there's just not very much code there.
> 

Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's ours 
for reference (please excuse any formating issues); it's also lean. 
We'll send out a proper patch later after some more testing:

---

/*
  * VMI paravirtual timer support routines.
  *
  * Copyright (C) 2007, VMware, Inc.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
  * (at your option) any later version.
  *
  * This program is distributed in the hope that it will be useful, but
  * WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
  * NON INFRINGEMENT.  See the GNU General Public License for more
  * details.
  *
  * You should have received a copy of the GNU General Public License
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  *
  */

#include <linux/smp.h>
#include <linux/cpumask.h>
#include <linux/clocksource.h>
#include <linux/clockchips.h>

#include <asm/vmi.h>
#include <asm/vmi_time.h>
#include <asm/apic.h>
#include <asm/i8253.h>
#include <asm/arch_hooks.h>

#include <irq_vectors.h>

#define VMI_ONESHOT  (VMI_ALARM_IS_ONESHOT  | VMI_CYCLES_REAL)
#define VMI_PERIODIC (VMI_ALARM_IS_PERIODIC | VMI_CYCLES_REAL)

static inline u32 vmi_counter(u32 flags)
{
	/* Given VMI_ONESHOT or VMI_PERIODIC, return the corresponding
	 * cycle counter. */
	return flags & VMI_ALARM_COUNTER_MASK;
}

/* paravirt_ops.get_wallclock = vmi_get_wallclock */
unsigned long vmi_get_wallclock(void)
{
	unsigned long long wallclock;
	wallclock = vmi_timer_ops.get_wallclock(); // nsec
	(void)do_div(wallclock, 1000000000);       // sec

	return wallclock;
}

/* paravirt_ops.set_wallclock = vmi_set_wallclock */
int vmi_set_wallclock(unsigned long now)
{
	return 0;
}

/* paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles */
unsigned long long vmi_get_sched_cycles(void)
{
	return vmi_timer_ops.get_cycle_counter(VMI_CYCLES_AVAILABLE);
}

/* paravirt_ops.get_cpu_khz = vmi_cpu_khz */
unsigned long vmi_cpu_khz(void)
{
	unsigned long long khz;
	khz = vmi_timer_ops.get_cycle_frequency();
	(void)do_div(khz, 1000);
	return khz;
}

/** vmi clockevent */

static struct clock_event_device vmi_global_clockevent;

static inline u32 vmi_alarm_wiring(struct clock_event_device *evt)
{
	return (evt == &vmi_global_clockevent) ?
		VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT;
}

static void vmi_timer_set_mode(enum clock_event_mode mode,
			       struct clock_event_device *evt)
{
	u32 wiring;
	cycle_t now, cycles_per_hz;
	BUG_ON(!irqs_disabled());

	wiring = vmi_alarm_wiring(evt);
	if (wiring == VMI_ALARM_WIRED_LVTT)
		/* Route the interrupt to the correct vector */
		apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR);

	switch (mode) {
	case CLOCK_EVT_MODE_ONESHOT:
		break;
	case CLOCK_EVT_MODE_PERIODIC:
		cycles_per_hz = vmi_timer_ops.get_cycle_frequency();
		(void)do_div(cycles_per_hz, HZ);
		now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC));
		vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC,
					now, cycles_per_hz);
		break;
	case CLOCK_EVT_MODE_UNUSED:
	case CLOCK_EVT_MODE_SHUTDOWN:
		switch (evt->mode) {
		case CLOCK_EVT_MODE_ONESHOT:
			vmi_timer_ops.cancel_alarm(VMI_ONESHOT);
			break;
		case CLOCK_EVT_MODE_PERIODIC:
			vmi_timer_ops.cancel_alarm(VMI_PERIODIC);
			break;
		default:
			break;
		}
		break;
	default:
		break;
	}
}

static int vmi_timer_next_event(unsigned long delta,
				struct clock_event_device *evt)
{
	/* Unfortunately, set_next_event interface only passes relative
	 * expiry, but we want absolute expiry.  It'd be better if were
	 * were passed an aboslute expiry, since a bunch of time may
	 * have been stolen between the time the delta is computed and
	 * when we set the alarm below. */
	cycle_t now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_ONESHOT));

	BUG_ON(evt->mode != CLOCK_EVT_MODE_ONESHOT);
	vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,
				now + delta, 0);
	return 0;
}

static struct clock_event_device vmi_clockevent = {
	.name		= "vmi-timer",
	.features	= CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT,
	.shift		= 22,
	.set_mode	= vmi_timer_set_mode,
	.set_next_event = vmi_timer_next_event,
	.rating         = 1000,
	.irq		= -1,
};

/* Replacement for PIT/HPET global clock event.
  * paravirt_ops.choose_time_init = vmi_time_init_clockevent
  */
void __init vmi_time_init_clockevent(void)
{
	cycle_t cycles_per_msec;

	/* One time setup: initialize the vmi clockevent parameters.
	 * These will be copied to the global and local clockevents. */

	/* Use cycles_per_msec since div_sc params are 32-bits. */
	cycles_per_msec = vmi_timer_ops.get_cycle_frequency();
	(void)do_div(cycles_per_msec, 1000);

	/* Must pick .shift such that .mult fits in 32-bits.  Choosing
	 * .shift to be 22 allows 2^(32-22) cycles per nano-seconds
	 * before overflow. */
	vmi_clockevent.mult = div_sc(cycles_per_msec, NSEC_PER_MSEC,
				     vmi_clockevent.shift);
	/* Upper bound is clockevent's use of ulong for cycle deltas. */
	vmi_clockevent.max_delta_ns =
		clockevent_delta2ns(ULONG_MAX, &vmi_clockevent);
	vmi_clockevent.min_delta_ns =
		clockevent_delta2ns(1, &vmi_clockevent);

	memcpy(&vmi_global_clockevent, &vmi_clockevent,
	       sizeof(vmi_global_clockevent));
	vmi_global_clockevent.name = "vmi-timer (boot)";
	vmi_global_clockevent.cpumask = cpumask_of_cpu(0);
	vmi_global_clockevent.irq = 0;

	printk(KERN_WARNING "vmi: registering clock event %s. mult=%lu 
shift=%u\n",
	       vmi_global_clockevent.name, vmi_global_clockevent.mult,
	       vmi_global_clockevent.shift);
	clockevents_register_device(&vmi_global_clockevent);
	global_clock_event = &vmi_global_clockevent;

	/* We use normal irq0 handler on cpu0. */
	time_init_hook();
}

#ifdef CONFIG_X86_LOCAL_APIC

/* Replacement for lapic timer local clock event.
  * paravirt_ops.setup_boot_clock      = vmi_nop
  *       (continue using global_clock_event on cpu0)
  * paravirt_ops.setup_secondary_clock = vmi_timer_setup_local_alarm
  */
void __devinit vmi_timer_setup_local_alarm(void)
{
	struct clock_event_device *evt = &__get_cpu_var(local_clock_events);

	/* Then, start it back up as a local clockevent device. */
	memcpy(evt, &vmi_clockevent, sizeof(*evt));
	evt->cpumask = cpumask_of_cpu(smp_processor_id());

	printk(KERN_WARNING "vmi: registering clock event %s. mult=%lu 
shift=%u\n",
	       evt->name, evt->mult, evt->shift);
	clockevents_register_device(evt);
}

#endif

/** vmi clocksource */

static cycle_t read_real_cycles(void)
{
	return vmi_timer_ops.get_cycle_counter(VMI_CYCLES_REAL);
}

static struct clocksource clocksource_vmi = {
	.name			= "vmi-timer",
	.rating			= 450,
	.read			= read_real_cycles,
	.mask			= CLOCKSOURCE_MASK(64),
	.mult			= 0, /* to be set */
	.shift			= 22,
	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
};

static int __init init_vmi_clocksource(void)
{
	cycle_t cycles_per_msec;

	if (!vmi_timer_ops.get_cycle_frequency)
		return 0;
	/* Use khz2mult rather than hz2mult since hz arg is only 32-bits. */
	cycles_per_msec = vmi_timer_ops.get_cycle_frequency();
	(void)do_div(cycles_per_msec, 1000);
	
	/* Note that clocksource.{mult, shift} converts in the opposite direction
	 * as clockevents.  */
	clocksource_vmi.mult = clocksource_khz2mult(cycles_per_msec,
						    clocksource_vmi.shift);

	printk(KERN_WARNING "vmi: registering clock source khz=%lld\n", 
cycles_per_msec);
	return clocksource_register(&clocksource_vmi);

}
module_init(init_vmi_clocksource);

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 19:49                                   ` Dan Hecht
@ 2007-03-07 20:11                                     ` Jeremy Fitzhardinge
  2007-03-07 20:49                                         ` Dan Hecht
  2007-03-07 20:57                                         ` Thomas Gleixner
  2007-03-07 21:21                                     ` Thomas Gleixner
  1 sibling, 2 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 20:11 UTC (permalink / raw)
  To: Dan Hecht
  Cc: James Morris, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML

Dan Hecht wrote:
> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> ours for reference (please excuse any formating issues); it's also
> lean. We'll send out a proper patch later after some more testing:

So the interrupt side of the clockevent comes through the virtual apic? 
Where does evt->handle_event get called?

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 17:41                               ` Jeremy Fitzhardinge
                                                 ` (3 preceding siblings ...)
  (?)
@ 2007-03-07 20:40                               ` Thomas Gleixner
  2007-03-07 21:07                                   ` Jeremy Fitzhardinge
  2007-03-07 21:42                                   ` Dan Hecht
  -1 siblings, 2 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 20:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Wed, 2007-03-07 at 09:41 -0800, Jeremy Fitzhardinge wrote:
> Other hypervisors may take other approaches, depending on what the real
> underlying hardware is and the real requirements.  One could imagine a
> hypervisor exposing an hpet mapping, for example, or just having some
> kind of completely synthetic time source.
> 
> The point is that if we were to build an abstraction layer over all of
> these just so that we could have a single clocksource/event
> implementation, it would be pretty much equivalent to the existing clock
> infrastructure, and would add no value.

I tend to disagree. The clockevents infrastructure was designed to cope
with the existing mess of real hardware. The discussion over the last
days exposed me to even more exotic designs than the hardware vendors
were able to deliver until now.

> I was very pleased when I saw the clocksource/event mechanisms go into
> the kernel because it means different hypervisors can have a clock*
> implementation to match their own particular time model/interface
> without having to clutter up the pv_ops interface, and still have a
> well-defined interface to the rest of the kernel's time infrastructure.

I know exactly where you are heading:

Offload the handling of hypervisor design decisions to the kernel and
let us deal with that. So we need to implement 128 bit math to convert
back and forth and I expect more interesting things to creep up. 

All this is of _NO_ use and benefit for the kernel itself.

Real hardware copes well with relative deltas for the events, even when
it is match register based. I thought long about the support for
absolute expiry values in cycles and decided against them to avoid that
math hackery, which you folks now demand.

> I don't think having a clock implementation for each hypervisor is such
> a big deal.  The Xen one, for example, is 300 lines of straightforward code.
> 
> > Abstractions for the abstractions sake are braindead. There is no real
> > reason to implement 128 bit math into that path just to make the virtual
> > clockevent device look like real hardware.
> >
> > The abstraction of clockevents helps you to get rid of hardwired
> > hardware assumptions, but you insist on creating them artificially for
> > reasons which are beyond my grasp.
> >   
> The hypervisor may present abstracted time hardware, but there is real
> time hardware under there somewhere, and there are benefits to making
> the abstraction as thin as possible.

Yeah, it's much faster to do the conversion in the kernel and not in the
hypervisor thin layer. See also below.

> Xen chooses to express its time
> interfaces in ns and so is a good direct match for the Linux time
> infrastructure, but it still has to the 128-bit cycles<->ns conversion
> *somewhere*, because the underlying hardware is still using cycles.  It
> sounds like the VMWare folks have chosen to directly use cycles in order
> to avoid that conversion altogether.

Neither the host OS nor the hypervisors use cycles as the main unit for
their own time related code. They all have the required conversion code
already available.

The historical design of hypervisors was based on emulating the hardware
1:1. So the TSC needs to be a TSC and the LAPIC a LAPIC. 

Paravitualized guests can use smarter virtual hardware which is exposed
to the kernel. Using paravirtualization only to speed up the emulation
of legacy crap without thinking about the overall possible enhancements
is just backwards. 

Paravirtualization is a technique that presents a software interface to
virtual machines that is similar but not identical to that of the
underlying hardware.

clockevents allow you to do that easy and simple, but you insist on a
1:1 conversion of your current design and offload the legacy burden of
your historical hardware usage to the kernel developers. No thanks.

Also let's compare the code flow for a Linux guest on a Linux host:

cylces based:

program_next_event()
	convert to a virtual cycle value
	call into the emulated clock event device
		call into the hypervisor
			convert to nanoseconds
			arm a hrtimer
			convert to real hardware cycles

nanosecond based:

program_next_event()
	call into the emulated clock event device
		call into the hypervisor
			arm a hrtimer
			convert to real hardware cycles

> > Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday
> > instead of writing up lengthy excuses, why it is soooo hard and takes
> > sooo much time and the current interface is sooo insufficient.
> >   
> 
> Yep, it worked out well.  The only warty thing in there is the asm
> 128-bit math needed in scale_delta() to convert tsc cycles to ns.  John
> Stultz had suggested (on a much earlier incarnation of this code) that
> it could be generally useful and could be hoisted to somewhere more
> common.  I've included the whole thing below.
> 
> /*
>  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
>  * yielding a 64-bit result.
>  */
> static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
> {
> 	u64 product;
> #ifdef __i386__
> 	u32 tmp1, tmp2;
> #endif
> 
> 	if (shift < 0)
> 		delta >>= -shift;
> 	else
> 		delta <<= shift;
> 
> #ifdef __i386__
> 	__asm__ (
> 		"mul  %5       ; "
> 		"mov  %4,%%eax ; "
> 		"mov  %%edx,%4 ; "
> 		"mul  %5       ; "
> 		"xor  %5,%5    ; "
> 		"add  %4,%%eax ; "
> 		"adc  %5,%%edx ; "
> 		: "=A" (product), "=r" (tmp1), "=r" (tmp2)
> 		: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
> #elif __x86_64__
> 	__asm__ (
> 		"mul %%rdx ; shrd $32,%%rdx,%%rax"
> 		: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
> #else
> #error implement me!

Yay. Here we are. Once we move that stuff into the core kernel
infrastructure, we have to maintain that warty thing in the worst case
for 24 archs and educate people _NOT_ to use it on an 32 bit ARM 74Mhz
CPU. Those are the things we care about.

> static int xen_set_next_event(unsigned long delta,
> 			      struct clock_event_device *evt)
> {
> 	s64 event = startup_offset + ktime_to_ns(evt->next_event);
> 
> 	if (HYPERVISOR_set_timer_op(event) < 0)
> 		BUG();
>
> 	/* We may have missed the deadline, but there's no real way of
> 	   knowing for sure.  If the event was in the past, then we'll
> 	   get an immediate interrupt. */
> 
> 	return 0;
> }

Looks nice and should serve the purpose for everyone. Here is the real
point for a paravirt_ops() interface.

	return paravirt_ops->clockevent->set_next_event(vcpu, event);

That way all hypervisors can do with that what they want without
cluttering the kernel with their horrible design decisions.

> static const struct clock_event_device xen_clockevent = {
> 	.name = "xen",
> 	.features = CLOCK_EVT_FEAT_ONESHOT,
> 
> 	.max_delta_ns = 0x7fffffff,
> 	.min_delta_ns = 100,	/* ? */
> 
> 	.mult = 1<<XEN_SHIFT,
> 	.shift = XEN_SHIFT,

We can optimize this by skipping the conversion via a feature flag.

> 	.rating = 500,
> 
> 	.set_mode = xen_set_mode,
> 	.set_next_event = xen_set_next_event,
> };
> static DEFINE_PER_CPU(struct clock_event_device, xen_clock_events);

Your implementation is almost the perfect prototype, if you move the
128 bit hackery into the hypervisor and hide it away from the kernel :)

One of these is perfectly fine for _ALL_ of the hypervisor folks.
Anything else is just a backwards decision for the kernel.

We can guarantee that we can and will fix up the 200 lines of code for a
sane clocksource and a clockevent emulation in case we modify those
interfaces, while keeping keeping the specified and agreed paravirt ops
interface intact, but I have ZERO interest to support and fixup 10
different implementations of glue layers with 20 different ways of
making the core clock code horrible.

Again: Imposing the per hypervisor idea of emulation is just backwards.

Create a shared set of interfaces into the hypervisor and do there
whatever you want and need to do. That's what was discussed at the
Kernel Summit in essence. paravirt ops are there to avoid the burden of
maintainence for the various flavours of hypervisor crack and not to
make an easy backdoor to sneak it in and let us have the brain damage.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 20:11                                     ` Jeremy Fitzhardinge
@ 2007-03-07 20:49                                         ` Dan Hecht
  2007-03-07 20:57                                         ` Thomas Gleixner
  1 sibling, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 20:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: James Morris, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML

On 03/07/2007 12:11 PM, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
>> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
>> ours for reference (please excuse any formating issues); it's also
>> lean. We'll send out a proper patch later after some more testing:
> 
> So the interrupt side of the clockevent comes through the virtual apic? 
> Where does evt->handle_event get called?
> 

Yeah, we use the same interrupt handlers as normal i386: timer_interrupt 
and smp_apic_timer_interrupt.  That way we don't need to duplicate the 
interrupt handler code.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 20:49                                         ` Dan Hecht
  0 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 20:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, tglx, john stultz, akpm, Ingo Molnar, LKML

On 03/07/2007 12:11 PM, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
>> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
>> ours for reference (please excuse any formating issues); it's also
>> lean. We'll send out a proper patch later after some more testing:
> 
> So the interrupt side of the clockevent comes through the virtual apic? 
> Where does evt->handle_event get called?
> 

Yeah, we use the same interrupt handlers as normal i386: timer_interrupt 
and smp_apic_timer_interrupt.  That way we don't need to duplicate the 
interrupt handler code.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 20:11                                     ` Jeremy Fitzhardinge
@ 2007-03-07 20:57                                         ` Thomas Gleixner
  2007-03-07 20:57                                         ` Thomas Gleixner
  1 sibling, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 20:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML

On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
> > Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> > ours for reference (please excuse any formating issues); it's also
> > lean. We'll send out a proper patch later after some more testing:
> 
> So the interrupt side of the clockevent comes through the virtual apic? 
> Where does evt->handle_event get called?


>         /* We use normal irq0 handler on cpu0. */
>         time_init_hook();

That's exactly the thing I ranted about before. We keep the historic
view of emulated hardware and just wrap it into enough glue code instead
of doing an abstract design, which just gets rid of those hardware
assumptions at all. That's the big advantage of paravirtualization, but
the current way on paravirt ops is just ignoring this.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 20:57                                         ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 20:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, john stultz, akpm, Ingo Molnar, LKML

On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
> > Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> > ours for reference (please excuse any formating issues); it's also
> > lean. We'll send out a proper patch later after some more testing:
> 
> So the interrupt side of the clockevent comes through the virtual apic? 
> Where does evt->handle_event get called?


>         /* We use normal irq0 handler on cpu0. */
>         time_init_hook();

That's exactly the thing I ranted about before. We keep the historic
view of emulated hardware and just wrap it into enough glue code instead
of doing an abstract design, which just gets rid of those hardware
assumptions at all. That's the big advantage of paravirtualization, but
the current way on paravirt ops is just ignoring this.

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 20:57                                         ` Thomas Gleixner
  (?)
@ 2007-03-07 21:02                                         ` Dan Hecht
  2007-03-07 21:08                                           ` Jeremy Fitzhardinge
  2007-03-07 21:19                                             ` Thomas Gleixner
  -1 siblings, 2 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 21:02 UTC (permalink / raw)
  To: tglx
  Cc: Jeremy Fitzhardinge, James Morris, Virtualization Mailing List,
	akpm, john stultz, Ingo Molnar, LKML

On 03/07/2007 12:57 PM, Thomas Gleixner wrote:
> On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:
>> Dan Hecht wrote:
>>> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
>>> ours for reference (please excuse any formating issues); it's also
>>> lean. We'll send out a proper patch later after some more testing:
>> So the interrupt side of the clockevent comes through the virtual apic? 
>> Where does evt->handle_event get called?
> 
> 
>>         /* We use normal irq0 handler on cpu0. */
>>         time_init_hook();
> 
> That's exactly the thing I ranted about before. We keep the historic
> view of emulated hardware and just wrap it into enough glue code instead
> of doing an abstract design, which just gets rid of those hardware
> assumptions at all. That's the big advantage of paravirtualization, but
> the current way on paravirt ops is just ignoring this.
> 

Are you saying you would prefer we create our own irq handler something 
like this rather than using the standard i386 handlers?

irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
{
    local_event->event_handler(local_event);
    return IRQ_HANDLED;
}

??  That's fine with me.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 20:40                               ` Thomas Gleixner
@ 2007-03-07 21:07                                   ` Jeremy Fitzhardinge
  2007-03-07 21:42                                   ` Dan Hecht
  1 sibling, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 21:07 UTC (permalink / raw)
  To: tglx
  Cc: Dan Hecht, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

Thomas Gleixner wrote:
> I tend to disagree. The clockevents infrastructure was designed to cope
> with the existing mess of real hardware. The discussion over the last
> days exposed me to even more exotic designs than the hardware vendors
> were able to deliver until now.
>   

It's a different but related problem domain.  It's also an increasingly
common execution environment for a kernel to find itself in.  Dealing
with proper paravirtualized timer devices is a big improvement over
trying to reliably deal with fully virtualized hardware timers, which
simply can't make the same guarantees that real hardware can make - such
as "you will definitely get N ns of CPU time between doing the
delta->absolute computation and programming the match register".

> I know exactly where you are heading:
>
> Offload the handling of hypervisor design decisions to the kernel and
> let us deal with that. So we need to implement 128 bit math to convert
> back and forth and I expect more interesting things to creep up. 
>   

I wouldn't put it that way.  We've been getting a lot of pressure to
keep the pv_ops interface as small as possible.  Reusing existing kernel
interfaces rather than making up new ones is a good way to do that.  The
clock infrastructure certainly cleans things up; earlier Xen patches
made a complete copy of the old kernel/time.c and hacked it around,
which isn't what anyone wants to do.

> All this is of _NO_ use and benefit for the kernel itself.
>   

Lots of people want to run Linux in virtual machines.  If we can make
sane kernel changes to help those users, then that is of use an benefit
to the kernel.

> Real hardware copes well with relative deltas for the events, even when
> it is match register based. I thought long about the support for
> absolute expiry values in cycles and decided against them to avoid that
> math hackery, which you folks now demand.
>   

Not really.  Xen and VMI interfaces both use absolute monotonic time for
timeouts, which is certainly a common case for such interfaces
(pthread_cond_timedwait, for example).  Converting delta to absolute is
clearly simple, but it does introduce an added bit of non-determinism if
your CPU can be preempted from outside at any time.  I presume SMM or
similar interrupts can cause the same problem on real hardware.

I guess the worst case for real hardware is an absolute-time match
register which only compares for match==now rather than match<=now,
since you could completely lose the time event if you miss the deadline.

>> static const struct clock_event_device xen_clockevent = {
>> 	.name = "xen",
>> 	.features = CLOCK_EVT_FEAT_ONESHOT,
>>
>> 	.max_delta_ns = 0x7fffffff,
>> 	.min_delta_ns = 100,	/* ? */
>>
>> 	.mult = 1<<XEN_SHIFT,
>> 	.shift = XEN_SHIFT,
>>     
>
> We can optimize this by skipping the conversion via a feature flag.
>   

The clocksource needed the shift for ntp warping.  Does the clockevent
need a shift at all?  Could I just set mult/shift to 1/0?

> Your implementation is almost the perfect prototype, if you move the
> 128 bit hackery into the hypervisor and hide it away from the kernel :)
>   

The point is to use the tsc to avoid making any hypercalls, so dealing
with the tsc->ns conversion has to happen on the guest side somehow.

> One of these is perfectly fine for _ALL_ of the hypervisor folks.
> Anything else is just a backwards decision for the kernel.
>   

That would certainly be ideal.  We'll look at the xen, vmi, lguest and
kvm paravirtualized time models and see how much they really have in
common.  I'm a bit curious about how vmi's time events make their way
back into the system.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 21:07                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 21:07 UTC (permalink / raw)
  To: tglx; +Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

Thomas Gleixner wrote:
> I tend to disagree. The clockevents infrastructure was designed to cope
> with the existing mess of real hardware. The discussion over the last
> days exposed me to even more exotic designs than the hardware vendors
> were able to deliver until now.
>   

It's a different but related problem domain.  It's also an increasingly
common execution environment for a kernel to find itself in.  Dealing
with proper paravirtualized timer devices is a big improvement over
trying to reliably deal with fully virtualized hardware timers, which
simply can't make the same guarantees that real hardware can make - such
as "you will definitely get N ns of CPU time between doing the
delta->absolute computation and programming the match register".

> I know exactly where you are heading:
>
> Offload the handling of hypervisor design decisions to the kernel and
> let us deal with that. So we need to implement 128 bit math to convert
> back and forth and I expect more interesting things to creep up. 
>   

I wouldn't put it that way.  We've been getting a lot of pressure to
keep the pv_ops interface as small as possible.  Reusing existing kernel
interfaces rather than making up new ones is a good way to do that.  The
clock infrastructure certainly cleans things up; earlier Xen patches
made a complete copy of the old kernel/time.c and hacked it around,
which isn't what anyone wants to do.

> All this is of _NO_ use and benefit for the kernel itself.
>   

Lots of people want to run Linux in virtual machines.  If we can make
sane kernel changes to help those users, then that is of use an benefit
to the kernel.

> Real hardware copes well with relative deltas for the events, even when
> it is match register based. I thought long about the support for
> absolute expiry values in cycles and decided against them to avoid that
> math hackery, which you folks now demand.
>   

Not really.  Xen and VMI interfaces both use absolute monotonic time for
timeouts, which is certainly a common case for such interfaces
(pthread_cond_timedwait, for example).  Converting delta to absolute is
clearly simple, but it does introduce an added bit of non-determinism if
your CPU can be preempted from outside at any time.  I presume SMM or
similar interrupts can cause the same problem on real hardware.

I guess the worst case for real hardware is an absolute-time match
register which only compares for match==now rather than match<=now,
since you could completely lose the time event if you miss the deadline.

>> static const struct clock_event_device xen_clockevent = {
>> 	.name = "xen",
>> 	.features = CLOCK_EVT_FEAT_ONESHOT,
>>
>> 	.max_delta_ns = 0x7fffffff,
>> 	.min_delta_ns = 100,	/* ? */
>>
>> 	.mult = 1<<XEN_SHIFT,
>> 	.shift = XEN_SHIFT,
>>     
>
> We can optimize this by skipping the conversion via a feature flag.
>   

The clocksource needed the shift for ntp warping.  Does the clockevent
need a shift at all?  Could I just set mult/shift to 1/0?

> Your implementation is almost the perfect prototype, if you move the
> 128 bit hackery into the hypervisor and hide it away from the kernel :)
>   

The point is to use the tsc to avoid making any hypercalls, so dealing
with the tsc->ns conversion has to happen on the guest side somehow.

> One of these is perfectly fine for _ALL_ of the hypervisor folks.
> Anything else is just a backwards decision for the kernel.
>   

That would certainly be ideal.  We'll look at the xen, vmi, lguest and
kvm paravirtualized time models and see how much they really have in
common.  I'm a bit curious about how vmi's time events make their way
back into the system.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:02                                         ` Dan Hecht
@ 2007-03-07 21:08                                           ` Jeremy Fitzhardinge
  2007-03-07 21:19                                             ` Thomas Gleixner
  1 sibling, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 21:08 UTC (permalink / raw)
  To: Dan Hecht
  Cc: tglx, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML

Dan Hecht wrote:
> Are you saying you would prefer we create our own irq handler
> something like this rather than using the standard i386 handlers?
>
> irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
> {
>    local_event->event_handler(local_event);
>    return IRQ_HANDLED;
> }
>
> ??  That's fine with me.

It does make the code self-contained.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:19                                             ` Thomas Gleixner
  (?)
@ 2007-03-07 21:14                                             ` Dan Hecht
  -1 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 21:14 UTC (permalink / raw)
  To: tglx
  Cc: Jeremy Fitzhardinge, James Morris, Virtualization Mailing List,
	akpm, john stultz, Ingo Molnar, LKML

On 03/07/2007 01:19 PM, Thomas Gleixner wrote:
> On Wed, 2007-03-07 at 13:02 -0800, Dan Hecht wrote:
>> On 03/07/2007 12:57 PM, Thomas Gleixner wrote:
>>> On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:
>>>> Dan Hecht wrote:
>>>>> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
>>>>> ours for reference (please excuse any formating issues); it's also
>>>>> lean. We'll send out a proper patch later after some more testing:
>>>> So the interrupt side of the clockevent comes through the virtual apic? 
>>>> Where does evt->handle_event get called?
>>>
>>>>         /* We use normal irq0 handler on cpu0. */
>>>>         time_init_hook();
>>> That's exactly the thing I ranted about before. We keep the historic
>>> view of emulated hardware and just wrap it into enough glue code instead
>>> of doing an abstract design, which just gets rid of those hardware
>>> assumptions at all. That's the big advantage of paravirtualization, but
>>> the current way on paravirt ops is just ignoring this.
>>>
>> Are you saying you would prefer we create our own irq handler something 
>> like this rather than using the standard i386 handlers?
>>
>> irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
>> {
>>     local_event->event_handler(local_event);
>>     return IRQ_HANDLED;
>> }
>>
>> ??  That's fine with me.
> 
> I prefer _ONE_ generic abstract implementation of a clock event, which
> can be used by all hypervisors. Please keep all your wiring and ideas of
> how to best emulate a i386 system away from the kernel as far as you
> can.
> 
> Please sit down with the other hypervisor folks and define the five
> functions you need to interact between clockevents and the particular
> hypervisor and implement it once.
> 
> Then you can change and evolve your idea of how handle them best in your
> hypervisor code, where it belongs.
> 

Okay, I guess we are essentially back to the "XEN & VMI" thread.  Let's 
just keep that discussion in one place.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 20:49                                         ` Dan Hecht
@ 2007-03-07 21:14                                           ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 21:14 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, James Morris, Virtualization Mailing List,
	akpm, john stultz, Ingo Molnar, LKML

On Wed, 2007-03-07 at 12:49 -0800, Dan Hecht wrote:
> On 03/07/2007 12:11 PM, Jeremy Fitzhardinge wrote:
> > Dan Hecht wrote:
> >> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> >> ours for reference (please excuse any formating issues); it's also
> >> lean. We'll send out a proper patch later after some more testing:
> > 
> > So the interrupt side of the clockevent comes through the virtual apic? 
> > Where does evt->handle_event get called?
> > 
> 
> Yeah, we use the same interrupt handlers as normal i386: timer_interrupt 
> and smp_apic_timer_interrupt.  That way we don't need to duplicate the 
> interrupt handler code.

Oh well. Here we are again. 2 hypervisors - 4 different views on how to
inject events into the kernel.

This is the complete wrong approach. Paravirtualization should not abuse
existing hardware drivers. It should just provide their own sane
abstract implementation.

Please stop this _NOW_

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 21:14                                           ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 21:14 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Virtualization Mailing List, akpm, john stultz, Ingo Molnar, LKML

On Wed, 2007-03-07 at 12:49 -0800, Dan Hecht wrote:
> On 03/07/2007 12:11 PM, Jeremy Fitzhardinge wrote:
> > Dan Hecht wrote:
> >> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> >> ours for reference (please excuse any formating issues); it's also
> >> lean. We'll send out a proper patch later after some more testing:
> > 
> > So the interrupt side of the clockevent comes through the virtual apic? 
> > Where does evt->handle_event get called?
> > 
> 
> Yeah, we use the same interrupt handlers as normal i386: timer_interrupt 
> and smp_apic_timer_interrupt.  That way we don't need to duplicate the 
> interrupt handler code.

Oh well. Here we are again. 2 hypervisors - 4 different views on how to
inject events into the kernel.

This is the complete wrong approach. Paravirtualization should not abuse
existing hardware drivers. It should just provide their own sane
abstract implementation.

Please stop this _NOW_

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:02                                         ` Dan Hecht
@ 2007-03-07 21:19                                             ` Thomas Gleixner
  2007-03-07 21:19                                             ` Thomas Gleixner
  1 sibling, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 21:19 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, James Morris, Virtualization Mailing List,
	akpm, john stultz, Ingo Molnar, LKML

On Wed, 2007-03-07 at 13:02 -0800, Dan Hecht wrote:
> On 03/07/2007 12:57 PM, Thomas Gleixner wrote:
> > On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:
> >> Dan Hecht wrote:
> >>> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> >>> ours for reference (please excuse any formating issues); it's also
> >>> lean. We'll send out a proper patch later after some more testing:
> >> So the interrupt side of the clockevent comes through the virtual apic? 
> >> Where does evt->handle_event get called?
> > 
> > 
> >>         /* We use normal irq0 handler on cpu0. */
> >>         time_init_hook();
> > 
> > That's exactly the thing I ranted about before. We keep the historic
> > view of emulated hardware and just wrap it into enough glue code instead
> > of doing an abstract design, which just gets rid of those hardware
> > assumptions at all. That's the big advantage of paravirtualization, but
> > the current way on paravirt ops is just ignoring this.
> > 
> 
> Are you saying you would prefer we create our own irq handler something 
> like this rather than using the standard i386 handlers?
> 
> irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
> {
>     local_event->event_handler(local_event);
>     return IRQ_HANDLED;
> }
> 
> ??  That's fine with me.

I prefer _ONE_ generic abstract implementation of a clock event, which
can be used by all hypervisors. Please keep all your wiring and ideas of
how to best emulate a i386 system away from the kernel as far as you
can.

Please sit down with the other hypervisor folks and define the five
functions you need to interact between clockevents and the particular
hypervisor and implement it once.

Then you can change and evolve your idea of how handle them best in your
hypervisor code, where it belongs.

	tglx

 


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 21:19                                             ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 21:19 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Virtualization Mailing List, akpm, john stultz, Ingo Molnar, LKML

On Wed, 2007-03-07 at 13:02 -0800, Dan Hecht wrote:
> On 03/07/2007 12:57 PM, Thomas Gleixner wrote:
> > On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:
> >> Dan Hecht wrote:
> >>> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> >>> ours for reference (please excuse any formating issues); it's also
> >>> lean. We'll send out a proper patch later after some more testing:
> >> So the interrupt side of the clockevent comes through the virtual apic? 
> >> Where does evt->handle_event get called?
> > 
> > 
> >>         /* We use normal irq0 handler on cpu0. */
> >>         time_init_hook();
> > 
> > That's exactly the thing I ranted about before. We keep the historic
> > view of emulated hardware and just wrap it into enough glue code instead
> > of doing an abstract design, which just gets rid of those hardware
> > assumptions at all. That's the big advantage of paravirtualization, but
> > the current way on paravirt ops is just ignoring this.
> > 
> 
> Are you saying you would prefer we create our own irq handler something 
> like this rather than using the standard i386 handlers?
> 
> irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
> {
>     local_event->event_handler(local_event);
>     return IRQ_HANDLED;
> }
> 
> ??  That's fine with me.

I prefer _ONE_ generic abstract implementation of a clock event, which
can be used by all hypervisors. Please keep all your wiring and ideas of
how to best emulate a i386 system away from the kernel as far as you
can.

Please sit down with the other hypervisor folks and define the five
functions you need to interact between clockevents and the particular
hypervisor and implement it once.

Then you can change and evolve your idea of how handle them best in your
hypervisor code, where it belongs.

	tglx

 

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 19:49                                   ` Dan Hecht
  2007-03-07 20:11                                     ` Jeremy Fitzhardinge
@ 2007-03-07 21:21                                     ` Thomas Gleixner
  2007-03-07 21:33                                       ` Dan Hecht
  2007-03-07 22:05                                       ` Jeremy Fitzhardinge
  1 sibling, 2 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 21:21 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, James Morris, Virtualization Mailing List,
	akpm, john stultz, Ingo Molnar, LKML

On Wed, 2007-03-07 at 11:49 -0800, Dan Hecht wrote:
> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's ours 
> for reference (please excuse any formating issues); it's also lean. 
> We'll send out a proper patch later after some more testing:

Ah. Bitching loud enough speeds things up. :)

> /** vmi clockevent */
> 
> static struct clock_event_device vmi_global_clockevent;
> 
> static inline u32 vmi_alarm_wiring(struct clock_event_device *evt)
> {
> 	return (evt == &vmi_global_clockevent) ?
> 		VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT;
> }
> 
> static void vmi_timer_set_mode(enum clock_event_mode mode,
> 			       struct clock_event_device *evt)
> {
> 	u32 wiring;
> 	cycle_t now, cycles_per_hz;
> 	BUG_ON(!irqs_disabled());
> 
> 	wiring = vmi_alarm_wiring(evt);
> 	if (wiring == VMI_ALARM_WIRED_LVTT)
> 		/* Route the interrupt to the correct vector */
> 		apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR);

Wire that in the hypervisor.

> 	switch (mode) {
> 	case CLOCK_EVT_MODE_ONESHOT:
> 		break;
> 	case CLOCK_EVT_MODE_PERIODIC:
> 		cycles_per_hz = vmi_timer_ops.get_cycle_frequency();
> 		(void)do_div(cycles_per_hz, HZ);
> 		now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC));
> 		vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC,
> 					now, cycles_per_hz);

	paravirt_ops->paravirt_clockevent->set_periodic(vcpu, period);

> 		break;
> 	case CLOCK_EVT_MODE_UNUSED:
> 	case CLOCK_EVT_MODE_SHUTDOWN:

	paravirt_ops->paravirt_clockevent->stop_event(vcpu, mode);


> 		switch (evt->mode) {
> 		case CLOCK_EVT_MODE_ONESHOT:
> 			vmi_timer_ops.cancel_alarm(VMI_ONESHOT);
> 			break;
> 		case CLOCK_EVT_MODE_PERIODIC:
> 			vmi_timer_ops.cancel_alarm(VMI_PERIODIC);
> 			break;
> 		default:
> 			break;
> 		}
> 		break;
> 	default:
> 		break;
> 	}
> }

This whole vmi_timer_ops thing is horrible. All hypervisors can share 
paravirt_ops->paravirt_clockevent and retrieve the methods on boot.

> static int vmi_timer_next_event(unsigned long delta,
> 				struct clock_event_device *evt)
> {
> 	/* Unfortunately, set_next_event interface only passes relative
> 	 * expiry, but we want absolute expiry.  It'd be better if were
> 	 * were passed an aboslute expiry, since a bunch of time may
> 	 * have been stolen between the time the delta is computed and
> 	 * when we set the alarm below. */
> 	cycle_t now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_ONESHOT));
> 
> 	BUG_ON(evt->mode != CLOCK_EVT_MODE_ONESHOT);
> 	vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,
> 				now + delta, 0);
> 	return 0;
> }

Great. Now we have:

       s64 event = startup_offset + ktime_to_ns(evt->next_event);

       if (HYPERVISOR_set_timer_op(event) < 0)
                BUG();
and

	vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,	now + delta, 0);

How will the next implementations look like ?

	lguest_program_timer(delta + lguest_current_time(), LGUEST_TIMER_SHOOT_ONCE);

	virt_nextgen_ops.set_timer_event(delta, NO_WE_NEED_NO_FLAGS);

	.......

This is tinkering of the best. My understanding of the paravirt
discussion at Kernel Summit was, that paravirt ops are exactly there to
prevent the above random hackery in the kernel and to allow _ALL_
hypervisors to interact via a sane interface inside of the kernel.

You are just perverting the whole idea of a standartized
paravirtualization interface.

This things can be done for clocksources, clockevents, interrupts (the
generic irq code allows this) and probaly for a whole bunch of other
stuff.

The current paravirt interface is completely insane and will explode
into an unmaintainable nightmare within no time, if we keep accepting
that crap further.

No thanks.

> #ifdef CONFIG_X86_LOCAL_APIC
> 
> /* Replacement for lapic timer local clock event.
>   * paravirt_ops.setup_boot_clock      = vmi_nop
>   *       (continue using global_clock_event on cpu0)
>   * paravirt_ops.setup_secondary_clock = vmi_timer_setup_local_alarm
>   */
> void __devinit vmi_timer_setup_local_alarm(void)
> {
> 	struct clock_event_device *evt = &__get_cpu_var(local_clock_events);
> 
> 	/* Then, start it back up as a local clockevent device. */
> 	memcpy(evt, &vmi_clockevent, sizeof(*evt));
> 	evt->cpumask = cpumask_of_cpu(smp_processor_id());
> 
> 	printk(KERN_WARNING "vmi: registering clock event %s. mult=%lu 
> shift=%u\n",
> 	       evt->name, evt->mult, evt->shift);
> 	clockevents_register_device(evt);
> }
> 
> #endif

Why the hell do you need an lapic emulator here? This is exactly the
kind of crap, we do not want to have. clockevents do not care which
piece of hardware is calling them and we do not care how a particular
hypervisor is wiring that hardware.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:21                                     ` Thomas Gleixner
@ 2007-03-07 21:33                                       ` Dan Hecht
  2007-03-07 22:05                                       ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 21:33 UTC (permalink / raw)
  To: tglx
  Cc: Jeremy Fitzhardinge, James Morris, Virtualization Mailing List,
	akpm, john stultz, Ingo Molnar, LKML

On 03/07/2007 01:21 PM, Thomas Gleixner wrote:
> On Wed, 2007-03-07 at 11:49 -0800, Dan Hecht wrote:
>> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's ours 
>> for reference (please excuse any formating issues); it's also lean. 
>> We'll send out a proper patch later after some more testing:
> 
> Ah. Bitching loud enough speeds things up. :)
> 

We've always planned to do this.  We just didn't want to create the 
dependency between paravirt_ops and clockevents too early such that they 
would depend on each other to merge to main line.  Now that they are 
both there, we are all for it.

>> /** vmi clockevent */
>>
>> static struct clock_event_device vmi_global_clockevent;
>>
>> static inline u32 vmi_alarm_wiring(struct clock_event_device *evt)
>> {
>> 	return (evt == &vmi_global_clockevent) ?
>> 		VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT;
>> }
>>
>> static void vmi_timer_set_mode(enum clock_event_mode mode,
>> 			       struct clock_event_device *evt)
>> {
>> 	u32 wiring;
>> 	cycle_t now, cycles_per_hz;
>> 	BUG_ON(!irqs_disabled());
>>
>> 	wiring = vmi_alarm_wiring(evt);
>> 	if (wiring == VMI_ALARM_WIRED_LVTT)
>> 		/* Route the interrupt to the correct vector */
>> 		apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR);
> 
> Wire that in the hypervisor.
> 
>> 	switch (mode) {
>> 	case CLOCK_EVT_MODE_ONESHOT:
>> 		break;
>> 	case CLOCK_EVT_MODE_PERIODIC:
>> 		cycles_per_hz = vmi_timer_ops.get_cycle_frequency();
>> 		(void)do_div(cycles_per_hz, HZ);
>> 		now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC));
>> 		vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC,
>> 					now, cycles_per_hz);
> 
> 	paravirt_ops->paravirt_clockevent->set_periodic(vcpu, period);
> 

Huh?  paravirt_ops isn't a hypervisor interface, it's just a linux code 
abstraction.  The code on both sides of paravirt_ops is *linux* code, 
any way you cut it.  clockevents is already a linux code abstraction. 
why introduce the redundancy?


>> 		break;
>> 	case CLOCK_EVT_MODE_UNUSED:
>> 	case CLOCK_EVT_MODE_SHUTDOWN:
> 
> 	paravirt_ops->paravirt_clockevent->stop_event(vcpu, mode);
> 

You would be introducing the same redundancy.

> 
>> 		switch (evt->mode) {
>> 		case CLOCK_EVT_MODE_ONESHOT:
>> 			vmi_timer_ops.cancel_alarm(VMI_ONESHOT);
>> 			break;
>> 		case CLOCK_EVT_MODE_PERIODIC:
>> 			vmi_timer_ops.cancel_alarm(VMI_PERIODIC);
>> 			break;
>> 		default:
>> 			break;
>> 		}
>> 		break;
>> 	default:
>> 		break;
>> 	}
>> }
> 
> This whole vmi_timer_ops thing is horrible. All hypervisors can share 
> paravirt_ops->paravirt_clockevent and retrieve the methods on boot.
> 

vmi_timer_ops.whatever is where the kernel <-> hypervisor boundary is 
crossed for VMI.

>> static int vmi_timer_next_event(unsigned long delta,
>> 				struct clock_event_device *evt)
>> {
>> 	/* Unfortunately, set_next_event interface only passes relative
>> 	 * expiry, but we want absolute expiry.  It'd be better if were
>> 	 * were passed an aboslute expiry, since a bunch of time may
>> 	 * have been stolen between the time the delta is computed and
>> 	 * when we set the alarm below. */
>> 	cycle_t now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_ONESHOT));
>>
>> 	BUG_ON(evt->mode != CLOCK_EVT_MODE_ONESHOT);
>> 	vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,
>> 				now + delta, 0);
>> 	return 0;
>> }
> 
> Great. Now we have:
> 
>        s64 event = startup_offset + ktime_to_ns(evt->next_event);
> 
>        if (HYPERVISOR_set_timer_op(event) < 0)
>                 BUG();
> and
> 
> 	vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,	now + delta, 0);
> 
> How will the next implementations look like ?
> 
> 	lguest_program_timer(delta + lguest_current_time(), LGUEST_TIMER_SHOOT_ONCE);
> 
> 	virt_nextgen_ops.set_timer_event(delta, NO_WE_NEED_NO_FLAGS);
> 
> 	.......
> 
> This is tinkering of the best. My understanding of the paravirt
> discussion at Kernel Summit was, that paravirt ops are exactly there to
> prevent the above random hackery in the kernel and to allow _ALL_
> hypervisors to interact via a sane interface inside of the kernel.
> 

No, that was not the point of paravirt_ops.  It is actually the complete 
opposite of the intention of paravirt_ops.  paravirt_ops' intent is 
exactly to allow for *multiple* hypervisor ABIs to exist in the kernel.

At kernel summit, paravirt_ops was proposed to allow for multiple 
hypervisor ABI's to be targeted by the kernel.  The code on both sides 
of paravirt_ops is *linux* code.

> You are just perverting the whole idea of a standartized
> paravirtualization interface.
> 
> This things can be done for clocksources, clockevents, interrupts (the
> generic irq code allows this) and probaly for a whole bunch of other
> stuff.
> 
> The current paravirt interface is completely insane and will explode
> into an unmaintainable nightmare within no time, if we keep accepting
> that crap further.
>
> No thanks.
>

Again, you are misunderstanding the intent of paravirt_ops and history 
behind it's development.


>> #ifdef CONFIG_X86_LOCAL_APIC
>>
>> /* Replacement for lapic timer local clock event.
>>   * paravirt_ops.setup_boot_clock      = vmi_nop
>>   *       (continue using global_clock_event on cpu0)
>>   * paravirt_ops.setup_secondary_clock = vmi_timer_setup_local_alarm
>>   */
>> void __devinit vmi_timer_setup_local_alarm(void)
>> {
>> 	struct clock_event_device *evt = &__get_cpu_var(local_clock_events);
>>
>> 	/* Then, start it back up as a local clockevent device. */
>> 	memcpy(evt, &vmi_clockevent, sizeof(*evt));
>> 	evt->cpumask = cpumask_of_cpu(smp_processor_id());
>>
>> 	printk(KERN_WARNING "vmi: registering clock event %s. mult=%lu 
>> shift=%u\n",
>> 	       evt->name, evt->mult, evt->shift);
>> 	clockevents_register_device(evt);
>> }
>>
>> #endif
> 
> Why the hell do you need an lapic emulator here? This is exactly the
> kind of crap, we do not want to have. clockevents do not care which
> piece of hardware is calling them and we do not care how a particular
> hypervisor is wiring that hardware.
> 

Again, I said in a previous mail that we am fine with introducing our 
own interrupt handler rather than using the lapic one.


Dan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:40                                     ` Thomas Gleixner
  (?)
@ 2007-03-07 21:34                                     ` Dan Hecht
  2007-03-07 22:14                                       ` Thomas Gleixner
  -1 siblings, 1 reply; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 21:34 UTC (permalink / raw)
  To: tglx
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On 03/07/2007 01:40 PM, Thomas Gleixner wrote:
> On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote:
> That would certainly be ideal.  We'll look at the xen, vmi, lguest and
>> kvm paravirtualized time models and see how much they really have in
>> common.  I'm a bit curious about how vmi's time events make their way
>> back into the system.
> 
> By the crude mechanism I'm fighting.
>

Hmm?  They make there way back via interrupts.  How is that crude?

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:07                                   ` Jeremy Fitzhardinge
@ 2007-03-07 21:40                                     ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 21:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
> > I tend to disagree. The clockevents infrastructure was designed to cope
> > with the existing mess of real hardware. The discussion over the last
> > days exposed me to even more exotic designs than the hardware vendors
> > were able to deliver until now.
> >   
> 
> It's a different but related problem domain.  It's also an increasingly
> common execution environment for a kernel to find itself in.  Dealing
> with proper paravirtualized timer devices is a big improvement over
> trying to reliably deal with fully virtualized hardware timers, which
> simply can't make the same guarantees that real hardware can make - such
> as "you will definitely get N ns of CPU time between doing the
> delta->absolute computation and programming the match register".

That's exactly the reason why we want only _ONE_ proper virtualized
timer device instead of 10 new variants of broken hardware.

> > I know exactly where you are heading:
> >
> > Offload the handling of hypervisor design decisions to the kernel and
> > let us deal with that. So we need to implement 128 bit math to convert
> > back and forth and I expect more interesting things to creep up. 
> >   
> 
> I wouldn't put it that way.  We've been getting a lot of pressure to
> keep the pv_ops interface as small as possible.  Reusing existing kernel
> interfaces rather than making up new ones is a good way to do that.  The
> clock infrastructure certainly cleans things up; earlier Xen patches
> made a complete copy of the old kernel/time.c and hacked it around,
> which isn't what anyone wants to do.

All you need is exactly ONE paravirt clockevent device and ONE paravirt
clocksource for _ALL_ hypervisors. Cast that into stone with a
paravirt_ops->clockwahtever interface and we are all happy.

> > All this is of _NO_ use and benefit for the kernel itself.
> >   
> 
> Lots of people want to run Linux in virtual machines.  If we can make
> sane kernel changes to help those users, then that is of use an benefit
> to the kernel.

The above will give a real benefit as it is a well defined interface,
which can be verified on both ends.

> > Real hardware copes well with relative deltas for the events, even when
> > it is match register based. I thought long about the support for
> > absolute expiry values in cycles and decided against them to avoid that
> > math hackery, which you folks now demand.
> >   
> 
> Not really.  Xen and VMI interfaces both use absolute monotonic time for
> timeouts, which is certainly a common case for such interfaces
> (pthread_cond_timedwait, for example).  Converting delta to absolute is
> clearly simple, but it does introduce an added bit of non-determinism if
> your CPU can be preempted from outside at any time.  I presume SMM or
> similar interrupts can cause the same problem on real hardware.

As I said before: I have no objection against expanding / changing the
clockevents interface to deliver absolute expiry time, which we have
already handy.

I just refuse for a good reason to convert it from ktime_t (nanoseconds)
to an absolute cycle value. This can be done on the hypervisor side of
the paravirt clock event device. Same applies for clocksources. The ones
which need nanosecond from/to whatever conversion can do it _IN_ the
hypervisor and not in 10 different grades of madness in the kernel code.

> > We can optimize this by skipping the conversion via a feature flag.
  
> The clocksource needed the shift for ntp warping.  Does the clockevent
> need a shift at all?  Could I just set mult/shift to 1/0?

Yes.

> > Your implementation is almost the perfect prototype, if you move the
> > 128 bit hackery into the hypervisor and hide it away from the kernel :)
> >   
> The point is to use the tsc to avoid making any hypercalls, so dealing
> with the tsc->ns conversion has to happen on the guest side somehow.

I understand that you want to make this as fast as possible, but TSC is
broken in more than one way and it just makes me barf, when we have yet
another way of dealing with it in the kernel.

Please keep the paravirt interface abstract and treat it in the same way
we treat the kernel - userspace API. The kernel hides all this hardware
crap away from the user space and the same applies for a sane paravirt
interface. This is also a benefit in terms of portability. 

For devices, which already live on top of an abstraction layer in the
kernel, e.g. clocksources, clockevents, interrupts, we can share one
implementation accross multiple platforms.

> > One of these is perfectly fine for _ALL_ of the hypervisor folks.
> > Anything else is just a backwards decision for the kernel.
> >   
> That would certainly be ideal.  We'll look at the xen, vmi, lguest and
> kvm paravirtualized time models and see how much they really have in
> common.  I'm a bit curious about how vmi's time events make their way
> back into the system.

By the crude mechanism I'm fighting.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 21:40                                     ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 21:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
> > I tend to disagree. The clockevents infrastructure was designed to cope
> > with the existing mess of real hardware. The discussion over the last
> > days exposed me to even more exotic designs than the hardware vendors
> > were able to deliver until now.
> >   
> 
> It's a different but related problem domain.  It's also an increasingly
> common execution environment for a kernel to find itself in.  Dealing
> with proper paravirtualized timer devices is a big improvement over
> trying to reliably deal with fully virtualized hardware timers, which
> simply can't make the same guarantees that real hardware can make - such
> as "you will definitely get N ns of CPU time between doing the
> delta->absolute computation and programming the match register".

That's exactly the reason why we want only _ONE_ proper virtualized
timer device instead of 10 new variants of broken hardware.

> > I know exactly where you are heading:
> >
> > Offload the handling of hypervisor design decisions to the kernel and
> > let us deal with that. So we need to implement 128 bit math to convert
> > back and forth and I expect more interesting things to creep up. 
> >   
> 
> I wouldn't put it that way.  We've been getting a lot of pressure to
> keep the pv_ops interface as small as possible.  Reusing existing kernel
> interfaces rather than making up new ones is a good way to do that.  The
> clock infrastructure certainly cleans things up; earlier Xen patches
> made a complete copy of the old kernel/time.c and hacked it around,
> which isn't what anyone wants to do.

All you need is exactly ONE paravirt clockevent device and ONE paravirt
clocksource for _ALL_ hypervisors. Cast that into stone with a
paravirt_ops->clockwahtever interface and we are all happy.

> > All this is of _NO_ use and benefit for the kernel itself.
> >   
> 
> Lots of people want to run Linux in virtual machines.  If we can make
> sane kernel changes to help those users, then that is of use an benefit
> to the kernel.

The above will give a real benefit as it is a well defined interface,
which can be verified on both ends.

> > Real hardware copes well with relative deltas for the events, even when
> > it is match register based. I thought long about the support for
> > absolute expiry values in cycles and decided against them to avoid that
> > math hackery, which you folks now demand.
> >   
> 
> Not really.  Xen and VMI interfaces both use absolute monotonic time for
> timeouts, which is certainly a common case for such interfaces
> (pthread_cond_timedwait, for example).  Converting delta to absolute is
> clearly simple, but it does introduce an added bit of non-determinism if
> your CPU can be preempted from outside at any time.  I presume SMM or
> similar interrupts can cause the same problem on real hardware.

As I said before: I have no objection against expanding / changing the
clockevents interface to deliver absolute expiry time, which we have
already handy.

I just refuse for a good reason to convert it from ktime_t (nanoseconds)
to an absolute cycle value. This can be done on the hypervisor side of
the paravirt clock event device. Same applies for clocksources. The ones
which need nanosecond from/to whatever conversion can do it _IN_ the
hypervisor and not in 10 different grades of madness in the kernel code.

> > We can optimize this by skipping the conversion via a feature flag.
  
> The clocksource needed the shift for ntp warping.  Does the clockevent
> need a shift at all?  Could I just set mult/shift to 1/0?

Yes.

> > Your implementation is almost the perfect prototype, if you move the
> > 128 bit hackery into the hypervisor and hide it away from the kernel :)
> >   
> The point is to use the tsc to avoid making any hypercalls, so dealing
> with the tsc->ns conversion has to happen on the guest side somehow.

I understand that you want to make this as fast as possible, but TSC is
broken in more than one way and it just makes me barf, when we have yet
another way of dealing with it in the kernel.

Please keep the paravirt interface abstract and treat it in the same way
we treat the kernel - userspace API. The kernel hides all this hardware
crap away from the user space and the same applies for a sane paravirt
interface. This is also a benefit in terms of portability. 

For devices, which already live on top of an abstraction layer in the
kernel, e.g. clocksources, clockevents, interrupts, we can share one
implementation accross multiple platforms.

> > One of these is perfectly fine for _ALL_ of the hypervisor folks.
> > Anything else is just a backwards decision for the kernel.
> >   
> That would certainly be ideal.  We'll look at the xen, vmi, lguest and
> kvm paravirtualized time models and see how much they really have in
> common.  I'm a bit curious about how vmi's time events make their way
> back into the system.

By the crude mechanism I'm fighting.

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 20:40                               ` Thomas Gleixner
@ 2007-03-07 21:42                                   ` Dan Hecht
  2007-03-07 21:42                                   ` Dan Hecht
  1 sibling, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 21:42 UTC (permalink / raw)
  To: tglx
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On 03/07/2007 12:40 PM, Thomas Gleixner wrote:
> Real hardware copes well with relative deltas for the events, even when
> it is match register based. I thought long about the support for
> absolute expiry values in cycles and decided against them to avoid that
> math hackery, which you folks now demand.

First of all, I'm not "demanding" anything. I'm just trying to have a 
technical discussion about the issues.  If it comes out that absolute 
expiry can't be done cleanly, and the cost out weighs the benefit, then 
so be it.  But, what's so wrong about having the discussion?

When you do have match register (or count and compare, whatever you want 
to call it) based timers in real hardware, the relative expiry interface 
in software is a bit suboptimal.  You still have no idea how much time 
has already gone by between the time you calculated the delta and when 
you setup the hardware (you have a pretty good estimate, but can't know 
for sure unless you disable caches and all other sources of 
non-determinate latencies).  So, you will always be a little late in 
your timer firing.  You may argue that no client of clockevents cares 
about this little bit of lateness.  But, it does exist, and can be 
solved with a software interface that talks in terms of absolute expiries.

Perhaps we can't get around the 128-bit math problem, or maybe we can 
think of a clever solution.  If we can't, then maybe fixing the lateness 
is not worth the cost 128-bit math.  But, maybe there is a clean way 
around the 128-bit math and we just need to approach it from another angle.

Dan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 21:42                                   ` Dan Hecht
  0 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 21:42 UTC (permalink / raw)
  To: tglx; +Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On 03/07/2007 12:40 PM, Thomas Gleixner wrote:
> Real hardware copes well with relative deltas for the events, even when
> it is match register based. I thought long about the support for
> absolute expiry values in cycles and decided against them to avoid that
> math hackery, which you folks now demand.

First of all, I'm not "demanding" anything. I'm just trying to have a 
technical discussion about the issues.  If it comes out that absolute 
expiry can't be done cleanly, and the cost out weighs the benefit, then 
so be it.  But, what's so wrong about having the discussion?

When you do have match register (or count and compare, whatever you want 
to call it) based timers in real hardware, the relative expiry interface 
in software is a bit suboptimal.  You still have no idea how much time 
has already gone by between the time you calculated the delta and when 
you setup the hardware (you have a pretty good estimate, but can't know 
for sure unless you disable caches and all other sources of 
non-determinate latencies).  So, you will always be a little late in 
your timer firing.  You may argue that no client of clockevents cares 
about this little bit of lateness.  But, it does exist, and can be 
solved with a software interface that talks in terms of absolute expiries.

Perhaps we can't get around the 128-bit math problem, or maybe we can 
think of a clever solution.  If we can't, then maybe fixing the lateness 
is not worth the cost 128-bit math.  But, maybe there is a clean way 
around the 128-bit math and we just need to approach it from another angle.

Dan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:21                                     ` Thomas Gleixner
  2007-03-07 21:33                                       ` Dan Hecht
@ 2007-03-07 22:05                                       ` Jeremy Fitzhardinge
  2007-03-07 23:05                                           ` Thomas Gleixner
  1 sibling, 1 reply; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 22:05 UTC (permalink / raw)
  To: tglx
  Cc: Dan Hecht, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML

Thomas Gleixner wrote:
> This is tinkering of the best. My understanding of the paravirt
> discussion at Kernel Summit was, that paravirt ops are exactly there to
> prevent the above random hackery in the kernel and to allow _ALL_
> hypervisors to interact via a sane interface inside of the kernel.
>   

No, I don't think that was ever the intent.  The idea was to create a
new interface for things which don't currently have an interface in the
kernel, such as how to run the CPU in ring 1 and manage pagetable
updates.  But an important and explicit intent of the project was to use
existing kernel interfaces where possible, rather than try to make
pv_ops an monster all-encompassing interface.

Using the new time infrastructure was an explicit example of that.  We
anticipated that different hypervisors would have different ways of
doing time, but all would be easily accommodated by the
clocksource/events infrastructure, and so each would have its own
implementation for these interfaces.  From the kernel's perspective,
they're just another time device, and we manage to avoid making any core
kernel changes, or bloating the pv_ops interface.  It seems like a
natural use of the clock subsystem's design.

> You are just perverting the whole idea of a standartized
> paravirtualization interface.
>
> This things can be done for clocksources, clockevents, interrupts (the
> generic irq code allows this) and probaly for a whole bunch of other
> stuff.
>   

Yes, exactly.  The entirety of the Xen support consists of not only an
implementation of the paravirt_ops interface, but also the Xen
clocksource and clockevents and the Xen irqchip.  My hope and intent is
that we can shrink the paravirt_ops interface in favour of using
existing generally useful kernel interfaces.

> The current paravirt interface is completely insane and will explode
> into an unmaintainable nightmare within no time, if we keep accepting
> that crap further.
>   

No, that's exactly what we've been trying to avoid.

If we start patching in new paravirt_ops to deal with time, interrupts,
or whatever piece of functionality which already has a perfectly good
kernel interface, then we're just increasing the size of the pv_ops
interface, its entanglement with the rest of the system and the amount
of potential legacy stuff which gets dragged around as the interface
evolves.

As hardware gets better at supporting virtualization directly, we're
going to see more hybrid para- and fully- virtualized hypervisor
interfaces.  The result will be that more and more of paravirt_ops will
be implemented by the "native" versions of the functions; maybe at some
point the whole thing will evaporate away.

It's not a huge reach to expect the hardware vendors to get a clue about
time hardware (scratch that, of course it is, but we can always hope)
and come up with something that is directly usable from either an OS
running natively or from within a virtual machine.  In that case, I'm
sure you'd agree it would warrant a real clocksource/event
implementation.  In the scheme I'm proposing, that's no big deal; you
just register the hardware driver, and that's that.  But what you're
proposing leaves this vestigial interface sitting in pv_ops, doing
nothing other than being redundant.


My principle goal here is to get the Xen code into the kernel, and I'm
being pragmatic about it.  If you think having a xen_clocksource is an
absolute blocker to merging this stuff, then I'll add the interface to
pv_ops, and we'll work out how to wire all the hypervisors up underneath
that interface.  But I think it's precisely the wrong way to go from an
overall kernel perspective.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:42                                   ` Dan Hecht
@ 2007-03-07 22:07                                     ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 22:07 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Wed, 2007-03-07 at 13:42 -0800, Dan Hecht wrote:
> On 03/07/2007 12:40 PM, Thomas Gleixner wrote:
> > Real hardware copes well with relative deltas for the events, even when
> > it is match register based. I thought long about the support for
> > absolute expiry values in cycles and decided against them to avoid that
> > math hackery, which you folks now demand.
> 
> First of all, I'm not "demanding" anything. I'm just trying to have a 
> technical discussion about the issues.  If it comes out that absolute 
> expiry can't be done cleanly, and the cost out weighs the benefit, then 
> so be it.  But, what's so wrong about having the discussion?
> 
> When you do have match register (or count and compare, whatever you want 
> to call it) based timers in real hardware, the relative expiry interface 
> in software is a bit suboptimal.  You still have no idea how much time 
> has already gone by between the time you calculated the delta and when 
> you setup the hardware (you have a pretty good estimate, but can't know 
> for sure unless you disable caches and all other sources of 
> non-determinate latencies).  So, you will always be a little late in 
> your timer firing.  You may argue that no client of clockevents cares 
> about this little bit of lateness.  But, it does exist, and can be 
> solved with a software interface that talks in terms of absolute expiries.

With sane hardware yes. But there is no sane hardware. You need a (<=)
match machinery instead of the available (==) ones, which introduce
extra latencies and incorrectness. See arch/i386/kernel/hpet.c. We can
end up with returning -ETIME and an interrupt, as we have no control
over SMM code and such crap at all. For such devices the delta based
expiry is actually faster, as it avoids the calculation of wraps and the
possible 128 bit math in the reprogramming path.

This correctness discussion is purely hypothetical on current real world
hardware.

> Perhaps we can't get around the 128-bit math problem, or maybe we can 
> think of a clever solution.  If we can't, then maybe fixing the lateness 
> is not worth the cost 128-bit math.  But, maybe there is a clean way 
> around the 128-bit math and we just need to approach it from another angle.

Please put the clever solution inside of the clockevent. I can provide
the absolute time in nanoseconds without making you touch the
clockevent->next_event variable.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 22:07                                     ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 22:07 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On Wed, 2007-03-07 at 13:42 -0800, Dan Hecht wrote:
> On 03/07/2007 12:40 PM, Thomas Gleixner wrote:
> > Real hardware copes well with relative deltas for the events, even when
> > it is match register based. I thought long about the support for
> > absolute expiry values in cycles and decided against them to avoid that
> > math hackery, which you folks now demand.
> 
> First of all, I'm not "demanding" anything. I'm just trying to have a 
> technical discussion about the issues.  If it comes out that absolute 
> expiry can't be done cleanly, and the cost out weighs the benefit, then 
> so be it.  But, what's so wrong about having the discussion?
> 
> When you do have match register (or count and compare, whatever you want 
> to call it) based timers in real hardware, the relative expiry interface 
> in software is a bit suboptimal.  You still have no idea how much time 
> has already gone by between the time you calculated the delta and when 
> you setup the hardware (you have a pretty good estimate, but can't know 
> for sure unless you disable caches and all other sources of 
> non-determinate latencies).  So, you will always be a little late in 
> your timer firing.  You may argue that no client of clockevents cares 
> about this little bit of lateness.  But, it does exist, and can be 
> solved with a software interface that talks in terms of absolute expiries.

With sane hardware yes. But there is no sane hardware. You need a (<=)
match machinery instead of the available (==) ones, which introduce
extra latencies and incorrectness. See arch/i386/kernel/hpet.c. We can
end up with returning -ETIME and an interrupt, as we have no control
over SMM code and such crap at all. For such devices the delta based
expiry is actually faster, as it avoids the calculation of wraps and the
possible 128 bit math in the reprogramming path.

This correctness discussion is purely hypothetical on current real world
hardware.

> Perhaps we can't get around the 128-bit math problem, or maybe we can 
> think of a clever solution.  If we can't, then maybe fixing the lateness 
> is not worth the cost 128-bit math.  But, maybe there is a clean way 
> around the 128-bit math and we just need to approach it from another angle.

Please put the clever solution inside of the clockevent. I can provide
the absolute time in nanoseconds without making you touch the
clockevent->next_event variable.

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:34                                     ` Dan Hecht
@ 2007-03-07 22:14                                       ` Thomas Gleixner
  2007-03-07 22:17                                           ` Zachary Amsden
  0 siblings, 1 reply; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 22:14 UTC (permalink / raw)
  To: Dan Hecht
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Wed, 2007-03-07 at 13:34 -0800, Dan Hecht wrote:
> On 03/07/2007 01:40 PM, Thomas Gleixner wrote:
> > On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote:
> > That would certainly be ideal.  We'll look at the xen, vmi, lguest and
> >> kvm paravirtualized time models and see how much they really have in
> >> common.  I'm a bit curious about how vmi's time events make their way
> >> back into the system.
> > 
> > By the crude mechanism I'm fighting.
> >
> 
> Hmm?  They make there way back via interrupts.  How is that crude?

Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self
contained and do not impose restrictions on the kernel core code, which
we have to maintain.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 22:14                                       ` Thomas Gleixner
@ 2007-03-07 22:17                                           ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-07 22:17 UTC (permalink / raw)
  To: tglx
  Cc: Dan Hecht, Jeremy Fitzhardinge, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

Thomas Gleixner wrote:
> Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self
> contained and do not impose restrictions on the kernel core code, which
> we have to maintain.
>   

But time_init_hook is supposed to be abused.  That is its purpose - to 
be a hook for different time devices on SGI Visual Workstation and 
Voyager.  And we don't actually abuse it anymore, we just bypass it 
because the default timer init path wants to setup the PIT or the HPET, 
neither of which should be used in paravirt.

Zach



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 22:17                                           ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-07 22:17 UTC (permalink / raw)
  To: tglx; +Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

Thomas Gleixner wrote:
> Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self
> contained and do not impose restrictions on the kernel core code, which
> we have to maintain.
>   

But time_init_hook is supposed to be abused.  That is its purpose - to 
be a hook for different time devices on SGI Visual Workstation and 
Voyager.  And we don't actually abuse it anymore, we just bypass it 
because the default timer init path wants to setup the PIT or the HPET, 
neither of which should be used in paravirt.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 22:31                                             ` Thomas Gleixner
@ 2007-03-07 22:28                                               ` Dan Hecht
  -1 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 22:28 UTC (permalink / raw)
  To: tglx
  Cc: Zachary Amsden, Jeremy Fitzhardinge, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On 03/07/2007 02:31 PM, Thomas Gleixner wrote:
> Please make these things self contained and not relying on whatever
> time_init_hook() contains.
> 

Fixing up the code to do this now....

thanks,
Dan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 22:28                                               ` Dan Hecht
  0 siblings, 0 replies; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 22:28 UTC (permalink / raw)
  To: tglx; +Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On 03/07/2007 02:31 PM, Thomas Gleixner wrote:
> Please make these things self contained and not relying on whatever
> time_init_hook() contains.
> 

Fixing up the code to do this now....

thanks,
Dan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 22:17                                           ` Zachary Amsden
@ 2007-03-07 22:31                                             ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 22:31 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Dan Hecht, Jeremy Fitzhardinge, Ingo Molnar, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

On Wed, 2007-03-07 at 14:17 -0800, Zachary Amsden wrote:
> Thomas Gleixner wrote:
> > Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self
> > contained and do not impose restrictions on the kernel core code, which
> > we have to maintain.
> >   
> 
> But time_init_hook is supposed to be abused.  That is its purpose - to 
> be a hook for different time devices on SGI Visual Workstation and 
> Voyager.  And we don't actually abuse it anymore, we just bypass it 
> because the default timer init path wants to setup the PIT or the HPET, 
> neither of which should be used in paravirt.

It is there for those hardware platforms, but using it inside your clock
event device is _JUST_ wrong.

Please make these things self contained and not relying on whatever
time_init_hook() contains.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 22:31                                             ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 22:31 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Virtualization Mailing List, john stultz, LKML, Ingo Molnar, akpm

On Wed, 2007-03-07 at 14:17 -0800, Zachary Amsden wrote:
> Thomas Gleixner wrote:
> > Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self
> > contained and do not impose restrictions on the kernel core code, which
> > we have to maintain.
> >   
> 
> But time_init_hook is supposed to be abused.  That is its purpose - to 
> be a hook for different time devices on SGI Visual Workstation and 
> Voyager.  And we don't actually abuse it anymore, we just bypass it 
> because the default timer init path wants to setup the PIT or the HPET, 
> neither of which should be used in paravirt.

It is there for those hardware platforms, but using it inside your clock
event device is _JUST_ wrong.

Please make these things self contained and not relying on whatever
time_init_hook() contains.

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 22:05                                       ` Jeremy Fitzhardinge
@ 2007-03-07 23:05                                           ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 23:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML

On Wed, 2007-03-07 at 14:05 -0800, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
> > This is tinkering of the best. My understanding of the paravirt
> > discussion at Kernel Summit was, that paravirt ops are exactly there to
> > prevent the above random hackery in the kernel and to allow _ALL_
> > hypervisors to interact via a sane interface inside of the kernel.
> >   
> 
> No, I don't think that was ever the intent.  The idea was to create a
> new interface for things which don't currently have an interface in the
> kernel, such as how to run the CPU in ring 1 and manage pagetable
> updates.  But an important and explicit intent of the project was to use
> existing kernel interfaces where possible, rather than try to make
> pv_ops an monster all-encompassing interface.

Maybe I missunderstood. 

Still there is a difference between using existing kernel interfaces and
abusing them in a way which makes modifications to the core kernel code
hard and unmaintainable. See below.

> Using the new time infrastructure was an explicit example of that.  We
> anticipated that different hypervisors would have different ways of
> doing time, but all would be easily accommodated by the
> clocksource/events infrastructure, and so each would have its own
> implementation for these interfaces.  From the kernel's perspective,
> they're just another time device, and we manage to avoid making any core
> kernel changes, or bloating the pv_ops interface.  It seems like a
> natural use of the clock subsystem's design.

On the other hand we yet see things like:

        /* We use normal irq0 handler on cpu0. */
        time_init_hook();

Which is just reaching into the kernel code directly and does not handle
the clock event interrupt self contained. clockevents is not bound to
IRQ0 and this kind of hackery is exactly what we need to avoid in order
to get this maintainable.

Once this is used by paravirt implementations a change to the
mach-default implementation will break stuff left and right.

Also the whole LAPIC business is so horrible, that it hurts. The generic
interrupt layer is there since almost a year and we still see the crude
emulation of hardware and assumptions of irq0 setup all over the place.

We carefully need to define, which existing kernel interfaces are used /
hooked in which way.

If the paravirt implementations actually use the already available
abstractions in the way in which those abstractions are designed, then
we get into a maintainable design. If there are shortcomings on those
abstractions we need to fix them in a sane way or provide a _common_
workaround (e.g. 128 bit math back and forth library) without impacting
the main kernel code.

Looking at vmitimer.c and the number of hardcoded assumptions are
telling me, that we are heading in exactly the opposite direction.

> > You are just perverting the whole idea of a standartized
> > paravirtualization interface.
> >
> > This things can be done for clocksources, clockevents, interrupts (the
> > generic irq code allows this) and probaly for a whole bunch of other
> > stuff.
> >   
> 
> Yes, exactly.  The entirety of the Xen support consists of not only an
> implementation of the paravirt_ops interface, but also the Xen
> clocksource and clockevents and the Xen irqchip.  My hope and intent is
> that we can shrink the paravirt_ops interface in favour of using
> existing generally useful kernel interfaces.

Yes, if they are used in a sane and self contained way without reaching
all over the place and expecting that those functions, which are not
part of the paravirt interfaces will work for ever.

> > The current paravirt interface is completely insane and will explode
> > into an unmaintainable nightmare within no time, if we keep accepting
> > that crap further.
> >   
> 
> No, that's exactly what we've been trying to avoid.
> 
> If we start patching in new paravirt_ops to deal with time, interrupts,
> or whatever piece of functionality which already has a perfectly good
> kernel interface, then we're just increasing the size of the pv_ops
> interface, its entanglement with the rest of the system and the amount
> of potential legacy stuff which gets dragged around as the interface
> evolves.

You are not increasing the entanglement with the rest of the system,
when you use a self contained device on top of an existing core kernel
infrastructure, which has a paravirt backend. Quite the contrary, you
have one piece of virtual hardware which is connected to the kernel and
interacts with the various incarnations on the other side, which can as
well live inside the kernel code. Granted it is another level of
indirection, but I'd be happy to have only to deal with one of those
beasts.

> As hardware gets better at supporting virtualization directly, we're
> going to see more hybrid para- and fully- virtualized hypervisor
> interfaces.  The result will be that more and more of paravirt_ops will
> be implemented by the "native" versions of the functions; maybe at some
> point the whole thing will evaporate away.
> 
> It's not a huge reach to expect the hardware vendors to get a clue about
> time hardware (scratch that, of course it is, but we can always hope)

hehe. There is always hope, but reality is so frustrating :)

> and come up with something that is directly usable from either an OS
> running natively or from within a virtual machine.  In that case, I'm
> sure you'd agree it would warrant a real clocksource/event
> implementation.  

Yes

> In the scheme I'm proposing, that's no big deal; you
> just register the hardware driver, and that's that.  But what you're
> proposing leaves this vestigial interface sitting in pv_ops, doing
> nothing other than being redundant.

Fair enough.

> My principle goal here is to get the Xen code into the kernel, and I'm
> being pragmatic about it.  If you think having a xen_clocksource is an
> absolute blocker to merging this stuff, then I'll add the interface to
> pv_ops, and we'll work out how to wire all the hypervisors up underneath
> that interface.  But I think it's precisely the wrong way to go from an
> overall kernel perspective.

No it's not an absolute blocker, as long as we can take care, that the
number of incarnations is 

- designed to be shareable between hypervisors which have the same time
model
- common code like the 128 bit math is in a shared library
- self contained and not reaching out into core kernel code for no good
reason

Same goes for clock events, interrupts and other core facilities.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 23:05                                           ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-07 23:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, john stultz, akpm, Ingo Molnar, LKML

On Wed, 2007-03-07 at 14:05 -0800, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
> > This is tinkering of the best. My understanding of the paravirt
> > discussion at Kernel Summit was, that paravirt ops are exactly there to
> > prevent the above random hackery in the kernel and to allow _ALL_
> > hypervisors to interact via a sane interface inside of the kernel.
> >   
> 
> No, I don't think that was ever the intent.  The idea was to create a
> new interface for things which don't currently have an interface in the
> kernel, such as how to run the CPU in ring 1 and manage pagetable
> updates.  But an important and explicit intent of the project was to use
> existing kernel interfaces where possible, rather than try to make
> pv_ops an monster all-encompassing interface.

Maybe I missunderstood. 

Still there is a difference between using existing kernel interfaces and
abusing them in a way which makes modifications to the core kernel code
hard and unmaintainable. See below.

> Using the new time infrastructure was an explicit example of that.  We
> anticipated that different hypervisors would have different ways of
> doing time, but all would be easily accommodated by the
> clocksource/events infrastructure, and so each would have its own
> implementation for these interfaces.  From the kernel's perspective,
> they're just another time device, and we manage to avoid making any core
> kernel changes, or bloating the pv_ops interface.  It seems like a
> natural use of the clock subsystem's design.

On the other hand we yet see things like:

        /* We use normal irq0 handler on cpu0. */
        time_init_hook();

Which is just reaching into the kernel code directly and does not handle
the clock event interrupt self contained. clockevents is not bound to
IRQ0 and this kind of hackery is exactly what we need to avoid in order
to get this maintainable.

Once this is used by paravirt implementations a change to the
mach-default implementation will break stuff left and right.

Also the whole LAPIC business is so horrible, that it hurts. The generic
interrupt layer is there since almost a year and we still see the crude
emulation of hardware and assumptions of irq0 setup all over the place.

We carefully need to define, which existing kernel interfaces are used /
hooked in which way.

If the paravirt implementations actually use the already available
abstractions in the way in which those abstractions are designed, then
we get into a maintainable design. If there are shortcomings on those
abstractions we need to fix them in a sane way or provide a _common_
workaround (e.g. 128 bit math back and forth library) without impacting
the main kernel code.

Looking at vmitimer.c and the number of hardcoded assumptions are
telling me, that we are heading in exactly the opposite direction.

> > You are just perverting the whole idea of a standartized
> > paravirtualization interface.
> >
> > This things can be done for clocksources, clockevents, interrupts (the
> > generic irq code allows this) and probaly for a whole bunch of other
> > stuff.
> >   
> 
> Yes, exactly.  The entirety of the Xen support consists of not only an
> implementation of the paravirt_ops interface, but also the Xen
> clocksource and clockevents and the Xen irqchip.  My hope and intent is
> that we can shrink the paravirt_ops interface in favour of using
> existing generally useful kernel interfaces.

Yes, if they are used in a sane and self contained way without reaching
all over the place and expecting that those functions, which are not
part of the paravirt interfaces will work for ever.

> > The current paravirt interface is completely insane and will explode
> > into an unmaintainable nightmare within no time, if we keep accepting
> > that crap further.
> >   
> 
> No, that's exactly what we've been trying to avoid.
> 
> If we start patching in new paravirt_ops to deal with time, interrupts,
> or whatever piece of functionality which already has a perfectly good
> kernel interface, then we're just increasing the size of the pv_ops
> interface, its entanglement with the rest of the system and the amount
> of potential legacy stuff which gets dragged around as the interface
> evolves.

You are not increasing the entanglement with the rest of the system,
when you use a self contained device on top of an existing core kernel
infrastructure, which has a paravirt backend. Quite the contrary, you
have one piece of virtual hardware which is connected to the kernel and
interacts with the various incarnations on the other side, which can as
well live inside the kernel code. Granted it is another level of
indirection, but I'd be happy to have only to deal with one of those
beasts.

> As hardware gets better at supporting virtualization directly, we're
> going to see more hybrid para- and fully- virtualized hypervisor
> interfaces.  The result will be that more and more of paravirt_ops will
> be implemented by the "native" versions of the functions; maybe at some
> point the whole thing will evaporate away.
> 
> It's not a huge reach to expect the hardware vendors to get a clue about
> time hardware (scratch that, of course it is, but we can always hope)

hehe. There is always hope, but reality is so frustrating :)

> and come up with something that is directly usable from either an OS
> running natively or from within a virtual machine.  In that case, I'm
> sure you'd agree it would warrant a real clocksource/event
> implementation.  

Yes

> In the scheme I'm proposing, that's no big deal; you
> just register the hardware driver, and that's that.  But what you're
> proposing leaves this vestigial interface sitting in pv_ops, doing
> nothing other than being redundant.

Fair enough.

> My principle goal here is to get the Xen code into the kernel, and I'm
> being pragmatic about it.  If you think having a xen_clocksource is an
> absolute blocker to merging this stuff, then I'll add the interface to
> pv_ops, and we'll work out how to wire all the hypervisors up underneath
> that interface.  But I think it's precisely the wrong way to go from an
> overall kernel perspective.

No it's not an absolute blocker, as long as we can take care, that the
number of incarnations is 

- designed to be shareable between hypervisors which have the same time
model
- common code like the 128 bit math is in a shared library
- self contained and not reaching out into core kernel code for no good
reason

Same goes for clock events, interrupts and other core facilities.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:05                                           ` Thomas Gleixner
  (?)
@ 2007-03-07 23:25                                           ` Zachary Amsden
  2007-03-07 23:36                                             ` Jeremy Fitzhardinge
                                                               ` (2 more replies)
  -1 siblings, 3 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-07 23:25 UTC (permalink / raw)
  To: tglx
  Cc: Jeremy Fitzhardinge, Virtualization Mailing List, john stultz,
	akpm, Ingo Molnar, LKML

Thomas Gleixner wrote:
> On the other hand we yet see things like:
>
>         /* We use normal irq0 handler on cpu0. */
>         time_init_hook();
>
> Which is just reaching into the kernel code directly and does not handle
> the clock event interrupt self contained. clockevents is not bound to
> IRQ0 and this kind of hackery is exactly what we need to avoid in order
> to get this maintainable.
>
> Once this is used by paravirt implementations a change to the
> mach-default implementation will break stuff left and right.
>   

We've fixed that already.  Thanks for pointing it out.  We were just 
trying to re-use code.

> Also the whole LAPIC business is so horrible, that it hurts. The generic
> interrupt layer is there since almost a year and we still see the crude
> emulation of hardware and assumptions of irq0 setup all over the place.
>
> We carefully need to define, which existing kernel interfaces are used /
> hooked in which way.
>
> If the paravirt implementations actually use the already available
> abstractions in the way in which those abstractions are designed, then
> we get into a maintainable design. If there are shortcomings on those
> abstractions we need to fix them in a sane way or provide a _common_
> workaround (e.g. 128 bit math back and forth library) without impacting
> the main kernel code.
>
> Looking at vmitimer.c and the number of hardcoded assumptions are
> telling me, that we are heading in exactly the opposite direction.
>   

No, VMI timer is unique because for SMP, it is based on the APIC.  On 
i386, SMP is hardwired to depend on the APIC, and so we simply re-use 
the pieces of it which are there, with the same assumptions about irqs, 
and hardware behavior, good or bad.  We just have a different way of 
telling the LAPIC when to deliver interrupts.

The alternative is to pretty much completely copy apic.c into vmi.c or 
vmitimer.c, which seems a rather bad idea, since now two copies of 
nearly identical code need to be maintained.

> Yes, if they are used in a sane and self contained way without reaching
> all over the place and expecting that those functions, which are not
> part of the paravirt interfaces will work for ever.
>   

But we definitely need pieces of the core APIC dependent code.  Xen 
needs pieces of it too, but very select pieces for SMP boot.  The 
ugliness you point out is there, but the reason it is there is not 
because the paravirt code is cluttered, it is because the i386 code is 
so hardwired to use the APIC model that there is pain separating from it.

The correct solution here is to properly separate the APIC, SMP, and 
timer code so the logic of it which we want to reuse is separated from 
the hardware dependence.  Clock events and clocksources take care of 
most of the timer issues, but there is still ugliness from SMP timer 
events depending on having part of the APIC infrastructure for wiring 
the interrupt gates.

> No it's not an absolute blocker, as long as we can take care, that the
> number of incarnations is 
>
> - designed to be shareable between hypervisors which have the same time
> model
> - common code like the 128 bit math is in a shared library
> - self contained and not reaching out into core kernel code for no good
> reason
>   
> Same goes for clock events, interrupts and other core facilities.

I think that is what everyone wants.  This is an iterative process.  We 
certainly don't want to reach out into core kernel code unless there is 
a good reason to do so, and with every development of clock events, 
sources, and interrupts, we have less of a reason to do so, and the code 
gets cleaner and more maintainable.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:05                                           ` Thomas Gleixner
  (?)
  (?)
@ 2007-03-07 23:33                                           ` Jeremy Fitzhardinge
  2007-03-07 23:52                                             ` Dan Hecht
  2007-03-08  0:35                                             ` Thomas Gleixner
  -1 siblings, 2 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 23:33 UTC (permalink / raw)
  To: tglx
  Cc: Dan Hecht, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML

Thomas Gleixner wrote:
> Still there is a difference between using existing kernel interfaces and
> abusing them in a way which makes modifications to the core kernel code
> hard and unmaintainable. See below.
>   

I completely agree.  "Using the kernel interfaces" doesn't mean "this
random hack happens to work", it means "use the interface as intended as
a fully-fledged client".  If the interface doesn't work for our use,
then we can negotiate with the appropriate people on how to extend it
properly.


> On the other hand we yet see things like:
>
>         /* We use normal irq0 handler on cpu0. */
>         time_init_hook();
>
> Which is just reaching into the kernel code directly and does not handle
> the clock event interrupt self contained. clockevents is not bound to
> IRQ0 and this kind of hackery is exactly what we need to avoid in order
> to get this maintainable.
>   

Yes, I'm definitely not arguing with you about this.  I think the first
cut vmi time code was pretty questionable, but I have confidence they'll
fix it up before submission.

The point is that when you put the xen and vmi implementations next to
each other you find that 1) in each case there's a pretty small
abstraction distance between the clock interface and the hypercall
interface, and 2) there's very little code which can be shared between
the two.  Which means that adding another layer of abstraction to
protect the clock code from paravirtualized time devices is just going
to add fat without much benefit.

> Yes, if they are used in a sane and self contained way without reaching
> all over the place and expecting that those functions, which are not
> part of the paravirt interfaces will work for ever.
>   

100% agree.  If the interfaces change, then we'll change the code using
them like any other kernel code would.  If the new interfaces are hard
to make work then that's a problem, but one would hope that would get
shaken out as part of the normal kernel development process.

The point is that this code under and around the paravirt_ops interface
is just normal Linux code, and we expect to participate in the normal
kernel development process, with all the usual
discussions/arguments/negotiations over interface changes.  If the code
loses all its maintainers and becomes orphaned, unresponsive to
interface changes, then it's like any other dead driver: mark it
CONFIG_BROKEN and wait for someone to fix it.  But for now and the
foreseeable future these are going to be actively supported and
maintained pieces of code.

> You are not increasing the entanglement with the rest of the system,
> when you use a self contained device on top of an existing core kernel
> infrastructure, which has a paravirt backend. Quite the contrary, you
> have one piece of virtual hardware which is connected to the kernel and
> interacts with the various incarnations on the other side, which can as
> well live inside the kernel code. Granted it is another level of
> indirection, but I'd be happy to have only to deal with one of those
> beasts.
>   

Right.  But at that point the interface doesn't really have much of a
technical basis.  It's really a political border at which you can hand
off responsibility and make it ours.  I quite understand your
motivation, but I think you're solving a problem that hasn't happened
yet, and one that we'd all like to avoid.

I know the vmi time code has coloured your view here, but I surely hope
it can be got into a better state before posting.  I'm biased of course,
but I would rather hope that all these drivers we're talking about will
be as stylistically clean as the Xen time code (which has room for
improvement, of course).

There is, however, a median solution which keeps the number of clock
drivers down but also doesn't involve extending pv_ops.  We can just
create paravirt_clocksource/paravirt_clockevent helper wrappers, with
their own internal interfaces to act as a facade for the
hypervisor-specific code.  I don't think there's much point in doing
this now, but maybe it will become appealing once we start dealing with
things like stolen time.

> No it's not an absolute blocker, as long as we can take care, that the
> number of incarnations is 
>
> - designed to be shareable between hypervisors which have the same time
> model
> - common code like the 128 bit math is in a shared library
> - self contained and not reaching out into core kernel code for no good
> reason

Yep.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:25                                           ` Zachary Amsden
@ 2007-03-07 23:36                                             ` Jeremy Fitzhardinge
  2007-03-07 23:40                                                 ` Zachary Amsden
  2007-03-08  0:22                                             ` Thomas Gleixner
  2007-03-08  9:10                                             ` hardwired VMI crap Ingo Molnar
  2 siblings, 1 reply; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-07 23:36 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: tglx, Virtualization Mailing List, john stultz, akpm, Ingo Molnar, LKML

Zachary Amsden wrote:
> Xen needs pieces of it too, but very select pieces for SMP boot.

We do?  Send the SMP Xen code over, because I don't have it here.

Thanks,
    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:36                                             ` Jeremy Fitzhardinge
@ 2007-03-07 23:40                                                 ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-07 23:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: tglx, Virtualization Mailing List, john stultz, akpm, Ingo Molnar, LKML

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>> Xen needs pieces of it too, but very select pieces for SMP boot.
>>     
>
> We do?  Send the SMP Xen code over, because I don't have it here.
>   

s/do/will (smpboot.c)


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-07 23:40                                                 ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-07 23:40 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, akpm, john stultz, tglx, Ingo Molnar, LKML

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>> Xen needs pieces of it too, but very select pieces for SMP boot.
>>     
>
> We do?  Send the SMP Xen code over, because I don't have it here.
>   

s/do/will (smpboot.c)

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:33                                           ` + stupid-hack-to-make-mainline-build.patch added to -mm tree Jeremy Fitzhardinge
@ 2007-03-07 23:52                                             ` Dan Hecht
  2007-03-08  0:19                                                 ` Jeremy Fitzhardinge
  2007-03-08  0:35                                             ` Thomas Gleixner
  1 sibling, 1 reply; 169+ messages in thread
From: Dan Hecht @ 2007-03-07 23:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: tglx, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML, Dan Hecht, Zachary Amsden

On 03/07/2007 03:33 PM, Jeremy Fitzhardinge wrote:
> I know the vmi time code has coloured your view here, but I surely hope
> it can be got into a better state before posting.  I'm biased of course,
> but I would rather hope that all these drivers we're talking about will
> be as stylistically clean as the Xen time code (which has room for
> improvement, of course).
> 

Could you send us comments on where you feel the style needs some fixing 
up?

VMI encapsulates all the implementation details away from the kernel, 
whereas the Xen time code puts it all out there in the kernel (see 
snippet below).  What happens when Xen wants to change the way it 
implements "system time"?  It looses compatibility with all existing 
kernels....

In VMI terms, the code to read "system time" from the hypervisor is this 
one-liner (it can be written in any "style" you want; the fact is, it's 
just an interface call to the VMI-layer):

vmi_timer_ops.get_cycle_counter(VMI_CYCLES_REAL);

In Xen terms, the same code to accomplish that is:

/*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
  */
static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
{
	u64 product;
#ifdef __i386__
	u32 tmp1, tmp2;
#endif

	if (shift < 0)
		delta >>= -shift;
	else
		delta <<= shift;

#ifdef __i386__
	__asm__ (
		"mul  %5       ; "
		"mov  %4,%%eax ; "
		"mov  %%edx,%4 ; "
		"mul  %5       ; "
		"xor  %5,%5    ; "
		"add  %4,%%eax ; "
		"adc  %5,%%edx ; "
		: "=A" (product), "=r" (tmp1), "=r" (tmp2)
		: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
#elif __x86_64__
	__asm__ (
		"mul %%rdx ; shrd $32,%%rdx,%%rax"
		: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
#else
#error implement me!
#endif

	return product;
}

static u64 get_nsec_offset(struct shadow_time_info *shadow)
{
	u64 now, delta;
	rdtscll(now);
	delta = now - shadow->tsc_timestamp;
	return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
}

static cycle_t xen_clocksource_read(void)
{
	struct shadow_time_info *shadow = &get_cpu_var(shadow_time);
	cycle_t ret;

	get_time_values_from_xen();

	ret = shadow->system_timestamp + get_nsec_offset(shadow);

	put_cpu_var(shadow_time);

	return ret;
}

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:52                                             ` Dan Hecht
@ 2007-03-08  0:19                                                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08  0:19 UTC (permalink / raw)
  To: Dan Hecht
  Cc: tglx, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML, Zachary Amsden

Dan Hecht wrote:
> On 03/07/2007 03:33 PM, Jeremy Fitzhardinge wrote:
>> I know the vmi time code has coloured your view here, but I surely hope
>> it can be got into a better state before posting.  I'm biased of course,
>> but I would rather hope that all these drivers we're talking about will
>> be as stylistically clean as the Xen time code (which has room for
>> improvement, of course).
>>
>
> Could you send us comments on where you feel the style needs some
> fixing up?

I think Thomas has covered this in quite a bit of detail already.  But
the fact that the code mentions "apic" or "pit" at all seems
unfortunate, but I guess that's what you have to work with.

> VMI encapsulates all the implementation details away from the kernel,
> whereas the Xen time code puts it all out there in the kernel[...]

This is not an exercise in "my hypervisor is better than yours", it's a
matter of getting clean implementations within the constraints of each
hypervisor interface.  The Xen code may be more verbose than the
corresponding VMI code, but it's self-contained and doesn't make any
demands on the rest of the kernel.

The concern is that the vmi code reaches out and does things like set
global_clock_event, calls time_init_hook and so on - basically
complicating the already ugly lapic/pic legacy time mess, and therefore
making yourself part of the tangle if anyone wants to go in there and
change it.

The question is whether you can make the vmi clock implementation
free-standing, in that it has no dependencies other than well defined
interfaces like the clock api itself, the normal (non-legacy) interrupt
api and, of course, the underlying VMI interface.  But no reach-arounds
into the lapic/pit code.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08  0:19                                                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08  0:19 UTC (permalink / raw)
  To: Dan Hecht
  Cc: tglx, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML, Zachary Amsden

Dan Hecht wrote:
> On 03/07/2007 03:33 PM, Jeremy Fitzhardinge wrote:
>> I know the vmi time code has coloured your view here, but I surely hope
>> it can be got into a better state before posting.  I'm biased of course,
>> but I would rather hope that all these drivers we're talking about will
>> be as stylistically clean as the Xen time code (which has room for
>> improvement, of course).
>>
>
> Could you send us comments on where you feel the style needs some
> fixing up?

I think Thomas has covered this in quite a bit of detail already.  But
the fact that the code mentions "apic" or "pit" at all seems
unfortunate, but I guess that's what you have to work with.

> VMI encapsulates all the implementation details away from the kernel,
> whereas the Xen time code puts it all out there in the kernel[...]

This is not an exercise in "my hypervisor is better than yours", it's a
matter of getting clean implementations within the constraints of each
hypervisor interface.  The Xen code may be more verbose than the
corresponding VMI code, but it's self-contained and doesn't make any
demands on the rest of the kernel.

The concern is that the vmi code reaches out and does things like set
global_clock_event, calls time_init_hook and so on - basically
complicating the already ugly lapic/pic legacy time mess, and therefore
making yourself part of the tangle if anyone wants to go in there and
change it.

The question is whether you can make the vmi clock implementation
free-standing, in that it has no dependencies other than well defined
interfaces like the clock api itself, the normal (non-legacy) interrupt
api and, of course, the underlying VMI interface.  But no reach-arounds
into the lapic/pit code.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:25                                           ` Zachary Amsden
  2007-03-07 23:36                                             ` Jeremy Fitzhardinge
@ 2007-03-08  0:22                                             ` Thomas Gleixner
  2007-03-08  1:01                                                 ` Daniel Arai
  2007-03-08  9:10                                             ` hardwired VMI crap Ingo Molnar
  2 siblings, 1 reply; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-08  0:22 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Virtualization Mailing List, john stultz,
	akpm, Ingo Molnar, LKML

On Wed, 2007-03-07 at 15:25 -0800, Zachary Amsden wrote:
> > Looking at vmitimer.c and the number of hardcoded assumptions are
> > telling me, that we are heading in exactly the opposite direction.
> >   
> 
> No, VMI timer is unique because for SMP, it is based on the APIC.  On 
> i386, SMP is hardwired to depend on the APIC, and so we simply re-use 
> the pieces of it which are there, with the same assumptions about irqs, 
> and hardware behavior, good or bad.  We just have a different way of 
> telling the LAPIC when to deliver interrupts.

This is exactly the point. There is no benfit in reusing 3 lines of
lapic interrupt handler code and therefor reaching into it. clockevents
are not connected to lapic on SMP by any means. They are designed to be
self contained and so please use them as designed.

> The alternative is to pretty much completely copy apic.c into vmi.c or 
> vmitimer.c, which seems a rather bad idea, since now two copies of 
> nearly identical code need to be maintained.

You managed to avoid the usage of other code (i.e. PIT / HPET) already,
so why is it sooo desireable to emulate apics instead of substituting it
by a small and sane replacement ? Just because you happen to have an
LAPIC emulator ? That's no reason to wire yourself into the kernel code
and make it harder to change and maintain.

> > Yes, if they are used in a sane and self contained way without reaching
> > all over the place and expecting that those functions, which are not
> > part of the paravirt interfaces will work for ever.
> >   
> 
> But we definitely need pieces of the core APIC dependent code.  Xen 
> needs pieces of it too, but very select pieces for SMP boot.  The 
> ugliness you point out is there, but the reason it is there is not 
> because the paravirt code is cluttered, it is because the i386 code is 
> so hardwired to use the APIC model that there is pain separating from it.
>
> The correct solution here is to properly separate the APIC, SMP, and 
> timer code so the logic of it which we want to reuse is separated from 
> the hardware dependence.  Clock events and clocksources take care of 
> most of the timer issues, but there is still ugliness from SMP timer 
> events depending on having part of the APIC infrastructure for wiring 
> the interrupt gates.

Again: clockevents do not require APIC and do not depend on any APIC
wiring. Your hypervisor is working that way.

> > No it's not an absolute blocker, as long as we can take care, that the
> > number of incarnations is 
> >
> > - designed to be shareable between hypervisors which have the same time
> > model
> > - common code like the 128 bit math is in a shared library
> > - self contained and not reaching out into core kernel code for no good
> > reason
> >   
> > Same goes for clock events, interrupts and other core facilities.
> 
> I think that is what everyone wants.  This is an iterative process.  We 
> certainly don't want to reach out into core kernel code unless there is 
> a good reason to do so, and with every development of clock events, 
> sources, and interrupts, we have less of a reason to do so, and the code 
> gets cleaner and more maintainable.

We have to avoid this reachout in the first place. It just adds more
hardwires into the hairball and makes it harder to distangle. 

If you want the virtualization support in the kernel, then please
understand that we hardwire now and we'll fix it up once the core kernel
developers serve us the solution on the silver tablet is not going to
work. Please work with us on a proper solution upfront instead of
throwing random hackery with the lame excuse "for a good reason" at us.

You knew exactly, that clockevents & co are on the way to mainline and
there was enough time to work with us on a proper solution. No, you
decided to ignore it, even after people pointed it out to you way before
the 2.6.21 merge window. Now we have the hardwire in place and we can
wait for you to fix it whenever it seems to fit into the vmware business
plan.

I'm not going to accept any further reachout unless there is an urgent
bugfix in the release cycle, which does not allow a proper solution. But
be sure, that the backout patch will hit -mm immidiately.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:33                                           ` + stupid-hack-to-make-mainline-build.patch added to -mm tree Jeremy Fitzhardinge
  2007-03-07 23:52                                             ` Dan Hecht
@ 2007-03-08  0:35                                             ` Thomas Gleixner
  2007-03-08  0:38                                                 ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-08  0:35 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Hecht, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML

On Wed, 2007-03-07 at 15:33 -0800, Jeremy Fitzhardinge wrote:
> > On the other hand we yet see things like:
> >
> >         /* We use normal irq0 handler on cpu0. */
> >         time_init_hook();
> >
> > Which is just reaching into the kernel code directly and does not handle
> > the clock event interrupt self contained. clockevents is not bound to
> > IRQ0 and this kind of hackery is exactly what we need to avoid in order
> > to get this maintainable.
> >   
> 
> Yes, I'm definitely not arguing with you about this.  I think the first
> cut vmi time code was pretty questionable, but I have confidence they'll
> fix it up before submission.

Sigh. The cut zero hairball is already in mainline. :(

> The point is that when you put the xen and vmi implementations next to
> each other you find that 1) in each case there's a pretty small
> abstraction distance between the clock interface and the hypercall
> interface, and 2) there's very little code which can be shared between
> the two.  Which means that adding another layer of abstraction to
> protect the clock code from paravirtualized time devices is just going
> to add fat without much benefit.

Fair enough.

> > Yes, if they are used in a sane and self contained way without reaching
> > all over the place and expecting that those functions, which are not
> > part of the paravirt interfaces will work for ever.
> >   
> 
> 100% agree.  If the interfaces change, then we'll change the code using
> them like any other kernel code would.  If the new interfaces are hard
> to make work then that's a problem, but one would hope that would get
> shaken out as part of the normal kernel development process.

Sure. If the clockevent API is changed, then the users get fixed. This
is not my main concern. The "oh we reuse the PIT interrupt" reachout is
what makes life hard. VMI does this already extensive and I'm frightened
by it.

> The point is that this code under and around the paravirt_ops interface
> is just normal Linux code, and we expect to participate in the normal
> kernel development process, with all the usual
> discussions/arguments/negotiations over interface changes.  If the code
> loses all its maintainers and becomes orphaned, unresponsive to
> interface changes, then it's like any other dead driver: mark it
> CONFIG_BROKEN and wait for someone to fix it.  But for now and the
> foreseeable future these are going to be actively supported and
> maintained pieces of code.

Ack.

> > You are not increasing the entanglement with the rest of the system,
> > when you use a self contained device on top of an existing core kernel
> > infrastructure, which has a paravirt backend. Quite the contrary, you
> > have one piece of virtual hardware which is connected to the kernel and
> > interacts with the various incarnations on the other side, which can as
> > well live inside the kernel code. Granted it is another level of
> > indirection, but I'd be happy to have only to deal with one of those
> > beasts.
> >   
> 
> Right.  But at that point the interface doesn't really have much of a
> technical basis.  It's really a political border at which you can hand
> off responsibility and make it ours.  I quite understand your
> motivation, but I think you're solving a problem that hasn't happened
> yet, and one that we'd all like to avoid.

Granted.

> I know the vmi time code has coloured your view here, but I surely hope
> it can be got into a better state before posting.  I'm biased of course,
> but I would rather hope that all these drivers we're talking about will
> be as stylistically clean as the Xen time code (which has room for
> improvement, of course).
> 
> There is, however, a median solution which keeps the number of clock
> drivers down but also doesn't involve extending pv_ops.  We can just
> create paravirt_clocksource/paravirt_clockevent helper wrappers, with
> their own internal interfaces to act as a facade for the
> hypervisor-specific code.  I don't think there's much point in doing
> this now, but maybe it will become appealing once we start dealing with
> things like stolen time.

We'll see.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  0:35                                             ` Thomas Gleixner
@ 2007-03-08  0:38                                                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08  0:38 UTC (permalink / raw)
  To: tglx
  Cc: Dan Hecht, James Morris, Virtualization Mailing List, akpm,
	john stultz, Ingo Molnar, LKML

Thomas Gleixner wrote:
> Sigh. The cut zero hairball is already in mainline. :(
>   

Yes, there were a couple of unfortunate patches in that series, but they
got fast-tracked in with the promise they would get fixed asap.

> Sure. If the clockevent API is changed, then the users get fixed. This
> is not my main concern. The "oh we reuse the PIT interrupt" reachout is
> what makes life hard. VMI does this already extensive and I'm frightened
> by it.
>   

Well, I think they know what's expected of them now.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08  0:38                                                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08  0:38 UTC (permalink / raw)
  To: tglx; +Cc: Virtualization Mailing List, john stultz, akpm, Ingo Molnar, LKML

Thomas Gleixner wrote:
> Sigh. The cut zero hairball is already in mainline. :(
>   

Yes, there were a couple of unfortunate patches in that series, but they
got fast-tracked in with the promise they would get fixed asap.

> Sure. If the clockevent API is changed, then the users get fixed. This
> is not my main concern. The "oh we reuse the PIT interrupt" reachout is
> what makes life hard. VMI does this already extensive and I'm frightened
> by it.
>   

Well, I think they know what's expected of them now.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 18:35                                   ` Jeremy Fitzhardinge
@ 2007-03-08  0:45                                     ` Alan Cox
  -1 siblings, 0 replies; 169+ messages in thread
From: Alan Cox @ 2007-03-08  0:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, tglx, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

> Yep, the tsc has myriad problems; for Xen its the best of a bad lot. 
> Unfortunately in 10 years no clearly better alternative has appeared;
> maybe in 10 years there will be one.  It might even be the tsc.

TSC is essentially unusable for any kind of time related work. And I'd
disagree about the alternatives - the HPET and ACPI timers are not bad,
the CMOS timer can be used as an interrupting timer source, and there is
the old PC timer chip. All are superior to the TSC.

Finally for performance management work you've got cycle counters in the
debug side (with interrupt on overflow) which allow you to do management
of resources by cpu ticks or by memory bandwidth utilisation (Sun btw
have a fascinating paper somewhere on the latter)

Alan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08  0:45                                     ` Alan Cox
  0 siblings, 0 replies; 169+ messages in thread
From: Alan Cox @ 2007-03-08  0:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, akpm, john stultz, LKML, Ingo Molnar, tglx

> Yep, the tsc has myriad problems; for Xen its the best of a bad lot. 
> Unfortunately in 10 years no clearly better alternative has appeared;
> maybe in 10 years there will be one.  It might even be the tsc.

TSC is essentially unusable for any kind of time related work. And I'd
disagree about the alternatives - the HPET and ACPI timers are not bad,
the CMOS timer can be used as an interrupting timer source, and there is
the old PC timer chip. All are superior to the TSC.

Finally for performance management work you've got cycle counters in the
debug side (with interrupt on overflow) which allow you to do management
of resources by cpu ticks or by memory bandwidth utilisation (Sun btw
have a fascinating paper somewhere on the latter)

Alan

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  0:22                                             ` Thomas Gleixner
@ 2007-03-08  1:01                                                 ` Daniel Arai
  0 siblings, 0 replies; 169+ messages in thread
From: Daniel Arai @ 2007-03-08  1:01 UTC (permalink / raw)
  To: tglx
  Cc: Zachary Amsden, Virtualization Mailing List, john stultz,
	Ingo Molnar, akpm, LKML

Thomas Gleixner wrote:

> You managed to avoid the usage of other code (i.e. PIT / HPET) already,
> so why is it sooo desireable to emulate apics instead of substituting it
> by a small and sane replacement ? Just because you happen to have an
> LAPIC emulator ? That's no reason to wire yourself into the kernel code
> and make it harder to change and maintain.

There are several reasons why it's desirable to emulate the APIC.  As you 
mentioned, we already have APIC emulation, and APIC emulation isn't a huge 
bottleneck on most workloads.  Our code works, the Linux code works, and 
replacing both pieces of code with something "small and sane" isn't going to 
improve performance very much, so why bother?  Any hypervisor implementation is 
going to be a tradeoff between what's easy to implement in the hypervisor, 
what's easy to implement in the guest operating system, and what's performance 
critical.

Secondly, not all (para-)virtualized operating systems will want to use 
abstracted devices.  Some virtual operating systems will be given direct access 
to hardware devices, and will need to run the actual driver for that device and 
not some abstracted device driver.  So I don't buy your argument that every 
piece of the kernel that interacts with a paravirtualized driver should have a 
"small and sane replacement."

But more importantly, we want a kernel that can run both on native hardware and 
in a paravirtualized environment.  Linux doesn't really provide abstractions for 
  replacing the appropriate code.  We tried to hook into the source code at a 
level that seemed possible.

For example, take smp_call_function().  What this essentially does is call 
send_IPI_allbutself().

void fastcall send_IPI_self(int vector)
{
         __send_IPI_shortcut(APIC_DEST_SELF, vector);
}

void __send_IPI_shortcut(unsigned int shortcut, int vector)
{
         /*
          * Subtle. In the case of the 'never do double writes' workaround
          * we have to lock out interrupts to be safe.  As we don't care
          * of the value read we use an atomic rmw access to avoid costly
          * cli/sti.  Otherwise we use an even cheaper single atomic write
          * to the APIC.
          */
         unsigned int cfg;

         /*
          * Wait for idle.
          */
         apic_wait_icr_idle();

         /*
          * No need to touch the target chip field
          */
         cfg = __prepare_ICR(shortcut, vector);

         /*
          * Send the IPI. The write to APIC_ICR fires this off.
          */
         apic_write_around(APIC_ICR, cfg);
}


There's no good way to override __send_IPI_shortcut.  I suppose we could add 
paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 
But there are dozens of functions in apic.c that would need to be included in 
paravirt ops.  And for our implementation, we really just want to override 
apic_read and apic_write, since we can make these faster when done through 
hypercalls than through memory accesses.  If we were to make these paravirt ops, 
their implementations would be the same, except with a different apic_read and 
apic_write.  This is a whole lot of useless code duplication.

Most of the interrupt system is not written in such a way that multiple APICs 
implementations can be selected from at boot time.  This is an absolute 
requirement so that the same kernel can boot on native and in a paravirtualized 
environment.  While this could be implemented, it seems like a waste of time, 
since we can just emulate something similar to a real interrupt system and not 
change things very much.

Dan Arai
VMware, Inc.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08  1:01                                                 ` Daniel Arai
  0 siblings, 0 replies; 169+ messages in thread
From: Daniel Arai @ 2007-03-08  1:01 UTC (permalink / raw)
  To: tglx
  Cc: Zachary Amsden, Virtualization Mailing List, john stultz,
	Ingo Molnar, akpm, LKML

Thomas Gleixner wrote:

> You managed to avoid the usage of other code (i.e. PIT / HPET) already,
> so why is it sooo desireable to emulate apics instead of substituting it
> by a small and sane replacement ? Just because you happen to have an
> LAPIC emulator ? That's no reason to wire yourself into the kernel code
> and make it harder to change and maintain.

There are several reasons why it's desirable to emulate the APIC.  As you 
mentioned, we already have APIC emulation, and APIC emulation isn't a huge 
bottleneck on most workloads.  Our code works, the Linux code works, and 
replacing both pieces of code with something "small and sane" isn't going to 
improve performance very much, so why bother?  Any hypervisor implementation is 
going to be a tradeoff between what's easy to implement in the hypervisor, 
what's easy to implement in the guest operating system, and what's performance 
critical.

Secondly, not all (para-)virtualized operating systems will want to use 
abstracted devices.  Some virtual operating systems will be given direct access 
to hardware devices, and will need to run the actual driver for that device and 
not some abstracted device driver.  So I don't buy your argument that every 
piece of the kernel that interacts with a paravirtualized driver should have a 
"small and sane replacement."

But more importantly, we want a kernel that can run both on native hardware and 
in a paravirtualized environment.  Linux doesn't really provide abstractions for 
  replacing the appropriate code.  We tried to hook into the source code at a 
level that seemed possible.

For example, take smp_call_function().  What this essentially does is call 
send_IPI_allbutself().

void fastcall send_IPI_self(int vector)
{
         __send_IPI_shortcut(APIC_DEST_SELF, vector);
}

void __send_IPI_shortcut(unsigned int shortcut, int vector)
{
         /*
          * Subtle. In the case of the 'never do double writes' workaround
          * we have to lock out interrupts to be safe.  As we don't care
          * of the value read we use an atomic rmw access to avoid costly
          * cli/sti.  Otherwise we use an even cheaper single atomic write
          * to the APIC.
          */
         unsigned int cfg;

         /*
          * Wait for idle.
          */
         apic_wait_icr_idle();

         /*
          * No need to touch the target chip field
          */
         cfg = __prepare_ICR(shortcut, vector);

         /*
          * Send the IPI. The write to APIC_ICR fires this off.
          */
         apic_write_around(APIC_ICR, cfg);
}


There's no good way to override __send_IPI_shortcut.  I suppose we could add 
paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 
But there are dozens of functions in apic.c that would need to be included in 
paravirt ops.  And for our implementation, we really just want to override 
apic_read and apic_write, since we can make these faster when done through 
hypercalls than through memory accesses.  If we were to make these paravirt ops, 
their implementations would be the same, except with a different apic_read and 
apic_write.  This is a whole lot of useless code duplication.

Most of the interrupt system is not written in such a way that multiple APICs 
implementations can be selected from at boot time.  This is an absolute 
requirement so that the same kernel can boot on native and in a paravirtualized 
environment.  While this could be implemented, it seems like a waste of time, 
since we can just emulate something similar to a real interrupt system and not 
change things very much.

Dan Arai
VMware, Inc.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  1:01                                                 ` Daniel Arai
@ 2007-03-08  1:23                                                   ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08  1:23 UTC (permalink / raw)
  To: Daniel Arai
  Cc: tglx, Virtualization Mailing List, john stultz, Ingo Molnar, akpm, LKML

Daniel Arai wrote:
> But more importantly, we want a kernel that can run both on native hardware and 
> in a paravirtualized environment.  Linux doesn't really provide abstractions for 
>   replacing the appropriate code.  We tried to hook into the source code at a 
> level that seemed possible.
>   

Xen doesn't support any kind of apic emulation, so we'll need to hook
anything which relies on an apic.  The ipi code you quote below will
probably be one of those.

My opinion is that pv_ops shouldn't have raw apic operations, but
instead have appropriate high-level interfaces to achieve the same
ends.  Zach's counter-argument was basically your's: that the VMI code
will use a lot of the native code except for the actual apic operations.

I can live with VMI emulating apics if it wants, so long as it does it
in private and doesn't make a big scene about it.  We'll need the
high-level interfaces regardless.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08  1:23                                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08  1:23 UTC (permalink / raw)
  To: Daniel Arai
  Cc: Virtualization Mailing List, akpm, john stultz, tglx, Ingo Molnar, LKML

Daniel Arai wrote:
> But more importantly, we want a kernel that can run both on native hardware and 
> in a paravirtualized environment.  Linux doesn't really provide abstractions for 
>   replacing the appropriate code.  We tried to hook into the source code at a 
> level that seemed possible.
>   

Xen doesn't support any kind of apic emulation, so we'll need to hook
anything which relies on an apic.  The ipi code you quote below will
probably be one of those.

My opinion is that pv_ops shouldn't have raw apic operations, but
instead have appropriate high-level interfaces to achieve the same
ends.  Zach's counter-argument was basically your's: that the VMI code
will use a lot of the native code except for the actual apic operations.

I can live with VMI emulating apics if it wants, so long as it does it
in private and doesn't make a big scene about it.  We'll need the
high-level interfaces regardless.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  1:23                                                   ` Jeremy Fitzhardinge
  (?)
@ 2007-03-08  7:02                                                   ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-08  7:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Daniel Arai, Virtualization Mailing List, john stultz,
	Ingo Molnar, akpm, LKML

On Wed, 2007-03-07 at 17:23 -0800, Jeremy Fitzhardinge wrote:
> Daniel Arai wrote:
> > But more importantly, we want a kernel that can run both on native hardware and 
> > in a paravirtualized environment.  Linux doesn't really provide abstractions for 
> >   replacing the appropriate code.  We tried to hook into the source code at a 
> > level that seemed possible.
> >   
> 
> Xen doesn't support any kind of apic emulation, so we'll need to hook
> anything which relies on an apic.  The ipi code you quote below will
> probably be one of those.
> 
> My opinion is that pv_ops shouldn't have raw apic operations, but
> instead have appropriate high-level interfaces to achieve the same
> ends.  Zach's counter-argument was basically your's: that the VMI code
> will use a lot of the native code except for the actual apic operations.
> 
> I can live with VMI emulating apics if it wants, so long as it does it
> in private and doesn't make a big scene about it.  We'll need the
> high-level interfaces regardless.

I can't because it reaches out into non private parts of the low level
implementation and is not helping to distangle things and making the
overall code better. No it forces its own view of the world on us
without giving us anything back.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  1:01                                                 ` Daniel Arai
  (?)
  (?)
@ 2007-03-08  7:28                                                 ` Thomas Gleixner
  2007-03-08  8:01                                                     ` Zachary Amsden
  -1 siblings, 1 reply; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-08  7:28 UTC (permalink / raw)
  To: Daniel Arai
  Cc: Zachary Amsden, Virtualization Mailing List, john stultz,
	Ingo Molnar, akpm, LKML

On Wed, 2007-03-07 at 17:01 -0800, Daniel Arai wrote:
> Thomas Gleixner wrote:
> 
> > You managed to avoid the usage of other code (i.e. PIT / HPET) already,
> > so why is it sooo desireable to emulate apics instead of substituting it
> > by a small and sane replacement ? Just because you happen to have an
> > LAPIC emulator ? That's no reason to wire yourself into the kernel code
> > and make it harder to change and maintain.
> 
> There are several reasons why it's desirable to emulate the APIC.  As you 
> mentioned, we already have APIC emulation, and APIC emulation isn't a huge 
> bottleneck on most workloads.  Our code works, the Linux code works, and 
> replacing both pieces of code with something "small and sane" isn't going to 
> improve performance very much, so why bother?  Any hypervisor implementation is 
> going to be a tradeoff between what's easy to implement in the hypervisor, 
> what's easy to implement in the guest operating system, and what's performance 
> critical.

It is not about performance. It is about maintainability. 

> Secondly, not all (para-)virtualized operating systems will want to use 
> abstracted devices.  Some virtual operating systems will be given direct access 
> to hardware devices, and will need to run the actual driver for that device and 
> not some abstracted device driver.  So I don't buy your argument that every 
> piece of the kernel that interacts with a paravirtualized driver should have a 
> "small and sane replacement."

Err. We talk about paravirtualized Linux and not about what you have to
emulate to get Windows running. I don't care at all. Do you really
expect that we have to accept your design decisions, just because they
allow you to make your life easy ? This is exactly what you are using
paravirt ops for: a backdoor to throw your hackery at the kernel and
leave us with the mess of hardwired crap.

> But more importantly, we want a kernel that can run both on native hardware and 
> in a paravirtualized environment.  Linux doesn't really provide abstractions for 
>   replacing the appropriate code.  We tried to hook into the source code at a 
> level that seemed possible.

Again. You just refuse to change your implementation and you want to
keep it by arguing how hard it is because there are no abstractions.

I went through the business of creating abstractions into hardwired
hairballs twice. I know exactly what I'm talking about. It _IS_ hard
work, but at the end it makes the code better and more maintainable. You
do nothing for that, but expect that we live with your addons to the
hairball.

> There's no good way to override __send_IPI_shortcut.  I suppose we could add 
> paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 
> But there are dozens of functions in apic.c that would need to be included in 
> paravirt ops.  And for our implementation, we really just want to override 
> apic_read and apic_write, since we can make these faster when done through 
> hypercalls than through memory accesses.  If we were to make these paravirt ops, 
> their implementations would be the same, except with a different apic_read and 
> apic_write.  This is a whole lot of useless code duplication.

No it is not. #include <linux/smp.h> is an abstraction and
__send_IPI ... is the i386 low level implementation.

You insist to hook yourself into the low level code instead of hooking
into the high level code, because it is _YOUR_ implementation and we
have to accept it as is.

This is the completely wrong way. We get the same crap and discussion
for every other architecture we are going to support with paravirt ops.
And probably for every other hypervisor implementation, which has a
different way of doing things.

> Most of the interrupt system is not written in such a way that multiple APICs 
> implementations can be selected from at boot time.  This is an absolute 
> requirement so that the same kernel can boot on native and in a paravirtualized 
> environment.  While this could be implemented, it seems like a waste of time, 
> since we can just emulate something similar to a real interrupt system and not 
> change things very much.

Waste of your precious time. I'm working on low level code and
abstractions and from now on I have also to take care not to break
_YOUR_ implementation. You are going to waste _MY_ time and I'm going to
fight that forever.

Your prayer wheel argument of missing abstractions and easiness of
emulating things is annoying. If you think it is better to emulate APIC,
please emulate it without paravirt ops. If you want the speed
improvement, work with us to create the interfaces and abstractions
which are necessary to have a sane, maintainable and useful for all
hypervisors implementation.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  7:28                                                 ` Thomas Gleixner
@ 2007-03-08  8:01                                                     ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08  8:01 UTC (permalink / raw)
  To: tglx
  Cc: Daniel Arai, Virtualization Mailing List, john stultz,
	Ingo Molnar, akpm, LKML, Daniel Hecht, Rusty Russell,
	Jeremy Fitzhardinge

Thomas Gleixner wrote:
> On Wed, 2007-03-07 at 17:01 -0800, Daniel Arai wrote:
>   
>> But more importantly, we want a kernel that can run both on native hardware and 
>> in a paravirtualized environment.  Linux doesn't really provide abstractions for 
>>   replacing the appropriate code.  We tried to hook into the source code at a 
>> level that seemed possible.
>>     
>
> Again. You just refuse to change your implementation and you want to
> keep it by arguing how hard it is because there are no abstractions.
>   

It is no longer possible to change our _hypervisor_ implementation.  The 
Linux side of our code is entirely flexible, and we are trying to change 
it, but it hasn't always been clear what you want us to do.

> Your prayer wheel argument of missing abstractions and easiness of
> emulating things is annoying. If you think it is better to emulate APIC,
> please emulate it without paravirt ops. If you want the speed
> improvement, work with us to create the interfaces and abstractions
> which are necessary to have a sane, maintainable and useful for all
> hypervisors implementation.
>   

That's what we are doing.  Our prayer wheel would be easier appeased if 
you actually told us which parts of the VMI timer you objected to.  As I 
understand it now:

1) We should not call into external functions in other time sources; any 
common code should be merged up
2) We should not be using global_clock_event; it is a horrible hack 
which you want to remove
3) We should not use the smp_apic_timer_interrupt assembly code which 
calls up to the lapic timer handlers
4) We should not add our own assembly code to call out to a local timer 
handler (from Ingo)

These last two points create a conflict which is a little tricky to 
solve.  We can't add our own custom timer handler, and we can't re-use 
the APIC timer handler.  But there is no timer handler available on i386 
that works, since the handlers will fall back to either PIC or IO-APIC 
edge handling.  Using either of those for the local timer interrupt on 
SMP does not work because they assume traditional IRQ semantics - an IRQ 
raised from the bus should be serviced by one processor.  Re-raises of 
the same IRQ on remote processors are locked out by the handler, and 
dropped.  Thus simultaneous local timers firing on multiple CPUs cause 
only one to be serviced.

This does not work for local timer interrupts in NO_HZ mode, because 
they must always be serviced so that they can reschedule the next local 
timer.  I have a proposed solution to this issue, but it fails to work 
when the IO-APIC assumes control of all IRQs based on ACPI results 
(which we control, but can't change because of compatibility issues with 
other operating systems).

My proposal is to keep IRQ-0 as the timer interrupt, on all CPUs, but 
fire it from the LAPIC after local apic timers get initialized.  We 
would do this by converting the irq handler using set_irq_handler(0, 
handle_percpu_irq).  The only problem is the IO-APIC code will want to 
take over IRQ0 and convert it to an edge triggered IO-APIC interrupt.  
But for the local irq handlers to work, we have to keep them using the 
handle_percpu_irq handler, and can't let the IO-APIC steal these 
vectors.  There is no way to do conditionally for just a specific set of 
IRQs in tree today, so we would need to add a special case to io_apic.c 
to allow early boot code to reserve specific vectors so they are not 
subsumed by the IO-APIC.  This seems reasonable, but is a special case.

If, on the other hand, we are allowed to use our own assembly code to 
call out to our local timer handler (dropping constraint #4 above), we 
can simply rewire LOCAL_TIMER_VECTOR to point to this code, but now we 
must emulate the semantics of irq_enter / leave / etc inside our code, 
which is also not the cleanest solution.  We used to do this, and it 
caught flak I believe from Ingo.

The basic problem is that a local IRQ doesn't behave like a global IRQ, 
and the i386 backend is unaware of how to set up any local IRQs except 
in the case of local APIC, but you have told us we should not re-use the 
APIC handlers by overloading global_clock_event.  The patches we sent 
out recently did just this, but seemed to meet even more violence than 
our previous way of doing things.

So the question is, which approach do you prefer?

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08  8:01                                                     ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08  8:01 UTC (permalink / raw)
  To: tglx
  Cc: Daniel Arai, Virtualization Mailing List, john stultz,
	Ingo Molnar, akpm, LKML, Daniel Hecht, Rusty Russell,
	Jeremy Fitzhardinge

Thomas Gleixner wrote:
> On Wed, 2007-03-07 at 17:01 -0800, Daniel Arai wrote:
>   
>> But more importantly, we want a kernel that can run both on native hardware and 
>> in a paravirtualized environment.  Linux doesn't really provide abstractions for 
>>   replacing the appropriate code.  We tried to hook into the source code at a 
>> level that seemed possible.
>>     
>
> Again. You just refuse to change your implementation and you want to
> keep it by arguing how hard it is because there are no abstractions.
>   

It is no longer possible to change our _hypervisor_ implementation.  The 
Linux side of our code is entirely flexible, and we are trying to change 
it, but it hasn't always been clear what you want us to do.

> Your prayer wheel argument of missing abstractions and easiness of
> emulating things is annoying. If you think it is better to emulate APIC,
> please emulate it without paravirt ops. If you want the speed
> improvement, work with us to create the interfaces and abstractions
> which are necessary to have a sane, maintainable and useful for all
> hypervisors implementation.
>   

That's what we are doing.  Our prayer wheel would be easier appeased if 
you actually told us which parts of the VMI timer you objected to.  As I 
understand it now:

1) We should not call into external functions in other time sources; any 
common code should be merged up
2) We should not be using global_clock_event; it is a horrible hack 
which you want to remove
3) We should not use the smp_apic_timer_interrupt assembly code which 
calls up to the lapic timer handlers
4) We should not add our own assembly code to call out to a local timer 
handler (from Ingo)

These last two points create a conflict which is a little tricky to 
solve.  We can't add our own custom timer handler, and we can't re-use 
the APIC timer handler.  But there is no timer handler available on i386 
that works, since the handlers will fall back to either PIC or IO-APIC 
edge handling.  Using either of those for the local timer interrupt on 
SMP does not work because they assume traditional IRQ semantics - an IRQ 
raised from the bus should be serviced by one processor.  Re-raises of 
the same IRQ on remote processors are locked out by the handler, and 
dropped.  Thus simultaneous local timers firing on multiple CPUs cause 
only one to be serviced.

This does not work for local timer interrupts in NO_HZ mode, because 
they must always be serviced so that they can reschedule the next local 
timer.  I have a proposed solution to this issue, but it fails to work 
when the IO-APIC assumes control of all IRQs based on ACPI results 
(which we control, but can't change because of compatibility issues with 
other operating systems).

My proposal is to keep IRQ-0 as the timer interrupt, on all CPUs, but 
fire it from the LAPIC after local apic timers get initialized.  We 
would do this by converting the irq handler using set_irq_handler(0, 
handle_percpu_irq).  The only problem is the IO-APIC code will want to 
take over IRQ0 and convert it to an edge triggered IO-APIC interrupt.  
But for the local irq handlers to work, we have to keep them using the 
handle_percpu_irq handler, and can't let the IO-APIC steal these 
vectors.  There is no way to do conditionally for just a specific set of 
IRQs in tree today, so we would need to add a special case to io_apic.c 
to allow early boot code to reserve specific vectors so they are not 
subsumed by the IO-APIC.  This seems reasonable, but is a special case.

If, on the other hand, we are allowed to use our own assembly code to 
call out to our local timer handler (dropping constraint #4 above), we 
can simply rewire LOCAL_TIMER_VECTOR to point to this code, but now we 
must emulate the semantics of irq_enter / leave / etc inside our code, 
which is also not the cleanest solution.  We used to do this, and it 
caught flak I believe from Ingo.

The basic problem is that a local IRQ doesn't behave like a global IRQ, 
and the i386 backend is unaware of how to set up any local IRQs except 
in the case of local APIC, but you have told us we should not re-use the 
APIC handlers by overloading global_clock_event.  The patches we sent 
out recently did just this, but seemed to meet even more violence than 
our previous way of doing things.

So the question is, which approach do you prefer?

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 21:07                                   ` Jeremy Fitzhardinge
@ 2007-03-08  8:01                                     ` Ingo Molnar
  -1 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08  8:01 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: tglx, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> > Your implementation is almost the perfect prototype, if you move the 
> > 128 bit hackery into the hypervisor and hide it away from the kernel 
> > :)
> 
> The point is to use the tsc to avoid making any hypercalls, so dealing 
> with the tsc->ns conversion has to happen on the guest side somehow.

you are obsessed with avoiding a hypercall, but why? Granted it's slow 
especially on things like SVN/VMX, but it's not fundamentally slow. We 
definitely do not want to design our whole APIs and abstractions around 
the temporary notion that 'hypercalls are slow'. I'd expect hypercalls 
to be put into silicon just as much as SYSENTER was put into silicon. 
Anyway, in terms of guest time code, a /big/ amount of design junk can 
be avoided by not trying to do sillynesses like 'virtual time'. The TSC 
is awfully unreliable.

really, it's a bit as if Linus looked at his 386DX CPU when he bought it 
16 years ago and decided that: "this CPU executes 16-bit code much 
faster than 32-bit code, so lets base this new toy OS on 16-bit code. 
Sure, it's a bit of a pain to use, compared to 32-bit code, but users 
demand performance!".

/THIS/ is the kind of junk we are trying to protect Linux against. 
Basically hypervisors are a way to prolong hardware legacies, and 
because unlike real hardware software ABIs dont actually burn out with 
time, and people are stubborn about using them, their effects are alot 
worse and alot longer than that of legacy hardware.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08  8:01                                     ` Ingo Molnar
  0 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08  8:01 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, john stultz, LKML, tglx, akpm


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> > Your implementation is almost the perfect prototype, if you move the 
> > 128 bit hackery into the hypervisor and hide it away from the kernel 
> > :)
> 
> The point is to use the tsc to avoid making any hypercalls, so dealing 
> with the tsc->ns conversion has to happen on the guest side somehow.

you are obsessed with avoiding a hypercall, but why? Granted it's slow 
especially on things like SVN/VMX, but it's not fundamentally slow. We 
definitely do not want to design our whole APIs and abstractions around 
the temporary notion that 'hypercalls are slow'. I'd expect hypercalls 
to be put into silicon just as much as SYSENTER was put into silicon. 
Anyway, in terms of guest time code, a /big/ amount of design junk can 
be avoided by not trying to do sillynesses like 'virtual time'. The TSC 
is awfully unreliable.

really, it's a bit as if Linus looked at his 386DX CPU when he bought it 
16 years ago and decided that: "this CPU executes 16-bit code much 
faster than 32-bit code, so lets base this new toy OS on 16-bit code. 
Sure, it's a bit of a pain to use, compared to 32-bit code, but users 
demand performance!".

/THIS/ is the kind of junk we are trying to protect Linux against. 
Basically hypervisors are a way to prolong hardware legacies, and 
because unlike real hardware software ABIs dont actually burn out with 
time, and people are stubborn about using them, their effects are alot 
worse and alot longer than that of legacy hardware.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  8:01                                     ` Ingo Molnar
@ 2007-03-08  8:15                                       ` Keir Fraser
  -1 siblings, 0 replies; 169+ messages in thread
From: Keir Fraser @ 2007-03-08  8:15 UTC (permalink / raw)
  To: Ingo Molnar, Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, john stultz, LKML, tglx, akpm

On 8/3/07 08:01, "Ingo Molnar" <mingo@elte.hu> wrote:

> you are obsessed with avoiding a hypercall, but why? Granted it's slow
> especially on things like SVN/VMX, but it's not fundamentally slow. We
> definitely do not want to design our whole APIs and abstractions around
> the temporary notion that 'hypercalls are slow'. I'd expect hypercalls
> to be put into silicon just as much as SYSENTER was put into silicon.

If syscalls are already so fast, why does Linux have vgettimeofday()?

 -- Keir



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08  8:15                                       ` Keir Fraser
  0 siblings, 0 replies; 169+ messages in thread
From: Keir Fraser @ 2007-03-08  8:15 UTC (permalink / raw)
  To: Ingo Molnar, Jeremy Fitzhardinge
  Cc: Virtualization Mailing List, tglx, LKML, akpm, john stultz

On 8/3/07 08:01, "Ingo Molnar" <mingo@elte.hu> wrote:

> you are obsessed with avoiding a hypercall, but why? Granted it's slow
> especially on things like SVN/VMX, but it's not fundamentally slow. We
> definitely do not want to design our whole APIs and abstractions around
> the temporary notion that 'hypercalls are slow'. I'd expect hypercalls
> to be put into silicon just as much as SYSENTER was put into silicon.

If syscalls are already so fast, why does Linux have vgettimeofday()?

 -- Keir

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  8:01                                     ` Ingo Molnar
  (?)
  (?)
@ 2007-03-08  8:41                                     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08  8:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, Rusty Russell, LKML, john stultz

Ingo Molnar wrote:
> you are obsessed with avoiding a hypercall, but why? Granted it's slow 
> especially on things like SVN/VMX, but it's not fundamentally slow. We 
> definitely do not want to design our whole APIs and abstractions around 
> the temporary notion that 'hypercalls are slow'.

Sure.  But the specific case we're talking about here is a 300 line
clock driver.  Nothing about its implementation has any effect on the
kernel's APIs or abstractions.

>  I'd expect hypercalls 
> to be put into silicon just as much as SYSENTER was put into silicon. 
>   
Sysenter is marginally faster than int $80, but not massively so.  I
guess Xen could use sysenter now for hypercalls, since its only useful
for getting into ring 0.

> Anyway, in terms of guest time code, a /big/ amount of design junk can 
> be avoided by not trying to do sillynesses like 'virtual time'. 
Well, if you have a hypervisor scheduler multiplexing vcpus onto a real
cpu at 100hz and a kernel scheduler multiplexing processes onto a vcpu
at 100hz, then you're going to get a lot of disappointed processes who
nominally got their 10ms real-time slice, but it was all spent on some
other vcpu.   Its important that the kernel's scheduler know how much
vcpu time each process really got, rather than basing its scheduling on
the amount of real time that passed.

> The TSC 
> is awfully unreliable.
>   
Sure.

> /THIS/ is the kind of junk we are trying to protect Linux against. 
>   

What?  That Xen happens to use the tsc as part of its hypervisor
interface?  A fact that's completely isolated from the rest of the
kernel behind the clock subsystem?

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-07 23:25                                           ` Zachary Amsden
  2007-03-07 23:36                                             ` Jeremy Fitzhardinge
  2007-03-08  0:22                                             ` Thomas Gleixner
@ 2007-03-08  9:10                                             ` Ingo Molnar
  2007-03-08 10:06                                               ` Zachary Amsden
  2 siblings, 1 reply; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08  9:10 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: tglx, Jeremy Fitzhardinge, john stultz, akpm, Linus Torvalds, LKML


* Zachary Amsden <zach@vmware.com> wrote:

> The correct solution here is to properly separate the APIC, SMP, and 
> timer code so the logic of it which we want to reuse is separated from 
> the hardware dependence.  Clock events and clocksources take care of 
> most of the timer issues, but there is still ugliness from SMP timer 
> events depending on having part of the APIC infrastructure for wiring 
> the interrupt gates.

what are you talking about? A clockevents driver does not need to know 
about lapic details, at all. In terms of interrupt gates for the 
hypervisor to notify about clock events - use a virtual interrupt 
controller via genirq.

if you want to use hardwired hardware details as your API: DO IT WITHOUT 
MODIFYING LINUX. If you want anything more intelligent, something more 
'paravirtual' - WORK WITH US AND WORK WITH THE OTHER HYPERVISORS. So far 
all i've seen from you was excuses and stonewalling on every step! We 
told you about the need to do VMI-timer ontop of clockevents last year 
already! You resisted virtually EVERY SINGLE cleanup suggestion since 
your stuff got upstream and you ONLY acted when a change was force-fed 
to you. Just count the number of emails you wrote, versus the patches 
you did. And your code is barely 2 weeks in! That is unacceptable.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08  9:10                                             ` hardwired VMI crap Ingo Molnar
@ 2007-03-08 10:06                                               ` Zachary Amsden
  2007-03-08 11:09                                                 ` Thomas Gleixner
  2007-03-08 18:35                                                   ` Chris Wright
  0 siblings, 2 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 10:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Jeremy Fitzhardinge, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

Ingo Molnar wrote:
> * Zachary Amsden <zach@vmware.com> wrote:
>
>   
>> The correct solution here is to properly separate the APIC, SMP, and 
>> timer code so the logic of it which we want to reuse is separated from 
>> the hardware dependence.  Clock events and clocksources take care of 
>> most of the timer issues, but there is still ugliness from SMP timer 
>> events depending on having part of the APIC infrastructure for wiring 
>> the interrupt gates.
>>     
>
> what are you talking about? A clockevents driver does not need to know 
> about lapic details, at all. In terms of interrupt gates for the 
> hypervisor to notify about clock events - use a virtual interrupt 
> controller via genirq.
>   

See my last e-mail.  It is not possible on i386, since local per-cpu 
interrupts are only supported via the APIC.

> if you want to use hardwired hardware details as your API: DO IT WITHOUT 
> MODIFYING LINUX. If you want anything more intelligent, something more 
> 'paravirtual' - WORK WITH US AND WORK WITH THE OTHER HYPERVISORS. So far 
> all i've seen from you was excuses and stonewalling on every step! We 
>   

So far, all you have done is not complain about our code until it was 
merged, the pursue every tactic possible to break it.  It is not us that 
are stonewalling.

> told you about the need to do VMI-timer ontop of clockevents last year 
> already! You resisted virtually EVERY SINGLE cleanup suggestion since 
> your stuff got upstream and you ONLY acted when a change was force-fed 
> to you. Just count the number of emails you wrote, versus the patches 
> you did. And your code is barely 2 weeks in! That is unacceptable.

Which cleanups have we resisted in particular?  I can't recall any.  
Just count the number of emails you wrote versus the patches and helpful 
suggestions you made.  No, instead, you broke our code, in many ways, 
with the untouchable aim of cleaning up the kernel source to do things 
the way you think they should be done in a future release.

Our code is in the tree now, and any attempts to break it using such 
justifications as easing maintenance for kernel developers in future 
releases are flat out false and improper.  We are working to correct 
flaws that we have and properly conform to the changing interfaces such 
as the timer subsystem, and also to interoperate properly with the full 
set of available configurations.

In the meantime, having code that uses slightly older interfaces in the 
kernel tree is not wrong in any way - it is pragmatic, because that code 
is working today, and not only that, the sanest thing to do in a release 
cycle.  And our code in the tree to be released imposes zero burden on 
anyone except for us.  Are we stopping you from rewriting the timer 
subsystem in the -rc tree?  How?  Because this code is supposed to be 
settled.  Your deliberate breaking of our code forces us to come up with 
workarounds that might be considered inappropriate, but nevertheless, 
necessary.  Who has to deal with and adapt to this?  Certainly not you.  
The burden to maintain the correctness of our code is on us.

Working together to make sure that this code completely integrates with 
all this new development is the right thing to do - in the development 
tree.  Why you insist on stopping our code in the tip kernel release 
tree is beyond me, as there is no purpose to it other than to block our 
code.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  8:01                                     ` Ingo Molnar
                                                       ` (2 preceding siblings ...)
  (?)
@ 2007-03-08 10:26                                     ` Rusty Russell
  -1 siblings, 0 replies; 169+ messages in thread
From: Rusty Russell @ 2007-03-08 10:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, tglx, Dan Hecht, Zachary Amsden, akpm, ak,
	Virtualization Mailing List, LKML, john stultz, James Morris

On Thu, 2007-03-08 at 09:01 +0100, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> 
> > > Your implementation is almost the perfect prototype, if you move the 
> > > 128 bit hackery into the hypervisor and hide it away from the kernel 
> > > :)
> > 
> > The point is to use the tsc to avoid making any hypercalls, so dealing 
> > with the tsc->ns conversion has to happen on the guest side somehow.
> 
> you are obsessed with avoiding a hypercall, but why? Granted it's slow 
> especially on things like SVN/VMX, but it's not fundamentally slow. We 
> definitely do not want to design our whole APIs and abstractions around 
> the temporary notion that 'hypercalls are slow'. I'd expect hypercalls 
> to be put into silicon just as much as SYSENTER was put into silicon. 

Indeed, I expect them to fall somewhere between system calls and context
switches.  Perhaps not slow, but definitely worth minimising.

> Anyway, in terms of guest time code, a /big/ amount of design junk can 
> be avoided by not trying to do sillynesses like 'virtual time'. The TSC 
> is awfully unreliable.

You mean stolen time?

I find this whole discussion really irritating, to be honest.  I just
want Thomas to implement the timer code for lguest, because that code
scares me...

I look forward to your patch 8)
Rusty.



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 10:06                                               ` Zachary Amsden
@ 2007-03-08 11:09                                                 ` Thomas Gleixner
  2007-03-08 20:46                                                     ` Zachary Amsden
  2007-03-08 18:35                                                   ` Chris Wright
  1 sibling, 1 reply; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-08 11:09 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, Jeremy Fitzhardinge, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Andi Kleen, Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

On Thu, 2007-03-08 at 02:06 -0800, Zachary Amsden wrote:
> >> The correct solution here is to properly separate the APIC, SMP, and 
> >> timer code so the logic of it which we want to reuse is separated from 
> >> the hardware dependence.  Clock events and clocksources take care of 
> >> most of the timer issues, but there is still ugliness from SMP timer 
> >> events depending on having part of the APIC infrastructure for wiring 
> >> the interrupt gates.
> >>     
> >
> > what are you talking about? A clockevents driver does not need to know 
> > about lapic details, at all. In terms of interrupt gates for the 
> > hypervisor to notify about clock events - use a virtual interrupt 
> > controller via genirq.
> >   
> 
> See my last e-mail.  It is not possible on i386, since local per-cpu 
> interrupts are only supported via the APIC.

It is not possible from your POV. It is possible, as we have already a
complete irq abstraction layer, which supports _ALL_ of the
requirements.

To make use of it in a maintainable way, it just needs the work of doing
a proper client for the genirq layer, which get's its interrupt injected
by the hypervisor.

genirq() does not care by which mechanism handle_percpu_irq() is called.

We provided the abstractions and you just tell us straight in the face,
that your hypervisor works that way and therefor we have to accept that
you do it that way.

It's not rocket science to implement an abstract interrupt controller,
which lets you inject per cpu or global interrupts into the generic
layer. It needs some preparatory work to distangle the boot code
assumptions from the implicit hardware, but this is a better spent time,
than another set of hackery, which you already advertised for smpboot.c

All we want you and the other hypervisor folks to do is to 

- use existing abstractions in the way they are designed
- create new ones where applicable
- break the hardwired hardware assumptions, so a sane emulation model
can be used.

This is a one time effort and of value for everyone and it is even
portable and reusable outside of i386. 

Nobody wants to prevent you of using existing facilities of the kernel
code. All we ask for is to use them in a sane way without hardwiring the
particular backend oddities all over the place.

This all is rushed in with the great promise, that it will be cleaned up
somewhere in the future, but this is not some random self contained
driver we talk about. It reaches into the guts of the kernel code and
once it is there we have to deal with it until the great promise is
fulfilled.

> > if you want to use hardwired hardware details as your API: DO IT WITHOUT 
> > MODIFYING LINUX. If you want anything more intelligent, something more 
> > 'paravirtual' - WORK WITH US AND WORK WITH THE OTHER HYPERVISORS. So far 
> > all i've seen from you was excuses and stonewalling on every step! We 
> >   
> 
> So far, all you have done is not complain about our code until it was 
> merged, the pursue every tactic possible to break it.  It is not us that 
> are stonewalling.

You have been told before. Andi asked you more than once to move to
clockevents.

I know its my fault that I just did not take enough time to look in
detail into the whole paravirt business before - mostly because I made
wrong assumptions about the design and intent of paravirt ops.

> In the meantime, having code that uses slightly older interfaces in the 
> kernel tree is not wrong in any way - it is pragmatic, because that code 
> is working today, and not only that, the sanest thing to do in a release 
> cycle.  And our code in the tree to be released imposes zero burden on 
> anyone except for us.  Are we stopping you from rewriting the timer 
> subsystem in the -rc tree?  How?  Because this code is supposed to be 
> settled.  Your deliberate breaking of our code forces us to come up with 
> workarounds that might be considered inappropriate, but nevertheless, 
> necessary.  Who has to deal with and adapt to this?  Certainly not you.  
> The burden to maintain the correctness of our code is on us.

That's just aside of reality. The reality is:

Ingo, Andi or I change code and it breaks one of the reachouts. Now this
get detected in -rc5 and gets a blocking issue. Report comes from user
and there is no time to fix it proper. The commit gets identified and it
simply falls back on us to fix it or revert the original patch.

Just look at it how it works. Dynticks break whatever crude assumption
in a driver and I have to spend my time on fixing it. We have enough
shit already there, which we break by such changes. No need to add
another steaming pile of it, which lures there to catch us.

If you can not change your hypervisor model to use a sane abstraction of
interrupts, then please emulate lapic, io_apic and everything else
_OUTSIDE_ of the kernel.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08  1:01                                                 ` Daniel Arai
                                                                   ` (2 preceding siblings ...)
  (?)
@ 2007-03-08 18:24                                                 ` Chris Wright
  2007-03-08 18:44                                                   ` Daniel Arai
  2007-03-08 19:42                                                   ` Jeremy Fitzhardinge
  -1 siblings, 2 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 18:24 UTC (permalink / raw)
  To: Daniel Arai
  Cc: tglx, Virtualization Mailing List, john stultz, Ingo Molnar, akpm, LKML

* Daniel Arai (arai@vmware.com) wrote:
> There's no good way to override __send_IPI_shortcut.  I suppose we could add 
> paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 

While that's basically what we did in Xen, it would make more sense to
build it into genapic which would give us one common abstraction to base
from.  We should avoid adding pv_ops when existing infrastructure exists.

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-07 23:40                                                 ` Zachary Amsden
@ 2007-03-08 18:30                                                   ` Chris Wright
  -1 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 18:30 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Virtualization Mailing List, akpm,
	john stultz, tglx, Ingo Molnar, LKML

* Zachary Amsden (zach@vmware.com) wrote:
> s/do/will (smpboot.c)

Well the current Xen mechanism rather dodges all of that (for bits like
IPI apicid).

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 18:30                                                   ` Chris Wright
  0 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 18:30 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Virtualization Mailing List, akpm, john stultz, Ingo Molnar, tglx, LKML

* Zachary Amsden (zach@vmware.com) wrote:
> s/do/will (smpboot.c)

Well the current Xen mechanism rather dodges all of that (for bits like
IPI apicid).

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 10:06                                               ` Zachary Amsden
@ 2007-03-08 18:35                                                   ` Chris Wright
  2007-03-08 18:35                                                   ` Chris Wright
  1 sibling, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 18:35 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, john stultz, LKML, Chris Wright,
	Virtualization Mailing List, tglx, Linus Torvalds, akpm

* Zachary Amsden (zach@vmware.com) wrote:
> Our code is in the tree now, and any attempts to break it using such 
> justifications as easing maintenance for kernel developers in future 
> releases are flat out false and improper.

That's not quite accurate.  This is what Ingo was complaining about
earlier with VMI being an inhibitor to change.  Core kernel will change
and occassionally break VMI.  It's entirely reasonable, and in fact
normal, to make these changes, esp in the name of easing long term
maintenance.  There's some mutual responsibility to fix things up in
the fallout.  But, I really didn't think you disagreed with that, so
perhaps I've misunderstood the above.

> We are working to correct 
> flaws that we have and properly conform to the changing interfaces such 
> as the timer subsystem, and also to interoperate properly with the full 
> set of available configurations.

Right, so let's move on ;-)

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-08 18:35                                                   ` Chris Wright
  0 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 18:35 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Chris Wright, Virtualization Mailing List, akpm, john stultz,
	Ingo Molnar, Linus Torvalds, tglx, LKML

* Zachary Amsden (zach@vmware.com) wrote:
> Our code is in the tree now, and any attempts to break it using such 
> justifications as easing maintenance for kernel developers in future 
> releases are flat out false and improper.

That's not quite accurate.  This is what Ingo was complaining about
earlier with VMI being an inhibitor to change.  Core kernel will change
and occassionally break VMI.  It's entirely reasonable, and in fact
normal, to make these changes, esp in the name of easing long term
maintenance.  There's some mutual responsibility to fix things up in
the fallout.  But, I really didn't think you disagreed with that, so
perhaps I've misunderstood the above.

> We are working to correct 
> flaws that we have and properly conform to the changing interfaces such 
> as the timer subsystem, and also to interoperate properly with the full 
> set of available configurations.

Right, so let's move on ;-)

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 18:24                                                 ` Chris Wright
@ 2007-03-08 18:44                                                   ` Daniel Arai
  2007-03-08 19:14                                                       ` Chris Wright
  2007-03-08 19:42                                                   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 169+ messages in thread
From: Daniel Arai @ 2007-03-08 18:44 UTC (permalink / raw)
  To: Chris Wright
  Cc: tglx, Virtualization Mailing List, john stultz, Ingo Molnar, akpm, LKML

Chris Wright wrote:
> * Daniel Arai (arai@vmware.com) wrote:
> 
>>There's no good way to override __send_IPI_shortcut.  I suppose we could add 
>>paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 
> 
> 
> While that's basically what we did in Xen, it would make more sense to
> build it into genapic which would give us one common abstraction to base
> from.  We should avoid adding pv_ops when existing infrastructure exists.

I agree with this.

Chris, would you like to work together on this?  I don't know what Xen's 
requirements are for the APIC interface.  Do you think we could come up with 
something that would fit both of our needs, and maybe also be usable for some of 
the subarch-specific code?

Dan.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 18:44                                                   ` Daniel Arai
@ 2007-03-08 19:14                                                       ` Chris Wright
  0 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 19:14 UTC (permalink / raw)
  To: Daniel Arai
  Cc: Chris Wright, tglx, Virtualization Mailing List, john stultz,
	Ingo Molnar, akpm, LKML

* Daniel Arai (arai@vmware.com) wrote:
> Chris, would you like to work together on this?  I don't know what Xen's 
> requirements are for the APIC interface.  Do you think we could come up 
> with something that would fit both of our needs, and maybe also be usable 
> for some of the subarch-specific code?

Sure, we just have a pretty small genapic_xen, and then enough (hackery,
this should be sorted out) to use that genapic and have an effective
override for __send_IPI_shortcut.

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 19:14                                                       ` Chris Wright
  0 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 19:14 UTC (permalink / raw)
  To: Daniel Arai
  Cc: Chris Wright, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML

* Daniel Arai (arai@vmware.com) wrote:
> Chris, would you like to work together on this?  I don't know what Xen's 
> requirements are for the APIC interface.  Do you think we could come up 
> with something that would fit both of our needs, and maybe also be usable 
> for some of the subarch-specific code?

Sure, we just have a pretty small genapic_xen, and then enough (hackery,
this should be sorted out) to use that genapic and have an effective
override for __send_IPI_shortcut.

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 19:14                                                       ` Chris Wright
@ 2007-03-08 19:17                                                         ` Ingo Molnar
  -1 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 19:17 UTC (permalink / raw)
  To: Chris Wright
  Cc: Daniel Arai, tglx, Virtualization Mailing List, john stultz, akpm, LKML


* Chris Wright <chrisw@sous-sol.org> wrote:

> > Chris, would you like to work together on this?  I don't know what 
> > Xen's requirements are for the APIC interface.  Do you think we 
> > could come up with something that would fit both of our needs, and 
> > maybe also be usable for some of the subarch-specific code?
> 
> Sure, we just have a pretty small genapic_xen, and then enough 
> (hackery, this should be sorted out) to use that genapic and have an 
> effective override for __send_IPI_shortcut.

genapic is still too lowlevel: as Thomas mentioned what we want is a 
virtual interrupt controller used by /all/ hypervisors (and mapped to 
their respective hypervisor ABIs via the backend).

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 19:17                                                         ` Ingo Molnar
  0 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 19:17 UTC (permalink / raw)
  To: Chris Wright; +Cc: Virtualization Mailing List, john stultz, tglx, akpm, LKML


* Chris Wright <chrisw@sous-sol.org> wrote:

> > Chris, would you like to work together on this?  I don't know what 
> > Xen's requirements are for the APIC interface.  Do you think we 
> > could come up with something that would fit both of our needs, and 
> > maybe also be usable for some of the subarch-specific code?
> 
> Sure, we just have a pretty small genapic_xen, and then enough 
> (hackery, this should be sorted out) to use that genapic and have an 
> effective override for __send_IPI_shortcut.

genapic is still too lowlevel: as Thomas mentioned what we want is a 
virtual interrupt controller used by /all/ hypervisors (and mapped to 
their respective hypervisor ABIs via the backend).

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 18:24                                                 ` Chris Wright
  2007-03-08 18:44                                                   ` Daniel Arai
@ 2007-03-08 19:42                                                   ` Jeremy Fitzhardinge
  2007-03-08 19:47                                                       ` Chris Wright
  2007-03-08 19:54                                                       ` Ingo Molnar
  1 sibling, 2 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 19:42 UTC (permalink / raw)
  To: Chris Wright
  Cc: Daniel Arai, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML

Chris Wright wrote:
> * Daniel Arai (arai@vmware.com) wrote:
>   
>> There's no good way to override __send_IPI_shortcut.  I suppose we could add 
>> paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 
>>     
>
> While that's basically what we did in Xen, it would make more sense to
> build it into genapic which would give us one common abstraction to base
> from.  We should avoid adding pv_ops when existing infrastructure exists.
>   

I was looking at cutting in at a much higher level.  The interface in
<linux/smp.h> is a good match for Xen, so I was going to investigate
making pv_ops at that level and see how it falls out.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 19:42                                                   ` Jeremy Fitzhardinge
@ 2007-03-08 19:47                                                       ` Chris Wright
  2007-03-08 19:54                                                       ` Ingo Molnar
  1 sibling, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 19:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Daniel Arai, Virtualization Mailing List, akpm,
	john stultz, tglx, Ingo Molnar, LKML

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > * Daniel Arai (arai@vmware.com) wrote:
> >   
> >> There's no good way to override __send_IPI_shortcut.  I suppose we could add 
> >> paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 
> >>     
> >
> > While that's basically what we did in Xen, it would make more sense to
> > build it into genapic which would give us one common abstraction to base
> > from.  We should avoid adding pv_ops when existing infrastructure exists.
> 
> I was looking at cutting in at a much higher level.  The interface in
> <linux/smp.h> is a good match for Xen, so I was going to investigate
> making pv_ops at that level and see how it falls out.

I agree with that, but I think that's esp. for things like create and launch
new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
as well.

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 19:47                                                       ` Chris Wright
  0 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 19:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Virtualization Mailing List, john stultz, akpm,
	Ingo Molnar, tglx, LKML

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > * Daniel Arai (arai@vmware.com) wrote:
> >   
> >> There's no good way to override __send_IPI_shortcut.  I suppose we could add 
> >> paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 
> >>     
> >
> > While that's basically what we did in Xen, it would make more sense to
> > build it into genapic which would give us one common abstraction to base
> > from.  We should avoid adding pv_ops when existing infrastructure exists.
> 
> I was looking at cutting in at a much higher level.  The interface in
> <linux/smp.h> is a good match for Xen, so I was going to investigate
> making pv_ops at that level and see how it falls out.

I agree with that, but I think that's esp. for things like create and launch
new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
as well.

thanks,
-chris

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 19:47                                                       ` Chris Wright
  (?)
@ 2007-03-08 19:52                                                       ` Jeremy Fitzhardinge
  2007-03-08 20:10                                                         ` Chris Wright
  -1 siblings, 1 reply; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 19:52 UTC (permalink / raw)
  To: Chris Wright
  Cc: Daniel Arai, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML

Chris Wright wrote:
> I agree with that, but I think that's esp. for things like create and launch
> new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
> as well.
>   

Well, native would fall back to using the existing arch/i386 versions of
those functions, so that's reasonably straightforward.  There'll need to
be a bit of internal rearrangement so that the Xen code can call in to
do things like set up the pda/gdt and other bits of CPU state.

I don't think IPI is especially interesting in itself, is it?   It's a
necessary mechanism to implement smp_call_function(), but Xen can do IPI
without having to invoke any of the existing apic-based IPI code.  The
other main user of IPI is cross-cpu tlb shootdown, but Xen has much more
efficient mechanisms than IPI for that (so we'll need to make the tlb
pv_ops interface a little wider to pass down a cpuset).

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 19:42                                                   ` Jeremy Fitzhardinge
@ 2007-03-08 19:54                                                       ` Ingo Molnar
  2007-03-08 19:54                                                       ` Ingo Molnar
  1 sibling, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 19:54 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Daniel Arai, Virtualization Mailing List, akpm,
	john stultz, tglx, LKML


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> > While that's basically what we did in Xen, it would make more sense 
> > to build it into genapic which would give us one common abstraction 
> > to base from.  We should avoid adding pv_ops when existing 
> > infrastructure exists.
> 
> I was looking at cutting in at a much higher level.  The interface in 
> <linux/smp.h> is a good match for Xen, so I was going to investigate 
> making pv_ops at that level and see how it falls out.

yes, yes, yes. Finally someone with a clue about APIs ;-)

Basically, we want to think about the hypercall API more like a system 
call API, not like a hardware API! There will probably still be lowlevel 
details like ptes for a long time - but even those are not quite 
necessary.

And the reason is really fundamental: those system-call alike APIs are 
going to be the /most stable ones/ over time! 'Send stuff from A to B' 
or 'notify X about event Y' is /ALOT/ more stable across hardware 
variations than 'IDTs, vectors, apics or ptes'. And that is so precisely 
because these are fundamental actions that physical matter can do, and 
those do not get changed when new silicon comes out. In that sense Xen's 
hypervisor API is saner than VMI.

the most highlevel API is what UML uses today (and it clearly overdoes 
abstraction), still i was able to get basic UML performance close to 
native performance, via extending a few Linux system calls to enable the 
management of multiple sets of pagetables (each represented by a 
separate fd) via a single hypervisor-level process, and feeding back raw 
pagefault events to the hypervisor. (that was UML's SKAS concept 
combined with sys_remap_file_pages_prot() and sys_vcpu())

Now the practical problem with UML is that nobody has tried to make an 
UML native+guest 'shared kernel image', and hence it's unusable for 
distros. But there is no conceptual problem with UML's virtualization 
model.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 19:54                                                       ` Ingo Molnar
  0 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 19:54 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Virtualization Mailing List, john stultz, akpm, tglx, LKML


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> > While that's basically what we did in Xen, it would make more sense 
> > to build it into genapic which would give us one common abstraction 
> > to base from.  We should avoid adding pv_ops when existing 
> > infrastructure exists.
> 
> I was looking at cutting in at a much higher level.  The interface in 
> <linux/smp.h> is a good match for Xen, so I was going to investigate 
> making pv_ops at that level and see how it falls out.

yes, yes, yes. Finally someone with a clue about APIs ;-)

Basically, we want to think about the hypercall API more like a system 
call API, not like a hardware API! There will probably still be lowlevel 
details like ptes for a long time - but even those are not quite 
necessary.

And the reason is really fundamental: those system-call alike APIs are 
going to be the /most stable ones/ over time! 'Send stuff from A to B' 
or 'notify X about event Y' is /ALOT/ more stable across hardware 
variations than 'IDTs, vectors, apics or ptes'. And that is so precisely 
because these are fundamental actions that physical matter can do, and 
those do not get changed when new silicon comes out. In that sense Xen's 
hypervisor API is saner than VMI.

the most highlevel API is what UML uses today (and it clearly overdoes 
abstraction), still i was able to get basic UML performance close to 
native performance, via extending a few Linux system calls to enable the 
management of multiple sets of pagetables (each represented by a 
separate fd) via a single hypervisor-level process, and feeding back raw 
pagefault events to the hypervisor. (that was UML's SKAS concept 
combined with sys_remap_file_pages_prot() and sys_vcpu())

Now the practical problem with UML is that nobody has tried to make an 
UML native+guest 'shared kernel image', and hence it's unusable for 
distros. But there is no conceptual problem with UML's virtualization 
model.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 19:52                                                       ` Jeremy Fitzhardinge
@ 2007-03-08 20:10                                                         ` Chris Wright
  2007-03-08 20:18                                                             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 169+ messages in thread
From: Chris Wright @ 2007-03-08 20:10 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Daniel Arai, Virtualization Mailing List, akpm,
	john stultz, tglx, Ingo Molnar, LKML

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Chris Wright wrote:
> > I agree with that, but I think that's esp. for things like create and launch
> > new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
> > as well.
> >   
> 
> Well, native would fall back to using the existing arch/i386 versions of
> those functions, so that's reasonably straightforward.

It's the fact that we need to leave code in the kernel to run on native,
but also do something dynamically with that same code when running
paravirt that I'm referring to.  Xen punts on this right now by
#ifdef'ing away as happy as can be.

> There'll need to
> be a bit of internal rearrangement so that the Xen code can call in to
> do things like set up the pda/gdt and other bits of CPU state.
> 
> I don't think IPI is especially interesting in itself, is it?   It's a
> necessary mechanism to implement smp_call_function(), but Xen can do IPI
> without having to invoke any of the existing apic-based IPI code.  The
> other main user of IPI is cross-cpu tlb shootdown, but Xen has much more
> efficient mechanisms than IPI for that (so we'll need to make the tlb
> pv_ops interface a little wider to pass down a cpuset).

No, it's not the IPI itself, it's the way it's often accessed by the rest of
the kernel (which is intertwined with genapic).  I'm happy to avoid apic
altogether since it's effectively worthless for Xen other than
integrating into the existing infrastructure.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 20:10                                                         ` Chris Wright
@ 2007-03-08 20:18                                                             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 20:18 UTC (permalink / raw)
  To: Chris Wright
  Cc: Daniel Arai, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML

Chris Wright wrote:
> * Jeremy Fitzhardinge (jeremy@goop.org) wrote:
>   
>> Chris Wright wrote:
>>     
>>> I agree with that, but I think that's esp. for things like create and launch
>>> new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
>>> as well.
>>>   
>>>       
>> Well, native would fall back to using the existing arch/i386 versions of
>> those functions, so that's reasonably straightforward.
>>     
>
> It's the fact that we need to leave code in the kernel to run on native,
> but also do something dynamically with that same code when running
> paravirt that I'm referring to.

Why would it be any different to all the other code we've got behind
native pvops?

The ideal simplified case is that we rename
smp_send_stop/send_reschedule/prepare_cpus/etc to native_* versions.  In
the !PARAVIRT case we just call the native_* version directly; in
PARAVIRT we call via the native pv_ops structure.  Under Xen, all these
would implemented independently from the native versions.

> No, it's not the IPI itself, it's the way it's often accessed by the rest of
> the kernel (which is intertwined with genapic).  I'm happy to avoid apic
> altogether since it's effectively worthless for Xen other than
> integrating into the existing infrastructure.
>   

I guess by "rest of the kernel" you mean other stuff in arch/i386.  Yes,
that's a concern, but maybe we can tease it apart in a sensible way.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 20:18                                                             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 20:18 UTC (permalink / raw)
  To: Chris Wright
  Cc: Virtualization Mailing List, tglx, john stultz, akpm, Ingo Molnar, LKML

Chris Wright wrote:
> * Jeremy Fitzhardinge (jeremy@goop.org) wrote:
>   
>> Chris Wright wrote:
>>     
>>> I agree with that, but I think that's esp. for things like create and launch
>>> new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
>>> as well.
>>>   
>>>       
>> Well, native would fall back to using the existing arch/i386 versions of
>> those functions, so that's reasonably straightforward.
>>     
>
> It's the fact that we need to leave code in the kernel to run on native,
> but also do something dynamically with that same code when running
> paravirt that I'm referring to.

Why would it be any different to all the other code we've got behind
native pvops?

The ideal simplified case is that we rename
smp_send_stop/send_reschedule/prepare_cpus/etc to native_* versions.  In
the !PARAVIRT case we just call the native_* version directly; in
PARAVIRT we call via the native pv_ops structure.  Under Xen, all these
would implemented independently from the native versions.

> No, it's not the IPI itself, it's the way it's often accessed by the rest of
> the kernel (which is intertwined with genapic).  I'm happy to avoid apic
> altogether since it's effectively worthless for Xen other than
> integrating into the existing infrastructure.
>   

I guess by "rest of the kernel" you mean other stuff in arch/i386.  Yes,
that's a concern, but maybe we can tease it apart in a sensible way.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 20:18                                                             ` Jeremy Fitzhardinge
@ 2007-03-08 20:23                                                               ` Chris Wright
  -1 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 20:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Daniel Arai, Virtualization Mailing List, akpm,
	john stultz, tglx, Ingo Molnar, LKML

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> I guess by "rest of the kernel" you mean other stuff in arch/i386.  Yes,
> that's a concern, but maybe we can tease it apart in a sensible way.

Yes, that's exactly what I'm saying.  Same with above (the native stuff), since
we don't want a bunch of apic_read type of pv_ops (oh, wait... ;-)  Of course,
dom0 will be another can of worms, but one at a time.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 20:23                                                               ` Chris Wright
  0 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 20:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Virtualization Mailing List, john stultz, akpm,
	Ingo Molnar, tglx, LKML

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> I guess by "rest of the kernel" you mean other stuff in arch/i386.  Yes,
> that's a concern, but maybe we can tease it apart in a sensible way.

Yes, that's exactly what I'm saying.  Same with above (the native stuff), since
we don't want a bunch of apic_read type of pv_ops (oh, wait... ;-)  Of course,
dom0 will be another can of worms, but one at a time.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 20:23                                                               ` Chris Wright
  (?)
@ 2007-03-08 20:33                                                               ` Jeremy Fitzhardinge
  2007-03-08 20:42                                                                   ` Chris Wright
  2007-03-08 21:45                                                                   ` Andi Kleen
  -1 siblings, 2 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 20:33 UTC (permalink / raw)
  To: Chris Wright
  Cc: Daniel Arai, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML

Chris Wright wrote:
> * Jeremy Fitzhardinge (jeremy@goop.org) wrote:
>   
>> I guess by "rest of the kernel" you mean other stuff in arch/i386.  Yes,
>> that's a concern, but maybe we can tease it apart in a sensible way.
>>     
>
> Yes, that's exactly what I'm saying.  Same with above (the native stuff), since
> we don't want a bunch of apic_read type of pv_ops (oh, wait... ;-)  Of course,
> dom0 will be another can of worms, but one at a time.
>   

Yeah, well we're already talking about a two-level model to accomodate
VMI, since it wants the mostly native SMP stuff except for the actual
apic operations.

Maybe hooking into genapic is the right way to mop up all the uses of
send_IPI and its variants.  But from a quick grep it doesn't look like
they get called from too many places...  Most of the callers seem to be
in arch/i386/kernek/smp.c, so they should be pretty easy to isolate.

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 20:33                                                               ` Jeremy Fitzhardinge
@ 2007-03-08 20:42                                                                   ` Chris Wright
  2007-03-08 21:45                                                                   ` Andi Kleen
  1 sibling, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 20:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Daniel Arai, Virtualization Mailing List, akpm,
	john stultz, tglx, Ingo Molnar, LKML

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Maybe hooking into genapic is the right way to mop up all the uses of
> send_IPI and its variants.  But from a quick grep it doesn't look like
> they get called from too many places...  Most of the callers seem to be
> in arch/i386/kernek/smp.c, so they should be pretty easy to isolate.

Yeah, we'll see once we are crashing and debugging some code ;-)

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 20:42                                                                   ` Chris Wright
  0 siblings, 0 replies; 169+ messages in thread
From: Chris Wright @ 2007-03-08 20:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Virtualization Mailing List, john stultz, akpm,
	Ingo Molnar, tglx, LKML

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Maybe hooking into genapic is the right way to mop up all the uses of
> send_IPI and its variants.  But from a quick grep it doesn't look like
> they get called from too many places...  Most of the callers seem to be
> in arch/i386/kernek/smp.c, so they should be pretty easy to isolate.

Yeah, we'll see once we are crashing and debugging some code ;-)

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 20:42                                                                   ` Chris Wright
@ 2007-03-08 20:42                                                                     ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 20:42 UTC (permalink / raw)
  To: Chris Wright
  Cc: Daniel Arai, Virtualization Mailing List, akpm, john stultz,
	tglx, Ingo Molnar, LKML

Chris Wright wrote:
> * Jeremy Fitzhardinge (jeremy@goop.org) wrote:
>   
>> Maybe hooking into genapic is the right way to mop up all the uses of
>> send_IPI and its variants.  But from a quick grep it doesn't look like
>> they get called from too many places...  Most of the callers seem to be
>> in arch/i386/kernek/smp.c, so they should be pretty easy to isolate.
>>     
>
> Yeah, we'll see once we are crashing and debugging some code ;-)
>   
It's the Linux way (tm).

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 20:42                                                                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 20:42 UTC (permalink / raw)
  To: Chris Wright
  Cc: Virtualization Mailing List, tglx, john stultz, akpm, Ingo Molnar, LKML

Chris Wright wrote:
> * Jeremy Fitzhardinge (jeremy@goop.org) wrote:
>   
>> Maybe hooking into genapic is the right way to mop up all the uses of
>> send_IPI and its variants.  But from a quick grep it doesn't look like
>> they get called from too many places...  Most of the callers seem to be
>> in arch/i386/kernek/smp.c, so they should be pretty easy to isolate.
>>     
>
> Yeah, we'll see once we are crashing and debugging some code ;-)
>   
It's the Linux way (tm).

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 11:09                                                 ` Thomas Gleixner
@ 2007-03-08 20:46                                                     ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 20:46 UTC (permalink / raw)
  To: tglx
  Cc: Ingo Molnar, Jeremy Fitzhardinge, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Andi Kleen, Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

Thomas Gleixner wrote:
> On Thu, 2007-03-08 at 02:06 -0800, Zachary Amsden wrote:
>   
>>>> The correct solution here is to properly separate the APIC, SMP, and 
>>>> timer code so the logic of it which we want to reuse is separated from 
>>>> the hardware dependence.  Clock events and clocksources take care of 
>>>> most of the timer issues, but there is still ugliness from SMP timer 
>>>> events depending on having part of the APIC infrastructure for wiring 
>>>> the interrupt gates.
>>>>     
>>>>         
>>> what are you talking about? A clockevents driver does not need to know 
>>> about lapic details, at all. In terms of interrupt gates for the 
>>> hypervisor to notify about clock events - use a virtual interrupt 
>>> controller via genirq.
>>>   
>>>       
>> See my last e-mail.  It is not possible on i386, since local per-cpu 
>> interrupts are only supported via the APIC.
>>     
>
> It is not possible from your POV. It is possible, as we have already a
> complete irq abstraction layer, which supports _ALL_ of the
> requirements.
>
> To make use of it in a maintainable way, it just needs the work of doing
> a proper client for the genirq layer, which get's its interrupt injected
> by the hypervisor.
>
> genirq() does not care by which mechanism handle_percpu_irq() is called.
>
> We provided the abstractions and you just tell us straight in the face,
> that your hypervisor works that way and therefor we have to accept that
> you do it that way.
>
> It's not rocket science to implement an abstract interrupt controller,
> which lets you inject per cpu or global interrupts into the generic
> layer. It needs some preparatory work to distangle the boot code
> assumptions from the implicit hardware, but this is a better spent time,
> than another set of hackery, which you already advertised for smpboot.c
>   

When we're about two weeks away from a product release and you are 
threatening to unmerge or block our code because we didn't create an 
abstract interrupt controller, we re-used the APIC and IO-APIC, this is 
uber rocket science.  We've been doing things this way, with public 
patches for over a year, and you've even been CC'd on some of the 
discussions.  So it is a little late to tell us - "redesign your 
hypervisor, or else.."

> All we want you and the other hypervisor folks to do is to 
>
> - use existing abstractions in the way they are designed
> - create new ones where applicable
>   

Great.
> - break the hardwired hardware assumptions, so a sane emulation model
> can be used.
>   

Why?  This is your own invention, as you think it would make life 
easier.  It doesn't - you still have real hardware to deal with, and 
your code will always be designed to operate on silicon with these 
hardwired assumptions.  Breaking away from that can actually make the 
code more complex, both in the hypervisor and in Linux.

>   
>> So far, all you have done is not complain about our code until it was 
>> merged, the pursue every tactic possible to break it.  It is not us that 
>> are stonewalling.
>>     
>
> You have been told before. Andi asked you more than once to move to
> clockevents.
>   

Which we have done.  And now you refuse to give any feedback on 
technical points, but maintain an objection to the way we have done it.

> If you can not change your hypervisor model to use a sane abstraction of
> interrupts, then please emulate lapic, io_apic and everything else
> _OUTSIDE_ of the kernel.
>   

We faithfully emulate lapic, io_apic, the pit, pic, and a normal 
interrupt subsystem.  We can't magically stop using these things because 
we have to support traditional full virtualization.  Which means any 
version of Linux, virtual interrupt controller or not, is going to boot 
up, find these things, and try to use them.  So for a paravirt kernel, 
either we have to disable each of these things in the Linux code or just 
re-use them.

So we re-use them.  We don't even change their semantics.  Where we get 
into trouble is the fact that only the lapic can deliver per-cpu timer 
IRQs, and we need to provide a better time abstraction than TSC.  So we 
need a time device, but there is no way to implement it in the 
traditional hardware model.

And I ask again for your feedback on which approach you think is correct:

1) Rewrite the interrupt subsystem of our hypervisor, making it 
incompatible with full virtualization, so that we can support an 
abstract interrupt controller with a "clean" interface
2) Reuse the same method that HPET, PIT and other time clients in i386 
use - the global_clock_event pointer which allows you to wrest control 
back from the APIC and reuse the lapic_events local clockevents.
3) Create a new low level interrupt handler for the per-cpu VMI timer 
IRQs instead of re-using the APIC handler
4) Use the irq APIs to allocate IRQ-0 as a percpu IRQ, then change the 
IO-APIC code so it can know not to convert this PIC IRQ into a IO-APIC 
edge IRQ.
5) Disable the io-apic code entirely in paravirt mode.  Rather than 
change it, merge a parallel copy of it into the VMI code
 so that we can use the 99% of the code we need, with the one bugfix for 
#4 above
6) Disable the apic code entirely in paravirt mode.  Rather than change 
it, merge a parallel copy of into the VMI code so that we can use the 
90% of the code we need, with changes to the LVT0 timer handling.
7) For SMP only, allocate a non-shared IO-APIC IRQ, then after the 
IO-APIC is initialized, magically switch this to a percpu handler and 
start delivering local timer interrupts via this IRQ.
8) Create a pie-in-the-sky single interrupt source, reserve an IDT 
vector for it (or steal the lapic timer slot), and use the irq apis to 
set it up to be handled as a per-cpu interrupt.  This actually sounds 
pretty good, to me.  The only problem is we will need to switch the 
timer IRQ from IRQ 0 to this vector when the APIC is initialized, but I 
think we already have all the machinery we need to handle that.
9) ???

This is a serious question, I would appreciate a serious response 
instead of snide comments about the crappiness of our interface and our 
code.  Which do help a little, because by process of elimination,  we 
can rule out the approaches you don't like.  But it would be more 
productive if we could carry on a traditional dialogue and I could just 
ask a question and you could answer and vice versa.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-08 20:46                                                     ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 20:46 UTC (permalink / raw)
  To: tglx
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	Ingo Molnar, Linus Torvalds, akpm

Thomas Gleixner wrote:
> On Thu, 2007-03-08 at 02:06 -0800, Zachary Amsden wrote:
>   
>>>> The correct solution here is to properly separate the APIC, SMP, and 
>>>> timer code so the logic of it which we want to reuse is separated from 
>>>> the hardware dependence.  Clock events and clocksources take care of 
>>>> most of the timer issues, but there is still ugliness from SMP timer 
>>>> events depending on having part of the APIC infrastructure for wiring 
>>>> the interrupt gates.
>>>>     
>>>>         
>>> what are you talking about? A clockevents driver does not need to know 
>>> about lapic details, at all. In terms of interrupt gates for the 
>>> hypervisor to notify about clock events - use a virtual interrupt 
>>> controller via genirq.
>>>   
>>>       
>> See my last e-mail.  It is not possible on i386, since local per-cpu 
>> interrupts are only supported via the APIC.
>>     
>
> It is not possible from your POV. It is possible, as we have already a
> complete irq abstraction layer, which supports _ALL_ of the
> requirements.
>
> To make use of it in a maintainable way, it just needs the work of doing
> a proper client for the genirq layer, which get's its interrupt injected
> by the hypervisor.
>
> genirq() does not care by which mechanism handle_percpu_irq() is called.
>
> We provided the abstractions and you just tell us straight in the face,
> that your hypervisor works that way and therefor we have to accept that
> you do it that way.
>
> It's not rocket science to implement an abstract interrupt controller,
> which lets you inject per cpu or global interrupts into the generic
> layer. It needs some preparatory work to distangle the boot code
> assumptions from the implicit hardware, but this is a better spent time,
> than another set of hackery, which you already advertised for smpboot.c
>   

When we're about two weeks away from a product release and you are 
threatening to unmerge or block our code because we didn't create an 
abstract interrupt controller, we re-used the APIC and IO-APIC, this is 
uber rocket science.  We've been doing things this way, with public 
patches for over a year, and you've even been CC'd on some of the 
discussions.  So it is a little late to tell us - "redesign your 
hypervisor, or else.."

> All we want you and the other hypervisor folks to do is to 
>
> - use existing abstractions in the way they are designed
> - create new ones where applicable
>   

Great.
> - break the hardwired hardware assumptions, so a sane emulation model
> can be used.
>   

Why?  This is your own invention, as you think it would make life 
easier.  It doesn't - you still have real hardware to deal with, and 
your code will always be designed to operate on silicon with these 
hardwired assumptions.  Breaking away from that can actually make the 
code more complex, both in the hypervisor and in Linux.

>   
>> So far, all you have done is not complain about our code until it was 
>> merged, the pursue every tactic possible to break it.  It is not us that 
>> are stonewalling.
>>     
>
> You have been told before. Andi asked you more than once to move to
> clockevents.
>   

Which we have done.  And now you refuse to give any feedback on 
technical points, but maintain an objection to the way we have done it.

> If you can not change your hypervisor model to use a sane abstraction of
> interrupts, then please emulate lapic, io_apic and everything else
> _OUTSIDE_ of the kernel.
>   

We faithfully emulate lapic, io_apic, the pit, pic, and a normal 
interrupt subsystem.  We can't magically stop using these things because 
we have to support traditional full virtualization.  Which means any 
version of Linux, virtual interrupt controller or not, is going to boot 
up, find these things, and try to use them.  So for a paravirt kernel, 
either we have to disable each of these things in the Linux code or just 
re-use them.

So we re-use them.  We don't even change their semantics.  Where we get 
into trouble is the fact that only the lapic can deliver per-cpu timer 
IRQs, and we need to provide a better time abstraction than TSC.  So we 
need a time device, but there is no way to implement it in the 
traditional hardware model.

And I ask again for your feedback on which approach you think is correct:

1) Rewrite the interrupt subsystem of our hypervisor, making it 
incompatible with full virtualization, so that we can support an 
abstract interrupt controller with a "clean" interface
2) Reuse the same method that HPET, PIT and other time clients in i386 
use - the global_clock_event pointer which allows you to wrest control 
back from the APIC and reuse the lapic_events local clockevents.
3) Create a new low level interrupt handler for the per-cpu VMI timer 
IRQs instead of re-using the APIC handler
4) Use the irq APIs to allocate IRQ-0 as a percpu IRQ, then change the 
IO-APIC code so it can know not to convert this PIC IRQ into a IO-APIC 
edge IRQ.
5) Disable the io-apic code entirely in paravirt mode.  Rather than 
change it, merge a parallel copy of it into the VMI code
 so that we can use the 99% of the code we need, with the one bugfix for 
#4 above
6) Disable the apic code entirely in paravirt mode.  Rather than change 
it, merge a parallel copy of into the VMI code so that we can use the 
90% of the code we need, with changes to the LVT0 timer handling.
7) For SMP only, allocate a non-shared IO-APIC IRQ, then after the 
IO-APIC is initialized, magically switch this to a percpu handler and 
start delivering local timer interrupts via this IRQ.
8) Create a pie-in-the-sky single interrupt source, reserve an IDT 
vector for it (or steal the lapic timer slot), and use the irq apis to 
set it up to be handled as a per-cpu interrupt.  This actually sounds 
pretty good, to me.  The only problem is we will need to switch the 
timer IRQ from IRQ 0 to this vector when the APIC is initialized, but I 
think we already have all the machinery we need to handle that.
9) ???

This is a serious question, I would appreciate a serious response 
instead of snide comments about the crappiness of our interface and our 
code.  Which do help a little, because by process of elimination,  we 
can rule out the approaches you don't like.  But it would be more 
productive if we could carry on a traditional dialogue and I could just 
ask a question and you could answer and vice versa.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 20:46                                                     ` Zachary Amsden
  (?)
@ 2007-03-08 21:13                                                     ` Ingo Molnar
  2007-03-08 22:17                                                       ` Zachary Amsden
  -1 siblings, 1 reply; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 21:13 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: tglx, Jeremy Fitzhardinge, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright


* Zachary Amsden <zach@vmware.com> wrote:

> When we're about two weeks away from a product release and you are 
> threatening to unmerge or block our code because we didn't create an 
> abstract interrupt controller, we re-used the APIC and IO-APIC, this 
> is uber rocket science. [...]

see my mail to you below: you've been told about the clockevents problem 
months ago, that you shouldnt hardwire PIT details and that you should 
be registering a clockevents device. You cannot credibly claim that you 
didnt know about this.

> We've been doing things this way, with public patches for over a year, 
> and you've even been CC'd on some of the discussions. [...]

i've specifically objected, numerous times - the result of which was 
that when you submitted it to lkml you didnt Cc: me ;) The VMI crap went 
in 'under the radar' via the x86_64 tree.

> [...]  So it is a little late to tell us - "redesign your hypervisor, 
> or else.."

Also, it was /you/ who claimed that paravirt_ops can take care of 
whatever design change on the Linux side - that claim is apparently 
history now and you are now claiming "there's a product on the road, we 
cannot change the hypervisor ABI"? Should i cite that email of yours 
too?

	Ingo

----------------->
Date: Fri, 5 Jan 2007 06:45:04 +0100
From: Ingo Molnar <mingo@elte.hu>
To: Zachary Amsden <zach@vmware.com>
Subject: Re: Clockevent changes in -mm tree
Cc: Thomas Gleixner <tglx@linutronix.de>, Andrew Morton <akpm@osdl.org>,
        Rusty Russell <rusty@rustcorp.com.au>

* Zachary Amsden <zach@vmware.com> wrote:

> So I'm running into some issues integrating the VMI timer code with 
> the clockevent code in the -mm tree.  Basically, my question is - are 
> clockevents now required to get the timer infrastructure to work 
> properly, and can I have multiple clockevent sources (to allow 
> overriding the PIT) that are selected at boot time?

(I've Cc:-ed Rusty too, the author of the paravirtualization patches. 
Rusty, what's your take on the VMI timer patchset of Zach?)

in any case, i dont see any fundamental problem here. The right model 
for timer paravirtualization is to notify the guest during early bootup 
that this is a paravirtual bootup. Then the guest doesnt even register 
the PIT clocksource but registers the virtual clock-events driver.

	Ingo


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 20:46                                                     ` Zachary Amsden
  (?)
  (?)
@ 2007-03-08 21:15                                                     ` Jeremy Fitzhardinge
  2007-03-08 21:34                                                         ` Ingo Molnar
  2007-03-08 22:31                                                         ` Zachary Amsden
  -1 siblings, 2 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 21:15 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: tglx, Ingo Molnar, john stultz, akpm, Linus Torvalds, LKML,
	Pratap Subrahmanyam, Rusty Russell, Andi Kleen, Daniel Hecht,
	Daniel Arai, Chris Wright, Virtualization Mailing List

Zachary Amsden wrote:
> We faithfully emulate lapic, io_apic, the pit, pic, and a normal
> interrupt subsystem.

Can you not just use the apic clock driver directly then?  Do you need
to do anything special?

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 21:15                                                     ` Jeremy Fitzhardinge
@ 2007-03-08 21:34                                                         ` Ingo Molnar
  2007-03-08 22:31                                                         ` Zachary Amsden
  1 sibling, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 21:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, tglx, john stultz, akpm, Linus Torvalds, LKML,
	Pratap Subrahmanyam, Rusty Russell, Andi Kleen, Daniel Hecht,
	Daniel Arai, Chris Wright, Virtualization Mailing List


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Zachary Amsden wrote:
> > We faithfully emulate lapic, io_apic, the pit, pic, and a normal
> > interrupt subsystem.
> 
> Can you not just use the apic clock driver directly then?  Do you need 
> to do anything special?

exactly. There are only two variants Linux wants to care about:

 - native hardware ABIs. If a hypervisor (like KVM) happens to do its
   stuff based on those, more power to them - we dont (and cannot) care.

 - /One/ _intelligent_ higher-level virtualization API/ABI. Xen's API is 
   quite advanced on this front. This would be shared by /all/ 
   hypervisors and could occasionally be reused by other hardware 
   platforms as well. It can be morphed via small wrappers into some
   'hypervisor personality' kind of hypervisor backends, but no
   fundamental transformation happens.

what we do _NOT_ want is some mixture of 'simplified' and 'hardwired' 
native hardware access mixed with hypercalls that somehow ends up 
creating a Frankenstein mixture of 'virtual silicon', which is specified 
nowhere else but in VMWare's proprietary hypervisor source code that we 
have no way to fix and no way to even see!

it's hard enough to get native silicon support right.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-08 21:34                                                         ` Ingo Molnar
  0 siblings, 0 replies; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 21:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	tglx, Linus Torvalds, akpm


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Zachary Amsden wrote:
> > We faithfully emulate lapic, io_apic, the pit, pic, and a normal
> > interrupt subsystem.
> 
> Can you not just use the apic clock driver directly then?  Do you need 
> to do anything special?

exactly. There are only two variants Linux wants to care about:

 - native hardware ABIs. If a hypervisor (like KVM) happens to do its
   stuff based on those, more power to them - we dont (and cannot) care.

 - /One/ _intelligent_ higher-level virtualization API/ABI. Xen's API is 
   quite advanced on this front. This would be shared by /all/ 
   hypervisors and could occasionally be reused by other hardware 
   platforms as well. It can be morphed via small wrappers into some
   'hypervisor personality' kind of hypervisor backends, but no
   fundamental transformation happens.

what we do _NOT_ want is some mixture of 'simplified' and 'hardwired' 
native hardware access mixed with hypercalls that somehow ends up 
creating a Frankenstein mixture of 'virtual silicon', which is specified 
nowhere else but in VMWare's proprietary hypervisor source code that we 
have no way to fix and no way to even see!

it's hard enough to get native silicon support right.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 20:46                                                     ` Zachary Amsden
@ 2007-03-08 21:39                                                       ` Andi Kleen
  -1 siblings, 0 replies; 169+ messages in thread
From: Andi Kleen @ 2007-03-08 21:39 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: tglx, Ingo Molnar, Jeremy Fitzhardinge, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

On Thursday 08 March 2007 21:46, Zachary Amsden wrote:
> Thomas Gleixner wrote:
> > On Thu, 2007-03-08 at 02:06 -0800, Zachary Amsden wrote:
> >   
> >>>> The correct solution here is to properly separate the APIC, SMP, and 
> >>>> timer code so the logic of it which we want to reuse is separated from 
> >>>> the hardware dependence.  Clock events and clocksources take care of 
> >>>> most of the timer issues, but there is still ugliness from SMP timer 
> >>>> events depending on having part of the APIC infrastructure for wiring 
> >>>> the interrupt gates.
> >>>>     
> >>>>         
> >>> what are you talking about? A clockevents driver does not need to know 
> >>> about lapic details, at all. In terms of interrupt gates for the 
> >>> hypervisor to notify about clock events - use a virtual interrupt 
> >>> controller via genirq.
> >>>   
> >>>       
> >> See my last e-mail.  It is not possible on i386, since local per-cpu 
> >> interrupts are only supported via the APIC.
> >>     
> >
> > It is not possible from your POV. It is possible, as we have already a
> > complete irq abstraction layer, which supports _ALL_ of the
> > requirements.
> >
> > To make use of it in a maintainable way, it just needs the work of doing
> > a proper client for the genirq layer, which get's its interrupt injected
> > by the hypervisor.
> >
> > genirq() does not care by which mechanism handle_percpu_irq() is called.
> >
> > We provided the abstractions and you just tell us straight in the face,
> > that your hypervisor works that way and therefor we have to accept that
> > you do it that way.
> >
> > It's not rocket science to implement an abstract interrupt controller,
> > which lets you inject per cpu or global interrupts into the generic
> > layer. It needs some preparatory work to distangle the boot code
> > assumptions from the implicit hardware, but this is a better spent time,
> > than another set of hackery, which you already advertised for smpboot.c
> >   
> 
> When we're about two weeks away from a product release and you are 
> threatening to unmerge or block our code because we didn't create an 
> abstract interrupt controller,

At least in Linux we don't really work with deadlines; if there 
are issues they need to be fixed even if it takes longer. I don't 
expect the version in .21 to be really usable anyways; it is clearly
still in development.

> we re-used the APIC and IO-APIC, this is  
> uber rocket science.  We've been doing things this way, with public 
> patches for over a year, and you've even been CC'd on some of the 
> discussions.  So it is a little late to tell us - "redesign your 
> hypervisor, or else.."

It shouldn't touch the hypervisor, just the paravirt VMI backend shouldn't it?
I assume you could do a very minimal APIC layer that is just enough to 
talk to your softapic and a genapic backend for IPIs.

At least I would welcome anything that shrinks the number of 
paravirt hooks.

I'm just not sure it would be less hooks: you would probably need
functions to start other CPUs at least.
 
I must admit I also didn't quite get what was the big problem with
hooking apic_read/apic_write.

For the timer you just need to use a own exclusive 
clocksource that never touches PIT.

> We faithfully emulate lapic, io_apic, the pit, pic, and a normal 
> interrupt subsystem. We can't magically stop using these things because  
> we have to support traditional full virtualization.  Which means any 
> version of Linux, virtual interrupt controller or not, is going to boot 
> up, find these things, and try to use them.  So for a paravirt kernel, 
> either we have to disable each of these things in the Linux code or just 
> re-use them.

If you don't enable them they should be already disabled as default 
state, shouldn't they? 

With an own custom clocksource and possible own APIC layer nobody
would ever enable the APICs.

> 1) Rewrite the interrupt subsystem of our hypervisor, making it 
> incompatible with full virtualization, so that we can support an 
> abstract interrupt controller with a "clean" interface

What do you mean with rewrite? It's quite easy to add a new
backend to the generic IRQ code. They aren't a lot of code.

> 2) Reuse the same method that HPET, PIT and other time clients in i386 
> use - the global_clock_event pointer which allows you to wrest control 
> back from the APIC and reuse the lapic_events local clockevents.

If you used an own APIC layer APIC would never get control at all.

> 3) Create a new low level interrupt handler for the per-cpu VMI timer 
> IRQs instead of re-using the APIC handler
> 4) Use the irq APIs to allocate IRQ-0 as a percpu IRQ, then change the 
> IO-APIC code so it can know not to convert this PIC IRQ into a IO-APIC 
> edge IRQ.
> 5) Disable the io-apic code entirely in paravirt mode.  Rather than 
> change it, merge a parallel copy of it into the VMI code
>  so that we can use the 99% of the code we need, with the one bugfix for 
> #4 above

You could probably do a much simpler version, couldn't you? A lot of 
the stuff in apic.c/io_apic.c shouldn't be needed for a clean virtual
interface. But yes it would probably be still a lot of code.

Still (2) is probably best for now, but the other alternatives
are not as ridiculous as you paint them.


-Andi

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-08 21:39                                                       ` Andi Kleen
  0 siblings, 0 replies; 169+ messages in thread
From: Andi Kleen @ 2007-03-08 21:39 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	tglx, akpm, Linus Torvalds, Ingo Molnar

On Thursday 08 March 2007 21:46, Zachary Amsden wrote:
> Thomas Gleixner wrote:
> > On Thu, 2007-03-08 at 02:06 -0800, Zachary Amsden wrote:
> >   
> >>>> The correct solution here is to properly separate the APIC, SMP, and 
> >>>> timer code so the logic of it which we want to reuse is separated from 
> >>>> the hardware dependence.  Clock events and clocksources take care of 
> >>>> most of the timer issues, but there is still ugliness from SMP timer 
> >>>> events depending on having part of the APIC infrastructure for wiring 
> >>>> the interrupt gates.
> >>>>     
> >>>>         
> >>> what are you talking about? A clockevents driver does not need to know 
> >>> about lapic details, at all. In terms of interrupt gates for the 
> >>> hypervisor to notify about clock events - use a virtual interrupt 
> >>> controller via genirq.
> >>>   
> >>>       
> >> See my last e-mail.  It is not possible on i386, since local per-cpu 
> >> interrupts are only supported via the APIC.
> >>     
> >
> > It is not possible from your POV. It is possible, as we have already a
> > complete irq abstraction layer, which supports _ALL_ of the
> > requirements.
> >
> > To make use of it in a maintainable way, it just needs the work of doing
> > a proper client for the genirq layer, which get's its interrupt injected
> > by the hypervisor.
> >
> > genirq() does not care by which mechanism handle_percpu_irq() is called.
> >
> > We provided the abstractions and you just tell us straight in the face,
> > that your hypervisor works that way and therefor we have to accept that
> > you do it that way.
> >
> > It's not rocket science to implement an abstract interrupt controller,
> > which lets you inject per cpu or global interrupts into the generic
> > layer. It needs some preparatory work to distangle the boot code
> > assumptions from the implicit hardware, but this is a better spent time,
> > than another set of hackery, which you already advertised for smpboot.c
> >   
> 
> When we're about two weeks away from a product release and you are 
> threatening to unmerge or block our code because we didn't create an 
> abstract interrupt controller,

At least in Linux we don't really work with deadlines; if there 
are issues they need to be fixed even if it takes longer. I don't 
expect the version in .21 to be really usable anyways; it is clearly
still in development.

> we re-used the APIC and IO-APIC, this is  
> uber rocket science.  We've been doing things this way, with public 
> patches for over a year, and you've even been CC'd on some of the 
> discussions.  So it is a little late to tell us - "redesign your 
> hypervisor, or else.."

It shouldn't touch the hypervisor, just the paravirt VMI backend shouldn't it?
I assume you could do a very minimal APIC layer that is just enough to 
talk to your softapic and a genapic backend for IPIs.

At least I would welcome anything that shrinks the number of 
paravirt hooks.

I'm just not sure it would be less hooks: you would probably need
functions to start other CPUs at least.
 
I must admit I also didn't quite get what was the big problem with
hooking apic_read/apic_write.

For the timer you just need to use a own exclusive 
clocksource that never touches PIT.

> We faithfully emulate lapic, io_apic, the pit, pic, and a normal 
> interrupt subsystem. We can't magically stop using these things because  
> we have to support traditional full virtualization.  Which means any 
> version of Linux, virtual interrupt controller or not, is going to boot 
> up, find these things, and try to use them.  So for a paravirt kernel, 
> either we have to disable each of these things in the Linux code or just 
> re-use them.

If you don't enable them they should be already disabled as default 
state, shouldn't they? 

With an own custom clocksource and possible own APIC layer nobody
would ever enable the APICs.

> 1) Rewrite the interrupt subsystem of our hypervisor, making it 
> incompatible with full virtualization, so that we can support an 
> abstract interrupt controller with a "clean" interface

What do you mean with rewrite? It's quite easy to add a new
backend to the generic IRQ code. They aren't a lot of code.

> 2) Reuse the same method that HPET, PIT and other time clients in i386 
> use - the global_clock_event pointer which allows you to wrest control 
> back from the APIC and reuse the lapic_events local clockevents.

If you used an own APIC layer APIC would never get control at all.

> 3) Create a new low level interrupt handler for the per-cpu VMI timer 
> IRQs instead of re-using the APIC handler
> 4) Use the irq APIs to allocate IRQ-0 as a percpu IRQ, then change the 
> IO-APIC code so it can know not to convert this PIC IRQ into a IO-APIC 
> edge IRQ.
> 5) Disable the io-apic code entirely in paravirt mode.  Rather than 
> change it, merge a parallel copy of it into the VMI code
>  so that we can use the 99% of the code we need, with the one bugfix for 
> #4 above

You could probably do a much simpler version, couldn't you? A lot of 
the stuff in apic.c/io_apic.c shouldn't be needed for a clean virtual
interface. But yes it would probably be still a lot of code.

Still (2) is probably best for now, but the other alternatives
are not as ridiculous as you paint them.


-Andi

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 21:34                                                         ` Ingo Molnar
  (?)
@ 2007-03-08 21:43                                                         ` Andi Kleen
  2007-03-08 22:30                                                           ` Ingo Molnar
  -1 siblings, 1 reply; 169+ messages in thread
From: Andi Kleen @ 2007-03-08 21:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Zachary Amsden, tglx, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List


> what we do _NOT_ want is some mixture of 'simplified' and 'hardwired' 
> native hardware access mixed with hypercalls that somehow ends up 
> creating a Frankenstein mixture of 'virtual silicon', is specified 
> nowhere else but in VMWare's proprietary hypervisor source code that we 
> have no way to fix and no way to even see!

Hmm, but we already drive the "vmware silicon" quite successfully with
fully virtualized kernels. And apparently the VMI version is the same,
just with some short cuts. Are you just worried about the ->apic_write()
hooks or about something else too?

-Andi

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
  2007-03-08 20:33                                                               ` Jeremy Fitzhardinge
@ 2007-03-08 21:45                                                                   ` Andi Kleen
  2007-03-08 21:45                                                                   ` Andi Kleen
  1 sibling, 0 replies; 169+ messages in thread
From: Andi Kleen @ 2007-03-08 21:45 UTC (permalink / raw)
  To: virtualization
  Cc: Jeremy Fitzhardinge, Chris Wright, tglx, john stultz, akpm,
	Ingo Molnar, LKML


> 
> Maybe hooking into genapic is the right way to mop up all the uses of
> send_IPI and its variants. 

It is.  More hooks in this are wouldn't be appreciated.

-Andi

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
@ 2007-03-08 21:45                                                                   ` Andi Kleen
  0 siblings, 0 replies; 169+ messages in thread
From: Andi Kleen @ 2007-03-08 21:45 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, akpm, john stultz, Ingo Molnar, tglx, LKML


> 
> Maybe hooking into genapic is the right way to mop up all the uses of
> send_IPI and its variants. 

It is.  More hooks in this are wouldn't be appreciated.

-Andi

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 21:13                                                     ` Ingo Molnar
@ 2007-03-08 22:17                                                       ` Zachary Amsden
  2007-03-08 22:33                                                         ` Ingo Molnar
  0 siblings, 1 reply; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 22:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Jeremy Fitzhardinge, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright

Ingo Molnar wrote:
> * Zachary Amsden <zach@vmware.com> wrote:
>
>   
>> When we're about two weeks away from a product release and you are 
>> threatening to unmerge or block our code because we didn't create an 
>> abstract interrupt controller, we re-used the APIC and IO-APIC, this 
>> is uber rocket science. [...]
>>     
>
> see my mail to you below: you've been told about the clockevents problem 
> months ago, that you shouldnt hardwire PIT details and that you should 
> be registering a clockevents device. You cannot credibly claim that you 
> didnt know about this.
>   

I am claiming no such thing.  My claim is that nobody ever said, well 
unless you you clockevents, we're going to break your code, then nack 
any possible way to fix it, and now for spite, since you are in the 
kernel tree, we're going to nack any attempt to use clockevents.

It was our plan to convert to using clockevents all along.  It was never 
said that this was such a huge, showstopping issue, and so we didn't see 
any reason to change the timer code any further for 2.6.21, specifically 
because the integration with hrtimers caused so much pain and debugging 
for us.  Our code was working fine, then clocksources came along, and we 
had to change.  Then clockevents came along, had bugs of its own to work 
out, and caused a huge amount of grief and debugging for us.  So when we 
had something working, we drew the line and figured we could make the 
leap to CE in the next kernel.

>> We've been doing things this way, with public patches for over a year, 
>> and you've even been CC'd on some of the discussions. [...]
>>     
>
> i've specifically objected, numerous times - the result of which was 
> that when you submitted it to lkml you didnt Cc: me ;) The VMI crap went 
> in 'under the radar' via the x86_64 tree.
>
>   
>> [...]  So it is a little late to tell us - "redesign your hypervisor, 
>> or else.."
>>     
>
> Also, it was /you/ who claimed that paravirt_ops can take care of 
> whatever design change on the Linux side - that claim is apparently 
> history now and you are now claiming "there's a product on the road, we 
> cannot change the hypervisor ABI"? Should i cite that email of yours 
> too?
>   

Ingo, either you or Thomas have vetoed every attempt we have made to 
make our code operate with clockevents.  There are serious platform 
issues here that make this difficult, no matter how many nice, well 
designed, abstract, higher-level kernel interfaces we have to work with, 
we have to work around platform code which makes the wrong assumptions.

Citing already established facts doesn't do anything productive.  Can I 
please get some feedback on the design choices I have proposed for how 
to integrate VMI timer?

Thanks,

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 21:43                                                         ` Andi Kleen
@ 2007-03-08 22:30                                                           ` Ingo Molnar
  2007-03-08 22:36                                                             ` Zachary Amsden
  0 siblings, 1 reply; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 22:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jeremy Fitzhardinge, Zachary Amsden, tglx, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Daniel Hecht, Daniel Arai, Chris Wright


* Andi Kleen <ak@suse.de> wrote:

> > what we do _NOT_ want is some mixture of 'simplified' and 
> > 'hardwired' native hardware access mixed with hypercalls that 
> > somehow ends up creating a Frankenstein mixture of 'virtual 
> > silicon', is specified nowhere else but in VMWare's proprietary 
> > hypervisor source code that we have no way to fix and no way to even 
> > see!
> 
> Hmm, but we already drive the "vmware silicon" quite successfully with 
> fully virtualized kernels. [...]

that is exactly what i listed as the first variant we want to support:

> >  - native hardware ABIs. If a hypervisor (like KVM) happens to do
> >    its stuff based on those, more power to them - we dont (and 
> >    cannot) care.

the "vmware silicon" you are talking about /is/ the native hardware ABI! 
I never had any problems with /that/.

> [...] And apparently the VMI version is the same, just with some short 
> cuts. Are you just worried about the ->apic_write() hooks or about 
> something else too?

i'm worried about those "shot cuts" (which in essence create software 
variants of silicon), the hooks, the hardwirings combined with the 
hypervisor-side ABIs creating a rigid mess that is harmful to Linux. 
paravirt_ops and the hooks gives a license for all hypervisor 'backends' 
to deviate into random arbitrary directions and to create all their 
separate 'virtual silicon' playgrounds with no regard to Linux 
maintainability. And once this gets released, Linux has no choice but to 
play along.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 21:15                                                     ` Jeremy Fitzhardinge
@ 2007-03-08 22:31                                                         ` Zachary Amsden
  2007-03-08 22:31                                                         ` Zachary Amsden
  1 sibling, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 22:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: tglx, Ingo Molnar, john stultz, akpm, Linus Torvalds, LKML,
	Pratap Subrahmanyam, Rusty Russell, Andi Kleen, Daniel Hecht,
	Daniel Arai, Chris Wright, Virtualization Mailing List

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>> We faithfully emulate lapic, io_apic, the pit, pic, and a normal
>> interrupt subsystem.
>>     
>
> Can you not just use the apic clock driver directly then?  Do you need
> to do anything special?
>   

The apic clock driver is going to program the apic, not our virtual 
clock, which tracks real time, available and stolen time.  APIC has no 
idea about any of this.

We want an APIC, just purely as a local interrupt controller, and a 
separate, paravirt timer device.  It is not very complicated.

The problem is that we are not supposed to fiddle with APIC internals, 
so we shouldn't be using their clock events, so we can't use their low 
level local timer interrupt, and we are not supposed to write our own.  
So we need a weird hack somewhere because we are not allowed to use any 
of the paths that currently work for our code, because of the 
maintenance burden it imposes by misusing the interfaces.

The larger problem I am having is that nobody wants to give productive 
feedback on how to fix the problem, just sit around and fling muck.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-08 22:31                                                         ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 22:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	Ingo Molnar, akpm, Linus Torvalds, tglx

Jeremy Fitzhardinge wrote:
> Zachary Amsden wrote:
>   
>> We faithfully emulate lapic, io_apic, the pit, pic, and a normal
>> interrupt subsystem.
>>     
>
> Can you not just use the apic clock driver directly then?  Do you need
> to do anything special?
>   

The apic clock driver is going to program the apic, not our virtual 
clock, which tracks real time, available and stolen time.  APIC has no 
idea about any of this.

We want an APIC, just purely as a local interrupt controller, and a 
separate, paravirt timer device.  It is not very complicated.

The problem is that we are not supposed to fiddle with APIC internals, 
so we shouldn't be using their clock events, so we can't use their low 
level local timer interrupt, and we are not supposed to write our own.  
So we need a weird hack somewhere because we are not allowed to use any 
of the paths that currently work for our code, because of the 
maintenance burden it imposes by misusing the interfaces.

The larger problem I am having is that nobody wants to give productive 
feedback on how to fix the problem, just sit around and fling muck.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 22:17                                                       ` Zachary Amsden
@ 2007-03-08 22:33                                                         ` Ingo Molnar
  2007-03-08 22:39                                                           ` Zachary Amsden
  0 siblings, 1 reply; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 22:33 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: tglx, Jeremy Fitzhardinge, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright


* Zachary Amsden <zach@vmware.com> wrote:

> Ingo, either you or Thomas have vetoed every attempt we have made to 
> make our code operate with clockevents. [...]

this is news to me - do you have any proof of such a veto?

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 22:30                                                           ` Ingo Molnar
@ 2007-03-08 22:36                                                             ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 22:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Jeremy Fitzhardinge, tglx, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Daniel Hecht, Daniel Arai, Chris Wright

Ingo Molnar wrote:
>   
>> [...] And apparently the VMI version is the same, just with some short 
>> cuts. Are you just worried about the ->apic_write() hooks or about 
>> something else too?
>>     
>
> i'm worried about those "shot cuts" (which in essence create software 
> variants of silicon), the hooks, the hardwirings combined with the 
> hypervisor-side ABIs creating a rigid mess that is harmful to Linux. 
> paravirt_ops and the hooks gives a license for all hypervisor 'backends' 
> to deviate into random arbitrary directions and to create all their 
> separate 'virtual silicon' playgrounds with no regard to Linux 
> maintainability. And once this gets released, Linux has no choice but to 
> play along.
>   

What?  And creating a high level API which allows you to implement 
totally random silicon with oodles of quirks and obfuscated 
implementation requirements creates more maintainability how?  We tried 
to stay as close as possible to the hardware ABI on purpose, 
specifically so we don't need to introduce 100 new concepts into the kernel.

Zaccch

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 22:33                                                         ` Ingo Molnar
@ 2007-03-08 22:39                                                           ` Zachary Amsden
  2007-03-16 10:12                                                             ` Pavel Machek
  0 siblings, 1 reply; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 22:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Jeremy Fitzhardinge, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright

Ingo Molnar wrote:
> * Zachary Amsden <zach@vmware.com> wrote:
>
>   
>> Ingo, either you or Thomas have vetoed every attempt we have made to 
>> make our code operate with clockevents. [...]
>>     
>
> this is news to me - do you have any proof of such a veto?
>   

Yes, your refusal to discuss any technical details when asked point 
blank which solution you prefer, and your continued whining and 
threatening to unmerge our code.

And I ask again for your feedback on which approach you think is correct:

1) Rewrite the interrupt subsystem of our hypervisor, making it 
incompatible with full virtualization, so that we can support an 
abstract interrupt controller with a "clean" interface
2) Reuse the same method that HPET, PIT and other time clients in i386 
use - the global_clock_event pointer which allows you to wrest control 
back from the APIC and reuse the lapic_events local clockevents.
3) Create a new low level interrupt handler for the per-cpu VMI timer 
IRQs instead of re-using the APIC handler
4) Use the irq APIs to allocate IRQ-0 as a percpu IRQ, then change the 
IO-APIC code so it can know not to convert this PIC IRQ into a IO-APIC 
edge IRQ.
5) Disable the io-apic code entirely in paravirt mode.  Rather than 
change it, merge a parallel copy of it into the VMI code
so that we can use the 99% of the code we need, with the one bugfix for 
#4 above
6) Disable the apic code entirely in paravirt mode.  Rather than change 
it, merge a parallel copy of into the VMI code so that we can use the 
90% of the code we need, with changes to the LVT0 timer handling.
7) For SMP only, allocate a non-shared IO-APIC IRQ, then after the 
IO-APIC is initialized, magically switch this to a percpu handler and 
start delivering local timer interrupts via this IRQ.
8) Create a pie-in-the-sky single interrupt source, reserve an IDT 
vector for it (or steal the lapic timer slot), and use the irq apis to 
set it up to be handled as a per-cpu interrupt.  This actually sounds 
pretty good, to me.  The only problem is we will need to switch the 
timer IRQ from IRQ 0 to this vector when the APIC is initialized, but I 
think we already have all the machinery we need to handle that.
9) ???

Zach


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 20:46                                                     ` Zachary Amsden
                                                                       ` (3 preceding siblings ...)
  (?)
@ 2007-03-08 22:42                                                     ` Ingo Molnar
  2007-03-08 23:39                                                       ` Zachary Amsden
  -1 siblings, 1 reply; 169+ messages in thread
From: Ingo Molnar @ 2007-03-08 22:42 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: tglx, Jeremy Fitzhardinge, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright


* Zachary Amsden <zach@vmware.com> wrote:

> [...]  So it is a little late to tell us - "redesign your hypervisor, 
> or else.."

is this how long the "paravirt_ops hides all the details and the VMI 
hypervisor ABI will never hinder Linux" sham lasted? Now that your stuff 
is upstream barely 2 weeks you say that it needs a redesign of your 
hypervisor to implement our suggestions? And your argument is: "oops, 
we've got product plans, too late for that, sorry"?

_we_ have to live with that mess for years, and part of our job is to 
say 'NO' when we see mess coming up. And no, it's not our job to solve 
it for you.

	Ingo

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 21:39                                                       ` Andi Kleen
  (?)
@ 2007-03-08 22:58                                                       ` Zachary Amsden
  -1 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 22:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: tglx, Ingo Molnar, Jeremy Fitzhardinge, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

Andi Kleen wrote:

> At least in Linux we don't really work with deadlines; if there 
> are issues they need to be fixed even if it takes longer. I don't 
> expect the version in .21 to be really usable anyways; it is clearly
> still in development.
>   

It was working, and I expect to have it working again.  It is not in 
development, but we urgently need to find a way to fix the problems 
created when Ingo hobbled it by removing NO_IDLE_HZ code from 2.6.21..

>   
>> we re-used the APIC and IO-APIC, this is  
>> uber rocket science.  We've been doing things this way, with public 
>> patches for over a year, and you've even been CC'd on some of the 
>> discussions.  So it is a little late to tell us - "redesign your 
>> hypervisor, or else.."
>>     
>
> It shouldn't touch the hypervisor, just the paravirt VMI backend shouldn't it?
> I assume you could do a very minimal APIC layer that is just enough to 
> talk to your softapic and a genapic backend for IPIs.
>
> At least I would welcome anything that shrinks the number of 
> paravirt hooks.
>
> I'm just not sure it would be less hooks: you would probably need
> functions to start other CPUs at least.
>   
 
Anything that attempts to create this uber multi-virtual interrupt / 
timer / IPI / clock management beast is going to add a huge number of 
paravirt hooks, because the vendor backends will be different for all of 
these.

>  
> I must admit I also didn't quite get what was the big problem with
> hooking apic_read/apic_write.
>   

You mean why we need them?  They make APIC writes faster, since 
otherwise they would trap and emulate, which is slow, and APIC is on 
critical paths.  Or why people object to them?  I don't get the latter 
either.

> For the timer you just need to use a own exclusive 
> clocksource that never touches PIT.
>   

We have that working fine.  It is getting the clock event to work 
independently from the lapic timer that is difficult because of the i386 
backend.

>   
>> We faithfully emulate lapic, io_apic, the pit, pic, and a normal 
>> interrupt subsystem. We can't magically stop using these things because  
>> we have to support traditional full virtualization.  Which means any 
>> version of Linux, virtual interrupt controller or not, is going to boot 
>> up, find these things, and try to use them.  So for a paravirt kernel, 
>> either we have to disable each of these things in the Linux code or just 
>> re-use them.
>>     
>
> If you don't enable them they should be already disabled as default 
> state, shouldn't they? 
>
> With an own custom clocksource and possible own APIC layer nobody
> would ever enable the APICs.
>   

But we enable and use them, in both full-virt, and paravirt mode.  So we 
really would need to duplicate the code, almost exactly for our "virtual 
interrupt controller", which would really just be a wrapper on top of a 
nearly identical APIC or IO-APIC implementation.

>> 1) Rewrite the interrupt subsystem of our hypervisor, making it 
>> incompatible with full virtualization, so that we can support an 
>> abstract interrupt controller with a "clean" interface
>>     
>
> What do you mean with rewrite? It's quite easy to add a new
> backend to the generic IRQ code. They aren't a lot of code.
>   

Yes, but we would then need to duplicate the APIC or IO-APIC 
implementation, because that is the hardware we emulate and use.  We 
just want a different way to fire local timers, that is all.

> You could probably do a much simpler version, couldn't you? A lot of 
> the stuff in apic.c/io_apic.c shouldn't be needed for a clean virtual
> interface. But yes it would probably be still a lot of code.
>   

Yes, we could do a cleaner simpler version.  But then we need to write 
this new interrupt controller code for both the hypervisor and for 
Linux.  And the fact that it is cleaner doesn't make it any nicer or 
perform any better - it is just another dependency between the kernel 
and hypervisor that then becomes hard to change.  So we would rather 
stay as close to the hardware design as possible.

> Still (2) is probably best for now, but the other alternatives
> are not as ridiculous as you paint them.
>   

We have (2) working.  But Thomas apparently hated it.  The idea I have 
about a single-IRQ source interrupt controller for timers seems pretty 
nice, and does almost exactly encapsulate the one difference we have 
from standard APIC / IO-APIC hardware - a different way to drive local 
timers.

Thanks for your feedback,

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 22:42                                                     ` Ingo Molnar
@ 2007-03-08 23:39                                                       ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 23:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Jeremy Fitzhardinge, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright

Ingo Molnar wrote:
> * Zachary Amsden <zach@vmware.com> wrote:
>
>   
>> [...]  So it is a little late to tell us - "redesign your hypervisor, 
>> or else.."
>>     
>
> is this how long the "paravirt_ops hides all the details and the VMI 
> hypervisor ABI will never hinder Linux" sham lasted? Now that your stuff 
> is upstream barely 2 weeks you say that it needs a redesign of your 
> hypervisor to implement our suggestions? And your argument is: "oops, 
> we've got product plans, too late for that, sorry"?
>   

Clever way to misconstrue my point.  If I took the same tack with your 
espousal of hypervisor API philosophy and cited all the different 
opinions you've been spewing lately, I think we could probably make a 
strong argument that VMI should be the model used by all hypervisors and 
should support any frankenstein combination of virtual and traditional 
hardware, because it should work for all types of emulated silicon.

We don't need to redesign our hypervisor, but that is what a lot of 
people seem to want us to do.  And there is no reason nor is there time 
to do it.

> _we_ have to live with that mess for years, and part of our job is to 
> say 'NO' when we see mess coming up. And no, it's not our job to solve 
> it for you.

You don't   have to live with any mess.  If you change the kernel 
interfaces to clockevents to pass around XML based time encodings, or 
you completely rewrite the way APIC and IO-APIC interact with the 
interrupt subsystem, and this breaks our code, there is a shared 
responsibility to make things work, but we are actively maintaining the 
code and will continue to do so.  If at some point we don't, the code 
gets marked broken, and eventually deprecated.  And there is no mess for 
you to live with.

We just want a way to work with the in-kernel interfaces that is blessed 
and correct, and that is where you could give feedback, but instead you 
refuse to delve into technical details of our proposed solutions, while 
blasting and breaking all of our current code.   We're trying to do the 
right thing and get feedback and design this thing right with the 
community, and all you can offer is violence and non-productive 
criticism - "nack, this is crap" is not a good answer.

So we'll just randomly try all of the proposed solutions until we find 
one that isn't nacked, which is a waste of everybody's time, but 
hopefully finds the correct solution.  Or you could just look at the 
solutions I proposed and tell me which ones you think are best.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 21:34                                                         ` Ingo Molnar
@ 2007-03-08 23:39                                                           ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 23:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, tglx, john stultz, akpm, Linus Torvalds, LKML,
	Pratap Subrahmanyam, Rusty Russell, Andi Kleen, Daniel Hecht,
	Daniel Arai, Chris Wright, Virtualization Mailing List

Ingo Molnar wrote:
>  - /One/ _intelligent_ higher-level virtualization API/ABI. Xen's API is 
>    quite advanced on this front.

At last!  Some love!

The Xen approach has always been to prefer high-level interfaces over
lower-level ones, so that guests can meaningfully participate in their
own virtualization.  There are some necessarily low-level things, but
conceptually simple things like "create a new vcpu" should have simple
interfaces.  There's no point in going to the effort of emulating a
whole pile of real hardware if Xen can present an interface which is a
close match to an existing high-level interface within the operating system.

>  This would be shared by /all/ 
>    hypervisors and could occasionally be reused by other hardware 
>    platforms as well. It can be morphed via small wrappers into some
>    'hypervisor personality' kind of hypervisor backends, but no
>    fundamental transformation happens.
>   

Because of both general conceptual cleanliness and Xen's requirements,
the pv_ops interface has tended towards fairly high-level interfaces
where possible and useful.

VMI's design tends towards being a closer match to some approximation of
the real underlying hardware, which I suppose is a reflection of its
origins as an extension of a fully virtualizing hypervisor.  I don't
have any particular problem with that, and I think its a perfectly
reasonable approach if that's the path you want to take.

What this means, however, is that the existing arch/i386 code needs to
be refactored so that VMI can reuse it in a sensible way, so that it can
implement the high-level operations in terms of the existing building
blocks that running on bare hardware has to use anyway.

Things like genirq and genapic should help with that in principle. 
genirq certainly cleaned up Xen's interface to the interrupt subsystem. 
I haven't yet looked at genapic, but from what Zach and Chris Wright
have said, I get the impression that it isn't yet up to meeting our
requirements.

The result of this tension has been a general desire to keep pv_ops a
high-level interface, but for pragmatic reasons it has grown a few
low-level operations (like the apic read and write interfaces) which
better suit VMI's needs.  I would like to see these go away as they get
replaced with high-level interfaces. 

I think this is the natural course we've been following anyway.  Xen has
the advantage of starting from a relatively clean slate here, but
reusing existing entrenched pieces cleanly in new ways always takes
careful thought and hard work.

> what we do _NOT_ want is some mixture of 'simplified' and 'hardwired' 
> native hardware access mixed with hypercalls that somehow ends up 
> creating a Frankenstein mixture of 'virtual silicon', which is specified 
> nowhere else but in VMWare's proprietary hypervisor source code that we 
> have no way to fix and no way to even see!

No, but I'm not prejudiced against virtual hardware.  If we have a piece
of code that thinks its talking to an apic, then I think its OK to use
that code whether its a real apic or a virtual one, _so long as its
being used in a way that's consistent with its intended interface_.  I
have to admit I have not looked at apics - real or virtual - in any
detail, so I won't claim to really understand the details of the
existing arch/i386 code or what VMI's trying to do, but it does seem to
me that it could all be much cleaner.

And clean is good, we all love clean - and so, agreement!

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-08 23:39                                                           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-08 23:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	tglx, Linus Torvalds, akpm

Ingo Molnar wrote:
>  - /One/ _intelligent_ higher-level virtualization API/ABI. Xen's API is 
>    quite advanced on this front.

At last!  Some love!

The Xen approach has always been to prefer high-level interfaces over
lower-level ones, so that guests can meaningfully participate in their
own virtualization.  There are some necessarily low-level things, but
conceptually simple things like "create a new vcpu" should have simple
interfaces.  There's no point in going to the effort of emulating a
whole pile of real hardware if Xen can present an interface which is a
close match to an existing high-level interface within the operating system.

>  This would be shared by /all/ 
>    hypervisors and could occasionally be reused by other hardware 
>    platforms as well. It can be morphed via small wrappers into some
>    'hypervisor personality' kind of hypervisor backends, but no
>    fundamental transformation happens.
>   

Because of both general conceptual cleanliness and Xen's requirements,
the pv_ops interface has tended towards fairly high-level interfaces
where possible and useful.

VMI's design tends towards being a closer match to some approximation of
the real underlying hardware, which I suppose is a reflection of its
origins as an extension of a fully virtualizing hypervisor.  I don't
have any particular problem with that, and I think its a perfectly
reasonable approach if that's the path you want to take.

What this means, however, is that the existing arch/i386 code needs to
be refactored so that VMI can reuse it in a sensible way, so that it can
implement the high-level operations in terms of the existing building
blocks that running on bare hardware has to use anyway.

Things like genirq and genapic should help with that in principle. 
genirq certainly cleaned up Xen's interface to the interrupt subsystem. 
I haven't yet looked at genapic, but from what Zach and Chris Wright
have said, I get the impression that it isn't yet up to meeting our
requirements.

The result of this tension has been a general desire to keep pv_ops a
high-level interface, but for pragmatic reasons it has grown a few
low-level operations (like the apic read and write interfaces) which
better suit VMI's needs.  I would like to see these go away as they get
replaced with high-level interfaces. 

I think this is the natural course we've been following anyway.  Xen has
the advantage of starting from a relatively clean slate here, but
reusing existing entrenched pieces cleanly in new ways always takes
careful thought and hard work.

> what we do _NOT_ want is some mixture of 'simplified' and 'hardwired' 
> native hardware access mixed with hypercalls that somehow ends up 
> creating a Frankenstein mixture of 'virtual silicon', which is specified 
> nowhere else but in VMWare's proprietary hypervisor source code that we 
> have no way to fix and no way to even see!

No, but I'm not prejudiced against virtual hardware.  If we have a piece
of code that thinks its talking to an apic, then I think its OK to use
that code whether its a real apic or a virtual one, _so long as its
being used in a way that's consistent with its intended interface_.  I
have to admit I have not looked at apics - real or virtual - in any
detail, so I won't claim to really understand the details of the
existing arch/i386 code or what VMI's trying to do, but it does seem to
me that it could all be much cleaner.

And clean is good, we all love clean - and so, agreement!

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 23:39                                                           ` Jeremy Fitzhardinge
@ 2007-03-08 23:55                                                             ` Zachary Amsden
  -1 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 23:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, tglx, john stultz, akpm, Linus Torvalds, LKML,
	Pratap Subrahmanyam, Rusty Russell, Andi Kleen, Daniel Hecht,
	Daniel Arai, Chris Wright, Virtualization Mailing List

Jeremy Fitzhardinge wrote:
> No, but I'm not prejudiced against virtual hardware.  If we have a piece
> of code that thinks its talking to an apic, then I think its OK to use
> that code whether its a real apic or a virtual one, _so long as its
> being used in a way that's consistent with its intended interface_.  I
> have to admit I have not looked at apics - real or virtual - in any
> detail, so I won't claim to really understand the details of the
> existing arch/i386 code or what VMI's trying to do, but it does seem to
> me that it could all be much cleaner.
>
> And clean is good, we all love clean - and so, agreement!
>   

For APICs, we have two operations - APICRead and APICWrite.  It is nice 
and clean, and plugs in very easily to the APIC accessors available in 
Linux.

Is this not clean?

We just don't drive the local timer interrupts through the APIC, we make 
hypercalls to schedule local timer alarms.  Which is something we must 
do for UP kernels as well, which use the PIT / PIC.  So there is a need 
for having clockevents code which doesn't program timers through the APIC.

So we have one separate time device, independent from the traditional 
hardware timers, and we just program that.  This design is not very 
complex, nor is it unclean, IMHO.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-08 23:55                                                             ` Zachary Amsden
  0 siblings, 0 replies; 169+ messages in thread
From: Zachary Amsden @ 2007-03-08 23:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	tglx, akpm, Linus Torvalds, Ingo Molnar

Jeremy Fitzhardinge wrote:
> No, but I'm not prejudiced against virtual hardware.  If we have a piece
> of code that thinks its talking to an apic, then I think its OK to use
> that code whether its a real apic or a virtual one, _so long as its
> being used in a way that's consistent with its intended interface_.  I
> have to admit I have not looked at apics - real or virtual - in any
> detail, so I won't claim to really understand the details of the
> existing arch/i386 code or what VMI's trying to do, but it does seem to
> me that it could all be much cleaner.
>
> And clean is good, we all love clean - and so, agreement!
>   

For APICs, we have two operations - APICRead and APICWrite.  It is nice 
and clean, and plugs in very easily to the APIC accessors available in 
Linux.

Is this not clean?

We just don't drive the local timer interrupts through the APIC, we make 
hypercalls to schedule local timer alarms.  Which is something we must 
do for UP kernels as well, which use the PIT / PIC.  So there is a need 
for having clockevents code which doesn't program timers through the APIC.

So we have one separate time device, independent from the traditional 
hardware timers, and we just program that.  This design is not very 
complex, nor is it unclean, IMHO.

Zach

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 23:39                                                           ` Jeremy Fitzhardinge
@ 2007-03-09  0:04                                                             ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-09  0:04 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Zachary Amsden, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

On Thu, 2007-03-08 at 15:39 -0800, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> >  - /One/ _intelligent_ higher-level virtualization API/ABI. Xen's API is 
> >    quite advanced on this front.
> 
> At last!  Some love!
> 
> The Xen approach has always been to prefer high-level interfaces over
> lower-level ones, so that guests can meaningfully participate in their
> own virtualization.  There are some necessarily low-level things, but
> conceptually simple things like "create a new vcpu" should have simple
> interfaces.  There's no point in going to the effort of emulating a
> whole pile of real hardware if Xen can present an interface which is a
> close match to an existing high-level interface within the operating system.

Once you are there, you are near the point where you created a virtual
architecture, which could run on any real architecture which gets
supported by a hypervisor backend.

I'd love that :)

I know it is tricky to combine this with the upcoming hardware
virtualization support. But it's at least a worthwhile thought
experiment.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-09  0:04                                                             ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-09  0:04 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	Ingo Molnar, Linus Torvalds, akpm

On Thu, 2007-03-08 at 15:39 -0800, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> >  - /One/ _intelligent_ higher-level virtualization API/ABI. Xen's API is 
> >    quite advanced on this front.
> 
> At last!  Some love!
> 
> The Xen approach has always been to prefer high-level interfaces over
> lower-level ones, so that guests can meaningfully participate in their
> own virtualization.  There are some necessarily low-level things, but
> conceptually simple things like "create a new vcpu" should have simple
> interfaces.  There's no point in going to the effort of emulating a
> whole pile of real hardware if Xen can present an interface which is a
> close match to an existing high-level interface within the operating system.

Once you are there, you are near the point where you created a virtual
architecture, which could run on any real architecture which gets
supported by a hypervisor backend.

I'd love that :)

I know it is tricky to combine this with the upcoming hardware
virtualization support. But it's at least a worthwhile thought
experiment.

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 23:55                                                             ` Zachary Amsden
  (?)
@ 2007-03-09  0:10                                                             ` Jeremy Fitzhardinge
  2007-03-09  0:29                                                                 ` Linus Torvalds
  -1 siblings, 1 reply; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09  0:10 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, tglx, john stultz, akpm, Linus Torvalds, LKML,
	Pratap Subrahmanyam, Rusty Russell, Andi Kleen, Daniel Hecht,
	Daniel Arai, Chris Wright, Virtualization Mailing List

Zachary Amsden wrote:
> For APICs, we have two operations - APICRead and APICWrite.  It is
> nice and clean, and plugs in very easily to the APIC accessors
> available in Linux.
>
> Is this not clean?

Sure, that's clean, From that perspective the apic is a bunch of
registers backed by a state machine or something.  It's not particularly
clean from a high-level interface perspective because those calls don't
mean anything, but that just means pv_ops is the wrong interface for
those calls. genapic, from its name alone, sounds like it should be the
right place to hook in at that level; if it isn't, it sounds like the
right starting place.

But...

> We just don't drive the local timer interrupts through the APIC, we
> make hypercalls to schedule local timer alarms.  Which is something we
> must do for UP kernels as well, which use the PIT / PIC.  So there is
> a need for having clockevents code which doesn't program timers
> through the APIC.

Yes, but couldn't you, oh I don't know, have the virtual timer
interrupts come in on irq 97, and just register a handler for that irq
and use that ISR to drive the time stuff?  Then its logically identical
to the Xen code or any other free-standing device driver.

Making your virtual timer device share interrupts with the (emulated)
real-time device seems to be making things messy (is that right, is that
the issue?).  I don't see why UP vs SMP is an issue here at all, or why
the PIT gets involved in any way (and I don't mean that in a "I think
your design is idiotic" way, I mean that in a "I don't really understand
the problem domain, so I'm missing something in your explanations" way).

    J

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 23:55                                                             ` Zachary Amsden
@ 2007-03-09  0:22                                                               ` Daniel Walker
  -1 siblings, 0 replies; 169+ messages in thread
From: Daniel Walker @ 2007-03-09  0:22 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Ingo Molnar, tglx, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Andi Kleen, Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

On Thu, 2007-03-08 at 15:55 -0800, Zachary Amsden wrote:

> 
> We just don't drive the local timer interrupts through the APIC, we make 
> hypercalls to schedule local timer alarms.  Which is something we must 
> do for UP kernels as well, which use the PIT / PIC.  So there is a need 
> for having clockevents code which doesn't program timers through the APIC.

This getting a bit confusing to me .. When your talking about "local
timer interrupts" are you speaking of a periodic interrupt from the
lapic , or the jiffies increment inside the virtual machine?

Daniel


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-09  0:22                                                               ` Daniel Walker
  0 siblings, 0 replies; 169+ messages in thread
From: Daniel Walker @ 2007-03-09  0:22 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	Ingo Molnar, akpm, Linus Torvalds, tglx

On Thu, 2007-03-08 at 15:55 -0800, Zachary Amsden wrote:

> 
> We just don't drive the local timer interrupts through the APIC, we make 
> hypercalls to schedule local timer alarms.  Which is something we must 
> do for UP kernels as well, which use the PIT / PIC.  So there is a need 
> for having clockevents code which doesn't program timers through the APIC.

This getting a bit confusing to me .. When your talking about "local
timer interrupts" are you speaking of a periodic interrupt from the
lapic , or the jiffies increment inside the virtual machine?

Daniel

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 23:55                                                             ` Zachary Amsden
@ 2007-03-09  0:28                                                               ` Thomas Gleixner
  -1 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-09  0:28 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Ingo Molnar, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Andi Kleen, Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

On Thu, 2007-03-08 at 15:55 -0800, Zachary Amsden wrote:
> Jeremy Fitzhardinge wrote:
> > No, but I'm not prejudiced against virtual hardware.  If we have a piece
> > of code that thinks its talking to an apic, then I think its OK to use
> > that code whether its a real apic or a virtual one, _so long as its
> > being used in a way that's consistent with its intended interface_.  I
> > have to admit I have not looked at apics - real or virtual - in any
> > detail, so I won't claim to really understand the details of the
> > existing arch/i386 code or what VMI's trying to do, but it does seem to
> > me that it could all be much cleaner.
> >
> > And clean is good, we all love clean - and so, agreement!
>
> For APICs, we have two operations - APICRead and APICWrite.  It is nice 
> and clean, and plugs in very easily to the APIC accessors available in 
> Linux.
> 
> Is this not clean?

No, because there is no need to use APIC. You just pave the road for
doing the same thing to IO_APIC and whatever is on your interest next.

> We just don't drive the local timer interrupts through the APIC, we make 
> hypercalls to schedule local timer alarms.  Which is something we must 
> do for UP kernels as well, which use the PIT / PIC.  So there is a need 
> for having clockevents code which doesn't program timers through the APIC.
>
> So we have one separate time device, independent from the traditional 
> hardware timers, and we just program that.  This design is not very 
> complex, nor is it unclean, IMHO.

And why exactly do you need the APIC operations for the complete
abstract and virtual clock event device ? To inject the interrupt, which
you anyway inject artificially into the paravirtualized kernel ? 

This is simply wrong and does not help anything. The 3 lines of code you
share with the apic timer code are not a valid reason to hook yourself
into the apic.

You can use any arbitrary interrupt number to fire your VMI timer and
this works on SMP as well, as we can pin interrupts on CPUs.

	tglx



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-09  0:28                                                               ` Thomas Gleixner
  0 siblings, 0 replies; 169+ messages in thread
From: Thomas Gleixner @ 2007-03-09  0:28 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	Ingo Molnar, Linus Torvalds, akpm

On Thu, 2007-03-08 at 15:55 -0800, Zachary Amsden wrote:
> Jeremy Fitzhardinge wrote:
> > No, but I'm not prejudiced against virtual hardware.  If we have a piece
> > of code that thinks its talking to an apic, then I think its OK to use
> > that code whether its a real apic or a virtual one, _so long as its
> > being used in a way that's consistent with its intended interface_.  I
> > have to admit I have not looked at apics - real or virtual - in any
> > detail, so I won't claim to really understand the details of the
> > existing arch/i386 code or what VMI's trying to do, but it does seem to
> > me that it could all be much cleaner.
> >
> > And clean is good, we all love clean - and so, agreement!
>
> For APICs, we have two operations - APICRead and APICWrite.  It is nice 
> and clean, and plugs in very easily to the APIC accessors available in 
> Linux.
> 
> Is this not clean?

No, because there is no need to use APIC. You just pave the road for
doing the same thing to IO_APIC and whatever is on your interest next.

> We just don't drive the local timer interrupts through the APIC, we make 
> hypercalls to schedule local timer alarms.  Which is something we must 
> do for UP kernels as well, which use the PIT / PIC.  So there is a need 
> for having clockevents code which doesn't program timers through the APIC.
>
> So we have one separate time device, independent from the traditional 
> hardware timers, and we just program that.  This design is not very 
> complex, nor is it unclean, IMHO.

And why exactly do you need the APIC operations for the complete
abstract and virtual clock event device ? To inject the interrupt, which
you anyway inject artificially into the paravirtualized kernel ? 

This is simply wrong and does not help anything. The 3 lines of code you
share with the apic timer code are not a valid reason to hook yourself
into the apic.

You can use any arbitrary interrupt number to fire your VMI timer and
this works on SMP as well, as we can pin interrupts on CPUs.

	tglx

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-09  0:10                                                             ` Jeremy Fitzhardinge
@ 2007-03-09  0:29                                                                 ` Linus Torvalds
  0 siblings, 0 replies; 169+ messages in thread
From: Linus Torvalds @ 2007-03-09  0:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Zachary Amsden, Ingo Molnar, tglx, john stultz, akpm, LKML,
	Pratap Subrahmanyam, Rusty Russell, Andi Kleen, Daniel Hecht,
	Daniel Arai, Chris Wright, Virtualization Mailing List


[ I don't really want to be involved too much in this particular 
  discussion, but I'll pipe up quickly anyway.. ]

On Thu, 8 Mar 2007, Jeremy Fitzhardinge wrote:

> Zachary Amsden wrote:
> > For APICs, we have two operations - APICRead and APICWrite.  It is
> > nice and clean, and plugs in very easily to the APIC accessors
> > available in Linux.
> >
> > Is this not clean?
> 
> Sure, that's clean, From that perspective the apic is a bunch of
> registers backed by a state machine or something.

I think you could do much worse than just decide to pick the IO-APIC/lapic 
as your "virtual interrupt controller model". So I do *not* think that 
APICRead/APICWrite are in any way horrible interfaces for a virtual 
interrupt controller. In many ways, you then have a tested and known 
interface to work with.

Of course, there are bound to be better interfaces that map more naturally 
to what people actually want to *do* ("[un]mask interrupt pin X", 
"acknowledge interrupt pin X" etc), but quite often the cost of an 
interface is *designing* it, so I don't think it's wrong per se to just 
avoid that cost entirely, and just say "we make our interface look like a 
known hardware interface". And then IOAPIC/lapic is the obvious choice.

So Ingo, I think you've been a bit unfair. I agree that it would be 
ugly to also try to emulate timers with that "fake local apic" setup, but 
even that ugliness is probably not horrid, especially if you want to have 
some kind of stable interface for an ABI.

Of course the whole "stable ABI" is fairly moot. The kernel clearly won't 
use the ABI, but a source-level API - and part of that is *exactly* that 
an ABI is by design always very inflexible and has to be backwards 
compatible. Make the API more high-level, and then the VMI "paravirt_ops" 
stuff can translate that higher-level API into its own low-level ABI any 
way it wants to.

I do agree that for timers, the lapic model is probably ugly enough that 
an ABI might be better off with somethign cleaner than just seeing timers 
too as part of the interrupt controller, but I doubt it really is  that 
big a deal. And I really think Ingo has made a bigger deal out of this 
than necessary, although clearly CONFIG_NO_HZ will require that the 
paravirt_ops will work on that level and will be able to translate it to 
whatever VMI interfaces there are.

		Linus

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
@ 2007-03-09  0:29                                                                 ` Linus Torvalds
  0 siblings, 0 replies; 169+ messages in thread
From: Linus Torvalds @ 2007-03-09  0:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: john stultz, LKML, Chris Wright, Virtualization Mailing List,
	tglx, akpm, Ingo Molnar


[ I don't really want to be involved too much in this particular 
  discussion, but I'll pipe up quickly anyway.. ]

On Thu, 8 Mar 2007, Jeremy Fitzhardinge wrote:

> Zachary Amsden wrote:
> > For APICs, we have two operations - APICRead and APICWrite.  It is
> > nice and clean, and plugs in very easily to the APIC accessors
> > available in Linux.
> >
> > Is this not clean?
> 
> Sure, that's clean, From that perspective the apic is a bunch of
> registers backed by a state machine or something.

I think you could do much worse than just decide to pick the IO-APIC/lapic 
as your "virtual interrupt controller model". So I do *not* think that 
APICRead/APICWrite are in any way horrible interfaces for a virtual 
interrupt controller. In many ways, you then have a tested and known 
interface to work with.

Of course, there are bound to be better interfaces that map more naturally 
to what people actually want to *do* ("[un]mask interrupt pin X", 
"acknowledge interrupt pin X" etc), but quite often the cost of an 
interface is *designing* it, so I don't think it's wrong per se to just 
avoid that cost entirely, and just say "we make our interface look like a 
known hardware interface". And then IOAPIC/lapic is the obvious choice.

So Ingo, I think you've been a bit unfair. I agree that it would be 
ugly to also try to emulate timers with that "fake local apic" setup, but 
even that ugliness is probably not horrid, especially if you want to have 
some kind of stable interface for an ABI.

Of course the whole "stable ABI" is fairly moot. The kernel clearly won't 
use the ABI, but a source-level API - and part of that is *exactly* that 
an ABI is by design always very inflexible and has to be backwards 
compatible. Make the API more high-level, and then the VMI "paravirt_ops" 
stuff can translate that higher-level API into its own low-level ABI any 
way it wants to.

I do agree that for timers, the lapic model is probably ugly enough that 
an ABI might be better off with somethign cleaner than just seeing timers 
too as part of the interrupt controller, but I doubt it really is  that 
big a deal. And I really think Ingo has made a bigger deal out of this 
than necessary, although clearly CONFIG_NO_HZ will require that the 
paravirt_ops will work on that level and will be able to translate it to 
whatever VMI interfaces there are.

		Linus

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-09  0:04                                                             ` Thomas Gleixner
  (?)
@ 2007-03-09  0:44                                                             ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 169+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09  0:44 UTC (permalink / raw)
  To: tglx
  Cc: Ingo Molnar, Zachary Amsden, john stultz, akpm, Linus Torvalds,
	LKML, Pratap Subrahmanyam, Rusty Russell, Andi Kleen,
	Daniel Hecht, Daniel Arai, Chris Wright,
	Virtualization Mailing List

Thomas Gleixner wrote:
> Once you are there, you are near the point where you created a virtual
> architecture, which could run on any real architecture which gets
> supported by a hypervisor backend.
>
> I'd love that :)
>   

Sure.  But not even hypervisors.  Once we sort out pv_ops's SMP support,
it will be this >< close to covering everything in the subarch
interface.  So we can drop all that goo in favour of paravirt_ops, and
make a single kernel that will boot on everything from voyager to
numa-q!  How's that for world peace?

> I know it is tricky to combine this with the upcoming hardware
> virtualization support. But it's at least a worthwhile thought
> experiment.
>   

Well, in many ways that's a step backwards.  The upside is that its
easier to get away with simply emulating the some particular piece
hardware, but it does lose a lot of opportunities for interesting
flexibility and optimisations.

But I anticipate we'll get a xen-hvm pv_ops backend, for running under
Xen with a virtualizing cpu.  It will probably look a lot like kvm's
pv_ops backend.

    J



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: hardwired VMI crap
  2007-03-08 22:39                                                           ` Zachary Amsden
@ 2007-03-16 10:12                                                             ` Pavel Machek
  0 siblings, 0 replies; 169+ messages in thread
From: Pavel Machek @ 2007-03-16 10:12 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Ingo Molnar, tglx, Jeremy Fitzhardinge, john stultz, akpm,
	Linus Torvalds, LKML, Pratap Subrahmanyam, Rusty Russell,
	Andi Kleen, Daniel Hecht, Daniel Arai, Chris Wright

On Thu 2007-03-08 14:39:15, Zachary Amsden wrote:
> Ingo Molnar wrote:
> >* Zachary Amsden <zach@vmware.com> wrote:
> >
> >  
> >>Ingo, either you or Thomas have vetoed every attempt we have made to 
> >>make our code operate with clockevents. [...]
> >>    
> >
> >this is news to me - do you have any proof of such a veto?
> >  
> 
> Yes, your refusal to discuss any technical details when asked point 
> blank which solution you prefer, and your continued whining and 
> threatening to unmerge our code.

Failing to answer a question is hardly a veto.

> And I ask again for your feedback on which approach you think is correct:

> 1) Rewrite the interrupt subsystem of our hypervisor, making it 
> incompatible with full virtualization, so that we can support an 
> abstract interrupt controller with a "clean" interface
> 2) Reuse the same method that HPET, PIT and other time clients in i386 
> use - the global_clock_event pointer which allows you to wrest control 
> back from the APIC and reuse the lapic_events local clockevents.
...

Do all of them then decide which code is nicest. I mean... this looks
like trap question for Ingo. He tells you 2, you'll do crappy
implementation of 2, and then claim Ingo can't object.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 169+ messages in thread

end of thread, other threads:[~2007-03-16 10:12 UTC | newest]

Thread overview: 169+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-06  6:52 + stupid-hack-to-make-mainline-build.patch added to -mm tree akpm
     [not found] ` <45ED16D2.3000202@vmware.com>
     [not found]   ` <20070306084258.GA15745@elte.hu>
     [not found]     ` <20070306084647.GA16280@elte.hu>
2007-03-06  8:55       ` Zachary Amsden
2007-03-06 10:59         ` Thomas Gleixner
2007-03-06 21:07           ` Dan Hecht
2007-03-06 21:07             ` Dan Hecht
2007-03-06 22:21             ` Andi Kleen
2007-03-06 22:21               ` Andi Kleen
2007-03-06 21:32               ` Dan Hecht
2007-03-06 23:53             ` Thomas Gleixner
2007-03-07  0:24               ` Jeremy Fitzhardinge
2007-03-07  0:35                 ` Dan Hecht
2007-03-07  0:49                   ` Thomas Gleixner
2007-03-07  0:53                     ` Dan Hecht
2007-03-07  1:18                       ` Thomas Gleixner
2007-03-07  2:08                         ` Dan Hecht
2007-03-07  8:37                           ` Thomas Gleixner
2007-03-07 17:41                             ` Jeremy Fitzhardinge
2007-03-07 17:41                               ` Jeremy Fitzhardinge
2007-03-07 17:49                               ` Ingo Molnar
2007-03-07 17:49                                 ` Ingo Molnar
2007-03-07 18:03                                 ` James Morris
2007-03-07 18:03                                   ` James Morris
2007-03-07 18:35                                 ` Jeremy Fitzhardinge
2007-03-07 18:35                                   ` Jeremy Fitzhardinge
2007-03-08  0:45                                   ` Alan Cox
2007-03-08  0:45                                     ` Alan Cox
2007-03-07 17:52                               ` Ingo Molnar
2007-03-07 17:52                                 ` Ingo Molnar
2007-03-07 18:28                                 ` Jeremy Fitzhardinge
2007-03-07 18:53                                   ` Thomas Gleixner
2007-03-07 18:53                                     ` Thomas Gleixner
2007-03-07 18:11                               ` James Morris
2007-03-07 18:11                                 ` James Morris
2007-03-07 18:56                                 ` Thomas Gleixner
2007-03-07 19:05                                 ` Jeremy Fitzhardinge
2007-03-07 19:49                                   ` Dan Hecht
2007-03-07 20:11                                     ` Jeremy Fitzhardinge
2007-03-07 20:49                                       ` Dan Hecht
2007-03-07 20:49                                         ` Dan Hecht
2007-03-07 21:14                                         ` Thomas Gleixner
2007-03-07 21:14                                           ` Thomas Gleixner
2007-03-07 20:57                                       ` Thomas Gleixner
2007-03-07 20:57                                         ` Thomas Gleixner
2007-03-07 21:02                                         ` Dan Hecht
2007-03-07 21:08                                           ` Jeremy Fitzhardinge
2007-03-07 21:19                                           ` Thomas Gleixner
2007-03-07 21:19                                             ` Thomas Gleixner
2007-03-07 21:14                                             ` Dan Hecht
2007-03-07 21:21                                     ` Thomas Gleixner
2007-03-07 21:33                                       ` Dan Hecht
2007-03-07 22:05                                       ` Jeremy Fitzhardinge
2007-03-07 23:05                                         ` Thomas Gleixner
2007-03-07 23:05                                           ` Thomas Gleixner
2007-03-07 23:25                                           ` Zachary Amsden
2007-03-07 23:36                                             ` Jeremy Fitzhardinge
2007-03-07 23:40                                               ` Zachary Amsden
2007-03-07 23:40                                                 ` Zachary Amsden
2007-03-08 18:30                                                 ` Chris Wright
2007-03-08 18:30                                                   ` Chris Wright
2007-03-08  0:22                                             ` Thomas Gleixner
2007-03-08  1:01                                               ` Daniel Arai
2007-03-08  1:01                                                 ` Daniel Arai
2007-03-08  1:23                                                 ` Jeremy Fitzhardinge
2007-03-08  1:23                                                   ` Jeremy Fitzhardinge
2007-03-08  7:02                                                   ` Thomas Gleixner
2007-03-08  7:28                                                 ` Thomas Gleixner
2007-03-08  8:01                                                   ` Zachary Amsden
2007-03-08  8:01                                                     ` Zachary Amsden
2007-03-08 18:24                                                 ` Chris Wright
2007-03-08 18:44                                                   ` Daniel Arai
2007-03-08 19:14                                                     ` Chris Wright
2007-03-08 19:14                                                       ` Chris Wright
2007-03-08 19:17                                                       ` Ingo Molnar
2007-03-08 19:17                                                         ` Ingo Molnar
2007-03-08 19:42                                                   ` Jeremy Fitzhardinge
2007-03-08 19:47                                                     ` Chris Wright
2007-03-08 19:47                                                       ` Chris Wright
2007-03-08 19:52                                                       ` Jeremy Fitzhardinge
2007-03-08 20:10                                                         ` Chris Wright
2007-03-08 20:18                                                           ` Jeremy Fitzhardinge
2007-03-08 20:18                                                             ` Jeremy Fitzhardinge
2007-03-08 20:23                                                             ` Chris Wright
2007-03-08 20:23                                                               ` Chris Wright
2007-03-08 20:33                                                               ` Jeremy Fitzhardinge
2007-03-08 20:42                                                                 ` Chris Wright
2007-03-08 20:42                                                                   ` Chris Wright
2007-03-08 20:42                                                                   ` Jeremy Fitzhardinge
2007-03-08 20:42                                                                     ` Jeremy Fitzhardinge
2007-03-08 21:45                                                                 ` Andi Kleen
2007-03-08 21:45                                                                   ` Andi Kleen
2007-03-08 19:54                                                     ` Ingo Molnar
2007-03-08 19:54                                                       ` Ingo Molnar
2007-03-08  9:10                                             ` hardwired VMI crap Ingo Molnar
2007-03-08 10:06                                               ` Zachary Amsden
2007-03-08 11:09                                                 ` Thomas Gleixner
2007-03-08 20:46                                                   ` Zachary Amsden
2007-03-08 20:46                                                     ` Zachary Amsden
2007-03-08 21:13                                                     ` Ingo Molnar
2007-03-08 22:17                                                       ` Zachary Amsden
2007-03-08 22:33                                                         ` Ingo Molnar
2007-03-08 22:39                                                           ` Zachary Amsden
2007-03-16 10:12                                                             ` Pavel Machek
2007-03-08 21:15                                                     ` Jeremy Fitzhardinge
2007-03-08 21:34                                                       ` Ingo Molnar
2007-03-08 21:34                                                         ` Ingo Molnar
2007-03-08 21:43                                                         ` Andi Kleen
2007-03-08 22:30                                                           ` Ingo Molnar
2007-03-08 22:36                                                             ` Zachary Amsden
2007-03-08 23:39                                                         ` Jeremy Fitzhardinge
2007-03-08 23:39                                                           ` Jeremy Fitzhardinge
2007-03-08 23:55                                                           ` Zachary Amsden
2007-03-08 23:55                                                             ` Zachary Amsden
2007-03-09  0:10                                                             ` Jeremy Fitzhardinge
2007-03-09  0:29                                                               ` Linus Torvalds
2007-03-09  0:29                                                                 ` Linus Torvalds
2007-03-09  0:22                                                             ` Daniel Walker
2007-03-09  0:22                                                               ` Daniel Walker
2007-03-09  0:28                                                             ` Thomas Gleixner
2007-03-09  0:28                                                               ` Thomas Gleixner
2007-03-09  0:04                                                           ` Thomas Gleixner
2007-03-09  0:04                                                             ` Thomas Gleixner
2007-03-09  0:44                                                             ` Jeremy Fitzhardinge
2007-03-08 22:31                                                       ` Zachary Amsden
2007-03-08 22:31                                                         ` Zachary Amsden
2007-03-08 21:39                                                     ` Andi Kleen
2007-03-08 21:39                                                       ` Andi Kleen
2007-03-08 22:58                                                       ` Zachary Amsden
2007-03-08 22:42                                                     ` Ingo Molnar
2007-03-08 23:39                                                       ` Zachary Amsden
2007-03-08 18:35                                                 ` Chris Wright
2007-03-08 18:35                                                   ` Chris Wright
2007-03-07 23:33                                           ` + stupid-hack-to-make-mainline-build.patch added to -mm tree Jeremy Fitzhardinge
2007-03-07 23:52                                             ` Dan Hecht
2007-03-08  0:19                                               ` Jeremy Fitzhardinge
2007-03-08  0:19                                                 ` Jeremy Fitzhardinge
2007-03-08  0:35                                             ` Thomas Gleixner
2007-03-08  0:38                                               ` Jeremy Fitzhardinge
2007-03-08  0:38                                                 ` Jeremy Fitzhardinge
2007-03-07 20:40                               ` Thomas Gleixner
2007-03-07 21:07                                 ` Jeremy Fitzhardinge
2007-03-07 21:07                                   ` Jeremy Fitzhardinge
2007-03-07 21:40                                   ` Thomas Gleixner
2007-03-07 21:40                                     ` Thomas Gleixner
2007-03-07 21:34                                     ` Dan Hecht
2007-03-07 22:14                                       ` Thomas Gleixner
2007-03-07 22:17                                         ` Zachary Amsden
2007-03-07 22:17                                           ` Zachary Amsden
2007-03-07 22:31                                           ` Thomas Gleixner
2007-03-07 22:31                                             ` Thomas Gleixner
2007-03-07 22:28                                             ` Dan Hecht
2007-03-07 22:28                                               ` Dan Hecht
2007-03-08  8:01                                   ` Ingo Molnar
2007-03-08  8:01                                     ` Ingo Molnar
2007-03-08  8:15                                     ` Keir Fraser
2007-03-08  8:15                                       ` Keir Fraser
2007-03-08  8:41                                     ` Jeremy Fitzhardinge
2007-03-08 10:26                                     ` Rusty Russell
2007-03-07 21:42                                 ` Dan Hecht
2007-03-07 21:42                                   ` Dan Hecht
2007-03-07 22:07                                   ` Thomas Gleixner
2007-03-07 22:07                                     ` Thomas Gleixner
2007-03-07  5:10                     ` Jeremy Fitzhardinge
2007-03-07  0:40                 ` Thomas Gleixner
2007-03-07  0:42               ` Dan Hecht
2007-03-07  1:22                 ` Thomas Gleixner
2007-03-07  1:22                   ` Thomas Gleixner
2007-03-07  1:44                   ` Dan Hecht
2007-03-07  1:44                     ` Dan Hecht
2007-03-07  7:48                     ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.