All of lore.kernel.org
 help / color / mirror / Atom feed
* ABI coupling to hypervisors via CONFIG_PARAVIRT
@ 2007-03-09 18:02 Ingo Molnar
  2007-03-09 18:28 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Sure, that's clean, From that perspective the apic is a bunch of 
> > registers backed by a state machine or something.
> 
> I think you could do much worse than just decide to pick the 
> IO-APIC/lapic as your "virtual interrupt controller model". So I do 
> *not* think that APICRead/APICWrite are in any way horrible interfaces 
> for a virtual interrupt controller. In many ways, you then have a 
> tested and known interface to work with.

yes - but we already support the raw hardware ABI, in the native kernel.

paravirt_ops is not 'just another PC sub-arch'. It is not 'just another 
hardware driver'. It is not 'just another x86 CPU'. paravirt_ops is much 
wider than that, it hooks everywhere and has effect on everything!

Lets take a look at the raw numbers. Here's a typical distro kernel 
vmlinux, with and without CONFIG_PARAVIRT [with no paravirt backend 
enabled]:

     text    data     bss     dec     hex filename
   139863   49010   57672  246545   3c311 x86-kernel-built-in.o.noparavirt
   148865   49310   57672  255847   3e767 x86-kernel-built-in.o.paravirt

     text    data     bss     dec     hex filename
  5154975  586932  221184 5963091  5afd53 vmlinux.noparavirt
  5189197  587504  221184 5997885  5b853d vmlinux.paravirt

why did code size increase by +6.4% in arch/i386/ (+0.7% in the 
vmlinux)? It is purely because CONFIG_PARAVIRT adds more than _1400_ 
function call hooks to the x86 arch:

 c05c8e60 D paravirt_ops

 c0102602:       ff 15 9c 8e 5c c0       call   *0xc05c8e9c
 c0102d37:       ff 15 94 8e 5c c0       call   *0xc05c8e94
 c0102d45:       ff 15 94 8e 5c c0       call   *0xc05c8e94
 c0102d53:       ff 15 94 8e 5c c0       call   *0xc05c8e94
 c0102d61:       ff 15 94 8e 5c c0       call   *0xc05c8e94
 c0102d6f:       ff 15 94 8e 5c c0       call   *0xc05c8e94
 [...]

 $ objdump -d vmlinux | grep c05c8e | wc -l
 1463

_1463_ hooks, spread out all around the x86 arch.

Are these only trivial hooks a'ka alternatives.h? Not at all, these are 
full-blown function hooks freely modifiable by a paravirt_ops 
implementation, spread throughout the architecture in a finegrained way. 
(see my arguments and specific demonstration about the bad effects of 
this, four paragraphs below.)

As a comparison: people argued about CONFIG_SECURITY hooks and flamed 
about them no end. The reality is, there's only _269_ calls to 
security_ops in this same kernel, and i've got CONFIG_SECURITY + SELINUX 
enabled. And the only functional modification that security_ops does to 
native behavior is "deny the syscall". Not 'full control over 
behavior'... In terms of coupling, CONFIG_SECURITY hooks are a walk in 
the park, relative to CONFIG_PARAVIRT.

we dont even give /real silicon/ that many hooks! If an x86 CPU came 
along that required the addition of 1400+ function hooks then we'd say: 
'you must be joking, that's not an x86 CPU! Make it more compatible!'.

please dont get me wrong - 1463 hooks spread out might be fine in the 
end, but _if and only if_ there are safeguards in place to make sure 
they are just a trivial variation of the hardware ABI - a'ka 
asm/alternatives.h. But there is _no_ such safeguard in place today and 
we are seeing the bad effects of that _already_, with just a _single_ 
hypervisor and a _single_ abstraction topic (time), so i'm very strongly 
convinced that it's a serious issue that cannot just be glossed over 
with "relax, it will work out fine". If there's one thing we learned in 
the past 15 years is that ABI issues will haunt us forever.

Let me demonstrate some of the bad effects, and how far we've _already_ 
deviated from the 'hardware ABI'. An example: one assumes that 
paravirt_ops.safe_halt() is a trivial variation of the 'halt 
instruction', right? But vmi.c and vmitimer.c does much more than that. 
Take a look at vmi_safe_halt() which calls vmi_stop_hz_timer(): it hacks 
back a jiffies assumption into its code via paravirt_ops.safe_halt() - 
purely via changes local to vmitimer.c, by using next_timer_interrupt()! 
Thus it has created a _dual layer_ of dynticks that we specifically 
objected against. It does so in spite of our warning about why that is 
bad, it does so in spite of Xen having implemented a clockevents driver 
in 2 hours, and it does so under the cover of 'oh, this is only a 
vmitimer.c local change'. It circumvents the native dynticks framework 
and in essence brings in the bad NO_IDLE_HZ technique that we worked so 
hard for 2 years not to ever enable for the i386 arch!

so one of my very real problems with paravirt_ops is that due to its 
sheer hook-based impact it allows the modification of the hardware ABI 
on a _very_ wide scale: both unintentionally and intentionally. 

Furthermore, it allows the introduction of hard-to-remove hardwired 
quirks that bind one particular paravirt_ops method to the hypervisor 
ABI - quirks that are not present in any real silicon! Quirks 
_guaranteed by Linux_, by virtue of giving the CONFIG_VMI promise. So we 
effectively expose Linux to the hypervisor ABIs, and give hypervisors 'a 
license to arbitrary ABI coupling'. With no clear mechanism whatsoever 
to remove that coupling. At least crap hardware rots with time and gives 
us a chance to remove - but software ABIs and hence "virtual hardware" 
seldom rots.

it's hard enough to change the APIC code so that it works on both Intel 
and AMD CPUs: and those CPUs /share the ABI/ to a very large degree, by 
executing the same code.

and there are no safeguards in place whatsoever to ensure that the 
'virtual silicon ABI' matches up to the real silicon ABI. Granted, for 
VMI it probably matches up today most likely because VMI came from full 
software virtualization that /had/ to emulate all of real silicon. But 
from now on, paravirt_ops allows shortcuts, allows changes to semantics 
[a shortcut is change of semantics], allows additional ABIs, and we 
already see that happenning in vmitimer.c, which is even one of the 
/easiest/ paravirtualization topic.

Furthermore, it's only the beginning of the pain. It's the effect of 
_ONE_ hypervisor (VMWare/ESX) and _ONE_ abstraction (time). We've got 
4-5 hypervisors lined up for the paravirt_ops free-for-all and half a 
dozen of fundamental abstractions to cover (of which time is the 
simplest)! Please do the math. With paravirt_ops, the complexity of this 
is per-hypervisor and per-abstrction and there's no safeguard in place 
to let them evolve towards a saner, shared model. It wont happen because 
doing a sane, shared ABI is _HARD_. Nobody will go the effort of 
implementing the clean solution if they get to play with paravirt_ops, 
and _I_ will have the work dumped on me in the architecture, trying to 
sort out the mess.

i claim that when the 'API cut' is done at the right level then no more 
than say 100 hooks would be needed - with virtually zero kernel size 
increase. We've got all the right highlevel abstractions: genirq, gtod, 
clockevents. Whatever is missing at the moment from the framework (say 
smp_send_reschedule()) we can abstract away. The bonus? It would be 
almost directly applicable to other architectures as well. It would also 
work with /any/ hypervisor.

Unfortunately, with the current paravirt_ops policy we might end up 
seeing none of that unification. 1400+ hooks are just wide enough (by 
the law or large numbers) so that any hypervisor can implement a random 
ABI that performs close to the 'real' ABI, and has no incentive 
whatsoever to play along and make life easier for Linux by having a 
sane, unified ABI. Having provided the CONFIG_VMI ABI once, distros will 
see themselves forced adding those 1400+ hooks for many, many years to 
come, blowing up Linux' instruction cache footprint in a critical area 
of code.

And that is why the "paravirt_ops is just virtual hardware" argument is 
totally wrong. _Nothing_ limits hypervisors from adding arbitrary ABI 
bindings to Linux. For example, VMI does this already and none of the 
following are hardware ABIs:

 #define VMI_CALL_SetAlarm               68
 #define VMI_CALL_CancelAlarm            69
 #define VMI_CALL_GetWallclockTime       70
 #define VMI_CALL_WallclockUpdated       71

and these ABIs have been objected to by Thomas and me because they are 
cycle based - still paravirt_ops allows Linux to become dependent on 
these ABIs.

Finally, what would i like to see?

Firstly, i think this has been over-rushed. After years of being happy 
with forks of the Linux kernel, all the hypervisors woke up at once and 
want to have their stuff upstream /now/. This rush created a hodgepodge 
of APIs/ABIs that we now in the end promise to support /all/. (if we 
take CONFIG_VMI i can see little ethical reason to not take Xen's 
paravirt_ops, lguest's paravirt_ops, KVM's paravirt_ops and i'm sure 
Microsoft/Novell will have something nice and different for us too.)

Secondly, i'd like to see a paravirt approach that has /implicit/ 
safeguards against the following type of crap:

   vmitimer.c code has a hardwired assumption that a PIT exists:

        /* Disable PIT. */
        outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */

   it has a hardwired assumption that 'cycles' makes a sense as a way to 
   communicate time units:

        vmi_timer_ops.set_alarm(
                      VMI_ALARM_WIRED_LVTT | VMI_ALARM_IS_PERIODIC | VMI_CYCLES_AVAILABLE,
                      per_cpu(process_times_cycles_accounted_cpu, cpu) + cycles_per_alarm,
                      cycles_per_alarm);

   it has a hardwired assumption that Linux keeps time in units of 
   'jiffies':

        if (rcu_needs_cpu(cpu) || local_softirq_pending() ||
            (next = next_timer_interrupt(),
             time_before_eq(next, jiffies + HZ/CONFIG_VMI_ALARM_HZ))) {
                cpu_clear(cpu, nohz_cpu_mask);

etc. etc. paravirt_ops is _NOT_ just a random driver we can fix in the 
future. These are all assumptions that can /easily/ leak out towards the 
hypervisor and thus create ABI coupling between that quirk and a 
specific version of the hypervisor. Fixing such bad coupling needs 
changes on the /hypervisor side/.

Granted, some of these are just harmless quirks that are fixable in 
Linux only, but some of these are stiffling because they bind Linux to 
the hypervisor ABI.

i might be overreacting, but the effects of 1400+ hooks put into the 
code against my objections, which code i helped write for many years, 
and that i'd like to help write for many years to come, is no small 
issue to me.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 18:02 ABI coupling to hypervisors via CONFIG_PARAVIRT Ingo Molnar
@ 2007-03-09 18:28 ` Andi Kleen
  2007-03-09 18:30 ` Linus Torvalds
  2007-03-09 19:00 ` Chris Wright
  2 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2007-03-09 18:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
	Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
	Chris Wright, Alan Cox

On Friday 09 March 2007 19:02, Ingo Molnar wrote:

> _1463_ hooks, spread out all around the x86 arch.

They are not all different hooks though, just many call site of the same.
Also most of them are well defined to just match what the instructions
do.

paravirt_ops has under hundred entries right now and i intend to not
expand it much further after the Xen bits are in.
 
> Let me demonstrate some of the bad effects, and how far we've _already_ 
> deviated from the 'hardware ABI'. An example: one assumes that 
> paravirt_ops.safe_halt() 

The vmi maintainers already agreed to fix that.

> i claim that when the 'API cut' is done at the right level 

Can you make a proposal?  Would you be willing to write code for that?

> then no more  
> than say 100 hooks would be needed 

Well we have less than 100 hooks right now, just with many call sites @)

> - with virtually zero kernel size  
> increase. 

I'll believe that when I see it.

> We've got all the right highlevel abstractions: genirq, gtod,  
> clockevents. Whatever is missing at the moment from the framework (say 
> smp_send_reschedule()) we can abstract away. 

smp_send_reschedule() is just an IPI instance which is already
abstracted with genapic. Xen has a genapic_xen.

VMI is still where ->apic_read/->apic_write and their relatively
harmless timer interrupt change make sense -- if they needed
more changes they would just need to bite the bullet and 
provide a custom genapic vmware apic driver.

> Unfortunately, with the current paravirt_ops policy we might end up 
> seeing none of that unification. 

I am open to concrete incremental proposals for improvements.

> 
> And that is why the "paravirt_ops is just virtual hardware" argument is 
> totally wrong. _Nothing_ limits hypervisors from adding arbitrary ABI 
> bindings to Linux. For example, VMI does this already and none of the 
> following are hardware ABIs:
> 
>  #define VMI_CALL_SetAlarm               68
>  #define VMI_CALL_CancelAlarm            69
>  #define VMI_CALL_GetWallclockTime       70
>  #define VMI_CALL_WallclockUpdated       71

That's VMI internal, not exposed above paravirt ops.
  
> Firstly, i think this has been over-rushed. After years of being happy 
> with forks of the Linux kernel, 

The code has been posted for a long time, open for review for everybody.

> Secondly, i'd like to see a paravirt approach that has /implicit/ 
> safeguards against the following type of crap:

I don't think you can use an API to force the underlying implementation
in a practical way. If code wants to do something wrong it no API 
in the world will stop it. That is why we have code review instead.

>    it has a hardwired assumption that 'cycles' makes a sense as a way to 
>    communicate time units:
> 
>         vmi_timer_ops.set_alarm(
>                       VMI_ALARM_WIRED_LVTT | VMI_ALARM_IS_PERIODIC | VMI_CYCLES_AVAILABLE,
>                       per_cpu(process_times_cycles_accounted_cpu, cpu) + cycles_per_alarm,
>                       cycles_per_alarm);

That's because VMI is defined this way? If paravirt chosed to not pass cycles
anymore they would just add a simple conversion function. SMOP.

 
>    it has a hardwired assumption that Linux keeps time in units of 
>    'jiffies':

Well a lot of drivers have that, but it can be all fixed.

> Granted, some of these are just harmless quirks that are fixable in 
> Linux only,

All of them.

> but some of these are stiffling because they bind Linux to  
> the hypervisor ABI.

I haven't seen a concrete example of that yet to be honest and I don't
really believe it.
 
-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 18:02 ABI coupling to hypervisors via CONFIG_PARAVIRT Ingo Molnar
  2007-03-09 18:28 ` Andi Kleen
@ 2007-03-09 18:30 ` Linus Torvalds
  2007-03-09 19:24   ` Ingo Molnar
  2007-03-09 19:00 ` Chris Wright
  2 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2007-03-09 18:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox



On Fri, 9 Mar 2007, Ingo Molnar wrote:
> 
> yes - but we already support the raw hardware ABI, in the native kernel.

Why do you continue to call paravirt an ABI?

We got over that. It's not. It's an API.

VMI is an ABI.

As long as you try to confuse the two, there's no point to the discussion.

Yeah, paravirt is ugly. Yeah, the calls should be moved higher in the 
stack. But you don't help by confusing the issue by mixing the different 
parts up and calling something an ABI that simply *isn't*.

Paravirt already acts on a higher level than the ioapic. It does do the 
"irq_disable()" kind of "highlevel" callbacks. Yeah, the "apic_write()" 
ones should go away, and they're just hacky, but there's nothing there 
that is an ABI.

So just *fix* it or tell others to fix it, instead of just confusing the 
issue.

And trust me, if "apic_write" causes bugs because it interacts with real 
APIC usage, we don't care ONE WHIT. That paravirt_ops entry goes out the 
window so fast you can't say "Whaa?!??". 

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 18:02 ABI coupling to hypervisors via CONFIG_PARAVIRT Ingo Molnar
  2007-03-09 18:28 ` Andi Kleen
  2007-03-09 18:30 ` Linus Torvalds
@ 2007-03-09 19:00 ` Chris Wright
  2 siblings, 0 replies; 37+ messages in thread
From: Chris Wright @ 2007-03-09 19:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
	Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
	Andi Kleen, Chris Wright, Alan Cox

* Ingo Molnar (mingo@elte.hu) wrote:
> i claim that when the 'API cut' is done at the right level then no more 
> than say 100 hooks would be needed - with virtually zero kernel size 
> increase. We've got all the right highlevel abstractions: genirq, gtod, 
> clockevents. Whatever is missing at the moment from the framework (say 
> smp_send_reschedule()) we can abstract away. The bonus? It would be 
> almost directly applicable to other architectures as well. It would also 
> work with /any/ hypervisor.

Oddly enough, that's really what we are trying to acheive.  There is
definitely some tension between the VMI model which is modeled very
directly on hardware and something like the Xen model which prefers
higher level interfaces.

I don't really agree with your metrics w.r.t hooks.  My point is you
take callsites == hooks to arrive at 1463 hook, but then above say 100
hooks is sufficient.  But we have on the order of 100 hooks (I believe
it's ~75 in Linus' tree).  Put it another way.  Do you believe that
something like irq_{en,dis}able() is appropriate to hook (as that's >
1400 callsites already)?

> Firstly, i think this has been over-rushed. After years of being happy 
> with forks of the Linux kernel, all the hypervisors woke up at once and 
> want to have their stuff upstream /now/. This rush created a hodgepodge 
> of APIs/ABIs that we now in the end promise to support /all/. (if we 
> take CONFIG_VMI i can see little ethical reason to not take Xen's 
> paravirt_ops, lguest's paravirt_ops, KVM's paravirt_ops and i'm sure 
> Microsoft/Novell will have something nice and different for us too.)

It would be imminently helpful if you helped with some specific ideas
on where the paravirt_ops interface needs to be adjusted.

> Secondly, i'd like to see a paravirt approach that has /implicit/ 
> safeguards against the following type of crap:

How would you propose doing that?  Typically that's done with code
review and patches.

thanks,
-chris

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 18:30 ` Linus Torvalds
@ 2007-03-09 19:24   ` Ingo Molnar
  2007-03-09 19:51     ` Linus Torvalds
  2007-03-09 20:50     ` Jan Engelhardt
  0 siblings, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 9 Mar 2007, Ingo Molnar wrote:
> > 
> > yes - but we already support the raw hardware ABI, in the native 
> > kernel.
> 
> Why do you continue to call paravirt an ABI?
> 
> We got over that. It's not. It's an API.
> 
> VMI is an ABI.

Unfortunately i still dont see where i'm wrong, and i'm really trying to 
understand your argument. Is your argument that as long as an ABI (VMI) 
is never directly used but only used via wrapper functions 
(paravirt_ops), it has no effects whatsoever on the flexibility of the 
rest of the software and ceases to have any negative ABI effects? In my 
opinion that is an absurd (and incorrect) point so i guess you must mean 
something else, but i really cannot think what that is.

I never said paravirt_ops is an ABI. I say that the ABI(s) _behind_ 
paravirt_ops [in the backend] /does/ limit Linux, even if wrapped, 
inevitably, and that i'm simply worried about having 4-5 independent 
ABIs behind each paravirt_ops variant each creating a web of design 
constraints on the rest of the kernel. To quote a past email of mine:

|| 'paravirt ops can take care of it' - but that is just blatantly 
|| _FALSE_: the ABI 'behind' the paravirt_ops 'shines through' via 
|| functional coupling

it doesnt matter in how big letters the wrapper functions have 'freedom' 
written on them, the _real_ constraint is the user's expectation to have 
the hypervisor work with Linux that worked with that particular VMI ABI 
in v2.6.21. So the user wants to have its hypervisor 1.12 work with 
Linux v2.6.22 - without having to update the hypervisor. And Linux 
v2.6.23. Etc. /That/ is the 'ABI effect' i'm worried about. It is a 
"compatibility web" that gets more and more entangled with every new 
paravirt_ops implementation added.
 
In practice, when a problem comes up during code rewrite, 90% of the 
time we can probably find a way around it via paravirt_ops and the 
backend, but i'm simply worried about the remaining 10%. And that 10% is 
not hypothetical at all, should i cite specific examples of problems 
that i think cannot be solved via Linux-only modifications?

I'm also worried about the sheer QA inertia of having an additional 4-5 
hypervisor-ABI constraints on the correctness of the kernel, in addition 
to the 2 main CPU variants we have at the moment.

If we said "paravirt_ops must behave like real hardware" then we'd 
probably remove some of that risk (although enforcement is still an 
issue). But we _specifically_ say that no, it doesnt have to behave like 
real hardware. We allow shortcuts, we allow modifications of behavior - 
and that's good in quite many cases. But we allow really weird hacks 
like the .safe_halt() thing. Our only present requirement it appears is 
that "it works with today's hypervisor" - and that requirement 
automatically transforms itself into: "all future kernels will work with 
all past versions of the hypervisor".

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 19:24   ` Ingo Molnar
@ 2007-03-09 19:51     ` Linus Torvalds
  2007-03-09 20:12       ` Ingo Molnar
  2007-03-09 21:04       ` Ingo Molnar
  2007-03-09 20:50     ` Jan Engelhardt
  1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2007-03-09 19:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox



On Fri, 9 Mar 2007, Ingo Molnar wrote:
> 
> Unfortunately i still dont see where i'm wrong, and i'm really trying to 
> understand your argument. Is your argument that as long as an ABI (VMI) 
> is never directly used but only used via wrapper functions 
> (paravirt_ops)

No.

My argument is utternly and *purely* that you've been confusing the 
discussion by using the wrong terms, and as a result, you've been 
discussing things that aren't *relevant*.

You haven't been saying anything constructive.

For example, here's a *constructive* thing you could have said, and never 
actually did:

 - paravirt_op->write_apic should not exist

   anything that needs to write to the apic should ether

    (a) have been caught much earlier in the paravirt stack, ie it's a 
	"disable interrupt" kind of operation, and should never even have 
	gotten to the APIC write in the first place, but been handled by 
	the paravirtualized handler.
    (b) just be emulated as an APIC write (and if the emulation isn't good 
	enough, screw it)

In other words, to be *constructive*, you need to point out particular 
and practical problem spots, instead of just ranting about any 
"paravirtualized ABI".

Of *course* there is an ABI at some point behind any API. Why do you harp 
on that? It's irrelevant. Any API will always end up being instantiated 
into a binary thing at some point, and that binary thing will have to work 
with some particular version of a hardware/infrastructure combination, but 
that has *nothing* to do with anything.

The x86 instruction set is an ABI. Our API's eventually tend to be 
compiled to something like that ABI, and yes, some ABI's may not be able 
to do certain things. For example, on the 32-bit x86 ABI, there are no 
interfaces for address space identifiers, and the wrappers become a no-op 
that just don't do anything, and if you want fast context switches between 
two contexts, you're screwed.

Similarly, maybe the VMI ABI doesn't allow for something that the kernel 
wants to do efficiently. Big deal. What relevance does that have to do 
with anything, except the fact that if true, the VMWare people are 
screwed? It's *their* problem.

So please

 - point out things that are badly done. I agree that apic_write() simply 
   shouldn't be an ABI point at all. But do so *directly* without some 
   ranting about other things that aren't relevant. Your "1400 hooks" rant 
   was pointless - there aren't 1400 hooks at all. There are 1400 
   call-sites, but that's like saying that the "mov" operation is a bad 
   instruction, because there are 5 million mov instructions in the 
   kernel.

 - Realize that if VMI has problems, it's not *your* problem, or even the
   kernels problem. It's purely a VMI problem. I don't understand why you 
   care, or why you think we should care.

 - and I guess we can also stop cc'ing me in the first place. I don't even 
   think virtualization is very interesting. I'd much rather flame people 
   about bad taste in more important areas ;)

Thanks,

				Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 19:51     ` Linus Torvalds
@ 2007-03-09 20:12       ` Ingo Molnar
  2007-03-09 21:05         ` Jeremy Fitzhardinge
  2007-03-09 21:06         ` Linus Torvalds
  2007-03-09 21:04       ` Ingo Molnar
  1 sibling, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Similarly, maybe the VMI ABI doesn't allow for something that the 
> kernel wants to do efficiently. Big deal. What relevance does that 
> have to do with anything, except the fact that if true, the VMWare 
> people are screwed? It's *their* problem.

i wont hold you up for long, but i think this is the key difference, and 
if i understand your point correctly i think you are really wrong here.

This is 'enterprise Linux compatibility and ABI 101', really - i dare to 
bet blindly that you wont see anyone here from distros arguing against 
this simple point.

Once this thing is released upstream, it creates a new compatibility 
rule:

  _new kernel must not break on an older hypervisor_

due to a new paravirt_ops design. Ever. It's really that simple. (I 
think i never said this explicitly because this requirement of backwards 
compatibility was so obvious to me.)

And it doesnt matter whether we think that it was VMWare who messed up. 
Users/customers _will_ blame us: "v2.6.25 regresses, it wont run under 
ESX v1.12 anymore". Distro will yield and will undo whatever change 
breaks backwards compatibility with older hypervisors. (most likely it 
will be undone upstream already) Backwards compatibility acts as a very 
heavy barrier against certain types of paravirt_ops design changes.

Once v2.6.21 is released, and a bigger distro releases a kernel with 
CONFIG_PARAVIRT+CONFIG_VMI enabled: backwards compatibility in future 
versions becomes mainly /that/ distro's problem (and upstream's 
problem), _NOT_ WMware's problem.

That's why i mentioned CONFIG_COMPAT_VDSO as an example. One major 
distro (SuSE 9.0) came out with that particular glibc version that had a 
bug that depended on a particular and totally unintentional ABI detail 
in the vDSO. As a result we had to do several iterations of 
CONFIG_COMPAT_VDSO to keep backwards compatibility. And glibc is perhaps 
_the_ most kernel-friendly external software project in existence. 
Still, the ABI dependency was there, and we cannot break users who run 
old userspace. The same rule holds here: we cannot break users who run 
an old hypervisor.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 19:24   ` Ingo Molnar
  2007-03-09 19:51     ` Linus Torvalds
@ 2007-03-09 20:50     ` Jan Engelhardt
  2007-03-09 22:50       ` Lee Revell
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Engelhardt @ 2007-03-09 20:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
	Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
	Andi Kleen, Chris Wright, Alan Cox

Hello,


On Mar 9 2007 20:24, Ingo Molnar wrote:
>* Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>>On Fri, 9 Mar 2007, Ingo Molnar wrote:
>>> 
>>> yes - but we already support the raw hardware ABI, in the native 
>>> kernel.
>> 
>> Why do you continue to call paravirt an ABI?
>> We got over that. It's not. It's an API.
>> VMI is an ABI.
>
>Unfortunately i still dont see where i'm wrong, and i'm really trying to 
>understand your argument. Is your argument that as long as an ABI (VMI) 
>is never directly used but only used via wrapper functions 
>(paravirt_ops), it has no effects whatsoever on the flexibility of the 
>rest of the software and ceases to have any negative ABI effects? [...]

As far as I understand and recognize it,


  [have monospaced font]

  paravirt struct      struct file_operations / {ALSA | OSS}
   ^                    ^
   |      <- API ->     |
   v                    v
  VMI                  /dev/dsp
   ^                    ^
   |      <- ABI ->     |
   v                    v
  product              userspace app


I think the sound example to the right really shows it. /dev/dsp has a
consistent ABI on a ton of systems. The API below it, varies. Linux got
file_operations and ALSA. Solaris/BSD may have its
vnode-and-so-on-functions and some sort of OSS.

Hope this helps (and more, I hope it's accurate - please correct me if I
am wrong.)



Jan
-- 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 19:51     ` Linus Torvalds
  2007-03-09 20:12       ` Ingo Molnar
@ 2007-03-09 21:04       ` Ingo Molnar
  2007-03-09 21:27         ` Chris Wright
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So please
> 
>  - point out things that are badly done. [...]

the thing badly done is fundamental and it trumps any other small 
technological detail complaint i have, because it affects the 
development and maintainance model: to promise backwards compatibility 
to 4-5 different hypervisors, using separate ABIs for each. /One/ such 
ABI would be complex enough to maintain IMO.

( if there is no backwards compatibility promise then i have zero
  complaints: then paravirt_ops + the hypercall just becomes another API
  internal to Linux that we can improve at will. But that is not
  realistic: if we provide CONFIG_VMI today, people will expect to have
  CONFIG_VMI in the future too. )

as i said in the very, very first email about this topic: /one/ new ABI 
towards hypervisors should be introduced step by step, and via concensus 
across hypervisors. We should treat the hypercall ABI very similar to 
the system call ABI: the system call ABI is largely based on a concensus 
between applications.

[ i think apic_write() granularity is bad too - but that is a small
  technical issue, dwarved by the ABI issues that impact the development
  model IMO. ]

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 20:12       ` Ingo Molnar
@ 2007-03-09 21:05         ` Jeremy Fitzhardinge
  2007-03-09 21:06         ` Linus Torvalds
  1 sibling, 0 replies; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 21:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zachary Amsden, Thomas Gleixner, john stultz,
	akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright, Alan Cox

Ingo Molnar wrote:
> Once this thing is released upstream, it creates a new compatibility 
> rule:
>
>   _new kernel must not break on an older hypervisor_
>   

Yes, that's important.  (It's perhaps more important that a new
hypervisor not break an old kernel, but that's 100% the hypervisor's
responsibility.)

But yes, the various pv_ops implementations will need to work to make
sure that they can maintain support for old hypervisors as the pv_ops
interface evolves.  That's the basic nature of such API-ABI couplings. 
The only way to deal with it is to make sure the API is high-enough
level to leave abstraction wiggle-room to adapt the pv_ops interface to
the hypervisor interface.

But I'm pretty sure that's your point, and one we've been aware of from
the start.

> due to a new paravirt_ops design. Ever. It's really that simple. (I 
> think i never said this explicitly because this requirement of backwards 
> compatibility was so obvious to me.)
>   

You're right.  Very important.

> And it doesnt matter whether we think that it was VMWare who messed up. 
> Users/customers _will_ blame us: "v2.6.25 regresses, it wont run under 
> ESX v1.12 anymore". Distro will yield and will undo whatever change 
> breaks backwards compatibility with older hypervisors. (most likely it 
> will be undone upstream already) Backwards compatibility acts as a very 
> heavy barrier against certain types of paravirt_ops design changes.
>   

Well, yes.  If there's a huge disconnect between the pv_ops interface
and the hypervisor interface which can't be bridged, then that might be
a problem.  But one would hope that would get sorted out way before
people start complaining about regressions.  If the regression is just a
bug, well, its just a bug.

pv_ops has to walk a delicate balance.  On the one hand, its entrypoints
can't be 1:1 with any given ABI, because that would make it very
brittle.  On the other hand, it can't be so high-level that any back-end
implementation has to have masses of code mapping the high-level
interface to the ABI.

This is not something you can do with capital-A Architecture up front. 
Its something that you need to work towards incrementally as real design
problems come up.  You know, like the rest of the kernel.

The current pv_ops development has been helped by the fact that Xen and
VMI (and lguest and kvm) have all quite distinct virtualization
approaches, because it prevents the interface from being overly fawning
to one particular design.  But that doesn't mean it should be a grab-bag
union of all the interfaces; we need to find the right level of
abstraction which can deal with everyone's requirements (both the
hypervisors and the kernel overall).  And we'll need to keep rebalancing
those tradeoffs as the requirements change (as new hypervisors get
added, existing ones change, obsolete ones get dropped and as the kernel
itself changes).

> Once v2.6.21 is released, and a bigger distro releases a kernel with 
> CONFIG_PARAVIRT+CONFIG_VMI enabled: backwards compatibility in future 
> versions becomes mainly /that/ distro's problem (and upstream's 
> problem), _NOT_ WMware's problem.
>   

pv_ops went into mainline in 2.6.20, more or less as a placeholder. 
2.6.21 will change it a bit, and the vmi binding will work a bit better,
I guess.  I'm hoping the patches I've been posting will make it into
2.6.22, and they'll make further changes to the interface.  This is a
new, highly dynamic interface, and nobody is claiming or (I hope)
expecting anything else.

If VMWare is indeed actually releasing a product in a couple of weeks
based on 2.6.21, well that's going to be their support problem.  I
really hope no distro is going to decide to ship a release based on a
current or near future kernel with CONFIG_PARAVIRT enabled, at least not
without realizing they're shipping something very new and raw.  pv_ops
is going to change a lot in the next few months.

> That's why i mentioned CONFIG_COMPAT_VDSO as an example. One major 
> distro (SuSE 9.0) came out with that particular glibc version that had a 
> bug that depended on a particular and totally unintentional ABI detail 
> in the vDSO. As a result we had to do several iterations of 
> CONFIG_COMPAT_VDSO to keep backwards compatibility. And glibc is perhaps 
> _the_ most kernel-friendly external software project in existence. 
> Still, the ABI dependency was there, and we cannot break users who run 
> old userspace. The same rule holds here: we cannot break users who run 
> an old hypervisor.
>   

Well, the glibc folks were pretty annoyed about that too, since it
wasn't even a release glibc with the problem; it was just some random
CVS snapshot that got shipped.  The distros screwed up, and we have to
deal with it.  It would be nice to turn off COMPAT_VDSO, but since we
can't, we're going to make it work with CONFIG_PARAVIRT.

    J

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 20:12       ` Ingo Molnar
  2007-03-09 21:05         ` Jeremy Fitzhardinge
@ 2007-03-09 21:06         ` Linus Torvalds
  2007-03-09 21:36           ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2007-03-09 21:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox



On Fri, 9 Mar 2007, Ingo Molnar wrote:
>
>   _new kernel must not break on an older hypervisor_

Total red herring. AGAIN.

The paravirt_ops isn't a 1:1 hypervisor ABI.

If we change paravirt_ops to be higher-level ops (as we should), yes, the 
paravirt->VMI layer needs to be extended to have the "higher->lower" 
translation. But at no point did we break the hypervisor.

Anyway, please don't even bother cc'ing or replying to me any more.

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:04       ` Ingo Molnar
@ 2007-03-09 21:27         ` Chris Wright
  2007-03-09 21:47           ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Chris Wright @ 2007-03-09 21:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
	Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
	Andi Kleen, Chris Wright, Alan Cox

* Ingo Molnar (mingo@elte.hu) wrote:
> ( if there is no backwards compatibility promise then i have zero
>   complaints: then paravirt_ops + the hypercall just becomes another API
>   internal to Linux that we can improve at will. But that is not
>   realistic: if we provide CONFIG_VMI today, people will expect to have
>   CONFIG_VMI in the future too. )

This was the whole reason we didn't adopt VMI directly.  Instead,
preferring an kernel internal API, pv_ops, that can adopt naturally
as the kernel changes, and it is the pv_ops client code's (or backend
as it is also referred to) responsibility to do whatever is necessary
to map back to the hypervisor's ABI.  The goal was explicitly to keep
things internal fluid as usual.  As I said before, no matter how you
slice it there's glue code somewhere to deal with compatibilities.
And it's always been the virtualization platform's responsibility to
deal with the changes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:06         ` Linus Torvalds
@ 2007-03-09 21:36           ` Ingo Molnar
  2007-03-09 21:40             ` Jeremy Fitzhardinge
                               ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 21:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> If we change paravirt_ops to be higher-level ops (as we should), yes, 
> the paravirt->VMI layer needs to be extended to have the 
> "higher->lower" translation. But at no point did we break the 
> hypervisor.

hm. So your point is that VMI is in essence a Turing machine (a 
near-complete one)? No matter what redesign we do on the Linux side, the 
VMI paravirt_ops will always be able to adopt to it? If that is the case 
then my ABI worries would indeed be wrong and i'd owe Zach a big fat 
apology [and more] for my flames ;-)

but what if the transformation is not just a top-down transformation, 
but something more conceptual, introducing something that VMI does not 
cover today, like: 'elimination of the APIC concept from guest mode 
(replacing it with a virtual interrupt controller)', or 'elimination of 
the concept of IRQ vectors from guest mode (virtual irq controller)'. Or 
'elimination of pagetables stored in the guest (all paravirt hooks would 
be vma-based)'?

but ... maybe because VMI is so lowlevel and covers /all/ of x86 today, 
it will always be able to emulate whatever different concept we can come 
up with? Do we really know this absolutely sure?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:36           ` Ingo Molnar
@ 2007-03-09 21:40             ` Jeremy Fitzhardinge
  2007-03-09 22:27             ` Linus Torvalds
  2007-03-09 23:10             ` Ingo Molnar
  2 siblings, 0 replies; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 21:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
	Rusty Russell, Andi Kleen, Chris Wright, Alan Cox

Ingo Molnar wrote:
> but ... maybe because VMI is so lowlevel and covers /all/ of x86 today, 
> it will always be able to emulate whatever different concept we can come 
> up with? Do we really know this absolutely sure?
>   

Why don't you make a specific proposal, and we'll work out the details.

    J

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:27         ` Chris Wright
@ 2007-03-09 21:47           ` Ingo Molnar
  2007-03-09 21:59             ` Jeremy Fitzhardinge
  2007-03-09 22:10             ` Chris Wright
  0 siblings, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 21:47 UTC (permalink / raw)
  To: Chris Wright
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
	Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
	Andi Kleen, Alan Cox


* Chris Wright <chrisw@sous-sol.org> wrote:

> * Ingo Molnar (mingo@elte.hu) wrote:
> > ( if there is no backwards compatibility promise then i have zero
> >   complaints: then paravirt_ops + the hypercall just becomes another API
> >   internal to Linux that we can improve at will. But that is not
> >   realistic: if we provide CONFIG_VMI today, people will expect to have
> >   CONFIG_VMI in the future too. )
> 
> This was the whole reason we didn't adopt VMI directly.  Instead, 
> preferring an kernel internal API, pv_ops, that can adopt naturally as 
> the kernel changes, and it is the pv_ops client code's (or backend as 
> it is also referred to) responsibility to do whatever is necessary to 
> map back to the hypervisor's ABI.  The goal was explicitly to keep 
> things internal fluid as usual.  As I said before, no matter how you 
> slice it there's glue code somewhere to deal with compatibilities. And 
> it's always been the virtualization platform's responsibility to deal 
> with the changes.

For example, for the sake of argument, if the VMI ABI consisted only of 
a single call:

  #define VMI_CALL_NOP           1

then obviously it would be very hard for VMI to adopt to changes in the 
kernel - no matter how many smarts you put into paravirt_ops :-)

agreed? That is the center of my argument. Does the VMI ABI limit the 
Linux kernel or not?

As we increase the complexity of a hypercall ABI, more and more things 
can be implemented via it. So _obviously_ there is a 'minimum level of 
capability' for every hypercall ABI that is /required/ to keep the Linux 
kernel 100% flexible. If in some tricky corner the ABI has some stupid 
limit or assumption, it might stiffle future changes in Linux.

i am worried whether /any/ future change to the upstream kernel's design 
can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i 
mean truly any. And whether that can be done is not a function of the 
flexibility of paravirt_ops, it's a function of the flexibility of the 
VMI ABI.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:47           ` Ingo Molnar
@ 2007-03-09 21:59             ` Jeremy Fitzhardinge
  2007-03-09 22:12               ` Ingo Molnar
  2007-03-09 22:10             ` Chris Wright
  1 sibling, 1 reply; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 21:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Wright, Zachary Amsden, Thomas Gleixner, john stultz, akpm,
	LKML, Rusty Russell, Andi Kleen, Alan Cox

Ingo Molnar wrote:
> i am worried whether /any/ future change to the upstream kernel's design 
> can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i 
> mean truly any. And whether that can be done is not a function of the 
> flexibility of paravirt_ops, it's a function of the flexibility of the 
> VMI ABI.

Come on. You're talking as if these changes can be made in a vacuum,
with no regard to any external constraints.  In reality, any such change
is limited by concerns like:

   1. does it make sense in terms of kernel design?
   2. can all the (real) architectures support it?
   3. does it make sense on real hardware?
   4. does it require massive kernel-wide changes, or does it have
      limited scope?
   5. etc, etc

Obviously one of the things to be considered among the many other issues
is "what effect does this have on the paravirt_ops interface?".

Now it may be that you've got a change that's absolutely great for
everyone, and the only blocker is that the FoobieVisor can't deal with
it.  OK, great, then you'd have a point.

But are you really saying that one of your handwavy proposals is just
fine on a pre-apic i486/mmu-less m68k/whatever, but can't be made to
work on a hypervisor?

Come up with a specific proposal, and we'll work out the details.

    J

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:47           ` Ingo Molnar
  2007-03-09 21:59             ` Jeremy Fitzhardinge
@ 2007-03-09 22:10             ` Chris Wright
  2007-03-09 22:24               ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Chris Wright @ 2007-03-09 22:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Wright, Linus Torvalds, Jeremy Fitzhardinge,
	Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
	Rusty Russell, Andi Kleen, Alan Cox

* Ingo Molnar (mingo@elte.hu) wrote:
> i am worried whether /any/ future change to the upstream kernel's design 
> can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i 
> mean truly any. And whether that can be done is not a function of the 
> flexibility of paravirt_ops, it's a function of the flexibility of the 
> VMI ABI.

i'm not really one to argue on behalf of VMI, but i don't think it's as
dire make it out.  the VMI is client code of pv_ops, and as the kernel
changes that client code will simply have to adapt.  of course there are
theoretical limitations, but let's keep it grounded to practical reality.
the whole premise is evolution.  so throw out specific issues, and let's
adapt rather than fall deep into theoretical rhetoric.

thanks,
-chris

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:59             ` Jeremy Fitzhardinge
@ 2007-03-09 22:12               ` Ingo Molnar
  2007-03-09 22:30                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 22:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Zachary Amsden, Thomas Gleixner, john stultz, akpm,
	LKML, Rusty Russell, Andi Kleen, Alan Cox


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Now it may be that you've got a change that's absolutely great for 
> everyone, and the only blocker is that the FoobieVisor can't deal with 
> it.  OK, great, then you'd have a point.

yep. That's precisely my worry. And it doesnt have to be a 'great' thing 
- just any random small change in the kernel that makes sense: what is 
the likelyhood that it cannot be implemented, no matter what amount of 
insight, paravirt_ops + hyper-ABI emulation hackery, for FoobieVisor, 
because FoobieVisor messed up its ABI.

that likelyhood is a pure function of how FoobieVisor's hypercall ABI is 
shaped. Wow! So can you guess where my fixation about not having too 
many ABIs could possibly originate from? ;-)

Until today everyone on the hypervisor side of the argument pretended 
that paravirt_ops solves all problems and acted stupid when i said an 
ABI is an ABI is an ABI, and that "backwards compatibility" does have 
some technological consequences. _Now_ at least i've got this minimal 
admission that FoobieVisor _might_ break. Quite a breakthrough =B-)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:10             ` Chris Wright
@ 2007-03-09 22:24               ` Ingo Molnar
  2007-03-09 22:36                 ` Jeremy Fitzhardinge
                                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 22:24 UTC (permalink / raw)
  To: Chris Wright
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
	Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
	Andi Kleen, Alan Cox


* Chris Wright <chrisw@sous-sol.org> wrote:

> * Ingo Molnar (mingo@elte.hu) wrote:
> > i am worried whether /any/ future change to the upstream kernel's design 
> > can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i 
> > mean truly any. And whether that can be done is not a function of the 
> > flexibility of paravirt_ops, it's a function of the flexibility of the 
> > VMI ABI.
> 
> i'm not really one to argue on behalf of VMI, but i don't think it's 
> as dire make it out. [...]

hey, that's what i thought when i helped do the vDSO, until i got 
slapped with cold reality called "CONFIG_COMPAT_VDSO". I'm a bit more 
careful about ABIs since then =B-)

> [...] the VMI is client code of pv_ops, and as the kernel changes that 
> client code will simply have to adapt.  of course there are 
> theoretical limitations, but let's keep it grounded to practical 
> reality. the whole premise is evolution.  so throw out specific 
> issues, and let's adapt rather than fall deep into theoretical 
> rhetoric.

ok, sure, how about the one i mentioned: long-term i'd like to have a 
paravirt model where the guest does not store /any/ page tables - all 
paging is managed by the hypervisor. The guest has a vma tree, but 
otherwise it does not process pagefaults, has no concept of a pte (if in 
paravirt mode), has no concept of kernel page tables either: there are 
hypercalls to allocate/free guest-kernel memory, etc. This needs some 
(serious) MM surgery but it's doable and it's interesting as well. How 
would you map this to the VMI backend?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:36           ` Ingo Molnar
  2007-03-09 21:40             ` Jeremy Fitzhardinge
@ 2007-03-09 22:27             ` Linus Torvalds
  2007-03-09 22:50               ` Ingo Molnar
  2007-03-09 23:07               ` Zachary Amsden
  2007-03-09 23:10             ` Ingo Molnar
  2 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2007-03-09 22:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox



On Fri, 9 Mar 2007, Ingo Molnar wrote:
> 
> hm. So your point is that VMI is in essence a Turing machine (a 
> near-complete one)? No matter what redesign we do on the Linux side, the 
> VMI paravirt_ops will always be able to adopt to it?

No, I don't think it's turing-complete ;)

But since it tries to basically come fairly close to emulating the 
hardware we already use, any higher-level abstraction we do (which 
obviously has to work on real hardware too!) is likely to be translatable 
into the "pseudo-hardware" thing that is the VMI interfaces.

> but ... maybe because VMI is so lowlevel and covers /all/ of x86 today, 
> it will always be able to emulate whatever different concept we can come 
> up with? Do we really know this absolutely sure?

"For sure"? Absolutely not. But since any new interfaces we come up with 
for doing timers etc had better work perfectly fine on an old hardware 
platform too, we can't exactly require any interfaces that do things that 
a bog-standard old dual-PPro didn't do 10 years ago, can we?

So assumign that the VMI interface is roughly as powerful (by virtue of 
basically emulating it) as the old single-ioapic/single-lapic systems we 
used to use, I don't think it should ever be a real problem. Hmm?

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:12               ` Ingo Molnar
@ 2007-03-09 22:30                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 22:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Wright, Zachary Amsden, Thomas Gleixner, john stultz, akpm,
	LKML, Rusty Russell, Andi Kleen, Alan Cox

Ingo Molnar wrote:
> yep. That's precisely my worry. And it doesnt have to be a 'great' thing 
> - just any random small change in the kernel that makes sense: what is 
> the likelyhood that it cannot be implemented, no matter what amount of 
> insight, paravirt_ops + hyper-ABI emulation hackery, for FoobieVisor, 
> because FoobieVisor messed up its ABI.
>
> that likelyhood is a pure function of how FoobieVisor's hypercall ABI is 
> shaped. Wow! So can you guess where my fixation about not having too 
> many ABIs could possibly originate from? ;-)
>   

OK, so its a problem that's happened before.  "It's a great idea, it's
so nice, but it breaks X."  Your options are:

   1. Well, nobody is really using X.  We can stop supporting it.
   2. X makes up 50% of the users, we'll just have to do without your
      great idea.
   3. Maybe we can get X updated so this idea works.

If X is a piece of hardware, then you're probably stuck with options 1
and 2.  If its something like firmware or a hypervisor, you might have a
chance with option 3.

The hypervisor interface is not at all special in this regard; you may
as well be arguing "We can't allow a port of Linux to the FoobieTron2000
CPU, because it might constrain some future development"; that's true,
it might.  But I don't think I've ever seen anyone make that argument
for not accepting a new architecture port.

I don't really understand what your overall argument is though.  Sure, I
guess its that if there's one ABI for all hypervisors, then you've only
got one hypervisor-related constraint to consider when evaluating a new
kernel change.  But that ABI is going to be as constraining as the its
most constraining hypervisor, so you're not really in a better position
than if you have N hypervisor ABIs.  In fact you're worse off, because
you have no flexibility to drop/adapt/whatever the real blocker.

> _Now_ at least i've got this minimal 
> admission that FoobieVisor _might_ break. Quite a breakthrough =B-)

If you went to all that typing to get that much of a concession, then
you have way too much time ;)

    J


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:24               ` Ingo Molnar
@ 2007-03-09 22:36                 ` Jeremy Fitzhardinge
  2007-03-09 23:38                   ` Ingo Molnar
  2007-03-09 22:46                 ` Chris Wright
  2007-03-09 23:13                 ` Rik van Riel
  2 siblings, 1 reply; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 22:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Wright, Linus Torvalds, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Alan Cox

Ingo Molnar wrote:
> ok, sure, how about the one i mentioned: long-term i'd like to have a 
> paravirt model where the guest does not store /any/ page tables - all 
> paging is managed by the hypervisor. The guest has a vma tree, but 
> otherwise it does not process pagefaults, has no concept of a pte (if in 
> paravirt mode), has no concept of kernel page tables either: there are 
> hypercalls to allocate/free guest-kernel memory, etc. This needs some 
> (serious) MM surgery but it's doable and it's interesting as well. How 
> would you map this to the VMI backend?

You wouldn't.  Why would you?  It might be a useful interface - and its
the perfect kind of high-level interface for pv_ops.  It might be worth
adapting a hypervisor to suit it, but you still need to support the
i386's pagetables.  So, you present the pv_ops interface with your
vma-based mappings, and it runs it through the vma->pagetable
translation layer to feed into either the i386 pagetables directly, or
to a hypervisor's page-based interface.

The important part is that there's more to the story than just pv_ops. 
If you wanted to make such a change, then you'd need to refactor the
i386 support code to add a vma->paging helper layer.  That layer would
be available for any pv_ops interface to use if it wishes.

(Remember, in the pv_ops model, bare hardware is a "hypervisor" too.)

Next problem?

    J

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:24               ` Ingo Molnar
  2007-03-09 22:36                 ` Jeremy Fitzhardinge
@ 2007-03-09 22:46                 ` Chris Wright
  2007-03-09 23:02                   ` Ingo Molnar
  2007-03-09 23:13                 ` Rik van Riel
  2 siblings, 1 reply; 37+ messages in thread
From: Chris Wright @ 2007-03-09 22:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Wright, Linus Torvalds, Jeremy Fitzhardinge,
	Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
	Rusty Russell, Andi Kleen, Alan Cox

* Ingo Molnar (mingo@elte.hu) wrote:
> * Chris Wright <chrisw@sous-sol.org> wrote:
> > i'm not really one to argue on behalf of VMI, but i don't think it's 
> > as dire make it out. [...]
> 
> hey, that's what i thought when i helped do the vDSO, until i got 
> slapped with cold reality called "CONFIG_COMPAT_VDSO". I'm a bit more 
> careful about ABIs since then =B-)

heh, once bitten twice shy, or however that goes ;-)

> > [...] the VMI is client code of pv_ops, and as the kernel changes that 
> > client code will simply have to adapt.  of course there are 
> > theoretical limitations, but let's keep it grounded to practical 
> > reality. the whole premise is evolution.  so throw out specific 
> > issues, and let's adapt rather than fall deep into theoretical 
> > rhetoric.
> 
> ok, sure, how about the one i mentioned: long-term i'd like to have a 
> paravirt model where the guest does not store /any/ page tables - all 
> paging is managed by the hypervisor. The guest has a vma tree, but 
> otherwise it does not process pagefaults, has no concept of a pte (if in 
> paravirt mode), has no concept of kernel page tables either: there are 
> hypercalls to allocate/free guest-kernel memory, etc. This needs some 
> (serious) MM surgery but it's doable and it's interesting as well. How 
> would you map this to the VMI backend?

Sounds a lot like a userspace process.  My immediate thought is, why
not use containers, a more natural fit.  But if you have _any_ hope
of booting this kernel on native hardware when it's not running under
a hypervisor then I'd expect the same pv_ops interfaces that allow it
to run on native would allow VMI to build and handle the shadow (since
you'd have taken it out of the kernel).  Heh, so in order to run this on
native we had to add fork/mmap pv ops?  I agree it might be interesting,
but it's still not clear that it's useful w/out some code to back it up,
and see the value.

thanks,
-chris

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:27             ` Linus Torvalds
@ 2007-03-09 22:50               ` Ingo Molnar
  2007-03-09 23:07               ` Zachary Amsden
  1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 22:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > but ... maybe because VMI is so lowlevel and covers /all/ of x86 
> > today, it will always be able to emulate whatever different concept 
> > we can come up with? Do we really know this absolutely sure?
> 
> "For sure"? Absolutely not. But since any new interfaces we come up 
> with for doing timers etc had better work perfectly fine on an old 
> hardware platform too, we can't exactly require any interfaces that do 
> things that a bog-standard old dual-PPro didn't do 10 years ago, can 
> we?

i'm not really thinking in terms of extending the native kernel - 
whatever we do on the native kernel we'll indeed have to be able to do 
on 'real' hardware too - including really old boxes.

i'm thinking more in terms of having some new and more intelligent 
virtualization-only abstractions.

For example i think people are way too obsessed with building virtual 
'machines' that look like PCs, while i've got some truly pie-in-the-sky 
ideas floating to simplify virtual machines, like the one i just 
outlined to Chris:

|| long-term i'd like to have a paravirt model where the guest does not 
|| store /any/ page tables - all paging is managed by the hypervisor. 
|| The guest has a vma tree, but otherwise it does not process 
|| pagefaults, has no concept of a pte (if in paravirt mode), has no 
|| concept of kernel page tables either: there are hypercalls to 
|| allocate/free guest-kernel memory, etc. This needs some (serious) MM 
|| surgery but it's doable and it's interesting as well. How would you 
|| map this to the VMI backend?

Clearly the ugliest and most complex part of hypervisors is MMU support. 
So the above model would avoid all the shadow page table complexities 
and other MMU nasties, and keep that stuff in the hypervisor. [ in 
exchange for a whole set of other complexities and problems ;-) ] This 
would be even more highlevel than UML (UML emulates pagetables in 
user-space), in that regard.

even in such a drastic model we could share like 90% of the x86 binary 
code with the rest of the kernel (all the filesystem support, networking 
stack, etc. would still be reusable by the guest kernel), so a paravirt 
approach, where native and guest is the same image (as opposed to the 
UML model, where they are separate) still makes sense and is preferred 
by distributions.

Mapping this into VMI calls looks ... near impossible to me. VMI really 
assumes that there is no hypervisor state for kernel objects - while the 
above model _shares_ a very substantial kernel object with the 
hypervisor - and in fact 100% delegates its handling to the hypervisor 
altogether. Hm?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 20:50     ` Jan Engelhardt
@ 2007-03-09 22:50       ` Lee Revell
  2007-03-14  8:41         ` alsa was " Pavel Machek
  0 siblings, 1 reply; 37+ messages in thread
From: Lee Revell @ 2007-03-09 22:50 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Ingo Molnar, Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
	Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
	Andi Kleen, Chris Wright, Alan Cox

On 3/9/07, Jan Engelhardt <jengelh@linux01.gwdg.de> wrote:
> I think the sound example to the right really shows it. /dev/dsp has a
> consistent ABI on a ton of systems. The API below it, varies. Linux got
> file_operations and ALSA. Solaris/BSD may have its
> vnode-and-so-on-functions and some sort of OSS.

I think this is a poor example as applications lose a lot of
functionality (multiple stream mixing, software volume control, etc)
by going through the legacy /dev/dsp interface vs. using native ALSA.

Lee

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:46                 ` Chris Wright
@ 2007-03-09 23:02                   ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 23:02 UTC (permalink / raw)
  To: Chris Wright
  Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
	Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
	Andi Kleen, Alan Cox


* Chris Wright <chrisw@sous-sol.org> wrote:

> > ok, sure, how about the one i mentioned: long-term i'd like to have 
> > a paravirt model where the guest does not store /any/ page tables - 
> > all paging is managed by the hypervisor. The guest has a vma tree, 
> > but otherwise it does not process pagefaults, has no concept of a 
> > pte (if in paravirt mode), has no concept of kernel page tables 
> > either: there are hypercalls to allocate/free guest-kernel memory, 
> > etc. This needs some (serious) MM surgery but it's doable and it's 
> > interesting as well. How would you map this to the VMI backend?
> 
> Sounds a lot like a userspace process.  My immediate thought is, why 
> not use containers, a more natural fit.  [...]

easy: in my model the hypervisor is isolated from the guest kernel. In 
the container model it is not. [ This is a basic quality requirement for 
virtualization: a guest kernel does not get to read any hypervisor 
crypto keys to HD-DVD smut! ;-) ]

> [...] But if you have _any_ hope of booting this kernel on native 
> hardware when it's not running under a hypervisor then I'd expect the 
> same pv_ops interfaces that allow it to run on native would allow VMI 
> to build and handle the shadow (since you'd have taken it out of the 
> kernel).  Heh, so in order to run this on native we had to add 
> fork/mmap pv ops?  I agree it might be interesting, but it's still not 
> clear that it's useful w/out some code to back it up, and see the 
> value.

progress ;-) But yes, some /really/ high-level pv_ops would be needed.

[ in the end we might be able to simplify it down to a single hook! That 
  would be: run_native_image / run_guest_image ;-) ]

seriously, most of the body of x86 kernel code is in filesystems, VFS, 
networking, scheduler and the core kernel - much of which can be shared 
between native and guest. The MM is a significant and very central 
chunk, but it is less than 3% of the total codesize.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:27             ` Linus Torvalds
  2007-03-09 22:50               ` Ingo Molnar
@ 2007-03-09 23:07               ` Zachary Amsden
  1 sibling, 0 replies; 37+ messages in thread
From: Zachary Amsden @ 2007-03-09 23:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jeremy Fitzhardinge, Thomas Gleixner, john stultz,
	akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright, Alan Cox

Linus Torvalds wrote:
>> but ... maybe because VMI is so lowlevel and covers /all/ of x86 today, 
>> it will always be able to emulate whatever different concept we can come 
>> up with? Do we really know this absolutely sure?
>>     
>
> "For sure"? Absolutely not. But since any new interfaces we come up with 
> for doing timers etc had better work perfectly fine on an old hardware 
> platform too, we can't exactly require any interfaces that do things that 
> a bog-standard old dual-PPro didn't do 10 years ago, can we?
>
> So assumign that the VMI interface is roughly as powerful (by virtue of 
> basically emulating it) as the old single-ioapic/single-lapic systems we 
> used to use, I don't think it should ever be a real problem. Hmm?
>   

Sorry to keep you in a thread you don't want more to do with Linus, but 
to answer the question completely directly:

There are four design requirement which are inviolate for achieving a 
large measurable performance gain, which is the primary benefit of i386 
paravirt-ops for us.  These changes are required for /ALL/ high 
performance paravirtualized kernels not running in hardware 
virtualization, across a broad spectrum of hypervisors, and have either 
zero negative impact or opportunity for additional gain when hardware 
virtualization is enabled.

1) Ability to run the kernel at non-zero CPL
2) Ability to replace hardware interrupt masking functions with 
virtualizable equivalents
3) Notification when page tables are allocated and released
4) Notification in some form when page table entries are updated (or 
vma's are changed)

Everything after this are incremental gains, some more valuable than 
others, but not as major in significance as the above four (apic_write, 
incidentally, _is_ one of the more substantial gains for us).

These don't seem to be a major burden on the kernel at all.

#1 is already the default case now for even native hardware.
#2 requires a lot of hooks because interrupt masking is a common 
function, and this is where the large numbers of hook points Ingo was 
demonstrating came from, but these icache effects and costs are on 
already expensive instructions.  In fact, it appears on some hardware, 
the nop padding around cli / sti / pushf / popf contributes to a 
mysterious performance gain, perhaps due to some pipelining anomaly.
#3 is not a common enough operation to be of performance concern.  It 
does however, require pagetables, just as native hardware does.  Which 
we can implement perfectly well anyway in our backend, just as the 
native backend would if some reckless madman removed all notions of page 
tables from a paravirt kernel.
#4 involves an extra call in page fault paths and from some points in 
the mm layer.

There are no ABI requirements tied to these, merely the presence of any 
usable API for them in paravirt-ops.

Linus is right - our virtual hardware is an exact replica of real 
hardware.  So no matter how you change paravirtualization in the kernel, 
anything that runs on real hardware will continue to run on VMware.  VMI 
is tied very closely to the hardware, on purpose, and follows the rules 
of native hardware extremely closely.  So you can pretty much twist and 
abuse paravirt-ops in a number of ways, and as long as it continues to 
run on real hardware with the above four requirements, it still runs 
even on VMI.  Violate the above four requirements, and it costs a lot of 
performance, but we still continue to run.

Zach

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 21:36           ` Ingo Molnar
  2007-03-09 21:40             ` Jeremy Fitzhardinge
  2007-03-09 22:27             ` Linus Torvalds
@ 2007-03-09 23:10             ` Ingo Molnar
  2007-03-09 23:38               ` Zachary Amsden
  2 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 23:10 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox


* Ingo Molnar <mingo@elte.hu> wrote:

> [...] If that is the case then my ABI worries would indeed be wrong 
> and i'd owe Zach a big fat apology [and more] for my flames ;-)

that apology i very much owe to Zach no matter what the outcome of the 
discussion. Zach, some of my mindless characterisations of the quality 
of VMI code (and of your intentions) were really out of bound and were 
unfair, and i'd like to apologize for that :-/

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:24               ` Ingo Molnar
  2007-03-09 22:36                 ` Jeremy Fitzhardinge
  2007-03-09 22:46                 ` Chris Wright
@ 2007-03-09 23:13                 ` Rik van Riel
  2 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2007-03-09 23:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Wright, Linus Torvalds, Jeremy Fitzhardinge,
	Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
	Rusty Russell, Andi Kleen, Alan Cox

Ingo Molnar wrote:

> ok, sure, how about the one i mentioned: long-term i'd like to have a 
> paravirt model where the guest does not store /any/ page tables - all 
> paging is managed by the hypervisor. The guest has a vma tree, but 
> otherwise it does not process pagefaults, has no concept of a pte (if in 
> paravirt mode), has no concept of kernel page tables either: there are 
> hypercalls to allocate/free guest-kernel memory, etc. This needs some 
> (serious) MM surgery but it's doable and it's interesting as well.  How
> would you map this to the VMI backend?

Ugh!

In a situation like that, how does the guest handle pageouts?

What about the dirty and accessed bits?

I'm guessing VMI would deal with this kind of abstraction in
exactly the same way we do the "native hardware" layer behind
paravirt_ops.  Ie. this example is a red herring.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 23:10             ` Ingo Molnar
@ 2007-03-09 23:38               ` Zachary Amsden
  0 siblings, 0 replies; 37+ messages in thread
From: Zachary Amsden @ 2007-03-09 23:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Thomas Gleixner, john stultz, akpm, LKML,
	Rusty Russell, Andi Kleen, Chris Wright, Alan Cox

Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>   
>> [...] If that is the case then my ABI worries would indeed be wrong 
>> and i'd owe Zach a big fat apology [and more] for my flames ;-)
>>     
>
> that apology i very much owe to Zach no matter what the outcome of the 
> discussion. Zach, some of my mindless characterisations of the quality 
> of VMI code (and of your intentions) were really out of bound and were 
> unfair, and i'd like to apologize for that :-/
>   

That's fine.  Despite not having an open source hypervisor, we really 
don't have evil intentions, and we really don't want to create tension 
or impede the progress of anyone else.  We think this work has positive 
benefits for Linux, our hypervisor, and others as well.

Some of our code did very much need fixing, and the positive thing we 
can take away from such a heated discussion is that we probably got more 
eyes on our code, maybe trying to find the evil, and instead finding 
bugs or things we could have done better.

At least someone did need to play devil's advocate, as this is an 
important thing to get right for the future direction of Linux, and I 
respect you for doing so, and don't take offense.

Zach

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:36                 ` Jeremy Fitzhardinge
@ 2007-03-09 23:38                   ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 23:38 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Chris Wright, Linus Torvalds, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Alan Cox


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> The important part is that there's more to the story than just pv_ops. 
> If you wanted to make such a change, then you'd need to refactor the 
> i386 support code to add a vma->paging helper layer.  That layer would 
> be available for any pv_ops interface to use if it wishes.

no such change is needed to native. [ other than the removal of tons of 
lowlevel hooks ;-) ] Think of this in terms of a completely separate MM 
layer for guest kernels, with all memory management details done on the 
hypervisor side, ok? I dont think you can emulate that in an equivalent 
way via VMI, the kernel object in this model is on the hypervisor side - 
while with VMI that does not look possible.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-09 22:50       ` Lee Revell
@ 2007-03-14  8:41         ` Pavel Machek
  2007-03-14 15:59           ` Jaroslav Kysela
  0 siblings, 1 reply; 37+ messages in thread
From: Pavel Machek @ 2007-03-14  8:41 UTC (permalink / raw)
  To: Lee Revell
  Cc: Jan Engelhardt, Ingo Molnar, Linus Torvalds, Jeremy Fitzhardinge,
	Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
	Rusty Russell, Andi Kleen, Chris Wright, Alan Cox

Hi!

> >I think the sound example to the right really shows it. 
> >/dev/dsp has a
> >consistent ABI on a ton of systems. The API below it, 
> >varies. Linux got
> >file_operations and ALSA. Solaris/BSD may have its
> >vnode-and-so-on-functions and some sort of OSS.
> 
> I think this is a poor example as applications lose a 
> lot of
> functionality (multiple stream mixing, software volume 
> control, etc)
> by going through the legacy /dev/dsp interface vs. using 
> native ALSA.

OTOH /dev/dsp is nice, clean, unixy interface, while alsa creates ugly
ABI you should not even use unless you are libalsa. ouch.
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-14  8:41         ` alsa was " Pavel Machek
@ 2007-03-14 15:59           ` Jaroslav Kysela
  2007-03-15  9:03             ` Pavel Machek
  0 siblings, 1 reply; 37+ messages in thread
From: Jaroslav Kysela @ 2007-03-14 15:59 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Lee Revell, Jan Engelhardt, Ingo Molnar, Linus Torvalds,
	Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox

On Wed, 14 Mar 2007, Pavel Machek wrote:

> Hi!
> 
> > >I think the sound example to the right really shows it. 
> > >/dev/dsp has a
> > >consistent ABI on a ton of systems. The API below it, 
> > >varies. Linux got
> > >file_operations and ALSA. Solaris/BSD may have its
> > >vnode-and-so-on-functions and some sort of OSS.
> > 
> > I think this is a poor example as applications lose a 
> > lot of
> > functionality (multiple stream mixing, software volume 
> > control, etc)
> > by going through the legacy /dev/dsp interface vs. using 
> > native ALSA.
> 
> OTOH /dev/dsp is nice, clean, unixy interface, while alsa creates ugly
> ABI you should not even use unless you are libalsa. ouch.

Pavel, calm down. World is not perfect and there are always probes to 
optimize things. We use standard file operations - open/close/ioctl/mmap, 
too.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-14 15:59           ` Jaroslav Kysela
@ 2007-03-15  9:03             ` Pavel Machek
  2007-03-15  9:10               ` Pavel Machek
  0 siblings, 1 reply; 37+ messages in thread
From: Pavel Machek @ 2007-03-15  9:03 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Lee Revell, Jan Engelhardt, Ingo Molnar, Linus Torvalds,
	Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox

Hi!

> > > >I think the sound example to the right really shows it. 
> > > >/dev/dsp has a
> > > >consistent ABI on a ton of systems. The API below it, 
> > > >varies. Linux got
> > > >file_operations and ALSA. Solaris/BSD may have its
> > > >vnode-and-so-on-functions and some sort of OSS.
> > > 
> > > I think this is a poor example as applications lose a 
> > > lot of
> > > functionality (multiple stream mixing, software volume 
> > > control, etc)
> > > by going through the legacy /dev/dsp interface vs. using 
> > > native ALSA.
> > 
> > OTOH /dev/dsp is nice, clean, unixy interface, while alsa creates ugly
> > ABI you should not even use unless you are libalsa. ouch.
> 
> Pavel, calm down. 

I'm pretty calm, thank you.

> World is not perfect and there are always probes to 
> optimize things. We use standard file operations - open/close/ioctl/mmap, 
> too.

Unfortunately AlSA _does not_ provide hardware abstraction. Instead,
it relies on libalsa in userland to do the kernel work. That means
that testing sound is ugly.

Plus, it made some "interesting choices" with naming, basically
inventing parallel system to /dev/. (Why do I have to specify "card 0"
number in xmms, and WTF it means? Why can't alsa use device paths as
rest of sane world?)

So... in dsp, if I wanted to record sound, I did

	cat /dev/dsp > /tmp/foo; cat /tmp/foo > /dev/dsp

If that worked, I had usable sound system, and if it broke, I knew it
is kernel fault. 

With alsa it is

	download & install alsalib
	download & install alsautils
	create 1007 nodes in /dev
	launch alsamixer, figure out what to do from inadequate descriptions
	launch arecord, try to guess some suitable options
	launch aplay, try to guess some options

...if it does not work, it may be a kernel problem or userspace
problem; I'm left with debugging both. That makes alsa pretty much
untestable.

(For _my_ usage, something like "alsatest" that is self-contained,
preferably statically linked, and just tests microphone and sound
output would be nice. But that does not fix the fact that alsa is
broken by design).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-15  9:03             ` Pavel Machek
@ 2007-03-15  9:10               ` Pavel Machek
  2007-03-15  9:23                 ` Zachary Amsden
  0 siblings, 1 reply; 37+ messages in thread
From: Pavel Machek @ 2007-03-15  9:10 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Lee Revell, Jan Engelhardt, Ingo Molnar, Linus Torvalds,
	Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox

Hi!

> So... in dsp, if I wanted to record sound, I did
> 
> 	cat /dev/dsp > /tmp/foo; cat /tmp/foo > /dev/dsp
> 
> If that worked, I had usable sound system, and if it broke, I knew it
> is kernel fault. 
> 
> With alsa it is
> 
> 	download & install alsalib
> 	download & install alsautils
> 	create 1007 nodes in /dev
> 	launch alsamixer, figure out what to do from inadequate descriptions
> 	launch arecord, try to guess some suitable options
> 	launch aplay, try to guess some options
> 
> ...if it does not work, it may be a kernel problem or userspace
> problem; I'm left with debugging both. That makes alsa pretty much
> untestable.

(Just for the record, I should note that networking is misdesigned in
similar way; that's why we have eth0 instead of /dev/eth0, and need
special tools to rename network interface. But this mistake dates to
BSD days or something, so we got used to it... and at least you do not
need to keep libnetwork up to date to keep your  net devices working.

So networking provides _ugly_ hardware abstraction, but it provides
it).

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-15  9:10               ` Pavel Machek
@ 2007-03-15  9:23                 ` Zachary Amsden
  2007-03-15  9:32                   ` Pavel Machek
  0 siblings, 1 reply; 37+ messages in thread
From: Zachary Amsden @ 2007-03-15  9:23 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jaroslav Kysela, Lee Revell, Jan Engelhardt, Ingo Molnar,
	Linus Torvalds, Jeremy Fitzhardinge, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox

Pavel Machek wrote:

>> 	download & install alsalib
>> 	download & install alsautils
>> 	create 1007 nodes in /dev

I really hope you meant permission 1007 nodes, not 1007 nodes!  I'm 
checking right now, and if the latter is the case, I'm going to 
uninstall alsa, even if that means my computer will forever be silent, 
it will be silent in protest.

> (Just for the record, I should note that networking is misdesigned in
> similar way; that's why we have eth0 instead of /dev/eth0, and need
> special tools to rename network interface. But this mistake dates to
> BSD days or something, so we got used to it... and at least you do not
> need to keep libnetwork up to date to keep your  net devices working.
>
> So networking provides _ugly_ hardware abstraction, but it provides
> it).
>   

I might add it got only worse with wireless support, now you can not 
configure card without both ifconfig and iwconfig, so not only is the 
API diverging, the userspace tools to manage it are doing so as well.

Zach

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
  2007-03-15  9:23                 ` Zachary Amsden
@ 2007-03-15  9:32                   ` Pavel Machek
  0 siblings, 0 replies; 37+ messages in thread
From: Pavel Machek @ 2007-03-15  9:32 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Jaroslav Kysela, Lee Revell, Jan Engelhardt, Ingo Molnar,
	Linus Torvalds, Jeremy Fitzhardinge, Thomas Gleixner,
	john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
	Alan Cox

Hi!

> >>	download & install alsalib
> >>	download & install alsautils
> >>	create 1007 nodes in /dev
> 
> I really hope you meant permission 1007 nodes, not 1007 nodes!  I'm 
> checking right now, and if the latter is the case, I'm going to 
> uninstall alsa, even if that means my computer will forever be silent, 
> it will be silent in protest.

I meant 1007 nodes... well, not literaly. For sound to work on my
system, I seem to need at least

crw-r--r--  1 root root 116,  0 Mar 24  2005 controlC0
crw-r--r--  1 root root 116, 32 Mar 24  2005 controlC1
crw-rw-rw-  1 root root 116,  4 Feb  1 16:33 hwC0D0
crw-r--r--  1 root root 116, 36 Mar 24  2005 hwC1D0
crw-rw-rw-  1 root root 116,  8 Feb  1 16:33 midiC0D0
crw-rw-rw-  1 root root 116, 24 Feb  1 16:33 pcmC0D0c
crw-rw-rw-  1 root root 116, 16 Feb  1 16:33 pcmC0D0p
crw-rw-rw-  1 root root 116, 48 Feb  1 16:34 pcmC1D0p
crw-rw-rw-  1 root root 116,  1 Feb  1 16:33 seq
crw-rw-rw-  1 root root 116, 33 Feb  1 16:33 timer

...fact that minor numbers are not allocated in Doc*/devices.txt does
not help, either.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2007-03-15  9:33 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-09 18:02 ABI coupling to hypervisors via CONFIG_PARAVIRT Ingo Molnar
2007-03-09 18:28 ` Andi Kleen
2007-03-09 18:30 ` Linus Torvalds
2007-03-09 19:24   ` Ingo Molnar
2007-03-09 19:51     ` Linus Torvalds
2007-03-09 20:12       ` Ingo Molnar
2007-03-09 21:05         ` Jeremy Fitzhardinge
2007-03-09 21:06         ` Linus Torvalds
2007-03-09 21:36           ` Ingo Molnar
2007-03-09 21:40             ` Jeremy Fitzhardinge
2007-03-09 22:27             ` Linus Torvalds
2007-03-09 22:50               ` Ingo Molnar
2007-03-09 23:07               ` Zachary Amsden
2007-03-09 23:10             ` Ingo Molnar
2007-03-09 23:38               ` Zachary Amsden
2007-03-09 21:04       ` Ingo Molnar
2007-03-09 21:27         ` Chris Wright
2007-03-09 21:47           ` Ingo Molnar
2007-03-09 21:59             ` Jeremy Fitzhardinge
2007-03-09 22:12               ` Ingo Molnar
2007-03-09 22:30                 ` Jeremy Fitzhardinge
2007-03-09 22:10             ` Chris Wright
2007-03-09 22:24               ` Ingo Molnar
2007-03-09 22:36                 ` Jeremy Fitzhardinge
2007-03-09 23:38                   ` Ingo Molnar
2007-03-09 22:46                 ` Chris Wright
2007-03-09 23:02                   ` Ingo Molnar
2007-03-09 23:13                 ` Rik van Riel
2007-03-09 20:50     ` Jan Engelhardt
2007-03-09 22:50       ` Lee Revell
2007-03-14  8:41         ` alsa was " Pavel Machek
2007-03-14 15:59           ` Jaroslav Kysela
2007-03-15  9:03             ` Pavel Machek
2007-03-15  9:10               ` Pavel Machek
2007-03-15  9:23                 ` Zachary Amsden
2007-03-15  9:32                   ` Pavel Machek
2007-03-09 19:00 ` Chris Wright

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.