* ABI coupling to hypervisors via CONFIG_PARAVIRT
@ 2007-03-09 18:02 Ingo Molnar
2007-03-09 18:28 ` Andi Kleen
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 18:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Sure, that's clean, From that perspective the apic is a bunch of
> > registers backed by a state machine or something.
>
> I think you could do much worse than just decide to pick the
> IO-APIC/lapic as your "virtual interrupt controller model". So I do
> *not* think that APICRead/APICWrite are in any way horrible interfaces
> for a virtual interrupt controller. In many ways, you then have a
> tested and known interface to work with.
yes - but we already support the raw hardware ABI, in the native kernel.
paravirt_ops is not 'just another PC sub-arch'. It is not 'just another
hardware driver'. It is not 'just another x86 CPU'. paravirt_ops is much
wider than that, it hooks everywhere and has effect on everything!
Lets take a look at the raw numbers. Here's a typical distro kernel
vmlinux, with and without CONFIG_PARAVIRT [with no paravirt backend
enabled]:
text data bss dec hex filename
139863 49010 57672 246545 3c311 x86-kernel-built-in.o.noparavirt
148865 49310 57672 255847 3e767 x86-kernel-built-in.o.paravirt
text data bss dec hex filename
5154975 586932 221184 5963091 5afd53 vmlinux.noparavirt
5189197 587504 221184 5997885 5b853d vmlinux.paravirt
why did code size increase by +6.4% in arch/i386/ (+0.7% in the
vmlinux)? It is purely because CONFIG_PARAVIRT adds more than _1400_
function call hooks to the x86 arch:
c05c8e60 D paravirt_ops
c0102602: ff 15 9c 8e 5c c0 call *0xc05c8e9c
c0102d37: ff 15 94 8e 5c c0 call *0xc05c8e94
c0102d45: ff 15 94 8e 5c c0 call *0xc05c8e94
c0102d53: ff 15 94 8e 5c c0 call *0xc05c8e94
c0102d61: ff 15 94 8e 5c c0 call *0xc05c8e94
c0102d6f: ff 15 94 8e 5c c0 call *0xc05c8e94
[...]
$ objdump -d vmlinux | grep c05c8e | wc -l
1463
_1463_ hooks, spread out all around the x86 arch.
Are these only trivial hooks a'ka alternatives.h? Not at all, these are
full-blown function hooks freely modifiable by a paravirt_ops
implementation, spread throughout the architecture in a finegrained way.
(see my arguments and specific demonstration about the bad effects of
this, four paragraphs below.)
As a comparison: people argued about CONFIG_SECURITY hooks and flamed
about them no end. The reality is, there's only _269_ calls to
security_ops in this same kernel, and i've got CONFIG_SECURITY + SELINUX
enabled. And the only functional modification that security_ops does to
native behavior is "deny the syscall". Not 'full control over
behavior'... In terms of coupling, CONFIG_SECURITY hooks are a walk in
the park, relative to CONFIG_PARAVIRT.
we dont even give /real silicon/ that many hooks! If an x86 CPU came
along that required the addition of 1400+ function hooks then we'd say:
'you must be joking, that's not an x86 CPU! Make it more compatible!'.
please dont get me wrong - 1463 hooks spread out might be fine in the
end, but _if and only if_ there are safeguards in place to make sure
they are just a trivial variation of the hardware ABI - a'ka
asm/alternatives.h. But there is _no_ such safeguard in place today and
we are seeing the bad effects of that _already_, with just a _single_
hypervisor and a _single_ abstraction topic (time), so i'm very strongly
convinced that it's a serious issue that cannot just be glossed over
with "relax, it will work out fine". If there's one thing we learned in
the past 15 years is that ABI issues will haunt us forever.
Let me demonstrate some of the bad effects, and how far we've _already_
deviated from the 'hardware ABI'. An example: one assumes that
paravirt_ops.safe_halt() is a trivial variation of the 'halt
instruction', right? But vmi.c and vmitimer.c does much more than that.
Take a look at vmi_safe_halt() which calls vmi_stop_hz_timer(): it hacks
back a jiffies assumption into its code via paravirt_ops.safe_halt() -
purely via changes local to vmitimer.c, by using next_timer_interrupt()!
Thus it has created a _dual layer_ of dynticks that we specifically
objected against. It does so in spite of our warning about why that is
bad, it does so in spite of Xen having implemented a clockevents driver
in 2 hours, and it does so under the cover of 'oh, this is only a
vmitimer.c local change'. It circumvents the native dynticks framework
and in essence brings in the bad NO_IDLE_HZ technique that we worked so
hard for 2 years not to ever enable for the i386 arch!
so one of my very real problems with paravirt_ops is that due to its
sheer hook-based impact it allows the modification of the hardware ABI
on a _very_ wide scale: both unintentionally and intentionally.
Furthermore, it allows the introduction of hard-to-remove hardwired
quirks that bind one particular paravirt_ops method to the hypervisor
ABI - quirks that are not present in any real silicon! Quirks
_guaranteed by Linux_, by virtue of giving the CONFIG_VMI promise. So we
effectively expose Linux to the hypervisor ABIs, and give hypervisors 'a
license to arbitrary ABI coupling'. With no clear mechanism whatsoever
to remove that coupling. At least crap hardware rots with time and gives
us a chance to remove - but software ABIs and hence "virtual hardware"
seldom rots.
it's hard enough to change the APIC code so that it works on both Intel
and AMD CPUs: and those CPUs /share the ABI/ to a very large degree, by
executing the same code.
and there are no safeguards in place whatsoever to ensure that the
'virtual silicon ABI' matches up to the real silicon ABI. Granted, for
VMI it probably matches up today most likely because VMI came from full
software virtualization that /had/ to emulate all of real silicon. But
from now on, paravirt_ops allows shortcuts, allows changes to semantics
[a shortcut is change of semantics], allows additional ABIs, and we
already see that happenning in vmitimer.c, which is even one of the
/easiest/ paravirtualization topic.
Furthermore, it's only the beginning of the pain. It's the effect of
_ONE_ hypervisor (VMWare/ESX) and _ONE_ abstraction (time). We've got
4-5 hypervisors lined up for the paravirt_ops free-for-all and half a
dozen of fundamental abstractions to cover (of which time is the
simplest)! Please do the math. With paravirt_ops, the complexity of this
is per-hypervisor and per-abstrction and there's no safeguard in place
to let them evolve towards a saner, shared model. It wont happen because
doing a sane, shared ABI is _HARD_. Nobody will go the effort of
implementing the clean solution if they get to play with paravirt_ops,
and _I_ will have the work dumped on me in the architecture, trying to
sort out the mess.
i claim that when the 'API cut' is done at the right level then no more
than say 100 hooks would be needed - with virtually zero kernel size
increase. We've got all the right highlevel abstractions: genirq, gtod,
clockevents. Whatever is missing at the moment from the framework (say
smp_send_reschedule()) we can abstract away. The bonus? It would be
almost directly applicable to other architectures as well. It would also
work with /any/ hypervisor.
Unfortunately, with the current paravirt_ops policy we might end up
seeing none of that unification. 1400+ hooks are just wide enough (by
the law or large numbers) so that any hypervisor can implement a random
ABI that performs close to the 'real' ABI, and has no incentive
whatsoever to play along and make life easier for Linux by having a
sane, unified ABI. Having provided the CONFIG_VMI ABI once, distros will
see themselves forced adding those 1400+ hooks for many, many years to
come, blowing up Linux' instruction cache footprint in a critical area
of code.
And that is why the "paravirt_ops is just virtual hardware" argument is
totally wrong. _Nothing_ limits hypervisors from adding arbitrary ABI
bindings to Linux. For example, VMI does this already and none of the
following are hardware ABIs:
#define VMI_CALL_SetAlarm 68
#define VMI_CALL_CancelAlarm 69
#define VMI_CALL_GetWallclockTime 70
#define VMI_CALL_WallclockUpdated 71
and these ABIs have been objected to by Thomas and me because they are
cycle based - still paravirt_ops allows Linux to become dependent on
these ABIs.
Finally, what would i like to see?
Firstly, i think this has been over-rushed. After years of being happy
with forks of the Linux kernel, all the hypervisors woke up at once and
want to have their stuff upstream /now/. This rush created a hodgepodge
of APIs/ABIs that we now in the end promise to support /all/. (if we
take CONFIG_VMI i can see little ethical reason to not take Xen's
paravirt_ops, lguest's paravirt_ops, KVM's paravirt_ops and i'm sure
Microsoft/Novell will have something nice and different for us too.)
Secondly, i'd like to see a paravirt approach that has /implicit/
safeguards against the following type of crap:
vmitimer.c code has a hardwired assumption that a PIT exists:
/* Disable PIT. */
outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */
it has a hardwired assumption that 'cycles' makes a sense as a way to
communicate time units:
vmi_timer_ops.set_alarm(
VMI_ALARM_WIRED_LVTT | VMI_ALARM_IS_PERIODIC | VMI_CYCLES_AVAILABLE,
per_cpu(process_times_cycles_accounted_cpu, cpu) + cycles_per_alarm,
cycles_per_alarm);
it has a hardwired assumption that Linux keeps time in units of
'jiffies':
if (rcu_needs_cpu(cpu) || local_softirq_pending() ||
(next = next_timer_interrupt(),
time_before_eq(next, jiffies + HZ/CONFIG_VMI_ALARM_HZ))) {
cpu_clear(cpu, nohz_cpu_mask);
etc. etc. paravirt_ops is _NOT_ just a random driver we can fix in the
future. These are all assumptions that can /easily/ leak out towards the
hypervisor and thus create ABI coupling between that quirk and a
specific version of the hypervisor. Fixing such bad coupling needs
changes on the /hypervisor side/.
Granted, some of these are just harmless quirks that are fixable in
Linux only, but some of these are stiffling because they bind Linux to
the hypervisor ABI.
i might be overreacting, but the effects of 1400+ hooks put into the
code against my objections, which code i helped write for many years,
and that i'd like to help write for many years to come, is no small
issue to me.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 18:02 ABI coupling to hypervisors via CONFIG_PARAVIRT Ingo Molnar
@ 2007-03-09 18:28 ` Andi Kleen
2007-03-09 18:30 ` Linus Torvalds
2007-03-09 19:00 ` Chris Wright
2 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2007-03-09 18:28 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
Chris Wright, Alan Cox
On Friday 09 March 2007 19:02, Ingo Molnar wrote:
> _1463_ hooks, spread out all around the x86 arch.
They are not all different hooks though, just many call site of the same.
Also most of them are well defined to just match what the instructions
do.
paravirt_ops has under hundred entries right now and i intend to not
expand it much further after the Xen bits are in.
> Let me demonstrate some of the bad effects, and how far we've _already_
> deviated from the 'hardware ABI'. An example: one assumes that
> paravirt_ops.safe_halt()
The vmi maintainers already agreed to fix that.
> i claim that when the 'API cut' is done at the right level
Can you make a proposal? Would you be willing to write code for that?
> then no more
> than say 100 hooks would be needed
Well we have less than 100 hooks right now, just with many call sites @)
> - with virtually zero kernel size
> increase.
I'll believe that when I see it.
> We've got all the right highlevel abstractions: genirq, gtod,
> clockevents. Whatever is missing at the moment from the framework (say
> smp_send_reschedule()) we can abstract away.
smp_send_reschedule() is just an IPI instance which is already
abstracted with genapic. Xen has a genapic_xen.
VMI is still where ->apic_read/->apic_write and their relatively
harmless timer interrupt change make sense -- if they needed
more changes they would just need to bite the bullet and
provide a custom genapic vmware apic driver.
> Unfortunately, with the current paravirt_ops policy we might end up
> seeing none of that unification.
I am open to concrete incremental proposals for improvements.
>
> And that is why the "paravirt_ops is just virtual hardware" argument is
> totally wrong. _Nothing_ limits hypervisors from adding arbitrary ABI
> bindings to Linux. For example, VMI does this already and none of the
> following are hardware ABIs:
>
> #define VMI_CALL_SetAlarm 68
> #define VMI_CALL_CancelAlarm 69
> #define VMI_CALL_GetWallclockTime 70
> #define VMI_CALL_WallclockUpdated 71
That's VMI internal, not exposed above paravirt ops.
> Firstly, i think this has been over-rushed. After years of being happy
> with forks of the Linux kernel,
The code has been posted for a long time, open for review for everybody.
> Secondly, i'd like to see a paravirt approach that has /implicit/
> safeguards against the following type of crap:
I don't think you can use an API to force the underlying implementation
in a practical way. If code wants to do something wrong it no API
in the world will stop it. That is why we have code review instead.
> it has a hardwired assumption that 'cycles' makes a sense as a way to
> communicate time units:
>
> vmi_timer_ops.set_alarm(
> VMI_ALARM_WIRED_LVTT | VMI_ALARM_IS_PERIODIC | VMI_CYCLES_AVAILABLE,
> per_cpu(process_times_cycles_accounted_cpu, cpu) + cycles_per_alarm,
> cycles_per_alarm);
That's because VMI is defined this way? If paravirt chosed to not pass cycles
anymore they would just add a simple conversion function. SMOP.
> it has a hardwired assumption that Linux keeps time in units of
> 'jiffies':
Well a lot of drivers have that, but it can be all fixed.
> Granted, some of these are just harmless quirks that are fixable in
> Linux only,
All of them.
> but some of these are stiffling because they bind Linux to
> the hypervisor ABI.
I haven't seen a concrete example of that yet to be honest and I don't
really believe it.
-Andi
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 18:02 ABI coupling to hypervisors via CONFIG_PARAVIRT Ingo Molnar
2007-03-09 18:28 ` Andi Kleen
@ 2007-03-09 18:30 ` Linus Torvalds
2007-03-09 19:24 ` Ingo Molnar
2007-03-09 19:00 ` Chris Wright
2 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2007-03-09 18:30 UTC (permalink / raw)
To: Ingo Molnar
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
On Fri, 9 Mar 2007, Ingo Molnar wrote:
>
> yes - but we already support the raw hardware ABI, in the native kernel.
Why do you continue to call paravirt an ABI?
We got over that. It's not. It's an API.
VMI is an ABI.
As long as you try to confuse the two, there's no point to the discussion.
Yeah, paravirt is ugly. Yeah, the calls should be moved higher in the
stack. But you don't help by confusing the issue by mixing the different
parts up and calling something an ABI that simply *isn't*.
Paravirt already acts on a higher level than the ioapic. It does do the
"irq_disable()" kind of "highlevel" callbacks. Yeah, the "apic_write()"
ones should go away, and they're just hacky, but there's nothing there
that is an ABI.
So just *fix* it or tell others to fix it, instead of just confusing the
issue.
And trust me, if "apic_write" causes bugs because it interacts with real
APIC usage, we don't care ONE WHIT. That paravirt_ops entry goes out the
window so fast you can't say "Whaa?!??".
Linus
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 18:02 ABI coupling to hypervisors via CONFIG_PARAVIRT Ingo Molnar
2007-03-09 18:28 ` Andi Kleen
2007-03-09 18:30 ` Linus Torvalds
@ 2007-03-09 19:00 ` Chris Wright
2 siblings, 0 replies; 37+ messages in thread
From: Chris Wright @ 2007-03-09 19:00 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
Andi Kleen, Chris Wright, Alan Cox
* Ingo Molnar (mingo@elte.hu) wrote:
> i claim that when the 'API cut' is done at the right level then no more
> than say 100 hooks would be needed - with virtually zero kernel size
> increase. We've got all the right highlevel abstractions: genirq, gtod,
> clockevents. Whatever is missing at the moment from the framework (say
> smp_send_reschedule()) we can abstract away. The bonus? It would be
> almost directly applicable to other architectures as well. It would also
> work with /any/ hypervisor.
Oddly enough, that's really what we are trying to acheive. There is
definitely some tension between the VMI model which is modeled very
directly on hardware and something like the Xen model which prefers
higher level interfaces.
I don't really agree with your metrics w.r.t hooks. My point is you
take callsites == hooks to arrive at 1463 hook, but then above say 100
hooks is sufficient. But we have on the order of 100 hooks (I believe
it's ~75 in Linus' tree). Put it another way. Do you believe that
something like irq_{en,dis}able() is appropriate to hook (as that's >
1400 callsites already)?
> Firstly, i think this has been over-rushed. After years of being happy
> with forks of the Linux kernel, all the hypervisors woke up at once and
> want to have their stuff upstream /now/. This rush created a hodgepodge
> of APIs/ABIs that we now in the end promise to support /all/. (if we
> take CONFIG_VMI i can see little ethical reason to not take Xen's
> paravirt_ops, lguest's paravirt_ops, KVM's paravirt_ops and i'm sure
> Microsoft/Novell will have something nice and different for us too.)
It would be imminently helpful if you helped with some specific ideas
on where the paravirt_ops interface needs to be adjusted.
> Secondly, i'd like to see a paravirt approach that has /implicit/
> safeguards against the following type of crap:
How would you propose doing that? Typically that's done with code
review and patches.
thanks,
-chris
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 18:30 ` Linus Torvalds
@ 2007-03-09 19:24 ` Ingo Molnar
2007-03-09 19:51 ` Linus Torvalds
2007-03-09 20:50 ` Jan Engelhardt
0 siblings, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 19:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 9 Mar 2007, Ingo Molnar wrote:
> >
> > yes - but we already support the raw hardware ABI, in the native
> > kernel.
>
> Why do you continue to call paravirt an ABI?
>
> We got over that. It's not. It's an API.
>
> VMI is an ABI.
Unfortunately i still dont see where i'm wrong, and i'm really trying to
understand your argument. Is your argument that as long as an ABI (VMI)
is never directly used but only used via wrapper functions
(paravirt_ops), it has no effects whatsoever on the flexibility of the
rest of the software and ceases to have any negative ABI effects? In my
opinion that is an absurd (and incorrect) point so i guess you must mean
something else, but i really cannot think what that is.
I never said paravirt_ops is an ABI. I say that the ABI(s) _behind_
paravirt_ops [in the backend] /does/ limit Linux, even if wrapped,
inevitably, and that i'm simply worried about having 4-5 independent
ABIs behind each paravirt_ops variant each creating a web of design
constraints on the rest of the kernel. To quote a past email of mine:
|| 'paravirt ops can take care of it' - but that is just blatantly
|| _FALSE_: the ABI 'behind' the paravirt_ops 'shines through' via
|| functional coupling
it doesnt matter in how big letters the wrapper functions have 'freedom'
written on them, the _real_ constraint is the user's expectation to have
the hypervisor work with Linux that worked with that particular VMI ABI
in v2.6.21. So the user wants to have its hypervisor 1.12 work with
Linux v2.6.22 - without having to update the hypervisor. And Linux
v2.6.23. Etc. /That/ is the 'ABI effect' i'm worried about. It is a
"compatibility web" that gets more and more entangled with every new
paravirt_ops implementation added.
In practice, when a problem comes up during code rewrite, 90% of the
time we can probably find a way around it via paravirt_ops and the
backend, but i'm simply worried about the remaining 10%. And that 10% is
not hypothetical at all, should i cite specific examples of problems
that i think cannot be solved via Linux-only modifications?
I'm also worried about the sheer QA inertia of having an additional 4-5
hypervisor-ABI constraints on the correctness of the kernel, in addition
to the 2 main CPU variants we have at the moment.
If we said "paravirt_ops must behave like real hardware" then we'd
probably remove some of that risk (although enforcement is still an
issue). But we _specifically_ say that no, it doesnt have to behave like
real hardware. We allow shortcuts, we allow modifications of behavior -
and that's good in quite many cases. But we allow really weird hacks
like the .safe_halt() thing. Our only present requirement it appears is
that "it works with today's hypervisor" - and that requirement
automatically transforms itself into: "all future kernels will work with
all past versions of the hypervisor".
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 19:24 ` Ingo Molnar
@ 2007-03-09 19:51 ` Linus Torvalds
2007-03-09 20:12 ` Ingo Molnar
2007-03-09 21:04 ` Ingo Molnar
2007-03-09 20:50 ` Jan Engelhardt
1 sibling, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2007-03-09 19:51 UTC (permalink / raw)
To: Ingo Molnar
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
On Fri, 9 Mar 2007, Ingo Molnar wrote:
>
> Unfortunately i still dont see where i'm wrong, and i'm really trying to
> understand your argument. Is your argument that as long as an ABI (VMI)
> is never directly used but only used via wrapper functions
> (paravirt_ops)
No.
My argument is utternly and *purely* that you've been confusing the
discussion by using the wrong terms, and as a result, you've been
discussing things that aren't *relevant*.
You haven't been saying anything constructive.
For example, here's a *constructive* thing you could have said, and never
actually did:
- paravirt_op->write_apic should not exist
anything that needs to write to the apic should ether
(a) have been caught much earlier in the paravirt stack, ie it's a
"disable interrupt" kind of operation, and should never even have
gotten to the APIC write in the first place, but been handled by
the paravirtualized handler.
(b) just be emulated as an APIC write (and if the emulation isn't good
enough, screw it)
In other words, to be *constructive*, you need to point out particular
and practical problem spots, instead of just ranting about any
"paravirtualized ABI".
Of *course* there is an ABI at some point behind any API. Why do you harp
on that? It's irrelevant. Any API will always end up being instantiated
into a binary thing at some point, and that binary thing will have to work
with some particular version of a hardware/infrastructure combination, but
that has *nothing* to do with anything.
The x86 instruction set is an ABI. Our API's eventually tend to be
compiled to something like that ABI, and yes, some ABI's may not be able
to do certain things. For example, on the 32-bit x86 ABI, there are no
interfaces for address space identifiers, and the wrappers become a no-op
that just don't do anything, and if you want fast context switches between
two contexts, you're screwed.
Similarly, maybe the VMI ABI doesn't allow for something that the kernel
wants to do efficiently. Big deal. What relevance does that have to do
with anything, except the fact that if true, the VMWare people are
screwed? It's *their* problem.
So please
- point out things that are badly done. I agree that apic_write() simply
shouldn't be an ABI point at all. But do so *directly* without some
ranting about other things that aren't relevant. Your "1400 hooks" rant
was pointless - there aren't 1400 hooks at all. There are 1400
call-sites, but that's like saying that the "mov" operation is a bad
instruction, because there are 5 million mov instructions in the
kernel.
- Realize that if VMI has problems, it's not *your* problem, or even the
kernels problem. It's purely a VMI problem. I don't understand why you
care, or why you think we should care.
- and I guess we can also stop cc'ing me in the first place. I don't even
think virtualization is very interesting. I'd much rather flame people
about bad taste in more important areas ;)
Thanks,
Linus
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 19:51 ` Linus Torvalds
@ 2007-03-09 20:12 ` Ingo Molnar
2007-03-09 21:05 ` Jeremy Fitzhardinge
2007-03-09 21:06 ` Linus Torvalds
2007-03-09 21:04 ` Ingo Molnar
1 sibling, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 20:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Similarly, maybe the VMI ABI doesn't allow for something that the
> kernel wants to do efficiently. Big deal. What relevance does that
> have to do with anything, except the fact that if true, the VMWare
> people are screwed? It's *their* problem.
i wont hold you up for long, but i think this is the key difference, and
if i understand your point correctly i think you are really wrong here.
This is 'enterprise Linux compatibility and ABI 101', really - i dare to
bet blindly that you wont see anyone here from distros arguing against
this simple point.
Once this thing is released upstream, it creates a new compatibility
rule:
_new kernel must not break on an older hypervisor_
due to a new paravirt_ops design. Ever. It's really that simple. (I
think i never said this explicitly because this requirement of backwards
compatibility was so obvious to me.)
And it doesnt matter whether we think that it was VMWare who messed up.
Users/customers _will_ blame us: "v2.6.25 regresses, it wont run under
ESX v1.12 anymore". Distro will yield and will undo whatever change
breaks backwards compatibility with older hypervisors. (most likely it
will be undone upstream already) Backwards compatibility acts as a very
heavy barrier against certain types of paravirt_ops design changes.
Once v2.6.21 is released, and a bigger distro releases a kernel with
CONFIG_PARAVIRT+CONFIG_VMI enabled: backwards compatibility in future
versions becomes mainly /that/ distro's problem (and upstream's
problem), _NOT_ WMware's problem.
That's why i mentioned CONFIG_COMPAT_VDSO as an example. One major
distro (SuSE 9.0) came out with that particular glibc version that had a
bug that depended on a particular and totally unintentional ABI detail
in the vDSO. As a result we had to do several iterations of
CONFIG_COMPAT_VDSO to keep backwards compatibility. And glibc is perhaps
_the_ most kernel-friendly external software project in existence.
Still, the ABI dependency was there, and we cannot break users who run
old userspace. The same rule holds here: we cannot break users who run
an old hypervisor.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 19:24 ` Ingo Molnar
2007-03-09 19:51 ` Linus Torvalds
@ 2007-03-09 20:50 ` Jan Engelhardt
2007-03-09 22:50 ` Lee Revell
1 sibling, 1 reply; 37+ messages in thread
From: Jan Engelhardt @ 2007-03-09 20:50 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
Andi Kleen, Chris Wright, Alan Cox
Hello,
On Mar 9 2007 20:24, Ingo Molnar wrote:
>* Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>>On Fri, 9 Mar 2007, Ingo Molnar wrote:
>>>
>>> yes - but we already support the raw hardware ABI, in the native
>>> kernel.
>>
>> Why do you continue to call paravirt an ABI?
>> We got over that. It's not. It's an API.
>> VMI is an ABI.
>
>Unfortunately i still dont see where i'm wrong, and i'm really trying to
>understand your argument. Is your argument that as long as an ABI (VMI)
>is never directly used but only used via wrapper functions
>(paravirt_ops), it has no effects whatsoever on the flexibility of the
>rest of the software and ceases to have any negative ABI effects? [...]
As far as I understand and recognize it,
[have monospaced font]
paravirt struct struct file_operations / {ALSA | OSS}
^ ^
| <- API -> |
v v
VMI /dev/dsp
^ ^
| <- ABI -> |
v v
product userspace app
I think the sound example to the right really shows it. /dev/dsp has a
consistent ABI on a ton of systems. The API below it, varies. Linux got
file_operations and ALSA. Solaris/BSD may have its
vnode-and-so-on-functions and some sort of OSS.
Hope this helps (and more, I hope it's accurate - please correct me if I
am wrong.)
Jan
--
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 19:51 ` Linus Torvalds
2007-03-09 20:12 ` Ingo Molnar
@ 2007-03-09 21:04 ` Ingo Molnar
2007-03-09 21:27 ` Chris Wright
1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 21:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> So please
>
> - point out things that are badly done. [...]
the thing badly done is fundamental and it trumps any other small
technological detail complaint i have, because it affects the
development and maintainance model: to promise backwards compatibility
to 4-5 different hypervisors, using separate ABIs for each. /One/ such
ABI would be complex enough to maintain IMO.
( if there is no backwards compatibility promise then i have zero
complaints: then paravirt_ops + the hypercall just becomes another API
internal to Linux that we can improve at will. But that is not
realistic: if we provide CONFIG_VMI today, people will expect to have
CONFIG_VMI in the future too. )
as i said in the very, very first email about this topic: /one/ new ABI
towards hypervisors should be introduced step by step, and via concensus
across hypervisors. We should treat the hypercall ABI very similar to
the system call ABI: the system call ABI is largely based on a concensus
between applications.
[ i think apic_write() granularity is bad too - but that is a small
technical issue, dwarved by the ABI issues that impact the development
model IMO. ]
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 20:12 ` Ingo Molnar
@ 2007-03-09 21:05 ` Jeremy Fitzhardinge
2007-03-09 21:06 ` Linus Torvalds
1 sibling, 0 replies; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 21:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Zachary Amsden, Thomas Gleixner, john stultz,
akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright, Alan Cox
Ingo Molnar wrote:
> Once this thing is released upstream, it creates a new compatibility
> rule:
>
> _new kernel must not break on an older hypervisor_
>
Yes, that's important. (It's perhaps more important that a new
hypervisor not break an old kernel, but that's 100% the hypervisor's
responsibility.)
But yes, the various pv_ops implementations will need to work to make
sure that they can maintain support for old hypervisors as the pv_ops
interface evolves. That's the basic nature of such API-ABI couplings.
The only way to deal with it is to make sure the API is high-enough
level to leave abstraction wiggle-room to adapt the pv_ops interface to
the hypervisor interface.
But I'm pretty sure that's your point, and one we've been aware of from
the start.
> due to a new paravirt_ops design. Ever. It's really that simple. (I
> think i never said this explicitly because this requirement of backwards
> compatibility was so obvious to me.)
>
You're right. Very important.
> And it doesnt matter whether we think that it was VMWare who messed up.
> Users/customers _will_ blame us: "v2.6.25 regresses, it wont run under
> ESX v1.12 anymore". Distro will yield and will undo whatever change
> breaks backwards compatibility with older hypervisors. (most likely it
> will be undone upstream already) Backwards compatibility acts as a very
> heavy barrier against certain types of paravirt_ops design changes.
>
Well, yes. If there's a huge disconnect between the pv_ops interface
and the hypervisor interface which can't be bridged, then that might be
a problem. But one would hope that would get sorted out way before
people start complaining about regressions. If the regression is just a
bug, well, its just a bug.
pv_ops has to walk a delicate balance. On the one hand, its entrypoints
can't be 1:1 with any given ABI, because that would make it very
brittle. On the other hand, it can't be so high-level that any back-end
implementation has to have masses of code mapping the high-level
interface to the ABI.
This is not something you can do with capital-A Architecture up front.
Its something that you need to work towards incrementally as real design
problems come up. You know, like the rest of the kernel.
The current pv_ops development has been helped by the fact that Xen and
VMI (and lguest and kvm) have all quite distinct virtualization
approaches, because it prevents the interface from being overly fawning
to one particular design. But that doesn't mean it should be a grab-bag
union of all the interfaces; we need to find the right level of
abstraction which can deal with everyone's requirements (both the
hypervisors and the kernel overall). And we'll need to keep rebalancing
those tradeoffs as the requirements change (as new hypervisors get
added, existing ones change, obsolete ones get dropped and as the kernel
itself changes).
> Once v2.6.21 is released, and a bigger distro releases a kernel with
> CONFIG_PARAVIRT+CONFIG_VMI enabled: backwards compatibility in future
> versions becomes mainly /that/ distro's problem (and upstream's
> problem), _NOT_ WMware's problem.
>
pv_ops went into mainline in 2.6.20, more or less as a placeholder.
2.6.21 will change it a bit, and the vmi binding will work a bit better,
I guess. I'm hoping the patches I've been posting will make it into
2.6.22, and they'll make further changes to the interface. This is a
new, highly dynamic interface, and nobody is claiming or (I hope)
expecting anything else.
If VMWare is indeed actually releasing a product in a couple of weeks
based on 2.6.21, well that's going to be their support problem. I
really hope no distro is going to decide to ship a release based on a
current or near future kernel with CONFIG_PARAVIRT enabled, at least not
without realizing they're shipping something very new and raw. pv_ops
is going to change a lot in the next few months.
> That's why i mentioned CONFIG_COMPAT_VDSO as an example. One major
> distro (SuSE 9.0) came out with that particular glibc version that had a
> bug that depended on a particular and totally unintentional ABI detail
> in the vDSO. As a result we had to do several iterations of
> CONFIG_COMPAT_VDSO to keep backwards compatibility. And glibc is perhaps
> _the_ most kernel-friendly external software project in existence.
> Still, the ABI dependency was there, and we cannot break users who run
> old userspace. The same rule holds here: we cannot break users who run
> an old hypervisor.
>
Well, the glibc folks were pretty annoyed about that too, since it
wasn't even a release glibc with the problem; it was just some random
CVS snapshot that got shipped. The distros screwed up, and we have to
deal with it. It would be nice to turn off COMPAT_VDSO, but since we
can't, we're going to make it work with CONFIG_PARAVIRT.
J
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 20:12 ` Ingo Molnar
2007-03-09 21:05 ` Jeremy Fitzhardinge
@ 2007-03-09 21:06 ` Linus Torvalds
2007-03-09 21:36 ` Ingo Molnar
1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2007-03-09 21:06 UTC (permalink / raw)
To: Ingo Molnar
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
On Fri, 9 Mar 2007, Ingo Molnar wrote:
>
> _new kernel must not break on an older hypervisor_
Total red herring. AGAIN.
The paravirt_ops isn't a 1:1 hypervisor ABI.
If we change paravirt_ops to be higher-level ops (as we should), yes, the
paravirt->VMI layer needs to be extended to have the "higher->lower"
translation. But at no point did we break the hypervisor.
Anyway, please don't even bother cc'ing or replying to me any more.
Linus
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:04 ` Ingo Molnar
@ 2007-03-09 21:27 ` Chris Wright
2007-03-09 21:47 ` Ingo Molnar
0 siblings, 1 reply; 37+ messages in thread
From: Chris Wright @ 2007-03-09 21:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
Andi Kleen, Chris Wright, Alan Cox
* Ingo Molnar (mingo@elte.hu) wrote:
> ( if there is no backwards compatibility promise then i have zero
> complaints: then paravirt_ops + the hypercall just becomes another API
> internal to Linux that we can improve at will. But that is not
> realistic: if we provide CONFIG_VMI today, people will expect to have
> CONFIG_VMI in the future too. )
This was the whole reason we didn't adopt VMI directly. Instead,
preferring an kernel internal API, pv_ops, that can adopt naturally
as the kernel changes, and it is the pv_ops client code's (or backend
as it is also referred to) responsibility to do whatever is necessary
to map back to the hypervisor's ABI. The goal was explicitly to keep
things internal fluid as usual. As I said before, no matter how you
slice it there's glue code somewhere to deal with compatibilities.
And it's always been the virtualization platform's responsibility to
deal with the changes.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:06 ` Linus Torvalds
@ 2007-03-09 21:36 ` Ingo Molnar
2007-03-09 21:40 ` Jeremy Fitzhardinge
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 21:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> If we change paravirt_ops to be higher-level ops (as we should), yes,
> the paravirt->VMI layer needs to be extended to have the
> "higher->lower" translation. But at no point did we break the
> hypervisor.
hm. So your point is that VMI is in essence a Turing machine (a
near-complete one)? No matter what redesign we do on the Linux side, the
VMI paravirt_ops will always be able to adopt to it? If that is the case
then my ABI worries would indeed be wrong and i'd owe Zach a big fat
apology [and more] for my flames ;-)
but what if the transformation is not just a top-down transformation,
but something more conceptual, introducing something that VMI does not
cover today, like: 'elimination of the APIC concept from guest mode
(replacing it with a virtual interrupt controller)', or 'elimination of
the concept of IRQ vectors from guest mode (virtual irq controller)'. Or
'elimination of pagetables stored in the guest (all paravirt hooks would
be vma-based)'?
but ... maybe because VMI is so lowlevel and covers /all/ of x86 today,
it will always be able to emulate whatever different concept we can come
up with? Do we really know this absolutely sure?
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:36 ` Ingo Molnar
@ 2007-03-09 21:40 ` Jeremy Fitzhardinge
2007-03-09 22:27 ` Linus Torvalds
2007-03-09 23:10 ` Ingo Molnar
2 siblings, 0 replies; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 21:40 UTC (permalink / raw)
To: Ingo Molnar
Cc: Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
Rusty Russell, Andi Kleen, Chris Wright, Alan Cox
Ingo Molnar wrote:
> but ... maybe because VMI is so lowlevel and covers /all/ of x86 today,
> it will always be able to emulate whatever different concept we can come
> up with? Do we really know this absolutely sure?
>
Why don't you make a specific proposal, and we'll work out the details.
J
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:27 ` Chris Wright
@ 2007-03-09 21:47 ` Ingo Molnar
2007-03-09 21:59 ` Jeremy Fitzhardinge
2007-03-09 22:10 ` Chris Wright
0 siblings, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 21:47 UTC (permalink / raw)
To: Chris Wright
Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
Andi Kleen, Alan Cox
* Chris Wright <chrisw@sous-sol.org> wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > ( if there is no backwards compatibility promise then i have zero
> > complaints: then paravirt_ops + the hypercall just becomes another API
> > internal to Linux that we can improve at will. But that is not
> > realistic: if we provide CONFIG_VMI today, people will expect to have
> > CONFIG_VMI in the future too. )
>
> This was the whole reason we didn't adopt VMI directly. Instead,
> preferring an kernel internal API, pv_ops, that can adopt naturally as
> the kernel changes, and it is the pv_ops client code's (or backend as
> it is also referred to) responsibility to do whatever is necessary to
> map back to the hypervisor's ABI. The goal was explicitly to keep
> things internal fluid as usual. As I said before, no matter how you
> slice it there's glue code somewhere to deal with compatibilities. And
> it's always been the virtualization platform's responsibility to deal
> with the changes.
For example, for the sake of argument, if the VMI ABI consisted only of
a single call:
#define VMI_CALL_NOP 1
then obviously it would be very hard for VMI to adopt to changes in the
kernel - no matter how many smarts you put into paravirt_ops :-)
agreed? That is the center of my argument. Does the VMI ABI limit the
Linux kernel or not?
As we increase the complexity of a hypercall ABI, more and more things
can be implemented via it. So _obviously_ there is a 'minimum level of
capability' for every hypercall ABI that is /required/ to keep the Linux
kernel 100% flexible. If in some tricky corner the ABI has some stupid
limit or assumption, it might stiffle future changes in Linux.
i am worried whether /any/ future change to the upstream kernel's design
can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i
mean truly any. And whether that can be done is not a function of the
flexibility of paravirt_ops, it's a function of the flexibility of the
VMI ABI.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:47 ` Ingo Molnar
@ 2007-03-09 21:59 ` Jeremy Fitzhardinge
2007-03-09 22:12 ` Ingo Molnar
2007-03-09 22:10 ` Chris Wright
1 sibling, 1 reply; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 21:59 UTC (permalink / raw)
To: Ingo Molnar
Cc: Chris Wright, Zachary Amsden, Thomas Gleixner, john stultz, akpm,
LKML, Rusty Russell, Andi Kleen, Alan Cox
Ingo Molnar wrote:
> i am worried whether /any/ future change to the upstream kernel's design
> can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i
> mean truly any. And whether that can be done is not a function of the
> flexibility of paravirt_ops, it's a function of the flexibility of the
> VMI ABI.
Come on. You're talking as if these changes can be made in a vacuum,
with no regard to any external constraints. In reality, any such change
is limited by concerns like:
1. does it make sense in terms of kernel design?
2. can all the (real) architectures support it?
3. does it make sense on real hardware?
4. does it require massive kernel-wide changes, or does it have
limited scope?
5. etc, etc
Obviously one of the things to be considered among the many other issues
is "what effect does this have on the paravirt_ops interface?".
Now it may be that you've got a change that's absolutely great for
everyone, and the only blocker is that the FoobieVisor can't deal with
it. OK, great, then you'd have a point.
But are you really saying that one of your handwavy proposals is just
fine on a pre-apic i486/mmu-less m68k/whatever, but can't be made to
work on a hypervisor?
Come up with a specific proposal, and we'll work out the details.
J
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:47 ` Ingo Molnar
2007-03-09 21:59 ` Jeremy Fitzhardinge
@ 2007-03-09 22:10 ` Chris Wright
2007-03-09 22:24 ` Ingo Molnar
1 sibling, 1 reply; 37+ messages in thread
From: Chris Wright @ 2007-03-09 22:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Chris Wright, Linus Torvalds, Jeremy Fitzhardinge,
Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
Rusty Russell, Andi Kleen, Alan Cox
* Ingo Molnar (mingo@elte.hu) wrote:
> i am worried whether /any/ future change to the upstream kernel's design
> can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i
> mean truly any. And whether that can be done is not a function of the
> flexibility of paravirt_ops, it's a function of the flexibility of the
> VMI ABI.
i'm not really one to argue on behalf of VMI, but i don't think it's as
dire make it out. the VMI is client code of pv_ops, and as the kernel
changes that client code will simply have to adapt. of course there are
theoretical limitations, but let's keep it grounded to practical reality.
the whole premise is evolution. so throw out specific issues, and let's
adapt rather than fall deep into theoretical rhetoric.
thanks,
-chris
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:59 ` Jeremy Fitzhardinge
@ 2007-03-09 22:12 ` Ingo Molnar
2007-03-09 22:30 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 22:12 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Chris Wright, Zachary Amsden, Thomas Gleixner, john stultz, akpm,
LKML, Rusty Russell, Andi Kleen, Alan Cox
* Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Now it may be that you've got a change that's absolutely great for
> everyone, and the only blocker is that the FoobieVisor can't deal with
> it. OK, great, then you'd have a point.
yep. That's precisely my worry. And it doesnt have to be a 'great' thing
- just any random small change in the kernel that makes sense: what is
the likelyhood that it cannot be implemented, no matter what amount of
insight, paravirt_ops + hyper-ABI emulation hackery, for FoobieVisor,
because FoobieVisor messed up its ABI.
that likelyhood is a pure function of how FoobieVisor's hypercall ABI is
shaped. Wow! So can you guess where my fixation about not having too
many ABIs could possibly originate from? ;-)
Until today everyone on the hypervisor side of the argument pretended
that paravirt_ops solves all problems and acted stupid when i said an
ABI is an ABI is an ABI, and that "backwards compatibility" does have
some technological consequences. _Now_ at least i've got this minimal
admission that FoobieVisor _might_ break. Quite a breakthrough =B-)
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:10 ` Chris Wright
@ 2007-03-09 22:24 ` Ingo Molnar
2007-03-09 22:36 ` Jeremy Fitzhardinge
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 22:24 UTC (permalink / raw)
To: Chris Wright
Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
Andi Kleen, Alan Cox
* Chris Wright <chrisw@sous-sol.org> wrote:
> * Ingo Molnar (mingo@elte.hu) wrote:
> > i am worried whether /any/ future change to the upstream kernel's design
> > can be adopted via paravirt_ops, via the current VMI ABI. And by /any/ i
> > mean truly any. And whether that can be done is not a function of the
> > flexibility of paravirt_ops, it's a function of the flexibility of the
> > VMI ABI.
>
> i'm not really one to argue on behalf of VMI, but i don't think it's
> as dire make it out. [...]
hey, that's what i thought when i helped do the vDSO, until i got
slapped with cold reality called "CONFIG_COMPAT_VDSO". I'm a bit more
careful about ABIs since then =B-)
> [...] the VMI is client code of pv_ops, and as the kernel changes that
> client code will simply have to adapt. of course there are
> theoretical limitations, but let's keep it grounded to practical
> reality. the whole premise is evolution. so throw out specific
> issues, and let's adapt rather than fall deep into theoretical
> rhetoric.
ok, sure, how about the one i mentioned: long-term i'd like to have a
paravirt model where the guest does not store /any/ page tables - all
paging is managed by the hypervisor. The guest has a vma tree, but
otherwise it does not process pagefaults, has no concept of a pte (if in
paravirt mode), has no concept of kernel page tables either: there are
hypercalls to allocate/free guest-kernel memory, etc. This needs some
(serious) MM surgery but it's doable and it's interesting as well. How
would you map this to the VMI backend?
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:36 ` Ingo Molnar
2007-03-09 21:40 ` Jeremy Fitzhardinge
@ 2007-03-09 22:27 ` Linus Torvalds
2007-03-09 22:50 ` Ingo Molnar
2007-03-09 23:07 ` Zachary Amsden
2007-03-09 23:10 ` Ingo Molnar
2 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2007-03-09 22:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
On Fri, 9 Mar 2007, Ingo Molnar wrote:
>
> hm. So your point is that VMI is in essence a Turing machine (a
> near-complete one)? No matter what redesign we do on the Linux side, the
> VMI paravirt_ops will always be able to adopt to it?
No, I don't think it's turing-complete ;)
But since it tries to basically come fairly close to emulating the
hardware we already use, any higher-level abstraction we do (which
obviously has to work on real hardware too!) is likely to be translatable
into the "pseudo-hardware" thing that is the VMI interfaces.
> but ... maybe because VMI is so lowlevel and covers /all/ of x86 today,
> it will always be able to emulate whatever different concept we can come
> up with? Do we really know this absolutely sure?
"For sure"? Absolutely not. But since any new interfaces we come up with
for doing timers etc had better work perfectly fine on an old hardware
platform too, we can't exactly require any interfaces that do things that
a bog-standard old dual-PPro didn't do 10 years ago, can we?
So assumign that the VMI interface is roughly as powerful (by virtue of
basically emulating it) as the old single-ioapic/single-lapic systems we
used to use, I don't think it should ever be a real problem. Hmm?
Linus
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:12 ` Ingo Molnar
@ 2007-03-09 22:30 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 22:30 UTC (permalink / raw)
To: Ingo Molnar
Cc: Chris Wright, Zachary Amsden, Thomas Gleixner, john stultz, akpm,
LKML, Rusty Russell, Andi Kleen, Alan Cox
Ingo Molnar wrote:
> yep. That's precisely my worry. And it doesnt have to be a 'great' thing
> - just any random small change in the kernel that makes sense: what is
> the likelyhood that it cannot be implemented, no matter what amount of
> insight, paravirt_ops + hyper-ABI emulation hackery, for FoobieVisor,
> because FoobieVisor messed up its ABI.
>
> that likelyhood is a pure function of how FoobieVisor's hypercall ABI is
> shaped. Wow! So can you guess where my fixation about not having too
> many ABIs could possibly originate from? ;-)
>
OK, so its a problem that's happened before. "It's a great idea, it's
so nice, but it breaks X." Your options are:
1. Well, nobody is really using X. We can stop supporting it.
2. X makes up 50% of the users, we'll just have to do without your
great idea.
3. Maybe we can get X updated so this idea works.
If X is a piece of hardware, then you're probably stuck with options 1
and 2. If its something like firmware or a hypervisor, you might have a
chance with option 3.
The hypervisor interface is not at all special in this regard; you may
as well be arguing "We can't allow a port of Linux to the FoobieTron2000
CPU, because it might constrain some future development"; that's true,
it might. But I don't think I've ever seen anyone make that argument
for not accepting a new architecture port.
I don't really understand what your overall argument is though. Sure, I
guess its that if there's one ABI for all hypervisors, then you've only
got one hypervisor-related constraint to consider when evaluating a new
kernel change. But that ABI is going to be as constraining as the its
most constraining hypervisor, so you're not really in a better position
than if you have N hypervisor ABIs. In fact you're worse off, because
you have no flexibility to drop/adapt/whatever the real blocker.
> _Now_ at least i've got this minimal
> admission that FoobieVisor _might_ break. Quite a breakthrough =B-)
If you went to all that typing to get that much of a concession, then
you have way too much time ;)
J
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:24 ` Ingo Molnar
@ 2007-03-09 22:36 ` Jeremy Fitzhardinge
2007-03-09 23:38 ` Ingo Molnar
2007-03-09 22:46 ` Chris Wright
2007-03-09 23:13 ` Rik van Riel
2 siblings, 1 reply; 37+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-09 22:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: Chris Wright, Linus Torvalds, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Alan Cox
Ingo Molnar wrote:
> ok, sure, how about the one i mentioned: long-term i'd like to have a
> paravirt model where the guest does not store /any/ page tables - all
> paging is managed by the hypervisor. The guest has a vma tree, but
> otherwise it does not process pagefaults, has no concept of a pte (if in
> paravirt mode), has no concept of kernel page tables either: there are
> hypercalls to allocate/free guest-kernel memory, etc. This needs some
> (serious) MM surgery but it's doable and it's interesting as well. How
> would you map this to the VMI backend?
You wouldn't. Why would you? It might be a useful interface - and its
the perfect kind of high-level interface for pv_ops. It might be worth
adapting a hypervisor to suit it, but you still need to support the
i386's pagetables. So, you present the pv_ops interface with your
vma-based mappings, and it runs it through the vma->pagetable
translation layer to feed into either the i386 pagetables directly, or
to a hypervisor's page-based interface.
The important part is that there's more to the story than just pv_ops.
If you wanted to make such a change, then you'd need to refactor the
i386 support code to add a vma->paging helper layer. That layer would
be available for any pv_ops interface to use if it wishes.
(Remember, in the pv_ops model, bare hardware is a "hypervisor" too.)
Next problem?
J
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:24 ` Ingo Molnar
2007-03-09 22:36 ` Jeremy Fitzhardinge
@ 2007-03-09 22:46 ` Chris Wright
2007-03-09 23:02 ` Ingo Molnar
2007-03-09 23:13 ` Rik van Riel
2 siblings, 1 reply; 37+ messages in thread
From: Chris Wright @ 2007-03-09 22:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: Chris Wright, Linus Torvalds, Jeremy Fitzhardinge,
Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
Rusty Russell, Andi Kleen, Alan Cox
* Ingo Molnar (mingo@elte.hu) wrote:
> * Chris Wright <chrisw@sous-sol.org> wrote:
> > i'm not really one to argue on behalf of VMI, but i don't think it's
> > as dire make it out. [...]
>
> hey, that's what i thought when i helped do the vDSO, until i got
> slapped with cold reality called "CONFIG_COMPAT_VDSO". I'm a bit more
> careful about ABIs since then =B-)
heh, once bitten twice shy, or however that goes ;-)
> > [...] the VMI is client code of pv_ops, and as the kernel changes that
> > client code will simply have to adapt. of course there are
> > theoretical limitations, but let's keep it grounded to practical
> > reality. the whole premise is evolution. so throw out specific
> > issues, and let's adapt rather than fall deep into theoretical
> > rhetoric.
>
> ok, sure, how about the one i mentioned: long-term i'd like to have a
> paravirt model where the guest does not store /any/ page tables - all
> paging is managed by the hypervisor. The guest has a vma tree, but
> otherwise it does not process pagefaults, has no concept of a pte (if in
> paravirt mode), has no concept of kernel page tables either: there are
> hypercalls to allocate/free guest-kernel memory, etc. This needs some
> (serious) MM surgery but it's doable and it's interesting as well. How
> would you map this to the VMI backend?
Sounds a lot like a userspace process. My immediate thought is, why
not use containers, a more natural fit. But if you have _any_ hope
of booting this kernel on native hardware when it's not running under
a hypervisor then I'd expect the same pv_ops interfaces that allow it
to run on native would allow VMI to build and handle the shadow (since
you'd have taken it out of the kernel). Heh, so in order to run this on
native we had to add fork/mmap pv ops? I agree it might be interesting,
but it's still not clear that it's useful w/out some code to back it up,
and see the value.
thanks,
-chris
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:27 ` Linus Torvalds
@ 2007-03-09 22:50 ` Ingo Molnar
2007-03-09 23:07 ` Zachary Amsden
1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 22:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > but ... maybe because VMI is so lowlevel and covers /all/ of x86
> > today, it will always be able to emulate whatever different concept
> > we can come up with? Do we really know this absolutely sure?
>
> "For sure"? Absolutely not. But since any new interfaces we come up
> with for doing timers etc had better work perfectly fine on an old
> hardware platform too, we can't exactly require any interfaces that do
> things that a bog-standard old dual-PPro didn't do 10 years ago, can
> we?
i'm not really thinking in terms of extending the native kernel -
whatever we do on the native kernel we'll indeed have to be able to do
on 'real' hardware too - including really old boxes.
i'm thinking more in terms of having some new and more intelligent
virtualization-only abstractions.
For example i think people are way too obsessed with building virtual
'machines' that look like PCs, while i've got some truly pie-in-the-sky
ideas floating to simplify virtual machines, like the one i just
outlined to Chris:
|| long-term i'd like to have a paravirt model where the guest does not
|| store /any/ page tables - all paging is managed by the hypervisor.
|| The guest has a vma tree, but otherwise it does not process
|| pagefaults, has no concept of a pte (if in paravirt mode), has no
|| concept of kernel page tables either: there are hypercalls to
|| allocate/free guest-kernel memory, etc. This needs some (serious) MM
|| surgery but it's doable and it's interesting as well. How would you
|| map this to the VMI backend?
Clearly the ugliest and most complex part of hypervisors is MMU support.
So the above model would avoid all the shadow page table complexities
and other MMU nasties, and keep that stuff in the hypervisor. [ in
exchange for a whole set of other complexities and problems ;-) ] This
would be even more highlevel than UML (UML emulates pagetables in
user-space), in that regard.
even in such a drastic model we could share like 90% of the x86 binary
code with the rest of the kernel (all the filesystem support, networking
stack, etc. would still be reusable by the guest kernel), so a paravirt
approach, where native and guest is the same image (as opposed to the
UML model, where they are separate) still makes sense and is preferred
by distributions.
Mapping this into VMI calls looks ... near impossible to me. VMI really
assumes that there is no hypervisor state for kernel objects - while the
above model _shares_ a very substantial kernel object with the
hypervisor - and in fact 100% delegates its handling to the hypervisor
altogether. Hm?
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 20:50 ` Jan Engelhardt
@ 2007-03-09 22:50 ` Lee Revell
2007-03-14 8:41 ` alsa was " Pavel Machek
0 siblings, 1 reply; 37+ messages in thread
From: Lee Revell @ 2007-03-09 22:50 UTC (permalink / raw)
To: Jan Engelhardt
Cc: Ingo Molnar, Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
Andi Kleen, Chris Wright, Alan Cox
On 3/9/07, Jan Engelhardt <jengelh@linux01.gwdg.de> wrote:
> I think the sound example to the right really shows it. /dev/dsp has a
> consistent ABI on a ton of systems. The API below it, varies. Linux got
> file_operations and ALSA. Solaris/BSD may have its
> vnode-and-so-on-functions and some sort of OSS.
I think this is a poor example as applications lose a lot of
functionality (multiple stream mixing, software volume control, etc)
by going through the legacy /dev/dsp interface vs. using native ALSA.
Lee
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:46 ` Chris Wright
@ 2007-03-09 23:02 ` Ingo Molnar
0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 23:02 UTC (permalink / raw)
To: Chris Wright
Cc: Linus Torvalds, Jeremy Fitzhardinge, Zachary Amsden,
Thomas Gleixner, john stultz, akpm, LKML, Rusty Russell,
Andi Kleen, Alan Cox
* Chris Wright <chrisw@sous-sol.org> wrote:
> > ok, sure, how about the one i mentioned: long-term i'd like to have
> > a paravirt model where the guest does not store /any/ page tables -
> > all paging is managed by the hypervisor. The guest has a vma tree,
> > but otherwise it does not process pagefaults, has no concept of a
> > pte (if in paravirt mode), has no concept of kernel page tables
> > either: there are hypercalls to allocate/free guest-kernel memory,
> > etc. This needs some (serious) MM surgery but it's doable and it's
> > interesting as well. How would you map this to the VMI backend?
>
> Sounds a lot like a userspace process. My immediate thought is, why
> not use containers, a more natural fit. [...]
easy: in my model the hypervisor is isolated from the guest kernel. In
the container model it is not. [ This is a basic quality requirement for
virtualization: a guest kernel does not get to read any hypervisor
crypto keys to HD-DVD smut! ;-) ]
> [...] But if you have _any_ hope of booting this kernel on native
> hardware when it's not running under a hypervisor then I'd expect the
> same pv_ops interfaces that allow it to run on native would allow VMI
> to build and handle the shadow (since you'd have taken it out of the
> kernel). Heh, so in order to run this on native we had to add
> fork/mmap pv ops? I agree it might be interesting, but it's still not
> clear that it's useful w/out some code to back it up, and see the
> value.
progress ;-) But yes, some /really/ high-level pv_ops would be needed.
[ in the end we might be able to simplify it down to a single hook! That
would be: run_native_image / run_guest_image ;-) ]
seriously, most of the body of x86 kernel code is in filesystems, VFS,
networking, scheduler and the core kernel - much of which can be shared
between native and guest. The MM is a significant and very central
chunk, but it is less than 3% of the total codesize.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:27 ` Linus Torvalds
2007-03-09 22:50 ` Ingo Molnar
@ 2007-03-09 23:07 ` Zachary Amsden
1 sibling, 0 replies; 37+ messages in thread
From: Zachary Amsden @ 2007-03-09 23:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Jeremy Fitzhardinge, Thomas Gleixner, john stultz,
akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright, Alan Cox
Linus Torvalds wrote:
>> but ... maybe because VMI is so lowlevel and covers /all/ of x86 today,
>> it will always be able to emulate whatever different concept we can come
>> up with? Do we really know this absolutely sure?
>>
>
> "For sure"? Absolutely not. But since any new interfaces we come up with
> for doing timers etc had better work perfectly fine on an old hardware
> platform too, we can't exactly require any interfaces that do things that
> a bog-standard old dual-PPro didn't do 10 years ago, can we?
>
> So assumign that the VMI interface is roughly as powerful (by virtue of
> basically emulating it) as the old single-ioapic/single-lapic systems we
> used to use, I don't think it should ever be a real problem. Hmm?
>
Sorry to keep you in a thread you don't want more to do with Linus, but
to answer the question completely directly:
There are four design requirement which are inviolate for achieving a
large measurable performance gain, which is the primary benefit of i386
paravirt-ops for us. These changes are required for /ALL/ high
performance paravirtualized kernels not running in hardware
virtualization, across a broad spectrum of hypervisors, and have either
zero negative impact or opportunity for additional gain when hardware
virtualization is enabled.
1) Ability to run the kernel at non-zero CPL
2) Ability to replace hardware interrupt masking functions with
virtualizable equivalents
3) Notification when page tables are allocated and released
4) Notification in some form when page table entries are updated (or
vma's are changed)
Everything after this are incremental gains, some more valuable than
others, but not as major in significance as the above four (apic_write,
incidentally, _is_ one of the more substantial gains for us).
These don't seem to be a major burden on the kernel at all.
#1 is already the default case now for even native hardware.
#2 requires a lot of hooks because interrupt masking is a common
function, and this is where the large numbers of hook points Ingo was
demonstrating came from, but these icache effects and costs are on
already expensive instructions. In fact, it appears on some hardware,
the nop padding around cli / sti / pushf / popf contributes to a
mysterious performance gain, perhaps due to some pipelining anomaly.
#3 is not a common enough operation to be of performance concern. It
does however, require pagetables, just as native hardware does. Which
we can implement perfectly well anyway in our backend, just as the
native backend would if some reckless madman removed all notions of page
tables from a paravirt kernel.
#4 involves an extra call in page fault paths and from some points in
the mm layer.
There are no ABI requirements tied to these, merely the presence of any
usable API for them in paravirt-ops.
Linus is right - our virtual hardware is an exact replica of real
hardware. So no matter how you change paravirtualization in the kernel,
anything that runs on real hardware will continue to run on VMware. VMI
is tied very closely to the hardware, on purpose, and follows the rules
of native hardware extremely closely. So you can pretty much twist and
abuse paravirt-ops in a number of ways, and as long as it continues to
run on real hardware with the above four requirements, it still runs
even on VMI. Violate the above four requirements, and it costs a lot of
performance, but we still continue to run.
Zach
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 21:36 ` Ingo Molnar
2007-03-09 21:40 ` Jeremy Fitzhardinge
2007-03-09 22:27 ` Linus Torvalds
@ 2007-03-09 23:10 ` Ingo Molnar
2007-03-09 23:38 ` Zachary Amsden
2 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 23:10 UTC (permalink / raw)
To: Zachary Amsden
Cc: Jeremy Fitzhardinge, Linus Torvalds, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
* Ingo Molnar <mingo@elte.hu> wrote:
> [...] If that is the case then my ABI worries would indeed be wrong
> and i'd owe Zach a big fat apology [and more] for my flames ;-)
that apology i very much owe to Zach no matter what the outcome of the
discussion. Zach, some of my mindless characterisations of the quality
of VMI code (and of your intentions) were really out of bound and were
unfair, and i'd like to apologize for that :-/
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:24 ` Ingo Molnar
2007-03-09 22:36 ` Jeremy Fitzhardinge
2007-03-09 22:46 ` Chris Wright
@ 2007-03-09 23:13 ` Rik van Riel
2 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2007-03-09 23:13 UTC (permalink / raw)
To: Ingo Molnar
Cc: Chris Wright, Linus Torvalds, Jeremy Fitzhardinge,
Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
Rusty Russell, Andi Kleen, Alan Cox
Ingo Molnar wrote:
> ok, sure, how about the one i mentioned: long-term i'd like to have a
> paravirt model where the guest does not store /any/ page tables - all
> paging is managed by the hypervisor. The guest has a vma tree, but
> otherwise it does not process pagefaults, has no concept of a pte (if in
> paravirt mode), has no concept of kernel page tables either: there are
> hypercalls to allocate/free guest-kernel memory, etc. This needs some
> (serious) MM surgery but it's doable and it's interesting as well. How
> would you map this to the VMI backend?
Ugh!
In a situation like that, how does the guest handle pageouts?
What about the dirty and accessed bits?
I'm guessing VMI would deal with this kind of abstraction in
exactly the same way we do the "native hardware" layer behind
paravirt_ops. Ie. this example is a red herring.
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 23:10 ` Ingo Molnar
@ 2007-03-09 23:38 ` Zachary Amsden
0 siblings, 0 replies; 37+ messages in thread
From: Zachary Amsden @ 2007-03-09 23:38 UTC (permalink / raw)
To: Ingo Molnar
Cc: Jeremy Fitzhardinge, Thomas Gleixner, john stultz, akpm, LKML,
Rusty Russell, Andi Kleen, Chris Wright, Alan Cox
Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>
>> [...] If that is the case then my ABI worries would indeed be wrong
>> and i'd owe Zach a big fat apology [and more] for my flames ;-)
>>
>
> that apology i very much owe to Zach no matter what the outcome of the
> discussion. Zach, some of my mindless characterisations of the quality
> of VMI code (and of your intentions) were really out of bound and were
> unfair, and i'd like to apologize for that :-/
>
That's fine. Despite not having an open source hypervisor, we really
don't have evil intentions, and we really don't want to create tension
or impede the progress of anyone else. We think this work has positive
benefits for Linux, our hypervisor, and others as well.
Some of our code did very much need fixing, and the positive thing we
can take away from such a heated discussion is that we probably got more
eyes on our code, maybe trying to find the evil, and instead finding
bugs or things we could have done better.
At least someone did need to play devil's advocate, as this is an
important thing to get right for the future direction of Linux, and I
respect you for doing so, and don't take offense.
Zach
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:36 ` Jeremy Fitzhardinge
@ 2007-03-09 23:38 ` Ingo Molnar
0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-03-09 23:38 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Chris Wright, Linus Torvalds, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Alan Cox
* Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> The important part is that there's more to the story than just pv_ops.
> If you wanted to make such a change, then you'd need to refactor the
> i386 support code to add a vma->paging helper layer. That layer would
> be available for any pv_ops interface to use if it wishes.
no such change is needed to native. [ other than the removal of tons of
lowlevel hooks ;-) ] Think of this in terms of a completely separate MM
layer for guest kernels, with all memory management details done on the
hypervisor side, ok? I dont think you can emulate that in an equivalent
way via VMI, the kernel object in this model is on the hypervisor side -
while with VMI that does not look possible.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-09 22:50 ` Lee Revell
@ 2007-03-14 8:41 ` Pavel Machek
2007-03-14 15:59 ` Jaroslav Kysela
0 siblings, 1 reply; 37+ messages in thread
From: Pavel Machek @ 2007-03-14 8:41 UTC (permalink / raw)
To: Lee Revell
Cc: Jan Engelhardt, Ingo Molnar, Linus Torvalds, Jeremy Fitzhardinge,
Zachary Amsden, Thomas Gleixner, john stultz, akpm, LKML,
Rusty Russell, Andi Kleen, Chris Wright, Alan Cox
Hi!
> >I think the sound example to the right really shows it.
> >/dev/dsp has a
> >consistent ABI on a ton of systems. The API below it,
> >varies. Linux got
> >file_operations and ALSA. Solaris/BSD may have its
> >vnode-and-so-on-functions and some sort of OSS.
>
> I think this is a poor example as applications lose a
> lot of
> functionality (multiple stream mixing, software volume
> control, etc)
> by going through the legacy /dev/dsp interface vs. using
> native ALSA.
OTOH /dev/dsp is nice, clean, unixy interface, while alsa creates ugly
ABI you should not even use unless you are libalsa. ouch.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-14 8:41 ` alsa was " Pavel Machek
@ 2007-03-14 15:59 ` Jaroslav Kysela
2007-03-15 9:03 ` Pavel Machek
0 siblings, 1 reply; 37+ messages in thread
From: Jaroslav Kysela @ 2007-03-14 15:59 UTC (permalink / raw)
To: Pavel Machek
Cc: Lee Revell, Jan Engelhardt, Ingo Molnar, Linus Torvalds,
Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
On Wed, 14 Mar 2007, Pavel Machek wrote:
> Hi!
>
> > >I think the sound example to the right really shows it.
> > >/dev/dsp has a
> > >consistent ABI on a ton of systems. The API below it,
> > >varies. Linux got
> > >file_operations and ALSA. Solaris/BSD may have its
> > >vnode-and-so-on-functions and some sort of OSS.
> >
> > I think this is a poor example as applications lose a
> > lot of
> > functionality (multiple stream mixing, software volume
> > control, etc)
> > by going through the legacy /dev/dsp interface vs. using
> > native ALSA.
>
> OTOH /dev/dsp is nice, clean, unixy interface, while alsa creates ugly
> ABI you should not even use unless you are libalsa. ouch.
Pavel, calm down. World is not perfect and there are always probes to
optimize things. We use standard file operations - open/close/ioctl/mmap,
too.
Jaroslav
-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SUSE Labs
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-14 15:59 ` Jaroslav Kysela
@ 2007-03-15 9:03 ` Pavel Machek
2007-03-15 9:10 ` Pavel Machek
0 siblings, 1 reply; 37+ messages in thread
From: Pavel Machek @ 2007-03-15 9:03 UTC (permalink / raw)
To: Jaroslav Kysela
Cc: Lee Revell, Jan Engelhardt, Ingo Molnar, Linus Torvalds,
Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
Hi!
> > > >I think the sound example to the right really shows it.
> > > >/dev/dsp has a
> > > >consistent ABI on a ton of systems. The API below it,
> > > >varies. Linux got
> > > >file_operations and ALSA. Solaris/BSD may have its
> > > >vnode-and-so-on-functions and some sort of OSS.
> > >
> > > I think this is a poor example as applications lose a
> > > lot of
> > > functionality (multiple stream mixing, software volume
> > > control, etc)
> > > by going through the legacy /dev/dsp interface vs. using
> > > native ALSA.
> >
> > OTOH /dev/dsp is nice, clean, unixy interface, while alsa creates ugly
> > ABI you should not even use unless you are libalsa. ouch.
>
> Pavel, calm down.
I'm pretty calm, thank you.
> World is not perfect and there are always probes to
> optimize things. We use standard file operations - open/close/ioctl/mmap,
> too.
Unfortunately AlSA _does not_ provide hardware abstraction. Instead,
it relies on libalsa in userland to do the kernel work. That means
that testing sound is ugly.
Plus, it made some "interesting choices" with naming, basically
inventing parallel system to /dev/. (Why do I have to specify "card 0"
number in xmms, and WTF it means? Why can't alsa use device paths as
rest of sane world?)
So... in dsp, if I wanted to record sound, I did
cat /dev/dsp > /tmp/foo; cat /tmp/foo > /dev/dsp
If that worked, I had usable sound system, and if it broke, I knew it
is kernel fault.
With alsa it is
download & install alsalib
download & install alsautils
create 1007 nodes in /dev
launch alsamixer, figure out what to do from inadequate descriptions
launch arecord, try to guess some suitable options
launch aplay, try to guess some options
...if it does not work, it may be a kernel problem or userspace
problem; I'm left with debugging both. That makes alsa pretty much
untestable.
(For _my_ usage, something like "alsatest" that is self-contained,
preferably statically linked, and just tests microphone and sound
output would be nice. But that does not fix the fact that alsa is
broken by design).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-15 9:03 ` Pavel Machek
@ 2007-03-15 9:10 ` Pavel Machek
2007-03-15 9:23 ` Zachary Amsden
0 siblings, 1 reply; 37+ messages in thread
From: Pavel Machek @ 2007-03-15 9:10 UTC (permalink / raw)
To: Jaroslav Kysela
Cc: Lee Revell, Jan Engelhardt, Ingo Molnar, Linus Torvalds,
Jeremy Fitzhardinge, Zachary Amsden, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
Hi!
> So... in dsp, if I wanted to record sound, I did
>
> cat /dev/dsp > /tmp/foo; cat /tmp/foo > /dev/dsp
>
> If that worked, I had usable sound system, and if it broke, I knew it
> is kernel fault.
>
> With alsa it is
>
> download & install alsalib
> download & install alsautils
> create 1007 nodes in /dev
> launch alsamixer, figure out what to do from inadequate descriptions
> launch arecord, try to guess some suitable options
> launch aplay, try to guess some options
>
> ...if it does not work, it may be a kernel problem or userspace
> problem; I'm left with debugging both. That makes alsa pretty much
> untestable.
(Just for the record, I should note that networking is misdesigned in
similar way; that's why we have eth0 instead of /dev/eth0, and need
special tools to rename network interface. But this mistake dates to
BSD days or something, so we got used to it... and at least you do not
need to keep libnetwork up to date to keep your net devices working.
So networking provides _ugly_ hardware abstraction, but it provides
it).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-15 9:10 ` Pavel Machek
@ 2007-03-15 9:23 ` Zachary Amsden
2007-03-15 9:32 ` Pavel Machek
0 siblings, 1 reply; 37+ messages in thread
From: Zachary Amsden @ 2007-03-15 9:23 UTC (permalink / raw)
To: Pavel Machek
Cc: Jaroslav Kysela, Lee Revell, Jan Engelhardt, Ingo Molnar,
Linus Torvalds, Jeremy Fitzhardinge, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
Pavel Machek wrote:
>> download & install alsalib
>> download & install alsautils
>> create 1007 nodes in /dev
I really hope you meant permission 1007 nodes, not 1007 nodes! I'm
checking right now, and if the latter is the case, I'm going to
uninstall alsa, even if that means my computer will forever be silent,
it will be silent in protest.
> (Just for the record, I should note that networking is misdesigned in
> similar way; that's why we have eth0 instead of /dev/eth0, and need
> special tools to rename network interface. But this mistake dates to
> BSD days or something, so we got used to it... and at least you do not
> need to keep libnetwork up to date to keep your net devices working.
>
> So networking provides _ugly_ hardware abstraction, but it provides
> it).
>
I might add it got only worse with wireless support, now you can not
configure card without both ifconfig and iwconfig, so not only is the
API diverging, the userspace tools to manage it are doing so as well.
Zach
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: alsa was Re: ABI coupling to hypervisors via CONFIG_PARAVIRT
2007-03-15 9:23 ` Zachary Amsden
@ 2007-03-15 9:32 ` Pavel Machek
0 siblings, 0 replies; 37+ messages in thread
From: Pavel Machek @ 2007-03-15 9:32 UTC (permalink / raw)
To: Zachary Amsden
Cc: Jaroslav Kysela, Lee Revell, Jan Engelhardt, Ingo Molnar,
Linus Torvalds, Jeremy Fitzhardinge, Thomas Gleixner,
john stultz, akpm, LKML, Rusty Russell, Andi Kleen, Chris Wright,
Alan Cox
Hi!
> >> download & install alsalib
> >> download & install alsautils
> >> create 1007 nodes in /dev
>
> I really hope you meant permission 1007 nodes, not 1007 nodes! I'm
> checking right now, and if the latter is the case, I'm going to
> uninstall alsa, even if that means my computer will forever be silent,
> it will be silent in protest.
I meant 1007 nodes... well, not literaly. For sound to work on my
system, I seem to need at least
crw-r--r-- 1 root root 116, 0 Mar 24 2005 controlC0
crw-r--r-- 1 root root 116, 32 Mar 24 2005 controlC1
crw-rw-rw- 1 root root 116, 4 Feb 1 16:33 hwC0D0
crw-r--r-- 1 root root 116, 36 Mar 24 2005 hwC1D0
crw-rw-rw- 1 root root 116, 8 Feb 1 16:33 midiC0D0
crw-rw-rw- 1 root root 116, 24 Feb 1 16:33 pcmC0D0c
crw-rw-rw- 1 root root 116, 16 Feb 1 16:33 pcmC0D0p
crw-rw-rw- 1 root root 116, 48 Feb 1 16:34 pcmC1D0p
crw-rw-rw- 1 root root 116, 1 Feb 1 16:33 seq
crw-rw-rw- 1 root root 116, 33 Feb 1 16:33 timer
...fact that minor numbers are not allocated in Doc*/devices.txt does
not help, either.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2007-03-15 9:33 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-09 18:02 ABI coupling to hypervisors via CONFIG_PARAVIRT Ingo Molnar
2007-03-09 18:28 ` Andi Kleen
2007-03-09 18:30 ` Linus Torvalds
2007-03-09 19:24 ` Ingo Molnar
2007-03-09 19:51 ` Linus Torvalds
2007-03-09 20:12 ` Ingo Molnar
2007-03-09 21:05 ` Jeremy Fitzhardinge
2007-03-09 21:06 ` Linus Torvalds
2007-03-09 21:36 ` Ingo Molnar
2007-03-09 21:40 ` Jeremy Fitzhardinge
2007-03-09 22:27 ` Linus Torvalds
2007-03-09 22:50 ` Ingo Molnar
2007-03-09 23:07 ` Zachary Amsden
2007-03-09 23:10 ` Ingo Molnar
2007-03-09 23:38 ` Zachary Amsden
2007-03-09 21:04 ` Ingo Molnar
2007-03-09 21:27 ` Chris Wright
2007-03-09 21:47 ` Ingo Molnar
2007-03-09 21:59 ` Jeremy Fitzhardinge
2007-03-09 22:12 ` Ingo Molnar
2007-03-09 22:30 ` Jeremy Fitzhardinge
2007-03-09 22:10 ` Chris Wright
2007-03-09 22:24 ` Ingo Molnar
2007-03-09 22:36 ` Jeremy Fitzhardinge
2007-03-09 23:38 ` Ingo Molnar
2007-03-09 22:46 ` Chris Wright
2007-03-09 23:02 ` Ingo Molnar
2007-03-09 23:13 ` Rik van Riel
2007-03-09 20:50 ` Jan Engelhardt
2007-03-09 22:50 ` Lee Revell
2007-03-14 8:41 ` alsa was " Pavel Machek
2007-03-14 15:59 ` Jaroslav Kysela
2007-03-15 9:03 ` Pavel Machek
2007-03-15 9:10 ` Pavel Machek
2007-03-15 9:23 ` Zachary Amsden
2007-03-15 9:32 ` Pavel Machek
2007-03-09 19:00 ` Chris Wright
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.