kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH 00/17] virtual-bus
       [not found] <49D469D2020000A100045FA1@lucius.provo.novell.com>
@ 2009-04-02 14:14 ` Patrick Mullaney
  2009-04-02 14:27   ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Patrick Mullaney @ 2009-04-02 14:14 UTC (permalink / raw)
  To: avi
  Cc: anthony, andi, herbert, Gregory Haskins, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

On Thu, 2009-04-02 at 16:27 +0300, Avi Kivity wrote:

> 
> virtio is a stable ABI.
> 
> > However, theres still the possibility we can make this work in an ABI
> > friendly way with cap-bits, or other such features.  For instance, the
> > virtio-net driver could register both with pci and vbus-proxy and
> > instantiate a device with a slightly different ops structure for each or
> > something.  Alternatively we could write a host-side shim to expose vbus
> > devices as pci devices or something like that.
> >   
> 
> Sounds complicated...
> 

IMO, it doesn't sound anymore complicated than making virtio support the
concepts already provided by vbus/venet-tap driver. Isn't there already
precedent for alternative approaches co-existing and having the users
decide which is the most appropriate for their use case? Switching
drivers in order to improve latency for a certain class of applications
would seem like something latency sensitive users would be more than
willing to do. I'd like to point out 2 things. Greg has offered help
in moving virtio into the vbus infrastructure. The vbus infrastructure
is a large part of what is being proposed here.



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:14 ` [RFC PATCH 00/17] virtual-bus Patrick Mullaney
@ 2009-04-02 14:27   ` Avi Kivity
  2009-04-02 15:31     ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 14:27 UTC (permalink / raw)
  To: Patrick Mullaney
  Cc: anthony, andi, herbert, Gregory Haskins, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

Patrick Mullaney wrote:
> On Thu, 2009-04-02 at 16:27 +0300, Avi Kivity wrote:
>
>   
>> virtio is a stable ABI.
>>
>>     
>>> However, theres still the possibility we can make this work in an ABI
>>> friendly way with cap-bits, or other such features.  For instance, the
>>> virtio-net driver could register both with pci and vbus-proxy and
>>> instantiate a device with a slightly different ops structure for each or
>>> something.  Alternatively we could write a host-side shim to expose vbus
>>> devices as pci devices or something like that.
>>>   
>>>       
>> Sounds complicated...
>>
>>     
>
> IMO, it doesn't sound anymore complicated than making virtio support the
> concepts already provided by vbus/venet-tap driver. Isn't there already
> precedent for alternative approaches co-existing and having the users
> decide which is the most appropriate for their use case? Switching
> drivers in order to improve latency for a certain class of applications
> would seem like something latency sensitive users would be more than
> willing to do. I'd like to point out 2 things. Greg has offered help
> in moving virtio into the vbus infrastructure. The vbus infrastructure
> is a large part of what is being proposed here.
>   

vbus (if I understand it right) is a whole package of things:

- a way to enumerate, discover, and manage devices

That part duplicates PCI and it would be pretty hard to convince me we 
need to move to something new.  virtio-pci (a) works, (b) works on Windows.

- a different way of doing interrupts

Again, the need to paravirtualize kills this on Windows (I think).

- a different ring layout, and splitting notifications from the ring

I don't see the huge win here

- placing the host part in the host kernel

Nothing vbus-specific here.

Switching drivers is unfortunately not easy on Linux as you need a new 
kernel; it's easier on Windows once you have the drivers written.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:27   ` Avi Kivity
@ 2009-04-02 15:31     ` Gregory Haskins
  2009-04-02 15:49       ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 15:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Patrick Mullaney, anthony, andi, herbert, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 5459 bytes --]

Avi Kivity wrote:
> Patrick Mullaney wrote:
>> On Thu, 2009-04-02 at 16:27 +0300, Avi Kivity wrote:
>>
>>  
>>> virtio is a stable ABI.
>>>
>>>    
>>>> However, theres still the possibility we can make this work in an ABI
>>>> friendly way with cap-bits, or other such features.  For instance, the
>>>> virtio-net driver could register both with pci and vbus-proxy and
>>>> instantiate a device with a slightly different ops structure for
>>>> each or
>>>> something.  Alternatively we could write a host-side shim to expose
>>>> vbus
>>>> devices as pci devices or something like that.
>>>>         
>>> Sounds complicated...
>>>
>>>     
>>
>> IMO, it doesn't sound anymore complicated than making virtio support the
>> concepts already provided by vbus/venet-tap driver. Isn't there already
>> precedent for alternative approaches co-existing and having the users
>> decide which is the most appropriate for their use case? Switching
>> drivers in order to improve latency for a certain class of applications
>> would seem like something latency sensitive users would be more than
>> willing to do. I'd like to point out 2 things. Greg has offered help
>> in moving virtio into the vbus infrastructure. The vbus infrastructure
>> is a large part of what is being proposed here.
>>   
>
> vbus (if I understand it right) is a whole package of things:
>
> - a way to enumerate, discover, and manage devices

Yes
>
> That part duplicates PCI

Yes, but the important thing to point out is it doesn't *replace* PCI. 
It simply an alternative.

> and it would be pretty hard to convince me we need to move to
> something new

But thats just it.  You don't *need* to move.  The two can coexist side
by side peacefully.  "vbus" just ends up being another device that may
or may not be present, and that may or may not have devices on it.  In
fact, during all this testing I was booting my guest with "eth0" as
virtio-net, and "eth1" as venet.  The both worked totally fine and
harmoniously.  The guest simply discovers if vbus is supported via a
cpuid feature bit and dynamically adds it if present.

> .  virtio-pci (a) works,
And it will continue to work

> (b) works on Windows.

virtio will continue to work on windows, as well.  And if one of my
customers wants vbus support on windows and is willing to pay us to
develop it, we can support *it* there as well.
>
>
> - a different way of doing interrupts
Yeah, but this is ok.  And I am not against doing that mod we talked
about earlier where I replace dynirq with a pci shim to represent the
vbus.  Question about that: does userspace support emulation of MSI
interrupts?  I would probably prefer it if I could keep the vbus IRQ (or
IRQs when I support MQ) from being shared.  It seems registering the
vbus as an MSI device would be more conducive to avoiding this.

>
> Again, the need to paravirtualize kills this on Windows (I think).
Not really.  Its the same thing conceptually as virtio, except I am not
riding on PCI so I would need to manage this somehow.  Its support would
not be "free", but I dont think the ability to support this new bus type
is ultimately predicated on having PCI support.  But like I said, this
is really vbus's problem.  virtio will continue to work, and customer
funding (or a dev volunteer) will dictate if windows can support vbus as
well.  Right now I am perfectly willing to accept that windows guests
have no ability to access the feature.

>
> - a different ring layout, and splitting notifications from the ring
Again, virtio will continue to work.  And if we cannot find a way to
collapse virtio and ioq together in a way that everyone agrees on, there
is no harm in having two.  I have no problem saying I will maintain
IOQ.  There is plenty of precedent for multiple ways to do the same thing.

>
>
> I don't see the huge win here
>
> - placing the host part in the host kernel
>
> Nothing vbus-specific here.

Well, it depends on what you want.  Do you want a implementation that is
virtio-net, kvm, and pci specific while being hardcoded in?  What
happens when someone wants to access it but doesnt support pci?  What if
something like lguest wants to use it too?  What if you want
virtio-block next?  This is one extreme.

The other extreme is the direction I have gone, which is dynamically
loaded/instantiated generic objects which can work with kvm or whatever
subsystem wants to write a vbus-connector for.  I realize this is more
complex.  It is also more flexible.  Everything has a cost, though I
will point out that a good portion of the cost has already been paid for
by me and my employer ;)

So yeah, it doesn't *need* vbus to do this.  This is just one way of
many things that could be done between the two extremes.  But I didn't
design this thing to be some randomly coded amorphous blob that I am now
trying to miraculously shoehorn into KVM.  I designed it from the start
as what I felt a good virtual IO facility could be when starting with a
clean slate, keeping KVM as a primary target application the whole
time.  It is unfortunate that we, I think, disagree on the value add of
PCI, and that in the end may prevent vbus as ever being a transport for
an ABI compatible virtio-net implementation.  However, that also doesn't
mean it isn't useful in other contexts outside of this one particular
type of IO.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:31     ` Gregory Haskins
@ 2009-04-02 15:49       ` Avi Kivity
  2009-04-02 16:06         ` Herbert Xu
  2009-04-02 17:44         ` Gregory Haskins
  0 siblings, 2 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 15:49 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Patrick Mullaney, anthony, andi, herbert, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

Gregory Haskins wrote:
>> vbus (if I understand it right) is a whole package of things:
>>
>> - a way to enumerate, discover, and manage devices
>>     
>
> Yes
>   
>> That part duplicates PCI
>>     
>
> Yes, but the important thing to point out is it doesn't *replace* PCI. 
> It simply an alternative.
>   

Does it offer substantial benefits over PCI?  If not, it's just extra code.

Note that virtio is not tied to PCI, so "vbus is generic" doesn't count.

>> and it would be pretty hard to convince me we need to move to
>> something new
>>     
>
> But thats just it.  You don't *need* to move.  The two can coexist side
> by side peacefully.  "vbus" just ends up being another device that may
> or may not be present, and that may or may not have devices on it.  In
> fact, during all this testing I was booting my guest with "eth0" as
> virtio-net, and "eth1" as venet.  The both worked totally fine and
> harmoniously.  The guest simply discovers if vbus is supported via a
> cpuid feature bit and dynamically adds it if present.
>   

I meant, move the development effort, testing, installed base, Windows 
drivers.

>   
>> .  virtio-pci (a) works,
>>     
> And it will continue to work
>   

So why add something new?

>   
>> (b) works on Windows.
>>     
>
> virtio will continue to work on windows, as well.  And if one of my
> customers wants vbus support on windows and is willing to pay us to
> develop it, we can support *it* there as well.
>   

I don't want to develop and support both virtio and vbus.  And I 
certainly don't want to depend on your customers.

>> - a different way of doing interrupts
>>     
> Yeah, but this is ok.  And I am not against doing that mod we talked
> about earlier where I replace dynirq with a pci shim to represent the
> vbus.  Question about that: does userspace support emulation of MSI
> interrupts?  

Yes, this is new.  See the interrupt routing stuff I mentioned.  It's 
probably only in kvm.git, not even in 2.6.30.

> I would probably prefer it if I could keep the vbus IRQ (or
> IRQs when I support MQ) from being shared.  It seems registering the
> vbus as an MSI device would be more conducive to avoiding this.
>   

I still think you want one MSI per device rather than one MSI per vbus, 
to avoid scaling problems on large guest.  After Herbert's let loose on 
the code, one MSI per queue.


>> - a different ring layout, and splitting notifications from the ring
>>     
> Again, virtio will continue to work.  And if we cannot find a way to
> collapse virtio and ioq together in a way that everyone agrees on, there
> is no harm in having two.  I have no problem saying I will maintain
> IOQ.  There is plenty of precedent for multiple ways to do the same thing.
>   

IMO we should just steal whatever makes ioq better, and credit you in 
some file no one reads.  We get backwards compatibility, Windows 
support, continuity, etc.

>> I don't see the huge win here
>>
>> - placing the host part in the host kernel
>>
>> Nothing vbus-specific here.
>>     
>
> Well, it depends on what you want.  Do you want a implementation that is
> virtio-net, kvm, and pci specific while being hardcoded in?

No.  virtio is already not kvm or pci specific.  Definitely all the pci 
emulation parts will remain in user space.

>   What
> happens when someone wants to access it but doesnt support pci?  What if
> something like lguest wants to use it too?  What if you want
> virtio-block next?  This is one extreme.
>   

It works out well on the guest side, so it can work on the host side.  
We have virtio bindings for pci, s390, and of course lguest.  virtio 
itself is agnostic to all of these.  The main difference from vbus is 
that it's guest-only, but could easily be extended to the host side if 
we break down and do things in the kernel.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:49       ` Avi Kivity
@ 2009-04-02 16:06         ` Herbert Xu
  2009-04-02 16:51           ` Avi Kivity
  2009-04-02 17:44         ` Gregory Haskins
  1 sibling, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02 16:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Patrick Mullaney, anthony, andi, Peter Morreale,
	rusty, agraf, kvm, linux-kernel, netdev

On Thu, Apr 02, 2009 at 06:49:22PM +0300, Avi Kivity wrote:
>
> I still think you want one MSI per device rather than one MSI per vbus,  
> to avoid scaling problems on large guest.  After Herbert's let loose on  
> the code, one MSI per queue.

Yes, one MSI per TX queue, and one per RX queue :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:06         ` Herbert Xu
@ 2009-04-02 16:51           ` Avi Kivity
  0 siblings, 0 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 16:51 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Patrick Mullaney, anthony, andi, Peter Morreale,
	rusty, agraf, kvm, linux-kernel, netdev

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 06:49:22PM +0300, Avi Kivity wrote:
>   
>> I still think you want one MSI per device rather than one MSI per vbus,  
>> to avoid scaling problems on large guest.  After Herbert's let loose on  
>> the code, one MSI per queue.
>>     
>
> Yes, one MSI per TX queue, and one per RX queue :)
>
>   

We're currently limited to 1024, so go wild :)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:49       ` Avi Kivity
  2009-04-02 16:06         ` Herbert Xu
@ 2009-04-02 17:44         ` Gregory Haskins
  2009-04-03 11:43           ` Avi Kivity
  1 sibling, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 17:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Patrick Mullaney, anthony, andi, herbert, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 7881 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>> vbus (if I understand it right) is a whole package of things:
>>>
>>> - a way to enumerate, discover, and manage devices
>>>     
>>
>> Yes
>>  
>>> That part duplicates PCI
>>>     
>>
>> Yes, but the important thing to point out is it doesn't *replace*
>> PCI. It simply an alternative.
>>   
>
> Does it offer substantial benefits over PCI?  If not, it's just extra
> code.

First of all, do you think I would spend time designing it if I didn't
think so? :)

Second of all, I want to use vbus for other things that do not speak PCI
natively (like userspace for instance...and if I am gleaning this
correctly, lguest doesnt either).

PCI sounds good at first, but I believe its a false economy.  It was
designed, of course, to be a hardware solution, so it carries all this
baggage derived from hardware constraints that simply do not exist in a
pure software world and that have to be emulated.  Things like the fixed
length and centrally managed PCI-IDs, PIO config cycles, BARs,
pci-irq-routing, etc.  While emulation of PCI is invaluable for
executing unmodified guest, its not strictly necessary from a
paravirtual software perspective...PV software is inherently already
aware of its context and can therefore use the best mechanism
appropriate from a broader selection of choices.

If we insist that PCI is the only interface we can support and we want
to do something, say, in the kernel for instance, we have to have either
something like the ICH model in the kernel (and really all of the pci
chipset models that qemu supports), or a hacky hybrid userspace/kernel
solution.  I think this is what you are advocating, but im sorry. IMO
that's just gross and unecessary gunk.  Lets stop beating around the
bush and just define the 4-5 hypercall verbs we need and be done with
it.  :)

FYI: The guest support for this is not really *that* much code IMO.
 
 drivers/vbus/proxy/Makefile      |    2
 drivers/vbus/proxy/kvm.c         |  726 +++++++++++++++++

and plus, I'll gladly maintain it :)

I mean, its not like new buses do not get defined from time to time. 
Should the computing industry stop coming up with new bus types because
they are afraid that the windows ABI only speaks PCI?  No, they just
develop a new driver for whatever the bus is and be done with it.  This
is really no different.

>
> Note that virtio is not tied to PCI, so "vbus is generic" doesn't count.
Well, preserving the existing virtio-net on x86 ABI is tied to PCI,
which is what I was referring to.  Sorry for the confusion.

>
>>> and it would be pretty hard to convince me we need to move to
>>> something new
>>>     
>>
>> But thats just it.  You don't *need* to move.  The two can coexist side
>> by side peacefully.  "vbus" just ends up being another device that may
>> or may not be present, and that may or may not have devices on it.  In
>> fact, during all this testing I was booting my guest with "eth0" as
>> virtio-net, and "eth1" as venet.  The both worked totally fine and
>> harmoniously.  The guest simply discovers if vbus is supported via a
>> cpuid feature bit and dynamically adds it if present.
>>   
>
> I meant, move the development effort, testing, installed base, Windows
> drivers.

Again, I will maintain this feature, and its completely off to the
side.  Turn it off in the config, or do not enable it in qemu and its
like it never existed.  Worst case is it gets reverted if you don't like
it.  Aside from the last few kvm specific patches, the rest is no
different than the greater linux environment.  E.g. if I update the
venet driver upstream, its conceptually no different than someone else
updating e1000, right?

>
>>  
>>> .  virtio-pci (a) works,
>>>     
>> And it will continue to work
>>   
>
> So why add something new?

I was hoping this was becoming clear by now, but apparently I am doing a
poor job of articulating things. :(  I think we got bogged down in the
802.x performance discussion and lost sight of what we are trying to
accomplish with the core infrastructure.

So this core vbus infrastructure is for generic, in-kernel IO models. 
As a first pass, we have implemented a kvm-connector, which lets kvm
guest kernels have access to the bus.  We also have a userspace
connector (which I haven't pushed yet due to remaining issues being
ironed out) which allows userspace applications to interact with the
devices as well.  As a prototype, we built "venet" to show how it all works.

In the future, we want to use this infrastructure to build IO models for
various things like high performance fabrics and guest bypass
technologies, etc.  For instance, guest userspace connections to RDMA
devices in the kernel, etc.

>
>>  
>>> (b) works on Windows.
>>>     
>>
>> virtio will continue to work on windows, as well.  And if one of my
>> customers wants vbus support on windows and is willing to pay us to
>> develop it, we can support *it* there as well.
>>   
>
> I don't want to develop and support both virtio and vbus.  And I
> certainly don't want to depend on your customers.

So don't.  Ill maintain the drivers and the infrastructure.  All we are
talking here is the possible acceptance of my kvm-connector patches
*after* the broader LKML community accepts the core infrastructure,
assuming that happens.

You can always just state that you do not support enabling the feature. 
Bug reports with it enabled go to me, etc.

If that is still not acceptable and you are ultimately not interested in
any kind of merge/collaboration:  At the very least, I hope we can get
some very trivial patches in for registering things like the
KVM_CAP_VBUS bits for vbus so I can present a stable ABI to anyone
downstream from me.  Those things have been shifting on me a lot lately ;)

>
>
>>> - a different way of doing interrupts
>>>     
>> Yeah, but this is ok.  And I am not against doing that mod we talked
>> about earlier where I replace dynirq with a pci shim to represent the
>> vbus.  Question about that: does userspace support emulation of MSI
>> interrupts?  
>
> Yes, this is new.  See the interrupt routing stuff I mentioned.  It's
> probably only in kvm.git, not even in 2.6.30.
Cool, will check out, thanks.

>
>> I would probably prefer it if I could keep the vbus IRQ (or
>> IRQs when I support MQ) from being shared.  It seems registering the
>> vbus as an MSI device would be more conducive to avoiding this.
>>   
>
> I still think you want one MSI per device rather than one MSI per
> vbus, to avoid scaling problems on large guest.  After Herbert's let
> loose on the code, one MSI per queue.

This is trivial for me to support with just a few tweaks to the kvm
host/guest connector patches.

>
>
>
>>> - a different ring layout, and splitting notifications from the ring
>>>     
>> Again, virtio will continue to work.  And if we cannot find a way to
>> collapse virtio and ioq together in a way that everyone agrees on, there
>> is no harm in having two.  I have no problem saying I will maintain
>> IOQ.  There is plenty of precedent for multiple ways to do the same
>> thing.
>>   
>
> IMO we should just steal whatever makes ioq better, and credit you in
> some file no one reads.  We get backwards compatibility, Windows
> support, continuity, etc.
>
>>> I don't see the huge win here
>>>
>>> - placing the host part in the host kernel
>>>
>>> Nothing vbus-specific here.
>>>     
>>
>> Well, it depends on what you want.  Do you want a implementation that is
>> virtio-net, kvm, and pci specific while being hardcoded in?
>
> No.  virtio is already not kvm or pci specific.  Definitely all the
> pci emulation parts will remain in user space.

blech :)

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 17:44         ` Gregory Haskins
@ 2009-04-03 11:43           ` Avi Kivity
  2009-04-03 14:58             ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 11:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Patrick Mullaney, anthony, andi, herbert, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

Gregory Haskins wrote:
>>> Yes, but the important thing to point out is it doesn't *replace*
>>> PCI. It simply an alternative.
>>>   
>>>       
>> Does it offer substantial benefits over PCI?  If not, it's just extra
>> code.
>>     
>
> First of all, do you think I would spend time designing it if I didn't
> think so? :)
>   

I'll rephrase.  What are the substantial benefits that this offers over PCI?

> Second of all, I want to use vbus for other things that do not speak PCI
> natively (like userspace for instance...and if I am gleaning this
> correctly, lguest doesnt either).
>   

And virtio supports lguest and s390.  virtio is not PCI specific.

However, for the PC platform, PCI has distinct advantages.  What 
advantages does vbus have for the PC platform?

> PCI sounds good at first, but I believe its a false economy.  It was
> designed, of course, to be a hardware solution, so it carries all this
> baggage derived from hardware constraints that simply do not exist in a
> pure software world and that have to be emulated.  Things like the fixed
> length and centrally managed PCI-IDs, 

Not a problem in practice.

> PIO config cycles, BARs,
> pci-irq-routing, etc.  

What are the problems with these?

> While emulation of PCI is invaluable for
> executing unmodified guest, its not strictly necessary from a
> paravirtual software perspective...PV software is inherently already
> aware of its context and can therefore use the best mechanism
> appropriate from a broader selection of choices.
>   

It's also not necessary to invent a new bus.  We need a positive 
advantage, we don't do things just because we can (and then lose the 
real advantages PCI has).

> If we insist that PCI is the only interface we can support and we want
> to do something, say, in the kernel for instance, we have to have either
> something like the ICH model in the kernel (and really all of the pci
> chipset models that qemu supports), or a hacky hybrid userspace/kernel
> solution.  I think this is what you are advocating, but im sorry. IMO
> that's just gross and unecessary gunk.  

If we go for a kernel solution, a hybrid solution is the best IMO.  I 
have no idea what's wrong with it.

The guest would discover and configure the device using normal PCI 
methods.  Qemu emulates the requests, and configures the kernel part 
using normal Linux syscalls.  The nice thing is, kvm and the kernel part 
don't even know about each other, except for a way for hypercalls to 
reach the device and a way for interrupts to reach kvm.

> Lets stop beating around the
> bush and just define the 4-5 hypercall verbs we need and be done with
> it.  :)
>
> FYI: The guest support for this is not really *that* much code IMO.
>  
>  drivers/vbus/proxy/Makefile      |    2
>  drivers/vbus/proxy/kvm.c         |  726 +++++++++++++++++
>   

Does it support device hotplug and hotunplug?  Can vbus interrupts be 
load balanced by irqbalance?  Can guest userspace enumerate devices?  
Module autoloading support?  pxe booting?

Plus a port to Windows, enerprise Linux distros based on 2.6.dead, and 
possibly less mainstream OSes.

> and plus, I'll gladly maintain it :)
>
> I mean, its not like new buses do not get defined from time to time. 
> Should the computing industry stop coming up with new bus types because
> they are afraid that the windows ABI only speaks PCI?  No, they just
> develop a new driver for whatever the bus is and be done with it.  This
> is really no different.
>   

As a matter of fact, a new bus was developed recently called PCI 
express.  It uses new slots, new electricals, it's not even a bus 
(routers + point-to-point links), new everything except that the 
software model was 1000000000000% compatible with traditional PCI.  
That's how much people are afraid of the Windows ABI.

>> Note that virtio is not tied to PCI, so "vbus is generic" doesn't count.
>>     
> Well, preserving the existing virtio-net on x86 ABI is tied to PCI,
> which is what I was referring to.  Sorry for the confusion.
>   

virtio-net knows nothing about PCI.  If you have a problem with PCI, 
write virtio-blah for a new bus.  Though I still don't understand why.

  

>> I meant, move the development effort, testing, installed base, Windows
>> drivers.
>>     
>
> Again, I will maintain this feature, and its completely off to the
> side.  Turn it off in the config, or do not enable it in qemu and its
> like it never existed.  Worst case is it gets reverted if you don't like
> it.  Aside from the last few kvm specific patches, the rest is no
> different than the greater linux environment.  E.g. if I update the
> venet driver upstream, its conceptually no different than someone else
> updating e1000, right?
>   

I have no objections to you maintaining vbus, though I'd much prefer if 
we can pool our efforts and cooperate on having one good set of drivers.

I think you're integrating too tightly with kvm, which is likely to 
cause problems when kvm evolves.  The way I'd do it is:

- drop all mmu integration; instead, have your devices maintain their 
own slots layout and use copy_to_user()/copy_from_user() (or 
get_user_pages_fast()).
- never use vmap like structures for more than the length of a request
- for hypercalls, add kvm_register_hypercall_handler()
- for interrupts, see the interrupt routing thingie and have an 
in-kernel version of the KVM_IRQ_LINE ioctl.

This way, the parts that go into kvm know nothing about vbus, you're not 
pinning any memory, and the integration bits can be used for other 
purposes.

  

>> So why add something new?
>>     
>
> I was hoping this was becoming clear by now, but apparently I am doing a
> poor job of articulating things. :(  I think we got bogged down in the
> 802.x performance discussion and lost sight of what we are trying to
> accomplish with the core infrastructure.
>
> So this core vbus infrastructure is for generic, in-kernel IO models. 
> As a first pass, we have implemented a kvm-connector, which lets kvm
> guest kernels have access to the bus.  We also have a userspace
> connector (which I haven't pushed yet due to remaining issues being
> ironed out) which allows userspace applications to interact with the
> devices as well.  As a prototype, we built "venet" to show how it all works.
>
> In the future, we want to use this infrastructure to build IO models for
> various things like high performance fabrics and guest bypass
> technologies, etc.  For instance, guest userspace connections to RDMA
> devices in the kernel, etc.
>   

I think virtio can be used for much of the same things.  There's nothing 
in virtio that implies guest/host, or pci, or anything else.  It's 
similar to your shm/signal and ring abstractions except virtio folds 
them together.  Is this folding the main problem?

As far as I can tell, everything around it just duplicates existing 
infrastructure (which may be old and crusty, but so what) without added 
value.

>>
>> I don't want to develop and support both virtio and vbus.  And I
>> certainly don't want to depend on your customers.
>>     
>
> So don't.  Ill maintain the drivers and the infrastructure.  All we are
> talking here is the possible acceptance of my kvm-connector patches
> *after* the broader LKML community accepts the core infrastructure,
> assuming that happens.
>   

As I mentioned above, I'd much rather we cooperate rather than fragment 
the development effort (and user base).

Regarding kvm-connector, see my more generic suggestion above.  That 
would work for virtio-in-kernel as well.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:43           ` Avi Kivity
@ 2009-04-03 14:58             ` Gregory Haskins
  2009-04-03 15:37               ` Avi Kivity
  2009-04-03 17:09               ` Chris Wright
  0 siblings, 2 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-03 14:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Patrick Mullaney, anthony, andi, herbert, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 14099 bytes --]

Hi Avi,

 I think we have since covered these topics later in the thread, but in
case you wanted to know my thoughts here:

Avi Kivity wrote:
> Gregory Haskins wrote:
>>>> Yes, but the important thing to point out is it doesn't *replace*
>>>> PCI. It simply an alternative.
>>>>         
>>> Does it offer substantial benefits over PCI?  If not, it's just extra
>>> code.
>>>     
>>
>> First of all, do you think I would spend time designing it if I didn't
>> think so? :)
>>   
>
> I'll rephrase.  What are the substantial benefits that this offers
> over PCI?

Simplicity and optimization.  You don't need most of the junk that comes
with PCI.  Its all overhead and artificial constraints.  You really only
need things like a handful of hypercall verbs and thats it.

>
>> Second of all, I want to use vbus for other things that do not speak PCI
>> natively (like userspace for instance...and if I am gleaning this
>> correctly, lguest doesnt either).
>>   
>
> And virtio supports lguest and s390.  virtio is not PCI specific.
I understand that.  We keep getting wrapped around the axle on this
one.   At some point in the discussion we were talking about supporting
the existing guest ABI without changing the guest at all.  So while I
totally understand the virtio can work over various transports, I am
referring to what would be needed to have existing ABI guests work with
an in-kernel version.  This may or may not be an actual requirement.

>
> However, for the PC platform, PCI has distinct advantages.  What
> advantages does vbus have for the PC platform?
To reiterate: IMO simplicity and optimization.  Its designed
specifically for PV use, which is software to software.

>
>> PCI sounds good at first, but I believe its a false economy.  It was
>> designed, of course, to be a hardware solution, so it carries all this
>> baggage derived from hardware constraints that simply do not exist in a
>> pure software world and that have to be emulated.  Things like the fixed
>> length and centrally managed PCI-IDs, 
>
> Not a problem in practice.

Perhaps, but its just one more constraint that isn't actually needed. 
Its like the cvs vs git debate.  Why have it centrally managed when you
don't technically need it.  Sure, centrally managed works, but I'd
rather not deal with it if there was a better option.

>
>> PIO config cycles, BARs,
>> pci-irq-routing, etc.  
>
> What are the problems with these?

1) PIOs are still less efficient to decode than a hypercall vector.  We
dont need to pretend we are hardware..the guest already knows whats
underneath them.  Use the most efficient call method.

2) BARs?  No one in their right mind should use an MMIO BAR for PV. :)
The last thing we want to do is cause page faults here.  Don't use them,
period.  (This is where something like the vbus::shm() interface comes in)

3) pci-irq routing was designed to accommodate etch constraints on a
piece of silicon that doesn't actually exist in kvm.  Why would I want
to pretend I have PCI A,B,C,D lines that route to a pin on an IOAPIC? 
Forget all that stuff and just inject an IRQ directly.  This gets much
better with MSI, I admit, but you hopefully catch my drift now.

One of my primary design objectives with vbus was to a) reduce the
signaling as much as possible, and b) reduce the cost of signaling.  
That is why I do things like use explicit hypercalls, aggregated
interrupts, bidir napi to mitigate signaling, the shm_signal::pending
mitigation, and avoiding going to userspace by running in the kernel. 
All of these things together help to form what I envision would be a
maximum performance transport.  Not all of these tricks are
interdependent (for instance, the bidir + full-duplex threading that I
do can be done in userspace too, as discussed).  They are just the
collective design elements that I think we need to make a guest perform
very close to its peak.  That is what I am after.

>
>> While emulation of PCI is invaluable for
>> executing unmodified guest, its not strictly necessary from a
>> paravirtual software perspective...PV software is inherently already
>> aware of its context and can therefore use the best mechanism
>> appropriate from a broader selection of choices.
>>   
>
> It's also not necessary to invent a new bus.
You are right, its not strictly necessary to work.  Its just presents
the opportunity to optimize as much as possible and to move away from
legacy constraints that no longer apply.  And since PVs sole purpose is
about optimization, I was not really interested in going "half-way".

>   We need a positive advantage, we don't do things just because we can
> (and then lose the real advantages PCI has).

Agreed, but I assert there are advantages.  You may not think they
outweigh the cost, and thats your prerogative, but I think they are
still there nonetheless.

>
>> If we insist that PCI is the only interface we can support and we want
>> to do something, say, in the kernel for instance, we have to have either
>> something like the ICH model in the kernel (and really all of the pci
>> chipset models that qemu supports), or a hacky hybrid userspace/kernel
>> solution.  I think this is what you are advocating, but im sorry. IMO
>> that's just gross and unecessary gunk.  
>
> If we go for a kernel solution, a hybrid solution is the best IMO.  I
> have no idea what's wrong with it.

Its just that rendering these objects as PCI is overhead that you don't
technically need.  You only want this backwards compat because you don't
want to require a new bus-driver in the guest, which is a perfectly
reasonable position to take.  But that doesn't mean it isn't a
compromise.  You are trading more complexity and overhead in the host
for simplicity in the guest.  I am trying to clean up this path for
looking forward.

>
> The guest would discover and configure the device using normal PCI
> methods.  Qemu emulates the requests, and configures the kernel part
> using normal Linux syscalls.  The nice thing is, kvm and the kernel
> part don't even know about each other, except for a way for hypercalls
> to reach the device and a way for interrupts to reach kvm.
>
>> Lets stop beating around the
>> bush and just define the 4-5 hypercall verbs we need and be done with
>> it.  :)
>>
>> FYI: The guest support for this is not really *that* much code IMO.
>>  
>>  drivers/vbus/proxy/Makefile      |    2
>>  drivers/vbus/proxy/kvm.c         |  726 +++++++++++++++++
>>   
>
> Does it support device hotplug and hotunplug?
Yes, today (use "ln -s" in configfs to map a device to a bus, and the
guest will see the device immediately)

>   Can vbus interrupts be load balanced by irqbalance?

Yes (tho support for the .affinity verb on the guests irq-chip is
currently missing...but the backend support is there)


>   Can guest userspace enumerate devices?

Yes, it presents as a standard LDM device in things like /sys/bus/vbus_proxy

>   Module autoloading support?

Yes

>   pxe booting?
No, but this is something I don't think we need for now.  If it was
really needed it could be added, I suppose.  But there are other
alternatives already, so I am not putting this high on the priority
list.  (For instance you can chose to not use vbus, or you can use
--kernel, etc).

>
>
> Plus a port to Windows,

Ive already said this is low on my list, but it could always be added if
someone cares that much

> enerprise Linux distros based on 2.6.dead

Thats easy, though there is nothing that says we need to.  This can be a
2.6.31ish thing that they pick up next time.


> , and possibly less mainstream OSes.
>
>> and plus, I'll gladly maintain it :)
>>
>> I mean, its not like new buses do not get defined from time to time.
>> Should the computing industry stop coming up with new bus types because
>> they are afraid that the windows ABI only speaks PCI?  No, they just
>> develop a new driver for whatever the bus is and be done with it.  This
>> is really no different.
>>   
>
> As a matter of fact, a new bus was developed recently called PCI
> express.  It uses new slots, new electricals, it's not even a bus
> (routers + point-to-point links), new everything except that the
> software model was 1000000000000% compatible with traditional PCI. 
> That's how much people are afraid of the Windows ABI.

Come on, Avi.  Now you are being silly.  So should the USB designers
have tried to make it look like PCI too?  Should the PCI designers have
tried to make it look like ISA?  :)  Yes, there are advantages to making
something backwards compatible.  There are also disadvantages to
maintaining that backwards compatibility.

Let me ask you this:  If you had a clean slate and were designing a
hypervisor and a guest OS from scratch:  What would you make the bus
look like?

>
>>> Note that virtio is not tied to PCI, so "vbus is generic" doesn't
>>> count.
>>>     
>> Well, preserving the existing virtio-net on x86 ABI is tied to PCI,
>> which is what I was referring to.  Sorry for the confusion.
>>   
>
> virtio-net knows nothing about PCI.  If you have a problem with PCI,
> write virtio-blah for a new bus.
Can virtio-net use a different backend other than virtio-pci?  Cool!  I
will look into that.  Perhaps that is what I need to make this work
smoothly.


>   Though I still don't understand why.
>
>  
>
>>> I meant, move the development effort, testing, installed base, Windows
>>> drivers.
>>>     
>>
>> Again, I will maintain this feature, and its completely off to the
>> side.  Turn it off in the config, or do not enable it in qemu and its
>> like it never existed.  Worst case is it gets reverted if you don't like
>> it.  Aside from the last few kvm specific patches, the rest is no
>> different than the greater linux environment.  E.g. if I update the
>> venet driver upstream, its conceptually no different than someone else
>> updating e1000, right?
>>   
>
> I have no objections to you maintaining vbus, though I'd much prefer
> if we can pool our efforts and cooperate on having one good set of
> drivers.

I agree, that would be ideal.

>
> I think you're integrating too tightly with kvm, which is likely to
> cause problems when kvm evolves.  The way I'd do it is:
>
> - drop all mmu integration; instead, have your devices maintain their
> own slots layout and use copy_to_user()/copy_from_user() (or
> get_user_pages_fast()).

> - never use vmap like structures for more than the length of a request

So does virtio also do demand loading in the backend?  Hmm.  I suppose
we could do this, but it will definitely affect the performance
somewhat.  I was thinking that the pages needed for the basic shm
components should be minimal, so this is a good tradeoff to vmap them in
and only demand load the payload.

> - for hypercalls, add kvm_register_hypercall_handler()

This is a good idea.  In fact, I had something like this in my series
back when I posted it as "PV-IO" a year and a half ago.  I am not sure
why I took it out, but I will put it back.

> - for interrupts, see the interrupt routing thingie and have an
> in-kernel version of the KVM_IRQ_LINE ioctl.

Will do.

>
> This way, the parts that go into kvm know nothing about vbus, you're
> not pinning any memory, and the integration bits can be used for other
> purposes.
>
>  
>
>>> So why add something new?
>>>     
>>
>> I was hoping this was becoming clear by now, but apparently I am doing a
>> poor job of articulating things. :(  I think we got bogged down in the
>> 802.x performance discussion and lost sight of what we are trying to
>> accomplish with the core infrastructure.
>>
>> So this core vbus infrastructure is for generic, in-kernel IO models.
>> As a first pass, we have implemented a kvm-connector, which lets kvm
>> guest kernels have access to the bus.  We also have a userspace
>> connector (which I haven't pushed yet due to remaining issues being
>> ironed out) which allows userspace applications to interact with the
>> devices as well.  As a prototype, we built "venet" to show how it all
>> works.
>>
>> In the future, we want to use this infrastructure to build IO models for
>> various things like high performance fabrics and guest bypass
>> technologies, etc.  For instance, guest userspace connections to RDMA
>> devices in the kernel, etc.
>>   
>
> I think virtio can be used for much of the same things.  There's
> nothing in virtio that implies guest/host, or pci, or anything else. 
> It's similar to your shm/signal and ring abstractions except virtio
> folds them together.  Is this folding the main problem?
Right.  Virtio and ioq overlap, and they do so primarily because I
needed a ring that was compatible with some of my design ideas, yet I
didnt want to break the virtio ABI without a blessing first.  If the
signaling was not folded in virtio, that would be a first great step.  I
am not sure if there would be other areas to address as well.

>
> As far as I can tell, everything around it just duplicates existing
> infrastructure (which may be old and crusty, but so what) without
> added value.

I am not sure what you refer to with "everything around it".  Are you
talking about the vbus core? 

>
>>>
>>> I don't want to develop and support both virtio and vbus.  And I
>>> certainly don't want to depend on your customers.
>>>     
>>
>> So don't.  Ill maintain the drivers and the infrastructure.  All we are
>> talking here is the possible acceptance of my kvm-connector patches
>> *after* the broader LKML community accepts the core infrastructure,
>> assuming that happens.
>>   
>
> As I mentioned above, I'd much rather we cooperate rather than
> fragment the development effort (and user base).
Me too.  I think we can get to that point if I make some of the changes
you suggested above.

Thanks Avi,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 14:58             ` Gregory Haskins
@ 2009-04-03 15:37               ` Avi Kivity
  2009-04-03 18:19                 ` Gregory Haskins
  2009-04-03 17:09               ` Chris Wright
  1 sibling, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 15:37 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Patrick Mullaney, anthony, andi, herbert, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

Gregory Haskins wrote:
>> I'll rephrase.  What are the substantial benefits that this offers
>> over PCI?
>>     
>
> Simplicity and optimization.  You don't need most of the junk that comes
> with PCI.  Its all overhead and artificial constraints.  You really only
> need things like a handful of hypercall verbs and thats it.
>
>   

Simplicity:

The guest already supports PCI.  It has to, since it was written to the 
PC platform, and since today it is fashionable to run kernels that 
support both bare metal and a hypervisor.  So you can't remove PCI from 
the guest.

The host also already supports PCI.  It has to, since it must supports 
guests which do not support vbus.  We can't remove PCI from the host.

You don't gain simplicity by adding things.  Sure, lguest is simple 
because it doesn't support PCI.  But Linux will forever support PCI, and 
Qemu will always support PCI.  You aren't simplifying anything by adding 
vbus.

Optimization:

Most of PCI (in our context) deals with configuration.  So removing it 
doesn't optimize anything, unless you're counting hotplugs-per-second or 
something.


>>> Second of all, I want to use vbus for other things that do not speak PCI
>>> natively (like userspace for instance...and if I am gleaning this
>>> correctly, lguest doesnt either).
>>>   
>>>       
>> And virtio supports lguest and s390.  virtio is not PCI specific.
>>     
> I understand that.  We keep getting wrapped around the axle on this
> one.   At some point in the discussion we were talking about supporting
> the existing guest ABI without changing the guest at all.  So while I
> totally understand the virtio can work over various transports, I am
> referring to what would be needed to have existing ABI guests work with
> an in-kernel version.  This may or may not be an actual requirement.
>   

There is be no problem supporting an in-kernel host virtio endpoint with 
the existing guest/host ABI.  Nothing in the ABI assumes the host 
endpoint is in userspace.  Nothing in the implementation requires us to 
move any of the PCI stuff into the kernel.

In fact, we already have in-kernel sources of PCI interrupts, these are 
assigned PCI devices (obviously, these have to use PCI).

>> However, for the PC platform, PCI has distinct advantages.  What
>> advantages does vbus have for the PC platform?
>>     
> To reiterate: IMO simplicity and optimization.  Its designed
> specifically for PV use, which is software to software.
>   

To avoid reiterating, please be specific about these advantages.

>   
>>> PCI sounds good at first, but I believe its a false economy.  It was
>>> designed, of course, to be a hardware solution, so it carries all this
>>> baggage derived from hardware constraints that simply do not exist in a
>>> pure software world and that have to be emulated.  Things like the fixed
>>> length and centrally managed PCI-IDs, 
>>>       
>> Not a problem in practice.
>>     
>
> Perhaps, but its just one more constraint that isn't actually needed. 
> Its like the cvs vs git debate.  Why have it centrally managed when you
> don't technically need it.  Sure, centrally managed works, but I'd
> rather not deal with it if there was a better option.
>   

We've allocated 3 PCI device IDs so far.  It's not a problem.  There are 
enough real problems out there.

>   
>>> PIO config cycles, BARs,
>>> pci-irq-routing, etc.  
>>>       
>> What are the problems with these?
>>     
>
> 1) PIOs are still less efficient to decode than a hypercall vector.  We
> dont need to pretend we are hardware..the guest already knows whats
> underneath them.  Use the most efficient call method.
>   

Last time we measured, hypercall overhead was the same as pio overhead.  
Both vmx and svm decode pio completely (except for string pio ...)

> 2) BARs?  No one in their right mind should use an MMIO BAR for PV. :)
> The last thing we want to do is cause page faults here.  Don't use them,
> period.  (This is where something like the vbus::shm() interface comes in)
>   

So don't use BARs for your fast path.  virtio places the ring in guest 
memory (like most real NICs).

> 3) pci-irq routing was designed to accommodate etch constraints on a
> piece of silicon that doesn't actually exist in kvm.  Why would I want
> to pretend I have PCI A,B,C,D lines that route to a pin on an IOAPIC? 
> Forget all that stuff and just inject an IRQ directly.  This gets much
> better with MSI, I admit, but you hopefully catch my drift now.
>   

True, PCI interrupts suck.  But this was fixed with MSI.  Why fix it again?

> One of my primary design objectives with vbus was to a) reduce the
> signaling as much as possible, and b) reduce the cost of signaling.  
> That is why I do things like use explicit hypercalls, aggregated
> interrupts, bidir napi to mitigate signaling, the shm_signal::pending
> mitigation, and avoiding going to userspace by running in the kernel. 
> All of these things together help to form what I envision would be a
> maximum performance transport.  Not all of these tricks are
> interdependent (for instance, the bidir + full-duplex threading that I
> do can be done in userspace too, as discussed).  They are just the
> collective design elements that I think we need to make a guest perform
> very close to its peak.  That is what I am after.
>
>   

None of these require vbus.  They can all be done with PCI.

> You are right, its not strictly necessary to work.  Its just presents
> the opportunity to optimize as much as possible and to move away from
> legacy constraints that no longer apply.  And since PVs sole purpose is
> about optimization, I was not really interested in going "half-way".
>   

What constraints?  Please be specific.

>>   We need a positive advantage, we don't do things just because we can
>> (and then lose the real advantages PCI has).
>>     
>
> Agreed, but I assert there are advantages.  You may not think they
> outweigh the cost, and thats your prerogative, but I think they are
> still there nonetheless.
>   

I'm not saying anything about what the advantages are worth and how they 
compare to the cost.  I'm asking what are the advantages.  Please don't 
just assert them into existence.

>>> If we insist that PCI is the only interface we can support and we want
>>> to do something, say, in the kernel for instance, we have to have either
>>> something like the ICH model in the kernel (and really all of the pci
>>> chipset models that qemu supports), or a hacky hybrid userspace/kernel
>>> solution.  I think this is what you are advocating, but im sorry. IMO
>>> that's just gross and unecessary gunk.  
>>>       
>> If we go for a kernel solution, a hybrid solution is the best IMO.  I
>> have no idea what's wrong with it.
>>     
>
> Its just that rendering these objects as PCI is overhead that you don't
> technically need.  You only want this backwards compat because you don't
> want to require a new bus-driver in the guest, which is a perfectly
> reasonable position to take.  But that doesn't mean it isn't a
> compromise.  You are trading more complexity and overhead in the host
> for simplicity in the guest.  I am trying to clean up this path for
> looking forward.
>   

All of this overhead is incurred at configuration time.  All the 
complexity already exists so we gain nothing by adding a competing 
implementation.  And making the guest complex in order to simplify the 
host is a pretty bad tradeoff considering we maintain one host but want 
to support many guests.

It's good to look forward, but in the vbus-dominated universe, what do 
we have that we don't have now?  Besides simplicity.

>> The guest would discover and configure the device using normal PCI
>> methods.  Qemu emulates the requests, and configures the kernel part
>> using normal Linux syscalls.  The nice thing is, kvm and the kernel
>> part don't even know about each other, except for a way for hypercalls
>> to reach the device and a way for interrupts to reach kvm.
>>
>>     
>>> Lets stop beating around the
>>> bush and just define the 4-5 hypercall verbs we need and be done with
>>> it.  :)
>>>
>>> FYI: The guest support for this is not really *that* much code IMO.
>>>  
>>>  drivers/vbus/proxy/Makefile      |    2
>>>  drivers/vbus/proxy/kvm.c         |  726 +++++++++++++++++
>>>   
>>>       
>> Does it support device hotplug and hotunplug?
>>     
> Yes, today (use "ln -s" in configfs to map a device to a bus, and the
> guest will see the device immediately)
>   

Neat.

>   
>>   Can vbus interrupts be load balanced by irqbalance?
>>     
>
> Yes (tho support for the .affinity verb on the guests irq-chip is
> currently missing...but the backend support is there)
>
>
>   
>>   Can guest userspace enumerate devices?
>>     
>
> Yes, it presents as a standard LDM device in things like /sys/bus/vbus_proxy
>
>   
>>   Module autoloading support?
>>     
>
> Yes
>
>   

Cool, looks like you have a nice part covered.

>>   pxe booting?
>>     
> No, but this is something I don't think we need for now.  If it was
> really needed it could be added, I suppose.  But there are other
> alternatives already, so I am not putting this high on the priority
> list.  (For instance you can chose to not use vbus, or you can use
> --kernel, etc).
>
>   
>> Plus a port to Windows,
>>     
>
> Ive already said this is low on my list, but it could always be added if
> someone cares that much
>   

That's unreasonable.  Windows is an important workload.

>   
>> enerprise Linux distros based on 2.6.dead
>>     
>
> Thats easy, though there is nothing that says we need to.  This can be a
> 2.6.31ish thing that they pick up next time.
>   

Of course we need to.  RHEL 4/5 and their equivalents will live for a 
long time as guests.  Customers will expect good performance.


>> As a matter of fact, a new bus was developed recently called PCI
>> express.  It uses new slots, new electricals, it's not even a bus
>> (routers + point-to-point links), new everything except that the
>> software model was 1000000000000% compatible with traditional PCI. 
>> That's how much people are afraid of the Windows ABI.
>>     
>
> Come on, Avi.  Now you are being silly.  So should the USB designers
> have tried to make it look like PCI too?  Should the PCI designers have
> tried to make it look like ISA?  :)  Yes, there are advantages to making
> something backwards compatible.  There are also disadvantages to
> maintaining that backwards compatibility.
>   

Most PCI chipsets include an ISA bridge, at least until recently.

> Let me ask you this:  If you had a clean slate and were designing a
> hypervisor and a guest OS from scratch:  What would you make the bus
> look like?
>   

If there were no installed base to cater for, the avi-bus would blow 
anything out of the water.  It would be so shiny and new to make you cry 
in envy.  It would strongly compete with lguest and steal its two users.

Back on earth, there are a hundred gazillion machines with good old x86, 
booting through 1978 era real mode, jumping over the 640K memory barrier 
(est. 1981), running BIOS code which was probably written in the 14th 
century, and sporting a PCI-compatible peripheral bus.

This is not an academic exercise, we're not trying to develop the most 
aesthetically pleasing stack.  We need to be pragmatic so we can provide 
users with real value, not provide outselves with software design 
entertainment (nominally called wanking on lkml, but kvm@ is a kinder, 
gentler list).

>> virtio-net knows nothing about PCI.  If you have a problem with PCI,
>> write virtio-blah for a new bus.
>>     
> Can virtio-net use a different backend other than virtio-pci?  Cool!  I
> will look into that.  Perhaps that is what I need to make this work
> smoothly.
>   

virtio-net (all virtio devices, actually) supports three platforms 
today.  PCI, lguest, and s390.

>> I think you're integrating too tightly with kvm, which is likely to
>> cause problems when kvm evolves.  The way I'd do it is:
>>
>> - drop all mmu integration; instead, have your devices maintain their
>> own slots layout and use copy_to_user()/copy_from_user() (or
>> get_user_pages_fast()).
>>     
>
>   
>> - never use vmap like structures for more than the length of a request
>>     
>
> So does virtio also do demand loading in the backend?  

Given that it's entirely in userspace, yes.

> Hmm.  I suppose
> we could do this, but it will definitely affect the performance
> somewhat.  I was thinking that the pages needed for the basic shm
> components should be minimal, so this is a good tradeoff to vmap them in
> and only demand load the payload.
>   

This is negotiable :) I won't insist on it, only strongly recommend it.  
copy_to_user() should be pretty fast.


>> I think virtio can be used for much of the same things.  There's
>> nothing in virtio that implies guest/host, or pci, or anything else. 
>> It's similar to your shm/signal and ring abstractions except virtio
>> folds them together.  Is this folding the main problem?
>>     
> Right.  Virtio and ioq overlap, and they do so primarily because I
> needed a ring that was compatible with some of my design ideas, yet I
> didnt want to break the virtio ABI without a blessing first.  If the
> signaling was not folded in virtio, that would be a first great step.  I
> am not sure if there would be other areas to address as well.
>   

It would be good to find out.  virtio has evolved in time, mostly 
keeping backwards compatibility, so if you need a feature, it could be 
added.

>> As far as I can tell, everything around it just duplicates existing
>> infrastructure (which may be old and crusty, but so what) without
>> added value.
>>     
>
> I am not sure what you refer to with "everything around it".  Are you
> talking about the vbus core? 
>   

I'm talking about enumeration, hotplug, interrupt routing, all that PCI 
slowpath stuff.  My feeling is the fast path is mostly virtio except for 
being in kernel, and the slow path is totally redundant.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 14:58             ` Gregory Haskins
  2009-04-03 15:37               ` Avi Kivity
@ 2009-04-03 17:09               ` Chris Wright
  2009-04-03 18:32                 ` Gregory Haskins
  1 sibling, 1 reply; 121+ messages in thread
From: Chris Wright @ 2009-04-03 17:09 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Patrick Mullaney, anthony, andi, herbert,
	Peter Morreale, rusty, agraf, kvm, linux-kernel, netdev

* Gregory Haskins (ghaskins@novell.com) wrote:
> Let me ask you this:  If you had a clean slate and were designing a
> hypervisor and a guest OS from scratch:  What would you make the bus
> look like?

Well, virtio did have a relatively clean slate. And PCI (as _one_
transport option) is what it looks like.  It's not the only transport
(as Avi already mentioned it works for s390, for example).

BTW, from my brief look at vbus, it seems pretty similar to xenbus.

thanks,
-chris

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 15:37               ` Avi Kivity
@ 2009-04-03 18:19                 ` Gregory Haskins
  2009-04-05 10:50                   ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-03 18:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Patrick Mullaney, anthony, andi, herbert, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 20099 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>> I'll rephrase.  What are the substantial benefits that this offers
>>> over PCI?
>>>     
>>
>> Simplicity and optimization.  You don't need most of the junk that comes
>> with PCI.  Its all overhead and artificial constraints.  You really only
>> need things like a handful of hypercall verbs and thats it.
>>
>>   
>
> Simplicity:
>
> The guest already supports PCI.  It has to, since it was written to
> the PC platform, and since today it is fashionable to run kernels that
> support both bare metal and a hypervisor.  So you can't remove PCI
> from the guest.

Agreed
>
> The host also already supports PCI.  It has to, since it must supports
> guests which do not support vbus.  We can't remove PCI from the host.

Agreed
>
> You don't gain simplicity by adding things.

But you are failing to account for the fact that we still have to add
something for PCI if we go with something like the in-kernel model.  Its
nice for the userspace side because a) it was already in qemu, and b) we
need it for proper guest support.  But we don't presumably have it for
this new thing, so something has to be created (unless this support is
somehow already there and I don't know it?)

>   Sure, lguest is simple because it doesn't support PCI.  But Linux
> will forever support PCI, and Qemu will always support PCI.  You
> aren't simplifying anything by adding vbus.
>
> Optimization:
>
> Most of PCI (in our context) deals with configuration.  So removing it
> doesn't optimize anything, unless you're counting hotplugs-per-second
> or something.

Most, but not all ;)  (Sorry, you left the window open on that one).

What about IRQ routing?  What if I want to coalesce interrupts to
minimize injection overhead?  How do I do that in PCI?

How do I route those interrupts in an arbitrarily nested fashion, say,
to a guest userspace?

What about scale?  What if Herbet decides to implement a 2048 ring MQ
device ;)  Theres no great way to do that in x86 with PCI, yet I can do
it in vbus.  (And yes, I know, this is ridiculous..just wanting to get
you thinking)

>
>
>>>> Second of all, I want to use vbus for other things that do not
>>>> speak PCI
>>>> natively (like userspace for instance...and if I am gleaning this
>>>> correctly, lguest doesnt either).
>>>>         
>>> And virtio supports lguest and s390.  virtio is not PCI specific.
>>>     
>> I understand that.  We keep getting wrapped around the axle on this
>> one.   At some point in the discussion we were talking about supporting
>> the existing guest ABI without changing the guest at all.  So while I
>> totally understand the virtio can work over various transports, I am
>> referring to what would be needed to have existing ABI guests work with
>> an in-kernel version.  This may or may not be an actual requirement.
>>   
>
> There is be no problem supporting an in-kernel host virtio endpoint
> with the existing guest/host ABI.  Nothing in the ABI assumes the host
> endpoint is in userspace.  Nothing in the implementation requires us
> to move any of the PCI stuff into the kernel.
Well, thats not really true.  If the device is a PCI device, there is
*some* stuff that has to go into the kernel.  Not an ICH model or
anything, but at least an ability to interact with userspace for
config-space changes, etc.

>
> In fact, we already have in-kernel sources of PCI interrupts, these
> are assigned PCI devices (obviously, these have to use PCI).

This will help.

>
>>> However, for the PC platform, PCI has distinct advantages.  What
>>> advantages does vbus have for the PC platform?
>>>     
>> To reiterate: IMO simplicity and optimization.  Its designed
>> specifically for PV use, which is software to software.
>>   
>
> To avoid reiterating, please be specific about these advantages.
We are both reading the same thread, right?

>
>>  
>>>> PCI sounds good at first, but I believe its a false economy.  It was
>>>> designed, of course, to be a hardware solution, so it carries all this
>>>> baggage derived from hardware constraints that simply do not exist
>>>> in a
>>>> pure software world and that have to be emulated.  Things like the
>>>> fixed
>>>> length and centrally managed PCI-IDs,       
>>> Not a problem in practice.
>>>     
>>
>> Perhaps, but its just one more constraint that isn't actually needed.
>> Its like the cvs vs git debate.  Why have it centrally managed when you
>> don't technically need it.  Sure, centrally managed works, but I'd
>> rather not deal with it if there was a better option.
>>   
>
> We've allocated 3 PCI device IDs so far.  It's not a problem.  There
> are enough real problems out there.
>
>>  
>>>> PIO config cycles, BARs,
>>>> pci-irq-routing, etc.        
>>> What are the problems with these?
>>>     
>>
>> 1) PIOs are still less efficient to decode than a hypercall vector.  We
>> dont need to pretend we are hardware..the guest already knows whats
>> underneath them.  Use the most efficient call method.
>>   
>
> Last time we measured, hypercall overhead was the same as pio
> overhead.  Both vmx and svm decode pio completely (except for string
> pio ...)
Not on my woodcrests last time I looked, but I'll check again.

>
>> 2) BARs?  No one in their right mind should use an MMIO BAR for PV. :)
>> The last thing we want to do is cause page faults here.  Don't use them,
>> period.  (This is where something like the vbus::shm() interface
>> comes in)
>>   
>
> So don't use BARs for your fast path.  virtio places the ring in guest
> memory (like most real NICs).
>
>> 3) pci-irq routing was designed to accommodate etch constraints on a
>> piece of silicon that doesn't actually exist in kvm.  Why would I want
>> to pretend I have PCI A,B,C,D lines that route to a pin on an IOAPIC?
>> Forget all that stuff and just inject an IRQ directly.  This gets much
>> better with MSI, I admit, but you hopefully catch my drift now.
>>   
>
> True, PCI interrupts suck.  But this was fixed with MSI.  Why fix it
> again?

As I stated, I don't like the constraints in place even by MSI (though
that is definately a step in the right direction).

 With vbus I can have a device that has an arbitrary number of shm
regions (limited by memory, of course), each with an arbitrarily routed
signal path that is limited by a u64, even on x86.  Each region can be
signaled bidirectionally and masked with a simple local memory write. 
They can be declared on the fly, allowing for the easy expression of
things like nested devices or or other dynamic resources.  The can be
routed across various topologies, such as IRQs or posix signals, even
across multiple hops in a single path.

How do I do that in PCI?

What does masking an interrupt look like?  Again, for the nested case?

Interrupt acknowledgment cycles?

>
>> One of my primary design objectives with vbus was to a) reduce the
>> signaling as much as possible, and b) reduce the cost of signaling. 
>> That is why I do things like use explicit hypercalls, aggregated
>> interrupts, bidir napi to mitigate signaling, the shm_signal::pending
>> mitigation, and avoiding going to userspace by running in the kernel.
>> All of these things together help to form what I envision would be a
>> maximum performance transport.  Not all of these tricks are
>> interdependent (for instance, the bidir + full-duplex threading that I
>> do can be done in userspace too, as discussed).  They are just the
>> collective design elements that I think we need to make a guest perform
>> very close to its peak.  That is what I am after.
>>
>>   
>
> None of these require vbus.  They can all be done with PCI.
Well, first of all:  Not really.  Second of all, even if you *could* do
this all with PCI, its not really PCI anymore.  So the question I have
is: whats the value in still using it?  For the discovery?  Its not very
hard to do discovery.  I wrote that whole part in a few hours and it
worked the first time I ran it.

What about that interrupt model I keep talking about?  How do you work
around that?  How do I nest these to support bypass?

>
>> You are right, its not strictly necessary to work.  Its just presents
>> the opportunity to optimize as much as possible and to move away from
>> legacy constraints that no longer apply.  And since PVs sole purpose is
>> about optimization, I was not really interested in going "half-way".
>>   
>
> What constraints?  Please be specific.

Avi, I have been.  Is this an exercise to see how much you can get me to
type? ;)

>
>>>   We need a positive advantage, we don't do things just because we can
>>> (and then lose the real advantages PCI has).
>>>     
>>
>> Agreed, but I assert there are advantages.  You may not think they
>> outweigh the cost, and thats your prerogative, but I think they are
>> still there nonetheless.
>>   
>
> I'm not saying anything about what the advantages are worth and how
> they compare to the cost.  I'm asking what are the advantages.  Please
> don't just assert them into existence.

Thats an unfair statement, Avi.  Now I would say you are playing word-games.

>
>>>> If we insist that PCI is the only interface we can support and we want
>>>> to do something, say, in the kernel for instance, we have to have
>>>> either
>>>> something like the ICH model in the kernel (and really all of the pci
>>>> chipset models that qemu supports), or a hacky hybrid userspace/kernel
>>>> solution.  I think this is what you are advocating, but im sorry. IMO
>>>> that's just gross and unecessary gunk.        
>>> If we go for a kernel solution, a hybrid solution is the best IMO.  I
>>> have no idea what's wrong with it.
>>>     
>>
>> Its just that rendering these objects as PCI is overhead that you don't
>> technically need.  You only want this backwards compat because you don't
>> want to require a new bus-driver in the guest, which is a perfectly
>> reasonable position to take.  But that doesn't mean it isn't a
>> compromise.  You are trading more complexity and overhead in the host
>> for simplicity in the guest.  I am trying to clean up this path for
>> looking forward.
>>   
>
> All of this overhead is incurred at configuration time.  All the
> complexity already exists

So you already have the ability to represent PCI devices that are in the
kernel?  Is this the device-assignment infrastructure?  Cool!  Wouldn't
this still need to be adapted to work with software devices?  If not,
then I take back the statements that they both add more host code and
agree that vbus is simply the one adding more.


> so we gain nothing by adding a competing implementation.  And making
> the guest complex in order to simplify the host is a pretty bad
> tradeoff considering we maintain one host but want to support many
> guests.
>
> It's good to look forward, but in the vbus-dominated universe, what do
> we have that we don't have now?  Besides simplicity.

A unified framework for declaring virtual resources directly in the
kernel, yet still retaining the natural isolation that we get in
userspace.  The ability to support guests that don't have PCI.  The
ability to support things that are not guests.  The ability to support
things that are not supported by PCI, like less hardware-centric signal
path routing.  The ability to signal across more than just IRQs.  The
ability for nesting (e.g. guest-userspace talking to host-kernel, etc). 

I recognize that this has no bearing on whether you, or anyone else
cares about these features.  But it certainly has features beyond what
he have with PCI, and I hope that is clear now.


>
>>> The guest would discover and configure the device using normal PCI
>>> methods.  Qemu emulates the requests, and configures the kernel part
>>> using normal Linux syscalls.  The nice thing is, kvm and the kernel
>>> part don't even know about each other, except for a way for hypercalls
>>> to reach the device and a way for interrupts to reach kvm.
>>>
>>>    
>>>> Lets stop beating around the
>>>> bush and just define the 4-5 hypercall verbs we need and be done with
>>>> it.  :)
>>>>
>>>> FYI: The guest support for this is not really *that* much code IMO.
>>>>  
>>>>  drivers/vbus/proxy/Makefile      |    2
>>>>  drivers/vbus/proxy/kvm.c         |  726 +++++++++++++++++
>>>>         
>>> Does it support device hotplug and hotunplug?
>>>     
>> Yes, today (use "ln -s" in configfs to map a device to a bus, and the
>> guest will see the device immediately)
>>   
>
> Neat.
>
>>  
>>>   Can vbus interrupts be load balanced by irqbalance?
>>>     
>>
>> Yes (tho support for the .affinity verb on the guests irq-chip is
>> currently missing...but the backend support is there)
>>
>>
>>  
>>>   Can guest userspace enumerate devices?
>>>     
>>
>> Yes, it presents as a standard LDM device in things like
>> /sys/bus/vbus_proxy
>>
>>  
>>>   Module autoloading support?
>>>     
>>
>> Yes
>>
>>   
>
> Cool, looks like you have a nice part covered.
>
>>>   pxe booting?
>>>     
>> No, but this is something I don't think we need for now.  If it was
>> really needed it could be added, I suppose.  But there are other
>> alternatives already, so I am not putting this high on the priority
>> list.  (For instance you can chose to not use vbus, or you can use
>> --kernel, etc).
>>
>>  
>>> Plus a port to Windows,
>>>     
>>
>> Ive already said this is low on my list, but it could always be added if
>> someone cares that much
>>   
>
> That's unreasonable.  Windows is an important workload.

Well, this is all GPL, right.  I mean, was KVM 100% complete when it was
proposed?  Accepted?  I am hoping to get some help building the parts of
this infrastructure from anyone interested in the community.  If Windows
support is truly important and someone cares, it will get built soon enough.

I pushed it out now because I have enough working to be useful in of
itself and to get a review.  But its certainly not done.

>
>>  
>>> enerprise Linux distros based on 2.6.dead
>>>     
>>
>> Thats easy, though there is nothing that says we need to.  This can be a
>> 2.6.31ish thing that they pick up next time.
>>   
>
> Of course we need to.  RHEL 4/5 and their equivalents will live for a
> long time as guests.  Customers will expect good performance.

Okay, easy enough from my perspective.  However, I didn't realize it was
very common to backport new features to enterprise distros like this.  I
have a sneaking suspicion we wouldn't really need to worry about this as
the project managers for those products would probably never allow it. 
But in the event that it was necessary, I think it wouldn't be horrendous.


>
>
>>> As a matter of fact, a new bus was developed recently called PCI
>>> express.  It uses new slots, new electricals, it's not even a bus
>>> (routers + point-to-point links), new everything except that the
>>> software model was 1000000000000% compatible with traditional PCI.
>>> That's how much people are afraid of the Windows ABI.
>>>     
>>
>> Come on, Avi.  Now you are being silly.  So should the USB designers
>> have tried to make it look like PCI too?  Should the PCI designers have
>> tried to make it look like ISA?  :)  Yes, there are advantages to making
>> something backwards compatible.  There are also disadvantages to
>> maintaining that backwards compatibility.
>>   
>
> Most PCI chipsets include an ISA bridge, at least until recently.

You don't give up, do you? :P


>
>> Let me ask you this:  If you had a clean slate and were designing a
>> hypervisor and a guest OS from scratch:  What would you make the bus
>> look like?
>>   
>
> If there were no installed base to cater for, the avi-bus would blow
> anything out of the water.  It would be so shiny and new to make you
> cry in envy.  It would strongly compete with lguest and steal its two
> users.
>
> Back on earth, there are a hundred gazillion machines with good old
> x86, booting through 1978 era real mode, jumping over the 640K memory
> barrier (est. 1981), running BIOS code which was probably written in
> the 14th century, and sporting a PCI-compatible peripheral bus.

Im ok with that, as none of them will have VMX :P

>
> This is not an academic exercise, we're not trying to develop the most
> aesthetically pleasing stack.  We need to be pragmatic so we can
> provide users with real value, not provide outselves with software
> design entertainment (nominally called wanking on lkml, but kvm@ is a
> kinder, gentler list).
>
>>> virtio-net knows nothing about PCI.  If you have a problem with PCI,
>>> write virtio-blah for a new bus.
>>>     
>> Can virtio-net use a different backend other than virtio-pci?  Cool!  I
>> will look into that.  Perhaps that is what I need to make this work
>> smoothly.
>>   
>
> virtio-net (all virtio devices, actually) supports three platforms
> today.  PCI, lguest, and s390.
Cool.  I bet I can just write a virtio-vbus adapter then.  Rusty, any
thoughts?

>
>>> I think you're integrating too tightly with kvm, which is likely to
>>> cause problems when kvm evolves.  The way I'd do it is:
>>>
>>> - drop all mmu integration; instead, have your devices maintain their
>>> own slots layout and use copy_to_user()/copy_from_user() (or
>>> get_user_pages_fast()).
>>>     
>>
>>  
>>> - never use vmap like structures for more than the length of a request
>>>     
>>
>> So does virtio also do demand loading in the backend?  
>
> Given that it's entirely in userspace, yes.

Ah, right.  How does that work our of curiosity?  Do you have to do a
syscall for every page you want to read?

>
>> Hmm.  I suppose
>> we could do this, but it will definitely affect the performance
>> somewhat.  I was thinking that the pages needed for the basic shm
>> components should be minimal, so this is a good tradeoff to vmap them in
>> and only demand load the payload.
>>   
>
> This is negotiable :) I won't insist on it, only strongly recommend
> it.  copy_to_user() should be pretty fast.

It probably is, but generally we cant use it since we are not in the
same context when we need to do the copy (copy_to/from_user assume
"current" is proper).  Thats ok, there are ways to do what you request
without explicitly using c_t_u().


>
>
>>> I think virtio can be used for much of the same things.  There's
>>> nothing in virtio that implies guest/host, or pci, or anything else.
>>> It's similar to your shm/signal and ring abstractions except virtio
>>> folds them together.  Is this folding the main problem?
>>>     
>> Right.  Virtio and ioq overlap, and they do so primarily because I
>> needed a ring that was compatible with some of my design ideas, yet I
>> didnt want to break the virtio ABI without a blessing first.  If the
>> signaling was not folded in virtio, that would be a first great step.  I
>> am not sure if there would be other areas to address as well.
>>   
>
> It would be good to find out.  virtio has evolved in time, mostly
> keeping backwards compatibility, so if you need a feature, it could be
> added.
>
>>> As far as I can tell, everything around it just duplicates existing
>>> infrastructure (which may be old and crusty, but so what) without
>>> added value.
>>>     
>>
>> I am not sure what you refer to with "everything around it".  Are you
>> talking about the vbus core?   
>
> I'm talking about enumeration, hotplug, interrupt routing, all that
> PCI slowpath stuff.  My feeling is the fast path is mostly virtio
> except for being in kernel, and the slow path is totally redundant.

Ok, but note that I think you are still confusing the front-end and
back-end here.  See my last email for clarification.

-Greg

>
>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 17:09               ` Chris Wright
@ 2009-04-03 18:32                 ` Gregory Haskins
  0 siblings, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-03 18:32 UTC (permalink / raw)
  To: Chris Wright
  Cc: Avi Kivity, Patrick Mullaney, anthony, andi, herbert,
	Peter Morreale, rusty, agraf, kvm, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1234 bytes --]

Chris Wright wrote:
> * Gregory Haskins (ghaskins@novell.com) wrote:
>   
>> Let me ask you this:  If you had a clean slate and were designing a
>> hypervisor and a guest OS from scratch:  What would you make the bus
>> look like?
>>     
>
> Well, virtio did have a relatively clean slate. And PCI (as _one_
> transport option) is what it looks like.  It's not the only transport
> (as Avi already mentioned it works for s390, for example).
>   

Got it.  Thanks.

> BTW, from my brief look at vbus, it seems pretty similar to xenbus.
>   

If you are referring to the guest side interface, it was actually
inspired by lguest's bus (I forget what Rusty called it now, though).  
I think I actually declared that in the original patch series I put out
1.5 years ago, but I might have inadvertently omitted that on this
go-round.

I think XenBus is more of an event channel infrastructure, isn't it? 
But in any case, I think the nature of getting PV drivers into a guest
is relatively similar, so I wouldn't be surprised if there were
parallels in quite a few of the implementations.  In fact, I chose a
generic name like "vbus" in hopes that it could be used across different
hypervisors. :)

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 18:19                 ` Gregory Haskins
@ 2009-04-05 10:50                   ` Avi Kivity
  0 siblings, 0 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-05 10:50 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Patrick Mullaney, anthony, andi, herbert, Peter Morreale, rusty,
	agraf, kvm, linux-kernel, netdev

Gregory Haskins wrote:
>> You don't gain simplicity by adding things.
>>     
>
> But you are failing to account for the fact that we still have to add
> something for PCI if we go with something like the in-kernel model.  Its
> nice for the userspace side because a) it was already in qemu, and b) we
> need it for proper guest support.  But we don't presumably have it for
> this new thing, so something has to be created (unless this support is
> somehow already there and I don't know it?)
>   

No, a virtio server in the kernel would know nothing about PCI.  
Userspace would handle the PCI interface and configure the kernel.  That 
way we can reuse the kernel part for lguest and s390.

>> Optimization:
>>
>> Most of PCI (in our context) deals with configuration.  So removing it
>> doesn't optimize anything, unless you're counting hotplugs-per-second
>> or something.
>>     
>
> Most, but not all ;)  (Sorry, you left the window open on that one).
>
> What about IRQ routing?  

That's already in the kernel.

> What if I want to coalesce interrupts to
> minimize injection overhead?  How do I do that in PCI?
>   

It has nothing to do with PCI.  It has to do with the device/guest 
protocol.  And virtio already does that (badly, in the case of network tx).

> How do I route those interrupts in an arbitrarily nested fashion, say,
> to a guest userspace?
>   

That's a guest problem.  kvm delivers an interrupt; if the guest knows 
how to service it in userspace, great.

> What about scale?  What if Herbet decides to implement a 2048 ring MQ
> device ;)  Theres no great way to do that in x86 with PCI, yet I can do
> it in vbus.  (And yes, I know, this is ridiculous..just wanting to get
> you thinking)
>   

I don't see why you can't do 2048 (or even 2049) rings with PCI.  You'd 
point some config space address at a 'ring descriptor table' and that's it.

>> There is be no problem supporting an in-kernel host virtio endpoint
>> with the existing guest/host ABI.  Nothing in the ABI assumes the host
>> endpoint is in userspace.  Nothing in the implementation requires us
>> to move any of the PCI stuff into the kernel.
>>     
> Well, thats not really true.  If the device is a PCI device, there is
> *some* stuff that has to go into the kernel.  Not an ICH model or
> anything, but at least an ability to interact with userspace for
> config-space changes, etc.
>   

Config space changes go to userspace anyway.  You'd need an interface to 
let userspace configure the kernel, but that's true for every device in 
the kernel.  And you don't want to let the guest configure the kernel 
directly, you want userspace to be able to keep control of things.

  

>> To avoid reiterating, please be specific about these advantages.
>>     
> We are both reading the same thread, right?
>   

Using different languages?

  

>> Last time we measured, hypercall overhead was the same as pio
>> overhead.  Both vmx and svm decode pio completely (except for string
>> pio ...)
>>     
> Not on my woodcrests last time I looked, but I'll check again.
>   

On woodcrests too.  See vmx.c:handle_io().

>> True, PCI interrupts suck.  But this was fixed with MSI.  Why fix it
>> again?
>>     
>
> As I stated, I don't like the constraints in place even by MSI (though
> that is definately a step in the right direction).
>   

Which constraints?

>  With vbus I can have a device that has an arbitrary number of shm
> regions (limited by memory, of course), 

So you can with PCI.

> each with an arbitrarily routed
> signal path that is limited by a u64, even on x86.  

There are still only 224 vectors per vcpu.

> Each region can be
> signaled bidirectionally and masked with a simple local memory write. 
> They can be declared on the fly, allowing for the easy expression of
> things like nested devices or or other dynamic resources.  The can be
> routed across various topologies, such as IRQs or posix signals, even
> across multiple hops in a single path.
>
> How do I do that in PCI?
>   

Not what this nesting means.  If I understand the rest, I think you can 
do it.

> What does masking an interrupt look like?  

It's a protocol between the device and the guest.  PCI doesn't specify 
it.  So you can use a bit in shared memory if you like.

> Again, for the nested case?
>   

What's that?

> Interrupt acknowledgment cycles?
>   

Standard for the platform.  Again it's outside the scope of PCI.

>>> One of my primary design objectives with vbus was to a) reduce the
>>> signaling as much as possible, and b) reduce the cost of signaling. 
>>> That is why I do things like use explicit hypercalls, aggregated
>>> interrupts, bidir napi to mitigate signaling, the shm_signal::pending
>>> mitigation, and avoiding going to userspace by running in the kernel.
>>> All of these things together help to form what I envision would be a
>>> maximum performance transport.  Not all of these tricks are
>>> interdependent (for instance, the bidir + full-duplex threading that I
>>> do can be done in userspace too, as discussed).  They are just the
>>> collective design elements that I think we need to make a guest perform
>>> very close to its peak.  That is what I am after.
>>>
>>>   
>>>       
>> None of these require vbus.  They can all be done with PCI.
>>     
> Well, first of all:  Not really.  

Really?  I think every network card+driver do this bidir napi thing.  
napi was invented for real network cards, IIUC.

> Second of all, even if you *could* do
> this all with PCI, its not really PCI anymore.  So the question I have
> is: whats the value in still using it?  For the discovery?  Its not very
> hard to do discovery.  I wrote that whole part in a few hours and it
> worked the first time I ran it.
>   

Yes, for the discovery.  And so it could work on all guests, not just 
Linux 2.6.31+.

> What about that interrupt model I keep talking about?  How do you work
> around that?  How do I nest these to support bypass?
>   

I'm lost, sorry.

>> What constraints?  Please be specific.
>>     
>
> Avi, I have been.  Is this an exercise to see how much you can get me to
> type? ;)
>   

I know I'd lose this, so no.  I'm really puzzled what you think we'd 
gain by departing from PCI (other than having a nice clean code base, 
which I don't think helps because we get to maintain both PCI and the 
new code base).

>> I'm not saying anything about what the advantages are worth and how
>> they compare to the cost.  I'm asking what are the advantages.  Please
>> don't just assert them into existence.
>>     
>
> Thats an unfair statement, Avi.  Now I would say you are playing word-games.
>   

I genuinely don't see them.  I'm not being deliberately stupid.

>> All of this overhead is incurred at configuration time.  All the
>> complexity already exists
>>     
>
> So you already have the ability to represent PCI devices that are in the
> kernel?  Is this the device-assignment infrastructure?  Cool!  Wouldn't
> this still need to be adapted to work with software devices?  If not,
> then I take back the statements that they both add more host code and
> agree that vbus is simply the one adding more.
>   

Of course it would need to be adapted, but nothing in the core.  For 
example, virtio-net.c would need to communicate with its kernel 
counterpart to tell it what it's configuration is, and to start and stop 
it (so we could do live migration).

We wouldn't need to make any changes to hw/pci.c, for example.

It's similar to how the in-kernel lapic and ioapic are integrated with qemu.

>> so we gain nothing by adding a competing implementation.  And making
>> the guest complex in order to simplify the host is a pretty bad
>> tradeoff considering we maintain one host but want to support many
>> guests.
>>
>> It's good to look forward, but in the vbus-dominated universe, what do
>> we have that we don't have now?  Besides simplicity.
>>     
>
> A unified framework for declaring virtual resources directly in the
> kernel, yet still retaining the natural isolation that we get in
> userspace.

That's not an advantage.  "directly in the kernel" doesn't buy the user 
anything.

>   The ability to support guests that don't have PCI.

Already have that.  See lguest and s390.

>   The
> ability to support things that are not guests.

So would a PCI implementation, as long as PCI is only in userspace.

>   The ability to support
> things that are not supported by PCI, like less hardware-centric signal
> path routing.  

What's that?

> The ability to signal across more than just IRQs.  

You can't, either with or without vbus.  You have to honour guest cli.  
You might do a Xen-like alternative implementation of interrupts, but 
that's bound to be slow since you have to access guest stack directly 
and switch stacks instead of letting the hardware do it for you.  And of 
course forget about Windows.

> The
> ability for nesting (e.g. guest-userspace talking to host-kernel, etc). 
>   

That's a guest problem.  If the guest kernel gives guest userspace 
access, guest userspace can have a go too, PCI or not.

In fact, even today guest userspace controls a PCI device - the X server 
runs in userspace and talks to the cirrus PCI device.

> I recognize that this has no bearing on whether you, or anyone else
> cares about these features.  But it certainly has features beyond what
> he have with PCI, and I hope that is clear now.
>   

With the exception of "less hardware-centric signal path routing", which 
I did not understand, I don't think you demonstrated any advantage.

    

>>> Ive already said this is low on my list, but it could always be added if
>>> someone cares that much
>>>   
>>>       
>> That's unreasonable.  Windows is an important workload.
>>     
>
> Well, this is all GPL, right.  I mean, was KVM 100% complete when it was
> proposed?  Accepted?  I am hoping to get some help building the parts of
> this infrastructure from anyone interested in the community.  If Windows
> support is truly important and someone cares, it will get built soon enough.
>
> I pushed it out now because I have enough working to be useful in of
> itself and to get a review.  But its certainly not done.
>   

You are proposing a major break from what we have now.  While you've 
demonstrated very nice performance numbers, it cannot be undertaken lightly.

This is how I see our options:

- continue to develop virtio, taking the performance improvements from venet

IMO this is the best course.  We do what we have to do to get better 
performance, perhaps by implementing a server in the kernel.  The 
Windows drivers continue to work.  Linux 2.6.older+ continue to work.  
Older hosts continue to work (with the userspace virtio 
implementation).  Performance improves.

- drop virtio, switch to vbus

That's probably the worst course.  Windows drivers stop working until 
further notice.  Older hosts stop working.  Older guests stop working.  
The only combination that works is 2.6.31+ on 2.6.31+.

- move virtio to maintenance mode, start developing vbus

Older guests use virtio, older hosts use virtio, if we have a new guest 
on new host we use vbus.  Start porting the Windows drivers to vbus.  
Start porting block drivers and host to vbus.  Same for balloon.

While workable, it increases the maintenance burden significantly as 
well as user confusion.  I don't think we'd be justified in moving in 
this direction unless there was a compelling reason, which I don't see 
right now.
>   
>   
>> Of course we need to.  RHEL 4/5 and their equivalents will live for a
>> long time as guests.  Customers will expect good performance.
>>     
>
> Okay, easy enough from my perspective.  However, I didn't realize it was
> very common to backport new features to enterprise distros like this.  I
> have a sneaking suspicion we wouldn't really need to worry about this as
> the project managers for those products would probably never allow it. 
> But in the event that it was necessary, I think it wouldn't be horrendous.
>   

As it happens, RHEL 5.3 has backported virtio drivers.

    

>>> So does virtio also do demand loading in the backend?  
>>>       
>> Given that it's entirely in userspace, yes.
>>     
>
> Ah, right.  How does that work our of curiosity?  Do you have to do a
> syscall for every page you want to read?
>   

No, you just read or write it through pointers.  Syscalls that access 
userspace work too (like read() or write()).


>>> Hmm.  I suppose
>>> we could do this, but it will definitely affect the performance
>>> somewhat.  I was thinking that the pages needed for the basic shm
>>> components should be minimal, so this is a good tradeoff to vmap them in
>>> and only demand load the payload.
>>>   
>>>       
>> This is negotiable :) I won't insist on it, only strongly recommend
>> it.  copy_to_user() should be pretty fast.
>>     
>
> It probably is, but generally we cant use it since we are not in the
> same context when we need to do the copy (copy_to/from_user assume
> "current" is proper).  

Right.

> Thats ok, there are ways to do what you request
> without explicitly using c_t_u().
>   

How?

If we can't, vmap() is fine.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-05 16:10         ` Avi Kivity
@ 2009-04-05 16:45           ` Anthony Liguori
  0 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-04-05 16:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rusty Russell, Gregory Haskins, linux-kernel, agraf, pmullaney,
	pmorreale, netdev, kvm

Avi Kivity wrote:
> Anthony Liguori wrote:
>>
>> What we need is:
>>
>> 1) Lockless MMIO/PIO dispatch (there should be two IO registration 
>> interfaces, a new lockless one and the legacy one)
>
> Not sure exactly how much this is needed, since when there is no 
> contention, locks are almost free (there's the atomic and cacheline 
> bounce, but no syscall).

There should be no contention but I strongly suspect there is more often 
than we think.  The IO thread can potentially hold the lock for a very 
long period of time.  Take into consideration things like qcow2 metadata 
read/write, VNC server updates, etc..

> For any long operations, we should drop the lock (of course we need 
> some kind of read/write lock or rcu to avoid hotunplug or 
> reconfiguration).
>
>> 2) A virtio-net thread that's independent of the IO thread.
>
> Yes -- that saves us all the select() prologue (calculating new 
> timeout) and the select() itself.

In an ideal world, we could do the submission via io_submit in the VCPU 
context, not worry about the copy latency (because we're zero copy).  
Then our packet transmission latency is consistently low because the 
path is consistent and lockless.  This is why dropping the lock is so 
important, it's not enough to usually have low latency.  We need to try 
and have latency as low as possible as often as possible.

Regards,

Anthony Liguori
>
>

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-05 14:13       ` Anthony Liguori
@ 2009-04-05 16:10         ` Avi Kivity
  2009-04-05 16:45           ` Anthony Liguori
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-05 16:10 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Rusty Russell, Gregory Haskins, linux-kernel, agraf, pmullaney,
	pmorreale, netdev, kvm

Anthony Liguori wrote:
>
> What we need is:
>
> 1) Lockless MMIO/PIO dispatch (there should be two IO registration 
> interfaces, a new lockless one and the legacy one)

Not sure exactly how much this is needed, since when there is no 
contention, locks are almost free (there's the atomic and cacheline 
bounce, but no syscall).

For any long operations, we should drop the lock (of course we need some 
kind of read/write lock or rcu to avoid hotunplug or reconfiguration).

> 2) A virtio-net thread that's independent of the IO thread.

Yes -- that saves us all the select() prologue (calculating new timeout) 
and the select() itself.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-05  3:44     ` Rusty Russell
  2009-04-05  8:06       ` Avi Kivity
@ 2009-04-05 14:13       ` Anthony Liguori
  2009-04-05 16:10         ` Avi Kivity
  1 sibling, 1 reply; 121+ messages in thread
From: Anthony Liguori @ 2009-04-05 14:13 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Gregory Haskins, linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

Rusty Russell wrote:
> On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote:
>   
>> Rusty Russell wrote:
>>     
>>> As you point out, 350-450 is possible, which is still bad, and it's at least
>>> partially caused by the exit to userspace and two system calls.  If virtio_net
>>> had a backend in the kernel, we'd be able to compare numbers properly.
>>>       
>> I doubt the userspace exit is the problem.  On a modern system, it takes 
>> about 1us to do a light-weight exit and about 2us to do a heavy-weight 
>> exit.  A transition to userspace is only about ~150ns, the bulk of the 
>> additional heavy-weight exit cost is from vcpu_put() within KVM.
>>     
>
> Just to inject some facts, servicing a ping via tap (ie host->guest then
> guest->host response) takes 26 system calls from one qemu thread, 7 from
> another (see strace below). Judging by those futex calls, multiple context
> switches, too.
>   

N.B. we're not optimized for latency today.  With the right 
infrastructure in userspace, I'm confident we could get this down.

What we need is:

1) Lockless MMIO/PIO dispatch (there should be two IO registration 
interfaces, a new lockless one and the legacy one)
2) A virtio-net thread that's independent of the IO thread.

It would be interesting to count the number of syscalls required in the 
lguest path since that should be a lot closer to optimal.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 16:28                                                   ` Gregory Haskins
@ 2009-04-05 10:00                                                     ` Avi Kivity
  0 siblings, 0 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-05 10:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
>   
>>> 2) the vbus-proxy and kvm-guest patch go away
>>> 3) the kvm-host patch changes to work with coordination from the
>>> userspace-pci emulation for things like MSI routing
>>> 4) qemu will know to create some MSI shim 1:1 with whatever it
>>> instantiates on the bus (and can communicate changes
>>>   
>>>       
>> Don't userstand.  What's this MSI shim?
>>     
>
> Well, if the device model was an object in vbus down in the kernel, yet
> PCI emulation was up in qemu, presumably we would want something to
> handle things like PCI config-cycles up in userspace.  Like, for
> instance, if the guest re-routes the MSI.  The shim/proxy would handle
> the config-cycle, and then turn around and do an ioctl to the kernel to
> configure the change with the in-kernel device model (or the irq
> infrastructure, as required).
>   

Right, this is how it should work.  All the gunk in userspace.

> But, TBH, I haven't really looked into whats actually required to make
> this work yet.  I am just spitballing to try to find a compromise.
>   

One thing I thought of trying to get this generic is to use file 
descriptors as irq handles.  So:

- userspace exposes a PCI device (same as today)
- guest configures its PCI IRQ (using MSI if it supports it)
- userspace handles this by calling KVM_IRQ_FD which converts the irq to 
a file descriptor
- userspace passes this fd to the kernel, or another userspace process
- end user triggers guest irqs by writing to this fd

We could do the same with hypercalls:

- guest and host userspace negotiate hypercall use through PCI config space
- userspace passes an fd to the kernel
- whenever the guest issues an hypercall, the kernel writes the 
arguments to the fd
- other end (in kernel or userspace) processes the hypercall


> No, you are confusing the front-end and back-end again ;)
>
> The back-end remains, and holds the device models as before.  This is
> the "vbus core".  Today the front-end interacts with the hypervisor to
> render "vbus" specific devices.  The proposal is to eliminate the
> front-end, and have the back end render the objects on the bus as PCI
> devices to the guest.  I am not sure if I can make it work, yet.  It
> needs more thought.
>   

It seems to me this already exists, it's the qemu device model.

The host kernel doesn't need any knowledge of how the devices are 
connected, even if it does implement some of them.

>> .  I don't think you've yet set down what its advantages are.  Being
>> pure and clean doesn't count, unless you rip out PCI from all existing
>> installed hardware and from Windows.
>>     
>
> You are being overly dramatic.  No one has ever said we are talking
> about ripping something out.  In fact, I've explicitly stated that PCI
> can coexist peacefully.    Having more than one bus in a system is
> certainly not without precedent (PCI, scsi, usb, etc).
>
> Rather, PCI is PCI, and will always be.  PCI was designed as a
> software-to-hardware interface.  It works well for its intention.  When
> we do full emulation of guests, we still do PCI so that all that
> software that was designed to work software-to-hardware still continue
> to work, even though technically its now software-to-software.  When we
> do PV, on the other hand, we no longer need to pretend it is
> software-to-hardware.  We can continue to use an interface designed for
> software-to-hardware if we choose, or we can use something else such as
> an interface designed specifically for software-to-software.
>
> As I have stated, PCI was designed with hardware constraints in mind. 
> What if I don't want to be governed by those constraints?  

I'd agree with all this if I actually saw a constraint in PCI.  But I don't.

> What if I
> don't want an interrupt per device (I don't)?   

Don't.  Though I thing you do, even multiple interrupts per device.

> What do I need BARs for
> (I don't)?  

Don't use them.

> Is a PCI PIO address relevant to me (no, hypercalls are more
> direct)?  Etc.  Its crap I dont need.
>   

So use hypercalls.

> All I really need is a way to a) discover and enumerate devices,
> preferably dynamically (hotswap), and b) a way to communicate with those
> devices.  I think you are overstating the the importance that PCI plays
> in (a), and are overstating the complexity associated with doing an
> alternative.  

Given that we have PCI, why would we do an alternative?

It works, it works with Windows, the nasty stuff is in userspace.  Why 
expend effort on an alternative?  Instead make it go faster.

> I think you are understating the level of hackiness
> required to continue to support PCI as we move to new paradigms, like
> in-kernel models.  

The kernel need know nothing about PCI, so I don't see how you work this 
out.

> And I think I have already stated that I can
> establish a higher degree of flexibility, and arguably, performance for
> (b).  

You've stated it, but failed to provide arguments for it.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-05  3:44     ` Rusty Russell
@ 2009-04-05  8:06       ` Avi Kivity
  2009-04-05 14:13       ` Anthony Liguori
  1 sibling, 0 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-05  8:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Anthony Liguori, Gregory Haskins, linux-kernel, agraf, pmullaney,
	pmorreale, netdev, kvm

Rusty Russell wrote:
> On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote:
>   
>> Rusty Russell wrote:
>>     
>>> As you point out, 350-450 is possible, which is still bad, and it's at least
>>> partially caused by the exit to userspace and two system calls.  If virtio_net
>>> had a backend in the kernel, we'd be able to compare numbers properly.
>>>       
>> I doubt the userspace exit is the problem.  On a modern system, it takes 
>> about 1us to do a light-weight exit and about 2us to do a heavy-weight 
>> exit.  A transition to userspace is only about ~150ns, the bulk of the 
>> additional heavy-weight exit cost is from vcpu_put() within KVM.
>>     
>
> Just to inject some facts, servicing a ping via tap (ie host->guest then
> guest->host response) takes 26 system calls from one qemu thread, 7 from
> another (see strace below). Judging by those futex calls, multiple context
> switches, too.
>   

Interesting stuff.  Even if amortized over half a ring's worth of 
packets, that's quite a lot.

Two threads are involved (we complete on the iothread, since we don't 
know which vcpu will end up processing the interrupt, if any).

>
> Pid 10260:
> 12:37:40.245785 select(17, [4 6 8 14 16], [], [], {0, 996000}) = 1 (in [6], left {0, 992000}) <0.003995>
>   

Should switch to epoll with its lower wait costs.  Unfortunately the 
relative timeout requires reading the clock.

> 12:37:40.250226 read(6, "\0\0\0\0\0\0\0\0\0\0RT\0\0224V*\211\24\210`\304\10\0E\0"..., 69632) = 108 <0.000051>
> 12:37:40.250462 write(1, "tap read: 108 bytes\n", 20) = 20 <0.000197>
>   

I hope this is your addition.

> 12:37:40.250800 ioctl(7, 0x4008ae61, 0x7fff8cafb3a0) = 0 <0.000223>
> 12:37:40.251149 read(6, 0x115c6ac, 69632) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>
>   

This wouldn't be necessary with io_getevents().

> 12:37:40.251292 write(1, "tap read: -1 bytes\n", 19) = 19 <0.000085>
>   

...

> 12:37:40.251488 clock_gettime(CLOCK_MONOTONIC, {1554, 633304282}) = 0 <0.000020>
> 12:37:40.251604 clock_gettime(CLOCK_MONOTONIC, {1554, 633413793}) = 0 <0.000019>
>   

Great.

> 12:37:40.251717 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.001222>
> 12:37:40.253037 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {1, 0}) <0.000026>
> 12:37:40.253196 read(16, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 <0.000022>
> 12:37:40.253324 rt_sigaction(SIGALRM, NULL, {0x406d50, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f1a842430f0}, 8) = 0 <0.000018>
> 12:37:40.253477 write(5, "\0", 1)       = 1 <0.000022>
>   

The write is to wake someone up.  Who?

> 12:37:40.253585 read(16, 0x7fff8cb09440, 128) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
>   

Clearing up signalfd...

> 12:37:40.253687 clock_gettime(CLOCK_MONOTONIC, {1554, 635496181}) = 0 <0.000019>
> 12:37:40.253798 writev(6, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"*\211\24\210`\304RT\0\0224V\10\0E\0\0T\255\262\0\0@\1G"..., 98}], 2) = 108 <0.000062>
> 12:37:40.253993 ioctl(7, 0x4008ae61, 0x7fff8caff460) = 0 <0.000161>
>   

Injecting the interrupt.

> 12:37:40.254263 clock_gettime(CLOCK_MONOTONIC, {1554, 636077540}) = 0 <0.000019>
> 12:37:40.254380 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.000394>
> 12:37:40.254861 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [4], left {1, 0}) <0.000022>
> 12:37:40.255001 read(4, "\0", 512)      = 1 <0.000021>
>   

Great,  the write() was to wake ourselves up.

> 12:37:40.255109 read(4, 0x7fff8cb092d0, 512) = -1 EAGAIN (Resource temporarily unavailable) <0.000018>
> 12:37:40.255211 clock_gettime(CLOCK_MONOTONIC, {1554, 637020677}) = 0 <0.000019>
> 12:37:40.255314 clock_gettime(CLOCK_MONOTONIC, {1554, 637123483}) = 0 <0.000019>
> 12:37:40.255416 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 <0.000018>
> 12:37:40.255524 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 14000000}}, NULL) = 0 <0.000021>
> 12:37:40.255635 clock_gettime(CLOCK_MONOTONIC, {1554, 637443915}) = 0 <0.000019>
> 12:37:40.255739 clock_gettime(CLOCK_MONOTONIC, {1554, 637547001}) = 0 <0.000018>
> 12:37:40.255847 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {0, 988000}) <0.014303>
>
>   

This is the vcpu thread:

> Pid 10262:
> 12:37:40.252531 clock_gettime(CLOCK_MONOTONIC, {1554, 634339051}) = 0 <0.000018>
> 12:37:40.252631 timer_gettime(0, {it_interval={0, 0}, it_value={0, 17549811}}) = 0 <0.000021>
> 12:37:40.252750 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0 <0.000024>
> 12:37:40.252868 ioctl(11, 0xae80, 0)    = 0 <0.001171>
> 12:37:40.254128 futex(0xb81360, 0x80 /* FUTEX_??? */, 2) = 0 <0.000270>
> 12:37:40.254490 ioctl(7, 0x4008ae61, 0x4134bee0) = 0 <0.000019>
> 12:37:40.254598 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 0 <0.000017>
> 12:37:40.254693 ioctl(11, 0xae80 <unfinished ...>
>   

Looks like the interrupt from the iothread was injected and delivered 
before the iothread could give up the mutex, so we needed to wait here.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 16:10   ` Anthony Liguori
@ 2009-04-05  3:44     ` Rusty Russell
  2009-04-05  8:06       ` Avi Kivity
  2009-04-05 14:13       ` Anthony Liguori
  0 siblings, 2 replies; 121+ messages in thread
From: Rusty Russell @ 2009-04-05  3:44 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote:
> Rusty Russell wrote:
> > As you point out, 350-450 is possible, which is still bad, and it's at least
> > partially caused by the exit to userspace and two system calls.  If virtio_net
> > had a backend in the kernel, we'd be able to compare numbers properly.
> 
> I doubt the userspace exit is the problem.  On a modern system, it takes 
> about 1us to do a light-weight exit and about 2us to do a heavy-weight 
> exit.  A transition to userspace is only about ~150ns, the bulk of the 
> additional heavy-weight exit cost is from vcpu_put() within KVM.

Just to inject some facts, servicing a ping via tap (ie host->guest then
guest->host response) takes 26 system calls from one qemu thread, 7 from
another (see strace below). Judging by those futex calls, multiple context
switches, too.

> If you were to switch to another kernel thread, and I'm pretty sure you 
> have to, you're going to still see about a 2us exit cost.

He switches to another thread, too, but with the right infrastructure (ie.
skb data destructors) we could skip this as well.  (It'd be interesting to
see how virtual-bus performed on a single cpu host).

Cheers,
Rusty.

Pid 10260:
12:37:40.245785 select(17, [4 6 8 14 16], [], [], {0, 996000}) = 1 (in [6], left {0, 992000}) <0.003995>
12:37:40.250226 read(6, "\0\0\0\0\0\0\0\0\0\0RT\0\0224V*\211\24\210`\304\10\0E\0"..., 69632) = 108 <0.000051>
12:37:40.250462 write(1, "tap read: 108 bytes\n", 20) = 20 <0.000197>
12:37:40.250800 ioctl(7, 0x4008ae61, 0x7fff8cafb3a0) = 0 <0.000223>
12:37:40.251149 read(6, 0x115c6ac, 69632) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>
12:37:40.251292 write(1, "tap read: -1 bytes\n", 19) = 19 <0.000085>
12:37:40.251488 clock_gettime(CLOCK_MONOTONIC, {1554, 633304282}) = 0 <0.000020>
12:37:40.251604 clock_gettime(CLOCK_MONOTONIC, {1554, 633413793}) = 0 <0.000019>
12:37:40.251717 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.001222>
12:37:40.253037 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {1, 0}) <0.000026>
12:37:40.253196 read(16, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 <0.000022>
12:37:40.253324 rt_sigaction(SIGALRM, NULL, {0x406d50, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f1a842430f0}, 8) = 0 <0.000018>
12:37:40.253477 write(5, "\0", 1)       = 1 <0.000022>
12:37:40.253585 read(16, 0x7fff8cb09440, 128) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
12:37:40.253687 clock_gettime(CLOCK_MONOTONIC, {1554, 635496181}) = 0 <0.000019>
12:37:40.253798 writev(6, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"*\211\24\210`\304RT\0\0224V\10\0E\0\0T\255\262\0\0@\1G"..., 98}], 2) = 108 <0.000062>
12:37:40.253993 ioctl(7, 0x4008ae61, 0x7fff8caff460) = 0 <0.000161>
12:37:40.254263 clock_gettime(CLOCK_MONOTONIC, {1554, 636077540}) = 0 <0.000019>
12:37:40.254380 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.000394>
12:37:40.254861 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [4], left {1, 0}) <0.000022>
12:37:40.255001 read(4, "\0", 512)      = 1 <0.000021>
12:37:40.255109 read(4, 0x7fff8cb092d0, 512) = -1 EAGAIN (Resource temporarily unavailable) <0.000018>
12:37:40.255211 clock_gettime(CLOCK_MONOTONIC, {1554, 637020677}) = 0 <0.000019>
12:37:40.255314 clock_gettime(CLOCK_MONOTONIC, {1554, 637123483}) = 0 <0.000019>
12:37:40.255416 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 <0.000018>
12:37:40.255524 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 14000000}}, NULL) = 0 <0.000021>
12:37:40.255635 clock_gettime(CLOCK_MONOTONIC, {1554, 637443915}) = 0 <0.000019>
12:37:40.255739 clock_gettime(CLOCK_MONOTONIC, {1554, 637547001}) = 0 <0.000018>
12:37:40.255847 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {0, 988000}) <0.014303>

Pid 10262:
12:37:40.252531 clock_gettime(CLOCK_MONOTONIC, {1554, 634339051}) = 0 <0.000018>
12:37:40.252631 timer_gettime(0, {it_interval={0, 0}, it_value={0, 17549811}}) = 0 <0.000021>
12:37:40.252750 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0 <0.000024>
12:37:40.252868 ioctl(11, 0xae80, 0)    = 0 <0.001171>
12:37:40.254128 futex(0xb81360, 0x80 /* FUTEX_??? */, 2) = 0 <0.000270>
12:37:40.254490 ioctl(7, 0x4008ae61, 0x4134bee0) = 0 <0.000019>
12:37:40.254598 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 0 <0.000017>
12:37:40.254693 ioctl(11, 0xae80 <unfinished ...>

fd:
lrwx------ 1 root root 64 2009-04-05 12:31 0 -> /dev/pts/1 
lrwx------ 1 root root 64 2009-04-05 12:31 1 -> /dev/pts/1 
lrwx------ 1 root root 64 2009-04-05 12:35 10 -> /home/rusty/qemu-images/ubuntu-8.10                                                                            
lrwx------ 1 root root 64 2009-04-05 12:35 11 -> anon_inode:kvm-vcpu            
lrwx------ 1 root root 64 2009-04-05 12:35 12 -> socket:[31414]                 
lrwx------ 1 root root 64 2009-04-05 12:35 13 -> socket:[31416]                 
lrwx------ 1 root root 64 2009-04-05 12:35 14 -> anon_inode:[eventfd]           
lrwx------ 1 root root 64 2009-04-05 12:35 15 -> anon_inode:[eventfd]           
lrwx------ 1 root root 64 2009-04-05 12:35 16 -> anon_inode:[signalfd]          
lrwx------ 1 root root 64 2009-04-05 12:31 2 -> /dev/pts/1                      
lr-x------ 1 root root 64 2009-04-05 12:31 3 -> /dev/kvm
lr-x------ 1 root root 64 2009-04-05 12:35 4 -> pipe:[31406]
l-wx------ 1 root root 64 2009-04-05 12:35 5 -> pipe:[31406]
lrwx------ 1 root root 64 2009-04-05 12:35 6 -> /dev/net/tun
lrwx------ 1 root root 64 2009-04-05 12:35 7 -> anon_inode:kvm-vm
lrwx------ 1 root root 64 2009-04-05 12:35 8 -> anon_inode:[signalfd]
lrwx------ 1 root root 64 2009-04-05 12:35 9 -> /tmp/vl.OL1kd9 (deleted)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 13:37                                                 ` Avi Kivity
@ 2009-04-03 16:28                                                   ` Gregory Haskins
  2009-04-05 10:00                                                     ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-03 16:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 10320 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> So again, I am proposing for consideration of accepting my work
>>>> (either
>>>> in its current form, or something we agree on after the normal review
>>>> process) not only on the basis of the future development of the
>>>> platform, but also to keep current components in their running to
>>>> their
>>>> full potential.  I will again point out that the code is almost
>>>> completely off to the side, can be completely disabled with config
>>>> options, and I will maintain it.  Therefore the only real impact is to
>>>> people who care to even try it, and to me.
>>>>         
>>> Your work is a whole stack.  Let's look at the constituents.
>>>
>>> - a new virtual bus for enumerating devices.
>>>
>>> Sorry, I still don't see the point.  It will just make writing drivers
>>> more difficult.  The only advantage I've heard from you is that it
>>> gets rid of the gunk.  Well, we still have to support the gunk for
>>> non-pv devices so the gunk is basically free.  The clean version is
>>> expensive since we need to port it to all guests and implement
>>> exciting features like hotplug.
>>>     
>> My real objection to PCI is fast-path related.  I don't object, per se,
>> to using PCI for discovery and hotplug.  If you use PCI just for these
>> types of things, but then allow fastpath to use more hypercall oriented
>> primitives, then I would agree with you.  We can leave PCI emulation in
>> user-space, and we get it for free, and things are relatively tidy.
>>   
>
> PCI has very little to do with the fast path (nothing, if we use MSI).

At the very least, PIOs are slightly slower than hypercalls.  Perhaps
not enough to care, but the last time I measured them they were slower,
and therefore my clean slate design doesn't use them.

But I digress.  I think I was actually kind of agreeing with you that we
could do this. :P

>
>> Its once you start requiring that we stay ABI compatible with something
>> like the existing virtio-net in x86 KVM where I think it starts to get
>> ugly when you try to move it into the kernel.  So that is what I had a
>> real objection to.  I think as long as we are not talking about trying
>> to make something like that work, its a much more viable prospect.
>>   
>
> I don't see why the fast path of virtio-net would be bad.  Can you
> elaborate?

Im not.  I am saying I think we might be able to do this.

>
> Obviously all the pci glue stays in userspace.
>
>> So what I propose is the following:
>> 1) The core vbus design stays the same (or close to it)
>>   
>
> Sorry, I still don't see what advantage this has over PCI, and how you
> deal with the disadvantages.

I think you are confusing the vbus-proxy (guest side) with the vbus
backend.  (1) is saying "keep the vbus backend'" and (2) is saying drop
the guest side stuff.  In this proposal, the guest would speak a PCI ABI
as far as its concerned.  Devices in the vbus backend would render as
PCI objects in the ICH (or whatever) model in userspace.

>
>> 2) the vbus-proxy and kvm-guest patch go away
>> 3) the kvm-host patch changes to work with coordination from the
>> userspace-pci emulation for things like MSI routing
>> 4) qemu will know to create some MSI shim 1:1 with whatever it
>> instantiates on the bus (and can communicate changes
>>   
>
> Don't userstand.  What's this MSI shim?

Well, if the device model was an object in vbus down in the kernel, yet
PCI emulation was up in qemu, presumably we would want something to
handle things like PCI config-cycles up in userspace.  Like, for
instance, if the guest re-routes the MSI.  The shim/proxy would handle
the config-cycle, and then turn around and do an ioctl to the kernel to
configure the change with the in-kernel device model (or the irq
infrastructure, as required).

But, TBH, I haven't really looked into whats actually required to make
this work yet.  I am just spitballing to try to find a compromise.

>
>> 5) any drivers that are written for these new PCI-IDs that might be
>> present are allowed to use a hypercall ABI to talk after they have been
>> probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
>> access methods).
>>   
>
> The way we'd to it with virtio is to add a feature bit that say "you
> can hypercall here instead of pio".  This way old drivers continue to
> work.

Yep, agreed.  This is what I was thinking we could do.  But now that I
have the possibility that I just need to write a virtio-vbus module to
co-exist with virtio-pci, perhaps it doesn't even need to be explicit.

>
> Note that nothing prevents us from trapping pio in the kernel (in
> fact, we do) and forwarding it to the device.  It shouldn't be any
> slower than hypercalls.
Sure, its just slightly slower, so I would prefer pure hypercalls if at
all possible.

>
>> Once I get here, I might have greater clarity to see how hard it would
>> make to emulate fast path components as well.  It might be easier than I
>> think.
>>
>> This is all off the cuff so it might need some fine tuning before its
>> actually workable.
>>
>> Does that sound reasonable?
>>   
>
> The vbus part (I assume you mean device enumeration) worries me

No, you are confusing the front-end and back-end again ;)

The back-end remains, and holds the device models as before.  This is
the "vbus core".  Today the front-end interacts with the hypervisor to
render "vbus" specific devices.  The proposal is to eliminate the
front-end, and have the back end render the objects on the bus as PCI
devices to the guest.  I am not sure if I can make it work, yet.  It
needs more thought.

> .  I don't think you've yet set down what its advantages are.  Being
> pure and clean doesn't count, unless you rip out PCI from all existing
> installed hardware and from Windows.

You are being overly dramatic.  No one has ever said we are talking
about ripping something out.  In fact, I've explicitly stated that PCI
can coexist peacefully.    Having more than one bus in a system is
certainly not without precedent (PCI, scsi, usb, etc).

Rather, PCI is PCI, and will always be.  PCI was designed as a
software-to-hardware interface.  It works well for its intention.  When
we do full emulation of guests, we still do PCI so that all that
software that was designed to work software-to-hardware still continue
to work, even though technically its now software-to-software.  When we
do PV, on the other hand, we no longer need to pretend it is
software-to-hardware.  We can continue to use an interface designed for
software-to-hardware if we choose, or we can use something else such as
an interface designed specifically for software-to-software.

As I have stated, PCI was designed with hardware constraints in mind. 
What if I don't want to be governed by those constraints?  What if I
don't want an interrupt per device (I don't)?   What do I need BARs for
(I don't)?  Is a PCI PIO address relevant to me (no, hypercalls are more
direct)?  Etc.  Its crap I dont need.

All I really need is a way to a) discover and enumerate devices,
preferably dynamically (hotswap), and b) a way to communicate with those
devices.  I think you are overstating the the importance that PCI plays
in (a), and are overstating the complexity associated with doing an
alternative.  I think you are understating the level of hackiness
required to continue to support PCI as we move to new paradigms, like
in-kernel models.  And I think I have already stated that I can
establish a higher degree of flexibility, and arguably, performance for
(b).  Therefore, I have come to the conclusion that I don't want it and
thus eradicated the dependence on it in my design.  I understand the
design tradeoffs that are associated with that decision.

>
>>> - finer-grained point-to-point communication abstractions
>>>
>>> Where virtio has ring+signalling together, you layer the two.  For
>>> networking, it doesn't matter.  For other applications, it may be
>>> helpful, perhaps you have something in mind.
>>>     
>>
>> Yeah, actually.  Thanks for bringing that up.
>>
>> So the reason why signaling and the ring are distinct constructs in the
>> design is to facilitate constructs other than rings.  For instance,
>> there may be some models where having a flat shared page is better than
>> a ring.  A ring will naturally preserve all values in flight, where as a
>> flat shared page would not (last update is always current).  There are
>> some algorithms where a previously posted value is obsoleted by an
>> update, and therefore rings are inherently bad for this update model.
>> And as we know, there are plenty of algorithms where a ring works
>> perfectly.  So I wanted that flexibility to be able to express both.
>>   
>
> I agree that there is significant potential here.
>
>> One of the things I have in mind for the flat page model is that RT vcpu
>> priority thing.  Another thing I am thinking of is coming up with a PV
>> LAPIC type replacement (where we can avoid doing the EOI trap by having
>> the PICs state shared).
>>   
>
> You keep falling into the paravirtualize the entire universe trap.  If
> you look deep down, you can see Jeremy struggling in there trying to
> bring dom0 support to Linux/Xen.
>
> The lapic is a huge ball of gunk but ripping it out is a monumental
> job with no substantial benefits.  We can at much lower effort avoid
> the EOI trap by paravirtualizing that small bit of ugliness.  Sure the
> result isn't a pure and clean room implementation.  It's a band aid. 
> But I'll take a 50-line band aid over a 3000-line implementation split
> across guest and host, which only works with Linux.
Well, keep in mind that I was really just giving you an example of
something that might want a shared-page instead of a shared-ring model. 
The possibility that such a device may be desirable in the future was
enough for me to decide that I wanted the shm model to be flexible,
instead of, say, designed specifically for virtio.  We may never, in
fact, do anything with the LAPIC idea.

-Greg

>
>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 13:13                                               ` Gregory Haskins
@ 2009-04-03 13:37                                                 ` Avi Kivity
  2009-04-03 16:28                                                   ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 13:37 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> So again, I am proposing for consideration of accepting my work (either
>>> in its current form, or something we agree on after the normal review
>>> process) not only on the basis of the future development of the
>>> platform, but also to keep current components in their running to their
>>> full potential.  I will again point out that the code is almost
>>> completely off to the side, can be completely disabled with config
>>> options, and I will maintain it.  Therefore the only real impact is to
>>> people who care to even try it, and to me.
>>>   
>>>       
>> Your work is a whole stack.  Let's look at the constituents.
>>
>> - a new virtual bus for enumerating devices.
>>
>> Sorry, I still don't see the point.  It will just make writing drivers
>> more difficult.  The only advantage I've heard from you is that it
>> gets rid of the gunk.  Well, we still have to support the gunk for
>> non-pv devices so the gunk is basically free.  The clean version is
>> expensive since we need to port it to all guests and implement
>> exciting features like hotplug.
>>     
> My real objection to PCI is fast-path related.  I don't object, per se,
> to using PCI for discovery and hotplug.  If you use PCI just for these
> types of things, but then allow fastpath to use more hypercall oriented
> primitives, then I would agree with you.  We can leave PCI emulation in
> user-space, and we get it for free, and things are relatively tidy.
>   

PCI has very little to do with the fast path (nothing, if we use MSI).

> Its once you start requiring that we stay ABI compatible with something
> like the existing virtio-net in x86 KVM where I think it starts to get
> ugly when you try to move it into the kernel.  So that is what I had a
> real objection to.  I think as long as we are not talking about trying
> to make something like that work, its a much more viable prospect.
>   

I don't see why the fast path of virtio-net would be bad.  Can you 
elaborate?

Obviously all the pci glue stays in userspace.

> So what I propose is the following: 
>
> 1) The core vbus design stays the same (or close to it)
>   

Sorry, I still don't see what advantage this has over PCI, and how you 
deal with the disadvantages.

> 2) the vbus-proxy and kvm-guest patch go away
> 3) the kvm-host patch changes to work with coordination from the
> userspace-pci emulation for things like MSI routing
> 4) qemu will know to create some MSI shim 1:1 with whatever it
> instantiates on the bus (and can communicate changes
>   

Don't userstand.  What's this MSI shim?

> 5) any drivers that are written for these new PCI-IDs that might be
> present are allowed to use a hypercall ABI to talk after they have been
> probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
> access methods).
>   

The way we'd to it with virtio is to add a feature bit that say "you can 
hypercall here instead of pio".  This way old drivers continue to work.

Note that nothing prevents us from trapping pio in the kernel (in fact, 
we do) and forwarding it to the device.  It shouldn't be any slower than 
hypercalls.

> Once I get here, I might have greater clarity to see how hard it would
> make to emulate fast path components as well.  It might be easier than I
> think.
>
> This is all off the cuff so it might need some fine tuning before its
> actually workable.
>
> Does that sound reasonable?
>   

The vbus part (I assume you mean device enumeration) worries me.  I 
don't think you've yet set down what its advantages are.  Being pure and 
clean doesn't count, unless you rip out PCI from all existing installed 
hardware and from Windows.

>> - finer-grained point-to-point communication abstractions
>>
>> Where virtio has ring+signalling together, you layer the two.  For
>> networking, it doesn't matter.  For other applications, it may be
>> helpful, perhaps you have something in mind.
>>     
>
> Yeah, actually.  Thanks for bringing that up.
>
> So the reason why signaling and the ring are distinct constructs in the
> design is to facilitate constructs other than rings.  For instance,
> there may be some models where having a flat shared page is better than
> a ring.  A ring will naturally preserve all values in flight, where as a
> flat shared page would not (last update is always current).  There are
> some algorithms where a previously posted value is obsoleted by an
> update, and therefore rings are inherently bad for this update model. 
> And as we know, there are plenty of algorithms where a ring works
> perfectly.  So I wanted that flexibility to be able to express both.
>   

I agree that there is significant potential here.

> One of the things I have in mind for the flat page model is that RT vcpu
> priority thing.  Another thing I am thinking of is coming up with a PV
> LAPIC type replacement (where we can avoid doing the EOI trap by having
> the PICs state shared).
>   

You keep falling into the paravirtualize the entire universe trap.  If 
you look deep down, you can see Jeremy struggling in there trying to 
bring dom0 support to Linux/Xen.

The lapic is a huge ball of gunk but ripping it out is a monumental job 
with no substantial benefits.  We can at much lower effort avoid the EOI 
trap by paravirtualizing that small bit of ugliness.  Sure the result 
isn't a pure and clean room implementation.  It's a band aid.  But I'll 
take a 50-line band aid over a 3000-line implementation split across 
guest and host, which only works with Linux.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 12:15                                             ` Avi Kivity
@ 2009-04-03 13:13                                               ` Gregory Haskins
  2009-04-03 13:37                                                 ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-03 13:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5298 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> So again, I am proposing for consideration of accepting my work (either
>> in its current form, or something we agree on after the normal review
>> process) not only on the basis of the future development of the
>> platform, but also to keep current components in their running to their
>> full potential.  I will again point out that the code is almost
>> completely off to the side, can be completely disabled with config
>> options, and I will maintain it.  Therefore the only real impact is to
>> people who care to even try it, and to me.
>>   
>
> Your work is a whole stack.  Let's look at the constituents.
>
> - a new virtual bus for enumerating devices.
>
> Sorry, I still don't see the point.  It will just make writing drivers
> more difficult.  The only advantage I've heard from you is that it
> gets rid of the gunk.  Well, we still have to support the gunk for
> non-pv devices so the gunk is basically free.  The clean version is
> expensive since we need to port it to all guests and implement
> exciting features like hotplug.
My real objection to PCI is fast-path related.  I don't object, per se,
to using PCI for discovery and hotplug.  If you use PCI just for these
types of things, but then allow fastpath to use more hypercall oriented
primitives, then I would agree with you.  We can leave PCI emulation in
user-space, and we get it for free, and things are relatively tidy.

Its once you start requiring that we stay ABI compatible with something
like the existing virtio-net in x86 KVM where I think it starts to get
ugly when you try to move it into the kernel.  So that is what I had a
real objection to.  I think as long as we are not talking about trying
to make something like that work, its a much more viable prospect.

So what I propose is the following: 

1) The core vbus design stays the same (or close to it)
2) the vbus-proxy and kvm-guest patch go away
3) the kvm-host patch changes to work with coordination from the
userspace-pci emulation for things like MSI routing
4) qemu will know to create some MSI shim 1:1 with whatever it
instantiates on the bus (and can communicate changes
5) any drivers that are written for these new PCI-IDs that might be
present are allowed to use a hypercall ABI to talk after they have been
probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
access methods).

Once I get here, I might have greater clarity to see how hard it would
make to emulate fast path components as well.  It might be easier than I
think.

This is all off the cuff so it might need some fine tuning before its
actually workable.

Does that sound reasonable?

>
> - finer-grained point-to-point communication abstractions
>
> Where virtio has ring+signalling together, you layer the two.  For
> networking, it doesn't matter.  For other applications, it may be
> helpful, perhaps you have something in mind.

Yeah, actually.  Thanks for bringing that up.

So the reason why signaling and the ring are distinct constructs in the
design is to facilitate constructs other than rings.  For instance,
there may be some models where having a flat shared page is better than
a ring.  A ring will naturally preserve all values in flight, where as a
flat shared page would not (last update is always current).  There are
some algorithms where a previously posted value is obsoleted by an
update, and therefore rings are inherently bad for this update model. 
And as we know, there are plenty of algorithms where a ring works
perfectly.  So I wanted that flexibility to be able to express both.

One of the things I have in mind for the flat page model is that RT vcpu
priority thing.  Another thing I am thinking of is coming up with a PV
LAPIC type replacement (where we can avoid doing the EOI trap by having
the PICs state shared).

>
> - your "bidirectional napi" model for the network device
>
> virtio implements exactly the same thing, except for the case of tx
> mitigation, due to my (perhaps pig-headed) rejection of doing things
> in a separate thread, and due to the total lack of sane APIs for
> packet traffic.

Yeah, and this part is not vbus, nor in-kernel specific.  That was just
a design element of venet-tap.  Though note, I did design the
vbus/shm-signal infrastructure with rich support for such a notion in
mind, so it wasn't accidental or anything like that.

>
> - a kernel implementation of the host networking device
>
> Given the continuous rejection (or rather, their continuous
> non-adoption-and-implementation) of my ideas re zerocopy networking
> aio, that seems like a pragmatic approach.  I wish it were otherwise.

Well, that gives me hope, at least ;)


>
> - a promise of more wonderful things yet to come
>
> Obviously I can't evaluate this.

Right, sorry.  I wish I had more concrete examples to show you, but we
only have the venet-tap working at this time.  I was going for the
"release early/often" approach in getting the core reviewed before we
got too far down a path, but perhaps that was the wrong thing in this
case.  We will certainly be sending updates as we get some of the more
advanced models and concepts working.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 12:02                                   ` Avi Kivity
@ 2009-04-03 13:05                                     ` Herbert Xu
  0 siblings, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-03 13:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 03:02:22PM +0300, Avi Kivity wrote:
>
> But it flushes the tap device, the packet still has to go through the  
> bridge + real interface?

Which under normal circumstances should occur before netif_rx_ni
returns.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 17:06                                                 ` Herbert Xu
  2009-04-02 17:17                                                   ` Herbert Xu
@ 2009-04-03 12:25                                                   ` Avi Kivity
  1 sibling, 0 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 12:25 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 07:54:21PM +0300, Avi Kivity wrote:
>   
>> 3ms latency for ping?
>>
>> (ping will always be scheduled immediately when the reply arrives if I  
>> understand cfs, so guest load won't delay it)
>>     
>
> That only happens if the guest immediately does some CPU-intensive
> computation 3ms and assuming its timeslice lasts that long.
>   

 Note that this happens even if the computation is SCHED_BATCH.

> In any case, the same thing will happen right now if the host or
> some other guest on the same CPU hogs the CPU for 3ms.
>
>   

If the host is overloaded, that's fair.  But millisecond latencies 
without host contention is not a good result.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 12:03                                           ` Gregory Haskins
@ 2009-04-03 12:15                                             ` Avi Kivity
  2009-04-03 13:13                                               ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 12:15 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> So again, I am proposing for consideration of accepting my work (either
> in its current form, or something we agree on after the normal review
> process) not only on the basis of the future development of the
> platform, but also to keep current components in their running to their
> full potential.  I will again point out that the code is almost
> completely off to the side, can be completely disabled with config
> options, and I will maintain it.  Therefore the only real impact is to
> people who care to even try it, and to me.
>   

Your work is a whole stack.  Let's look at the constituents.

- a new virtual bus for enumerating devices.

Sorry, I still don't see the point.  It will just make writing drivers 
more difficult.  The only advantage I've heard from you is that it gets 
rid of the gunk.  Well, we still have to support the gunk for non-pv 
devices so the gunk is basically free.  The clean version is expensive 
since we need to port it to all guests and implement exciting features 
like hotplug.

- finer-grained point-to-point communication abstractions

Where virtio has ring+signalling together, you layer the two.  For 
networking, it doesn't matter.  For other applications, it may be 
helpful, perhaps you have something in mind.

- your "bidirectional napi" model for the network device

virtio implements exactly the same thing, except for the case of tx 
mitigation, due to my (perhaps pig-headed) rejection of doing things in 
a separate thread, and due to the total lack of sane APIs for packet 
traffic.

- a kernel implementation of the host networking device

Given the continuous rejection (or rather, their continuous 
non-adoption-and-implementation) of my ideas re zerocopy networking aio, 
that seems like a pragmatic approach.  I wish it were otherwise.

- a promise of more wonderful things yet to come

Obviously I can't evaluate this.

Did I miss anything?

Right now my preferred course of action is to implement a prototype 
userspace notification for networking.  Second choice is to move the 
host virtio implementation into the kernel.  I simply don't see how the 
rest of the stack is cost effective.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:09                                         ` Anthony Liguori
  2009-04-02 16:19                                           ` Avi Kivity
@ 2009-04-03 12:03                                           ` Gregory Haskins
  2009-04-03 12:15                                             ` Avi Kivity
  1 sibling, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-03 12:03 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 2438 bytes --]

Anthony Liguori wrote:
> Anthony Liguori wrote:
>> Avi Kivity wrote:
>>> Avi Kivity wrote:
>>>>
>>>> The alternative is to get a notification from the stack that the
>>>> packet is done processing.  Either an skb destructor in the kernel,
>>>> or my new API that everyone is not rushing out to implement.
>>>
>>> btw, my new api is
>>>
>>>
>>>   io_submit(..., nr, ...): submit nr packets
>>>   io_getevents(): complete nr packets
>>
>> I don't think we even need that to end this debate.  I'm convinced we
>> have a bug somewhere.  Even disabling TX mitigation, I see a ping
>> latency of around 300ns whereas it's only 50ns on the host.  This
>> defies logic so I'm now looking to isolate why that is.
>
> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes
> were the big winner... I hate qemu sometimes.

[ Ive already said this privately to Anthony on IRC, but ..]

Hey, congrats!  Thats impressive actually.

So I realize that perhaps you guys are not quite seeing my long term
vision here, which I think will offer some new features that we dont
have today.  I hope to change that over the coming weeks.  However, I
should also point out that perhaps even if, as of right now, my one and
only working module (venet-tap) were all I could offer, it does give us
a "rivalry" position between the two, and this historically has been a
good thing on many projects.  This helps foster innovation through
competition that potentially benefits both.  Case in point, a little
competition provoked an investigation that brought virtio-net's latency
down from 3125us to 90us.  I realize its not a production-ready patch
quite yet, but I am confident Anthony will find something that is
suitable to checkin very soon.  That's a huge improvement to a problem
that was just sitting around unnoticed because there was nothing to
compare it with.

So again, I am proposing for consideration of accepting my work (either
in its current form, or something we agree on after the normal review
process) not only on the basis of the future development of the
platform, but also to keep current components in their running to their
full potential.  I will again point out that the code is almost
completely off to the side, can be completely disabled with config
options, and I will maintain it.  Therefore the only real impact is to
people who care to even try it, and to me.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:55                                 ` Herbert Xu
@ 2009-04-03 12:02                                   ` Avi Kivity
  2009-04-03 13:05                                     ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 12:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Fri, Apr 03, 2009 at 02:54:02PM +0300, Avi Kivity wrote:
>   
>> It doesn't copy and queue the packet?  We use O_NONBLOCK and poll() so  
>> we can tell when we can queue without blocking.
>>     
>
> Well netif_rx queues the packet, but netif_rx_ni is netif_rx plus
> an immediate flush.
>   

But it flushes the tap device, the packet still has to go through the 
bridge + real interface?

Even if it's queued there, I want to know when the packet is on the 
wire, not on some random software or hardware queue in the middle.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:54                               ` Avi Kivity
@ 2009-04-03 11:55                                 ` Herbert Xu
  2009-04-03 12:02                                   ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-03 11:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 02:54:02PM +0300, Avi Kivity wrote:
>
> It doesn't copy and queue the packet?  We use O_NONBLOCK and poll() so  
> we can tell when we can queue without blocking.

Well netif_rx queues the packet, but netif_rx_ni is netif_rx plus
an immediate flush.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:48                             ` Herbert Xu
@ 2009-04-03 11:54                               ` Avi Kivity
  2009-04-03 11:55                                 ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 11:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Fri, Apr 03, 2009 at 02:46:04PM +0300, Avi Kivity wrote:
>   
>> The host writes the packet to tap, at which point it is consumed from  
>> its point of view.  The host would like to mention that if there was an  
>> API to notify it when the packet was actually consumed, then it would  
>> gladly use it.  Bonus points if this involves not copying the packet.
>>     
>
> We're using write(2) for this, no? 

Yes.

> That should invoke netif_rx_ni
> which blocks until the packet is "processed", which usually means
> that it's placed on the NIC's hardware queue.
>   

It doesn't copy and queue the packet?  We use O_NONBLOCK and poll() so 
we can tell when we can queue without blocking.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:46                           ` Avi Kivity
@ 2009-04-03 11:48                             ` Herbert Xu
  2009-04-03 11:54                               ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-03 11:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 02:46:04PM +0300, Avi Kivity wrote:
>
> The host writes the packet to tap, at which point it is consumed from  
> its point of view.  The host would like to mention that if there was an  
> API to notify it when the packet was actually consumed, then it would  
> gladly use it.  Bonus points if this involves not copying the packet.

We're using write(2) for this, no? That should invoke netif_rx_ni
which blocks until the packet is "processed", which usually means
that it's placed on the NIC's hardware queue.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:18                       ` Andi Kleen
  2009-04-03 11:34                         ` Herbert Xu
@ 2009-04-03 11:46                         ` Avi Kivity
  1 sibling, 0 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 11:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gerd Hoffmann, Herbert Xu, ghaskins, anthony, linux-kernel,
	agraf, pmullaney, pmorreale, rusty, netdev, kvm

Andi Kleen wrote:
>> Check shared ring status when stuffing a request.  If there are requests
>>     
>
> That means you're bouncing cache lines all the time. Probably not a big
> issue on single socket but could be on larger systems.
>   

That's why I'd like requests to be handled on the vcpu thread rather 
than an auxiliary thread.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:12                         ` Herbert Xu
@ 2009-04-03 11:46                           ` Avi Kivity
  2009-04-03 11:48                             ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 11:46 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Fri, Apr 03, 2009 at 02:03:45PM +0300, Avi Kivity wrote:
>   
>> If the host is able to consume a request immediately, and the guest is  
>> not able to batch requests, this breaks down.  And that is the current  
>> situation.
>>     
>
> Hang on, why is the host consuming the request immediately? It
> has to write the packet to tap, which then calls netif_rx_ni so
> it should actually go all the way, no?
>   

The host writes the packet to tap, at which point it is consumed from 
its point of view.  The host would like to mention that if there was an 
API to notify it when the packet was actually consumed, then it would 
gladly use it.  Bonus points if this involves not copying the packet.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:18                       ` Andi Kleen
@ 2009-04-03 11:34                         ` Herbert Xu
  2009-04-03 11:46                         ` Avi Kivity
  1 sibling, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-03 11:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gerd Hoffmann, Avi Kivity, ghaskins, anthony, linux-kernel,
	agraf, pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 01:18:54PM +0200, Andi Kleen wrote:
> > Check shared ring status when stuffing a request.  If there are requests
> 
> That means you're bouncing cache lines all the time. Probably not a big
> issue on single socket but could be on larger systems.

If the backend is running on a core that doesn't share caches
with the guest queue then you've got bigger problems.

Right this is unavoidable for guests with many CPUs but that
should go away once we support multiqueue in virtio-net.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 10:58                     ` Gerd Hoffmann
  2009-04-03 11:03                       ` Avi Kivity
  2009-04-03 11:18                       ` Andi Kleen
@ 2009-04-03 11:28                       ` Gregory Haskins
  2 siblings, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-03 11:28 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Avi Kivity, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

Gerd Hoffmann wrote:
> Avi Kivity wrote:
>   
>> There is no choice.  Exiting from the guest to the kernel to userspace
>> is prohibitively expensive, you can't do that on every packet.
>>     
>
> I didn't look at virtio-net very closely yet.  I wonder why the
> notification is that a big issue though.  It is easy to keep the number
> of notifications low without increasing latency:
>
> Check shared ring status when stuffing a request.  If there are requests
> not (yet) consumed by the other end there is no need to send a
> notification.  That scheme can even span multiple rings (nics with rx
> and tx for example).
>   

FWIW: I employ this scheme.  The shm-signal construct has a "dirty" and
"pending" flag (all on the same cacheline, which may or may not address
Andi's later point).  The first time you dirty the shm, it sets both
flags.  The consumer side has to clear "pending" before any subsequent
signals are sent.  Normally the consumer side will also clear "enabled"
(as part of the bidir napi thing) to further disable signals.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 10:58                     ` Gerd Hoffmann
  2009-04-03 11:03                       ` Avi Kivity
@ 2009-04-03 11:18                       ` Andi Kleen
  2009-04-03 11:34                         ` Herbert Xu
  2009-04-03 11:46                         ` Avi Kivity
  2009-04-03 11:28                       ` Gregory Haskins
  2 siblings, 2 replies; 121+ messages in thread
From: Andi Kleen @ 2009-04-03 11:18 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Avi Kivity, Herbert Xu, ghaskins, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, rusty, netdev, kvm

> Check shared ring status when stuffing a request.  If there are requests

That means you're bouncing cache lines all the time. Probably not a big
issue on single socket but could be on larger systems.

-Andi


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:03                       ` Avi Kivity
@ 2009-04-03 11:12                         ` Herbert Xu
  2009-04-03 11:46                           ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-03 11:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 02:03:45PM +0300, Avi Kivity wrote:
>
> If the host is able to consume a request immediately, and the guest is  
> not able to batch requests, this breaks down.  And that is the current  
> situation.

Hang on, why is the host consuming the request immediately? It
has to write the packet to tap, which then calls netif_rx_ni so
it should actually go all the way, no?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 10:58                     ` Gerd Hoffmann
@ 2009-04-03 11:03                       ` Avi Kivity
  2009-04-03 11:12                         ` Herbert Xu
  2009-04-03 11:18                       ` Andi Kleen
  2009-04-03 11:28                       ` Gregory Haskins
  2 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-03 11:03 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Herbert Xu, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Gerd Hoffmann wrote:
> Avi Kivity wrote:
>   
>> There is no choice.  Exiting from the guest to the kernel to userspace
>> is prohibitively expensive, you can't do that on every packet.
>>     
>
> I didn't look at virtio-net very closely yet.  I wonder why the
> notification is that a big issue though.  It is easy to keep the number
> of notifications low without increasing latency:
>
> Check shared ring status when stuffing a request.  If there are requests
> not (yet) consumed by the other end there is no need to send a
> notification.  That scheme can even span multiple rings (nics with rx
> and tx for example).
>   

If the host is able to consume a request immediately, and the guest is 
not able to batch requests, this breaks down.  And that is the current 
situation.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:02                   ` Avi Kivity
  2009-04-02  9:16                     ` Herbert Xu
  2009-04-02 10:55                     ` Gregory Haskins
@ 2009-04-03 10:58                     ` Gerd Hoffmann
  2009-04-03 11:03                       ` Avi Kivity
                                         ` (2 more replies)
  2 siblings, 3 replies; 121+ messages in thread
From: Gerd Hoffmann @ 2009-04-03 10:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Avi Kivity wrote:
> There is no choice.  Exiting from the guest to the kernel to userspace
> is prohibitively expensive, you can't do that on every packet.

I didn't look at virtio-net very closely yet.  I wonder why the
notification is that a big issue though.  It is easy to keep the number
of notifications low without increasing latency:

Check shared ring status when stuffing a request.  If there are requests
not (yet) consumed by the other end there is no need to send a
notification.  That scheme can even span multiple rings (nics with rx
and tx for example).

Host backend can put a limit on the number of requests it takes out of
the queue at once.  i.e. block backend can take out some requests, throw
them at the block layer, check whenever any request in flight is done,
if so send back replies, start over again.  guest can put more requests
into the queue meanwhile without having to notify the host.  I've seen
the number of notifications going down to zero when running disk
benchmarks in the guest ;)

Of course that works best with one or more I/O threads, so the vcpu
doesn't has to stop running anyway to get the I/O work done ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:10                                 ` Michael S. Tsirkin
@ 2009-04-03  4:43                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-03  4:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, Gregory Haskins, Avi Kivity, Herbert Xu, anthony,
	andi, linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

Michael S. Tsirkin wrote:
> Rusty, I think this is what you did in your patch from 2008 to add destructor
> for skb data ( http://kerneltrap.org/mailarchive/linux-netdev/2008/4/18/1464944 ):
> and it seems that it would make zero-copy possible - or was there some problem with
> that approach? Do you happen to remember?
>   

I'm planning on resurrecting it to replace the page destructor used by 
Xen netback.

    J


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 18:18                                             ` Anthony Liguori
@ 2009-04-03  1:11                                               ` Herbert Xu
  0 siblings, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-03  1:11 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: avi, ghaskins, andi, linux-kernel, agraf, pmullaney, pmorreale,
	rusty, netdev, kvm

Anthony Liguori <anthony@codemonkey.ws> wrote:
>
> Anyway, if we're able to send this many packets, I suspect we'll be able 
> to also handle much higher throughputs without TX mitigation so that's 
> what I'm going to look at now.

Awesome! I'm prepared to eat my words :)

On the subject of TX mitigation, can we please set a standard
on how we measure it? For instance, do we bind the the backend
qemu to the same CPU as the guest, or do we bind it to a different
CPU that shares cache? They're two completely different scenarios
and I think we should be explicit about which one we're measuring.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:19                                           ` Avi Kivity
@ 2009-04-02 18:18                                             ` Anthony Liguori
  2009-04-03  1:11                                               ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Anthony Liguori @ 2009-04-02 18:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Avi Kivity wrote:
> Anthony Liguori wrote:
>>> I don't think we even need that to end this debate.  I'm convinced 
>>> we have a bug somewhere.  Even disabling TX mitigation, I see a ping 
>>> latency of around 300ns whereas it's only 50ns on the host.  This 
>>> defies logic so I'm now looking to isolate why that is.
>>
>> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
>> were the big winner... I hate qemu sometimes.
>>
>>
>
> What, this:

UDP_RR test was limited by CPU consumption.  QEMU was pegging a CPU with 
only about 4000 packets per second whereas the host could do 14000.  An 
oprofile run showed that phys_page_find/cpu_physical_memory_rw where at 
the top by a wide margin which makes little sense since virtio is zero 
copy in kvm-userspace today.

That leaves the ring queue accessors that used ld[wlq]_phys and friends 
that happen to make use of the above.  That led me to try this terrible 
hack below and low and beyond, we immediately jumped to 10000 pps.  This 
only works because almost nothing uses ld[wlq]_phys in practice except 
for virtio so breaking it for the non-RAM case didn't matter.

We didn't encounter this before because when I changed this behavior, I 
tested streaming and ping.  Both remained the same.  You can only expose 
this issue if you first disable tx mitigation.

Anyway, if we're able to send this many packets, I suspect we'll be able 
to also handle much higher throughputs without TX mitigation so that's 
what I'm going to look at now.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 17:06                                                 ` Herbert Xu
@ 2009-04-02 17:17                                                   ` Herbert Xu
  2009-04-03 12:25                                                   ` Avi Kivity
  1 sibling, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-02 17:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

On Fri, Apr 03, 2009 at 01:06:10AM +0800, Herbert Xu wrote:
>
> That only happens if the guest immediately does some CPU-intensive
> computation 3ms and assuming its timeslice lasts that long.
> 
> In any case, the same thing will happen right now if the host or
> some other guest on the same CPU hogs the CPU for 3ms.

Even better, look at the packet's TOS.  If it's marked for low-
latency then vmexit immediately.  Otherwise continue.

In the backend you'd just set the marker in shared memory.

Of course invert this for the host => guest direction.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:54                                               ` Avi Kivity
@ 2009-04-02 17:06                                                 ` Herbert Xu
  2009-04-02 17:17                                                   ` Herbert Xu
  2009-04-03 12:25                                                   ` Avi Kivity
  0 siblings, 2 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-02 17:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

On Thu, Apr 02, 2009 at 07:54:21PM +0300, Avi Kivity wrote:
>
> 3ms latency for ping?
>
> (ping will always be scheduled immediately when the reply arrives if I  
> understand cfs, so guest load won't delay it)

That only happens if the guest immediately does some CPU-intensive
computation 3ms and assuming its timeslice lasts that long.

In any case, the same thing will happen right now if the host or
some other guest on the same CPU hogs the CPU for 3ms.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:09                                             ` Herbert Xu
@ 2009-04-02 16:54                                               ` Avi Kivity
  2009-04-02 17:06                                                 ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 16:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 06:57:38PM +0300, Avi Kivity wrote:
>   
>> What if the guest sends N packets, then does some expensive computation  
>> (say the guest scheduler switches from the benchmark process to  
>> evolution).  So now we have the marker set at packet N, but the host  
>> will not see it until the guest timeslice is up?
>>     
>
> Well that's fine.  The guest will use up the remainder of its
> timeslice.  After all we only have one core/hyperthread here so
> this is no different than if the packets were held up higher up
> in the guest kernel and the guest decided to do some computation.
>
>   

3ms latency for ping?

(ping will always be scheduled immediately when the reply arrives if I 
understand cfs, so guest load won't delay it)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:09                                         ` Anthony Liguori
@ 2009-04-02 16:19                                           ` Avi Kivity
  2009-04-02 18:18                                             ` Anthony Liguori
  2009-04-03 12:03                                           ` Gregory Haskins
  1 sibling, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 16:19 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Anthony Liguori wrote:
>> I don't think we even need that to end this debate.  I'm convinced we 
>> have a bug somewhere.  Even disabling TX mitigation, I see a ping 
>> latency of around 300ns whereas it's only 50ns on the host.  This 
>> defies logic so I'm now looking to isolate why that is.
>
> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
> were the big winner... I hate qemu sometimes.
>
>

What, this:

> diff --git a/qemu/exec.c b/qemu/exec.c
> index 67f3fa3..1331022 100644
> --- a/qemu/exec.c
> +++ b/qemu/exec.c
> @@ -3268,6 +3268,10 @@ uint32_t ldl_phys(target_phys_addr_t addr)
>      unsigned long pd;
>      PhysPageDesc *p;
>  
> +#if 1
> +    return ldl_p(phys_ram_base + addr);
> +#endif
> +
>      p = phys_page_find(addr >> TARGET_PAGE_BITS);
>      if (!p) {
>          pd = IO_MEM_UNASSIGNED;
> @@ -3300,6 +3304,10 @@ uint64_t ldq_phys(target_phys_addr_t addr)
>      unsigned long pd;
>      PhysPageDesc *p;
>  
> +#if 1
> +    return ldq_p(phys_ram_base + addr);
> +#endif
> +
>      p = phys_page_find(addr >> TARGET_PAGE_BITS);
>      if (!p) {
>          pd = IO_MEM_UNASSIGNED;

The way I read it, it will run only run slowly once per page, then 
settle to a cache miss per page.

Regardless, it makes a memslot model even more attractive.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:57                                           ` Avi Kivity
@ 2009-04-02 16:09                                             ` Herbert Xu
  2009-04-02 16:54                                               ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02 16:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

On Thu, Apr 02, 2009 at 06:57:38PM +0300, Avi Kivity wrote:
>
> What if the guest sends N packets, then does some expensive computation  
> (say the guest scheduler switches from the benchmark process to  
> evolution).  So now we have the marker set at packet N, but the host  
> will not see it until the guest timeslice is up?

Well that's fine.  The guest will use up the remainder of its
timeslice.  After all we only have one core/hyperthread here so
this is no different than if the packets were held up higher up
in the guest kernel and the guest decided to do some computation.

Once its timeslice completes the backend can start plugging away
at the backlog.

Of course it would be better to put the backend on another core
that shares the cache or a hyperthread on the same core.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:49                                       ` Anthony Liguori
@ 2009-04-02 16:09                                         ` Anthony Liguori
  2009-04-02 16:19                                           ` Avi Kivity
  2009-04-03 12:03                                           ` Gregory Haskins
  0 siblings, 2 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-04-02 16:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1274 bytes --]

Anthony Liguori wrote:
> Avi Kivity wrote:
>> Avi Kivity wrote:
>>>
>>> The alternative is to get a notification from the stack that the 
>>> packet is done processing.  Either an skb destructor in the kernel, 
>>> or my new API that everyone is not rushing out to implement.
>>
>> btw, my new api is
>>
>>
>>   io_submit(..., nr, ...): submit nr packets
>>   io_getevents(): complete nr packets
>
> I don't think we even need that to end this debate.  I'm convinced we 
> have a bug somewhere.  Even disabling TX mitigation, I see a ping 
> latency of around 300ns whereas it's only 50ns on the host.  This 
> defies logic so I'm now looking to isolate why that is.

I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes were 
the big winner... I hate qemu sometimes.

I'm pretty confident I can get at least to Greg's numbers with some 
poking.  I think I understand why he's doing better after reading his 
patches carefully but I also don't think it'll scale with many guests 
well...  stay tuned.

But most importantly, we are darn near where vbus is with this patch wrt 
added packet latency and this is totally from userspace with no host 
kernel changes.

So no, userspace is not the issue.

Regards,

Anthony Liguori

> Regards,
>
> Anthony Liguori
>


[-- Attachment #2: first-pass.patch --]
[-- Type: text/x-patch, Size: 6596 bytes --]

diff --git a/qemu/exec.c b/qemu/exec.c
index 67f3fa3..1331022 100644
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -3268,6 +3268,10 @@ uint32_t ldl_phys(target_phys_addr_t addr)
     unsigned long pd;
     PhysPageDesc *p;
 
+#if 1
+    return ldl_p(phys_ram_base + addr);
+#endif
+
     p = phys_page_find(addr >> TARGET_PAGE_BITS);
     if (!p) {
         pd = IO_MEM_UNASSIGNED;
@@ -3300,6 +3304,10 @@ uint64_t ldq_phys(target_phys_addr_t addr)
     unsigned long pd;
     PhysPageDesc *p;
 
+#if 1
+    return ldq_p(phys_ram_base + addr);
+#endif
+
     p = phys_page_find(addr >> TARGET_PAGE_BITS);
     if (!p) {
         pd = IO_MEM_UNASSIGNED;
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 9bce3a0..ac77b80 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -36,6 +36,7 @@ typedef struct VirtIONet
     VirtQueue *ctrl_vq;
     VLANClientState *vc;
     QEMUTimer *tx_timer;
+    QEMUBH *bh;
     int tx_timer_active;
     int mergeable_rx_bufs;
     int promisc;
@@ -504,6 +505,10 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
     virtio_notify(&n->vdev, n->rx_vq);
 }
 
+VirtIODevice *global_vdev = NULL;
+
+extern void tap_try_to_recv(VLANClientState *vc);
+
 /* TX */
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
@@ -545,42 +550,35 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
             len += hdr_len;
         }
 
+        global_vdev = &n->vdev;
         len += qemu_sendv_packet(n->vc, out_sg, out_num);
+        global_vdev = NULL;
 
         virtqueue_push(vq, &elem, len);
         virtio_notify(&n->vdev, vq);
     }
+
+    tap_try_to_recv(n->vc->vlan->first_client);
 }
 
 static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    if (n->tx_timer_active) {
-        virtio_queue_set_notification(vq, 1);
-        qemu_del_timer(n->tx_timer);
-        n->tx_timer_active = 0;
-        virtio_net_flush_tx(n, vq);
-    } else {
-        qemu_mod_timer(n->tx_timer,
-                       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-        n->tx_timer_active = 1;
-        virtio_queue_set_notification(vq, 0);
-    }
+#if 0
+    virtio_queue_set_notification(vq, 0);
+    qemu_bh_schedule(n->bh);
+#else
+    virtio_net_flush_tx(n, n->tx_vq);
+#endif
 }
 
-static void virtio_net_tx_timer(void *opaque)
+static void virtio_net_handle_tx_bh(void *opaque)
 {
     VirtIONet *n = opaque;
 
-    n->tx_timer_active = 0;
-
-    /* Just in case the driver is not ready on more */
-    if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
-
-    virtio_queue_set_notification(n->tx_vq, 1);
     virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq, 1);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
@@ -675,8 +673,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
     n->vdev.get_features = virtio_net_get_features;
     n->vdev.set_features = virtio_net_set_features;
     n->vdev.reset = virtio_net_reset;
-    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
-    n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+    n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx);
+    n->tx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_tx);
     n->ctrl_vq = virtio_add_queue(&n->vdev, 16, virtio_net_handle_ctrl);
     memcpy(n->mac, nd->macaddr, ETH_ALEN);
     n->status = VIRTIO_NET_S_LINK_UP;
@@ -684,10 +682,10 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
                                  virtio_net_receive, virtio_net_can_receive, n);
     n->vc->link_status_changed = virtio_net_set_link_status;
 
+    n->bh = qemu_bh_new(virtio_net_handle_tx_bh, n);
+
     qemu_format_nic_info_str(n->vc, n->mac);
 
-    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
-    n->tx_timer_active = 0;
     n->mergeable_rx_bufs = 0;
     n->promisc = 1; /* for compatibility */
 
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 577eb5a..1365d11 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -507,6 +507,39 @@ static void virtio_reset(void *opaque)
     }
 }
 
+void virtio_sample_start(VirtIODevice *vdev)
+{
+    vdev->n_samples = 0;
+    virtio_sample(vdev);
+}
+
+void virtio_sample(VirtIODevice *vdev)
+{
+    gettimeofday(&vdev->samples[vdev->n_samples], NULL);
+    vdev->n_samples++;
+}
+
+static unsigned long usec_delta(struct timeval *before, struct timeval *after)
+{
+    return (after->tv_sec - before->tv_sec) * 1000000UL + (after->tv_usec - before->tv_usec);
+}
+
+void virtio_sample_end(VirtIODevice *vdev)
+{
+    int last, i;
+
+    virtio_sample(vdev);
+
+    last = vdev->n_samples - 1;
+
+    printf("Total time = %ldus\n", usec_delta(&vdev->samples[0], &vdev->samples[last]));
+
+    for (i = 1; i < vdev->n_samples; i++)
+        printf("sample[%d .. %d] = %ldus\n", i - 1, i, usec_delta(&vdev->samples[i - 1], &vdev->samples[i]));
+
+    vdev->n_samples = 0;
+}
+
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 {
     VirtIODevice *vdev = to_virtio_device(opaque);
diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h
index 18c7a1a..a039310 100644
--- a/qemu/hw/virtio.h
+++ b/qemu/hw/virtio.h
@@ -17,6 +17,8 @@
 #include "hw.h"
 #include "pci.h"
 
+#include <sys/time.h>
+
 /* from Linux's linux/virtio_config.h */
 
 /* Status byte for guest to report progress, and synchronize features. */
@@ -87,6 +89,8 @@ struct VirtIODevice
     void (*set_config)(VirtIODevice *vdev, const uint8_t *config);
     void (*reset)(VirtIODevice *vdev);
     VirtQueue *vq;
+    int n_samples;
+    struct timeval samples[100];
 };
 
 VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name,
@@ -122,4 +126,10 @@ int virtio_queue_ready(VirtQueue *vq);
 
 int virtio_queue_empty(VirtQueue *vq);
 
+void virtio_sample_start(VirtIODevice *vdev);
+
+void virtio_sample(VirtIODevice *vdev);
+
+void virtio_sample_end(VirtIODevice *vdev);
+
 #endif
diff --git a/qemu/net.c b/qemu/net.c
index efb64d3..dc872e5 100644
--- a/qemu/net.c
+++ b/qemu/net.c
@@ -733,6 +733,7 @@ typedef struct TAPState {
 } TAPState;
 
 #ifdef HAVE_IOVEC
+
 static ssize_t tap_receive_iov(void *opaque, const struct iovec *iov,
                                int iovcnt)
 {
@@ -853,6 +854,12 @@ static void tap_send(void *opaque)
     } while (s->size > 0);
 }
 
+void tap_try_to_recv(VLANClientState *vc)
+{
+    TAPState *s = vc->opaque;
+    tap_send(s);
+}
+
 int tap_has_vnet_hdr(void *opaque)
 {
     VLANClientState *vc = opaque;

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:40                                         ` Herbert Xu
@ 2009-04-02 15:57                                           ` Avi Kivity
  2009-04-02 16:09                                             ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 15:57 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 06:00:17PM +0300, Avi Kivity wrote:
>   
>> Good point - if we rely on having excess cores in the host, large guest  
>> scalability will drop.
>>     
>
> Going back to TX mitigation, I wonder if we could avoid it altogether
> by having a "wakeup" mechanism that does not involve a vmexit.  We
> have two cases:
>
> 1) UP, or rather guest runs on the same core/hyperthread as the
> backend.  This is the easy one, the guest simply sets a marker
> in shared memory and keeps going until its time is up.  Then the
> backend takes over, and uses a marker for notification too.
>
> The markers need to be interpreted by the scheduler so that it
> knows the guest/backend is runnable, respectively.
>   

Let's look at this first.

What if the guest sends N packets, then does some expensive computation 
(say the guest scheduler switches from the benchmark process to 
evolution).  So now we have the marker set at packet N, but the host 
will not see it until the guest timeslice is up?

I think I totally misunderstood you.  Can you repeat in smaller words?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:00                                       ` Avi Kivity
@ 2009-04-02 15:40                                         ` Herbert Xu
  2009-04-02 15:57                                           ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02 15:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

On Thu, Apr 02, 2009 at 06:00:17PM +0300, Avi Kivity wrote:
>
> Good point - if we rely on having excess cores in the host, large guest  
> scalability will drop.

Going back to TX mitigation, I wonder if we could avoid it altogether
by having a "wakeup" mechanism that does not involve a vmexit.  We
have two cases:

1) UP, or rather guest runs on the same core/hyperthread as the
backend.  This is the easy one, the guest simply sets a marker
in shared memory and keeps going until its time is up.  Then the
backend takes over, and uses a marker for notification too.

The markers need to be interpreted by the scheduler so that it
knows the guest/backend is runnable, respectively.

2) The guest and backend runs on two cores/hyperthreads.  We'll
assume that they share caches as otherwise mitigation is the last
thing to worry about.  We use the same marker mechanism as above.
The only caveat is that if one core/hyperthread is idle, its
idle thread needs to monitor the marker (this would be a separate
per-core marker) to wake up the scheduler.

CCing Ingo so that he can flame me if I'm totally off the mark.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:13                               ` Rusty Russell
  2009-04-02 12:50                                 ` Gregory Haskins
@ 2009-04-02 15:10                                 ` Michael S. Tsirkin
  2009-04-03  4:43                                   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 121+ messages in thread
From: Michael S. Tsirkin @ 2009-04-02 15:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Gregory Haskins, Avi Kivity, Herbert Xu, anthony, andi,
	linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

On Thu, Apr 02, 2009 at 10:43:19PM +1030, Rusty Russell wrote:
> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
> > You do not need to know when the packet is copied (which I currently
> > do).  You only need it for zero-copy (of which I would like to support,
> > but as I understand it there are problems with the reliability of proper
> > callback (i.e. skb->destructor).
> 
> But if you have a UP guest, there will *never* be another packet in the queue
> at this point, since it wasn't running.
> 
> As Avi said, you can do the processing in another thread and go back to the
> guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
> again before the thread did for exactly this kind of reason.
> 
> While Avi's point about a "powerful enough userspace API" is probably valid,
> I don't think it's going to happen.  It's almost certainly less code to put a
> virtio_net server in the kernel, than it is to create such a powerful
> interface (see vringfd & tap).  And that interface would have one user in
> practice.
> 
> So, let's roll out a kernel virtio_net server.  Anyone?
> Rusty.

BTW, whatever approach is chosen, to enable zero-copy transmits, it seems that
we still must add tracking of when the skb has actually been transmitted, right?

Rusty, I think this is what you did in your patch from 2008 to add destructor
for skb data ( http://kerneltrap.org/mailarchive/linux-netdev/2008/4/18/1464944 ):
and it seems that it would make zero-copy possible - or was there some problem with
that approach? Do you happen to remember?

-- 
MST

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:50                                     ` Herbert Xu
@ 2009-04-02 15:00                                       ` Avi Kivity
  2009-04-02 15:40                                         ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 15:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 04:07:09PM +0300, Avi Kivity wrote:
>   
>> I think Rusty did mean a UP guest, and without schedule-and-forget.
>>     
>
> Going off on a tangent here, I don't really think it should matter
> whether we're UP or SMP.  The ideal state is where we have the
> same number of (virtual) TX queues as there are cores in the guest.
> On the host side we need the backend to run at least on a core
> that shares cache with the corresponding guest queue/core.  If
> that happens to be the same core as the guest core then it should
> work as well.
>
> IOW we should optimise it as if the host were UP.
>   

Good point - if we rely on having excess cores in the host, large guest 
scalability will drop.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:07                                   ` Avi Kivity
  2009-04-02 13:22                                     ` Gregory Haskins
@ 2009-04-02 14:50                                     ` Herbert Xu
  2009-04-02 15:00                                       ` Avi Kivity
  1 sibling, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02 14:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm

On Thu, Apr 02, 2009 at 04:07:09PM +0300, Avi Kivity wrote:
>
> I think Rusty did mean a UP guest, and without schedule-and-forget.

Going off on a tangent here, I don't really think it should matter
whether we're UP or SMP.  The ideal state is where we have the
same number of (virtual) TX queues as there are cores in the guest.
On the host side we need the backend to run at least on a core
that shares cache with the corresponding guest queue/core.  If
that happens to be the same core as the guest core then it should
work as well.

IOW we should optimise it as if the host were UP.

> The problem is that we already have virtio guest drivers going several  
> kernel versions back, as well as Windows drivers.  We can't keep  
> changing the infrastructure under people's feet.

Yes I agree that changing the guest-side driver is a no-no.  However,
we should be able to achieve what's shown here without modifying the
guest-side.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:41                                     ` Avi Kivity
@ 2009-04-02 14:49                                       ` Anthony Liguori
  2009-04-02 16:09                                         ` Anthony Liguori
  0 siblings, 1 reply; 121+ messages in thread
From: Anthony Liguori @ 2009-04-02 14:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> The alternative is to get a notification from the stack that the 
>> packet is done processing.  Either an skb destructor in the kernel, 
>> or my new API that everyone is not rushing out to implement.
>
> btw, my new api is
>
>
>   io_submit(..., nr, ...): submit nr packets
>   io_getevents(): complete nr packets

I don't think we even need that to end this debate.  I'm convinced we 
have a bug somewhere.  Even disabling TX mitigation, I see a ping 
latency of around 300ns whereas it's only 50ns on the host.  This defies 
logic so I'm now looking to isolate why that is.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:32                                   ` Avi Kivity
@ 2009-04-02 14:41                                     ` Avi Kivity
  2009-04-02 14:49                                       ` Anthony Liguori
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 14:41 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Avi Kivity wrote:
>
> The alternative is to get a notification from the stack that the 
> packet is done processing.  Either an skb destructor in the kernel, or 
> my new API that everyone is not rushing out to implement.

btw, my new api is


   io_submit(..., nr, ...): submit nr packets
   io_getevents(): complete nr packets

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:24                                 ` Gregory Haskins
@ 2009-04-02 14:32                                   ` Avi Kivity
  2009-04-02 14:41                                     ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 14:32 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
>> If you have a request-response workload with the wire idle and latency
>> critical, then there's no problem having an exit per packet because
>> (a) there aren't that many packets and (b) the guest isn't doing any
>> batching, so guest overhead will swamp the hypervisor overhead.
>>     
> Right, so the trick is to use an algorithm that adapts here.  Batching
> solves the first case, but not the second.  The bidir napi thing solves
> both, but it does assume you have ample host processing power to run the
> algorithm concurrently.  This may or may not be suitable to all
> applications, I admit.
>   

The alternative is to get a notification from the stack that the packet 
is done processing.  Either an skb destructor in the kernel, or my new 
API that everyone is not rushing out to implement.

>>> Right now its way way way worse than 2us.  In fact, at my last reading
>>> this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
>>> maintaining line-rate) and I will be impressed.  Heck, shorten it to
>>> 80us and I will be impressed.
>>>   
>>>       
>> The 3060us thing is a timer, not cpu time.
>>     
> Agreed, but its still "state of the art" from an observer perspective. 
> The reason "why", though easily explainable, is inconsequential to most
> people.  FWIW, I have seen virtio-net do a much more respectable 350us
> on an older version, so I know there is plenty of room for improvement.
>   

All I want is the notification, and the timer is headed into the nearest 
landfill.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:45                               ` Avi Kivity
@ 2009-04-02 14:24                                 ` Gregory Haskins
  2009-04-02 14:32                                   ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 14:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3816 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> Avi Kivity wrote:
>>>>  
>>>>      
>>>>> My 'prohibitively expensive' is true only if you exit every packet.
>>>>>
>>>>>
>>>>>             
>>>> Understood, but yet you need to do this if you want something like
>>>> iSCSI
>>>> READ transactions to have as low-latency as possible.
>>>>         
>>> Dunno, two microseconds is too much?  The wire imposes much more.
>>>
>>>     
>>
>> No, but thats not what we are talking about.  You said signaling on
>> every packet is prohibitively expensive.  I am saying signaling on every
>> packet is required for decent latency.  So is it prohibitively expensive
>> or not?
>>   
>
> We're heading dangerously into the word-game area.  Let's not do that.
>
> If you have a high throughput workload with many packets per seconds
> then an exit per packet (whether to userspace or to the kernel) is
> expensive.  So you do exit mitigation.  Latency is not important since
> the packets are going to sit in the output queue anyway.

Agreed.  virtio-net currently does this with batching.  I do with the
bidir napi thing (which effectively crosses the producer::consumer > 1
threshold to mitigate the signal path).


>
> If you have a request-response workload with the wire idle and latency
> critical, then there's no problem having an exit per packet because
> (a) there aren't that many packets and (b) the guest isn't doing any
> batching, so guest overhead will swamp the hypervisor overhead.
Right, so the trick is to use an algorithm that adapts here.  Batching
solves the first case, but not the second.  The bidir napi thing solves
both, but it does assume you have ample host processing power to run the
algorithm concurrently.  This may or may not be suitable to all
applications, I admit.

>
> If you have a low latency request-response workload mixed with a high
> throughput workload, then you aren't going to get low latency since
> your low latency packets will sit on the queue behind the high
> throughput packets.  You can fix that with multiqueue and then you're
> back to one of the scenarios above.
Agreed, and thats ok.  Now we are getting more into 802.1p type MQ
issues anyway, if the application cared about it that much.

>
>> I think most would agree that adding 2us is not bad, but so far that is
>> an unproven theory that the IO path in question only adds 2us.   And we
>> are not just looking at the rate at which we can enter and exit the
>> guest...we need the whole path...from the PIO kick to the dev_xmit() on
>> the egress hardware, to the ingress and rx-injection.  This includes any
>> and all penalties associated with the path, even if they are imposed by
>> something like the design of tun-tap.
>>   
>
> Correct, we need to look at the whole path.  That's why the wishing
> well is clogged with my 'give me a better userspace interface' emails.
>
>> Right now its way way way worse than 2us.  In fact, at my last reading
>> this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
>> maintaining line-rate) and I will be impressed.  Heck, shorten it to
>> 80us and I will be impressed.
>>   
>
> The 3060us thing is a timer, not cpu time.
Agreed, but its still "state of the art" from an observer perspective. 
The reason "why", though easily explainable, is inconsequential to most
people.  FWIW, I have seen virtio-net do a much more respectable 350us
on an older version, so I know there is plenty of room for improvement.

>   We aren't starting a JVM for each packet.
Heh...it kind of feels like that right now, so hopefully some
improvement will at least be on the one thing that comes out of all this.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:27                                       ` Avi Kivity
@ 2009-04-02 14:05                                         ` Gregory Haskins
  0 siblings, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 14:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rusty Russell, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3713 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> Rusty Russell wrote:
>>>>  
>>>>      
>>>>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>>>>             
>>>>>> You do not need to know when the packet is copied (which I currently
>>>>>> do).  You only need it for zero-copy (of which I would like to
>>>>>> support,
>>>>>> but as I understand it there are problems with the reliability of
>>>>>> proper
>>>>>> callback (i.e. skb->destructor).
>>>>>>                     
>>>>> But if you have a UP guest,
>>>>>             
>>>> I assume you mean UP host ;)
>>>>
>>>>         
>>> I think Rusty did mean a UP guest, and without schedule-and-forget.
>>>     
>> That doesnt make sense to me, tho.  All the testing I did was a UP
>> guest, actually.  Why would I be constrained to run without the
>> scheduling unless the host was also UP?
>>   
>
> You aren't constrained.  And your numbers show it works.
>
>>>
>>> The problem is that we already have virtio guest drivers going several
>>> kernel versions back, as well as Windows drivers.  We can't keep
>>> changing the infrastructure under people's feet.
>>>     
>>
>> Well, IIUC the virtio code itself declares the ABI as unstable, so there
>> technically *is* an out if we really wanted one.  But I certainly
>> understand the desire to not change this ABI if at all possible, and
>> thus the resistance here.
>>   
>
> virtio is a stable ABI.

Dang!  Scratch that.
>
>> However, theres still the possibility we can make this work in an ABI
>> friendly way with cap-bits, or other such features.  For instance, the
>> virtio-net driver could register both with pci and vbus-proxy and
>> instantiate a device with a slightly different ops structure for each or
>> something.  Alternatively we could write a host-side shim to expose vbus
>> devices as pci devices or something like that.
>>   
>
> Sounds complicated...

Well, the first solution would be relatively trivial...at least on the
guest side.  All the other infrastructure is done and included in the
series I sent out.  The changes to the virtio-net driver on the guest
itself would be minimal.  The bigger effort would be converting
venet-tap to use virtio-ring instead of IOQ.  But this would arguably be
less work than starting a virtio-net backend module from scratch because
you would have to not only code up the entire virtio-net backend, but
also all the pci emulation and irq routing stuff that is required (and
is already done by the vbus infrastructure).  Here all the major pieces
are in place, just the xmit and rx routines need to be converted to
virtio-isms.

For the second option, I agree.  Its probably too nasty and it would be
better if there was just either a virtio-net to kvm-host hack, or a more
pci oriented version of a vbus-like framework.

That said, there is certainly nothing wrong with having an alternate
option.  There is plenty of precedent for having different drivers for
different subsystems, etc, even if there is overlap.  Heck, even KVM has
realtek, e1000, and virtio-net, etc.  Would our kvm community be willing
to work with me to get these patches merged?  I am perfectly willing to
maintain them.  That said, the general infrastructure should probably
not live in -kvm (perhaps -tip, -mm, or -next, etc is more
appropriate).  So a good plan might be to shoot for the core going into
a more general upstream tree.  When/if that happens, then the kvm
community could consider the kvm specific parts, etc.  I realize this is
all pending review acceptance by everyone involved...

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:36                             ` Gregory Haskins
@ 2009-04-02 13:45                               ` Avi Kivity
  2009-04-02 14:24                                 ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 13:45 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> Avi Kivity wrote:
>>>  
>>>       
>>>> My 'prohibitively expensive' is true only if you exit every packet.
>>>>
>>>>
>>>>     
>>>>         
>>> Understood, but yet you need to do this if you want something like iSCSI
>>> READ transactions to have as low-latency as possible.
>>>   
>>>       
>> Dunno, two microseconds is too much?  The wire imposes much more.
>>
>>     
>
> No, but thats not what we are talking about.  You said signaling on
> every packet is prohibitively expensive.  I am saying signaling on every
> packet is required for decent latency.  So is it prohibitively expensive
> or not?
>   

We're heading dangerously into the word-game area.  Let's not do that.

If you have a high throughput workload with many packets per seconds 
then an exit per packet (whether to userspace or to the kernel) is 
expensive.  So you do exit mitigation.  Latency is not important since 
the packets are going to sit in the output queue anyway.

If you have a request-response workload with the wire idle and latency 
critical, then there's no problem having an exit per packet because (a) 
there aren't that many packets and (b) the guest isn't doing any 
batching, so guest overhead will swamp the hypervisor overhead.

If you have a low latency request-response workload mixed with a high 
throughput workload, then you aren't going to get low latency since your 
low latency packets will sit on the queue behind the high throughput 
packets.  You can fix that with multiqueue and then you're back to one 
of the scenarios above.

> I think most would agree that adding 2us is not bad, but so far that is
> an unproven theory that the IO path in question only adds 2us.   And we
> are not just looking at the rate at which we can enter and exit the
> guest...we need the whole path...from the PIO kick to the dev_xmit() on
> the egress hardware, to the ingress and rx-injection.  This includes any
> and all penalties associated with the path, even if they are imposed by
> something like the design of tun-tap.
>   

Correct, we need to look at the whole path.  That's why the wishing well 
is clogged with my 'give me a better userspace interface' emails.

> Right now its way way way worse than 2us.  In fact, at my last reading
> this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
> maintaining line-rate) and I will be impressed.  Heck, shorten it to
> 80us and I will be impressed.
>   

The 3060us thing is a timer, not cpu time.  We aren't starting a JVM for 
each packet.  We could remove it given a notification API, or 
duplicating the sched-and-forget thing, like Rusty did with lguest or 
Mark with qemu.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:08                           ` Avi Kivity
@ 2009-04-02 13:36                             ` Gregory Haskins
  2009-04-02 13:45                               ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 13:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> My 'prohibitively expensive' is true only if you exit every packet.
>>>
>>>
>>>     
>>
>> Understood, but yet you need to do this if you want something like iSCSI
>> READ transactions to have as low-latency as possible.
>>   
>
> Dunno, two microseconds is too much?  The wire imposes much more.
>

No, but thats not what we are talking about.  You said signaling on
every packet is prohibitively expensive.  I am saying signaling on every
packet is required for decent latency.  So is it prohibitively expensive
or not?

I think most would agree that adding 2us is not bad, but so far that is
an unproven theory that the IO path in question only adds 2us.   And we
are not just looking at the rate at which we can enter and exit the
guest...we need the whole path...from the PIO kick to the dev_xmit() on
the egress hardware, to the ingress and rx-injection.  This includes any
and all penalties associated with the path, even if they are imposed by
something like the design of tun-tap.

Right now its way way way worse than 2us.  In fact, at my last reading
this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
maintaining line-rate) and I will be impressed.  Heck, shorten it to
80us and I will be impressed.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:22                                     ` Gregory Haskins
@ 2009-04-02 13:27                                       ` Avi Kivity
  2009-04-02 14:05                                         ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 13:27 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Rusty Russell, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> Rusty Russell wrote:
>>>  
>>>       
>>>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>>>      
>>>>         
>>>>> You do not need to know when the packet is copied (which I currently
>>>>> do).  You only need it for zero-copy (of which I would like to
>>>>> support,
>>>>> but as I understand it there are problems with the reliability of
>>>>> proper
>>>>> callback (i.e. skb->destructor).
>>>>>           
>>>>>           
>>>> But if you have a UP guest,
>>>>     
>>>>         
>>> I assume you mean UP host ;)
>>>
>>>   
>>>       
>> I think Rusty did mean a UP guest, and without schedule-and-forget.
>>     
> That doesnt make sense to me, tho.  All the testing I did was a UP
> guest, actually.  Why would I be constrained to run without the
> scheduling unless the host was also UP?
>   

You aren't constrained.  And your numbers show it works.

>>
>> The problem is that we already have virtio guest drivers going several
>> kernel versions back, as well as Windows drivers.  We can't keep
>> changing the infrastructure under people's feet.
>>     
>
> Well, IIUC the virtio code itself declares the ABI as unstable, so there
> technically *is* an out if we really wanted one.  But I certainly
> understand the desire to not change this ABI if at all possible, and
> thus the resistance here.
>   

virtio is a stable ABI.

> However, theres still the possibility we can make this work in an ABI
> friendly way with cap-bits, or other such features.  For instance, the
> virtio-net driver could register both with pci and vbus-proxy and
> instantiate a device with a slightly different ops structure for each or
> something.  Alternatively we could write a host-side shim to expose vbus
> devices as pci devices or something like that.
>   

Sounds complicated...

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:07                                   ` Avi Kivity
@ 2009-04-02 13:22                                     ` Gregory Haskins
  2009-04-02 13:27                                       ` Avi Kivity
  2009-04-02 14:50                                     ` Herbert Xu
  1 sibling, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 13:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rusty Russell, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1983 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Rusty Russell wrote:
>>  
>>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>>      
>>>> You do not need to know when the packet is copied (which I currently
>>>> do).  You only need it for zero-copy (of which I would like to
>>>> support,
>>>> but as I understand it there are problems with the reliability of
>>>> proper
>>>> callback (i.e. skb->destructor).
>>>>           
>>> But if you have a UP guest,
>>>     
>>
>> I assume you mean UP host ;)
>>
>>   
>
> I think Rusty did mean a UP guest, and without schedule-and-forget.
That doesnt make sense to me, tho.  All the testing I did was a UP
guest, actually.  Why would I be constrained to run without the
scheduling unless the host was also UP?

>
>> Hmm..well I was hoping to be able to work with you guys to make my
>> proposal fit this role.  If there is no interest in that, I hope that my
>> infrastructure itself may still be considered for merging (in *some*
>> tree, not -kvm per se) as I would prefer to not maintain it out of tree
>> if it can be avoided.
>
> The problem is that we already have virtio guest drivers going several
> kernel versions back, as well as Windows drivers.  We can't keep
> changing the infrastructure under people's feet.

Well, IIUC the virtio code itself declares the ABI as unstable, so there
technically *is* an out if we really wanted one.  But I certainly
understand the desire to not change this ABI if at all possible, and
thus the resistance here.

However, theres still the possibility we can make this work in an ABI
friendly way with cap-bits, or other such features.  For instance, the
virtio-net driver could register both with pci and vbus-proxy and
instantiate a device with a slightly different ops structure for each or
something.  Alternatively we could write a host-side shim to expose vbus
devices as pci devices or something like that.

-Greg

>
>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:54                         ` Gregory Haskins
@ 2009-04-02 13:08                           ` Avi Kivity
  2009-04-02 13:36                             ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 13:08 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> My 'prohibitively expensive' is true only if you exit every packet.
>>
>>
>>     
>
> Understood, but yet you need to do this if you want something like iSCSI
> READ transactions to have as low-latency as possible.
>   

Dunno, two microseconds is too much?  The wire imposes much more.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:50                                 ` Gregory Haskins
  2009-04-02 12:52                                   ` Gregory Haskins
@ 2009-04-02 13:07                                   ` Avi Kivity
  2009-04-02 13:22                                     ` Gregory Haskins
  2009-04-02 14:50                                     ` Herbert Xu
  1 sibling, 2 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 13:07 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Rusty Russell, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

Gregory Haskins wrote:
> Rusty Russell wrote:
>   
>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>   
>>     
>>> You do not need to know when the packet is copied (which I currently
>>> do).  You only need it for zero-copy (of which I would like to support,
>>> but as I understand it there are problems with the reliability of proper
>>> callback (i.e. skb->destructor).
>>>     
>>>       
>> But if you have a UP guest,
>>     
>
> I assume you mean UP host ;)
>
>   

I think Rusty did mean a UP guest, and without schedule-and-forget.

> Hmm..well I was hoping to be able to work with you guys to make my
> proposal fit this role.  If there is no interest in that, I hope that my
> infrastructure itself may still be considered for merging (in *some*
> tree, not -kvm per se) as I would prefer to not maintain it out of tree
> if it can be avoided.

The problem is that we already have virtio guest drivers going several 
kernel versions back, as well as Windows drivers.  We can't keep 
changing the infrastructure under people's feet.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:43                                   ` Avi Kivity
@ 2009-04-02 13:03                                     ` Gregory Haskins
  0 siblings, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 13:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Mark McLoughlin

[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> It's more of a "schedule and forget" which I think brings you the
>>> win.  The host disables notifications and schedules the actual tx work
>>> (rx from the host's perspective).  So now the guest and host continue
>>> producing and consuming packets in parallel.  So long as the guest is
>>> faster (due to the host being throttled?), notifications continue to
>>> be disabled.
>>>     
>> Yep, when the "producer::consumer" ratio is > 1, we mitigate
>> signaling. When its < 1, we signal roughly once per packet.
>>
>>  
>>> If you changed your rx_isr() to process the packets immediately
>>> instead of scheduling, I think throughput would drop dramatically.
>>>     
>> Right, that is the point. :) This is that "soft asic" thing I was
>> talking about yesterday.
>>   
>
> But all that has nothing to do with where the code lives, in the
> kernel or userspace.

Agreed, but note Ive already stated that some of my boost is likely from
in-kernel, while others are unrelated design elements such as the
"soft-asic" approach (you guys dont read my 10 page emails, do you? ;). 
I don't deny that some of my ideas could be used in userspace as well
(Credit if used would be appreciated :).

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:42                       ` Avi Kivity
@ 2009-04-02 12:54                         ` Gregory Haskins
  2009-04-02 13:08                           ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 246 bytes --]

Avi Kivity wrote:
>
>
> My 'prohibitively expensive' is true only if you exit every packet.
>
>

Understood, but yet you need to do this if you want something like iSCSI
READ transactions to have as low-latency as possible.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:50                                 ` Gregory Haskins
@ 2009-04-02 12:52                                   ` Gregory Haskins
  2009-04-02 13:07                                   ` Avi Kivity
  1 sibling, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Avi Kivity, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 402 bytes --]

Gregory Haskins wrote:
> Rusty Russell wrote:
>   
>
>>  there will *never* be another packet in the queue
>> at this point, since it wasn't running.
>>   
>>     
> Yep, and I'll be the first to admit that my design only looks forward. 
>   
To clarify, I am referring to the internal design of the venet-tap
only.  The general vbus architecture makes no such policy decisions.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:13                               ` Rusty Russell
@ 2009-04-02 12:50                                 ` Gregory Haskins
  2009-04-02 12:52                                   ` Gregory Haskins
  2009-04-02 13:07                                   ` Avi Kivity
  2009-04-02 15:10                                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:50 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Avi Kivity, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 2296 bytes --]

Rusty Russell wrote:
> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>   
>> You do not need to know when the packet is copied (which I currently
>> do).  You only need it for zero-copy (of which I would like to support,
>> but as I understand it there are problems with the reliability of proper
>> callback (i.e. skb->destructor).
>>     
>
> But if you have a UP guest,

I assume you mean UP host ;)

>  there will *never* be another packet in the queue
> at this point, since it wasn't running.
>   
Yep, and I'll be the first to admit that my design only looks forward. 
Its for high speed links and multi-core cpus, etc.  If you have a
uniprocessor host, the throughput would likely start to suffer with my
current strategy.  You could probably reclaim some of that throughput
(but trading latency) by doing as you are suggesting with the deferred
initial signalling.  However, it is still a tradeoff to account for the
lower-end rig.  I could certainly put a heuristic/timer on the
guest->host to mitigate this as well, but this is not my target use case
anyway so I am not sure it is worth it.


> As Avi said, you can do the processing in another thread and go back to the
> guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
> again before the thread did for exactly this kind of reason.
>
> While Avi's point about a "powerful enough userspace API" is probably valid,
> I don't think it's going to happen.  It's almost certainly less code to put a
> virtio_net server in the kernel, than it is to create such a powerful
> interface (see vringfd & tap).  And that interface would have one user in
> practice.
>
> So, let's roll out a kernel virtio_net server.  Anyone?
>   
Hmm..well I was hoping to be able to work with you guys to make my
proposal fit this role.  If there is no interest in that, I hope that my
infrastructure itself may still be considered for merging (in *some*
tree, not -kvm per se) as I would prefer to not maintain it out of tree
if it can be avoided.  I think people will find that the new logic
touches very few existing kernel lines at all, and can be completely
disabled with config options so it should be relatively inconsequential
to those that do not care.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:30                                 ` Gregory Haskins
@ 2009-04-02 12:43                                   ` Avi Kivity
  2009-04-02 13:03                                     ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 12:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Mark McLoughlin

Gregory Haskins wrote:

  

>> It's more of a "schedule and forget" which I think brings you the
>> win.  The host disables notifications and schedules the actual tx work
>> (rx from the host's perspective).  So now the guest and host continue
>> producing and consuming packets in parallel.  So long as the guest is
>> faster (due to the host being throttled?), notifications continue to
>> be disabled.
>>     
> Yep, when the "producer::consumer" ratio is > 1, we mitigate signaling. 
> When its < 1, we signal roughly once per packet.
>
>   
>> If you changed your rx_isr() to process the packets immediately
>> instead of scheduling, I think throughput would drop dramatically.
>>     
> Right, that is the point. :) This is that "soft asic" thing I was
> talking about yesterday.
>   

But all that has nothing to do with where the code lives, in the kernel 
or userspace.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:22                     ` Gregory Haskins
@ 2009-04-02 12:42                       ` Avi Kivity
  2009-04-02 12:54                         ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 12:42 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>
>>  
>>
>>     
>>>> virtio is already non-kvm-specific (lguest uses it) and
>>>> non-pci-specific (s390 uses it).
>>>>     
>>>>         
>>> Ok, then to be more specific, I need it to be more generic than it
>>> already is.  For instance, I need it to be able to integrate with
>>> shm_signals.  
>>>       
>> Why?
>>     
> Well, shm_signals is what I designed to be the event mechanism for vbus
> devices.  One of the design criteria of shm_signal is that it should
> support a variety of environments, such as kvm, but also something like
> userspace apps.  So I cannot make assumptions about things like "pci
> interrupts", etc.
>   

virtio doesn't make these assumptions either.  The only difference I see 
is that you separate notification from the ring structure.

> By your own words, the exit to userspace is "prohibitively expensive",
> so that is either true or its not.  If its 2 microseconds, show me.

In user/test/x86/vmexit.c, change 'cpuid' to 'out %al, $0'; drop the 
printf() in kvmctl.c's test_outb().

I get something closer to 4 microseconds, but that's on a two year old 
machine;  It will be around two on Nehalems.

My 'prohibitively expensive' is true only if you exit every packet.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 11:59                               ` Avi Kivity
@ 2009-04-02 12:30                                 ` Gregory Haskins
  2009-04-02 12:43                                   ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Mark McLoughlin

[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> Why does a kernel solution not need to know when a packet is
>>> transmitted?
>>>
>>>     
>>
>> You do not need to know when the packet is copied (which I currently
>> do).  You only need it for zero-copy (of which I would like to support,
>> but as I understand it there are problems with the reliability of proper
>> callback (i.e. skb->destructor).
>>
>> Its "fire and forget" :)
>>   
>
> It's more of a "schedule and forget" which I think brings you the
> win.  The host disables notifications and schedules the actual tx work
> (rx from the host's perspective).  So now the guest and host continue
> producing and consuming packets in parallel.  So long as the guest is
> faster (due to the host being throttled?), notifications continue to
> be disabled.
Yep, when the "producer::consumer" ratio is > 1, we mitigate signaling. 
When its < 1, we signal roughly once per packet.

>
> If you changed your rx_isr() to process the packets immediately
> instead of scheduling, I think throughput would drop dramatically.
Right, that is the point. :) This is that "soft asic" thing I was
talking about yesterday.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 11:43                   ` Avi Kivity
@ 2009-04-02 12:22                     ` Gregory Haskins
  2009-04-02 12:42                       ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3660 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> virtio is already non-kvm-specific (lguest uses it) and
>>> non-pci-specific (s390 uses it).
>>>     
>>
>> Ok, then to be more specific, I need it to be more generic than it
>> already is.  For instance, I need it to be able to integrate with
>> shm_signals.  
>
> Why?
Well, shm_signals is what I designed to be the event mechanism for vbus
devices.  One of the design criteria of shm_signal is that it should
support a variety of environments, such as kvm, but also something like
userspace apps.  So I cannot make assumptions about things like "pci
interrupts", etc.

So if I want to use it in vbus, virtio-ring has to be able to use them,
as opposed to what it does today. Part of this would be a natural fit
for the "kick()" callback in virtio, but there are other problems.  For
one, virtio-ring (IIUC) does its own event-masking directly in the
virtio metadata.  However, really I want the higher layer ring-overlay
to do its masking in terms of the lower-layered shm_signal in order to
work the way I envision this stuff.  If you look at the IOQ
implementation, this is exactly what it does.

To be clear, and Ive stated this in the past: venet is just an example
of this generic, in-kernel concept.  We plan on doing much much more
with all this.  One of the things we are working on is have userspace
clients be able to access this too, with an ultimately goal of
supporting things like having guest-userspace doing bypass, rdma, etc. 
We are not there yet, though...only the kvm-host to guest kernel is
currently functional and is thus the working example.

I totally "get" the attraction to doing things in userspace.  Its
contained, naturally isolated, easily supports migration, etc.  Its also
a penalty.  Bare-metal userspace apps have a direct path to the kernel
IO.  I want to give guest the same advantage.  Some people will care
more about things like migration than performance, and that is fine. 
But others will certainly care more about performance, and that is what
we are trying to address.

>
>  
>
>>> If you have a good exit mitigation scheme you can cut exits by a
>>> factor of 100; so the userspace exit costs are cut by the same
>>> factor.  If you have good copyless networking APIs you can cut the
>>> cost of copies to zero (well, to the cost of get_user_pages_fast(),
>>> but a kernel solution needs that too).
>>>     
>>
>> "exit mitigation' schemes are for bandwidth, not latency.  For latency
>> it all comes down to how fast you can signal in both directions.  If
>> someone is going to do a stand-alone request-reply, its generally always
>> going to be at least one hypercall and one rx-interrupt.  So your speed
>> will be governed by your signal path, not your buffer bandwidth.
>>   
>
> The userspace path is longer by 2 microseconds (for two additional
> heavyweight exits) and a few syscalls.  I don't think that's worthy of
> putting all the code in the kernel.

By your own words, the exit to userspace is "prohibitively expensive",
so that is either true or its not.  If its 2 microseconds, show me.  We
need the rtt time to go from a "kick" PIO all the way to queue a packet
on the egress hardware and return.  That is going to define your
latency.  If you can do this such that you can do something like ICMP
ping in 65us (or anything close to a few dozen microseconds of this),
I'll shut-up about how much I think the current path sucks ;)  Even so,
I still propose the concept of a frame-work for in-kernel devices for
all the other reasons I mentioned above.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 11:06                             ` Gregory Haskins
  2009-04-02 11:59                               ` Avi Kivity
@ 2009-04-02 12:13                               ` Rusty Russell
  2009-04-02 12:50                                 ` Gregory Haskins
  2009-04-02 15:10                                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 121+ messages in thread
From: Rusty Russell @ 2009-04-02 12:13 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
> You do not need to know when the packet is copied (which I currently
> do).  You only need it for zero-copy (of which I would like to support,
> but as I understand it there are problems with the reliability of proper
> callback (i.e. skb->destructor).

But if you have a UP guest, there will *never* be another packet in the queue
at this point, since it wasn't running.

As Avi said, you can do the processing in another thread and go back to the
guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
again before the thread did for exactly this kind of reason.

While Avi's point about a "powerful enough userspace API" is probably valid,
I don't think it's going to happen.  It's almost certainly less code to put a
virtio_net server in the kernel, than it is to create such a powerful
interface (see vringfd & tap).  And that interface would have one user in
practice.

So, let's roll out a kernel virtio_net server.  Anyone?
Rusty.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 11:06                             ` Gregory Haskins
@ 2009-04-02 11:59                               ` Avi Kivity
  2009-04-02 12:30                                 ` Gregory Haskins
  2009-04-02 12:13                               ` Rusty Russell
  1 sibling, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 11:59 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Mark McLoughlin

Gregory Haskins wrote:

  

>> Why does a kernel solution not need to know when a packet is transmitted?
>>
>>     
>
> You do not need to know when the packet is copied (which I currently
> do).  You only need it for zero-copy (of which I would like to support,
> but as I understand it there are problems with the reliability of proper
> callback (i.e. skb->destructor).
>
> Its "fire and forget" :)
>   

It's more of a "schedule and forget" which I think brings you the win.  
The host disables notifications and schedules the actual tx work (rx 
from the host's perspective).  So now the guest and host continue 
producing and consuming packets in parallel.  So long as the guest is 
faster (due to the host being throttled?), notifications continue to be 
disabled.

If you changed your rx_isr() to process the packets immediately instead 
of scheduling, I think throughput would drop dramatically.

Mark had a similar change for virtio.  Mark?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 10:55                     ` Gregory Haskins
@ 2009-04-02 11:48                       ` Avi Kivity
  0 siblings, 0 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 11:48 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:

  

>> There is no choice.  Exiting from the guest to the kernel to userspace
>> is prohibitively expensive, you can't do that on every packet.
>>
>>     
>
> Now you are making my point ;)  This is part of the cost of your
> signaling path, and it directly adds to your latency time.   

It adds a microsecond.  The kvm overhead of putting things in userspace 
is low enough, I don't know why people keep mentioning it.  The problem 
is the kernel/user networking interfaces.

> You can't
> buffer packets here if the guest is only going to send one and wait for
> a response and expect that to perform well.  And this is precisely what
> drove me to look at avoiding going back to userspace in the first place.
>   

We're not buffering any packets.  What we lack is a way to tell the 
guest that we're done processing all packets in the ring (IOW, re-enable 
notifications).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 10:46                 ` Gregory Haskins
@ 2009-04-02 11:43                   ` Avi Kivity
  2009-04-02 12:22                     ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02 11:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:

  

>> virtio is already non-kvm-specific (lguest uses it) and
>> non-pci-specific (s390 uses it).
>>     
>
> Ok, then to be more specific, I need it to be more generic than it
> already is.  For instance, I need it to be able to integrate with
> shm_signals.  

Why?

  

>> If you have a good exit mitigation scheme you can cut exits by a
>> factor of 100; so the userspace exit costs are cut by the same
>> factor.  If you have good copyless networking APIs you can cut the
>> cost of copies to zero (well, to the cost of get_user_pages_fast(),
>> but a kernel solution needs that too).
>>     
>
> "exit mitigation' schemes are for bandwidth, not latency.  For latency
> it all comes down to how fast you can signal in both directions.  If
> someone is going to do a stand-alone request-reply, its generally always
> going to be at least one hypercall and one rx-interrupt.  So your speed
> will be governed by your signal path, not your buffer bandwidth.
>   

The userspace path is longer by 2 microseconds (for two additional 
heavyweight exits) and a few syscalls.  I don't think that's worthy of 
putting all the code in the kernel.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:38                           ` Avi Kivity
  2009-04-02  9:41                             ` Herbert Xu
@ 2009-04-02 11:06                             ` Gregory Haskins
  2009-04-02 11:59                               ` Avi Kivity
  2009-04-02 12:13                               ` Rusty Russell
  1 sibling, 2 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 11:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

Avi Kivity wrote:
> Herbert Xu wrote:
>> On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
>>  
>>> If tap told us when the packets were actually transmitted, life
>>> would be  wonderful:
>>>     
>>
>> And why do we need this? Because we are in user space!
>>
>>   
>
> Why does a kernel solution not need to know when a packet is transmitted?
>

You do not need to know when the packet is copied (which I currently
do).  You only need it for zero-copy (of which I would like to support,
but as I understand it there are problems with the reliability of proper
callback (i.e. skb->destructor).

Its "fire and forget" :)

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:02                   ` Avi Kivity
  2009-04-02  9:16                     ` Herbert Xu
@ 2009-04-02 10:55                     ` Gregory Haskins
  2009-04-02 11:48                       ` Avi Kivity
  2009-04-03 10:58                     ` Gerd Hoffmann
  2 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 10:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1626 bytes --]

Avi Kivity wrote:
> Herbert Xu wrote:
>> Avi Kivity <avi@redhat.com> wrote:
>>  
>>> virtio is already non-kvm-specific (lguest uses it) and
>>> non-pci-specific (s390 uses it).
>>>     
>>
>> I think Greg's work shows that putting the backend in the kernel
>> can dramatically reduce the cost of a single guest->host transaction.
>> I'm sure the same thing would work for virtio too.
>>   
>
> Virtio suffers because we've had no notification of when a packet is
> actually submitted.  With the notification, the only difference should
> be in the cost of a kernel->user switch, which is nowhere nearly as
> dramatic.
>
>>> If you have a good exit mitigation scheme you can cut exits by a
>>> factor of 100; so the userspace exit costs are cut by the same
>>> factor.  If you have good copyless networking APIs you can cut the
>>> cost of copies to zero (well, to the cost of get_user_pages_fast(),
>>> but a kernel solution needs that too).
>>>     
>>
>> Given the choice of having to mitigate or not having the problem
>> in the first place, guess what I would prefer :)
>>   
>
> There is no choice.  Exiting from the guest to the kernel to userspace
> is prohibitively expensive, you can't do that on every packet.
>

Now you are making my point ;)  This is part of the cost of your
signaling path, and it directly adds to your latency time.   You can't
buffer packets here if the guest is only going to send one and wait for
a response and expect that to perform well.  And this is precisely what
drove me to look at avoiding going back to userspace in the first place.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  6:51               ` Avi Kivity
  2009-04-02  8:52                 ` Herbert Xu
@ 2009-04-02 10:46                 ` Gregory Haskins
  2009-04-02 11:43                   ` Avi Kivity
  1 sibling, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3434 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>
>>
>> I think there is a slight disconnect here.  This is *exactly* what I am
>> trying to do.  You can of course do this many ways, and I am not denying
>> it could be done a different way than the path I have chosen.  One
>> extreme would be to just slam a virtio-net specific chunk of code
>> directly into kvm on the host.  Another extreme would be to build a
>> generic framework into Linux for declaring arbitrary IO types,
>> integrating it with kvm (as well as other environments such as lguest,
>> userspace, etc), and building a virtio-net model on top of that.
>>
>> So in case it is not obvious at this point, I have gone with the latter
>> approach.  I wanted to make sure it wasn't kvm specific or something
>> like pci specific so it had the broadest applicability to a range of
>> environments.  So that is why the design is the way it is.  I understand
>> that this approach is technically "harder/more-complex" than the "slam
>> virtio-net into kvm" approach, but I've already done that work.  All we
>> need to do now is agree on the details ;)
>>
>>   
>
> virtio is already non-kvm-specific (lguest uses it) and
> non-pci-specific (s390 uses it).

Ok, then to be more specific, I need it to be more generic than it
already is.  For instance, I need it to be able to integrate with
shm_signals.  If we can do that without breaking the existing ABI, that
would be great!  Last I looked, it was somewhat entwined here so I didnt
try...but I admit that I didnt try that hard since I already had the IOQ
library ready to go.

>
>>> That said, I don't think we're bound today by the fact that we're in
>>> userspace.
>>>     
>> You will *always* be bound by the fact that you are in userspace.  Its
>> purely a question of "how much" and "does anyone care".    Right now,
>> the anwer is "a lot (roughly 45x slower)" and "at least Greg's customers
>> do".  I have no doubt that this can and will change/improve in the
>> future.  But it will always be true that no matter how much userspace
>> improves, the kernel based solution will always be faster.  Its simple
>> physics.  I'm cutting out the middleman to ultimately reach the same
>> destination as the userspace path, so userspace can never be equal.
>>   
>
> If you have a good exit mitigation scheme you can cut exits by a
> factor of 100; so the userspace exit costs are cut by the same
> factor.  If you have good copyless networking APIs you can cut the
> cost of copies to zero (well, to the cost of get_user_pages_fast(),
> but a kernel solution needs that too).

"exit mitigation' schemes are for bandwidth, not latency.  For latency
it all comes down to how fast you can signal in both directions.  If
someone is going to do a stand-alone request-reply, its generally always
going to be at least one hypercall and one rx-interrupt.  So your speed
will be governed by your signal path, not your buffer bandwidth.

What Ive done is shown that you can use techniques other than buffering
the head of the queue to do exit mitigation for bandwidth, while still
maintaining a very short signaling path for latency.  And I also argue
that the latter will always be optimal in the kernel, though I know by
which degree is still TBD.  Anthony thinks he can make the difference
negligible, and I would love to see it but am skeptical.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:43                               ` Avi Kivity
@ 2009-04-02  9:44                                 ` Herbert Xu
  0 siblings, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  9:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:43:54PM +0300, Avi Kivity wrote:
>
> So we're back to "the problem is with the kernel->user interface, not  
> userspace being cursed into slowness".

Well until you have a patch + numbers that's only an allegation :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:41                             ` Herbert Xu
@ 2009-04-02  9:43                               ` Avi Kivity
  2009-04-02  9:44                                 ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02  9:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 12:38:46PM +0300, Avi Kivity wrote:
>   
>> Why does a kernel solution not need to know when a packet is transmitted?
>>     
>
> Because you can install your own destructor?
>   

So we're back to "the problem is with the kernel->user interface, not 
userspace being cursed into slowness".


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:38                           ` Avi Kivity
@ 2009-04-02  9:41                             ` Herbert Xu
  2009-04-02  9:43                               ` Avi Kivity
  2009-04-02 11:06                             ` Gregory Haskins
  1 sibling, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  9:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:38:46PM +0300, Avi Kivity wrote:
>
> Why does a kernel solution not need to know when a packet is transmitted?

Because you can install your own destructor?

I don't know what Greg did, but netback did that nasty page destructor
hack which Jeremy is trying to undo :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:29                         ` Herbert Xu
  2009-04-02  9:33                           ` Herbert Xu
@ 2009-04-02  9:38                           ` Avi Kivity
  2009-04-02  9:41                             ` Herbert Xu
  2009-04-02 11:06                             ` Gregory Haskins
  1 sibling, 2 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-02  9:38 UTC (permalink / raw)
  To: Herbert Xu
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
>   
>> If tap told us when the packets were actually transmitted, life would be  
>> wonderful:
>>     
>
> And why do we need this? Because we are in user space!
>
>   

Why does a kernel solution not need to know when a packet is transmitted?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:29                         ` Herbert Xu
@ 2009-04-02  9:33                           ` Herbert Xu
  2009-04-02  9:38                           ` Avi Kivity
  1 sibling, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  9:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Patrick Ohly, David S. Miller

On Thu, Apr 02, 2009 at 05:29:36PM +0800, Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
> >
> > If tap told us when the packets were actually transmitted, life would be  
> > wonderful:
> 
> And why do we need this? Because we are in user space!
> 
> I'll continue to wait for your patch and numbers :)

And in case you're working on that patch, this might interest
you.  Check out the netdev thread titled "TX time stamping".
Now that we assign the tap skb with its own sk, these two scenarios
are pretty much identical.

I also noitced despite davem's threats to revert the patch, it
has now made Linus's tree :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:27                       ` Avi Kivity
@ 2009-04-02  9:29                         ` Herbert Xu
  2009-04-02  9:33                           ` Herbert Xu
  2009-04-02  9:38                           ` Avi Kivity
  0 siblings, 2 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  9:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
>
> If tap told us when the packets were actually transmitted, life would be  
> wonderful:

And why do we need this? Because we are in user space!

I'll continue to wait for your patch and numbers :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:16                     ` Herbert Xu
@ 2009-04-02  9:27                       ` Avi Kivity
  2009-04-02  9:29                         ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02  9:27 UTC (permalink / raw)
  To: Herbert Xu
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 12:02:09PM +0300, Avi Kivity wrote:
>   
>> There is no choice.  Exiting from the guest to the kernel to userspace  
>> is prohibitively expensive, you can't do that on every packet.
>>     
>
> I was referring to the bit between the kernel and userspace.
>
> In any case, I just looked at the virtio mitigation code again
> and I am completely baffled at why we need it.  Look at Greg's
> code or the netback/netfront notification, why do we need this
> completely artificial mitigation when the ring itself provides
> a natural way of stemming the flow?
>   

If the vcpu thread does the transmit, then it will always complete 
sending immediately:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: pop packet
  qemu: copy to tap
  qemu: ??

At this point, qemu must enable notification again, since we have no 
notification from tap that the transmit completed.  The only alternative 
is the timer.

If we do the transmit through an extra thread, then scheduling latency 
buys us some time:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: schedule iothread
  iothread: pop packet
  iothread: copy to tap
  iothread: check for more packets
  iothread: enable notification

If tap told us when the packets were actually transmitted, life would be 
wonderful:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: pop packet
  qemu: queue on tap
  qemu: return to guest
  hardware: churn churn churn
  tap: packet is out
  iothread: check for more packets
  iothread: enable notification

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:02                   ` Avi Kivity
@ 2009-04-02  9:16                     ` Herbert Xu
  2009-04-02  9:27                       ` Avi Kivity
  2009-04-02 10:55                     ` Gregory Haskins
  2009-04-03 10:58                     ` Gerd Hoffmann
  2 siblings, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  9:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:02:09PM +0300, Avi Kivity wrote:
>
> There is no choice.  Exiting from the guest to the kernel to userspace  
> is prohibitively expensive, you can't do that on every packet.

I was referring to the bit between the kernel and userspace.

In any case, I just looked at the virtio mitigation code again
and I am completely baffled at why we need it.  Look at Greg's
code or the netback/netfront notification, why do we need this
completely artificial mitigation when the ring itself provides
a natural way of stemming the flow?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:03                   ` Avi Kivity
@ 2009-04-02  9:05                     ` Herbert Xu
  0 siblings, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  9:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:03:32PM +0300, Avi Kivity wrote:
>
> Like Anthony said, the problem is with the kernel->user interfaces.  We  
> won't have a good user space virtio implementation until that is fixed.

If it's just the interface that's bad, then it should be possible
to do a proof-of-concept patch to show that this is the case.

Even if we have to redesign the interface, at least you can then
say that you guys were right all along :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  8:54                 ` Herbert Xu
@ 2009-04-02  9:03                   ` Avi Kivity
  2009-04-02  9:05                     ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02  9:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Anthony Liguori, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 09:46:49AM +0300, Avi Kivity wrote:
>   
>> I don't understand this.  If we had good interfaces, all that userspace  
>> would do is translate guest physical addresses to host physical  
>> addresses, and translate the guest->host protocol to host API calls.  I  
>> don't see anything there that benefits from being in the kernel.
>>
>> Can you elaborate?
>>     
>
> I think Greg has expressed it clearly enough.
>
> At the end of the day, the numbers speak for themselves.  So if
> and when there's a user-space version that achieves the same or
> better results, then I will change my mind :)
>   

Like Anthony said, the problem is with the kernel->user interfaces.  We 
won't have a good user space virtio implementation until that is fixed.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  8:52                 ` Herbert Xu
@ 2009-04-02  9:02                   ` Avi Kivity
  2009-04-02  9:16                     ` Herbert Xu
                                       ` (2 more replies)
  0 siblings, 3 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-02  9:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> Avi Kivity <avi@redhat.com> wrote:
>   
>> virtio is already non-kvm-specific (lguest uses it) and non-pci-specific 
>> (s390 uses it).
>>     
>
> I think Greg's work shows that putting the backend in the kernel
> can dramatically reduce the cost of a single guest->host transaction.
> I'm sure the same thing would work for virtio too.
>   

Virtio suffers because we've had no notification of when a packet is 
actually submitted.  With the notification, the only difference should 
be in the cost of a kernel->user switch, which is nowhere nearly as 
dramatic.

>> If you have a good exit mitigation scheme you can cut exits by a factor 
>> of 100; so the userspace exit costs are cut by the same factor.  If you 
>> have good copyless networking APIs you can cut the cost of copies to 
>> zero (well, to the cost of get_user_pages_fast(), but a kernel solution 
>> needs that too).
>>     
>
> Given the choice of having to mitigate or not having the problem
> in the first place, guess what I would prefer :)
>   

There is no choice.  Exiting from the guest to the kernel to userspace 
is prohibitively expensive, you can't do that on every packet.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  6:46               ` Avi Kivity
@ 2009-04-02  8:54                 ` Herbert Xu
  2009-04-02  9:03                   ` Avi Kivity
  0 siblings, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  8:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 09:46:49AM +0300, Avi Kivity wrote:
>
> I don't understand this.  If we had good interfaces, all that userspace  
> would do is translate guest physical addresses to host physical  
> addresses, and translate the guest->host protocol to host API calls.  I  
> don't see anything there that benefits from being in the kernel.
>
> Can you elaborate?

I think Greg has expressed it clearly enough.

At the end of the day, the numbers speak for themselves.  So if
and when there's a user-space version that achieves the same or
better results, then I will change my mind :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  6:51               ` Avi Kivity
@ 2009-04-02  8:52                 ` Herbert Xu
  2009-04-02  9:02                   ` Avi Kivity
  2009-04-02 10:46                 ` Gregory Haskins
  1 sibling, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  8:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Avi Kivity <avi@redhat.com> wrote:
>
> virtio is already non-kvm-specific (lguest uses it) and non-pci-specific 
> (s390 uses it).

I think Greg's work shows that putting the backend in the kernel
can dramatically reduce the cost of a single guest->host transaction.
I'm sure the same thing would work for virtio too.

> If you have a good exit mitigation scheme you can cut exits by a factor 
> of 100; so the userspace exit costs are cut by the same factor.  If you 
> have good copyless networking APIs you can cut the cost of copies to 
> zero (well, to the cost of get_user_pages_fast(), but a kernel solution 
> needs that too).

Given the choice of having to mitigate or not having the problem
in the first place, guess what I would prefer :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 21:09             ` Gregory Haskins
  2009-04-02  0:29               ` Anthony Liguori
@ 2009-04-02  6:51               ` Avi Kivity
  2009-04-02  8:52                 ` Herbert Xu
  2009-04-02 10:46                 ` Gregory Haskins
  1 sibling, 2 replies; 121+ messages in thread
From: Avi Kivity @ 2009-04-02  6:51 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
>
>
> I think there is a slight disconnect here.  This is *exactly* what I am
> trying to do.  You can of course do this many ways, and I am not denying
> it could be done a different way than the path I have chosen.  One
> extreme would be to just slam a virtio-net specific chunk of code
> directly into kvm on the host.  Another extreme would be to build a
> generic framework into Linux for declaring arbitrary IO types,
> integrating it with kvm (as well as other environments such as lguest,
> userspace, etc), and building a virtio-net model on top of that.
>
> So in case it is not obvious at this point, I have gone with the latter
> approach.  I wanted to make sure it wasn't kvm specific or something
> like pci specific so it had the broadest applicability to a range of
> environments.  So that is why the design is the way it is.  I understand
> that this approach is technically "harder/more-complex" than the "slam
> virtio-net into kvm" approach, but I've already done that work.  All we
> need to do now is agree on the details ;)
>
>   

virtio is already non-kvm-specific (lguest uses it) and non-pci-specific 
(s390 uses it).

>> That said, I don't think we're bound today by the fact that we're in
>> userspace.
>>     
> You will *always* be bound by the fact that you are in userspace.  Its
> purely a question of "how much" and "does anyone care".    Right now,
> the anwer is "a lot (roughly 45x slower)" and "at least Greg's customers
> do".  I have no doubt that this can and will change/improve in the
> future.  But it will always be true that no matter how much userspace
> improves, the kernel based solution will always be faster.  Its simple
> physics.  I'm cutting out the middleman to ultimately reach the same
> destination as the userspace path, so userspace can never be equal.
>   

If you have a good exit mitigation scheme you can cut exits by a factor 
of 100; so the userspace exit costs are cut by the same factor.  If you 
have good copyless networking APIs you can cut the cost of copies to 
zero (well, to the cost of get_user_pages_fast(), but a kernel solution 
needs that too).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  3:09             ` Herbert Xu
@ 2009-04-02  6:46               ` Avi Kivity
  2009-04-02  8:54                 ` Herbert Xu
  0 siblings, 1 reply; 121+ messages in thread
From: Avi Kivity @ 2009-04-02  6:46 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Anthony Liguori, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> Anthony Liguori <anthony@codemonkey.ws> wrote:
>   
>> That said, I don't think we're bound today by the fact that we're in 
>> userspace.  Rather we're bound by the interfaces we have between the 
>> host kernel and userspace to generate IO.  I'd rather fix those 
>> interfaces than put more stuff in the kernel.
>>     
>
> I'm sorry but I totally disagree with that.  By having our IO
> infrastructure in user-space we've basically given up the main
> advantage of kvm, which is that the physical drivers operate in
> the same environment as the hypervisor.
>   

I don't understand this.  If we had good interfaces, all that userspace 
would do is translate guest physical addresses to host physical 
addresses, and translate the guest->host protocol to host API calls.  I 
don't see anything there that benefits from being in the kernel.

Can you elaborate?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 22:10                   ` Gregory Haskins
@ 2009-04-02  6:00                     ` Chris Wright
  0 siblings, 0 replies; 121+ messages in thread
From: Chris Wright @ 2009-04-02  6:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Anthony Liguori, Andi Kleen, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

* Gregory Haskins (ghaskins@novell.com) wrote:

<snip nice list>

> I cant think of more examples right now, but I will update this list
> if/when I come up with more.  I hope that satisfactorily answered your
> question, though!

Yes, that helps, thanks.

There's still the simple issue of guest/host interface widening w/ kernel
resident backend where a plain ol' bug (good that you thought about the
isolation) can take out more than single guest.  Always the balance... ;-)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01  6:08 ` Rusty Russell
  2009-04-01 11:35   ` Gregory Haskins
  2009-04-01 16:10   ` Anthony Liguori
@ 2009-04-02  3:15   ` Herbert Xu
  2 siblings, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  3:15 UTC (permalink / raw)
  To: Rusty Russell
  Cc: ghaskins, linux-kernel, agraf, pmullaney, pmorreale, anthony,
	netdev, kvm

Rusty Russell <rusty@rustcorp.com.au> wrote:
> 
> As you point out, 350-450 is possible, which is still bad, and it's at least
> partially caused by the exit to userspace and two system calls.  If virtio_net
> had a backend in the kernel, we'd be able to compare numbers properly.

FWIW I don't really care whether we go with this or a kernel
virtio_net backend.  Either way should be good.  However the
status quo where we're stuck with a user-space backend really
sucks!

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 20:40             ` Chris Wright
  2009-04-01 21:11               ` Gregory Haskins
@ 2009-04-02  3:11               ` Herbert Xu
  1 sibling, 0 replies; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  3:11 UTC (permalink / raw)
  To: Chris Wright
  Cc: anthony, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Chris Wright <chrisw@sous-sol.org> wrote:
>
>> That said, I don't think we're bound today by the fact that we're in  
>> userspace.  Rather we're bound by the interfaces we have between the  
>> host kernel and userspace to generate IO.  I'd rather fix those  
>> interfaces than put more stuff in the kernel.
> 
> And more stuff in the kernel can come at the potential cost of weakening
> protection/isolation.

Protection/isolation always comes at a cost.  Not everyone wants
to pay that, just like health insurance :) We should enable the
users to choose which model they want, based on their needs.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  0:29               ` Anthony Liguori
@ 2009-04-02  3:11                 ` Gregory Haskins
  0 siblings, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02  3:11 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, rusty,
	netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3936 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> Anthony Liguori wrote:
>>   I think there is a slight disconnect here.  This is *exactly* what
>> I am
>> trying to do. 
>
> If it were exactly what you were trying to do, you would have posted a
> virtio-net in-kernel backend implementation instead of a whole new
> paravirtual IO framework ;-)

semantics, semantics ;)

but ok, fair enough.

>
>>> That said, I don't think we're bound today by the fact that we're in
>>> userspace.
>>>     
>> You will *always* be bound by the fact that you are in userspace.
>
> Again, let's talk numbers.  A heavy-weight exit is 1us slower than a
> light weight exit.  Ideally, you're taking < 1 exit per packet because
> you're batching notifications.  If you're ping latency on bare metal
> compared to vbus is 39us to 65us, then all other things being equally,
> the cost imposed by doing what your doing in userspace would make the
> latency be 66us taking your latency from 166% of native to 169% of
> native.  That's not a huge difference and I'm sure you'll agree there
> are a lot of opportunities to improve that even further.

Ok, so lets see it happen.  Consider the gauntlet thrown :)  Your
challenge, should you chose to accept it, is to take todays 4000us and
hit a 65us latency target while maintaining 10GE line-rate (at least
1500 mtu line-rate).

I personally don't want to even stop at 65.  I want to hit that 36us!  
In case you think that is crazy, my first prototype of venet was hitting
about 140us, and I shaved 10us here, 10us there, eventually getting down
to the 65us we have today.  The low hanging fruit is all but harvested
at this point, but I am not done searching for additional sources of
latency. I just needed to take a breather to get the code out there for
review. :)

>
> And you didn't mention whether your latency tests are based on ping or
> something more sophisticated

Well, the numbers posted were actually from netperf -t UDP_RR.  This
generates a pps from a continuous (but non-bursted) RTT measurement.  So
I invert the pps result of this test to get the average rtt time.  I
have also confirmed that ping jives with these results (e.g. virtio-net
results were about 4ms, and venet were about 0.065ms as reported by ping).

> as ping will be a pathological case
Ah, but this is not really pathological IMO.  There are plenty of
workloads that exhibit request-reply patterns (e.g. RPC), and this is a
direct measurement of the systems ability to support these
efficiently.   And even unidirectional flows can be hampered by poor
latency (think PTP clock sync, etc).

Massive throughput with poor latency is like Andrew Tanenbaum's
station-wagon full of backup tapes ;)  I think I have proven we can
actually get both with a little creative use of resources.

> that doesn't allow any notification batching.
Well, if we can take anything away from all this: I think I have
demonstrated that you don't need notification batching to get good
throughput.  And batching on the head-end of the queue adds directly to
your latency overhead, so I don't think its a good technique in general
(though I realize that not everyone cares about latency, per se, so
maybe most are satisfied with the status-quo).

>
>> I agree that the "does anyone care" part of the equation will approach
>> zero as the latency difference shrinks across some threshold (probably
>> the single microsecond range), but I will believe that is even possible
>> when I see it ;)
>>   
>
> Note the other hat we have to where is not just virtualization
> developer but Linux developer.  If there are bad userspace interfaces
> for IO that impose artificial restrictions, then we need to identify
> those and fix them.

Fair enough, and I would love to take that on but alas my
development/debug bandwidth is rather finite these days ;)

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 18:45           ` Anthony Liguori
  2009-04-01 20:40             ` Chris Wright
  2009-04-01 21:09             ` Gregory Haskins
@ 2009-04-02  3:09             ` Herbert Xu
  2009-04-02  6:46               ` Avi Kivity
  2 siblings, 1 reply; 121+ messages in thread
From: Herbert Xu @ 2009-04-02  3:09 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: andi, ghaskins, linux-kernel, agraf, pmullaney, pmorreale, rusty,
	netdev, kvm

Anthony Liguori <anthony@codemonkey.ws> wrote:
>
> That said, I don't think we're bound today by the fact that we're in 
> userspace.  Rather we're bound by the interfaces we have between the 
> host kernel and userspace to generate IO.  I'd rather fix those 
> interfaces than put more stuff in the kernel.

I'm sorry but I totally disagree with that.  By having our IO
infrastructure in user-space we've basically given up the main
advantage of kvm, which is that the physical drivers operate in
the same environment as the hypervisor.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  1:24     ` Rusty Russell
@ 2009-04-02  2:27       ` Gregory Haskins
  0 siblings, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-02  2:27 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3541 bytes --]

Rusty Russell wrote:
> On Wednesday 01 April 2009 22:05:39 Gregory Haskins wrote:
>   
>> Rusty Russell wrote:
>>     
>>> I could dig through the code, but I'll ask directly: what heuristic do
>>> you use for notification prevention in your venet_tap driver?
>>>       
>> I am not 100% sure I know what you mean with "notification prevention",
>> but let me take a stab at it.
>>     
>
> Good stab :)
>
>   
>> I only signal back to the guest to reclaim its skbs every 10
>> packets, or if I drain the queue, whichever comes first (note to self:
>> make this # configurable).
>>     
>
> Good stab, though I was referring to guest->host signals (I'll assume
> you use a similar scheme there).
>   
Oh, actually no.  The guest->host path only uses the "bidir napi" thing
I mentioned.  So first packet hypercalls the host immediately with no
delay, schedules my host-side "rx" thread, disables subsequent
hypercalls, and returns to the guest.  If the guest tries to send
another packet before the time it takes the host to drain all queued
skbs (in this case, 1), it will simply queue it to the ring with no
additional hypercalls.    Like typical napi ingress processing, the host
will leave hypercalls disabled until it finds the ring empty, so this
process can continue indefinitely until the host catches up.  Once fully
drained,  the host will re-enable the hypercall channel and subsequent
transmissions will repeat the original process.

In summary, infrequent transmissions will tend to have one hypercall per
packet.  Bursty transmissions will have one hypercall per burst
(starting immediately with the first packet).  In both cases, we
minimize the latency to get the first packet "out the door".

So really the only place I am using a funky heuristic is the modulus 10
operation for tx-complete going host->guest.  The rest are kind of
standard napi event mitigation techniques.

> You use a number of packets, qemu uses a timer (150usec), lguest uses a
> variable timer (starting at 500usec, dropping by 1 every time but increasing
> by 10 every time we get fewer packets than last time).
>
> So, if the guest sends two packets and stops, you'll hang indefinitely?
>   
Shouldn't, no.  The host will send tx-complete interrupts at *max* every
10 packets, but if it drains the queue before the modulus 10 expires, it
will send a tx-complete immediately, right before it re-enables
hypercalls.  So there is no hang, and there is no delay.

For reference, here is the modulus 10 signaling
(./drivers/vbus/devices/venet-tap.c, line 584):

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l584

Here is the one that happens after the queue is fully drained (line 593)

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l593

and finally, here is where I re-enable hypercalls (or system calls if
the driver is in userspace, etc)

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l600

> That's why we use a timer, otherwise any mitigation scheme has this issue.
>   

I'm not sure I follow.  I don't think I need a timer at all using this
scheme, but perhaps I am missing something?

Thanks Rusty!
-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 11:35   ` Gregory Haskins
@ 2009-04-02  1:24     ` Rusty Russell
  2009-04-02  2:27       ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Rusty Russell @ 2009-04-02  1:24 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, netdev, kvm

On Wednesday 01 April 2009 22:05:39 Gregory Haskins wrote:
> Rusty Russell wrote:
> > I could dig through the code, but I'll ask directly: what heuristic do
> > you use for notification prevention in your venet_tap driver?
> 
> I am not 100% sure I know what you mean with "notification prevention",
> but let me take a stab at it.

Good stab :)

> I only signal back to the guest to reclaim its skbs every 10
> packets, or if I drain the queue, whichever comes first (note to self:
> make this # configurable).

Good stab, though I was referring to guest->host signals (I'll assume
you use a similar scheme there).

You use a number of packets, qemu uses a timer (150usec), lguest uses a
variable timer (starting at 500usec, dropping by 1 every time but increasing
by 10 every time we get fewer packets than last time).

So, if the guest sends two packets and stops, you'll hang indefinitely?
That's why we use a timer, otherwise any mitigation scheme has this issue.

Thanks,
Rusty.


> 
> The nice part about this scheme is it significantly reduces the amount
> of guest/host transitions, while still providing the lowest latency
> response for single packets possible.  e.g. Send one packet, and you get
> one hypercall, and one tx-complete interrupt as soon as it queues on the
> hardware.  Send 100 packets, and you get one hypercall and 10
> tx-complete interrupts as frequently as every tenth packet queues on the
> hardware.  There is no timer governing the flow, etc.
> 
> Is that what you were asking?
> 
> > As you point out, 350-450 is possible, which is still bad, and it's at least
> > partially caused by the exit to userspace and two system calls.  If virtio_net
> > had a backend in the kernel, we'd be able to compare numbers properly.
> >   
> :)
> 
> But that is the whole point, isnt it?  I created vbus specifically as a
> framework for putting things in the kernel, and that *is* one of the
> major reasons it is faster than virtio-net...its not the difference in,
> say, IOQs vs virtio-ring (though note I also think some of the
> innovations we have added such as bi-dir napi are helping too, but these
> are not "in-kernel" specific kinds of features and could probably help
> the userspace version too).
> 
> I would be entirely happy if you guys accepted the general concept and
> framework of vbus, and then worked with me to actually convert what I
> have as "venet-tap" into essentially an in-kernel virtio-net.  I am not
> specifically interested in creating a competing pv-net driver...I just
> needed something to showcase the concepts and I didnt want to hack the
> virtio-net infrastructure to do it until I had everyone's blessing. 
> Note to maintainers: I *am* perfectly willing to maintain the venet
> drivers if, for some reason, we decide that we want to keep them as
> is.   Its just an ideal for me to collapse virtio-net and venet-tap
> together, and I suspect our community would prefer this as well.
> 
> -Greg
> 
> 

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 21:09             ` Gregory Haskins
@ 2009-04-02  0:29               ` Anthony Liguori
  2009-04-02  3:11                 ` Gregory Haskins
  2009-04-02  6:51               ` Avi Kivity
  1 sibling, 1 reply; 121+ messages in thread
From: Anthony Liguori @ 2009-04-02  0:29 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, rusty,
	netdev, kvm

Gregory Haskins wrote:
> Anthony Liguori wrote:
>   
> I think there is a slight disconnect here.  This is *exactly* what I am
> trying to do. 

If it were exactly what you were trying to do, you would have posted a 
virtio-net in-kernel backend implementation instead of a whole new 
paravirtual IO framework ;-)

>> That said, I don't think we're bound today by the fact that we're in
>> userspace.
>>     
> You will *always* be bound by the fact that you are in userspace.

Again, let's talk numbers.  A heavy-weight exit is 1us slower than a 
light weight exit.  Ideally, you're taking < 1 exit per packet because 
you're batching notifications.  If you're ping latency on bare metal 
compared to vbus is 39us to 65us, then all other things being equally, 
the cost imposed by doing what your doing in userspace would make the 
latency be 66us taking your latency from 166% of native to 169% of 
native.  That's not a huge difference and I'm sure you'll agree there 
are a lot of opportunities to improve that even further.

And you didn't mention whether your latency tests are based on ping or 
something more sophisticated as ping will be a pathological case that 
doesn't allow any notification batching.

> I agree that the "does anyone care" part of the equation will approach
> zero as the latency difference shrinks across some threshold (probably
> the single microsecond range), but I will believe that is even possible
> when I see it ;)
>   

Note the other hat we have to where is not just virtualization developer 
but Linux developer.  If there are bad userspace interfaces for IO that 
impose artificial restrictions, then we need to identify those and fix them.

Regards,

Anthony Liguori

> Regards,
> -Greg
>
>   


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 22:23             ` Andi Kleen
@ 2009-04-01 23:05               ` Gregory Haskins
  0 siblings, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 23:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5003 bytes --]

Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 04:29:57PM -0400, Gregory Haskins wrote:
>   
>>> description?
>>>   
>>>       
>> Yes, good point.  I will be sure to be more explicit in the next rev.
>>
>>     
>>>   
>>>       
>>>> So the administrator can then set these attributes as
>>>> desired to manipulate the configuration of the instance of the device,
>>>> on a per device basis.
>>>>     
>>>>         
>>> How would the guest learn of any changes in there?
>>>   
>>>       
>> The only events explicitly supported by the infrastructure of this
>> nature would be device-add and device-remove.  So when an admin adds or
>> removes a device to a bus, the guest would see driver::probe() and
>> driver::remove() callbacks, respectively.  All other events are left (by
>> design) to be handled by the device ABI itself, presumably over the
>> provided shm infrastructure.
>>     
>
> Ok so you rely on a transaction model where everything is set up
> before it is somehow comitted to the guest? I hope that is made
> explicit in the interface somehow.
>   
Well, its not an explicit transaction model, but I guess you could think
of it that way.

Generally you set the device up before you launch the guest.  By the
time the guest loads and tries to scan the bus for the initial
discovery, all the devices would be ready to go.

This does bring up the question of hotswap.  Today we fully support
hotswap in and out, but leaving this "enabled" transaction to the
individual device means that the device-id would be visible in the bus
namespace before the device may want to actually communicate.  Hmmm

Perhaps I need to build this in as a more explicit "enabled"
feature...and the guest will not see the driver::probe() until this happens.

>   
>> This script creates two buses ("client-bus" and "server-bus"),
>> instantiates a single venet-tap on each of them, and then "wires" them
>> together with a private bridge instance called "vbus-br0".  To complete
>> the picture here, you would want to launch two kvms, one of each of the
>> client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.
>>
>> # (echo client-bus > /proc/self/vbus; qemu-kvm -hda client.img....)
>> # (echo server-bus > /proc/self/vbus; qemu-kvm -hda server.img....)
>>
>> (And as noted, someday qemu will be able to do all the setup that the
>> script did, natively.  It would wire whatever tap it created to an
>> existing bridge with qemu-ifup, just like we do for tun-taps today)
>>     
>
> The usual problem with that is permissions. Just making qemu-ifup suid
> it not very nice.  It would be good if any new design addressed this.
>   

Well, its kind of out of my control.  venet-tap ultimately creates a
simple netif interface which we must do something with.  Once its
created, "wiring" it up to something like a linux-bridge is no different
than something like a tun-tap, so the qemu-ifup requirement doesn't change.

The one thing I can think of is it would be possible to build a
"venet-switch" module, and this could be done without using brctl or
qemu-ifup...but then I would lose all the benefits of re-using that
infrastructure.  I do not recommend we actually do this, but it would
technically be a way to address your concern.


>   
>> the current code doesnt support rw on the mac attributes yet..i need a
>> parser first).
>>     
>
> parser in kernel space always sounds scary to me.
>   
Heh..why do you think I keep procrastinating ;)

>
>   
>> Yeah, ultimately I would love to be able to support a fairly wide range
>> of the normal userspace/kernel ABI through this mechanism.  In fact, one
>> of my original design goals was to somehow expose the syscall ABI
>> directly via some kind of syscall proxy device on the bus.  I have since
>>     
>
> That sounds really scary for security. 
>
>
>   
>> backed away from that idea once I started thinking about things some
>> more and realized that a significant number of system calls are really
>> inappropriate for a guest type environment due to their ability to
>> block.   We really dont want a vcpu to block.....however, the AIO type
>>     
>
> Not only because of blocking, but also because of security issues.
> After all one of the usual reasons to run a guest is security isolation.
>   
Oh yeah, totally agreed.  Not that I am advocating this, because I have
abandoned the idea.  But back when I was thinking of this, I would have
addressed the security with the vbus and syscall-proxy-device objects
themselves.  E.g. if you dont instantiate a syscall-proxy-device on the
bus, the guest wouldnt have access to syscalls at all.   And you could
put filters into the module to limit what syscalls were allowed, which
UID to make the guest appear as, etc.

> In general the more powerful the guest API the more risky it is, so some
> self moderation is probably a good thing.
>   
:)

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 20:29           ` Gregory Haskins
@ 2009-04-01 22:23             ` Andi Kleen
  2009-04-01 23:05               ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2009-04-01 22:23 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, anthony,
	rusty, netdev, kvm

On Wed, Apr 01, 2009 at 04:29:57PM -0400, Gregory Haskins wrote:
> > description?
> >   
> Yes, good point.  I will be sure to be more explicit in the next rev.
> 
> >   
> >> So the administrator can then set these attributes as
> >> desired to manipulate the configuration of the instance of the device,
> >> on a per device basis.
> >>     
> >
> > How would the guest learn of any changes in there?
> >   
> The only events explicitly supported by the infrastructure of this
> nature would be device-add and device-remove.  So when an admin adds or
> removes a device to a bus, the guest would see driver::probe() and
> driver::remove() callbacks, respectively.  All other events are left (by
> design) to be handled by the device ABI itself, presumably over the
> provided shm infrastructure.

Ok so you rely on a transaction model where everything is set up
before it is somehow comitted to the guest? I hope that is made
explicit in the interface somehow.

> This script creates two buses ("client-bus" and "server-bus"),
> instantiates a single venet-tap on each of them, and then "wires" them
> together with a private bridge instance called "vbus-br0".  To complete
> the picture here, you would want to launch two kvms, one of each of the
> client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.
> 
> # (echo client-bus > /proc/self/vbus; qemu-kvm -hda client.img....)
> # (echo server-bus > /proc/self/vbus; qemu-kvm -hda server.img....)
> 
> (And as noted, someday qemu will be able to do all the setup that the
> script did, natively.  It would wire whatever tap it created to an
> existing bridge with qemu-ifup, just like we do for tun-taps today)

The usual problem with that is permissions. Just making qemu-ifup suid
it not very nice.  It would be good if any new design addressed this.

> the current code doesnt support rw on the mac attributes yet..i need a
> parser first).

parser in kernel space always sounds scary to me.


> 
> Yeah, ultimately I would love to be able to support a fairly wide range
> of the normal userspace/kernel ABI through this mechanism.  In fact, one
> of my original design goals was to somehow expose the syscall ABI
> directly via some kind of syscall proxy device on the bus.  I have since

That sounds really scary for security. 


> backed away from that idea once I started thinking about things some
> more and realized that a significant number of system calls are really
> inappropriate for a guest type environment due to their ability to
> block.   We really dont want a vcpu to block.....however, the AIO type

Not only because of blocking, but also because of security issues.
After all one of the usual reasons to run a guest is security isolation.

In general the more powerful the guest API the more risky it is, so some
self moderation is probably a good thing.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 21:28                 ` Chris Wright
@ 2009-04-01 22:10                   ` Gregory Haskins
  2009-04-02  6:00                     ` Chris Wright
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 22:10 UTC (permalink / raw)
  To: Chris Wright
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3362 bytes --]

Chris Wright wrote:
> * Gregory Haskins (ghaskins@novell.com) wrote:
>   
>> Note that the design of vbus should prevent any weakening
>>     
>
> Could you elaborate?
>   

Absolutely.

So you said that something in the kernel could weaken the
protection/isolation.  And I fully agree that whatever we do here has to
be done carefully...more carefully than a userspace derived counterpart,
naturally.

So to address this, I put in various mechanisms to (hopefully? :) ensure
we can still maintain proper isolation, as well as protect the host,
other guests, and other applications from corruption.  Here are some of
the highlights:

*) As I mentioned, a "vbus" is a form of a kernel-resource-container. 
It is designed so that the view of a vbus is a unique namespace of
device-ids.  Each bus has its own individual namespace that consist
solely of the devices that have been placed on that bus.  The only way
to create a bus, and/or create a device on a bus, is via the
administrative interface on the host.

*) A task can only associate with, at most, one vbus at a time.  This
means that a task can only see the device-id namespace of the devices on
its associated bus and thats it.  This is enforced by the host kernel by
placing a reference to the associated vbus on the task-struct itself. 
Again, the only way to modify this association is via a host based
administrative operation.  Note that multiple tasks can associate to the
same vbus, which would commonly be used by all threads in an app, or all
vcpus in a guest, etc.

*) the asynchronous nature of the shm/ring interfaces implies we have
the potential for asynchronous faults.  E.g. "crap" in the ring might
not be discovered at the EIP of the guest vcpu when it actually inserts
the crap, but rather later when the host side tries to update the ring. 
A naive implementation would have the host do a BUG_ON() when it
discovers the discrepancy (note that I still have a few of these to fix
in the venet-tap code).  Instead, what should happen is that we utilize
an asynchronous fault mechanism that allows the guest to always be the
one punished (via something like a machine-check for guests, or SIGABRT
for userspace, etc)

*) "south-to-north path signaling robustness".  Because vbus supports a
variety of different environments, I call guest/userspace "north', and
the host/kernel "south".  When the north wants to communicate with the
kernel, its perfectly ok to stall the north indefinitely if the south is
not ready.  However, it is not really ok to stall the south when
communicating with the north because this is an attack vector.  E.g. a
malicous/broken guest could just stop servicing its ring to cause
threads in the host to jam up.  This is bad. :)  So what we do is we
design all south-to-north signaling paths to be robust against
stalling.  What they do instead is manage backpressure a little bit more
intelligently than simply blocking like they might in the guest.  For
instance, in venet-tap, a "transmit" from netif that has to be injected
in the south-to-north ring when it is full will result in a
netif_stop_queue().   etc.

I cant think of more examples right now, but I will update this list
if/when I come up with more.  I hope that satisfactorily answered your
question, though!

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 21:11               ` Gregory Haskins
@ 2009-04-01 21:28                 ` Chris Wright
  2009-04-01 22:10                   ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Chris Wright @ 2009-04-01 21:28 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Anthony Liguori, Andi Kleen, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

* Gregory Haskins (ghaskins@novell.com) wrote:
> Note that the design of vbus should prevent any weakening

Could you elaborate?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 20:40             ` Chris Wright
@ 2009-04-01 21:11               ` Gregory Haskins
  2009-04-01 21:28                 ` Chris Wright
  2009-04-02  3:11               ` Herbert Xu
  1 sibling, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 21:11 UTC (permalink / raw)
  To: Chris Wright
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

Chris Wright wrote:
> And more stuff in the kernel can come at the potential cost of weakening
> protection/isolation.
>   
Note that the design of vbus should prevent any weakening...though if
you see a hole, please point it out.

(On that front, note that I still have some hardening to do, such as not
calling BUG_ON() in venet-tap if the ring is in a funk, etc)

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 18:45           ` Anthony Liguori
  2009-04-01 20:40             ` Chris Wright
@ 2009-04-01 21:09             ` Gregory Haskins
  2009-04-02  0:29               ` Anthony Liguori
  2009-04-02  6:51               ` Avi Kivity
  2009-04-02  3:09             ` Herbert Xu
  2 siblings, 2 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 21:09 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, rusty,
	netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 2770 bytes --]

Anthony Liguori wrote:
> Andi Kleen wrote:
>> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>>  
>>>>>             
>>>> But surely you must have some specific use case in mind? Something
>>>> that it does better than the various methods that are available
>>>> today. Or rather there must be some problem you're trying
>>>> to solve. I'm just not sure what that problem exactly is.
>>>>         
>>> Performance.  We are trying to create a high performance IO
>>> infrastructure.
>>>     
>>
>> Ok. So the goal is to bypass user space qemu completely for better
>> performance. Can you please put this into the initial patch
>> description?
>>   
>
> FWIW, there's nothing that prevents in-kernel back ends with virtio so
> vbus certainly isn't required for in-kernel backends.

I think there is a slight disconnect here.  This is *exactly* what I am
trying to do.  You can of course do this many ways, and I am not denying
it could be done a different way than the path I have chosen.  One
extreme would be to just slam a virtio-net specific chunk of code
directly into kvm on the host.  Another extreme would be to build a
generic framework into Linux for declaring arbitrary IO types,
integrating it with kvm (as well as other environments such as lguest,
userspace, etc), and building a virtio-net model on top of that.

So in case it is not obvious at this point, I have gone with the latter
approach.  I wanted to make sure it wasn't kvm specific or something
like pci specific so it had the broadest applicability to a range of
environments.  So that is why the design is the way it is.  I understand
that this approach is technically "harder/more-complex" than the "slam
virtio-net into kvm" approach, but I've already done that work.  All we
need to do now is agree on the details ;)

>
>
> That said, I don't think we're bound today by the fact that we're in
> userspace.
You will *always* be bound by the fact that you are in userspace.  Its
purely a question of "how much" and "does anyone care".    Right now,
the anwer is "a lot (roughly 45x slower)" and "at least Greg's customers
do".  I have no doubt that this can and will change/improve in the
future.  But it will always be true that no matter how much userspace
improves, the kernel based solution will always be faster.  Its simple
physics.  I'm cutting out the middleman to ultimately reach the same
destination as the userspace path, so userspace can never be equal.

I agree that the "does anyone care" part of the equation will approach
zero as the latency difference shrinks across some threshold (probably
the single microsecond range), but I will believe that is even possible
when I see it ;)

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 18:45           ` Anthony Liguori
@ 2009-04-01 20:40             ` Chris Wright
  2009-04-01 21:11               ` Gregory Haskins
  2009-04-02  3:11               ` Herbert Xu
  2009-04-01 21:09             ` Gregory Haskins
  2009-04-02  3:09             ` Herbert Xu
  2 siblings, 2 replies; 121+ messages in thread
From: Chris Wright @ 2009-04-01 20:40 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andi Kleen, Gregory Haskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

* Anthony Liguori (anthony@codemonkey.ws) wrote:
> Andi Kleen wrote:
>> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>>> Performance.  We are trying to create a high performance IO infrastructure.
>>
>> Ok. So the goal is to bypass user space qemu completely for better
>> performance. Can you please put this into the initial patch
>> description?
>
> FWIW, there's nothing that prevents in-kernel back ends with virtio so  
> vbus certainly isn't required for in-kernel backends.

Indeed.

> That said, I don't think we're bound today by the fact that we're in  
> userspace.  Rather we're bound by the interfaces we have between the  
> host kernel and userspace to generate IO.  I'd rather fix those  
> interfaces than put more stuff in the kernel.

And more stuff in the kernel can come at the potential cost of weakening
protection/isolation.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 17:01         ` Andi Kleen
  2009-04-01 18:45           ` Anthony Liguori
@ 2009-04-01 20:29           ` Gregory Haskins
  2009-04-01 22:23             ` Andi Kleen
  1 sibling, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 20:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5815 bytes --]

Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>   
>>>>     
>>>>         
>>> But surely you must have some specific use case in mind? Something
>>> that it does better than the various methods that are available
>>> today. Or rather there must be some problem you're trying
>>> to solve. I'm just not sure what that problem exactly is.
>>>   
>>>       
>> Performance.  We are trying to create a high performance IO infrastructure.
>>     
>
> Ok. So the goal is to bypass user space qemu completely for better
> performance. Can you please put this into the initial patch
> description?
>   
Yes, good point.  I will be sure to be more explicit in the next rev.

>   
>> So the administrator can then set these attributes as
>> desired to manipulate the configuration of the instance of the device,
>> on a per device basis.
>>     
>
> How would the guest learn of any changes in there?
>   
The only events explicitly supported by the infrastructure of this
nature would be device-add and device-remove.  So when an admin adds or
removes a device to a bus, the guest would see driver::probe() and
driver::remove() callbacks, respectively.  All other events are left (by
design) to be handled by the device ABI itself, presumably over the
provided shm infrastructure.

So for instance, I have on my todo list to add a third shm-ring for
events in the venet ABI.   One of the event-types I would like to
support is LINK_UP and LINK_DOWN.  These events would be coupled to the
administrative manipulation of the "enabled" attribute in sysfs.  Other
event-types could be added as needed/appropriate.

I decided to do it this way because I felt it didn't make sense for me
to expose the attributes directly, since they are often back-end
specific anyway.   Therefore I leave it to the device-specific ABI which
has all the necessary tools for async events built in.


> I think the interesting part would be how e.g. a vnet device
> would be connected to the outside interfaces.
>   

Ah, good question.  This ties into the statement I made earlier about
how presumably the administrative agent would know what a module is and
how it works.  As part of this, they would also handle any kind of
additional work, such as wiring the backend up.  Here is a script that I
use for testing that demonstrates this:

------------------
#!/bin/bash

set -e

modprobe venet-tap
mount -t configfs configfs /config

bridge=vbus-br0

brctl addbr $bridge
brctl setfd $bridge 0
ifconfig $bridge up

createtap()
{
    mkdir /config/vbus/devices/$1-dev
    echo venet-tap > /config/vbus/devices/$1-dev/type
    mkdir /config/vbus/instances/$1-bus
    ln -s /config/vbus/devices/$1-dev /config/vbus/instances/$1-bus
    echo 1 > /sys/vbus/devices/$1-dev/enabled

    ifname=$(cat /sys/vbus/devices/$1-dev/ifname)
    ifconfig $ifname up
    brctl addif $bridge $ifname
}

createtap client
createtap server

--------------------

This script creates two buses ("client-bus" and "server-bus"),
instantiates a single venet-tap on each of them, and then "wires" them
together with a private bridge instance called "vbus-br0".  To complete
the picture here, you would want to launch two kvms, one of each of the
client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.

# (echo client-bus > /proc/self/vbus; qemu-kvm -hda client.img....)
# (echo server-bus > /proc/self/vbus; qemu-kvm -hda server.img....)

(And as noted, someday qemu will be able to do all the setup that the
script did, natively.  It would wire whatever tap it created to an
existing bridge with qemu-ifup, just like we do for tun-taps today)

One of the key details is where I do "ifname=$(cat
/sys/vbus/devices/$1-dev/ifname)".  The "ifname" attribute of the
venet-tap is a read-only attribute that reports back the netif interface
name that was returned when the device did a register_netdev() (e.g.
"eth3").  This register_netdev() operation occurs as a result of echoing
the "1" into the "enabled" attribute.  Deferring the registration until
the admin explicitly does an "enable" gives the admin a chance to change
the MAC address of the virtual-adapter before it is registered (note:
the current code doesnt support rw on the mac attributes yet..i need a
parser first).


>   
>> So the admin would instantiate this "vdisk" device and do:
>>
>> 'echo /path/to/my/exported/disk.dat > /sys/vbus/devices/foo/src_path'
>>     
>
> So it would act like a loop device? Would you reuse the loop device
> or write something new?
>   

Well, keeping in mind that I haven't even looked at writing a block
device for this infrastructure yet....my blanket statement would be
"lets reuse as much as possible" ;)  If the existing loop infrastructure
would work here, great!

> How about VFS mount name spaces?
>   

Yeah, ultimately I would love to be able to support a fairly wide range
of the normal userspace/kernel ABI through this mechanism.  In fact, one
of my original design goals was to somehow expose the syscall ABI
directly via some kind of syscall proxy device on the bus.  I have since
backed away from that idea once I started thinking about things some
more and realized that a significant number of system calls are really
inappropriate for a guest type environment due to their ability to
block.   We really dont want a vcpu to block.....however, the AIO type
system calls on the other hand, have much more promise.  ;)  TBD.

For right now I am focused more on the explicit virtual-device type
transport (disk, net, etc).  But in theory we should be able to express
a fairly broad range of services in terms of the call()/shm() interfaces.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 17:01         ` Andi Kleen
@ 2009-04-01 18:45           ` Anthony Liguori
  2009-04-01 20:40             ` Chris Wright
                               ` (2 more replies)
  2009-04-01 20:29           ` Gregory Haskins
  1 sibling, 3 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-04-01 18:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gregory Haskins, linux-kernel, agraf, pmullaney, pmorreale,
	rusty, netdev, kvm

Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>   
>>>>     
>>>>         
>>> But surely you must have some specific use case in mind? Something
>>> that it does better than the various methods that are available
>>> today. Or rather there must be some problem you're trying
>>> to solve. I'm just not sure what that problem exactly is.
>>>   
>>>       
>> Performance.  We are trying to create a high performance IO infrastructure.
>>     
>
> Ok. So the goal is to bypass user space qemu completely for better
> performance. Can you please put this into the initial patch
> description?
>   

FWIW, there's nothing that prevents in-kernel back ends with virtio so 
vbus certainly isn't required for in-kernel backends.

That said, I don't think we're bound today by the fact that we're in 
userspace.  Rather we're bound by the interfaces we have between the 
host kernel and userspace to generate IO.  I'd rather fix those 
interfaces than put more stuff in the kernel.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 14:19       ` Gregory Haskins
  2009-04-01 14:42         ` Gregory Haskins
@ 2009-04-01 17:01         ` Andi Kleen
  2009-04-01 18:45           ` Anthony Liguori
  2009-04-01 20:29           ` Gregory Haskins
  1 sibling, 2 replies; 121+ messages in thread
From: Andi Kleen @ 2009-04-01 17:01 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, anthony,
	rusty, netdev, kvm

On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
> >>     
> >
> > But surely you must have some specific use case in mind? Something
> > that it does better than the various methods that are available
> > today. Or rather there must be some problem you're trying
> > to solve. I'm just not sure what that problem exactly is.
> >   
> Performance.  We are trying to create a high performance IO infrastructure.

Ok. So the goal is to bypass user space qemu completely for better
performance. Can you please put this into the initial patch
description?

> So the administrator can then set these attributes as
> desired to manipulate the configuration of the instance of the device,
> on a per device basis.

How would the guest learn of any changes in there?

I think the interesting part would be how e.g. a vnet device
would be connected to the outside interfaces.

> So the admin would instantiate this "vdisk" device and do:
> 
> 'echo /path/to/my/exported/disk.dat > /sys/vbus/devices/foo/src_path'

So it would act like a loop device? Would you reuse the loop device
or write something new?

How about VFS mount name spaces?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01  6:08 ` Rusty Russell
  2009-04-01 11:35   ` Gregory Haskins
@ 2009-04-01 16:10   ` Anthony Liguori
  2009-04-05  3:44     ` Rusty Russell
  2009-04-02  3:15   ` Herbert Xu
  2 siblings, 1 reply; 121+ messages in thread
From: Anthony Liguori @ 2009-04-01 16:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Gregory Haskins, linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

Rusty Russell wrote:
> On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
>   
>> Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
>> Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
>> Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)
>>     
>
> That rtt time is awful.  I know the notification suppression heuristic
> in qemu sucks.
>
> I could dig through the code, but I'll ask directly: what heuristic do
> you use for notification prevention in your venet_tap driver?
>
> As you point out, 350-450 is possible, which is still bad, and it's at least
> partially caused by the exit to userspace and two system calls.  If virtio_net
> had a backend in the kernel, we'd be able to compare numbers properly.
>   

I doubt the userspace exit is the problem.  On a modern system, it takes 
about 1us to do a light-weight exit and about 2us to do a heavy-weight 
exit.  A transition to userspace is only about ~150ns, the bulk of the 
additional heavy-weight exit cost is from vcpu_put() within KVM.

If you were to switch to another kernel thread, and I'm pretty sure you 
have to, you're going to still see about a 2us exit cost.  Even if you 
factor in the two syscalls, we're still talking about less than .5us 
that you're saving.  Avi mentioned he had some ideas to allow in-kernel 
thread switching without taking a heavy-weight exit but suffice to say, 
we can't do that today.

You have no easy way to generate PCI interrupts in the kernel either.  
You'll most certainly have to drop down to userspace anyway for that.

I believe the real issue is that we cannot get enough information today 
from tun/tap to do proper notification prevention b/c we don't know when 
the packet processing is completed.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 14:19       ` Gregory Haskins
@ 2009-04-01 14:42         ` Gregory Haskins
  2009-04-01 17:01         ` Andi Kleen
  1 sibling, 0 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 14:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty,
	netdev, kvm, linux-rt-users

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

Gregory Haskins wrote:
> Andi Kleen wrote:
>   
>> On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
>>   
>>     
>>> Andi Kleen wrote:
>>>     
>>>       
>>>> Gregory Haskins <ghaskins@novell.com> writes:
>>>>
>>>> What might be useful is if you could expand a bit more on what the high level
>>>> use cases for this. 
>>>>
>>>> Questions that come to mind and that would be good to answer:
>>>>
>>>> This seems to be aimed at having multiple VMs talk
>>>> to each other, but not talk to the rest of the world, correct? 
>>>> Is that a common use case? 
>>>>   
>>>>       
>>>>         
>>> Actually we didn't design specifically for either type of environment. 
>>>     
>>>       
>> But surely you must have some specific use case in mind? Something
>> that it does better than the various methods that are available
>> today. Or rather there must be some problem you're trying
>> to solve. I'm just not sure what that problem exactly is.
>>   
>>     
> Performance.  We are trying to create a high performance IO infrastructure.
>   
Actually, I should also state that I am interested in enabling some new
kinds of features based on having in-kernel devices like this.  For
instance (and this is still very theoretical and half-baked), I would
like to try to support RT guests.

[adding linux-rt-users]

I think one of the things that we need in order to do that is being able
to convey vcpu priority state information to the host in an efficient
way.  I was thinking that a shared-page per vcpu could have something
like "current" and "theshold" priorties.  The guest modifies "current"
while the host modifies "threshold".   The guest would be allowed to
increase its "current" priority without a hypercall (after all, if its
already running presumably it is already of sufficient priority that the
scheduler).  But if the guest wants to drop below "threshold", it needs
to hypercall the host to give it an opportunity to schedule() a new task
(vcpu or not).

The host, on the other hand, could apply a mapping so that the guests
priority of RT1-RT99 might map to RT20-RT30 on the host, or something
like that.  We would have to take other considerations as well, such as
implicit boosting on IRQ injection (e.g. the guest could be in HLT/IDLE
when an interrupt is injected...but by virtue of injecting that
interrupt we may need to boost it to (guest-relative) RT50).

Like I said, this is all half-baked right now.  My primary focus is
improving performance, but I did try to lay the groundwork for taking
things in new directions too..rt being an example.

Hope that helps!
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 13:23     ` Andi Kleen
@ 2009-04-01 14:19       ` Gregory Haskins
  2009-04-01 14:42         ` Gregory Haskins
  2009-04-01 17:01         ` Andi Kleen
  0 siblings, 2 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 14:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5594 bytes --]

Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
>   
>> Andi Kleen wrote:
>>     
>>> Gregory Haskins <ghaskins@novell.com> writes:
>>>
>>> What might be useful is if you could expand a bit more on what the high level
>>> use cases for this. 
>>>
>>> Questions that come to mind and that would be good to answer:
>>>
>>> This seems to be aimed at having multiple VMs talk
>>> to each other, but not talk to the rest of the world, correct? 
>>> Is that a common use case? 
>>>   
>>>       
>> Actually we didn't design specifically for either type of environment. 
>>     
>
> But surely you must have some specific use case in mind? Something
> that it does better than the various methods that are available
> today. Or rather there must be some problem you're trying
> to solve. I'm just not sure what that problem exactly is.
>   
Performance.  We are trying to create a high performance IO infrastructure.

Ideally we would like to see things like virtual-machines have
bare-metal performance (or as close as possible) using just pure
software on commodity hardware.   The data I provided shows that
something like KVM with virtio-net does a good job on throughput even on
10GE, but the latency is several orders of magnitude slower than
bare-metal.   We are addressing this issue and others like it that are a
result of the current design of out-of-kernel emulation.
>   
>> What we *are* trying to address is making an easy way to declare virtual
>> resources directly in the kernel so that they can be accessed more
>> efficiently.  Contrast that to the way its done today, where the models
>> live in, say, qemu userspace.
>>
>> So instead of having
>> guest->host->qemu::virtio-net->tap->[iptables|bridge], you simply have
>> guest->host->[iptables|bridge].  How you make your private network (if
>>     
>
> So is the goal more performance or simplicity or what?
>   

(Answered above)

>   
>>> What would be the use cases for non networking devices?
>>>
>>> How would the interfaces to the user look like?
>>>   
>>>       
>> I am not sure if you are asking about the guests perspective or the
>> host-administators perspective.
>>     
>
> I was wondering about the host-administrators perspective.
>   
Ah, ok.  Sorry about that.  It was probably good to document that other
thing anyway, so no harm.

So about the host-administrator interface.  The whole thing is driven by
configfs, and the basics are already covered in the documentation in
patch 2, so I wont repeat it here.  Here is a reference to the file for
everyone's convenience:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=Documentation/vbus.txt;h=e8a05dafaca2899d37bd4314fb0c7529c167ee0f;hb=f43949f7c340bf667e68af6e6a29552e62f59033

So a sufficiently privileged user can instantiate a new bus (e.g.
container) and devices on that bus via configfs operations.  The types
of devices available to instantiate are dictated by whatever vbus-device
modules you have loaded into your particular kernel.  The loaded modules
available are enumerated under /sys/vbus/deviceclass.

Now presumably the administrator knows what a particular module is and
how to configure it before instantiating it.  Once they instantiate it,
it will present an interface in sysfs with a set of attributes.  For
example, an instantiated venet-tap looks like this:

ghaskins@test:~> tree /sys/vbus/devices
/sys/vbus/devices
`-- foo
    |-- class -> ../../deviceclass/venet-tap
    |-- client_mac
    |-- enabled
    |-- host_mac
    |-- ifname
    `-- interfaces
        `-- 0 -> ../../../instances/bar/devices/0


Some of these attributes, like "class" and "interfaces" are default
attributes that are filled in by the infrastructure.  Other attributes,
like "client_mac" and "enabled" are properties defined by the venet-tap
module itself.  So the administrator can then set these attributes as
desired to manipulate the configuration of the instance of the device,
on a per device basis.

So now imagine we have some kind of disk-io vbus device that is designed
to act kind of like a file-loopback device.  It might define an
attribute allowing you to specify the path to the file/block-dev that
you want it to export.

(Warning: completely fictitious "tree" output to follow ;)

ghaskins@test:~> tree /sys/vbus/devices
/sys/vbus/devices
`-- foo
    |-- class -> ../../deviceclass/vdisk
    |-- src_path
    `-- interfaces
        `-- 0 -> ../../../instances/bar/devices/0

So the admin would instantiate this "vdisk" device and do:

'echo /path/to/my/exported/disk.dat > /sys/vbus/devices/foo/src_path'

To point the device to the file on the host that it wants to present as
a vdisk.  Any guest that has access to the particular bus that contains
this device would then see it as a standard "vdisk" ABI device (as if
there where such a thing, yet) and could talk to it using a vdisk
specific driver.

A property of a vbus is that it is inherited by children.  Today, I do
not have direct support in qemu for creating/configuring vbus devices. 
Instead what I do is I set up the vbus and devices from bash, and then
launch qemu-kvm so it inherits the bus.  Someday (soon, unless you guys
start telling me this whole idea is rubbish ;) I will add support so you
could do things like "-net nic,model=venet" and that would trigger qemu
to go out and create the container/device on its own.  TBD.

I hope this helps to clarify!
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 12:03   ` Gregory Haskins
@ 2009-04-01 13:23     ` Andi Kleen
  2009-04-01 14:19       ` Gregory Haskins
  0 siblings, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2009-04-01 13:23 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, anthony,
	rusty, netdev, kvm

On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
> Andi Kleen wrote:
> > Gregory Haskins <ghaskins@novell.com> writes:
> >
> > What might be useful is if you could expand a bit more on what the high level
> > use cases for this. 
> >
> > Questions that come to mind and that would be good to answer:
> >
> > This seems to be aimed at having multiple VMs talk
> > to each other, but not talk to the rest of the world, correct? 
> > Is that a common use case? 
> >   
> 
> Actually we didn't design specifically for either type of environment. 

But surely you must have some specific use case in mind? Something
that it does better than the various methods that are available
today. Or rather there must be some problem you're trying
to solve. I'm just not sure what that problem exactly is.

> What we *are* trying to address is making an easy way to declare virtual
> resources directly in the kernel so that they can be accessed more
> efficiently.  Contrast that to the way its done today, where the models
> live in, say, qemu userspace.
> 
> So instead of having
> guest->host->qemu::virtio-net->tap->[iptables|bridge], you simply have
> guest->host->[iptables|bridge].  How you make your private network (if

So is the goal more performance or simplicity or what?

> > What would be the use cases for non networking devices?
> >
> > How would the interfaces to the user look like?
> >   
> 
> I am not sure if you are asking about the guests perspective or the
> host-administators perspective.

I was wondering about the host-administrators perspective.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-03-31 20:18 ` Andi Kleen
@ 2009-04-01 12:03   ` Gregory Haskins
  2009-04-01 13:23     ` Andi Kleen
  0 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 12:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 4062 bytes --]

Andi Kleen wrote:
> Gregory Haskins <ghaskins@novell.com> writes:
>
> What might be useful is if you could expand a bit more on what the high level
> use cases for this. 
>
> Questions that come to mind and that would be good to answer:
>
> This seems to be aimed at having multiple VMs talk
> to each other, but not talk to the rest of the world, correct? 
> Is that a common use case? 
>   

Actually we didn't design specifically for either type of environment. 
I think it would, in fact, be well suited to either type of
communication model, even concurrently (e.g. an intra-vm ipc channel
resource could live right on the same bus as a virtio-net and a
virtio-disk resource)

> Wouldn't they typically have a default route  anyways and be able to talk to each 
> other this way? 
> And why can't any such isolation be done with standard firewalling? (it's known that 
> current iptables has some scalability issues, but there's work going on right
> now to fix that). 
>   
vbus itself, and even some of the higher level constructs we apply on
top of it (like venet) are at a different scope than I think what you
are getting at above.  Yes, I suppose you could create a private network
using the existing virtio-net + iptables.  But you could also do the
same using virtio-net and a private bridge devices as well.  That is not
what we are trying to address.

What we *are* trying to address is making an easy way to declare virtual
resources directly in the kernel so that they can be accessed more
efficiently.  Contrast that to the way its done today, where the models
live in, say, qemu userspace.

So instead of having
guest->host->qemu::virtio-net->tap->[iptables|bridge], you simply have
guest->host->[iptables|bridge].  How you make your private network (if
that is what you want to do) is orthogonal...its the path to get there
that we changed.

> What would be the use cases for non networking devices?
>
> How would the interfaces to the user look like?
>   

I am not sure if you are asking about the guests perspective or the
host-administators perspective.

First now lets look at the low-level device interface from the guests
perspective.  We can cover the admin perspective in a separate doc, if
need be.

Each device in vbus supports two basic verbs: CALL, and SHM

int (*call)(struct vbus_device_proxy *dev, u32 func,
            void *data, size_t len, int flags);

int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
           void *ptr, size_t len,
           struct shm_signal_desc *sigdesc, struct shm_signal **signal,
           int flags);

CALL provides a synchronous method for invoking some verb on the device
(defined by "func") with some arbitrary data.  The namespace for "func"
is part of the ABI for the device in question.  It is analogous to an
ioctl, with the primary difference being that its remotable (it invokes
from the guest driver across to the host device).

SHM provides a way to register shared-memory with the device which can
be used for asynchronous communication.  The memory is always owned by
the "north" (the guest), while the "south" (the host) simply maps it
into its address space.  You can optionally establish a shm_signal
object on this memory for signaling in either direction, and I
anticipate most shm regions will use this feature.  Each shm region has
an "id" namespace, which like the "func" namespace from the CALL method
is completely owned by the device ABI.  For example, we have might have
id's of "RX-RING" and "TX-RING", etc.

From there, we can (hopefully) build an arbitrary type of IO service to
map on top.  So for instance, for venet-tap, we have CALL verbs for
things like MACQUERY, and LINKUP, and we have SHM ids for RX-QUEUE and
TX-QUEUE.  We can write a driver that speaks this ABI on the bottom
edge, and presents a normal netif interface on the top edge.  So the
actual consumption of these resources can look just like another other
resource of a similar type.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01  6:08 ` Rusty Russell
@ 2009-04-01 11:35   ` Gregory Haskins
  2009-04-02  1:24     ` Rusty Russell
  2009-04-01 16:10   ` Anthony Liguori
  2009-04-02  3:15   ` Herbert Xu
  2 siblings, 1 reply; 121+ messages in thread
From: Gregory Haskins @ 2009-04-01 11:35 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3541 bytes --]

Rusty Russell wrote:
> On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
>   
>> Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
>> Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
>> Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)
>>     
>
> That rtt time is awful.  I know the notification suppression heuristic
> in qemu sucks.
>
> I could dig through the code, but I'll ask directly: what heuristic do
> you use for notification prevention in your venet_tap driver?
>   

I am not 100% sure I know what you mean with "notification prevention",
but let me take a stab at it.

So like most of these kinds of constructs, I have two rings (rx + tx on
the guest is reversed to tx + rx on the host), each of which can signal
in either direction for a total of 4 events, 2 on each side of the
connection.  I utilize what I call "bidirectional napi" so that only the
first packet submitted needs to signal across the guest/host boundary. 
E.g. first ingress packet injects an interrupt, and then does a
napi_schedule and masks future irqs.  Likewise, first egress packet does
a hypercall, and then does a "napi_schedule" (I dont actually use napi
in this path, but its conceptually identical) and masks future
hypercalls.  So thats is my first form of what I would call notification
prevention.

The second form occurs on the "tx-complete" path (that is guest->host
tx).  I only signal back to the guest to reclaim its skbs every 10
packets, or if I drain the queue, whichever comes first (note to self:
make this # configurable).

The nice part about this scheme is it significantly reduces the amount
of guest/host transitions, while still providing the lowest latency
response for single packets possible.  e.g. Send one packet, and you get
one hypercall, and one tx-complete interrupt as soon as it queues on the
hardware.  Send 100 packets, and you get one hypercall and 10
tx-complete interrupts as frequently as every tenth packet queues on the
hardware.  There is no timer governing the flow, etc.

Is that what you were asking?

> As you point out, 350-450 is possible, which is still bad, and it's at least
> partially caused by the exit to userspace and two system calls.  If virtio_net
> had a backend in the kernel, we'd be able to compare numbers properly.
>   
:)

But that is the whole point, isnt it?  I created vbus specifically as a
framework for putting things in the kernel, and that *is* one of the
major reasons it is faster than virtio-net...its not the difference in,
say, IOQs vs virtio-ring (though note I also think some of the
innovations we have added such as bi-dir napi are helping too, but these
are not "in-kernel" specific kinds of features and could probably help
the userspace version too).

I would be entirely happy if you guys accepted the general concept and
framework of vbus, and then worked with me to actually convert what I
have as "venet-tap" into essentially an in-kernel virtio-net.  I am not
specifically interested in creating a competing pv-net driver...I just
needed something to showcase the concepts and I didnt want to hack the
virtio-net infrastructure to do it until I had everyone's blessing. 
Note to maintainers: I *am* perfectly willing to maintain the venet
drivers if, for some reason, we decide that we want to keep them as
is.   Its just an ideal for me to collapse virtio-net and venet-tap
together, and I suspect our community would prefer this as well.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-03-31 18:42 Gregory Haskins
  2009-03-31 20:18 ` Andi Kleen
@ 2009-04-01  6:08 ` Rusty Russell
  2009-04-01 11:35   ` Gregory Haskins
                     ` (2 more replies)
  1 sibling, 3 replies; 121+ messages in thread
From: Rusty Russell @ 2009-04-01  6:08 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, netdev, kvm

On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
> Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
> Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
> Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)

That rtt time is awful.  I know the notification suppression heuristic
in qemu sucks.

I could dig through the code, but I'll ask directly: what heuristic do
you use for notification prevention in your venet_tap driver?

As you point out, 350-450 is possible, which is still bad, and it's at least
partially caused by the exit to userspace and two system calls.  If virtio_net
had a backend in the kernel, we'd be able to compare numbers properly.

> Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
> Virtio-net: tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
> Venet: tput = 5802Mb/s, round-trip = 15127 (66us rtt)
> 
> Note that even the throughput was slightly better in this test for venet, though
> neither venet nor virtio-net could achieve line-rate.  I suspect some tuning may
> allow these numbers to improve, TBD.

At some point, the copying will hurt you.  This is fairly easy to avoid on
xmit tho.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-03-31 18:42 Gregory Haskins
@ 2009-03-31 20:18 ` Andi Kleen
  2009-04-01 12:03   ` Gregory Haskins
  2009-04-01  6:08 ` Rusty Russell
  1 sibling, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2009-03-31 20:18 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins <ghaskins@novell.com> writes:

What might be useful is if you could expand a bit more on what the high level
use cases for this. 

Questions that come to mind and that would be good to answer:

This seems to be aimed at having multiple VMs talk
to each other, but not talk to the rest of the world, correct? 
Is that a common use case? 

Wouldn't they typically have a default route  anyways and be able to talk to each 
other this way? 
And why can't any such isolation be done with standard firewalling? (it's known that 
current iptables has some scalability issues, but there's work going on right
now to fix that). 

What would be the use cases for non networking devices?

How would the interfaces to the user look like?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [RFC PATCH 00/17] virtual-bus
@ 2009-03-31 18:42 Gregory Haskins
  2009-03-31 20:18 ` Andi Kleen
  2009-04-01  6:08 ` Rusty Russell
  0 siblings, 2 replies; 121+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

applies to v2.6.29 (will port to git HEAD soon)

FIRST OFF: Let me state that this is not a KVM or networking specific
technology.  Virtual-Bus is a mechanism for defining and deploying
software “devices” directly in a Linux kernel.  The example use-case we
have provided supports a “virtual-ethernet” device being utilized in a
KVM guest environment, so comparisons to virtio-net will be natural.
However, please note that this is but one use-case, of many we have
planned for the future (such as userspace bypass and RT guest support).
The goal for right now is to describe what a virual-bus is and why we
believe it is useful.

We are intent to get this core technology merged, even if the networking
components are not accepted as is.  It should be noted that, in many ways,
virtio could be considered complimentary to the technology.  We could
in fact, have implemented the virtual-ethernet using a virtio-ring, but
it would have required ABI changes that we didn't want to yet propose
without having the concept in general vetted and accepted by the community.

To cut to the chase, we recently measured our virtual-ethernet on 
v2.6.29 on two 8-core x86_64 boxes with Chelsio T3 10GE connected back
to back via cross over.  We measured bare-metal performance, as well
as a kvm guest (running the same kernel) connected to the T3 via
a linux-bridge+tap configuration with a 1500 MTU.  The results are as
follows:

Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)

As you can see, all three technologies can achieve (MTU limited) line-rate,
but the virtio-net solution is severely limited on the latency front (by a
factor of 48:1)

Note that the 320pps is technically artificially low in virtio-net, caused by a
a known design limitation to use a timer for tx-mitigation.  However, note that
even when removing the timer from the path the best we could achieve was
350us-450us of latency, and doing so causes the tput to drop to 1300Mb/s.
So even in this case, I think the in-kernel results presents a compelling
argument for the new model presented.

When we jump to 9000 byte MTU, the situation looks similar

Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
Virtio-net: tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
Venet: tput = 5802Mb/s, round-trip = 15127 (66us rtt)


Note that even the throughput was slightly better in this test for venet, though
neither venet nor virtio-net could achieve line-rate.  I suspect some tuning may
allow these numbers to improve, TBD.

So with that said, lets jump into the description:

Virtual-Bus: What is it?
--------------------

Virtual-Bus is a kernel based IO resource container technology.  It is modeled
on a concept similar to the Linux Device-Model (LDM), where we have buses,
devices, and drivers as the primary actors.  However, VBUS has several
distinctions when contrasted with LDM:

  1) "Busses" in LDM are relatively static and global to the kernel (e.g.
     "PCI", "USB", etc).  VBUS buses are arbitrarily created and destroyed
     dynamically, and are not globally visible.  Instead they are defined as
     visible only to a specific subset of the system (the contained context).
  2) "Devices" in LDM are typically tangible physical (or sometimes logical)
     devices.  VBUS devices are purely software abstractions (which may or
     may not have one or more physical devices behind them).  Devices may
     also be arbitrarily created or destroyed by software/administrative action
     as opposed to by a hardware discovery mechanism.
  3) "Drivers" in LDM sit within the same kernel context as the busses and
     devices they interact with.  VBUS drivers live in a foreign
     context (such as userspace, or a virtual-machine guest).

The idea is that a vbus is created to contain access to some IO services.
Virtual devices are then instantiated and linked to a bus to grant access to
drivers actively present on the bus.  Drivers will only have visibility to
devices present on their respective bus, and nothing else.

Virtual devices are defined by modules which register a deviceclass with the
system.  A deviceclass simply represents a type of device that _may_ be
instantiated into a device, should an administrator wish to do so.  Once
this has happened, the device may be associated with one or more buses where
it will become visible to all clients of those respective buses.

Why do we need this?
----------------------

There are various reasons why such a construct may be useful.  One of the
most interesting use cases is for virtualization, such as KVM.  Hypervisors
today provide virtualized IO resources to a guest, but this is often at a cost
in both latency and throughput compared to bare metal performance.  Utilizing
para-virtual resources instead of emulated devices helps to mitigate this
penalty, but even these techniques to date have not fully realized the
potential of the underlying bare-metal hardware.

Some of the performance differential is unavoidable just given the extra
processing that occurs due to the deeper stack (guest+host).  However, some of
this overhead is a direct result of the rather indirect path most hypervisors
use to route IO.  For instance, KVM uses PIO faults from the guest to trigger
a guest->host-kernel->host-userspace->host-kernel sequence of events.
Contrast this to a typical userspace application on the host which must only
traverse app->kernel for most IO.

The fact is that the linux kernel is already great at managing access to IO
resources.  Therefore, if you have a hypervisor that is based on the linux
kernel, is there some way that we can allow the hypervisor to manage IO
directly instead of forcing this convoluted path?

The short answer is: "not yet" ;)

In order to use such a concept, we need some new facilties.  For one, we
need to be able to define containers with their corresponding access-control so
that guests do not have unmitigated access to anything they wish.  Second,
we also need to define some forms of memory access that is uniform in the face
of various clients (e.g. "copy_to_user()" cannot be assumed to work for, say,
a KVM vcpu context).  Lastly, we need to provide access to these resources in
a way that makes sense for the application, such as asynchronous communication
paths and minimizing context switches.

So we introduce VBUS as a framework to provide such facilities.  The net
result is a *substantial* reduction in IO overhead, even when compared to
state of the art para-virtualization techniques (such as virtio-net).

For more details, please visit our wiki at:

http://developer.novell.com/wiki/index.php/Virtual-bus

Regards,
-Greg

---

Gregory Haskins (17):
      kvm: Add guest-side support for VBUS
      kvm: Add VBUS support to the host
      kvm: add dynamic IRQ support
      kvm: add a reset capability
      x86: allow the irq->vector translation to be determined outside of ioapic
      venettap: add scatter-gather support
      venet: add scatter-gather support
      venet-tap: Adds a "venet" compatible "tap" device to VBUS
      net: Add vbus_enet driver
      venet: add the ABI definitions for an 802.x packet interface
      ioq: add vbus helpers
      ioq: Add basic definitions for a shared-memory, lockless queue
      vbus: add a "vbus-proxy" bus model for vbus_driver objects
      vbus: add bus-registration notifiers
      vbus: add connection-client helper infrastructure
      vbus: add virtual-bus definitions
      shm-signal: shared-memory signals


 Documentation/vbus.txt           |  386 +++++++++
 arch/x86/Kconfig                 |   16 
 arch/x86/Makefile                |    3 
 arch/x86/include/asm/irq.h       |    6 
 arch/x86/include/asm/kvm_host.h  |    9 
 arch/x86/include/asm/kvm_para.h  |   12 
 arch/x86/kernel/io_apic.c        |   25 +
 arch/x86/kvm/Kconfig             |    9 
 arch/x86/kvm/Makefile            |    6 
 arch/x86/kvm/dynirq.c            |  329 ++++++++
 arch/x86/kvm/guest/Makefile      |    2 
 arch/x86/kvm/guest/dynirq.c      |   95 ++
 arch/x86/kvm/x86.c               |   13 
 arch/x86/kvm/x86.h               |   12 
 drivers/Makefile                 |    2 
 drivers/net/Kconfig              |   13 
 drivers/net/Makefile             |    1 
 drivers/net/vbus-enet.c          |  933 ++++++++++++++++++++++
 drivers/vbus/devices/Kconfig     |   17 
 drivers/vbus/devices/Makefile    |    1 
 drivers/vbus/devices/venet-tap.c | 1587 ++++++++++++++++++++++++++++++++++++++
 drivers/vbus/proxy/Makefile      |    2 
 drivers/vbus/proxy/kvm.c         |  726 +++++++++++++++++
 fs/proc/base.c                   |   96 ++
 include/linux/ioq.h              |  410 ++++++++++
 include/linux/kvm.h              |    4 
 include/linux/kvm_guest.h        |    7 
 include/linux/kvm_host.h         |   27 +
 include/linux/kvm_para.h         |   60 +
 include/linux/sched.h            |    4 
 include/linux/shm_signal.h       |  188 +++++
 include/linux/vbus.h             |  162 ++++
 include/linux/vbus_client.h      |  115 +++
 include/linux/vbus_device.h      |  423 ++++++++++
 include/linux/vbus_driver.h      |   80 ++
 include/linux/venet.h            |   82 ++
 kernel/Makefile                  |    1 
 kernel/exit.c                    |    2 
 kernel/fork.c                    |    2 
 kernel/vbus/Kconfig              |   38 +
 kernel/vbus/Makefile             |    6 
 kernel/vbus/attribute.c          |   52 +
 kernel/vbus/client.c             |  527 +++++++++++++
 kernel/vbus/config.c             |  275 +++++++
 kernel/vbus/core.c               |  626 +++++++++++++++
 kernel/vbus/devclass.c           |  124 +++
 kernel/vbus/map.c                |   72 ++
 kernel/vbus/map.h                |   41 +
 kernel/vbus/proxy.c              |  216 +++++
 kernel/vbus/shm-ioq.c            |   89 ++
 kernel/vbus/vbus.h               |  117 +++
 lib/Kconfig                      |   22 +
 lib/Makefile                     |    2 
 lib/ioq.c                        |  298 +++++++
 lib/shm_signal.c                 |  186 ++++
 virt/kvm/kvm_main.c              |   37 +
 virt/kvm/vbus.c                  | 1307 +++++++++++++++++++++++++++++++
 57 files changed, 9902 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/vbus.txt
 create mode 100644 arch/x86/kvm/dynirq.c
 create mode 100644 arch/x86/kvm/guest/Makefile
 create mode 100644 arch/x86/kvm/guest/dynirq.c
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 drivers/vbus/devices/Kconfig
 create mode 100644 drivers/vbus/devices/Makefile
 create mode 100644 drivers/vbus/devices/venet-tap.c
 create mode 100644 drivers/vbus/proxy/Makefile
 create mode 100644 drivers/vbus/proxy/kvm.c
 create mode 100644 include/linux/ioq.h
 create mode 100644 include/linux/kvm_guest.h
 create mode 100644 include/linux/shm_signal.h
 create mode 100644 include/linux/vbus.h
 create mode 100644 include/linux/vbus_client.h
 create mode 100644 include/linux/vbus_device.h
 create mode 100644 include/linux/vbus_driver.h
 create mode 100644 include/linux/venet.h
 create mode 100644 kernel/vbus/Kconfig
 create mode 100644 kernel/vbus/Makefile
 create mode 100644 kernel/vbus/attribute.c
 create mode 100644 kernel/vbus/client.c
 create mode 100644 kernel/vbus/config.c
 create mode 100644 kernel/vbus/core.c
 create mode 100644 kernel/vbus/devclass.c
 create mode 100644 kernel/vbus/map.c
 create mode 100644 kernel/vbus/map.h
 create mode 100644 kernel/vbus/proxy.c
 create mode 100644 kernel/vbus/shm-ioq.c
 create mode 100644 kernel/vbus/vbus.h
 create mode 100644 lib/ioq.c
 create mode 100644 lib/shm_signal.c
 create mode 100644 virt/kvm/vbus.c

-- 
Signature

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2009-04-05 16:45 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <49D469D2020000A100045FA1@lucius.provo.novell.com>
2009-04-02 14:14 ` [RFC PATCH 00/17] virtual-bus Patrick Mullaney
2009-04-02 14:27   ` Avi Kivity
2009-04-02 15:31     ` Gregory Haskins
2009-04-02 15:49       ` Avi Kivity
2009-04-02 16:06         ` Herbert Xu
2009-04-02 16:51           ` Avi Kivity
2009-04-02 17:44         ` Gregory Haskins
2009-04-03 11:43           ` Avi Kivity
2009-04-03 14:58             ` Gregory Haskins
2009-04-03 15:37               ` Avi Kivity
2009-04-03 18:19                 ` Gregory Haskins
2009-04-05 10:50                   ` Avi Kivity
2009-04-03 17:09               ` Chris Wright
2009-04-03 18:32                 ` Gregory Haskins
2009-03-31 18:42 Gregory Haskins
2009-03-31 20:18 ` Andi Kleen
2009-04-01 12:03   ` Gregory Haskins
2009-04-01 13:23     ` Andi Kleen
2009-04-01 14:19       ` Gregory Haskins
2009-04-01 14:42         ` Gregory Haskins
2009-04-01 17:01         ` Andi Kleen
2009-04-01 18:45           ` Anthony Liguori
2009-04-01 20:40             ` Chris Wright
2009-04-01 21:11               ` Gregory Haskins
2009-04-01 21:28                 ` Chris Wright
2009-04-01 22:10                   ` Gregory Haskins
2009-04-02  6:00                     ` Chris Wright
2009-04-02  3:11               ` Herbert Xu
2009-04-01 21:09             ` Gregory Haskins
2009-04-02  0:29               ` Anthony Liguori
2009-04-02  3:11                 ` Gregory Haskins
2009-04-02  6:51               ` Avi Kivity
2009-04-02  8:52                 ` Herbert Xu
2009-04-02  9:02                   ` Avi Kivity
2009-04-02  9:16                     ` Herbert Xu
2009-04-02  9:27                       ` Avi Kivity
2009-04-02  9:29                         ` Herbert Xu
2009-04-02  9:33                           ` Herbert Xu
2009-04-02  9:38                           ` Avi Kivity
2009-04-02  9:41                             ` Herbert Xu
2009-04-02  9:43                               ` Avi Kivity
2009-04-02  9:44                                 ` Herbert Xu
2009-04-02 11:06                             ` Gregory Haskins
2009-04-02 11:59                               ` Avi Kivity
2009-04-02 12:30                                 ` Gregory Haskins
2009-04-02 12:43                                   ` Avi Kivity
2009-04-02 13:03                                     ` Gregory Haskins
2009-04-02 12:13                               ` Rusty Russell
2009-04-02 12:50                                 ` Gregory Haskins
2009-04-02 12:52                                   ` Gregory Haskins
2009-04-02 13:07                                   ` Avi Kivity
2009-04-02 13:22                                     ` Gregory Haskins
2009-04-02 13:27                                       ` Avi Kivity
2009-04-02 14:05                                         ` Gregory Haskins
2009-04-02 14:50                                     ` Herbert Xu
2009-04-02 15:00                                       ` Avi Kivity
2009-04-02 15:40                                         ` Herbert Xu
2009-04-02 15:57                                           ` Avi Kivity
2009-04-02 16:09                                             ` Herbert Xu
2009-04-02 16:54                                               ` Avi Kivity
2009-04-02 17:06                                                 ` Herbert Xu
2009-04-02 17:17                                                   ` Herbert Xu
2009-04-03 12:25                                                   ` Avi Kivity
2009-04-02 15:10                                 ` Michael S. Tsirkin
2009-04-03  4:43                                   ` Jeremy Fitzhardinge
2009-04-02 10:55                     ` Gregory Haskins
2009-04-02 11:48                       ` Avi Kivity
2009-04-03 10:58                     ` Gerd Hoffmann
2009-04-03 11:03                       ` Avi Kivity
2009-04-03 11:12                         ` Herbert Xu
2009-04-03 11:46                           ` Avi Kivity
2009-04-03 11:48                             ` Herbert Xu
2009-04-03 11:54                               ` Avi Kivity
2009-04-03 11:55                                 ` Herbert Xu
2009-04-03 12:02                                   ` Avi Kivity
2009-04-03 13:05                                     ` Herbert Xu
2009-04-03 11:18                       ` Andi Kleen
2009-04-03 11:34                         ` Herbert Xu
2009-04-03 11:46                         ` Avi Kivity
2009-04-03 11:28                       ` Gregory Haskins
2009-04-02 10:46                 ` Gregory Haskins
2009-04-02 11:43                   ` Avi Kivity
2009-04-02 12:22                     ` Gregory Haskins
2009-04-02 12:42                       ` Avi Kivity
2009-04-02 12:54                         ` Gregory Haskins
2009-04-02 13:08                           ` Avi Kivity
2009-04-02 13:36                             ` Gregory Haskins
2009-04-02 13:45                               ` Avi Kivity
2009-04-02 14:24                                 ` Gregory Haskins
2009-04-02 14:32                                   ` Avi Kivity
2009-04-02 14:41                                     ` Avi Kivity
2009-04-02 14:49                                       ` Anthony Liguori
2009-04-02 16:09                                         ` Anthony Liguori
2009-04-02 16:19                                           ` Avi Kivity
2009-04-02 18:18                                             ` Anthony Liguori
2009-04-03  1:11                                               ` Herbert Xu
2009-04-03 12:03                                           ` Gregory Haskins
2009-04-03 12:15                                             ` Avi Kivity
2009-04-03 13:13                                               ` Gregory Haskins
2009-04-03 13:37                                                 ` Avi Kivity
2009-04-03 16:28                                                   ` Gregory Haskins
2009-04-05 10:00                                                     ` Avi Kivity
2009-04-02  3:09             ` Herbert Xu
2009-04-02  6:46               ` Avi Kivity
2009-04-02  8:54                 ` Herbert Xu
2009-04-02  9:03                   ` Avi Kivity
2009-04-02  9:05                     ` Herbert Xu
2009-04-01 20:29           ` Gregory Haskins
2009-04-01 22:23             ` Andi Kleen
2009-04-01 23:05               ` Gregory Haskins
2009-04-01  6:08 ` Rusty Russell
2009-04-01 11:35   ` Gregory Haskins
2009-04-02  1:24     ` Rusty Russell
2009-04-02  2:27       ` Gregory Haskins
2009-04-01 16:10   ` Anthony Liguori
2009-04-05  3:44     ` Rusty Russell
2009-04-05  8:06       ` Avi Kivity
2009-04-05 14:13       ` Anthony Liguori
2009-04-05 16:10         ` Avi Kivity
2009-04-05 16:45           ` Anthony Liguori
2009-04-02  3:15   ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).