Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>   
>>>>     
>>>>         
>>> But surely you must have some specific use case in mind? Something
>>> that it does better than the various methods that are available
>>> today. Or rather there must be some problem you're trying
>>> to solve. I'm just not sure what that problem exactly is.
>>>   
>>>       
>> Performance.  We are trying to create a high performance IO infrastructure.
>>     
>
> Ok. So the goal is to bypass user space qemu completely for better
> performance. Can you please put this into the initial patch
> description?
>   
Yes, good point.  I will be sure to be more explicit in the next rev.

>   
>> So the administrator can then set these attributes as
>> desired to manipulate the configuration of the instance of the device,
>> on a per device basis.
>>     
>
> How would the guest learn of any changes in there?
>   
The only events explicitly supported by the infrastructure of this
nature would be device-add and device-remove.  So when an admin adds or
removes a device to a bus, the guest would see driver::probe() and
driver::remove() callbacks, respectively.  All other events are left (by
design) to be handled by the device ABI itself, presumably over the
provided shm infrastructure.

So for instance, I have on my todo list to add a third shm-ring for
events in the venet ABI.   One of the event-types I would like to
support is LINK_UP and LINK_DOWN.  These events would be coupled to the
administrative manipulation of the "enabled" attribute in sysfs.  Other
event-types could be added as needed/appropriate.

I decided to do it this way because I felt it didn't make sense for me
to expose the attributes directly, since they are often back-end
specific anyway.   Therefore I leave it to the device-specific ABI which
has all the necessary tools for async events built in.


> I think the interesting part would be how e.g. a vnet device
> would be connected to the outside interfaces.
>   

Ah, good question.  This ties into the statement I made earlier about
how presumably the administrative agent would know what a module is and
how it works.  As part of this, they would also handle any kind of
additional work, such as wiring the backend up.  Here is a script that I
use for testing that demonstrates this:

------------------
#!/bin/bash

set -e

modprobe venet-tap
mount -t configfs configfs /config

bridge=vbus-br0

brctl addbr $bridge
brctl setfd $bridge 0
ifconfig $bridge up

createtap()
{
    mkdir /config/vbus/devices/$1-dev
    echo venet-tap > /config/vbus/devices/$1-dev/type
    mkdir /config/vbus/instances/$1-bus
    ln -s /config/vbus/devices/$1-dev /config/vbus/instances/$1-bus
    echo 1 > /sys/vbus/devices/$1-dev/enabled

    ifname=$(cat /sys/vbus/devices/$1-dev/ifname)
    ifconfig $ifname up
    brctl addif $bridge $ifname
}

createtap client
createtap server

--------------------

This script creates two buses ("client-bus" and "server-bus"),
instantiates a single venet-tap on each of them, and then "wires" them
together with a private bridge instance called "vbus-br0".  To complete
the picture here, you would want to launch two kvms, one of each of the
client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.

# (echo client-bus > /proc/self/vbus; qemu-kvm -hda client.img....)
# (echo server-bus > /proc/self/vbus; qemu-kvm -hda server.img....)

(And as noted, someday qemu will be able to do all the setup that the
script did, natively.  It would wire whatever tap it created to an
existing bridge with qemu-ifup, just like we do for tun-taps today)

One of the key details is where I do "ifname=$(cat
/sys/vbus/devices/$1-dev/ifname)".  The "ifname" attribute of the
venet-tap is a read-only attribute that reports back the netif interface
name that was returned when the device did a register_netdev() (e.g.
"eth3").  This register_netdev() operation occurs as a result of echoing
the "1" into the "enabled" attribute.  Deferring the registration until
the admin explicitly does an "enable" gives the admin a chance to change
the MAC address of the virtual-adapter before it is registered (note:
the current code doesnt support rw on the mac attributes yet..i need a
parser first).


>   
>> So the admin would instantiate this "vdisk" device and do:
>>
>> 'echo /path/to/my/exported/disk.dat > /sys/vbus/devices/foo/src_path'
>>     
>
> So it would act like a loop device? Would you reuse the loop device
> or write something new?
>   

Well, keeping in mind that I haven't even looked at writing a block
device for this infrastructure yet....my blanket statement would be
"lets reuse as much as possible" ;)  If the existing loop infrastructure
would work here, great!

> How about VFS mount name spaces?
>   

Yeah, ultimately I would love to be able to support a fairly wide range
of the normal userspace/kernel ABI through this mechanism.  In fact, one
of my original design goals was to somehow expose the syscall ABI
directly via some kind of syscall proxy device on the bus.  I have since
backed away from that idea once I started thinking about things some
more and realized that a significant number of system calls are really
inappropriate for a guest type environment due to their ability to
block.   We really dont want a vcpu to block.....however, the AIO type
system calls on the other hand, have much more promise.  ;)  TBD.

For right now I am focused more on the explicit virtual-device type
transport (disk, net, etc).  But in theory we should be able to express
a fairly broad range of services in terms of the call()/shm() interfaces.

-Greg