* [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O @ 2013-05-26 9:32 Luke Gorrie 2013-05-27 9:34 ` Stefan Hajnoczi 0 siblings, 1 reply; 49+ messages in thread From: Luke Gorrie @ 2013-05-26 9:32 UTC (permalink / raw) To: qemu-devel; +Cc: snabb-devel, stefanha, mst [-- Attachment #1: Type: text/plain, Size: 2119 bytes --] Dear qemu-devel hackers, I am writing to ask for some technical advice. I am making embarrassingly slow progress on finding a good way to integrate the Snabb Switch user-space ethernet switch (http://snabb.co/snabbswitch/) with QEMU for efficient ethernet I/O. Stefan put us onto the highly promising track of vhost/virtio. We have implemented this between Snabb Switch and the Linux kernel, but not directly between Snabb Switch and QEMU guests. The "roadblock" we have hit is embarrasingly basic: QEMU is using user-to-kernel system calls to setup vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found a good way to map these towards Snabb Switch instead of the kernel. We have several ideas on the table and we would love some technical feedback on what sounds like a good way forward -- ideally a better alternative that we haven't thought of at all, not being QEMU gurus ourselves. Here are the ideas on the table right now: 1. Use FUSE to implement a complete clone of /dev/net/tun and /dev/vhost-net inside Snabb Switch. Implement every ioctl() that QEMU requires. 2. Extend QEMU to support a user-user IPC mode of vhost. In this mode QEMU would not use ioctl() etc but some other system calls that are appropriate for IPC between user-space processes. 3. Use the kernel as a middle-man. Create a double-ended "veth" interface and have Snabb Switch and QEMU each open a PF_PACKET socket and accelerate it with VHOST_NET. #1 is appealing _if_ it can really be done. Risk is that we hit a road-block when implementing the behavior of the ioctl()s, for example have trouble mapping guest memory or getting hold of an eventfd, and that FUSE is kinda heavy-weight. #2 is appealing _if_ it can be done in a nice way. I don't know which system calls would be appropriate and I don't know how to write the code inside QEMU in a neat way. #3 is appealing _if_ there is no significant overhead e.g. an extra memory copy inside the kernel, or if it's really quick to do as a temporary stop-gap. We would love some words of wisdom about the options above and/or a new idea :-) Cheers, -Luke [-- Attachment #2: Type: text/html, Size: 2602 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-26 9:32 [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O Luke Gorrie @ 2013-05-27 9:34 ` Stefan Hajnoczi 2013-05-27 15:18 ` Michael S. Tsirkin ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: Stefan Hajnoczi @ 2013-05-27 9:34 UTC (permalink / raw) To: Luke Gorrie; +Cc: snabb-devel, qemu-devel, mst On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: > Stefan put us onto the highly promising track of vhost/virtio. We have > implemented this between Snabb Switch and the Linux kernel, but not > directly between Snabb Switch and QEMU guests. The "roadblock" we have hit > is embarrasingly basic: QEMU is using user-to-kernel system calls to setup > vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found > a good way to map these towards Snabb Switch instead of the kernel. vhost_net is about connecting the a virtio-net speaking process to a tun-like device. The problem you are trying to solve is connecting a virtio-net speaking process to Snabb Switch. Either you need to replace vhost or you need a tun-like device interface. Replacing vhost would mean that your switch implements virtio-net, shares guest RAM with the guest, and shares the ioeventfd and irqfd which are used to signal with the guest. At that point your switch is similar to the virtio-net data plane work that Ping Fan Liu is working on but your switch is in a separate process rather than a thread. How does your switch talk to hardware? If you have userspace NIC drivers that bypass the Linux network stack then the approach I mentioned fits well. If you are using the Linux network stack then it might be better to integrate with vhost maybe as a tun-like device driver. Stefan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 9:34 ` Stefan Hajnoczi @ 2013-05-27 15:18 ` Michael S. Tsirkin 2013-05-27 15:43 ` Paolo Bonzini 2013-05-28 10:10 ` Luke Gorrie 2 siblings, 0 replies; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-27 15:18 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Luke Gorrie, snabb-devel, qemu-devel On Mon, May 27, 2013 at 11:34:09AM +0200, Stefan Hajnoczi wrote: > On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: > > Stefan put us onto the highly promising track of vhost/virtio. We have > > implemented this between Snabb Switch and the Linux kernel, but not > > directly between Snabb Switch and QEMU guests. The "roadblock" we have hit > > is embarrasingly basic: QEMU is using user-to-kernel system calls to setup > > vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found > > a good way to map these towards Snabb Switch instead of the kernel. > > vhost_net is about connecting the a virtio-net speaking process to a > tun-like device. The problem you are trying to solve is connecting a > virtio-net speaking process to Snabb Switch. > > Either you need to replace vhost or you need a tun-like device > interface. > > Replacing vhost would mean that your switch implements virtio-net, > shares guest RAM with the guest, and shares the ioeventfd and irqfd > which are used to signal with the guest. At that point your switch is > similar to the virtio-net data plane work that Ping Fan Liu is working > on but your switch is in a separate process rather than a thread. > > How does your switch talk to hardware? Yes that's my question as well. > If you have userspace NIC > drivers that bypass the Linux network stack then the approach I > mentioned fits well. > > If you are using the Linux network stack then it might be better to > integrate with vhost maybe as a tun-like device driver. > > Stefan Maybe you should bind macvtap passthrough mode to veth. packet socket backend doesn't support TSO at this point, so it's slower. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 9:34 ` Stefan Hajnoczi 2013-05-27 15:18 ` Michael S. Tsirkin @ 2013-05-27 15:43 ` Paolo Bonzini 2013-05-27 16:18 ` Anthony Liguori 2013-05-28 10:10 ` Luke Gorrie 2 siblings, 1 reply; 49+ messages in thread From: Paolo Bonzini @ 2013-05-27 15:43 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Luke Gorrie, snabb-devel, qemu-devel, mst Il 27/05/2013 11:34, Stefan Hajnoczi ha scritto: > On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: >> Stefan put us onto the highly promising track of vhost/virtio. We have >> implemented this between Snabb Switch and the Linux kernel, but not >> directly between Snabb Switch and QEMU guests. The "roadblock" we have hit >> is embarrasingly basic: QEMU is using user-to-kernel system calls to setup >> vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found >> a good way to map these towards Snabb Switch instead of the kernel. > > vhost_net is about connecting the a virtio-net speaking process to a > tun-like device. The problem you are trying to solve is connecting a > virtio-net speaking process to Snabb Switch. > > Either you need to replace vhost or you need a tun-like device > interface. > > How does your switch talk to hardware? And also, is your switch monolithic or does it consist of different processes? If you already have processes talking to each other, the first thing that came to my mind was a new network backend, similar to net/vde.c but more featureful (so that you support the virtio headers for offloading, for example). Then you would use "-netdev snabb,id=net0 -device e1000,netdev=net0". It would be slower than vhost-net, for example no zero-copy transmission. > 3. Use the kernel as a middle-man. Create a double-ended "veth" > interface and have Snabb Switch and QEMU each open a PF_PACKET > socket and accelerate it with VHOST_NET. As Michael, mentioned, this could be macvtap on the interface that you have already created in the switch and passed to vhost-net. Then you do not have to do anything in QEMU. Paolo > If you are using the Linux network stack then it might be better to > integrate with vhost maybe as a tun-like device driver. > > Stefan > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 15:43 ` Paolo Bonzini @ 2013-05-27 16:18 ` Anthony Liguori 2013-05-27 16:18 ` Paolo Bonzini 2013-05-28 10:39 ` Luke Gorrie 0 siblings, 2 replies; 49+ messages in thread From: Anthony Liguori @ 2013-05-27 16:18 UTC (permalink / raw) To: Paolo Bonzini, Stefan Hajnoczi; +Cc: Luke Gorrie, snabb-devel, qemu-devel, mst Paolo Bonzini <pbonzini@redhat.com> writes: > Il 27/05/2013 11:34, Stefan Hajnoczi ha scritto: >> On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: >>> Stefan put us onto the highly promising track of vhost/virtio. We have >>> implemented this between Snabb Switch and the Linux kernel, but not >>> directly between Snabb Switch and QEMU guests. The "roadblock" we have hit >>> is embarrasingly basic: QEMU is using user-to-kernel system calls to setup >>> vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found >>> a good way to map these towards Snabb Switch instead of the kernel. >> >> vhost_net is about connecting the a virtio-net speaking process to a >> tun-like device. The problem you are trying to solve is connecting a >> virtio-net speaking process to Snabb Switch. >> >> Either you need to replace vhost or you need a tun-like device >> interface. >> >> How does your switch talk to hardware? > > And also, is your switch monolithic or does it consist of different > processes? > > If you already have processes talking to each other, the first thing > that came to my mind was a new network backend, similar to net/vde.c but > more featureful (so that you support the virtio headers for offloading, > for example). Then you would use "-netdev snabb,id=net0 -device > e1000,netdev=net0". It would be very interesting to combine this with vmsplice/splice. > It would be slower than vhost-net, for example no zero-copy > transmission. With splice, I think you could at least get single copy guest-to-guest networking which is about as good as can be done. Regards, Anthony Liguori >> 3. Use the kernel as a middle-man. Create a double-ended "veth" >> interface and have Snabb Switch and QEMU each open a PF_PACKET >> socket and accelerate it with VHOST_NET. > > As Michael, mentioned, this could be macvtap on the interface that you > have already created in the switch and passed to vhost-net. Then you do > not have to do anything in QEMU. > > Paolo > >> If you are using the Linux network stack then it might be better to >> integrate with vhost maybe as a tun-like device driver. >> >> Stefan >> >> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 16:18 ` Anthony Liguori @ 2013-05-27 16:18 ` Paolo Bonzini 2013-05-27 17:01 ` Anthony Liguori 2013-05-28 10:39 ` Luke Gorrie 1 sibling, 1 reply; 49+ messages in thread From: Paolo Bonzini @ 2013-05-27 16:18 UTC (permalink / raw) To: Anthony Liguori Cc: Luke Gorrie, snabb-devel, qemu-devel, Stefan Hajnoczi, mst Il 27/05/2013 18:18, Anthony Liguori ha scritto: > Paolo Bonzini <pbonzini@redhat.com> writes: > >> Il 27/05/2013 11:34, Stefan Hajnoczi ha scritto: >>> On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: >>>> Stefan put us onto the highly promising track of vhost/virtio. We have >>>> implemented this between Snabb Switch and the Linux kernel, but not >>>> directly between Snabb Switch and QEMU guests. The "roadblock" we have hit >>>> is embarrasingly basic: QEMU is using user-to-kernel system calls to setup >>>> vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found >>>> a good way to map these towards Snabb Switch instead of the kernel. >>> >>> vhost_net is about connecting the a virtio-net speaking process to a >>> tun-like device. The problem you are trying to solve is connecting a >>> virtio-net speaking process to Snabb Switch. >>> >>> Either you need to replace vhost or you need a tun-like device >>> interface. >>> >>> How does your switch talk to hardware? >> >> And also, is your switch monolithic or does it consist of different >> processes? >> >> If you already have processes talking to each other, the first thing >> that came to my mind was a new network backend, similar to net/vde.c but >> more featureful (so that you support the virtio headers for offloading, >> for example). Then you would use "-netdev snabb,id=net0 -device >> e1000,netdev=net0". > > It would be very interesting to combine this with vmsplice/splice. Was zero-copy vmsplice/splice actually ever implemented? I thought it was reverted. Paolo >> It would be slower than vhost-net, for example no zero-copy >> transmission. > > With splice, I think you could at least get single copy guest-to-guest > networking which is about as good as can be done. > > Regards, > > Anthony Liguori > >>> 3. Use the kernel as a middle-man. Create a double-ended "veth" >>> interface and have Snabb Switch and QEMU each open a PF_PACKET >>> socket and accelerate it with VHOST_NET. >> >> As Michael, mentioned, this could be macvtap on the interface that you >> have already created in the switch and passed to vhost-net. Then you do >> not have to do anything in QEMU. >> >> Paolo >> >>> If you are using the Linux network stack then it might be better to >>> integrate with vhost maybe as a tun-like device driver. >>> >>> Stefan >>> >>> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 16:18 ` Paolo Bonzini @ 2013-05-27 17:01 ` Anthony Liguori 2013-05-27 17:13 ` Michael S. Tsirkin 0 siblings, 1 reply; 49+ messages in thread From: Anthony Liguori @ 2013-05-27 17:01 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Luke Gorrie, snabb-devel, qemu-devel, Stefan Hajnoczi, mst Paolo Bonzini <pbonzini@redhat.com> writes: > Il 27/05/2013 18:18, Anthony Liguori ha scritto: >> Paolo Bonzini <pbonzini@redhat.com> writes: >> >>> Il 27/05/2013 11:34, Stefan Hajnoczi ha scritto: >>>> On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: >>>>> Stefan put us onto the highly promising track of vhost/virtio. We have >>>>> implemented this between Snabb Switch and the Linux kernel, but not >>>>> directly between Snabb Switch and QEMU guests. The "roadblock" we have hit >>>>> is embarrasingly basic: QEMU is using user-to-kernel system calls to setup >>>>> vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found >>>>> a good way to map these towards Snabb Switch instead of the kernel. >>>> >>>> vhost_net is about connecting the a virtio-net speaking process to a >>>> tun-like device. The problem you are trying to solve is connecting a >>>> virtio-net speaking process to Snabb Switch. >>>> >>>> Either you need to replace vhost or you need a tun-like device >>>> interface. >>>> >>>> How does your switch talk to hardware? >>> >>> And also, is your switch monolithic or does it consist of different >>> processes? >>> >>> If you already have processes talking to each other, the first thing >>> that came to my mind was a new network backend, similar to net/vde.c but >>> more featureful (so that you support the virtio headers for offloading, >>> for example). Then you would use "-netdev snabb,id=net0 -device >>> e1000,netdev=net0". >> >> It would be very interesting to combine this with vmsplice/splice. > > Was zero-copy vmsplice/splice actually ever implemented? I thought it > was reverted. Not sure what context you're talking about re: zero copy... a pipe can store references to pages instead of having a buffer that stores data. That certainly is there today--otherwise the interface is pointless. When splicing from pipe to pipe, you can move those references without copying the data. When vmsplicing from a userspace region to a pipe, the kernel just stores references to the pages. vmsplicing from a pipe to userspace OTOH will copy the data. This is fixable at least when dealing with GIFT'd pages. For guest-to-guest traffic, you wouldn't be gifting the pages I don't think. For implementing guest-to-guest traffic, the source QEMU can vmsplice the packet to a pipe that is shared with the vswitch. The vswitch can tee(3) the first N bytes to a second pipe such that it can read the info needed for routing decisions. Once the decision is made, if it's a local guest, it can splice() the packet to the appropriate destination QEMU process or another vswitch daemon (no data copy here). Finally, the destination QEMU process can vmsplice() from the pipe which will copy the data (this is the only copy). If vswitch needs to route externally, then it would need to splice() to a macvtap. macvtap should be able to send the packet without copying the data. Not sure that this last work will work as expected but if it doesn't, that's a bug that can/should be fixed. The kernel cannot do better than the above modulo any overhead from userspace context switching[*]. Guest-to-guest requires a copy. Normally macvtap is undesirable because it's tightly connected to a network adapter but that is a desirable trait in this case. N.B., I'm not advocating making all switching decisions in userspace. Just pointing out how it can be done efficiently. [*] in theory the kernel could do zero copy receive but i'm not sure it's feasible in practice. Regards, Anthony Liguori > > Paolo > >>> It would be slower than vhost-net, for example no zero-copy >>> transmission. >> >> With splice, I think you could at least get single copy guest-to-guest >> networking which is about as good as can be done. >> >> Regards, >> >> Anthony Liguori >> >>>> 3. Use the kernel as a middle-man. Create a double-ended "veth" >>>> interface and have Snabb Switch and QEMU each open a PF_PACKET >>>> socket and accelerate it with VHOST_NET. >>> >>> As Michael, mentioned, this could be macvtap on the interface that you >>> have already created in the switch and passed to vhost-net. Then you do >>> not have to do anything in QEMU. >>> >>> Paolo >>> >>>> If you are using the Linux network stack then it might be better to >>>> integrate with vhost maybe as a tun-like device driver. >>>> >>>> Stefan >>>> >>>> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 17:01 ` Anthony Liguori @ 2013-05-27 17:13 ` Michael S. Tsirkin 2013-05-27 18:31 ` Anthony Liguori 0 siblings, 1 reply; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-27 17:13 UTC (permalink / raw) To: Anthony Liguori Cc: Luke Gorrie, Paolo Bonzini, snabb-devel, qemu-devel, Stefan Hajnoczi On Mon, May 27, 2013 at 12:01:07PM -0500, Anthony Liguori wrote: > Paolo Bonzini <pbonzini@redhat.com> writes: > > > Il 27/05/2013 18:18, Anthony Liguori ha scritto: > >> Paolo Bonzini <pbonzini@redhat.com> writes: > >> > >>> Il 27/05/2013 11:34, Stefan Hajnoczi ha scritto: > >>>> On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: > >>>>> Stefan put us onto the highly promising track of vhost/virtio. We have > >>>>> implemented this between Snabb Switch and the Linux kernel, but not > >>>>> directly between Snabb Switch and QEMU guests. The "roadblock" we have hit > >>>>> is embarrasingly basic: QEMU is using user-to-kernel system calls to setup > >>>>> vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found > >>>>> a good way to map these towards Snabb Switch instead of the kernel. > >>>> > >>>> vhost_net is about connecting the a virtio-net speaking process to a > >>>> tun-like device. The problem you are trying to solve is connecting a > >>>> virtio-net speaking process to Snabb Switch. > >>>> > >>>> Either you need to replace vhost or you need a tun-like device > >>>> interface. > >>>> > >>>> How does your switch talk to hardware? > >>> > >>> And also, is your switch monolithic or does it consist of different > >>> processes? > >>> > >>> If you already have processes talking to each other, the first thing > >>> that came to my mind was a new network backend, similar to net/vde.c but > >>> more featureful (so that you support the virtio headers for offloading, > >>> for example). Then you would use "-netdev snabb,id=net0 -device > >>> e1000,netdev=net0". > >> > >> It would be very interesting to combine this with vmsplice/splice. > > > > Was zero-copy vmsplice/splice actually ever implemented? I thought it > > was reverted. > > Not sure what context you're talking about re: zero copy... a pipe can > store references to pages instead of having a buffer that stores data. > That certainly is there today--otherwise the interface is pointless. > > When splicing from pipe to pipe, you can move those references without > copying the data. > > When vmsplicing from a userspace region to a pipe, the kernel just > stores references to the pages. vmsplicing from a pipe to userspace > OTOH will copy the data. This is fixable at least when dealing with > GIFT'd pages. For guest-to-guest traffic, you wouldn't be gifting the > pages I don't think. > > For implementing guest-to-guest traffic, the source QEMU can vmsplice > the packet to a pipe that is shared with the vswitch. The vswitch can > tee(3) the first N bytes to a second pipe such that it can read the > info needed for routing decisions. > > Once the decision is made, if it's a local guest, it can splice() the > packet to the appropriate destination QEMU process or another vswitch > daemon (no data copy here). > > Finally, the destination QEMU process can vmsplice() from the pipe which > will copy the data (this is the only copy). AFAIK splice is mostly useless for networking as there's no way to get notified when packet has been sent. > If vswitch needs to route externally, then it would need to splice() to > a macvtap. > > macvtap should be able to send the packet without copying the data. Not > sure that this last work will work as expected but if it doesn't, that's > a bug that can/should be fixed. > > The kernel cannot do better than the above modulo any overhead from > userspace context switching[*]. Also modulo scheduler latency - kernel processes packets in interrupt context. There's a reason e.g. OVS runs data-path in kernel. > Guest-to-guest requires a copy. > Normally macvtap is undesirable because it's tightly connected to a > network adapter but that is a desirable trait in this case. > > N.B., I'm not advocating making all switching decisions in > userspace. Just pointing out how it can be done efficiently. > > [*] in theory the kernel could do zero copy receive but i'm not sure > it's feasible in practice. > > Regards, > > Anthony Liguori > > > > > Paolo > > > >>> It would be slower than vhost-net, for example no zero-copy > >>> transmission. > >> > >> With splice, I think you could at least get single copy guest-to-guest > >> networking which is about as good as can be done. > >> > >> Regards, > >> > >> Anthony Liguori > >> > >>>> 3. Use the kernel as a middle-man. Create a double-ended "veth" > >>>> interface and have Snabb Switch and QEMU each open a PF_PACKET > >>>> socket and accelerate it with VHOST_NET. > >>> > >>> As Michael, mentioned, this could be macvtap on the interface that you > >>> have already created in the switch and passed to vhost-net. Then you do > >>> not have to do anything in QEMU. > >>> > >>> Paolo > >>> > >>>> If you are using the Linux network stack then it might be better to > >>>> integrate with vhost maybe as a tun-like device driver. > >>>> > >>>> Stefan > >>>> > >>>> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 17:13 ` Michael S. Tsirkin @ 2013-05-27 18:31 ` Anthony Liguori 0 siblings, 0 replies; 49+ messages in thread From: Anthony Liguori @ 2013-05-27 18:31 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Luke Gorrie, Paolo Bonzini, snabb-devel, qemu-devel, Stefan Hajnoczi On Mon, May 27, 2013 at 12:13 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Mon, May 27, 2013 at 12:01:07PM -0500, Anthony Liguori wrote: > > Paolo Bonzini <pbonzini@redhat.com> writes: > > > > Finally, the destination QEMU process can vmsplice() from the pipe which > > will copy the data (this is the only copy). > > AFAIK splice is mostly useless for networking as there's no way to > get notified when packet has been sent. I suspect you could use a thread pool to work around this. It's certainly not useless if your goal is to do userspace switching... > > If vswitch needs to route externally, then it would need to splice() to > > a macvtap. > > > > macvtap should be able to send the packet without copying the data. Not > > sure that this last work will work as expected but if it doesn't, that's > > a bug that can/should be fixed. > > > > The kernel cannot do better than the above modulo any overhead from > > userspace context switching[*]. > > Also modulo scheduler latency - kernel processes packets > in interrupt context. There's a reason e.g. OVS runs data-path in > kernel. Ack. Like I say below, I think network routing belongs in the kernel. Regards, Anthony Liguori > > Guest-to-guest requires a copy. > > Normally macvtap is undesirable because it's tightly connected to a > > network adapter but that is a desirable trait in this case. > > > > N.B., I'm not advocating making all switching decisions in > > userspace. Just pointing out how it can be done efficiently. > > > > [*] in theory the kernel could do zero copy receive but i'm not sure > > it's feasible in practice. > > > > Regards, > > > > Anthony Liguori > > > > > > > > Paolo > > > > > >>> It would be slower than vhost-net, for example no zero-copy > > >>> transmission. > > >> > > >> With splice, I think you could at least get single copy guest-to-guest > > >> networking which is about as good as can be done. > > >> > > >> Regards, > > >> > > >> Anthony Liguori > > >> > > >>>> 3. Use the kernel as a middle-man. Create a double-ended "veth" > > >>>> interface and have Snabb Switch and QEMU each open a PF_PACKET > > >>>> socket and accelerate it with VHOST_NET. > > >>> > > >>> As Michael, mentioned, this could be macvtap on the interface that you > > >>> have already created in the switch and passed to vhost-net. Then you do > > >>> not have to do anything in QEMU. > > >>> > > >>> Paolo > > >>> > > >>>> If you are using the Linux network stack then it might be better to > > >>>> integrate with vhost maybe as a tun-like device driver. > > >>>> > > >>>> Stefan > > >>>> > > >>>> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 16:18 ` Anthony Liguori 2013-05-27 16:18 ` Paolo Bonzini @ 2013-05-28 10:39 ` Luke Gorrie 1 sibling, 0 replies; 49+ messages in thread From: Luke Gorrie @ 2013-05-28 10:39 UTC (permalink / raw) To: Anthony Liguori Cc: Paolo Bonzini, snabb-devel, qemu-devel, Stefan Hajnoczi, mst [-- Attachment #1: Type: text/plain, Size: 948 bytes --] Hi Anthony, On 27 May 2013 18:18, Anthony Liguori <anthony@codemonkey.ws> wrote: > It would be very interesting to combine this with vmsplice/splice. > Good point. This kernel-centric approach is a very promising one, though not the design we are exploring in the Snabb Switch project. Snabb Switch is instead very hardware-centric. That is: we see the world as CPU cores, DRAM banks, PCIe devices. We want to keep our inter-process communication as close to this model as possible, which is why Virtio is very appealing - it looks like a DMA-based interface between two pieces of hardware. In this sense the kernel is like a BIOS: something that got you up and running, and takes care of lots of ugly irrelevant details for you, but that you don't want to have the minimum possible interaction with. Some motivation explained in an old blog entry when deciding to take this route: http://blog.lukego.com/blog/2012/10/28/firmware-vs-software/ [-- Attachment #2: Type: text/html, Size: 1625 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-27 9:34 ` Stefan Hajnoczi 2013-05-27 15:18 ` Michael S. Tsirkin 2013-05-27 15:43 ` Paolo Bonzini @ 2013-05-28 10:10 ` Luke Gorrie 2013-05-28 10:35 ` Stefan Hajnoczi ` (2 more replies) 2 siblings, 3 replies; 49+ messages in thread From: Luke Gorrie @ 2013-05-28 10:10 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: snabb-devel, qemu-devel, mst [-- Attachment #1: Type: text/plain, Size: 1656 bytes --] On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com> wrote: > vhost_net is about connecting the a virtio-net speaking process to a > tun-like device. The problem you are trying to solve is connecting a > virtio-net speaking process to Snabb Switch. > Yep! > Either you need to replace vhost or you need a tun-like device > interface. > > Replacing vhost would mean that your switch implements virtio-net, > shares guest RAM with the guest, and shares the ioeventfd and irqfd > which are used to signal with the guest. This would be a great solution from my perspective. This is the design that I am now struggling to find a good implementation strategy for. > At that point your switch is similar to the virtio-net data plane work > that Ping Fan Liu is working > on but your switch is in a separate process rather than a thread. > Thanks for the reference! I was not aware of this work and it sounds highly relevant. How does your switch talk to hardware? If you have userspace NIC > drivers that bypass the Linux network stack then the approach I > mentioned fits well. > The switch talks to hardware using a built-in userspace ("kernel bypass") device driver. The switch runs in a single userspace process with realtime priority and polls for traffic. The design is similar to what Intel are now promoting with their Data Plane Development Kit. The only system call in the main traffic loop is to sleep for a microsecond or so when idle. The Intel 10G NIC driver is written in Lua btw, in case anybody is curious to check out something so exotic here's the link: https://github.com/SnabbCo/snabbswitch/blob/master/src/intel10g.lua [-- Attachment #2: Type: text/html, Size: 2967 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 10:10 ` Luke Gorrie @ 2013-05-28 10:35 ` Stefan Hajnoczi 2013-05-28 11:36 ` Julian Stecklina 2013-05-28 11:58 ` Stefan Hajnoczi 2 siblings, 0 replies; 49+ messages in thread From: Stefan Hajnoczi @ 2013-05-28 10:35 UTC (permalink / raw) To: Luke Gorrie; +Cc: snabb-devel, qemu-devel, Stefan Hajnoczi, Michael S. Tsirkin On Tue, May 28, 2013 at 12:10 PM, Luke Gorrie <lukego@gmail.com> wrote: > On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com> wrote: >> >> vhost_net is about connecting the a virtio-net speaking process to a >> tun-like device. The problem you are trying to solve is connecting a >> virtio-net speaking process to Snabb Switch. > > > Yep! > >> >> Either you need to replace vhost or you need a tun-like device >> interface. >> >> Replacing vhost would mean that your switch implements virtio-net, >> shares guest RAM with the guest, and shares the ioeventfd and irqfd >> which are used to signal with the guest. > > > This would be a great solution from my perspective. This is the design that > I am now struggling to find a good implementation strategy for. > >> >> At that point your switch is similar to the virtio-net data plane work >> that Ping Fan Liu is working >> on but your switch is in a separate process rather than a thread. > > > Thanks for the reference! I was not aware of this work and it sounds highly > relevant. > >> How does your switch talk to hardware? If you have userspace NIC >> drivers that bypass the Linux network stack then the approach I >> mentioned fits well. > > > The switch talks to hardware using a built-in userspace ("kernel bypass") > device driver. BTW there is an effort to get low-latency networking integrated into Linux: http://thread.gmane.org/gmane.linux.kernel/1493276 Stefan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 10:10 ` Luke Gorrie 2013-05-28 10:35 ` Stefan Hajnoczi @ 2013-05-28 11:36 ` Julian Stecklina 2013-05-28 11:53 ` Michael S. Tsirkin 2013-05-28 17:00 ` [Qemu-devel] " Anthony Liguori 2013-05-28 11:58 ` Stefan Hajnoczi 2 siblings, 2 replies; 49+ messages in thread From: Julian Stecklina @ 2013-05-28 11:36 UTC (permalink / raw) To: snabb-devel, qemu-devel; +Cc: mst On 05/28/2013 12:10 PM, Luke Gorrie wrote: > On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com > <mailto:stefanha@redhat.com>> wrote: > > vhost_net is about connecting the a virtio-net speaking process to a > tun-like device. The problem you are trying to solve is connecting a > virtio-net speaking process to Snabb Switch. > > > Yep! Since I am on a similar path as Luke, let me share another idea. What about extending qemu in a way to allow PCI device models to be implemented in another process. This is not as hard as it may sound. qemu would open a domain socket to this process and map VM memory over to the other side. This can be accomplished by having file descriptors in qemu to VM memory (reusing -mem-path code) and passing those over the domain socket. The other side can then just mmap them. The socket would also be used for configuration and I/O by the guest on the PCI I/O/memory regions. You could also use this to do IRQs or use eventfds, whatever works better. To have a zero copy userspace switch, the switch would offer virtio-net devices to any qemu that wants to connect to it and implement the complete device logic itself. Since it has access to all guest memory, it can just do memcpy for packet data. Of course, this only works for 64-bit systems, because you need vast amounts of virtual address space. In my experience, doing this in userspace is _way less painful_. If you can get away with polling in the switch the overhead of doing all this in userspace is zero. And as long as you can rate-limit explicit notifications over the socket even that overhead should be okay. Opinions? Julian ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 11:36 ` Julian Stecklina @ 2013-05-28 11:53 ` Michael S. Tsirkin 2013-05-28 12:09 ` Julian Stecklina ` (2 more replies) 2013-05-28 17:00 ` [Qemu-devel] " Anthony Liguori 1 sibling, 3 replies; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-28 11:53 UTC (permalink / raw) To: Julian Stecklina; +Cc: snabb-devel, qemu-devel On Tue, May 28, 2013 at 01:36:36PM +0200, Julian Stecklina wrote: > On 05/28/2013 12:10 PM, Luke Gorrie wrote: > > On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com > > <mailto:stefanha@redhat.com>> wrote: > > > > vhost_net is about connecting the a virtio-net speaking process to a > > tun-like device. The problem you are trying to solve is connecting a > > virtio-net speaking process to Snabb Switch. > > > > > > Yep! > > Since I am on a similar path as Luke, let me share another idea. > > What about extending qemu in a way to allow PCI device models to be > implemented in another process. This is not as hard as it may sound. > qemu would open a domain socket to this process and map VM memory over > to the other side. This can be accomplished by having file descriptors > in qemu to VM memory (reusing -mem-path code) and passing those over the > domain socket. The other side can then just mmap them. The socket would > also be used for configuration and I/O by the guest on the PCI > I/O/memory regions. You could also use this to do IRQs or use eventfds, > whatever works better. > > To have a zero copy userspace switch, the switch would offer virtio-net > devices to any qemu that wants to connect to it and implement the > complete device logic itself. Since it has access to all guest memory, > it can just do memcpy for packet data. Of course, this only works for > 64-bit systems, because you need vast amounts of virtual address space. > In my experience, doing this in userspace is _way less painful_. > > If you can get away with polling in the switch the overhead of doing all > this in userspace is zero. And as long as you can rate-limit explicit > notifications over the socket even that overhead should be okay. > > Opinions? > > Julian Implementing out of process device logic would absolutely be useful for qemu, for security. Don't expect it to be zero overhead though, latency overhead of bouncing each packet through multiple processes would be especially painful. Yes, you can maybe trade some of this latency for power/CPU cycles by aggressive polling. Doing this in a way that does not waste a lot of power would be tricky. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 11:53 ` Michael S. Tsirkin @ 2013-05-28 12:09 ` Julian Stecklina 2013-05-28 13:56 ` Michael S. Tsirkin 2013-05-28 12:48 ` [Qemu-devel] [snabb-devel:276] " Luke Gorrie 2013-05-28 14:42 ` [Qemu-devel] [snabb-devel:276] " Luke Gorrie 2 siblings, 1 reply; 49+ messages in thread From: Julian Stecklina @ 2013-05-28 12:09 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: snabb-devel, qemu-devel On 05/28/2013 01:53 PM, Michael S. Tsirkin wrote: > Implementing out of process device logic would absolutely be useful for > qemu, for security. > > Don't expect it to be zero overhead though, latency overhead > of bouncing each packet through multiple processes would > be especially painful. Currently, latency for vhost is also quite bad compared to what it could be, because for VM-to-VM packets usually 4 CPUs are involved. The CPU that VM A's vcpu thread runs on, the CPU its vhost thread in the kernel runs on, the CPU VM B's vhost thread runs on and finally the CPU VM B's vcpu thread runs on. It is possible to change the vhost implementation in the kernel to handle packet transmission to local VMs in a single thread, but it is rather hard. I have a hacky patch that implements that (that unfortunately I cannot make public :( ) and it improves latency and CPU utlization. I would suppose a userspace implementation of this is way simpler and still give most of the performance benefits. It also removes the virtio implementation in the kernel (vhost) from the trusted computing base of other stuff in the system. IMHO implementing device emulation in the kernel is plain wrong from a security perspective. Julian ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 12:09 ` Julian Stecklina @ 2013-05-28 13:56 ` Michael S. Tsirkin 2013-05-28 15:35 ` Julian Stecklina 0 siblings, 1 reply; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-28 13:56 UTC (permalink / raw) To: Julian Stecklina; +Cc: snabb-devel, qemu-devel On Tue, May 28, 2013 at 02:09:21PM +0200, Julian Stecklina wrote: > On 05/28/2013 01:53 PM, Michael S. Tsirkin wrote: > > Implementing out of process device logic would absolutely be useful for > > qemu, for security. > > > > Don't expect it to be zero overhead though, latency overhead > > of bouncing each packet through multiple processes would > > be especially painful. > > Currently, latency for vhost is also quite bad compared to what it could > be, because for VM-to-VM packets usually 4 CPUs are involved. The CPU > that VM A's vcpu thread runs on, the CPU its vhost thread in the kernel > runs on, the CPU VM B's vhost thread runs on and finally the CPU VM B's > vcpu thread runs on. > > It is possible to change the vhost implementation in the kernel to > handle packet transmission to local VMs in a single thread, but it is > rather hard. I have a hacky patch that implements that (that > unfortunately I cannot make public :( ) and it improves latency and CPU > utlization. Yes - and it's not new. Shirley Ma sent such prototype patches, and in fact that was how vhost worked originally. There were some issues to be fixed before it worked without issues, but we do plan to go back to that I think. And that's only for guest to guest. While important it is not the most common case. Guest to external is. For that we need to do things like process packets in softirq context. People are looking into all this now. > I would suppose a userspace implementation of this is way > simpler and still give most of the performance benefits. It also removes > the virtio implementation in the kernel (vhost) from the trusted > computing base of other stuff in the system. > > IMHO implementing device emulation in the kernel is plain wrong from a > security perspective. > > Julian It would be, yes. But vhost is not a device emulation. emulation is in qemu. vhost is an asynchronous kernel/userspace interface. kvm has support for ioeventfd/irqfd, which creates a fastpath way to signal host kernel directly from guest, bypassing qemu. But it's not a vhost feature, and anyway people are using vhost without it, so it's not a must. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 13:56 ` Michael S. Tsirkin @ 2013-05-28 15:35 ` Julian Stecklina 2013-05-28 15:44 ` Michael S. Tsirkin 0 siblings, 1 reply; 49+ messages in thread From: Julian Stecklina @ 2013-05-28 15:35 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: snabb-devel, qemu-devel [-- Attachment #1: Type: text/plain, Size: 289 bytes --] On 05/28/2013 03:56 PM, Michael S. Tsirkin wrote: > and in fact that was how vhost worked originally. > There were some issues to be fixed before it worked > without issues, but we do plan to go back to that I think. Do you know why they abandoned this execution model? Julian [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 15:35 ` Julian Stecklina @ 2013-05-28 15:44 ` Michael S. Tsirkin 0 siblings, 0 replies; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-28 15:44 UTC (permalink / raw) To: Julian Stecklina; +Cc: snabb-devel, qemu-devel On Tue, May 28, 2013 at 05:35:55PM +0200, Julian Stecklina wrote: > On 05/28/2013 03:56 PM, Michael S. Tsirkin wrote: > > and in fact that was how vhost worked originally. > > There were some issues to be fixed before it worked > > without issues, but we do plan to go back to that I think. > > Do you know why they abandoned this execution model? > > Julian > Yes I do. Two main issues: 1. Originally vhost used a shared workqueue for everything. It turned out that an access to userspace memory might sometimes block e.g. if it hit swap. When this happened the whole workqueue got blocked, and no guest could make progress. 2. workqueue was sticking to one CPU too aggressively E.g. there could be 10 free CPUs on the box, workqueue was still using the same one which queued the work even if that one was very busy. This all got sorted out in core workqueue code, so we should go back and try using the regular workqueue. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:276] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 11:53 ` Michael S. Tsirkin 2013-05-28 12:09 ` Julian Stecklina @ 2013-05-28 12:48 ` Luke Gorrie 2013-05-28 13:12 ` Julian Stecklina 2013-05-28 14:42 ` [Qemu-devel] [snabb-devel:276] " Luke Gorrie 2 siblings, 1 reply; 49+ messages in thread From: Luke Gorrie @ 2013-05-28 12:48 UTC (permalink / raw) To: snabb-devel; +Cc: qemu-devel, Julian Stecklina [-- Attachment #1: Type: text/plain, Size: 645 bytes --] On 28 May 2013 13:53, Michael S. Tsirkin <mst@redhat.com> wrote: > Implementing out of process device logic would absolutely be useful for > qemu, for security. > This sounds wonderful from my perspective. The whole PCI device implemented in my process according to the Virtio spec? What would it take to make this possible? I really want snabbswitch <-> guest I/O with as little involvement from Linux and QEMU as possible. Personally I can work much more effectively in the snabbswitch code than in QEMU or Linux and that's why I haven't implemented Stefan's design, despite having several times sat down with the intention of doing so :) [-- Attachment #2: Type: text/html, Size: 1147 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:276] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 12:48 ` [Qemu-devel] [snabb-devel:276] " Luke Gorrie @ 2013-05-28 13:12 ` Julian Stecklina 2013-05-28 13:42 ` [Qemu-devel] [snabb-devel:280] " Luke Gorrie 0 siblings, 1 reply; 49+ messages in thread From: Julian Stecklina @ 2013-05-28 13:12 UTC (permalink / raw) Cc: snabb-devel, qemu-devel On 05/28/2013 02:48 PM, Luke Gorrie wrote: > On 28 May 2013 13:53, Michael S. Tsirkin <mst@redhat.com > <mailto:mst@redhat.com>> wrote: > > Implementing out of process device logic would absolutely be useful for > qemu, for security. > > > This sounds wonderful from my perspective. The whole PCI device > implemented in my process according to the Virtio spec? What would it > take to make this possible? AFAICS this can be implemented as a new device in qemu without touching qemu internals. Except for a way to get file descriptors for guest memory. I'll give it a shot today and tomorrow and we'll see how far I get... Julian ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:280] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 13:12 ` Julian Stecklina @ 2013-05-28 13:42 ` Luke Gorrie 0 siblings, 0 replies; 49+ messages in thread From: Luke Gorrie @ 2013-05-28 13:42 UTC (permalink / raw) To: snabb-devel; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 323 bytes --] On 28 May 2013 15:12, Julian Stecklina <jsteckli@os.inf.tu-dresden.de>wrote: > AFAICS this can be implemented as a new device in qemu without touching > qemu internals. Except for a way to get file descriptors for guest > memory. I'll give it a shot today and tomorrow and we'll see how far I > get... > Most intriguing! [-- Attachment #2: Type: text/html, Size: 729 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:276] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 11:53 ` Michael S. Tsirkin 2013-05-28 12:09 ` Julian Stecklina 2013-05-28 12:48 ` [Qemu-devel] [snabb-devel:276] " Luke Gorrie @ 2013-05-28 14:42 ` Luke Gorrie 2013-05-28 15:33 ` Julian Stecklina 2 siblings, 1 reply; 49+ messages in thread From: Luke Gorrie @ 2013-05-28 14:42 UTC (permalink / raw) To: snabb-devel; +Cc: qemu-devel, Julian Stecklina [-- Attachment #1: Type: text/plain, Size: 978 bytes --] On 28 May 2013 13:53, Michael S. Tsirkin <mst@redhat.com> wrote: > Yes, you can maybe trade some of this latency for power/CPU cycles by > aggressive polling. Doing this in a way that does not waste a lot of > power would be tricky. > For what it's worth, here is my mental model right now: Administrator budgets one CPU core for network I/O (VM and NIC). Switch uses that CPU to deliver sufficient speed (e.g. 10 M packets/sec ~ 40Gbps). Switch uses micro-sleeps to cut cpu usage to ~ 1% in idle periods. Today I really am statically provisioning a CPU core using "linux isolcpus=..." boot option, but maybe e.g. scheduling with realtime priority would work too and free up some spare cycles for running VMs. I am hoping that administrators will feel that dedicating ~6.25% total CPU (one core out of 16) for networking will be OK. Remains to be seen whether we can really deliver this kind of performance with a mix of phsical NICs and VMs, but that's the design goal. [-- Attachment #2: Type: text/html, Size: 1588 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:276] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 14:42 ` [Qemu-devel] [snabb-devel:276] " Luke Gorrie @ 2013-05-28 15:33 ` Julian Stecklina 0 siblings, 0 replies; 49+ messages in thread From: Julian Stecklina @ 2013-05-28 15:33 UTC (permalink / raw) Cc: snabb-devel, qemu-devel On 05/28/2013 04:42 PM, Luke Gorrie wrote: > On 28 May 2013 13:53, Michael S. Tsirkin <mst@redhat.com > <mailto:mst@redhat.com>> wrote: > > Yes, you can maybe trade some of this latency for power/CPU cycles by > aggressive polling. Doing this in a way that does not waste a lot of > power would be tricky. > > > For what it's worth, here is my mental model right now: > > Administrator budgets one CPU core for network I/O (VM and NIC). > Switch uses that CPU to deliver sufficient speed (e.g. 10 M packets/sec > ~ 40Gbps). > Switch uses micro-sleeps to cut cpu usage to ~ 1% in idle periods. With virtio the backend can decide whether it wants to be notified by the client. If you disable all notifications you are in polling mode. If your backend/switch doesn't find anything to do it can reenable notifications and block. Thus you would naturally revert to non-polling mode, when the network load is low. Julian ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 11:36 ` Julian Stecklina 2013-05-28 11:53 ` Michael S. Tsirkin @ 2013-05-28 17:00 ` Anthony Liguori 2013-05-28 17:17 ` Michael S. Tsirkin 2013-05-29 12:32 ` Julian Stecklina 1 sibling, 2 replies; 49+ messages in thread From: Anthony Liguori @ 2013-05-28 17:00 UTC (permalink / raw) To: Julian Stecklina, snabb-devel, qemu-devel; +Cc: mst Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > On 05/28/2013 12:10 PM, Luke Gorrie wrote: >> On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com >> <mailto:stefanha@redhat.com>> wrote: >> >> vhost_net is about connecting the a virtio-net speaking process to a >> tun-like device. The problem you are trying to solve is connecting a >> virtio-net speaking process to Snabb Switch. >> >> >> Yep! > > Since I am on a similar path as Luke, let me share another idea. > > What about extending qemu in a way to allow PCI device models to be > implemented in another process. We aren't going to support any interface that enables out of tree devices. This is just plugins in a different form with even more downsides. You cannot easily keep track of dirty info, the guest physical address translation to host is difficult to keep in sync (imagine the complexity of memory hotplug). Basically, it's easy to hack up but extremely hard to do something that works correctly overall. There isn't a compelling reason to implement something like this other than avoiding getting code into QEMU. Best to just submit your device to QEMU for inclusion. If you want to avoid copying in a vswitch, better to use something like vmsplice as I outlined in another thread. > This is not as hard as it may sound. > qemu would open a domain socket to this process and map VM memory over > to the other side. This can be accomplished by having file descriptors > in qemu to VM memory (reusing -mem-path code) and passing those over the > domain socket. The other side can then just mmap them. The socket would > also be used for configuration and I/O by the guest on the PCI > I/O/memory regions. You could also use this to do IRQs or use eventfds, > whatever works better. > > To have a zero copy userspace switch, the switch would offer virtio-net > devices to any qemu that wants to connect to it and implement the > complete device logic itself. Since it has access to all guest memory, > it can just do memcpy for packet data. Of course, this only works for > 64-bit systems, because you need vast amounts of virtual address space. > In my experience, doing this in userspace is _way less painful_. > > If you can get away with polling in the switch the overhead of doing all > this in userspace is zero. And as long as you can rate-limit explicit > notifications over the socket even that overhead should be okay. > > Opinions? I don't see any compelling reason to do something like this. It's jumping through a tremendous number of hoops to avoid putting code that belongs in QEMU in tree. Regards, Anthony Liguori > > Julian ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 17:00 ` [Qemu-devel] " Anthony Liguori @ 2013-05-28 17:17 ` Michael S. Tsirkin 2013-05-28 18:55 ` Anthony Liguori 2013-05-29 7:49 ` [Qemu-devel] " Stefan Hajnoczi 2013-05-29 12:32 ` Julian Stecklina 1 sibling, 2 replies; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-28 17:17 UTC (permalink / raw) To: Anthony Liguori; +Cc: snabb-devel, qemu-devel, Julian Stecklina On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: > Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > > > On 05/28/2013 12:10 PM, Luke Gorrie wrote: > >> On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com > >> <mailto:stefanha@redhat.com>> wrote: > >> > >> vhost_net is about connecting the a virtio-net speaking process to a > >> tun-like device. The problem you are trying to solve is connecting a > >> virtio-net speaking process to Snabb Switch. > >> > >> > >> Yep! > > > > Since I am on a similar path as Luke, let me share another idea. > > > > What about extending qemu in a way to allow PCI device models to be > > implemented in another process. > > We aren't going to support any interface that enables out of tree > devices. This is just plugins in a different form with even more > downsides. You cannot easily keep track of dirty info, the guest > physical address translation to host is difficult to keep in sync > (imagine the complexity of memory hotplug). > > Basically, it's easy to hack up but extremely hard to do something that > works correctly overall. > > There isn't a compelling reason to implement something like this other > than avoiding getting code into QEMU. Best to just submit your device > to QEMU for inclusion. > > If you want to avoid copying in a vswitch, better to use something like > vmsplice as I outlined in another thread. > > > This is not as hard as it may sound. > > qemu would open a domain socket to this process and map VM memory over > > to the other side. This can be accomplished by having file descriptors > > in qemu to VM memory (reusing -mem-path code) and passing those over the > > domain socket. The other side can then just mmap them. The socket would > > also be used for configuration and I/O by the guest on the PCI > > I/O/memory regions. You could also use this to do IRQs or use eventfds, > > whatever works better. > > > > To have a zero copy userspace switch, the switch would offer virtio-net > > devices to any qemu that wants to connect to it and implement the > > complete device logic itself. Since it has access to all guest memory, > > it can just do memcpy for packet data. Of course, this only works for > > 64-bit systems, because you need vast amounts of virtual address space. > > In my experience, doing this in userspace is _way less painful_. > > > > If you can get away with polling in the switch the overhead of doing all > > this in userspace is zero. And as long as you can rate-limit explicit > > notifications over the socket even that overhead should be okay. > > > > Opinions? > > I don't see any compelling reason to do something like this. It's > jumping through a tremendous number of hoops to avoid putting code that > belongs in QEMU in tree. > > Regards, > > Anthony Liguori > > > > > Julian OTOH an in-tree device that runs in a separate process would be useful e.g. for security. For example, we could limit a virtio-net device process to only access tap and vhost files. We can kill this process if there's a bug with the result that NIC gets stalled but everything else keeps going. Possibly restart on next guest reset. There could be other advantages. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 17:17 ` Michael S. Tsirkin @ 2013-05-28 18:55 ` Anthony Liguori 2013-05-29 10:31 ` Stefano Stabellini 2013-05-29 7:49 ` [Qemu-devel] " Stefan Hajnoczi 1 sibling, 1 reply; 49+ messages in thread From: Anthony Liguori @ 2013-05-28 18:55 UTC (permalink / raw) To: Michael S. Tsirkin Cc: snabb-devel, qemu-devel, Stefano Stabellini, Julian Stecklina "Michael S. Tsirkin" <mst@redhat.com> writes: > On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: >> Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: >> >> >> I don't see any compelling reason to do something like this. It's >> jumping through a tremendous number of hoops to avoid putting code that >> belongs in QEMU in tree. >> >> Regards, >> >> Anthony Liguori >> >> > >> > Julian > > OTOH an in-tree device that runs in a separate process would > be useful e.g. for security. An *in-tree* device would at least be a reasonable place to have a discussion. I still think it's pretty hard to make work beyond just a hack. > For example, we could limit a virtio-net device process > to only access tap and vhost files. Stefano et al from the Xen community have some interest in this. I believe they've done some initial prototyping already. Regards, Anthony Liguori > We can kill this process if there's a bug > with the result that NIC gets stalled but everything else > keeps going. > Possibly restart on next guest reset. > There could be other advantages. > > -- > MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 18:55 ` Anthony Liguori @ 2013-05-29 10:31 ` Stefano Stabellini 2013-05-29 12:25 ` Michael S. Tsirkin 2013-06-04 12:19 ` [Qemu-devel] [snabb-devel:300] " Luke Gorrie 0 siblings, 2 replies; 49+ messages in thread From: Stefano Stabellini @ 2013-05-29 10:31 UTC (permalink / raw) To: Anthony Liguori Cc: snabb-devel, Stefano Stabellini, Michael S. Tsirkin, qemu-devel, julien.grall, Julian Stecklina On Tue, 28 May 2013, Anthony Liguori wrote: > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: > >> Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > >> > >> > >> I don't see any compelling reason to do something like this. It's > >> jumping through a tremendous number of hoops to avoid putting code that > >> belongs in QEMU in tree. > >> > >> Regards, > >> > >> Anthony Liguori > >> > >> > > >> > Julian > > > > OTOH an in-tree device that runs in a separate process would > > be useful e.g. for security. > > An *in-tree* device would at least be a reasonable place to have a discussion. > > I still think it's pretty hard to make work beyond just a hack. > > > For example, we could limit a virtio-net device process > > to only access tap and vhost files. > > Stefano et al from the Xen community have some interest in this. I > believe they've done some initial prototyping already. Right, what Michael said are exactly the principal reasons why Julien started working on this a while back: http://marc.info/?l=qemu-devel&m=134566472209750&w=2 http://marc.info/?l=qemu-devel&m=134566262709001&w=2 Although he had a prototype fully running, the code never went upstream, and now Julien is working on something else. The work was based on Xen and the idea that you can have multiple device models (multiple QEMU instances) each of them emulating a different set of devices for the guest VM. Each device model would register with Xen the devices that is emulating and the corresponding MMIO regions for which it wants to receive IO requests. When the guest traps into Xen on a MMIO read/write, Xen would forward the IO emulation request to the right device model instance. This is very useful for reliability, because if you have a bug in your network device emulator is not going to bring down all the QEMU instances, just the one running the network device, and could be restarted without compromising the stability of the whole system. It is good for security, because you can limit what each QEMU process can do in a much more fine grained way. And of course on Xen you can go much farther by running the QEMU instances in different domains altogether. It is good for isolation because the QEMU processes don't need to be fully privileged and are completely isolated from one another so if a malicious guest manages to break into one of them, for example because the network device has a security vulnerability, it won't be able to cause issues to the others. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 10:31 ` Stefano Stabellini @ 2013-05-29 12:25 ` Michael S. Tsirkin 2013-05-29 13:04 ` Stefano Stabellini 2013-06-04 12:19 ` [Qemu-devel] [snabb-devel:300] " Luke Gorrie 1 sibling, 1 reply; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-29 12:25 UTC (permalink / raw) To: Stefano Stabellini Cc: julien.grall, snabb-devel, qemu-devel, Anthony Liguori, Julian Stecklina On Wed, May 29, 2013 at 11:31:50AM +0100, Stefano Stabellini wrote: > On Tue, 28 May 2013, Anthony Liguori wrote: > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > > On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: > > >> Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > > >> > > >> > > >> I don't see any compelling reason to do something like this. It's > > >> jumping through a tremendous number of hoops to avoid putting code that > > >> belongs in QEMU in tree. > > >> > > >> Regards, > > >> > > >> Anthony Liguori > > >> > > >> > > > >> > Julian > > > > > > OTOH an in-tree device that runs in a separate process would > > > be useful e.g. for security. > > > > An *in-tree* device would at least be a reasonable place to have a discussion. > > > > I still think it's pretty hard to make work beyond just a hack. > > > > > For example, we could limit a virtio-net device process > > > to only access tap and vhost files. > > > > Stefano et al from the Xen community have some interest in this. I > > believe they've done some initial prototyping already. > > Right, what Michael said are exactly the principal reasons why Julien > started working on this a while back: > > http://marc.info/?l=qemu-devel&m=134566472209750&w=2 > http://marc.info/?l=qemu-devel&m=134566262709001&w=2 > > Although he had a prototype fully running, the code never went upstream, > and now Julien is working on something else. > > The work was based on Xen and the idea that you can have multiple device > models (multiple QEMU instances) each of them emulating a different set > of devices for the guest VM. Each device model would register with Xen > the devices that is emulating and the corresponding MMIO regions for > which it wants to receive IO requests. When the guest traps into Xen on > a MMIO read/write, Xen would forward the IO emulation request to the > right device model instance. > > This is very useful for reliability, because if you have a bug in your > network device emulator is not going to bring down all the QEMU > instances, just the one running the network device, and could be > restarted without compromising the stability of the whole system. > > It is good for security, because you can limit what each QEMU process > can do in a much more fine grained way. And of course on Xen you can go > much farther by running the QEMU instances in different domains > altogether. > > It is good for isolation because the QEMU processes don't need to be > fully privileged and are completely isolated from one another so if a > malicious guest manages to break into one of them, for example because > the network device has a security vulnerability, it won't be able to > cause issues to the others. I see. I think what we are discussing here is more along the lines of decoding the request in QEMU and forwarding to another process for slow-path setup. Do the bounce directly in kvm only for fast-path operations. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 12:25 ` Michael S. Tsirkin @ 2013-05-29 13:04 ` Stefano Stabellini 0 siblings, 0 replies; 49+ messages in thread From: Stefano Stabellini @ 2013-05-29 13:04 UTC (permalink / raw) To: Michael S. Tsirkin Cc: snabb-devel, Stefano Stabellini, qemu-devel, julien.grall, Anthony Liguori, Julian Stecklina On Wed, 29 May 2013, Michael S. Tsirkin wrote: > On Wed, May 29, 2013 at 11:31:50AM +0100, Stefano Stabellini wrote: > > On Tue, 28 May 2013, Anthony Liguori wrote: > > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > > > > On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: > > > >> Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > > > >> > > > >> > > > >> I don't see any compelling reason to do something like this. It's > > > >> jumping through a tremendous number of hoops to avoid putting code that > > > >> belongs in QEMU in tree. > > > >> > > > >> Regards, > > > >> > > > >> Anthony Liguori > > > >> > > > >> > > > > >> > Julian > > > > > > > > OTOH an in-tree device that runs in a separate process would > > > > be useful e.g. for security. > > > > > > An *in-tree* device would at least be a reasonable place to have a discussion. > > > > > > I still think it's pretty hard to make work beyond just a hack. > > > > > > > For example, we could limit a virtio-net device process > > > > to only access tap and vhost files. > > > > > > Stefano et al from the Xen community have some interest in this. I > > > believe they've done some initial prototyping already. > > > > Right, what Michael said are exactly the principal reasons why Julien > > started working on this a while back: > > > > http://marc.info/?l=qemu-devel&m=134566472209750&w=2 > > http://marc.info/?l=qemu-devel&m=134566262709001&w=2 > > > > Although he had a prototype fully running, the code never went upstream, > > and now Julien is working on something else. > > > > The work was based on Xen and the idea that you can have multiple device > > models (multiple QEMU instances) each of them emulating a different set > > of devices for the guest VM. Each device model would register with Xen > > the devices that is emulating and the corresponding MMIO regions for > > which it wants to receive IO requests. When the guest traps into Xen on > > a MMIO read/write, Xen would forward the IO emulation request to the > > right device model instance. > > > > This is very useful for reliability, because if you have a bug in your > > network device emulator is not going to bring down all the QEMU > > instances, just the one running the network device, and could be > > restarted without compromising the stability of the whole system. > > > > It is good for security, because you can limit what each QEMU process > > can do in a much more fine grained way. And of course on Xen you can go > > much farther by running the QEMU instances in different domains > > altogether. > > > > It is good for isolation because the QEMU processes don't need to be > > fully privileged and are completely isolated from one another so if a > > malicious guest manages to break into one of them, for example because > > the network device has a security vulnerability, it won't be able to > > cause issues to the others. > > I see. I think what we are discussing here is more along the lines > of decoding the request in QEMU and forwarding to another process > for slow-path setup. > > Do the bounce directly in kvm only for fast-path operations. So you would keep the PCI decoder in QEMU. However you would still need an interface to register more than one QEMU process with KVM for the fast-path operations, right? How do you think that this interface would look like? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:300] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 10:31 ` Stefano Stabellini 2013-05-29 12:25 ` Michael S. Tsirkin @ 2013-06-04 12:19 ` Luke Gorrie 2013-06-04 12:49 ` Julian Stecklina 2013-06-04 12:56 ` [Qemu-devel] [snabb-devel:300] " Michael S. Tsirkin 1 sibling, 2 replies; 49+ messages in thread From: Luke Gorrie @ 2013-06-04 12:19 UTC (permalink / raw) To: snabb-devel Cc: Stefano Stabellini, Michael S. Tsirkin, qemu-devel, julien.grall, Anthony Liguori, Julian Stecklina [-- Attachment #1: Type: text/plain, Size: 2211 bytes --] Howdy, My brain is slowly catching up with all of the information shared in this thread. Here is my first attempt to tease out a way forward for Snabb Switch. The idea that excites me is to implement a complete PCI device in Snabb Switch and expose this to the guest at the basic PCI/MMIO/DMA level. The device would be a Virtio network adapter based on Rusty Russell's specification. The switch<->VM interface would be based on PCI rather than vhost. I _think_ this is the basic idea that Stefano Stabellini and Julian Stecklina are talking about. I like this because: - The abstraction level is primarily PCI hardware devices (hardware) rather than system calls (kernel) as with vhost/socket/splice/etc. This is a much better fit for the Snabb Switch code, which is already doing physical network I/O based on built-in drivers built on PCI MMIO/DMA. I invest my energy in learning more about PCI and Virtio rather than Linux and QEMU. - The code feels more generic. The software we develop is a standard Virtio PCI network device rather than a specific QEMU-vhost interface. In principle (...) we could reuse the same code with more hypervisors in the future. - The code that I am not well positioned to write myself - the hypervisor side - may have already been written/prototyped by others and available for testing, even though it's not available in mainline QEMU. I have some questions, if you don't mind: 1. Have I understood the idea correctly above? (Or what do I have wrong?) 2. Is this PCI integration available in some code base that I could test with? e.g. non-mainline QEMU, Xen, vbox, VMware, etc? 3. If I hack a proof-of-concept what is most likely to go wrong in an OpenStack context? I mean - the "memory hotplug" and "track what is dirty" issues that are alluded to. Is my code going to run slowly? drop packets? break during migration? crash VMs? Long-term I do need a solution that works with standard mainline QEMU but I could also start with something more custom and revisit the whole issue next year. The most important thing now is to start making forward progress and have something working and performant this summer/autumn. Cheers & thanks for all the information, -Luke [-- Attachment #2: Type: text/html, Size: 2777 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:300] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-06-04 12:19 ` [Qemu-devel] [snabb-devel:300] " Luke Gorrie @ 2013-06-04 12:49 ` Julian Stecklina 2013-06-04 20:09 ` [Qemu-devel] [snabb-devel:326] " Luke Gorrie 2013-06-04 12:56 ` [Qemu-devel] [snabb-devel:300] " Michael S. Tsirkin 1 sibling, 1 reply; 49+ messages in thread From: Julian Stecklina @ 2013-06-04 12:49 UTC (permalink / raw) To: Luke Gorrie Cc: snabb-devel, Stefano Stabellini, Michael S. Tsirkin, qemu-devel, julien.grall, Anthony Liguori -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 06/04/2013 02:19 PM, Luke Gorrie wrote: > The idea that excites me is to implement a complete PCI device in > Snabb Switch and expose this to the guest at the basic PCI/MMIO/DMA > level. The device would be a Virtio network adapter based on Rusty > Russell's specification. The switch<->VM interface would be based > on PCI rather than vhost. > > I _think_ this is the basic idea that Stefano Stabellini and > Julian Stecklina are talking about. Yes. Btw, progress is being made. Albeit a bit slower than expected. I will have to show something "soon". [...] > I have some questions, if you don't mind: > > 1. Have I understood the idea correctly above? (Or what do I have > wrong?) AFAICS yes. > 3. If I hack a proof-of-concept what is most likely to go wrong in > an OpenStack context? I mean - the "memory hotplug" and "track what > is dirty" issues that are alluded to. Is my code going to run > slowly? drop packets? break during migration? crash VMs? In the earliest implementation it will probably break during migration. I hope this can be fixed, but since I don't understand the magic qemu does for migration these days, I might be wrong. Julian -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEARECAAYFAlGt4mcACgkQ2EtjUdW3H9k7IgCgiQJ81z1Zj5G5tED8EghKQlDe gRQAniXzqTSadS7OvVHFQhKydbOFNkBn =VBws -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:326] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-06-04 12:49 ` Julian Stecklina @ 2013-06-04 20:09 ` Luke Gorrie 0 siblings, 0 replies; 49+ messages in thread From: Luke Gorrie @ 2013-06-04 20:09 UTC (permalink / raw) To: snabb-devel Cc: julien.grall, Stefano Stabellini, qemu-devel, Anthony Liguori, Michael S. Tsirkin [-- Attachment #1: Type: text/plain, Size: 399 bytes --] On 4 June 2013 14:49, Julian Stecklina <jsteckli@os.inf.tu-dresden.de>wrote: > Yes. Btw, progress is being made. Albeit a bit slower than expected. I > will have to show something "soon". > Awesome! It's already about 9 months since I did my simple proof-of-concept integration (https://github.com/SnabbCo/QEMU/compare/master...shm) so I am not moving at a breakneck pace on this stuff myself :) [-- Attachment #2: Type: text/html, Size: 978 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:300] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-06-04 12:19 ` [Qemu-devel] [snabb-devel:300] " Luke Gorrie 2013-06-04 12:49 ` Julian Stecklina @ 2013-06-04 12:56 ` Michael S. Tsirkin 2013-06-05 6:09 ` [Qemu-devel] [snabb-devel:327] " Luke Gorrie 1 sibling, 1 reply; 49+ messages in thread From: Michael S. Tsirkin @ 2013-06-04 12:56 UTC (permalink / raw) To: Luke Gorrie Cc: snabb-devel, Stefano Stabellini, qemu-devel, julien.grall, Anthony Liguori, Julian Stecklina On Tue, Jun 04, 2013 at 02:19:23PM +0200, Luke Gorrie wrote: > The idea that excites me is to implement a complete PCI device in Snabb Switch > and expose this to the guest at the basic PCI/MMIO/DMA level. That would mean making snabb switch part of QEMU. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:327] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-06-04 12:56 ` [Qemu-devel] [snabb-devel:300] " Michael S. Tsirkin @ 2013-06-05 6:09 ` Luke Gorrie 0 siblings, 0 replies; 49+ messages in thread From: Luke Gorrie @ 2013-06-05 6:09 UTC (permalink / raw) To: snabb-devel Cc: julien.grall, Stefano Stabellini, qemu-devel, Anthony Liguori, Julian Stecklina [-- Attachment #1: Type: text/plain, Size: 681 bytes --] On 4 June 2013 14:56, Michael S. Tsirkin <mst@redhat.com> wrote: > That would mean making snabb switch part of QEMU. > Just curious - not suggesting that this is practical - but what would that mean? Is the important thing to keep all device implementations in the same source tree so that QEMU developers can take responsibility for everything working? Or is it that the Snabb Switch code would need to execute inside the QEMU process at runtime? Snabb Switch is actually reasonably embeddable: less than 1MB, single threaded, hardly makes any system calls. The one "big dependency" we have is LuaJIT (luajit.org) but that is routinely embedded in video games and such like. [-- Attachment #2: Type: text/html, Size: 1236 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 17:17 ` Michael S. Tsirkin 2013-05-28 18:55 ` Anthony Liguori @ 2013-05-29 7:49 ` Stefan Hajnoczi 2013-05-29 9:08 ` Michael S. Tsirkin 1 sibling, 1 reply; 49+ messages in thread From: Stefan Hajnoczi @ 2013-05-29 7:49 UTC (permalink / raw) To: Michael S. Tsirkin Cc: snabb-devel, qemu-devel, Anthony Liguori, Julian Stecklina On Tue, May 28, 2013 at 08:17:42PM +0300, Michael S. Tsirkin wrote: > On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: > > Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > > > > > On 05/28/2013 12:10 PM, Luke Gorrie wrote: > > >> On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com > > >> <mailto:stefanha@redhat.com>> wrote: > > >> > > >> vhost_net is about connecting the a virtio-net speaking process to a > > >> tun-like device. The problem you are trying to solve is connecting a > > >> virtio-net speaking process to Snabb Switch. > > >> > > >> > > >> Yep! > > > > > > Since I am on a similar path as Luke, let me share another idea. > > > > > > What about extending qemu in a way to allow PCI device models to be > > > implemented in another process. > > > > We aren't going to support any interface that enables out of tree > > devices. This is just plugins in a different form with even more > > downsides. You cannot easily keep track of dirty info, the guest > > physical address translation to host is difficult to keep in sync > > (imagine the complexity of memory hotplug). > > > > Basically, it's easy to hack up but extremely hard to do something that > > works correctly overall. > > > > There isn't a compelling reason to implement something like this other > > than avoiding getting code into QEMU. Best to just submit your device > > to QEMU for inclusion. > > > > If you want to avoid copying in a vswitch, better to use something like > > vmsplice as I outlined in another thread. > > > > > This is not as hard as it may sound. > > > qemu would open a domain socket to this process and map VM memory over > > > to the other side. This can be accomplished by having file descriptors > > > in qemu to VM memory (reusing -mem-path code) and passing those over the > > > domain socket. The other side can then just mmap them. The socket would > > > also be used for configuration and I/O by the guest on the PCI > > > I/O/memory regions. You could also use this to do IRQs or use eventfds, > > > whatever works better. > > > > > > To have a zero copy userspace switch, the switch would offer virtio-net > > > devices to any qemu that wants to connect to it and implement the > > > complete device logic itself. Since it has access to all guest memory, > > > it can just do memcpy for packet data. Of course, this only works for > > > 64-bit systems, because you need vast amounts of virtual address space. > > > In my experience, doing this in userspace is _way less painful_. > > > > > > If you can get away with polling in the switch the overhead of doing all > > > this in userspace is zero. And as long as you can rate-limit explicit > > > notifications over the socket even that overhead should be okay. > > > > > > Opinions? > > > > I don't see any compelling reason to do something like this. It's > > jumping through a tremendous number of hoops to avoid putting code that > > belongs in QEMU in tree. > > > > Regards, > > > > Anthony Liguori > > > > > > > > Julian > > OTOH an in-tree device that runs in a separate process would > be useful e.g. for security. > For example, we could limit a virtio-net device process > to only access tap and vhost files. For tap or vhost files only this is good for security. I'm not sure it has many advantages over a QEMU process under SELinux though. Obviously when the switch process has shared memory access to multiple guests' RAM, the security is worse than a QEMU process solution but better than a vhost kernel solution. So the security story is not a clear win. Stefan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 7:49 ` [Qemu-devel] " Stefan Hajnoczi @ 2013-05-29 9:08 ` Michael S. Tsirkin 2013-05-29 14:21 ` Stefan Hajnoczi 0 siblings, 1 reply; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-29 9:08 UTC (permalink / raw) To: Stefan Hajnoczi Cc: snabb-devel, qemu-devel, Anthony Liguori, Julian Stecklina On Wed, May 29, 2013 at 09:49:29AM +0200, Stefan Hajnoczi wrote: > On Tue, May 28, 2013 at 08:17:42PM +0300, Michael S. Tsirkin wrote: > > On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: > > > Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > > > > > > > On 05/28/2013 12:10 PM, Luke Gorrie wrote: > > > >> On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com > > > >> <mailto:stefanha@redhat.com>> wrote: > > > >> > > > >> vhost_net is about connecting the a virtio-net speaking process to a > > > >> tun-like device. The problem you are trying to solve is connecting a > > > >> virtio-net speaking process to Snabb Switch. > > > >> > > > >> > > > >> Yep! > > > > > > > > Since I am on a similar path as Luke, let me share another idea. > > > > > > > > What about extending qemu in a way to allow PCI device models to be > > > > implemented in another process. > > > > > > We aren't going to support any interface that enables out of tree > > > devices. This is just plugins in a different form with even more > > > downsides. You cannot easily keep track of dirty info, the guest > > > physical address translation to host is difficult to keep in sync > > > (imagine the complexity of memory hotplug). > > > > > > Basically, it's easy to hack up but extremely hard to do something that > > > works correctly overall. > > > > > > There isn't a compelling reason to implement something like this other > > > than avoiding getting code into QEMU. Best to just submit your device > > > to QEMU for inclusion. > > > > > > If you want to avoid copying in a vswitch, better to use something like > > > vmsplice as I outlined in another thread. > > > > > > > This is not as hard as it may sound. > > > > qemu would open a domain socket to this process and map VM memory over > > > > to the other side. This can be accomplished by having file descriptors > > > > in qemu to VM memory (reusing -mem-path code) and passing those over the > > > > domain socket. The other side can then just mmap them. The socket would > > > > also be used for configuration and I/O by the guest on the PCI > > > > I/O/memory regions. You could also use this to do IRQs or use eventfds, > > > > whatever works better. > > > > > > > > To have a zero copy userspace switch, the switch would offer virtio-net > > > > devices to any qemu that wants to connect to it and implement the > > > > complete device logic itself. Since it has access to all guest memory, > > > > it can just do memcpy for packet data. Of course, this only works for > > > > 64-bit systems, because you need vast amounts of virtual address space. > > > > In my experience, doing this in userspace is _way less painful_. > > > > > > > > If you can get away with polling in the switch the overhead of doing all > > > > this in userspace is zero. And as long as you can rate-limit explicit > > > > notifications over the socket even that overhead should be okay. > > > > > > > > Opinions? > > > > > > I don't see any compelling reason to do something like this. It's > > > jumping through a tremendous number of hoops to avoid putting code that > > > belongs in QEMU in tree. > > > > > > Regards, > > > > > > Anthony Liguori > > > > > > > > > > > Julian > > > > OTOH an in-tree device that runs in a separate process would > > be useful e.g. for security. > > For example, we could limit a virtio-net device process > > to only access tap and vhost files. > > For tap or vhost files only this is good for security. I'm not sure it > has many advantages over a QEMU process under SELinux though. At the moment SELinux necessarily gives QEMU rights to e.g. access the filesystem. This process would only get access to tap and vhost. We can also run it as a different user. Defence in depth. We can also limit e.g. the CPU of this process aggressively (as it's not doing anything on data path). I could go on. And it's really easy too, until you want to use it in production, at which point you need to cover lots of nasty details like hotplug and migration. > Obviously when the switch process has shared memory access to multiple > guests' RAM, the security is worse than a QEMU process solution but > better than a vhost kernel solution. > So the security story is not a clear win. > > Stefan How exactly you pass packets between guest and host is very unlikely to affect your security in a meaningful way. Except, if you lose networking, orif it's just slow beyond any measure, you are suddenly more secure against network-based attacks. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 9:08 ` Michael S. Tsirkin @ 2013-05-29 14:21 ` Stefan Hajnoczi 2013-05-29 14:48 ` Michael S. Tsirkin 2013-05-29 16:02 ` Julian Stecklina 0 siblings, 2 replies; 49+ messages in thread From: Stefan Hajnoczi @ 2013-05-29 14:21 UTC (permalink / raw) To: Michael S. Tsirkin Cc: snabb-devel, qemu-devel, Anthony Liguori, Julian Stecklina On Wed, May 29, 2013 at 12:08:59PM +0300, Michael S. Tsirkin wrote: > On Wed, May 29, 2013 at 09:49:29AM +0200, Stefan Hajnoczi wrote: > > On Tue, May 28, 2013 at 08:17:42PM +0300, Michael S. Tsirkin wrote: > > > On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: > > > > Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > > > > > > > > > On 05/28/2013 12:10 PM, Luke Gorrie wrote: > > > > >> On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com > > > > >> <mailto:stefanha@redhat.com>> wrote: > > > > >> > > > > >> vhost_net is about connecting the a virtio-net speaking process to a > > > > >> tun-like device. The problem you are trying to solve is connecting a > > > > >> virtio-net speaking process to Snabb Switch. > > > > >> > > > > >> > > > > >> Yep! > > > > > > > > > > Since I am on a similar path as Luke, let me share another idea. > > > > > > > > > > What about extending qemu in a way to allow PCI device models to be > > > > > implemented in another process. > > > > > > > > We aren't going to support any interface that enables out of tree > > > > devices. This is just plugins in a different form with even more > > > > downsides. You cannot easily keep track of dirty info, the guest > > > > physical address translation to host is difficult to keep in sync > > > > (imagine the complexity of memory hotplug). > > > > > > > > Basically, it's easy to hack up but extremely hard to do something that > > > > works correctly overall. > > > > > > > > There isn't a compelling reason to implement something like this other > > > > than avoiding getting code into QEMU. Best to just submit your device > > > > to QEMU for inclusion. > > > > > > > > If you want to avoid copying in a vswitch, better to use something like > > > > vmsplice as I outlined in another thread. > > > > > > > > > This is not as hard as it may sound. > > > > > qemu would open a domain socket to this process and map VM memory over > > > > > to the other side. This can be accomplished by having file descriptors > > > > > in qemu to VM memory (reusing -mem-path code) and passing those over the > > > > > domain socket. The other side can then just mmap them. The socket would > > > > > also be used for configuration and I/O by the guest on the PCI > > > > > I/O/memory regions. You could also use this to do IRQs or use eventfds, > > > > > whatever works better. > > > > > > > > > > To have a zero copy userspace switch, the switch would offer virtio-net > > > > > devices to any qemu that wants to connect to it and implement the > > > > > complete device logic itself. Since it has access to all guest memory, > > > > > it can just do memcpy for packet data. Of course, this only works for > > > > > 64-bit systems, because you need vast amounts of virtual address space. > > > > > In my experience, doing this in userspace is _way less painful_. > > > > > > > > > > If you can get away with polling in the switch the overhead of doing all > > > > > this in userspace is zero. And as long as you can rate-limit explicit > > > > > notifications over the socket even that overhead should be okay. > > > > > > > > > > Opinions? > > > > > > > > I don't see any compelling reason to do something like this. It's > > > > jumping through a tremendous number of hoops to avoid putting code that > > > > belongs in QEMU in tree. > > > > > > > > Regards, > > > > > > > > Anthony Liguori > > > > > > > > > > > > > > Julian > > > > > > OTOH an in-tree device that runs in a separate process would > > > be useful e.g. for security. > > > For example, we could limit a virtio-net device process > > > to only access tap and vhost files. > > > > For tap or vhost files only this is good for security. I'm not sure it > > has many advantages over a QEMU process under SELinux though. > > At the moment SELinux necessarily gives QEMU rights to > e.g. access the filesystem. > This process would only get access to tap and vhost. > > We can also run it as a different user. > Defence in depth. > > We can also limit e.g. the CPU of this process aggressively > (as it's not doing anything on data path). > > I could go on. > > And it's really easy too, until you want to use it in production, > at which point you need to cover lots of > nasty details like hotplug and migration. I think there are diminishing returns. Once QEMU is isolated so it cannot open arbitrary files, just has access to the resources granted by the management tool on startup, etc then I'm not sure it's worth the complexity and performance-cost of splitting the model up into even smaller pieces. IMO there isn't a trust boundary that's worth isolating here (compare to sshd privilege separation where separate uids really make sense and are necessary, with QEMU having multiple uids that lack capabilities to do much doesn't win much over the SELinux setup). > > Obviously when the switch process has shared memory access to multiple > > guests' RAM, the security is worse than a QEMU process solution but > > better than a vhost kernel solution. > > So the security story is not a clear win. > > > > Stefan > > How exactly you pass packets between guest and host is very unlikely to > affect your security in a meaningful way. > > Except, if you lose networking, orif it's just slow beyond any measure, > you are suddenly more secure against network-based attacks. The fact that a single switch process has shared memory access to all guests' RAM is critical. If the switch process is exploited, then that exposes other guests' data! (Think of a multi-tenant host with guests belonging to different users.) Stefan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 14:21 ` Stefan Hajnoczi @ 2013-05-29 14:48 ` Michael S. Tsirkin 2013-05-29 16:02 ` Julian Stecklina 1 sibling, 0 replies; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-29 14:48 UTC (permalink / raw) To: Stefan Hajnoczi Cc: snabb-devel, qemu-devel, Anthony Liguori, Julian Stecklina On Wed, May 29, 2013 at 04:21:43PM +0200, Stefan Hajnoczi wrote: > On Wed, May 29, 2013 at 12:08:59PM +0300, Michael S. Tsirkin wrote: > > On Wed, May 29, 2013 at 09:49:29AM +0200, Stefan Hajnoczi wrote: > > > On Tue, May 28, 2013 at 08:17:42PM +0300, Michael S. Tsirkin wrote: > > > > On Tue, May 28, 2013 at 12:00:38PM -0500, Anthony Liguori wrote: > > > > > Julian Stecklina <jsteckli@os.inf.tu-dresden.de> writes: > > > > > > > > > > > On 05/28/2013 12:10 PM, Luke Gorrie wrote: > > > > > >> On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com > > > > > >> <mailto:stefanha@redhat.com>> wrote: > > > > > >> > > > > > >> vhost_net is about connecting the a virtio-net speaking process to a > > > > > >> tun-like device. The problem you are trying to solve is connecting a > > > > > >> virtio-net speaking process to Snabb Switch. > > > > > >> > > > > > >> > > > > > >> Yep! > > > > > > > > > > > > Since I am on a similar path as Luke, let me share another idea. > > > > > > > > > > > > What about extending qemu in a way to allow PCI device models to be > > > > > > implemented in another process. > > > > > > > > > > We aren't going to support any interface that enables out of tree > > > > > devices. This is just plugins in a different form with even more > > > > > downsides. You cannot easily keep track of dirty info, the guest > > > > > physical address translation to host is difficult to keep in sync > > > > > (imagine the complexity of memory hotplug). > > > > > > > > > > Basically, it's easy to hack up but extremely hard to do something that > > > > > works correctly overall. > > > > > > > > > > There isn't a compelling reason to implement something like this other > > > > > than avoiding getting code into QEMU. Best to just submit your device > > > > > to QEMU for inclusion. > > > > > > > > > > If you want to avoid copying in a vswitch, better to use something like > > > > > vmsplice as I outlined in another thread. > > > > > > > > > > > This is not as hard as it may sound. > > > > > > qemu would open a domain socket to this process and map VM memory over > > > > > > to the other side. This can be accomplished by having file descriptors > > > > > > in qemu to VM memory (reusing -mem-path code) and passing those over the > > > > > > domain socket. The other side can then just mmap them. The socket would > > > > > > also be used for configuration and I/O by the guest on the PCI > > > > > > I/O/memory regions. You could also use this to do IRQs or use eventfds, > > > > > > whatever works better. > > > > > > > > > > > > To have a zero copy userspace switch, the switch would offer virtio-net > > > > > > devices to any qemu that wants to connect to it and implement the > > > > > > complete device logic itself. Since it has access to all guest memory, > > > > > > it can just do memcpy for packet data. Of course, this only works for > > > > > > 64-bit systems, because you need vast amounts of virtual address space. > > > > > > In my experience, doing this in userspace is _way less painful_. > > > > > > > > > > > > If you can get away with polling in the switch the overhead of doing all > > > > > > this in userspace is zero. And as long as you can rate-limit explicit > > > > > > notifications over the socket even that overhead should be okay. > > > > > > > > > > > > Opinions? > > > > > > > > > > I don't see any compelling reason to do something like this. It's > > > > > jumping through a tremendous number of hoops to avoid putting code that > > > > > belongs in QEMU in tree. > > > > > > > > > > Regards, > > > > > > > > > > Anthony Liguori > > > > > > > > > > > > > > > > > Julian > > > > > > > > OTOH an in-tree device that runs in a separate process would > > > > be useful e.g. for security. > > > > For example, we could limit a virtio-net device process > > > > to only access tap and vhost files. > > > > > > For tap or vhost files only this is good for security. I'm not sure it > > > has many advantages over a QEMU process under SELinux though. > > > > At the moment SELinux necessarily gives QEMU rights to > > e.g. access the filesystem. > > This process would only get access to tap and vhost. > > > > We can also run it as a different user. > > Defence in depth. > > > > We can also limit e.g. the CPU of this process aggressively > > (as it's not doing anything on data path). > > > > I could go on. > > > > And it's really easy too, until you want to use it in production, > > at which point you need to cover lots of > > nasty details like hotplug and migration. > > I think there are diminishing returns. Once QEMU is isolated so it > cannot open arbitrary files, just has access to the resources granted by > the management tool on startup, etc then I'm not sure it's worth the > complexity and performance-cost of splitting the model up into even > smaller pieces. Well, this part is network-facing so there is some value, to isolate it, I don't know how big it is. > IMO there isn't a trust boundary that's worth isolating > here (compare to sshd privilege separation where separate uids really > make sense and are necessary, with QEMU having multiple uids that lack > capabilities to do much doesn't win much over the SELinux setup). > > > > Obviously when the switch process has shared memory access to multiple > > > guests' RAM, the security is worse than a QEMU process solution but > > > better than a vhost kernel solution. > > > So the security story is not a clear win. > > > > > > Stefan > > > > How exactly you pass packets between guest and host is very unlikely to > > affect your security in a meaningful way. > > > > Except, if you lose networking, orif it's just slow beyond any measure, > > you are suddenly more secure against network-based attacks. > > The fact that a single switch process has shared memory access to all > guests' RAM is critical. If the switch process is exploited, then that > exposes other guests' data! (Think of a multi-tenant host with guests > belonging to different users.) > > Stefan Well local priveledge escalation bugs are common enough that you should be very careful in any network facing application, whether that has access to all guests when well-behaved, or not. -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 14:21 ` Stefan Hajnoczi 2013-05-29 14:48 ` Michael S. Tsirkin @ 2013-05-29 16:02 ` Julian Stecklina 2013-05-30 2:35 ` ronnie sahlberg 2013-05-30 6:46 ` Stefan Hajnoczi 1 sibling, 2 replies; 49+ messages in thread From: Julian Stecklina @ 2013-05-29 16:02 UTC (permalink / raw) To: Stefan Hajnoczi Cc: snabb-devel, qemu-devel, Anthony Liguori, Michael S. Tsirkin [-- Attachment #1: Type: text/plain, Size: 462 bytes --] On 05/29/2013 04:21 PM, Stefan Hajnoczi wrote: > The fact that a single switch process has shared memory access to all > guests' RAM is critical. If the switch process is exploited, then that > exposes other guests' data! (Think of a multi-tenant host with guests > belonging to different users.) True. But people don't mind having instruction decoding and half of virtio in the kernel these days, so it can't be that security critical... Julian [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 16:02 ` Julian Stecklina @ 2013-05-30 2:35 ` ronnie sahlberg 2013-05-30 6:46 ` Stefan Hajnoczi 1 sibling, 0 replies; 49+ messages in thread From: ronnie sahlberg @ 2013-05-30 2:35 UTC (permalink / raw) To: Julian Stecklina Cc: Stefan Hajnoczi, snabb-devel, qemu-devel, Anthony Liguori, Michael S. Tsirkin Julian, Stefan's concerns are valid. (Hopefully, kernel is harder to exploit and more carefully audited.) On Wed, May 29, 2013 at 9:02 AM, Julian Stecklina <jsteckli@os.inf.tu-dresden.de> wrote: > On 05/29/2013 04:21 PM, Stefan Hajnoczi wrote: >> The fact that a single switch process has shared memory access to all >> guests' RAM is critical. If the switch process is exploited, then that >> exposes other guests' data! (Think of a multi-tenant host with guests >> belonging to different users.) > > True. But people don't mind having instruction decoding and half of > virtio in the kernel these days, so it can't be that security critical... > > Julian > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 16:02 ` Julian Stecklina 2013-05-30 2:35 ` ronnie sahlberg @ 2013-05-30 6:46 ` Stefan Hajnoczi 2013-05-30 6:55 ` Michael S. Tsirkin ` (2 more replies) 1 sibling, 3 replies; 49+ messages in thread From: Stefan Hajnoczi @ 2013-05-30 6:46 UTC (permalink / raw) To: Julian Stecklina Cc: snabb-devel, qemu-devel, Anthony Liguori, Michael S. Tsirkin On Wed, May 29, 2013 at 6:02 PM, Julian Stecklina <jsteckli@os.inf.tu-dresden.de> wrote: > On 05/29/2013 04:21 PM, Stefan Hajnoczi wrote: >> The fact that a single switch process has shared memory access to all >> guests' RAM is critical. If the switch process is exploited, then that >> exposes other guests' data! (Think of a multi-tenant host with guests >> belonging to different users.) > > True. But people don't mind having instruction decoding and half of > virtio in the kernel these days, so it can't be that security critical... No, it's still security critical. If there were equivalent solutions with better security then I'm sure people would accept them. It's just that there isn't an equivalent solution yet :). Stefan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-30 6:46 ` Stefan Hajnoczi @ 2013-05-30 6:55 ` Michael S. Tsirkin 2013-05-30 7:11 ` [Qemu-devel] [snabb-devel:308] " Luke Gorrie 2013-05-30 8:08 ` [Qemu-devel] " Julian Stecklina 2 siblings, 0 replies; 49+ messages in thread From: Michael S. Tsirkin @ 2013-05-30 6:55 UTC (permalink / raw) To: Stefan Hajnoczi Cc: snabb-devel, qemu-devel, Anthony Liguori, Julian Stecklina On Thu, May 30, 2013 at 08:46:42AM +0200, Stefan Hajnoczi wrote: > On Wed, May 29, 2013 at 6:02 PM, Julian Stecklina > <jsteckli@os.inf.tu-dresden.de> wrote: > > On 05/29/2013 04:21 PM, Stefan Hajnoczi wrote: > >> The fact that a single switch process has shared memory access to all > >> guests' RAM is critical. If the switch process is exploited, then that > >> exposes other guests' data! (Think of a multi-tenant host with guests > >> belonging to different users.) > > > > True. But people don't mind having instruction decoding and half of > > virtio in the kernel these days, so it can't be that security critical... > > No, it's still security critical. If there were equivalent solutions > with better security then I'm sure people would accept them. It's > just that there isn't an equivalent solution yet :). > > Stefan Some people would accept them. Others run with selinux off ... -- MST ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] [snabb-devel:308] Re: snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-30 6:46 ` Stefan Hajnoczi 2013-05-30 6:55 ` Michael S. Tsirkin @ 2013-05-30 7:11 ` Luke Gorrie 2013-05-30 8:08 ` [Qemu-devel] " Julian Stecklina 2 siblings, 0 replies; 49+ messages in thread From: Luke Gorrie @ 2013-05-30 7:11 UTC (permalink / raw) To: snabb-devel Cc: Michael S. Tsirkin, qemu-devel, Anthony Liguori, Julian Stecklina [-- Attachment #1: Type: text/plain, Size: 935 bytes --] On 30 May 2013 08:46, Stefan Hajnoczi <stefanha@gmail.com> wrote: > No, it's still security critical. If there were equivalent solutions > with better security then I'm sure people would accept them. It's > just that there isn't an equivalent solution yet :). > Security-wise this is where I would like to be in the long term: 1. Userspace switch running non-root. 2. Shared memory mappings with VMs for exactly the memory that will be used for packet buffers. (They could map it from me...) 3. IOMMU access to exactly the PCI devices under the switch's control. Again my perspective is pretty hardware-centric rather than kernel-centric i.e. I'm thinking more in terms of MMU/IOMMU than Unix system calls and permissions. Short-term agenda is to build something good enough that people will _want_ to spend the energy to make it highly secure (and highly portable, etc). So I run as root and use every trick in the sysfs book. [-- Attachment #2: Type: text/html, Size: 1521 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-30 6:46 ` Stefan Hajnoczi 2013-05-30 6:55 ` Michael S. Tsirkin 2013-05-30 7:11 ` [Qemu-devel] [snabb-devel:308] " Luke Gorrie @ 2013-05-30 8:08 ` Julian Stecklina 2 siblings, 0 replies; 49+ messages in thread From: Julian Stecklina @ 2013-05-30 8:08 UTC (permalink / raw) To: Stefan Hajnoczi Cc: snabb-devel, qemu-devel, Anthony Liguori, Michael S. Tsirkin -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/30/2013 08:46 AM, Stefan Hajnoczi wrote: > On Wed, May 29, 2013 at 6:02 PM, Julian Stecklina > <jsteckli@os.inf.tu-dresden.de> wrote: >> On 05/29/2013 04:21 PM, Stefan Hajnoczi wrote: >>> The fact that a single switch process has shared memory access >>> to all guests' RAM is critical. If the switch process is >>> exploited, then that exposes other guests' data! (Think of a >>> multi-tenant host with guests belonging to different users.) >> >> True. But people don't mind having instruction decoding and half >> of virtio in the kernel these days, so it can't be that security >> critical... > > No, it's still security critical. If there were equivalent > solutions with better security then I'm sure people would accept > them. It's just that there isn't an equivalent solution yet :). My comment was more or less meant in a resigning way. ;) At least we are not putting HTTP servers in there any more. Julian -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEARECAAYFAlGnCRMACgkQ2EtjUdW3H9mzFwCghZxvckYgZ4atLm3HLPPWF/Lb 688AnRXm12jbBlmCVOKSaDUHHejEdh7O =csrK -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 17:00 ` [Qemu-devel] " Anthony Liguori 2013-05-28 17:17 ` Michael S. Tsirkin @ 2013-05-29 12:32 ` Julian Stecklina 2013-05-29 14:31 ` Stefan Hajnoczi 1 sibling, 1 reply; 49+ messages in thread From: Julian Stecklina @ 2013-05-29 12:32 UTC (permalink / raw) To: qemu-devel On 05/28/2013 07:00 PM, Anthony Liguori wrote: > We aren't going to support any interface that enables out of tree > devices. This is just plugins in a different form with even more > downsides. You cannot easily keep track of dirty info, the guest > physical address translation to host is difficult to keep in sync > (imagine the complexity of memory hotplug). Is there a document describing the current qemu VM migration process, especially the dirty page tracking, except the code? As a side note: The downsides of a naive approach are about that of PCI passtrough devices with the opportunity to fix them later on. That being said, I can live with this not being included in mainline qemu. Julian ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 12:32 ` Julian Stecklina @ 2013-05-29 14:31 ` Stefan Hajnoczi 2013-05-29 15:59 ` Julian Stecklina 0 siblings, 1 reply; 49+ messages in thread From: Stefan Hajnoczi @ 2013-05-29 14:31 UTC (permalink / raw) To: Julian Stecklina; +Cc: qemu-devel On Wed, May 29, 2013 at 02:32:50PM +0200, Julian Stecklina wrote: > On 05/28/2013 07:00 PM, Anthony Liguori wrote: > > We aren't going to support any interface that enables out of tree > > devices. This is just plugins in a different form with even more > > downsides. You cannot easily keep track of dirty info, the guest > > physical address translation to host is difficult to keep in sync > > (imagine the complexity of memory hotplug). > > Is there a document describing the current qemu VM migration process, > especially the dirty page tracking, except the code? > > As a side note: The downsides of a naive approach are about that of PCI > passtrough devices with the opportunity to fix them later on. That being > said, I can live with this not being included in mainline qemu. Asking for documentation and in the next paragraph stating you don't mind keeping patches out-of-tree... The QEMU community exists because people are willing to contribute. If you intend to only "take" and not "give", then you'll find that people gradually stop responding to your emails. Stefan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-29 14:31 ` Stefan Hajnoczi @ 2013-05-29 15:59 ` Julian Stecklina 0 siblings, 0 replies; 49+ messages in thread From: Julian Stecklina @ 2013-05-29 15:59 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 1519 bytes --] On 05/29/2013 04:31 PM, Stefan Hajnoczi wrote: > On Wed, May 29, 2013 at 02:32:50PM +0200, Julian Stecklina wrote: >> On 05/28/2013 07:00 PM, Anthony Liguori wrote: >>> We aren't going to support any interface that enables out of tree >>> devices. This is just plugins in a different form with even more >>> downsides. You cannot easily keep track of dirty info, the guest >>> physical address translation to host is difficult to keep in sync >>> (imagine the complexity of memory hotplug). >> >> Is there a document describing the current qemu VM migration process, >> especially the dirty page tracking, except the code? >> >> As a side note: The downsides of a naive approach are about that of PCI >> passtrough devices with the opportunity to fix them later on. That being >> said, I can live with this not being included in mainline qemu. > > Asking for documentation and in the next paragraph stating you don't > mind keeping patches out-of-tree... > > The QEMU community exists because people are willing to contribute. If > you intend to only "take" and not "give", then you'll find that people > gradually stop responding to your emails. I am not saying that this will be closed in any way. I'll create a github repo once I have to show something. If there is interest in getting this into mainline qemu, then why not. Currently, it seems that there is a categorical rejection of such functionality. But this discussion is moot until there is working code. ;) Julian [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 10:10 ` Luke Gorrie 2013-05-28 10:35 ` Stefan Hajnoczi 2013-05-28 11:36 ` Julian Stecklina @ 2013-05-28 11:58 ` Stefan Hajnoczi 2013-10-21 10:29 ` Luke Gorrie 2 siblings, 1 reply; 49+ messages in thread From: Stefan Hajnoczi @ 2013-05-28 11:58 UTC (permalink / raw) To: Luke Gorrie; +Cc: snabb-devel, qemu-devel, mst On Tue, May 28, 2013 at 12:10:50PM +0200, Luke Gorrie wrote: > On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > vhost_net is about connecting the a virtio-net speaking process to a > > tun-like device. The problem you are trying to solve is connecting a > > virtio-net speaking process to Snabb Switch. > > > > Yep! > > > > Either you need to replace vhost or you need a tun-like device > > interface. > > > > Replacing vhost would mean that your switch implements virtio-net, > > shares guest RAM with the guest, and shares the ioeventfd and irqfd > > which are used to signal with the guest. > > > This would be a great solution from my perspective. This is the design that > I am now struggling to find a good implementation strategy for. The switch needs 3 resources for direct virtio-net communication with the guest: 1. Shared memory access to guest physical memory for guest physical to host userspace address translation. vhost and data plane automatically guest access to guest memory and they learn about memory layout using the MemoryListener interface in QEMU (see hw/virtio/vhost.c:vhost_region_add() and friends). 2. Virtqueue kick notifier (ioeventfd) so the switch knows when the guest signals the host. See virtio_queue_get_host_notifier(vq). 3. Guest interrupt notifier (irqfd) so the switch can signal the guest. See virtio_queue_get_guest_notifier(vq). I don't have a detailed suggestion for how to interface the switch and QEMU processes. It may be necessary to communicate back and forth (to handle the virtio device lifecycle) so a UNIX domain socket would be appropriate for passing file descriptors. Here is a rough idea: $ switch --listen-path=/var/run/switch.sock $ qemu --device virtio-net-pci,switch=/var/run/switch.sock On QEMU startup: (switch socket) add_port --id="qemu-$PID" --session-persistence (Here --session-persistence means that the port will be automatically destroyed if the switch socket session is terminated because the UNIX domain socket is closed by QEMU.) On virtio device status transition to DRIVER_OK: (switch socket) configure_port --id="qemu-$PID" --mem=/tmp/shm/qemu-$PID --ioeventfd=2 --irqfd=3 On virtio device status transition from DRIVER_OK: (switch socket) deconfigure_port --id="qemu-$PID" I skipped a bunch of things: 1. virtio-net has several virtqueues so you need multiple ioeventfds. 2. QEMU needs to communicate memory mapping information, this gets especially interesting with memory hotplug. Memory is more complicated than a single shmem blob. 3. Multiple NICs per guest should be supported. Stefan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O 2013-05-28 11:58 ` Stefan Hajnoczi @ 2013-10-21 10:29 ` Luke Gorrie 0 siblings, 0 replies; 49+ messages in thread From: Luke Gorrie @ 2013-10-21 10:29 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: snabb-devel, qemu-devel, Michael S. Tsirkin [-- Attachment #1: Type: text/plain, Size: 3513 bytes --] Hi all, Back in May we talked about efficiently connecting a user-space Ethernet switch to QEMU guests. Stefan Hajnoczi sketched the design of a userspace version of vhost that uses a Unix socket for its control interface. His design is in the mail quoted below. I'd like to ask you: if this feature were properly implemented and maintained, would you guys accept it into qemu? If so then I will work with a good QEMU hacker to develop it. also, have there been any new developments in this area (vhost-net and userspace ethernet I/O) that we should take into account? On 28 May 2013 13:58, Stefan Hajnoczi <stefanha@redhat.com> wrote: > On Tue, May 28, 2013 at 12:10:50PM +0200, Luke Gorrie wrote: > > On 27 May 2013 11:34, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > vhost_net is about connecting the a virtio-net speaking process to a > > > tun-like device. The problem you are trying to solve is connecting a > > > virtio-net speaking process to Snabb Switch. > > > > > > > Yep! > > > > > > > Either you need to replace vhost or you need a tun-like device > > > interface. > > > > > > Replacing vhost would mean that your switch implements virtio-net, > > > shares guest RAM with the guest, and shares the ioeventfd and irqfd > > > which are used to signal with the guest. > > > > > > This would be a great solution from my perspective. This is the design > that > > I am now struggling to find a good implementation strategy for. > > The switch needs 3 resources for direct virtio-net communication with > the guest: > > 1. Shared memory access to guest physical memory for guest physical to > host userspace address translation. vhost and data plane > automatically guest access to guest memory and they learn about > memory layout using the MemoryListener interface in QEMU (see > hw/virtio/vhost.c:vhost_region_add() and friends). > > 2. Virtqueue kick notifier (ioeventfd) so the switch knows when the > guest signals the host. See virtio_queue_get_host_notifier(vq). > > 3. Guest interrupt notifier (irqfd) so the switch can signal the guest. > See virtio_queue_get_guest_notifier(vq). > > I don't have a detailed suggestion for how to interface the switch and > QEMU processes. It may be necessary to communicate back and forth (to > handle the virtio device lifecycle) so a UNIX domain socket would be > appropriate for passing file descriptors. Here is a rough idea: > > $ switch --listen-path=/var/run/switch.sock > $ qemu --device virtio-net-pci,switch=/var/run/switch.sock > > On QEMU startup: > > (switch socket) add_port --id="qemu-$PID" --session-persistence > > (Here --session-persistence means that the port will be automatically > destroyed if the switch socket session is terminated because the UNIX > domain socket is closed by QEMU.) > > On virtio device status transition to DRIVER_OK: > > (switch socket) configure_port --id="qemu-$PID" > --mem=/tmp/shm/qemu-$PID > --ioeventfd=2 > --irqfd=3 > > On virtio device status transition from DRIVER_OK: > > (switch socket) deconfigure_port --id="qemu-$PID" > > I skipped a bunch of things: > > 1. virtio-net has several virtqueues so you need multiple ioeventfds. > > 2. QEMU needs to communicate memory mapping information, this gets > especially interesting with memory hotplug. Memory is more > complicated than a single shmem blob. > > 3. Multiple NICs per guest should be supported. > > Stefan > [-- Attachment #2: Type: text/html, Size: 4463 bytes --] ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2013-10-21 10:29 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-05-26 9:32 [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O Luke Gorrie 2013-05-27 9:34 ` Stefan Hajnoczi 2013-05-27 15:18 ` Michael S. Tsirkin 2013-05-27 15:43 ` Paolo Bonzini 2013-05-27 16:18 ` Anthony Liguori 2013-05-27 16:18 ` Paolo Bonzini 2013-05-27 17:01 ` Anthony Liguori 2013-05-27 17:13 ` Michael S. Tsirkin 2013-05-27 18:31 ` Anthony Liguori 2013-05-28 10:39 ` Luke Gorrie 2013-05-28 10:10 ` Luke Gorrie 2013-05-28 10:35 ` Stefan Hajnoczi 2013-05-28 11:36 ` Julian Stecklina 2013-05-28 11:53 ` Michael S. Tsirkin 2013-05-28 12:09 ` Julian Stecklina 2013-05-28 13:56 ` Michael S. Tsirkin 2013-05-28 15:35 ` Julian Stecklina 2013-05-28 15:44 ` Michael S. Tsirkin 2013-05-28 12:48 ` [Qemu-devel] [snabb-devel:276] " Luke Gorrie 2013-05-28 13:12 ` Julian Stecklina 2013-05-28 13:42 ` [Qemu-devel] [snabb-devel:280] " Luke Gorrie 2013-05-28 14:42 ` [Qemu-devel] [snabb-devel:276] " Luke Gorrie 2013-05-28 15:33 ` Julian Stecklina 2013-05-28 17:00 ` [Qemu-devel] " Anthony Liguori 2013-05-28 17:17 ` Michael S. Tsirkin 2013-05-28 18:55 ` Anthony Liguori 2013-05-29 10:31 ` Stefano Stabellini 2013-05-29 12:25 ` Michael S. Tsirkin 2013-05-29 13:04 ` Stefano Stabellini 2013-06-04 12:19 ` [Qemu-devel] [snabb-devel:300] " Luke Gorrie 2013-06-04 12:49 ` Julian Stecklina 2013-06-04 20:09 ` [Qemu-devel] [snabb-devel:326] " Luke Gorrie 2013-06-04 12:56 ` [Qemu-devel] [snabb-devel:300] " Michael S. Tsirkin 2013-06-05 6:09 ` [Qemu-devel] [snabb-devel:327] " Luke Gorrie 2013-05-29 7:49 ` [Qemu-devel] " Stefan Hajnoczi 2013-05-29 9:08 ` Michael S. Tsirkin 2013-05-29 14:21 ` Stefan Hajnoczi 2013-05-29 14:48 ` Michael S. Tsirkin 2013-05-29 16:02 ` Julian Stecklina 2013-05-30 2:35 ` ronnie sahlberg 2013-05-30 6:46 ` Stefan Hajnoczi 2013-05-30 6:55 ` Michael S. Tsirkin 2013-05-30 7:11 ` [Qemu-devel] [snabb-devel:308] " Luke Gorrie 2013-05-30 8:08 ` [Qemu-devel] " Julian Stecklina 2013-05-29 12:32 ` Julian Stecklina 2013-05-29 14:31 ` Stefan Hajnoczi 2013-05-29 15:59 ` Julian Stecklina 2013-05-28 11:58 ` Stefan Hajnoczi 2013-10-21 10:29 ` Luke Gorrie
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.