From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:42755) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Uh0zF-0005oD-RE for qemu-devel@nongnu.org; Mon, 27 May 2013 13:13:41 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Uh0z7-00069k-Pp for qemu-devel@nongnu.org; Mon, 27 May 2013 13:13:33 -0400 Received: from mx1.redhat.com ([209.132.183.28]:12843) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Uh0z7-00069a-GZ for qemu-devel@nongnu.org; Mon, 27 May 2013 13:13:25 -0400 Date: Mon, 27 May 2013 20:13:40 +0300 From: "Michael S. Tsirkin" Message-ID: <20130527171339.GB18800@redhat.com> References: <20130527093409.GH21969@stefanha-thinkpad.redhat.com> <51A37F06.2080300@redhat.com> <874ndoflc2.fsf@codemonkey.ws> <51A38770.4040106@redhat.com> <87wqqk8ii4.fsf@codemonkey.ws> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87wqqk8ii4.fsf@codemonkey.ws> Subject: Re: [Qemu-devel] snabbswitch integration with QEMU for userspace ethernet I/O List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Luke Gorrie , Paolo Bonzini , "snabb-devel@googlegroups.com" , qemu-devel@nongnu.org, Stefan Hajnoczi On Mon, May 27, 2013 at 12:01:07PM -0500, Anthony Liguori wrote: > Paolo Bonzini writes: > > > Il 27/05/2013 18:18, Anthony Liguori ha scritto: > >> Paolo Bonzini writes: > >> > >>> Il 27/05/2013 11:34, Stefan Hajnoczi ha scritto: > >>>> On Sun, May 26, 2013 at 11:32:49AM +0200, Luke Gorrie wrote: > >>>>> Stefan put us onto the highly promising track of vhost/virtio. We have > >>>>> implemented this between Snabb Switch and the Linux kernel, but not > >>>>> directly between Snabb Switch and QEMU guests. The "roadblock" we have hit > >>>>> is embarrasingly basic: QEMU is using user-to-kernel system calls to setup > >>>>> vhost (open /dev/net/tun and /dev/vhost-net, ioctl()s) and I haven't found > >>>>> a good way to map these towards Snabb Switch instead of the kernel. > >>>> > >>>> vhost_net is about connecting the a virtio-net speaking process to a > >>>> tun-like device. The problem you are trying to solve is connecting a > >>>> virtio-net speaking process to Snabb Switch. > >>>> > >>>> Either you need to replace vhost or you need a tun-like device > >>>> interface. > >>>> > >>>> How does your switch talk to hardware? > >>> > >>> And also, is your switch monolithic or does it consist of different > >>> processes? > >>> > >>> If you already have processes talking to each other, the first thing > >>> that came to my mind was a new network backend, similar to net/vde.c but > >>> more featureful (so that you support the virtio headers for offloading, > >>> for example). Then you would use "-netdev snabb,id=net0 -device > >>> e1000,netdev=net0". > >> > >> It would be very interesting to combine this with vmsplice/splice. > > > > Was zero-copy vmsplice/splice actually ever implemented? I thought it > > was reverted. > > Not sure what context you're talking about re: zero copy... a pipe can > store references to pages instead of having a buffer that stores data. > That certainly is there today--otherwise the interface is pointless. > > When splicing from pipe to pipe, you can move those references without > copying the data. > > When vmsplicing from a userspace region to a pipe, the kernel just > stores references to the pages. vmsplicing from a pipe to userspace > OTOH will copy the data. This is fixable at least when dealing with > GIFT'd pages. For guest-to-guest traffic, you wouldn't be gifting the > pages I don't think. > > For implementing guest-to-guest traffic, the source QEMU can vmsplice > the packet to a pipe that is shared with the vswitch. The vswitch can > tee(3) the first N bytes to a second pipe such that it can read the > info needed for routing decisions. > > Once the decision is made, if it's a local guest, it can splice() the > packet to the appropriate destination QEMU process or another vswitch > daemon (no data copy here). > > Finally, the destination QEMU process can vmsplice() from the pipe which > will copy the data (this is the only copy). AFAIK splice is mostly useless for networking as there's no way to get notified when packet has been sent. > If vswitch needs to route externally, then it would need to splice() to > a macvtap. > > macvtap should be able to send the packet without copying the data. Not > sure that this last work will work as expected but if it doesn't, that's > a bug that can/should be fixed. > > The kernel cannot do better than the above modulo any overhead from > userspace context switching[*]. Also modulo scheduler latency - kernel processes packets in interrupt context. There's a reason e.g. OVS runs data-path in kernel. > Guest-to-guest requires a copy. > Normally macvtap is undesirable because it's tightly connected to a > network adapter but that is a desirable trait in this case. > > N.B., I'm not advocating making all switching decisions in > userspace. Just pointing out how it can be done efficiently. > > [*] in theory the kernel could do zero copy receive but i'm not sure > it's feasible in practice. > > Regards, > > Anthony Liguori > > > > > Paolo > > > >>> It would be slower than vhost-net, for example no zero-copy > >>> transmission. > >> > >> With splice, I think you could at least get single copy guest-to-guest > >> networking which is about as good as can be done. > >> > >> Regards, > >> > >> Anthony Liguori > >> > >>>> 3. Use the kernel as a middle-man. Create a double-ended "veth" > >>>> interface and have Snabb Switch and QEMU each open a PF_PACKET > >>>> socket and accelerate it with VHOST_NET. > >>> > >>> As Michael, mentioned, this could be macvtap on the interface that you > >>> have already created in the switch and passed to vhost-net. Then you do > >>> not have to do anything in QEMU. > >>> > >>> Paolo > >>> > >>>> If you are using the Linux network stack then it might be better to > >>>> integrate with vhost maybe as a tun-like device driver. > >>>> > >>>> Stefan > >>>> > >>>>