From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D412EC43215 for ; Wed, 27 Nov 2019 03:04:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 97D9C2075C for ; Wed, 27 Nov 2019 03:04:24 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="dMR50nad" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726525AbfK0DEY (ORCPT ); Tue, 26 Nov 2019 22:04:24 -0500 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:29986 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726539AbfK0DEY (ORCPT ); Tue, 26 Nov 2019 22:04:24 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1574823862; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iE3rAsz56hH79+UxCuX+XCtVRFH817L5o8cU7FK2t70=; b=dMR50nadAaSYMIzV+/UPa7qCpipQxHaA0bockVgkDVQAVPQCONilmljvJRECY8fzETpHs/ 1bRrvzFITXUh3dG+S3mO30peaz32rZBgWSDxxMfB3BftO1ZLnNlZSXBlgSu6v/OjzdWvKH iMRrWBi9jhqhCqFOicLWM6HBy/pcnwo= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-287-7NVwILcvNXWPX7t1rWGN0A-1; Tue, 26 Nov 2019 22:04:18 -0500 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A50BF80183C; Wed, 27 Nov 2019 03:04:16 +0000 (UTC) Received: from [10.72.12.237] (ovpn-12-237.pek2.redhat.com [10.72.12.237]) by smtp.corp.redhat.com (Postfix) with ESMTP id F239C1001DC2; Wed, 27 Nov 2019 03:04:09 +0000 (UTC) Subject: Re: [PATCH] pci: endpoint: functions: Add a virtnet EP function To: Alan Mikhak , Kishon Vijay Abraham I Cc: mst@redhat.com, lorenzo.pieralisi@arm.com, Bjorn Helgaas , linux-pci@vger.kernel.org, "haotian.wang@duke.edu" , Jon Mason , KVM list References: <7067e657-5c8e-b724-fa6a-086fece6e6c3@redhat.com> <20190904215801.2971-1-haotian.wang@sifive.com> <59982499-0fc1-2e39-9ff9-993fb4dd7dcc@redhat.com> <2cf00ec4-1ed6-f66e-6897-006d1a5b6390@ti.com> <9f8e596f-b601-7f97-a98a-111763f966d1@ti.com> From: Jason Wang Message-ID: <230895ea-e292-d23d-ae4c-cf97677bbbfc@redhat.com> Date: Wed, 27 Nov 2019 11:04:09 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-MC-Unique: 7NVwILcvNXWPX7t1rWGN0A-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org On 2019/11/27 =E4=B8=8A=E5=8D=886:01, Alan Mikhak wrote: > On Tue, Nov 26, 2019 at 1:55 PM Alan Mikhak wrot= e: >> On Tue, Nov 26, 2019 at 4:36 AM Kishon Vijay Abraham I w= rote: >>> Hi Jason, >>> >>> On 26/11/19 3:28 PM, Jason Wang wrote: >>>> On 2019/11/25 =E4=B8=8B=E5=8D=888:49, Kishon Vijay Abraham I wrote: >>>>> +Alan, Jon >>>>> >>>>> Hi Jason, Haotian, Alan, >>>>> >>>>> On 05/09/19 8:26 AM, Jason Wang wrote: >>>>>> On 2019/9/5 =E4=B8=8A=E5=8D=885:58, Haotian Wang wrote: >>>>>>> Hi Jason, >>>>>>> >>>>>>> I have an additional comment regarding using vring. >>>>>>> >>>>>>> On Tue, Sep 3, 2019 at 6:42 AM Jason Wang wro= te: >>>>>>>> Kind of, in order to address the above limitation, you probably wa= nt to >>>>>>>> implement a vringh based netdevice and driver. It will work like, >>>>>>>> instead of trying to represent a virtio-net device to endpoint, >>>>>>>> represent a new type of network device, it uses two vringh ring in= stead >>>>>>>> virtio ring. The vringh ring is usually used to implement the >>>>>>>> counterpart of virtio driver. The advantages are obvious: >>>>>>>> >>>>>>>> - no need to deal with two sets of features, config space etc. >>>>>>>> - network specific, from the point of endpoint linux, it's not a v= irtio >>>>>>>> device, no need to care about transport stuffs or embedding intern= al >>>>>>>> virtio-net specific data structures >>>>>>>> - reuse the exist codes (vringh) to avoid duplicated bugs, impleme= nting >>>>>>>> a virtqueue is kind of challenge >>>>>>> With vringh.c, there is no easy way to interface with virtio_net.c. >>>>>>> >>>>>>> vringh.c is linked with vhost/net.c nicely >>>>>> Let me clarify, vhost_net doesn't use vringh at all (though there's = a >>>>>> plan to switch to use vringh). >>>>>> >>>>>> >>>>>>> but again it's not easy to >>>>>>> interface vhost/net.c with the network stack of endpoint kernel. Th= e >>>>>>> vhost drivers are not designed with the purpose of creating another >>>>>>> suite of virtual devices in the host kernel in the first place. If = I try >>>>>>> to manually write code for this interfacing, it seems that I will d= o >>>>>>> duplicate work that virtio_net.c does. >>>>>> Let me explain: >>>>>> >>>>>> - I'm not suggesting to use vhost_net since it can only deal with >>>>>> userspace virtio rings. >>>>>> - I suggest to introduce netdev that has vringh vring assoticated. >>>>>> Vringh was designed to deal with virtio ring located at different ty= pes >>>>>> of memory. It supports userspace vring and kernel vring currently, b= ut >>>>>> it should not be too hard to add support for e.g endpoint device tha= t >>>>>> requires DMA or whatever other method to access the vring. So it was= by >>>>>> design to talk directly with e.g kernel virtio device. >>>>>> - In your case, you can read vring address from virtio config space >>>>>> through endpoint framework and then create vringh. It's as simple as= : >>>>>> creating a netdev, read vring address, and initialize vringh. Then y= ou >>>>>> can use vringh helper to get iov and build skb etc (similar to caif_= virtio). >>>>> From the discussions above and from looking at Jason's mdev patches= [1], I've >>>>> created the block diagram below. >>>>> >>>>> While this patch (from Haotian) deals with RC<->EP connection, I'd al= so like >>>>> this to be extended for NTB (using multiple EP instances. RC<->EP<->E= P<->RC) >>>>> [2][3]. >>>>> >>>>> +-----------------------------------+ +----------------------------= ---------+ >>>>> | | | = | >>>>> | +------------+ +--------------+ | | +------------+ +----------= ----+ | >>>>> | | vringh_net | | vringh_rpmsg | | | | virtio_net | | virtio_rp= msg | | >>>>> | +------------+ +--------------+ | | +------------+ +----------= ----+ | >>>>> | | | = | >>>>> | +---------------+ | | +---------------+ = | >>>>> | | vringh_mdev | | | | virtio_mdev | = | >>>>> | +---------------+ | | +---------------+ = | >>>>> | | | = | >>>>> | +------------+ +------------+ | | +-------------------+ +----= --------+| >>>>> | | vringh_epf | | vringh_ntb | | | | virtio_pci_common | | vir= tio_ntb || >>>>> | +------------+ +------------+ | | +-------------------+ +----= --------+| >>>>> | (PCI EP Device) (NTB Secondary | | (PCI RC) (NTB = Primary | >>>>> | Device) | | De= vice) | >>>>> | | | = | >>>>> | | | = | >>>>> | (A) | | (B) = | >>>>> +-----------------------------------+ +----------------------------= ---------+ >>>>> >>>>> GUEST SIDE (B): >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>> In the virtualization terminology, the side labeled (B) will be the g= uest side. >>>>> Here it will be the place where PCIe host (RC) side SW will execute (= Ignore NTB >>>>> for this discussion since PCIe host side SW will execute on both ends= of the >>>>> link in the case of NTB. However I've included in the block diagram s= ince the >>>>> design we adopt should be able to be extended for NTB as well). >>>>> >>>>> Most of the pieces in (B) already exists. >>>>> 1) virtio_net and virtio_rpmsg: No modifications needed and can be us= ed as it >>>>> is. >>>>> 2) virtio_mdev: Jason has sent this [1]. This could be used as it is = for EP >>>>> usecases as well. Jason has created mvnet based on virtio_mdev, = but for EP >>>>> usecases virtio_pci_common and virtio_ntb should use it. >>>> >>>> Can we implement NTB as a transport for virtio, then there's no need f= or >>>> virtio_mdev? >>> Yes, we could have NTB specific virtio_config_ops. Where exactly should >>> virtio_mdev be used? It would be used when you want to consider a userspace driver which is=20 not support by current virito_config_ops. >>>> >>>>> 3) virtio_pci_common: This should be used when a PCIe EPF is connecte= d. This >>>>> should be modified to create virtio_mdev instead of directly cre= ating virtio >>>>> device. >>>>> 4) virtio_ntb: This is used for NTB where one end of the link should = use >>>>> virtio_ntb. This should create virtio_mdev. >>>>> >>>>> With this virtio_mdev can abstract virtio_pci_common and virtio_ntb a= nd ideally >>>>> any virtio drivers can be used for EP or NTB (In the block diagram ab= ove >>>>> virtio_net and virtio_rpmsg can be used). >>>>> >>>>> HOST SIDE (A): >>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>> In the virtualization terminology, the side labeled (A) will be the h= ost side. >>>>> Here it will be the place where PCIe device (Endpoint) side SW will e= xecute. >>>>> >>>>> Bits and pieces of (A) should exist but there should be considerable = work in >>>>> this. >>>>> 1) vringh_net: There should be vringh drivers corresponding to >>>>> the virtio drivers on the guest side (B). vringh_net should regi= ster with >>>>> the net core. The vringh_net device should be created by vringh_= mdev. This >>>>> should be new development. >>>>> 2) vringh_rpmsg: vringh_rpmsg should register with the rpmsg core. Th= e >>>>> vringh_rpmsg device should be created by vringh_mdev. >>>>> 3) vringh_mdev: This layer should define ops specific to vringh (e.g >>>>> get_desc_addr() should give vring descriptor address and will de= pend on >>>>> either EP device or NTB device). I haven't looked further on wha= t other ops >>>>> will be needed. IMO this layer should also decide whether _kern(= ) or _user() >>>>> vringh helpers should be invoked. >>>> >>>> Right, but probably not necessary called "mdev", it could just some ab= straction >>>> as a set of callbacks. >>> Yeah, we could have something like vringh_config_ops. Once we start to >>> implement, this might get more clear. Yes, it is. >>>> >>>>> 4) vringh_epf: This will be used for PCIe endpoint. This will impleme= nt ops to >>>>> get the vring descriptor address. >>>>> 5) vringh_ntb: Similar to vringh_epf but will interface with NTB devi= ce instead >>>>> of EPF device. >>>>> >>>>> Jason, >>>>> >>>>> Can you give your comments on the above design? Do you see any flaws/= issues >>>>> with the above approach? >>>> >>>> Looks good overall, see questions above. >>> Thanks for your comments Jason. >>> >>> Haotian, Alan, Me or whoever gets to implement this first, should try t= o follow >>> the above discussed approach. >> Kishon, >> >> Thank you, and Jason Wang, for comments and suggestions re: NTB. >> >> My preference is to see Haotian continue his work on this >> patch, if and when possible. As for expanding the scope to >> support NTB, I personally find it very interesting. I will >> keep an eye open for a suitable hardware platform in house >> before figuring out if and when it would be possible to do such >> work. From your slides, you may get there first since you >> seem to have a suitable hardware platform already. > - haotian.wang@sifive.com > > other: haotian.wang@duke.edu > >> Regards, >> Alan >> >>> Thanks >>> Kishon >>> >>>> Thanks >>>> >>>> >>>>> Thanks >>>>> Kishon >>>>> >>>>> [1] -> https://lkml.org/lkml/2019/11/18/261 >>>>> [2] -> https://lkml.org/lkml/2019/9/26/291 >>>>> [3] -> >>>>> https://www.linuxplumbersconf.org/event/4/contributions/395/attachmen= ts/284/481/Implementing_NTB_Controller_Using_PCIe_Endpoint_-_final.pdf >>>>> >>>>>>> There will be two more main disadvantages probably. >>>>>>> >>>>>>> Firstly, there will be two layers of overheads. vhost/net.c uses >>>>>>> vringh.c to channel data buffers into some struct sockets. This is = the >>>>>>> first layer of overhead. That the virtual network device will have = to >>>>>>> use these sockets somehow adds another layer of overhead. >>>>>> As I said, it doesn't work like vhost and no socket is needed at all= . >>>>>> >>>>>> >>>>>>> Secondly, probing, intialization and de-initialization of the virtu= al >>>>>>> network_device are already non-trivial. I'll likely copy this part >>>>>>> almost verbatim from virtio_net.c in the end. So in the end, there = will >>>>>>> be more duplicate code. >>>>>> It will be a new type of network device instead of virtio, you don't >>>>>> need to care any virtio stuffs but vringh in your codes. So it looks= to >>>>>> me it would be much simpler and compact. >>>>>> >>>>>> But I'm not saying your method is no way to go, but you should deal = with >>>>>> lots of other issues like I've replied in the previous mail. What yo= u >>>>>> want to achieve is >>>>>> >>>>>> 1) Host (virtio-pci) <-> virtio ring <-> virtual eth device <-> virt= io >>>>>> ring <-> Endpoint (virtio with customized config_ops). >>>>>> >>>>>> But I suggest is >>>>>> >>>>>> 2) Host (virtio-pci) <-> virtio ring <-> virtual eth device <-> vrin= gh >>>>>> vring (virtio ring in the Host) <-> network device >>>>>> >>>>>> The differences is. >>>>>> - Complexity: In your proposal, there will be two virtio devices and= 4 >>>>>> virtqueues. It means you need to prepare two sets of features, confi= g >>>>>> ops etc. And dealing with inconsistent feature will be a pain. It ma= y >>>>>> work for simple case like a virtio-net device with only _F_MAC, but = it >>>>>> would be hard to be expanded. If we decide to go for vringh, there w= ill >>>>>> be a single virtio device and 2 virtqueues. In the endpoint part, it >>>>>> will be 2 vringh vring (which is actually point the same virtqueue f= rom >>>>>> Host side) and a normal network device. There's no need for dealing = with >>>>>> inconsistency, since vringh basically sever as a a device >>>>>> implementation, the feature negotiation is just between device (netw= ork >>>>>> device with vringh) and driver (virtito-pci) from the view of Linux >>>>>> running on the PCI Host. >>>>>> - Maintainability: A third path for dealing virtio ring. We've alrea= dy >>>>>> had vhost and vringh, a third path will add a lot of overhead when >>>>>> trying to maintaining them. My proposal will try to reuse vringh, >>>>>> there's no need a new path. >>>>>> - Layer violation: We want to hide the transport details from the de= vice >>>>>> and make virito-net device can be used without modification. But you= r >>>>>> codes try to poke information like virtnet_info. My proposal is to j= ust >>>>>> have a new networking device that won't need to care virtio at all. = It's >>>>>> not that hard as you imagine to have a new type of netdev, I suggest= to >>>>>> take a look at how caif_virtio is done, it would be helpful. >>>>>> >>>>>> If you still decide to go with two two virtio device model, you need >>>>>> probably: >>>>>> - Proving two sets of config and features, and deal with inconsisten= cy >>>>>> - Try to reuse the vringh codes >>>>>> - Do not refer internal structures from virtio-net.c >>>>>> >>>>>> But I recommend to take a step of trying vringh method which should = be >>>>>> much simpler. >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>>> Thank you for your patience! >>>>>>> >>>>>>> Best, >>>>>>> Haotian