From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_PASS,URIBL_BLOCKED,URIBL_SBL,URIBL_SBL_A autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB419C10F11 for ; Wed, 10 Apr 2019 10:02:44 +0000 (UTC) Received: from dpdk.org (dpdk.org [92.243.14.124]) by mail.kernel.org (Postfix) with ESMTP id 378002133D for ; Wed, 10 Apr 2019 10:02:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="TQUdNPpZ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 378002133D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=dev-bounces@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id C06EB5F28; Wed, 10 Apr 2019 12:02:42 +0200 (CEST) Received: from mail-io1-f65.google.com (mail-io1-f65.google.com [209.85.166.65]) by dpdk.org (Postfix) with ESMTP id 614335F1C for ; Wed, 10 Apr 2019 12:02:41 +0200 (CEST) Received: by mail-io1-f65.google.com with SMTP id x3so1548698iol.10 for ; Wed, 10 Apr 2019 03:02:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=UF9hN0ZejhyUxaC3M21b5UQ28tcYgLmITA932pIN0Oc=; b=TQUdNPpZo/wb2y+5k3auWL6onj4xEt0/SvIQfOJ1F7Kv8LGRBAumpPSJesF1lyJ6pp cqgdk6+gSb/a6YrvP9KxN1aHKHgw5Xz6ZGLlKb6kK1dBRfmeektgpG01aqJ9wqh4RB2o WYQOi5YXYyUvHl+VTqABdlDy6ovGEXdAa3vkfyUyNVN7IwGdYV9vL0PhmQ8Gohbz5nUl TlZ0odV9jsIGqfpuANbse5d99hdGN6/YbuUbJrZVUoqFsKcou1LMDNJri6JRTKm+mZVP rHo3hbJYqCRqh8flRciY5U/561tOVgY+/cRIvJjBS5Nwh03DqfLgAINeyn9sdBDX0beP 5gFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=UF9hN0ZejhyUxaC3M21b5UQ28tcYgLmITA932pIN0Oc=; b=raIWT3SVh4CSALmzWVCWp8VNyIHD5wCnxwRfHWfkwvsuAREcr0Wwf8gNMMcIuxQOX1 EUcePeoTr3zS29CxV924xq3hKw/h3Dl6j1dzZxEdi7kz3EamA8AJcF2GEoOpkOl28aXN 5lewnesFccIJcJf/ScLr31/bXtw8ngvRkvGNCUKAfLZ7AtaKjjWGKiLbVTQ2H00jssdZ NqmQpZb4CFSpbQZlQ9xZBO84RfeUMRTF2H+NDm5kWj63EZ31VGuu9cTv/rMgRb454MQx nxscngtYJq/DU86JfBfEPAuAmRdxAnGCc0hlBnp4lEXIiSK73B6aD9+nZvBgt/RdTEUI plSQ== X-Gm-Message-State: APjAAAVz39DT1itLUX4BFUo8uMADraD3fKTunGEKuCOXhY2lUstXHKbW V//8GOm/Xq/qIVtW4jD7nIieSZykiVdws+TlIqeL6Q== X-Google-Smtp-Source: APXvYqzhOP77TIx0q9CXi3gqoLWasEqN9puXwWmCC3Tpg9Imf4+ThMUZDEPBS+etrGdh/pp2PGAJWpmjtKsmVWTBD8c= X-Received: by 2002:a6b:e206:: with SMTP id z6mr8295841ioc.237.1554890560567; Wed, 10 Apr 2019 03:02:40 -0700 (PDT) MIME-Version: 1.0 References: <20190403071844.21126-1-tiwei.bie@intel.com> <20190408093601.GA12313@dpdk-tbie.sh.intel.com> In-Reply-To: <20190408093601.GA12313@dpdk-tbie.sh.intel.com> From: Francois Ozog Date: Wed, 10 Apr 2019 12:02:28 +0200 Message-ID: To: Tiwei Bie , dev Cc: Alejandro Lucero , "Liang, Cunming" , Bruce Richardson , Ilias Apalodimas , brouer@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi all, I presented an approach in Fosdem (https://archive.fosdem.org/2018/schedule/event/netmdev/) and feel happy someone is picking up. If we step back a little, the mdev concept is to allow userland to be given a direct control over the hardware data path on a device still controlled by the kernel. >From a code base perspective, this can shrink down PMD code size b y a significant size: only 10% of the PMD code is actual data path, the rest being device control! The concept is perfect for DPDK, SPDK and many other scenarios (AI accelerators). Should the work be triggered by DPDK community, it should be applicable to a broader set of communities: SPDK, VPP, ODP, AF_XDP.... We bumped into many sharing (between kernel and userland) complexities particularly when a single PCI device controls two ports. So let's assume we try to solve a subset of the cases: coherent IO memory and a dedicated PCI space (by whatever mechanism) per port. What are the "things to solve"? 1) enumeration: enumerating and "capturing" an mdev device (the patch I ass= ume) 2) bifurcation: designating the queues to capture in userland (may be all) with a hardware driven rule (flow director or more generic) 3) memory management: dealing with rings and buffer management on rx and tx paths The bifurcation can be as simple as : all queues in userland, or quite rich: TCP port 80 goes to userland while the rest (ICMP...) go to kernel. If the kernel gets some of the traffic there will be a routing information sharing problem to solve. We had a few experiments here. Conclusion is its doable but many corner cases make it a big work. And it would be nice if the queue selection can be made very generic (and not tied to flow director). Let's state this is for further study for now. Lets focus on memory management of VFIO exposed devices. I haven't refreshed my knowledge of the VFIO framework so you may want to correct a few points... First of all, DPDK is made to switch packets and particularly between ports= . With VFIO, this means all devices are in the same virtual IOVA which is tricky to implement in the kernel. There are a few strategies to do that all requiring significant mdev extensions and more probably a kernel infrastructure change. The good news is it can be made in such a way that selected drivers implement the change, not requiring all the drivers to be touched. Another big question is: is the kernel allocating the memory then the userland gets a map to it, or does the userland allocates the memory and the kernel just maintains the IOVA mapping. I would favor kernel allocation and userland gets a map to it (in the unified IOVA). One reason being that memory allocation strategy can be very different from hardware to hardware: - driver allocates packet buffers and populate a single ring of packet per = queue - driver allocates packet buffers of different sizes and populate multiple rings per queue (for instance rings of 128, 256, 1024, 2048 byte arrays per queue) - driver allocates an unstructured memory area (say 32MB) and give it to hardware (no prepopulation of rings). So the userland framework (DPDK, SPDK, ODP, VPP, AF_XDP, proprietary...) can just query for queues and rings to the kernel driver that knows what has to be done for the driver. The userland framework just has to create the relevant objects (queues, rings, packet buffers) to the provided kernel information. Exposing VFIO devices to DPDK and other frameworks is a major topic, and I suggest that at the same time enumeration is done, a broader discussion on the data path itself happens. Data path discussion is about memory management (above) and packet descriptors. Exposing hardware dependent structures in the userland is not the most widely accepted wisdom. So I would rather assume hardware natively produce hardware, vendor, OS independent descriptors. Candidates can be: DPDK mbuf, VPP vlib_buf or virtio 1.1. I would favor a packet descriptor that supports a combination of inline offloads (VxLAN + IPSec + TSO...) : if virtio 1.1 could be extended with some DPDK mbuf fields that would be perfect ;-) That looks science fiction but I know that some smartNICs and other hardware, the hardware produced packet descriptor format can be flexible.... Cheers FF On Mon, 8 Apr 2019 at 11:36, Tiwei Bie wrote: > > On Mon, Apr 08, 2019 at 09:44:07AM +0100, Alejandro Lucero wrote: > > On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie wrote: > > > Hi everyone, > > > > > > This is a draft implementation of the mdev (Mediated device [1]) > > > bus support in DPDK. Mdev is a way to virtualize devices in Linux > > > kernel. Based on the device-api (mdev_type/device_api), there could > > > be different types of mdev devices (e.g. vfio-pci). In this RFC, > > > one mdev bus is introduced to scan the mdev devices in the system > > > and do the probe based on the device-api. > > > > > > Take the mdev devices whose device-api is "vfio-pci" as an example, > > > in this RFC, these devices will be probed by a mdev driver provided > > > by PCI bus, which will plug them to the PCI bus. And they will be > > > probed with the drivers registered on the PCI bus based on VendorID/ > > > DeviceID/... then. > > > > > > +----------+ > > > | mdev bus | > > > +----+-----+ > > > | > > > +----------------+----+------+------+ > > > | | | | > > > mdev_vfio_pci ...... > > > (device-api: vfio-pci) > > > > > > There are also other ways to add mdev device support in DPDK (e.g. > > > let PCI bus scan /sys/bus/mdev/devices directly). Comments would be > > > appreciated! > > > > Hi Tiwei, > > > > Thanks for the patchset. I was close to send a patchset with the same m= dev > > support, but I'm glad to see your patchset first because I think it is > > interesting to see another view of how to implemented this. > > > > After going through your patch I was a bit confused about how the mdev = device > > to mdev driver match was done. But then I realized the approach you are > > following is different to my implementation, likely due to having diffe= rent > > purposes. If I understand the idea behind, you want to have same PCI PM= D > > drivers working with devices, PCI devices, created from mediated device= s. > > Exactly! > > > That > > is the reason there is just one mdev driver, the one for vfio-pci media= ted > > devices type. > > > > My approach was different and I though having specific PMD mdev support= was > > necessary, with the PMD requiring to register a mdev driver. I can see,= after > > reading your patch, it can be perfectly possible to have the same PMDs = for > > "pure" PCI devices and PCI devices made from mediated devices, and if t= he PMD > > requires to do something different due to the mediated devices intrinsi= cs, then > > explicitly supporting that per PMD. I got specific ioctl calls between = the PMD > > and the mediating driver but this can also be done with your approach. > > > > I'm working on having a mediated PF, what is a different purpose than t= he Intel > > scalable I/O idea, so I will merge this patchset with my code and see i= f it > > works. > > Cool! Thanks! > > > > > Thanks! > > > > > > > [1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-= mediated-device.txt > > > > > > Thanks, > > > Tiwei > > > > > > Tiwei Bie (3): > > > eal: add a helper for reading string from sysfs > > > bus/mdev: add mdev bus support > > > bus/pci: add mdev support > > > > > > config/common_base | 5 + > > > config/common_linux | 1 + > > > drivers/bus/Makefile | 1 + > > > drivers/bus/mdev/Makefile | 41 +++ > > > drivers/bus/mdev/linux/Makefile | 6 + > > > drivers/bus/mdev/linux/mdev.c | 117 ++++++++ > > > drivers/bus/mdev/mdev.c | 310 ++++++++++++++++++++= ++ > > > drivers/bus/mdev/meson.build | 15 ++ > > > drivers/bus/mdev/private.h | 90 +++++++ > > > drivers/bus/mdev/rte_bus_mdev.h | 141 ++++++++++ > > > drivers/bus/mdev/rte_bus_mdev_version.map | 12 + > > > drivers/bus/meson.build | 2 +- > > > drivers/bus/pci/Makefile | 3 + > > > drivers/bus/pci/linux/Makefile | 4 + > > > drivers/bus/pci/linux/pci_vfio.c | 35 ++- > > > drivers/bus/pci/linux/pci_vfio_mdev.c | 305 ++++++++++++++++++++= + > > > drivers/bus/pci/meson.build | 4 +- > > > drivers/bus/pci/pci_common.c | 17 +- > > > drivers/bus/pci/private.h | 9 + > > > drivers/bus/pci/rte_bus_pci.h | 11 +- > > > lib/librte_eal/common/eal_filesystem.h | 7 + > > > lib/librte_eal/freebsd/eal/eal.c | 22 ++ > > > lib/librte_eal/linux/eal/eal.c | 22 ++ > > > lib/librte_eal/rte_eal_version.map | 1 + > > > mk/rte.app.mk | 1 + > > > 25 files changed, 1163 insertions(+), 19 deletions(-) > > > create mode 100644 drivers/bus/mdev/Makefile > > > create mode 100644 drivers/bus/mdev/linux/Makefile > > > create mode 100644 drivers/bus/mdev/linux/mdev.c > > > create mode 100644 drivers/bus/mdev/mdev.c > > > create mode 100644 drivers/bus/mdev/meson.build > > > create mode 100644 drivers/bus/mdev/private.h > > > create mode 100644 drivers/bus/mdev/rte_bus_mdev.h > > > create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map > > > create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c > > > > > > -- > > > 2.17.1 > > > > -- Fran=C3=A7ois-Fr=C3=A9d=C3=A9ric Ozog | Director Linaro Edge & Fog Computin= g Group T: +33.67221.6485 francois.ozog@linaro.org | Skype: ffozog