linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
@ 2018-09-03  0:51 Kenneth Lee
  2018-09-03  0:51 ` [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework Kenneth Lee
                   ` (9 more replies)
  0 siblings, 10 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-03  0:51 UTC (permalink / raw)
  To: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

From: Kenneth Lee <liguozhu@hisilicon.com>

WarpDrive is an accelerator framework to expose the hardware capabilities
directly to the user space. It makes use of the exist vfio and vfio-mdev
facilities. So the user application can send request and DMA to the
hardware without interaction with the kernel. This removes the latency
of syscall.

WarpDrive is the name for the whole framework. The component in kernel
is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
library of WarpDrive can access it via VFIO interface.

The patchset contains document for the detail. Please refer to it for more
information.

This patchset is intended to be used with Jean Philippe Brucker's SVA
patch [1], which enables not only IO side page fault, but also PASID
support to IOMMU and VFIO.

With these features, WarpDrive can support non-pinned memory and
multi-process in the same accelerator device.  We tested it in our SoC
integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
tree can be found here: [2].

But it is not mandatory. This patchset is tested in the latest mainline
kernel without the SVA patches.  So it supports only one process for each
accelerator.

We have noticed the IOMMU aware mdev RFC announced recently [3].

The IOMMU aware mdev has similar idea but different intention comparing to
WarpDrive. It intends to dedicate part of the hardware resource to a VM.
And the design is supposed to be used with Scalable I/O Virtualization.
While sdmdev is intended to share the hardware resource with a big amount
of processes.  It just requires the hardware supporting address
translation per process (PCIE's PASID or ARM SMMU's substream ID).

But we don't see serious confliction on both design. We believe they can be
normalized as one.

The patch 1 is document of the framework. The patch 2 and 3 add sdmdev
support. The patch 4, 5 and 6 is drivers for Hislicon's ZIP Accelerator
which is registered to both crypto and warpdrive(sdmdev) and can be
used from kernel or user space at the same time. The patch 7 is a user
space sample demonstrating how WarpDrive works.


Change History:
V2 changed from V1:
	1. Change kernel framework name from SPIMDEV (Share Parent IOMMU
	   Mdev) to SDMDEV (Share Domain Mdev).
	2. Allocate Hardware Resource when a new mdev is created (While
	   it is allocated when the mdev is openned)
	3. Unmap pages from the shared domain when the sdmdev iommu group is
	   detached. (This procedure is necessary, but missed in V1)
	4. Update document accordingly.
	5. Rebase to the latest kernel (4.19.0-rc1)
	
	According the review comment on RFCv1, We did try to use dma-buf
	as back end of WarpDrive. It can work properly with the current
	solution [4], but it cannot make use of process's
	own memory address space directly. This is important to many
	acceleration scenario. So dma-buf will be taken as a backup
	alternative for noiommu scenario, it will be added in the future
	version. 


Refernces:
[1] https://www.spinics.net/lists/kernel/msg2651481.html
[2] https://github.com/Kenneth-Lee/linux-kernel-warpdrive/tree/warpdrive-sva-v0.5
[3] https://lkml.org/lkml/2018/7/22/34
[4] https://github.com/Kenneth-Lee/linux-kernel-warpdrive/tree/warpdrive-v0.7-dmabuf

Best Regards
Kenneth Lee

Kenneth Lee (7):
  vfio/sdmdev: Add documents for WarpDrive framework
  iommu: Add share domain interface in iommu for sdmdev
  vfio: add sdmdev support
  crypto: add hisilicon Queue Manager driver
  crypto: Add Hisilicon Zip driver
  crypto: add sdmdev support to Hisilicon QM
  vfio/sdmdev: add user sample

 Documentation/00-INDEX                    |   2 +
 Documentation/warpdrive/warpdrive.rst     | 100 +++
 Documentation/warpdrive/wd-arch.svg       | 728 ++++++++++++++++
 drivers/crypto/Makefile                   |   2 +-
 drivers/crypto/hisilicon/Kconfig          |  25 +
 drivers/crypto/hisilicon/Makefile         |   2 +
 drivers/crypto/hisilicon/qm.c             | 979 ++++++++++++++++++++++
 drivers/crypto/hisilicon/qm.h             | 122 +++
 drivers/crypto/hisilicon/zip/Makefile     |   2 +
 drivers/crypto/hisilicon/zip/zip.h        |  57 ++
 drivers/crypto/hisilicon/zip/zip_crypto.c | 353 ++++++++
 drivers/crypto/hisilicon/zip/zip_crypto.h |   8 +
 drivers/crypto/hisilicon/zip/zip_main.c   | 195 +++++
 drivers/iommu/iommu.c                     |  29 +-
 drivers/vfio/Kconfig                      |   1 +
 drivers/vfio/Makefile                     |   1 +
 drivers/vfio/sdmdev/Kconfig               |  10 +
 drivers/vfio/sdmdev/Makefile              |   3 +
 drivers/vfio/sdmdev/vfio_sdmdev.c         | 363 ++++++++
 drivers/vfio/vfio_iommu_type1.c           | 151 +++-
 include/linux/iommu.h                     |  15 +
 include/linux/vfio_sdmdev.h               |  96 +++
 include/uapi/linux/vfio_sdmdev.h          |  29 +
 samples/warpdrive/AUTHORS                 |   2 +
 samples/warpdrive/ChangeLog               |   1 +
 samples/warpdrive/Makefile.am             |   9 +
 samples/warpdrive/NEWS                    |   1 +
 samples/warpdrive/README                  |  32 +
 samples/warpdrive/autogen.sh              |   3 +
 samples/warpdrive/cleanup.sh              |  13 +
 samples/warpdrive/configure.ac            |  52 ++
 samples/warpdrive/drv/hisi_qm_udrv.c      | 223 +++++
 samples/warpdrive/drv/hisi_qm_udrv.h      |  53 ++
 samples/warpdrive/test/Makefile.am        |   7 +
 samples/warpdrive/test/comp_hw.h          |  23 +
 samples/warpdrive/test/test_hisi_zip.c    | 206 +++++
 samples/warpdrive/wd.c                    | 309 +++++++
 samples/warpdrive/wd.h                    | 154 ++++
 samples/warpdrive/wd_adapter.c            |  74 ++
 samples/warpdrive/wd_adapter.h            |  43 +
 40 files changed, 4470 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/warpdrive/warpdrive.rst
 create mode 100644 Documentation/warpdrive/wd-arch.svg
 create mode 100644 drivers/crypto/hisilicon/qm.c
 create mode 100644 drivers/crypto/hisilicon/qm.h
 create mode 100644 drivers/crypto/hisilicon/zip/Makefile
 create mode 100644 drivers/crypto/hisilicon/zip/zip.h
 create mode 100644 drivers/crypto/hisilicon/zip/zip_crypto.c
 create mode 100644 drivers/crypto/hisilicon/zip/zip_crypto.h
 create mode 100644 drivers/crypto/hisilicon/zip/zip_main.c
 create mode 100644 drivers/vfio/sdmdev/Kconfig
 create mode 100644 drivers/vfio/sdmdev/Makefile
 create mode 100644 drivers/vfio/sdmdev/vfio_sdmdev.c
 create mode 100644 include/linux/vfio_sdmdev.h
 create mode 100644 include/uapi/linux/vfio_sdmdev.h
 create mode 100644 samples/warpdrive/AUTHORS
 create mode 100644 samples/warpdrive/ChangeLog
 create mode 100644 samples/warpdrive/Makefile.am
 create mode 100644 samples/warpdrive/NEWS
 create mode 100644 samples/warpdrive/README
 create mode 100755 samples/warpdrive/autogen.sh
 create mode 100755 samples/warpdrive/cleanup.sh
 create mode 100644 samples/warpdrive/configure.ac
 create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.c
 create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.h
 create mode 100644 samples/warpdrive/test/Makefile.am
 create mode 100644 samples/warpdrive/test/comp_hw.h
 create mode 100644 samples/warpdrive/test/test_hisi_zip.c
 create mode 100644 samples/warpdrive/wd.c
 create mode 100644 samples/warpdrive/wd.h
 create mode 100644 samples/warpdrive/wd_adapter.c
 create mode 100644 samples/warpdrive/wd_adapter.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
@ 2018-09-03  0:51 ` Kenneth Lee
  2018-09-06 18:36   ` Randy Dunlap
  2018-09-03  0:51 ` [PATCH 2/7] iommu: Add share domain interface in iommu for sdmdev Kenneth Lee
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-03  0:51 UTC (permalink / raw)
  To: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

From: Kenneth Lee <liguozhu@hisilicon.com>

WarpDrive is a common user space accelerator framework.  Its main component
in Kernel is called sdmdev, Share Domain Mediated Device. It exposes
the hardware capabilities to the user space via vfio-mdev. So processes in
user land can obtain a "queue" by open the device and direct access the
hardware MMIO space or do DMA operation via VFIO interface.

WarpDrive is intended to be used with Jean Philippe Brucker's SVA
patchset to support multi-process. But This is not a must.  Without the
SVA patches, WarpDrive can still work for one process for every hardware
device.

This patch add detail documents for the framework.

Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
---
 Documentation/00-INDEX                |   2 +
 Documentation/warpdrive/warpdrive.rst | 100 ++++
 Documentation/warpdrive/wd-arch.svg   | 728 ++++++++++++++++++++++++++
 3 files changed, 830 insertions(+)
 create mode 100644 Documentation/warpdrive/warpdrive.rst
 create mode 100644 Documentation/warpdrive/wd-arch.svg

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index 2754fe83f0d4..9959affab599 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -410,6 +410,8 @@ vm/
 	- directory with info on the Linux vm code.
 w1/
 	- directory with documents regarding the 1-wire (w1) subsystem.
+warpdrive/
+	- directory with documents about WarpDrive accelerator framework.
 watchdog/
 	- how to auto-reboot Linux if it has "fallen and can't get up". ;-)
 wimax/
diff --git a/Documentation/warpdrive/warpdrive.rst b/Documentation/warpdrive/warpdrive.rst
new file mode 100644
index 000000000000..6d2a5d1e08c4
--- /dev/null
+++ b/Documentation/warpdrive/warpdrive.rst
@@ -0,0 +1,100 @@
+Introduction of WarpDrive
+=========================
+
+*WarpDrive* is a general accelerator framework for user space. It intends to
+provide interface for the user process to send request to hardware
+accelerator without heavy user-kernel interaction cost.
+
+The *WarpDrive* user library is supposed to provide a pipe-based API, such as:
+        ::
+        int wd_request_queue(struct wd_queue *q);
+        void wd_release_queue(struct wd_queue *q);
+
+        int wd_send(struct wd_queue *q, void *req);
+        int wd_recv(struct wd_queue *q, void **req);
+        int wd_recv_sync(struct wd_queue *q, void **req);
+        int wd_flush(struct wd_queue *q);
+
+*wd_request_queue* creates the pipe connection, *queue*, between the
+application and the hardware. The application sends request and pulls the
+answer back by asynchronized wd_send/wd_recv, which directly interact with the
+hardware (by MMIO or share memory) without syscall.
+
+*WarpDrive* maintains a unified application address space among all involved
+accelerators.  With the following APIs: ::
+
+        int wd_mem_share(struct wd_queue *q, const void *addr,
+                         size_t size, int flags);
+        void wd_mem_unshare(struct wd_queue *q, const void *addr, size_t size);
+
+The referred process space shared by these APIs can be directly referred by the
+hardware. The process can also dedicate its whole process space with flags,
+*WD_SHARE_ALL* (not in this patch yet).
+
+The name *WarpDrive* is simply a cool and general name meaning the framework
+makes the application faster. As it will be explained in this text later, the
+facility in kernel is called *SDMDEV*, namely "Share Domain Mediated Device".
+
+
+How does it work
+================
+
+*WarpDrive* is built upon *VFIO-MDEV*. The queue is wrapped as *mdev* in VFIO.
+So memory sharing can be done via standard VFIO standard DMA interface.
+
+The architecture is illustrated as follow figure:
+
+.. image:: wd-arch.svg
+        :alt: WarpDrive Architecture
+
+Accelerator driver shares its capability via *SDMDEV* API: ::
+
+        vfio_sdmdev_register(struct vfio_sdmdev *sdmdev);
+        vfio_sdmdev_unregister(struct vfio_sdmdev *sdmdev);
+        vfio_sdmdev_wake_up(struct spimdev_queue *q);
+
+*vfio_sdmdev_register* is a helper function to register the hardware to the
+*VFIO_MDEV* framework. The queue creation is done by *mdev* creation interface.
+
+*WarpDrive* User library mmap the mdev to access its mmio space and shared
+memory. Request can be sent to, or receive from, hardware in this mmap-ed
+space until the queue is full or empty.
+
+The user library can wait on the queue by ioctl(VFIO_SDMDEV_CMD_WAIT) the mdev
+if the queue is full or empty. If the queue status is changed, the hardware
+driver use *vfio_sdmdev_wake_up* to wake up the waiting process.
+
+
+Multiple processes support
+==========================
+
+In the latest mainline kernel (4.18) when this document is written,
+multi-process is not supported in VFIO yet.
+
+Jean Philippe Brucker has a patchset to enable it[1]_. We have tested it
+with our hardware (which is known as *D06*). It works well. *WarpDrive* rely
+on them to support multiple processes. If it is not enabled, *WarpDrive* can
+still work, but it support only one mdev for a process, which will share the
+same io map table with kernel. (But it is not going to be a security problem,
+since the user application cannot access the kernel address space)
+
+When multiprocess is support, mdev can be created based on how many
+hardware resource (queue) is available. Because the VFIO framework accepts only
+one open from one mdev iommu_group. Mdev become the smallest unit for process
+to use queue. And the mdev will not be released if the user process exist. So
+it will need a resource agent to manage the mdev allocation for the user
+process. This is not in this document's range.
+
+
+Legacy Mode Support
+===================
+For the hardware on which IOMMU is not support, WarpDrive can run on *NOIOMMU*
+mode. That require some update to the mdev driver, which is not included in
+this version yet.
+
+
+References
+==========
+.. [1] https://patchwork.kernel.org/patch/10394851/
+
+.. vim: tw=78
diff --git a/Documentation/warpdrive/wd-arch.svg b/Documentation/warpdrive/wd-arch.svg
new file mode 100644
index 000000000000..2b0c467ee399
--- /dev/null
+++ b/Documentation/warpdrive/wd-arch.svg
@@ -0,0 +1,728 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+
+<svg
+   xmlns:dc="http://purl.org/dc/elements/1.1/"
+   xmlns:cc="http://creativecommons.org/ns#"
+   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+   xmlns:svg="http://www.w3.org/2000/svg"
+   xmlns="http://www.w3.org/2000/svg"
+   xmlns:xlink="http://www.w3.org/1999/xlink"
+   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+   width="210mm"
+   height="193mm"
+   viewBox="0 0 744.09449 683.85823"
+   id="svg2"
+   version="1.1"
+   inkscape:version="0.92.3 (2405546, 2018-03-11)"
+   sodipodi:docname="wd-arch.svg">
+  <defs
+     id="defs4">
+    <linearGradient
+       inkscape:collect="always"
+       id="linearGradient6830">
+      <stop
+         style="stop-color:#000000;stop-opacity:1;"
+         offset="0"
+         id="stop6832" />
+      <stop
+         style="stop-color:#000000;stop-opacity:0;"
+         offset="1"
+         id="stop6834" />
+    </linearGradient>
+    <linearGradient
+       inkscape:collect="always"
+       xlink:href="#linearGradient5026"
+       id="linearGradient5032"
+       x1="353"
+       y1="211.3622"
+       x2="565.5"
+       y2="174.8622"
+       gradientUnits="userSpaceOnUse"
+       gradientTransform="translate(-89.949614,405.94594)" />
+    <linearGradient
+       inkscape:collect="always"
+       id="linearGradient5026">
+      <stop
+         style="stop-color:#f2f2f2;stop-opacity:1;"
+         offset="0"
+         id="stop5028" />
+      <stop
+         style="stop-color:#f2f2f2;stop-opacity:0;"
+         offset="1"
+         id="stop5030" />
+    </linearGradient>
+    <filter
+       inkscape:collect="always"
+       style="color-interpolation-filters:sRGB"
+       id="filter4169-3"
+       x="-0.031597666"
+       width="1.0631953"
+       y="-0.099812768"
+       height="1.1996255">
+      <feGaussianBlur
+         inkscape:collect="always"
+         stdDeviation="1.3307599"
+         id="feGaussianBlur4171-6" />
+    </filter>
+    <linearGradient
+       inkscape:collect="always"
+       xlink:href="#linearGradient5026"
+       id="linearGradient5032-1"
+       x1="353"
+       y1="211.3622"
+       x2="565.5"
+       y2="174.8622"
+       gradientUnits="userSpaceOnUse"
+       gradientTransform="translate(175.77842,400.29111)" />
+    <filter
+       inkscape:collect="always"
+       style="color-interpolation-filters:sRGB"
+       id="filter4169-3-0"
+       x="-0.031597666"
+       width="1.0631953"
+       y="-0.099812768"
+       height="1.1996255">
+      <feGaussianBlur
+         inkscape:collect="always"
+         stdDeviation="1.3307599"
+         id="feGaussianBlur4171-6-9" />
+    </filter>
+    <marker
+       markerWidth="18.960653"
+       markerHeight="11.194658"
+       refX="9.4803267"
+       refY="5.5973287"
+       orient="auto"
+       id="marker4613">
+      <rect
+         y="-5.1589785"
+         x="5.8504119"
+         height="10.317957"
+         width="10.317957"
+         id="rect4212"
+         style="fill:#ffffff;stroke:#000000;stroke-width:0.69143367;stroke-miterlimit:4;stroke-dasharray:none"
+         transform="matrix(0.86111274,0.50841405,-0.86111274,0.50841405,0,0)">
+        <title
+           id="title4262">generation</title>
+      </rect>
+    </marker>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <linearGradient
+       inkscape:collect="always"
+       xlink:href="#linearGradient5026"
+       id="linearGradient5032-3-9"
+       x1="353"
+       y1="211.3622"
+       x2="565.5"
+       y2="174.8622"
+       gradientUnits="userSpaceOnUse"
+       gradientTransform="matrix(1.2452511,0,0,0.98513016,-190.95632,540.33156)" />
+    <filter
+       inkscape:collect="always"
+       style="color-interpolation-filters:sRGB"
+       id="filter4169-3-5-8"
+       x="-0.031597666"
+       width="1.0631953"
+       y="-0.099812768"
+       height="1.1996255">
+      <feGaussianBlur
+         inkscape:collect="always"
+         stdDeviation="1.3307599"
+         id="feGaussianBlur4171-6-3-9" />
+    </filter>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6-2">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1-9"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6-2-1">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1-9-9"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <linearGradient
+       inkscape:collect="always"
+       xlink:href="#linearGradient5026"
+       id="linearGradient5032-3-9-7"
+       x1="353"
+       y1="211.3622"
+       x2="565.5"
+       y2="174.8622"
+       gradientUnits="userSpaceOnUse"
+       gradientTransform="matrix(1.3742742,0,0,0.97786398,-234.52617,654.63367)" />
+    <filter
+       inkscape:collect="always"
+       style="color-interpolation-filters:sRGB"
+       id="filter4169-3-5-8-5"
+       x="-0.031597666"
+       width="1.0631953"
+       y="-0.099812768"
+       height="1.1996255">
+      <feGaussianBlur
+         inkscape:collect="always"
+         stdDeviation="1.3307599"
+         id="feGaussianBlur4171-6-3-9-0" />
+    </filter>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6-2-6">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1-9-1"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <linearGradient
+       inkscape:collect="always"
+       xlink:href="#linearGradient5026"
+       id="linearGradient5032-3-9-4"
+       x1="353"
+       y1="211.3622"
+       x2="565.5"
+       y2="174.8622"
+       gradientUnits="userSpaceOnUse"
+       gradientTransform="matrix(1.3742912,0,0,2.0035845,-468.34428,342.56603)" />
+    <filter
+       inkscape:collect="always"
+       style="color-interpolation-filters:sRGB"
+       id="filter4169-3-5-8-54"
+       x="-0.031597666"
+       width="1.0631953"
+       y="-0.099812768"
+       height="1.1996255">
+      <feGaussianBlur
+         inkscape:collect="always"
+         stdDeviation="1.3307599"
+         id="feGaussianBlur4171-6-3-9-7" />
+    </filter>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6-2-1-8">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1-9-9-6"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6-2-1-8-8">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1-9-9-6-9"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6-0">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1-93"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6-0-2">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1-93-6"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+    <filter
+       inkscape:collect="always"
+       style="color-interpolation-filters:sRGB"
+       id="filter5382"
+       x="-0.089695387"
+       width="1.1793908"
+       y="-0.10052069"
+       height="1.2010413">
+      <feGaussianBlur
+         inkscape:collect="always"
+         stdDeviation="0.86758925"
+         id="feGaussianBlur5384" />
+    </filter>
+    <linearGradient
+       inkscape:collect="always"
+       xlink:href="#linearGradient6830"
+       id="linearGradient6836"
+       x1="362.73923"
+       y1="700.04059"
+       x2="340.4751"
+       y2="678.25488"
+       gradientUnits="userSpaceOnUse"
+       gradientTransform="translate(-23.771026,-135.76835)" />
+    <marker
+       markerWidth="11.227358"
+       markerHeight="12.355258"
+       refX="10"
+       refY="6.177629"
+       orient="auto"
+       id="marker4825-6-2-6-2">
+      <path
+         inkscape:connector-curvature="0"
+         id="path4757-1-9-1-9"
+         d="M 0.42024733,0.42806444 10.231357,6.3500844 0.24347733,11.918544"
+         style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" />
+    </marker>
+  </defs>
+  <sodipodi:namedview
+     id="base"
+     pagecolor="#ffffff"
+     bordercolor="#666666"
+     borderopacity="1.0"
+     inkscape:pageopacity="0.0"
+     inkscape:pageshadow="2"
+     inkscape:zoom="1.4"
+     inkscape:cx="313.72367"
+     inkscape:cy="307.5947"
+     inkscape:document-units="px"
+     inkscape:current-layer="layer1"
+     showgrid="false"
+     inkscape:window-width="1916"
+     inkscape:window-height="1033"
+     inkscape:window-x="1920"
+     inkscape:window-y="22"
+     inkscape:window-maximized="0"
+     fit-margin-right="0.3"
+     inkscape:snap-global="false" />
+  <metadata
+     id="metadata7">
+    <rdf:RDF>
+      <cc:Work
+         rdf:about="">
+        <dc:format>image/svg+xml</dc:format>
+        <dc:type
+           rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
+        <dc:title />
+      </cc:Work>
+    </rdf:RDF>
+  </metadata>
+  <g
+     inkscape:label="Layer 1"
+     inkscape:groupmode="layer"
+     id="layer1"
+     transform="translate(0,-368.50374)">
+    <rect
+       style="fill:#000000;stroke:#000000;stroke-width:0.6465112;filter:url(#filter4169-3)"
+       id="rect4136-3-6"
+       width="101.07784"
+       height="31.998148"
+       x="283.01144"
+       y="588.80896" />
+    <rect
+       style="fill:url(#linearGradient5032);fill-opacity:1;stroke:#000000;stroke-width:0.6465112"
+       id="rect4136-2"
+       width="101.07784"
+       height="31.998148"
+       x="281.63498"
+       y="586.75739" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="294.21747"
+       y="612.50073"
+       id="text4138-6"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1"
+         x="294.21747"
+         y="612.50073"
+         style="font-size:15px;line-height:1.25">WarpDrive</tspan></text>
+    <rect
+       style="fill:#000000;stroke:#000000;stroke-width:0.6465112;filter:url(#filter4169-3-0)"
+       id="rect4136-3-6-3"
+       width="101.07784"
+       height="31.998148"
+       x="548.7395"
+       y="583.15417" />
+    <rect
+       style="fill:url(#linearGradient5032-1);fill-opacity:1;stroke:#000000;stroke-width:0.6465112"
+       id="rect4136-2-60"
+       width="101.07784"
+       height="31.998148"
+       x="547.36304"
+       y="581.1026" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="557.83484"
+       y="602.32745"
+       id="text4138-6-6"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-2"
+         x="557.83484"
+         y="602.32745"
+         style="font-size:15px;line-height:1.25">user_driver</tspan></text>
+    <path
+       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#marker4613)"
+       d="m 547.36304,600.78954 -156.58203,0.0691"
+       id="path4855"
+       inkscape:connector-curvature="0"
+       sodipodi:nodetypes="cc" />
+    <rect
+       style="fill:#000000;stroke:#000000;stroke-width:0.6465112;filter:url(#filter4169-3-5-8)"
+       id="rect4136-3-6-5-7"
+       width="101.07784"
+       height="31.998148"
+       x="128.74678"
+       y="80.648842"
+       transform="matrix(1.2452511,0,0,0.98513016,113.15182,641.02594)" />
+    <rect
+       style="fill:url(#linearGradient5032-3-9);fill-opacity:1;stroke:#000000;stroke-width:0.71606314"
+       id="rect4136-2-6-3"
+       width="125.86729"
+       height="31.522341"
+       x="271.75983"
+       y="718.45435" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="306.29599"
+       y="746.50073"
+       id="text4138-6-2-6"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1"
+         x="306.29599"
+         y="746.50073"
+         style="font-size:15px;line-height:1.25">sdmdev</tspan></text>
+    <path
+       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#marker4825-6-2)"
+       d="m 329.57309,619.72453 5.0373,97.14447"
+       id="path4661-3"
+       inkscape:connector-curvature="0"
+       sodipodi:nodetypes="cc" />
+    <path
+       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#marker4825-6-2-1)"
+       d="m 342.57219,830.63108 -5.67699,-79.2841"
+       id="path4661-3-4"
+       inkscape:connector-curvature="0"
+       sodipodi:nodetypes="cc" />
+    <rect
+       style="fill:#000000;stroke:#000000;stroke-width:0.6465112;filter:url(#filter4169-3-5-8-5)"
+       id="rect4136-3-6-5-7-3"
+       width="101.07784"
+       height="31.998148"
+       x="128.74678"
+       y="80.648842"
+       transform="matrix(1.3742742,0,0,0.97786398,101.09126,754.58534)" />
+    <rect
+       style="fill:url(#linearGradient5032-3-9-7);fill-opacity:1;stroke:#000000;stroke-width:0.74946606"
+       id="rect4136-2-6-3-6"
+       width="138.90866"
+       height="31.289837"
+       x="276.13297"
+       y="831.44263" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="295.67819"
+       y="852.98224"
+       id="text4138-6-2-6-1"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1-0"
+         x="295.67819"
+         y="852.98224"
+         style="font-size:15px;line-height:1.25">Device Driver</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="349.31198"
+       y="829.46118"
+       id="text4138-6-2-6-1-6"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1-0-3"
+         x="349.31198"
+         y="829.46118"
+         style="font-size:15px;line-height:1.25">*</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="349.98282"
+       y="768.698"
+       id="text4138-6-2-6-1-6-2"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1-0-3-0"
+         x="349.98282"
+         y="768.698"
+         style="font-size:15px;line-height:1.25">1</tspan></text>
+    <path
+       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#marker4825-6-2-6)"
+       d="m 568.1238,614.05402 0.51369,333.80219"
+       id="path4661-3-5"
+       inkscape:connector-curvature="0"
+       sodipodi:nodetypes="cc" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="371.8013"
+       y="664.62476"
+       id="text4138-6-2-6-1-6-2-5"><tspan
+         sodipodi:role="line"
+         x="371.8013"
+         y="664.62476"
+         id="tspan4274"
+         style="font-size:15px;line-height:1.25">&lt;&lt;vfio&gt;&gt;</tspan><tspan
+         sodipodi:role="line"
+         x="371.8013"
+         y="683.37476"
+         id="tspan4305"
+         style="font-size:15px;line-height:1.25">resource management</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="389.92969"
+       y="587.44836"
+       id="text4138-6-2-6-1-6-2-56"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1-0-3-0-9"
+         x="389.92969"
+         y="587.44836"
+         style="font-size:15px;line-height:1.25">1</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="528.64813"
+       y="600.08429"
+       id="text4138-6-2-6-1-6-3"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1-0-3-7"
+         x="528.64813"
+         y="600.08429"
+         style="font-size:15px;line-height:1.25">*</tspan></text>
+    <rect
+       style="fill:#000000;stroke:#000000;stroke-width:0.6465112;filter:url(#filter4169-3-5-8-54)"
+       id="rect4136-3-6-5-7-4"
+       width="101.07784"
+       height="31.998148"
+       x="128.74678"
+       y="80.648842"
+       transform="matrix(1.3745874,0,0,1.8929066,-132.7754,556.04505)" />
+    <rect
+       style="fill:url(#linearGradient5032-3-9-4);fill-opacity:1;stroke:#000000;stroke-width:1.07280123"
+       id="rect4136-2-6-3-4"
+       width="138.91039"
+       height="64.111"
+       x="42.321312"
+       y="704.8371" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="110.30745"
+       y="722.94025"
+       id="text4138-6-2-6-3"><tspan
+         sodipodi:role="line"
+         x="111.99202"
+         y="722.94025"
+         id="tspan4366"
+         style="font-size:15px;line-height:1.25;text-align:center;text-anchor:middle">other standard </tspan><tspan
+         sodipodi:role="line"
+         x="110.30745"
+         y="741.69025"
+         id="tspan4368"
+         style="font-size:15px;line-height:1.25;text-align:center;text-anchor:middle">framework</tspan><tspan
+         sodipodi:role="line"
+         x="110.30745"
+         y="760.44025"
+         style="font-size:15px;line-height:1.25;text-align:center;text-anchor:middle"
+         id="tspan6840">(crypto/nic/others)</tspan></text>
+    <path
+       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#marker4825-6-2-1-8)"
+       d="M 276.29661,849.04109 134.04449,771.90853"
+       id="path4661-3-4-8"
+       inkscape:connector-curvature="0"
+       sodipodi:nodetypes="cc" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="313.70813"
+       y="730.06366"
+       id="text4138-6-2-6-36"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1-7"
+         x="313.70813"
+         y="730.06366"
+         style="font-size:10px;line-height:1.25">&lt;&lt;lkm&gt;&gt;</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;text-align:start;letter-spacing:0px;word-spacing:0px;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="343.81625"
+       y="786.44141"
+       id="text4138-6-2-6-1-6-2-5-7-5"><tspan
+         sodipodi:role="line"
+         x="343.81625"
+         y="786.44141"
+         style="font-size:15px;line-height:1.25;text-align:start;text-anchor:start"
+         id="tspan2357">register as mdev with</tspan><tspan
+         sodipodi:role="line"
+         x="343.81625"
+         y="805.19141"
+         style="font-size:15px;line-height:1.25;text-align:start;text-anchor:start"
+         id="tspan1462">&quot;share domain&quot; attribute</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="29.145819"
+       y="833.44244"
+       id="text4138-6-2-6-1-6-2-5-7-5-2"><tspan
+         sodipodi:role="line"
+         x="29.145819"
+         y="833.44244"
+         id="tspan4301"
+         style="font-size:15px;line-height:1.25">register to other subsystem</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="301.20813"
+       y="597.29437"
+       id="text4138-6-2-6-36-1"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1-7-2"
+         x="301.20813"
+         y="597.29437"
+         style="font-size:10px;line-height:1.25">&lt;&lt;user_lib&gt;&gt;</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="649.09613"
+       y="774.4798"
+       id="text4138-6-2-6-1-6-2-5-3"><tspan
+         sodipodi:role="line"
+         id="tspan4140-1-9-1-0-3-0-4-6"
+         x="649.09613"
+         y="774.4798"
+         style="font-size:15px;line-height:1.25">&lt;&lt;vfio&gt;&gt;</tspan><tspan
+         sodipodi:role="line"
+         x="649.09613"
+         y="793.2298"
+         id="tspan4274-7"
+         style="font-size:15px;line-height:1.25">Hardware Accessing</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="371.01291"
+       y="529.23682"
+       id="text4138-6-2-6-1-6-2-5-36"><tspan
+         sodipodi:role="line"
+         x="371.01291"
+         y="529.23682"
+         id="tspan4305-3"
+         style="font-size:15px;line-height:1.25">wd user api</tspan></text>
+    <path
+       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       d="m 328.19325,585.87943 0,-23.57142"
+       id="path4348"
+       inkscape:connector-curvature="0" />
+    <ellipse
+       style="opacity:1;fill:#ffffff;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0"
+       id="path4350"
+       cx="328.01468"
+       cy="551.95081"
+       rx="11.607142"
+       ry="10.357142" />
+    <path
+       style="opacity:0.444;fill:url(#linearGradient6836);fill-opacity:1;fill-rule:evenodd;stroke:none;stroke-width:1;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;filter:url(#filter5382)"
+       id="path4350-2"
+       sodipodi:type="arc"
+       sodipodi:cx="329.44327"
+       sodipodi:cy="553.37933"
+       sodipodi:rx="11.607142"
+       sodipodi:ry="10.357142"
+       sodipodi:start="0"
+       sodipodi:end="6.2509098"
+       d="m 341.05041,553.37933 a 11.607142,10.357142 0 0 1 -11.51349,10.35681 11.607142,10.357142 0 0 1 -11.69928,-10.18967 11.607142,10.357142 0 0 1 11.32469,-10.52124 11.607142,10.357142 0 0 1 11.88204,10.01988"
+       sodipodi:open="true" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="543.91455"
+       y="978.22363"
+       id="text4138-6-2-6-1-6-2-5-36-3"><tspan
+         sodipodi:role="line"
+         x="543.91455"
+         y="978.22363"
+         id="tspan4305-3-67"
+         style="font-size:15px;line-height:1.25">Device(Hardware)</tspan></text>
+    <path
+       style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#marker4825-6-2-6-2)"
+       d="m 347.51164,865.4527 153.19752,91.52439"
+       id="path4661-3-5-1"
+       inkscape:connector-curvature="0"
+       sodipodi:nodetypes="cc" />
+    <text
+       xml:space="preserve"
+       style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
+       x="343.6398"
+       y="716.47754"
+       id="text4138-6-2-6-1-6-2-5-7-5-2-6"><tspan
+         sodipodi:role="line"
+         x="343.6398"
+         y="716.47754"
+         id="tspan4301-4"
+         style="font-style:italic;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:15px;line-height:1.25;font-family:sans-serif;-inkscape-font-specification:'sans-serif Italic';stroke-width:1px">Share Domain mdev</tspan></text>
+  </g>
+</svg>
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/7] iommu: Add share domain interface in iommu for sdmdev
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
  2018-09-03  0:51 ` [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework Kenneth Lee
@ 2018-09-03  0:51 ` Kenneth Lee
  2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-03  0:51 UTC (permalink / raw)
  To: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

From: Kenneth Lee <liguozhu@hisilicon.com>

This patch add sharing interface for a iommu_group. The new interface:

	iommu_group_share_domain()
	iommu_group_unshare_domain()

can be used by some virtual iommu_group (such as iommu_group of sdmdev)
to share their parent's iommu_group.

When the domain of a group is shared, it cannot be changed before
being unshared. By this way, all domain users can assume the shared
IOMMU have the same configuration.  In the future, notification can be
added if update is required.

Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
---
 drivers/iommu/iommu.c | 29 ++++++++++++++++++++++++++++-
 include/linux/iommu.h | 15 +++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8c15c5980299..8e567e1037dd 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -58,6 +58,9 @@ struct iommu_group {
 	int id;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *domain;
+	atomic_t domain_shared_ref; /* Number of user of current domain.
+				     * The domain cannot be modified if ref > 0
+				     */
 };
 
 struct group_device {
@@ -385,6 +388,7 @@ struct iommu_group *iommu_group_alloc(void)
 		return ERR_PTR(ret);
 	}
 	group->id = ret;
+	atomic_set(&group->domain_shared_ref, 0);
 
 	ret = kobject_init_and_add(&group->kobj, &iommu_group_ktype,
 				   NULL, "%d", group->id);
@@ -518,6 +522,26 @@ int iommu_group_set_name(struct iommu_group *group, const char *name)
 }
 EXPORT_SYMBOL_GPL(iommu_group_set_name);
 
+struct iommu_domain *iommu_group_share_domain(struct iommu_group *group)
+{
+	/* the domain can be shared only when the default domain is used */
+	/* todo: more shareable check */
+	if (group->domain != group->default_domain)
+		return ERR_PTR(-EINVAL);
+
+	atomic_inc(&group->domain_shared_ref);
+	return group->domain;
+}
+EXPORT_SYMBOL_GPL(iommu_group_share_domain);
+
+struct iommu_domain *iommu_group_unshare_domain(struct iommu_group *group)
+{
+	atomic_dec(&group->domain_shared_ref);
+	WARN_ON(atomic_read(&group->domain_shared_ref) < 0);
+	return group->domain;
+}
+EXPORT_SYMBOL_GPL(iommu_group_unshare_domain);
+
 static int iommu_group_create_direct_mappings(struct iommu_group *group,
 					      struct device *dev)
 {
@@ -1437,7 +1461,8 @@ static int __iommu_attach_group(struct iommu_domain *domain,
 {
 	int ret;
 
-	if (group->default_domain && group->domain != group->default_domain)
+	if ((group->default_domain && group->domain != group->default_domain) ||
+	     atomic_read(&group->domain_shared_ref) > 0)
 		return -EBUSY;
 
 	ret = __iommu_group_for_each_dev(group, domain,
@@ -1474,6 +1499,8 @@ static void __iommu_detach_group(struct iommu_domain *domain,
 {
 	int ret;
 
+	WARN_ON(atomic_read(&group->domain_shared_ref) > 0);
+
 	if (!group->default_domain) {
 		__iommu_group_for_each_dev(group, domain,
 					   iommu_group_do_detach_device);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 87994c265bf5..013ac400b643 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -344,6 +344,9 @@ extern int iommu_domain_get_attr(struct iommu_domain *domain, enum iommu_attr,
 				 void *data);
 extern int iommu_domain_set_attr(struct iommu_domain *domain, enum iommu_attr,
 				 void *data);
+extern struct iommu_domain *iommu_group_share_domain(struct iommu_group *group);
+extern struct iommu_domain *iommu_group_unshare_domain(
+		struct iommu_group *group);
 
 /* Window handling function prototypes */
 extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
@@ -616,6 +619,18 @@ static inline int iommu_domain_set_attr(struct iommu_domain *domain,
 	return -EINVAL;
 }
 
+static inline struct iommu_domain *iommu_group_share_domain(
+		struct iommu_group *group)
+{
+	return NULL;
+}
+
+static inline struct iommu_domain *iommu_group_unshare_domain(
+		struct iommu_group *group)
+{
+	return NULL;
+}
+
 static inline int  iommu_device_register(struct iommu_device *iommu)
 {
 	return -ENODEV;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 3/7] vfio: add sdmdev support
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
  2018-09-03  0:51 ` [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework Kenneth Lee
  2018-09-03  0:51 ` [PATCH 2/7] iommu: Add share domain interface in iommu for sdmdev Kenneth Lee
@ 2018-09-03  0:52 ` Kenneth Lee
  2018-09-03  2:11   ` Randy Dunlap
                     ` (5 more replies)
  2018-09-03  0:52 ` [PATCH 4/7] crypto: add hisilicon Queue Manager driver Kenneth Lee
                   ` (6 subsequent siblings)
  9 siblings, 6 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-03  0:52 UTC (permalink / raw)
  To: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

From: Kenneth Lee <liguozhu@hisilicon.com>

SDMDEV is "Share Domain Mdev". It is a vfio-mdev. But differ from
the general vfio-mdev, it shares its parent's IOMMU. If Multi-PASID
support is enabled in the IOMMU (not yet in the current kernel HEAD),
multiple process can share the IOMMU by different PASID. If it is not
support, only one process can share the IOMMU with the kernel driver.

Currently only the vfio type-1 driver is updated to make it to be aware
of.

Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
---
 drivers/vfio/Kconfig              |   1 +
 drivers/vfio/Makefile             |   1 +
 drivers/vfio/sdmdev/Kconfig       |  10 +
 drivers/vfio/sdmdev/Makefile      |   3 +
 drivers/vfio/sdmdev/vfio_sdmdev.c | 363 ++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c   | 151 ++++++++++++-
 include/linux/vfio_sdmdev.h       |  96 ++++++++
 include/uapi/linux/vfio_sdmdev.h  |  29 +++
 8 files changed, 648 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/sdmdev/Kconfig
 create mode 100644 drivers/vfio/sdmdev/Makefile
 create mode 100644 drivers/vfio/sdmdev/vfio_sdmdev.c
 create mode 100644 include/linux/vfio_sdmdev.h
 create mode 100644 include/uapi/linux/vfio_sdmdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index c84333eb5eb5..5af7d1db505e 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -47,4 +47,5 @@ menuconfig VFIO_NOIOMMU
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
 source "drivers/vfio/mdev/Kconfig"
+source "drivers/vfio/sdmdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index de67c4725cce..678592360a7a 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
 obj-$(CONFIG_VFIO_MDEV) += mdev/
+obj-$(CONFIG_VFIO_SDMDEV) += sdmdev/
diff --git a/drivers/vfio/sdmdev/Kconfig b/drivers/vfio/sdmdev/Kconfig
new file mode 100644
index 000000000000..51474272870d
--- /dev/null
+++ b/drivers/vfio/sdmdev/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0
+config VFIO_SDMDEV
+	tristate "Support for Share Domain MDEV"
+	depends on VFIO_MDEV_DEVICE
+	help
+	  Support for VFIO Share Domain MDEV, which enables the kernel to
+	  support light weight hardware accelerator framework, WarpDrive.
+
+	  To compile this as a module, choose M here: the module will be called
+	  sdmdev.
diff --git a/drivers/vfio/sdmdev/Makefile b/drivers/vfio/sdmdev/Makefile
new file mode 100644
index 000000000000..ccaaa03f3184
--- /dev/null
+++ b/drivers/vfio/sdmdev/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+sdmdev-y := sdmdev.o
+obj-$(CONFIG_VFIO_SDMDEV) += vfio_sdmdev.o
diff --git a/drivers/vfio/sdmdev/vfio_sdmdev.c b/drivers/vfio/sdmdev/vfio_sdmdev.c
new file mode 100644
index 000000000000..c6eb5d4bdab0
--- /dev/null
+++ b/drivers/vfio/sdmdev/vfio_sdmdev.c
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <linux/module.h>
+#include <linux/vfio_sdmdev.h>
+
+static struct class *sdmdev_class;
+
+static int vfio_sdmdev_dev_exist(struct device *dev, void *data)
+{
+	return !strcmp(dev_name(dev), dev_name((struct device *)data));
+}
+
+#ifdef CONFIG_IOMMU_SVA
+static bool vfio_sdmdev_is_valid_pasid(int pasid)
+{
+	struct mm_struct *mm;
+
+	mm = iommu_sva_find(pasid);
+	if (mm) {
+		mmput(mm);
+		return mm == current->mm;
+	}
+
+	return false;
+}
+#endif
+
+/* Check if the device is a mediated device belongs to vfio_sdmdev */
+int vfio_sdmdev_is_sdmdev(struct device *dev)
+{
+	struct mdev_device *mdev;
+	struct device *pdev;
+
+	mdev = mdev_from_dev(dev);
+	if (!mdev)
+		return 0;
+
+	pdev = mdev_parent_dev(mdev);
+	if (!pdev)
+		return 0;
+
+	return class_for_each_device(sdmdev_class, NULL, pdev,
+			vfio_sdmdev_dev_exist);
+}
+EXPORT_SYMBOL_GPL(vfio_sdmdev_is_sdmdev);
+
+struct vfio_sdmdev *vfio_sdmdev_pdev_sdmdev(struct device *dev)
+{
+	struct device *class_dev;
+
+	if (!dev)
+		return ERR_PTR(-EINVAL);
+
+	class_dev = class_find_device(sdmdev_class, NULL, dev,
+		(int(*)(struct device *, const void *))vfio_sdmdev_dev_exist);
+	if (!class_dev)
+		return ERR_PTR(-ENODEV);
+
+	return container_of(class_dev, struct vfio_sdmdev, cls_dev);
+}
+EXPORT_SYMBOL_GPL(vfio_sdmdev_pdev_sdmdev);
+
+struct vfio_sdmdev *mdev_sdmdev(struct mdev_device *mdev)
+{
+	struct device *pdev = mdev_parent_dev(mdev);
+
+	return vfio_sdmdev_pdev_sdmdev(pdev);
+}
+EXPORT_SYMBOL_GPL(mdev_sdmdev);
+
+static ssize_t iommu_type_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct vfio_sdmdev *sdmdev = vfio_sdmdev_pdev_sdmdev(dev);
+
+	if (!sdmdev)
+		return -ENODEV;
+
+	return sprintf(buf, "%d\n", sdmdev->iommu_type);
+}
+
+static DEVICE_ATTR_RO(iommu_type);
+
+static ssize_t dma_flag_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct vfio_sdmdev *sdmdev = vfio_sdmdev_pdev_sdmdev(dev);
+
+	if (!sdmdev)
+		return -ENODEV;
+
+	return sprintf(buf, "%d\n", sdmdev->dma_flag);
+}
+
+static DEVICE_ATTR_RO(dma_flag);
+
+/* mdev->dev_attr_groups */
+static struct attribute *vfio_sdmdev_attrs[] = {
+	&dev_attr_iommu_type.attr,
+	&dev_attr_dma_flag.attr,
+	NULL,
+};
+static const struct attribute_group vfio_sdmdev_group = {
+	.name  = VFIO_SDMDEV_PDEV_ATTRS_GRP_NAME,
+	.attrs = vfio_sdmdev_attrs,
+};
+const struct attribute_group *vfio_sdmdev_groups[] = {
+	&vfio_sdmdev_group,
+	NULL,
+};
+
+/* default attributes for mdev->supported_type_groups, used by registerer*/
+#define MDEV_TYPE_ATTR_RO_EXPORT(name) \
+		MDEV_TYPE_ATTR_RO(name); \
+		EXPORT_SYMBOL_GPL(mdev_type_attr_##name);
+
+#define DEF_SIMPLE_SDMDEV_ATTR(_name, sdmdev_member, format) \
+static ssize_t _name##_show(struct kobject *kobj, struct device *dev, \
+			    char *buf) \
+{ \
+	struct vfio_sdmdev *sdmdev = vfio_sdmdev_pdev_sdmdev(dev); \
+	if (!sdmdev) \
+		return -ENODEV; \
+	return sprintf(buf, format, sdmdev->sdmdev_member); \
+} \
+MDEV_TYPE_ATTR_RO_EXPORT(_name)
+
+DEF_SIMPLE_SDMDEV_ATTR(flags, flags, "%d");
+DEF_SIMPLE_SDMDEV_ATTR(name, name, "%s"); /* this should be algorithm name, */
+		/* but you would not care if you have only one algorithm */
+DEF_SIMPLE_SDMDEV_ATTR(device_api, api_ver, "%s");
+
+static ssize_t
+available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	struct vfio_sdmdev *sdmdev = vfio_sdmdev_pdev_sdmdev(dev);
+	int nr_inst = 0;
+
+	nr_inst = sdmdev->ops->get_available_instances ?
+		sdmdev->ops->get_available_instances(sdmdev) : 0;
+	return sprintf(buf, "%d", nr_inst);
+}
+MDEV_TYPE_ATTR_RO_EXPORT(available_instances);
+
+static int vfio_sdmdev_mdev_create(struct kobject *kobj,
+	struct mdev_device *mdev)
+{
+	struct device *pdev = mdev_parent_dev(mdev);
+	struct vfio_sdmdev_queue *q;
+	struct vfio_sdmdev *sdmdev = mdev_sdmdev(mdev);
+	int ret;
+
+	if (!sdmdev->ops->get_queue)
+		return -ENODEV;
+
+	ret = sdmdev->ops->get_queue(sdmdev, &q);
+	if (ret)
+		return ret;
+
+	q->sdmdev = sdmdev;
+	q->mdev = mdev;
+	init_waitqueue_head(&q->wait);
+
+	mdev_set_drvdata(mdev, q);
+	get_device(pdev);
+
+	return 0;
+}
+
+static int vfio_sdmdev_mdev_remove(struct mdev_device *mdev)
+{
+	struct vfio_sdmdev_queue *q =
+		(struct vfio_sdmdev_queue *)mdev_get_drvdata(mdev);
+	struct vfio_sdmdev *sdmdev = q->sdmdev;
+	struct device *pdev = mdev_parent_dev(mdev);
+
+	put_device(pdev);
+
+	if (sdmdev->ops->put_queue);
+		sdmdev->ops->put_queue(q);
+
+	return 0;
+}
+
+/* Wake up the process who is waiting this queue */
+void vfio_sdmdev_wake_up(struct vfio_sdmdev_queue *q)
+{
+	wake_up_all(&q->wait);
+}
+EXPORT_SYMBOL_GPL(vfio_sdmdev_wake_up);
+
+static int vfio_sdmdev_mdev_mmap(struct mdev_device *mdev,
+				 struct vm_area_struct *vma)
+{
+	struct vfio_sdmdev_queue *q =
+		(struct vfio_sdmdev_queue *)mdev_get_drvdata(mdev);
+	struct vfio_sdmdev *sdmdev = q->sdmdev;
+
+	if (sdmdev->ops->mmap)
+		return sdmdev->ops->mmap(q, vma);
+
+	dev_err(sdmdev->dev, "no driver mmap!\n");
+	return -EINVAL;
+}
+
+static inline int vfio_sdmdev_wait(struct vfio_sdmdev_queue *q,
+				   unsigned long timeout)
+{
+	int ret;
+	struct vfio_sdmdev *sdmdev = q->sdmdev;
+
+	if (!sdmdev->ops->mask_notify)
+		return -ENODEV;
+
+	sdmdev->ops->mask_notify(q, VFIO_SDMDEV_EVENT_Q_UPDATE);
+
+	ret = timeout ?  wait_event_interruptible_timeout(q->wait,
+			sdmdev->ops->is_q_updated(q), timeout) :
+		     wait_event_interruptible(q->wait,
+			sdmdev->ops->is_q_updated(q));
+
+	sdmdev->ops->mask_notify(q, 0);
+
+	return ret;
+}
+
+static long vfio_sdmdev_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct vfio_sdmdev_queue *q =
+		(struct vfio_sdmdev_queue *)mdev_get_drvdata(mdev);
+	struct vfio_sdmdev *sdmdev = q->sdmdev;
+
+	switch (cmd) {
+	case VFIO_SDMDEV_CMD_WAIT:
+		return vfio_sdmdev_wait(q, arg);
+
+#ifdef CONFIG_IOMMU_SVA
+	case VFIO_SDMDEV_CMD_BIND_PASID:
+		int ret;
+
+		if (!vfio_sdmdev_is_valid_pasid(arg))
+			return -EINVAL;
+
+		mutex_lock(&q->mutex);
+		q->pasid = arg;
+
+		if (sdmdev->ops->start_queue)
+			ret = sdmdev->ops->start_queue(q);
+
+		mutex_unlock(&q->mutex);
+
+		return ret;
+#endif
+
+	default:
+		if (sdmdev->ops->ioctl)
+			return sdmdev->ops->ioctl(q, cmd, arg);
+
+		dev_err(sdmdev->dev, "ioctl cmd (%d) is not supported!\n", cmd);
+		return -EINVAL;
+	}
+}
+
+static void vfio_sdmdev_release(struct device *dev) { }
+
+static void vfio_sdmdev_mdev_release(struct mdev_device *mdev)
+{
+	struct vfio_sdmdev_queue *q =
+		(struct vfio_sdmdev_queue *)mdev_get_drvdata(mdev);
+	struct vfio_sdmdev *sdmdev = q->sdmdev;
+
+	if (sdmdev->ops->stop_queue)
+		sdmdev->ops->stop_queue(q);
+}
+
+static int vfio_sdmdev_mdev_open(struct mdev_device *mdev)
+{
+#ifndef CONFIG_IOMMU_SVA
+	struct vfio_sdmdev_queue *q =
+		(struct vfio_sdmdev_queue *)mdev_get_drvdata(mdev);
+	struct vfio_sdmdev *sdmdev = q->sdmdev;
+
+	if (sdmdev->ops->start_queue)
+		sdmdev->ops->start_queue(q);
+#endif
+
+	return 0;
+}
+
+/**
+ *	vfio_sdmdev_register - register a sdmdev
+ *	@sdmdev: device structure
+ */
+int vfio_sdmdev_register(struct vfio_sdmdev *sdmdev)
+{
+	int ret;
+
+	if (!sdmdev->dev)
+		return -ENODEV;
+
+	atomic_set(&sdmdev->ref, 0);
+	sdmdev->cls_dev.parent = sdmdev->dev;
+	sdmdev->cls_dev.class = sdmdev_class;
+	sdmdev->cls_dev.release = vfio_sdmdev_release;
+	dev_set_name(&sdmdev->cls_dev, "%s", dev_name(sdmdev->dev));
+	ret = device_register(&sdmdev->cls_dev);
+	if (ret)
+		goto err;
+
+	sdmdev->mdev_fops.owner			= THIS_MODULE;
+	sdmdev->mdev_fops.dev_attr_groups	= vfio_sdmdev_groups;
+	WARN_ON(!sdmdev->mdev_fops.supported_type_groups);
+	sdmdev->mdev_fops.create		= vfio_sdmdev_mdev_create;
+	sdmdev->mdev_fops.remove		= vfio_sdmdev_mdev_remove;
+	sdmdev->mdev_fops.ioctl			= vfio_sdmdev_mdev_ioctl;
+	sdmdev->mdev_fops.open			= vfio_sdmdev_mdev_open;
+	sdmdev->mdev_fops.release		= vfio_sdmdev_mdev_release;
+	sdmdev->mdev_fops.mmap			= vfio_sdmdev_mdev_mmap,
+
+	ret = mdev_register_device(sdmdev->dev, &sdmdev->mdev_fops);
+	if (ret)
+		goto err_with_cls_dev;
+
+	return 0;
+
+err_with_cls_dev:
+	device_unregister(&sdmdev->cls_dev);
+err:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_sdmdev_register);
+
+/**
+ * vfio_sdmdev_unregister - unregisters a sdmdev
+ * @sdmdev: device to unregister
+ *
+ * Unregister a sdmdev that wat previously successully registered with
+ * vfio_sdmdev_register().
+ */
+void vfio_sdmdev_unregister(struct vfio_sdmdev *sdmdev)
+{
+	mdev_unregister_device(sdmdev->dev);
+	device_unregister(&sdmdev->cls_dev);
+}
+EXPORT_SYMBOL_GPL(vfio_sdmdev_unregister);
+
+static int __init vfio_sdmdev_init(void)
+{
+	sdmdev_class = class_create(THIS_MODULE, VFIO_SDMDEV_CLASS_NAME);
+	return PTR_ERR_OR_ZERO(sdmdev_class);
+}
+
+static __exit void vfio_sdmdev_exit(void)
+{
+	class_destroy(sdmdev_class);
+}
+
+module_init(vfio_sdmdev_init);
+module_exit(vfio_sdmdev_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Hisilicon Tech. Co., Ltd.");
+MODULE_DESCRIPTION("VFIO Share Domain Mediated Device");
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d9fd3188615d..ba73231d8692 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -41,6 +41,7 @@
 #include <linux/notifier.h>
 #include <linux/dma-iommu.h>
 #include <linux/irqdomain.h>
+#include <linux/vfio_sdmdev.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -89,6 +90,8 @@ struct vfio_dma {
 };
 
 struct vfio_group {
+	/* iommu_group of mdev's parent device */
+	struct iommu_group	*parent_group;
 	struct iommu_group	*iommu_group;
 	struct list_head	next;
 };
@@ -1327,6 +1330,109 @@ static bool vfio_iommu_has_sw_msi(struct iommu_group *group, phys_addr_t *base)
 	return ret;
 }
 
+/* return 0 if the device is not sdmdev.
+ * return 1 if the device is sdmdev, the data will be updated with parent
+ *	device's group.
+ * return -errno if other error.
+ */
+static int vfio_sdmdev_type(struct device *dev, void *data)
+{
+	struct iommu_group **group = data;
+	struct iommu_group *pgroup;
+	int (*_is_sdmdev)(struct device *dev);
+	struct device *pdev;
+	int ret = 1;
+
+	/* vfio_sdmdev module is not configurated */
+	_is_sdmdev = symbol_get(vfio_sdmdev_is_sdmdev);
+	if (!_is_sdmdev)
+		return 0;
+
+	/* check if it belongs to vfio_sdmdev device */
+	if (!_is_sdmdev(dev)) {
+		ret = 0;
+		goto out;
+	}
+
+	pdev = dev->parent;
+	pgroup = iommu_group_get(pdev);
+	if (!pgroup) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	if (group) {
+		/* check if all parent devices is the same */
+		if (*group && *group != pgroup)
+			ret = -ENODEV;
+		else
+			*group = pgroup;
+	}
+
+	iommu_group_put(pgroup);
+
+out:
+	symbol_put(vfio_sdmdev_is_sdmdev);
+
+	return ret;
+}
+
+/* return 0 or -errno */
+static int vfio_sdmdev_bus(struct device *dev, void *data)
+{
+	struct bus_type **bus = data;
+
+	if (!dev->bus)
+		return -ENODEV;
+
+	/* ensure all devices has the same bus_type */
+	if (*bus && *bus != dev->bus)
+		return -EINVAL;
+
+	*bus = dev->bus;
+	return 0;
+}
+
+/* return 0 means it is not sd group, 1 means it is, or -EXXX for error */
+static int vfio_iommu_type1_attach_sdgroup(struct vfio_domain *domain,
+					    struct vfio_group *group,
+					    struct iommu_group *iommu_group)
+{
+	int ret;
+	struct bus_type *pbus = NULL;
+	struct iommu_group *pgroup = NULL;
+
+	ret = iommu_group_for_each_dev(iommu_group, &pgroup,
+				       vfio_sdmdev_type);
+	if (ret < 0)
+		goto out;
+	else if (ret > 0) {
+		domain->domain = iommu_group_share_domain(pgroup);
+		if (IS_ERR(domain->domain))
+			goto out;
+		ret = iommu_group_for_each_dev(pgroup, &pbus,
+				       vfio_sdmdev_bus);
+		if (ret < 0)
+			goto err_with_share_domain;
+
+		if (pbus && iommu_capable(pbus, IOMMU_CAP_CACHE_COHERENCY))
+			domain->prot |= IOMMU_CACHE;
+
+		group->parent_group = pgroup;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+
+		return 1;
+	}
+
+	return 0;
+
+err_with_share_domain:
+	iommu_group_unshare_domain(pgroup);
+out:
+	return ret;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
@@ -1335,8 +1441,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL, *mdev_bus;
 	int ret;
-	bool resv_msi, msi_remap;
-	phys_addr_t resv_msi_base;
+	bool resv_msi = false, msi_remap;
+	phys_addr_t resv_msi_base = 0;
 
 	mutex_lock(&iommu->lock);
 
@@ -1373,6 +1479,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (mdev_bus) {
 		if ((bus == mdev_bus) && !iommu_present(bus)) {
 			symbol_put(mdev_bus_type);
+
+			ret = vfio_iommu_type1_attach_sdgroup(domain, group,
+					iommu_group);
+			if (ret < 0)
+				goto out_free;
+			else if (ret > 0)
+				goto replay_check;
+
 			if (!iommu->external_domain) {
 				INIT_LIST_HEAD(&domain->group_list);
 				iommu->external_domain = domain;
@@ -1451,12 +1565,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	vfio_test_domain_fgsp(domain);
 
+replay_check:
 	/* replay mappings on new domains */
 	ret = vfio_iommu_replay(iommu, domain);
 	if (ret)
 		goto out_detach;
 
-	if (resv_msi) {
+	if (!group->parent_group && resv_msi) {
 		ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
 		if (ret)
 			goto out_detach;
@@ -1471,7 +1586,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 out_detach:
 	iommu_detach_group(domain->domain, iommu_group);
 out_domain:
-	iommu_domain_free(domain->domain);
+	if (group->parent_group)
+		iommu_group_unshare_domain(group->parent_group);
+	else
+		iommu_domain_free(domain->domain);
 out_free:
 	kfree(domain);
 	kfree(group);
@@ -1527,12 +1645,25 @@ static void vfio_sanity_check_pfn_list(struct vfio_iommu *iommu)
 	WARN_ON(iommu->notifier.head);
 }
 
+static void vfio_iommu_undo(struct vfio_iommu *iommu,
+			    struct iommu_domain *domain)
+{
+	struct rb_node *n = rb_first(&iommu->dma_list);
+	struct vfio_dma *dma;
+
+	for (; n; n = rb_next(n)) {
+		dma = rb_entry(n, struct vfio_dma, node);
+		iommu_unmap(domain, dma->iova, dma->size);
+	}
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain;
 	struct vfio_group *group;
+	struct iommu_domain *sdomain = NULL;
 
 	mutex_lock(&iommu->lock);
 
@@ -1560,7 +1691,12 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 		if (!group)
 			continue;
 
-		iommu_detach_group(domain->domain, iommu_group);
+		if (group->parent_group)
+			sdomain = iommu_group_unshare_domain(
+					group->parent_group);
+		else
+			iommu_detach_group(domain->domain, iommu_group);
+
 		list_del(&group->next);
 		kfree(group);
 		/*
@@ -1577,7 +1713,10 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 				else
 					vfio_iommu_unmap_unpin_reaccount(iommu);
 			}
-			iommu_domain_free(domain->domain);
+			if (domain->domain != sdomain)
+				iommu_domain_free(domain->domain);
+			else
+				vfio_iommu_undo(iommu, sdomain);
 			list_del(&domain->next);
 			kfree(domain);
 		}
diff --git a/include/linux/vfio_sdmdev.h b/include/linux/vfio_sdmdev.h
new file mode 100644
index 000000000000..fbc9fb3f4abc
--- /dev/null
+++ b/include/linux/vfio_sdmdev.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#ifndef __VFIO_SDMDEV_H
+#define __VFIO_SDMDEV_H
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+#include <linux/vfio.h>
+#include <uapi/linux/vfio_sdmdev.h>
+
+struct vfio_sdmdev_queue;
+struct vfio_sdmdev;
+
+/* event bit used to mask the hardware irq */
+#define VFIO_SDMDEV_EVENT_Q_UPDATE BIT(0) /* irq if queue is updated */
+
+/**
+ * struct vfio_sdmdev_ops - WD device operations
+ * @get_queue: get a queue from the device according to algorithm
+ * @put_queue: free a queue to the device
+ * @start_queue: put queue into action with current process's pasid.
+ * @stop_queue: stop queue from running state
+ * @is_q_updated: check whether the task is finished
+ * @mask_notify: mask the task irq of queue
+ * @mmap: mmap addresses of queue to user space
+ * @reset: reset the WD device
+ * @reset_queue: reset the queue
+ * @ioctl:   ioctl for user space users of the queue
+ * @get_available_instances: get numbers of the queue remained
+ */
+struct vfio_sdmdev_ops {
+	int (*get_queue)(struct vfio_sdmdev *sdmdev,
+			 struct vfio_sdmdev_queue **q);
+	void (*put_queue)(struct vfio_sdmdev_queue *q);
+	int (*start_queue)(struct vfio_sdmdev_queue *q);
+	void (*stop_queue)(struct vfio_sdmdev_queue *q);
+	int (*is_q_updated)(struct vfio_sdmdev_queue *q);
+	void (*mask_notify)(struct vfio_sdmdev_queue *q, int event_mask);
+	int (*mmap)(struct vfio_sdmdev_queue *q, struct vm_area_struct *vma);
+	int (*reset)(struct vfio_sdmdev *sdmdev);
+	int (*reset_queue)(struct vfio_sdmdev_queue *q);
+	long (*ioctl)(struct vfio_sdmdev_queue *q, unsigned int cmd,
+			unsigned long arg);
+	int (*get_available_instances)(struct vfio_sdmdev *sdmdev);
+};
+
+struct vfio_sdmdev_queue {
+	struct mutex mutex;
+	struct vfio_sdmdev *sdmdev;
+	__u32 flags;
+	void *priv;
+	wait_queue_head_t wait;
+	struct mdev_device *mdev;
+	int fd;
+	int container;
+#ifdef CONFIG_IOMMU_SVA
+	int pasid;
+#endif
+};
+
+struct vfio_sdmdev {
+	const char *name;
+	int status;
+	atomic_t ref;
+	const struct vfio_sdmdev_ops *ops;
+	struct device *dev;
+	struct device cls_dev;
+	bool is_vf;
+	u32 iommu_type;
+	u32 dma_flag;
+	void *priv;
+	int flags;
+	const char *api_ver;
+	struct mdev_parent_ops mdev_fops;
+};
+
+int vfio_sdmdev_register(struct vfio_sdmdev *sdmdev);
+void vfio_sdmdev_unregister(struct vfio_sdmdev *sdmdev);
+void vfio_sdmdev_wake_up(struct vfio_sdmdev_queue *q);
+int vfio_sdmdev_is_sdmdev(struct device *dev);
+struct vfio_sdmdev *vfio_sdmdev_pdev_sdmdev(struct device *dev);
+struct vfio_sdmdev *mdev_sdmdev(struct mdev_device *mdev);
+
+extern struct mdev_type_attribute mdev_type_attr_flags;
+extern struct mdev_type_attribute mdev_type_attr_name;
+extern struct mdev_type_attribute mdev_type_attr_device_api;
+extern struct mdev_type_attribute mdev_type_attr_available_instances;
+#define VFIO_SDMDEV_DEFAULT_MDEV_TYPE_ATTRS \
+	&mdev_type_attr_name.attr, \
+	&mdev_type_attr_device_api.attr, \
+	&mdev_type_attr_available_instances.attr, \
+	&mdev_type_attr_flags.attr
+
+#define _VFIO_SDMDEV_REGION(vm_pgoff)	(vm_pgoff & 0xf)
+
+#endif
diff --git a/include/uapi/linux/vfio_sdmdev.h b/include/uapi/linux/vfio_sdmdev.h
new file mode 100644
index 000000000000..79fa33fbc8c0
--- /dev/null
+++ b/include/uapi/linux/vfio_sdmdev.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#ifndef _UAPIVFIO_SDMDEV_H
+#define _UAPIVFIO_SDMDEV_H
+
+#include <linux/ioctl.h>
+
+#define VFIO_SDMDEV_CLASS_NAME		"sdmdev"
+
+/* Device ATTRs in parent dev SYSFS DIR */
+#define VFIO_SDMDEV_PDEV_ATTRS_GRP_NAME	"params"
+
+/* Parent device attributes */
+#define SDMDEV_IOMMU_TYPE	"iommu_type"
+#define SDMDEV_DMA_FLAG		"dma_flag"
+
+/* Maximum length of algorithm name string */
+#define VFIO_SDMDEV_ALG_NAME_SIZE		64
+
+/* the bits used in SDMDEV_DMA_FLAG attributes */
+#define VFIO_SDMDEV_DMA_INVALID			0
+#define	VFIO_SDMDEV_DMA_SINGLE_PROC_MAP		1
+#define	VFIO_SDMDEV_DMA_MULTI_PROC_MAP		2
+#define	VFIO_SDMDEV_DMA_SVM			4
+#define	VFIO_SDMDEV_DMA_SVM_NO_FAULT		8
+#define	VFIO_SDMDEV_DMA_PHY			16
+
+#define VFIO_SDMDEV_CMD_WAIT		_IO('W', 1)
+#define VFIO_SDMDEV_CMD_BIND_PASID	_IO('W', 2)
+#endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 4/7] crypto: add hisilicon Queue Manager driver
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
                   ` (2 preceding siblings ...)
  2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
@ 2018-09-03  0:52 ` Kenneth Lee
  2018-09-03  2:15   ` Randy Dunlap
  2018-09-03  0:52 ` [PATCH 5/7] crypto: Add Hisilicon Zip driver Kenneth Lee
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-03  0:52 UTC (permalink / raw)
  To: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

From: Kenneth Lee <liguozhu@hisilicon.com>

Hisilicon QM is a general IP used by some Hisilicon accelerators. It
provides a general PCIE interface for the CPU and the accelerator to share
a group of queues.

This commit includes a library used by the accelerator driver to access
the QM hardware.

Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Hao Fang <fanghao11@huawei.com>
---
 drivers/crypto/Makefile           |   2 +-
 drivers/crypto/hisilicon/Kconfig  |   8 +
 drivers/crypto/hisilicon/Makefile |   1 +
 drivers/crypto/hisilicon/qm.c     | 820 ++++++++++++++++++++++++++++++
 drivers/crypto/hisilicon/qm.h     | 110 ++++
 5 files changed, 940 insertions(+), 1 deletion(-)
 create mode 100644 drivers/crypto/hisilicon/qm.c
 create mode 100644 drivers/crypto/hisilicon/qm.h

diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index c23396f32c8a..f3a7abe42424 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -46,4 +46,4 @@ obj-$(CONFIG_CRYPTO_DEV_VMX) += vmx/
 obj-$(CONFIG_CRYPTO_DEV_BCM_SPU) += bcm/
 obj-$(CONFIG_CRYPTO_DEV_SAFEXCEL) += inside-secure/
 obj-$(CONFIG_CRYPTO_DEV_ARTPEC6) += axis/
-obj-y += hisilicon/
+obj-$(CONFIG_CRYPTO_DEV_HISILICON) += hisilicon/
diff --git a/drivers/crypto/hisilicon/Kconfig b/drivers/crypto/hisilicon/Kconfig
index 8ca9c503bcb0..02a6eef84101 100644
--- a/drivers/crypto/hisilicon/Kconfig
+++ b/drivers/crypto/hisilicon/Kconfig
@@ -1,4 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
+config CRYPTO_DEV_HISILICON
+	tristate "Support for HISILICON CRYPTO ACCELERATOR"
+	help
+	  Enable this to use Hisilicon Hardware Accelerators
 
 config CRYPTO_DEV_HISI_SEC
 	tristate "Support for Hisilicon SEC crypto block cipher accelerator"
@@ -12,3 +16,7 @@ config CRYPTO_DEV_HISI_SEC
 
 	  To compile this as a module, choose M here: the module
 	  will be called hisi_sec.
+
+config CRYPTO_DEV_HISI_QM
+	tristate
+	depends on ARM64 && PCI
diff --git a/drivers/crypto/hisilicon/Makefile b/drivers/crypto/hisilicon/Makefile
index 463f46ace182..05e9052e0f52 100644
--- a/drivers/crypto/hisilicon/Makefile
+++ b/drivers/crypto/hisilicon/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_CRYPTO_DEV_HISI_SEC) += sec/
+obj-$(CONFIG_CRYPTO_DEV_HISI_QM) += qm.o
diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
new file mode 100644
index 000000000000..ea618b4d0929
--- /dev/null
+++ b/drivers/crypto/hisilicon/qm.c
@@ -0,0 +1,820 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <asm/page.h>
+#include <linux/bitmap.h>
+#include <linux/dma-mapping.h>
+#include <linux/io.h>
+#include <linux/irqreturn.h>
+#include <linux/log2.h>
+#include "qm.h"
+
+#define QM_DEF_Q_NUM			128
+
+/* eq/aeq irq enable */
+#define QM_VF_AEQ_INT_SOURCE		0x0
+#define QM_VF_AEQ_INT_MASK		0x4
+#define QM_VF_EQ_INT_SOURCE		0x8
+#define QM_VF_EQ_INT_MASK		0xc
+
+/* mailbox */
+#define MAILBOX_CMD_SQC			0x0
+#define MAILBOX_CMD_CQC			0x1
+#define MAILBOX_CMD_EQC			0x2
+#define MAILBOX_CMD_SQC_BT		0x4
+#define MAILBOX_CMD_CQC_BT		0x5
+
+#define MAILBOX_CMD_SEND_BASE		0x300
+#define MAILBOX_EVENT_SHIFT		8
+#define MAILBOX_STATUS_SHIFT		9
+#define MAILBOX_BUSY_SHIFT		13
+#define MAILBOX_OP_SHIFT		14
+#define MAILBOX_QUEUE_SHIFT		16
+
+/* sqc shift */
+#define SQ_HEAD_SHIFT			0
+#define SQ_TAIL_SHIFI			16
+#define SQ_HOP_NUM_SHIFT		0
+#define SQ_PAGE_SIZE_SHIFT		4
+#define SQ_BUF_SIZE_SHIFT		8
+#define SQ_SQE_SIZE_SHIFT		12
+#define SQ_HEAD_IDX_SIG_SHIFT		0
+#define SQ_TAIL_IDX_SIG_SHIFT		0
+#define SQ_CQN_SHIFT			0
+#define SQ_PRIORITY_SHIFT		0
+#define SQ_ORDERS_SHIFT			4
+#define SQ_TYPE_SHIFT			8
+
+#define SQ_TYPE_MASK			0xf
+
+/* cqc shift */
+#define CQ_HEAD_SHIFT			0
+#define CQ_TAIL_SHIFI			16
+#define CQ_HOP_NUM_SHIFT		0
+#define CQ_PAGE_SIZE_SHIFT		4
+#define CQ_BUF_SIZE_SHIFT		8
+#define CQ_SQE_SIZE_SHIFT		12
+#define CQ_PASID			0
+#define CQ_HEAD_IDX_SIG_SHIFT		0
+#define CQ_TAIL_IDX_SIG_SHIFT		0
+#define CQ_CQN_SHIFT			0
+#define CQ_PRIORITY_SHIFT		16
+#define CQ_ORDERS_SHIFT			0
+#define CQ_TYPE_SHIFT			0
+#define CQ_PHASE_SHIFT			0
+#define CQ_FLAG_SHIFT			1
+
+#define CQC_HEAD_INDEX(cqc)		((cqc)->cq_head)
+#define CQC_PHASE(cqc)			(((cqc)->dw6) & 0x1)
+#define CQC_CQ_ADDRESS(cqc)		(((u64)((cqc)->cq_base_h) << 32) | \
+					 ((cqc)->cq_base_l))
+#define CQC_PHASE_BIT			0x1
+
+/* eqc shift */
+#define MB_EQC_EQE_SHIFT		12
+#define MB_EQC_PHASE_SHIFT		16
+
+#define EQC_HEAD_INDEX(eqc)		((eqc)->eq_head)
+#define EQC_TAIL_INDEX(eqc)		((eqc)->eq_tail)
+#define EQC_PHASE(eqc)			((((eqc)->dw6) >> 16) & 0x1)
+
+#define EQC_PHASE_BIT			0x00010000
+
+/* cqe shift */
+#define CQE_PHASE(cqe)			((cqe)->w7 & 0x1)
+#define CQE_SQ_NUM(cqe)			((cqe)->sq_num)
+#define CQE_SQ_HEAD_INDEX(cqe)		((cqe)->sq_head)
+
+/* eqe shift */
+#define EQE_PHASE(eqe)			(((eqe)->dw0 >> 16) & 0x1)
+#define EQE_CQN(eqe)			(((eqe)->dw0) & 0xffff)
+
+#define QM_EQE_CQN_MASK			0xffff
+
+/* doorbell */
+#define DOORBELL_CMD_SQ			0
+#define DOORBELL_CMD_CQ			1
+#define DOORBELL_CMD_EQ			2
+#define DOORBELL_CMD_AEQ		3
+
+#define DOORBELL_CMD_SEND_BASE		0x340
+
+#define QM_MEM_START_INIT		0x100040
+#define QM_MEM_INIT_DONE		0x100044
+#define QM_VFT_CFG_RDY			0x10006c
+#define QM_VFT_CFG_OP_WR		0x100058
+#define QM_VFT_CFG_TYPE			0x10005c
+#define QM_SQC_VFT			0x0
+#define QM_CQC_VFT			0x1
+#define QM_VFT_CFG_ADDRESS		0x100060
+#define QM_VFT_CFG_OP_ENABLE		0x100054
+
+#define QM_VFT_CFG_DATA_L		0x100064
+#define QM_VFT_CFG_DATA_H		0x100068
+#define QM_SQC_VFT_BUF_SIZE		(7ULL << 8)
+#define QM_SQC_VFT_SQC_SIZE		(5ULL << 12)
+#define QM_SQC_VFT_INDEX_NUMBER		(1ULL << 16)
+#define QM_SQC_VFT_BT_INDEX_SHIFT	22
+#define QM_SQC_VFT_START_SQN_SHIFT	28
+#define QM_SQC_VFT_VALID		(1ULL << 44)
+#define QM_CQC_VFT_BUF_SIZE		(7ULL << 8)
+#define QM_CQC_VFT_SQC_SIZE		(5ULL << 12)
+#define QM_CQC_VFT_INDEX_NUMBER		(1ULL << 16)
+#define QM_CQC_VFT_BT_INDEX_SHIFT	22
+#define QM_CQC_VFT_VALID		(1ULL << 28)
+
+struct cqe {
+	__le32 rsvd0;
+	__le16 cmd_id;
+	__le16 rsvd1;
+	__le16 sq_head;
+	__le16 sq_num;
+	__le16 rsvd2;
+	__le16 w7;
+};
+
+struct eqe {
+	__le32 dw0;
+};
+
+struct sqc {
+	__le16 head;
+	__le16 tail;
+	__le32 base_l;
+	__le32 base_h;
+	__le32 dw3;
+	__le16 qes;
+	__le16 rsvd0;
+	__le16 pasid;
+	__le16 w11;
+	__le16 cq_num;
+	__le16 w13;
+	__le32 rsvd1;
+};
+
+struct cqc {
+	__le16 head;
+	__le16 tail;
+	__le32 base_l;
+	__le32 base_h;
+	__le32 dw3;
+	__le16 qes;
+	__le16 rsvd0;
+	__le16 pasid;
+	__le16 w11;
+	__le32 dw6;
+	__le32 rsvd1;
+};
+
+#define INIT_QC(qc, base) do { \
+	(qc)->head = 0; \
+	(qc)->tail = 0; \
+	(qc)->base_l = lower_32_bits(base); \
+	(qc)->base_h = upper_32_bits(base); \
+	(qc)->pasid = 0; \
+	(qc)->w11 = 0; \
+	(qc)->rsvd1 = 0; \
+	(qc)->qes = QM_Q_DEPTH - 1; \
+} while (0)
+
+struct eqc {
+	__le16 head;
+	__le16 tail;
+	__le32 base_l;
+	__le32 base_h;
+	__le32 dw3;
+	__le32 rsvd[2];
+	__le32 dw6;
+};
+
+struct mailbox {
+	__le16 w0;
+	__le16 queue_num;
+	__le32 base_l;
+	__le32 base_h;
+	__le32 rsvd;
+};
+
+struct doorbell {
+	__le16 queue_num;
+	__le16 cmd;
+	__le16 index;
+	__le16 priority;
+};
+
+#define QM_DMA_BUF(p, buf) ((struct buf *)(p)->buf.addr)
+#define QM_SQC(p) QM_DMA_BUF(p, sqc)
+#define QM_CQC(p) QM_DMA_BUF(p, cqc)
+#define QM_EQC(p) QM_DMA_BUF(p, eqc)
+#define QM_EQE(p) QM_DMA_BUF(p, eqe)
+
+#define QP_SQE_DMA(qp) ((qp)->scqe.dma)
+#define QP_CQE(qp) ((struct cqe *)((qp)->scqe.addr + \
+				   qp->qm->sqe_size * QM_Q_DEPTH))
+#define QP_CQE_DMA(qp) ((qp)->scqe.dma + qp->qm->sqe_size * QM_Q_DEPTH)
+
+static inline void qm_writel(struct qm_info *qm, u32 val, u32 offset)
+{
+	writel(val, qm->io_base + offset);
+}
+
+struct qm_info;
+
+struct hisi_acc_qm_hw_ops {
+	int (*vft_config)(struct qm_info *qm, u16 base, u32 number);
+};
+
+static inline int hacc_qm_mb_is_busy(struct qm_info *qm)
+{
+	u32 val;
+
+	return readl_relaxed_poll_timeout(QM_ADDR(qm, MAILBOX_CMD_SEND_BASE),
+		val, !((val >> MAILBOX_BUSY_SHIFT) & 0x1), 10, 1000);
+}
+
+static inline void qm_mb_write(struct qm_info *qm, void *src)
+{
+	void __iomem *fun_base = QM_ADDR(qm, MAILBOX_CMD_SEND_BASE);
+	unsigned long tmp0 = 0, tmp1 = 0;
+
+	asm volatile("ldp %0, %1, %3\n"
+		     "stp %0, %1, %2\n"
+		     "dsb sy\n"
+		     : "=&r" (tmp0),
+		       "=&r" (tmp1),
+		       "+Q" (*((char *)fun_base))
+		     : "Q" (*((char *)src))
+		     : "memory");
+}
+
+static int qm_mb(struct qm_info *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
+		 bool op, bool event)
+{
+	struct mailbox mailbox;
+	int i = 0;
+	int ret = 0;
+
+	memset(&mailbox, 0, sizeof(struct mailbox));
+
+	mailbox.w0 = cmd |
+		     (event ? 0x1 << MAILBOX_EVENT_SHIFT : 0) |
+		     (op ? 0x1 << MAILBOX_OP_SHIFT : 0) |
+		     (0x1 << MAILBOX_BUSY_SHIFT);
+	mailbox.queue_num = queue;
+	mailbox.base_l = lower_32_bits(dma_addr);
+	mailbox.base_h = upper_32_bits(dma_addr);
+	mailbox.rsvd = 0;
+
+	mutex_lock(&qm->mailbox_lock);
+
+	while (hacc_qm_mb_is_busy(qm) && i < 10)
+		i++;
+	if (i >= 10) {
+		ret = -EBUSY;
+		dev_err(&qm->pdev->dev, "QM mail box is busy!");
+		goto busy_unlock;
+	}
+	qm_mb_write(qm, &mailbox);
+	i = 0;
+	while (hacc_qm_mb_is_busy(qm) && i < 10)
+		i++;
+	if (i >= 10) {
+		ret = -EBUSY;
+		dev_err(&qm->pdev->dev, "QM mail box is still busy!");
+		goto busy_unlock;
+	}
+
+busy_unlock:
+	mutex_unlock(&qm->mailbox_lock);
+
+	return ret;
+}
+
+static void qm_db(struct qm_info *qm, u16 qn, u8 cmd, u16 index, u8 priority)
+{
+	u64 doorbell = 0;
+
+	doorbell = (u64)qn | ((u64)cmd << 16);
+	doorbell |= ((u64)index | ((u64)priority << 16)) << 32;
+
+	writeq(doorbell, QM_ADDR(qm, DOORBELL_CMD_SEND_BASE));
+}
+
+/* @return 0 - cq/eq event, 1 - async event, 2 - abnormal error */
+static u32 qm_get_irq_source(struct qm_info *qm)
+{
+	return readl(QM_ADDR(qm, QM_VF_EQ_INT_SOURCE));
+}
+
+static inline struct hisi_qp *to_hisi_qp(struct qm_info *qm, struct eqe *eqe)
+{
+	u16 cqn = eqe->dw0 & QM_EQE_CQN_MASK;
+	struct hisi_qp *qp;
+
+	read_lock(&qm->qps_lock);
+	qp = qm->qp_array[cqn];
+	read_unlock(&qm->qps_lock);
+
+	return qp;
+}
+
+static inline void qm_cq_head_update(struct hisi_qp *qp)
+{
+	if (qp->qp_status.cq_head == QM_Q_DEPTH - 1) {
+		QM_CQC(qp)->dw6 = QM_CQC(qp)->dw6 ^ CQC_PHASE_BIT;
+		qp->qp_status.cq_head = 0;
+	} else {
+		qp->qp_status.cq_head++;
+	}
+}
+
+static inline void qm_poll_qp(struct hisi_qp *qp, struct qm_info *qm)
+{
+	struct cqe *cqe;
+
+	cqe = QP_CQE(qp) + qp->qp_status.cq_head;
+
+	if (qp->req_cb) {
+		while (CQE_PHASE(cqe) == CQC_PHASE(QM_CQC(qp))) {
+			dma_rmb();
+			qp->req_cb(qp, QP_SQE_ADDR(qp) +
+				   qm->sqe_size *
+				   CQE_SQ_HEAD_INDEX(cqe));
+			qm_cq_head_update(qp);
+			cqe = QP_CQE(qp) + qp->qp_status.cq_head;
+		}
+	} else if (qp->event_cb) {
+		qp->event_cb(qp);
+		qm_cq_head_update(qp);
+		cqe = QP_CQE(qp) + qp->qp_status.cq_head;
+	}
+
+	qm_db(qm, qp->queue_id, DOORBELL_CMD_CQ, qp->qp_status.cq_head, 0);
+
+	/* set c_flag */
+	qm_db(qm, qp->queue_id, DOORBELL_CMD_CQ, qp->qp_status.cq_head, 1);
+}
+
+static irqreturn_t qm_irq_thread(int irq, void *data)
+{
+	struct qm_info *qm = data;
+	struct eqe *eqe = QM_EQE(qm) + qm->eq_head;
+	struct eqc *eqc = QM_EQC(qm);
+	struct hisi_qp *qp;
+
+	while (EQE_PHASE(eqe) == EQC_PHASE(eqc)) {
+		qp = to_hisi_qp(qm, eqe);
+		if (qp)
+			qm_poll_qp(qp, qm);
+
+		if (qm->eq_head == QM_Q_DEPTH - 1) {
+			eqc->dw6 = eqc->dw6 ^ EQC_PHASE_BIT;
+			eqe = QM_EQE(qm);
+			qm->eq_head = 0;
+		} else {
+			eqe++;
+			qm->eq_head++;
+		}
+
+		qm_db(qm, 0, DOORBELL_CMD_EQ, qm->eq_head, 0);
+	}
+
+	return IRQ_HANDLED;
+}
+
+static void qm_init_qp_status(struct hisi_qp *qp)
+{
+	struct hisi_acc_qp_status *qp_status = &qp->qp_status;
+
+	qp_status->sq_tail = 0;
+	qp_status->sq_head = 0;
+	qp_status->cq_head = 0;
+	qp_status->cqc_phase = 1;
+	qp_status->is_sq_full = 0;
+}
+
+/* check if bit in regs is 1 */
+static inline int qm_acc_check(struct qm_info *qm, u32 offset, u32 bit)
+{
+	int val;
+
+	return readl_relaxed_poll_timeout(QM_ADDR(qm, offset), val,
+					  val & BIT(bit), 10, 1000);
+}
+
+static inline int qm_init_q_buffer(struct device *dev, size_t size,
+				   struct qm_dma_buffer *db)
+{
+	db->size = size;
+	db->addr = dma_zalloc_coherent(dev, size, &db->dma, GFP_KERNEL);
+	if (!db->addr)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static inline void qm_uninit_q_buffer(struct device *dev,
+				      struct qm_dma_buffer *db)
+{
+	dma_free_coherent(dev, db->size, db->addr, db->dma);
+}
+
+static inline int qm_init_bt(struct qm_info *qm, struct device *dev,
+			     size_t size, struct qm_dma_buffer *db, int mb_cmd)
+{
+	int ret;
+
+	ret = qm_init_q_buffer(dev, size, db);
+	if (ret)
+		return -ENOMEM;
+
+	ret = qm_mb(qm, mb_cmd, db->dma, 0, 0, 0);
+	if (ret) {
+		qm_uninit_q_buffer(dev, db);
+		return ret;
+	}
+
+	return 0;
+}
+
+/* the config should be conducted after hisi_acc_init_qm_mem() */
+static int qm_vft_common_config(struct qm_info *qm, u16 base, u32 number)
+{
+	u64 tmp;
+	int ret;
+
+	ret = qm_acc_check(qm, QM_VFT_CFG_RDY, 0);
+	if (ret)
+		return ret;
+	qm_writel(qm, 0x0, QM_VFT_CFG_OP_WR);
+	qm_writel(qm, QM_SQC_VFT, QM_VFT_CFG_TYPE);
+	qm_writel(qm, qm->pdev->devfn, QM_VFT_CFG_ADDRESS);
+
+	tmp = QM_SQC_VFT_BUF_SIZE			|
+	      QM_SQC_VFT_SQC_SIZE			|
+	      QM_SQC_VFT_INDEX_NUMBER			|
+	      QM_SQC_VFT_VALID				|
+	      (u64)base << QM_SQC_VFT_START_SQN_SHIFT;
+
+	qm_writel(qm, tmp & 0xffffffff, QM_VFT_CFG_DATA_L);
+	qm_writel(qm, tmp >> 32, QM_VFT_CFG_DATA_H);
+
+	qm_writel(qm, 0x0, QM_VFT_CFG_RDY);
+	qm_writel(qm, 0x1, QM_VFT_CFG_OP_ENABLE);
+	ret = qm_acc_check(qm, QM_VFT_CFG_RDY, 0);
+	if (ret)
+		return ret;
+	tmp = 0;
+
+	qm_writel(qm, 0x0, QM_VFT_CFG_OP_WR);
+	qm_writel(qm, QM_CQC_VFT, QM_VFT_CFG_TYPE);
+	qm_writel(qm, qm->pdev->devfn, QM_VFT_CFG_ADDRESS);
+
+	tmp = QM_CQC_VFT_BUF_SIZE			|
+	      QM_CQC_VFT_SQC_SIZE			|
+	      QM_CQC_VFT_INDEX_NUMBER			|
+	      QM_CQC_VFT_VALID;
+
+	qm_writel(qm, tmp & 0xffffffff, QM_VFT_CFG_DATA_L);
+	qm_writel(qm, tmp >> 32, QM_VFT_CFG_DATA_H);
+
+	qm_writel(qm, 0x0, QM_VFT_CFG_RDY);
+	qm_writel(qm, 0x1, QM_VFT_CFG_OP_ENABLE);
+	ret = qm_acc_check(qm, QM_VFT_CFG_RDY, 0);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+static struct hisi_acc_qm_hw_ops qm_hw_ops_v1 = {
+	.vft_config = qm_vft_common_config,
+};
+
+struct hisi_qp *hisi_qm_create_qp(struct qm_info *qm, u8 alg_type)
+{
+	struct hisi_qp *qp;
+	int qp_index;
+	int ret;
+
+	write_lock(&qm->qps_lock);
+	qp_index = find_first_zero_bit(qm->qp_bitmap, qm->qp_num);
+	if (qp_index >= qm->qp_num) {
+		write_unlock(&qm->qps_lock);
+		return ERR_PTR(-EBUSY);
+	}
+	set_bit(qp_index, qm->qp_bitmap);
+
+	qp = kzalloc(sizeof(*qp), GFP_KERNEL);
+	if (!qp) {
+		ret = -ENOMEM;
+		write_unlock(&qm->qps_lock);
+		goto err_with_bitset;
+	}
+
+	qp->queue_id = qp_index;
+	qp->qm = qm;
+	qp->alg_type = alg_type;
+	qm_init_qp_status(qp);
+
+	write_unlock(&qm->qps_lock);
+	return qp;
+
+err_with_bitset:
+	clear_bit(qp_index, qm->qp_bitmap);
+
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(hisi_qm_create_qp);
+
+int hisi_qm_start_qp(struct hisi_qp *qp, unsigned long arg)
+{
+	struct qm_info *qm = qp->qm;
+	struct device *dev = &qm->pdev->dev;
+	struct sqc *sqc;
+	struct cqc *cqc;
+	int qp_index = qp->queue_id;
+	int pasid = arg;
+	int ret;
+
+	/* set sq and cq context */
+	qp->sqc.addr = QM_SQC(qm) + qp_index;
+	qp->sqc.dma = qm->sqc.dma + qp_index * sizeof(struct sqc);
+	sqc = QM_SQC(qp);
+
+	qp->cqc.addr = QM_CQC(qm) + qp_index;
+	qp->cqc.dma = qm->cqc.dma + qp_index * sizeof(struct cqc);
+	cqc = QM_CQC(qp);
+
+	/* allocate sq and cq */
+	ret = qm_init_q_buffer(dev,
+		qm->sqe_size * QM_Q_DEPTH + sizeof(struct cqe) * QM_Q_DEPTH,
+		&qp->scqe);
+	if (ret)
+		return ret;
+
+	INIT_QC(sqc, qp->scqe.dma);
+	sqc->pasid = pasid;
+	sqc->dw3 = (0 << SQ_HOP_NUM_SHIFT)      |
+		   (0 << SQ_PAGE_SIZE_SHIFT)    |
+		   (0 << SQ_BUF_SIZE_SHIFT)     |
+		   (ilog2(qm->sqe_size) << SQ_SQE_SIZE_SHIFT);
+	sqc->cq_num = qp_index;
+	sqc->w13 = 0 << SQ_PRIORITY_SHIFT	|
+		   1 << SQ_ORDERS_SHIFT		|
+		   (qp->alg_type & SQ_TYPE_MASK) << SQ_TYPE_SHIFT;
+
+	ret = qm_mb(qm, MAILBOX_CMD_SQC, qp->sqc.dma, qp_index, 0, 0);
+	if (ret)
+		return ret;
+
+	INIT_QC(cqc, qp->scqe.dma + qm->sqe_size * QM_Q_DEPTH);
+	cqc->dw3 = (0 << CQ_HOP_NUM_SHIFT)	|
+		   (0 << CQ_PAGE_SIZE_SHIFT)	|
+		   (0 << CQ_BUF_SIZE_SHIFT)	|
+		   (4 << CQ_SQE_SIZE_SHIFT);
+	cqc->dw6 = 1 << CQ_PHASE_SHIFT | 1 << CQ_FLAG_SHIFT;
+
+	ret = qm_mb(qm, MAILBOX_CMD_CQC, qp->cqc.dma, qp_index, 0, 0);
+	if (ret)
+		return ret;
+
+	write_lock(&qm->qps_lock);
+	qm->qp_array[qp_index] = qp;
+	init_completion(&qp->completion);
+	write_unlock(&qm->qps_lock);
+
+	return qp_index;
+}
+EXPORT_SYMBOL_GPL(hisi_qm_start_qp);
+
+void hisi_qm_release_qp(struct hisi_qp *qp)
+{
+	struct qm_info *qm = qp->qm;
+	struct device *dev = &qm->pdev->dev;
+
+	write_lock(&qm->qps_lock);
+	qm->qp_array[qp->queue_id] = NULL;
+	bitmap_clear(qm->qp_bitmap, qp->queue_id, 1);
+	write_unlock(&qm->qps_lock);
+
+	qm_uninit_q_buffer(dev, &qp->scqe);
+	kfree(qp);
+}
+EXPORT_SYMBOL_GPL(hisi_qm_release_qp);
+
+static void *qm_get_avail_sqe(struct hisi_qp *qp)
+{
+	struct hisi_acc_qp_status *qp_status = &qp->qp_status;
+	void *sq_base = QP_SQE_ADDR(qp);
+	u16 sq_tail = qp_status->sq_tail;
+
+	if (qp_status->is_sq_full == 1)
+		return NULL;
+
+	return sq_base + sq_tail * qp->qm->sqe_size;
+}
+
+int hisi_qp_send(struct hisi_qp *qp, void *msg)
+{
+	struct hisi_acc_qp_status *qp_status = &qp->qp_status;
+	u16 sq_tail = qp_status->sq_tail;
+	u16 sq_tail_next = (sq_tail + 1) % QM_Q_DEPTH;
+	unsigned long timeout = 100;
+	void *sqe = qm_get_avail_sqe(qp);
+
+	if (!sqe)
+		return -ENOSPC;
+
+	memcpy(sqe, msg, qp->qm->sqe_size);
+
+	qm_db(qp->qm, qp->queue_id, DOORBELL_CMD_SQ, sq_tail_next, 0);
+
+	qp_status->sq_tail = sq_tail_next;
+
+	if (qp_status->sq_tail == qp_status->sq_head)
+		qp_status->is_sq_full = 1;
+
+	/* wait until job finished */
+	wait_for_completion_timeout(&qp->completion, timeout);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(hisi_qp_send);
+
+int hisi_qm_init(const char *dev_name, struct qm_info *qm)
+{
+	struct pci_dev *pdev = qm->pdev;
+	int ret;
+
+	ret = pci_enable_device_mem(pdev);
+	if (ret < 0) {
+		dev_err(&pdev->dev, "Can't enable device mem!\n");
+		return ret;
+	}
+
+	ret = pci_request_mem_regions(pdev, dev_name);
+	if (ret < 0) {
+		dev_err(&pdev->dev, "Can't request mem regions!\n");
+		goto err_with_pcidev;
+	}
+
+	qm->dev_name = dev_name;
+	qm->phys_base = pci_resource_start(pdev, 2);
+	qm->size = pci_resource_len(qm->pdev, 2);
+	qm->io_base = devm_ioremap(&pdev->dev, qm->phys_base, qm->size);
+	if (!qm->io_base) {
+		ret = -EIO;
+		goto err_with_mem_regions;
+	}
+
+	dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
+	pci_set_master(pdev);
+
+	ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSI);
+	if (ret < 0) {
+		dev_err(&pdev->dev, "Enable MSI vectors fail!\n");
+		goto err_with_mem_regions;
+	}
+
+	qm->eq_head = 0;
+	mutex_init(&qm->mailbox_lock);
+	rwlock_init(&qm->qps_lock);
+
+	if (qm->ver)
+		qm->ops = &qm_hw_ops_v1;
+
+	return 0;
+
+err_with_mem_regions:
+	pci_release_mem_regions(pdev);
+err_with_pcidev:
+	pci_disable_device(pdev);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(hisi_qm_init);
+
+void hisi_qm_uninit(struct qm_info *qm)
+{
+	struct pci_dev *pdev = qm->pdev;
+
+	pci_free_irq_vectors(pdev);
+	pci_release_mem_regions(pdev);
+	pci_disable_device(pdev);
+}
+EXPORT_SYMBOL_GPL(hisi_qm_uninit);
+
+static irqreturn_t qm_irq(int irq, void *data)
+{
+	struct qm_info *qm = data;
+	u32 int_source;
+
+	int_source = qm_get_irq_source(qm);
+	if (int_source)
+		return IRQ_WAKE_THREAD;
+
+	dev_err(&qm->pdev->dev, "invalid int source %d\n", int_source);
+
+	return IRQ_HANDLED;
+}
+
+int hisi_qm_start(struct qm_info *qm)
+{
+	struct pci_dev *pdev = qm->pdev;
+	struct device *dev = &pdev->dev;
+	int ret;
+
+	if (qm->pdev->is_physfn)
+		qm->ops->vft_config(qm, qm->qp_base, qm->qp_num);
+
+	ret = qm_init_q_buffer(dev, sizeof(struct eqc), &qm->eqc);
+	if (ret)
+		goto err_out;
+
+	ret = qm_init_q_buffer(dev, sizeof(struct eqe) * QM_Q_DEPTH, &qm->eqe);
+	if (ret)
+		goto err_with_eqc;
+
+	QM_EQC(qm)->base_l = lower_32_bits(qm->eqe.dma);
+	QM_EQC(qm)->base_h = upper_32_bits(qm->eqe.dma);
+	QM_EQC(qm)->dw3 = 2 << MB_EQC_EQE_SHIFT;
+	QM_EQC(qm)->dw6 = (QM_Q_DEPTH - 1) | (1 << MB_EQC_PHASE_SHIFT);
+	ret = qm_mb(qm, MAILBOX_CMD_EQC, qm->eqc.dma, 0, 0, 0);
+	if (ret)
+		goto err_with_eqe;
+
+	qm->qp_bitmap = kcalloc(BITS_TO_LONGS(qm->qp_num), sizeof(long),
+				GFP_KERNEL);
+	if (!qm->qp_bitmap)
+		goto err_with_eqe;
+
+	qm->qp_array = kcalloc(qm->qp_num, sizeof(struct hisi_qp *),
+			       GFP_KERNEL);
+	if (!qm->qp_array)
+		goto err_with_bitmap;
+
+	/* Init sqc_bt */
+	ret = qm_init_bt(qm, dev, sizeof(struct sqc) * qm->qp_num, &qm->sqc,
+			 MAILBOX_CMD_SQC_BT);
+	if (ret)
+		goto err_with_qp_array;
+
+	/* Init cqc_bt */
+	ret = qm_init_bt(qm, dev, sizeof(struct cqc) * qm->qp_num, &qm->cqc,
+			 MAILBOX_CMD_CQC_BT);
+	if (ret)
+		goto err_with_sqc;
+
+	ret = request_threaded_irq(pci_irq_vector(pdev, 0), qm_irq,
+				   qm_irq_thread, IRQF_SHARED, qm->dev_name,
+				   qm);
+	if (ret)
+		goto err_with_cqc;
+
+	writel(0x0, QM_ADDR(qm, QM_VF_EQ_INT_MASK));
+
+	return 0;
+
+err_with_cqc:
+	qm_uninit_q_buffer(dev, &qm->cqc);
+err_with_sqc:
+	qm_uninit_q_buffer(dev, &qm->sqc);
+err_with_qp_array:
+	kfree(qm->qp_array);
+err_with_bitmap:
+	kfree(qm->qp_bitmap);
+err_with_eqe:
+	qm_uninit_q_buffer(dev, &qm->eqe);
+err_with_eqc:
+	qm_uninit_q_buffer(dev, &qm->eqc);
+err_out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(hisi_qm_start);
+
+void hisi_qm_stop(struct qm_info *qm)
+{
+	struct pci_dev *pdev = qm->pdev;
+	struct device *dev = &pdev->dev;
+
+	free_irq(pci_irq_vector(pdev, 0), qm);
+	qm_uninit_q_buffer(dev, &qm->cqc);
+	kfree(qm->qp_array);
+	kfree(qm->qp_bitmap);
+	qm_uninit_q_buffer(dev, &qm->eqe);
+	qm_uninit_q_buffer(dev, &qm->eqc);
+}
+EXPORT_SYMBOL_GPL(hisi_qm_stop);
+
+/* put qm into init state, so the acce config become available */
+int hisi_qm_mem_start(struct qm_info *qm)
+{
+	u32 val;
+
+	qm_writel(qm, 0x1, QM_MEM_START_INIT);
+	return readl_relaxed_poll_timeout(QM_ADDR(qm, QM_MEM_INIT_DONE), val,
+					  val & BIT(0), 10, 1000);
+}
+EXPORT_SYMBOL_GPL(hisi_qm_mem_start);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Zhou Wang <wangzhou1@hisilicon.com>");
+MODULE_DESCRIPTION("HiSilicon Accelerator queue manager driver");
diff --git a/drivers/crypto/hisilicon/qm.h b/drivers/crypto/hisilicon/qm.h
new file mode 100644
index 000000000000..0e81182ac6a8
--- /dev/null
+++ b/drivers/crypto/hisilicon/qm.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#ifndef HISI_ACC_QM_H
+#define HISI_ACC_QM_H
+
+#include <linux/dmapool.h>
+#include <linux/iopoll.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/slab.h>
+
+#define QM_CQE_SIZE			16
+/* default queue depth for sq/cq/eq */
+#define QM_Q_DEPTH			1024
+
+/* qm user domain */
+#define QM_ARUSER_M_CFG_1		0x100088
+#define QM_ARUSER_M_CFG_ENABLE		0x100090
+#define QM_AWUSER_M_CFG_1		0x100098
+#define QM_AWUSER_M_CFG_ENABLE		0x1000a0
+#define QM_WUSER_M_CFG_ENABLE		0x1000a8
+
+/* qm cache */
+#define QM_CACHE_CTL			0x100050
+#define QM_AXI_M_CFG			0x1000ac
+#define QM_AXI_M_CFG_ENABLE		0x1000b0
+#define QM_PEH_AXUSER_CFG		0x1000cc
+#define QM_PEH_AXUSER_CFG_ENABLE	0x1000d0
+
+#define QP_SQE_ADDR(qp) ((qp)->scqe.addr)
+
+struct qm_dma_buffer {
+	int size;
+	void *addr;
+	dma_addr_t dma;
+};
+
+struct qm_info {
+	int ver;
+	const char *dev_name;
+	struct pci_dev *pdev;
+
+	resource_size_t phys_base;
+	resource_size_t size;
+	void __iomem *io_base;
+
+	u32 sqe_size;
+	u32 qp_base;
+	u32 qp_num;
+
+	struct qm_dma_buffer sqc, cqc, eqc, eqe;
+
+	u32 eq_head;
+
+	rwlock_t qps_lock;
+	unsigned long *qp_bitmap;
+	struct hisi_qp **qp_array;
+
+	struct mutex mailbox_lock;
+
+	struct hisi_acc_qm_hw_ops *ops;
+
+};
+#define QM_ADDR(qm, off) ((qm)->io_base + off)
+
+struct hisi_acc_qp_status {
+	u16 sq_tail;
+	u16 sq_head;
+	u16 cq_head;
+	bool cqc_phase;
+	int is_sq_full;
+};
+
+struct hisi_qp;
+
+struct hisi_qp_ops {
+	int (*fill_sqe)(void *sqe, void *q_parm, void *d_parm);
+};
+
+struct hisi_qp {
+	/* sq number in this function */
+	u32 queue_id;
+	u8 alg_type;
+	u8 req_type;
+
+	struct qm_dma_buffer sqc, cqc;
+	struct qm_dma_buffer scqe;
+
+	struct hisi_acc_qp_status qp_status;
+
+	struct qm_info *qm;
+
+	/* for crypto sync API */
+	struct completion completion;
+
+	struct hisi_qp_ops *hw_ops;
+	void *qp_ctx;
+	void (*event_cb)(struct hisi_qp *qp);
+	void (*req_cb)(struct hisi_qp *qp, void *data);
+};
+
+extern int hisi_qm_init(const char *dev_name, struct qm_info *qm);
+extern void hisi_qm_uninit(struct qm_info *qm);
+extern int hisi_qm_start(struct qm_info *qm);
+extern void hisi_qm_stop(struct qm_info *qm);
+extern int hisi_qm_mem_start(struct qm_info *qm);
+extern struct hisi_qp *hisi_qm_create_qp(struct qm_info *qm, u8 alg_type);
+extern int hisi_qm_start_qp(struct hisi_qp *qp, unsigned long arg);
+extern void hisi_qm_release_qp(struct hisi_qp *qp);
+extern int hisi_qp_send(struct hisi_qp *qp, void *msg);
+#endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 5/7] crypto: Add Hisilicon Zip driver
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
                   ` (3 preceding siblings ...)
  2018-09-03  0:52 ` [PATCH 4/7] crypto: add hisilicon Queue Manager driver Kenneth Lee
@ 2018-09-03  0:52 ` Kenneth Lee
  2018-09-03  0:52 ` [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM Kenneth Lee
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-03  0:52 UTC (permalink / raw)
  To: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

From: Kenneth Lee <liguozhu@hisilicon.com>

The Hisilicon ZIP accelerator implements zlib and gzip algorithm support
for the software. It uses Hisilicon QM as the interface to the CPU, so it
is shown up as a PCIE device to the CPU with a group of queues.

This commit provides PCIE driver to the accelerator and register it to
the crypto subsystem.

Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Hao Fang <fanghao11@huawei.com>
---
 drivers/crypto/hisilicon/Kconfig          |   7 +
 drivers/crypto/hisilicon/Makefile         |   1 +
 drivers/crypto/hisilicon/zip/Makefile     |   2 +
 drivers/crypto/hisilicon/zip/zip.h        |  57 ++++
 drivers/crypto/hisilicon/zip/zip_crypto.c | 353 ++++++++++++++++++++++
 drivers/crypto/hisilicon/zip/zip_crypto.h |   8 +
 drivers/crypto/hisilicon/zip/zip_main.c   | 195 ++++++++++++
 7 files changed, 623 insertions(+)
 create mode 100644 drivers/crypto/hisilicon/zip/Makefile
 create mode 100644 drivers/crypto/hisilicon/zip/zip.h
 create mode 100644 drivers/crypto/hisilicon/zip/zip_crypto.c
 create mode 100644 drivers/crypto/hisilicon/zip/zip_crypto.h
 create mode 100644 drivers/crypto/hisilicon/zip/zip_main.c

diff --git a/drivers/crypto/hisilicon/Kconfig b/drivers/crypto/hisilicon/Kconfig
index 02a6eef84101..1d155708cd69 100644
--- a/drivers/crypto/hisilicon/Kconfig
+++ b/drivers/crypto/hisilicon/Kconfig
@@ -20,3 +20,10 @@ config CRYPTO_DEV_HISI_SEC
 config CRYPTO_DEV_HISI_QM
 	tristate
 	depends on ARM64 && PCI
+
+config CRYPTO_DEV_HISI_ZIP
+	tristate "Support for HISI ZIP Driver"
+	depends on ARM64 && CRYPTO_DEV_HISILICON
+	select CRYPTO_DEV_HISI_QM
+	help
+	  Support for HiSilicon HIP08 ZIP Driver
diff --git a/drivers/crypto/hisilicon/Makefile b/drivers/crypto/hisilicon/Makefile
index 05e9052e0f52..c97c5b27c3cb 100644
--- a/drivers/crypto/hisilicon/Makefile
+++ b/drivers/crypto/hisilicon/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_CRYPTO_DEV_HISI_SEC) += sec/
 obj-$(CONFIG_CRYPTO_DEV_HISI_QM) += qm.o
+obj-$(CONFIG_CRYPTO_DEV_HISI_ZIP) += zip/
diff --git a/drivers/crypto/hisilicon/zip/Makefile b/drivers/crypto/hisilicon/zip/Makefile
new file mode 100644
index 000000000000..a936f099ee22
--- /dev/null
+++ b/drivers/crypto/hisilicon/zip/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_CRYPTO_DEV_HISI_ZIP) += hisi_zip.o
+hisi_zip-objs = zip_main.o zip_crypto.o
diff --git a/drivers/crypto/hisilicon/zip/zip.h b/drivers/crypto/hisilicon/zip/zip.h
new file mode 100644
index 000000000000..87515e158b17
--- /dev/null
+++ b/drivers/crypto/hisilicon/zip/zip.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#ifndef HISI_ZIP_H
+#define HISI_ZIP_H
+
+#include <linux/list.h>
+#include "../qm.h"
+
+#define HZIP_SQE_SIZE			128
+#define HZIP_SQ_SIZE			(HZIP_SQE_SIZE * QM_Q_DEPTH)
+#define QM_CQ_SIZE			(QM_CQE_SIZE * QM_Q_DEPTH)
+#define HZIP_PF_DEF_Q_NUM		64
+#define HZIP_PF_DEF_Q_BASE		0
+
+struct hisi_zip {
+	struct qm_info qm;
+	struct list_head list;
+
+#ifdef CONFIG_CRYPTO_DEV_HISI_SPIMDEV
+	struct vfio_spimdev *spimdev;
+#endif
+};
+
+struct hisi_zip_sqe {
+	__u32 consumed;
+	__u32 produced;
+	__u32 comp_data_length;
+	__u32 dw3;
+	__u32 input_data_length;
+	__u32 lba_l;
+	__u32 lba_h;
+	__u32 dw7;
+	__u32 dw8;
+	__u32 dw9;
+	__u32 dw10;
+	__u32 priv_info;
+	__u32 dw12;
+	__u32 tag;
+	__u32 dest_avail_out;
+	__u32 rsvd0;
+	__u32 comp_head_addr_l;
+	__u32 comp_head_addr_h;
+	__u32 source_addr_l;
+	__u32 source_addr_h;
+	__u32 dest_addr_l;
+	__u32 dest_addr_h;
+	__u32 stream_ctx_addr_l;
+	__u32 stream_ctx_addr_h;
+	__u32 cipher_key1_addr_l;
+	__u32 cipher_key1_addr_h;
+	__u32 cipher_key2_addr_l;
+	__u32 cipher_key2_addr_h;
+	__u32 rsvd1[4];
+};
+
+extern struct list_head hisi_zip_list;
+
+#endif
diff --git a/drivers/crypto/hisilicon/zip/zip_crypto.c b/drivers/crypto/hisilicon/zip/zip_crypto.c
new file mode 100644
index 000000000000..0a0e3de8e1d6
--- /dev/null
+++ b/drivers/crypto/hisilicon/zip/zip_crypto.c
@@ -0,0 +1,353 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#include <linux/crypto.h>
+#include <linux/dma-mapping.h>
+#include <linux/pci.h>
+#include <linux/topology.h>
+#include "../qm.h"
+#include "zip.h"
+
+#define INPUT_BUFFER_SIZE	(64 * 1024)
+#define OUTPUT_BUFFER_SIZE	(64 * 1024)
+
+#define COMP_NAME_TO_TYPE(alg_name)			\
+	(!strcmp((alg_name), "zlib-deflate") ? 0x02 :	\
+	 !strcmp((alg_name), "gzip") ? 0x03 : 0)	\
+
+struct hisi_zip_buffer {
+	u8 *input;
+	dma_addr_t input_dma;
+	u8 *output;
+	dma_addr_t output_dma;
+};
+
+struct hisi_zip_qp_ctx {
+	struct hisi_zip_buffer buffer;
+	struct hisi_qp *qp;
+	struct hisi_zip_sqe zip_sqe;
+};
+
+struct hisi_zip_ctx {
+#define QPC_COMP	0
+#define QPC_DECOMP	1
+	struct hisi_zip_qp_ctx qp_ctx[2];
+};
+
+static struct hisi_zip *find_zip_device(int node)
+{
+	struct hisi_zip *hisi_zip, *ret = NULL;
+	struct device *dev;
+	int min_distance = 100;
+
+	list_for_each_entry(hisi_zip, &hisi_zip_list, list) {
+		dev = &hisi_zip->qm.pdev->dev;
+		if (node_distance(dev->numa_node, node) < min_distance) {
+			ret = hisi_zip;
+			min_distance = node_distance(dev->numa_node, node);
+		}
+	}
+
+	return ret;
+}
+
+static void hisi_zip_qp_event_notifier(struct hisi_qp *qp)
+{
+	complete(&qp->completion);
+}
+
+static int hisi_zip_fill_sqe_v1(void *sqe, void *q_parm, u32 len)
+{
+	struct hisi_zip_sqe *zip_sqe = (struct hisi_zip_sqe *)sqe;
+	struct hisi_zip_qp_ctx *qp_ctx = (struct hisi_zip_qp_ctx *)q_parm;
+	struct hisi_zip_buffer *buffer = &qp_ctx->buffer;
+
+	memset(zip_sqe, 0, sizeof(struct hisi_zip_sqe));
+
+	zip_sqe->input_data_length = len;
+	zip_sqe->dw9 = qp_ctx->qp->req_type;
+	zip_sqe->dest_avail_out = OUTPUT_BUFFER_SIZE;
+	zip_sqe->source_addr_l = lower_32_bits(buffer->input_dma);
+	zip_sqe->source_addr_h = upper_32_bits(buffer->input_dma);
+	zip_sqe->dest_addr_l = lower_32_bits(buffer->output_dma);
+	zip_sqe->dest_addr_h = upper_32_bits(buffer->output_dma);
+
+	return 0;
+}
+
+/* let's allocate one buffer now, may have problem in async case */
+static int hisi_zip_alloc_qp_buffer(struct hisi_zip_qp_ctx *hisi_zip_qp_ctx)
+{
+	struct hisi_zip_buffer *buffer = &hisi_zip_qp_ctx->buffer;
+	struct hisi_qp *qp = hisi_zip_qp_ctx->qp;
+	struct device *dev = &qp->qm->pdev->dev;
+	int ret;
+
+	buffer->input = dma_alloc_coherent(dev, INPUT_BUFFER_SIZE,
+					   &buffer->input_dma, GFP_KERNEL);
+	if (!buffer->input)
+		return -ENOMEM;
+
+	buffer->output = dma_alloc_coherent(dev, OUTPUT_BUFFER_SIZE,
+					    &buffer->output_dma, GFP_KERNEL);
+	if (!buffer->output) {
+		ret = -ENOMEM;
+		goto err_alloc_output_buffer;
+	}
+
+	return 0;
+
+err_alloc_output_buffer:
+	dma_free_coherent(dev, INPUT_BUFFER_SIZE, buffer->input,
+			  buffer->input_dma);
+	return ret;
+}
+
+static void hisi_zip_free_qp_buffer(struct hisi_zip_qp_ctx *hisi_zip_qp_ctx)
+{
+	struct hisi_zip_buffer *buffer = &hisi_zip_qp_ctx->buffer;
+	struct hisi_qp *qp = hisi_zip_qp_ctx->qp;
+	struct device *dev = &qp->qm->pdev->dev;
+
+	dma_free_coherent(dev, INPUT_BUFFER_SIZE, buffer->input,
+			  buffer->input_dma);
+	dma_free_coherent(dev, OUTPUT_BUFFER_SIZE, buffer->output,
+			  buffer->output_dma);
+}
+
+static int hisi_zip_create_qp(struct qm_info *qm, struct hisi_zip_qp_ctx *ctx,
+			      int alg_type, int req_type)
+{
+	struct hisi_qp *qp;
+	int ret;
+
+	qp = hisi_qm_create_qp(qm, alg_type);
+
+	if (IS_ERR(qp))
+		return PTR_ERR(qp);
+
+	qp->event_cb = hisi_zip_qp_event_notifier;
+	qp->req_type = req_type;
+
+	qp->qp_ctx = ctx;
+	ctx->qp = qp;
+
+	ret = hisi_zip_alloc_qp_buffer(ctx);
+	if (ret)
+		goto err_with_qp;
+
+	ret = hisi_qm_start_qp(qp, 0);
+	if (ret < 0)
+		goto err_with_qp_buffer;
+
+	return 0;
+err_with_qp_buffer:
+	hisi_zip_free_qp_buffer(ctx);
+err_with_qp:
+	hisi_qm_release_qp(qp);
+	return ret;
+}
+
+static void hisi_zip_release_qp(struct hisi_zip_qp_ctx *ctx)
+{
+	hisi_qm_release_qp(ctx->qp);
+	hisi_zip_free_qp_buffer(ctx);
+}
+
+static int hisi_zip_alloc_comp_ctx(struct crypto_tfm *tfm)
+{
+	struct hisi_zip_ctx *hisi_zip_ctx = crypto_tfm_ctx(tfm);
+	const char *alg_name = crypto_tfm_alg_name(tfm);
+	struct hisi_zip *hisi_zip;
+	struct qm_info *qm;
+	int ret, i, j;
+
+	u8 req_type = COMP_NAME_TO_TYPE(alg_name);
+
+	/* find the proper zip device */
+	hisi_zip = find_zip_device(cpu_to_node(smp_processor_id()));
+	if (!hisi_zip) {
+		pr_err("Can not find proper ZIP device!\n");
+		return -ENODEV;
+	}
+	qm = &hisi_zip->qm;
+
+	for (i = 0; i < 2; i++) {
+	/* it is just happen that 0 is compress, 1 is decompress on alg_type */
+		ret = hisi_zip_create_qp(qm, &hisi_zip_ctx->qp_ctx[i], i,
+					 req_type);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	for (j = i-1; j >= 0; j--)
+		hisi_zip_release_qp(&hisi_zip_ctx->qp_ctx[j]);
+
+	return ret;
+}
+
+static void hisi_zip_free_comp_ctx(struct crypto_tfm *tfm)
+{
+	struct hisi_zip_ctx *hisi_zip_ctx = crypto_tfm_ctx(tfm);
+	int i;
+
+	/* release the qp */
+	for (i = 1; i >= 0; i--)
+		hisi_zip_release_qp(&hisi_zip_ctx->qp_ctx[i]);
+}
+
+static int hisi_zip_copy_data_to_buffer(struct hisi_zip_qp_ctx *qp_ctx,
+					const u8 *src, unsigned int slen)
+{
+	struct hisi_zip_buffer *buffer = &qp_ctx->buffer;
+
+	if (slen > INPUT_BUFFER_SIZE)
+		return -EINVAL;
+
+	memcpy(buffer->input, src, slen);
+
+	return 0;
+}
+
+static struct hisi_zip_sqe *hisi_zip_get_writeback_sqe(struct hisi_qp *qp)
+{
+	struct hisi_acc_qp_status *qp_status = &qp->qp_status;
+	struct hisi_zip_sqe *sq_base = QP_SQE_ADDR(qp);
+	u16 sq_head = qp_status->sq_head;
+
+	return sq_base + sq_head;
+}
+
+static int hisi_zip_copy_data_from_buffer(struct hisi_zip_qp_ctx *qp_ctx,
+					  u8 *dst, unsigned int *dlen)
+{
+	struct hisi_zip_buffer *buffer = &qp_ctx->buffer;
+	struct hisi_qp *qp = qp_ctx->qp;
+	struct hisi_zip_sqe *zip_sqe = hisi_zip_get_writeback_sqe(qp);
+	u32 status = zip_sqe->dw3 & 0xff;
+	u16 sq_head;
+
+	if (status != 0) {
+		pr_err("hisi zip: %s fail!\n", (qp->alg_type == 0) ?
+		       "compression" : "decompression");
+		return status;
+	}
+
+	if (zip_sqe->produced > OUTPUT_BUFFER_SIZE)
+		return -ENOMEM;
+
+	memcpy(dst, buffer->output, zip_sqe->produced);
+	*dlen = zip_sqe->produced;
+
+	sq_head = qp->qp_status.sq_head;
+	if (sq_head == QM_Q_DEPTH - 1)
+		qp->qp_status.sq_head = 0;
+	else
+		qp->qp_status.sq_head++;
+
+	return 0;
+}
+
+static int hisi_zip_compress(struct crypto_tfm *tfm, const u8 *src,
+			     unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+	struct hisi_zip_ctx *hisi_zip_ctx = crypto_tfm_ctx(tfm);
+	struct hisi_zip_qp_ctx *qp_ctx = &hisi_zip_ctx->qp_ctx[QPC_COMP];
+	struct hisi_qp *qp = qp_ctx->qp;
+	struct hisi_zip_sqe *zip_sqe = &qp_ctx->zip_sqe;
+	int ret;
+
+	ret = hisi_zip_copy_data_to_buffer(qp_ctx, src, slen);
+	if (ret < 0)
+		return ret;
+
+	hisi_zip_fill_sqe_v1(zip_sqe, qp_ctx, slen);
+
+	/* send command to start the compress job */
+	hisi_qp_send(qp, zip_sqe);
+
+	return hisi_zip_copy_data_from_buffer(qp_ctx, dst, dlen);
+}
+
+static int hisi_zip_decompress(struct crypto_tfm *tfm, const u8 *src,
+			       unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+	struct hisi_zip_ctx *hisi_zip_ctx = crypto_tfm_ctx(tfm);
+	struct hisi_zip_qp_ctx *qp_ctx = &hisi_zip_ctx->qp_ctx[QPC_DECOMP];
+	struct hisi_qp *qp = qp_ctx->qp;
+	struct hisi_zip_sqe *zip_sqe = &qp_ctx->zip_sqe;
+	int ret;
+
+	ret = hisi_zip_copy_data_to_buffer(qp_ctx, src, slen);
+	if (ret < 0)
+		return ret;
+
+	hisi_zip_fill_sqe_v1(zip_sqe, qp_ctx, slen);
+
+	/* send command to start the decompress job */
+	hisi_qp_send(qp, zip_sqe);
+
+	return hisi_zip_copy_data_from_buffer(qp_ctx, dst, dlen);
+}
+
+static struct crypto_alg hisi_zip_zlib = {
+	.cra_name		= "zlib-deflate",
+	.cra_flags		= CRYPTO_ALG_TYPE_COMPRESS,
+	.cra_ctxsize		= sizeof(struct hisi_zip_ctx),
+	.cra_priority           = 300,
+	.cra_module		= THIS_MODULE,
+	.cra_init		= hisi_zip_alloc_comp_ctx,
+	.cra_exit		= hisi_zip_free_comp_ctx,
+	.cra_u			= {
+		.compress = {
+			.coa_compress	= hisi_zip_compress,
+			.coa_decompress	= hisi_zip_decompress
+		}
+	}
+};
+
+static struct crypto_alg hisi_zip_gzip = {
+	.cra_name		= "gzip",
+	.cra_flags		= CRYPTO_ALG_TYPE_COMPRESS,
+	.cra_ctxsize		= sizeof(struct hisi_zip_ctx),
+	.cra_priority           = 300,
+	.cra_module		= THIS_MODULE,
+	.cra_init		= hisi_zip_alloc_comp_ctx,
+	.cra_exit		= hisi_zip_free_comp_ctx,
+	.cra_u			= {
+		.compress = {
+			.coa_compress	= hisi_zip_compress,
+			.coa_decompress	= hisi_zip_decompress
+		}
+	}
+};
+
+int hisi_zip_register_to_crypto(void)
+{
+	int ret;
+
+	ret = crypto_register_alg(&hisi_zip_zlib);
+	if (ret < 0) {
+		pr_err("Zlib algorithm registration failed\n");
+		return ret;
+	}
+
+	ret = crypto_register_alg(&hisi_zip_gzip);
+	if (ret < 0) {
+		pr_err("Gzip algorithm registration failed\n");
+		goto err_unregister_zlib;
+	}
+
+	return 0;
+
+err_unregister_zlib:
+	crypto_unregister_alg(&hisi_zip_zlib);
+
+	return ret;
+}
+
+void hisi_zip_unregister_from_crypto(void)
+{
+	crypto_unregister_alg(&hisi_zip_zlib);
+	crypto_unregister_alg(&hisi_zip_gzip);
+}
diff --git a/drivers/crypto/hisilicon/zip/zip_crypto.h b/drivers/crypto/hisilicon/zip/zip_crypto.h
new file mode 100644
index 000000000000..84eefd74c9c4
--- /dev/null
+++ b/drivers/crypto/hisilicon/zip/zip_crypto.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#ifndef HISI_ZIP_CRYPTO_H
+#define HISI_ZIP_CRYPTO_H
+
+int hisi_zip_register_to_crypto(void);
+void hisi_zip_unregister_from_crypto(void);
+
+#endif
diff --git a/drivers/crypto/hisilicon/zip/zip_main.c b/drivers/crypto/hisilicon/zip/zip_main.c
new file mode 100644
index 000000000000..cad4c97f4826
--- /dev/null
+++ b/drivers/crypto/hisilicon/zip/zip_main.c
@@ -0,0 +1,195 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <linux/bitops.h>
+#include <linux/init.h>
+#include <linux/io.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include "zip.h"
+#include "zip_crypto.h"
+
+#define HZIP_VF_NUM			63
+#define HZIP_QUEUE_NUM_V1		4096
+#define HZIP_QUEUE_NUM_V2		1024
+
+#define HZIP_FSM_MAX_CNT		0x301008
+
+#define HZIP_PORT_ARCA_CHE_0		0x301040
+#define HZIP_PORT_ARCA_CHE_1		0x301044
+#define HZIP_PORT_AWCA_CHE_0		0x301060
+#define HZIP_PORT_AWCA_CHE_1		0x301064
+
+#define HZIP_BD_RUSER_32_63		0x301110
+#define HZIP_SGL_RUSER_32_63		0x30111c
+#define HZIP_DATA_RUSER_32_63		0x301128
+#define HZIP_DATA_WUSER_32_63		0x301134
+#define HZIP_BD_WUSER_32_63		0x301140
+
+LIST_HEAD(hisi_zip_list);
+DEFINE_MUTEX(hisi_zip_list_lock);
+
+static const char hisi_zip_name[] = "hisi_zip";
+
+static const struct pci_device_id hisi_zip_dev_ids[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, 0xa250) },
+	{ 0, }
+};
+
+static inline void hisi_zip_add_to_list(struct hisi_zip *hisi_zip)
+{
+	mutex_lock(&hisi_zip_list_lock);
+	list_add_tail(&hisi_zip->list, &hisi_zip_list);
+	mutex_unlock(&hisi_zip_list_lock);
+}
+
+static inline void hisi_zip_remove_from_list(struct hisi_zip *hisi_zip)
+{
+	mutex_lock(&hisi_zip_list_lock);
+	list_del(&hisi_zip->list);
+	mutex_unlock(&hisi_zip_list_lock);
+}
+
+static void hisi_zip_set_user_domain_and_cache(struct hisi_zip *hisi_zip)
+{
+	u32 val;
+
+	/* qm user domain */
+	writel(0x40001070, hisi_zip->qm.io_base + QM_ARUSER_M_CFG_1);
+	writel(0xfffffffe, hisi_zip->qm.io_base + QM_ARUSER_M_CFG_ENABLE);
+	writel(0x40001070, hisi_zip->qm.io_base + QM_AWUSER_M_CFG_1);
+	writel(0xfffffffe, hisi_zip->qm.io_base + QM_AWUSER_M_CFG_ENABLE);
+	writel(0xffffffff, hisi_zip->qm.io_base + QM_WUSER_M_CFG_ENABLE);
+	writel(0x4893,     hisi_zip->qm.io_base + QM_CACHE_CTL);
+
+	val = readl(hisi_zip->qm.io_base + QM_PEH_AXUSER_CFG);
+	val |= (1 << 11);
+	writel(val, hisi_zip->qm.io_base + QM_PEH_AXUSER_CFG);
+
+	/* qm cache */
+	writel(0xffff,     hisi_zip->qm.io_base + QM_AXI_M_CFG);
+	writel(0xffffffff, hisi_zip->qm.io_base + QM_AXI_M_CFG_ENABLE);
+	writel(0xffffffff, hisi_zip->qm.io_base + QM_PEH_AXUSER_CFG_ENABLE);
+
+	/* cache */
+	writel(0xffffffff, hisi_zip->qm.io_base + HZIP_PORT_ARCA_CHE_0);
+	writel(0xffffffff, hisi_zip->qm.io_base + HZIP_PORT_ARCA_CHE_1);
+	writel(0xffffffff, hisi_zip->qm.io_base + HZIP_PORT_AWCA_CHE_0);
+	writel(0xffffffff, hisi_zip->qm.io_base + HZIP_PORT_AWCA_CHE_1);
+	/* user domain configurations */
+	writel(0x40001070, hisi_zip->qm.io_base + HZIP_BD_RUSER_32_63);
+	writel(0x40001070, hisi_zip->qm.io_base + HZIP_SGL_RUSER_32_63);
+#ifdef CONFIG_IOMMU_SVA
+	writel(0x40001071, hisi_zip->qm.io_base + HZIP_DATA_RUSER_32_63);
+	writel(0x40001071, hisi_zip->qm.io_base + HZIP_DATA_WUSER_32_63);
+#else
+	writel(0x40001070, hisi_zip->qm.io_base + HZIP_DATA_RUSER_32_63);
+	writel(0x40001070, hisi_zip->qm.io_base + HZIP_DATA_WUSER_32_63);
+#endif
+	writel(0x40001070, hisi_zip->qm.io_base + HZIP_BD_WUSER_32_63);
+
+	/* fsm count */
+	writel(0xfffffff, hisi_zip->qm.io_base + HZIP_FSM_MAX_CNT);
+
+	/* clock gating, core, decompress verify enable */
+	writel(0x10005, hisi_zip->qm.io_base + 0x301004);
+}
+
+static int hisi_zip_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct hisi_zip *hisi_zip;
+	struct qm_info *qm;
+	int ret;
+	u8 rev_id;
+
+	hisi_zip = devm_kzalloc(&pdev->dev, sizeof(*hisi_zip), GFP_KERNEL);
+	if (!hisi_zip)
+		return -ENOMEM;
+	hisi_zip_add_to_list(hisi_zip);
+	pci_set_drvdata(pdev, hisi_zip);
+
+	qm = &hisi_zip->qm;
+	qm->pdev = pdev;
+
+	pci_read_config_byte(pdev, PCI_REVISION_ID, &rev_id);
+	if (rev_id == 0x20)
+		qm->ver = 1;
+	qm->sqe_size = HZIP_SQE_SIZE;
+	ret = hisi_qm_init(hisi_zip_name, qm);
+	if (ret)
+		goto err_with_hisi_zip;
+
+	if (pdev->is_physfn) {
+		ret = hisi_qm_mem_start(qm);
+		if (ret)
+			goto err_with_qm_init;
+
+		hisi_zip_set_user_domain_and_cache(hisi_zip);
+
+		qm->qp_base = HZIP_PF_DEF_Q_BASE;
+		qm->qp_num = HZIP_PF_DEF_Q_NUM;
+	}
+
+	ret = hisi_qm_start(qm);
+	if (ret)
+		goto err_with_qm_init;
+
+	return 0;
+
+err_with_qm_init:
+	hisi_qm_uninit(qm);
+err_with_hisi_zip:
+	kfree(hisi_zip);
+	return ret;
+}
+
+static void hisi_zip_remove(struct pci_dev *pdev)
+{
+	struct hisi_zip *hisi_zip = pci_get_drvdata(pdev);
+	struct qm_info *qm = &hisi_zip->qm;
+
+	hisi_qm_stop(qm);
+	hisi_qm_uninit(qm);
+	hisi_zip_remove_from_list(hisi_zip);
+	kfree(hisi_zip);
+}
+
+static struct pci_driver hisi_zip_pci_driver = {
+	.name		= "hisi_zip",
+	.id_table	= hisi_zip_dev_ids,
+	.probe		= hisi_zip_probe,
+	.remove		= hisi_zip_remove,
+};
+
+static int __init hisi_zip_init(void)
+{
+	int ret;
+
+	ret = pci_register_driver(&hisi_zip_pci_driver);
+	if (ret < 0) {
+		pr_err("zip: can't register hisi zip driver.\n");
+		return ret;
+	}
+
+	ret = hisi_zip_register_to_crypto();
+	if (ret < 0) {
+		pr_err("zip: can't register hisi zip to crypto.\n");
+		pci_unregister_driver(&hisi_zip_pci_driver);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit hisi_zip_exit(void)
+{
+	hisi_zip_unregister_from_crypto();
+	pci_unregister_driver(&hisi_zip_pci_driver);
+}
+
+module_init(hisi_zip_init);
+module_exit(hisi_zip_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Zhou Wang <wangzhou1@hisilicon.com>");
+MODULE_DESCRIPTION("Driver for HiSilicon ZIP accelerator");
+MODULE_DEVICE_TABLE(pci, hisi_zip_dev_ids);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
                   ` (4 preceding siblings ...)
  2018-09-03  0:52 ` [PATCH 5/7] crypto: Add Hisilicon Zip driver Kenneth Lee
@ 2018-09-03  0:52 ` Kenneth Lee
  2018-09-03  2:19   ` Randy Dunlap
  2018-09-03  0:52 ` [PATCH 7/7] vfio/sdmdev: add user sample Kenneth Lee
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-03  0:52 UTC (permalink / raw)
  To: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

From: Kenneth Lee <liguozhu@hisilicon.com>

This commit add spimdev support to the Hislicon QM driver, any
accelerator that use QM can expose its queues to the user space.

Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Hao Fang <fanghao11@huawei.com>
Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>
---
 drivers/crypto/hisilicon/Kconfig |  10 ++
 drivers/crypto/hisilicon/qm.c    | 159 +++++++++++++++++++++++++++++++
 drivers/crypto/hisilicon/qm.h    |  12 +++
 3 files changed, 181 insertions(+)

diff --git a/drivers/crypto/hisilicon/Kconfig b/drivers/crypto/hisilicon/Kconfig
index 1d155708cd69..b85fab48fdab 100644
--- a/drivers/crypto/hisilicon/Kconfig
+++ b/drivers/crypto/hisilicon/Kconfig
@@ -17,6 +17,16 @@ config CRYPTO_DEV_HISI_SEC
 	  To compile this as a module, choose M here: the module
 	  will be called hisi_sec.
 
+config CRYPTO_DEV_HISI_SDMDEV
+	bool "Enable SDMDEV interface"
+	depends on CRYPTO_DEV_HISILICON
+	select VFIO_SDMDEV
+	help
+	  Enable this enable the SDMDEV, "shared IOMMU Domain Mediated Device"
+	  interface for all Hisilicon accelerators if they can. The SDMDEV
+	  enable the WarpDrive user space accelerator driver to access the
+	  hardware function directly.
+
 config CRYPTO_DEV_HISI_QM
 	tristate
 	depends on ARM64 && PCI
diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
index ea618b4d0929..c94e138e9b99 100644
--- a/drivers/crypto/hisilicon/qm.c
+++ b/drivers/crypto/hisilicon/qm.c
@@ -639,6 +639,155 @@ int hisi_qp_send(struct hisi_qp *qp, void *msg)
 }
 EXPORT_SYMBOL_GPL(hisi_qp_send);
 
+#ifdef CONFIG_CRYPTO_DEV_HISI_SDMDEV
+/* mdev->supported_type_groups */
+static struct attribute *hisi_qm_type_attrs[] = {
+	VFIO_SDMDEV_DEFAULT_MDEV_TYPE_ATTRS,
+	NULL,
+};
+static struct attribute_group hisi_qm_type_group = {
+	.attrs = hisi_qm_type_attrs,
+};
+static struct attribute_group *mdev_type_groups[] = {
+	&hisi_qm_type_group,
+	NULL,
+};
+
+static void qm_qp_event_notifier(struct hisi_qp *qp)
+{
+	vfio_sdmdev_wake_up(qp->sdmdev_q);
+}
+
+static int hisi_qm_get_queue(struct vfio_sdmdev *sdmdev,
+			  struct vfio_sdmdev_queue **q)
+{
+	struct qm_info *qm = sdmdev->priv;
+	struct hisi_qp *qp = NULL;
+	struct vfio_sdmdev_queue *wd_q;
+	u8 alg_type = 0; /* fix me here */
+	int ret;
+
+	qp = hisi_qm_create_qp(qm, alg_type);
+	if (IS_ERR(qp))
+		return PTR_ERR(qp);
+
+	wd_q = kzalloc(sizeof(struct vfio_sdmdev_queue), GFP_KERNEL);
+	if (!wd_q) {
+		ret = -ENOMEM;
+		goto err_with_qp;
+	}
+
+	wd_q->priv = qp;
+	wd_q->sdmdev = sdmdev;
+	*q = wd_q;
+	qp->sdmdev_q = wd_q;
+	qp->event_cb = qm_qp_event_notifier;
+
+	return 0;
+
+err_with_qp:
+	hisi_qm_release_qp(qp);
+	return ret;
+}
+
+void hisi_qm_put_queue(struct vfio_sdmdev_queue *q)
+{
+	struct hisi_qp *qp = q->priv;
+
+	hisi_qm_release_qp(qp);
+	kfree(q);
+}
+
+/* map sq/cq/doorbell to user space */
+static int hisi_qm_mmap(struct vfio_sdmdev_queue *q,
+			struct vm_area_struct *vma)
+{
+	struct hisi_qp *qp = (struct hisi_qp *)q->priv;
+	struct qm_info *qm = qp->qm;
+	struct device *dev = &qm->pdev->dev;
+	size_t sz = vma->vm_end - vma->vm_start;
+	u8 region;
+
+	vma->vm_flags |= (VM_IO | VM_LOCKED | VM_DONTEXPAND | VM_DONTDUMP);
+	region = _VFIO_SDMDEV_REGION(vma->vm_pgoff);
+
+	switch (region) {
+	case 0:
+		if (sz > PAGE_SIZE)
+			return -EINVAL;
+		/*
+		 * Warning: This is not safe as multiple queues use the same
+		 * doorbell, v1 hardware interface problem. v2 will fix it
+		 */
+		return remap_pfn_range(vma, vma->vm_start,
+				       qm->phys_base >> PAGE_SHIFT,
+				       sz, pgprot_noncached(vma->vm_page_prot));
+	case 1:
+		vma->vm_pgoff = 0;
+		if (sz > qp->scqe.size)
+			return -EINVAL;
+
+		return dma_mmap_coherent(dev, vma, qp->scqe.addr, qp->scqe.dma,
+				sz);
+
+	default:
+		return -EINVAL;
+	}
+}
+
+static int hisi_qm_start_queue(struct vfio_sdmdev_queue *q)
+{
+	struct hisi_qp *qp = q->priv;
+
+#ifdef CONFIG_IOMMU_SVA
+	return hisi_qm_start_qp(qp, q->pasid);
+#else
+	return hisi_qm_start_qp(qp, 0);
+#endif
+}
+
+static void hisi_qm_stop_queue(struct vfio_sdmdev_queue *q)
+{
+	/* need to stop hardware, but can not support in v1 */
+}
+
+static const struct vfio_sdmdev_ops qm_ops = {
+	.get_queue = hisi_qm_get_queue,
+	.put_queue = hisi_qm_put_queue,
+	.start_queue = hisi_qm_start_queue,
+	.stop_queue = hisi_qm_stop_queue,
+	.mmap = hisi_qm_mmap,
+};
+
+static int qm_register_sdmdev(struct qm_info *qm)
+{
+	struct pci_dev *pdev = qm->pdev;
+	struct vfio_sdmdev *sdmdev = &qm->sdmdev;
+
+	sdmdev->iommu_type = VFIO_TYPE1_IOMMU;
+
+#ifdef CONFIG_IOMMU_SVA
+	sdmdev->dma_flag = VFIO_SDMDEV_DMA_MULTI_PROC_MAP;
+#else
+	sdmdev->dma_flag = VFIO_SDMDEV_DMA_SINGLE_PROC_MAP;
+#endif
+
+	sdmdev->name = qm->dev_name;
+	sdmdev->dev = &pdev->dev;
+	sdmdev->is_vf = pdev->is_virtfn;
+	sdmdev->priv = qm;
+	sdmdev->api_ver = "hisi_qm_v1";
+	sdmdev->flags = 0;
+
+	sdmdev->mdev_fops.mdev_attr_groups = qm->mdev_dev_groups;
+	hisi_qm_type_group.name = qm->dev_name;
+	sdmdev->mdev_fops.supported_type_groups = mdev_type_groups;
+	sdmdev->ops = &qm_ops;
+
+	return vfio_sdmdev_register(sdmdev);
+}
+#endif
+
 int hisi_qm_init(const char *dev_name, struct qm_info *qm)
 {
 	struct pci_dev *pdev = qm->pdev;
@@ -769,6 +918,12 @@ int hisi_qm_start(struct qm_info *qm)
 	if (ret)
 		goto err_with_cqc;
 
+#ifdef CONFIG_CRYPTO_DEV_HISI_SDMDEV
+	ret = qm_register_sdmdev(qm);
+	if (ret)
+		goto err_with_cqc;
+#endif
+
 	writel(0x0, QM_ADDR(qm, QM_VF_EQ_INT_MASK));
 
 	return 0;
@@ -795,6 +950,10 @@ void hisi_qm_stop(struct qm_info *qm)
 	struct pci_dev *pdev = qm->pdev;
 	struct device *dev = &pdev->dev;
 
+#ifdef CONFIG_CRYPTO_DEV_HISI_SDMDEV
+	vfio_sdmdev_unregister(&qm->sdmdev);
+#endif
+
 	free_irq(pci_irq_vector(pdev, 0), qm);
 	qm_uninit_q_buffer(dev, &qm->cqc);
 	kfree(qm->qp_array);
diff --git a/drivers/crypto/hisilicon/qm.h b/drivers/crypto/hisilicon/qm.h
index 0e81182ac6a8..0d24e0bd42e8 100644
--- a/drivers/crypto/hisilicon/qm.h
+++ b/drivers/crypto/hisilicon/qm.h
@@ -8,6 +8,10 @@
 #include <linux/pci.h>
 #include <linux/slab.h>
 
+#ifdef CONFIG_CRYPTO_DEV_HISI_SDMDEV
+#include <linux/vfio_sdmdev.h>
+#endif
+
 #define QM_CQE_SIZE			16
 /* default queue depth for sq/cq/eq */
 #define QM_Q_DEPTH			1024
@@ -59,6 +63,10 @@ struct qm_info {
 
 	struct hisi_acc_qm_hw_ops *ops;
 
+#ifdef CONFIG_CRYPTO_DEV_HISI_SDMDEV
+	struct vfio_sdmdev sdmdev;
+	const struct attribute_group **mdev_dev_groups;
+#endif
 };
 #define QM_ADDR(qm, off) ((qm)->io_base + off)
 
@@ -89,6 +97,10 @@ struct hisi_qp {
 
 	struct qm_info *qm;
 
+#ifdef CONFIG_CRYPTO_DEV_HISI_SDMDEV
+	struct vfio_sdmdev_queue *sdmdev_q;
+#endif
+
 	/* for crypto sync API */
 	struct completion completion;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 7/7] vfio/sdmdev: add user sample
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
                   ` (5 preceding siblings ...)
  2018-09-03  0:52 ` [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM Kenneth Lee
@ 2018-09-03  0:52 ` Kenneth Lee
  2018-09-03  2:25   ` Randy Dunlap
  2018-09-03  2:32 ` [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Lu Baolu
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-03  0:52 UTC (permalink / raw)
  To: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

From: Kenneth Lee <liguozhu@hisilicon.com>

This is the sample code to demostrate how WrapDrive user application
should be.

It contains:

1. wd.[ch], the common library to provide WrapDrive interface.
2. drv/*, the user driver to access the hardware upon spimdev
3. test/*, the test application to use WrapDrive interface to access the
   hardware queue(s) of the accelerator.

The Hisilicon HIP08 ZIP accelerator is used in this sample.

Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>
Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
Signed-off-by: Hao Fang <fanghao11@huawei.com>
Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
---
 samples/warpdrive/AUTHORS              |   2 +
 samples/warpdrive/ChangeLog            |   1 +
 samples/warpdrive/Makefile.am          |   9 +
 samples/warpdrive/NEWS                 |   1 +
 samples/warpdrive/README               |  32 +++
 samples/warpdrive/autogen.sh           |   3 +
 samples/warpdrive/cleanup.sh           |  13 ++
 samples/warpdrive/configure.ac         |  52 +++++
 samples/warpdrive/drv/hisi_qm_udrv.c   | 223 ++++++++++++++++++
 samples/warpdrive/drv/hisi_qm_udrv.h   |  53 +++++
 samples/warpdrive/test/Makefile.am     |   7 +
 samples/warpdrive/test/comp_hw.h       |  23 ++
 samples/warpdrive/test/test_hisi_zip.c | 206 +++++++++++++++++
 samples/warpdrive/wd.c                 | 309 +++++++++++++++++++++++++
 samples/warpdrive/wd.h                 | 154 ++++++++++++
 samples/warpdrive/wd_adapter.c         |  74 ++++++
 samples/warpdrive/wd_adapter.h         |  43 ++++
 17 files changed, 1205 insertions(+)
 create mode 100644 samples/warpdrive/AUTHORS
 create mode 100644 samples/warpdrive/ChangeLog
 create mode 100644 samples/warpdrive/Makefile.am
 create mode 100644 samples/warpdrive/NEWS
 create mode 100644 samples/warpdrive/README
 create mode 100755 samples/warpdrive/autogen.sh
 create mode 100755 samples/warpdrive/cleanup.sh
 create mode 100644 samples/warpdrive/configure.ac
 create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.c
 create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.h
 create mode 100644 samples/warpdrive/test/Makefile.am
 create mode 100644 samples/warpdrive/test/comp_hw.h
 create mode 100644 samples/warpdrive/test/test_hisi_zip.c
 create mode 100644 samples/warpdrive/wd.c
 create mode 100644 samples/warpdrive/wd.h
 create mode 100644 samples/warpdrive/wd_adapter.c
 create mode 100644 samples/warpdrive/wd_adapter.h

diff --git a/samples/warpdrive/AUTHORS b/samples/warpdrive/AUTHORS
new file mode 100644
index 000000000000..fe7dc2413b0d
--- /dev/null
+++ b/samples/warpdrive/AUTHORS
@@ -0,0 +1,2 @@
+Kenneth Lee<liguozhu@hisilicon.com>
+Zaibo Xu<xuzaibo@huawei.com>
diff --git a/samples/warpdrive/ChangeLog b/samples/warpdrive/ChangeLog
new file mode 100644
index 000000000000..b1b716105590
--- /dev/null
+++ b/samples/warpdrive/ChangeLog
@@ -0,0 +1 @@
+init
diff --git a/samples/warpdrive/Makefile.am b/samples/warpdrive/Makefile.am
new file mode 100644
index 000000000000..41154a880a97
--- /dev/null
+++ b/samples/warpdrive/Makefile.am
@@ -0,0 +1,9 @@
+ACLOCAL_AMFLAGS = -I m4
+AUTOMAKE_OPTIONS = foreign subdir-objects
+AM_CFLAGS=-Wall -O0 -fno-strict-aliasing
+
+lib_LTLIBRARIES=libwd.la
+libwd_la_SOURCES=wd.c wd_adapter.c wd.h wd_adapter.h \
+		 drv/hisi_qm_udrv.c drv/hisi_qm_udrv.h
+
+SUBDIRS=. test
diff --git a/samples/warpdrive/NEWS b/samples/warpdrive/NEWS
new file mode 100644
index 000000000000..b1b716105590
--- /dev/null
+++ b/samples/warpdrive/NEWS
@@ -0,0 +1 @@
+init
diff --git a/samples/warpdrive/README b/samples/warpdrive/README
new file mode 100644
index 000000000000..3adf66b112fc
--- /dev/null
+++ b/samples/warpdrive/README
@@ -0,0 +1,32 @@
+WD User Land Demonstration
+==========================
+
+This directory contains some applications and libraries to demonstrate how a
+
+WrapDrive application can be constructed.
+
+
+As a demo, we try to make it simple and clear for understanding. It is not
+
+supposed to be used in business scenario.
+
+
+The directory contains the following elements:
+
+wd.[ch]
+	A demonstration WrapDrive fundamental library which wraps the basic
+	operations to the WrapDrive-ed device.
+
+wd_adapter.[ch]
+	User driver adaptor for wd.[ch]
+
+wd_utils.[ch]
+	Some utitlities function used by WD and its drivers
+
+drv/*
+	User drivers. It helps to fulfill the semantic of wd.[ch] for
+	particular hardware
+
+test/*
+	Test applications to use the wrapdrive library
+
diff --git a/samples/warpdrive/autogen.sh b/samples/warpdrive/autogen.sh
new file mode 100755
index 000000000000..58deaf49de2a
--- /dev/null
+++ b/samples/warpdrive/autogen.sh
@@ -0,0 +1,3 @@
+#!/bin/sh -x
+
+autoreconf -i -f -v
diff --git a/samples/warpdrive/cleanup.sh b/samples/warpdrive/cleanup.sh
new file mode 100755
index 000000000000..c5f3d21e5dc1
--- /dev/null
+++ b/samples/warpdrive/cleanup.sh
@@ -0,0 +1,13 @@
+#!/bin/sh
+
+if [ -r Makefile ]; then
+	make distclean
+fi
+
+FILES="aclocal.m4 autom4te.cache compile config.guess config.h.in config.log \
+       config.status config.sub configure cscope.out depcomp install-sh      \
+       libsrc/Makefile libsrc/Makefile.in libtool ltmain.sh Makefile         \
+       ar-lib m4 \
+       Makefile.in missing src/Makefile src/Makefile.in test/Makefile.in"
+
+rm -vRf $FILES
diff --git a/samples/warpdrive/configure.ac b/samples/warpdrive/configure.ac
new file mode 100644
index 000000000000..53262f3197c2
--- /dev/null
+++ b/samples/warpdrive/configure.ac
@@ -0,0 +1,52 @@
+AC_PREREQ([2.69])
+AC_INIT([wrapdrive], [0.1], [liguozhu@hisilicon.com])
+AC_CONFIG_SRCDIR([wd.c])
+AM_INIT_AUTOMAKE([1.10 no-define])
+
+AC_CONFIG_MACRO_DIR([m4])
+AC_CONFIG_HEADERS([config.h])
+
+# Checks for programs.
+AC_PROG_CXX
+AC_PROG_AWK
+AC_PROG_CC
+AC_PROG_CPP
+AC_PROG_INSTALL
+AC_PROG_LN_S
+AC_PROG_MAKE_SET
+AC_PROG_RANLIB
+
+AM_PROG_AR
+AC_PROG_LIBTOOL
+AM_PROG_LIBTOOL
+LT_INIT
+AM_PROG_CC_C_O
+
+AC_DEFINE([HAVE_SVA], [0], [enable SVA support])
+AC_ARG_ENABLE([sva],
+	      [ --enable-sva 	enable to support sva feature],
+	      AC_DEFINE([HAVE_SVA], [1]))
+
+# Checks for libraries.
+
+# Checks for header files.
+AC_CHECK_HEADERS([fcntl.h stdint.h stdlib.h string.h sys/ioctl.h sys/time.h unistd.h])
+
+# Checks for typedefs, structures, and compiler characteristics.
+AC_CHECK_HEADER_STDBOOL
+AC_C_INLINE
+AC_TYPE_OFF_T
+AC_TYPE_SIZE_T
+AC_TYPE_UINT16_T
+AC_TYPE_UINT32_T
+AC_TYPE_UINT64_T
+AC_TYPE_UINT8_T
+
+# Checks for library functions.
+AC_FUNC_MALLOC
+AC_FUNC_MMAP
+AC_CHECK_FUNCS([memset munmap])
+
+AC_CONFIG_FILES([Makefile
+                 test/Makefile])
+AC_OUTPUT
diff --git a/samples/warpdrive/drv/hisi_qm_udrv.c b/samples/warpdrive/drv/hisi_qm_udrv.c
new file mode 100644
index 000000000000..777e2e3cff18
--- /dev/null
+++ b/samples/warpdrive/drv/hisi_qm_udrv.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <sys/mman.h>
+#include <assert.h>
+#include <string.h>
+#include <stdint.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+#include <sys/eventfd.h>
+
+#include "hisi_qm_udrv.h"
+
+#if __AARCH64EL__ == 1
+#define mb() {asm volatile("dsb sy" : : : "memory"); }
+#else
+#warning "this file need to be used on AARCH64EL mode"
+#define mb()
+#endif
+
+#define QM_SQE_SIZE		128
+#define QM_CQE_SIZE		16
+#define QM_EQ_DEPTH		1024
+
+/* cqe shift */
+#define CQE_PHASE(cq)	(((*((__u32 *)(cq) + 3)) >> 16) & 0x1)
+#define CQE_SQ_NUM(cq)	((*((__u32 *)(cq) + 2)) >> 16)
+#define CQE_SQ_HEAD_INDEX(cq)	((*((__u32 *)(cq) + 2)) & 0xffff)
+
+#define QM_IOMEM_SIZE		4096
+
+#define QM_DOORBELL_OFFSET	0x340
+
+struct cqe {
+	__le32 rsvd0;
+	__le16 cmd_id;
+	__le16 rsvd1;
+	__le16 sq_head;
+	__le16 sq_num;
+	__le16 rsvd2;
+	__le16 w7; /* phase, status */
+};
+
+struct hisi_qm_queue_info {
+	void *sq_base;
+	void *cq_base;
+	void *doorbell_base;
+	__u16 sq_tail_index;
+	__u16 sq_head_index;
+	__u16 cq_head_index;
+	__u16 sqn;
+	bool cqc_phase;
+	void *req_cache[QM_EQ_DEPTH];
+	int is_sq_full;
+};
+
+int hacc_db(struct hisi_qm_queue_info *q, __u8 cmd, __u16 index, __u8 priority)
+{
+	void *base = q->doorbell_base;
+	__u16 sqn = q->sqn;
+	__u64 doorbell = 0;
+
+	doorbell = (__u64)sqn | ((__u64)cmd << 16);
+	doorbell |= ((__u64)index | ((__u64)priority << 16)) << 32;
+
+	*((__u64 *)base) = doorbell;
+
+	return 0;
+}
+
+static int hisi_qm_fill_sqe(void *msg, struct hisi_qm_queue_info *info, __u16 i)
+{
+	struct hisi_qm_msg *sqe = (struct hisi_qm_msg *)info->sq_base + i;
+	memcpy((void *)sqe, msg, sizeof(struct hisi_qm_msg));
+	assert(!info->req_cache[i]);
+	info->req_cache[i] = msg;
+
+	return 0;
+}
+
+static int hisi_qm_recv_sqe(struct hisi_qm_msg *sqe, struct hisi_qm_queue_info *info, __u16 i)
+{
+	__u32 status = sqe->dw3 & 0xff;
+	__u32 type = sqe->dw9 & 0xff;
+
+	if (status != 0 && status != 0x0d) {
+		fprintf(stderr, "bad status (s=%d, t=%d)\n", status, type);
+		return -EIO;
+	}
+
+	assert(info->req_cache[i]);
+	memcpy((void *)info->req_cache[i], sqe, sizeof(struct hisi_qm_msg));
+	return 0;
+}
+
+int hisi_qm_set_queue_dio(struct wd_queue *q)
+{
+	struct hisi_qm_queue_info *info;
+	void *vaddr;
+	int ret;
+
+	alloc_obj(info);
+	if (!info)
+		return -1;
+
+	q->priv = info;
+
+	vaddr = mmap(NULL,
+		QM_SQE_SIZE * QM_EQ_DEPTH + QM_CQE_SIZE * QM_EQ_DEPTH,
+		PROT_READ | PROT_WRITE, MAP_SHARED, q->mdev, 4096);
+	if (vaddr <= 0) {
+		ret = (intptr_t)vaddr;
+		goto err_with_info;
+	}
+	info->sq_base = vaddr;
+	info->cq_base = vaddr + QM_SQE_SIZE * QM_EQ_DEPTH;
+
+	vaddr = mmap(NULL, QM_IOMEM_SIZE,
+		PROT_READ | PROT_WRITE, MAP_SHARED, q->mdev, 0);
+	if (vaddr <= 0) {
+		ret = (intptr_t)vaddr;
+		goto err_with_scq;
+	}
+	info->doorbell_base = vaddr + QM_DOORBELL_OFFSET;
+	info->sq_tail_index = 0;
+	info->sq_head_index = 0;
+	info->cq_head_index = 0;
+	info->cqc_phase = 1;
+
+	info->is_sq_full = 0;
+
+	return 0;
+
+err_with_scq:
+	munmap(info->sq_base,
+		QM_SQE_SIZE * QM_EQ_DEPTH + QM_CQE_SIZE * QM_EQ_DEPTH);
+err_with_info:
+	free(info);
+	return ret;
+}
+
+void hisi_qm_unset_queue_dio(struct wd_queue *q)
+{
+	struct hisi_qm_queue_info *info = (struct hisi_qm_queue_info *)q->priv;
+
+	munmap(info->doorbell_base - QM_DOORBELL_OFFSET, QM_IOMEM_SIZE);
+	munmap(info->cq_base, QM_CQE_SIZE * QM_EQ_DEPTH);
+	munmap(info->sq_base, QM_SQE_SIZE * QM_EQ_DEPTH);
+	free(info);
+	q->priv = NULL;
+}
+
+int hisi_qm_add_to_dio_q(struct wd_queue *q, void *req)
+{
+	struct hisi_qm_queue_info *info = (struct hisi_qm_queue_info *)q->priv;
+	__u16 i;
+
+	if (info->is_sq_full)
+		return -EBUSY;
+
+	i = info->sq_tail_index;
+
+	hisi_qm_fill_sqe(req, q->priv, i);
+
+	mb();
+
+	if (i == (QM_EQ_DEPTH - 1))
+		i = 0;
+	else
+		i++;
+
+	hacc_db(info, DOORBELL_CMD_SQ, i, 0);
+
+	info->sq_tail_index = i;
+
+	if (i == info->sq_head_index)
+		info->is_sq_full = 1;
+
+	return 0;
+}
+
+int hisi_qm_get_from_dio_q(struct wd_queue *q, void **resp)
+{
+	struct hisi_qm_queue_info *info = (struct hisi_qm_queue_info *)q->priv;
+	__u16 i = info->cq_head_index;
+	struct cqe *cq_base = info->cq_base;
+	struct hisi_qm_msg *sq_base = info->sq_base;
+	struct cqe *cqe = cq_base + i;
+	struct hisi_qm_msg *sqe;
+	int ret;
+
+	if (info->cqc_phase == CQE_PHASE(cqe)) {
+		sqe = sq_base + CQE_SQ_HEAD_INDEX(cqe);
+		ret = hisi_qm_recv_sqe(sqe, info, i);
+		if (ret < 0)
+			return -EIO;
+
+		if (info->is_sq_full)
+			info->is_sq_full = 0;
+	} else {
+		return -EAGAIN;
+	}
+
+	*resp = info->req_cache[i];
+	info->req_cache[i] = NULL;
+
+	if (i == (QM_EQ_DEPTH - 1)) {
+		info->cqc_phase = !(info->cqc_phase);
+		i = 0;
+	} else
+		i++;
+
+	hacc_db(info, DOORBELL_CMD_CQ, i, 0);
+
+	info->cq_head_index = i;
+	info->sq_head_index = i;
+
+
+	return ret;
+}
diff --git a/samples/warpdrive/drv/hisi_qm_udrv.h b/samples/warpdrive/drv/hisi_qm_udrv.h
new file mode 100644
index 000000000000..6a7a06a089c9
--- /dev/null
+++ b/samples/warpdrive/drv/hisi_qm_udrv.h
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef __HZIP_DRV_H__
+#define __HZIP_DRV_H__
+
+#include <linux/types.h>
+#include "../wd.h"
+
+/* this is unnecessary big, the hardware should optimize it */
+struct hisi_qm_msg {
+	__u32 consumed;
+	__u32 produced;
+	__u32 comp_date_length;
+	__u32 dw3;
+	__u32 input_date_length;
+	__u32 lba_l;
+	__u32 lba_h;
+	__u32 dw7; /* ... */
+	__u32 dw8; /* ... */
+	__u32 dw9; /* ... */
+	__u32 dw10; /* ... */
+	__u32 priv_info;
+	__u32 dw12; /* ... */
+	__u32 tag;
+	__u32 dest_avail_out;
+	__u32 rsvd0;
+	__u32 comp_head_addr_l;
+	__u32 comp_head_addr_h;
+	__u32 source_addr_l;
+	__u32 source_addr_h;
+	__u32 dest_addr_l;
+	__u32 dest_addr_h;
+	__u32 stream_ctx_addr_l;
+	__u32 stream_ctx_addr_h;
+	__u32 cipher_key1_addr_l;
+	__u32 cipher_key1_addr_h;
+	__u32 cipher_key2_addr_l;
+	__u32 cipher_key2_addr_h;
+	__u32 rsvd1[4];
+};
+
+struct hisi_acc_qm_sqc {
+	__u16 sqn;
+};
+
+#define DOORBELL_CMD_SQ		0
+#define DOORBELL_CMD_CQ		1
+
+int hisi_qm_set_queue_dio(struct wd_queue *q);
+void hisi_qm_unset_queue_dio(struct wd_queue *q);
+int hisi_qm_add_to_dio_q(struct wd_queue *q, void *req);
+int hisi_qm_get_from_dio_q(struct wd_queue *q, void **resp);
+
+#endif
diff --git a/samples/warpdrive/test/Makefile.am b/samples/warpdrive/test/Makefile.am
new file mode 100644
index 000000000000..ad80e80a47d7
--- /dev/null
+++ b/samples/warpdrive/test/Makefile.am
@@ -0,0 +1,7 @@
+AM_CFLAGS=-Wall -O0 -fno-strict-aliasing
+
+bin_PROGRAMS=test_hisi_zip
+
+test_hisi_zip_SOURCES=test_hisi_zip.c
+
+test_hisi_zip_LDADD=../.libs/libwd.a
diff --git a/samples/warpdrive/test/comp_hw.h b/samples/warpdrive/test/comp_hw.h
new file mode 100644
index 000000000000..79328fd0c1a0
--- /dev/null
+++ b/samples/warpdrive/test/comp_hw.h
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/**
+ * This file is shared bewteen user and kernel space Wrapdrive which is
+ * including algorithm attibutions that both user and driver are caring for
+ */
+
+#ifndef __VFIO_WDEV_COMP_H
+#define __VFIO_WDEV_COMP_H
+
+/* De-compressing algorithms' parameters */
+struct vfio_wdev_comp_param {
+	__u32 window_size;
+	__u32 comp_level;
+	__u32 mode;
+	__u32 alg;
+};
+
+/* WD defines all the De-compressing algorithm names here */
+#define VFIO_WDEV_ZLIB			"zlib"
+#define VFIO_WDEV_GZIP			"gzip"
+#define VFIO_WDEV_LZ4			"lz4"
+
+#endif
diff --git a/samples/warpdrive/test/test_hisi_zip.c b/samples/warpdrive/test/test_hisi_zip.c
new file mode 100644
index 000000000000..5bf90c6d0e81
--- /dev/null
+++ b/samples/warpdrive/test/test_hisi_zip.c
@@ -0,0 +1,206 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <stdio.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include "../wd.h"
+#include "comp_hw.h"
+#include "../drv/hisi_qm_udrv.h"
+
+#if defined(MSDOS) || defined(OS2) || defined(WIN32) || defined(__CYGWIN__)
+#  include <fcntl.h>
+#  include <io.h>
+#  define SET_BINARY_MODE(file) setmode(fileno(file), O_BINARY)
+#else
+#  define SET_BINARY_MODE(file)
+#endif
+
+#define PAGE_SHIFT	12
+#define PAGE_SIZE	(1 << PAGE_SHIFT)
+
+#define ASIZE (8*512*4096)	/*16MB*/
+
+#define SYS_ERR_COND(cond, msg)		\
+do {					\
+	if (cond) {			\
+		perror(msg);		\
+		exit(EXIT_FAILURE);	\
+	}				\
+} while (0)
+
+#define ZLIB 0
+#define GZIP 1
+
+#define CHUNK 65535
+
+
+int hizip_deflate(FILE *source, FILE *dest,  int type)
+{
+	__u64 in, out;
+	struct wd_queue q;
+	struct hisi_qm_msg *msg, *recv_msg;
+	void *a, *b;
+	char *src, *dst;
+	int ret, total_len;
+	int output_num;
+	int fd, file_msize;
+
+	memset(&q, 0, sizeof(q));
+	q.container = -1;
+	q.mdev_name = "22e09922-7a82-11e8-9cf6-d74cffa9e87b";
+	q.vfio_group_path = "/dev/vfio/10"; //fixme to the right path
+	q.iommu_ext_path = "/sys/class/sdmdev/0000:75:00.0/device/params/iommu_type";
+	q.dmaflag_ext_path = "/sys/class/sdmdev/0000:75:00.0/device/params/dma_flag";
+	q.device_api_path = "/sys/class/sdmdev/0000:75:00.0/device/mdev_supported_types/hisi_zip-hisi_zip/device_api";
+	ret = wd_request_queue(&q);
+	SYS_ERR_COND(ret, "wd_request_queue");
+
+	fprintf(stderr, "pasid=%d, dma_flag=%d\n", q.pasid, q.dma_flag);
+	fd = fileno(source);
+	struct stat s;
+
+	if (fstat(fd, &s) < 0) {
+		close(fd);
+		perror("fd error\n");
+		return -1;
+	}
+	total_len = s.st_size;
+
+	if (!total_len) {
+		ret = -EINVAL;
+		SYS_ERR_COND(ret, "input file length zero");
+	}
+	if (total_len > 16*1024*1024) {
+		fputs("error, input file size too large(<16MB)!\n", stderr);
+		goto release_q;
+	}
+	file_msize = !(total_len%PAGE_SIZE) ? total_len :
+			(total_len/PAGE_SIZE+1)*PAGE_SIZE;
+	/* mmap file and  DMA mapping */
+	a = mmap((void *)0x0, file_msize, PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE, fd, 0);
+	if (!a) {
+		fputs("mmap file fail!\n", stderr);
+		goto release_q;
+	}
+	ret = wd_mem_share(&q, a, file_msize, 0);
+	if (ret) {
+		fprintf(stderr, "wd_mem_share dma a buf fail!err=%d\n", -errno);
+		goto unmap_file;
+	}
+	/* Allocate some space and setup a DMA mapping */
+	b = mmap((void *)0x0, ASIZE, PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	if (!b) {
+		fputs("mmap b fail!\n", stderr);
+		goto unshare_file;
+	}
+	memset(b, 0, ASIZE);
+	ret = wd_mem_share(&q, b, ASIZE, 0);
+	if (ret) {
+		fputs("wd_mem_share dma b buf fail!\n", stderr);
+		goto unmap_mem;
+	}
+	src = (char *)a;
+	dst = (char *)b;
+
+	msg = malloc(sizeof(*msg));
+	if (!msg) {
+		fputs("alloc msg fail!\n", stderr);
+		goto alloc_msg_fail;
+	}
+	memset((void *)msg, 0, sizeof(*msg));
+	msg->input_date_length = total_len;
+	if (type == ZLIB)
+		msg->dw9 = 2;
+	else
+		msg->dw9 = 3;
+	msg->dest_avail_out = 0x800000;
+
+	in = (__u64)src;
+	out = (__u64)dst;
+
+	msg->source_addr_l = in & 0xffffffff;
+	msg->source_addr_h = in >> 32;
+	msg->dest_addr_l = out & 0xffffffff;
+	msg->dest_addr_h = out >> 32;
+
+	ret = wd_send(&q, msg);
+
+	if (ret == -EBUSY) {
+		usleep(1);
+		goto recv_again;
+	}
+	SYS_ERR_COND(ret, "send fail!\n");
+recv_again:
+	ret = wd_recv(&q, (void **)&recv_msg);
+	if (ret == -EIO) {
+		fputs(" wd_recv fail!\n", stderr);
+		goto alloc_msg_fail;
+	/* synchronous mode, if get none, then get again */
+	} else if (ret == -EAGAIN)
+		goto recv_again;
+
+	output_num = recv_msg->produced;
+	/* add zlib compress head and write head + compressed date to a file */
+	char zip_head[2] = {0x78, 0x9c};
+
+	fwrite(zip_head, 1, 2, dest);
+	fwrite((char *)out, 1, output_num, dest);
+	fclose(dest);
+
+	free(msg);
+alloc_msg_fail:
+	wd_mem_unshare(&q, b, ASIZE);
+unmap_mem:
+	munmap(b, ASIZE);
+unshare_file:
+	wd_mem_unshare(&q, a, file_msize);
+unmap_file:
+	munmap(a, file_msize);
+release_q:
+	wd_release_queue(&q);
+
+	return ret;
+}
+
+int main(int argc, char *argv[])
+{
+	int alg_type = 0;
+
+	/* avoid end-of-line conversions */
+	SET_BINARY_MODE(stdin);
+	SET_BINARY_MODE(stdout);
+
+	if (!argv[1]) {
+		fputs("<<use ./test_hisi_zip -h get more details>>\n", stderr);
+		goto EXIT;
+	}
+
+	if (!strcmp(argv[1], "-z"))
+		alg_type = ZLIB;
+	else if (!strcmp(argv[1], "-g")) {
+		alg_type = GZIP;
+	} else if (!strcmp(argv[1], "-h")) {
+		fputs("[version]:1.0.2\n", stderr);
+		fputs("[usage]: ./test_hisi_zip [type] <src_file> dest_file\n",
+			stderr);
+		fputs("     [type]:\n", stderr);
+		fputs("            -z  = zlib\n", stderr);
+		fputs("            -g  = gzip\n", stderr);
+		fputs("            -h  = usage\n", stderr);
+		fputs("Example:\n", stderr);
+		fputs("./test_hisi_zip -z < test.data > out.data\n", stderr);
+		goto EXIT;
+	} else {
+		fputs("Unknow option\n", stderr);
+		fputs("<<use ./test_comp_iommu -h get more details>>\n",
+			stderr);
+		goto EXIT;
+	}
+
+	hizip_deflate(stdin, stdout, alg_type);
+EXIT:
+	return EXIT_SUCCESS;
+}
diff --git a/samples/warpdrive/wd.c b/samples/warpdrive/wd.c
new file mode 100644
index 000000000000..8df071637341
--- /dev/null
+++ b/samples/warpdrive/wd.c
@@ -0,0 +1,309 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "config.h"
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/queue.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <errno.h>
+#include <sys/mman.h>
+#include <string.h>
+#include <assert.h>
+#include <dirent.h>
+#include <sys/poll.h>
+#include "wd.h"
+#include "wd_adapter.h"
+
+#if (defined(HAVE_SVA) & HAVE_SVA)
+static int _wd_bind_process(struct wd_queue *q)
+{
+	struct bind_data {
+		struct vfio_iommu_type1_bind bind;
+		struct vfio_iommu_type1_bind_process data;
+	} wd_bind;
+	int ret;
+	__u32 flags = 0;
+
+	if (q->dma_flag & VFIO_SDMDEV_DMA_MULTI_PROC_MAP)
+		flags = VFIO_IOMMU_BIND_PRIV;
+	else if (q->dma_flag & VFIO_SDMDEV_DMA_SVM_NO_FAULT)
+		flags = VFIO_IOMMU_BIND_NOPF;
+
+	wd_bind.bind.flags = VFIO_IOMMU_BIND_PROCESS;
+	wd_bind.bind.argsz = sizeof(wd_bind);
+	wd_bind.data.flags = flags;
+	ret = ioctl(q->container, VFIO_IOMMU_BIND, &wd_bind);
+	if (ret)
+		return ret;
+	q->pasid = wd_bind.data.pasid;
+	return ret;
+}
+
+static int _wd_unbind_process(struct wd_queue *q)
+{
+	struct bind_data {
+		struct vfio_iommu_type1_bind bind;
+		struct vfio_iommu_type1_bind_process data;
+	} wd_bind;
+	__u32 flags = 0;
+
+	if (q->dma_flag & VFIO_SDMDEV_DMA_MULTI_PROC_MAP)
+		flags = VFIO_IOMMU_BIND_PRIV;
+	else if (q->dma_flag & VFIO_SDMDEV_DMA_SVM_NO_FAULT)
+		flags = VFIO_IOMMU_BIND_NOPF;
+
+	wd_bind.bind.flags = VFIO_IOMMU_BIND_PROCESS;
+	wd_bind.data.pasid = q->pasid;
+	wd_bind.data.flags = flags;
+	wd_bind.bind.argsz = sizeof(wd_bind);
+
+	return ioctl(q->container, VFIO_IOMMU_UNBIND, &wd_bind);
+}
+#endif
+
+int wd_request_queue(struct wd_queue *q)
+{
+	struct vfio_group_status group_status = {
+		.argsz = sizeof(group_status) };
+	int iommu_ext;
+	int ret;
+
+	if (!q->vfio_group_path ||
+		!q->device_api_path ||
+		!q->dmaflag_ext_path ||
+		!q->iommu_ext_path) {
+		WD_ERR("please set vfio_group_path, dmaflag_ext_path, "
+		"device_api_path, and iommu_ext_path before call %s", __func__);
+		return -EINVAL;
+	}
+
+	q->hw_type_id = 0; /* this can be set according to the device api_version in the future */
+
+	q->group = open(q->vfio_group_path, O_RDWR);
+	if (q->group < 0) {
+		WD_ERR("open vfio group(%s) fail, errno=%d\n",
+			q->vfio_group_path, errno);
+		return -errno;
+	}
+
+	if (q->container <= 0) {
+		q->container = open("/dev/vfio/vfio", O_RDWR);
+		if (q->container < 0) {
+			WD_ERR("Create VFIO container fail!\n");
+			ret = -ENODEV;
+			goto err_with_group;
+		}
+	}
+
+	if (ioctl(q->container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
+		WD_ERR("VFIO version check fail!\n");
+		ret = -EINVAL;
+		goto err_with_container;
+	}
+
+	q->dma_flag = _get_attr_int(q->dmaflag_ext_path);
+	if (q->dma_flag == INT_MIN) {
+		ret = -EINVAL;
+		goto err_with_container;
+	}
+
+	iommu_ext = _get_attr_int(q->iommu_ext_path);
+	if (iommu_ext == INT_MIN) {
+		ret = -EINVAL;
+		goto err_with_container;
+	}
+
+	ret = ioctl(q->container, VFIO_CHECK_EXTENSION, iommu_ext);
+	if (!ret) {
+		WD_ERR("VFIO iommu check (%d) fail (%d)!\n", iommu_ext, ret);
+		goto err_with_container;
+	}
+
+	ret = _get_attr_str(q->device_api_path, q->hw_type);
+	if (ret)
+		goto err_with_container;
+
+	ret = ioctl(q->group, VFIO_GROUP_GET_STATUS, &group_status);
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+		WD_ERR("VFIO group is not viable\n");
+		goto err_with_container;
+	}
+
+	ret = ioctl(q->group, VFIO_GROUP_SET_CONTAINER, &q->container);
+	if (ret) {
+		WD_ERR("VFIO group fail on VFIO_GROUP_SET_CONTAINER\n");
+		goto err_with_container;
+	}
+
+	ret = ioctl(q->container, VFIO_SET_IOMMU, iommu_ext);
+	if (ret) {
+		WD_ERR("VFIO fail on VFIO_SET_IOMMU(%d)\n", iommu_ext);
+		goto err_with_container;
+	}
+
+	q->mdev = ioctl(q->group, VFIO_GROUP_GET_DEVICE_FD, q->mdev_name);
+	if (q->mdev < 0) {
+		WD_ERR("VFIO fail on VFIO_GROUP_GET_DEVICE_FD (%d)\n", q->mdev);
+		ret = q->mdev;
+		goto err_with_container;
+	}
+
+#if (defined(HAVE_SVA) & HAVE_SVA)
+	if (!(q->dma_flag & (VFIO_SDMDEV_DMA_PHY | VFIO_SDMDEV_DMA_SINGLE_PROC_MAP))) {
+		ret = _wd_bind_process(q);
+		if (ret) {
+			close(q->mdev);
+			WD_ERR("VFIO fails to bind process!\n");
+			goto err_with_mdev;
+
+		}
+	}
+
+	ret = ioctl(q->mdev, VFIO_SDMDEV_CMD_BIND_PASID, (unsigned long)q->pasid);
+	if (ret < 0) {
+		WD_ERR("fail to bind paisd to device,ret=%d\n", errno);
+		goto err_with_mdev;
+	}
+#endif
+
+	ret = drv_open(q);
+	if (ret)
+		goto err_with_mdev;
+
+	return 0;
+
+err_with_mdev:
+	close(q->mdev);
+err_with_container:
+	close(q->container);
+err_with_group:
+	close(q->group);
+	return ret;
+}
+
+void wd_release_queue(struct wd_queue *q)
+{
+	drv_close(q);
+
+#if (defined(HAVE_SVA) & HAVE_SVA)
+	if (!(q->dma_flag & (VFIO_SDMDEV_DMA_PHY | VFIO_SDMDEV_DMA_SINGLE_PROC_MAP))) {
+		if (q->pasid <= 0) {
+			WD_ERR("Wd queue pasid ! pasid=%d\n", q->pasid);
+			return;
+		}
+		if (_wd_unbind_process(q)) {
+			WD_ERR("VFIO fails to unbind process!\n");
+			return;
+		}
+	}
+#endif
+
+	close(q->mdev);
+	close(q->container);
+	close(q->group);
+}
+
+int wd_send(struct wd_queue *q, void *req)
+{
+	return drv_send(q, req);
+}
+
+int wd_recv(struct wd_queue *q, void **resp)
+{
+	return drv_recv(q, resp);
+}
+
+static int wd_flush_and_wait(struct wd_queue *q, int ms)
+{
+	wd_flush(q);
+	return ioctl(q->mdev, VFIO_SDMDEV_CMD_WAIT, ms);
+}
+
+int wd_recv_sync(struct wd_queue *q, void **resp, __u16 ms)
+{
+	int ret;
+
+	while (1) {
+		ret = wd_recv(q, resp);
+		if (ret == -EBUSY) {
+			ret = wd_flush_and_wait(q, ms);
+			if (ret)
+				return ret;
+		} else
+			return ret;
+	}
+}
+
+void wd_flush(struct wd_queue *q)
+{
+	drv_flush(q);
+}
+
+static int _wd_mem_share_type1(struct wd_queue *q, const void *addr,
+			       size_t size, int flags)
+{
+	struct vfio_iommu_type1_dma_map dma_map;
+
+	if (q->dma_flag & VFIO_SDMDEV_DMA_SVM_NO_FAULT)
+		return mlock(addr, size);
+
+#if (defined(HAVE_SVA) & HAVE_SVA)
+	else if ((q->dma_flag & VFIO_SDMDEV_DMA_MULTI_PROC_MAP) &&
+		 (q->pasid > 0))
+		dma_map.pasid = q->pasid;
+#endif
+	else if ((q->dma_flag & VFIO_SDMDEV_DMA_SINGLE_PROC_MAP))
+		; //todo
+	else
+		return -1;
+
+	dma_map.vaddr = (__u64)addr;
+	dma_map.size = size;
+	dma_map.iova = (__u64)addr;
+	dma_map.flags =
+		VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE | flags;
+	dma_map.argsz = sizeof(dma_map);
+
+	return ioctl(q->container, VFIO_IOMMU_MAP_DMA, &dma_map);
+}
+
+static void _wd_mem_unshare_type1(struct wd_queue *q, const void *addr,
+				  size_t size)
+{
+#if (defined(HAVE_SVA) & HAVE_SVA)
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+#endif
+
+	if (q->dma_flag & VFIO_SDMDEV_DMA_SVM_NO_FAULT) {
+		(void)munlock(addr, size);
+		return;
+	}
+
+#if (defined(HAVE_SVA) & HAVE_SVA)
+	dma_unmap.iova = (__u64)addr;
+	if ((q->dma_flag & VFIO_SDMDEV_DMA_MULTI_PROC_MAP) && (q->pasid > 0))
+		dma_unmap.flags = 0;
+		dma_unmap.size = size;
+		dma_unmap.argsz = sizeof(dma_unmap);
+		ioctl(q->container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+#endif
+}
+
+int wd_mem_share(struct wd_queue *q, const void *addr, size_t size, int flags)
+{
+	if (drv_can_do_mem_share(q))
+		return drv_share(q, addr, size, flags);
+	else
+		return _wd_mem_share_type1(q, addr, size, flags);
+}
+
+void wd_mem_unshare(struct wd_queue *q, const void *addr, size_t size)
+{
+	if (drv_can_do_mem_share(q))
+		drv_unshare(q, addr, size);
+	else
+		_wd_mem_unshare_type1(q, addr, size);
+}
+
diff --git a/samples/warpdrive/wd.h b/samples/warpdrive/wd.h
new file mode 100644
index 000000000000..eccf43dc034d
--- /dev/null
+++ b/samples/warpdrive/wd.h
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef __WD_H
+#define __WD_H
+#include <stdlib.h>
+#include <errno.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <limits.h>
+#include "../../include/uapi/linux/vfio.h"
+#include "../../include/uapi/linux/vfio_sdmdev.h"
+
+#define SYS_VAL_SIZE		16
+#define PATH_STR_SIZE 		256
+#define WD_NAME_SIZE 		64
+#define WD_MAX_MEMLIST_SZ 	128
+
+
+#ifndef dma_addr_t
+#define dma_addr_t __u64
+#endif
+
+typedef int bool;
+
+#ifndef true
+#define true 1
+#endif
+
+#ifndef false
+#define false 0
+#endif
+
+/* the flags used by wd_capa->flags, the high 16bits are for algorithm
+ * and the low 16bits are for Framework
+ */
+#define WD_FLAGS_FW_PREFER_LOCAL_ZONE 1
+
+#define WD_FLAGS_FW_MASK 0x0000FFFF
+#ifndef WD_ERR
+#define WD_ERR(format, args...) fprintf(stderr, format, ##args)
+#endif
+
+/* Default page size should be 4k size */
+#define WDQ_MAP_REGION(region_index)	((region_index << 12) & 0xf000)
+#define WDQ_MAP_Q(q_index)		((q_index << 16) & 0xffff0000)
+
+static inline void wd_reg_write(void *reg_addr, uint32_t value)
+{
+	*((volatile uint32_t *)reg_addr) = value;
+}
+
+static inline uint32_t wd_reg_read(void *reg_addr)
+{
+	uint32_t temp;
+
+	temp = *((volatile uint32_t *)reg_addr);
+
+	return temp;
+}
+
+static inline int _get_attr_str(const char *path, char value[PATH_STR_SIZE])
+{
+	int fd, ret;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		WD_ERR("get_attr_str: open %s fail\n", path);
+		return fd;
+	}
+	memset(value, 0, PATH_STR_SIZE);
+	ret = read(fd, value, PATH_STR_SIZE);
+	if (ret > 0) {
+		close(fd);
+		return 0;
+	}
+	close(fd);
+
+	WD_ERR("read nothing from %s\n", path);
+	return -EINVAL;
+}
+
+static inline int _get_attr_int(const char *path)
+{
+	char value[PATH_STR_SIZE];
+	if (_get_attr_str(path, value))
+		return INT_MIN;
+	else
+		return atoi(value);
+}
+
+/* Memory in accelerating message can be different */
+enum wd_addr_flags {
+	WD_AATTR_INVALID = 0,
+
+	 /* Common user virtual memory */
+	_WD_AATTR_COM_VIRT = 1,
+
+	 /* Physical address*/
+	_WD_AATTR_PHYS = 2,
+
+	/* I/O virtual address*/
+	_WD_AATTR_IOVA = 4,
+
+	/* SGL, user cares for */
+	WD_AATTR_SGL = 8,
+};
+
+#define WD_CAPA_PRIV_DATA_SIZE	64
+
+#define alloc_obj(objp) do { \
+	objp = malloc(sizeof(*objp)); \
+	memset(objp, 0, sizeof(*objp)); \
+}while(0)
+#define free_obj(objp) if (objp)free(objp)
+
+struct wd_queue {
+	const char *mdev_name;
+	char hw_type[PATH_STR_SIZE];
+	int hw_type_id;
+	int dma_flag;
+	void *priv; /* private data used by the drv layer */
+	int container;
+	int group;
+	int mdev;
+	int pasid;
+	int iommu_type;
+	char *vfio_group_path;
+	char *iommu_ext_path;
+	char *dmaflag_ext_path;
+	char *device_api_path;
+};
+
+extern int wd_request_queue(struct wd_queue *q);
+extern void wd_release_queue(struct wd_queue *q);
+extern int wd_send(struct wd_queue *q, void *req);
+extern int wd_recv(struct wd_queue *q, void **resp);
+extern void wd_flush(struct wd_queue *q);
+extern int wd_recv_sync(struct wd_queue *q, void **resp, __u16 ms);
+extern int wd_mem_share(struct wd_queue *q, const void *addr,
+			size_t size, int flags);
+extern void wd_mem_unshare(struct wd_queue *q, const void *addr, size_t size);
+
+/* for debug only */
+extern int wd_dump_all_algos(void);
+
+/* this is only for drv used */
+extern int wd_set_queue_attr(struct wd_queue *q, const char *name,
+				char *value);
+extern int __iommu_type(struct wd_queue *q);
+#endif
diff --git a/samples/warpdrive/wd_adapter.c b/samples/warpdrive/wd_adapter.c
new file mode 100644
index 000000000000..d8d55b75e99a
--- /dev/null
+++ b/samples/warpdrive/wd_adapter.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <string.h>
+#include <dirent.h>
+
+
+#include "wd_adapter.h"
+#include "./drv/hisi_qm_udrv.h"
+
+static struct wd_drv_dio_if hw_dio_tbl[] = { {
+		.hw_type = "hisi_qm_v1",
+		.open = hisi_qm_set_queue_dio,
+		.close = hisi_qm_unset_queue_dio,
+		.send = hisi_qm_add_to_dio_q,
+		.recv = hisi_qm_get_from_dio_q,
+	},
+	/* Add other drivers direct IO operations here */
+};
+
+/* todo: there should be some stable way to match the device and the driver */
+#define MAX_HW_TYPE (sizeof(hw_dio_tbl) / sizeof(hw_dio_tbl[0]))
+
+int drv_open(struct wd_queue *q)
+{
+	int i;
+
+	//todo: try to find another dev if the user driver is not avaliable
+	for (i = 0; i < MAX_HW_TYPE; i++) {
+		if (!strcmp(q->hw_type,
+			hw_dio_tbl[i].hw_type)) {
+			q->hw_type_id = i;
+			return hw_dio_tbl[q->hw_type_id].open(q);
+		}
+	}
+	WD_ERR("No matching driver to use!\n");
+	errno = ENODEV;
+	return -ENODEV;
+}
+
+void drv_close(struct wd_queue *q)
+{
+	hw_dio_tbl[q->hw_type_id].close(q);
+}
+
+int drv_send(struct wd_queue *q, void *req)
+{
+	return hw_dio_tbl[q->hw_type_id].send(q, req);
+}
+
+int drv_recv(struct wd_queue *q, void **req)
+{
+	return hw_dio_tbl[q->hw_type_id].recv(q, req);
+}
+
+int drv_share(struct wd_queue *q, const void *addr, size_t size, int flags)
+{
+	return hw_dio_tbl[q->hw_type_id].share(q, addr, size, flags);
+}
+
+void drv_unshare(struct wd_queue *q, const void *addr, size_t size)
+{
+	hw_dio_tbl[q->hw_type_id].unshare(q, addr, size);
+}
+
+bool drv_can_do_mem_share(struct wd_queue *q)
+{
+	return hw_dio_tbl[q->hw_type_id].share != NULL;
+}
+
+void drv_flush(struct wd_queue *q)
+{
+	if (hw_dio_tbl[q->hw_type_id].flush)
+		hw_dio_tbl[q->hw_type_id].flush(q);
+}
diff --git a/samples/warpdrive/wd_adapter.h b/samples/warpdrive/wd_adapter.h
new file mode 100644
index 000000000000..bb3ab3ec112a
--- /dev/null
+++ b/samples/warpdrive/wd_adapter.h
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/* the common drv header define the unified interface for wd */
+#ifndef __WD_ADAPTER_H__
+#define __WD_ADAPTER_H__
+
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+
+#include "wd.h"
+
+struct wd_drv_dio_if {
+	char *hw_type;
+	int (*open)(struct wd_queue *q);
+	void (*close)(struct wd_queue *q);
+	int (*set_pasid)(struct wd_queue *q);
+	int (*unset_pasid)(struct wd_queue *q);
+	int (*send)(struct wd_queue *q, void *req);
+	int (*recv)(struct wd_queue *q, void **req);
+	void (*flush)(struct wd_queue *q);
+	int (*share)(struct wd_queue *q, const void *addr,
+		size_t size, int flags);
+	int (*unshare)(struct wd_queue *q, const void *addr, size_t size);
+};
+
+extern int drv_open(struct wd_queue *q);
+extern void drv_close(struct wd_queue *q);
+extern int drv_send(struct wd_queue *q, void *req);
+extern int drv_recv(struct wd_queue *q, void **req);
+extern void drv_flush(struct wd_queue *q);
+extern int drv_share(struct wd_queue *q, const void *addr,
+	size_t size, int flags);
+extern void drv_unshare(struct wd_queue *q, const void *addr, size_t size);
+extern bool drv_can_do_mem_share(struct wd_queue *q);
+
+#endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/7] vfio: add sdmdev support
  2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
@ 2018-09-03  2:11   ` Randy Dunlap
  2018-09-06  8:08     ` Kenneth Lee
  2018-09-03  2:55   ` Lu Baolu
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 58+ messages in thread
From: Randy Dunlap @ 2018-09-03  2:11 UTC (permalink / raw)
  To: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

On 09/02/2018 05:52 PM, Kenneth Lee wrote:
> diff --git a/drivers/vfio/sdmdev/Kconfig b/drivers/vfio/sdmdev/Kconfig
> new file mode 100644
> index 000000000000..51474272870d
> --- /dev/null
> +++ b/drivers/vfio/sdmdev/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config VFIO_SDMDEV
> +	tristate "Support for Share Domain MDEV"
> +	depends on VFIO_MDEV_DEVICE
> +	help
> +	  Support for VFIO Share Domain MDEV, which enables the kernel to
> +	  support light weight hardware accelerator framework, WarpDrive.

	          lightweight

> +
> +	  To compile this as a module, choose M here: the module will be called
> +	  sdmdev.


thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 4/7] crypto: add hisilicon Queue Manager driver
  2018-09-03  0:52 ` [PATCH 4/7] crypto: add hisilicon Queue Manager driver Kenneth Lee
@ 2018-09-03  2:15   ` Randy Dunlap
  2018-09-06  9:08     ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Randy Dunlap @ 2018-09-03  2:15 UTC (permalink / raw)
  To: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

On 09/02/2018 05:52 PM, Kenneth Lee wrote:
> diff --git a/drivers/crypto/hisilicon/Kconfig b/drivers/crypto/hisilicon/Kconfig
> index 8ca9c503bcb0..02a6eef84101 100644
> --- a/drivers/crypto/hisilicon/Kconfig
> +++ b/drivers/crypto/hisilicon/Kconfig
> @@ -1,4 +1,8 @@
>  # SPDX-License-Identifier: GPL-2.0
> +config CRYPTO_DEV_HISILICON
> +	tristate "Support for HISILICON CRYPTO ACCELERATOR"
> +	help
> +	  Enable this to use Hisilicon Hardware Accelerators

	                                        Accelerators.


-- 
~Randy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM
  2018-09-03  0:52 ` [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM Kenneth Lee
@ 2018-09-03  2:19   ` Randy Dunlap
  2018-09-06  9:09     ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Randy Dunlap @ 2018-09-03  2:19 UTC (permalink / raw)
  To: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

On 09/02/2018 05:52 PM, Kenneth Lee wrote:
> diff --git a/drivers/crypto/hisilicon/Kconfig b/drivers/crypto/hisilicon/Kconfig
> index 1d155708cd69..b85fab48fdab 100644
> --- a/drivers/crypto/hisilicon/Kconfig
> +++ b/drivers/crypto/hisilicon/Kconfig
> @@ -17,6 +17,16 @@ config CRYPTO_DEV_HISI_SEC
>  	  To compile this as a module, choose M here: the module
>  	  will be called hisi_sec.
>  
> +config CRYPTO_DEV_HISI_SDMDEV
> +	bool "Enable SDMDEV interface"
> +	depends on CRYPTO_DEV_HISILICON
> +	select VFIO_SDMDEV
> +	help
> +	  Enable this enable the SDMDEV, "shared IOMMU Domain Mediated Device"

At a minimum:
	  Enable this to enable the SDMDEV,

although that could be done better.  Maybe just:
	  Enable the SDMDEV "shared IOMMU Domain Mediated Device"

	  
> +	  interface for all Hisilicon accelerators if they can. The SDMDEV

probably drop "if they can":          accelerators.  The SDMDEV interface

> +	  enable the WarpDrive user space accelerator driver to access the

	  enables

> +	  hardware function directly.
> +


-- 
~Randy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/7] vfio/sdmdev: add user sample
  2018-09-03  0:52 ` [PATCH 7/7] vfio/sdmdev: add user sample Kenneth Lee
@ 2018-09-03  2:25   ` Randy Dunlap
  2018-09-06  9:10     ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Randy Dunlap @ 2018-09-03  2:25 UTC (permalink / raw)
  To: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

On 09/02/2018 05:52 PM, Kenneth Lee wrote:
> From: Kenneth Lee <liguozhu@hisilicon.com>
> 
> This is the sample code to demostrate how WrapDrive user application
> should be.
> 
> It contains:
> 
> 1. wd.[ch], the common library to provide WrapDrive interface.

                                            WarpDrive

> 2. drv/*, the user driver to access the hardware upon spimdev
> 3. test/*, the test application to use WrapDrive interface to access the

                                         WarpDrive

>    hardware queue(s) of the accelerator.
> 
> The Hisilicon HIP08 ZIP accelerator is used in this sample.
> 
> Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>
> Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
> Signed-off-by: Hao Fang <fanghao11@huawei.com>
> Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
> ---
>  samples/warpdrive/AUTHORS              |   2 +
>  samples/warpdrive/ChangeLog            |   1 +
>  samples/warpdrive/Makefile.am          |   9 +
>  samples/warpdrive/NEWS                 |   1 +
>  samples/warpdrive/README               |  32 +++
>  samples/warpdrive/autogen.sh           |   3 +
>  samples/warpdrive/cleanup.sh           |  13 ++
>  samples/warpdrive/configure.ac         |  52 +++++
>  samples/warpdrive/drv/hisi_qm_udrv.c   | 223 ++++++++++++++++++
>  samples/warpdrive/drv/hisi_qm_udrv.h   |  53 +++++
>  samples/warpdrive/test/Makefile.am     |   7 +
>  samples/warpdrive/test/comp_hw.h       |  23 ++
>  samples/warpdrive/test/test_hisi_zip.c | 206 +++++++++++++++++
>  samples/warpdrive/wd.c                 | 309 +++++++++++++++++++++++++
>  samples/warpdrive/wd.h                 | 154 ++++++++++++
>  samples/warpdrive/wd_adapter.c         |  74 ++++++
>  samples/warpdrive/wd_adapter.h         |  43 ++++
>  17 files changed, 1205 insertions(+)
>  create mode 100644 samples/warpdrive/AUTHORS
>  create mode 100644 samples/warpdrive/ChangeLog
>  create mode 100644 samples/warpdrive/Makefile.am
>  create mode 100644 samples/warpdrive/NEWS
>  create mode 100644 samples/warpdrive/README
>  create mode 100755 samples/warpdrive/autogen.sh
>  create mode 100755 samples/warpdrive/cleanup.sh
>  create mode 100644 samples/warpdrive/configure.ac
>  create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.c
>  create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.h
>  create mode 100644 samples/warpdrive/test/Makefile.am
>  create mode 100644 samples/warpdrive/test/comp_hw.h
>  create mode 100644 samples/warpdrive/test/test_hisi_zip.c
>  create mode 100644 samples/warpdrive/wd.c
>  create mode 100644 samples/warpdrive/wd.h
>  create mode 100644 samples/warpdrive/wd_adapter.c
>  create mode 100644 samples/warpdrive/wd_adapter.h

> diff --git a/samples/warpdrive/README b/samples/warpdrive/README
> new file mode 100644
> index 000000000000..3adf66b112fc
> --- /dev/null
> +++ b/samples/warpdrive/README
> @@ -0,0 +1,32 @@
> +WD User Land Demonstration
> +==========================
> +
> +This directory contains some applications and libraries to demonstrate how a
> +
> +WrapDrive application can be constructed.

   WarpDrive

> +
> +
> +As a demo, we try to make it simple and clear for understanding. It is not
> +
> +supposed to be used in business scenario.
> +
> +
> +The directory contains the following elements:
> +
> +wd.[ch]
> +	A demonstration WrapDrive fundamental library which wraps the basic

	                WarpDrive

> +	operations to the WrapDrive-ed device.

                          WarpDrive

> +
> +wd_adapter.[ch]
> +	User driver adaptor for wd.[ch]
> +
> +wd_utils.[ch]
> +	Some utitlities function used by WD and its drivers
> +
> +drv/*
> +	User drivers. It helps to fulfill the semantic of wd.[ch] for
> +	particular hardware
> +
> +test/*
> +	Test applications to use the wrapdrive library

	                             warpdrive

-- 
~Randy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
                   ` (6 preceding siblings ...)
  2018-09-03  0:52 ` [PATCH 7/7] vfio/sdmdev: add user sample Kenneth Lee
@ 2018-09-03  2:32 ` Lu Baolu
  2018-09-06  9:11   ` Kenneth Lee
  2018-09-04 15:00 ` Jerome Glisse
  2018-09-17  1:42 ` Jerome Glisse
  9 siblings, 1 reply; 58+ messages in thread
From: Lu Baolu @ 2018-09-03  2:32 UTC (permalink / raw)
  To: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Sanjay Kumar
  Cc: baolu.lu, linuxarm

Hi,

On 09/03/2018 08:51 AM, Kenneth Lee wrote:
> From: Kenneth Lee <liguozhu@hisilicon.com>
> 
> WarpDrive is an accelerator framework to expose the hardware capabilities
> directly to the user space. It makes use of the exist vfio and vfio-mdev
> facilities. So the user application can send request and DMA to the
> hardware without interaction with the kernel. This removes the latency
> of syscall.
> 
> WarpDrive is the name for the whole framework. The component in kernel
> is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> library of WarpDrive can access it via VFIO interface.
> 
> The patchset contains document for the detail. Please refer to it for more
> information.
> 
> This patchset is intended to be used with Jean Philippe Brucker's SVA
> patch [1], which enables not only IO side page fault, but also PASID
> support to IOMMU and VFIO.
> 
> With these features, WarpDrive can support non-pinned memory and
> multi-process in the same accelerator device.  We tested it in our SoC
> integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> tree can be found here: [2].
> 
> But it is not mandatory. This patchset is tested in the latest mainline
> kernel without the SVA patches.  So it supports only one process for each
> accelerator.
> 
> We have noticed the IOMMU aware mdev RFC announced recently [3].
> 
> The IOMMU aware mdev has similar idea but different intention comparing to
> WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> And the design is supposed to be used with Scalable I/O Virtualization.
> While sdmdev is intended to share the hardware resource with a big amount
> of processes.  It just requires the hardware supporting address
> translation per process (PCIE's PASID or ARM SMMU's substream ID).
> 
> But we don't see serious confliction on both design. We believe they can be
> normalized as one.
> 
> The patch 1 is document of the framework. The patch 2 and 3 add sdmdev
> support. The patch 4, 5 and 6 is drivers for Hislicon's ZIP Accelerator
> which is registered to both crypto and warpdrive(sdmdev) and can be
> used from kernel or user space at the same time. The patch 7 is a user
> space sample demonstrating how WarpDrive works.
> 
> 
> Change History:
> V2 changed from V1:
> 	1. Change kernel framework name from SPIMDEV (Share Parent IOMMU
> 	   Mdev) to SDMDEV (Share Domain Mdev).
> 	2. Allocate Hardware Resource when a new mdev is created (While
> 	   it is allocated when the mdev is openned)
> 	3. Unmap pages from the shared domain when the sdmdev iommu group is
> 	   detached. (This procedure is necessary, but missed in V1)
> 	4. Update document accordingly.
> 	5. Rebase to the latest kernel (4.19.0-rc1)
> 	
> 	According the review comment on RFCv1, We did try to use dma-buf
> 	as back end of WarpDrive. It can work properly with the current
> 	solution [4], but it cannot make use of process's
> 	own memory address space directly. This is important to many
> 	acceleration scenario. So dma-buf will be taken as a backup
> 	alternative for noiommu scenario, it will be added in the future
> 	version.
> 
> 
> Refernces:
> [1] https://www.spinics.net/lists/kernel/msg2651481.html
> [2] https://github.com/Kenneth-Lee/linux-kernel-warpdrive/tree/warpdrive-sva-v0.5
> [3] https://lkml.org/lkml/2018/7/22/34

Please refer to the latest version posted here for discussion.

https://lkml.org/lkml/2018/8/30/107

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/7] vfio: add sdmdev support
  2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
  2018-09-03  2:11   ` Randy Dunlap
@ 2018-09-03  2:55   ` Lu Baolu
  2018-09-06  9:01     ` Kenneth Lee
  2018-09-04 15:31   ` [RFC PATCH] vfio: vfio_sdmdev_groups[] can be static kbuild test robot
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 58+ messages in thread
From: Lu Baolu @ 2018-09-03  2:55 UTC (permalink / raw)
  To: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Sanjay Kumar
  Cc: baolu.lu, linuxarm

Hi,

On 09/03/2018 08:52 AM, Kenneth Lee wrote:
> From: Kenneth Lee <liguozhu@hisilicon.com>
> 
> SDMDEV is "Share Domain Mdev". It is a vfio-mdev. But differ from
> the general vfio-mdev, it shares its parent's IOMMU. If Multi-PASID
> support is enabled in the IOMMU (not yet in the current kernel HEAD),
> multiple process can share the IOMMU by different PASID. If it is not
> support, only one process can share the IOMMU with the kernel driver.
> 

If only for share domain purpose, I don't think it's necessary to create
a new device type.

> Currently only the vfio type-1 driver is updated to make it to be aware
> of.
> 
> Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
> Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>
> Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
> ---
>   drivers/vfio/Kconfig              |   1 +
>   drivers/vfio/Makefile             |   1 +
>   drivers/vfio/sdmdev/Kconfig       |  10 +
>   drivers/vfio/sdmdev/Makefile      |   3 +
>   drivers/vfio/sdmdev/vfio_sdmdev.c | 363 ++++++++++++++++++++++++++++++
>   drivers/vfio/vfio_iommu_type1.c   | 151 ++++++++++++-
>   include/linux/vfio_sdmdev.h       |  96 ++++++++
>   include/uapi/linux/vfio_sdmdev.h  |  29 +++
>   8 files changed, 648 insertions(+), 6 deletions(-)
>   create mode 100644 drivers/vfio/sdmdev/Kconfig
>   create mode 100644 drivers/vfio/sdmdev/Makefile
>   create mode 100644 drivers/vfio/sdmdev/vfio_sdmdev.c
>   create mode 100644 include/linux/vfio_sdmdev.h
>   create mode 100644 include/uapi/linux/vfio_sdmdev.h
> 

[--cut for short --]

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d9fd3188615d..ba73231d8692 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -41,6 +41,7 @@
>   #include <linux/notifier.h>
>   #include <linux/dma-iommu.h>
>   #include <linux/irqdomain.h>
> +#include <linux/vfio_sdmdev.h>
>   
>   #define DRIVER_VERSION  "0.2"
>   #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -89,6 +90,8 @@ struct vfio_dma {
>   };
>   
>   struct vfio_group {
> +	/* iommu_group of mdev's parent device */
> +	struct iommu_group	*parent_group;
>   	struct iommu_group	*iommu_group;
>   	struct list_head	next;
>   };
> @@ -1327,6 +1330,109 @@ static bool vfio_iommu_has_sw_msi(struct iommu_group *group, phys_addr_t *base)
>   	return ret;
>   }
>   
> +/* return 0 if the device is not sdmdev.
> + * return 1 if the device is sdmdev, the data will be updated with parent
> + *	device's group.
> + * return -errno if other error.
> + */
> +static int vfio_sdmdev_type(struct device *dev, void *data)
> +{
> +	struct iommu_group **group = data;
> +	struct iommu_group *pgroup;
> +	int (*_is_sdmdev)(struct device *dev);
> +	struct device *pdev;
> +	int ret = 1;
> +
> +	/* vfio_sdmdev module is not configurated */
> +	_is_sdmdev = symbol_get(vfio_sdmdev_is_sdmdev);
> +	if (!_is_sdmdev)
> +		return 0;
> +
> +	/* check if it belongs to vfio_sdmdev device */
> +	if (!_is_sdmdev(dev)) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	pdev = dev->parent;
> +	pgroup = iommu_group_get(pdev);
> +	if (!pgroup) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	if (group) {
> +		/* check if all parent devices is the same */
> +		if (*group && *group != pgroup)
> +			ret = -ENODEV;
> +		else
> +			*group = pgroup;
> +	}
> +
> +	iommu_group_put(pgroup);
> +
> +out:
> +	symbol_put(vfio_sdmdev_is_sdmdev);
> +
> +	return ret;
> +}
> +
> +/* return 0 or -errno */
> +static int vfio_sdmdev_bus(struct device *dev, void *data)
> +{
> +	struct bus_type **bus = data;
> +
> +	if (!dev->bus)
> +		return -ENODEV;
> +
> +	/* ensure all devices has the same bus_type */
> +	if (*bus && *bus != dev->bus)
> +		return -EINVAL;
> +
> +	*bus = dev->bus;
> +	return 0;
> +}
> +
> +/* return 0 means it is not sd group, 1 means it is, or -EXXX for error */
> +static int vfio_iommu_type1_attach_sdgroup(struct vfio_domain *domain,
> +					    struct vfio_group *group,
> +					    struct iommu_group *iommu_group)
> +{
> +	int ret;
> +	struct bus_type *pbus = NULL;
> +	struct iommu_group *pgroup = NULL;
> +
> +	ret = iommu_group_for_each_dev(iommu_group, &pgroup,
> +				       vfio_sdmdev_type);
> +	if (ret < 0)
> +		goto out;
> +	else if (ret > 0) {
> +		domain->domain = iommu_group_share_domain(pgroup);
> +		if (IS_ERR(domain->domain))
> +			goto out;
> +		ret = iommu_group_for_each_dev(pgroup, &pbus,
> +				       vfio_sdmdev_bus);
> +		if (ret < 0)
> +			goto err_with_share_domain;
> +
> +		if (pbus && iommu_capable(pbus, IOMMU_CAP_CACHE_COHERENCY))
> +			domain->prot |= IOMMU_CACHE;
> +
> +		group->parent_group = pgroup;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +
> +		return 1;
> +	}

This doesn't match the function name. It only gets the domain from the
parent device. It hasn't been really attached.

> +
> +	return 0;
> +
> +err_with_share_domain:
> +	iommu_group_unshare_domain(pgroup);
> +out:
> +	return ret;
> +}
> +
>   static int vfio_iommu_type1_attach_group(void *iommu_data,
>   					 struct iommu_group *iommu_group)
>   {
> @@ -1335,8 +1441,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>   	struct vfio_domain *domain, *d;
>   	struct bus_type *bus = NULL, *mdev_bus;
>   	int ret;
> -	bool resv_msi, msi_remap;
> -	phys_addr_t resv_msi_base;
> +	bool resv_msi = false, msi_remap;
> +	phys_addr_t resv_msi_base = 0;
>   
>   	mutex_lock(&iommu->lock);
>   
> @@ -1373,6 +1479,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>   	if (mdev_bus) {
>   		if ((bus == mdev_bus) && !iommu_present(bus)) {
>   			symbol_put(mdev_bus_type);
> +
> +			ret = vfio_iommu_type1_attach_sdgroup(domain, group,
> +					iommu_group);
> +			if (ret < 0)
> +				goto out_free;
> +			else if (ret > 0)
> +				goto replay_check;

Here you get the domain from the parent device and save it for later
use. The actual attaching is ignored.

I don't think this follows the philosophy of this function. It actually
make all devices in the group with the same bus type to share a single
domain.

Further more, the parent domain might be a domain of type
IOMMU_DOMAIN_DMA. That will not be able to use as an
IOMMU_DOMAIN_UNMANAGED domain for iommu APIs.

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
                   ` (7 preceding siblings ...)
  2018-09-03  2:32 ` [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Lu Baolu
@ 2018-09-04 15:00 ` Jerome Glisse
  2018-09-04 16:15   ` Alex Williamson
  2018-09-17  1:42 ` Jerome Glisse
  9 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-04 15:00 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> From: Kenneth Lee <liguozhu@hisilicon.com>
> 
> WarpDrive is an accelerator framework to expose the hardware capabilities
> directly to the user space. It makes use of the exist vfio and vfio-mdev
> facilities. So the user application can send request and DMA to the
> hardware without interaction with the kernel. This removes the latency
> of syscall.
> 
> WarpDrive is the name for the whole framework. The component in kernel
> is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> library of WarpDrive can access it via VFIO interface.
> 
> The patchset contains document for the detail. Please refer to it for more
> information.
> 
> This patchset is intended to be used with Jean Philippe Brucker's SVA
> patch [1], which enables not only IO side page fault, but also PASID
> support to IOMMU and VFIO.
> 
> With these features, WarpDrive can support non-pinned memory and
> multi-process in the same accelerator device.  We tested it in our SoC
> integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> tree can be found here: [2].
> 
> But it is not mandatory. This patchset is tested in the latest mainline
> kernel without the SVA patches.  So it supports only one process for each
> accelerator.
> 
> We have noticed the IOMMU aware mdev RFC announced recently [3].
> 
> The IOMMU aware mdev has similar idea but different intention comparing to
> WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> And the design is supposed to be used with Scalable I/O Virtualization.
> While sdmdev is intended to share the hardware resource with a big amount
> of processes.  It just requires the hardware supporting address
> translation per process (PCIE's PASID or ARM SMMU's substream ID).
> 
> But we don't see serious confliction on both design. We believe they can be
> normalized as one.
> 

So once again i do not understand why you are trying to do things
this way. Kernel already have tons of example of everything you
want to do without a new framework. Moreover i believe you are
confuse by VFIO. To me VFIO is for VM not to create general device
driver frame work.

So here is your use case as i understand it. You have a device
with a limited number of command queues (can be just one) and in
some case it can support SVA/SVM (when hardware support it and it
is not disabled). Final requirement is being able to schedule cmds
from userspace without ioctl. All of this exists already exists
upstream in few device drivers.


So here is how every body else is doing it. Please explain why
this does not work.

1 Userspace open device file driver. Kernel device driver create
  a context and associate it with on open. This context can be
  uniq to the process and can bind hardware resources (like a
  command queue) to the process.
2 Userspace bind/acquire a commands queue and initialize it with
  an ioctl on the device file. Through that ioctl userspace can
  be inform wether either SVA/SVM works for the device. If SVA/
  SVM works then kernel device driver bind the process to the
  device as part of this ioctl.
3 If SVM/SVA does not work userspace do an ioctl to create dma
  buffer or something that does exactly the same thing.
4 Userspace mmap the command queue (mmap of the device file by
  using informations gather at step 2)
5 Userspace can write commands into the queue it mapped
6 When userspace close the device file all resources are release
  just like any existing device drivers.

Now if you want to create a device driver framework that expose
a device file with generic API for all of the above steps fine.
But it does not need to be part of VFIO whatsoever or explain
why.


Note that if IOMMU is fully disabled you probably want to block
userspace from being able to directly scheduling commands onto
the hardware as it would allow userspace to DMA anywhere and thus
would open the kernel to easy exploits. In this case you can still
keeps the same API as above and use page fault tricks to valid
commands written by userspace into fake commands ring. This will
be as slow or maybe even slower than ioctl but at least it allows
you to validate commands.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH] vfio: vfio_sdmdev_groups[] can be static
  2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
  2018-09-03  2:11   ` Randy Dunlap
  2018-09-03  2:55   ` Lu Baolu
@ 2018-09-04 15:31   ` kbuild test robot
  2018-09-04 15:32   ` [PATCH 3/7] vfio: add sdmdev support kbuild test robot
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 58+ messages in thread
From: kbuild test robot @ 2018-09-04 15:31 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: kbuild-all, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm


Fixes: 1e47d5e60865 ("vfio: add sdmdev support")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
---
 vfio_sdmdev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/sdmdev/vfio_sdmdev.c b/drivers/vfio/sdmdev/vfio_sdmdev.c
index c6eb5d4..e7d3c23 100644
--- a/drivers/vfio/sdmdev/vfio_sdmdev.c
+++ b/drivers/vfio/sdmdev/vfio_sdmdev.c
@@ -103,7 +103,7 @@ static const struct attribute_group vfio_sdmdev_group = {
 	.name  = VFIO_SDMDEV_PDEV_ATTRS_GRP_NAME,
 	.attrs = vfio_sdmdev_attrs,
 };
-const struct attribute_group *vfio_sdmdev_groups[] = {
+static const struct attribute_group *vfio_sdmdev_groups[] = {
 	&vfio_sdmdev_group,
 	NULL,
 };

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/7] vfio: add sdmdev support
  2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
                     ` (2 preceding siblings ...)
  2018-09-04 15:31   ` [RFC PATCH] vfio: vfio_sdmdev_groups[] can be static kbuild test robot
@ 2018-09-04 15:32   ` kbuild test robot
  2018-09-04 15:32   ` kbuild test robot
  2018-09-05  7:27   ` Dan Carpenter
  5 siblings, 0 replies; 58+ messages in thread
From: kbuild test robot @ 2018-09-04 15:32 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: kbuild-all, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

Hi Kenneth,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on cryptodev/master]
[also build test WARNING on v4.19-rc2 next-20180831]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kenneth-Lee/A-General-Accelerator-Framework-WarpDrive/20180903-162733
base:   https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__
:::::: branch date: 3 hours ago
:::::: commit date: 3 hours ago

>> drivers/vfio/sdmdev/vfio_sdmdev.c:106:30: sparse: symbol 'vfio_sdmdev_groups' was not declared. Should it be static?
   drivers/vfio/sdmdev/vfio_sdmdev.c: In function 'vfio_sdmdev_mdev_remove':
   drivers/vfio/sdmdev/vfio_sdmdev.c:178:2: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
     if (sdmdev->ops->put_queue);
     ^~
   drivers/vfio/sdmdev/vfio_sdmdev.c:179:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
      sdmdev->ops->put_queue(q);
      ^~~~~~

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/7] vfio: add sdmdev support
  2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
                     ` (3 preceding siblings ...)
  2018-09-04 15:32   ` [PATCH 3/7] vfio: add sdmdev support kbuild test robot
@ 2018-09-04 15:32   ` kbuild test robot
  2018-09-05  7:27   ` Dan Carpenter
  5 siblings, 0 replies; 58+ messages in thread
From: kbuild test robot @ 2018-09-04 15:32 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: kbuild-all, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

[-- Attachment #1: Type: text/plain, Size: 2563 bytes --]

Hi Kenneth,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on cryptodev/master]
[also build test WARNING on v4.19-rc2 next-20180831]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kenneth-Lee/A-General-Accelerator-Framework-WarpDrive/20180903-162733
base:   https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 
:::::: branch date: 2 hours ago
:::::: commit date: 2 hours ago

All warnings (new ones prefixed by >>):

   drivers/vfio/sdmdev/vfio_sdmdev.c: In function 'vfio_sdmdev_mdev_remove':
>> drivers/vfio/sdmdev/vfio_sdmdev.c:178:2: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
     if (sdmdev->ops->put_queue);
     ^~
   drivers/vfio/sdmdev/vfio_sdmdev.c:179:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
      sdmdev->ops->put_queue(q);
      ^~~~~~

# https://github.com/0day-ci/linux/commit/1e47d5e608652b4a2c813dbeaf5aa6811f6ceaf7
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 1e47d5e608652b4a2c813dbeaf5aa6811f6ceaf7
vim +/if +178 drivers/vfio/sdmdev/vfio_sdmdev.c

1e47d5e6 Kenneth Lee 2018-09-03  168  
1e47d5e6 Kenneth Lee 2018-09-03  169  static int vfio_sdmdev_mdev_remove(struct mdev_device *mdev)
1e47d5e6 Kenneth Lee 2018-09-03  170  {
1e47d5e6 Kenneth Lee 2018-09-03  171  	struct vfio_sdmdev_queue *q =
1e47d5e6 Kenneth Lee 2018-09-03  172  		(struct vfio_sdmdev_queue *)mdev_get_drvdata(mdev);
1e47d5e6 Kenneth Lee 2018-09-03  173  	struct vfio_sdmdev *sdmdev = q->sdmdev;
1e47d5e6 Kenneth Lee 2018-09-03  174  	struct device *pdev = mdev_parent_dev(mdev);
1e47d5e6 Kenneth Lee 2018-09-03  175  
1e47d5e6 Kenneth Lee 2018-09-03  176  	put_device(pdev);
1e47d5e6 Kenneth Lee 2018-09-03  177  
1e47d5e6 Kenneth Lee 2018-09-03 @178  	if (sdmdev->ops->put_queue);
1e47d5e6 Kenneth Lee 2018-09-03  179  		sdmdev->ops->put_queue(q);
1e47d5e6 Kenneth Lee 2018-09-03  180  
1e47d5e6 Kenneth Lee 2018-09-03  181  	return 0;
1e47d5e6 Kenneth Lee 2018-09-03  182  }
1e47d5e6 Kenneth Lee 2018-09-03  183  

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 64103 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-04 15:00 ` Jerome Glisse
@ 2018-09-04 16:15   ` Alex Williamson
  2018-09-06  9:45     ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Alex Williamson @ 2018-09-04 16:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Tue, 4 Sep 2018 11:00:19 -0400
Jerome Glisse <jglisse@redhat.com> wrote:

> On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > From: Kenneth Lee <liguozhu@hisilicon.com>
> > 
> > WarpDrive is an accelerator framework to expose the hardware capabilities
> > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > facilities. So the user application can send request and DMA to the
> > hardware without interaction with the kernel. This removes the latency
> > of syscall.
> > 
> > WarpDrive is the name for the whole framework. The component in kernel
> > is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> > hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> > library of WarpDrive can access it via VFIO interface.
> > 
> > The patchset contains document for the detail. Please refer to it for more
> > information.
> > 
> > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > patch [1], which enables not only IO side page fault, but also PASID
> > support to IOMMU and VFIO.
> > 
> > With these features, WarpDrive can support non-pinned memory and
> > multi-process in the same accelerator device.  We tested it in our SoC
> > integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> > tree can be found here: [2].
> > 
> > But it is not mandatory. This patchset is tested in the latest mainline
> > kernel without the SVA patches.  So it supports only one process for each
> > accelerator.
> > 
> > We have noticed the IOMMU aware mdev RFC announced recently [3].
> > 
> > The IOMMU aware mdev has similar idea but different intention comparing to
> > WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> > And the design is supposed to be used with Scalable I/O Virtualization.
> > While sdmdev is intended to share the hardware resource with a big amount
> > of processes.  It just requires the hardware supporting address
> > translation per process (PCIE's PASID or ARM SMMU's substream ID).
> > 
> > But we don't see serious confliction on both design. We believe they can be
> > normalized as one.
> >   
> 
> So once again i do not understand why you are trying to do things
> this way. Kernel already have tons of example of everything you
> want to do without a new framework. Moreover i believe you are
> confuse by VFIO. To me VFIO is for VM not to create general device
> driver frame work.

VFIO is a userspace driver framework, the VM use case just happens to
be a rather prolific one.  VFIO was never intended to be solely a VM
device interface and has several other userspace users, notably DPDK
and SPDK, an NVMe backend in QEMU, a userspace NVMe driver, a ruby
wrapper, and perhaps others that I'm not aware of.  Whether vfio is
appropriate interface here might certainly still be a debatable topic,
but I would strongly disagree with your last sentence above.  Thanks,

Alex

> So here is your use case as i understand it. You have a device
> with a limited number of command queues (can be just one) and in
> some case it can support SVA/SVM (when hardware support it and it
> is not disabled). Final requirement is being able to schedule cmds
> from userspace without ioctl. All of this exists already exists
> upstream in few device drivers.
> 
> 
> So here is how every body else is doing it. Please explain why
> this does not work.
> 
> 1 Userspace open device file driver. Kernel device driver create
>   a context and associate it with on open. This context can be
>   uniq to the process and can bind hardware resources (like a
>   command queue) to the process.
> 2 Userspace bind/acquire a commands queue and initialize it with
>   an ioctl on the device file. Through that ioctl userspace can
>   be inform wether either SVA/SVM works for the device. If SVA/
>   SVM works then kernel device driver bind the process to the
>   device as part of this ioctl.
> 3 If SVM/SVA does not work userspace do an ioctl to create dma
>   buffer or something that does exactly the same thing.
> 4 Userspace mmap the command queue (mmap of the device file by
>   using informations gather at step 2)
> 5 Userspace can write commands into the queue it mapped
> 6 When userspace close the device file all resources are release
>   just like any existing device drivers.
> 
> Now if you want to create a device driver framework that expose
> a device file with generic API for all of the above steps fine.
> But it does not need to be part of VFIO whatsoever or explain
> why.
> 
> 
> Note that if IOMMU is fully disabled you probably want to block
> userspace from being able to directly scheduling commands onto
> the hardware as it would allow userspace to DMA anywhere and thus
> would open the kernel to easy exploits. In this case you can still
> keeps the same API as above and use page fault tricks to valid
> commands written by userspace into fake commands ring. This will
> be as slow or maybe even slower than ioctl but at least it allows
> you to validate commands.
> 
> Cheers,
> Jérôme


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/7] vfio: add sdmdev support
  2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
                     ` (4 preceding siblings ...)
  2018-09-04 15:32   ` kbuild test robot
@ 2018-09-05  7:27   ` Dan Carpenter
  5 siblings, 0 replies; 58+ messages in thread
From: Dan Carpenter @ 2018-09-05  7:27 UTC (permalink / raw)
  To: kbuild, Kenneth Lee
  Cc: kbuild-all, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

Hi Kenneth,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on cryptodev/master]
[also build test WARNING on v4.19-rc2 next-20180905]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kenneth-Lee/A-General-Accelerator-Framework-WarpDrive/20180903-162733
base:   https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master

smatch warnings:
drivers/vfio/sdmdev/vfio_sdmdev.c:78 iommu_type_show() error: 'sdmdev' dereferencing possible ERR_PTR()
drivers/vfio/sdmdev/vfio_sdmdev.c:91 dma_flag_show() error: 'sdmdev' dereferencing possible ERR_PTR()
drivers/vfio/sdmdev/vfio_sdmdev.c:127 flags_show() error: 'sdmdev' dereferencing possible ERR_PTR()
drivers/vfio/sdmdev/vfio_sdmdev.c:128 name_show() error: 'sdmdev' dereferencing possible ERR_PTR()
drivers/vfio/sdmdev/vfio_sdmdev.c:130 device_api_show() error: 'sdmdev' dereferencing possible ERR_PTR()
drivers/vfio/sdmdev/vfio_sdmdev.c:138 available_instances_show() error: 'sdmdev' dereferencing possible ERR_PTR()
drivers/vfio/sdmdev/vfio_sdmdev.c:178 vfio_sdmdev_mdev_remove() warn: if();

# https://github.com/0day-ci/linux/commit/1e47d5e608652b4a2c813dbeaf5aa6811f6ceaf7
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 1e47d5e608652b4a2c813dbeaf5aa6811f6ceaf7
vim +/sdmdev +78 drivers/vfio/sdmdev/vfio_sdmdev.c

1e47d5e6 Kenneth Lee 2018-09-03   69  
1e47d5e6 Kenneth Lee 2018-09-03   70  static ssize_t iommu_type_show(struct device *dev,
1e47d5e6 Kenneth Lee 2018-09-03   71  			       struct device_attribute *attr, char *buf)
1e47d5e6 Kenneth Lee 2018-09-03   72  {
1e47d5e6 Kenneth Lee 2018-09-03   73  	struct vfio_sdmdev *sdmdev = vfio_sdmdev_pdev_sdmdev(dev);
                                                                     ^^^^^^^^^^^^^^^^^^^^^^^
Presumably this returns error pointers instead of NULL?

1e47d5e6 Kenneth Lee 2018-09-03   74  
1e47d5e6 Kenneth Lee 2018-09-03   75  	if (!sdmdev)
1e47d5e6 Kenneth Lee 2018-09-03   76  		return -ENODEV;
1e47d5e6 Kenneth Lee 2018-09-03   77  
1e47d5e6 Kenneth Lee 2018-09-03  @78  	return sprintf(buf, "%d\n", sdmdev->iommu_type);
1e47d5e6 Kenneth Lee 2018-09-03   79  }
1e47d5e6 Kenneth Lee 2018-09-03   80  
1e47d5e6 Kenneth Lee 2018-09-03   81  static DEVICE_ATTR_RO(iommu_type);
1e47d5e6 Kenneth Lee 2018-09-03   82  
1e47d5e6 Kenneth Lee 2018-09-03   83  static ssize_t dma_flag_show(struct device *dev,
1e47d5e6 Kenneth Lee 2018-09-03   84  			     struct device_attribute *attr, char *buf)
1e47d5e6 Kenneth Lee 2018-09-03   85  {
1e47d5e6 Kenneth Lee 2018-09-03   86  	struct vfio_sdmdev *sdmdev = vfio_sdmdev_pdev_sdmdev(dev);
1e47d5e6 Kenneth Lee 2018-09-03   87  
1e47d5e6 Kenneth Lee 2018-09-03   88  	if (!sdmdev)
1e47d5e6 Kenneth Lee 2018-09-03   89  		return -ENODEV;
1e47d5e6 Kenneth Lee 2018-09-03   90  
1e47d5e6 Kenneth Lee 2018-09-03  @91  	return sprintf(buf, "%d\n", sdmdev->dma_flag);
1e47d5e6 Kenneth Lee 2018-09-03   92  }
1e47d5e6 Kenneth Lee 2018-09-03   93  
1e47d5e6 Kenneth Lee 2018-09-03   94  static DEVICE_ATTR_RO(dma_flag);
1e47d5e6 Kenneth Lee 2018-09-03   95  
1e47d5e6 Kenneth Lee 2018-09-03   96  /* mdev->dev_attr_groups */
1e47d5e6 Kenneth Lee 2018-09-03   97  static struct attribute *vfio_sdmdev_attrs[] = {
1e47d5e6 Kenneth Lee 2018-09-03   98  	&dev_attr_iommu_type.attr,
1e47d5e6 Kenneth Lee 2018-09-03   99  	&dev_attr_dma_flag.attr,
1e47d5e6 Kenneth Lee 2018-09-03  100  	NULL,
1e47d5e6 Kenneth Lee 2018-09-03  101  };
1e47d5e6 Kenneth Lee 2018-09-03  102  static const struct attribute_group vfio_sdmdev_group = {
1e47d5e6 Kenneth Lee 2018-09-03  103  	.name  = VFIO_SDMDEV_PDEV_ATTRS_GRP_NAME,
1e47d5e6 Kenneth Lee 2018-09-03  104  	.attrs = vfio_sdmdev_attrs,
1e47d5e6 Kenneth Lee 2018-09-03  105  };
1e47d5e6 Kenneth Lee 2018-09-03  106  const struct attribute_group *vfio_sdmdev_groups[] = {
1e47d5e6 Kenneth Lee 2018-09-03  107  	&vfio_sdmdev_group,
1e47d5e6 Kenneth Lee 2018-09-03  108  	NULL,
1e47d5e6 Kenneth Lee 2018-09-03  109  };
1e47d5e6 Kenneth Lee 2018-09-03  110  
1e47d5e6 Kenneth Lee 2018-09-03  111  /* default attributes for mdev->supported_type_groups, used by registerer*/
1e47d5e6 Kenneth Lee 2018-09-03  112  #define MDEV_TYPE_ATTR_RO_EXPORT(name) \
1e47d5e6 Kenneth Lee 2018-09-03  113  		MDEV_TYPE_ATTR_RO(name); \
1e47d5e6 Kenneth Lee 2018-09-03  114  		EXPORT_SYMBOL_GPL(mdev_type_attr_##name);
1e47d5e6 Kenneth Lee 2018-09-03  115  
1e47d5e6 Kenneth Lee 2018-09-03  116  #define DEF_SIMPLE_SDMDEV_ATTR(_name, sdmdev_member, format) \
1e47d5e6 Kenneth Lee 2018-09-03  117  static ssize_t _name##_show(struct kobject *kobj, struct device *dev, \
1e47d5e6 Kenneth Lee 2018-09-03  118  			    char *buf) \
1e47d5e6 Kenneth Lee 2018-09-03  119  { \
1e47d5e6 Kenneth Lee 2018-09-03  120  	struct vfio_sdmdev *sdmdev = vfio_sdmdev_pdev_sdmdev(dev); \
1e47d5e6 Kenneth Lee 2018-09-03  121  	if (!sdmdev) \
1e47d5e6 Kenneth Lee 2018-09-03  122  		return -ENODEV; \
1e47d5e6 Kenneth Lee 2018-09-03  123  	return sprintf(buf, format, sdmdev->sdmdev_member); \
1e47d5e6 Kenneth Lee 2018-09-03  124  } \
1e47d5e6 Kenneth Lee 2018-09-03  125  MDEV_TYPE_ATTR_RO_EXPORT(_name)
1e47d5e6 Kenneth Lee 2018-09-03  126  
1e47d5e6 Kenneth Lee 2018-09-03 @127  DEF_SIMPLE_SDMDEV_ATTR(flags, flags, "%d");
1e47d5e6 Kenneth Lee 2018-09-03 @128  DEF_SIMPLE_SDMDEV_ATTR(name, name, "%s"); /* this should be algorithm name, */
1e47d5e6 Kenneth Lee 2018-09-03  129  		/* but you would not care if you have only one algorithm */
1e47d5e6 Kenneth Lee 2018-09-03 @130  DEF_SIMPLE_SDMDEV_ATTR(device_api, api_ver, "%s");
1e47d5e6 Kenneth Lee 2018-09-03  131  
1e47d5e6 Kenneth Lee 2018-09-03  132  static ssize_t
1e47d5e6 Kenneth Lee 2018-09-03  133  available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
1e47d5e6 Kenneth Lee 2018-09-03  134  {
1e47d5e6 Kenneth Lee 2018-09-03  135  	struct vfio_sdmdev *sdmdev = vfio_sdmdev_pdev_sdmdev(dev);
1e47d5e6 Kenneth Lee 2018-09-03  136  	int nr_inst = 0;
1e47d5e6 Kenneth Lee 2018-09-03  137  
1e47d5e6 Kenneth Lee 2018-09-03 @138  	nr_inst = sdmdev->ops->get_available_instances ?
1e47d5e6 Kenneth Lee 2018-09-03  139  		sdmdev->ops->get_available_instances(sdmdev) : 0;
1e47d5e6 Kenneth Lee 2018-09-03  140  	return sprintf(buf, "%d", nr_inst);
1e47d5e6 Kenneth Lee 2018-09-03  141  }
1e47d5e6 Kenneth Lee 2018-09-03  142  MDEV_TYPE_ATTR_RO_EXPORT(available_instances);
1e47d5e6 Kenneth Lee 2018-09-03  143  
1e47d5e6 Kenneth Lee 2018-09-03  144  static int vfio_sdmdev_mdev_create(struct kobject *kobj,
1e47d5e6 Kenneth Lee 2018-09-03  145  	struct mdev_device *mdev)
1e47d5e6 Kenneth Lee 2018-09-03  146  {
1e47d5e6 Kenneth Lee 2018-09-03  147  	struct device *pdev = mdev_parent_dev(mdev);
1e47d5e6 Kenneth Lee 2018-09-03  148  	struct vfio_sdmdev_queue *q;
1e47d5e6 Kenneth Lee 2018-09-03  149  	struct vfio_sdmdev *sdmdev = mdev_sdmdev(mdev);
1e47d5e6 Kenneth Lee 2018-09-03  150  	int ret;
1e47d5e6 Kenneth Lee 2018-09-03  151  
1e47d5e6 Kenneth Lee 2018-09-03  152  	if (!sdmdev->ops->get_queue)
1e47d5e6 Kenneth Lee 2018-09-03  153  		return -ENODEV;
1e47d5e6 Kenneth Lee 2018-09-03  154  
1e47d5e6 Kenneth Lee 2018-09-03  155  	ret = sdmdev->ops->get_queue(sdmdev, &q);
1e47d5e6 Kenneth Lee 2018-09-03  156  	if (ret)
1e47d5e6 Kenneth Lee 2018-09-03  157  		return ret;
1e47d5e6 Kenneth Lee 2018-09-03  158  
1e47d5e6 Kenneth Lee 2018-09-03  159  	q->sdmdev = sdmdev;
1e47d5e6 Kenneth Lee 2018-09-03  160  	q->mdev = mdev;
1e47d5e6 Kenneth Lee 2018-09-03  161  	init_waitqueue_head(&q->wait);
1e47d5e6 Kenneth Lee 2018-09-03  162  
1e47d5e6 Kenneth Lee 2018-09-03  163  	mdev_set_drvdata(mdev, q);
1e47d5e6 Kenneth Lee 2018-09-03  164  	get_device(pdev);
1e47d5e6 Kenneth Lee 2018-09-03  165  
1e47d5e6 Kenneth Lee 2018-09-03  166  	return 0;
1e47d5e6 Kenneth Lee 2018-09-03  167  }
1e47d5e6 Kenneth Lee 2018-09-03  168  
1e47d5e6 Kenneth Lee 2018-09-03  169  static int vfio_sdmdev_mdev_remove(struct mdev_device *mdev)
1e47d5e6 Kenneth Lee 2018-09-03  170  {
1e47d5e6 Kenneth Lee 2018-09-03  171  	struct vfio_sdmdev_queue *q =
1e47d5e6 Kenneth Lee 2018-09-03  172  		(struct vfio_sdmdev_queue *)mdev_get_drvdata(mdev);
1e47d5e6 Kenneth Lee 2018-09-03  173  	struct vfio_sdmdev *sdmdev = q->sdmdev;
1e47d5e6 Kenneth Lee 2018-09-03  174  	struct device *pdev = mdev_parent_dev(mdev);
1e47d5e6 Kenneth Lee 2018-09-03  175  
1e47d5e6 Kenneth Lee 2018-09-03  176  	put_device(pdev);
1e47d5e6 Kenneth Lee 2018-09-03  177  
1e47d5e6 Kenneth Lee 2018-09-03 @178  	if (sdmdev->ops->put_queue);
                                                                   ^
Extra semicolon breaks the code.

1e47d5e6 Kenneth Lee 2018-09-03  179  		sdmdev->ops->put_queue(q);
1e47d5e6 Kenneth Lee 2018-09-03  180  
1e47d5e6 Kenneth Lee 2018-09-03  181  	return 0;
1e47d5e6 Kenneth Lee 2018-09-03  182  }
1e47d5e6 Kenneth Lee 2018-09-03  183  

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/7] vfio: add sdmdev support
  2018-09-03  2:11   ` Randy Dunlap
@ 2018-09-06  8:08     ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-06  8:08 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Sun, Sep 02, 2018 at 07:11:12PM -0700, Randy Dunlap wrote:
> Date: Sun, 2 Sep 2018 19:11:12 -0700
> From: Randy Dunlap <rdunlap@infradead.org>
> To: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>,
>  Sanjay Kumar <sanjay.k.kumar@intel.com>
> CC: linuxarm@huawei.com
> Subject: Re: [PATCH 3/7] vfio: add sdmdev support
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.9.1
> Message-ID: <0574e7dc-bed2-d2d0-3aa6-93590d54ce81@infradead.org>
> 
> On 09/02/2018 05:52 PM, Kenneth Lee wrote:
> > diff --git a/drivers/vfio/sdmdev/Kconfig b/drivers/vfio/sdmdev/Kconfig
> > new file mode 100644
> > index 000000000000..51474272870d
> > --- /dev/null
> > +++ b/drivers/vfio/sdmdev/Kconfig
> > @@ -0,0 +1,10 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +config VFIO_SDMDEV
> > +	tristate "Support for Share Domain MDEV"
> > +	depends on VFIO_MDEV_DEVICE
> > +	help
> > +	  Support for VFIO Share Domain MDEV, which enables the kernel to
> > +	  support light weight hardware accelerator framework, WarpDrive.
> 
> 	          lightweight
> 
Thank you, will fix it.
> > +
> > +	  To compile this as a module, choose M here: the module will be called
> > +	  sdmdev.
> 
> 
> thanks,
> -- 
> ~Randy

-- 
			-Kenneth(Hisilicon)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/7] vfio: add sdmdev support
  2018-09-03  2:55   ` Lu Baolu
@ 2018-09-06  9:01     ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-06  9:01 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Sanjay Kumar, linuxarm

On Mon, Sep 03, 2018 at 10:55:57AM +0800, Lu Baolu wrote:
> Date: Mon, 3 Sep 2018 10:55:57 +0800
> From: Lu Baolu <baolu.lu@linux.intel.com>
> To: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Sanjay Kumar
>  <sanjay.k.kumar@intel.com>
> CC: linuxarm@huawei.com, baolu.lu@linux.intel.com
> Subject: Re: [PATCH 3/7] vfio: add sdmdev support
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.9.1
> Message-ID: <4ea51b20-dcc1-db32-18eb-24a004ab9085@linux.intel.com>
> 
> Hi,
> 
> On 09/03/2018 08:52 AM, Kenneth Lee wrote:
> >From: Kenneth Lee <liguozhu@hisilicon.com>
> >
> >SDMDEV is "Share Domain Mdev". It is a vfio-mdev. But differ from
> >the general vfio-mdev, it shares its parent's IOMMU. If Multi-PASID
> >support is enabled in the IOMMU (not yet in the current kernel HEAD),
> >multiple process can share the IOMMU by different PASID. If it is not
> >support, only one process can share the IOMMU with the kernel driver.
> >
> 
> If only for share domain purpose, I don't think it's necessary to create
> a new device type.
> 

Yes, if ONLY for share domain purpose. But we need also to share the interrupt.

> >Currently only the vfio type-1 driver is updated to make it to be aware
> >of.
> >
> >Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
> >Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>
> >Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
> >---
> >  drivers/vfio/Kconfig              |   1 +
> >  drivers/vfio/Makefile             |   1 +
> >  drivers/vfio/sdmdev/Kconfig       |  10 +
> >  drivers/vfio/sdmdev/Makefile      |   3 +
> >  drivers/vfio/sdmdev/vfio_sdmdev.c | 363 ++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_type1.c   | 151 ++++++++++++-
> >  include/linux/vfio_sdmdev.h       |  96 ++++++++
> >  include/uapi/linux/vfio_sdmdev.h  |  29 +++
> >  8 files changed, 648 insertions(+), 6 deletions(-)
> >  create mode 100644 drivers/vfio/sdmdev/Kconfig
> >  create mode 100644 drivers/vfio/sdmdev/Makefile
> >  create mode 100644 drivers/vfio/sdmdev/vfio_sdmdev.c
> >  create mode 100644 include/linux/vfio_sdmdev.h
> >  create mode 100644 include/uapi/linux/vfio_sdmdev.h
> >
> 
> [--cut for short --]
> 
> >diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >index d9fd3188615d..ba73231d8692 100644
> >--- a/drivers/vfio/vfio_iommu_type1.c
> >+++ b/drivers/vfio/vfio_iommu_type1.c
> >@@ -41,6 +41,7 @@
> >  #include <linux/notifier.h>
> >  #include <linux/dma-iommu.h>
> >  #include <linux/irqdomain.h>
> >+#include <linux/vfio_sdmdev.h>
> >  #define DRIVER_VERSION  "0.2"
> >  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> >@@ -89,6 +90,8 @@ struct vfio_dma {
> >  };
> >  struct vfio_group {
> >+	/* iommu_group of mdev's parent device */
> >+	struct iommu_group	*parent_group;
> >  	struct iommu_group	*iommu_group;
> >  	struct list_head	next;
> >  };
> >@@ -1327,6 +1330,109 @@ static bool vfio_iommu_has_sw_msi(struct iommu_group *group, phys_addr_t *base)
> >  	return ret;
> >  }
> >+/* return 0 if the device is not sdmdev.
> >+ * return 1 if the device is sdmdev, the data will be updated with parent
> >+ *	device's group.
> >+ * return -errno if other error.
> >+ */
> >+static int vfio_sdmdev_type(struct device *dev, void *data)
> >+{
> >+	struct iommu_group **group = data;
> >+	struct iommu_group *pgroup;
> >+	int (*_is_sdmdev)(struct device *dev);
> >+	struct device *pdev;
> >+	int ret = 1;
> >+
> >+	/* vfio_sdmdev module is not configurated */
> >+	_is_sdmdev = symbol_get(vfio_sdmdev_is_sdmdev);
> >+	if (!_is_sdmdev)
> >+		return 0;
> >+
> >+	/* check if it belongs to vfio_sdmdev device */
> >+	if (!_is_sdmdev(dev)) {
> >+		ret = 0;
> >+		goto out;
> >+	}
> >+
> >+	pdev = dev->parent;
> >+	pgroup = iommu_group_get(pdev);
> >+	if (!pgroup) {
> >+		ret = -ENODEV;
> >+		goto out;
> >+	}
> >+
> >+	if (group) {
> >+		/* check if all parent devices is the same */
> >+		if (*group && *group != pgroup)
> >+			ret = -ENODEV;
> >+		else
> >+			*group = pgroup;
> >+	}
> >+
> >+	iommu_group_put(pgroup);
> >+
> >+out:
> >+	symbol_put(vfio_sdmdev_is_sdmdev);
> >+
> >+	return ret;
> >+}
> >+
> >+/* return 0 or -errno */
> >+static int vfio_sdmdev_bus(struct device *dev, void *data)
> >+{
> >+	struct bus_type **bus = data;
> >+
> >+	if (!dev->bus)
> >+		return -ENODEV;
> >+
> >+	/* ensure all devices has the same bus_type */
> >+	if (*bus && *bus != dev->bus)
> >+		return -EINVAL;
> >+
> >+	*bus = dev->bus;
> >+	return 0;
> >+}
> >+
> >+/* return 0 means it is not sd group, 1 means it is, or -EXXX for error */
> >+static int vfio_iommu_type1_attach_sdgroup(struct vfio_domain *domain,
> >+					    struct vfio_group *group,
> >+					    struct iommu_group *iommu_group)
> >+{
> >+	int ret;
> >+	struct bus_type *pbus = NULL;
> >+	struct iommu_group *pgroup = NULL;
> >+
> >+	ret = iommu_group_for_each_dev(iommu_group, &pgroup,
> >+				       vfio_sdmdev_type);
> >+	if (ret < 0)
> >+		goto out;
> >+	else if (ret > 0) {
> >+		domain->domain = iommu_group_share_domain(pgroup);
> >+		if (IS_ERR(domain->domain))
> >+			goto out;
> >+		ret = iommu_group_for_each_dev(pgroup, &pbus,
> >+				       vfio_sdmdev_bus);
> >+		if (ret < 0)
> >+			goto err_with_share_domain;
> >+
> >+		if (pbus && iommu_capable(pbus, IOMMU_CAP_CACHE_COHERENCY))
> >+			domain->prot |= IOMMU_CACHE;
> >+
> >+		group->parent_group = pgroup;
> >+		INIT_LIST_HEAD(&domain->group_list);
> >+		list_add(&group->next, &domain->group_list);
> >+
> >+		return 1;
> >+	}
> 
> This doesn't match the function name. It only gets the domain from the
> parent device. It hasn't been really attached.
> 
> >+
> >+	return 0;
> >+
> >+err_with_share_domain:
> >+	iommu_group_unshare_domain(pgroup);
> >+out:
> >+	return ret;
> >+}
> >+
> >  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  					 struct iommu_group *iommu_group)
> >  {
> >@@ -1335,8 +1441,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  	struct vfio_domain *domain, *d;
> >  	struct bus_type *bus = NULL, *mdev_bus;
> >  	int ret;
> >-	bool resv_msi, msi_remap;
> >-	phys_addr_t resv_msi_base;
> >+	bool resv_msi = false, msi_remap;
> >+	phys_addr_t resv_msi_base = 0;
> >  	mutex_lock(&iommu->lock);
> >@@ -1373,6 +1479,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  	if (mdev_bus) {
> >  		if ((bus == mdev_bus) && !iommu_present(bus)) {
> >  			symbol_put(mdev_bus_type);
> >+
> >+			ret = vfio_iommu_type1_attach_sdgroup(domain, group,
> >+					iommu_group);
> >+			if (ret < 0)
> >+				goto out_free;
> >+			else if (ret > 0)
> >+				goto replay_check;
> 
> Here you get the domain from the parent device and save it for later
> use. The actual attaching is ignored.
> 
> I don't think this follows the philosophy of this function. It actually
> make all devices in the group with the same bus type to share a single
> domain.

I think the original logic here is:

1. Create a new vfio_domain along with a iommu_domain for the group attached to
   the container
2. Try to match the vfio_domain with the domain list in the container. If there
   is a match, free the created one and and reuse it, or add the new vfio_domain
   to the list. 

   With this design, the same configuration to the IOMMU(unit) will be applied
   only once.


For iommu_group that shares IOMMU with its parent, the configuration will never
be the same (The PASID will be different), so it is not necessary to merge them.

> 
> Further more, the parent domain might be a domain of type
> IOMMU_DOMAIN_DMA. That will not be able to use as an
> IOMMU_DOMAIN_UNMANAGED domain for iommu APIs.

Indeed, it should be checked when the domain is shared. Unmanaged domain should
not be used for sharing. I will update it in the future.

> 
> Best regards,
> Lu Baolu

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 4/7] crypto: add hisilicon Queue Manager driver
  2018-09-03  2:15   ` Randy Dunlap
@ 2018-09-06  9:08     ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-06  9:08 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Sun, Sep 02, 2018 at 07:15:07PM -0700, Randy Dunlap wrote:
> Date: Sun, 2 Sep 2018 19:15:07 -0700
> From: Randy Dunlap <rdunlap@infradead.org>
> To: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>,
>  Sanjay Kumar <sanjay.k.kumar@intel.com>
> CC: linuxarm@huawei.com
> Subject: Re: [PATCH 4/7] crypto: add hisilicon Queue Manager driver
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.9.1
> Message-ID: <4e46a451-d1cd-ac68-84b4-20792fdbc733@infradead.org>
> 
> On 09/02/2018 05:52 PM, Kenneth Lee wrote:
> > diff --git a/drivers/crypto/hisilicon/Kconfig b/drivers/crypto/hisilicon/Kconfig
> > index 8ca9c503bcb0..02a6eef84101 100644
> > --- a/drivers/crypto/hisilicon/Kconfig
> > +++ b/drivers/crypto/hisilicon/Kconfig
> > @@ -1,4 +1,8 @@
> >  # SPDX-License-Identifier: GPL-2.0
> > +config CRYPTO_DEV_HISILICON
> > +	tristate "Support for HISILICON CRYPTO ACCELERATOR"
> > +	help
> > +	  Enable this to use Hisilicon Hardware Accelerators
> 
> 	                                        Accelerators.

Thanks, will change it in next version.

> 
> 
> -- 
> ~Randy

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM
  2018-09-03  2:19   ` Randy Dunlap
@ 2018-09-06  9:09     ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-06  9:09 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Sun, Sep 02, 2018 at 07:19:03PM -0700, Randy Dunlap wrote:
> Date: Sun, 2 Sep 2018 19:19:03 -0700
> From: Randy Dunlap <rdunlap@infradead.org>
> To: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>,
>  Sanjay Kumar <sanjay.k.kumar@intel.com>
> CC: linuxarm@huawei.com
> Subject: Re: [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.9.1
> Message-ID: <46e57b2b-756e-e256-5b3c-30749d865512@infradead.org>
> 
> On 09/02/2018 05:52 PM, Kenneth Lee wrote:
> > diff --git a/drivers/crypto/hisilicon/Kconfig b/drivers/crypto/hisilicon/Kconfig
> > index 1d155708cd69..b85fab48fdab 100644
> > --- a/drivers/crypto/hisilicon/Kconfig
> > +++ b/drivers/crypto/hisilicon/Kconfig
> > @@ -17,6 +17,16 @@ config CRYPTO_DEV_HISI_SEC
> >  	  To compile this as a module, choose M here: the module
> >  	  will be called hisi_sec.
> >  
> > +config CRYPTO_DEV_HISI_SDMDEV
> > +	bool "Enable SDMDEV interface"
> > +	depends on CRYPTO_DEV_HISILICON
> > +	select VFIO_SDMDEV
> > +	help
> > +	  Enable this enable the SDMDEV, "shared IOMMU Domain Mediated Device"
> 
> At a minimum:
> 	  Enable this to enable the SDMDEV,
> 
> although that could be done better.  Maybe just:
> 	  Enable the SDMDEV "shared IOMMU Domain Mediated Device"
> 
> 	  
> > +	  interface for all Hisilicon accelerators if they can. The SDMDEV
> 
> probably drop "if they can":          accelerators.  The SDMDEV interface
> 
> > +	  enable the WarpDrive user space accelerator driver to access the
> 
> 	  enables

Thank you, will change them all in the coming version.

> 
> > +	  hardware function directly.
> > +
> 
> 
> -- 
> ~Randy

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/7] vfio/sdmdev: add user sample
  2018-09-03  2:25   ` Randy Dunlap
@ 2018-09-06  9:10     ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-06  9:10 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Sun, Sep 02, 2018 at 07:25:12PM -0700, Randy Dunlap wrote:
> Date: Sun, 2 Sep 2018 19:25:12 -0700
> From: Randy Dunlap <rdunlap@infradead.org>
> To: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>,
>  Sanjay Kumar <sanjay.k.kumar@intel.com>
> CC: linuxarm@huawei.com
> Subject: Re: [PATCH 7/7] vfio/sdmdev: add user sample
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.9.1
> Message-ID: <ff75ff6a-2cd8-b25e-8d19-52e8f57141da@infradead.org>
> 
> On 09/02/2018 05:52 PM, Kenneth Lee wrote:
> > From: Kenneth Lee <liguozhu@hisilicon.com>
> > 
> > This is the sample code to demostrate how WrapDrive user application
> > should be.
> > 
> > It contains:
> > 
> > 1. wd.[ch], the common library to provide WrapDrive interface.
> 
>                                             WarpDrive
> 
> > 2. drv/*, the user driver to access the hardware upon spimdev
> > 3. test/*, the test application to use WrapDrive interface to access the
> 
>                                          WarpDrive
> 
> >    hardware queue(s) of the accelerator.
> > 
> > The Hisilicon HIP08 ZIP accelerator is used in this sample.
> > 
> > Signed-off-by: Zaibo Xu <xuzaibo@huawei.com>
> > Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
> > Signed-off-by: Hao Fang <fanghao11@huawei.com>
> > Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com>
> > ---
> >  samples/warpdrive/AUTHORS              |   2 +
> >  samples/warpdrive/ChangeLog            |   1 +
> >  samples/warpdrive/Makefile.am          |   9 +
> >  samples/warpdrive/NEWS                 |   1 +
> >  samples/warpdrive/README               |  32 +++
> >  samples/warpdrive/autogen.sh           |   3 +
> >  samples/warpdrive/cleanup.sh           |  13 ++
> >  samples/warpdrive/configure.ac         |  52 +++++
> >  samples/warpdrive/drv/hisi_qm_udrv.c   | 223 ++++++++++++++++++
> >  samples/warpdrive/drv/hisi_qm_udrv.h   |  53 +++++
> >  samples/warpdrive/test/Makefile.am     |   7 +
> >  samples/warpdrive/test/comp_hw.h       |  23 ++
> >  samples/warpdrive/test/test_hisi_zip.c | 206 +++++++++++++++++
> >  samples/warpdrive/wd.c                 | 309 +++++++++++++++++++++++++
> >  samples/warpdrive/wd.h                 | 154 ++++++++++++
> >  samples/warpdrive/wd_adapter.c         |  74 ++++++
> >  samples/warpdrive/wd_adapter.h         |  43 ++++
> >  17 files changed, 1205 insertions(+)
> >  create mode 100644 samples/warpdrive/AUTHORS
> >  create mode 100644 samples/warpdrive/ChangeLog
> >  create mode 100644 samples/warpdrive/Makefile.am
> >  create mode 100644 samples/warpdrive/NEWS
> >  create mode 100644 samples/warpdrive/README
> >  create mode 100755 samples/warpdrive/autogen.sh
> >  create mode 100755 samples/warpdrive/cleanup.sh
> >  create mode 100644 samples/warpdrive/configure.ac
> >  create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.c
> >  create mode 100644 samples/warpdrive/drv/hisi_qm_udrv.h
> >  create mode 100644 samples/warpdrive/test/Makefile.am
> >  create mode 100644 samples/warpdrive/test/comp_hw.h
> >  create mode 100644 samples/warpdrive/test/test_hisi_zip.c
> >  create mode 100644 samples/warpdrive/wd.c
> >  create mode 100644 samples/warpdrive/wd.h
> >  create mode 100644 samples/warpdrive/wd_adapter.c
> >  create mode 100644 samples/warpdrive/wd_adapter.h
> 
> > diff --git a/samples/warpdrive/README b/samples/warpdrive/README
> > new file mode 100644
> > index 000000000000..3adf66b112fc
> > --- /dev/null
> > +++ b/samples/warpdrive/README
> > @@ -0,0 +1,32 @@
> > +WD User Land Demonstration
> > +==========================
> > +
> > +This directory contains some applications and libraries to demonstrate how a
> > +
> > +WrapDrive application can be constructed.
> 
>    WarpDrive
> 
> > +
> > +
> > +As a demo, we try to make it simple and clear for understanding. It is not
> > +
> > +supposed to be used in business scenario.
> > +
> > +
> > +The directory contains the following elements:
> > +
> > +wd.[ch]
> > +	A demonstration WrapDrive fundamental library which wraps the basic
> 
> 	                WarpDrive
> 
> > +	operations to the WrapDrive-ed device.
> 
>                           WarpDrive
> 
> > +
> > +wd_adapter.[ch]
> > +	User driver adaptor for wd.[ch]
> > +
> > +wd_utils.[ch]
> > +	Some utitlities function used by WD and its drivers
> > +
> > +drv/*
> > +	User drivers. It helps to fulfill the semantic of wd.[ch] for
> > +	particular hardware
> > +
> > +test/*
> > +	Test applications to use the wrapdrive library
> 
> 	                             warpdrive
> 
> -- 
> ~Randy

Thank you, will change them all in the coming version.

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-03  2:32 ` [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Lu Baolu
@ 2018-09-06  9:11   ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-06  9:11 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Sanjay Kumar, linuxarm

On Mon, Sep 03, 2018 at 10:32:16AM +0800, Lu Baolu wrote:
> Date: Mon, 3 Sep 2018 10:32:16 +0800
> From: Lu Baolu <baolu.lu@linux.intel.com>
> To: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Sanjay Kumar
>  <sanjay.k.kumar@intel.com>
> CC: baolu.lu@linux.intel.com, linuxarm@huawei.com
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.9.1
> Message-ID: <81edb8ff-d046-34e5-aee7-d8564e2517c2@linux.intel.com>
> 
> Hi,
> 
> On 09/03/2018 08:51 AM, Kenneth Lee wrote:
> >From: Kenneth Lee <liguozhu@hisilicon.com>
> >
> >WarpDrive is an accelerator framework to expose the hardware capabilities
> >directly to the user space. It makes use of the exist vfio and vfio-mdev
> >facilities. So the user application can send request and DMA to the
> >hardware without interaction with the kernel. This removes the latency
> >of syscall.
> >
> >WarpDrive is the name for the whole framework. The component in kernel
> >is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> >hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> >library of WarpDrive can access it via VFIO interface.
> >
> >The patchset contains document for the detail. Please refer to it for more
> >information.
> >
> >This patchset is intended to be used with Jean Philippe Brucker's SVA
> >patch [1], which enables not only IO side page fault, but also PASID
> >support to IOMMU and VFIO.
> >
> >With these features, WarpDrive can support non-pinned memory and
> >multi-process in the same accelerator device.  We tested it in our SoC
> >integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> >tree can be found here: [2].
> >
> >But it is not mandatory. This patchset is tested in the latest mainline
> >kernel without the SVA patches.  So it supports only one process for each
> >accelerator.
> >
> >We have noticed the IOMMU aware mdev RFC announced recently [3].
> >
> >The IOMMU aware mdev has similar idea but different intention comparing to
> >WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> >And the design is supposed to be used with Scalable I/O Virtualization.
> >While sdmdev is intended to share the hardware resource with a big amount
> >of processes.  It just requires the hardware supporting address
> >translation per process (PCIE's PASID or ARM SMMU's substream ID).
> >
> >But we don't see serious confliction on both design. We believe they can be
> >normalized as one.
> >
> >The patch 1 is document of the framework. The patch 2 and 3 add sdmdev
> >support. The patch 4, 5 and 6 is drivers for Hislicon's ZIP Accelerator
> >which is registered to both crypto and warpdrive(sdmdev) and can be
> >used from kernel or user space at the same time. The patch 7 is a user
> >space sample demonstrating how WarpDrive works.
> >
> >
> >Change History:
> >V2 changed from V1:
> >	1. Change kernel framework name from SPIMDEV (Share Parent IOMMU
> >	   Mdev) to SDMDEV (Share Domain Mdev).
> >	2. Allocate Hardware Resource when a new mdev is created (While
> >	   it is allocated when the mdev is openned)
> >	3. Unmap pages from the shared domain when the sdmdev iommu group is
> >	   detached. (This procedure is necessary, but missed in V1)
> >	4. Update document accordingly.
> >	5. Rebase to the latest kernel (4.19.0-rc1)
> >	
> >	According the review comment on RFCv1, We did try to use dma-buf
> >	as back end of WarpDrive. It can work properly with the current
> >	solution [4], but it cannot make use of process's
> >	own memory address space directly. This is important to many
> >	acceleration scenario. So dma-buf will be taken as a backup
> >	alternative for noiommu scenario, it will be added in the future
> >	version.
> >
> >
> >Refernces:
> >[1] https://www.spinics.net/lists/kernel/msg2651481.html
> >[2] https://github.com/Kenneth-Lee/linux-kernel-warpdrive/tree/warpdrive-sva-v0.5
> >[3] https://lkml.org/lkml/2018/7/22/34
> 
> Please refer to the latest version posted here for discussion.
> 
> https://lkml.org/lkml/2018/8/30/107

Sure. Thank you.

> 
> Best regards,
> Lu Baolu

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-04 16:15   ` Alex Williamson
@ 2018-09-06  9:45     ` Kenneth Lee
  2018-09-06 13:31       ` Jerome Glisse
  0 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-06  9:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jerome Glisse, Kenneth Lee, Jonathan Corbet, Herbert Xu,
	David S . Miller, Joerg Roedel, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> Date: Tue, 4 Sep 2018 10:15:09 -0600
> From: Alex Williamson <alex.williamson@redhat.com>
> To: Jerome Glisse <jglisse@redhat.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Kenneth Lee
>  <liguozhu@hisilicon.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang
>  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe
>  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman
>  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,
>  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
>  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,
>  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu
>  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,
>  linuxarm@huawei.com
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: <20180904101509.62314b67@t450s.home>
> 
> On Tue, 4 Sep 2018 11:00:19 -0400
> Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > From: Kenneth Lee <liguozhu@hisilicon.com>
> > > 
> > > WarpDrive is an accelerator framework to expose the hardware capabilities
> > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > facilities. So the user application can send request and DMA to the
> > > hardware without interaction with the kernel. This removes the latency
> > > of syscall.
> > > 
> > > WarpDrive is the name for the whole framework. The component in kernel
> > > is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> > > hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> > > library of WarpDrive can access it via VFIO interface.
> > > 
> > > The patchset contains document for the detail. Please refer to it for more
> > > information.
> > > 
> > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > patch [1], which enables not only IO side page fault, but also PASID
> > > support to IOMMU and VFIO.
> > > 
> > > With these features, WarpDrive can support non-pinned memory and
> > > multi-process in the same accelerator device.  We tested it in our SoC
> > > integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> > > tree can be found here: [2].
> > > 
> > > But it is not mandatory. This patchset is tested in the latest mainline
> > > kernel without the SVA patches.  So it supports only one process for each
> > > accelerator.
> > > 
> > > We have noticed the IOMMU aware mdev RFC announced recently [3].
> > > 
> > > The IOMMU aware mdev has similar idea but different intention comparing to
> > > WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> > > And the design is supposed to be used with Scalable I/O Virtualization.
> > > While sdmdev is intended to share the hardware resource with a big amount
> > > of processes.  It just requires the hardware supporting address
> > > translation per process (PCIE's PASID or ARM SMMU's substream ID).
> > > 
> > > But we don't see serious confliction on both design. We believe they can be
> > > normalized as one.
> > >   
> > 
> > So once again i do not understand why you are trying to do things
> > this way. Kernel already have tons of example of everything you
> > want to do without a new framework. Moreover i believe you are
> > confuse by VFIO. To me VFIO is for VM not to create general device
> > driver frame work.
> 
> VFIO is a userspace driver framework, the VM use case just happens to
> be a rather prolific one.  VFIO was never intended to be solely a VM
> device interface and has several other userspace users, notably DPDK
> and SPDK, an NVMe backend in QEMU, a userspace NVMe driver, a ruby
> wrapper, and perhaps others that I'm not aware of.  Whether vfio is
> appropriate interface here might certainly still be a debatable topic,
> but I would strongly disagree with your last sentence above.  Thanks,
> 
> Alex
> 

Yes, that is also my standpoint here.

> > So here is your use case as i understand it. You have a device
> > with a limited number of command queues (can be just one) and in
> > some case it can support SVA/SVM (when hardware support it and it
> > is not disabled). Final requirement is being able to schedule cmds
> > from userspace without ioctl. All of this exists already exists
> > upstream in few device drivers.
> > 
> > 
> > So here is how every body else is doing it. Please explain why
> > this does not work.
> > 
> > 1 Userspace open device file driver. Kernel device driver create
> >   a context and associate it with on open. This context can be
> >   uniq to the process and can bind hardware resources (like a
> >   command queue) to the process.
> > 2 Userspace bind/acquire a commands queue and initialize it with
> >   an ioctl on the device file. Through that ioctl userspace can
> >   be inform wether either SVA/SVM works for the device. If SVA/
> >   SVM works then kernel device driver bind the process to the
> >   device as part of this ioctl.
> > 3 If SVM/SVA does not work userspace do an ioctl to create dma
> >   buffer or something that does exactly the same thing.
> > 4 Userspace mmap the command queue (mmap of the device file by
> >   using informations gather at step 2)
> > 5 Userspace can write commands into the queue it mapped
> > 6 When userspace close the device file all resources are release
> >   just like any existing device drivers.

Hi, Jerome,

Just one thing, as I said in the cover letter, dma-buf requires the application
to use memory created by the driver for DMA. I did try the dma-buf way in
WrapDrive (refer to [4] in the cover letter), it is a good backup for NOIOMMU
mode or we cannot solve the problem in VFIO.

But, in many of my application scenario, the application already has some memory
in hand, maybe allocated by the framework or libraries. Anyway, they don't get
memory from my library, and they pass the poiter for data operation. And they
may also have pointer in the buffer. Those pointer may be used by the
accelerator. So I need hardware fully share the address space with the
application. That is what dmabuf cannot do.

> > 
> > Now if you want to create a device driver framework that expose
> > a device file with generic API for all of the above steps fine.
> > But it does not need to be part of VFIO whatsoever or explain
> > why.
> > 
> > 
> > Note that if IOMMU is fully disabled you probably want to block
> > userspace from being able to directly scheduling commands onto
> > the hardware as it would allow userspace to DMA anywhere and thus
> > would open the kernel to easy exploits. In this case you can still
> > keeps the same API as above and use page fault tricks to valid
> > commands written by userspace into fake commands ring. This will
> > be as slow or maybe even slower than ioctl but at least it allows
> > you to validate commands.
> > 
> > Cheers,
> > Jérôme

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-06  9:45     ` Kenneth Lee
@ 2018-09-06 13:31       ` Jerome Glisse
  2018-09-07  4:01         ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-06 13:31 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Alex Williamson, Kenneth Lee, Jonathan Corbet, Herbert Xu,
	David S . Miller, Joerg Roedel, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > Date: Tue, 4 Sep 2018 10:15:09 -0600
> > From: Alex Williamson <alex.williamson@redhat.com>
> > To: Jerome Glisse <jglisse@redhat.com>
> > CC: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
> >  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
> >  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Kenneth Lee
> >  <liguozhu@hisilicon.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang
> >  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe
> >  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman
> >  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,
> >  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
> >  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,
> >  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu
> >  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,
> >  linuxarm@huawei.com
> > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> > Message-ID: <20180904101509.62314b67@t450s.home>
> > 
> > On Tue, 4 Sep 2018 11:00:19 -0400
> > Jerome Glisse <jglisse@redhat.com> wrote:
> > 
> > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > From: Kenneth Lee <liguozhu@hisilicon.com>
> > > > 
> > > > WarpDrive is an accelerator framework to expose the hardware capabilities
> > > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > > facilities. So the user application can send request and DMA to the
> > > > hardware without interaction with the kernel. This removes the latency
> > > > of syscall.
> > > > 
> > > > WarpDrive is the name for the whole framework. The component in kernel
> > > > is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> > > > hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> > > > library of WarpDrive can access it via VFIO interface.
> > > > 
> > > > The patchset contains document for the detail. Please refer to it for more
> > > > information.
> > > > 
> > > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > > patch [1], which enables not only IO side page fault, but also PASID
> > > > support to IOMMU and VFIO.
> > > > 
> > > > With these features, WarpDrive can support non-pinned memory and
> > > > multi-process in the same accelerator device.  We tested it in our SoC
> > > > integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> > > > tree can be found here: [2].
> > > > 
> > > > But it is not mandatory. This patchset is tested in the latest mainline
> > > > kernel without the SVA patches.  So it supports only one process for each
> > > > accelerator.
> > > > 
> > > > We have noticed the IOMMU aware mdev RFC announced recently [3].
> > > > 
> > > > The IOMMU aware mdev has similar idea but different intention comparing to
> > > > WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> > > > And the design is supposed to be used with Scalable I/O Virtualization.
> > > > While sdmdev is intended to share the hardware resource with a big amount
> > > > of processes.  It just requires the hardware supporting address
> > > > translation per process (PCIE's PASID or ARM SMMU's substream ID).
> > > > 
> > > > But we don't see serious confliction on both design. We believe they can be
> > > > normalized as one.
> > > >   
> > > 
> > > So once again i do not understand why you are trying to do things
> > > this way. Kernel already have tons of example of everything you
> > > want to do without a new framework. Moreover i believe you are
> > > confuse by VFIO. To me VFIO is for VM not to create general device
> > > driver frame work.
> > 
> > VFIO is a userspace driver framework, the VM use case just happens to
> > be a rather prolific one.  VFIO was never intended to be solely a VM
> > device interface and has several other userspace users, notably DPDK
> > and SPDK, an NVMe backend in QEMU, a userspace NVMe driver, a ruby
> > wrapper, and perhaps others that I'm not aware of.  Whether vfio is
> > appropriate interface here might certainly still be a debatable topic,
> > but I would strongly disagree with your last sentence above.  Thanks,
> > 
> > Alex
> > 
> 
> Yes, that is also my standpoint here.
> 
> > > So here is your use case as i understand it. You have a device
> > > with a limited number of command queues (can be just one) and in
> > > some case it can support SVA/SVM (when hardware support it and it
> > > is not disabled). Final requirement is being able to schedule cmds
> > > from userspace without ioctl. All of this exists already exists
> > > upstream in few device drivers.
> > > 
> > > 
> > > So here is how every body else is doing it. Please explain why
> > > this does not work.
> > > 
> > > 1 Userspace open device file driver. Kernel device driver create
> > >   a context and associate it with on open. This context can be
> > >   uniq to the process and can bind hardware resources (like a
> > >   command queue) to the process.
> > > 2 Userspace bind/acquire a commands queue and initialize it with
> > >   an ioctl on the device file. Through that ioctl userspace can
> > >   be inform wether either SVA/SVM works for the device. If SVA/
> > >   SVM works then kernel device driver bind the process to the
> > >   device as part of this ioctl.
> > > 3 If SVM/SVA does not work userspace do an ioctl to create dma
> > >   buffer or something that does exactly the same thing.
> > > 4 Userspace mmap the command queue (mmap of the device file by
> > >   using informations gather at step 2)
> > > 5 Userspace can write commands into the queue it mapped
> > > 6 When userspace close the device file all resources are release
> > >   just like any existing device drivers.
> 
> Hi, Jerome,
> 
> Just one thing, as I said in the cover letter, dma-buf requires the application
> to use memory created by the driver for DMA. I did try the dma-buf way in
> WrapDrive (refer to [4] in the cover letter), it is a good backup for NOIOMMU
> mode or we cannot solve the problem in VFIO.
> 
> But, in many of my application scenario, the application already has some memory
> in hand, maybe allocated by the framework or libraries. Anyway, they don't get
> memory from my library, and they pass the poiter for data operation. And they
> may also have pointer in the buffer. Those pointer may be used by the
> accelerator. So I need hardware fully share the address space with the
> application. That is what dmabuf cannot do.

dmabuf can do that ... it is call uptr you can look at i915 for
instance. Still this does not answer my question above, why do
you need to be in VFIO to do any of the above thing ? Kernel has
tons of examples that does all of the above and are not in VFIO
(including usinng existing user pointer with device).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework
  2018-09-03  0:51 ` [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework Kenneth Lee
@ 2018-09-06 18:36   ` Randy Dunlap
  2018-09-07  2:21     ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Randy Dunlap @ 2018-09-06 18:36 UTC (permalink / raw)
  To: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang,
	Zaibo Xu, Philippe Ombredanne, Greg Kroah-Hartman,
	Thomas Gleixner, linux-doc, linux-kernel, linux-crypto, iommu,
	kvm, linux-accelerators, Lu Baolu, Sanjay Kumar
  Cc: linuxarm

Hi,

On 09/02/2018 05:51 PM, Kenneth Lee wrote:
> From: Kenneth Lee <liguozhu@hisilicon.com>
> 
> WarpDrive is a common user space accelerator framework.  Its main component
> in Kernel is called sdmdev, Share Domain Mediated Device. It exposes
> the hardware capabilities to the user space via vfio-mdev. So processes in
> user land can obtain a "queue" by open the device and direct access the
> hardware MMIO space or do DMA operation via VFIO interface.
> 
> WarpDrive is intended to be used with Jean Philippe Brucker's SVA
> patchset to support multi-process. But This is not a must.  Without the
> SVA patches, WarpDrive can still work for one process for every hardware
> device.
> 
> This patch add detail documents for the framework.
> 
> Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
> ---
>  Documentation/00-INDEX                |   2 +
>  Documentation/warpdrive/warpdrive.rst | 100 ++++
>  Documentation/warpdrive/wd-arch.svg   | 728 ++++++++++++++++++++++++++
>  3 files changed, 830 insertions(+)
>  create mode 100644 Documentation/warpdrive/warpdrive.rst
>  create mode 100644 Documentation/warpdrive/wd-arch.svg

> diff --git a/Documentation/warpdrive/warpdrive.rst b/Documentation/warpdrive/warpdrive.rst
> new file mode 100644
> index 000000000000..6d2a5d1e08c4
> --- /dev/null
> +++ b/Documentation/warpdrive/warpdrive.rst
> @@ -0,0 +1,100 @@
> +Introduction of WarpDrive
> +=========================
> +
> +*WarpDrive* is a general accelerator framework for user space. It intends to
> +provide interface for the user process to send request to hardware
> +accelerator without heavy user-kernel interaction cost.
> +
> +The *WarpDrive* user library is supposed to provide a pipe-based API, such as:

Do you say "is supposed to" because it doesn't do that (yet)?
Or you could just change that to say:

   The WarpDrive user library provides a pipe-based API, such as:


> +        ::
> +        int wd_request_queue(struct wd_queue *q);
> +        void wd_release_queue(struct wd_queue *q);
> +
> +        int wd_send(struct wd_queue *q, void *req);
> +        int wd_recv(struct wd_queue *q, void **req);
> +        int wd_recv_sync(struct wd_queue *q, void **req);
> +        int wd_flush(struct wd_queue *q);
> +
> +*wd_request_queue* creates the pipe connection, *queue*, between the
> +application and the hardware. The application sends request and pulls the
> +answer back by asynchronized wd_send/wd_recv, which directly interact with the
> +hardware (by MMIO or share memory) without syscall.
> +
> +*WarpDrive* maintains a unified application address space among all involved
> +accelerators.  With the following APIs: ::

Seems like an extra '.' there.  How about:

  accelerators with the following APIs: ::

> +
> +        int wd_mem_share(struct wd_queue *q, const void *addr,
> +                         size_t size, int flags);
> +        void wd_mem_unshare(struct wd_queue *q, const void *addr, size_t size);
> +
> +The referred process space shared by these APIs can be directly referred by the
> +hardware. The process can also dedicate its whole process space with flags,
> +*WD_SHARE_ALL* (not in this patch yet).
> +
> +The name *WarpDrive* is simply a cool and general name meaning the framework
> +makes the application faster. As it will be explained in this text later, the
> +facility in kernel is called *SDMDEV*, namely "Share Domain Mediated Device".
> +
> +
> +How does it work
> +================
> +
> +*WarpDrive* is built upon *VFIO-MDEV*. The queue is wrapped as *mdev* in VFIO.
> +So memory sharing can be done via standard VFIO standard DMA interface.
> +
> +The architecture is illustrated as follow figure:
> +
> +.. image:: wd-arch.svg
> +        :alt: WarpDrive Architecture
> +
> +Accelerator driver shares its capability via *SDMDEV* API: ::
> +
> +        vfio_sdmdev_register(struct vfio_sdmdev *sdmdev);
> +        vfio_sdmdev_unregister(struct vfio_sdmdev *sdmdev);
> +        vfio_sdmdev_wake_up(struct spimdev_queue *q);
> +
> +*vfio_sdmdev_register* is a helper function to register the hardware to the
> +*VFIO_MDEV* framework. The queue creation is done by *mdev* creation interface.
> +
> +*WarpDrive* User library mmap the mdev to access its mmio space and shared

s/mmio/MMIO/

> +memory. Request can be sent to, or receive from, hardware in this mmap-ed
> +space until the queue is full or empty.
> +
> +The user library can wait on the queue by ioctl(VFIO_SDMDEV_CMD_WAIT) the mdev
> +if the queue is full or empty. If the queue status is changed, the hardware
> +driver use *vfio_sdmdev_wake_up* to wake up the waiting process.
> +
> +
> +Multiple processes support
> +==========================
> +
> +In the latest mainline kernel (4.18) when this document is written,
> +multi-process is not supported in VFIO yet.
> +
> +Jean Philippe Brucker has a patchset to enable it[1]_. We have tested it
> +with our hardware (which is known as *D06*). It works well. *WarpDrive* rely
> +on them to support multiple processes. If it is not enabled, *WarpDrive* can
> +still work, but it support only one mdev for a process, which will share the
> +same io map table with kernel. (But it is not going to be a security problem,
> +since the user application cannot access the kernel address space)
> +
> +When multiprocess is support, mdev can be created based on how many
> +hardware resource (queue) is available. Because the VFIO framework accepts only
> +one open from one mdev iommu_group. Mdev become the smallest unit for process
> +to use queue. And the mdev will not be released if the user process exist. So
> +it will need a resource agent to manage the mdev allocation for the user
> +process. This is not in this document's range.
> +
> +
> +Legacy Mode Support
> +===================
> +For the hardware on which IOMMU is not support, WarpDrive can run on *NOIOMMU*
> +mode. That require some update to the mdev driver, which is not included in
> +this version yet.
> +
> +
> +References
> +==========
> +.. [1] https://patchwork.kernel.org/patch/10394851/
> +
> +.. vim: tw=78

thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework
  2018-09-06 18:36   ` Randy Dunlap
@ 2018-09-07  2:21     ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-07  2:21 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Thu, Sep 06, 2018 at 11:36:36AM -0700, Randy Dunlap wrote:
> Date: Thu, 6 Sep 2018 11:36:36 -0700
> From: Randy Dunlap <rdunlap@infradead.org>
> To: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
>  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
>  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>,
>  Sanjay Kumar <sanjay.k.kumar@intel.com>
> CC: linuxarm@huawei.com
> Subject: Re: [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.9.1
> Message-ID: <56f5f66d-f6d9-f4fa-40ca-e4a8bad170c1@infradead.org>
> 
> Hi,
> 
> On 09/02/2018 05:51 PM, Kenneth Lee wrote:
> > From: Kenneth Lee <liguozhu@hisilicon.com>
> > 
> > WarpDrive is a common user space accelerator framework.  Its main component
> > in Kernel is called sdmdev, Share Domain Mediated Device. It exposes
> > the hardware capabilities to the user space via vfio-mdev. So processes in
> > user land can obtain a "queue" by open the device and direct access the
> > hardware MMIO space or do DMA operation via VFIO interface.
> > 
> > WarpDrive is intended to be used with Jean Philippe Brucker's SVA
> > patchset to support multi-process. But This is not a must.  Without the
> > SVA patches, WarpDrive can still work for one process for every hardware
> > device.
> > 
> > This patch add detail documents for the framework.
> > 
> > Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com>
> > ---
> >  Documentation/00-INDEX                |   2 +
> >  Documentation/warpdrive/warpdrive.rst | 100 ++++
> >  Documentation/warpdrive/wd-arch.svg   | 728 ++++++++++++++++++++++++++
> >  3 files changed, 830 insertions(+)
> >  create mode 100644 Documentation/warpdrive/warpdrive.rst
> >  create mode 100644 Documentation/warpdrive/wd-arch.svg
> 
> > diff --git a/Documentation/warpdrive/warpdrive.rst b/Documentation/warpdrive/warpdrive.rst
> > new file mode 100644
> > index 000000000000..6d2a5d1e08c4
> > --- /dev/null
> > +++ b/Documentation/warpdrive/warpdrive.rst
> > @@ -0,0 +1,100 @@
> > +Introduction of WarpDrive
> > +=========================
> > +
> > +*WarpDrive* is a general accelerator framework for user space. It intends to
> > +provide interface for the user process to send request to hardware
> > +accelerator without heavy user-kernel interaction cost.
> > +
> > +The *WarpDrive* user library is supposed to provide a pipe-based API, such as:
> 
> Do you say "is supposed to" because it doesn't do that (yet)?
> Or you could just change that to say:
> 
>    The WarpDrive user library provides a pipe-based API, such as:
> 

Actually, I tried to say it can be defined like this. But people can choose
other implementation with the same kernel API.

I will say it explicitly in the future version. Thank you.

> 
> > +        ::
> > +        int wd_request_queue(struct wd_queue *q);
> > +        void wd_release_queue(struct wd_queue *q);
> > +
> > +        int wd_send(struct wd_queue *q, void *req);
> > +        int wd_recv(struct wd_queue *q, void **req);
> > +        int wd_recv_sync(struct wd_queue *q, void **req);
> > +        int wd_flush(struct wd_queue *q);
> > +
> > +*wd_request_queue* creates the pipe connection, *queue*, between the
> > +application and the hardware. The application sends request and pulls the
> > +answer back by asynchronized wd_send/wd_recv, which directly interact with the
> > +hardware (by MMIO or share memory) without syscall.
> > +
> > +*WarpDrive* maintains a unified application address space among all involved
> > +accelerators.  With the following APIs: ::
> 
> Seems like an extra '.' there.  How about:
> 
>   accelerators with the following APIs: ::
> 

Err, the "with..." clause belong to the following "The referred process
space...".

> > +
> > +        int wd_mem_share(struct wd_queue *q, const void *addr,
> > +                         size_t size, int flags);
> > +        void wd_mem_unshare(struct wd_queue *q, const void *addr, size_t size);
> > +
> > +The referred process space shared by these APIs can be directly referred by the
> > +hardware. The process can also dedicate its whole process space with flags,
> > +*WD_SHARE_ALL* (not in this patch yet).
> > +
> > +The name *WarpDrive* is simply a cool and general name meaning the framework
> > +makes the application faster. As it will be explained in this text later, the
> > +facility in kernel is called *SDMDEV*, namely "Share Domain Mediated Device".
> > +
> > +
> > +How does it work
> > +================
> > +
> > +*WarpDrive* is built upon *VFIO-MDEV*. The queue is wrapped as *mdev* in VFIO.
> > +So memory sharing can be done via standard VFIO standard DMA interface.
> > +
> > +The architecture is illustrated as follow figure:
> > +
> > +.. image:: wd-arch.svg
> > +        :alt: WarpDrive Architecture
> > +
> > +Accelerator driver shares its capability via *SDMDEV* API: ::
> > +
> > +        vfio_sdmdev_register(struct vfio_sdmdev *sdmdev);
> > +        vfio_sdmdev_unregister(struct vfio_sdmdev *sdmdev);
> > +        vfio_sdmdev_wake_up(struct spimdev_queue *q);
> > +
> > +*vfio_sdmdev_register* is a helper function to register the hardware to the
> > +*VFIO_MDEV* framework. The queue creation is done by *mdev* creation interface.
> > +
> > +*WarpDrive* User library mmap the mdev to access its mmio space and shared
> 
> s/mmio/MMIO/
> 
> > +memory. Request can be sent to, or receive from, hardware in this mmap-ed
> > +space until the queue is full or empty.
> > +
> > +The user library can wait on the queue by ioctl(VFIO_SDMDEV_CMD_WAIT) the mdev
> > +if the queue is full or empty. If the queue status is changed, the hardware
> > +driver use *vfio_sdmdev_wake_up* to wake up the waiting process.
> > +
> > +
> > +Multiple processes support
> > +==========================
> > +
> > +In the latest mainline kernel (4.18) when this document is written,
> > +multi-process is not supported in VFIO yet.
> > +
> > +Jean Philippe Brucker has a patchset to enable it[1]_. We have tested it
> > +with our hardware (which is known as *D06*). It works well. *WarpDrive* rely
> > +on them to support multiple processes. If it is not enabled, *WarpDrive* can
> > +still work, but it support only one mdev for a process, which will share the
> > +same io map table with kernel. (But it is not going to be a security problem,
> > +since the user application cannot access the kernel address space)
> > +
> > +When multiprocess is support, mdev can be created based on how many
> > +hardware resource (queue) is available. Because the VFIO framework accepts only
> > +one open from one mdev iommu_group. Mdev become the smallest unit for process
> > +to use queue. And the mdev will not be released if the user process exist. So
> > +it will need a resource agent to manage the mdev allocation for the user
> > +process. This is not in this document's range.
> > +
> > +
> > +Legacy Mode Support
> > +===================
> > +For the hardware on which IOMMU is not support, WarpDrive can run on *NOIOMMU*
> > +mode. That require some update to the mdev driver, which is not included in
> > +this version yet.
> > +
> > +
> > +References
> > +==========
> > +.. [1] https://patchwork.kernel.org/patch/10394851/
> > +
> > +.. vim: tw=78
> 
> thanks,
> -- 
> ~Randy

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-06 13:31       ` Jerome Glisse
@ 2018-09-07  4:01         ` Kenneth Lee
  2018-09-07 16:53           ` Jerome Glisse
  0 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-07  4:01 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Alex Williamson, Kenneth Lee, Jonathan Corbet, Herbert Xu,
	David S . Miller, Joerg Roedel, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> Date: Thu, 6 Sep 2018 09:31:33 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Alex Williamson <alex.williamson@redhat.com>, Kenneth Lee
>  <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>, Herbert Xu
>  <herbert@gondor.apana.org.au>, "David S . Miller" <davem@davemloft.net>,
>  Joerg Roedel <joro@8bytes.org>, Hao Fang <fanghao11@huawei.com>, Zhou Wang
>  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe
>  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman
>  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,
>  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
>  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,
>  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu
>  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,
>  linuxarm@huawei.com
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.0 (2018-05-17)
> Message-ID: <20180906133133.GA3830@redhat.com>
> 
> On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > Date: Tue, 4 Sep 2018 10:15:09 -0600
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > To: Jerome Glisse <jglisse@redhat.com>
> > > CC: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
> > >  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
> > >  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Kenneth Lee
> > >  <liguozhu@hisilicon.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang
> > >  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe
> > >  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman
> > >  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,
> > >  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
> > >  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,
> > >  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu
> > >  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,
> > >  linuxarm@huawei.com
> > > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> > > Message-ID: <20180904101509.62314b67@t450s.home>
> > > 
> > > On Tue, 4 Sep 2018 11:00:19 -0400
> > > Jerome Glisse <jglisse@redhat.com> wrote:
> > > 
> > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > > From: Kenneth Lee <liguozhu@hisilicon.com>
> > > > > 
> > > > > WarpDrive is an accelerator framework to expose the hardware capabilities
> > > > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > > > facilities. So the user application can send request and DMA to the
> > > > > hardware without interaction with the kernel. This removes the latency
> > > > > of syscall.
> > > > > 
> > > > > WarpDrive is the name for the whole framework. The component in kernel
> > > > > is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> > > > > hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> > > > > library of WarpDrive can access it via VFIO interface.
> > > > > 
> > > > > The patchset contains document for the detail. Please refer to it for more
> > > > > information.
> > > > > 
> > > > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > > > patch [1], which enables not only IO side page fault, but also PASID
> > > > > support to IOMMU and VFIO.
> > > > > 
> > > > > With these features, WarpDrive can support non-pinned memory and
> > > > > multi-process in the same accelerator device.  We tested it in our SoC
> > > > > integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> > > > > tree can be found here: [2].
> > > > > 
> > > > > But it is not mandatory. This patchset is tested in the latest mainline
> > > > > kernel without the SVA patches.  So it supports only one process for each
> > > > > accelerator.
> > > > > 
> > > > > We have noticed the IOMMU aware mdev RFC announced recently [3].
> > > > > 
> > > > > The IOMMU aware mdev has similar idea but different intention comparing to
> > > > > WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> > > > > And the design is supposed to be used with Scalable I/O Virtualization.
> > > > > While sdmdev is intended to share the hardware resource with a big amount
> > > > > of processes.  It just requires the hardware supporting address
> > > > > translation per process (PCIE's PASID or ARM SMMU's substream ID).
> > > > > 
> > > > > But we don't see serious confliction on both design. We believe they can be
> > > > > normalized as one.
> > > > >   
> > > > 
> > > > So once again i do not understand why you are trying to do things
> > > > this way. Kernel already have tons of example of everything you
> > > > want to do without a new framework. Moreover i believe you are
> > > > confuse by VFIO. To me VFIO is for VM not to create general device
> > > > driver frame work.
> > > 
> > > VFIO is a userspace driver framework, the VM use case just happens to
> > > be a rather prolific one.  VFIO was never intended to be solely a VM
> > > device interface and has several other userspace users, notably DPDK
> > > and SPDK, an NVMe backend in QEMU, a userspace NVMe driver, a ruby
> > > wrapper, and perhaps others that I'm not aware of.  Whether vfio is
> > > appropriate interface here might certainly still be a debatable topic,
> > > but I would strongly disagree with your last sentence above.  Thanks,
> > > 
> > > Alex
> > > 
> > 
> > Yes, that is also my standpoint here.
> > 
> > > > So here is your use case as i understand it. You have a device
> > > > with a limited number of command queues (can be just one) and in
> > > > some case it can support SVA/SVM (when hardware support it and it
> > > > is not disabled). Final requirement is being able to schedule cmds
> > > > from userspace without ioctl. All of this exists already exists
> > > > upstream in few device drivers.
> > > > 
> > > > 
> > > > So here is how every body else is doing it. Please explain why
> > > > this does not work.
> > > > 
> > > > 1 Userspace open device file driver. Kernel device driver create
> > > >   a context and associate it with on open. This context can be
> > > >   uniq to the process and can bind hardware resources (like a
> > > >   command queue) to the process.
> > > > 2 Userspace bind/acquire a commands queue and initialize it with
> > > >   an ioctl on the device file. Through that ioctl userspace can
> > > >   be inform wether either SVA/SVM works for the device. If SVA/
> > > >   SVM works then kernel device driver bind the process to the
> > > >   device as part of this ioctl.
> > > > 3 If SVM/SVA does not work userspace do an ioctl to create dma
> > > >   buffer or something that does exactly the same thing.
> > > > 4 Userspace mmap the command queue (mmap of the device file by
> > > >   using informations gather at step 2)
> > > > 5 Userspace can write commands into the queue it mapped
> > > > 6 When userspace close the device file all resources are release
> > > >   just like any existing device drivers.
> > 
> > Hi, Jerome,
> > 
> > Just one thing, as I said in the cover letter, dma-buf requires the application
> > to use memory created by the driver for DMA. I did try the dma-buf way in
> > WrapDrive (refer to [4] in the cover letter), it is a good backup for NOIOMMU
> > mode or we cannot solve the problem in VFIO.
> > 
> > But, in many of my application scenario, the application already has some memory
> > in hand, maybe allocated by the framework or libraries. Anyway, they don't get
> > memory from my library, and they pass the poiter for data operation. And they
> > may also have pointer in the buffer. Those pointer may be used by the
> > accelerator. So I need hardware fully share the address space with the
> > application. That is what dmabuf cannot do.
> 
> dmabuf can do that ... it is call uptr you can look at i915 for
> instance. Still this does not answer my question above, why do
> you need to be in VFIO to do any of the above thing ? Kernel has
> tons of examples that does all of the above and are not in VFIO
> (including usinng existing user pointer with device).
> 
> Cheers,
> Jérôme

I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
user memory to the kernel. That is not what we need. What we try to get is: the
user application do something on its data, and push it away to the accelerator,
and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
the memory, referring any portion of it with the same VAs of the application,
even the VAs are stored inside the memory itself.

And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
user driver framework. And I need exactly a user driver interface. Why should I
invent another wheel? It has most of stuff I need:

1. Connecting multiple devices to the same application space
2. Pinning and DMA from the application space to the whole set of device
3. Managing hardware resource by device

We just need the last step: make sure multiple applications and the kernel can
share the same IOMMU. Then why shouldn't we use VFIO?

And personally, I believe the maturity and correctness of a framework are driven
by applications. Now the problem in accelerator world is that we don't have a
direction. If we believe the requirement is right, the method itself is not a
big problem in the end. We just need to let people have a unify platform to
share their work together.

Cheers
-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-07  4:01         ` Kenneth Lee
@ 2018-09-07 16:53           ` Jerome Glisse
  2018-09-07 17:55             ` Jean-Philippe Brucker
  2018-09-10  3:28             ` Kenneth Lee
  0 siblings, 2 replies; 58+ messages in thread
From: Jerome Glisse @ 2018-09-07 16:53 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	Thomas Gleixner, linux-crypto, Zhou Wang, Philippe Ombredanne,
	Zaibo Xu, David S . Miller, linux-accelerators, Lu Baolu

On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > Date: Thu, 6 Sep 2018 09:31:33 -0400
> > From: Jerome Glisse <jglisse@redhat.com>
> > To: Kenneth Lee <liguozhu@hisilicon.com>
> > CC: Alex Williamson <alex.williamson@redhat.com>, Kenneth Lee
> >  <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>, Herbert Xu
> >  <herbert@gondor.apana.org.au>, "David S . Miller" <davem@davemloft.net>,
> >  Joerg Roedel <joro@8bytes.org>, Hao Fang <fanghao11@huawei.com>, Zhou Wang
> >  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe
> >  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman
> >  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,
> >  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
> >  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,
> >  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu
> >  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,
> >  linuxarm@huawei.com
> > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> > User-Agent: Mutt/1.10.0 (2018-05-17)
> > Message-ID: <20180906133133.GA3830@redhat.com>
> > 
> > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > Date: Tue, 4 Sep 2018 10:15:09 -0600
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > To: Jerome Glisse <jglisse@redhat.com>
> > > > CC: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
> > > >  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
> > > >  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Kenneth Lee
> > > >  <liguozhu@hisilicon.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang
> > > >  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe
> > > >  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman
> > > >  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,
> > > >  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
> > > >  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,
> > > >  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu
> > > >  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,
> > > >  linuxarm@huawei.com
> > > > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> > > > Message-ID: <20180904101509.62314b67@t450s.home>
> > > > 
> > > > On Tue, 4 Sep 2018 11:00:19 -0400
> > > > Jerome Glisse <jglisse@redhat.com> wrote:
> > > > 
> > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > > > From: Kenneth Lee <liguozhu@hisilicon.com>
> > > > > > 
> > > > > > WarpDrive is an accelerator framework to expose the hardware capabilities
> > > > > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > > > > facilities. So the user application can send request and DMA to the
> > > > > > hardware without interaction with the kernel. This removes the latency
> > > > > > of syscall.
> > > > > > 
> > > > > > WarpDrive is the name for the whole framework. The component in kernel
> > > > > > is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> > > > > > hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> > > > > > library of WarpDrive can access it via VFIO interface.
> > > > > > 
> > > > > > The patchset contains document for the detail. Please refer to it for more
> > > > > > information.
> > > > > > 
> > > > > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > > > > patch [1], which enables not only IO side page fault, but also PASID
> > > > > > support to IOMMU and VFIO.
> > > > > > 
> > > > > > With these features, WarpDrive can support non-pinned memory and
> > > > > > multi-process in the same accelerator device.  We tested it in our SoC
> > > > > > integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> > > > > > tree can be found here: [2].
> > > > > > 
> > > > > > But it is not mandatory. This patchset is tested in the latest mainline
> > > > > > kernel without the SVA patches.  So it supports only one process for each
> > > > > > accelerator.
> > > > > > 
> > > > > > We have noticed the IOMMU aware mdev RFC announced recently [3].
> > > > > > 
> > > > > > The IOMMU aware mdev has similar idea but different intention comparing to
> > > > > > WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> > > > > > And the design is supposed to be used with Scalable I/O Virtualization.
> > > > > > While sdmdev is intended to share the hardware resource with a big amount
> > > > > > of processes.  It just requires the hardware supporting address
> > > > > > translation per process (PCIE's PASID or ARM SMMU's substream ID).
> > > > > > 
> > > > > > But we don't see serious confliction on both design. We believe they can be
> > > > > > normalized as one.
> > > > > >   
> > > > > 
> > > > > So once again i do not understand why you are trying to do things
> > > > > this way. Kernel already have tons of example of everything you
> > > > > want to do without a new framework. Moreover i believe you are
> > > > > confuse by VFIO. To me VFIO is for VM not to create general device
> > > > > driver frame work.
> > > > 
> > > > VFIO is a userspace driver framework, the VM use case just happens to
> > > > be a rather prolific one.  VFIO was never intended to be solely a VM
> > > > device interface and has several other userspace users, notably DPDK
> > > > and SPDK, an NVMe backend in QEMU, a userspace NVMe driver, a ruby
> > > > wrapper, and perhaps others that I'm not aware of.  Whether vfio is
> > > > appropriate interface here might certainly still be a debatable topic,
> > > > but I would strongly disagree with your last sentence above.  Thanks,
> > > > 
> > > > Alex
> > > > 
> > > 
> > > Yes, that is also my standpoint here.
> > > 
> > > > > So here is your use case as i understand it. You have a device
> > > > > with a limited number of command queues (can be just one) and in
> > > > > some case it can support SVA/SVM (when hardware support it and it
> > > > > is not disabled). Final requirement is being able to schedule cmds
> > > > > from userspace without ioctl. All of this exists already exists
> > > > > upstream in few device drivers.
> > > > > 
> > > > > 
> > > > > So here is how every body else is doing it. Please explain why
> > > > > this does not work.
> > > > > 
> > > > > 1 Userspace open device file driver. Kernel device driver create
> > > > >   a context and associate it with on open. This context can be
> > > > >   uniq to the process and can bind hardware resources (like a
> > > > >   command queue) to the process.
> > > > > 2 Userspace bind/acquire a commands queue and initialize it with
> > > > >   an ioctl on the device file. Through that ioctl userspace can
> > > > >   be inform wether either SVA/SVM works for the device. If SVA/
> > > > >   SVM works then kernel device driver bind the process to the
> > > > >   device as part of this ioctl.
> > > > > 3 If SVM/SVA does not work userspace do an ioctl to create dma
> > > > >   buffer or something that does exactly the same thing.
> > > > > 4 Userspace mmap the command queue (mmap of the device file by
> > > > >   using informations gather at step 2)
> > > > > 5 Userspace can write commands into the queue it mapped
> > > > > 6 When userspace close the device file all resources are release
> > > > >   just like any existing device drivers.
> > > 
> > > Hi, Jerome,
> > > 
> > > Just one thing, as I said in the cover letter, dma-buf requires the application
> > > to use memory created by the driver for DMA. I did try the dma-buf way in
> > > WrapDrive (refer to [4] in the cover letter), it is a good backup for NOIOMMU
> > > mode or we cannot solve the problem in VFIO.
> > > 
> > > But, in many of my application scenario, the application already has some memory
> > > in hand, maybe allocated by the framework or libraries. Anyway, they don't get
> > > memory from my library, and they pass the poiter for data operation. And they
> > > may also have pointer in the buffer. Those pointer may be used by the
> > > accelerator. So I need hardware fully share the address space with the
> > > application. That is what dmabuf cannot do.
> > 
> > dmabuf can do that ... it is call uptr you can look at i915 for
> > instance. Still this does not answer my question above, why do
> > you need to be in VFIO to do any of the above thing ? Kernel has
> > tons of examples that does all of the above and are not in VFIO
> > (including usinng existing user pointer with device).
> > 
> > Cheers,
> > Jérôme
> 
> I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> user memory to the kernel. That is not what we need. What we try to get is: the
> user application do something on its data, and push it away to the accelerator,
> and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> the memory, referring any portion of it with the same VAs of the application,
> even the VAs are stored inside the memory itself.

You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
It does GUP and create GEM object AFAICR you can wrap that GEM object into a
dma buffer object.

> 
> And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> user driver framework. And I need exactly a user driver interface. Why should I
> invent another wheel? It has most of stuff I need:
> 
> 1. Connecting multiple devices to the same application space
> 2. Pinning and DMA from the application space to the whole set of device
> 3. Managing hardware resource by device
> 
> We just need the last step: make sure multiple applications and the kernel can
> share the same IOMMU. Then why shouldn't we use VFIO?

Because tons of other drivers already do all of the above outside VFIO. Many
driver have a sizeable userspace side to them (anything with ioctl do) so they
can be construded as userspace driver too.

So there is no reasons to do that under VFIO. Especialy as in your example
it is not a real user space device driver, the userspace portion only knows
about writting command into command buffer AFAICT.

VFIO is for real userspace driver where interrupt, configurations, ... ie
all the driver is handled in userspace. This means that the userspace have
to be trusted as it could program the device to do DMA to anywhere (if
IOMMU is disabled at boot which is still the default configuration in the
kernel).

So i do not see any reasons to do anything you want inside VFIO. All you
want to do can be done outside as easily. Moreover it would be better if
you define clearly each scenario because from where i sit it looks like
you are opening the door wide open to userspace to DMA anywhere when IOMMU
is disabled.

When IOMMU is disabled you can _not_ expose command queue to userspace
unless your device has its own page table and all commands are relative
to that page table and the device page table is populated by kernel driver
in secure way (ie by checking that what is populated can be access).

I do not believe your example device to have such page table nor do i see
a fallback path when IOMMU is disabled that force user to do ioctl for
each commands.

Yes i understand that you target SVA/SVM but still you claim to support
non SVA/SVM. The point is that userspace can not be trusted if you want
to have random program use your device. I am pretty sure that all user
of VFIO are trusted process (like QEMU).


Finaly i am convince that the IOMMU grouping stuff related to VFIO is
useless for your usecase. I really do not see the point of that, it
does complicate things for you for no reasons AFAICT.


> 
> And personally, I believe the maturity and correctness of a framework are driven
> by applications. Now the problem in accelerator world is that we don't have a
> direction. If we believe the requirement is right, the method itself is not a
> big problem in the end. We just need to let people have a unify platform to
> share their work together.

I am not against that but it seems to me that all you want to do is only
a matter of simplifying discovery of such devices and sharing few common
ioctl (DMA mapping, creating command queue, managing command queue, ...)
and again for all this i do not see the point of doing this under VFIO.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-07 16:53           ` Jerome Glisse
@ 2018-09-07 17:55             ` Jean-Philippe Brucker
  2018-09-07 18:04               ` Jerome Glisse
  2018-09-10  3:28             ` Kenneth Lee
  1 sibling, 1 reply; 58+ messages in thread
From: Jean-Philippe Brucker @ 2018-09-07 17:55 UTC (permalink / raw)
  To: Jerome Glisse, Kenneth Lee
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, linux-doc, Sanjay Kumar, Hao Fang,
	linux-kernel, linuxarm, iommu, David S . Miller, linux-crypto,
	Philippe Ombredanne, Thomas Gleixner, linux-accelerators

On 07/09/2018 17:53, Jerome Glisse wrote:
> So there is no reasons to do that under VFIO. Especialy as in your example
> it is not a real user space device driver, the userspace portion only knows
> about writting command into command buffer AFAICT.
> 
> VFIO is for real userspace driver where interrupt, configurations, ... ie
> all the driver is handled in userspace. This means that the userspace have
> to be trusted as it could program the device to do DMA to anywhere (if
> IOMMU is disabled at boot which is still the default configuration in the
> kernel).

If the IOMMU is disabled (not exactly a kernel default by the way, I
think most IOMMU drivers enable it by default), your userspace driver
can't bypass DMA isolation by accident. It just won't be allowed to
access the device. VFIO requires an IOMMU unless the admin forces the
NOIOMMU mode with the "enable_unsafe_noiommu_mode" module parameter, and
the userspace explicitly asks for it with VFIO_NOIOMMU_IOMMU, which
taints the kernel. Not for production. A normal userspace driver that
uses VFIO can only do DMA to its own memory.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-07 17:55             ` Jean-Philippe Brucker
@ 2018-09-07 18:04               ` Jerome Glisse
  0 siblings, 0 replies; 58+ messages in thread
From: Jerome Glisse @ 2018-09-07 18:04 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Kenneth Lee, Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, linux-doc, Sanjay Kumar, Hao Fang, iommu,
	linux-kernel, linuxarm, Alex Williamson, linux-crypto,
	Philippe Ombredanne, Thomas Gleixner, David S . Miller,
	linux-accelerators

On Fri, Sep 07, 2018 at 06:55:45PM +0100, Jean-Philippe Brucker wrote:
> On 07/09/2018 17:53, Jerome Glisse wrote:
> > So there is no reasons to do that under VFIO. Especialy as in your example
> > it is not a real user space device driver, the userspace portion only knows
> > about writting command into command buffer AFAICT.
> > 
> > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > all the driver is handled in userspace. This means that the userspace have
> > to be trusted as it could program the device to do DMA to anywhere (if
> > IOMMU is disabled at boot which is still the default configuration in the
> > kernel).
> 
> If the IOMMU is disabled (not exactly a kernel default by the way, I
> think most IOMMU drivers enable it by default), your userspace driver
> can't bypass DMA isolation by accident. It just won't be allowed to
> access the device. VFIO requires an IOMMU unless the admin forces the
> NOIOMMU mode with the "enable_unsafe_noiommu_mode" module parameter, and
> the userspace explicitly asks for it with VFIO_NOIOMMU_IOMMU, which
> taints the kernel. Not for production. A normal userspace driver that
> uses VFIO can only do DMA to its own memory.
> 

Didn't know about VFIO check, which is a sane thing. On Intel  IOMMU
is disabled by default (see INTEL_IOMMU_DEFAULT_ON Kconfig option).
I am pretty sure it use to be the same for AMD but maybe it is now
enabled by default.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-07 16:53           ` Jerome Glisse
  2018-09-07 17:55             ` Jean-Philippe Brucker
@ 2018-09-10  3:28             ` Kenneth Lee
  2018-09-10 14:54               ` Jerome Glisse
  1 sibling, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-10  3:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	Thomas Gleixner, linux-crypto, Zhou Wang, Philippe Ombredanne,
	Zaibo Xu, David S . Miller, linux-accelerators, Lu Baolu

On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> Date: Fri, 7 Sep 2018 12:53:06 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Herbert Xu
>  <herbert@gondor.apana.org.au>, kvm@vger.kernel.org, Jonathan Corbet
>  <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Joerg
>  Roedel <joro@8bytes.org>, linux-doc@vger.kernel.org, Sanjay Kumar
>  <sanjay.k.kumar@intel.com>, Hao Fang <fanghao11@huawei.com>,
>  iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, Alex Williamson <alex.williamson@redhat.com>, Thomas
>  Gleixner <tglx@linutronix.de>, linux-crypto@vger.kernel.org, Zhou Wang
>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,
>  Zaibo Xu <xuzaibo@huawei.com>, "David S . Miller" <davem@davemloft.net>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.0 (2018-05-17)
> Message-ID: <20180907165303.GA3519@redhat.com>
> 
> On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > Date: Thu, 6 Sep 2018 09:31:33 -0400
> > > From: Jerome Glisse <jglisse@redhat.com>
> > > To: Kenneth Lee <liguozhu@hisilicon.com>
> > > CC: Alex Williamson <alex.williamson@redhat.com>, Kenneth Lee
> > >  <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>, Herbert Xu
> > >  <herbert@gondor.apana.org.au>, "David S . Miller" <davem@davemloft.net>,
> > >  Joerg Roedel <joro@8bytes.org>, Hao Fang <fanghao11@huawei.com>, Zhou Wang
> > >  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe
> > >  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman
> > >  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,
> > >  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
> > >  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,
> > >  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu
> > >  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,
> > >  linuxarm@huawei.com
> > > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> > > User-Agent: Mutt/1.10.0 (2018-05-17)
> > > Message-ID: <20180906133133.GA3830@redhat.com>
> > > 
> > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > Date: Tue, 4 Sep 2018 10:15:09 -0600
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > To: Jerome Glisse <jglisse@redhat.com>
> > > > > CC: Kenneth Lee <nek.in.cn@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
> > > > >  Herbert Xu <herbert@gondor.apana.org.au>, "David S . Miller"
> > > > >  <davem@davemloft.net>, Joerg Roedel <joro@8bytes.org>, Kenneth Lee
> > > > >  <liguozhu@hisilicon.com>, Hao Fang <fanghao11@huawei.com>, Zhou Wang
> > > > >  <wangzhou1@hisilicon.com>, Zaibo Xu <xuzaibo@huawei.com>, Philippe
> > > > >  Ombredanne <pombredanne@nexb.com>, Greg Kroah-Hartman
> > > > >  <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>,
> > > > >  linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
> > > > >  linux-crypto@vger.kernel.org, iommu@lists.linux-foundation.org,
> > > > >  kvm@vger.kernel.org, linux-accelerators@lists.ozlabs.org, Lu Baolu
> > > > >  <baolu.lu@linux.intel.com>, Sanjay Kumar <sanjay.k.kumar@intel.com>,
> > > > >  linuxarm@huawei.com
> > > > > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> > > > > Message-ID: <20180904101509.62314b67@t450s.home>
> > > > > 
> > > > > On Tue, 4 Sep 2018 11:00:19 -0400
> > > > > Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > 
> > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > > > > From: Kenneth Lee <liguozhu@hisilicon.com>
> > > > > > > 
> > > > > > > WarpDrive is an accelerator framework to expose the hardware capabilities
> > > > > > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > > > > > facilities. So the user application can send request and DMA to the
> > > > > > > hardware without interaction with the kernel. This removes the latency
> > > > > > > of syscall.
> > > > > > > 
> > > > > > > WarpDrive is the name for the whole framework. The component in kernel
> > > > > > > is called SDMDEV, Share Domain Mediated Device. Driver driver exposes its
> > > > > > > hardware resource by registering to SDMDEV as a VFIO-Mdev. So the user
> > > > > > > library of WarpDrive can access it via VFIO interface.
> > > > > > > 
> > > > > > > The patchset contains document for the detail. Please refer to it for more
> > > > > > > information.
> > > > > > > 
> > > > > > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > > > > > patch [1], which enables not only IO side page fault, but also PASID
> > > > > > > support to IOMMU and VFIO.
> > > > > > > 
> > > > > > > With these features, WarpDrive can support non-pinned memory and
> > > > > > > multi-process in the same accelerator device.  We tested it in our SoC
> > > > > > > integrated Accelerator (board ID: D06, Chip ID: HIP08). A reference work
> > > > > > > tree can be found here: [2].
> > > > > > > 
> > > > > > > But it is not mandatory. This patchset is tested in the latest mainline
> > > > > > > kernel without the SVA patches.  So it supports only one process for each
> > > > > > > accelerator.
> > > > > > > 
> > > > > > > We have noticed the IOMMU aware mdev RFC announced recently [3].
> > > > > > > 
> > > > > > > The IOMMU aware mdev has similar idea but different intention comparing to
> > > > > > > WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> > > > > > > And the design is supposed to be used with Scalable I/O Virtualization.
> > > > > > > While sdmdev is intended to share the hardware resource with a big amount
> > > > > > > of processes.  It just requires the hardware supporting address
> > > > > > > translation per process (PCIE's PASID or ARM SMMU's substream ID).
> > > > > > > 
> > > > > > > But we don't see serious confliction on both design. We believe they can be
> > > > > > > normalized as one.
> > > > > > >   
> > > > > > 
> > > > > > So once again i do not understand why you are trying to do things
> > > > > > this way. Kernel already have tons of example of everything you
> > > > > > want to do without a new framework. Moreover i believe you are
> > > > > > confuse by VFIO. To me VFIO is for VM not to create general device
> > > > > > driver frame work.
> > > > > 
> > > > > VFIO is a userspace driver framework, the VM use case just happens to
> > > > > be a rather prolific one.  VFIO was never intended to be solely a VM
> > > > > device interface and has several other userspace users, notably DPDK
> > > > > and SPDK, an NVMe backend in QEMU, a userspace NVMe driver, a ruby
> > > > > wrapper, and perhaps others that I'm not aware of.  Whether vfio is
> > > > > appropriate interface here might certainly still be a debatable topic,
> > > > > but I would strongly disagree with your last sentence above.  Thanks,
> > > > > 
> > > > > Alex
> > > > > 
> > > > 
> > > > Yes, that is also my standpoint here.
> > > > 
> > > > > > So here is your use case as i understand it. You have a device
> > > > > > with a limited number of command queues (can be just one) and in
> > > > > > some case it can support SVA/SVM (when hardware support it and it
> > > > > > is not disabled). Final requirement is being able to schedule cmds
> > > > > > from userspace without ioctl. All of this exists already exists
> > > > > > upstream in few device drivers.
> > > > > > 
> > > > > > 
> > > > > > So here is how every body else is doing it. Please explain why
> > > > > > this does not work.
> > > > > > 
> > > > > > 1 Userspace open device file driver. Kernel device driver create
> > > > > >   a context and associate it with on open. This context can be
> > > > > >   uniq to the process and can bind hardware resources (like a
> > > > > >   command queue) to the process.
> > > > > > 2 Userspace bind/acquire a commands queue and initialize it with
> > > > > >   an ioctl on the device file. Through that ioctl userspace can
> > > > > >   be inform wether either SVA/SVM works for the device. If SVA/
> > > > > >   SVM works then kernel device driver bind the process to the
> > > > > >   device as part of this ioctl.
> > > > > > 3 If SVM/SVA does not work userspace do an ioctl to create dma
> > > > > >   buffer or something that does exactly the same thing.
> > > > > > 4 Userspace mmap the command queue (mmap of the device file by
> > > > > >   using informations gather at step 2)
> > > > > > 5 Userspace can write commands into the queue it mapped
> > > > > > 6 When userspace close the device file all resources are release
> > > > > >   just like any existing device drivers.
> > > > 
> > > > Hi, Jerome,
> > > > 
> > > > Just one thing, as I said in the cover letter, dma-buf requires the application
> > > > to use memory created by the driver for DMA. I did try the dma-buf way in
> > > > WrapDrive (refer to [4] in the cover letter), it is a good backup for NOIOMMU
> > > > mode or we cannot solve the problem in VFIO.
> > > > 
> > > > But, in many of my application scenario, the application already has some memory
> > > > in hand, maybe allocated by the framework or libraries. Anyway, they don't get
> > > > memory from my library, and they pass the poiter for data operation. And they
> > > > may also have pointer in the buffer. Those pointer may be used by the
> > > > accelerator. So I need hardware fully share the address space with the
> > > > application. That is what dmabuf cannot do.
> > > 
> > > dmabuf can do that ... it is call uptr you can look at i915 for
> > > instance. Still this does not answer my question above, why do
> > > you need to be in VFIO to do any of the above thing ? Kernel has
> > > tons of examples that does all of the above and are not in VFIO
> > > (including usinng existing user pointer with device).
> > > 
> > > Cheers,
> > > Jérôme
> > 
> > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > user memory to the kernel. That is not what we need. What we try to get is: the
> > user application do something on its data, and push it away to the accelerator,
> > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > the memory, referring any portion of it with the same VAs of the application,
> > even the VAs are stored inside the memory itself.
> 
> You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> dma buffer object.
> 

Thank you for directing me to this implementation. It is interesting:).

But it is not yet solve my problem. If I understand it right, the userptr in
i915 do the following:

1. The user process sets a user pointer with size to the kernel via ioctl.
2. The kernel wraps it as a dma-buf and keeps the process's mm for further
   reference.
3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
   can be shared between the user space and the hardware.

But my scenario is: 

1. The user process has some data in the user space, pointed by a pointer, say
   ptr1. And within the memory, there may be some other pointers, let's say one
   of them is ptr2.
2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
   hardware must refer ptr1 and ptr2 *directly* for data.

Userptr lets the hardware and process share the same memory space. But I need
them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.

> > 
> > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > user driver framework. And I need exactly a user driver interface. Why should I
> > invent another wheel? It has most of stuff I need:
> > 
> > 1. Connecting multiple devices to the same application space
> > 2. Pinning and DMA from the application space to the whole set of device
> > 3. Managing hardware resource by device
> > 
> > We just need the last step: make sure multiple applications and the kernel can
> > share the same IOMMU. Then why shouldn't we use VFIO?
> 
> Because tons of other drivers already do all of the above outside VFIO. Many
> driver have a sizeable userspace side to them (anything with ioctl do) so they
> can be construded as userspace driver too.
> 

Ignoring if there are *tons* of drivers are doing that;), even I do the same as
i915 and solve the address space problem. And if I don't need to with VFIO, why
should I spend so much effort to do it again?

> So there is no reasons to do that under VFIO. Especialy as in your example
> it is not a real user space device driver, the userspace portion only knows
> about writting command into command buffer AFAICT.
> 
> VFIO is for real userspace driver where interrupt, configurations, ... ie
> all the driver is handled in userspace. This means that the userspace have
> to be trusted as it could program the device to do DMA to anywhere (if
> IOMMU is disabled at boot which is still the default configuration in the
> kernel).
> 

But as Alex explained, VFIO is not simply used by VM. So it need not to have all
stuffs as a driver in host system. And I do need to share the user space as DMA
buffer to the hardware. And I can get it with just a little update, then it can
service me perfectly. I don't understand why I should choose a long route.

> So i do not see any reasons to do anything you want inside VFIO. All you
> want to do can be done outside as easily. Moreover it would be better if
> you define clearly each scenario because from where i sit it looks like
> you are opening the door wide open to userspace to DMA anywhere when IOMMU
> is disabled.
> 
> When IOMMU is disabled you can _not_ expose command queue to userspace
> unless your device has its own page table and all commands are relative
> to that page table and the device page table is populated by kernel driver
> in secure way (ie by checking that what is populated can be access).
> 
> I do not believe your example device to have such page table nor do i see
> a fallback path when IOMMU is disabled that force user to do ioctl for
> each commands.
> 
> Yes i understand that you target SVA/SVM but still you claim to support
> non SVA/SVM. The point is that userspace can not be trusted if you want
> to have random program use your device. I am pretty sure that all user
> of VFIO are trusted process (like QEMU).
> 
> 
> Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> useless for your usecase. I really do not see the point of that, it
> does complicate things for you for no reasons AFAICT.

Indeed, I don't like the group thing. I believe VFIO's maintains would not like
it very much either;). But the problem is, the group reflects to the same
IOMMU(unit), which may shared with other devices.  It is a security problem. I
cannot ignore it. I have to take it into account event I don't use VFIO.

> 
> 
> > 
> > And personally, I believe the maturity and correctness of a framework are driven
> > by applications. Now the problem in accelerator world is that we don't have a
> > direction. If we believe the requirement is right, the method itself is not a
> > big problem in the end. We just need to let people have a unify platform to
> > share their work together.
> 
> I am not against that but it seems to me that all you want to do is only
> a matter of simplifying discovery of such devices and sharing few common
> ioctl (DMA mapping, creating command queue, managing command queue, ...)
> and again for all this i do not see the point of doing this under VFIO.

It is not a problem of device management, it is a problem of sharing address
space.

Cheers,

> 
> 
> Cheers,
> Jérôme

-- 
			-Kenneth(Hisilicon)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-10  3:28             ` Kenneth Lee
@ 2018-09-10 14:54               ` Jerome Glisse
  2018-09-11  2:42                 ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-10 14:54 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, linux-kernel, linuxarm, iommu, David S . Miller,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	Zaibo Xu, linux-accelerators, Lu Baolu

On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:

[...]

> > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > user application do something on its data, and push it away to the accelerator,
> > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > the memory, referring any portion of it with the same VAs of the application,
> > > even the VAs are stored inside the memory itself.
> > 
> > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > dma buffer object.
> > 
> 
> Thank you for directing me to this implementation. It is interesting:).
> 
> But it is not yet solve my problem. If I understand it right, the userptr in
> i915 do the following:
> 
> 1. The user process sets a user pointer with size to the kernel via ioctl.
> 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
>    reference.
> 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
>    can be shared between the user space and the hardware.
> 
> But my scenario is: 
> 
> 1. The user process has some data in the user space, pointed by a pointer, say
>    ptr1. And within the memory, there may be some other pointers, let's say one
>    of them is ptr2.
> 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
>    hardware must refer ptr1 and ptr2 *directly* for data.
> 
> Userptr lets the hardware and process share the same memory space. But I need
> them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.

So to be 100% clear should we _ignore_ the non SVA/SVM case ?
If so then wait for necessary SVA/SVM to land and do warp drive
without non SVA/SVM path.

If you still want non SVA/SVM path what you want to do only works
if both ptr1 and ptr2 are in a range that is DMA mapped to the
device (moreover you need DMA address to match process address
which is not an easy feat).

Now even if you only want SVA/SVM, i do not see what is the point
of doing this inside VFIO. AMD GPU driver does not and there would
be no benefit for them to be there. Well a AMD VFIO mdev device
driver for QEMU guest might be useful but they have SVIO IIRC.

For SVA/SVM your usage model is:

Setup:
    - user space create a warp drive context for the process
    - user space create a device specific context for the process
    - user space create a user space command queue for the device
    - user space bind command queue

    At this point the kernel driver has bound the process address
    space to the device with a command queue and userspace

Usage:
    - user space schedule work and call appropriate flush/update
      ioctl from time to time. Might be optional depends on the
      hardware, but probably a good idea to enforce so that kernel
      can unbind the command queue to bind another process command
      queue.
    ...

Cleanup:
    - user space unbind command queue
    - user space destroy device specific context
    - user space destroy warp drive context
    All the above can be implicit when closing the device file.

So again in the above model i do not see anywhere something from
VFIO that would benefit this model.


> > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > user driver framework. And I need exactly a user driver interface. Why should I
> > > invent another wheel? It has most of stuff I need:
> > > 
> > > 1. Connecting multiple devices to the same application space
> > > 2. Pinning and DMA from the application space to the whole set of device
> > > 3. Managing hardware resource by device
> > > 
> > > We just need the last step: make sure multiple applications and the kernel can
> > > share the same IOMMU. Then why shouldn't we use VFIO?
> > 
> > Because tons of other drivers already do all of the above outside VFIO. Many
> > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > can be construded as userspace driver too.
> > 
> 
> Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> i915 and solve the address space problem. And if I don't need to with VFIO, why
> should I spend so much effort to do it again?

Because you do not need any code from VFIO, nor do you need to reinvent
things. If non SVA/SVM matters to you then use dma buffer. If not then
i do not see anything in VFIO that you need.


> > So there is no reasons to do that under VFIO. Especialy as in your example
> > it is not a real user space device driver, the userspace portion only knows
> > about writting command into command buffer AFAICT.
> > 
> > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > all the driver is handled in userspace. This means that the userspace have
> > to be trusted as it could program the device to do DMA to anywhere (if
> > IOMMU is disabled at boot which is still the default configuration in the
> > kernel).
> > 
> 
> But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> stuffs as a driver in host system. And I do need to share the user space as DMA
> buffer to the hardware. And I can get it with just a little update, then it can
> service me perfectly. I don't understand why I should choose a long route.

Again this is not the long route i do not see anything in VFIO that
benefit you in the SVA/SVM case. A basic character device driver can
do that.


> > So i do not see any reasons to do anything you want inside VFIO. All you
> > want to do can be done outside as easily. Moreover it would be better if
> > you define clearly each scenario because from where i sit it looks like
> > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > is disabled.
> > 
> > When IOMMU is disabled you can _not_ expose command queue to userspace
> > unless your device has its own page table and all commands are relative
> > to that page table and the device page table is populated by kernel driver
> > in secure way (ie by checking that what is populated can be access).
> > 
> > I do not believe your example device to have such page table nor do i see
> > a fallback path when IOMMU is disabled that force user to do ioctl for
> > each commands.
> > 
> > Yes i understand that you target SVA/SVM but still you claim to support
> > non SVA/SVM. The point is that userspace can not be trusted if you want
> > to have random program use your device. I am pretty sure that all user
> > of VFIO are trusted process (like QEMU).
> > 
> > 
> > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > useless for your usecase. I really do not see the point of that, it
> > does complicate things for you for no reasons AFAICT.
> 
> Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> it very much either;). But the problem is, the group reflects to the same
> IOMMU(unit), which may shared with other devices.  It is a security problem. I
> cannot ignore it. I have to take it into account event I don't use VFIO.

To me it seems you are making a policy decission in kernel space ie
wether the device should be isolated in its own group or not is a
decission that is up to the sys admin or something in userspace.
Right now existing user of SVA/SVM don't (at least AFAICT).

Do we really want to force such isolation ?


> > > And personally, I believe the maturity and correctness of a framework are driven
> > > by applications. Now the problem in accelerator world is that we don't have a
> > > direction. If we believe the requirement is right, the method itself is not a
> > > big problem in the end. We just need to let people have a unify platform to
> > > share their work together.
> > 
> > I am not against that but it seems to me that all you want to do is only
> > a matter of simplifying discovery of such devices and sharing few common
> > ioctl (DMA mapping, creating command queue, managing command queue, ...)
> > and again for all this i do not see the point of doing this under VFIO.
> 
> It is not a problem of device management, it is a problem of sharing address
> space.

This ties back to IOMMU SVA/SVM group isolation above.

Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-10 14:54               ` Jerome Glisse
@ 2018-09-11  2:42                 ` Kenneth Lee
  2018-09-11  3:33                   ` Jerome Glisse
  0 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-11  2:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, linux-kernel, linuxarm, iommu, David S . Miller,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	Zaibo Xu, linux-accelerators, Lu Baolu

On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> Date: Mon, 10 Sep 2018 10:54:23 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson
>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,
>  kvm@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>, Greg Kroah-Hartman
>  <gregkh@linuxfoundation.org>, Joerg Roedel <joro@8bytes.org>,
>  linux-doc@vger.kernel.org, Sanjay Kumar <sanjay.k.kumar@intel.com>, Hao
>  Fang <fanghao11@huawei.com>, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, iommu@lists.linux-foundation.org, "David S . Miller"
>  <davem@davemloft.net>, linux-crypto@vger.kernel.org, Zhou Wang
>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,
>  Thomas Gleixner <tglx@linutronix.de>, Zaibo Xu <xuzaibo@huawei.com>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.0 (2018-05-17)
> Message-ID: <20180910145423.GA3488@redhat.com>
> 
> On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> 
> [...]
> 
> > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > > user application do something on its data, and push it away to the accelerator,
> > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > > the memory, referring any portion of it with the same VAs of the application,
> > > > even the VAs are stored inside the memory itself.
> > > 
> > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > > dma buffer object.
> > > 
> > 
> > Thank you for directing me to this implementation. It is interesting:).
> > 
> > But it is not yet solve my problem. If I understand it right, the userptr in
> > i915 do the following:
> > 
> > 1. The user process sets a user pointer with size to the kernel via ioctl.
> > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
> >    reference.
> > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
> >    can be shared between the user space and the hardware.
> > 
> > But my scenario is: 
> > 
> > 1. The user process has some data in the user space, pointed by a pointer, say
> >    ptr1. And within the memory, there may be some other pointers, let's say one
> >    of them is ptr2.
> > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
> >    hardware must refer ptr1 and ptr2 *directly* for data.
> > 
> > Userptr lets the hardware and process share the same memory space. But I need
> > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.
> 
> So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> If so then wait for necessary SVA/SVM to land and do warp drive
> without non SVA/SVM path.
> 

I think we should clear the concept of SVA/SVM here. As my understanding, Share
Virtual Address/Memory means: any virtual address in a process can be used by
device at the same time. This requires IOMMU device to support PASID. And
optionally, it requires the feature of page-fault-from-device.

But before the feature is settled down, IOMMU can be used immediately in the
current kernel. That make it possible to assign ONE process's virtual addresses
to the device's IOMMU page table with GUP. This make WarpDrive work well for one
process.

Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
from the feature in the future. It dose not means WarpDrive is useless before
that. And it works for our Zip and RSA accelerators in physical world.

> If you still want non SVA/SVM path what you want to do only works
> if both ptr1 and ptr2 are in a range that is DMA mapped to the
> device (moreover you need DMA address to match process address
> which is not an easy feat).
> 
> Now even if you only want SVA/SVM, i do not see what is the point
> of doing this inside VFIO. AMD GPU driver does not and there would
> be no benefit for them to be there. Well a AMD VFIO mdev device
> driver for QEMU guest might be useful but they have SVIO IIRC.
> 
> For SVA/SVM your usage model is:
> 
> Setup:
>     - user space create a warp drive context for the process
>     - user space create a device specific context for the process
>     - user space create a user space command queue for the device
>     - user space bind command queue
> 
>     At this point the kernel driver has bound the process address
>     space to the device with a command queue and userspace
> 
> Usage:
>     - user space schedule work and call appropriate flush/update
>       ioctl from time to time. Might be optional depends on the
>       hardware, but probably a good idea to enforce so that kernel
>       can unbind the command queue to bind another process command
>       queue.
>     ...
> 
> Cleanup:
>     - user space unbind command queue
>     - user space destroy device specific context
>     - user space destroy warp drive context
>     All the above can be implicit when closing the device file.
> 
> So again in the above model i do not see anywhere something from
> VFIO that would benefit this model.
> 

Let me show you how the model will be if I use VFIO:

Setup (Kernel part)
	- Kernel driver do every as usual to serve the other functionality, NIC
	  can still be registered to netdev, encryptor can still be registered
	  to crypto...
	- At the same time, the driver can devote some of its hardware resource
	  and register them as a mdev creator to the VFIO framework. This just
	  need limited change to the VFIO type1 driver.

Setup (User space)
	- System administrator create mdev via the mdev creator interface.
	- Following VFIO setup routine, user space open the mdev's group, there is
	  only one group for one device.
	- Without PASID support, you don't need to do anything. With PASID, bind
	  the PASID to the device via VFIO interface.
	- Get the device from the group via VFIO interface and mmap it the user
	  space for device's MMIO access (for the queue).
	- Map whatever memory you need to share with the device with VFIO
	  interface.
	- (opt) Add more devices into the container if you want to share the
	  same address space with them

Cleanup:
	- User space close the group file handler
	- There will be a problem to let the other process know the mdev is
	  freed to be used again. My RFCv1 choose a file handler solution. Alex
	  dose not like it. But it is not a big problem. We can always have a
	  scheduler process to manage the state of the mdev or even we can
	  switch back to the RFCv1 solution without too much effort if we like
	  in the future. 

Except for the minimum update to the type1 driver and use sdmdev to manage the
interrupt sharing, I don't need any extra code to gain the address sharing
capability. And the capability will be strengthen along with the upgrade of VFIO.

> 
> > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > > user driver framework. And I need exactly a user driver interface. Why should I
> > > > invent another wheel? It has most of stuff I need:
> > > > 
> > > > 1. Connecting multiple devices to the same application space
> > > > 2. Pinning and DMA from the application space to the whole set of device
> > > > 3. Managing hardware resource by device
> > > > 
> > > > We just need the last step: make sure multiple applications and the kernel can
> > > > share the same IOMMU. Then why shouldn't we use VFIO?
> > > 
> > > Because tons of other drivers already do all of the above outside VFIO. Many
> > > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > > can be construded as userspace driver too.
> > > 
> > 
> > Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> > i915 and solve the address space problem. And if I don't need to with VFIO, why
> > should I spend so much effort to do it again?
> 
> Because you do not need any code from VFIO, nor do you need to reinvent
> things. If non SVA/SVM matters to you then use dma buffer. If not then
> i do not see anything in VFIO that you need.
> 

As I have explain, if I don't use VFIO, at lease I have to do all that has been
done in i915 or even more than that.

> 
> > > So there is no reasons to do that under VFIO. Especialy as in your example
> > > it is not a real user space device driver, the userspace portion only knows
> > > about writting command into command buffer AFAICT.
> > > 
> > > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > > all the driver is handled in userspace. This means that the userspace have
> > > to be trusted as it could program the device to do DMA to anywhere (if
> > > IOMMU is disabled at boot which is still the default configuration in the
> > > kernel).
> > > 
> > 
> > But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> > stuffs as a driver in host system. And I do need to share the user space as DMA
> > buffer to the hardware. And I can get it with just a little update, then it can
> > service me perfectly. I don't understand why I should choose a long route.
> 
> Again this is not the long route i do not see anything in VFIO that
> benefit you in the SVA/SVM case. A basic character device driver can
> do that.
> 
> 
> > > So i do not see any reasons to do anything you want inside VFIO. All you
> > > want to do can be done outside as easily. Moreover it would be better if
> > > you define clearly each scenario because from where i sit it looks like
> > > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > > is disabled.
> > > 
> > > When IOMMU is disabled you can _not_ expose command queue to userspace
> > > unless your device has its own page table and all commands are relative
> > > to that page table and the device page table is populated by kernel driver
> > > in secure way (ie by checking that what is populated can be access).
> > > 
> > > I do not believe your example device to have such page table nor do i see
> > > a fallback path when IOMMU is disabled that force user to do ioctl for
> > > each commands.
> > > 
> > > Yes i understand that you target SVA/SVM but still you claim to support
> > > non SVA/SVM. The point is that userspace can not be trusted if you want
> > > to have random program use your device. I am pretty sure that all user
> > > of VFIO are trusted process (like QEMU).
> > > 
> > > 
> > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > > useless for your usecase. I really do not see the point of that, it
> > > does complicate things for you for no reasons AFAICT.
> > 
> > Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> > it very much either;). But the problem is, the group reflects to the same
> > IOMMU(unit), which may shared with other devices.  It is a security problem. I
> > cannot ignore it. I have to take it into account event I don't use VFIO.
> 
> To me it seems you are making a policy decission in kernel space ie
> wether the device should be isolated in its own group or not is a
> decission that is up to the sys admin or something in userspace.
> Right now existing user of SVA/SVM don't (at least AFAICT).
> 
> Do we really want to force such isolation ?
> 

But it is not my decision, that how the iommu subsystem is designed. Personally
I don't like it at all, because all our hardwares have their own stream id
(device id). I don't need the group concept at all. But the iommu subsystem
assume some devices may share the name device ID to a single IOMMU.

> 
> > > > And personally, I believe the maturity and correctness of a framework are driven
> > > > by applications. Now the problem in accelerator world is that we don't have a
> > > > direction. If we believe the requirement is right, the method itself is not a
> > > > big problem in the end. We just need to let people have a unify platform to
> > > > share their work together.
> > > 
> > > I am not against that but it seems to me that all you want to do is only
> > > a matter of simplifying discovery of such devices and sharing few common
> > > ioctl (DMA mapping, creating command queue, managing command queue, ...)
> > > and again for all this i do not see the point of doing this under VFIO.
> > 
> > It is not a problem of device management, it is a problem of sharing address
> > space.
> 
> This ties back to IOMMU SVA/SVM group isolation above.
> 
> Jérôme

Cheers
-- 
			-Kenneth(Hisilicon)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-11  2:42                 ` Kenneth Lee
@ 2018-09-11  3:33                   ` Jerome Glisse
  2018-09-11  6:40                     ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-11  3:33 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Zaibo Xu, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	David S . Miller, linux-accelerators, Lu Baolu

On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > 
> > [...]
> > 
> > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > > > user application do something on its data, and push it away to the accelerator,
> > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > > > the memory, referring any portion of it with the same VAs of the application,
> > > > > even the VAs are stored inside the memory itself.
> > > > 
> > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > > > dma buffer object.
> > > > 
> > > 
> > > Thank you for directing me to this implementation. It is interesting:).
> > > 
> > > But it is not yet solve my problem. If I understand it right, the userptr in
> > > i915 do the following:
> > > 
> > > 1. The user process sets a user pointer with size to the kernel via ioctl.
> > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
> > >    reference.
> > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
> > >    can be shared between the user space and the hardware.
> > > 
> > > But my scenario is: 
> > > 
> > > 1. The user process has some data in the user space, pointed by a pointer, say
> > >    ptr1. And within the memory, there may be some other pointers, let's say one
> > >    of them is ptr2.
> > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
> > >    hardware must refer ptr1 and ptr2 *directly* for data.
> > > 
> > > Userptr lets the hardware and process share the same memory space. But I need
> > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.
> > 
> > So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> > If so then wait for necessary SVA/SVM to land and do warp drive
> > without non SVA/SVM path.
> > 
> 
> I think we should clear the concept of SVA/SVM here. As my understanding, Share
> Virtual Address/Memory means: any virtual address in a process can be used by
> device at the same time. This requires IOMMU device to support PASID. And
> optionally, it requires the feature of page-fault-from-device.

Yes we agree on what SVA/SVM is. There is a one gotcha thought, access
to range that are MMIO map ie CPU page table pointing to IO memory, IIRC
it is undefined what happens on some platform for a device trying to
access those using SVA/SVM.


> But before the feature is settled down, IOMMU can be used immediately in the
> current kernel. That make it possible to assign ONE process's virtual addresses
> to the device's IOMMU page table with GUP. This make WarpDrive work well for one
> process.

UH ? How ? You want to GUP _every_ single valid address in the process
and map it to the device ? How do you handle new vma, page being replace
(despite GUP because of things that utimately calls zap pte) ...

Again here you said that the device must be able to access _any_ valid
pointer. With GUP this is insane.

So i am assuming this is not what you want to do without SVA/SVM ie with
GUP you have a different programming model, one in which the userspace
must first bind _range_ of memory to the device and get a DMA address
for the range.

Again, GUP range of process address space to map it to a device so that
userspace can use the device on the mapped range is something that do
exist in various places in the kernel.

> Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
> from the feature in the future. It dose not means WarpDrive is useless before
> that. And it works for our Zip and RSA accelerators in physical world.

Just not with random process address ...

> > If you still want non SVA/SVM path what you want to do only works
> > if both ptr1 and ptr2 are in a range that is DMA mapped to the
> > device (moreover you need DMA address to match process address
> > which is not an easy feat).
> > 
> > Now even if you only want SVA/SVM, i do not see what is the point
> > of doing this inside VFIO. AMD GPU driver does not and there would
> > be no benefit for them to be there. Well a AMD VFIO mdev device
> > driver for QEMU guest might be useful but they have SVIO IIRC.
> > 
> > For SVA/SVM your usage model is:
> > 
> > Setup:
> >     - user space create a warp drive context for the process
> >     - user space create a device specific context for the process
> >     - user space create a user space command queue for the device
> >     - user space bind command queue
> > 
> >     At this point the kernel driver has bound the process address
> >     space to the device with a command queue and userspace
> > 
> > Usage:
> >     - user space schedule work and call appropriate flush/update
> >       ioctl from time to time. Might be optional depends on the
> >       hardware, but probably a good idea to enforce so that kernel
> >       can unbind the command queue to bind another process command
> >       queue.
> >     ...
> > 
> > Cleanup:
> >     - user space unbind command queue
> >     - user space destroy device specific context
> >     - user space destroy warp drive context
> >     All the above can be implicit when closing the device file.
> > 
> > So again in the above model i do not see anywhere something from
> > VFIO that would benefit this model.
> > 
> 
> Let me show you how the model will be if I use VFIO:
> 
> Setup (Kernel part)
> 	- Kernel driver do every as usual to serve the other functionality, NIC
> 	  can still be registered to netdev, encryptor can still be registered
> 	  to crypto...
> 	- At the same time, the driver can devote some of its hardware resource
> 	  and register them as a mdev creator to the VFIO framework. This just
> 	  need limited change to the VFIO type1 driver.

In the above VFIO does not help you one bit ... you can do that with
as much code with new common device as front end.

> Setup (User space)
> 	- System administrator create mdev via the mdev creator interface.
> 	- Following VFIO setup routine, user space open the mdev's group, there is
> 	  only one group for one device.
> 	- Without PASID support, you don't need to do anything. With PASID, bind
> 	  the PASID to the device via VFIO interface.
> 	- Get the device from the group via VFIO interface and mmap it the user
> 	  space for device's MMIO access (for the queue).
> 	- Map whatever memory you need to share with the device with VFIO
> 	  interface.
> 	- (opt) Add more devices into the container if you want to share the
> 	  same address space with them

So all VFIO buys you here is boiler plate code that does insert_pfn()
to handle MMIO mapping. Which is just couple hundred lines of boiler
plate code.

> 
> Cleanup:
> 	- User space close the group file handler
> 	- There will be a problem to let the other process know the mdev is
> 	  freed to be used again. My RFCv1 choose a file handler solution. Alex
> 	  dose not like it. But it is not a big problem. We can always have a
> 	  scheduler process to manage the state of the mdev or even we can
> 	  switch back to the RFCv1 solution without too much effort if we like
> 	  in the future.

If you were outside VFIO you would have more freedom on how to do that.
For instance process opening the device file can be placed on queue and
first one in the queue get to use the device until it closes/release the
device. Then next one in queue get the device ...

> Except for the minimum update to the type1 driver and use sdmdev to manage the
> interrupt sharing, I don't need any extra code to gain the address sharing
> capability. And the capability will be strengthen along with the upgrade of VFIO.
> 
> > 
> > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > > > user driver framework. And I need exactly a user driver interface. Why should I
> > > > > invent another wheel? It has most of stuff I need:
> > > > > 
> > > > > 1. Connecting multiple devices to the same application space
> > > > > 2. Pinning and DMA from the application space to the whole set of device
> > > > > 3. Managing hardware resource by device
> > > > > 
> > > > > We just need the last step: make sure multiple applications and the kernel can
> > > > > share the same IOMMU. Then why shouldn't we use VFIO?
> > > > 
> > > > Because tons of other drivers already do all of the above outside VFIO. Many
> > > > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > > > can be construded as userspace driver too.
> > > > 
> > > 
> > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> > > i915 and solve the address space problem. And if I don't need to with VFIO, why
> > > should I spend so much effort to do it again?
> > 
> > Because you do not need any code from VFIO, nor do you need to reinvent
> > things. If non SVA/SVM matters to you then use dma buffer. If not then
> > i do not see anything in VFIO that you need.
> > 
> 
> As I have explain, if I don't use VFIO, at lease I have to do all that has been
> done in i915 or even more than that.

So beside the MMIO mmap() handling and dma mapping of range of user space
address space (again all very boiler plate code duplicated accross the
kernel several time in different forms). You do not gain anything being
inside VFIO right ?


> > > > So there is no reasons to do that under VFIO. Especialy as in your example
> > > > it is not a real user space device driver, the userspace portion only knows
> > > > about writting command into command buffer AFAICT.
> > > > 
> > > > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > > > all the driver is handled in userspace. This means that the userspace have
> > > > to be trusted as it could program the device to do DMA to anywhere (if
> > > > IOMMU is disabled at boot which is still the default configuration in the
> > > > kernel).
> > > > 
> > > 
> > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> > > stuffs as a driver in host system. And I do need to share the user space as DMA
> > > buffer to the hardware. And I can get it with just a little update, then it can
> > > service me perfectly. I don't understand why I should choose a long route.
> > 
> > Again this is not the long route i do not see anything in VFIO that
> > benefit you in the SVA/SVM case. A basic character device driver can
> > do that.
> > 
> > 
> > > > So i do not see any reasons to do anything you want inside VFIO. All you
> > > > want to do can be done outside as easily. Moreover it would be better if
> > > > you define clearly each scenario because from where i sit it looks like
> > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > > > is disabled.
> > > > 
> > > > When IOMMU is disabled you can _not_ expose command queue to userspace
> > > > unless your device has its own page table and all commands are relative
> > > > to that page table and the device page table is populated by kernel driver
> > > > in secure way (ie by checking that what is populated can be access).
> > > > 
> > > > I do not believe your example device to have such page table nor do i see
> > > > a fallback path when IOMMU is disabled that force user to do ioctl for
> > > > each commands.
> > > > 
> > > > Yes i understand that you target SVA/SVM but still you claim to support
> > > > non SVA/SVM. The point is that userspace can not be trusted if you want
> > > > to have random program use your device. I am pretty sure that all user
> > > > of VFIO are trusted process (like QEMU).
> > > > 
> > > > 
> > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > > > useless for your usecase. I really do not see the point of that, it
> > > > does complicate things for you for no reasons AFAICT.
> > > 
> > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> > > it very much either;). But the problem is, the group reflects to the same
> > > IOMMU(unit), which may shared with other devices.  It is a security problem. I
> > > cannot ignore it. I have to take it into account event I don't use VFIO.
> > 
> > To me it seems you are making a policy decission in kernel space ie
> > wether the device should be isolated in its own group or not is a
> > decission that is up to the sys admin or something in userspace.
> > Right now existing user of SVA/SVM don't (at least AFAICT).
> > 
> > Do we really want to force such isolation ?
> > 
> 
> But it is not my decision, that how the iommu subsystem is designed. Personally
> I don't like it at all, because all our hardwares have their own stream id
> (device id). I don't need the group concept at all. But the iommu subsystem
> assume some devices may share the name device ID to a single IOMMU.

My question was do you really want to force group isolation for the
device ? Existing SVA/SVM capable driver do not force that, they let
the userspace decide this (sysadm, distributions, ...). Being part of
VFIO (in the way you do, likely ways to avoid this inside VFIO too)
force this decision ie make a policy decision without userspace having
anything to say about it.


The IOMMU group thing as always been doubt full to me, it is advertise
as allowing to share resources (ie IOMMU page table) between devices.
But this assume that all device driver in the group have some way of
communicating with each other to share common DMA address that point
to memory devices care. I believe only VFIO does that and probably
only when use by QEMU.


Anyway my question is:

Is it that much useful to be inside VFIO (to avoid few hundred lines
of boiler plate code) given that it forces you into a model (group
isolation) that so far have never been the prefered way for all
existing device driver that already do what you want to achieve ?


From where i stand i do not see overwhelming reasons to do what you
are doing inside VFIO.

To me it would make more sense to have regular device driver. They
all can have device file under same hierarchy to make devices with
same programming model easy to discover.

Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-11  3:33                   ` Jerome Glisse
@ 2018-09-11  6:40                     ` Kenneth Lee
  2018-09-11 13:40                       ` Jerome Glisse
  0 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-11  6:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Zaibo Xu, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	David S . Miller, linux-accelerators, Lu Baolu

On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:
> Date: Mon, 10 Sep 2018 23:33:59 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Zaibo Xu <xuzaibo@huawei.com>,
>  Herbert Xu <herbert@gondor.apana.org.au>, kvm@vger.kernel.org, Jonathan
>  Corbet <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
>  Joerg Roedel <joro@8bytes.org>, linux-doc@vger.kernel.org, Sanjay Kumar
>  <sanjay.k.kumar@intel.com>, Hao Fang <fanghao11@huawei.com>,
>  iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, Alex Williamson <alex.williamson@redhat.com>,
>  linux-crypto@vger.kernel.org, Zhou Wang <wangzhou1@hisilicon.com>,
>  Philippe Ombredanne <pombredanne@nexb.com>, Thomas Gleixner
>  <tglx@linutronix.de>, "David S . Miller" <davem@davemloft.net>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.1 (2018-07-13)
> Message-ID: <20180911033358.GA4730@redhat.com>
> 
> On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > 
> > > [...]
> > > 
> > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > > > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > > > > user application do something on its data, and push it away to the accelerator,
> > > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > > > > the memory, referring any portion of it with the same VAs of the application,
> > > > > > even the VAs are stored inside the memory itself.
> > > > > 
> > > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > > > > dma buffer object.
> > > > > 
> > > > 
> > > > Thank you for directing me to this implementation. It is interesting:).
> > > > 
> > > > But it is not yet solve my problem. If I understand it right, the userptr in
> > > > i915 do the following:
> > > > 
> > > > 1. The user process sets a user pointer with size to the kernel via ioctl.
> > > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
> > > >    reference.
> > > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
> > > >    can be shared between the user space and the hardware.
> > > > 
> > > > But my scenario is: 
> > > > 
> > > > 1. The user process has some data in the user space, pointed by a pointer, say
> > > >    ptr1. And within the memory, there may be some other pointers, let's say one
> > > >    of them is ptr2.
> > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
> > > >    hardware must refer ptr1 and ptr2 *directly* for data.
> > > > 
> > > > Userptr lets the hardware and process share the same memory space. But I need
> > > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> > > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.
> > > 
> > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> > > If so then wait for necessary SVA/SVM to land and do warp drive
> > > without non SVA/SVM path.
> > > 
> > 
> > I think we should clear the concept of SVA/SVM here. As my understanding, Share
> > Virtual Address/Memory means: any virtual address in a process can be used by
> > device at the same time. This requires IOMMU device to support PASID. And
> > optionally, it requires the feature of page-fault-from-device.
> 
> Yes we agree on what SVA/SVM is. There is a one gotcha thought, access
> to range that are MMIO map ie CPU page table pointing to IO memory, IIRC
> it is undefined what happens on some platform for a device trying to
> access those using SVA/SVM.
> 
> 
> > But before the feature is settled down, IOMMU can be used immediately in the
> > current kernel. That make it possible to assign ONE process's virtual addresses
> > to the device's IOMMU page table with GUP. This make WarpDrive work well for one
> > process.
> 
> UH ? How ? You want to GUP _every_ single valid address in the process
> and map it to the device ? How do you handle new vma, page being replace
> (despite GUP because of things that utimately calls zap pte) ...
> 
> Again here you said that the device must be able to access _any_ valid
> pointer. With GUP this is insane.
> 
> So i am assuming this is not what you want to do without SVA/SVM ie with
> GUP you have a different programming model, one in which the userspace
> must first bind _range_ of memory to the device and get a DMA address
> for the range.
> 
> Again, GUP range of process address space to map it to a device so that
> userspace can use the device on the mapped range is something that do
> exist in various places in the kernel.
> 

Yes same as your expectation, in WarpDrive, we use the concept of "sharing" to
do so. If some memory is going to be shared among process and devices, we use
wd_share_mem(queue, ptr, size) to share those memory. When the queue is working
in this mode, the point is valid in those memory segments. The wd_share_mem call
vfio dma map syscall which will do GUP. 

If SVA/SVM is enabled, user space can set SHARE_ALL flags to the queue. Then
wd_share_mem() is not necessary.

This is really not popular when we started the work on WarpDrive. The GUP
document said it should be put within the scope of mm_sem is locked. Because GUP
simply increase the page refcount, not keep the mapping between the page and the
vma. We keep our work together with VFIO to make sure the problem can be solved
in one deal.

And now we have GUP-longterm and many accounting work in VFIO, we don't want to
do that again.

> > Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
> > from the feature in the future. It dose not means WarpDrive is useless before
> > that. And it works for our Zip and RSA accelerators in physical world.
> 
> Just not with random process address ...
> 
> > > If you still want non SVA/SVM path what you want to do only works
> > > if both ptr1 and ptr2 are in a range that is DMA mapped to the
> > > device (moreover you need DMA address to match process address
> > > which is not an easy feat).
> > > 
> > > Now even if you only want SVA/SVM, i do not see what is the point
> > > of doing this inside VFIO. AMD GPU driver does not and there would
> > > be no benefit for them to be there. Well a AMD VFIO mdev device
> > > driver for QEMU guest might be useful but they have SVIO IIRC.
> > > 
> > > For SVA/SVM your usage model is:
> > > 
> > > Setup:
> > >     - user space create a warp drive context for the process
> > >     - user space create a device specific context for the process
> > >     - user space create a user space command queue for the device
> > >     - user space bind command queue
> > > 
> > >     At this point the kernel driver has bound the process address
> > >     space to the device with a command queue and userspace
> > > 
> > > Usage:
> > >     - user space schedule work and call appropriate flush/update
> > >       ioctl from time to time. Might be optional depends on the
> > >       hardware, but probably a good idea to enforce so that kernel
> > >       can unbind the command queue to bind another process command
> > >       queue.
> > >     ...
> > > 
> > > Cleanup:
> > >     - user space unbind command queue
> > >     - user space destroy device specific context
> > >     - user space destroy warp drive context
> > >     All the above can be implicit when closing the device file.
> > > 
> > > So again in the above model i do not see anywhere something from
> > > VFIO that would benefit this model.
> > > 
> > 
> > Let me show you how the model will be if I use VFIO:
> > 
> > Setup (Kernel part)
> > 	- Kernel driver do every as usual to serve the other functionality, NIC
> > 	  can still be registered to netdev, encryptor can still be registered
> > 	  to crypto...
> > 	- At the same time, the driver can devote some of its hardware resource
> > 	  and register them as a mdev creator to the VFIO framework. This just
> > 	  need limited change to the VFIO type1 driver.
> 
> In the above VFIO does not help you one bit ... you can do that with
> as much code with new common device as front end.
> 
> > Setup (User space)
> > 	- System administrator create mdev via the mdev creator interface.
> > 	- Following VFIO setup routine, user space open the mdev's group, there is
> > 	  only one group for one device.
> > 	- Without PASID support, you don't need to do anything. With PASID, bind
> > 	  the PASID to the device via VFIO interface.
> > 	- Get the device from the group via VFIO interface and mmap it the user
> > 	  space for device's MMIO access (for the queue).
> > 	- Map whatever memory you need to share with the device with VFIO
> > 	  interface.
> > 	- (opt) Add more devices into the container if you want to share the
> > 	  same address space with them
> 
> So all VFIO buys you here is boiler plate code that does insert_pfn()
> to handle MMIO mapping. Which is just couple hundred lines of boiler
> plate code.
> 

No. With VFIO, I don't need to:

1. GUP and accounting for RLIMIT_MEMLOCK
2. Keep all GUP pages for releasing (VFIO uses the rb_tree to do so)
2. Handle the PASID on SMMU (ARM's IOMMU) myself.
3. Multiple devices menagement (VFIO uses container to manage this)

And even as a boiler plate, it is valueable, the memory thing is sensitive
interface to user space, it can easily become a security problem. If I can
achieve my target within the scope of VFIO, why not? At lease it has been
proved to be safe for the time being.

> > 
> > Cleanup:
> > 	- User space close the group file handler
> > 	- There will be a problem to let the other process know the mdev is
> > 	  freed to be used again. My RFCv1 choose a file handler solution. Alex
> > 	  dose not like it. But it is not a big problem. We can always have a
> > 	  scheduler process to manage the state of the mdev or even we can
> > 	  switch back to the RFCv1 solution without too much effort if we like
> > 	  in the future.
> 
> If you were outside VFIO you would have more freedom on how to do that.
> For instance process opening the device file can be placed on queue and
> first one in the queue get to use the device until it closes/release the
> device. Then next one in queue get the device ...

Yes. I do like the file handle solution. But I hope the solution become mature
as soon as possible. Many of our products, and as I know include some of our
partners, are waiting for a long term solution as direction. If I rely on some
unmature solution, they may choose some deviated, customized solution. That will
be much harmful. Compare to this, the freedom is not so important...

> 
> > Except for the minimum update to the type1 driver and use sdmdev to manage the
> > interrupt sharing, I don't need any extra code to gain the address sharing
> > capability. And the capability will be strengthen along with the upgrade of VFIO.
> > 
> > > 
> > > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > > > > user driver framework. And I need exactly a user driver interface. Why should I
> > > > > > invent another wheel? It has most of stuff I need:
> > > > > > 
> > > > > > 1. Connecting multiple devices to the same application space
> > > > > > 2. Pinning and DMA from the application space to the whole set of device
> > > > > > 3. Managing hardware resource by device
> > > > > > 
> > > > > > We just need the last step: make sure multiple applications and the kernel can
> > > > > > share the same IOMMU. Then why shouldn't we use VFIO?
> > > > > 
> > > > > Because tons of other drivers already do all of the above outside VFIO. Many
> > > > > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > > > > can be construded as userspace driver too.
> > > > > 
> > > > 
> > > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> > > > i915 and solve the address space problem. And if I don't need to with VFIO, why
> > > > should I spend so much effort to do it again?
> > > 
> > > Because you do not need any code from VFIO, nor do you need to reinvent
> > > things. If non SVA/SVM matters to you then use dma buffer. If not then
> > > i do not see anything in VFIO that you need.
> > > 
> > 
> > As I have explain, if I don't use VFIO, at lease I have to do all that has been
> > done in i915 or even more than that.
> 
> So beside the MMIO mmap() handling and dma mapping of range of user space
> address space (again all very boiler plate code duplicated accross the
> kernel several time in different forms). You do not gain anything being
> inside VFIO right ?
> 

As I said, rb-tree for gup, rlimit accounting, cooperation on SMMU, and mature
user interface are our concern.

> 
> > > > > So there is no reasons to do that under VFIO. Especialy as in your example
> > > > > it is not a real user space device driver, the userspace portion only knows
> > > > > about writting command into command buffer AFAICT.
> > > > > 
> > > > > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > > > > all the driver is handled in userspace. This means that the userspace have
> > > > > to be trusted as it could program the device to do DMA to anywhere (if
> > > > > IOMMU is disabled at boot which is still the default configuration in the
> > > > > kernel).
> > > > > 
> > > > 
> > > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> > > > stuffs as a driver in host system. And I do need to share the user space as DMA
> > > > buffer to the hardware. And I can get it with just a little update, then it can
> > > > service me perfectly. I don't understand why I should choose a long route.
> > > 
> > > Again this is not the long route i do not see anything in VFIO that
> > > benefit you in the SVA/SVM case. A basic character device driver can
> > > do that.
> > > 
> > > 
> > > > > So i do not see any reasons to do anything you want inside VFIO. All you
> > > > > want to do can be done outside as easily. Moreover it would be better if
> > > > > you define clearly each scenario because from where i sit it looks like
> > > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > > > > is disabled.
> > > > > 
> > > > > When IOMMU is disabled you can _not_ expose command queue to userspace
> > > > > unless your device has its own page table and all commands are relative
> > > > > to that page table and the device page table is populated by kernel driver
> > > > > in secure way (ie by checking that what is populated can be access).
> > > > > 
> > > > > I do not believe your example device to have such page table nor do i see
> > > > > a fallback path when IOMMU is disabled that force user to do ioctl for
> > > > > each commands.
> > > > > 
> > > > > Yes i understand that you target SVA/SVM but still you claim to support
> > > > > non SVA/SVM. The point is that userspace can not be trusted if you want
> > > > > to have random program use your device. I am pretty sure that all user
> > > > > of VFIO are trusted process (like QEMU).
> > > > > 
> > > > > 
> > > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > > > > useless for your usecase. I really do not see the point of that, it
> > > > > does complicate things for you for no reasons AFAICT.
> > > > 
> > > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> > > > it very much either;). But the problem is, the group reflects to the same
> > > > IOMMU(unit), which may shared with other devices.  It is a security problem. I
> > > > cannot ignore it. I have to take it into account event I don't use VFIO.
> > > 
> > > To me it seems you are making a policy decission in kernel space ie
> > > wether the device should be isolated in its own group or not is a
> > > decission that is up to the sys admin or something in userspace.
> > > Right now existing user of SVA/SVM don't (at least AFAICT).
> > > 
> > > Do we really want to force such isolation ?
> > > 
> > 
> > But it is not my decision, that how the iommu subsystem is designed. Personally
> > I don't like it at all, because all our hardwares have their own stream id
> > (device id). I don't need the group concept at all. But the iommu subsystem
> > assume some devices may share the name device ID to a single IOMMU.
> 
> My question was do you really want to force group isolation for the
> device ? Existing SVA/SVM capable driver do not force that, they let
> the userspace decide this (sysadm, distributions, ...). Being part of
> VFIO (in the way you do, likely ways to avoid this inside VFIO too)
> force this decision ie make a policy decision without userspace having
> anything to say about it.
> 
> 
> The IOMMU group thing as always been doubt full to me, it is advertise
> as allowing to share resources (ie IOMMU page table) between devices.
> But this assume that all device driver in the group have some way of
> communicating with each other to share common DMA address that point
> to memory devices care. I believe only VFIO does that and probably
> only when use by QEMU.
> 
> 
> Anyway my question is:
> 
> Is it that much useful to be inside VFIO (to avoid few hundred lines
> of boiler plate code) given that it forces you into a model (group
> isolation) that so far have never been the prefered way for all
> existing device driver that already do what you want to achieve ?
> 

You mean to say I create another framework and copy most of the code from VFIO?
It is hard to believe the mainline kernel will take my code. So how about let me
try the VFIO way first and try that if it won't work? ;)

> 
> >From where i stand i do not see overwhelming reasons to do what you
> are doing inside VFIO.
> 
> To me it would make more sense to have regular device driver. They
> all can have device file under same hierarchy to make devices with
> same programming model easy to discover.
> 
> Jérôme

Cheers
-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-11  6:40                     ` Kenneth Lee
@ 2018-09-11 13:40                       ` Jerome Glisse
  2018-09-13  8:32                         ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-11 13:40 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Zaibo Xu, linux-doc, Sanjay Kumar, Hao Fang,
	linux-kernel, linuxarm, iommu, David S . Miller, linux-crypto,
	Zhou Wang, Philippe Ombredanne, Thomas Gleixner, Joerg Roedel,
	linux-accelerators, Lu Baolu

On Tue, Sep 11, 2018 at 02:40:43PM +0800, Kenneth Lee wrote:
> On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:
> > On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> > > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> > > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > > > > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > > > > > user application do something on its data, and push it away to the accelerator,
> > > > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > > > > > the memory, referring any portion of it with the same VAs of the application,
> > > > > > > even the VAs are stored inside the memory itself.
> > > > > > 
> > > > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > > > > > dma buffer object.
> > > > > > 
> > > > > 
> > > > > Thank you for directing me to this implementation. It is interesting:).
> > > > > 
> > > > > But it is not yet solve my problem. If I understand it right, the userptr in
> > > > > i915 do the following:
> > > > > 
> > > > > 1. The user process sets a user pointer with size to the kernel via ioctl.
> > > > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
> > > > >    reference.
> > > > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
> > > > >    can be shared between the user space and the hardware.
> > > > > 
> > > > > But my scenario is: 
> > > > > 
> > > > > 1. The user process has some data in the user space, pointed by a pointer, say
> > > > >    ptr1. And within the memory, there may be some other pointers, let's say one
> > > > >    of them is ptr2.
> > > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
> > > > >    hardware must refer ptr1 and ptr2 *directly* for data.
> > > > > 
> > > > > Userptr lets the hardware and process share the same memory space. But I need
> > > > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> > > > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.
> > > > 
> > > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> > > > If so then wait for necessary SVA/SVM to land and do warp drive
> > > > without non SVA/SVM path.
> > > > 
> > > 
> > > I think we should clear the concept of SVA/SVM here. As my understanding, Share
> > > Virtual Address/Memory means: any virtual address in a process can be used by
> > > device at the same time. This requires IOMMU device to support PASID. And
> > > optionally, it requires the feature of page-fault-from-device.
> > 
> > Yes we agree on what SVA/SVM is. There is a one gotcha thought, access
> > to range that are MMIO map ie CPU page table pointing to IO memory, IIRC
> > it is undefined what happens on some platform for a device trying to
> > access those using SVA/SVM.
> > 
> > 
> > > But before the feature is settled down, IOMMU can be used immediately in the
> > > current kernel. That make it possible to assign ONE process's virtual addresses
> > > to the device's IOMMU page table with GUP. This make WarpDrive work well for one
> > > process.
> > 
> > UH ? How ? You want to GUP _every_ single valid address in the process
> > and map it to the device ? How do you handle new vma, page being replace
> > (despite GUP because of things that utimately calls zap pte) ...
> > 
> > Again here you said that the device must be able to access _any_ valid
> > pointer. With GUP this is insane.
> > 
> > So i am assuming this is not what you want to do without SVA/SVM ie with
> > GUP you have a different programming model, one in which the userspace
> > must first bind _range_ of memory to the device and get a DMA address
> > for the range.
> > 
> > Again, GUP range of process address space to map it to a device so that
> > userspace can use the device on the mapped range is something that do
> > exist in various places in the kernel.
> > 
> 
> Yes same as your expectation, in WarpDrive, we use the concept of "sharing" to
> do so. If some memory is going to be shared among process and devices, we use
> wd_share_mem(queue, ptr, size) to share those memory. When the queue is working
> in this mode, the point is valid in those memory segments. The wd_share_mem call
> vfio dma map syscall which will do GUP. 
> 
> If SVA/SVM is enabled, user space can set SHARE_ALL flags to the queue. Then
> wd_share_mem() is not necessary.
> 
> This is really not popular when we started the work on WarpDrive. The GUP
> document said it should be put within the scope of mm_sem is locked. Because GUP
> simply increase the page refcount, not keep the mapping between the page and the
> vma. We keep our work together with VFIO to make sure the problem can be solved
> in one deal.

The problem can not be solved in one deal, you can not maintain vaddr
pointing to same page after a fork() this can not be solve without the
use of mmu notifier and device dma mapping invalidation ! So being part
of VFIO will not help you there.

AFAIK VFIO is fine with the way it is as QEMU do not fork() once it
is running a guest and thus the COW that would invalidate vaddr to
physical page assumption is not broken. So i doubt VFIO folks have
any incentive to go down the mmu notifier path and invalidate device
mapping. They also have the replay thing that probably handle some
of fork cases by trusting user space program to do it. In your case
you can not trust the user space program.

In your case AFAICT i do not see any warning or gotcha so the following
scenario is broken (in non SVA/SVM):
    1) program setup the device (open container, mdev, setup queue, ...)
    2) program map some range of its address space wih VFIO_IOMMU_MAP_DMA
    3) program start using the device using map setup in 2)
    ...
    4) program fork()
    5) parent trigger COW inside the range setup in 2)

    At this point it is the child process that can write to the page that
    are access by the device (which was map by the parent in 2)). The
    parent can no longer access that memory from the CPU.

There is just no sane way to fix this beside invalidating device mapping
on fork (and you can not rely on userspace to do so) and thus stopping
the device on fork (SVA/SVM case do not have any issue here).

> And now we have GUP-longterm and many accounting work in VFIO, we don't want to
> do that again.

GUP-longterm does not solve any GUP problem, it just block people to
do GUP on DAX backed vma to avoid pining persistent memory as it is
a nightmare to handle in the block device driver and file system code.

The accounting is the rt limit thing and is litteraly 10 lines of
code so i would not see that as hard to replicate.


> > > Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
> > > from the feature in the future. It dose not means WarpDrive is useless before
> > > that. And it works for our Zip and RSA accelerators in physical world.
> > 
> > Just not with random process address ...
> > 
> > > > If you still want non SVA/SVM path what you want to do only works
> > > > if both ptr1 and ptr2 are in a range that is DMA mapped to the
> > > > device (moreover you need DMA address to match process address
> > > > which is not an easy feat).
> > > > 
> > > > Now even if you only want SVA/SVM, i do not see what is the point
> > > > of doing this inside VFIO. AMD GPU driver does not and there would
> > > > be no benefit for them to be there. Well a AMD VFIO mdev device
> > > > driver for QEMU guest might be useful but they have SVIO IIRC.
> > > > 
> > > > For SVA/SVM your usage model is:
> > > > 
> > > > Setup:
> > > >     - user space create a warp drive context for the process
> > > >     - user space create a device specific context for the process
> > > >     - user space create a user space command queue for the device
> > > >     - user space bind command queue
> > > > 
> > > >     At this point the kernel driver has bound the process address
> > > >     space to the device with a command queue and userspace
> > > > 
> > > > Usage:
> > > >     - user space schedule work and call appropriate flush/update
> > > >       ioctl from time to time. Might be optional depends on the
> > > >       hardware, but probably a good idea to enforce so that kernel
> > > >       can unbind the command queue to bind another process command
> > > >       queue.
> > > >     ...
> > > > 
> > > > Cleanup:
> > > >     - user space unbind command queue
> > > >     - user space destroy device specific context
> > > >     - user space destroy warp drive context
> > > >     All the above can be implicit when closing the device file.
> > > > 
> > > > So again in the above model i do not see anywhere something from
> > > > VFIO that would benefit this model.
> > > > 
> > > 
> > > Let me show you how the model will be if I use VFIO:
> > > 
> > > Setup (Kernel part)
> > > 	- Kernel driver do every as usual to serve the other functionality, NIC
> > > 	  can still be registered to netdev, encryptor can still be registered
> > > 	  to crypto...
> > > 	- At the same time, the driver can devote some of its hardware resource
> > > 	  and register them as a mdev creator to the VFIO framework. This just
> > > 	  need limited change to the VFIO type1 driver.
> > 
> > In the above VFIO does not help you one bit ... you can do that with
> > as much code with new common device as front end.
> > 
> > > Setup (User space)
> > > 	- System administrator create mdev via the mdev creator interface.
> > > 	- Following VFIO setup routine, user space open the mdev's group, there is
> > > 	  only one group for one device.
> > > 	- Without PASID support, you don't need to do anything. With PASID, bind
> > > 	  the PASID to the device via VFIO interface.
> > > 	- Get the device from the group via VFIO interface and mmap it the user
> > > 	  space for device's MMIO access (for the queue).
> > > 	- Map whatever memory you need to share with the device with VFIO
> > > 	  interface.
> > > 	- (opt) Add more devices into the container if you want to share the
> > > 	  same address space with them
> > 
> > So all VFIO buys you here is boiler plate code that does insert_pfn()
> > to handle MMIO mapping. Which is just couple hundred lines of boiler
> > plate code.
> > 
> 
> No. With VFIO, I don't need to:
> 
> 1. GUP and accounting for RLIMIT_MEMLOCK

That's 10 line of code ...

> 2. Keep all GUP pages for releasing (VFIO uses the rb_tree to do so)

GUP pages are not part of rb_tree and what you want to do can be done
in few lines of code here is pseudo code:

warp_dma_map_range(ulong vaddr, ulong npages)
{
    struct page *pages = kvzalloc(npages);

    for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
        GUP(vaddr, &pages[i]);
        iommu_map(vaddr, page_to_pfn(pages[i]));
    }
    kvfree(pages);
}

warp_dma_unmap_range(ulong vaddr, ulong npages)
{
    for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
        unsigned long pfn;

        pfn = iommu_iova_to_phys(vaddr);
        iommu_unmap(vaddr);
        put_page(pfn_to_page(page)); /* set dirty if mapped write */
    }
}

Add locking, error handling, dirtying and comments and you are barely
looking at couple hundred lines of code. You do not need any of the
complexity of VFIO as you do not have the same requirements. Namely
VFIO have to keep track of iova and physical mapping for things like
migration (migrating guest between host) and few others very
virtualization centric requirements.


> 2. Handle the PASID on SMMU (ARM's IOMMU) myself.

Existing driver do that with 20 lines of with comments and error
handling (see kfd_iommu_bind_process_to_device() for instance) i
doubt you need much more than that.


> 3. Multiple devices menagement (VFIO uses container to manage this)

All the vfio_group* stuff ? OK that's boiler plate code, note that
hard to replicate thought.

> And even as a boiler plate, it is valueable, the memory thing is sensitive
> interface to user space, it can easily become a security problem. If I can
> achieve my target within the scope of VFIO, why not? At lease it has been
> proved to be safe for the time being.

The thing is being part of VFIO impose things on you, things that you
do not need. Like one device per group (maybe it is you imposing this,
i am loosing track here). Or the complex dma mapping tracking ...


> > > Cleanup:
> > > 	- User space close the group file handler
> > > 	- There will be a problem to let the other process know the mdev is
> > > 	  freed to be used again. My RFCv1 choose a file handler solution. Alex
> > > 	  dose not like it. But it is not a big problem. We can always have a
> > > 	  scheduler process to manage the state of the mdev or even we can
> > > 	  switch back to the RFCv1 solution without too much effort if we like
> > > 	  in the future.
> > 
> > If you were outside VFIO you would have more freedom on how to do that.
> > For instance process opening the device file can be placed on queue and
> > first one in the queue get to use the device until it closes/release the
> > device. Then next one in queue get the device ...
> 
> Yes. I do like the file handle solution. But I hope the solution become mature
> as soon as possible. Many of our products, and as I know include some of our
> partners, are waiting for a long term solution as direction. If I rely on some
> unmature solution, they may choose some deviated, customized solution. That will
> be much harmful. Compare to this, the freedom is not so important...

I do not see how being part of VFIO protect you from people doing crazy
thing to their kernel ... Time to market being key in this world, i doubt
that being part of VFIO would make anyone think twice before taking a
shortcut.

I have seen horrible things on that front and only players like Google
can impose a minimum level of sanity.


> > > Except for the minimum update to the type1 driver and use sdmdev to manage the
> > > interrupt sharing, I don't need any extra code to gain the address sharing
> > > capability. And the capability will be strengthen along with the upgrade of VFIO.
> > > 
> > > > 
> > > > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > > > > > user driver framework. And I need exactly a user driver interface. Why should I
> > > > > > > invent another wheel? It has most of stuff I need:
> > > > > > > 
> > > > > > > 1. Connecting multiple devices to the same application space
> > > > > > > 2. Pinning and DMA from the application space to the whole set of device
> > > > > > > 3. Managing hardware resource by device
> > > > > > > 
> > > > > > > We just need the last step: make sure multiple applications and the kernel can
> > > > > > > share the same IOMMU. Then why shouldn't we use VFIO?
> > > > > > 
> > > > > > Because tons of other drivers already do all of the above outside VFIO. Many
> > > > > > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > > > > > can be construded as userspace driver too.
> > > > > > 
> > > > > 
> > > > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> > > > > i915 and solve the address space problem. And if I don't need to with VFIO, why
> > > > > should I spend so much effort to do it again?
> > > > 
> > > > Because you do not need any code from VFIO, nor do you need to reinvent
> > > > things. If non SVA/SVM matters to you then use dma buffer. If not then
> > > > i do not see anything in VFIO that you need.
> > > > 
> > > 
> > > As I have explain, if I don't use VFIO, at lease I have to do all that has been
> > > done in i915 or even more than that.
> > 
> > So beside the MMIO mmap() handling and dma mapping of range of user space
> > address space (again all very boiler plate code duplicated accross the
> > kernel several time in different forms). You do not gain anything being
> > inside VFIO right ?
> > 
> 
> As I said, rb-tree for gup, rlimit accounting, cooperation on SMMU, and mature
> user interface are our concern.
> > 
> > > > > > So there is no reasons to do that under VFIO. Especialy as in your example
> > > > > > it is not a real user space device driver, the userspace portion only knows
> > > > > > about writting command into command buffer AFAICT.
> > > > > > 
> > > > > > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > > > > > all the driver is handled in userspace. This means that the userspace have
> > > > > > to be trusted as it could program the device to do DMA to anywhere (if
> > > > > > IOMMU is disabled at boot which is still the default configuration in the
> > > > > > kernel).
> > > > > > 
> > > > > 
> > > > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> > > > > stuffs as a driver in host system. And I do need to share the user space as DMA
> > > > > buffer to the hardware. And I can get it with just a little update, then it can
> > > > > service me perfectly. I don't understand why I should choose a long route.
> > > > 
> > > > Again this is not the long route i do not see anything in VFIO that
> > > > benefit you in the SVA/SVM case. A basic character device driver can
> > > > do that.
> > > > 
> > > > 
> > > > > > So i do not see any reasons to do anything you want inside VFIO. All you
> > > > > > want to do can be done outside as easily. Moreover it would be better if
> > > > > > you define clearly each scenario because from where i sit it looks like
> > > > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > > > > > is disabled.
> > > > > > 
> > > > > > When IOMMU is disabled you can _not_ expose command queue to userspace
> > > > > > unless your device has its own page table and all commands are relative
> > > > > > to that page table and the device page table is populated by kernel driver
> > > > > > in secure way (ie by checking that what is populated can be access).
> > > > > > 
> > > > > > I do not believe your example device to have such page table nor do i see
> > > > > > a fallback path when IOMMU is disabled that force user to do ioctl for
> > > > > > each commands.
> > > > > > 
> > > > > > Yes i understand that you target SVA/SVM but still you claim to support
> > > > > > non SVA/SVM. The point is that userspace can not be trusted if you want
> > > > > > to have random program use your device. I am pretty sure that all user
> > > > > > of VFIO are trusted process (like QEMU).
> > > > > > 
> > > > > > 
> > > > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > > > > > useless for your usecase. I really do not see the point of that, it
> > > > > > does complicate things for you for no reasons AFAICT.
> > > > > 
> > > > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> > > > > it very much either;). But the problem is, the group reflects to the same
> > > > > IOMMU(unit), which may shared with other devices.  It is a security problem. I
> > > > > cannot ignore it. I have to take it into account event I don't use VFIO.
> > > > 
> > > > To me it seems you are making a policy decission in kernel space ie
> > > > wether the device should be isolated in its own group or not is a
> > > > decission that is up to the sys admin or something in userspace.
> > > > Right now existing user of SVA/SVM don't (at least AFAICT).
> > > > 
> > > > Do we really want to force such isolation ?
> > > > 
> > > 
> > > But it is not my decision, that how the iommu subsystem is designed. Personally
> > > I don't like it at all, because all our hardwares have their own stream id
> > > (device id). I don't need the group concept at all. But the iommu subsystem
> > > assume some devices may share the name device ID to a single IOMMU.
> > 
> > My question was do you really want to force group isolation for the
> > device ? Existing SVA/SVM capable driver do not force that, they let
> > the userspace decide this (sysadm, distributions, ...). Being part of
> > VFIO (in the way you do, likely ways to avoid this inside VFIO too)
> > force this decision ie make a policy decision without userspace having
> > anything to say about it.

You still do not answer my question, do you really want to force group
isolation for device in your framework ? Which is a policy decision from
my POV and thus belong to userspace and should not be enforce by kernel.


> > The IOMMU group thing as always been doubt full to me, it is advertise
> > as allowing to share resources (ie IOMMU page table) between devices.
> > But this assume that all device driver in the group have some way of
> > communicating with each other to share common DMA address that point
> > to memory devices care. I believe only VFIO does that and probably
> > only when use by QEMU.
> > 
> > 
> > Anyway my question is:
> > 
> > Is it that much useful to be inside VFIO (to avoid few hundred lines
> > of boiler plate code) given that it forces you into a model (group
> > isolation) that so far have never been the prefered way for all
> > existing device driver that already do what you want to achieve ?
> > 
> 
> You mean to say I create another framework and copy most of the code from VFIO?
> It is hard to believe the mainline kernel will take my code. So how about let me
> try the VFIO way first and try that if it won't work? ;)

There is no trying, this is the kernel, once you expose something to
userspace you have to keep supporting it forever ... There is no, hey
let's add this new framework and see how it goes and removing it few
kernel version latter ...

That is why i am being pedantic :) on making sure there is good reasons
to do what you do inside VFIO. I do believe that we want a common frame-
work like the one you are proposing but i do not believe it should be
part of VFIO given the baggages it comes with and that are not relevant
to the use cases for this kind of devices.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-11 13:40                       ` Jerome Glisse
@ 2018-09-13  8:32                         ` Kenneth Lee
  2018-09-13 14:51                           ` Jerome Glisse
  0 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-13  8:32 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Zaibo Xu, linux-doc, Sanjay Kumar, Hao Fang,
	linux-kernel, linuxarm, iommu, David S . Miller, linux-crypto,
	Zhou Wang, Philippe Ombredanne, Thomas Gleixner, Joerg Roedel,
	linux-accelerators, Lu Baolu

On Tue, Sep 11, 2018 at 09:40:14AM -0400, Jerome Glisse wrote:
> Date: Tue, 11 Sep 2018 09:40:14 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson
>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,
>  kvm@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>, Greg Kroah-Hartman
>  <gregkh@linuxfoundation.org>, Zaibo Xu <xuzaibo@huawei.com>,
>  linux-doc@vger.kernel.org, Sanjay Kumar <sanjay.k.kumar@intel.com>, Hao
>  Fang <fanghao11@huawei.com>, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, iommu@lists.linux-foundation.org, "David S . Miller"
>  <davem@davemloft.net>, linux-crypto@vger.kernel.org, Zhou Wang
>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,
>  Thomas Gleixner <tglx@linutronix.de>, Joerg Roedel <joro@8bytes.org>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.1 (2018-07-13)
> Message-ID: <20180911134013.GA3932@redhat.com>
> 
> On Tue, Sep 11, 2018 at 02:40:43PM +0800, Kenneth Lee wrote:
> > On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:
> > > On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> > > > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> > > > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > > > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > > > > > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > > > > > > user application do something on its data, and push it away to the accelerator,
> > > > > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > > > > > > the memory, referring any portion of it with the same VAs of the application,
> > > > > > > > even the VAs are stored inside the memory itself.
> > > > > > > 
> > > > > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > > > > > > dma buffer object.
> > > > > > > 
> > > > > > 
> > > > > > Thank you for directing me to this implementation. It is interesting:).
> > > > > > 
> > > > > > But it is not yet solve my problem. If I understand it right, the userptr in
> > > > > > i915 do the following:
> > > > > > 
> > > > > > 1. The user process sets a user pointer with size to the kernel via ioctl.
> > > > > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
> > > > > >    reference.
> > > > > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
> > > > > >    can be shared between the user space and the hardware.
> > > > > > 
> > > > > > But my scenario is: 
> > > > > > 
> > > > > > 1. The user process has some data in the user space, pointed by a pointer, say
> > > > > >    ptr1. And within the memory, there may be some other pointers, let's say one
> > > > > >    of them is ptr2.
> > > > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
> > > > > >    hardware must refer ptr1 and ptr2 *directly* for data.
> > > > > > 
> > > > > > Userptr lets the hardware and process share the same memory space. But I need
> > > > > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> > > > > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.
> > > > > 
> > > > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> > > > > If so then wait for necessary SVA/SVM to land and do warp drive
> > > > > without non SVA/SVM path.
> > > > > 
> > > > 
> > > > I think we should clear the concept of SVA/SVM here. As my understanding, Share
> > > > Virtual Address/Memory means: any virtual address in a process can be used by
> > > > device at the same time. This requires IOMMU device to support PASID. And
> > > > optionally, it requires the feature of page-fault-from-device.
> > > 
> > > Yes we agree on what SVA/SVM is. There is a one gotcha thought, access
> > > to range that are MMIO map ie CPU page table pointing to IO memory, IIRC
> > > it is undefined what happens on some platform for a device trying to
> > > access those using SVA/SVM.
> > > 
> > > 
> > > > But before the feature is settled down, IOMMU can be used immediately in the
> > > > current kernel. That make it possible to assign ONE process's virtual addresses
> > > > to the device's IOMMU page table with GUP. This make WarpDrive work well for one
> > > > process.
> > > 
> > > UH ? How ? You want to GUP _every_ single valid address in the process
> > > and map it to the device ? How do you handle new vma, page being replace
> > > (despite GUP because of things that utimately calls zap pte) ...
> > > 
> > > Again here you said that the device must be able to access _any_ valid
> > > pointer. With GUP this is insane.
> > > 
> > > So i am assuming this is not what you want to do without SVA/SVM ie with
> > > GUP you have a different programming model, one in which the userspace
> > > must first bind _range_ of memory to the device and get a DMA address
> > > for the range.
> > > 
> > > Again, GUP range of process address space to map it to a device so that
> > > userspace can use the device on the mapped range is something that do
> > > exist in various places in the kernel.
> > > 
> > 
> > Yes same as your expectation, in WarpDrive, we use the concept of "sharing" to
> > do so. If some memory is going to be shared among process and devices, we use
> > wd_share_mem(queue, ptr, size) to share those memory. When the queue is working
> > in this mode, the point is valid in those memory segments. The wd_share_mem call
> > vfio dma map syscall which will do GUP. 
> > 
> > If SVA/SVM is enabled, user space can set SHARE_ALL flags to the queue. Then
> > wd_share_mem() is not necessary.
> > 
> > This is really not popular when we started the work on WarpDrive. The GUP
> > document said it should be put within the scope of mm_sem is locked. Because GUP
> > simply increase the page refcount, not keep the mapping between the page and the
> > vma. We keep our work together with VFIO to make sure the problem can be solved
> > in one deal.
> 
> The problem can not be solved in one deal, you can not maintain vaddr
> pointing to same page after a fork() this can not be solve without the
> use of mmu notifier and device dma mapping invalidation ! So being part
> of VFIO will not help you there.

Good point. But sadly, even with mmu notifier and dma mapping invalidation, I
cannot do anything here. If the process fork a sub-process, the sub-process need
a new pasid and hardware resource. The IOMM space mapped should not be used. The
parent process should be aware of this, unmap and close the device file before
the fork. I have the same limitation as VFIO:(

I don't think I can change much here. If I can, VFIO can too:)

> 
> AFAIK VFIO is fine with the way it is as QEMU do not fork() once it
> is running a guest and thus the COW that would invalidate vaddr to
> physical page assumption is not broken. So i doubt VFIO folks have
> any incentive to go down the mmu notifier path and invalidate device
> mapping. They also have the replay thing that probably handle some
> of fork cases by trusting user space program to do it. In your case
> you can not trust the user space program.
> 
> In your case AFAICT i do not see any warning or gotcha so the following
> scenario is broken (in non SVA/SVM):
>     1) program setup the device (open container, mdev, setup queue, ...)
>     2) program map some range of its address space wih VFIO_IOMMU_MAP_DMA
>     3) program start using the device using map setup in 2)
>     ...
>     4) program fork()
>     5) parent trigger COW inside the range setup in 2)
> 
>     At this point it is the child process that can write to the page that
>     are access by the device (which was map by the parent in 2)). The
>     parent can no longer access that memory from the CPU.
> 
> There is just no sane way to fix this beside invalidating device mapping
> on fork (and you can not rely on userspace to do so) and thus stopping
> the device on fork (SVA/SVM case do not have any issue here).

Indeed. But as soon as we choose to expose the device space to the user space,
the limitation is already there. If we want to solve the problem, we have to
have a hook in the copy_process() procedure and copy the parent's queue state to
a new queue, assign it to the child's fd and redirect the child's mmap to
it. If I can do so, the same logic can also be applied to VFIO.

The good side is, this is not a security leak. The hardware has been given to
the process. It is the process who choose to share it. If it won't work, it is
the process's problem;)

> 
> > And now we have GUP-longterm and many accounting work in VFIO, we don't want to
> > do that again.
> 
> GUP-longterm does not solve any GUP problem, it just block people to
> do GUP on DAX backed vma to avoid pining persistent memory as it is
> a nightmare to handle in the block device driver and file system code.
> 
> The accounting is the rt limit thing and is litteraly 10 lines of
> code so i would not see that as hard to replicate.

OK. Agree.

> 
> 
> > > > Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
> > > > from the feature in the future. It dose not means WarpDrive is useless before
> > > > that. And it works for our Zip and RSA accelerators in physical world.
> > > 
> > > Just not with random process address ...
> > > 
> > > > > If you still want non SVA/SVM path what you want to do only works
> > > > > if both ptr1 and ptr2 are in a range that is DMA mapped to the
> > > > > device (moreover you need DMA address to match process address
> > > > > which is not an easy feat).
> > > > > 
> > > > > Now even if you only want SVA/SVM, i do not see what is the point
> > > > > of doing this inside VFIO. AMD GPU driver does not and there would
> > > > > be no benefit for them to be there. Well a AMD VFIO mdev device
> > > > > driver for QEMU guest might be useful but they have SVIO IIRC.
> > > > > 
> > > > > For SVA/SVM your usage model is:
> > > > > 
> > > > > Setup:
> > > > >     - user space create a warp drive context for the process
> > > > >     - user space create a device specific context for the process
> > > > >     - user space create a user space command queue for the device
> > > > >     - user space bind command queue
> > > > > 
> > > > >     At this point the kernel driver has bound the process address
> > > > >     space to the device with a command queue and userspace
> > > > > 
> > > > > Usage:
> > > > >     - user space schedule work and call appropriate flush/update
> > > > >       ioctl from time to time. Might be optional depends on the
> > > > >       hardware, but probably a good idea to enforce so that kernel
> > > > >       can unbind the command queue to bind another process command
> > > > >       queue.
> > > > >     ...
> > > > > 
> > > > > Cleanup:
> > > > >     - user space unbind command queue
> > > > >     - user space destroy device specific context
> > > > >     - user space destroy warp drive context
> > > > >     All the above can be implicit when closing the device file.
> > > > > 
> > > > > So again in the above model i do not see anywhere something from
> > > > > VFIO that would benefit this model.
> > > > > 
> > > > 
> > > > Let me show you how the model will be if I use VFIO:
> > > > 
> > > > Setup (Kernel part)
> > > > 	- Kernel driver do every as usual to serve the other functionality, NIC
> > > > 	  can still be registered to netdev, encryptor can still be registered
> > > > 	  to crypto...
> > > > 	- At the same time, the driver can devote some of its hardware resource
> > > > 	  and register them as a mdev creator to the VFIO framework. This just
> > > > 	  need limited change to the VFIO type1 driver.
> > > 
> > > In the above VFIO does not help you one bit ... you can do that with
> > > as much code with new common device as front end.
> > > 
> > > > Setup (User space)
> > > > 	- System administrator create mdev via the mdev creator interface.
> > > > 	- Following VFIO setup routine, user space open the mdev's group, there is
> > > > 	  only one group for one device.
> > > > 	- Without PASID support, you don't need to do anything. With PASID, bind
> > > > 	  the PASID to the device via VFIO interface.
> > > > 	- Get the device from the group via VFIO interface and mmap it the user
> > > > 	  space for device's MMIO access (for the queue).
> > > > 	- Map whatever memory you need to share with the device with VFIO
> > > > 	  interface.
> > > > 	- (opt) Add more devices into the container if you want to share the
> > > > 	  same address space with them
> > > 
> > > So all VFIO buys you here is boiler plate code that does insert_pfn()
> > > to handle MMIO mapping. Which is just couple hundred lines of boiler
> > > plate code.
> > > 
> > 
> > No. With VFIO, I don't need to:
> > 
> > 1. GUP and accounting for RLIMIT_MEMLOCK
> 
> That's 10 line of code ...
> 
> > 2. Keep all GUP pages for releasing (VFIO uses the rb_tree to do so)
> 
> GUP pages are not part of rb_tree and what you want to do can be done
> in few lines of code here is pseudo code:
> 
> warp_dma_map_range(ulong vaddr, ulong npages)
> {
>     struct page *pages = kvzalloc(npages);
> 
>     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
>         GUP(vaddr, &pages[i]);
>         iommu_map(vaddr, page_to_pfn(pages[i]));
>     }
>     kvfree(pages);
> }
> 
> warp_dma_unmap_range(ulong vaddr, ulong npages)
> {
>     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
>         unsigned long pfn;
> 
>         pfn = iommu_iova_to_phys(vaddr);
>         iommu_unmap(vaddr);
>         put_page(pfn_to_page(page)); /* set dirty if mapped write */
>     }
> }
> 

But what if the process exist without unmapping? The pages will be pinned in the
kernel forever.

> Add locking, error handling, dirtying and comments and you are barely
> looking at couple hundred lines of code. You do not need any of the
> complexity of VFIO as you do not have the same requirements. Namely
> VFIO have to keep track of iova and physical mapping for things like
> migration (migrating guest between host) and few others very
> virtualization centric requirements.
> 
> 
> > 2. Handle the PASID on SMMU (ARM's IOMMU) myself.
> 
> Existing driver do that with 20 lines of with comments and error
> handling (see kfd_iommu_bind_process_to_device() for instance) i
> doubt you need much more than that.
> 

OK, I agree.

> 
> > 3. Multiple devices menagement (VFIO uses container to manage this)
> 
> All the vfio_group* stuff ? OK that's boiler plate code, note that
> hard to replicate thought.

No, I meant the container thing. Several devices/group can be assigned to the
same container and the DMA on the container can be assigned to all those
devices. So we can have some devices to share the same name space.

> 
> > And even as a boiler plate, it is valueable, the memory thing is sensitive
> > interface to user space, it can easily become a security problem. If I can
> > achieve my target within the scope of VFIO, why not? At lease it has been
> > proved to be safe for the time being.
> 
> The thing is being part of VFIO impose things on you, things that you
> do not need. Like one device per group (maybe it is you imposing this,
> i am loosing track here). Or the complex dma mapping tracking ...
> 

Err... But the one-device-per-group is not VFIO's decision. It is IOMMU's :).
Unless I don't use IOMMU.

> 
> > > > Cleanup:
> > > > 	- User space close the group file handler
> > > > 	- There will be a problem to let the other process know the mdev is
> > > > 	  freed to be used again. My RFCv1 choose a file handler solution. Alex
> > > > 	  dose not like it. But it is not a big problem. We can always have a
> > > > 	  scheduler process to manage the state of the mdev or even we can
> > > > 	  switch back to the RFCv1 solution without too much effort if we like
> > > > 	  in the future.
> > > 
> > > If you were outside VFIO you would have more freedom on how to do that.
> > > For instance process opening the device file can be placed on queue and
> > > first one in the queue get to use the device until it closes/release the
> > > device. Then next one in queue get the device ...
> > 
> > Yes. I do like the file handle solution. But I hope the solution become mature
> > as soon as possible. Many of our products, and as I know include some of our
> > partners, are waiting for a long term solution as direction. If I rely on some
> > unmature solution, they may choose some deviated, customized solution. That will
> > be much harmful. Compare to this, the freedom is not so important...
> 
> I do not see how being part of VFIO protect you from people doing crazy
> thing to their kernel ... Time to market being key in this world, i doubt
> that being part of VFIO would make anyone think twice before taking a
> shortcut.
> 
> I have seen horrible things on that front and only players like Google
> can impose a minimum level of sanity.
> 

OK. My fault, to talk about TTM. It has nothing doing with the architecture
decision. But I don't yet see what harm will be brought if I use VFIO when it
can fulfill almost all my requirements.

> 
> > > > Except for the minimum update to the type1 driver and use sdmdev to manage the
> > > > interrupt sharing, I don't need any extra code to gain the address sharing
> > > > capability. And the capability will be strengthen along with the upgrade of VFIO.
> > > > 
> > > > > 
> > > > > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > > > > > > user driver framework. And I need exactly a user driver interface. Why should I
> > > > > > > > invent another wheel? It has most of stuff I need:
> > > > > > > > 
> > > > > > > > 1. Connecting multiple devices to the same application space
> > > > > > > > 2. Pinning and DMA from the application space to the whole set of device
> > > > > > > > 3. Managing hardware resource by device
> > > > > > > > 
> > > > > > > > We just need the last step: make sure multiple applications and the kernel can
> > > > > > > > share the same IOMMU. Then why shouldn't we use VFIO?
> > > > > > > 
> > > > > > > Because tons of other drivers already do all of the above outside VFIO. Many
> > > > > > > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > > > > > > can be construded as userspace driver too.
> > > > > > > 
> > > > > > 
> > > > > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> > > > > > i915 and solve the address space problem. And if I don't need to with VFIO, why
> > > > > > should I spend so much effort to do it again?
> > > > > 
> > > > > Because you do not need any code from VFIO, nor do you need to reinvent
> > > > > things. If non SVA/SVM matters to you then use dma buffer. If not then
> > > > > i do not see anything in VFIO that you need.
> > > > > 
> > > > 
> > > > As I have explain, if I don't use VFIO, at lease I have to do all that has been
> > > > done in i915 or even more than that.
> > > 
> > > So beside the MMIO mmap() handling and dma mapping of range of user space
> > > address space (again all very boiler plate code duplicated accross the
> > > kernel several time in different forms). You do not gain anything being
> > > inside VFIO right ?
> > > 
> > 
> > As I said, rb-tree for gup, rlimit accounting, cooperation on SMMU, and mature
> > user interface are our concern.
> > > 
> > > > > > > So there is no reasons to do that under VFIO. Especialy as in your example
> > > > > > > it is not a real user space device driver, the userspace portion only knows
> > > > > > > about writting command into command buffer AFAICT.
> > > > > > > 
> > > > > > > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > > > > > > all the driver is handled in userspace. This means that the userspace have
> > > > > > > to be trusted as it could program the device to do DMA to anywhere (if
> > > > > > > IOMMU is disabled at boot which is still the default configuration in the
> > > > > > > kernel).
> > > > > > > 
> > > > > > 
> > > > > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> > > > > > stuffs as a driver in host system. And I do need to share the user space as DMA
> > > > > > buffer to the hardware. And I can get it with just a little update, then it can
> > > > > > service me perfectly. I don't understand why I should choose a long route.
> > > > > 
> > > > > Again this is not the long route i do not see anything in VFIO that
> > > > > benefit you in the SVA/SVM case. A basic character device driver can
> > > > > do that.
> > > > > 
> > > > > 
> > > > > > > So i do not see any reasons to do anything you want inside VFIO. All you
> > > > > > > want to do can be done outside as easily. Moreover it would be better if
> > > > > > > you define clearly each scenario because from where i sit it looks like
> > > > > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > > > > > > is disabled.
> > > > > > > 
> > > > > > > When IOMMU is disabled you can _not_ expose command queue to userspace
> > > > > > > unless your device has its own page table and all commands are relative
> > > > > > > to that page table and the device page table is populated by kernel driver
> > > > > > > in secure way (ie by checking that what is populated can be access).
> > > > > > > 
> > > > > > > I do not believe your example device to have such page table nor do i see
> > > > > > > a fallback path when IOMMU is disabled that force user to do ioctl for
> > > > > > > each commands.
> > > > > > > 
> > > > > > > Yes i understand that you target SVA/SVM but still you claim to support
> > > > > > > non SVA/SVM. The point is that userspace can not be trusted if you want
> > > > > > > to have random program use your device. I am pretty sure that all user
> > > > > > > of VFIO are trusted process (like QEMU).
> > > > > > > 
> > > > > > > 
> > > > > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > > > > > > useless for your usecase. I really do not see the point of that, it
> > > > > > > does complicate things for you for no reasons AFAICT.
> > > > > > 
> > > > > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> > > > > > it very much either;). But the problem is, the group reflects to the same
> > > > > > IOMMU(unit), which may shared with other devices.  It is a security problem. I
> > > > > > cannot ignore it. I have to take it into account event I don't use VFIO.
> > > > > 
> > > > > To me it seems you are making a policy decission in kernel space ie
> > > > > wether the device should be isolated in its own group or not is a
> > > > > decission that is up to the sys admin or something in userspace.
> > > > > Right now existing user of SVA/SVM don't (at least AFAICT).
> > > > > 
> > > > > Do we really want to force such isolation ?
> > > > > 
> > > > 
> > > > But it is not my decision, that how the iommu subsystem is designed. Personally
> > > > I don't like it at all, because all our hardwares have their own stream id
> > > > (device id). I don't need the group concept at all. But the iommu subsystem
> > > > assume some devices may share the name device ID to a single IOMMU.
> > > 
> > > My question was do you really want to force group isolation for the
> > > device ? Existing SVA/SVM capable driver do not force that, they let
> > > the userspace decide this (sysadm, distributions, ...). Being part of
> > > VFIO (in the way you do, likely ways to avoid this inside VFIO too)
> > > force this decision ie make a policy decision without userspace having
> > > anything to say about it.
> 
> You still do not answer my question, do you really want to force group
> isolation for device in your framework ? Which is a policy decision from
> my POV and thus belong to userspace and should not be enforce by kernel.

No. But I have to follow the rule defined by IOMMU, haven't I?

> 
> 
> > > The IOMMU group thing as always been doubt full to me, it is advertise
> > > as allowing to share resources (ie IOMMU page table) between devices.
> > > But this assume that all device driver in the group have some way of
> > > communicating with each other to share common DMA address that point
> > > to memory devices care. I believe only VFIO does that and probably
> > > only when use by QEMU.
> > > 
> > > 
> > > Anyway my question is:
> > > 
> > > Is it that much useful to be inside VFIO (to avoid few hundred lines
> > > of boiler plate code) given that it forces you into a model (group
> > > isolation) that so far have never been the prefered way for all
> > > existing device driver that already do what you want to achieve ?
> > > 
> > 
> > You mean to say I create another framework and copy most of the code from VFIO?
> > It is hard to believe the mainline kernel will take my code. So how about let me
> > try the VFIO way first and try that if it won't work? ;)
> 
> There is no trying, this is the kernel, once you expose something to
> userspace you have to keep supporting it forever ... There is no, hey
> let's add this new framework and see how it goes and removing it few
> kernel version latter ...
> 

No, I don't meant it was unserious when I said "try". I was just not sure if the
community can accept it. 

Can Alex say something on this? Is this scenario in the future scope of VFIO? If
it is, we have the season to solve the problem on the way. If it is not, we
should choose other way even we have to copy most of the code.

> That is why i am being pedantic :) on making sure there is good reasons
> to do what you do inside VFIO. I do believe that we want a common frame-
> work like the one you are proposing but i do not believe it should be
> part of VFIO given the baggages it comes with and that are not relevant
> to the use cases for this kind of devices.

Understood. And I appreciate the discussion and help:)

Cheers
> 
> Cheers,
> Jérôme

-- 
			-Kenneth(Hisilicon)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-13  8:32                         ` Kenneth Lee
@ 2018-09-13 14:51                           ` Jerome Glisse
  2018-09-14  3:12                             ` Kenneth Lee
  2018-09-14  6:50                             ` Tian, Kevin
  0 siblings, 2 replies; 58+ messages in thread
From: Jerome Glisse @ 2018-09-13 14:51 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Zaibo Xu, linux-doc, Sanjay Kumar, Hao Fang,
	linux-kernel, linuxarm, iommu, David S . Miller, linux-crypto,
	Zhou Wang, Philippe Ombredanne, Thomas Gleixner, Joerg Roedel,
	linux-accelerators, Lu Baolu

On Thu, Sep 13, 2018 at 04:32:32PM +0800, Kenneth Lee wrote:
> On Tue, Sep 11, 2018 at 09:40:14AM -0400, Jerome Glisse wrote:
> > On Tue, Sep 11, 2018 at 02:40:43PM +0800, Kenneth Lee wrote:
> > > On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:
> > > > On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> > > > > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> > > > > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > > > > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > > > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > > > > > > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > > > > > > > user application do something on its data, and push it away to the accelerator,
> > > > > > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > > > > > > > the memory, referring any portion of it with the same VAs of the application,
> > > > > > > > > even the VAs are stored inside the memory itself.
> > > > > > > > 
> > > > > > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > > > > > > > dma buffer object.
> > > > > > > > 
> > > > > > > 
> > > > > > > Thank you for directing me to this implementation. It is interesting:).
> > > > > > > 
> > > > > > > But it is not yet solve my problem. If I understand it right, the userptr in
> > > > > > > i915 do the following:
> > > > > > > 
> > > > > > > 1. The user process sets a user pointer with size to the kernel via ioctl.
> > > > > > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
> > > > > > >    reference.
> > > > > > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
> > > > > > >    can be shared between the user space and the hardware.
> > > > > > > 
> > > > > > > But my scenario is: 
> > > > > > > 
> > > > > > > 1. The user process has some data in the user space, pointed by a pointer, say
> > > > > > >    ptr1. And within the memory, there may be some other pointers, let's say one
> > > > > > >    of them is ptr2.
> > > > > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
> > > > > > >    hardware must refer ptr1 and ptr2 *directly* for data.
> > > > > > > 
> > > > > > > Userptr lets the hardware and process share the same memory space. But I need
> > > > > > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> > > > > > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.
> > > > > > 
> > > > > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> > > > > > If so then wait for necessary SVA/SVM to land and do warp drive
> > > > > > without non SVA/SVM path.
> > > > > > 
> > > > > 
> > > > > I think we should clear the concept of SVA/SVM here. As my understanding, Share
> > > > > Virtual Address/Memory means: any virtual address in a process can be used by
> > > > > device at the same time. This requires IOMMU device to support PASID. And
> > > > > optionally, it requires the feature of page-fault-from-device.
> > > > 
> > > > Yes we agree on what SVA/SVM is. There is a one gotcha thought, access
> > > > to range that are MMIO map ie CPU page table pointing to IO memory, IIRC
> > > > it is undefined what happens on some platform for a device trying to
> > > > access those using SVA/SVM.
> > > > 
> > > > 
> > > > > But before the feature is settled down, IOMMU can be used immediately in the
> > > > > current kernel. That make it possible to assign ONE process's virtual addresses
> > > > > to the device's IOMMU page table with GUP. This make WarpDrive work well for one
> > > > > process.
> > > > 
> > > > UH ? How ? You want to GUP _every_ single valid address in the process
> > > > and map it to the device ? How do you handle new vma, page being replace
> > > > (despite GUP because of things that utimately calls zap pte) ...
> > > > 
> > > > Again here you said that the device must be able to access _any_ valid
> > > > pointer. With GUP this is insane.
> > > > 
> > > > So i am assuming this is not what you want to do without SVA/SVM ie with
> > > > GUP you have a different programming model, one in which the userspace
> > > > must first bind _range_ of memory to the device and get a DMA address
> > > > for the range.
> > > > 
> > > > Again, GUP range of process address space to map it to a device so that
> > > > userspace can use the device on the mapped range is something that do
> > > > exist in various places in the kernel.
> > > > 
> > > 
> > > Yes same as your expectation, in WarpDrive, we use the concept of "sharing" to
> > > do so. If some memory is going to be shared among process and devices, we use
> > > wd_share_mem(queue, ptr, size) to share those memory. When the queue is working
> > > in this mode, the point is valid in those memory segments. The wd_share_mem call
> > > vfio dma map syscall which will do GUP. 
> > > 
> > > If SVA/SVM is enabled, user space can set SHARE_ALL flags to the queue. Then
> > > wd_share_mem() is not necessary.
> > > 
> > > This is really not popular when we started the work on WarpDrive. The GUP
> > > document said it should be put within the scope of mm_sem is locked. Because GUP
> > > simply increase the page refcount, not keep the mapping between the page and the
> > > vma. We keep our work together with VFIO to make sure the problem can be solved
> > > in one deal.
> > 
> > The problem can not be solved in one deal, you can not maintain vaddr
> > pointing to same page after a fork() this can not be solve without the
> > use of mmu notifier and device dma mapping invalidation ! So being part
> > of VFIO will not help you there.
> 
> Good point. But sadly, even with mmu notifier and dma mapping invalidation, I
> cannot do anything here. If the process fork a sub-process, the sub-process need
> a new pasid and hardware resource. The IOMM space mapped should not be used. The
> parent process should be aware of this, unmap and close the device file before
> the fork. I have the same limitation as VFIO:(
> 
> I don't think I can change much here. If I can, VFIO can too:)

The forbid child to access the device is easy in the kernel whenever
someone open the device file force set the OCLOEXEC flag on the file
some device driver already do that and so should you. With that you
should always have a struct file - mm struct one to one relationship
and thus one PASID per struct file ie per open of the device file.

That does not solve the GUP/fork issue i describe below.


> > AFAIK VFIO is fine with the way it is as QEMU do not fork() once it
> > is running a guest and thus the COW that would invalidate vaddr to
> > physical page assumption is not broken. So i doubt VFIO folks have
> > any incentive to go down the mmu notifier path and invalidate device
> > mapping. They also have the replay thing that probably handle some
> > of fork cases by trusting user space program to do it. In your case
> > you can not trust the user space program.
> > 
> > In your case AFAICT i do not see any warning or gotcha so the following
> > scenario is broken (in non SVA/SVM):
> >     1) program setup the device (open container, mdev, setup queue, ...)
> >     2) program map some range of its address space wih VFIO_IOMMU_MAP_DMA
> >     3) program start using the device using map setup in 2)
> >     ...
> >     4) program fork()
> >     5) parent trigger COW inside the range setup in 2)
> > 
> >     At this point it is the child process that can write to the page that
> >     are access by the device (which was map by the parent in 2)). The
> >     parent can no longer access that memory from the CPU.
> > 
> > There is just no sane way to fix this beside invalidating device mapping
> > on fork (and you can not rely on userspace to do so) and thus stopping
> > the device on fork (SVA/SVM case do not have any issue here).
> 
> Indeed. But as soon as we choose to expose the device space to the user space,
> the limitation is already there. If we want to solve the problem, we have to
> have a hook in the copy_process() procedure and copy the parent's queue state to
> a new queue, assign it to the child's fd and redirect the child's mmap to
> it. If I can do so, the same logic can also be applied to VFIO.

Except we do not want to do that and this does not solve the COW i describe
above unless you disable COW altogether which is a big no.

> The good side is, this is not a security leak. The hardware has been given to
> the process. It is the process who choose to share it. If it won't work, it is
> the process's problem;)

No this is bad, you can not expect every single userspace program to know and
be aware of that. If some trusted application (say systemd or firefox, ...)
start using your device and is unaware or does not comprehend all side effect
it would allow the child to access/change its parent memory and that is bad.
Device driver need to protect against user doing stupid thing. All existing
device driver that do ATS/PASID already protect themself against child (ie the
child can not use the device through a mix of OCLOEXEC and other checks).

You must set the OCLOEXEC flag but that does not solve the COW problem above.

My motto in life is "do not trust userspace" :) which also translate to "do
not expect userspace will do the right thing".


> > > And now we have GUP-longterm and many accounting work in VFIO, we don't want to
> > > do that again.
> > 
> > GUP-longterm does not solve any GUP problem, it just block people to
> > do GUP on DAX backed vma to avoid pining persistent memory as it is
> > a nightmare to handle in the block device driver and file system code.
> > 
> > The accounting is the rt limit thing and is litteraly 10 lines of
> > code so i would not see that as hard to replicate.
> 
> OK. Agree.
> 
> > 
> > 
> > > > > Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
> > > > > from the feature in the future. It dose not means WarpDrive is useless before
> > > > > that. And it works for our Zip and RSA accelerators in physical world.
> > > > 
> > > > Just not with random process address ...
> > > > 
> > > > > > If you still want non SVA/SVM path what you want to do only works
> > > > > > if both ptr1 and ptr2 are in a range that is DMA mapped to the
> > > > > > device (moreover you need DMA address to match process address
> > > > > > which is not an easy feat).
> > > > > > 
> > > > > > Now even if you only want SVA/SVM, i do not see what is the point
> > > > > > of doing this inside VFIO. AMD GPU driver does not and there would
> > > > > > be no benefit for them to be there. Well a AMD VFIO mdev device
> > > > > > driver for QEMU guest might be useful but they have SVIO IIRC.
> > > > > > 
> > > > > > For SVA/SVM your usage model is:
> > > > > > 
> > > > > > Setup:
> > > > > >     - user space create a warp drive context for the process
> > > > > >     - user space create a device specific context for the process
> > > > > >     - user space create a user space command queue for the device
> > > > > >     - user space bind command queue
> > > > > > 
> > > > > >     At this point the kernel driver has bound the process address
> > > > > >     space to the device with a command queue and userspace
> > > > > > 
> > > > > > Usage:
> > > > > >     - user space schedule work and call appropriate flush/update
> > > > > >       ioctl from time to time. Might be optional depends on the
> > > > > >       hardware, but probably a good idea to enforce so that kernel
> > > > > >       can unbind the command queue to bind another process command
> > > > > >       queue.
> > > > > >     ...
> > > > > > 
> > > > > > Cleanup:
> > > > > >     - user space unbind command queue
> > > > > >     - user space destroy device specific context
> > > > > >     - user space destroy warp drive context
> > > > > >     All the above can be implicit when closing the device file.
> > > > > > 
> > > > > > So again in the above model i do not see anywhere something from
> > > > > > VFIO that would benefit this model.
> > > > > > 
> > > > > 
> > > > > Let me show you how the model will be if I use VFIO:
> > > > > 
> > > > > Setup (Kernel part)
> > > > > 	- Kernel driver do every as usual to serve the other functionality, NIC
> > > > > 	  can still be registered to netdev, encryptor can still be registered
> > > > > 	  to crypto...
> > > > > 	- At the same time, the driver can devote some of its hardware resource
> > > > > 	  and register them as a mdev creator to the VFIO framework. This just
> > > > > 	  need limited change to the VFIO type1 driver.
> > > > 
> > > > In the above VFIO does not help you one bit ... you can do that with
> > > > as much code with new common device as front end.
> > > > 
> > > > > Setup (User space)
> > > > > 	- System administrator create mdev via the mdev creator interface.
> > > > > 	- Following VFIO setup routine, user space open the mdev's group, there is
> > > > > 	  only one group for one device.
> > > > > 	- Without PASID support, you don't need to do anything. With PASID, bind
> > > > > 	  the PASID to the device via VFIO interface.
> > > > > 	- Get the device from the group via VFIO interface and mmap it the user
> > > > > 	  space for device's MMIO access (for the queue).
> > > > > 	- Map whatever memory you need to share with the device with VFIO
> > > > > 	  interface.
> > > > > 	- (opt) Add more devices into the container if you want to share the
> > > > > 	  same address space with them
> > > > 
> > > > So all VFIO buys you here is boiler plate code that does insert_pfn()
> > > > to handle MMIO mapping. Which is just couple hundred lines of boiler
> > > > plate code.
> > > > 
> > > 
> > > No. With VFIO, I don't need to:
> > > 
> > > 1. GUP and accounting for RLIMIT_MEMLOCK
> > 
> > That's 10 line of code ...
> > 
> > > 2. Keep all GUP pages for releasing (VFIO uses the rb_tree to do so)
> > 
> > GUP pages are not part of rb_tree and what you want to do can be done
> > in few lines of code here is pseudo code:
> > 
> > warp_dma_map_range(ulong vaddr, ulong npages)
> > {
> >     struct page *pages = kvzalloc(npages);
> > 
> >     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
> >         GUP(vaddr, &pages[i]);
> >         iommu_map(vaddr, page_to_pfn(pages[i]));
> >     }
> >     kvfree(pages);
> > }
> > 
> > warp_dma_unmap_range(ulong vaddr, ulong npages)
> > {
> >     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
> >         unsigned long pfn;
> > 
> >         pfn = iommu_iova_to_phys(vaddr);
> >         iommu_unmap(vaddr);
> >         put_page(pfn_to_page(page)); /* set dirty if mapped write */
> >     }
> > }
> > 
> 
> But what if the process exist without unmapping? The pages will be pinned in the
> kernel forever.

Yeah add a struct warp_map { struct list_head list; unsigned long vaddr,
unsigned long npages; } for every mapping, store the head into the device
file private field of struct file and when the release fs callback is
call you can walk done the list to force unmap any leftover. This is not
that much code. You ca even use interval tree which is 3 lines of code
with interval_tree_generic.h to speed up warp_map lookup on unmap ioctl.


> > Add locking, error handling, dirtying and comments and you are barely
> > looking at couple hundred lines of code. You do not need any of the
> > complexity of VFIO as you do not have the same requirements. Namely
> > VFIO have to keep track of iova and physical mapping for things like
> > migration (migrating guest between host) and few others very
> > virtualization centric requirements.
> > 
> > 
> > > 2. Handle the PASID on SMMU (ARM's IOMMU) myself.
> > 
> > Existing driver do that with 20 lines of with comments and error
> > handling (see kfd_iommu_bind_process_to_device() for instance) i
> > doubt you need much more than that.
> > 
> 
> OK, I agree.
> 
> > 
> > > 3. Multiple devices menagement (VFIO uses container to manage this)
> > 
> > All the vfio_group* stuff ? OK that's boiler plate code, note that
> > hard to replicate thought.
> 
> No, I meant the container thing. Several devices/group can be assigned to the
> same container and the DMA on the container can be assigned to all those
> devices. So we can have some devices to share the same name space.

This was the motivation of my question below, to me this is a policy
decision and it should be left to userspace to decide but not forced
upon userspace because it uses a given device driver.

Maybe i am wrong but i think you can create container and device group
without having a VFIO driver for the devices in the group. It is not
something i do often so i might be wrong here.


> > > And even as a boiler plate, it is valueable, the memory thing is sensitive
> > > interface to user space, it can easily become a security problem. If I can
> > > achieve my target within the scope of VFIO, why not? At lease it has been
> > > proved to be safe for the time being.
> > 
> > The thing is being part of VFIO impose things on you, things that you
> > do not need. Like one device per group (maybe it is you imposing this,
> > i am loosing track here). Or the complex dma mapping tracking ...
> > 
> 
> Err... But the one-device-per-group is not VFIO's decision. It is IOMMU's :).
> Unless I don't use IOMMU.

AFAIK, on x86 and PPC at least, all PCIE devices are in the same group
by default at boot or at least all devices behind the same bridge.

Maybe they are kernel option to avoid that and userspace init program
can definitly re-arrange that base on sysadmin policy).


> > > > > Cleanup:
> > > > > 	- User space close the group file handler
> > > > > 	- There will be a problem to let the other process know the mdev is
> > > > > 	  freed to be used again. My RFCv1 choose a file handler solution. Alex
> > > > > 	  dose not like it. But it is not a big problem. We can always have a
> > > > > 	  scheduler process to manage the state of the mdev or even we can
> > > > > 	  switch back to the RFCv1 solution without too much effort if we like
> > > > > 	  in the future.
> > > > 
> > > > If you were outside VFIO you would have more freedom on how to do that.
> > > > For instance process opening the device file can be placed on queue and
> > > > first one in the queue get to use the device until it closes/release the
> > > > device. Then next one in queue get the device ...
> > > 
> > > Yes. I do like the file handle solution. But I hope the solution become mature
> > > as soon as possible. Many of our products, and as I know include some of our
> > > partners, are waiting for a long term solution as direction. If I rely on some
> > > unmature solution, they may choose some deviated, customized solution. That will
> > > be much harmful. Compare to this, the freedom is not so important...
> > 
> > I do not see how being part of VFIO protect you from people doing crazy
> > thing to their kernel ... Time to market being key in this world, i doubt
> > that being part of VFIO would make anyone think twice before taking a
> > shortcut.
> > 
> > I have seen horrible things on that front and only players like Google
> > can impose a minimum level of sanity.
> > 
> 
> OK. My fault, to talk about TTM. It has nothing doing with the architecture
> decision. But I don't yet see what harm will be brought if I use VFIO when it
> can fulfill almost all my requirements.

The harm is in forcing the device group isolation policy which is not
necessary for all devices like you said so yourself on ARM with the
device stream id so that IOMMU can identify individual devices.

So i would rather see the device isolation as something orthogonal to
what you want to achieve and that should be forced upon user ie sysadmin
should control that distribution can have sane default for each platform.


> > > > > Except for the minimum update to the type1 driver and use sdmdev to manage the
> > > > > interrupt sharing, I don't need any extra code to gain the address sharing
> > > > > capability. And the capability will be strengthen along with the upgrade of VFIO.
> > > > > 
> > > > > > 
> > > > > > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > > > > > > > user driver framework. And I need exactly a user driver interface. Why should I
> > > > > > > > > invent another wheel? It has most of stuff I need:
> > > > > > > > > 
> > > > > > > > > 1. Connecting multiple devices to the same application space
> > > > > > > > > 2. Pinning and DMA from the application space to the whole set of device
> > > > > > > > > 3. Managing hardware resource by device
> > > > > > > > > 
> > > > > > > > > We just need the last step: make sure multiple applications and the kernel can
> > > > > > > > > share the same IOMMU. Then why shouldn't we use VFIO?
> > > > > > > > 
> > > > > > > > Because tons of other drivers already do all of the above outside VFIO. Many
> > > > > > > > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > > > > > > > can be construded as userspace driver too.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> > > > > > > i915 and solve the address space problem. And if I don't need to with VFIO, why
> > > > > > > should I spend so much effort to do it again?
> > > > > > 
> > > > > > Because you do not need any code from VFIO, nor do you need to reinvent
> > > > > > things. If non SVA/SVM matters to you then use dma buffer. If not then
> > > > > > i do not see anything in VFIO that you need.
> > > > > > 
> > > > > 
> > > > > As I have explain, if I don't use VFIO, at lease I have to do all that has been
> > > > > done in i915 or even more than that.
> > > > 
> > > > So beside the MMIO mmap() handling and dma mapping of range of user space
> > > > address space (again all very boiler plate code duplicated accross the
> > > > kernel several time in different forms). You do not gain anything being
> > > > inside VFIO right ?
> > > > 
> > > 
> > > As I said, rb-tree for gup, rlimit accounting, cooperation on SMMU, and mature
> > > user interface are our concern.
> > > > 
> > > > > > > > So there is no reasons to do that under VFIO. Especialy as in your example
> > > > > > > > it is not a real user space device driver, the userspace portion only knows
> > > > > > > > about writting command into command buffer AFAICT.
> > > > > > > > 
> > > > > > > > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > > > > > > > all the driver is handled in userspace. This means that the userspace have
> > > > > > > > to be trusted as it could program the device to do DMA to anywhere (if
> > > > > > > > IOMMU is disabled at boot which is still the default configuration in the
> > > > > > > > kernel).
> > > > > > > > 
> > > > > > > 
> > > > > > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> > > > > > > stuffs as a driver in host system. And I do need to share the user space as DMA
> > > > > > > buffer to the hardware. And I can get it with just a little update, then it can
> > > > > > > service me perfectly. I don't understand why I should choose a long route.
> > > > > > 
> > > > > > Again this is not the long route i do not see anything in VFIO that
> > > > > > benefit you in the SVA/SVM case. A basic character device driver can
> > > > > > do that.
> > > > > > 
> > > > > > 
> > > > > > > > So i do not see any reasons to do anything you want inside VFIO. All you
> > > > > > > > want to do can be done outside as easily. Moreover it would be better if
> > > > > > > > you define clearly each scenario because from where i sit it looks like
> > > > > > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > > > > > > > is disabled.
> > > > > > > > 
> > > > > > > > When IOMMU is disabled you can _not_ expose command queue to userspace
> > > > > > > > unless your device has its own page table and all commands are relative
> > > > > > > > to that page table and the device page table is populated by kernel driver
> > > > > > > > in secure way (ie by checking that what is populated can be access).
> > > > > > > > 
> > > > > > > > I do not believe your example device to have such page table nor do i see
> > > > > > > > a fallback path when IOMMU is disabled that force user to do ioctl for
> > > > > > > > each commands.
> > > > > > > > 
> > > > > > > > Yes i understand that you target SVA/SVM but still you claim to support
> > > > > > > > non SVA/SVM. The point is that userspace can not be trusted if you want
> > > > > > > > to have random program use your device. I am pretty sure that all user
> > > > > > > > of VFIO are trusted process (like QEMU).
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > > > > > > > useless for your usecase. I really do not see the point of that, it
> > > > > > > > does complicate things for you for no reasons AFAICT.
> > > > > > > 
> > > > > > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> > > > > > > it very much either;). But the problem is, the group reflects to the same
> > > > > > > IOMMU(unit), which may shared with other devices.  It is a security problem. I
> > > > > > > cannot ignore it. I have to take it into account event I don't use VFIO.
> > > > > > 
> > > > > > To me it seems you are making a policy decission in kernel space ie
> > > > > > wether the device should be isolated in its own group or not is a
> > > > > > decission that is up to the sys admin or something in userspace.
> > > > > > Right now existing user of SVA/SVM don't (at least AFAICT).
> > > > > > 
> > > > > > Do we really want to force such isolation ?
> > > > > > 
> > > > > 
> > > > > But it is not my decision, that how the iommu subsystem is designed. Personally
> > > > > I don't like it at all, because all our hardwares have their own stream id
> > > > > (device id). I don't need the group concept at all. But the iommu subsystem
> > > > > assume some devices may share the name device ID to a single IOMMU.
> > > > 
> > > > My question was do you really want to force group isolation for the
> > > > device ? Existing SVA/SVM capable driver do not force that, they let
> > > > the userspace decide this (sysadm, distributions, ...). Being part of
> > > > VFIO (in the way you do, likely ways to avoid this inside VFIO too)
> > > > force this decision ie make a policy decision without userspace having
> > > > anything to say about it.
> > 
> > You still do not answer my question, do you really want to force group
> > isolation for device in your framework ? Which is a policy decision from
> > my POV and thus belong to userspace and should not be enforce by kernel.
> 
> No. But I have to follow the rule defined by IOMMU, haven't I?

The IOMMU rule does not say that every device _must_ always be in one
group and one domain only. I am pretty sure on x86 by default you get
one domain for all PCIE devices behind same bridge.

My point is that the device grouping into domain/group should be an
orthogonal decision ie it should not be a requirement by the device
driver and should be under control of userspace as it is a policy
decission.

One exception if it is unsafe to have device share a domain in which
case following my motto the driver should refuse to work and return
an error on open (and a kernel explaining why). But this depends on
device and platform.


> > > > The IOMMU group thing as always been doubt full to me, it is advertise
> > > > as allowing to share resources (ie IOMMU page table) between devices.
> > > > But this assume that all device driver in the group have some way of
> > > > communicating with each other to share common DMA address that point
> > > > to memory devices care. I believe only VFIO does that and probably
> > > > only when use by QEMU.
> > > > 
> > > > 
> > > > Anyway my question is:
> > > > 
> > > > Is it that much useful to be inside VFIO (to avoid few hundred lines
> > > > of boiler plate code) given that it forces you into a model (group
> > > > isolation) that so far have never been the prefered way for all
> > > > existing device driver that already do what you want to achieve ?
> > > > 
> > > 
> > > You mean to say I create another framework and copy most of the code from VFIO?
> > > It is hard to believe the mainline kernel will take my code. So how about let me
> > > try the VFIO way first and try that if it won't work? ;)
> > 
> > There is no trying, this is the kernel, once you expose something to
> > userspace you have to keep supporting it forever ... There is no, hey
> > let's add this new framework and see how it goes and removing it few
> > kernel version latter ...
> > 
> 
> No, I don't meant it was unserious when I said "try". I was just not sure if the
> community can accept it. 
> 
> Can Alex say something on this? Is this scenario in the future scope of VFIO? If
> it is, we have the season to solve the problem on the way. If it is not, we
> should choose other way even we have to copy most of the code.
> 
> > That is why i am being pedantic :) on making sure there is good reasons
> > to do what you do inside VFIO. I do believe that we want a common frame-
> > work like the one you are proposing but i do not believe it should be
> > part of VFIO given the baggages it comes with and that are not relevant
> > to the use cases for this kind of devices.
> 
> Understood. And I appreciate the discussion and help:)

Thank you for bearing with in this long discussion :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-13 14:51                           ` Jerome Glisse
@ 2018-09-14  3:12                             ` Kenneth Lee
  2018-09-14 14:05                               ` Jerome Glisse
  2018-09-14  6:50                             ` Tian, Kevin
  1 sibling, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-14  3:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Zaibo Xu, linux-doc, Sanjay Kumar, Hao Fang,
	linux-kernel, linuxarm, iommu, David S . Miller, linux-crypto,
	Zhou Wang, Philippe Ombredanne, Thomas Gleixner, Joerg Roedel,
	linux-accelerators, Lu Baolu

On Thu, Sep 13, 2018 at 10:51:50AM -0400, Jerome Glisse wrote:
> Date: Thu, 13 Sep 2018 10:51:50 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson
>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,
>  kvm@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>, Greg Kroah-Hartman
>  <gregkh@linuxfoundation.org>, Zaibo Xu <xuzaibo@huawei.com>,
>  linux-doc@vger.kernel.org, Sanjay Kumar <sanjay.k.kumar@intel.com>, Hao
>  Fang <fanghao11@huawei.com>, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, iommu@lists.linux-foundation.org, "David S . Miller"
>  <davem@davemloft.net>, linux-crypto@vger.kernel.org, Zhou Wang
>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,
>  Thomas Gleixner <tglx@linutronix.de>, Joerg Roedel <joro@8bytes.org>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.1 (2018-07-13)
> Message-ID: <20180913145149.GB3576@redhat.com>
> 
> On Thu, Sep 13, 2018 at 04:32:32PM +0800, Kenneth Lee wrote:
> > On Tue, Sep 11, 2018 at 09:40:14AM -0400, Jerome Glisse wrote:
> > > On Tue, Sep 11, 2018 at 02:40:43PM +0800, Kenneth Lee wrote:
> > > > On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:
> > > > > On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> > > > > > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> > > > > > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > > > > > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > > > > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > > > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > > > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > > > > 
> > > > > > > [...]
> > > > > > > 
> > > > > > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > > > > > > > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > > > > > > > > user application do something on its data, and push it away to the accelerator,
> > > > > > > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > > > > > > > > the memory, referring any portion of it with the same VAs of the application,
> > > > > > > > > > even the VAs are stored inside the memory itself.
> > > > > > > > > 
> > > > > > > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > > > > > > > > dma buffer object.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Thank you for directing me to this implementation. It is interesting:).
> > > > > > > > 
> > > > > > > > But it is not yet solve my problem. If I understand it right, the userptr in
> > > > > > > > i915 do the following:
> > > > > > > > 
> > > > > > > > 1. The user process sets a user pointer with size to the kernel via ioctl.
> > > > > > > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
> > > > > > > >    reference.
> > > > > > > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
> > > > > > > >    can be shared between the user space and the hardware.
> > > > > > > > 
> > > > > > > > But my scenario is: 
> > > > > > > > 
> > > > > > > > 1. The user process has some data in the user space, pointed by a pointer, say
> > > > > > > >    ptr1. And within the memory, there may be some other pointers, let's say one
> > > > > > > >    of them is ptr2.
> > > > > > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
> > > > > > > >    hardware must refer ptr1 and ptr2 *directly* for data.
> > > > > > > > 
> > > > > > > > Userptr lets the hardware and process share the same memory space. But I need
> > > > > > > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> > > > > > > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.
> > > > > > > 
> > > > > > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> > > > > > > If so then wait for necessary SVA/SVM to land and do warp drive
> > > > > > > without non SVA/SVM path.
> > > > > > > 
> > > > > > 
> > > > > > I think we should clear the concept of SVA/SVM here. As my understanding, Share
> > > > > > Virtual Address/Memory means: any virtual address in a process can be used by
> > > > > > device at the same time. This requires IOMMU device to support PASID. And
> > > > > > optionally, it requires the feature of page-fault-from-device.
> > > > > 
> > > > > Yes we agree on what SVA/SVM is. There is a one gotcha thought, access
> > > > > to range that are MMIO map ie CPU page table pointing to IO memory, IIRC
> > > > > it is undefined what happens on some platform for a device trying to
> > > > > access those using SVA/SVM.
> > > > > 
> > > > > 
> > > > > > But before the feature is settled down, IOMMU can be used immediately in the
> > > > > > current kernel. That make it possible to assign ONE process's virtual addresses
> > > > > > to the device's IOMMU page table with GUP. This make WarpDrive work well for one
> > > > > > process.
> > > > > 
> > > > > UH ? How ? You want to GUP _every_ single valid address in the process
> > > > > and map it to the device ? How do you handle new vma, page being replace
> > > > > (despite GUP because of things that utimately calls zap pte) ...
> > > > > 
> > > > > Again here you said that the device must be able to access _any_ valid
> > > > > pointer. With GUP this is insane.
> > > > > 
> > > > > So i am assuming this is not what you want to do without SVA/SVM ie with
> > > > > GUP you have a different programming model, one in which the userspace
> > > > > must first bind _range_ of memory to the device and get a DMA address
> > > > > for the range.
> > > > > 
> > > > > Again, GUP range of process address space to map it to a device so that
> > > > > userspace can use the device on the mapped range is something that do
> > > > > exist in various places in the kernel.
> > > > > 
> > > > 
> > > > Yes same as your expectation, in WarpDrive, we use the concept of "sharing" to
> > > > do so. If some memory is going to be shared among process and devices, we use
> > > > wd_share_mem(queue, ptr, size) to share those memory. When the queue is working
> > > > in this mode, the point is valid in those memory segments. The wd_share_mem call
> > > > vfio dma map syscall which will do GUP. 
> > > > 
> > > > If SVA/SVM is enabled, user space can set SHARE_ALL flags to the queue. Then
> > > > wd_share_mem() is not necessary.
> > > > 
> > > > This is really not popular when we started the work on WarpDrive. The GUP
> > > > document said it should be put within the scope of mm_sem is locked. Because GUP
> > > > simply increase the page refcount, not keep the mapping between the page and the
> > > > vma. We keep our work together with VFIO to make sure the problem can be solved
> > > > in one deal.
> > > 
> > > The problem can not be solved in one deal, you can not maintain vaddr
> > > pointing to same page after a fork() this can not be solve without the
> > > use of mmu notifier and device dma mapping invalidation ! So being part
> > > of VFIO will not help you there.
> > 
> > Good point. But sadly, even with mmu notifier and dma mapping invalidation, I
> > cannot do anything here. If the process fork a sub-process, the sub-process need
> > a new pasid and hardware resource. The IOMM space mapped should not be used. The
> > parent process should be aware of this, unmap and close the device file before
> > the fork. I have the same limitation as VFIO:(
> > 
> > I don't think I can change much here. If I can, VFIO can too:)
> 
> The forbid child to access the device is easy in the kernel whenever
> someone open the device file force set the OCLOEXEC flag on the file
> some device driver already do that and so should you. With that you
> should always have a struct file - mm struct one to one relationship
> and thus one PASID per struct file ie per open of the device file.

I considerred the OCLOEXEC flag, but it seams it works only for exec, not fork.

> 
> That does not solve the GUP/fork issue i describe below.
> 
> 
> > > AFAIK VFIO is fine with the way it is as QEMU do not fork() once it
> > > is running a guest and thus the COW that would invalidate vaddr to
> > > physical page assumption is not broken. So i doubt VFIO folks have
> > > any incentive to go down the mmu notifier path and invalidate device
> > > mapping. They also have the replay thing that probably handle some
> > > of fork cases by trusting user space program to do it. In your case
> > > you can not trust the user space program.
> > > 
> > > In your case AFAICT i do not see any warning or gotcha so the following
> > > scenario is broken (in non SVA/SVM):
> > >     1) program setup the device (open container, mdev, setup queue, ...)
> > >     2) program map some range of its address space wih VFIO_IOMMU_MAP_DMA
> > >     3) program start using the device using map setup in 2)
> > >     ...
> > >     4) program fork()
> > >     5) parent trigger COW inside the range setup in 2)
> > > 
> > >     At this point it is the child process that can write to the page that
> > >     are access by the device (which was map by the parent in 2)). The
> > >     parent can no longer access that memory from the CPU.
> > > 
> > > There is just no sane way to fix this beside invalidating device mapping
> > > on fork (and you can not rely on userspace to do so) and thus stopping
> > > the device on fork (SVA/SVM case do not have any issue here).
> > 
> > Indeed. But as soon as we choose to expose the device space to the user space,
> > the limitation is already there. If we want to solve the problem, we have to
> > have a hook in the copy_process() procedure and copy the parent's queue state to
> > a new queue, assign it to the child's fd and redirect the child's mmap to
> > it. If I can do so, the same logic can also be applied to VFIO.
> 
> Except we do not want to do that and this does not solve the COW i describe
> above unless you disable COW altogether which is a big no.
> 
> > The good side is, this is not a security leak. The hardware has been given to
> > the process. It is the process who choose to share it. If it won't work, it is
> > the process's problem;)
> 
> No this is bad, you can not expect every single userspace program to know and
> be aware of that. If some trusted application (say systemd or firefox, ...)
> start using your device and is unaware or does not comprehend all side effect
> it would allow the child to access/change its parent memory and that is bad.
> Device driver need to protect against user doing stupid thing. All existing
> device driver that do ATS/PASID already protect themself against child (ie the
> child can not use the device through a mix of OCLOEXEC and other checks).
> 
> You must set the OCLOEXEC flag but that does not solve the COW problem above.
> 
> My motto in life is "do not trust userspace" :) which also translate to "do
> not expect userspace will do the right thing".
> 

We don't really trust user space here. We trust that the process cannot do more
than what has been given to it. If an application use WarpDrive, and share its
memory with it, it should know what happen when it fork a sub-process.  This is
just like you clone a sub-process with CLONE_FILES or CLONE_VM, the parent
process should know what happen.

> 
> > > > And now we have GUP-longterm and many accounting work in VFIO, we don't want to
> > > > do that again.
> > > 
> > > GUP-longterm does not solve any GUP problem, it just block people to
> > > do GUP on DAX backed vma to avoid pining persistent memory as it is
> > > a nightmare to handle in the block device driver and file system code.
> > > 
> > > The accounting is the rt limit thing and is litteraly 10 lines of
> > > code so i would not see that as hard to replicate.
> > 
> > OK. Agree.
> > 
> > > 
> > > 
> > > > > > Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
> > > > > > from the feature in the future. It dose not means WarpDrive is useless before
> > > > > > that. And it works for our Zip and RSA accelerators in physical world.
> > > > > 
> > > > > Just not with random process address ...
> > > > > 
> > > > > > > If you still want non SVA/SVM path what you want to do only works
> > > > > > > if both ptr1 and ptr2 are in a range that is DMA mapped to the
> > > > > > > device (moreover you need DMA address to match process address
> > > > > > > which is not an easy feat).
> > > > > > > 
> > > > > > > Now even if you only want SVA/SVM, i do not see what is the point
> > > > > > > of doing this inside VFIO. AMD GPU driver does not and there would
> > > > > > > be no benefit for them to be there. Well a AMD VFIO mdev device
> > > > > > > driver for QEMU guest might be useful but they have SVIO IIRC.
> > > > > > > 
> > > > > > > For SVA/SVM your usage model is:
> > > > > > > 
> > > > > > > Setup:
> > > > > > >     - user space create a warp drive context for the process
> > > > > > >     - user space create a device specific context for the process
> > > > > > >     - user space create a user space command queue for the device
> > > > > > >     - user space bind command queue
> > > > > > > 
> > > > > > >     At this point the kernel driver has bound the process address
> > > > > > >     space to the device with a command queue and userspace
> > > > > > > 
> > > > > > > Usage:
> > > > > > >     - user space schedule work and call appropriate flush/update
> > > > > > >       ioctl from time to time. Might be optional depends on the
> > > > > > >       hardware, but probably a good idea to enforce so that kernel
> > > > > > >       can unbind the command queue to bind another process command
> > > > > > >       queue.
> > > > > > >     ...
> > > > > > > 
> > > > > > > Cleanup:
> > > > > > >     - user space unbind command queue
> > > > > > >     - user space destroy device specific context
> > > > > > >     - user space destroy warp drive context
> > > > > > >     All the above can be implicit when closing the device file.
> > > > > > > 
> > > > > > > So again in the above model i do not see anywhere something from
> > > > > > > VFIO that would benefit this model.
> > > > > > > 
> > > > > > 
> > > > > > Let me show you how the model will be if I use VFIO:
> > > > > > 
> > > > > > Setup (Kernel part)
> > > > > > 	- Kernel driver do every as usual to serve the other functionality, NIC
> > > > > > 	  can still be registered to netdev, encryptor can still be registered
> > > > > > 	  to crypto...
> > > > > > 	- At the same time, the driver can devote some of its hardware resource
> > > > > > 	  and register them as a mdev creator to the VFIO framework. This just
> > > > > > 	  need limited change to the VFIO type1 driver.
> > > > > 
> > > > > In the above VFIO does not help you one bit ... you can do that with
> > > > > as much code with new common device as front end.
> > > > > 
> > > > > > Setup (User space)
> > > > > > 	- System administrator create mdev via the mdev creator interface.
> > > > > > 	- Following VFIO setup routine, user space open the mdev's group, there is
> > > > > > 	  only one group for one device.
> > > > > > 	- Without PASID support, you don't need to do anything. With PASID, bind
> > > > > > 	  the PASID to the device via VFIO interface.
> > > > > > 	- Get the device from the group via VFIO interface and mmap it the user
> > > > > > 	  space for device's MMIO access (for the queue).
> > > > > > 	- Map whatever memory you need to share with the device with VFIO
> > > > > > 	  interface.
> > > > > > 	- (opt) Add more devices into the container if you want to share the
> > > > > > 	  same address space with them
> > > > > 
> > > > > So all VFIO buys you here is boiler plate code that does insert_pfn()
> > > > > to handle MMIO mapping. Which is just couple hundred lines of boiler
> > > > > plate code.
> > > > > 
> > > > 
> > > > No. With VFIO, I don't need to:
> > > > 
> > > > 1. GUP and accounting for RLIMIT_MEMLOCK
> > > 
> > > That's 10 line of code ...
> > > 
> > > > 2. Keep all GUP pages for releasing (VFIO uses the rb_tree to do so)
> > > 
> > > GUP pages are not part of rb_tree and what you want to do can be done
> > > in few lines of code here is pseudo code:
> > > 
> > > warp_dma_map_range(ulong vaddr, ulong npages)
> > > {
> > >     struct page *pages = kvzalloc(npages);
> > > 
> > >     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
> > >         GUP(vaddr, &pages[i]);
> > >         iommu_map(vaddr, page_to_pfn(pages[i]));
> > >     }
> > >     kvfree(pages);
> > > }
> > > 
> > > warp_dma_unmap_range(ulong vaddr, ulong npages)
> > > {
> > >     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
> > >         unsigned long pfn;
> > > 
> > >         pfn = iommu_iova_to_phys(vaddr);
> > >         iommu_unmap(vaddr);
> > >         put_page(pfn_to_page(page)); /* set dirty if mapped write */
> > >     }
> > > }
> > > 
> > 
> > But what if the process exist without unmapping? The pages will be pinned in the
> > kernel forever.
> 
> Yeah add a struct warp_map { struct list_head list; unsigned long vaddr,
> unsigned long npages; } for every mapping, store the head into the device
> file private field of struct file and when the release fs callback is
> call you can walk done the list to force unmap any leftover. This is not
> that much code. You ca even use interval tree which is 3 lines of code
> with interval_tree_generic.h to speed up warp_map lookup on unmap ioctl.
> 

Yes, when you add all of them... it is VFIO:)

> 
> > > Add locking, error handling, dirtying and comments and you are barely
> > > looking at couple hundred lines of code. You do not need any of the
> > > complexity of VFIO as you do not have the same requirements. Namely
> > > VFIO have to keep track of iova and physical mapping for things like
> > > migration (migrating guest between host) and few others very
> > > virtualization centric requirements.
> > > 
> > > 
> > > > 2. Handle the PASID on SMMU (ARM's IOMMU) myself.
> > > 
> > > Existing driver do that with 20 lines of with comments and error
> > > handling (see kfd_iommu_bind_process_to_device() for instance) i
> > > doubt you need much more than that.
> > > 
> > 
> > OK, I agree.
> > 
> > > 
> > > > 3. Multiple devices menagement (VFIO uses container to manage this)
> > > 
> > > All the vfio_group* stuff ? OK that's boiler plate code, note that
> > > hard to replicate thought.
> > 
> > No, I meant the container thing. Several devices/group can be assigned to the
> > same container and the DMA on the container can be assigned to all those
> > devices. So we can have some devices to share the same name space.
> 
> This was the motivation of my question below, to me this is a policy
> decision and it should be left to userspace to decide but not forced
> upon userspace because it uses a given device driver.
> 
> Maybe i am wrong but i think you can create container and device group
> without having a VFIO driver for the devices in the group. It is not
> something i do often so i might be wrong here.
> 

Container maintains a virtual unify address space for all group/iommu which is
added to it. It simplify the address management. But yes, you can choose to do
it all in the user space.

> 
> > > > And even as a boiler plate, it is valueable, the memory thing is sensitive
> > > > interface to user space, it can easily become a security problem. If I can
> > > > achieve my target within the scope of VFIO, why not? At lease it has been
> > > > proved to be safe for the time being.
> > > 
> > > The thing is being part of VFIO impose things on you, things that you
> > > do not need. Like one device per group (maybe it is you imposing this,
> > > i am loosing track here). Or the complex dma mapping tracking ...
> > > 
> > 
> > Err... But the one-device-per-group is not VFIO's decision. It is IOMMU's :).
> > Unless I don't use IOMMU.
> 
> AFAIK, on x86 and PPC at least, all PCIE devices are in the same group
> by default at boot or at least all devices behind the same bridge.
> 
> Maybe they are kernel option to avoid that and userspace init program
> can definitly re-arrange that base on sysadmin policy).
> 

But if the IOMMU is enabled, all PCIE devices have their own device IDs. So the
IOMMU can use different page table for every of them.

> 
> > > > > > Cleanup:
> > > > > > 	- User space close the group file handler
> > > > > > 	- There will be a problem to let the other process know the mdev is
> > > > > > 	  freed to be used again. My RFCv1 choose a file handler solution. Alex
> > > > > > 	  dose not like it. But it is not a big problem. We can always have a
> > > > > > 	  scheduler process to manage the state of the mdev or even we can
> > > > > > 	  switch back to the RFCv1 solution without too much effort if we like
> > > > > > 	  in the future.
> > > > > 
> > > > > If you were outside VFIO you would have more freedom on how to do that.
> > > > > For instance process opening the device file can be placed on queue and
> > > > > first one in the queue get to use the device until it closes/release the
> > > > > device. Then next one in queue get the device ...
> > > > 
> > > > Yes. I do like the file handle solution. But I hope the solution become mature
> > > > as soon as possible. Many of our products, and as I know include some of our
> > > > partners, are waiting for a long term solution as direction. If I rely on some
> > > > unmature solution, they may choose some deviated, customized solution. That will
> > > > be much harmful. Compare to this, the freedom is not so important...
> > > 
> > > I do not see how being part of VFIO protect you from people doing crazy
> > > thing to their kernel ... Time to market being key in this world, i doubt
> > > that being part of VFIO would make anyone think twice before taking a
> > > shortcut.
> > > 
> > > I have seen horrible things on that front and only players like Google
> > > can impose a minimum level of sanity.
> > > 
> > 
> > OK. My fault, to talk about TTM. It has nothing doing with the architecture
> > decision. But I don't yet see what harm will be brought if I use VFIO when it
> > can fulfill almost all my requirements.
> 
> The harm is in forcing the device group isolation policy which is not
> necessary for all devices like you said so yourself on ARM with the
> device stream id so that IOMMU can identify individual devices.

No. Some mini SoC share the same stream id among several small devices. The
iommu has to treat them as the same. That is why IOMMU introduces the concept of
iommu_group. Personally I don't like that. If they share the same stream id,
they should be treated as the same hardware. But it is the decision for the time
being...

> 
> So i would rather see the device isolation as something orthogonal to
> what you want to achieve and that should be forced upon user ie sysadmin
> should control that distribution can have sane default for each platform.
> 
> 
> > > > > > Except for the minimum update to the type1 driver and use sdmdev to manage the
> > > > > > interrupt sharing, I don't need any extra code to gain the address sharing
> > > > > > capability. And the capability will be strengthen along with the upgrade of VFIO.
> > > > > > 
> > > > > > > 
> > > > > > > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > > > > > > > > user driver framework. And I need exactly a user driver interface. Why should I
> > > > > > > > > > invent another wheel? It has most of stuff I need:
> > > > > > > > > > 
> > > > > > > > > > 1. Connecting multiple devices to the same application space
> > > > > > > > > > 2. Pinning and DMA from the application space to the whole set of device
> > > > > > > > > > 3. Managing hardware resource by device
> > > > > > > > > > 
> > > > > > > > > > We just need the last step: make sure multiple applications and the kernel can
> > > > > > > > > > share the same IOMMU. Then why shouldn't we use VFIO?
> > > > > > > > > 
> > > > > > > > > Because tons of other drivers already do all of the above outside VFIO. Many
> > > > > > > > > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > > > > > > > > can be construded as userspace driver too.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> > > > > > > > i915 and solve the address space problem. And if I don't need to with VFIO, why
> > > > > > > > should I spend so much effort to do it again?
> > > > > > > 
> > > > > > > Because you do not need any code from VFIO, nor do you need to reinvent
> > > > > > > things. If non SVA/SVM matters to you then use dma buffer. If not then
> > > > > > > i do not see anything in VFIO that you need.
> > > > > > > 
> > > > > > 
> > > > > > As I have explain, if I don't use VFIO, at lease I have to do all that has been
> > > > > > done in i915 or even more than that.
> > > > > 
> > > > > So beside the MMIO mmap() handling and dma mapping of range of user space
> > > > > address space (again all very boiler plate code duplicated accross the
> > > > > kernel several time in different forms). You do not gain anything being
> > > > > inside VFIO right ?
> > > > > 
> > > > 
> > > > As I said, rb-tree for gup, rlimit accounting, cooperation on SMMU, and mature
> > > > user interface are our concern.
> > > > > 
> > > > > > > > > So there is no reasons to do that under VFIO. Especialy as in your example
> > > > > > > > > it is not a real user space device driver, the userspace portion only knows
> > > > > > > > > about writting command into command buffer AFAICT.
> > > > > > > > > 
> > > > > > > > > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > > > > > > > > all the driver is handled in userspace. This means that the userspace have
> > > > > > > > > to be trusted as it could program the device to do DMA to anywhere (if
> > > > > > > > > IOMMU is disabled at boot which is still the default configuration in the
> > > > > > > > > kernel).
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> > > > > > > > stuffs as a driver in host system. And I do need to share the user space as DMA
> > > > > > > > buffer to the hardware. And I can get it with just a little update, then it can
> > > > > > > > service me perfectly. I don't understand why I should choose a long route.
> > > > > > > 
> > > > > > > Again this is not the long route i do not see anything in VFIO that
> > > > > > > benefit you in the SVA/SVM case. A basic character device driver can
> > > > > > > do that.
> > > > > > > 
> > > > > > > 
> > > > > > > > > So i do not see any reasons to do anything you want inside VFIO. All you
> > > > > > > > > want to do can be done outside as easily. Moreover it would be better if
> > > > > > > > > you define clearly each scenario because from where i sit it looks like
> > > > > > > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > > > > > > > > is disabled.
> > > > > > > > > 
> > > > > > > > > When IOMMU is disabled you can _not_ expose command queue to userspace
> > > > > > > > > unless your device has its own page table and all commands are relative
> > > > > > > > > to that page table and the device page table is populated by kernel driver
> > > > > > > > > in secure way (ie by checking that what is populated can be access).
> > > > > > > > > 
> > > > > > > > > I do not believe your example device to have such page table nor do i see
> > > > > > > > > a fallback path when IOMMU is disabled that force user to do ioctl for
> > > > > > > > > each commands.
> > > > > > > > > 
> > > > > > > > > Yes i understand that you target SVA/SVM but still you claim to support
> > > > > > > > > non SVA/SVM. The point is that userspace can not be trusted if you want
> > > > > > > > > to have random program use your device. I am pretty sure that all user
> > > > > > > > > of VFIO are trusted process (like QEMU).
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > > > > > > > > useless for your usecase. I really do not see the point of that, it
> > > > > > > > > does complicate things for you for no reasons AFAICT.
> > > > > > > > 
> > > > > > > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> > > > > > > > it very much either;). But the problem is, the group reflects to the same
> > > > > > > > IOMMU(unit), which may shared with other devices.  It is a security problem. I
> > > > > > > > cannot ignore it. I have to take it into account event I don't use VFIO.
> > > > > > > 
> > > > > > > To me it seems you are making a policy decission in kernel space ie
> > > > > > > wether the device should be isolated in its own group or not is a
> > > > > > > decission that is up to the sys admin or something in userspace.
> > > > > > > Right now existing user of SVA/SVM don't (at least AFAICT).
> > > > > > > 
> > > > > > > Do we really want to force such isolation ?
> > > > > > > 
> > > > > > 
> > > > > > But it is not my decision, that how the iommu subsystem is designed. Personally
> > > > > > I don't like it at all, because all our hardwares have their own stream id
> > > > > > (device id). I don't need the group concept at all. But the iommu subsystem
> > > > > > assume some devices may share the name device ID to a single IOMMU.
> > > > > 
> > > > > My question was do you really want to force group isolation for the
> > > > > device ? Existing SVA/SVM capable driver do not force that, they let
> > > > > the userspace decide this (sysadm, distributions, ...). Being part of
> > > > > VFIO (in the way you do, likely ways to avoid this inside VFIO too)
> > > > > force this decision ie make a policy decision without userspace having
> > > > > anything to say about it.
> > > 
> > > You still do not answer my question, do you really want to force group
> > > isolation for device in your framework ? Which is a policy decision from
> > > my POV and thus belong to userspace and should not be enforce by kernel.
> > 
> > No. But I have to follow the rule defined by IOMMU, haven't I?
> 
> The IOMMU rule does not say that every device _must_ always be in one
> group and one domain only. I am pretty sure on x86 by default you get
> one domain for all PCIE devices behind same bridge.
> 

Really? Give me some more time. I need to check it out.

> My point is that the device grouping into domain/group should be an
> orthogonal decision ie it should not be a requirement by the device
> driver and should be under control of userspace as it is a policy
> decission.
> 
> One exception if it is unsafe to have device share a domain in which
> case following my motto the driver should refuse to work and return
> an error on open (and a kernel explaining why). But this depends on
> device and platform.
> 
> 
> > > > > The IOMMU group thing as always been doubt full to me, it is advertise
> > > > > as allowing to share resources (ie IOMMU page table) between devices.
> > > > > But this assume that all device driver in the group have some way of
> > > > > communicating with each other to share common DMA address that point
> > > > > to memory devices care. I believe only VFIO does that and probably
> > > > > only when use by QEMU.
> > > > > 
> > > > > 
> > > > > Anyway my question is:
> > > > > 
> > > > > Is it that much useful to be inside VFIO (to avoid few hundred lines
> > > > > of boiler plate code) given that it forces you into a model (group
> > > > > isolation) that so far have never been the prefered way for all
> > > > > existing device driver that already do what you want to achieve ?
> > > > > 
> > > > 
> > > > You mean to say I create another framework and copy most of the code from VFIO?
> > > > It is hard to believe the mainline kernel will take my code. So how about let me
> > > > try the VFIO way first and try that if it won't work? ;)
> > > 
> > > There is no trying, this is the kernel, once you expose something to
> > > userspace you have to keep supporting it forever ... There is no, hey
> > > let's add this new framework and see how it goes and removing it few
> > > kernel version latter ...
> > > 
> > 
> > No, I don't meant it was unserious when I said "try". I was just not sure if the
> > community can accept it. 
> > 
> > Can Alex say something on this? Is this scenario in the future scope of VFIO? If
> > it is, we have the season to solve the problem on the way. If it is not, we
> > should choose other way even we have to copy most of the code.
> > 
> > > That is why i am being pedantic :) on making sure there is good reasons
> > > to do what you do inside VFIO. I do believe that we want a common frame-
> > > work like the one you are proposing but i do not believe it should be
> > > part of VFIO given the baggages it comes with and that are not relevant
> > > to the use cases for this kind of devices.
> > 
> > Understood. And I appreciate the discussion and help:)
> 
> Thank you for bearing with in this long discussion :)
> 
> Cheers,
> Jérôme

Cheers

-- 
			-Kenneth(Hisilicon)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-13 14:51                           ` Jerome Glisse
  2018-09-14  3:12                             ` Kenneth Lee
@ 2018-09-14  6:50                             ` Tian, Kevin
  2018-09-14 13:05                               ` Kenneth Lee
  2018-09-14 14:13                               ` Jerome Glisse
  1 sibling, 2 replies; 58+ messages in thread
From: Tian, Kevin @ 2018-09-14  6:50 UTC (permalink / raw)
  To: Jerome Glisse, Kenneth Lee
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Zaibo Xu, linux-doc, Kumar, Sanjay K,
	Hao Fang, linux-kernel, linuxarm, iommu, David S . Miller,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	Joerg Roedel, linux-accelerators, Lu Baolu

> From: Jerome Glisse
> Sent: Thursday, September 13, 2018 10:52 PM
>
[...]
 > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group
> by default at boot or at least all devices behind the same bridge.

the group thing reflects physical hierarchy limitation, not changed
cross boot. Please note iommu group defines the minimal isolation
boundary - all devices within same group must be attached to the
same iommu domain or address space, because physically IOMMU
cannot differentiate DMAs out of those devices. devices behind
legacy PCI-X bridge is one example. other examples include devices
behind a PCIe switch port which doesn't support ACS thus cannot
route p2p transaction to IOMMU. If talking about typical PCIe 
endpoint (with upstreaming ports all supporting ACS), you'll get
one device per group.

One iommu group today is attached to only one iommu domain.
In the future one group may attach to multiple domains, as the
aux domain concept being discussed in another thread.

> 
> Maybe they are kernel option to avoid that and userspace init program
> can definitly re-arrange that base on sysadmin policy).

I don't think there is such option, as it may break isolation model
enabled by IOMMU.

[...]
> > > That is why i am being pedantic :) on making sure there is good reasons
> > > to do what you do inside VFIO. I do believe that we want a common
> frame-
> > > work like the one you are proposing but i do not believe it should be
> > > part of VFIO given the baggages it comes with and that are not relevant
> > > to the use cases for this kind of devices.
> >

The purpose of VFIO is clear - the kernel portal for granting generic 
device resource (mmio, irq, etc.) to user space. VFIO doesn't care
what exactly a resource is used for (queue, cmd reg, etc.). If really
pursuing VFIO path is necessary, maybe such common framework
should lay down in user space, which gets all granted resource from
kernel driver thru VFIO and then provides accelerator services to 
other processes?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-14  6:50                             ` Tian, Kevin
@ 2018-09-14 13:05                               ` Kenneth Lee
  2018-09-14 14:13                               ` Jerome Glisse
  1 sibling, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-14 13:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jerome Glisse, Kenneth Lee, Alex Williamson, Herbert Xu, kvm,
	Jonathan Corbet, Greg Kroah-Hartman, Zaibo Xu, linux-doc, Kumar,
	Sanjay K, Hao Fang, linux-kernel, linuxarm, iommu,
	David S . Miller, linux-crypto, Zhou Wang, Philippe Ombredanne,
	Thomas Gleixner, Joerg Roedel, linux-accelerators, Lu Baolu

On Fri, Sep 14, 2018 at 06:50:55AM +0000, Tian, Kevin wrote:
> Date: Fri, 14 Sep 2018 06:50:55 +0000
> From: "Tian, Kevin" <kevin.tian@intel.com>
> To: Jerome Glisse <jglisse@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson
>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,
>  "kvm@vger.kernel.org" <kvm@vger.kernel.org>, Jonathan Corbet
>  <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Zaibo
>  Xu <xuzaibo@huawei.com>, "linux-doc@vger.kernel.org"
>  <linux-doc@vger.kernel.org>, "Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,
>  Hao Fang <fanghao11@huawei.com>, "linux-kernel@vger.kernel.org"
>  <linux-kernel@vger.kernel.org>, "linuxarm@huawei.com"
>  <linuxarm@huawei.com>, "iommu@lists.linux-foundation.org"
>  <iommu@lists.linux-foundation.org>, "David S . Miller"
>  <davem@davemloft.net>, "linux-crypto@vger.kernel.org"
>  <linux-crypto@vger.kernel.org>, Zhou Wang <wangzhou1@hisilicon.com>,
>  Philippe Ombredanne <pombredanne@nexb.com>, Thomas Gleixner
>  <tglx@linutronix.de>, Joerg Roedel <joro@8bytes.org>,
>  "linux-accelerators@lists.ozlabs.org"
>  <linux-accelerators@lists.ozlabs.org>, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: RE: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: <AADFC41AFE54684AB9EE6CBC0274A5D191303A7F@SHSMSX101.ccr.corp.intel.com>
> 
> > From: Jerome Glisse
> > Sent: Thursday, September 13, 2018 10:52 PM
> >
> [...]
>  > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group
> > by default at boot or at least all devices behind the same bridge.
> 
> the group thing reflects physical hierarchy limitation, not changed
> cross boot. Please note iommu group defines the minimal isolation
> boundary - all devices within same group must be attached to the
> same iommu domain or address space, because physically IOMMU
> cannot differentiate DMAs out of those devices. devices behind
> legacy PCI-X bridge is one example. other examples include devices
> behind a PCIe switch port which doesn't support ACS thus cannot
> route p2p transaction to IOMMU. If talking about typical PCIe 
> endpoint (with upstreaming ports all supporting ACS), you'll get
> one device per group.
> 
> One iommu group today is attached to only one iommu domain.
> In the future one group may attach to multiple domains, as the
> aux domain concept being discussed in another thread.
> 
> > 
> > Maybe they are kernel option to avoid that and userspace init program
> > can definitly re-arrange that base on sysadmin policy).
> 
> I don't think there is such option, as it may break isolation model
> enabled by IOMMU.
> 
> [...]
> > > > That is why i am being pedantic :) on making sure there is good reasons
> > > > to do what you do inside VFIO. I do believe that we want a common
> > frame-
> > > > work like the one you are proposing but i do not believe it should be
> > > > part of VFIO given the baggages it comes with and that are not relevant
> > > > to the use cases for this kind of devices.
> > >
> 
> The purpose of VFIO is clear - the kernel portal for granting generic 
> device resource (mmio, irq, etc.) to user space. VFIO doesn't care
> what exactly a resource is used for (queue, cmd reg, etc.). If really
> pursuing VFIO path is necessary, maybe such common framework
> should lay down in user space, which gets all granted resource from
> kernel driver thru VFIO and then provides accelerator services to 
> other processes?

Yes. I think this is exactly what WarpDrive is now doing. This patch is just let
the type1 driver use parent IOMMU for mdev.

> 
> Thanks
> Kevin

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-14  3:12                             ` Kenneth Lee
@ 2018-09-14 14:05                               ` Jerome Glisse
  0 siblings, 0 replies; 58+ messages in thread
From: Jerome Glisse @ 2018-09-14 14:05 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Zaibo Xu, linux-doc, Sanjay Kumar, Hao Fang,
	iommu, linux-kernel, linuxarm, Alex Williamson, linux-crypto,
	Zhou Wang, Philippe Ombredanne, Thomas Gleixner, Lu Baolu,
	David S . Miller, linux-accelerators, Joerg Roedel

On Fri, Sep 14, 2018 at 11:12:01AM +0800, Kenneth Lee wrote:
> On Thu, Sep 13, 2018 at 10:51:50AM -0400, Jerome Glisse wrote:
> > On Thu, Sep 13, 2018 at 04:32:32PM +0800, Kenneth Lee wrote:
> > > On Tue, Sep 11, 2018 at 09:40:14AM -0400, Jerome Glisse wrote:
> > > > On Tue, Sep 11, 2018 at 02:40:43PM +0800, Kenneth Lee wrote:
> > > > > On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:
> > > > > > On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> > > > > > > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> > > > > > > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > > > > > > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > > > > > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > > > > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse wrote:
> > > > > > > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee wrote:
> > > > > > > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth Lee wrote:
> > > > > > > > 
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it "copy_from_user" the
> > > > > > > > > > > user memory to the kernel. That is not what we need. What we try to get is: the
> > > > > > > > > > > user application do something on its data, and push it away to the accelerator,
> > > > > > > > > > > and says: "I'm tied, it is your turn to do the job...". Then the accelerator has
> > > > > > > > > > > the memory, referring any portion of it with the same VAs of the application,
> > > > > > > > > > > even the VAs are stored inside the memory itself.
> > > > > > > > > > 
> > > > > > > > > > You were not looking at right place see drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > > > > > It does GUP and create GEM object AFAICR you can wrap that GEM object into a
> > > > > > > > > > dma buffer object.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thank you for directing me to this implementation. It is interesting:).
> > > > > > > > > 
> > > > > > > > > But it is not yet solve my problem. If I understand it right, the userptr in
> > > > > > > > > i915 do the following:
> > > > > > > > > 
> > > > > > > > > 1. The user process sets a user pointer with size to the kernel via ioctl.
> > > > > > > > > 2. The kernel wraps it as a dma-buf and keeps the process's mm for further
> > > > > > > > >    reference.
> > > > > > > > > 3. The user pages are allocated, GUPed or DMA mapped to the device. So the data
> > > > > > > > >    can be shared between the user space and the hardware.
> > > > > > > > > 
> > > > > > > > > But my scenario is: 
> > > > > > > > > 
> > > > > > > > > 1. The user process has some data in the user space, pointed by a pointer, say
> > > > > > > > >    ptr1. And within the memory, there may be some other pointers, let's say one
> > > > > > > > >    of them is ptr2.
> > > > > > > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO space. And the
> > > > > > > > >    hardware must refer ptr1 and ptr2 *directly* for data.
> > > > > > > > > 
> > > > > > > > > Userptr lets the hardware and process share the same memory space. But I need
> > > > > > > > > them to share the same *address space*. So IOMMU is a MUST for WarpDrive,
> > > > > > > > > NOIOMMU mode, as Jean said, is just for verifying some of the procedure is OK.
> > > > > > > > 
> > > > > > > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> > > > > > > > If so then wait for necessary SVA/SVM to land and do warp drive
> > > > > > > > without non SVA/SVM path.
> > > > > > > > 
> > > > > > > 
> > > > > > > I think we should clear the concept of SVA/SVM here. As my understanding, Share
> > > > > > > Virtual Address/Memory means: any virtual address in a process can be used by
> > > > > > > device at the same time. This requires IOMMU device to support PASID. And
> > > > > > > optionally, it requires the feature of page-fault-from-device.
> > > > > > 
> > > > > > Yes we agree on what SVA/SVM is. There is a one gotcha thought, access
> > > > > > to range that are MMIO map ie CPU page table pointing to IO memory, IIRC
> > > > > > it is undefined what happens on some platform for a device trying to
> > > > > > access those using SVA/SVM.
> > > > > > 
> > > > > > 
> > > > > > > But before the feature is settled down, IOMMU can be used immediately in the
> > > > > > > current kernel. That make it possible to assign ONE process's virtual addresses
> > > > > > > to the device's IOMMU page table with GUP. This make WarpDrive work well for one
> > > > > > > process.
> > > > > > 
> > > > > > UH ? How ? You want to GUP _every_ single valid address in the process
> > > > > > and map it to the device ? How do you handle new vma, page being replace
> > > > > > (despite GUP because of things that utimately calls zap pte) ...
> > > > > > 
> > > > > > Again here you said that the device must be able to access _any_ valid
> > > > > > pointer. With GUP this is insane.
> > > > > > 
> > > > > > So i am assuming this is not what you want to do without SVA/SVM ie with
> > > > > > GUP you have a different programming model, one in which the userspace
> > > > > > must first bind _range_ of memory to the device and get a DMA address
> > > > > > for the range.
> > > > > > 
> > > > > > Again, GUP range of process address space to map it to a device so that
> > > > > > userspace can use the device on the mapped range is something that do
> > > > > > exist in various places in the kernel.
> > > > > > 
> > > > > 
> > > > > Yes same as your expectation, in WarpDrive, we use the concept of "sharing" to
> > > > > do so. If some memory is going to be shared among process and devices, we use
> > > > > wd_share_mem(queue, ptr, size) to share those memory. When the queue is working
> > > > > in this mode, the point is valid in those memory segments. The wd_share_mem call
> > > > > vfio dma map syscall which will do GUP. 
> > > > > 
> > > > > If SVA/SVM is enabled, user space can set SHARE_ALL flags to the queue. Then
> > > > > wd_share_mem() is not necessary.
> > > > > 
> > > > > This is really not popular when we started the work on WarpDrive. The GUP
> > > > > document said it should be put within the scope of mm_sem is locked. Because GUP
> > > > > simply increase the page refcount, not keep the mapping between the page and the
> > > > > vma. We keep our work together with VFIO to make sure the problem can be solved
> > > > > in one deal.
> > > > 
> > > > The problem can not be solved in one deal, you can not maintain vaddr
> > > > pointing to same page after a fork() this can not be solve without the
> > > > use of mmu notifier and device dma mapping invalidation ! So being part
> > > > of VFIO will not help you there.
> > > 
> > > Good point. But sadly, even with mmu notifier and dma mapping invalidation, I
> > > cannot do anything here. If the process fork a sub-process, the sub-process need
> > > a new pasid and hardware resource. The IOMM space mapped should not be used. The
> > > parent process should be aware of this, unmap and close the device file before
> > > the fork. I have the same limitation as VFIO:(
> > > 
> > > I don't think I can change much here. If I can, VFIO can too:)
> > 
> > The forbid child to access the device is easy in the kernel whenever
> > someone open the device file force set the OCLOEXEC flag on the file
> > some device driver already do that and so should you. With that you
> > should always have a struct file - mm struct one to one relationship
> > and thus one PASID per struct file ie per open of the device file.
> 
> I considerred the OCLOEXEC flag, but it seams it works only for exec, not fork.

Every device driver i know protect against that (child having access)
by at least setting VM_DONTCOPY on mmap of device IO (or even all mmap
of device file).

But you right OCLOEXEC does nothing on fork() sometimes i forget posix.


> > That does not solve the GUP/fork issue i describe below.
> > 
> > 
> > > > AFAIK VFIO is fine with the way it is as QEMU do not fork() once it
> > > > is running a guest and thus the COW that would invalidate vaddr to
> > > > physical page assumption is not broken. So i doubt VFIO folks have
> > > > any incentive to go down the mmu notifier path and invalidate device
> > > > mapping. They also have the replay thing that probably handle some
> > > > of fork cases by trusting user space program to do it. In your case
> > > > you can not trust the user space program.
> > > > 
> > > > In your case AFAICT i do not see any warning or gotcha so the following
> > > > scenario is broken (in non SVA/SVM):
> > > >     1) program setup the device (open container, mdev, setup queue, ...)
> > > >     2) program map some range of its address space wih VFIO_IOMMU_MAP_DMA
> > > >     3) program start using the device using map setup in 2)
> > > >     ...
> > > >     4) program fork()
> > > >     5) parent trigger COW inside the range setup in 2)
> > > > 
> > > >     At this point it is the child process that can write to the page that
> > > >     are access by the device (which was map by the parent in 2)). The
> > > >     parent can no longer access that memory from the CPU.
> > > > 
> > > > There is just no sane way to fix this beside invalidating device mapping
> > > > on fork (and you can not rely on userspace to do so) and thus stopping
> > > > the device on fork (SVA/SVM case do not have any issue here).
> > > 
> > > Indeed. But as soon as we choose to expose the device space to the user space,
> > > the limitation is already there. If we want to solve the problem, we have to
> > > have a hook in the copy_process() procedure and copy the parent's queue state to
> > > a new queue, assign it to the child's fd and redirect the child's mmap to
> > > it. If I can do so, the same logic can also be applied to VFIO.
> > 
> > Except we do not want to do that and this does not solve the COW i describe
> > above unless you disable COW altogether which is a big no.
> > 
> > > The good side is, this is not a security leak. The hardware has been given to
> > > the process. It is the process who choose to share it. If it won't work, it is
> > > the process's problem;)
> > 
> > No this is bad, you can not expect every single userspace program to know and
> > be aware of that. If some trusted application (say systemd or firefox, ...)
> > start using your device and is unaware or does not comprehend all side effect
> > it would allow the child to access/change its parent memory and that is bad.
> > Device driver need to protect against user doing stupid thing. All existing
> > device driver that do ATS/PASID already protect themself against child (ie the
> > child can not use the device through a mix of OCLOEXEC and other checks).
> > 
> > You must set the OCLOEXEC flag but that does not solve the COW problem above.
> > 
> > My motto in life is "do not trust userspace" :) which also translate to "do
> > not expect userspace will do the right thing".
> > 
> 
> We don't really trust user space here. We trust that the process cannot do more
> than what has been given to it. If an application use WarpDrive, and share its
> memory with it, it should know what happen when it fork a sub-process.  This is
> just like you clone a sub-process with CLONE_FILES or CLONE_VM, the parent
> process should know what happen.

Except application do not always know, the whole container business is
an example of that. Most application have tons of dependency and reliance
on various library. There is no way they audit all the library all the
times.

So if a library decide to use such device then this trickles down into
application without their knowledge and you can open security issues
through that.

To be clear here, what worry me is the non SVA/SVM case, there is no way
for the device driver to intercept fork and thus noways for it to break
COW for the parent, so child could potentialy write hardware commands for
the device (depends on the device if for instance the cmd queue mmap only
point to regular memory from which it fetches its commands). I guess relying
on IOMMU might be good enough (ie only range mapped by the first user of
the hardware would be accessible).


> > > > > And now we have GUP-longterm and many accounting work in VFIO, we don't want to
> > > > > do that again.
> > > > 
> > > > GUP-longterm does not solve any GUP problem, it just block people to
> > > > do GUP on DAX backed vma to avoid pining persistent memory as it is
> > > > a nightmare to handle in the block device driver and file system code.
> > > > 
> > > > The accounting is the rt limit thing and is litteraly 10 lines of
> > > > code so i would not see that as hard to replicate.
> > > 
> > > OK. Agree.
> > > 
> > > > 
> > > > 
> > > > > > > Now We are talking about SVA and PASID, just to make sure WarpDrive can benefit
> > > > > > > from the feature in the future. It dose not means WarpDrive is useless before
> > > > > > > that. And it works for our Zip and RSA accelerators in physical world.
> > > > > > 
> > > > > > Just not with random process address ...
> > > > > > 
> > > > > > > > If you still want non SVA/SVM path what you want to do only works
> > > > > > > > if both ptr1 and ptr2 are in a range that is DMA mapped to the
> > > > > > > > device (moreover you need DMA address to match process address
> > > > > > > > which is not an easy feat).
> > > > > > > > 
> > > > > > > > Now even if you only want SVA/SVM, i do not see what is the point
> > > > > > > > of doing this inside VFIO. AMD GPU driver does not and there would
> > > > > > > > be no benefit for them to be there. Well a AMD VFIO mdev device
> > > > > > > > driver for QEMU guest might be useful but they have SVIO IIRC.
> > > > > > > > 
> > > > > > > > For SVA/SVM your usage model is:
> > > > > > > > 
> > > > > > > > Setup:
> > > > > > > >     - user space create a warp drive context for the process
> > > > > > > >     - user space create a device specific context for the process
> > > > > > > >     - user space create a user space command queue for the device
> > > > > > > >     - user space bind command queue
> > > > > > > > 
> > > > > > > >     At this point the kernel driver has bound the process address
> > > > > > > >     space to the device with a command queue and userspace
> > > > > > > > 
> > > > > > > > Usage:
> > > > > > > >     - user space schedule work and call appropriate flush/update
> > > > > > > >       ioctl from time to time. Might be optional depends on the
> > > > > > > >       hardware, but probably a good idea to enforce so that kernel
> > > > > > > >       can unbind the command queue to bind another process command
> > > > > > > >       queue.
> > > > > > > >     ...
> > > > > > > > 
> > > > > > > > Cleanup:
> > > > > > > >     - user space unbind command queue
> > > > > > > >     - user space destroy device specific context
> > > > > > > >     - user space destroy warp drive context
> > > > > > > >     All the above can be implicit when closing the device file.
> > > > > > > > 
> > > > > > > > So again in the above model i do not see anywhere something from
> > > > > > > > VFIO that would benefit this model.
> > > > > > > > 
> > > > > > > 
> > > > > > > Let me show you how the model will be if I use VFIO:
> > > > > > > 
> > > > > > > Setup (Kernel part)
> > > > > > > 	- Kernel driver do every as usual to serve the other functionality, NIC
> > > > > > > 	  can still be registered to netdev, encryptor can still be registered
> > > > > > > 	  to crypto...
> > > > > > > 	- At the same time, the driver can devote some of its hardware resource
> > > > > > > 	  and register them as a mdev creator to the VFIO framework. This just
> > > > > > > 	  need limited change to the VFIO type1 driver.
> > > > > > 
> > > > > > In the above VFIO does not help you one bit ... you can do that with
> > > > > > as much code with new common device as front end.
> > > > > > 
> > > > > > > Setup (User space)
> > > > > > > 	- System administrator create mdev via the mdev creator interface.
> > > > > > > 	- Following VFIO setup routine, user space open the mdev's group, there is
> > > > > > > 	  only one group for one device.
> > > > > > > 	- Without PASID support, you don't need to do anything. With PASID, bind
> > > > > > > 	  the PASID to the device via VFIO interface.
> > > > > > > 	- Get the device from the group via VFIO interface and mmap it the user
> > > > > > > 	  space for device's MMIO access (for the queue).
> > > > > > > 	- Map whatever memory you need to share with the device with VFIO
> > > > > > > 	  interface.
> > > > > > > 	- (opt) Add more devices into the container if you want to share the
> > > > > > > 	  same address space with them
> > > > > > 
> > > > > > So all VFIO buys you here is boiler plate code that does insert_pfn()
> > > > > > to handle MMIO mapping. Which is just couple hundred lines of boiler
> > > > > > plate code.
> > > > > > 
> > > > > 
> > > > > No. With VFIO, I don't need to:
> > > > > 
> > > > > 1. GUP and accounting for RLIMIT_MEMLOCK
> > > > 
> > > > That's 10 line of code ...
> > > > 
> > > > > 2. Keep all GUP pages for releasing (VFIO uses the rb_tree to do so)
> > > > 
> > > > GUP pages are not part of rb_tree and what you want to do can be done
> > > > in few lines of code here is pseudo code:
> > > > 
> > > > warp_dma_map_range(ulong vaddr, ulong npages)
> > > > {
> > > >     struct page *pages = kvzalloc(npages);
> > > > 
> > > >     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
> > > >         GUP(vaddr, &pages[i]);
> > > >         iommu_map(vaddr, page_to_pfn(pages[i]));
> > > >     }
> > > >     kvfree(pages);
> > > > }
> > > > 
> > > > warp_dma_unmap_range(ulong vaddr, ulong npages)
> > > > {
> > > >     for (i = 0; i < npages; ++i, vaddr += PAGE_SIZE) {
> > > >         unsigned long pfn;
> > > > 
> > > >         pfn = iommu_iova_to_phys(vaddr);
> > > >         iommu_unmap(vaddr);
> > > >         put_page(pfn_to_page(page)); /* set dirty if mapped write */
> > > >     }
> > > > }
> > > > 
> > > 
> > > But what if the process exist without unmapping? The pages will be pinned in the
> > > kernel forever.
> > 
> > Yeah add a struct warp_map { struct list_head list; unsigned long vaddr,
> > unsigned long npages; } for every mapping, store the head into the device
> > file private field of struct file and when the release fs callback is
> > call you can walk done the list to force unmap any leftover. This is not
> > that much code. You ca even use interval tree which is 3 lines of code
> > with interval_tree_generic.h to speed up warp_map lookup on unmap ioctl.
> > 
> 
> Yes, when you add all of them... it is VFIO:)

No, VFIO not only need to track all mapping but it also need to track iova
to vaddr which you do not need. The VFIO code is much more complex because
of that. What you would need is only ~200lines of code.

> > 
> > > > Add locking, error handling, dirtying and comments and you are barely
> > > > looking at couple hundred lines of code. You do not need any of the
> > > > complexity of VFIO as you do not have the same requirements. Namely
> > > > VFIO have to keep track of iova and physical mapping for things like
> > > > migration (migrating guest between host) and few others very
> > > > virtualization centric requirements.
> > > > 
> > > > 
> > > > > 2. Handle the PASID on SMMU (ARM's IOMMU) myself.
> > > > 
> > > > Existing driver do that with 20 lines of with comments and error
> > > > handling (see kfd_iommu_bind_process_to_device() for instance) i
> > > > doubt you need much more than that.
> > > > 
> > > 
> > > OK, I agree.
> > > 
> > > > 
> > > > > 3. Multiple devices menagement (VFIO uses container to manage this)
> > > > 
> > > > All the vfio_group* stuff ? OK that's boiler plate code, note that
> > > > hard to replicate thought.
> > > 
> > > No, I meant the container thing. Several devices/group can be assigned to the
> > > same container and the DMA on the container can be assigned to all those
> > > devices. So we can have some devices to share the same name space.
> > 
> > This was the motivation of my question below, to me this is a policy
> > decision and it should be left to userspace to decide but not forced
> > upon userspace because it uses a given device driver.
> > 
> > Maybe i am wrong but i think you can create container and device group
> > without having a VFIO driver for the devices in the group. It is not
> > something i do often so i might be wrong here.
> > 
> 
> Container maintains a virtual unify address space for all group/iommu which is
> added to it. It simplify the address management. But yes, you can choose to do
> it all in the user space.
>
> > 
> > > > > And even as a boiler plate, it is valueable, the memory thing is sensitive
> > > > > interface to user space, it can easily become a security problem. If I can
> > > > > achieve my target within the scope of VFIO, why not? At lease it has been
> > > > > proved to be safe for the time being.
> > > > 
> > > > The thing is being part of VFIO impose things on you, things that you
> > > > do not need. Like one device per group (maybe it is you imposing this,
> > > > i am loosing track here). Or the complex dma mapping tracking ...
> > > > 
> > > 
> > > Err... But the one-device-per-group is not VFIO's decision. It is IOMMU's :).
> > > Unless I don't use IOMMU.
> > 
> > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group
> > by default at boot or at least all devices behind the same bridge.
> > 
> > Maybe they are kernel option to avoid that and userspace init program
> > can definitly re-arrange that base on sysadmin policy).
> > 
> 
> But if the IOMMU is enabled, all PCIE devices have their own device IDs. So the
> IOMMU can use different page table for every of them.

On PCIE AFAIR IOMMU can not differentiate between devices, hence why
by default they end up with the same domain in the same group.


> > > > > > > Cleanup:
> > > > > > > 	- User space close the group file handler
> > > > > > > 	- There will be a problem to let the other process know the mdev is
> > > > > > > 	  freed to be used again. My RFCv1 choose a file handler solution. Alex
> > > > > > > 	  dose not like it. But it is not a big problem. We can always have a
> > > > > > > 	  scheduler process to manage the state of the mdev or even we can
> > > > > > > 	  switch back to the RFCv1 solution without too much effort if we like
> > > > > > > 	  in the future.
> > > > > > 
> > > > > > If you were outside VFIO you would have more freedom on how to do that.
> > > > > > For instance process opening the device file can be placed on queue and
> > > > > > first one in the queue get to use the device until it closes/release the
> > > > > > device. Then next one in queue get the device ...
> > > > > 
> > > > > Yes. I do like the file handle solution. But I hope the solution become mature
> > > > > as soon as possible. Many of our products, and as I know include some of our
> > > > > partners, are waiting for a long term solution as direction. If I rely on some
> > > > > unmature solution, they may choose some deviated, customized solution. That will
> > > > > be much harmful. Compare to this, the freedom is not so important...
> > > > 
> > > > I do not see how being part of VFIO protect you from people doing crazy
> > > > thing to their kernel ... Time to market being key in this world, i doubt
> > > > that being part of VFIO would make anyone think twice before taking a
> > > > shortcut.
> > > > 
> > > > I have seen horrible things on that front and only players like Google
> > > > can impose a minimum level of sanity.
> > > > 
> > > 
> > > OK. My fault, to talk about TTM. It has nothing doing with the architecture
> > > decision. But I don't yet see what harm will be brought if I use VFIO when it
> > > can fulfill almost all my requirements.
> > 
> > The harm is in forcing the device group isolation policy which is not
> > necessary for all devices like you said so yourself on ARM with the
> > device stream id so that IOMMU can identify individual devices.
> 
> No. Some mini SoC share the same stream id among several small devices. The
> iommu has to treat them as the same. That is why IOMMU introduces the concept of
> iommu_group. Personally I don't like that. If they share the same stream id,
> they should be treated as the same hardware. But it is the decision for the time
> being...

Still policy decission ie how to group device together should be some-
thing left to the sysadmin/distribution. When hardware have constraint
it just limits the choice of userspace.


> > So i would rather see the device isolation as something orthogonal to
> > what you want to achieve and that should be forced upon user ie sysadmin
> > should control that distribution can have sane default for each platform.
> > 
> > 
> > > > > > > Except for the minimum update to the type1 driver and use sdmdev to manage the
> > > > > > > interrupt sharing, I don't need any extra code to gain the address sharing
> > > > > > > capability. And the capability will be strengthen along with the upgrade of VFIO.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > > > And I don't understand why I should avoid to use VFIO? As Alex said, VFIO is the
> > > > > > > > > > > user driver framework. And I need exactly a user driver interface. Why should I
> > > > > > > > > > > invent another wheel? It has most of stuff I need:
> > > > > > > > > > > 
> > > > > > > > > > > 1. Connecting multiple devices to the same application space
> > > > > > > > > > > 2. Pinning and DMA from the application space to the whole set of device
> > > > > > > > > > > 3. Managing hardware resource by device
> > > > > > > > > > > 
> > > > > > > > > > > We just need the last step: make sure multiple applications and the kernel can
> > > > > > > > > > > share the same IOMMU. Then why shouldn't we use VFIO?
> > > > > > > > > > 
> > > > > > > > > > Because tons of other drivers already do all of the above outside VFIO. Many
> > > > > > > > > > driver have a sizeable userspace side to them (anything with ioctl do) so they
> > > > > > > > > > can be construded as userspace driver too.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Ignoring if there are *tons* of drivers are doing that;), even I do the same as
> > > > > > > > > i915 and solve the address space problem. And if I don't need to with VFIO, why
> > > > > > > > > should I spend so much effort to do it again?
> > > > > > > > 
> > > > > > > > Because you do not need any code from VFIO, nor do you need to reinvent
> > > > > > > > things. If non SVA/SVM matters to you then use dma buffer. If not then
> > > > > > > > i do not see anything in VFIO that you need.
> > > > > > > > 
> > > > > > > 
> > > > > > > As I have explain, if I don't use VFIO, at lease I have to do all that has been
> > > > > > > done in i915 or even more than that.
> > > > > > 
> > > > > > So beside the MMIO mmap() handling and dma mapping of range of user space
> > > > > > address space (again all very boiler plate code duplicated accross the
> > > > > > kernel several time in different forms). You do not gain anything being
> > > > > > inside VFIO right ?
> > > > > > 
> > > > > 
> > > > > As I said, rb-tree for gup, rlimit accounting, cooperation on SMMU, and mature
> > > > > user interface are our concern.
> > > > > > 
> > > > > > > > > > So there is no reasons to do that under VFIO. Especialy as in your example
> > > > > > > > > > it is not a real user space device driver, the userspace portion only knows
> > > > > > > > > > about writting command into command buffer AFAICT.
> > > > > > > > > > 
> > > > > > > > > > VFIO is for real userspace driver where interrupt, configurations, ... ie
> > > > > > > > > > all the driver is handled in userspace. This means that the userspace have
> > > > > > > > > > to be trusted as it could program the device to do DMA to anywhere (if
> > > > > > > > > > IOMMU is disabled at boot which is still the default configuration in the
> > > > > > > > > > kernel).
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > But as Alex explained, VFIO is not simply used by VM. So it need not to have all
> > > > > > > > > stuffs as a driver in host system. And I do need to share the user space as DMA
> > > > > > > > > buffer to the hardware. And I can get it with just a little update, then it can
> > > > > > > > > service me perfectly. I don't understand why I should choose a long route.
> > > > > > > > 
> > > > > > > > Again this is not the long route i do not see anything in VFIO that
> > > > > > > > benefit you in the SVA/SVM case. A basic character device driver can
> > > > > > > > do that.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > So i do not see any reasons to do anything you want inside VFIO. All you
> > > > > > > > > > want to do can be done outside as easily. Moreover it would be better if
> > > > > > > > > > you define clearly each scenario because from where i sit it looks like
> > > > > > > > > > you are opening the door wide open to userspace to DMA anywhere when IOMMU
> > > > > > > > > > is disabled.
> > > > > > > > > > 
> > > > > > > > > > When IOMMU is disabled you can _not_ expose command queue to userspace
> > > > > > > > > > unless your device has its own page table and all commands are relative
> > > > > > > > > > to that page table and the device page table is populated by kernel driver
> > > > > > > > > > in secure way (ie by checking that what is populated can be access).
> > > > > > > > > > 
> > > > > > > > > > I do not believe your example device to have such page table nor do i see
> > > > > > > > > > a fallback path when IOMMU is disabled that force user to do ioctl for
> > > > > > > > > > each commands.
> > > > > > > > > > 
> > > > > > > > > > Yes i understand that you target SVA/SVM but still you claim to support
> > > > > > > > > > non SVA/SVM. The point is that userspace can not be trusted if you want
> > > > > > > > > > to have random program use your device. I am pretty sure that all user
> > > > > > > > > > of VFIO are trusted process (like QEMU).
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Finaly i am convince that the IOMMU grouping stuff related to VFIO is
> > > > > > > > > > useless for your usecase. I really do not see the point of that, it
> > > > > > > > > > does complicate things for you for no reasons AFAICT.
> > > > > > > > > 
> > > > > > > > > Indeed, I don't like the group thing. I believe VFIO's maintains would not like
> > > > > > > > > it very much either;). But the problem is, the group reflects to the same
> > > > > > > > > IOMMU(unit), which may shared with other devices.  It is a security problem. I
> > > > > > > > > cannot ignore it. I have to take it into account event I don't use VFIO.
> > > > > > > > 
> > > > > > > > To me it seems you are making a policy decission in kernel space ie
> > > > > > > > wether the device should be isolated in its own group or not is a
> > > > > > > > decission that is up to the sys admin or something in userspace.
> > > > > > > > Right now existing user of SVA/SVM don't (at least AFAICT).
> > > > > > > > 
> > > > > > > > Do we really want to force such isolation ?
> > > > > > > > 
> > > > > > > 
> > > > > > > But it is not my decision, that how the iommu subsystem is designed. Personally
> > > > > > > I don't like it at all, because all our hardwares have their own stream id
> > > > > > > (device id). I don't need the group concept at all. But the iommu subsystem
> > > > > > > assume some devices may share the name device ID to a single IOMMU.
> > > > > > 
> > > > > > My question was do you really want to force group isolation for the
> > > > > > device ? Existing SVA/SVM capable driver do not force that, they let
> > > > > > the userspace decide this (sysadm, distributions, ...). Being part of
> > > > > > VFIO (in the way you do, likely ways to avoid this inside VFIO too)
> > > > > > force this decision ie make a policy decision without userspace having
> > > > > > anything to say about it.
> > > > 
> > > > You still do not answer my question, do you really want to force group
> > > > isolation for device in your framework ? Which is a policy decision from
> > > > my POV and thus belong to userspace and should not be enforce by kernel.
> > > 
> > > No. But I have to follow the rule defined by IOMMU, haven't I?
> > 
> > The IOMMU rule does not say that every device _must_ always be in one
> > group and one domain only. I am pretty sure on x86 by default you get
> > one domain for all PCIE devices behind same bridge.
> > 
> 
> Really? Give me some more time. I need to check it out.

On my computer right now i have one group per pcie bridge and all
devices behind same bridge in same groups.

ls /sys/kernel/iommu_groups/*/devices

> > My point is that the device grouping into domain/group should be an
> > orthogonal decision ie it should not be a requirement by the device
> > driver and should be under control of userspace as it is a policy
> > decission.
> > 
> > One exception if it is unsafe to have device share a domain in which
> > case following my motto the driver should refuse to work and return
> > an error on open (and a kernel explaining why). But this depends on
> > device and platform.
> > 

Cheers
Jérôme


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-14  6:50                             ` Tian, Kevin
  2018-09-14 13:05                               ` Kenneth Lee
@ 2018-09-14 14:13                               ` Jerome Glisse
  1 sibling, 0 replies; 58+ messages in thread
From: Jerome Glisse @ 2018-09-14 14:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kenneth Lee, Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Zaibo Xu, linux-doc, Kumar, Sanjay K,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	Lu Baolu, David S . Miller, linux-accelerators, Joerg Roedel

On Fri, Sep 14, 2018 at 06:50:55AM +0000, Tian, Kevin wrote:
> > From: Jerome Glisse
> > Sent: Thursday, September 13, 2018 10:52 PM
> >
> [...]
>  > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group
> > by default at boot or at least all devices behind the same bridge.
> 
> the group thing reflects physical hierarchy limitation, not changed
> cross boot. Please note iommu group defines the minimal isolation
> boundary - all devices within same group must be attached to the
> same iommu domain or address space, because physically IOMMU
> cannot differentiate DMAs out of those devices. devices behind
> legacy PCI-X bridge is one example. other examples include devices
> behind a PCIe switch port which doesn't support ACS thus cannot
> route p2p transaction to IOMMU. If talking about typical PCIe 
> endpoint (with upstreaming ports all supporting ACS), you'll get
> one device per group.
> 
> One iommu group today is attached to only one iommu domain.
> In the future one group may attach to multiple domains, as the
> aux domain concept being discussed in another thread.

Thanks for the info.

> 
> > 
> > Maybe they are kernel option to avoid that and userspace init program
> > can definitly re-arrange that base on sysadmin policy).
> 
> I don't think there is such option, as it may break isolation model
> enabled by IOMMU.
> 
> [...]
> > > > That is why i am being pedantic :) on making sure there is good reasons
> > > > to do what you do inside VFIO. I do believe that we want a common
> > frame-
> > > > work like the one you are proposing but i do not believe it should be
> > > > part of VFIO given the baggages it comes with and that are not relevant
> > > > to the use cases for this kind of devices.
> > >
> 
> The purpose of VFIO is clear - the kernel portal for granting generic 
> device resource (mmio, irq, etc.) to user space. VFIO doesn't care
> what exactly a resource is used for (queue, cmd reg, etc.). If really
> pursuing VFIO path is necessary, maybe such common framework
> should lay down in user space, which gets all granted resource from
> kernel driver thru VFIO and then provides accelerator services to 
> other processes?

Except that many existing device driver falls under that description
(ie exposing mmio, command queues, ...) and are not under VFIO.

Up to mdev VFIO was all about handling a full device to userspace AFAIK.
With the introduction of mdev a host kernel driver can "slice" its
device and share it through VFIO to userspace. Note that in that case
it might never end over any mmio, irq, ... the host driver might just
be handling over memory and would be polling from it to schedule on
the real hardware.


The question i am asking about warpdrive is wether being in VFIO is
necessary ? as i do not see the requirement myself.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
                   ` (8 preceding siblings ...)
  2018-09-04 15:00 ` Jerome Glisse
@ 2018-09-17  1:42 ` Jerome Glisse
  2018-09-17  8:39   ` Kenneth Lee
  2018-09-21 10:03   ` Kenneth Lee
  9 siblings, 2 replies; 58+ messages in thread
From: Jerome Glisse @ 2018-09-17  1:42 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Jonathan Corbet, Herbert Xu, David S . Miller, Joerg Roedel,
	Alex Williamson, Kenneth Lee, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

So i want to summarize issues i have as this threads have dig deep into
details. For this i would like to differentiate two cases first the easy
one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
In both cases your objectives as i understand them:

[R1]- expose a common user space API that make it easy to share boiler
      plate code accross many devices (discovering devices, opening
      device, creating context, creating command queue ...).
[R2]- try to share the device as much as possible up to device limits
      (number of independant queues the device has)
[R3]- minimize syscall by allowing user space to directly schedule on the
      device queue without a round trip to the kernel

I don't think i missed any.


(1) Device with SVA/SVM

For that case it is easy, you do not need to be in VFIO or part of any
thing specific in the kernel. There is no security risk (modulo bug in
the SVA/SVM silicon). Fork/exec is properly handle and binding a process
to a device is just couple dozen lines of code.


(2) Device does not have SVA/SVM (or it is disabled)

You want to still allow device to be part of your framework. However
here i see fundamentals securities issues and you move the burden of
being careful to user space which i think is a bad idea. We should
never trus the userspace from kernel space.

To keep the same API for the user space code you want a 1:1 mapping
between device physical address and process virtual address (ie if
device access device physical address A it is accessing the same
memory as what is backing the virtual address A in the process.

Security issues are on two things:
[I1]- fork/exec, a process who opened any such device and created an
      active queue can transfer without its knowledge control of its
      commands queue through COW. The parent map some anonymous region
      to the device as a command queue buffer but because of COW the
      parent can be the first to copy on write and thus the child can
      inherit the original pages that are mapped to the hardware.
      Here parent lose control and child gain it.

[I2]- Because of [R3] you want to allow userspace to schedule commands
      on the device without doing an ioctl and thus here user space
      can schedule any commands to the device with any address. What
      happens if that address have not been mapped by the user space
      is undefined and in fact can not be defined as what each IOMMU
      does on invalid address access is different from IOMMU to IOMMU.

      In case of a bad IOMMU, or simply an IOMMU improperly setup by
      the kernel, this can potentialy allow user space to DMA anywhere.

[I3]- By relying on GUP in VFIO you are not abiding by the implicit
      contract (at least i hope it is implicit) that you should not
      try to map to the device any file backed vma (private or share).

      The VFIO code never check the vma controlling the addresses that
      are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the
      user space can provide file backed range.

      I am guessing that the VFIO code never had any issues because its
      number one user is QEMU and QEMU never does that (and that's good
      as no one should ever do that).

      So if process does that you are opening your self to serious file
      system corruption (depending on file system this can lead to total
      data loss for the filesystem).

      Issue is that once you GUP you never abide to file system flushing
      which write protect the page before writing to the disk. So
      because the page is still map with write permission to the device
      (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can
      write to the page while it is in the middle of being written back
      to disk. Consult your nearest file system specialist to ask him
      how bad that can be.

[I4]- Design issue, mdev design As Far As I Understand It is about
      sharing a single device to multiple clients (most obvious case
      here is again QEMU guest). But you are going against that model,
      in fact AFAIUI you are doing the exect opposite. When there is
      no SVA/SVM you want only one mdev device that can not be share.

      So this is counter intuitive to the mdev existing design. It is
      not about sharing device among multiple users but about giving
      exclusive access to the device to one user.



All the reasons above is why i believe a different model would serve
you and your user better. Below is a design that avoids all of the
above issues and still delivers all of your objectives with the
exceptions of the third one [R3] when there is no SVA/SVM.


Create a subsystem (very much boiler plate code) which allow device to
register themself against (very much like what you do in your current
patchset but outside of VFIO).

That subsystem will create a device file for each registered system and
expose a common API (ie set of ioctl) for each of those device files.

When user space create a queue (through an ioctl after opening the device
file) the kernel can return -EBUSY if all the device queue are in use,
or create a device queue and return a flag like SYNC_ONLY for device that
do not have SVA/SVM.

For device with SVA/SVM at the time the process create a queue you bind
the process PASID to the device queue. From there on the userspace can
schedule commands and use the device without going to kernel space.

For device without SVA/SVM you create a fake queue that is just pure
memory is not related to the device. From there on the userspace must
call an ioctl every time it wants the device to consume its queue
(hence why the SYNC_ONLY flag for synchronous operation only). The
kernel portion read the fake queue expose to user space and copy
commands into the real hardware queue but first it properly map any
of the process memory needed for those commands to the device and
adjust the device physical address with the one it gets from dma_map
API.

With that model it is "easy" to listen to mmu_notifier and to abide by
them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]
issue by only mapping a fake device queue to userspace.

So yes with that models it means that every device that wish to support
the non SVA/SVM case will have to do extra work (ie emulate its command
queue in software in the kernel). But by doing so, you support an
unlimited number of process on your device (ie all the process can share
one single hardware command queues or multiple hardware queues).

The big advantages i see here is that the process do not have to worry
about doing something wrong. You are protecting yourself and your user
from stupid mistakes.


I hope this is useful to you.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-17  1:42 ` Jerome Glisse
@ 2018-09-17  8:39   ` Kenneth Lee
  2018-09-17 12:37     ` Jerome Glisse
  2018-09-21 10:03   ` Kenneth Lee
  1 sibling, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-17  8:39 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> Date: Sun, 16 Sep 2018 21:42:44 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <nek.in.cn@gmail.com>
> CC: Jonathan Corbet <corbet@lwn.net>, Herbert Xu
>  <herbert@gondor.apana.org.au>, "David S . Miller" <davem@davemloft.net>,
>  Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>,
>  Sanjay Kumar <sanjay.k.kumar@intel.com>, linuxarm@huawei.com
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.1 (2018-07-13)
> Message-ID: <20180917014244.GA27596@redhat.com>
> 
> So i want to summarize issues i have as this threads have dig deep into
> details. For this i would like to differentiate two cases first the easy
> one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.

Thank you very much for the summary.

> In both cases your objectives as i understand them:
> 
> [R1]- expose a common user space API that make it easy to share boiler
>       plate code accross many devices (discovering devices, opening
>       device, creating context, creating command queue ...).
> [R2]- try to share the device as much as possible up to device limits
>       (number of independant queues the device has)
> [R3]- minimize syscall by allowing user space to directly schedule on the
>       device queue without a round trip to the kernel
> 
> I don't think i missed any.
> 
> 
> (1) Device with SVA/SVM
> 
> For that case it is easy, you do not need to be in VFIO or part of any
> thing specific in the kernel. There is no security risk (modulo bug in
> the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> to a device is just couple dozen lines of code.
> 

This is right...logically. But the kernel has no clear definition about "Device
with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the
boiler plate.

VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only
one. If we add that support within VFIO, which solve most of the problem of
SVA/SVM, it will save a lot of work in the future.

I think this is the key confliction between us. So could Alex please say
something here? If the VFIO is going to take this into its scope, we can try
together to solve all the problem on the way. If it it is not, it is also
simple, we can just go to another way to fulfill this part of requirements even
we have to duplicate most of the code.

Another point I need to emphasis here: because we have to replace the hardware
queue when fork, so it won't be very simple even in SVA/SVM case.

> 
> (2) Device does not have SVA/SVM (or it is disabled)
> 
> You want to still allow device to be part of your framework. However
> here i see fundamentals securities issues and you move the burden of
> being careful to user space which i think is a bad idea. We should
> never trus the userspace from kernel space.
> 
> To keep the same API for the user space code you want a 1:1 mapping
> between device physical address and process virtual address (ie if
> device access device physical address A it is accessing the same
> memory as what is backing the virtual address A in the process.
> 
> Security issues are on two things:
> [I1]- fork/exec, a process who opened any such device and created an
>       active queue can transfer without its knowledge control of its
>       commands queue through COW. The parent map some anonymous region
>       to the device as a command queue buffer but because of COW the
>       parent can be the first to copy on write and thus the child can
>       inherit the original pages that are mapped to the hardware.
>       Here parent lose control and child gain it.

This is indeed an issue. But it remains an issue only if you continue to use the
queue and the memory after fork. We can use at_fork kinds of gadget to fix it in
user space.

From some perspectives, I think the issue can be solved by iommu_notifier. For
example, when the process is fork-ed, we can set the mapped device mmio space as
COW for the child process, so a new queue can be created and set to the same
state as the parent's if the space is accessed. Then we can have two separated
queues for both the parent and the child. The memory part can be done in the
same way.

The thing is, the same strategy can be applied to VFIO without changing its
original feature.

> 
> [I2]- Because of [R3] you want to allow userspace to schedule commands
>       on the device without doing an ioctl and thus here user space
>       can schedule any commands to the device with any address. What
>       happens if that address have not been mapped by the user space
>       is undefined and in fact can not be defined as what each IOMMU
>       does on invalid address access is different from IOMMU to IOMMU.
> 
>       In case of a bad IOMMU, or simply an IOMMU improperly setup by
>       the kernel, this can potentialy allow user space to DMA anywhere.

I don't think this is an issue. If you cannot trust IOMMU and proper setup of
IOMMU in kernel, you cannot trust anything. And the whole VFIO framework is
untrustable.

> 
> [I3]- By relying on GUP in VFIO you are not abiding by the implicit
>       contract (at least i hope it is implicit) that you should not
>       try to map to the device any file backed vma (private or share).
> 
>       The VFIO code never check the vma controlling the addresses that
>       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the
>       user space can provide file backed range.
> 
>       I am guessing that the VFIO code never had any issues because its
>       number one user is QEMU and QEMU never does that (and that's good
>       as no one should ever do that).
> 
>       So if process does that you are opening your self to serious file
>       system corruption (depending on file system this can lead to total
>       data loss for the filesystem).
> 
>       Issue is that once you GUP you never abide to file system flushing
>       which write protect the page before writing to the disk. So
>       because the page is still map with write permission to the device
>       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can
>       write to the page while it is in the middle of being written back
>       to disk. Consult your nearest file system specialist to ask him
>       how bad that can be.

Same as I2, it is an issue, but the problem can be solved in VFIO if we really
take it in the scope of VFIO.

> 
> [I4]- Design issue, mdev design As Far As I Understand It is about
>       sharing a single device to multiple clients (most obvious case
>       here is again QEMU guest). But you are going against that model,
>       in fact AFAIUI you are doing the exect opposite. When there is
>       no SVA/SVM you want only one mdev device that can not be share.

Wait. It is NOT "I want only one mdev device when there is no SVA/SVM", it is "I
can support only one mdev when there is no PASID support for the IOMMU".

> 
>       So this is counter intuitive to the mdev existing design. It is
>       not about sharing device among multiple users but about giving
>       exclusive access to the device to one user.
> 
> 
> 
> All the reasons above is why i believe a different model would serve
> you and your user better. Below is a design that avoids all of the
> above issues and still delivers all of your objectives with the
> exceptions of the third one [R3] when there is no SVA/SVM.
> 
> 
> Create a subsystem (very much boiler plate code) which allow device to
> register themself against (very much like what you do in your current
> patchset but outside of VFIO).
> 
> That subsystem will create a device file for each registered system and
> expose a common API (ie set of ioctl) for each of those device files.
> 
> When user space create a queue (through an ioctl after opening the device
> file) the kernel can return -EBUSY if all the device queue are in use,
> or create a device queue and return a flag like SYNC_ONLY for device that
> do not have SVA/SVM.
> 
> For device with SVA/SVM at the time the process create a queue you bind
> the process PASID to the device queue. From there on the userspace can
> schedule commands and use the device without going to kernel space.

As mentioned previously, this is not enough for fork scenario.

> 
> For device without SVA/SVM you create a fake queue that is just pure
> memory is not related to the device. From there on the userspace must
> call an ioctl every time it wants the device to consume its queue
> (hence why the SYNC_ONLY flag for synchronous operation only). The
> kernel portion read the fake queue expose to user space and copy
> commands into the real hardware queue but first it properly map any
> of the process memory needed for those commands to the device and
> adjust the device physical address with the one it gets from dma_map
> API.
> 

But in this way, we will lost most of the benefit of avoiding syscall.

> With that model it is "easy" to listen to mmu_notifier and to abide by
> them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]
> issue by only mapping a fake device queue to userspace.
> 
> So yes with that models it means that every device that wish to support
> the non SVA/SVM case will have to do extra work (ie emulate its command
> queue in software in the kernel). But by doing so, you support an
> unlimited number of process on your device (ie all the process can share
> one single hardware command queues or multiple hardware queues).

If I can do this, I will not need WarpDrive at all:(

> 
> The big advantages i see here is that the process do not have to worry
> about doing something wrong. You are protecting yourself and your user
> from stupid mistakes.
> 
> 
> I hope this is useful to you.
> 

Anyway, I will try to address the problem you mentioned in next version, and
make both options (on-VFIO or off-VFIO) are available. Thanks.

> Cheers,
> Jérôme


Cheers
-- 
			-Kenneth(Hisilicon)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-17  8:39   ` Kenneth Lee
@ 2018-09-17 12:37     ` Jerome Glisse
  2018-09-18  6:00       ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-17 12:37 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	Thomas Gleixner, linux-crypto, Zhou Wang, Philippe Ombredanne,
	Zaibo Xu, David S . Miller, linux-accelerators, Lu Baolu

On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:
> On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > So i want to summarize issues i have as this threads have dig deep into
> > details. For this i would like to differentiate two cases first the easy
> > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> 
> Thank you very much for the summary.
> 
> > In both cases your objectives as i understand them:
> > 
> > [R1]- expose a common user space API that make it easy to share boiler
> >       plate code accross many devices (discovering devices, opening
> >       device, creating context, creating command queue ...).
> > [R2]- try to share the device as much as possible up to device limits
> >       (number of independant queues the device has)
> > [R3]- minimize syscall by allowing user space to directly schedule on the
> >       device queue without a round trip to the kernel
> > 
> > I don't think i missed any.
> > 
> > 
> > (1) Device with SVA/SVM
> > 
> > For that case it is easy, you do not need to be in VFIO or part of any
> > thing specific in the kernel. There is no security risk (modulo bug in
> > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > to a device is just couple dozen lines of code.
> > 
> 
> This is right...logically. But the kernel has no clear definition about "Device
> with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the
> boiler plate.
> 
> VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only
> one. If we add that support within VFIO, which solve most of the problem of
> SVA/SVM, it will save a lot of work in the future.

You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user
all do the SVA/SVM setup in couple dozen lines and i failed to see how it
would require any more than that in your case.


> I think this is the key confliction between us. So could Alex please say
> something here? If the VFIO is going to take this into its scope, we can try
> together to solve all the problem on the way. If it it is not, it is also
> simple, we can just go to another way to fulfill this part of requirements even
> we have to duplicate most of the code.
> 
> Another point I need to emphasis here: because we have to replace the hardware
> queue when fork, so it won't be very simple even in SVA/SVM case.

I am assuming hardware queue can only be setup by the kernel and thus
you are totaly safe forkwise as the queue is setup against a PASID and
the child does not bind to any PASID and you use VM_DONTCOPY on the
mmap of the hardware MMIO queue because you should really use that flag
for that.


> > (2) Device does not have SVA/SVM (or it is disabled)
> > 
> > You want to still allow device to be part of your framework. However
> > here i see fundamentals securities issues and you move the burden of
> > being careful to user space which i think is a bad idea. We should
> > never trus the userspace from kernel space.
> > 
> > To keep the same API for the user space code you want a 1:1 mapping
> > between device physical address and process virtual address (ie if
> > device access device physical address A it is accessing the same
> > memory as what is backing the virtual address A in the process.
> > 
> > Security issues are on two things:
> > [I1]- fork/exec, a process who opened any such device and created an
> >       active queue can transfer without its knowledge control of its
> >       commands queue through COW. The parent map some anonymous region
> >       to the device as a command queue buffer but because of COW the
> >       parent can be the first to copy on write and thus the child can
> >       inherit the original pages that are mapped to the hardware.
> >       Here parent lose control and child gain it.
> 
> This is indeed an issue. But it remains an issue only if you continue to use the
> queue and the memory after fork. We can use at_fork kinds of gadget to fix it in
> user space.

Trusting user space is a no go from my point of view.


> From some perspectives, I think the issue can be solved by iommu_notifier. For
> example, when the process is fork-ed, we can set the mapped device mmio space as
> COW for the child process, so a new queue can be created and set to the same
> state as the parent's if the space is accessed. Then we can have two separated
> queues for both the parent and the child. The memory part can be done in the
> same way.

The mmap of mmio space for the queue is not an issue just use VM_DONTCOPY
for it. Issue is with COW and IOMMU mapping of pages and this can not be
solve in your model.

> The thing is, the same strategy can be applied to VFIO without changing its
> original feature.

No it can not it would break existing VFIO contract (which only should be
use against private anonymous vma).

> 
> > 
> > [I2]- Because of [R3] you want to allow userspace to schedule commands
> >       on the device without doing an ioctl and thus here user space
> >       can schedule any commands to the device with any address. What
> >       happens if that address have not been mapped by the user space
> >       is undefined and in fact can not be defined as what each IOMMU
> >       does on invalid address access is different from IOMMU to IOMMU.
> > 
> >       In case of a bad IOMMU, or simply an IOMMU improperly setup by
> >       the kernel, this can potentialy allow user space to DMA anywhere.
> 
> I don't think this is an issue. If you cannot trust IOMMU and proper setup of
> IOMMU in kernel, you cannot trust anything. And the whole VFIO framework is
> untrustable.

VFIO device is usualy restricted to trusted user and other device that
do DMA do various checks to make sure user space can not abuse them, the
assumption i have always seen so far is to not trust that IOMMU will do
all the work. So exposing user space access to device with DMA capabilities
should be done carefuly IMHO.

To be thorough list of potential bugs i am concern about:
    - IOMMU hardware bug
    - IOMMU does not isolate device after too many fail attempt
    - firmware setup the IOMMU in some unsafe way and linux kernel does
      not catch that (like passthrough on error or when there is no
      entry for the PA which i am told is a thing for debug)
    - bug in the linux IOMMU kernel driver



> > 
> > [I3]- By relying on GUP in VFIO you are not abiding by the implicit
> >       contract (at least i hope it is implicit) that you should not
> >       try to map to the device any file backed vma (private or share).
> > 
> >       The VFIO code never check the vma controlling the addresses that
> >       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the
> >       user space can provide file backed range.
> > 
> >       I am guessing that the VFIO code never had any issues because its
> >       number one user is QEMU and QEMU never does that (and that's good
> >       as no one should ever do that).
> > 
> >       So if process does that you are opening your self to serious file
> >       system corruption (depending on file system this can lead to total
> >       data loss for the filesystem).
> > 
> >       Issue is that once you GUP you never abide to file system flushing
> >       which write protect the page before writing to the disk. So
> >       because the page is still map with write permission to the device
> >       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can
> >       write to the page while it is in the middle of being written back
> >       to disk. Consult your nearest file system specialist to ask him
> >       how bad that can be.
> 
> Same as I2, it is an issue, but the problem can be solved in VFIO if we really
> take it in the scope of VFIO.

Except it can not be solve without breaking VFIO. If you use mmu_notifier
it means that the the IOMMU mapping can vanish at _any_ time and because
you allow user space to directly schedule work on the hardware command
queue than you do not have any synchronization point to use.

When notifier calls you must stop all hardware access and wait for any
pending work that might still dereference affected range. Finaly you can
restore mapping only once the notifier is done. AFAICT this would break
all existing VFIO user.

So solving that for your case it means you would have to:
warp_invalidate_range_callback() {
    - take some lock to protect against restore
    - unmap the command queue (zap_range) from userspace to stop further
      commands
    - wait for any pending commands on the hardware to complete
    - clear all IOMMU mappings for the range
    - put_pages()
    - drop restore lock
    return to let invalidation complete its works
}

warp_restore() {
    // This is call by the page fault handler on the command queue mapped
    // to user space.
    - take restore lock
    - go over all IOMMU mapping and restore them (GUP)
    - remap command queue to userspace
    - drop restore lock
}

This model does not work for VFIO existing users AFAICT.


> > [I4]- Design issue, mdev design As Far As I Understand It is about
> >       sharing a single device to multiple clients (most obvious case
> >       here is again QEMU guest). But you are going against that model,
> >       in fact AFAIUI you are doing the exect opposite. When there is
> >       no SVA/SVM you want only one mdev device that can not be share.
> 
> Wait. It is NOT "I want only one mdev device when there is no SVA/SVM", it is "I
> can support only one mdev when there is no PASID support for the IOMMU".

Except you can support more than one user when no SVA/SVM with the model
i outlined below.


> > 
> >       So this is counter intuitive to the mdev existing design. It is
> >       not about sharing device among multiple users but about giving
> >       exclusive access to the device to one user.
> > 
> > 
> > 
> > All the reasons above is why i believe a different model would serve
> > you and your user better. Below is a design that avoids all of the
> > above issues and still delivers all of your objectives with the
> > exceptions of the third one [R3] when there is no SVA/SVM.
> > 
> > 
> > Create a subsystem (very much boiler plate code) which allow device to
> > register themself against (very much like what you do in your current
> > patchset but outside of VFIO).
> > 
> > That subsystem will create a device file for each registered system and
> > expose a common API (ie set of ioctl) for each of those device files.
> > 
> > When user space create a queue (through an ioctl after opening the device
> > file) the kernel can return -EBUSY if all the device queue are in use,
> > or create a device queue and return a flag like SYNC_ONLY for device that
> > do not have SVA/SVM.
> > 
> > For device with SVA/SVM at the time the process create a queue you bind
> > the process PASID to the device queue. From there on the userspace can
> > schedule commands and use the device without going to kernel space.
> 
> As mentioned previously, this is not enough for fork scenario.

It is for every existing user of SVA/SVM so i fail to see why it would
be any different in your case. Note that they all use VM_DONTCOPY flag
on the queue mapped to userspace.


> > For device without SVA/SVM you create a fake queue that is just pure
> > memory is not related to the device. From there on the userspace must
> > call an ioctl every time it wants the device to consume its queue
> > (hence why the SYNC_ONLY flag for synchronous operation only). The
> > kernel portion read the fake queue expose to user space and copy
> > commands into the real hardware queue but first it properly map any
> > of the process memory needed for those commands to the device and
> > adjust the device physical address with the one it gets from dma_map
> > API.
> > 
> 
> But in this way, we will lost most of the benefit of avoiding syscall.

Yes but only when there is SVA/SVM. What i am trying to stress is that
there is no sane way to mirror user space address space onto device
without mmu_notifier so short of that this model where you have to
syscall to schedules thing on the hardware is the easiest thing to do.


> > With that model it is "easy" to listen to mmu_notifier and to abide by
> > them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]
> > issue by only mapping a fake device queue to userspace.
> > 
> > So yes with that models it means that every device that wish to support
> > the non SVA/SVM case will have to do extra work (ie emulate its command
> > queue in software in the kernel). But by doing so, you support an
> > unlimited number of process on your device (ie all the process can share
> > one single hardware command queues or multiple hardware queues).
> 
> If I can do this, I will not need WarpDrive at all:(

This is only needed if you wish to support non SVA/SVM, shifting the
burden to the kernel is always the thing to do especially if they are
legitimate security concerns.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-17 12:37     ` Jerome Glisse
@ 2018-09-18  6:00       ` Kenneth Lee
  2018-09-18 13:03         ` Jerome Glisse
  0 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-18  6:00 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	Thomas Gleixner, linux-crypto, Zhou Wang, Philippe Ombredanne,
	Zaibo Xu, David S . Miller, linux-accelerators, Lu Baolu

On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:
> Date: Mon, 17 Sep 2018 08:37:45 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Herbert Xu
>  <herbert@gondor.apana.org.au>, kvm@vger.kernel.org, Jonathan Corbet
>  <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Joerg
>  Roedel <joro@8bytes.org>, linux-doc@vger.kernel.org, Sanjay Kumar
>  <sanjay.k.kumar@intel.com>, Hao Fang <fanghao11@huawei.com>,
>  iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, Alex Williamson <alex.williamson@redhat.com>, Thomas
>  Gleixner <tglx@linutronix.de>, linux-crypto@vger.kernel.org, Zhou Wang
>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,
>  Zaibo Xu <xuzaibo@huawei.com>, "David S . Miller" <davem@davemloft.net>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.1 (2018-07-13)
> Message-ID: <20180917123744.GA3605@redhat.com>
> 
> On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:
> > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > > So i want to summarize issues i have as this threads have dig deep into
> > > details. For this i would like to differentiate two cases first the easy
> > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> > 
> > Thank you very much for the summary.
> > 
> > > In both cases your objectives as i understand them:
> > > 
> > > [R1]- expose a common user space API that make it easy to share boiler
> > >       plate code accross many devices (discovering devices, opening
> > >       device, creating context, creating command queue ...).
> > > [R2]- try to share the device as much as possible up to device limits
> > >       (number of independant queues the device has)
> > > [R3]- minimize syscall by allowing user space to directly schedule on the
> > >       device queue without a round trip to the kernel
> > > 
> > > I don't think i missed any.
> > > 
> > > 
> > > (1) Device with SVA/SVM
> > > 
> > > For that case it is easy, you do not need to be in VFIO or part of any
> > > thing specific in the kernel. There is no security risk (modulo bug in
> > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > > to a device is just couple dozen lines of code.
> > > 
> > 
> > This is right...logically. But the kernel has no clear definition about "Device
> > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the
> > boiler plate.
> > 
> > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only
> > one. If we add that support within VFIO, which solve most of the problem of
> > SVA/SVM, it will save a lot of work in the future.
> 
> You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user
> all do the SVA/SVM setup in couple dozen lines and i failed to see how it
> would require any more than that in your case.
> 
> 
> > I think this is the key confliction between us. So could Alex please say
> > something here? If the VFIO is going to take this into its scope, we can try
> > together to solve all the problem on the way. If it it is not, it is also
> > simple, we can just go to another way to fulfill this part of requirements even
> > we have to duplicate most of the code.
> > 
> > Another point I need to emphasis here: because we have to replace the hardware
> > queue when fork, so it won't be very simple even in SVA/SVM case.
> 
> I am assuming hardware queue can only be setup by the kernel and thus
> you are totaly safe forkwise as the queue is setup against a PASID and
> the child does not bind to any PASID and you use VM_DONTCOPY on the
> mmap of the hardware MMIO queue because you should really use that flag
> for that.
> 
> 
> > > (2) Device does not have SVA/SVM (or it is disabled)
> > > 
> > > You want to still allow device to be part of your framework. However
> > > here i see fundamentals securities issues and you move the burden of
> > > being careful to user space which i think is a bad idea. We should
> > > never trus the userspace from kernel space.
> > > 
> > > To keep the same API for the user space code you want a 1:1 mapping
> > > between device physical address and process virtual address (ie if
> > > device access device physical address A it is accessing the same
> > > memory as what is backing the virtual address A in the process.
> > > 
> > > Security issues are on two things:
> > > [I1]- fork/exec, a process who opened any such device and created an
> > >       active queue can transfer without its knowledge control of its
> > >       commands queue through COW. The parent map some anonymous region
> > >       to the device as a command queue buffer but because of COW the
> > >       parent can be the first to copy on write and thus the child can
> > >       inherit the original pages that are mapped to the hardware.
> > >       Here parent lose control and child gain it.
> > 
> > This is indeed an issue. But it remains an issue only if you continue to use the
> > queue and the memory after fork. We can use at_fork kinds of gadget to fix it in
> > user space.
> 
> Trusting user space is a no go from my point of view.

Can we dive deeper on this? Maybe we have different understanding on "Trusting
user space". As my understanding, "trusting user space" means "no matter what
the user process does, it should only hurt itself and anything give to it, no
the kernel and the other process".

In our case, we create a channel between a process and the hardware. The process
can do whateven it like to its own memory the channel itself. It won't hurt the
other process and the kernel. And if the process fork a child and give the
channel to the child, it should the freedom on those resource remain within the
parent and the child. We are not trust another else.

So do you refer to something else here?

> 
> 
> > From some perspectives, I think the issue can be solved by iommu_notifier. For
> > example, when the process is fork-ed, we can set the mapped device mmio space as
> > COW for the child process, so a new queue can be created and set to the same
> > state as the parent's if the space is accessed. Then we can have two separated
> > queues for both the parent and the child. The memory part can be done in the
> > same way.
> 
> The mmap of mmio space for the queue is not an issue just use VM_DONTCOPY
> for it. Issue is with COW and IOMMU mapping of pages and this can not be
> solve in your model.
> 
> > The thing is, the same strategy can be applied to VFIO without changing its
> > original feature.
> 
> No it can not it would break existing VFIO contract (which only should be
> use against private anonymous vma).
> 
> > 
> > > 
> > > [I2]- Because of [R3] you want to allow userspace to schedule commands
> > >       on the device without doing an ioctl and thus here user space
> > >       can schedule any commands to the device with any address. What
> > >       happens if that address have not been mapped by the user space
> > >       is undefined and in fact can not be defined as what each IOMMU
> > >       does on invalid address access is different from IOMMU to IOMMU.
> > > 
> > >       In case of a bad IOMMU, or simply an IOMMU improperly setup by
> > >       the kernel, this can potentialy allow user space to DMA anywhere.
> > 
> > I don't think this is an issue. If you cannot trust IOMMU and proper setup of
> > IOMMU in kernel, you cannot trust anything. And the whole VFIO framework is
> > untrustable.
> 
> VFIO device is usualy restricted to trusted user and other device that
> do DMA do various checks to make sure user space can not abuse them, the
> assumption i have always seen so far is to not trust that IOMMU will do
> all the work. So exposing user space access to device with DMA capabilities
> should be done carefuly IMHO.
> 
> To be thorough list of potential bugs i am concern about:
>     - IOMMU hardware bug
>     - IOMMU does not isolate device after too many fail attempt
>     - firmware setup the IOMMU in some unsafe way and linux kernel does
>       not catch that (like passthrough on error or when there is no
>       entry for the PA which i am told is a thing for debug)
>     - bug in the linux IOMMU kernel driver
> 
> 
> 
> > > 
> > > [I3]- By relying on GUP in VFIO you are not abiding by the implicit
> > >       contract (at least i hope it is implicit) that you should not
> > >       try to map to the device any file backed vma (private or share).
> > > 
> > >       The VFIO code never check the vma controlling the addresses that
> > >       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the
> > >       user space can provide file backed range.
> > > 
> > >       I am guessing that the VFIO code never had any issues because its
> > >       number one user is QEMU and QEMU never does that (and that's good
> > >       as no one should ever do that).
> > > 
> > >       So if process does that you are opening your self to serious file
> > >       system corruption (depending on file system this can lead to total
> > >       data loss for the filesystem).
> > > 
> > >       Issue is that once you GUP you never abide to file system flushing
> > >       which write protect the page before writing to the disk. So
> > >       because the page is still map with write permission to the device
> > >       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can
> > >       write to the page while it is in the middle of being written back
> > >       to disk. Consult your nearest file system specialist to ask him
> > >       how bad that can be.
> > 
> > Same as I2, it is an issue, but the problem can be solved in VFIO if we really
> > take it in the scope of VFIO.
> 
> Except it can not be solve without breaking VFIO. If you use mmu_notifier
> it means that the the IOMMU mapping can vanish at _any_ time and because
> you allow user space to directly schedule work on the hardware command
> queue than you do not have any synchronization point to use.
> 
> When notifier calls you must stop all hardware access and wait for any
> pending work that might still dereference affected range. Finaly you can
> restore mapping only once the notifier is done. AFAICT this would break
> all existing VFIO user.
> 
> So solving that for your case it means you would have to:
> warp_invalidate_range_callback() {
>     - take some lock to protect against restore
>     - unmap the command queue (zap_range) from userspace to stop further
>       commands
>     - wait for any pending commands on the hardware to complete
>     - clear all IOMMU mappings for the range
>     - put_pages()
>     - drop restore lock
>     return to let invalidation complete its works
> }
> 
> warp_restore() {
>     // This is call by the page fault handler on the command queue mapped
>     // to user space.
>     - take restore lock
>     - go over all IOMMU mapping and restore them (GUP)
>     - remap command queue to userspace
>     - drop restore lock
> }
> 
> This model does not work for VFIO existing users AFAICT.
> 
> 
> > > [I4]- Design issue, mdev design As Far As I Understand It is about
> > >       sharing a single device to multiple clients (most obvious case
> > >       here is again QEMU guest). But you are going against that model,
> > >       in fact AFAIUI you are doing the exect opposite. When there is
> > >       no SVA/SVM you want only one mdev device that can not be share.
> > 
> > Wait. It is NOT "I want only one mdev device when there is no SVA/SVM", it is "I
> > can support only one mdev when there is no PASID support for the IOMMU".
> 
> Except you can support more than one user when no SVA/SVM with the model
> i outlined below.
> 
> 
> > > 
> > >       So this is counter intuitive to the mdev existing design. It is
> > >       not about sharing device among multiple users but about giving
> > >       exclusive access to the device to one user.
> > > 
> > > 
> > > 
> > > All the reasons above is why i believe a different model would serve
> > > you and your user better. Below is a design that avoids all of the
> > > above issues and still delivers all of your objectives with the
> > > exceptions of the third one [R3] when there is no SVA/SVM.
> > > 
> > > 
> > > Create a subsystem (very much boiler plate code) which allow device to
> > > register themself against (very much like what you do in your current
> > > patchset but outside of VFIO).
> > > 
> > > That subsystem will create a device file for each registered system and
> > > expose a common API (ie set of ioctl) for each of those device files.
> > > 
> > > When user space create a queue (through an ioctl after opening the device
> > > file) the kernel can return -EBUSY if all the device queue are in use,
> > > or create a device queue and return a flag like SYNC_ONLY for device that
> > > do not have SVA/SVM.
> > > 
> > > For device with SVA/SVM at the time the process create a queue you bind
> > > the process PASID to the device queue. From there on the userspace can
> > > schedule commands and use the device without going to kernel space.
> > 
> > As mentioned previously, this is not enough for fork scenario.
> 
> It is for every existing user of SVA/SVM so i fail to see why it would
> be any different in your case. Note that they all use VM_DONTCOPY flag
> on the queue mapped to userspace.
> 
> 
> > > For device without SVA/SVM you create a fake queue that is just pure
> > > memory is not related to the device. From there on the userspace must
> > > call an ioctl every time it wants the device to consume its queue
> > > (hence why the SYNC_ONLY flag for synchronous operation only). The
> > > kernel portion read the fake queue expose to user space and copy
> > > commands into the real hardware queue but first it properly map any
> > > of the process memory needed for those commands to the device and
> > > adjust the device physical address with the one it gets from dma_map
> > > API.
> > > 
> > 
> > But in this way, we will lost most of the benefit of avoiding syscall.
> 
> Yes but only when there is SVA/SVM. What i am trying to stress is that
> there is no sane way to mirror user space address space onto device
> without mmu_notifier so short of that this model where you have to
> syscall to schedules thing on the hardware is the easiest thing to do.
> 
> 
> > > With that model it is "easy" to listen to mmu_notifier and to abide by
> > > them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]
> > > issue by only mapping a fake device queue to userspace.
> > > 
> > > So yes with that models it means that every device that wish to support
> > > the non SVA/SVM case will have to do extra work (ie emulate its command
> > > queue in software in the kernel). But by doing so, you support an
> > > unlimited number of process on your device (ie all the process can share
> > > one single hardware command queues or multiple hardware queues).
> > 
> > If I can do this, I will not need WarpDrive at all:(
> 
> This is only needed if you wish to support non SVA/SVM, shifting the
> burden to the kernel is always the thing to do especially if they are
> legitimate security concerns.

For the other part, please give me some more time to write some test code and
come back to the discussino. Thank you.

Cheers

> 
> 
> Cheers,
> Jérôme

-- 
			-Kenneth(Hisilicon)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-18  6:00       ` Kenneth Lee
@ 2018-09-18 13:03         ` Jerome Glisse
  2018-09-20  5:55           ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-18 13:03 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, linux-kernel, linuxarm, iommu, David S . Miller,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	Zaibo Xu, linux-accelerators, Lu Baolu

On Tue, Sep 18, 2018 at 02:00:14PM +0800, Kenneth Lee wrote:
> On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:
> > On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:
> > > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > > > So i want to summarize issues i have as this threads have dig deep into
> > > > details. For this i would like to differentiate two cases first the easy
> > > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> > > 
> > > Thank you very much for the summary.
> > > 
> > > > In both cases your objectives as i understand them:
> > > > 
> > > > [R1]- expose a common user space API that make it easy to share boiler
> > > >       plate code accross many devices (discovering devices, opening
> > > >       device, creating context, creating command queue ...).
> > > > [R2]- try to share the device as much as possible up to device limits
> > > >       (number of independant queues the device has)
> > > > [R3]- minimize syscall by allowing user space to directly schedule on the
> > > >       device queue without a round trip to the kernel
> > > > 
> > > > I don't think i missed any.
> > > > 
> > > > 
> > > > (1) Device with SVA/SVM
> > > > 
> > > > For that case it is easy, you do not need to be in VFIO or part of any
> > > > thing specific in the kernel. There is no security risk (modulo bug in
> > > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > > > to a device is just couple dozen lines of code.
> > > > 
> > > 
> > > This is right...logically. But the kernel has no clear definition about "Device
> > > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the
> > > boiler plate.
> > > 
> > > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only
> > > one. If we add that support within VFIO, which solve most of the problem of
> > > SVA/SVM, it will save a lot of work in the future.
> > 
> > You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user
> > all do the SVA/SVM setup in couple dozen lines and i failed to see how it
> > would require any more than that in your case.
> > 
> > 
> > > I think this is the key confliction between us. So could Alex please say
> > > something here? If the VFIO is going to take this into its scope, we can try
> > > together to solve all the problem on the way. If it it is not, it is also
> > > simple, we can just go to another way to fulfill this part of requirements even
> > > we have to duplicate most of the code.
> > > 
> > > Another point I need to emphasis here: because we have to replace the hardware
> > > queue when fork, so it won't be very simple even in SVA/SVM case.
> > 
> > I am assuming hardware queue can only be setup by the kernel and thus
> > you are totaly safe forkwise as the queue is setup against a PASID and
> > the child does not bind to any PASID and you use VM_DONTCOPY on the
> > mmap of the hardware MMIO queue because you should really use that flag
> > for that.
> > 
> > 
> > > > (2) Device does not have SVA/SVM (or it is disabled)
> > > > 
> > > > You want to still allow device to be part of your framework. However
> > > > here i see fundamentals securities issues and you move the burden of
> > > > being careful to user space which i think is a bad idea. We should
> > > > never trus the userspace from kernel space.
> > > > 
> > > > To keep the same API for the user space code you want a 1:1 mapping
> > > > between device physical address and process virtual address (ie if
> > > > device access device physical address A it is accessing the same
> > > > memory as what is backing the virtual address A in the process.
> > > > 
> > > > Security issues are on two things:
> > > > [I1]- fork/exec, a process who opened any such device and created an
> > > >       active queue can transfer without its knowledge control of its
> > > >       commands queue through COW. The parent map some anonymous region
> > > >       to the device as a command queue buffer but because of COW the
> > > >       parent can be the first to copy on write and thus the child can
> > > >       inherit the original pages that are mapped to the hardware.
> > > >       Here parent lose control and child gain it.
> > > 
> > > This is indeed an issue. But it remains an issue only if you continue to use the
> > > queue and the memory after fork. We can use at_fork kinds of gadget to fix it in
> > > user space.
> > 
> > Trusting user space is a no go from my point of view.
> 
> Can we dive deeper on this? Maybe we have different understanding on "Trusting
> user space". As my understanding, "trusting user space" means "no matter what
> the user process does, it should only hurt itself and anything give to it, no
> the kernel and the other process".
> 
> In our case, we create a channel between a process and the hardware. The process
> can do whateven it like to its own memory the channel itself. It won't hurt the
> other process and the kernel. And if the process fork a child and give the
> channel to the child, it should the freedom on those resource remain within the
> parent and the child. We are not trust another else.
> 
> So do you refer to something else here?
> 

I am refering to COW giving control to the child on to what happens
in the parent from device point of view. A process hurting itself is
fine, but if process now has to do special steps to protect from
its child ie make sure that its childs can not hurt it, then i see
that as a kernel bug. We can not ask user space process to know about
all the thousands things that needs to be done to avoid issues with
each device driver that the process may use (process can be totaly
ignorant it is using a device if that device is use by a library it
links to).


Maybe what needs to happen will explain it better. So if userspace
wants to be secure and protect itself from its child taking over the
device through COW:

    - parent opened a device and is using it

    ... when parent wants to fork/exec it must:

    - parent _must_ flush device command queue and wait for the
      device to finish all pending jobs

    - parent _must_ unmap all range mapped to the device

    - parent should first close device file (unless you force set
      the CLOEXEC flag in the kernel)/it could also just flush
      but if you are not mapping the device command queue with
      VM_DONTCOPY then you should really be closing the device

    - now parent can fork/exec

    - parent must force COW ie write at least one byte to _all_
      pages in the range it wants to use with the device

    - parent re-open the device and re-initialize everything


So this is putting quite a burden on a number of steps the parent
_must_ do in order to keep control of memory exposed to the device.
Not doing so can potentialy lead (it depends on who does the COW
first) to the child taking control of memory use by the device,
memory which was mapped by the parent before the child was created.

Forcing CLOEXEC and VM_DONTCOPY somewhat help to simplify this,
but you still need to stop, flush, unmap, before fork/exec and then
re-init everything after.


This is only when not using SVA/SVM, SVA/SVM is totaly fine from
that point of view, no issues whatsoever.

The solution i outlined in previous email do not have that above
issue either, no need to rely on user space doing that dance.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-18 13:03         ` Jerome Glisse
@ 2018-09-20  5:55           ` Kenneth Lee
  2018-09-20 14:23             ` Jerome Glisse
  0 siblings, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-20  5:55 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, linux-kernel, linuxarm, iommu, David S . Miller,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	Zaibo Xu, linux-accelerators, Lu Baolu

On Tue, Sep 18, 2018 at 09:03:14AM -0400, Jerome Glisse wrote:
> Date: Tue, 18 Sep 2018 09:03:14 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson
>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,
>  kvm@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>, Greg Kroah-Hartman
>  <gregkh@linuxfoundation.org>, Joerg Roedel <joro@8bytes.org>,
>  linux-doc@vger.kernel.org, Sanjay Kumar <sanjay.k.kumar@intel.com>, Hao
>  Fang <fanghao11@huawei.com>, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, iommu@lists.linux-foundation.org, "David S . Miller"
>  <davem@davemloft.net>, linux-crypto@vger.kernel.org, Zhou Wang
>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,
>  Thomas Gleixner <tglx@linutronix.de>, Zaibo Xu <xuzaibo@huawei.com>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.1 (2018-07-13)
> Message-ID: <20180918130314.GA3500@redhat.com>
> 
> On Tue, Sep 18, 2018 at 02:00:14PM +0800, Kenneth Lee wrote:
> > On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:
> > > On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:
> > > > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > > > > So i want to summarize issues i have as this threads have dig deep into
> > > > > details. For this i would like to differentiate two cases first the easy
> > > > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> > > > 
> > > > Thank you very much for the summary.
> > > > 
> > > > > In both cases your objectives as i understand them:
> > > > > 
> > > > > [R1]- expose a common user space API that make it easy to share boiler
> > > > >       plate code accross many devices (discovering devices, opening
> > > > >       device, creating context, creating command queue ...).
> > > > > [R2]- try to share the device as much as possible up to device limits
> > > > >       (number of independant queues the device has)
> > > > > [R3]- minimize syscall by allowing user space to directly schedule on the
> > > > >       device queue without a round trip to the kernel
> > > > > 
> > > > > I don't think i missed any.
> > > > > 
> > > > > 
> > > > > (1) Device with SVA/SVM
> > > > > 
> > > > > For that case it is easy, you do not need to be in VFIO or part of any
> > > > > thing specific in the kernel. There is no security risk (modulo bug in
> > > > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > > > > to a device is just couple dozen lines of code.
> > > > > 
> > > > 
> > > > This is right...logically. But the kernel has no clear definition about "Device
> > > > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the
> > > > boiler plate.
> > > > 
> > > > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only
> > > > one. If we add that support within VFIO, which solve most of the problem of
> > > > SVA/SVM, it will save a lot of work in the future.
> > > 
> > > You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user
> > > all do the SVA/SVM setup in couple dozen lines and i failed to see how it
> > > would require any more than that in your case.
> > > 
> > > 
> > > > I think this is the key confliction between us. So could Alex please say
> > > > something here? If the VFIO is going to take this into its scope, we can try
> > > > together to solve all the problem on the way. If it it is not, it is also
> > > > simple, we can just go to another way to fulfill this part of requirements even
> > > > we have to duplicate most of the code.
> > > > 
> > > > Another point I need to emphasis here: because we have to replace the hardware
> > > > queue when fork, so it won't be very simple even in SVA/SVM case.
> > > 
> > > I am assuming hardware queue can only be setup by the kernel and thus
> > > you are totaly safe forkwise as the queue is setup against a PASID and
> > > the child does not bind to any PASID and you use VM_DONTCOPY on the
> > > mmap of the hardware MMIO queue because you should really use that flag
> > > for that.
> > > 
> > > 
> > > > > (2) Device does not have SVA/SVM (or it is disabled)
> > > > > 
> > > > > You want to still allow device to be part of your framework. However
> > > > > here i see fundamentals securities issues and you move the burden of
> > > > > being careful to user space which i think is a bad idea. We should
> > > > > never trus the userspace from kernel space.
> > > > > 
> > > > > To keep the same API for the user space code you want a 1:1 mapping
> > > > > between device physical address and process virtual address (ie if
> > > > > device access device physical address A it is accessing the same
> > > > > memory as what is backing the virtual address A in the process.
> > > > > 
> > > > > Security issues are on two things:
> > > > > [I1]- fork/exec, a process who opened any such device and created an
> > > > >       active queue can transfer without its knowledge control of its
> > > > >       commands queue through COW. The parent map some anonymous region
> > > > >       to the device as a command queue buffer but because of COW the
> > > > >       parent can be the first to copy on write and thus the child can
> > > > >       inherit the original pages that are mapped to the hardware.
> > > > >       Here parent lose control and child gain it.
> > > > 
> > > > This is indeed an issue. But it remains an issue only if you continue to use the
> > > > queue and the memory after fork. We can use at_fork kinds of gadget to fix it in
> > > > user space.
> > > 
> > > Trusting user space is a no go from my point of view.
> > 
> > Can we dive deeper on this? Maybe we have different understanding on "Trusting
> > user space". As my understanding, "trusting user space" means "no matter what
> > the user process does, it should only hurt itself and anything give to it, no
> > the kernel and the other process".
> > 
> > In our case, we create a channel between a process and the hardware. The process
> > can do whateven it like to its own memory the channel itself. It won't hurt the
> > other process and the kernel. And if the process fork a child and give the
> > channel to the child, it should the freedom on those resource remain within the
> > parent and the child. We are not trust another else.
> > 
> > So do you refer to something else here?
> > 
> 
> I am refering to COW giving control to the child on to what happens
> in the parent from device point of view. A process hurting itself is
> fine, but if process now has to do special steps to protect from
> its child ie make sure that its childs can not hurt it, then i see
> that as a kernel bug. We can not ask user space process to know about
> all the thousands things that needs to be done to avoid issues with
> each device driver that the process may use (process can be totaly
> ignorant it is using a device if that device is use by a library it
> links to).
> 
> 
> Maybe what needs to happen will explain it better. So if userspace
> wants to be secure and protect itself from its child taking over the
> device through COW:
> 
>     - parent opened a device and is using it
> 
>     ... when parent wants to fork/exec it must:
> 
>     - parent _must_ flush device command queue and wait for the
>       device to finish all pending jobs
> 
>     - parent _must_ unmap all range mapped to the device
> 
>     - parent should first close device file (unless you force set
>       the CLOEXEC flag in the kernel)/it could also just flush
>       but if you are not mapping the device command queue with
>       VM_DONTCOPY then you should really be closing the device
> 
>     - now parent can fork/exec
> 
>     - parent must force COW ie write at least one byte to _all_
>       pages in the range it wants to use with the device
> 
>     - parent re-open the device and re-initialize everything
> 
> 
> So this is putting quite a burden on a number of steps the parent
> _must_ do in order to keep control of memory exposed to the device.
> Not doing so can potentialy lead (it depends on who does the COW
> first) to the child taking control of memory use by the device,
> memory which was mapped by the parent before the child was created.
> 
> Forcing CLOEXEC and VM_DONTCOPY somewhat help to simplify this,
> but you still need to stop, flush, unmap, before fork/exec and then
> re-init everything after.
> 
> 
> This is only when not using SVA/SVM, SVA/SVM is totaly fine from
> that point of view, no issues whatsoever.
> 
> The solution i outlined in previous email do not have that above
> issue either, no need to rely on user space doing that dance.

Thank you. I get the point. I'm now trying to see if I can solve the problem by
seting the vma to VM_SHARED when the portiong is "shared to the hardware".

> 
> Cheers,
> Jérôme

-- 
			-Kenneth(Hisilicon)

================================================================================
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-20  5:55           ` Kenneth Lee
@ 2018-09-20 14:23             ` Jerome Glisse
  2018-09-21 10:05               ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-20 14:23 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, linux-kernel, linuxarm, iommu, David S . Miller,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	Zaibo Xu, linux-accelerators, Lu Baolu

On Thu, Sep 20, 2018 at 01:55:43PM +0800, Kenneth Lee wrote:
> On Tue, Sep 18, 2018 at 09:03:14AM -0400, Jerome Glisse wrote:
> > On Tue, Sep 18, 2018 at 02:00:14PM +0800, Kenneth Lee wrote:
> > > On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:
> > > > On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:
> > > > > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > > > > > So i want to summarize issues i have as this threads have dig deep into
> > > > > > details. For this i would like to differentiate two cases first the easy
> > > > > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> > > > > 
> > > > > Thank you very much for the summary.
> > > > > 
> > > > > > In both cases your objectives as i understand them:
> > > > > > 
> > > > > > [R1]- expose a common user space API that make it easy to share boiler
> > > > > >       plate code accross many devices (discovering devices, opening
> > > > > >       device, creating context, creating command queue ...).
> > > > > > [R2]- try to share the device as much as possible up to device limits
> > > > > >       (number of independant queues the device has)
> > > > > > [R3]- minimize syscall by allowing user space to directly schedule on the
> > > > > >       device queue without a round trip to the kernel
> > > > > > 
> > > > > > I don't think i missed any.
> > > > > > 
> > > > > > 
> > > > > > (1) Device with SVA/SVM
> > > > > > 
> > > > > > For that case it is easy, you do not need to be in VFIO or part of any
> > > > > > thing specific in the kernel. There is no security risk (modulo bug in
> > > > > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > > > > > to a device is just couple dozen lines of code.
> > > > > > 
> > > > > 
> > > > > This is right...logically. But the kernel has no clear definition about "Device
> > > > > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the
> > > > > boiler plate.
> > > > > 
> > > > > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only
> > > > > one. If we add that support within VFIO, which solve most of the problem of
> > > > > SVA/SVM, it will save a lot of work in the future.
> > > > 
> > > > You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user
> > > > all do the SVA/SVM setup in couple dozen lines and i failed to see how it
> > > > would require any more than that in your case.
> > > > 
> > > > 
> > > > > I think this is the key confliction between us. So could Alex please say
> > > > > something here? If the VFIO is going to take this into its scope, we can try
> > > > > together to solve all the problem on the way. If it it is not, it is also
> > > > > simple, we can just go to another way to fulfill this part of requirements even
> > > > > we have to duplicate most of the code.
> > > > > 
> > > > > Another point I need to emphasis here: because we have to replace the hardware
> > > > > queue when fork, so it won't be very simple even in SVA/SVM case.
> > > > 
> > > > I am assuming hardware queue can only be setup by the kernel and thus
> > > > you are totaly safe forkwise as the queue is setup against a PASID and
> > > > the child does not bind to any PASID and you use VM_DONTCOPY on the
> > > > mmap of the hardware MMIO queue because you should really use that flag
> > > > for that.
> > > > 
> > > > 
> > > > > > (2) Device does not have SVA/SVM (or it is disabled)
> > > > > > 
> > > > > > You want to still allow device to be part of your framework. However
> > > > > > here i see fundamentals securities issues and you move the burden of
> > > > > > being careful to user space which i think is a bad idea. We should
> > > > > > never trus the userspace from kernel space.
> > > > > > 
> > > > > > To keep the same API for the user space code you want a 1:1 mapping
> > > > > > between device physical address and process virtual address (ie if
> > > > > > device access device physical address A it is accessing the same
> > > > > > memory as what is backing the virtual address A in the process.
> > > > > > 
> > > > > > Security issues are on two things:
> > > > > > [I1]- fork/exec, a process who opened any such device and created an
> > > > > >       active queue can transfer without its knowledge control of its
> > > > > >       commands queue through COW. The parent map some anonymous region
> > > > > >       to the device as a command queue buffer but because of COW the
> > > > > >       parent can be the first to copy on write and thus the child can
> > > > > >       inherit the original pages that are mapped to the hardware.
> > > > > >       Here parent lose control and child gain it.
> > > > > 
> > > > > This is indeed an issue. But it remains an issue only if you continue to use the
> > > > > queue and the memory after fork. We can use at_fork kinds of gadget to fix it in
> > > > > user space.
> > > > 
> > > > Trusting user space is a no go from my point of view.
> > > 
> > > Can we dive deeper on this? Maybe we have different understanding on "Trusting
> > > user space". As my understanding, "trusting user space" means "no matter what
> > > the user process does, it should only hurt itself and anything give to it, no
> > > the kernel and the other process".
> > > 
> > > In our case, we create a channel between a process and the hardware. The process
> > > can do whateven it like to its own memory the channel itself. It won't hurt the
> > > other process and the kernel. And if the process fork a child and give the
> > > channel to the child, it should the freedom on those resource remain within the
> > > parent and the child. We are not trust another else.
> > > 
> > > So do you refer to something else here?
> > > 
> > 
> > I am refering to COW giving control to the child on to what happens
> > in the parent from device point of view. A process hurting itself is
> > fine, but if process now has to do special steps to protect from
> > its child ie make sure that its childs can not hurt it, then i see
> > that as a kernel bug. We can not ask user space process to know about
> > all the thousands things that needs to be done to avoid issues with
> > each device driver that the process may use (process can be totaly
> > ignorant it is using a device if that device is use by a library it
> > links to).
> > 
> > 
> > Maybe what needs to happen will explain it better. So if userspace
> > wants to be secure and protect itself from its child taking over the
> > device through COW:
> > 
> >     - parent opened a device and is using it
> > 
> >     ... when parent wants to fork/exec it must:
> > 
> >     - parent _must_ flush device command queue and wait for the
> >       device to finish all pending jobs
> > 
> >     - parent _must_ unmap all range mapped to the device
> > 
> >     - parent should first close device file (unless you force set
> >       the CLOEXEC flag in the kernel)/it could also just flush
> >       but if you are not mapping the device command queue with
> >       VM_DONTCOPY then you should really be closing the device
> > 
> >     - now parent can fork/exec
> > 
> >     - parent must force COW ie write at least one byte to _all_
> >       pages in the range it wants to use with the device
> > 
> >     - parent re-open the device and re-initialize everything
> > 
> > 
> > So this is putting quite a burden on a number of steps the parent
> > _must_ do in order to keep control of memory exposed to the device.
> > Not doing so can potentialy lead (it depends on who does the COW
> > first) to the child taking control of memory use by the device,
> > memory which was mapped by the parent before the child was created.
> > 
> > Forcing CLOEXEC and VM_DONTCOPY somewhat help to simplify this,
> > but you still need to stop, flush, unmap, before fork/exec and then
> > re-init everything after.
> > 
> > 
> > This is only when not using SVA/SVM, SVA/SVM is totaly fine from
> > that point of view, no issues whatsoever.
> > 
> > The solution i outlined in previous email do not have that above
> > issue either, no need to rely on user space doing that dance.
> 
> Thank you. I get the point. I'm now trying to see if I can solve the problem by
> seting the vma to VM_SHARED when the portiong is "shared to the hardware".
> 

FYI you can not convert a private anonymous vma to a share one it is
illegal AFAIK at least i never heard of it and i am pretty sure the
mm code would break if that happens. The user space is the one that
decide what flags a vma has, not the kernel. Modulo few flags like
DONTCOPY that can be force set by device driver for their vma ie vma
of an mmap against the device file.

If you don't like my solution here is another one but it is ugly and
i think it is a bad idea. Again this is for the non SVA/SVM case and
it assumes that the command queue is a mmap() of the device file:
  (A) register mmu_notifier
  (B) on _every_ invalidate range callback (_no matter_ what is the
      range) you zap the command queue mapped to user space (this is
      because you can't tell if the callback happens for a fork or
      something else) wait for the hardware queue to finish and clear
      all the iommu/dma mapping and you unpin all the pages ie
      put_page()
  (C) in device file vma page fault handler (vm_operations_struct.
      fault) you redo all the GUP and redo all the iommu/dma mapping
      and you remap the command queue to the userspace

In (C) you can remap different command queue if you are in the child
than in the parent (just look at current->mm and compare it to the
one the command queue was created against).

Note that this solution will be much __slower__ than what i described
in my previous email. You will see that mmu notifier callbacks happens
often and for tons of reasons and you will be _constantly_ undoing and
redoing tons of work.

This can be mitigated if you can differentiate reasons behind a mmu
notifier callback. I posted patchset to do that a while ago and i
intend to post it again in the next month or so. But this would still
be a bad idea and solution i described previously is much more sane.

Trying to pretend you can have the same thing as SVA/SVM without SVA
is not a good idea. The non SVA case can still expose same API (like
i described previously) but should go through kernel for _every_
hardware submission (you can batch multiple commands in one submission).
Not doing so is way too risky from my POV.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-17  1:42 ` Jerome Glisse
  2018-09-17  8:39   ` Kenneth Lee
@ 2018-09-21 10:03   ` Kenneth Lee
  2018-09-21 14:52     ` Jerome Glisse
  1 sibling, 1 reply; 58+ messages in thread
From: Kenneth Lee @ 2018-09-21 10:03 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Jonathan Corbet, Herbert Xu, David S . Miller,
	Joerg Roedel, Alex Williamson, Hao Fang, Zhou Wang, Zaibo Xu,
	Philippe Ombredanne, Greg Kroah-Hartman, Thomas Gleixner,
	linux-doc, linux-kernel, linux-crypto, iommu, kvm,
	linux-accelerators, Lu Baolu, Sanjay Kumar, linuxarm

On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> Received: from POPSCN.huawei.com [10.3.17.45] by Turing-Arch-b with POP3
>  (fetchmail-6.3.26) for <kenny@localhost> (single-drop); Mon, 17 Sep 2018
>  09:45:02 +0800 (CST)
> Received: from DGGEMM406-HUB.china.huawei.com (10.3.20.214) by
>  dggeml421-hub.china.huawei.com (10.1.199.38) with Microsoft SMTP Server
>  (TLS) id 14.3.399.0; Mon, 17 Sep 2018 09:43:07 +0800
> Received: from dggwg01-in.huawei.com (172.30.65.32) by
>  DGGEMM406-HUB.china.huawei.com (10.3.20.214) with Microsoft SMTP Server id
>  14.3.399.0; Mon, 17 Sep 2018 09:43:00 +0800
> Received: from mx1.redhat.com (unknown [209.132.183.28])	by Forcepoint
>  Email with ESMTPS id A15E04AB7D1C3;	Mon, 17 Sep 2018 09:42:56 +0800 (CST)
> Received: from smtp.corp.redhat.com
>  (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.26])	(using
>  TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client
>  certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id
>  EC621308212D;	Mon, 17 Sep 2018 01:42:52 +0000 (UTC)
> Received: from redhat.com (ovpn-121-3.rdu2.redhat.com [10.10.121.3])	by
>  smtp.corp.redhat.com (Postfix) with ESMTPS id 8874530912F4;	Mon, 17 Sep
>  2018 01:42:46 +0000 (UTC)
> Date: Sun, 16 Sep 2018 21:42:44 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <nek.in.cn@gmail.com>
> CC: Jonathan Corbet <corbet@lwn.net>, Herbert Xu
>  <herbert@gondor.apana.org.au>, "David S . Miller" <davem@davemloft.net>,
>  Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>,
>  Sanjay Kumar <sanjay.k.kumar@intel.com>, linuxarm@huawei.com
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: <20180917014244.GA27596@redhat.com>
> References: <20180903005204.26041-1-nek.in.cn@gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> In-Reply-To: <20180903005204.26041-1-nek.in.cn@gmail.com>
> User-Agent: Mutt/1.10.1 (2018-07-13)
> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.26
> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
>  (mx1.redhat.com [10.5.110.42]); Mon, 17 Sep 2018 01:42:53 +0000 (UTC)
> Return-Path: jglisse@redhat.com
> X-MS-Exchange-Organization-AuthSource: DGGEMM406-HUB.china.huawei.com
> X-MS-Exchange-Organization-AuthAs: Anonymous
> MIME-Version: 1.0
> 
> So i want to summarize issues i have as this threads have dig deep into
> details. For this i would like to differentiate two cases first the easy
> one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> In both cases your objectives as i understand them:
> 
> [R1]- expose a common user space API that make it easy to share boiler
>       plate code accross many devices (discovering devices, opening
>       device, creating context, creating command queue ...).
> [R2]- try to share the device as much as possible up to device limits
>       (number of independant queues the device has)
> [R3]- minimize syscall by allowing user space to directly schedule on the
>       device queue without a round trip to the kernel
> 
> I don't think i missed any.
> 
> 
> (1) Device with SVA/SVM
> 
> For that case it is easy, you do not need to be in VFIO or part of any
> thing specific in the kernel. There is no security risk (modulo bug in
> the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> to a device is just couple dozen lines of code.
> 
> 
> (2) Device does not have SVA/SVM (or it is disabled)
> 
> You want to still allow device to be part of your framework. However
> here i see fundamentals securities issues and you move the burden of
> being careful to user space which i think is a bad idea. We should
> never trus the userspace from kernel space.
> 
> To keep the same API for the user space code you want a 1:1 mapping
> between device physical address and process virtual address (ie if
> device access device physical address A it is accessing the same
> memory as what is backing the virtual address A in the process.
> 
> Security issues are on two things:
> [I1]- fork/exec, a process who opened any such device and created an
>       active queue can transfer without its knowledge control of its
>       commands queue through COW. The parent map some anonymous region
>       to the device as a command queue buffer but because of COW the
>       parent can be the first to copy on write and thus the child can
>       inherit the original pages that are mapped to the hardware.
>       Here parent lose control and child gain it.
> 

Hi, Jerome, 

I reconsider your logic. I think the problem can be solved. Let us separate the
SVA/SVM feature into two: fault-from-device and device-va-awareness. A device
with iommu can support only device-va-awareness or both.

VFIO works on top of iommu, so it will support at least device-va-awareness. For
the COW problem, it can be taken as a mmu synchronization issue. If the mmu page
table is changed, it should be synchronize to iommu (via iommu_notifier). In the
case that the device support fault-from-device, it will work fine. In the case
that it supports only device-va-awareness, we can prefault (handle_mm_fault)
also via iommu_notifier and reset to iommu page table.

So this can be considered as a bug of VFIO, cannot it?

> [I2]- Because of [R3] you want to allow userspace to schedule commands
>       on the device without doing an ioctl and thus here user space
>       can schedule any commands to the device with any address. What
>       happens if that address have not been mapped by the user space
>       is undefined and in fact can not be defined as what each IOMMU
>       does on invalid address access is different from IOMMU to IOMMU.
> 
>       In case of a bad IOMMU, or simply an IOMMU improperly setup by
>       the kernel, this can potentialy allow user space to DMA anywhere.
> 
> [I3]- By relying on GUP in VFIO you are not abiding by the implicit
>       contract (at least i hope it is implicit) that you should not
>       try to map to the device any file backed vma (private or share).
> 
>       The VFIO code never check the vma controlling the addresses that
>       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the
>       user space can provide file backed range.
> 
>       I am guessing that the VFIO code never had any issues because its
>       number one user is QEMU and QEMU never does that (and that's good
>       as no one should ever do that).
> 
>       So if process does that you are opening your self to serious file
>       system corruption (depending on file system this can lead to total
>       data loss for the filesystem).
> 
>       Issue is that once you GUP you never abide to file system flushing
>       which write protect the page before writing to the disk. So
>       because the page is still map with write permission to the device
>       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can
>       write to the page while it is in the middle of being written back
>       to disk. Consult your nearest file system specialist to ask him
>       how bad that can be.

In the case, we cannot do anything if the device do not support
fault-from-device. But we can reject write map with file-backed mapping.

It seems both issues can be solved under VFIO framework:) (But of cause, I don't
mean it has to)

> 
> [I4]- Design issue, mdev design As Far As I Understand It is about
>       sharing a single device to multiple clients (most obvious case
>       here is again QEMU guest). But you are going against that model,
>       in fact AFAIUI you are doing the exect opposite. When there is
>       no SVA/SVM you want only one mdev device that can not be share.
> 
>       So this is counter intuitive to the mdev existing design. It is
>       not about sharing device among multiple users but about giving
>       exclusive access to the device to one user.
> 
> 
> 
> All the reasons above is why i believe a different model would serve
> you and your user better. Below is a design that avoids all of the
> above issues and still delivers all of your objectives with the
> exceptions of the third one [R3] when there is no SVA/SVM.
> 
> 
> Create a subsystem (very much boiler plate code) which allow device to
> register themself against (very much like what you do in your current
> patchset but outside of VFIO).
> 
> That subsystem will create a device file for each registered system and
> expose a common API (ie set of ioctl) for each of those device files.
> 
> When user space create a queue (through an ioctl after opening the device
> file) the kernel can return -EBUSY if all the device queue are in use,
> or create a device queue and return a flag like SYNC_ONLY for device that
> do not have SVA/SVM.
> 
> For device with SVA/SVM at the time the process create a queue you bind
> the process PASID to the device queue. From there on the userspace can
> schedule commands and use the device without going to kernel space.
> 
> For device without SVA/SVM you create a fake queue that is just pure
> memory is not related to the device. From there on the userspace must
> call an ioctl every time it wants the device to consume its queue
> (hence why the SYNC_ONLY flag for synchronous operation only). The
> kernel portion read the fake queue expose to user space and copy
> commands into the real hardware queue but first it properly map any
> of the process memory needed for those commands to the device and
> adjust the device physical address with the one it gets from dma_map
> API.
> 
> With that model it is "easy" to listen to mmu_notifier and to abide by
> them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]
> issue by only mapping a fake device queue to userspace.
> 
> So yes with that models it means that every device that wish to support
> the non SVA/SVM case will have to do extra work (ie emulate its command
> queue in software in the kernel). But by doing so, you support an
> unlimited number of process on your device (ie all the process can share
> one single hardware command queues or multiple hardware queues).
> 
> The big advantages i see here is that the process do not have to worry
> about doing something wrong. You are protecting yourself and your user
> from stupid mistakes.
> 
> 
> I hope this is useful to you.
> 
> Cheers,
> Jérôme

Cheers
-- 
			-Kenneth(Hisilicon)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-20 14:23             ` Jerome Glisse
@ 2018-09-21 10:05               ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-21 10:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Alex Williamson, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, linux-kernel, linuxarm, iommu, David S . Miller,
	linux-crypto, Zhou Wang, Philippe Ombredanne, Thomas Gleixner,
	Zaibo Xu, linux-accelerators, Lu Baolu

On Thu, Sep 20, 2018 at 10:23:40AM -0400, Jerome Glisse wrote:
> Received: from popscn.huawei.com [10.3.17.45] by Turing-Arch-b with POP3
>  (fetchmail-6.3.26) for <kenny@localhost> (single-drop); Thu, 20 Sep 2018
>  22:30:01 +0800 (CST)
> Received: from DGGEMM401-HUB.china.huawei.com (10.3.20.209) by
>  dggeml405-hub.china.huawei.com (10.3.17.49) with Microsoft SMTP Server
>  (TLS) id 14.3.382.0; Thu, 20 Sep 2018 22:23:57 +0800
> Received: from dggwg01-in.huawei.com (172.30.65.35) by
>  DGGEMM401-HUB.china.huawei.com (10.3.20.209) with Microsoft SMTP Server id
>  14.3.399.0; Thu, 20 Sep 2018 22:23:56 +0800
> Received: from mx1.redhat.com (unknown [209.132.183.28])	by Forcepoint
>  Email with ESMTPS id 18963FA9B6FA9;	Thu, 20 Sep 2018 22:23:50 +0800 (CST)
> Received: from smtp.corp.redhat.com
>  (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22])	(using
>  TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client
>  certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id
>  CF1D9C058CBE;	Thu, 20 Sep 2018 14:23:47 +0000 (UTC)
> Received: from redhat.com (unknown [10.20.6.215])	by
>  smtp.corp.redhat.com (Postfix) with ESMTPS id 1B030106A780;	Thu, 20 Sep
>  2018 14:23:41 +0000 (UTC)
> Date: Thu, 20 Sep 2018 10:23:40 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Alex Williamson
>  <alex.williamson@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>,
>  kvm@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>, Greg Kroah-Hartman
>  <gregkh@linuxfoundation.org>, Joerg Roedel <joro@8bytes.org>,
>  linux-doc@vger.kernel.org, Sanjay Kumar <sanjay.k.kumar@intel.com>, Hao
>  Fang <fanghao11@huawei.com>, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, iommu@lists.linux-foundation.org, "David S . Miller"
>  <davem@davemloft.net>, linux-crypto@vger.kernel.org, Zhou Wang
>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,
>  Thomas Gleixner <tglx@linutronix.de>, Zaibo Xu <xuzaibo@huawei.com>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: <20180920142340.GA3341@redhat.com>
> References: <20180903005204.26041-1-nek.in.cn@gmail.com>
>  <20180917014244.GA27596@redhat.com>
>  <20180917083940.GE207969@Turing-Arch-b> <20180917123744.GA3605@redhat.com>
>  <20180918060014.GF207969@Turing-Arch-b> <20180918130314.GA3500@redhat.com>
>  <20180920055543.GG207969@Turing-Arch-b>
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> In-Reply-To: <20180920055543.GG207969@Turing-Arch-b>
> User-Agent: Mutt/1.10.0 (2018-05-17)
> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
>  (mx1.redhat.com [10.5.110.32]); Thu, 20 Sep 2018 14:23:48 +0000 (UTC)
> Return-Path: jglisse@redhat.com
> X-MS-Exchange-Organization-AuthSource: DGGEMM401-HUB.china.huawei.com
> X-MS-Exchange-Organization-AuthAs: Anonymous
> MIME-Version: 1.0
> 
> On Thu, Sep 20, 2018 at 01:55:43PM +0800, Kenneth Lee wrote:
> > On Tue, Sep 18, 2018 at 09:03:14AM -0400, Jerome Glisse wrote:
> > > On Tue, Sep 18, 2018 at 02:00:14PM +0800, Kenneth Lee wrote:
> > > > On Mon, Sep 17, 2018 at 08:37:45AM -0400, Jerome Glisse wrote:
> > > > > On Mon, Sep 17, 2018 at 04:39:40PM +0800, Kenneth Lee wrote:
> > > > > > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > > > > > > So i want to summarize issues i have as this threads have dig deep into
> > > > > > > details. For this i would like to differentiate two cases first the easy
> > > > > > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> > > > > > 
> > > > > > Thank you very much for the summary.
> > > > > > 
> > > > > > > In both cases your objectives as i understand them:
> > > > > > > 
> > > > > > > [R1]- expose a common user space API that make it easy to share boiler
> > > > > > >       plate code accross many devices (discovering devices, opening
> > > > > > >       device, creating context, creating command queue ...).
> > > > > > > [R2]- try to share the device as much as possible up to device limits
> > > > > > >       (number of independant queues the device has)
> > > > > > > [R3]- minimize syscall by allowing user space to directly schedule on the
> > > > > > >       device queue without a round trip to the kernel
> > > > > > > 
> > > > > > > I don't think i missed any.
> > > > > > > 
> > > > > > > 
> > > > > > > (1) Device with SVA/SVM
> > > > > > > 
> > > > > > > For that case it is easy, you do not need to be in VFIO or part of any
> > > > > > > thing specific in the kernel. There is no security risk (modulo bug in
> > > > > > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > > > > > > to a device is just couple dozen lines of code.
> > > > > > > 
> > > > > > 
> > > > > > This is right...logically. But the kernel has no clear definition about "Device
> > > > > > with SVA/SVM" and no boiler plate for doing so. Then VFIO may become one of the
> > > > > > boiler plate.
> > > > > > 
> > > > > > VFIO is one of the wrappers for IOMMU for user space. And maybe it is the only
> > > > > > one. If we add that support within VFIO, which solve most of the problem of
> > > > > > SVA/SVM, it will save a lot of work in the future.
> > > > > 
> > > > > You do not need to "wrap" IOMMU for SVA/SVM. Existing upstream SVA/SVM user
> > > > > all do the SVA/SVM setup in couple dozen lines and i failed to see how it
> > > > > would require any more than that in your case.
> > > > > 
> > > > > 
> > > > > > I think this is the key confliction between us. So could Alex please say
> > > > > > something here? If the VFIO is going to take this into its scope, we can try
> > > > > > together to solve all the problem on the way. If it it is not, it is also
> > > > > > simple, we can just go to another way to fulfill this part of requirements even
> > > > > > we have to duplicate most of the code.
> > > > > > 
> > > > > > Another point I need to emphasis here: because we have to replace the hardware
> > > > > > queue when fork, so it won't be very simple even in SVA/SVM case.
> > > > > 
> > > > > I am assuming hardware queue can only be setup by the kernel and thus
> > > > > you are totaly safe forkwise as the queue is setup against a PASID and
> > > > > the child does not bind to any PASID and you use VM_DONTCOPY on the
> > > > > mmap of the hardware MMIO queue because you should really use that flag
> > > > > for that.
> > > > > 
> > > > > 
> > > > > > > (2) Device does not have SVA/SVM (or it is disabled)
> > > > > > > 
> > > > > > > You want to still allow device to be part of your framework. However
> > > > > > > here i see fundamentals securities issues and you move the burden of
> > > > > > > being careful to user space which i think is a bad idea. We should
> > > > > > > never trus the userspace from kernel space.
> > > > > > > 
> > > > > > > To keep the same API for the user space code you want a 1:1 mapping
> > > > > > > between device physical address and process virtual address (ie if
> > > > > > > device access device physical address A it is accessing the same
> > > > > > > memory as what is backing the virtual address A in the process.
> > > > > > > 
> > > > > > > Security issues are on two things:
> > > > > > > [I1]- fork/exec, a process who opened any such device and created an
> > > > > > >       active queue can transfer without its knowledge control of its
> > > > > > >       commands queue through COW. The parent map some anonymous region
> > > > > > >       to the device as a command queue buffer but because of COW the
> > > > > > >       parent can be the first to copy on write and thus the child can
> > > > > > >       inherit the original pages that are mapped to the hardware.
> > > > > > >       Here parent lose control and child gain it.
> > > > > > 
> > > > > > This is indeed an issue. But it remains an issue only if you continue to use the
> > > > > > queue and the memory after fork. We can use at_fork kinds of gadget to fix it in
> > > > > > user space.
> > > > > 
> > > > > Trusting user space is a no go from my point of view.
> > > > 
> > > > Can we dive deeper on this? Maybe we have different understanding on "Trusting
> > > > user space". As my understanding, "trusting user space" means "no matter what
> > > > the user process does, it should only hurt itself and anything give to it, no
> > > > the kernel and the other process".
> > > > 
> > > > In our case, we create a channel between a process and the hardware. The process
> > > > can do whateven it like to its own memory the channel itself. It won't hurt the
> > > > other process and the kernel. And if the process fork a child and give the
> > > > channel to the child, it should the freedom on those resource remain within the
> > > > parent and the child. We are not trust another else.
> > > > 
> > > > So do you refer to something else here?
> > > > 
> > > 
> > > I am refering to COW giving control to the child on to what happens
> > > in the parent from device point of view. A process hurting itself is
> > > fine, but if process now has to do special steps to protect from
> > > its child ie make sure that its childs can not hurt it, then i see
> > > that as a kernel bug. We can not ask user space process to know about
> > > all the thousands things that needs to be done to avoid issues with
> > > each device driver that the process may use (process can be totaly
> > > ignorant it is using a device if that device is use by a library it
> > > links to).
> > > 
> > > 
> > > Maybe what needs to happen will explain it better. So if userspace
> > > wants to be secure and protect itself from its child taking over the
> > > device through COW:
> > > 
> > >     - parent opened a device and is using it
> > > 
> > >     ... when parent wants to fork/exec it must:
> > > 
> > >     - parent _must_ flush device command queue and wait for the
> > >       device to finish all pending jobs
> > > 
> > >     - parent _must_ unmap all range mapped to the device
> > > 
> > >     - parent should first close device file (unless you force set
> > >       the CLOEXEC flag in the kernel)/it could also just flush
> > >       but if you are not mapping the device command queue with
> > >       VM_DONTCOPY then you should really be closing the device
> > > 
> > >     - now parent can fork/exec
> > > 
> > >     - parent must force COW ie write at least one byte to _all_
> > >       pages in the range it wants to use with the device
> > > 
> > >     - parent re-open the device and re-initialize everything
> > > 
> > > 
> > > So this is putting quite a burden on a number of steps the parent
> > > _must_ do in order to keep control of memory exposed to the device.
> > > Not doing so can potentialy lead (it depends on who does the COW
> > > first) to the child taking control of memory use by the device,
> > > memory which was mapped by the parent before the child was created.
> > > 
> > > Forcing CLOEXEC and VM_DONTCOPY somewhat help to simplify this,
> > > but you still need to stop, flush, unmap, before fork/exec and then
> > > re-init everything after.
> > > 
> > > 
> > > This is only when not using SVA/SVM, SVA/SVM is totaly fine from
> > > that point of view, no issues whatsoever.
> > > 
> > > The solution i outlined in previous email do not have that above
> > > issue either, no need to rely on user space doing that dance.
> > 
> > Thank you. I get the point. I'm now trying to see if I can solve the problem by
> > seting the vma to VM_SHARED when the portiong is "shared to the hardware".
> > 
> 
> FYI you can not convert a private anonymous vma to a share one it is
> illegal AFAIK at least i never heard of it and i am pretty sure the
> mm code would break if that happens. The user space is the one that
> decide what flags a vma has, not the kernel. Modulo few flags like
> DONTCOPY that can be force set by device driver for their vma ie vma
> of an mmap against the device file.
> 
> If you don't like my solution here is another one but it is ugly and
> i think it is a bad idea. Again this is for the non SVA/SVM case and
> it assumes that the command queue is a mmap() of the device file:
>   (A) register mmu_notifier
>   (B) on _every_ invalidate range callback (_no matter_ what is the
>       range) you zap the command queue mapped to user space (this is
>       because you can't tell if the callback happens for a fork or
>       something else) wait for the hardware queue to finish and clear
>       all the iommu/dma mapping and you unpin all the pages ie
>       put_page()
>   (C) in device file vma page fault handler (vm_operations_struct.
>       fault) you redo all the GUP and redo all the iommu/dma mapping
>       and you remap the command queue to the userspace
> 
> In (C) you can remap different command queue if you are in the child
> than in the parent (just look at current->mm and compare it to the
> one the command queue was created against).
> 
> Note that this solution will be much __slower__ than what i described
> in my previous email. You will see that mmu notifier callbacks happens
> often and for tons of reasons and you will be _constantly_ undoing and
> redoing tons of work.
> 
> This can be mitigated if you can differentiate reasons behind a mmu
> notifier callback. I posted patchset to do that a while ago and i
> intend to post it again in the next month or so. But this would still
> be a bad idea and solution i described previously is much more sane.
> 
> Trying to pretend you can have the same thing as SVA/SVM without SVA
> is not a good idea. The non SVA case can still expose same API (like
> i described previously) but should go through kernel for _every_
> hardware submission (you can batch multiple commands in one submission).
> Not doing so is way too risky from my POV.
> 
> Cheers,
> Jérôme

You are quite right. I tried all the way to find a leak in the mm system and
fail to. I will tried other way or maybe discard the non-SVA scenario.

Cheers,
-- 
			-Kenneth(Hisilicon)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-21 10:03   ` Kenneth Lee
@ 2018-09-21 14:52     ` Jerome Glisse
  2018-09-25  5:55       ` Kenneth Lee
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Glisse @ 2018-09-21 14:52 UTC (permalink / raw)
  To: Kenneth Lee
  Cc: Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	Thomas Gleixner, linux-crypto, Zhou Wang, Philippe Ombredanne,
	Zaibo Xu, David S . Miller, linux-accelerators, Lu Baolu

On Fri, Sep 21, 2018 at 06:03:14PM +0800, Kenneth Lee wrote:
> On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > 
> > So i want to summarize issues i have as this threads have dig deep into
> > details. For this i would like to differentiate two cases first the easy
> > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> > In both cases your objectives as i understand them:
> > 
> > [R1]- expose a common user space API that make it easy to share boiler
> >       plate code accross many devices (discovering devices, opening
> >       device, creating context, creating command queue ...).
> > [R2]- try to share the device as much as possible up to device limits
> >       (number of independant queues the device has)
> > [R3]- minimize syscall by allowing user space to directly schedule on the
> >       device queue without a round trip to the kernel
> > 
> > I don't think i missed any.
> > 
> > 
> > (1) Device with SVA/SVM
> > 
> > For that case it is easy, you do not need to be in VFIO or part of any
> > thing specific in the kernel. There is no security risk (modulo bug in
> > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > to a device is just couple dozen lines of code.
> > 
> > 
> > (2) Device does not have SVA/SVM (or it is disabled)
> > 
> > You want to still allow device to be part of your framework. However
> > here i see fundamentals securities issues and you move the burden of
> > being careful to user space which i think is a bad idea. We should
> > never trus the userspace from kernel space.
> > 
> > To keep the same API for the user space code you want a 1:1 mapping
> > between device physical address and process virtual address (ie if
> > device access device physical address A it is accessing the same
> > memory as what is backing the virtual address A in the process.
> > 
> > Security issues are on two things:
> > [I1]- fork/exec, a process who opened any such device and created an
> >       active queue can transfer without its knowledge control of its
> >       commands queue through COW. The parent map some anonymous region
> >       to the device as a command queue buffer but because of COW the
> >       parent can be the first to copy on write and thus the child can
> >       inherit the original pages that are mapped to the hardware.
> >       Here parent lose control and child gain it.
> > 
> 
> Hi, Jerome, 
> 
> I reconsider your logic. I think the problem can be solved. Let us separate the
> SVA/SVM feature into two: fault-from-device and device-va-awareness. A device
> with iommu can support only device-va-awareness or both.

Not sure i follow, are you also talking about the non SVA/SVM case here ?
Either device has SVA/SVM, either it does not. The fact that device can
use same physical address PA for its access as the process virtual address
VA does not change any of the issues listed here, same issues would apply
if device was using PA != VA ...


> VFIO works on top of iommu, so it will support at least device-va-awareness. For
> the COW problem, it can be taken as a mmu synchronization issue. If the mmu page
> table is changed, it should be synchronize to iommu (via iommu_notifier). In the
> case that the device support fault-from-device, it will work fine. In the case
> that it supports only device-va-awareness, we can prefault (handle_mm_fault)
> also via iommu_notifier and reset to iommu page table.
> 
> So this can be considered as a bug of VFIO, cannot it?

So again SVA/SVM is fine because it uses the same CPU page table so anything
done to the process address space reflect automaticly to the device. Nothing
to do for SVA/SVM.

For non SVA/SVM you _must_ unmap device command queues, flush and wait for
all pending commands, unmap all the IOMMU mapping and wait for the process
to fault on the unmapped device command queue before trying to restore any
thing. This would be terribly slow i described it in another email.

Also you can not fault inside a mmu_notifier it is illegal, all you can do
is unmap thing and wait for access to finish.

> > [I2]- Because of [R3] you want to allow userspace to schedule commands
> >       on the device without doing an ioctl and thus here user space
> >       can schedule any commands to the device with any address. What
> >       happens if that address have not been mapped by the user space
> >       is undefined and in fact can not be defined as what each IOMMU
> >       does on invalid address access is different from IOMMU to IOMMU.
> > 
> >       In case of a bad IOMMU, or simply an IOMMU improperly setup by
> >       the kernel, this can potentialy allow user space to DMA anywhere.
> > 
> > [I3]- By relying on GUP in VFIO you are not abiding by the implicit
> >       contract (at least i hope it is implicit) that you should not
> >       try to map to the device any file backed vma (private or share).
> > 
> >       The VFIO code never check the vma controlling the addresses that
> >       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the
> >       user space can provide file backed range.
> > 
> >       I am guessing that the VFIO code never had any issues because its
> >       number one user is QEMU and QEMU never does that (and that's good
> >       as no one should ever do that).
> > 
> >       So if process does that you are opening your self to serious file
> >       system corruption (depending on file system this can lead to total
> >       data loss for the filesystem).
> > 
> >       Issue is that once you GUP you never abide to file system flushing
> >       which write protect the page before writing to the disk. So
> >       because the page is still map with write permission to the device
> >       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can
> >       write to the page while it is in the middle of being written back
> >       to disk. Consult your nearest file system specialist to ask him
> >       how bad that can be.
> 
> In the case, we cannot do anything if the device do not support
> fault-from-device. But we can reject write map with file-backed mapping.

Yes this would avoid most issues with file-backed mapping, truncate would
still cause weird behavior but only for your device. But then your non
SVA/SVM case does not behave as the SVA/SVM case so why not do what i out-
lined below that allow same behavior modulo command flushing ...

> It seems both issues can be solved under VFIO framework:) (But of cause, I don't
> mean it has to)

The COW can not be solve without either the solution i described in this
previous mail below or the other one with mmu notifier. But the one below
is saner and has better performance.


> > 
> > [I4]- Design issue, mdev design As Far As I Understand It is about
> >       sharing a single device to multiple clients (most obvious case
> >       here is again QEMU guest). But you are going against that model,
> >       in fact AFAIUI you are doing the exect opposite. When there is
> >       no SVA/SVM you want only one mdev device that can not be share.
> > 
> >       So this is counter intuitive to the mdev existing design. It is
> >       not about sharing device among multiple users but about giving
> >       exclusive access to the device to one user.
> > 
> > 
> > 
> > All the reasons above is why i believe a different model would serve
> > you and your user better. Below is a design that avoids all of the
> > above issues and still delivers all of your objectives with the
> > exceptions of the third one [R3] when there is no SVA/SVM.
> > 
> > 
> > Create a subsystem (very much boiler plate code) which allow device to
> > register themself against (very much like what you do in your current
> > patchset but outside of VFIO).
> > 
> > That subsystem will create a device file for each registered system and
> > expose a common API (ie set of ioctl) for each of those device files.
> > 
> > When user space create a queue (through an ioctl after opening the device
> > file) the kernel can return -EBUSY if all the device queue are in use,
> > or create a device queue and return a flag like SYNC_ONLY for device that
> > do not have SVA/SVM.
> > 
> > For device with SVA/SVM at the time the process create a queue you bind
> > the process PASID to the device queue. From there on the userspace can
> > schedule commands and use the device without going to kernel space.
> > 
> > For device without SVA/SVM you create a fake queue that is just pure
> > memory is not related to the device. From there on the userspace must
> > call an ioctl every time it wants the device to consume its queue
> > (hence why the SYNC_ONLY flag for synchronous operation only). The
> > kernel portion read the fake queue expose to user space and copy
> > commands into the real hardware queue but first it properly map any
> > of the process memory needed for those commands to the device and
> > adjust the device physical address with the one it gets from dma_map
> > API.
> > 
> > With that model it is "easy" to listen to mmu_notifier and to abide by
> > them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]
> > issue by only mapping a fake device queue to userspace.
> > 
> > So yes with that models it means that every device that wish to support
> > the non SVA/SVM case will have to do extra work (ie emulate its command
> > queue in software in the kernel). But by doing so, you support an
> > unlimited number of process on your device (ie all the process can share
> > one single hardware command queues or multiple hardware queues).
> > 
> > The big advantages i see here is that the process do not have to worry
> > about doing something wrong. You are protecting yourself and your user
> > from stupid mistakes.
> > 
> > 
> > I hope this is useful to you.
> > 
> > Cheers,
> > Jérôme
> 
> Cheers
> -- 
> 			-Kenneth(Hisilicon)
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
  2018-09-21 14:52     ` Jerome Glisse
@ 2018-09-25  5:55       ` Kenneth Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Kenneth Lee @ 2018-09-25  5:55 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kenneth Lee, Herbert Xu, kvm, Jonathan Corbet,
	Greg Kroah-Hartman, Joerg Roedel, linux-doc, Sanjay Kumar,
	Hao Fang, iommu, linux-kernel, linuxarm, Alex Williamson,
	Thomas Gleixner, linux-crypto, Zhou Wang, Philippe Ombredanne,
	Zaibo Xu, David S . Miller, linux-accelerators, Lu Baolu

On Fri, Sep 21, 2018 at 10:52:01AM -0400, Jerome Glisse wrote:
> Received: from popscn.huawei.com [10.3.17.45] by Turing-Arch-b with POP3
>  (fetchmail-6.3.26) for <kenny@localhost> (single-drop); Fri, 21 Sep 2018
>  23:00:01 +0800 (CST)
> Received: from DGGEMM406-HUB.china.huawei.com (10.3.20.214) by
>  DGGEML403-HUB.china.huawei.com (10.3.17.33) with Microsoft SMTP Server
>  (TLS) id 14.3.399.0; Fri, 21 Sep 2018 22:52:20 +0800
> Received: from dggwg01-in.huawei.com (172.30.65.38) by
>  DGGEMM406-HUB.china.huawei.com (10.3.20.214) with Microsoft SMTP Server id
>  14.3.399.0; Fri, 21 Sep 2018 22:52:16 +0800
> Received: from mx1.redhat.com (unknown [209.132.183.28])	by Forcepoint
>  Email with ESMTPS id 912ECA2EC6662;	Fri, 21 Sep 2018 22:52:12 +0800 (CST)
> Received: from smtp.corp.redhat.com
>  (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.24])	(using
>  TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client
>  certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id
>  25BC0792BB;	Fri, 21 Sep 2018 14:52:10 +0000 (UTC)
> Received: from redhat.com (ovpn-124-21.rdu2.redhat.com [10.10.124.21])
> 	by smtp.corp.redhat.com (Postfix) with ESMTPS id 67B25308BDA0;
> 	Fri, 21 Sep 2018 14:52:03 +0000 (UTC)
> Date: Fri, 21 Sep 2018 10:52:01 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <liguozhu@hisilicon.com>
> CC: Kenneth Lee <nek.in.cn@gmail.com>, Herbert Xu
>  <herbert@gondor.apana.org.au>, kvm@vger.kernel.org, Jonathan Corbet
>  <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Joerg
>  Roedel <joro@8bytes.org>, linux-doc@vger.kernel.org, Sanjay Kumar
>  <sanjay.k.kumar@intel.com>, Hao Fang <fanghao11@huawei.com>,
>  iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
>  linuxarm@huawei.com, Alex Williamson <alex.williamson@redhat.com>, Thomas
>  Gleixner <tglx@linutronix.de>, linux-crypto@vger.kernel.org, Zhou Wang
>  <wangzhou1@hisilicon.com>, Philippe Ombredanne <pombredanne@nexb.com>,
>  Zaibo Xu <xuzaibo@huawei.com>, "David S . Miller" <davem@davemloft.net>,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: <20180921145201.GA3357@redhat.com>
> References: <20180903005204.26041-1-nek.in.cn@gmail.com>
>  <20180917014244.GA27596@redhat.com>
>  <20180921100314.GH207969@Turing-Arch-b>
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> In-Reply-To: <20180921100314.GH207969@Turing-Arch-b>
> User-Agent: Mutt/1.10.1 (2018-07-13)
> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.24
> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
>  (mx1.redhat.com [10.5.110.39]); Fri, 21 Sep 2018 14:52:10 +0000 (UTC)
> Return-Path: jglisse@redhat.com
> X-MS-Exchange-Organization-AuthSource: DGGEMM406-HUB.china.huawei.com
> X-MS-Exchange-Organization-AuthAs: Anonymous
> MIME-Version: 1.0
> 
> On Fri, Sep 21, 2018 at 06:03:14PM +0800, Kenneth Lee wrote:
> > On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> > > 
> > > So i want to summarize issues i have as this threads have dig deep into
> > > details. For this i would like to differentiate two cases first the easy
> > > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> > > In both cases your objectives as i understand them:
> > > 
> > > [R1]- expose a common user space API that make it easy to share boiler
> > >       plate code accross many devices (discovering devices, opening
> > >       device, creating context, creating command queue ...).
> > > [R2]- try to share the device as much as possible up to device limits
> > >       (number of independant queues the device has)
> > > [R3]- minimize syscall by allowing user space to directly schedule on the
> > >       device queue without a round trip to the kernel
> > > 
> > > I don't think i missed any.
> > > 
> > > 
> > > (1) Device with SVA/SVM
> > > 
> > > For that case it is easy, you do not need to be in VFIO or part of any
> > > thing specific in the kernel. There is no security risk (modulo bug in
> > > the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> > > to a device is just couple dozen lines of code.
> > > 
> > > 
> > > (2) Device does not have SVA/SVM (or it is disabled)
> > > 
> > > You want to still allow device to be part of your framework. However
> > > here i see fundamentals securities issues and you move the burden of
> > > being careful to user space which i think is a bad idea. We should
> > > never trus the userspace from kernel space.
> > > 
> > > To keep the same API for the user space code you want a 1:1 mapping
> > > between device physical address and process virtual address (ie if
> > > device access device physical address A it is accessing the same
> > > memory as what is backing the virtual address A in the process.
> > > 
> > > Security issues are on two things:
> > > [I1]- fork/exec, a process who opened any such device and created an
> > >       active queue can transfer without its knowledge control of its
> > >       commands queue through COW. The parent map some anonymous region
> > >       to the device as a command queue buffer but because of COW the
> > >       parent can be the first to copy on write and thus the child can
> > >       inherit the original pages that are mapped to the hardware.
> > >       Here parent lose control and child gain it.
> > > 
> > 
> > Hi, Jerome, 
> > 
> > I reconsider your logic. I think the problem can be solved. Let us separate the
> > SVA/SVM feature into two: fault-from-device and device-va-awareness. A device
> > with iommu can support only device-va-awareness or both.
> 
> Not sure i follow, are you also talking about the non SVA/SVM case here ?
> Either device has SVA/SVM, either it does not. The fact that device can
> use same physical address PA for its access as the process virtual address
> VA does not change any of the issues listed here, same issues would apply
> if device was using PA != VA ...
> 
> 
> > VFIO works on top of iommu, so it will support at least device-va-awareness. For
> > the COW problem, it can be taken as a mmu synchronization issue. If the mmu page
> > table is changed, it should be synchronize to iommu (via iommu_notifier). In the
> > case that the device support fault-from-device, it will work fine. In the case
> > that it supports only device-va-awareness, we can prefault (handle_mm_fault)
> > also via iommu_notifier and reset to iommu page table.
> > 
> > So this can be considered as a bug of VFIO, cannot it?
> 
> So again SVA/SVM is fine because it uses the same CPU page table so anything
> done to the process address space reflect automaticly to the device. Nothing
> to do for SVA/SVM.
> 
> For non SVA/SVM you _must_ unmap device command queues, flush and wait for
> all pending commands, unmap all the IOMMU mapping and wait for the process
> to fault on the unmapped device command queue before trying to restore any
> thing. This would be terribly slow i described it in another email.
> 
> Also you can not fault inside a mmu_notifier it is illegal, all you can do
> is unmap thing and wait for access to finish.
> 
> > > [I2]- Because of [R3] you want to allow userspace to schedule commands
> > >       on the device without doing an ioctl and thus here user space
> > >       can schedule any commands to the device with any address. What
> > >       happens if that address have not been mapped by the user space
> > >       is undefined and in fact can not be defined as what each IOMMU
> > >       does on invalid address access is different from IOMMU to IOMMU.
> > > 
> > >       In case of a bad IOMMU, or simply an IOMMU improperly setup by
> > >       the kernel, this can potentialy allow user space to DMA anywhere.
> > > 
> > > [I3]- By relying on GUP in VFIO you are not abiding by the implicit
> > >       contract (at least i hope it is implicit) that you should not
> > >       try to map to the device any file backed vma (private or share).
> > > 
> > >       The VFIO code never check the vma controlling the addresses that
> > >       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the
> > >       user space can provide file backed range.
> > > 
> > >       I am guessing that the VFIO code never had any issues because its
> > >       number one user is QEMU and QEMU never does that (and that's good
> > >       as no one should ever do that).
> > > 
> > >       So if process does that you are opening your self to serious file
> > >       system corruption (depending on file system this can lead to total
> > >       data loss for the filesystem).
> > > 
> > >       Issue is that once you GUP you never abide to file system flushing
> > >       which write protect the page before writing to the disk. So
> > >       because the page is still map with write permission to the device
> > >       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can
> > >       write to the page while it is in the middle of being written back
> > >       to disk. Consult your nearest file system specialist to ask him
> > >       how bad that can be.
> > 
> > In the case, we cannot do anything if the device do not support
> > fault-from-device. But we can reject write map with file-backed mapping.
> 
> Yes this would avoid most issues with file-backed mapping, truncate would
> still cause weird behavior but only for your device. But then your non
> SVA/SVM case does not behave as the SVA/SVM case so why not do what i out-
> lined below that allow same behavior modulo command flushing ...
> 
> > It seems both issues can be solved under VFIO framework:) (But of cause, I don't
> > mean it has to)
> 
> The COW can not be solve without either the solution i described in this
> previous mail below or the other one with mmu notifier. But the one below
> is saner and has better performance.
> 
> 
> > > 
> > > [I4]- Design issue, mdev design As Far As I Understand It is about
> > >       sharing a single device to multiple clients (most obvious case
> > >       here is again QEMU guest). But you are going against that model,
> > >       in fact AFAIUI you are doing the exect opposite. When there is
> > >       no SVA/SVM you want only one mdev device that can not be share.
> > > 
> > >       So this is counter intuitive to the mdev existing design. It is
> > >       not about sharing device among multiple users but about giving
> > >       exclusive access to the device to one user.
> > > 
> > > 
> > > 
> > > All the reasons above is why i believe a different model would serve
> > > you and your user better. Below is a design that avoids all of the
> > > above issues and still delivers all of your objectives with the
> > > exceptions of the third one [R3] when there is no SVA/SVM.
> > > 
> > > 
> > > Create a subsystem (very much boiler plate code) which allow device to
> > > register themself against (very much like what you do in your current
> > > patchset but outside of VFIO).
> > > 
> > > That subsystem will create a device file for each registered system and
> > > expose a common API (ie set of ioctl) for each of those device files.
> > > 
> > > When user space create a queue (through an ioctl after opening the device
> > > file) the kernel can return -EBUSY if all the device queue are in use,
> > > or create a device queue and return a flag like SYNC_ONLY for device that
> > > do not have SVA/SVM.
> > > 
> > > For device with SVA/SVM at the time the process create a queue you bind
> > > the process PASID to the device queue. From there on the userspace can
> > > schedule commands and use the device without going to kernel space.
> > > 
> > > For device without SVA/SVM you create a fake queue that is just pure
> > > memory is not related to the device. From there on the userspace must
> > > call an ioctl every time it wants the device to consume its queue
> > > (hence why the SYNC_ONLY flag for synchronous operation only). The
> > > kernel portion read the fake queue expose to user space and copy
> > > commands into the real hardware queue but first it properly map any
> > > of the process memory needed for those commands to the device and
> > > adjust the device physical address with the one it gets from dma_map
> > > API.
> > > 
> > > With that model it is "easy" to listen to mmu_notifier and to abide by
> > > them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]
> > > issue by only mapping a fake device queue to userspace.
> > > 
> > > So yes with that models it means that every device that wish to support
> > > the non SVA/SVM case will have to do extra work (ie emulate its command
> > > queue in software in the kernel). But by doing so, you support an
> > > unlimited number of process on your device (ie all the process can share
> > > one single hardware command queues or multiple hardware queues).
> > > 
> > > The big advantages i see here is that the process do not have to worry
> > > about doing something wrong. You are protecting yourself and your user
> > > from stupid mistakes.
> > > 
> > > 
> > > I hope this is useful to you.
> > > 
> > > Cheers,
> > > Jérôme
> > 
> > Cheers
> > -- 
> > 			-Kenneth(Hisilicon)
> > 

Thank you very much. Jerome. I will push RFCv3 without VFIO soon.

The basic idea will be:

1. Make a chrdev interface for any WarpDrive device instance. Allocate queue
   when it is openned.

2. For device with SVA/SVM, the device (in the queue context) share
   application's process page table via iommu.

3. For device without SVA/SVM, the application can mmap any WarpDrive device as
   its "shared memory". The memory will be shared with any queue openned by this
   process, before or after the queues are openned. The VAs will still be shared
   with the hardware (by iommu or by kernel translation). This will fulfill the
   requirement of "Sharing VA", but it will need the application to use those
   mmapped memory only.

Cheers

-- 
			-Kenneth(Hisilicon)

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2018-09-25  5:57 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
2018-09-03  0:51 ` [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework Kenneth Lee
2018-09-06 18:36   ` Randy Dunlap
2018-09-07  2:21     ` Kenneth Lee
2018-09-03  0:51 ` [PATCH 2/7] iommu: Add share domain interface in iommu for sdmdev Kenneth Lee
2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
2018-09-03  2:11   ` Randy Dunlap
2018-09-06  8:08     ` Kenneth Lee
2018-09-03  2:55   ` Lu Baolu
2018-09-06  9:01     ` Kenneth Lee
2018-09-04 15:31   ` [RFC PATCH] vfio: vfio_sdmdev_groups[] can be static kbuild test robot
2018-09-04 15:32   ` [PATCH 3/7] vfio: add sdmdev support kbuild test robot
2018-09-04 15:32   ` kbuild test robot
2018-09-05  7:27   ` Dan Carpenter
2018-09-03  0:52 ` [PATCH 4/7] crypto: add hisilicon Queue Manager driver Kenneth Lee
2018-09-03  2:15   ` Randy Dunlap
2018-09-06  9:08     ` Kenneth Lee
2018-09-03  0:52 ` [PATCH 5/7] crypto: Add Hisilicon Zip driver Kenneth Lee
2018-09-03  0:52 ` [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM Kenneth Lee
2018-09-03  2:19   ` Randy Dunlap
2018-09-06  9:09     ` Kenneth Lee
2018-09-03  0:52 ` [PATCH 7/7] vfio/sdmdev: add user sample Kenneth Lee
2018-09-03  2:25   ` Randy Dunlap
2018-09-06  9:10     ` Kenneth Lee
2018-09-03  2:32 ` [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Lu Baolu
2018-09-06  9:11   ` Kenneth Lee
2018-09-04 15:00 ` Jerome Glisse
2018-09-04 16:15   ` Alex Williamson
2018-09-06  9:45     ` Kenneth Lee
2018-09-06 13:31       ` Jerome Glisse
2018-09-07  4:01         ` Kenneth Lee
2018-09-07 16:53           ` Jerome Glisse
2018-09-07 17:55             ` Jean-Philippe Brucker
2018-09-07 18:04               ` Jerome Glisse
2018-09-10  3:28             ` Kenneth Lee
2018-09-10 14:54               ` Jerome Glisse
2018-09-11  2:42                 ` Kenneth Lee
2018-09-11  3:33                   ` Jerome Glisse
2018-09-11  6:40                     ` Kenneth Lee
2018-09-11 13:40                       ` Jerome Glisse
2018-09-13  8:32                         ` Kenneth Lee
2018-09-13 14:51                           ` Jerome Glisse
2018-09-14  3:12                             ` Kenneth Lee
2018-09-14 14:05                               ` Jerome Glisse
2018-09-14  6:50                             ` Tian, Kevin
2018-09-14 13:05                               ` Kenneth Lee
2018-09-14 14:13                               ` Jerome Glisse
2018-09-17  1:42 ` Jerome Glisse
2018-09-17  8:39   ` Kenneth Lee
2018-09-17 12:37     ` Jerome Glisse
2018-09-18  6:00       ` Kenneth Lee
2018-09-18 13:03         ` Jerome Glisse
2018-09-20  5:55           ` Kenneth Lee
2018-09-20 14:23             ` Jerome Glisse
2018-09-21 10:05               ` Kenneth Lee
2018-09-21 10:03   ` Kenneth Lee
2018-09-21 14:52     ` Jerome Glisse
2018-09-25  5:55       ` Kenneth Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).