All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO
@ 2021-07-20  8:16 Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 01/27] Update kernel headers Yishai Hadas
                   ` (27 more replies)
  0 siblings, 28 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

This series introduces mlx5 user space driver over VFIO.

This enables an application to take full ownership on the opened device and run
any firmware command (e.g. port up/down) without any concern to hurt someone
else.

The application look and feel is like regular RDMA application over DEVX, it
uses verbs API to open/close a device and then mostly uses DEVX APIs to
interact with the device.

To achieve it, few mlx5 DV APIs were introduced, it includes:
- An API to get ibv_device for a given mlx5 PCI name.
- APIs to manage device specific events.

Detailed man pages were added to describe the expected usage of those APIs.

The mlx5 VFIO driver implemented the basic verbs APIs as of managing MR/PD and
the DEVX APIs which are required to write an RDMA application.

The VFIO uAPIs were used to setup mlx5 vfio context by reading the device
initialization segment, setup DMA and enables the device command interface.

In addition, the series includes pyverbs stuff which runs DEVX like testing
over RDMA and VFIO mlx5 devices.

Some extra documentation of the benefits and the motivation to use VFIO can be
found here [1].

PR was sent [2].

[1] https://www.kernel.org/doc/html/latest/driver-api/vfio.html
[2] https://github.com/linux-rdma/rdma-core/pull/1034

Yishai

Edward Srouji (10):
  pyverbs: Support DevX UMEM registration
  pyverbs/mlx5: Support EQN querying
  pyverbs/mlx5: Support more DevX objects
  pyverbs: Add auxiliary memory functions
  pyverbs/mlx5: Add support to extract mlx5dv objects
  pyverbs/mlx5: Wrap mlx5_cqe64 struct and add enums
  tests: Add MAC address to the tests' args
  tests: Add mlx5 DevX data path test
  pyverbs/mlx5: Support mlx5 devices over VFIO
  tests: Add a test for mlx5 over VFIO

Maor Gottlieb (1):
  mlx5: Enable debug functionality for vfio

Mark Zhang (5):
  util: Add interval_set support
  mlx5: Support fast teardown over vfio
  mlx5: VFIO poll_health support
  mlx5: Set DV context ops
  mlx5: Implement mlx5dv devx_obj APIs over vfio

Yishai Hadas (11):
  Update kernel headers
  mlx5: Introduce mlx5dv_get_vfio_device_list()
  verbs: Enable verbs_open_device() to work over non sysfs devices
  mlx5: Setup mlx5 vfio context
  mlx5: Add mlx5_vfio_cmd_exec() support
  mlx5: vfio setup function support
  mlx5: vfio setup basic caps
  mlx5: Enable interrupt command mode over vfio
  mlx5: Introduce vfio APIs to process events
  mlx5: Implement basic verbs operation for PD and MR over vfio
  mlx5: Support initial DEVX/DV APIs over vfio

 Documentation/pyverbs.md                           |   34 +
 debian/ibverbs-providers.symbols                   |    4 +
 kernel-headers/CMakeLists.txt                      |    4 +
 kernel-headers/linux/vfio.h                        | 1374 ++++++++
 libibverbs/device.c                                |   39 +-
 libibverbs/sysfs.c                                 |    5 +
 providers/mlx5/CMakeLists.txt                      |    3 +-
 providers/mlx5/dr_rule.c                           |   10 +-
 providers/mlx5/libmlx5.map                         |    7 +
 providers/mlx5/man/CMakeLists.txt                  |    3 +
 .../mlx5/man/mlx5dv_get_vfio_device_list.3.md      |   64 +
 providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md  |   41 +
 providers/mlx5/man/mlx5dv_vfio_process_events.3.md |   43 +
 providers/mlx5/mlx5.c                              |  376 ++-
 providers/mlx5/mlx5.h                              |  187 +-
 providers/mlx5/mlx5_ifc.h                          | 1206 ++++++-
 providers/mlx5/mlx5_vfio.c                         | 3379 ++++++++++++++++++++
 providers/mlx5/mlx5_vfio.h                         |  329 ++
 providers/mlx5/mlx5dv.h                            |   25 +
 providers/mlx5/verbs.c                             |  966 +++++-
 pyverbs/CMakeLists.txt                             |    7 +
 pyverbs/dma_util.pyx                               |   25 +
 pyverbs/mem_alloc.pyx                              |   46 +-
 pyverbs/pd.pyx                                     |    4 +
 pyverbs/providers/mlx5/CMakeLists.txt              |    4 +-
 pyverbs/providers/mlx5/libmlx5.pxd                 |  103 +-
 pyverbs/providers/mlx5/mlx5_vfio.pxd               |   15 +
 pyverbs/providers/mlx5/mlx5_vfio.pyx               |  116 +
 pyverbs/providers/mlx5/mlx5dv.pxd                  |   20 +
 pyverbs/providers/mlx5/mlx5dv.pyx                  |  277 +-
 pyverbs/providers/mlx5/mlx5dv_enums.pxd            |   42 +
 pyverbs/providers/mlx5/mlx5dv_objects.pxd          |   28 +
 pyverbs/providers/mlx5/mlx5dv_objects.pyx          |  214 ++
 tests/CMakeLists.txt                               |    3 +
 tests/args_parser.py                               |    5 +
 tests/base.py                                      |   14 +-
 tests/mlx5_base.py                                 |  460 ++-
 tests/mlx5_prm_structs.py                          | 1046 ++++++
 tests/test_mlx5_devx.py                            |   23 +
 tests/test_mlx5_vfio.py                            |  104 +
 util/CMakeLists.txt                                |    2 +
 util/interval_set.c                                |  208 ++
 util/interval_set.h                                |   77 +
 util/util.h                                        |    5 +
 44 files changed, 10650 insertions(+), 297 deletions(-)
 create mode 100644 kernel-headers/linux/vfio.h
 create mode 100644 providers/mlx5/man/mlx5dv_get_vfio_device_list.3.md
 create mode 100644 providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md
 create mode 100644 providers/mlx5/man/mlx5dv_vfio_process_events.3.md
 create mode 100644 providers/mlx5/mlx5_vfio.c
 create mode 100644 providers/mlx5/mlx5_vfio.h
 create mode 100644 pyverbs/dma_util.pyx
 create mode 100644 pyverbs/providers/mlx5/mlx5_vfio.pxd
 create mode 100644 pyverbs/providers/mlx5/mlx5_vfio.pyx
 create mode 100644 pyverbs/providers/mlx5/mlx5dv_objects.pxd
 create mode 100644 pyverbs/providers/mlx5/mlx5dv_objects.pyx
 create mode 100644 tests/mlx5_prm_structs.py
 create mode 100644 tests/test_mlx5_devx.py
 create mode 100644 tests/test_mlx5_vfio.py
 create mode 100644 util/interval_set.c
 create mode 100644 util/interval_set.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 01/27] Update kernel headers
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 02/27] mlx5: Introduce mlx5dv_get_vfio_device_list() Yishai Hadas
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

To commit e73f0f0ee754 ("Linux 5.14-rc1");

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 kernel-headers/CMakeLists.txt |    4 +
 kernel-headers/linux/vfio.h   | 1374 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1378 insertions(+)
 create mode 100644 kernel-headers/linux/vfio.h

diff --git a/kernel-headers/CMakeLists.txt b/kernel-headers/CMakeLists.txt
index b961892..d9621ee 100644
--- a/kernel-headers/CMakeLists.txt
+++ b/kernel-headers/CMakeLists.txt
@@ -26,6 +26,10 @@ publish_internal_headers(rdma
   rdma/vmw_pvrdma-abi.h
   )
 
+publish_internal_headers(linux
+  linux/vfio.h
+  )
+
 publish_internal_headers(rdma/hfi
   rdma/hfi/hfi1_ioctl.h
   rdma/hfi/hfi1_user.h
diff --git a/kernel-headers/linux/vfio.h b/kernel-headers/linux/vfio.h
new file mode 100644
index 0000000..78e4dcd
--- /dev/null
+++ b/kernel-headers/linux/vfio.h
@@ -0,0 +1,1374 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _UAPIVFIO_H
+#define _UAPIVFIO_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#ifndef __kernel
+#define __user
+#endif
+
+#define VFIO_API_VERSION	0
+
+
+/* Kernel & User level defines for VFIO IOCTLs. */
+
+/* Extensions */
+
+#define VFIO_TYPE1_IOMMU		1
+#define VFIO_SPAPR_TCE_IOMMU		2
+#define VFIO_TYPE1v2_IOMMU		3
+/*
+ * IOMMU enforces DMA cache coherence (ex. PCIe NoSnoop stripping).  This
+ * capability is subject to change as groups are added or removed.
+ */
+#define VFIO_DMA_CC_IOMMU		4
+
+/* Check if EEH is supported */
+#define VFIO_EEH			5
+
+/* Two-stage IOMMU */
+#define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
+
+#define VFIO_SPAPR_TCE_v2_IOMMU		7
+
+/*
+ * The No-IOMMU IOMMU offers no translation or isolation for devices and
+ * supports no ioctls outside of VFIO_CHECK_EXTENSION.  Use of VFIO's No-IOMMU
+ * code will taint the host kernel and should be used with extreme caution.
+ */
+#define VFIO_NOIOMMU_IOMMU		8
+
+/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
+#define VFIO_UNMAP_ALL			9
+
+/* Supports the vaddr flag for DMA map and unmap */
+#define VFIO_UPDATE_VADDR		10
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace.  We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE	(';')
+#define VFIO_BASE	100
+
+/*
+ * For extension of INFO ioctls, VFIO makes use of a capability chain
+ * designed after PCI/e capabilities.  A flag bit indicates whether
+ * this capability chain is supported and a field defined in the fixed
+ * structure defines the offset of the first capability in the chain.
+ * This field is only valid when the corresponding bit in the flags
+ * bitmap is set.  This offset field is relative to the start of the
+ * INFO buffer, as is the next field within each capability header.
+ * The id within the header is a shared address space per INFO ioctl,
+ * while the version field is specific to the capability id.  The
+ * contents following the header are specific to the capability id.
+ */
+struct vfio_info_cap_header {
+	__u16	id;		/* Identifies capability */
+	__u16	version;	/* Version specific to the capability ID */
+	__u32	next;		/* Offset of next capability */
+};
+
+/*
+ * Callers of INFO ioctls passing insufficiently sized buffers will see
+ * the capability chain flag bit set, a zero value for the first capability
+ * offset (if available within the provided argsz), and argsz will be
+ * updated to report the necessary buffer size.  For compatibility, the
+ * INFO ioctl will not report error in this case, but the capability chain
+ * will not be available.
+ */
+
+/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
+
+/**
+ * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
+ *
+ * Report the version of the VFIO API.  This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: VFIO_API_VERSION
+ * Availability: Always
+ */
+#define VFIO_GET_API_VERSION		_IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __u32)
+ *
+ * Check whether an extension is supported.
+ * Return: 0 if not supported, 1 (or some other positive integer) if supported.
+ * Availability: Always
+ */
+#define VFIO_CHECK_EXTENSION		_IO(VFIO_TYPE, VFIO_BASE + 1)
+
+/**
+ * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
+ *
+ * Set the iommu to the given type.  The type must be supported by an
+ * iommu driver as verified by calling CHECK_EXTENSION using the same
+ * type.  A group must be set to this file descriptor before this
+ * ioctl is available.  The IOMMU interfaces enabled by this call are
+ * specific to the value set.
+ * Return: 0 on success, -errno on failure
+ * Availability: When VFIO group attached
+ */
+#define VFIO_SET_IOMMU			_IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
+
+/**
+ * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
+ *						struct vfio_group_status)
+ *
+ * Retrieve information about the group.  Fills in provided
+ * struct vfio_group_info.  Caller sets argsz.
+ * Return: 0 on succes, -errno on failure.
+ * Availability: Always
+ */
+struct vfio_group_status {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_GROUP_FLAGS_VIABLE		(1 << 0)
+#define VFIO_GROUP_FLAGS_CONTAINER_SET	(1 << 1)
+};
+#define VFIO_GROUP_GET_STATUS		_IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
+ *
+ * Set the container for the VFIO group to the open VFIO file
+ * descriptor provided.  Groups may only belong to a single
+ * container.  Containers may, at their discretion, support multiple
+ * groups.  Only when a container is set are all of the interfaces
+ * of the VFIO file descriptor and the VFIO group file descriptor
+ * available to the user.
+ * Return: 0 on success, -errno on failure.
+ * Availability: Always
+ */
+#define VFIO_GROUP_SET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 4)
+
+/**
+ * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
+ *
+ * Remove the group from the attached container.  This is the
+ * opposite of the SET_CONTAINER call and returns the group to
+ * an initial state.  All device file descriptors must be released
+ * prior to calling this interface.  When removing the last group
+ * from a container, the IOMMU will be disabled and all state lost,
+ * effectively also returning the VFIO file descriptor to an initial
+ * state.
+ * Return: 0 on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_UNSET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided string.  The string should match a device listed in
+ * the devices subdirectory of the IOMMU group sysfs entry.  The
+ * group containing the device must already be added to this context.
+ * Return: new file descriptor on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_GET_DEVICE_FD	_IO(VFIO_TYPE, VFIO_BASE + 6)
+
+/* --------------- IOCTLs for DEVICE file descriptors --------------- */
+
+/**
+ * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
+ *						struct vfio_device_info)
+ *
+ * Retrieve information about the device.  Fills in provided
+ * struct vfio_device_info.  Caller sets argsz.
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+#define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
+#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
+#define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
+#define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
+#define VFIO_DEVICE_FLAGS_AP	(1 << 5)	/* vfio-ap device */
+#define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6)	/* vfio-fsl-mc device */
+#define VFIO_DEVICE_FLAGS_CAPS	(1 << 7)	/* Info supports caps */
+	__u32	num_regions;	/* Max region index + 1 */
+	__u32	num_irqs;	/* Max IRQ index + 1 */
+	__u32   cap_offset;	/* Offset within info struct of first cap */
+};
+#define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 7)
+
+/*
+ * Vendor driver using Mediated device framework should provide device_api
+ * attribute in supported type attribute groups. Device API string should be one
+ * of the following corresponding to device flags in vfio_device_info structure.
+ */
+
+#define VFIO_DEVICE_API_PCI_STRING		"vfio-pci"
+#define VFIO_DEVICE_API_PLATFORM_STRING		"vfio-platform"
+#define VFIO_DEVICE_API_AMBA_STRING		"vfio-amba"
+#define VFIO_DEVICE_API_CCW_STRING		"vfio-ccw"
+#define VFIO_DEVICE_API_AP_STRING		"vfio-ap"
+
+/*
+ * The following capabilities are unique to s390 zPCI devices.  Their contents
+ * are further-defined in vfio_zdev.h
+ */
+#define VFIO_DEVICE_INFO_CAP_ZPCI_BASE		1
+#define VFIO_DEVICE_INFO_CAP_ZPCI_GROUP		2
+#define VFIO_DEVICE_INFO_CAP_ZPCI_UTIL		3
+#define VFIO_DEVICE_INFO_CAP_ZPCI_PFIP		4
+
+/**
+ * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
+ *				       struct vfio_region_info)
+ *
+ * Retrieve information about a device region.  Caller provides
+ * struct vfio_region_info with index value set.  Caller sets argsz.
+ * Implementation of region mapping is bus driver specific.  This is
+ * intended to describe MMIO, I/O port, as well as bus specific
+ * regions (ex. PCI config space).  Zero sized regions may be used
+ * to describe unimplemented regions (ex. unimplemented PCI BARs).
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_region_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_REGION_INFO_FLAG_READ	(1 << 0) /* Region supports read */
+#define VFIO_REGION_INFO_FLAG_WRITE	(1 << 1) /* Region supports write */
+#define VFIO_REGION_INFO_FLAG_MMAP	(1 << 2) /* Region supports mmap */
+#define VFIO_REGION_INFO_FLAG_CAPS	(1 << 3) /* Info supports caps */
+	__u32	index;		/* Region index */
+	__u32	cap_offset;	/* Offset within info struct of first cap */
+	__u64	size;		/* Region size (bytes) */
+	__u64	offset;		/* Region offset from start of device fd */
+};
+#define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)
+
+/*
+ * The sparse mmap capability allows finer granularity of specifying areas
+ * within a region with mmap support.  When specified, the user should only
+ * mmap the offset ranges specified by the areas array.  mmaps outside of the
+ * areas specified may fail (such as the range covering a PCI MSI-X table) or
+ * may result in improper device behavior.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_REGION_INFO_CAP_SPARSE_MMAP	1
+
+struct vfio_region_sparse_mmap_area {
+	__u64	offset;	/* Offset of mmap'able area within region */
+	__u64	size;	/* Size of mmap'able area */
+};
+
+struct vfio_region_info_cap_sparse_mmap {
+	struct vfio_info_cap_header header;
+	__u32	nr_areas;
+	__u32	reserved;
+	struct vfio_region_sparse_mmap_area areas[];
+};
+
+/*
+ * The device specific type capability allows regions unique to a specific
+ * device or class of devices to be exposed.  This helps solve the problem for
+ * vfio bus drivers of defining which region indexes correspond to which region
+ * on the device, without needing to resort to static indexes, as done by
+ * vfio-pci.  For instance, if we were to go back in time, we might remove
+ * VFIO_PCI_VGA_REGION_INDEX and let vfio-pci simply define that all indexes
+ * greater than or equal to VFIO_PCI_NUM_REGIONS are device specific and we'd
+ * make a "VGA" device specific type to describe the VGA access space.  This
+ * means that non-VGA devices wouldn't need to waste this index, and thus the
+ * address space associated with it due to implementation of device file
+ * descriptor offsets in vfio-pci.
+ *
+ * The current implementation is now part of the user ABI, so we can't use this
+ * for VGA, but there are other upcoming use cases, such as opregions for Intel
+ * IGD devices and framebuffers for vGPU devices.  We missed VGA, but we'll
+ * use this for future additions.
+ *
+ * The structure below defines version 1 of this capability.
+ */
+#define VFIO_REGION_INFO_CAP_TYPE	2
+
+struct vfio_region_info_cap_type {
+	struct vfio_info_cap_header header;
+	__u32 type;	/* global per bus driver */
+	__u32 subtype;	/* type specific */
+};
+
+/*
+ * List of region types, global per bus driver.
+ * If you introduce a new type, please add it here.
+ */
+
+/* PCI region type containing a PCI vendor part */
+#define VFIO_REGION_TYPE_PCI_VENDOR_TYPE	(1 << 31)
+#define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
+#define VFIO_REGION_TYPE_GFX                    (1)
+#define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
+
+/* sub-types for VFIO_REGION_TYPE_PCI_* */
+
+/* 8086 vendor PCI sub-types */
+#define VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION	(1)
+#define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
+#define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
+
+/* 10de vendor PCI sub-types */
+/*
+ * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space.
+ *
+ * Deprecated, region no longer provided
+ */
+#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM	(1)
+
+/* 1014 vendor PCI sub-types */
+/*
+ * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU
+ * to do TLB invalidation on a GPU.
+ *
+ * Deprecated, region no longer provided
+ */
+#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
+
+/* sub-types for VFIO_REGION_TYPE_GFX */
+#define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
+
+/**
+ * struct vfio_region_gfx_edid - EDID region layout.
+ *
+ * Set display link state and EDID blob.
+ *
+ * The EDID blob has monitor information such as brand, name, serial
+ * number, physical size, supported video modes and more.
+ *
+ * This special region allows userspace (typically qemu) set a virtual
+ * EDID for the virtual monitor, which allows a flexible display
+ * configuration.
+ *
+ * For the edid blob spec look here:
+ *    https://en.wikipedia.org/wiki/Extended_Display_Identification_Data
+ *
+ * On linux systems you can find the EDID blob in sysfs:
+ *    /sys/class/drm/${card}/${connector}/edid
+ *
+ * You can use the edid-decode ulility (comes with xorg-x11-utils) to
+ * decode the EDID blob.
+ *
+ * @edid_offset: location of the edid blob, relative to the
+ *               start of the region (readonly).
+ * @edid_max_size: max size of the edid blob (readonly).
+ * @edid_size: actual edid size (read/write).
+ * @link_state: display link state (read/write).
+ * VFIO_DEVICE_GFX_LINK_STATE_UP: Monitor is turned on.
+ * VFIO_DEVICE_GFX_LINK_STATE_DOWN: Monitor is turned off.
+ * @max_xres: max display width (0 == no limitation, readonly).
+ * @max_yres: max display height (0 == no limitation, readonly).
+ *
+ * EDID update protocol:
+ *   (1) set link-state to down.
+ *   (2) update edid blob and size.
+ *   (3) set link-state to up.
+ */
+struct vfio_region_gfx_edid {
+	__u32 edid_offset;
+	__u32 edid_max_size;
+	__u32 edid_size;
+	__u32 max_xres;
+	__u32 max_yres;
+	__u32 link_state;
+#define VFIO_DEVICE_GFX_LINK_STATE_UP    1
+#define VFIO_DEVICE_GFX_LINK_STATE_DOWN  2
+};
+
+/* sub-types for VFIO_REGION_TYPE_CCW */
+#define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
+#define VFIO_REGION_SUBTYPE_CCW_SCHIB		(2)
+#define VFIO_REGION_SUBTYPE_CCW_CRW		(3)
+
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * The structure vfio_device_migration_info is placed at the 0th offset of
+ * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
+ * migration information. Field accesses from this structure are only supported
+ * at their native width and alignment. Otherwise, the result is undefined and
+ * vendor drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      - The user application writes to this field to inform the vendor driver
+ *        about the device state to be transitioned to.
+ *      - The vendor driver should take the necessary actions to change the
+ *        device state. After successful transition to a given state, the
+ *        vendor driver should return success on write(device_state, state)
+ *        system call. If the device state transition fails, the vendor driver
+ *        should return an appropriate -errno for the fault condition.
+ *      - On the user application side, if the device state transition fails,
+ *	  that is, if write(device_state, state) returns an error, read
+ *	  device_state again to determine the current state of the device from
+ *	  the vendor driver.
+ *      - The vendor driver should return previous state of the device unless
+ *        the vendor driver has encountered an internal error, in which case
+ *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
+ *      - The user application must use the device reset ioctl to recover the
+ *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
+ *        indicated to be in a valid device state by reading device_state, the
+ *        user application may attempt to transition the device to any valid
+ *        state reachable from the current state or terminate itself.
+ *
+ *      device_state consists of 3 bits:
+ *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
+ *        it indicates the _STOP state. When the device state is changed to
+ *        _STOP, driver should stop the device before write() returns.
+ *      - If bit 1 is set, it indicates the _SAVING state, which means that the
+ *        driver should start gathering device state information that will be
+ *        provided to the VFIO user application to save the device's state.
+ *      - If bit 2 is set, it indicates the _RESUMING state, which means that
+ *        the driver should prepare to resume the device. Data provided through
+ *        the migration region should be used to resume the device.
+ *      Bits 3 - 31 are reserved for future use. To preserve them, the user
+ *      application should perform a read-modify-write operation on this
+ *      field when modifying the specified bits.
+ *
+ *  +------- _RESUMING
+ *  |+------ _SAVING
+ *  ||+----- _RUNNING
+ *  |||
+ *  000b => Device Stopped, not saving or resuming
+ *  001b => Device running, which is the default state
+ *  010b => Stop the device & save the device state, stop-and-copy state
+ *  011b => Device running and save the device state, pre-copy state
+ *  100b => Device stopped and the device state is resuming
+ *  101b => Invalid state
+ *  110b => Error state
+ *  111b => Invalid state
+ *
+ * State transitions:
+ *
+ *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
+ *                (100b)     (001b)     (011b)        (010b)       (000b)
+ * 0. Running or default state
+ *                             |
+ *
+ * 1. Normal Shutdown (optional)
+ *                             |------------------------------------->|
+ *
+ * 2. Save the state or suspend
+ *                             |------------------------->|---------->|
+ *
+ * 3. Save the state during live migration
+ *                             |----------->|------------>|---------->|
+ *
+ * 4. Resuming
+ *                  |<---------|
+ *
+ * 5. Resumed
+ *                  |--------->|
+ *
+ * 0. Default state of VFIO device is _RUNNING when the user application starts.
+ * 1. During normal shutdown of the user application, the user application may
+ *    optionally change the VFIO device state from _RUNNING to _STOP. This
+ *    transition is optional. The vendor driver must support this transition but
+ *    must not require it.
+ * 2. When the user application saves state or suspends the application, the
+ *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
+ *    On state transition from _RUNNING to stop-and-copy, driver must stop the
+ *    device, save the device state and send it to the application through the
+ *    migration region. The sequence to be followed for such transition is given
+ *    below.
+ * 3. In live migration of user application, the state transitions from _RUNNING
+ *    to pre-copy, to stop-and-copy, and to _STOP.
+ *    On state transition from _RUNNING to pre-copy, the driver should start
+ *    gathering the device state while the application is still running and send
+ *    the device state data to application through the migration region.
+ *    On state transition from pre-copy to stop-and-copy, the driver must stop
+ *    the device, save the device state and send it to the user application
+ *    through the migration region.
+ *    Vendor drivers must support the pre-copy state even for implementations
+ *    where no data is provided to the user before the stop-and-copy state. The
+ *    user must not be required to consume all migration data before the device
+ *    transitions to a new state, including the stop-and-copy state.
+ *    The sequence to be followed for above two transitions is given below.
+ * 4. To start the resuming phase, the device state should be transitioned from
+ *    the _RUNNING to the _RESUMING state.
+ *    In the _RESUMING state, the driver should use the device state data
+ *    received through the migration region to resume the device.
+ * 5. After providing saved device data to the driver, the application should
+ *    change the state from _RESUMING to _RUNNING.
+ *
+ * reserved:
+ *      Reads on this field return zero and writes are ignored.
+ *
+ * pending_bytes: (read only)
+ *      The number of pending bytes still to be migrated from the vendor driver.
+ *
+ * data_offset: (read only)
+ *      The user application should read data_offset field from the migration
+ *      region. The user application should read the device data from this
+ *      offset within the migration region during the _SAVING state or write
+ *      the device data during the _RESUMING state. See below for details of
+ *      sequence to be followed.
+ *
+ * data_size: (read/write)
+ *      The user application should read data_size to get the size in bytes of
+ *      the data copied in the migration region during the _SAVING state and
+ *      write the size in bytes of the data copied in the migration region
+ *      during the _RESUMING state.
+ *
+ * The format of the migration region is as follows:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * The structure vfio_device_migration_info is always followed by the data
+ * section in the region, so data_offset will always be nonzero. The offset
+ * from where the data is copied is decided by the kernel driver. The data
+ * section can be trapped, mmapped, or partitioned, depending on how the kernel
+ * driver defines the data section. The data section partition can be defined
+ * as mapped by the sparse mmap capability. If mmapped, data_offset must be
+ * page aligned, whereas initial section which contains the
+ * vfio_device_migration_info structure, might not end at the offset, which is
+ * page aligned. The user is not required to access through mmap regardless
+ * of the capabilities of the region mmap.
+ * The vendor driver should determine whether and how to partition the data
+ * section. The vendor driver should return data_offset accordingly.
+ *
+ * The sequence to be followed while in pre-copy state and stop-and-copy state
+ * is as follows:
+ * a. Read pending_bytes, indicating the start of a new iteration to get device
+ *    data. Repeated read on pending_bytes at this stage should have no side
+ *    effects.
+ *    If pending_bytes == 0, the user application should not iterate to get data
+ *    for that device.
+ *    If pending_bytes > 0, perform the following steps.
+ * b. Read data_offset, indicating that the vendor driver should make data
+ *    available through the data section. The vendor driver should return this
+ *    read operation only after data is available from (region + data_offset)
+ *    to (region + data_offset + data_size).
+ * c. Read data_size, which is the amount of data in bytes available through
+ *    the migration region.
+ *    Read on data_offset and data_size should return the offset and size of
+ *    the current buffer if the user application reads data_offset and
+ *    data_size more than once here.
+ * d. Read data_size bytes of data from (region + data_offset) from the
+ *    migration region.
+ * e. Process the data.
+ * f. Read pending_bytes, which indicates that the data from the previous
+ *    iteration has been read. If pending_bytes > 0, go to step b.
+ *
+ * The user application can transition from the _SAVING|_RUNNING
+ * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
+ * number of pending bytes. The user application should iterate in _SAVING
+ * (stop-and-copy) until pending_bytes is 0.
+ *
+ * The sequence to be followed while _RESUMING device state is as follows:
+ * While data for this device is available, repeat the following steps:
+ * a. Read data_offset from where the user application should write data.
+ * b. Write migration data starting at the migration region + data_offset for
+ *    the length determined by data_size from the migration source.
+ * c. Write data_size, which indicates to the vendor driver that data is
+ *    written in the migration region. Vendor driver must return this write
+ *    operations on consuming data. Vendor driver should apply the
+ *    user-provided migration region data to the device resume state.
+ *
+ * If an error occurs during the above sequences, the vendor driver can return
+ * an error code for next read() or write() operation, which will terminate the
+ * loop. The user application should then take the next necessary action, for
+ * example, failing migration or terminating the user application.
+ *
+ * For the user application, data is opaque. The user application should write
+ * data in the same order as the data is received and the data should be of
+ * same transaction size at the source.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_STOP      (0)
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_VALID(state) \
+	(state & VFIO_DEVICE_STATE_RESUMING ? \
+	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+
+#define VFIO_DEVICE_STATE_IS_ERROR(state) \
+	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
+					      VFIO_DEVICE_STATE_RESUMING))
+
+#define VFIO_DEVICE_STATE_SET_ERROR(state) \
+	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
+					     VFIO_DEVICE_STATE_RESUMING)
+
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+};
+
+/*
+ * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
+ * which allows direct access to non-MSIX registers which happened to be within
+ * the same system page.
+ *
+ * Even though the userspace gets direct access to the MSIX data, the existing
+ * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
+ */
+#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
+
+/*
+ * Capability with compressed real address (aka SSA - small system address)
+ * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing
+ * and by the userspace to associate a NVLink bridge with a GPU.
+ *
+ * Deprecated, capability no longer provided
+ */
+#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT	4
+
+struct vfio_region_info_cap_nvlink2_ssatgt {
+	struct vfio_info_cap_header header;
+	__u64 tgt;
+};
+
+/*
+ * Capability with an NVLink link speed. The value is read by
+ * the NVlink2 bridge driver from the bridge's "ibm,nvlink-speed"
+ * property in the device tree. The value is fixed in the hardware
+ * and failing to provide the correct value results in the link
+ * not working with no indication from the driver why.
+ *
+ * Deprecated, capability no longer provided
+ */
+#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD	5
+
+struct vfio_region_info_cap_nvlink2_lnkspd {
+	struct vfio_info_cap_header header;
+	__u32 link_speed;
+	__u32 __pad;
+};
+
+/**
+ * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
+ *				    struct vfio_irq_info)
+ *
+ * Retrieve information about a device IRQ.  Caller provides
+ * struct vfio_irq_info with index value set.  Caller sets argsz.
+ * Implementation of IRQ mapping is bus driver specific.  Indexes
+ * using multiple IRQs are primarily intended to support MSI-like
+ * interrupt blocks.  Zero count irq blocks may be used to describe
+ * unimplemented interrupt types.
+ *
+ * The EVENTFD flag indicates the interrupt index supports eventfd based
+ * signaling.
+ *
+ * The MASKABLE flags indicates the index supports MASK and UNMASK
+ * actions described below.
+ *
+ * AUTOMASKED indicates that after signaling, the interrupt line is
+ * automatically masked by VFIO and the user needs to unmask the line
+ * to receive new interrupts.  This is primarily intended to distinguish
+ * level triggered interrupts.
+ *
+ * The NORESIZE flag indicates that the interrupt lines within the index
+ * are setup as a set and new subindexes cannot be enabled without first
+ * disabling the entire index.  This is used for interrupts like PCI MSI
+ * and MSI-X where the driver may only use a subset of the available
+ * indexes, but VFIO needs to enable a specific number of vectors
+ * upfront.  In the case of MSI-X, where the user can enable MSI-X and
+ * then add and unmask vectors, it's up to userspace to make the decision
+ * whether to allocate the maximum supported number of vectors or tear
+ * down setup and incrementally increase the vectors as each is enabled.
+ */
+struct vfio_irq_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_INFO_EVENTFD		(1 << 0)
+#define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
+#define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
+#define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+	__u32	index;		/* IRQ index */
+	__u32	count;		/* Number of IRQs within this index */
+};
+#define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
+
+/**
+ * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
+ *
+ * Set signaling, masking, and unmasking of interrupts.  Caller provides
+ * struct vfio_irq_set with all fields set.  'start' and 'count' indicate
+ * the range of subindexes being specified.
+ *
+ * The DATA flags specify the type of data provided.  If DATA_NONE, the
+ * operation performs the specified action immediately on the specified
+ * interrupt(s).  For example, to unmask AUTOMASKED interrupt [0,0]:
+ * flags = (DATA_NONE|ACTION_UNMASK), index = 0, start = 0, count = 1.
+ *
+ * DATA_BOOL allows sparse support for the same on arrays of interrupts.
+ * For example, to mask interrupts [0,1] and [0,3] (but not [0,2]):
+ * flags = (DATA_BOOL|ACTION_MASK), index = 0, start = 1, count = 3,
+ * data = {1,0,1}
+ *
+ * DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
+ * A value of -1 can be used to either de-assign interrupts if already
+ * assigned or skip un-assigned interrupts.  For example, to set an eventfd
+ * to be trigger for interrupts [0,0] and [0,2]:
+ * flags = (DATA_EVENTFD|ACTION_TRIGGER), index = 0, start = 0, count = 3,
+ * data = {fd1, -1, fd2}
+ * If index [0,1] is previously set, two count = 1 ioctls calls would be
+ * required to set [0,0] and [0,2] without changing [0,1].
+ *
+ * Once a signaling mechanism is set, DATA_BOOL or DATA_NONE can be used
+ * with ACTION_TRIGGER to perform kernel level interrupt loopback testing
+ * from userspace (ie. simulate hardware triggering).
+ *
+ * Setting of an event triggering mechanism to userspace for ACTION_TRIGGER
+ * enables the interrupt index for the device.  Individual subindex interrupts
+ * can be disabled using the -1 value for DATA_EVENTFD or the index can be
+ * disabled as a whole with: flags = (DATA_NONE|ACTION_TRIGGER), count = 0.
+ *
+ * Note that ACTION_[UN]MASK specify user->kernel signaling (irqfds) while
+ * ACTION_TRIGGER specifies kernel->user signaling.
+ */
+struct vfio_irq_set {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
+#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
+#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
+#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
+#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
+#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
+	__u32	index;
+	__u32	start;
+	__u32	count;
+	__u8	data[];
+};
+#define VFIO_DEVICE_SET_IRQS		_IO(VFIO_TYPE, VFIO_BASE + 10)
+
+#define VFIO_IRQ_SET_DATA_TYPE_MASK	(VFIO_IRQ_SET_DATA_NONE | \
+					 VFIO_IRQ_SET_DATA_BOOL | \
+					 VFIO_IRQ_SET_DATA_EVENTFD)
+#define VFIO_IRQ_SET_ACTION_TYPE_MASK	(VFIO_IRQ_SET_ACTION_MASK | \
+					 VFIO_IRQ_SET_ACTION_UNMASK | \
+					 VFIO_IRQ_SET_ACTION_TRIGGER)
+/**
+ * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
+ *
+ * Reset a device.
+ */
+#define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
+
+/*
+ * The VFIO-PCI bus driver makes use of the following fixed region and
+ * IRQ index mapping.  Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+	VFIO_PCI_BAR0_REGION_INDEX,
+	VFIO_PCI_BAR1_REGION_INDEX,
+	VFIO_PCI_BAR2_REGION_INDEX,
+	VFIO_PCI_BAR3_REGION_INDEX,
+	VFIO_PCI_BAR4_REGION_INDEX,
+	VFIO_PCI_BAR5_REGION_INDEX,
+	VFIO_PCI_ROM_REGION_INDEX,
+	VFIO_PCI_CONFIG_REGION_INDEX,
+	/*
+	 * Expose VGA regions defined for PCI base class 03, subclass 00.
+	 * This includes I/O port ranges 0x3b0 to 0x3bb and 0x3c0 to 0x3df
+	 * as well as the MMIO range 0xa0000 to 0xbffff.  Each implemented
+	 * range is found at it's identity mapped offset from the region
+	 * offset, for example 0x3b0 is region_info.offset + 0x3b0.  Areas
+	 * between described ranges are unimplemented.
+	 */
+	VFIO_PCI_VGA_REGION_INDEX,
+	VFIO_PCI_NUM_REGIONS = 9 /* Fixed user ABI, region indexes >=9 use */
+				 /* device specific cap to define content. */
+};
+
+enum {
+	VFIO_PCI_INTX_IRQ_INDEX,
+	VFIO_PCI_MSI_IRQ_INDEX,
+	VFIO_PCI_MSIX_IRQ_INDEX,
+	VFIO_PCI_ERR_IRQ_INDEX,
+	VFIO_PCI_REQ_IRQ_INDEX,
+	VFIO_PCI_NUM_IRQS
+};
+
+/*
+ * The vfio-ccw bus driver makes use of the following fixed region and
+ * IRQ index mapping. Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+	VFIO_CCW_CONFIG_REGION_INDEX,
+	VFIO_CCW_NUM_REGIONS
+};
+
+enum {
+	VFIO_CCW_IO_IRQ_INDEX,
+	VFIO_CCW_CRW_IRQ_INDEX,
+	VFIO_CCW_REQ_IRQ_INDEX,
+	VFIO_CCW_NUM_IRQS
+};
+
+/**
+ * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IORW(VFIO_TYPE, VFIO_BASE + 12,
+ *					      struct vfio_pci_hot_reset_info)
+ *
+ * Return: 0 on success, -errno on failure:
+ *	-enospc = insufficient buffer, -enodev = unsupported for device.
+ */
+struct vfio_pci_dependent_device {
+	__u32	group_id;
+	__u16	segment;
+	__u8	bus;
+	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
+};
+
+struct vfio_pci_hot_reset_info {
+	__u32	argsz;
+	__u32	flags;
+	__u32	count;
+	struct vfio_pci_dependent_device	devices[];
+};
+
+#define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
+ *				    struct vfio_pci_hot_reset)
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_pci_hot_reset {
+	__u32	argsz;
+	__u32	flags;
+	__u32	count;
+	__s32	group_fds[];
+};
+
+#define VFIO_DEVICE_PCI_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/**
+ * VFIO_DEVICE_QUERY_GFX_PLANE - _IOW(VFIO_TYPE, VFIO_BASE + 14,
+ *                                    struct vfio_device_query_gfx_plane)
+ *
+ * Set the drm_plane_type and flags, then retrieve the gfx plane info.
+ *
+ * flags supported:
+ * - VFIO_GFX_PLANE_TYPE_PROBE and VFIO_GFX_PLANE_TYPE_DMABUF are set
+ *   to ask if the mdev supports dma-buf. 0 on support, -EINVAL on no
+ *   support for dma-buf.
+ * - VFIO_GFX_PLANE_TYPE_PROBE and VFIO_GFX_PLANE_TYPE_REGION are set
+ *   to ask if the mdev supports region. 0 on support, -EINVAL on no
+ *   support for region.
+ * - VFIO_GFX_PLANE_TYPE_DMABUF or VFIO_GFX_PLANE_TYPE_REGION is set
+ *   with each call to query the plane info.
+ * - Others are invalid and return -EINVAL.
+ *
+ * Note:
+ * 1. Plane could be disabled by guest. In that case, success will be
+ *    returned with zero-initialized drm_format, size, width and height
+ *    fields.
+ * 2. x_hot/y_hot is set to 0xFFFFFFFF if no hotspot information available
+ *
+ * Return: 0 on success, -errno on other failure.
+ */
+struct vfio_device_gfx_plane_info {
+	__u32 argsz;
+	__u32 flags;
+#define VFIO_GFX_PLANE_TYPE_PROBE (1 << 0)
+#define VFIO_GFX_PLANE_TYPE_DMABUF (1 << 1)
+#define VFIO_GFX_PLANE_TYPE_REGION (1 << 2)
+	/* in */
+	__u32 drm_plane_type;	/* type of plane: DRM_PLANE_TYPE_* */
+	/* out */
+	__u32 drm_format;	/* drm format of plane */
+	__u64 drm_format_mod;   /* tiled mode */
+	__u32 width;	/* width of plane */
+	__u32 height;	/* height of plane */
+	__u32 stride;	/* stride of plane */
+	__u32 size;	/* size of plane in bytes, align on page*/
+	__u32 x_pos;	/* horizontal position of cursor plane */
+	__u32 y_pos;	/* vertical position of cursor plane*/
+	__u32 x_hot;    /* horizontal position of cursor hotspot */
+	__u32 y_hot;    /* vertical position of cursor hotspot */
+	union {
+		__u32 region_index;	/* region index */
+		__u32 dmabuf_id;	/* dma-buf id */
+	};
+};
+
+#define VFIO_DEVICE_QUERY_GFX_PLANE _IO(VFIO_TYPE, VFIO_BASE + 14)
+
+/**
+ * VFIO_DEVICE_GET_GFX_DMABUF - _IOW(VFIO_TYPE, VFIO_BASE + 15, __u32)
+ *
+ * Return a new dma-buf file descriptor for an exposed guest framebuffer
+ * described by the provided dmabuf_id. The dmabuf_id is returned from VFIO_
+ * DEVICE_QUERY_GFX_PLANE as a token of the exposed guest framebuffer.
+ */
+
+#define VFIO_DEVICE_GET_GFX_DMABUF _IO(VFIO_TYPE, VFIO_BASE + 15)
+
+/**
+ * VFIO_DEVICE_IOEVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 16,
+ *                              struct vfio_device_ioeventfd)
+ *
+ * Perform a write to the device at the specified device fd offset, with
+ * the specified data and width when the provided eventfd is triggered.
+ * vfio bus drivers may not support this for all regions, for all widths,
+ * or at all.  vfio-pci currently only enables support for BAR regions,
+ * excluding the MSI-X vector table.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_ioeventfd {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_IOEVENTFD_8		(1 << 0) /* 1-byte write */
+#define VFIO_DEVICE_IOEVENTFD_16	(1 << 1) /* 2-byte write */
+#define VFIO_DEVICE_IOEVENTFD_32	(1 << 2) /* 4-byte write */
+#define VFIO_DEVICE_IOEVENTFD_64	(1 << 3) /* 8-byte write */
+#define VFIO_DEVICE_IOEVENTFD_SIZE_MASK	(0xf)
+	__u64	offset;			/* device fd offset of write */
+	__u64	data;			/* data to be written */
+	__s32	fd;			/* -1 for de-assignment */
+};
+
+#define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 16)
+
+/**
+ * VFIO_DEVICE_FEATURE - _IORW(VFIO_TYPE, VFIO_BASE + 17,
+ *			       struct vfio_device_feature)
+ *
+ * Get, set, or probe feature data of the device.  The feature is selected
+ * using the FEATURE_MASK portion of the flags field.  Support for a feature
+ * can be probed by setting both the FEATURE_MASK and PROBE bits.  A probe
+ * may optionally include the GET and/or SET bits to determine read vs write
+ * access of the feature respectively.  Probing a feature will return success
+ * if the feature is supported and all of the optionally indicated GET/SET
+ * methods are supported.  The format of the data portion of the structure is
+ * specific to the given feature.  The data portion is not required for
+ * probing.  GET and SET are mutually exclusive, except for use with PROBE.
+ *
+ * Return 0 on success, -errno on failure.
+ */
+struct vfio_device_feature {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_FEATURE_MASK	(0xffff) /* 16-bit feature index */
+#define VFIO_DEVICE_FEATURE_GET		(1 << 16) /* Get feature into data[] */
+#define VFIO_DEVICE_FEATURE_SET		(1 << 17) /* Set feature from data[] */
+#define VFIO_DEVICE_FEATURE_PROBE	(1 << 18) /* Probe feature support */
+	__u8	data[];
+};
+
+#define VFIO_DEVICE_FEATURE		_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/*
+ * Provide support for setting a PCI VF Token, which is used as a shared
+ * secret between PF and VF drivers.  This feature may only be set on a
+ * PCI SR-IOV PF when SR-IOV is enabled on the PF and there are no existing
+ * open VFs.  Data provided when setting this feature is a 16-byte array
+ * (__u8 b[16]), representing a UUID.
+ */
+#define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
+
+/* -------- API for Type1 VFIO IOMMU -------- */
+
+/**
+ * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 12, struct vfio_iommu_info)
+ *
+ * Retrieve information about the IOMMU object. Fills in provided
+ * struct vfio_iommu_info. Caller sets argsz.
+ *
+ * XXX Should we do these by CHECK_EXTENSION too?
+ */
+struct vfio_iommu_type1_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
+	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
+	__u32   cap_offset;	/* Offset within info struct of first cap */
+};
+
+/*
+ * The IOVA capability allows to report the valid IOVA range(s)
+ * excluding any non-relaxable reserved regions exposed by
+ * devices attached to the container. Any DMA map attempt
+ * outside the valid iova range will return error.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE  1
+
+struct vfio_iova_range {
+	__u64	start;
+	__u64	end;
+};
+
+struct vfio_iommu_type1_info_cap_iova_range {
+	struct	vfio_info_cap_header header;
+	__u32	nr_iovas;
+	__u32	reserved;
+	struct	vfio_iova_range iova_ranges[];
+};
+
+/*
+ * The migration capability allows to report supported features for migration.
+ *
+ * The structures below define version 1 of this capability.
+ *
+ * The existence of this capability indicates that IOMMU kernel driver supports
+ * dirty page logging.
+ *
+ * pgsize_bitmap: Kernel driver returns bitmap of supported page sizes for dirty
+ * page logging.
+ * max_dirty_bitmap_size: Kernel driver returns maximum supported dirty bitmap
+ * size in bytes that can be used by user applications when getting the dirty
+ * bitmap.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION  2
+
+struct vfio_iommu_type1_info_cap_migration {
+	struct	vfio_info_cap_header header;
+	__u32	flags;
+	__u64	pgsize_bitmap;
+	__u64	max_dirty_bitmap_size;		/* in bytes */
+};
+
+/*
+ * The DMA available capability allows to report the current number of
+ * simultaneously outstanding DMA mappings that are allowed.
+ *
+ * The structure below defines version 1 of this capability.
+ *
+ * avail: specifies the current number of outstanding DMA mappings allowed.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_DMA_AVAIL 3
+
+struct vfio_iommu_type1_info_dma_avail {
+	struct	vfio_info_cap_header header;
+	__u32	avail;
+};
+
+#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
+ *
+ * Map process virtual addresses to IO virtual addresses using the
+ * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and
+ * unblock translation of host virtual addresses in the iova range.  The vaddr
+ * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
+ * maintain memory consistency within the user application, the updated vaddr
+ * must address the same memory object as originally mapped.  Failure to do so
+ * will result in user memory corruption and/or device misbehavior.  iova and
+ * size must match those in the original MAP_DMA call.  Protection is not
+ * changed, and the READ & WRITE flags must be 0.
+ */
+struct vfio_iommu_type1_dma_map {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
+#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2)
+	__u64	vaddr;				/* Process virtual address */
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+
+struct vfio_bitmap {
+	__u64        pgsize;	/* page size for bitmap in bytes */
+	__u64        size;	/* in bytes */
+	__u64 __user *data;	/* one bit per page */
+};
+
+/**
+ * VFIO_IOMMU_UNMAP_DMA - _IOWR(VFIO_TYPE, VFIO_BASE + 14,
+ *							struct vfio_dma_unmap)
+ *
+ * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
+ * Caller sets argsz.  The actual unmapped size is returned in the size
+ * field.  No guarantee is made to the user that arbitrary unmaps of iova
+ * or size different from those used in the original mapping call will
+ * succeed.
+ *
+ * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
+ * before unmapping IO virtual addresses. When this flag is set, the user must
+ * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
+ * memory via vfio_bitmap.data and its size in the vfio_bitmap.size field.
+ * A bit in the bitmap represents one page, of user provided page size in
+ * vfio_bitmap.pgsize field, consecutively starting from iova offset. Bit set
+ * indicates that the page at that offset from iova is dirty. A Bitmap of the
+ * pages in the range of unmapped size is returned in the user-provided
+ * vfio_bitmap.data.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
+ * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
+ * virtual addresses in the iova range.  Tasks that attempt to translate an
+ * iova's vaddr will block.  DMA to already-mapped pages continues.  This
+ * cannot be combined with the get-dirty-bitmap flag.
+ */
+struct vfio_iommu_type1_dma_unmap {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
+#define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
+#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+	__u8    data[];
+};
+
+#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
+/*
+ * IOCTLs to enable/disable IOMMU container usage.
+ * No parameters are supported.
+ */
+#define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
+
+/**
+ * VFIO_IOMMU_DIRTY_PAGES - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
+ *                                     struct vfio_iommu_type1_dirty_bitmap)
+ * IOCTL is used for dirty pages logging.
+ * Caller should set flag depending on which operation to perform, details as
+ * below:
+ *
+ * Calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_START flag set, instructs
+ * the IOMMU driver to log pages that are dirtied or potentially dirtied by
+ * the device; designed to be used when a migration is in progress. Dirty pages
+ * are logged until logging is disabled by user application by calling the IOCTL
+ * with VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP flag.
+ *
+ * Calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP flag set, instructs
+ * the IOMMU driver to stop logging dirtied pages.
+ *
+ * Calling the IOCTL with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP flag set
+ * returns the dirty pages bitmap for IOMMU container for a given IOVA range.
+ * The user must specify the IOVA range and the pgsize through the structure
+ * vfio_iommu_type1_dirty_bitmap_get in the data[] portion. This interface
+ * supports getting a bitmap of the smallest supported pgsize only and can be
+ * modified in future to get a bitmap of any specified supported pgsize. The
+ * user must provide a zeroed memory area for the bitmap memory and specify its
+ * size in bitmap.size. One bit is used to represent one page consecutively
+ * starting from iova offset. The user should provide page size in bitmap.pgsize
+ * field. A bit set in the bitmap indicates that the page at that offset from
+ * iova is dirty. The caller must set argsz to a value including the size of
+ * structure vfio_iommu_type1_dirty_bitmap_get, but excluding the size of the
+ * actual bitmap. If dirty pages logging is not enabled, an error will be
+ * returned.
+ *
+ * Only one of the flags _START, _STOP and _GET may be specified at a time.
+ *
+ */
+struct vfio_iommu_type1_dirty_bitmap {
+	__u32        argsz;
+	__u32        flags;
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_START	(1 << 0)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP	(1 << 1)
+#define VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP	(1 << 2)
+	__u8         data[];
+};
+
+struct vfio_iommu_type1_dirty_bitmap_get {
+	__u64              iova;	/* IO virtual address */
+	__u64              size;	/* Size of iova range */
+	struct vfio_bitmap bitmap;
+};
+
+#define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
+
+/*
+ * The SPAPR TCE DDW info struct provides the information about
+ * the details of Dynamic DMA window capability.
+ *
+ * @pgsizes contains a page size bitmask, 4K/64K/16M are supported.
+ * @max_dynamic_windows_supported tells the maximum number of windows
+ * which the platform can create.
+ * @levels tells the maximum number of levels in multi-level IOMMU tables;
+ * this allows splitting a table into smaller chunks which reduces
+ * the amount of physically contiguous memory required for the table.
+ */
+struct vfio_iommu_spapr_tce_ddw_info {
+	__u64 pgsizes;			/* Bitmap of supported page sizes */
+	__u32 max_dynamic_windows_supported;
+	__u32 levels;
+};
+
+/*
+ * The SPAPR TCE info struct provides the information about the PCI bus
+ * address ranges available for DMA, these values are programmed into
+ * the hardware so the guest has to know that information.
+ *
+ * The DMA 32 bit window start is an absolute PCI bus address.
+ * The IOVA address passed via map/unmap ioctls are absolute PCI bus
+ * addresses too so the window works as a filter rather than an offset
+ * for IOVA addresses.
+ *
+ * Flags supported:
+ * - VFIO_IOMMU_SPAPR_INFO_DDW: informs the userspace that dynamic DMA windows
+ *   (DDW) support is present. @ddw is only supported when DDW is present.
+ */
+struct vfio_iommu_spapr_tce_info {
+	__u32 argsz;
+	__u32 flags;
+#define VFIO_IOMMU_SPAPR_INFO_DDW	(1 << 0)	/* DDW supported */
+	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
+	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
+	struct vfio_iommu_spapr_tce_ddw_info ddw;
+};
+
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/*
+ * EEH PE operation struct provides ways to:
+ * - enable/disable EEH functionality;
+ * - unfreeze IO/DMA for frozen PE;
+ * - read PE state;
+ * - reset PE;
+ * - configure PE;
+ * - inject EEH error.
+ */
+struct vfio_eeh_pe_err {
+	__u32 type;
+	__u32 func;
+	__u64 addr;
+	__u64 mask;
+};
+
+struct vfio_eeh_pe_op {
+	__u32 argsz;
+	__u32 flags;
+	__u32 op;
+	union {
+		struct vfio_eeh_pe_err err;
+	};
+};
+
+#define VFIO_EEH_PE_DISABLE		0	/* Disable EEH functionality */
+#define VFIO_EEH_PE_ENABLE		1	/* Enable EEH functionality  */
+#define VFIO_EEH_PE_UNFREEZE_IO		2	/* Enable IO for frozen PE   */
+#define VFIO_EEH_PE_UNFREEZE_DMA	3	/* Enable DMA for frozen PE  */
+#define VFIO_EEH_PE_GET_STATE		4	/* PE state retrieval        */
+#define  VFIO_EEH_PE_STATE_NORMAL	0	/* PE in functional state    */
+#define  VFIO_EEH_PE_STATE_RESET	1	/* PE reset in progress      */
+#define  VFIO_EEH_PE_STATE_STOPPED	2	/* Stopped DMA and IO        */
+#define  VFIO_EEH_PE_STATE_STOPPED_DMA	4	/* Stopped DMA only          */
+#define  VFIO_EEH_PE_STATE_UNAVAIL	5	/* State unavailable         */
+#define VFIO_EEH_PE_RESET_DEACTIVATE	5	/* Deassert PE reset         */
+#define VFIO_EEH_PE_RESET_HOT		6	/* Assert hot reset          */
+#define VFIO_EEH_PE_RESET_FUNDAMENTAL	7	/* Assert fundamental reset  */
+#define VFIO_EEH_PE_CONFIGURE		8	/* PE configuration          */
+#define VFIO_EEH_PE_INJECT_ERR		9	/* Inject EEH error          */
+
+#define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/**
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
+ *
+ * Registers user space memory where DMA is allowed. It pins
+ * user pages and does the locked memory accounting so
+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
+ * get faster.
+ */
+struct vfio_iommu_spapr_register_memory {
+	__u32	argsz;
+	__u32	flags;
+	__u64	vaddr;				/* Process virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
+ * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
+ *
+ * Unregisters user space memory registered with
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
+ * Uses vfio_iommu_spapr_register_memory for parameters.
+ */
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_CREATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19, struct vfio_iommu_spapr_tce_create)
+ *
+ * Creates an additional TCE table and programs it (sets a new DMA window)
+ * to every IOMMU group in the container. It receives page shift, window
+ * size and number of levels in the TCE table being created.
+ *
+ * It allocates and returns an offset on a PCI bus of the new DMA window.
+ */
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u32 page_shift;
+	__u32 __resv1;
+	__u64 window_size;
+	__u32 levels;
+	__u32 __resv2;
+	/* out */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_REMOVE - _IOW(VFIO_TYPE, VFIO_BASE + 20, struct vfio_iommu_spapr_tce_remove)
+ *
+ * Unprograms a TCE table from all groups in the container and destroys it.
+ * It receives a PCI bus offset as a window id.
+ */
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
+/* ***************************************************************** */
+
+#endif /* _UAPIVFIO_H */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 02/27] mlx5: Introduce mlx5dv_get_vfio_device_list()
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 01/27] Update kernel headers Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio Yishai Hadas
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Introduce mlx5dv_get_vfio_device_list() API for getting list of mlx5
devices which can be used over VFIO.

A man page with the expected usage was added.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 debian/ibverbs-providers.symbols                   |   2 +
 providers/mlx5/CMakeLists.txt                      |   3 +-
 providers/mlx5/libmlx5.map                         |   5 +
 providers/mlx5/man/CMakeLists.txt                  |   1 +
 .../mlx5/man/mlx5dv_get_vfio_device_list.3.md      |  64 +++++++
 providers/mlx5/mlx5.c                              |   4 +-
 providers/mlx5/mlx5.h                              |   1 +
 providers/mlx5/mlx5_vfio.c                         | 190 +++++++++++++++++++++
 providers/mlx5/mlx5_vfio.h                         |  27 +++
 providers/mlx5/mlx5dv.h                            |  13 ++
 10 files changed, 307 insertions(+), 3 deletions(-)
 create mode 100644 providers/mlx5/man/mlx5dv_get_vfio_device_list.3.md
 create mode 100644 providers/mlx5/mlx5_vfio.c
 create mode 100644 providers/mlx5/mlx5_vfio.h

diff --git a/debian/ibverbs-providers.symbols b/debian/ibverbs-providers.symbols
index 294832b..64e29b1 100644
--- a/debian/ibverbs-providers.symbols
+++ b/debian/ibverbs-providers.symbols
@@ -28,6 +28,7 @@ libmlx5.so.1 ibverbs-providers #MINVER#
  MLX5_1.18@MLX5_1.18 34
  MLX5_1.19@MLX5_1.19 35
  MLX5_1.20@MLX5_1.20 36
+ MLX5_1.21@MLX5_1.21 37
  mlx5dv_init_obj@MLX5_1.0 13
  mlx5dv_init_obj@MLX5_1.2 15
  mlx5dv_query_device@MLX5_1.0 13
@@ -133,6 +134,7 @@ libmlx5.so.1 ibverbs-providers #MINVER#
  mlx5dv_map_ah_to_qp@MLX5_1.20 36
  mlx5dv_qp_cancel_posted_send_wrs@MLX5_1.20 36
  _mlx5dv_mkey_check@MLX5_1.20 36
+ mlx5dv_get_vfio_device_list@MLX5_1.21 37
 libefa.so.1 ibverbs-providers #MINVER#
 * Build-Depends-Package: libibverbs-dev
  EFA_1.0@EFA_1.0 24
diff --git a/providers/mlx5/CMakeLists.txt b/providers/mlx5/CMakeLists.txt
index 69abdd1..45e397e 100644
--- a/providers/mlx5/CMakeLists.txt
+++ b/providers/mlx5/CMakeLists.txt
@@ -11,7 +11,7 @@ if (MLX5_MW_DEBUG)
 endif()
 
 rdma_shared_provider(mlx5 libmlx5.map
-  1 1.20.${PACKAGE_VERSION}
+  1 1.21.${PACKAGE_VERSION}
   buf.c
   cq.c
   dbrec.c
@@ -30,6 +30,7 @@ rdma_shared_provider(mlx5 libmlx5.map
   dr_table.c
   dr_send.c
   mlx5.c
+  mlx5_vfio.c
   qp.c
   srq.c
   verbs.c
diff --git a/providers/mlx5/libmlx5.map b/providers/mlx5/libmlx5.map
index af7541d..3e8a4d8 100644
--- a/providers/mlx5/libmlx5.map
+++ b/providers/mlx5/libmlx5.map
@@ -189,3 +189,8 @@ MLX5_1.20 {
 		mlx5dv_qp_cancel_posted_send_wrs;
 		_mlx5dv_mkey_check;
 } MLX5_1.19;
+
+MLX5_1.21 {
+        global:
+		mlx5dv_get_vfio_device_list;
+} MLX5_1.20;
diff --git a/providers/mlx5/man/CMakeLists.txt b/providers/mlx5/man/CMakeLists.txt
index bb6499d..91aebed 100644
--- a/providers/mlx5/man/CMakeLists.txt
+++ b/providers/mlx5/man/CMakeLists.txt
@@ -22,6 +22,7 @@ rdma_man_pages(
   mlx5dv_dump.3.md
   mlx5dv_flow_action_esp.3.md
   mlx5dv_get_clock_info.3
+  mlx5dv_get_vfio_device_list.3.md
   mlx5dv_init_obj.3
   mlx5dv_is_supported.3.md
   mlx5dv_map_ah_to_qp.3.md
diff --git a/providers/mlx5/man/mlx5dv_get_vfio_device_list.3.md b/providers/mlx5/man/mlx5dv_get_vfio_device_list.3.md
new file mode 100644
index 0000000..13c8e63
--- /dev/null
+++ b/providers/mlx5/man/mlx5dv_get_vfio_device_list.3.md
@@ -0,0 +1,64 @@
+---
+layout: page
+title: mlx5dv_get_vfio_device_list
+section: 3
+tagline: Verbs
+---
+
+# NAME
+
+mlx5dv_get_vfio_device_list - Get list of available devices to be used over VFIO
+
+# SYNOPSIS
+
+```c
+#include <infiniband/mlx5dv.h>
+
+struct ibv_device **
+mlx5dv_get_vfio_device_list(struct mlx5dv_vfio_context_attr *attr);
+```
+
+# DESCRIPTION
+
+Returns a NULL-terminated array of devices based on input *attr*.
+
+# ARGUMENTS
+
+*attr*
+:	Describe the VFIO devices to return in list.
+
+## *attr* argument
+
+```c
+struct mlx5dv_vfio_context_attr {
+	const char *pci_name;
+	uint32_t flags;
+	uint64_t comp_mask;
+};
+```
+*pci_name*
+:      The PCI name of the required device.
+
+*flags*
+:       A bitwise OR of the various values described below.
+
+        *MLX5DV_VFIO_CTX_FLAGS_INIT_LINK_DOWN*:
+        Upon device initialization link should stay down.
+
+*comp_mask*
+:       Bitmask specifying what fields in the structure are valid.
+
+# RETURN VALUE
+Returns the array of the matching devices, or sets errno and returns NULL if the request fails.
+
+# NOTES
+Client  code  should open all the devices it intends to use with ibv_open_device() before calling ibv_free_device_list().  Once it frees the array with ibv_free_device_list(), it will be able to
+use only the open devices; pointers to unopened devices will no longer be valid.
+
+# SEE ALSO
+
+*ibv_open_device(3)* *ibv_free_device_list(3)*
+
+# AUTHOR
+
+Yishai Hadas <yishaih@nvidia.com>
diff --git a/providers/mlx5/mlx5.c b/providers/mlx5/mlx5.c
index e172b9d..46d7748 100644
--- a/providers/mlx5/mlx5.c
+++ b/providers/mlx5/mlx5.c
@@ -62,7 +62,7 @@ static void mlx5_free_context(struct ibv_context *ibctx);
 #endif
 
 #define HCA(v, d) VERBS_PCI_MATCH(PCI_VENDOR_ID_##v, d, NULL)
-static const struct verbs_match_ent hca_table[] = {
+const struct verbs_match_ent mlx5_hca_table[] = {
 	VERBS_DRIVER_ID(RDMA_DRIVER_MLX5),
 	HCA(MELLANOX, 0x1011),	/* MT4113 Connect-IB */
 	HCA(MELLANOX, 0x1012),	/* Connect-IB Virtual Function */
@@ -2410,7 +2410,7 @@ static const struct verbs_device_ops mlx5_dev_ops = {
 	.name = "mlx5",
 	.match_min_abi_version = MLX5_UVERBS_MIN_ABI_VERSION,
 	.match_max_abi_version = MLX5_UVERBS_MAX_ABI_VERSION,
-	.match_table = hca_table,
+	.match_table = mlx5_hca_table,
 	.alloc_device = mlx5_device_alloc,
 	.uninit_device = mlx5_uninit_device,
 	.alloc_context = mlx5_alloc_context,
diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h
index ac2f88c..3862007 100644
--- a/providers/mlx5/mlx5.h
+++ b/providers/mlx5/mlx5.h
@@ -94,6 +94,7 @@ enum {
 
 extern uint32_t mlx5_debug_mask;
 extern int mlx5_freeze_on_error_cqe;
+extern const struct verbs_match_ent mlx5_hca_table[];
 
 #ifdef MLX5_DEBUG
 #define mlx5_dbg(fp, mask, format, arg...)				\
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
new file mode 100644
index 0000000..69c7662
--- /dev/null
+++ b/providers/mlx5/mlx5_vfio.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#define _GNU_SOURCE
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <string.h>
+#include <sys/param.h>
+
+#include "mlx5dv.h"
+#include "mlx5_vfio.h"
+#include "mlx5.h"
+
+static struct verbs_context *
+mlx5_vfio_alloc_context(struct ibv_device *ibdev,
+			int cmd_fd, void *private_data)
+{
+	return NULL;
+}
+
+static void mlx5_vfio_uninit_device(struct verbs_device *verbs_device)
+{
+	struct mlx5_vfio_device *dev = to_mvfio_dev(&verbs_device->device);
+
+	free(dev->pci_name);
+	free(dev);
+}
+
+static const struct verbs_device_ops mlx5_vfio_dev_ops = {
+	.name = "mlx5_vfio",
+	.alloc_context = mlx5_vfio_alloc_context,
+	.uninit_device = mlx5_vfio_uninit_device,
+};
+
+static bool is_mlx5_pci(const char *pci_path)
+{
+	const struct verbs_match_ent *ent;
+	uint16_t vendor_id, device_id;
+	char pci_info_path[256];
+	char buff[128];
+	int fd;
+
+	snprintf(pci_info_path, sizeof(pci_info_path), "%s/vendor", pci_path);
+	fd = open(pci_info_path, O_RDONLY);
+	if (fd < 0)
+		return false;
+
+	if (read(fd, buff, sizeof(buff)) <= 0)
+		goto err;
+
+	vendor_id = strtoul(buff, NULL, 0);
+	close(fd);
+
+	snprintf(pci_info_path, sizeof(pci_info_path), "%s/device", pci_path);
+	fd = open(pci_info_path, O_RDONLY);
+	if (fd < 0)
+		return false;
+
+	if (read(fd, buff, sizeof(buff)) <= 0)
+		goto err;
+
+	device_id = strtoul(buff, NULL, 0);
+	close(fd);
+
+	for (ent = mlx5_hca_table; ent->kind != VERBS_MATCH_SENTINEL; ent++) {
+		if (ent->kind != VERBS_MATCH_PCI)
+			continue;
+		if (ent->device == device_id && ent->vendor == vendor_id)
+			return true;
+	}
+
+	return false;
+
+err:
+	close(fd);
+	return false;
+}
+
+static int mlx5_vfio_get_iommu_group_id(const char *pci_name)
+{
+	int seg, bus, slot, func;
+	int ret, groupid;
+	char path[128], iommu_group_path[128], *group_name;
+	struct stat st;
+	ssize_t len;
+
+	ret = sscanf(pci_name, "%04x:%02x:%02x.%d", &seg, &bus, &slot, &func);
+	if (ret != 4)
+		return -1;
+
+	snprintf(path, sizeof(path),
+		 "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/",
+		 seg, bus, slot, func);
+
+	ret = stat(path, &st);
+	if (ret < 0)
+		return -1;
+
+	if (!is_mlx5_pci(path))
+		return -1;
+
+	strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+	len = readlink(path, iommu_group_path, sizeof(iommu_group_path));
+	if (len <= 0)
+		return -1;
+
+	iommu_group_path[len] = 0;
+	group_name = basename(iommu_group_path);
+
+	if (sscanf(group_name, "%d", &groupid) != 1)
+		return -1;
+
+	snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
+	ret = stat(path, &st);
+	if (ret < 0)
+		return -1;
+
+	return groupid;
+}
+
+static int mlx5_vfio_get_handle(struct mlx5_vfio_device *vfio_dev,
+			 struct mlx5dv_vfio_context_attr *attr)
+{
+	int iommu_group;
+
+	iommu_group = mlx5_vfio_get_iommu_group_id(attr->pci_name);
+	if (iommu_group < 0)
+		return -1;
+
+	sprintf(vfio_dev->vfio_path, "/dev/vfio/%d", iommu_group);
+	vfio_dev->pci_name = strdup(attr->pci_name);
+
+	return 0;
+}
+
+struct ibv_device **
+mlx5dv_get_vfio_device_list(struct mlx5dv_vfio_context_attr *attr)
+{
+	struct mlx5_vfio_device *vfio_dev;
+	struct ibv_device **list = NULL;
+	int err;
+
+	if (!check_comp_mask(attr->comp_mask, 0) ||
+	    !check_comp_mask(attr->flags, MLX5DV_VFIO_CTX_FLAGS_INIT_LINK_DOWN)) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	list = calloc(1, sizeof(struct ibv_device *));
+	if (!list) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	vfio_dev = calloc(1, sizeof(*vfio_dev));
+	if (!vfio_dev) {
+		errno = ENOMEM;
+		goto end;
+	}
+
+	vfio_dev->vdev.ops = &mlx5_vfio_dev_ops;
+	atomic_init(&vfio_dev->vdev.refcount, 1);
+
+	/* Find the vfio handle for attrs, store in mlx5_vfio_device */
+	err = mlx5_vfio_get_handle(vfio_dev, attr);
+	if (err)
+		goto err_get;
+
+	vfio_dev->flags = attr->flags;
+	vfio_dev->page_size = sysconf(_SC_PAGESIZE);
+
+	list[0] = &vfio_dev->vdev.device;
+	return list;
+
+err_get:
+	free(vfio_dev);
+end:
+	free(list);
+	return NULL;
+}
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
new file mode 100644
index 0000000..6ba4254
--- /dev/null
+++ b/providers/mlx5/mlx5_vfio.h
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef MLX5_VFIO_H
+#define MLX5_VFIO_H
+
+#include <stddef.h>
+#include <stdio.h>
+
+#include <infiniband/driver.h>
+
+struct mlx5_vfio_device {
+	struct verbs_device vdev;
+	char *pci_name;
+	char vfio_path[IBV_SYSFS_PATH_MAX];
+	int page_size;
+	uint32_t flags;
+};
+
+static inline struct mlx5_vfio_device *to_mvfio_dev(struct ibv_device *ibdev)
+{
+	return container_of(ibdev, struct mlx5_vfio_device, vdev.device);
+}
+
+#endif
diff --git a/providers/mlx5/mlx5dv.h b/providers/mlx5/mlx5dv.h
index 2eba232..e657527 100644
--- a/providers/mlx5/mlx5dv.h
+++ b/providers/mlx5/mlx5dv.h
@@ -1474,6 +1474,19 @@ struct mlx5dv_context_attr {
 
 bool mlx5dv_is_supported(struct ibv_device *device);
 
+enum mlx5dv_vfio_context_attr_flags {
+	MLX5DV_VFIO_CTX_FLAGS_INIT_LINK_DOWN = 1 << 0,
+};
+
+struct mlx5dv_vfio_context_attr {
+	const char *pci_name;
+	uint32_t flags; /* Use enum mlx5dv_vfio_context_attr_flags */
+	uint64_t comp_mask;
+};
+
+struct ibv_device **
+mlx5dv_get_vfio_device_list(struct mlx5dv_vfio_context_attr *attr);
+
 struct ibv_context *
 mlx5dv_open_device(struct ibv_device *device, struct mlx5dv_context_attr *attr);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 01/27] Update kernel headers Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 02/27] mlx5: Introduce mlx5dv_get_vfio_device_list() Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:51   ` Leon Romanovsky
  2021-07-20  8:16 ` [PATCH rdma-core 04/27] util: Add interval_set support Yishai Hadas
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

From: Maor Gottlieb <maorg@nvidia.com>

Usage will be in next patches.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5.c | 28 ++++++++++++++--------------
 providers/mlx5/mlx5.h |  4 ++++
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/providers/mlx5/mlx5.c b/providers/mlx5/mlx5.c
index 46d7748..1abaa8c 100644
--- a/providers/mlx5/mlx5.c
+++ b/providers/mlx5/mlx5.c
@@ -583,7 +583,7 @@ static int get_total_uuars(int page_size)
 	return size;
 }
 
-static void open_debug_file(struct mlx5_context *ctx)
+void mlx5_open_debug_file(FILE **dbg_fp)
 {
 	char *env;
 	FILE *default_dbg_fp = NULL;
@@ -594,25 +594,25 @@ static void open_debug_file(struct mlx5_context *ctx)
 
 	env = getenv("MLX5_DEBUG_FILE");
 	if (!env) {
-		ctx->dbg_fp = default_dbg_fp;
+		*dbg_fp = default_dbg_fp;
 		return;
 	}
 
-	ctx->dbg_fp = fopen(env, "aw+");
-	if (!ctx->dbg_fp) {
-		ctx->dbg_fp = default_dbg_fp;
-		mlx5_err(ctx->dbg_fp, "Failed opening debug file %s\n", env);
+	*dbg_fp = fopen(env, "aw+");
+	if (!*dbg_fp) {
+		*dbg_fp = default_dbg_fp;
+		mlx5_err(*dbg_fp, "Failed opening debug file %s\n", env);
 		return;
 	}
 }
 
-static void close_debug_file(struct mlx5_context *ctx)
+void mlx5_close_debug_file(FILE *dbg_fp)
 {
-	if (ctx->dbg_fp && ctx->dbg_fp != stderr)
-		fclose(ctx->dbg_fp);
+	if (dbg_fp && dbg_fp != stderr)
+		fclose(dbg_fp);
 }
 
-static void set_debug_mask(void)
+void mlx5_set_debug_mask(void)
 {
 	char *env;
 
@@ -2036,7 +2036,7 @@ static int get_uar_info(struct mlx5_device *mdev,
 
 static void mlx5_uninit_context(struct mlx5_context *context)
 {
-	close_debug_file(context);
+	mlx5_close_debug_file(context->dbg_fp);
 
 	verbs_uninit_context(&context->ibv_ctx);
 	free(context);
@@ -2056,8 +2056,8 @@ static struct mlx5_context *mlx5_init_context(struct ibv_device *ibdev,
 	if (!context)
 		return NULL;
 
-	open_debug_file(context);
-	set_debug_mask();
+	mlx5_open_debug_file(&context->dbg_fp);
+	mlx5_set_debug_mask();
 	set_freeze_on_error();
 	if (gethostname(context->hostname, sizeof(context->hostname)))
 		strcpy(context->hostname, "host_unknown");
@@ -2377,7 +2377,7 @@ static void mlx5_free_context(struct ibv_context *ibctx)
 		       page_size);
 	if (context->clock_info_page)
 		munmap((void *)context->clock_info_page, page_size);
-	close_debug_file(context);
+	mlx5_close_debug_file(context->dbg_fp);
 	clean_dyn_uars(ibctx);
 	reserved_qpn_blks_free(context);
 
diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h
index 3862007..7436bc8 100644
--- a/providers/mlx5/mlx5.h
+++ b/providers/mlx5/mlx5.h
@@ -992,6 +992,10 @@ static inline struct mlx5_flow *to_mflow(struct ibv_flow *flow_id)
 
 bool is_mlx5_dev(struct ibv_device *device);
 
+void mlx5_open_debug_file(FILE **dbg_fp);
+void mlx5_close_debug_file(FILE *dbg_fp);
+void mlx5_set_debug_mask(void);
+
 int mlx5_alloc_buf(struct mlx5_buf *buf, size_t size, int page_size);
 void mlx5_free_buf(struct mlx5_buf *buf);
 int mlx5_alloc_buf_contig(struct mlx5_context *mctx, struct mlx5_buf *buf,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 04/27] util: Add interval_set support
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (2 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 05/27] verbs: Enable verbs_open_device() to work over non sysfs devices Yishai Hadas
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

From: Mark Zhang <markzhang@nvidia.com>

Add interval_set functionality to support range management.

This functionality enables to insert/get valid ranges of values and
maintains the contiguity of them internally.

This will be used in down stream patches from this series to set/get
valid iova ranges from this data structure.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 util/CMakeLists.txt |   2 +
 util/interval_set.c | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 util/interval_set.h |  77 +++++++++++++++++++
 3 files changed, 287 insertions(+)
 create mode 100644 util/interval_set.c
 create mode 100644 util/interval_set.h

diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt
index e8646bf..d8a66be 100644
--- a/util/CMakeLists.txt
+++ b/util/CMakeLists.txt
@@ -1,6 +1,7 @@
 publish_internal_headers(util
   cl_qmap.h
   compiler.h
+  interval_set.h
   node_name_map.h
   rdma_nl.h
   symver.h
@@ -9,6 +10,7 @@ publish_internal_headers(util
 
 set(C_FILES
   cl_map.c
+  interval_set.c
   node_name_map.c
   open_cdev.c
   rdma_nl.c
diff --git a/util/interval_set.c b/util/interval_set.c
new file mode 100644
index 0000000..9fb9bde
--- /dev/null
+++ b/util/interval_set.c
@@ -0,0 +1,208 @@
+/* GPLv2 or OpenIB.org BSD (MIT) See COPYING file */
+
+#include <errno.h>
+#include <pthread.h>
+#include <stdlib.h>
+
+#include <ccan/list.h>
+#include <util/interval_set.h>
+#include <util/util.h>
+
+struct iset {
+	struct list_head head;
+	pthread_mutex_t lock;
+};
+
+struct iset_range {
+	struct list_node entry;
+	uint64_t start;
+	uint64_t length;
+};
+
+struct iset *iset_create(void)
+{
+	struct iset *iset;
+
+	iset = calloc(1, sizeof(*iset));
+	if (!iset) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	pthread_mutex_init(&iset->lock, NULL);
+	list_head_init(&iset->head);
+	return iset;
+}
+
+void iset_destroy(struct iset *iset)
+{
+	struct iset_range *range, *tmp;
+
+	list_for_each_safe(&iset->head, range, tmp, entry)
+		free(range);
+
+	free(iset);
+}
+
+static int
+range_overlap(uint64_t s1, uint64_t len1, uint64_t s2, uint64_t len2)
+{
+	if (((s1 < s2) && (s1 + len1 - 1 < s2)) ||
+	    ((s1 > s2) && (s1 > s2 + len2 - 1)))
+		return 0;
+
+	return 1;
+}
+
+static struct iset_range *create_range(uint64_t start, uint64_t length)
+{
+	struct iset_range *range;
+
+	range = calloc(1, sizeof(*range));
+	if (!range) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	range->start = start;
+	range->length = length;
+	return range;
+}
+
+static void delete_range(struct iset_range *r)
+{
+	list_del(&r->entry);
+	free(r);
+}
+
+static bool check_do_combine(struct iset *iset,
+			     struct iset_range *p, struct iset_range *n,
+			     uint64_t start, uint64_t length)
+{
+	bool combined2prev = false, combined2next = false;
+
+	if (p && (p->start + p->length == start)) {
+		p->length += length;
+		combined2prev = true;
+	}
+
+	if (n && (start + length == n->start)) {
+		if (combined2prev) {
+			p->length += n->length;
+			delete_range(n);
+		} else {
+			n->start = start;
+			n->length += length;
+		}
+		combined2next = true;
+	}
+
+	return combined2prev || combined2next;
+}
+
+int iset_insert_range(struct iset *iset, uint64_t start, uint64_t length)
+{
+	struct iset_range *prev = NULL, *r, *rnew;
+	bool found = false, combined;
+	int ret = 0;
+
+	if (!length || (start + length - 1 < start)) {
+		errno = EINVAL;
+		return errno;
+	}
+
+	pthread_mutex_lock(&iset->lock);
+	list_for_each(&iset->head, r, entry) {
+		if (range_overlap(r->start, r->length, start, length)) {
+			errno = EINVAL;
+			ret = errno;
+			goto out;
+		}
+
+		if (r->start > start) {
+			found = true;
+			break;
+		}
+
+		prev = r;
+	}
+
+	combined = check_do_combine(iset, prev, found ? r : NULL,
+				    start, length);
+	if (!combined) {
+		rnew = create_range(start, length);
+		if (!rnew) {
+			ret = errno;
+			goto out;
+		}
+
+		if (!found)
+			list_add_tail(&iset->head, &rnew->entry);
+		else
+			list_add_before(&iset->head, &r->entry, &rnew->entry);
+	}
+
+out:
+	pthread_mutex_unlock(&iset->lock);
+	return ret;
+}
+
+static int power_of_two(uint64_t x)
+{
+	return ((x != 0) && !(x & (x - 1)));
+}
+
+int iset_alloc_range(struct iset *iset, uint64_t length, uint64_t *start)
+{
+	struct iset_range *r, *rnew;
+	uint64_t astart, rend;
+	bool found = false;
+	int ret = 0;
+
+	if (!power_of_two(length)) {
+		errno = EINVAL;
+		return errno;
+	}
+
+	pthread_mutex_lock(&iset->lock);
+	list_for_each(&iset->head, r, entry) {
+		astart = align(r->start, length);
+		/* Check for wrap around */
+		if ((astart + length - 1 >= astart) &&
+		    (astart + length - 1 <= r->start + r->length - 1)) {
+			found = true;
+			break;
+		}
+	}
+	if (!found) {
+		errno = ENOSPC;
+		ret = errno;
+		goto out;
+	}
+
+	if (r->start == astart) {
+		if (r->length == length) { /* Case #1 */
+			delete_range(r);
+		} else {	/* Case #2 */
+			r->start += length;
+			r->length -= length;
+		}
+	} else {
+		rend = r->start + r->length;
+		if (astart + length != rend) { /* Case #4 */
+			rnew = create_range(astart + length,
+					    rend - astart - length);
+			if (!rnew) {
+				ret = errno;
+				goto out;
+			}
+			list_add_after(&iset->head, &r->entry, &rnew->entry);
+		}
+		r->length = astart - r->start; /* Case #3 & #4 */
+	}
+
+	*start = astart;
+out:
+	pthread_mutex_unlock(&iset->lock);
+	return ret;
+}
diff --git a/util/interval_set.h b/util/interval_set.h
new file mode 100644
index 0000000..d5b1f56
--- /dev/null
+++ b/util/interval_set.h
@@ -0,0 +1,77 @@
+/* GPLv2 or OpenIB.org BSD (MIT) See COPYING file */
+
+#include <stdint.h>
+
+struct iset;
+
+/**
+ * iset_create - Create an interval set
+ *
+ * Return the created iset if succeeded, NULL otherwise, with errno set
+ */
+struct iset *iset_create(void);
+
+/**
+ * iset_destroy - Destroy an interval set
+ * @iset: The set to be destroyed
+ */
+void iset_destroy(struct iset *iset);
+
+/**
+ * iset_insert_range - Insert a range to the set
+ * @iset: The set to be operated
+ * @start: The start address of the range
+ * @length: The length of the range
+ *
+ * If this range is continuous to the adjacent ranges (before and/or after),
+ * then they will be combined to a larger one.
+ *
+ * Return 0 if succeeded, errno otherwise
+ */
+int iset_insert_range(struct iset *iset, uint64_t start, uint64_t length);
+
+/**
+ * iset_alloc_range - Allocate a range from the set
+ *
+ * @iset: The set to be operated
+ * @length: The length of the range, must be power of two
+ * @start: The start address of the allocated range, aligned with @length
+ *
+ * Return 0 if succeeded, errno otherwise
+ *
+ * Note: There are these cases:
+ *
+Case 1: Original range is fully taken
++------------------+
+|XXXXXXXXXXXXXXXXXX|
++------------------+
+=>  (NULL)
+
+Case 2: Original range shrunk
++------------------+
+|XXXXX             |
++------------------+
+=>
+      +------------+
+      |            |
+      +------------+
+
+Case 3: Original range shrunk
++------------------+
+|             XXXXX|
++------------------+
+=>
++------------+
+|            |
++------------+
+
+Case 4: Original range splited
++------------------+
+|      XXXXX       |
++------------------+
+=>
++-----+     +------+
+|     |     |      |
++-----+     +------+
+*/
+int iset_alloc_range(struct iset *iset, uint64_t length, uint64_t *start);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 05/27] verbs: Enable verbs_open_device() to work over non sysfs devices
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (3 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 04/27] util: Add interval_set support Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 06/27] mlx5: Setup mlx5 vfio context Yishai Hadas
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Enable verbs_open_device() to work over non sysfs devices as of mlx5
over VFIO.

Any other API over verbs_sysfs_dev should fail cleanly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 libibverbs/device.c | 39 ++++++++++++++++++++++-----------------
 libibverbs/sysfs.c  |  5 +++++
 2 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/libibverbs/device.c b/libibverbs/device.c
index 2f0b3b7..bd6a019 100644
--- a/libibverbs/device.c
+++ b/libibverbs/device.c
@@ -124,7 +124,7 @@ LATEST_SYMVER_FUNC(ibv_get_device_guid, 1_1, "IBVERBS_1.1",
 	int i;
 
 	pthread_mutex_lock(&dev_list_lock);
-	if (sysfs_dev->flags & VSYSFS_READ_NODE_GUID) {
+	if (sysfs_dev && sysfs_dev->flags & VSYSFS_READ_NODE_GUID) {
 		guid = sysfs_dev->node_guid;
 		pthread_mutex_unlock(&dev_list_lock);
 		return htobe64(guid);
@@ -154,7 +154,7 @@ int ibv_get_device_index(struct ibv_device *device)
 {
 	struct verbs_sysfs_dev *sysfs_dev = verbs_get_device(device)->sysfs;
 
-	return sysfs_dev->ibdev_idx;
+	return sysfs_dev ? sysfs_dev->ibdev_idx : -1;
 }
 
 void verbs_init_cq(struct ibv_cq *cq, struct ibv_context *context,
@@ -323,18 +323,20 @@ static void set_lib_ops(struct verbs_context *vctx)
 struct ibv_context *verbs_open_device(struct ibv_device *device, void *private_data)
 {
 	struct verbs_device *verbs_device = verbs_get_device(device);
-	int cmd_fd;
+	int cmd_fd = -1;
 	struct verbs_context *context_ex;
 	int ret;
 
-	/*
-	 * We'll only be doing writes, but we need O_RDWR in case the
-	 * provider needs to mmap() the file.
-	 */
-	cmd_fd = open_cdev(verbs_device->sysfs->sysfs_name,
-			   verbs_device->sysfs->sysfs_cdev);
-	if (cmd_fd < 0)
-		return NULL;
+	if (verbs_device->sysfs) {
+		/*
+		 * We'll only be doing writes, but we need O_RDWR in case the
+		 * provider needs to mmap() the file.
+		 */
+		cmd_fd = open_cdev(verbs_device->sysfs->sysfs_name,
+				   verbs_device->sysfs->sysfs_cdev);
+		if (cmd_fd < 0)
+			return NULL;
+	}
 
 	/*
 	 * cmd_fd ownership is transferred into alloc_context, if it fails
@@ -345,11 +347,13 @@ struct ibv_context *verbs_open_device(struct ibv_device *device, void *private_d
 		return NULL;
 
 	set_lib_ops(context_ex);
-	if (context_ex->context.async_fd == -1) {
-		ret = ibv_cmd_alloc_async_fd(&context_ex->context);
-		if (ret) {
-			ibv_close_device(&context_ex->context);
-			return NULL;
+	if (verbs_device->sysfs) {
+		if (context_ex->context.async_fd == -1) {
+			ret = ibv_cmd_alloc_async_fd(&context_ex->context);
+			if (ret) {
+				ibv_close_device(&context_ex->context);
+				return NULL;
+			}
 		}
 	}
 
@@ -428,7 +432,8 @@ out:
 void verbs_uninit_context(struct verbs_context *context_ex)
 {
 	free(context_ex->priv);
-	close(context_ex->context.cmd_fd);
+	if (context_ex->context.cmd_fd != -1)
+		close(context_ex->context.cmd_fd);
 	if (context_ex->context.async_fd != -1)
 		close(context_ex->context.async_fd);
 	ibverbs_device_put(context_ex->context.device);
diff --git a/libibverbs/sysfs.c b/libibverbs/sysfs.c
index 8ba4472..d898432 100644
--- a/libibverbs/sysfs.c
+++ b/libibverbs/sysfs.c
@@ -127,6 +127,11 @@ int ibv_read_ibdev_sysfs_file(char *buf, size_t size,
 	va_list va;
 	int res;
 
+	if (!sysfs_dev) {
+		errno = EINVAL;
+		return -1;
+	}
+
 	va_start(va, fnfmt);
 	if (vasprintf(&path, fnfmt, va) < 0) {
 		va_end(va);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 06/27] mlx5: Setup mlx5 vfio context
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (4 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 05/27] verbs: Enable verbs_open_device() to work over non sysfs devices Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 07/27] mlx5: Add mlx5_vfio_cmd_exec() support Yishai Hadas
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Setup mlx5 vfio context by using the ioctl API over vfio to read the
device initialization segment and other required stuff.

As part of that the applicable IOVA ranges are set to be ready for DMA
usage and the command interface is initialized.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5.h      |   2 +
 providers/mlx5/mlx5_vfio.c | 740 +++++++++++++++++++++++++++++++++++++++++++++
 providers/mlx5/mlx5_vfio.h | 134 ++++++++
 3 files changed, 876 insertions(+)

diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h
index 7436bc8..7e7d70d 100644
--- a/providers/mlx5/mlx5.h
+++ b/providers/mlx5/mlx5.h
@@ -72,6 +72,7 @@ enum {
 
 enum {
 	MLX5_ADAPTER_PAGE_SIZE		= 4096,
+	MLX5_ADAPTER_PAGE_SHIFT		= 12,
 };
 
 #define MLX5_CQ_PREFIX "MLX_CQ"
@@ -90,6 +91,7 @@ enum {
 	MLX5_DBG_CQ_CQE		= 1 << 4,
 	MLX5_DBG_CONTIG		= 1 << 5,
 	MLX5_DBG_DR		= 1 << 6,
+	MLX5_DBG_VFIO		= 1 << 7,
 };
 
 extern uint32_t mlx5_debug_mask;
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index 69c7662..86f14f1 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -15,15 +15,755 @@
 #include <sys/mman.h>
 #include <string.h>
 #include <sys/param.h>
+#include <linux/vfio.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
 
 #include "mlx5dv.h"
 #include "mlx5_vfio.h"
 #include "mlx5.h"
 
+static void mlx5_vfio_free_cmd_msg(struct mlx5_vfio_context *ctx,
+				   struct mlx5_cmd_msg *msg);
+
+static int mlx5_vfio_alloc_cmd_msg(struct mlx5_vfio_context *ctx,
+				   uint32_t size, struct mlx5_cmd_msg *msg);
+
+static int mlx5_vfio_register_mem(struct mlx5_vfio_context *ctx,
+				  void *vaddr, uint64_t iova, uint64_t size)
+{
+	struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
+
+	dma_map.vaddr = (uintptr_t)vaddr;
+	dma_map.size = size;
+	dma_map.iova = iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+	return ioctl(ctx->container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+}
+
+static void mlx5_vfio_unregister_mem(struct mlx5_vfio_context *ctx,
+				     uint64_t iova, uint64_t size)
+{
+	struct vfio_iommu_type1_dma_unmap dma_unmap = {};
+
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = size;
+	dma_unmap.iova = iova;
+
+	if (ioctl(ctx->container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap))
+		assert(false);
+}
+
+static struct page_block *mlx5_vfio_new_block(struct mlx5_vfio_context *ctx)
+{
+	struct page_block *page_block;
+	int err;
+
+	page_block = calloc(1, sizeof(*page_block));
+	if (!page_block) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	err = posix_memalign(&page_block->page_ptr, MLX5_VFIO_BLOCK_SIZE,
+			     MLX5_VFIO_BLOCK_SIZE);
+	if (err) {
+		errno = err;
+		goto err;
+	}
+
+	err = iset_alloc_range(ctx->iova_alloc, MLX5_VFIO_BLOCK_SIZE, &page_block->iova);
+	if (err)
+		goto err_range;
+
+	bitmap_fill(page_block->free_pages, MLX5_VFIO_BLOCK_NUM_PAGES);
+	err = mlx5_vfio_register_mem(ctx, page_block->page_ptr, page_block->iova,
+				     MLX5_VFIO_BLOCK_SIZE);
+	if (err)
+		goto err_reg;
+
+	list_add(&ctx->mem_alloc.block_list, &page_block->next_block);
+	return page_block;
+
+err_reg:
+	iset_insert_range(ctx->iova_alloc, page_block->iova,
+			  MLX5_VFIO_BLOCK_SIZE);
+err_range:
+	free(page_block->page_ptr);
+err:
+	free(page_block);
+	return NULL;
+}
+
+static void mlx5_vfio_free_block(struct mlx5_vfio_context *ctx,
+				 struct page_block *page_block)
+{
+	mlx5_vfio_unregister_mem(ctx, page_block->iova, MLX5_VFIO_BLOCK_SIZE);
+	iset_insert_range(ctx->iova_alloc, page_block->iova, MLX5_VFIO_BLOCK_SIZE);
+	list_del(&page_block->next_block);
+	free(page_block->page_ptr);
+	free(page_block);
+}
+
+static int mlx5_vfio_alloc_page(struct mlx5_vfio_context *ctx, uint64_t *iova)
+{
+	struct page_block *page_block;
+	unsigned long pg;
+	int ret = 0;
+
+	pthread_mutex_lock(&ctx->mem_alloc.block_list_mutex);
+	while (true) {
+		list_for_each(&ctx->mem_alloc.block_list, page_block, next_block) {
+			pg = bitmap_ffs(page_block->free_pages, 0, MLX5_VFIO_BLOCK_NUM_PAGES);
+			if (pg != MLX5_VFIO_BLOCK_NUM_PAGES) {
+				bitmap_clear_bit(page_block->free_pages, pg);
+				*iova = page_block->iova + pg * MLX5_ADAPTER_PAGE_SIZE;
+				goto end;
+			}
+		}
+		if (!mlx5_vfio_new_block(ctx)) {
+			ret = -1;
+			goto end;
+		}
+	}
+end:
+	pthread_mutex_unlock(&ctx->mem_alloc.block_list_mutex);
+	return ret;
+}
+
+static void mlx5_vfio_free_page(struct mlx5_vfio_context *ctx, uint64_t iova)
+{
+	struct page_block *page_block;
+	unsigned long pg;
+
+	pthread_mutex_lock(&ctx->mem_alloc.block_list_mutex);
+	list_for_each(&ctx->mem_alloc.block_list, page_block, next_block) {
+		if (page_block->iova > iova ||
+		    (page_block->iova + MLX5_VFIO_BLOCK_SIZE <= iova))
+			continue;
+
+		pg = (iova - page_block->iova) / MLX5_ADAPTER_PAGE_SIZE;
+		assert(!bitmap_test_bit(page_block->free_pages, pg));
+		bitmap_set_bit(page_block->free_pages, pg);
+		if (bitmap_full(page_block->free_pages, MLX5_VFIO_BLOCK_NUM_PAGES))
+			mlx5_vfio_free_block(ctx, page_block);
+
+		goto end;
+	}
+
+	assert(false);
+end:
+	pthread_mutex_unlock(&ctx->mem_alloc.block_list_mutex);
+}
+
+static int mlx5_vfio_enable_pci_cmd(struct mlx5_vfio_context *ctx)
+{
+	struct vfio_region_info pci_config_reg = {};
+	uint16_t pci_com_buf = 0x6;
+	char buffer[4096];
+
+	pci_config_reg.argsz = sizeof(pci_config_reg);
+	pci_config_reg.index = VFIO_PCI_CONFIG_REGION_INDEX;
+
+	if (ioctl(ctx->device_fd, VFIO_DEVICE_GET_REGION_INFO, &pci_config_reg))
+		return -1;
+
+	if (pwrite(ctx->device_fd, &pci_com_buf, 2, pci_config_reg.offset + 0x4) != 2)
+		return -1;
+
+	if (pread(ctx->device_fd, buffer, pci_config_reg.size, pci_config_reg.offset)
+			!= pci_config_reg.size)
+		return -1;
+
+	return 0;
+}
+
+static void free_cmd_box(struct mlx5_vfio_context *ctx,
+			 struct mlx5_cmd_mailbox *mailbox)
+{
+	mlx5_vfio_unregister_mem(ctx, mailbox->iova, MLX5_ADAPTER_PAGE_SIZE);
+	iset_insert_range(ctx->iova_alloc, mailbox->iova, MLX5_ADAPTER_PAGE_SIZE);
+	free(mailbox->buf);
+	free(mailbox);
+}
+
+static struct mlx5_cmd_mailbox *alloc_cmd_box(struct mlx5_vfio_context *ctx)
+{
+	struct mlx5_cmd_mailbox *mailbox;
+	int ret;
+
+	mailbox = calloc(1, sizeof(*mailbox));
+	if (!mailbox) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	ret = posix_memalign(&mailbox->buf, MLX5_ADAPTER_PAGE_SIZE,
+			     MLX5_ADAPTER_PAGE_SIZE);
+	if (ret) {
+		errno = ret;
+		goto err_free;
+	}
+
+	memset(mailbox->buf, 0, MLX5_ADAPTER_PAGE_SIZE);
+
+	ret = iset_alloc_range(ctx->iova_alloc, MLX5_ADAPTER_PAGE_SIZE, &mailbox->iova);
+	if (ret)
+		goto err_tree;
+
+	ret = mlx5_vfio_register_mem(ctx, mailbox->buf, mailbox->iova,
+				     MLX5_ADAPTER_PAGE_SIZE);
+	if (ret)
+		goto err_reg;
+
+	return mailbox;
+
+err_reg:
+	iset_insert_range(ctx->iova_alloc, mailbox->iova,
+			  MLX5_ADAPTER_PAGE_SIZE);
+err_tree:
+	free(mailbox->buf);
+err_free:
+	free(mailbox);
+	return NULL;
+}
+
+static int mlx5_calc_cmd_blocks(uint32_t msg_len)
+{
+	int size = msg_len;
+	int blen = size - min_t(int, 16, size);
+
+	return DIV_ROUND_UP(blen, MLX5_CMD_DATA_BLOCK_SIZE);
+}
+
+static void mlx5_vfio_free_cmd_msg(struct mlx5_vfio_context *ctx,
+				   struct mlx5_cmd_msg *msg)
+{
+	struct mlx5_cmd_mailbox *head = msg->next;
+	struct mlx5_cmd_mailbox *next;
+
+	while (head) {
+		next = head->next;
+		free_cmd_box(ctx, head);
+		head = next;
+	}
+	msg->len = 0;
+}
+
+static int mlx5_vfio_alloc_cmd_msg(struct mlx5_vfio_context *ctx,
+				   uint32_t size, struct mlx5_cmd_msg *msg)
+{
+	struct mlx5_cmd_mailbox *tmp, *head = NULL;
+	struct mlx5_cmd_block *block;
+	int i, num_blocks;
+
+	msg->len = size;
+	num_blocks = mlx5_calc_cmd_blocks(size);
+
+	for (i = 0; i < num_blocks; i++) {
+		tmp = alloc_cmd_box(ctx);
+		if (!tmp)
+			goto err_alloc;
+
+		block = tmp->buf;
+		tmp->next = head;
+		block->next = htobe64(tmp->next ? tmp->next->iova : 0);
+		block->block_num = htobe32(num_blocks - i - 1);
+		head = tmp;
+	}
+	msg->next = head;
+	return 0;
+
+err_alloc:
+	while (head) {
+		tmp = head->next;
+		free_cmd_box(ctx, head);
+		head = tmp;
+	}
+	msg->len = 0;
+	return -1;
+}
+
+static void mlx5_vfio_free_cmd_slot(struct mlx5_vfio_context *ctx, int slot)
+{
+	struct mlx5_vfio_cmd_slot *cmd_slot = &ctx->cmd.cmds[slot];
+
+	mlx5_vfio_free_cmd_msg(ctx, &cmd_slot->in);
+	mlx5_vfio_free_cmd_msg(ctx, &cmd_slot->out);
+	close(cmd_slot->completion_event_fd);
+}
+
+static int mlx5_vfio_setup_cmd_slot(struct mlx5_vfio_context *ctx, int slot)
+{
+	struct mlx5_vfio_cmd *cmd = &ctx->cmd;
+	struct mlx5_vfio_cmd_slot *cmd_slot = &cmd->cmds[slot];
+	struct mlx5_cmd_layout *cmd_lay;
+	int ret;
+
+	ret = mlx5_vfio_alloc_cmd_msg(ctx, 4096, &cmd_slot->in);
+	if (ret)
+		return ret;
+
+	ret = mlx5_vfio_alloc_cmd_msg(ctx, 4096, &cmd_slot->out);
+	if (ret)
+		goto err;
+
+	cmd_lay = cmd->vaddr + (slot * (1 << cmd->log_stride));
+	cmd_lay->type = MLX5_PCI_CMD_XPORT;
+	cmd_lay->iptr = htobe64(cmd_slot->in.next->iova);
+	cmd_lay->optr = htobe64(cmd_slot->out.next->iova);
+
+	cmd_slot->lay = cmd_lay;
+	cmd_slot->completion_event_fd = eventfd(0, EFD_CLOEXEC);
+	if (cmd_slot->completion_event_fd < 0) {
+		ret = -1;
+		goto err_fd;
+	}
+
+	pthread_mutex_init(&cmd_slot->lock, NULL);
+
+	return 0;
+
+err_fd:
+	mlx5_vfio_free_cmd_msg(ctx, &cmd_slot->out);
+err:
+	mlx5_vfio_free_cmd_msg(ctx, &cmd_slot->in);
+	return ret;
+}
+
+static int mlx5_vfio_init_cmd_interface(struct mlx5_vfio_context *ctx)
+{
+	struct mlx5_init_seg *init_seg = ctx->bar_map;
+	struct mlx5_vfio_cmd *cmd = &ctx->cmd;
+	uint16_t cmdif_rev;
+	uint32_t cmd_h, cmd_l;
+	int ret;
+
+	cmdif_rev = be32toh(init_seg->cmdif_rev_fw_sub) >> 16;
+
+	if (cmdif_rev != 5) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	cmd_l = be32toh(init_seg->cmdq_addr_l_sz) & 0xff;
+	ctx->cmd.log_sz = cmd_l >> 4 & 0xf;
+	ctx->cmd.log_stride = cmd_l & 0xf;
+	if (1 << ctx->cmd.log_sz > MLX5_MAX_COMMANDS) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	if (ctx->cmd.log_sz + ctx->cmd.log_stride > MLX5_ADAPTER_PAGE_SHIFT) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	/* The initial address must be 4K aligned */
+	ret = posix_memalign(&cmd->vaddr, MLX5_ADAPTER_PAGE_SIZE,
+			     MLX5_ADAPTER_PAGE_SIZE);
+	if (ret) {
+		errno = ret;
+		return -1;
+	}
+
+	memset(cmd->vaddr, 0, MLX5_ADAPTER_PAGE_SIZE);
+
+	ret = iset_alloc_range(ctx->iova_alloc, MLX5_ADAPTER_PAGE_SIZE, &cmd->iova);
+	if (ret)
+		goto err_free;
+
+	ret = mlx5_vfio_register_mem(ctx, cmd->vaddr, cmd->iova, MLX5_ADAPTER_PAGE_SIZE);
+	if (ret)
+		goto err_reg;
+
+	cmd_h = (uint32_t)((uint64_t)(cmd->iova) >> 32);
+	cmd_l = (uint32_t)(uint64_t)(cmd->iova);
+
+	init_seg->cmdq_addr_h = htobe32(cmd_h);
+	init_seg->cmdq_addr_l_sz = htobe32(cmd_l);
+
+	/* Make sure firmware sees the complete address before we proceed */
+	udma_to_device_barrier();
+
+	ret = mlx5_vfio_setup_cmd_slot(ctx, 0);
+	if (ret)
+		goto err_slot_0;
+
+	ret = mlx5_vfio_setup_cmd_slot(ctx, MLX5_MAX_COMMANDS - 1);
+	if (ret)
+		goto err_slot_1;
+
+	ret = mlx5_vfio_enable_pci_cmd(ctx);
+	if (!ret)
+		return 0;
+
+	mlx5_vfio_free_cmd_slot(ctx, MLX5_MAX_COMMANDS - 1);
+err_slot_1:
+	mlx5_vfio_free_cmd_slot(ctx, 0);
+err_slot_0:
+	mlx5_vfio_unregister_mem(ctx, cmd->iova, MLX5_ADAPTER_PAGE_SIZE);
+err_reg:
+	iset_insert_range(ctx->iova_alloc, cmd->iova, MLX5_ADAPTER_PAGE_SIZE);
+err_free:
+	free(cmd->vaddr);
+	return ret;
+}
+
+static void mlx5_vfio_clean_cmd_interface(struct mlx5_vfio_context *ctx)
+{
+	struct mlx5_vfio_cmd *cmd = &ctx->cmd;
+
+	mlx5_vfio_free_cmd_slot(ctx, 0);
+	mlx5_vfio_free_cmd_slot(ctx, MLX5_MAX_COMMANDS - 1);
+	mlx5_vfio_unregister_mem(ctx, cmd->iova, MLX5_ADAPTER_PAGE_SIZE);
+	iset_insert_range(ctx->iova_alloc, cmd->iova, MLX5_ADAPTER_PAGE_SIZE);
+	free(cmd->vaddr);
+}
+
+static void set_iova_min_page_size(struct mlx5_vfio_context *ctx,
+				   uint64_t iova_pgsizes)
+{
+	int i;
+
+	for (i = MLX5_ADAPTER_PAGE_SHIFT; i < 64; i++) {
+		if (iova_pgsizes & (1 << i)) {
+			ctx->iova_min_page_size = 1 << i;
+			return;
+		}
+	}
+
+	assert(false);
+}
+
+/* if the kernel does not report usable IOVA regions, choose the legacy region */
+#define MLX5_VFIO_IOVA_MIN1 0x10000ULL
+#define MLX5_VFIO_IOVA_MAX1 0xFEDFFFFFULL
+#define MLX5_VFIO_IOVA_MIN2 0xFEF00000ULL
+#define MLX5_VFIO_IOVA_MAX2 ((1ULL << 39) - 1)
+
+static int mlx5_vfio_get_iommu_info(struct mlx5_vfio_context *ctx)
+{
+	struct vfio_iommu_type1_info *info;
+	int ret, i;
+	void *ptr;
+	uint32_t offset;
+
+	info = calloc(1, sizeof(*info));
+	if (!info) {
+		errno = ENOMEM;
+		return -1;
+	}
+
+	info->argsz = sizeof(*info);
+	ret = ioctl(ctx->container_fd, VFIO_IOMMU_GET_INFO, info);
+	if (ret)
+		goto end;
+
+	if (info->argsz > sizeof(*info)) {
+		info = realloc(info, info->argsz);
+		if (!info) {
+			errno = ENOMEM;
+			ret = -1;
+			goto end;
+		}
+
+		ret = ioctl(ctx->container_fd, VFIO_IOMMU_GET_INFO, info);
+		if (ret)
+			goto end;
+	}
+
+	set_iova_min_page_size(ctx, (info->flags & VFIO_IOMMU_INFO_PGSIZES) ?
+			       info->iova_pgsizes : 4096);
+
+	if (!(info->flags & VFIO_IOMMU_INFO_CAPS))
+		goto set_legacy;
+
+	offset = info->cap_offset;
+	while (offset) {
+		struct vfio_iommu_type1_info_cap_iova_range *iova_range;
+		struct vfio_info_cap_header *header;
+
+		ptr = (void *)info + offset;
+		header = ptr;
+
+		if (header->id != VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE) {
+			offset = header->next;
+			continue;
+		}
+
+		iova_range = (struct vfio_iommu_type1_info_cap_iova_range *)header;
+
+		for (i = 0; i < iova_range->nr_iovas; i++) {
+			mlx5_dbg(ctx->dbg_fp, MLX5_DBG_VFIO, "\t%02d: %016llx - %016llx\n", i,
+				 iova_range->iova_ranges[i].start,
+				 iova_range->iova_ranges[i].end);
+			ret = iset_insert_range(ctx->iova_alloc, iova_range->iova_ranges[i].start,
+						iova_range->iova_ranges[i].end -
+						iova_range->iova_ranges[i].start + 1);
+			if (ret)
+				goto end;
+		}
+
+		goto end;
+	}
+
+set_legacy:
+	ret = iset_insert_range(ctx->iova_alloc, MLX5_VFIO_IOVA_MIN1,
+				MLX5_VFIO_IOVA_MAX1 - MLX5_VFIO_IOVA_MIN1 + 1);
+	if (!ret)
+		ret = iset_insert_range(ctx->iova_alloc, MLX5_VFIO_IOVA_MIN2,
+					MLX5_VFIO_IOVA_MAX2 - MLX5_VFIO_IOVA_MIN2 + 1);
+
+end:
+	free(info);
+	return ret;
+}
+
+static void mlx5_vfio_clean_device_dma(struct mlx5_vfio_context *ctx)
+{
+	struct page_block *page_block, *tmp;
+
+	list_for_each_safe(&ctx->mem_alloc.block_list, page_block,
+			   tmp, next_block)
+		mlx5_vfio_free_block(ctx, page_block);
+
+	iset_destroy(ctx->iova_alloc);
+}
+
+static int mlx5_vfio_init_device_dma(struct mlx5_vfio_context *ctx)
+{
+	ctx->iova_alloc = iset_create();
+	if (!ctx->iova_alloc)
+		return -1;
+
+	list_head_init(&ctx->mem_alloc.block_list);
+	pthread_mutex_init(&ctx->mem_alloc.block_list_mutex, NULL);
+
+	if (mlx5_vfio_get_iommu_info(ctx))
+		goto err;
+
+	/* create an initial block of DMA memory ready to be used */
+	if (!mlx5_vfio_new_block(ctx))
+		goto err;
+
+	return 0;
+err:
+	iset_destroy(ctx->iova_alloc);
+	return -1;
+}
+
+static void mlx5_vfio_uninit_bar0(struct mlx5_vfio_context *ctx)
+{
+	munmap(ctx->bar_map, ctx->bar_map_size);
+}
+
+static int mlx5_vfio_init_bar0(struct mlx5_vfio_context *ctx)
+{
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	void *base;
+	int err;
+
+	reg.index = 0;
+	err = ioctl(ctx->device_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+	if (err)
+		return err;
+
+	base = mmap(NULL, reg.size, PROT_READ | PROT_WRITE, MAP_SHARED,
+		    ctx->device_fd, reg.offset);
+	if (base == MAP_FAILED)
+		return -1;
+
+	ctx->bar_map = (struct mlx5_init_seg *)base;
+	ctx->bar_map_size = reg.size;
+	return 0;
+}
+
+#define MLX5_VFIO_MAX_INTR_VEC_ID 1
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+			      sizeof(int) * (MLX5_VFIO_MAX_INTR_VEC_ID))
+
+/* enable MSI-X interrupts */
+static int
+mlx5_vfio_enable_msix(struct mlx5_vfio_context *ctx)
+{
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int len;
+	int *fd_ptr;
+
+	len = sizeof(irq_set_buf);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = len;
+	irq_set->count = 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[0] = ctx->cmd_comp_fd;
+
+	return ioctl(ctx->device_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+}
+
+static int mlx5_vfio_init_async_fd(struct mlx5_vfio_context *ctx)
+{
+	struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+	irq.index = VFIO_PCI_MSIX_IRQ_INDEX;
+	if (ioctl(ctx->device_fd, VFIO_DEVICE_GET_IRQ_INFO, &irq))
+		return -1;
+
+	/* fail if this vector cannot be used with eventfd */
+	if ((irq.flags & VFIO_IRQ_INFO_EVENTFD) == 0)
+		return -1;
+
+	/* set up an eventfd for command completion interrupts */
+	ctx->cmd_comp_fd = eventfd(0, EFD_CLOEXEC);
+	if (ctx->cmd_comp_fd < 0)
+		return -1;
+
+	if (mlx5_vfio_enable_msix(ctx))
+		goto err_msix;
+
+	return 0;
+
+err_msix:
+	close(ctx->cmd_comp_fd);
+	return -1;
+}
+
+static void mlx5_vfio_close_fds(struct mlx5_vfio_context *ctx)
+{
+	close(ctx->device_fd);
+	close(ctx->container_fd);
+	close(ctx->group_fd);
+	close(ctx->cmd_comp_fd);
+}
+
+static int mlx5_vfio_open_fds(struct mlx5_vfio_context *ctx,
+			      struct mlx5_vfio_device *mdev)
+{
+	struct vfio_group_status group_status = { .argsz = sizeof(group_status) };
+
+	/* Create a new container */
+	ctx->container_fd = open("/dev/vfio/vfio", O_RDWR);
+
+	if (ctx->container_fd < 0)
+		return -1;
+
+	if (ioctl(ctx->container_fd, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
+		goto close_cont;
+
+	if (!ioctl(ctx->container_fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
+		/* Doesn't support the IOMMU driver we want. */
+		goto close_cont;
+
+	/* Open the group */
+	ctx->group_fd = open(mdev->vfio_path, O_RDWR);
+	if (ctx->group_fd < 0)
+		goto close_cont;
+
+	/* Test the group is viable and available */
+	if (ioctl(ctx->group_fd, VFIO_GROUP_GET_STATUS, &group_status))
+		goto close_group;
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+		/* Group is not viable (ie, not all devices bound for vfio) */
+		errno = EINVAL;
+		goto close_group;
+	}
+
+	/* Add the group to the container */
+	if (ioctl(ctx->group_fd, VFIO_GROUP_SET_CONTAINER, &ctx->container_fd))
+		goto close_group;
+
+	/* Enable the IOMMU model we want */
+	if (ioctl(ctx->container_fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU))
+		goto close_group;
+
+	/* Get a file descriptor for the device */
+	ctx->device_fd = ioctl(ctx->group_fd, VFIO_GROUP_GET_DEVICE_FD,
+			       mdev->pci_name);
+	if (ctx->device_fd < 0)
+		goto close_group;
+
+	if (mlx5_vfio_init_async_fd(ctx))
+		goto close_group;
+
+	return 0;
+
+close_group:
+	close(ctx->group_fd);
+close_cont:
+	close(ctx->container_fd);
+	return -1;
+}
+
+static void mlx5_vfio_uninit_context(struct mlx5_vfio_context *ctx)
+{
+	mlx5_close_debug_file(ctx->dbg_fp);
+
+	verbs_uninit_context(&ctx->vctx);
+	free(ctx);
+}
+
+static void mlx5_vfio_free_context(struct ibv_context *ibctx)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(ibctx);
+
+	mlx5_vfio_clean_cmd_interface(ctx);
+	mlx5_vfio_clean_device_dma(ctx);
+	mlx5_vfio_uninit_bar0(ctx);
+	mlx5_vfio_close_fds(ctx);
+	mlx5_vfio_uninit_context(ctx);
+}
+
+static const struct verbs_context_ops mlx5_vfio_common_ops = {
+	.free_context = mlx5_vfio_free_context,
+};
+
 static struct verbs_context *
 mlx5_vfio_alloc_context(struct ibv_device *ibdev,
 			int cmd_fd, void *private_data)
 {
+	struct mlx5_vfio_device *mdev = to_mvfio_dev(ibdev);
+	struct mlx5_vfio_context *mctx;
+
+	cmd_fd = -1;
+
+	mctx = verbs_init_and_alloc_context(ibdev, cmd_fd, mctx, vctx,
+					    RDMA_DRIVER_UNKNOWN);
+	if (!mctx)
+		return NULL;
+
+	mlx5_open_debug_file(&mctx->dbg_fp);
+	mlx5_set_debug_mask();
+
+	if (mlx5_vfio_open_fds(mctx, mdev))
+		goto err;
+
+	if (mlx5_vfio_init_bar0(mctx))
+		goto close_fds;
+
+	if (mlx5_vfio_init_device_dma(mctx))
+		goto err_bar;
+
+	if (mlx5_vfio_init_cmd_interface(mctx))
+		goto err_dma;
+
+	verbs_set_ops(&mctx->vctx, &mlx5_vfio_common_ops);
+	return &mctx->vctx;
+
+err_dma:
+	mlx5_vfio_clean_device_dma(mctx);
+err_bar:
+	mlx5_vfio_uninit_bar0(mctx);
+close_fds:
+	mlx5_vfio_close_fds(mctx);
+err:
+	mlx5_vfio_uninit_context(mctx);
 	return NULL;
 }
 
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 6ba4254..392ddcb 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -8,8 +8,21 @@
 
 #include <stddef.h>
 #include <stdio.h>
+#include "mlx5.h"
 
 #include <infiniband/driver.h>
+#include <util/interval_set.h>
+
+enum {
+	MLX5_MAX_COMMANDS = 32,
+	MLX5_CMD_DATA_BLOCK_SIZE = 512,
+	MLX5_PCI_CMD_XPORT = 7,
+};
+
+enum {
+	MLX5_VFIO_BLOCK_SIZE = 2 * 1024 * 1024,
+	MLX5_VFIO_BLOCK_NUM_PAGES = MLX5_VFIO_BLOCK_SIZE / MLX5_ADAPTER_PAGE_SIZE,
+};
 
 struct mlx5_vfio_device {
 	struct verbs_device vdev;
@@ -19,9 +32,130 @@ struct mlx5_vfio_device {
 	uint32_t flags;
 };
 
+struct health_buffer {
+	__be32		assert_var[5];
+	__be32		rsvd0[3];
+	__be32		assert_exit_ptr;
+	__be32		assert_callra;
+	__be32		rsvd1[2];
+	__be32		fw_ver;
+	__be32		hw_id;
+	__be32		rfr;
+	uint8_t		irisc_index;
+	uint8_t		synd;
+	__be16		ext_synd;
+};
+
+struct mlx5_init_seg {
+	__be32			fw_rev;
+	__be32			cmdif_rev_fw_sub;
+	__be32			rsvd0[2];
+	__be32			cmdq_addr_h;
+	__be32			cmdq_addr_l_sz;
+	__be32			cmd_dbell;
+	__be32			rsvd1[120];
+	__be32			initializing;
+	struct health_buffer	health;
+	__be32			rsvd2[880];
+	__be32			internal_timer_h;
+	__be32			internal_timer_l;
+	__be32			rsvd3[2];
+	__be32			health_counter;
+	__be32			rsvd4[1019];
+	__be64			ieee1588_clk;
+	__be32			ieee1588_clk_type;
+	__be32			clr_intx;
+};
+
+struct mlx5_cmd_layout {
+	uint8_t		type;
+	uint8_t		rsvd0[3];
+	__be32		ilen;
+	__be64		iptr;
+	__be32		in[4];
+	__be32		out[4];
+	__be64		optr;
+	__be32		olen;
+	uint8_t		token;
+	uint8_t		sig;
+	uint8_t		rsvd1;
+	uint8_t		status_own;
+};
+
+struct mlx5_cmd_block {
+	uint8_t		data[MLX5_CMD_DATA_BLOCK_SIZE];
+	uint8_t		rsvd0[48];
+	__be64		next;
+	__be32		block_num;
+	uint8_t		rsvd1;
+	uint8_t		token;
+	uint8_t		ctrl_sig;
+	uint8_t		sig;
+};
+
+struct page_block {
+	void *page_ptr;
+	uint64_t iova;
+	struct list_node next_block;
+	BITMAP_DECLARE(free_pages, MLX5_VFIO_BLOCK_NUM_PAGES);
+};
+
+struct vfio_mem_allocator {
+	struct list_head block_list;
+	pthread_mutex_t block_list_mutex;
+};
+
+struct mlx5_cmd_mailbox {
+	void *buf;
+	uint64_t iova;
+	struct mlx5_cmd_mailbox *next;
+};
+
+struct mlx5_cmd_msg {
+	uint32_t len;
+	struct mlx5_cmd_mailbox *next;
+};
+
+struct mlx5_vfio_cmd_slot {
+	struct mlx5_cmd_layout *lay;
+	struct mlx5_cmd_msg in;
+	struct mlx5_cmd_msg out;
+	pthread_mutex_t lock;
+	int completion_event_fd;
+};
+
+struct mlx5_vfio_cmd {
+	void *vaddr; /* cmd page address */
+	uint64_t iova;
+	uint8_t log_sz;
+	uint8_t log_stride;
+	struct mlx5_vfio_cmd_slot cmds[MLX5_MAX_COMMANDS];
+};
+
+struct mlx5_vfio_context {
+	struct verbs_context vctx;
+	int container_fd;
+	int group_fd;
+	int device_fd;
+	int cmd_comp_fd; /* command completion FD */
+	struct iset *iova_alloc;
+	uint64_t iova_min_page_size;
+	FILE *dbg_fp;
+	struct vfio_mem_allocator mem_alloc;
+	struct mlx5_init_seg *bar_map;
+	size_t bar_map_size;
+	struct mlx5_vfio_cmd cmd;
+	bool have_eq;
+};
+
 static inline struct mlx5_vfio_device *to_mvfio_dev(struct ibv_device *ibdev)
 {
 	return container_of(ibdev, struct mlx5_vfio_device, vdev.device);
 }
 
+static inline struct mlx5_vfio_context *to_mvfio_ctx(struct ibv_context *ibctx)
+{
+	return container_of(ibctx, struct mlx5_vfio_context, vctx.context);
+}
+
 #endif
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 07/27] mlx5: Add mlx5_vfio_cmd_exec() support
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (5 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 06/27] mlx5: Setup mlx5 vfio context Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 08/27] mlx5: vfio setup function support Yishai Hadas
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Add mlx5_vfio_cmd_exec() support to enable running commands.

This includes the required functionality to prepare both the in/out
command layouts and handle the response upon issuing the command.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_ifc.h  |  38 ++++++
 providers/mlx5/mlx5_vfio.c | 287 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 325 insertions(+)

diff --git a/providers/mlx5/mlx5_ifc.h b/providers/mlx5/mlx5_ifc.h
index 31d0bdc..aef6196 100644
--- a/providers/mlx5/mlx5_ifc.h
+++ b/providers/mlx5/mlx5_ifc.h
@@ -73,6 +73,25 @@ enum {
 	MLX5_CMD_OP_SYNC_STEERING = 0xb00,
 };
 
+enum {
+	MLX5_CMD_STAT_OK = 0x0,
+	MLX5_CMD_STAT_INT_ERR = 0x1,
+	MLX5_CMD_STAT_BAD_OP_ERR = 0x2,
+	MLX5_CMD_STAT_BAD_PARAM_ERR = 0x3,
+	MLX5_CMD_STAT_BAD_SYS_STATE_ERR = 0x4,
+	MLX5_CMD_STAT_BAD_RES_ERR = 0x5,
+	MLX5_CMD_STAT_RES_BUSY = 0x6,
+	MLX5_CMD_STAT_LIM_ERR = 0x8,
+	MLX5_CMD_STAT_BAD_RES_STATE_ERR = 0x9,
+	MLX5_CMD_STAT_IX_ERR = 0xa,
+	MLX5_CMD_STAT_NO_RES_ERR = 0xf,
+	MLX5_CMD_STAT_BAD_INP_LEN_ERR = 0x50,
+	MLX5_CMD_STAT_BAD_OUTP_LEN_ERR = 0x51,
+	MLX5_CMD_STAT_BAD_QP_STATE_ERR = 0x10,
+	MLX5_CMD_STAT_BAD_PKT_ERR = 0x30,
+	MLX5_CMD_STAT_BAD_SIZE_OUTS_CQES_ERR = 0x40,
+};
+
 struct mlx5_ifc_atomic_caps_bits {
 	u8         reserved_at_0[0x40];
 
@@ -4093,4 +4112,23 @@ struct mlx5_ifc_create_psv_in_bits {
 	u8         reserved_at_60[0x20];
 };
 
+struct mlx5_ifc_mbox_out_bits {
+	u8	status[0x8];
+	u8	reserved_at_8[0x18];
+
+	u8	syndrome[0x20];
+
+	u8	reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_mbox_in_bits {
+	u8	opcode[0x10];
+	u8	uid[0x10];
+
+	u8	reserved_at_20[0x10];
+	u8	op_mod[0x10];
+
+	u8	reserved_at_40[0x40];
+};
+
 #endif /* MLX5_IFC_H */
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index 86f14f1..37f06a9 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -9,6 +9,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
+#include <sys/time.h>
 #include <errno.h>
 #include <sys/stat.h>
 #include <fcntl.h>
@@ -18,10 +19,12 @@
 #include <linux/vfio.h>
 #include <sys/eventfd.h>
 #include <sys/ioctl.h>
+#include <util/mmio.h>
 
 #include "mlx5dv.h"
 #include "mlx5_vfio.h"
 #include "mlx5.h"
+#include "mlx5_ifc.h"
 
 static void mlx5_vfio_free_cmd_msg(struct mlx5_vfio_context *ctx,
 				   struct mlx5_cmd_msg *msg);
@@ -157,6 +160,290 @@ end:
 	pthread_mutex_unlock(&ctx->mem_alloc.block_list_mutex);
 }
 
+static int cmd_status_to_err(uint8_t status)
+{
+	switch (status) {
+	case MLX5_CMD_STAT_OK:				return 0;
+	case MLX5_CMD_STAT_INT_ERR:			return EIO;
+	case MLX5_CMD_STAT_BAD_OP_ERR:			return EINVAL;
+	case MLX5_CMD_STAT_BAD_PARAM_ERR:		return EINVAL;
+	case MLX5_CMD_STAT_BAD_SYS_STATE_ERR:		return EIO;
+	case MLX5_CMD_STAT_BAD_RES_ERR:			return EINVAL;
+	case MLX5_CMD_STAT_RES_BUSY:			return EBUSY;
+	case MLX5_CMD_STAT_LIM_ERR:			return ENOMEM;
+	case MLX5_CMD_STAT_BAD_RES_STATE_ERR:		return EINVAL;
+	case MLX5_CMD_STAT_IX_ERR:			return EINVAL;
+	case MLX5_CMD_STAT_NO_RES_ERR:			return EAGAIN;
+	case MLX5_CMD_STAT_BAD_INP_LEN_ERR:		return EIO;
+	case MLX5_CMD_STAT_BAD_OUTP_LEN_ERR:		return EIO;
+	case MLX5_CMD_STAT_BAD_QP_STATE_ERR:		return EINVAL;
+	case MLX5_CMD_STAT_BAD_PKT_ERR:			return EINVAL;
+	case MLX5_CMD_STAT_BAD_SIZE_OUTS_CQES_ERR:	return EINVAL;
+	default:					return EIO;
+	}
+}
+
+static const char *cmd_status_str(uint8_t status)
+{
+	switch (status) {
+	case MLX5_CMD_STAT_OK:
+		return "OK";
+	case MLX5_CMD_STAT_INT_ERR:
+		return "internal error";
+	case MLX5_CMD_STAT_BAD_OP_ERR:
+		return "bad operation";
+	case MLX5_CMD_STAT_BAD_PARAM_ERR:
+		return "bad parameter";
+	case MLX5_CMD_STAT_BAD_SYS_STATE_ERR:
+		return "bad system state";
+	case MLX5_CMD_STAT_BAD_RES_ERR:
+		return "bad resource";
+	case MLX5_CMD_STAT_RES_BUSY:
+		return "resource busy";
+	case MLX5_CMD_STAT_LIM_ERR:
+		return "limits exceeded";
+	case MLX5_CMD_STAT_BAD_RES_STATE_ERR:
+		return "bad resource state";
+	case MLX5_CMD_STAT_IX_ERR:
+		return "bad index";
+	case MLX5_CMD_STAT_NO_RES_ERR:
+		return "no resources";
+	case MLX5_CMD_STAT_BAD_INP_LEN_ERR:
+		return "bad input length";
+	case MLX5_CMD_STAT_BAD_OUTP_LEN_ERR:
+		return "bad output length";
+	case MLX5_CMD_STAT_BAD_QP_STATE_ERR:
+		return "bad QP state";
+	case MLX5_CMD_STAT_BAD_PKT_ERR:
+		return "bad packet (discarded)";
+	case MLX5_CMD_STAT_BAD_SIZE_OUTS_CQES_ERR:
+		return "bad size too many outstanding CQEs";
+	default:
+		return "unknown status";
+	}
+}
+
+static void mlx5_cmd_mbox_status(void *out, uint8_t *status, uint32_t *syndrome)
+{
+	*status = DEVX_GET(mbox_out, out, status);
+	*syndrome = DEVX_GET(mbox_out, out, syndrome);
+}
+
+static int mlx5_vfio_cmd_check(struct mlx5_vfio_context *ctx, void *in, void *out)
+{
+	uint32_t syndrome;
+	uint8_t  status;
+	uint16_t opcode;
+	uint16_t op_mod;
+
+	mlx5_cmd_mbox_status(out, &status, &syndrome);
+	if (!status)
+		return 0;
+
+	opcode = DEVX_GET(mbox_in, in, opcode);
+	op_mod = DEVX_GET(mbox_in, in, op_mod);
+
+	mlx5_err(ctx->dbg_fp,
+		 "mlx5_vfio_op_code(0x%x), op_mod(0x%x) failed, status %s(0x%x), syndrome (0x%x)\n",
+		 opcode, op_mod,
+		 cmd_status_str(status), status, syndrome);
+
+	errno = cmd_status_to_err(status);
+	return errno;
+}
+
+static int mlx5_copy_from_msg(void *to, struct mlx5_cmd_msg *from, int size,
+			      struct mlx5_cmd_layout *cmd_lay)
+{
+	struct mlx5_cmd_block *block;
+	struct mlx5_cmd_mailbox *next;
+	int copy;
+
+	copy = min_t(int, size, sizeof(cmd_lay->out));
+	memcpy(to, cmd_lay->out, copy);
+	size -= copy;
+	to += copy;
+
+	next = from->next;
+	while (size) {
+		if (!next) {
+			assert(false);
+			errno = ENOMEM;
+			return errno;
+		}
+
+		copy = min_t(int, size, MLX5_CMD_DATA_BLOCK_SIZE);
+		block = next->buf;
+
+		memcpy(to, block->data, copy);
+		to += copy;
+		size -= copy;
+		next = next->next;
+	}
+
+	return 0;
+}
+
+static int mlx5_copy_to_msg(struct mlx5_cmd_msg *to, void *from, int size,
+			    struct mlx5_cmd_layout *cmd_lay)
+{
+	struct mlx5_cmd_block *block;
+	struct mlx5_cmd_mailbox *next;
+	int copy;
+
+	copy = min_t(int, size, sizeof(cmd_lay->in));
+	memcpy(cmd_lay->in, from, copy);
+	size -= copy;
+	from += copy;
+
+	next = to->next;
+	while (size) {
+		if (!next) {
+			assert(false);
+			errno = ENOMEM;
+			return errno;
+		}
+
+		copy = min_t(int, size, MLX5_CMD_DATA_BLOCK_SIZE);
+		block = next->buf;
+		memcpy(block->data, from, copy);
+		from += copy;
+		size -= copy;
+		next = next->next;
+	}
+
+	return 0;
+}
+
+static int mlx5_vfio_enlarge_cmd_msg(struct mlx5_vfio_context *ctx, struct mlx5_cmd_msg *cmd_msg,
+				     struct mlx5_cmd_layout *cmd_lay, uint32_t len, bool is_in)
+{
+	int err;
+
+	mlx5_vfio_free_cmd_msg(ctx, cmd_msg);
+	err = mlx5_vfio_alloc_cmd_msg(ctx, len, cmd_msg);
+	if (err)
+		return err;
+
+	if (is_in)
+		cmd_lay->iptr = htobe64(cmd_msg->next->iova);
+	else
+		cmd_lay->optr = htobe64(cmd_msg->next->iova);
+
+	return 0;
+}
+
+/* One minute for the sake of bringup */
+#define MLX5_CMD_TIMEOUT_MSEC (60 * 1000)
+
+static int mlx5_vfio_poll_timeout(struct mlx5_cmd_layout *cmd_lay)
+{
+	static struct timeval start, curr;
+	uint64_t ms_start, ms_curr;
+
+	gettimeofday(&start, NULL);
+	ms_start = (uint64_t)start.tv_sec * 1000 + start.tv_usec / 1000;
+	do {
+		if (!(mmio_read8(&cmd_lay->status_own) & 0x1))
+			return 0;
+		pthread_yield();
+		gettimeofday(&curr, NULL);
+		ms_curr = (uint64_t)curr.tv_sec * 1000 + curr.tv_usec / 1000;
+	} while (ms_curr - ms_start < MLX5_CMD_TIMEOUT_MSEC);
+
+	errno = ETIMEDOUT;
+	return errno;
+}
+
+static int mlx5_vfio_cmd_prep_in(struct mlx5_vfio_context *ctx,
+				 struct mlx5_cmd_msg *cmd_in,
+				 struct mlx5_cmd_layout *cmd_lay,
+				 void *in, int ilen)
+{
+	int err;
+
+	if (ilen > cmd_in->len) {
+		err = mlx5_vfio_enlarge_cmd_msg(ctx, cmd_in, cmd_lay, ilen, true);
+		if (err)
+			return err;
+	}
+
+	err = mlx5_copy_to_msg(cmd_in, in, ilen, cmd_lay);
+	if (err)
+		return err;
+
+	cmd_lay->ilen = htobe32(ilen);
+	return 0;
+}
+
+static int mlx5_vfio_cmd_prep_out(struct mlx5_vfio_context *ctx,
+				  struct mlx5_cmd_msg *cmd_out,
+				  struct mlx5_cmd_layout *cmd_lay, int olen)
+{
+	struct mlx5_cmd_mailbox *tmp;
+	struct mlx5_cmd_block *block;
+
+	cmd_lay->olen = htobe32(olen);
+
+	/* zeroing output header */
+	memset(cmd_lay->out, 0, sizeof(cmd_lay->out));
+
+	if (olen > cmd_out->len)
+		/* Upon enlarge output message is zeroed */
+		return mlx5_vfio_enlarge_cmd_msg(ctx, cmd_out, cmd_lay, olen, false);
+
+	/* zeroing output message */
+	tmp = cmd_out->next;
+	olen -= min_t(int, olen, sizeof(cmd_lay->out));
+	while (olen > 0) {
+		block = tmp->buf;
+		memset(block->data, 0, MLX5_CMD_DATA_BLOCK_SIZE);
+		olen -= MLX5_CMD_DATA_BLOCK_SIZE;
+		tmp = tmp->next;
+		assert(tmp || olen <= 0);
+	}
+	return 0;
+}
+
+static int mlx5_vfio_cmd_exec(struct mlx5_vfio_context *ctx, void *in,
+			      int ilen, void *out, int olen,
+			      unsigned int slot)
+{
+	struct mlx5_init_seg *init_seg = ctx->bar_map;
+	struct mlx5_cmd_layout *cmd_lay = ctx->cmd.cmds[slot].lay;
+	struct mlx5_cmd_msg *cmd_in = &ctx->cmd.cmds[slot].in;
+	struct mlx5_cmd_msg *cmd_out = &ctx->cmd.cmds[slot].out;
+	int err;
+
+	pthread_mutex_lock(&ctx->cmd.cmds[slot].lock);
+
+	err = mlx5_vfio_cmd_prep_in(ctx, cmd_in, cmd_lay, in, ilen);
+	if (err)
+		goto end;
+
+	err = mlx5_vfio_cmd_prep_out(ctx, cmd_out, cmd_lay, olen);
+	if (err)
+		goto end;
+
+	cmd_lay->status_own = 0x1;
+
+	udma_to_device_barrier();
+	mmio_write32_be(&init_seg->cmd_dbell, htobe32(0x1 << slot));
+
+	err = mlx5_vfio_poll_timeout(cmd_lay);
+	if (err)
+		goto end;
+	udma_from_device_barrier();
+	err = mlx5_copy_from_msg(out, cmd_out, olen, cmd_lay);
+	if (err)
+		goto end;
+
+	err = mlx5_vfio_cmd_check(ctx, in, out);
+end:
+	pthread_mutex_unlock(&ctx->cmd.cmds[slot].lock);
+	return err;
+}
+
 static int mlx5_vfio_enable_pci_cmd(struct mlx5_vfio_context *ctx)
 {
 	struct vfio_region_info pci_config_reg = {};
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 08/27] mlx5: vfio setup function support
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (6 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 07/27] mlx5: Add mlx5_vfio_cmd_exec() support Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 09/27] mlx5: vfio setup basic caps Yishai Hadas
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Setup device function support by following the required command sequence
and steps based on the device specification.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_ifc.h  | 215 +++++++++++++++++++++++++++++++++++++++
 providers/mlx5/mlx5_vfio.c | 246 +++++++++++++++++++++++++++++++++++++++++++++
 providers/mlx5/mlx5_vfio.h |  16 +++
 providers/mlx5/mlx5dv.h    |   4 +
 4 files changed, 481 insertions(+)

diff --git a/providers/mlx5/mlx5_ifc.h b/providers/mlx5/mlx5_ifc.h
index aef6196..ac741cd 100644
--- a/providers/mlx5/mlx5_ifc.h
+++ b/providers/mlx5/mlx5_ifc.h
@@ -41,6 +41,13 @@ enum mlx5_cap_mode {
 
 enum {
 	MLX5_CMD_OP_QUERY_HCA_CAP = 0x100,
+	MLX5_CMD_OP_INIT_HCA = 0x102,
+	MLX5_CMD_OP_TEARDOWN_HCA = 0x103,
+	MLX5_CMD_OP_ENABLE_HCA = 0x104,
+	MLX5_CMD_OP_QUERY_PAGES = 0x107,
+	MLX5_CMD_OP_MANAGE_PAGES = 0x108,
+	MLX5_CMD_OP_QUERY_ISSI = 0x10a,
+	MLX5_CMD_OP_SET_ISSI = 0x10b,
 	MLX5_CMD_OP_CREATE_MKEY = 0x200,
 	MLX5_CMD_OP_CREATE_QP = 0x500,
 	MLX5_CMD_OP_RST2INIT_QP = 0x502,
@@ -55,6 +62,7 @@ enum {
 	MLX5_CMD_OP_QUERY_ESW_VPORT_CONTEXT = 0x752,
 	MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT = 0x754,
 	MLX5_CMD_OP_QUERY_ROCE_ADDRESS = 0x760,
+	MLX5_CMD_OP_ACCESS_REG = 0x805,
 	MLX5_CMD_OP_QUERY_LAG = 0x842,
 	MLX5_CMD_OP_CREATE_TIR = 0x900,
 	MLX5_CMD_OP_MODIFY_SQ = 0x905,
@@ -92,6 +100,16 @@ enum {
 	MLX5_CMD_STAT_BAD_SIZE_OUTS_CQES_ERR = 0x40,
 };
 
+enum {
+	MLX5_PAGES_CANT_GIVE = 0,
+	MLX5_PAGES_GIVE = 1,
+	MLX5_PAGES_TAKE = 2,
+};
+
+enum {
+	MLX5_REG_HOST_ENDIANNESS = 0x7004,
+};
+
 struct mlx5_ifc_atomic_caps_bits {
 	u8         reserved_at_0[0x40];
 
@@ -4131,4 +4149,201 @@ struct mlx5_ifc_mbox_in_bits {
 	u8	reserved_at_40[0x40];
 };
 
+struct mlx5_ifc_enable_hca_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         function_id[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_enable_hca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x20];
+};
+
+struct mlx5_ifc_query_issi_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x10];
+	u8         current_issi[0x10];
+
+	u8         reserved_at_60[0xa0];
+
+	u8         reserved_at_100[76][0x8];
+	u8         supported_issi_dw0[0x20];
+};
+
+struct mlx5_ifc_query_issi_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_set_issi_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_set_issi_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         current_issi[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_query_pages_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8	   embedded_cpu_function[0x01];
+	u8	   reserved_bits[0x0f];
+	u8	   function_id[0x10];
+
+	u8	   num_pages[0x20];
+};
+
+struct mlx5_ifc_query_pages_in_bits {
+	u8	opcode[0x10];
+	u8	reserved_at_10[0x10];
+
+	u8	reserved_at_20[0x10];
+	u8	op_mod[0x10];
+
+	u8	reserved_at_40[0x10];
+	u8	function_id[0x10];
+
+	u8	reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_manage_pages_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         output_num_entries[0x20];
+
+	u8         reserved_at_60[0x20];
+
+	u8         pas[][0x40];
+};
+
+struct mlx5_ifc_manage_pages_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         embedded_cpu_function[0x1];
+	u8         reserved_at_41[0xf];
+	u8         function_id[0x10];
+
+	u8         input_num_entries[0x20];
+
+	u8         pas[][0x40];
+};
+
+struct mlx5_ifc_teardown_hca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x3f];
+
+	u8         state[0x1];
+};
+
+enum {
+	MLX5_TEARDOWN_HCA_IN_PROFILE_GRACEFUL_CLOSE = 0x0,
+};
+
+struct mlx5_ifc_teardown_hca_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         profile[0x10];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_init_hca_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_init_hca_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_access_register_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+
+	u8         register_data[][0x20];
+};
+
+struct mlx5_ifc_access_register_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x10];
+	u8         register_id[0x10];
+
+	u8         argument[0x20];
+
+	u8         register_data[][0x20];
+};
+
 #endif /* MLX5_IFC_H */
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index 37f06a9..4d12807 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -988,6 +988,246 @@ close_cont:
 	return -1;
 }
 
+static int mlx5_vfio_enable_hca(struct mlx5_vfio_context *ctx)
+{
+	uint32_t in[DEVX_ST_SZ_DW(enable_hca_in)] = {};
+	uint32_t out[DEVX_ST_SZ_DW(enable_hca_out)] = {};
+
+	DEVX_SET(enable_hca_in, in, opcode, MLX5_CMD_OP_ENABLE_HCA);
+	return mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+}
+
+static int mlx5_vfio_set_issi(struct mlx5_vfio_context *ctx)
+{
+	uint32_t query_in[DEVX_ST_SZ_DW(query_issi_in)] = {};
+	uint32_t query_out[DEVX_ST_SZ_DW(query_issi_out)] = {};
+	uint32_t set_in[DEVX_ST_SZ_DW(set_issi_in)] = {};
+	uint32_t set_out[DEVX_ST_SZ_DW(set_issi_out)] = {};
+	uint32_t sup_issi;
+	int err;
+
+	DEVX_SET(query_issi_in, query_in, opcode, MLX5_CMD_OP_QUERY_ISSI);
+	err = mlx5_vfio_cmd_exec(ctx, query_in, sizeof(query_in), query_out,
+				 sizeof(query_out), 0);
+	if (err)
+		return err;
+
+	sup_issi = DEVX_GET(query_issi_out, query_out, supported_issi_dw0);
+
+	if (!(sup_issi & (1 << 1))) {
+		errno = EOPNOTSUPP;
+		return errno;
+	}
+
+	DEVX_SET(set_issi_in, set_in, opcode, MLX5_CMD_OP_SET_ISSI);
+	DEVX_SET(set_issi_in, set_in, current_issi, 1);
+	return mlx5_vfio_cmd_exec(ctx, set_in, sizeof(set_in), set_out,
+				  sizeof(set_out), 0);
+}
+
+static int mlx5_vfio_give_pages(struct mlx5_vfio_context *ctx,
+				uint16_t func_id,
+				int32_t npages)
+{
+	int32_t out[DEVX_ST_SZ_DW(manage_pages_out)] = {};
+	int inlen = DEVX_ST_SZ_BYTES(manage_pages_in);
+	int i, err;
+	int32_t *in;
+	uint64_t iova;
+
+	inlen += npages * DEVX_FLD_SZ_BYTES(manage_pages_in, pas[0]);
+	in = calloc(1, inlen);
+	if (!in) {
+		errno = ENOMEM;
+		return errno;
+	}
+
+	for (i = 0; i < npages; i++) {
+		err = mlx5_vfio_alloc_page(ctx, &iova);
+		if (err)
+			goto err;
+
+		DEVX_ARRAY_SET64(manage_pages_in, in, pas, i, iova);
+	}
+
+	DEVX_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
+	DEVX_SET(manage_pages_in, in, op_mod, MLX5_PAGES_GIVE);
+	DEVX_SET(manage_pages_in, in, function_id, func_id);
+	DEVX_SET(manage_pages_in, in, input_num_entries, npages);
+
+	err = mlx5_vfio_cmd_exec(ctx, in, inlen, out, sizeof(out),
+				 MLX5_MAX_COMMANDS - 1);
+	if (!err)
+		goto end;
+err:
+	for (i--; i >= 0; i--)
+		mlx5_vfio_free_page(ctx, DEVX_GET64(manage_pages_in, in, pas[i]));
+end:
+	free(in);
+	return err;
+}
+
+static int mlx5_vfio_query_pages(struct mlx5_vfio_context *ctx, int boot,
+				 uint16_t *func_id, int32_t *npages)
+{
+	uint32_t query_pages_in[DEVX_ST_SZ_DW(query_pages_in)] = {};
+	uint32_t query_pages_out[DEVX_ST_SZ_DW(query_pages_out)] = {};
+	int ret;
+
+	DEVX_SET(query_pages_in, query_pages_in, opcode, MLX5_CMD_OP_QUERY_PAGES);
+	DEVX_SET(query_pages_in, query_pages_in, op_mod, boot ? 0x01 : 0x02);
+
+	ret = mlx5_vfio_cmd_exec(ctx, query_pages_in, sizeof(query_pages_in),
+				 query_pages_out, sizeof(query_pages_out), 0);
+	if (ret)
+		return ret;
+
+	*npages = DEVX_GET(query_pages_out, query_pages_out, num_pages);
+	*func_id = DEVX_GET(query_pages_out, query_pages_out, function_id);
+
+	return 0;
+}
+
+static int mlx5_vfio_satisfy_startup_pages(struct mlx5_vfio_context *ctx,
+					   int boot)
+{
+	uint16_t function_id;
+	int32_t npages = 0;
+	int ret;
+
+	ret = mlx5_vfio_query_pages(ctx, boot, &function_id, &npages);
+	if (ret)
+		return ret;
+
+	return mlx5_vfio_give_pages(ctx, function_id, npages);
+}
+
+static int mlx5_vfio_access_reg(struct mlx5_vfio_context *ctx, void *data_in,
+				int size_in, void *data_out, int size_out,
+				uint16_t reg_id, int arg, int write)
+{
+	int outlen = DEVX_ST_SZ_BYTES(access_register_out) + size_out;
+	int inlen = DEVX_ST_SZ_BYTES(access_register_in) + size_in;
+	int err = ENOMEM;
+	uint32_t *out = NULL;
+	uint32_t *in = NULL;
+	void *data;
+
+	in = calloc(1, inlen);
+	out = calloc(1, outlen);
+	if (!in || !out) {
+		errno = ENOMEM;
+		goto out;
+	}
+
+	data = DEVX_ADDR_OF(access_register_in, in, register_data);
+	memcpy(data, data_in, size_in);
+
+	DEVX_SET(access_register_in, in, opcode, MLX5_CMD_OP_ACCESS_REG);
+	DEVX_SET(access_register_in, in, op_mod, !write);
+	DEVX_SET(access_register_in, in, argument, arg);
+	DEVX_SET(access_register_in, in, register_id, reg_id);
+
+	err = mlx5_vfio_cmd_exec(ctx, in, inlen, out, outlen, 0);
+	if (err)
+		goto out;
+
+	data = DEVX_ADDR_OF(access_register_out, out, register_data);
+	memcpy(data_out, data, size_out);
+
+out:
+	free(out);
+	free(in);
+	return err;
+}
+
+static int mlx5_vfio_set_hca_ctrl(struct mlx5_vfio_context *ctx)
+{
+	struct mlx5_reg_host_endianness he_in = {};
+	struct mlx5_reg_host_endianness he_out = {};
+
+	he_in.he = MLX5_SET_HOST_ENDIANNESS;
+	return mlx5_vfio_access_reg(ctx, &he_in, sizeof(he_in),
+				    &he_out, sizeof(he_out),
+				    MLX5_REG_HOST_ENDIANNESS, 0, 1);
+}
+
+static int mlx5_vfio_init_hca(struct mlx5_vfio_context *ctx)
+{
+	uint32_t in[DEVX_ST_SZ_DW(init_hca_in)] = {};
+	uint32_t out[DEVX_ST_SZ_DW(init_hca_out)] = {};
+
+	DEVX_SET(init_hca_in, in, opcode, MLX5_CMD_OP_INIT_HCA);
+	return mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+}
+
+static int fw_initializing(struct mlx5_init_seg *init_seg)
+{
+	return be32toh(init_seg->initializing) >> 31;
+}
+
+static int wait_fw_init(struct mlx5_init_seg *init_seg, uint32_t max_wait_mili)
+{
+	int num_loops = max_wait_mili / FW_INIT_WAIT_MS;
+	int loop = 0;
+
+	while (fw_initializing(init_seg)) {
+		usleep(FW_INIT_WAIT_MS * 1000);
+		loop++;
+		if (loop == num_loops) {
+			errno = EBUSY;
+			return errno;
+		}
+	}
+
+	return 0;
+}
+
+static int mlx5_vfio_teardown_hca(struct mlx5_vfio_context *ctx)
+{
+	uint32_t in[DEVX_ST_SZ_DW(teardown_hca_in)] = {};
+	uint32_t out[DEVX_ST_SZ_DW(teardown_hca_out)] = {};
+
+	DEVX_SET(teardown_hca_in, in, opcode, MLX5_CMD_OP_TEARDOWN_HCA);
+	DEVX_SET(teardown_hca_in, in, profile, MLX5_TEARDOWN_HCA_IN_PROFILE_GRACEFUL_CLOSE);
+	return mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+}
+
+static int mlx5_vfio_setup_function(struct mlx5_vfio_context *ctx)
+{
+	int err;
+
+	err = wait_fw_init(ctx->bar_map, FW_PRE_INIT_TIMEOUT_MILI);
+	if (err)
+		return err;
+
+	err = mlx5_vfio_enable_hca(ctx);
+	if (err)
+		return err;
+
+	err = mlx5_vfio_set_issi(ctx);
+	if (err)
+		return err;
+
+	err = mlx5_vfio_satisfy_startup_pages(ctx, 1);
+	if (err)
+		return err;
+
+	err = mlx5_vfio_set_hca_ctrl(ctx);
+	if (err)
+		return err;
+
+	err = mlx5_vfio_satisfy_startup_pages(ctx, 0);
+	if (err)
+		return err;
+
+	err = mlx5_vfio_init_hca(ctx);
+	if (err)
+		return err;
+
+	return 0;
+}
+
 static void mlx5_vfio_uninit_context(struct mlx5_vfio_context *ctx)
 {
 	mlx5_close_debug_file(ctx->dbg_fp);
@@ -1000,6 +1240,7 @@ static void mlx5_vfio_free_context(struct ibv_context *ibctx)
 {
 	struct mlx5_vfio_context *ctx = to_mvfio_ctx(ibctx);
 
+	mlx5_vfio_teardown_hca(ctx);
 	mlx5_vfio_clean_cmd_interface(ctx);
 	mlx5_vfio_clean_device_dma(ctx);
 	mlx5_vfio_uninit_bar0(ctx);
@@ -1040,9 +1281,14 @@ mlx5_vfio_alloc_context(struct ibv_device *ibdev,
 	if (mlx5_vfio_init_cmd_interface(mctx))
 		goto err_dma;
 
+	if (mlx5_vfio_setup_function(mctx))
+		goto clean_cmd;
+
 	verbs_set_ops(&mctx->vctx, &mlx5_vfio_common_ops);
 	return &mctx->vctx;
 
+clean_cmd:
+	mlx5_vfio_clean_cmd_interface(mctx);
 err_dma:
 	mlx5_vfio_clean_device_dma(mctx);
 err_bar:
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 392ddcb..36b1f40 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -13,6 +13,9 @@
 #include <infiniband/driver.h>
 #include <util/interval_set.h>
 
+#define FW_INIT_WAIT_MS 2
+#define FW_PRE_INIT_TIMEOUT_MILI 120000
+
 enum {
 	MLX5_MAX_COMMANDS = 32,
 	MLX5_CMD_DATA_BLOCK_SIZE = 512,
@@ -32,6 +35,19 @@ struct mlx5_vfio_device {
 	uint32_t flags;
 };
 
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+#define MLX5_SET_HOST_ENDIANNESS 0
+#elif __BYTE_ORDER == __BIG_ENDIAN
+#define MLX5_SET_HOST_ENDIANNESS 0x80
+#else
+#error Host endianness not defined
+#endif
+
+struct mlx5_reg_host_endianness {
+	uint8_t he;
+	uint8_t rsvd[15];
+};
+
 struct health_buffer {
 	__be32		assert_var[5];
 	__be32		rsvd0[3];
diff --git a/providers/mlx5/mlx5dv.h b/providers/mlx5/mlx5dv.h
index e657527..6aaea37 100644
--- a/providers/mlx5/mlx5dv.h
+++ b/providers/mlx5/mlx5dv.h
@@ -1687,6 +1687,10 @@ static inline uint64_t _devx_get64(const void *p, size_t bit_off)
 
 #define DEVX_GET64(typ, p, fld) _devx_get64(p, __devx_bit_off(typ, fld))
 
+#define DEVX_ARRAY_SET64(typ, p, fld, idx, v) do { \
+	DEVX_SET64(typ, p, fld[idx], v); \
+} while (0)
+
 struct mlx5dv_dr_domain;
 struct mlx5dv_dr_table;
 struct mlx5dv_dr_matcher;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 09/27] mlx5: vfio setup basic caps
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (7 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 08/27] mlx5: vfio setup function support Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 10/27] mlx5: Support fast teardown over vfio Yishai Hadas
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Set basic caps that are required to initialize the device properly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_ifc.h  |  87 ++++++++++++++++++++-
 providers/mlx5/mlx5_vfio.c | 185 ++++++++++++++++++++++++++++++++++++++++++++-
 providers/mlx5/mlx5_vfio.h |  21 +++++
 3 files changed, 290 insertions(+), 3 deletions(-)

diff --git a/providers/mlx5/mlx5_ifc.h b/providers/mlx5/mlx5_ifc.h
index ac741cd..082ac1f 100644
--- a/providers/mlx5/mlx5_ifc.h
+++ b/providers/mlx5/mlx5_ifc.h
@@ -36,6 +36,7 @@
 #define u8 uint8_t
 
 enum mlx5_cap_mode {
+	HCA_CAP_OPMOD_GET_MAX = 0,
 	HCA_CAP_OPMOD_GET_CUR	= 1,
 };
 
@@ -46,6 +47,7 @@ enum {
 	MLX5_CMD_OP_ENABLE_HCA = 0x104,
 	MLX5_CMD_OP_QUERY_PAGES = 0x107,
 	MLX5_CMD_OP_MANAGE_PAGES = 0x108,
+	MLX5_CMD_OP_SET_HCA_CAP = 0x109,
 	MLX5_CMD_OP_QUERY_ISSI = 0x10a,
 	MLX5_CMD_OP_SET_ISSI = 0x10b,
 	MLX5_CMD_OP_CREATE_MKEY = 0x200,
@@ -61,6 +63,7 @@ enum {
 	MLX5_CMD_OP_QUERY_DCT = 0x713,
 	MLX5_CMD_OP_QUERY_ESW_VPORT_CONTEXT = 0x752,
 	MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT = 0x754,
+	MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT = 0x755,
 	MLX5_CMD_OP_QUERY_ROCE_ADDRESS = 0x760,
 	MLX5_CMD_OP_ACCESS_REG = 0x805,
 	MLX5_CMD_OP_QUERY_LAG = 0x842,
@@ -110,6 +113,11 @@ enum {
 	MLX5_REG_HOST_ENDIANNESS = 0x7004,
 };
 
+enum {
+	MLX5_CAP_PORT_TYPE_IB  = 0x0,
+	MLX5_CAP_PORT_TYPE_ETH = 0x1,
+};
+
 struct mlx5_ifc_atomic_caps_bits {
 	u8         reserved_at_0[0x40];
 
@@ -140,7 +148,8 @@ struct mlx5_ifc_atomic_caps_bits {
 };
 
 struct mlx5_ifc_roce_cap_bits {
-	u8         reserved_0[0x5];
+	u8         reserved_0[0x4];
+	u8         sw_r_roce_src_udp_port[0x1];
 	u8         fl_rc_qp_when_roce_disabled[0x1];
 	u8         fl_rc_qp_when_roce_enabled[0x1];
 	u8         reserved_at_7[0x17];
@@ -912,7 +921,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         uar_4k[0x1];
 	u8         reserved_at_241[0x9];
 	u8         uar_sz[0x6];
-	u8         reserved_at_250[0x3];
+	u8         reserved_at_250[0x2];
+	u8         umem_uid_0[0x1];
 	u8         log_max_dc_cnak_qps[0x5];
 	u8         log_pg_sz[0x8];
 
@@ -1339,8 +1349,11 @@ struct mlx5_ifc_query_hca_cap_in_bits {
 };
 
 enum mlx5_cap_type {
+	MLX5_CAP_GENERAL = 0,
 	MLX5_CAP_ODP = 2,
 	MLX5_CAP_ATOMIC = 3,
+	MLX5_CAP_ROCE,
+	MLX5_CAP_NUM,
 };
 
 enum {
@@ -4346,4 +4359,74 @@ struct mlx5_ifc_access_register_in_bits {
 	u8         register_data[][0x20];
 };
 
+struct mlx5_ifc_modify_nic_vport_context_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_modify_nic_vport_field_select_bits {
+	u8         reserved_at_0[0x12];
+	u8         affiliation[0x1];
+	u8         reserved_at_13[0x1];
+	u8         disable_uc_local_lb[0x1];
+	u8         disable_mc_local_lb[0x1];
+	u8         node_guid[0x1];
+	u8         port_guid[0x1];
+	u8         min_inline[0x1];
+	u8         mtu[0x1];
+	u8         change_event[0x1];
+	u8         promisc[0x1];
+	u8         permanent_address[0x1];
+	u8         addresses_list[0x1];
+	u8         roce_en[0x1];
+	u8         reserved_at_1f[0x1];
+};
+
+struct mlx5_ifc_modify_nic_vport_context_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         other_vport[0x1];
+	u8         reserved_at_41[0xf];
+	u8         vport_number[0x10];
+
+	struct mlx5_ifc_modify_nic_vport_field_select_bits field_select;
+
+	u8         reserved_at_80[0x780];
+
+	struct mlx5_ifc_nic_vport_context_bits nic_vport_context;
+};
+
+struct mlx5_ifc_set_hca_cap_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_set_hca_cap_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         other_function[0x1];
+	u8         reserved_at_41[0xf];
+	u8         function_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	union mlx5_ifc_hca_cap_union_bits capability;
+};
+
 #endif /* MLX5_IFC_H */
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index 4d12807..bd128c2 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -1141,6 +1141,177 @@ out:
 	return err;
 }
 
+static int mlx5_vfio_get_caps_mode(struct mlx5_vfio_context *ctx,
+				   enum mlx5_cap_type cap_type,
+				   enum mlx5_cap_mode cap_mode)
+{
+	uint8_t in[DEVX_ST_SZ_BYTES(query_hca_cap_in)] = {};
+	int out_sz = DEVX_ST_SZ_BYTES(query_hca_cap_out);
+	void *out, *hca_caps;
+	uint16_t opmod = (cap_type << 1) | (cap_mode & 0x01);
+	int err;
+
+	out = calloc(1, out_sz);
+	if (!out) {
+		errno = ENOMEM;
+		return errno;
+	}
+
+	DEVX_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
+	DEVX_SET(query_hca_cap_in, in, op_mod, opmod);
+	err = mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, out_sz, 0);
+	if (err)
+		goto query_ex;
+
+	hca_caps = DEVX_ADDR_OF(query_hca_cap_out, out, capability);
+
+	switch (cap_mode) {
+	case HCA_CAP_OPMOD_GET_MAX:
+		memcpy(ctx->caps.hca_max[cap_type], hca_caps,
+		       DEVX_UN_SZ_BYTES(hca_cap_union));
+		break;
+	case HCA_CAP_OPMOD_GET_CUR:
+		memcpy(ctx->caps.hca_cur[cap_type], hca_caps,
+		       DEVX_UN_SZ_BYTES(hca_cap_union));
+		break;
+	default:
+		err = EINVAL;
+		assert(false);
+		break;
+	}
+
+query_ex:
+	free(out);
+	return err;
+}
+
+enum mlx5_vport_roce_state {
+	MLX5_VPORT_ROCE_DISABLED = 0,
+	MLX5_VPORT_ROCE_ENABLED  = 1,
+};
+
+static int mlx5_vfio_nic_vport_update_roce_state(struct mlx5_vfio_context *ctx,
+						 enum mlx5_vport_roce_state state)
+{
+	uint32_t out[DEVX_ST_SZ_DW(modify_nic_vport_context_out)] = {};
+	int inlen = DEVX_ST_SZ_BYTES(modify_nic_vport_context_in);
+	void *in;
+	int err;
+
+	in = calloc(1, inlen);
+	if (!in) {
+		errno = ENOMEM;
+		return errno;
+	}
+
+	DEVX_SET(modify_nic_vport_context_in, in, field_select.roce_en, 1);
+	DEVX_SET(modify_nic_vport_context_in, in, nic_vport_context.roce_en,
+		 state);
+	DEVX_SET(modify_nic_vport_context_in, in, opcode,
+		 MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT);
+
+	err = mlx5_vfio_cmd_exec(ctx, in, inlen, out, sizeof(out), 0);
+
+	free(in);
+
+	return err;
+}
+
+static int mlx5_vfio_get_caps(struct mlx5_vfio_context *ctx, enum mlx5_cap_type cap_type)
+{
+	int ret;
+
+	ret = mlx5_vfio_get_caps_mode(ctx, cap_type, HCA_CAP_OPMOD_GET_CUR);
+	if (ret)
+		return ret;
+
+	return mlx5_vfio_get_caps_mode(ctx, cap_type, HCA_CAP_OPMOD_GET_MAX);
+}
+
+static int handle_hca_cap_roce(struct mlx5_vfio_context *ctx, void *set_ctx,
+			       int ctx_size)
+{
+	int err;
+	uint32_t out[DEVX_ST_SZ_DW(set_hca_cap_out)] = {};
+	void *set_hca_cap;
+
+	if (!MLX5_VFIO_CAP_GEN(ctx, roce))
+		return 0;
+
+	err = mlx5_vfio_get_caps(ctx, MLX5_CAP_ROCE);
+	if (err)
+		return err;
+
+	if (MLX5_VFIO_CAP_ROCE(ctx, sw_r_roce_src_udp_port) ||
+	    !MLX5_VFIO_CAP_ROCE_MAX(ctx, sw_r_roce_src_udp_port))
+		return 0;
+
+	set_hca_cap = DEVX_ADDR_OF(set_hca_cap_in, set_ctx, capability);
+	memcpy(set_hca_cap, ctx->caps.hca_cur[MLX5_CAP_ROCE],
+	       DEVX_ST_SZ_BYTES(roce_cap));
+	DEVX_SET(roce_cap, set_hca_cap, sw_r_roce_src_udp_port, 1);
+	DEVX_SET(set_hca_cap_in, set_ctx, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+	DEVX_SET(set_hca_cap_in, set_ctx, op_mod, MLX5_SET_HCA_CAP_OP_MOD_ROCE);
+	return mlx5_vfio_cmd_exec(ctx, set_ctx, ctx_size, out, sizeof(out), 0);
+}
+
+static int handle_hca_cap(struct mlx5_vfio_context *ctx, void *set_ctx, int set_sz)
+{
+	struct mlx5_vfio_device *dev = to_mvfio_dev(ctx->vctx.context.device);
+	int sys_page_shift = ilog32(dev->page_size - 1);
+	uint32_t out[DEVX_ST_SZ_DW(set_hca_cap_out)] = {};
+	void *set_hca_cap;
+	int err;
+
+	err = mlx5_vfio_get_caps(ctx, MLX5_CAP_GENERAL);
+	if (err)
+		return err;
+
+	set_hca_cap = DEVX_ADDR_OF(set_hca_cap_in, set_ctx,
+				   capability);
+	memcpy(set_hca_cap, ctx->caps.hca_cur[MLX5_CAP_GENERAL],
+	       DEVX_ST_SZ_BYTES(cmd_hca_cap));
+
+	/* disable cmdif checksum */
+	DEVX_SET(cmd_hca_cap, set_hca_cap, cmdif_checksum, 0);
+
+	if (dev->flags & MLX5DV_VFIO_CTX_FLAGS_INIT_LINK_DOWN)
+		DEVX_SET(cmd_hca_cap, set_hca_cap, disable_link_up_by_init_hca, 1);
+
+	DEVX_SET(cmd_hca_cap, set_hca_cap, log_uar_page_sz, sys_page_shift - 12);
+
+	if (MLX5_VFIO_CAP_GEN_MAX(ctx, mkey_by_name))
+		DEVX_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
+
+	DEVX_SET(set_hca_cap_in, set_ctx, opcode, MLX5_CMD_OP_SET_HCA_CAP);
+	DEVX_SET(set_hca_cap_in, set_ctx, op_mod, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
+
+	return mlx5_vfio_cmd_exec(ctx, set_ctx, set_sz, out, sizeof(out), 0);
+}
+
+static int set_hca_cap(struct mlx5_vfio_context *ctx)
+{
+	int set_sz = DEVX_ST_SZ_BYTES(set_hca_cap_in);
+	void *set_ctx;
+	int err;
+
+	set_ctx = calloc(1, set_sz);
+	if (!set_ctx) {
+		errno = ENOMEM;
+		return errno;
+	}
+
+	err = handle_hca_cap(ctx, set_ctx, set_sz);
+	if (err)
+		goto out;
+
+	memset(set_ctx, 0, set_sz);
+	err = handle_hca_cap_roce(ctx, set_ctx, set_sz);
+out:
+	free(set_ctx);
+	return err;
+}
+
 static int mlx5_vfio_set_hca_ctrl(struct mlx5_vfio_context *ctx)
 {
 	struct mlx5_reg_host_endianness he_in = {};
@@ -1217,6 +1388,15 @@ static int mlx5_vfio_setup_function(struct mlx5_vfio_context *ctx)
 	if (err)
 		return err;
 
+	err = set_hca_cap(ctx);
+	if (err)
+		return err;
+
+	if (!MLX5_VFIO_CAP_GEN(ctx, umem_uid_0)) {
+		errno = EOPNOTSUPP;
+		return errno;
+	}
+
 	err = mlx5_vfio_satisfy_startup_pages(ctx, 0);
 	if (err)
 		return err;
@@ -1225,7 +1405,10 @@ static int mlx5_vfio_setup_function(struct mlx5_vfio_context *ctx)
 	if (err)
 		return err;
 
-	return 0;
+	if (MLX5_VFIO_CAP_GEN(ctx, port_type) == MLX5_CAP_PORT_TYPE_ETH)
+		err = mlx5_vfio_nic_vport_update_roce_state(ctx, MLX5_VPORT_ROCE_ENABLED);
+
+	return err;
 }
 
 static void mlx5_vfio_uninit_context(struct mlx5_vfio_context *ctx)
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 36b1f40..225c1b9 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -12,6 +12,7 @@
 
 #include <infiniband/driver.h>
 #include <util/interval_set.h>
+#include "mlx5_ifc.h"
 
 #define FW_INIT_WAIT_MS 2
 #define FW_PRE_INIT_TIMEOUT_MILI 120000
@@ -43,6 +44,22 @@ struct mlx5_vfio_device {
 #error Host endianness not defined
 #endif
 
+/* GET Dev Caps macros */
+#define MLX5_VFIO_CAP_GEN(ctx, cap) \
+	DEVX_GET(cmd_hca_cap, ctx->caps.hca_cur[MLX5_CAP_GENERAL], cap)
+
+#define MLX5_VFIO_CAP_GEN_64(mdev, cap) \
+	DEVX_GET64(cmd_hca_cap, mdev->caps.hca_cur[MLX5_CAP_GENERAL], cap)
+
+#define MLX5_VFIO_CAP_GEN_MAX(ctx, cap) \
+	DEVX_GET(cmd_hca_cap, ctx->caps.hca_max[MLX5_CAP_GENERAL], cap)
+
+#define MLX5_VFIO_CAP_ROCE(ctx, cap) \
+	DEVX_GET(roce_cap, ctx->caps.hca_cur[MLX5_CAP_ROCE], cap)
+
+#define MLX5_VFIO_CAP_ROCE_MAX(ctx, cap) \
+	DEVX_GET(roce_cap, ctx->caps.hca_max[MLX5_CAP_ROCE], cap)
+
 struct mlx5_reg_host_endianness {
 	uint8_t he;
 	uint8_t rsvd[15];
@@ -162,6 +179,10 @@ struct mlx5_vfio_context {
 	size_t bar_map_size;
 	struct mlx5_vfio_cmd cmd;
 	bool have_eq;
+	struct {
+		uint32_t hca_cur[MLX5_CAP_NUM][DEVX_UN_SZ_DW(hca_cap_union)];
+		uint32_t hca_max[MLX5_CAP_NUM][DEVX_UN_SZ_DW(hca_cap_union)];
+	} caps;
 };
 
 static inline struct mlx5_vfio_device *to_mvfio_dev(struct ibv_device *ibdev)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 10/27] mlx5: Support fast teardown over vfio
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (8 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 09/27] mlx5: vfio setup basic caps Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 11/27] mlx5: Enable interrupt command mode " Yishai Hadas
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

From: Mark Zhang <markzhang@nvidia.com>

Add vfio fast teardown support; If it fails then do regular teardown.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_ifc.h  |  5 +++
 providers/mlx5/mlx5_vfio.c | 76 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/providers/mlx5/mlx5_ifc.h b/providers/mlx5/mlx5_ifc.h
index 082ac1f..4b7a4c2 100644
--- a/providers/mlx5/mlx5_ifc.h
+++ b/providers/mlx5/mlx5_ifc.h
@@ -4286,6 +4286,10 @@ struct mlx5_ifc_manage_pages_in_bits {
 	u8         pas[][0x40];
 };
 
+enum {
+	MLX5_TEARDOWN_HCA_OUT_FORCE_STATE_FAIL = 0x1,
+};
+
 struct mlx5_ifc_teardown_hca_out_bits {
 	u8         status[0x8];
 	u8         reserved_at_8[0x18];
@@ -4299,6 +4303,7 @@ struct mlx5_ifc_teardown_hca_out_bits {
 
 enum {
 	MLX5_TEARDOWN_HCA_IN_PROFILE_GRACEFUL_CLOSE = 0x0,
+	MLX5_TEARDOWN_HCA_IN_PROFILE_PREPARE_FAST_TEARDOWN = 0x2,
 };
 
 struct mlx5_ifc_teardown_hca_in_bits {
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index bd128c2..97d3ce6 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -1354,7 +1354,7 @@ static int wait_fw_init(struct mlx5_init_seg *init_seg, uint32_t max_wait_mili)
 	return 0;
 }
 
-static int mlx5_vfio_teardown_hca(struct mlx5_vfio_context *ctx)
+static int mlx5_vfio_teardown_hca_regular(struct mlx5_vfio_context *ctx)
 {
 	uint32_t in[DEVX_ST_SZ_DW(teardown_hca_in)] = {};
 	uint32_t out[DEVX_ST_SZ_DW(teardown_hca_out)] = {};
@@ -1364,6 +1364,80 @@ static int mlx5_vfio_teardown_hca(struct mlx5_vfio_context *ctx)
 	return mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
 }
 
+enum mlx5_cmd_addr_l_sz_offset {
+	MLX5_NIC_IFC_OFFSET = 8,
+};
+
+enum {
+	MLX5_NIC_IFC_DISABLED = 1,
+};
+
+static uint8_t mlx5_vfio_get_nic_state(struct mlx5_vfio_context *ctx)
+{
+	return (be32toh(mmio_read32_be(&ctx->bar_map->cmdq_addr_l_sz)) >> 8) & 7;
+}
+
+static void mlx5_vfio_set_nic_state(struct mlx5_vfio_context *ctx, uint8_t state)
+{
+	uint32_t cur_cmdq_addr_l_sz;
+
+	cur_cmdq_addr_l_sz = be32toh(mmio_read32_be(&ctx->bar_map->cmdq_addr_l_sz));
+	mmio_write32_be(&ctx->bar_map->cmdq_addr_l_sz,
+			htobe32((cur_cmdq_addr_l_sz & 0xFFFFF000) |
+				state << MLX5_NIC_IFC_OFFSET));
+}
+
+#define MLX5_FAST_TEARDOWN_WAIT_MS 3000
+#define MLX5_FAST_TEARDOWN_WAIT_ONCE_MS 1
+static int mlx5_vfio_teardown_hca_fast(struct mlx5_vfio_context *ctx)
+{
+	uint32_t out[DEVX_ST_SZ_DW(teardown_hca_out)] = {};
+	uint32_t in[DEVX_ST_SZ_DW(teardown_hca_in)] = {};
+	int waited = 0, state, ret;
+
+	DEVX_SET(teardown_hca_in, in, opcode, MLX5_CMD_OP_TEARDOWN_HCA);
+	DEVX_SET(teardown_hca_in, in, profile,
+		 MLX5_TEARDOWN_HCA_IN_PROFILE_PREPARE_FAST_TEARDOWN);
+	ret = mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+	if (ret)
+		return ret;
+
+	state = DEVX_GET(teardown_hca_out, out, state);
+	if (state == MLX5_TEARDOWN_HCA_OUT_FORCE_STATE_FAIL) {
+		mlx5_err(ctx->dbg_fp, "teardown with fast mode failed\n");
+		return EIO;
+	}
+
+	mlx5_vfio_set_nic_state(ctx, MLX5_NIC_IFC_DISABLED);
+	do {
+		if (mlx5_vfio_get_nic_state(ctx) == MLX5_NIC_IFC_DISABLED)
+			break;
+		usleep(MLX5_FAST_TEARDOWN_WAIT_ONCE_MS * 1000);
+		waited += MLX5_FAST_TEARDOWN_WAIT_ONCE_MS;
+	} while (waited < MLX5_FAST_TEARDOWN_WAIT_MS);
+
+	if (mlx5_vfio_get_nic_state(ctx) != MLX5_NIC_IFC_DISABLED) {
+		mlx5_err(ctx->dbg_fp, "NIC IFC still %d after %ums.\n",
+			 mlx5_vfio_get_nic_state(ctx), waited);
+		return EIO;
+	}
+
+	return 0;
+}
+
+static int mlx5_vfio_teardown_hca(struct mlx5_vfio_context *ctx)
+{
+	int err;
+
+	if (MLX5_VFIO_CAP_GEN(ctx, fast_teardown)) {
+		err = mlx5_vfio_teardown_hca_fast(ctx);
+		if (!err)
+			return 0;
+	}
+
+	return mlx5_vfio_teardown_hca_regular(ctx);
+}
+
 static int mlx5_vfio_setup_function(struct mlx5_vfio_context *ctx)
 {
 	int err;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 11/27] mlx5: Enable interrupt command mode over vfio
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (9 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 10/27] mlx5: Support fast teardown over vfio Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 12/27] mlx5: Introduce vfio APIs to process events Yishai Hadas
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Enable interrupt command mode over vfio by using EQ and its related
device stuff.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_ifc.h  | 150 ++++++++++++++++++
 providers/mlx5/mlx5_vfio.c | 373 ++++++++++++++++++++++++++++++++++++++++++++-
 providers/mlx5/mlx5_vfio.h |  65 ++++++++
 3 files changed, 582 insertions(+), 6 deletions(-)

diff --git a/providers/mlx5/mlx5_ifc.h b/providers/mlx5/mlx5_ifc.h
index 4b7a4c2..2129779 100644
--- a/providers/mlx5/mlx5_ifc.h
+++ b/providers/mlx5/mlx5_ifc.h
@@ -51,6 +51,8 @@ enum {
 	MLX5_CMD_OP_QUERY_ISSI = 0x10a,
 	MLX5_CMD_OP_SET_ISSI = 0x10b,
 	MLX5_CMD_OP_CREATE_MKEY = 0x200,
+	MLX5_CMD_OP_CREATE_EQ = 0x301,
+	MLX5_CMD_OP_DESTROY_EQ = 0x302,
 	MLX5_CMD_OP_CREATE_QP = 0x500,
 	MLX5_CMD_OP_RST2INIT_QP = 0x502,
 	MLX5_CMD_OP_INIT2RTR_QP = 0x503,
@@ -65,6 +67,8 @@ enum {
 	MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT = 0x754,
 	MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT = 0x755,
 	MLX5_CMD_OP_QUERY_ROCE_ADDRESS = 0x760,
+	MLX5_CMD_OP_ALLOC_UAR = 0x802,
+	MLX5_CMD_OP_DEALLOC_UAR = 0x803,
 	MLX5_CMD_OP_ACCESS_REG = 0x805,
 	MLX5_CMD_OP_QUERY_LAG = 0x842,
 	MLX5_CMD_OP_CREATE_TIR = 0x900,
@@ -118,6 +122,15 @@ enum {
 	MLX5_CAP_PORT_TYPE_ETH = 0x1,
 };
 
+enum mlx5_event {
+	MLX5_EVENT_TYPE_CMD = 0x0a,
+	MLX5_EVENT_TYPE_PAGE_REQUEST = 0xb,
+};
+
+enum {
+	MLX5_EQ_DOORBEL_OFFSET = 0x40,
+};
+
 struct mlx5_ifc_atomic_caps_bits {
 	u8         reserved_at_0[0x40];
 
@@ -4434,4 +4447,141 @@ struct mlx5_ifc_set_hca_cap_in_bits {
 	union mlx5_ifc_hca_cap_union_bits capability;
 };
 
+struct mlx5_ifc_alloc_uar_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         uar[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_alloc_uar_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_dealloc_uar_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_dealloc_uar_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x8];
+	u8         uar[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_eqc_bits {
+	u8         status[0x4];
+	u8         reserved_at_4[0x9];
+	u8         ec[0x1];
+	u8         oi[0x1];
+	u8         reserved_at_f[0x5];
+	u8         st[0x4];
+	u8         reserved_at_18[0x8];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x14];
+	u8         page_offset[0x6];
+	u8         reserved_at_5a[0x6];
+
+	u8         reserved_at_60[0x3];
+	u8         log_eq_size[0x5];
+	u8         uar_page[0x18];
+
+	u8         reserved_at_80[0x20];
+
+	u8         reserved_at_a0[0x18];
+	u8         intr[0x8];
+
+	u8         reserved_at_c0[0x3];
+	u8         log_page_size[0x5];
+	u8         reserved_at_c8[0x18];
+
+	u8         reserved_at_e0[0x60];
+
+	u8         reserved_at_140[0x8];
+	u8         consumer_counter[0x18];
+
+	u8         reserved_at_160[0x8];
+	u8         producer_counter[0x18];
+
+	u8         reserved_at_180[0x80];
+};
+
+struct mlx5_ifc_create_eq_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x18];
+	u8         eq_number[0x8];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_eq_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x40];
+
+	struct mlx5_ifc_eqc_bits eq_context_entry;
+
+	u8         reserved_at_280[0x40];
+
+	u8         event_bitmask[4][0x40];
+
+	u8         reserved_at_3c0[0x4c0];
+
+	u8         pas[][0x40];
+};
+
+struct mlx5_ifc_destroy_eq_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_destroy_eq_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x18];
+	u8         eq_number[0x8];
+
+	u8         reserved_at_60[0x20];
+};
+
 #endif /* MLX5_IFC_H */
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index 97d3ce6..dbb9858 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -19,6 +19,7 @@
 #include <linux/vfio.h>
 #include <sys/eventfd.h>
 #include <sys/ioctl.h>
+#include <poll.h>
 #include <util/mmio.h>
 
 #include "mlx5dv.h"
@@ -26,6 +27,10 @@
 #include "mlx5.h"
 #include "mlx5_ifc.h"
 
+enum {
+	MLX5_VFIO_CMD_VEC_IDX,
+};
+
 static void mlx5_vfio_free_cmd_msg(struct mlx5_vfio_context *ctx,
 				   struct mlx5_cmd_msg *msg);
 
@@ -223,6 +228,37 @@ static const char *cmd_status_str(uint8_t status)
 	}
 }
 
+static struct mlx5_eqe *get_eqe(struct mlx5_eq *eq, uint32_t entry)
+{
+	return eq->vaddr + entry * MLX5_EQE_SIZE;
+}
+
+static struct mlx5_eqe *mlx5_eq_get_eqe(struct mlx5_eq *eq, uint32_t cc)
+{
+	uint32_t ci = eq->cons_index + cc;
+	struct mlx5_eqe *eqe;
+
+	eqe = get_eqe(eq, ci & (eq->nent - 1));
+	eqe = ((eqe->owner & 1) ^ !!(ci & eq->nent)) ? NULL : eqe;
+
+	if (eqe)
+		udma_from_device_barrier();
+
+	return eqe;
+}
+
+static void eq_update_ci(struct mlx5_eq *eq, uint32_t cc, int arm)
+{
+	__be32 *addr = eq->doorbell + (arm ? 0 : 2);
+	uint32_t val;
+
+	eq->cons_index += cc;
+	val = (eq->cons_index & 0xffffff) | (eq->eqn << 24);
+
+	mmio_write32_be(addr, htobe32(val));
+	udma_to_device_barrier();
+}
+
 static void mlx5_cmd_mbox_status(void *out, uint8_t *status, uint32_t *syndrome)
 {
 	*status = DEVX_GET(mbox_out, out, status);
@@ -315,6 +351,85 @@ static int mlx5_copy_to_msg(struct mlx5_cmd_msg *to, void *from, int size,
 	return 0;
 }
 
+/* The HCA will think the queue has overflowed if we don't tell it we've been
+ * processing events.
+ * We create EQs with MLX5_NUM_SPARE_EQE extra entries,
+ * so we must update our consumer index at least that often.
+ */
+static inline uint32_t mlx5_eq_update_cc(struct mlx5_eq *eq, uint32_t cc)
+{
+	if (unlikely(cc >= MLX5_NUM_SPARE_EQE)) {
+		eq_update_ci(eq, cc, 0);
+		cc = 0;
+	}
+	return cc;
+}
+
+static int mlx5_vfio_cmd_comp(struct mlx5_vfio_context *ctx, unsigned long slot)
+{
+	uint64_t u = 1;
+	ssize_t s;
+
+	s = write(ctx->cmd.cmds[slot].completion_event_fd, &u,
+		  sizeof(uint64_t));
+	if (s != sizeof(uint64_t))
+		return -1;
+
+	return 0;
+}
+
+static int mlx5_vfio_process_cmd_eqe(struct mlx5_vfio_context *ctx,
+				     struct mlx5_eqe *eqe)
+{
+	struct mlx5_eqe_cmd *cmd_eqe = &eqe->data.cmd;
+	unsigned long vector = be32toh(cmd_eqe->vector);
+	unsigned long slot;
+	int count = 0;
+	int ret;
+
+	for (slot = 0; slot < MLX5_MAX_COMMANDS; slot++) {
+		if (vector & (1 << slot)) {
+			assert(ctx->cmd.cmds[slot].comp_func);
+			ret = ctx->cmd.cmds[slot].comp_func(ctx, slot);
+			if (ret)
+				return ret;
+
+			vector &= ~(1 << slot);
+			count++;
+		}
+	}
+
+	assert(!vector && count);
+	return 0;
+}
+
+static int mlx5_vfio_process_async_events(struct mlx5_vfio_context *ctx)
+{
+	struct mlx5_eqe *eqe;
+	int ret = 0;
+	int cc = 0;
+
+	pthread_mutex_lock(&ctx->eq_lock);
+	while ((eqe = mlx5_eq_get_eqe(&ctx->async_eq, cc))) {
+		switch (eqe->type) {
+		case MLX5_EVENT_TYPE_CMD:
+			ret = mlx5_vfio_process_cmd_eqe(ctx, eqe);
+			break;
+		default:
+			break;
+		}
+
+		cc = mlx5_eq_update_cc(&ctx->async_eq, ++cc);
+		if (ret)
+			goto out;
+	}
+
+out:
+	eq_update_ci(&ctx->async_eq, cc, 1);
+	pthread_mutex_unlock(&ctx->eq_lock);
+	return ret;
+}
+
 static int mlx5_vfio_enlarge_cmd_msg(struct mlx5_vfio_context *ctx, struct mlx5_cmd_msg *cmd_msg,
 				     struct mlx5_cmd_layout *cmd_lay, uint32_t len, bool is_in)
 {
@@ -333,6 +448,49 @@ static int mlx5_vfio_enlarge_cmd_msg(struct mlx5_vfio_context *ctx, struct mlx5_
 	return 0;
 }
 
+static int mlx5_vfio_wait_event(struct mlx5_vfio_context *ctx,
+				unsigned int slot)
+{
+	struct mlx5_cmd_layout *cmd_lay = ctx->cmd.cmds[slot].lay;
+	uint64_t u;
+	ssize_t s;
+	int err;
+
+	struct pollfd fds[2] = {
+		{ .fd = ctx->cmd_comp_fd, .events = POLLIN },
+		{ .fd = ctx->cmd.cmds[slot].completion_event_fd, .events = POLLIN }
+		};
+
+	while (true) {
+		err = poll(fds, 2, -1);
+		if (err < 0 && errno != EAGAIN) {
+			mlx5_err(ctx->dbg_fp, "mlx5_vfio_wait_event, poll failed, errno=%d\n", errno);
+			return errno;
+		}
+		if (fds[0].revents & POLLIN) {
+			s = read(fds[0].fd, &u, sizeof(uint64_t));
+			if (s < 0 && errno != EAGAIN) {
+				mlx5_err(ctx->dbg_fp, "mlx5_vfio_wait_event, read failed, errno=%d\n", errno);
+				return errno;
+			}
+
+			err = mlx5_vfio_process_async_events(ctx);
+			if (err)
+				return err;
+		}
+		if (fds[1].revents & POLLIN) {
+			s = read(fds[1].fd, &u, sizeof(uint64_t));
+			if (s < 0 && errno != EAGAIN) {
+				mlx5_err(ctx->dbg_fp, "mlx5_vfio_wait_event, read failed, slot=%d, errno=%d\n",
+					 slot, errno);
+				return errno;
+			}
+			if (!(mmio_read8(&cmd_lay->status_own) & 0x1))
+				return 0;
+		}
+	}
+}
+
 /* One minute for the sake of bringup */
 #define MLX5_CMD_TIMEOUT_MSEC (60 * 1000)
 
@@ -430,10 +588,17 @@ static int mlx5_vfio_cmd_exec(struct mlx5_vfio_context *ctx, void *in,
 	udma_to_device_barrier();
 	mmio_write32_be(&init_seg->cmd_dbell, htobe32(0x1 << slot));
 
-	err = mlx5_vfio_poll_timeout(cmd_lay);
-	if (err)
-		goto end;
-	udma_from_device_barrier();
+	if (ctx->have_eq) {
+		err = mlx5_vfio_wait_event(ctx, slot);
+		if (err)
+			goto end;
+	} else {
+		err = mlx5_vfio_poll_timeout(cmd_lay);
+		if (err)
+			goto end;
+		udma_from_device_barrier();
+	}
+
 	err = mlx5_copy_from_msg(out, cmd_out, olen, cmd_lay);
 	if (err)
 		goto end;
@@ -608,6 +773,9 @@ static int mlx5_vfio_setup_cmd_slot(struct mlx5_vfio_context *ctx, int slot)
 		goto err_fd;
 	}
 
+	if (slot != MLX5_MAX_COMMANDS - 1)
+		cmd_slot->comp_func = mlx5_vfio_cmd_comp;
+
 	pthread_mutex_init(&cmd_slot->lock, NULL);
 
 	return 0;
@@ -889,7 +1057,7 @@ mlx5_vfio_enable_msix(struct mlx5_vfio_context *ctx)
 	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
 	irq_set->start = 0;
 	fd_ptr = (int *)&irq_set->data;
-	fd_ptr[0] = ctx->cmd_comp_fd;
+	fd_ptr[MLX5_VFIO_CMD_VEC_IDX] = ctx->cmd_comp_fd;
 
 	return ioctl(ctx->device_fd, VFIO_DEVICE_SET_IRQS, irq_set);
 }
@@ -907,7 +1075,7 @@ static int mlx5_vfio_init_async_fd(struct mlx5_vfio_context *ctx)
 		return -1;
 
 	/* set up an eventfd for command completion interrupts */
-	ctx->cmd_comp_fd = eventfd(0, EFD_CLOEXEC);
+	ctx->cmd_comp_fd = eventfd(0, EFD_CLOEXEC | O_NONBLOCK);
 	if (ctx->cmd_comp_fd < 0)
 		return -1;
 
@@ -988,6 +1156,193 @@ close_cont:
 	return -1;
 }
 
+enum {
+	MLX5_EQE_OWNER_INIT_VAL = 0x1,
+};
+
+static void init_eq_buf(struct mlx5_eq *eq)
+{
+	struct mlx5_eqe *eqe;
+	int i;
+
+	for (i = 0; i < eq->nent; i++) {
+		eqe = get_eqe(eq, i);
+		eqe->owner = MLX5_EQE_OWNER_INIT_VAL;
+	}
+}
+
+static uint64_t uar2iova(struct mlx5_vfio_context *ctx, uint32_t index)
+{
+	return (uint64_t)((void *)ctx->bar_map + (index * MLX5_ADAPTER_PAGE_SIZE));
+}
+
+static int mlx5_vfio_alloc_uar(struct mlx5_vfio_context *ctx, uint32_t *uarn)
+{
+	uint32_t out[DEVX_ST_SZ_DW(alloc_uar_out)] = {};
+	uint32_t in[DEVX_ST_SZ_DW(alloc_uar_in)] = {};
+	int err;
+
+	DEVX_SET(alloc_uar_in, in, opcode, MLX5_CMD_OP_ALLOC_UAR);
+	err = mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+	if (!err)
+		*uarn = DEVX_GET(alloc_uar_out, out, uar);
+
+	return err;
+}
+
+static void mlx5_vfio_dealloc_uar(struct mlx5_vfio_context *ctx, uint32_t uarn)
+{
+	uint32_t out[DEVX_ST_SZ_DW(dealloc_uar_out)] = {};
+	uint32_t in[DEVX_ST_SZ_DW(dealloc_uar_in)] = {};
+
+	DEVX_SET(dealloc_uar_in, in, opcode, MLX5_CMD_OP_DEALLOC_UAR);
+	DEVX_SET(dealloc_uar_in, in, uar, uarn);
+	mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+}
+
+static void mlx5_vfio_destroy_eq(struct mlx5_vfio_context *ctx, struct mlx5_eq *eq)
+{
+	uint32_t in[DEVX_ST_SZ_DW(destroy_eq_in)] = {};
+	uint32_t out[DEVX_ST_SZ_DW(destroy_eq_out)] = {};
+
+	DEVX_SET(destroy_eq_in, in, opcode, MLX5_CMD_OP_DESTROY_EQ);
+	DEVX_SET(destroy_eq_in, in, eq_number, eq->eqn);
+
+	mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+	mlx5_vfio_unregister_mem(ctx, eq->iova, eq->iova_size);
+	iset_insert_range(ctx->iova_alloc, eq->iova, eq->iova_size);
+	free(eq->vaddr);
+}
+
+static void destroy_async_eqs(struct mlx5_vfio_context *ctx)
+{
+	ctx->have_eq = false;
+	mlx5_vfio_destroy_eq(ctx, &ctx->async_eq);
+	mlx5_vfio_dealloc_uar(ctx, ctx->eqs_uar.uarn);
+}
+
+static int
+create_map_eq(struct mlx5_vfio_context *ctx, struct mlx5_eq *eq,
+	      struct mlx5_eq_param *param)
+{
+	uint32_t out[DEVX_ST_SZ_DW(create_eq_out)] = {};
+	uint8_t vecidx = param->irq_index;
+	__be64 *pas;
+	void *eqc;
+	int inlen;
+	uint32_t *in;
+	int err;
+	int i;
+	int alloc_size;
+
+	pthread_mutex_init(&ctx->eq_lock, NULL);
+	eq->nent = roundup_pow_of_two(param->nent + MLX5_NUM_SPARE_EQE);
+	eq->cons_index = 0;
+	alloc_size = eq->nent * MLX5_EQE_SIZE;
+	eq->iova_size = max(roundup_pow_of_two(alloc_size), ctx->iova_min_page_size);
+
+	inlen = DEVX_ST_SZ_BYTES(create_eq_in) +
+		DEVX_FLD_SZ_BYTES(create_eq_in, pas[0]) * 1;
+
+	in = calloc(1, inlen);
+	if (!in)
+		return ENOMEM;
+
+	pas = (__be64 *)DEVX_ADDR_OF(create_eq_in, in, pas);
+
+	err = posix_memalign(&eq->vaddr, eq->iova_size, alloc_size);
+	if (err) {
+		errno = err;
+		goto end;
+	}
+
+	err = iset_alloc_range(ctx->iova_alloc, eq->iova_size, &eq->iova);
+	if (err)
+		goto err_range;
+
+	err = mlx5_vfio_register_mem(ctx, eq->vaddr, eq->iova, eq->iova_size);
+	if (err)
+		goto err_reg;
+
+	pas[0] = htobe64(eq->iova);
+	init_eq_buf(eq);
+	DEVX_SET(create_eq_in, in, opcode, MLX5_CMD_OP_CREATE_EQ);
+
+	for (i = 0; i < 4; i++)
+		DEVX_ARRAY_SET64(create_eq_in, in, event_bitmask, i,
+				 param->mask[i]);
+
+	eqc = DEVX_ADDR_OF(create_eq_in, in, eq_context_entry);
+	DEVX_SET(eqc, eqc, log_eq_size, ilog32(eq->nent - 1));
+	DEVX_SET(eqc, eqc, uar_page, ctx->eqs_uar.uarn);
+	DEVX_SET(eqc, eqc, intr, vecidx);
+	DEVX_SET(eqc, eqc, log_page_size, ilog32(eq->iova_size - 1) - MLX5_ADAPTER_PAGE_SHIFT);
+
+	err = mlx5_vfio_cmd_exec(ctx, in, inlen, out, sizeof(out), 0);
+	if (err)
+		goto err_cmd;
+
+	eq->vecidx = vecidx;
+	eq->eqn = DEVX_GET(create_eq_out, out, eq_number);
+	eq->doorbell = (void *)ctx->eqs_uar.iova + MLX5_EQ_DOORBEL_OFFSET;
+
+	free(in);
+	return 0;
+
+err_cmd:
+	mlx5_vfio_unregister_mem(ctx, eq->iova, eq->iova_size);
+err_reg:
+	iset_insert_range(ctx->iova_alloc, eq->iova, eq->iova_size);
+err_range:
+	free(eq->vaddr);
+end:
+	free(in);
+	return err;
+}
+
+static int
+setup_async_eq(struct mlx5_vfio_context *ctx, struct mlx5_eq_param *param,
+	       struct mlx5_eq *eq)
+{
+	int err;
+
+	err = create_map_eq(ctx, eq, param);
+	if (err)
+		return err;
+
+	eq_update_ci(eq, 0, 1);
+
+	return 0;
+}
+
+static int create_async_eqs(struct mlx5_vfio_context *ctx)
+{
+	struct mlx5_eq_param param = {};
+	int err;
+
+	err = mlx5_vfio_alloc_uar(ctx, &ctx->eqs_uar.uarn);
+	if (err)
+		return err;
+
+	ctx->eqs_uar.iova = uar2iova(ctx, ctx->eqs_uar.uarn);
+
+	param = (struct mlx5_eq_param) {
+		.irq_index = MLX5_VFIO_CMD_VEC_IDX,
+		.nent = MLX5_NUM_CMD_EQE,
+		.mask[0] = 1ull << MLX5_EVENT_TYPE_CMD,
+	};
+
+	err = setup_async_eq(ctx, &param, &ctx->async_eq);
+	if (err)
+		goto err;
+
+	ctx->have_eq = true;
+	return 0;
+err:
+	mlx5_vfio_dealloc_uar(ctx, ctx->eqs_uar.uarn);
+	return err;
+}
+
 static int mlx5_vfio_enable_hca(struct mlx5_vfio_context *ctx)
 {
 	uint32_t in[DEVX_ST_SZ_DW(enable_hca_in)] = {};
@@ -1497,6 +1852,7 @@ static void mlx5_vfio_free_context(struct ibv_context *ibctx)
 {
 	struct mlx5_vfio_context *ctx = to_mvfio_ctx(ibctx);
 
+	destroy_async_eqs(ctx);
 	mlx5_vfio_teardown_hca(ctx);
 	mlx5_vfio_clean_cmd_interface(ctx);
 	mlx5_vfio_clean_device_dma(ctx);
@@ -1541,9 +1897,14 @@ mlx5_vfio_alloc_context(struct ibv_device *ibdev,
 	if (mlx5_vfio_setup_function(mctx))
 		goto clean_cmd;
 
+	if (create_async_eqs(mctx))
+		goto func_teardown;
+
 	verbs_set_ops(&mctx->vctx, &mlx5_vfio_common_ops);
 	return &mctx->vctx;
 
+func_teardown:
+	mlx5_vfio_teardown_hca(mctx);
 clean_cmd:
 	mlx5_vfio_clean_cmd_interface(mctx);
 err_dma:
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 225c1b9..449a5c5 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -60,6 +60,8 @@ struct mlx5_vfio_device {
 #define MLX5_VFIO_CAP_ROCE_MAX(ctx, cap) \
 	DEVX_GET(roce_cap, ctx->caps.hca_max[MLX5_CAP_ROCE], cap)
 
+struct mlx5_vfio_context;
+
 struct mlx5_reg_host_endianness {
 	uint8_t he;
 	uint8_t rsvd[15];
@@ -149,12 +151,16 @@ struct mlx5_cmd_msg {
 	struct mlx5_cmd_mailbox *next;
 };
 
+typedef int (*vfio_cmd_slot_comp)(struct mlx5_vfio_context *ctx,
+				  unsigned long slot);
+
 struct mlx5_vfio_cmd_slot {
 	struct mlx5_cmd_layout *lay;
 	struct mlx5_cmd_msg in;
 	struct mlx5_cmd_msg out;
 	pthread_mutex_t lock;
 	int completion_event_fd;
+	vfio_cmd_slot_comp comp_func;
 };
 
 struct mlx5_vfio_cmd {
@@ -165,6 +171,62 @@ struct mlx5_vfio_cmd {
 	struct mlx5_vfio_cmd_slot cmds[MLX5_MAX_COMMANDS];
 };
 
+struct mlx5_eq_param {
+	uint8_t irq_index;
+	int nent;
+	uint64_t mask[4];
+};
+
+struct mlx5_eq {
+	__be32 *doorbell;
+	uint32_t cons_index;
+	unsigned int vecidx;
+	uint8_t eqn;
+	int nent;
+	void *vaddr;
+	uint64_t iova;
+	uint64_t iova_size;
+};
+
+struct mlx5_eqe_cmd {
+	__be32 vector;
+	__be32 rsvd[6];
+};
+
+struct mlx5_eqe_page_req {
+	__be16 ec_function;
+	__be16 func_id;
+	__be32 num_pages;
+	__be32 rsvd1[5];
+};
+
+union ev_data {
+	__be32 raw[7];
+	struct mlx5_eqe_cmd cmd;
+	struct mlx5_eqe_page_req req_pages;
+};
+
+struct mlx5_eqe {
+	uint8_t rsvd0;
+	uint8_t type;
+	uint8_t rsvd1;
+	uint8_t sub_type;
+	__be32 rsvd2[7];
+	union ev_data data;
+	__be16 rsvd3;
+	uint8_t signature;
+	uint8_t owner;
+};
+
+#define MLX5_EQE_SIZE (sizeof(struct mlx5_eqe))
+#define MLX5_NUM_CMD_EQE   (32)
+#define MLX5_NUM_SPARE_EQE (0x80)
+
+struct mlx5_vfio_eqs_uar {
+	uint32_t uarn;
+	uint64_t iova;
+};
+
 struct mlx5_vfio_context {
 	struct verbs_context vctx;
 	int container_fd;
@@ -183,6 +245,9 @@ struct mlx5_vfio_context {
 		uint32_t hca_cur[MLX5_CAP_NUM][DEVX_UN_SZ_DW(hca_cap_union)];
 		uint32_t hca_max[MLX5_CAP_NUM][DEVX_UN_SZ_DW(hca_cap_union)];
 	} caps;
+	struct mlx5_eq async_eq;
+	struct mlx5_vfio_eqs_uar eqs_uar;
+	pthread_mutex_t eq_lock;
 };
 
 static inline struct mlx5_vfio_device *to_mvfio_dev(struct ibv_device *ibdev)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 12/27] mlx5: Introduce vfio APIs to process events
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (10 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 11/27] mlx5: Enable interrupt command mode " Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 13/27] mlx5: VFIO poll_health support Yishai Hadas
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Introduce vfio APIs to process events, it includes:
- mlx5dv_vfio_get_events_fd()
- mlx5dv_vfio_process_events()

The first API returns an FD number that should be used to detect whether
events exist.

The second API should be called to process the existing events.

At that step the PAGE request event support was added in addition to
support async command mode to let the mlx5dv_vfio_process_events()
return to the caller without blocking.

Detailed man pages were added to describe the expected usage of the new
APIs.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 debian/ibverbs-providers.symbols                   |   2 +
 providers/mlx5/libmlx5.map                         |   2 +
 providers/mlx5/man/CMakeLists.txt                  |   2 +
 providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md  |  41 ++++
 providers/mlx5/man/mlx5dv_vfio_process_events.3.md |  43 ++++
 providers/mlx5/mlx5_vfio.c                         | 231 ++++++++++++++++++++-
 providers/mlx5/mlx5_vfio.h                         |  14 ++
 providers/mlx5/mlx5dv.h                            |   8 +
 8 files changed, 332 insertions(+), 11 deletions(-)
 create mode 100644 providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md
 create mode 100644 providers/mlx5/man/mlx5dv_vfio_process_events.3.md

diff --git a/debian/ibverbs-providers.symbols b/debian/ibverbs-providers.symbols
index 64e29b1..3e36592 100644
--- a/debian/ibverbs-providers.symbols
+++ b/debian/ibverbs-providers.symbols
@@ -135,6 +135,8 @@ libmlx5.so.1 ibverbs-providers #MINVER#
  mlx5dv_qp_cancel_posted_send_wrs@MLX5_1.20 36
  _mlx5dv_mkey_check@MLX5_1.20 36
  mlx5dv_get_vfio_device_list@MLX5_1.21 37
+ mlx5dv_vfio_get_events_fd@MLX5_1.21 37
+ mlx5dv_vfio_process_events@MLX5_1.21 37
 libefa.so.1 ibverbs-providers #MINVER#
 * Build-Depends-Package: libibverbs-dev
  EFA_1.0@EFA_1.0 24
diff --git a/providers/mlx5/libmlx5.map b/providers/mlx5/libmlx5.map
index 3e8a4d8..d6294d8 100644
--- a/providers/mlx5/libmlx5.map
+++ b/providers/mlx5/libmlx5.map
@@ -193,4 +193,6 @@ MLX5_1.20 {
 MLX5_1.21 {
         global:
 		mlx5dv_get_vfio_device_list;
+		mlx5dv_vfio_get_events_fd;
+		mlx5dv_vfio_process_events;
 } MLX5_1.20;
diff --git a/providers/mlx5/man/CMakeLists.txt b/providers/mlx5/man/CMakeLists.txt
index 91aebed..cb3525c 100644
--- a/providers/mlx5/man/CMakeLists.txt
+++ b/providers/mlx5/man/CMakeLists.txt
@@ -40,6 +40,8 @@ rdma_man_pages(
   mlx5dv_sched_node_create.3.md
   mlx5dv_ts_to_ns.3
   mlx5dv_wr_mkey_configure.3.md
+  mlx5dv_vfio_get_events_fd.3.md
+  mlx5dv_vfio_process_events.3.md
   mlx5dv_wr_post.3.md
   mlx5dv_wr_set_mkey_sig_block.3.md
   mlx5dv.7
diff --git a/providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md b/providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md
new file mode 100644
index 0000000..3023bb2
--- /dev/null
+++ b/providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md
@@ -0,0 +1,41 @@
+---
+layout: page
+title: mlx5dv_vfio_get_events_fd
+section: 3
+tagline: Verbs
+---
+
+# NAME
+
+mlx5dv_vfio_get_events_fd - Get the file descriptor to manage driver events.
+
+# SYNOPSIS
+
+```c
+#include <infiniband/mlx5dv.h>
+
+int mlx5dv_vfio_get_events_fd(struct ibv_context *ctx);
+```
+
+# DESCRIPTION
+
+Returns the file descriptor to be used for managing driver events.
+
+# ARGUMENTS
+
+*ctx*
+:	device context that was opened for VFIO by calling mlx5dv_get_vfio_device_list().
+
+# RETURN VALUE
+Returns the internal matching file descriptor.
+
+# NOTES
+Client code should poll the returned file descriptor and once there is some data to be managed immediately call *mlx5dv_vfio_process_events()*.
+
+# SEE ALSO
+
+*ibv_open_device(3)* *ibv_free_device_list(3)* *mlx5dv_get_vfio_device_list(3)*
+
+# AUTHOR
+
+Yishai Hadas <yishaih@nvidia.com>
diff --git a/providers/mlx5/man/mlx5dv_vfio_process_events.3.md b/providers/mlx5/man/mlx5dv_vfio_process_events.3.md
new file mode 100644
index 0000000..6c4123b
--- /dev/null
+++ b/providers/mlx5/man/mlx5dv_vfio_process_events.3.md
@@ -0,0 +1,43 @@
+---
+layout: page
+title: mlx5dv_vfio_process_events
+section: 3
+tagline: Verbs
+---
+
+# NAME
+
+mlx5dv_vfio_process_events - process vfio driver events
+
+# SYNOPSIS
+
+```c
+#include <infiniband/mlx5dv.h>
+
+int mlx5dv_vfio_process_events(struct ibv_context *ctx);
+```
+
+# DESCRIPTION
+
+This API should run from application thread and maintain device events.
+The application is responsible to get the events FD by calling *mlx5dv_vfio_get_events_fd()*
+and once the FD is pollable call the API to let driver process its internal events.
+
+# ARGUMENTS
+
+*ctx*
+:	device context that was opened for VFIO by calling mlx5dv_get_vfio_device_list().
+
+# RETURN VALUE
+Returns 0 upon success or errno value in case a failure has occurred.
+
+# NOTES
+Application can use this API also to periodically check the device health state even if no events exist.
+
+# SEE ALSO
+
+*ibv_open_device(3)* *ibv_free_device_list(3)* *mlx5dv_get_vfio_device_list(3)* *mlx5dv_vfio_get_events_fd(3)*
+
+# AUTHOR
+
+Yishai Hadas <yishaih@nvidia.com>
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index dbb9858..85ee25b 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -31,12 +31,21 @@ enum {
 	MLX5_VFIO_CMD_VEC_IDX,
 };
 
+static int mlx5_vfio_give_pages(struct mlx5_vfio_context *ctx, uint16_t func_id,
+				int32_t npages, bool is_event);
+static int mlx5_vfio_reclaim_pages(struct mlx5_vfio_context *ctx, uint32_t func_id,
+				   int npages);
+
 static void mlx5_vfio_free_cmd_msg(struct mlx5_vfio_context *ctx,
 				   struct mlx5_cmd_msg *msg);
 
 static int mlx5_vfio_alloc_cmd_msg(struct mlx5_vfio_context *ctx,
 				   uint32_t size, struct mlx5_cmd_msg *msg);
 
+static int mlx5_vfio_post_cmd(struct mlx5_vfio_context *ctx, void *in,
+			      int ilen, void *out, int olen,
+			      unsigned int slot, bool async);
+
 static int mlx5_vfio_register_mem(struct mlx5_vfio_context *ctx,
 				  void *vaddr, uint64_t iova, uint64_t size)
 {
@@ -259,6 +268,22 @@ static void eq_update_ci(struct mlx5_eq *eq, uint32_t cc, int arm)
 	udma_to_device_barrier();
 }
 
+static int mlx5_vfio_handle_page_req_event(struct mlx5_vfio_context *ctx,
+					   struct mlx5_eqe *eqe)
+{
+	struct mlx5_eqe_page_req *req = &eqe->data.req_pages;
+	int32_t num_pages;
+	int16_t func_id;
+
+	func_id = be16toh(req->func_id);
+	num_pages = be32toh(req->num_pages);
+
+	if (num_pages > 0)
+		return mlx5_vfio_give_pages(ctx, func_id, num_pages, true);
+
+	return mlx5_vfio_reclaim_pages(ctx, func_id, -1 * num_pages);
+}
+
 static void mlx5_cmd_mbox_status(void *out, uint8_t *status, uint32_t *syndrome)
 {
 	*status = DEVX_GET(mbox_out, out, status);
@@ -365,6 +390,52 @@ static inline uint32_t mlx5_eq_update_cc(struct mlx5_eq *eq, uint32_t cc)
 	return cc;
 }
 
+static int mlx5_vfio_process_page_request_comp(struct mlx5_vfio_context *ctx,
+					       unsigned long slot)
+{
+	struct mlx5_vfio_cmd_slot *cmd_slot = &ctx->cmd.cmds[slot];
+	struct cmd_async_data *cmd_data = &cmd_slot->curr;
+	int num_claimed;
+	int ret, i;
+
+	ret = mlx5_copy_from_msg(cmd_data->buff_out, &cmd_slot->out,
+				 cmd_data->olen, cmd_slot->lay);
+	if (ret)
+		goto end;
+
+	ret = mlx5_vfio_cmd_check(ctx, cmd_data->buff_in, cmd_data->buff_out);
+	if (ret)
+		goto end;
+
+	if (DEVX_GET(manage_pages_in, cmd_data->buff_in, op_mod) == MLX5_PAGES_GIVE)
+		goto end;
+
+	num_claimed = DEVX_GET(manage_pages_out, cmd_data->buff_out, output_num_entries);
+	if (num_claimed > DEVX_GET(manage_pages_in, cmd_data->buff_in, input_num_entries)) {
+		ret = EINVAL;
+		errno = ret;
+		goto end;
+	}
+
+	for (i = 0; i < num_claimed; i++)
+		mlx5_vfio_free_page(ctx, DEVX_GET64(manage_pages_out, cmd_data->buff_out, pas[i]));
+
+end:
+	free(cmd_data->buff_in);
+	free(cmd_data->buff_out);
+	cmd_slot->in_use = false;
+	if (!ret && cmd_slot->is_pending) {
+		cmd_data = &cmd_slot->pending;
+
+		pthread_mutex_lock(&cmd_slot->lock);
+		cmd_slot->is_pending = false;
+		ret = mlx5_vfio_post_cmd(ctx, cmd_data->buff_in, cmd_data->ilen,
+					 cmd_data->buff_out, cmd_data->olen, slot, true);
+		pthread_mutex_unlock(&cmd_slot->lock);
+	}
+	return ret;
+}
+
 static int mlx5_vfio_cmd_comp(struct mlx5_vfio_context *ctx, unsigned long slot)
 {
 	uint64_t u = 1;
@@ -415,6 +486,9 @@ static int mlx5_vfio_process_async_events(struct mlx5_vfio_context *ctx)
 		case MLX5_EVENT_TYPE_CMD:
 			ret = mlx5_vfio_process_cmd_eqe(ctx, eqe);
 			break;
+		case MLX5_EVENT_TYPE_PAGE_REQUEST:
+			ret = mlx5_vfio_handle_page_req_event(ctx, eqe);
+			break;
 		default:
 			break;
 		}
@@ -563,9 +637,9 @@ static int mlx5_vfio_cmd_prep_out(struct mlx5_vfio_context *ctx,
 	return 0;
 }
 
-static int mlx5_vfio_cmd_exec(struct mlx5_vfio_context *ctx, void *in,
+static int mlx5_vfio_post_cmd(struct mlx5_vfio_context *ctx, void *in,
 			      int ilen, void *out, int olen,
-			      unsigned int slot)
+			      unsigned int slot, bool async)
 {
 	struct mlx5_init_seg *init_seg = ctx->bar_map;
 	struct mlx5_cmd_layout *cmd_lay = ctx->cmd.cmds[slot].lay;
@@ -573,20 +647,62 @@ static int mlx5_vfio_cmd_exec(struct mlx5_vfio_context *ctx, void *in,
 	struct mlx5_cmd_msg *cmd_out = &ctx->cmd.cmds[slot].out;
 	int err;
 
-	pthread_mutex_lock(&ctx->cmd.cmds[slot].lock);
+	/* Lock was taken by caller */
+	if (async && ctx->cmd.cmds[slot].in_use) {
+		struct cmd_async_data *pending = &ctx->cmd.cmds[slot].pending;
+
+		if (ctx->cmd.cmds[slot].is_pending) {
+			assert(false);
+			return EINVAL;
+		}
+
+		/* We might get another PAGE EVENT before previous CMD was completed.
+		 * Save the new work and once get the CMD completion go and do the job.
+		 */
+		pending->buff_in = in;
+		pending->buff_out = out;
+		pending->ilen = ilen;
+		pending->olen = olen;
+
+		ctx->cmd.cmds[slot].is_pending = true;
+		return 0;
+	}
 
 	err = mlx5_vfio_cmd_prep_in(ctx, cmd_in, cmd_lay, in, ilen);
 	if (err)
-		goto end;
+		return err;
 
 	err = mlx5_vfio_cmd_prep_out(ctx, cmd_out, cmd_lay, olen);
 	if (err)
-		goto end;
+		return err;
+
+	if (async) {
+		ctx->cmd.cmds[slot].in_use = true;
+		ctx->cmd.cmds[slot].curr.ilen = ilen;
+		ctx->cmd.cmds[slot].curr.olen = olen;
+		ctx->cmd.cmds[slot].curr.buff_in = in;
+		ctx->cmd.cmds[slot].curr.buff_out = out;
+	}
 
 	cmd_lay->status_own = 0x1;
 
 	udma_to_device_barrier();
 	mmio_write32_be(&init_seg->cmd_dbell, htobe32(0x1 << slot));
+	return 0;
+}
+
+static int mlx5_vfio_cmd_exec(struct mlx5_vfio_context *ctx, void *in,
+			       int ilen, void *out, int olen,
+			       unsigned int slot)
+{
+	struct mlx5_cmd_layout *cmd_lay = ctx->cmd.cmds[slot].lay;
+	struct mlx5_cmd_msg *cmd_out = &ctx->cmd.cmds[slot].out;
+	int err;
+
+	pthread_mutex_lock(&ctx->cmd.cmds[slot].lock);
+	err = mlx5_vfio_post_cmd(ctx, in, ilen, out, olen, slot, false);
+	if (err)
+		goto end;
 
 	if (ctx->have_eq) {
 		err = mlx5_vfio_wait_event(ctx, slot);
@@ -775,6 +891,8 @@ static int mlx5_vfio_setup_cmd_slot(struct mlx5_vfio_context *ctx, int slot)
 
 	if (slot != MLX5_MAX_COMMANDS - 1)
 		cmd_slot->comp_func = mlx5_vfio_cmd_comp;
+	else
+		cmd_slot->comp_func = mlx5_vfio_process_page_request_comp;
 
 	pthread_mutex_init(&cmd_slot->lock, NULL);
 
@@ -1329,7 +1447,8 @@ static int create_async_eqs(struct mlx5_vfio_context *ctx)
 	param = (struct mlx5_eq_param) {
 		.irq_index = MLX5_VFIO_CMD_VEC_IDX,
 		.nent = MLX5_NUM_CMD_EQE,
-		.mask[0] = 1ull << MLX5_EVENT_TYPE_CMD,
+		.mask[0] = 1ull << MLX5_EVENT_TYPE_CMD |
+			   1ull << MLX5_EVENT_TYPE_PAGE_REQUEST,
 	};
 
 	err = setup_async_eq(ctx, &param, &ctx->async_eq);
@@ -1343,6 +1462,49 @@ err:
 	return err;
 }
 
+static int mlx5_vfio_reclaim_pages(struct mlx5_vfio_context *ctx, uint32_t func_id,
+				   int npages)
+{
+	uint32_t inlen = DEVX_ST_SZ_BYTES(manage_pages_in);
+	int outlen;
+	uint32_t *out;
+	void *in;
+	int err;
+	int slot = MLX5_MAX_COMMANDS - 1;
+
+	outlen = DEVX_ST_SZ_BYTES(manage_pages_out);
+
+	outlen += npages * DEVX_FLD_SZ_BYTES(manage_pages_out, pas[0]);
+	out = calloc(1, outlen);
+	if (!out) {
+		errno = ENOMEM;
+		return errno;
+	}
+
+	in = calloc(1, inlen);
+	if (!in) {
+		err = ENOMEM;
+		errno = err;
+		goto out_free;
+	}
+
+	DEVX_SET(manage_pages_in, in, opcode, MLX5_CMD_OP_MANAGE_PAGES);
+	DEVX_SET(manage_pages_in, in, op_mod, MLX5_PAGES_TAKE);
+	DEVX_SET(manage_pages_in, in, function_id, func_id);
+	DEVX_SET(manage_pages_in, in, input_num_entries, npages);
+
+	pthread_mutex_lock(&ctx->cmd.cmds[slot].lock);
+	err = mlx5_vfio_post_cmd(ctx, in, inlen, out, outlen, slot, true);
+	pthread_mutex_unlock(&ctx->cmd.cmds[slot].lock);
+	if (!err)
+		return 0;
+
+	free(in);
+out_free:
+	free(out);
+	return err;
+}
+
 static int mlx5_vfio_enable_hca(struct mlx5_vfio_context *ctx)
 {
 	uint32_t in[DEVX_ST_SZ_DW(enable_hca_in)] = {};
@@ -1382,10 +1544,13 @@ static int mlx5_vfio_set_issi(struct mlx5_vfio_context *ctx)
 
 static int mlx5_vfio_give_pages(struct mlx5_vfio_context *ctx,
 				uint16_t func_id,
-				int32_t npages)
+				int32_t npages,
+				bool is_event)
 {
 	int32_t out[DEVX_ST_SZ_DW(manage_pages_out)] = {};
 	int inlen = DEVX_ST_SZ_BYTES(manage_pages_in);
+	int slot = MLX5_MAX_COMMANDS - 1;
+	void *outp = out;
 	int i, err;
 	int32_t *in;
 	uint64_t iova;
@@ -1397,6 +1562,15 @@ static int mlx5_vfio_give_pages(struct mlx5_vfio_context *ctx,
 		return errno;
 	}
 
+	if (is_event) {
+		outp = calloc(1, sizeof(out));
+		if (!outp) {
+			errno = ENOMEM;
+			err = errno;
+			goto end;
+		}
+	}
+
 	for (i = 0; i < npages; i++) {
 		err = mlx5_vfio_alloc_page(ctx, &iova);
 		if (err)
@@ -1410,11 +1584,22 @@ static int mlx5_vfio_give_pages(struct mlx5_vfio_context *ctx,
 	DEVX_SET(manage_pages_in, in, function_id, func_id);
 	DEVX_SET(manage_pages_in, in, input_num_entries, npages);
 
-	err = mlx5_vfio_cmd_exec(ctx, in, inlen, out, sizeof(out),
-				 MLX5_MAX_COMMANDS - 1);
-	if (!err)
+	if (is_event) {
+		pthread_mutex_lock(&ctx->cmd.cmds[slot].lock);
+		err = mlx5_vfio_post_cmd(ctx, in, inlen, outp, sizeof(out), slot, true);
+		pthread_mutex_unlock(&ctx->cmd.cmds[slot].lock);
+	} else {
+		err = mlx5_vfio_cmd_exec(ctx, in, inlen, outp, sizeof(out), slot);
+	}
+
+	if (!err) {
+		if (is_event)
+			return 0;
 		goto end;
+	}
 err:
+	if (is_event)
+		free(outp);
 	for (i--; i >= 0; i--)
 		mlx5_vfio_free_page(ctx, DEVX_GET64(manage_pages_in, in, pas[i]));
 end:
@@ -1454,7 +1639,7 @@ static int mlx5_vfio_satisfy_startup_pages(struct mlx5_vfio_context *ctx,
 	if (ret)
 		return ret;
 
-	return mlx5_vfio_give_pages(ctx, function_id, npages);
+	return mlx5_vfio_give_pages(ctx, function_id, npages, false);
 }
 
 static int mlx5_vfio_access_reg(struct mlx5_vfio_context *ctx, void *data_in,
@@ -2034,6 +2219,30 @@ static int mlx5_vfio_get_handle(struct mlx5_vfio_device *vfio_dev,
 	return 0;
 }
 
+int mlx5dv_vfio_get_events_fd(struct ibv_context *ibctx)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(ibctx);
+
+	return ctx->cmd_comp_fd;
+}
+
+int mlx5dv_vfio_process_events(struct ibv_context *ibctx)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(ibctx);
+	uint64_t u;
+	ssize_t s;
+
+	/* read to re-arm the FD and process all existing events */
+	s = read(ctx->cmd_comp_fd, &u, sizeof(uint64_t));
+	if (s < 0 && errno != EAGAIN) {
+		mlx5_err(ctx->dbg_fp, "%s, read failed, errno=%d\n",
+			 __func__, errno);
+		return errno;
+	}
+
+	return mlx5_vfio_process_async_events(ctx);
+}
+
 struct ibv_device **
 mlx5dv_get_vfio_device_list(struct mlx5dv_vfio_context_attr *attr)
 {
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 449a5c5..8e240c8 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -151,9 +151,17 @@ struct mlx5_cmd_msg {
 	struct mlx5_cmd_mailbox *next;
 };
 
+
 typedef int (*vfio_cmd_slot_comp)(struct mlx5_vfio_context *ctx,
 				  unsigned long slot);
 
+struct cmd_async_data {
+	void *buff_in;
+	int ilen;
+	void *buff_out;
+	int olen;
+};
+
 struct mlx5_vfio_cmd_slot {
 	struct mlx5_cmd_layout *lay;
 	struct mlx5_cmd_msg in;
@@ -161,6 +169,11 @@ struct mlx5_vfio_cmd_slot {
 	pthread_mutex_t lock;
 	int completion_event_fd;
 	vfio_cmd_slot_comp comp_func;
+	/* async cmd caller data */
+	bool in_use;
+	struct cmd_async_data curr;
+	bool is_pending;
+	struct cmd_async_data pending;
 };
 
 struct mlx5_vfio_cmd {
@@ -245,6 +258,7 @@ struct mlx5_vfio_context {
 		uint32_t hca_cur[MLX5_CAP_NUM][DEVX_UN_SZ_DW(hca_cap_union)];
 		uint32_t hca_max[MLX5_CAP_NUM][DEVX_UN_SZ_DW(hca_cap_union)];
 	} caps;
+
 	struct mlx5_eq async_eq;
 	struct mlx5_vfio_eqs_uar eqs_uar;
 	pthread_mutex_t eq_lock;
diff --git a/providers/mlx5/mlx5dv.h b/providers/mlx5/mlx5dv.h
index 6aaea37..c9699ec 100644
--- a/providers/mlx5/mlx5dv.h
+++ b/providers/mlx5/mlx5dv.h
@@ -1487,6 +1487,14 @@ struct mlx5dv_vfio_context_attr {
 struct ibv_device **
 mlx5dv_get_vfio_device_list(struct mlx5dv_vfio_context_attr *attr);
 
+int mlx5dv_vfio_get_events_fd(struct ibv_context *ibctx);
+
+/* This API should run from application thread and maintain device events.
+ * The application is responsible to get the events FD by calling mlx5dv_vfio_get_events_fd
+ * and once the FD is pollable call the API to let driver process the ready events.
+ */
+int mlx5dv_vfio_process_events(struct ibv_context *context);
+
 struct ibv_context *
 mlx5dv_open_device(struct ibv_device *device, struct mlx5dv_context_attr *attr);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 13/27] mlx5: VFIO poll_health support
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (11 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 12/27] mlx5: Introduce vfio APIs to process events Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 14/27] mlx5: Implement basic verbs operation for PD and MR over vfio Yishai Hadas
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

From: Mark Zhang <markzhang@nvidia.com>

Add firmware health polling support in vfio driver.

Such a case is not expected and we refer it as some fatal error in the
firmware that should be avoided/fixed.

The health buffer check is triggered by the application upon its call to
mlx5dv_vfio_process_events().

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_vfio.c | 168 +++++++++++++++++++++++++++++++++++++++++++++
 providers/mlx5/mlx5_vfio.h |  10 ++-
 2 files changed, 177 insertions(+), 1 deletion(-)

diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index 85ee25b..c37358c 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -22,6 +22,8 @@
 #include <poll.h>
 #include <util/mmio.h>
 
+#include <ccan/array_size.h>
+
 #include "mlx5dv.h"
 #include "mlx5_vfio.h"
 #include "mlx5.h"
@@ -1910,6 +1912,7 @@ enum mlx5_cmd_addr_l_sz_offset {
 
 enum {
 	MLX5_NIC_IFC_DISABLED = 1,
+	MLX5_NIC_IFC_SW_RESET = 7,
 };
 
 static uint8_t mlx5_vfio_get_nic_state(struct mlx5_vfio_context *ctx)
@@ -1978,6 +1981,169 @@ static int mlx5_vfio_teardown_hca(struct mlx5_vfio_context *ctx)
 	return mlx5_vfio_teardown_hca_regular(ctx);
 }
 
+static bool sensor_pci_not_working(struct mlx5_init_seg *init_seg)
+{
+	/* Offline PCI reads return 0xffffffff */
+	return (be32toh(mmio_read32_be(&init_seg->health.fw_ver)) == 0xffffffff);
+}
+
+enum mlx5_fatal_assert_bit_offsets {
+	MLX5_RFR_OFFSET = 31,
+};
+
+static bool sensor_fw_synd_rfr(struct mlx5_init_seg *init_seg)
+{
+	uint32_t rfr = be32toh(mmio_read32_be(&init_seg->health.rfr)) >> MLX5_RFR_OFFSET;
+	uint8_t synd = mmio_read8(&init_seg->health.synd);
+
+	return (rfr && synd);
+}
+
+enum  {
+	MLX5_SENSOR_NO_ERR = 0,
+	MLX5_SENSOR_PCI_COMM_ERR = 1,
+	MLX5_SENSOR_NIC_DISABLED = 3,
+	MLX5_SENSOR_NIC_SW_RESET = 4,
+	MLX5_SENSOR_FW_SYND_RFR = 5,
+};
+
+static uint32_t mlx5_health_check_fatal_sensors(struct mlx5_vfio_context *ctx)
+{
+	if (sensor_pci_not_working(ctx->bar_map))
+		return MLX5_SENSOR_PCI_COMM_ERR;
+
+	if (mlx5_vfio_get_nic_state(ctx) == MLX5_NIC_IFC_DISABLED)
+		return MLX5_SENSOR_NIC_DISABLED;
+
+	if (mlx5_vfio_get_nic_state(ctx) == MLX5_NIC_IFC_SW_RESET)
+		return MLX5_SENSOR_NIC_SW_RESET;
+
+	if (sensor_fw_synd_rfr(ctx->bar_map))
+		return MLX5_SENSOR_FW_SYND_RFR;
+
+	return MLX5_SENSOR_NO_ERR;
+}
+
+enum {
+	MLX5_HEALTH_SYNDR_FW_ERR = 0x1,
+	MLX5_HEALTH_SYNDR_IRISC_ERR = 0x7,
+	MLX5_HEALTH_SYNDR_HW_UNRECOVERABLE_ERR = 0x8,
+	MLX5_HEALTH_SYNDR_CRC_ERR = 0x9,
+	MLX5_HEALTH_SYNDR_FETCH_PCI_ERR = 0xa,
+	MLX5_HEALTH_SYNDR_HW_FTL_ERR = 0xb,
+	MLX5_HEALTH_SYNDR_ASYNC_EQ_OVERRUN_ERR = 0xc,
+	MLX5_HEALTH_SYNDR_EQ_ERR = 0xd,
+	MLX5_HEALTH_SYNDR_EQ_INV = 0xe,
+	MLX5_HEALTH_SYNDR_FFSER_ERR = 0xf,
+	MLX5_HEALTH_SYNDR_HIGH_TEMP = 0x10,
+};
+
+static const char *hsynd_str(u8 synd)
+{
+	switch (synd) {
+	case MLX5_HEALTH_SYNDR_FW_ERR:
+		return "firmware internal error";
+	case MLX5_HEALTH_SYNDR_IRISC_ERR:
+		return "irisc not responding";
+	case MLX5_HEALTH_SYNDR_HW_UNRECOVERABLE_ERR:
+		return "unrecoverable hardware error";
+	case MLX5_HEALTH_SYNDR_CRC_ERR:
+		return "firmware CRC error";
+	case MLX5_HEALTH_SYNDR_FETCH_PCI_ERR:
+		return "ICM fetch PCI error";
+	case MLX5_HEALTH_SYNDR_HW_FTL_ERR:
+		return "HW fatal error\n";
+	case MLX5_HEALTH_SYNDR_ASYNC_EQ_OVERRUN_ERR:
+		return "async EQ buffer overrun";
+	case MLX5_HEALTH_SYNDR_EQ_ERR:
+		return "EQ error";
+	case MLX5_HEALTH_SYNDR_EQ_INV:
+		return "Invalid EQ referenced";
+	case MLX5_HEALTH_SYNDR_FFSER_ERR:
+		return "FFSER error";
+	case MLX5_HEALTH_SYNDR_HIGH_TEMP:
+		return "High temperature";
+	default:
+		return "unrecognized error";
+	}
+}
+
+static void print_health_info(struct mlx5_vfio_context *ctx)
+{
+	struct mlx5_init_seg *iseg = ctx->bar_map;
+	struct health_buffer *h = &iseg->health;
+	char fw_str[18] = {};
+	int i;
+
+	/* If the syndrome is 0, the device is OK and no need to print buffer */
+	if (!mmio_read8(&h->synd))
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(h->assert_var); i++)
+		mlx5_err(ctx->dbg_fp, "assert_var[%d] 0x%08x\n",
+			 i, be32toh(mmio_read32_be(h->assert_var + i)));
+
+	mlx5_err(ctx->dbg_fp, "assert_exit_ptr 0x%08x\n",
+		 be32toh(mmio_read32_be(&h->assert_exit_ptr)));
+	mlx5_err(ctx->dbg_fp, "assert_callra 0x%08x\n",
+		 be32toh(mmio_read32_be(&h->assert_callra)));
+	sprintf(fw_str, "%d.%d.%d",
+		be32toh(mmio_read32_be(&iseg->fw_rev)) & 0xffff,
+		be32toh(mmio_read32_be(&iseg->fw_rev)) >> 16,
+		be32toh(mmio_read32_be(&iseg->cmdif_rev_fw_sub)) & 0xffff);
+	mlx5_err(ctx->dbg_fp, "fw_ver %s\n", fw_str);
+	mlx5_err(ctx->dbg_fp, "hw_id 0x%08x\n", be32toh(mmio_read32_be(&h->hw_id)));
+	mlx5_err(ctx->dbg_fp, "irisc_index %d\n", mmio_read8(&h->irisc_index));
+	mlx5_err(ctx->dbg_fp, "synd 0x%x: %s\n", mmio_read8(&h->synd),
+		 hsynd_str(mmio_read8(&h->synd)));
+	mlx5_err(ctx->dbg_fp, "ext_synd 0x%04x\n",
+		 be16toh(mmio_read16_be(&h->ext_synd)));
+	mlx5_err(ctx->dbg_fp, "raw fw_ver 0x%08x\n",
+		 be32toh(mmio_read32_be(&iseg->fw_rev)));
+}
+
+static void mlx5_vfio_poll_health(struct mlx5_vfio_context *ctx)
+{
+	struct mlx5_vfio_health_state *hstate = &ctx->health_state;
+	uint32_t fatal_error, count;
+	struct timeval tv;
+	uint64_t time;
+	int ret;
+
+	ret = gettimeofday(&tv, NULL);
+	if (ret)
+		return;
+
+	time = (uint64_t)tv.tv_sec * 1000 + tv.tv_usec / 1000;
+	if (time - hstate->prev_time < POLL_HEALTH_INTERVAL)
+		return;
+
+	fatal_error = mlx5_health_check_fatal_sensors(ctx);
+	if (fatal_error) {
+		mlx5_err(ctx->dbg_fp, "%s: Fatal error %u detected\n",
+			 __func__, fatal_error);
+		goto err;
+	}
+	count = be32toh(mmio_read32_be(&ctx->bar_map->health_counter)) & 0xffffff;
+	if (count == hstate->prev_count)
+		++hstate->miss_counter;
+	else
+		hstate->miss_counter = 0;
+
+	hstate->prev_time = time;
+	hstate->prev_count = count;
+	if (hstate->miss_counter == MAX_MISSES) {
+		mlx5_err(ctx->dbg_fp,
+			 "device's health compromised - reached miss count\n");
+		goto err;
+	}
+
+	return;
+err:
+	print_health_info(ctx);
+	abort();
+}
+
 static int mlx5_vfio_setup_function(struct mlx5_vfio_context *ctx)
 {
 	int err;
@@ -2232,6 +2398,8 @@ int mlx5dv_vfio_process_events(struct ibv_context *ibctx)
 	uint64_t u;
 	ssize_t s;
 
+	mlx5_vfio_poll_health(ctx);
+
 	/* read to re-arm the FD and process all existing events */
 	s = read(ctx->cmd_comp_fd, &u, sizeof(uint64_t));
 	if (s < 0 && errno != EAGAIN) {
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 8e240c8..296d6d1 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -240,6 +240,14 @@ struct mlx5_vfio_eqs_uar {
 	uint64_t iova;
 };
 
+#define POLL_HEALTH_INTERVAL 1000 /* ms */
+#define MAX_MISSES 3
+struct mlx5_vfio_health_state {
+	uint64_t prev_time; /* ms */
+	uint32_t prev_count;
+	uint32_t miss_counter;
+};
+
 struct mlx5_vfio_context {
 	struct verbs_context vctx;
 	int container_fd;
@@ -258,7 +266,7 @@ struct mlx5_vfio_context {
 		uint32_t hca_cur[MLX5_CAP_NUM][DEVX_UN_SZ_DW(hca_cap_union)];
 		uint32_t hca_max[MLX5_CAP_NUM][DEVX_UN_SZ_DW(hca_cap_union)];
 	} caps;
-
+	struct mlx5_vfio_health_state health_state;
 	struct mlx5_eq async_eq;
 	struct mlx5_vfio_eqs_uar eqs_uar;
 	pthread_mutex_t eq_lock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 14/27] mlx5: Implement basic verbs operation for PD and MR over vfio
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (12 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 13/27] mlx5: VFIO poll_health support Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 15/27] mlx5: Set DV context ops Yishai Hadas
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Implement basic verbs operation for PD and MR over vfio, this includes:
- PD alloc/dealloc
- MR reg/dereg.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_ifc.h  |  76 ++++++++++++-
 providers/mlx5/mlx5_vfio.c | 273 +++++++++++++++++++++++++++++++++++++++++++++
 providers/mlx5/mlx5_vfio.h |  25 +++++
 util/util.h                |   5 +
 4 files changed, 377 insertions(+), 2 deletions(-)

diff --git a/providers/mlx5/mlx5_ifc.h b/providers/mlx5/mlx5_ifc.h
index 2129779..1cbe846 100644
--- a/providers/mlx5/mlx5_ifc.h
+++ b/providers/mlx5/mlx5_ifc.h
@@ -51,6 +51,7 @@ enum {
 	MLX5_CMD_OP_QUERY_ISSI = 0x10a,
 	MLX5_CMD_OP_SET_ISSI = 0x10b,
 	MLX5_CMD_OP_CREATE_MKEY = 0x200,
+	MLX5_CMD_OP_DESTROY_MKEY = 0x202,
 	MLX5_CMD_OP_CREATE_EQ = 0x301,
 	MLX5_CMD_OP_DESTROY_EQ = 0x302,
 	MLX5_CMD_OP_CREATE_QP = 0x500,
@@ -67,6 +68,8 @@ enum {
 	MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT = 0x754,
 	MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT = 0x755,
 	MLX5_CMD_OP_QUERY_ROCE_ADDRESS = 0x760,
+	MLX5_CMD_OP_ALLOC_PD = 0x800,
+	MLX5_CMD_OP_DEALLOC_PD = 0x801,
 	MLX5_CMD_OP_ALLOC_UAR = 0x802,
 	MLX5_CMD_OP_DEALLOC_UAR = 0x803,
 	MLX5_CMD_OP_ACCESS_REG = 0x805,
@@ -1380,7 +1383,8 @@ enum {
 };
 
 enum {
-	MLX5_MKC_ACCESS_MODE_KLMS  = 0x2,
+	MLX5_MKC_ACCESS_MODE_MTT = 0x1,
+	MLX5_MKC_ACCESS_MODE_KLMS = 0x2,
 };
 
 struct mlx5_ifc_mkc_bits {
@@ -1425,7 +1429,9 @@ struct mlx5_ifc_mkc_bits {
 
 	u8         translations_octword_size[0x20];
 
-	u8         reserved_at_1c0[0x1b];
+	u8         reserved_at_1c0[0x19];
+	u8         relaxed_ordering_read[0x1];
+	u8         reserved_at_1d9[0x1];
 	u8         log_page_size[0x5];
 
 	u8         reserved_at_1e0[0x20];
@@ -1467,6 +1473,28 @@ struct mlx5_ifc_create_mkey_in_bits {
 	u8         klm_pas_mtt[0][0x20];
 };
 
+struct mlx5_ifc_destroy_mkey_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_destroy_mkey_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x8];
+	u8         mkey_index[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
 struct mlx5_ifc_l2_hdr_bits {
 	u8         dmac_47_16[0x20];
 	u8         dmac_15_0[0x10];
@@ -4584,4 +4612,48 @@ struct mlx5_ifc_destroy_eq_in_bits {
 	u8         reserved_at_60[0x20];
 };
 
+struct mlx5_ifc_alloc_pd_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         pd[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_alloc_pd_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_dealloc_pd_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_dealloc_pd_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x8];
+	u8         pd[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
 #endif /* MLX5_IFC_H */
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index c37358c..e85a8cc 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -33,6 +33,12 @@ enum {
 	MLX5_VFIO_CMD_VEC_IDX,
 };
 
+enum {
+	MLX5_VFIO_SUPP_MR_ACCESS_FLAGS = IBV_ACCESS_LOCAL_WRITE |
+		IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ |
+		IBV_ACCESS_REMOTE_ATOMIC | IBV_ACCESS_RELAXED_ORDERING,
+};
+
 static int mlx5_vfio_give_pages(struct mlx5_vfio_context *ctx, uint16_t func_id,
 				int32_t npages, bool is_event);
 static int mlx5_vfio_reclaim_pages(struct mlx5_vfio_context *ctx, uint32_t func_id,
@@ -2191,6 +2197,268 @@ static int mlx5_vfio_setup_function(struct mlx5_vfio_context *ctx)
 	return err;
 }
 
+static struct ibv_pd *mlx5_vfio_alloc_pd(struct ibv_context *ibctx)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(ibctx);
+	uint32_t in[DEVX_ST_SZ_DW(alloc_pd_in)] = {0};
+	uint32_t out[DEVX_ST_SZ_DW(alloc_pd_out)] = {0};
+	int err;
+	struct mlx5_pd *pd;
+
+	pd = calloc(1, sizeof(*pd));
+	if (!pd)
+		return NULL;
+
+	DEVX_SET(alloc_pd_in, in, opcode, MLX5_CMD_OP_ALLOC_PD);
+	err = mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+
+	if (err)
+		goto err;
+
+	pd->pdn = DEVX_GET(alloc_pd_out, out, pd);
+
+	return &pd->ibv_pd;
+err:
+	free(pd);
+	return NULL;
+}
+
+static int mlx5_vfio_dealloc_pd(struct ibv_pd *pd)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(pd->context);
+	uint32_t in[DEVX_ST_SZ_DW(dealloc_pd_in)] = {};
+	uint32_t out[DEVX_ST_SZ_DW(dealloc_pd_out)] = {};
+	struct mlx5_pd *mpd = to_mpd(pd);
+	int ret;
+
+	DEVX_SET(dealloc_pd_in, in, opcode, MLX5_CMD_OP_DEALLOC_PD);
+	DEVX_SET(dealloc_pd_in, in, pd, mpd->pdn);
+
+	ret = mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+	if (ret)
+		return ret;
+
+	free(mpd);
+	return 0;
+}
+
+static size_t calc_num_dma_blocks(uint64_t iova, size_t length,
+				   unsigned long pgsz)
+{
+	return (size_t)((align(iova + length, pgsz) -
+			 align_down(iova, pgsz)) / pgsz);
+}
+
+static int get_octo_len(uint64_t addr, uint64_t len, int page_shift)
+{
+	uint64_t page_size = 1ULL << page_shift;
+	uint64_t offset;
+	int npages;
+
+	offset = addr & (page_size - 1);
+	npages = align(len + offset, page_size) >> page_shift;
+	return (npages + 1) / 2;
+}
+
+static inline uint32_t mlx5_mkey_to_idx(uint32_t mkey)
+{
+	return mkey >> 8;
+}
+
+static inline uint32_t mlx5_idx_to_mkey(uint32_t mkey_idx)
+{
+	return mkey_idx << 8;
+}
+
+static void set_mkc_access_pd_addr_fields(void *mkc, int acc, uint64_t start_addr,
+					  struct ibv_pd *pd)
+{
+	struct mlx5_pd *mpd = to_mpd(pd);
+
+	DEVX_SET(mkc, mkc, a, !!(acc & IBV_ACCESS_REMOTE_ATOMIC));
+	DEVX_SET(mkc, mkc, rw, !!(acc & IBV_ACCESS_REMOTE_WRITE));
+	DEVX_SET(mkc, mkc, rr, !!(acc & IBV_ACCESS_REMOTE_READ));
+	DEVX_SET(mkc, mkc, lw, !!(acc & IBV_ACCESS_LOCAL_WRITE));
+	DEVX_SET(mkc, mkc, lr, 1);
+	/* Application is responsible to set based on caps */
+	DEVX_SET(mkc, mkc, relaxed_ordering_write,
+		 !!(acc & IBV_ACCESS_RELAXED_ORDERING));
+	DEVX_SET(mkc, mkc, relaxed_ordering_read,
+		 !!(acc & IBV_ACCESS_RELAXED_ORDERING));
+	DEVX_SET(mkc, mkc, pd, mpd->pdn);
+	DEVX_SET(mkc, mkc, qpn, 0xffffff);
+	DEVX_SET64(mkc, mkc, start_addr, start_addr);
+}
+
+static int mlx5_vfio_dereg_mr(struct verbs_mr *vmr)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(vmr->ibv_mr.context);
+	struct mlx5_vfio_mr *mr = to_mvfio_mr(&vmr->ibv_mr);
+	uint32_t in[DEVX_ST_SZ_DW(destroy_mkey_in)] = {};
+	uint32_t out[DEVX_ST_SZ_DW(destroy_mkey_in)] = {};
+	int ret;
+
+	DEVX_SET(destroy_mkey_in, in, opcode, MLX5_CMD_OP_DESTROY_MKEY);
+	DEVX_SET(destroy_mkey_in, in, mkey_index, mlx5_mkey_to_idx(vmr->ibv_mr.lkey));
+	ret = mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+	if (ret)
+		return ret;
+
+	mlx5_vfio_unregister_mem(ctx, mr->iova + mr->iova_aligned_offset,
+				 mr->iova_reg_size);
+	iset_insert_range(ctx->iova_alloc, mr->iova, mr->iova_page_size);
+
+	free(vmr);
+	return 0;
+}
+
+static void mlx5_vfio_populate_pas(uint64_t dma_addr, int num_dma, size_t page_size,
+				  __be64 *pas, uint64_t access_flags)
+{
+	int i;
+
+	for (i = 0; i < num_dma; i++) {
+		*pas = htobe64(dma_addr | access_flags);
+		pas++;
+		dma_addr += page_size;
+	}
+}
+
+static uint64_t calc_spanning_page_size(uint64_t start, uint64_t length)
+{
+	/* Compute a page_size such that:
+	 * start & (page_size-1) == (start + length) & (page_size - 1)
+	 */
+	uint64_t diffs = start ^ (start + length - 1);
+
+	return roundup_pow_of_two(diffs + 1);
+}
+
+static struct ibv_mr *mlx5_vfio_reg_mr(struct ibv_pd *pd, void *addr, size_t length,
+				       uint64_t hca_va, int access)
+{
+	struct mlx5_vfio_device *dev = to_mvfio_dev(pd->context->device);
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(pd->context);
+	uint32_t out[DEVX_ST_SZ_DW(create_mkey_out)] = {};
+	uint32_t mkey_index;
+	uint32_t *in;
+	int inlen, num_pas, ret;
+	struct mlx5_vfio_mr *mr;
+	struct verbs_mr *vmr;
+	int page_shift, iova_min_page_shift;
+	__be64 *pas;
+	uint8_t key;
+	void *mkc;
+	void *aligned_va;
+
+	if (!check_comp_mask(access, MLX5_VFIO_SUPP_MR_ACCESS_FLAGS)) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	if (((uint64_t)addr & (ctx->iova_min_page_size - 1)) !=
+	    (hca_va & (ctx->iova_min_page_size - 1))) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	mr = calloc(1, sizeof(*mr));
+	if (!mr) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	/* Page size that encloses the start and end of the mkey's hca_va range */
+	mr->iova_page_size = max(calc_spanning_page_size(hca_va, length),
+				 ctx->iova_min_page_size);
+
+	ret = iset_alloc_range(ctx->iova_alloc, mr->iova_page_size, &mr->iova);
+	if (ret)
+		goto end;
+
+	aligned_va = (void *)((unsigned long)addr & ~(ctx->iova_min_page_size - 1));
+	page_shift = ilog32(mr->iova_page_size - 1);
+	iova_min_page_shift = ilog32(ctx->iova_min_page_size - 1);
+	if (page_shift > iova_min_page_shift)
+		/* Ensure the low bis of the mkey VA match the low bits of the IOVA because the mkc
+		 * start_addr specifies both the wire VA and the DMA VA.
+		 */
+		mr->iova_aligned_offset = hca_va & GENMASK(page_shift - 1, iova_min_page_shift);
+
+	mr->iova_reg_size = align(length + hca_va, ctx->iova_min_page_size) -
+				  align_down(hca_va, ctx->iova_min_page_size);
+
+	assert(mr->iova_page_size >= mr->iova_aligned_offset + mr->iova_reg_size);
+	ret = mlx5_vfio_register_mem(ctx, aligned_va,
+				     mr->iova + mr->iova_aligned_offset,
+				     mr->iova_reg_size);
+
+	if (ret)
+		goto err_reg;
+
+	num_pas = 1;
+	if (page_shift > MLX5_MAX_PAGE_SHIFT) {
+		page_shift = MLX5_MAX_PAGE_SHIFT;
+		num_pas = calc_num_dma_blocks(hca_va, length, (1ULL << MLX5_MAX_PAGE_SHIFT));
+	}
+
+	inlen = DEVX_ST_SZ_BYTES(create_mkey_in) + (sizeof(*pas) * align(num_pas, 2));
+
+	in = calloc(1, inlen);
+	if (!in) {
+		errno = ENOMEM;
+		goto err_in;
+	}
+
+	pas = (__be64 *)DEVX_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
+	mlx5_vfio_populate_pas(mr->iova, num_pas, (1ULL << page_shift), pas, MLX5_MTT_PRESENT);
+
+	DEVX_SET(create_mkey_in, in, opcode, MLX5_CMD_OP_CREATE_MKEY);
+	DEVX_SET(create_mkey_in, in, pg_access, 1);
+	mkc = DEVX_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+	set_mkc_access_pd_addr_fields(mkc, access, hca_va, pd);
+	DEVX_SET(mkc, mkc, free, 0);
+	DEVX_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
+	DEVX_SET64(mkc, mkc, len, length);
+	DEVX_SET(mkc, mkc, bsf_octword_size, 0);
+	DEVX_SET(mkc, mkc, translations_octword_size,
+		 get_octo_len(hca_va, length, page_shift));
+	DEVX_SET(mkc, mkc, log_page_size, page_shift);
+
+	DEVX_SET(create_mkey_in, in, translations_octword_actual_size,
+		 get_octo_len(hca_va, length, page_shift));
+
+	key = atomic_fetch_add(&dev->mkey_var, 1);
+	DEVX_SET(mkc, mkc, mkey_7_0, key);
+
+	ret = mlx5_vfio_cmd_exec(ctx, in, inlen, out, sizeof(out), 0);
+	if (ret)
+		goto err_exec;
+
+	free(in);
+	mkey_index = DEVX_GET(create_mkey_out, out, mkey_index);
+	vmr = &mr->vmr;
+	vmr->ibv_mr.lkey = key | mlx5_idx_to_mkey(mkey_index);
+	vmr->ibv_mr.rkey = vmr->ibv_mr.lkey;
+	vmr->ibv_mr.context = pd->context;
+	vmr->mr_type = IBV_MR_TYPE_MR;
+	vmr->access = access;
+	vmr->ibv_mr.handle = 0;
+
+	return &mr->vmr.ibv_mr;
+
+err_exec:
+	free(in);
+err_in:
+	mlx5_vfio_unregister_mem(ctx, mr->iova + mr->iova_aligned_offset,
+				 mr->iova_reg_size);
+err_reg:
+	iset_insert_range(ctx->iova_alloc, mr->iova, mr->iova_page_size);
+end:
+	free(mr);
+	return NULL;
+}
+
 static void mlx5_vfio_uninit_context(struct mlx5_vfio_context *ctx)
 {
 	mlx5_close_debug_file(ctx->dbg_fp);
@@ -2213,6 +2481,10 @@ static void mlx5_vfio_free_context(struct ibv_context *ibctx)
 }
 
 static const struct verbs_context_ops mlx5_vfio_common_ops = {
+	.alloc_pd = mlx5_vfio_alloc_pd,
+	.dealloc_pd = mlx5_vfio_dealloc_pd,
+	.reg_mr = mlx5_vfio_reg_mr,
+	.dereg_mr = mlx5_vfio_dereg_mr,
 	.free_context = mlx5_vfio_free_context,
 };
 
@@ -2446,6 +2718,7 @@ mlx5dv_get_vfio_device_list(struct mlx5dv_vfio_context_attr *attr)
 
 	vfio_dev->flags = attr->flags;
 	vfio_dev->page_size = sysconf(_SC_PAGESIZE);
+	atomic_init(&vfio_dev->mkey_var, 0);
 
 	list[0] = &vfio_dev->vdev.device;
 	return list;
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 296d6d1..5311c6f 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -23,17 +23,37 @@ enum {
 	MLX5_PCI_CMD_XPORT = 7,
 };
 
+enum mlx5_ib_mtt_access_flags {
+	MLX5_MTT_READ  = (1 << 0),
+	MLX5_MTT_WRITE = (1 << 1),
+};
+
+enum {
+	MLX5_MAX_PAGE_SHIFT = 31,
+};
+
+#define MLX5_MTT_PRESENT (MLX5_MTT_READ | MLX5_MTT_WRITE)
+
 enum {
 	MLX5_VFIO_BLOCK_SIZE = 2 * 1024 * 1024,
 	MLX5_VFIO_BLOCK_NUM_PAGES = MLX5_VFIO_BLOCK_SIZE / MLX5_ADAPTER_PAGE_SIZE,
 };
 
+struct mlx5_vfio_mr {
+	struct verbs_mr vmr;
+	uint64_t iova;
+	uint64_t iova_page_size;
+	uint64_t iova_aligned_offset;
+	uint64_t iova_reg_size;
+};
+
 struct mlx5_vfio_device {
 	struct verbs_device vdev;
 	char *pci_name;
 	char vfio_path[IBV_SYSFS_PATH_MAX];
 	int page_size;
 	uint32_t flags;
+	atomic_int mkey_var;
 };
 
 #if __BYTE_ORDER == __LITTLE_ENDIAN
@@ -282,4 +302,9 @@ static inline struct mlx5_vfio_context *to_mvfio_ctx(struct ibv_context *ibctx)
 	return container_of(ibctx, struct mlx5_vfio_context, vctx.context);
 }
 
+static inline struct mlx5_vfio_mr *to_mvfio_mr(struct ibv_mr *ibmr)
+{
+	return container_of(ibmr, struct mlx5_vfio_mr, vmr.ibv_mr);
+}
+
 #endif
diff --git a/util/util.h b/util/util.h
index 2c05631..45f5065 100644
--- a/util/util.h
+++ b/util/util.h
@@ -70,6 +70,11 @@ static inline unsigned long align(unsigned long val, unsigned long align)
 	return (val + align - 1) & ~(align - 1);
 }
 
+static inline unsigned long align_down(unsigned long val, unsigned long _align)
+{
+	return align(val - (_align - 1), _align);
+}
+
 static inline uint64_t roundup_pow_of_two(uint64_t n)
 {
 	return n == 1 ? 1 : 1ULL << ilog64(n - 1);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 15/27] mlx5: Set DV context ops
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (13 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 14/27] mlx5: Implement basic verbs operation for PD and MR over vfio Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 16/27] mlx5: Support initial DEVX/DV APIs over vfio Yishai Hadas
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

From: Mark Zhang <markzhang@nvidia.com>

Wrap DV APIs and call the matching function pointer if supported.
This comes to handle both rdma and vfio flows.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/dr_rule.c   |  10 +-
 providers/mlx5/mlx5.c      | 344 ++++++++++++----
 providers/mlx5/mlx5.h      | 180 ++++++++-
 providers/mlx5/mlx5_vfio.c |  27 ++
 providers/mlx5/mlx5_vfio.h |   1 +
 providers/mlx5/verbs.c     | 966 +++++++++++++++++++++++++++++++++++++--------
 6 files changed, 1287 insertions(+), 241 deletions(-)

diff --git a/providers/mlx5/dr_rule.c b/providers/mlx5/dr_rule.c
index f763399..6291685 100644
--- a/providers/mlx5/dr_rule.c
+++ b/providers/mlx5/dr_rule.c
@@ -1341,11 +1341,11 @@ dr_rule_create_rule_root(struct mlx5dv_dr_matcher *matcher,
 	if (ret)
 		goto free_attr_aux;
 
-	rule->flow = __mlx5dv_create_flow(matcher->dv_matcher,
-					  value,
-					  num_actions,
-					  attr,
-					  attr_aux);
+	rule->flow = _mlx5dv_create_flow(matcher->dv_matcher,
+					 value,
+					 num_actions,
+					 attr,
+					 attr_aux);
 	if (!rule->flow)
 		goto remove_action_members;
 
diff --git a/providers/mlx5/mlx5.c b/providers/mlx5/mlx5.c
index 1abaa8c..963581a 100644
--- a/providers/mlx5/mlx5.c
+++ b/providers/mlx5/mlx5.c
@@ -50,8 +50,10 @@
 #include "mlx5-abi.h"
 #include "wqe.h"
 #include "mlx5_ifc.h"
+#include "mlx5_vfio.h"
 
 static void mlx5_free_context(struct ibv_context *ibctx);
+static bool is_mlx5_dev(struct ibv_device *device);
 
 #ifndef CPU_OR
 #define CPU_OR(x, y, z) do {} while (0)
@@ -819,15 +821,12 @@ static uint32_t get_dc_odp_caps(struct ibv_context *ctx)
 	return ret;
 }
 
-int mlx5dv_query_device(struct ibv_context *ctx_in,
-			 struct mlx5dv_context *attrs_out)
+static int _mlx5dv_query_device(struct ibv_context *ctx_in,
+				struct mlx5dv_context *attrs_out)
 {
 	struct mlx5_context *mctx = to_mctx(ctx_in);
 	uint64_t comp_mask_out = 0;
 
-	if (!is_mlx5_dev(ctx_in->device))
-		return EOPNOTSUPP;
-
 	attrs_out->version   = 0;
 	attrs_out->flags     = 0;
 
@@ -921,15 +920,23 @@ int mlx5dv_query_device(struct ibv_context *ctx_in,
 	return 0;
 }
 
+int mlx5dv_query_device(struct ibv_context *ctx_in,
+			struct mlx5dv_context *attrs_out)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx_in);
+
+	if (!dvops || !dvops->query_device)
+		return EOPNOTSUPP;
+
+	return dvops->query_device(ctx_in, attrs_out);
+}
+
 static int mlx5dv_get_qp(struct ibv_qp *qp_in,
 			 struct mlx5dv_qp *qp_out)
 {
 	struct mlx5_qp *mqp = to_mqp(qp_in);
 	uint64_t mask_out = 0;
 
-	if (!is_mlx5_dev(qp_in->context->device))
-		return EOPNOTSUPP;
-
 	qp_out->dbrec     = mqp->db;
 
 	if (mqp->sq_buf_size)
@@ -980,9 +987,6 @@ static int mlx5dv_get_cq(struct ibv_cq *cq_in,
 	struct mlx5_cq *mcq = to_mcq(cq_in);
 	struct mlx5_context *mctx = to_mctx(cq_in->context);
 
-	if (!is_mlx5_dev(cq_in->context->device))
-		return EOPNOTSUPP;
-
 	cq_out->comp_mask = 0;
 	cq_out->cqn       = mcq->cqn;
 	cq_out->cqe_cnt   = mcq->verbs_cq.cq.cqe + 1;
@@ -1001,9 +1005,6 @@ static int mlx5dv_get_rwq(struct ibv_wq *wq_in,
 {
 	struct mlx5_rwq *mrwq = to_mrwq(wq_in);
 
-	if (!is_mlx5_dev(wq_in->context->device))
-		return EOPNOTSUPP;
-
 	rwq_out->comp_mask = 0;
 	rwq_out->buf       = mrwq->pbuff;
 	rwq_out->dbrec     = mrwq->recv_db;
@@ -1019,9 +1020,6 @@ static int mlx5dv_get_srq(struct ibv_srq *srq_in,
 	struct mlx5_srq *msrq;
 	uint64_t mask_out = 0;
 
-	if (!is_mlx5_dev(srq_in->context->device))
-		return EOPNOTSUPP;
-
 	msrq = container_of(srq_in, struct mlx5_srq, vsrq.srq);
 
 	srq_out->buf       = msrq->buf.buf;
@@ -1045,9 +1043,6 @@ static int mlx5dv_get_dm(struct ibv_dm *dm_in,
 	struct mlx5_dm *mdm = to_mdm(dm_in);
 	uint64_t mask_out = 0;
 
-	if (!is_mlx5_dev(dm_in->context->device))
-		return EOPNOTSUPP;
-
 	dm_out->buf       = mdm->start_va;
 	dm_out->length    = mdm->length;
 
@@ -1066,9 +1061,6 @@ static int mlx5dv_get_av(struct ibv_ah *ah_in,
 {
 	struct mlx5_ah *mah = to_mah(ah_in);
 
-	if (!is_mlx5_dev(ah_in->context->device))
-		return EOPNOTSUPP;
-
 	ah_out->comp_mask = 0;
 	ah_out->av	  = &mah->av;
 
@@ -1080,9 +1072,6 @@ static int mlx5dv_get_pd(struct ibv_pd *pd_in,
 {
 	struct mlx5_pd *mpd = to_mpd(pd_in);
 
-	if (!is_mlx5_dev(pd_in->context->device))
-		return EOPNOTSUPP;
-
 	pd_out->comp_mask = 0;
 	pd_out->pdn = mpd->pdn;
 
@@ -1119,8 +1108,7 @@ static bool lag_operation_supported(struct ibv_qp *qp)
 	struct mlx5_context *mctx = to_mctx(qp->context);
 	struct mlx5_qp *mqp = to_mqp(qp);
 
-	if (!is_mlx5_dev(qp->context->device) ||
-	    (mctx->entropy_caps.num_lag_ports <= 1))
+	if (mctx->entropy_caps.num_lag_ports <= 1)
 		return false;
 
 	if ((qp->qp_type == IBV_QPT_RC) ||
@@ -1136,8 +1124,8 @@ static bool lag_operation_supported(struct ibv_qp *qp)
 }
 
 
-int mlx5dv_query_qp_lag_port(struct ibv_qp *qp, uint8_t *port_num,
-			     uint8_t *active_port_num)
+static int _mlx5dv_query_qp_lag_port(struct ibv_qp *qp, uint8_t *port_num,
+				     uint8_t *active_port_num)
 {
 	uint8_t lag_state, tx_remap_affinity_1, tx_remap_affinity_2;
 	uint32_t in_tis[DEVX_ST_SZ_DW(query_tis_in)] = {};
@@ -1201,6 +1189,18 @@ int mlx5dv_query_qp_lag_port(struct ibv_qp *qp, uint8_t *port_num,
 	return 0;
 }
 
+int mlx5dv_query_qp_lag_port(struct ibv_qp *qp, uint8_t *port_num,
+			     uint8_t *active_port_num)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(qp->context);
+
+	if (!dvops || !dvops->query_qp_lag_port)
+		return EOPNOTSUPP;
+
+	return dvops->query_qp_lag_port(qp, port_num,
+					active_port_num);
+}
+
 static int modify_tis_lag_port(struct ibv_qp *qp, uint8_t port_num)
 {
 	uint32_t out[DEVX_ST_SZ_DW(modify_tis_out)] = {};
@@ -1232,7 +1232,7 @@ static int modify_qp_lag_port(struct ibv_qp *qp, uint8_t port_num)
 	return mlx5dv_devx_qp_modify(qp, in, sizeof(in), out, sizeof(out));
 }
 
-int mlx5dv_modify_qp_lag_port(struct ibv_qp *qp, uint8_t port_num)
+static int _mlx5dv_modify_qp_lag_port(struct ibv_qp *qp, uint8_t port_num)
 {
 	uint8_t curr_configured, curr_active;
 	struct mlx5_qp *mqp = to_mqp(qp);
@@ -1263,15 +1263,23 @@ int mlx5dv_modify_qp_lag_port(struct ibv_qp *qp, uint8_t port_num)
 	}
 }
 
-int mlx5dv_modify_qp_udp_sport(struct ibv_qp *qp, uint16_t udp_sport)
+int mlx5dv_modify_qp_lag_port(struct ibv_qp *qp, uint8_t port_num)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(qp->context);
+
+	if (!dvops || !dvops->modify_qp_lag_port)
+		return EOPNOTSUPP;
+
+	return dvops->modify_qp_lag_port(qp, port_num);
+
+}
+
+static int _mlx5dv_modify_qp_udp_sport(struct ibv_qp *qp, uint16_t udp_sport)
 {
 	uint32_t in[DEVX_ST_SZ_DW(rts2rts_qp_in)] = {};
 	uint32_t out[DEVX_ST_SZ_DW(rts2rts_qp_out)] = {};
 	struct mlx5_context *mctx = to_mctx(qp->context);
 
-	if (!is_mlx5_dev(qp->context->device))
-		return EOPNOTSUPP;
-
 	switch (qp->qp_type) {
 	case IBV_QPT_RC:
 	case IBV_QPT_UC:
@@ -1293,6 +1301,16 @@ int mlx5dv_modify_qp_udp_sport(struct ibv_qp *qp, uint16_t udp_sport)
 				     sizeof(out));
 }
 
+int mlx5dv_modify_qp_udp_sport(struct ibv_qp *qp, uint16_t udp_sport)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(qp->context);
+
+	if (!dvops || !dvops->modify_qp_udp_sport)
+		return EOPNOTSUPP;
+
+	return dvops->modify_qp_udp_sport(qp, udp_sport);
+}
+
 static bool sched_supported(struct ibv_context *ctx)
 {
 	struct mlx5_qos_caps *qc = &to_mctx(ctx)->qos_caps;
@@ -1412,18 +1430,13 @@ static bool sched_attr_valid(const struct mlx5dv_sched_attr *attr, bool node)
 	return true;
 }
 
-struct mlx5dv_sched_node *
-mlx5dv_sched_node_create(struct ibv_context *ctx,
-			 const struct mlx5dv_sched_attr *attr)
+static struct mlx5dv_sched_node *
+_mlx5dv_sched_node_create(struct ibv_context *ctx,
+			   const struct mlx5dv_sched_attr *attr)
 {
 	struct mlx5dv_sched_node *node;
 	struct mlx5dv_devx_obj *obj;
 
-	if (!is_mlx5_dev(ctx->device)) {
-		errno = EOPNOTSUPP;
-		return NULL;
-	}
-
 	if (!sched_attr_valid(attr, true)) {
 		errno = EINVAL;
 		return NULL;
@@ -1453,10 +1466,24 @@ err_sched_nic_create:
 	return NULL;
 }
 
-struct mlx5dv_sched_leaf *
-mlx5dv_sched_leaf_create(struct ibv_context *ctx,
+struct mlx5dv_sched_node *
+mlx5dv_sched_node_create(struct ibv_context *ctx,
 			 const struct mlx5dv_sched_attr *attr)
 {
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->sched_node_create) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->sched_node_create(ctx, attr);
+}
+
+static struct mlx5dv_sched_leaf *
+_mlx5dv_sched_leaf_create(struct ibv_context *ctx,
+			   const struct mlx5dv_sched_attr *attr)
+{
 	struct mlx5dv_sched_leaf *leaf;
 	struct mlx5dv_devx_obj *obj;
 
@@ -1490,8 +1517,22 @@ err_sched_nic_create:
 	return NULL;
 }
 
-int mlx5dv_sched_node_modify(struct mlx5dv_sched_node *node,
-			     const struct mlx5dv_sched_attr *attr)
+struct mlx5dv_sched_leaf *
+mlx5dv_sched_leaf_create(struct ibv_context *ctx,
+			 const struct mlx5dv_sched_attr *attr)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->sched_leaf_create) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->sched_leaf_create(ctx, attr);
+}
+
+static int _mlx5dv_sched_node_modify(struct mlx5dv_sched_node *node,
+				     const struct mlx5dv_sched_attr *attr)
 {
 	if (!node || !sched_attr_valid(attr, true)) {
 		errno = EINVAL;
@@ -1507,9 +1548,20 @@ int mlx5dv_sched_node_modify(struct mlx5dv_sched_node *node,
 				       MLX5_SCHED_ELEM_TYPE_TSAR);
 }
 
-int mlx5dv_sched_leaf_modify(struct mlx5dv_sched_leaf *leaf,
+int mlx5dv_sched_node_modify(struct mlx5dv_sched_node *node,
 			     const struct mlx5dv_sched_attr *attr)
 {
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(node->obj->context);
+
+	if (!dvops || !dvops->sched_node_modify)
+		return EOPNOTSUPP;
+
+	return dvops->sched_node_modify(node, attr);
+}
+
+static int _mlx5dv_sched_leaf_modify(struct mlx5dv_sched_leaf *leaf,
+				     const struct mlx5dv_sched_attr *attr)
+{
 	if (!leaf || !sched_attr_valid(attr, false)) {
 		errno = EINVAL;
 		return errno;
@@ -1524,7 +1576,18 @@ int mlx5dv_sched_leaf_modify(struct mlx5dv_sched_leaf *leaf,
 				       MLX5_SCHED_ELEM_TYPE_QUEUE_GROUP);
 }
 
-int mlx5dv_sched_node_destroy(struct mlx5dv_sched_node *node)
+int mlx5dv_sched_leaf_modify(struct mlx5dv_sched_leaf *leaf,
+			     const struct mlx5dv_sched_attr *attr)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(leaf->obj->context);
+
+	if (!dvops || !dvops->sched_leaf_modify)
+		return EOPNOTSUPP;
+
+	return dvops->sched_leaf_modify(leaf, attr);
+}
+
+static int _mlx5dv_sched_node_destroy(struct mlx5dv_sched_node *node)
 {
 	int ret;
 
@@ -1536,7 +1599,17 @@ int mlx5dv_sched_node_destroy(struct mlx5dv_sched_node *node)
 	return 0;
 }
 
-int mlx5dv_sched_leaf_destroy(struct mlx5dv_sched_leaf *leaf)
+int mlx5dv_sched_node_destroy(struct mlx5dv_sched_node *node)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(node->obj->context);
+
+	if (!dvops || !dvops->sched_node_destroy)
+		return EOPNOTSUPP;
+
+	return dvops->sched_node_destroy(node);
+}
+
+static int _mlx5dv_sched_leaf_destroy(struct mlx5dv_sched_leaf *leaf)
 {
 	int ret;
 
@@ -1548,6 +1621,16 @@ int mlx5dv_sched_leaf_destroy(struct mlx5dv_sched_leaf *leaf)
 	return 0;
 }
 
+int mlx5dv_sched_leaf_destroy(struct mlx5dv_sched_leaf *leaf)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(leaf->obj->context);
+
+	if (!dvops || !dvops->sched_leaf_destroy)
+		return EOPNOTSUPP;
+
+	return dvops->sched_leaf_destroy(leaf);
+}
+
 static int modify_ib_qp_sched_elem_init(struct ibv_qp *qp,
 					uint32_t req_id, uint32_t resp_id)
 {
@@ -1630,9 +1713,9 @@ static int modify_raw_qp_sched_elem(struct ibv_qp *qp, uint32_t qos_id)
 	return mlx5dv_devx_qp_modify(qp, min, sizeof(min), mout, sizeof(mout));
 }
 
-int mlx5dv_modify_qp_sched_elem(struct ibv_qp *qp,
-				const struct mlx5dv_sched_leaf *requestor,
-				const struct mlx5dv_sched_leaf *responder)
+static int _mlx5dv_modify_qp_sched_elem(struct ibv_qp *qp,
+					const struct mlx5dv_sched_leaf *requestor,
+					const struct mlx5dv_sched_leaf *responder)
 {
 	struct mlx5_qos_caps *qc = &to_mctx(qp->context)->qos_caps;
 
@@ -1659,6 +1742,18 @@ int mlx5dv_modify_qp_sched_elem(struct ibv_qp *qp,
 	}
 }
 
+int mlx5dv_modify_qp_sched_elem(struct ibv_qp *qp,
+				const struct mlx5dv_sched_leaf *requestor,
+				const struct mlx5dv_sched_leaf *responder)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(qp->context);
+
+	if (!dvops || !dvops->modify_qp_sched_elem)
+		return EOPNOTSUPP;
+
+	return dvops->modify_qp_sched_elem(qp, requestor, responder);
+}
+
 int mlx5_modify_qp_drain_sigerr(struct ibv_qp *qp)
 {
 	uint64_t mask = MLX5_QPC_OPT_MASK_INIT2INIT_DRAIN_SIGERR;
@@ -1750,15 +1845,14 @@ static void reserved_qpn_blks_free(struct mlx5_context *mctx)
  * always starts from last allocation position, to make sure the QPN
  * always move forward to prevent stale QPN.
  */
-int mlx5dv_reserved_qpn_alloc(struct ibv_context *ctx, uint32_t *qpn)
+static int _mlx5dv_reserved_qpn_alloc(struct ibv_context *ctx, uint32_t *qpn)
 {
 	struct mlx5_context *mctx = to_mctx(ctx);
 	struct reserved_qpn_blk *blk;
 	uint32_t qpns_per_obj;
 	int ret = 0;
 
-	if (!is_mlx5_dev(ctx->device) ||
-	    !(mctx->general_obj_types_caps & (1ULL << MLX5_OBJ_TYPE_RESERVED_QPN)))
+	if (!(mctx->general_obj_types_caps & (1ULL << MLX5_OBJ_TYPE_RESERVED_QPN)))
 		return EOPNOTSUPP;
 
 	qpns_per_obj = 1 << mctx->hca_cap_2_caps.log_reserved_qpns_per_obj;
@@ -1786,11 +1880,21 @@ end:
 	return ret;
 }
 
+int mlx5dv_reserved_qpn_alloc(struct ibv_context *ctx, uint32_t *qpn)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->reserved_qpn_alloc)
+		return EOPNOTSUPP;
+
+	return dvops->reserved_qpn_alloc(ctx, qpn);
+}
+
 /**
  * Deallocate a reserved QPN. The FW object is destroyed only when all QPNs
  * in this object were used and freed.
  */
-int mlx5dv_reserved_qpn_dealloc(struct ibv_context *ctx, uint32_t qpn)
+static int _mlx5dv_reserved_qpn_dealloc(struct ibv_context *ctx, uint32_t qpn)
 {
 	struct mlx5_context *mctx = to_mctx(ctx);
 	struct reserved_qpn_blk *blk, *tmp;
@@ -1829,9 +1933,17 @@ end:
 	return ret;
 }
 
-LATEST_SYMVER_FUNC(mlx5dv_init_obj, 1_2, "MLX5_1.2",
-		   int,
-		   struct mlx5dv_obj *obj, uint64_t obj_type)
+int mlx5dv_reserved_qpn_dealloc(struct ibv_context *ctx, uint32_t qpn)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->reserved_qpn_dealloc)
+		return EOPNOTSUPP;
+
+	return dvops->reserved_qpn_dealloc(ctx, qpn);
+}
+
+static int _mlx5dv_init_obj(struct mlx5dv_obj *obj, uint64_t obj_type)
 {
 	int ret = 0;
 
@@ -1853,6 +1965,46 @@ LATEST_SYMVER_FUNC(mlx5dv_init_obj, 1_2, "MLX5_1.2",
 	return ret;
 }
 
+static struct ibv_context *
+get_context_from_obj(struct mlx5dv_obj *obj, uint64_t obj_type)
+{
+	if (obj_type & MLX5DV_OBJ_QP)
+		return obj->qp.in->context;
+	if (obj_type & MLX5DV_OBJ_CQ)
+		return obj->cq.in->context;
+	if (obj_type & MLX5DV_OBJ_SRQ)
+		return obj->srq.in->context;
+	if (obj_type & MLX5DV_OBJ_RWQ)
+		return obj->rwq.in->context;
+	if (obj_type & MLX5DV_OBJ_DM)
+		return obj->dm.in->context;
+	if (obj_type & MLX5DV_OBJ_AH)
+		return obj->ah.in->context;
+	if (obj_type & MLX5DV_OBJ_PD)
+		return obj->pd.in->context;
+
+	return NULL;
+}
+
+LATEST_SYMVER_FUNC(mlx5dv_init_obj, 1_2, "MLX5_1.2",
+		   int,
+		   struct mlx5dv_obj *obj, uint64_t obj_type)
+{
+	struct mlx5_dv_context_ops *dvops;
+	struct ibv_context *ctx;
+
+	ctx = get_context_from_obj(obj, obj_type);
+	if (!ctx)
+		return EINVAL;
+
+	dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->init_obj)
+		return EOPNOTSUPP;
+
+	return dvops->init_obj(obj, obj_type);
+}
+
 COMPAT_SYMVER_FUNC(mlx5dv_init_obj, 1_0, "MLX5_1.0",
 		   int,
 		   struct mlx5dv_obj *obj, uint64_t obj_type)
@@ -1922,14 +2074,12 @@ out:
 	return uar->reg;
 }
 
-int mlx5dv_set_context_attr(struct ibv_context *ibv_ctx,
-			enum mlx5dv_set_ctx_attr_type type, void *attr)
+static int _mlx5dv_set_context_attr(struct ibv_context *ibv_ctx,
+				    enum mlx5dv_set_ctx_attr_type type,
+				    void *attr)
 {
 	struct mlx5_context *ctx = to_mctx(ibv_ctx);
 
-	if (!is_mlx5_dev(ibv_ctx->device))
-		return EOPNOTSUPP;
-
 	switch (type) {
 	case MLX5DV_CTX_ATTR_BUF_ALLOCATORS:
 		ctx->extern_alloc = *((struct mlx5dv_ctx_allocators *)attr);
@@ -1941,8 +2091,19 @@ int mlx5dv_set_context_attr(struct ibv_context *ibv_ctx,
 	return 0;
 }
 
-int mlx5dv_get_clock_info(struct ibv_context *ctx_in,
-			  struct mlx5dv_clock_info *clock_info)
+int mlx5dv_set_context_attr(struct ibv_context *ibv_ctx,
+			    enum mlx5dv_set_ctx_attr_type type, void *attr)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ibv_ctx);
+
+	if (!dvops || !dvops->set_context_attr)
+		return EOPNOTSUPP;
+
+	return dvops->set_context_attr(ibv_ctx, type, attr);
+}
+
+static int _mlx5dv_get_clock_info(struct ibv_context *ctx_in,
+				  struct mlx5dv_clock_info *clock_info)
 {
 	struct mlx5_context *ctx = to_mctx(ctx_in);
 	const struct mlx5_ib_clock_info *ci;
@@ -1980,6 +2141,41 @@ repeat:
 	return 0;
 }
 
+int mlx5dv_get_clock_info(struct ibv_context *ctx_in,
+			  struct mlx5dv_clock_info *clock_info)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx_in);
+
+	if (!dvops || !dvops->get_clock_info)
+		return EOPNOTSUPP;
+
+	return dvops->get_clock_info(ctx_in, clock_info);
+}
+
+static struct mlx5_dv_context_ops mlx5_dv_ctx_ops = {
+	.query_device = _mlx5dv_query_device,
+
+	.query_qp_lag_port = _mlx5dv_query_qp_lag_port,
+	.modify_qp_lag_port = _mlx5dv_modify_qp_lag_port,
+
+	.modify_qp_udp_sport = _mlx5dv_modify_qp_udp_sport,
+
+	.sched_node_create = _mlx5dv_sched_node_create,
+	.sched_leaf_create = _mlx5dv_sched_leaf_create,
+	.sched_node_modify = _mlx5dv_sched_node_modify,
+	.sched_leaf_modify = _mlx5dv_sched_leaf_modify,
+	.sched_node_destroy = _mlx5dv_sched_node_destroy,
+	.sched_leaf_destroy = _mlx5dv_sched_leaf_destroy,
+	.modify_qp_sched_elem = _mlx5dv_modify_qp_sched_elem,
+
+	.reserved_qpn_alloc = _mlx5dv_reserved_qpn_alloc,
+	.reserved_qpn_dealloc = _mlx5dv_reserved_qpn_dealloc,
+
+	.set_context_attr = _mlx5dv_set_context_attr,
+	.get_clock_info = _mlx5dv_get_clock_info,
+	.init_obj = _mlx5dv_init_obj,
+};
+
 static void adjust_uar_info(struct mlx5_device *mdev,
 			    struct mlx5_context *context,
 			    struct mlx5_ib_alloc_ucontext_resp *resp)
@@ -2234,6 +2430,7 @@ bf_done:
 		else
 			goto err_free;
 	}
+	context->dv_ctx_ops = &mlx5_dv_ctx_ops;
 
 	mlx5_query_device_ctx(context);
 
@@ -2403,6 +2600,7 @@ static struct verbs_device *mlx5_device_alloc(struct verbs_sysfs_dev *sysfs_dev)
 	dev->page_size   = sysconf(_SC_PAGESIZE);
 	dev->driver_abi_ver = sysfs_dev->abi_ver;
 
+	mlx5_set_dv_ctx_ops(&mlx5_dv_ctx_ops);
 	return &dev->verbs_dev;
 }
 
@@ -2417,10 +2615,20 @@ static const struct verbs_device_ops mlx5_dev_ops = {
 	.import_context = mlx5_import_context,
 };
 
-bool is_mlx5_dev(struct ibv_device *device)
+static bool is_mlx5_dev(struct ibv_device *device)
 {
 	struct verbs_device *verbs_device = verbs_get_device(device);
 
 	return verbs_device->ops == &mlx5_dev_ops;
 }
+
+struct mlx5_dv_context_ops *mlx5_get_dv_ops(struct ibv_context *ibctx)
+{
+	if (is_mlx5_dev(ibctx->device))
+		return to_mctx(ibctx)->dv_ctx_ops;
+	else if (is_mlx5_vfio_dev(ibctx->device))
+		return to_mvfio_ctx(ibctx)->dv_ctx_ops;
+	else
+		return NULL;
+}
 PROVIDER_DRIVER(mlx5, mlx5_dev_ops);
diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h
index 7e7d70d..fd02199 100644
--- a/providers/mlx5/mlx5.h
+++ b/providers/mlx5/mlx5.h
@@ -302,6 +302,8 @@ struct mlx5_reserved_qpns {
 	pthread_mutex_t mutex;
 };
 
+struct mlx5_dv_context_ops;
+
 struct mlx5_context {
 	struct verbs_context		ibv_ctx;
 	int				max_num_qps;
@@ -403,6 +405,7 @@ struct mlx5_context {
 	void				*cq_uar_reg;
 	struct mlx5_reserved_qpns	reserved_qpns;
 	uint8_t				qp_data_in_order_cap:1;
+	struct mlx5_dv_context_ops	*dv_ctx_ops;
 };
 
 struct mlx5_bitmap {
@@ -865,11 +868,11 @@ struct mlx5dv_sched_leaf {
 };
 
 struct ibv_flow *
-__mlx5dv_create_flow(struct mlx5dv_flow_matcher *flow_matcher,
-		     struct mlx5dv_flow_match_parameters *match_value,
-		     size_t num_actions,
-		     struct mlx5dv_flow_action_attr actions_attr[],
-		     struct mlx5_flow_action_attr_aux actions_attr_aux[]);
+_mlx5dv_create_flow(struct mlx5dv_flow_matcher *flow_matcher,
+		    struct mlx5dv_flow_match_parameters *match_value,
+		    size_t num_actions,
+		    struct mlx5dv_flow_action_attr actions_attr[],
+		    struct mlx5_flow_action_attr_aux actions_attr_aux[]);
 
 extern int mlx5_stall_num_loop;
 extern int mlx5_stall_cq_poll_min;
@@ -992,7 +995,7 @@ static inline struct mlx5_flow *to_mflow(struct ibv_flow *flow_id)
 	return container_of(flow_id, struct mlx5_flow, flow_id);
 }
 
-bool is_mlx5_dev(struct ibv_device *device);
+bool is_mlx5_vfio_dev(struct ibv_device *device);
 
 void mlx5_open_debug_file(FILE **dbg_fp);
 void mlx5_close_debug_file(FILE *dbg_fp);
@@ -1340,4 +1343,169 @@ static inline bool srq_has_waitq(struct mlx5_srq *srq)
 
 bool srq_cooldown_wqe(struct mlx5_srq *srq, int ind);
 
+struct mlx5_dv_context_ops {
+	int (*devx_general_cmd)(struct ibv_context *context, const void *in,
+				size_t inlen, void *out, size_t outlen);
+
+	struct mlx5dv_devx_obj *(*devx_obj_create)(struct ibv_context *context,
+						   const void *in, size_t inlen,
+						   void *out, size_t outlen);
+	int (*devx_obj_query)(struct mlx5dv_devx_obj *obj, const void *in,
+			      size_t inlen, void *out, size_t outlen);
+	int (*devx_obj_modify)(struct mlx5dv_devx_obj *obj, const void *in,
+			       size_t inlen, void *out, size_t outlen);
+	int (*devx_obj_destroy)(struct mlx5dv_devx_obj *obj);
+
+	int (*devx_query_eqn)(struct ibv_context *context, uint32_t vector,
+			      uint32_t *eqn);
+
+	int (*devx_cq_query)(struct ibv_cq *cq, const void *in, size_t inlen,
+			     void *out, size_t outlen);
+	int (*devx_cq_modify)(struct ibv_cq *cq, const void *in, size_t inlen,
+			      void *out, size_t outlen);
+
+	int (*devx_qp_query)(struct ibv_qp *qp, const void *in, size_t inlen,
+			     void *out, size_t outlen);
+	int (*devx_qp_modify)(struct ibv_qp *qp, const void *in, size_t inlen,
+			      void *out, size_t outlen);
+
+	int (*devx_srq_query)(struct ibv_srq *srq, const void *in, size_t inlen,
+			      void *out, size_t outlen);
+	int (*devx_srq_modify)(struct ibv_srq *srq, const void *in, size_t inlen,
+			       void *out, size_t outlen);
+
+	int (*devx_wq_query)(struct ibv_wq *wq, const void *in, size_t inlen,
+			     void *out, size_t outlen);
+	int (*devx_wq_modify)(struct ibv_wq *wq, const void *in, size_t inlen,
+			      void *out, size_t outlen);
+
+	int (*devx_ind_tbl_query)(struct ibv_rwq_ind_table *ind_tbl, const void *in,
+				  size_t inlen, void *out, size_t outlen);
+	int (*devx_ind_tbl_modify)(struct ibv_rwq_ind_table *ind_tbl, const void *in,
+				   size_t inlen, void *out, size_t outlen);
+
+	struct mlx5dv_devx_cmd_comp *(*devx_create_cmd_comp)(struct ibv_context *context);
+	void (*devx_destroy_cmd_comp)(struct mlx5dv_devx_cmd_comp *cmd_comp);
+
+	struct mlx5dv_devx_event_channel *(*devx_create_event_channel)(struct ibv_context *context,
+								       enum mlx5dv_devx_create_event_channel_flags flags);
+	void (*devx_destroy_event_channel)(struct mlx5dv_devx_event_channel *dv_event_channel);
+	int (*devx_subscribe_devx_event)(struct mlx5dv_devx_event_channel *dv_event_channel,
+					 struct mlx5dv_devx_obj *obj,
+					 uint16_t events_sz,
+					 uint16_t events_num[],
+					 uint64_t cookie);
+	int (*devx_subscribe_devx_event_fd)(struct mlx5dv_devx_event_channel *dv_event_channel,
+					    int fd,
+					    struct mlx5dv_devx_obj *obj,
+					    uint16_t event_num);
+
+	int (*devx_obj_query_async)(struct mlx5dv_devx_obj *obj, const void *in,
+				    size_t inlen, size_t outlen,
+				    uint64_t wr_id,
+				    struct mlx5dv_devx_cmd_comp *cmd_comp);
+	int (*devx_get_async_cmd_comp)(struct mlx5dv_devx_cmd_comp *cmd_comp,
+				       struct mlx5dv_devx_async_cmd_hdr *cmd_resp,
+				       size_t cmd_resp_len);
+
+	ssize_t (*devx_get_event)(struct mlx5dv_devx_event_channel *event_channel,
+				  struct mlx5dv_devx_async_event_hdr *event_data,
+				  size_t event_resp_len);
+
+	struct mlx5dv_devx_uar *(*devx_alloc_uar)(struct ibv_context *context,
+						       uint32_t flags);
+	void (*devx_free_uar)(struct mlx5dv_devx_uar *dv_devx_uar);
+
+	struct mlx5dv_devx_umem *(*devx_umem_reg)(struct ibv_context *context,
+						  void *addr, size_t size, uint32_t access);
+	struct mlx5dv_devx_umem *(*devx_umem_reg_ex)(struct ibv_context *ctx,
+						     struct mlx5dv_devx_umem_in *umem_in);
+	int (*devx_umem_dereg)(struct mlx5dv_devx_umem *dv_devx_umem);
+
+	struct mlx5dv_mkey *(*create_mkey)(struct mlx5dv_mkey_init_attr *mkey_init_attr);
+	int (*destroy_mkey)(struct mlx5dv_mkey *dv_mkey);
+
+	struct mlx5dv_var *(*alloc_var)(struct ibv_context *context, uint32_t flags);
+	void (*free_var)(struct mlx5dv_var *dv_var);
+
+	struct mlx5dv_pp *(*pp_alloc)(struct ibv_context *context, size_t pp_context_sz,
+				      const void *pp_context, uint32_t flags);
+	void (*pp_free)(struct mlx5dv_pp *dv_pp);
+
+	int (*init_obj)(struct mlx5dv_obj *obj, uint64_t obj_type);
+	struct ibv_cq_ex *(*create_cq)(struct ibv_context *context,
+				       struct ibv_cq_init_attr_ex *cq_attr,
+				       struct mlx5dv_cq_init_attr *mlx5_cq_attr);
+	struct ibv_qp *(*create_qp)(struct ibv_context *context,
+				    struct ibv_qp_init_attr_ex *qp_attr,
+				    struct mlx5dv_qp_init_attr *mlx5_qp_attr);
+	struct mlx5dv_qp_ex *(*qp_ex_from_ibv_qp_ex)(struct ibv_qp_ex *qp); /* Is this needed? */
+	struct ibv_wq *(*create_wq)(struct ibv_context *context,
+				    struct ibv_wq_init_attr *attr,
+				    struct mlx5dv_wq_init_attr *mlx5_wq_attr);
+
+	struct ibv_dm *(*alloc_dm)(struct ibv_context *context,
+				   struct ibv_alloc_dm_attr *dm_attr,
+				   struct mlx5dv_alloc_dm_attr *mlx5_dm_attr);
+	void *(*dm_map_op_addr)(struct ibv_dm *dm, uint8_t op);
+
+	struct ibv_flow_action *
+	(*create_flow_action_esp)(struct ibv_context *ctx,
+				  struct ibv_flow_action_esp_attr *esp,
+				  struct mlx5dv_flow_action_esp *mlx5_attr);
+	struct ibv_flow_action *
+	(*create_flow_action_modify_header)(struct ibv_context *ctx,
+					    size_t actions_sz,
+					    uint64_t actions[],
+					    enum mlx5dv_flow_table_type ft_type);
+	struct ibv_flow_action *
+	(*create_flow_action_packet_reformat)(struct ibv_context *ctx,
+					      size_t data_sz,
+					      void *data,
+					      enum mlx5dv_flow_action_packet_reformat_type reformat_type,
+					      enum mlx5dv_flow_table_type ft_type);
+
+	struct mlx5dv_flow_matcher *(*create_flow_matcher)(struct ibv_context *context,
+							   struct mlx5dv_flow_matcher_attr *attr);
+	int (*destroy_flow_matcher)(struct mlx5dv_flow_matcher *flow_matcher);
+	struct ibv_flow *(*create_flow)(struct mlx5dv_flow_matcher *flow_matcher,
+					struct mlx5dv_flow_match_parameters *match_value,
+					size_t num_actions,
+					struct mlx5dv_flow_action_attr actions_attr[],
+					struct mlx5_flow_action_attr_aux actions_attr_aux[]);
+
+	int (*query_device)(struct ibv_context *ctx_in, struct mlx5dv_context *attrs_out);
+
+	int (*query_qp_lag_port)(struct ibv_qp *qp, uint8_t *port_num,
+				 uint8_t *active_port_num);
+	int (*modify_qp_lag_port)(struct ibv_qp *qp, uint8_t port_num);
+	int (*modify_qp_udp_sport)(struct ibv_qp *qp, uint16_t udp_sport);
+
+	struct mlx5dv_sched_node *(*sched_node_create)(struct ibv_context *ctx,
+						       const struct mlx5dv_sched_attr *attr);
+	struct mlx5dv_sched_leaf *(*sched_leaf_create)(struct ibv_context *ctx,
+						       const struct mlx5dv_sched_attr *attr);
+	int (*sched_node_modify)(struct mlx5dv_sched_node *node,
+				 const struct mlx5dv_sched_attr *attr);
+	int (*sched_leaf_modify)(struct mlx5dv_sched_leaf *leaf,
+				 const struct mlx5dv_sched_attr *attr);
+	int (*sched_node_destroy)(struct mlx5dv_sched_node *node);
+	int (*sched_leaf_destroy)(struct mlx5dv_sched_leaf *leaf);
+	int (*modify_qp_sched_elem)(struct ibv_qp *qp,
+				    const struct mlx5dv_sched_leaf *requestor,
+				    const struct mlx5dv_sched_leaf *responder);
+	int (*reserved_qpn_alloc)(struct ibv_context *ctx, uint32_t *qpn);
+	int (*reserved_qpn_dealloc)(struct ibv_context *ctx, uint32_t qpn);
+	int (*set_context_attr)(struct ibv_context *ibv_ctx,
+				enum mlx5dv_set_ctx_attr_type type, void *attr);
+	int (*get_clock_info)(struct ibv_context *ctx_in,
+			      struct mlx5dv_clock_info *clock_info);
+	int (*query_port)(struct ibv_context *context, uint32_t port_num,
+			  struct mlx5dv_port *info, size_t info_len);
+	int (*map_ah_to_qp)(struct ibv_ah *ah, uint32_t qp_num);
+};
+
+struct mlx5_dv_context_ops *mlx5_get_dv_ops(struct ibv_context *context);
+void mlx5_set_dv_ctx_ops(struct mlx5_dv_context_ops *ops);
+
 #endif /* MLX5_H */
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index e85a8cc..23c6eeb 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -2459,6 +2459,25 @@ end:
 	return NULL;
 }
 
+static struct mlx5dv_devx_obj *
+vfio_devx_obj_create(struct ibv_context *context, const void *in,
+		     size_t inlen, void *out, size_t outlen)
+{
+	errno = EOPNOTSUPP;
+	return NULL;
+}
+
+static int vfio_devx_obj_query(struct mlx5dv_devx_obj *obj, const void *in,
+				size_t inlen, void *out, size_t outlen)
+{
+	return EOPNOTSUPP;
+}
+
+static struct mlx5_dv_context_ops mlx5_vfio_dv_ctx_ops = {
+	.devx_obj_create = vfio_devx_obj_create,
+	.devx_obj_query = vfio_devx_obj_query,
+};
+
 static void mlx5_vfio_uninit_context(struct mlx5_vfio_context *ctx)
 {
 	mlx5_close_debug_file(ctx->dbg_fp);
@@ -2524,6 +2543,7 @@ mlx5_vfio_alloc_context(struct ibv_device *ibdev,
 		goto func_teardown;
 
 	verbs_set_ops(&mctx->vctx, &mlx5_vfio_common_ops);
+	mctx->dv_ctx_ops = &mlx5_vfio_dv_ctx_ops;
 	return &mctx->vctx;
 
 func_teardown:
@@ -2729,3 +2749,10 @@ end:
 	free(list);
 	return NULL;
 }
+
+bool is_mlx5_vfio_dev(struct ibv_device *device)
+{
+	struct verbs_device *verbs_device = verbs_get_device(device);
+
+	return verbs_device->ops == &mlx5_vfio_dev_ops;
+}
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 5311c6f..79b8033 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -290,6 +290,7 @@ struct mlx5_vfio_context {
 	struct mlx5_eq async_eq;
 	struct mlx5_vfio_eqs_uar eqs_uar;
 	pthread_mutex_t eq_lock;
+	struct mlx5_dv_context_ops *dv_ctx_ops;
 };
 
 static inline struct mlx5_vfio_device *to_mvfio_dev(struct ibv_device *ibdev)
diff --git a/providers/mlx5/verbs.c b/providers/mlx5/verbs.c
index 833b7cb..33b19df 100644
--- a/providers/mlx5/verbs.c
+++ b/providers/mlx5/verbs.c
@@ -1169,17 +1169,12 @@ struct ibv_cq_ex *mlx5_create_cq_ex(struct ibv_context *context,
 	return create_cq(context, cq_attr, MLX5_CQ_FLAGS_EXTENDED, NULL);
 }
 
-struct ibv_cq_ex *mlx5dv_create_cq(struct ibv_context *context,
-				      struct ibv_cq_init_attr_ex *cq_attr,
-				      struct mlx5dv_cq_init_attr *mlx5_cq_attr)
+static struct ibv_cq_ex *_mlx5dv_create_cq(struct ibv_context *context,
+					   struct ibv_cq_init_attr_ex *cq_attr,
+					   struct mlx5dv_cq_init_attr *mlx5_cq_attr)
 {
 	struct ibv_cq_ex *cq;
 
-	if (!is_mlx5_dev(context->device)) {
-		errno = EOPNOTSUPP;
-		return NULL;
-	}
-
 	cq = create_cq(context, cq_attr, MLX5_CQ_FLAGS_EXTENDED, mlx5_cq_attr);
 	if (!cq)
 		return NULL;
@@ -1189,6 +1184,20 @@ struct ibv_cq_ex *mlx5dv_create_cq(struct ibv_context *context,
 	return cq;
 }
 
+struct ibv_cq_ex *mlx5dv_create_cq(struct ibv_context *context,
+				      struct ibv_cq_init_attr_ex *cq_attr,
+				      struct mlx5dv_cq_init_attr *mlx5_cq_attr)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->create_cq) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->create_cq(context, cq_attr, mlx5_cq_attr);
+}
+
 int mlx5_resize_cq(struct ibv_cq *ibcq, int cqe)
 {
 	struct mlx5_cq *cq = to_mcq(ibcq);
@@ -3112,7 +3121,7 @@ int mlx5_destroy_ah(struct ibv_ah *ah)
 	return 0;
 }
 
-int mlx5dv_map_ah_to_qp(struct ibv_ah *ah, uint32_t qp_num)
+static int _mlx5dv_map_ah_to_qp(struct ibv_ah *ah, uint32_t qp_num)
 {
 	uint32_t out[DEVX_ST_SZ_DW(general_obj_out_cmd_hdr)] = {};
 	uint32_t in[DEVX_ST_SZ_DW(create_av_qp_mapping_in)] = {};
@@ -3122,9 +3131,6 @@ int mlx5dv_map_ah_to_qp(struct ibv_ah *ah, uint32_t qp_num)
 	void *attr;
 	int ret = 0;
 
-	if (!is_mlx5_dev(ah->context->device))
-		return EOPNOTSUPP;
-
 	if (!(mctx->general_obj_types_caps &
 	      (1ULL << MLX5_OBJ_TYPE_AV_QP_MAPPING)) ||
 	    !mah->is_global)
@@ -3159,6 +3165,16 @@ int mlx5dv_map_ah_to_qp(struct ibv_ah *ah, uint32_t qp_num)
 	return ret;
 }
 
+int mlx5dv_map_ah_to_qp(struct ibv_ah *ah, uint32_t qp_num)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ah->context);
+
+	if (!dvops || !dvops->map_ah_to_qp)
+		return EOPNOTSUPP;
+
+	return dvops->map_ah_to_qp(ah, qp_num);
+}
+
 int mlx5_attach_mcast(struct ibv_qp *qp, const union ibv_gid *gid, uint16_t lid)
 {
 	return ibv_cmd_attach_mcast(qp, gid, lid);
@@ -3175,16 +3191,25 @@ struct ibv_qp *mlx5_create_qp_ex(struct ibv_context *context,
 	return create_qp(context, attr, NULL);
 }
 
+static struct ibv_qp *_mlx5dv_create_qp(struct ibv_context *context,
+				struct ibv_qp_init_attr_ex *qp_attr,
+				struct mlx5dv_qp_init_attr *mlx5_qp_attr)
+{
+	return create_qp(context, qp_attr, mlx5_qp_attr);
+}
+
 struct ibv_qp *mlx5dv_create_qp(struct ibv_context *context,
 				struct ibv_qp_init_attr_ex *qp_attr,
 				struct mlx5dv_qp_init_attr *mlx5_qp_attr)
 {
-	if (!is_mlx5_dev(context->device)) {
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->create_qp) {
 		errno = EOPNOTSUPP;
 		return NULL;
 	}
 
-	return create_qp(context, qp_attr, mlx5_qp_attr);
+	return dvops->create_qp(context, qp_attr, mlx5_qp_attr);
 }
 
 struct mlx5dv_qp_ex *mlx5dv_qp_ex_from_ibv_qp_ex(struct ibv_qp_ex *qp)
@@ -4009,16 +4034,25 @@ struct ibv_wq *mlx5_create_wq(struct ibv_context *context,
 	return create_wq(context, attr, NULL);
 }
 
+static struct ibv_wq *_mlx5dv_create_wq(struct ibv_context *context,
+					struct ibv_wq_init_attr *attr,
+					struct mlx5dv_wq_init_attr *mlx5_wq_attr)
+{
+	return create_wq(context, attr, mlx5_wq_attr);
+}
+
 struct ibv_wq *mlx5dv_create_wq(struct ibv_context *context,
 				struct ibv_wq_init_attr *attr,
 				struct mlx5dv_wq_init_attr *mlx5_wq_attr)
 {
-	if (!is_mlx5_dev(context->device)) {
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->create_wq) {
 		errno = EOPNOTSUPP;
 		return NULL;
 	}
 
-	return create_wq(context, attr, mlx5_wq_attr);
+	return dvops->create_wq(context, attr, mlx5_wq_attr);
 }
 
 int mlx5_modify_wq(struct ibv_wq *wq, struct ibv_wq_attr *attr)
@@ -4298,9 +4332,10 @@ struct ibv_flow_action *mlx5_create_flow_action_esp(struct ibv_context *ctx,
 	return _mlx5_create_flow_action_esp(ctx, attr, NULL);
 }
 
-struct ibv_flow_action *mlx5dv_create_flow_action_esp(struct ibv_context *ctx,
-						      struct ibv_flow_action_esp_attr *esp,
-						      struct mlx5dv_flow_action_esp *mlx5_attr)
+static struct ibv_flow_action *
+_mlx5dv_create_flow_action_esp(struct ibv_context *ctx,
+			       struct ibv_flow_action_esp_attr *esp,
+			       struct mlx5dv_flow_action_esp *mlx5_attr)
 {
 	DECLARE_COMMAND_BUFFER_LINK(driver_attr, UVERBS_OBJECT_FLOW_ACTION,
 				    UVERBS_METHOD_FLOW_ACTION_ESP_CREATE, 1,
@@ -4325,6 +4360,21 @@ struct ibv_flow_action *mlx5dv_create_flow_action_esp(struct ibv_context *ctx,
 	return _mlx5_create_flow_action_esp(ctx, esp, driver_attr);
 }
 
+struct ibv_flow_action *mlx5dv_create_flow_action_esp(struct ibv_context *ctx,
+						      struct ibv_flow_action_esp_attr *esp,
+						      struct mlx5dv_flow_action_esp *mlx5_attr)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->create_flow_action_esp) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->create_flow_action_esp(ctx, esp,
+						      mlx5_attr);
+}
+
 int mlx5_modify_flow_action_esp(struct ibv_flow_action *action,
 				struct ibv_flow_action_esp_attr *attr)
 {
@@ -4337,10 +4387,11 @@ int mlx5_modify_flow_action_esp(struct ibv_flow_action *action,
 	return ibv_cmd_modify_flow_action_esp(vaction, attr, NULL);
 }
 
-struct ibv_flow_action *mlx5dv_create_flow_action_modify_header(struct ibv_context *ctx,
-								size_t actions_sz,
-								uint64_t actions[],
-								enum mlx5dv_flow_table_type ft_type)
+static struct ibv_flow_action *
+_mlx5dv_create_flow_action_modify_header(struct ibv_context *ctx,
+					 size_t actions_sz,
+					 uint64_t actions[],
+					 enum mlx5dv_flow_table_type ft_type)
 {
 	DECLARE_COMMAND_BUFFER(cmd, UVERBS_OBJECT_FLOW_ACTION,
 			       MLX5_IB_METHOD_FLOW_ACTION_CREATE_MODIFY_HEADER,
@@ -4375,12 +4426,29 @@ struct ibv_flow_action *mlx5dv_create_flow_action_modify_header(struct ibv_conte
 	return &action->action;
 }
 
-struct ibv_flow_action *
-mlx5dv_create_flow_action_packet_reformat(struct ibv_context *ctx,
-					  size_t data_sz,
-					  void *data,
-					  enum mlx5dv_flow_action_packet_reformat_type reformat_type,
-					  enum mlx5dv_flow_table_type ft_type)
+struct ibv_flow_action *mlx5dv_create_flow_action_modify_header(struct ibv_context *ctx,
+								size_t actions_sz,
+								uint64_t actions[],
+								enum mlx5dv_flow_table_type ft_type)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->create_flow_action_modify_header) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->create_flow_action_modify_header(ctx, actions_sz,
+								actions, ft_type);
+
+}
+
+static struct ibv_flow_action *
+_mlx5dv_create_flow_action_packet_reformat(struct ibv_context *ctx,
+					   size_t data_sz,
+					   void *data,
+					   enum mlx5dv_flow_action_packet_reformat_type reformat_type,
+					   enum mlx5dv_flow_table_type ft_type)
 {
 	DECLARE_COMMAND_BUFFER(cmd, UVERBS_OBJECT_FLOW_ACTION,
 			       MLX5_IB_METHOD_FLOW_ACTION_CREATE_PACKET_REFORMAT, 4);
@@ -4425,6 +4493,24 @@ mlx5dv_create_flow_action_packet_reformat(struct ibv_context *ctx,
 	return &action->action;
 }
 
+struct ibv_flow_action *
+mlx5dv_create_flow_action_packet_reformat(struct ibv_context *ctx,
+					  size_t data_sz,
+					  void *data,
+					  enum mlx5dv_flow_action_packet_reformat_type reformat_type,
+					  enum mlx5dv_flow_table_type ft_type)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->create_flow_action_packet_reformat) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->create_flow_action_packet_reformat(ctx, data_sz, data,
+								  reformat_type, ft_type);
+}
+
 int mlx5_destroy_flow_action(struct ibv_flow_action *action)
 {
 	struct verbs_flow_action *vaction =
@@ -4502,7 +4588,7 @@ static void *dm_mmap(struct ibv_context *context, struct mlx5_dm *mdm,
 		    context->cmd_fd, page_size * offset);
 }
 
-void *mlx5dv_dm_map_op_addr(struct ibv_dm *dm, uint8_t op)
+static void *_mlx5dv_dm_map_op_addr(struct ibv_dm *dm, uint8_t op)
 {
 	int page_size = to_mdev(dm->context->device)->page_size;
 	struct mlx5_dm *mdm = to_mdm(dm);
@@ -4511,11 +4597,6 @@ void *mlx5dv_dm_map_op_addr(struct ibv_dm *dm, uint8_t op)
 	void *va;
 	int ret;
 
-	if (!is_mlx5_dev(dm->context->device)) {
-		errno = EOPNOTSUPP;
-		return NULL;
-	}
-
 	DECLARE_COMMAND_BUFFER(cmdb, UVERBS_OBJECT_DM,
 			       MLX5_IB_METHOD_DM_MAP_OP_ADDR, 4);
 	fill_attr_in_obj(cmdb, MLX5_IB_ATTR_DM_MAP_OP_ADDR_REQ_HANDLE,
@@ -4538,6 +4619,18 @@ void *mlx5dv_dm_map_op_addr(struct ibv_dm *dm, uint8_t op)
 	return va + (start_offset & (page_size - 1));
 }
 
+void *mlx5dv_dm_map_op_addr(struct ibv_dm *dm, uint8_t op)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(dm->context);
+
+	if (!dvops || !dvops->dm_map_op_addr) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->dm_map_op_addr(dm, op);
+}
+
 void mlx5_unimport_dm(struct ibv_dm *ibdm)
 {
 	struct mlx5_dm *dm = to_mdm(ibdm);
@@ -4653,10 +4746,10 @@ static int alloc_dm_steering_sw_icm(struct ibv_context *ctx,
 	return 0;
 }
 
-struct ibv_dm *
-mlx5dv_alloc_dm(struct ibv_context *context,
-		struct ibv_alloc_dm_attr *dm_attr,
-		struct mlx5dv_alloc_dm_attr *mlx5_dm_attr)
+static struct ibv_dm *
+_mlx5dv_alloc_dm(struct ibv_context *context,
+		 struct ibv_alloc_dm_attr *dm_attr,
+		 struct mlx5dv_alloc_dm_attr *mlx5_dm_attr)
 {
 	DECLARE_COMMAND_BUFFER(cmdb, UVERBS_OBJECT_DM, UVERBS_METHOD_DM_ALLOC,
 			       3);
@@ -4706,6 +4799,21 @@ err_free_mem:
 	return NULL;
 }
 
+struct ibv_dm *
+mlx5dv_alloc_dm(struct ibv_context *context,
+		struct ibv_alloc_dm_attr *dm_attr,
+		struct mlx5dv_alloc_dm_attr *mlx5_dm_attr)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->alloc_dm) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->alloc_dm(context, dm_attr, mlx5_dm_attr);
+}
+
 int mlx5_free_dm(struct ibv_dm *ibdm)
 {
 	struct mlx5_device *mdev = to_mdev(ibdm->context->device);
@@ -4845,9 +4953,9 @@ int mlx5_read_counters(struct ibv_counters *counters,
 
 }
 
-struct mlx5dv_flow_matcher *
-mlx5dv_create_flow_matcher(struct ibv_context *context,
-			   struct mlx5dv_flow_matcher_attr *attr)
+static struct mlx5dv_flow_matcher *
+_mlx5dv_create_flow_matcher(struct ibv_context *context,
+			    struct mlx5dv_flow_matcher_attr *attr)
 {
 	DECLARE_COMMAND_BUFFER(cmd, MLX5_IB_OBJECT_FLOW_MATCHER,
 			       MLX5_IB_METHOD_FLOW_MATCHER_CREATE,
@@ -4904,7 +5012,21 @@ err:
 	return NULL;
 }
 
-int mlx5dv_destroy_flow_matcher(struct mlx5dv_flow_matcher *flow_matcher)
+struct mlx5dv_flow_matcher *
+mlx5dv_create_flow_matcher(struct ibv_context *context,
+			   struct mlx5dv_flow_matcher_attr *attr)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->create_flow_matcher) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->create_flow_matcher(context, attr);
+}
+
+static int _mlx5dv_destroy_flow_matcher(struct mlx5dv_flow_matcher *flow_matcher)
 {
 	DECLARE_COMMAND_BUFFER(cmd, MLX5_IB_OBJECT_FLOW_MATCHER,
 			       MLX5_IB_METHOD_FLOW_MATCHER_DESTROY,
@@ -4922,13 +5044,23 @@ int mlx5dv_destroy_flow_matcher(struct mlx5dv_flow_matcher *flow_matcher)
 	return 0;
 }
 
+int mlx5dv_destroy_flow_matcher(struct mlx5dv_flow_matcher *flow_matcher)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(flow_matcher->context);
+
+	if (!dvops || !dvops->destroy_flow_matcher)
+		return EOPNOTSUPP;
+
+	return dvops->destroy_flow_matcher(flow_matcher);
+}
+
 #define CREATE_FLOW_MAX_FLOW_ACTIONS_SUPPORTED 8
 struct ibv_flow *
-__mlx5dv_create_flow(struct mlx5dv_flow_matcher *flow_matcher,
-		     struct mlx5dv_flow_match_parameters *match_value,
-		     size_t num_actions,
-		     struct mlx5dv_flow_action_attr actions_attr[],
-		     struct mlx5_flow_action_attr_aux actions_attr_aux[])
+_mlx5dv_create_flow(struct mlx5dv_flow_matcher *flow_matcher,
+		    struct mlx5dv_flow_match_parameters *match_value,
+		    size_t num_actions,
+		    struct mlx5dv_flow_action_attr actions_attr[],
+		    struct mlx5_flow_action_attr_aux actions_attr_aux[])
 {
 	uint32_t flow_actions[CREATE_FLOW_MAX_FLOW_ACTIONS_SUPPORTED];
 	struct verbs_flow_action *vaction;
@@ -5074,15 +5206,22 @@ mlx5dv_create_flow(struct mlx5dv_flow_matcher *flow_matcher,
 		   size_t num_actions,
 		   struct mlx5dv_flow_action_attr actions_attr[])
 {
-	return __mlx5dv_create_flow(flow_matcher,
-				    match_value,
-				    num_actions,
-				    actions_attr,
-				    NULL);
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(flow_matcher->context);
+
+	if (!dvops || !dvops->create_flow) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->create_flow(flow_matcher,
+				  match_value,
+				  num_actions,
+				  actions_attr,
+				  NULL);
 }
 
 static struct mlx5dv_devx_umem *
-_mlx5dv_devx_umem_reg_ex(struct ibv_context *context,
+__mlx5dv_devx_umem_reg_ex(struct ibv_context *context,
 			 struct mlx5dv_devx_umem_in *in,
 			 bool legacy)
 {
@@ -5139,8 +5278,8 @@ err:
 	return NULL;
 }
 
-struct mlx5dv_devx_umem *
-mlx5dv_devx_umem_reg(struct ibv_context *context, void *addr, size_t size, uint32_t access)
+static struct mlx5dv_devx_umem *
+_mlx5dv_devx_umem_reg(struct ibv_context *context, void *addr, size_t size, uint32_t access)
 {
 	struct mlx5dv_devx_umem_in umem_in = {};
 
@@ -5150,16 +5289,43 @@ mlx5dv_devx_umem_reg(struct ibv_context *context, void *addr, size_t size, uint3
 
 	umem_in.pgsz_bitmap = UINT64_MAX & ~(MLX5_ADAPTER_PAGE_SIZE - 1);
 
-	return _mlx5dv_devx_umem_reg_ex(context, &umem_in, true);
+	return __mlx5dv_devx_umem_reg_ex(context, &umem_in, true);
+}
+
+static struct mlx5dv_devx_umem *
+_mlx5dv_devx_umem_reg_ex(struct ibv_context *ctx, struct mlx5dv_devx_umem_in *umem_in)
+{
+	return __mlx5dv_devx_umem_reg_ex(ctx, umem_in, false);
 }
 
 struct mlx5dv_devx_umem *
 mlx5dv_devx_umem_reg_ex(struct ibv_context *ctx, struct mlx5dv_devx_umem_in *umem_in)
 {
-	return _mlx5dv_devx_umem_reg_ex(ctx, umem_in, false);
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ctx);
+
+	if (!dvops || !dvops->devx_umem_reg_ex) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->devx_umem_reg_ex(ctx, umem_in);
+}
+
+struct mlx5dv_devx_umem *
+mlx5dv_devx_umem_reg(struct ibv_context *context, void *addr, size_t size, uint32_t access)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->devx_umem_reg) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->devx_umem_reg(context, addr, size, access);
+
 }
 
-int mlx5dv_devx_umem_dereg(struct mlx5dv_devx_umem *dv_devx_umem)
+static int _mlx5dv_devx_umem_dereg(struct mlx5dv_devx_umem *dv_devx_umem)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_UMEM,
@@ -5179,6 +5345,19 @@ int mlx5dv_devx_umem_dereg(struct mlx5dv_devx_umem *dv_devx_umem)
 	return 0;
 }
 
+int mlx5dv_devx_umem_dereg(struct mlx5dv_devx_umem *dv_devx_umem)
+{
+	struct mlx5_devx_umem *umem = container_of(dv_devx_umem, struct mlx5_devx_umem,
+						   dv_devx_umem);
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(umem->context);
+
+	if (!dvops || !dvops->devx_umem_dereg)
+		return EOPNOTSUPP;
+
+	return dvops->devx_umem_dereg(dv_devx_umem);
+
+}
+
 static void set_devx_obj_info(const void *in, const void *out,
 			      struct mlx5dv_devx_obj *obj)
 {
@@ -5241,9 +5420,9 @@ static void set_devx_obj_info(const void *in, const void *out,
 	}
 }
 
-struct mlx5dv_devx_obj *
-mlx5dv_devx_obj_create(struct ibv_context *context, const void *in, size_t inlen,
-				void *out, size_t outlen)
+static struct mlx5dv_devx_obj *
+_mlx5dv_devx_obj_create(struct ibv_context *context, const void *in,
+			size_t inlen, void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5270,14 +5449,30 @@ mlx5dv_devx_obj_create(struct ibv_context *context, const void *in, size_t inlen
 	obj->handle = read_attr_obj(MLX5_IB_ATTR_DEVX_OBJ_CREATE_HANDLE, handle);
 	obj->context = context;
 	set_devx_obj_info(in, out, obj);
+
 	return obj;
 err:
 	free(obj);
 	return NULL;
 }
 
-int mlx5dv_devx_obj_query(struct mlx5dv_devx_obj *obj, const void *in, size_t inlen,
-				void *out, size_t outlen)
+struct mlx5dv_devx_obj *
+mlx5dv_devx_obj_create(struct ibv_context *context, const void *in,
+			 size_t inlen, void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->devx_obj_create) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->devx_obj_create(context, in, inlen, out, outlen);
+}
+
+static int
+_mlx5dv_devx_obj_query(struct mlx5dv_devx_obj *obj, const void *in,
+		       size_t inlen, void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5291,8 +5486,19 @@ int mlx5dv_devx_obj_query(struct mlx5dv_devx_obj *obj, const void *in, size_t in
 	return execute_ioctl(obj->context, cmd);
 }
 
-int mlx5dv_devx_obj_modify(struct mlx5dv_devx_obj *obj, const void *in, size_t inlen,
-				void *out, size_t outlen)
+int mlx5dv_devx_obj_query(struct mlx5dv_devx_obj *obj, const void *in, size_t inlen,
+			  void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(obj->context);
+
+	if (!dvops || !dvops->devx_obj_query)
+		return EOPNOTSUPP;
+
+	return dvops->devx_obj_query(obj, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_obj_modify(struct mlx5dv_devx_obj *obj, const void *in,
+				   size_t inlen, void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5306,7 +5512,18 @@ int mlx5dv_devx_obj_modify(struct mlx5dv_devx_obj *obj, const void *in, size_t i
 	return execute_ioctl(obj->context, cmd);
 }
 
-int mlx5dv_devx_obj_destroy(struct mlx5dv_devx_obj *obj)
+int mlx5dv_devx_obj_modify(struct mlx5dv_devx_obj *obj, const void *in,
+			   size_t inlen, void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(obj->context);
+
+	if (!dvops || !dvops->devx_obj_modify)
+		return EOPNOTSUPP;
+
+	return dvops->devx_obj_modify(obj, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_obj_destroy(struct mlx5dv_devx_obj *obj)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5323,8 +5540,18 @@ int mlx5dv_devx_obj_destroy(struct mlx5dv_devx_obj *obj)
 	return 0;
 }
 
-int mlx5dv_devx_general_cmd(struct ibv_context *context, const void *in, size_t inlen,
-			void *out, size_t outlen)
+int mlx5dv_devx_obj_destroy(struct mlx5dv_devx_obj *obj)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(obj->context);
+
+	if (!dvops || !dvops->devx_obj_destroy)
+		return EOPNOTSUPP;
+
+	return dvops->devx_obj_destroy(obj);
+}
+
+static int _mlx5dv_devx_general_cmd(struct ibv_context *context, const void *in,
+				    size_t inlen, void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX,
@@ -5337,24 +5564,44 @@ int mlx5dv_devx_general_cmd(struct ibv_context *context, const void *in, size_t
 	return execute_ioctl(context, cmd);
 }
 
-int _mlx5dv_query_port(struct ibv_context *context,
-		       uint32_t port_num,
-		       struct mlx5dv_port *info, size_t info_len)
+int mlx5dv_devx_general_cmd(struct ibv_context *context, const void *in, size_t inlen,
+			    void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->devx_general_cmd)
+		return EOPNOTSUPP;
+
+	return dvops->devx_general_cmd(context, in, inlen, out, outlen);
+}
+
+static int __mlx5dv_query_port(struct ibv_context *context,
+			       uint32_t port_num,
+			       struct mlx5dv_port *info, size_t info_len)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       UVERBS_OBJECT_DEVICE,
 			       MLX5_IB_METHOD_QUERY_PORT,
 			       2);
 
-	if (!is_mlx5_dev(context->device))
-		return EOPNOTSUPP;
-
 	fill_attr_in_uint32(cmd, MLX5_IB_ATTR_QUERY_PORT_PORT_NUM, port_num);
 	fill_attr_out(cmd, MLX5_IB_ATTR_QUERY_PORT, info, info_len);
 
 	return execute_ioctl(context, cmd);
 }
 
+int _mlx5dv_query_port(struct ibv_context *context,
+		       uint32_t port_num,
+		       struct mlx5dv_port *info, size_t info_len)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->query_port)
+		return EOPNOTSUPP;
+
+	return dvops->query_port(context, port_num, info, info_len);
+}
+
 void clean_dyn_uars(struct ibv_context *context)
 {
 	struct mlx5_context *ctx = to_mctx(context);
@@ -5379,8 +5626,8 @@ void clean_dyn_uars(struct ibv_context *context)
 		mlx5_free_uar(context, ctx->nc_uar);
 }
 
-struct mlx5dv_devx_uar *mlx5dv_devx_alloc_uar(struct ibv_context *context,
-					      uint32_t flags)
+static struct mlx5dv_devx_uar *
+_mlx5dv_devx_alloc_uar(struct ibv_context *context, uint32_t flags)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX,
@@ -5390,11 +5637,6 @@ struct mlx5dv_devx_uar *mlx5dv_devx_alloc_uar(struct ibv_context *context,
 	int ret;
 	struct mlx5_bf *bf;
 
-	if (!is_mlx5_dev(context->device)) {
-		errno = EOPNOTSUPP;
-		return NULL;
-	}
-
 	if (!check_comp_mask(flags, MLX5_IB_UAPI_UAR_ALLOC_TYPE_NC)) {
 		errno = EOPNOTSUPP;
 		return NULL;
@@ -5430,7 +5672,20 @@ struct mlx5dv_devx_uar *mlx5dv_devx_alloc_uar(struct ibv_context *context,
 	return &bf->devx_uar.dv_devx_uar;
 }
 
-void mlx5dv_devx_free_uar(struct mlx5dv_devx_uar *dv_devx_uar)
+struct mlx5dv_devx_uar *
+mlx5dv_devx_alloc_uar(struct ibv_context *context, uint32_t flags)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->devx_alloc_uar) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->devx_alloc_uar(context, flags);
+}
+
+static void _mlx5dv_devx_free_uar(struct mlx5dv_devx_uar *dv_devx_uar)
 {
 	struct mlx5_bf *bf = container_of(dv_devx_uar, struct mlx5_bf,
 					  devx_uar.dv_devx_uar);
@@ -5441,8 +5696,20 @@ void mlx5dv_devx_free_uar(struct mlx5dv_devx_uar *dv_devx_uar)
 	mlx5_detach_dedicated_uar(bf->devx_uar.context, bf);
 }
 
-int mlx5dv_devx_query_eqn(struct ibv_context *context, uint32_t vector,
-			  uint32_t *eqn)
+void mlx5dv_devx_free_uar(struct mlx5dv_devx_uar *dv_devx_uar)
+{
+	struct mlx5_devx_uar *uar = container_of(dv_devx_uar, struct mlx5_devx_uar,
+						 dv_devx_uar);
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(uar->context);
+
+	if (!dvops || !dvops->devx_free_uar)
+		return;
+
+	dvops->devx_free_uar(dv_devx_uar);
+}
+
+static int _mlx5dv_devx_query_eqn(struct ibv_context *context,
+				   uint32_t vector, uint32_t *eqn)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX,
@@ -5455,8 +5722,19 @@ int mlx5dv_devx_query_eqn(struct ibv_context *context, uint32_t vector,
 	return execute_ioctl(context, cmd);
 }
 
-int mlx5dv_devx_cq_query(struct ibv_cq *cq, const void *in, size_t inlen,
-				void *out, size_t outlen)
+int mlx5dv_devx_query_eqn(struct ibv_context *context, uint32_t vector,
+			  uint32_t *eqn)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->devx_query_eqn)
+		return EOPNOTSUPP;
+
+	return dvops->devx_query_eqn(context, vector, eqn);
+}
+
+static int _mlx5dv_devx_cq_query(struct ibv_cq *cq, const void *in,
+				  size_t inlen, void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5470,9 +5748,20 @@ int mlx5dv_devx_cq_query(struct ibv_cq *cq, const void *in, size_t inlen,
 	return execute_ioctl(cq->context, cmd);
 }
 
-int mlx5dv_devx_cq_modify(struct ibv_cq *cq, const void *in, size_t inlen,
+int mlx5dv_devx_cq_query(struct ibv_cq *cq, const void *in, size_t inlen,
 				void *out, size_t outlen)
 {
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(cq->context);
+
+	if (!dvops || !dvops->devx_cq_query)
+		return EOPNOTSUPP;
+
+	return dvops->devx_cq_query(cq, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_cq_modify(struct ibv_cq *cq, const void *in,
+				   size_t inlen, void *out, size_t outlen)
+{
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
 			       MLX5_IB_METHOD_DEVX_OBJ_MODIFY,
@@ -5485,9 +5774,20 @@ int mlx5dv_devx_cq_modify(struct ibv_cq *cq, const void *in, size_t inlen,
 	return execute_ioctl(cq->context, cmd);
 }
 
-int mlx5dv_devx_qp_query(struct ibv_qp *qp, const void *in, size_t inlen,
+int mlx5dv_devx_cq_modify(struct ibv_cq *cq, const void *in, size_t inlen,
 				void *out, size_t outlen)
 {
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(cq->context);
+
+	if (!dvops || !dvops->devx_cq_modify)
+		return EOPNOTSUPP;
+
+	return dvops->devx_cq_modify(cq, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_qp_query(struct ibv_qp *qp, const void *in,
+				  size_t inlen, void *out, size_t outlen)
+{
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
 			       MLX5_IB_METHOD_DEVX_OBJ_QUERY,
@@ -5500,9 +5800,20 @@ int mlx5dv_devx_qp_query(struct ibv_qp *qp, const void *in, size_t inlen,
 	return execute_ioctl(qp->context, cmd);
 }
 
-int mlx5dv_devx_qp_modify(struct ibv_qp *qp, const void *in, size_t inlen,
+int mlx5dv_devx_qp_query(struct ibv_qp *qp, const void *in, size_t inlen,
 				void *out, size_t outlen)
 {
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(qp->context);
+
+	if (!dvops || !dvops->devx_qp_query)
+		return EOPNOTSUPP;
+
+	return dvops->devx_qp_query(qp, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_qp_modify(struct ibv_qp *qp, const void *in,
+				   size_t inlen, void *out, size_t outlen)
+{
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
 			       MLX5_IB_METHOD_DEVX_OBJ_MODIFY,
@@ -5515,8 +5826,19 @@ int mlx5dv_devx_qp_modify(struct ibv_qp *qp, const void *in, size_t inlen,
 	return execute_ioctl(qp->context, cmd);
 }
 
-int mlx5dv_devx_srq_query(struct ibv_srq *srq, const void *in, size_t inlen,
-				void *out, size_t outlen)
+int mlx5dv_devx_qp_modify(struct ibv_qp *qp, const void *in, size_t inlen,
+			  void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(qp->context);
+
+	if (!dvops || !dvops->devx_qp_modify)
+		return EOPNOTSUPP;
+
+	return dvops->devx_qp_modify(qp, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_srq_query(struct ibv_srq *srq, const void *in,
+				   size_t inlen, void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5530,8 +5852,19 @@ int mlx5dv_devx_srq_query(struct ibv_srq *srq, const void *in, size_t inlen,
 	return execute_ioctl(srq->context, cmd);
 }
 
-int mlx5dv_devx_srq_modify(struct ibv_srq *srq, const void *in, size_t inlen,
-				void *out, size_t outlen)
+int mlx5dv_devx_srq_query(struct ibv_srq *srq, const void *in, size_t inlen,
+			  void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(srq->context);
+
+	if (!dvops || !dvops->devx_srq_query)
+		return EOPNOTSUPP;
+
+	return dvops->devx_srq_query(srq, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_srq_modify(struct ibv_srq *srq, const void *in,
+				    size_t inlen, void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5545,8 +5878,19 @@ int mlx5dv_devx_srq_modify(struct ibv_srq *srq, const void *in, size_t inlen,
 	return execute_ioctl(srq->context, cmd);
 }
 
-int mlx5dv_devx_wq_query(struct ibv_wq *wq, const void *in, size_t inlen,
-				void *out, size_t outlen)
+int mlx5dv_devx_srq_modify(struct ibv_srq *srq, const void *in, size_t inlen,
+			   void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(srq->context);
+
+	if (!dvops || !dvops->devx_srq_modify)
+		return EOPNOTSUPP;
+
+	return dvops->devx_srq_modify(srq, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_wq_query(struct ibv_wq *wq, const void *in, size_t inlen,
+				  void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5560,8 +5904,19 @@ int mlx5dv_devx_wq_query(struct ibv_wq *wq, const void *in, size_t inlen,
 	return execute_ioctl(wq->context, cmd);
 }
 
-int mlx5dv_devx_wq_modify(struct ibv_wq *wq, const void *in, size_t inlen,
-				void *out, size_t outlen)
+int mlx5dv_devx_wq_query(struct ibv_wq *wq, const void *in, size_t inlen,
+			 void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(wq->context);
+
+	if (!dvops || !dvops->devx_wq_query)
+		return EOPNOTSUPP;
+
+	return dvops->devx_wq_query(wq, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_wq_modify(struct ibv_wq *wq, const void *in,
+				   size_t inlen, void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5575,8 +5930,20 @@ int mlx5dv_devx_wq_modify(struct ibv_wq *wq, const void *in, size_t inlen,
 	return execute_ioctl(wq->context, cmd);
 }
 
-int mlx5dv_devx_ind_tbl_query(struct ibv_rwq_ind_table *ind_tbl, const void *in, size_t inlen,
-				void *out, size_t outlen)
+int mlx5dv_devx_wq_modify(struct ibv_wq *wq, const void *in, size_t inlen,
+			  void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(wq->context);
+
+	if (!dvops || !dvops->devx_wq_modify)
+		return EOPNOTSUPP;
+
+	return dvops->devx_wq_modify(wq, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_ind_tbl_query(struct ibv_rwq_ind_table *ind_tbl,
+				       const void *in, size_t inlen,
+				       void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5590,8 +5957,21 @@ int mlx5dv_devx_ind_tbl_query(struct ibv_rwq_ind_table *ind_tbl, const void *in,
 	return execute_ioctl(ind_tbl->context, cmd);
 }
 
-int mlx5dv_devx_ind_tbl_modify(struct ibv_rwq_ind_table *ind_tbl, const void *in, size_t inlen,
-				void *out, size_t outlen)
+
+int mlx5dv_devx_ind_tbl_query(struct ibv_rwq_ind_table *ind_tbl, const void *in,
+			      size_t inlen, void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ind_tbl->context);
+
+	if (!dvops || !dvops->devx_ind_tbl_query)
+		return EOPNOTSUPP;
+
+	return dvops->devx_ind_tbl_query(ind_tbl, in, inlen, out, outlen);
+}
+
+static int _mlx5dv_devx_ind_tbl_modify(struct ibv_rwq_ind_table *ind_tbl,
+					const void *in, size_t inlen,
+					void *out, size_t outlen)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5605,8 +5985,20 @@ int mlx5dv_devx_ind_tbl_modify(struct ibv_rwq_ind_table *ind_tbl, const void *in
 	return execute_ioctl(ind_tbl->context, cmd);
 }
 
-struct mlx5dv_devx_cmd_comp *
-mlx5dv_devx_create_cmd_comp(struct ibv_context *context)
+int mlx5dv_devx_ind_tbl_modify(struct ibv_rwq_ind_table *ind_tbl,
+			       const void *in, size_t inlen,
+			       void *out, size_t outlen)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ind_tbl->context);
+
+	if (!dvops || !dvops->devx_ind_tbl_modify)
+		return EOPNOTSUPP;
+
+	return dvops->devx_ind_tbl_modify(ind_tbl, in, inlen, out, outlen);
+}
+
+static struct mlx5dv_devx_cmd_comp *
+_mlx5dv_devx_create_cmd_comp(struct ibv_context *context)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD,
@@ -5638,16 +6030,35 @@ err:
 	return NULL;
 }
 
-void mlx5dv_devx_destroy_cmd_comp(
+struct mlx5dv_devx_cmd_comp *
+mlx5dv_devx_create_cmd_comp(struct ibv_context *context)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->devx_create_cmd_comp) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->devx_create_cmd_comp(context);
+}
+
+static void _mlx5dv_devx_destroy_cmd_comp(
 			struct mlx5dv_devx_cmd_comp *cmd_comp)
 {
 	close(cmd_comp->fd);
 	free(cmd_comp);
 }
 
-struct mlx5dv_devx_event_channel *
-mlx5dv_devx_create_event_channel(struct ibv_context *context,
-				 enum mlx5dv_devx_create_event_channel_flags flags)
+void mlx5dv_devx_destroy_cmd_comp(
+			struct mlx5dv_devx_cmd_comp *cmd_comp)
+{
+	_mlx5dv_devx_destroy_cmd_comp(cmd_comp);
+}
+
+static struct mlx5dv_devx_event_channel *
+_mlx5dv_devx_create_event_channel(struct ibv_context *context,
+				   enum mlx5dv_devx_create_event_channel_flags flags)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_ASYNC_EVENT_FD,
@@ -5682,7 +6093,21 @@ err:
 	return NULL;
 }
 
-void mlx5dv_devx_destroy_event_channel(
+struct mlx5dv_devx_event_channel *
+mlx5dv_devx_create_event_channel(struct ibv_context *context,
+				 enum mlx5dv_devx_create_event_channel_flags flags)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->devx_create_event_channel) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->devx_create_event_channel(context, flags);
+}
+
+static void _mlx5dv_devx_destroy_event_channel(
 			struct mlx5dv_devx_event_channel *dv_event_channel)
 {
 	struct mlx5_devx_event_channel *event_channel =
@@ -5693,11 +6118,26 @@ void mlx5dv_devx_destroy_event_channel(
 	free(event_channel);
 }
 
-int mlx5dv_devx_subscribe_devx_event(struct mlx5dv_devx_event_channel *dv_event_channel,
-				     struct mlx5dv_devx_obj *obj, /* can be NULL for unaffiliated events */
-				     uint16_t events_sz,
-				     uint16_t events_num[],
-				     uint64_t cookie)
+void mlx5dv_devx_destroy_event_channel(
+			struct mlx5dv_devx_event_channel *dv_event_channel)
+{
+	struct mlx5_devx_event_channel *ech =
+			container_of(dv_event_channel, struct mlx5_devx_event_channel,
+				     dv_event_channel);
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(ech->context);
+
+	if (!dvops || !dvops->devx_destroy_event_channel)
+		return;
+
+	return dvops->devx_destroy_event_channel(dv_event_channel);
+}
+
+static int
+_mlx5dv_devx_subscribe_devx_event(struct mlx5dv_devx_event_channel *dv_event_channel,
+				  struct mlx5dv_devx_obj *obj, /* can be NULL for unaffiliated events */
+				  uint16_t events_sz,
+				  uint16_t events_num[],
+				  uint64_t cookie)
 {
 	struct mlx5_devx_event_channel *event_channel =
 			container_of(dv_event_channel, struct mlx5_devx_event_channel,
@@ -5717,10 +6157,26 @@ int mlx5dv_devx_subscribe_devx_event(struct mlx5dv_devx_event_channel *dv_event_
 	return execute_ioctl(event_channel->context, cmd);
 }
 
-int mlx5dv_devx_subscribe_devx_event_fd(struct mlx5dv_devx_event_channel *dv_event_channel,
-					int fd,
-					struct mlx5dv_devx_obj *obj, /* can be NULL for unaffiliated events */
-					uint16_t event_num)
+int mlx5dv_devx_subscribe_devx_event(struct mlx5dv_devx_event_channel *dv_event_channel,
+				     struct mlx5dv_devx_obj *obj, /* can be NULL for unaffiliated events */
+				     uint16_t events_sz,
+				     uint16_t events_num[],
+				     uint64_t cookie)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(obj->context);
+
+	if (!dvops || !dvops->devx_subscribe_devx_event)
+		return EOPNOTSUPP;
+
+	return dvops->devx_subscribe_devx_event(dv_event_channel, obj,
+							 events_sz, events_num,
+							 cookie);
+}
+
+static int _mlx5dv_devx_subscribe_devx_event_fd(struct mlx5dv_devx_event_channel *dv_event_channel,
+						int fd,
+						struct mlx5dv_devx_obj *obj, /* can be NULL for unaffiliated events */
+						uint16_t event_num)
 {
 	struct mlx5_devx_event_channel *event_channel =
 			container_of(dv_event_channel, struct mlx5_devx_event_channel,
@@ -5740,10 +6196,24 @@ int mlx5dv_devx_subscribe_devx_event_fd(struct mlx5dv_devx_event_channel *dv_eve
 	return execute_ioctl(event_channel->context, cmd);
 }
 
-int mlx5dv_devx_obj_query_async(struct mlx5dv_devx_obj *obj, const void *in,
-				size_t inlen, size_t outlen,
-				uint64_t wr_id,
-				struct mlx5dv_devx_cmd_comp *cmd_comp)
+int mlx5dv_devx_subscribe_devx_event_fd(struct mlx5dv_devx_event_channel *dv_event_channel,
+					int fd,
+					struct mlx5dv_devx_obj *obj, /* can be NULL for unaffiliated events */
+					uint16_t event_num)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(obj->context);
+
+	if (!dvops || !dvops->devx_subscribe_devx_event_fd)
+		return EOPNOTSUPP;
+
+	return dvops->devx_subscribe_devx_event_fd(dv_event_channel, fd,
+							    obj, event_num);
+}
+
+static int _mlx5dv_devx_obj_query_async(struct mlx5dv_devx_obj *obj, const void *in,
+					size_t inlen, size_t outlen,
+					uint64_t wr_id,
+					struct mlx5dv_devx_cmd_comp *cmd_comp)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_DEVX_OBJ,
@@ -5759,9 +6229,23 @@ int mlx5dv_devx_obj_query_async(struct mlx5dv_devx_obj *obj, const void *in,
 	return execute_ioctl(obj->context, cmd);
 }
 
-int mlx5dv_devx_get_async_cmd_comp(struct mlx5dv_devx_cmd_comp *cmd_comp,
-				   struct mlx5dv_devx_async_cmd_hdr *cmd_resp,
-				   size_t cmd_resp_len)
+int mlx5dv_devx_obj_query_async(struct mlx5dv_devx_obj *obj, const void *in,
+				size_t inlen, size_t outlen,
+				uint64_t wr_id,
+				struct mlx5dv_devx_cmd_comp *cmd_comp)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(obj->context);
+
+	if (!dvops || !dvops->devx_obj_query_async)
+		return EOPNOTSUPP;
+
+	return dvops->devx_obj_query_async(obj, in, inlen, outlen,
+						    wr_id, cmd_comp);
+}
+
+static int _mlx5dv_devx_get_async_cmd_comp(struct mlx5dv_devx_cmd_comp *cmd_comp,
+					   struct mlx5dv_devx_async_cmd_hdr *cmd_resp,
+					   size_t cmd_resp_len)
 {
 	ssize_t bytes;
 
@@ -5775,24 +6259,12 @@ int mlx5dv_devx_get_async_cmd_comp(struct mlx5dv_devx_cmd_comp *cmd_comp,
 	return 0;
 }
 
-ssize_t mlx5dv_devx_get_event(struct mlx5dv_devx_event_channel *event_channel,
-				   struct mlx5dv_devx_async_event_hdr *event_data,
-				   size_t event_resp_len)
+int mlx5dv_devx_get_async_cmd_comp(struct mlx5dv_devx_cmd_comp *cmd_comp,
+				   struct mlx5dv_devx_async_cmd_hdr *cmd_resp,
+				   size_t cmd_resp_len)
 {
-	ssize_t bytes;
-
-	bytes = read(event_channel->fd, event_data, event_resp_len);
-	if (bytes < 0)
-		return -1;
-
-	/* cookie should be always exist */
-	if (bytes < sizeof(*event_data)) {
-		errno = EINVAL;
-		return -1;
-	}
-
-	/* event data may be omitted in case no EQE data exists (e.g. completion event on a CQ) */
-	return bytes;
+	return _mlx5dv_devx_get_async_cmd_comp(cmd_comp, cmd_resp,
+					       cmd_resp_len);
 }
 
 static int mlx5_destroy_sig_psvs(struct mlx5_sig_ctx *sig)
@@ -5880,7 +6352,37 @@ static int mlx5_destroy_sig_ctx(struct mlx5_sig_ctx *sig)
 	return ret;
 }
 
-struct mlx5dv_mkey *mlx5dv_create_mkey(struct mlx5dv_mkey_init_attr *mkey_init_attr)
+static ssize_t _mlx5dv_devx_get_event(struct mlx5dv_devx_event_channel *event_channel,
+				      struct mlx5dv_devx_async_event_hdr *event_data,
+				      size_t event_resp_len)
+{
+	ssize_t bytes;
+
+	bytes = read(event_channel->fd, event_data, event_resp_len);
+	if (bytes < 0)
+		return -1;
+
+	/* cookie should be always exist */
+	if (bytes < sizeof(*event_data)) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	/* event data may be omitted in case no EQE data exists (e.g. completion event on a CQ) */
+	return bytes;
+}
+
+ssize_t mlx5dv_devx_get_event(struct mlx5dv_devx_event_channel *event_channel,
+			      struct mlx5dv_devx_async_event_hdr *event_data,
+			      size_t event_resp_len)
+{
+	return _mlx5dv_devx_get_event(event_channel,
+				      event_data,
+				      event_resp_len);
+}
+
+static struct mlx5dv_mkey *
+_mlx5dv_create_mkey(struct mlx5dv_mkey_init_attr *mkey_init_attr)
 {
 	uint32_t out[DEVX_ST_SZ_DW(create_mkey_out)] = {};
 	uint32_t in[DEVX_ST_SZ_DW(create_mkey_in)] = {};
@@ -5953,7 +6455,19 @@ err_free_mkey:
 	return NULL;
 }
 
-int mlx5dv_destroy_mkey(struct mlx5dv_mkey *dv_mkey)
+struct mlx5dv_mkey *mlx5dv_create_mkey(struct mlx5dv_mkey_init_attr *mkey_init_attr)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(mkey_init_attr->pd->context);
+
+	if (!dvops || !dvops->create_mkey) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->create_mkey(mkey_init_attr);
+}
+
+static int _mlx5dv_destroy_mkey(struct mlx5dv_mkey *dv_mkey)
 {
 	struct mlx5_mkey *mkey = container_of(dv_mkey, struct mlx5_mkey,
 					  dv_mkey);
@@ -5977,6 +6491,18 @@ int mlx5dv_destroy_mkey(struct mlx5dv_mkey *dv_mkey)
 	return 0;
 }
 
+int mlx5dv_destroy_mkey(struct mlx5dv_mkey *dv_mkey)
+{
+	struct mlx5_mkey *mkey = container_of(dv_mkey, struct mlx5_mkey,
+					      dv_mkey);
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(mkey->devx_obj->context);
+
+	if (!dvops || !dvops->destroy_mkey)
+		return EOPNOTSUPP;
+
+	return dvops->destroy_mkey(dv_mkey);
+}
+
 enum {
 	MLX5_SIGERR_CQE_SYNDROME_REFTAG = 1 << 11,
 	MLX5_SIGERR_CQE_SYNDROME_APPTAG = 1 << 12,
@@ -6088,8 +6614,8 @@ int _mlx5dv_mkey_check(struct mlx5dv_mkey *dv_mkey,
 	return 0;
 }
 
-struct mlx5dv_var *
-mlx5dv_alloc_var(struct ibv_context *context, uint32_t flags)
+static struct mlx5dv_var *
+_mlx5dv_alloc_var(struct ibv_context *context, uint32_t flags)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_VAR,
@@ -6100,11 +6626,6 @@ mlx5dv_alloc_var(struct ibv_context *context, uint32_t flags)
 	struct mlx5_var_obj *obj;
 	int ret;
 
-	if (!is_mlx5_dev(context->device)) {
-		errno = EOPNOTSUPP;
-		return NULL;
-	}
-
 	if (flags) {
 		errno = EOPNOTSUPP;
 		return NULL;
@@ -6138,8 +6659,20 @@ err:
 	return NULL;
 }
 
+struct mlx5dv_var *
+mlx5dv_alloc_var(struct ibv_context *context, uint32_t flags)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
 
-void mlx5dv_free_var(struct mlx5dv_var *dv_var)
+	if (!dvops || !dvops->alloc_var) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->alloc_var(context, flags);
+}
+
+static void _mlx5dv_free_var(struct mlx5dv_var *dv_var)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_VAR,
@@ -6156,10 +6689,22 @@ void mlx5dv_free_var(struct mlx5dv_var *dv_var)
 	free(obj);
 }
 
-struct mlx5dv_pp *mlx5dv_pp_alloc(struct ibv_context *context,
-				  size_t pp_context_sz,
-				  const void *pp_context,
-				  uint32_t flags)
+void mlx5dv_free_var(struct mlx5dv_var *dv_var)
+{
+	struct mlx5_var_obj *obj = container_of(dv_var, struct mlx5_var_obj,
+						dv_var);
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(obj->context);
+
+	if (!dvops || !dvops->free_var)
+		return;
+
+	return dvops->free_var(dv_var);
+}
+
+static struct mlx5dv_pp *_mlx5dv_pp_alloc(struct ibv_context *context,
+					  size_t pp_context_sz,
+					  const void *pp_context,
+					  uint32_t flags)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_PP,
@@ -6170,11 +6715,6 @@ struct mlx5dv_pp *mlx5dv_pp_alloc(struct ibv_context *context,
 	struct mlx5_pp_obj *obj;
 	int ret;
 
-	if (!is_mlx5_dev(context->device)) {
-		errno = EOPNOTSUPP;
-		return NULL;
-	}
-
 	if (!check_comp_mask(flags,
 	    MLX5_IB_UAPI_PP_ALLOC_FLAGS_DEDICATED_INDEX)) {
 		errno = EOPNOTSUPP;
@@ -6208,7 +6748,23 @@ err:
 	return NULL;
 }
 
-void mlx5dv_pp_free(struct mlx5dv_pp *dv_pp)
+struct mlx5dv_pp *mlx5dv_pp_alloc(struct ibv_context *context,
+				  size_t pp_context_sz,
+				  const void *pp_context,
+				  uint32_t flags)
+{
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(context);
+
+	if (!dvops || !dvops->pp_alloc) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return dvops->pp_alloc(context, pp_context_sz,
+			       pp_context, flags);
+}
+
+static void _mlx5dv_pp_free(struct mlx5dv_pp *dv_pp)
 {
 	DECLARE_COMMAND_BUFFER(cmd,
 			       MLX5_IB_OBJECT_PP,
@@ -6224,3 +6780,89 @@ void mlx5dv_pp_free(struct mlx5dv_pp *dv_pp)
 
 	free(obj);
 }
+
+void mlx5dv_pp_free(struct mlx5dv_pp *dv_pp)
+{
+	struct mlx5_pp_obj *obj = container_of(dv_pp, struct mlx5_pp_obj, dv_pp);
+	struct mlx5_dv_context_ops *dvops = mlx5_get_dv_ops(obj->context);
+
+	if (!dvops || !dvops->pp_free)
+		return;
+
+	dvops->pp_free(dv_pp);
+}
+
+void mlx5_set_dv_ctx_ops(struct mlx5_dv_context_ops *ops)
+{
+	ops->devx_general_cmd = _mlx5dv_devx_general_cmd;
+
+	ops->devx_obj_create = _mlx5dv_devx_obj_create;
+
+	ops->devx_obj_query = _mlx5dv_devx_obj_query;
+	ops->devx_obj_modify = _mlx5dv_devx_obj_modify;
+	ops->devx_obj_destroy = _mlx5dv_devx_obj_destroy;
+
+	ops->devx_query_eqn = _mlx5dv_devx_query_eqn;
+
+	ops->devx_cq_query = _mlx5dv_devx_cq_query;
+	ops->devx_cq_modify = _mlx5dv_devx_cq_modify;
+
+	ops->devx_qp_query = _mlx5dv_devx_qp_query;
+	ops->devx_qp_modify = _mlx5dv_devx_qp_modify;
+
+	ops->devx_srq_query = _mlx5dv_devx_srq_query;
+	ops->devx_srq_modify = _mlx5dv_devx_srq_modify;
+
+	ops->devx_wq_query = _mlx5dv_devx_wq_query;
+	ops->devx_wq_modify = _mlx5dv_devx_wq_modify;
+
+	ops->devx_ind_tbl_query = _mlx5dv_devx_ind_tbl_query;
+	ops->devx_ind_tbl_modify = _mlx5dv_devx_ind_tbl_modify;
+
+	ops->devx_create_cmd_comp = _mlx5dv_devx_create_cmd_comp;
+	ops->devx_destroy_cmd_comp = _mlx5dv_devx_destroy_cmd_comp;
+
+	ops->devx_create_event_channel = _mlx5dv_devx_create_event_channel;
+	ops->devx_destroy_event_channel = _mlx5dv_devx_destroy_event_channel;
+
+	ops->devx_subscribe_devx_event = _mlx5dv_devx_subscribe_devx_event;
+	ops->devx_subscribe_devx_event_fd = _mlx5dv_devx_subscribe_devx_event_fd;
+
+	ops->devx_obj_query_async = _mlx5dv_devx_obj_query_async;
+	ops->devx_get_async_cmd_comp = _mlx5dv_devx_get_async_cmd_comp;
+
+	ops->devx_get_event = _mlx5dv_devx_get_event;
+
+	ops->devx_alloc_uar = _mlx5dv_devx_alloc_uar;
+	ops->devx_free_uar = _mlx5dv_devx_free_uar;
+
+	ops->devx_umem_reg = _mlx5dv_devx_umem_reg;
+	ops->devx_umem_reg_ex = _mlx5dv_devx_umem_reg_ex;
+	ops->devx_umem_dereg = _mlx5dv_devx_umem_dereg;
+
+	ops->create_mkey = _mlx5dv_create_mkey;
+	ops->destroy_mkey = _mlx5dv_destroy_mkey;
+
+	ops->alloc_var = _mlx5dv_alloc_var;
+	ops->free_var = _mlx5dv_free_var;
+
+	ops->pp_alloc = _mlx5dv_pp_alloc;
+	ops->pp_free = _mlx5dv_pp_free;
+
+	ops->create_cq = _mlx5dv_create_cq;
+	ops->create_qp = _mlx5dv_create_qp;
+	ops->create_wq = _mlx5dv_create_wq;
+
+	ops->alloc_dm = _mlx5dv_alloc_dm;
+	ops->dm_map_op_addr = _mlx5dv_dm_map_op_addr;
+
+	ops->create_flow_action_esp = _mlx5dv_create_flow_action_esp;
+	ops->create_flow_action_modify_header = _mlx5dv_create_flow_action_modify_header;
+	ops->create_flow_action_packet_reformat = _mlx5dv_create_flow_action_packet_reformat;
+	ops->create_flow_matcher = _mlx5dv_create_flow_matcher;
+	ops->destroy_flow_matcher = _mlx5dv_destroy_flow_matcher;
+	ops->create_flow = _mlx5dv_create_flow;
+
+	ops->map_ah_to_qp = _mlx5dv_map_ah_to_qp;
+	ops->query_port = __mlx5dv_query_port;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 16/27] mlx5: Support initial DEVX/DV APIs over vfio
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (14 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 15/27] mlx5: Set DV context ops Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 17/27] mlx5: Implement mlx5dv devx_obj " Yishai Hadas
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

Support initial DEVX/DV APIs over vfio for UMEM/UAR/EQN usage.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_ifc.h  |  70 ++++++++++++++
 providers/mlx5/mlx5_vfio.c | 228 ++++++++++++++++++++++++++++++++++++++++++++-
 providers/mlx5/mlx5_vfio.h |  10 ++
 3 files changed, 307 insertions(+), 1 deletion(-)

diff --git a/providers/mlx5/mlx5_ifc.h b/providers/mlx5/mlx5_ifc.h
index 1cbe846..1bd7466 100644
--- a/providers/mlx5/mlx5_ifc.h
+++ b/providers/mlx5/mlx5_ifc.h
@@ -88,6 +88,8 @@ enum {
 	MLX5_CMD_OP_CREATE_GENERAL_OBJECT = 0xa00,
 	MLX5_CMD_OP_MODIFY_GENERAL_OBJECT = 0xa01,
 	MLX5_CMD_OP_QUERY_GENERAL_OBJECT = 0xa02,
+	MLX5_CMD_OP_CREATE_UMEM = 0xa08,
+	MLX5_CMD_OP_DESTROY_UMEM = 0xa0a,
 	MLX5_CMD_OP_SYNC_STEERING = 0xb00,
 };
 
@@ -4656,4 +4658,72 @@ struct mlx5_ifc_dealloc_pd_in_bits {
 	u8         reserved_at_60[0x20];
 };
 
+struct mlx5_ifc_mtt_bits {
+	u8         ptag_63_32[0x20];
+
+	u8         ptag_31_8[0x18];
+	u8         reserved_at_38[0x6];
+	u8         wr_en[0x1];
+	u8         rd_en[0x1];
+};
+
+struct mlx5_ifc_umem_bits {
+	u8         reserved_at_0[0x80];
+
+	u8         reserved_at_80[0x1b];
+	u8         log_page_size[0x5];
+
+	u8         page_offset[0x20];
+
+	u8         num_of_mtt[0x40];
+
+	struct mlx5_ifc_mtt_bits  mtt[];
+};
+
+struct mlx5_ifc_create_umem_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         reserved_at_40[0x40];
+
+	struct mlx5_ifc_umem_bits  umem;
+};
+
+struct mlx5_ifc_create_umem_out_bits {
+	u8         status[0x8];
+	u8         reserved_at_8[0x18];
+
+	u8         syndrome[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         umem_id[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_umem_in_bits {
+	u8        opcode[0x10];
+	u8        uid[0x10];
+
+	u8        reserved_at_20[0x10];
+	u8        op_mod[0x10];
+
+	u8        reserved_at_40[0x8];
+	u8        umem_id[0x18];
+
+	u8        reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_umem_out_bits {
+	u8        status[0x8];
+	u8        reserved_at_8[0x18];
+
+	u8        syndrome[0x20];
+
+	u8        reserved_at_40[0x40];
+};
+
 #endif /* MLX5_IFC_H */
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index 23c6eeb..5e55697 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -37,6 +37,8 @@ enum {
 	MLX5_VFIO_SUPP_MR_ACCESS_FLAGS = IBV_ACCESS_LOCAL_WRITE |
 		IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ |
 		IBV_ACCESS_REMOTE_ATOMIC | IBV_ACCESS_RELAXED_ORDERING,
+	MLX5_VFIO_SUPP_UMEM_ACCESS_FLAGS = IBV_ACCESS_LOCAL_WRITE |
+		IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ,
 };
 
 static int mlx5_vfio_give_pages(struct mlx5_vfio_context *ctx, uint16_t func_id,
@@ -173,7 +175,6 @@ static void mlx5_vfio_free_page(struct mlx5_vfio_context *ctx, uint64_t iova)
 		bitmap_set_bit(page_block->free_pages, pg);
 		if (bitmap_full(page_block->free_pages, MLX5_VFIO_BLOCK_NUM_PAGES))
 			mlx5_vfio_free_block(ctx, page_block);
-
 		goto end;
 	}
 
@@ -2467,6 +2468,220 @@ vfio_devx_obj_create(struct ibv_context *context, const void *in,
 	return NULL;
 }
 
+static int vfio_devx_query_eqn(struct ibv_context *ibctx, uint32_t vector,
+			       uint32_t *eqn)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(ibctx);
+
+	if (vector > ibctx->num_comp_vectors - 1)
+		return EINVAL;
+
+	/* For now use the singleton EQN created for async events */
+	*eqn = ctx->async_eq.eqn;
+	return 0;
+}
+
+static struct mlx5dv_devx_uar *
+vfio_devx_alloc_uar(struct ibv_context *ibctx, uint32_t flags)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(ibctx);
+	struct mlx5_devx_uar *uar;
+
+	if (flags != MLX5_IB_UAPI_UAR_ALLOC_TYPE_NC) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	uar = calloc(1, sizeof(*uar));
+	if (!uar) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	uar->dv_devx_uar.page_id = ctx->eqs_uar.uarn;
+	uar->dv_devx_uar.base_addr = (void *)ctx->eqs_uar.iova;
+	uar->dv_devx_uar.reg_addr = uar->dv_devx_uar.base_addr + MLX5_BF_OFFSET;
+	uar->context = ibctx;
+
+	return &uar->dv_devx_uar;
+}
+
+static void vfio_devx_free_uar(struct mlx5dv_devx_uar *dv_devx_uar)
+{
+	free(dv_devx_uar);
+}
+
+static struct mlx5dv_devx_umem *
+_vfio_devx_umem_reg(struct ibv_context *context,
+		    void *addr, size_t size, uint32_t access,
+		    uint64_t pgsz_bitmap)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(context);
+	uint32_t out[DEVX_ST_SZ_DW(create_umem_out)] = {};
+	struct mlx5_vfio_devx_umem *vfio_umem;
+	int iova_page_shift;
+	uint64_t iova_size;
+	int ret;
+	void *in;
+	uint32_t inlen;
+	__be64 *mtt;
+	void *umem;
+	bool writeable;
+	void *aligned_va;
+	int num_pas;
+
+	if (!check_comp_mask(access, MLX5_VFIO_SUPP_UMEM_ACCESS_FLAGS)) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	if ((access & IBV_ACCESS_REMOTE_WRITE) &&
+	    !(access & IBV_ACCESS_LOCAL_WRITE)) {
+		errno = EINVAL;
+		return NULL;
+	}
+
+	/* Page size that encloses the start and end of the umem range */
+	iova_size = max(roundup_pow_of_two(size + ((uint64_t) addr & (ctx->iova_min_page_size - 1))),
+			ctx->iova_min_page_size);
+
+	if (!(iova_size & pgsz_bitmap)) {
+		/* input should include the iova page size */
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	writeable = access &
+		(IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
+
+	vfio_umem = calloc(1, sizeof(*vfio_umem));
+	if (!vfio_umem) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	vfio_umem->iova_size = iova_size;
+	if (ibv_dontfork_range(addr, size))
+		goto err;
+
+	ret = iset_alloc_range(ctx->iova_alloc, vfio_umem->iova_size, &vfio_umem->iova);
+	if (ret)
+		goto err_alloc;
+
+	/* The registration's arguments have to reflect real VA presently mapped into the process */
+	aligned_va = (void *) ((unsigned long) addr & ~(ctx->iova_min_page_size - 1));
+	vfio_umem->iova_reg_size = align((addr + size) - aligned_va, ctx->iova_min_page_size);
+	ret = mlx5_vfio_register_mem(ctx, aligned_va, vfio_umem->iova, vfio_umem->iova_reg_size);
+	if (ret)
+		goto err_reg;
+
+	iova_page_shift = ilog32(vfio_umem->iova_size - 1);
+	num_pas = 1;
+	if (iova_page_shift > MLX5_MAX_PAGE_SHIFT) {
+		iova_page_shift = MLX5_MAX_PAGE_SHIFT;
+		num_pas = DIV_ROUND_UP(vfio_umem->iova_size, (1ULL << iova_page_shift));
+	}
+
+	inlen = DEVX_ST_SZ_BYTES(create_umem_in) + DEVX_ST_SZ_BYTES(mtt) * num_pas;
+
+	in = calloc(1, inlen);
+	if (!in) {
+		errno = ENOMEM;
+		goto err_in;
+	}
+
+	umem = DEVX_ADDR_OF(create_umem_in, in, umem);
+	mtt = (__be64 *)DEVX_ADDR_OF(umem, umem, mtt);
+
+	DEVX_SET(create_umem_in, in, opcode, MLX5_CMD_OP_CREATE_UMEM);
+	DEVX_SET64(umem, umem, num_of_mtt, num_pas);
+	DEVX_SET(umem, umem, log_page_size, iova_page_shift - MLX5_ADAPTER_PAGE_SHIFT);
+	DEVX_SET(umem, umem, page_offset, addr - aligned_va);
+
+	mlx5_vfio_populate_pas(vfio_umem->iova, num_pas, (1ULL << iova_page_shift), mtt,
+			       (writeable ? MLX5_MTT_WRITE : 0) | MLX5_MTT_READ);
+
+	ret = mlx5_vfio_cmd_exec(ctx, in, inlen, out, sizeof(out), 0);
+	if (ret)
+		goto err_exec;
+
+	free(in);
+
+	vfio_umem->dv_devx_umem.umem_id = DEVX_GET(create_umem_out, out, umem_id);
+	vfio_umem->context = context;
+	vfio_umem->addr = addr;
+	vfio_umem->size = size;
+	return &vfio_umem->dv_devx_umem;
+
+err_exec:
+	free(in);
+err_in:
+	mlx5_vfio_unregister_mem(ctx, vfio_umem->iova, vfio_umem->iova_reg_size);
+err_reg:
+	iset_insert_range(ctx->iova_alloc, vfio_umem->iova, vfio_umem->iova_size);
+err_alloc:
+	ibv_dofork_range(addr, size);
+err:
+	free(vfio_umem);
+	return NULL;
+}
+
+static struct mlx5dv_devx_umem *
+vfio_devx_umem_reg(struct ibv_context *context,
+		   void *addr, size_t size, uint32_t access)
+{
+	return _vfio_devx_umem_reg(context, addr, size, access, UINT64_MAX);
+}
+
+static struct mlx5dv_devx_umem *
+vfio_devx_umem_reg_ex(struct ibv_context *ctx, struct mlx5dv_devx_umem_in *in)
+{
+	if (!check_comp_mask(in->comp_mask, 0)) {
+		errno = EOPNOTSUPP;
+		return NULL;
+	}
+
+	return _vfio_devx_umem_reg(ctx, in->addr, in->size, in->access, in->pgsz_bitmap);
+}
+
+static int vfio_devx_umem_dereg(struct mlx5dv_devx_umem *dv_devx_umem)
+{
+	struct mlx5_vfio_devx_umem *vfio_umem =
+		container_of(dv_devx_umem, struct mlx5_vfio_devx_umem,
+			     dv_devx_umem);
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(vfio_umem->context);
+	uint32_t in[DEVX_ST_SZ_DW(create_umem_in)] = {};
+	uint32_t out[DEVX_ST_SZ_DW(create_umem_out)] = {};
+	int ret;
+
+	DEVX_SET(destroy_umem_in, in, opcode, MLX5_CMD_OP_DESTROY_UMEM);
+	DEVX_SET(destroy_umem_in, in, umem_id, dv_devx_umem->umem_id);
+
+	ret = mlx5_vfio_cmd_exec(ctx, in, sizeof(in), out, sizeof(out), 0);
+	if (ret)
+		return ret;
+
+	mlx5_vfio_unregister_mem(ctx, vfio_umem->iova, vfio_umem->iova_reg_size);
+	iset_insert_range(ctx->iova_alloc, vfio_umem->iova, vfio_umem->iova_size);
+	ibv_dofork_range(vfio_umem->addr, vfio_umem->size);
+	free(vfio_umem);
+	return 0;
+}
+
+static int vfio_init_obj(struct mlx5dv_obj *obj, uint64_t obj_type)
+{
+	struct ibv_pd *pd_in = obj->pd.in;
+	struct mlx5dv_pd *pd_out = obj->pd.out;
+	struct mlx5_pd *mpd = to_mpd(pd_in);
+
+	if (obj_type != MLX5DV_OBJ_PD)
+		return EOPNOTSUPP;
+
+	pd_out->comp_mask = 0;
+	pd_out->pdn = mpd->pdn;
+	return 0;
+}
+
 static int vfio_devx_obj_query(struct mlx5dv_devx_obj *obj, const void *in,
 				size_t inlen, void *out, size_t outlen)
 {
@@ -2476,6 +2691,13 @@ static int vfio_devx_obj_query(struct mlx5dv_devx_obj *obj, const void *in,
 static struct mlx5_dv_context_ops mlx5_vfio_dv_ctx_ops = {
 	.devx_obj_create = vfio_devx_obj_create,
 	.devx_obj_query = vfio_devx_obj_query,
+	.devx_query_eqn = vfio_devx_query_eqn,
+	.devx_alloc_uar = vfio_devx_alloc_uar,
+	.devx_free_uar = vfio_devx_free_uar,
+	.devx_umem_reg = vfio_devx_umem_reg,
+	.devx_umem_reg_ex = vfio_devx_umem_reg_ex,
+	.devx_umem_dereg = vfio_devx_umem_dereg,
+	.init_obj = vfio_init_obj,
 };
 
 static void mlx5_vfio_uninit_context(struct mlx5_vfio_context *ctx)
@@ -2544,6 +2766,10 @@ mlx5_vfio_alloc_context(struct ibv_device *ibdev,
 
 	verbs_set_ops(&mctx->vctx, &mlx5_vfio_common_ops);
 	mctx->dv_ctx_ops = &mlx5_vfio_dv_ctx_ops;
+
+	/* For now only a singelton EQ is supported */
+	mctx->vctx.context.num_comp_vectors = 1;
+
 	return &mctx->vctx;
 
 func_teardown:
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 79b8033..766c48c 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -47,6 +47,16 @@ struct mlx5_vfio_mr {
 	uint64_t iova_reg_size;
 };
 
+struct mlx5_vfio_devx_umem {
+	struct mlx5dv_devx_umem dv_devx_umem;
+	struct ibv_context *context;
+	void *addr;
+	size_t size;
+	uint64_t iova;
+	uint64_t iova_size;
+	uint64_t iova_reg_size;
+};
+
 struct mlx5_vfio_device {
 	struct verbs_device vdev;
 	char *pci_name;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 17/27] mlx5: Implement mlx5dv devx_obj APIs over vfio
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (15 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 16/27] mlx5: Support initial DEVX/DV APIs over vfio Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 18/27] pyverbs: Support DevX UMEM registration Yishai Hadas
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards

From: Mark Zhang <markzhang@nvidia.com>

Implement mlx5dv vfio APIs: devx_obj_create/query/modify/destroy, as
well as the devx_general_cmd

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 providers/mlx5/mlx5_ifc.h  | 565 ++++++++++++++++++++++++++++++++++++++++++++-
 providers/mlx5/mlx5_vfio.c | 413 ++++++++++++++++++++++++++++++++-
 providers/mlx5/mlx5_vfio.h |   8 +
 3 files changed, 975 insertions(+), 11 deletions(-)

diff --git a/providers/mlx5/mlx5_ifc.h b/providers/mlx5/mlx5_ifc.h
index 1bd7466..175fe4a 100644
--- a/providers/mlx5/mlx5_ifc.h
+++ b/providers/mlx5/mlx5_ifc.h
@@ -54,7 +54,10 @@ enum {
 	MLX5_CMD_OP_DESTROY_MKEY = 0x202,
 	MLX5_CMD_OP_CREATE_EQ = 0x301,
 	MLX5_CMD_OP_DESTROY_EQ = 0x302,
+	MLX5_CMD_OP_CREATE_CQ = 0x400,
+	MLX5_CMD_OP_DESTROY_CQ = 0x401,
 	MLX5_CMD_OP_CREATE_QP = 0x500,
+	MLX5_CMD_OP_DESTROY_QP = 0x501,
 	MLX5_CMD_OP_RST2INIT_QP = 0x502,
 	MLX5_CMD_OP_INIT2RTR_QP = 0x503,
 	MLX5_CMD_OP_RTR2RTS_QP = 0x504,
@@ -63,31 +66,71 @@ enum {
 	MLX5_CMD_OP_INIT2INIT_QP = 0x50e,
 	MLX5_CMD_OP_CREATE_PSV = 0x600,
 	MLX5_CMD_OP_DESTROY_PSV = 0x601,
+	MLX5_CMD_OP_CREATE_SRQ = 0x700,
+	MLX5_CMD_OP_DESTROY_SRQ = 0x701,
+	MLX5_CMD_OP_CREATE_XRC_SRQ = 0x705,
+	MLX5_CMD_OP_DESTROY_XRC_SRQ = 0x706,
+	MLX5_CMD_OP_CREATE_DCT = 0x710,
+	MLX5_CMD_OP_DESTROY_DCT = 0x711,
 	MLX5_CMD_OP_QUERY_DCT = 0x713,
+	MLX5_CMD_OP_CREATE_XRQ = 0x717,
+	MLX5_CMD_OP_DESTROY_XRQ = 0x718,
 	MLX5_CMD_OP_QUERY_ESW_VPORT_CONTEXT = 0x752,
 	MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT = 0x754,
 	MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT = 0x755,
 	MLX5_CMD_OP_QUERY_ROCE_ADDRESS = 0x760,
+	MLX5_CMD_OP_ALLOC_Q_COUNTER = 0x771,
+	MLX5_CMD_OP_DEALLOC_Q_COUNTER = 0x772,
+	MLX5_CMD_OP_CREATE_SCHEDULING_ELEMENT = 0x782,
+	MLX5_CMD_OP_DESTROY_SCHEDULING_ELEMENT = 0x783,
 	MLX5_CMD_OP_ALLOC_PD = 0x800,
 	MLX5_CMD_OP_DEALLOC_PD = 0x801,
 	MLX5_CMD_OP_ALLOC_UAR = 0x802,
 	MLX5_CMD_OP_DEALLOC_UAR = 0x803,
 	MLX5_CMD_OP_ACCESS_REG = 0x805,
+	MLX5_CMD_OP_ATTACH_TO_MCG = 0x806,
+	MLX5_CMD_OP_DETACH_FROM_MCG = 0x807,
+	MLX5_CMD_OP_ALLOC_XRCD = 0x80e,
+	MLX5_CMD_OP_DEALLOC_XRCD = 0x80f,
+	MLX5_CMD_OP_ALLOC_TRANSPORT_DOMAIN = 0x816,
+	MLX5_CMD_OP_DEALLOC_TRANSPORT_DOMAIN = 0x817,
+	MLX5_CMD_OP_ADD_VXLAN_UDP_DPORT = 0x827,
+	MLX5_CMD_OP_DELETE_VXLAN_UDP_DPORT = 0x828,
+	MLX5_CMD_OP_SET_L2_TABLE_ENTRY = 0x829,
+	MLX5_CMD_OP_DELETE_L2_TABLE_ENTRY = 0x82b,
 	MLX5_CMD_OP_QUERY_LAG = 0x842,
 	MLX5_CMD_OP_CREATE_TIR = 0x900,
+	MLX5_CMD_OP_DESTROY_TIR = 0x902,
+	MLX5_CMD_OP_CREATE_SQ = 0x904,
 	MLX5_CMD_OP_MODIFY_SQ = 0x905,
+	MLX5_CMD_OP_DESTROY_SQ = 0x906,
+	MLX5_CMD_OP_CREATE_RQ = 0x908,
+	MLX5_CMD_OP_DESTROY_RQ = 0x90a,
+	MLX5_CMD_OP_CREATE_RMP = 0x90c,
+	MLX5_CMD_OP_DESTROY_RMP = 0x90e,
+	MLX5_CMD_OP_CREATE_TIS = 0x912,
 	MLX5_CMD_OP_MODIFY_TIS = 0x913,
+	MLX5_CMD_OP_DESTROY_TIS = 0x914,
 	MLX5_CMD_OP_QUERY_TIS = 0x915,
+	MLX5_CMD_OP_CREATE_RQT = 0x916,
+	MLX5_CMD_OP_DESTROY_RQT = 0x918,
 	MLX5_CMD_OP_CREATE_FLOW_TABLE = 0x930,
+	MLX5_CMD_OP_DESTROY_FLOW_TABLE = 0x931,
 	MLX5_CMD_OP_QUERY_FLOW_TABLE = 0x932,
 	MLX5_CMD_OP_CREATE_FLOW_GROUP = 0x933,
+	MLX5_CMD_OP_DESTROY_FLOW_GROUP = 0x934,
 	MLX5_CMD_OP_SET_FLOW_TABLE_ENTRY = 0x936,
+	MLX5_CMD_OP_DELETE_FLOW_TABLE_ENTRY = 0x938,
 	MLX5_CMD_OP_CREATE_FLOW_COUNTER = 0x939,
+	MLX5_CMD_OP_DEALLOC_FLOW_COUNTER = 0x93a,
 	MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT = 0x93d,
 	MLX5_CMD_OP_DEALLOC_PACKET_REFORMAT_CONTEXT = 0x93e,
+	MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT = 0x940,
+	MLX5_CMD_OP_DEALLOC_MODIFY_HEADER_CONTEXT = 0x941,
 	MLX5_CMD_OP_CREATE_GENERAL_OBJECT = 0xa00,
 	MLX5_CMD_OP_MODIFY_GENERAL_OBJECT = 0xa01,
 	MLX5_CMD_OP_QUERY_GENERAL_OBJECT = 0xa02,
+	MLX5_CMD_OP_DESTROY_GENERAL_OBJECT = 0xa03,
 	MLX5_CMD_OP_CREATE_UMEM = 0xa08,
 	MLX5_CMD_OP_DESTROY_UMEM = 0xa0a,
 	MLX5_CMD_OP_SYNC_STEERING = 0xb00,
@@ -236,6 +279,27 @@ struct mlx5_ifc_create_flow_table_out_bits {
 	u8         icm_address_31_0[0x20];
 };
 
+struct mlx5_ifc_destroy_flow_table_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         other_vport[0x1];
+	u8         reserved_at_41[0xf];
+	u8         vport_number[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         table_type[0x8];
+	u8         reserved_at_88[0x18];
+
+	u8         reserved_at_a0[0x8];
+	u8         table_id[0x18];
+
+	u8         reserved_at_c0[0x140];
+};
+
 struct mlx5_ifc_query_flow_table_in_bits {
 	u8         opcode[0x10];
 	u8         reserved_at_10[0x10];
@@ -2991,6 +3055,17 @@ struct mlx5_ifc_alloc_flow_counter_out_bits {
 	u8	reserved_at_60[0x20];
 };
 
+struct mlx5_ifc_dealloc_flow_counter_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         flow_counter_id[0x20];
+
+	u8         reserved_at_60[0x20];
+};
+
 enum {
 	MLX5_OBJ_TYPE_FLOW_METER = 0x000a,
 	MLX5_OBJ_TYPE_MATCH_DEFINER = 0x0018,
@@ -3422,6 +3497,18 @@ struct mlx5_ifc_create_tir_out_bits {
 	u8         icm_address_31_0[0x20];
 };
 
+struct mlx5_ifc_destroy_tir_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         tirn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
 struct mlx5_ifc_create_qp_out_bits {
 	u8         status[0x8];
 	u8         reserved_at_8[0x18];
@@ -3459,6 +3546,18 @@ struct mlx5_ifc_create_qp_in_bits {
 	u8         pas[0][0x40];
 };
 
+struct mlx5_ifc_destroy_qp_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         qpn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
 enum mlx5_qpc_opt_mask_32 {
 	MLX5_QPC_OPT_MASK_32_QOS_QUEUE_GROUP_ID = 1 << 1,
 	MLX5_QPC_OPT_MASK_32_UDP_SPORT = 1 << 2,
@@ -3898,7 +3997,13 @@ struct mlx5_ifc_create_flow_group_in_bits {
 	u8         opcode[0x10];
 	u8         reserved_at_10[0x10];
 
-	u8         reserved_at_20[0x60];
+	u8         reserved_at_20[0x20];
+
+	u8         other_vport[0x1];
+	u8         reserved_at_41[0xf];
+	u8         vport_number[0x10];
+
+	u8         reserved_at_60[0x20];
 
 	u8         table_type[0x8];
 	u8         reserved_at_88[0x18];
@@ -3921,6 +4026,29 @@ struct mlx5_ifc_create_flow_group_out_bits {
 	u8         reserved_at_60[0x20];
 };
 
+struct mlx5_ifc_destroy_flow_group_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         other_vport[0x1];
+	u8         reserved_at_41[0xf];
+	u8         vport_number[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         table_type[0x8];
+	u8         reserved_at_88[0x18];
+
+	u8         reserved_at_a0[0x8];
+	u8         table_id[0x18];
+
+	u8         group_id[0x20];
+
+	u8         reserved_at_e0[0x120];
+};
+
 struct mlx5_ifc_dest_format_bits {
 	u8         destination_type[0x8];
 	u8         destination_id[0x18];
@@ -3977,7 +4105,14 @@ struct mlx5_ifc_set_fte_in_bits {
 	u8         opcode[0x10];
 	u8         reserved_at_10[0x10];
 
-	u8         reserved_at_20[0x60];
+	u8         reserved_at_20[0x10];
+	u8         op_mod[0x10];
+
+	u8         other_vport[0x1];
+	u8         reserved_at_41[0xf];
+	u8         vport_number[0x10];
+
+	u8         reserved_at_60[0x20];
 
 	u8         table_type[0x8];
 	u8         reserved_at_88[0x18];
@@ -4186,6 +4321,18 @@ struct mlx5_ifc_create_psv_in_bits {
 	u8         reserved_at_60[0x20];
 };
 
+struct mlx5_ifc_destroy_psv_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         psvn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
 struct mlx5_ifc_mbox_out_bits {
 	u8	status[0x8];
 	u8	reserved_at_8[0x18];
@@ -4726,4 +4873,418 @@ struct mlx5_ifc_destroy_umem_out_bits {
 	u8        reserved_at_40[0x40];
 };
 
+struct mlx5_ifc_delete_fte_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         other_vport[0x1];
+	u8         reserved_at_41[0xf];
+	u8         vport_number[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         table_type[0x8];
+	u8         reserved_at_88[0x18];
+
+	u8         reserved_at_a0[0x8];
+	u8         table_id[0x18];
+
+	u8         reserved_at_c0[0x40];
+
+	u8         flow_index[0x20];
+
+	u8         reserved_at_120[0xe0];
+};
+
+struct mlx5_ifc_create_cq_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         cqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_cq_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         cqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_alloc_transport_domain_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         transport_domain[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_dealloc_transport_domain_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         transport_domain[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_rmp_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         rmpn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_rmp_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         rmpn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_sq_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         sqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_sq_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         sqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_rq_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         rqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_rq_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         rqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_rqt_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         rqtn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_rqt_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         rqtn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_tis_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         tisn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_tis_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         tisn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_alloc_q_counter_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x18];
+	u8         counter_set_id[0x8];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_dealloc_q_counter_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x18];
+	u8         counter_set_id[0x8];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_alloc_modify_header_context_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         modify_header_id[0x20];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_dealloc_modify_header_context_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         modify_header_id[0x20];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_scheduling_element_out_bits {
+	u8         reserved_at_0[0x80];
+
+	u8         scheduling_element_id[0x20];
+
+	u8         reserved_at_a0[0x160];
+};
+
+struct mlx5_ifc_create_scheduling_element_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         scheduling_hierarchy[0x8];
+	u8         reserved_at_48[0x18];
+
+	u8         reserved_at_60[0x3a0];
+};
+
+struct mlx5_ifc_destroy_scheduling_element_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         scheduling_hierarchy[0x8];
+	u8         reserved_at_48[0x18];
+
+	u8         scheduling_element_id[0x20];
+
+	u8         reserved_at_80[0x180];
+};
+
+struct mlx5_ifc_add_vxlan_udp_dport_in_bits {
+	u8         reserved_at_0[0x60];
+
+	u8         reserved_at_60[0x10];
+	u8         vxlan_udp_port[0x10];
+};
+
+struct mlx5_ifc_delete_vxlan_udp_dport_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x40];
+
+	u8         reserved_at_60[0x10];
+	u8         vxlan_udp_port[0x10];
+};
+
+struct mlx5_ifc_set_l2_table_entry_in_bits {
+	u8         reserved_at_0[0xa0];
+
+	u8         reserved_at_a0[0x8];
+	u8         table_index[0x18];
+
+	u8         reserved_at_c0[0x140];
+
+};
+
+struct mlx5_ifc_delete_l2_table_entry_in_bits {
+	u8         opcode[0x10];
+	u8         reserved_at_10[0x10];
+
+	u8         reserved_at_20[0x80];
+
+	u8         reserved_at_a0[0x8];
+	u8         table_index[0x18];
+
+	u8         reserved_at_c0[0x140];
+};
+
+struct mlx5_ifc_create_srq_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         srqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_srq_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         srqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_xrc_srq_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         xrc_srqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_xrc_srq_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         xrc_srqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_dct_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         dctn[0x18];
+
+	u8         ece[0x20];
+};
+
+struct mlx5_ifc_destroy_dct_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         dctn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_create_xrq_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         xrqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_destroy_xrq_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         xrqn[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_attach_to_mcg_in_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         qpn[0x18];
+
+	u8         reserved_at_60[0x20];
+
+	u8         multicast_gid[16][0x8];
+};
+
+struct mlx5_ifc_detach_from_mcg_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         qpn[0x18];
+
+	u8         reserved_at_60[0x20];
+
+	u8         multicast_gid[16][0x8];
+};
+
+struct mlx5_ifc_alloc_xrcd_out_bits {
+	u8         reserved_at_0[0x40];
+
+	u8         reserved_at_40[0x8];
+	u8         xrcd[0x18];
+
+	u8         reserved_at_60[0x20];
+};
+
+struct mlx5_ifc_dealloc_xrcd_in_bits {
+	u8         opcode[0x10];
+	u8         uid[0x10];
+
+	u8         reserved_at_20[0x20];
+
+	u8         reserved_at_40[0x8];
+	u8         xrcd[0x18];
+
+	u8         reserved_at_60[0x20];
+};
 #endif /* MLX5_IFC_H */
diff --git a/providers/mlx5/mlx5_vfio.c b/providers/mlx5/mlx5_vfio.c
index 5e55697..0bc9aed 100644
--- a/providers/mlx5/mlx5_vfio.c
+++ b/providers/mlx5/mlx5_vfio.c
@@ -2460,14 +2460,6 @@ end:
 	return NULL;
 }
 
-static struct mlx5dv_devx_obj *
-vfio_devx_obj_create(struct ibv_context *context, const void *in,
-		     size_t inlen, void *out, size_t outlen)
-{
-	errno = EOPNOTSUPP;
-	return NULL;
-}
-
 static int vfio_devx_query_eqn(struct ibv_context *ibctx, uint32_t vector,
 			       uint32_t *eqn)
 {
@@ -2682,15 +2674,418 @@ static int vfio_init_obj(struct mlx5dv_obj *obj, uint64_t obj_type)
 	return 0;
 }
 
+static int vfio_devx_general_cmd(struct ibv_context *context, const void *in,
+				 size_t inlen, void *out, size_t outlen)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(context);
+
+	return mlx5_vfio_cmd_exec(ctx, (void *)in, inlen, out, outlen, 0);
+}
+
+static bool devx_is_obj_create_cmd(const void *in)
+{
+	uint16_t opcode = DEVX_GET(general_obj_in_cmd_hdr, in, opcode);
+
+	switch (opcode) {
+	case MLX5_CMD_OP_CREATE_GENERAL_OBJECT:
+	case MLX5_CMD_OP_CREATE_MKEY:
+	case MLX5_CMD_OP_CREATE_CQ:
+	case MLX5_CMD_OP_ALLOC_PD:
+	case MLX5_CMD_OP_ALLOC_TRANSPORT_DOMAIN:
+	case MLX5_CMD_OP_CREATE_RMP:
+	case MLX5_CMD_OP_CREATE_SQ:
+	case MLX5_CMD_OP_CREATE_RQ:
+	case MLX5_CMD_OP_CREATE_RQT:
+	case MLX5_CMD_OP_CREATE_TIR:
+	case MLX5_CMD_OP_CREATE_TIS:
+	case MLX5_CMD_OP_ALLOC_Q_COUNTER:
+	case MLX5_CMD_OP_CREATE_FLOW_TABLE:
+	case MLX5_CMD_OP_CREATE_FLOW_GROUP:
+	case MLX5_CMD_OP_CREATE_FLOW_COUNTER:
+	case MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT:
+	case MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT:
+	case MLX5_CMD_OP_CREATE_SCHEDULING_ELEMENT:
+	case MLX5_CMD_OP_ADD_VXLAN_UDP_DPORT:
+	case MLX5_CMD_OP_SET_L2_TABLE_ENTRY:
+	case MLX5_CMD_OP_CREATE_QP:
+	case MLX5_CMD_OP_CREATE_SRQ:
+	case MLX5_CMD_OP_CREATE_XRC_SRQ:
+	case MLX5_CMD_OP_CREATE_DCT:
+	case MLX5_CMD_OP_CREATE_XRQ:
+	case MLX5_CMD_OP_ATTACH_TO_MCG:
+	case MLX5_CMD_OP_ALLOC_XRCD:
+		return true;
+	case MLX5_CMD_OP_SET_FLOW_TABLE_ENTRY:
+	{
+		uint8_t op_mod = DEVX_GET(set_fte_in, in, op_mod);
+
+		if (op_mod == 0)
+			return true;
+		return false;
+	}
+	case MLX5_CMD_OP_CREATE_PSV:
+	{
+		uint8_t num_psv = DEVX_GET(create_psv_in, in, num_psv);
+
+		if (num_psv == 1)
+			return true;
+		return false;
+	}
+	default:
+		return false;
+	}
+}
+
+static uint32_t devx_get_created_obj_id(const void *in, const void *out,
+					uint16_t opcode)
+{
+	switch (opcode) {
+	case MLX5_CMD_OP_CREATE_GENERAL_OBJECT:
+		return DEVX_GET(general_obj_out_cmd_hdr, out, obj_id);
+	case MLX5_CMD_OP_CREATE_UMEM:
+		return DEVX_GET(create_umem_out, out, umem_id);
+	case MLX5_CMD_OP_CREATE_MKEY:
+		return DEVX_GET(create_mkey_out, out, mkey_index);
+	case MLX5_CMD_OP_CREATE_CQ:
+		return DEVX_GET(create_cq_out, out, cqn);
+	case MLX5_CMD_OP_ALLOC_PD:
+		return DEVX_GET(alloc_pd_out, out, pd);
+	case MLX5_CMD_OP_ALLOC_TRANSPORT_DOMAIN:
+		return DEVX_GET(alloc_transport_domain_out, out,
+				transport_domain);
+	case MLX5_CMD_OP_CREATE_RMP:
+		return DEVX_GET(create_rmp_out, out, rmpn);
+	case MLX5_CMD_OP_CREATE_SQ:
+		return DEVX_GET(create_sq_out, out, sqn);
+	case MLX5_CMD_OP_CREATE_RQ:
+		return DEVX_GET(create_rq_out, out, rqn);
+	case MLX5_CMD_OP_CREATE_RQT:
+		return DEVX_GET(create_rqt_out, out, rqtn);
+	case MLX5_CMD_OP_CREATE_TIR:
+		return DEVX_GET(create_tir_out, out, tirn);
+	case MLX5_CMD_OP_CREATE_TIS:
+		return DEVX_GET(create_tis_out, out, tisn);
+	case MLX5_CMD_OP_ALLOC_Q_COUNTER:
+		return DEVX_GET(alloc_q_counter_out, out, counter_set_id);
+	case MLX5_CMD_OP_CREATE_FLOW_TABLE:
+		return DEVX_GET(create_flow_table_out, out, table_id);
+	case MLX5_CMD_OP_CREATE_FLOW_GROUP:
+		return DEVX_GET(create_flow_group_out, out, group_id);
+	case MLX5_CMD_OP_SET_FLOW_TABLE_ENTRY:
+		return DEVX_GET(set_fte_in, in, flow_index);
+	case MLX5_CMD_OP_CREATE_FLOW_COUNTER:
+		return DEVX_GET(alloc_flow_counter_out, out, flow_counter_id);
+	case MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT:
+		return DEVX_GET(alloc_packet_reformat_context_out, out,
+				packet_reformat_id);
+	case MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT:
+		return DEVX_GET(alloc_modify_header_context_out, out,
+				modify_header_id);
+	case MLX5_CMD_OP_CREATE_SCHEDULING_ELEMENT:
+		return DEVX_GET(create_scheduling_element_out, out,
+				scheduling_element_id);
+	case MLX5_CMD_OP_ADD_VXLAN_UDP_DPORT:
+		return DEVX_GET(add_vxlan_udp_dport_in, in, vxlan_udp_port);
+	case MLX5_CMD_OP_SET_L2_TABLE_ENTRY:
+		return DEVX_GET(set_l2_table_entry_in, in, table_index);
+	case MLX5_CMD_OP_CREATE_QP:
+		return DEVX_GET(create_qp_out, out, qpn);
+	case MLX5_CMD_OP_CREATE_SRQ:
+		return DEVX_GET(create_srq_out, out, srqn);
+	case MLX5_CMD_OP_CREATE_XRC_SRQ:
+		return DEVX_GET(create_xrc_srq_out, out, xrc_srqn);
+	case MLX5_CMD_OP_CREATE_DCT:
+		return DEVX_GET(create_dct_out, out, dctn);
+	case MLX5_CMD_OP_CREATE_XRQ:
+		return DEVX_GET(create_xrq_out, out, xrqn);
+	case MLX5_CMD_OP_ATTACH_TO_MCG:
+		return DEVX_GET(attach_to_mcg_in, in, qpn);
+	case MLX5_CMD_OP_ALLOC_XRCD:
+		return DEVX_GET(alloc_xrcd_out, out, xrcd);
+	case MLX5_CMD_OP_CREATE_PSV:
+		return DEVX_GET(create_psv_out, out, psv0_index);
+	default:
+		/* The entry must match to one of the devx_is_obj_create_cmd */
+		assert(false);
+		return 0;
+	}
+}
+
+static void devx_obj_build_destroy_cmd(const void *in, void *out,
+				       void *din, uint32_t *dinlen,
+				       struct mlx5dv_devx_obj *obj)
+{
+	uint16_t opcode = DEVX_GET(general_obj_in_cmd_hdr, in, opcode);
+	uint16_t uid = DEVX_GET(general_obj_in_cmd_hdr, in, uid);
+	uint32_t *obj_id = &obj->object_id;
+
+	*obj_id = devx_get_created_obj_id(in, out, opcode);
+	*dinlen = DEVX_ST_SZ_BYTES(general_obj_in_cmd_hdr);
+	DEVX_SET(general_obj_in_cmd_hdr, din, uid, uid);
+
+	switch (opcode) {
+	case MLX5_CMD_OP_CREATE_GENERAL_OBJECT:
+		DEVX_SET(general_obj_in_cmd_hdr, din, opcode, MLX5_CMD_OP_DESTROY_GENERAL_OBJECT);
+		DEVX_SET(general_obj_in_cmd_hdr, din, obj_id, *obj_id);
+		DEVX_SET(general_obj_in_cmd_hdr, din, obj_type,
+			 DEVX_GET(general_obj_in_cmd_hdr, in, obj_type));
+		break;
+
+	case MLX5_CMD_OP_CREATE_UMEM:
+		DEVX_SET(destroy_umem_in, din, opcode,
+			 MLX5_CMD_OP_DESTROY_UMEM);
+		DEVX_SET(destroy_umem_in, din, umem_id, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_MKEY:
+		DEVX_SET(destroy_mkey_in, din, opcode,
+			 MLX5_CMD_OP_DESTROY_MKEY);
+		DEVX_SET(destroy_mkey_in, din, mkey_index, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_CQ:
+		DEVX_SET(destroy_cq_in, din, opcode, MLX5_CMD_OP_DESTROY_CQ);
+		DEVX_SET(destroy_cq_in, din, cqn, *obj_id);
+		break;
+	case MLX5_CMD_OP_ALLOC_PD:
+		DEVX_SET(dealloc_pd_in, din, opcode, MLX5_CMD_OP_DEALLOC_PD);
+		DEVX_SET(dealloc_pd_in, din, pd, *obj_id);
+		break;
+	case MLX5_CMD_OP_ALLOC_TRANSPORT_DOMAIN:
+		DEVX_SET(dealloc_transport_domain_in, din, opcode,
+			 MLX5_CMD_OP_DEALLOC_TRANSPORT_DOMAIN);
+		DEVX_SET(dealloc_transport_domain_in, din, transport_domain,
+			 *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_RMP:
+		DEVX_SET(destroy_rmp_in, din, opcode, MLX5_CMD_OP_DESTROY_RMP);
+		DEVX_SET(destroy_rmp_in, din, rmpn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_SQ:
+		DEVX_SET(destroy_sq_in, din, opcode, MLX5_CMD_OP_DESTROY_SQ);
+		DEVX_SET(destroy_sq_in, din, sqn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_RQ:
+		DEVX_SET(destroy_rq_in, din, opcode, MLX5_CMD_OP_DESTROY_RQ);
+		DEVX_SET(destroy_rq_in, din, rqn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_RQT:
+		DEVX_SET(destroy_rqt_in, din, opcode, MLX5_CMD_OP_DESTROY_RQT);
+		DEVX_SET(destroy_rqt_in, din, rqtn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_TIR:
+		DEVX_SET(destroy_tir_in, din, opcode, MLX5_CMD_OP_DESTROY_TIR);
+		DEVX_SET(destroy_tir_in, din, tirn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_TIS:
+		DEVX_SET(destroy_tis_in, din, opcode, MLX5_CMD_OP_DESTROY_TIS);
+		DEVX_SET(destroy_tis_in, din, tisn, *obj_id);
+		break;
+	case MLX5_CMD_OP_ALLOC_Q_COUNTER:
+		DEVX_SET(dealloc_q_counter_in, din, opcode,
+			 MLX5_CMD_OP_DEALLOC_Q_COUNTER);
+		DEVX_SET(dealloc_q_counter_in, din, counter_set_id, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_FLOW_TABLE:
+		*dinlen = DEVX_ST_SZ_BYTES(destroy_flow_table_in);
+		DEVX_SET(destroy_flow_table_in, din, other_vport,
+			 DEVX_GET(create_flow_table_in,  in, other_vport));
+		DEVX_SET(destroy_flow_table_in, din, vport_number,
+			 DEVX_GET(create_flow_table_in,  in, vport_number));
+		DEVX_SET(destroy_flow_table_in, din, table_type,
+			 DEVX_GET(create_flow_table_in,  in, table_type));
+		DEVX_SET(destroy_flow_table_in, din, table_id, *obj_id);
+		DEVX_SET(destroy_flow_table_in, din, opcode,
+			 MLX5_CMD_OP_DESTROY_FLOW_TABLE);
+		break;
+	case MLX5_CMD_OP_CREATE_FLOW_GROUP:
+		*dinlen = DEVX_ST_SZ_BYTES(destroy_flow_group_in);
+		DEVX_SET(destroy_flow_group_in, din, other_vport,
+			 DEVX_GET(create_flow_group_in, in, other_vport));
+		DEVX_SET(destroy_flow_group_in, din, vport_number,
+			 DEVX_GET(create_flow_group_in, in, vport_number));
+		DEVX_SET(destroy_flow_group_in, din, table_type,
+			 DEVX_GET(create_flow_group_in, in, table_type));
+		DEVX_SET(destroy_flow_group_in, din, table_id,
+			 DEVX_GET(create_flow_group_in, in, table_id));
+		DEVX_SET(destroy_flow_group_in, din, group_id, *obj_id);
+		DEVX_SET(destroy_flow_group_in, din, opcode,
+			 MLX5_CMD_OP_DESTROY_FLOW_GROUP);
+		break;
+	case MLX5_CMD_OP_SET_FLOW_TABLE_ENTRY:
+		*dinlen = DEVX_ST_SZ_BYTES(delete_fte_in);
+		DEVX_SET(delete_fte_in, din, other_vport,
+			 DEVX_GET(set_fte_in,  in, other_vport));
+		DEVX_SET(delete_fte_in, din, vport_number,
+			 DEVX_GET(set_fte_in, in, vport_number));
+		DEVX_SET(delete_fte_in, din, table_type,
+			 DEVX_GET(set_fte_in, in, table_type));
+		DEVX_SET(delete_fte_in, din, table_id,
+			 DEVX_GET(set_fte_in, in, table_id));
+		DEVX_SET(delete_fte_in, din, flow_index, *obj_id);
+		DEVX_SET(delete_fte_in, din, opcode,
+			 MLX5_CMD_OP_DELETE_FLOW_TABLE_ENTRY);
+		break;
+	case MLX5_CMD_OP_CREATE_FLOW_COUNTER:
+		DEVX_SET(dealloc_flow_counter_in, din, opcode,
+			 MLX5_CMD_OP_DEALLOC_FLOW_COUNTER);
+		DEVX_SET(dealloc_flow_counter_in, din, flow_counter_id,
+			 *obj_id);
+		break;
+	case MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT:
+		DEVX_SET(dealloc_packet_reformat_context_in, din, opcode,
+			 MLX5_CMD_OP_DEALLOC_PACKET_REFORMAT_CONTEXT);
+		DEVX_SET(dealloc_packet_reformat_context_in, din,
+			 packet_reformat_id, *obj_id);
+		break;
+	case MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT:
+		DEVX_SET(dealloc_modify_header_context_in, din, opcode,
+			 MLX5_CMD_OP_DEALLOC_MODIFY_HEADER_CONTEXT);
+		DEVX_SET(dealloc_modify_header_context_in, din,
+			 modify_header_id, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_SCHEDULING_ELEMENT:
+		*dinlen = DEVX_ST_SZ_BYTES(destroy_scheduling_element_in);
+		DEVX_SET(destroy_scheduling_element_in, din,
+			 scheduling_hierarchy,
+			 DEVX_GET(create_scheduling_element_in, in,
+				  scheduling_hierarchy));
+		DEVX_SET(destroy_scheduling_element_in, din,
+			 scheduling_element_id, *obj_id);
+		DEVX_SET(destroy_scheduling_element_in, din, opcode,
+			 MLX5_CMD_OP_DESTROY_SCHEDULING_ELEMENT);
+		break;
+	case MLX5_CMD_OP_ADD_VXLAN_UDP_DPORT:
+		*dinlen = DEVX_ST_SZ_BYTES(delete_vxlan_udp_dport_in);
+		DEVX_SET(delete_vxlan_udp_dport_in, din, vxlan_udp_port, *obj_id);
+		DEVX_SET(delete_vxlan_udp_dport_in, din, opcode,
+			 MLX5_CMD_OP_DELETE_VXLAN_UDP_DPORT);
+		break;
+	case MLX5_CMD_OP_SET_L2_TABLE_ENTRY:
+		*dinlen = DEVX_ST_SZ_BYTES(delete_l2_table_entry_in);
+		DEVX_SET(delete_l2_table_entry_in, din, table_index, *obj_id);
+		DEVX_SET(delete_l2_table_entry_in, din, opcode,
+			 MLX5_CMD_OP_DELETE_L2_TABLE_ENTRY);
+		break;
+	case MLX5_CMD_OP_CREATE_QP:
+		DEVX_SET(destroy_qp_in, din, opcode, MLX5_CMD_OP_DESTROY_QP);
+		DEVX_SET(destroy_qp_in, din, qpn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_SRQ:
+		DEVX_SET(destroy_srq_in, din, opcode, MLX5_CMD_OP_DESTROY_SRQ);
+		DEVX_SET(destroy_srq_in, din, srqn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_XRC_SRQ:
+		DEVX_SET(destroy_xrc_srq_in, din, opcode,
+			 MLX5_CMD_OP_DESTROY_XRC_SRQ);
+		DEVX_SET(destroy_xrc_srq_in, din, xrc_srqn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_DCT:
+		DEVX_SET(destroy_dct_in, din, opcode, MLX5_CMD_OP_DESTROY_DCT);
+		DEVX_SET(destroy_dct_in, din, dctn, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_XRQ:
+		DEVX_SET(destroy_xrq_in, din, opcode, MLX5_CMD_OP_DESTROY_XRQ);
+		DEVX_SET(destroy_xrq_in, din, xrqn, *obj_id);
+		break;
+	case MLX5_CMD_OP_ATTACH_TO_MCG:
+		*dinlen = DEVX_ST_SZ_BYTES(detach_from_mcg_in);
+		DEVX_SET(detach_from_mcg_in, din, qpn,
+			 DEVX_GET(attach_to_mcg_in, in, qpn));
+		memcpy(DEVX_ADDR_OF(detach_from_mcg_in, din, multicast_gid),
+		       DEVX_ADDR_OF(attach_to_mcg_in, in, multicast_gid),
+		       DEVX_FLD_SZ_BYTES(attach_to_mcg_in, multicast_gid));
+		DEVX_SET(detach_from_mcg_in, din, opcode,
+			 MLX5_CMD_OP_DETACH_FROM_MCG);
+		DEVX_SET(detach_from_mcg_in, din, qpn, *obj_id);
+		break;
+	case MLX5_CMD_OP_ALLOC_XRCD:
+		DEVX_SET(dealloc_xrcd_in, din, opcode,
+			 MLX5_CMD_OP_DEALLOC_XRCD);
+		DEVX_SET(dealloc_xrcd_in, din, xrcd, *obj_id);
+		break;
+	case MLX5_CMD_OP_CREATE_PSV:
+		DEVX_SET(destroy_psv_in, din, opcode,
+			 MLX5_CMD_OP_DESTROY_PSV);
+		DEVX_SET(destroy_psv_in, din, psvn, *obj_id);
+		break;
+	default:
+		/* The entry must match to one of the devx_is_obj_create_cmd */
+		assert(false);
+		break;
+	}
+}
+
+static struct mlx5dv_devx_obj *
+vfio_devx_obj_create(struct ibv_context *context, const void *in,
+		     size_t inlen, void *out, size_t outlen)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(context);
+	struct mlx5_devx_obj *obj;
+	int ret;
+
+	if (!devx_is_obj_create_cmd(in)) {
+		errno = EINVAL;
+		return NULL;
+	}
+
+	obj = calloc(1, sizeof(*obj));
+	if (!obj) {
+		errno = ENOMEM;
+		return NULL;
+	}
+
+	ret = mlx5_vfio_cmd_exec(ctx, (void *)in, inlen, out, outlen, 0);
+	if (ret)
+		goto fail;
+
+	devx_obj_build_destroy_cmd(in, out, obj->dinbox,
+				   &obj->dinlen, &obj->dv_obj);
+	obj->dv_obj.context = context;
+
+	return &obj->dv_obj;
+fail:
+	free(obj);
+	return NULL;
+}
+
 static int vfio_devx_obj_query(struct mlx5dv_devx_obj *obj, const void *in,
 				size_t inlen, void *out, size_t outlen)
 {
-	return EOPNOTSUPP;
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(obj->context);
+
+	return mlx5_vfio_cmd_exec(ctx, (void *)in, inlen, out, outlen, 0);
+}
+
+static int vfio_devx_obj_modify(struct mlx5dv_devx_obj *obj, const void *in,
+				size_t inlen, void *out, size_t outlen)
+{
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(obj->context);
+
+	return mlx5_vfio_cmd_exec(ctx, (void *)in, inlen, out, outlen, 0);
+}
+
+static int vfio_devx_obj_destroy(struct mlx5dv_devx_obj *obj)
+{
+	struct mlx5_devx_obj *mobj = container_of(obj,
+						  struct mlx5_devx_obj, dv_obj);
+	struct mlx5_vfio_context *ctx = to_mvfio_ctx(obj->context);
+	uint32_t out[DEVX_ST_SZ_DW(general_obj_out_cmd_hdr)];
+	int ret;
+
+	ret = mlx5_vfio_cmd_exec(ctx, mobj->dinbox, mobj->dinlen,
+				 out, sizeof(out), 0);
+	if (ret)
+		return ret;
+
+	free(mobj);
+	return 0;
 }
 
 static struct mlx5_dv_context_ops mlx5_vfio_dv_ctx_ops = {
+	.devx_general_cmd = vfio_devx_general_cmd,
 	.devx_obj_create = vfio_devx_obj_create,
 	.devx_obj_query = vfio_devx_obj_query,
+	.devx_obj_modify = vfio_devx_obj_modify,
+	.devx_obj_destroy = vfio_devx_obj_destroy,
 	.devx_query_eqn = vfio_devx_query_eqn,
 	.devx_alloc_uar = vfio_devx_alloc_uar,
 	.devx_free_uar = vfio_devx_free_uar,
diff --git a/providers/mlx5/mlx5_vfio.h b/providers/mlx5/mlx5_vfio.h
index 766c48c..2165a22 100644
--- a/providers/mlx5/mlx5_vfio.h
+++ b/providers/mlx5/mlx5_vfio.h
@@ -9,6 +9,7 @@
 #include <stddef.h>
 #include <stdio.h>
 #include "mlx5.h"
+#include "mlx5_ifc.h"
 
 #include <infiniband/driver.h>
 #include <util/interval_set.h>
@@ -303,6 +304,13 @@ struct mlx5_vfio_context {
 	struct mlx5_dv_context_ops *dv_ctx_ops;
 };
 
+#define MLX5_MAX_DESTROY_INBOX_SIZE_DW	DEVX_ST_SZ_DW(delete_fte_in)
+struct mlx5_devx_obj {
+	struct mlx5dv_devx_obj dv_obj;
+	uint32_t dinbox[MLX5_MAX_DESTROY_INBOX_SIZE_DW];
+	uint32_t dinlen;
+};
+
 static inline struct mlx5_vfio_device *to_mvfio_dev(struct ibv_device *ibdev)
 {
 	return container_of(ibdev, struct mlx5_vfio_device, vdev.device);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 18/27] pyverbs: Support DevX UMEM registration
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (16 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 17/27] mlx5: Implement mlx5dv devx_obj " Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 19/27] pyverbs/mlx5: Support EQN querying Yishai Hadas
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Add support to register a DevX UMEM and UMEM extended, a user memory to
be used by the DevX interface for DMA.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 pyverbs/providers/mlx5/libmlx5.pxd | 21 +++++++--
 pyverbs/providers/mlx5/mlx5dv.pxd  |  8 ++++
 pyverbs/providers/mlx5/mlx5dv.pyx  | 93 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/pyverbs/providers/mlx5/libmlx5.pxd b/pyverbs/providers/mlx5/libmlx5.pxd
index 8dd0141..66b8705 100644
--- a/pyverbs/providers/mlx5/libmlx5.pxd
+++ b/pyverbs/providers/mlx5/libmlx5.pxd
@@ -212,6 +212,17 @@ cdef extern from 'infiniband/mlx5dv.h':
         uint8_t     fm_ce_se
         uint32_t    imm
 
+    cdef struct mlx5dv_devx_umem:
+        uint32_t umem_id;
+
+    cdef struct mlx5dv_devx_umem_in:
+        void        *addr
+        size_t      size
+        uint32_t    access
+        uint64_t    pgsz_bitmap
+        uint64_t    comp_mask
+
+
     void mlx5dv_set_ctrl_seg(mlx5_wqe_ctrl_seg *seg, uint16_t pi, uint8_t opcode,
                              uint8_t opmod, uint32_t qp_num, uint8_t fm_ce_se,
                              uint8_t ds, uint8_t signature, uint32_t imm)
@@ -298,11 +309,15 @@ cdef extern from 'infiniband/mlx5dv.h':
     int mlx5dv_map_ah_to_qp(v.ibv_ah *ah, uint32_t qp_num)
 
     # DevX APIs
-    mlx5dv_devx_uar *mlx5dv_devx_alloc_uar(v.ibv_context *context,
-                                           uint32_t flags)
+    mlx5dv_devx_uar *mlx5dv_devx_alloc_uar(v.ibv_context *context, uint32_t flags)
     void mlx5dv_devx_free_uar(mlx5dv_devx_uar *devx_uar)
     int mlx5dv_devx_general_cmd(v.ibv_context *context, const void *in_,
-                                size_t inlen, void *out, size_t outlen);
+                                size_t inlen, void *out, size_t outlen)
+    mlx5dv_devx_umem *mlx5dv_devx_umem_reg(v.ibv_context *ctx, void *addr,
+                                           size_t size, unsigned long access)
+    mlx5dv_devx_umem *mlx5dv_devx_umem_reg_ex(v.ibv_context *ctx,
+                                              mlx5dv_devx_umem_in *umem_in)
+    int mlx5dv_devx_umem_dereg(mlx5dv_devx_umem *umem)
 
     # Mkey setters
     void mlx5dv_wr_mkey_configure(mlx5dv_qp_ex *mqp, mlx5dv_mkey *mkey,
diff --git a/pyverbs/providers/mlx5/mlx5dv.pxd b/pyverbs/providers/mlx5/mlx5dv.pxd
index 18b208e..154a117 100644
--- a/pyverbs/providers/mlx5/mlx5dv.pxd
+++ b/pyverbs/providers/mlx5/mlx5dv.pxd
@@ -11,6 +11,8 @@ from pyverbs.cq cimport CQEX
 
 
 cdef class Mlx5Context(Context):
+    cdef object devx_umems
+    cdef add_ref(self, obj)
     cpdef close(self)
 
 cdef class Mlx5DVContextAttr(PyverbsObject):
@@ -69,3 +71,9 @@ cdef class Wqe(PyverbsCM):
     cdef void *addr
     cdef int is_user_addr
     cdef object segments
+
+cdef class Mlx5UMEM(PyverbsCM):
+    cdef dv.mlx5dv_devx_umem *umem
+    cdef Context context
+    cdef void *addr
+    cdef object is_user_addr
diff --git a/pyverbs/providers/mlx5/mlx5dv.pyx b/pyverbs/providers/mlx5/mlx5dv.pyx
index 07dd7db..d16aed1 100644
--- a/pyverbs/providers/mlx5/mlx5dv.pyx
+++ b/pyverbs/providers/mlx5/mlx5dv.pyx
@@ -2,10 +2,11 @@
 # Copyright (c) 2019 Mellanox Technologies, Inc. All rights reserved. See COPYING file
 
 from libc.stdint cimport uintptr_t, uint8_t, uint16_t, uint32_t
-from libc.stdlib cimport calloc, free, malloc
-from libc.string cimport memcpy
+from libc.string cimport memcpy, memset
+from libc.stdlib cimport calloc, free
 from posix.mman cimport munmap
 import logging
+import weakref
 
 from pyverbs.providers.mlx5.mlx5dv_mkey cimport Mlx5MrInterleaved, Mlx5Mkey, \
     Mlx5MkeyConfAttr, Mlx5SigBlockAttr
@@ -13,6 +14,7 @@ from pyverbs.pyverbs_error import PyverbsUserError, PyverbsRDMAError, PyverbsErr
 from pyverbs.providers.mlx5.mlx5dv_sched cimport Mlx5dvSchedLeaf
 cimport pyverbs.providers.mlx5.mlx5dv_enums as dve
 cimport pyverbs.providers.mlx5.libmlx5 as dv
+from pyverbs.mem_alloc import posix_memalign
 from pyverbs.qp cimport QPInitAttrEx, QPEx
 from pyverbs.base import PyverbsRDMAErrno
 from pyverbs.base cimport close_weakrefs
@@ -156,6 +158,7 @@ cdef class Mlx5Context(Context):
         if self.context == NULL:
             raise PyverbsRDMAErrno('Failed to open mlx5 context on {dev}'
                                    .format(dev=self.name))
+        self.devx_umems = weakref.WeakSet()
 
     def query_mlx5_device(self, comp_mask=-1):
         """
@@ -259,12 +262,21 @@ cdef class Mlx5Context(Context):
         free(clock_info)
         return ns_time
 
+    cdef add_ref(self, obj):
+        try:
+            Context.add_ref(self, obj)
+        except PyverbsError:
+            if isinstance(obj, Mlx5UMEM):
+                self.devx_umems.add(obj)
+            else:
+                raise PyverbsError('Unrecognized object type')
+
     def __dealloc__(self):
         self.close()
 
     cpdef close(self):
         if self.context != NULL:
-            close_weakrefs([self.pps])
+            close_weakrefs([self.pps, self.devx_umems])
             super(Mlx5Context, self).close()
 
 
@@ -1307,3 +1319,78 @@ cdef class Wqe(PyverbsCM):
             if not self.is_user_addr:
                 free(self.addr)
             self.addr = NULL
+
+
+cdef class Mlx5UMEM(PyverbsCM):
+    def __init__(self, Context context not None, size, addr=None, alignment=64,
+                 access=0, pgsz_bitmap=0, comp_mask=0):
+        """
+        User memory object to be used by the DevX interface.
+        If pgsz_bitmap or comp_mask were passed, the extended umem registration
+        will be used.
+        :param context: RDMA device context to create the action on
+        :param size: The size of the addr buffer (or the internal buffer to be
+                     allocated if addr is None)
+        :param alignment: The alignment of the internally allocated buffer
+                          (Valid if addr is None)
+        :param addr: The memory start address to register (if None, the address
+                     will be allocated internally)
+        :param access: The desired memory protection attributes (default: 0)
+        :param pgsz_bitmap: Represents the required page sizes
+        :param comp_mask: Compatibility mask
+        """
+        super().__init__()
+        cdef dv.mlx5dv_devx_umem_in umem_in
+
+        if addr is not None:
+            self.addr = <void*><uintptr_t>addr
+            self.is_user_addr = True
+        else:
+            self.addr = <void*><uintptr_t>posix_memalign(size, alignment)
+            memset(self.addr, 0, size)
+            self.is_user_addr = False
+
+        if pgsz_bitmap or comp_mask:
+            umem_in.addr = self.addr
+            umem_in.size = size
+            umem_in.access = access
+            umem_in.pgsz_bitmap = pgsz_bitmap
+            umem_in.comp_mask = comp_mask
+            self.umem = dv.mlx5dv_devx_umem_reg_ex(context.context, &umem_in)
+        else:
+            self.umem = dv.mlx5dv_devx_umem_reg(context.context, self.addr,
+                                                size, access)
+        if self.umem == NULL:
+            raise PyverbsRDMAErrno("Failed to register a UMEM.")
+        self.context = context
+        self.context.add_ref(self)
+
+    def __dealloc__(self):
+        self.close()
+
+    cpdef close(self):
+        if self.umem != NULL:
+            self.logger.debug('Closing Mlx5UMEM')
+            rc = dv.mlx5dv_devx_umem_dereg(self.umem)
+            try:
+                if rc:
+                    raise PyverbsError("Failed to dereg UMEM.", rc)
+            finally:
+                if not self.is_user_addr:
+                    free(self.addr)
+            self.umem = NULL
+            self.context = None
+
+    def __str__(self):
+        print_format = '{:20}: {:<20}\n'
+        return print_format.format('umem id', self.umem_id) + \
+               print_format.format('reg addr', self.umem_addr)
+
+    @property
+    def umem_id(self):
+        return self.umem.umem_id
+
+    @property
+    def umem_addr(self):
+        if self.addr:
+            return <uintptr_t><void*>self.addr
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 19/27] pyverbs/mlx5: Support EQN querying
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (17 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 18/27] pyverbs: Support DevX UMEM registration Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 20/27] pyverbs/mlx5: Support more DevX objects Yishai Hadas
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Add the ability to retrieve a device's EQ number which relates to a
given input vector.
The query is done over DevX.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 pyverbs/providers/mlx5/libmlx5.pxd |  1 +
 pyverbs/providers/mlx5/mlx5dv.pyx  | 12 ++++++++++++
 2 files changed, 13 insertions(+)

diff --git a/pyverbs/providers/mlx5/libmlx5.pxd b/pyverbs/providers/mlx5/libmlx5.pxd
index 66b8705..ba2c6ec 100644
--- a/pyverbs/providers/mlx5/libmlx5.pxd
+++ b/pyverbs/providers/mlx5/libmlx5.pxd
@@ -318,6 +318,7 @@ cdef extern from 'infiniband/mlx5dv.h':
     mlx5dv_devx_umem *mlx5dv_devx_umem_reg_ex(v.ibv_context *ctx,
                                               mlx5dv_devx_umem_in *umem_in)
     int mlx5dv_devx_umem_dereg(mlx5dv_devx_umem *umem)
+    int mlx5dv_devx_query_eqn(v.ibv_context *context, uint32_t vector, uint32_t *eqn)
 
     # Mkey setters
     void mlx5dv_wr_mkey_configure(mlx5dv_qp_ex *mqp, mlx5dv_mkey *mkey,
diff --git a/pyverbs/providers/mlx5/mlx5dv.pyx b/pyverbs/providers/mlx5/mlx5dv.pyx
index d16aed1..2c47cb6 100644
--- a/pyverbs/providers/mlx5/mlx5dv.pyx
+++ b/pyverbs/providers/mlx5/mlx5dv.pyx
@@ -262,6 +262,18 @@ cdef class Mlx5Context(Context):
         free(clock_info)
         return ns_time
 
+    def devx_query_eqn(self, vector):
+        """
+        Query EQN for a given vector id.
+        :param vector: Completion vector number
+        :return: The device EQ number which relates to the given input vector
+        """
+        cdef uint32_t eqn
+        rc = dv.mlx5dv_devx_query_eqn(self.context, vector, &eqn)
+        if rc:
+            raise PyverbsRDMAError('Failed to query EQN', rc)
+        return eqn
+
     cdef add_ref(self, obj):
         try:
             Context.add_ref(self, obj)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 20/27] pyverbs/mlx5: Support more DevX objects
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (18 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 19/27] pyverbs/mlx5: Support EQN querying Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 21/27] pyverbs: Add auxiliary memory functions Yishai Hadas
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

A DevX object represents some underlay firmware object, the input
command to create it is some raw data given by the user application
which should match the device specification.

A new class Mlx5DevxObj was added that would present any DevX object
that could be creating using the create_obj DevX API.
The command's output is stored in the object's instance, and the object
can be modified/queried and destroyed.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 Documentation/pyverbs.md           |  34 ++++++++++++
 pyverbs/providers/mlx5/libmlx5.pxd |   7 +++
 pyverbs/providers/mlx5/mlx5dv.pxd  |   6 +++
 pyverbs/providers/mlx5/mlx5dv.pyx  | 103 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/Documentation/pyverbs.md b/Documentation/pyverbs.md
index c7fa761..bcbde05 100644
--- a/Documentation/pyverbs.md
+++ b/Documentation/pyverbs.md
@@ -745,3 +745,37 @@ Below is the output when printing the spec.
     Dst mac         : de:de:de:00:de:de    mask: ff:ff:ff:ff:ff:ff
     Ether type      : 8451                 mask: 65535
     Vlan tag        : 8961                 mask: 65535
+
+
+##### MLX5 DevX Objects
+A DevX object represents some underlay firmware object, the input command to
+create it is some raw data given by the user application which should match the
+device specification.
+Upon successful creation, the output buffer includes the raw data from the device
+according to its specification and is stored in the Mlx5DevxObj instance. This
+data can be used as part of related firmware commands to this object.
+In addition to creation, the user can query/modify and destroy the object.
+
+Although weakrefs and DevX objects closure are added and handled by
+Pyverbs, the users must manually close these objects when finished, and
+should not let them be handled by the GC, or by closing the Mlx5Context directly,
+since there's no guarantee that the DevX objects are closed in the correct order,
+because Mlx5DevxObj is a general class that can be any of the device's available
+objects.
+But Pyverbs does guarantee to close DevX UARs and UMEMs in order, and after
+closing the other DevX objects.
+
+The following code snippet shows how to allocate and destroy a PD object over DevX.
+```python
+from pyverbs.providers.mlx5.mlx5dv import Mlx5Context, Mlx5DVContextAttr, Mlx5DevxObj
+import pyverbs.providers.mlx5.mlx5_enums as dve
+import struct
+
+attr = Mlx5DVContextAttr(dve.MLX5DV_CONTEXT_FLAGS_DEVX)
+ctx = Mlx5Context(attr, 'rocep8s0f0')
+MLX5_CMD_OP_ALLOC_PD = 0x800
+MLX5_CMD_OP_ALLOC_PD_OUTLEN = 0x10
+cmd_in = struct.pack('!H14s', MLX5_CMD_OP_ALLOC_PD, bytes(0))
+pd = Mlx5DevxObj(ctx, cmd_in, MLX5_CMD_OP_ALLOC_PD_OUTLEN)
+pd.close()
+```
diff --git a/pyverbs/providers/mlx5/libmlx5.pxd b/pyverbs/providers/mlx5/libmlx5.pxd
index ba2c6ec..34691a9 100644
--- a/pyverbs/providers/mlx5/libmlx5.pxd
+++ b/pyverbs/providers/mlx5/libmlx5.pxd
@@ -319,6 +319,13 @@ cdef extern from 'infiniband/mlx5dv.h':
                                               mlx5dv_devx_umem_in *umem_in)
     int mlx5dv_devx_umem_dereg(mlx5dv_devx_umem *umem)
     int mlx5dv_devx_query_eqn(v.ibv_context *context, uint32_t vector, uint32_t *eqn)
+    mlx5dv_devx_obj *mlx5dv_devx_obj_create(v.ibv_context *context, const void *_in,
+                                            size_t inlen, void *out, size_t outlen)
+    int mlx5dv_devx_obj_query(mlx5dv_devx_obj *obj, const void *in_,
+                              size_t inlen, void *out, size_t outlen)
+    int mlx5dv_devx_obj_modify(mlx5dv_devx_obj *obj, const void *in_,
+                               size_t inlen, void *out, size_t outlen)
+    int mlx5dv_devx_obj_destroy(mlx5dv_devx_obj *obj)
 
     # Mkey setters
     void mlx5dv_wr_mkey_configure(mlx5dv_qp_ex *mqp, mlx5dv_mkey *mkey,
diff --git a/pyverbs/providers/mlx5/mlx5dv.pxd b/pyverbs/providers/mlx5/mlx5dv.pxd
index 154a117..2b758fe 100644
--- a/pyverbs/providers/mlx5/mlx5dv.pxd
+++ b/pyverbs/providers/mlx5/mlx5dv.pxd
@@ -12,6 +12,7 @@ from pyverbs.cq cimport CQEX
 
 cdef class Mlx5Context(Context):
     cdef object devx_umems
+    cdef object devx_objs
     cdef add_ref(self, obj)
     cpdef close(self)
 
@@ -77,3 +78,8 @@ cdef class Mlx5UMEM(PyverbsCM):
     cdef Context context
     cdef void *addr
     cdef object is_user_addr
+
+cdef class Mlx5DevxObj(PyverbsCM):
+    cdef dv.mlx5dv_devx_obj *obj
+    cdef Context context
+    cdef object out_view
diff --git a/pyverbs/providers/mlx5/mlx5dv.pyx b/pyverbs/providers/mlx5/mlx5dv.pyx
index 2c47cb6..ab0bd4a 100644
--- a/pyverbs/providers/mlx5/mlx5dv.pyx
+++ b/pyverbs/providers/mlx5/mlx5dv.pyx
@@ -140,6 +140,104 @@ cdef class Mlx5DVContextAttr(PyverbsObject):
         self.attr.comp_mask = val
 
 
+cdef class Mlx5DevxObj(PyverbsCM):
+    """
+    Represents mlx5dv_devx_obj C struct.
+    """
+    def __init__(self, Context context, in_, outlen):
+        """
+        Creates a DevX object.
+        If the object was successfully created, the command's output would be
+        stored as a memoryview in self.out_view.
+        :param in_: Bytes of the obj_create command's input data provided in a
+                    device specification format.
+                    (Stream of bytes or __bytes__ is implemented)
+        :param outlen: Expected output length in bytes
+        """
+        super().__init__()
+        in_bytes = bytes(in_)
+        cdef char *in_mailbox = _prepare_devx_inbox(in_bytes)
+        cdef char *out_mailbox = _prepare_devx_outbox(outlen)
+        self.obj = dv.mlx5dv_devx_obj_create(context.context, in_mailbox,
+                                             len(in_bytes), out_mailbox, outlen)
+        try:
+            if self.obj == NULL:
+                raise PyverbsRDMAErrno('Failed to create DevX object')
+            self.out_view = memoryview(out_mailbox[:outlen])
+            status = hex(self.out_view[0])
+            syndrome = self.out_view[4:8].hex()
+            if status != hex(0):
+                raise PyverbsRDMAError('Failed to create DevX object with status'
+                                       f'({status}) and syndrome (0x{syndrome})')
+        finally:
+            free(in_mailbox)
+            free(out_mailbox)
+        self.context = context
+        self.context.add_ref(self)
+
+    def query(self, in_, outlen):
+        """
+        Queries the DevX object.
+        :param in_: Bytes of the obj_query command's input data provided in a
+                    device specification format.
+                    (Stream of bytes or __bytes__ is implemented)
+        :param outlen: Expected output length in bytes
+        :return: Bytes of the command's output
+        """
+        in_bytes = bytes(in_)
+        cdef char *in_mailbox = _prepare_devx_inbox(in_bytes)
+        cdef char *out_mailbox = _prepare_devx_outbox(outlen)
+        rc = dv.mlx5dv_devx_obj_query(self.obj, in_mailbox, len(in_bytes),
+                                      out_mailbox, outlen)
+        try:
+            if rc:
+                raise PyverbsRDMAError('Failed to query DevX object', rc)
+            out = <bytes>out_mailbox[:outlen]
+        finally:
+            free(in_mailbox)
+            free(out_mailbox)
+        return out
+
+    def modify(self, in_, outlen):
+        """
+        Modifies the DevX object.
+        :param in_: Bytes of the obj_modify command's input data provided in a
+                    device specification format.
+                    (Stream of bytes or __bytes__ is implemented)
+        :param outlen: Expected output length in bytes
+        :return: Bytes of the command's output
+        """
+        in_bytes = bytes(in_)
+        cdef char *in_mailbox = _prepare_devx_inbox(in_bytes)
+        cdef char *out_mailbox = _prepare_devx_outbox(outlen)
+        rc = dv.mlx5dv_devx_obj_modify(self.obj, in_mailbox, len(in_bytes),
+                                       out_mailbox, outlen)
+        try:
+            if rc:
+                raise PyverbsRDMAError('Failed to modify DevX object', rc)
+            out = <bytes>out_mailbox[:outlen]
+        finally:
+            free(in_mailbox)
+            free(out_mailbox)
+        return out
+
+    @property
+    def out_view(self):
+        return self.out_view
+
+    def __dealloc__(self):
+        self.close()
+
+    cpdef close(self):
+        if self.obj != NULL:
+            self.logger.debug('Closing Mlx5DvexObj')
+            rc = dv.mlx5dv_devx_obj_destroy(self.obj)
+            if rc:
+                raise PyverbsRDMAError('Failed to destroy a DevX object', rc)
+            self.obj = NULL
+            self.context = None
+
+
 cdef class Mlx5Context(Context):
     """
     Represent mlx5 context, which extends Context.
@@ -159,6 +257,7 @@ cdef class Mlx5Context(Context):
             raise PyverbsRDMAErrno('Failed to open mlx5 context on {dev}'
                                    .format(dev=self.name))
         self.devx_umems = weakref.WeakSet()
+        self.devx_objs = weakref.WeakSet()
 
     def query_mlx5_device(self, comp_mask=-1):
         """
@@ -280,6 +379,8 @@ cdef class Mlx5Context(Context):
         except PyverbsError:
             if isinstance(obj, Mlx5UMEM):
                 self.devx_umems.add(obj)
+            elif isinstance(obj, Mlx5DevxObj):
+                self.devx_objs.add(obj)
             else:
                 raise PyverbsError('Unrecognized object type')
 
@@ -288,7 +389,7 @@ cdef class Mlx5Context(Context):
 
     cpdef close(self):
         if self.context != NULL:
-            close_weakrefs([self.pps, self.devx_umems])
+            close_weakrefs([self.pps, self.devx_objs, self.devx_umems])
             super(Mlx5Context, self).close()
 
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 21/27] pyverbs: Add auxiliary memory functions
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (19 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 20/27] pyverbs/mlx5: Support more DevX objects Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 22/27] pyverbs/mlx5: Add support to extract mlx5dv objects Yishai Hadas
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Add some auxiliary functions to mem_alloc module to expose to Python
users the ability to read/write from/to memory, with chuncks of 4/8
bytes.
In addition, expose udma memory barriers and the ability to write on
MMIO.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 pyverbs/CMakeLists.txt |  7 +++++++
 pyverbs/dma_util.pyx   | 25 +++++++++++++++++++++++++
 pyverbs/mem_alloc.pyx  | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 77 insertions(+), 1 deletion(-)
 create mode 100644 pyverbs/dma_util.pyx

diff --git a/pyverbs/CMakeLists.txt b/pyverbs/CMakeLists.txt
index c532b4c..f403719 100644
--- a/pyverbs/CMakeLists.txt
+++ b/pyverbs/CMakeLists.txt
@@ -12,6 +12,12 @@ else()
   set(DMABUF_ALLOC dmabuf_alloc_stub.c)
 endif()
 
+if (HAVE_COHERENT_DMA)
+  set(DMA_UTIL dma_util.pyx)
+else()
+  set(DMA_UTIL "")
+endif()
+
 rdma_cython_module(pyverbs ""
   addr.pyx
   base.pyx
@@ -19,6 +25,7 @@ rdma_cython_module(pyverbs ""
   cmid.pyx
   cq.pyx
   device.pyx
+  ${DMA_UTIL}
   dmabuf.pyx
   ${DMABUF_ALLOC}
   enums.pyx
diff --git a/pyverbs/dma_util.pyx b/pyverbs/dma_util.pyx
new file mode 100644
index 0000000..36d5f9b
--- /dev/null
+++ b/pyverbs/dma_util.pyx
@@ -0,0 +1,25 @@
+# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
+# Copyright (c) 2021 Nvidia, Inc. All rights reserved. See COPYING file
+
+#cython: language_level=3
+
+from libc.stdint cimport uintptr_t, uint64_t
+
+cdef extern from 'util/udma_barrier.h':
+    cdef void udma_to_device_barrier()
+    cdef void udma_from_device_barrier()
+
+cdef extern from 'util/mmio.h':
+   cdef void mmio_write64_be(void *addr, uint64_t val)
+
+
+def udma_to_dev_barrier():
+    udma_to_device_barrier()
+
+
+def udma_from_dev_barrier():
+    udma_from_device_barrier()
+
+
+def mmio_write64_as_be(addr, val):
+    mmio_write64_be(<void*><uintptr_t> addr, val)
diff --git a/pyverbs/mem_alloc.pyx b/pyverbs/mem_alloc.pyx
index 24be4f1..c6290f7 100644
--- a/pyverbs/mem_alloc.pyx
+++ b/pyverbs/mem_alloc.pyx
@@ -6,13 +6,17 @@
 from posix.stdlib cimport posix_memalign as c_posix_memalign
 from libc.stdlib cimport malloc as c_malloc, free as c_free
 from posix.mman cimport mmap as c_mmap, munmap as c_munmap
-from libc.stdint cimport uintptr_t
+from libc.stdint cimport uintptr_t, uint32_t, uint64_t
 from libc.string cimport memset
 cimport posix.mman as mm
 
 cdef extern from 'sys/mman.h':
     cdef void* MAP_FAILED
 
+cdef extern from 'endian.h':
+    unsigned long htobe32(unsigned long host_32bits)
+    unsigned long htobe64(unsigned long host_64bits)
+
 
 def mmap(addr=0, length=100, prot=mm.PROT_READ | mm.PROT_WRITE,
          flags=mm.MAP_PRIVATE | mm.MAP_ANONYMOUS, fd=0, offset=0):
@@ -82,6 +86,46 @@ def free(ptr):
     c_free(<void*><uintptr_t>ptr)
 
 
+def writebe32(addr, val, offset=0):
+    """
+    Write 32-bit value <val> as Big Endian to address <addr> and offset <offset>
+    :param addr: The start of the address to write the value to
+    :param val: Value to write
+    :param offset: Offset of the address  to write the value to (in 4-bytes)
+    """
+    (<uint32_t*><void*><uintptr_t>addr)[offset] = htobe32(val)
+
+
+def writebe64(addr, val, offset=0):
+    """
+    Write 64-bit value <val> as Big Endian to address <addr> and offset <offset>
+    :param addr: The start of the address to write the value to
+    :param val: Value to write
+    :param offset: Offset of the address  to write the value to (in 8-bytes)
+    """
+    (<uint64_t*><void*><uintptr_t>addr)[offset] = htobe64(val)
+
+
+def read32(addr, offset=0):
+    """
+    Read 32-bit value from address <addr> and offset <offset>
+    :param addr: The start of the address to read from
+    :param offset: Offset of the address to read from (in 4-bytes)
+    :return: The read value
+    """
+    return (<uint32_t*><uintptr_t>addr)[offset]
+
+
+def read64(addr, offset=0):
+    """
+    Read 64-bit value from address <addr> and offset <offset>
+    :param addr: The start of the address to read from
+    :param offset: Offset of the address to read from (in 8-bytes)
+    :return: The read value
+    """
+    return (<uint64_t*><uintptr_t>addr)[offset]
+
+
 # protection bits for mmap/mprotect
 PROT_EXEC_ = mm.PROT_EXEC
 PROT_READ_ = mm.PROT_READ
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 22/27] pyverbs/mlx5: Add support to extract mlx5dv objects
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (20 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 21/27] pyverbs: Add auxiliary memory functions Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 23/27] pyverbs/mlx5: Wrap mlx5_cqe64 struct and add enums Yishai Hadas
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Support extraction of mlx5dv objects from ibv objects.
This allows the users to access the object numbers and much more
attributes needed for some usage such as DevX.
Currently there's a support for the following objects: PD, CQ, QP and
SRQ.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 pyverbs/pd.pyx                            |   4 +
 pyverbs/providers/mlx5/CMakeLists.txt     |   1 +
 pyverbs/providers/mlx5/libmlx5.pxd        |  52 ++++++++
 pyverbs/providers/mlx5/mlx5dv_enums.pxd   |  17 +++
 pyverbs/providers/mlx5/mlx5dv_objects.pxd |  28 ++++
 pyverbs/providers/mlx5/mlx5dv_objects.pyx | 214 ++++++++++++++++++++++++++++++
 6 files changed, 316 insertions(+)
 create mode 100644 pyverbs/providers/mlx5/mlx5dv_objects.pxd
 create mode 100644 pyverbs/providers/mlx5/mlx5dv_objects.pyx

diff --git a/pyverbs/pd.pyx b/pyverbs/pd.pyx
index 2c0c424..e8d3e1d 100644
--- a/pyverbs/pd.pyx
+++ b/pyverbs/pd.pyx
@@ -139,6 +139,10 @@ cdef class PD(PyverbsCM):
     def handle(self):
         return self.pd.handle
 
+    @property
+    def pd(self):
+        return <object>self.pd
+
 
 cdef void *pd_alloc(v.ibv_pd *pd, void *pd_context, size_t size,
                   size_t alignment, v.uint64_t resource_type):
diff --git a/pyverbs/providers/mlx5/CMakeLists.txt b/pyverbs/providers/mlx5/CMakeLists.txt
index cdb6be2..4763c61 100644
--- a/pyverbs/providers/mlx5/CMakeLists.txt
+++ b/pyverbs/providers/mlx5/CMakeLists.txt
@@ -11,5 +11,6 @@ rdma_cython_module(pyverbs/providers/mlx5 mlx5
   mlx5_enums.pyx
   mlx5dv_flow.pyx
   mlx5dv_mkey.pyx
+  mlx5dv_objects.pyx
   mlx5dv_sched.pyx
 )
diff --git a/pyverbs/providers/mlx5/libmlx5.pxd b/pyverbs/providers/mlx5/libmlx5.pxd
index 34691a9..de4008d 100644
--- a/pyverbs/providers/mlx5/libmlx5.pxd
+++ b/pyverbs/providers/mlx5/libmlx5.pxd
@@ -4,6 +4,7 @@
 include 'mlx5dv_enums.pxd'
 
 from libc.stdint cimport uint8_t, uint16_t, uint32_t, uint64_t, uintptr_t
+from posix.types cimport off_t
 from libcpp cimport bool
 
 cimport pyverbs.libibverbs as v
@@ -222,6 +223,56 @@ cdef extern from 'infiniband/mlx5dv.h':
         uint64_t    pgsz_bitmap
         uint64_t    comp_mask
 
+    cdef struct mlx5dv_pd:
+        uint32_t    pdn
+        uint64_t    comp_mask
+
+    cdef struct mlx5dv_cq:
+        void        *buf
+        uint32_t    *dbrec
+        uint32_t    cqe_cnt
+        uint32_t    cqe_size
+        void        *cq_uar
+        uint32_t    cqn
+        uint64_t    comp_mask
+
+    cdef struct mlx5dv_qp:
+        uint64_t    comp_mask
+        off_t       uar_mmap_offset
+        uint32_t    tirn
+        uint32_t    tisn
+        uint32_t    rqn
+        uint32_t    sqn
+
+    cdef struct mlx5dv_srq:
+        uint32_t    stride
+        uint32_t    head
+        uint32_t    tail
+        uint64_t    comp_mask
+        uint32_t    srqn
+
+    cdef struct pd:
+        v.ibv_pd    *in_ "in"
+        mlx5dv_pd   *out
+
+    cdef struct cq:
+        v.ibv_cq    *in_ "in"
+        mlx5dv_cq   *out
+
+    cdef struct qp:
+        v.ibv_qp    *in_ "in"
+        mlx5dv_qp   *out
+
+    cdef struct srq:
+        v.ibv_srq   *in_ "in"
+        mlx5dv_srq  *out
+
+    cdef struct mlx5dv_obj:
+        pd  pd
+        cq  cq
+        qp  qp
+        srq srq
+
 
     void mlx5dv_set_ctrl_seg(mlx5_wqe_ctrl_seg *seg, uint16_t pi, uint8_t opcode,
                              uint8_t opmod, uint32_t qp_num, uint8_t fm_ce_se,
@@ -326,6 +377,7 @@ cdef extern from 'infiniband/mlx5dv.h':
     int mlx5dv_devx_obj_modify(mlx5dv_devx_obj *obj, const void *in_,
                                size_t inlen, void *out, size_t outlen)
     int mlx5dv_devx_obj_destroy(mlx5dv_devx_obj *obj)
+    int mlx5dv_init_obj(mlx5dv_obj *obj, uint64_t obj_type)
 
     # Mkey setters
     void mlx5dv_wr_mkey_configure(mlx5dv_qp_ex *mqp, mlx5dv_mkey *mkey,
diff --git a/pyverbs/providers/mlx5/mlx5dv_enums.pxd b/pyverbs/providers/mlx5/mlx5dv_enums.pxd
index d0a29bc..9f8d1a1 100644
--- a/pyverbs/providers/mlx5/mlx5dv_enums.pxd
+++ b/pyverbs/providers/mlx5/mlx5dv_enums.pxd
@@ -176,6 +176,23 @@ cdef extern from 'infiniband/mlx5dv.h':
         MLX5DV_DR_DOMAIN_TYPE_NIC_TX
         MLX5DV_DR_DOMAIN_TYPE_FDB
 
+    cpdef enum mlx5dv_qp_comp_mask:
+        MLX5DV_QP_MASK_UAR_MMAP_OFFSET
+        MLX5DV_QP_MASK_RAW_QP_HANDLES
+        MLX5DV_QP_MASK_RAW_QP_TIR_ADDR
+
+    cpdef enum mlx5dv_srq_comp_mask:
+        MLX5DV_SRQ_MASK_SRQN
+
+    cpdef enum mlx5dv_obj_type:
+        MLX5DV_OBJ_QP
+        MLX5DV_OBJ_CQ
+        MLX5DV_OBJ_SRQ
+        MLX5DV_OBJ_RWQ
+        MLX5DV_OBJ_DM
+        MLX5DV_OBJ_AH
+        MLX5DV_OBJ_PD
+
     cpdef unsigned long long MLX5DV_RES_TYPE_QP
     cpdef unsigned long long MLX5DV_RES_TYPE_RWQ
     cpdef unsigned long long MLX5DV_RES_TYPE_DBR
diff --git a/pyverbs/providers/mlx5/mlx5dv_objects.pxd b/pyverbs/providers/mlx5/mlx5dv_objects.pxd
new file mode 100644
index 0000000..e8eab84
--- /dev/null
+++ b/pyverbs/providers/mlx5/mlx5dv_objects.pxd
@@ -0,0 +1,28 @@
+# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
+# Copyright (c) 2021 Nvidia, Inc. All rights reserved. See COPYING file
+
+#cython: language_level=3
+
+cimport pyverbs.providers.mlx5.libmlx5 as dv
+from pyverbs.base cimport PyverbsObject
+
+
+cdef class Mlx5DvPD(PyverbsObject):
+    cdef dv.mlx5dv_pd dv_pd
+
+cdef class Mlx5DvCQ(PyverbsObject):
+    cdef dv.mlx5dv_cq dv_cq
+
+cdef class Mlx5DvQP(PyverbsObject):
+    cdef dv.mlx5dv_qp dv_qp
+
+cdef class Mlx5DvSRQ(PyverbsObject):
+    cdef dv.mlx5dv_srq dv_srq
+
+cdef class Mlx5DvObj(PyverbsObject):
+    cdef dv.mlx5dv_obj obj
+    cdef Mlx5DvCQ dv_cq
+    cdef Mlx5DvQP dv_qp
+    cdef Mlx5DvPD dv_pd
+    cdef Mlx5DvSRQ dv_srq
+
diff --git a/pyverbs/providers/mlx5/mlx5dv_objects.pyx b/pyverbs/providers/mlx5/mlx5dv_objects.pyx
new file mode 100644
index 0000000..ec6eeb6
--- /dev/null
+++ b/pyverbs/providers/mlx5/mlx5dv_objects.pyx
@@ -0,0 +1,214 @@
+# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
+# Copyright (c) 2021 Nvidia, Inc. All rights reserved. See COPYING file
+"""
+This module wraps mlx5dv_<obj> C structs, such as mlx5dv_cq, mlx5dv_qp etc.
+It exposes to the users the mlx5 driver-specific attributes for ibv objects by
+extracting them via mlx5dv_init_obj() API by using Mlx5DvObj class, which holds
+all the (currently) supported Mlx5Dv<Obj> objects.
+Note: This is not be confused with Mlx5<Obj> which holds the ibv_<obj>_ex that
+      was created using mlx5dv_create_<obj>().
+"""
+
+from libc.stdint cimport uintptr_t, uint32_t
+
+from pyverbs.pyverbs_error import PyverbsUserError, PyverbsRDMAError
+cimport pyverbs.providers.mlx5.mlx5dv_enums as dve
+cimport pyverbs.libibverbs as v
+
+
+cdef class Mlx5DvPD(PyverbsObject):
+
+    @property
+    def pdn(self):
+        """ The protection domain object number """
+        return self.dv_pd.pdn
+
+    @property
+    def comp_mask(self):
+        return self.dv_pd.comp_mask
+
+
+cdef class Mlx5DvCQ(PyverbsObject):
+
+    @property
+    def cqe_size(self):
+        return self.dv_cq.cqe_size
+
+    @property
+    def comp_mask(self):
+        return self.dv_cq.comp_mask
+
+    @property
+    def cqn(self):
+        return self.dv_cq.cqn
+
+    @property
+    def buf(self):
+        return <uintptr_t><void*>self.dv_cq.buf
+
+    @property
+    def cq_uar(self):
+        return <uintptr_t><void*>self.dv_cq.cq_uar
+
+    @property
+    def dbrec(self):
+        return <uintptr_t><uint32_t*>self.dv_cq.dbrec
+
+    @property
+    def cqe_cnt(self):
+        return self.dv_cq.cqe_cnt
+
+
+cdef class Mlx5DvQP(PyverbsObject):
+
+    @property
+    def rqn(self):
+        """ The receive queue number of the QP"""
+        return self.dv_qp.rqn
+
+    @property
+    def sqn(self):
+        """ The send queue number of the QP"""
+        return self.dv_qp.sqn
+
+    @property
+    def tirn(self):
+        """
+        The number of the transport interface receive object that attached
+        to the RQ of the QP
+        """
+        return self.dv_qp.tirn
+
+    @property
+    def tisn(self):
+        """
+        The number of the transport interface send object that attached
+        to the SQ of the QP
+        """
+        return self.dv_qp.tisn
+
+    @property
+    def comp_mask(self):
+        return self.dv_qp.comp_mask
+    @comp_mask.setter
+    def comp_mask(self, val):
+        self.dv_qp.comp_mask = val
+
+    @property
+    def uar_mmap_offset(self):
+        return self.uar_mmap_offset
+
+
+cdef class Mlx5DvSRQ(PyverbsObject):
+
+    @property
+    def stride(self):
+        return self.dv_srq.stride
+
+    @property
+    def head(self):
+        return self.dv_srq.stride
+
+    @property
+    def tail(self):
+        return self.dv_srq.stride
+
+    @property
+    def comp_mask(self):
+        return self.dv_srq.comp_mask
+    @comp_mask.setter
+    def comp_mask(self, val):
+        self.dv_srq.comp_mask = val
+
+    @property
+    def srqn(self):
+        """ The shared receive queue object number """
+        return self.dv_srq.srqn
+
+
+cdef class Mlx5DvObj(PyverbsObject):
+    """
+    Mlx5DvObj represents mlx5dv_obj C struct.
+    """
+    def __init__(self, obj_type=None, **kwargs):
+        """
+        Retrieves DV objects from ibv object to be able to extract attributes
+        (such as cqe_size of a CQ).
+        Currently supports CQ, QP, PD and SRQ objects.
+        The initialized objects can be accessed using self.dvobj (e.g. self.dvcq).
+        :param obj_type: Bitmask which defines what objects was provided.
+                         Currently it supports: MLX5DV_OBJ_CQ, MLX5DV_OBJ_QP,
+                         MLX5DV_OBJ_SRQ and MLX5DV_OBJ_PD.
+        :param kwargs: List of objects (cq, qp, pd, srq) from which to extract
+                       data and their comp_masks if applicable. If comp_mask is
+                       not provided by user, mask all by default.
+        """
+        self.dv_pd = self.dv_cq = self.dv_qp = self.dv_srq = None
+        if obj_type is None:
+            return
+        self.init_obj(obj_type, **kwargs)
+
+    def init_obj(self, obj_type, **kwargs):
+        """
+        Initialize DV objects.
+        The objects are re-initialized if they're already extracted.
+        """
+        supported_obj_types = dve.MLX5DV_OBJ_CQ | dve.MLX5DV_OBJ_QP | \
+                         dve.MLX5DV_OBJ_PD | dve.MLX5DV_OBJ_SRQ
+        if obj_type & supported_obj_types is False:
+            raise PyverbsUserError('Invalid obj_type was provided')
+
+        cq = kwargs.get('cq') if obj_type | dve.MLX5DV_OBJ_CQ else None
+        qp = kwargs.get('qp') if obj_type | dve.MLX5DV_OBJ_QP else None
+        pd = kwargs.get('pd') if obj_type | dve.MLX5DV_OBJ_PD else None
+        srq = kwargs.get('srq') if obj_type | dve.MLX5DV_OBJ_SRQ else None
+        if cq is qp is pd is srq is None:
+            raise PyverbsUserError("No supported object was provided.")
+
+        if cq:
+            dv_cq = Mlx5DvCQ()
+            self.obj.cq.in_ = <v.ibv_cq*>cq.cq
+            self.obj.cq.out = &(dv_cq.dv_cq)
+            self.dv_cq = dv_cq
+        if qp:
+            dv_qp = Mlx5DvQP()
+            comp_mask = kwargs.get('qp_comp_mask')
+            dv_qp.comp_mask = comp_mask if comp_mask else \
+                dv.MLX5DV_QP_MASK_UAR_MMAP_OFFSET | \
+                dv.MLX5DV_QP_MASK_RAW_QP_HANDLES | \
+                dv.MLX5DV_QP_MASK_RAW_QP_TIR_ADDR
+            self.obj.qp.in_ = <v.ibv_qp*>qp.qp
+            self.obj.qp.out = &(dv_qp.dv_qp)
+            self.dv_qp = dv_qp
+        if pd:
+            dv_pd = Mlx5DvPD()
+            self.obj.pd.in_ = <v.ibv_pd*>pd.pd
+            self.obj.pd.out = &(dv_pd.dv_pd)
+            self.dv_pd = dv_pd
+        if srq:
+            dv_srq = Mlx5DvSRQ()
+            comp_mask = kwargs.get('srq_comp_mask')
+            dv_srq.comp_mask = comp_mask if comp_mask else dv.MLX5DV_SRQ_MASK_SRQN
+            self.obj.srq.in_ = <v.ibv_srq*>srq.srq
+            self.obj.srq.out = &(dv_srq.dv_srq)
+            self.dv_srq = dv_srq
+
+        rc = dv.mlx5dv_init_obj(&self.obj, obj_type)
+        if rc != 0:
+            raise PyverbsRDMAError("Failed to initialize Mlx5DvObj", rc)
+
+    @property
+    def dvcq(self):
+        return self.dv_cq
+
+    @property
+    def dvqp(self):
+        return self.dv_qp
+
+    @property
+    def dvpd(self):
+        return self.dv_pd
+
+    @property
+    def dvsrq(self):
+        return self.dv_srq
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 23/27] pyverbs/mlx5: Wrap mlx5_cqe64 struct and add enums
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (21 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 22/27] pyverbs/mlx5: Add support to extract mlx5dv objects Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 24/27] tests: Add MAC address to the tests' args Yishai Hadas
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Add a Mlx5Cqe64 class that wraps mlx5_cqe64 C struct, and provide an
easy way to users to get/set its fields.
In addition expose related enums.
Both are relevant and necessary for DevX data path.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 pyverbs/providers/mlx5/libmlx5.pxd      | 14 +++++++
 pyverbs/providers/mlx5/mlx5dv.pxd       |  3 ++
 pyverbs/providers/mlx5/mlx5dv.pyx       | 71 +++++++++++++++++++++++++++++++++
 pyverbs/providers/mlx5/mlx5dv_enums.pxd | 22 ++++++++++
 4 files changed, 110 insertions(+)

diff --git a/pyverbs/providers/mlx5/libmlx5.pxd b/pyverbs/providers/mlx5/libmlx5.pxd
index de4008d..af034ad 100644
--- a/pyverbs/providers/mlx5/libmlx5.pxd
+++ b/pyverbs/providers/mlx5/libmlx5.pxd
@@ -273,12 +273,26 @@ cdef extern from 'infiniband/mlx5dv.h':
         qp  qp
         srq srq
 
+    cdef struct mlx5_cqe64:
+        uint16_t    wqe_id
+        uint32_t    imm_inval_pkey
+        uint32_t    byte_cnt
+        uint64_t    timestamp
+        uint16_t    wqe_counter
+        uint8_t     signature
+        uint8_t     op_own
+
 
     void mlx5dv_set_ctrl_seg(mlx5_wqe_ctrl_seg *seg, uint16_t pi, uint8_t opcode,
                              uint8_t opmod, uint32_t qp_num, uint8_t fm_ce_se,
                              uint8_t ds, uint8_t signature, uint32_t imm)
     void mlx5dv_set_data_seg(mlx5_wqe_data_seg *seg, uint32_t length,
                              uint32_t lkey, uintptr_t address)
+    uint8_t mlx5dv_get_cqe_owner(mlx5_cqe64 *cqe)
+    void mlx5dv_set_cqe_owner(mlx5_cqe64 *cqe, uint8_t val)
+    uint8_t mlx5dv_get_cqe_se(mlx5_cqe64 *cqe)
+    uint8_t mlx5dv_get_cqe_format(mlx5_cqe64 *cqe)
+    uint8_t mlx5dv_get_cqe_opcode(mlx5_cqe64 *cqe)
     bool mlx5dv_is_supported(v.ibv_device *device)
     v.ibv_context* mlx5dv_open_device(v.ibv_device *device,
                                       mlx5dv_context_attr *attr)
diff --git a/pyverbs/providers/mlx5/mlx5dv.pxd b/pyverbs/providers/mlx5/mlx5dv.pxd
index 2b758fe..968cbdb 100644
--- a/pyverbs/providers/mlx5/mlx5dv.pxd
+++ b/pyverbs/providers/mlx5/mlx5dv.pxd
@@ -83,3 +83,6 @@ cdef class Mlx5DevxObj(PyverbsCM):
     cdef dv.mlx5dv_devx_obj *obj
     cdef Context context
     cdef object out_view
+
+cdef class Mlx5Cqe64(PyverbsObject):
+    cdef dv.mlx5_cqe64 *cqe
diff --git a/pyverbs/providers/mlx5/mlx5dv.pyx b/pyverbs/providers/mlx5/mlx5dv.pyx
index ab0bd4a..8d6bae0 100644
--- a/pyverbs/providers/mlx5/mlx5dv.pyx
+++ b/pyverbs/providers/mlx5/mlx5dv.pyx
@@ -1507,3 +1507,74 @@ cdef class Mlx5UMEM(PyverbsCM):
     def umem_addr(self):
         if self.addr:
             return <uintptr_t><void*>self.addr
+
+
+cdef class Mlx5Cqe64(PyverbsObject):
+    def __init__(self, addr):
+        self.cqe = <dv.mlx5_cqe64*><uintptr_t> addr
+
+    def dump(self):
+        dump_format = '{:08x} {:08x} {:08x} {:08x}\n'
+        str = ''
+        for i in range(0, 16, 4):
+            str += dump_format.format(be32toh((<uint32_t*>self.cqe)[i]),
+                                      be32toh((<uint32_t*>self.cqe)[i + 1]),
+                                      be32toh((<uint32_t*>self.cqe)[i + 2]),
+                                      be32toh((<uint32_t*>self.cqe)[i + 3]))
+        return str
+
+    def is_empty(self):
+        for i in range(16):
+            if be32toh((<uint32_t*>self.cqe)[i]) != 0:
+                return False
+        return True
+
+    @property
+    def owner(self):
+        return dv.mlx5dv_get_cqe_owner(self.cqe)
+    @owner.setter
+    def owner(self, val):
+        dv.mlx5dv_set_cqe_owner(self.cqe, <uint8_t> val)
+
+    @property
+    def se(self):
+        return dv.mlx5dv_get_cqe_se(self.cqe)
+
+    @property
+    def format(self):
+        return dv.mlx5dv_get_cqe_format(self.cqe)
+
+    @property
+    def opcode(self):
+        return dv.mlx5dv_get_cqe_opcode(self.cqe)
+
+    @property
+    def imm_inval_pkey(self):
+        return be32toh(self.cqe.imm_inval_pkey)
+
+    @property
+    def wqe_id(self):
+        return be16toh(self.cqe.wqe_id)
+
+    @property
+    def byte_cnt(self):
+        return be32toh(self.cqe.byte_cnt)
+
+    @property
+    def timestamp(self):
+        return be64toh(self.cqe.timestamp)
+
+    @property
+    def wqe_counter(self):
+        return be16toh(self.cqe.wqe_counter)
+
+    @property
+    def signature(self):
+        return self.cqe.signature
+
+    @property
+    def op_own(self):
+        return self.cqe.op_own
+
+    def __str__(self):
+        return (<dv.mlx5_cqe64>((<dv.mlx5_cqe64*>self.cqe)[0])).__str__()
diff --git a/pyverbs/providers/mlx5/mlx5dv_enums.pxd b/pyverbs/providers/mlx5/mlx5dv_enums.pxd
index 9f8d1a1..60713e8 100644
--- a/pyverbs/providers/mlx5/mlx5dv_enums.pxd
+++ b/pyverbs/providers/mlx5/mlx5dv_enums.pxd
@@ -193,6 +193,28 @@ cdef extern from 'infiniband/mlx5dv.h':
         MLX5DV_OBJ_AH
         MLX5DV_OBJ_PD
 
+    cpdef enum:
+        MLX5_RCV_DBR
+        MLX5_SND_DBR
+
+    cpdef enum:
+        MLX5_CQE_OWNER_MASK
+        MLX5_CQE_REQ
+        MLX5_CQE_RESP_WR_IMM
+        MLX5_CQE_RESP_SEND
+        MLX5_CQE_RESP_SEND_IMM
+        MLX5_CQE_RESP_SEND_INV
+        MLX5_CQE_RESIZE_CQ
+        MLX5_CQE_NO_PACKET
+        MLX5_CQE_SIG_ERR
+        MLX5_CQE_REQ_ERR
+        MLX5_CQE_RESP_ERR
+        MLX5_CQE_INVALID
+
+    cpdef enum:
+        MLX5_SEND_WQE_BB
+        MLX5_SEND_WQE_SHIFT
+
     cpdef unsigned long long MLX5DV_RES_TYPE_QP
     cpdef unsigned long long MLX5DV_RES_TYPE_RWQ
     cpdef unsigned long long MLX5DV_RES_TYPE_DBR
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 24/27] tests: Add MAC address to the tests' args
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (22 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 23/27] pyverbs/mlx5: Wrap mlx5_cqe64 struct and add enums Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 25/27] tests: Add mlx5 DevX data path test Yishai Hadas
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Append the MAC address of the relevant interface to the tests'
arguments.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 tests/base.py | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/tests/base.py b/tests/base.py
index dcebdc7..f5518d1 100644
--- a/tests/base.py
+++ b/tests/base.py
@@ -131,6 +131,7 @@ class RDMATestCase(unittest.TestCase):
         self.pkey_index = pkey_index
         self.gid_type = gid_type if gid_index is None else None
         self.ip_addr = None
+        self.mac_addr = None
         self.pre_environment = {}
         self.server = None
         self.client = None
@@ -168,13 +169,14 @@ class RDMATestCase(unittest.TestCase):
         return out.decode().split('\n')[0]
 
     @staticmethod
-    def get_ip_address(ifname):
+    def get_ip_mac_address(ifname):
         out = subprocess.check_output(['ip', '-j', 'addr', 'show', ifname])
         loaded_json = json.loads(out.decode())
         interface = loaded_json[0]['addr_info'][0]['local']
+        mac = loaded_json[0]['address']
         if 'fe80::' in interface:
             interface = interface + '%' + ifname
-        return interface
+        return interface, mac
 
     def setUp(self):
         """
@@ -242,11 +244,11 @@ class RDMATestCase(unittest.TestCase):
                 continue
             net_name = self.get_net_name(dev)
             try:
-                ip_addr = self.get_ip_address(net_name)
+                ip_addr, mac_addr = self.get_ip_mac_address(net_name)
             except (KeyError, IndexError):
-                self.args.append([dev, port, idx, None])
+                self.args.append([dev, port, idx, None, None])
             else:
-                self.args.append([dev, port, idx, ip_addr])
+                self.args.append([dev, port, idx, ip_addr, mac_addr])
 
     def _add_gids_per_device(self, ctx, dev):
         self._add_gids_per_port(ctx, dev, self.ib_port)
@@ -264,6 +266,7 @@ class RDMATestCase(unittest.TestCase):
         self.ib_port = args[1]
         self.gid_index = args[2]
         self.ip_addr = args[3]
+        self.mac_addr = args[4]
 
     def set_env_variable(self, var, value):
         """
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 25/27] tests: Add mlx5 DevX data path test
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (23 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 24/27] tests: Add MAC address to the tests' args Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 26/27] pyverbs/mlx5: Support mlx5 devices over VFIO Yishai Hadas
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Extend the mlx5 tests' infrastructure to support mlx5 DevX objects
creation.
Add a data path test that creates all the needed resources to be able
to create an RC QP, modify it to RTS state and does SEMD_IMM traffic
over DevX objects.
The tests does validate the data including the received "immediate"
value, and checks the validity of the received CQEs.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 tests/CMakeLists.txt      |    2 +
 tests/base.py             |    1 +
 tests/mlx5_base.py        |  460 +++++++++++++++++++-
 tests/mlx5_prm_structs.py | 1046 +++++++++++++++++++++++++++++++++++++++++++++
 tests/test_mlx5_devx.py   |   23 +
 5 files changed, 1529 insertions(+), 3 deletions(-)
 create mode 100644 tests/mlx5_prm_structs.py
 create mode 100644 tests/test_mlx5_devx.py

diff --git a/tests/CMakeLists.txt b/tests/CMakeLists.txt
index 7333f25..7b079d8 100644
--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@@ -8,6 +8,7 @@ rdma_python_test(tests
   base_rdmacm.py
   efa_base.py
   mlx5_base.py
+  mlx5_prm_structs.py
   rdmacm_utils.py
   test_addr.py
   test_cq.py
@@ -20,6 +21,7 @@ rdma_python_test(tests
   test_fork.py
   test_mlx5_cq.py
   test_mlx5_dc.py
+  test_mlx5_devx.py
   test_mlx5_dm_ops.py
   test_mlx5_dr.py
   test_mlx5_flow.py
diff --git a/tests/base.py b/tests/base.py
index f5518d1..e9f36ab 100644
--- a/tests/base.py
+++ b/tests/base.py
@@ -30,6 +30,7 @@ PATH_MTU = e.IBV_MTU_1024
 MAX_DEST_RD_ATOMIC = 1
 NUM_OF_PROCESSES = 2
 MC_IP_PREFIX = '230'
+MAX_RDMA_ATOMIC = 20
 MAX_RD_ATOMIC = 1
 MIN_RNR_TIMER =12
 RETRY_CNT = 7
diff --git a/tests/mlx5_base.py b/tests/mlx5_base.py
index ded4bc1..47e6ebc 100644
--- a/tests/mlx5_base.py
+++ b/tests/mlx5_base.py
@@ -2,20 +2,35 @@
 # Copyright (c) 2020 NVIDIA Corporation . All rights reserved. See COPYING file
 
 import unittest
+import resource
 import random
+import struct
 import errno
+import math
+import time
 
 from pyverbs.providers.mlx5.mlx5dv import Mlx5Context, Mlx5DVContextAttr, \
-    Mlx5DVQPInitAttr, Mlx5QP, Mlx5DVDCInitAttr
+    Mlx5DVQPInitAttr, Mlx5QP, Mlx5DVDCInitAttr, Mlx5DevxObj, Mlx5UMEM, Mlx5UAR, \
+    WqeDataSeg, WqeCtrlSeg, Wqe, Mlx5Cqe64
 from tests.base import TrafficResources, set_rnr_attributes, DCT_KEY, \
-    RDMATestCase, PyverbsAPITestCase, RDMACMBaseTest
-from pyverbs.pyverbs_error import PyverbsRDMAError, PyverbsUserError
+    RDMATestCase, PyverbsAPITestCase, RDMACMBaseTest, BaseResources, PATH_MTU, \
+    RNR_RETRY, RETRY_CNT, MIN_RNR_TIMER, TIMEOUT, MAX_RDMA_ATOMIC
+from pyverbs.pyverbs_error import PyverbsRDMAError, PyverbsUserError, \
+    PyverbsError
+from pyverbs.providers.mlx5.mlx5dv_objects import Mlx5DvObj
 from pyverbs.qp import QPCap, QPInitAttrEx, QPAttr
 import pyverbs.providers.mlx5.mlx5_enums as dve
 from pyverbs.addr import AHAttr, GlobalRoute
+import pyverbs.mem_alloc as mem
+import pyverbs.dma_util as dma
 import pyverbs.device as d
+from pyverbs.pd import PD
 import pyverbs.enums as e
 from pyverbs.mr import MR
+import tests.utils
+
+MLX5_CQ_SET_CI = 0
+POLL_CQ_TIMEOUT = 5  # In seconds
 
 
 MELLANOX_VENDOR_ID = 0x02c9
@@ -155,3 +170,442 @@ class Mlx5DcResources(TrafficResources):
             if ex.error_code == errno.EOPNOTSUPP:
                 raise unittest.SkipTest(f'Create DC QP is not supported')
             raise ex
+
+
+class WqAttrs:
+    def __init__(self):
+        super().__init__()
+        self.wqe_num = 0
+        self.wqe_size = 0
+        self.wq_size = 0
+        self.head = 0
+        self.post_idx = 0
+        self.wqe_shift = 0
+        self.offset = 0
+
+    def __str__(self):
+        return str(vars(self))
+
+    def __format__(self, format_spec):
+        return str(self).__format__(format_spec)
+
+
+class CqAttrs:
+    def __init__(self):
+        super().__init__()
+        self.cons_idx = 0
+        self.cqe_size = 64
+        self.ncqes = 256
+
+    def __str__(self):
+        return str(vars(self))
+
+    def __format__(self, format_spec):
+        return str(self).__format__(format_spec)
+
+
+class QueueAttrs:
+    def __init__(self):
+        self.rq = WqAttrs()
+        self.sq = WqAttrs()
+        self.cq = CqAttrs()
+
+    def __str__(self):
+        print_format = '{}:\n\t{}\n'
+        return print_format.format('RQ Attributes', self.rq) + \
+               print_format.format('SQ Attributes', self.sq) + \
+               print_format.format('CQ Attributes', self.cq)
+
+
+class Mlx5DevxRcResources(BaseResources):
+    """
+    Creates all the DevX resources needed for a traffic-ready RC DevX QP,
+    including methods to transit the WQs into RTS state.
+    It also includes traffic methods for post send/receive and poll.
+    The class currently supports post send with immediate, but can be
+    easily extended to support other opcodes in the future.
+    """
+    def __init__(self, dev_name, ib_port, gid_index, msg_size=1024):
+        super().__init__(dev_name, ib_port, gid_index)
+        self.umems = {}
+        self.msg_size = msg_size
+        self.num_msgs = 1000
+        self.imm = 0x03020100
+        self.uar = {}
+        self.max_recv_sge = 1
+        self.eqn = None
+        self.pd = None
+        self.dv_pd = None
+        self.mr = None
+        self.cq = None
+        self.qp = None
+        self.qpn = None
+        self.psn = None
+        self.lid = None
+        self.gid = [0, 0, 0, 0]
+        # Remote attrs
+        self.rqpn = None
+        self.rpsn = None
+        self.rlid = None
+        self.rgid = [0, 0, 0, 0]
+        self.rmac = None
+        self.devx_objs = []
+        self.qattr = QueueAttrs()
+        self.init_resources()
+
+    def init_resources(self):
+        if not self.is_eth():
+            self.query_lid()
+        else:
+            self.query_gid()
+        self.create_pd()
+        self.create_mr()
+        self.query_eqn()
+        self.create_uar()
+        self.create_queue_attrs()
+        self.create_cq()
+        self.create_qp()
+        # Objects closure order is important, and must be done manually in DevX
+        self.devx_objs = [self.qp, self.cq] + list(self.uar.values()) + list(self.umems.values())
+
+    def query_lid(self):
+        from tests.mlx5_prm_structs import QueryHcaVportContextIn, \
+            QueryHcaVportContextOut, QueryHcaCapIn, QueryCmdHcaCapOut
+
+        query_cap_in = QueryHcaCapIn(op_mod=0x1)
+        query_cap_out = QueryCmdHcaCapOut(self.ctx.devx_general_cmd(
+            query_cap_in, len(QueryCmdHcaCapOut())))
+        if query_cap_out.status:
+            raise PyverbsRDMAError('Failed to query general HCA CAPs with syndrome '
+                                   f'({query_port_out.syndrome}')
+        port_num = self.ib_port if query_cap_out.capability.num_ports >= 2 else 0
+        query_port_in = QueryHcaVportContextIn(port_num=port_num)
+        query_port_out = QueryHcaVportContextOut(self.ctx.devx_general_cmd(
+            query_port_in, len(QueryHcaVportContextOut())))
+        if query_port_out.status:
+            raise PyverbsRDMAError('Failed to query vport with syndrome '
+                                   f'({query_port_out.syndrome})')
+        self.lid = query_port_out.hca_vport_context.lid
+
+    def query_gid(self):
+        gid = self.ctx.query_gid(self.ib_port, self.gid_index).gid.split(':')
+        for i in range(0, len(gid), 2):
+            self.gid[int(i/2)] = int(gid[i] + gid[i+1], 16)
+
+    def is_eth(self):
+        from tests.mlx5_prm_structs import QueryHcaCapIn, \
+            QueryCmdHcaCapOut
+
+        query_cap_in = QueryHcaCapIn(op_mod=0x1)
+        query_cap_out = QueryCmdHcaCapOut(self.ctx.devx_general_cmd(
+            query_cap_in, len(QueryCmdHcaCapOut())))
+        if query_cap_out.status:
+            raise PyverbsRDMAError('Failed to query general HCA CAPs with syndrome '
+                                   f'({query_port_out.syndrome})')
+        return query_cap_out.capability.port_type  # 0:IB, 1:ETH
+
+    @staticmethod
+    def roundup_pow_of_two(val):
+        return pow(2, math.ceil(math.log2(val)))
+
+    def create_queue_attrs(self):
+        # RQ calculations
+        wqe_size = WqeDataSeg.sizeof() * self.max_recv_sge
+        self.qattr.rq.wqe_size = self.roundup_pow_of_two(wqe_size)
+        max_recv_wr = self.roundup_pow_of_two(self.num_msgs)
+        self.qattr.rq.wq_size = max(self.qattr.rq.wqe_size * max_recv_wr,
+                                    dve.MLX5_SEND_WQE_BB)
+        self.qattr.rq.wqe_num = math.ceil(self.qattr.rq.wq_size / self.qattr.rq.wqe_size)
+        self.qattr.rq.wqe_shift = int(math.log2(self.qattr.rq.wqe_size - 1)) + 1
+
+        # SQ calculations
+        self.qattr.sq.offset = self.qattr.rq.wq_size
+        # 192 = max overhead size of all structs needed for all operations in RC
+        wqe_size = 192 + WqeDataSeg.sizeof()
+        # Align wqe size to MLX5_SEND_WQE_BB
+        self.qattr.sq.wqe_size = (wqe_size + dve.MLX5_SEND_WQE_BB - 1) & ~(dve.MLX5_SEND_WQE_BB - 1)
+        self.qattr.sq.wq_size = self.roundup_pow_of_two(self.qattr.sq.wqe_size * self.num_msgs)
+        self.qattr.sq.wqe_num = math.ceil(self.qattr.sq.wq_size / dve.MLX5_SEND_WQE_BB)
+        self.qattr.sq.wqe_shift = int(math.log2(dve.MLX5_SEND_WQE_BB))
+
+    def create_context(self):
+        try:
+            attr = Mlx5DVContextAttr(dve.MLX5DV_CONTEXT_FLAGS_DEVX)
+            self.ctx = Mlx5Context(attr, self.dev_name)
+        except PyverbsUserError as ex:
+            raise unittest.SkipTest(f'Could not open mlx5 context ({ex})')
+        except PyverbsRDMAError:
+            raise unittest.SkipTest('Opening mlx5 DevX context is not supported')
+
+    def create_pd(self):
+        self.pd = PD(self.ctx)
+        self.dv_pd = Mlx5DvObj(dve.MLX5DV_OBJ_PD, pd=self.pd).dvpd
+
+    def create_mr(self):
+        access = e.IBV_ACCESS_REMOTE_WRITE | e.IBV_ACCESS_LOCAL_WRITE | \
+                 e.IBV_ACCESS_REMOTE_READ
+        self.mr = MR(self.pd, self.msg_size, access)
+
+    def create_umem(self, size,
+                    access=e.IBV_ACCESS_LOCAL_WRITE,
+                    alignment=resource.getpagesize()):
+        return Mlx5UMEM(self.ctx, size=size, alignment=alignment, access=access)
+
+    def create_uar(self):
+        self.uar['qp'] = Mlx5UAR(self.ctx, dve._MLX5DV_UAR_ALLOC_TYPE_NC)
+        self.uar['cq'] = Mlx5UAR(self.ctx, dve._MLX5DV_UAR_ALLOC_TYPE_NC)
+        if not self.uar['cq'].page_id or not self.uar['qp'].page_id:
+            raise PyverbsRDMAError('Failed to allocate UAR')
+
+    def query_eqn(self):
+        self.eqn = self.ctx.devx_query_eqn(0)
+
+    def create_cq(self):
+        from tests.mlx5_prm_structs import CreateCqIn, SwCqc, CreateCqOut
+
+        cq_size = self.roundup_pow_of_two(self.qattr.cq.cqe_size * self.qattr.cq.ncqes)
+        # Align to page size
+        pg_size = resource.getpagesize()
+        cq_size = (cq_size + pg_size - 1) & ~(pg_size - 1)
+        self.umems['cq'] = self.create_umem(size=cq_size)
+        self.umems['cq_dbr'] = self.create_umem(size=8, alignment=8)
+        log_cq_size = math.ceil(math.log2(self.qattr.cq.ncqes))
+        cmd_in = CreateCqIn(cq_umem_valid=1, cq_umem_id=self.umems['cq'].umem_id,
+                            sw_cqc=SwCqc(c_eqn=self.eqn, uar_page=self.uar['cq'].page_id,
+                                         log_cq_size=log_cq_size, dbr_umem_valid=1,
+                                         dbr_umem_id=self.umems['cq_dbr'].umem_id))
+        self.cq = Mlx5DevxObj(self.ctx, cmd_in, len(CreateCqOut()))
+
+    def create_qp(self):
+        self.psn = random.getrandbits(24)
+        from tests.mlx5_prm_structs import SwQpc, CreateQpIn, DevxOps,\
+            CreateQpOut, CreateCqOut
+
+        self.psn = random.getrandbits(24)
+        qp_size = self.roundup_pow_of_two(self.qattr.rq.wq_size + self.qattr.sq.wq_size)
+        # Align to page size
+        pg_size = resource.getpagesize()
+        qp_size = (qp_size + pg_size - 1) & ~(pg_size - 1)
+        self.umems['qp'] = self.create_umem(size=qp_size)
+        self.umems['qp_dbr'] = self.create_umem(size=8, alignment=8)
+        log_rq_size = int(math.log2(self.qattr.rq.wqe_num - 1)) + 1
+        # Size of a receive WQE is 16*pow(2, log_rq_stride)
+        log_rq_stride = self.qattr.rq.wqe_shift - 4
+        log_sq_size = int(math.log2(self.qattr.sq.wqe_num - 1)) + 1
+        cqn = CreateCqOut(self.cq.out_view).cqn
+        qpc = SwQpc(st=DevxOps.MLX5_QPC_ST_RC, pd=self.dv_pd.pdn,
+                    pm_state=DevxOps.MLX5_QPC_PM_STATE_MIGRATED,
+                    log_rq_size=log_rq_size, log_sq_size=log_sq_size, ts_format=0x1,
+                    log_rq_stride=log_rq_stride, uar_page=self.uar['qp'].page_id,
+                    cqn_snd=cqn, cqn_rcv=cqn, dbr_umem_id=self.umems['qp_dbr'].umem_id,
+                    dbr_umem_valid=1)
+        cmd_in = CreateQpIn(sw_qpc=qpc, wq_umem_id=self.umems['qp'].umem_id,
+                            wq_umem_valid=1)
+        self.qp = Mlx5DevxObj(self.ctx, cmd_in, len(CreateQpOut()))
+        self.qpn = CreateQpOut(self.qp.out_view).qpn
+
+    def to_rts(self):
+        """
+        Moves the created QP to RTS state by modifying it using DevX through all
+        the needed states with all the required attributes.
+        rlid, rpsn, rqpn and rgid (when valid) must be already updated before
+        calling this method.
+        """
+        from tests.mlx5_prm_structs import DevxOps, ModifyQpIn, ModifyQpOut,\
+            CreateQpOut, SwQpc
+        cmd_out_len = len(ModifyQpOut())
+
+        # RST2INIT
+        qpn = CreateQpOut(self.qp.out_view).qpn
+        swqpc = SwQpc(rre=1, rwe=1)
+        swqpc.primary_address_path.vhca_port_num = self.ib_port
+        cmd_in = ModifyQpIn(opcode=DevxOps.MLX5_CMD_OP_RST2INIT_QP, qpn=qpn,
+                            sw_qpc=swqpc)
+        self.qp.modify(cmd_in, cmd_out_len)
+
+        # INIT2RTR
+        swqpc = SwQpc(mtu=PATH_MTU, log_msg_max=20, remote_qpn=self.rqpn,
+                      min_rnr_nak=MIN_RNR_TIMER, next_rcv_psn=self.rpsn)
+        swqpc.primary_address_path.vhca_port_num = self.ib_port
+        swqpc.primary_address_path.rlid = self.rlid
+        if self.is_eth():
+            # GID field is a must for Eth (or if GRH is set in IB)
+            swqpc.primary_address_path.rgid_rip = self.rgid
+            swqpc.primary_address_path.rmac = self.rmac
+            swqpc.primary_address_path.src_addr_index = self.gid_index
+            swqpc.primary_address_path.hop_limit = tests.utils.PacketConsts.TTL_HOP_LIMIT
+            # UDP sport must be reserved for roce v1 and v1.5
+            if self.ctx.query_gid_type(self.ib_port, self.gid_index) == e.IBV_GID_TYPE_SYSFS_ROCE_V2:
+                swqpc.primary_address_path.udp_sport = 0xdcba
+        else:
+            swqpc.primary_address_path.rlid = self.rlid
+        cmd_in = ModifyQpIn(opcode=DevxOps.MLX5_CMD_OP_INIT2RTR_QP, qpn=qpn,
+                            sw_qpc=swqpc)
+        self.qp.modify(cmd_in, cmd_out_len)
+
+        # RTR2RTS
+        swqpc = SwQpc(retry_count=RETRY_CNT, rnr_retry=RNR_RETRY,
+                      next_send_psn=self.psn, log_sra_max=MAX_RDMA_ATOMIC)
+        swqpc.primary_address_path.vhca_port_num = self.ib_port
+        swqpc.primary_address_path.ack_timeout = TIMEOUT
+        cmd_in = ModifyQpIn(opcode=DevxOps.MLX5_CMD_OP_RTR2RTS_QP, qpn=qpn,
+                            sw_qpc=swqpc)
+        self.qp.modify(cmd_in, cmd_out_len)
+
+    def pre_run(self, rpsn, rqpn, rgid=0, rlid=0, rmac=0):
+        """
+        Configure Resources before running traffic
+        :param rpsns: Remote PSN (packet serial number)
+        :param rqpn: Remote QP number
+        :param rgid: Remote GID
+        :param rlid: Remote LID
+        :param rmac: Remote MAC (valid for RoCE)
+        :return: None
+        """
+        self.rpsn = rpsn
+        self.rqpn = rqpn
+        self.rgid = rgid
+        self.rlid = rlid
+        self.rmac = rmac
+        self.to_rts()
+
+    def post_send(self):
+        """
+        Posts one send WQE to the SQ by doing all the required work such as
+        building the control/data segments, updating and ringing the dbr,
+        updating the producer indexes, etc.
+        """
+        idx = self.qattr.sq.post_idx if self.qattr.sq.post_idx < self.qattr.sq.wqe_num else 0
+        buf_offset = self.qattr.sq.offset + (idx << dve.MLX5_SEND_WQE_SHIFT)
+        # Prepare WQE
+        imm_be32 = struct.unpack("<I", struct.pack(">I", self.imm + self.qattr.sq.post_idx))[0]
+        ctrl_seg = WqeCtrlSeg(imm=imm_be32, fm_ce_se=dve.MLX5_WQE_CTRL_CQ_UPDATE)
+        data_seg = WqeDataSeg(self.mr.length, self.mr.lkey, self.mr.buf)
+        ctrl_seg.opmod_idx_opcode = (self.qattr.sq.post_idx & 0xffff) << 8 | dve.MLX5_OPCODE_SEND_IMM
+        size_in_octowords = int((ctrl_seg.sizeof() +  data_seg.sizeof()) / 16)
+        ctrl_seg.qpn_ds = self.qpn << 8 | size_in_octowords
+        Wqe([ctrl_seg, data_seg], self.umems['qp'].umem_addr + buf_offset)
+        self.qattr.sq.post_idx += int((size_in_octowords * 16 +
+                                       dve.MLX5_SEND_WQE_BB - 1) / dve.MLX5_SEND_WQE_BB)
+        # Make sure descriptors are written
+        dma.udma_to_dev_barrier()
+        # Update the doorbell record
+        mem.writebe32(self.umems['qp_dbr'].umem_addr,
+                      self.qattr.sq.post_idx & 0xffff, dve.MLX5_SND_DBR)
+        dma.udma_to_dev_barrier()
+        # Ring the doorbell and post the WQE
+        dma.mmio_write64_as_be(self.uar['qp'].reg_addr, mem.read64(ctrl_seg.addr))
+
+    def post_recv(self):
+        """
+        Posts one receive WQE to the RQ by doing all the required work such as
+        building the control/data segments, updating the dbr and the producer
+        indexes.
+        """
+        buf_offset = self.qattr.rq.offset + self.qattr.rq.wqe_size * self.qattr.rq.head
+        # Prepare WQE
+        data_seg = WqeDataSeg(self.mr.length, self.mr.lkey, self.mr.buf)
+        Wqe([data_seg], self.umems['qp'].umem_addr + buf_offset)
+        # Update indexes
+        self.qattr.rq.post_idx += 1
+        self.qattr.rq.head = self.qattr.rq.head + 1 if self.qattr.rq.head + 1 < self.qattr.rq.wqe_num else 0
+        # Update the doorbell record
+        dma.udma_to_dev_barrier()
+        mem.writebe32(self.umems['qp_dbr'].umem_addr,
+                      self.qattr.rq.post_idx & 0xffff, dve.MLX5_RCV_DBR)
+
+    def poll_cq(self):
+        """
+        Polls the CQ once and updates the consumer index upon success.
+        The CQE opcode and owner bit are checked and verified.
+        This method does busy-waiting as long as it gets an empty CQE, until a
+        timeout of POLL_CQ_TIMEOUT seconds.
+        """
+        idx = self.qattr.cq.cons_idx % self.qattr.cq.ncqes
+        cq_owner_flip = not(not(self.qattr.cq.cons_idx & self.qattr.cq.ncqes))
+        cqe_start_addr = self.umems['cq'].umem_addr + (idx * self.qattr.cq.cqe_size)
+        cqe = None
+        start_poll_t = time.perf_counter()
+        while cqe is None:
+            cqe = Mlx5Cqe64(cqe_start_addr)
+            if (cqe.opcode == dve.MLX5_CQE_INVALID) or \
+                    (cqe.owner ^ cq_owner_flip) or cqe.is_empty():
+                if time.perf_counter() - start_poll_t >= POLL_CQ_TIMEOUT:
+                    raise PyverbsRDMAError(f'CQE #{self.qattr.cq.cons_idx} '
+                                           f'is empty or invalid:\n{cqe.dump()}')
+                cqe = None
+
+        # After CQE ownership check, must do memory barrier and re-read the CQE.
+        dma.udma_from_dev_barrier()
+        cqe = Mlx5Cqe64(cqe_start_addr)
+
+        if cqe.opcode == dve.MLX5_CQE_RESP_ERR:
+            raise PyverbsRDMAError(f'Got a CQE #{self.qattr.cq.cons_idx} '
+                                   f'with responder error:\n{cqe.dump()}')
+        elif cqe.opcode == dve.MLX5_CQE_REQ_ERR:
+            raise PyverbsRDMAError(f'Got a CQE #{self.qattr.cq.cons_idx} '
+                                   f'with requester error:\n{cqe.dump()}')
+
+        self.qattr.cq.cons_idx += 1
+        mem.writebe32(self.umems['cq_dbr'].umem_addr,
+                      self.qattr.cq.cons_idx & 0xffffff, MLX5_CQ_SET_CI)
+        return cqe
+
+    def close_resources(self):
+        for obj in self.devx_objs:
+            if obj:
+                obj.close()
+
+
+class Mlx5DevxTrafficBase(Mlx5RDMATestCase):
+    """
+    A base class for mlx5 DevX traffic tests.
+    This class does not include any tests, but provides quick players (client,
+    server) creation and provides a traffic method.
+    """
+    def tearDown(self):
+        if self.server:
+            self.server.close_resources()
+        if self.client:
+            self.client.close_resources()
+        super().tearDown()
+
+    def create_players(self, resources, **resource_arg):
+        """
+        Initialize tests resources.
+        :param resources: The RDMA resources to use.
+        :param resource_arg: Dictionary of args that specify the resources
+                             specific attributes.
+        :return: None
+        """
+        self.server = resources(**self.dev_info, **resource_arg)
+        self.client = resources(**self.dev_info, **resource_arg)
+        self.pre_run()
+
+    def pre_run(self):
+        self.server.pre_run(self.client.psn, self.client.qpn, self.client.gid,
+                            self.client.lid, self.mac_addr)
+        self.client.pre_run(self.server.psn, self.server.qpn, self.server.gid,
+                            self.server.lid, self.mac_addr)
+
+    def send_imm_traffic(self):
+        self.client.mr.write('c' * self.client.msg_size, self.client.msg_size)
+        for _ in range(self.client.num_msgs):
+            cons_idx = self.client.qattr.cq.cons_idx
+            self.server.post_recv()
+            self.client.post_send()
+            # Poll client and verify received cqe opcode
+            send_cqe = self.client.poll_cq()
+            self.assertEqual(send_cqe.opcode, dve.MLX5_CQE_REQ,
+                             'Unexpected CQE opcode')
+            # Poll server and verify received cqe opcode
+            recv_cqe = self.server.poll_cq()
+            self.assertEqual(recv_cqe.opcode, dve.MLX5_CQE_RESP_SEND_IMM,
+                             'Unexpected CQE opcode')
+            msg_received = self.server.mr.read(self.server.msg_size, 0)
+            # Validate data (of received message and immediate value)
+            tests.utils.validate(msg_received, True, self.server.msg_size)
+            self.assertEqual(recv_cqe.imm_inval_pkey,
+                             self.client.imm + cons_idx)
+            self.server.mr.write('s' * self.server.msg_size,
+                                 self.server.msg_size)
diff --git a/tests/mlx5_prm_structs.py b/tests/mlx5_prm_structs.py
new file mode 100644
index 0000000..1999a3b
--- /dev/null
+++ b/tests/mlx5_prm_structs.py
@@ -0,0 +1,1046 @@
+# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
+# Copyright (c) 2021 Nvidia Inc. All rights reserved. See COPYING file
+
+"""
+This module provides scapy based classes that represent the mlx5 PRM structs.
+"""
+import unittest
+
+try:
+    import logging
+    logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
+    from scapy.packet import Packet
+    from scapy.fields import BitField, ByteField, IntField, \
+        ShortField, LongField, StrFixedLenField, PacketField, \
+        PacketListField, ConditionalField, PadField, FieldListField, MACField
+except ImportError:
+    raise unittest.SkipTest('scapy package is needed in order to run DevX tests')
+
+
+class DevxOps:
+    MLX5_CMD_OP_ALLOC_PD = 0x800
+    MLX5_CMD_OP_CREATE_CQ = 0x400
+    MLX5_CMD_OP_QUERY_CQ = 0x402
+    MLX5_CMD_OP_MODIFY_CQ = 0x403
+    MLX5_CMD_OP_CREATE_QP = 0x500
+    MLX5_CMD_OP_QUERY_QP = 0x50b
+    MLX5_CMD_OP_RST2INIT_QP = 0x502
+    MLX5_CMD_OP_INIT2RTR_QP = 0x503
+    MLX5_CMD_OP_RTR2RTS_QP = 0x504
+    MLX5_CMD_OP_RTS2RTS_QP = 0x505
+    MLX5_CMD_OP_QUERY_HCA_VPORT_CONTEXT = 0x762
+    MLX5_CMD_OP_QUERY_HCA_VPORT_GID = 0x764
+    MLX5_QPC_ST_RC = 0X0
+    MLX5_QPC_PM_STATE_MIGRATED = 0x3
+    MLX5_CMD_OP_QUERY_HCA_CAP = 0x100
+
+
+# Common
+class SwPas(Packet):
+    fields_desc = [
+        IntField('pa_h', 0),
+        BitField('pa_l', 0, 20),
+        BitField('reserved1', 0, 12),
+    ]
+
+
+# PD
+class AllocPdIn(Packet):
+    fields_desc = [
+        ShortField('opcode', DevxOps.MLX5_CMD_OP_ALLOC_PD),
+        ShortField('uid', 0),
+        ShortField('reserved1', 0),
+        ShortField('op_mod', 0),
+        StrFixedLenField('reserved2', None, length=8),
+    ]
+
+
+class AllocPdOut(Packet):
+    fields_desc = [
+        ByteField('status', 0),
+        BitField('reserved1', 0, 24),
+        IntField('syndrome', 0),
+        ByteField('reserved2', 0),
+        BitField('pd', 0, 24),
+        StrFixedLenField('reserved3', None, length=4),
+    ]
+
+
+# CQ
+class CmdInputFieldSelectResizeCq(Packet):
+    fields_desc = [
+        BitField('reserved1', 0, 28),
+        BitField('umem', 0, 1),
+        BitField('log_page_size', 0, 1),
+        BitField('page_offset', 0, 1),
+        BitField('log_cq_size', 0, 1),
+    ]
+
+
+class CmdInputFieldSelectModifyCqFields(Packet):
+    fields_desc = [
+        BitField('reserved_0', 0, 26),
+        BitField('status', 0, 1),
+        BitField('cq_period_mode', 0, 1),
+        BitField('c_eqn', 0, 1),
+        BitField('oi', 0, 1),
+        BitField('cq_max_count', 0, 1),
+        BitField('cq_period', 0, 1),
+    ]
+
+
+class SwCqc(Packet):
+    fields_desc = [
+        BitField('status', 0, 4),
+        BitField('as_notify', 0, 1),
+        BitField('initiator_src_dct', 0, 1),
+        BitField('dbr_umem_valid', 0, 1),
+        BitField('reserved1', 0, 1),
+        BitField('cqe_sz', 0, 3),
+        BitField('cc', 0, 1),
+        BitField('reserved2', 0, 1),
+        BitField('scqe_break_moderation_en', 0, 1),
+        BitField('oi', 0, 1),
+        BitField('cq_period_mode', 0, 2),
+        BitField('cqe_compression_en', 0, 1),
+        BitField('mini_cqe_res_format', 0, 2),
+        BitField('st', 0, 4),
+        ByteField('reserved3', 0),
+        IntField('dbr_umem_id', 0),
+        BitField('reserved4', 0, 20),
+        BitField('page_offset', 0, 6),
+        BitField('reserved5', 0, 6),
+        BitField('reserved6', 0, 3),
+        BitField('log_cq_size', 0, 5),
+        BitField('uar_page', 0, 24),
+        BitField('reserved7', 0, 4),
+        BitField('cq_period', 0, 12),
+        ShortField('cq_max_count', 0),
+        BitField('reserved8', 0, 24),
+        ByteField('c_eqn', 0),
+        BitField('reserved9', 0, 3),
+        BitField('log_page_size', 0, 5),
+        BitField('reserved10', 0, 24),
+        StrFixedLenField('reserved11', None, length=4),
+        ByteField('reserved12', 0),
+        BitField('last_notified_index', 0, 24),
+        ByteField('reserved13', 0),
+        BitField('last_solicit_index', 0, 24),
+        ByteField('reserved14', 0),
+        BitField('consumer_counter', 0, 24),
+        ByteField('reserved15', 0),
+        BitField('producer_counter', 0, 24),
+        BitField('local_partition_id', 0, 12),
+        BitField('process_id', 0, 20),
+        ShortField('reserved16', 0),
+        ShortField('thread_id', 0),
+        IntField('db_record_addr_63_32', 0),
+        BitField('db_record_addr_31_3', 0, 29),
+        BitField('reserved17', 0, 3),
+    ]
+
+
+class CreateCqIn(Packet):
+    fields_desc = [
+        ShortField('opcode', DevxOps.MLX5_CMD_OP_CREATE_CQ),
+        ShortField('uid', 0),
+        ShortField('reserved1', 0),
+        ShortField('op_mod', 0),
+        ByteField('reserved2', 0),
+        BitField('cqn', 0, 24),
+        StrFixedLenField('reserved3', None, length=4),
+        PacketField('sw_cqc', SwCqc(), SwCqc),
+        LongField('e_mtt_pointer_or_cq_umem_offset', 0),
+        IntField('cq_umem_id', 0),
+        BitField('cq_umem_valid', 0, 1),
+        BitField('reserved4', 0, 31),
+        StrFixedLenField('reserved5', None, length=176),
+        PacketListField('pas', [SwPas() for x in range(0)], SwPas, count_from=lambda pkt: 0),
+    ]
+
+
+class CreateCqOut(Packet):
+    fields_desc = [
+        ByteField('status', 0),
+        BitField('reserved1', 0, 24),
+        IntField('syndrome', 0),
+        ByteField('reserved2', 0),
+        BitField('cqn', 0, 24),
+        StrFixedLenField('reserved3', None, length=4),
+    ]
+
+
+# QP
+class SwAds(Packet):
+    fields_desc = [
+        BitField('fl', 0, 1),
+        BitField('free_ar', 0, 1),
+        BitField('reserved1', 0, 14),
+        ShortField('pkey_index', 0),
+        ByteField('reserved2', 0),
+        BitField('grh', 0, 1),
+        BitField('mlid', 0, 7),
+        ShortField('rlid', 0),
+        BitField('ack_timeout', 0, 5),
+        BitField('reserved3', 0, 3),
+        ByteField('src_addr_index', 0),
+        BitField('log_rtm', 0, 4),
+        BitField('stat_rate', 0, 4),
+        ByteField('hop_limit', 0),
+        BitField('reserved4', 0, 4),
+        BitField('tclass', 0, 8),
+        BitField('flow_label', 0, 20),
+        FieldListField('rgid_rip', [0 for x in range(4)], IntField('', 0),
+                       count_from=lambda pkt: 4),
+        BitField('reserved5', 0, 4),
+        BitField('f_dscp', 0, 1),
+        BitField('f_ecn', 0, 1),
+        BitField('reserved6', 0, 1),
+        BitField('f_eth_prio', 0, 1),
+        BitField('ecn', 0, 2),
+        BitField('dscp', 0, 6),
+        ShortField('udp_sport', 0),
+        BitField('dei_cfi_reserved_from_prm_041', 0, 1),
+        BitField('eth_prio', 0, 3),
+        BitField('sl', 0, 4),
+        ByteField('vhca_port_num', 0),
+        MACField('rmac', '00:00:00:00:00:00'),
+
+    ]
+
+
+class SwQpc(Packet):
+    fields_desc = [
+        BitField('state', 0, 4),
+        BitField('lag_tx_port_affinity', 0, 4),
+        ByteField('st', 0),
+        BitField('reserved1', 0, 3),
+        BitField('pm_state', 0, 2),
+        BitField('reserved2', 0, 1),
+        BitField('req_e2e_credit_mode', 0, 2),
+        BitField('offload_type', 0, 4),
+        BitField('end_padding_mode', 0, 2),
+        BitField('reserved3', 0, 2),
+        BitField('wq_signature', 0, 1),
+        BitField('block_lb_mc', 0, 1),
+        BitField('atomic_like_write_en', 0, 1),
+        BitField('latency_sensitive', 0, 1),
+        BitField('dual_write', 0, 1),
+        BitField('drain_sigerr', 0, 1),
+        BitField('multi_path', 0, 1),
+        BitField('reserved4', 0, 1),
+        BitField('pd', 0, 24),
+        BitField('mtu', 0, 3),
+        BitField('log_msg_max', 0, 5),
+        BitField('reserved5', 0, 1),
+        BitField('log_rq_size', 0, 4),
+        BitField('log_rq_stride', 0, 3),
+        BitField('no_sq', 0, 1),
+        BitField('log_sq_size', 0, 4),
+        BitField('reserved6', 0, 1),
+        BitField('retry_mode', 0, 2),
+        BitField('ts_format', 0, 2),
+        BitField('data_in_order', 0, 1),
+        BitField('rlkey', 0, 1),
+        BitField('ulp_stateless_offload_mode', 0, 4),
+        ByteField('counter_set_id', 0),
+        BitField('uar_page', 0, 24),
+        BitField('reserved7', 0, 3),
+        BitField('full_handshake', 0, 1),
+        BitField('cnak_reverse_sl', 0, 4),
+        BitField('user_index', 0, 24),
+        BitField('reserved8', 0, 3),
+        BitField('log_page_size', 0, 5),
+        BitField('remote_qpn', 0, 24),
+        PacketField('primary_address_path', SwAds(), SwAds),
+        PacketField('secondary_address_path', SwAds(), SwAds),
+        BitField('log_ack_req_freq', 0, 4),
+        BitField('reserved9', 0, 4),
+        BitField('log_sra_max', 0, 3),
+        BitField('extended_rnr_retry_valid', 0, 1),
+        BitField('reserved10', 0, 1),
+        BitField('retry_count', 0, 3),
+        BitField('rnr_retry', 0, 3),
+        BitField('extended_retry_count_valid', 0, 1),
+        BitField('fre', 0, 1),
+        BitField('cur_rnr_retry', 0, 3),
+        BitField('cur_retry_count', 0, 3),
+        BitField('extended_log_rnr_retry', 0, 5),
+        ShortField('extended_cur_rnr_retry', 0),
+        ShortField('packet_pacing_rate_limit_index', 0),
+        ByteField('reserved11', 0),
+        BitField('next_send_psn', 0, 24),
+        ByteField('reserved12', 0),
+        BitField('cqn_snd', 0, 24),
+        ByteField('reserved13', 0),
+        BitField('deth_sqpn', 0, 24),
+        ByteField('reserved14', 0),
+        ByteField('extended_retry_count', 0),
+        ByteField('reserved15', 0),
+        ByteField('extended_cur_retry_count', 0),
+        ByteField('reserved16', 0),
+        BitField('last_acked_psn', 0, 24),
+        ByteField('reserved17', 0),
+        BitField('ssn', 0, 24),
+        ByteField('reserved18', 0),
+        BitField('log_rra_max', 0, 3),
+        BitField('reserved19', 0, 1),
+        BitField('atomic_mode', 0, 4),
+        BitField('rre', 0, 1),
+        BitField('rwe', 0, 1),
+        BitField('rae', 0, 1),
+        BitField('reserved20', 0, 1),
+        BitField('page_offset', 0, 6),
+        BitField('reserved21', 0, 3),
+        BitField('cd_slave_receive', 0, 1),
+        BitField('cd_slave_send', 0, 1),
+        BitField('cd_master', 0, 1),
+        BitField('reserved22', 0, 3),
+        BitField('min_rnr_nak', 0, 5),
+        BitField('next_rcv_psn', 0, 24),
+        ByteField('reserved23', 0),
+        BitField('xrcd', 0, 24),
+        ByteField('reserved24', 0),
+        BitField('cqn_rcv', 0, 24),
+        LongField('dbr_addr', 0),
+        IntField('q_key', 0),
+        BitField('reserved25', 0, 5),
+        BitField('rq_type', 0, 3),
+        BitField('srqn_rmpn_xrqn', 0, 24),
+        ByteField('reserved26', 0),
+        BitField('rmsn', 0, 24),
+        ShortField('hw_sq_wqebb_counter', 0),
+        ShortField('sw_sq_wqebb_counter', 0),
+        IntField('hw_rq_counter', 0),
+        IntField('sw_rq_counter', 0),
+        ByteField('reserved27', 0),
+        BitField('roce_adp_retrans_rtt', 0, 24),
+        BitField('reserved28', 0, 15),
+        BitField('cgs', 0, 1),
+        ByteField('cs_req', 0),
+        ByteField('cs_res', 0),
+        LongField('dc_access_key', 0),
+        BitField('rdma_active', 0, 1),
+        BitField('comm_est', 0, 1),
+        BitField('suspended', 0, 1),
+        BitField('dbr_umem_valid', 0, 1),
+        BitField('reserved29', 0, 4),
+        BitField('send_msg_psn', 0, 24),
+        ByteField('reserved30', 0),
+        BitField('rcv_msg_psn', 0, 24),
+        LongField('rdma_va', 0),
+        IntField('rdma_key', 0),
+        IntField('dbr_umem_id', 0),
+    ]
+
+
+class CreateQpIn(Packet):
+    fields_desc = [
+        ShortField('opcode', DevxOps.MLX5_CMD_OP_CREATE_QP),
+        ShortField('uid', 0),
+        ShortField('reserved1', 0),
+        ShortField('op_mod', 0),
+        ByteField('reserved2', 0),
+        BitField('input_qpn', 0, 24),
+        BitField('reserved3', 0, 1),
+        BitField('cmd_on_behalf', 0, 1),
+        BitField('reserved4', 0, 14),
+        ShortField('vhca_id', 0),
+        IntField('opt_param_mask', 0),
+        StrFixedLenField('reserved5', None, length=4),
+        PacketField('sw_qpc', SwQpc(), SwQpc),
+        LongField('e_mtt_pointer_or_wq_umem_offset', 0),
+        IntField('wq_umem_id', 0),
+        BitField('wq_umem_valid', 0, 1),
+        BitField('reserved6', 0, 31),
+        PacketListField('pas', [SwPas() for x in range(0)], SwPas,
+                        count_from=lambda pkt: 0),
+    ]
+
+
+class CreateQpOut(Packet):
+    fields_desc = [
+        ByteField('status', 0),
+        BitField('reserved1', 0, 24),
+        IntField('syndrome', 0),
+        ByteField('reserved2', 0),
+        BitField('qpn', 0, 24),
+        StrFixedLenField('reserved3', None, length=4),
+    ]
+
+
+class ModifyQpIn(Packet):
+    fields_desc = [
+        ShortField('opcode', 0),
+        ShortField('uid', 0),
+        ShortField('vhca_tunnel_id', 0),
+        ShortField('op_mod', 0),
+        ByteField('reserved2', 0),
+        BitField('qpn', 0, 24),
+        IntField('reserved3', 0),
+        IntField('opt_param_mask', 0),
+        IntField('ece', 0),
+        PacketField('sw_qpc', SwQpc(), SwQpc),
+        StrFixedLenField('reserved4', None, length=16),
+    ]
+
+
+class ModifyQpOut(Packet):
+    fields_desc = [
+        ByteField('status', 0),
+        BitField('reserved1', 0, 24),
+        IntField('syndrome', 0),
+        StrFixedLenField('reserved2', None, length=8),
+    ]
+
+
+class QueryQpIn(Packet):
+    fields_desc = [
+        ShortField('opcode', DevxOps.MLX5_CMD_OP_QUERY_QP),
+        ShortField('uid', 0),
+        ShortField('reserved1', 0),
+        ShortField('op_mod', 0),
+        ByteField('reserved2', 0),
+        BitField('qpn', 0, 24),
+        StrFixedLenField('reserved3', None, length=4),
+    ]
+
+
+class QueryQpOut(Packet):
+    fields_desc = [
+        ByteField('status', 0),
+        BitField('reserved1', 0, 24),
+        IntField('syndrome', 0),
+        StrFixedLenField('reserved2', None, length=8),
+        IntField('opt_param_mask', 0),
+        StrFixedLenField('reserved3', None, length=4),
+        PacketField('sw_qpc', SwQpc(), SwQpc),
+        LongField('e_mtt_pointer', 0),
+        StrFixedLenField('reserved4', None, length=8),
+        PacketListField('pas', [SwPas() for x in range(0)], SwPas,
+                        count_from=lambda pkt: 0),
+    ]
+
+
+# Query HCA VPORT Context
+class QueryHcaVportContextIn(Packet):
+    fields_desc = [
+        ShortField('opcode', DevxOps.MLX5_CMD_OP_QUERY_HCA_VPORT_CONTEXT),
+        ShortField('uid', 0),
+        ShortField('reserved1', 0),
+        ShortField('op_mod', 0),
+        BitField('other_vport', 0, 1),
+        BitField('reserved2', 0, 11),
+        BitField('port_num', 0, 4),
+        ShortField('vport_number', 0),
+        StrFixedLenField('reserved3', None, length=4),
+    ]
+
+
+class HcaVportContext(Packet):
+    fields_desc = [
+        IntField('field_select', 0),
+        StrFixedLenField('reserved1', None, length=28),
+        BitField('sm_virt_aware', 0, 1),
+        BitField('has_smi', 0, 1),
+        BitField('has_raw', 0, 1),
+        BitField('grh_required', 0, 1),
+        BitField('reserved2', 0, 1),
+        BitField('min_wqe_inline_mode', 0, 3),
+        ByteField('reserved3', 0),
+        BitField('port_physical_state', 0, 4),
+        BitField('vport_state_policy', 0, 4),
+        BitField('port_state', 0, 4),
+        BitField('vport_state', 0, 4),
+        StrFixedLenField('reserved4', None, length=4),
+        LongField('system_image_guid', 0),
+        LongField('port_guid', 0),
+        LongField('node_guid', 0),
+        IntField('cap_mask1', 0),
+        IntField('cap_mask1_field_select', 0),
+        IntField('cap_mask2', 0),
+        IntField('cap_mask2_field_select', 0),
+        ShortField('reserved5', 0),
+        ShortField('ooo_sl_mask', 0),
+        StrFixedLenField('reserved6', None, length=12),
+        ShortField('lid', 0),
+        BitField('reserved7', 0, 4),
+        BitField('init_type_reply', 0, 4),
+        BitField('lmc', 0, 3),
+        BitField('subnet_timeout', 0, 5),
+        ShortField('sm_lid', 0),
+        BitField('sm_sl', 0, 4),
+        BitField('reserved8', 0, 12),
+        ShortField('qkey_violation_counter', 0),
+        ShortField('pkey_violation_counter', 0),
+        StrFixedLenField('reserved9', None, length=404),
+    ]
+
+
+class QueryHcaVportContextOut(Packet):
+    fields_desc = [
+        ByteField('status', 0),
+        BitField('reserved1', 0, 24),
+        IntField('syndrome', 0),
+        StrFixedLenField('reserved2', None, length=8),
+        PacketField('hca_vport_context', HcaVportContext(), HcaVportContext),
+    ]
+
+
+# Query HCA VPORT GID
+class QueryHcaVportGidIn(Packet):
+    fields_desc = [
+        ShortField('opcode', DevxOps.MLX5_CMD_OP_QUERY_HCA_VPORT_GID),
+        ShortField('uid', 0),
+        ShortField('reserved1', 0),
+        ShortField('op_mod', 0),
+        BitField('other_vport', 0, 1),
+        BitField('reserved2', 0, 11),
+        BitField('port_num', 0, 4),
+        ShortField('vport_number', 0),
+        ShortField('reserved3', 0),
+        ShortField('gid_index', 0),
+    ]
+
+
+class IbGidCmd(Packet):
+    fields_desc = [
+        LongField('prefix', 0),
+        LongField('guid', 0),
+    ]
+
+
+class QueryHcaVportGidOut(Packet):
+    fields_desc = [
+        ByteField('status', 0),
+        BitField('reserved1', 0, 24),
+        IntField('syndrome', 0),
+        StrFixedLenField('reserved2', None, length=4),
+        ShortField('gids_num', 0),
+        ShortField('reserved3', 0),
+        PacketField('gid0', IbGidCmd(), IbGidCmd),
+    ]
+
+
+# Query HCA CAP
+class QueryHcaCapIn(Packet):
+    fields_desc = [
+        ShortField('opcode', DevxOps.MLX5_CMD_OP_QUERY_HCA_CAP),
+        ShortField('uid', 0),
+        ShortField('reserved1', 0),
+        ShortField('op_mod', 0),
+        BitField('other_function', 0, 1),
+        BitField('reserved2', 0, 15),
+        ShortField('function_id', 0),
+        StrFixedLenField('reserved3', None, length=4),
+    ]
+
+
+class CmdHcaCap(Packet):
+    fields_desc = [
+        BitField('access_other_hca_roce', 0, 1),
+        BitField('reserved1', 0, 30),
+        BitField('vhca_resource_manager', 0, 1),
+        BitField('hca_cap_2', 0, 1),
+        BitField('reserved2', 0, 2),
+        BitField('event_on_vhca_state_teardown_request', 0, 1),
+        BitField('event_on_vhca_state_in_use', 0, 1),
+        BitField('event_on_vhca_state_active', 0, 1),
+        BitField('event_on_vhca_state_allocated', 0, 1),
+        BitField('event_on_vhca_state_invalid', 0, 1),
+        ByteField('transpose_max_element_size', 0),
+        ShortField('vhca_id', 0),
+        ByteField('transpose_max_cols', 0),
+        ByteField('transpose_max_rows', 0),
+        ShortField('transpose_max_size', 0),
+        BitField('reserved3', 0, 1),
+        BitField('sw_steering_icm_large_scale_steering', 0, 1),
+        BitField('qp_data_in_order', 0, 1),
+        BitField('log_regexp_scatter_gather_size', 0, 5),
+        BitField('reserved4', 0, 3),
+        BitField('log_dma_mmo_max_size', 0, 5),
+        BitField('relaxed_ordering_write_pci_enabled', 0, 1),
+        BitField('reserved5', 0, 2),
+        BitField('log_compress_max_size', 0, 5),
+        BitField('reserved6', 0, 3),
+        BitField('log_decompress_max_size', 0, 5),
+        ByteField('log_max_srq_sz', 0),
+        ByteField('log_max_qp_sz', 0),
+        BitField('event_cap', 0, 1),
+        BitField('reserved7', 0, 2),
+        BitField('isolate_vl_tc_new', 0, 1),
+        BitField('reserved8', 0, 2),
+        BitField('nvmeotcp', 0, 1),
+        BitField('pcie_hanged', 0, 1),
+        BitField('prio_tag_required', 0, 1),
+        BitField('wqe_index_ignore_cap', 0, 1),
+        BitField('reserved9', 0, 1),
+        BitField('log_max_qp', 0, 5),
+        BitField('regexp', 0, 1),
+        BitField('regexp_params', 0, 1),
+        BitField('regexp_alloc_onbehalf_umem', 0, 1),
+        BitField('ece', 0, 1),
+        BitField('regexp_num_of_engines', 0, 4),
+        BitField('allow_pause_tx', 0, 1),
+        BitField('reg_c_preserve', 0, 1),
+        BitField('isolate_vl_tc', 0, 1),
+        BitField('log_max_srqs', 0, 5),
+        BitField('psp', 0, 1),
+        BitField('reserved10', 0, 1),
+        BitField('ts_cqe_to_dest_cqn', 0, 1),
+        BitField('regexp_log_crspace_size', 0, 5),
+        BitField('selective_repeat', 0, 1),
+        BitField('go_back_n', 0, 1),
+        BitField('reserved11', 0, 1),
+        BitField('scatter_fcs_w_decap_disable', 0, 1),
+        BitField('reserved12', 0, 4),
+        ByteField('max_sgl_for_optimized_performance', 0),
+        ByteField('log_max_cq_sz', 0),
+        BitField('relaxed_ordering_write_umr', 0, 1),
+        BitField('relaxed_ordering_read_umr', 0, 1),
+        BitField('access_register_user', 0, 1),
+        BitField('reserved13', 0, 5),
+        BitField('upt_device_emulation_manager', 0, 1),
+        BitField('virtio_net_device_emulation_manager', 0, 1),
+        BitField('virtio_blk_device_emulation_manager', 0, 1),
+        BitField('log_max_cq', 0, 5),
+        ByteField('log_max_eq_sz', 0),
+        BitField('relaxed_ordering_write', 0, 1),
+        BitField('relaxed_ordering_read', 0, 1),
+        BitField('log_max_mkey', 0, 6),
+        BitField('tunneled_atomic', 0, 1),
+        BitField('as_notify', 0, 1),
+        BitField('m_pci_port', 0, 1),
+        BitField('m_vhca_mk', 0, 1),
+        BitField('hotplug_manager', 0, 1),
+        BitField('nvme_device_emulation_manager', 0, 1),
+        BitField('terminate_scatter_list_mkey', 0, 1),
+        BitField('repeated_mkey', 0, 1),
+        BitField('dump_fill_mkey', 0, 1),
+        BitField('dpp', 0, 1),
+        BitField('resources_on_nvme_emulation_manager', 0, 1),
+        BitField('fast_teardown', 0, 1),
+        BitField('log_max_eq', 0, 4),
+        ByteField('max_indirection', 0),
+        BitField('fixed_buffer_size', 0, 1),
+        BitField('log_max_mrw_sz', 0, 7),
+        BitField('force_teardown', 0, 1),
+        BitField('prepare_fast_teardown_allways_1', 0, 1),
+        BitField('log_max_bsf_list_size', 0, 6),
+        BitField('umr_extended_translation_offset', 0, 1),
+        BitField('null_mkey', 0, 1),
+        BitField('log_max_klm_list_size', 0, 6),
+        BitField('non_wire_sq', 0, 1),
+        BitField('ats_ro_dependence', 0, 1),
+        BitField('qp_context_extension', 0, 1),
+        BitField('log_max_static_sq_wq_size', 0, 5),
+        BitField('resources_on_virtio_net_emulation_manager', 0, 1),
+        BitField('resources_on_virtio_blk_emulation_manager', 0, 1),
+        BitField('log_max_ra_req_dc', 0, 6),
+        BitField('vhca_trust_level_reg', 0, 1),
+        BitField('eth_wqe_too_small_mode', 0, 1),
+        BitField('vnic_env_eth_wqe_too_small', 0, 1),
+        BitField('log_max_static_sq_wq', 0, 5),
+        BitField('ooo_sl_mask', 0, 1),
+        BitField('vnic_env_cq_overrun', 0, 1),
+        BitField('log_max_ra_res_dc', 0, 6),
+        BitField('cc_roce_ecn_rp_classify_mode', 0, 1),
+        BitField('cc_roce_ecn_rp_dynamic_rtt', 0, 1),
+        BitField('cc_roce_ecn_rp_dynamic_ai', 0, 1),
+        BitField('cc_roce_ecn_rp_dynamic_g', 0, 1),
+        BitField('cc_roce_ecn_rp_burst_decouple', 0, 1),
+        BitField('release_all_pages', 0, 1),
+        BitField('depracated_do_not_use', 0, 1),
+        BitField('sig_crc64_xp10', 0, 1),
+        BitField('sig_crc32c', 0, 1),
+        BitField('roce_accl', 0, 1),
+        BitField('log_max_ra_req_qp', 0, 6),
+        BitField('reserved14', 0, 1),
+        BitField('rts2rts_udp_sport', 0, 1),
+        BitField('rts2rts_lag_tx_port_affinity', 0, 1),
+        BitField('dma_mmo', 0, 1),
+        BitField('compress_min_block_size', 0, 4),
+        BitField('compress', 0, 1),
+        BitField('decompress', 0, 1),
+        BitField('log_max_ra_res_qp', 0, 6),
+        BitField('end_pad', 0, 1),
+        BitField('cc_query_allowed', 0, 1),
+        BitField('cc_modify_allowed', 0, 1),
+        BitField('start_pad', 0, 1),
+        BitField('cache_line_128byte', 0, 1),
+        BitField('gid_table_size_ro', 0, 1),
+        BitField('pkey_table_size_ro', 0, 1),
+        BitField('rts2rts_qp_rmp', 0, 1),
+        BitField('rnr_nak_q_counters', 0, 1),
+        BitField('rts2rts_qp_counters_set_id', 0, 1),
+        BitField('rts2rts_qp_dscp', 0, 1),
+        BitField('gen3_cc_negotiation', 0, 1),
+        BitField('vnic_env_int_rq_oob', 0, 1),
+        BitField('sbcam_reg', 0, 1),
+        BitField('cwcam_reg', 0, 1),
+        BitField('qcam_reg', 0, 1),
+        ShortField('gid_table_size', 0),
+        BitField('out_of_seq_cnt', 0, 1),
+        BitField('vport_counters', 0, 1),
+        BitField('retransmission_q_counters', 0, 1),
+        BitField('debug', 0, 1),
+        BitField('modify_rq_counters_set_id', 0, 1),
+        BitField('rq_delay_drop', 0, 1),
+        BitField('max_qp_cnt', 0, 10),
+        ShortField('pkey_table_size', 0),
+        BitField('vport_group_manager', 0, 1),
+        BitField('vhca_group_manager', 0, 1),
+        BitField('ib_virt', 0, 1),
+        BitField('eth_virt', 0, 1),
+        BitField('vnic_env_queue_counters', 0, 1),
+        BitField('ets', 0, 1),
+        BitField('nic_flow_table', 0, 1),
+        BitField('eswitch_manager', 0, 1),
+        BitField('device_memory', 0, 1),
+        BitField('mcam_reg', 0, 1),
+        BitField('pcam_reg', 0, 1),
+        BitField('local_ca_ack_delay', 0, 5),
+        BitField('port_module_event', 0, 1),
+        BitField('enhanced_retransmission_q_counters', 0, 1),
+        BitField('port_checks', 0, 1),
+        BitField('pulse_gen_control', 0, 1),
+        BitField('disable_link_up_by_init_hca', 0, 1),
+        BitField('beacon_led', 0, 1),
+        BitField('port_type', 0, 2),
+        ByteField('num_ports', 0),
+        BitField('snapshot', 0, 1),
+        BitField('pps', 0, 1),
+        BitField('pps_modify', 0, 1),
+        BitField('log_max_msg', 0, 5),
+        BitField('multi_path_xrc_rdma', 0, 1),
+        BitField('multi_path_dc_rdma', 0, 1),
+        BitField('multi_path_rc_rdma', 0, 1),
+        BitField('traffic_fast_control', 0, 1),
+        BitField('max_tc', 0, 4),
+        BitField('temp_warn_event', 0, 1),
+        BitField('dcbx', 0, 1),
+        BitField('general_notification_event', 0, 1),
+        BitField('multi_prio_sq', 0, 1),
+        BitField('afu_owner', 0, 1),
+        BitField('fpga', 0, 1),
+        BitField('rol_s', 0, 1),
+        BitField('rol_g', 0, 1),
+        BitField('ib_port_sniffer', 0, 1),
+        BitField('wol_s', 0, 1),
+        BitField('wol_g', 0, 1),
+        BitField('wol_a', 0, 1),
+        BitField('wol_b', 0, 1),
+        BitField('wol_m', 0, 1),
+        BitField('wol_u', 0, 1),
+        BitField('wol_p', 0, 1),
+        ShortField('stat_rate_support', 0),
+        BitField('sig_block_4048', 0, 1),
+        BitField('pci_sync_for_fw_update_event', 0, 1),
+        BitField('init2rtr_drain_sigerr', 0, 1),
+        BitField('log_max_extended_rnr_retry', 0, 5),
+        BitField('init2_lag_tx_port_affinity', 0, 1),
+        BitField('flow_group_type_hash_split', 0, 1),
+        BitField('reserved15', 0, 1),
+        BitField('wqe_based_flow_table_update', 0, 1),
+        BitField('cqe_version', 0, 4),
+        BitField('compact_address_vector', 0, 1),
+        BitField('eth_striding_wq', 0, 1),
+        BitField('reserved16', 0, 1),
+        BitField('ipoib_enhanced_offloads', 0, 1),
+        BitField('ipoib_basic_offloads', 0, 1),
+        BitField('ib_link_list_striding_wq', 0, 1),
+        BitField('repeated_block_disabled', 0, 1),
+        BitField('umr_modify_entity_size_disabled', 0, 1),
+        BitField('umr_modify_atomic_disabled', 0, 1),
+        BitField('umr_indirect_mkey_disabled', 0, 1),
+        BitField('umr_fence', 0, 2),
+        BitField('dc_req_sctr_data_cqe', 0, 1),
+        BitField('dc_connect_qp', 0, 1),
+        BitField('dc_cnak_trace', 0, 1),
+        BitField('drain_sigerr', 0, 1),
+        BitField('cmdif_checksum', 0, 2),
+        BitField('sigerr_cqe', 0, 1),
+        BitField('e_psv', 0, 1),
+        BitField('wq_signature', 0, 1),
+        BitField('sctr_data_cqe', 0, 1),
+        BitField('bsf_in_create_mkey', 0, 1),
+        BitField('sho', 0, 1),
+        BitField('tph', 0, 1),
+        BitField('rf', 0, 1),
+        BitField('dct', 0, 1),
+        BitField('qos', 0, 1),
+        BitField('eth_net_offloads', 0, 1),
+        BitField('roce', 0, 1),
+        BitField('atomic', 0, 1),
+        BitField('extended_retry_count', 0, 1),
+        BitField('cq_oi', 0, 1),
+        BitField('cq_resize', 0, 1),
+        BitField('cq_moderation', 0, 1),
+        BitField('cq_period_mode_modify', 0, 1),
+        BitField('cq_invalidate', 0, 1),
+        BitField('reserved17', 0, 1),
+        BitField('cq_eq_remap', 0, 1),
+        BitField('pg', 0, 1),
+        BitField('block_lb_mc', 0, 1),
+        BitField('exponential_backoff', 0, 1),
+        BitField('scqe_break_moderation', 0, 1),
+        BitField('cq_period_start_from_cqe', 0, 1),
+        BitField('cd', 0, 1),
+        BitField('atm', 0, 1),
+        BitField('apm', 0, 1),
+        BitField('vector_calc', 0, 1),
+        BitField('umr_ptr_rlkey', 0, 1),
+        BitField('imaicl', 0, 1),
+        BitField('qp_packet_based', 0, 1),
+        BitField('ib_cyclic_striding_wq', 0, 1),
+        BitField('ipoib_enhanced_pkey_change', 0, 1),
+        BitField('initiator_src_dct_in_cqe', 0, 1),
+        BitField('qkv', 0, 1),
+        BitField('pkv', 0, 1),
+        BitField('set_deth_sqpn', 0, 1),
+        BitField('rts2rts_primary_sl', 0, 1),
+        BitField('initiator_src_dct', 0, 1),
+        BitField('dc_v2', 0, 1),
+        BitField('xrc', 0, 1),
+        BitField('ud', 0, 1),
+        BitField('uc', 0, 1),
+        BitField('rc', 0, 1),
+        BitField('uar_4k', 0, 1),
+        BitField('reserved18', 0, 7),
+        BitField('fl_rc_qp_when_roce_disabled', 0, 1),
+        BitField('reserved19', 0, 1),
+        BitField('uar_sz', 0, 6),
+        BitField('reserved20', 0, 3),
+        BitField('log_max_dc_cnak_qps', 0, 5),
+        ByteField('log_pg_sz', 0),
+        BitField('bf', 0, 1),
+        BitField('driver_version', 0, 1),
+        BitField('pad_tx_eth_packet', 0, 1),
+        BitField('query_driver_version', 0, 1),
+        BitField('max_qp_retry_freq', 0, 1),
+        BitField('qp_by_name', 0, 1),
+        BitField('mkey_by_name', 0, 1),
+        BitField('reserved21', 0, 4),
+        BitField('log_bf_reg_size', 0, 5),
+        BitField('reserved22', 0, 6),
+        BitField('lag_dct', 0, 2),
+        BitField('lag_tx_port_affinity', 0, 1),
+        BitField('lag_native_fdb_selection', 0, 1),
+        BitField('must_be_0', 0, 1),
+        BitField('lag_master', 0, 1),
+        BitField('num_lag_ports', 0, 4),
+        ShortField('num_of_diagnostic_counters', 0),
+        ShortField('max_wqe_sz_sq', 0),
+        ShortField('reserved23', 0),
+        ShortField('max_wqe_sz_rq', 0),
+        ShortField('max_flow_counter_31_16', 0),
+        ShortField('max_wqe_sz_sq_dc', 0),
+        BitField('reserved24', 0, 7),
+        BitField('max_qp_mcg', 0, 25),
+        ShortField('mlnx_tag_ethertype', 0),
+        ByteField('flow_counter_bulk_alloc', 0),
+        ByteField('log_max_mcg', 0),
+        BitField('reserved25', 0, 3),
+        BitField('log_max_transport_domain', 0, 5),
+        BitField('reserved26', 0, 3),
+        BitField('log_max_pd', 0, 5),
+        BitField('reserved27', 0, 11),
+        BitField('log_max_xrcd', 0, 5),
+        BitField('nic_receive_steering_discard', 0, 1),
+        BitField('receive_discard_vport_down', 0, 1),
+        BitField('transmit_discard_vport_down', 0, 1),
+        BitField('eq_overrun_count', 0, 1),
+        BitField('nic_receive_steering_depth', 0, 1),
+        BitField('invalid_command_count', 0, 1),
+        BitField('quota_exceeded_count', 0, 1),
+        BitField('flow_counter_by_name', 0, 1),
+        ByteField('log_max_flow_counter_bulk', 0),
+        ShortField('max_flow_counter_15_0', 0),
+        BitField('modify_tis', 0, 1),
+        BitField('flow_counters_dump', 0, 1),
+        BitField('reserved28', 0, 1),
+        BitField('log_max_rq', 0, 5),
+        BitField('reserved29', 0, 3),
+        BitField('log_max_sq', 0, 5),
+        BitField('reserved30', 0, 3),
+        BitField('log_max_tir', 0, 5),
+        BitField('reserved31', 0, 3),
+        BitField('log_max_tis', 0, 5),
+        BitField('basic_cyclic_rcv_wqe', 0, 1),
+        BitField('reserved32', 0, 2),
+        BitField('log_max_rmp', 0, 5),
+        BitField('reserved33', 0, 3),
+        BitField('log_max_rqt', 0, 5),
+        BitField('reserved34', 0, 3),
+        BitField('log_max_rqt_size', 0, 5),
+        BitField('reserved35', 0, 3),
+        BitField('log_max_tis_per_sq', 0, 5),
+        BitField('ext_stride_num_range', 0, 1),
+        BitField('reserved36', 0, 2),
+        BitField('log_max_stride_sz_rq', 0, 5),
+        BitField('reserved37', 0, 3),
+        BitField('log_min_stride_sz_rq', 0, 5),
+        BitField('reserved38', 0, 3),
+        BitField('log_max_stride_sz_sq', 0, 5),
+        BitField('reserved39', 0, 3),
+        BitField('log_min_stride_sz_sq', 0, 5),
+        BitField('hairpin_eth_raw', 0, 1),
+        BitField('reserved40', 0, 2),
+        BitField('log_max_hairpin_queues', 0, 5),
+        BitField('hairpin_ib_raw', 0, 1),
+        BitField('hairpin_eth2ipoib', 0, 1),
+        BitField('hairpin_ipoib2eth', 0, 1),
+        BitField('log_max_hairpin_wq_data_sz', 0, 5),
+        BitField('reserved41', 0, 3),
+        BitField('log_max_hairpin_num_packets', 0, 5),
+        BitField('reserved42', 0, 3),
+        BitField('log_max_wq_sz', 0, 5),
+        BitField('nic_vport_change_event', 0, 1),
+        BitField('disable_local_lb_uc', 0, 1),
+        BitField('disable_local_lb_mc', 0, 1),
+        BitField('log_min_hairpin_wq_data_sz', 0, 5),
+        BitField('system_image_guid_modifiable', 0, 1),
+        BitField('reserved43', 0, 1),
+        BitField('vhca_state', 0, 1),
+        BitField('log_max_vlan_list', 0, 5),
+        BitField('reserved44', 0, 3),
+        BitField('log_max_current_mc_list', 0, 5),
+        BitField('reserved45', 0, 3),
+        BitField('log_max_current_uc_list', 0, 5),
+        LongField('general_obj_types', 0),
+        BitField('sq_ts_format', 0, 2),
+        BitField('rq_ts_format', 0, 2),
+        BitField('steering_format_version', 0, 4),
+        BitField('create_qp_start_hint', 0, 24),
+        BitField('tls', 0, 1),
+        BitField('ats', 0, 1),
+        BitField('reserved46', 0, 1),
+        BitField('log_max_uctx', 0, 5),
+        BitField('aes_xts', 0, 1),
+        BitField('crypto', 0, 1),
+        BitField('ipsec_offload', 0, 1),
+        BitField('log_max_umem', 0, 5),
+        ShortField('max_num_eqs', 0),
+        BitField('reserved47', 0, 1),
+        BitField('tls_tx', 0, 1),
+        BitField('tls_rx', 0, 1),
+        BitField('log_max_l2_table', 0, 5),
+        ByteField('reserved48', 0),
+        ShortField('log_uar_page_sz', 0),
+        BitField('e', 0, 1),
+        BitField('reserved49', 0, 31),
+        IntField('device_frequency_mhz', 0),
+        IntField('device_frequency_khz', 0),
+        BitField('capi', 0, 1),
+        BitField('create_pec', 0, 1),
+        BitField('nvmf_target_offload', 0, 1),
+        BitField('capi_invalidate', 0, 1),
+        BitField('reserved50', 0, 23),
+        BitField('log_max_pasid', 0, 5),
+        IntField('num_of_uars_per_page', 0),
+        IntField('flex_parser_protocols', 0),
+        ByteField('max_geneve_tlv_options', 0),
+        BitField('reserved51', 0, 3),
+        BitField('max_geneve_tlv_option_data_len', 0, 5),
+        BitField('flex_parser_header_modify', 0, 1),
+        BitField('reserved52', 0, 2),
+        BitField('log_max_guaranteed_connections', 0, 5),
+        BitField('reserved53', 0, 3),
+        BitField('log_max_dct_connections', 0, 5),
+        ByteField('log_max_atomic_size_qp', 0),
+        BitField('reserved54', 0, 3),
+        BitField('log_max_dci_stream_channels', 0, 5),
+        BitField('reserved55', 0, 3),
+        BitField('log_max_dci_errored_streams', 0, 5),
+        ByteField('log_max_atomic_size_dc', 0),
+        ShortField('max_multi_user_group_size', 0),
+        BitField('reserved56', 0, 2),
+        BitField('crossing_vhca_mkey', 0, 1),
+        BitField('log_max_dek', 0, 5),
+        BitField('reserved57', 0, 1),
+        BitField('mini_cqe_resp_l3l4header', 0, 1),
+        BitField('mini_cqe_resp_flow_tag', 0, 1),
+        BitField('enhanced_cqe_compression', 0, 1),
+        BitField('mini_cqe_resp_stride_index', 0, 1),
+        BitField('cqe_128_always', 0, 1),
+        BitField('cqe_compression_128b', 0, 1),
+        BitField('cqe_compression', 0, 1),
+        ShortField('cqe_compression_timeout', 0),
+        ShortField('cqe_compression_max_num', 0),
+        BitField('reserved58', 0, 3),
+        BitField('wqe_based_flow_table_update_dest_type_offset', 0, 5),
+        BitField('flex_parser_id_gtpu_dw_0', 0, 4),
+        BitField('log_max_tm_offloaded_op_size', 0, 4),
+        BitField('tag_matching', 0, 1),
+        BitField('rndv_offload_rc', 0, 1),
+        BitField('rndv_offload_dc', 0, 1),
+        BitField('log_tag_matching_list_sz', 0, 5),
+        BitField('reserved59', 0, 3),
+        BitField('log_max_xrq', 0, 5),
+        ByteField('affiliate_nic_vport_criteria', 0),
+        ByteField('native_port_num', 0),
+        ByteField('num_vhca_ports', 0),
+        BitField('flex_parser_id_gtpu_teid', 0, 4),
+        BitField('reserved60', 0, 1),
+        BitField('trusted_vnic_vhca', 0, 1),
+        BitField('sw_owner_id', 0, 1),
+        BitField('reserve_not_to_use', 0, 1),
+        ShortField('max_num_of_monitor_counters', 0),
+        ShortField('num_ppcnt_monitor_counters', 0),
+        ShortField('max_num_sf', 0),
+        ShortField('num_q_monitor_counters', 0),
+        StrFixedLenField('reserved61', None, length=4),
+        BitField('sf', 0, 1),
+        BitField('sf_set_partition', 0, 1),
+        BitField('reserved62', 0, 1),
+        BitField('log_max_sf', 0, 5),
+        ByteField('reserved63', 0),
+        ByteField('log_min_sf_size', 0),
+        ByteField('max_num_sf_partitions', 0),
+        IntField('uctx_permission', 0),
+        BitField('flex_parser_id_mpls_over_x_cw', 0, 4),
+        BitField('flex_parser_id_geneve_tlv_option_0', 0, 4),
+        BitField('flex_parser_id_icmp_dw1', 0, 4),
+        BitField('flex_parser_id_icmp_dw0', 0, 4),
+        BitField('flex_parser_id_icmpv6_dw1', 0, 4),
+        BitField('flex_parser_id_icmpv6_dw0', 0, 4),
+        BitField('flex_parser_id_outer_first_mpls_over_gre', 0, 4),
+        BitField('flex_parser_id_outer_first_mpls_over_udp_label', 0, 4),
+        ShortField('max_num_match_definer', 0),
+        ShortField('sf_base_id', 0),
+        BitField('flex_parser_id_gtpu_dw_2', 0, 4),
+        BitField('flex_parser_id_gtpu_first_ext_dw_0', 0, 4),
+        BitField('num_total_dynamic_vf_msix', 0, 24),
+        BitField('reserved64', 0, 3),
+        BitField('log_flow_hit_aso_granularity', 0, 5),
+        BitField('reserved65', 0, 3),
+        BitField('log_flow_hit_aso_max_alloc', 0, 5),
+        BitField('reserved66', 0, 4),
+        BitField('dynamic_msix_table_size', 0, 12),
+        BitField('reserved67', 0, 3),
+        BitField('log_max_num_flow_hit_aso', 0, 5),
+        BitField('reserved68', 0, 4),
+        BitField('min_dynamic_vf_msix_table_size', 0, 4),
+        BitField('reserved69', 0, 4),
+        BitField('max_dynamic_vf_msix_table_size', 0, 12),
+        BitField('reserved70', 0, 3),
+        BitField('log_max_num_header_modify_argument', 0, 5),
+        BitField('reserved71', 0, 4),
+        BitField('log_header_modify_argument_granularity', 0, 4),
+        BitField('reserved72', 0, 3),
+        BitField('log_header_modify_argument_max_alloc', 0, 5),
+        BitField('reserved73', 0, 3),
+        BitField('max_flow_execute_aso', 0, 5),
+        LongField('vhca_tunnel_commands', 0),
+        LongField('match_definer_format_supported', 0),
+    ]
+
+
+class QueryCmdHcaCapOut(Packet):
+    fields_desc = [
+        ByteField('status', 0),
+        BitField('reserved1', 0, 24),
+        IntField('syndrome', 0),
+        StrFixedLenField('reserved2', None, length=8),
+        PadField(PacketField('capability', CmdHcaCap(), CmdHcaCap), 2048, padwith=b"\x00"),
+    ]
diff --git a/tests/test_mlx5_devx.py b/tests/test_mlx5_devx.py
new file mode 100644
index 0000000..c43dcd5
--- /dev/null
+++ b/tests/test_mlx5_devx.py
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
+# Copyright (c) 2021 Nvidia Inc. All rights reserved. See COPYING file
+
+"""
+Test module for mlx5 DevX.
+"""
+
+from tests.mlx5_base import Mlx5DevxRcResources, Mlx5DevxTrafficBase
+
+
+class Mlx5DevxRcTrafficTest(Mlx5DevxTrafficBase):
+    """
+    Test various functionality of mlx5 DevX objects
+    """
+
+    def test_devx_rc_qp_send_imm_traffic(self):
+        """
+        Creates two DevX RC QPs and modifies them to RTS state.
+        Then does SEND_IMM traffic.
+        """
+        self.create_players(Mlx5DevxRcResources)
+        # Send traffic
+        self.send_imm_traffic()
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 26/27] pyverbs/mlx5: Support mlx5 devices over VFIO
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (24 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 25/27] tests: Add mlx5 DevX data path test Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-07-20  8:16 ` [PATCH rdma-core 27/27] tests: Add a test for mlx5 " Yishai Hadas
  2021-08-01  8:00 ` [PATCH rdma-core 00/27] Introduce mlx5 user space driver " Yishai Hadas
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Support opening an mlx5 device and creating a context over VFIO by
adding  Mlx5VfioContext class based on Mlx5Context to allow using DevX
APIs.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 pyverbs/providers/mlx5/CMakeLists.txt   |   3 +-
 pyverbs/providers/mlx5/libmlx5.pxd      |   8 +++
 pyverbs/providers/mlx5/mlx5_vfio.pxd    |  15 +++++
 pyverbs/providers/mlx5/mlx5_vfio.pyx    | 116 ++++++++++++++++++++++++++++++++
 pyverbs/providers/mlx5/mlx5dv.pxd       |   3 +
 pyverbs/providers/mlx5/mlx5dv_enums.pxd |   3 +
 6 files changed, 147 insertions(+), 1 deletion(-)
 create mode 100644 pyverbs/providers/mlx5/mlx5_vfio.pxd
 create mode 100644 pyverbs/providers/mlx5/mlx5_vfio.pyx

diff --git a/pyverbs/providers/mlx5/CMakeLists.txt b/pyverbs/providers/mlx5/CMakeLists.txt
index 4763c61..c0b5869 100644
--- a/pyverbs/providers/mlx5/CMakeLists.txt
+++ b/pyverbs/providers/mlx5/CMakeLists.txt
@@ -7,8 +7,9 @@ rdma_cython_module(pyverbs/providers/mlx5 mlx5
   dr_matcher.pyx
   dr_rule.pyx
   dr_table.pyx
-  mlx5dv.pyx
   mlx5_enums.pyx
+  mlx5_vfio.pyx
+  mlx5dv.pyx
   mlx5dv_flow.pyx
   mlx5dv_mkey.pyx
   mlx5dv_objects.pyx
diff --git a/pyverbs/providers/mlx5/libmlx5.pxd b/pyverbs/providers/mlx5/libmlx5.pxd
index af034ad..e0904f5 100644
--- a/pyverbs/providers/mlx5/libmlx5.pxd
+++ b/pyverbs/providers/mlx5/libmlx5.pxd
@@ -223,6 +223,11 @@ cdef extern from 'infiniband/mlx5dv.h':
         uint64_t    pgsz_bitmap
         uint64_t    comp_mask
 
+    cdef struct mlx5dv_vfio_context_attr:
+        const char  *pci_name
+        uint32_t    flags
+        uint64_t    comp_mask
+
     cdef struct mlx5dv_pd:
         uint32_t    pdn
         uint64_t    comp_mask
@@ -372,6 +377,9 @@ cdef extern from 'infiniband/mlx5dv.h':
                              uint64_t device_timestamp)
     int mlx5dv_get_clock_info(v.ibv_context *ctx_in, mlx5dv_clock_info *clock_info)
     int mlx5dv_map_ah_to_qp(v.ibv_ah *ah, uint32_t qp_num)
+    v.ibv_device **mlx5dv_get_vfio_device_list(mlx5dv_vfio_context_attr *attr)
+    int mlx5dv_vfio_get_events_fd(v.ibv_context *ibctx)
+    int mlx5dv_vfio_process_events(v.ibv_context *context)
 
     # DevX APIs
     mlx5dv_devx_uar *mlx5dv_devx_alloc_uar(v.ibv_context *context, uint32_t flags)
diff --git a/pyverbs/providers/mlx5/mlx5_vfio.pxd b/pyverbs/providers/mlx5/mlx5_vfio.pxd
new file mode 100644
index 0000000..0e9facd
--- /dev/null
+++ b/pyverbs/providers/mlx5/mlx5_vfio.pxd
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
+# Copyright (c) 2021 Nvidia, Inc. All rights reserved. See COPYING file
+
+#cython: language_level=3
+
+from pyverbs.providers.mlx5.mlx5dv cimport Mlx5Context
+cimport pyverbs.providers.mlx5.libmlx5 as dv
+from pyverbs.base cimport PyverbsObject
+
+
+cdef class Mlx5VfioContext(Mlx5Context):
+    pass
+
+cdef class Mlx5VfioAttr(PyverbsObject):
+    cdef dv.mlx5dv_vfio_context_attr attr
diff --git a/pyverbs/providers/mlx5/mlx5_vfio.pyx b/pyverbs/providers/mlx5/mlx5_vfio.pyx
new file mode 100644
index 0000000..2978b61
--- /dev/null
+++ b/pyverbs/providers/mlx5/mlx5_vfio.pyx
@@ -0,0 +1,116 @@
+# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
+# Copyright (c) 2021 Nvidia, Inc. All rights reserved. See COPYING file
+
+#cython: language_level=3
+
+from cpython.mem cimport PyMem_Malloc, PyMem_Free
+from libc.string cimport strcpy
+import weakref
+
+from pyverbs.pyverbs_error import PyverbsRDMAError
+cimport pyverbs.providers.mlx5.libmlx5 as dv
+from pyverbs.base import PyverbsRDMAErrno
+from pyverbs.base cimport close_weakrefs
+from pyverbs.device cimport Context
+cimport pyverbs.libibverbs as v
+
+
+cdef class Mlx5VfioAttr(PyverbsObject):
+    """
+    Mlx5VfioAttr class, represents mlx5dv_vfio_context_attr C struct.
+    """
+    def __init__(self, pci_name, flags=0, comp_mask=0):
+        self.pci_name = pci_name
+        self.attr.flags = flags
+        self.attr.comp_mask = comp_mask
+
+    def __dealloc__(self):
+        if self.attr.pci_name != NULL:
+            PyMem_Free(<void*>self.attr.pci_name)
+            self.attr.pci_name = NULL
+
+    @property
+    def flags(self):
+        return self.attr.flags
+    @flags.setter
+    def flags(self, val):
+        self.attr.flags = val
+
+    @property
+    def comp_mask(self):
+        return self.attr.comp_mask
+    @comp_mask.setter
+    def comp_mask(self, val):
+        self.attr.comp_mask = val
+
+    @property
+    def pci_name(self):
+        return self.attr.pci_name[:]
+    @pci_name.setter
+    def pci_name(self, val):
+        if self.attr.pci_name != NULL:
+            PyMem_Free(<void*>self.attr.pci_name)
+        pci_name_bytes = val.encode()
+        self.attr.pci_name = <char*>PyMem_Malloc(len(pci_name_bytes))
+        strcpy(<char*>self.attr.pci_name, pci_name_bytes)
+
+
+cdef class Mlx5VfioContext(Mlx5Context):
+    """
+    Mlx5VfioContext class is used to easily initialize and open a context over
+    a mlx5 vfio device.
+    It is initialized based on the passed mlx5 vfio attributes (Mlx5VfioAttr),
+    by getting the relevant vfio device and opening it (creating a context).
+    """
+    def __init__(self, Mlx5VfioAttr attr):
+        super(Context, self).__init__()
+        cdef v.ibv_device **dev_list
+
+        self.name = attr.pci_name
+        self.pds = weakref.WeakSet()
+        self.devx_umems = weakref.WeakSet()
+        self.devx_objs = weakref.WeakSet()
+        self.uars = weakref.WeakSet()
+
+        dev_list = dv.mlx5dv_get_vfio_device_list(&attr.attr)
+        if dev_list == NULL:
+            raise PyverbsRDMAErrno('Failed to get VFIO device list')
+        self.device = dev_list[0]
+        if self.device == NULL:
+            raise PyverbsRDMAError('Failed to get VFIO device')
+        try:
+            self.context = v.ibv_open_device(self.device)
+            if self.context == NULL:
+                raise PyverbsRDMAErrno('Failed to open mlx5 VFIO device '
+                                       f'({self.device.name.decode()})')
+        finally:
+            v.ibv_free_device_list(dev_list)
+
+    def get_events_fd(self):
+        """
+        Gets the file descriptor to manage driver events.
+        :return: The file descriptor to be used for managing driver events.
+        """
+        fd = dv.mlx5dv_vfio_get_events_fd(self.context)
+        if fd < 0:
+            raise PyverbsRDMAError('Failed to get VFIO events FD', -fd)
+        return fd
+
+    def process_events(self):
+        """
+        Process events on the vfio device.
+        This method should run from application thread to maintain device events.
+        :return: None
+        """
+        rc = dv.mlx5dv_vfio_process_events(self.context)
+        if rc:
+            raise PyverbsRDMAError('VFIO process events failed', rc)
+
+    cpdef close(self):
+        if self.context != NULL:
+            self.logger.debug('Closing Mlx5VfioContext')
+            close_weakrefs([self.pds, self.devx_objs, self.devx_umems, self.uars])
+            rc = v.ibv_close_device(self.context)
+            if rc != 0:
+                raise PyverbsRDMAErrno(f'Failed to close device {self.name}')
+            self.context = NULL
diff --git a/pyverbs/providers/mlx5/mlx5dv.pxd b/pyverbs/providers/mlx5/mlx5dv.pxd
index 968cbdb..490c697 100644
--- a/pyverbs/providers/mlx5/mlx5dv.pxd
+++ b/pyverbs/providers/mlx5/mlx5dv.pxd
@@ -86,3 +86,6 @@ cdef class Mlx5DevxObj(PyverbsCM):
 
 cdef class Mlx5Cqe64(PyverbsObject):
     cdef dv.mlx5_cqe64 *cqe
+
+cdef class Mlx5VfioAttr(PyverbsObject):
+    cdef dv.mlx5dv_vfio_context_attr attr
diff --git a/pyverbs/providers/mlx5/mlx5dv_enums.pxd b/pyverbs/providers/mlx5/mlx5dv_enums.pxd
index 60713e8..94599ab 100644
--- a/pyverbs/providers/mlx5/mlx5dv_enums.pxd
+++ b/pyverbs/providers/mlx5/mlx5dv_enums.pxd
@@ -215,6 +215,9 @@ cdef extern from 'infiniband/mlx5dv.h':
         MLX5_SEND_WQE_BB
         MLX5_SEND_WQE_SHIFT
 
+    cpdef enum mlx5dv_vfio_context_attr_flags:
+        MLX5DV_VFIO_CTX_FLAGS_INIT_LINK_DOWN
+
     cpdef unsigned long long MLX5DV_RES_TYPE_QP
     cpdef unsigned long long MLX5DV_RES_TYPE_RWQ
     cpdef unsigned long long MLX5DV_RES_TYPE_DBR
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH rdma-core 27/27] tests: Add a test for mlx5 over VFIO
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (25 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 26/27] pyverbs/mlx5: Support mlx5 devices over VFIO Yishai Hadas
@ 2021-07-20  8:16 ` Yishai Hadas
  2021-08-01  8:00 ` [PATCH rdma-core 00/27] Introduce mlx5 user space driver " Yishai Hadas
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  8:16 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, yishaih, maorg, markzhang, edwards, Ido Kalir

From: Edward Srouji <edwards@nvidia.com>

Add a test that opens an mlx5 vfio context, creates two DevX QPs on
it, and modifies them to RTS state. Then performs traffic with SEND_IMM.
An additional cmd line argument was added for the tests, to allow users
to pass a PCI device instead of an RDMA device for this test.

Reviewed-by: Ido Kalir <idok@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 tests/CMakeLists.txt    |   1 +
 tests/args_parser.py    |   5 +++
 tests/test_mlx5_vfio.py | 104 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 110 insertions(+)
 create mode 100644 tests/test_mlx5_vfio.py

diff --git a/tests/CMakeLists.txt b/tests/CMakeLists.txt
index 7b079d8..cabeba0 100644
--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@@ -36,6 +36,7 @@ rdma_python_test(tests
   test_mlx5_uar.py
   test_mlx5_udp_sport.py
   test_mlx5_var.py
+  test_mlx5_vfio.py
   test_mr.py
   test_odp.py
   test_pd.py
diff --git a/tests/args_parser.py b/tests/args_parser.py
index 5bc53b0..4312121 100644
--- a/tests/args_parser.py
+++ b/tests/args_parser.py
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
 # Copyright (c) 2020 Kamal Heib <kamalheib1@gmail.com>, All rights reserved.  See COPYING file
+# Copyright (c) 2021 Nvidia, Inc. All rights reserved. See COPYING file
 
 import argparse
 import sys
@@ -16,6 +17,10 @@ class ArgsParser(object):
         parser = argparse.ArgumentParser()
         parser.add_argument('--dev',
                             help='RDMA device to run the tests on')
+        parser.add_argument('--pci-dev',
+                            help='PCI device to run the tests on, which is '
+                                 'needed by some tests where the RDMA device is '
+                                 'not available (e.g. VFIO)')
         parser.add_argument('--port',
                             help='Use port <port> of RDMA device', type=int,
                             default=1)
diff --git a/tests/test_mlx5_vfio.py b/tests/test_mlx5_vfio.py
new file mode 100644
index 0000000..06da8dd
--- /dev/null
+++ b/tests/test_mlx5_vfio.py
@@ -0,0 +1,104 @@
+# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
+# Copyright (c) 2021 Nvidia, Inc. All rights reserved. See COPYING file
+"""
+Test module for pyverbs' mlx5_vfio module.
+"""
+
+from threading import Thread
+import unittest
+import select
+import errno
+
+from pyverbs.providers.mlx5.mlx5_vfio import Mlx5VfioAttr, Mlx5VfioContext
+from tests.mlx5_base import Mlx5DevxRcResources, Mlx5DevxTrafficBase
+from pyverbs.pyverbs_error import PyverbsRDMAError
+
+
+class Mlx5VfioResources(Mlx5DevxRcResources):
+    def __init__(self, ib_port, pci_name, gid_index=None, ctx=None):
+        self.pci_name = pci_name
+        self.ctx = ctx
+        super().__init__(None, ib_port, gid_index)
+
+    def create_context(self):
+        """
+        Opens an mlx5 VFIO context.
+        Since only one context is allowed to be opened on a VFIO, the user must
+        pass that context for the remaining resources, which in that case, the
+        same context would be used.
+        :return: None
+        """
+        if self.ctx:
+            return
+        try:
+            vfio_attr = Mlx5VfioAttr(pci_name=self.pci_name)
+            vfio_attr.pci_name = self.pci_name
+            self.ctx = Mlx5VfioContext(attr=vfio_attr)
+        except PyverbsRDMAError as ex:
+            if ex.error_code == errno.EOPNOTSUPP:
+                raise unittest.SkipTest(f'Mlx5 VFIO is not supported ({ex})')
+            raise ex
+
+    def query_gid(self):
+        """
+        Currently Mlx5VfioResources does not support Eth port type.
+        Query GID would just be skipped.
+        """
+        pass
+
+
+class Mlx5VfioTrafficTest(Mlx5DevxTrafficBase):
+    """
+    Test various functionality of an mlx5-vfio device.
+    """
+    def setUp(self):
+        """
+        Verifies that the user has passed a PCI device name to work with.
+        """
+        self.pci_dev = self.config['pci_dev']
+        if not self.pci_dev:
+            raise unittest.SkipTest('PCI device must be passed by the user')
+
+    def create_players(self):
+        self.server = Mlx5VfioResources(ib_port=self.ib_port, pci_name=self.pci_dev)
+        self.client = Mlx5VfioResources(ib_port=self.ib_port, pci_name=self.pci_dev,
+                                        ctx=self.server.ctx)
+
+    def vfio_processs_events(self):
+        """
+        Processes mlx5 vfio device events.
+        This method should run from application thread to maintain the events.
+        """
+        # Server and client use the same context
+        events_fd = self.server.ctx.get_events_fd()
+        with select.epoll() as epoll_events:
+            epoll_events.register(events_fd, select.EPOLLIN)
+            while self.proc_events:
+                for fd, event in epoll_events.poll(timeout=0.1):
+                    if fd == events_fd:
+                        if not (event & select.EPOLLIN):
+                            self.event_ex.append(PyverbsRDMAError(f'Unexpected vfio event: {event}'))
+                        self.server.ctx.process_events()
+
+    def test_mlx5vfio_rc_qp_send_imm_traffic(self):
+        """
+        Opens one mlx5 vfio context, creates two DevX RC QPs on it, and modifies
+        them to RTS state.
+        Then does SEND_IMM traffic.
+        """
+        self.create_players()
+        if self.server.is_eth():
+            raise unittest.SkipTest(f'{self.__class__.__name__} is currently supported over IB only')
+        self.event_ex = []
+        self.proc_events = True
+        proc_events = Thread(target=self.vfio_processs_events)
+        proc_events.start()
+        # Move the DevX QPs ro RTS state
+        self.pre_run()
+        # Send traffic
+        self.send_imm_traffic()
+        # Stop listening to events
+        self.proc_events = False
+        proc_events.join()
+        if self.event_ex:
+            raise PyverbsRDMAError(f'Received unexpected vfio events: {self.event_ex}')
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio
  2021-07-20  8:16 ` [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio Yishai Hadas
@ 2021-07-20  8:51   ` Leon Romanovsky
  2021-07-20  9:27     ` Yishai Hadas
  0 siblings, 1 reply; 36+ messages in thread
From: Leon Romanovsky @ 2021-07-20  8:51 UTC (permalink / raw)
  To: Yishai Hadas; +Cc: linux-rdma, jgg, maorg, markzhang, edwards

On Tue, Jul 20, 2021 at 11:16:23AM +0300, Yishai Hadas wrote:
> From: Maor Gottlieb <maorg@nvidia.com>
> 
> Usage will be in next patches.
> 
> Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  providers/mlx5/mlx5.c | 28 ++++++++++++++--------------
>  providers/mlx5/mlx5.h |  4 ++++
>  2 files changed, 18 insertions(+), 14 deletions(-)

Probably, this patch will be needed to be changed after
"Verbs logging API" PR https://github.com/linux-rdma/rdma-core/pull/1030

Thanks

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio
  2021-07-20  8:51   ` Leon Romanovsky
@ 2021-07-20  9:27     ` Yishai Hadas
  2021-07-20 12:27       ` Leon Romanovsky
  0 siblings, 1 reply; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20  9:27 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: linux-rdma, jgg, maorg, markzhang, edwards

On 7/20/2021 11:51 AM, Leon Romanovsky wrote:
> On Tue, Jul 20, 2021 at 11:16:23AM +0300, Yishai Hadas wrote:
>> From: Maor Gottlieb <maorg@nvidia.com>
>>
>> Usage will be in next patches.
>>
>> Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>   providers/mlx5/mlx5.c | 28 ++++++++++++++--------------
>>   providers/mlx5/mlx5.h |  4 ++++
>>   2 files changed, 18 insertions(+), 14 deletions(-)
> Probably, this patch will be needed to be changed after
> "Verbs logging API" PR https://github.com/linux-rdma/rdma-core/pull/1030
>
> Thanks

Well, not really, this patch just reorganizes things inside mlx5 for 
code sharing.

Yishai


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio
  2021-07-20  9:27     ` Yishai Hadas
@ 2021-07-20 12:27       ` Leon Romanovsky
  2021-07-20 14:57         ` Yishai Hadas
  0 siblings, 1 reply; 36+ messages in thread
From: Leon Romanovsky @ 2021-07-20 12:27 UTC (permalink / raw)
  To: Yishai Hadas; +Cc: linux-rdma, jgg, maorg, markzhang, edwards

On Tue, Jul 20, 2021 at 12:27:46PM +0300, Yishai Hadas wrote:
> On 7/20/2021 11:51 AM, Leon Romanovsky wrote:
> > On Tue, Jul 20, 2021 at 11:16:23AM +0300, Yishai Hadas wrote:
> > > From: Maor Gottlieb <maorg@nvidia.com>
> > > 
> > > Usage will be in next patches.
> > > 
> > > Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
> > > Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> > > ---
> > >   providers/mlx5/mlx5.c | 28 ++++++++++++++--------------
> > >   providers/mlx5/mlx5.h |  4 ++++
> > >   2 files changed, 18 insertions(+), 14 deletions(-)
> > Probably, this patch will be needed to be changed after
> > "Verbs logging API" PR https://github.com/linux-rdma/rdma-core/pull/1030
> > 
> > Thanks
> 
> Well, not really, this patch just reorganizes things inside mlx5 for code
> sharing.

After Gal's PR, I expect to see all mlx5 file/debug logic gone except
some minimal conversion logic to support old legacy variables.

All that get_env_... code will go.

Thanks

> 
> Yishai
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio
  2021-07-20 12:27       ` Leon Romanovsky
@ 2021-07-20 14:57         ` Yishai Hadas
  2021-07-21  7:05           ` Gal Pressman
  0 siblings, 1 reply; 36+ messages in thread
From: Yishai Hadas @ 2021-07-20 14:57 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: linux-rdma, jgg, maorg, markzhang, edwards

On 7/20/2021 3:27 PM, Leon Romanovsky wrote:
> On Tue, Jul 20, 2021 at 12:27:46PM +0300, Yishai Hadas wrote:
>> On 7/20/2021 11:51 AM, Leon Romanovsky wrote:
>>> On Tue, Jul 20, 2021 at 11:16:23AM +0300, Yishai Hadas wrote:
>>>> From: Maor Gottlieb <maorg@nvidia.com>
>>>>
>>>> Usage will be in next patches.
>>>>
>>>> Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>> ---
>>>>    providers/mlx5/mlx5.c | 28 ++++++++++++++--------------
>>>>    providers/mlx5/mlx5.h |  4 ++++
>>>>    2 files changed, 18 insertions(+), 14 deletions(-)
>>> Probably, this patch will be needed to be changed after
>>> "Verbs logging API" PR https://github.com/linux-rdma/rdma-core/pull/1030
>>>
>>> Thanks
>> Well, not really, this patch just reorganizes things inside mlx5 for code
>> sharing.
> After Gal's PR, I expect to see all mlx5 file/debug logic gone except
> some minimal conversion logic to support old legacy variables.
>
> All that get_env_... code will go.
>
> Thanks
>
Looking on current VERBs logging PR, it doesn't give a clean path 
conversion from mlx5.

For example it missed a debug granularity per object (e.g. QP, CQ, etc.) 
, in addition it doesn't define a driver specific options as we have 
today in mlx5 (e.g. MLX5_DBG_DR).

I believe that this should be added before going forward with the 
logging PR to enable a clean transition later on.

The transition of mlx5 should preserve current API/semantics (ENV, etc.) 
and is an orthogonal task.

Yishai



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio
  2021-07-20 14:57         ` Yishai Hadas
@ 2021-07-21  7:05           ` Gal Pressman
  2021-07-21  7:58             ` Yishai Hadas
  0 siblings, 1 reply; 36+ messages in thread
From: Gal Pressman @ 2021-07-21  7:05 UTC (permalink / raw)
  To: Yishai Hadas, Leon Romanovsky
  Cc: linux-rdma, jgg, maorg, markzhang, edwards, Leybovich, Yossi

On 20/07/2021 17:57, Yishai Hadas wrote:
> On 7/20/2021 3:27 PM, Leon Romanovsky wrote:
>> On Tue, Jul 20, 2021 at 12:27:46PM +0300, Yishai Hadas wrote:
>>> On 7/20/2021 11:51 AM, Leon Romanovsky wrote:
>>>> On Tue, Jul 20, 2021 at 11:16:23AM +0300, Yishai Hadas wrote:
>>>>> From: Maor Gottlieb <maorg@nvidia.com>
>>>>>
>>>>> Usage will be in next patches.
>>>>>
>>>>> Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
>>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>>> ---
>>>>>    providers/mlx5/mlx5.c | 28 ++++++++++++++--------------
>>>>>    providers/mlx5/mlx5.h |  4 ++++
>>>>>    2 files changed, 18 insertions(+), 14 deletions(-)
>>>> Probably, this patch will be needed to be changed after
>>>> "Verbs logging API" PR https://github.com/linux-rdma/rdma-core/pull/1030
>>>>
>>>> Thanks
>>> Well, not really, this patch just reorganizes things inside mlx5 for code
>>> sharing.
>> After Gal's PR, I expect to see all mlx5 file/debug logic gone except
>> some minimal conversion logic to support old legacy variables.
>>
>> All that get_env_... code will go.
>>
>> Thanks
>>
> Looking on current VERBs logging PR, it doesn't give a clean path conversion
> from mlx5.
> 
> For example it missed a debug granularity per object (e.g. QP, CQ, etc.) , in
> addition it doesn't define a driver specific options as we have today in mlx5
> (e.g. MLX5_DBG_DR).
> 
> I believe that this should be added before going forward with the logging PR to
> enable a clean transition later on.
> 
> The transition of mlx5 should preserve current API/semantics (ENV, etc.) and is
> an orthogonal task.

Yishai, if you have any more concerns please share them in the PR.. The
discussion there is going on for a while and you've been quiet so I assumed
you're fine with it.

I disagree about needing to support everything that exists in mlx5 today, the
purpose of the generic mechanism is to unify the environment variables, driver
specific options do the opposite. IMO masking out a few prints isn't worth the
divergence.

If the mlx5 provider has backwards compatibility issues it doesn't necessarily
have to use this API, but we can at least covert most existing providers and all
future providers that wish to support such functionality in a unified way.

BTW, why even insist on having backwards compatibility here? Do you actually
have useful code that relies on debug environment variables?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio
  2021-07-21  7:05           ` Gal Pressman
@ 2021-07-21  7:58             ` Yishai Hadas
  2021-07-21  8:51               ` Gal Pressman
  0 siblings, 1 reply; 36+ messages in thread
From: Yishai Hadas @ 2021-07-21  7:58 UTC (permalink / raw)
  To: Gal Pressman, Leon Romanovsky
  Cc: linux-rdma, jgg, maorg, markzhang, edwards, Leybovich, Yossi

On 7/21/2021 10:05 AM, Gal Pressman wrote:
> On 20/07/2021 17:57, Yishai Hadas wrote:
>> On 7/20/2021 3:27 PM, Leon Romanovsky wrote:
>>> On Tue, Jul 20, 2021 at 12:27:46PM +0300, Yishai Hadas wrote:
>>>> On 7/20/2021 11:51 AM, Leon Romanovsky wrote:
>>>>> On Tue, Jul 20, 2021 at 11:16:23AM +0300, Yishai Hadas wrote:
>>>>>> From: Maor Gottlieb <maorg@nvidia.com>
>>>>>>
>>>>>> Usage will be in next patches.
>>>>>>
>>>>>> Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
>>>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>>>> ---
>>>>>>     providers/mlx5/mlx5.c | 28 ++++++++++++++--------------
>>>>>>     providers/mlx5/mlx5.h |  4 ++++
>>>>>>     2 files changed, 18 insertions(+), 14 deletions(-)
>>>>> Probably, this patch will be needed to be changed after
>>>>> "Verbs logging API" PR https://github.com/linux-rdma/rdma-core/pull/1030
>>>>>
>>>>> Thanks
>>>> Well, not really, this patch just reorganizes things inside mlx5 for code
>>>> sharing.
>>> After Gal's PR, I expect to see all mlx5 file/debug logic gone except
>>> some minimal conversion logic to support old legacy variables.
>>>
>>> All that get_env_... code will go.
>>>
>>> Thanks
>>>
>> Looking on current VERBs logging PR, it doesn't give a clean path conversion
>> from mlx5.
>>
>> For example it missed a debug granularity per object (e.g. QP, CQ, etc.) , in
>> addition it doesn't define a driver specific options as we have today in mlx5
>> (e.g. MLX5_DBG_DR).
>>
>> I believe that this should be added before going forward with the logging PR to
>> enable a clean transition later on.
>>
>> The transition of mlx5 should preserve current API/semantics (ENV, etc.) and is
>> an orthogonal task.
> Yishai, if you have any more concerns please share them in the PR.. The
> discussion there is going on for a while and you've been quiet so I assumed
> you're fine with it.
>
> I disagree about needing to support everything that exists in mlx5 today, the
> purpose of the generic mechanism is to unify the environment variables, driver
> specific options do the opposite. IMO masking out a few prints isn't worth the
> divergence.


The options in mlx5 gives more granularity and supports vendor specific 
options, this may be needed down the road by other vendors as well.

If we come with a new API need to consider such options in the general case.

NP, I'll comment in the logging PR as well.

>
> If the mlx5 provider has backwards compatibility issues it doesn't necessarily
> have to use this API, but we can at least covert most existing providers and all
> future providers that wish to support such functionality in a unified way.


The point was that current suggestion doesn't allow a clean transition 
for mlx5, so we won't use it unless the API will give a matching 
alternative.

> BTW, why even insist on having backwards compatibility here? Do you actually
> have useful code that relies on debug environment variables?

Logging options (env, mask, etc.) are kind of API which need to be 
preserved, it's used in the field for debug purposes, no reason to loose 
granularity and trace.

Yishai


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio
  2021-07-21  7:58             ` Yishai Hadas
@ 2021-07-21  8:51               ` Gal Pressman
  0 siblings, 0 replies; 36+ messages in thread
From: Gal Pressman @ 2021-07-21  8:51 UTC (permalink / raw)
  To: Yishai Hadas, Leon Romanovsky
  Cc: linux-rdma, jgg, maorg, markzhang, edwards, Leybovich, Yossi

On 21/07/2021 10:58, Yishai Hadas wrote:
> On 7/21/2021 10:05 AM, Gal Pressman wrote:
>> On 20/07/2021 17:57, Yishai Hadas wrote:
>>> On 7/20/2021 3:27 PM, Leon Romanovsky wrote:
>>>> On Tue, Jul 20, 2021 at 12:27:46PM +0300, Yishai Hadas wrote:
>>>>> On 7/20/2021 11:51 AM, Leon Romanovsky wrote:
>>>>>> On Tue, Jul 20, 2021 at 11:16:23AM +0300, Yishai Hadas wrote:
>>>>>>> From: Maor Gottlieb <maorg@nvidia.com>
>>>>>>>
>>>>>>> Usage will be in next patches.
>>>>>>>
>>>>>>> Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
>>>>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>>>>> ---
>>>>>>>     providers/mlx5/mlx5.c | 28 ++++++++++++++--------------
>>>>>>>     providers/mlx5/mlx5.h |  4 ++++
>>>>>>>     2 files changed, 18 insertions(+), 14 deletions(-)
>>>>>> Probably, this patch will be needed to be changed after
>>>>>> "Verbs logging API" PR https://github.com/linux-rdma/rdma-core/pull/1030
>>>>>>
>>>>>> Thanks
>>>>> Well, not really, this patch just reorganizes things inside mlx5 for code
>>>>> sharing.
>>>> After Gal's PR, I expect to see all mlx5 file/debug logic gone except
>>>> some minimal conversion logic to support old legacy variables.
>>>>
>>>> All that get_env_... code will go.
>>>>
>>>> Thanks
>>>>
>>> Looking on current VERBs logging PR, it doesn't give a clean path conversion
>>> from mlx5.
>>>
>>> For example it missed a debug granularity per object (e.g. QP, CQ, etc.) , in
>>> addition it doesn't define a driver specific options as we have today in mlx5
>>> (e.g. MLX5_DBG_DR).
>>>
>>> I believe that this should be added before going forward with the logging PR to
>>> enable a clean transition later on.
>>>
>>> The transition of mlx5 should preserve current API/semantics (ENV, etc.) and is
>>> an orthogonal task.
>> Yishai, if you have any more concerns please share them in the PR.. The
>> discussion there is going on for a while and you've been quiet so I assumed
>> you're fine with it.
>>
>> I disagree about needing to support everything that exists in mlx5 today, the
>> purpose of the generic mechanism is to unify the environment variables, driver
>> specific options do the opposite. IMO masking out a few prints isn't worth the
>> divergence.
> 
> 
> The options in mlx5 gives more granularity and supports vendor specific options,
> this may be needed down the road by other vendors as well.
> 
> If we come with a new API need to consider such options in the general case.
> 
> NP, I'll comment in the logging PR as well.
> 
>>
>> If the mlx5 provider has backwards compatibility issues it doesn't necessarily
>> have to use this API, but we can at least covert most existing providers and all
>> future providers that wish to support such functionality in a unified way.
> 
> 
> The point was that current suggestion doesn't allow a clean transition for mlx5,
> so we won't use it unless the API will give a matching alternative.
> 
>> BTW, why even insist on having backwards compatibility here? Do you actually
>> have useful code that relies on debug environment variables?
> 
> Logging options (env, mask, etc.) are kind of API which need to be preserved,
> it's used in the field for debug purposes, no reason to loose granularity and
> trace.

It's used in the field by *people*, which unlike legacy code can learn to use
the new debug env variables, they don't need backwards compatibility (the same
as users of debugfs).
I addressed the granularity issue in the PR.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO
  2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
                   ` (26 preceding siblings ...)
  2021-07-20  8:16 ` [PATCH rdma-core 27/27] tests: Add a test for mlx5 " Yishai Hadas
@ 2021-08-01  8:00 ` Yishai Hadas
  27 siblings, 0 replies; 36+ messages in thread
From: Yishai Hadas @ 2021-08-01  8:00 UTC (permalink / raw)
  To: linux-rdma; +Cc: jgg, maorg, markzhang, edwards

On 7/20/2021 11:16 AM, Yishai Hadas wrote:
> This series introduces mlx5 user space driver over VFIO.
>
> This enables an application to take full ownership on the opened device and run
> any firmware command (e.g. port up/down) without any concern to hurt someone
> else.
>
> The application look and feel is like regular RDMA application over DEVX, it
> uses verbs API to open/close a device and then mostly uses DEVX APIs to
> interact with the device.
>
> To achieve it, few mlx5 DV APIs were introduced, it includes:
> - An API to get ibv_device for a given mlx5 PCI name.
> - APIs to manage device specific events.
>
> Detailed man pages were added to describe the expected usage of those APIs.
>
> The mlx5 VFIO driver implemented the basic verbs APIs as of managing MR/PD and
> the DEVX APIs which are required to write an RDMA application.
>
> The VFIO uAPIs were used to setup mlx5 vfio context by reading the device
> initialization segment, setup DMA and enables the device command interface.
>
> In addition, the series includes pyverbs stuff which runs DEVX like testing
> over RDMA and VFIO mlx5 devices.
>
> Some extra documentation of the benefits and the motivation to use VFIO can be
> found here [1].
>
> PR was sent [2].
>
> [1] https://www.kernel.org/doc/html/latest/driver-api/vfio.html
> [2] https://github.com/linux-rdma/rdma-core/pull/1034
>
> Yishai
>
> Edward Srouji (10):
>    pyverbs: Support DevX UMEM registration
>    pyverbs/mlx5: Support EQN querying
>    pyverbs/mlx5: Support more DevX objects
>    pyverbs: Add auxiliary memory functions
>    pyverbs/mlx5: Add support to extract mlx5dv objects
>    pyverbs/mlx5: Wrap mlx5_cqe64 struct and add enums
>    tests: Add MAC address to the tests' args
>    tests: Add mlx5 DevX data path test
>    pyverbs/mlx5: Support mlx5 devices over VFIO
>    tests: Add a test for mlx5 over VFIO
>
> Maor Gottlieb (1):
>    mlx5: Enable debug functionality for vfio
>
> Mark Zhang (5):
>    util: Add interval_set support
>    mlx5: Support fast teardown over vfio
>    mlx5: VFIO poll_health support
>    mlx5: Set DV context ops
>    mlx5: Implement mlx5dv devx_obj APIs over vfio
>
> Yishai Hadas (11):
>    Update kernel headers
>    mlx5: Introduce mlx5dv_get_vfio_device_list()
>    verbs: Enable verbs_open_device() to work over non sysfs devices
>    mlx5: Setup mlx5 vfio context
>    mlx5: Add mlx5_vfio_cmd_exec() support
>    mlx5: vfio setup function support
>    mlx5: vfio setup basic caps
>    mlx5: Enable interrupt command mode over vfio
>    mlx5: Introduce vfio APIs to process events
>    mlx5: Implement basic verbs operation for PD and MR over vfio
>    mlx5: Support initial DEVX/DV APIs over vfio
>
>   Documentation/pyverbs.md                           |   34 +
>   debian/ibverbs-providers.symbols                   |    4 +
>   kernel-headers/CMakeLists.txt                      |    4 +
>   kernel-headers/linux/vfio.h                        | 1374 ++++++++
>   libibverbs/device.c                                |   39 +-
>   libibverbs/sysfs.c                                 |    5 +
>   providers/mlx5/CMakeLists.txt                      |    3 +-
>   providers/mlx5/dr_rule.c                           |   10 +-
>   providers/mlx5/libmlx5.map                         |    7 +
>   providers/mlx5/man/CMakeLists.txt                  |    3 +
>   .../mlx5/man/mlx5dv_get_vfio_device_list.3.md      |   64 +
>   providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md  |   41 +
>   providers/mlx5/man/mlx5dv_vfio_process_events.3.md |   43 +
>   providers/mlx5/mlx5.c                              |  376 ++-
>   providers/mlx5/mlx5.h                              |  187 +-
>   providers/mlx5/mlx5_ifc.h                          | 1206 ++++++-
>   providers/mlx5/mlx5_vfio.c                         | 3379 ++++++++++++++++++++
>   providers/mlx5/mlx5_vfio.h                         |  329 ++
>   providers/mlx5/mlx5dv.h                            |   25 +
>   providers/mlx5/verbs.c                             |  966 +++++-
>   pyverbs/CMakeLists.txt                             |    7 +
>   pyverbs/dma_util.pyx                               |   25 +
>   pyverbs/mem_alloc.pyx                              |   46 +-
>   pyverbs/pd.pyx                                     |    4 +
>   pyverbs/providers/mlx5/CMakeLists.txt              |    4 +-
>   pyverbs/providers/mlx5/libmlx5.pxd                 |  103 +-
>   pyverbs/providers/mlx5/mlx5_vfio.pxd               |   15 +
>   pyverbs/providers/mlx5/mlx5_vfio.pyx               |  116 +
>   pyverbs/providers/mlx5/mlx5dv.pxd                  |   20 +
>   pyverbs/providers/mlx5/mlx5dv.pyx                  |  277 +-
>   pyverbs/providers/mlx5/mlx5dv_enums.pxd            |   42 +
>   pyverbs/providers/mlx5/mlx5dv_objects.pxd          |   28 +
>   pyverbs/providers/mlx5/mlx5dv_objects.pyx          |  214 ++
>   tests/CMakeLists.txt                               |    3 +
>   tests/args_parser.py                               |    5 +
>   tests/base.py                                      |   14 +-
>   tests/mlx5_base.py                                 |  460 ++-
>   tests/mlx5_prm_structs.py                          | 1046 ++++++
>   tests/test_mlx5_devx.py                            |   23 +
>   tests/test_mlx5_vfio.py                            |  104 +
>   util/CMakeLists.txt                                |    2 +
>   util/interval_set.c                                |  208 ++
>   util/interval_set.h                                |   77 +
>   util/util.h                                        |    5 +
>   44 files changed, 10650 insertions(+), 297 deletions(-)
>   create mode 100644 kernel-headers/linux/vfio.h
>   create mode 100644 providers/mlx5/man/mlx5dv_get_vfio_device_list.3.md
>   create mode 100644 providers/mlx5/man/mlx5dv_vfio_get_events_fd.3.md
>   create mode 100644 providers/mlx5/man/mlx5dv_vfio_process_events.3.md
>   create mode 100644 providers/mlx5/mlx5_vfio.c
>   create mode 100644 providers/mlx5/mlx5_vfio.h
>   create mode 100644 pyverbs/dma_util.pyx
>   create mode 100644 pyverbs/providers/mlx5/mlx5_vfio.pxd
>   create mode 100644 pyverbs/providers/mlx5/mlx5_vfio.pyx
>   create mode 100644 pyverbs/providers/mlx5/mlx5dv_objects.pxd
>   create mode 100644 pyverbs/providers/mlx5/mlx5dv_objects.pyx
>   create mode 100644 tests/mlx5_prm_structs.py
>   create mode 100644 tests/test_mlx5_devx.py
>   create mode 100644 tests/test_mlx5_vfio.py
>   create mode 100644 util/interval_set.c
>   create mode 100644 util/interval_set.h
>
The PR was merged.

Yishai


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-08-01  8:01 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20  8:16 [PATCH rdma-core 00/27] Introduce mlx5 user space driver over VFIO Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 01/27] Update kernel headers Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 02/27] mlx5: Introduce mlx5dv_get_vfio_device_list() Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 03/27] mlx5: Enable debug functionality for vfio Yishai Hadas
2021-07-20  8:51   ` Leon Romanovsky
2021-07-20  9:27     ` Yishai Hadas
2021-07-20 12:27       ` Leon Romanovsky
2021-07-20 14:57         ` Yishai Hadas
2021-07-21  7:05           ` Gal Pressman
2021-07-21  7:58             ` Yishai Hadas
2021-07-21  8:51               ` Gal Pressman
2021-07-20  8:16 ` [PATCH rdma-core 04/27] util: Add interval_set support Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 05/27] verbs: Enable verbs_open_device() to work over non sysfs devices Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 06/27] mlx5: Setup mlx5 vfio context Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 07/27] mlx5: Add mlx5_vfio_cmd_exec() support Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 08/27] mlx5: vfio setup function support Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 09/27] mlx5: vfio setup basic caps Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 10/27] mlx5: Support fast teardown over vfio Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 11/27] mlx5: Enable interrupt command mode " Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 12/27] mlx5: Introduce vfio APIs to process events Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 13/27] mlx5: VFIO poll_health support Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 14/27] mlx5: Implement basic verbs operation for PD and MR over vfio Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 15/27] mlx5: Set DV context ops Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 16/27] mlx5: Support initial DEVX/DV APIs over vfio Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 17/27] mlx5: Implement mlx5dv devx_obj " Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 18/27] pyverbs: Support DevX UMEM registration Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 19/27] pyverbs/mlx5: Support EQN querying Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 20/27] pyverbs/mlx5: Support more DevX objects Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 21/27] pyverbs: Add auxiliary memory functions Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 22/27] pyverbs/mlx5: Add support to extract mlx5dv objects Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 23/27] pyverbs/mlx5: Wrap mlx5_cqe64 struct and add enums Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 24/27] tests: Add MAC address to the tests' args Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 25/27] tests: Add mlx5 DevX data path test Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 26/27] pyverbs/mlx5: Support mlx5 devices over VFIO Yishai Hadas
2021-07-20  8:16 ` [PATCH rdma-core 27/27] tests: Add a test for mlx5 " Yishai Hadas
2021-08-01  8:00 ` [PATCH rdma-core 00/27] Introduce mlx5 user space driver " Yishai Hadas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.