All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
@ 2022-02-17 17:44 Lukasz Maniak
  2022-02-17 17:44 ` [PATCH v5 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Lukasz Maniak
                   ` (16 more replies)
  0 siblings, 17 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Klaus Jensen, Keith Busch, Lukasz Maniak, qemu-block, Łukasz Gieryk

Changes since v4:
- Added hello world example for SR-IOV to the docs
- Moved AER initialization from nvme_init_ctrl to nvme_init_state
- Fixed division by zero issue in calculation of vqfrt and vifrt
  capabilities

Knut Omang (2):
  pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt

Lukasz Maniak (4):
  hw/nvme: Add support for SR-IOV
  hw/nvme: Add support for Primary Controller Capabilities
  hw/nvme: Add support for Secondary Controller List
  docs: Add documentation for SR-IOV and Virtualization Enhancements

Łukasz Gieryk (9):
  pcie: Add a helper to the SR/IOV API
  pcie: Add 1.2 version token for the Power Management Capability
  hw/nvme: Implement the Function Level Reset
  hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  hw/nvme: Remove reg_size variable and update BAR0 size calculation
  hw/nvme: Calculate BAR attributes in a function
  hw/nvme: Initialize capability structures for primary/secondary
    controllers
  hw/nvme: Add support for the Virtualization Management command
  hw/nvme: Update the initalization place for the AER queue

 docs/pcie_sriov.txt          | 115 ++++++
 docs/system/devices/nvme.rst |  82 +++++
 hw/nvme/ctrl.c               | 674 ++++++++++++++++++++++++++++++++---
 hw/nvme/ns.c                 |   2 +-
 hw/nvme/nvme.h               |  55 ++-
 hw/nvme/subsys.c             |  75 +++-
 hw/nvme/trace-events         |   6 +
 hw/pci/meson.build           |   1 +
 hw/pci/pci.c                 | 100 ++++--
 hw/pci/pcie.c                |   5 +
 hw/pci/pcie_sriov.c          | 302 ++++++++++++++++
 hw/pci/trace-events          |   5 +
 include/block/nvme.h         |  65 ++++
 include/hw/pci/pci.h         |  12 +-
 include/hw/pci/pci_ids.h     |   1 +
 include/hw/pci/pci_regs.h    |   1 +
 include/hw/pci/pcie.h        |   6 +
 include/hw/pci/pcie_sriov.h  |  77 ++++
 include/qemu/typedefs.h      |   2 +
 19 files changed, 1505 insertions(+), 81 deletions(-)
 create mode 100644 docs/pcie_sriov.txt
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

-- 
2.25.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v5 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-18  8:24   ` Michael S. Tsirkin
  2022-02-17 17:44 ` [PATCH v5 02/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt Lukasz Maniak
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk,
	Lukasz Maniak, Keith Busch, Klaus Jensen, Knut Omang

From: Knut Omang <knut.omang@oracle.com>

This patch provides the building blocks for creating an SR/IOV
PCIe Extended Capability header and register/unregister
SR/IOV Virtual Functions.

Signed-off-by: Knut Omang <knuto@ifi.uio.no>
---
 hw/pci/meson.build          |   1 +
 hw/pci/pci.c                | 100 +++++++++---
 hw/pci/pcie.c               |   5 +
 hw/pci/pcie_sriov.c         | 294 ++++++++++++++++++++++++++++++++++++
 hw/pci/trace-events         |   5 +
 include/hw/pci/pci.h        |  12 +-
 include/hw/pci/pcie.h       |   6 +
 include/hw/pci/pcie_sriov.h |  71 +++++++++
 include/qemu/typedefs.h     |   2 +
 9 files changed, 470 insertions(+), 26 deletions(-)
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

diff --git a/hw/pci/meson.build b/hw/pci/meson.build
index 5c4bbac8171..bcc9c75919f 100644
--- a/hw/pci/meson.build
+++ b/hw/pci/meson.build
@@ -5,6 +5,7 @@ pci_ss.add(files(
   'pci.c',
   'pci_bridge.c',
   'pci_host.c',
+  'pcie_sriov.c',
   'shpc.c',
   'slotid_cap.c'
 ))
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 5d30f9ca60e..ba8fb92efc6 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg)
 {
     uint8_t type;
 
+    /* PCIe virtual functions do not have their own BARs */
+    assert(!pci_is_vf(d));
+
     if (reg != PCI_ROM_SLOT)
         return PCI_BASE_ADDRESS_0 + reg * 4;
 
@@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev)
     }
 }
 
-static void pci_do_device_reset(PCIDevice *dev)
+static void pci_reset_regions(PCIDevice *dev)
 {
     int r;
+    if (pci_is_vf(dev)) {
+        return;
+    }
+
+    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
+        PCIIORegion *region = &dev->io_regions[r];
+        if (!region->size) {
+            continue;
+        }
 
+        if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
+            region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            pci_set_quad(dev->config + pci_bar(dev, r), region->type);
+        } else {
+            pci_set_long(dev->config + pci_bar(dev, r), region->type);
+        }
+    }
+}
+
+static void pci_do_device_reset(PCIDevice *dev)
+{
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
@@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev)
                               pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
                               pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
     dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
-    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
-        PCIIORegion *region = &dev->io_regions[r];
-        if (!region->size) {
-            continue;
-        }
-
-        if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
-            region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
-            pci_set_quad(dev->config + pci_bar(dev, r), region->type);
-        } else {
-            pci_set_long(dev->config + pci_bar(dev, r), region->type);
-        }
-    }
+    pci_reset_regions(dev);
     pci_update_mappings(dev);
 
     msi_reset(dev);
@@ -884,6 +895,16 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice *dev, Error **errp)
         dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
     }
 
+    /*
+     * With SR/IOV and ARI, a device at function 0 need not be a multifunction
+     * device, as it may just be a VF that ended up with function 0 in
+     * the legacy PCI interpretation. Avoid failing in such cases:
+     */
+    if (pci_is_vf(dev) &&
+        dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+        return;
+    }
+
     /*
      * multifunction bit is interpreted in two ways as follows.
      *   - all functions must set the bit to 1.
@@ -1083,6 +1104,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
                    bus->devices[devfn]->name);
         return NULL;
     } else if (dev->hotplugged &&
+               !pci_is_vf(pci_dev) &&
                pci_get_function_0(pci_dev)) {
         error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
                    " new func %s cannot be exposed to guest.",
@@ -1191,6 +1213,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
     pcibus_t size = memory_region_size(memory);
     uint8_t hdr_type;
 
+    assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
     assert(region_num >= 0);
     assert(region_num < PCI_NUM_REGIONS);
     assert(is_power_of_2(size));
@@ -1294,11 +1317,45 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
     return pci_dev->io_regions[region_num].addr;
 }
 
-static pcibus_t pci_bar_address(PCIDevice *d,
-                                int reg, uint8_t type, pcibus_t size)
+static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
+                                        uint8_t type, pcibus_t size)
+{
+    pcibus_t new_addr;
+    if (!pci_is_vf(d)) {
+        int bar = pci_bar(d, reg);
+        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            new_addr = pci_get_quad(d->config + bar);
+        } else {
+            new_addr = pci_get_long(d->config + bar);
+        }
+    } else {
+        PCIDevice *pf = d->exp.sriov_vf.pf;
+        uint16_t sriov_cap = pf->exp.sriov_cap;
+        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
+        uint16_t vf_offset =
+            pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
+        uint16_t vf_stride =
+            pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
+        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) / vf_stride;
+
+        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            new_addr = pci_get_quad(pf->config + bar);
+        } else {
+            new_addr = pci_get_long(pf->config + bar);
+        }
+        new_addr += vf_num * size;
+    }
+    /* The ROM slot has a specific enable bit, keep it intact */
+    if (reg != PCI_ROM_SLOT) {
+        new_addr &= ~(size - 1);
+    }
+    return new_addr;
+}
+
+pcibus_t pci_bar_address(PCIDevice *d,
+                         int reg, uint8_t type, pcibus_t size)
 {
     pcibus_t new_addr, last_addr;
-    int bar = pci_bar(d, reg);
     uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
     Object *machine = qdev_get_machine();
     ObjectClass *oc = object_get_class(machine);
@@ -1309,7 +1366,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
         if (!(cmd & PCI_COMMAND_IO)) {
             return PCI_BAR_UNMAPPED;
         }
-        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
+        new_addr = pci_config_get_bar_addr(d, reg, type, size);
         last_addr = new_addr + size - 1;
         /* Check if 32 bit BAR wraps around explicitly.
          * TODO: make priorities correct and remove this work around.
@@ -1324,11 +1381,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
     if (!(cmd & PCI_COMMAND_MEMORY)) {
         return PCI_BAR_UNMAPPED;
     }
-    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
-        new_addr = pci_get_quad(d->config + bar);
-    } else {
-        new_addr = pci_get_long(d->config + bar);
-    }
+    new_addr = pci_config_get_bar_addr(d, reg, type, size);
     /* the ROM slot has a specific enable bit */
     if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
         return PCI_BAR_UNMAPPED;
@@ -1473,6 +1526,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
 
     msi_write_config(d, addr, val_in, l);
     msix_write_config(d, addr, val_in, l);
+    pcie_sriov_config_write(d, addr, val_in, l);
 }
 
 /***********************************************************/
diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index d7d73a31e4c..3c44204cf31 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -446,6 +446,11 @@ void pcie_cap_slot_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
     PCIDevice *pci_dev = PCI_DEVICE(dev);
     uint32_t lnkcap = pci_get_long(exp_cap + PCI_EXP_LNKCAP);
 
+    if (pci_is_vf(pci_dev)) {
+        /* Virtual function cannot be physically disconnected */
+        return;
+    }
+
     /* Don't send event when device is enabled during qemu machine creation:
      * it is present on boot, no hotplug event is necessary. We do send an
      * event when the device is disabled later. */
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
new file mode 100644
index 00000000000..3f256d483fa
--- /dev/null
+++ b/hw/pci/pcie_sriov.c
@@ -0,0 +1,294 @@
+/*
+ * pcie_sriov.c:
+ *
+ * Implementation of SR/IOV emulation support.
+ *
+ * Copyright (c) 2015-2017 Knut Omang <knut.omang@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pcie.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/qdev-properties.h"
+#include "qemu/error-report.h"
+#include "qemu/range.h"
+#include "qapi/error.h"
+#include "trace.h"
+
+static PCIDevice *register_vf(PCIDevice *pf, int devfn,
+                              const char *name, uint16_t vf_num);
+static void unregister_vfs(PCIDevice *dev);
+
+void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+                        const char *vfname, uint16_t vf_dev_id,
+                        uint16_t init_vfs, uint16_t total_vfs,
+                        uint16_t vf_offset, uint16_t vf_stride)
+{
+    uint8_t *cfg = dev->config + offset;
+    uint8_t *wmask;
+
+    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
+                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
+    dev->exp.sriov_cap = offset;
+    dev->exp.sriov_pf.num_vfs = 0;
+    dev->exp.sriov_pf.vfname = g_strdup(vfname);
+    dev->exp.sriov_pf.vf = NULL;
+
+    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
+    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
+
+    /*
+     * Mandatory page sizes to support.
+     * Device implementations can call pcie_sriov_pf_add_sup_pgsize()
+     * to set more bits:
+     */
+    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, SRIOV_SUP_PGSIZE_MINREQ);
+
+    /*
+     * Default is to use 4K pages, software can modify it
+     * to any of the supported bits
+     */
+    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
+
+    /* Set up device ID and initial/total number of VFs available */
+    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
+    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
+    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
+    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
+
+    /* Write enable control bits */
+    wmask = dev->wmask + offset;
+    pci_set_word(wmask + PCI_SRIOV_CTRL,
+                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
+    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
+    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
+
+    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
+}
+
+void pcie_sriov_pf_exit(PCIDevice *dev)
+{
+    unregister_vfs(dev);
+    g_free((char *)dev->exp.sriov_pf.vfname);
+    dev->exp.sriov_pf.vfname = NULL;
+}
+
+void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
+                               uint8_t type, dma_addr_t size)
+{
+    uint32_t addr;
+    uint64_t wmask;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+
+    assert(sriov_cap > 0);
+    assert(region_num >= 0);
+    assert(region_num < PCI_NUM_REGIONS);
+    assert(region_num != PCI_ROM_SLOT);
+
+    wmask = ~(size - 1);
+    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
+
+    pci_set_long(dev->config + addr, type);
+    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
+        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+        pci_set_quad(dev->wmask + addr, wmask);
+        pci_set_quad(dev->cmask + addr, ~0ULL);
+    } else {
+        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
+        pci_set_long(dev->cmask + addr, 0xffffffff);
+    }
+    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
+}
+
+void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
+                                MemoryRegion *memory)
+{
+    PCIIORegion *r;
+    PCIBus *bus = pci_get_bus(dev);
+    uint8_t type;
+    pcibus_t size = memory_region_size(memory);
+
+    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
+    assert(region_num >= 0);
+    assert(region_num < PCI_NUM_REGIONS);
+    type = dev->exp.sriov_vf.pf->exp.sriov_pf.vf_bar_type[region_num];
+
+    if (!is_power_of_2(size)) {
+        error_report("%s: PCI region size must be a power"
+                     " of two - type=0x%x, size=0x%"FMT_PCIBUS,
+                     __func__, type, size);
+        exit(1);
+    }
+
+    r = &dev->io_regions[region_num];
+    r->memory = memory;
+    r->address_space =
+        type & PCI_BASE_ADDRESS_SPACE_IO
+        ? bus->address_space_io
+        : bus->address_space_mem;
+    r->size = size;
+    r->type = type;
+
+    r->addr = pci_bar_address(dev, region_num, r->type, r->size);
+    if (r->addr != PCI_BAR_UNMAPPED) {
+        memory_region_add_subregion_overlap(r->address_space,
+                                            r->addr, r->memory, 1);
+    }
+}
+
+static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name,
+                              uint16_t vf_num)
+{
+    PCIDevice *dev = pci_new(devfn, name);
+    dev->exp.sriov_vf.pf = pf;
+    dev->exp.sriov_vf.vf_number = vf_num;
+    PCIBus *bus = pci_get_bus(pf);
+    Error *local_err = NULL;
+
+    qdev_realize(&dev->qdev, &bus->qbus, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return NULL;
+    }
+
+    /* set vid/did according to sr/iov spec - they are not used */
+    pci_config_set_vendor_id(dev->config, 0xffff);
+    pci_config_set_device_id(dev->config, 0xffff);
+
+    return dev;
+}
+
+static void register_vfs(PCIDevice *dev)
+{
+    uint16_t num_vfs;
+    uint16_t i;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+    uint16_t vf_offset =
+        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
+    uint16_t vf_stride =
+        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
+    int32_t devfn = dev->devfn + vf_offset;
+
+    assert(sriov_cap > 0);
+    num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
+
+    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) * num_vfs);
+    assert(dev->exp.sriov_pf.vf);
+
+    trace_sriov_register_vfs(dev->name, PCI_SLOT(dev->devfn),
+                             PCI_FUNC(dev->devfn), num_vfs);
+    for (i = 0; i < num_vfs; i++) {
+        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn,
+                                              dev->exp.sriov_pf.vfname, i);
+        if (!dev->exp.sriov_pf.vf[i]) {
+            num_vfs = i;
+            break;
+        }
+        devfn += vf_stride;
+    }
+    dev->exp.sriov_pf.num_vfs = num_vfs;
+}
+
+static void unregister_vfs(PCIDevice *dev)
+{
+    Error *local_err = NULL;
+    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
+    uint16_t i;
+
+    trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
+                               PCI_FUNC(dev->devfn), num_vfs);
+    for (i = 0; i < num_vfs; i++) {
+        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
+        object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
+        if (local_err) {
+            fprintf(stderr, "Failed to unplug: %s\n",
+                    error_get_pretty(local_err));
+            error_free(local_err);
+        }
+        object_unparent(OBJECT(vf));
+    }
+    g_free(dev->exp.sriov_pf.vf);
+    dev->exp.sriov_pf.vf = NULL;
+    dev->exp.sriov_pf.num_vfs = 0;
+    pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
+}
+
+void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
+                             uint32_t val, int len)
+{
+    uint32_t off;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+
+    if (!sriov_cap || address < sriov_cap) {
+        return;
+    }
+    off = address - sriov_cap;
+    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
+        return;
+    }
+
+    trace_sriov_config_write(dev->name, PCI_SLOT(dev->devfn),
+                             PCI_FUNC(dev->devfn), off, val, len);
+
+    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
+        if (dev->exp.sriov_pf.num_vfs) {
+            if (!(val & PCI_SRIOV_CTRL_VFE)) {
+                unregister_vfs(dev);
+            }
+        } else {
+            if (val & PCI_SRIOV_CTRL_VFE) {
+                register_vfs(dev);
+            }
+        }
+    }
+}
+
+
+/* Reset SR/IOV VF Enable bit to trigger an unregister of all VFs */
+void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
+{
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+    if (sriov_cap) {
+        uint32_t val = pci_get_byte(dev->config + sriov_cap + PCI_SRIOV_CTRL);
+        if (val & PCI_SRIOV_CTRL_VFE) {
+            val &= ~PCI_SRIOV_CTRL_VFE;
+            pcie_sriov_config_write(dev, sriov_cap + PCI_SRIOV_CTRL, val, 1);
+        }
+    }
+}
+
+/* Add optional supported page sizes to the mask of supported page sizes */
+void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize)
+{
+    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
+    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
+
+    uint16_t sup_pgsize = pci_get_word(cfg + PCI_SRIOV_SUP_PGSIZE);
+
+    sup_pgsize |= opt_sup_pgsize;
+
+    /*
+     * Make sure the new bits are set, and that system page size
+     * also can be set to any of the new values according to spec:
+     */
+    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
+    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
+}
+
+
+uint16_t pcie_sriov_vf_number(PCIDevice *dev)
+{
+    assert(pci_is_vf(dev));
+    return dev->exp.sriov_vf.vf_number;
+}
+
+
+PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
+{
+    return dev->exp.sriov_vf.pf;
+}
diff --git a/hw/pci/trace-events b/hw/pci/trace-events
index 7570752c404..aaf46bc92df 100644
--- a/hw/pci/trace-events
+++ b/hw/pci/trace-events
@@ -10,3 +10,8 @@ pci_cfg_write(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, unsig
 
 # msix.c
 msix_write_config(char *name, bool enabled, bool masked) "dev %s enabled %d masked %d"
+
+# hw/pci/pcie_sriov.c
+sriov_register_vfs(const char *name, int slot, int function, int num_vfs) "%s %02x:%x: creating %d vf devs"
+sriov_unregister_vfs(const char *name, int slot, int function, int num_vfs) "%s %02x:%x: Unregistering %d vf devs"
+sriov_config_write(const char *name, int slot, int fun, uint32_t offset, uint32_t val, uint32_t len) "%s %02x:%x: sriov offset 0x%x val 0x%x len %d"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index c3f3c90473e..3a32b8dd40a 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -7,9 +7,6 @@
 /* PCI includes legacy ISA access.  */
 #include "hw/isa/isa.h"
 
-#include "hw/pci/pcie.h"
-#include "qom/object.h"
-
 extern bool pci_available;
 
 /* PCI bus */
@@ -157,6 +154,7 @@ enum {
 #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
 
 #include "hw/pci/pci_regs.h"
+#include "hw/pci/pcie.h"
 
 /* PCI HEADER_TYPE */
 #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
@@ -499,6 +497,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
 
+pcibus_t pci_bar_address(PCIDevice *d,
+                         int reg, uint8_t type, pcibus_t size);
+
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
 {
@@ -779,6 +780,11 @@ static inline int pci_is_express_downstream_port(const PCIDevice *d)
     return type == PCI_EXP_TYPE_DOWNSTREAM || type == PCI_EXP_TYPE_ROOT_PORT;
 }
 
+static inline int pci_is_vf(const PCIDevice *d)
+{
+    return d->exp.sriov_vf.pf != NULL;
+}
+
 static inline uint32_t pci_config_size(const PCIDevice *d)
 {
     return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE : PCI_CONFIG_SPACE_SIZE;
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 6063bee0ec6..168950a83b7 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -24,6 +24,7 @@
 #include "hw/pci/pci_regs.h"
 #include "hw/pci/pcie_regs.h"
 #include "hw/pci/pcie_aer.h"
+#include "hw/pci/pcie_sriov.h"
 #include "hw/hotplug.h"
 
 typedef enum {
@@ -81,6 +82,11 @@ struct PCIExpressDevice {
 
     /* ACS */
     uint16_t acs_cap;
+
+    /* SR/IOV */
+    uint16_t sriov_cap;
+    PCIESriovPF sriov_pf;
+    PCIESriovVF sriov_vf;
 };
 
 #define COMPAT_PROP_PCP "power_controller_present"
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
new file mode 100644
index 00000000000..990cff0a1c6
--- /dev/null
+++ b/include/hw/pci/pcie_sriov.h
@@ -0,0 +1,71 @@
+/*
+ * pcie_sriov.h:
+ *
+ * Implementation of SR/IOV emulation support.
+ *
+ * Copyright (c) 2015 Knut Omang <knut.omang@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_PCIE_SRIOV_H
+#define QEMU_PCIE_SRIOV_H
+
+struct PCIESriovPF {
+    uint16_t num_vfs;   /* Number of virtual functions created */
+    uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
+    const char *vfname; /* Reference to the device type used for the VFs */
+    PCIDevice **vf;     /* Pointer to an array of num_vfs VF devices */
+};
+
+struct PCIESriovVF {
+    PCIDevice *pf;      /* Pointer back to owner physical function */
+    uint16_t vf_number; /* Logical VF number of this function */
+};
+
+void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+                        const char *vfname, uint16_t vf_dev_id,
+                        uint16_t init_vfs, uint16_t total_vfs,
+                        uint16_t vf_offset, uint16_t vf_stride);
+void pcie_sriov_pf_exit(PCIDevice *dev);
+
+/* Set up a VF bar in the SR/IOV bar area */
+void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
+                               uint8_t type, dma_addr_t size);
+
+/* Instantiate a bar for a VF */
+void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
+                                MemoryRegion *memory);
+
+/*
+ * Default (minimal) page size support values
+ * as required by the SR/IOV standard:
+ * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
+ */
+#define SRIOV_SUP_PGSIZE_MINREQ 0x553
+
+/*
+ * Optionally add supported page sizes to the mask of supported page sizes
+ * Page size values are interpreted as opt_sup_pgsize << 12.
+ */
+void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize);
+
+/* SR/IOV capability config write handler */
+void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
+                             uint32_t val, int len);
+
+/* Reset SR/IOV VF Enable bit to unregister all VFs */
+void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
+
+/* Get logical VF number of a VF - only valid for VFs */
+uint16_t pcie_sriov_vf_number(PCIDevice *dev);
+
+/*
+ * Get the physical function that owns this VF.
+ * Returns NULL if dev is not a virtual function
+ */
+PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
+
+#endif /* QEMU_PCIE_SRIOV_H */
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index ee60eb3de47..5b302cb2145 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -86,6 +86,8 @@ typedef struct PCIDevice PCIDevice;
 typedef struct PCIEAERErr PCIEAERErr;
 typedef struct PCIEAERLog PCIEAERLog;
 typedef struct PCIEAERMsg PCIEAERMsg;
+typedef struct PCIESriovPF PCIESriovPF;
+typedef struct PCIESriovVF PCIESriovVF;
 typedef struct PCIEPort PCIEPort;
 typedef struct PCIESlot PCIESlot;
 typedef struct PCIExpressDevice PCIExpressDevice;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 02/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
  2022-02-17 17:44 ` [PATCH v5 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-18  8:24   ` Michael S. Tsirkin
  2022-02-17 17:44 ` [PATCH v5 03/15] pcie: Add a helper to the SR/IOV API Lukasz Maniak
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk,
	Lukasz Maniak, Keith Busch, Klaus Jensen, Knut Omang

From: Knut Omang <knut.omang@oracle.com>

Add a small intro + minimal documentation for how to
implement SR/IOV support for an emulated device.

Signed-off-by: Knut Omang <knuto@ifi.uio.no>
---
 docs/pcie_sriov.txt | 115 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)
 create mode 100644 docs/pcie_sriov.txt

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
new file mode 100644
index 00000000000..f5e891e1d45
--- /dev/null
+++ b/docs/pcie_sriov.txt
@@ -0,0 +1,115 @@
+PCI SR/IOV EMULATION SUPPORT
+============================
+
+Description
+===========
+SR/IOV (Single Root I/O Virtualization) is an optional extended capability
+of a PCI Express device. It allows a single physical function (PF) to appear as multiple
+virtual functions (VFs) for the main purpose of eliminating software
+overhead in I/O from virtual machines.
+
+Qemu now implements the basic common functionality to enable an emulated device
+to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
+proof-of-concept hack of the Intel igb can be found here:
+
+git://github.com/knuto/qemu.git sriov_patches_v5
+
+Implementation
+==============
+Implementing emulation of an SR/IOV capable device typically consists of
+implementing support for two types of device classes; the "normal" physical device
+(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
+like other devices, except that some of their properties are derived from
+the PF.
+
+A virtual function is different from a physical function in that the BAR
+space for all VFs are defined by the BAR registers in the PFs SR/IOV
+capability. All VFs have the same BARs and BAR sizes.
+
+Accesses to these virtual BARs then is computed as
+
+   <VF BAR start> + <VF number> * <BAR sz> + <offset>
+
+From our emulation perspective this means that there is a separate call for
+setting up a BAR for a VF.
+
+1) To enable SR/IOV support in the PF, it must be a PCI Express device so
+   you would need to add a PCI Express capability in the normal PCI
+   capability list. You might also want to add an ARI (Alternative
+   Routing-ID Interpretation) capability to indicate that your device
+   supports functions beyond it's "own" function space (0-7),
+   which is necessary to support more than 7 functions, or
+   if functions extends beyond offset 7 because they are placed at an
+   offset > 1 or have stride > 1.
+
+   ...
+   #include "hw/pci/pcie.h"
+   #include "hw/pci/pcie_sriov.h"
+
+   pci_your_pf_dev_realize( ... )
+   {
+      ...
+      int ret = pcie_endpoint_cap_init(d, 0x70);
+      ...
+      pcie_ari_init(d, 0x100, 1);
+      ...
+
+      /* Add and initialize the SR/IOV capability */
+      pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+                       vf_devid, initial_vfs, total_vfs,
+                       fun_offset, stride);
+
+      /* Set up individual VF BARs (parameters as for normal BARs) */
+      pcie_sriov_pf_init_vf_bar( ... )
+      ...
+   }
+
+   For cleanup, you simply call:
+
+      pcie_sriov_pf_exit(device);
+
+   which will delete all the virtual functions and associated resources.
+
+2) Similarly in the implementation of the virtual function, you need to
+   make it a PCI Express device and add a similar set of capabilities
+   except for the SR/IOV capability. Then you need to set up the VF BARs as
+   subregions of the PFs SR/IOV VF BARs by calling
+   pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:
+
+   pci_your_vf_dev_realize( ... )
+   {
+      ...
+      int ret = pcie_endpoint_cap_init(d, 0x60);
+      ...
+      pcie_ari_init(d, 0x100, 1);
+      ...
+      memory_region_init(mr, ... )
+      pcie_sriov_vf_register_bar(d, bar_nr, mr);
+      ...
+   }
+
+Testing on Linux guest
+======================
+The easiest is if your device driver supports sysfs based SR/IOV
+enabling. Support for this was added in kernel v.3.8, so not all drivers
+support it yet.
+
+To enable 4 VFs for a device at 01:00.0:
+
+	modprobe yourdriver
+	echo 4 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
+
+You should now see 4 VFs with lspci.
+To turn SR/IOV off again - the standard requires you to turn it off before you can enable
+another VF count, and the emulation enforces this:
+
+	echo 0 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
+
+Older drivers typically provide a max_vfs module parameter
+to enable it at load time:
+
+	modprobe yourdriver max_vfs=4
+
+To disable the VFs again then, you simply have to unload the driver:
+
+	rmmod yourdriver
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 03/15] pcie: Add a helper to the SR/IOV API
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
  2022-02-17 17:44 ` [PATCH v5 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Lukasz Maniak
  2022-02-17 17:44 ` [PATCH v5 02/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-18  8:25   ` Michael S. Tsirkin
  2022-02-17 17:44 ` [PATCH v5 04/15] pcie: Add 1.2 version token for the Power Management Capability Lukasz Maniak
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk,
	Lukasz Maniak, Keith Busch, Klaus Jensen, Knut Omang

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

Convenience function for retrieving the PCIDevice object of the N-th VF.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
Reviewed-by: Knut Omang <knuto@ifi.uio.no>
---
 hw/pci/pcie_sriov.c         | 10 +++++++++-
 include/hw/pci/pcie_sriov.h |  6 ++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 3f256d483fa..87abad6ac86 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -287,8 +287,16 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev)
     return dev->exp.sriov_vf.vf_number;
 }
 
-
 PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
 {
     return dev->exp.sriov_vf.pf;
 }
+
+PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
+{
+    assert(!pci_is_vf(dev));
+    if (n < dev->exp.sriov_pf.num_vfs) {
+        return dev->exp.sriov_pf.vf[n];
+    }
+    return NULL;
+}
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 990cff0a1c6..80f5c84e75c 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -68,4 +68,10 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev);
  */
 PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
 
+/*
+ * Get the n-th VF of this physical function - only valid for PF.
+ * Returns NULL if index is invalid
+ */
+PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n);
+
 #endif /* QEMU_PCIE_SRIOV_H */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 04/15] pcie: Add 1.2 version token for the Power Management Capability
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (2 preceding siblings ...)
  2022-02-17 17:44 ` [PATCH v5 03/15] pcie: Add a helper to the SR/IOV API Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-18  8:25   ` Michael S. Tsirkin
  2022-02-17 17:44 ` [PATCH v5 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk,
	Lukasz Maniak, Keith Busch, Klaus Jensen

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 include/hw/pci/pci_regs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index 77ba64b9314..a5901409622 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -4,5 +4,6 @@
 #include "standard-headers/linux/pci_regs.h"
 
 #define  PCI_PM_CAP_VER_1_1     0x0002  /* PCI PM spec ver. 1.1 */
+#define  PCI_PM_CAP_VER_1_2     0x0003  /* PCI PM spec ver. 1.2 */
 
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 05/15] hw/nvme: Add support for SR-IOV
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (3 preceding siblings ...)
  2022-02-17 17:44 ` [PATCH v5 04/15] pcie: Add 1.2 version token for the Power Management Capability Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-18  7:06   ` Klaus Jensen
  2022-02-17 17:44 ` [PATCH v5 06/15] hw/nvme: Add support for Primary Controller Capabilities Lukasz Maniak
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk,
	Lukasz Maniak, Keith Busch, Klaus Jensen

This patch implements initial support for Single Root I/O Virtualization
on an NVMe device.

Essentially, it allows to define the maximum number of virtual functions
supported by the NVMe controller via sriov_max_vfs parameter.

Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
capability by a physical controller and ARI capability by both the
physical and virtual function devices.

NVMe controllers created via virtual functions mirror functionally
the physical controller, which may not entirely be the case, thus
consideration would be needed on the way to limit the capabilities of
the VF.

NVMe subsystem is required for the use of SR-IOV.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
---
 hw/nvme/ctrl.c           | 85 ++++++++++++++++++++++++++++++++++++++--
 hw/nvme/nvme.h           |  3 +-
 include/hw/pci/pci_ids.h |  1 +
 3 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 98aac98bef5..adeba0b2b6d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -35,6 +35,7 @@
  *              mdts=<N[optional]>,vsl=<N[optional]>, \
  *              zoned.zasl=<N[optional]>, \
  *              zoned.auto_transition=<on|off[optional]>, \
+ *              sriov_max_vfs=<N[optional]> \
  *              subsys=<subsys_id>
  *      -device nvme-ns,drive=<drive_id>,bus=<bus_name>,nsid=<nsid>,\
  *              zoned=<true|false[optional]>, \
@@ -106,6 +107,12 @@
  *   transitioned to zone state closed for resource management purposes.
  *   Defaults to 'on'.
  *
+ * - `sriov_max_vfs`
+ *   Indicates the maximum number of PCIe virtual functions supported
+ *   by the controller. The default value is 0. Specifying a non-zero value
+ *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
+ *   Virtual function controllers will not report SR-IOV capability.
+ *
  * nvme namespace device parameters
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  * - `shared`
@@ -160,6 +167,7 @@
 #include "sysemu/block-backend.h"
 #include "sysemu/hostmem.h"
 #include "hw/pci/msix.h"
+#include "hw/pci/pcie_sriov.h"
 #include "migration/vmstate.h"
 
 #include "nvme.h"
@@ -175,6 +183,9 @@
 #define NVME_TEMPERATURE_CRITICAL 0x175
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
+#define NVME_MAX_VFS 127
+#define NVME_VF_OFFSET 0x1
+#define NVME_VF_STRIDE 1
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
     do { \
@@ -5742,6 +5753,10 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
         g_free(event);
     }
 
+    if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
+        pcie_sriov_pf_disable_vfs(&n->parent_obj);
+    }
+
     n->aer_queued = 0;
     n->outstanding_aers = 0;
     n->qs_created = false;
@@ -6423,6 +6438,29 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
         error_setg(errp, "vsl must be non-zero");
         return;
     }
+
+    if (params->sriov_max_vfs) {
+        if (!n->subsys) {
+            error_setg(errp, "subsystem is required for the use of SR-IOV");
+            return;
+        }
+
+        if (params->sriov_max_vfs > NVME_MAX_VFS) {
+            error_setg(errp, "sriov_max_vfs must be between 0 and %d",
+                       NVME_MAX_VFS);
+            return;
+        }
+
+        if (params->cmb_size_mb) {
+            error_setg(errp, "CMB is not supported with SR-IOV");
+            return;
+        }
+
+        if (n->pmr.dev) {
+            error_setg(errp, "PMR is not supported with SR-IOV");
+            return;
+        }
+    }
 }
 
 static void nvme_init_state(NvmeCtrl *n)
@@ -6480,6 +6518,20 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
     memory_region_set_enabled(&n->pmr.dev->mr, false);
 }
 
+static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
+                            uint64_t bar_size)
+{
+    uint16_t vf_dev_id = n->params.use_intel_id ?
+                         PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
+
+    pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
+                       n->params.sriov_max_vfs, n->params.sriov_max_vfs,
+                       NVME_VF_OFFSET, NVME_VF_STRIDE);
+
+    pcie_sriov_pf_init_vf_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+                              PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
+}
+
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
     uint8_t *pci_conf = pci_dev->config;
@@ -6494,7 +6546,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 
     if (n->params.use_intel_id) {
         pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
-        pci_config_set_device_id(pci_conf, 0x5845);
+        pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_INTEL_NVME);
     } else {
         pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_REDHAT);
         pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_REDHAT_NVME);
@@ -6502,6 +6554,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 
     pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
     pcie_endpoint_cap_init(pci_dev, 0x80);
+    if (n->params.sriov_max_vfs) {
+        pcie_ari_init(pci_dev, 0x100, 1);
+    }
 
     bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB);
     msix_table_offset = bar_size;
@@ -6520,8 +6575,12 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
                           n->reg_size);
     memory_region_add_subregion(&n->bar0, 0, &n->iomem);
 
-    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
-                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
+    if (pci_is_vf(pci_dev)) {
+        pcie_sriov_vf_register_bar(pci_dev, 0, &n->bar0);
+    } else {
+        pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+                         PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
+    }
     ret = msix_init(pci_dev, n->params.msix_qsize,
                     &n->bar0, 0, msix_table_offset,
                     &n->bar0, 0, msix_pba_offset, 0, &err);
@@ -6542,6 +6601,10 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         nvme_init_pmr(n, pci_dev);
     }
 
+    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
+        nvme_init_sriov(n, pci_dev, 0x120, bar_size);
+    }
+
     return 0;
 }
 
@@ -6691,6 +6754,16 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     NvmeCtrl *n = NVME(pci_dev);
     NvmeNamespace *ns;
     Error *local_err = NULL;
+    NvmeCtrl *pn = NVME(pcie_sriov_get_pf(pci_dev));
+
+    if (pci_is_vf(pci_dev)) {
+        /*
+         * VFs derive settings from the parent. PF's lifespan exceeds
+         * that of VF's, so it's safe to share params.serial.
+         */
+        memcpy(&n->params, &pn->params, sizeof(NvmeParams));
+        n->subsys = pn->subsys;
+    }
 
     nvme_check_constraints(n, &local_err);
     if (local_err) {
@@ -6755,6 +6828,11 @@ static void nvme_exit(PCIDevice *pci_dev)
     if (n->pmr.dev) {
         host_memory_backend_set_mapped(n->pmr.dev, false);
     }
+
+    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
+        pcie_sriov_pf_exit(pci_dev);
+    }
+
     msix_uninit(pci_dev, &n->bar0, &n->bar0);
     memory_region_del_subregion(&n->bar0, &n->iomem);
 }
@@ -6779,6 +6857,7 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
     DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
                      params.auto_transition_zones, true),
+    DEFINE_PROP_UINT8("sriov_max_vfs", NvmeCtrl, params.sriov_max_vfs, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 90c0bb7ce23..17245db96b5 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -24,7 +24,7 @@
 
 #include "block/nvme.h"
 
-#define NVME_MAX_CONTROLLERS 32
+#define NVME_MAX_CONTROLLERS 256
 #define NVME_MAX_NAMESPACES  256
 #define NVME_EUI64_DEFAULT ((uint64_t)0x5254000000000000)
 
@@ -401,6 +401,7 @@ typedef struct NvmeParams {
     uint8_t  zasl;
     bool     auto_transition_zones;
     bool     legacy_cmb;
+    uint8_t  sriov_max_vfs;
 } NvmeParams;
 
 typedef struct NvmeCtrl {
diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
index 11abe22d460..992426768e6 100644
--- a/include/hw/pci/pci_ids.h
+++ b/include/hw/pci/pci_ids.h
@@ -237,6 +237,7 @@
 #define PCI_DEVICE_ID_INTEL_82801BA_11   0x244e
 #define PCI_DEVICE_ID_INTEL_82801D       0x24CD
 #define PCI_DEVICE_ID_INTEL_ESB_9        0x25ab
+#define PCI_DEVICE_ID_INTEL_NVME         0x5845
 #define PCI_DEVICE_ID_INTEL_82371SB_0    0x7000
 #define PCI_DEVICE_ID_INTEL_82371SB_1    0x7010
 #define PCI_DEVICE_ID_INTEL_82371SB_2    0x7020
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 06/15] hw/nvme: Add support for Primary Controller Capabilities
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (4 preceding siblings ...)
  2022-02-17 17:44 ` [PATCH v5 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-17 17:44 ` [PATCH v5 07/15] hw/nvme: Add support for Secondary Controller List Lukasz Maniak
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Lukasz Maniak,
	Philippe Mathieu-Daudé,
	Keith Busch, Hanna Reitz, Stefan Hajnoczi, Łukasz Gieryk,
	Klaus Jensen

Implementation of Primary Controller Capabilities data
structure (Identify command with CNS value of 14h).

Currently, the command returns only ID of a primary controller.
Handling of remaining fields are added in subsequent patches
implementing virtualization enhancements.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c       | 23 ++++++++++++++++++-----
 hw/nvme/nvme.h       |  2 ++
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 23 +++++++++++++++++++++++
 4 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index adeba0b2b6d..0bd55948ce1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4697,6 +4697,14 @@ static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, NvmeRequest *req,
     return nvme_c2h(n, (uint8_t *)list, sizeof(list), req);
 }
 
+static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, NvmeRequest *req)
+{
+    trace_pci_nvme_identify_pri_ctrl_cap(le16_to_cpu(n->pri_ctrl_cap.cntlid));
+
+    return nvme_c2h(n, (uint8_t *)&n->pri_ctrl_cap,
+                    sizeof(NvmePriCtrlCap), req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
                                      bool active)
 {
@@ -4915,6 +4923,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
         return nvme_identify_ctrl_list(n, req, true);
     case NVME_ID_CNS_CTRL_LIST:
         return nvme_identify_ctrl_list(n, req, false);
+    case NVME_ID_CNS_PRIMARY_CTRL_CAP:
+        return nvme_identify_pri_ctrl_cap(n, req);
     case NVME_ID_CNS_CS_NS:
         return nvme_identify_ns_csi(n, req, true);
     case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6465,6 +6475,8 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 
 static void nvme_init_state(NvmeCtrl *n)
 {
+    NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
+
     /* add one to max_ioqpairs to account for the admin queue pair */
     n->reg_size = pow2ceil(sizeof(NvmeBar) +
                            2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
@@ -6474,6 +6486,8 @@ static void nvme_init_state(NvmeCtrl *n)
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
+
+    cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -6774,15 +6788,14 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     qbus_init(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS,
               &pci_dev->qdev, n->parent_obj.qdev.id);
 
-    nvme_init_state(n);
-    if (nvme_init_pci(n, pci_dev, errp)) {
-        return;
-    }
-
     if (nvme_init_subsys(n, errp)) {
         error_propagate(errp, local_err);
         return;
     }
+    nvme_init_state(n);
+    if (nvme_init_pci(n, pci_dev, errp)) {
+        return;
+    }
     nvme_init_ctrl(n, pci_dev);
 
     /* setup a namespace if the controller drive property was given */
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 17245db96b5..2db48eb25c9 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -471,6 +471,8 @@ typedef struct NvmeCtrl {
         };
         uint32_t    async_config;
     } features;
+
+    NvmePriCtrlCap  pri_ctrl_cap;
 } NvmeCtrl;
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index 90730d802fe..bfc09dddc62 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -52,6 +52,7 @@ pci_nvme_identify_ctrl(void) "identify controller"
 pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_ctrl_list(uint8_t cns, uint16_t cntid) "cns 0x%"PRIx8" cntid %"PRIu16""
+pci_nvme_identify_pri_ctrl_cap(uint16_t cntlid) "identify primary controller capabilities cntlid=%"PRIu16""
 pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", csi=0x%"PRIx8""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index cd068ac8914..73666cc900a 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1019,6 +1019,7 @@ enum NvmeIdCns {
     NVME_ID_CNS_NS_PRESENT            = 0x11,
     NVME_ID_CNS_NS_ATTACHED_CTRL_LIST = 0x12,
     NVME_ID_CNS_CTRL_LIST             = 0x13,
+    NVME_ID_CNS_PRIMARY_CTRL_CAP      = 0x14,
     NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
     NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
     NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
@@ -1503,6 +1504,27 @@ typedef enum NvmeZoneState {
     NVME_ZONE_STATE_OFFLINE          = 0x0f,
 } NvmeZoneState;
 
+typedef struct QEMU_PACKED NvmePriCtrlCap {
+    uint16_t    cntlid;
+    uint16_t    portid;
+    uint8_t     crt;
+    uint8_t     rsvd5[27];
+    uint32_t    vqfrt;
+    uint32_t    vqrfa;
+    uint16_t    vqrfap;
+    uint16_t    vqprt;
+    uint16_t    vqfrsm;
+    uint16_t    vqgran;
+    uint8_t     rsvd48[16];
+    uint32_t    vifrt;
+    uint32_t    virfa;
+    uint16_t    virfap;
+    uint16_t    viprt;
+    uint16_t    vifrsm;
+    uint16_t    vigran;
+    uint8_t     rsvd80[4016];
+} NvmePriCtrlCap;
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
@@ -1535,5 +1557,6 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDifTuple) != 8);
+    QEMU_BUILD_BUG_ON(sizeof(NvmePriCtrlCap) != 4096);
 }
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 07/15] hw/nvme: Add support for Secondary Controller List
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (5 preceding siblings ...)
  2022-02-17 17:44 ` [PATCH v5 06/15] hw/nvme: Add support for Primary Controller Capabilities Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-17 17:44 ` [PATCH v5 08/15] hw/nvme: Implement the Function Level Reset Lukasz Maniak
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	Lukasz Maniak, Philippe Mathieu-Daudé,
	Keith Busch, Hanna Reitz, Stefan Hajnoczi, Klaus Jensen

Introduce handling for Secondary Controller List (Identify command with
CNS value of 15h).

Secondary controller ids are unique in the subsystem, hence they are
reserved by it upon initialization of the primary controller to the
number of sriov_max_vfs.

ID reservation requires the addition of an intermediate controller slot
state, so the reserved controller has the address 0xFFFF.
A secondary controller is in the reserved state when it has no virtual
function assigned, but its primary controller is realized.
Secondary controller reservations are released to NULL when its primary
controller is unregistered.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
---
 hw/nvme/ctrl.c       | 35 +++++++++++++++++++++
 hw/nvme/ns.c         |  2 +-
 hw/nvme/nvme.h       | 18 +++++++++++
 hw/nvme/subsys.c     | 75 ++++++++++++++++++++++++++++++++++++++------
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 20 ++++++++++++
 6 files changed, 141 insertions(+), 10 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 0bd55948ce1..05acd681656 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4705,6 +4705,29 @@ static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, NvmeRequest *req)
                     sizeof(NvmePriCtrlCap), req);
 }
 
+static uint16_t nvme_identify_sec_ctrl_list(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    uint16_t pri_ctrl_id = le16_to_cpu(n->pri_ctrl_cap.cntlid);
+    uint16_t min_id = le16_to_cpu(c->ctrlid);
+    uint8_t num_sec_ctrl = n->sec_ctrl_list.numcntl;
+    NvmeSecCtrlList list = {0};
+    uint8_t i;
+
+    for (i = 0; i < num_sec_ctrl; i++) {
+        if (n->sec_ctrl_list.sec[i].scid >= min_id) {
+            list.numcntl = num_sec_ctrl - i;
+            memcpy(&list.sec, n->sec_ctrl_list.sec + i,
+                   list.numcntl * sizeof(NvmeSecCtrlEntry));
+            break;
+        }
+    }
+
+    trace_pci_nvme_identify_sec_ctrl_list(pri_ctrl_id, list.numcntl);
+
+    return nvme_c2h(n, (uint8_t *)&list, sizeof(list), req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
                                      bool active)
 {
@@ -4925,6 +4948,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
         return nvme_identify_ctrl_list(n, req, false);
     case NVME_ID_CNS_PRIMARY_CTRL_CAP:
         return nvme_identify_pri_ctrl_cap(n, req);
+    case NVME_ID_CNS_SECONDARY_CTRL_LIST:
+        return nvme_identify_sec_ctrl_list(n, req);
     case NVME_ID_CNS_CS_NS:
         return nvme_identify_ns_csi(n, req, true);
     case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6476,6 +6501,9 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 static void nvme_init_state(NvmeCtrl *n)
 {
     NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
+    NvmeSecCtrlList *list = &n->sec_ctrl_list;
+    NvmeSecCtrlEntry *sctrl;
+    int i;
 
     /* add one to max_ioqpairs to account for the admin queue pair */
     n->reg_size = pow2ceil(sizeof(NvmeBar) +
@@ -6487,6 +6515,13 @@ static void nvme_init_state(NvmeCtrl *n)
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 
+    list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
+    for (i = 0; i < n->params.sriov_max_vfs; i++) {
+        sctrl = &list->sec[i];
+        sctrl->pcid = cpu_to_le16(n->cntlid);
+        sctrl->vfn = cpu_to_le16(i + 1);
+    }
+
     cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index ee673f1a5be..d42fba117f1 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -567,7 +567,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
             for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) {
                 NvmeCtrl *ctrl = subsys->ctrls[i];
 
-                if (ctrl) {
+                if (ctrl && ctrl != SUBSYS_SLOT_RSVD) {
                     nvme_attach_ns(ctrl, ns);
                 }
             }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 2db48eb25c9..f4494e5236f 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -43,6 +43,7 @@ typedef struct NvmeBus {
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
     OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
+#define SUBSYS_SLOT_RSVD (void *)0xFFFF
 
 typedef struct NvmeSubsystem {
     DeviceState parent_obj;
@@ -67,6 +68,10 @@ static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem *subsys,
         return NULL;
     }
 
+    if (subsys->ctrls[cntlid] == SUBSYS_SLOT_RSVD) {
+        return NULL;
+    }
+
     return subsys->ctrls[cntlid];
 }
 
@@ -473,6 +478,7 @@ typedef struct NvmeCtrl {
     } features;
 
     NvmePriCtrlCap  pri_ctrl_cap;
+    NvmeSecCtrlList sec_ctrl_list;
 } NvmeCtrl;
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
@@ -507,6 +513,18 @@ static inline uint16_t nvme_cid(NvmeRequest *req)
     return le16_to_cpu(req->cqe.cid);
 }
 
+static inline NvmeSecCtrlEntry *nvme_sctrl(NvmeCtrl *n)
+{
+    PCIDevice *pci_dev = &n->parent_obj;
+    NvmeCtrl *pf = NVME(pcie_sriov_get_pf(pci_dev));
+
+    if (pci_is_vf(pci_dev)) {
+        return &pf->sec_ctrl_list.sec[pcie_sriov_vf_number(pci_dev)];
+    }
+
+    return NULL;
+}
+
 void nvme_attach_ns(NvmeCtrl *n, NvmeNamespace *ns);
 uint16_t nvme_bounce_data(NvmeCtrl *n, void *ptr, uint32_t len,
                           NvmeTxDirection dir, NvmeRequest *req);
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index fb58d639504..f445ca86c25 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -11,20 +11,71 @@
 
 #include "nvme.h"
 
-int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
+static int nvme_subsys_reserve_cntlids(NvmeCtrl *n, int start, int num)
 {
     NvmeSubsystem *subsys = n->subsys;
-    int cntlid, nsid;
+    NvmeSecCtrlList *list = &n->sec_ctrl_list;
+    NvmeSecCtrlEntry *sctrl;
+    int i, cnt = 0;
+
+    for (i = start; i < ARRAY_SIZE(subsys->ctrls) && cnt < num; i++) {
+        if (!subsys->ctrls[i]) {
+            sctrl = &list->sec[cnt];
+            sctrl->scid = cpu_to_le16(i);
+            subsys->ctrls[i] = SUBSYS_SLOT_RSVD;
+            cnt++;
+        }
+    }
+
+    return cnt;
+}
 
-    for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
-        if (!subsys->ctrls[cntlid]) {
-            break;
+static void nvme_subsys_unreserve_cntlids(NvmeCtrl *n)
+{
+    NvmeSubsystem *subsys = n->subsys;
+    NvmeSecCtrlList *list = &n->sec_ctrl_list;
+    NvmeSecCtrlEntry *sctrl;
+    int i, cntlid;
+
+    for (i = 0; i < n->params.sriov_max_vfs; i++) {
+        sctrl = &list->sec[i];
+        cntlid = le16_to_cpu(sctrl->scid);
+
+        if (cntlid) {
+            assert(subsys->ctrls[cntlid] == SUBSYS_SLOT_RSVD);
+            subsys->ctrls[cntlid] = NULL;
+            sctrl->scid = 0;
         }
     }
+}
+
+int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
+{
+    NvmeSubsystem *subsys = n->subsys;
+    NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
+    int cntlid, nsid, num_rsvd, num_vfs = n->params.sriov_max_vfs;
+
+    if (pci_is_vf(&n->parent_obj)) {
+        cntlid = le16_to_cpu(sctrl->scid);
+    } else {
+        for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
+            if (!subsys->ctrls[cntlid]) {
+                break;
+            }
+        }
 
-    if (cntlid == ARRAY_SIZE(subsys->ctrls)) {
-        error_setg(errp, "no more free controller id");
-        return -1;
+        if (cntlid == ARRAY_SIZE(subsys->ctrls)) {
+            error_setg(errp, "no more free controller id");
+            return -1;
+        }
+
+        num_rsvd = nvme_subsys_reserve_cntlids(n, cntlid + 1, num_vfs);
+        if (num_rsvd != num_vfs) {
+            nvme_subsys_unreserve_cntlids(n);
+            error_setg(errp,
+                       "no more free controller ids for secondary controllers");
+            return -1;
+        }
     }
 
     subsys->ctrls[cntlid] = n;
@@ -41,7 +92,13 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 
 void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
 {
-    subsys->ctrls[n->cntlid] = NULL;
+    if (pci_is_vf(&n->parent_obj)) {
+        subsys->ctrls[n->cntlid] = SUBSYS_SLOT_RSVD;
+    } else {
+        subsys->ctrls[n->cntlid] = NULL;
+        nvme_subsys_unreserve_cntlids(n);
+    }
+
     n->cntlid = -1;
 }
 
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index bfc09dddc62..12afb8478a4 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -53,6 +53,7 @@ pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_ctrl_list(uint8_t cns, uint16_t cntid) "cns 0x%"PRIx8" cntid %"PRIu16""
 pci_nvme_identify_pri_ctrl_cap(uint16_t cntlid) "identify primary controller capabilities cntlid=%"PRIu16""
+pci_nvme_identify_sec_ctrl_list(uint16_t cntlid, uint8_t numcntl) "identify secondary controller list cntlid=%"PRIu16" numcntl=%"PRIu8""
 pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", csi=0x%"PRIx8""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 73666cc900a..fde4ddfceec 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1020,6 +1020,7 @@ enum NvmeIdCns {
     NVME_ID_CNS_NS_ATTACHED_CTRL_LIST = 0x12,
     NVME_ID_CNS_CTRL_LIST             = 0x13,
     NVME_ID_CNS_PRIMARY_CTRL_CAP      = 0x14,
+    NVME_ID_CNS_SECONDARY_CTRL_LIST   = 0x15,
     NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
     NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
     NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
@@ -1525,6 +1526,23 @@ typedef struct QEMU_PACKED NvmePriCtrlCap {
     uint8_t     rsvd80[4016];
 } NvmePriCtrlCap;
 
+typedef struct QEMU_PACKED NvmeSecCtrlEntry {
+    uint16_t    scid;
+    uint16_t    pcid;
+    uint8_t     scs;
+    uint8_t     rsvd5[3];
+    uint16_t    vfn;
+    uint16_t    nvq;
+    uint16_t    nvi;
+    uint8_t     rsvd14[18];
+} NvmeSecCtrlEntry;
+
+typedef struct QEMU_PACKED NvmeSecCtrlList {
+    uint8_t             numcntl;
+    uint8_t             rsvd1[31];
+    NvmeSecCtrlEntry    sec[127];
+} NvmeSecCtrlList;
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
@@ -1558,5 +1576,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDifTuple) != 8);
     QEMU_BUILD_BUG_ON(sizeof(NvmePriCtrlCap) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeSecCtrlEntry) != 32);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeSecCtrlList) != 4096);
 }
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 08/15] hw/nvme: Implement the Function Level Reset
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (6 preceding siblings ...)
  2022-02-17 17:44 ` [PATCH v5 07/15] hw/nvme: Add support for Secondary Controller List Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-17 17:44 ` [PATCH v5 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Klaus Jensen, Lukasz Maniak, Keith Busch,
	Łukasz Gieryk, Klaus Jensen

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

This patch implements the Function Level Reset, a feature currently not
implemented for the Nvme device, while listed as a mandatory ("shall")
in the 1.4 spec.

The implementation reuses FLR-related building blocks defined for the
pci-bridge module, and follows the same logic:
    - FLR capability is advertised in the PCIE config,
    - custom pci_write_config callback detects a write to the trigger
      register and performs the PCI reset,
    - which, eventually, calls the custom dc->reset handler.

Depending on reset type, parts of the state should (or should not) be
cleared. To distinguish the type of reset, an additional parameter is
passed to the reset function.

This patch also enables advertisement of the Power Management PCI
capability. The main reason behind it is to announce the no_soft_reset=1
bit, to signal SR-IOV support where each VF can be reset individually.

The implementation purposedly ignores writes to the PMCS.PS register,
as even such naïve behavior is enough to correctly handle the D3->D0
transition.

It’s worth to note, that the power state transition back to to D3, with
all the corresponding side effects, wasn't and stil isn't handled
properly.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c       | 52 ++++++++++++++++++++++++++++++++++++++++----
 hw/nvme/nvme.h       |  5 +++++
 hw/nvme/trace-events |  1 +
 3 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 05acd681656..7c1dd80f21d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5757,7 +5757,7 @@ static void nvme_process_sq(void *opaque)
     }
 }
 
-static void nvme_ctrl_reset(NvmeCtrl *n)
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
 {
     NvmeNamespace *ns;
     int i;
@@ -5789,7 +5789,9 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
     }
 
     if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
-        pcie_sriov_pf_disable_vfs(&n->parent_obj);
+        if (rst != NVME_RESET_CONTROLLER) {
+            pcie_sriov_pf_disable_vfs(&n->parent_obj);
+        }
     }
 
     n->aer_queued = 0;
@@ -6023,7 +6025,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
             }
         } else if (!NVME_CC_EN(data) && NVME_CC_EN(cc)) {
             trace_pci_nvme_mmio_stopped();
-            nvme_ctrl_reset(n);
+            nvme_ctrl_reset(n, NVME_RESET_CONTROLLER);
             cc = 0;
             csts &= ~NVME_CSTS_READY;
         }
@@ -6581,6 +6583,28 @@ static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
                               PCI_BASE_ADDRESS_MEM_TYPE_64, bar_size);
 }
 
+static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
+{
+    Error *err = NULL;
+    int ret;
+
+    ret = pci_add_capability(pci_dev, PCI_CAP_ID_PM, offset,
+                             PCI_PM_SIZEOF, &err);
+    if (err) {
+        error_report_err(err);
+        return ret;
+    }
+
+    pci_set_word(pci_dev->config + offset + PCI_PM_PMC,
+                 PCI_PM_CAP_VER_1_2);
+    pci_set_word(pci_dev->config + offset + PCI_PM_CTRL,
+                 PCI_PM_CTRL_NO_SOFT_RESET);
+    pci_set_word(pci_dev->wmask + offset + PCI_PM_CTRL,
+                 PCI_PM_CTRL_STATE_MASK);
+
+    return 0;
+}
+
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
     uint8_t *pci_conf = pci_dev->config;
@@ -6602,7 +6626,9 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
     }
 
     pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+    nvme_add_pm_capability(pci_dev, 0x60);
     pcie_endpoint_cap_init(pci_dev, 0x80);
+    pcie_cap_flr_init(pci_dev);
     if (n->params.sriov_max_vfs) {
         pcie_ari_init(pci_dev, 0x100, 1);
     }
@@ -6852,7 +6878,7 @@ static void nvme_exit(PCIDevice *pci_dev)
     NvmeNamespace *ns;
     int i;
 
-    nvme_ctrl_reset(n);
+    nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
 
     if (n->subsys) {
         for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
@@ -6951,6 +6977,22 @@ static void nvme_set_smart_warning(Object *obj, Visitor *v, const char *name,
     }
 }
 
+static void nvme_pci_reset(DeviceState *qdev)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(qdev);
+    NvmeCtrl *n = NVME(pci_dev);
+
+    trace_pci_nvme_pci_reset();
+    nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
+}
+
+static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
+                                  uint32_t val, int len)
+{
+    pci_default_write_config(dev, address, val, len);
+    pcie_cap_flr_write_config(dev, address, val, len);
+}
+
 static const VMStateDescription nvme_vmstate = {
     .name = "nvme",
     .unmigratable = 1,
@@ -6962,6 +7004,7 @@ static void nvme_class_init(ObjectClass *oc, void *data)
     PCIDeviceClass *pc = PCI_DEVICE_CLASS(oc);
 
     pc->realize = nvme_realize;
+    pc->config_write = nvme_pci_write_config;
     pc->exit = nvme_exit;
     pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
     pc->revision = 2;
@@ -6970,6 +7013,7 @@ static void nvme_class_init(ObjectClass *oc, void *data)
     dc->desc = "Non-Volatile Memory Express";
     device_class_set_props(dc, nvme_props);
     dc->vmsd = &nvme_vmstate;
+    dc->reset = nvme_pci_reset;
 }
 
 static void nvme_instance_init(Object *obj)
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index f4494e5236f..5ba07b62dff 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -481,6 +481,11 @@ typedef struct NvmeCtrl {
     NvmeSecCtrlList sec_ctrl_list;
 } NvmeCtrl;
 
+typedef enum NvmeResetType {
+    NVME_RESET_FUNCTION   = 0,
+    NVME_RESET_CONTROLLER = 1,
+} NvmeResetType;
+
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
 {
     if (!nsid || nsid > NVME_MAX_NAMESPACES) {
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index 12afb8478a4..f8bf85fb78f 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -106,6 +106,7 @@ pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_
 pci_nvme_clear_ns_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
 pci_nvme_clear_ns_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
 pci_nvme_zoned_zrwa_implicit_flush(uint64_t zslba, uint32_t nlb) "zslba 0x%"PRIx64" nlb %"PRIu32""
+pci_nvme_pci_reset(void) "PCI Function Level Reset"
 
 # error conditions
 pci_nvme_err_mdts(size_t len) "len %zu"
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (7 preceding siblings ...)
  2022-02-17 17:44 ` [PATCH v5 08/15] hw/nvme: Implement the Function Level Reset Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-03-01 12:22   ` Klaus Jensen
  2022-02-17 17:44 ` [PATCH v5 10/15] hw/nvme: Remove reg_size variable and update BAR0 size calculation Lukasz Maniak
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Klaus Jensen, Keith Busch, Lukasz Maniak, qemu-block, Łukasz Gieryk

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

The NVMe device defines two properties: max_ioqpairs, msix_qsize. Having
them as constants is problematic for SR-IOV support.

SR-IOV introduces virtual resources (queues, interrupts) that can be
assigned to PF and its dependent VFs. Each device, following a reset,
should work with the configured number of queues. A single constant is
no longer sufficient to hold the whole state.

This patch tries to solve the problem by introducing additional
variables in NvmeCtrl’s state. The variables for, e.g., managing queues
are therefore organized as:
 - n->params.max_ioqpairs – no changes, constant set by the user
 - n->(mutable_state) – (not a part of this patch) user-configurable,
                        specifies number of queues available _after_
                        reset
 - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
                      n->params.max_ioqpairs; initialized in realize()
                      and updated during reset() to reflect user’s
                      changes to the mutable state

Since the number of available i/o queues and interrupts can change in
runtime, buffers for sq/cqs and the MSIX-related structures are
allocated big enough to handle the limits, to completely avoid the
complicated reallocation. A helper function (nvme_update_msixcap_ts)
updates the corresponding capability register, to signal configuration
changes.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c | 52 ++++++++++++++++++++++++++++++++++----------------
 hw/nvme/nvme.h |  2 ++
 2 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 7c1dd80f21d..f1b4026e4f8 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -445,12 +445,12 @@ static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-    return sqid < n->params.max_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
+    return sqid < n->conf_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-    return cqid < n->params.max_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
+    return cqid < n->conf_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -4188,8 +4188,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
         trace_pci_nvme_err_invalid_create_sq_cqid(cqid);
         return NVME_INVALID_CQID | NVME_DNR;
     }
-    if (unlikely(!sqid || sqid > n->params.max_ioqpairs ||
-        n->sq[sqid] != NULL)) {
+    if (unlikely(!sqid || sqid > n->conf_ioqpairs || n->sq[sqid] != NULL)) {
         trace_pci_nvme_err_invalid_create_sq_sqid(sqid);
         return NVME_INVALID_QID | NVME_DNR;
     }
@@ -4541,8 +4540,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
     trace_pci_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
                              NVME_CQ_FLAGS_IEN(qflags) != 0);
 
-    if (unlikely(!cqid || cqid > n->params.max_ioqpairs ||
-        n->cq[cqid] != NULL)) {
+    if (unlikely(!cqid || cqid > n->conf_ioqpairs || n->cq[cqid] != NULL)) {
         trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
         return NVME_INVALID_QID | NVME_DNR;
     }
@@ -4558,7 +4556,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
         trace_pci_nvme_err_invalid_create_cq_vector(vector);
         return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
     }
-    if (unlikely(vector >= n->params.msix_qsize)) {
+    if (unlikely(vector >= n->conf_msix_qsize)) {
         trace_pci_nvme_err_invalid_create_cq_vector(vector);
         return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
     }
@@ -5155,13 +5153,12 @@ defaults:
 
         break;
     case NVME_NUMBER_OF_QUEUES:
-        result = (n->params.max_ioqpairs - 1) |
-            ((n->params.max_ioqpairs - 1) << 16);
+        result = (n->conf_ioqpairs - 1) | ((n->conf_ioqpairs - 1) << 16);
         trace_pci_nvme_getfeat_numq(result);
         break;
     case NVME_INTERRUPT_VECTOR_CONF:
         iv = dw11 & 0xffff;
-        if (iv >= n->params.max_ioqpairs + 1) {
+        if (iv >= n->conf_ioqpairs + 1) {
             return NVME_INVALID_FIELD | NVME_DNR;
         }
 
@@ -5316,10 +5313,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
 
         trace_pci_nvme_setfeat_numq((dw11 & 0xffff) + 1,
                                     ((dw11 >> 16) & 0xffff) + 1,
-                                    n->params.max_ioqpairs,
-                                    n->params.max_ioqpairs);
-        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-                                      ((n->params.max_ioqpairs - 1) << 16));
+                                    n->conf_ioqpairs,
+                                    n->conf_ioqpairs);
+        req->cqe.result = cpu_to_le32((n->conf_ioqpairs - 1) |
+                                      ((n->conf_ioqpairs - 1) << 16));
         break;
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
@@ -5757,8 +5754,24 @@ static void nvme_process_sq(void *opaque)
     }
 }
 
+static void nvme_update_msixcap_ts(PCIDevice *pci_dev, uint32_t table_size)
+{
+    uint8_t *config;
+
+    if (!msix_present(pci_dev)) {
+        return;
+    }
+
+    assert(table_size > 0 && table_size <= pci_dev->msix_entries_nr);
+
+    config = pci_dev->config + pci_dev->msix_cap;
+    pci_set_word_by_mask(config + PCI_MSIX_FLAGS, PCI_MSIX_FLAGS_QSIZE,
+                         table_size - 1);
+}
+
 static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
 {
+    PCIDevice *pci_dev = &n->parent_obj;
     NvmeNamespace *ns;
     int i;
 
@@ -5788,15 +5801,17 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
         g_free(event);
     }
 
-    if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
+    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
         if (rst != NVME_RESET_CONTROLLER) {
-            pcie_sriov_pf_disable_vfs(&n->parent_obj);
+            pcie_sriov_pf_disable_vfs(pci_dev);
         }
     }
 
     n->aer_queued = 0;
     n->outstanding_aers = 0;
     n->qs_created = false;
+
+    nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
 }
 
 static void nvme_ctrl_shutdown(NvmeCtrl *n)
@@ -6507,6 +6522,9 @@ static void nvme_init_state(NvmeCtrl *n)
     NvmeSecCtrlEntry *sctrl;
     int i;
 
+    n->conf_ioqpairs = n->params.max_ioqpairs;
+    n->conf_msix_qsize = n->params.msix_qsize;
+
     /* add one to max_ioqpairs to account for the admin queue pair */
     n->reg_size = pow2ceil(sizeof(NvmeBar) +
                            2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
@@ -6668,6 +6686,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         }
     }
 
+    nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
+
     if (n->params.cmb_size_mb) {
         nvme_init_cmb(n, pci_dev);
     }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 5ba07b62dff..314a2894759 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -434,6 +434,8 @@ typedef struct NvmeCtrl {
     uint64_t    starttime_ms;
     uint16_t    temperature;
     uint8_t     smart_critical_warning;
+    uint32_t    conf_msix_qsize;
+    uint32_t    conf_ioqpairs;
 
     struct {
         MemoryRegion mem;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 10/15] hw/nvme: Remove reg_size variable and update BAR0 size calculation
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (8 preceding siblings ...)
  2022-02-17 17:44 ` [PATCH v5 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
@ 2022-02-17 17:44 ` Lukasz Maniak
  2022-02-17 17:45 ` [PATCH v5 11/15] hw/nvme: Calculate BAR attributes in a function Lukasz Maniak
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Klaus Jensen, Lukasz Maniak, Keith Busch,
	Łukasz Gieryk, Klaus Jensen

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

The n->reg_size parameter unnecessarily splits the BAR0 size calculation
in two phases; removed to simplify the code.

With all the calculations done in one place, it seems the pow2ceil,
applied originally to reg_size, is unnecessary. The rounding should
happen as the last step, when BAR size includes Nvme registers, queue
registers, and MSIX-related space.

Finally, the size of the mmio memory region is extended to cover the 1st
4KiB padding (see the map below). Access to this range is handled as
interaction with a non-existing queue and generates an error trace, so
actually nothing changes, while the reg_size variable is no longer needed.

    --------------------
    |      BAR0        |
    --------------------
    [Nvme Registers    ]
    [Queues            ]
    [power-of-2 padding] - removed in this patch
    [4KiB padding (1)  ]
    [MSIX TABLE        ]
    [4KiB padding (2)  ]
    [MSIX PBA          ]
    [power-of-2 padding]

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c | 10 +++++-----
 hw/nvme/nvme.h |  1 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f1b4026e4f8..6abec8e4369 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6525,9 +6525,6 @@ static void nvme_init_state(NvmeCtrl *n)
     n->conf_ioqpairs = n->params.max_ioqpairs;
     n->conf_msix_qsize = n->params.msix_qsize;
 
-    /* add one to max_ioqpairs to account for the admin queue pair */
-    n->reg_size = pow2ceil(sizeof(NvmeBar) +
-                           2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
     n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
     n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
     n->temperature = NVME_TEMPERATURE;
@@ -6651,7 +6648,10 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         pcie_ari_init(pci_dev, 0x100, 1);
     }
 
-    bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB);
+    /* add one to max_ioqpairs to account for the admin queue pair */
+    bar_size = sizeof(NvmeBar) +
+               2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE;
+    bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
     msix_table_offset = bar_size;
     msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
 
@@ -6665,7 +6665,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 
     memory_region_init(&n->bar0, OBJECT(n), "nvme-bar0", bar_size);
     memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
-                          n->reg_size);
+                          msix_table_offset);
     memory_region_add_subregion(&n->bar0, 0, &n->iomem);
 
     if (pci_is_vf(pci_dev)) {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 314a2894759..86b5b321331 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -424,7 +424,6 @@ typedef struct NvmeCtrl {
     uint16_t    max_prp_ents;
     uint16_t    cqe_size;
     uint16_t    sqe_size;
-    uint32_t    reg_size;
     uint32_t    max_q_ents;
     uint8_t     outstanding_aers;
     uint32_t    irq_status;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 11/15] hw/nvme: Calculate BAR attributes in a function
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (9 preceding siblings ...)
  2022-02-17 17:44 ` [PATCH v5 10/15] hw/nvme: Remove reg_size variable and update BAR0 size calculation Lukasz Maniak
@ 2022-02-17 17:45 ` Lukasz Maniak
  2022-02-17 17:45 ` [PATCH v5 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers Lukasz Maniak
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Klaus Jensen, Lukasz Maniak, Keith Busch,
	Łukasz Gieryk, Klaus Jensen

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

An NVMe device with SR-IOV capability calculates the BAR size
differently for PF and VF, so it makes sense to extract the common code
to a separate function.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c | 45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 6abec8e4369..73707565345 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6584,6 +6584,34 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
     memory_region_set_enabled(&n->pmr.dev->mr, false);
 }
 
+static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
+                              unsigned *msix_table_offset,
+                              unsigned *msix_pba_offset)
+{
+    uint64_t bar_size, msix_table_size, msix_pba_size;
+
+    bar_size = sizeof(NvmeBar) + 2 * total_queues * NVME_DB_SIZE;
+    bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+    if (msix_table_offset) {
+        *msix_table_offset = bar_size;
+    }
+
+    msix_table_size = PCI_MSIX_ENTRY_SIZE * total_irqs;
+    bar_size += msix_table_size;
+    bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+    if (msix_pba_offset) {
+        *msix_pba_offset = bar_size;
+    }
+
+    msix_pba_size = QEMU_ALIGN_UP(total_irqs, 64) / 8;
+    bar_size += msix_pba_size;
+
+    bar_size = pow2ceil(bar_size);
+    return bar_size;
+}
+
 static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
                             uint64_t bar_size)
 {
@@ -6623,7 +6651,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
     uint8_t *pci_conf = pci_dev->config;
-    uint64_t bar_size, msix_table_size, msix_pba_size;
+    uint64_t bar_size;
     unsigned msix_table_offset, msix_pba_offset;
     int ret;
 
@@ -6649,19 +6677,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
     }
 
     /* add one to max_ioqpairs to account for the admin queue pair */
-    bar_size = sizeof(NvmeBar) +
-               2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE;
-    bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
-    msix_table_offset = bar_size;
-    msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
-
-    bar_size += msix_table_size;
-    bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
-    msix_pba_offset = bar_size;
-    msix_pba_size = QEMU_ALIGN_UP(n->params.msix_qsize, 64) / 8;
-
-    bar_size += msix_pba_size;
-    bar_size = pow2ceil(bar_size);
+    bar_size = nvme_bar_size(n->params.max_ioqpairs + 1, n->params.msix_qsize,
+                             &msix_table_offset, &msix_pba_offset);
 
     memory_region_init(&n->bar0, OBJECT(n), "nvme-bar0", bar_size);
     memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (10 preceding siblings ...)
  2022-02-17 17:45 ` [PATCH v5 11/15] hw/nvme: Calculate BAR attributes in a function Lukasz Maniak
@ 2022-02-17 17:45 ` Lukasz Maniak
  2022-02-18 14:37   ` Lukasz Maniak
  2022-02-17 17:45 ` [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command Lukasz Maniak
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	Lukasz Maniak, Philippe Mathieu-Daudé,
	Keith Busch, Hanna Reitz, Stefan Hajnoczi, Klaus Jensen

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

With four new properties:
 - sriov_v{i,q}_flexible,
 - sriov_max_v{i,q}_per_vf,
one can configure the number of available flexible resources, as well as
the limits. The primary and secondary controller capability structures
are initialized accordingly.

Since the number of available queues (interrupts) now varies between
VF/PF, BAR size calculation is also adjusted.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c       | 142 ++++++++++++++++++++++++++++++++++++++++---
 hw/nvme/nvme.h       |   4 ++
 include/block/nvme.h |   5 ++
 3 files changed, 144 insertions(+), 7 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 73707565345..2a6a36e733d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -36,6 +36,10 @@
  *              zoned.zasl=<N[optional]>, \
  *              zoned.auto_transition=<on|off[optional]>, \
  *              sriov_max_vfs=<N[optional]> \
+ *              sriov_vq_flexible=<N[optional]> \
+ *              sriov_vi_flexible=<N[optional]> \
+ *              sriov_max_vi_per_vf=<N[optional]> \
+ *              sriov_max_vq_per_vf=<N[optional]> \
  *              subsys=<subsys_id>
  *      -device nvme-ns,drive=<drive_id>,bus=<bus_name>,nsid=<nsid>,\
  *              zoned=<true|false[optional]>, \
@@ -113,6 +117,29 @@
  *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
  *   Virtual function controllers will not report SR-IOV capability.
  *
+ *   NOTE: Single Root I/O Virtualization support is experimental.
+ *   All the related parameters may be subject to change.
+ *
+ * - `sriov_vq_flexible`
+ *   Indicates the total number of flexible queue resources assignable to all
+ *   the secondary controllers. Implicitly sets the number of primary
+ *   controller's private resources to `(max_ioqpairs - sriov_vq_flexible)`.
+ *
+ * - `sriov_vi_flexible`
+ *   Indicates the total number of flexible interrupt resources assignable to
+ *   all the secondary controllers. Implicitly sets the number of primary
+ *   controller's private resources to `(msix_qsize - sriov_vi_flexible)`.
+ *
+ * - `sriov_max_vi_per_vf`
+ *   Indicates the maximum number of virtual interrupt resources assignable
+ *   to a secondary controller. The default 0 resolves to
+ *   `(sriov_vi_flexible / sriov_max_vfs)`.
+ *
+ * - `sriov_max_vq_per_vf`
+ *   Indicates the maximum number of virtual queue resources assignable to
+ *   a secondary controller. The default 0 resolves to
+ *   `(sriov_vq_flexible / sriov_max_vfs)`.
+ *
  * nvme namespace device parameters
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  * - `shared`
@@ -184,6 +211,7 @@
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
 #define NVME_MAX_VFS 127
+#define NVME_VF_RES_GRANULARITY 1
 #define NVME_VF_OFFSET 0x1
 #define NVME_VF_STRIDE 1
 
@@ -6512,6 +6540,54 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
             error_setg(errp, "PMR is not supported with SR-IOV");
             return;
         }
+
+        if (!params->sriov_vq_flexible || !params->sriov_vi_flexible) {
+            error_setg(errp, "both sriov_vq_flexible and sriov_vi_flexible"
+                       " must be set for the use of SR-IOV");
+            return;
+        }
+
+        if (params->sriov_vq_flexible < params->sriov_max_vfs * 2) {
+            error_setg(errp, "sriov_vq_flexible must be greater than or equal"
+                       " to %d (sriov_max_vfs * 2)", params->sriov_max_vfs * 2);
+            return;
+        }
+
+        if (params->max_ioqpairs < params->sriov_vq_flexible + 2) {
+            error_setg(errp, "sriov_vq_flexible - max_ioqpairs (PF-private"
+                       " queue resources) must be greater than or equal to 2");
+            return;
+        }
+
+        if (params->sriov_vi_flexible < params->sriov_max_vfs) {
+            error_setg(errp, "sriov_vi_flexible must be greater than or equal"
+                       " to %d (sriov_max_vfs)", params->sriov_max_vfs);
+            return;
+        }
+
+        if (params->msix_qsize < params->sriov_vi_flexible + 1) {
+            error_setg(errp, "sriov_vi_flexible - msix_qsize (PF-private"
+                       " interrupt resources) must be greater than or equal"
+                       " to 1");
+            return;
+        }
+
+        if (params->sriov_max_vi_per_vf &&
+            (params->sriov_max_vi_per_vf - 1) % NVME_VF_RES_GRANULARITY) {
+            error_setg(errp, "sriov_max_vi_per_vf must meet:"
+                       " (X - 1) %% %d == 0 and X >= 1",
+                       NVME_VF_RES_GRANULARITY);
+            return;
+        }
+
+        if (params->sriov_max_vq_per_vf &&
+            (params->sriov_max_vq_per_vf < 2 ||
+             (params->sriov_max_vq_per_vf - 1) % NVME_VF_RES_GRANULARITY)) {
+            error_setg(errp, "sriov_max_vq_per_vf must meet:"
+                       " (X - 1) %% %d == 0 and X >= 2",
+                       NVME_VF_RES_GRANULARITY);
+            return;
+        }
     }
 }
 
@@ -6520,10 +6596,19 @@ static void nvme_init_state(NvmeCtrl *n)
     NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
     NvmeSecCtrlList *list = &n->sec_ctrl_list;
     NvmeSecCtrlEntry *sctrl;
+    uint8_t max_vfs;
     int i;
 
-    n->conf_ioqpairs = n->params.max_ioqpairs;
-    n->conf_msix_qsize = n->params.msix_qsize;
+    if (pci_is_vf(&n->parent_obj)) {
+        sctrl = nvme_sctrl(n);
+        max_vfs = 0;
+        n->conf_ioqpairs = sctrl->nvq ? le16_to_cpu(sctrl->nvq) - 1 : 0;
+        n->conf_msix_qsize = sctrl->nvi ? le16_to_cpu(sctrl->nvi) : 1;
+    } else {
+        max_vfs = n->params.sriov_max_vfs;
+        n->conf_ioqpairs = n->params.max_ioqpairs;
+        n->conf_msix_qsize = n->params.msix_qsize;
+    }
 
     n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
     n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
@@ -6532,14 +6617,41 @@ static void nvme_init_state(NvmeCtrl *n)
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 
-    list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
-    for (i = 0; i < n->params.sriov_max_vfs; i++) {
+    list->numcntl = cpu_to_le16(max_vfs);
+    for (i = 0; i < max_vfs; i++) {
         sctrl = &list->sec[i];
         sctrl->pcid = cpu_to_le16(n->cntlid);
         sctrl->vfn = cpu_to_le16(i + 1);
     }
 
     cap->cntlid = cpu_to_le16(n->cntlid);
+    cap->crt = NVME_CRT_VQ | NVME_CRT_VI;
+
+    if (pci_is_vf(&n->parent_obj)) {
+        cap->vqprt = cpu_to_le16(1 + n->conf_ioqpairs);
+    } else {
+        cap->vqprt = cpu_to_le16(1 + n->params.max_ioqpairs -
+                                 n->params.sriov_vq_flexible);
+        cap->vqfrt = cpu_to_le32(n->params.sriov_vq_flexible);
+        cap->vqrfap = cap->vqfrt;
+        cap->vqgran = cpu_to_le16(NVME_VF_RES_GRANULARITY);
+        cap->vqfrsm = n->params.sriov_max_vq_per_vf ?
+                        cpu_to_le16(n->params.sriov_max_vq_per_vf) :
+                        cap->vqfrt / MAX(max_vfs, 1);
+    }
+
+    if (pci_is_vf(&n->parent_obj)) {
+        cap->viprt = cpu_to_le16(n->conf_msix_qsize);
+    } else {
+        cap->viprt = cpu_to_le16(n->params.msix_qsize -
+                                 n->params.sriov_vi_flexible);
+        cap->vifrt = cpu_to_le32(n->params.sriov_vi_flexible);
+        cap->virfap = cap->vifrt;
+        cap->vigran = cpu_to_le16(NVME_VF_RES_GRANULARITY);
+        cap->vifrsm = n->params.sriov_max_vi_per_vf ?
+                        cpu_to_le16(n->params.sriov_max_vi_per_vf) :
+                        cap->vifrt / MAX(max_vfs, 1);
+    }
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -6612,11 +6724,14 @@ static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
     return bar_size;
 }
 
-static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
-                            uint64_t bar_size)
+static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
 {
     uint16_t vf_dev_id = n->params.use_intel_id ?
                          PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
+    NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
+    uint64_t bar_size = nvme_bar_size(le16_to_cpu(cap->vqfrsm),
+                                      le16_to_cpu(cap->vifrsm),
+                                      NULL, NULL);
 
     pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
                        n->params.sriov_max_vfs, n->params.sriov_max_vfs,
@@ -6714,7 +6829,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
     }
 
     if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
-        nvme_init_sriov(n, pci_dev, 0x120, bar_size);
+        nvme_init_sriov(n, pci_dev, 0x120);
     }
 
     return 0;
@@ -6738,6 +6853,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NvmeIdCtrl *id = &n->id_ctrl;
     uint8_t *pci_conf = pci_dev->config;
     uint64_t cap = ldq_le_p(&n->bar.cap);
+    NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
 
     id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
     id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
@@ -6829,6 +6945,10 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
 
     stl_le_p(&n->bar.vs, NVME_SPEC_VER);
     n->bar.intmc = n->bar.intms = 0;
+
+    if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
+        stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
+    }
 }
 
 static int nvme_init_subsys(NvmeCtrl *n, Error **errp)
@@ -6969,6 +7089,14 @@ static Property nvme_props[] = {
     DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
                      params.auto_transition_zones, true),
     DEFINE_PROP_UINT8("sriov_max_vfs", NvmeCtrl, params.sriov_max_vfs, 0),
+    DEFINE_PROP_UINT16("sriov_vq_flexible", NvmeCtrl,
+                       params.sriov_vq_flexible, 0),
+    DEFINE_PROP_UINT16("sriov_vi_flexible", NvmeCtrl,
+                       params.sriov_vi_flexible, 0),
+    DEFINE_PROP_UINT8("sriov_max_vi_per_vf", NvmeCtrl,
+                      params.sriov_max_vi_per_vf, 0),
+    DEFINE_PROP_UINT8("sriov_max_vq_per_vf", NvmeCtrl,
+                      params.sriov_max_vq_per_vf, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 86b5b321331..82f11bb08f0 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -407,6 +407,10 @@ typedef struct NvmeParams {
     bool     auto_transition_zones;
     bool     legacy_cmb;
     uint8_t  sriov_max_vfs;
+    uint16_t sriov_vq_flexible;
+    uint16_t sriov_vi_flexible;
+    uint8_t  sriov_max_vq_per_vf;
+    uint8_t  sriov_max_vi_per_vf;
 } NvmeParams;
 
 typedef struct NvmeCtrl {
diff --git a/include/block/nvme.h b/include/block/nvme.h
index fde4ddfceec..a8192edcd9d 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1526,6 +1526,11 @@ typedef struct QEMU_PACKED NvmePriCtrlCap {
     uint8_t     rsvd80[4016];
 } NvmePriCtrlCap;
 
+typedef enum NvmePriCtrlCapCrt {
+    NVME_CRT_VQ             = 1 << 0,
+    NVME_CRT_VI             = 1 << 1,
+} NvmePriCtrlCapCrt;
+
 typedef struct QEMU_PACKED NvmeSecCtrlEntry {
     uint16_t    scid;
     uint16_t    pcid;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (11 preceding siblings ...)
  2022-02-17 17:45 ` [PATCH v5 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers Lukasz Maniak
@ 2022-02-17 17:45 ` Lukasz Maniak
  2022-03-01 13:07   ` Klaus Jensen
  2022-02-17 17:45 ` [PATCH v5 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements Lukasz Maniak
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	Lukasz Maniak, Philippe Mathieu-Daudé,
	Keith Busch, Hanna Reitz, Stefan Hajnoczi, Klaus Jensen

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

With the new command one can:
 - assign flexible resources (queues, interrupts) to primary and
   secondary controllers,
 - toggle the online/offline state of given controller.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c       | 257 ++++++++++++++++++++++++++++++++++++++++++-
 hw/nvme/nvme.h       |  20 ++++
 hw/nvme/trace-events |   3 +
 include/block/nvme.h |  17 +++
 4 files changed, 295 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2a6a36e733d..a9742cf5051 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -188,6 +188,7 @@
 #include "qemu/error-report.h"
 #include "qemu/log.h"
 #include "qemu/units.h"
+#include "qemu/range.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "sysemu/sysemu.h"
@@ -259,6 +260,7 @@ static const uint32_t nvme_cse_acs[256] = {
     [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_NS_ATTACHMENT]    = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
+    [NVME_ADM_CMD_VIRT_MNGMT]       = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_FORMAT_NVM]       = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 };
 
@@ -290,6 +292,7 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
 };
 
 static void nvme_process_sq(void *opaque);
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst);
 
 static uint16_t nvme_sqid(NvmeRequest *req)
 {
@@ -5694,6 +5697,167 @@ out:
     return status;
 }
 
+static void nvme_get_virt_res_num(NvmeCtrl *n, uint8_t rt, int *num_total,
+                                  int *num_prim, int *num_sec)
+{
+    *num_total = le32_to_cpu(rt ?
+                             n->pri_ctrl_cap.vifrt : n->pri_ctrl_cap.vqfrt);
+    *num_prim = le16_to_cpu(rt ?
+                            n->pri_ctrl_cap.virfap : n->pri_ctrl_cap.vqrfap);
+    *num_sec = le16_to_cpu(rt ? n->pri_ctrl_cap.virfa : n->pri_ctrl_cap.vqrfa);
+}
+
+static uint16_t nvme_assign_virt_res_to_prim(NvmeCtrl *n, NvmeRequest *req,
+                                             uint16_t cntlid, uint8_t rt,
+                                             int nr)
+{
+    int num_total, num_prim, num_sec;
+
+    if (cntlid != n->cntlid) {
+        return NVME_INVALID_CTRL_ID | NVME_DNR;
+    }
+
+    nvme_get_virt_res_num(n, rt, &num_total, &num_prim, &num_sec);
+
+    if (nr > num_total) {
+        return NVME_INVALID_NUM_RESOURCES | NVME_DNR;
+    }
+
+    if (nr > num_total - num_sec) {
+        return NVME_INVALID_RESOURCE_ID | NVME_DNR;
+    }
+
+    if (rt) {
+        n->next_pri_ctrl_cap.virfap = cpu_to_le16(nr);
+    } else {
+        n->next_pri_ctrl_cap.vqrfap = cpu_to_le16(nr);
+    }
+
+    req->cqe.result = cpu_to_le32(nr);
+    return req->status;
+}
+
+static void nvme_update_virt_res(NvmeCtrl *n, NvmeSecCtrlEntry *sctrl,
+                                 uint8_t rt, int nr)
+{
+    int prev_nr, prev_total;
+
+    if (rt) {
+        prev_nr = le16_to_cpu(sctrl->nvi);
+        prev_total = le32_to_cpu(n->pri_ctrl_cap.virfa);
+        sctrl->nvi = cpu_to_le16(nr);
+        n->pri_ctrl_cap.virfa = cpu_to_le32(prev_total + nr - prev_nr);
+    } else {
+        prev_nr = le16_to_cpu(sctrl->nvq);
+        prev_total = le32_to_cpu(n->pri_ctrl_cap.vqrfa);
+        sctrl->nvq = cpu_to_le16(nr);
+        n->pri_ctrl_cap.vqrfa = cpu_to_le32(prev_total + nr - prev_nr);
+    }
+}
+
+static uint16_t nvme_assign_virt_res_to_sec(NvmeCtrl *n, NvmeRequest *req,
+                                            uint16_t cntlid, uint8_t rt, int nr)
+{
+    int num_total, num_prim, num_sec, num_free, diff, limit;
+    NvmeSecCtrlEntry *sctrl;
+
+    sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+    if (!sctrl) {
+        return NVME_INVALID_CTRL_ID | NVME_DNR;
+    }
+
+    if (sctrl->scs) {
+        return NVME_INVALID_SEC_CTRL_STATE | NVME_DNR;
+    }
+
+    limit = le16_to_cpu(rt ? n->pri_ctrl_cap.vifrsm : n->pri_ctrl_cap.vqfrsm);
+    if (nr > limit) {
+        return NVME_INVALID_NUM_RESOURCES | NVME_DNR;
+    }
+
+    nvme_get_virt_res_num(n, rt, &num_total, &num_prim, &num_sec);
+    num_free = num_total - num_prim - num_sec;
+    diff = nr - le16_to_cpu(rt ? sctrl->nvi : sctrl->nvq);
+
+    if (diff > num_free) {
+        return NVME_INVALID_RESOURCE_ID | NVME_DNR;
+    }
+
+    nvme_update_virt_res(n, sctrl, rt, nr);
+    req->cqe.result = cpu_to_le32(nr);
+
+    return req->status;
+}
+
+static uint16_t nvme_virt_set_state(NvmeCtrl *n, uint16_t cntlid, bool online)
+{
+    NvmeCtrl *sn = NULL;
+    NvmeSecCtrlEntry *sctrl;
+    int vf_index;
+
+    sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+    if (!sctrl) {
+        return NVME_INVALID_CTRL_ID | NVME_DNR;
+    }
+
+    if (!pci_is_vf(&n->parent_obj)) {
+        vf_index = le16_to_cpu(sctrl->vfn) - 1;
+        sn = NVME(pcie_sriov_get_vf_at_index(&n->parent_obj, vf_index));
+    }
+
+    if (online) {
+        if (!sctrl->nvi || (le16_to_cpu(sctrl->nvq) < 2) || !sn) {
+            return NVME_INVALID_SEC_CTRL_STATE | NVME_DNR;
+        }
+
+        if (!sctrl->scs) {
+            sctrl->scs = 0x1;
+            nvme_ctrl_reset(sn, NVME_RESET_FUNCTION);
+        }
+    } else {
+        nvme_update_virt_res(n, sctrl, NVME_VIRT_RES_INTERRUPT, 0);
+        nvme_update_virt_res(n, sctrl, NVME_VIRT_RES_QUEUE, 0);
+
+        if (sctrl->scs) {
+            sctrl->scs = 0x0;
+            if (sn) {
+                nvme_ctrl_reset(sn, NVME_RESET_FUNCTION);
+            }
+        }
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_virt_mngmt(NvmeCtrl *n, NvmeRequest *req)
+{
+    uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
+    uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
+    uint8_t act = dw10 & 0xf;
+    uint8_t rt = (dw10 >> 8) & 0x7;
+    uint16_t cntlid = (dw10 >> 16) & 0xffff;
+    int nr = dw11 & 0xffff;
+
+    trace_pci_nvme_virt_mngmt(nvme_cid(req), act, cntlid, rt ? "VI" : "VQ", nr);
+
+    if (rt != NVME_VIRT_RES_QUEUE && rt != NVME_VIRT_RES_INTERRUPT) {
+        return NVME_INVALID_RESOURCE_ID | NVME_DNR;
+    }
+
+    switch (act) {
+    case NVME_VIRT_MNGMT_ACTION_SEC_ASSIGN:
+        return nvme_assign_virt_res_to_sec(n, req, cntlid, rt, nr);
+    case NVME_VIRT_MNGMT_ACTION_PRM_ALLOC:
+        return nvme_assign_virt_res_to_prim(n, req, cntlid, rt, nr);
+    case NVME_VIRT_MNGMT_ACTION_SEC_ONLINE:
+        return nvme_virt_set_state(n, cntlid, true);
+    case NVME_VIRT_MNGMT_ACTION_SEC_OFFLINE:
+        return nvme_virt_set_state(n, cntlid, false);
+    default:
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+}
+
 static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
@@ -5736,6 +5900,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
         return nvme_aer(n, req);
     case NVME_ADM_CMD_NS_ATTACHMENT:
         return nvme_ns_attachment(n, req);
+    case NVME_ADM_CMD_VIRT_MNGMT:
+        return nvme_virt_mngmt(n, req);
     case NVME_ADM_CMD_FORMAT_NVM:
         return nvme_format(n, req);
     default:
@@ -5797,9 +5963,33 @@ static void nvme_update_msixcap_ts(PCIDevice *pci_dev, uint32_t table_size)
                          table_size - 1);
 }
 
+static void nvme_activate_virt_res(NvmeCtrl *n)
+{
+    PCIDevice *pci_dev = &n->parent_obj;
+    NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
+    NvmeSecCtrlEntry *sctrl;
+
+    /* -1 to account for the admin queue */
+    if (pci_is_vf(pci_dev)) {
+        sctrl = nvme_sctrl(n);
+        cap->vqprt = sctrl->nvq;
+        cap->viprt = sctrl->nvi;
+        n->conf_ioqpairs = sctrl->nvq ? le16_to_cpu(sctrl->nvq) - 1 : 0;
+        n->conf_msix_qsize = sctrl->nvi ? le16_to_cpu(sctrl->nvi) : 1;
+    } else {
+        cap->vqrfap = n->next_pri_ctrl_cap.vqrfap;
+        cap->virfap = n->next_pri_ctrl_cap.virfap;
+        n->conf_ioqpairs = le16_to_cpu(cap->vqprt) +
+                           le16_to_cpu(cap->vqrfap) - 1;
+        n->conf_msix_qsize = le16_to_cpu(cap->viprt) +
+                             le16_to_cpu(cap->virfap);
+    }
+}
+
 static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
 {
     PCIDevice *pci_dev = &n->parent_obj;
+    NvmeSecCtrlEntry *sctrl;
     NvmeNamespace *ns;
     int i;
 
@@ -5829,9 +6019,20 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
         g_free(event);
     }
 
-    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
+    if (n->params.sriov_max_vfs) {
+        if (!pci_is_vf(pci_dev)) {
+            for (i = 0; i < n->sec_ctrl_list.numcntl; i++) {
+                sctrl = &n->sec_ctrl_list.sec[i];
+                nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
+            }
+
+            if (rst != NVME_RESET_CONTROLLER) {
+                pcie_sriov_pf_disable_vfs(pci_dev);
+            }
+        }
+
         if (rst != NVME_RESET_CONTROLLER) {
-            pcie_sriov_pf_disable_vfs(pci_dev);
+            nvme_activate_virt_res(n);
         }
     }
 
@@ -5840,6 +6041,13 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
     n->qs_created = false;
 
     nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
+
+    if (pci_is_vf(pci_dev)) {
+        sctrl = nvme_sctrl(n);
+        stl_le_p(&n->bar.csts, sctrl->scs ? 0 : NVME_CSTS_FAILED);
+    } else {
+        stl_le_p(&n->bar.csts, 0);
+    }
 }
 
 static void nvme_ctrl_shutdown(NvmeCtrl *n)
@@ -5885,7 +6093,15 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     uint64_t acq = ldq_le_p(&n->bar.acq);
     uint32_t page_bits = NVME_CC_MPS(cc) + 12;
     uint32_t page_size = 1 << page_bits;
+    NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
 
+    if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
+        trace_pci_nvme_err_startfail_virt_state(le16_to_cpu(sctrl->nvi),
+                                                le16_to_cpu(sctrl->nvq),
+                                                sctrl->scs ? "ONLINE" :
+                                                             "OFFLINE");
+        return -1;
+    }
     if (unlikely(n->cq[0])) {
         trace_pci_nvme_err_startfail_cq();
         return -1;
@@ -6268,6 +6484,12 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr addr, unsigned size)
         return 0;
     }
 
+    if (pci_is_vf(&n->parent_obj) && !nvme_sctrl(n)->scs &&
+        addr != NVME_REG_CSTS) {
+        trace_pci_nvme_err_ignored_mmio_vf_offline(addr, size);
+        return 0;
+    }
+
     /*
      * When PMRWBM bit 1 is set then read from
      * from PMRSTS should ensure prior writes
@@ -6417,6 +6639,12 @@ static void nvme_mmio_write(void *opaque, hwaddr addr, uint64_t data,
 
     trace_pci_nvme_mmio_write(addr, data, size);
 
+    if (pci_is_vf(&n->parent_obj) && !nvme_sctrl(n)->scs &&
+        addr != NVME_REG_CSTS) {
+        trace_pci_nvme_err_ignored_mmio_vf_offline(addr, size);
+        return;
+    }
+
     if (addr < sizeof(n->bar)) {
         nvme_write_bar(n, addr, data, size);
     } else {
@@ -7151,9 +7379,34 @@ static void nvme_pci_reset(DeviceState *qdev)
     nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
 }
 
+static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
+                                      uint32_t val, int len)
+{
+    NvmeCtrl *n = NVME(dev);
+    NvmeSecCtrlEntry *sctrl;
+    uint16_t sriov_cap = dev->exp.sriov_cap;
+    uint32_t off = address - sriov_cap;
+    int i, num_vfs;
+
+    if (!sriov_cap) {
+        return;
+    }
+
+    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
+        if (!(val & PCI_SRIOV_CTRL_VFE)) {
+            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
+            for (i = 0; i < num_vfs; i++) {
+                sctrl = &n->sec_ctrl_list.sec[i];
+                nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
+            }
+        }
+    }
+}
+
 static void nvme_pci_write_config(PCIDevice *dev, uint32_t address,
                                   uint32_t val, int len)
 {
+    nvme_sriov_pre_write_ctrl(dev, address, val, len);
     pci_default_write_config(dev, address, val, len);
     pcie_cap_flr_write_config(dev, address, val, len);
 }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 82f11bb08f0..279dd7582f4 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -335,6 +335,7 @@ static inline const char *nvme_adm_opc_str(uint8_t opc)
     case NVME_ADM_CMD_GET_FEATURES:     return "NVME_ADM_CMD_GET_FEATURES";
     case NVME_ADM_CMD_ASYNC_EV_REQ:     return "NVME_ADM_CMD_ASYNC_EV_REQ";
     case NVME_ADM_CMD_NS_ATTACHMENT:    return "NVME_ADM_CMD_NS_ATTACHMENT";
+    case NVME_ADM_CMD_VIRT_MNGMT:       return "NVME_ADM_CMD_VIRT_MNGMT";
     case NVME_ADM_CMD_FORMAT_NVM:       return "NVME_ADM_CMD_FORMAT_NVM";
     default:                            return "NVME_ADM_CMD_UNKNOWN";
     }
@@ -484,6 +485,10 @@ typedef struct NvmeCtrl {
 
     NvmePriCtrlCap  pri_ctrl_cap;
     NvmeSecCtrlList sec_ctrl_list;
+    struct {
+        uint16_t    vqrfap;
+        uint16_t    virfap;
+    } next_pri_ctrl_cap;    /* These override pri_ctrl_cap after reset */
 } NvmeCtrl;
 
 typedef enum NvmeResetType {
@@ -535,6 +540,21 @@ static inline NvmeSecCtrlEntry *nvme_sctrl(NvmeCtrl *n)
     return NULL;
 }
 
+static inline NvmeSecCtrlEntry *nvme_sctrl_for_cntlid(NvmeCtrl *n,
+                                                      uint16_t cntlid)
+{
+    NvmeSecCtrlList *list = &n->sec_ctrl_list;
+    uint8_t i;
+
+    for (i = 0; i < list->numcntl; i++) {
+        if (le16_to_cpu(list->sec[i].scid) == cntlid) {
+            return &list->sec[i];
+        }
+    }
+
+    return NULL;
+}
+
 void nvme_attach_ns(NvmeCtrl *n, NvmeNamespace *ns);
 uint16_t nvme_bounce_data(NvmeCtrl *n, void *ptr, uint32_t len,
                           NvmeTxDirection dir, NvmeRequest *req);
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index f8bf85fb78f..2a3f3bdb8d6 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -107,6 +107,7 @@ pci_nvme_clear_ns_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", sl
 pci_nvme_clear_ns_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
 pci_nvme_zoned_zrwa_implicit_flush(uint64_t zslba, uint32_t nlb) "zslba 0x%"PRIx64" nlb %"PRIu32""
 pci_nvme_pci_reset(void) "PCI Function Level Reset"
+pci_nvme_virt_mngmt(uint16_t cid, uint16_t act, uint16_t cntlid, const char* rt, uint16_t nr) "cid %"PRIu16", act=0x%"PRIx16", ctrlid=%"PRIu16" %s nr=%"PRIu16""
 
 # error conditions
 pci_nvme_err_mdts(size_t len) "len %zu"
@@ -176,7 +177,9 @@ pci_nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the
 pci_nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
 pci_nvme_err_startfail_zasl_too_small(uint32_t zasl, uint32_t pagesz) "nvme_start_ctrl failed because zone append size limit %"PRIu32" is too small, needs to be >= %"PRIu32""
 pci_nvme_err_startfail(void) "setting controller enable bit failed"
+pci_nvme_err_startfail_virt_state(uint16_t vq, uint16_t vi, const char *state) "nvme_start_ctrl failed due to ctrl state: vi=%u vq=%u %s"
 pci_nvme_err_invalid_mgmt_action(uint8_t action) "action=0x%"PRIx8""
+pci_nvme_err_ignored_mmio_vf_offline(uint64_t addr, unsigned size) "addr 0x%"PRIx64" size %d"
 
 # undefined behavior
 pci_nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index a8192edcd9d..ad19d4327e1 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -595,6 +595,7 @@ enum NvmeAdminCommands {
     NVME_ADM_CMD_ACTIVATE_FW    = 0x10,
     NVME_ADM_CMD_DOWNLOAD_FW    = 0x11,
     NVME_ADM_CMD_NS_ATTACHMENT  = 0x15,
+    NVME_ADM_CMD_VIRT_MNGMT     = 0x1c,
     NVME_ADM_CMD_FORMAT_NVM     = 0x80,
     NVME_ADM_CMD_SECURITY_SEND  = 0x81,
     NVME_ADM_CMD_SECURITY_RECV  = 0x82,
@@ -886,6 +887,10 @@ enum NvmeStatusCodes {
     NVME_NS_PRIVATE             = 0x0119,
     NVME_NS_NOT_ATTACHED        = 0x011a,
     NVME_NS_CTRL_LIST_INVALID   = 0x011c,
+    NVME_INVALID_CTRL_ID        = 0x011f,
+    NVME_INVALID_SEC_CTRL_STATE = 0x0120,
+    NVME_INVALID_NUM_RESOURCES  = 0x0121,
+    NVME_INVALID_RESOURCE_ID    = 0x0122,
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
     NVME_WRITE_TO_RO            = 0x0182,
@@ -1548,6 +1553,18 @@ typedef struct QEMU_PACKED NvmeSecCtrlList {
     NvmeSecCtrlEntry    sec[127];
 } NvmeSecCtrlList;
 
+typedef enum NvmeVirtMngmtAction {
+    NVME_VIRT_MNGMT_ACTION_PRM_ALLOC    = 0x01,
+    NVME_VIRT_MNGMT_ACTION_SEC_OFFLINE  = 0x07,
+    NVME_VIRT_MNGMT_ACTION_SEC_ASSIGN   = 0x08,
+    NVME_VIRT_MNGMT_ACTION_SEC_ONLINE   = 0x09,
+} NvmeVirtMngmtAction;
+
+typedef enum NvmeVirtualResourceType {
+    NVME_VIRT_RES_QUEUE         = 0x00,
+    NVME_VIRT_RES_INTERRUPT     = 0x01,
+} NvmeVirtualResourceType;
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (12 preceding siblings ...)
  2022-02-17 17:45 ` [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command Lukasz Maniak
@ 2022-02-17 17:45 ` Lukasz Maniak
  2022-03-01 12:23   ` Klaus Jensen
  2022-02-17 17:45 ` [PATCH v5 15/15] hw/nvme: Update the initalization place for the AER queue Lukasz Maniak
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Heinrich Schuchardt, qemu-block, Klaus Jensen, Lukasz Maniak,
	Philippe Mathieu-Daudé,
	Keith Busch, Łukasz Gieryk, Klaus Jensen, Alex Bennée

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
---
 docs/system/devices/nvme.rst | 82 ++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index b5acb2a9c19..aba253304e4 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -239,3 +239,85 @@ The virtual namespace device supports DIF- and DIX-based protection information
   to ``1`` to transfer protection information as the first eight bytes of
   metadata. Otherwise, the protection information is transferred as the last
   eight bytes.
+
+Virtualization Enhancements and SR-IOV (Experimental Support)
+-------------------------------------------------------------
+
+The ``nvme`` device supports Single Root I/O Virtualization and Sharing
+along with Virtualization Enhancements. The controller has to be linked to
+an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
+
+A number of parameters are present (**please note, that they may be
+subject to change**):
+
+``sriov_max_vfs`` (default: ``0``)
+  Indicates the maximum number of PCIe virtual functions supported
+  by the controller. Specifying a non-zero value enables reporting of both
+  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
+  by the NVMe device. Virtual function controllers will not report SR-IOV.
+
+``sriov_vq_flexible``
+  Indicates the total number of flexible queue resources assignable to all
+  the secondary controllers. Implicitly sets the number of primary
+  controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
+
+``sriov_vi_flexible``
+  Indicates the total number of flexible interrupt resources assignable to
+  all the secondary controllers. Implicitly sets the number of primary
+  controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
+
+``sriov_max_vi_per_vf`` (default: ``0``)
+  Indicates the maximum number of virtual interrupt resources assignable
+  to a secondary controller. The default ``0`` resolves to
+  ``(sriov_vi_flexible / sriov_max_vfs)``
+
+``sriov_max_vq_per_vf`` (default: ``0``)
+  Indicates the maximum number of virtual queue resources assignable to
+  a secondary controller. The default ``0`` resolves to
+  ``(sriov_vq_flexible / sriov_max_vfs)``
+
+The simplest possible invocation enables the capability to set up one VF
+controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
+
+.. code-block:: console
+
+   -device nvme-subsys,id=subsys0
+   -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
+    sriov_vq_flexible=2,sriov_vi_flexible=1
+
+The minimum steps required to configure a functional NVMe secondary
+controller are:
+
+  * unbind flexible resources from the primary controller
+
+.. code-block:: console
+
+   nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
+   nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
+
+  * perform a Function Level Reset on the primary controller to actually
+    release the resources
+
+.. code-block:: console
+
+   echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
+
+  * enable VF
+
+.. code-block:: console
+
+   echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
+
+  * assign the flexible resources to the VF and set it ONLINE
+
+.. code-block:: console
+
+   nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
+   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
+   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
+
+  * bind the NVMe driver to the VF
+
+.. code-block:: console
+
+   echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
\ No newline at end of file
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 15/15] hw/nvme: Update the initalization place for the AER queue
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (13 preceding siblings ...)
  2022-02-17 17:45 ` [PATCH v5 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements Lukasz Maniak
@ 2022-02-17 17:45 ` Lukasz Maniak
  2022-02-18  6:49   ` Klaus Jensen
  2022-02-18  8:23 ` [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Michael S. Tsirkin
  2022-02-18  8:26 ` Michael S. Tsirkin
  16 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-17 17:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Klaus Jensen, Keith Busch, Lukasz Maniak, qemu-block, Łukasz Gieryk

From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

This patch updates the initialization place for the AER queue, so it’s
initialized once, at controller initialization, and not every time
controller is enabled.

While the original version works for a non-SR-IOV device, as it’s hard
to interact with the controller if it’s not enabled, the multiple
reinitialization is not necessarily correct.

With the SR/IOV feature enabled a segfault can happen: a VF can have its
controller disabled, while a namespace can still be attached to the
controller through the parent PF. An event generated in such case ends
up on an uninitialized queue.

While it’s an interesting question whether a VF should support AER in
the first place, I don’t think it must be answered today.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
---
 hw/nvme/ctrl.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index a9742cf5051..ae41fced596 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6182,8 +6182,6 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
     nvme_set_timestamp(n, 0ULL);
 
-    QTAILQ_INIT(&n->aer_queue);
-
     nvme_select_iocs(n);
 
     return 0;
@@ -6844,6 +6842,7 @@ static void nvme_init_state(NvmeCtrl *n)
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
+    QTAILQ_INIT(&n->aer_queue);
 
     list->numcntl = cpu_to_le16(max_vfs);
     for (i = 0; i < max_vfs; i++) {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 15/15] hw/nvme: Update the initalization place for the AER queue
  2022-02-17 17:45 ` [PATCH v5 15/15] hw/nvme: Update the initalization place for the AER queue Lukasz Maniak
@ 2022-02-18  6:49   ` Klaus Jensen
  0 siblings, 0 replies; 35+ messages in thread
From: Klaus Jensen @ 2022-02-18  6:49 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Keith Busch, Łukasz Gieryk, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1006 bytes --]

On Feb 17 18:45, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> This patch updates the initialization place for the AER queue, so it’s
> initialized once, at controller initialization, and not every time
> controller is enabled.
> 
> While the original version works for a non-SR-IOV device, as it’s hard
> to interact with the controller if it’s not enabled, the multiple
> reinitialization is not necessarily correct.
> 
> With the SR/IOV feature enabled a segfault can happen: a VF can have its
> controller disabled, while a namespace can still be attached to the
> controller through the parent PF. An event generated in such case ends
> up on an uninitialized queue.
> 
> While it’s an interesting question whether a VF should support AER in
> the first place, I don’t think it must be answered today.
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

Looks good.

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 05/15] hw/nvme: Add support for SR-IOV
  2022-02-17 17:44 ` [PATCH v5 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
@ 2022-02-18  7:06   ` Klaus Jensen
  0 siblings, 0 replies; 35+ messages in thread
From: Klaus Jensen @ 2022-02-18  7:06 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Michael S. Tsirkin, Łukasz Gieryk, qemu-devel,
	Keith Busch

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

On Feb 17 18:44, Lukasz Maniak wrote:
> This patch implements initial support for Single Root I/O Virtualization
> on an NVMe device.
> 
> Essentially, it allows to define the maximum number of virtual functions
> supported by the NVMe controller via sriov_max_vfs parameter.
> 
> Passing a non-zero value to sriov_max_vfs triggers reporting of SR-IOV
> capability by a physical controller and ARI capability by both the
> physical and virtual function devices.
> 
> NVMe controllers created via virtual functions mirror functionally
> the physical controller, which may not entirely be the case, thus
> consideration would be needed on the way to limit the capabilities of
> the VF.
> 
> NVMe subsystem is required for the use of SR-IOV.
> 
> Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> ---
>  hw/nvme/ctrl.c           | 85 ++++++++++++++++++++++++++++++++++++++--
>  hw/nvme/nvme.h           |  3 +-
>  include/hw/pci/pci_ids.h |  1 +
>  3 files changed, 85 insertions(+), 4 deletions(-)
> 

LGTM.

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (14 preceding siblings ...)
  2022-02-17 17:45 ` [PATCH v5 15/15] hw/nvme: Update the initalization place for the AER queue Lukasz Maniak
@ 2022-02-18  8:23 ` Michael S. Tsirkin
  2022-02-18 14:33   ` Lukasz Maniak
  2022-02-18  8:26 ` Michael S. Tsirkin
  16 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2022-02-18  8:23 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: Klaus Jensen, Keith Busch, qemu-devel, qemu-block, Łukasz Gieryk

On Thu, Feb 17, 2022 at 06:44:49PM +0100, Lukasz Maniak wrote:
> Changes since v4:
> - Added hello world example for SR-IOV to the docs
> - Moved AER initialization from nvme_init_ctrl to nvme_init_state
> - Fixed division by zero issue in calculation of vqfrt and vifrt
>   capabilities


BTW you should copy all reviewers on the cover letter.



> Knut Omang (2):
>   pcie: Add support for Single Root I/O Virtualization (SR/IOV)
>   pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
> 
> Lukasz Maniak (4):
>   hw/nvme: Add support for SR-IOV
>   hw/nvme: Add support for Primary Controller Capabilities
>   hw/nvme: Add support for Secondary Controller List
>   docs: Add documentation for SR-IOV and Virtualization Enhancements
> 
> Łukasz Gieryk (9):
>   pcie: Add a helper to the SR/IOV API
>   pcie: Add 1.2 version token for the Power Management Capability
>   hw/nvme: Implement the Function Level Reset
>   hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
>   hw/nvme: Remove reg_size variable and update BAR0 size calculation
>   hw/nvme: Calculate BAR attributes in a function
>   hw/nvme: Initialize capability structures for primary/secondary
>     controllers
>   hw/nvme: Add support for the Virtualization Management command
>   hw/nvme: Update the initalization place for the AER queue
> 
>  docs/pcie_sriov.txt          | 115 ++++++
>  docs/system/devices/nvme.rst |  82 +++++
>  hw/nvme/ctrl.c               | 674 ++++++++++++++++++++++++++++++++---
>  hw/nvme/ns.c                 |   2 +-
>  hw/nvme/nvme.h               |  55 ++-
>  hw/nvme/subsys.c             |  75 +++-
>  hw/nvme/trace-events         |   6 +
>  hw/pci/meson.build           |   1 +
>  hw/pci/pci.c                 | 100 ++++--
>  hw/pci/pcie.c                |   5 +
>  hw/pci/pcie_sriov.c          | 302 ++++++++++++++++
>  hw/pci/trace-events          |   5 +
>  include/block/nvme.h         |  65 ++++
>  include/hw/pci/pci.h         |  12 +-
>  include/hw/pci/pci_ids.h     |   1 +
>  include/hw/pci/pci_regs.h    |   1 +
>  include/hw/pci/pcie.h        |   6 +
>  include/hw/pci/pcie_sriov.h  |  77 ++++
>  include/qemu/typedefs.h      |   2 +
>  19 files changed, 1505 insertions(+), 81 deletions(-)
>  create mode 100644 docs/pcie_sriov.txt
>  create mode 100644 hw/pci/pcie_sriov.c
>  create mode 100644 include/hw/pci/pcie_sriov.h
> 
> -- 
> 2.25.1
> 
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  2022-02-17 17:44 ` [PATCH v5 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Lukasz Maniak
@ 2022-02-18  8:24   ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2022-02-18  8:24 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Łukasz Gieryk, qemu-devel, Keith Busch,
	Klaus Jensen, Knut Omang

On Thu, Feb 17, 2022 at 06:44:50PM +0100, Lukasz Maniak wrote:
> From: Knut Omang <knut.omang@oracle.com>
> 
> This patch provides the building blocks for creating an SR/IOV
> PCIe Extended Capability header and register/unregister
> SR/IOV Virtual Functions.
> 
> Signed-off-by: Knut Omang <knuto@ifi.uio.no>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  hw/pci/meson.build          |   1 +
>  hw/pci/pci.c                | 100 +++++++++---
>  hw/pci/pcie.c               |   5 +
>  hw/pci/pcie_sriov.c         | 294 ++++++++++++++++++++++++++++++++++++
>  hw/pci/trace-events         |   5 +
>  include/hw/pci/pci.h        |  12 +-
>  include/hw/pci/pcie.h       |   6 +
>  include/hw/pci/pcie_sriov.h |  71 +++++++++
>  include/qemu/typedefs.h     |   2 +
>  9 files changed, 470 insertions(+), 26 deletions(-)
>  create mode 100644 hw/pci/pcie_sriov.c
>  create mode 100644 include/hw/pci/pcie_sriov.h
> 
> diff --git a/hw/pci/meson.build b/hw/pci/meson.build
> index 5c4bbac8171..bcc9c75919f 100644
> --- a/hw/pci/meson.build
> +++ b/hw/pci/meson.build
> @@ -5,6 +5,7 @@ pci_ss.add(files(
>    'pci.c',
>    'pci_bridge.c',
>    'pci_host.c',
> +  'pcie_sriov.c',
>    'shpc.c',
>    'slotid_cap.c'
>  ))
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 5d30f9ca60e..ba8fb92efc6 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg)
>  {
>      uint8_t type;
>  
> +    /* PCIe virtual functions do not have their own BARs */
> +    assert(!pci_is_vf(d));
> +
>      if (reg != PCI_ROM_SLOT)
>          return PCI_BASE_ADDRESS_0 + reg * 4;
>  
> @@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev)
>      }
>  }
>  
> -static void pci_do_device_reset(PCIDevice *dev)
> +static void pci_reset_regions(PCIDevice *dev)
>  {
>      int r;
> +    if (pci_is_vf(dev)) {
> +        return;
> +    }
> +
> +    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> +        PCIIORegion *region = &dev->io_regions[r];
> +        if (!region->size) {
> +            continue;
> +        }
>  
> +        if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
> +            region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +            pci_set_quad(dev->config + pci_bar(dev, r), region->type);
> +        } else {
> +            pci_set_long(dev->config + pci_bar(dev, r), region->type);
> +        }
> +    }
> +}
> +
> +static void pci_do_device_reset(PCIDevice *dev)
> +{
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  
> @@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev)
>                                pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
>                                pci_get_word(dev->w1cmask + PCI_INTERRUPT_LINE));
>      dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> -    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> -        PCIIORegion *region = &dev->io_regions[r];
> -        if (!region->size) {
> -            continue;
> -        }
> -
> -        if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
> -            region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> -            pci_set_quad(dev->config + pci_bar(dev, r), region->type);
> -        } else {
> -            pci_set_long(dev->config + pci_bar(dev, r), region->type);
> -        }
> -    }
> +    pci_reset_regions(dev);
>      pci_update_mappings(dev);
>  
>      msi_reset(dev);
> @@ -884,6 +895,16 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice *dev, Error **errp)
>          dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
>      }
>  
> +    /*
> +     * With SR/IOV and ARI, a device at function 0 need not be a multifunction
> +     * device, as it may just be a VF that ended up with function 0 in
> +     * the legacy PCI interpretation. Avoid failing in such cases:
> +     */
> +    if (pci_is_vf(dev) &&
> +        dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +        return;
> +    }
> +
>      /*
>       * multifunction bit is interpreted in two ways as follows.
>       *   - all functions must set the bit to 1.
> @@ -1083,6 +1104,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>                     bus->devices[devfn]->name);
>          return NULL;
>      } else if (dev->hotplugged &&
> +               !pci_is_vf(pci_dev) &&
>                 pci_get_function_0(pci_dev)) {
>          error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
>                     " new func %s cannot be exposed to guest.",
> @@ -1191,6 +1213,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
>      pcibus_t size = memory_region_size(memory);
>      uint8_t hdr_type;
>  
> +    assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
>      assert(region_num >= 0);
>      assert(region_num < PCI_NUM_REGIONS);
>      assert(is_power_of_2(size));
> @@ -1294,11 +1317,45 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num)
>      return pci_dev->io_regions[region_num].addr;
>  }
>  
> -static pcibus_t pci_bar_address(PCIDevice *d,
> -                                int reg, uint8_t type, pcibus_t size)
> +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> +                                        uint8_t type, pcibus_t size)
> +{
> +    pcibus_t new_addr;
> +    if (!pci_is_vf(d)) {
> +        int bar = pci_bar(d, reg);
> +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +            new_addr = pci_get_quad(d->config + bar);
> +        } else {
> +            new_addr = pci_get_long(d->config + bar);
> +        }
> +    } else {
> +        PCIDevice *pf = d->exp.sriov_vf.pf;
> +        uint16_t sriov_cap = pf->exp.sriov_cap;
> +        int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
> +        uint16_t vf_offset =
> +            pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
> +        uint16_t vf_stride =
> +            pci_get_word(pf->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
> +        uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) / vf_stride;
> +
> +        if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +            new_addr = pci_get_quad(pf->config + bar);
> +        } else {
> +            new_addr = pci_get_long(pf->config + bar);
> +        }
> +        new_addr += vf_num * size;
> +    }
> +    /* The ROM slot has a specific enable bit, keep it intact */
> +    if (reg != PCI_ROM_SLOT) {
> +        new_addr &= ~(size - 1);
> +    }
> +    return new_addr;
> +}
> +
> +pcibus_t pci_bar_address(PCIDevice *d,
> +                         int reg, uint8_t type, pcibus_t size)
>  {
>      pcibus_t new_addr, last_addr;
> -    int bar = pci_bar(d, reg);
>      uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
>      Object *machine = qdev_get_machine();
>      ObjectClass *oc = object_get_class(machine);
> @@ -1309,7 +1366,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>          if (!(cmd & PCI_COMMAND_IO)) {
>              return PCI_BAR_UNMAPPED;
>          }
> -        new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> +        new_addr = pci_config_get_bar_addr(d, reg, type, size);
>          last_addr = new_addr + size - 1;
>          /* Check if 32 bit BAR wraps around explicitly.
>           * TODO: make priorities correct and remove this work around.
> @@ -1324,11 +1381,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>      if (!(cmd & PCI_COMMAND_MEMORY)) {
>          return PCI_BAR_UNMAPPED;
>      }
> -    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> -        new_addr = pci_get_quad(d->config + bar);
> -    } else {
> -        new_addr = pci_get_long(d->config + bar);
> -    }
> +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
>      /* the ROM slot has a specific enable bit */
>      if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
>          return PCI_BAR_UNMAPPED;
> @@ -1473,6 +1526,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val_in, int
>  
>      msi_write_config(d, addr, val_in, l);
>      msix_write_config(d, addr, val_in, l);
> +    pcie_sriov_config_write(d, addr, val_in, l);
>  }
>  
>  /***********************************************************/
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index d7d73a31e4c..3c44204cf31 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -446,6 +446,11 @@ void pcie_cap_slot_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev,
>      PCIDevice *pci_dev = PCI_DEVICE(dev);
>      uint32_t lnkcap = pci_get_long(exp_cap + PCI_EXP_LNKCAP);
>  
> +    if (pci_is_vf(pci_dev)) {
> +        /* Virtual function cannot be physically disconnected */
> +        return;
> +    }
> +
>      /* Don't send event when device is enabled during qemu machine creation:
>       * it is present on boot, no hotplug event is necessary. We do send an
>       * event when the device is disabled later. */
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> new file mode 100644
> index 00000000000..3f256d483fa
> --- /dev/null
> +++ b/hw/pci/pcie_sriov.c
> @@ -0,0 +1,294 @@
> +/*
> + * pcie_sriov.c:
> + *
> + * Implementation of SR/IOV emulation support.
> + *
> + * Copyright (c) 2015-2017 Knut Omang <knut.omang@oracle.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/pci/pci.h"
> +#include "hw/pci/pcie.h"
> +#include "hw/pci/pci_bus.h"
> +#include "hw/qdev-properties.h"
> +#include "qemu/error-report.h"
> +#include "qemu/range.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +
> +static PCIDevice *register_vf(PCIDevice *pf, int devfn,
> +                              const char *name, uint16_t vf_num);
> +static void unregister_vfs(PCIDevice *dev);
> +
> +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> +                        const char *vfname, uint16_t vf_dev_id,
> +                        uint16_t init_vfs, uint16_t total_vfs,
> +                        uint16_t vf_offset, uint16_t vf_stride)
> +{
> +    uint8_t *cfg = dev->config + offset;
> +    uint8_t *wmask;
> +
> +    pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
> +                        offset, PCI_EXT_CAP_SRIOV_SIZEOF);
> +    dev->exp.sriov_cap = offset;
> +    dev->exp.sriov_pf.num_vfs = 0;
> +    dev->exp.sriov_pf.vfname = g_strdup(vfname);
> +    dev->exp.sriov_pf.vf = NULL;
> +
> +    pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
> +    pci_set_word(cfg + PCI_SRIOV_VF_STRIDE, vf_stride);
> +
> +    /*
> +     * Mandatory page sizes to support.
> +     * Device implementations can call pcie_sriov_pf_add_sup_pgsize()
> +     * to set more bits:
> +     */
> +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, SRIOV_SUP_PGSIZE_MINREQ);
> +
> +    /*
> +     * Default is to use 4K pages, software can modify it
> +     * to any of the supported bits
> +     */
> +    pci_set_word(cfg + PCI_SRIOV_SYS_PGSIZE, 0x1);
> +
> +    /* Set up device ID and initial/total number of VFs available */
> +    pci_set_word(cfg + PCI_SRIOV_VF_DID, vf_dev_id);
> +    pci_set_word(cfg + PCI_SRIOV_INITIAL_VF, init_vfs);
> +    pci_set_word(cfg + PCI_SRIOV_TOTAL_VF, total_vfs);
> +    pci_set_word(cfg + PCI_SRIOV_NUM_VF, 0);
> +
> +    /* Write enable control bits */
> +    wmask = dev->wmask + offset;
> +    pci_set_word(wmask + PCI_SRIOV_CTRL,
> +                 PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
> +    pci_set_word(wmask + PCI_SRIOV_NUM_VF, 0xffff);
> +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, 0x553);
> +
> +    qdev_prop_set_bit(&dev->qdev, "multifunction", true);
> +}
> +
> +void pcie_sriov_pf_exit(PCIDevice *dev)
> +{
> +    unregister_vfs(dev);
> +    g_free((char *)dev->exp.sriov_pf.vfname);
> +    dev->exp.sriov_pf.vfname = NULL;
> +}
> +
> +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> +                               uint8_t type, dma_addr_t size)
> +{
> +    uint32_t addr;
> +    uint64_t wmask;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> +    assert(sriov_cap > 0);
> +    assert(region_num >= 0);
> +    assert(region_num < PCI_NUM_REGIONS);
> +    assert(region_num != PCI_ROM_SLOT);
> +
> +    wmask = ~(size - 1);
> +    addr = sriov_cap + PCI_SRIOV_BAR + region_num * 4;
> +
> +    pci_set_long(dev->config + addr, type);
> +    if (!(type & PCI_BASE_ADDRESS_SPACE_IO) &&
> +        type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +        pci_set_quad(dev->wmask + addr, wmask);
> +        pci_set_quad(dev->cmask + addr, ~0ULL);
> +    } else {
> +        pci_set_long(dev->wmask + addr, wmask & 0xffffffff);
> +        pci_set_long(dev->cmask + addr, 0xffffffff);
> +    }
> +    dev->exp.sriov_pf.vf_bar_type[region_num] = type;
> +}
> +
> +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> +                                MemoryRegion *memory)
> +{
> +    PCIIORegion *r;
> +    PCIBus *bus = pci_get_bus(dev);
> +    uint8_t type;
> +    pcibus_t size = memory_region_size(memory);
> +
> +    assert(pci_is_vf(dev)); /* PFs must use pci_register_bar */
> +    assert(region_num >= 0);
> +    assert(region_num < PCI_NUM_REGIONS);
> +    type = dev->exp.sriov_vf.pf->exp.sriov_pf.vf_bar_type[region_num];
> +
> +    if (!is_power_of_2(size)) {
> +        error_report("%s: PCI region size must be a power"
> +                     " of two - type=0x%x, size=0x%"FMT_PCIBUS,
> +                     __func__, type, size);
> +        exit(1);
> +    }
> +
> +    r = &dev->io_regions[region_num];
> +    r->memory = memory;
> +    r->address_space =
> +        type & PCI_BASE_ADDRESS_SPACE_IO
> +        ? bus->address_space_io
> +        : bus->address_space_mem;
> +    r->size = size;
> +    r->type = type;
> +
> +    r->addr = pci_bar_address(dev, region_num, r->type, r->size);
> +    if (r->addr != PCI_BAR_UNMAPPED) {
> +        memory_region_add_subregion_overlap(r->address_space,
> +                                            r->addr, r->memory, 1);
> +    }
> +}
> +
> +static PCIDevice *register_vf(PCIDevice *pf, int devfn, const char *name,
> +                              uint16_t vf_num)
> +{
> +    PCIDevice *dev = pci_new(devfn, name);
> +    dev->exp.sriov_vf.pf = pf;
> +    dev->exp.sriov_vf.vf_number = vf_num;
> +    PCIBus *bus = pci_get_bus(pf);
> +    Error *local_err = NULL;
> +
> +    qdev_realize(&dev->qdev, &bus->qbus, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return NULL;
> +    }
> +
> +    /* set vid/did according to sr/iov spec - they are not used */
> +    pci_config_set_vendor_id(dev->config, 0xffff);
> +    pci_config_set_device_id(dev->config, 0xffff);
> +
> +    return dev;
> +}
> +
> +static void register_vfs(PCIDevice *dev)
> +{
> +    uint16_t num_vfs;
> +    uint16_t i;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +    uint16_t vf_offset =
> +        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_OFFSET);
> +    uint16_t vf_stride =
> +        pci_get_word(dev->config + sriov_cap + PCI_SRIOV_VF_STRIDE);
> +    int32_t devfn = dev->devfn + vf_offset;
> +
> +    assert(sriov_cap > 0);
> +    num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
> +
> +    dev->exp.sriov_pf.vf = g_malloc(sizeof(PCIDevice *) * num_vfs);
> +    assert(dev->exp.sriov_pf.vf);
> +
> +    trace_sriov_register_vfs(dev->name, PCI_SLOT(dev->devfn),
> +                             PCI_FUNC(dev->devfn), num_vfs);
> +    for (i = 0; i < num_vfs; i++) {
> +        dev->exp.sriov_pf.vf[i] = register_vf(dev, devfn,
> +                                              dev->exp.sriov_pf.vfname, i);
> +        if (!dev->exp.sriov_pf.vf[i]) {
> +            num_vfs = i;
> +            break;
> +        }
> +        devfn += vf_stride;
> +    }
> +    dev->exp.sriov_pf.num_vfs = num_vfs;
> +}
> +
> +static void unregister_vfs(PCIDevice *dev)
> +{
> +    Error *local_err = NULL;
> +    uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
> +    uint16_t i;
> +
> +    trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
> +                               PCI_FUNC(dev->devfn), num_vfs);
> +    for (i = 0; i < num_vfs; i++) {
> +        PCIDevice *vf = dev->exp.sriov_pf.vf[i];
> +        object_property_set_bool(OBJECT(vf), "realized", false, &local_err);
> +        if (local_err) {
> +            fprintf(stderr, "Failed to unplug: %s\n",
> +                    error_get_pretty(local_err));
> +            error_free(local_err);
> +        }
> +        object_unparent(OBJECT(vf));
> +    }
> +    g_free(dev->exp.sriov_pf.vf);
> +    dev->exp.sriov_pf.vf = NULL;
> +    dev->exp.sriov_pf.num_vfs = 0;
> +    pci_set_word(dev->config + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0);
> +}
> +
> +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> +                             uint32_t val, int len)
> +{
> +    uint32_t off;
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +
> +    if (!sriov_cap || address < sriov_cap) {
> +        return;
> +    }
> +    off = address - sriov_cap;
> +    if (off >= PCI_EXT_CAP_SRIOV_SIZEOF) {
> +        return;
> +    }
> +
> +    trace_sriov_config_write(dev->name, PCI_SLOT(dev->devfn),
> +                             PCI_FUNC(dev->devfn), off, val, len);
> +
> +    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
> +        if (dev->exp.sriov_pf.num_vfs) {
> +            if (!(val & PCI_SRIOV_CTRL_VFE)) {
> +                unregister_vfs(dev);
> +            }
> +        } else {
> +            if (val & PCI_SRIOV_CTRL_VFE) {
> +                register_vfs(dev);
> +            }
> +        }
> +    }
> +}
> +
> +
> +/* Reset SR/IOV VF Enable bit to trigger an unregister of all VFs */
> +void pcie_sriov_pf_disable_vfs(PCIDevice *dev)
> +{
> +    uint16_t sriov_cap = dev->exp.sriov_cap;
> +    if (sriov_cap) {
> +        uint32_t val = pci_get_byte(dev->config + sriov_cap + PCI_SRIOV_CTRL);
> +        if (val & PCI_SRIOV_CTRL_VFE) {
> +            val &= ~PCI_SRIOV_CTRL_VFE;
> +            pcie_sriov_config_write(dev, sriov_cap + PCI_SRIOV_CTRL, val, 1);
> +        }
> +    }
> +}
> +
> +/* Add optional supported page sizes to the mask of supported page sizes */
> +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize)
> +{
> +    uint8_t *cfg = dev->config + dev->exp.sriov_cap;
> +    uint8_t *wmask = dev->wmask + dev->exp.sriov_cap;
> +
> +    uint16_t sup_pgsize = pci_get_word(cfg + PCI_SRIOV_SUP_PGSIZE);
> +
> +    sup_pgsize |= opt_sup_pgsize;
> +
> +    /*
> +     * Make sure the new bits are set, and that system page size
> +     * also can be set to any of the new values according to spec:
> +     */
> +    pci_set_word(cfg + PCI_SRIOV_SUP_PGSIZE, sup_pgsize);
> +    pci_set_word(wmask + PCI_SRIOV_SYS_PGSIZE, sup_pgsize);
> +}
> +
> +
> +uint16_t pcie_sriov_vf_number(PCIDevice *dev)
> +{
> +    assert(pci_is_vf(dev));
> +    return dev->exp.sriov_vf.vf_number;
> +}
> +
> +
> +PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
> +{
> +    return dev->exp.sriov_vf.pf;
> +}
> diff --git a/hw/pci/trace-events b/hw/pci/trace-events
> index 7570752c404..aaf46bc92df 100644
> --- a/hw/pci/trace-events
> +++ b/hw/pci/trace-events
> @@ -10,3 +10,8 @@ pci_cfg_write(const char *dev, uint32_t bus, uint32_t slot, uint32_t func, unsig
>  
>  # msix.c
>  msix_write_config(char *name, bool enabled, bool masked) "dev %s enabled %d masked %d"
> +
> +# hw/pci/pcie_sriov.c
> +sriov_register_vfs(const char *name, int slot, int function, int num_vfs) "%s %02x:%x: creating %d vf devs"
> +sriov_unregister_vfs(const char *name, int slot, int function, int num_vfs) "%s %02x:%x: Unregistering %d vf devs"
> +sriov_config_write(const char *name, int slot, int fun, uint32_t offset, uint32_t val, uint32_t len) "%s %02x:%x: sriov offset 0x%x val 0x%x len %d"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index c3f3c90473e..3a32b8dd40a 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -7,9 +7,6 @@
>  /* PCI includes legacy ISA access.  */
>  #include "hw/isa/isa.h"
>  
> -#include "hw/pci/pcie.h"
> -#include "qom/object.h"
> -
>  extern bool pci_available;
>  
>  /* PCI bus */
> @@ -157,6 +154,7 @@ enum {
>  #define QEMU_PCI_VGA_IO_HI_SIZE 0x20
>  
>  #include "hw/pci/pci_regs.h"
> +#include "hw/pci/pcie.h"
>  
>  /* PCI HEADER_TYPE */
>  #define  PCI_HEADER_TYPE_MULTI_FUNCTION 0x80
> @@ -499,6 +497,9 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>  void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
>  
> +pcibus_t pci_bar_address(PCIDevice *d,
> +                         int reg, uint8_t type, pcibus_t size);
> +
>  static inline void
>  pci_set_byte(uint8_t *config, uint8_t val)
>  {
> @@ -779,6 +780,11 @@ static inline int pci_is_express_downstream_port(const PCIDevice *d)
>      return type == PCI_EXP_TYPE_DOWNSTREAM || type == PCI_EXP_TYPE_ROOT_PORT;
>  }
>  
> +static inline int pci_is_vf(const PCIDevice *d)
> +{
> +    return d->exp.sriov_vf.pf != NULL;
> +}
> +
>  static inline uint32_t pci_config_size(const PCIDevice *d)
>  {
>      return pci_is_express(d) ? PCIE_CONFIG_SPACE_SIZE : PCI_CONFIG_SPACE_SIZE;
> diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> index 6063bee0ec6..168950a83b7 100644
> --- a/include/hw/pci/pcie.h
> +++ b/include/hw/pci/pcie.h
> @@ -24,6 +24,7 @@
>  #include "hw/pci/pci_regs.h"
>  #include "hw/pci/pcie_regs.h"
>  #include "hw/pci/pcie_aer.h"
> +#include "hw/pci/pcie_sriov.h"
>  #include "hw/hotplug.h"
>  
>  typedef enum {
> @@ -81,6 +82,11 @@ struct PCIExpressDevice {
>  
>      /* ACS */
>      uint16_t acs_cap;
> +
> +    /* SR/IOV */
> +    uint16_t sriov_cap;
> +    PCIESriovPF sriov_pf;
> +    PCIESriovVF sriov_vf;
>  };
>  
>  #define COMPAT_PROP_PCP "power_controller_present"
> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> new file mode 100644
> index 00000000000..990cff0a1c6
> --- /dev/null
> +++ b/include/hw/pci/pcie_sriov.h
> @@ -0,0 +1,71 @@
> +/*
> + * pcie_sriov.h:
> + *
> + * Implementation of SR/IOV emulation support.
> + *
> + * Copyright (c) 2015 Knut Omang <knut.omang@oracle.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef QEMU_PCIE_SRIOV_H
> +#define QEMU_PCIE_SRIOV_H
> +
> +struct PCIESriovPF {
> +    uint16_t num_vfs;   /* Number of virtual functions created */
> +    uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
> +    const char *vfname; /* Reference to the device type used for the VFs */
> +    PCIDevice **vf;     /* Pointer to an array of num_vfs VF devices */
> +};
> +
> +struct PCIESriovVF {
> +    PCIDevice *pf;      /* Pointer back to owner physical function */
> +    uint16_t vf_number; /* Logical VF number of this function */
> +};
> +
> +void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
> +                        const char *vfname, uint16_t vf_dev_id,
> +                        uint16_t init_vfs, uint16_t total_vfs,
> +                        uint16_t vf_offset, uint16_t vf_stride);
> +void pcie_sriov_pf_exit(PCIDevice *dev);
> +
> +/* Set up a VF bar in the SR/IOV bar area */
> +void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int region_num,
> +                               uint8_t type, dma_addr_t size);
> +
> +/* Instantiate a bar for a VF */
> +void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
> +                                MemoryRegion *memory);
> +
> +/*
> + * Default (minimal) page size support values
> + * as required by the SR/IOV standard:
> + * 0x553 << 12 = 0x553000 = 4K + 8K + 64K + 256K + 1M + 4M
> + */
> +#define SRIOV_SUP_PGSIZE_MINREQ 0x553
> +
> +/*
> + * Optionally add supported page sizes to the mask of supported page sizes
> + * Page size values are interpreted as opt_sup_pgsize << 12.
> + */
> +void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t opt_sup_pgsize);
> +
> +/* SR/IOV capability config write handler */
> +void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
> +                             uint32_t val, int len);
> +
> +/* Reset SR/IOV VF Enable bit to unregister all VFs */
> +void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
> +
> +/* Get logical VF number of a VF - only valid for VFs */
> +uint16_t pcie_sriov_vf_number(PCIDevice *dev);
> +
> +/*
> + * Get the physical function that owns this VF.
> + * Returns NULL if dev is not a virtual function
> + */
> +PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
> +
> +#endif /* QEMU_PCIE_SRIOV_H */
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index ee60eb3de47..5b302cb2145 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -86,6 +86,8 @@ typedef struct PCIDevice PCIDevice;
>  typedef struct PCIEAERErr PCIEAERErr;
>  typedef struct PCIEAERLog PCIEAERLog;
>  typedef struct PCIEAERMsg PCIEAERMsg;
> +typedef struct PCIESriovPF PCIESriovPF;
> +typedef struct PCIESriovVF PCIESriovVF;
>  typedef struct PCIEPort PCIEPort;
>  typedef struct PCIESlot PCIESlot;
>  typedef struct PCIExpressDevice PCIExpressDevice;
> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 02/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
  2022-02-17 17:44 ` [PATCH v5 02/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt Lukasz Maniak
@ 2022-02-18  8:24   ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2022-02-18  8:24 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Łukasz Gieryk, qemu-devel, Keith Busch,
	Klaus Jensen, Knut Omang

On Thu, Feb 17, 2022 at 06:44:51PM +0100, Lukasz Maniak wrote:
> From: Knut Omang <knut.omang@oracle.com>
> 
> Add a small intro + minimal documentation for how to
> implement SR/IOV support for an emulated device.
> 
> Signed-off-by: Knut Omang <knuto@ifi.uio.no>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  docs/pcie_sriov.txt | 115 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 115 insertions(+)
>  create mode 100644 docs/pcie_sriov.txt
> 
> diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> new file mode 100644
> index 00000000000..f5e891e1d45
> --- /dev/null
> +++ b/docs/pcie_sriov.txt
> @@ -0,0 +1,115 @@
> +PCI SR/IOV EMULATION SUPPORT
> +============================
> +
> +Description
> +===========
> +SR/IOV (Single Root I/O Virtualization) is an optional extended capability
> +of a PCI Express device. It allows a single physical function (PF) to appear as multiple
> +virtual functions (VFs) for the main purpose of eliminating software
> +overhead in I/O from virtual machines.
> +
> +Qemu now implements the basic common functionality to enable an emulated device
> +to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
> +proof-of-concept hack of the Intel igb can be found here:
> +
> +git://github.com/knuto/qemu.git sriov_patches_v5
> +
> +Implementation
> +==============
> +Implementing emulation of an SR/IOV capable device typically consists of
> +implementing support for two types of device classes; the "normal" physical device
> +(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
> +like other devices, except that some of their properties are derived from
> +the PF.
> +
> +A virtual function is different from a physical function in that the BAR
> +space for all VFs are defined by the BAR registers in the PFs SR/IOV
> +capability. All VFs have the same BARs and BAR sizes.
> +
> +Accesses to these virtual BARs then is computed as
> +
> +   <VF BAR start> + <VF number> * <BAR sz> + <offset>
> +
> +From our emulation perspective this means that there is a separate call for
> +setting up a BAR for a VF.
> +
> +1) To enable SR/IOV support in the PF, it must be a PCI Express device so
> +   you would need to add a PCI Express capability in the normal PCI
> +   capability list. You might also want to add an ARI (Alternative
> +   Routing-ID Interpretation) capability to indicate that your device
> +   supports functions beyond it's "own" function space (0-7),
> +   which is necessary to support more than 7 functions, or
> +   if functions extends beyond offset 7 because they are placed at an
> +   offset > 1 or have stride > 1.
> +
> +   ...
> +   #include "hw/pci/pcie.h"
> +   #include "hw/pci/pcie_sriov.h"
> +
> +   pci_your_pf_dev_realize( ... )
> +   {
> +      ...
> +      int ret = pcie_endpoint_cap_init(d, 0x70);
> +      ...
> +      pcie_ari_init(d, 0x100, 1);
> +      ...
> +
> +      /* Add and initialize the SR/IOV capability */
> +      pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
> +                       vf_devid, initial_vfs, total_vfs,
> +                       fun_offset, stride);
> +
> +      /* Set up individual VF BARs (parameters as for normal BARs) */
> +      pcie_sriov_pf_init_vf_bar( ... )
> +      ...
> +   }
> +
> +   For cleanup, you simply call:
> +
> +      pcie_sriov_pf_exit(device);
> +
> +   which will delete all the virtual functions and associated resources.
> +
> +2) Similarly in the implementation of the virtual function, you need to
> +   make it a PCI Express device and add a similar set of capabilities
> +   except for the SR/IOV capability. Then you need to set up the VF BARs as
> +   subregions of the PFs SR/IOV VF BARs by calling
> +   pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:
> +
> +   pci_your_vf_dev_realize( ... )
> +   {
> +      ...
> +      int ret = pcie_endpoint_cap_init(d, 0x60);
> +      ...
> +      pcie_ari_init(d, 0x100, 1);
> +      ...
> +      memory_region_init(mr, ... )
> +      pcie_sriov_vf_register_bar(d, bar_nr, mr);
> +      ...
> +   }
> +
> +Testing on Linux guest
> +======================
> +The easiest is if your device driver supports sysfs based SR/IOV
> +enabling. Support for this was added in kernel v.3.8, so not all drivers
> +support it yet.
> +
> +To enable 4 VFs for a device at 01:00.0:
> +
> +	modprobe yourdriver
> +	echo 4 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> +
> +You should now see 4 VFs with lspci.
> +To turn SR/IOV off again - the standard requires you to turn it off before you can enable
> +another VF count, and the emulation enforces this:
> +
> +	echo 0 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> +
> +Older drivers typically provide a max_vfs module parameter
> +to enable it at load time:
> +
> +	modprobe yourdriver max_vfs=4
> +
> +To disable the VFs again then, you simply have to unload the driver:
> +
> +	rmmod yourdriver
> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 03/15] pcie: Add a helper to the SR/IOV API
  2022-02-17 17:44 ` [PATCH v5 03/15] pcie: Add a helper to the SR/IOV API Lukasz Maniak
@ 2022-02-18  8:25   ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2022-02-18  8:25 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Łukasz Gieryk, qemu-devel, Keith Busch,
	Klaus Jensen, Knut Omang

On Thu, Feb 17, 2022 at 06:44:52PM +0100, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> Convenience function for retrieving the PCIDevice object of the N-th VF.
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> Reviewed-by: Knut Omang <knuto@ifi.uio.no>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  hw/pci/pcie_sriov.c         | 10 +++++++++-
>  include/hw/pci/pcie_sriov.h |  6 ++++++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index 3f256d483fa..87abad6ac86 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -287,8 +287,16 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev)
>      return dev->exp.sriov_vf.vf_number;
>  }
>  
> -
>  PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
>  {
>      return dev->exp.sriov_vf.pf;
>  }
> +
> +PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
> +{
> +    assert(!pci_is_vf(dev));
> +    if (n < dev->exp.sriov_pf.num_vfs) {
> +        return dev->exp.sriov_pf.vf[n];
> +    }
> +    return NULL;
> +}
> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> index 990cff0a1c6..80f5c84e75c 100644
> --- a/include/hw/pci/pcie_sriov.h
> +++ b/include/hw/pci/pcie_sriov.h
> @@ -68,4 +68,10 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev);
>   */
>  PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
>  
> +/*
> + * Get the n-th VF of this physical function - only valid for PF.
> + * Returns NULL if index is invalid
> + */
> +PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n);
> +
>  #endif /* QEMU_PCIE_SRIOV_H */
> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 04/15] pcie: Add 1.2 version token for the Power Management Capability
  2022-02-17 17:44 ` [PATCH v5 04/15] pcie: Add 1.2 version token for the Power Management Capability Lukasz Maniak
@ 2022-02-18  8:25   ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2022-02-18  8:25 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Łukasz Gieryk, qemu-devel, Keith Busch, Klaus Jensen

On Thu, Feb 17, 2022 at 06:44:53PM +0100, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  include/hw/pci/pci_regs.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
> index 77ba64b9314..a5901409622 100644
> --- a/include/hw/pci/pci_regs.h
> +++ b/include/hw/pci/pci_regs.h
> @@ -4,5 +4,6 @@
>  #include "standard-headers/linux/pci_regs.h"
>  
>  #define  PCI_PM_CAP_VER_1_1     0x0002  /* PCI PM spec ver. 1.1 */
> +#define  PCI_PM_CAP_VER_1_2     0x0003  /* PCI PM spec ver. 1.2 */
>  
>  #endif
> -- 
> 2.25.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
                   ` (15 preceding siblings ...)
  2022-02-18  8:23 ` [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Michael S. Tsirkin
@ 2022-02-18  8:26 ` Michael S. Tsirkin
  2022-02-18  8:51   ` Klaus Jensen
  16 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2022-02-18  8:26 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: Klaus Jensen, Keith Busch, qemu-devel, qemu-block, Łukasz Gieryk

On Thu, Feb 17, 2022 at 06:44:49PM +0100, Lukasz Maniak wrote:
> Changes since v4:
> - Added hello world example for SR-IOV to the docs
> - Moved AER initialization from nvme_init_ctrl to nvme_init_state
> - Fixed division by zero issue in calculation of vqfrt and vifrt
>   capabilities


So do you want to merge it all with nvme bits? which tree is this for?
Or would you like me to merge the pci bits for now?
Thanks!

> Knut Omang (2):
>   pcie: Add support for Single Root I/O Virtualization (SR/IOV)
>   pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
> 
> Lukasz Maniak (4):
>   hw/nvme: Add support for SR-IOV
>   hw/nvme: Add support for Primary Controller Capabilities
>   hw/nvme: Add support for Secondary Controller List
>   docs: Add documentation for SR-IOV and Virtualization Enhancements
> 
> Łukasz Gieryk (9):
>   pcie: Add a helper to the SR/IOV API
>   pcie: Add 1.2 version token for the Power Management Capability
>   hw/nvme: Implement the Function Level Reset
>   hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
>   hw/nvme: Remove reg_size variable and update BAR0 size calculation
>   hw/nvme: Calculate BAR attributes in a function
>   hw/nvme: Initialize capability structures for primary/secondary
>     controllers
>   hw/nvme: Add support for the Virtualization Management command
>   hw/nvme: Update the initalization place for the AER queue
> 
>  docs/pcie_sriov.txt          | 115 ++++++
>  docs/system/devices/nvme.rst |  82 +++++
>  hw/nvme/ctrl.c               | 674 ++++++++++++++++++++++++++++++++---
>  hw/nvme/ns.c                 |   2 +-
>  hw/nvme/nvme.h               |  55 ++-
>  hw/nvme/subsys.c             |  75 +++-
>  hw/nvme/trace-events         |   6 +
>  hw/pci/meson.build           |   1 +
>  hw/pci/pci.c                 | 100 ++++--
>  hw/pci/pcie.c                |   5 +
>  hw/pci/pcie_sriov.c          | 302 ++++++++++++++++
>  hw/pci/trace-events          |   5 +
>  include/block/nvme.h         |  65 ++++
>  include/hw/pci/pci.h         |  12 +-
>  include/hw/pci/pci_ids.h     |   1 +
>  include/hw/pci/pci_regs.h    |   1 +
>  include/hw/pci/pcie.h        |   6 +
>  include/hw/pci/pcie_sriov.h  |  77 ++++
>  include/qemu/typedefs.h      |   2 +
>  19 files changed, 1505 insertions(+), 81 deletions(-)
>  create mode 100644 docs/pcie_sriov.txt
>  create mode 100644 hw/pci/pcie_sriov.c
>  create mode 100644 include/hw/pci/pcie_sriov.h
> 
> -- 
> 2.25.1
> 
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2022-02-18  8:26 ` Michael S. Tsirkin
@ 2022-02-18  8:51   ` Klaus Jensen
  2022-02-18  9:33     ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Klaus Jensen @ 2022-02-18  8:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Keith Busch, Łukasz Gieryk, Lukasz Maniak, qemu-block, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 940 bytes --]

On Feb 18 03:26, Michael S. Tsirkin wrote:
> On Thu, Feb 17, 2022 at 06:44:49PM +0100, Lukasz Maniak wrote:
> > Changes since v4:
> > - Added hello world example for SR-IOV to the docs
> > - Moved AER initialization from nvme_init_ctrl to nvme_init_state
> > - Fixed division by zero issue in calculation of vqfrt and vifrt
> >   capabilities
> 
> 
> So do you want to merge it all with nvme bits? which tree is this for?
> Or would you like me to merge the pci bits for now?
> Thanks!
> 

I was wondering how to approach that as well. I think maybe it could all
go through your tree so the pcie bits doesnt just sit their without
being used by anything? It's up to you, but note that nvme bits are not
fully reviewed yet.

If you are fine with merging the pcie bits then lets do that and we
merge the nvme bits through the nvme tree. The nvme bits is fully acked,
so it will go in, just need to finalize the reviews.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2022-02-18  8:51   ` Klaus Jensen
@ 2022-02-18  9:33     ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2022-02-18  9:33 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Keith Busch, Łukasz Gieryk, Lukasz Maniak, qemu-block, qemu-devel

On Fri, Feb 18, 2022 at 09:51:12AM +0100, Klaus Jensen wrote:
> On Feb 18 03:26, Michael S. Tsirkin wrote:
> > On Thu, Feb 17, 2022 at 06:44:49PM +0100, Lukasz Maniak wrote:
> > > Changes since v4:
> > > - Added hello world example for SR-IOV to the docs
> > > - Moved AER initialization from nvme_init_ctrl to nvme_init_state
> > > - Fixed division by zero issue in calculation of vqfrt and vifrt
> > >   capabilities
> > 
> > 
> > So do you want to merge it all with nvme bits? which tree is this for?
> > Or would you like me to merge the pci bits for now?
> > Thanks!
> > 
> 
> I was wondering how to approach that as well. I think maybe it could all
> go through your tree so the pcie bits doesnt just sit their without
> being used by anything? It's up to you, but note that nvme bits are not
> fully reviewed yet.
> 
> If you are fine with merging the pcie bits then lets do that and we
> merge the nvme bits through the nvme tree. The nvme bits is fully acked,
> so it will go in, just need to finalize the reviews.

Yes, I'm fine with this.

-- 
MST



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
  2022-02-18  8:23 ` [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Michael S. Tsirkin
@ 2022-02-18 14:33   ` Lukasz Maniak
  0 siblings, 0 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-18 14:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Klaus Jensen, Keith Busch, qemu-devel, qemu-block, Łukasz Gieryk

On Fri, Feb 18, 2022 at 03:23:15AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 17, 2022 at 06:44:49PM +0100, Lukasz Maniak wrote:
> > Changes since v4:
> > - Added hello world example for SR-IOV to the docs
> > - Moved AER initialization from nvme_init_ctrl to nvme_init_state
> > - Fixed division by zero issue in calculation of vqfrt and vifrt
> >   capabilities
> 
> 
> BTW you should copy all reviewers on the cover letter.
> 
Yep, will do next time. Sorry about that.
> 
> 
> > Knut Omang (2):
> >   pcie: Add support for Single Root I/O Virtualization (SR/IOV)
> >   pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
> > 
> > Lukasz Maniak (4):
> >   hw/nvme: Add support for SR-IOV
> >   hw/nvme: Add support for Primary Controller Capabilities
> >   hw/nvme: Add support for Secondary Controller List
> >   docs: Add documentation for SR-IOV and Virtualization Enhancements
> > 
> > Łukasz Gieryk (9):
> >   pcie: Add a helper to the SR/IOV API
> >   pcie: Add 1.2 version token for the Power Management Capability
> >   hw/nvme: Implement the Function Level Reset
> >   hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
> >   hw/nvme: Remove reg_size variable and update BAR0 size calculation
> >   hw/nvme: Calculate BAR attributes in a function
> >   hw/nvme: Initialize capability structures for primary/secondary
> >     controllers
> >   hw/nvme: Add support for the Virtualization Management command
> >   hw/nvme: Update the initalization place for the AER queue
> > 
> >  docs/pcie_sriov.txt          | 115 ++++++
> >  docs/system/devices/nvme.rst |  82 +++++
> >  hw/nvme/ctrl.c               | 674 ++++++++++++++++++++++++++++++++---
> >  hw/nvme/ns.c                 |   2 +-
> >  hw/nvme/nvme.h               |  55 ++-
> >  hw/nvme/subsys.c             |  75 +++-
> >  hw/nvme/trace-events         |   6 +
> >  hw/pci/meson.build           |   1 +
> >  hw/pci/pci.c                 | 100 ++++--
> >  hw/pci/pcie.c                |   5 +
> >  hw/pci/pcie_sriov.c          | 302 ++++++++++++++++
> >  hw/pci/trace-events          |   5 +
> >  include/block/nvme.h         |  65 ++++
> >  include/hw/pci/pci.h         |  12 +-
> >  include/hw/pci/pci_ids.h     |   1 +
> >  include/hw/pci/pci_regs.h    |   1 +
> >  include/hw/pci/pcie.h        |   6 +
> >  include/hw/pci/pcie_sriov.h  |  77 ++++
> >  include/qemu/typedefs.h      |   2 +
> >  19 files changed, 1505 insertions(+), 81 deletions(-)
> >  create mode 100644 docs/pcie_sriov.txt
> >  create mode 100644 hw/pci/pcie_sriov.c
> >  create mode 100644 include/hw/pci/pcie_sriov.h
> > 
> > -- 
> > 2.25.1
> > 
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers
  2022-02-17 17:45 ` [PATCH v5 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers Lukasz Maniak
@ 2022-02-18 14:37   ` Lukasz Maniak
  0 siblings, 0 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-02-18 14:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	Philippe Mathieu-Daudé,
	Keith Busch, Hanna Reitz, Stefan Hajnoczi, Klaus Jensen

On Thu, Feb 17, 2022 at 06:45:01PM +0100, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> With four new properties:
>  - sriov_v{i,q}_flexible,
>  - sriov_max_v{i,q}_per_vf,
> one can configure the number of available flexible resources, as well as
> the limits. The primary and secondary controller capability structures
> are initialized accordingly.
> 
> Since the number of available queues (interrupts) now varies between
> VF/PF, BAR size calculation is also adjusted.
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> ---
>  hw/nvme/ctrl.c       | 142 ++++++++++++++++++++++++++++++++++++++++---
>  hw/nvme/nvme.h       |   4 ++
>  include/block/nvme.h |   5 ++
>  3 files changed, 144 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 73707565345..2a6a36e733d 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -36,6 +36,10 @@
>   *              zoned.zasl=<N[optional]>, \
>   *              zoned.auto_transition=<on|off[optional]>, \
>   *              sriov_max_vfs=<N[optional]> \
> + *              sriov_vq_flexible=<N[optional]> \
> + *              sriov_vi_flexible=<N[optional]> \
> + *              sriov_max_vi_per_vf=<N[optional]> \
> + *              sriov_max_vq_per_vf=<N[optional]> \
>   *              subsys=<subsys_id>
>   *      -device nvme-ns,drive=<drive_id>,bus=<bus_name>,nsid=<nsid>,\
>   *              zoned=<true|false[optional]>, \
> @@ -113,6 +117,29 @@
>   *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
>   *   Virtual function controllers will not report SR-IOV capability.
>   *
> + *   NOTE: Single Root I/O Virtualization support is experimental.
> + *   All the related parameters may be subject to change.
> + *
> + * - `sriov_vq_flexible`
> + *   Indicates the total number of flexible queue resources assignable to all
> + *   the secondary controllers. Implicitly sets the number of primary
> + *   controller's private resources to `(max_ioqpairs - sriov_vq_flexible)`.
> + *
> + * - `sriov_vi_flexible`
> + *   Indicates the total number of flexible interrupt resources assignable to
> + *   all the secondary controllers. Implicitly sets the number of primary
> + *   controller's private resources to `(msix_qsize - sriov_vi_flexible)`.
> + *
> + * - `sriov_max_vi_per_vf`
> + *   Indicates the maximum number of virtual interrupt resources assignable
> + *   to a secondary controller. The default 0 resolves to
> + *   `(sriov_vi_flexible / sriov_max_vfs)`.
> + *
> + * - `sriov_max_vq_per_vf`
> + *   Indicates the maximum number of virtual queue resources assignable to
> + *   a secondary controller. The default 0 resolves to
> + *   `(sriov_vq_flexible / sriov_max_vfs)`.
> + *
>   * nvme namespace device parameters
>   * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   * - `shared`
> @@ -184,6 +211,7 @@
>  #define NVME_NUM_FW_SLOTS 1
>  #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
>  #define NVME_MAX_VFS 127
> +#define NVME_VF_RES_GRANULARITY 1
>  #define NVME_VF_OFFSET 0x1
>  #define NVME_VF_STRIDE 1
>  
> @@ -6512,6 +6540,54 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
>              error_setg(errp, "PMR is not supported with SR-IOV");
>              return;
>          }
> +
> +        if (!params->sriov_vq_flexible || !params->sriov_vi_flexible) {
> +            error_setg(errp, "both sriov_vq_flexible and sriov_vi_flexible"
> +                       " must be set for the use of SR-IOV");
> +            return;
> +        }
> +
> +        if (params->sriov_vq_flexible < params->sriov_max_vfs * 2) {
> +            error_setg(errp, "sriov_vq_flexible must be greater than or equal"
> +                       " to %d (sriov_max_vfs * 2)", params->sriov_max_vfs * 2);
> +            return;
> +        }
> +
> +        if (params->max_ioqpairs < params->sriov_vq_flexible + 2) {
> +            error_setg(errp, "sriov_vq_flexible - max_ioqpairs (PF-private"
After posting, we realized that the error string is confusing. This will
be fixed in v6.

> +                       " queue resources) must be greater than or equal to 2");
> +            return;
> +        }
> +
> +        if (params->sriov_vi_flexible < params->sriov_max_vfs) {
> +            error_setg(errp, "sriov_vi_flexible must be greater than or equal"
> +                       " to %d (sriov_max_vfs)", params->sriov_max_vfs);
> +            return;
> +        }
> +
> +        if (params->msix_qsize < params->sriov_vi_flexible + 1) {
> +            error_setg(errp, "sriov_vi_flexible - msix_qsize (PF-private"
Same here.

> +                       " interrupt resources) must be greater than or equal"
> +                       " to 1");
> +            return;
> +        }
> +
> +        if (params->sriov_max_vi_per_vf &&
> +            (params->sriov_max_vi_per_vf - 1) % NVME_VF_RES_GRANULARITY) {
> +            error_setg(errp, "sriov_max_vi_per_vf must meet:"
> +                       " (X - 1) %% %d == 0 and X >= 1",
> +                       NVME_VF_RES_GRANULARITY);
> +            return;
> +        }
> +
> +        if (params->sriov_max_vq_per_vf &&
> +            (params->sriov_max_vq_per_vf < 2 ||
> +             (params->sriov_max_vq_per_vf - 1) % NVME_VF_RES_GRANULARITY)) {
> +            error_setg(errp, "sriov_max_vq_per_vf must meet:"
> +                       " (X - 1) %% %d == 0 and X >= 2",
> +                       NVME_VF_RES_GRANULARITY);
> +            return;
> +        }
>      }
>  }
>  
> @@ -6520,10 +6596,19 @@ static void nvme_init_state(NvmeCtrl *n)
>      NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
>      NvmeSecCtrlList *list = &n->sec_ctrl_list;
>      NvmeSecCtrlEntry *sctrl;
> +    uint8_t max_vfs;
>      int i;
>  
> -    n->conf_ioqpairs = n->params.max_ioqpairs;
> -    n->conf_msix_qsize = n->params.msix_qsize;
> +    if (pci_is_vf(&n->parent_obj)) {
> +        sctrl = nvme_sctrl(n);
> +        max_vfs = 0;
> +        n->conf_ioqpairs = sctrl->nvq ? le16_to_cpu(sctrl->nvq) - 1 : 0;
> +        n->conf_msix_qsize = sctrl->nvi ? le16_to_cpu(sctrl->nvi) : 1;
> +    } else {
> +        max_vfs = n->params.sriov_max_vfs;
> +        n->conf_ioqpairs = n->params.max_ioqpairs;
> +        n->conf_msix_qsize = n->params.msix_qsize;
> +    }
>  
>      n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
>      n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
> @@ -6532,14 +6617,41 @@ static void nvme_init_state(NvmeCtrl *n)
>      n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
>      n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
>  
> -    list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
> -    for (i = 0; i < n->params.sriov_max_vfs; i++) {
> +    list->numcntl = cpu_to_le16(max_vfs);
> +    for (i = 0; i < max_vfs; i++) {
>          sctrl = &list->sec[i];
>          sctrl->pcid = cpu_to_le16(n->cntlid);
>          sctrl->vfn = cpu_to_le16(i + 1);
>      }
>  
>      cap->cntlid = cpu_to_le16(n->cntlid);
> +    cap->crt = NVME_CRT_VQ | NVME_CRT_VI;
> +
> +    if (pci_is_vf(&n->parent_obj)) {
> +        cap->vqprt = cpu_to_le16(1 + n->conf_ioqpairs);
> +    } else {
> +        cap->vqprt = cpu_to_le16(1 + n->params.max_ioqpairs -
> +                                 n->params.sriov_vq_flexible);
> +        cap->vqfrt = cpu_to_le32(n->params.sriov_vq_flexible);
> +        cap->vqrfap = cap->vqfrt;
> +        cap->vqgran = cpu_to_le16(NVME_VF_RES_GRANULARITY);
> +        cap->vqfrsm = n->params.sriov_max_vq_per_vf ?
> +                        cpu_to_le16(n->params.sriov_max_vq_per_vf) :
> +                        cap->vqfrt / MAX(max_vfs, 1);
> +    }
> +
> +    if (pci_is_vf(&n->parent_obj)) {
> +        cap->viprt = cpu_to_le16(n->conf_msix_qsize);
> +    } else {
> +        cap->viprt = cpu_to_le16(n->params.msix_qsize -
> +                                 n->params.sriov_vi_flexible);
> +        cap->vifrt = cpu_to_le32(n->params.sriov_vi_flexible);
> +        cap->virfap = cap->vifrt;
> +        cap->vigran = cpu_to_le16(NVME_VF_RES_GRANULARITY);
> +        cap->vifrsm = n->params.sriov_max_vi_per_vf ?
> +                        cpu_to_le16(n->params.sriov_max_vi_per_vf) :
> +                        cap->vifrt / MAX(max_vfs, 1);
> +    }
>  }
>  
>  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> @@ -6612,11 +6724,14 @@ static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
>      return bar_size;
>  }
>  
> -static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
> -                            uint64_t bar_size)
> +static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
>  {
>      uint16_t vf_dev_id = n->params.use_intel_id ?
>                           PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
> +    NvmePriCtrlCap *cap = &n->pri_ctrl_cap;
> +    uint64_t bar_size = nvme_bar_size(le16_to_cpu(cap->vqfrsm),
> +                                      le16_to_cpu(cap->vifrsm),
> +                                      NULL, NULL);
>  
>      pcie_sriov_pf_init(pci_dev, offset, "nvme", vf_dev_id,
>                         n->params.sriov_max_vfs, n->params.sriov_max_vfs,
> @@ -6714,7 +6829,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>      }
>  
>      if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
> -        nvme_init_sriov(n, pci_dev, 0x120, bar_size);
> +        nvme_init_sriov(n, pci_dev, 0x120);
>      }
>  
>      return 0;
> @@ -6738,6 +6853,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      NvmeIdCtrl *id = &n->id_ctrl;
>      uint8_t *pci_conf = pci_dev->config;
>      uint64_t cap = ldq_le_p(&n->bar.cap);
> +    NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
>  
>      id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
>      id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
> @@ -6829,6 +6945,10 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>  
>      stl_le_p(&n->bar.vs, NVME_SPEC_VER);
>      n->bar.intmc = n->bar.intms = 0;
> +
> +    if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
> +        stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
> +    }
>  }
>  
>  static int nvme_init_subsys(NvmeCtrl *n, Error **errp)
> @@ -6969,6 +7089,14 @@ static Property nvme_props[] = {
>      DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
>                       params.auto_transition_zones, true),
>      DEFINE_PROP_UINT8("sriov_max_vfs", NvmeCtrl, params.sriov_max_vfs, 0),
> +    DEFINE_PROP_UINT16("sriov_vq_flexible", NvmeCtrl,
> +                       params.sriov_vq_flexible, 0),
> +    DEFINE_PROP_UINT16("sriov_vi_flexible", NvmeCtrl,
> +                       params.sriov_vi_flexible, 0),
> +    DEFINE_PROP_UINT8("sriov_max_vi_per_vf", NvmeCtrl,
> +                      params.sriov_max_vi_per_vf, 0),
> +    DEFINE_PROP_UINT8("sriov_max_vq_per_vf", NvmeCtrl,
> +                      params.sriov_max_vq_per_vf, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 86b5b321331..82f11bb08f0 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -407,6 +407,10 @@ typedef struct NvmeParams {
>      bool     auto_transition_zones;
>      bool     legacy_cmb;
>      uint8_t  sriov_max_vfs;
> +    uint16_t sriov_vq_flexible;
> +    uint16_t sriov_vi_flexible;
> +    uint8_t  sriov_max_vq_per_vf;
> +    uint8_t  sriov_max_vi_per_vf;
>  } NvmeParams;
>  
>  typedef struct NvmeCtrl {
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index fde4ddfceec..a8192edcd9d 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -1526,6 +1526,11 @@ typedef struct QEMU_PACKED NvmePriCtrlCap {
>      uint8_t     rsvd80[4016];
>  } NvmePriCtrlCap;
>  
> +typedef enum NvmePriCtrlCapCrt {
> +    NVME_CRT_VQ             = 1 << 0,
> +    NVME_CRT_VI             = 1 << 1,
> +} NvmePriCtrlCapCrt;
> +
>  typedef struct QEMU_PACKED NvmeSecCtrlEntry {
>      uint16_t    scid;
>      uint16_t    pcid;
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  2022-02-17 17:44 ` [PATCH v5 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
@ 2022-03-01 12:22   ` Klaus Jensen
  0 siblings, 0 replies; 35+ messages in thread
From: Klaus Jensen @ 2022-03-01 12:22 UTC (permalink / raw)
  To: Lukasz Maniak; +Cc: Keith Busch, Łukasz Gieryk, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 8173 bytes --]

On Feb 17 18:44, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> The NVMe device defines two properties: max_ioqpairs, msix_qsize. Having
> them as constants is problematic for SR-IOV support.
> 
> SR-IOV introduces virtual resources (queues, interrupts) that can be
> assigned to PF and its dependent VFs. Each device, following a reset,
> should work with the configured number of queues. A single constant is
> no longer sufficient to hold the whole state.
> 
> This patch tries to solve the problem by introducing additional
> variables in NvmeCtrl’s state. The variables for, e.g., managing queues
> are therefore organized as:
>  - n->params.max_ioqpairs – no changes, constant set by the user
>  - n->(mutable_state) – (not a part of this patch) user-configurable,
>                         specifies number of queues available _after_
>                         reset
>  - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
>                       n->params.max_ioqpairs; initialized in realize()
>                       and updated during reset() to reflect user’s
>                       changes to the mutable state
> 
> Since the number of available i/o queues and interrupts can change in
> runtime, buffers for sq/cqs and the MSIX-related structures are
> allocated big enough to handle the limits, to completely avoid the
> complicated reallocation. A helper function (nvme_update_msixcap_ts)
> updates the corresponding capability register, to signal configuration
> changes.
> 
> Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>

LGTM.

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

> ---
>  hw/nvme/ctrl.c | 52 ++++++++++++++++++++++++++++++++++----------------
>  hw/nvme/nvme.h |  2 ++
>  2 files changed, 38 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 7c1dd80f21d..f1b4026e4f8 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -445,12 +445,12 @@ static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid)
>  
>  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
>  {
> -    return sqid < n->params.max_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
> +    return sqid < n->conf_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
>  }
>  
>  static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
>  {
> -    return cqid < n->params.max_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
> +    return cqid < n->conf_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
>  }
>  
>  static void nvme_inc_cq_tail(NvmeCQueue *cq)
> @@ -4188,8 +4188,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
>          trace_pci_nvme_err_invalid_create_sq_cqid(cqid);
>          return NVME_INVALID_CQID | NVME_DNR;
>      }
> -    if (unlikely(!sqid || sqid > n->params.max_ioqpairs ||
> -        n->sq[sqid] != NULL)) {
> +    if (unlikely(!sqid || sqid > n->conf_ioqpairs || n->sq[sqid] != NULL)) {
>          trace_pci_nvme_err_invalid_create_sq_sqid(sqid);
>          return NVME_INVALID_QID | NVME_DNR;
>      }
> @@ -4541,8 +4540,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
>      trace_pci_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
>                               NVME_CQ_FLAGS_IEN(qflags) != 0);
>  
> -    if (unlikely(!cqid || cqid > n->params.max_ioqpairs ||
> -        n->cq[cqid] != NULL)) {
> +    if (unlikely(!cqid || cqid > n->conf_ioqpairs || n->cq[cqid] != NULL)) {
>          trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
>          return NVME_INVALID_QID | NVME_DNR;
>      }
> @@ -4558,7 +4556,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
>          trace_pci_nvme_err_invalid_create_cq_vector(vector);
>          return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
>      }
> -    if (unlikely(vector >= n->params.msix_qsize)) {
> +    if (unlikely(vector >= n->conf_msix_qsize)) {
>          trace_pci_nvme_err_invalid_create_cq_vector(vector);
>          return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
>      }
> @@ -5155,13 +5153,12 @@ defaults:
>  
>          break;
>      case NVME_NUMBER_OF_QUEUES:
> -        result = (n->params.max_ioqpairs - 1) |
> -            ((n->params.max_ioqpairs - 1) << 16);
> +        result = (n->conf_ioqpairs - 1) | ((n->conf_ioqpairs - 1) << 16);
>          trace_pci_nvme_getfeat_numq(result);
>          break;
>      case NVME_INTERRUPT_VECTOR_CONF:
>          iv = dw11 & 0xffff;
> -        if (iv >= n->params.max_ioqpairs + 1) {
> +        if (iv >= n->conf_ioqpairs + 1) {
>              return NVME_INVALID_FIELD | NVME_DNR;
>          }
>  
> @@ -5316,10 +5313,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
>  
>          trace_pci_nvme_setfeat_numq((dw11 & 0xffff) + 1,
>                                      ((dw11 >> 16) & 0xffff) + 1,
> -                                    n->params.max_ioqpairs,
> -                                    n->params.max_ioqpairs);
> -        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
> -                                      ((n->params.max_ioqpairs - 1) << 16));
> +                                    n->conf_ioqpairs,
> +                                    n->conf_ioqpairs);
> +        req->cqe.result = cpu_to_le32((n->conf_ioqpairs - 1) |
> +                                      ((n->conf_ioqpairs - 1) << 16));
>          break;
>      case NVME_ASYNCHRONOUS_EVENT_CONF:
>          n->features.async_config = dw11;
> @@ -5757,8 +5754,24 @@ static void nvme_process_sq(void *opaque)
>      }
>  }
>  
> +static void nvme_update_msixcap_ts(PCIDevice *pci_dev, uint32_t table_size)
> +{
> +    uint8_t *config;
> +
> +    if (!msix_present(pci_dev)) {
> +        return;
> +    }
> +
> +    assert(table_size > 0 && table_size <= pci_dev->msix_entries_nr);
> +
> +    config = pci_dev->config + pci_dev->msix_cap;
> +    pci_set_word_by_mask(config + PCI_MSIX_FLAGS, PCI_MSIX_FLAGS_QSIZE,
> +                         table_size - 1);
> +}
> +
>  static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
>  {
> +    PCIDevice *pci_dev = &n->parent_obj;
>      NvmeNamespace *ns;
>      int i;
>  
> @@ -5788,15 +5801,17 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst)
>          g_free(event);
>      }
>  
> -    if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) {
> +    if (!pci_is_vf(pci_dev) && n->params.sriov_max_vfs) {
>          if (rst != NVME_RESET_CONTROLLER) {
> -            pcie_sriov_pf_disable_vfs(&n->parent_obj);
> +            pcie_sriov_pf_disable_vfs(pci_dev);
>          }
>      }
>  
>      n->aer_queued = 0;
>      n->outstanding_aers = 0;
>      n->qs_created = false;
> +
> +    nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
>  }
>  
>  static void nvme_ctrl_shutdown(NvmeCtrl *n)
> @@ -6507,6 +6522,9 @@ static void nvme_init_state(NvmeCtrl *n)
>      NvmeSecCtrlEntry *sctrl;
>      int i;
>  
> +    n->conf_ioqpairs = n->params.max_ioqpairs;
> +    n->conf_msix_qsize = n->params.msix_qsize;
> +
>      /* add one to max_ioqpairs to account for the admin queue pair */
>      n->reg_size = pow2ceil(sizeof(NvmeBar) +
>                             2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
> @@ -6668,6 +6686,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>          }
>      }
>  
> +    nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
> +
>      if (n->params.cmb_size_mb) {
>          nvme_init_cmb(n, pci_dev);
>      }
> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 5ba07b62dff..314a2894759 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -434,6 +434,8 @@ typedef struct NvmeCtrl {
>      uint64_t    starttime_ms;
>      uint16_t    temperature;
>      uint8_t     smart_critical_warning;
> +    uint32_t    conf_msix_qsize;
> +    uint32_t    conf_ioqpairs;
>  
>      struct {
>          MemoryRegion mem;
> -- 
> 2.25.1
> 

-- 
One of us - No more doubt, silence or taboo about mental illness.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements
  2022-02-17 17:45 ` [PATCH v5 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements Lukasz Maniak
@ 2022-03-01 12:23   ` Klaus Jensen
  2022-03-21 12:36     ` Lukasz Maniak
  0 siblings, 1 reply; 35+ messages in thread
From: Klaus Jensen @ 2022-03-01 12:23 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Łukasz Gieryk, qemu-devel,
	Philippe Mathieu-Daudé,
	Heinrich Schuchardt, Keith Busch, Klaus Jensen, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 4146 bytes --]

On Feb 17 18:45, Lukasz Maniak wrote:
> Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>

Please add a short commit description as well. Otherwise,

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

> ---
>  docs/system/devices/nvme.rst | 82 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)
> 
> diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
> index b5acb2a9c19..aba253304e4 100644
> --- a/docs/system/devices/nvme.rst
> +++ b/docs/system/devices/nvme.rst
> @@ -239,3 +239,85 @@ The virtual namespace device supports DIF- and DIX-based protection information
>    to ``1`` to transfer protection information as the first eight bytes of
>    metadata. Otherwise, the protection information is transferred as the last
>    eight bytes.
> +
> +Virtualization Enhancements and SR-IOV (Experimental Support)
> +-------------------------------------------------------------
> +
> +The ``nvme`` device supports Single Root I/O Virtualization and Sharing
> +along with Virtualization Enhancements. The controller has to be linked to
> +an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
> +
> +A number of parameters are present (**please note, that they may be
> +subject to change**):
> +
> +``sriov_max_vfs`` (default: ``0``)
> +  Indicates the maximum number of PCIe virtual functions supported
> +  by the controller. Specifying a non-zero value enables reporting of both
> +  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
> +  by the NVMe device. Virtual function controllers will not report SR-IOV.
> +
> +``sriov_vq_flexible``
> +  Indicates the total number of flexible queue resources assignable to all
> +  the secondary controllers. Implicitly sets the number of primary
> +  controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
> +
> +``sriov_vi_flexible``
> +  Indicates the total number of flexible interrupt resources assignable to
> +  all the secondary controllers. Implicitly sets the number of primary
> +  controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
> +
> +``sriov_max_vi_per_vf`` (default: ``0``)
> +  Indicates the maximum number of virtual interrupt resources assignable
> +  to a secondary controller. The default ``0`` resolves to
> +  ``(sriov_vi_flexible / sriov_max_vfs)``
> +
> +``sriov_max_vq_per_vf`` (default: ``0``)
> +  Indicates the maximum number of virtual queue resources assignable to
> +  a secondary controller. The default ``0`` resolves to
> +  ``(sriov_vq_flexible / sriov_max_vfs)``
> +
> +The simplest possible invocation enables the capability to set up one VF
> +controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
> +
> +.. code-block:: console
> +
> +   -device nvme-subsys,id=subsys0
> +   -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
> +    sriov_vq_flexible=2,sriov_vi_flexible=1
> +
> +The minimum steps required to configure a functional NVMe secondary
> +controller are:
> +
> +  * unbind flexible resources from the primary controller
> +
> +.. code-block:: console
> +
> +   nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
> +   nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
> +
> +  * perform a Function Level Reset on the primary controller to actually
> +    release the resources
> +
> +.. code-block:: console
> +
> +   echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
> +
> +  * enable VF
> +
> +.. code-block:: console
> +
> +   echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> +
> +  * assign the flexible resources to the VF and set it ONLINE
> +
> +.. code-block:: console
> +
> +   nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
> +   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
> +   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
> +
> +  * bind the NVMe driver to the VF
> +
> +.. code-block:: console
> +
> +   echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
> \ No newline at end of file
> -- 
> 2.25.1
> 

-- 
One of us - No more doubt, silence or taboo about mental illness.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command
  2022-02-17 17:45 ` [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command Lukasz Maniak
@ 2022-03-01 13:07   ` Klaus Jensen
  2022-03-09 12:41     ` Łukasz Gieryk
  0 siblings, 1 reply; 35+ messages in thread
From: Klaus Jensen @ 2022-03-01 13:07 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Łukasz Gieryk,
	qemu-devel, Philippe Mathieu-Daudé,
	Hanna Reitz, Stefan Hajnoczi, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 3696 bytes --]

On Feb 17 18:45, Lukasz Maniak wrote:
> From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> 
> With the new command one can:
>  - assign flexible resources (queues, interrupts) to primary and
>    secondary controllers,
>  - toggle the online/offline state of given controller.
> 

QEMU segfaults (or asserts depending on the wind blowing) if the SR-IOV
enabled device is hotplugged after being configured (i.e. follow the
docs for a simple setup and then do a `device_del <nvme-device>` in the
monitor. I suspect this is related to freeing the queues and something
getting double-freed.

The device can be removed just fine if SR-IOV is configured (as in,
parameters are set), but no resources are reserved, onlined etc.


Snip from the backtrace (assert):

qemu-system-x86_64: ../util/qemu-thread-posix.c:78: qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.

Thread 4 "qemu-system-x86" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fcb8ffff640 (LWP 174907)]
0x00007fcb9a85534c in __pthread_kill_implementation () from /usr/lib64/libc.so.6
(gdb) bt
#0  0x00007fcb9a85534c in __pthread_kill_implementation () at /usr/lib64/libc.so.6
#1  0x00007fcb9a8084b8 in raise () at /usr/lib64/libc.so.6
#2  0x00007fcb9a7f2534 in abort () at /usr/lib64/libc.so.6
#3  0x00007fcb9a7f245c in _nl_load_domain.cold () at /usr/lib64/libc.so.6
#4  0x00007fcb9a801116 in  () at /usr/lib64/libc.so.6
#5  0x0000556c1fffc342 in qemu_mutex_lock_impl (mutex=<optimized out>, file=<optimized out>, line=<optimized out>) at ../util/qemu-thread-posix.c:78
#6  qemu_mutex_lock_impl (mutex=<optimized out>, file=<optimized out>, line=<optimized out>) at ../util/qemu-thread-posix.c:74
#7  0x0000556c2001af05 in timer_del (ts=ts@entry=0x7fc9780000a0) at ../util/qemu-timer.c:432
#8  0x0000556c1fc28657 in timer_free (ts=0x7fc9780000a0) at /home/kbj/work/src/qemu/include/qemu/timer.h:633
#9  timer_free (ts=0x7fc9780000a0) at /home/kbj/work/src/qemu/include/qemu/timer.h:630
#10 nvme_free_sq (sq=0x7fc978000090, n=<optimized out>, n=<optimized out>) at ../hw/nvme/ctrl.c:4129
#11 0x0000556c1fc2a369 in nvme_ctrl_reset (n=0x7fc978436e70, rst=NVME_RESET_FUNCTION) at ../hw/nvme/ctrl.c:6007
#12 0x0000556c1fc2a84c in nvme_virt_set_state (n=n@entry=0x556c22d486b0, cntlid=<optimized out>, online=online@entry=0x0) at ../hw/nvme/ctrl.c:5815
#13 0x0000556c1fc2a5c6 in nvme_ctrl_reset (n=0x556c22d486b0, rst=NVME_RESET_FUNCTION) at ../hw/nvme/ctrl.c:6026
#14 0x0000556c1fc2a9e3 in nvme_exit (pci_dev=0x556c22d486b0) at ../hw/nvme/ctrl.c:7265
#15 0x0000556c1fc450e3 in pci_qdev_unrealize (dev=<optimized out>) at ../hw/pci/pci.c:1200
... more here


Snip from the backtrace (segfault)

Thread 7 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f6c635fe640 (LWP 175552)]
0x0000555e275ab63a in nvme_free_sq (sq=0xfea03000, n=<optimized out>, n=<optimized out>) at ../hw/nvme/ctrl.c:4128
4128	    n->sq[sq->sqid] = NULL;
(gdb) bt
#0  0x0000555e275ab63a in nvme_free_sq (sq=0xfea03000, n=<optimized out>, n=<optimized out>) at ../hw/nvme/ctrl.c:4128
#1  0x0000555e275ad369 in nvme_ctrl_reset (n=0x7f6e683793e0, rst=NVME_RESET_FUNCTION) at ../hw/nvme/ctrl.c:6007
#2  0x0000555e275ad84c in nvme_virt_set_state (n=n@entry=0x555e2a2626b0, cntlid=<optimized out>, online=online@entry=0x0) at ../hw/nvme/ctrl.c:5815
#3  0x0000555e275ad5c6 in nvme_ctrl_reset (n=0x555e2a2626b0, rst=NVME_RESET_FUNCTION) at ../hw/nvme/ctrl.c:6026
#4  0x0000555e275ad9e3 in nvme_exit (pci_dev=0x555e2a2626b0) at ../hw/nvme/ctrl.c:7265
#5  0x0000555e275c80e3 in pci_qdev_unrealize (dev=<optimized out>) at ../hw/pci/pci.c:1200
... more here

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command
  2022-03-01 13:07   ` Klaus Jensen
@ 2022-03-09 12:41     ` Łukasz Gieryk
  2022-03-11 12:20       ` Lukasz Maniak
  0 siblings, 1 reply; 35+ messages in thread
From: Łukasz Gieryk @ 2022-03-09 12:41 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Lukasz Maniak, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, Keith Busch,
	Philippe Mathieu-Daudé

On Tue, Mar 01, 2022 at 02:07:08PM +0100, Klaus Jensen wrote:
> On Feb 17 18:45, Lukasz Maniak wrote:
> > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > 
> > With the new command one can:
> >  - assign flexible resources (queues, interrupts) to primary and
> >    secondary controllers,
> >  - toggle the online/offline state of given controller.
> > 
> 
> QEMU segfaults (or asserts depending on the wind blowing) if the SR-IOV
> enabled device is hotplugged after being configured (i.e. follow the
> docs for a simple setup and then do a `device_del <nvme-device>` in the
> monitor. I suspect this is related to freeing the queues and something
> getting double-freed.
> 

I’ve finally found some time to look at the issue.

Long story short: the hot-plug mechanism deletes all VFs without the PF
knowing, then PF tries to reset and delete all the already non-existing
devices.

I have a solution for the problem, but there’s high a chance it’s not
the correct one. I’m still reading through the specs, as my knowledge in
the area of hot-plug/ACPI is quite limited.

Soon we will release the next patch set, with the fix included. I hope
the ACPI maintainers will chime in then. Till that happens, this is the
summary of my findings:

1) The current SR-IOV implementation assumes it’s the PF that creates
   and deletes VFs.
2) It’s a design decision (the Nvme device at least) for the VFs to be
   of the same class as PF. Effectively, they share the dc->hotpluggable
   value.
3) When a VF is created, it’s added as a child node to PF’s PCI bus
   slot.
4) Monitor/device_del triggers the ACPI mechanism. The implementation is
   not aware of SR/IOV and ejects PF’s PCI slot, directly unrealizing all
   hot-pluggable (!acpi_pcihp_pc_no_hotplug) children nodes.
5) VFs are unrealized directly, and it doesn’t work well with (1).
   SR/IOV structures are not updated, so when it’s PF’s turn to be
   unrealized, it works on stale pointers to already-deleted VFs.

My proposed ‘fix’ is to make the PCI ACPI code aware of SR/IOV:


diff --git a/hw/acpi/pcihp.c b/hw/acpi/pcihp.c
index f4d706e47d..090bdb8e74 100644
--- a/hw/acpi/pcihp.c
+++ b/hw/acpi/pcihp.c
@@ -196,8 +196,12 @@ static bool acpi_pcihp_pc_no_hotplug(AcpiPciHpState *s, PCIDevice *dev)
      * ACPI doesn't allow hotplug of bridge devices.  Don't allow
      * hot-unplug of bridge devices unless they were added by hotplug
      * (and so, not described by acpi).
+     *
+     * Don't allow hot-unplug of SR-IOV Virtual Functions, as they
+     * will be removed implicitly, when Physical Function is unplugged.
      */
-    return (pc->is_bridge && !dev->qdev.hotplugged) || !dc->hotpluggable;
+    return (pc->is_bridge && !dev->qdev.hotplugged) || !dc->hotpluggable ||
+           pci_is_vf(dev);
 }



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command
  2022-03-09 12:41     ` Łukasz Gieryk
@ 2022-03-11 12:20       ` Lukasz Maniak
  0 siblings, 0 replies; 35+ messages in thread
From: Lukasz Maniak @ 2022-03-11 12:20 UTC (permalink / raw)
  To: Łukasz Gieryk
  Cc: Fam Zheng, Kevin Wolf, Ani Sinha, qemu-block, Michael S. Tsirkin,
	qemu-devel, Philippe Mathieu-Daudé,
	Keith Busch, Hanna Reitz, Stefan Hajnoczi, Klaus Jensen,
	Igor Mammedov

On Wed, Mar 09, 2022 at 01:41:27PM +0100, Łukasz Gieryk wrote:
> On Tue, Mar 01, 2022 at 02:07:08PM +0100, Klaus Jensen wrote:
> > On Feb 17 18:45, Lukasz Maniak wrote:
> > > From: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
> > > 
> > > With the new command one can:
> > >  - assign flexible resources (queues, interrupts) to primary and
> > >    secondary controllers,
> > >  - toggle the online/offline state of given controller.
> > > 
> > 
> > QEMU segfaults (or asserts depending on the wind blowing) if the SR-IOV
> > enabled device is hotplugged after being configured (i.e. follow the
> > docs for a simple setup and then do a `device_del <nvme-device>` in the
> > monitor. I suspect this is related to freeing the queues and something
> > getting double-freed.
> > 
> 
> I’ve finally found some time to look at the issue.
> 
> Long story short: the hot-plug mechanism deletes all VFs without the PF
> knowing, then PF tries to reset and delete all the already non-existing
> devices.
> 
> I have a solution for the problem, but there’s high a chance it’s not
> the correct one. I’m still reading through the specs, as my knowledge in
> the area of hot-plug/ACPI is quite limited.
> 
> Soon we will release the next patch set, with the fix included. I hope
> the ACPI maintainers will chime in then. Till that happens, this is the
> summary of my findings:
> 
> 1) The current SR-IOV implementation assumes it’s the PF that creates
>    and deletes VFs.
> 2) It’s a design decision (the Nvme device at least) for the VFs to be
>    of the same class as PF. Effectively, they share the dc->hotpluggable
>    value.
> 3) When a VF is created, it’s added as a child node to PF’s PCI bus
>    slot.
> 4) Monitor/device_del triggers the ACPI mechanism. The implementation is
>    not aware of SR/IOV and ejects PF’s PCI slot, directly unrealizing all
>    hot-pluggable (!acpi_pcihp_pc_no_hotplug) children nodes.
> 5) VFs are unrealized directly, and it doesn’t work well with (1).
>    SR/IOV structures are not updated, so when it’s PF’s turn to be
>    unrealized, it works on stale pointers to already-deleted VFs.
> 
> My proposed ‘fix’ is to make the PCI ACPI code aware of SR/IOV:
> 

CC'ing ACPI/SMBIOS maintainers/reviewers on the proposed fix.

> 
> diff --git a/hw/acpi/pcihp.c b/hw/acpi/pcihp.c
> index f4d706e47d..090bdb8e74 100644
> --- a/hw/acpi/pcihp.c
> +++ b/hw/acpi/pcihp.c
> @@ -196,8 +196,12 @@ static bool acpi_pcihp_pc_no_hotplug(AcpiPciHpState *s, PCIDevice *dev)
>       * ACPI doesn't allow hotplug of bridge devices.  Don't allow
>       * hot-unplug of bridge devices unless they were added by hotplug
>       * (and so, not described by acpi).
> +     *
> +     * Don't allow hot-unplug of SR-IOV Virtual Functions, as they
> +     * will be removed implicitly, when Physical Function is unplugged.
>       */
> -    return (pc->is_bridge && !dev->qdev.hotplugged) || !dc->hotpluggable;
> +    return (pc->is_bridge && !dev->qdev.hotplugged) || !dc->hotpluggable ||
> +           pci_is_vf(dev);
>  }
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements
  2022-03-01 12:23   ` Klaus Jensen
@ 2022-03-21 12:36     ` Lukasz Maniak
  2022-03-22  6:15       ` Klaus Jensen
  0 siblings, 1 reply; 35+ messages in thread
From: Lukasz Maniak @ 2022-03-21 12:36 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: qemu-block, Łukasz Gieryk, qemu-devel,
	Philippe Mathieu-Daudé,
	Heinrich Schuchardt, Keith Busch, Klaus Jensen, Alex Bennée

On Tue, Mar 01, 2022 at 01:23:18PM +0100, Klaus Jensen wrote:
> On Feb 17 18:45, Lukasz Maniak wrote:
> > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> 
> Please add a short commit description as well. Otherwise,

Klaus,

Sorry I forgot to add the description in v6 aka v7, been really busy
recently.
I am going to add the description for v8.

Regards,
Lukasz
> 
> Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
> 
> > ---
> >  docs/system/devices/nvme.rst | 82 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> > 
> > diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
> > index b5acb2a9c19..aba253304e4 100644
> > --- a/docs/system/devices/nvme.rst
> > +++ b/docs/system/devices/nvme.rst
> > @@ -239,3 +239,85 @@ The virtual namespace device supports DIF- and DIX-based protection information
> >    to ``1`` to transfer protection information as the first eight bytes of
> >    metadata. Otherwise, the protection information is transferred as the last
> >    eight bytes.
> > +
> > +Virtualization Enhancements and SR-IOV (Experimental Support)
> > +-------------------------------------------------------------
> > +
> > +The ``nvme`` device supports Single Root I/O Virtualization and Sharing
> > +along with Virtualization Enhancements. The controller has to be linked to
> > +an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
> > +
> > +A number of parameters are present (**please note, that they may be
> > +subject to change**):
> > +
> > +``sriov_max_vfs`` (default: ``0``)
> > +  Indicates the maximum number of PCIe virtual functions supported
> > +  by the controller. Specifying a non-zero value enables reporting of both
> > +  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
> > +  by the NVMe device. Virtual function controllers will not report SR-IOV.
> > +
> > +``sriov_vq_flexible``
> > +  Indicates the total number of flexible queue resources assignable to all
> > +  the secondary controllers. Implicitly sets the number of primary
> > +  controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
> > +
> > +``sriov_vi_flexible``
> > +  Indicates the total number of flexible interrupt resources assignable to
> > +  all the secondary controllers. Implicitly sets the number of primary
> > +  controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
> > +
> > +``sriov_max_vi_per_vf`` (default: ``0``)
> > +  Indicates the maximum number of virtual interrupt resources assignable
> > +  to a secondary controller. The default ``0`` resolves to
> > +  ``(sriov_vi_flexible / sriov_max_vfs)``
> > +
> > +``sriov_max_vq_per_vf`` (default: ``0``)
> > +  Indicates the maximum number of virtual queue resources assignable to
> > +  a secondary controller. The default ``0`` resolves to
> > +  ``(sriov_vq_flexible / sriov_max_vfs)``
> > +
> > +The simplest possible invocation enables the capability to set up one VF
> > +controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
> > +
> > +.. code-block:: console
> > +
> > +   -device nvme-subsys,id=subsys0
> > +   -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
> > +    sriov_vq_flexible=2,sriov_vi_flexible=1
> > +
> > +The minimum steps required to configure a functional NVMe secondary
> > +controller are:
> > +
> > +  * unbind flexible resources from the primary controller
> > +
> > +.. code-block:: console
> > +
> > +   nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
> > +   nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
> > +
> > +  * perform a Function Level Reset on the primary controller to actually
> > +    release the resources
> > +
> > +.. code-block:: console
> > +
> > +   echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
> > +
> > +  * enable VF
> > +
> > +.. code-block:: console
> > +
> > +   echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> > +
> > +  * assign the flexible resources to the VF and set it ONLINE
> > +
> > +.. code-block:: console
> > +
> > +   nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
> > +   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
> > +   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
> > +
> > +  * bind the NVMe driver to the VF
> > +
> > +.. code-block:: console
> > +
> > +   echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
> > \ No newline at end of file
> > -- 
> > 2.25.1
> > 
> 
> -- 
> One of us - No more doubt, silence or taboo about mental illness.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements
  2022-03-21 12:36     ` Lukasz Maniak
@ 2022-03-22  6:15       ` Klaus Jensen
  0 siblings, 0 replies; 35+ messages in thread
From: Klaus Jensen @ 2022-03-22  6:15 UTC (permalink / raw)
  To: Lukasz Maniak
  Cc: qemu-block, Łukasz Gieryk, qemu-devel,
	Philippe Mathieu-Daudé,
	Heinrich Schuchardt, Keith Busch, Klaus Jensen, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 540 bytes --]

On Mar 21 13:36, Lukasz Maniak wrote:
> On Tue, Mar 01, 2022 at 01:23:18PM +0100, Klaus Jensen wrote:
> > On Feb 17 18:45, Lukasz Maniak wrote:
> > > Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
> > 
> > Please add a short commit description as well. Otherwise,
> 
> Klaus,
> 
> Sorry I forgot to add the description in v6 aka v7, been really busy
> recently.
> I am going to add the description for v8.
> 

No problem :) Let's hope the ACPI maintainers Ack's your proposed fix
and we can move from there :)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-03-22  6:24 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-17 17:44 [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Lukasz Maniak
2022-02-17 17:44 ` [PATCH v5 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV) Lukasz Maniak
2022-02-18  8:24   ` Michael S. Tsirkin
2022-02-17 17:44 ` [PATCH v5 02/15] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt Lukasz Maniak
2022-02-18  8:24   ` Michael S. Tsirkin
2022-02-17 17:44 ` [PATCH v5 03/15] pcie: Add a helper to the SR/IOV API Lukasz Maniak
2022-02-18  8:25   ` Michael S. Tsirkin
2022-02-17 17:44 ` [PATCH v5 04/15] pcie: Add 1.2 version token for the Power Management Capability Lukasz Maniak
2022-02-18  8:25   ` Michael S. Tsirkin
2022-02-17 17:44 ` [PATCH v5 05/15] hw/nvme: Add support for SR-IOV Lukasz Maniak
2022-02-18  7:06   ` Klaus Jensen
2022-02-17 17:44 ` [PATCH v5 06/15] hw/nvme: Add support for Primary Controller Capabilities Lukasz Maniak
2022-02-17 17:44 ` [PATCH v5 07/15] hw/nvme: Add support for Secondary Controller List Lukasz Maniak
2022-02-17 17:44 ` [PATCH v5 08/15] hw/nvme: Implement the Function Level Reset Lukasz Maniak
2022-02-17 17:44 ` [PATCH v5 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime Lukasz Maniak
2022-03-01 12:22   ` Klaus Jensen
2022-02-17 17:44 ` [PATCH v5 10/15] hw/nvme: Remove reg_size variable and update BAR0 size calculation Lukasz Maniak
2022-02-17 17:45 ` [PATCH v5 11/15] hw/nvme: Calculate BAR attributes in a function Lukasz Maniak
2022-02-17 17:45 ` [PATCH v5 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers Lukasz Maniak
2022-02-18 14:37   ` Lukasz Maniak
2022-02-17 17:45 ` [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command Lukasz Maniak
2022-03-01 13:07   ` Klaus Jensen
2022-03-09 12:41     ` Łukasz Gieryk
2022-03-11 12:20       ` Lukasz Maniak
2022-02-17 17:45 ` [PATCH v5 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements Lukasz Maniak
2022-03-01 12:23   ` Klaus Jensen
2022-03-21 12:36     ` Lukasz Maniak
2022-03-22  6:15       ` Klaus Jensen
2022-02-17 17:45 ` [PATCH v5 15/15] hw/nvme: Update the initalization place for the AER queue Lukasz Maniak
2022-02-18  6:49   ` Klaus Jensen
2022-02-18  8:23 ` [PATCH v5 00/15] hw/nvme: SR-IOV with Virtualization Enhancements Michael S. Tsirkin
2022-02-18 14:33   ` Lukasz Maniak
2022-02-18  8:26 ` Michael S. Tsirkin
2022-02-18  8:51   ` Klaus Jensen
2022-02-18  9:33     ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.