qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] nvme: allow cmb and pmr emulation on same device
@ 2020-07-01 21:48 Andrzej Jakowski
  2020-07-01 21:48 ` [PATCH v4 1/2] nvme: indicate CMB support through controller capabilities register Andrzej Jakowski
  2020-07-01 21:48 ` [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device Andrzej Jakowski
  0 siblings, 2 replies; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-01 21:48 UTC (permalink / raw)
  To: kbusch, kwolf, mreitz; +Cc: qemu-devel, qemu-block

Hi All, 

Resending series recently posted on mailing list related to nvme device
extension with couple of fixes after review.

This patch series does following:
 - Fixes problem where CMBS bit was not set in controller capabilities
   register, so support for CMB was not correctly advertised to guest.
   This is resend of patch that has been submitted and reviewed by
   Klaus [1]
 - Introduces BAR4 sharing between MSI-X vectors and CMB. This allows
   to have CMB and PMR emulated on the same device. This extension
   was indicated by Keith [2]

v4:
 - modified BAR4 initialization, so now it consists of CMB, MSIX and
   PBA memory regions overlapping on top of it. This reduces patch
   complexity significantly (Klaus [5])

v3:
 - code style fixes including: removal of spurious line break, moving
   define into define section and code alignment (Klaus [4])
 - removed unintended code reintroduction (Klaus [4])

v2:
 - rebase on Kevin's latest block branch (Klaus [3])
 - improved comments section (Klaus [3])
 - simplified calculation of BAR4 size (Klaus [3])

v1:
 - initial push of the patch

[1]: https://lore.kernel.org/qemu-devel/20200408055607.g2ii7gwqbnv6cd3w@apples.localdomain/
[2]: https://lore.kernel.org/qemu-devel/20200330165518.GA8234@redsun51.ssa.fujisawa.hgst.com/
[3]: https://lore.kernel.org/qemu-devel/20200605181043.28782-1-andrzej.jakowski@linux.intel.com/
[4]: https://lore.kernel.org/qemu-devel/20200618092524.posxi5mysb3jjtpn@apples.localdomain/
[5]: https://lore.kernel.org/qemu-devel/20200626055033.6vxqvi4s5pll7som@apples.localdomain/




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/2] nvme: indicate CMB support through controller capabilities register
  2020-07-01 21:48 [PATCH v4] nvme: allow cmb and pmr emulation on same device Andrzej Jakowski
@ 2020-07-01 21:48 ` Andrzej Jakowski
  2020-07-07 16:27   ` Maxim Levitsky
  2020-07-01 21:48 ` [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device Andrzej Jakowski
  1 sibling, 1 reply; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-01 21:48 UTC (permalink / raw)
  To: kbusch, kwolf, mreitz
  Cc: Klaus Jensen, Andrzej Jakowski, qemu-devel, qemu-block

This patch sets CMBS bit in controller capabilities register when user
configures NVMe driver with CMB support, so capabilites are correctly
reported to guest OS.

Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c      | 2 +-
 include/block/nvme.h | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1aee042d4c..9f11f3e9da 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1582,6 +1582,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
     NVME_CAP_SET_CSS(n->bar.cap, 1);
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
+    NVME_CAP_SET_CMBS(n->bar.cap, n->params.cmb_size_mb ? 1 : 0);
 
     n->bar.vs = 0x00010200;
     n->bar.intmc = n->bar.intms = 0;
@@ -1591,7 +1592,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
     NvmeCtrl *n = NVME(pci_dev);
     Error *local_err = NULL;
-
     int i;
 
     nvme_check_constraints(n, &local_err);
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 1720ee1d51..14cf398dfa 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -35,6 +35,7 @@ enum NvmeCapShift {
     CAP_MPSMIN_SHIFT   = 48,
     CAP_MPSMAX_SHIFT   = 52,
     CAP_PMR_SHIFT      = 56,
+    CAP_CMB_SHIFT      = 57,
 };
 
 enum NvmeCapMask {
@@ -48,6 +49,7 @@ enum NvmeCapMask {
     CAP_MPSMIN_MASK    = 0xf,
     CAP_MPSMAX_MASK    = 0xf,
     CAP_PMR_MASK       = 0x1,
+    CAP_CMB_MASK       = 0x1,
 };
 
 #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
@@ -78,8 +80,10 @@ enum NvmeCapMask {
                                                            << CAP_MPSMIN_SHIFT)
 #define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & CAP_MPSMAX_MASK)\
                                                             << CAP_MPSMAX_SHIFT)
-#define NVME_CAP_SET_PMRS(cap, val) (cap |= (uint64_t)(val & CAP_PMR_MASK)\
+#define NVME_CAP_SET_PMRS(cap, val)   (cap |= (uint64_t)(val & CAP_PMR_MASK)   \
                                                             << CAP_PMR_SHIFT)
+#define NVME_CAP_SET_CMBS(cap, val)   (cap |= (uint64_t)(val & CAP_CMB_MASK)   \
+                                                           << CAP_CMB_SHIFT)
 
 enum NvmeCcShift {
     CC_EN_SHIFT     = 0,
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-01 21:48 [PATCH v4] nvme: allow cmb and pmr emulation on same device Andrzej Jakowski
  2020-07-01 21:48 ` [PATCH v4 1/2] nvme: indicate CMB support through controller capabilities register Andrzej Jakowski
@ 2020-07-01 21:48 ` Andrzej Jakowski
  2020-07-02 10:13   ` Klaus Jensen
  1 sibling, 1 reply; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-01 21:48 UTC (permalink / raw)
  To: kbusch, kwolf, mreitz; +Cc: Andrzej Jakowski, qemu-devel, qemu-block

So far it was not possible to have CMB and PMR emulated on the same
device, because BAR2 was used exclusively either of PMR or CMB. This
patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.

Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
---
 hw/block/nvme.c      | 101 +++++++++++++++++++++++++++++--------------
 hw/block/nvme.h      |   1 +
 include/block/nvme.h |   4 +-
 3 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 9f11f3e9da..f5156bcfe5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -22,12 +22,12 @@
  *              [pmrdev=<mem_backend_file_id>,] \
  *              max_ioqpairs=<N[optional]>
  *
- * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
+ * Note cmb_size_mb denotes size of CMB in MB. CMB when configured is assumed
+ * to be resident in BAR4 at certain offset - this is because BAR4 is also
+ * used for storing MSI-X table that is available at offset 0 in BAR4.
  *
- * cmb_size_mb= and pmrdev= options are mutually exclusive due to limitation
- * in available BAR's. cmb_size_mb= will take precedence over pmrdev= when
- * both provided.
+ * pmrdev is assumed to be resident in BAR2/BAR3. When configured it consumes
+ * whole BAR2/BAR3 exclusively.
  * Enabling pmr emulation can be achieved by pointing to memory-backend-file.
  * For example:
  * -object memory-backend-file,id=<mem_id>,share=on,mem-path=<file_path>, \
@@ -57,8 +57,8 @@
 #define NVME_MAX_IOQPAIRS 0xffff
 #define NVME_REG_SIZE 0x1000
 #define NVME_DB_SIZE  4
-#define NVME_CMB_BIR 2
 #define NVME_PMR_BIR 2
+#define NVME_MSIX_BIR 4
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
     do { \
@@ -1395,7 +1395,7 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
         return;
     }
 
-    if (!n->params.cmb_size_mb && n->pmrdev) {
+    if (n->pmrdev) {
         if (host_memory_backend_is_mapped(n->pmrdev)) {
             char *path = object_get_canonical_path_component(OBJECT(n->pmrdev));
             error_setg(errp, "can't use already busy memdev: %s", path);
@@ -1453,33 +1453,70 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     id_ns->nuse = id_ns->ncap;
 }
 
-static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+static void nvme_bar4_init(PCIDevice *pci_dev, Error **errp)
 {
-    NVME_CMBLOC_SET_BIR(n->bar.cmbloc, NVME_CMB_BIR);
-    NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
-
-    NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-    NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
-    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
-    NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
-    NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-    NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
-    NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
-
-    n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-    memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
-                          "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-    pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+    NvmeCtrl *n = NVME(pci_dev);
+    int status;
+    uint64_t bar_size, cmb_offset = 0;
+    uint32_t msix_vectors;
+    uint32_t nvme_pba_offset;
+    uint32_t cmb_size_units;
+
+    msix_vectors = n->params.max_ioqpairs + 1;
+    nvme_pba_offset = PCI_MSIX_ENTRY_SIZE * msix_vectors;
+    bar_size = nvme_pba_offset + QEMU_ALIGN_UP(msix_vectors, 64) / 8;
+
+    if (n->params.cmb_size_mb) {
+        NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+        NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+        NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+        NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
+        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+        cmb_size_units = NVME_CMBSZ_GETSIZEUNITS(n->bar.cmbsz);
+        cmb_offset = QEMU_ALIGN_UP(bar_size, cmb_size_units);
+
+        NVME_CMBLOC_SET_BIR(n->bar.cmbloc, NVME_MSIX_BIR);
+        NVME_CMBLOC_SET_OFST(n->bar.cmbloc, cmb_offset / cmb_size_units);
+
+        n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+
+        bar_size += cmb_offset;
+        bar_size += NVME_CMBSZ_GETSIZE(n->bar.cmbsz);
+    }
+
+    bar_size = pow2ceil(bar_size);
+
+    /* Create memory region for BAR4, then overlap cmb, msix and pba
+     * tables on top of it.*/
+    memory_region_init(&n->bar4, OBJECT(n), "nvme-bar4", bar_size);
+
+    if (n->params.cmb_size_mb) {
+        memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
+                              "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+
+        memory_region_add_subregion(&n->bar4, cmb_offset, &n->ctrl_mem);
+    }
+
+    status = msix_init(pci_dev, n->params.msix_qsize,
+                       &n->bar4, NVME_MSIX_BIR, 0,
+                       &n->bar4, NVME_MSIX_BIR, nvme_pba_offset,
+                       0, errp);
+
+    if (status) {
+        return;
+    }
+
+    pci_register_bar(pci_dev, NVME_MSIX_BIR,
                      PCI_BASE_ADDRESS_SPACE_MEMORY |
                      PCI_BASE_ADDRESS_MEM_TYPE_64 |
-                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
+                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->bar4);
 }
 
 static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
 {
-    /* Controller Capabilities register */
-    NVME_CAP_SET_PMRS(n->bar.cap, 1);
-
     /* PMR Capabities register */
     n->bar.pmrcap = 0;
     NVME_PMRCAP_SET_RDS(n->bar.pmrcap, 0);
@@ -1537,13 +1574,10 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
                           n->reg_size);
     pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
                      PCI_BASE_ADDRESS_MEM_TYPE_64, &n->iomem);
-    if (msix_init_exclusive_bar(pci_dev, n->params.msix_qsize, 4, errp)) {
-        return;
-    }
 
-    if (n->params.cmb_size_mb) {
-        nvme_init_cmb(n, pci_dev);
-    } else if (n->pmrdev) {
+    nvme_bar4_init(pci_dev, errp);
+
+    if (n->pmrdev) {
         nvme_init_pmr(n, pci_dev);
     }
 }
@@ -1583,6 +1617,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_CSS(n->bar.cap, 1);
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
     NVME_CAP_SET_CMBS(n->bar.cap, n->params.cmb_size_mb ? 1 : 0);
+    NVME_CAP_SET_PMRS(n->bar.cap, n->pmrdev ? 1 : 0);
 
     n->bar.vs = 0x00010200;
     n->bar.intmc = n->bar.intms = 0;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 1d30c0bca2..b2b9d727a5 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -81,6 +81,7 @@ typedef struct NvmeCtrl {
     PCIDevice    parent_obj;
     MemoryRegion iomem;
     MemoryRegion ctrl_mem;
+    MemoryRegion bar4;
     NvmeBar      bar;
     BlockConf    conf;
     NvmeParams   params;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 14cf398dfa..76d15bdf9f 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -216,9 +216,11 @@ enum NvmeCmbszMask {
     (cmbsz |= (uint64_t)(val & CMBSZ_SZU_MASK) << CMBSZ_SZU_SHIFT)
 #define NVME_CMBSZ_SET_SZ(cmbsz, val)    \
     (cmbsz |= (uint64_t)(val & CMBSZ_SZ_MASK) << CMBSZ_SZ_SHIFT)
+#define NVME_CMBSZ_GETSIZEUNITS(cmbsz) \
+    (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz)))
 
 #define NVME_CMBSZ_GETSIZE(cmbsz) \
-    (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz))))
+    (NVME_CMBSZ_SZ(cmbsz) * NVME_CMBSZ_GETSIZEUNITS(cmbsz))
 
 enum NvmePmrcapShift {
     PMRCAP_RDS_SHIFT      = 3,
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-01 21:48 ` [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device Andrzej Jakowski
@ 2020-07-02 10:13   ` Klaus Jensen
  2020-07-02 10:31     ` Klaus Jensen
  0 siblings, 1 reply; 20+ messages in thread
From: Klaus Jensen @ 2020-07-02 10:13 UTC (permalink / raw)
  To: Andrzej Jakowski; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On Jul  1 14:48, Andrzej Jakowski wrote:
> So far it was not possible to have CMB and PMR emulated on the same
> device, because BAR2 was used exclusively either of PMR or CMB. This
> patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.
> 

Linux craps out when I test this (1MB CMB):

Misaligned __add_pages start: 0xfdd00 end: #fdeff

I tracked it down to check_pfn_span (mm/memory_hotplug.c) failing
because it's not on a 2MB boundary. I then tried to monkey patch the
cmb_offset to be 2MB instead and it succeeds in registering the cmb:

[    8.384565] memmap_init_zone_device initialised 512 pages in 0ms
[    8.385715] nvme 0000:03:00.0: added peer-to-peer DMA memory [mem 0xfd200000-0xfd3fffff 64bit
pref]
[    8.419372] nvme nvme0: 1/0/0 default/read/poll queues

But the kernel then continues to really crap out with a kernel panic:

[    8.440891] DMAR: DRHD: handling fault status reg 2
[    8.440934] BUG: kernel NULL pointer dereference, address: 0000000000000120
[    8.441713] DMAR: [DMA Read] Request device [03:00.0] PASID ffffffff fault addr fd200000 [faul
t reason 06] PTE Read access is not set
[    8.442630] #PF: supervisor write access in kernel mode
[    8.444972] #PF: error_code(0x0002) - not-present page
[    8.445640] PGD 0 P4D 0
[    8.445965] Oops: 0002 [#1] PREEMPT SMP PTI
[    8.446499] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.8.0-rc1-00034-gb6cf9836d07f-dirty #19
[    8.447547] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb0
2-prebuilt.qemu.org 04/01/2014
[    8.448898] RIP: 0010:nvme_process_cq+0xc4/0x200 [nvme]
[    8.449525] Code: cf 00 00 00 48 8b 57 70 48 8b 7c c2 f8 e8 14 e9 32 c1 49 89 c7 0f 1f 44 00 0
0 41 0f b7 44 24 0e 49 8b 14 24 4c 89 ff 66 d1 e8 <49> 89 97 20 01 00 00 66 41 89 87 2a 01 00 00
e8 28 04 33 c1 0f b7
[    8.451662] RSP: 0018:ffffc9000015cf20 EFLAGS: 00010803
[    8.452321] RAX: 000000000000400b RBX: ffff88826faad0c0 RCX: 0000000000000000
[    8.453293] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    8.454312] RBP: ffff8882725d38e4 R08: 00000001f71e225c R09: 0000000000000000
[    8.455319] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888270bb0000
[    8.456334] R13: 0000000000000000 R14: ffffc9000015cfac R15: 0000000000000000
[    8.457311] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
[    8.458441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.459380] CR2: 0000000000000120 CR3: 0000000271c8c006 CR4: 0000000000360ee0
[    8.460507] Call Trace:
[    8.460906]  <IRQ>
[    8.461272]  nvme_irq+0x10/0x20 [nvme]
[    8.461951]  __handle_irq_event_percpu+0x45/0x1b0
[    8.462803]  ? handle_fasteoi_irq+0x210/0x210
[    8.463585]  handle_irq_event+0x58/0xb0
[    8.464312]  handle_edge_irq+0xae/0x270
[    8.465027]  asm_call_on_stack+0x12/0x20
[    8.465686]  </IRQ>
[    8.466053]  common_interrupt+0x120/0x1f0
[    8.466717]  asm_common_interrupt+0x1e/0x40
[    8.467429] RIP: 0010:default_idle+0x21/0x170
[    8.468140] Code: eb a6 e8 82 27 ff ff cc cc 0f 1f 44 00 00 41 54 55 53 e8 e2 2e ff ff 0f 1f 4
4 00 00 e9 07 00 00 00 0f 00 2d f3 d4 44 00 fb f4 <e8> ca 2e ff ff 89 c5 0f 1f 44 00 00 5b 5d 41
5c c3 89 c5 65 8b 05
[    8.471286] RSP: 0018:ffffc9000009fec8 EFLAGS: 00000282
[    8.472202] RAX: 0000000000000003 RBX: ffff888276ff0000 RCX: 0000000000000001
[    8.473405] RDX: 0000000000000001 RSI: ffffffff8212355f RDI: ffffffff8212d699
[    8.474571] RBP: 0000000000000003 R08: ffff888277d9e4a0 R09: 0000000000000020
[    8.475717] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    8.476921] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    8.478110]  ? default_idle+0xe/0x170
[    8.478728]  do_idle+0x1e1/0x240
[    8.479283]  ? _raw_spin_lock_irqsave+0x19/0x40
[    8.480040]  cpu_startup_entry+0x19/0x20
[    8.480705]  start_secondary+0x153/0x190
[    8.481400]  secondary_startup_64+0xb6/0xc0
[    8.482114] Modules linked in: libata nvme nvme_core scsi_mod t10_pi crc_t10dif crct10dif_gene
ric crct10dif_common virtio_net net_failover failover virtio_rng rng_core
[    8.484675] CR2: 0000000000000120
[    8.485264] ---[ end trace ff1849437c76af12 ]---
[    8.486065] RIP: 0010:nvme_process_cq+0xc4/0x200 [nvme]
[    8.486953] Code: cf 00 00 00 48 8b 57 70 48 8b 7c c2 f8 e8 14 e9 32 c1 49 89 c7 0f 1f 44 00 0
0 41 0f b7 44 24 0e 49 8b 14 24 4c 89 ff 66 d1 e8 <49> 89 97 20 01 00 00 66 41 89 87 2a 01 00 00
e8 28 04 33 c1 0f b7
[    8.490234] RSP: 0018:ffffc9000015cf20 EFLAGS: 00010803
[    8.491144] RAX: 000000000000400b RBX: ffff88826faad0c0 RCX: 0000000000000000
[    8.492445] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    8.493681] RBP: ffff8882725d38e4 R08: 00000001f71e225c R09: 0000000000000000
[    8.494907] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888270bb0000
[    8.496130] R13: 0000000000000000 R14: ffffc9000015cfac R15: 0000000000000000
[    8.497363] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
[    8.498726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.499713] CR2: 0000000000000120 CR3: 0000000271c8c006 CR4: 0000000000360ee0
[    8.500935] Kernel panic - not syncing: Fatal exception in interrupt
[    8.502113] Kernel Offset: disabled
[    8.502728] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

I'm out of my depth here, but the CMB/BAR offsets/setup seems sane enough -
could this be a kernel bug?

Keith, any thoughts on this?

> Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
> ---
>  hw/block/nvme.c      | 101 +++++++++++++++++++++++++++++--------------
>  hw/block/nvme.h      |   1 +
>  include/block/nvme.h |   4 +-
>  3 files changed, 72 insertions(+), 34 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 9f11f3e9da..f5156bcfe5 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -22,12 +22,12 @@
>   *              [pmrdev=<mem_backend_file_id>,] \
>   *              max_ioqpairs=<N[optional]>
>   *
> - * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> - * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> + * Note cmb_size_mb denotes size of CMB in MB. CMB when configured is assumed
> + * to be resident in BAR4 at certain offset - this is because BAR4 is also
> + * used for storing MSI-X table that is available at offset 0 in BAR4.
>   *
> - * cmb_size_mb= and pmrdev= options are mutually exclusive due to limitation
> - * in available BAR's. cmb_size_mb= will take precedence over pmrdev= when
> - * both provided.
> + * pmrdev is assumed to be resident in BAR2/BAR3. When configured it consumes
> + * whole BAR2/BAR3 exclusively.
>   * Enabling pmr emulation can be achieved by pointing to memory-backend-file.
>   * For example:
>   * -object memory-backend-file,id=<mem_id>,share=on,mem-path=<file_path>, \
> @@ -57,8 +57,8 @@
>  #define NVME_MAX_IOQPAIRS 0xffff
>  #define NVME_REG_SIZE 0x1000
>  #define NVME_DB_SIZE  4
> -#define NVME_CMB_BIR 2
>  #define NVME_PMR_BIR 2
> +#define NVME_MSIX_BIR 4
>  
>  #define NVME_GUEST_ERR(trace, fmt, ...) \
>      do { \
> @@ -1395,7 +1395,7 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
>          return;
>      }
>  
> -    if (!n->params.cmb_size_mb && n->pmrdev) {
> +    if (n->pmrdev) {
>          if (host_memory_backend_is_mapped(n->pmrdev)) {
>              char *path = object_get_canonical_path_component(OBJECT(n->pmrdev));
>              error_setg(errp, "can't use already busy memdev: %s", path);
> @@ -1453,33 +1453,70 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>      id_ns->nuse = id_ns->ncap;
>  }
>  
> -static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> +static void nvme_bar4_init(PCIDevice *pci_dev, Error **errp)
>  {
> -    NVME_CMBLOC_SET_BIR(n->bar.cmbloc, NVME_CMB_BIR);
> -    NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> -
> -    NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> -    NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> -    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> -    NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> -    NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> -    NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
> -    NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
> -
> -    n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> -    memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
> -                          "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> -    pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
> +    NvmeCtrl *n = NVME(pci_dev);
> +    int status;
> +    uint64_t bar_size, cmb_offset = 0;
> +    uint32_t msix_vectors;
> +    uint32_t nvme_pba_offset;
> +    uint32_t cmb_size_units;
> +
> +    msix_vectors = n->params.max_ioqpairs + 1;
> +    nvme_pba_offset = PCI_MSIX_ENTRY_SIZE * msix_vectors;
> +    bar_size = nvme_pba_offset + QEMU_ALIGN_UP(msix_vectors, 64) / 8;
> +
> +    if (n->params.cmb_size_mb) {
> +        NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> +        NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> +        NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> +        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> +        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> +        NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
> +        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
> +
> +        cmb_size_units = NVME_CMBSZ_GETSIZEUNITS(n->bar.cmbsz);
> +        cmb_offset = QEMU_ALIGN_UP(bar_size, cmb_size_units);
> +
> +        NVME_CMBLOC_SET_BIR(n->bar.cmbloc, NVME_MSIX_BIR);
> +        NVME_CMBLOC_SET_OFST(n->bar.cmbloc, cmb_offset / cmb_size_units);
> +
> +        n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> +
> +        bar_size += cmb_offset;
> +        bar_size += NVME_CMBSZ_GETSIZE(n->bar.cmbsz);
> +    }
> +
> +    bar_size = pow2ceil(bar_size);
> +
> +    /* Create memory region for BAR4, then overlap cmb, msix and pba
> +     * tables on top of it.*/
> +    memory_region_init(&n->bar4, OBJECT(n), "nvme-bar4", bar_size);
> +
> +    if (n->params.cmb_size_mb) {
> +        memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
> +                              "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> +
> +        memory_region_add_subregion(&n->bar4, cmb_offset, &n->ctrl_mem);
> +    }
> +
> +    status = msix_init(pci_dev, n->params.msix_qsize,
> +                       &n->bar4, NVME_MSIX_BIR, 0,
> +                       &n->bar4, NVME_MSIX_BIR, nvme_pba_offset,
> +                       0, errp);
> +
> +    if (status) {
> +        return;
> +    }
> +
> +    pci_register_bar(pci_dev, NVME_MSIX_BIR,
>                       PCI_BASE_ADDRESS_SPACE_MEMORY |
>                       PCI_BASE_ADDRESS_MEM_TYPE_64 |
> -                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
> +                     PCI_BASE_ADDRESS_MEM_PREFETCH, &n->bar4);
>  }
>  
>  static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
>  {
> -    /* Controller Capabilities register */
> -    NVME_CAP_SET_PMRS(n->bar.cap, 1);
> -
>      /* PMR Capabities register */
>      n->bar.pmrcap = 0;
>      NVME_PMRCAP_SET_RDS(n->bar.pmrcap, 0);
> @@ -1537,13 +1574,10 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>                            n->reg_size);
>      pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
>                       PCI_BASE_ADDRESS_MEM_TYPE_64, &n->iomem);
> -    if (msix_init_exclusive_bar(pci_dev, n->params.msix_qsize, 4, errp)) {
> -        return;
> -    }
>  
> -    if (n->params.cmb_size_mb) {
> -        nvme_init_cmb(n, pci_dev);
> -    } else if (n->pmrdev) {
> +    nvme_bar4_init(pci_dev, errp);
> +
> +    if (n->pmrdev) {
>          nvme_init_pmr(n, pci_dev);
>      }
>  }
> @@ -1583,6 +1617,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      NVME_CAP_SET_CSS(n->bar.cap, 1);
>      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
>      NVME_CAP_SET_CMBS(n->bar.cap, n->params.cmb_size_mb ? 1 : 0);
> +    NVME_CAP_SET_PMRS(n->bar.cap, n->pmrdev ? 1 : 0);
>  
>      n->bar.vs = 0x00010200;
>      n->bar.intmc = n->bar.intms = 0;
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 1d30c0bca2..b2b9d727a5 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -81,6 +81,7 @@ typedef struct NvmeCtrl {
>      PCIDevice    parent_obj;
>      MemoryRegion iomem;
>      MemoryRegion ctrl_mem;
> +    MemoryRegion bar4;
>      NvmeBar      bar;
>      BlockConf    conf;
>      NvmeParams   params;
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 14cf398dfa..76d15bdf9f 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -216,9 +216,11 @@ enum NvmeCmbszMask {
>      (cmbsz |= (uint64_t)(val & CMBSZ_SZU_MASK) << CMBSZ_SZU_SHIFT)
>  #define NVME_CMBSZ_SET_SZ(cmbsz, val)    \
>      (cmbsz |= (uint64_t)(val & CMBSZ_SZ_MASK) << CMBSZ_SZ_SHIFT)
> +#define NVME_CMBSZ_GETSIZEUNITS(cmbsz) \
> +    (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz)))
>  
>  #define NVME_CMBSZ_GETSIZE(cmbsz) \
> -    (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz))))
> +    (NVME_CMBSZ_SZ(cmbsz) * NVME_CMBSZ_GETSIZEUNITS(cmbsz))
>  
>  enum NvmePmrcapShift {
>      PMRCAP_RDS_SHIFT      = 3,
> -- 
> 2.21.1
> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-02 10:13   ` Klaus Jensen
@ 2020-07-02 10:31     ` Klaus Jensen
  2020-07-02 15:07       ` Andrzej Jakowski
  0 siblings, 1 reply; 20+ messages in thread
From: Klaus Jensen @ 2020-07-02 10:31 UTC (permalink / raw)
  To: Andrzej Jakowski; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On Jul  2 12:13, Klaus Jensen wrote:
> On Jul  1 14:48, Andrzej Jakowski wrote:
> > So far it was not possible to have CMB and PMR emulated on the same
> > device, because BAR2 was used exclusively either of PMR or CMB. This
> > patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.
> > 
> 
> Linux craps out when I test this (1MB CMB):
> 
> Misaligned __add_pages start: 0xfdd00 end: #fdeff
> 
> I tracked it down to check_pfn_span (mm/memory_hotplug.c) failing
> because it's not on a 2MB boundary. I then tried to monkey patch the
> cmb_offset to be 2MB instead and it succeeds in registering the cmb:
> 
> [    8.384565] memmap_init_zone_device initialised 512 pages in 0ms
> [    8.385715] nvme 0000:03:00.0: added peer-to-peer DMA memory [mem 0xfd200000-0xfd3fffff 64bit
> pref]
> [    8.419372] nvme nvme0: 1/0/0 default/read/poll queues
> 
> But the kernel then continues to really crap out with a kernel panic:
> 
> [    8.440891] DMAR: DRHD: handling fault status reg 2
> [    8.440934] BUG: kernel NULL pointer dereference, address: 0000000000000120
> [    8.441713] DMAR: [DMA Read] Request device [03:00.0] PASID ffffffff fault addr fd200000 [faul
> t reason 06] PTE Read access is not set
> [    8.442630] #PF: supervisor write access in kernel mode
> [    8.444972] #PF: error_code(0x0002) - not-present page
> [    8.445640] PGD 0 P4D 0
> [    8.445965] Oops: 0002 [#1] PREEMPT SMP PTI
> [    8.446499] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.8.0-rc1-00034-gb6cf9836d07f-dirty #19
> [    8.447547] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb0
> 2-prebuilt.qemu.org 04/01/2014
> [    8.448898] RIP: 0010:nvme_process_cq+0xc4/0x200 [nvme]
> [    8.449525] Code: cf 00 00 00 48 8b 57 70 48 8b 7c c2 f8 e8 14 e9 32 c1 49 89 c7 0f 1f 44 00 0
> 0 41 0f b7 44 24 0e 49 8b 14 24 4c 89 ff 66 d1 e8 <49> 89 97 20 01 00 00 66 41 89 87 2a 01 00 00
> e8 28 04 33 c1 0f b7
> [    8.451662] RSP: 0018:ffffc9000015cf20 EFLAGS: 00010803
> [    8.452321] RAX: 000000000000400b RBX: ffff88826faad0c0 RCX: 0000000000000000
> [    8.453293] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [    8.454312] RBP: ffff8882725d38e4 R08: 00000001f71e225c R09: 0000000000000000
> [    8.455319] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888270bb0000
> [    8.456334] R13: 0000000000000000 R14: ffffc9000015cfac R15: 0000000000000000
> [    8.457311] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
> [    8.458441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    8.459380] CR2: 0000000000000120 CR3: 0000000271c8c006 CR4: 0000000000360ee0
> [    8.460507] Call Trace:
> [    8.460906]  <IRQ>
> [    8.461272]  nvme_irq+0x10/0x20 [nvme]
> [    8.461951]  __handle_irq_event_percpu+0x45/0x1b0
> [    8.462803]  ? handle_fasteoi_irq+0x210/0x210
> [    8.463585]  handle_irq_event+0x58/0xb0
> [    8.464312]  handle_edge_irq+0xae/0x270
> [    8.465027]  asm_call_on_stack+0x12/0x20
> [    8.465686]  </IRQ>
> [    8.466053]  common_interrupt+0x120/0x1f0
> [    8.466717]  asm_common_interrupt+0x1e/0x40
> [    8.467429] RIP: 0010:default_idle+0x21/0x170
> [    8.468140] Code: eb a6 e8 82 27 ff ff cc cc 0f 1f 44 00 00 41 54 55 53 e8 e2 2e ff ff 0f 1f 4
> 4 00 00 e9 07 00 00 00 0f 00 2d f3 d4 44 00 fb f4 <e8> ca 2e ff ff 89 c5 0f 1f 44 00 00 5b 5d 41
> 5c c3 89 c5 65 8b 05
> [    8.471286] RSP: 0018:ffffc9000009fec8 EFLAGS: 00000282
> [    8.472202] RAX: 0000000000000003 RBX: ffff888276ff0000 RCX: 0000000000000001
> [    8.473405] RDX: 0000000000000001 RSI: ffffffff8212355f RDI: ffffffff8212d699
> [    8.474571] RBP: 0000000000000003 R08: ffff888277d9e4a0 R09: 0000000000000020
> [    8.475717] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [    8.476921] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [    8.478110]  ? default_idle+0xe/0x170
> [    8.478728]  do_idle+0x1e1/0x240
> [    8.479283]  ? _raw_spin_lock_irqsave+0x19/0x40
> [    8.480040]  cpu_startup_entry+0x19/0x20
> [    8.480705]  start_secondary+0x153/0x190
> [    8.481400]  secondary_startup_64+0xb6/0xc0
> [    8.482114] Modules linked in: libata nvme nvme_core scsi_mod t10_pi crc_t10dif crct10dif_gene
> ric crct10dif_common virtio_net net_failover failover virtio_rng rng_core
> [    8.484675] CR2: 0000000000000120
> [    8.485264] ---[ end trace ff1849437c76af12 ]---
> [    8.486065] RIP: 0010:nvme_process_cq+0xc4/0x200 [nvme]
> [    8.486953] Code: cf 00 00 00 48 8b 57 70 48 8b 7c c2 f8 e8 14 e9 32 c1 49 89 c7 0f 1f 44 00 0
> 0 41 0f b7 44 24 0e 49 8b 14 24 4c 89 ff 66 d1 e8 <49> 89 97 20 01 00 00 66 41 89 87 2a 01 00 00
> e8 28 04 33 c1 0f b7
> [    8.490234] RSP: 0018:ffffc9000015cf20 EFLAGS: 00010803
> [    8.491144] RAX: 000000000000400b RBX: ffff88826faad0c0 RCX: 0000000000000000
> [    8.492445] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [    8.493681] RBP: ffff8882725d38e4 R08: 00000001f71e225c R09: 0000000000000000
> [    8.494907] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888270bb0000
> [    8.496130] R13: 0000000000000000 R14: ffffc9000015cfac R15: 0000000000000000
> [    8.497363] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
> [    8.498726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    8.499713] CR2: 0000000000000120 CR3: 0000000271c8c006 CR4: 0000000000360ee0
> [    8.500935] Kernel panic - not syncing: Fatal exception in interrupt
> [    8.502113] Kernel Offset: disabled
> [    8.502728] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> 
> I'm out of my depth here, but the CMB/BAR offsets/setup seems sane enough -
> could this be a kernel bug?
> 
> Keith, any thoughts on this?
> 

Aight, an update here. This only happens when QEMU is run with a virtual
IOMMU. Otherwise, the kernel is happy.

With the vIOMMU, qemu also craps out a bit:

qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfd200000, level=0x2, slpte=0x0, write=0)
qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfd200000)

So I think we are back in QEMU land for the bug.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-02 10:31     ` Klaus Jensen
@ 2020-07-02 15:07       ` Andrzej Jakowski
  2020-07-02 17:51         ` Klaus Jensen
  0 siblings, 1 reply; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-02 15:07 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On 7/2/20 3:31 AM, Klaus Jensen wrote:
> Aight, an update here. This only happens when QEMU is run with a virtual
> IOMMU. Otherwise, the kernel is happy.
> 
> With the vIOMMU, qemu also craps out a bit:
> 
> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfd200000, level=0x2, slpte=0x0, write=0)
> qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfd200000)
> 
> So I think we are back in QEMU land for the bug.

Can you share command line for that?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-02 15:07       ` Andrzej Jakowski
@ 2020-07-02 17:51         ` Klaus Jensen
  2020-07-02 23:33           ` Andrzej Jakowski
  0 siblings, 1 reply; 20+ messages in thread
From: Klaus Jensen @ 2020-07-02 17:51 UTC (permalink / raw)
  To: Andrzej Jakowski; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On Jul  2 08:07, Andrzej Jakowski wrote:
> On 7/2/20 3:31 AM, Klaus Jensen wrote:
> > Aight, an update here. This only happens when QEMU is run with a virtual
> > IOMMU. Otherwise, the kernel is happy.
> > 
> > With the vIOMMU, qemu also craps out a bit:
> > 
> > qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfd200000, level=0x2, slpte=0x0, write=0)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfd200000)
> > 
> > So I think we are back in QEMU land for the bug.
> 
> Can you share command line for that?
> 
> 

qemu-system-x86_64 \
  -nodefaults \
  -display none \
  -device intel-iommu,pt,intremap=on,device-iotlb=on \
  -machine type=q35,accel=kvm,kernel_irqchip=split \
  -cpu host \
  -smp 4 \
  -m 8G \
  -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 \
  -device virtio-rng-pci \
  -drive id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap \
  -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
  -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
  -device xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
  -drive id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap \
  -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
  -device nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2 \
  -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
  -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
  -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
  -virtfs local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules \
  -serial mon:stdio \
  -trace pci_nvme*




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-02 17:51         ` Klaus Jensen
@ 2020-07-02 23:33           ` Andrzej Jakowski
  2020-07-06  7:15             ` Klaus Jensen
  0 siblings, 1 reply; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-02 23:33 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On 7/2/20 10:51 AM, Klaus Jensen wrote:
> On Jul  2 08:07, Andrzej Jakowski wrote:
>> On 7/2/20 3:31 AM, Klaus Jensen wrote:
>>> Aight, an update here. This only happens when QEMU is run with a virtual
>>> IOMMU. Otherwise, the kernel is happy.
>>>
>>> With the vIOMMU, qemu also craps out a bit:
>>>
>>> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfd200000, level=0x2, slpte=0x0, write=0)
>>> qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfd200000)
>>>
>>> So I think we are back in QEMU land for the bug.
>>
>> Can you share command line for that?
>>
>>
> 
> qemu-system-x86_64 \
>   -nodefaults \
>   -display none \
>   -device intel-iommu,pt,intremap=on,device-iotlb=on \
>   -machine type=q35,accel=kvm,kernel_irqchip=split \
>   -cpu host \
>   -smp 4 \
>   -m 8G \
>   -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 \
>   -device virtio-rng-pci \
>   -drive id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap \
>   -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
>   -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
>   -device xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
>   -drive id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap \
>   -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
>   -device nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2 \
>   -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
>   -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
>   -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
>   -virtfs local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules \
>   -serial mon:stdio \
>   -trace pci_nvme*
> 
> 

I focused on reproduction and it looks to me that my patch doesn't 
necessarily introduce regression. I run it w/ and w/o patch in both cases
getting error while registering. Here is kernel guest log:

[   87.606482] nvme nvme0: pci function 0000:00:04.0
[   87.635577] dev=0000000095b0a83b bar=2 size=134217728 offset=0
[   87.636593] nvme nvme0: failed to register the CMB ret=-95
[   87.643262] nvme nvme0: 12/0/0 default/read/poll queues

Any thoughts?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-02 23:33           ` Andrzej Jakowski
@ 2020-07-06  7:15             ` Klaus Jensen
  2020-07-08  4:44               ` Andrzej Jakowski
  0 siblings, 1 reply; 20+ messages in thread
From: Klaus Jensen @ 2020-07-06  7:15 UTC (permalink / raw)
  To: Andrzej Jakowski; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On Jul  2 16:33, Andrzej Jakowski wrote:
> On 7/2/20 10:51 AM, Klaus Jensen wrote:
> > On Jul  2 08:07, Andrzej Jakowski wrote:
> >> On 7/2/20 3:31 AM, Klaus Jensen wrote:
> >>> Aight, an update here. This only happens when QEMU is run with a virtual
> >>> IOMMU. Otherwise, the kernel is happy.
> >>>
> >>> With the vIOMMU, qemu also craps out a bit:
> >>>
> >>> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfd200000, level=0x2, slpte=0x0, write=0)
> >>> qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfd200000)
> >>>
> >>> So I think we are back in QEMU land for the bug.
> >>
> >> Can you share command line for that?
> >>
> >>
> > 
> > qemu-system-x86_64 \
> >   -nodefaults \
> >   -display none \
> >   -device intel-iommu,pt,intremap=on,device-iotlb=on \
> >   -machine type=q35,accel=kvm,kernel_irqchip=split \
> >   -cpu host \
> >   -smp 4 \
> >   -m 8G \
> >   -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 \
> >   -device virtio-rng-pci \
> >   -drive id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap \
> >   -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
> >   -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
> >   -device xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
> >   -drive id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap \
> >   -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
> >   -device nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2 \
> >   -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
> >   -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
> >   -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
> >   -virtfs local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules \
> >   -serial mon:stdio \
> >   -trace pci_nvme*
> > 
> > 
> 
> I focused on reproduction and it looks to me that my patch doesn't 
> necessarily introduce regression. I run it w/ and w/o patch in both cases
> getting error while registering. Here is kernel guest log:
> 
> [   87.606482] nvme nvme0: pci function 0000:00:04.0
> [   87.635577] dev=0000000095b0a83b bar=2 size=134217728 offset=0
> [   87.636593] nvme nvme0: failed to register the CMB ret=-95
> [   87.643262] nvme nvme0: 12/0/0 default/read/poll queues
> 
> Any thoughts?
> 

Hmm, that's not what I am seeing.

With kernel v5.8-rc4, I'm not seeing any issues with CMB with and
without IOMMU on QEMU master. With your patches, my kernel (v5.8-rc4)
pukes both with and without iommu.

BUT! This doesn't mean that your patch is bad, it looks more like an
issue in the kernel. I still think the BAR configuration looks sane, but
I am not expert on this.

To satisify my curiosity I tried mending your patch to put the CMB on
offset 0 and move the MSI-X vector table and PBA to BAR 0 (like I
suggested back in the day). That works. With and without IOMMU. So, I
think it is an issue with the Linux kernel not being too happy about the
CMB being at an offset. It doesn't directly look like an issue in the
nvme driver since the issue shows up far lower in the memory subsystem,
but it would be nice to have the linux nvme gang at least acknowledge
the issue.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] nvme: indicate CMB support through controller capabilities register
  2020-07-01 21:48 ` [PATCH v4 1/2] nvme: indicate CMB support through controller capabilities register Andrzej Jakowski
@ 2020-07-07 16:27   ` Maxim Levitsky
  2020-07-07 19:15     ` Klaus Jensen
  0 siblings, 1 reply; 20+ messages in thread
From: Maxim Levitsky @ 2020-07-07 16:27 UTC (permalink / raw)
  To: Andrzej Jakowski, kbusch, kwolf, mreitz
  Cc: Klaus Jensen, qemu-devel, qemu-block

On Wed, 2020-07-01 at 14:48 -0700, Andrzej Jakowski wrote:
> This patch sets CMBS bit in controller capabilities register when user
> configures NVMe driver with CMB support, so capabilites are correctly
> reported to guest OS.
> 
> Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
> Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c      | 2 +-
>  include/block/nvme.h | 6 +++++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 1aee042d4c..9f11f3e9da 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1582,6 +1582,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      NVME_CAP_SET_TO(n->bar.cap, 0xf);
>      NVME_CAP_SET_CSS(n->bar.cap, 1);
>      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
> +    NVME_CAP_SET_CMBS(n->bar.cap, n->params.cmb_size_mb ? 1 : 0);
>  
>      n->bar.vs = 0x00010200;
>      n->bar.intmc = n->bar.intms = 0;
> @@ -1591,7 +1592,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>  {
>      NvmeCtrl *n = NVME(pci_dev);
>      Error *local_err = NULL;
> -
>      int i;
>  
>      nvme_check_constraints(n, &local_err);
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 1720ee1d51..14cf398dfa 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -35,6 +35,7 @@ enum NvmeCapShift {
>      CAP_MPSMIN_SHIFT   = 48,
>      CAP_MPSMAX_SHIFT   = 52,
>      CAP_PMR_SHIFT      = 56,
> +    CAP_CMB_SHIFT      = 57,
>  };
>  
>  enum NvmeCapMask {
> @@ -48,6 +49,7 @@ enum NvmeCapMask {
>      CAP_MPSMIN_MASK    = 0xf,
>      CAP_MPSMAX_MASK    = 0xf,
>      CAP_PMR_MASK       = 0x1,
> +    CAP_CMB_MASK       = 0x1,
>  };
>  
>  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> @@ -78,8 +80,10 @@ enum NvmeCapMask {
>                                                             << CAP_MPSMIN_SHIFT)
>  #define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & CAP_MPSMAX_MASK)\
>                                                              << CAP_MPSMAX_SHIFT)
> -#define NVME_CAP_SET_PMRS(cap, val) (cap |= (uint64_t)(val & CAP_PMR_MASK)\
> +#define NVME_CAP_SET_PMRS(cap, val)   (cap |= (uint64_t)(val & CAP_PMR_MASK)   \
>                                                              << CAP_PMR_SHIFT)
> +#define NVME_CAP_SET_CMBS(cap, val)   (cap |= (uint64_t)(val & CAP_CMB_MASK)   \
> +                                                           << CAP_CMB_SHIFT)
>  
>  enum NvmeCcShift {
>      CC_EN_SHIFT     = 0,


I wonder how this could have beeing forgotten. Hmm.
I see that Linux kernel uses CMBSZ != for that.
I guess this explains it.

Reviewed-by: Maxim Levitsky <mlevitsky@gmail.com>

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] nvme: indicate CMB support through controller capabilities register
  2020-07-07 16:27   ` Maxim Levitsky
@ 2020-07-07 19:15     ` Klaus Jensen
  2020-07-30 11:26       ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Klaus Jensen @ 2020-07-07 19:15 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kwolf, qemu-block, Klaus Jensen, qemu-devel, mreitz,
	Andrzej Jakowski, kbusch

On Jul  7 19:27, Maxim Levitsky wrote:
> On Wed, 2020-07-01 at 14:48 -0700, Andrzej Jakowski wrote:
> > This patch sets CMBS bit in controller capabilities register when user
> > configures NVMe driver with CMB support, so capabilites are correctly
> > reported to guest OS.
> > 
> > Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
> > Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >  hw/block/nvme.c      | 2 +-
> >  include/block/nvme.h | 6 +++++-
> >  2 files changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 1aee042d4c..9f11f3e9da 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1582,6 +1582,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
> >      NVME_CAP_SET_TO(n->bar.cap, 0xf);
> >      NVME_CAP_SET_CSS(n->bar.cap, 1);
> >      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
> > +    NVME_CAP_SET_CMBS(n->bar.cap, n->params.cmb_size_mb ? 1 : 0);
> >  
> >      n->bar.vs = 0x00010200;
> >      n->bar.intmc = n->bar.intms = 0;
> > @@ -1591,7 +1592,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >  {
> >      NvmeCtrl *n = NVME(pci_dev);
> >      Error *local_err = NULL;
> > -
> >      int i;
> >  
> >      nvme_check_constraints(n, &local_err);
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 1720ee1d51..14cf398dfa 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -35,6 +35,7 @@ enum NvmeCapShift {
> >      CAP_MPSMIN_SHIFT   = 48,
> >      CAP_MPSMAX_SHIFT   = 52,
> >      CAP_PMR_SHIFT      = 56,
> > +    CAP_CMB_SHIFT      = 57,
> >  };
> >  
> >  enum NvmeCapMask {
> > @@ -48,6 +49,7 @@ enum NvmeCapMask {
> >      CAP_MPSMIN_MASK    = 0xf,
> >      CAP_MPSMAX_MASK    = 0xf,
> >      CAP_PMR_MASK       = 0x1,
> > +    CAP_CMB_MASK       = 0x1,
> >  };
> >  
> >  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> > @@ -78,8 +80,10 @@ enum NvmeCapMask {
> >                                                             << CAP_MPSMIN_SHIFT)
> >  #define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & CAP_MPSMAX_MASK)\
> >                                                              << CAP_MPSMAX_SHIFT)
> > -#define NVME_CAP_SET_PMRS(cap, val) (cap |= (uint64_t)(val & CAP_PMR_MASK)\
> > +#define NVME_CAP_SET_PMRS(cap, val)   (cap |= (uint64_t)(val & CAP_PMR_MASK)   \
> >                                                              << CAP_PMR_SHIFT)
> > +#define NVME_CAP_SET_CMBS(cap, val)   (cap |= (uint64_t)(val & CAP_CMB_MASK)   \
> > +                                                           << CAP_CMB_SHIFT)
> >  
> >  enum NvmeCcShift {
> >      CC_EN_SHIFT     = 0,
> 
> 
> I wonder how this could have beeing forgotten. Hmm.
> I see that Linux kernel uses CMBSZ != for that.
> I guess this explains it.
> 
> Reviewed-by: Maxim Levitsky <mlevitsky@gmail.com>
> 

It is a v1.4 field. The CMB support was added when NVMe was at v1.2.
And the Linux kernel is also basically adhering to v1.3 wrt. CMB
support. In v1.4 the host actually needs to specifically enable the CMB
- and that is not something the kernel does currently IIRC.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-06  7:15             ` Klaus Jensen
@ 2020-07-08  4:44               ` Andrzej Jakowski
  2020-07-15  8:06                 ` Klaus Jensen
  0 siblings, 1 reply; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-08  4:44 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On 7/6/20 12:15 AM, Klaus Jensen wrote:
> On Jul  2 16:33, Andrzej Jakowski wrote:
>> On 7/2/20 10:51 AM, Klaus Jensen wrote:
>>> On Jul  2 08:07, Andrzej Jakowski wrote:
>>>> On 7/2/20 3:31 AM, Klaus Jensen wrote:
>>>>> Aight, an update here. This only happens when QEMU is run with a virtual
>>>>> IOMMU. Otherwise, the kernel is happy.
>>>>>
>>>>> With the vIOMMU, qemu also craps out a bit:
>>>>>
>>>>> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfd200000, level=0x2, slpte=0x0, write=0)
>>>>> qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfd200000)
>>>>>
>>>>> So I think we are back in QEMU land for the bug.
>>>>
>>>> Can you share command line for that?
>>>>
>>>>
>>>
>>> qemu-system-x86_64 \
>>>   -nodefaults \
>>>   -display none \
>>>   -device intel-iommu,pt,intremap=on,device-iotlb=on \
>>>   -machine type=q35,accel=kvm,kernel_irqchip=split \
>>>   -cpu host \
>>>   -smp 4 \
>>>   -m 8G \
>>>   -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 \
>>>   -device virtio-rng-pci \
>>>   -drive id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap \
>>>   -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
>>>   -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
>>>   -device xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
>>>   -drive id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap \
>>>   -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
>>>   -device nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2 \
>>>   -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
>>>   -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
>>>   -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
>>>   -virtfs local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules \
>>>   -serial mon:stdio \
>>>   -trace pci_nvme*
>>>
>>>
>>
>> I focused on reproduction and it looks to me that my patch doesn't 
>> necessarily introduce regression. I run it w/ and w/o patch in both cases
>> getting error while registering. Here is kernel guest log:
>>
>> [   87.606482] nvme nvme0: pci function 0000:00:04.0
>> [   87.635577] dev=0000000095b0a83b bar=2 size=134217728 offset=0
>> [   87.636593] nvme nvme0: failed to register the CMB ret=-95
>> [   87.643262] nvme nvme0: 12/0/0 default/read/poll queues
>>
>> Any thoughts?
>>
> 
> Hmm, that's not what I am seeing.
> 
> With kernel v5.8-rc4, I'm not seeing any issues with CMB with and
> without IOMMU on QEMU master. With your patches, my kernel (v5.8-rc4)
> pukes both with and without iommu.
> 
> BUT! This doesn't mean that your patch is bad, it looks more like an
> issue in the kernel. I still think the BAR configuration looks sane, but
> I am not expert on this.
> 
> To satisify my curiosity I tried mending your patch to put the CMB on
> offset 0 and move the MSI-X vector table and PBA to BAR 0 (like I
> suggested back in the day). That works. With and without IOMMU. So, I
> think it is an issue with the Linux kernel not being too happy about the
> CMB being at an offset. It doesn't directly look like an issue in the
> nvme driver since the issue shows up far lower in the memory subsystem,
> but it would be nice to have the linux nvme gang at least acknowledge
> the issue.
> 

I have managed to reproduce that problem and played with patch to see
when the problem occurs vs not. 
When I put MSIX back to BAR2 (no PMR at all) and CMB left at BAR4 but 
starting at offset 0 I was still able to reproduce issue.
So then I've played with memory region API and interesting observed that
problem occurs when region overlaying is used via:

memory_region_init(&n->bar4, OBJECT(n), "nvme-bar4",  bar_size);$
$  
if (n->params.cmb_size_mb) {$
    memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,$
                          "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));$
$  
    memory_region_add_subregion_overlap(&n->bar4, cmb_offset, &n->ctrl_mem, 1);$
}$

on the other hand when cmb memory region is initialized w/o region
overlaying that is:

memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,$
                      "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));

I get no reproduction.

Also observed qemu complaing about failed translation:
qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfe400000, level=0x2, slpte=0x0, write=0)
qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfe400000)

Not sure how we want to proceed. Any suggestions?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-08  4:44               ` Andrzej Jakowski
@ 2020-07-15  8:06                 ` Klaus Jensen
  2020-07-15  8:21                   ` Klaus Jensen
  2020-07-21 21:54                   ` Andrzej Jakowski
  0 siblings, 2 replies; 20+ messages in thread
From: Klaus Jensen @ 2020-07-15  8:06 UTC (permalink / raw)
  To: Andrzej Jakowski; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On Jul  7 21:44, Andrzej Jakowski wrote:
> On 7/6/20 12:15 AM, Klaus Jensen wrote:
> > On Jul  2 16:33, Andrzej Jakowski wrote:
> >> On 7/2/20 10:51 AM, Klaus Jensen wrote:
> >>> On Jul  2 08:07, Andrzej Jakowski wrote:
> >>>> On 7/2/20 3:31 AM, Klaus Jensen wrote:
> >>>>> Aight, an update here. This only happens when QEMU is run with a virtual
> >>>>> IOMMU. Otherwise, the kernel is happy.
> >>>>>
> >>>>> With the vIOMMU, qemu also craps out a bit:
> >>>>>
> >>>>> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfd200000, level=0x2, slpte=0x0, write=0)
> >>>>> qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfd200000)
> >>>>>
> >>>>> So I think we are back in QEMU land for the bug.
> >>>>
> >>>> Can you share command line for that?
> >>>>
> >>>>
> >>>
> >>> qemu-system-x86_64 \
> >>>   -nodefaults \
> >>>   -display none \
> >>>   -device intel-iommu,pt,intremap=on,device-iotlb=on \
> >>>   -machine type=q35,accel=kvm,kernel_irqchip=split \
> >>>   -cpu host \
> >>>   -smp 4 \
> >>>   -m 8G \
> >>>   -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 \
> >>>   -device virtio-rng-pci \
> >>>   -drive id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap \
> >>>   -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
> >>>   -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
> >>>   -device xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
> >>>   -drive id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap \
> >>>   -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
> >>>   -device nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2 \
> >>>   -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
> >>>   -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
> >>>   -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
> >>>   -virtfs local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules \
> >>>   -serial mon:stdio \
> >>>   -trace pci_nvme*
> >>>
> >>>
> >>
> >> I focused on reproduction and it looks to me that my patch doesn't 
> >> necessarily introduce regression. I run it w/ and w/o patch in both cases
> >> getting error while registering. Here is kernel guest log:
> >>
> >> [   87.606482] nvme nvme0: pci function 0000:00:04.0
> >> [   87.635577] dev=0000000095b0a83b bar=2 size=134217728 offset=0
> >> [   87.636593] nvme nvme0: failed to register the CMB ret=-95
> >> [   87.643262] nvme nvme0: 12/0/0 default/read/poll queues
> >>
> >> Any thoughts?
> >>
> > 
> > Hmm, that's not what I am seeing.
> > 
> > With kernel v5.8-rc4, I'm not seeing any issues with CMB with and
> > without IOMMU on QEMU master. With your patches, my kernel (v5.8-rc4)
> > pukes both with and without iommu.
> > 
> > BUT! This doesn't mean that your patch is bad, it looks more like an
> > issue in the kernel. I still think the BAR configuration looks sane, but
> > I am not expert on this.
> > 
> > To satisify my curiosity I tried mending your patch to put the CMB on
> > offset 0 and move the MSI-X vector table and PBA to BAR 0 (like I
> > suggested back in the day). That works. With and without IOMMU. So, I
> > think it is an issue with the Linux kernel not being too happy about the
> > CMB being at an offset. It doesn't directly look like an issue in the
> > nvme driver since the issue shows up far lower in the memory subsystem,
> > but it would be nice to have the linux nvme gang at least acknowledge
> > the issue.
> > 
> 
> I have managed to reproduce that problem and played with patch to see
> when the problem occurs vs not. 
> When I put MSIX back to BAR2 (no PMR at all) and CMB left at BAR4 but 
> starting at offset 0 I was still able to reproduce issue.
> So then I've played with memory region API and interesting observed that
> problem occurs when region overlaying is used via:
> 
> memory_region_init(&n->bar4, OBJECT(n), "nvme-bar4",  bar_size);$
> $  
> if (n->params.cmb_size_mb) {$
>     memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,$
>                           "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));$
> $  
>     memory_region_add_subregion_overlap(&n->bar4, cmb_offset, &n->ctrl_mem, 1);$
> }$
> 
> on the other hand when cmb memory region is initialized w/o region
> overlaying that is:
> 
> memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,$
>                       "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> 
> I get no reproduction.
> 
> Also observed qemu complaing about failed translation:
> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error (iova=0xfe400000, level=0x2, slpte=0x0, write=0)
> qemu-system-x86_64: vtd_iommu_translate: detected translation failure (dev=03:00:00, iova=0xfe400000)
> 
> Not sure how we want to proceed. Any suggestions?
> 

Hi Andrzej,

I've not been ignoring this, but sorry for not following up earlier.

I'm hesitent to merge anything that very obviously breaks an OS that we
know is used a lot to this using this device. Also because the issue has
not been analyzed well enough to actually know if this is a QEMU or
kernel issue.

Now, as far as I can test, having the MSI-X vector table and PBA in BAR
0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
(irregardless of IOMMU on/off).

Later, when the issue is better understood, we can add options to set
offsets, BIRs etc.

The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
git://git.infradead.org/qemu-nvme.git nvme-next branch.

Can you reproduce the issues with that patch? I can't on a stock Arch
Linux 5.7.5-arch1-1 kernel.


---
 hw/block/nvme.c | 41 +++++++++++++++++++++++++++++------------
 hw/block/nvme.h |  1 +
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ce96a14f9f1a..903362d0bdb8 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -25,9 +25,8 @@
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
  *
- * cmb_size_mb= and pmrdev= options are mutually exclusive due to limitation
- * in available BAR's. cmb_size_mb= will take precedence over pmrdev= when
- * both provided.
+ * pmrdev is assumed to be resident in BAR4. When configured it consumes
+ * whole BAR4/BAR5 exclusively.
  * Enabling pmr emulation can be achieved by pointing to memory-backend-file.
  * For example:
  * -object memory-backend-file,id=<mem_id>,share=on,mem-path=<file_path>, \
@@ -57,7 +56,7 @@
 #define NVME_MAX_IOQPAIRS 0xffff
 #define NVME_DB_SIZE  4
 #define NVME_CMB_BIR 2
-#define NVME_PMR_BIR 2
+#define NVME_PMR_BIR 4
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
     do { \
@@ -1394,7 +1393,7 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
         return;
     }
 
-    if (!n->params.cmb_size_mb && n->pmrdev) {
+    if (n->pmrdev) {
         if (host_memory_backend_is_mapped(n->pmrdev)) {
             char *path = object_get_canonical_path_component(OBJECT(n->pmrdev));
             error_setg(errp, "can't use already busy memdev: %s", path);
@@ -1476,9 +1475,6 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
 
 static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
 {
-    /* Controller Capabilities register */
-    NVME_CAP_SET_PMRS(n->bar.cap, 1);
-
     /* PMR Capabities register */
     n->bar.pmrcap = 0;
     NVME_PMRCAP_SET_RDS(n->bar.pmrcap, 0);
@@ -1526,23 +1522,43 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice *pci_dev)
 static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
     uint8_t *pci_conf = pci_dev->config;
+    uint32_t msix_vectors;
+    uint32_t pba_offset;
+    uint64_t bar_size, msix_offset = 0;
 
     pci_conf[PCI_INTERRUPT_PIN] = 1;
     pci_config_set_prog_interface(pci_conf, 0x2);
     pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
     pcie_endpoint_cap_init(pci_dev, 0x80);
 
+    msix_offset = n->reg_size;
+    bar_size = n->reg_size;
+
+    msix_vectors = n->params.max_ioqpairs + 1;
+    pba_offset = msix_offset + PCI_MSIX_ENTRY_SIZE * msix_vectors;
+    bar_size += pba_offset + QEMU_ALIGN_UP(msix_vectors, 64) / 8;
+    bar_size = pow2ceil(bar_size);
+
+    memory_region_init(&n->bar0, OBJECT(n), "nvme-bar0", bar_size);
     memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
                           n->reg_size);
-    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
-                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->iomem);
-    if (msix_init_exclusive_bar(pci_dev, n->params.msix_qsize, 4, errp)) {
+    memory_region_add_subregion(&n->bar0, 0, &n->iomem);
+
+    if (msix_init(pci_dev, n->params.msix_qsize,
+                  &n->bar0, 0, msix_offset,
+                  &n->bar0, 0, pba_offset,
+                  0, errp)) {
         return;
     }
 
+    pci_register_bar(pci_dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY |
+                     PCI_BASE_ADDRESS_MEM_TYPE_64, &n->bar0);
+
     if (n->params.cmb_size_mb) {
         nvme_init_cmb(n, pci_dev);
-    } else if (n->pmrdev) {
+    }
+
+    if (n->pmrdev) {
         nvme_init_pmr(n, pci_dev);
     }
 }
@@ -1582,6 +1598,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_CSS(n->bar.cap, 1);
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
     NVME_CAP_SET_CMBS(n->bar.cap, n->params.cmb_size_mb ? 1 : 0);
+    NVME_CAP_SET_PMRS(n->bar.cap, n->pmrdev ? 1 : 0);
 
     n->bar.vs = 0x00010200;
     n->bar.intmc = n->bar.intms = 0;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 1d30c0bca283..f312d18b315b 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -79,6 +79,7 @@ static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
 
 typedef struct NvmeCtrl {
     PCIDevice    parent_obj;
+    MemoryRegion bar0;
     MemoryRegion iomem;
     MemoryRegion ctrl_mem;
     NvmeBar      bar;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-15  8:06                 ` Klaus Jensen
@ 2020-07-15  8:21                   ` Klaus Jensen
  2020-07-21 21:54                   ` Andrzej Jakowski
  1 sibling, 0 replies; 20+ messages in thread
From: Klaus Jensen @ 2020-07-15  8:21 UTC (permalink / raw)
  To: Andrzej Jakowski; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On Jul 15 10:06, Klaus Jensen wrote:
> 
> Can you reproduce the issues with that patch? I can't on a stock Arch
> Linux 5.7.5-arch1-1 kernel.
> 

Was afraid I said this a bit too early since the stock kernel does not
enable the iommu by default. Tested with another kernel and yes, I still
cannot reproduce the vtd errors or any complaints from the kernel (w/
or w/o iommu).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-15  8:06                 ` Klaus Jensen
  2020-07-15  8:21                   ` Klaus Jensen
@ 2020-07-21 21:54                   ` Andrzej Jakowski
  2020-07-22  7:43                     ` Klaus Jensen
  1 sibling, 1 reply; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-21 21:54 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On 7/15/20 1:06 AM, Klaus Jensen wrote:
> Hi Andrzej,
> 
> I've not been ignoring this, but sorry for not following up earlier.
> 
> I'm hesitent to merge anything that very obviously breaks an OS that we
> know is used a lot to this using this device. Also because the issue has
> not been analyzed well enough to actually know if this is a QEMU or
> kernel issue.

Hi Klaus,

Thx for your response! I understand your hesitance on merging stuff that
obviously breaks guest OS. 

> 
> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> (irregardless of IOMMU on/off).
> 
> Later, when the issue is better understood, we can add options to set
> offsets, BIRs etc.
> 
> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> git://git.infradead.org/qemu-nvme.git nvme-next branch.
> 
> Can you reproduce the issues with that patch? I can't on a stock Arch
> Linux 5.7.5-arch1-1 kernel.

While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
feel that investigation part why it works while mine doesn't is
missing. It looks to me that both patches are basically following same 
approach: create memory subregion and overlay on top of other memory
region. Why one works and the other doesn't then?

Having in mind that, I have recently focused on understanding problem.
I observed that when guest assigns address to BAR4, addr field in
nvme-bar4 memory region gets populated, but it doesn't get automatically
populated in ctrl_mem (cmb) memory subregion, so later when nvme_addr_is_cmb() 
is called address check works incorrectly and as a consequence vmm does dma 
read instead of memcpy.
I created a patch that sets correct address on ctrl_mem subregion and guest 
OS boots up correctly.

When I looked into pci and memory region code I noticed that indeed address
is only assigned to top level memory region but not to contained subregions.
I think that because in your approach cmb grabs whole bar exclusively it works
fine.

Here is my question (perhaps pci folks can help answer :)): if we consider 
memory region overlapping for pci devices as valid use case should pci 
code on configuration cycles walk through all contained subregion and
update addr field accordingly?

Thx!


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-21 21:54                   ` Andrzej Jakowski
@ 2020-07-22  7:43                     ` Klaus Jensen
  2020-07-22 17:00                       ` Andrzej Jakowski
  0 siblings, 1 reply; 20+ messages in thread
From: Klaus Jensen @ 2020-07-22  7:43 UTC (permalink / raw)
  To: Andrzej Jakowski; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

@keith, please see below - can you comment on the Linux kernel 2 MB
boundary requirement for the CMB? Or should we hail Stephen (or Logan
maybe) since this seems to be related to p2pdma?

On Jul 21 14:54, Andrzej Jakowski wrote:
> On 7/15/20 1:06 AM, Klaus Jensen wrote:
> > Hi Andrzej,
> > 
> > I've not been ignoring this, but sorry for not following up earlier.
> > 
> > I'm hesitent to merge anything that very obviously breaks an OS that we
> > know is used a lot to this using this device. Also because the issue has
> > not been analyzed well enough to actually know if this is a QEMU or
> > kernel issue.
> 
> Hi Klaus,
> 
> Thx for your response! I understand your hesitance on merging stuff that
> obviously breaks guest OS. 
> 
> > 
> > Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> > 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> > (irregardless of IOMMU on/off).
> > 
> > Later, when the issue is better understood, we can add options to set
> > offsets, BIRs etc.
> > 
> > The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> > be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> > git://git.infradead.org/qemu-nvme.git nvme-next branch.
> > 
> > Can you reproduce the issues with that patch? I can't on a stock Arch
> > Linux 5.7.5-arch1-1 kernel.
> 
> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
> feel that investigation part why it works while mine doesn't is
> missing. It looks to me that both patches are basically following same 
> approach: create memory subregion and overlay on top of other memory
> region. Why one works and the other doesn't then?
> 
> Having in mind that, I have recently focused on understanding problem.
> I observed that when guest assigns address to BAR4, addr field in
> nvme-bar4 memory region gets populated, but it doesn't get automatically
> populated in ctrl_mem (cmb) memory subregion, so later when nvme_addr_is_cmb() 
> is called address check works incorrectly and as a consequence vmm does dma 
> read instead of memcpy.
> I created a patch that sets correct address on ctrl_mem subregion and guest 
> OS boots up correctly.
> 
> When I looked into pci and memory region code I noticed that indeed address
> is only assigned to top level memory region but not to contained subregions.
> I think that because in your approach cmb grabs whole bar exclusively it works
> fine.
> 
> Here is my question (perhaps pci folks can help answer :)): if we consider 
> memory region overlapping for pci devices as valid use case should pci 
> code on configuration cycles walk through all contained subregion and
> update addr field accordingly?
> 
> Thx!
> 

Hi Andrzej,

Thanks for looking into this. I think your analysis helped me nail this.
The problem is that we added the use of a subregion and have some
assumptions that no longer hold.

nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
But when the memory region is a subregion, addr holds an offset into the
parent container instead. Thus, changing all occurances of
n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
(this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
in your original patch[1]. The reason my version worked is because there
was no subregion involved for the CMB, so the existing address
validation calculations were still correct.

This leaves us with the Linux kernel complaining about not being able to
register the CMB if it is not on a 2MB boundary - this is probably just
a constraint in the kernel that we can't do anything about (but I'm no
kernel hacker...), which can be fixed by either being "nice" towards the
Linux kernel by forcing a 2 MB alignment in the device or exposing the
SZU field such that the user can choose 16MiB size units (or higher)
instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the
device such that we do not have to introduce new cmb_size parameters,
while also making it easier for the user to configure. But I'm not
really sure.

  [1]: Message-Id: <20200701214858.28515-3-andrzej.jakowski@linux.intel.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-22  7:43                     ` Klaus Jensen
@ 2020-07-22 17:00                       ` Andrzej Jakowski
  2020-07-22 17:21                         ` Klaus Jensen
  0 siblings, 1 reply; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-22 17:00 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On 7/22/20 12:43 AM, Klaus Jensen wrote:
> @keith, please see below - can you comment on the Linux kernel 2 MB
> boundary requirement for the CMB? Or should we hail Stephen (or Logan
> maybe) since this seems to be related to p2pdma?
> 
> On Jul 21 14:54, Andrzej Jakowski wrote:
>> On 7/15/20 1:06 AM, Klaus Jensen wrote:
>>> Hi Andrzej,
>>>
>>> I've not been ignoring this, but sorry for not following up earlier.
>>>
>>> I'm hesitent to merge anything that very obviously breaks an OS that we
>>> know is used a lot to this using this device. Also because the issue has
>>> not been analyzed well enough to actually know if this is a QEMU or
>>> kernel issue.
>>
>> Hi Klaus,
>>
>> Thx for your response! I understand your hesitance on merging stuff that
>> obviously breaks guest OS. 
>>
>>>
>>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
>>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
>>> (irregardless of IOMMU on/off).
>>>
>>> Later, when the issue is better understood, we can add options to set
>>> offsets, BIRs etc.
>>>
>>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
>>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
>>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
>>>
>>> Can you reproduce the issues with that patch? I can't on a stock Arch
>>> Linux 5.7.5-arch1-1 kernel.
>>
>> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
>> feel that investigation part why it works while mine doesn't is
>> missing. It looks to me that both patches are basically following same 
>> approach: create memory subregion and overlay on top of other memory
>> region. Why one works and the other doesn't then?
>>
>> Having in mind that, I have recently focused on understanding problem.
>> I observed that when guest assigns address to BAR4, addr field in
>> nvme-bar4 memory region gets populated, but it doesn't get automatically
>> populated in ctrl_mem (cmb) memory subregion, so later when nvme_addr_is_cmb() 
>> is called address check works incorrectly and as a consequence vmm does dma 
>> read instead of memcpy.
>> I created a patch that sets correct address on ctrl_mem subregion and guest 
>> OS boots up correctly.
>>
>> When I looked into pci and memory region code I noticed that indeed address
>> is only assigned to top level memory region but not to contained subregions.
>> I think that because in your approach cmb grabs whole bar exclusively it works
>> fine.
>>
>> Here is my question (perhaps pci folks can help answer :)): if we consider 
>> memory region overlapping for pci devices as valid use case should pci 
>> code on configuration cycles walk through all contained subregion and
>> update addr field accordingly?
>>
>> Thx!
>>
> 
> Hi Andrzej,
> 
> Thanks for looking into this. I think your analysis helped me nail this.
> The problem is that we added the use of a subregion and have some
> assumptions that no longer hold.
> 
> nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
> But when the memory region is a subregion, addr holds an offset into the
> parent container instead. Thus, changing all occurances of
> n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
> (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
> in your original patch[1]. The reason my version worked is because there
> was no subregion involved for the CMB, so the existing address
> validation calculations were still correct.

I'm a little bit concerned with this approach:
(n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
describe my understanding of the problem.
It looks to me that addr field sometimes contains *absolute* address (when no 
hierarchy is used) and other times it contains *relative* address (when
hierarchy is created). From my perspective use of this field is inconsistent
and thus error-prone.  
Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
solve root problem and is still prone to the same problem if in the future
we potentially build even more complicated hierarchy.
I think that we could solve it by introducing helper function like

hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 

to retrieve absolute address and in the documentation indicate that addr field
can be relative or absolute and it is recommended to use above function to 
retrieve absolute address.
What do you think?

> 
> This leaves us with the Linux kernel complaining about not being able to
> register the CMB if it is not on a 2MB boundary - this is probably just
> a constraint in the kernel that we can't do anything about (but I'm no
> kernel hacker...), which can be fixed by either being "nice" towards the
> Linux kernel by forcing a 2 MB alignment in the device or exposing the
> SZU field such that the user can choose 16MiB size units (or higher)
> instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the
> device such that we do not have to introduce new cmb_size parameters,
> while also making it easier for the user to configure. But I'm not
> really sure.
This is kernel limitation that we have to live with. The minimum granularity
of devm_memremap_pages() is 2MB and it must be 2MB aligned. It used to worse
at 128MB size+align (section-size), but sub-section memory-hotplug patches 
adjusted that to a 2MB section. 
Ensuring 2MiB size and alignment in the device emulation makes sense to me.
Perhaps we could document that limitations - making user more aware of it.

> 
>   [1]: Message-Id: <20200701214858.28515-3-andrzej.jakowski@linux.intel.com>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-22 17:00                       ` Andrzej Jakowski
@ 2020-07-22 17:21                         ` Klaus Jensen
  2020-07-22 18:14                           ` Andrzej Jakowski
  0 siblings, 1 reply; 20+ messages in thread
From: Klaus Jensen @ 2020-07-22 17:21 UTC (permalink / raw)
  To: Andrzej Jakowski; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On Jul 22 10:00, Andrzej Jakowski wrote:
> On 7/22/20 12:43 AM, Klaus Jensen wrote:
> > @keith, please see below - can you comment on the Linux kernel 2 MB
> > boundary requirement for the CMB? Or should we hail Stephen (or Logan
> > maybe) since this seems to be related to p2pdma?
> > 
> > On Jul 21 14:54, Andrzej Jakowski wrote:
> >> On 7/15/20 1:06 AM, Klaus Jensen wrote:
> >>> Hi Andrzej,
> >>>
> >>> I've not been ignoring this, but sorry for not following up earlier.
> >>>
> >>> I'm hesitent to merge anything that very obviously breaks an OS that we
> >>> know is used a lot to this using this device. Also because the issue has
> >>> not been analyzed well enough to actually know if this is a QEMU or
> >>> kernel issue.
> >>
> >> Hi Klaus,
> >>
> >> Thx for your response! I understand your hesitance on merging stuff that
> >> obviously breaks guest OS. 
> >>
> >>>
> >>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> >>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> >>> (irregardless of IOMMU on/off).
> >>>
> >>> Later, when the issue is better understood, we can add options to set
> >>> offsets, BIRs etc.
> >>>
> >>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> >>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> >>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
> >>>
> >>> Can you reproduce the issues with that patch? I can't on a stock Arch
> >>> Linux 5.7.5-arch1-1 kernel.
> >>
> >> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
> >> feel that investigation part why it works while mine doesn't is
> >> missing. It looks to me that both patches are basically following same 
> >> approach: create memory subregion and overlay on top of other memory
> >> region. Why one works and the other doesn't then?
> >>
> >> Having in mind that, I have recently focused on understanding problem.
> >> I observed that when guest assigns address to BAR4, addr field in
> >> nvme-bar4 memory region gets populated, but it doesn't get automatically
> >> populated in ctrl_mem (cmb) memory subregion, so later when nvme_addr_is_cmb() 
> >> is called address check works incorrectly and as a consequence vmm does dma 
> >> read instead of memcpy.
> >> I created a patch that sets correct address on ctrl_mem subregion and guest 
> >> OS boots up correctly.
> >>
> >> When I looked into pci and memory region code I noticed that indeed address
> >> is only assigned to top level memory region but not to contained subregions.
> >> I think that because in your approach cmb grabs whole bar exclusively it works
> >> fine.
> >>
> >> Here is my question (perhaps pci folks can help answer :)): if we consider 
> >> memory region overlapping for pci devices as valid use case should pci 
> >> code on configuration cycles walk through all contained subregion and
> >> update addr field accordingly?
> >>
> >> Thx!
> >>
> > 
> > Hi Andrzej,
> > 
> > Thanks for looking into this. I think your analysis helped me nail this.
> > The problem is that we added the use of a subregion and have some
> > assumptions that no longer hold.
> > 
> > nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
> > But when the memory region is a subregion, addr holds an offset into the
> > parent container instead. Thus, changing all occurances of
> > n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
> > (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
> > in your original patch[1]. The reason my version worked is because there
> > was no subregion involved for the CMB, so the existing address
> > validation calculations were still correct.
> 
> I'm a little bit concerned with this approach:
> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
> describe my understanding of the problem.

Oh. In the context of your patch I meant bar4 of course, but anyway.

> It looks to me that addr field sometimes contains *absolute* address (when no 
> hierarchy is used) and other times it contains *relative* address (when
> hierarchy is created). From my perspective use of this field is inconsistent
> and thus error-prone.  
> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
> solve root problem and is still prone to the same problem if in the future
> we potentially build even more complicated hierarchy.
> I think that we could solve it by introducing helper function like
> 
> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 
> 
> to retrieve absolute address and in the documentation indicate that addr field
> can be relative or absolute and it is recommended to use above function to 
> retrieve absolute address.
> What do you think?
> 

I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I
did just to convince myself that this was the issue ;)

I think the helper might already be there in memory.c. It's just not
exported.

   static hwaddr memory_region_to_absolute_addr(MemoryRegion *mr, hwaddr offset)

> > 
> > This leaves us with the Linux kernel complaining about not being able to
> > register the CMB if it is not on a 2MB boundary - this is probably just
> > a constraint in the kernel that we can't do anything about (but I'm no
> > kernel hacker...), which can be fixed by either being "nice" towards the
> > Linux kernel by forcing a 2 MB alignment in the device or exposing the
> > SZU field such that the user can choose 16MiB size units (or higher)
> > instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the
> > device such that we do not have to introduce new cmb_size parameters,
> > while also making it easier for the user to configure. But I'm not
> > really sure.
> This is kernel limitation that we have to live with. The minimum granularity
> of devm_memremap_pages() is 2MB and it must be 2MB aligned. It used to worse
> at 128MB size+align (section-size), but sub-section memory-hotplug patches 
> adjusted that to a 2MB section. 

Thanks for that explanation!

> Ensuring 2MiB size and alignment in the device emulation makes sense to me.
> Perhaps we could document that limitations - making user more aware of it.
> 

Sounds good to me.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
  2020-07-22 17:21                         ` Klaus Jensen
@ 2020-07-22 18:14                           ` Andrzej Jakowski
  0 siblings, 0 replies; 20+ messages in thread
From: Andrzej Jakowski @ 2020-07-22 18:14 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: kbusch, kwolf, qemu-devel, qemu-block, mreitz

On 7/22/20 10:21 AM, Klaus Jensen wrote:
> On Jul 22 10:00, Andrzej Jakowski wrote:
>> On 7/22/20 12:43 AM, Klaus Jensen wrote:
>>> @keith, please see below - can you comment on the Linux kernel 2 MB
>>> boundary requirement for the CMB? Or should we hail Stephen (or Logan
>>> maybe) since this seems to be related to p2pdma?
>>>
>>> On Jul 21 14:54, Andrzej Jakowski wrote:
>>>> On 7/15/20 1:06 AM, Klaus Jensen wrote:
>>>>> Hi Andrzej,
>>>>>
>>>>> I've not been ignoring this, but sorry for not following up earlier.
>>>>>
>>>>> I'm hesitent to merge anything that very obviously breaks an OS that we
>>>>> know is used a lot to this using this device. Also because the issue has
>>>>> not been analyzed well enough to actually know if this is a QEMU or
>>>>> kernel issue.
>>>>
>>>> Hi Klaus,
>>>>
>>>> Thx for your response! I understand your hesitance on merging stuff that
>>>> obviously breaks guest OS. 
>>>>
>>>>>
>>>>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
>>>>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
>>>>> (irregardless of IOMMU on/off).
>>>>>
>>>>> Later, when the issue is better understood, we can add options to set
>>>>> offsets, BIRs etc.
>>>>>
>>>>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
>>>>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
>>>>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
>>>>>
>>>>> Can you reproduce the issues with that patch? I can't on a stock Arch
>>>>> Linux 5.7.5-arch1-1 kernel.
>>>>
>>>> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
>>>> feel that investigation part why it works while mine doesn't is
>>>> missing. It looks to me that both patches are basically following same 
>>>> approach: create memory subregion and overlay on top of other memory
>>>> region. Why one works and the other doesn't then?
>>>>
>>>> Having in mind that, I have recently focused on understanding problem.
>>>> I observed that when guest assigns address to BAR4, addr field in
>>>> nvme-bar4 memory region gets populated, but it doesn't get automatically
>>>> populated in ctrl_mem (cmb) memory subregion, so later when nvme_addr_is_cmb() 
>>>> is called address check works incorrectly and as a consequence vmm does dma 
>>>> read instead of memcpy.
>>>> I created a patch that sets correct address on ctrl_mem subregion and guest 
>>>> OS boots up correctly.
>>>>
>>>> When I looked into pci and memory region code I noticed that indeed address
>>>> is only assigned to top level memory region but not to contained subregions.
>>>> I think that because in your approach cmb grabs whole bar exclusively it works
>>>> fine.
>>>>
>>>> Here is my question (perhaps pci folks can help answer :)): if we consider 
>>>> memory region overlapping for pci devices as valid use case should pci 
>>>> code on configuration cycles walk through all contained subregion and
>>>> update addr field accordingly?
>>>>
>>>> Thx!
>>>>
>>>
>>> Hi Andrzej,
>>>
>>> Thanks for looking into this. I think your analysis helped me nail this.
>>> The problem is that we added the use of a subregion and have some
>>> assumptions that no longer hold.
>>>
>>> nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
>>> But when the memory region is a subregion, addr holds an offset into the
>>> parent container instead. Thus, changing all occurances of
>>> n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
>>> (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
>>> in your original patch[1]. The reason my version worked is because there
>>> was no subregion involved for the CMB, so the existing address
>>> validation calculations were still correct.
>>
>> I'm a little bit concerned with this approach:
>> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
>> describe my understanding of the problem.
> 
> Oh. In the context of your patch I meant bar4 of course, but anyway.
> 
>> It looks to me that addr field sometimes contains *absolute* address (when no 
>> hierarchy is used) and other times it contains *relative* address (when
>> hierarchy is created). From my perspective use of this field is inconsistent
>> and thus error-prone.  
>> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
>> solve root problem and is still prone to the same problem if in the future
>> we potentially build even more complicated hierarchy.
>> I think that we could solve it by introducing helper function like
>>
>> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 
>>
>> to retrieve absolute address and in the documentation indicate that addr field
>> can be relative or absolute and it is recommended to use above function to 
>> retrieve absolute address.
>> What do you think?
>>
> 
> I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I
> did just to convince myself that this was the issue ;)
> 
> I think the helper might already be there in memory.c. It's just not
> exported.
> 
>    static hwaddr memory_region_to_absolute_addr(MemoryRegion *mr, hwaddr offset)

Awesome! I didn't notice. Let me check and repost series soon :)

> 
>>>
>>> This leaves us with the Linux kernel complaining about not being able to
>>> register the CMB if it is not on a 2MB boundary - this is probably just
>>> a constraint in the kernel that we can't do anything about (but I'm no
>>> kernel hacker...), which can be fixed by either being "nice" towards the
>>> Linux kernel by forcing a 2 MB alignment in the device or exposing the
>>> SZU field such that the user can choose 16MiB size units (or higher)
>>> instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the
>>> device such that we do not have to introduce new cmb_size parameters,
>>> while also making it easier for the user to configure. But I'm not
>>> really sure.
>> This is kernel limitation that we have to live with. The minimum granularity
>> of devm_memremap_pages() is 2MB and it must be 2MB aligned. It used to worse
>> at 128MB size+align (section-size), but sub-section memory-hotplug patches 
>> adjusted that to a 2MB section. 
> 
> Thanks for that explanation!
> 
>> Ensuring 2MiB size and alignment in the device emulation makes sense to me.
>> Perhaps we could document that limitations - making user more aware of it.
>>
> 
> Sounds good to me.
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/2] nvme: indicate CMB support through controller capabilities register
  2020-07-07 19:15     ` Klaus Jensen
@ 2020-07-30 11:26       ` Maxim Levitsky
  0 siblings, 0 replies; 20+ messages in thread
From: Maxim Levitsky @ 2020-07-30 11:26 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: kwolf, qemu-block, Klaus Jensen, qemu-devel, mreitz,
	Andrzej Jakowski, kbusch

On Tue, 2020-07-07 at 21:15 +0200, Klaus Jensen wrote:
> On Jul  7 19:27, Maxim Levitsky wrote:
> > On Wed, 2020-07-01 at 14:48 -0700, Andrzej Jakowski wrote:
> > > This patch sets CMBS bit in controller capabilities register when user
> > > configures NVMe driver with CMB support, so capabilites are correctly
> > > reported to guest OS.
> > > 
> > > Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
> > > Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
> > > ---
> > >  hw/block/nvme.c      | 2 +-
> > >  include/block/nvme.h | 6 +++++-
> > >  2 files changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 1aee042d4c..9f11f3e9da 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -1582,6 +1582,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
> > >      NVME_CAP_SET_TO(n->bar.cap, 0xf);
> > >      NVME_CAP_SET_CSS(n->bar.cap, 1);
> > >      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
> > > +    NVME_CAP_SET_CMBS(n->bar.cap, n->params.cmb_size_mb ? 1 : 0);
> > >  
> > >      n->bar.vs = 0x00010200;
> > >      n->bar.intmc = n->bar.intms = 0;
> > > @@ -1591,7 +1592,6 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > >  {
> > >      NvmeCtrl *n = NVME(pci_dev);
> > >      Error *local_err = NULL;
> > > -
> > >      int i;
> > >  
> > >      nvme_check_constraints(n, &local_err);
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index 1720ee1d51..14cf398dfa 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -35,6 +35,7 @@ enum NvmeCapShift {
> > >      CAP_MPSMIN_SHIFT   = 48,
> > >      CAP_MPSMAX_SHIFT   = 52,
> > >      CAP_PMR_SHIFT      = 56,
> > > +    CAP_CMB_SHIFT      = 57,
> > >  };
> > >  
> > >  enum NvmeCapMask {
> > > @@ -48,6 +49,7 @@ enum NvmeCapMask {
> > >      CAP_MPSMIN_MASK    = 0xf,
> > >      CAP_MPSMAX_MASK    = 0xf,
> > >      CAP_PMR_MASK       = 0x1,
> > > +    CAP_CMB_MASK       = 0x1,
> > >  };
> > >  
> > >  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> > > @@ -78,8 +80,10 @@ enum NvmeCapMask {
> > >                                                             << CAP_MPSMIN_SHIFT)
> > >  #define NVME_CAP_SET_MPSMAX(cap, val) (cap |= (uint64_t)(val & CAP_MPSMAX_MASK)\
> > >                                                              << CAP_MPSMAX_SHIFT)
> > > -#define NVME_CAP_SET_PMRS(cap, val) (cap |= (uint64_t)(val & CAP_PMR_MASK)\
> > > +#define NVME_CAP_SET_PMRS(cap, val)   (cap |= (uint64_t)(val & CAP_PMR_MASK)   \
> > >                                                              << CAP_PMR_SHIFT)
> > > +#define NVME_CAP_SET_CMBS(cap, val)   (cap |= (uint64_t)(val & CAP_CMB_MASK)   \
> > > +                                                           << CAP_CMB_SHIFT)
> > >  
> > >  enum NvmeCcShift {
> > >      CC_EN_SHIFT     = 0,
> > 
> > I wonder how this could have beeing forgotten. Hmm.
> > I see that Linux kernel uses CMBSZ != for that.
> > I guess this explains it.
> > 
> > Reviewed-by: Maxim Levitsky <mlevitsky@gmail.com>
> > 
> 
> It is a v1.4 field. The CMB support was added when NVMe was at v1.2.
> And the Linux kernel is also basically adhering to v1.3 wrt. CMB
> support. In v1.4 the host actually needs to specifically enable the CMB
> - and that is not something the kernel does currently IIRC.
> 
Ah, makes sense!
I by now have specs for each NVME revision, but I am getting lazy sometimes to cross-check
them.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-07-30 11:27 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-01 21:48 [PATCH v4] nvme: allow cmb and pmr emulation on same device Andrzej Jakowski
2020-07-01 21:48 ` [PATCH v4 1/2] nvme: indicate CMB support through controller capabilities register Andrzej Jakowski
2020-07-07 16:27   ` Maxim Levitsky
2020-07-07 19:15     ` Klaus Jensen
2020-07-30 11:26       ` Maxim Levitsky
2020-07-01 21:48 ` [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device Andrzej Jakowski
2020-07-02 10:13   ` Klaus Jensen
2020-07-02 10:31     ` Klaus Jensen
2020-07-02 15:07       ` Andrzej Jakowski
2020-07-02 17:51         ` Klaus Jensen
2020-07-02 23:33           ` Andrzej Jakowski
2020-07-06  7:15             ` Klaus Jensen
2020-07-08  4:44               ` Andrzej Jakowski
2020-07-15  8:06                 ` Klaus Jensen
2020-07-15  8:21                   ` Klaus Jensen
2020-07-21 21:54                   ` Andrzej Jakowski
2020-07-22  7:43                     ` Klaus Jensen
2020-07-22 17:00                       ` Andrzej Jakowski
2020-07-22 17:21                         ` Klaus Jensen
2020-07-22 18:14                           ` Andrzej Jakowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).