All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
@ 2016-12-20 16:31 Fam Zheng
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
	borntraeger, Stefan Hajnoczi

This series adds a new protocol driver that is intended to achieve about 20%
better performance for latency bound workloads (i.e. synchronous I/O) than
linux-aio when guest is exclusively accessing a NVMe device, by talking to the
device directly instead of through kernel file system layers and its NVMe
driver.

This applies on top of Stefan's block-next tree which has the busy polling
patches - the new driver also supports it.

A git branch is also available as:

    https://github.com/famz/qemu nvme

See patch 4 for benchmark numbers.

Tests are done on QEMU's NVMe emulation and a real Intel P3700 SSD NVMe card.
Most of dd/fio/mkfs/kernel build and OS installation testings work well, but an
weird write fault looking similar to [1] is consistently seen when installing
RHEL 7.3 guest, which is still under investigation.

[1]: http://lists.infradead.org/pipermail/linux-nvme/2015-May/001840.html

Also, the ram notifier is not enough for hot plugged block device because in
that case the notifier is installed _after_ ram blocks are added so it won't
get the events.

Fam Zheng (3):
  util: Add a notifier list for qemu_vfree()
  util: Add VFIO helper library
  block: Add VFIO based NVMe driver

Paolo Bonzini (1):
  ramblock-notifier: new

 block/Makefile.objs         |    1 +
 block/nvme.c                | 1026 +++++++++++++++++++++++++++++++++++++++++++
 exec.c                      |    5 +
 include/exec/memory.h       |    6 +-
 include/exec/ram_addr.h     |   46 +-
 include/exec/ramlist.h      |   72 +++
 include/qemu/notify.h       |    1 +
 include/qemu/vfio-helpers.h |   29 ++
 numa.c                      |   29 ++
 stubs/Makefile.objs         |    1 +
 stubs/ram.c                 |   10 +
 util/Makefile.objs          |    1 +
 util/oslib-posix.c          |    9 +
 util/vfio-helpers.c         |  713 ++++++++++++++++++++++++++++++
 xen-mapcache.c              |    3 +
 15 files changed, 1903 insertions(+), 49 deletions(-)
 create mode 100644 block/nvme.c
 create mode 100644 include/exec/ramlist.h
 create mode 100644 include/qemu/vfio-helpers.h
 create mode 100644 stubs/ram.c
 create mode 100644 util/vfio-helpers.c

-- 
2.9.3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
  2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
@ 2016-12-20 16:31 ` Fam Zheng
  2016-12-22  9:56   ` Paolo Bonzini
  2017-01-11  5:38   ` Stefan Weil
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 2/4] util: Add a notifier list for qemu_vfree() Fam Zheng
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
	borntraeger, Stefan Hajnoczi

From: Paolo Bonzini <pbonzini@redhat.com>

This adds a notify interface of ram block additions and removals.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 exec.c                  |  5 ++++
 include/exec/memory.h   |  6 +----
 include/exec/ram_addr.h | 46 ++-----------------------------
 include/exec/ramlist.h  | 72 +++++++++++++++++++++++++++++++++++++++++++++++++
 numa.c                  | 29 ++++++++++++++++++++
 stubs/Makefile.objs     |  1 +
 stubs/ram.c             | 10 +++++++
 xen-mapcache.c          |  3 +++
 8 files changed, 123 insertions(+), 49 deletions(-)
 create mode 100644 include/exec/ramlist.h
 create mode 100644 stubs/ram.c

diff --git a/exec.c b/exec.c
index 08c558e..027c301 100644
--- a/exec.c
+++ b/exec.c
@@ -1654,6 +1654,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_HUGEPAGE);
         /* MADV_DONTFORK is also needed by KVM in absence of synchronous MMU */
         qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_DONTFORK);
+        ram_block_notify_add(new_block->host, new_block->max_length);
     }
 }
 
@@ -1784,6 +1785,10 @@ void qemu_ram_free(RAMBlock *block)
         return;
     }
 
+    if (block->host) {
+        ram_block_notify_remove(block->host, block->max_length);
+    }
+
     qemu_mutex_lock_ramlist();
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9728a2f..fbc0a8f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -16,16 +16,12 @@
 
 #ifndef CONFIG_USER_ONLY
 
-#define DIRTY_MEMORY_VGA       0
-#define DIRTY_MEMORY_CODE      1
-#define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
-
 #include "exec/cpu-common.h"
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
 #endif
 #include "exec/memattrs.h"
+#include "exec/ramlist.h"
 #include "qemu/queue.h"
 #include "qemu/int128.h"
 #include "qemu/notify.h"
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 54d7108..3e79466 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "exec/ramlist.h"
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -35,6 +36,7 @@ struct RAMBlock {
     char idstr[256];
     /* RCU-enabled, writes protected by the ramlist lock */
     QLIST_ENTRY(RAMBlock) next;
+    QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
     int fd;
     size_t page_size;
 };
@@ -50,51 +52,7 @@ static inline void *ramblock_ptr(RAMBlock *block, ram_addr_t offset)
     return (char *)block->host + offset;
 }
 
-/* The dirty memory bitmap is split into fixed-size blocks to allow growth
- * under RCU.  The bitmap for a block can be accessed as follows:
- *
- *   rcu_read_lock();
- *
- *   DirtyMemoryBlocks *blocks =
- *       atomic_rcu_read(&ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]);
- *
- *   ram_addr_t idx = (addr >> TARGET_PAGE_BITS) / DIRTY_MEMORY_BLOCK_SIZE;
- *   unsigned long *block = blocks.blocks[idx];
- *   ...access block bitmap...
- *
- *   rcu_read_unlock();
- *
- * Remember to check for the end of the block when accessing a range of
- * addresses.  Move on to the next block if you reach the end.
- *
- * Organization into blocks allows dirty memory to grow (but not shrink) under
- * RCU.  When adding new RAMBlocks requires the dirty memory to grow, a new
- * DirtyMemoryBlocks array is allocated with pointers to existing blocks kept
- * the same.  Other threads can safely access existing blocks while dirty
- * memory is being grown.  When no threads are using the old DirtyMemoryBlocks
- * anymore it is freed by RCU (but the underlying blocks stay because they are
- * pointed to from the new DirtyMemoryBlocks).
- */
-#define DIRTY_MEMORY_BLOCK_SIZE ((ram_addr_t)256 * 1024 * 8)
-typedef struct {
-    struct rcu_head rcu;
-    unsigned long *blocks[];
-} DirtyMemoryBlocks;
-
-typedef struct RAMList {
-    QemuMutex mutex;
-    RAMBlock *mru_block;
-    /* RCU-enabled, writes protected by the ramlist lock. */
-    QLIST_HEAD(, RAMBlock) blocks;
-    DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
-    uint32_t version;
-} RAMList;
-extern RAMList ram_list;
-
 ram_addr_t last_ram_offset(void);
-void qemu_mutex_lock_ramlist(void);
-void qemu_mutex_unlock_ramlist(void);
-
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
                                    bool share, const char *mem_path,
                                    Error **errp);
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
new file mode 100644
index 0000000..c59880d
--- /dev/null
+++ b/include/exec/ramlist.h
@@ -0,0 +1,72 @@
+#ifndef RAMLIST_H
+#define RAMLIST_H
+
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+#include "qemu/rcu.h"
+
+typedef struct RAMBlockNotifier RAMBlockNotifier;
+
+#define DIRTY_MEMORY_VGA       0
+#define DIRTY_MEMORY_CODE      1
+#define DIRTY_MEMORY_MIGRATION 2
+#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
+
+/* The dirty memory bitmap is split into fixed-size blocks to allow growth
+ * under RCU.  The bitmap for a block can be accessed as follows:
+ *
+ *   rcu_read_lock();
+ *
+ *   DirtyMemoryBlocks *blocks =
+ *       atomic_rcu_read(&ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]);
+ *
+ *   ram_addr_t idx = (addr >> TARGET_PAGE_BITS) / DIRTY_MEMORY_BLOCK_SIZE;
+ *   unsigned long *block = blocks.blocks[idx];
+ *   ...access block bitmap...
+ *
+ *   rcu_read_unlock();
+ *
+ * Remember to check for the end of the block when accessing a range of
+ * addresses.  Move on to the next block if you reach the end.
+ *
+ * Organization into blocks allows dirty memory to grow (but not shrink) under
+ * RCU.  When adding new RAMBlocks requires the dirty memory to grow, a new
+ * DirtyMemoryBlocks array is allocated with pointers to existing blocks kept
+ * the same.  Other threads can safely access existing blocks while dirty
+ * memory is being grown.  When no threads are using the old DirtyMemoryBlocks
+ * anymore it is freed by RCU (but the underlying blocks stay because they are
+ * pointed to from the new DirtyMemoryBlocks).
+ */
+#define DIRTY_MEMORY_BLOCK_SIZE ((ram_addr_t)256 * 1024 * 8)
+typedef struct {
+    struct rcu_head rcu;
+    unsigned long *blocks[];
+} DirtyMemoryBlocks;
+
+typedef struct RAMList {
+    QemuMutex mutex;
+    RAMBlock *mru_block;
+    /* RCU-enabled, writes protected by the ramlist lock. */
+    QLIST_HEAD(, RAMBlock) blocks;
+    DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
+    uint32_t version;
+    QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
+} RAMList;
+extern RAMList ram_list;
+
+void qemu_mutex_lock_ramlist(void);
+void qemu_mutex_unlock_ramlist(void);
+
+struct RAMBlockNotifier {
+    void (*ram_block_added)(RAMBlockNotifier *n, void *host, size_t size);
+    void (*ram_block_removed)(RAMBlockNotifier *n, void *host, size_t size);
+    QLIST_ENTRY(RAMBlockNotifier) next;
+};
+
+void ram_block_notifier_add(RAMBlockNotifier *n);
+void ram_block_notifier_remove(RAMBlockNotifier *n);
+void ram_block_notify_add(void *host, size_t size);
+void ram_block_notify_remove(void *host, size_t size);
+
+
+#endif /* RAMLIST_H */
diff --git a/numa.c b/numa.c
index 9c09e45..81f29f4 100644
--- a/numa.c
+++ b/numa.c
@@ -25,6 +25,7 @@
 #include "qemu/osdep.h"
 #include "sysemu/numa.h"
 #include "exec/cpu-common.h"
+#include "exec/ramlist.h"
 #include "qemu/bitmap.h"
 #include "qom/cpu.h"
 #include "qemu/error-report.h"
@@ -562,3 +563,31 @@ int numa_get_node_for_cpu(int idx)
     }
     return i;
 }
+
+void ram_block_notifier_add(RAMBlockNotifier *n)
+{
+    QLIST_INSERT_HEAD(&ram_list.ramblock_notifiers, n, next);
+}
+
+void ram_block_notifier_remove(RAMBlockNotifier *n)
+{
+    QLIST_REMOVE(n, next);
+}
+
+void ram_block_notify_add(void *host, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        notifier->ram_block_added(notifier, host, size);
+    }
+}
+
+void ram_block_notify_remove(void *host, size_t size)
+{
+    RAMBlockNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+        notifier->ram_block_removed(notifier, host, size);
+    }
+}
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 2b5bb74..1829c7f 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -27,6 +27,7 @@ stub-obj-y += mon-is-qmp.o
 stub-obj-y += monitor-init.o
 stub-obj-y += notify-event.o
 stub-obj-y += qtest.o
+stub-obj-y += ram.o
 stub-obj-y += replay.o
 stub-obj-y += replay-user.o
 stub-obj-y += reset.o
diff --git a/stubs/ram.c b/stubs/ram.c
new file mode 100644
index 0000000..ea5d528
--- /dev/null
+++ b/stubs/ram.c
@@ -0,0 +1,10 @@
+#include "qemu/osdep.h"
+#include "exec/ramlist.h"
+
+void ram_block_notifier_add(RAMBlockNotifier *n)
+{
+}
+
+void ram_block_notifier_remove(RAMBlockNotifier *n)
+{
+}
diff --git a/xen-mapcache.c b/xen-mapcache.c
index 8f3a592..dc9b321 100644
--- a/xen-mapcache.c
+++ b/xen-mapcache.c
@@ -163,6 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
     err = g_malloc0(nb_pfn * sizeof (int));
 
     if (entry->vaddr_base != NULL) {
+        ram_block_removed(entry->vaddr_base, entry->size);
         if (munmap(entry->vaddr_base, entry->size) != 0) {
             perror("unmap fails");
             exit(-1);
@@ -188,6 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
     entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned long) *
             BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
 
+    ram_block_added(entry->vaddr_base, entry->size);
     bitmap_zero(entry->valid_mapping, nb_pfn);
     for (i = 0; i < nb_pfn; i++) {
         if (!err[i]) {
@@ -397,6 +399,7 @@ static void xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
     }
 
     pentry->next = entry->next;
+    ram_block_removed(entry->vaddr_base, entry->size);
     if (munmap(entry->vaddr_base, entry->size) != 0) {
         perror("unmap fails");
         exit(-1);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 2/4] util: Add a notifier list for qemu_vfree()
  2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
@ 2016-12-20 16:31 ` Fam Zheng
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library Fam Zheng
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
	borntraeger, Stefan Hajnoczi

To allow other code to intercept buffer freeing and do customized
unmapping operations.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/qemu/notify.h | 1 +
 util/oslib-posix.c    | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/include/qemu/notify.h b/include/qemu/notify.h
index a3d73e4..24c4f5f 100644
--- a/include/qemu/notify.h
+++ b/include/qemu/notify.h
@@ -69,4 +69,5 @@ void notifier_with_return_remove(NotifierWithReturn *notifier);
 int notifier_with_return_list_notify(NotifierWithReturnList *list,
                                      void *data);
 
+void qemu_vfree_add_notifier(Notifier *n);
 #endif
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index f631464..b0dd207 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -125,9 +125,18 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment)
     return ptr;
 }
 
+static NotifierList qemu_vfree_notifiers =
+    NOTIFIER_LIST_INITIALIZER(qemu_vfree_notifiers);
+
+void qemu_vfree_add_notifier(Notifier *n)
+{
+    notifier_list_add(&qemu_vfree_notifiers, n);
+}
+
 void qemu_vfree(void *ptr)
 {
     trace_qemu_vfree(ptr);
+    notifier_list_notify(&qemu_vfree_notifiers, ptr);
     free(ptr);
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library
  2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 2/4] util: Add a notifier list for qemu_vfree() Fam Zheng
@ 2016-12-20 16:31 ` Fam Zheng
  2016-12-21 15:46   ` Paolo Bonzini
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
	borntraeger, Stefan Hajnoczi

This is a simple helper library to be used by VFIO device drivers in
QEMU, makeing it very easy to interface with /dev/vfio and do DMA
mappings.

Especially, once initialized, this module proactively maps all guest ram
regions to IO address space (in the host IOMMU context) so that in
further I/O paths, it's very cheap to look up the list of IOVAs of
SG buffers that we get from guest driver, without doing any syscall.

For bounce buffers that are allocated by QEMU, we do lazy mapping
transparently when the caller is invoking the same DMA mapping function.
This is relatively slower but should still be efficient thanks to
HBitmap.  However, caller is responsible for unmapping bounce buffers
explicitly.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/qemu/vfio-helpers.h |  29 ++
 util/Makefile.objs          |   1 +
 util/vfio-helpers.c         | 713 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 743 insertions(+)
 create mode 100644 include/qemu/vfio-helpers.h
 create mode 100644 util/vfio-helpers.c

diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
new file mode 100644
index 0000000..4096dd8
--- /dev/null
+++ b/include/qemu/vfio-helpers.h
@@ -0,0 +1,29 @@
+/*
+ * VFIO helper functions
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *   Fam Zheng <famz@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_VFIO_H
+#define QEMU_VFIO_H
+#include "qemu/queue.h"
+
+typedef struct QEMUVFIOState QEMUVFIOState;
+
+QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp);
+void qemu_vfio_close(QEMUVFIOState *s);
+int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
+                      bool contiguous, uint64_t *iova_list);
+void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host);
+void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index, Error **errp);
+void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar);
+int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
+                           int irq_type, Error **errp);
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index ad0f9c7..16c714a 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -36,3 +36,4 @@ util-obj-y += log.o
 util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
+util-obj-y += vfio-helpers.o
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
new file mode 100644
index 0000000..55d35e7
--- /dev/null
+++ b/util/vfio-helpers.c
@@ -0,0 +1,713 @@
+/*
+ * VFIO helper functions
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *   Fam Zheng <famz@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "trace.h"
+#include "qemu/queue.h"
+#include "qemu/vfio-helpers.h"
+#include "qemu/error-report.h"
+#include "standard-headers/linux/pci_regs.h"
+#include "qemu/event_notifier.h"
+#include "qemu/hbitmap.h"
+
+#define VFIO_DEBUG 0
+
+#define DPRINTF(...) \
+    if (VFIO_DEBUG) { \
+        printf(__VA_ARGS__); \
+    }
+
+/* XXX: Once VFIO exposes the iova bit width in the IOMMU capability interface,
+ * we can use a runtime limit. It's also possible to do platform specific
+ * detection by reading sysfs entries. Until then, 39 is a safe bet. */
+#define QEMU_VFIO_IOVA_MAX (1ULL << 39)
+
+/* DMA address space is managed in chunks. */
+#define QEMU_VFIO_CHUNK_SIZE (2ULL << 20)
+
+#define QEMU_VFIO_ALLOC_BITMAP_SIZE (QEMU_VFIO_IOVA_MAX / QEMU_VFIO_CHUNK_SIZE)
+
+typedef struct IOVARange {
+    uint64_t iova;
+    uint64_t nr_pages;
+    QSIMPLEQ_ENTRY(IOVARange) next;
+} IOVARange;
+
+typedef struct {
+    /* Page aligned. */
+    void *host;
+    size_t size;
+    /* The IOVA range list to which the [host, host + size) area is mapped. */
+    QSIMPLEQ_HEAD(, IOVARange) iova_list;
+} IOVAMapping;
+
+struct QEMUVFIOState {
+    int container;
+    int group;
+    int device;
+    RAMBlockNotifier ram_notifier;
+    struct vfio_region_info config_region_info, bar_region_info[6];
+
+    /* Allocation bitmap of IOVA address space, each bit represents
+     * QEMU_VFIO_CHUNK_SIZE bytes. Set bits mean free, unset for used/reserved.
+     **/
+    HBitmap *free_chunks;
+    IOVAMapping *mappings;
+    int nr_mappings;
+};
+
+static int sysfs_find_group_file(const char *device, char **path, Error **errp)
+{
+    int ret;
+    char *sysfs_link = NULL;
+    char *sysfs_group = NULL;
+    char *p;
+
+    sysfs_link = g_strdup_printf("/sys/bus/pci/devices/%s/iommu_group",
+                                 device);
+    sysfs_group = g_malloc(PATH_MAX);
+    ret = readlink(sysfs_link, sysfs_group, PATH_MAX - 1);
+    if (ret == -1) {
+        error_setg_errno(errp, errno, "Failed to find iommu group sysfs path");
+        ret = -errno;
+        goto out;
+    }
+    ret = 0;
+    p = strrchr(sysfs_group, '/');
+    if (!p) {
+        error_setg(errp, "Failed to find iommu group number");
+        ret = -errno;
+        goto out;
+    }
+
+    *path = g_strdup_printf("/dev/vfio/%s", p + 1);
+out:
+    g_free(sysfs_link);
+    g_free(sysfs_group);
+    return ret;
+}
+
+static int qemu_vfio_pci_init_bar(QEMUVFIOState *s, unsigned int index,
+                                  Error **errp)
+{
+    assert(index < ARRAY_SIZE(s->bar_region_info));
+    s->bar_region_info[index] = (struct vfio_region_info) {
+        .index = VFIO_PCI_BAR0_REGION_INDEX + index,
+        .argsz = sizeof(struct vfio_region_info),
+    };
+    if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->bar_region_info[index])) {
+        error_setg_errno(errp, errno, "Failed to get BAR region info");
+        return -errno;
+    }
+
+    return 0;
+}
+
+/**
+ * Map a PCI bar area.
+ */
+void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index, Error **errp)
+{
+    void *p;
+    assert(index >= 0 && index < 6);
+    p = mmap(NULL, MIN(8192, s->bar_region_info[index].size),
+             PROT_READ | PROT_WRITE, MAP_SHARED,
+             s->device, s->bar_region_info[index].offset);
+    if (p == MAP_FAILED) {
+        error_setg_errno(errp, errno, "Failed to map BAR region");
+        p = NULL;
+    }
+    return p;
+}
+
+/**
+ * Unmap a PCI bar area.
+ */
+void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar)
+{
+    if (bar) {
+        munmap(bar, MIN(8192, s->bar_region_info[index].size));
+    }
+}
+
+/**
+ * Initialize device IRQ with @irq_type and and register an event notifier.
+ */
+int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
+                           int irq_type, Error **errp)
+{
+    int r;
+    struct vfio_irq_set *irq_set;
+    size_t irq_set_size;
+    struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
+
+    irq_info.index = irq_type;
+    if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) {
+        error_setg_errno(errp, errno, "Failed to get device interrupt info");
+        return -errno;
+    }
+    if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) {
+        error_setg(errp, "Device interrupt doesn't support eventfd");
+        return -EINVAL;
+    }
+
+    irq_set_size = sizeof(*irq_set) + sizeof(int);
+    irq_set = g_malloc0(irq_set_size);
+
+    /* Get to a known IRQ state */
+    *irq_set = (struct vfio_irq_set) {
+        .argsz = irq_set_size,
+        .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
+        .index = irq_info.index,
+        .start = 0,
+        .count = 1,
+    };
+
+    *(int *)&irq_set->data = event_notifier_get_fd(e);
+    r = ioctl(s->device, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (r) {
+        error_setg_errno(errp, errno, "Failed to setup device interrupt");
+        return -errno;
+    }
+    return 0;
+}
+
+static int qemu_vfio_pci_read_config(QEMUVFIOState *s, void *buf, int size, int ofs)
+{
+    if (pread(s->device, buf, size, s->config_region_info.offset + ofs) == size) {
+        return 0;
+    }
+    return -1;
+}
+
+static int qemu_vfio_pci_write_config(QEMUVFIOState *s, void *buf, int size, int ofs)
+{
+    if (pwrite(s->device, buf, size,
+               s->config_region_info.offset + ofs) == size) {
+        return 0;
+    }
+
+    return -1;
+}
+
+static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
+                              Error **errp)
+{
+    int ret;
+    int i;
+    uint16_t pci_cmd;
+    struct vfio_group_status group_status =
+    { .argsz = sizeof(group_status) };
+    struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
+    struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+    char *group_file = NULL;
+
+    /* Create a new container */
+    s->container = open("/dev/vfio/vfio", O_RDWR);
+
+    if (ioctl(s->container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
+        error_setg(errp, "Invalid VFIO version");
+        ret = -EINVAL;
+        goto out;
+    }
+
+    if (!ioctl(s->container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
+        error_setg_errno(errp, errno, "VFIO IOMMU check failed");
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* Open the group */
+    ret = sysfs_find_group_file(device, &group_file, errp);
+    if (ret) {
+        goto out;
+    }
+
+    s->group = open(group_file, O_RDWR);
+    g_free(group_file);
+    if (s->group <= 0) {
+        error_setg_errno(errp, errno, "Failed to open VFIO group file");
+        ret = -errno;
+        goto out;
+    }
+
+    /* Test the group is viable and available */
+    if (ioctl(s->group, VFIO_GROUP_GET_STATUS, &group_status)) {
+        error_setg_errno(errp, errno, "Failed to get VFIO group status");
+        ret = -errno;
+        goto out;
+    }
+
+    if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+        error_setg(errp, "VFIO group is not viable");
+        ret = -EINVAL;
+        goto out;
+    }
+
+    /* Add the group to the container */
+    if (ioctl(s->group, VFIO_GROUP_SET_CONTAINER, &s->container)) {
+        error_setg_errno(errp, errno, "Failed to add group to VFIO container");
+        ret = -errno;
+        goto out;
+    }
+
+    /* Enable the IOMMU model we want */
+    if (ioctl(s->container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) {
+        error_setg_errno(errp, errno, "Failed to set VFIO IOMMU type");
+        ret = -errno;
+        goto out;
+    }
+
+    /* Get additional IOMMU info */
+    if (ioctl(s->container, VFIO_IOMMU_GET_INFO, &iommu_info)) {
+        error_setg_errno(errp, errno, "Failed to get IOMMU info");
+        ret = -errno;
+        goto out;
+    }
+
+    s->device = ioctl(s->group, VFIO_GROUP_GET_DEVICE_FD, device);
+
+    if (s->device < 0) {
+        error_setg_errno(errp, errno, "Failed to get device fd");
+        ret = -errno;
+        goto out;
+    }
+
+    /* Test and setup the device */
+    if (ioctl(s->device, VFIO_DEVICE_GET_INFO, &device_info)) {
+        error_setg_errno(errp, errno, "Failed to get device info");
+        ret = -errno;
+        goto out;
+    }
+
+    if (device_info.num_regions < VFIO_PCI_CONFIG_REGION_INDEX) {
+        error_setg(errp, "Invalid device regions");
+        ret = -EINVAL;
+        goto out;
+    }
+
+    s->config_region_info = (struct vfio_region_info) {
+        .index = VFIO_PCI_CONFIG_REGION_INDEX,
+        .argsz = sizeof(struct vfio_region_info),
+    };
+    if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->config_region_info)) {
+        error_setg_errno(errp, errno, "Failed to get config region info");
+        ret = -errno;
+        goto out;
+    }
+
+    for (i = 0; i < 6; i++) {
+        ret = qemu_vfio_pci_init_bar(s, i, errp);
+        if (ret) {
+            goto out;
+        }
+    }
+
+    /* Enable bus master */
+    if (qemu_vfio_pci_read_config(s, &pci_cmd, sizeof(pci_cmd),
+                                  PCI_COMMAND) < 0) {
+        goto out;
+    }
+    pci_cmd |= PCI_COMMAND_MASTER;
+    if (qemu_vfio_pci_write_config(s, &pci_cmd, sizeof(pci_cmd),
+                                   PCI_COMMAND) < 0) {
+        goto out;
+    }
+out:
+    return ret;
+}
+
+static void qemu_vfio_ram_block_added(RAMBlockNotifier *n, void *host, size_t size)
+{
+    QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+    /* XXX: handle existing ram regions? */
+    DPRINTF("ram block added %p %lx\n", host, size);
+    qemu_vfio_dma_map(s, host, size, true, NULL);
+}
+
+static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n, void *host, size_t size)
+{
+    QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+    if (host) {
+        DPRINTF("ram block removed %p %lx\n", host, size);
+        qemu_vfio_dma_unmap(s, host);
+    }
+}
+
+/**
+ * Open a PCI device, e.g. "0000:00:01.0".
+ */
+QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp)
+{
+    int r;
+    QEMUVFIOState *s = g_new0(QEMUVFIOState, 1);
+
+    r = qemu_vfio_init_pci(s, device, errp);
+    if (r) {
+        g_free(s);
+        return NULL;
+    }
+
+    s->ram_notifier.ram_block_added = qemu_vfio_ram_block_added;
+    s->ram_notifier.ram_block_removed = qemu_vfio_ram_block_removed;
+    ram_block_notifier_add(&s->ram_notifier);
+    s->free_chunks = hbitmap_alloc(QEMU_VFIO_ALLOC_BITMAP_SIZE, 0);
+    hbitmap_set(s->free_chunks, 1, QEMU_VFIO_ALLOC_BITMAP_SIZE - 1);
+
+    return s;
+}
+
+static void qemu_vfio_dump_mapping(IOVAMapping *m)
+{
+    DPRINTF("  vfio mapping %p %lx\n", m->host, m->size);
+    IOVARange *r;
+
+    QSIMPLEQ_FOREACH(r, &m->iova_list, next) {
+        DPRINTF("   IOVA %lx len %lx\n", r->iova, r->nr_pages * getpagesize());
+    }
+}
+
+static void qemu_vfio_dump_mappings(QEMUVFIOState *s)
+{
+    int i;
+    DPRINTF("vfio mappings\n");
+    for (i = 0; i < s->nr_mappings; ++i) {
+        qemu_vfio_dump_mapping(&s->mappings[i]);
+    }
+}
+
+/**
+ * Find the mapping entry that contains [host, host + size) and set @index to
+ * the position. If no entry contains it, @index is the position _after_ which
+ * to insert the new mapping. IOW, it is the index of the largest element that
+ * is smaller than @host, or -1 if no entry is. */
+static IOVAMapping *qemu_vfio_find_mapping(QEMUVFIOState *s,
+                                           void *host, size_t size,
+                                           int *index)
+{
+    IOVAMapping *p = s->mappings;
+    IOVAMapping *q = p ? p + s->nr_mappings - 1 : NULL;
+    IOVAMapping *mid = p ? p + (q - p) / 2 : NULL;
+    DPRINTF("qemu vfio find mapping %p %lx... ", host, size);
+    if (!p) {
+        *index = -1;
+        return NULL;
+    }
+    while (true) {
+        mid = p + (q - p) / 2;
+        if (mid == p) {
+            break;
+        }
+        if (mid->host > host) {
+            q = mid;
+        } else if (mid->host < host) {
+            p = mid;
+        } else {
+            break;
+        }
+    }
+    if (mid->host > host) {
+        mid--;
+    } else if (mid < &s->mappings[s->nr_mappings - 1]
+               && (mid + 1)->host <= host) {
+        mid++;
+    }
+    *index = mid - &s->mappings[0];
+    if (mid >= &s->mappings[0] &&
+        mid->host <= host && mid->host + mid->size > host) {
+        assert(mid < &s->mappings[s->nr_mappings]);
+        assert(mid->host + mid->size >= host + size);
+        DPRINTF("found, index %d\n", *index);
+        return mid;
+    }
+    DPRINTF("not found, index %d\n", *index);
+    return NULL;
+}
+
+/**
+ * Allocate IOVA and and create a new mapping record and insert it in @s. If
+ * contiguous is true, the mapped IOVA must be contiguous, otherwise
+ * segmentation is allowed to make the allocation more likely to succeed.
+ */
+static IOVAMapping *qemu_vfio_new_mapping(QEMUVFIOState *s,
+                                          void *host, size_t size,
+                                          int index,
+                                          bool contiguous)
+{
+    int i;
+    int shift;
+    const size_t pages_per_chunk = QEMU_VFIO_CHUNK_SIZE / getpagesize();
+    size_t pages = DIV_ROUND_UP(size, getpagesize());
+    size_t chunks = DIV_ROUND_UP(pages, pages_per_chunk);
+    int64_t next;
+    IOVAMapping m = {.host = host, .size = size};
+    IOVAMapping *insert;
+    HBitmapIter iter;
+    IOVARange *r;
+
+    if (DIV_ROUND_UP(pages, pages_per_chunk) > hbitmap_count(s->free_chunks)) {
+        return NULL;
+    }
+    QSIMPLEQ_INIT(&m.iova_list);
+
+    r = NULL;
+    hbitmap_iter_init(&iter, s->free_chunks, 1);
+    if (contiguous) {
+        while (true) {
+            bool satisfy = true;
+            next = hbitmap_iter_next(&iter);
+            if (next < 0) {
+                return NULL;
+            }
+            for (i = 1; i < chunks; i++) {
+                if (!hbitmap_get(s->free_chunks, next + i)) {
+                    satisfy = false;
+                    break;
+                }
+            }
+            if (satisfy) {
+                break;
+            }
+        }
+        hbitmap_reset(s->free_chunks, next, chunks);
+        r = g_new(IOVARange, 1);
+        r->iova = next * pages_per_chunk * getpagesize();
+        r->nr_pages = pages;
+        QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
+    } else {
+        next = hbitmap_iter_next(&iter);
+        while (pages) {
+            uint64_t chunk;
+            if (next < 0) {
+                hbitmap_iter_init(&iter, s->free_chunks, 1);
+                next = hbitmap_iter_next(&iter);
+            }
+            assert(next >= 0);
+            chunk = next;
+            DPRINTF("using chunk %ld\n", chunk);
+            next = hbitmap_iter_next(&iter);
+            hbitmap_reset(s->free_chunks, chunk, 1);
+            if (r && r->iova + r->nr_pages == chunk * pages_per_chunk) {
+                r->nr_pages += MIN(pages, pages_per_chunk);
+            } else {
+                r = g_new(IOVARange, 1);
+                r->iova = chunk * pages_per_chunk * getpagesize();
+                r->nr_pages = MIN(pages, pages_per_chunk);
+                QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
+            }
+            pages -= MIN(pages, pages_per_chunk);
+        }
+    }
+
+    assert(index >= 0);
+    s->nr_mappings++;
+    s->mappings = g_realloc_n(s->mappings, sizeof(s->mappings[0]),
+                              s->nr_mappings);
+    insert = &s->mappings[index];
+    shift = s->nr_mappings - index - 1;
+    DPRINTF("inserting to %d shift %d\n", index, shift);
+    if (shift) {
+        memmove(insert + 1, insert, shift * sizeof(s->mappings[0]));
+    }
+    *insert = m;
+    return insert;
+}
+
+/* Remove the mapping record from @s and undo the IOVA mapping with VFIO. */
+static void qemu_vfio_free_mapping(QEMUVFIOState *s, IOVAMapping *mapping,
+                                   bool can_fail)
+{
+    IOVARange *r, *tmp;
+
+    QSIMPLEQ_FOREACH_SAFE(r, &mapping->iova_list, next, tmp) {
+        size_t size = r->nr_pages * getpagesize();
+        struct vfio_iommu_type1_dma_unmap unmap = {
+            .argsz = sizeof(unmap),
+            .flags = 0,
+            .iova = r->iova,
+            .size = size,
+        };
+        if (ioctl(s->container, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+            if (!can_fail) {
+                error_report("VFIO_UNMAP_DMA: %d", -errno);
+            }
+        }
+        hbitmap_set(s->free_chunks, r->iova / QEMU_VFIO_CHUNK_SIZE,
+                    DIV_ROUND_UP(size, QEMU_VFIO_CHUNK_SIZE));
+        g_free(r);
+    }
+}
+
+/* Do the DMA mapping with VFIO. */
+static int qemu_vfio_do_mapping(QEMUVFIOState *s, IOVAMapping *mapping)
+{
+    IOVARange *r;
+    uint8_t *p = mapping->host;
+
+    QSIMPLEQ_FOREACH(r, &mapping->iova_list, next) {
+        struct vfio_iommu_type1_dma_map dma_map = {
+            .argsz = sizeof(dma_map),
+            .flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
+            .iova = r->iova,
+            .vaddr = (uintptr_t)p,
+            .size = r->nr_pages * getpagesize(),
+        };
+        DPRINTF("vfio map %p pages %ld to %lx\n", p, r->nr_pages, r->iova);
+
+        if (ioctl(s->container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
+            error_report("VFIO_MAP_DMA: %d", -errno);
+            return -errno;
+        }
+        p += r->nr_pages * getpagesize();
+    }
+    return 0;
+}
+
+/* Check if the mapping list is (ascending) ordered. */
+static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
+{
+    int i;
+    for (i = 0; i < s->nr_mappings - 1; ++i) {
+        if (!(s->mappings[i].host < s->mappings[i + 1].host)) {
+            fprintf(stderr, "item %d not sorted!\n", i);
+            qemu_vfio_dump_mappings(s);
+            return false;
+        }
+        if (!(s->mappings[i].host + s->mappings[i].size <=
+              s->mappings[i + 1].host)) {
+            fprintf(stderr, "item %d overlap with next!\n", i);
+            qemu_vfio_dump_mappings(s);
+            return false;
+        }
+    }
+    return true;
+}
+
+/* Map [host, host + size) area, and optionally store the IOVA in iova_list
+ * buffer. The area must be aligned to page size, and mustn't overlap or cross
+ * boundary of existing mappings.
+ *
+ * If the area is already mapped, this function can be used to retrieve the
+ * IOVA by passing non-NULL iova_list: for contiguous=true, only one entry is
+ * stored to @iova_list; otherwise, up to (size / page_size) entries will be
+ * written. The caller must ensure the passed buffer is big enough.
+ *
+ * If the area is not mapped, this will try to allocate a sequence of free
+ * IOVAs (contiguous can be enforced by passing true), and report them to
+ * iova_list.
+ */
+int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
+                      bool contiguous, uint64_t *iova_list)
+{
+    int ret;
+    int index;
+    IOVAMapping *mapping;
+
+    DPRINTF("vfio dma map %p %lx contiguous %d\n", host, size, contiguous);
+    assert(QEMU_PTR_IS_ALIGNED(host, getpagesize()));
+    assert(QEMU_IS_ALIGNED(size, getpagesize()));
+    mapping = qemu_vfio_find_mapping(s, host, size, &index);
+    if (!mapping) {
+        mapping = qemu_vfio_new_mapping(s, host, size, index + 1, contiguous);
+        if (!mapping) {
+            return -ENOMEM;
+        }
+        assert(qemu_vfio_verify_mappings(s));
+        ret = qemu_vfio_do_mapping(s, mapping);
+        if (ret) {
+            qemu_vfio_free_mapping(s, mapping, true);
+            return ret;
+        }
+        qemu_vfio_dump_mappings(s);
+    } else {
+        /* XXX: check contiguous for existing mapping? */
+    }
+    if (!mapping) {
+        return -ENOMEM;
+    }
+    if (iova_list) {
+        uint64_t *p = iova_list;
+        uint64_t skip_pages = (host - mapping->host) / getpagesize();
+        IOVARange *r;
+        for (r = QSIMPLEQ_FIRST(&mapping->iova_list);
+             skip_pages > r->nr_pages;
+             r = QSIMPLEQ_NEXT(r, next)) {
+            skip_pages -= r->nr_pages;
+        }
+        while (size) {
+            uint64_t i;
+            assert(r);
+            for (i = skip_pages; size && i < r->nr_pages; ++i) {
+                *p = r->iova + i * getpagesize();
+                assert(*p >= r->iova);
+                assert(*p < r->iova + r->nr_pages * getpagesize());
+                p++;
+                size -= getpagesize();
+                if (contiguous) {
+                    goto out;
+                }
+            }
+            r = QSIMPLEQ_NEXT(r, next);
+            skip_pages = 0;
+        }
+    }
+out:
+    return 0;
+}
+
+/* Unmapping the whole area that was previously mapped with
+ * qemu_vfio_dma_map(). */
+void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host)
+{
+    int index = 0;
+    IOVAMapping *m;
+
+    if (!host) {
+        return;
+    }
+
+    DPRINTF("vfio unmap %p\n", host);
+    m = qemu_vfio_find_mapping(s, host, 4096, &index);
+    if (!m) {
+        return;
+    }
+    qemu_vfio_free_mapping(s, m, false);
+    assert(s->nr_mappings);
+    memmove(&s->mappings[index], &s->mappings[index + 1],
+            sizeof(s->mappings[0]) * (s->nr_mappings - index - 1));
+    s->nr_mappings--;
+    s->mappings = g_realloc_n(s->mappings, sizeof(s->mappings[0]),
+                              s->nr_mappings);
+}
+
+/* Close and free the VFIO resources. */
+void qemu_vfio_close(QEMUVFIOState *s)
+{
+    int i;
+
+    if (!s) {
+        return;
+    }
+    for (i = 0; i < s->nr_mappings; ++i) {
+        qemu_vfio_free_mapping(s, &s->mappings[i], false);
+    }
+    hbitmap_free(s->free_chunks);
+    ram_block_notifier_remove(&s->ram_notifier);
+    close(s->device);
+    close(s->group);
+    close(s->container);
+}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver
  2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
                   ` (2 preceding siblings ...)
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library Fam Zheng
@ 2016-12-20 16:31 ` Fam Zheng
  2016-12-20 16:39   ` Paolo Bonzini
  2016-12-21 11:59   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  2016-12-20 23:04 ` [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device no-reply
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
	borntraeger, Stefan Hajnoczi

This is a new protocol driver that exclusively opens a host NVMe
controller through VFIO. It achieves better latency than linux-aio.

                                         nvme://    linux-aio
------------------------------------------------------------------
  fio bs=4k  iodepth=1 (IOPS)             30014      24009
  fio bs=4k  iodepth=1 (IOPS)  +polling   45148      34689
  fio bs=64k iodepth=1 (IOPS)             17801      14954
  fio bs=64k iodepth=1 (IOPS)  +polling   17970      14564

  fio bs=4k  iodepth=32 (IOPS)           110637     121480
  fio bs=4k  iodepth=32 (IOPS) +polling  145533     166878
  fio bs=64k iodepth=32 (IOPS)            50525      50596
  fio bs=64k iodepth=32 (IOPS) +polling   50482      50534

  (host) qemu-img bench -c 8000000 (sec)  15.00      43.13

("+polling" means poll-max-ns=18000 which is a rule of thumb value for
the used NVMe device, otherwise it defaults to 0).

For the rows where linux-aio is faster, a lot of evidence shows that the
io queue is more likely to stay full if the request completions happen
faster, as in the nvme:// case, hence less batch submission and request
merging than linux-aio.

This patch is co-authored by Paolo and me.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block/Makefile.objs |    1 +
 block/nvme.c        | 1026 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1027 insertions(+)
 create mode 100644 block/nvme.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 67a036a..fe975a6 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 block-obj-y += null.o mirror.o commit.o io.o
 block-obj-y += throttle-groups.o
+block-obj-y += nvme.o
 
 block-obj-y += nbd.o nbd-client.o sheepdog.o
 block-obj-$(CONFIG_LIBISCSI) += iscsi.o
diff --git a/block/nvme.c b/block/nvme.c
new file mode 100644
index 0000000..5c9ffde
--- /dev/null
+++ b/block/nvme.c
@@ -0,0 +1,1026 @@
+/*
+ * NVMe block driver based on vfio
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *   Fam Zheng <famz@redhat.com>
+ *   Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qstring.h"
+#include "block/block_int.h"
+#include "qemu/error-report.h"
+#include "qemu/vfio-helpers.h"
+
+#define NVME_DEBUG 0
+
+#define DPRINTF(...) \
+    if (NVME_DEBUG) { \
+        printf(__VA_ARGS__); \
+    }
+
+#define NVME_SQ_ENTRY_BYTES 64
+#define NVME_CQ_ENTRY_BYTES 16
+#define NVME_QUEUE_SIZE 128
+
+/* TODO: share definitions with hw/block/nvme.c? */
+enum NvmeAdminCommands {
+    NVME_ADM_CMD_DELETE_SQ      = 0x00,
+    NVME_ADM_CMD_CREATE_SQ      = 0x01,
+    NVME_ADM_CMD_GET_LOG_PAGE   = 0x02,
+    NVME_ADM_CMD_DELETE_CQ      = 0x04,
+    NVME_ADM_CMD_CREATE_CQ      = 0x05,
+    NVME_ADM_CMD_IDENTIFY       = 0x06,
+    NVME_ADM_CMD_ABORT          = 0x08,
+    NVME_ADM_CMD_SET_FEATURES   = 0x09,
+    NVME_ADM_CMD_GET_FEATURES   = 0x0a,
+    NVME_ADM_CMD_ASYNC_EV_REQ   = 0x0c,
+    NVME_ADM_CMD_ACTIVATE_FW    = 0x10,
+    NVME_ADM_CMD_DOWNLOAD_FW    = 0x11,
+    NVME_ADM_CMD_FORMAT_NVM     = 0x80,
+    NVME_ADM_CMD_SECURITY_SEND  = 0x81,
+    NVME_ADM_CMD_SECURITY_RECV  = 0x82,
+};
+
+enum NvmeIoCommands {
+    NVME_CMD_FLUSH              = 0x00,
+    NVME_CMD_WRITE              = 0x01,
+    NVME_CMD_READ               = 0x02,
+    NVME_CMD_WRITE_UNCOR        = 0x04,
+    NVME_CMD_COMPARE            = 0x05,
+    NVME_CMD_DSM                = 0x09,
+};
+
+typedef struct {
+    uint8_t  opcode;
+    uint8_t  flags;
+    uint16_t cid;
+    uint32_t nsid;
+    uint64_t reserved;
+    uint64_t mptr;
+    uint64_t prp1;
+    uint64_t prp2;
+    uint32_t cdw10;
+    uint32_t cdw11;
+    uint32_t cdw12;
+    uint32_t cdw13;
+    uint32_t cdw14;
+    uint32_t cdw15;
+} QEMU_PACKED NVMeCommand;
+
+typedef struct {
+    uint32_t cmd_specific;
+    uint32_t reserved;
+    uint16_t sq_head;
+    uint16_t sqid;
+    uint16_t cid;
+    uint16_t status;
+} QEMU_PACKED NVMeCompletion;
+
+typedef struct {
+    int32_t  head, tail;
+    uint8_t  *queue;
+    uint64_t iova;
+    volatile uint32_t *doorbell;
+} NVMeQueue;
+
+typedef struct {
+    BlockCompletionFunc *cb;
+    void *opaque;
+    int cid;
+    void *prp_list_page;
+    uint64_t prp_list_iova;
+} NVMeRequest;
+
+typedef struct {
+    int         index;
+    NVMeQueue   sq, cq;
+    int         cq_phase;
+    uint8_t     *prp_list_pages;
+    uint64_t    prp_list_base_iova;
+    NVMeRequest reqs[NVME_QUEUE_SIZE];
+    CoQueue     wait_queue;
+    bool        busy;
+    int         need_kick;
+    int         inflight;
+} NVMeQueuePair;
+
+typedef volatile struct {
+    uint64_t cap;
+    uint32_t vs;
+    uint32_t intms;
+    uint32_t intmc;
+    uint32_t cc;
+    uint32_t reserved0;
+    uint32_t csts;
+    uint32_t nssr;
+    uint32_t aqa;
+    uint64_t asq;
+    uint64_t acq;
+    uint32_t cmbloc;
+    uint32_t cmbsz;
+    uint8_t  reserved1[0xec0];
+    uint8_t  cmd_set_specfic[0x100];
+    uint32_t doorbells[];
+} QEMU_PACKED NVMeRegs;
+
+QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
+
+typedef struct {
+    QEMUVFIOState *vfio;
+    NVMeRegs *regs;
+    /* The submission/completion queue pairs.
+     * [0]: admin queue.
+     * [1..]: io queues.
+     */
+    NVMeQueuePair **queues;
+    int nr_queues;
+    size_t page_size;
+    /* How many uint32_t elements does each doorbell entry take. */
+    size_t doorbell_scale;
+    bool write_cache;
+    EventNotifier event_notifier;
+    uint64_t nsze; /* Namespace size reported by identify command */
+    int nsid;      /* The namespace id to read/write data. */
+    uint64_t max_transfer;
+    int plugged;
+    Notifier vfree_notify;
+} BDRVNVMeState;
+
+#define NVME_BLOCK_OPT_DEVICE "device"
+#define NVME_BLOCK_OPT_NAMESPACE "namespace"
+
+static QemuOptsList runtime_opts = {
+    .name = "nvme",
+    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
+    .desc = {
+        {
+            .name = NVME_BLOCK_OPT_DEVICE,
+            .type = QEMU_OPT_STRING,
+            .help = "NVMe PCI device address",
+        },
+        {
+            .name = NVME_BLOCK_OPT_NAMESPACE,
+            .type = QEMU_OPT_NUMBER,
+            .help = "NVMe namespace",
+        },
+        { /* end of list */ }
+    },
+};
+
+static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
+                            int nentries, int entry_bytes, Error **errp)
+{
+    BDRVNVMeState *s = bs->opaque;
+    size_t bytes;
+    int r;
+
+    bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
+    q->head = q->tail = 0;
+    q->queue = qemu_try_blockalign0(bs, bytes);
+
+    if (!q->queue) {
+        error_setg(errp, "Cannot allocate queue");
+        return;
+    }
+    r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, true, &q->iova);
+    if (r) {
+        error_setg(errp, "Cannot map queue");
+    }
+}
+
+static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
+{
+    qemu_vfree(q->prp_list_pages);
+    qemu_vfree(q->sq.queue);
+    qemu_vfree(q->cq.queue);
+    g_free(q);
+}
+
+static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
+                                             int idx, int size,
+                                             Error **errp)
+{
+    int i, r;
+    BDRVNVMeState *s = bs->opaque;
+    Error *local_err = NULL;
+    NVMeQueuePair *q = g_new0(NVMeQueuePair, 1);
+    uint64_t prp_list_iovas[NVME_QUEUE_SIZE];
+
+    q->index = idx;
+    qemu_co_queue_init(&q->wait_queue);
+    q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE);
+    r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
+                          s->page_size * NVME_QUEUE_SIZE,
+                          false, prp_list_iovas);
+    if (r) {
+        goto fail;
+    }
+    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
+        NVMeRequest *req = &q->reqs[i];
+        req->cid = i + 1;
+        req->prp_list_page = q->prp_list_pages + i * s->page_size;
+        req->prp_list_iova = prp_list_iovas[i];
+    }
+    nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        goto fail;
+    }
+    q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale];
+
+    nvme_init_queue(bs, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        goto fail;
+    }
+    q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
+
+    return q;
+fail:
+    nvme_free_queue_pair(bs, q);
+    return NULL;
+}
+
+static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
+{
+    if (s->plugged || !q->need_kick) {
+        return;
+    }
+    DPRINTF("nvme kick queue %d\n", q->index);
+    assert(!(q->sq.tail & 0xFF00));
+    /* Fence the write to submission queue entry before notifying the guest. */
+    smp_wmb();
+    *q->sq.doorbell = cpu_to_le32(q->sq.tail);
+    q->inflight += q->need_kick;
+    q->need_kick = 0;
+}
+
+static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
+{
+    int i;
+    if (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
+        /* We have to leave one slot empty as that is the full queue case (head
+         * == tail + 1). */
+        return NULL;
+    }
+    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
+        if (!q->reqs[i].cb) {
+            return &q->reqs[i];
+        }
+    }
+    return NULL;
+}
+
+static inline int nvme_translate_error(const NVMeCompletion *c)
+{
+    if ((le16_to_cpu(c->status) >> 1) & 0xFF) {
+        DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status %x\n",
+                c->cmd_specific, c->sq_head, c->sqid, c->cid, c->status);
+    }
+    switch ((le16_to_cpu(c->status) >> 1) & 0xFF) {
+    case 0:
+        return 0;
+    case 1:
+        return -ENOSYS;
+    case 2:
+        return -EINVAL;
+    default:
+        return -EIO;
+    }
+}
+
+static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
+{
+    bool progress = false;
+    NVMeRequest *req;
+    NVMeCompletion *c;
+
+    DPRINTF("nvme process completion %d inflight %d\n", q->index, q->inflight);
+    if (q->busy || s->plugged) {
+        DPRINTF("queue busy\n");
+        return false;
+    }
+    q->busy = true;
+    assert(q->inflight >= 0);
+    while (q->inflight) {
+        c = (NVMeCompletion *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
+        if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
+            break;
+        }
+        q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
+        if (!q->cq.head) {
+            q->cq_phase = !q->cq_phase;
+        }
+        /* XXX: error handling instead of assert as it's from device? */
+        assert(c->cid > 0);
+        assert(c->cid <= NVME_QUEUE_SIZE);
+        DPRINTF("nvme completing command %d\n", c->cid);
+        req = &q->reqs[c->cid - 1];
+        assert(req->cid == c->cid);
+        assert(req->cb);
+        req->cb(req->opaque, nvme_translate_error(c));
+        req->cb = req->opaque = NULL;
+        qemu_co_enter_next(&q->wait_queue);
+        c->cid = 0;
+        q->inflight--;
+        /* Flip Phase Tag bit. */
+        c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
+        progress = true;
+    }
+    if (progress) {
+        /* Notify the device so it can post more completions. */
+        smp_mb_release();
+        *q->cq.doorbell = cpu_to_le32(q->cq.head);
+    }
+    q->busy = false;
+    return progress;
+}
+
+static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
+                                NVMeRequest *req,
+                                NVMeCommand *cmd, BlockCompletionFunc cb,
+                                void *opaque)
+{
+    req->cb = cb;
+    req->opaque = opaque;
+    cmd->cid = cpu_to_le32(req->cid);
+    DPRINTF("nvme submit command %d to queue %d\n", req->cid, q->index);
+    memcpy((uint8_t *)q->sq.queue +
+           q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
+    q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
+    q->need_kick++;
+    nvme_kick(s, q);
+    nvme_process_completion(s, q);
+}
+
+static void nvme_cmd_sync_cb(void *opaque, int ret)
+{
+    int *pret = opaque;
+    *pret = ret;
+}
+
+static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
+                         NVMeCommand *cmd)
+{
+    NVMeRequest *req;
+    BDRVNVMeState *s = bs->opaque;
+    int ret = -EINPROGRESS;
+    req = nvme_get_free_req(q);
+    if (!req) {
+        return -EBUSY;
+    }
+    nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
+
+    BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
+    return ret;
+}
+
+static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
+{
+    BDRVNVMeState *s = bs->opaque;
+    uint8_t *resp;
+    int r;
+    uint64_t iova;
+    NVMeCommand cmd = {
+        .opcode = NVME_ADM_CMD_IDENTIFY,
+        .cdw10 = cpu_to_le32(0x1),
+    };
+
+    resp = qemu_try_blockalign0(bs, 4096);
+    if (!resp) {
+        error_setg(errp, "Cannot allocate buffer for identify response");
+        return false;
+    }
+    r = qemu_vfio_dma_map(s->vfio, resp, 4096, true, &iova);
+    if (r) {
+        error_setg(errp, "Cannot map buffer for DMA");
+        goto fail;
+    }
+    cmd.prp1 = cpu_to_le64(iova);
+
+    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
+        error_setg(errp, "Failed to identify controller");
+        goto fail;
+    }
+
+    if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) {
+        error_setg(errp, "Invalid namespace");
+        goto fail;
+    }
+    s->write_cache = le32_to_cpu(resp[525]) & 0x1;
+    s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size;
+    /* For now the page list buffer per command is one page, to hold at most
+     * s->page_size / sizeof(uint64_t) entries. */
+    s->max_transfer = MIN_NON_ZERO(s->max_transfer,
+                          s->page_size / sizeof(uint64_t) * s->page_size);
+
+    memset((char *)resp, 0, 4096);
+
+    cmd.cdw10 = 0;
+    cmd.nsid = namespace;
+    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
+        error_setg(errp, "Failed to identify namespace");
+        goto fail;
+    }
+
+    /*qemu_hexdump((const char *)resp, stdout, "NS", 4096);*/
+    s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]);
+
+    qemu_vfree(resp);
+    return true;
+fail:
+    qemu_vfree(resp);
+    return false;
+}
+
+static void nvme_handle_event(EventNotifier *n)
+{
+    int i;
+    BDRVNVMeState *s = container_of(n, BDRVNVMeState, event_notifier);
+
+    DPRINTF("nvme handle event\n");
+    event_notifier_test_and_clear(n);
+    for (i = 0; i < s->nr_queues; i++) {
+        while (nvme_process_completion(s, s->queues[i])) {
+            /* Keep polling until no progress. */
+        }
+    }
+}
+
+static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
+{
+    BDRVNVMeState *s = bs->opaque;
+    int n = s->nr_queues;
+    NVMeQueuePair *q;
+    NVMeCommand cmd;
+    int queue_size = NVME_QUEUE_SIZE;
+
+    q = nvme_create_queue_pair(bs, n, queue_size, errp);
+    if (!q) {
+        return false;
+    }
+    cmd = (NVMeCommand) {
+        .opcode = NVME_ADM_CMD_CREATE_CQ,
+        .prp1 = cpu_to_le64(q->cq.iova),
+        .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
+        .cdw11 = cpu_to_le32(0x3),
+    };
+    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
+        error_setg(errp, "Failed to create io queue [%d]", n);
+        nvme_free_queue_pair(bs, q);
+        return false;
+    }
+    cmd = (NVMeCommand) {
+        .opcode = NVME_ADM_CMD_CREATE_SQ,
+        .prp1 = cpu_to_le64(q->sq.iova),
+        .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
+        .cdw11 = cpu_to_le32(0x1 | (n << 16)),
+    };
+    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
+        error_setg(errp, "Failed to create io queue [%d]", n);
+        nvme_free_queue_pair(bs, q);
+        return false;
+    }
+    s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
+    s->queues[n] = q;
+    s->nr_queues++;
+    return true;
+}
+
+static bool nvme_poll_cb(void *opaque)
+{
+    int i;
+    EventNotifier *e = opaque;
+    BDRVNVMeState *s = container_of(e, BDRVNVMeState, event_notifier);
+    bool progress = false;
+
+    DPRINTF("nvme poll cb\n");
+    for (i = 0; i < s->nr_queues; i++) {
+        while (nvme_process_completion(s, s->queues[i])) {
+            progress = true;
+        }
+    }
+    return progress;
+}
+
+static void nvme_vfree_cb(Notifier *n, void *p)
+{
+    BDRVNVMeState *s = container_of(n, BDRVNVMeState, vfree_notify);
+    qemu_vfio_dma_unmap(s->vfio, p);
+}
+
+static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
+                     Error **errp)
+{
+    BDRVNVMeState *s = bs->opaque;
+    int ret;
+    uint64_t cap;
+    uint64_t timeout_ms;
+    uint64_t deadline, now;
+
+    s->nsid = namespace;
+    ret = event_notifier_init(&s->event_notifier, 0);
+    if (ret) {
+        error_setg(errp, "Failed to init event notifier");
+        return ret;
+    }
+
+    s->vfree_notify.notify = nvme_vfree_cb;
+    qemu_vfree_add_notifier(&s->vfree_notify);
+
+    s->vfio = qemu_vfio_open_pci(device, errp);
+    if (!s->vfio) {
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, errp);
+    if (!s->regs) {
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    /* Perform initialize sequence as described in NVMe spec "7.6.1
+     * Initialization". */
+
+    cap = le64_to_cpu(s->regs->cap);
+    if (!(cap & (1ULL << 37))) {
+        error_setg(errp, "Device doesn't support NVMe command set");
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF)));
+    s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t);
+    bs->bl.opt_mem_alignment = s->page_size;
+    timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
+
+    /* Reset device to get a clean state. */
+    s->regs->cc = cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE);
+    /* Wait for CSTS.RDY = 0. */
+    deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * 1000000ULL;
+    while (le32_to_cpu(s->regs->csts) & 0x1) {
+        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
+            error_setg(errp, "Timeout while waiting for device to reset (%ld ms)",
+                       timeout_ms);
+            ret = -ETIMEDOUT;
+            goto fail;
+        }
+    }
+
+    /* Set up admin queue. */
+    s->queues = g_new(NVMeQueuePair *, 1);
+    s->nr_queues = 1;
+    s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp);
+    if (!s->queues[0]) {
+        ret = -EINVAL;
+        goto fail;
+    }
+    QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
+    s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
+    s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova);
+    s->regs->acq = cpu_to_le64(s->queues[0]->cq.iova);
+
+    /* After setting up all control registers we can enable device now. */
+    s->regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
+                              (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
+                              0x1);
+    /* Wait for CSTS.RDY = 1. */
+    now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+    deadline = now + timeout_ms * 1000000;
+    while (!(le32_to_cpu(s->regs->csts) & 0x1)) {
+        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
+            error_setg(errp, "Timeout while waiting for device to start (%ld ms)",
+                       timeout_ms);
+            ret = -ETIMEDOUT;
+            goto fail_queue;
+        }
+    }
+
+    ret = qemu_vfio_pci_init_irq(s->vfio, &s->event_notifier,
+                                 VFIO_PCI_MSIX_IRQ_INDEX, errp);
+    if (ret) {
+        goto fail_queue;
+    }
+    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
+                           false, nvme_handle_event, nvme_poll_cb);
+
+    if (!nvme_identify(bs, namespace, errp)) {
+        ret = -EIO;
+        goto fail_handler;
+    }
+
+    /* Set up command queues. */
+    /* XXX: multiple io queues? */
+    if (!nvme_add_io_queue(bs, errp)) {
+        ret = -EIO;
+        goto fail_handler;
+    }
+    return 0;
+
+fail_handler:
+    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
+                           false, NULL, NULL);
+fail_queue:
+    nvme_free_queue_pair(bs, s->queues[0]);
+fail:
+    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
+    qemu_vfio_close(s->vfio);
+    event_notifier_cleanup(&s->event_notifier);
+    return ret;
+}
+
+static void nvme_parse_filename(const char *filename, QDict *options,
+                                Error **errp)
+{
+    int pref = strlen("nvme://");
+
+    /* XXX: support namespace in URI like nvme://0000:00:00.0/1 ? */
+    if (strlen(filename) > pref && !strncmp(filename, "nvme://", pref)) {
+        qdict_put(options, NVME_BLOCK_OPT_DEVICE,
+                  qstring_from_str(filename + pref));
+    }
+}
+
+static int nvme_file_open(BlockDriverState *bs, QDict *options, int flags,
+                          Error **errp)
+{
+    const char *device;
+    QemuOpts *opts;
+    int namespace;
+
+    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
+    qemu_opts_absorb_qdict(opts, options, &error_abort);
+    device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
+    if (!device) {
+        error_setg(errp, "'" NVME_BLOCK_OPT_DEVICE "' option is required");
+        return -EINVAL;
+    }
+
+    namespace = qemu_opt_get_number(opts, NVME_BLOCK_OPT_NAMESPACE, 1);
+    nvme_init(bs, device, namespace, errp);
+
+    qemu_opts_del(opts);
+    return 0;
+}
+
+static void nvme_close(BlockDriverState *bs)
+{
+    int i;
+    BDRVNVMeState *s = bs->opaque;
+
+    for (i = 0; i < s->nr_queues; ++i) {
+        nvme_free_queue_pair(bs, s->queues[i]);
+    }
+    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
+                           false, NULL, NULL);
+    notifier_remove(&s->vfree_notify);
+    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
+    qemu_vfio_close(s->vfio);
+}
+
+static int64_t nvme_getlength(BlockDriverState *bs)
+{
+    BDRVNVMeState *s = bs->opaque;
+
+    return s->nsze << BDRV_SECTOR_BITS;
+}
+
+static inline int nvme_cmd_map_qiov(BlockDriverState *bs, NVMeCommand *cmd,
+                                    NVMeRequest *req, QEMUIOVector *qiov)
+{
+    BDRVNVMeState *s = bs->opaque;
+    uint64_t *pagelist = req->prp_list_page;
+    int i, r;
+    unsigned int entries = 0;
+
+    assert(qiov->size);
+    assert(QEMU_IS_ALIGNED(qiov->size, s->page_size));
+    assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t));
+    for (i = 0; i < qiov->niov; ++i) {
+        r = qemu_vfio_dma_map(s->vfio,
+                              qiov->iov[i].iov_base,
+                              qiov->iov[i].iov_len,
+                              false, &pagelist[entries]);
+        if (r) {
+            return r;
+        }
+
+        entries += qiov->iov[i].iov_len / s->page_size;
+        assert(entries <= s->page_size / sizeof(uint64_t));
+    }
+
+    switch (entries) {
+    case 0:
+        abort();
+    case 1:
+        cmd->prp1 = cpu_to_le64(pagelist[0]);
+        cmd->prp2 = 0;
+        break;
+    case 2:
+        cmd->prp1 = cpu_to_le64(pagelist[0]);
+        cmd->prp2 = cpu_to_le64(pagelist[1]);;
+        break;
+    default:
+        cmd->prp1 = cpu_to_le64(pagelist[0]);
+        cmd->prp2 = cpu_to_le64(req->prp_list_iova);
+        for (i = 0; i < entries - 1; ++i) {
+            pagelist[i] = cpu_to_le64(pagelist[i + 1]);
+        }
+        pagelist[entries] = 0;
+        break;
+    }
+    return 0;
+}
+
+typedef struct {
+    Coroutine *co;
+    int ret;
+    AioContext *ctx;
+} NVMeCoData;
+
+static void nvme_rw_cb_bh(void *opaque)
+{
+    NVMeCoData *data = opaque;
+    qemu_coroutine_enter(data->co);
+}
+
+static void nvme_rw_cb(void *opaque, int ret)
+{
+    NVMeCoData *data = opaque;
+    data->ret = ret;
+    if (!data->co) {
+        /* The rw coroutine hasn't yielded, don't try to enter. */
+        return;
+    }
+    if (qemu_coroutine_self() != data->co) {
+        qemu_coroutine_enter(data->co);
+    } else {
+        aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
+    }
+}
+
+static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
+                                            uint64_t offset, uint64_t bytes,
+                                            QEMUIOVector *qiov,
+                                            bool is_write)
+{
+    int r;
+    BDRVNVMeState *s = bs->opaque;
+    NVMeQueuePair *ioq = s->queues[1];
+    NVMeRequest *req;
+    NVMeCommand cmd = {
+        .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
+        .nsid = cpu_to_le32(s->nsid),
+        .cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF),
+        .cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 0xFFFFFFFF),
+        .cdw12 = cpu_to_le32(((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF),
+    };
+    NVMeCoData data = {
+        .ctx = bdrv_get_aio_context(bs),
+        .ret = -EINPROGRESS,
+    };
+
+    DPRINTF(">>> NVMe %s offset %lx bytes %lx qiov[%d]\n",
+            is_write ? "write" : "read",
+            offset, bytes, qiov->niov);
+    assert(s->nr_queues > 1);
+    while (true) {
+        req = nvme_get_free_req(ioq);
+        if (req) {
+            break;
+        }
+        DPRINTF("nvme wait req\n");
+        qemu_co_queue_wait(&ioq->wait_queue);
+        DPRINTF("nvme wait req done\n");
+    }
+
+    r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
+    if (r) {
+        return r;
+    }
+    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+    data.co = qemu_coroutine_self();
+    while (data.ret == -EINPROGRESS) {
+        qemu_coroutine_yield();
+    }
+
+    DPRINTF("<<< NVMe %s offset %lx bytes %lx ret %d\n",
+            is_write ? "write" : "read",
+            offset, bytes, data.ret);
+    return data.ret;
+}
+
+static inline bool nvme_qiov_aligned(BlockDriverState *bs,
+                                     const QEMUIOVector *qiov)
+{
+    int i;
+    BDRVNVMeState *s = bs->opaque;
+
+    for (i = 0; i < qiov->niov; ++i) {
+        if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) ||
+            !QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) {
+            return false;
+        }
+    }
+    return true;
+}
+
+static inline coroutine_fn
+int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+                QEMUIOVector *qiov, bool is_write)
+{
+    int r;
+    uint8_t *buf;
+    QEMUIOVector local_qiov;
+
+    assert(offset % BDRV_SECTOR_SIZE == 0);
+    assert(bytes % BDRV_SECTOR_SIZE == 0);
+    if (nvme_qiov_aligned(bs, qiov)) {
+        return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write);
+    }
+    buf = qemu_try_blockalign(bs, bytes);
+
+    if (!buf) {
+        return -ENOMEM;
+    }
+    qemu_iovec_init(&local_qiov, 1);
+    qemu_iovec_add(&local_qiov, buf, bytes);
+    r = nvme_co_prw_aligned(bs, offset, bytes, &local_qiov, is_write);
+    qemu_iovec_destroy(&local_qiov);
+    if (!r) {
+        qemu_iovec_from_buf(qiov, 0, buf, bytes);
+    }
+    qemu_vfree(buf);
+    return r;
+}
+
+static coroutine_fn int nvme_co_preadv(BlockDriverState *bs,
+                                       uint64_t offset, uint64_t bytes,
+                                       QEMUIOVector *qiov, int flags)
+{
+    return nvme_co_prw(bs, offset, bytes, qiov, false);
+}
+
+static coroutine_fn int nvme_co_pwritev(BlockDriverState *bs,
+                                        uint64_t offset, uint64_t bytes,
+                                        QEMUIOVector *qiov, int flags)
+{
+    return nvme_co_prw(bs, offset, bytes, qiov, true);
+}
+
+static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
+{
+    BDRVNVMeState *s = bs->opaque;
+    NVMeQueuePair *ioq = s->queues[1];
+    NVMeRequest *req;
+    NVMeCommand cmd = {
+        .opcode = NVME_CMD_FLUSH,
+        .nsid = cpu_to_le32(s->nsid),
+    };
+    NVMeCoData data = {
+        .ctx = bdrv_get_aio_context(bs),
+        .ret = -EINPROGRESS,
+    };
+
+    assert(s->nr_queues > 1);
+    while (true) {
+        req = nvme_get_free_req(ioq);
+        if (req) {
+            break;
+        }
+        qemu_co_queue_wait(&ioq->wait_queue);
+    }
+
+    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+    data.co = qemu_coroutine_self();
+    if (data.ret == -EINPROGRESS) {
+        qemu_coroutine_yield();
+    }
+
+    return data.ret;
+}
+
+
+static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
+                               BlockReopenQueue *queue, Error **errp)
+{
+    return 0;
+}
+
+static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs,
+                                                     int64_t sector_num,
+                                                     int nb_sectors, int *pnum,
+                                                     BlockDriverState **file)
+{
+    *pnum = nb_sectors;
+    *file = bs;
+
+    return BDRV_BLOCK_OFFSET_VALID | (sector_num << BDRV_SECTOR_BITS);
+}
+
+static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts)
+{
+    QINCREF(opts);
+    qdict_del(opts, "filename");
+
+    if (!qdict_size(opts)) {
+        snprintf(bs->exact_filename, sizeof(bs->exact_filename), "%s://",
+                 bs->drv->format_name);
+    }
+
+    qdict_put(opts, "driver", qstring_from_str(bs->drv->format_name));
+    bs->full_open_options = opts;
+}
+
+static void nvme_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+    BDRVNVMeState *s = bs->opaque;
+
+    /* XXX: other fields from identify command/controller capacity? */
+    bs->bl.opt_mem_alignment = s->page_size;
+    /* XXX: are there other limits? */
+    bs->bl.request_alignment = s->page_size;
+    bs->bl.max_transfer = s->max_transfer;
+}
+
+static void nvme_detach_aio_context(BlockDriverState *bs)
+{
+    BDRVNVMeState *s = bs->opaque;
+
+    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
+                           false, NULL, NULL);
+}
+
+static void nvme_attach_aio_context(BlockDriverState *bs,
+                                    AioContext *new_context)
+{
+    BDRVNVMeState *s = bs->opaque;
+
+    aio_set_event_notifier(new_context, &s->event_notifier,
+                           false, nvme_handle_event, nvme_poll_cb);
+}
+
+static void nvme_aio_plug(BlockDriverState *bs)
+{
+    BDRVNVMeState *s = bs->opaque;
+    s->plugged++;
+}
+
+static void nvme_aio_unplug(BlockDriverState *bs)
+{
+    int i;
+    BDRVNVMeState *s = bs->opaque;
+    assert(s->plugged);
+    if (!--s->plugged) {
+        for (i = 1; i < s->nr_queues; i++) {
+            nvme_kick(s, s->queues[i]);
+            nvme_process_completion(s, s->queues[i]);
+        }
+    }
+}
+
+static BlockDriver bdrv_nvme = {
+    .format_name              = "nvme",
+    .protocol_name            = "nvme",
+    .instance_size            = sizeof(BDRVNVMeState),
+
+    .bdrv_parse_filename      = nvme_parse_filename,
+    .bdrv_file_open           = nvme_file_open,
+    .bdrv_close               = nvme_close,
+    .bdrv_getlength           = nvme_getlength,
+
+    .bdrv_co_preadv           = nvme_co_preadv,
+    .bdrv_co_pwritev          = nvme_co_pwritev,
+    .bdrv_co_flush_to_disk    = nvme_co_flush,
+    .bdrv_reopen_prepare      = nvme_reopen_prepare,
+
+    .bdrv_co_get_block_status = nvme_co_get_block_status,
+
+    .bdrv_refresh_filename    = nvme_refresh_filename,
+    .bdrv_refresh_limits      = nvme_refresh_limits,
+
+    .bdrv_detach_aio_context  = nvme_detach_aio_context,
+    .bdrv_attach_aio_context  = nvme_attach_aio_context,
+
+    .bdrv_io_plug             = nvme_aio_plug,
+    .bdrv_io_unplug           = nvme_aio_unplug,
+};
+
+static void bdrv_nvme_init(void)
+{
+    bdrv_register(&bdrv_nvme);
+}
+
+block_init(bdrv_nvme_init);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
@ 2016-12-20 16:39   ` Paolo Bonzini
  2016-12-21 11:59   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2016-12-20 16:39 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
	borntraeger, Stefan Hajnoczi



----- Original Message -----
> From: "Fam Zheng" <famz@redhat.com>
> To: qemu-devel@nongnu.org
> Cc: "Paolo Bonzini" <pbonzini@redhat.com>, qemu-block@nongnu.org, "Karl Rister" <krister@redhat.com>, "Kevin Wolf"
> <kwolf@redhat.com>, "Max Reitz" <mreitz@redhat.com>, borntraeger@de.ibm.com, "Stefan Hajnoczi" <stefanha@redhat.com>
> Sent: Tuesday, December 20, 2016 5:31:39 PM
> Subject: [PATCH 4/4] block: Add VFIO based NVMe driver
> 
> This is a new protocol driver that exclusively opens a host NVMe
> controller through VFIO. It achieves better latency than linux-aio.
> 
>                                          nvme://    linux-aio
> ------------------------------------------------------------------
>   fio bs=4k  iodepth=1 (IOPS)             30014      24009
>   fio bs=4k  iodepth=1 (IOPS)  +polling   45148      34689
>   fio bs=64k iodepth=1 (IOPS)             17801      14954
>   fio bs=64k iodepth=1 (IOPS)  +polling   17970      14564
> 
>   fio bs=4k  iodepth=32 (IOPS)           110637     121480
>   fio bs=4k  iodepth=32 (IOPS) +polling  145533     166878
>   fio bs=64k iodepth=32 (IOPS)            50525      50596
>   fio bs=64k iodepth=32 (IOPS) +polling   50482      50534
> 
>   (host) qemu-img bench -c 8000000 (sec)  15.00      43.13
> 
> ("+polling" means poll-max-ns=18000 which is a rule of thumb value for
> the used NVMe device, otherwise it defaults to 0).
> 
> For the rows where linux-aio is faster, a lot of evidence shows that the
> io queue is more likely to stay full if the request completions happen
> faster, as in the nvme:// case, hence less batch submission and request
> merging than linux-aio.
> 
> This patch is co-authored by Paolo and me.

... though to be fair it's at least 95% Fam's work. :)

Paolo

> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/Makefile.objs |    1 +
>  block/nvme.c        | 1026
>  +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 1027 insertions(+)
>  create mode 100644 block/nvme.c
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 67a036a..fe975a6 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>  block-obj-y += null.o mirror.o commit.o io.o
>  block-obj-y += throttle-groups.o
> +block-obj-y += nvme.o
>  
>  block-obj-y += nbd.o nbd-client.o sheepdog.o
>  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
> diff --git a/block/nvme.c b/block/nvme.c
> new file mode 100644
> index 0000000..5c9ffde
> --- /dev/null
> +++ b/block/nvme.c
> @@ -0,0 +1,1026 @@
> +/*
> + * NVMe block driver based on vfio
> + *
> + * Copyright 2016 Red Hat, Inc.
> + *
> + * Authors:
> + *   Fam Zheng <famz@redhat.com>
> + *   Paolo Bonzini <pbonzini@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +#include "qapi/error.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qapi/qmp/qstring.h"
> +#include "block/block_int.h"
> +#include "qemu/error-report.h"
> +#include "qemu/vfio-helpers.h"
> +
> +#define NVME_DEBUG 0
> +
> +#define DPRINTF(...) \
> +    if (NVME_DEBUG) { \
> +        printf(__VA_ARGS__); \
> +    }
> +
> +#define NVME_SQ_ENTRY_BYTES 64
> +#define NVME_CQ_ENTRY_BYTES 16
> +#define NVME_QUEUE_SIZE 128
> +
> +/* TODO: share definitions with hw/block/nvme.c? */
> +enum NvmeAdminCommands {
> +    NVME_ADM_CMD_DELETE_SQ      = 0x00,
> +    NVME_ADM_CMD_CREATE_SQ      = 0x01,
> +    NVME_ADM_CMD_GET_LOG_PAGE   = 0x02,
> +    NVME_ADM_CMD_DELETE_CQ      = 0x04,
> +    NVME_ADM_CMD_CREATE_CQ      = 0x05,
> +    NVME_ADM_CMD_IDENTIFY       = 0x06,
> +    NVME_ADM_CMD_ABORT          = 0x08,
> +    NVME_ADM_CMD_SET_FEATURES   = 0x09,
> +    NVME_ADM_CMD_GET_FEATURES   = 0x0a,
> +    NVME_ADM_CMD_ASYNC_EV_REQ   = 0x0c,
> +    NVME_ADM_CMD_ACTIVATE_FW    = 0x10,
> +    NVME_ADM_CMD_DOWNLOAD_FW    = 0x11,
> +    NVME_ADM_CMD_FORMAT_NVM     = 0x80,
> +    NVME_ADM_CMD_SECURITY_SEND  = 0x81,
> +    NVME_ADM_CMD_SECURITY_RECV  = 0x82,
> +};
> +
> +enum NvmeIoCommands {
> +    NVME_CMD_FLUSH              = 0x00,
> +    NVME_CMD_WRITE              = 0x01,
> +    NVME_CMD_READ               = 0x02,
> +    NVME_CMD_WRITE_UNCOR        = 0x04,
> +    NVME_CMD_COMPARE            = 0x05,
> +    NVME_CMD_DSM                = 0x09,
> +};
> +
> +typedef struct {
> +    uint8_t  opcode;
> +    uint8_t  flags;
> +    uint16_t cid;
> +    uint32_t nsid;
> +    uint64_t reserved;
> +    uint64_t mptr;
> +    uint64_t prp1;
> +    uint64_t prp2;
> +    uint32_t cdw10;
> +    uint32_t cdw11;
> +    uint32_t cdw12;
> +    uint32_t cdw13;
> +    uint32_t cdw14;
> +    uint32_t cdw15;
> +} QEMU_PACKED NVMeCommand;
> +
> +typedef struct {
> +    uint32_t cmd_specific;
> +    uint32_t reserved;
> +    uint16_t sq_head;
> +    uint16_t sqid;
> +    uint16_t cid;
> +    uint16_t status;
> +} QEMU_PACKED NVMeCompletion;
> +
> +typedef struct {
> +    int32_t  head, tail;
> +    uint8_t  *queue;
> +    uint64_t iova;
> +    volatile uint32_t *doorbell;
> +} NVMeQueue;
> +
> +typedef struct {
> +    BlockCompletionFunc *cb;
> +    void *opaque;
> +    int cid;
> +    void *prp_list_page;
> +    uint64_t prp_list_iova;
> +} NVMeRequest;
> +
> +typedef struct {
> +    int         index;
> +    NVMeQueue   sq, cq;
> +    int         cq_phase;
> +    uint8_t     *prp_list_pages;
> +    uint64_t    prp_list_base_iova;
> +    NVMeRequest reqs[NVME_QUEUE_SIZE];
> +    CoQueue     wait_queue;
> +    bool        busy;
> +    int         need_kick;
> +    int         inflight;
> +} NVMeQueuePair;
> +
> +typedef volatile struct {
> +    uint64_t cap;
> +    uint32_t vs;
> +    uint32_t intms;
> +    uint32_t intmc;
> +    uint32_t cc;
> +    uint32_t reserved0;
> +    uint32_t csts;
> +    uint32_t nssr;
> +    uint32_t aqa;
> +    uint64_t asq;
> +    uint64_t acq;
> +    uint32_t cmbloc;
> +    uint32_t cmbsz;
> +    uint8_t  reserved1[0xec0];
> +    uint8_t  cmd_set_specfic[0x100];
> +    uint32_t doorbells[];
> +} QEMU_PACKED NVMeRegs;
> +
> +QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
> +
> +typedef struct {
> +    QEMUVFIOState *vfio;
> +    NVMeRegs *regs;
> +    /* The submission/completion queue pairs.
> +     * [0]: admin queue.
> +     * [1..]: io queues.
> +     */
> +    NVMeQueuePair **queues;
> +    int nr_queues;
> +    size_t page_size;
> +    /* How many uint32_t elements does each doorbell entry take. */
> +    size_t doorbell_scale;
> +    bool write_cache;
> +    EventNotifier event_notifier;
> +    uint64_t nsze; /* Namespace size reported by identify command */
> +    int nsid;      /* The namespace id to read/write data. */
> +    uint64_t max_transfer;
> +    int plugged;
> +    Notifier vfree_notify;
> +} BDRVNVMeState;
> +
> +#define NVME_BLOCK_OPT_DEVICE "device"
> +#define NVME_BLOCK_OPT_NAMESPACE "namespace"
> +
> +static QemuOptsList runtime_opts = {
> +    .name = "nvme",
> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> +    .desc = {
> +        {
> +            .name = NVME_BLOCK_OPT_DEVICE,
> +            .type = QEMU_OPT_STRING,
> +            .help = "NVMe PCI device address",
> +        },
> +        {
> +            .name = NVME_BLOCK_OPT_NAMESPACE,
> +            .type = QEMU_OPT_NUMBER,
> +            .help = "NVMe namespace",
> +        },
> +        { /* end of list */ }
> +    },
> +};
> +
> +static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
> +                            int nentries, int entry_bytes, Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    size_t bytes;
> +    int r;
> +
> +    bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
> +    q->head = q->tail = 0;
> +    q->queue = qemu_try_blockalign0(bs, bytes);
> +
> +    if (!q->queue) {
> +        error_setg(errp, "Cannot allocate queue");
> +        return;
> +    }
> +    r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, true, &q->iova);
> +    if (r) {
> +        error_setg(errp, "Cannot map queue");
> +    }
> +}
> +
> +static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
> +{
> +    qemu_vfree(q->prp_list_pages);
> +    qemu_vfree(q->sq.queue);
> +    qemu_vfree(q->cq.queue);
> +    g_free(q);
> +}
> +
> +static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
> +                                             int idx, int size,
> +                                             Error **errp)
> +{
> +    int i, r;
> +    BDRVNVMeState *s = bs->opaque;
> +    Error *local_err = NULL;
> +    NVMeQueuePair *q = g_new0(NVMeQueuePair, 1);
> +    uint64_t prp_list_iovas[NVME_QUEUE_SIZE];
> +
> +    q->index = idx;
> +    qemu_co_queue_init(&q->wait_queue);
> +    q->prp_list_pages = qemu_blockalign0(bs, s->page_size *
> NVME_QUEUE_SIZE);
> +    r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
> +                          s->page_size * NVME_QUEUE_SIZE,
> +                          false, prp_list_iovas);
> +    if (r) {
> +        goto fail;
> +    }
> +    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
> +        NVMeRequest *req = &q->reqs[i];
> +        req->cid = i + 1;
> +        req->prp_list_page = q->prp_list_pages + i * s->page_size;
> +        req->prp_list_iova = prp_list_iovas[i];
> +    }
> +    nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        goto fail;
> +    }
> +    q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale];
> +
> +    nvme_init_queue(bs, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        goto fail;
> +    }
> +    q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
> +
> +    return q;
> +fail:
> +    nvme_free_queue_pair(bs, q);
> +    return NULL;
> +}
> +
> +static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
> +{
> +    if (s->plugged || !q->need_kick) {
> +        return;
> +    }
> +    DPRINTF("nvme kick queue %d\n", q->index);
> +    assert(!(q->sq.tail & 0xFF00));
> +    /* Fence the write to submission queue entry before notifying the guest.
> */
> +    smp_wmb();
> +    *q->sq.doorbell = cpu_to_le32(q->sq.tail);
> +    q->inflight += q->need_kick;
> +    q->need_kick = 0;
> +}
> +
> +static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
> +{
> +    int i;
> +    if (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
> +        /* We have to leave one slot empty as that is the full queue case
> (head
> +         * == tail + 1). */
> +        return NULL;
> +    }
> +    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
> +        if (!q->reqs[i].cb) {
> +            return &q->reqs[i];
> +        }
> +    }
> +    return NULL;
> +}
> +
> +static inline int nvme_translate_error(const NVMeCompletion *c)
> +{
> +    if ((le16_to_cpu(c->status) >> 1) & 0xFF) {
> +        DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status
> %x\n",
> +                c->cmd_specific, c->sq_head, c->sqid, c->cid, c->status);
> +    }
> +    switch ((le16_to_cpu(c->status) >> 1) & 0xFF) {
> +    case 0:
> +        return 0;
> +    case 1:
> +        return -ENOSYS;
> +    case 2:
> +        return -EINVAL;
> +    default:
> +        return -EIO;
> +    }
> +}
> +
> +static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
> +{
> +    bool progress = false;
> +    NVMeRequest *req;
> +    NVMeCompletion *c;
> +
> +    DPRINTF("nvme process completion %d inflight %d\n", q->index,
> q->inflight);
> +    if (q->busy || s->plugged) {
> +        DPRINTF("queue busy\n");
> +        return false;
> +    }
> +    q->busy = true;
> +    assert(q->inflight >= 0);
> +    while (q->inflight) {
> +        c = (NVMeCompletion *)&q->cq.queue[q->cq.head *
> NVME_CQ_ENTRY_BYTES];
> +        if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> +            break;
> +        }
> +        q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> +        if (!q->cq.head) {
> +            q->cq_phase = !q->cq_phase;
> +        }
> +        /* XXX: error handling instead of assert as it's from device? */
> +        assert(c->cid > 0);
> +        assert(c->cid <= NVME_QUEUE_SIZE);
> +        DPRINTF("nvme completing command %d\n", c->cid);
> +        req = &q->reqs[c->cid - 1];
> +        assert(req->cid == c->cid);
> +        assert(req->cb);
> +        req->cb(req->opaque, nvme_translate_error(c));
> +        req->cb = req->opaque = NULL;
> +        qemu_co_enter_next(&q->wait_queue);
> +        c->cid = 0;
> +        q->inflight--;
> +        /* Flip Phase Tag bit. */
> +        c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> +        progress = true;
> +    }
> +    if (progress) {
> +        /* Notify the device so it can post more completions. */
> +        smp_mb_release();
> +        *q->cq.doorbell = cpu_to_le32(q->cq.head);
> +    }
> +    q->busy = false;
> +    return progress;
> +}
> +
> +static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
> +                                NVMeRequest *req,
> +                                NVMeCommand *cmd, BlockCompletionFunc cb,
> +                                void *opaque)
> +{
> +    req->cb = cb;
> +    req->opaque = opaque;
> +    cmd->cid = cpu_to_le32(req->cid);
> +    DPRINTF("nvme submit command %d to queue %d\n", req->cid, q->index);
> +    memcpy((uint8_t *)q->sq.queue +
> +           q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
> +    q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
> +    q->need_kick++;
> +    nvme_kick(s, q);
> +    nvme_process_completion(s, q);
> +}
> +
> +static void nvme_cmd_sync_cb(void *opaque, int ret)
> +{
> +    int *pret = opaque;
> +    *pret = ret;
> +}
> +
> +static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
> +                         NVMeCommand *cmd)
> +{
> +    NVMeRequest *req;
> +    BDRVNVMeState *s = bs->opaque;
> +    int ret = -EINPROGRESS;
> +    req = nvme_get_free_req(q);
> +    if (!req) {
> +        return -EBUSY;
> +    }
> +    nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
> +
> +    BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
> +    return ret;
> +}
> +
> +static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    uint8_t *resp;
> +    int r;
> +    uint64_t iova;
> +    NVMeCommand cmd = {
> +        .opcode = NVME_ADM_CMD_IDENTIFY,
> +        .cdw10 = cpu_to_le32(0x1),
> +    };
> +
> +    resp = qemu_try_blockalign0(bs, 4096);
> +    if (!resp) {
> +        error_setg(errp, "Cannot allocate buffer for identify response");
> +        return false;
> +    }
> +    r = qemu_vfio_dma_map(s->vfio, resp, 4096, true, &iova);
> +    if (r) {
> +        error_setg(errp, "Cannot map buffer for DMA");
> +        goto fail;
> +    }
> +    cmd.prp1 = cpu_to_le64(iova);
> +
> +    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> +        error_setg(errp, "Failed to identify controller");
> +        goto fail;
> +    }
> +
> +    if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) {
> +        error_setg(errp, "Invalid namespace");
> +        goto fail;
> +    }
> +    s->write_cache = le32_to_cpu(resp[525]) & 0x1;
> +    s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size;
> +    /* For now the page list buffer per command is one page, to hold at most
> +     * s->page_size / sizeof(uint64_t) entries. */
> +    s->max_transfer = MIN_NON_ZERO(s->max_transfer,
> +                          s->page_size / sizeof(uint64_t) * s->page_size);
> +
> +    memset((char *)resp, 0, 4096);
> +
> +    cmd.cdw10 = 0;
> +    cmd.nsid = namespace;
> +    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> +        error_setg(errp, "Failed to identify namespace");
> +        goto fail;
> +    }
> +
> +    /*qemu_hexdump((const char *)resp, stdout, "NS", 4096);*/
> +    s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]);
> +
> +    qemu_vfree(resp);
> +    return true;
> +fail:
> +    qemu_vfree(resp);
> +    return false;
> +}
> +
> +static void nvme_handle_event(EventNotifier *n)
> +{
> +    int i;
> +    BDRVNVMeState *s = container_of(n, BDRVNVMeState, event_notifier);
> +
> +    DPRINTF("nvme handle event\n");
> +    event_notifier_test_and_clear(n);
> +    for (i = 0; i < s->nr_queues; i++) {
> +        while (nvme_process_completion(s, s->queues[i])) {
> +            /* Keep polling until no progress. */
> +        }
> +    }
> +}
> +
> +static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    int n = s->nr_queues;
> +    NVMeQueuePair *q;
> +    NVMeCommand cmd;
> +    int queue_size = NVME_QUEUE_SIZE;
> +
> +    q = nvme_create_queue_pair(bs, n, queue_size, errp);
> +    if (!q) {
> +        return false;
> +    }
> +    cmd = (NVMeCommand) {
> +        .opcode = NVME_ADM_CMD_CREATE_CQ,
> +        .prp1 = cpu_to_le64(q->cq.iova),
> +        .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> +        .cdw11 = cpu_to_le32(0x3),
> +    };
> +    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> +        error_setg(errp, "Failed to create io queue [%d]", n);
> +        nvme_free_queue_pair(bs, q);
> +        return false;
> +    }
> +    cmd = (NVMeCommand) {
> +        .opcode = NVME_ADM_CMD_CREATE_SQ,
> +        .prp1 = cpu_to_le64(q->sq.iova),
> +        .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> +        .cdw11 = cpu_to_le32(0x1 | (n << 16)),
> +    };
> +    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> +        error_setg(errp, "Failed to create io queue [%d]", n);
> +        nvme_free_queue_pair(bs, q);
> +        return false;
> +    }
> +    s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
> +    s->queues[n] = q;
> +    s->nr_queues++;
> +    return true;
> +}
> +
> +static bool nvme_poll_cb(void *opaque)
> +{
> +    int i;
> +    EventNotifier *e = opaque;
> +    BDRVNVMeState *s = container_of(e, BDRVNVMeState, event_notifier);
> +    bool progress = false;
> +
> +    DPRINTF("nvme poll cb\n");
> +    for (i = 0; i < s->nr_queues; i++) {
> +        while (nvme_process_completion(s, s->queues[i])) {
> +            progress = true;
> +        }
> +    }
> +    return progress;
> +}
> +
> +static void nvme_vfree_cb(Notifier *n, void *p)
> +{
> +    BDRVNVMeState *s = container_of(n, BDRVNVMeState, vfree_notify);
> +    qemu_vfio_dma_unmap(s->vfio, p);
> +}
> +
> +static int nvme_init(BlockDriverState *bs, const char *device, int
> namespace,
> +                     Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    int ret;
> +    uint64_t cap;
> +    uint64_t timeout_ms;
> +    uint64_t deadline, now;
> +
> +    s->nsid = namespace;
> +    ret = event_notifier_init(&s->event_notifier, 0);
> +    if (ret) {
> +        error_setg(errp, "Failed to init event notifier");
> +        return ret;
> +    }
> +
> +    s->vfree_notify.notify = nvme_vfree_cb;
> +    qemu_vfree_add_notifier(&s->vfree_notify);
> +
> +    s->vfio = qemu_vfio_open_pci(device, errp);
> +    if (!s->vfio) {
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, errp);
> +    if (!s->regs) {
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    /* Perform initialize sequence as described in NVMe spec "7.6.1
> +     * Initialization". */
> +
> +    cap = le64_to_cpu(s->regs->cap);
> +    if (!(cap & (1ULL << 37))) {
> +        error_setg(errp, "Device doesn't support NVMe command set");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF)));
> +    s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t);
> +    bs->bl.opt_mem_alignment = s->page_size;
> +    timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
> +
> +    /* Reset device to get a clean state. */
> +    s->regs->cc = cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE);
> +    /* Wait for CSTS.RDY = 0. */
> +    deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms *
> 1000000ULL;
> +    while (le32_to_cpu(s->regs->csts) & 0x1) {
> +        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
> +            error_setg(errp, "Timeout while waiting for device to reset (%ld
> ms)",
> +                       timeout_ms);
> +            ret = -ETIMEDOUT;
> +            goto fail;
> +        }
> +    }
> +
> +    /* Set up admin queue. */
> +    s->queues = g_new(NVMeQueuePair *, 1);
> +    s->nr_queues = 1;
> +    s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp);
> +    if (!s->queues[0]) {
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +    QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
> +    s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
> +    s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova);
> +    s->regs->acq = cpu_to_le64(s->queues[0]->cq.iova);
> +
> +    /* After setting up all control registers we can enable device now. */
> +    s->regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
> +                              (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
> +                              0x1);
> +    /* Wait for CSTS.RDY = 1. */
> +    now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +    deadline = now + timeout_ms * 1000000;
> +    while (!(le32_to_cpu(s->regs->csts) & 0x1)) {
> +        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
> +            error_setg(errp, "Timeout while waiting for device to start (%ld
> ms)",
> +                       timeout_ms);
> +            ret = -ETIMEDOUT;
> +            goto fail_queue;
> +        }
> +    }
> +
> +    ret = qemu_vfio_pci_init_irq(s->vfio, &s->event_notifier,
> +                                 VFIO_PCI_MSIX_IRQ_INDEX, errp);
> +    if (ret) {
> +        goto fail_queue;
> +    }
> +    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> +                           false, nvme_handle_event, nvme_poll_cb);
> +
> +    if (!nvme_identify(bs, namespace, errp)) {
> +        ret = -EIO;
> +        goto fail_handler;
> +    }
> +
> +    /* Set up command queues. */
> +    /* XXX: multiple io queues? */
> +    if (!nvme_add_io_queue(bs, errp)) {
> +        ret = -EIO;
> +        goto fail_handler;
> +    }
> +    return 0;
> +
> +fail_handler:
> +    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> +                           false, NULL, NULL);
> +fail_queue:
> +    nvme_free_queue_pair(bs, s->queues[0]);
> +fail:
> +    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
> +    qemu_vfio_close(s->vfio);
> +    event_notifier_cleanup(&s->event_notifier);
> +    return ret;
> +}
> +
> +static void nvme_parse_filename(const char *filename, QDict *options,
> +                                Error **errp)
> +{
> +    int pref = strlen("nvme://");
> +
> +    /* XXX: support namespace in URI like nvme://0000:00:00.0/1 ? */
> +    if (strlen(filename) > pref && !strncmp(filename, "nvme://", pref)) {
> +        qdict_put(options, NVME_BLOCK_OPT_DEVICE,
> +                  qstring_from_str(filename + pref));
> +    }
> +}
> +
> +static int nvme_file_open(BlockDriverState *bs, QDict *options, int flags,
> +                          Error **errp)
> +{
> +    const char *device;
> +    QemuOpts *opts;
> +    int namespace;
> +
> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> +    qemu_opts_absorb_qdict(opts, options, &error_abort);
> +    device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> +    if (!device) {
> +        error_setg(errp, "'" NVME_BLOCK_OPT_DEVICE "' option is required");
> +        return -EINVAL;
> +    }
> +
> +    namespace = qemu_opt_get_number(opts, NVME_BLOCK_OPT_NAMESPACE, 1);
> +    nvme_init(bs, device, namespace, errp);
> +
> +    qemu_opts_del(opts);
> +    return 0;
> +}
> +
> +static void nvme_close(BlockDriverState *bs)
> +{
> +    int i;
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    for (i = 0; i < s->nr_queues; ++i) {
> +        nvme_free_queue_pair(bs, s->queues[i]);
> +    }
> +    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> +                           false, NULL, NULL);
> +    notifier_remove(&s->vfree_notify);
> +    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
> +    qemu_vfio_close(s->vfio);
> +}
> +
> +static int64_t nvme_getlength(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    return s->nsze << BDRV_SECTOR_BITS;
> +}
> +
> +static inline int nvme_cmd_map_qiov(BlockDriverState *bs, NVMeCommand *cmd,
> +                                    NVMeRequest *req, QEMUIOVector *qiov)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    uint64_t *pagelist = req->prp_list_page;
> +    int i, r;
> +    unsigned int entries = 0;
> +
> +    assert(qiov->size);
> +    assert(QEMU_IS_ALIGNED(qiov->size, s->page_size));
> +    assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t));
> +    for (i = 0; i < qiov->niov; ++i) {
> +        r = qemu_vfio_dma_map(s->vfio,
> +                              qiov->iov[i].iov_base,
> +                              qiov->iov[i].iov_len,
> +                              false, &pagelist[entries]);
> +        if (r) {
> +            return r;
> +        }
> +
> +        entries += qiov->iov[i].iov_len / s->page_size;
> +        assert(entries <= s->page_size / sizeof(uint64_t));
> +    }
> +
> +    switch (entries) {
> +    case 0:
> +        abort();
> +    case 1:
> +        cmd->prp1 = cpu_to_le64(pagelist[0]);
> +        cmd->prp2 = 0;
> +        break;
> +    case 2:
> +        cmd->prp1 = cpu_to_le64(pagelist[0]);
> +        cmd->prp2 = cpu_to_le64(pagelist[1]);;
> +        break;
> +    default:
> +        cmd->prp1 = cpu_to_le64(pagelist[0]);
> +        cmd->prp2 = cpu_to_le64(req->prp_list_iova);
> +        for (i = 0; i < entries - 1; ++i) {
> +            pagelist[i] = cpu_to_le64(pagelist[i + 1]);
> +        }
> +        pagelist[entries] = 0;
> +        break;
> +    }
> +    return 0;
> +}
> +
> +typedef struct {
> +    Coroutine *co;
> +    int ret;
> +    AioContext *ctx;
> +} NVMeCoData;
> +
> +static void nvme_rw_cb_bh(void *opaque)
> +{
> +    NVMeCoData *data = opaque;
> +    qemu_coroutine_enter(data->co);
> +}
> +
> +static void nvme_rw_cb(void *opaque, int ret)
> +{
> +    NVMeCoData *data = opaque;
> +    data->ret = ret;
> +    if (!data->co) {
> +        /* The rw coroutine hasn't yielded, don't try to enter. */
> +        return;
> +    }
> +    if (qemu_coroutine_self() != data->co) {
> +        qemu_coroutine_enter(data->co);
> +    } else {
> +        aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
> +    }
> +}
> +
> +static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
> +                                            uint64_t offset, uint64_t bytes,
> +                                            QEMUIOVector *qiov,
> +                                            bool is_write)
> +{
> +    int r;
> +    BDRVNVMeState *s = bs->opaque;
> +    NVMeQueuePair *ioq = s->queues[1];
> +    NVMeRequest *req;
> +    NVMeCommand cmd = {
> +        .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
> +        .nsid = cpu_to_le32(s->nsid),
> +        .cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF),
> +        .cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) &
> 0xFFFFFFFF),
> +        .cdw12 = cpu_to_le32(((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF),
> +    };
> +    NVMeCoData data = {
> +        .ctx = bdrv_get_aio_context(bs),
> +        .ret = -EINPROGRESS,
> +    };
> +
> +    DPRINTF(">>> NVMe %s offset %lx bytes %lx qiov[%d]\n",
> +            is_write ? "write" : "read",
> +            offset, bytes, qiov->niov);
> +    assert(s->nr_queues > 1);
> +    while (true) {
> +        req = nvme_get_free_req(ioq);
> +        if (req) {
> +            break;
> +        }
> +        DPRINTF("nvme wait req\n");
> +        qemu_co_queue_wait(&ioq->wait_queue);
> +        DPRINTF("nvme wait req done\n");
> +    }
> +
> +    r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
> +    if (r) {
> +        return r;
> +    }
> +    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> +
> +    data.co = qemu_coroutine_self();
> +    while (data.ret == -EINPROGRESS) {
> +        qemu_coroutine_yield();
> +    }
> +
> +    DPRINTF("<<< NVMe %s offset %lx bytes %lx ret %d\n",
> +            is_write ? "write" : "read",
> +            offset, bytes, data.ret);
> +    return data.ret;
> +}
> +
> +static inline bool nvme_qiov_aligned(BlockDriverState *bs,
> +                                     const QEMUIOVector *qiov)
> +{
> +    int i;
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    for (i = 0; i < qiov->niov; ++i) {
> +        if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) ||
> +            !QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) {
> +            return false;
> +        }
> +    }
> +    return true;
> +}
> +
> +static inline coroutine_fn
> +int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +                QEMUIOVector *qiov, bool is_write)
> +{
> +    int r;
> +    uint8_t *buf;
> +    QEMUIOVector local_qiov;
> +
> +    assert(offset % BDRV_SECTOR_SIZE == 0);
> +    assert(bytes % BDRV_SECTOR_SIZE == 0);
> +    if (nvme_qiov_aligned(bs, qiov)) {
> +        return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write);
> +    }
> +    buf = qemu_try_blockalign(bs, bytes);
> +
> +    if (!buf) {
> +        return -ENOMEM;
> +    }
> +    qemu_iovec_init(&local_qiov, 1);
> +    qemu_iovec_add(&local_qiov, buf, bytes);
> +    r = nvme_co_prw_aligned(bs, offset, bytes, &local_qiov, is_write);
> +    qemu_iovec_destroy(&local_qiov);
> +    if (!r) {
> +        qemu_iovec_from_buf(qiov, 0, buf, bytes);
> +    }
> +    qemu_vfree(buf);
> +    return r;
> +}
> +
> +static coroutine_fn int nvme_co_preadv(BlockDriverState *bs,
> +                                       uint64_t offset, uint64_t bytes,
> +                                       QEMUIOVector *qiov, int flags)
> +{
> +    return nvme_co_prw(bs, offset, bytes, qiov, false);
> +}
> +
> +static coroutine_fn int nvme_co_pwritev(BlockDriverState *bs,
> +                                        uint64_t offset, uint64_t bytes,
> +                                        QEMUIOVector *qiov, int flags)
> +{
> +    return nvme_co_prw(bs, offset, bytes, qiov, true);
> +}
> +
> +static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    NVMeQueuePair *ioq = s->queues[1];
> +    NVMeRequest *req;
> +    NVMeCommand cmd = {
> +        .opcode = NVME_CMD_FLUSH,
> +        .nsid = cpu_to_le32(s->nsid),
> +    };
> +    NVMeCoData data = {
> +        .ctx = bdrv_get_aio_context(bs),
> +        .ret = -EINPROGRESS,
> +    };
> +
> +    assert(s->nr_queues > 1);
> +    while (true) {
> +        req = nvme_get_free_req(ioq);
> +        if (req) {
> +            break;
> +        }
> +        qemu_co_queue_wait(&ioq->wait_queue);
> +    }
> +
> +    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> +
> +    data.co = qemu_coroutine_self();
> +    if (data.ret == -EINPROGRESS) {
> +        qemu_coroutine_yield();
> +    }
> +
> +    return data.ret;
> +}
> +
> +
> +static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> +                               BlockReopenQueue *queue, Error **errp)
> +{
> +    return 0;
> +}
> +
> +static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs,
> +                                                     int64_t sector_num,
> +                                                     int nb_sectors, int
> *pnum,
> +                                                     BlockDriverState
> **file)
> +{
> +    *pnum = nb_sectors;
> +    *file = bs;
> +
> +    return BDRV_BLOCK_OFFSET_VALID | (sector_num << BDRV_SECTOR_BITS);
> +}
> +
> +static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts)
> +{
> +    QINCREF(opts);
> +    qdict_del(opts, "filename");
> +
> +    if (!qdict_size(opts)) {
> +        snprintf(bs->exact_filename, sizeof(bs->exact_filename), "%s://",
> +                 bs->drv->format_name);
> +    }
> +
> +    qdict_put(opts, "driver", qstring_from_str(bs->drv->format_name));
> +    bs->full_open_options = opts;
> +}
> +
> +static void nvme_refresh_limits(BlockDriverState *bs, Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    /* XXX: other fields from identify command/controller capacity? */
> +    bs->bl.opt_mem_alignment = s->page_size;
> +    /* XXX: are there other limits? */
> +    bs->bl.request_alignment = s->page_size;
> +    bs->bl.max_transfer = s->max_transfer;
> +}
> +
> +static void nvme_detach_aio_context(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> +                           false, NULL, NULL);
> +}
> +
> +static void nvme_attach_aio_context(BlockDriverState *bs,
> +                                    AioContext *new_context)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    aio_set_event_notifier(new_context, &s->event_notifier,
> +                           false, nvme_handle_event, nvme_poll_cb);
> +}
> +
> +static void nvme_aio_plug(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    s->plugged++;
> +}
> +
> +static void nvme_aio_unplug(BlockDriverState *bs)
> +{
> +    int i;
> +    BDRVNVMeState *s = bs->opaque;
> +    assert(s->plugged);
> +    if (!--s->plugged) {
> +        for (i = 1; i < s->nr_queues; i++) {
> +            nvme_kick(s, s->queues[i]);
> +            nvme_process_completion(s, s->queues[i]);
> +        }
> +    }
> +}
> +
> +static BlockDriver bdrv_nvme = {
> +    .format_name              = "nvme",
> +    .protocol_name            = "nvme",
> +    .instance_size            = sizeof(BDRVNVMeState),
> +
> +    .bdrv_parse_filename      = nvme_parse_filename,
> +    .bdrv_file_open           = nvme_file_open,
> +    .bdrv_close               = nvme_close,
> +    .bdrv_getlength           = nvme_getlength,
> +
> +    .bdrv_co_preadv           = nvme_co_preadv,
> +    .bdrv_co_pwritev          = nvme_co_pwritev,
> +    .bdrv_co_flush_to_disk    = nvme_co_flush,
> +    .bdrv_reopen_prepare      = nvme_reopen_prepare,
> +
> +    .bdrv_co_get_block_status = nvme_co_get_block_status,
> +
> +    .bdrv_refresh_filename    = nvme_refresh_filename,
> +    .bdrv_refresh_limits      = nvme_refresh_limits,
> +
> +    .bdrv_detach_aio_context  = nvme_detach_aio_context,
> +    .bdrv_attach_aio_context  = nvme_attach_aio_context,
> +
> +    .bdrv_io_plug             = nvme_aio_plug,
> +    .bdrv_io_unplug           = nvme_aio_unplug,
> +};
> +
> +static void bdrv_nvme_init(void)
> +{
> +    bdrv_register(&bdrv_nvme);
> +}
> +
> +block_init(bdrv_nvme_init);
> --
> 2.9.3
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
  2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
                   ` (3 preceding siblings ...)
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
@ 2016-12-20 23:04 ` no-reply
  2016-12-21  1:38   ` Fam Zheng
  2016-12-21  0:48 ` no-reply
  2016-12-29  4:09 ` Tian, Kevin
  6 siblings, 1 reply; 20+ messages in thread
From: no-reply @ 2016-12-20 23:04 UTC (permalink / raw)
  To: famz
  Cc: qemu-devel, kwolf, qemu-block, mreitz, borntraeger, stefanha,
	pbonzini, krister

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Subject: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
Message-id: 20161220163139.12016-1-famz@redhat.com
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
ddc026b block: Add VFIO based NVMe driver
f5dd619 util: Add VFIO helper library
5608a5c util: Add a notifier list for qemu_vfree()
822462d ramblock-notifier: new

=== OUTPUT BEGIN ===
Checking PATCH 1/4: ramblock-notifier: new...
Checking PATCH 2/4: util: Add a notifier list for qemu_vfree()...
Checking PATCH 3/4: util: Add VFIO helper library...
WARNING: line over 80 characters
#184: FILE: util/vfio-helpers.c:111:
+    if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->bar_region_info[index])) {

WARNING: line over 80 characters
#262: FILE: util/vfio-helpers.c:189:
+static int qemu_vfio_pci_read_config(QEMUVFIOState *s, void *buf, int size, int ofs)

WARNING: line over 80 characters
#264: FILE: util/vfio-helpers.c:191:
+    if (pread(s->device, buf, size, s->config_region_info.offset + ofs) == size) {

WARNING: line over 80 characters
#270: FILE: util/vfio-helpers.c:197:
+static int qemu_vfio_pci_write_config(QEMUVFIOState *s, void *buf, int size, int ofs)

ERROR: that open brace { should be on the previous line
#287: FILE: util/vfio-helpers.c:214:
+    struct vfio_group_status group_status =
+    { .argsz = sizeof(group_status) };

WARNING: line over 80 characters
#407: FILE: util/vfio-helpers.c:334:
+static void qemu_vfio_ram_block_added(RAMBlockNotifier *n, void *host, size_t size)

WARNING: line over 80 characters
#415: FILE: util/vfio-helpers.c:342:
+static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n, void *host, size_t size)

total: 1 errors, 6 warnings, 746 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 4/4: block: Add VFIO based NVMe driver...
ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#145: FILE: block/nvme.c:92:
+    volatile uint32_t *doorbell;

ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#169: FILE: block/nvme.c:116:
+typedef volatile struct {

WARNING: line over 80 characters
#339: FILE: block/nvme.c:286:
+        DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status %x\n",

WARNING: line over 80 characters
#626: FILE: block/nvme.c:573:
+            error_setg(errp, "Timeout while waiting for device to reset (%ld ms)",

WARNING: line over 80 characters
#655: FILE: block/nvme.c:602:
+            error_setg(errp, "Timeout while waiting for device to start (%ld ms)",

total: 2 errors, 3 warnings, 1033 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
  2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
                   ` (4 preceding siblings ...)
  2016-12-20 23:04 ` [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device no-reply
@ 2016-12-21  0:48 ` no-reply
  2016-12-29  4:09 ` Tian, Kevin
  6 siblings, 0 replies; 20+ messages in thread
From: no-reply @ 2016-12-21  0:48 UTC (permalink / raw)
  To: famz
  Cc: qemu-devel, kwolf, qemu-block, mreitz, borntraeger, stefanha,
	pbonzini, krister

Hi,

Your series failed automatic build test. Please find the testing commands and
their output below. If you have docker installed, you can probably reproduce it
locally.

Type: series
Message-id: 20161220163139.12016-1-famz@redhat.com
Subject: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device

=== TEST SCRIPT BEGIN ===
#!/bin/bash
set -e
git submodule update --init dtc
# Let docker tests dump environment info
export SHOW_ENV=1
export J=16
make docker-test-quick@centos6
make docker-test-mingw@fedora
make docker-test-build@min-glib
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
ddc026b block: Add VFIO based NVMe driver
f5dd619 util: Add VFIO helper library
5608a5c util: Add a notifier list for qemu_vfree()
822462d ramblock-notifier: new

=== OUTPUT BEGIN ===
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into 'dtc'...
Submodule path 'dtc': checked out '65cc4d2748a2c2e6f27f1cf39e07a5dbabd80ebf'
  BUILD   centos6
make[1]: Entering directory `/var/tmp/patchew-tester-tmp-m_z3bizr/src'
  ARCHIVE qemu.tgz
  ARCHIVE dtc.tgz
  COPY    RUNNER
    RUN test-quick in qemu:centos6 
Packages installed:
SDL-devel-1.2.14-7.el6_7.1.x86_64
ccache-3.1.6-2.el6.x86_64
epel-release-6-8.noarch
gcc-4.4.7-17.el6.x86_64
git-1.7.1-4.el6_7.1.x86_64
glib2-devel-2.28.8-5.el6.x86_64
libfdt-devel-1.4.0-1.el6.x86_64
make-3.81-23.el6.x86_64
package g++ is not installed
pixman-devel-0.32.8-1.el6.x86_64
tar-1.23-15.el6_8.x86_64
zlib-devel-1.2.3-29.el6.x86_64

Environment variables:
PACKAGES=libfdt-devel ccache     tar git make gcc g++     zlib-devel glib2-devel SDL-devel pixman-devel     epel-release
HOSTNAME=71a2d1232481
TERM=xterm
MAKEFLAGS= -j16
HISTSIZE=1000
J=16
USER=root
CCACHE_DIR=/var/tmp/ccache
EXTRA_CONFIGURE_OPTS=
V=
SHOW_ENV=1
MAIL=/var/spool/mail/root
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
LANG=en_US.UTF-8
TARGET_LIST=
HISTCONTROL=ignoredups
SHLVL=1
HOME=/root
TEST_DIR=/tmp/qemu-test
LOGNAME=root
LESSOPEN=||/usr/bin/lesspipe.sh %s
FEATURES= dtc
DEBUG=
G_BROKEN_FILENAMES=1
CCACHE_HASHDIR=
_=/usr/bin/env

Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu --prefix=/var/tmp/qemu-build/install
No C++ compiler available; disabling C++ specific optional code
Install prefix    /var/tmp/qemu-build/install
BIOS directory    /var/tmp/qemu-build/install/share/qemu
binary directory  /var/tmp/qemu-build/install/bin
library directory /var/tmp/qemu-build/install/lib
module directory  /var/tmp/qemu-build/install/lib/qemu
libexec directory /var/tmp/qemu-build/install/libexec
include directory /var/tmp/qemu-build/install/include
config directory  /var/tmp/qemu-build/install/etc
local state directory   /var/tmp/qemu-build/install/var
Manual directory  /var/tmp/qemu-build/install/share/man
ELF interp prefix /usr/gnemul/qemu-%M
Source path       /tmp/qemu-test/src
C compiler        cc
Host C compiler   cc
C++ compiler      
Objective-C compiler cc
ARFLAGS           rv
CFLAGS            -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
QEMU_CFLAGS       -I/usr/include/pixman-1    -pthread -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include   -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv  -Wendif-labels -Wmissing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-all
LDFLAGS           -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g 
make              make
install           install
python            python -B
smbd              /usr/sbin/smbd
module support    no
host CPU          x86_64
host big endian   no
target list       x86_64-softmmu aarch64-softmmu
tcg debug enabled no
gprof enabled     no
sparse enabled    no
strip binaries    yes
profiler          no
static build      no
pixman            system
SDL support       yes (1.2.14)
GTK support       no 
GTK GL support    no
VTE support       no 
TLS priority      NORMAL
GNUTLS support    no
GNUTLS rnd        no
libgcrypt         no
libgcrypt kdf     no
nettle            no 
nettle kdf        no
libtasn1          no
curses support    no
virgl support     no
curl support      no
mingw32 support   no
Audio drivers     oss
Block whitelist (rw) 
Block whitelist (ro) 
VirtFS support    no
VNC support       yes
VNC SASL support  no
VNC JPEG support  no
VNC PNG support   no
xen support       no
brlapi support    no
bluez  support    no
Documentation     no
PIE               yes
vde support       no
netmap support    no
Linux AIO support no
ATTR/XATTR support yes
Install blobs     yes
KVM support       yes
COLO support      yes
RDMA support      no
TCG interpreter   no
fdt support       yes
preadv support    yes
fdatasync         yes
madvise           yes
posix_madvise     yes
libcap-ng support no
vhost-net support yes
vhost-scsi support yes
vhost-vsock support yes
Trace backends    log
spice support     no 
rbd support       no
xfsctl support    no
smartcard support no
libusb            no
usb net redir     no
OpenGL support    no
OpenGL dmabufs    no
libiscsi support  no
libnfs support    no
build guest agent yes
QGA VSS support   no
QGA w32 disk info no
QGA MSI support   no
seccomp support   no
coroutine backend ucontext
coroutine pool    yes
debug stack usage no
GlusterFS support no
Archipelago support no
gcov              gcov
gcov enabled      no
TPM support       yes
libssh2 support   no
TPM passthrough   yes
QOM debugging     yes
lzo support       no
snappy support    no
bzip2 support     no
NUMA host support no
tcmalloc support  no
jemalloc support  no
avx2 optimization no
replication support yes
  GEN     x86_64-softmmu/config-devices.mak.tmp
  GEN     aarch64-softmmu/config-devices.mak.tmp
  GEN     config-host.h
  GEN     qemu-options.def
  GEN     qmp-commands.h
  GEN     qapi-types.h
  GEN     qapi-visit.h
  GEN     qapi-event.h
  GEN     qmp-introspect.h
  GEN     x86_64-softmmu/config-devices.mak
  GEN     aarch64-softmmu/config-devices.mak
  GEN     module_block.h
  GEN     tests/test-qapi-visit.h
  GEN     tests/test-qapi-types.h
  GEN     tests/test-qmp-commands.h
  GEN     tests/test-qapi-event.h
  GEN     tests/test-qmp-introspect.h
  GEN     trace/generated-tracers.h
  GEN     trace/generated-tcg-tracers.h
  GEN     trace/generated-helpers-wrappers.h
  GEN     trace/generated-helpers.h
  GEN     config-all-devices.mak
  CC      tests/qemu-iotests/socket_scm_helper.o
  GEN     qga/qapi-generated/qga-qmp-commands.h
  GEN     qga/qapi-generated/qga-qapi-types.h
  GEN     qga/qapi-generated/qga-qapi-visit.h
  GEN     qga/qapi-generated/qga-qapi-types.c
  GEN     qga/qapi-generated/qga-qapi-visit.c
  GEN     qga/qapi-generated/qga-qmp-marshal.c
  GEN     qapi-types.c
  GEN     qmp-introspect.c
  GEN     qapi-visit.c
  GEN     qapi-event.c
  CC      qapi/qapi-visit-core.o
  CC      qapi/qapi-dealloc-visitor.o
  CC      qapi/qobject-input-visitor.o
  CC      qapi/qobject-output-visitor.o
  CC      qapi/qmp-registry.o
  CC      qapi/qmp-dispatch.o
  CC      qapi/string-input-visitor.o
  CC      qapi/string-output-visitor.o
  CC      qapi/opts-visitor.o
  CC      qapi/qapi-clone-visitor.o
  CC      qapi/qmp-event.o
  CC      qapi/qapi-util.o
  CC      qobject/qnull.o
  CC      qobject/qint.o
  CC      qobject/qstring.o
  CC      qobject/qdict.o
  CC      qobject/qlist.o
  CC      qobject/qfloat.o
  CC      qobject/qbool.o
  CC      qobject/qjson.o
  CC      qobject/qobject.o
  CC      qobject/json-lexer.o
  CC      qobject/json-streamer.o
  CC      qobject/json-parser.o
  GEN     trace/generated-tracers.c
  CC      trace/control.o
  CC      trace/qmp.o
  CC      util/osdep.o
  CC      util/cutils.o
  CC      util/unicode.o
  CC      util/qemu-timer-common.o
  CC      util/bufferiszero.o
  CC      util/compatfd.o
  CC      util/event_notifier-posix.o
  CC      util/mmap-alloc.o
  CC      util/oslib-posix.o
  CC      util/qemu-openpty.o
  CC      util/qemu-thread-posix.o
  CC      util/memfd.o
  CC      util/envlist.o
  CC      util/path.o
  CC      util/module.o
  CC      util/bitmap.o
  CC      util/bitops.o
  CC      util/hbitmap.o
  CC      util/fifo8.o
  CC      util/acl.o
  CC      util/qemu-error.o
  CC      util/error.o
  CC      util/id.o
  CC      util/iov.o
  CC      util/qemu-config.o
  CC      util/qemu-sockets.o
  CC      util/uri.o
  CC      util/notify.o
  CC      util/qemu-option.o
  CC      util/qemu-progress.o
  CC      util/hexdump.o
  CC      util/crc32c.o
  CC      util/uuid.o
  CC      util/getauxval.o
  CC      util/throttle.o
  CC      util/readline.o
  CC      util/rcu.o
  CC      util/qemu-coroutine.o
  CC      util/qemu-coroutine-lock.o
  CC      util/qemu-coroutine-io.o
  CC      util/qemu-coroutine-sleep.o
  CC      util/coroutine-ucontext.o
  CC      util/buffer.o
  CC      util/timed-average.o
  CC      util/base64.o
  CC      util/log.o
  CC      util/qdist.o
  CC      util/qht.o
  CC      util/range.o
  CC      util/vfio-helpers.o
  CC      crypto/pbkdf-stub.o
  CC      stubs/arch-query-cpu-def.o
  CC      stubs/arch-query-cpu-model-expansion.o
  CC      stubs/arch-query-cpu-model-comparison.o
  CC      stubs/arch-query-cpu-model-baseline.o
  CC      stubs/bdrv-next-monitor-owned.o
  CC      stubs/blk-commit-all.o
  CC      stubs/blockdev-close-all-bdrv-states.o
  CC      stubs/cpu-get-clock.o
  CC      stubs/clock-warp.o
  CC      stubs/cpu-get-icount.o
  CC      stubs/dump.o
  CC      stubs/error-printf.o
  CC      stubs/fdset-add-fd.o
  CC      stubs/fdset-find-fd.o
  CC      stubs/fdset-get-fd.o
  CC      stubs/fdset-remove-fd.o
  CC      stubs/gdbstub.o
  CC      stubs/get-fd.o
  CC      stubs/get-next-serial.o
  CC      stubs/get-vm-name.o
  CC      stubs/iothread.o
  CC      stubs/iothread-lock.o
  CC      stubs/is-daemonized.o
  CC      stubs/machine-init-done.o
  CC      stubs/migr-blocker.o
  CC      stubs/mon-is-qmp.o
  CC      stubs/monitor-init.o
  CC      stubs/notify-event.o
  CC      stubs/qtest.o
  CC      stubs/ram.o
  CC      stubs/replay.o
  CC      stubs/reset.o
  CC      stubs/replay-user.o
  CC      stubs/runstate-check.o
  CC      stubs/set-fd-handler.o
  CC      stubs/slirp.o
  CC      stubs/sysbus.o
  CC      stubs/trace-control.o
  CC      stubs/uuid.o
  CC      stubs/vm-stop.o
  CC      stubs/vmstate.o
  CC      stubs/cpus.o
  CC      stubs/kvm.o
  CC      stubs/qmp_pc_dimm_device_list.o
  CC      stubs/target-monitor-defs.o
  CC      stubs/target-get-monitor-def.o
  CC      stubs/vhost.o
  CC      stubs/iohandler.o
  CC      stubs/ipmi.o
  CC      stubs/smbios_type_38.o
  CC      stubs/pc_madt_cpu_entry.o
  CC      stubs/migration-colo.o
  CC      contrib/ivshmem-client/ivshmem-client.o
  CC      contrib/ivshmem-client/main.o
  CC      contrib/ivshmem-server/ivshmem-server.o
  CC      contrib/ivshmem-server/main.o
  CC      qemu-nbd.o
  CC      async.o
  CC      thread-pool.o
  CC      block.o
  CC      blockjob.o
  CC      main-loop.o
  CC      iohandler.o
  CC      qemu-timer.o
  CC      aio-posix.o
  CC      qemu-io-cmds.o
  CC      replication.o
  CC      block/raw_bsd.o
  CC      block/qcow.o
  CC      block/vdi.o
  CC      block/vmdk.o
  CC      block/cloop.o
  CC      block/bochs.o
  CC      block/vpc.o
  CC      block/vvfat.o
  CC      block/dmg.o
  CC      block/qcow2.o
  CC      block/qcow2-cluster.o
  CC      block/qcow2-refcount.o
  CC      block/qcow2-snapshot.o
  CC      block/qcow2-cache.o
  CC      block/qed.o
  CC      block/qed-gencb.o
  CC      block/qed-l2-cache.o
  CC      block/qed-table.o
  CC      block/qed-cluster.o
  CC      block/qed-check.o
  CC      block/vhdx.o
  CC      block/vhdx-endian.o
  CC      block/vhdx-log.o
  CC      block/quorum.o
  CC      block/parallels.o
  CC      block/blkdebug.o
  CC      block/blkverify.o
  CC      block/blkreplay.o
  CC      block/block-backend.o
  CC      block/snapshot.o
  CC      block/qapi.o
  CC      block/raw-posix.o
  CC      block/null.o
  CC      block/mirror.o
  CC      block/commit.o
  CC      block/io.o
  CC      block/throttle-groups.o
  CC      block/nvme.o
  CC      block/nbd.o
  CC      block/nbd-client.o
  CC      block/sheepdog.o
  CC      block/accounting.o
  CC      block/dirty-bitmap.o
  CC      block/write-threshold.o
  CC      block/backup.o
/tmp/qemu-test/src/block/nvme.c: In function ‘nvme_init’:
/tmp/qemu-test/src/block/nvme.c:615: error: too many arguments to function ‘aio_set_event_notifier’
/tmp/qemu-test/src/block/nvme.c:632: error: too many arguments to function ‘aio_set_event_notifier’
/tmp/qemu-test/src/block/nvme.c: In function ‘nvme_close’:
/tmp/qemu-test/src/block/nvme.c:685: error: too many arguments to function ‘aio_set_event_notifier’
/tmp/qemu-test/src/block/nvme.c: In function ‘nvme_detach_aio_context’:
/tmp/qemu-test/src/block/nvme.c:963: error: too many arguments to function ‘aio_set_event_notifier’
/tmp/qemu-test/src/block/nvme.c: In function ‘nvme_attach_aio_context’:
/tmp/qemu-test/src/block/nvme.c:972: error: too many arguments to function ‘aio_set_event_notifier’
make: *** [block/nvme.o] Error 1
make: *** Waiting for unfinished jobs....
make[1]: *** [docker-run] Error 2
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-m_z3bizr/src'
make: *** [docker-run-test-quick@centos6] Error 2
=== OUTPUT END ===

Test command exited with code: 2


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
  2016-12-20 23:04 ` [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device no-reply
@ 2016-12-21  1:38   ` Fam Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-21  1:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, qemu-block, mreitz, borntraeger, stefanha, pbonzini, krister

On Tue, 12/20 15:04, no-reply@patchew.org wrote:
> ERROR: that open brace { should be on the previous line
> #287: FILE: util/vfio-helpers.c:214:
> +    struct vfio_group_status group_status =
> +    { .argsz = sizeof(group_status) };

Hmm, it may indeed look better.

> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
> #145: FILE: block/nvme.c:92:
> +    volatile uint32_t *doorbell;
> 
> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
> #169: FILE: block/nvme.c:116:
> +typedef volatile struct {

These are shared with hardware, even volatile-considered-harmful.txt admits it
can be valid:

> - Pointers to data structures in coherent memory which might be modified
>   by I/O devices can, sometimes, legitimately be volatile.  A ring buffer
>   used by a network adapter, where that adapter changes pointers to
>   indicate which descriptors have been processed, is an example of this
>   type of situation.

Fam

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 4/4] block: Add VFIO based NVMe driver
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
  2016-12-20 16:39   ` Paolo Bonzini
@ 2016-12-21 11:59   ` Stefan Hajnoczi
  2016-12-21 14:05     ` Fam Zheng
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2016-12-21 11:59 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Kevin Wolf, qemu-block, Max Reitz, borntraeger,
	Stefan Hajnoczi, Paolo Bonzini, Karl Rister

[-- Attachment #1: Type: text/plain, Size: 36425 bytes --]

On Wed, Dec 21, 2016 at 12:31:39AM +0800, Fam Zheng wrote:
> This is a new protocol driver that exclusively opens a host NVMe
> controller through VFIO. It achieves better latency than linux-aio.

This is an interesting block driver to have for performance comparisons.
Definitely something that is worth merging.

>                                          nvme://    linux-aio
> ------------------------------------------------------------------
>   fio bs=4k  iodepth=1 (IOPS)             30014      24009
>   fio bs=4k  iodepth=1 (IOPS)  +polling   45148      34689
>   fio bs=64k iodepth=1 (IOPS)             17801      14954
>   fio bs=64k iodepth=1 (IOPS)  +polling   17970      14564
> 
>   fio bs=4k  iodepth=32 (IOPS)           110637     121480
>   fio bs=4k  iodepth=32 (IOPS) +polling  145533     166878
>   fio bs=64k iodepth=32 (IOPS)            50525      50596
>   fio bs=64k iodepth=32 (IOPS) +polling   50482      50534
> 
>   (host) qemu-img bench -c 8000000 (sec)  15.00      43.13
> 
> ("+polling" means poll-max-ns=18000 which is a rule of thumb value for
> the used NVMe device, otherwise it defaults to 0).
> 
> For the rows where linux-aio is faster, a lot of evidence shows that the
> io queue is more likely to stay full if the request completions happen
> faster, as in the nvme:// case, hence less batch submission and request
> merging than linux-aio.

I'm not sure I understand this paragraph.  Sounds like you are saying
that nvme:// has lower latency and this defeats batch submission.
Higher numbers are still better at the end of the day so it's worth
studying this more closely and coming up with solutions.  Maybe at a
certain rate of submission it makes sense to favor throughput
(batching) even with nvme://.

Regarding merging: are you doing sequential I/O?  Please try random
instead.

> This patch is co-authored by Paolo and me.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/Makefile.objs |    1 +
>  block/nvme.c        | 1026 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 1027 insertions(+)
>  create mode 100644 block/nvme.c
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 67a036a..fe975a6 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>  block-obj-y += null.o mirror.o commit.o io.o
>  block-obj-y += throttle-groups.o
> +block-obj-y += nvme.o
>  
>  block-obj-y += nbd.o nbd-client.o sheepdog.o
>  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
> diff --git a/block/nvme.c b/block/nvme.c
> new file mode 100644
> index 0000000..5c9ffde
> --- /dev/null
> +++ b/block/nvme.c
> @@ -0,0 +1,1026 @@
> +/*
> + * NVMe block driver based on vfio
> + *
> + * Copyright 2016 Red Hat, Inc.
> + *
> + * Authors:
> + *   Fam Zheng <famz@redhat.com>
> + *   Paolo Bonzini <pbonzini@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +#include "qapi/error.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qapi/qmp/qstring.h"
> +#include "block/block_int.h"
> +#include "qemu/error-report.h"
> +#include "qemu/vfio-helpers.h"
> +
> +#define NVME_DEBUG 0
> +
> +#define DPRINTF(...) \
> +    if (NVME_DEBUG) { \
> +        printf(__VA_ARGS__); \
> +    }
> +
> +#define NVME_SQ_ENTRY_BYTES 64
> +#define NVME_CQ_ENTRY_BYTES 16
> +#define NVME_QUEUE_SIZE 128
> +
> +/* TODO: share definitions with hw/block/nvme.c? */

Sounds good.

> +enum NvmeAdminCommands {
> +    NVME_ADM_CMD_DELETE_SQ      = 0x00,
> +    NVME_ADM_CMD_CREATE_SQ      = 0x01,
> +    NVME_ADM_CMD_GET_LOG_PAGE   = 0x02,
> +    NVME_ADM_CMD_DELETE_CQ      = 0x04,
> +    NVME_ADM_CMD_CREATE_CQ      = 0x05,
> +    NVME_ADM_CMD_IDENTIFY       = 0x06,
> +    NVME_ADM_CMD_ABORT          = 0x08,
> +    NVME_ADM_CMD_SET_FEATURES   = 0x09,
> +    NVME_ADM_CMD_GET_FEATURES   = 0x0a,
> +    NVME_ADM_CMD_ASYNC_EV_REQ   = 0x0c,
> +    NVME_ADM_CMD_ACTIVATE_FW    = 0x10,
> +    NVME_ADM_CMD_DOWNLOAD_FW    = 0x11,
> +    NVME_ADM_CMD_FORMAT_NVM     = 0x80,
> +    NVME_ADM_CMD_SECURITY_SEND  = 0x81,
> +    NVME_ADM_CMD_SECURITY_RECV  = 0x82,
> +};
> +
> +enum NvmeIoCommands {
> +    NVME_CMD_FLUSH              = 0x00,
> +    NVME_CMD_WRITE              = 0x01,
> +    NVME_CMD_READ               = 0x02,
> +    NVME_CMD_WRITE_UNCOR        = 0x04,
> +    NVME_CMD_COMPARE            = 0x05,
> +    NVME_CMD_DSM                = 0x09,
> +};
> +
> +typedef struct {
> +    uint8_t  opcode;
> +    uint8_t  flags;
> +    uint16_t cid;
> +    uint32_t nsid;
> +    uint64_t reserved;
> +    uint64_t mptr;
> +    uint64_t prp1;
> +    uint64_t prp2;
> +    uint32_t cdw10;
> +    uint32_t cdw11;
> +    uint32_t cdw12;
> +    uint32_t cdw13;
> +    uint32_t cdw14;
> +    uint32_t cdw15;
> +} QEMU_PACKED NVMeCommand;
> +
> +typedef struct {
> +    uint32_t cmd_specific;
> +    uint32_t reserved;
> +    uint16_t sq_head;
> +    uint16_t sqid;
> +    uint16_t cid;
> +    uint16_t status;
> +} QEMU_PACKED NVMeCompletion;
> +
> +typedef struct {
> +    int32_t  head, tail;
> +    uint8_t  *queue;
> +    uint64_t iova;
> +    volatile uint32_t *doorbell;
> +} NVMeQueue;
> +
> +typedef struct {
> +    BlockCompletionFunc *cb;
> +    void *opaque;
> +    int cid;
> +    void *prp_list_page;
> +    uint64_t prp_list_iova;
> +} NVMeRequest;
> +
> +typedef struct {
> +    int         index;
> +    NVMeQueue   sq, cq;
> +    int         cq_phase;
> +    uint8_t     *prp_list_pages;
> +    uint64_t    prp_list_base_iova;
> +    NVMeRequest reqs[NVME_QUEUE_SIZE];
> +    CoQueue     wait_queue;

"free_req_queue" describes the purpose of this queue.

> +    bool        busy;
> +    int         need_kick;
> +    int         inflight;
> +} NVMeQueuePair;
> +
> +typedef volatile struct {
> +    uint64_t cap;
> +    uint32_t vs;
> +    uint32_t intms;
> +    uint32_t intmc;
> +    uint32_t cc;
> +    uint32_t reserved0;
> +    uint32_t csts;
> +    uint32_t nssr;
> +    uint32_t aqa;
> +    uint64_t asq;
> +    uint64_t acq;
> +    uint32_t cmbloc;
> +    uint32_t cmbsz;
> +    uint8_t  reserved1[0xec0];
> +    uint8_t  cmd_set_specfic[0x100];
> +    uint32_t doorbells[];
> +} QEMU_PACKED NVMeRegs;
> +
> +QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
> +
> +typedef struct {
> +    QEMUVFIOState *vfio;
> +    NVMeRegs *regs;
> +    /* The submission/completion queue pairs.
> +     * [0]: admin queue.
> +     * [1..]: io queues.
> +     */
> +    NVMeQueuePair **queues;
> +    int nr_queues;
> +    size_t page_size;
> +    /* How many uint32_t elements does each doorbell entry take. */
> +    size_t doorbell_scale;
> +    bool write_cache;
> +    EventNotifier event_notifier;

"event_notifier" describes the type, not the purpose of the field.
"irq_notifier" is clearer.

> +    uint64_t nsze; /* Namespace size reported by identify command */
> +    int nsid;      /* The namespace id to read/write data. */
> +    uint64_t max_transfer;
> +    int plugged;
> +    Notifier vfree_notify;
> +} BDRVNVMeState;
> +
> +#define NVME_BLOCK_OPT_DEVICE "device"
> +#define NVME_BLOCK_OPT_NAMESPACE "namespace"
> +
> +static QemuOptsList runtime_opts = {
> +    .name = "nvme",
> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> +    .desc = {
> +        {
> +            .name = NVME_BLOCK_OPT_DEVICE,
> +            .type = QEMU_OPT_STRING,
> +            .help = "NVMe PCI device address",
> +        },
> +        {
> +            .name = NVME_BLOCK_OPT_NAMESPACE,
> +            .type = QEMU_OPT_NUMBER,
> +            .help = "NVMe namespace",
> +        },
> +        { /* end of list */ }
> +    },
> +};
> +
> +static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
> +                            int nentries, int entry_bytes, Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    size_t bytes;
> +    int r;
> +
> +    bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
> +    q->head = q->tail = 0;
> +    q->queue = qemu_try_blockalign0(bs, bytes);
> +
> +    if (!q->queue) {
> +        error_setg(errp, "Cannot allocate queue");
> +        return;
> +    }
> +    r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, true, &q->iova);
> +    if (r) {
> +        error_setg(errp, "Cannot map queue");
> +    }
> +}
> +
> +static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
> +{
> +    qemu_vfree(q->prp_list_pages);
> +    qemu_vfree(q->sq.queue);
> +    qemu_vfree(q->cq.queue);
> +    g_free(q);
> +}
> +
> +static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
> +                                             int idx, int size,
> +                                             Error **errp)
> +{
> +    int i, r;
> +    BDRVNVMeState *s = bs->opaque;
> +    Error *local_err = NULL;
> +    NVMeQueuePair *q = g_new0(NVMeQueuePair, 1);
> +    uint64_t prp_list_iovas[NVME_QUEUE_SIZE];
> +
> +    q->index = idx;
> +    qemu_co_queue_init(&q->wait_queue);
> +    q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE);
> +    r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
> +                          s->page_size * NVME_QUEUE_SIZE,
> +                          false, prp_list_iovas);
> +    if (r) {
> +        goto fail;
> +    }
> +    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
> +        NVMeRequest *req = &q->reqs[i];
> +        req->cid = i + 1;
> +        req->prp_list_page = q->prp_list_pages + i * s->page_size;
> +        req->prp_list_iova = prp_list_iovas[i];
> +    }
> +    nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        goto fail;
> +    }
> +    q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale];
> +
> +    nvme_init_queue(bs, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        goto fail;
> +    }
> +    q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
> +
> +    return q;
> +fail:
> +    nvme_free_queue_pair(bs, q);
> +    return NULL;
> +}
> +
> +static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
> +{
> +    if (s->plugged || !q->need_kick) {
> +        return;
> +    }
> +    DPRINTF("nvme kick queue %d\n", q->index);
> +    assert(!(q->sq.tail & 0xFF00));
> +    /* Fence the write to submission queue entry before notifying the guest. */
> +    smp_wmb();
> +    *q->sq.doorbell = cpu_to_le32(q->sq.tail);
> +    q->inflight += q->need_kick;
> +    q->need_kick = 0;
> +}
> +
> +static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
> +{
> +    int i;
> +    if (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
> +        /* We have to leave one slot empty as that is the full queue case (head
> +         * == tail + 1). */
> +        return NULL;
> +    }
> +    for (i = 0; i < NVME_QUEUE_SIZE; i++) {
> +        if (!q->reqs[i].cb) {
> +            return &q->reqs[i];
> +        }
> +    }
> +    return NULL;
> +}
> +
> +static inline int nvme_translate_error(const NVMeCompletion *c)
> +{
> +    if ((le16_to_cpu(c->status) >> 1) & 0xFF) {
> +        DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status %x\n",
> +                c->cmd_specific, c->sq_head, c->sqid, c->cid, c->status);
> +    }
> +    switch ((le16_to_cpu(c->status) >> 1) & 0xFF) {
> +    case 0:
> +        return 0;
> +    case 1:
> +        return -ENOSYS;
> +    case 2:
> +        return -EINVAL;
> +    default:
> +        return -EIO;
> +    }
> +}
> +
> +static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
> +{
> +    bool progress = false;
> +    NVMeRequest *req;
> +    NVMeCompletion *c;
> +
> +    DPRINTF("nvme process completion %d inflight %d\n", q->index, q->inflight);
> +    if (q->busy || s->plugged) {
> +        DPRINTF("queue busy\n");
> +        return false;
> +    }
> +    q->busy = true;
> +    assert(q->inflight >= 0);
> +    while (q->inflight) {
> +        c = (NVMeCompletion *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
> +        if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> +            break;
> +        }
> +        q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> +        if (!q->cq.head) {
> +            q->cq_phase = !q->cq_phase;
> +        }
> +        /* XXX: error handling instead of assert as it's from device? */
> +        assert(c->cid > 0);
> +        assert(c->cid <= NVME_QUEUE_SIZE);
> +        DPRINTF("nvme completing command %d\n", c->cid);
> +        req = &q->reqs[c->cid - 1];
> +        assert(req->cid == c->cid);
> +        assert(req->cb);
> +        req->cb(req->opaque, nvme_translate_error(c));
> +        req->cb = req->opaque = NULL;
> +        qemu_co_enter_next(&q->wait_queue);
> +        c->cid = 0;
> +        q->inflight--;
> +        /* Flip Phase Tag bit. */
> +        c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> +        progress = true;
> +    }
> +    if (progress) {
> +        /* Notify the device so it can post more completions. */
> +        smp_mb_release();
> +        *q->cq.doorbell = cpu_to_le32(q->cq.head);
> +    }
> +    q->busy = false;
> +    return progress;
> +}
> +
> +static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
> +                                NVMeRequest *req,
> +                                NVMeCommand *cmd, BlockCompletionFunc cb,
> +                                void *opaque)
> +{
> +    req->cb = cb;
> +    req->opaque = opaque;
> +    cmd->cid = cpu_to_le32(req->cid);
> +    DPRINTF("nvme submit command %d to queue %d\n", req->cid, q->index);
> +    memcpy((uint8_t *)q->sq.queue +
> +           q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
> +    q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
> +    q->need_kick++;
> +    nvme_kick(s, q);
> +    nvme_process_completion(s, q);
> +}
> +
> +static void nvme_cmd_sync_cb(void *opaque, int ret)
> +{
> +    int *pret = opaque;
> +    *pret = ret;
> +}
> +
> +static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
> +                         NVMeCommand *cmd)
> +{
> +    NVMeRequest *req;
> +    BDRVNVMeState *s = bs->opaque;
> +    int ret = -EINPROGRESS;
> +    req = nvme_get_free_req(q);
> +    if (!req) {
> +        return -EBUSY;
> +    }
> +    nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
> +
> +    BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
> +    return ret;
> +}
> +
> +static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    uint8_t *resp;
> +    int r;
> +    uint64_t iova;
> +    NVMeCommand cmd = {
> +        .opcode = NVME_ADM_CMD_IDENTIFY,
> +        .cdw10 = cpu_to_le32(0x1),
> +    };
> +
> +    resp = qemu_try_blockalign0(bs, 4096);
> +    if (!resp) {
> +        error_setg(errp, "Cannot allocate buffer for identify response");
> +        return false;
> +    }
> +    r = qemu_vfio_dma_map(s->vfio, resp, 4096, true, &iova);
> +    if (r) {
> +        error_setg(errp, "Cannot map buffer for DMA");
> +        goto fail;
> +    }
> +    cmd.prp1 = cpu_to_le64(iova);
> +
> +    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> +        error_setg(errp, "Failed to identify controller");
> +        goto fail;
> +    }
> +
> +    if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) {
> +        error_setg(errp, "Invalid namespace");
> +        goto fail;
> +    }
> +    s->write_cache = le32_to_cpu(resp[525]) & 0x1;
> +    s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size;
> +    /* For now the page list buffer per command is one page, to hold at most
> +     * s->page_size / sizeof(uint64_t) entries. */
> +    s->max_transfer = MIN_NON_ZERO(s->max_transfer,
> +                          s->page_size / sizeof(uint64_t) * s->page_size);
> +
> +    memset((char *)resp, 0, 4096);
> +
> +    cmd.cdw10 = 0;
> +    cmd.nsid = namespace;
> +    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> +        error_setg(errp, "Failed to identify namespace");
> +        goto fail;
> +    }
> +
> +    /*qemu_hexdump((const char *)resp, stdout, "NS", 4096);*/
> +    s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]);
> +
> +    qemu_vfree(resp);
> +    return true;
> +fail:
> +    qemu_vfree(resp);
> +    return false;
> +}
> +
> +static void nvme_handle_event(EventNotifier *n)
> +{
> +    int i;
> +    BDRVNVMeState *s = container_of(n, BDRVNVMeState, event_notifier);
> +
> +    DPRINTF("nvme handle event\n");
> +    event_notifier_test_and_clear(n);
> +    for (i = 0; i < s->nr_queues; i++) {
> +        while (nvme_process_completion(s, s->queues[i])) {
> +            /* Keep polling until no progress. */
> +        }
> +    }
> +}
> +
> +static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    int n = s->nr_queues;
> +    NVMeQueuePair *q;
> +    NVMeCommand cmd;
> +    int queue_size = NVME_QUEUE_SIZE;
> +
> +    q = nvme_create_queue_pair(bs, n, queue_size, errp);
> +    if (!q) {
> +        return false;
> +    }
> +    cmd = (NVMeCommand) {
> +        .opcode = NVME_ADM_CMD_CREATE_CQ,
> +        .prp1 = cpu_to_le64(q->cq.iova),
> +        .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> +        .cdw11 = cpu_to_le32(0x3),
> +    };
> +    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> +        error_setg(errp, "Failed to create io queue [%d]", n);
> +        nvme_free_queue_pair(bs, q);
> +        return false;
> +    }
> +    cmd = (NVMeCommand) {
> +        .opcode = NVME_ADM_CMD_CREATE_SQ,
> +        .prp1 = cpu_to_le64(q->sq.iova),
> +        .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> +        .cdw11 = cpu_to_le32(0x1 | (n << 16)),
> +    };
> +    if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> +        error_setg(errp, "Failed to create io queue [%d]", n);
> +        nvme_free_queue_pair(bs, q);
> +        return false;
> +    }
> +    s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
> +    s->queues[n] = q;
> +    s->nr_queues++;
> +    return true;
> +}
> +
> +static bool nvme_poll_cb(void *opaque)
> +{
> +    int i;
> +    EventNotifier *e = opaque;
> +    BDRVNVMeState *s = container_of(e, BDRVNVMeState, event_notifier);
> +    bool progress = false;
> +
> +    DPRINTF("nvme poll cb\n");
> +    for (i = 0; i < s->nr_queues; i++) {
> +        while (nvme_process_completion(s, s->queues[i])) {
> +            progress = true;
> +        }
> +    }
> +    return progress;
> +}
> +
> +static void nvme_vfree_cb(Notifier *n, void *p)
> +{
> +    BDRVNVMeState *s = container_of(n, BDRVNVMeState, vfree_notify);
> +    qemu_vfio_dma_unmap(s->vfio, p);
> +}
> +
> +static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
> +                     Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    int ret;
> +    uint64_t cap;
> +    uint64_t timeout_ms;
> +    uint64_t deadline, now;
> +
> +    s->nsid = namespace;
> +    ret = event_notifier_init(&s->event_notifier, 0);
> +    if (ret) {
> +        error_setg(errp, "Failed to init event notifier");
> +        return ret;
> +    }
> +
> +    s->vfree_notify.notify = nvme_vfree_cb;
> +    qemu_vfree_add_notifier(&s->vfree_notify);
> +
> +    s->vfio = qemu_vfio_open_pci(device, errp);
> +    if (!s->vfio) {
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, errp);
> +    if (!s->regs) {
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    /* Perform initialize sequence as described in NVMe spec "7.6.1
> +     * Initialization". */
> +
> +    cap = le64_to_cpu(s->regs->cap);
> +    if (!(cap & (1ULL << 37))) {
> +        error_setg(errp, "Device doesn't support NVMe command set");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF)));
> +    s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t);
> +    bs->bl.opt_mem_alignment = s->page_size;
> +    timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
> +
> +    /* Reset device to get a clean state. */
> +    s->regs->cc = cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE);
> +    /* Wait for CSTS.RDY = 0. */
> +    deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * 1000000ULL;
> +    while (le32_to_cpu(s->regs->csts) & 0x1) {
> +        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
> +            error_setg(errp, "Timeout while waiting for device to reset (%ld ms)",
> +                       timeout_ms);
> +            ret = -ETIMEDOUT;
> +            goto fail;
> +        }
> +    }
> +
> +    /* Set up admin queue. */
> +    s->queues = g_new(NVMeQueuePair *, 1);
> +    s->nr_queues = 1;
> +    s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp);
> +    if (!s->queues[0]) {
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +    QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
> +    s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
> +    s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova);
> +    s->regs->acq = cpu_to_le64(s->queues[0]->cq.iova);
> +
> +    /* After setting up all control registers we can enable device now. */
> +    s->regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
> +                              (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
> +                              0x1);
> +    /* Wait for CSTS.RDY = 1. */
> +    now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +    deadline = now + timeout_ms * 1000000;
> +    while (!(le32_to_cpu(s->regs->csts) & 0x1)) {
> +        if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
> +            error_setg(errp, "Timeout while waiting for device to start (%ld ms)",
> +                       timeout_ms);
> +            ret = -ETIMEDOUT;
> +            goto fail_queue;
> +        }
> +    }
> +
> +    ret = qemu_vfio_pci_init_irq(s->vfio, &s->event_notifier,
> +                                 VFIO_PCI_MSIX_IRQ_INDEX, errp);
> +    if (ret) {
> +        goto fail_queue;
> +    }
> +    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> +                           false, nvme_handle_event, nvme_poll_cb);
> +
> +    if (!nvme_identify(bs, namespace, errp)) {
> +        ret = -EIO;
> +        goto fail_handler;
> +    }
> +
> +    /* Set up command queues. */
> +    /* XXX: multiple io queues? */
> +    if (!nvme_add_io_queue(bs, errp)) {
> +        ret = -EIO;
> +        goto fail_handler;
> +    }
> +    return 0;
> +
> +fail_handler:
> +    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> +                           false, NULL, NULL);
> +fail_queue:
> +    nvme_free_queue_pair(bs, s->queues[0]);
> +fail:
> +    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
> +    qemu_vfio_close(s->vfio);
> +    event_notifier_cleanup(&s->event_notifier);
> +    return ret;
> +}
> +
> +static void nvme_parse_filename(const char *filename, QDict *options,
> +                                Error **errp)
> +{
> +    int pref = strlen("nvme://");
> +
> +    /* XXX: support namespace in URI like nvme://0000:00:00.0/1 ? */
> +    if (strlen(filename) > pref && !strncmp(filename, "nvme://", pref)) {
> +        qdict_put(options, NVME_BLOCK_OPT_DEVICE,
> +                  qstring_from_str(filename + pref));
> +    }
> +}
> +
> +static int nvme_file_open(BlockDriverState *bs, QDict *options, int flags,
> +                          Error **errp)
> +{
> +    const char *device;
> +    QemuOpts *opts;
> +    int namespace;
> +
> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> +    qemu_opts_absorb_qdict(opts, options, &error_abort);
> +    device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> +    if (!device) {
> +        error_setg(errp, "'" NVME_BLOCK_OPT_DEVICE "' option is required");
> +        return -EINVAL;
> +    }
> +
> +    namespace = qemu_opt_get_number(opts, NVME_BLOCK_OPT_NAMESPACE, 1);
> +    nvme_init(bs, device, namespace, errp);
> +
> +    qemu_opts_del(opts);
> +    return 0;
> +}
> +
> +static void nvme_close(BlockDriverState *bs)
> +{
> +    int i;
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    for (i = 0; i < s->nr_queues; ++i) {
> +        nvme_free_queue_pair(bs, s->queues[i]);
> +    }
> +    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> +                           false, NULL, NULL);
> +    notifier_remove(&s->vfree_notify);
> +    qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
> +    qemu_vfio_close(s->vfio);
> +}
> +
> +static int64_t nvme_getlength(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    return s->nsze << BDRV_SECTOR_BITS;
> +}
> +
> +static inline int nvme_cmd_map_qiov(BlockDriverState *bs, NVMeCommand *cmd,
> +                                    NVMeRequest *req, QEMUIOVector *qiov)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    uint64_t *pagelist = req->prp_list_page;
> +    int i, r;
> +    unsigned int entries = 0;
> +
> +    assert(qiov->size);
> +    assert(QEMU_IS_ALIGNED(qiov->size, s->page_size));
> +    assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t));
> +    for (i = 0; i < qiov->niov; ++i) {
> +        r = qemu_vfio_dma_map(s->vfio,
> +                              qiov->iov[i].iov_base,
> +                              qiov->iov[i].iov_len,
> +                              false, &pagelist[entries]);
> +        if (r) {
> +            return r;
> +        }
> +
> +        entries += qiov->iov[i].iov_len / s->page_size;
> +        assert(entries <= s->page_size / sizeof(uint64_t));
> +    }
> +
> +    switch (entries) {
> +    case 0:
> +        abort();
> +    case 1:
> +        cmd->prp1 = cpu_to_le64(pagelist[0]);
> +        cmd->prp2 = 0;
> +        break;
> +    case 2:
> +        cmd->prp1 = cpu_to_le64(pagelist[0]);
> +        cmd->prp2 = cpu_to_le64(pagelist[1]);;
> +        break;
> +    default:
> +        cmd->prp1 = cpu_to_le64(pagelist[0]);
> +        cmd->prp2 = cpu_to_le64(req->prp_list_iova);
> +        for (i = 0; i < entries - 1; ++i) {
> +            pagelist[i] = cpu_to_le64(pagelist[i + 1]);
> +        }
> +        pagelist[entries] = 0;
> +        break;
> +    }
> +    return 0;
> +}
> +
> +typedef struct {
> +    Coroutine *co;
> +    int ret;
> +    AioContext *ctx;
> +} NVMeCoData;
> +
> +static void nvme_rw_cb_bh(void *opaque)
> +{
> +    NVMeCoData *data = opaque;
> +    qemu_coroutine_enter(data->co);
> +}
> +
> +static void nvme_rw_cb(void *opaque, int ret)
> +{
> +    NVMeCoData *data = opaque;
> +    data->ret = ret;
> +    if (!data->co) {
> +        /* The rw coroutine hasn't yielded, don't try to enter. */
> +        return;
> +    }
> +    if (qemu_coroutine_self() != data->co) {
> +        qemu_coroutine_enter(data->co);
> +    } else {
> +        aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
> +    }
> +}
> +
> +static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
> +                                            uint64_t offset, uint64_t bytes,
> +                                            QEMUIOVector *qiov,
> +                                            bool is_write)
> +{
> +    int r;
> +    BDRVNVMeState *s = bs->opaque;
> +    NVMeQueuePair *ioq = s->queues[1];
> +    NVMeRequest *req;
> +    NVMeCommand cmd = {
> +        .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
> +        .nsid = cpu_to_le32(s->nsid),
> +        .cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF),
> +        .cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 0xFFFFFFFF),
> +        .cdw12 = cpu_to_le32(((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF),
> +    };
> +    NVMeCoData data = {
> +        .ctx = bdrv_get_aio_context(bs),
> +        .ret = -EINPROGRESS,
> +    };
> +
> +    DPRINTF(">>> NVMe %s offset %lx bytes %lx qiov[%d]\n",
> +            is_write ? "write" : "read",
> +            offset, bytes, qiov->niov);
> +    assert(s->nr_queues > 1);
> +    while (true) {
> +        req = nvme_get_free_req(ioq);
> +        if (req) {
> +            break;
> +        }
> +        DPRINTF("nvme wait req\n");
> +        qemu_co_queue_wait(&ioq->wait_queue);
> +        DPRINTF("nvme wait req done\n");
> +    }
> +
> +    r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
> +    if (r) {
> +        return r;

Is req leaked?

> +    }
> +    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> +
> +    data.co = qemu_coroutine_self();
> +    while (data.ret == -EINPROGRESS) {
> +        qemu_coroutine_yield();
> +    }
> +
> +    DPRINTF("<<< NVMe %s offset %lx bytes %lx ret %d\n",
> +            is_write ? "write" : "read",
> +            offset, bytes, data.ret);
> +    return data.ret;
> +}
> +
> +static inline bool nvme_qiov_aligned(BlockDriverState *bs,
> +                                     const QEMUIOVector *qiov)
> +{
> +    int i;
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    for (i = 0; i < qiov->niov; ++i) {
> +        if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) ||
> +            !QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) {
> +            return false;
> +        }
> +    }
> +    return true;
> +}
> +
> +static inline coroutine_fn
> +int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +                QEMUIOVector *qiov, bool is_write)
> +{
> +    int r;
> +    uint8_t *buf;
> +    QEMUIOVector local_qiov;
> +
> +    assert(offset % BDRV_SECTOR_SIZE == 0);
> +    assert(bytes % BDRV_SECTOR_SIZE == 0);
> +    if (nvme_qiov_aligned(bs, qiov)) {
> +        return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write);
> +    }
> +    buf = qemu_try_blockalign(bs, bytes);
> +
> +    if (!buf) {
> +        return -ENOMEM;
> +    }
> +    qemu_iovec_init(&local_qiov, 1);
> +    qemu_iovec_add(&local_qiov, buf, bytes);
> +    r = nvme_co_prw_aligned(bs, offset, bytes, &local_qiov, is_write);
> +    qemu_iovec_destroy(&local_qiov);
> +    if (!r) {
> +        qemu_iovec_from_buf(qiov, 0, buf, bytes);
> +    }
> +    qemu_vfree(buf);
> +    return r;
> +}
> +
> +static coroutine_fn int nvme_co_preadv(BlockDriverState *bs,
> +                                       uint64_t offset, uint64_t bytes,
> +                                       QEMUIOVector *qiov, int flags)
> +{
> +    return nvme_co_prw(bs, offset, bytes, qiov, false);
> +}
> +
> +static coroutine_fn int nvme_co_pwritev(BlockDriverState *bs,
> +                                        uint64_t offset, uint64_t bytes,
> +                                        QEMUIOVector *qiov, int flags)
> +{
> +    return nvme_co_prw(bs, offset, bytes, qiov, true);
> +}
> +
> +static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    NVMeQueuePair *ioq = s->queues[1];
> +    NVMeRequest *req;
> +    NVMeCommand cmd = {
> +        .opcode = NVME_CMD_FLUSH,
> +        .nsid = cpu_to_le32(s->nsid),
> +    };
> +    NVMeCoData data = {
> +        .ctx = bdrv_get_aio_context(bs),
> +        .ret = -EINPROGRESS,
> +    };
> +
> +    assert(s->nr_queues > 1);
> +    while (true) {
> +        req = nvme_get_free_req(ioq);
> +        if (req) {
> +            break;
> +        }
> +        qemu_co_queue_wait(&ioq->wait_queue);
> +    }
> +
> +    nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> +
> +    data.co = qemu_coroutine_self();
> +    if (data.ret == -EINPROGRESS) {
> +        qemu_coroutine_yield();
> +    }
> +
> +    return data.ret;
> +}
> +
> +
> +static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> +                               BlockReopenQueue *queue, Error **errp)
> +{
> +    return 0;
> +}
> +
> +static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs,
> +                                                     int64_t sector_num,
> +                                                     int nb_sectors, int *pnum,
> +                                                     BlockDriverState **file)
> +{
> +    *pnum = nb_sectors;
> +    *file = bs;
> +
> +    return BDRV_BLOCK_OFFSET_VALID | (sector_num << BDRV_SECTOR_BITS);
> +}
> +
> +static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts)
> +{
> +    QINCREF(opts);
> +    qdict_del(opts, "filename");
> +
> +    if (!qdict_size(opts)) {
> +        snprintf(bs->exact_filename, sizeof(bs->exact_filename), "%s://",
> +                 bs->drv->format_name);
> +    }
> +
> +    qdict_put(opts, "driver", qstring_from_str(bs->drv->format_name));
> +    bs->full_open_options = opts;
> +}
> +
> +static void nvme_refresh_limits(BlockDriverState *bs, Error **errp)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    /* XXX: other fields from identify command/controller capacity? */
> +    bs->bl.opt_mem_alignment = s->page_size;
> +    /* XXX: are there other limits? */
> +    bs->bl.request_alignment = s->page_size;
> +    bs->bl.max_transfer = s->max_transfer;
> +}
> +
> +static void nvme_detach_aio_context(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> +                           false, NULL, NULL);
> +}
> +
> +static void nvme_attach_aio_context(BlockDriverState *bs,
> +                                    AioContext *new_context)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +
> +    aio_set_event_notifier(new_context, &s->event_notifier,
> +                           false, nvme_handle_event, nvme_poll_cb);
> +}
> +
> +static void nvme_aio_plug(BlockDriverState *bs)
> +{
> +    BDRVNVMeState *s = bs->opaque;
> +    s->plugged++;
> +}
> +
> +static void nvme_aio_unplug(BlockDriverState *bs)
> +{
> +    int i;
> +    BDRVNVMeState *s = bs->opaque;
> +    assert(s->plugged);
> +    if (!--s->plugged) {
> +        for (i = 1; i < s->nr_queues; i++) {
> +            nvme_kick(s, s->queues[i]);
> +            nvme_process_completion(s, s->queues[i]);
> +        }
> +    }
> +}
> +
> +static BlockDriver bdrv_nvme = {
> +    .format_name              = "nvme",
> +    .protocol_name            = "nvme",
> +    .instance_size            = sizeof(BDRVNVMeState),
> +
> +    .bdrv_parse_filename      = nvme_parse_filename,
> +    .bdrv_file_open           = nvme_file_open,
> +    .bdrv_close               = nvme_close,
> +    .bdrv_getlength           = nvme_getlength,
> +
> +    .bdrv_co_preadv           = nvme_co_preadv,
> +    .bdrv_co_pwritev          = nvme_co_pwritev,
> +    .bdrv_co_flush_to_disk    = nvme_co_flush,
> +    .bdrv_reopen_prepare      = nvme_reopen_prepare,
> +
> +    .bdrv_co_get_block_status = nvme_co_get_block_status,
> +
> +    .bdrv_refresh_filename    = nvme_refresh_filename,
> +    .bdrv_refresh_limits      = nvme_refresh_limits,
> +
> +    .bdrv_detach_aio_context  = nvme_detach_aio_context,
> +    .bdrv_attach_aio_context  = nvme_attach_aio_context,
> +
> +    .bdrv_io_plug             = nvme_aio_plug,
> +    .bdrv_io_unplug           = nvme_aio_unplug,
> +};
> +
> +static void bdrv_nvme_init(void)
> +{
> +    bdrv_register(&bdrv_nvme);
> +}
> +
> +block_init(bdrv_nvme_init);
> -- 
> 2.9.3
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 4/4] block: Add VFIO based NVMe driver
  2016-12-21 11:59   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2016-12-21 14:05     ` Fam Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-21 14:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Kevin Wolf, qemu-block, Max Reitz, borntraeger,
	Stefan Hajnoczi, Paolo Bonzini, Karl Rister

On Wed, 12/21 11:59, Stefan Hajnoczi wrote:
> On Wed, Dec 21, 2016 at 12:31:39AM +0800, Fam Zheng wrote:
> > This is a new protocol driver that exclusively opens a host NVMe
> > controller through VFIO. It achieves better latency than linux-aio.
> 
> This is an interesting block driver to have for performance comparisons.
> Definitely something that is worth merging.
> 
> >                                          nvme://    linux-aio
> > ------------------------------------------------------------------
> >   fio bs=4k  iodepth=1 (IOPS)             30014      24009
> >   fio bs=4k  iodepth=1 (IOPS)  +polling   45148      34689
> >   fio bs=64k iodepth=1 (IOPS)             17801      14954
> >   fio bs=64k iodepth=1 (IOPS)  +polling   17970      14564
> > 
> >   fio bs=4k  iodepth=32 (IOPS)           110637     121480
> >   fio bs=4k  iodepth=32 (IOPS) +polling  145533     166878
> >   fio bs=64k iodepth=32 (IOPS)            50525      50596
> >   fio bs=64k iodepth=32 (IOPS) +polling   50482      50534
> > 
> >   (host) qemu-img bench -c 8000000 (sec)  15.00      43.13
> > 
> > ("+polling" means poll-max-ns=18000 which is a rule of thumb value for
> > the used NVMe device, otherwise it defaults to 0).
> > 
> > For the rows where linux-aio is faster, a lot of evidence shows that the
> > io queue is more likely to stay full if the request completions happen
> > faster, as in the nvme:// case, hence less batch submission and request
> > merging than linux-aio.
> 
> I'm not sure I understand this paragraph.  Sounds like you are saying
> that nvme:// has lower latency and this defeats batch submission.
> Higher numbers are still better at the end of the day so it's worth
> studying this more closely and coming up with solutions.  Maybe at a
> certain rate of submission it makes sense to favor throughput
> (batching) even with nvme://.

Good question! Busy polling at nvme:// side reduces batched completion, and busy
polling at virtio side reduces batched submission. I think this is a common
pattern and we should figure out a general strategy to "favor throughput"
based on the rate.

> 
> Regarding merging: are you doing sequential I/O?  Please try random
> instead.

Yes, it is sequential. I've also tried random but the results are mostly the
same.  It means merging is the smaller factor here, and host <-> guest
communication is the bigger one. Maybe we should go ahead to expriment busy
polling at driver side now.

> > +typedef struct {
> > +    int         index;
> > +    NVMeQueue   sq, cq;
> > +    int         cq_phase;
> > +    uint8_t     *prp_list_pages;
> > +    uint64_t    prp_list_base_iova;
> > +    NVMeRequest reqs[NVME_QUEUE_SIZE];
> > +    CoQueue     wait_queue;
> 
> "free_req_queue" describes the purpose of this queue.

OK, I'll rename it.

> 
> > +    bool        busy;
> > +    int         need_kick;
> > +    int         inflight;
> > +} NVMeQueuePair;
> > +
> > +typedef volatile struct {
> > +    uint64_t cap;
> > +    uint32_t vs;
> > +    uint32_t intms;
> > +    uint32_t intmc;
> > +    uint32_t cc;
> > +    uint32_t reserved0;
> > +    uint32_t csts;
> > +    uint32_t nssr;
> > +    uint32_t aqa;
> > +    uint64_t asq;
> > +    uint64_t acq;
> > +    uint32_t cmbloc;
> > +    uint32_t cmbsz;
> > +    uint8_t  reserved1[0xec0];
> > +    uint8_t  cmd_set_specfic[0x100];
> > +    uint32_t doorbells[];
> > +} QEMU_PACKED NVMeRegs;
> > +
> > +QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
> > +
> > +typedef struct {
> > +    QEMUVFIOState *vfio;
> > +    NVMeRegs *regs;
> > +    /* The submission/completion queue pairs.
> > +     * [0]: admin queue.
> > +     * [1..]: io queues.
> > +     */
> > +    NVMeQueuePair **queues;
> > +    int nr_queues;
> > +    size_t page_size;
> > +    /* How many uint32_t elements does each doorbell entry take. */
> > +    size_t doorbell_scale;
> > +    bool write_cache;
> > +    EventNotifier event_notifier;
> 
> "event_notifier" describes the type, not the purpose of the field.
> "irq_notifier" is clearer.

Good idea.

> > +    while (true) {
> > +        req = nvme_get_free_req(ioq);
> > +        if (req) {
> > +            break;
> > +        }
> > +        DPRINTF("nvme wait req\n");
> > +        qemu_co_queue_wait(&ioq->wait_queue);
> > +        DPRINTF("nvme wait req done\n");
> > +    }
> > +
> > +    r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
> > +    if (r) {
> > +        return r;
> 
> Is req leaked?

No. It is a pointer to an ioq->reqs element. ioq->reqs is an object pool with
the same lifecycle of ioq. Until nvme_submit_command() is called, *req is still
"free".

Fam

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library Fam Zheng
@ 2016-12-21 15:46   ` Paolo Bonzini
  2016-12-21 16:19     ` Fam Zheng
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2016-12-21 15:46 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel
  Cc: qemu-block, Karl Rister, Kevin Wolf, Max Reitz, borntraeger,
	Stefan Hajnoczi



On 20/12/2016 17:31, Fam Zheng wrote:
> +    hbitmap_iter_init(&iter, s->free_chunks, 1);
> +    if (contiguous) {
> +        while (true) {
> +            bool satisfy = true;
> +            next = hbitmap_iter_next(&iter);
> +            if (next < 0) {
> +                return NULL;
> +            }
> +            for (i = 1; i < chunks; i++) {
> +                if (!hbitmap_get(s->free_chunks, next + i)) {
> +                    satisfy = false;
> +                    break;
> +                }
> +            }
> +            if (satisfy) {
> +                break;
> +            }
> +        }
> +        hbitmap_reset(s->free_chunks, next, chunks);
> +        r = g_new(IOVARange, 1);
> +        r->iova = next * pages_per_chunk * getpagesize();
> +        r->nr_pages = pages;
> +        QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
> +    } else {
> +        next = hbitmap_iter_next(&iter);
> +        while (pages) {
> +            uint64_t chunk;
> +            if (next < 0) {
> +                hbitmap_iter_init(&iter, s->free_chunks, 1);
> +                next = hbitmap_iter_next(&iter);
> +            }
> +            assert(next >= 0);
> +            chunk = next;
> +            DPRINTF("using chunk %ld\n", chunk);
> +            next = hbitmap_iter_next(&iter);
> +            hbitmap_reset(s->free_chunks, chunk, 1);
> +            if (r && r->iova + r->nr_pages == chunk * pages_per_chunk) {
> +                r->nr_pages += MIN(pages, pages_per_chunk);
> +            } else {
> +                r = g_new(IOVARange, 1);
> +                r->iova = chunk * pages_per_chunk * getpagesize();
> +                r->nr_pages = MIN(pages, pages_per_chunk);
> +                QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
> +            }
> +            pages -= MIN(pages, pages_per_chunk);
> +        }

I'm not sure HBitmap tracking is useful.  If we exhaust the IOVA space,
we can just throw everything away with a single VFIO_IOMMU_UNMAP_DMA.
Then replay the RAMBlockNotifier mappings (we need to add this anyway
for hotplug support) and keep on mapping lazily whatever comes later.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library
  2016-12-21 15:46   ` Paolo Bonzini
@ 2016-12-21 16:19     ` Fam Zheng
  2016-12-21 17:02       ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Fam Zheng @ 2016-12-21 16:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
	borntraeger, Stefan Hajnoczi

On Wed, 12/21 16:46, Paolo Bonzini wrote:
> 
> 
> On 20/12/2016 17:31, Fam Zheng wrote:
> > +    hbitmap_iter_init(&iter, s->free_chunks, 1);
> > +    if (contiguous) {
> > +        while (true) {
> > +            bool satisfy = true;
> > +            next = hbitmap_iter_next(&iter);
> > +            if (next < 0) {
> > +                return NULL;
> > +            }
> > +            for (i = 1; i < chunks; i++) {
> > +                if (!hbitmap_get(s->free_chunks, next + i)) {
> > +                    satisfy = false;
> > +                    break;
> > +                }
> > +            }
> > +            if (satisfy) {
> > +                break;
> > +            }
> > +        }
> > +        hbitmap_reset(s->free_chunks, next, chunks);
> > +        r = g_new(IOVARange, 1);
> > +        r->iova = next * pages_per_chunk * getpagesize();
> > +        r->nr_pages = pages;
> > +        QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
> > +    } else {
> > +        next = hbitmap_iter_next(&iter);
> > +        while (pages) {
> > +            uint64_t chunk;
> > +            if (next < 0) {
> > +                hbitmap_iter_init(&iter, s->free_chunks, 1);
> > +                next = hbitmap_iter_next(&iter);
> > +            }
> > +            assert(next >= 0);
> > +            chunk = next;
> > +            DPRINTF("using chunk %ld\n", chunk);
> > +            next = hbitmap_iter_next(&iter);
> > +            hbitmap_reset(s->free_chunks, chunk, 1);
> > +            if (r && r->iova + r->nr_pages == chunk * pages_per_chunk) {
> > +                r->nr_pages += MIN(pages, pages_per_chunk);
> > +            } else {
> > +                r = g_new(IOVARange, 1);
> > +                r->iova = chunk * pages_per_chunk * getpagesize();
> > +                r->nr_pages = MIN(pages, pages_per_chunk);
> > +                QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
> > +            }
> > +            pages -= MIN(pages, pages_per_chunk);
> > +        }
> 
> I'm not sure HBitmap tracking is useful.  If we exhaust the IOVA space,
> we can just throw everything away with a single VFIO_IOMMU_UNMAP_DMA.
> Then replay the RAMBlockNotifier mappings (we need to add this anyway
> for hotplug support) and keep on mapping lazily whatever comes later.

It's clever! It'd be a bit more complicated than that, though. Things like
queues etc in block/nvme.c have to be preserved, and if we already ensure that,
ram blocks can be preserved similarly, but indeed bounce buffers can be handled
that way. I still need to think about how to make sure none of the invalidated
IOVA addresses are in use by other requests.

Also I wonder how expensive the huge VFIO_IOMMU_UNMAP_DMA is. In the worst case
the "throwaway" IOVAs can be limited to a small range.

Fam

> 
> Thanks,
> 
> Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library
  2016-12-21 16:19     ` Fam Zheng
@ 2016-12-21 17:02       ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2016-12-21 17:02 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
	borntraeger, Stefan Hajnoczi



On 21/12/2016 17:19, Fam Zheng wrote:
> It's clever! It'd be a bit more complicated than that, though. Things like
> queues etc in block/nvme.c have to be preserved, and if we already ensure that,
> ram blocks can be preserved similarly, but indeed bounce buffers can be handled
> that way. I still need to think about how to make sure none of the invalidated
> IOVA addresses are in use by other requests.

Hmm, that's true.  As you said, we'll probably want to split the IOVA
space in two, with a relatively small part for "volatile" addresses.

You can add two counters that track how many requests are using volatile
space.  When it's time to do the VFIO_IOMMU_UNMAP_DMA, you do something
like:

    if (vfio->next_phase == vfio->current_phase) {
        vfio->next_phase = !vfio->current_phase;
        while (vfio->request_counter[vfio->current_phase] != 0) {
            wait on CoQueue
        }
        ioctl(VFIO_IOMMU_UNMAP_DMA)
        vfio->current_phase = vfio->next_phase;
        wake up everyone on CoQueue
    } else {
        /* wait for the unmap to happen */
        while (vfio->next_phase != vfio->current_phase) {
            wait on CoQueue
        }
    }

As an optimization, incrementing/decrementing request_counter can be
delayed until you find an item of the QEMUIOVector that needs a volatile
IOVA.  Then it should never be incremented in practice during guest
execution.

Paolo

> Also I wonder how expensive the huge VFIO_IOMMU_UNMAP_DMA is. In the worst case
> the "throwaway" IOVAs can be limited to a small range.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
@ 2016-12-22  9:56   ` Paolo Bonzini
  2017-01-11  5:38   ` Stefan Weil
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2016-12-22  9:56 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, borntraeger, Stefan Hajnoczi,
	Karl Rister


> diff --git a/xen-mapcache.c b/xen-mapcache.c
> index 8f3a592..dc9b321 100644
> --- a/xen-mapcache.c
> +++ b/xen-mapcache.c
> @@ -163,6 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
>      err = g_malloc0(nb_pfn * sizeof (int));
>  
>      if (entry->vaddr_base != NULL) {
> +        ram_block_removed(entry->vaddr_base, entry->size);
>          if (munmap(entry->vaddr_base, entry->size) != 0) {
>              perror("unmap fails");
>              exit(-1);
> @@ -188,6 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
>      entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned long) *
>              BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
>  
> +    ram_block_added(entry->vaddr_base, entry->size);
>      bitmap_zero(entry->valid_mapping, nb_pfn);
>      for (i = 0; i < nb_pfn; i++) {
>          if (!err[i]) {
> @@ -397,6 +399,7 @@ static void xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
>      }
>  
>      pentry->next = entry->next;
> +    ram_block_removed(entry->vaddr_base, entry->size);
>      if (munmap(entry->vaddr_base, entry->size) != 0) {
>          perror("unmap fails");
>          exit(-1);
> 

My mistake here, these should be ram_block_notify_{add,remove} instead.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
  2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
                   ` (5 preceding siblings ...)
  2016-12-21  0:48 ` no-reply
@ 2016-12-29  4:09 ` Tian, Kevin
  2016-12-30  0:46   ` Fam Zheng
  6 siblings, 1 reply; 20+ messages in thread
From: Tian, Kevin @ 2016-12-29  4:09 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, borntraeger, Stefan Hajnoczi,
	Paolo Bonzini, Karl Rister

> From: Fam Zheng
> Sent: Wednesday, December 21, 2016 12:32 AM
> 
> This series adds a new protocol driver that is intended to achieve about 20%
> better performance for latency bound workloads (i.e. synchronous I/O) than
> linux-aio when guest is exclusively accessing a NVMe device, by talking to the
> device directly instead of through kernel file system layers and its NVMe
> driver.
> 

Curious... if the NVMe device is exclusively owned by the guest, why
not directly passing through to the guest? is it a tradeoff between
performance (better than linux-aio) and composability (snapshot and
live migration which not supported by direct passthrough)?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
  2016-12-29  4:09 ` Tian, Kevin
@ 2016-12-30  0:46   ` Fam Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-30  0:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: qemu-devel, Kevin Wolf, qemu-block, Max Reitz, borntraeger,
	Stefan Hajnoczi, Paolo Bonzini, Karl Rister

On Thu, 12/29 04:09, Tian, Kevin wrote:
> is it a tradeoff between performance (better than linux-aio) and composability
> (snapshot and live migration which not supported by direct passthrough)?

Yes.

Fam

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
  2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
  2016-12-22  9:56   ` Paolo Bonzini
@ 2017-01-11  5:38   ` Stefan Weil
  2017-01-11  5:48     ` Stefan Weil
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Weil @ 2017-01-11  5:38 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, borntraeger, Stefan Hajnoczi,
	Paolo Bonzini, Karl Rister, Vincent Palatin

Hi,

this fails for me when building with XEN support.
I noticed the failure when testing the latest HAXM patches.
See compiler output below.

Regards
Stefan


On 12/20/16 17:31, Fam Zheng wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
>
> This adds a notify interface of ram block additions and removals.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
[...]
> diff --git a/xen-mapcache.c b/xen-mapcache.c
> index 8f3a592..dc9b321 100644
> --- a/xen-mapcache.c
> +++ b/xen-mapcache.c
> @@ -163,6 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
>      err = g_malloc0(nb_pfn * sizeof (int));
>
>      if (entry->vaddr_base != NULL) {
> +        ram_block_removed(entry->vaddr_base, entry->size);
>          if (munmap(entry->vaddr_base, entry->size) != 0) {
>              perror("unmap fails");
>              exit(-1);
> @@ -188,6 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
>      entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned long) *
>              BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
>
> +    ram_block_added(entry->vaddr_base, entry->size);
>      bitmap_zero(entry->valid_mapping, nb_pfn);
>      for (i = 0; i < nb_pfn; i++) {
>          if (!err[i]) {
> @@ -397,6 +399,7 @@ static void xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
>      }
>
>      pentry->next = entry->next;
> +    ram_block_removed(entry->vaddr_base, entry->size);
>      if (munmap(entry->vaddr_base, entry->size) != 0) {
>          perror("unmap fails");
>          exit(-1);
>

   CC      x86_64-softmmu/xen-mapcache.o
/qemu/xen-mapcache.c: In function 'xen_remap_bucket':
/qemu/xen-mapcache.c:166:9: error: implicit declaration of function 
'ram_block_removed' [-Werror=implicit-function-declaration]
          ram_block_removed(entry->vaddr_base, entry->size);
          ^~~~~~~~~~~~~~~~~
/qemu/xen-mapcache.c:166:9: error: nested extern declaration of 
'ram_block_removed' [-Werror=nested-externs]
/qemu/xen-mapcache.c:192:5: error: implicit declaration of function 
'ram_block_added' [-Werror=implicit-function-declaration]
      ram_block_added(entry->vaddr_base, entry->size);
      ^~~~~~~~~~~~~~~
/qemu/xen-mapcache.c:192:5: error: nested extern declaration of 
'ram_block_added' [-Werror=nested-externs]
cc1: all warnings being treated as errors
/qemu/rules.mak:64: recipe for target 'xen-mapcache.o' failed
make[1]: *** [xen-mapcache.o] Error 1
Makefile:203: recipe for target 'subdir-x86_64-softmmu' failed
make: *** [subdir-x86_64-softmmu] Error 2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
  2017-01-11  5:38   ` Stefan Weil
@ 2017-01-11  5:48     ` Stefan Weil
  2017-01-11  6:41       ` Fam Zheng
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Weil @ 2017-01-11  5:48 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel
  Cc: Kevin Wolf, Vincent Palatin, qemu-block, Max Reitz, borntraeger,
	Stefan Hajnoczi, Paolo Bonzini, Karl Rister

On 01/11/17 06:38, Stefan Weil wrote:
> Hi,
>
> this fails for me when building with XEN support.
> I noticed the failure when testing the latest HAXM patches.
> See compiler output below.
>
> Regards
> Stefan

The patch compiles with this modification:


diff --git a/xen-mapcache.c b/xen-mapcache.c
index dc9b321491..31debdfb2c 100644
--- a/xen-mapcache.c
+++ b/xen-mapcache.c
@@ -163,7 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
      err = g_malloc0(nb_pfn * sizeof (int));

      if (entry->vaddr_base != NULL) {
-        ram_block_removed(entry->vaddr_base, entry->size);
+        ram_block_notify_remove(entry->vaddr_base, entry->size);
          if (munmap(entry->vaddr_base, entry->size) != 0) {
              perror("unmap fails");
              exit(-1);
@@ -189,7 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
      entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned 
long) *
              BITS_TO_LONGS(size >> XC_PAGE_SHIFT));

-    ram_block_added(entry->vaddr_base, entry->size);
+    ram_block_notify_add(entry->vaddr_base, entry->size);
      bitmap_zero(entry->valid_mapping, nb_pfn);
      for (i = 0; i < nb_pfn; i++) {
          if (!err[i]) {
@@ -399,7 +399,7 @@ static void 
xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
      }

      pentry->next = entry->next;
-    ram_block_removed(entry->vaddr_base, entry->size);
+    ram_block_notify_remove(entry->vaddr_base, entry->size);
      if (munmap(entry->vaddr_base, entry->size) != 0) {
          perror("unmap fails");
          exit(-1);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
  2017-01-11  5:48     ` Stefan Weil
@ 2017-01-11  6:41       ` Fam Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2017-01-11  6:41 UTC (permalink / raw)
  To: Stefan Weil
  Cc: qemu-devel, Kevin Wolf, Vincent Palatin, qemu-block, Max Reitz,
	borntraeger, Stefan Hajnoczi, Paolo Bonzini, Karl Rister

On Wed, 01/11 06:48, Stefan Weil wrote:
> On 01/11/17 06:38, Stefan Weil wrote:
> > Hi,
> > 
> > this fails for me when building with XEN support.
> > I noticed the failure when testing the latest HAXM patches.
> > See compiler output below.
> > 
> > Regards
> > Stefan
> 
> The patch compiles with this modification:
> 
> 
> diff --git a/xen-mapcache.c b/xen-mapcache.c
> index dc9b321491..31debdfb2c 100644
> --- a/xen-mapcache.c
> +++ b/xen-mapcache.c
> @@ -163,7 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
>      err = g_malloc0(nb_pfn * sizeof (int));
> 
>      if (entry->vaddr_base != NULL) {
> -        ram_block_removed(entry->vaddr_base, entry->size);
> +        ram_block_notify_remove(entry->vaddr_base, entry->size);
>          if (munmap(entry->vaddr_base, entry->size) != 0) {
>              perror("unmap fails");
>              exit(-1);
> @@ -189,7 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
>      entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned
> long) *
>              BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
> 
> -    ram_block_added(entry->vaddr_base, entry->size);
> +    ram_block_notify_add(entry->vaddr_base, entry->size);
>      bitmap_zero(entry->valid_mapping, nb_pfn);
>      for (i = 0; i < nb_pfn; i++) {
>          if (!err[i]) {
> @@ -399,7 +399,7 @@ static void
> xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
>      }
> 
>      pentry->next = entry->next;
> -    ram_block_removed(entry->vaddr_base, entry->size);
> +    ram_block_notify_remove(entry->vaddr_base, entry->size);
>      if (munmap(entry->vaddr_base, entry->size) != 0) {
>          perror("unmap fails");
>          exit(-1);
> 

Yes, this matches what Paolo pointed out in his reply. I'll fix that in the next
revision.

Fam

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-01-11  6:41 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
2016-12-22  9:56   ` Paolo Bonzini
2017-01-11  5:38   ` Stefan Weil
2017-01-11  5:48     ` Stefan Weil
2017-01-11  6:41       ` Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 2/4] util: Add a notifier list for qemu_vfree() Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library Fam Zheng
2016-12-21 15:46   ` Paolo Bonzini
2016-12-21 16:19     ` Fam Zheng
2016-12-21 17:02       ` Paolo Bonzini
2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
2016-12-20 16:39   ` Paolo Bonzini
2016-12-21 11:59   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2016-12-21 14:05     ` Fam Zheng
2016-12-20 23:04 ` [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device no-reply
2016-12-21  1:38   ` Fam Zheng
2016-12-21  0:48 ` no-reply
2016-12-29  4:09 ` Tian, Kevin
2016-12-30  0:46   ` Fam Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.