* [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
@ 2016-12-20 16:31 Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
` (6 more replies)
0 siblings, 7 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
borntraeger, Stefan Hajnoczi
This series adds a new protocol driver that is intended to achieve about 20%
better performance for latency bound workloads (i.e. synchronous I/O) than
linux-aio when guest is exclusively accessing a NVMe device, by talking to the
device directly instead of through kernel file system layers and its NVMe
driver.
This applies on top of Stefan's block-next tree which has the busy polling
patches - the new driver also supports it.
A git branch is also available as:
https://github.com/famz/qemu nvme
See patch 4 for benchmark numbers.
Tests are done on QEMU's NVMe emulation and a real Intel P3700 SSD NVMe card.
Most of dd/fio/mkfs/kernel build and OS installation testings work well, but an
weird write fault looking similar to [1] is consistently seen when installing
RHEL 7.3 guest, which is still under investigation.
[1]: http://lists.infradead.org/pipermail/linux-nvme/2015-May/001840.html
Also, the ram notifier is not enough for hot plugged block device because in
that case the notifier is installed _after_ ram blocks are added so it won't
get the events.
Fam Zheng (3):
util: Add a notifier list for qemu_vfree()
util: Add VFIO helper library
block: Add VFIO based NVMe driver
Paolo Bonzini (1):
ramblock-notifier: new
block/Makefile.objs | 1 +
block/nvme.c | 1026 +++++++++++++++++++++++++++++++++++++++++++
exec.c | 5 +
include/exec/memory.h | 6 +-
include/exec/ram_addr.h | 46 +-
include/exec/ramlist.h | 72 +++
include/qemu/notify.h | 1 +
include/qemu/vfio-helpers.h | 29 ++
numa.c | 29 ++
stubs/Makefile.objs | 1 +
stubs/ram.c | 10 +
util/Makefile.objs | 1 +
util/oslib-posix.c | 9 +
util/vfio-helpers.c | 713 ++++++++++++++++++++++++++++++
xen-mapcache.c | 3 +
15 files changed, 1903 insertions(+), 49 deletions(-)
create mode 100644 block/nvme.c
create mode 100644 include/exec/ramlist.h
create mode 100644 include/qemu/vfio-helpers.h
create mode 100644 stubs/ram.c
create mode 100644 util/vfio-helpers.c
--
2.9.3
^ permalink raw reply [flat|nested] 20+ messages in thread
* [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
@ 2016-12-20 16:31 ` Fam Zheng
2016-12-22 9:56 ` Paolo Bonzini
2017-01-11 5:38 ` Stefan Weil
2016-12-20 16:31 ` [Qemu-devel] [PATCH 2/4] util: Add a notifier list for qemu_vfree() Fam Zheng
` (5 subsequent siblings)
6 siblings, 2 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
borntraeger, Stefan Hajnoczi
From: Paolo Bonzini <pbonzini@redhat.com>
This adds a notify interface of ram block additions and removals.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
exec.c | 5 ++++
include/exec/memory.h | 6 +----
include/exec/ram_addr.h | 46 ++-----------------------------
include/exec/ramlist.h | 72 +++++++++++++++++++++++++++++++++++++++++++++++++
numa.c | 29 ++++++++++++++++++++
stubs/Makefile.objs | 1 +
stubs/ram.c | 10 +++++++
xen-mapcache.c | 3 +++
8 files changed, 123 insertions(+), 49 deletions(-)
create mode 100644 include/exec/ramlist.h
create mode 100644 stubs/ram.c
diff --git a/exec.c b/exec.c
index 08c558e..027c301 100644
--- a/exec.c
+++ b/exec.c
@@ -1654,6 +1654,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_HUGEPAGE);
/* MADV_DONTFORK is also needed by KVM in absence of synchronous MMU */
qemu_madvise(new_block->host, new_block->max_length, QEMU_MADV_DONTFORK);
+ ram_block_notify_add(new_block->host, new_block->max_length);
}
}
@@ -1784,6 +1785,10 @@ void qemu_ram_free(RAMBlock *block)
return;
}
+ if (block->host) {
+ ram_block_notify_remove(block->host, block->max_length);
+ }
+
qemu_mutex_lock_ramlist();
QLIST_REMOVE_RCU(block, next);
ram_list.mru_block = NULL;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9728a2f..fbc0a8f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -16,16 +16,12 @@
#ifndef CONFIG_USER_ONLY
-#define DIRTY_MEMORY_VGA 0
-#define DIRTY_MEMORY_CODE 1
-#define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM 3 /* num of dirty bits */
-
#include "exec/cpu-common.h"
#ifndef CONFIG_USER_ONLY
#include "exec/hwaddr.h"
#endif
#include "exec/memattrs.h"
+#include "exec/ramlist.h"
#include "qemu/queue.h"
#include "qemu/int128.h"
#include "qemu/notify.h"
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 54d7108..3e79466 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
#ifndef CONFIG_USER_ONLY
#include "hw/xen/xen.h"
+#include "exec/ramlist.h"
struct RAMBlock {
struct rcu_head rcu;
@@ -35,6 +36,7 @@ struct RAMBlock {
char idstr[256];
/* RCU-enabled, writes protected by the ramlist lock */
QLIST_ENTRY(RAMBlock) next;
+ QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
int fd;
size_t page_size;
};
@@ -50,51 +52,7 @@ static inline void *ramblock_ptr(RAMBlock *block, ram_addr_t offset)
return (char *)block->host + offset;
}
-/* The dirty memory bitmap is split into fixed-size blocks to allow growth
- * under RCU. The bitmap for a block can be accessed as follows:
- *
- * rcu_read_lock();
- *
- * DirtyMemoryBlocks *blocks =
- * atomic_rcu_read(&ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]);
- *
- * ram_addr_t idx = (addr >> TARGET_PAGE_BITS) / DIRTY_MEMORY_BLOCK_SIZE;
- * unsigned long *block = blocks.blocks[idx];
- * ...access block bitmap...
- *
- * rcu_read_unlock();
- *
- * Remember to check for the end of the block when accessing a range of
- * addresses. Move on to the next block if you reach the end.
- *
- * Organization into blocks allows dirty memory to grow (but not shrink) under
- * RCU. When adding new RAMBlocks requires the dirty memory to grow, a new
- * DirtyMemoryBlocks array is allocated with pointers to existing blocks kept
- * the same. Other threads can safely access existing blocks while dirty
- * memory is being grown. When no threads are using the old DirtyMemoryBlocks
- * anymore it is freed by RCU (but the underlying blocks stay because they are
- * pointed to from the new DirtyMemoryBlocks).
- */
-#define DIRTY_MEMORY_BLOCK_SIZE ((ram_addr_t)256 * 1024 * 8)
-typedef struct {
- struct rcu_head rcu;
- unsigned long *blocks[];
-} DirtyMemoryBlocks;
-
-typedef struct RAMList {
- QemuMutex mutex;
- RAMBlock *mru_block;
- /* RCU-enabled, writes protected by the ramlist lock. */
- QLIST_HEAD(, RAMBlock) blocks;
- DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
- uint32_t version;
-} RAMList;
-extern RAMList ram_list;
-
ram_addr_t last_ram_offset(void);
-void qemu_mutex_lock_ramlist(void);
-void qemu_mutex_unlock_ramlist(void);
-
RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
bool share, const char *mem_path,
Error **errp);
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
new file mode 100644
index 0000000..c59880d
--- /dev/null
+++ b/include/exec/ramlist.h
@@ -0,0 +1,72 @@
+#ifndef RAMLIST_H
+#define RAMLIST_H
+
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+#include "qemu/rcu.h"
+
+typedef struct RAMBlockNotifier RAMBlockNotifier;
+
+#define DIRTY_MEMORY_VGA 0
+#define DIRTY_MEMORY_CODE 1
+#define DIRTY_MEMORY_MIGRATION 2
+#define DIRTY_MEMORY_NUM 3 /* num of dirty bits */
+
+/* The dirty memory bitmap is split into fixed-size blocks to allow growth
+ * under RCU. The bitmap for a block can be accessed as follows:
+ *
+ * rcu_read_lock();
+ *
+ * DirtyMemoryBlocks *blocks =
+ * atomic_rcu_read(&ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]);
+ *
+ * ram_addr_t idx = (addr >> TARGET_PAGE_BITS) / DIRTY_MEMORY_BLOCK_SIZE;
+ * unsigned long *block = blocks.blocks[idx];
+ * ...access block bitmap...
+ *
+ * rcu_read_unlock();
+ *
+ * Remember to check for the end of the block when accessing a range of
+ * addresses. Move on to the next block if you reach the end.
+ *
+ * Organization into blocks allows dirty memory to grow (but not shrink) under
+ * RCU. When adding new RAMBlocks requires the dirty memory to grow, a new
+ * DirtyMemoryBlocks array is allocated with pointers to existing blocks kept
+ * the same. Other threads can safely access existing blocks while dirty
+ * memory is being grown. When no threads are using the old DirtyMemoryBlocks
+ * anymore it is freed by RCU (but the underlying blocks stay because they are
+ * pointed to from the new DirtyMemoryBlocks).
+ */
+#define DIRTY_MEMORY_BLOCK_SIZE ((ram_addr_t)256 * 1024 * 8)
+typedef struct {
+ struct rcu_head rcu;
+ unsigned long *blocks[];
+} DirtyMemoryBlocks;
+
+typedef struct RAMList {
+ QemuMutex mutex;
+ RAMBlock *mru_block;
+ /* RCU-enabled, writes protected by the ramlist lock. */
+ QLIST_HEAD(, RAMBlock) blocks;
+ DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
+ uint32_t version;
+ QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
+} RAMList;
+extern RAMList ram_list;
+
+void qemu_mutex_lock_ramlist(void);
+void qemu_mutex_unlock_ramlist(void);
+
+struct RAMBlockNotifier {
+ void (*ram_block_added)(RAMBlockNotifier *n, void *host, size_t size);
+ void (*ram_block_removed)(RAMBlockNotifier *n, void *host, size_t size);
+ QLIST_ENTRY(RAMBlockNotifier) next;
+};
+
+void ram_block_notifier_add(RAMBlockNotifier *n);
+void ram_block_notifier_remove(RAMBlockNotifier *n);
+void ram_block_notify_add(void *host, size_t size);
+void ram_block_notify_remove(void *host, size_t size);
+
+
+#endif /* RAMLIST_H */
diff --git a/numa.c b/numa.c
index 9c09e45..81f29f4 100644
--- a/numa.c
+++ b/numa.c
@@ -25,6 +25,7 @@
#include "qemu/osdep.h"
#include "sysemu/numa.h"
#include "exec/cpu-common.h"
+#include "exec/ramlist.h"
#include "qemu/bitmap.h"
#include "qom/cpu.h"
#include "qemu/error-report.h"
@@ -562,3 +563,31 @@ int numa_get_node_for_cpu(int idx)
}
return i;
}
+
+void ram_block_notifier_add(RAMBlockNotifier *n)
+{
+ QLIST_INSERT_HEAD(&ram_list.ramblock_notifiers, n, next);
+}
+
+void ram_block_notifier_remove(RAMBlockNotifier *n)
+{
+ QLIST_REMOVE(n, next);
+}
+
+void ram_block_notify_add(void *host, size_t size)
+{
+ RAMBlockNotifier *notifier;
+
+ QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+ notifier->ram_block_added(notifier, host, size);
+ }
+}
+
+void ram_block_notify_remove(void *host, size_t size)
+{
+ RAMBlockNotifier *notifier;
+
+ QLIST_FOREACH(notifier, &ram_list.ramblock_notifiers, next) {
+ notifier->ram_block_removed(notifier, host, size);
+ }
+}
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 2b5bb74..1829c7f 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -27,6 +27,7 @@ stub-obj-y += mon-is-qmp.o
stub-obj-y += monitor-init.o
stub-obj-y += notify-event.o
stub-obj-y += qtest.o
+stub-obj-y += ram.o
stub-obj-y += replay.o
stub-obj-y += replay-user.o
stub-obj-y += reset.o
diff --git a/stubs/ram.c b/stubs/ram.c
new file mode 100644
index 0000000..ea5d528
--- /dev/null
+++ b/stubs/ram.c
@@ -0,0 +1,10 @@
+#include "qemu/osdep.h"
+#include "exec/ramlist.h"
+
+void ram_block_notifier_add(RAMBlockNotifier *n)
+{
+}
+
+void ram_block_notifier_remove(RAMBlockNotifier *n)
+{
+}
diff --git a/xen-mapcache.c b/xen-mapcache.c
index 8f3a592..dc9b321 100644
--- a/xen-mapcache.c
+++ b/xen-mapcache.c
@@ -163,6 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
err = g_malloc0(nb_pfn * sizeof (int));
if (entry->vaddr_base != NULL) {
+ ram_block_removed(entry->vaddr_base, entry->size);
if (munmap(entry->vaddr_base, entry->size) != 0) {
perror("unmap fails");
exit(-1);
@@ -188,6 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned long) *
BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
+ ram_block_added(entry->vaddr_base, entry->size);
bitmap_zero(entry->valid_mapping, nb_pfn);
for (i = 0; i < nb_pfn; i++) {
if (!err[i]) {
@@ -397,6 +399,7 @@ static void xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
}
pentry->next = entry->next;
+ ram_block_removed(entry->vaddr_base, entry->size);
if (munmap(entry->vaddr_base, entry->size) != 0) {
perror("unmap fails");
exit(-1);
--
2.9.3
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [PATCH 2/4] util: Add a notifier list for qemu_vfree()
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
@ 2016-12-20 16:31 ` Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library Fam Zheng
` (4 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
borntraeger, Stefan Hajnoczi
To allow other code to intercept buffer freeing and do customized
unmapping operations.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
include/qemu/notify.h | 1 +
util/oslib-posix.c | 9 +++++++++
2 files changed, 10 insertions(+)
diff --git a/include/qemu/notify.h b/include/qemu/notify.h
index a3d73e4..24c4f5f 100644
--- a/include/qemu/notify.h
+++ b/include/qemu/notify.h
@@ -69,4 +69,5 @@ void notifier_with_return_remove(NotifierWithReturn *notifier);
int notifier_with_return_list_notify(NotifierWithReturnList *list,
void *data);
+void qemu_vfree_add_notifier(Notifier *n);
#endif
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index f631464..b0dd207 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -125,9 +125,18 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment)
return ptr;
}
+static NotifierList qemu_vfree_notifiers =
+ NOTIFIER_LIST_INITIALIZER(qemu_vfree_notifiers);
+
+void qemu_vfree_add_notifier(Notifier *n)
+{
+ notifier_list_add(&qemu_vfree_notifiers, n);
+}
+
void qemu_vfree(void *ptr)
{
trace_qemu_vfree(ptr);
+ notifier_list_notify(&qemu_vfree_notifiers, ptr);
free(ptr);
}
--
2.9.3
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 2/4] util: Add a notifier list for qemu_vfree() Fam Zheng
@ 2016-12-20 16:31 ` Fam Zheng
2016-12-21 15:46 ` Paolo Bonzini
2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
` (3 subsequent siblings)
6 siblings, 1 reply; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
borntraeger, Stefan Hajnoczi
This is a simple helper library to be used by VFIO device drivers in
QEMU, makeing it very easy to interface with /dev/vfio and do DMA
mappings.
Especially, once initialized, this module proactively maps all guest ram
regions to IO address space (in the host IOMMU context) so that in
further I/O paths, it's very cheap to look up the list of IOVAs of
SG buffers that we get from guest driver, without doing any syscall.
For bounce buffers that are allocated by QEMU, we do lazy mapping
transparently when the caller is invoking the same DMA mapping function.
This is relatively slower but should still be efficient thanks to
HBitmap. However, caller is responsible for unmapping bounce buffers
explicitly.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
include/qemu/vfio-helpers.h | 29 ++
util/Makefile.objs | 1 +
util/vfio-helpers.c | 713 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 743 insertions(+)
create mode 100644 include/qemu/vfio-helpers.h
create mode 100644 util/vfio-helpers.c
diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
new file mode 100644
index 0000000..4096dd8
--- /dev/null
+++ b/include/qemu/vfio-helpers.h
@@ -0,0 +1,29 @@
+/*
+ * VFIO helper functions
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ * Fam Zheng <famz@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_VFIO_H
+#define QEMU_VFIO_H
+#include "qemu/queue.h"
+
+typedef struct QEMUVFIOState QEMUVFIOState;
+
+QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp);
+void qemu_vfio_close(QEMUVFIOState *s);
+int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
+ bool contiguous, uint64_t *iova_list);
+void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host);
+void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index, Error **errp);
+void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar);
+int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
+ int irq_type, Error **errp);
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index ad0f9c7..16c714a 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -36,3 +36,4 @@ util-obj-y += log.o
util-obj-y += qdist.o
util-obj-y += qht.o
util-obj-y += range.o
+util-obj-y += vfio-helpers.o
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
new file mode 100644
index 0000000..55d35e7
--- /dev/null
+++ b/util/vfio-helpers.c
@@ -0,0 +1,713 @@
+/*
+ * VFIO helper functions
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ * Fam Zheng <famz@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "trace.h"
+#include "qemu/queue.h"
+#include "qemu/vfio-helpers.h"
+#include "qemu/error-report.h"
+#include "standard-headers/linux/pci_regs.h"
+#include "qemu/event_notifier.h"
+#include "qemu/hbitmap.h"
+
+#define VFIO_DEBUG 0
+
+#define DPRINTF(...) \
+ if (VFIO_DEBUG) { \
+ printf(__VA_ARGS__); \
+ }
+
+/* XXX: Once VFIO exposes the iova bit width in the IOMMU capability interface,
+ * we can use a runtime limit. It's also possible to do platform specific
+ * detection by reading sysfs entries. Until then, 39 is a safe bet. */
+#define QEMU_VFIO_IOVA_MAX (1ULL << 39)
+
+/* DMA address space is managed in chunks. */
+#define QEMU_VFIO_CHUNK_SIZE (2ULL << 20)
+
+#define QEMU_VFIO_ALLOC_BITMAP_SIZE (QEMU_VFIO_IOVA_MAX / QEMU_VFIO_CHUNK_SIZE)
+
+typedef struct IOVARange {
+ uint64_t iova;
+ uint64_t nr_pages;
+ QSIMPLEQ_ENTRY(IOVARange) next;
+} IOVARange;
+
+typedef struct {
+ /* Page aligned. */
+ void *host;
+ size_t size;
+ /* The IOVA range list to which the [host, host + size) area is mapped. */
+ QSIMPLEQ_HEAD(, IOVARange) iova_list;
+} IOVAMapping;
+
+struct QEMUVFIOState {
+ int container;
+ int group;
+ int device;
+ RAMBlockNotifier ram_notifier;
+ struct vfio_region_info config_region_info, bar_region_info[6];
+
+ /* Allocation bitmap of IOVA address space, each bit represents
+ * QEMU_VFIO_CHUNK_SIZE bytes. Set bits mean free, unset for used/reserved.
+ **/
+ HBitmap *free_chunks;
+ IOVAMapping *mappings;
+ int nr_mappings;
+};
+
+static int sysfs_find_group_file(const char *device, char **path, Error **errp)
+{
+ int ret;
+ char *sysfs_link = NULL;
+ char *sysfs_group = NULL;
+ char *p;
+
+ sysfs_link = g_strdup_printf("/sys/bus/pci/devices/%s/iommu_group",
+ device);
+ sysfs_group = g_malloc(PATH_MAX);
+ ret = readlink(sysfs_link, sysfs_group, PATH_MAX - 1);
+ if (ret == -1) {
+ error_setg_errno(errp, errno, "Failed to find iommu group sysfs path");
+ ret = -errno;
+ goto out;
+ }
+ ret = 0;
+ p = strrchr(sysfs_group, '/');
+ if (!p) {
+ error_setg(errp, "Failed to find iommu group number");
+ ret = -errno;
+ goto out;
+ }
+
+ *path = g_strdup_printf("/dev/vfio/%s", p + 1);
+out:
+ g_free(sysfs_link);
+ g_free(sysfs_group);
+ return ret;
+}
+
+static int qemu_vfio_pci_init_bar(QEMUVFIOState *s, unsigned int index,
+ Error **errp)
+{
+ assert(index < ARRAY_SIZE(s->bar_region_info));
+ s->bar_region_info[index] = (struct vfio_region_info) {
+ .index = VFIO_PCI_BAR0_REGION_INDEX + index,
+ .argsz = sizeof(struct vfio_region_info),
+ };
+ if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->bar_region_info[index])) {
+ error_setg_errno(errp, errno, "Failed to get BAR region info");
+ return -errno;
+ }
+
+ return 0;
+}
+
+/**
+ * Map a PCI bar area.
+ */
+void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index, Error **errp)
+{
+ void *p;
+ assert(index >= 0 && index < 6);
+ p = mmap(NULL, MIN(8192, s->bar_region_info[index].size),
+ PROT_READ | PROT_WRITE, MAP_SHARED,
+ s->device, s->bar_region_info[index].offset);
+ if (p == MAP_FAILED) {
+ error_setg_errno(errp, errno, "Failed to map BAR region");
+ p = NULL;
+ }
+ return p;
+}
+
+/**
+ * Unmap a PCI bar area.
+ */
+void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar)
+{
+ if (bar) {
+ munmap(bar, MIN(8192, s->bar_region_info[index].size));
+ }
+}
+
+/**
+ * Initialize device IRQ with @irq_type and and register an event notifier.
+ */
+int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
+ int irq_type, Error **errp)
+{
+ int r;
+ struct vfio_irq_set *irq_set;
+ size_t irq_set_size;
+ struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
+
+ irq_info.index = irq_type;
+ if (ioctl(s->device, VFIO_DEVICE_GET_IRQ_INFO, &irq_info)) {
+ error_setg_errno(errp, errno, "Failed to get device interrupt info");
+ return -errno;
+ }
+ if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) {
+ error_setg(errp, "Device interrupt doesn't support eventfd");
+ return -EINVAL;
+ }
+
+ irq_set_size = sizeof(*irq_set) + sizeof(int);
+ irq_set = g_malloc0(irq_set_size);
+
+ /* Get to a known IRQ state */
+ *irq_set = (struct vfio_irq_set) {
+ .argsz = irq_set_size,
+ .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
+ .index = irq_info.index,
+ .start = 0,
+ .count = 1,
+ };
+
+ *(int *)&irq_set->data = event_notifier_get_fd(e);
+ r = ioctl(s->device, VFIO_DEVICE_SET_IRQS, irq_set);
+ g_free(irq_set);
+ if (r) {
+ error_setg_errno(errp, errno, "Failed to setup device interrupt");
+ return -errno;
+ }
+ return 0;
+}
+
+static int qemu_vfio_pci_read_config(QEMUVFIOState *s, void *buf, int size, int ofs)
+{
+ if (pread(s->device, buf, size, s->config_region_info.offset + ofs) == size) {
+ return 0;
+ }
+ return -1;
+}
+
+static int qemu_vfio_pci_write_config(QEMUVFIOState *s, void *buf, int size, int ofs)
+{
+ if (pwrite(s->device, buf, size,
+ s->config_region_info.offset + ofs) == size) {
+ return 0;
+ }
+
+ return -1;
+}
+
+static int qemu_vfio_init_pci(QEMUVFIOState *s, const char *device,
+ Error **errp)
+{
+ int ret;
+ int i;
+ uint16_t pci_cmd;
+ struct vfio_group_status group_status =
+ { .argsz = sizeof(group_status) };
+ struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
+ struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+ char *group_file = NULL;
+
+ /* Create a new container */
+ s->container = open("/dev/vfio/vfio", O_RDWR);
+
+ if (ioctl(s->container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
+ error_setg(errp, "Invalid VFIO version");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (!ioctl(s->container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
+ error_setg_errno(errp, errno, "VFIO IOMMU check failed");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Open the group */
+ ret = sysfs_find_group_file(device, &group_file, errp);
+ if (ret) {
+ goto out;
+ }
+
+ s->group = open(group_file, O_RDWR);
+ g_free(group_file);
+ if (s->group <= 0) {
+ error_setg_errno(errp, errno, "Failed to open VFIO group file");
+ ret = -errno;
+ goto out;
+ }
+
+ /* Test the group is viable and available */
+ if (ioctl(s->group, VFIO_GROUP_GET_STATUS, &group_status)) {
+ error_setg_errno(errp, errno, "Failed to get VFIO group status");
+ ret = -errno;
+ goto out;
+ }
+
+ if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+ error_setg(errp, "VFIO group is not viable");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Add the group to the container */
+ if (ioctl(s->group, VFIO_GROUP_SET_CONTAINER, &s->container)) {
+ error_setg_errno(errp, errno, "Failed to add group to VFIO container");
+ ret = -errno;
+ goto out;
+ }
+
+ /* Enable the IOMMU model we want */
+ if (ioctl(s->container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) {
+ error_setg_errno(errp, errno, "Failed to set VFIO IOMMU type");
+ ret = -errno;
+ goto out;
+ }
+
+ /* Get additional IOMMU info */
+ if (ioctl(s->container, VFIO_IOMMU_GET_INFO, &iommu_info)) {
+ error_setg_errno(errp, errno, "Failed to get IOMMU info");
+ ret = -errno;
+ goto out;
+ }
+
+ s->device = ioctl(s->group, VFIO_GROUP_GET_DEVICE_FD, device);
+
+ if (s->device < 0) {
+ error_setg_errno(errp, errno, "Failed to get device fd");
+ ret = -errno;
+ goto out;
+ }
+
+ /* Test and setup the device */
+ if (ioctl(s->device, VFIO_DEVICE_GET_INFO, &device_info)) {
+ error_setg_errno(errp, errno, "Failed to get device info");
+ ret = -errno;
+ goto out;
+ }
+
+ if (device_info.num_regions < VFIO_PCI_CONFIG_REGION_INDEX) {
+ error_setg(errp, "Invalid device regions");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ s->config_region_info = (struct vfio_region_info) {
+ .index = VFIO_PCI_CONFIG_REGION_INDEX,
+ .argsz = sizeof(struct vfio_region_info),
+ };
+ if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->config_region_info)) {
+ error_setg_errno(errp, errno, "Failed to get config region info");
+ ret = -errno;
+ goto out;
+ }
+
+ for (i = 0; i < 6; i++) {
+ ret = qemu_vfio_pci_init_bar(s, i, errp);
+ if (ret) {
+ goto out;
+ }
+ }
+
+ /* Enable bus master */
+ if (qemu_vfio_pci_read_config(s, &pci_cmd, sizeof(pci_cmd),
+ PCI_COMMAND) < 0) {
+ goto out;
+ }
+ pci_cmd |= PCI_COMMAND_MASTER;
+ if (qemu_vfio_pci_write_config(s, &pci_cmd, sizeof(pci_cmd),
+ PCI_COMMAND) < 0) {
+ goto out;
+ }
+out:
+ return ret;
+}
+
+static void qemu_vfio_ram_block_added(RAMBlockNotifier *n, void *host, size_t size)
+{
+ QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+ /* XXX: handle existing ram regions? */
+ DPRINTF("ram block added %p %lx\n", host, size);
+ qemu_vfio_dma_map(s, host, size, true, NULL);
+}
+
+static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n, void *host, size_t size)
+{
+ QEMUVFIOState *s = container_of(n, QEMUVFIOState, ram_notifier);
+ if (host) {
+ DPRINTF("ram block removed %p %lx\n", host, size);
+ qemu_vfio_dma_unmap(s, host);
+ }
+}
+
+/**
+ * Open a PCI device, e.g. "0000:00:01.0".
+ */
+QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp)
+{
+ int r;
+ QEMUVFIOState *s = g_new0(QEMUVFIOState, 1);
+
+ r = qemu_vfio_init_pci(s, device, errp);
+ if (r) {
+ g_free(s);
+ return NULL;
+ }
+
+ s->ram_notifier.ram_block_added = qemu_vfio_ram_block_added;
+ s->ram_notifier.ram_block_removed = qemu_vfio_ram_block_removed;
+ ram_block_notifier_add(&s->ram_notifier);
+ s->free_chunks = hbitmap_alloc(QEMU_VFIO_ALLOC_BITMAP_SIZE, 0);
+ hbitmap_set(s->free_chunks, 1, QEMU_VFIO_ALLOC_BITMAP_SIZE - 1);
+
+ return s;
+}
+
+static void qemu_vfio_dump_mapping(IOVAMapping *m)
+{
+ DPRINTF(" vfio mapping %p %lx\n", m->host, m->size);
+ IOVARange *r;
+
+ QSIMPLEQ_FOREACH(r, &m->iova_list, next) {
+ DPRINTF(" IOVA %lx len %lx\n", r->iova, r->nr_pages * getpagesize());
+ }
+}
+
+static void qemu_vfio_dump_mappings(QEMUVFIOState *s)
+{
+ int i;
+ DPRINTF("vfio mappings\n");
+ for (i = 0; i < s->nr_mappings; ++i) {
+ qemu_vfio_dump_mapping(&s->mappings[i]);
+ }
+}
+
+/**
+ * Find the mapping entry that contains [host, host + size) and set @index to
+ * the position. If no entry contains it, @index is the position _after_ which
+ * to insert the new mapping. IOW, it is the index of the largest element that
+ * is smaller than @host, or -1 if no entry is. */
+static IOVAMapping *qemu_vfio_find_mapping(QEMUVFIOState *s,
+ void *host, size_t size,
+ int *index)
+{
+ IOVAMapping *p = s->mappings;
+ IOVAMapping *q = p ? p + s->nr_mappings - 1 : NULL;
+ IOVAMapping *mid = p ? p + (q - p) / 2 : NULL;
+ DPRINTF("qemu vfio find mapping %p %lx... ", host, size);
+ if (!p) {
+ *index = -1;
+ return NULL;
+ }
+ while (true) {
+ mid = p + (q - p) / 2;
+ if (mid == p) {
+ break;
+ }
+ if (mid->host > host) {
+ q = mid;
+ } else if (mid->host < host) {
+ p = mid;
+ } else {
+ break;
+ }
+ }
+ if (mid->host > host) {
+ mid--;
+ } else if (mid < &s->mappings[s->nr_mappings - 1]
+ && (mid + 1)->host <= host) {
+ mid++;
+ }
+ *index = mid - &s->mappings[0];
+ if (mid >= &s->mappings[0] &&
+ mid->host <= host && mid->host + mid->size > host) {
+ assert(mid < &s->mappings[s->nr_mappings]);
+ assert(mid->host + mid->size >= host + size);
+ DPRINTF("found, index %d\n", *index);
+ return mid;
+ }
+ DPRINTF("not found, index %d\n", *index);
+ return NULL;
+}
+
+/**
+ * Allocate IOVA and and create a new mapping record and insert it in @s. If
+ * contiguous is true, the mapped IOVA must be contiguous, otherwise
+ * segmentation is allowed to make the allocation more likely to succeed.
+ */
+static IOVAMapping *qemu_vfio_new_mapping(QEMUVFIOState *s,
+ void *host, size_t size,
+ int index,
+ bool contiguous)
+{
+ int i;
+ int shift;
+ const size_t pages_per_chunk = QEMU_VFIO_CHUNK_SIZE / getpagesize();
+ size_t pages = DIV_ROUND_UP(size, getpagesize());
+ size_t chunks = DIV_ROUND_UP(pages, pages_per_chunk);
+ int64_t next;
+ IOVAMapping m = {.host = host, .size = size};
+ IOVAMapping *insert;
+ HBitmapIter iter;
+ IOVARange *r;
+
+ if (DIV_ROUND_UP(pages, pages_per_chunk) > hbitmap_count(s->free_chunks)) {
+ return NULL;
+ }
+ QSIMPLEQ_INIT(&m.iova_list);
+
+ r = NULL;
+ hbitmap_iter_init(&iter, s->free_chunks, 1);
+ if (contiguous) {
+ while (true) {
+ bool satisfy = true;
+ next = hbitmap_iter_next(&iter);
+ if (next < 0) {
+ return NULL;
+ }
+ for (i = 1; i < chunks; i++) {
+ if (!hbitmap_get(s->free_chunks, next + i)) {
+ satisfy = false;
+ break;
+ }
+ }
+ if (satisfy) {
+ break;
+ }
+ }
+ hbitmap_reset(s->free_chunks, next, chunks);
+ r = g_new(IOVARange, 1);
+ r->iova = next * pages_per_chunk * getpagesize();
+ r->nr_pages = pages;
+ QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
+ } else {
+ next = hbitmap_iter_next(&iter);
+ while (pages) {
+ uint64_t chunk;
+ if (next < 0) {
+ hbitmap_iter_init(&iter, s->free_chunks, 1);
+ next = hbitmap_iter_next(&iter);
+ }
+ assert(next >= 0);
+ chunk = next;
+ DPRINTF("using chunk %ld\n", chunk);
+ next = hbitmap_iter_next(&iter);
+ hbitmap_reset(s->free_chunks, chunk, 1);
+ if (r && r->iova + r->nr_pages == chunk * pages_per_chunk) {
+ r->nr_pages += MIN(pages, pages_per_chunk);
+ } else {
+ r = g_new(IOVARange, 1);
+ r->iova = chunk * pages_per_chunk * getpagesize();
+ r->nr_pages = MIN(pages, pages_per_chunk);
+ QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
+ }
+ pages -= MIN(pages, pages_per_chunk);
+ }
+ }
+
+ assert(index >= 0);
+ s->nr_mappings++;
+ s->mappings = g_realloc_n(s->mappings, sizeof(s->mappings[0]),
+ s->nr_mappings);
+ insert = &s->mappings[index];
+ shift = s->nr_mappings - index - 1;
+ DPRINTF("inserting to %d shift %d\n", index, shift);
+ if (shift) {
+ memmove(insert + 1, insert, shift * sizeof(s->mappings[0]));
+ }
+ *insert = m;
+ return insert;
+}
+
+/* Remove the mapping record from @s and undo the IOVA mapping with VFIO. */
+static void qemu_vfio_free_mapping(QEMUVFIOState *s, IOVAMapping *mapping,
+ bool can_fail)
+{
+ IOVARange *r, *tmp;
+
+ QSIMPLEQ_FOREACH_SAFE(r, &mapping->iova_list, next, tmp) {
+ size_t size = r->nr_pages * getpagesize();
+ struct vfio_iommu_type1_dma_unmap unmap = {
+ .argsz = sizeof(unmap),
+ .flags = 0,
+ .iova = r->iova,
+ .size = size,
+ };
+ if (ioctl(s->container, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+ if (!can_fail) {
+ error_report("VFIO_UNMAP_DMA: %d", -errno);
+ }
+ }
+ hbitmap_set(s->free_chunks, r->iova / QEMU_VFIO_CHUNK_SIZE,
+ DIV_ROUND_UP(size, QEMU_VFIO_CHUNK_SIZE));
+ g_free(r);
+ }
+}
+
+/* Do the DMA mapping with VFIO. */
+static int qemu_vfio_do_mapping(QEMUVFIOState *s, IOVAMapping *mapping)
+{
+ IOVARange *r;
+ uint8_t *p = mapping->host;
+
+ QSIMPLEQ_FOREACH(r, &mapping->iova_list, next) {
+ struct vfio_iommu_type1_dma_map dma_map = {
+ .argsz = sizeof(dma_map),
+ .flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
+ .iova = r->iova,
+ .vaddr = (uintptr_t)p,
+ .size = r->nr_pages * getpagesize(),
+ };
+ DPRINTF("vfio map %p pages %ld to %lx\n", p, r->nr_pages, r->iova);
+
+ if (ioctl(s->container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
+ error_report("VFIO_MAP_DMA: %d", -errno);
+ return -errno;
+ }
+ p += r->nr_pages * getpagesize();
+ }
+ return 0;
+}
+
+/* Check if the mapping list is (ascending) ordered. */
+static bool qemu_vfio_verify_mappings(QEMUVFIOState *s)
+{
+ int i;
+ for (i = 0; i < s->nr_mappings - 1; ++i) {
+ if (!(s->mappings[i].host < s->mappings[i + 1].host)) {
+ fprintf(stderr, "item %d not sorted!\n", i);
+ qemu_vfio_dump_mappings(s);
+ return false;
+ }
+ if (!(s->mappings[i].host + s->mappings[i].size <=
+ s->mappings[i + 1].host)) {
+ fprintf(stderr, "item %d overlap with next!\n", i);
+ qemu_vfio_dump_mappings(s);
+ return false;
+ }
+ }
+ return true;
+}
+
+/* Map [host, host + size) area, and optionally store the IOVA in iova_list
+ * buffer. The area must be aligned to page size, and mustn't overlap or cross
+ * boundary of existing mappings.
+ *
+ * If the area is already mapped, this function can be used to retrieve the
+ * IOVA by passing non-NULL iova_list: for contiguous=true, only one entry is
+ * stored to @iova_list; otherwise, up to (size / page_size) entries will be
+ * written. The caller must ensure the passed buffer is big enough.
+ *
+ * If the area is not mapped, this will try to allocate a sequence of free
+ * IOVAs (contiguous can be enforced by passing true), and report them to
+ * iova_list.
+ */
+int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
+ bool contiguous, uint64_t *iova_list)
+{
+ int ret;
+ int index;
+ IOVAMapping *mapping;
+
+ DPRINTF("vfio dma map %p %lx contiguous %d\n", host, size, contiguous);
+ assert(QEMU_PTR_IS_ALIGNED(host, getpagesize()));
+ assert(QEMU_IS_ALIGNED(size, getpagesize()));
+ mapping = qemu_vfio_find_mapping(s, host, size, &index);
+ if (!mapping) {
+ mapping = qemu_vfio_new_mapping(s, host, size, index + 1, contiguous);
+ if (!mapping) {
+ return -ENOMEM;
+ }
+ assert(qemu_vfio_verify_mappings(s));
+ ret = qemu_vfio_do_mapping(s, mapping);
+ if (ret) {
+ qemu_vfio_free_mapping(s, mapping, true);
+ return ret;
+ }
+ qemu_vfio_dump_mappings(s);
+ } else {
+ /* XXX: check contiguous for existing mapping? */
+ }
+ if (!mapping) {
+ return -ENOMEM;
+ }
+ if (iova_list) {
+ uint64_t *p = iova_list;
+ uint64_t skip_pages = (host - mapping->host) / getpagesize();
+ IOVARange *r;
+ for (r = QSIMPLEQ_FIRST(&mapping->iova_list);
+ skip_pages > r->nr_pages;
+ r = QSIMPLEQ_NEXT(r, next)) {
+ skip_pages -= r->nr_pages;
+ }
+ while (size) {
+ uint64_t i;
+ assert(r);
+ for (i = skip_pages; size && i < r->nr_pages; ++i) {
+ *p = r->iova + i * getpagesize();
+ assert(*p >= r->iova);
+ assert(*p < r->iova + r->nr_pages * getpagesize());
+ p++;
+ size -= getpagesize();
+ if (contiguous) {
+ goto out;
+ }
+ }
+ r = QSIMPLEQ_NEXT(r, next);
+ skip_pages = 0;
+ }
+ }
+out:
+ return 0;
+}
+
+/* Unmapping the whole area that was previously mapped with
+ * qemu_vfio_dma_map(). */
+void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host)
+{
+ int index = 0;
+ IOVAMapping *m;
+
+ if (!host) {
+ return;
+ }
+
+ DPRINTF("vfio unmap %p\n", host);
+ m = qemu_vfio_find_mapping(s, host, 4096, &index);
+ if (!m) {
+ return;
+ }
+ qemu_vfio_free_mapping(s, m, false);
+ assert(s->nr_mappings);
+ memmove(&s->mappings[index], &s->mappings[index + 1],
+ sizeof(s->mappings[0]) * (s->nr_mappings - index - 1));
+ s->nr_mappings--;
+ s->mappings = g_realloc_n(s->mappings, sizeof(s->mappings[0]),
+ s->nr_mappings);
+}
+
+/* Close and free the VFIO resources. */
+void qemu_vfio_close(QEMUVFIOState *s)
+{
+ int i;
+
+ if (!s) {
+ return;
+ }
+ for (i = 0; i < s->nr_mappings; ++i) {
+ qemu_vfio_free_mapping(s, &s->mappings[i], false);
+ }
+ hbitmap_free(s->free_chunks);
+ ram_block_notifier_remove(&s->ram_notifier);
+ close(s->device);
+ close(s->group);
+ close(s->container);
+}
--
2.9.3
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
` (2 preceding siblings ...)
2016-12-20 16:31 ` [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library Fam Zheng
@ 2016-12-20 16:31 ` Fam Zheng
2016-12-20 16:39 ` Paolo Bonzini
2016-12-21 11:59 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2016-12-20 23:04 ` [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device no-reply
` (2 subsequent siblings)
6 siblings, 2 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-20 16:31 UTC (permalink / raw)
To: qemu-devel
Cc: Paolo Bonzini, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
borntraeger, Stefan Hajnoczi
This is a new protocol driver that exclusively opens a host NVMe
controller through VFIO. It achieves better latency than linux-aio.
nvme:// linux-aio
------------------------------------------------------------------
fio bs=4k iodepth=1 (IOPS) 30014 24009
fio bs=4k iodepth=1 (IOPS) +polling 45148 34689
fio bs=64k iodepth=1 (IOPS) 17801 14954
fio bs=64k iodepth=1 (IOPS) +polling 17970 14564
fio bs=4k iodepth=32 (IOPS) 110637 121480
fio bs=4k iodepth=32 (IOPS) +polling 145533 166878
fio bs=64k iodepth=32 (IOPS) 50525 50596
fio bs=64k iodepth=32 (IOPS) +polling 50482 50534
(host) qemu-img bench -c 8000000 (sec) 15.00 43.13
("+polling" means poll-max-ns=18000 which is a rule of thumb value for
the used NVMe device, otherwise it defaults to 0).
For the rows where linux-aio is faster, a lot of evidence shows that the
io queue is more likely to stay full if the request completions happen
faster, as in the nvme:// case, hence less batch submission and request
merging than linux-aio.
This patch is co-authored by Paolo and me.
Signed-off-by: Fam Zheng <famz@redhat.com>
---
block/Makefile.objs | 1 +
block/nvme.c | 1026 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 1027 insertions(+)
create mode 100644 block/nvme.c
diff --git a/block/Makefile.objs b/block/Makefile.objs
index 67a036a..fe975a6 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
block-obj-y += null.o mirror.o commit.o io.o
block-obj-y += throttle-groups.o
+block-obj-y += nvme.o
block-obj-y += nbd.o nbd-client.o sheepdog.o
block-obj-$(CONFIG_LIBISCSI) += iscsi.o
diff --git a/block/nvme.c b/block/nvme.c
new file mode 100644
index 0000000..5c9ffde
--- /dev/null
+++ b/block/nvme.c
@@ -0,0 +1,1026 @@
+/*
+ * NVMe block driver based on vfio
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ * Fam Zheng <famz@redhat.com>
+ * Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qstring.h"
+#include "block/block_int.h"
+#include "qemu/error-report.h"
+#include "qemu/vfio-helpers.h"
+
+#define NVME_DEBUG 0
+
+#define DPRINTF(...) \
+ if (NVME_DEBUG) { \
+ printf(__VA_ARGS__); \
+ }
+
+#define NVME_SQ_ENTRY_BYTES 64
+#define NVME_CQ_ENTRY_BYTES 16
+#define NVME_QUEUE_SIZE 128
+
+/* TODO: share definitions with hw/block/nvme.c? */
+enum NvmeAdminCommands {
+ NVME_ADM_CMD_DELETE_SQ = 0x00,
+ NVME_ADM_CMD_CREATE_SQ = 0x01,
+ NVME_ADM_CMD_GET_LOG_PAGE = 0x02,
+ NVME_ADM_CMD_DELETE_CQ = 0x04,
+ NVME_ADM_CMD_CREATE_CQ = 0x05,
+ NVME_ADM_CMD_IDENTIFY = 0x06,
+ NVME_ADM_CMD_ABORT = 0x08,
+ NVME_ADM_CMD_SET_FEATURES = 0x09,
+ NVME_ADM_CMD_GET_FEATURES = 0x0a,
+ NVME_ADM_CMD_ASYNC_EV_REQ = 0x0c,
+ NVME_ADM_CMD_ACTIVATE_FW = 0x10,
+ NVME_ADM_CMD_DOWNLOAD_FW = 0x11,
+ NVME_ADM_CMD_FORMAT_NVM = 0x80,
+ NVME_ADM_CMD_SECURITY_SEND = 0x81,
+ NVME_ADM_CMD_SECURITY_RECV = 0x82,
+};
+
+enum NvmeIoCommands {
+ NVME_CMD_FLUSH = 0x00,
+ NVME_CMD_WRITE = 0x01,
+ NVME_CMD_READ = 0x02,
+ NVME_CMD_WRITE_UNCOR = 0x04,
+ NVME_CMD_COMPARE = 0x05,
+ NVME_CMD_DSM = 0x09,
+};
+
+typedef struct {
+ uint8_t opcode;
+ uint8_t flags;
+ uint16_t cid;
+ uint32_t nsid;
+ uint64_t reserved;
+ uint64_t mptr;
+ uint64_t prp1;
+ uint64_t prp2;
+ uint32_t cdw10;
+ uint32_t cdw11;
+ uint32_t cdw12;
+ uint32_t cdw13;
+ uint32_t cdw14;
+ uint32_t cdw15;
+} QEMU_PACKED NVMeCommand;
+
+typedef struct {
+ uint32_t cmd_specific;
+ uint32_t reserved;
+ uint16_t sq_head;
+ uint16_t sqid;
+ uint16_t cid;
+ uint16_t status;
+} QEMU_PACKED NVMeCompletion;
+
+typedef struct {
+ int32_t head, tail;
+ uint8_t *queue;
+ uint64_t iova;
+ volatile uint32_t *doorbell;
+} NVMeQueue;
+
+typedef struct {
+ BlockCompletionFunc *cb;
+ void *opaque;
+ int cid;
+ void *prp_list_page;
+ uint64_t prp_list_iova;
+} NVMeRequest;
+
+typedef struct {
+ int index;
+ NVMeQueue sq, cq;
+ int cq_phase;
+ uint8_t *prp_list_pages;
+ uint64_t prp_list_base_iova;
+ NVMeRequest reqs[NVME_QUEUE_SIZE];
+ CoQueue wait_queue;
+ bool busy;
+ int need_kick;
+ int inflight;
+} NVMeQueuePair;
+
+typedef volatile struct {
+ uint64_t cap;
+ uint32_t vs;
+ uint32_t intms;
+ uint32_t intmc;
+ uint32_t cc;
+ uint32_t reserved0;
+ uint32_t csts;
+ uint32_t nssr;
+ uint32_t aqa;
+ uint64_t asq;
+ uint64_t acq;
+ uint32_t cmbloc;
+ uint32_t cmbsz;
+ uint8_t reserved1[0xec0];
+ uint8_t cmd_set_specfic[0x100];
+ uint32_t doorbells[];
+} QEMU_PACKED NVMeRegs;
+
+QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
+
+typedef struct {
+ QEMUVFIOState *vfio;
+ NVMeRegs *regs;
+ /* The submission/completion queue pairs.
+ * [0]: admin queue.
+ * [1..]: io queues.
+ */
+ NVMeQueuePair **queues;
+ int nr_queues;
+ size_t page_size;
+ /* How many uint32_t elements does each doorbell entry take. */
+ size_t doorbell_scale;
+ bool write_cache;
+ EventNotifier event_notifier;
+ uint64_t nsze; /* Namespace size reported by identify command */
+ int nsid; /* The namespace id to read/write data. */
+ uint64_t max_transfer;
+ int plugged;
+ Notifier vfree_notify;
+} BDRVNVMeState;
+
+#define NVME_BLOCK_OPT_DEVICE "device"
+#define NVME_BLOCK_OPT_NAMESPACE "namespace"
+
+static QemuOptsList runtime_opts = {
+ .name = "nvme",
+ .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
+ .desc = {
+ {
+ .name = NVME_BLOCK_OPT_DEVICE,
+ .type = QEMU_OPT_STRING,
+ .help = "NVMe PCI device address",
+ },
+ {
+ .name = NVME_BLOCK_OPT_NAMESPACE,
+ .type = QEMU_OPT_NUMBER,
+ .help = "NVMe namespace",
+ },
+ { /* end of list */ }
+ },
+};
+
+static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
+ int nentries, int entry_bytes, Error **errp)
+{
+ BDRVNVMeState *s = bs->opaque;
+ size_t bytes;
+ int r;
+
+ bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
+ q->head = q->tail = 0;
+ q->queue = qemu_try_blockalign0(bs, bytes);
+
+ if (!q->queue) {
+ error_setg(errp, "Cannot allocate queue");
+ return;
+ }
+ r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, true, &q->iova);
+ if (r) {
+ error_setg(errp, "Cannot map queue");
+ }
+}
+
+static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
+{
+ qemu_vfree(q->prp_list_pages);
+ qemu_vfree(q->sq.queue);
+ qemu_vfree(q->cq.queue);
+ g_free(q);
+}
+
+static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
+ int idx, int size,
+ Error **errp)
+{
+ int i, r;
+ BDRVNVMeState *s = bs->opaque;
+ Error *local_err = NULL;
+ NVMeQueuePair *q = g_new0(NVMeQueuePair, 1);
+ uint64_t prp_list_iovas[NVME_QUEUE_SIZE];
+
+ q->index = idx;
+ qemu_co_queue_init(&q->wait_queue);
+ q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE);
+ r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
+ s->page_size * NVME_QUEUE_SIZE,
+ false, prp_list_iovas);
+ if (r) {
+ goto fail;
+ }
+ for (i = 0; i < NVME_QUEUE_SIZE; i++) {
+ NVMeRequest *req = &q->reqs[i];
+ req->cid = i + 1;
+ req->prp_list_page = q->prp_list_pages + i * s->page_size;
+ req->prp_list_iova = prp_list_iovas[i];
+ }
+ nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
+ if (local_err) {
+ error_propagate(errp, local_err);
+ goto fail;
+ }
+ q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale];
+
+ nvme_init_queue(bs, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err);
+ if (local_err) {
+ error_propagate(errp, local_err);
+ goto fail;
+ }
+ q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
+
+ return q;
+fail:
+ nvme_free_queue_pair(bs, q);
+ return NULL;
+}
+
+static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
+{
+ if (s->plugged || !q->need_kick) {
+ return;
+ }
+ DPRINTF("nvme kick queue %d\n", q->index);
+ assert(!(q->sq.tail & 0xFF00));
+ /* Fence the write to submission queue entry before notifying the guest. */
+ smp_wmb();
+ *q->sq.doorbell = cpu_to_le32(q->sq.tail);
+ q->inflight += q->need_kick;
+ q->need_kick = 0;
+}
+
+static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
+{
+ int i;
+ if (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
+ /* We have to leave one slot empty as that is the full queue case (head
+ * == tail + 1). */
+ return NULL;
+ }
+ for (i = 0; i < NVME_QUEUE_SIZE; i++) {
+ if (!q->reqs[i].cb) {
+ return &q->reqs[i];
+ }
+ }
+ return NULL;
+}
+
+static inline int nvme_translate_error(const NVMeCompletion *c)
+{
+ if ((le16_to_cpu(c->status) >> 1) & 0xFF) {
+ DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status %x\n",
+ c->cmd_specific, c->sq_head, c->sqid, c->cid, c->status);
+ }
+ switch ((le16_to_cpu(c->status) >> 1) & 0xFF) {
+ case 0:
+ return 0;
+ case 1:
+ return -ENOSYS;
+ case 2:
+ return -EINVAL;
+ default:
+ return -EIO;
+ }
+}
+
+static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
+{
+ bool progress = false;
+ NVMeRequest *req;
+ NVMeCompletion *c;
+
+ DPRINTF("nvme process completion %d inflight %d\n", q->index, q->inflight);
+ if (q->busy || s->plugged) {
+ DPRINTF("queue busy\n");
+ return false;
+ }
+ q->busy = true;
+ assert(q->inflight >= 0);
+ while (q->inflight) {
+ c = (NVMeCompletion *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
+ if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
+ break;
+ }
+ q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
+ if (!q->cq.head) {
+ q->cq_phase = !q->cq_phase;
+ }
+ /* XXX: error handling instead of assert as it's from device? */
+ assert(c->cid > 0);
+ assert(c->cid <= NVME_QUEUE_SIZE);
+ DPRINTF("nvme completing command %d\n", c->cid);
+ req = &q->reqs[c->cid - 1];
+ assert(req->cid == c->cid);
+ assert(req->cb);
+ req->cb(req->opaque, nvme_translate_error(c));
+ req->cb = req->opaque = NULL;
+ qemu_co_enter_next(&q->wait_queue);
+ c->cid = 0;
+ q->inflight--;
+ /* Flip Phase Tag bit. */
+ c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
+ progress = true;
+ }
+ if (progress) {
+ /* Notify the device so it can post more completions. */
+ smp_mb_release();
+ *q->cq.doorbell = cpu_to_le32(q->cq.head);
+ }
+ q->busy = false;
+ return progress;
+}
+
+static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
+ NVMeRequest *req,
+ NVMeCommand *cmd, BlockCompletionFunc cb,
+ void *opaque)
+{
+ req->cb = cb;
+ req->opaque = opaque;
+ cmd->cid = cpu_to_le32(req->cid);
+ DPRINTF("nvme submit command %d to queue %d\n", req->cid, q->index);
+ memcpy((uint8_t *)q->sq.queue +
+ q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
+ q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
+ q->need_kick++;
+ nvme_kick(s, q);
+ nvme_process_completion(s, q);
+}
+
+static void nvme_cmd_sync_cb(void *opaque, int ret)
+{
+ int *pret = opaque;
+ *pret = ret;
+}
+
+static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
+ NVMeCommand *cmd)
+{
+ NVMeRequest *req;
+ BDRVNVMeState *s = bs->opaque;
+ int ret = -EINPROGRESS;
+ req = nvme_get_free_req(q);
+ if (!req) {
+ return -EBUSY;
+ }
+ nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
+
+ BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
+ return ret;
+}
+
+static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
+{
+ BDRVNVMeState *s = bs->opaque;
+ uint8_t *resp;
+ int r;
+ uint64_t iova;
+ NVMeCommand cmd = {
+ .opcode = NVME_ADM_CMD_IDENTIFY,
+ .cdw10 = cpu_to_le32(0x1),
+ };
+
+ resp = qemu_try_blockalign0(bs, 4096);
+ if (!resp) {
+ error_setg(errp, "Cannot allocate buffer for identify response");
+ return false;
+ }
+ r = qemu_vfio_dma_map(s->vfio, resp, 4096, true, &iova);
+ if (r) {
+ error_setg(errp, "Cannot map buffer for DMA");
+ goto fail;
+ }
+ cmd.prp1 = cpu_to_le64(iova);
+
+ if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
+ error_setg(errp, "Failed to identify controller");
+ goto fail;
+ }
+
+ if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) {
+ error_setg(errp, "Invalid namespace");
+ goto fail;
+ }
+ s->write_cache = le32_to_cpu(resp[525]) & 0x1;
+ s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size;
+ /* For now the page list buffer per command is one page, to hold at most
+ * s->page_size / sizeof(uint64_t) entries. */
+ s->max_transfer = MIN_NON_ZERO(s->max_transfer,
+ s->page_size / sizeof(uint64_t) * s->page_size);
+
+ memset((char *)resp, 0, 4096);
+
+ cmd.cdw10 = 0;
+ cmd.nsid = namespace;
+ if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
+ error_setg(errp, "Failed to identify namespace");
+ goto fail;
+ }
+
+ /*qemu_hexdump((const char *)resp, stdout, "NS", 4096);*/
+ s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]);
+
+ qemu_vfree(resp);
+ return true;
+fail:
+ qemu_vfree(resp);
+ return false;
+}
+
+static void nvme_handle_event(EventNotifier *n)
+{
+ int i;
+ BDRVNVMeState *s = container_of(n, BDRVNVMeState, event_notifier);
+
+ DPRINTF("nvme handle event\n");
+ event_notifier_test_and_clear(n);
+ for (i = 0; i < s->nr_queues; i++) {
+ while (nvme_process_completion(s, s->queues[i])) {
+ /* Keep polling until no progress. */
+ }
+ }
+}
+
+static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
+{
+ BDRVNVMeState *s = bs->opaque;
+ int n = s->nr_queues;
+ NVMeQueuePair *q;
+ NVMeCommand cmd;
+ int queue_size = NVME_QUEUE_SIZE;
+
+ q = nvme_create_queue_pair(bs, n, queue_size, errp);
+ if (!q) {
+ return false;
+ }
+ cmd = (NVMeCommand) {
+ .opcode = NVME_ADM_CMD_CREATE_CQ,
+ .prp1 = cpu_to_le64(q->cq.iova),
+ .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
+ .cdw11 = cpu_to_le32(0x3),
+ };
+ if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
+ error_setg(errp, "Failed to create io queue [%d]", n);
+ nvme_free_queue_pair(bs, q);
+ return false;
+ }
+ cmd = (NVMeCommand) {
+ .opcode = NVME_ADM_CMD_CREATE_SQ,
+ .prp1 = cpu_to_le64(q->sq.iova),
+ .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
+ .cdw11 = cpu_to_le32(0x1 | (n << 16)),
+ };
+ if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
+ error_setg(errp, "Failed to create io queue [%d]", n);
+ nvme_free_queue_pair(bs, q);
+ return false;
+ }
+ s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
+ s->queues[n] = q;
+ s->nr_queues++;
+ return true;
+}
+
+static bool nvme_poll_cb(void *opaque)
+{
+ int i;
+ EventNotifier *e = opaque;
+ BDRVNVMeState *s = container_of(e, BDRVNVMeState, event_notifier);
+ bool progress = false;
+
+ DPRINTF("nvme poll cb\n");
+ for (i = 0; i < s->nr_queues; i++) {
+ while (nvme_process_completion(s, s->queues[i])) {
+ progress = true;
+ }
+ }
+ return progress;
+}
+
+static void nvme_vfree_cb(Notifier *n, void *p)
+{
+ BDRVNVMeState *s = container_of(n, BDRVNVMeState, vfree_notify);
+ qemu_vfio_dma_unmap(s->vfio, p);
+}
+
+static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
+ Error **errp)
+{
+ BDRVNVMeState *s = bs->opaque;
+ int ret;
+ uint64_t cap;
+ uint64_t timeout_ms;
+ uint64_t deadline, now;
+
+ s->nsid = namespace;
+ ret = event_notifier_init(&s->event_notifier, 0);
+ if (ret) {
+ error_setg(errp, "Failed to init event notifier");
+ return ret;
+ }
+
+ s->vfree_notify.notify = nvme_vfree_cb;
+ qemu_vfree_add_notifier(&s->vfree_notify);
+
+ s->vfio = qemu_vfio_open_pci(device, errp);
+ if (!s->vfio) {
+ ret = -EINVAL;
+ goto fail;
+ }
+
+ s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, errp);
+ if (!s->regs) {
+ ret = -EINVAL;
+ goto fail;
+ }
+
+ /* Perform initialize sequence as described in NVMe spec "7.6.1
+ * Initialization". */
+
+ cap = le64_to_cpu(s->regs->cap);
+ if (!(cap & (1ULL << 37))) {
+ error_setg(errp, "Device doesn't support NVMe command set");
+ ret = -EINVAL;
+ goto fail;
+ }
+
+ s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF)));
+ s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t);
+ bs->bl.opt_mem_alignment = s->page_size;
+ timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
+
+ /* Reset device to get a clean state. */
+ s->regs->cc = cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE);
+ /* Wait for CSTS.RDY = 0. */
+ deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * 1000000ULL;
+ while (le32_to_cpu(s->regs->csts) & 0x1) {
+ if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
+ error_setg(errp, "Timeout while waiting for device to reset (%ld ms)",
+ timeout_ms);
+ ret = -ETIMEDOUT;
+ goto fail;
+ }
+ }
+
+ /* Set up admin queue. */
+ s->queues = g_new(NVMeQueuePair *, 1);
+ s->nr_queues = 1;
+ s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp);
+ if (!s->queues[0]) {
+ ret = -EINVAL;
+ goto fail;
+ }
+ QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
+ s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
+ s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova);
+ s->regs->acq = cpu_to_le64(s->queues[0]->cq.iova);
+
+ /* After setting up all control registers we can enable device now. */
+ s->regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
+ (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
+ 0x1);
+ /* Wait for CSTS.RDY = 1. */
+ now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+ deadline = now + timeout_ms * 1000000;
+ while (!(le32_to_cpu(s->regs->csts) & 0x1)) {
+ if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
+ error_setg(errp, "Timeout while waiting for device to start (%ld ms)",
+ timeout_ms);
+ ret = -ETIMEDOUT;
+ goto fail_queue;
+ }
+ }
+
+ ret = qemu_vfio_pci_init_irq(s->vfio, &s->event_notifier,
+ VFIO_PCI_MSIX_IRQ_INDEX, errp);
+ if (ret) {
+ goto fail_queue;
+ }
+ aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
+ false, nvme_handle_event, nvme_poll_cb);
+
+ if (!nvme_identify(bs, namespace, errp)) {
+ ret = -EIO;
+ goto fail_handler;
+ }
+
+ /* Set up command queues. */
+ /* XXX: multiple io queues? */
+ if (!nvme_add_io_queue(bs, errp)) {
+ ret = -EIO;
+ goto fail_handler;
+ }
+ return 0;
+
+fail_handler:
+ aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
+ false, NULL, NULL);
+fail_queue:
+ nvme_free_queue_pair(bs, s->queues[0]);
+fail:
+ qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
+ qemu_vfio_close(s->vfio);
+ event_notifier_cleanup(&s->event_notifier);
+ return ret;
+}
+
+static void nvme_parse_filename(const char *filename, QDict *options,
+ Error **errp)
+{
+ int pref = strlen("nvme://");
+
+ /* XXX: support namespace in URI like nvme://0000:00:00.0/1 ? */
+ if (strlen(filename) > pref && !strncmp(filename, "nvme://", pref)) {
+ qdict_put(options, NVME_BLOCK_OPT_DEVICE,
+ qstring_from_str(filename + pref));
+ }
+}
+
+static int nvme_file_open(BlockDriverState *bs, QDict *options, int flags,
+ Error **errp)
+{
+ const char *device;
+ QemuOpts *opts;
+ int namespace;
+
+ opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
+ qemu_opts_absorb_qdict(opts, options, &error_abort);
+ device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
+ if (!device) {
+ error_setg(errp, "'" NVME_BLOCK_OPT_DEVICE "' option is required");
+ return -EINVAL;
+ }
+
+ namespace = qemu_opt_get_number(opts, NVME_BLOCK_OPT_NAMESPACE, 1);
+ nvme_init(bs, device, namespace, errp);
+
+ qemu_opts_del(opts);
+ return 0;
+}
+
+static void nvme_close(BlockDriverState *bs)
+{
+ int i;
+ BDRVNVMeState *s = bs->opaque;
+
+ for (i = 0; i < s->nr_queues; ++i) {
+ nvme_free_queue_pair(bs, s->queues[i]);
+ }
+ aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
+ false, NULL, NULL);
+ notifier_remove(&s->vfree_notify);
+ qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
+ qemu_vfio_close(s->vfio);
+}
+
+static int64_t nvme_getlength(BlockDriverState *bs)
+{
+ BDRVNVMeState *s = bs->opaque;
+
+ return s->nsze << BDRV_SECTOR_BITS;
+}
+
+static inline int nvme_cmd_map_qiov(BlockDriverState *bs, NVMeCommand *cmd,
+ NVMeRequest *req, QEMUIOVector *qiov)
+{
+ BDRVNVMeState *s = bs->opaque;
+ uint64_t *pagelist = req->prp_list_page;
+ int i, r;
+ unsigned int entries = 0;
+
+ assert(qiov->size);
+ assert(QEMU_IS_ALIGNED(qiov->size, s->page_size));
+ assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t));
+ for (i = 0; i < qiov->niov; ++i) {
+ r = qemu_vfio_dma_map(s->vfio,
+ qiov->iov[i].iov_base,
+ qiov->iov[i].iov_len,
+ false, &pagelist[entries]);
+ if (r) {
+ return r;
+ }
+
+ entries += qiov->iov[i].iov_len / s->page_size;
+ assert(entries <= s->page_size / sizeof(uint64_t));
+ }
+
+ switch (entries) {
+ case 0:
+ abort();
+ case 1:
+ cmd->prp1 = cpu_to_le64(pagelist[0]);
+ cmd->prp2 = 0;
+ break;
+ case 2:
+ cmd->prp1 = cpu_to_le64(pagelist[0]);
+ cmd->prp2 = cpu_to_le64(pagelist[1]);;
+ break;
+ default:
+ cmd->prp1 = cpu_to_le64(pagelist[0]);
+ cmd->prp2 = cpu_to_le64(req->prp_list_iova);
+ for (i = 0; i < entries - 1; ++i) {
+ pagelist[i] = cpu_to_le64(pagelist[i + 1]);
+ }
+ pagelist[entries] = 0;
+ break;
+ }
+ return 0;
+}
+
+typedef struct {
+ Coroutine *co;
+ int ret;
+ AioContext *ctx;
+} NVMeCoData;
+
+static void nvme_rw_cb_bh(void *opaque)
+{
+ NVMeCoData *data = opaque;
+ qemu_coroutine_enter(data->co);
+}
+
+static void nvme_rw_cb(void *opaque, int ret)
+{
+ NVMeCoData *data = opaque;
+ data->ret = ret;
+ if (!data->co) {
+ /* The rw coroutine hasn't yielded, don't try to enter. */
+ return;
+ }
+ if (qemu_coroutine_self() != data->co) {
+ qemu_coroutine_enter(data->co);
+ } else {
+ aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
+ }
+}
+
+static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
+ uint64_t offset, uint64_t bytes,
+ QEMUIOVector *qiov,
+ bool is_write)
+{
+ int r;
+ BDRVNVMeState *s = bs->opaque;
+ NVMeQueuePair *ioq = s->queues[1];
+ NVMeRequest *req;
+ NVMeCommand cmd = {
+ .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
+ .nsid = cpu_to_le32(s->nsid),
+ .cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF),
+ .cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 0xFFFFFFFF),
+ .cdw12 = cpu_to_le32(((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF),
+ };
+ NVMeCoData data = {
+ .ctx = bdrv_get_aio_context(bs),
+ .ret = -EINPROGRESS,
+ };
+
+ DPRINTF(">>> NVMe %s offset %lx bytes %lx qiov[%d]\n",
+ is_write ? "write" : "read",
+ offset, bytes, qiov->niov);
+ assert(s->nr_queues > 1);
+ while (true) {
+ req = nvme_get_free_req(ioq);
+ if (req) {
+ break;
+ }
+ DPRINTF("nvme wait req\n");
+ qemu_co_queue_wait(&ioq->wait_queue);
+ DPRINTF("nvme wait req done\n");
+ }
+
+ r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
+ if (r) {
+ return r;
+ }
+ nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+ data.co = qemu_coroutine_self();
+ while (data.ret == -EINPROGRESS) {
+ qemu_coroutine_yield();
+ }
+
+ DPRINTF("<<< NVMe %s offset %lx bytes %lx ret %d\n",
+ is_write ? "write" : "read",
+ offset, bytes, data.ret);
+ return data.ret;
+}
+
+static inline bool nvme_qiov_aligned(BlockDriverState *bs,
+ const QEMUIOVector *qiov)
+{
+ int i;
+ BDRVNVMeState *s = bs->opaque;
+
+ for (i = 0; i < qiov->niov; ++i) {
+ if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) ||
+ !QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) {
+ return false;
+ }
+ }
+ return true;
+}
+
+static inline coroutine_fn
+int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+ QEMUIOVector *qiov, bool is_write)
+{
+ int r;
+ uint8_t *buf;
+ QEMUIOVector local_qiov;
+
+ assert(offset % BDRV_SECTOR_SIZE == 0);
+ assert(bytes % BDRV_SECTOR_SIZE == 0);
+ if (nvme_qiov_aligned(bs, qiov)) {
+ return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write);
+ }
+ buf = qemu_try_blockalign(bs, bytes);
+
+ if (!buf) {
+ return -ENOMEM;
+ }
+ qemu_iovec_init(&local_qiov, 1);
+ qemu_iovec_add(&local_qiov, buf, bytes);
+ r = nvme_co_prw_aligned(bs, offset, bytes, &local_qiov, is_write);
+ qemu_iovec_destroy(&local_qiov);
+ if (!r) {
+ qemu_iovec_from_buf(qiov, 0, buf, bytes);
+ }
+ qemu_vfree(buf);
+ return r;
+}
+
+static coroutine_fn int nvme_co_preadv(BlockDriverState *bs,
+ uint64_t offset, uint64_t bytes,
+ QEMUIOVector *qiov, int flags)
+{
+ return nvme_co_prw(bs, offset, bytes, qiov, false);
+}
+
+static coroutine_fn int nvme_co_pwritev(BlockDriverState *bs,
+ uint64_t offset, uint64_t bytes,
+ QEMUIOVector *qiov, int flags)
+{
+ return nvme_co_prw(bs, offset, bytes, qiov, true);
+}
+
+static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
+{
+ BDRVNVMeState *s = bs->opaque;
+ NVMeQueuePair *ioq = s->queues[1];
+ NVMeRequest *req;
+ NVMeCommand cmd = {
+ .opcode = NVME_CMD_FLUSH,
+ .nsid = cpu_to_le32(s->nsid),
+ };
+ NVMeCoData data = {
+ .ctx = bdrv_get_aio_context(bs),
+ .ret = -EINPROGRESS,
+ };
+
+ assert(s->nr_queues > 1);
+ while (true) {
+ req = nvme_get_free_req(ioq);
+ if (req) {
+ break;
+ }
+ qemu_co_queue_wait(&ioq->wait_queue);
+ }
+
+ nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
+
+ data.co = qemu_coroutine_self();
+ if (data.ret == -EINPROGRESS) {
+ qemu_coroutine_yield();
+ }
+
+ return data.ret;
+}
+
+
+static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
+ BlockReopenQueue *queue, Error **errp)
+{
+ return 0;
+}
+
+static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs,
+ int64_t sector_num,
+ int nb_sectors, int *pnum,
+ BlockDriverState **file)
+{
+ *pnum = nb_sectors;
+ *file = bs;
+
+ return BDRV_BLOCK_OFFSET_VALID | (sector_num << BDRV_SECTOR_BITS);
+}
+
+static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts)
+{
+ QINCREF(opts);
+ qdict_del(opts, "filename");
+
+ if (!qdict_size(opts)) {
+ snprintf(bs->exact_filename, sizeof(bs->exact_filename), "%s://",
+ bs->drv->format_name);
+ }
+
+ qdict_put(opts, "driver", qstring_from_str(bs->drv->format_name));
+ bs->full_open_options = opts;
+}
+
+static void nvme_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+ BDRVNVMeState *s = bs->opaque;
+
+ /* XXX: other fields from identify command/controller capacity? */
+ bs->bl.opt_mem_alignment = s->page_size;
+ /* XXX: are there other limits? */
+ bs->bl.request_alignment = s->page_size;
+ bs->bl.max_transfer = s->max_transfer;
+}
+
+static void nvme_detach_aio_context(BlockDriverState *bs)
+{
+ BDRVNVMeState *s = bs->opaque;
+
+ aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
+ false, NULL, NULL);
+}
+
+static void nvme_attach_aio_context(BlockDriverState *bs,
+ AioContext *new_context)
+{
+ BDRVNVMeState *s = bs->opaque;
+
+ aio_set_event_notifier(new_context, &s->event_notifier,
+ false, nvme_handle_event, nvme_poll_cb);
+}
+
+static void nvme_aio_plug(BlockDriverState *bs)
+{
+ BDRVNVMeState *s = bs->opaque;
+ s->plugged++;
+}
+
+static void nvme_aio_unplug(BlockDriverState *bs)
+{
+ int i;
+ BDRVNVMeState *s = bs->opaque;
+ assert(s->plugged);
+ if (!--s->plugged) {
+ for (i = 1; i < s->nr_queues; i++) {
+ nvme_kick(s, s->queues[i]);
+ nvme_process_completion(s, s->queues[i]);
+ }
+ }
+}
+
+static BlockDriver bdrv_nvme = {
+ .format_name = "nvme",
+ .protocol_name = "nvme",
+ .instance_size = sizeof(BDRVNVMeState),
+
+ .bdrv_parse_filename = nvme_parse_filename,
+ .bdrv_file_open = nvme_file_open,
+ .bdrv_close = nvme_close,
+ .bdrv_getlength = nvme_getlength,
+
+ .bdrv_co_preadv = nvme_co_preadv,
+ .bdrv_co_pwritev = nvme_co_pwritev,
+ .bdrv_co_flush_to_disk = nvme_co_flush,
+ .bdrv_reopen_prepare = nvme_reopen_prepare,
+
+ .bdrv_co_get_block_status = nvme_co_get_block_status,
+
+ .bdrv_refresh_filename = nvme_refresh_filename,
+ .bdrv_refresh_limits = nvme_refresh_limits,
+
+ .bdrv_detach_aio_context = nvme_detach_aio_context,
+ .bdrv_attach_aio_context = nvme_attach_aio_context,
+
+ .bdrv_io_plug = nvme_aio_plug,
+ .bdrv_io_unplug = nvme_aio_unplug,
+};
+
+static void bdrv_nvme_init(void)
+{
+ bdrv_register(&bdrv_nvme);
+}
+
+block_init(bdrv_nvme_init);
--
2.9.3
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver
2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
@ 2016-12-20 16:39 ` Paolo Bonzini
2016-12-21 11:59 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2016-12-20 16:39 UTC (permalink / raw)
To: Fam Zheng
Cc: qemu-devel, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
borntraeger, Stefan Hajnoczi
----- Original Message -----
> From: "Fam Zheng" <famz@redhat.com>
> To: qemu-devel@nongnu.org
> Cc: "Paolo Bonzini" <pbonzini@redhat.com>, qemu-block@nongnu.org, "Karl Rister" <krister@redhat.com>, "Kevin Wolf"
> <kwolf@redhat.com>, "Max Reitz" <mreitz@redhat.com>, borntraeger@de.ibm.com, "Stefan Hajnoczi" <stefanha@redhat.com>
> Sent: Tuesday, December 20, 2016 5:31:39 PM
> Subject: [PATCH 4/4] block: Add VFIO based NVMe driver
>
> This is a new protocol driver that exclusively opens a host NVMe
> controller through VFIO. It achieves better latency than linux-aio.
>
> nvme:// linux-aio
> ------------------------------------------------------------------
> fio bs=4k iodepth=1 (IOPS) 30014 24009
> fio bs=4k iodepth=1 (IOPS) +polling 45148 34689
> fio bs=64k iodepth=1 (IOPS) 17801 14954
> fio bs=64k iodepth=1 (IOPS) +polling 17970 14564
>
> fio bs=4k iodepth=32 (IOPS) 110637 121480
> fio bs=4k iodepth=32 (IOPS) +polling 145533 166878
> fio bs=64k iodepth=32 (IOPS) 50525 50596
> fio bs=64k iodepth=32 (IOPS) +polling 50482 50534
>
> (host) qemu-img bench -c 8000000 (sec) 15.00 43.13
>
> ("+polling" means poll-max-ns=18000 which is a rule of thumb value for
> the used NVMe device, otherwise it defaults to 0).
>
> For the rows where linux-aio is faster, a lot of evidence shows that the
> io queue is more likely to stay full if the request completions happen
> faster, as in the nvme:// case, hence less batch submission and request
> merging than linux-aio.
>
> This patch is co-authored by Paolo and me.
... though to be fair it's at least 95% Fam's work. :)
Paolo
>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
> block/Makefile.objs | 1 +
> block/nvme.c | 1026
> +++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 1027 insertions(+)
> create mode 100644 block/nvme.c
>
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 67a036a..fe975a6 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
> block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> block-obj-y += null.o mirror.o commit.o io.o
> block-obj-y += throttle-groups.o
> +block-obj-y += nvme.o
>
> block-obj-y += nbd.o nbd-client.o sheepdog.o
> block-obj-$(CONFIG_LIBISCSI) += iscsi.o
> diff --git a/block/nvme.c b/block/nvme.c
> new file mode 100644
> index 0000000..5c9ffde
> --- /dev/null
> +++ b/block/nvme.c
> @@ -0,0 +1,1026 @@
> +/*
> + * NVMe block driver based on vfio
> + *
> + * Copyright 2016 Red Hat, Inc.
> + *
> + * Authors:
> + * Fam Zheng <famz@redhat.com>
> + * Paolo Bonzini <pbonzini@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +#include "qapi/error.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qapi/qmp/qstring.h"
> +#include "block/block_int.h"
> +#include "qemu/error-report.h"
> +#include "qemu/vfio-helpers.h"
> +
> +#define NVME_DEBUG 0
> +
> +#define DPRINTF(...) \
> + if (NVME_DEBUG) { \
> + printf(__VA_ARGS__); \
> + }
> +
> +#define NVME_SQ_ENTRY_BYTES 64
> +#define NVME_CQ_ENTRY_BYTES 16
> +#define NVME_QUEUE_SIZE 128
> +
> +/* TODO: share definitions with hw/block/nvme.c? */
> +enum NvmeAdminCommands {
> + NVME_ADM_CMD_DELETE_SQ = 0x00,
> + NVME_ADM_CMD_CREATE_SQ = 0x01,
> + NVME_ADM_CMD_GET_LOG_PAGE = 0x02,
> + NVME_ADM_CMD_DELETE_CQ = 0x04,
> + NVME_ADM_CMD_CREATE_CQ = 0x05,
> + NVME_ADM_CMD_IDENTIFY = 0x06,
> + NVME_ADM_CMD_ABORT = 0x08,
> + NVME_ADM_CMD_SET_FEATURES = 0x09,
> + NVME_ADM_CMD_GET_FEATURES = 0x0a,
> + NVME_ADM_CMD_ASYNC_EV_REQ = 0x0c,
> + NVME_ADM_CMD_ACTIVATE_FW = 0x10,
> + NVME_ADM_CMD_DOWNLOAD_FW = 0x11,
> + NVME_ADM_CMD_FORMAT_NVM = 0x80,
> + NVME_ADM_CMD_SECURITY_SEND = 0x81,
> + NVME_ADM_CMD_SECURITY_RECV = 0x82,
> +};
> +
> +enum NvmeIoCommands {
> + NVME_CMD_FLUSH = 0x00,
> + NVME_CMD_WRITE = 0x01,
> + NVME_CMD_READ = 0x02,
> + NVME_CMD_WRITE_UNCOR = 0x04,
> + NVME_CMD_COMPARE = 0x05,
> + NVME_CMD_DSM = 0x09,
> +};
> +
> +typedef struct {
> + uint8_t opcode;
> + uint8_t flags;
> + uint16_t cid;
> + uint32_t nsid;
> + uint64_t reserved;
> + uint64_t mptr;
> + uint64_t prp1;
> + uint64_t prp2;
> + uint32_t cdw10;
> + uint32_t cdw11;
> + uint32_t cdw12;
> + uint32_t cdw13;
> + uint32_t cdw14;
> + uint32_t cdw15;
> +} QEMU_PACKED NVMeCommand;
> +
> +typedef struct {
> + uint32_t cmd_specific;
> + uint32_t reserved;
> + uint16_t sq_head;
> + uint16_t sqid;
> + uint16_t cid;
> + uint16_t status;
> +} QEMU_PACKED NVMeCompletion;
> +
> +typedef struct {
> + int32_t head, tail;
> + uint8_t *queue;
> + uint64_t iova;
> + volatile uint32_t *doorbell;
> +} NVMeQueue;
> +
> +typedef struct {
> + BlockCompletionFunc *cb;
> + void *opaque;
> + int cid;
> + void *prp_list_page;
> + uint64_t prp_list_iova;
> +} NVMeRequest;
> +
> +typedef struct {
> + int index;
> + NVMeQueue sq, cq;
> + int cq_phase;
> + uint8_t *prp_list_pages;
> + uint64_t prp_list_base_iova;
> + NVMeRequest reqs[NVME_QUEUE_SIZE];
> + CoQueue wait_queue;
> + bool busy;
> + int need_kick;
> + int inflight;
> +} NVMeQueuePair;
> +
> +typedef volatile struct {
> + uint64_t cap;
> + uint32_t vs;
> + uint32_t intms;
> + uint32_t intmc;
> + uint32_t cc;
> + uint32_t reserved0;
> + uint32_t csts;
> + uint32_t nssr;
> + uint32_t aqa;
> + uint64_t asq;
> + uint64_t acq;
> + uint32_t cmbloc;
> + uint32_t cmbsz;
> + uint8_t reserved1[0xec0];
> + uint8_t cmd_set_specfic[0x100];
> + uint32_t doorbells[];
> +} QEMU_PACKED NVMeRegs;
> +
> +QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
> +
> +typedef struct {
> + QEMUVFIOState *vfio;
> + NVMeRegs *regs;
> + /* The submission/completion queue pairs.
> + * [0]: admin queue.
> + * [1..]: io queues.
> + */
> + NVMeQueuePair **queues;
> + int nr_queues;
> + size_t page_size;
> + /* How many uint32_t elements does each doorbell entry take. */
> + size_t doorbell_scale;
> + bool write_cache;
> + EventNotifier event_notifier;
> + uint64_t nsze; /* Namespace size reported by identify command */
> + int nsid; /* The namespace id to read/write data. */
> + uint64_t max_transfer;
> + int plugged;
> + Notifier vfree_notify;
> +} BDRVNVMeState;
> +
> +#define NVME_BLOCK_OPT_DEVICE "device"
> +#define NVME_BLOCK_OPT_NAMESPACE "namespace"
> +
> +static QemuOptsList runtime_opts = {
> + .name = "nvme",
> + .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> + .desc = {
> + {
> + .name = NVME_BLOCK_OPT_DEVICE,
> + .type = QEMU_OPT_STRING,
> + .help = "NVMe PCI device address",
> + },
> + {
> + .name = NVME_BLOCK_OPT_NAMESPACE,
> + .type = QEMU_OPT_NUMBER,
> + .help = "NVMe namespace",
> + },
> + { /* end of list */ }
> + },
> +};
> +
> +static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
> + int nentries, int entry_bytes, Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + size_t bytes;
> + int r;
> +
> + bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
> + q->head = q->tail = 0;
> + q->queue = qemu_try_blockalign0(bs, bytes);
> +
> + if (!q->queue) {
> + error_setg(errp, "Cannot allocate queue");
> + return;
> + }
> + r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, true, &q->iova);
> + if (r) {
> + error_setg(errp, "Cannot map queue");
> + }
> +}
> +
> +static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
> +{
> + qemu_vfree(q->prp_list_pages);
> + qemu_vfree(q->sq.queue);
> + qemu_vfree(q->cq.queue);
> + g_free(q);
> +}
> +
> +static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
> + int idx, int size,
> + Error **errp)
> +{
> + int i, r;
> + BDRVNVMeState *s = bs->opaque;
> + Error *local_err = NULL;
> + NVMeQueuePair *q = g_new0(NVMeQueuePair, 1);
> + uint64_t prp_list_iovas[NVME_QUEUE_SIZE];
> +
> + q->index = idx;
> + qemu_co_queue_init(&q->wait_queue);
> + q->prp_list_pages = qemu_blockalign0(bs, s->page_size *
> NVME_QUEUE_SIZE);
> + r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
> + s->page_size * NVME_QUEUE_SIZE,
> + false, prp_list_iovas);
> + if (r) {
> + goto fail;
> + }
> + for (i = 0; i < NVME_QUEUE_SIZE; i++) {
> + NVMeRequest *req = &q->reqs[i];
> + req->cid = i + 1;
> + req->prp_list_page = q->prp_list_pages + i * s->page_size;
> + req->prp_list_iova = prp_list_iovas[i];
> + }
> + nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
> + if (local_err) {
> + error_propagate(errp, local_err);
> + goto fail;
> + }
> + q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale];
> +
> + nvme_init_queue(bs, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err);
> + if (local_err) {
> + error_propagate(errp, local_err);
> + goto fail;
> + }
> + q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
> +
> + return q;
> +fail:
> + nvme_free_queue_pair(bs, q);
> + return NULL;
> +}
> +
> +static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
> +{
> + if (s->plugged || !q->need_kick) {
> + return;
> + }
> + DPRINTF("nvme kick queue %d\n", q->index);
> + assert(!(q->sq.tail & 0xFF00));
> + /* Fence the write to submission queue entry before notifying the guest.
> */
> + smp_wmb();
> + *q->sq.doorbell = cpu_to_le32(q->sq.tail);
> + q->inflight += q->need_kick;
> + q->need_kick = 0;
> +}
> +
> +static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
> +{
> + int i;
> + if (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
> + /* We have to leave one slot empty as that is the full queue case
> (head
> + * == tail + 1). */
> + return NULL;
> + }
> + for (i = 0; i < NVME_QUEUE_SIZE; i++) {
> + if (!q->reqs[i].cb) {
> + return &q->reqs[i];
> + }
> + }
> + return NULL;
> +}
> +
> +static inline int nvme_translate_error(const NVMeCompletion *c)
> +{
> + if ((le16_to_cpu(c->status) >> 1) & 0xFF) {
> + DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status
> %x\n",
> + c->cmd_specific, c->sq_head, c->sqid, c->cid, c->status);
> + }
> + switch ((le16_to_cpu(c->status) >> 1) & 0xFF) {
> + case 0:
> + return 0;
> + case 1:
> + return -ENOSYS;
> + case 2:
> + return -EINVAL;
> + default:
> + return -EIO;
> + }
> +}
> +
> +static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
> +{
> + bool progress = false;
> + NVMeRequest *req;
> + NVMeCompletion *c;
> +
> + DPRINTF("nvme process completion %d inflight %d\n", q->index,
> q->inflight);
> + if (q->busy || s->plugged) {
> + DPRINTF("queue busy\n");
> + return false;
> + }
> + q->busy = true;
> + assert(q->inflight >= 0);
> + while (q->inflight) {
> + c = (NVMeCompletion *)&q->cq.queue[q->cq.head *
> NVME_CQ_ENTRY_BYTES];
> + if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> + break;
> + }
> + q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> + if (!q->cq.head) {
> + q->cq_phase = !q->cq_phase;
> + }
> + /* XXX: error handling instead of assert as it's from device? */
> + assert(c->cid > 0);
> + assert(c->cid <= NVME_QUEUE_SIZE);
> + DPRINTF("nvme completing command %d\n", c->cid);
> + req = &q->reqs[c->cid - 1];
> + assert(req->cid == c->cid);
> + assert(req->cb);
> + req->cb(req->opaque, nvme_translate_error(c));
> + req->cb = req->opaque = NULL;
> + qemu_co_enter_next(&q->wait_queue);
> + c->cid = 0;
> + q->inflight--;
> + /* Flip Phase Tag bit. */
> + c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> + progress = true;
> + }
> + if (progress) {
> + /* Notify the device so it can post more completions. */
> + smp_mb_release();
> + *q->cq.doorbell = cpu_to_le32(q->cq.head);
> + }
> + q->busy = false;
> + return progress;
> +}
> +
> +static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
> + NVMeRequest *req,
> + NVMeCommand *cmd, BlockCompletionFunc cb,
> + void *opaque)
> +{
> + req->cb = cb;
> + req->opaque = opaque;
> + cmd->cid = cpu_to_le32(req->cid);
> + DPRINTF("nvme submit command %d to queue %d\n", req->cid, q->index);
> + memcpy((uint8_t *)q->sq.queue +
> + q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
> + q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
> + q->need_kick++;
> + nvme_kick(s, q);
> + nvme_process_completion(s, q);
> +}
> +
> +static void nvme_cmd_sync_cb(void *opaque, int ret)
> +{
> + int *pret = opaque;
> + *pret = ret;
> +}
> +
> +static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
> + NVMeCommand *cmd)
> +{
> + NVMeRequest *req;
> + BDRVNVMeState *s = bs->opaque;
> + int ret = -EINPROGRESS;
> + req = nvme_get_free_req(q);
> + if (!req) {
> + return -EBUSY;
> + }
> + nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
> +
> + BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
> + return ret;
> +}
> +
> +static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + uint8_t *resp;
> + int r;
> + uint64_t iova;
> + NVMeCommand cmd = {
> + .opcode = NVME_ADM_CMD_IDENTIFY,
> + .cdw10 = cpu_to_le32(0x1),
> + };
> +
> + resp = qemu_try_blockalign0(bs, 4096);
> + if (!resp) {
> + error_setg(errp, "Cannot allocate buffer for identify response");
> + return false;
> + }
> + r = qemu_vfio_dma_map(s->vfio, resp, 4096, true, &iova);
> + if (r) {
> + error_setg(errp, "Cannot map buffer for DMA");
> + goto fail;
> + }
> + cmd.prp1 = cpu_to_le64(iova);
> +
> + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> + error_setg(errp, "Failed to identify controller");
> + goto fail;
> + }
> +
> + if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) {
> + error_setg(errp, "Invalid namespace");
> + goto fail;
> + }
> + s->write_cache = le32_to_cpu(resp[525]) & 0x1;
> + s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size;
> + /* For now the page list buffer per command is one page, to hold at most
> + * s->page_size / sizeof(uint64_t) entries. */
> + s->max_transfer = MIN_NON_ZERO(s->max_transfer,
> + s->page_size / sizeof(uint64_t) * s->page_size);
> +
> + memset((char *)resp, 0, 4096);
> +
> + cmd.cdw10 = 0;
> + cmd.nsid = namespace;
> + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> + error_setg(errp, "Failed to identify namespace");
> + goto fail;
> + }
> +
> + /*qemu_hexdump((const char *)resp, stdout, "NS", 4096);*/
> + s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]);
> +
> + qemu_vfree(resp);
> + return true;
> +fail:
> + qemu_vfree(resp);
> + return false;
> +}
> +
> +static void nvme_handle_event(EventNotifier *n)
> +{
> + int i;
> + BDRVNVMeState *s = container_of(n, BDRVNVMeState, event_notifier);
> +
> + DPRINTF("nvme handle event\n");
> + event_notifier_test_and_clear(n);
> + for (i = 0; i < s->nr_queues; i++) {
> + while (nvme_process_completion(s, s->queues[i])) {
> + /* Keep polling until no progress. */
> + }
> + }
> +}
> +
> +static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + int n = s->nr_queues;
> + NVMeQueuePair *q;
> + NVMeCommand cmd;
> + int queue_size = NVME_QUEUE_SIZE;
> +
> + q = nvme_create_queue_pair(bs, n, queue_size, errp);
> + if (!q) {
> + return false;
> + }
> + cmd = (NVMeCommand) {
> + .opcode = NVME_ADM_CMD_CREATE_CQ,
> + .prp1 = cpu_to_le64(q->cq.iova),
> + .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> + .cdw11 = cpu_to_le32(0x3),
> + };
> + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> + error_setg(errp, "Failed to create io queue [%d]", n);
> + nvme_free_queue_pair(bs, q);
> + return false;
> + }
> + cmd = (NVMeCommand) {
> + .opcode = NVME_ADM_CMD_CREATE_SQ,
> + .prp1 = cpu_to_le64(q->sq.iova),
> + .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> + .cdw11 = cpu_to_le32(0x1 | (n << 16)),
> + };
> + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> + error_setg(errp, "Failed to create io queue [%d]", n);
> + nvme_free_queue_pair(bs, q);
> + return false;
> + }
> + s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
> + s->queues[n] = q;
> + s->nr_queues++;
> + return true;
> +}
> +
> +static bool nvme_poll_cb(void *opaque)
> +{
> + int i;
> + EventNotifier *e = opaque;
> + BDRVNVMeState *s = container_of(e, BDRVNVMeState, event_notifier);
> + bool progress = false;
> +
> + DPRINTF("nvme poll cb\n");
> + for (i = 0; i < s->nr_queues; i++) {
> + while (nvme_process_completion(s, s->queues[i])) {
> + progress = true;
> + }
> + }
> + return progress;
> +}
> +
> +static void nvme_vfree_cb(Notifier *n, void *p)
> +{
> + BDRVNVMeState *s = container_of(n, BDRVNVMeState, vfree_notify);
> + qemu_vfio_dma_unmap(s->vfio, p);
> +}
> +
> +static int nvme_init(BlockDriverState *bs, const char *device, int
> namespace,
> + Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + int ret;
> + uint64_t cap;
> + uint64_t timeout_ms;
> + uint64_t deadline, now;
> +
> + s->nsid = namespace;
> + ret = event_notifier_init(&s->event_notifier, 0);
> + if (ret) {
> + error_setg(errp, "Failed to init event notifier");
> + return ret;
> + }
> +
> + s->vfree_notify.notify = nvme_vfree_cb;
> + qemu_vfree_add_notifier(&s->vfree_notify);
> +
> + s->vfio = qemu_vfio_open_pci(device, errp);
> + if (!s->vfio) {
> + ret = -EINVAL;
> + goto fail;
> + }
> +
> + s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, errp);
> + if (!s->regs) {
> + ret = -EINVAL;
> + goto fail;
> + }
> +
> + /* Perform initialize sequence as described in NVMe spec "7.6.1
> + * Initialization". */
> +
> + cap = le64_to_cpu(s->regs->cap);
> + if (!(cap & (1ULL << 37))) {
> + error_setg(errp, "Device doesn't support NVMe command set");
> + ret = -EINVAL;
> + goto fail;
> + }
> +
> + s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF)));
> + s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t);
> + bs->bl.opt_mem_alignment = s->page_size;
> + timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
> +
> + /* Reset device to get a clean state. */
> + s->regs->cc = cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE);
> + /* Wait for CSTS.RDY = 0. */
> + deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms *
> 1000000ULL;
> + while (le32_to_cpu(s->regs->csts) & 0x1) {
> + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
> + error_setg(errp, "Timeout while waiting for device to reset (%ld
> ms)",
> + timeout_ms);
> + ret = -ETIMEDOUT;
> + goto fail;
> + }
> + }
> +
> + /* Set up admin queue. */
> + s->queues = g_new(NVMeQueuePair *, 1);
> + s->nr_queues = 1;
> + s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp);
> + if (!s->queues[0]) {
> + ret = -EINVAL;
> + goto fail;
> + }
> + QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
> + s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
> + s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova);
> + s->regs->acq = cpu_to_le64(s->queues[0]->cq.iova);
> +
> + /* After setting up all control registers we can enable device now. */
> + s->regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
> + (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
> + 0x1);
> + /* Wait for CSTS.RDY = 1. */
> + now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> + deadline = now + timeout_ms * 1000000;
> + while (!(le32_to_cpu(s->regs->csts) & 0x1)) {
> + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
> + error_setg(errp, "Timeout while waiting for device to start (%ld
> ms)",
> + timeout_ms);
> + ret = -ETIMEDOUT;
> + goto fail_queue;
> + }
> + }
> +
> + ret = qemu_vfio_pci_init_irq(s->vfio, &s->event_notifier,
> + VFIO_PCI_MSIX_IRQ_INDEX, errp);
> + if (ret) {
> + goto fail_queue;
> + }
> + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> + false, nvme_handle_event, nvme_poll_cb);
> +
> + if (!nvme_identify(bs, namespace, errp)) {
> + ret = -EIO;
> + goto fail_handler;
> + }
> +
> + /* Set up command queues. */
> + /* XXX: multiple io queues? */
> + if (!nvme_add_io_queue(bs, errp)) {
> + ret = -EIO;
> + goto fail_handler;
> + }
> + return 0;
> +
> +fail_handler:
> + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> + false, NULL, NULL);
> +fail_queue:
> + nvme_free_queue_pair(bs, s->queues[0]);
> +fail:
> + qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
> + qemu_vfio_close(s->vfio);
> + event_notifier_cleanup(&s->event_notifier);
> + return ret;
> +}
> +
> +static void nvme_parse_filename(const char *filename, QDict *options,
> + Error **errp)
> +{
> + int pref = strlen("nvme://");
> +
> + /* XXX: support namespace in URI like nvme://0000:00:00.0/1 ? */
> + if (strlen(filename) > pref && !strncmp(filename, "nvme://", pref)) {
> + qdict_put(options, NVME_BLOCK_OPT_DEVICE,
> + qstring_from_str(filename + pref));
> + }
> +}
> +
> +static int nvme_file_open(BlockDriverState *bs, QDict *options, int flags,
> + Error **errp)
> +{
> + const char *device;
> + QemuOpts *opts;
> + int namespace;
> +
> + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> + qemu_opts_absorb_qdict(opts, options, &error_abort);
> + device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> + if (!device) {
> + error_setg(errp, "'" NVME_BLOCK_OPT_DEVICE "' option is required");
> + return -EINVAL;
> + }
> +
> + namespace = qemu_opt_get_number(opts, NVME_BLOCK_OPT_NAMESPACE, 1);
> + nvme_init(bs, device, namespace, errp);
> +
> + qemu_opts_del(opts);
> + return 0;
> +}
> +
> +static void nvme_close(BlockDriverState *bs)
> +{
> + int i;
> + BDRVNVMeState *s = bs->opaque;
> +
> + for (i = 0; i < s->nr_queues; ++i) {
> + nvme_free_queue_pair(bs, s->queues[i]);
> + }
> + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> + false, NULL, NULL);
> + notifier_remove(&s->vfree_notify);
> + qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
> + qemu_vfio_close(s->vfio);
> +}
> +
> +static int64_t nvme_getlength(BlockDriverState *bs)
> +{
> + BDRVNVMeState *s = bs->opaque;
> +
> + return s->nsze << BDRV_SECTOR_BITS;
> +}
> +
> +static inline int nvme_cmd_map_qiov(BlockDriverState *bs, NVMeCommand *cmd,
> + NVMeRequest *req, QEMUIOVector *qiov)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + uint64_t *pagelist = req->prp_list_page;
> + int i, r;
> + unsigned int entries = 0;
> +
> + assert(qiov->size);
> + assert(QEMU_IS_ALIGNED(qiov->size, s->page_size));
> + assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t));
> + for (i = 0; i < qiov->niov; ++i) {
> + r = qemu_vfio_dma_map(s->vfio,
> + qiov->iov[i].iov_base,
> + qiov->iov[i].iov_len,
> + false, &pagelist[entries]);
> + if (r) {
> + return r;
> + }
> +
> + entries += qiov->iov[i].iov_len / s->page_size;
> + assert(entries <= s->page_size / sizeof(uint64_t));
> + }
> +
> + switch (entries) {
> + case 0:
> + abort();
> + case 1:
> + cmd->prp1 = cpu_to_le64(pagelist[0]);
> + cmd->prp2 = 0;
> + break;
> + case 2:
> + cmd->prp1 = cpu_to_le64(pagelist[0]);
> + cmd->prp2 = cpu_to_le64(pagelist[1]);;
> + break;
> + default:
> + cmd->prp1 = cpu_to_le64(pagelist[0]);
> + cmd->prp2 = cpu_to_le64(req->prp_list_iova);
> + for (i = 0; i < entries - 1; ++i) {
> + pagelist[i] = cpu_to_le64(pagelist[i + 1]);
> + }
> + pagelist[entries] = 0;
> + break;
> + }
> + return 0;
> +}
> +
> +typedef struct {
> + Coroutine *co;
> + int ret;
> + AioContext *ctx;
> +} NVMeCoData;
> +
> +static void nvme_rw_cb_bh(void *opaque)
> +{
> + NVMeCoData *data = opaque;
> + qemu_coroutine_enter(data->co);
> +}
> +
> +static void nvme_rw_cb(void *opaque, int ret)
> +{
> + NVMeCoData *data = opaque;
> + data->ret = ret;
> + if (!data->co) {
> + /* The rw coroutine hasn't yielded, don't try to enter. */
> + return;
> + }
> + if (qemu_coroutine_self() != data->co) {
> + qemu_coroutine_enter(data->co);
> + } else {
> + aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
> + }
> +}
> +
> +static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
> + uint64_t offset, uint64_t bytes,
> + QEMUIOVector *qiov,
> + bool is_write)
> +{
> + int r;
> + BDRVNVMeState *s = bs->opaque;
> + NVMeQueuePair *ioq = s->queues[1];
> + NVMeRequest *req;
> + NVMeCommand cmd = {
> + .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
> + .nsid = cpu_to_le32(s->nsid),
> + .cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF),
> + .cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) &
> 0xFFFFFFFF),
> + .cdw12 = cpu_to_le32(((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF),
> + };
> + NVMeCoData data = {
> + .ctx = bdrv_get_aio_context(bs),
> + .ret = -EINPROGRESS,
> + };
> +
> + DPRINTF(">>> NVMe %s offset %lx bytes %lx qiov[%d]\n",
> + is_write ? "write" : "read",
> + offset, bytes, qiov->niov);
> + assert(s->nr_queues > 1);
> + while (true) {
> + req = nvme_get_free_req(ioq);
> + if (req) {
> + break;
> + }
> + DPRINTF("nvme wait req\n");
> + qemu_co_queue_wait(&ioq->wait_queue);
> + DPRINTF("nvme wait req done\n");
> + }
> +
> + r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
> + if (r) {
> + return r;
> + }
> + nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> +
> + data.co = qemu_coroutine_self();
> + while (data.ret == -EINPROGRESS) {
> + qemu_coroutine_yield();
> + }
> +
> + DPRINTF("<<< NVMe %s offset %lx bytes %lx ret %d\n",
> + is_write ? "write" : "read",
> + offset, bytes, data.ret);
> + return data.ret;
> +}
> +
> +static inline bool nvme_qiov_aligned(BlockDriverState *bs,
> + const QEMUIOVector *qiov)
> +{
> + int i;
> + BDRVNVMeState *s = bs->opaque;
> +
> + for (i = 0; i < qiov->niov; ++i) {
> + if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) ||
> + !QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) {
> + return false;
> + }
> + }
> + return true;
> +}
> +
> +static inline coroutine_fn
> +int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> + QEMUIOVector *qiov, bool is_write)
> +{
> + int r;
> + uint8_t *buf;
> + QEMUIOVector local_qiov;
> +
> + assert(offset % BDRV_SECTOR_SIZE == 0);
> + assert(bytes % BDRV_SECTOR_SIZE == 0);
> + if (nvme_qiov_aligned(bs, qiov)) {
> + return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write);
> + }
> + buf = qemu_try_blockalign(bs, bytes);
> +
> + if (!buf) {
> + return -ENOMEM;
> + }
> + qemu_iovec_init(&local_qiov, 1);
> + qemu_iovec_add(&local_qiov, buf, bytes);
> + r = nvme_co_prw_aligned(bs, offset, bytes, &local_qiov, is_write);
> + qemu_iovec_destroy(&local_qiov);
> + if (!r) {
> + qemu_iovec_from_buf(qiov, 0, buf, bytes);
> + }
> + qemu_vfree(buf);
> + return r;
> +}
> +
> +static coroutine_fn int nvme_co_preadv(BlockDriverState *bs,
> + uint64_t offset, uint64_t bytes,
> + QEMUIOVector *qiov, int flags)
> +{
> + return nvme_co_prw(bs, offset, bytes, qiov, false);
> +}
> +
> +static coroutine_fn int nvme_co_pwritev(BlockDriverState *bs,
> + uint64_t offset, uint64_t bytes,
> + QEMUIOVector *qiov, int flags)
> +{
> + return nvme_co_prw(bs, offset, bytes, qiov, true);
> +}
> +
> +static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + NVMeQueuePair *ioq = s->queues[1];
> + NVMeRequest *req;
> + NVMeCommand cmd = {
> + .opcode = NVME_CMD_FLUSH,
> + .nsid = cpu_to_le32(s->nsid),
> + };
> + NVMeCoData data = {
> + .ctx = bdrv_get_aio_context(bs),
> + .ret = -EINPROGRESS,
> + };
> +
> + assert(s->nr_queues > 1);
> + while (true) {
> + req = nvme_get_free_req(ioq);
> + if (req) {
> + break;
> + }
> + qemu_co_queue_wait(&ioq->wait_queue);
> + }
> +
> + nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> +
> + data.co = qemu_coroutine_self();
> + if (data.ret == -EINPROGRESS) {
> + qemu_coroutine_yield();
> + }
> +
> + return data.ret;
> +}
> +
> +
> +static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> + BlockReopenQueue *queue, Error **errp)
> +{
> + return 0;
> +}
> +
> +static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs,
> + int64_t sector_num,
> + int nb_sectors, int
> *pnum,
> + BlockDriverState
> **file)
> +{
> + *pnum = nb_sectors;
> + *file = bs;
> +
> + return BDRV_BLOCK_OFFSET_VALID | (sector_num << BDRV_SECTOR_BITS);
> +}
> +
> +static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts)
> +{
> + QINCREF(opts);
> + qdict_del(opts, "filename");
> +
> + if (!qdict_size(opts)) {
> + snprintf(bs->exact_filename, sizeof(bs->exact_filename), "%s://",
> + bs->drv->format_name);
> + }
> +
> + qdict_put(opts, "driver", qstring_from_str(bs->drv->format_name));
> + bs->full_open_options = opts;
> +}
> +
> +static void nvme_refresh_limits(BlockDriverState *bs, Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> +
> + /* XXX: other fields from identify command/controller capacity? */
> + bs->bl.opt_mem_alignment = s->page_size;
> + /* XXX: are there other limits? */
> + bs->bl.request_alignment = s->page_size;
> + bs->bl.max_transfer = s->max_transfer;
> +}
> +
> +static void nvme_detach_aio_context(BlockDriverState *bs)
> +{
> + BDRVNVMeState *s = bs->opaque;
> +
> + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> + false, NULL, NULL);
> +}
> +
> +static void nvme_attach_aio_context(BlockDriverState *bs,
> + AioContext *new_context)
> +{
> + BDRVNVMeState *s = bs->opaque;
> +
> + aio_set_event_notifier(new_context, &s->event_notifier,
> + false, nvme_handle_event, nvme_poll_cb);
> +}
> +
> +static void nvme_aio_plug(BlockDriverState *bs)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + s->plugged++;
> +}
> +
> +static void nvme_aio_unplug(BlockDriverState *bs)
> +{
> + int i;
> + BDRVNVMeState *s = bs->opaque;
> + assert(s->plugged);
> + if (!--s->plugged) {
> + for (i = 1; i < s->nr_queues; i++) {
> + nvme_kick(s, s->queues[i]);
> + nvme_process_completion(s, s->queues[i]);
> + }
> + }
> +}
> +
> +static BlockDriver bdrv_nvme = {
> + .format_name = "nvme",
> + .protocol_name = "nvme",
> + .instance_size = sizeof(BDRVNVMeState),
> +
> + .bdrv_parse_filename = nvme_parse_filename,
> + .bdrv_file_open = nvme_file_open,
> + .bdrv_close = nvme_close,
> + .bdrv_getlength = nvme_getlength,
> +
> + .bdrv_co_preadv = nvme_co_preadv,
> + .bdrv_co_pwritev = nvme_co_pwritev,
> + .bdrv_co_flush_to_disk = nvme_co_flush,
> + .bdrv_reopen_prepare = nvme_reopen_prepare,
> +
> + .bdrv_co_get_block_status = nvme_co_get_block_status,
> +
> + .bdrv_refresh_filename = nvme_refresh_filename,
> + .bdrv_refresh_limits = nvme_refresh_limits,
> +
> + .bdrv_detach_aio_context = nvme_detach_aio_context,
> + .bdrv_attach_aio_context = nvme_attach_aio_context,
> +
> + .bdrv_io_plug = nvme_aio_plug,
> + .bdrv_io_unplug = nvme_aio_unplug,
> +};
> +
> +static void bdrv_nvme_init(void)
> +{
> + bdrv_register(&bdrv_nvme);
> +}
> +
> +block_init(bdrv_nvme_init);
> --
> 2.9.3
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
` (3 preceding siblings ...)
2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
@ 2016-12-20 23:04 ` no-reply
2016-12-21 1:38 ` Fam Zheng
2016-12-21 0:48 ` no-reply
2016-12-29 4:09 ` Tian, Kevin
6 siblings, 1 reply; 20+ messages in thread
From: no-reply @ 2016-12-20 23:04 UTC (permalink / raw)
To: famz
Cc: qemu-devel, kwolf, qemu-block, mreitz, borntraeger, stefanha,
pbonzini, krister
Hi,
Your series seems to have some coding style problems. See output below for
more information:
Subject: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
Message-id: 20161220163139.12016-1-famz@redhat.com
Type: series
=== TEST SCRIPT BEGIN ===
#!/bin/bash
BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0
# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True
commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done
exit $failed
=== TEST SCRIPT END ===
Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
ddc026b block: Add VFIO based NVMe driver
f5dd619 util: Add VFIO helper library
5608a5c util: Add a notifier list for qemu_vfree()
822462d ramblock-notifier: new
=== OUTPUT BEGIN ===
Checking PATCH 1/4: ramblock-notifier: new...
Checking PATCH 2/4: util: Add a notifier list for qemu_vfree()...
Checking PATCH 3/4: util: Add VFIO helper library...
WARNING: line over 80 characters
#184: FILE: util/vfio-helpers.c:111:
+ if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, &s->bar_region_info[index])) {
WARNING: line over 80 characters
#262: FILE: util/vfio-helpers.c:189:
+static int qemu_vfio_pci_read_config(QEMUVFIOState *s, void *buf, int size, int ofs)
WARNING: line over 80 characters
#264: FILE: util/vfio-helpers.c:191:
+ if (pread(s->device, buf, size, s->config_region_info.offset + ofs) == size) {
WARNING: line over 80 characters
#270: FILE: util/vfio-helpers.c:197:
+static int qemu_vfio_pci_write_config(QEMUVFIOState *s, void *buf, int size, int ofs)
ERROR: that open brace { should be on the previous line
#287: FILE: util/vfio-helpers.c:214:
+ struct vfio_group_status group_status =
+ { .argsz = sizeof(group_status) };
WARNING: line over 80 characters
#407: FILE: util/vfio-helpers.c:334:
+static void qemu_vfio_ram_block_added(RAMBlockNotifier *n, void *host, size_t size)
WARNING: line over 80 characters
#415: FILE: util/vfio-helpers.c:342:
+static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n, void *host, size_t size)
total: 1 errors, 6 warnings, 746 lines checked
Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 4/4: block: Add VFIO based NVMe driver...
ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#145: FILE: block/nvme.c:92:
+ volatile uint32_t *doorbell;
ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#169: FILE: block/nvme.c:116:
+typedef volatile struct {
WARNING: line over 80 characters
#339: FILE: block/nvme.c:286:
+ DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status %x\n",
WARNING: line over 80 characters
#626: FILE: block/nvme.c:573:
+ error_setg(errp, "Timeout while waiting for device to reset (%ld ms)",
WARNING: line over 80 characters
#655: FILE: block/nvme.c:602:
+ error_setg(errp, "Timeout while waiting for device to start (%ld ms)",
total: 2 errors, 3 warnings, 1033 lines checked
Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===
Test command exited with code: 1
---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
` (4 preceding siblings ...)
2016-12-20 23:04 ` [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device no-reply
@ 2016-12-21 0:48 ` no-reply
2016-12-29 4:09 ` Tian, Kevin
6 siblings, 0 replies; 20+ messages in thread
From: no-reply @ 2016-12-21 0:48 UTC (permalink / raw)
To: famz
Cc: qemu-devel, kwolf, qemu-block, mreitz, borntraeger, stefanha,
pbonzini, krister
Hi,
Your series failed automatic build test. Please find the testing commands and
their output below. If you have docker installed, you can probably reproduce it
locally.
Type: series
Message-id: 20161220163139.12016-1-famz@redhat.com
Subject: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
=== TEST SCRIPT BEGIN ===
#!/bin/bash
set -e
git submodule update --init dtc
# Let docker tests dump environment info
export SHOW_ENV=1
export J=16
make docker-test-quick@centos6
make docker-test-mingw@fedora
make docker-test-build@min-glib
=== TEST SCRIPT END ===
Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
ddc026b block: Add VFIO based NVMe driver
f5dd619 util: Add VFIO helper library
5608a5c util: Add a notifier list for qemu_vfree()
822462d ramblock-notifier: new
=== OUTPUT BEGIN ===
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into 'dtc'...
Submodule path 'dtc': checked out '65cc4d2748a2c2e6f27f1cf39e07a5dbabd80ebf'
BUILD centos6
make[1]: Entering directory `/var/tmp/patchew-tester-tmp-m_z3bizr/src'
ARCHIVE qemu.tgz
ARCHIVE dtc.tgz
COPY RUNNER
RUN test-quick in qemu:centos6
Packages installed:
SDL-devel-1.2.14-7.el6_7.1.x86_64
ccache-3.1.6-2.el6.x86_64
epel-release-6-8.noarch
gcc-4.4.7-17.el6.x86_64
git-1.7.1-4.el6_7.1.x86_64
glib2-devel-2.28.8-5.el6.x86_64
libfdt-devel-1.4.0-1.el6.x86_64
make-3.81-23.el6.x86_64
package g++ is not installed
pixman-devel-0.32.8-1.el6.x86_64
tar-1.23-15.el6_8.x86_64
zlib-devel-1.2.3-29.el6.x86_64
Environment variables:
PACKAGES=libfdt-devel ccache tar git make gcc g++ zlib-devel glib2-devel SDL-devel pixman-devel epel-release
HOSTNAME=71a2d1232481
TERM=xterm
MAKEFLAGS= -j16
HISTSIZE=1000
J=16
USER=root
CCACHE_DIR=/var/tmp/ccache
EXTRA_CONFIGURE_OPTS=
V=
SHOW_ENV=1
MAIL=/var/spool/mail/root
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
LANG=en_US.UTF-8
TARGET_LIST=
HISTCONTROL=ignoredups
SHLVL=1
HOME=/root
TEST_DIR=/tmp/qemu-test
LOGNAME=root
LESSOPEN=||/usr/bin/lesspipe.sh %s
FEATURES= dtc
DEBUG=
G_BROKEN_FILENAMES=1
CCACHE_HASHDIR=
_=/usr/bin/env
Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu --prefix=/var/tmp/qemu-build/install
No C++ compiler available; disabling C++ specific optional code
Install prefix /var/tmp/qemu-build/install
BIOS directory /var/tmp/qemu-build/install/share/qemu
binary directory /var/tmp/qemu-build/install/bin
library directory /var/tmp/qemu-build/install/lib
module directory /var/tmp/qemu-build/install/lib/qemu
libexec directory /var/tmp/qemu-build/install/libexec
include directory /var/tmp/qemu-build/install/include
config directory /var/tmp/qemu-build/install/etc
local state directory /var/tmp/qemu-build/install/var
Manual directory /var/tmp/qemu-build/install/share/man
ELF interp prefix /usr/gnemul/qemu-%M
Source path /tmp/qemu-test/src
C compiler cc
Host C compiler cc
C++ compiler
Objective-C compiler cc
ARFLAGS rv
CFLAGS -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g
QEMU_CFLAGS -I/usr/include/pixman-1 -pthread -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -Wendif-labels -Wmissing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-all
LDFLAGS -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g
make make
install install
python python -B
smbd /usr/sbin/smbd
module support no
host CPU x86_64
host big endian no
target list x86_64-softmmu aarch64-softmmu
tcg debug enabled no
gprof enabled no
sparse enabled no
strip binaries yes
profiler no
static build no
pixman system
SDL support yes (1.2.14)
GTK support no
GTK GL support no
VTE support no
TLS priority NORMAL
GNUTLS support no
GNUTLS rnd no
libgcrypt no
libgcrypt kdf no
nettle no
nettle kdf no
libtasn1 no
curses support no
virgl support no
curl support no
mingw32 support no
Audio drivers oss
Block whitelist (rw)
Block whitelist (ro)
VirtFS support no
VNC support yes
VNC SASL support no
VNC JPEG support no
VNC PNG support no
xen support no
brlapi support no
bluez support no
Documentation no
PIE yes
vde support no
netmap support no
Linux AIO support no
ATTR/XATTR support yes
Install blobs yes
KVM support yes
COLO support yes
RDMA support no
TCG interpreter no
fdt support yes
preadv support yes
fdatasync yes
madvise yes
posix_madvise yes
libcap-ng support no
vhost-net support yes
vhost-scsi support yes
vhost-vsock support yes
Trace backends log
spice support no
rbd support no
xfsctl support no
smartcard support no
libusb no
usb net redir no
OpenGL support no
OpenGL dmabufs no
libiscsi support no
libnfs support no
build guest agent yes
QGA VSS support no
QGA w32 disk info no
QGA MSI support no
seccomp support no
coroutine backend ucontext
coroutine pool yes
debug stack usage no
GlusterFS support no
Archipelago support no
gcov gcov
gcov enabled no
TPM support yes
libssh2 support no
TPM passthrough yes
QOM debugging yes
lzo support no
snappy support no
bzip2 support no
NUMA host support no
tcmalloc support no
jemalloc support no
avx2 optimization no
replication support yes
GEN x86_64-softmmu/config-devices.mak.tmp
GEN aarch64-softmmu/config-devices.mak.tmp
GEN config-host.h
GEN qemu-options.def
GEN qmp-commands.h
GEN qapi-types.h
GEN qapi-visit.h
GEN qapi-event.h
GEN qmp-introspect.h
GEN x86_64-softmmu/config-devices.mak
GEN aarch64-softmmu/config-devices.mak
GEN module_block.h
GEN tests/test-qapi-visit.h
GEN tests/test-qapi-types.h
GEN tests/test-qmp-commands.h
GEN tests/test-qapi-event.h
GEN tests/test-qmp-introspect.h
GEN trace/generated-tracers.h
GEN trace/generated-tcg-tracers.h
GEN trace/generated-helpers-wrappers.h
GEN trace/generated-helpers.h
GEN config-all-devices.mak
CC tests/qemu-iotests/socket_scm_helper.o
GEN qga/qapi-generated/qga-qmp-commands.h
GEN qga/qapi-generated/qga-qapi-types.h
GEN qga/qapi-generated/qga-qapi-visit.h
GEN qga/qapi-generated/qga-qapi-types.c
GEN qga/qapi-generated/qga-qapi-visit.c
GEN qga/qapi-generated/qga-qmp-marshal.c
GEN qapi-types.c
GEN qmp-introspect.c
GEN qapi-visit.c
GEN qapi-event.c
CC qapi/qapi-visit-core.o
CC qapi/qapi-dealloc-visitor.o
CC qapi/qobject-input-visitor.o
CC qapi/qobject-output-visitor.o
CC qapi/qmp-registry.o
CC qapi/qmp-dispatch.o
CC qapi/string-input-visitor.o
CC qapi/string-output-visitor.o
CC qapi/opts-visitor.o
CC qapi/qapi-clone-visitor.o
CC qapi/qmp-event.o
CC qapi/qapi-util.o
CC qobject/qnull.o
CC qobject/qint.o
CC qobject/qstring.o
CC qobject/qdict.o
CC qobject/qlist.o
CC qobject/qfloat.o
CC qobject/qbool.o
CC qobject/qjson.o
CC qobject/qobject.o
CC qobject/json-lexer.o
CC qobject/json-streamer.o
CC qobject/json-parser.o
GEN trace/generated-tracers.c
CC trace/control.o
CC trace/qmp.o
CC util/osdep.o
CC util/cutils.o
CC util/unicode.o
CC util/qemu-timer-common.o
CC util/bufferiszero.o
CC util/compatfd.o
CC util/event_notifier-posix.o
CC util/mmap-alloc.o
CC util/oslib-posix.o
CC util/qemu-openpty.o
CC util/qemu-thread-posix.o
CC util/memfd.o
CC util/envlist.o
CC util/path.o
CC util/module.o
CC util/bitmap.o
CC util/bitops.o
CC util/hbitmap.o
CC util/fifo8.o
CC util/acl.o
CC util/qemu-error.o
CC util/error.o
CC util/id.o
CC util/iov.o
CC util/qemu-config.o
CC util/qemu-sockets.o
CC util/uri.o
CC util/notify.o
CC util/qemu-option.o
CC util/qemu-progress.o
CC util/hexdump.o
CC util/crc32c.o
CC util/uuid.o
CC util/getauxval.o
CC util/throttle.o
CC util/readline.o
CC util/rcu.o
CC util/qemu-coroutine.o
CC util/qemu-coroutine-lock.o
CC util/qemu-coroutine-io.o
CC util/qemu-coroutine-sleep.o
CC util/coroutine-ucontext.o
CC util/buffer.o
CC util/timed-average.o
CC util/base64.o
CC util/log.o
CC util/qdist.o
CC util/qht.o
CC util/range.o
CC util/vfio-helpers.o
CC crypto/pbkdf-stub.o
CC stubs/arch-query-cpu-def.o
CC stubs/arch-query-cpu-model-expansion.o
CC stubs/arch-query-cpu-model-comparison.o
CC stubs/arch-query-cpu-model-baseline.o
CC stubs/bdrv-next-monitor-owned.o
CC stubs/blk-commit-all.o
CC stubs/blockdev-close-all-bdrv-states.o
CC stubs/cpu-get-clock.o
CC stubs/clock-warp.o
CC stubs/cpu-get-icount.o
CC stubs/dump.o
CC stubs/error-printf.o
CC stubs/fdset-add-fd.o
CC stubs/fdset-find-fd.o
CC stubs/fdset-get-fd.o
CC stubs/fdset-remove-fd.o
CC stubs/gdbstub.o
CC stubs/get-fd.o
CC stubs/get-next-serial.o
CC stubs/get-vm-name.o
CC stubs/iothread.o
CC stubs/iothread-lock.o
CC stubs/is-daemonized.o
CC stubs/machine-init-done.o
CC stubs/migr-blocker.o
CC stubs/mon-is-qmp.o
CC stubs/monitor-init.o
CC stubs/notify-event.o
CC stubs/qtest.o
CC stubs/ram.o
CC stubs/replay.o
CC stubs/reset.o
CC stubs/replay-user.o
CC stubs/runstate-check.o
CC stubs/set-fd-handler.o
CC stubs/slirp.o
CC stubs/sysbus.o
CC stubs/trace-control.o
CC stubs/uuid.o
CC stubs/vm-stop.o
CC stubs/vmstate.o
CC stubs/cpus.o
CC stubs/kvm.o
CC stubs/qmp_pc_dimm_device_list.o
CC stubs/target-monitor-defs.o
CC stubs/target-get-monitor-def.o
CC stubs/vhost.o
CC stubs/iohandler.o
CC stubs/ipmi.o
CC stubs/smbios_type_38.o
CC stubs/pc_madt_cpu_entry.o
CC stubs/migration-colo.o
CC contrib/ivshmem-client/ivshmem-client.o
CC contrib/ivshmem-client/main.o
CC contrib/ivshmem-server/ivshmem-server.o
CC contrib/ivshmem-server/main.o
CC qemu-nbd.o
CC async.o
CC thread-pool.o
CC block.o
CC blockjob.o
CC main-loop.o
CC iohandler.o
CC qemu-timer.o
CC aio-posix.o
CC qemu-io-cmds.o
CC replication.o
CC block/raw_bsd.o
CC block/qcow.o
CC block/vdi.o
CC block/vmdk.o
CC block/cloop.o
CC block/bochs.o
CC block/vpc.o
CC block/vvfat.o
CC block/dmg.o
CC block/qcow2.o
CC block/qcow2-cluster.o
CC block/qcow2-refcount.o
CC block/qcow2-snapshot.o
CC block/qcow2-cache.o
CC block/qed.o
CC block/qed-gencb.o
CC block/qed-l2-cache.o
CC block/qed-table.o
CC block/qed-cluster.o
CC block/qed-check.o
CC block/vhdx.o
CC block/vhdx-endian.o
CC block/vhdx-log.o
CC block/quorum.o
CC block/parallels.o
CC block/blkdebug.o
CC block/blkverify.o
CC block/blkreplay.o
CC block/block-backend.o
CC block/snapshot.o
CC block/qapi.o
CC block/raw-posix.o
CC block/null.o
CC block/mirror.o
CC block/commit.o
CC block/io.o
CC block/throttle-groups.o
CC block/nvme.o
CC block/nbd.o
CC block/nbd-client.o
CC block/sheepdog.o
CC block/accounting.o
CC block/dirty-bitmap.o
CC block/write-threshold.o
CC block/backup.o
/tmp/qemu-test/src/block/nvme.c: In function ‘nvme_init’:
/tmp/qemu-test/src/block/nvme.c:615: error: too many arguments to function ‘aio_set_event_notifier’
/tmp/qemu-test/src/block/nvme.c:632: error: too many arguments to function ‘aio_set_event_notifier’
/tmp/qemu-test/src/block/nvme.c: In function ‘nvme_close’:
/tmp/qemu-test/src/block/nvme.c:685: error: too many arguments to function ‘aio_set_event_notifier’
/tmp/qemu-test/src/block/nvme.c: In function ‘nvme_detach_aio_context’:
/tmp/qemu-test/src/block/nvme.c:963: error: too many arguments to function ‘aio_set_event_notifier’
/tmp/qemu-test/src/block/nvme.c: In function ‘nvme_attach_aio_context’:
/tmp/qemu-test/src/block/nvme.c:972: error: too many arguments to function ‘aio_set_event_notifier’
make: *** [block/nvme.o] Error 1
make: *** Waiting for unfinished jobs....
make[1]: *** [docker-run] Error 2
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-m_z3bizr/src'
make: *** [docker-run-test-quick@centos6] Error 2
=== OUTPUT END ===
Test command exited with code: 2
---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
2016-12-20 23:04 ` [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device no-reply
@ 2016-12-21 1:38 ` Fam Zheng
0 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-21 1:38 UTC (permalink / raw)
To: qemu-devel
Cc: kwolf, qemu-block, mreitz, borntraeger, stefanha, pbonzini, krister
On Tue, 12/20 15:04, no-reply@patchew.org wrote:
> ERROR: that open brace { should be on the previous line
> #287: FILE: util/vfio-helpers.c:214:
> + struct vfio_group_status group_status =
> + { .argsz = sizeof(group_status) };
Hmm, it may indeed look better.
> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
> #145: FILE: block/nvme.c:92:
> + volatile uint32_t *doorbell;
>
> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
> #169: FILE: block/nvme.c:116:
> +typedef volatile struct {
These are shared with hardware, even volatile-considered-harmful.txt admits it
can be valid:
> - Pointers to data structures in coherent memory which might be modified
> by I/O devices can, sometimes, legitimately be volatile. A ring buffer
> used by a network adapter, where that adapter changes pointers to
> indicate which descriptors have been processed, is an example of this
> type of situation.
Fam
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH 4/4] block: Add VFIO based NVMe driver
2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
2016-12-20 16:39 ` Paolo Bonzini
@ 2016-12-21 11:59 ` Stefan Hajnoczi
2016-12-21 14:05 ` Fam Zheng
1 sibling, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2016-12-21 11:59 UTC (permalink / raw)
To: Fam Zheng
Cc: qemu-devel, Kevin Wolf, qemu-block, Max Reitz, borntraeger,
Stefan Hajnoczi, Paolo Bonzini, Karl Rister
[-- Attachment #1: Type: text/plain, Size: 36425 bytes --]
On Wed, Dec 21, 2016 at 12:31:39AM +0800, Fam Zheng wrote:
> This is a new protocol driver that exclusively opens a host NVMe
> controller through VFIO. It achieves better latency than linux-aio.
This is an interesting block driver to have for performance comparisons.
Definitely something that is worth merging.
> nvme:// linux-aio
> ------------------------------------------------------------------
> fio bs=4k iodepth=1 (IOPS) 30014 24009
> fio bs=4k iodepth=1 (IOPS) +polling 45148 34689
> fio bs=64k iodepth=1 (IOPS) 17801 14954
> fio bs=64k iodepth=1 (IOPS) +polling 17970 14564
>
> fio bs=4k iodepth=32 (IOPS) 110637 121480
> fio bs=4k iodepth=32 (IOPS) +polling 145533 166878
> fio bs=64k iodepth=32 (IOPS) 50525 50596
> fio bs=64k iodepth=32 (IOPS) +polling 50482 50534
>
> (host) qemu-img bench -c 8000000 (sec) 15.00 43.13
>
> ("+polling" means poll-max-ns=18000 which is a rule of thumb value for
> the used NVMe device, otherwise it defaults to 0).
>
> For the rows where linux-aio is faster, a lot of evidence shows that the
> io queue is more likely to stay full if the request completions happen
> faster, as in the nvme:// case, hence less batch submission and request
> merging than linux-aio.
I'm not sure I understand this paragraph. Sounds like you are saying
that nvme:// has lower latency and this defeats batch submission.
Higher numbers are still better at the end of the day so it's worth
studying this more closely and coming up with solutions. Maybe at a
certain rate of submission it makes sense to favor throughput
(batching) even with nvme://.
Regarding merging: are you doing sequential I/O? Please try random
instead.
> This patch is co-authored by Paolo and me.
>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
> block/Makefile.objs | 1 +
> block/nvme.c | 1026 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 1027 insertions(+)
> create mode 100644 block/nvme.c
>
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 67a036a..fe975a6 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
> block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> block-obj-y += null.o mirror.o commit.o io.o
> block-obj-y += throttle-groups.o
> +block-obj-y += nvme.o
>
> block-obj-y += nbd.o nbd-client.o sheepdog.o
> block-obj-$(CONFIG_LIBISCSI) += iscsi.o
> diff --git a/block/nvme.c b/block/nvme.c
> new file mode 100644
> index 0000000..5c9ffde
> --- /dev/null
> +++ b/block/nvme.c
> @@ -0,0 +1,1026 @@
> +/*
> + * NVMe block driver based on vfio
> + *
> + * Copyright 2016 Red Hat, Inc.
> + *
> + * Authors:
> + * Fam Zheng <famz@redhat.com>
> + * Paolo Bonzini <pbonzini@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <linux/vfio.h>
> +#include "qapi/error.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qapi/qmp/qstring.h"
> +#include "block/block_int.h"
> +#include "qemu/error-report.h"
> +#include "qemu/vfio-helpers.h"
> +
> +#define NVME_DEBUG 0
> +
> +#define DPRINTF(...) \
> + if (NVME_DEBUG) { \
> + printf(__VA_ARGS__); \
> + }
> +
> +#define NVME_SQ_ENTRY_BYTES 64
> +#define NVME_CQ_ENTRY_BYTES 16
> +#define NVME_QUEUE_SIZE 128
> +
> +/* TODO: share definitions with hw/block/nvme.c? */
Sounds good.
> +enum NvmeAdminCommands {
> + NVME_ADM_CMD_DELETE_SQ = 0x00,
> + NVME_ADM_CMD_CREATE_SQ = 0x01,
> + NVME_ADM_CMD_GET_LOG_PAGE = 0x02,
> + NVME_ADM_CMD_DELETE_CQ = 0x04,
> + NVME_ADM_CMD_CREATE_CQ = 0x05,
> + NVME_ADM_CMD_IDENTIFY = 0x06,
> + NVME_ADM_CMD_ABORT = 0x08,
> + NVME_ADM_CMD_SET_FEATURES = 0x09,
> + NVME_ADM_CMD_GET_FEATURES = 0x0a,
> + NVME_ADM_CMD_ASYNC_EV_REQ = 0x0c,
> + NVME_ADM_CMD_ACTIVATE_FW = 0x10,
> + NVME_ADM_CMD_DOWNLOAD_FW = 0x11,
> + NVME_ADM_CMD_FORMAT_NVM = 0x80,
> + NVME_ADM_CMD_SECURITY_SEND = 0x81,
> + NVME_ADM_CMD_SECURITY_RECV = 0x82,
> +};
> +
> +enum NvmeIoCommands {
> + NVME_CMD_FLUSH = 0x00,
> + NVME_CMD_WRITE = 0x01,
> + NVME_CMD_READ = 0x02,
> + NVME_CMD_WRITE_UNCOR = 0x04,
> + NVME_CMD_COMPARE = 0x05,
> + NVME_CMD_DSM = 0x09,
> +};
> +
> +typedef struct {
> + uint8_t opcode;
> + uint8_t flags;
> + uint16_t cid;
> + uint32_t nsid;
> + uint64_t reserved;
> + uint64_t mptr;
> + uint64_t prp1;
> + uint64_t prp2;
> + uint32_t cdw10;
> + uint32_t cdw11;
> + uint32_t cdw12;
> + uint32_t cdw13;
> + uint32_t cdw14;
> + uint32_t cdw15;
> +} QEMU_PACKED NVMeCommand;
> +
> +typedef struct {
> + uint32_t cmd_specific;
> + uint32_t reserved;
> + uint16_t sq_head;
> + uint16_t sqid;
> + uint16_t cid;
> + uint16_t status;
> +} QEMU_PACKED NVMeCompletion;
> +
> +typedef struct {
> + int32_t head, tail;
> + uint8_t *queue;
> + uint64_t iova;
> + volatile uint32_t *doorbell;
> +} NVMeQueue;
> +
> +typedef struct {
> + BlockCompletionFunc *cb;
> + void *opaque;
> + int cid;
> + void *prp_list_page;
> + uint64_t prp_list_iova;
> +} NVMeRequest;
> +
> +typedef struct {
> + int index;
> + NVMeQueue sq, cq;
> + int cq_phase;
> + uint8_t *prp_list_pages;
> + uint64_t prp_list_base_iova;
> + NVMeRequest reqs[NVME_QUEUE_SIZE];
> + CoQueue wait_queue;
"free_req_queue" describes the purpose of this queue.
> + bool busy;
> + int need_kick;
> + int inflight;
> +} NVMeQueuePair;
> +
> +typedef volatile struct {
> + uint64_t cap;
> + uint32_t vs;
> + uint32_t intms;
> + uint32_t intmc;
> + uint32_t cc;
> + uint32_t reserved0;
> + uint32_t csts;
> + uint32_t nssr;
> + uint32_t aqa;
> + uint64_t asq;
> + uint64_t acq;
> + uint32_t cmbloc;
> + uint32_t cmbsz;
> + uint8_t reserved1[0xec0];
> + uint8_t cmd_set_specfic[0x100];
> + uint32_t doorbells[];
> +} QEMU_PACKED NVMeRegs;
> +
> +QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
> +
> +typedef struct {
> + QEMUVFIOState *vfio;
> + NVMeRegs *regs;
> + /* The submission/completion queue pairs.
> + * [0]: admin queue.
> + * [1..]: io queues.
> + */
> + NVMeQueuePair **queues;
> + int nr_queues;
> + size_t page_size;
> + /* How many uint32_t elements does each doorbell entry take. */
> + size_t doorbell_scale;
> + bool write_cache;
> + EventNotifier event_notifier;
"event_notifier" describes the type, not the purpose of the field.
"irq_notifier" is clearer.
> + uint64_t nsze; /* Namespace size reported by identify command */
> + int nsid; /* The namespace id to read/write data. */
> + uint64_t max_transfer;
> + int plugged;
> + Notifier vfree_notify;
> +} BDRVNVMeState;
> +
> +#define NVME_BLOCK_OPT_DEVICE "device"
> +#define NVME_BLOCK_OPT_NAMESPACE "namespace"
> +
> +static QemuOptsList runtime_opts = {
> + .name = "nvme",
> + .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> + .desc = {
> + {
> + .name = NVME_BLOCK_OPT_DEVICE,
> + .type = QEMU_OPT_STRING,
> + .help = "NVMe PCI device address",
> + },
> + {
> + .name = NVME_BLOCK_OPT_NAMESPACE,
> + .type = QEMU_OPT_NUMBER,
> + .help = "NVMe namespace",
> + },
> + { /* end of list */ }
> + },
> +};
> +
> +static void nvme_init_queue(BlockDriverState *bs, NVMeQueue *q,
> + int nentries, int entry_bytes, Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + size_t bytes;
> + int r;
> +
> + bytes = ROUND_UP(nentries * entry_bytes, s->page_size);
> + q->head = q->tail = 0;
> + q->queue = qemu_try_blockalign0(bs, bytes);
> +
> + if (!q->queue) {
> + error_setg(errp, "Cannot allocate queue");
> + return;
> + }
> + r = qemu_vfio_dma_map(s->vfio, q->queue, bytes, true, &q->iova);
> + if (r) {
> + error_setg(errp, "Cannot map queue");
> + }
> +}
> +
> +static void nvme_free_queue_pair(BlockDriverState *bs, NVMeQueuePair *q)
> +{
> + qemu_vfree(q->prp_list_pages);
> + qemu_vfree(q->sq.queue);
> + qemu_vfree(q->cq.queue);
> + g_free(q);
> +}
> +
> +static NVMeQueuePair *nvme_create_queue_pair(BlockDriverState *bs,
> + int idx, int size,
> + Error **errp)
> +{
> + int i, r;
> + BDRVNVMeState *s = bs->opaque;
> + Error *local_err = NULL;
> + NVMeQueuePair *q = g_new0(NVMeQueuePair, 1);
> + uint64_t prp_list_iovas[NVME_QUEUE_SIZE];
> +
> + q->index = idx;
> + qemu_co_queue_init(&q->wait_queue);
> + q->prp_list_pages = qemu_blockalign0(bs, s->page_size * NVME_QUEUE_SIZE);
> + r = qemu_vfio_dma_map(s->vfio, q->prp_list_pages,
> + s->page_size * NVME_QUEUE_SIZE,
> + false, prp_list_iovas);
> + if (r) {
> + goto fail;
> + }
> + for (i = 0; i < NVME_QUEUE_SIZE; i++) {
> + NVMeRequest *req = &q->reqs[i];
> + req->cid = i + 1;
> + req->prp_list_page = q->prp_list_pages + i * s->page_size;
> + req->prp_list_iova = prp_list_iovas[i];
> + }
> + nvme_init_queue(bs, &q->sq, size, NVME_SQ_ENTRY_BYTES, &local_err);
> + if (local_err) {
> + error_propagate(errp, local_err);
> + goto fail;
> + }
> + q->sq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale];
> +
> + nvme_init_queue(bs, &q->cq, size, NVME_CQ_ENTRY_BYTES, &local_err);
> + if (local_err) {
> + error_propagate(errp, local_err);
> + goto fail;
> + }
> + q->cq.doorbell = &s->regs->doorbells[idx * 2 * s->doorbell_scale + 1];
> +
> + return q;
> +fail:
> + nvme_free_queue_pair(bs, q);
> + return NULL;
> +}
> +
> +static void nvme_kick(BDRVNVMeState *s, NVMeQueuePair *q)
> +{
> + if (s->plugged || !q->need_kick) {
> + return;
> + }
> + DPRINTF("nvme kick queue %d\n", q->index);
> + assert(!(q->sq.tail & 0xFF00));
> + /* Fence the write to submission queue entry before notifying the guest. */
> + smp_wmb();
> + *q->sq.doorbell = cpu_to_le32(q->sq.tail);
> + q->inflight += q->need_kick;
> + q->need_kick = 0;
> +}
> +
> +static NVMeRequest *nvme_get_free_req(NVMeQueuePair *q)
> +{
> + int i;
> + if (q->inflight + q->need_kick > NVME_QUEUE_SIZE - 2) {
> + /* We have to leave one slot empty as that is the full queue case (head
> + * == tail + 1). */
> + return NULL;
> + }
> + for (i = 0; i < NVME_QUEUE_SIZE; i++) {
> + if (!q->reqs[i].cb) {
> + return &q->reqs[i];
> + }
> + }
> + return NULL;
> +}
> +
> +static inline int nvme_translate_error(const NVMeCompletion *c)
> +{
> + if ((le16_to_cpu(c->status) >> 1) & 0xFF) {
> + DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status %x\n",
> + c->cmd_specific, c->sq_head, c->sqid, c->cid, c->status);
> + }
> + switch ((le16_to_cpu(c->status) >> 1) & 0xFF) {
> + case 0:
> + return 0;
> + case 1:
> + return -ENOSYS;
> + case 2:
> + return -EINVAL;
> + default:
> + return -EIO;
> + }
> +}
> +
> +static bool nvme_process_completion(BDRVNVMeState *s, NVMeQueuePair *q)
> +{
> + bool progress = false;
> + NVMeRequest *req;
> + NVMeCompletion *c;
> +
> + DPRINTF("nvme process completion %d inflight %d\n", q->index, q->inflight);
> + if (q->busy || s->plugged) {
> + DPRINTF("queue busy\n");
> + return false;
> + }
> + q->busy = true;
> + assert(q->inflight >= 0);
> + while (q->inflight) {
> + c = (NVMeCompletion *)&q->cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
> + if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> + break;
> + }
> + q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> + if (!q->cq.head) {
> + q->cq_phase = !q->cq_phase;
> + }
> + /* XXX: error handling instead of assert as it's from device? */
> + assert(c->cid > 0);
> + assert(c->cid <= NVME_QUEUE_SIZE);
> + DPRINTF("nvme completing command %d\n", c->cid);
> + req = &q->reqs[c->cid - 1];
> + assert(req->cid == c->cid);
> + assert(req->cb);
> + req->cb(req->opaque, nvme_translate_error(c));
> + req->cb = req->opaque = NULL;
> + qemu_co_enter_next(&q->wait_queue);
> + c->cid = 0;
> + q->inflight--;
> + /* Flip Phase Tag bit. */
> + c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> + progress = true;
> + }
> + if (progress) {
> + /* Notify the device so it can post more completions. */
> + smp_mb_release();
> + *q->cq.doorbell = cpu_to_le32(q->cq.head);
> + }
> + q->busy = false;
> + return progress;
> +}
> +
> +static void nvme_submit_command(BDRVNVMeState *s, NVMeQueuePair *q,
> + NVMeRequest *req,
> + NVMeCommand *cmd, BlockCompletionFunc cb,
> + void *opaque)
> +{
> + req->cb = cb;
> + req->opaque = opaque;
> + cmd->cid = cpu_to_le32(req->cid);
> + DPRINTF("nvme submit command %d to queue %d\n", req->cid, q->index);
> + memcpy((uint8_t *)q->sq.queue +
> + q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
> + q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
> + q->need_kick++;
> + nvme_kick(s, q);
> + nvme_process_completion(s, q);
> +}
> +
> +static void nvme_cmd_sync_cb(void *opaque, int ret)
> +{
> + int *pret = opaque;
> + *pret = ret;
> +}
> +
> +static int nvme_cmd_sync(BlockDriverState *bs, NVMeQueuePair *q,
> + NVMeCommand *cmd)
> +{
> + NVMeRequest *req;
> + BDRVNVMeState *s = bs->opaque;
> + int ret = -EINPROGRESS;
> + req = nvme_get_free_req(q);
> + if (!req) {
> + return -EBUSY;
> + }
> + nvme_submit_command(s, q, req, cmd, nvme_cmd_sync_cb, &ret);
> +
> + BDRV_POLL_WHILE(bs, ret == -EINPROGRESS);
> + return ret;
> +}
> +
> +static bool nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + uint8_t *resp;
> + int r;
> + uint64_t iova;
> + NVMeCommand cmd = {
> + .opcode = NVME_ADM_CMD_IDENTIFY,
> + .cdw10 = cpu_to_le32(0x1),
> + };
> +
> + resp = qemu_try_blockalign0(bs, 4096);
> + if (!resp) {
> + error_setg(errp, "Cannot allocate buffer for identify response");
> + return false;
> + }
> + r = qemu_vfio_dma_map(s->vfio, resp, 4096, true, &iova);
> + if (r) {
> + error_setg(errp, "Cannot map buffer for DMA");
> + goto fail;
> + }
> + cmd.prp1 = cpu_to_le64(iova);
> +
> + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> + error_setg(errp, "Failed to identify controller");
> + goto fail;
> + }
> +
> + if (le32_to_cpu(*(uint32_t *)&resp[516]) < namespace) {
> + error_setg(errp, "Invalid namespace");
> + goto fail;
> + }
> + s->write_cache = le32_to_cpu(resp[525]) & 0x1;
> + s->max_transfer = (resp[77] ? 1 << resp[77] : 0) * s->page_size;
> + /* For now the page list buffer per command is one page, to hold at most
> + * s->page_size / sizeof(uint64_t) entries. */
> + s->max_transfer = MIN_NON_ZERO(s->max_transfer,
> + s->page_size / sizeof(uint64_t) * s->page_size);
> +
> + memset((char *)resp, 0, 4096);
> +
> + cmd.cdw10 = 0;
> + cmd.nsid = namespace;
> + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> + error_setg(errp, "Failed to identify namespace");
> + goto fail;
> + }
> +
> + /*qemu_hexdump((const char *)resp, stdout, "NS", 4096);*/
> + s->nsze = le64_to_cpu(*(uint64_t *)&resp[0]);
> +
> + qemu_vfree(resp);
> + return true;
> +fail:
> + qemu_vfree(resp);
> + return false;
> +}
> +
> +static void nvme_handle_event(EventNotifier *n)
> +{
> + int i;
> + BDRVNVMeState *s = container_of(n, BDRVNVMeState, event_notifier);
> +
> + DPRINTF("nvme handle event\n");
> + event_notifier_test_and_clear(n);
> + for (i = 0; i < s->nr_queues; i++) {
> + while (nvme_process_completion(s, s->queues[i])) {
> + /* Keep polling until no progress. */
> + }
> + }
> +}
> +
> +static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + int n = s->nr_queues;
> + NVMeQueuePair *q;
> + NVMeCommand cmd;
> + int queue_size = NVME_QUEUE_SIZE;
> +
> + q = nvme_create_queue_pair(bs, n, queue_size, errp);
> + if (!q) {
> + return false;
> + }
> + cmd = (NVMeCommand) {
> + .opcode = NVME_ADM_CMD_CREATE_CQ,
> + .prp1 = cpu_to_le64(q->cq.iova),
> + .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> + .cdw11 = cpu_to_le32(0x3),
> + };
> + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> + error_setg(errp, "Failed to create io queue [%d]", n);
> + nvme_free_queue_pair(bs, q);
> + return false;
> + }
> + cmd = (NVMeCommand) {
> + .opcode = NVME_ADM_CMD_CREATE_SQ,
> + .prp1 = cpu_to_le64(q->sq.iova),
> + .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> + .cdw11 = cpu_to_le32(0x1 | (n << 16)),
> + };
> + if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> + error_setg(errp, "Failed to create io queue [%d]", n);
> + nvme_free_queue_pair(bs, q);
> + return false;
> + }
> + s->queues = g_renew(NVMeQueuePair *, s->queues, n + 1);
> + s->queues[n] = q;
> + s->nr_queues++;
> + return true;
> +}
> +
> +static bool nvme_poll_cb(void *opaque)
> +{
> + int i;
> + EventNotifier *e = opaque;
> + BDRVNVMeState *s = container_of(e, BDRVNVMeState, event_notifier);
> + bool progress = false;
> +
> + DPRINTF("nvme poll cb\n");
> + for (i = 0; i < s->nr_queues; i++) {
> + while (nvme_process_completion(s, s->queues[i])) {
> + progress = true;
> + }
> + }
> + return progress;
> +}
> +
> +static void nvme_vfree_cb(Notifier *n, void *p)
> +{
> + BDRVNVMeState *s = container_of(n, BDRVNVMeState, vfree_notify);
> + qemu_vfio_dma_unmap(s->vfio, p);
> +}
> +
> +static int nvme_init(BlockDriverState *bs, const char *device, int namespace,
> + Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + int ret;
> + uint64_t cap;
> + uint64_t timeout_ms;
> + uint64_t deadline, now;
> +
> + s->nsid = namespace;
> + ret = event_notifier_init(&s->event_notifier, 0);
> + if (ret) {
> + error_setg(errp, "Failed to init event notifier");
> + return ret;
> + }
> +
> + s->vfree_notify.notify = nvme_vfree_cb;
> + qemu_vfree_add_notifier(&s->vfree_notify);
> +
> + s->vfio = qemu_vfio_open_pci(device, errp);
> + if (!s->vfio) {
> + ret = -EINVAL;
> + goto fail;
> + }
> +
> + s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, errp);
> + if (!s->regs) {
> + ret = -EINVAL;
> + goto fail;
> + }
> +
> + /* Perform initialize sequence as described in NVMe spec "7.6.1
> + * Initialization". */
> +
> + cap = le64_to_cpu(s->regs->cap);
> + if (!(cap & (1ULL << 37))) {
> + error_setg(errp, "Device doesn't support NVMe command set");
> + ret = -EINVAL;
> + goto fail;
> + }
> +
> + s->page_size = MAX(4096, 1 << (12 + ((cap >> 48) & 0xF)));
> + s->doorbell_scale = (4 << (((cap >> 32) & 0xF))) / sizeof(uint32_t);
> + bs->bl.opt_mem_alignment = s->page_size;
> + timeout_ms = MIN(500 * ((cap >> 24) & 0xFF), 30000);
> +
> + /* Reset device to get a clean state. */
> + s->regs->cc = cpu_to_le32(le32_to_cpu(s->regs->cc) & 0xFE);
> + /* Wait for CSTS.RDY = 0. */
> + deadline = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + timeout_ms * 1000000ULL;
> + while (le32_to_cpu(s->regs->csts) & 0x1) {
> + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
> + error_setg(errp, "Timeout while waiting for device to reset (%ld ms)",
> + timeout_ms);
> + ret = -ETIMEDOUT;
> + goto fail;
> + }
> + }
> +
> + /* Set up admin queue. */
> + s->queues = g_new(NVMeQueuePair *, 1);
> + s->nr_queues = 1;
> + s->queues[0] = nvme_create_queue_pair(bs, 0, NVME_QUEUE_SIZE, errp);
> + if (!s->queues[0]) {
> + ret = -EINVAL;
> + goto fail;
> + }
> + QEMU_BUILD_BUG_ON(NVME_QUEUE_SIZE & 0xF000);
> + s->regs->aqa = cpu_to_le32((NVME_QUEUE_SIZE << 16) | NVME_QUEUE_SIZE);
> + s->regs->asq = cpu_to_le64(s->queues[0]->sq.iova);
> + s->regs->acq = cpu_to_le64(s->queues[0]->cq.iova);
> +
> + /* After setting up all control registers we can enable device now. */
> + s->regs->cc = cpu_to_le32((ctz32(NVME_CQ_ENTRY_BYTES) << 20) |
> + (ctz32(NVME_SQ_ENTRY_BYTES) << 16) |
> + 0x1);
> + /* Wait for CSTS.RDY = 1. */
> + now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> + deadline = now + timeout_ms * 1000000;
> + while (!(le32_to_cpu(s->regs->csts) & 0x1)) {
> + if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) > deadline) {
> + error_setg(errp, "Timeout while waiting for device to start (%ld ms)",
> + timeout_ms);
> + ret = -ETIMEDOUT;
> + goto fail_queue;
> + }
> + }
> +
> + ret = qemu_vfio_pci_init_irq(s->vfio, &s->event_notifier,
> + VFIO_PCI_MSIX_IRQ_INDEX, errp);
> + if (ret) {
> + goto fail_queue;
> + }
> + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> + false, nvme_handle_event, nvme_poll_cb);
> +
> + if (!nvme_identify(bs, namespace, errp)) {
> + ret = -EIO;
> + goto fail_handler;
> + }
> +
> + /* Set up command queues. */
> + /* XXX: multiple io queues? */
> + if (!nvme_add_io_queue(bs, errp)) {
> + ret = -EIO;
> + goto fail_handler;
> + }
> + return 0;
> +
> +fail_handler:
> + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> + false, NULL, NULL);
> +fail_queue:
> + nvme_free_queue_pair(bs, s->queues[0]);
> +fail:
> + qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
> + qemu_vfio_close(s->vfio);
> + event_notifier_cleanup(&s->event_notifier);
> + return ret;
> +}
> +
> +static void nvme_parse_filename(const char *filename, QDict *options,
> + Error **errp)
> +{
> + int pref = strlen("nvme://");
> +
> + /* XXX: support namespace in URI like nvme://0000:00:00.0/1 ? */
> + if (strlen(filename) > pref && !strncmp(filename, "nvme://", pref)) {
> + qdict_put(options, NVME_BLOCK_OPT_DEVICE,
> + qstring_from_str(filename + pref));
> + }
> +}
> +
> +static int nvme_file_open(BlockDriverState *bs, QDict *options, int flags,
> + Error **errp)
> +{
> + const char *device;
> + QemuOpts *opts;
> + int namespace;
> +
> + opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
> + qemu_opts_absorb_qdict(opts, options, &error_abort);
> + device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> + if (!device) {
> + error_setg(errp, "'" NVME_BLOCK_OPT_DEVICE "' option is required");
> + return -EINVAL;
> + }
> +
> + namespace = qemu_opt_get_number(opts, NVME_BLOCK_OPT_NAMESPACE, 1);
> + nvme_init(bs, device, namespace, errp);
> +
> + qemu_opts_del(opts);
> + return 0;
> +}
> +
> +static void nvme_close(BlockDriverState *bs)
> +{
> + int i;
> + BDRVNVMeState *s = bs->opaque;
> +
> + for (i = 0; i < s->nr_queues; ++i) {
> + nvme_free_queue_pair(bs, s->queues[i]);
> + }
> + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> + false, NULL, NULL);
> + notifier_remove(&s->vfree_notify);
> + qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs);
> + qemu_vfio_close(s->vfio);
> +}
> +
> +static int64_t nvme_getlength(BlockDriverState *bs)
> +{
> + BDRVNVMeState *s = bs->opaque;
> +
> + return s->nsze << BDRV_SECTOR_BITS;
> +}
> +
> +static inline int nvme_cmd_map_qiov(BlockDriverState *bs, NVMeCommand *cmd,
> + NVMeRequest *req, QEMUIOVector *qiov)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + uint64_t *pagelist = req->prp_list_page;
> + int i, r;
> + unsigned int entries = 0;
> +
> + assert(qiov->size);
> + assert(QEMU_IS_ALIGNED(qiov->size, s->page_size));
> + assert(qiov->size / s->page_size <= s->page_size / sizeof(uint64_t));
> + for (i = 0; i < qiov->niov; ++i) {
> + r = qemu_vfio_dma_map(s->vfio,
> + qiov->iov[i].iov_base,
> + qiov->iov[i].iov_len,
> + false, &pagelist[entries]);
> + if (r) {
> + return r;
> + }
> +
> + entries += qiov->iov[i].iov_len / s->page_size;
> + assert(entries <= s->page_size / sizeof(uint64_t));
> + }
> +
> + switch (entries) {
> + case 0:
> + abort();
> + case 1:
> + cmd->prp1 = cpu_to_le64(pagelist[0]);
> + cmd->prp2 = 0;
> + break;
> + case 2:
> + cmd->prp1 = cpu_to_le64(pagelist[0]);
> + cmd->prp2 = cpu_to_le64(pagelist[1]);;
> + break;
> + default:
> + cmd->prp1 = cpu_to_le64(pagelist[0]);
> + cmd->prp2 = cpu_to_le64(req->prp_list_iova);
> + for (i = 0; i < entries - 1; ++i) {
> + pagelist[i] = cpu_to_le64(pagelist[i + 1]);
> + }
> + pagelist[entries] = 0;
> + break;
> + }
> + return 0;
> +}
> +
> +typedef struct {
> + Coroutine *co;
> + int ret;
> + AioContext *ctx;
> +} NVMeCoData;
> +
> +static void nvme_rw_cb_bh(void *opaque)
> +{
> + NVMeCoData *data = opaque;
> + qemu_coroutine_enter(data->co);
> +}
> +
> +static void nvme_rw_cb(void *opaque, int ret)
> +{
> + NVMeCoData *data = opaque;
> + data->ret = ret;
> + if (!data->co) {
> + /* The rw coroutine hasn't yielded, don't try to enter. */
> + return;
> + }
> + if (qemu_coroutine_self() != data->co) {
> + qemu_coroutine_enter(data->co);
> + } else {
> + aio_bh_schedule_oneshot(data->ctx, nvme_rw_cb_bh, data);
> + }
> +}
> +
> +static coroutine_fn int nvme_co_prw_aligned(BlockDriverState *bs,
> + uint64_t offset, uint64_t bytes,
> + QEMUIOVector *qiov,
> + bool is_write)
> +{
> + int r;
> + BDRVNVMeState *s = bs->opaque;
> + NVMeQueuePair *ioq = s->queues[1];
> + NVMeRequest *req;
> + NVMeCommand cmd = {
> + .opcode = is_write ? NVME_CMD_WRITE : NVME_CMD_READ,
> + .nsid = cpu_to_le32(s->nsid),
> + .cdw10 = cpu_to_le32((offset >> BDRV_SECTOR_BITS) & 0xFFFFFFFF),
> + .cdw11 = cpu_to_le32(((offset >> BDRV_SECTOR_BITS) >> 32) & 0xFFFFFFFF),
> + .cdw12 = cpu_to_le32(((bytes >> BDRV_SECTOR_BITS) - 1) & 0xFFFF),
> + };
> + NVMeCoData data = {
> + .ctx = bdrv_get_aio_context(bs),
> + .ret = -EINPROGRESS,
> + };
> +
> + DPRINTF(">>> NVMe %s offset %lx bytes %lx qiov[%d]\n",
> + is_write ? "write" : "read",
> + offset, bytes, qiov->niov);
> + assert(s->nr_queues > 1);
> + while (true) {
> + req = nvme_get_free_req(ioq);
> + if (req) {
> + break;
> + }
> + DPRINTF("nvme wait req\n");
> + qemu_co_queue_wait(&ioq->wait_queue);
> + DPRINTF("nvme wait req done\n");
> + }
> +
> + r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
> + if (r) {
> + return r;
Is req leaked?
> + }
> + nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> +
> + data.co = qemu_coroutine_self();
> + while (data.ret == -EINPROGRESS) {
> + qemu_coroutine_yield();
> + }
> +
> + DPRINTF("<<< NVMe %s offset %lx bytes %lx ret %d\n",
> + is_write ? "write" : "read",
> + offset, bytes, data.ret);
> + return data.ret;
> +}
> +
> +static inline bool nvme_qiov_aligned(BlockDriverState *bs,
> + const QEMUIOVector *qiov)
> +{
> + int i;
> + BDRVNVMeState *s = bs->opaque;
> +
> + for (i = 0; i < qiov->niov; ++i) {
> + if (!QEMU_PTR_IS_ALIGNED(qiov->iov[i].iov_base, s->page_size) ||
> + !QEMU_IS_ALIGNED(qiov->iov[i].iov_len, s->page_size)) {
> + return false;
> + }
> + }
> + return true;
> +}
> +
> +static inline coroutine_fn
> +int nvme_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> + QEMUIOVector *qiov, bool is_write)
> +{
> + int r;
> + uint8_t *buf;
> + QEMUIOVector local_qiov;
> +
> + assert(offset % BDRV_SECTOR_SIZE == 0);
> + assert(bytes % BDRV_SECTOR_SIZE == 0);
> + if (nvme_qiov_aligned(bs, qiov)) {
> + return nvme_co_prw_aligned(bs, offset, bytes, qiov, is_write);
> + }
> + buf = qemu_try_blockalign(bs, bytes);
> +
> + if (!buf) {
> + return -ENOMEM;
> + }
> + qemu_iovec_init(&local_qiov, 1);
> + qemu_iovec_add(&local_qiov, buf, bytes);
> + r = nvme_co_prw_aligned(bs, offset, bytes, &local_qiov, is_write);
> + qemu_iovec_destroy(&local_qiov);
> + if (!r) {
> + qemu_iovec_from_buf(qiov, 0, buf, bytes);
> + }
> + qemu_vfree(buf);
> + return r;
> +}
> +
> +static coroutine_fn int nvme_co_preadv(BlockDriverState *bs,
> + uint64_t offset, uint64_t bytes,
> + QEMUIOVector *qiov, int flags)
> +{
> + return nvme_co_prw(bs, offset, bytes, qiov, false);
> +}
> +
> +static coroutine_fn int nvme_co_pwritev(BlockDriverState *bs,
> + uint64_t offset, uint64_t bytes,
> + QEMUIOVector *qiov, int flags)
> +{
> + return nvme_co_prw(bs, offset, bytes, qiov, true);
> +}
> +
> +static coroutine_fn int nvme_co_flush(BlockDriverState *bs)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + NVMeQueuePair *ioq = s->queues[1];
> + NVMeRequest *req;
> + NVMeCommand cmd = {
> + .opcode = NVME_CMD_FLUSH,
> + .nsid = cpu_to_le32(s->nsid),
> + };
> + NVMeCoData data = {
> + .ctx = bdrv_get_aio_context(bs),
> + .ret = -EINPROGRESS,
> + };
> +
> + assert(s->nr_queues > 1);
> + while (true) {
> + req = nvme_get_free_req(ioq);
> + if (req) {
> + break;
> + }
> + qemu_co_queue_wait(&ioq->wait_queue);
> + }
> +
> + nvme_submit_command(s, ioq, req, &cmd, nvme_rw_cb, &data);
> +
> + data.co = qemu_coroutine_self();
> + if (data.ret == -EINPROGRESS) {
> + qemu_coroutine_yield();
> + }
> +
> + return data.ret;
> +}
> +
> +
> +static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> + BlockReopenQueue *queue, Error **errp)
> +{
> + return 0;
> +}
> +
> +static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs,
> + int64_t sector_num,
> + int nb_sectors, int *pnum,
> + BlockDriverState **file)
> +{
> + *pnum = nb_sectors;
> + *file = bs;
> +
> + return BDRV_BLOCK_OFFSET_VALID | (sector_num << BDRV_SECTOR_BITS);
> +}
> +
> +static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts)
> +{
> + QINCREF(opts);
> + qdict_del(opts, "filename");
> +
> + if (!qdict_size(opts)) {
> + snprintf(bs->exact_filename, sizeof(bs->exact_filename), "%s://",
> + bs->drv->format_name);
> + }
> +
> + qdict_put(opts, "driver", qstring_from_str(bs->drv->format_name));
> + bs->full_open_options = opts;
> +}
> +
> +static void nvme_refresh_limits(BlockDriverState *bs, Error **errp)
> +{
> + BDRVNVMeState *s = bs->opaque;
> +
> + /* XXX: other fields from identify command/controller capacity? */
> + bs->bl.opt_mem_alignment = s->page_size;
> + /* XXX: are there other limits? */
> + bs->bl.request_alignment = s->page_size;
> + bs->bl.max_transfer = s->max_transfer;
> +}
> +
> +static void nvme_detach_aio_context(BlockDriverState *bs)
> +{
> + BDRVNVMeState *s = bs->opaque;
> +
> + aio_set_event_notifier(bdrv_get_aio_context(bs), &s->event_notifier,
> + false, NULL, NULL);
> +}
> +
> +static void nvme_attach_aio_context(BlockDriverState *bs,
> + AioContext *new_context)
> +{
> + BDRVNVMeState *s = bs->opaque;
> +
> + aio_set_event_notifier(new_context, &s->event_notifier,
> + false, nvme_handle_event, nvme_poll_cb);
> +}
> +
> +static void nvme_aio_plug(BlockDriverState *bs)
> +{
> + BDRVNVMeState *s = bs->opaque;
> + s->plugged++;
> +}
> +
> +static void nvme_aio_unplug(BlockDriverState *bs)
> +{
> + int i;
> + BDRVNVMeState *s = bs->opaque;
> + assert(s->plugged);
> + if (!--s->plugged) {
> + for (i = 1; i < s->nr_queues; i++) {
> + nvme_kick(s, s->queues[i]);
> + nvme_process_completion(s, s->queues[i]);
> + }
> + }
> +}
> +
> +static BlockDriver bdrv_nvme = {
> + .format_name = "nvme",
> + .protocol_name = "nvme",
> + .instance_size = sizeof(BDRVNVMeState),
> +
> + .bdrv_parse_filename = nvme_parse_filename,
> + .bdrv_file_open = nvme_file_open,
> + .bdrv_close = nvme_close,
> + .bdrv_getlength = nvme_getlength,
> +
> + .bdrv_co_preadv = nvme_co_preadv,
> + .bdrv_co_pwritev = nvme_co_pwritev,
> + .bdrv_co_flush_to_disk = nvme_co_flush,
> + .bdrv_reopen_prepare = nvme_reopen_prepare,
> +
> + .bdrv_co_get_block_status = nvme_co_get_block_status,
> +
> + .bdrv_refresh_filename = nvme_refresh_filename,
> + .bdrv_refresh_limits = nvme_refresh_limits,
> +
> + .bdrv_detach_aio_context = nvme_detach_aio_context,
> + .bdrv_attach_aio_context = nvme_attach_aio_context,
> +
> + .bdrv_io_plug = nvme_aio_plug,
> + .bdrv_io_unplug = nvme_aio_unplug,
> +};
> +
> +static void bdrv_nvme_init(void)
> +{
> + bdrv_register(&bdrv_nvme);
> +}
> +
> +block_init(bdrv_nvme_init);
> --
> 2.9.3
>
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH 4/4] block: Add VFIO based NVMe driver
2016-12-21 11:59 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2016-12-21 14:05 ` Fam Zheng
0 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-21 14:05 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, Kevin Wolf, qemu-block, Max Reitz, borntraeger,
Stefan Hajnoczi, Paolo Bonzini, Karl Rister
On Wed, 12/21 11:59, Stefan Hajnoczi wrote:
> On Wed, Dec 21, 2016 at 12:31:39AM +0800, Fam Zheng wrote:
> > This is a new protocol driver that exclusively opens a host NVMe
> > controller through VFIO. It achieves better latency than linux-aio.
>
> This is an interesting block driver to have for performance comparisons.
> Definitely something that is worth merging.
>
> > nvme:// linux-aio
> > ------------------------------------------------------------------
> > fio bs=4k iodepth=1 (IOPS) 30014 24009
> > fio bs=4k iodepth=1 (IOPS) +polling 45148 34689
> > fio bs=64k iodepth=1 (IOPS) 17801 14954
> > fio bs=64k iodepth=1 (IOPS) +polling 17970 14564
> >
> > fio bs=4k iodepth=32 (IOPS) 110637 121480
> > fio bs=4k iodepth=32 (IOPS) +polling 145533 166878
> > fio bs=64k iodepth=32 (IOPS) 50525 50596
> > fio bs=64k iodepth=32 (IOPS) +polling 50482 50534
> >
> > (host) qemu-img bench -c 8000000 (sec) 15.00 43.13
> >
> > ("+polling" means poll-max-ns=18000 which is a rule of thumb value for
> > the used NVMe device, otherwise it defaults to 0).
> >
> > For the rows where linux-aio is faster, a lot of evidence shows that the
> > io queue is more likely to stay full if the request completions happen
> > faster, as in the nvme:// case, hence less batch submission and request
> > merging than linux-aio.
>
> I'm not sure I understand this paragraph. Sounds like you are saying
> that nvme:// has lower latency and this defeats batch submission.
> Higher numbers are still better at the end of the day so it's worth
> studying this more closely and coming up with solutions. Maybe at a
> certain rate of submission it makes sense to favor throughput
> (batching) even with nvme://.
Good question! Busy polling at nvme:// side reduces batched completion, and busy
polling at virtio side reduces batched submission. I think this is a common
pattern and we should figure out a general strategy to "favor throughput"
based on the rate.
>
> Regarding merging: are you doing sequential I/O? Please try random
> instead.
Yes, it is sequential. I've also tried random but the results are mostly the
same. It means merging is the smaller factor here, and host <-> guest
communication is the bigger one. Maybe we should go ahead to expriment busy
polling at driver side now.
> > +typedef struct {
> > + int index;
> > + NVMeQueue sq, cq;
> > + int cq_phase;
> > + uint8_t *prp_list_pages;
> > + uint64_t prp_list_base_iova;
> > + NVMeRequest reqs[NVME_QUEUE_SIZE];
> > + CoQueue wait_queue;
>
> "free_req_queue" describes the purpose of this queue.
OK, I'll rename it.
>
> > + bool busy;
> > + int need_kick;
> > + int inflight;
> > +} NVMeQueuePair;
> > +
> > +typedef volatile struct {
> > + uint64_t cap;
> > + uint32_t vs;
> > + uint32_t intms;
> > + uint32_t intmc;
> > + uint32_t cc;
> > + uint32_t reserved0;
> > + uint32_t csts;
> > + uint32_t nssr;
> > + uint32_t aqa;
> > + uint64_t asq;
> > + uint64_t acq;
> > + uint32_t cmbloc;
> > + uint32_t cmbsz;
> > + uint8_t reserved1[0xec0];
> > + uint8_t cmd_set_specfic[0x100];
> > + uint32_t doorbells[];
> > +} QEMU_PACKED NVMeRegs;
> > +
> > +QEMU_BUILD_BUG_ON(offsetof(NVMeRegs, doorbells) != 0x1000);
> > +
> > +typedef struct {
> > + QEMUVFIOState *vfio;
> > + NVMeRegs *regs;
> > + /* The submission/completion queue pairs.
> > + * [0]: admin queue.
> > + * [1..]: io queues.
> > + */
> > + NVMeQueuePair **queues;
> > + int nr_queues;
> > + size_t page_size;
> > + /* How many uint32_t elements does each doorbell entry take. */
> > + size_t doorbell_scale;
> > + bool write_cache;
> > + EventNotifier event_notifier;
>
> "event_notifier" describes the type, not the purpose of the field.
> "irq_notifier" is clearer.
Good idea.
> > + while (true) {
> > + req = nvme_get_free_req(ioq);
> > + if (req) {
> > + break;
> > + }
> > + DPRINTF("nvme wait req\n");
> > + qemu_co_queue_wait(&ioq->wait_queue);
> > + DPRINTF("nvme wait req done\n");
> > + }
> > +
> > + r = nvme_cmd_map_qiov(bs, &cmd, req, qiov);
> > + if (r) {
> > + return r;
>
> Is req leaked?
No. It is a pointer to an ioq->reqs element. ioq->reqs is an object pool with
the same lifecycle of ioq. Until nvme_submit_command() is called, *req is still
"free".
Fam
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library
2016-12-20 16:31 ` [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library Fam Zheng
@ 2016-12-21 15:46 ` Paolo Bonzini
2016-12-21 16:19 ` Fam Zheng
0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2016-12-21 15:46 UTC (permalink / raw)
To: Fam Zheng, qemu-devel
Cc: qemu-block, Karl Rister, Kevin Wolf, Max Reitz, borntraeger,
Stefan Hajnoczi
On 20/12/2016 17:31, Fam Zheng wrote:
> + hbitmap_iter_init(&iter, s->free_chunks, 1);
> + if (contiguous) {
> + while (true) {
> + bool satisfy = true;
> + next = hbitmap_iter_next(&iter);
> + if (next < 0) {
> + return NULL;
> + }
> + for (i = 1; i < chunks; i++) {
> + if (!hbitmap_get(s->free_chunks, next + i)) {
> + satisfy = false;
> + break;
> + }
> + }
> + if (satisfy) {
> + break;
> + }
> + }
> + hbitmap_reset(s->free_chunks, next, chunks);
> + r = g_new(IOVARange, 1);
> + r->iova = next * pages_per_chunk * getpagesize();
> + r->nr_pages = pages;
> + QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
> + } else {
> + next = hbitmap_iter_next(&iter);
> + while (pages) {
> + uint64_t chunk;
> + if (next < 0) {
> + hbitmap_iter_init(&iter, s->free_chunks, 1);
> + next = hbitmap_iter_next(&iter);
> + }
> + assert(next >= 0);
> + chunk = next;
> + DPRINTF("using chunk %ld\n", chunk);
> + next = hbitmap_iter_next(&iter);
> + hbitmap_reset(s->free_chunks, chunk, 1);
> + if (r && r->iova + r->nr_pages == chunk * pages_per_chunk) {
> + r->nr_pages += MIN(pages, pages_per_chunk);
> + } else {
> + r = g_new(IOVARange, 1);
> + r->iova = chunk * pages_per_chunk * getpagesize();
> + r->nr_pages = MIN(pages, pages_per_chunk);
> + QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
> + }
> + pages -= MIN(pages, pages_per_chunk);
> + }
I'm not sure HBitmap tracking is useful. If we exhaust the IOVA space,
we can just throw everything away with a single VFIO_IOMMU_UNMAP_DMA.
Then replay the RAMBlockNotifier mappings (we need to add this anyway
for hotplug support) and keep on mapping lazily whatever comes later.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library
2016-12-21 15:46 ` Paolo Bonzini
@ 2016-12-21 16:19 ` Fam Zheng
2016-12-21 17:02 ` Paolo Bonzini
0 siblings, 1 reply; 20+ messages in thread
From: Fam Zheng @ 2016-12-21 16:19 UTC (permalink / raw)
To: Paolo Bonzini
Cc: qemu-devel, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
borntraeger, Stefan Hajnoczi
On Wed, 12/21 16:46, Paolo Bonzini wrote:
>
>
> On 20/12/2016 17:31, Fam Zheng wrote:
> > + hbitmap_iter_init(&iter, s->free_chunks, 1);
> > + if (contiguous) {
> > + while (true) {
> > + bool satisfy = true;
> > + next = hbitmap_iter_next(&iter);
> > + if (next < 0) {
> > + return NULL;
> > + }
> > + for (i = 1; i < chunks; i++) {
> > + if (!hbitmap_get(s->free_chunks, next + i)) {
> > + satisfy = false;
> > + break;
> > + }
> > + }
> > + if (satisfy) {
> > + break;
> > + }
> > + }
> > + hbitmap_reset(s->free_chunks, next, chunks);
> > + r = g_new(IOVARange, 1);
> > + r->iova = next * pages_per_chunk * getpagesize();
> > + r->nr_pages = pages;
> > + QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
> > + } else {
> > + next = hbitmap_iter_next(&iter);
> > + while (pages) {
> > + uint64_t chunk;
> > + if (next < 0) {
> > + hbitmap_iter_init(&iter, s->free_chunks, 1);
> > + next = hbitmap_iter_next(&iter);
> > + }
> > + assert(next >= 0);
> > + chunk = next;
> > + DPRINTF("using chunk %ld\n", chunk);
> > + next = hbitmap_iter_next(&iter);
> > + hbitmap_reset(s->free_chunks, chunk, 1);
> > + if (r && r->iova + r->nr_pages == chunk * pages_per_chunk) {
> > + r->nr_pages += MIN(pages, pages_per_chunk);
> > + } else {
> > + r = g_new(IOVARange, 1);
> > + r->iova = chunk * pages_per_chunk * getpagesize();
> > + r->nr_pages = MIN(pages, pages_per_chunk);
> > + QSIMPLEQ_INSERT_TAIL(&m.iova_list, r, next);
> > + }
> > + pages -= MIN(pages, pages_per_chunk);
> > + }
>
> I'm not sure HBitmap tracking is useful. If we exhaust the IOVA space,
> we can just throw everything away with a single VFIO_IOMMU_UNMAP_DMA.
> Then replay the RAMBlockNotifier mappings (we need to add this anyway
> for hotplug support) and keep on mapping lazily whatever comes later.
It's clever! It'd be a bit more complicated than that, though. Things like
queues etc in block/nvme.c have to be preserved, and if we already ensure that,
ram blocks can be preserved similarly, but indeed bounce buffers can be handled
that way. I still need to think about how to make sure none of the invalidated
IOVA addresses are in use by other requests.
Also I wonder how expensive the huge VFIO_IOMMU_UNMAP_DMA is. In the worst case
the "throwaway" IOVAs can be limited to a small range.
Fam
>
> Thanks,
>
> Paolo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library
2016-12-21 16:19 ` Fam Zheng
@ 2016-12-21 17:02 ` Paolo Bonzini
0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2016-12-21 17:02 UTC (permalink / raw)
To: Fam Zheng
Cc: qemu-devel, qemu-block, Karl Rister, Kevin Wolf, Max Reitz,
borntraeger, Stefan Hajnoczi
On 21/12/2016 17:19, Fam Zheng wrote:
> It's clever! It'd be a bit more complicated than that, though. Things like
> queues etc in block/nvme.c have to be preserved, and if we already ensure that,
> ram blocks can be preserved similarly, but indeed bounce buffers can be handled
> that way. I still need to think about how to make sure none of the invalidated
> IOVA addresses are in use by other requests.
Hmm, that's true. As you said, we'll probably want to split the IOVA
space in two, with a relatively small part for "volatile" addresses.
You can add two counters that track how many requests are using volatile
space. When it's time to do the VFIO_IOMMU_UNMAP_DMA, you do something
like:
if (vfio->next_phase == vfio->current_phase) {
vfio->next_phase = !vfio->current_phase;
while (vfio->request_counter[vfio->current_phase] != 0) {
wait on CoQueue
}
ioctl(VFIO_IOMMU_UNMAP_DMA)
vfio->current_phase = vfio->next_phase;
wake up everyone on CoQueue
} else {
/* wait for the unmap to happen */
while (vfio->next_phase != vfio->current_phase) {
wait on CoQueue
}
}
As an optimization, incrementing/decrementing request_counter can be
delayed until you find an item of the QEMUIOVector that needs a volatile
IOVA. Then it should never be incremented in practice during guest
execution.
Paolo
> Also I wonder how expensive the huge VFIO_IOMMU_UNMAP_DMA is. In the worst case
> the "throwaway" IOVAs can be limited to a small range.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
@ 2016-12-22 9:56 ` Paolo Bonzini
2017-01-11 5:38 ` Stefan Weil
1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2016-12-22 9:56 UTC (permalink / raw)
To: Fam Zheng, qemu-devel
Cc: Kevin Wolf, qemu-block, Max Reitz, borntraeger, Stefan Hajnoczi,
Karl Rister
> diff --git a/xen-mapcache.c b/xen-mapcache.c
> index 8f3a592..dc9b321 100644
> --- a/xen-mapcache.c
> +++ b/xen-mapcache.c
> @@ -163,6 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
> err = g_malloc0(nb_pfn * sizeof (int));
>
> if (entry->vaddr_base != NULL) {
> + ram_block_removed(entry->vaddr_base, entry->size);
> if (munmap(entry->vaddr_base, entry->size) != 0) {
> perror("unmap fails");
> exit(-1);
> @@ -188,6 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
> entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned long) *
> BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
>
> + ram_block_added(entry->vaddr_base, entry->size);
> bitmap_zero(entry->valid_mapping, nb_pfn);
> for (i = 0; i < nb_pfn; i++) {
> if (!err[i]) {
> @@ -397,6 +399,7 @@ static void xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
> }
>
> pentry->next = entry->next;
> + ram_block_removed(entry->vaddr_base, entry->size);
> if (munmap(entry->vaddr_base, entry->size) != 0) {
> perror("unmap fails");
> exit(-1);
>
My mistake here, these should be ram_block_notify_{add,remove} instead.
Paolo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
` (5 preceding siblings ...)
2016-12-21 0:48 ` no-reply
@ 2016-12-29 4:09 ` Tian, Kevin
2016-12-30 0:46 ` Fam Zheng
6 siblings, 1 reply; 20+ messages in thread
From: Tian, Kevin @ 2016-12-29 4:09 UTC (permalink / raw)
To: Fam Zheng, qemu-devel
Cc: Kevin Wolf, qemu-block, Max Reitz, borntraeger, Stefan Hajnoczi,
Paolo Bonzini, Karl Rister
> From: Fam Zheng
> Sent: Wednesday, December 21, 2016 12:32 AM
>
> This series adds a new protocol driver that is intended to achieve about 20%
> better performance for latency bound workloads (i.e. synchronous I/O) than
> linux-aio when guest is exclusively accessing a NVMe device, by talking to the
> device directly instead of through kernel file system layers and its NVMe
> driver.
>
Curious... if the NVMe device is exclusively owned by the guest, why
not directly passing through to the guest? is it a tradeoff between
performance (better than linux-aio) and composability (snapshot and
live migration which not supported by direct passthrough)?
Thanks
Kevin
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
2016-12-29 4:09 ` Tian, Kevin
@ 2016-12-30 0:46 ` Fam Zheng
0 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2016-12-30 0:46 UTC (permalink / raw)
To: Tian, Kevin
Cc: qemu-devel, Kevin Wolf, qemu-block, Max Reitz, borntraeger,
Stefan Hajnoczi, Paolo Bonzini, Karl Rister
On Thu, 12/29 04:09, Tian, Kevin wrote:
> is it a tradeoff between performance (better than linux-aio) and composability
> (snapshot and live migration which not supported by direct passthrough)?
Yes.
Fam
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
2016-12-22 9:56 ` Paolo Bonzini
@ 2017-01-11 5:38 ` Stefan Weil
2017-01-11 5:48 ` Stefan Weil
1 sibling, 1 reply; 20+ messages in thread
From: Stefan Weil @ 2017-01-11 5:38 UTC (permalink / raw)
To: Fam Zheng, qemu-devel
Cc: Kevin Wolf, qemu-block, Max Reitz, borntraeger, Stefan Hajnoczi,
Paolo Bonzini, Karl Rister, Vincent Palatin
Hi,
this fails for me when building with XEN support.
I noticed the failure when testing the latest HAXM patches.
See compiler output below.
Regards
Stefan
On 12/20/16 17:31, Fam Zheng wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
>
> This adds a notify interface of ram block additions and removals.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
[...]
> diff --git a/xen-mapcache.c b/xen-mapcache.c
> index 8f3a592..dc9b321 100644
> --- a/xen-mapcache.c
> +++ b/xen-mapcache.c
> @@ -163,6 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
> err = g_malloc0(nb_pfn * sizeof (int));
>
> if (entry->vaddr_base != NULL) {
> + ram_block_removed(entry->vaddr_base, entry->size);
> if (munmap(entry->vaddr_base, entry->size) != 0) {
> perror("unmap fails");
> exit(-1);
> @@ -188,6 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
> entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned long) *
> BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
>
> + ram_block_added(entry->vaddr_base, entry->size);
> bitmap_zero(entry->valid_mapping, nb_pfn);
> for (i = 0; i < nb_pfn; i++) {
> if (!err[i]) {
> @@ -397,6 +399,7 @@ static void xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
> }
>
> pentry->next = entry->next;
> + ram_block_removed(entry->vaddr_base, entry->size);
> if (munmap(entry->vaddr_base, entry->size) != 0) {
> perror("unmap fails");
> exit(-1);
>
CC x86_64-softmmu/xen-mapcache.o
/qemu/xen-mapcache.c: In function 'xen_remap_bucket':
/qemu/xen-mapcache.c:166:9: error: implicit declaration of function
'ram_block_removed' [-Werror=implicit-function-declaration]
ram_block_removed(entry->vaddr_base, entry->size);
^~~~~~~~~~~~~~~~~
/qemu/xen-mapcache.c:166:9: error: nested extern declaration of
'ram_block_removed' [-Werror=nested-externs]
/qemu/xen-mapcache.c:192:5: error: implicit declaration of function
'ram_block_added' [-Werror=implicit-function-declaration]
ram_block_added(entry->vaddr_base, entry->size);
^~~~~~~~~~~~~~~
/qemu/xen-mapcache.c:192:5: error: nested extern declaration of
'ram_block_added' [-Werror=nested-externs]
cc1: all warnings being treated as errors
/qemu/rules.mak:64: recipe for target 'xen-mapcache.o' failed
make[1]: *** [xen-mapcache.o] Error 1
Makefile:203: recipe for target 'subdir-x86_64-softmmu' failed
make: *** [subdir-x86_64-softmmu] Error 2
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
2017-01-11 5:38 ` Stefan Weil
@ 2017-01-11 5:48 ` Stefan Weil
2017-01-11 6:41 ` Fam Zheng
0 siblings, 1 reply; 20+ messages in thread
From: Stefan Weil @ 2017-01-11 5:48 UTC (permalink / raw)
To: Fam Zheng, qemu-devel
Cc: Kevin Wolf, Vincent Palatin, qemu-block, Max Reitz, borntraeger,
Stefan Hajnoczi, Paolo Bonzini, Karl Rister
On 01/11/17 06:38, Stefan Weil wrote:
> Hi,
>
> this fails for me when building with XEN support.
> I noticed the failure when testing the latest HAXM patches.
> See compiler output below.
>
> Regards
> Stefan
The patch compiles with this modification:
diff --git a/xen-mapcache.c b/xen-mapcache.c
index dc9b321491..31debdfb2c 100644
--- a/xen-mapcache.c
+++ b/xen-mapcache.c
@@ -163,7 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
err = g_malloc0(nb_pfn * sizeof (int));
if (entry->vaddr_base != NULL) {
- ram_block_removed(entry->vaddr_base, entry->size);
+ ram_block_notify_remove(entry->vaddr_base, entry->size);
if (munmap(entry->vaddr_base, entry->size) != 0) {
perror("unmap fails");
exit(-1);
@@ -189,7 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned
long) *
BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
- ram_block_added(entry->vaddr_base, entry->size);
+ ram_block_notify_add(entry->vaddr_base, entry->size);
bitmap_zero(entry->valid_mapping, nb_pfn);
for (i = 0; i < nb_pfn; i++) {
if (!err[i]) {
@@ -399,7 +399,7 @@ static void
xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
}
pentry->next = entry->next;
- ram_block_removed(entry->vaddr_base, entry->size);
+ ram_block_notify_remove(entry->vaddr_base, entry->size);
if (munmap(entry->vaddr_base, entry->size) != 0) {
perror("unmap fails");
exit(-1);
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] ramblock-notifier: new
2017-01-11 5:48 ` Stefan Weil
@ 2017-01-11 6:41 ` Fam Zheng
0 siblings, 0 replies; 20+ messages in thread
From: Fam Zheng @ 2017-01-11 6:41 UTC (permalink / raw)
To: Stefan Weil
Cc: qemu-devel, Kevin Wolf, Vincent Palatin, qemu-block, Max Reitz,
borntraeger, Stefan Hajnoczi, Paolo Bonzini, Karl Rister
On Wed, 01/11 06:48, Stefan Weil wrote:
> On 01/11/17 06:38, Stefan Weil wrote:
> > Hi,
> >
> > this fails for me when building with XEN support.
> > I noticed the failure when testing the latest HAXM patches.
> > See compiler output below.
> >
> > Regards
> > Stefan
>
> The patch compiles with this modification:
>
>
> diff --git a/xen-mapcache.c b/xen-mapcache.c
> index dc9b321491..31debdfb2c 100644
> --- a/xen-mapcache.c
> +++ b/xen-mapcache.c
> @@ -163,7 +163,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
> err = g_malloc0(nb_pfn * sizeof (int));
>
> if (entry->vaddr_base != NULL) {
> - ram_block_removed(entry->vaddr_base, entry->size);
> + ram_block_notify_remove(entry->vaddr_base, entry->size);
> if (munmap(entry->vaddr_base, entry->size) != 0) {
> perror("unmap fails");
> exit(-1);
> @@ -189,7 +189,7 @@ static void xen_remap_bucket(MapCacheEntry *entry,
> entry->valid_mapping = (unsigned long *) g_malloc0(sizeof(unsigned
> long) *
> BITS_TO_LONGS(size >> XC_PAGE_SHIFT));
>
> - ram_block_added(entry->vaddr_base, entry->size);
> + ram_block_notify_add(entry->vaddr_base, entry->size);
> bitmap_zero(entry->valid_mapping, nb_pfn);
> for (i = 0; i < nb_pfn; i++) {
> if (!err[i]) {
> @@ -399,7 +399,7 @@ static void
> xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
> }
>
> pentry->next = entry->next;
> - ram_block_removed(entry->vaddr_base, entry->size);
> + ram_block_notify_remove(entry->vaddr_base, entry->size);
> if (munmap(entry->vaddr_base, entry->size) != 0) {
> perror("unmap fails");
> exit(-1);
>
Yes, this matches what Paolo pointed out in his reply. I'll fix that in the next
revision.
Fam
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2017-01-11 6:41 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-20 16:31 [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 1/4] ramblock-notifier: new Fam Zheng
2016-12-22 9:56 ` Paolo Bonzini
2017-01-11 5:38 ` Stefan Weil
2017-01-11 5:48 ` Stefan Weil
2017-01-11 6:41 ` Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 2/4] util: Add a notifier list for qemu_vfree() Fam Zheng
2016-12-20 16:31 ` [Qemu-devel] [PATCH 3/4] util: Add VFIO helper library Fam Zheng
2016-12-21 15:46 ` Paolo Bonzini
2016-12-21 16:19 ` Fam Zheng
2016-12-21 17:02 ` Paolo Bonzini
2016-12-20 16:31 ` [Qemu-devel] [PATCH 4/4] block: Add VFIO based NVMe driver Fam Zheng
2016-12-20 16:39 ` Paolo Bonzini
2016-12-21 11:59 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2016-12-21 14:05 ` Fam Zheng
2016-12-20 23:04 ` [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device no-reply
2016-12-21 1:38 ` Fam Zheng
2016-12-21 0:48 ` no-reply
2016-12-29 4:09 ` Tian, Kevin
2016-12-30 0:46 ` Fam Zheng
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.