linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
@ 2021-07-20  3:31 longli
  2021-07-20  3:31 ` [Patch v4 1/3] Drivers: hv: vmbus: add support to ignore certain PCIE devices longli
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: longli @ 2021-07-20  3:31 UTC (permalink / raw)
  To: linux-fs, linux-block, linux-kernel, linux-hyperv; +Cc: Long Li

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 5166 bytes --]

From: Long Li <longli@microsoft.com>

Microsoft Azure Blob storage service exposes a REST API to applications
for data access.
(https://docs.microsoft.com/en-us/rest/api/storageservices/blob-service-rest-api)

This patchset implements a VSC (Virtualization Service Consumer) that
communicates with a VSP (Virtualization Service Provider) on the Hyper-V
host to execute Blob storage access via native network stack on the host.
This VSC doesn't implement the semantics of REST API. Those are implemented
in user-space. The VSC provides a fast data path to VSP.

Answers to some previous questions discussing the driver:

Q: Why this driver doesn't use the block layer

A: The Azure Blob is based on a model of object oriented storage. The
storage object is not modeled in block sectors. While it's possible to
present the storage object as a block device (assuming it makes sense to
fake all the block device attributes), we lose the ability to express
functionality that are defined in the REST API. 

Q: You just lost all use of caching and io_uring and loads of other kernel
infrastructure that has been developed and relied on for decades?

A: The REST API is not designed to have caching at system level. This
driver doesn't attempt to improve on this. There are discussions on
supporting ioctl() on io_uring (https://lwn.net/Articles/844875/), that
will benefit this driver. The block I/O scheduling is not helpful in this
case, as the Blob application and Blob storage server have complete
knowledge on the I/O pattern based on storage object type. This knowledge
doesn't get easily consumed by the block layer.

Q: You also just abandoned the POSIX model and forced people to use a
random-custom-library just to access their storage devices, breaking all
existing programs in the world?

A: The existing Blob applications access storage via HTTP (REST API). They
don't use POSIX interface. The interface for Azure Blob is not designed
on POSIX.

Q: What programs today use this new API?

A: Currently none is released. But per above, there are also none using
POSIX.

Q: Where is the API published and what ensures that it will remain stable?

A: Cloud based REST protocols have similar considerations to the kernel in
terms of interface stability. Applications depend on cloud services via
REST in much the same way as they depend on kernel services. Because
applications can consume cloud APIs over the Internet, there is no
opportunity to recompile applications to ensure compatibility. This causes
the underlying APIs to be exceptionally stable, and Azure Blob has not
removed support for an exposed API to date. This driver is supporting a
pass-through model where requests in a guest process can be reflected to a
VM host environment. Like the current REST interface, the goal is to ensure
that each host provide a high degree of compatibility with each guest, but
that task is largely outside the scope of this driver, which exists to
communicate requests in the same way an HTTP stack would. Just like an HTTP
stack does not require updates to add a new custom header or receive one
from a server, this driver does not require updates for new functionality
so long as the high level request/response model is retained.

Q: What happens when it changes over time, do we have to rebuild all
userspace applications?

A: No. We don’t rebuild them all to talk HTTP either. In the current HTTP
scheme, applications specify the version of the protocol they talk, and the
storage backend responds with that version.

Q: What happens to the kernel code over time, how do you handle changes to
the API there?

A: The goal of this driver is to get requests to the Hyper-V host, so the
kernel isn’t involved in API changes, in the same way that HTTP
implementations are robust to extra functionality being added to HTTP.

Long Li (3):
  Drivers: hv: vmbus: add support to ignore certain PCIE devices
  Drivers: hv: add Azure Blob driver
  Drivers: hv: Add to maintainer for Azure Blob driver

Changes:

v2:
Refactored the code in vmbus to scan devices
Reworked Azure Blob driver and moved user-mode interfaces to uapi

v3:
Changed licensing language
Patch format passed "checkpatch --strict"
debugfs and logging, module parameter cleanup
General code clean up
Fix device removal race conditions

v4:
addressed licencing issues
changed to dynamic device model

Long Li (3):
  Drivers: hv: vmbus: add support to ignore certain PCIE devices
  Drivers: hv: add Azure Blob driver
  Drivers: hv: Add to maintainer for Hyper-V/Azure drivers

 .../userspace-api/ioctl/ioctl-number.rst      |   2 +
 MAINTAINERS                                   |   2 +
 drivers/hv/Kconfig                            |  11 +
 drivers/hv/Makefile                           |   1 +
 drivers/hv/channel_mgmt.c                     |  55 +-
 drivers/hv/hv_azure_blob.c                    | 628 ++++++++++++++++++
 include/linux/hyperv.h                        |   9 +
 include/uapi/misc/hv_azure_blob.h             |  34 +
 8 files changed, 736 insertions(+), 6 deletions(-)
 create mode 100644 drivers/hv/hv_azure_blob.c
 create mode 100644 include/uapi/misc/hv_azure_blob.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Patch v4 1/3] Drivers: hv: vmbus: add support to ignore certain PCIE devices
  2021-07-20  3:31 [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob longli
@ 2021-07-20  3:31 ` longli
  2021-07-20  3:31 ` [Patch v4 2/3] Drivers: hv: add Azure Blob driver longli
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: longli @ 2021-07-20  3:31 UTC (permalink / raw)
  To: linux-fs, linux-block, linux-kernel, linux-hyperv
  Cc: Long Li, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui

From: Long Li <longli@microsoft.com>

Under certain configurations when Hyper-v presents a device to VMBUS, it
may have a VMBUS channel and a PCIe channel. The PCIe channel is not used
in Linux and does not have a backing PCIE device on Hyper-v. For such
devices, ignore those PCIe channels so they are not getting probed by the
PCI subsystem.

Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/hv/channel_mgmt.c | 48 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index caf6d0c4bc1b..705e95d7a040 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -26,6 +26,21 @@
 
 static void init_vp_index(struct vmbus_channel *channel);
 
+/*
+ * For some VMBUS devices, Hyper-v also presents certain PCIE devices as
+ * part of the host device implementation. Those devices have no real
+ * PCI implementation in Hyper-V, and should be ignored and not presented
+ * to the PCI layer.
+ */
+static const guid_t vpci_ignore_instances[] = {
+	/*
+	 * Azure Blob instance ID in VPCI
+	 * {d4573da2-2caa-4711-a8f9-bbabf4ee9685}
+	 */
+	GUID_INIT(0xd4573da2, 0x2caa, 0x4711, 0xa8, 0xf9,
+		  0xbb, 0xab, 0xf4, 0xee, 0x96, 0x85),
+};
+
 const struct vmbus_device vmbus_devs[] = {
 	/* IDE */
 	{ .dev_type = HV_IDE,
@@ -187,20 +202,19 @@ static bool is_unsupported_vmbus_devs(const guid_t *guid)
 	return false;
 }
 
-static u16 hv_get_dev_type(const struct vmbus_channel *channel)
+static u16 hv_get_dev_type(const guid_t *guid)
 {
-	const guid_t *guid = &channel->offermsg.offer.if_type;
 	u16 i;
 
-	if (is_hvsock_channel(channel) || is_unsupported_vmbus_devs(guid))
+	if (is_unsupported_vmbus_devs(guid))
 		return HV_UNKNOWN;
 
 	for (i = HV_IDE; i < HV_UNKNOWN; i++) {
 		if (guid_equal(guid, &vmbus_devs[i].guid))
-			return i;
+			return vmbus_devs[i].dev_type;
 	}
 	pr_info("Unknown GUID: %pUl\n", guid);
-	return i;
+	return HV_UNKNOWN;
 }
 
 /**
@@ -487,6 +501,16 @@ void vmbus_free_channels(void)
 	}
 }
 
+static bool ignore_pcie_device(guid_t *if_instance)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(vpci_ignore_instances); i++)
+		if (guid_equal(&vpci_ignore_instances[i], if_instance))
+			return true;
+	return false;
+}
+
 /* Note: the function can run concurrently for primary/sub channels. */
 static void vmbus_add_channel_work(struct work_struct *work)
 {
@@ -910,7 +934,11 @@ static void vmbus_setup_channel_state(struct vmbus_channel *channel,
 	       sizeof(struct vmbus_channel_offer_channel));
 	channel->monitor_grp = (u8)offer->monitorid / 32;
 	channel->monitor_bit = (u8)offer->monitorid % 32;
-	channel->device_id = hv_get_dev_type(channel);
+	if (is_hvsock_channel(channel))
+		channel->device_id = HV_UNKNOWN;
+	else
+		channel->device_id =
+			hv_get_dev_type(&channel->offermsg.offer.if_type);
 }
 
 /*
@@ -972,6 +1000,14 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
 
 	trace_vmbus_onoffer(offer);
 
+	/* Check to see if we should ignore this PCIe channel */
+	if (hv_get_dev_type(&offer->offer.if_type) == HV_PCIE &&
+	    ignore_pcie_device(&offer->offer.if_instance)) {
+		pr_debug("Ignore instance %pUl over VPCI\n",
+			 &offer->offer.if_instance);
+		return;
+	}
+
 	if (!vmbus_is_valid_device(&offer->offer.if_type)) {
 		pr_err_ratelimited("Invalid offer %d from the host supporting isolation\n",
 				   offer->child_relid);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20  3:31 [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob longli
  2021-07-20  3:31 ` [Patch v4 1/3] Drivers: hv: vmbus: add support to ignore certain PCIE devices longli
@ 2021-07-20  3:31 ` longli
  2021-07-20  5:26   ` Greg Kroah-Hartman
                     ` (3 more replies)
  2021-07-20  3:31 ` [Patch v4 3/3] Drivers: hv: Add to maintainer for Hyper-V/Azure drivers longli
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 23+ messages in thread
From: longli @ 2021-07-20  3:31 UTC (permalink / raw)
  To: linux-fs, linux-block, linux-kernel, linux-hyperv
  Cc: Long Li, Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Greg Kroah-Hartman,
	Bjorn Andersson, Hans de Goede, Dan Williams, Maximilian Luz,
	Mike Rapoport, Ben Widawsky, Jiri Slaby, Andra Paraschiv,
	Siddharth Gupta, Hannes Reinecke, linux-doc

From: Long Li <longli@microsoft.com>

Azure Blob provides scalable, secure and shared storage services for Azure.

This driver adds support for accelerated access to Azure Blob storage. As an
alternative to REST APIs, it provides a fast data path that uses host native
network stack and direct data link for storage server access.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Maximilian Luz <luzmaximilian@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Andra Paraschiv <andraprs@amazon.com>
Cc: Siddharth Gupta <sidgup@codeaurora.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Long Li <longli@microsoft.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |   2 +
 drivers/hv/Kconfig                            |  11 +
 drivers/hv/Makefile                           |   1 +
 drivers/hv/channel_mgmt.c                     |   7 +
 drivers/hv/hv_azure_blob.c                    | 628 ++++++++++++++++++
 include/linux/hyperv.h                        |   9 +
 include/uapi/misc/hv_azure_blob.h             |  34 +
 7 files changed, 692 insertions(+)
 create mode 100644 drivers/hv/hv_azure_blob.c
 create mode 100644 include/uapi/misc/hv_azure_blob.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 9bfc2b510c64..1ee8c0c7bd2e 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -180,6 +180,8 @@ Code  Seq#    Include File                                           Comments
 'R'   01     linux/rfkill.h                                          conflict!
 'R'   C0-DF  net/bluetooth/rfcomm.h
 'R'   E0     uapi/linux/fsl_mc.h
+'R'   F0-FF  uapi/misc/hv_azure_blob.h                               Microsoft Azure Blob driver
+                                                                     <mailto:longli@microsoft.com>
 'S'   all    linux/cdrom.h                                           conflict!
 'S'   80-81  scsi/scsi_ioctl.h                                       conflict!
 'S'   82-FF  scsi/scsi.h                                             conflict!
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 66c794d92391..53bebd0ad812 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -27,4 +27,15 @@ config HYPERV_BALLOON
 	help
 	  Select this option to enable Hyper-V Balloon driver.
 
+config HYPERV_AZURE_BLOB
+	tristate "Microsoft Azure Blob driver"
+	depends on HYPERV && X86_64
+	help
+	  Select this option to enable Microsoft Azure Blob driver.
+
+	  This driver implements a fast datapath over Hyper-V to support
+	  accelerated access to Microsoft Azure Blob services.
+	  To compile this driver as a module, choose M here. The module will be
+	  called azure_blob.
+
 endmenu
diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 94daf8240c95..272644532245 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_HYPERV)		+= hv_vmbus.o
 obj-$(CONFIG_HYPERV_UTILS)	+= hv_utils.o
 obj-$(CONFIG_HYPERV_BALLOON)	+= hv_balloon.o
+obj-$(CONFIG_HYPERV_AZURE_BLOB)	+= hv_azure_blob.o
 
 CFLAGS_hv_trace.o = -I$(src)
 CFLAGS_hv_balloon.o = -I$(src)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 705e95d7a040..3095611045b5 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -154,6 +154,13 @@ const struct vmbus_device vmbus_devs[] = {
 	  .allowed_in_isolated = false,
 	},
 
+	/* Azure Blob */
+	{ .dev_type = HV_AZURE_BLOB,
+	  HV_AZURE_BLOB_GUID,
+	  .perf_device = false,
+	  .allowed_in_isolated = false,
+	},
+
 	/* Unknown GUID */
 	{ .dev_type = HV_UNKNOWN,
 	  .perf_device = false,
diff --git a/drivers/hv/hv_azure_blob.c b/drivers/hv/hv_azure_blob.c
new file mode 100644
index 000000000000..04bec92aa058
--- /dev/null
+++ b/drivers/hv/hv_azure_blob.c
@@ -0,0 +1,628 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021 Microsoft Corporation. */
+
+#include <uapi/misc/hv_azure_blob.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/cred.h>
+#include <linux/debugfs.h>
+#include <linux/pagemap.h>
+#include <linux/hyperv.h>
+#include <linux/miscdevice.h>
+#include <linux/uio.h>
+
+struct az_blob_device {
+	struct hv_device *device;
+
+	/* Opened files maintained by this device */
+	struct list_head file_list;
+	/* Lock for protecting file_list */
+	spinlock_t file_lock;
+
+	/* The refcount for this device */
+	refcount_t count;
+
+	/* Pending requests to VSP */
+	atomic_t pending;
+	wait_queue_head_t waiting_to_drain;
+
+	bool removing;
+};
+
+/* VSP messages */
+enum az_blob_vsp_request_type {
+	AZ_BLOB_DRIVER_REQUEST_FIRST     = 0x100,
+	AZ_BLOB_DRIVER_USER_REQUEST      = 0x100,
+	AZ_BLOB_DRIVER_REGISTER_BUFFER   = 0x101,
+	AZ_BLOB_DRIVER_DEREGISTER_BUFFER = 0x102,
+};
+
+/* VSC->VSP request */
+struct az_blob_vsp_request {
+	u32 version;
+	u32 timeout_ms;
+	u32 data_buffer_offset;
+	u32 data_buffer_length;
+	u32 data_buffer_valid;
+	u32 operation_type;
+	u32 request_buffer_offset;
+	u32 request_buffer_length;
+	u32 response_buffer_offset;
+	u32 response_buffer_length;
+	guid_t transaction_id;
+} __packed;
+
+/* VSP->VSC response */
+struct az_blob_vsp_response {
+	u32 length;
+	u32 error;
+	u32 response_len;
+} __packed;
+
+struct az_blob_vsp_request_ctx {
+	struct list_head list;
+	struct completion wait_vsp;
+	struct az_blob_request_sync *request;
+};
+
+struct az_blob_file_ctx {
+	struct list_head list;
+
+	/* List of pending requests to VSP */
+	struct list_head vsp_pending_requests;
+	/* Lock for protecting vsp_pending_requests */
+	spinlock_t vsp_pending_lock;
+	wait_queue_head_t wait_vsp_pending;
+
+	pid_t pid;
+
+	struct az_blob_device *dev;
+};
+
+/* The maximum number of pages we can pass to VSP in a single packet */
+#define AZ_BLOB_MAX_PAGES 8192
+
+static struct az_blob_device *az_blob_dev;
+#define AZ_DEV (&az_blob_dev->device->device)
+
+/* Ring buffer size in bytes */
+#define AZ_BLOB_RING_SIZE (128 * 1024)
+
+/* System wide device queue depth */
+#define AZ_BLOB_QUEUE_DEPTH 1024
+
+/* The VSP protocol version this driver understands */
+#define VSP_PROTOCOL_VERSION_V1 0
+
+static const struct hv_vmbus_device_id id_table[] = {
+	{ HV_AZURE_BLOB_GUID,
+	  .driver_data = 0
+	},
+	{ },
+};
+
+static void az_blob_on_channel_callback(void *context)
+{
+	struct vmbus_channel *channel = (struct vmbus_channel *)context;
+	const struct vmpacket_descriptor *desc;
+
+	foreach_vmbus_pkt(desc, channel) {
+		struct az_blob_vsp_request_ctx *request_ctx;
+		struct az_blob_vsp_response *response;
+		u64 cmd_rqst = vmbus_request_addr(&channel->requestor,
+						  desc->trans_id);
+		if (cmd_rqst == VMBUS_RQST_ERROR) {
+			dev_err(AZ_DEV, "incorrect transaction id %llu\n",
+				desc->trans_id);
+			continue;
+		}
+		request_ctx = (struct az_blob_vsp_request_ctx *)cmd_rqst;
+		response = hv_pkt_data(desc);
+
+		dev_dbg(AZ_DEV, "got response for request %pUb status %u "
+			    "response_len %u\n",
+			    &request_ctx->request->guid, response->error,
+			    response->response_len);
+		request_ctx->request->response.status = response->error;
+		request_ctx->request->response.response_len =
+			response->response_len;
+		complete(&request_ctx->wait_vsp);
+	}
+}
+
+static int az_blob_fop_open(struct inode *inode, struct file *file)
+{
+	struct az_blob_device *dev = az_blob_dev;
+	struct az_blob_file_ctx *file_ctx;
+	unsigned long flags;
+
+	file_ctx = kzalloc(sizeof(*file_ctx), GFP_KERNEL);
+	if (!file_ctx)
+		return -ENOMEM;
+
+	file_ctx->pid = task_tgid_vnr(current);
+	INIT_LIST_HEAD(&file_ctx->vsp_pending_requests);
+	init_waitqueue_head(&file_ctx->wait_vsp_pending);
+	spin_lock_init(&file_ctx->vsp_pending_lock);
+	file_ctx->dev = dev;
+	refcount_inc(&dev->count);
+
+	spin_lock_irqsave(&dev->file_lock, flags);
+	list_add_tail(&file_ctx->list, &dev->file_list);
+	spin_unlock_irqrestore(&dev->file_lock, flags);
+
+	file->private_data = file_ctx;
+	return 0;
+}
+
+static int az_blob_fop_release(struct inode *inode, struct file *file)
+{
+	struct az_blob_file_ctx *file_ctx = file->private_data;
+	struct az_blob_device *dev = file_ctx->dev;
+	unsigned long flags;
+
+	wait_event(file_ctx->wait_vsp_pending,
+		   list_empty(&file_ctx->vsp_pending_requests));
+
+	spin_lock_irqsave(&dev->file_lock, flags);
+	list_del(&file_ctx->list);
+	spin_unlock_irqrestore(&dev->file_lock, flags);
+
+	kfree(file_ctx);
+	if (refcount_dec_and_test(&dev->count))
+		kfree(dev);
+
+	return 0;
+}
+
+static inline bool az_blob_safe_file_access(struct file *file)
+{
+	return file->f_cred == current_cred() && !uaccess_kernel();
+}
+
+/* Pin the user buffer pages into memory for passing to VSP */
+static int get_buffer_pages(int rw, void __user *buffer, u32 buffer_len,
+			    struct page ***ppages, size_t *start,
+			    size_t *num_pages)
+{
+	struct iovec iov;
+	struct iov_iter iter;
+	int ret;
+	ssize_t result;
+	struct page **pages;
+
+	ret = import_single_range(rw, buffer, buffer_len, &iov, &iter);
+	if (ret) {
+		dev_dbg(AZ_DEV, "request buffer access error %d\n", ret);
+		return ret;
+	}
+
+	result = iov_iter_get_pages_alloc(&iter, &pages, buffer_len, start);
+	if (result < 0) {
+		dev_dbg(AZ_DEV, "failed to pin user pages result=%ld\n", result);
+		return result;
+	}
+
+	*num_pages = (result + *start + PAGE_SIZE - 1) / PAGE_SIZE;
+	if (result != buffer_len) {
+		dev_dbg(AZ_DEV, "can't pin user pages requested %d got %ld\n",
+			buffer_len, result);
+		for (i = 0; i < *num_pages; i++)
+			put_page(pages[i]);
+		kvfree(pages);
+		return -EFAULT;
+	}
+
+	*ppages = pages;
+	return 0;
+}
+
+static void fill_in_page_buffer(u64 *pfn_array, int *index,
+				struct page **pages, unsigned long num_pages)
+{
+	int i, page_idx = *index;
+
+	for (i = 0; i < num_pages; i++)
+		pfn_array[page_idx++] = page_to_pfn(pages[i]);
+	*index = page_idx;
+}
+
+static void free_buffer_pages(size_t num_pages, struct page **pages)
+{
+	unsigned long i;
+
+	for (i = 0; i < num_pages; i++)
+		if (pages && pages[i])
+			put_page(pages[i]);
+	kvfree(pages);
+}
+
+static long az_blob_ioctl_user_request(struct file *filp, unsigned long arg)
+{
+	struct az_blob_file_ctx *file_ctx = filp->private_data;
+	struct az_blob_device *dev = file_ctx->dev;
+	struct az_blob_request_sync __user *request_user =
+		(struct az_blob_request_sync __user *)arg;
+	struct az_blob_request_sync request;
+	struct az_blob_vsp_request_ctx request_ctx;
+	unsigned long flags;
+	int ret;
+	size_t request_start, request_num_pages = 0;
+	size_t response_start, response_num_pages = 0;
+	size_t data_start, data_num_pages = 0, total_num_pages;
+	struct page **request_pages = NULL, **response_pages = NULL;
+	struct page **data_pages = NULL;
+	struct vmbus_packet_mpb_array *desc;
+	u64 *pfn_array;
+	int desc_size;
+	int page_idx;
+	struct az_blob_vsp_request *vsp_request;
+
+	/* Fast fail if device is being removed */
+	if (dev->removing)
+		return -ENODEV;
+
+	if (!az_blob_safe_file_access(filp)) {
+		dev_dbg(AZ_DEV, "process %d(%s) changed security contexts after"
+			    " opening file descriptor\n",
+			    task_tgid_vnr(current), current->comm);
+		return -EACCES;
+	}
+
+	if (copy_from_user(&request, request_user, sizeof(request))) {
+		dev_dbg(AZ_DEV, "don't have permission to user provided buffer\n");
+		return -EFAULT;
+	}
+
+	dev_dbg(AZ_DEV, "az_blob ioctl request guid %pUb timeout %u request_len %u"
+		    " response_len %u data_len %u request_buffer %llx "
+		    "response_buffer %llx data_buffer %llx\n",
+		    &request.guid, request.timeout, request.request_len,
+		    request.response_len, request.data_len, request.request_buffer,
+		    request.response_buffer, request.data_buffer);
+
+	if (!request.request_len || !request.response_len)
+		return -EINVAL;
+
+	if (request.data_len && request.data_len < request.data_valid)
+		return -EINVAL;
+
+	if (request.data_len > PAGE_SIZE * AZ_BLOB_MAX_PAGES ||
+	    request.request_len > PAGE_SIZE * AZ_BLOB_MAX_PAGES ||
+	    request.response_len > PAGE_SIZE * AZ_BLOB_MAX_PAGES)
+		return -EINVAL;
+
+	init_completion(&request_ctx.wait_vsp);
+	request_ctx.request = &request;
+
+	ret = get_buffer_pages(READ, (void __user *)request.request_buffer,
+			       request.request_len, &request_pages,
+			       &request_start, &request_num_pages);
+	if (ret)
+		goto get_user_page_failed;
+
+	ret = get_buffer_pages(READ | WRITE,
+			       (void __user *)request.response_buffer,
+			       request.response_len, &response_pages,
+			       &response_start, &response_num_pages);
+	if (ret)
+		goto get_user_page_failed;
+
+	if (request.data_len) {
+		ret = get_buffer_pages(READ | WRITE,
+				       (void __user *)request.data_buffer,
+				       request.data_len, &data_pages,
+				       &data_start, &data_num_pages);
+		if (ret)
+			goto get_user_page_failed;
+	}
+
+	total_num_pages = request_num_pages + response_num_pages +
+				data_num_pages;
+	if (total_num_pages > AZ_BLOB_MAX_PAGES) {
+		dev_dbg(AZ_DEV, "number of DMA pages %lu buffer exceeding %u\n",
+			total_num_pages, AZ_BLOB_MAX_PAGES);
+		ret = -EINVAL;
+		goto get_user_page_failed;
+	}
+
+	/* Construct a VMBUS packet and send it over to VSP */
+	desc_size = struct_size(desc, range.pfn_array, total_num_pages);
+	desc = kzalloc(desc_size, GFP_KERNEL);
+	vsp_request = kzalloc(sizeof(*vsp_request), GFP_KERNEL);
+	if (!desc || !vsp_request) {
+		kfree(desc);
+		kfree(vsp_request);
+		ret = -ENOMEM;
+		goto get_user_page_failed;
+	}
+
+	desc->range.offset = 0;
+	desc->range.len = total_num_pages * PAGE_SIZE;
+	pfn_array = desc->range.pfn_array;
+	page_idx = 0;
+
+	if (request.data_len) {
+		fill_in_page_buffer(pfn_array, &page_idx, data_pages,
+				    data_num_pages);
+		vsp_request->data_buffer_offset = data_start;
+		vsp_request->data_buffer_length = request.data_len;
+		vsp_request->data_buffer_valid = request.data_valid;
+	}
+
+	fill_in_page_buffer(pfn_array, &page_idx, request_pages,
+			    request_num_pages);
+	vsp_request->request_buffer_offset = request_start +
+						data_num_pages * PAGE_SIZE;
+	vsp_request->request_buffer_length = request.request_len;
+
+	fill_in_page_buffer(pfn_array, &page_idx, response_pages,
+			    response_num_pages);
+	vsp_request->response_buffer_offset = response_start +
+		(data_num_pages + request_num_pages) * PAGE_SIZE;
+	vsp_request->response_buffer_length = request.response_len;
+
+	vsp_request->version = VSP_PROTOCOL_VERSION_V1;
+	vsp_request->timeout_ms = request.timeout;
+	vsp_request->operation_type = AZ_BLOB_DRIVER_USER_REQUEST;
+	guid_copy(&vsp_request->transaction_id, &request.guid);
+
+	spin_lock_irqsave(&file_ctx->vsp_pending_lock, flags);
+	list_add_tail(&request_ctx.list, &file_ctx->vsp_pending_requests);
+	spin_unlock_irqrestore(&file_ctx->vsp_pending_lock, flags);
+
+	atomic_inc(&dev->pending);
+
+	/* Check if device is being removed */
+	if (dev->removing) {
+		ret = -ENODEV;
+		goto vmbus_send_failed;
+	}
+
+	ret = vmbus_sendpacket_mpb_desc(dev->device->channel, desc, desc_size,
+					vsp_request, sizeof(*vsp_request),
+					(u64)&request_ctx);
+
+	kfree(desc);
+	kfree(vsp_request);
+	if (ret)
+		goto vmbus_send_failed;
+
+	wait_for_completion(&request_ctx.wait_vsp);
+
+	/*
+	 * At this point, the response is already written to request
+	 * by VMBUS completion handler, copy them to user-mode buffers
+	 * and return to user-mode
+	 */
+	if (copy_to_user(&request_user->response, &request.response,
+			 sizeof(request.response)))
+		ret = -EFAULT;
+
+vmbus_send_failed:
+	if (atomic_dec_and_test(&dev->pending) && dev->removing)
+		wake_up(&dev->waiting_to_drain);
+
+	spin_lock_irqsave(&file_ctx->vsp_pending_lock, flags);
+	list_del(&request_ctx.list);
+	if (list_empty(&file_ctx->vsp_pending_requests))
+		wake_up(&file_ctx->wait_vsp_pending);
+	spin_unlock_irqrestore(&file_ctx->vsp_pending_lock, flags);
+
+get_user_page_failed:
+	free_buffer_pages(request_num_pages, request_pages);
+	free_buffer_pages(response_num_pages, response_pages);
+	free_buffer_pages(data_num_pages, data_pages);
+
+	return ret;
+}
+
+static long az_blob_fop_ioctl(struct file *filp, unsigned int cmd,
+			      unsigned long arg)
+{
+	switch (cmd) {
+	case IOCTL_AZ_BLOB_DRIVER_USER_REQUEST:
+		return az_blob_ioctl_user_request(filp, arg);
+
+	default:
+		dev_dbg(AZ_DEV, "unrecognized IOCTL code %u\n", cmd);
+	}
+
+	return -EINVAL;
+}
+
+static const struct file_operations az_blob_client_fops = {
+	.owner		= THIS_MODULE,
+	.open		= az_blob_fop_open,
+	.unlocked_ioctl = az_blob_fop_ioctl,
+	.release	= az_blob_fop_release,
+};
+
+static struct miscdevice az_blob_misc_device = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "azure_blob",
+	.fops	= &az_blob_client_fops,
+};
+
+static int az_blob_show_pending_requests(struct seq_file *m, void *v)
+{
+	unsigned long flags, flags2;
+	struct az_blob_vsp_request_ctx *request_ctx;
+	struct az_blob_file_ctx *file_ctx;
+	struct az_blob_device *dev = az_blob_dev;
+
+	seq_puts(m, "List of pending requests\n");
+	seq_puts(m, "UUID request_len response_len data_len "
+		"request_buffer response_buffer data_buffer\n");
+	spin_lock_irqsave(&dev->file_lock, flags);
+	list_for_each_entry(file_ctx, &dev->file_list, list) {
+		spin_lock_irqsave(&file_ctx->vsp_pending_lock, flags2);
+		seq_printf(m, "file context for pid %u\n", file_ctx->pid);
+		list_for_each_entry(request_ctx,
+				    &file_ctx->vsp_pending_requests, list) {
+			seq_printf(m, "%pUb ", &request_ctx->request->guid);
+			seq_printf(m, "%u ", request_ctx->request->request_len);
+			seq_printf(m, "%u ",
+				   request_ctx->request->response_len);
+			seq_printf(m, "%u ", request_ctx->request->data_len);
+			seq_printf(m, "%llx ",
+				   request_ctx->request->request_buffer);
+			seq_printf(m, "%llx ",
+				   request_ctx->request->response_buffer);
+			seq_printf(m, "%llx\n",
+				   request_ctx->request->data_buffer);
+		}
+		spin_unlock_irqrestore(&file_ctx->vsp_pending_lock, flags2);
+	}
+	spin_unlock_irqrestore(&dev->file_lock, flags);
+
+	return 0;
+}
+
+static int az_blob_debugfs_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, az_blob_show_pending_requests, NULL);
+}
+
+static const struct file_operations az_blob_debugfs_fops = {
+	.open		= az_blob_debugfs_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release
+};
+
+static void az_blob_remove_device(void)
+{
+	struct dentry *debugfs_root = debugfs_lookup("az_blob", NULL);
+
+	debugfs_remove_recursive(debugfs_root);
+	misc_deregister(&az_blob_misc_device);
+}
+
+static int az_blob_create_device(struct az_blob_device *dev)
+{
+	int ret;
+	struct dentry *debugfs_root;
+
+	ret = misc_register(&az_blob_misc_device);
+	if (ret)
+		return ret;
+
+	debugfs_root = debugfs_create_dir("az_blob", NULL);
+	debugfs_create_file("pending_requests", 0400, debugfs_root, NULL,
+			    &az_blob_debugfs_fops);
+
+	return 0;
+}
+
+static int az_blob_connect_to_vsp(struct hv_device *device,
+				  struct az_blob_device *dev, u32 ring_size)
+{
+	int ret;
+
+	dev->device = device;
+	device->channel->rqstor_size = AZ_BLOB_QUEUE_DEPTH;
+
+	ret = vmbus_open(device->channel, ring_size, ring_size, NULL, 0,
+			 az_blob_on_channel_callback, device->channel);
+
+	if (ret)
+		return ret;
+
+	hv_set_drvdata(device, dev);
+
+	return ret;
+}
+
+static void az_blob_remove_vmbus(struct hv_device *device)
+{
+	hv_set_drvdata(device, NULL);
+	vmbus_close(device->channel);
+}
+
+static int az_blob_probe(struct hv_device *device,
+			 const struct hv_vmbus_device_id *dev_id)
+{
+	int ret;
+	struct az_blob_device *dev;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return -ENOMEM;
+
+	spin_lock_init(&dev->file_lock);
+	INIT_LIST_HEAD(&dev->file_list);
+	atomic_set(&dev->pending, 0);
+	init_waitqueue_head(&dev->waiting_to_drain);
+
+	ret = az_blob_connect_to_vsp(device, dev, AZ_BLOB_RING_SIZE);
+	if (ret)
+		goto fail;
+
+	refcount_set(&dev->count, 1);
+	az_blob_dev = dev;
+
+	// create user-mode client library facing device
+	ret = az_blob_create_device(dev);
+	if (ret) {
+		dev_err(AZ_DEV, "failed to create device ret=%d\n", ret);
+		az_blob_remove_vmbus(device);
+		goto fail;
+	}
+
+	dev_info(AZ_DEV, "successfully probed device\n");
+	return 0;
+
+fail:
+	kfree(dev);
+	return ret;
+}
+
+static int az_blob_remove(struct hv_device *device)
+{
+	struct az_blob_device *dev = hv_get_drvdata(device);
+
+	dev->removing = true;
+	az_blob_remove_device();
+
+	/*
+	 * We won't get any new requests from user-mode. There may be
+	 * pending requests already sent over VMBUS and we may get more
+	 * responses from VSP. Wait until no VSC/VSP traffic is possible.
+	 */
+	wait_event(dev->waiting_to_drain,
+		   atomic_read(&dev->pending) == 0);
+
+	az_blob_remove_vmbus(device);
+
+	if (refcount_dec_and_test(&dev->count))
+		kfree(dev);
+
+	return 0;
+}
+
+static struct hv_driver az_blob_drv = {
+	.name		= KBUILD_MODNAME,
+	.id_table	= id_table,
+	.probe		= az_blob_probe,
+	.remove		= az_blob_remove,
+	.driver		= {
+		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
+	},
+};
+
+static int __init az_blob_drv_init(void)
+{
+	return vmbus_driver_register(&az_blob_drv);
+}
+
+static void __exit az_blob_drv_exit(void)
+{
+	vmbus_driver_unregister(&az_blob_drv);
+}
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Microsoft Azure Blob driver");
+module_init(az_blob_drv_init);
+module_exit(az_blob_drv_exit);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index d1e59dbef1dd..ac3136284ace 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -772,6 +772,7 @@ enum vmbus_device_type {
 	HV_FCOPY,
 	HV_BACKUP,
 	HV_DM,
+	HV_AZURE_BLOB,
 	HV_UNKNOWN,
 };
 
@@ -1349,6 +1350,14 @@ void vmbus_free_mmio(resource_size_t start, resource_size_t size);
 	.guid = GUID_INIT(0xba6163d9, 0x04a1, 0x4d29, 0xb6, 0x05, \
 			  0x72, 0xe2, 0xff, 0xb1, 0xdc, 0x7f)
 
+/*
+ * Azure Blob GUID
+ * {0590b792-db64-45cc-81db-b8d70c577c9e}
+ */
+#define HV_AZURE_BLOB_GUID \
+	.guid = GUID_INIT(0x0590b792, 0xdb64, 0x45cc, 0x81, 0xdb, \
+			  0xb8, 0xd7, 0x0c, 0x57, 0x7c, 0x9e)
+
 /*
  * Shutdown GUID
  * {0e0b6031-5213-4934-818b-38d90ced39db}
diff --git a/include/uapi/misc/hv_azure_blob.h b/include/uapi/misc/hv_azure_blob.h
new file mode 100644
index 000000000000..ebb141b2762a
--- /dev/null
+++ b/include/uapi/misc/hv_azure_blob.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
+/* Copyright (c) 2021 Microsoft Corporation. */
+
+#ifndef _AZ_BLOB_H
+#define _AZ_BLOB_H
+
+#include <linux/kernel.h>
+#include <linux/uuid.h>
+
+/* user-mode sync request sent through ioctl */
+struct az_blob_request_sync_response {
+	__u32 status;
+	__u32 response_len;
+};
+
+struct az_blob_request_sync {
+	guid_t guid;
+	__u32 timeout;
+	__u32 request_len;
+	__u32 response_len;
+	__u32 data_len;
+	__u32 data_valid;
+	__aligned_u64 request_buffer;
+	__aligned_u64 response_buffer;
+	__aligned_u64 data_buffer;
+	struct az_blob_request_sync_response response;
+};
+
+#define AZ_BLOB_MAGIC_NUMBER	'R'
+#define IOCTL_AZ_BLOB_DRIVER_USER_REQUEST \
+		_IOWR(AZ_BLOB_MAGIC_NUMBER, 0xf0, \
+			struct az_blob_request_sync)
+
+#endif /* define _AZ_BLOB_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Patch v4 3/3] Drivers: hv: Add to maintainer for Hyper-V/Azure drivers
  2021-07-20  3:31 [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob longli
  2021-07-20  3:31 ` [Patch v4 1/3] Drivers: hv: vmbus: add support to ignore certain PCIE devices longli
  2021-07-20  3:31 ` [Patch v4 2/3] Drivers: hv: add Azure Blob driver longli
@ 2021-07-20  3:31 ` longli
  2021-07-20  4:37 ` [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob Bart Van Assche
  2021-07-20 15:54 ` Greg KH
  4 siblings, 0 replies; 23+ messages in thread
From: longli @ 2021-07-20  3:31 UTC (permalink / raw)
  To: linux-fs, linux-block, linux-kernel, linux-hyperv
  Cc: Long Li, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui

From: Long Li <longli@microsoft.com>

Add longli@microsoft.com to maintainer list.

Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 948706174fae..bbf062936f21 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8440,6 +8440,7 @@ M:	Haiyang Zhang <haiyangz@microsoft.com>
 M:	Stephen Hemminger <sthemmin@microsoft.com>
 M:	Wei Liu <wei.liu@kernel.org>
 M:	Dexuan Cui <decui@microsoft.com>
+M:	Long Li <longli@microsoft.com>
 L:	linux-hyperv@vger.kernel.org
 S:	Supported
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git
@@ -8468,6 +8469,7 @@ F:	include/asm-generic/mshyperv.h
 F:	include/clocksource/hyperv_timer.h
 F:	include/linux/hyperv.h
 F:	include/uapi/linux/hyperv.h
+F:	include/uapi/misc/hv_azure_blob.h
 F:	net/vmw_vsock/hyperv_transport.c
 F:	tools/hv/
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20  3:31 [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob longli
                   ` (2 preceding siblings ...)
  2021-07-20  3:31 ` [Patch v4 3/3] Drivers: hv: Add to maintainer for Hyper-V/Azure drivers longli
@ 2021-07-20  4:37 ` Bart Van Assche
  2021-07-20  6:01   ` Christoph Hellwig
  2021-07-20 15:54 ` Greg KH
  4 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2021-07-20  4:37 UTC (permalink / raw)
  To: longli, linux-fs, linux-block, linux-kernel, linux-hyperv; +Cc: Long Li

On 7/19/21 8:31 PM, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> Microsoft Azure Blob storage service exposes a REST API to applications
> for data access.
> (https://docs.microsoft.com/en-us/rest/api/storageservices/blob-service-rest-api)
> 
> This patchset implements a VSC (Virtualization Service Consumer) that
> communicates with a VSP (Virtualization Service Provider) on the Hyper-V
> host to execute Blob storage access via native network stack on the host.
> This VSC doesn't implement the semantics of REST API. Those are implemented
> in user-space. The VSC provides a fast data path to VSP.
> 
> Answers to some previous questions discussing the driver:
> 
> Q: Why this driver doesn't use the block layer
> 
> A: The Azure Blob is based on a model of object oriented storage. The
> storage object is not modeled in block sectors. While it's possible to
> present the storage object as a block device (assuming it makes sense to
> fake all the block device attributes), we lose the ability to express
> functionality that are defined in the REST API. 
> 
> Q: You just lost all use of caching and io_uring and loads of other kernel
> infrastructure that has been developed and relied on for decades?
> 
> A: The REST API is not designed to have caching at system level. This
> driver doesn't attempt to improve on this. There are discussions on
> supporting ioctl() on io_uring (https://lwn.net/Articles/844875/), that
> will benefit this driver. The block I/O scheduling is not helpful in this
> case, as the Blob application and Blob storage server have complete
> knowledge on the I/O pattern based on storage object type. This knowledge
> doesn't get easily consumed by the block layer.
> 
> Q: You also just abandoned the POSIX model and forced people to use a
> random-custom-library just to access their storage devices, breaking all
> existing programs in the world?
> 
> A: The existing Blob applications access storage via HTTP (REST API). They
> don't use POSIX interface. The interface for Azure Blob is not designed
> on POSIX.
> 
> Q: What programs today use this new API?
> 
> A: Currently none is released. But per above, there are also none using
> POSIX.
> 
> Q: Where is the API published and what ensures that it will remain stable?
> 
> A: Cloud based REST protocols have similar considerations to the kernel in
> terms of interface stability. Applications depend on cloud services via
> REST in much the same way as they depend on kernel services. Because
> applications can consume cloud APIs over the Internet, there is no
> opportunity to recompile applications to ensure compatibility. This causes
> the underlying APIs to be exceptionally stable, and Azure Blob has not
> removed support for an exposed API to date. This driver is supporting a
> pass-through model where requests in a guest process can be reflected to a
> VM host environment. Like the current REST interface, the goal is to ensure
> that each host provide a high degree of compatibility with each guest, but
> that task is largely outside the scope of this driver, which exists to
> communicate requests in the same way an HTTP stack would. Just like an HTTP
> stack does not require updates to add a new custom header or receive one
> from a server, this driver does not require updates for new functionality
> so long as the high level request/response model is retained.
> 
> Q: What happens when it changes over time, do we have to rebuild all
> userspace applications?
> 
> A: No. We don’t rebuild them all to talk HTTP either. In the current HTTP
> scheme, applications specify the version of the protocol they talk, and the
> storage backend responds with that version.
> 
> Q: What happens to the kernel code over time, how do you handle changes to
> the API there?
> 
> A: The goal of this driver is to get requests to the Hyper-V host, so the
> kernel isn’t involved in API changes, in the same way that HTTP
> implementations are robust to extra functionality being added to HTTP.

Another question is why do we need this in the kernel? Has it been
considered to provide a driver similar to vfio on top of the Hyper-V bus
such that this object storage driver can be implemented as a user-space
library instead of as a kernel driver? As you may know vfio users can
either use eventfds for completion notifications or polling. An
interface like io_uring can be built easily on top of vfio.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20  3:31 ` [Patch v4 2/3] Drivers: hv: add Azure Blob driver longli
@ 2021-07-20  5:26   ` Greg Kroah-Hartman
  2021-07-20  5:30   ` Greg Kroah-Hartman
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Greg Kroah-Hartman @ 2021-07-20  5:26 UTC (permalink / raw)
  To: longli
  Cc: linux-fs, linux-block, linux-kernel, linux-hyperv, Long Li,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Bjorn Andersson,
	Hans de Goede, Dan Williams, Maximilian Luz, Mike Rapoport,
	Ben Widawsky, Jiri Slaby, Andra Paraschiv, Siddharth Gupta,
	Hannes Reinecke, linux-doc

On Mon, Jul 19, 2021 at 08:31:05PM -0700, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> Azure Blob provides scalable, secure and shared storage services for Azure.
> 
> This driver adds support for accelerated access to Azure Blob storage. As an
> alternative to REST APIs, it provides a fast data path that uses host native
> network stack and direct data link for storage server access.

Thank you for fixing up the license info, but you did not include any
information here about how to use this, where the userspace tools are,
nor why you are going around the existing kernel block layer and proof
that going around the block layer is a good idea.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20  3:31 ` [Patch v4 2/3] Drivers: hv: add Azure Blob driver longli
  2021-07-20  5:26   ` Greg Kroah-Hartman
@ 2021-07-20  5:30   ` Greg Kroah-Hartman
  2021-07-20  7:34   ` Greg Kroah-Hartman
  2021-07-20 11:10   ` Jiri Slaby
  3 siblings, 0 replies; 23+ messages in thread
From: Greg Kroah-Hartman @ 2021-07-20  5:30 UTC (permalink / raw)
  To: longli
  Cc: linux-fs, linux-block, linux-kernel, linux-hyperv, Long Li,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Bjorn Andersson,
	Hans de Goede, Dan Williams, Maximilian Luz, Mike Rapoport,
	Ben Widawsky, Jiri Slaby, Andra Paraschiv, Siddharth Gupta,
	Hannes Reinecke, linux-doc

On Mon, Jul 19, 2021 at 08:31:05PM -0700, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> Azure Blob provides scalable, secure and shared storage services for Azure.
> 
> This driver adds support for accelerated access to Azure Blob storage. As an
> alternative to REST APIs, it provides a fast data path that uses host native
> network stack and direct data link for storage server access.
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Stephen Hemminger <sthemmin@microsoft.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Cc: Dexuan Cui <decui@microsoft.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
> Cc: Hans de Goede <hdegoede@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Maximilian Luz <luzmaximilian@gmail.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Ben Widawsky <ben.widawsky@intel.com>
> Cc: Jiri Slaby <jirislaby@kernel.org>
> Cc: Andra Paraschiv <andraprs@amazon.com>
> Cc: Siddharth Gupta <sidgup@codeaurora.org>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: linux-doc@vger.kernel.org
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |   2 +
>  drivers/hv/Kconfig                            |  11 +
>  drivers/hv/Makefile                           |   1 +
>  drivers/hv/channel_mgmt.c                     |   7 +
>  drivers/hv/hv_azure_blob.c                    | 628 ++++++++++++++++++
>  include/linux/hyperv.h                        |   9 +
>  include/uapi/misc/hv_azure_blob.h             |  34 +
>  7 files changed, 692 insertions(+)
>  create mode 100644 drivers/hv/hv_azure_blob.c
>  create mode 100644 include/uapi/misc/hv_azure_blob.h
> 

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- This looks like a new version of a previously submitted patch, but you
  did not list below the --- line any changes from the previous version.
  Please read the section entitled "The canonical patch format" in the
  kernel file, Documentation/SubmittingPatches for what needs to be done
  here to properly describe this.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20  4:37 ` [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob Bart Van Assche
@ 2021-07-20  6:01   ` Christoph Hellwig
  2021-07-20  7:05     ` Long Li
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2021-07-20  6:01 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: longli, linux-fs, linux-block, linux-kernel, linux-hyperv, Long Li

On Mon, Jul 19, 2021 at 09:37:56PM -0700, Bart Van Assche wrote:
> such that this object storage driver can be implemented as a user-space
> library instead of as a kernel driver? As you may know vfio users can
> either use eventfds for completion notifications or polling. An
> interface like io_uring can be built easily on top of vfio.

Yes.  Similar to say the NVMe K/V command set this does not look like
a candidate for a kernel driver.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20  6:01   ` Christoph Hellwig
@ 2021-07-20  7:05     ` Long Li
  2021-07-20 15:15       ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Long Li @ 2021-07-20  7:05 UTC (permalink / raw)
  To: Christoph Hellwig, Bart Van Assche
  Cc: longli, linux-fs, linux-block, linux-kernel, linux-hyperv, gregkh

> Subject: Re: [Patch v4 0/3] Introduce a driver to support host accelerated
> access to Microsoft Azure Blob
> 
> On Mon, Jul 19, 2021 at 09:37:56PM -0700, Bart Van Assche wrote:
> > such that this object storage driver can be implemented as a
> > user-space library instead of as a kernel driver? As you may know vfio
> > users can either use eventfds for completion notifications or polling.
> > An interface like io_uring can be built easily on top of vfio.
> 
> Yes.  Similar to say the NVMe K/V command set this does not look like a
> candidate for a kernel driver.

The driver is modeled to support multiple processes/users over a VMBUS
channel. I don't see a way that this can be implemented through VFIO? 

Even if it can be done, this exposes a security risk as the same VMBUS
channel is shared by multiple processes in user-mode.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20  3:31 ` [Patch v4 2/3] Drivers: hv: add Azure Blob driver longli
  2021-07-20  5:26   ` Greg Kroah-Hartman
  2021-07-20  5:30   ` Greg Kroah-Hartman
@ 2021-07-20  7:34   ` Greg Kroah-Hartman
  2021-07-20 19:57     ` Long Li
  2021-07-20 11:10   ` Jiri Slaby
  3 siblings, 1 reply; 23+ messages in thread
From: Greg Kroah-Hartman @ 2021-07-20  7:34 UTC (permalink / raw)
  To: longli
  Cc: linux-fs, linux-block, linux-kernel, linux-hyperv, Long Li,
	Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Bjorn Andersson,
	Hans de Goede, Dan Williams, Maximilian Luz, Mike Rapoport,
	Ben Widawsky, Jiri Slaby, Andra Paraschiv, Siddharth Gupta,
	Hannes Reinecke, linux-doc

On Mon, Jul 19, 2021 at 08:31:05PM -0700, longli@linuxonhyperv.com wrote:
> +struct az_blob_device {
> +	struct hv_device *device;
> +
> +	/* Opened files maintained by this device */
> +	struct list_head file_list;
> +	/* Lock for protecting file_list */
> +	spinlock_t file_lock;
> +
> +	/* The refcount for this device */
> +	refcount_t count;

Just use a kref please if you really need this.  Are you sure you do?
You already have 2 other reference counted objects being used here, why
make it 3?

> +	/* Pending requests to VSP */
> +	atomic_t pending;

Why does this need to be atomic?


> +	wait_queue_head_t waiting_to_drain;
> +
> +	bool removing;

Are you sure this actually works properly?  Why is it needed vs. any
other misc device?


> +/* VSC->VSP request */
> +struct az_blob_vsp_request {
> +	u32 version;
> +	u32 timeout_ms;
> +	u32 data_buffer_offset;
> +	u32 data_buffer_length;
> +	u32 data_buffer_valid;
> +	u32 operation_type;
> +	u32 request_buffer_offset;
> +	u32 request_buffer_length;
> +	u32 response_buffer_offset;
> +	u32 response_buffer_length;
> +	guid_t transaction_id;
> +} __packed;

Why packed?  If this is going across the wire somewhere, you need to
specify the endian-ness of these values, right?  If this is not going
across the wire, no need for it to be packed.

> +
> +/* VSP->VSC response */
> +struct az_blob_vsp_response {
> +	u32 length;
> +	u32 error;
> +	u32 response_len;
> +} __packed;

Same here.

> +
> +struct az_blob_vsp_request_ctx {
> +	struct list_head list;
> +	struct completion wait_vsp;
> +	struct az_blob_request_sync *request;
> +};
> +
> +struct az_blob_file_ctx {
> +	struct list_head list;
> +
> +	/* List of pending requests to VSP */
> +	struct list_head vsp_pending_requests;
> +	/* Lock for protecting vsp_pending_requests */
> +	spinlock_t vsp_pending_lock;
> +	wait_queue_head_t wait_vsp_pending;
> +
> +	pid_t pid;

Why do you need a pid?  What namespace is this pid in?

> +static int az_blob_probe(struct hv_device *device,
> +			 const struct hv_vmbus_device_id *dev_id)
> +{
> +	int ret;
> +	struct az_blob_device *dev;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&dev->file_lock);
> +	INIT_LIST_HEAD(&dev->file_list);
> +	atomic_set(&dev->pending, 0);
> +	init_waitqueue_head(&dev->waiting_to_drain);
> +
> +	ret = az_blob_connect_to_vsp(device, dev, AZ_BLOB_RING_SIZE);
> +	if (ret)
> +		goto fail;
> +
> +	refcount_set(&dev->count, 1);
> +	az_blob_dev = dev;
> +
> +	// create user-mode client library facing device
> +	ret = az_blob_create_device(dev);
> +	if (ret) {
> +		dev_err(AZ_DEV, "failed to create device ret=%d\n", ret);
> +		az_blob_remove_vmbus(device);
> +		goto fail;
> +	}
> +
> +	dev_info(AZ_DEV, "successfully probed device\n");

When drivers are working properly, they should be quiet.

And what is with the AZ_DEV macro mess?

And can you handle more than one device in the system at one time?  I
think your debugfs logic will get really confused.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20  3:31 ` [Patch v4 2/3] Drivers: hv: add Azure Blob driver longli
                     ` (2 preceding siblings ...)
  2021-07-20  7:34   ` Greg Kroah-Hartman
@ 2021-07-20 11:10   ` Jiri Slaby
  2021-07-20 22:12     ` Long Li
  3 siblings, 1 reply; 23+ messages in thread
From: Jiri Slaby @ 2021-07-20 11:10 UTC (permalink / raw)
  To: longli, linux-fs, linux-block, linux-kernel, linux-hyperv
  Cc: Long Li, Jonathan Corbet, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Greg Kroah-Hartman,
	Bjorn Andersson, Hans de Goede, Dan Williams, Maximilian Luz,
	Mike Rapoport, Ben Widawsky, Andra Paraschiv, Siddharth Gupta,
	Hannes Reinecke, linux-doc

On 20. 07. 21, 5:31, longli@linuxonhyperv.com wrote:
> --- /dev/null
> +++ b/include/uapi/misc/hv_azure_blob.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
> +/* Copyright (c) 2021 Microsoft Corporation. */
> +
> +#ifndef _AZ_BLOB_H
> +#define _AZ_BLOB_H
> +
> +#include <linux/kernel.h>
> +#include <linux/uuid.h>

Quoting from 
https://lore.kernel.org/linux-doc/MWHPR21MB159375586D810EC5DCB66AF0D7039@MWHPR21MB1593.namprd21.prod.outlook.com/:
=====
Seems like a #include of asm/ioctl.h (or something similar)
is needed so that _IOWR is defined.  Also, a #include
is needed for __u32, __aligned_u64, guid_t, etc.
=====

Why was no include added?

> +
> +/* user-mode sync request sent through ioctl */
> +struct az_blob_request_sync_response {
> +	__u32 status;
> +	__u32 response_len;
> +};
> +
> +struct az_blob_request_sync {
> +	guid_t guid;
> +	__u32 timeout;
> +	__u32 request_len;
> +	__u32 response_len;
> +	__u32 data_len;
> +	__u32 data_valid;
> +	__aligned_u64 request_buffer;
> +	__aligned_u64 response_buffer;
> +	__aligned_u64 data_buffer;
> +	struct az_blob_request_sync_response response;
> +};
> +
> +#define AZ_BLOB_MAGIC_NUMBER	'R'
> +#define IOCTL_AZ_BLOB_DRIVER_USER_REQUEST \
> +		_IOWR(AZ_BLOB_MAGIC_NUMBER, 0xf0, \
> +			struct az_blob_request_sync)
> +
> +#endif /* define _AZ_BLOB_H */
> 

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20  7:05     ` Long Li
@ 2021-07-20 15:15       ` Bart Van Assche
  2021-07-20 17:33         ` Long Li
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2021-07-20 15:15 UTC (permalink / raw)
  To: Long Li, Christoph Hellwig
  Cc: longli, linux-fs, linux-block, linux-kernel, linux-hyperv, gregkh

On 7/20/21 12:05 AM, Long Li wrote:
>> Subject: Re: [Patch v4 0/3] Introduce a driver to support host accelerated
>> access to Microsoft Azure Blob
>>
>> On Mon, Jul 19, 2021 at 09:37:56PM -0700, Bart Van Assche wrote:
>>> such that this object storage driver can be implemented as a
>>> user-space library instead of as a kernel driver? As you may know vfio
>>> users can either use eventfds for completion notifications or polling.
>>> An interface like io_uring can be built easily on top of vfio.
>>
>> Yes.  Similar to say the NVMe K/V command set this does not look like a
>> candidate for a kernel driver.
> 
> The driver is modeled to support multiple processes/users over a VMBUS
> channel. I don't see a way that this can be implemented through VFIO? 
> 
> Even if it can be done, this exposes a security risk as the same VMBUS
> channel is shared by multiple processes in user-mode.

Sharing a VMBUS channel among processes is not necessary. I propose to
assign one VMBUS channel to each process and to multiplex I/O submitted
to channels associated with the same blob storage object inside e.g. the
hypervisor. This is not a new idea. In the NVMe specification there is a
diagram that shows that multiple NVMe controllers can provide access to
the same NVMe namespace. See also diagram "Figure 416: NVM Subsystem
with Three I/O Controllers" in version 1.4 of the NVMe specification.

Bart.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20  3:31 [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob longli
                   ` (3 preceding siblings ...)
  2021-07-20  4:37 ` [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob Bart Van Assche
@ 2021-07-20 15:54 ` Greg KH
  2021-07-20 18:37   ` Long Li
  4 siblings, 1 reply; 23+ messages in thread
From: Greg KH @ 2021-07-20 15:54 UTC (permalink / raw)
  To: longli; +Cc: linux-fs, linux-block, linux-kernel, linux-hyperv, Long Li

On Mon, Jul 19, 2021 at 08:31:03PM -0700, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> Microsoft Azure Blob storage service exposes a REST API to applications
> for data access.
> (https://docs.microsoft.com/en-us/rest/api/storageservices/blob-service-rest-api)
> 
> This patchset implements a VSC (Virtualization Service Consumer) that
> communicates with a VSP (Virtualization Service Provider) on the Hyper-V
> host to execute Blob storage access via native network stack on the host.
> This VSC doesn't implement the semantics of REST API. Those are implemented
> in user-space. The VSC provides a fast data path to VSP.
> 
> Answers to some previous questions discussing the driver:
> 
> Q: Why this driver doesn't use the block layer
> 
> A: The Azure Blob is based on a model of object oriented storage. The
> storage object is not modeled in block sectors. While it's possible to
> present the storage object as a block device (assuming it makes sense to
> fake all the block device attributes), we lose the ability to express
> functionality that are defined in the REST API. 

What "REST API"?

Why doesn't object oriented storage map to a file handle to read from?
No need to mess with "blocks", why would you care about them?

And again, you loose all caching, this thing has got to be really slow,
why add a slow storage interface?  What workload requires this type of
slow block storage?

> Q: You also just abandoned the POSIX model and forced people to use a
> random-custom-library just to access their storage devices, breaking all
> existing programs in the world?
> 
> A: The existing Blob applications access storage via HTTP (REST API). They
> don't use POSIX interface. The interface for Azure Blob is not designed
> on POSIX.

I do not see a HTTP interface here, what does that have to do with the
kernel?

I see a single ioctl interface, that's all.

> Q: What programs today use this new API?
> 
> A: Currently none is released. But per above, there are also none using
> POSIX.

Great, so no one uses this, so who is asking for and requiring this
thing?

What about just doing it all from userspace using FUSE?  Why does this
HAVE to be a kernel driver?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20 15:15       ` Bart Van Assche
@ 2021-07-20 17:33         ` Long Li
  2021-07-20 18:16           ` gregkh
  0 siblings, 1 reply; 23+ messages in thread
From: Long Li @ 2021-07-20 17:33 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: longli, linux-fs, linux-block, linux-kernel, linux-hyperv, gregkh

> Subject: Re: [Patch v4 0/3] Introduce a driver to support host accelerated
> access to Microsoft Azure Blob
> 
> On 7/20/21 12:05 AM, Long Li wrote:
> >> Subject: Re: [Patch v4 0/3] Introduce a driver to support host
> >> accelerated access to Microsoft Azure Blob
> >>
> >> On Mon, Jul 19, 2021 at 09:37:56PM -0700, Bart Van Assche wrote:
> >>> such that this object storage driver can be implemented as a
> >>> user-space library instead of as a kernel driver? As you may know
> >>> vfio users can either use eventfds for completion notifications or polling.
> >>> An interface like io_uring can be built easily on top of vfio.
> >>
> >> Yes.  Similar to say the NVMe K/V command set this does not look like
> >> a candidate for a kernel driver.
> >
> > The driver is modeled to support multiple processes/users over a VMBUS
> > channel. I don't see a way that this can be implemented through VFIO?
> >
> > Even if it can be done, this exposes a security risk as the same VMBUS
> > channel is shared by multiple processes in user-mode.
> 
> Sharing a VMBUS channel among processes is not necessary. I propose to
> assign one VMBUS channel to each process and to multiplex I/O submitted to
> channels associated with the same blob storage object inside e.g. the
> hypervisor. This is not a new idea. In the NVMe specification there is a
> diagram that shows that multiple NVMe controllers can provide access to the
> same NVMe namespace. See also diagram "Figure 416: NVM Subsystem with
> Three I/O Controllers" in version 1.4 of the NVMe specification.
> 
> Bart.

Currently, the Hyper-V is not designed to have one VMBUS channel for each process.
In Hyper-V, a channel is offered from the host to the guest VM. The host doesn't
know in advance how many processes are going to use this service so it can't
offer those channels in advance. There is no mechanism to offer dynamic
per-process allocated channels based on guest needs. Some devices (e.g.
network and storage) use multiple channels for scalability but they are not
for serving individual processes.

Assigning one VMBUS channel per process needs significant change on the Hyper-V side.

Long

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20 17:33         ` Long Li
@ 2021-07-20 18:16           ` gregkh
  2021-07-20 18:52             ` Long Li
  0 siblings, 1 reply; 23+ messages in thread
From: gregkh @ 2021-07-20 18:16 UTC (permalink / raw)
  To: Long Li
  Cc: Bart Van Assche, Christoph Hellwig, longli, linux-fs,
	linux-block, linux-kernel, linux-hyperv

On Tue, Jul 20, 2021 at 05:33:47PM +0000, Long Li wrote:
> > Subject: Re: [Patch v4 0/3] Introduce a driver to support host accelerated
> > access to Microsoft Azure Blob
> > 
> > On 7/20/21 12:05 AM, Long Li wrote:
> > >> Subject: Re: [Patch v4 0/3] Introduce a driver to support host
> > >> accelerated access to Microsoft Azure Blob
> > >>
> > >> On Mon, Jul 19, 2021 at 09:37:56PM -0700, Bart Van Assche wrote:
> > >>> such that this object storage driver can be implemented as a
> > >>> user-space library instead of as a kernel driver? As you may know
> > >>> vfio users can either use eventfds for completion notifications or polling.
> > >>> An interface like io_uring can be built easily on top of vfio.
> > >>
> > >> Yes.  Similar to say the NVMe K/V command set this does not look like
> > >> a candidate for a kernel driver.
> > >
> > > The driver is modeled to support multiple processes/users over a VMBUS
> > > channel. I don't see a way that this can be implemented through VFIO?
> > >
> > > Even if it can be done, this exposes a security risk as the same VMBUS
> > > channel is shared by multiple processes in user-mode.
> > 
> > Sharing a VMBUS channel among processes is not necessary. I propose to
> > assign one VMBUS channel to each process and to multiplex I/O submitted to
> > channels associated with the same blob storage object inside e.g. the
> > hypervisor. This is not a new idea. In the NVMe specification there is a
> > diagram that shows that multiple NVMe controllers can provide access to the
> > same NVMe namespace. See also diagram "Figure 416: NVM Subsystem with
> > Three I/O Controllers" in version 1.4 of the NVMe specification.
> > 
> > Bart.
> 
> Currently, the Hyper-V is not designed to have one VMBUS channel for each process.

So it's a slow interface :(

> In Hyper-V, a channel is offered from the host to the guest VM. The host doesn't
> know in advance how many processes are going to use this service so it can't
> offer those channels in advance. There is no mechanism to offer dynamic
> per-process allocated channels based on guest needs. Some devices (e.g.
> network and storage) use multiple channels for scalability but they are not
> for serving individual processes.
> 
> Assigning one VMBUS channel per process needs significant change on the Hyper-V side.

What is the throughput of a single channel as-is?  You provided no
benchmarks or numbers at all in this patchset which would justify this
new kernel driver :(

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20 15:54 ` Greg KH
@ 2021-07-20 18:37   ` Long Li
  2021-07-21  5:18     ` Greg KH
  0 siblings, 1 reply; 23+ messages in thread
From: Long Li @ 2021-07-20 18:37 UTC (permalink / raw)
  To: Greg KH, longli; +Cc: linux-fs, linux-block, linux-kernel, linux-hyperv

> Subject: Re: [Patch v4 0/3] Introduce a driver to support host accelerated
> access to Microsoft Azure Blob
> 
> On Mon, Jul 19, 2021 at 08:31:03PM -0700, longli@linuxonhyperv.com wrote:
> > From: Long Li <longli@microsoft.com>
> >
> > Microsoft Azure Blob storage service exposes a REST API to
> > applications for data access.
> >
> (https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoc
> > s.microsoft.com%2Fen-us%2Frest%2Fapi%2Fstorageservices%2Fblob-
> service-
> > rest-
> api&amp;data=04%7C01%7Clongli%40microsoft.com%7Ce499fbe161554232e
> >
> b1608d94b96a772%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637
> 623932
> >
> 843247787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIj
> oiV2luMzIi
> >
> LCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=9CKNHAmurdBWp
> ZeLfkiC18
> > CXNg66UhKsSZzzHZkzf0Y%3D&amp;reserved=0)
> >
> > This patchset implements a VSC (Virtualization Service Consumer) that
> > communicates with a VSP (Virtualization Service Provider) on the
> > Hyper-V host to execute Blob storage access via native network stack on
> the host.
> > This VSC doesn't implement the semantics of REST API. Those are
> > implemented in user-space. The VSC provides a fast data path to VSP.
> >
> > Answers to some previous questions discussing the driver:
> >
> > Q: Why this driver doesn't use the block layer
> >
> > A: The Azure Blob is based on a model of object oriented storage. The
> > storage object is not modeled in block sectors. While it's possible to
> > present the storage object as a block device (assuming it makes sense
> > to fake all the block device attributes), we lose the ability to
> > express functionality that are defined in the REST API.
> 
> What "REST API"?
> 
> Why doesn't object oriented storage map to a file handle to read from?
> No need to mess with "blocks", why would you care about them?
> 
> And again, you loose all caching, this thing has got to be really slow, why add
> a slow storage interface?  What workload requires this type of slow block
> storage?

"Blob REST API" expresses storage request semantics through HTTP. In Blob
REST API, each request is associated with a context meta data expressed in
HTTP headers. The ability to express those semantics is rich, it's only limited
by HTTP protocol.

There are attempts to implement the Blob as a file system.
Here is an example filesystem (BlobFuse) implemented for Blob:
(https://github.com/Azure/azure-storage-fuse).

It's doable, but at the same time you lose all the performance and
shareable/security features presented in the REST API for Blob. A POSIX
interface cannot express same functionality as the REST API for Blob.

For example, The Blob API for read (Get Blob, 
https://docs.microsoft.com/en-us/rest/api/storageservices/get-blob)
has rich meta data context that cannot easily be mapped to POSIX. The same
goes to other Blob API to manage security tokens and the life cycle of shareable
objects.

BlobFuse (above) filesystem demonstrated why Blob should not be implemented
on a filesystem. It's useable for data management purposes. It's not usable for an I/O
intensive workload. It's not usable for managing sharable objects and security tokens.

Blob is designed not to use file system caching and block layer I/O scheduling.
There are many solutions existing today, that access raw disk for I/O, bypassing
filesystem and block layer. For example, many database applications access raw
disks for I/O. When the application knows the I/O pattern and its intended behavior,
it doesn't get much benefit from filesystem or block.

> 
> > Q: You also just abandoned the POSIX model and forced people to use a
> > random-custom-library just to access their storage devices, breaking
> > all existing programs in the world?
> >
> > A: The existing Blob applications access storage via HTTP (REST API).
> > They don't use POSIX interface. The interface for Azure Blob is not
> > designed on POSIX.
> 
> I do not see a HTTP interface here, what does that have to do with the kernel?
> 
> I see a single ioctl interface, that's all.

The driver doesn't attempt to implement Blob API or HTTP. It presents a fast data
path to pass Blob requests to hosts, so the guest VM doesn't need to assemble
a HTTP packet for each Blob REST requests. This also eliminates additional
overhead in guest network stack to send the HTTP packets over TCP/IP.

Once the request reaches the Azure host, it knows the best way to reach to the 
backend storage and serving the Blob request, while at the same time all the 
security and integrity features are preserved.


> 
> > Q: What programs today use this new API?
> >
> > A: Currently none is released. But per above, there are also none
> > using POSIX.
> 
> Great, so no one uses this, so who is asking for and requiring this thing?
> 
> What about just doing it all from userspace using FUSE?  Why does this HAVE
> to be a kernel driver?

We have a major workload nearing the end of development. Compared with
REST over HTTP, this device model presented faster data access and CPU savings
in that there is no overhead of sending HTTP over network.

Eventually, all the existing Blob REST API applications can use this new API, once
it gets to their Blob transport libraries.

I have explained why BlobFuse is not suitable for production workloads. There
are people using BlobFuse but mostly for management purposes.

Thanks,
Long

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20 18:16           ` gregkh
@ 2021-07-20 18:52             ` Long Li
  0 siblings, 0 replies; 23+ messages in thread
From: Long Li @ 2021-07-20 18:52 UTC (permalink / raw)
  To: gregkh
  Cc: Bart Van Assche, Christoph Hellwig, longli, linux-fs,
	linux-block, linux-kernel, linux-hyperv

> Subject: Re: [Patch v4 0/3] Introduce a driver to support host accelerated
> access to Microsoft Azure Blob
> 
> On Tue, Jul 20, 2021 at 05:33:47PM +0000, Long Li wrote:
> > > Subject: Re: [Patch v4 0/3] Introduce a driver to support host
> > > accelerated access to Microsoft Azure Blob
> > >
> > > On 7/20/21 12:05 AM, Long Li wrote:
> > > >> Subject: Re: [Patch v4 0/3] Introduce a driver to support host
> > > >> accelerated access to Microsoft Azure Blob
> > > >>
> > > >> On Mon, Jul 19, 2021 at 09:37:56PM -0700, Bart Van Assche wrote:
> > > >>> such that this object storage driver can be implemented as a
> > > >>> user-space library instead of as a kernel driver? As you may
> > > >>> know vfio users can either use eventfds for completion notifications
> or polling.
> > > >>> An interface like io_uring can be built easily on top of vfio.
> > > >>
> > > >> Yes.  Similar to say the NVMe K/V command set this does not look
> > > >> like a candidate for a kernel driver.
> > > >
> > > > The driver is modeled to support multiple processes/users over a
> > > > VMBUS channel. I don't see a way that this can be implemented
> through VFIO?
> > > >
> > > > Even if it can be done, this exposes a security risk as the same
> > > > VMBUS channel is shared by multiple processes in user-mode.
> > >
> > > Sharing a VMBUS channel among processes is not necessary. I propose
> > > to assign one VMBUS channel to each process and to multiplex I/O
> > > submitted to channels associated with the same blob storage object
> > > inside e.g. the hypervisor. This is not a new idea. In the NVMe
> > > specification there is a diagram that shows that multiple NVMe
> > > controllers can provide access to the same NVMe namespace. See also
> > > diagram "Figure 416: NVM Subsystem with Three I/O Controllers" in
> version 1.4 of the NVMe specification.
> > >
> > > Bart.
> >
> > Currently, the Hyper-V is not designed to have one VMBUS channel for
> each process.
> 
> So it's a slow interface :(
> 
> > In Hyper-V, a channel is offered from the host to the guest VM. The
> > host doesn't know in advance how many processes are going to use this
> > service so it can't offer those channels in advance. There is no
> > mechanism to offer dynamic per-process allocated channels based on
> guest needs. Some devices (e.g.
> > network and storage) use multiple channels for scalability but they
> > are not for serving individual processes.
> >
> > Assigning one VMBUS channel per process needs significant change on the
> Hyper-V side.
> 
> What is the throughput of a single channel as-is?  You provided no
> benchmarks or numbers at all in this patchset which would justify this new
> kernel driver :(

Test data shows a single channel is not a limitation of the target workload.
The VSP/VSC protocol is designed to avoid data copy as much as possible.
Being a VMBUS device, the Hyper-V is capable of allocating multiple channels
on different CPUs if a single channel proves to be a bottleneck. 

Preliminary test results show the performance increase of up to 30% in data
center environment, while at the same time reducing the number of servers
and CPUs serving Blob requests, as compared to going through the complete
HTTP stack. This also enables the use of transport technology directly to backend
server that are not available to VMs (for example RDMA transport) due to security reasons.

Long

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20  7:34   ` Greg Kroah-Hartman
@ 2021-07-20 19:57     ` Long Li
  2021-07-21  5:08       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 23+ messages in thread
From: Long Li @ 2021-07-20 19:57 UTC (permalink / raw)
  To: Greg Kroah-Hartman, longli
  Cc: linux-fs, linux-block, linux-kernel, linux-hyperv,
	Jonathan Corbet, KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, Bjorn Andersson, Hans de Goede, Williams,
	Dan J, Maximilian Luz, Mike Rapoport, Ben Widawsky, Jiri Slaby,
	Andra Paraschiv, Siddharth Gupta, Hannes Reinecke, linux-doc

> Subject: Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
> 
> On Mon, Jul 19, 2021 at 08:31:05PM -0700, longli@linuxonhyperv.com wrote:
> > +struct az_blob_device {
> > +	struct hv_device *device;
> > +
> > +	/* Opened files maintained by this device */
> > +	struct list_head file_list;
> > +	/* Lock for protecting file_list */
> > +	spinlock_t file_lock;
> > +
> > +	/* The refcount for this device */
> > +	refcount_t count;
> 
> Just use a kref please if you really need this.  Are you sure you do?
> You already have 2 other reference counted objects being used here, why make
> it 3?

The "count" is to keep track how many user-mode instances and vmbus instance
are opened on this device. Being a VMBUS device, this device can be removed 
at any time (host servicing etc). We must remove the device when this happens
even if the device is still opened by some user-mode program. The "count" will
guarantee the lifecycle of the device object after all user-mode has released the device.

I looked at using "file_list" (it's used for tracking opened files by user-mode) for this purpose, 
but I found out I still need to manage the device count at the vmbus side. 

> 
> > +	/* Pending requests to VSP */
> > +	atomic_t pending;
> 
> Why does this need to be atomic?

"pending' is per-device maintained that could change when multiple-user access
the device at the same time.

> 
> 
> > +	wait_queue_head_t waiting_to_drain;
> > +
> > +	bool removing;
> 
> Are you sure this actually works properly?  Why is it needed vs. any other misc
> device?

When removing this device from vmbus, we need to guarantee there is no possible packets to
vmbus. This is a requirement before calling vmbus_close(). Other drivers of vmbus follows
the same procedure.

The reason why this driver needs this is that the device removal can happen in the middle of
az_blob_ioctl_user_request(), which can send packet over vmbus.

> 
> 
> > +/* VSC->VSP request */
> > +struct az_blob_vsp_request {
> > +	u32 version;
> > +	u32 timeout_ms;
> > +	u32 data_buffer_offset;
> > +	u32 data_buffer_length;
> > +	u32 data_buffer_valid;
> > +	u32 operation_type;
> > +	u32 request_buffer_offset;
> > +	u32 request_buffer_length;
> > +	u32 response_buffer_offset;
> > +	u32 response_buffer_length;
> > +	guid_t transaction_id;
> > +} __packed;
> 
> Why packed?  If this is going across the wire somewhere, you need to specify
> the endian-ness of these values, right?  If this is not going across the wire, no
> need for it to be packed.

Those data go through the wire.

All data structures specified in the Hyper-V and guest VM use Little Endian byte
ordering.  All HV core drivers have a dependence on X86, that guarantees this
ordering.

> 
> > +
> > +/* VSP->VSC response */
> > +struct az_blob_vsp_response {
> > +	u32 length;
> > +	u32 error;
> > +	u32 response_len;
> > +} __packed;
> 
> Same here.
> 
> > +
> > +struct az_blob_vsp_request_ctx {
> > +	struct list_head list;
> > +	struct completion wait_vsp;
> > +	struct az_blob_request_sync *request; };
> > +
> > +struct az_blob_file_ctx {
> > +	struct list_head list;
> > +
> > +	/* List of pending requests to VSP */
> > +	struct list_head vsp_pending_requests;
> > +	/* Lock for protecting vsp_pending_requests */
> > +	spinlock_t vsp_pending_lock;
> > +	wait_queue_head_t wait_vsp_pending;
> > +
> > +	pid_t pid;
> 
> Why do you need a pid?  What namespace is this pid in?

It's a request from user library team for production troubleshooting
purposes. It's exposed as informal in debugfs.

> 
> > +static int az_blob_probe(struct hv_device *device,
> > +			 const struct hv_vmbus_device_id *dev_id) {
> > +	int ret;
> > +	struct az_blob_device *dev;
> > +
> > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +	if (!dev)
> > +		return -ENOMEM;
> > +
> > +	spin_lock_init(&dev->file_lock);
> > +	INIT_LIST_HEAD(&dev->file_list);
> > +	atomic_set(&dev->pending, 0);
> > +	init_waitqueue_head(&dev->waiting_to_drain);
> > +
> > +	ret = az_blob_connect_to_vsp(device, dev, AZ_BLOB_RING_SIZE);
> > +	if (ret)
> > +		goto fail;
> > +
> > +	refcount_set(&dev->count, 1);
> > +	az_blob_dev = dev;
> > +
> > +	// create user-mode client library facing device
> > +	ret = az_blob_create_device(dev);
> > +	if (ret) {
> > +		dev_err(AZ_DEV, "failed to create device ret=%d\n", ret);
> > +		az_blob_remove_vmbus(device);
> > +		goto fail;
> > +	}
> > +
> > +	dev_info(AZ_DEV, "successfully probed device\n");
> 
> When drivers are working properly, they should be quiet.

The reason is that in production environment when dealing with custom support
cases, there is no good way to check if the channel is opened on the device. Having
this message will greatly clear confusions on possible mis-configurations.

> 
> And what is with the AZ_DEV macro mess?

It's not required, it's just for saving code length. I can put "&az_blob_dev->device->device"
in every dev_err(), but it makes the code look a lot longer.

> 
> And can you handle more than one device in the system at one time?  I think
> your debugfs logic will get really confused.

There can be one device object active in the system at any given time. The debugfs grabs
the current active device object. If the device is being removed, removed or added, 
the current active device object is updated accordingly.

> 
> thanks,
> 
> greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20 11:10   ` Jiri Slaby
@ 2021-07-20 22:12     ` Long Li
  2021-07-21  4:57       ` Jiri Slaby
  0 siblings, 1 reply; 23+ messages in thread
From: Long Li @ 2021-07-20 22:12 UTC (permalink / raw)
  To: Jiri Slaby, longli, linux-fs, linux-block, linux-kernel, linux-hyperv
  Cc: Jonathan Corbet, KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, Greg Kroah-Hartman, Bjorn Andersson,
	Hans de Goede, Williams, Dan J, Maximilian Luz, Mike Rapoport,
	Ben Widawsky, Andra Paraschiv, Siddharth Gupta, Hannes Reinecke,
	linux-doc

> Subject: Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
> 
> On 20. 07. 21, 5:31, longli@linuxonhyperv.com wrote:
> > --- /dev/null
> > +++ b/include/uapi/misc/hv_azure_blob.h
> > @@ -0,0 +1,34 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
> > +/* Copyright (c) 2021 Microsoft Corporation. */
> > +
> > +#ifndef _AZ_BLOB_H
> > +#define _AZ_BLOB_H
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/uuid.h>
> 
> Quoting from
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Flinux-
> doc%2FMWHPR21MB159375586D810EC5DCB66AF0D7039%40MWHPR21MB1
> 593.namprd21.prod.outlook.com%2F&amp;data=04%7C01%7Clongli%40micr
> osoft.com%7C7fdf2d6ed15d4d4122a308d94b6eeed0%7C72f988bf86f141af91
> ab2d7cd011db47%7C1%7C0%7C637623762292949381%7CUnknown%7CTWFp
> bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVC
> I6Mn0%3D%7C3000&amp;sdata=kv0ZkU1QL6TxlJJZEQEsT7aqLFL9lmP2SStz8k
> U5sIs%3D&amp;reserved=0:
> =====
> Seems like a #include of asm/ioctl.h (or something similar) is needed so that
> _IOWR is defined.  Also, a #include is needed for __u32, __aligned_u64,
> guid_t, etc.
> =====

The user-space code includes "sys/ioctl.h" for calling into ioctl(). "sys/ioctl.h"
includes <linux/ioctl.h>, so it has no problem finding _IOWR.

guid_t is defined in <uapi/linux/uuid.h>, included from <linux/uuid.h> (in this file)
__u32 and __aligned_u64 are defined in <uapi/linux/types.>, which is included from <linux/kernel.h> (in this file)




> 
> Why was no include added?
> 
> > +
> > +/* user-mode sync request sent through ioctl */ struct
> > +az_blob_request_sync_response {
> > +	__u32 status;
> > +	__u32 response_len;
> > +};
> > +
> > +struct az_blob_request_sync {
> > +	guid_t guid;
> > +	__u32 timeout;
> > +	__u32 request_len;
> > +	__u32 response_len;
> > +	__u32 data_len;
> > +	__u32 data_valid;
> > +	__aligned_u64 request_buffer;
> > +	__aligned_u64 response_buffer;
> > +	__aligned_u64 data_buffer;
> > +	struct az_blob_request_sync_response response; };
> > +
> > +#define AZ_BLOB_MAGIC_NUMBER	'R'
> > +#define IOCTL_AZ_BLOB_DRIVER_USER_REQUEST \
> > +		_IOWR(AZ_BLOB_MAGIC_NUMBER, 0xf0, \
> > +			struct az_blob_request_sync)
> > +
> > +#endif /* define _AZ_BLOB_H */
> >
> 
> thanks,
> --
> js
> suse labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20 22:12     ` Long Li
@ 2021-07-21  4:57       ` Jiri Slaby
  2021-07-21 16:07         ` Long Li
  0 siblings, 1 reply; 23+ messages in thread
From: Jiri Slaby @ 2021-07-21  4:57 UTC (permalink / raw)
  To: Long Li, longli, linux-fs, linux-block, linux-kernel, linux-hyperv
  Cc: Jonathan Corbet, KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, Greg Kroah-Hartman, Bjorn Andersson,
	Hans de Goede, Williams, Dan J, Maximilian Luz, Mike Rapoport,
	Ben Widawsky, Andra Paraschiv, Siddharth Gupta, Hannes Reinecke,
	linux-doc

On 21. 07. 21, 0:12, Long Li wrote:
>> Subject: Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
>>
>> On 20. 07. 21, 5:31, longli@linuxonhyperv.com wrote:
>>> --- /dev/null
>>> +++ b/include/uapi/misc/hv_azure_blob.h
>>> @@ -0,0 +1,34 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
>>> +/* Copyright (c) 2021 Microsoft Corporation. */
>>> +
>>> +#ifndef _AZ_BLOB_H
>>> +#define _AZ_BLOB_H
>>> +
>>> +#include <linux/kernel.h>
>>> +#include <linux/uuid.h>
>>
>> Quoting from
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
>> kernel.org%2Flinux-
>> doc%2FMWHPR21MB159375586D810EC5DCB66AF0D7039%40MWHPR21MB1
>> 593.namprd21.prod.outlook.com%2F&amp;data=04%7C01%7Clongli%40micr
>> osoft.com%7C7fdf2d6ed15d4d4122a308d94b6eeed0%7C72f988bf86f141af91
>> ab2d7cd011db47%7C1%7C0%7C637623762292949381%7CUnknown%7CTWFp
>> bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVC
>> I6Mn0%3D%7C3000&amp;sdata=kv0ZkU1QL6TxlJJZEQEsT7aqLFL9lmP2SStz8k
>> U5sIs%3D&amp;reserved=0:
>> =====
>> Seems like a #include of asm/ioctl.h (or something similar) is needed so that
>> _IOWR is defined.  Also, a #include is needed for __u32, __aligned_u64,
>> guid_t, etc.
>> =====
> 
> The user-space code includes "sys/ioctl.h" for calling into ioctl(). "sys/ioctl.h"
> includes <linux/ioctl.h>, so it has no problem finding _IOWR.
> 
> guid_t is defined in <uapi/linux/uuid.h>, included from <linux/uuid.h> (in this file)
> __u32 and __aligned_u64 are defined in <uapi/linux/types.>, which is included from <linux/kernel.h> (in this file)

No, please don't rely on implicit include chains. Nor that userspace 
solves the includes for you.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-20 19:57     ` Long Li
@ 2021-07-21  5:08       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 23+ messages in thread
From: Greg Kroah-Hartman @ 2021-07-21  5:08 UTC (permalink / raw)
  To: Long Li
  Cc: longli, linux-fs, linux-block, linux-kernel, linux-hyperv,
	Jonathan Corbet, KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, Bjorn Andersson, Hans de Goede, Williams,
	Dan J, Maximilian Luz, Mike Rapoport, Ben Widawsky, Jiri Slaby,
	Andra Paraschiv, Siddharth Gupta, Hannes Reinecke, linux-doc

On Tue, Jul 20, 2021 at 07:57:56PM +0000, Long Li wrote:
> > Subject: Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
> > 
> > On Mon, Jul 19, 2021 at 08:31:05PM -0700, longli@linuxonhyperv.com wrote:
> > > +struct az_blob_device {
> > > +	struct hv_device *device;
> > > +
> > > +	/* Opened files maintained by this device */
> > > +	struct list_head file_list;
> > > +	/* Lock for protecting file_list */
> > > +	spinlock_t file_lock;
> > > +
> > > +	/* The refcount for this device */
> > > +	refcount_t count;
> > 
> > Just use a kref please if you really need this.  Are you sure you do?
> > You already have 2 other reference counted objects being used here, why make
> > it 3?
> 
> The "count" is to keep track how many user-mode instances and vmbus instance
> are opened on this device.

That will not work at all, sorry.  Please NEVER try to count "open"
calls to a driver, as the driver will never be told how many userspace
programs are accessing it at any time, nor should it even ever care.

Use the existing apis, there is no need to attempt to count this as you
will always get it wrong (hint, what happens when processes share file
descriptors...)

> Being a VMBUS device, this device can be removed 
> at any time (host servicing etc). We must remove the device when this happens
> even if the device is still opened by some user-mode program. The "count" will
> guarantee the lifecycle of the device object after all user-mode has released the device.

No it will not, just use the existing apis and all will be fine.

> I looked at using "file_list" (it's used for tracking opened files by user-mode) for this purpose, 
> but I found out I still need to manage the device count at the vmbus side. 

Again, you should not need this.  As proof, no other driver needs
this...

> > > +	/* Pending requests to VSP */
> > > +	atomic_t pending;
> > 
> > Why does this need to be atomic?
> 
> "pending' is per-device maintained that could change when multiple-user access
> the device at the same time.

How do you know this, and why does it matter?

> > > +	wait_queue_head_t waiting_to_drain;
> > > +
> > > +	bool removing;
> > 
> > Are you sure this actually works properly?  Why is it needed vs. any other misc
> > device?
> 
> When removing this device from vmbus, we need to guarantee there is no possible packets to
> vmbus.

Why?  Shouldn't the vmbus api handle this for you automatically?  Why is
this driver unique here?

> This is a requirement before calling vmbus_close(). Other drivers of vmbus follows
> the same procedure.

Ah.  Why not fix this in the vmbus core?  That sounds like extra logic
that everyone would have to duplicate for no good reason.

> The reason why this driver needs this is that the device removal can happen in the middle of
> az_blob_ioctl_user_request(), which can send packet over vmbus.

Sure, but the bus should handle that for you.

> > > +/* VSC->VSP request */
> > > +struct az_blob_vsp_request {
> > > +	u32 version;
> > > +	u32 timeout_ms;
> > > +	u32 data_buffer_offset;
> > > +	u32 data_buffer_length;
> > > +	u32 data_buffer_valid;
> > > +	u32 operation_type;
> > > +	u32 request_buffer_offset;
> > > +	u32 request_buffer_length;
> > > +	u32 response_buffer_offset;
> > > +	u32 response_buffer_length;
> > > +	guid_t transaction_id;
> > > +} __packed;
> > 
> > Why packed?  If this is going across the wire somewhere, you need to specify
> > the endian-ness of these values, right?  If this is not going across the wire, no
> > need for it to be packed.
> 
> Those data go through the wire.
> 
> All data structures specified in the Hyper-V and guest VM use Little Endian byte
> ordering.  All HV core drivers have a dependence on X86, that guarantees this
> ordering.

Then specify little endian please.

> > > +	struct completion wait_vsp;
> > > +	struct az_blob_request_sync *request; };
> > > +
> > > +struct az_blob_file_ctx {
> > > +	struct list_head list;
> > > +
> > > +	/* List of pending requests to VSP */
> > > +	struct list_head vsp_pending_requests;
> > > +	/* Lock for protecting vsp_pending_requests */
> > > +	spinlock_t vsp_pending_lock;
> > > +	wait_queue_head_t wait_vsp_pending;
> > > +
> > > +	pid_t pid;
> > 
> > Why do you need a pid?  What namespace is this pid in?
> 
> It's a request from user library team for production troubleshooting
> purposes. It's exposed as informal in debugfs.

Then it will be wrong.  Put all of your "debugging" stuff into one place
so it is obvious what it is for, and so that people can ignore it.

> > > +static int az_blob_probe(struct hv_device *device,
> > > +			 const struct hv_vmbus_device_id *dev_id) {
> > > +	int ret;
> > > +	struct az_blob_device *dev;
> > > +
> > > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > > +	if (!dev)
> > > +		return -ENOMEM;
> > > +
> > > +	spin_lock_init(&dev->file_lock);
> > > +	INIT_LIST_HEAD(&dev->file_list);
> > > +	atomic_set(&dev->pending, 0);
> > > +	init_waitqueue_head(&dev->waiting_to_drain);
> > > +
> > > +	ret = az_blob_connect_to_vsp(device, dev, AZ_BLOB_RING_SIZE);
> > > +	if (ret)
> > > +		goto fail;
> > > +
> > > +	refcount_set(&dev->count, 1);
> > > +	az_blob_dev = dev;
> > > +
> > > +	// create user-mode client library facing device
> > > +	ret = az_blob_create_device(dev);
> > > +	if (ret) {
> > > +		dev_err(AZ_DEV, "failed to create device ret=%d\n", ret);
> > > +		az_blob_remove_vmbus(device);
> > > +		goto fail;
> > > +	}
> > > +
> > > +	dev_info(AZ_DEV, "successfully probed device\n");
> > 
> > When drivers are working properly, they should be quiet.
> 
> The reason is that in production environment when dealing with custom support
> cases, there is no good way to check if the channel is opened on the device. Having
> this message will greatly clear confusions on possible mis-configurations.

Again, no, the driver should be quiet.  If you REALLY need this type of
thing, the bus should be the one doing that type of notification for all
devices on the bus.  Do not make this a per-driver choice.

> > And what is with the AZ_DEV macro mess?
> 
> It's not required, it's just for saving code length. I can put "&az_blob_dev->device->device"
> in every dev_err(), but it makes the code look a lot longer.

Then perhaps your structures are not correct?  Please spell it out so
that we can see that your implementation needs fixing.

> > And can you handle more than one device in the system at one time?  I think
> > your debugfs logic will get really confused.
> 
> There can be one device object active in the system at any given time. The debugfs grabs
> the current active device object. If the device is being removed, removed or added, 
> the current active device object is updated accordingly.

If I remember, your debugfs initialization seems to not use the device
name, but rather a "driver" name, which implies that this is not going
to work well with multiple devices.  But I could be wrong.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob
  2021-07-20 18:37   ` Long Li
@ 2021-07-21  5:18     ` Greg KH
  0 siblings, 0 replies; 23+ messages in thread
From: Greg KH @ 2021-07-21  5:18 UTC (permalink / raw)
  To: Long Li; +Cc: longli, linux-fs, linux-block, linux-kernel, linux-hyperv

On Tue, Jul 20, 2021 at 06:37:38PM +0000, Long Li wrote:
> > Subject: Re: [Patch v4 0/3] Introduce a driver to support host accelerated
> > access to Microsoft Azure Blob
> > 
> > On Mon, Jul 19, 2021 at 08:31:03PM -0700, longli@linuxonhyperv.com wrote:
> > > From: Long Li <longli@microsoft.com>
> > >
> > > Microsoft Azure Blob storage service exposes a REST API to
> > > applications for data access.
> > >
> > (https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoc
> > > s.microsoft.com%2Fen-us%2Frest%2Fapi%2Fstorageservices%2Fblob-
> > service-
> > > rest-
> > api&amp;data=04%7C01%7Clongli%40microsoft.com%7Ce499fbe161554232e
> > >
> > b1608d94b96a772%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637
> > 623932
> > >
> > 843247787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIj
> > oiV2luMzIi
> > >
> > LCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=9CKNHAmurdBWp
> > ZeLfkiC18
> > > CXNg66UhKsSZzzHZkzf0Y%3D&amp;reserved=0)
> > >
> > > This patchset implements a VSC (Virtualization Service Consumer) that
> > > communicates with a VSP (Virtualization Service Provider) on the
> > > Hyper-V host to execute Blob storage access via native network stack on
> > the host.
> > > This VSC doesn't implement the semantics of REST API. Those are
> > > implemented in user-space. The VSC provides a fast data path to VSP.
> > >
> > > Answers to some previous questions discussing the driver:
> > >
> > > Q: Why this driver doesn't use the block layer
> > >
> > > A: The Azure Blob is based on a model of object oriented storage. The
> > > storage object is not modeled in block sectors. While it's possible to
> > > present the storage object as a block device (assuming it makes sense
> > > to fake all the block device attributes), we lose the ability to
> > > express functionality that are defined in the REST API.
> > 
> > What "REST API"?
> > 
> > Why doesn't object oriented storage map to a file handle to read from?
> > No need to mess with "blocks", why would you care about them?
> > 
> > And again, you loose all caching, this thing has got to be really slow, why add
> > a slow storage interface?  What workload requires this type of slow block
> > storage?
> 
> "Blob REST API" expresses storage request semantics through HTTP. In Blob
> REST API, each request is associated with a context meta data expressed in
> HTTP headers. The ability to express those semantics is rich, it's only limited
> by HTTP protocol.

HTTP has nothing to do with the kernel, so I do not understand what you
are talking about here.

> There are attempts to implement the Blob as a file system.
> Here is an example filesystem (BlobFuse) implemented for Blob:
> (https://github.com/Azure/azure-storage-fuse).
> 
> It's doable, but at the same time you lose all the performance and
> shareable/security features presented in the REST API for Blob.

What sharable/security features are in this driver instead?  I saw none.

> A POSIX
> interface cannot express same functionality as the REST API for Blob.

But you are not putting a REST api into this kernel driver, so I fail to
understand this.

> For example, The Blob API for read (Get Blob, 
> https://docs.microsoft.com/en-us/rest/api/storageservices/get-blob)
> has rich meta data context that cannot easily be mapped to POSIX. The same
> goes to other Blob API to manage security tokens and the life cycle of shareable
> objects.

How can you have sharable objects in this ioctl interface instead?

> BlobFuse (above) filesystem demonstrated why Blob should not be implemented
> on a filesystem. It's useable for data management purposes. It's not usable for an I/O
> intensive workload. It's not usable for managing sharable objects and security tokens.

What is the bottleneck for the throughput/performance issues involved?

> Blob is designed not to use file system caching and block layer I/O scheduling.
> There are many solutions existing today, that access raw disk for I/O, bypassing
> filesystem and block layer. For example, many database applications access raw
> disks for I/O. When the application knows the I/O pattern and its intended behavior,
> it doesn't get much benefit from filesystem or block.

Databases that use raw i/o "know what they are doing" and are constantly
fighting with the kernel.  Don't expect kernel developers to just think
that this is ok.

But this is not what you are doing here at all, this is an object
storage, you are still being forced to open/ioctl/close to get an object
instead of just doing open/read/close, so I fail to understand where the
performance issues are.

And if they are in the FUSE layer, why not just write a filesystem layer
in the kernel instead to resolve that?  Who is insisting that you do
this through a character device driver to get filesystem data?

> > > Q: You also just abandoned the POSIX model and forced people to use a
> > > random-custom-library just to access their storage devices, breaking
> > > all existing programs in the world?
> > >
> > > A: The existing Blob applications access storage via HTTP (REST API).
> > > They don't use POSIX interface. The interface for Azure Blob is not
> > > designed on POSIX.
> > 
> > I do not see a HTTP interface here, what does that have to do with the kernel?
> > 
> > I see a single ioctl interface, that's all.
> 
> The driver doesn't attempt to implement Blob API or HTTP. It presents a fast data
> path to pass Blob requests to hosts, so the guest VM doesn't need to assemble
> a HTTP packet for each Blob REST requests. This also eliminates additional
> overhead in guest network stack to send the HTTP packets over TCP/IP.

Again, I fail to understand how http or tcp/ip comes into play here at
all, that's not what this driver does.

> Once the request reaches the Azure host, it knows the best way to reach to the 
> backend storage and serving the Blob request, while at the same time all the 
> security and integrity features are preserved.

I do not understand this statement at all.

> > > Q: What programs today use this new API?
> > >
> > > A: Currently none is released. But per above, there are also none
> > > using POSIX.
> > 
> > Great, so no one uses this, so who is asking for and requiring this thing?
> > 
> > What about just doing it all from userspace using FUSE?  Why does this HAVE
> > to be a kernel driver?
> 
> We have a major workload nearing the end of development. Compared with
> REST over HTTP, this device model presented faster data access and CPU savings
> in that there is no overhead of sending HTTP over network.

Your development cycle means nothing to us, sorry.  Please realize that
if you submit something that is not acceptable to us, there is no
requirement that we take it just because we feel this is implemented in
totally the wrong way.

And you have not, again, proven that there is any performance
improvement anywhere due to lack of numbers and data.   And again, what
does HTTP have to do with this driver.

> Eventually, all the existing Blob REST API applications can use this new API, once
> it gets to their Blob transport libraries.

What applications?  What libraries?  Who is using any of this?

> I have explained why BlobFuse is not suitable for production workloads. There
> are people using BlobFuse but mostly for management purposes.

I fail to see why it is not usable as you have not provided any real
information, sorry.

Please step back and write up a document that explains the problem you
are trying to solve and then we can go from there.  Right now you are
throwing a driver at us and expecting us to just accept it, despite it
looking like it is completly wrong for the problem space it is
attempting to interact with.

Please work with some experienced Linux kernel developers on your team
to do all of this _BEFORE_ submitting it to the community again.  Based
on the mistakes made so far, it looks like you could use some guidance
in turning this into something that might be acceptable.  As it is, you
are a long way off.

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
  2021-07-21  4:57       ` Jiri Slaby
@ 2021-07-21 16:07         ` Long Li
  0 siblings, 0 replies; 23+ messages in thread
From: Long Li @ 2021-07-21 16:07 UTC (permalink / raw)
  To: Jiri Slaby, longli, linux-fs, linux-block, linux-kernel, linux-hyperv
  Cc: Jonathan Corbet, KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Wei Liu, Dexuan Cui, Greg Kroah-Hartman, Bjorn Andersson,
	Hans de Goede, Williams, Dan J, Maximilian Luz, Mike Rapoport,
	Ben Widawsky, Andra Paraschiv, Siddharth Gupta, Hannes Reinecke,
	linux-doc

> Subject: Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
> 
> On 21. 07. 21, 0:12, Long Li wrote:
> >> Subject: Re: [Patch v4 2/3] Drivers: hv: add Azure Blob driver
> >>
> >> On 20. 07. 21, 5:31, longli@linuxonhyperv.com wrote:
> >>> --- /dev/null
> >>> +++ b/include/uapi/misc/hv_azure_blob.h
> >>> @@ -0,0 +1,34 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
> >>> +/* Copyright (c) 2021 Microsoft Corporation. */
> >>> +
> >>> +#ifndef _AZ_BLOB_H
> >>> +#define _AZ_BLOB_H
> >>> +
> >>> +#include <linux/kernel.h>
> >>> +#include <linux/uuid.h>
> >>
> >> Quoting from
> >>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> >> kernel.org%2Flinux-
> >>
> doc%2FMWHPR21MB159375586D810EC5DCB66AF0D7039%40MWHPR21M
> B1
> >>
> 593.namprd21.prod.outlook.com%2F&amp;data=04%7C01%7Clongli%40micr
> >>
> osoft.com%7C7fdf2d6ed15d4d4122a308d94b6eeed0%7C72f988bf86f141af
> 91
> >>
> ab2d7cd011db47%7C1%7C0%7C637623762292949381%7CUnknown%7CTW
> Fp
> >>
> bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVC
> >>
> I6Mn0%3D%7C3000&amp;sdata=kv0ZkU1QL6TxlJJZEQEsT7aqLFL9lmP2SStz8k
> >> U5sIs%3D&amp;reserved=0:
> >> =====
> >> Seems like a #include of asm/ioctl.h (or something similar) is needed
> >> so that _IOWR is defined.  Also, a #include is needed for __u32,
> >> __aligned_u64, guid_t, etc.
> >> =====
> >
> > The user-space code includes "sys/ioctl.h" for calling into ioctl(). "sys/ioctl.h"
> > includes <linux/ioctl.h>, so it has no problem finding _IOWR.
> >
> > guid_t is defined in <uapi/linux/uuid.h>, included from <linux/uuid.h>
> > (in this file)
> > __u32 and __aligned_u64 are defined in <uapi/linux/types.>, which is
> > included from <linux/kernel.h> (in this file)
> 
> No, please don't rely on implicit include chains. Nor that userspace solves the
> includes for you.

Will fix this.

> 
> thanks,
> --
> js
> suse labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-07-21 16:07 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20  3:31 [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob longli
2021-07-20  3:31 ` [Patch v4 1/3] Drivers: hv: vmbus: add support to ignore certain PCIE devices longli
2021-07-20  3:31 ` [Patch v4 2/3] Drivers: hv: add Azure Blob driver longli
2021-07-20  5:26   ` Greg Kroah-Hartman
2021-07-20  5:30   ` Greg Kroah-Hartman
2021-07-20  7:34   ` Greg Kroah-Hartman
2021-07-20 19:57     ` Long Li
2021-07-21  5:08       ` Greg Kroah-Hartman
2021-07-20 11:10   ` Jiri Slaby
2021-07-20 22:12     ` Long Li
2021-07-21  4:57       ` Jiri Slaby
2021-07-21 16:07         ` Long Li
2021-07-20  3:31 ` [Patch v4 3/3] Drivers: hv: Add to maintainer for Hyper-V/Azure drivers longli
2021-07-20  4:37 ` [Patch v4 0/3] Introduce a driver to support host accelerated access to Microsoft Azure Blob Bart Van Assche
2021-07-20  6:01   ` Christoph Hellwig
2021-07-20  7:05     ` Long Li
2021-07-20 15:15       ` Bart Van Assche
2021-07-20 17:33         ` Long Li
2021-07-20 18:16           ` gregkh
2021-07-20 18:52             ` Long Li
2021-07-20 15:54 ` Greg KH
2021-07-20 18:37   ` Long Li
2021-07-21  5:18     ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).