linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface
@ 2020-11-21  0:30 Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes Nuno Das Neves
                   ` (19 more replies)
  0 siblings, 20 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

This patch series provides a userspace interface for creating and running guest
virtual machines while running on the Microsoft Hypervisor [0].

Since managing guest machines can only be done when Linux is the root partition,
this series depends on the RFC already posted by Wei Liu:
https://lore.kernel.org/linux-hyperv/20201105165814.29233-1-wei.liu@kernel.org/T/#t

The first two patches provide some helpers for converting hypervisor status
codes to linux error codes, and easily printing hypervisor status codes to dmesg
for debugging.

Hyper-V related headers asm-generic/hyperv-tlfs.h and x86/asm/hyperv-tlfs.h are
split into uapi and non-uapi. The uapi versions contain structures used in both
the ioctl interface and the kernel.

The mshv API is introduced in virt/mshv/mshv_main.c. As each interface is
introduced, documentation is added in Documentation/virt/mshv/api.rst.
The API is file-desciptor based, like KVM. The entry point is /dev/mshv.

/dev/mshv ioctls:
MSHV_REQUEST_VERSION
MSHV_CREATE_PARTITION

Partition (vm) ioctls:
MSHV_MAP_GUEST_MEMORY, MSHV_UNMAP_GUEST_MEMORY
MSHV_INSTALL_INTERCEPT
MSHV_ASSERT_INTERRUPT
MSHV_GET_PARTITION_PROPERTY, MSHV_SET_PARTITION_PROPERTY
MSHV_CREATE_VP

Vp (vcpu) ioctls:
MSHV_GET_VP_REGISTERS, MSHV_SET_VP_REGISTERS
MSHV_RUN_VP
MSHV_GET_VP_STATE, MSHV_SET_VP_STATE
mmap() (register page)

[0] Hyper-V is more well-known, but it really refers to the whole stack
    including the hypervisor and other components that run in Windows kernel
    and userspace.

Nuno Das Neves (18):
  x86/hyperv: convert hyperv statuses to linux error codes
  asm-generic/hyperv: convert hyperv statuses to strings
  virt/mshv: minimal mshv module (/dev/mshv/)
  virt/mshv: request version ioctl
  virt/mshv: create partition ioctl
  virt/mshv: create, initialize, finalize, delete partition hypercalls
  virt/mshv: withdraw memory hypercall
  virt/mshv: map and unmap guest memory
  virt/mshv: create vcpu ioctl
  virt/mshv: get and set vcpu registers ioctls
  virt/mshv: set up synic pages for intercept messages
  virt/mshv: run vp ioctl and isr
  virt/mshv: install intercept ioctl
  virt/mshv: assert interrupt ioctl
  virt/mshv: get and set vp state ioctls
  virt/mshv: mmap vp register page
  virt/mshv: get and set partition property ioctls
  virt/mshv: Add enlightenment bits to create partition ioctl

 .../userspace-api/ioctl/ioctl-number.rst      |    2 +
 Documentation/virt/mshv/api.rst               |  173 ++
 arch/x86/Kconfig                              |    2 +
 arch/x86/hyperv/Kconfig                       |   22 +
 arch/x86/hyperv/Makefile                      |    4 +
 arch/x86/hyperv/hv_init.c                     |    2 +-
 arch/x86/hyperv/hv_proc.c                     |   40 +-
 arch/x86/include/asm/hyperv-tlfs.h            |   44 +-
 arch/x86/include/asm/mshyperv.h               |    1 +
 arch/x86/include/uapi/asm/hyperv-tlfs.h       | 1312 +++++++++++
 arch/x86/kernel/cpu/mshyperv.c                |   16 +
 include/asm-generic/hyperv-tlfs.h             |  324 ++-
 include/asm-generic/mshyperv.h                |    3 +
 include/linux/mshv.h                          |   61 +
 include/uapi/asm-generic/hyperv-tlfs.h        |  160 ++
 include/uapi/linux/mshv.h                     |  109 +
 virt/mshv/mshv_main.c                         | 2054 +++++++++++++++++
 17 files changed, 4178 insertions(+), 151 deletions(-)
 create mode 100644 Documentation/virt/mshv/api.rst
 create mode 100644 arch/x86/hyperv/Kconfig
 create mode 100644 arch/x86/include/uapi/asm/hyperv-tlfs.h
 create mode 100644 include/linux/mshv.h
 create mode 100644 include/uapi/asm-generic/hyperv-tlfs.h
 create mode 100644 include/uapi/linux/mshv.h
 create mode 100644 virt/mshv/mshv_main.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-09 13:04   ` Vitaly Kuznetsov
  2020-11-21  0:30 ` [RFC PATCH 02/18] asm-generic/hyperv: convert hyperv statuses to strings Nuno Das Neves
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Return linux-friendly error codes from hypercall wrapper functions.
This will be needed in the mshv module.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/x86/hyperv/hv_proc.c         | 30 ++++++++++++++++++++++++++---
 arch/x86/include/asm/mshyperv.h   |  1 +
 include/asm-generic/hyperv-tlfs.h | 32 +++++++++++++++++++++----------
 3 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
index 0fd972c9129a..8f86f8e86748 100644
--- a/arch/x86/hyperv/hv_proc.c
+++ b/arch/x86/hyperv/hv_proc.c
@@ -18,6 +18,30 @@
 #define HV_DEPOSIT_MAX_ORDER (8)
 #define HV_DEPOSIT_MAX (1 << HV_DEPOSIT_MAX_ORDER)
 
+int hv_status_to_errno(int hv_status)
+{
+	switch (hv_status) {
+	case HV_STATUS_SUCCESS:
+		return 0;
+	case HV_STATUS_INVALID_PARAMETER:
+	case HV_STATUS_UNKNOWN_PROPERTY:
+	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
+	case HV_STATUS_INVALID_VP_INDEX:
+	case HV_STATUS_INVALID_REGISTER_VALUE:
+	case HV_STATUS_INVALID_LP_INDEX:
+		return EINVAL;
+	case HV_STATUS_ACCESS_DENIED:
+	case HV_STATUS_OPERATION_DENIED:
+		return EACCES;
+	case HV_STATUS_NOT_ACKNOWLEDGED:
+	case HV_STATUS_INVALID_VP_STATE:
+	case HV_STATUS_INVALID_PARTITION_STATE:
+		return EBADFD;
+	}
+	return ENOTRECOVERABLE;
+}
+EXPORT_SYMBOL_GPL(hv_status_to_errno);
+
 /*
  * Deposits exact number of pages
  * Must be called with interrupts enabled
@@ -99,7 +123,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 
 	if (status != HV_STATUS_SUCCESS) {
 		pr_err("Failed to deposit pages: %d\n", status);
-		ret = status;
+		ret = -hv_status_to_errno(status);
 		goto err_free_allocations;
 	}
 
@@ -155,7 +179,7 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
 			if (status != HV_STATUS_SUCCESS) {
 				pr_err("%s: cpu %u apic ID %u, %d\n", __func__,
 				       lp_index, apic_id, status);
-				ret = status;
+				ret = -hv_status_to_errno(status);
 			}
 			break;
 		}
@@ -203,7 +227,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 			if (status != HV_STATUS_SUCCESS) {
 				pr_err("%s: vcpu %u, lp %u, %d\n", __func__,
 				       vp_index, flags, status);
-				ret = status;
+				ret = -hv_status_to_errno(status);
 			}
 			break;
 		}
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index cbee72550a12..eb75faa4d4c5 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -243,6 +243,7 @@ int hyperv_flush_guest_mapping_range(u64 as,
 int hyperv_fill_flush_guest_mapping_list(
 		struct hv_guest_mapping_flush_list *flush,
 		u64 start_gfn, u64 end_gfn);
+int hv_status_to_errno(int hv_status);
 
 extern bool hv_root_partition;
 
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index dd385c6a71b5..445244192fa4 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -181,16 +181,28 @@ enum HV_GENERIC_SET_FORMAT {
 #define HV_HYPERCALL_REP_START_MASK	GENMASK_ULL(59, 48)
 
 /* hypercall status code */
-#define HV_STATUS_SUCCESS			0
-#define HV_STATUS_INVALID_HYPERCALL_CODE	2
-#define HV_STATUS_INVALID_HYPERCALL_INPUT	3
-#define HV_STATUS_INVALID_ALIGNMENT		4
-#define HV_STATUS_INVALID_PARAMETER		5
-#define HV_STATUS_OPERATION_DENIED		8
-#define HV_STATUS_INSUFFICIENT_MEMORY		11
-#define HV_STATUS_INVALID_PORT_ID		17
-#define HV_STATUS_INVALID_CONNECTION_ID		18
-#define HV_STATUS_INSUFFICIENT_BUFFERS		19
+#define HV_STATUS_SUCCESS			0x0
+#define HV_STATUS_INVALID_HYPERCALL_CODE	0x2
+#define HV_STATUS_INVALID_HYPERCALL_INPUT	0x3
+#define HV_STATUS_INVALID_ALIGNMENT		0x4
+#define HV_STATUS_INVALID_PARAMETER		0x5
+#define HV_STATUS_ACCESS_DENIED			0x6
+#define HV_STATUS_INVALID_PARTITION_STATE	0x7
+#define HV_STATUS_OPERATION_DENIED		0x8
+#define HV_STATUS_UNKNOWN_PROPERTY		0x9
+#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE	0xA
+#define HV_STATUS_INSUFFICIENT_MEMORY		0xB
+#define HV_STATUS_INVALID_PARTITION_ID		0xD
+#define HV_STATUS_INVALID_VP_INDEX		0xE
+#define HV_STATUS_NOT_FOUND			0x10
+#define HV_STATUS_INVALID_PORT_ID		0x11
+#define HV_STATUS_INVALID_CONNECTION_ID		0x12
+#define HV_STATUS_INSUFFICIENT_BUFFERS		0x13
+#define HV_STATUS_NOT_ACKNOWLEDGED		0x14
+#define HV_STATUS_INVALID_VP_STATE		0x15
+#define HV_STATUS_NO_RESOURCES			0x1D
+#define HV_STATUS_INVALID_LP_INDEX		0x41
+#define HV_STATUS_INVALID_REGISTER_VALUE	0x50
 
 /*
  * The Hyper-V TimeRefCount register and the TSC
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 02/18] asm-generic/hyperv: convert hyperv statuses to strings
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 03/18] virt/mshv: minimal mshv module (/dev/mshv/) Nuno Das Neves
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Allow hyperv hypercall failures to be debugged more easily with dmesg.
This will be used in the mshv module.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/x86/hyperv/hv_init.c         |  2 +-
 arch/x86/hyperv/hv_proc.c         | 10 +++---
 include/asm-generic/hyperv-tlfs.h | 60 +++++++++++++++++++------------
 3 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 2c2189832da7..2a8cd2cf0745 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -363,7 +363,7 @@ void __init hv_get_partition_id(void)
 	status = hv_do_hypercall(HVCALL_GET_PARTITION_ID, NULL, output_page) &
 		HV_HYPERCALL_RESULT_MASK;
 	if (status != HV_STATUS_SUCCESS)
-		pr_err("Failed to get partition ID: %d\n", status);
+		pr_err("Failed to get partition ID: %s\n", hv_status_to_string(status));
 	else
 		hv_current_partition_id = output_page->partition_id;
 	local_irq_restore(flags);
diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
index 8f86f8e86748..a88ed6873fbd 100644
--- a/arch/x86/hyperv/hv_proc.c
+++ b/arch/x86/hyperv/hv_proc.c
@@ -122,7 +122,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 	local_irq_restore(flags);
 
 	if (status != HV_STATUS_SUCCESS) {
-		pr_err("Failed to deposit pages: %d\n", status);
+		pr_err("Failed to deposit pages: %s\n", hv_status_to_string(status));
 		ret = -hv_status_to_errno(status);
 		goto err_free_allocations;
 	}
@@ -177,8 +177,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
 
 		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
 			if (status != HV_STATUS_SUCCESS) {
-				pr_err("%s: cpu %u apic ID %u, %d\n", __func__,
-				       lp_index, apic_id, status);
+				pr_err("%s: cpu %u apic ID %u, %s\n", __func__,
+				       lp_index, apic_id, hv_status_to_string(status));
 				ret = -hv_status_to_errno(status);
 			}
 			break;
@@ -225,8 +225,8 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 
 		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
 			if (status != HV_STATUS_SUCCESS) {
-				pr_err("%s: vcpu %u, lp %u, %d\n", __func__,
-				       vp_index, flags, status);
+				pr_err("%s: vcpu %u, lp %u, %s\n", __func__,
+				       vp_index, flags, hv_status_to_string(status));
 				ret = -hv_status_to_errno(status);
 			}
 			break;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 445244192fa4..05b9dc9896ab 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -181,28 +181,44 @@ enum HV_GENERIC_SET_FORMAT {
 #define HV_HYPERCALL_REP_START_MASK	GENMASK_ULL(59, 48)
 
 /* hypercall status code */
-#define HV_STATUS_SUCCESS			0x0
-#define HV_STATUS_INVALID_HYPERCALL_CODE	0x2
-#define HV_STATUS_INVALID_HYPERCALL_INPUT	0x3
-#define HV_STATUS_INVALID_ALIGNMENT		0x4
-#define HV_STATUS_INVALID_PARAMETER		0x5
-#define HV_STATUS_ACCESS_DENIED			0x6
-#define HV_STATUS_INVALID_PARTITION_STATE	0x7
-#define HV_STATUS_OPERATION_DENIED		0x8
-#define HV_STATUS_UNKNOWN_PROPERTY		0x9
-#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE	0xA
-#define HV_STATUS_INSUFFICIENT_MEMORY		0xB
-#define HV_STATUS_INVALID_PARTITION_ID		0xD
-#define HV_STATUS_INVALID_VP_INDEX		0xE
-#define HV_STATUS_NOT_FOUND			0x10
-#define HV_STATUS_INVALID_PORT_ID		0x11
-#define HV_STATUS_INVALID_CONNECTION_ID		0x12
-#define HV_STATUS_INSUFFICIENT_BUFFERS		0x13
-#define HV_STATUS_NOT_ACKNOWLEDGED		0x14
-#define HV_STATUS_INVALID_VP_STATE		0x15
-#define HV_STATUS_NO_RESOURCES			0x1D
-#define HV_STATUS_INVALID_LP_INDEX		0x41
-#define HV_STATUS_INVALID_REGISTER_VALUE	0x50
+#define __HV_STATUS_DEF(OP) \
+	OP(HV_STATUS_SUCCESS,				0x0) \
+	OP(HV_STATUS_INVALID_HYPERCALL_CODE,		0x2) \
+	OP(HV_STATUS_INVALID_HYPERCALL_INPUT,		0x3) \
+	OP(HV_STATUS_INVALID_ALIGNMENT,			0x4) \
+	OP(HV_STATUS_INVALID_PARAMETER,			0x5) \
+	OP(HV_STATUS_ACCESS_DENIED,			0x6) \
+	OP(HV_STATUS_INVALID_PARTITION_STATE,		0x7) \
+	OP(HV_STATUS_OPERATION_DENIED,			0x8) \
+	OP(HV_STATUS_UNKNOWN_PROPERTY,			0x9) \
+	OP(HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE,	0xA) \
+	OP(HV_STATUS_INSUFFICIENT_MEMORY,		0xB) \
+	OP(HV_STATUS_INVALID_PARTITION_ID,		0xD) \
+	OP(HV_STATUS_INVALID_VP_INDEX,			0xE) \
+	OP(HV_STATUS_NOT_FOUND,				0x10) \
+	OP(HV_STATUS_INVALID_PORT_ID,			0x11) \
+	OP(HV_STATUS_INVALID_CONNECTION_ID,		0x12) \
+	OP(HV_STATUS_INSUFFICIENT_BUFFERS,		0x13) \
+	OP(HV_STATUS_NOT_ACKNOWLEDGED,			0x14) \
+	OP(HV_STATUS_INVALID_VP_STATE,			0x15) \
+	OP(HV_STATUS_NO_RESOURCES,			0x1D) \
+	OP(HV_STATUS_INVALID_LP_INDEX,			0x41) \
+	OP(HV_STATUS_INVALID_REGISTER_VALUE,		0x50)
+
+#define __HV_MAKE_HV_STATUS_ENUM(NAME, VAL) NAME = (VAL),
+#define __HV_MAKE_HV_STATUS_CASE(NAME, VAL) case (NAME): return (#NAME);
+
+enum hv_status {
+	__HV_STATUS_DEF(__HV_MAKE_HV_STATUS_ENUM)
+};
+
+static inline const char *hv_status_to_string(enum hv_status status)
+{
+	switch (status) {
+	__HV_STATUS_DEF(__HV_MAKE_HV_STATUS_CASE)
+	default : return "Unknown";
+	}
+}
 
 /*
  * The Hyper-V TimeRefCount register and the TSC
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 03/18] virt/mshv: minimal mshv module (/dev/mshv/)
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 02/18] asm-generic/hyperv: convert hyperv statuses to strings Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 04/18] virt/mshv: request version ioctl Nuno Das Neves
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce a barebones module file for the mshv API.
Introduce CONFIG_HYPERV_ROOT_API for controlling compilation of mshv.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/x86/Kconfig         |  2 ++
 arch/x86/hyperv/Kconfig  | 22 +++++++++++++
 arch/x86/hyperv/Makefile |  4 +++
 virt/mshv/mshv_main.c    | 70 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 98 insertions(+)
 create mode 100644 arch/x86/hyperv/Kconfig
 create mode 100644 virt/mshv/mshv_main.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..8d3848eea358 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2901,3 +2901,5 @@ source "drivers/firmware/Kconfig"
 source "arch/x86/kvm/Kconfig"
 
 source "arch/x86/Kconfig.assembler"
+
+source "arch/x86/hyperv/Kconfig"
diff --git a/arch/x86/hyperv/Kconfig b/arch/x86/hyperv/Kconfig
new file mode 100644
index 000000000000..81e783ab3514
--- /dev/null
+++ b/arch/x86/hyperv/Kconfig
@@ -0,0 +1,22 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# HYPERV_ROOT_API configuration
+#
+
+config HYPERV_ROOT_API
+	tristate "Microsoft Hypervisor root partition interfaces: /dev/mshv"
+	depends on HYPERV
+	help
+	  Provides access to interfaces for managing guest virtual machines
+	  running under the Microsoft Hypervisor.
+
+	  These interfaces will only work when Linux is running as root
+	  partition on the Microsoft Hypervisor.
+
+	  The interfaces are provided via a device named /dev/mshv.
+
+	  To compile this as a module, choose M here: the module
+	  will be called mshv.
+
+	  If unsure, say N.
+
diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 2ebcf3969121..86f6dc1c5118 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -5,3 +5,7 @@ obj-$(CONFIG_X86_64)	+= hv_apic.o hv_proc.o irqdomain.o
 ifdef CONFIG_X86_64
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)	+= hv_spinlock.o
 endif
+
+MSHV := ../../../virt/mshv
+mshv-y                          += $(MSHV)/mshv_main.o
+obj-$(CONFIG_HYPERV_ROOT_API)   += mshv.o
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
new file mode 100644
index 000000000000..ecb9089761fe
--- /dev/null
+++ b/virt/mshv/mshv_main.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, Microsoft Corporation.
+ *
+ * Authors:
+ *   Nuno Das Neves <nudasnev@microsoft.com>
+ *   Lillian Grassin-Drake <ligrassi@microsoft.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+
+MODULE_AUTHOR("Microsoft");
+MODULE_LICENSE("GPL");
+
+static long
+mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
+{
+	return -ENOTTY;
+}
+
+static int
+mshv_dev_open(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+static int
+mshv_dev_release(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+static const struct file_operations mshv_dev_fops = {
+	.owner = THIS_MODULE,
+	.open = mshv_dev_open,
+	.release = mshv_dev_release,
+	.unlocked_ioctl = mshv_dev_ioctl,
+	.llseek = noop_llseek,
+};
+
+static struct miscdevice mshv_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mshv",
+	.fops = &mshv_dev_fops,
+	.mode = 600,
+};
+
+static int
+__init mshv_init(void)
+{
+	int r;
+
+	r = misc_register(&mshv_dev);
+	if (r)
+		pr_err("%s: misc device register failed\n", __func__);
+
+	return r;
+}
+
+static void
+__exit mshv_exit(void)
+{
+	misc_deregister(&mshv_dev);
+}
+
+module_init(mshv_init);
+module_exit(mshv_exit);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 04/18] virt/mshv: request version ioctl
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (2 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 03/18] virt/mshv: minimal mshv module (/dev/mshv/) Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-08 19:41   ` Michael Kelley
  2021-02-09 13:11   ` Vitaly Kuznetsov
  2020-11-21  0:30 ` [RFC PATCH 05/18] virt/mshv: create partition ioctl Nuno Das Neves
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Reserve ioctl number in userpsace-api/ioctl/ioctl-number.rst
Introduce MSHV_REQUEST_VERSION ioctl.
Introduce documentation for /dev/mshv in Documentation/virt/mshv

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |  2 +
 Documentation/virt/mshv/api.rst               | 62 +++++++++++++++++++
 include/linux/mshv.h                          | 11 ++++
 include/uapi/linux/mshv.h                     | 19 ++++++
 virt/mshv/mshv_main.c                         | 49 +++++++++++++++
 5 files changed, 143 insertions(+)
 create mode 100644 Documentation/virt/mshv/api.rst
 create mode 100644 include/linux/mshv.h
 create mode 100644 include/uapi/linux/mshv.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 55a2d9b2ce33..13a4d3ecafca 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -343,6 +343,8 @@ Code  Seq#    Include File                                           Comments
 0xB5  00-0F  uapi/linux/rpmsg.h                                      <mailto:linux-remoteproc@vger.kernel.org>
 0xB6  all    linux/fpga-dfl.h
 0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-remoteproc@vger.kernel.org>
+0xB8  all    uapi/linux/mshv.h                                       Microsoft Hypervisor root partition APIs
+                                                                     <mailto:linux-hyperv@vger.kernel.org>
 0xC0  00-0F  linux/usb/iowarrior.h
 0xCA  00-0F  uapi/misc/cxl.h
 0xCA  10-2F  uapi/misc/ocxl.h
diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
new file mode 100644
index 000000000000..82e32de48d03
--- /dev/null
+++ b/Documentation/virt/mshv/api.rst
@@ -0,0 +1,62 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+Microsoft Hypervisor Root Partition API Documentation
+=====================================================
+
+1. Overview
+===========
+
+This document describes APIs for creating and managing guest virtual machines
+when running Linux as the root partition on the Microsoft Hypervisor.
+
+This API is not yet stable.
+
+2. Glossary/Terms
+=================
+
+hv
+--
+Short for Hyper-V. This name is used in the kernel to describe interfaces to
+the Microsoft Hypervisor.
+
+mshv
+----
+Short for Microsoft Hypervisor. This is the name of the userland API module
+described in this document.
+
+Partition
+---------
+A virtual machine running on the Microsoft Hypervisor.
+
+Root Partition
+--------------
+The partition that is created and assumes control when the machine boots. The
+root partition can use mshv APIs to create guest partitions.
+
+3. API description
+==================
+
+The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
+
+Mshv is file descriptor-based, following a similar pattern to KVM.
+
+To get a handle to the mshv driver, use open("/dev/mshv").
+
+3.1 MSHV_REQUEST_VERSION
+------------------------
+:Type: /dev/mshv ioctl
+:Parameters: pointer to a u32
+:Returns: 0 on success
+
+Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
+establish the interface version with the kernel module.
+
+The caller should pass the MSHV_VERSION as an argument.
+
+The kernel module will check which interface versions it supports and return 0
+if one of them matches.
+
+This /dev/mshv file descriptor will remain 'locked' to that version as long as
+it is open - this ioctl can only be called once per open.
+
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
new file mode 100644
index 000000000000..a0982fe2c0b8
--- /dev/null
+++ b/include/linux/mshv.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_MSHV_H
+#define _LINUX_MSHV_H
+
+/*
+ * Microsoft Hypervisor root partition driver for /dev/mshv
+ */
+
+#include <uapi/linux/mshv.h>
+
+#endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
new file mode 100644
index 000000000000..dd30fc2f0a80
--- /dev/null
+++ b/include/uapi/linux/mshv.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_MSHV_H
+#define _UAPI_LINUX_MSHV_H
+
+/*
+ * Userspace interface for /dev/mshv
+ * Microsoft Hypervisor root partition APIs
+ */
+
+#include <linux/types.h>
+
+#define MSHV_VERSION	0x0
+
+#define MSHV_IOCTL 0xB8
+
+/* mshv device */
+#define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
+
+#endif
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index ecb9089761fe..62f631f85301 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -11,25 +11,74 @@
 #include <linux/module.h>
 #include <linux/fs.h>
 #include <linux/miscdevice.h>
+#include <linux/slab.h>
+#include <linux/mshv.h>
 
 MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
 
+#define MSHV_INVALID_VERSION	0xFFFFFFFF
+#define MSHV_CURRENT_VERSION	MSHV_VERSION
+
+static u32 supported_versions[] = {
+	MSHV_CURRENT_VERSION,
+};
+
+static long
+mshv_ioctl_request_version(u32 *version, void __user *user_arg)
+{
+	u32 arg;
+	int i;
+
+	if (copy_from_user(&arg, user_arg, sizeof(arg)))
+		return -EFAULT;
+
+	for (i = 0; i < ARRAY_SIZE(supported_versions); ++i) {
+		if (supported_versions[i] == arg) {
+			*version = supported_versions[i];
+			return 0;
+		}
+	}
+	return -ENOTSUPP;
+}
+
 static long
 mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
+	u32 *version = (u32 *)filp->private_data;
+
+	if (ioctl == MSHV_REQUEST_VERSION) {
+		/* Version can only be set once */
+		if (*version != MSHV_INVALID_VERSION)
+			return -EBADFD;
+
+		return mshv_ioctl_request_version(version, (void __user *)arg);
+	}
+
+	/* Version must be set before other ioctls can be called */
+	if (*version == MSHV_INVALID_VERSION)
+		return -EBADFD;
+
+	/* TODO other ioctls */
+
 	return -ENOTTY;
 }
 
 static int
 mshv_dev_open(struct inode *inode, struct file *filp)
 {
+	filp->private_data = kmalloc(sizeof(u32), GFP_KERNEL);
+	if (!filp->private_data)
+		return -ENOMEM;
+	*(u32 *)filp->private_data = MSHV_INVALID_VERSION;
+
 	return 0;
 }
 
 static int
 mshv_dev_release(struct inode *inode, struct file *filp)
 {
+	kfree(filp->private_data);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 05/18] virt/mshv: create partition ioctl
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (3 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 04/18] virt/mshv: request version ioctl Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-09 13:15   ` Vitaly Kuznetsov
  2020-11-21  0:30 ` [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls Nuno Das Neves
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Add MSHV_CREATE_PARTITION, which creates an fd to track a new partition.
Partition is not yet created in the hypervisor itself.
Introduce header files for userspace-facing hyperv structures.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst         |  12 ++
 arch/x86/include/asm/hyperv-tlfs.h      |   1 +
 arch/x86/include/uapi/asm/hyperv-tlfs.h | 124 ++++++++++++++++
 include/asm-generic/hyperv-tlfs.h       |   1 +
 include/linux/mshv.h                    |  16 +++
 include/uapi/asm-generic/hyperv-tlfs.h  |  14 ++
 include/uapi/linux/mshv.h               |   7 +
 virt/mshv/mshv_main.c                   | 179 +++++++++++++++++++++---
 8 files changed, 338 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/include/uapi/asm/hyperv-tlfs.h
 create mode 100644 include/uapi/asm-generic/hyperv-tlfs.h

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 82e32de48d03..ce651a1738e0 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -39,6 +39,9 @@ root partition can use mshv APIs to create guest partitions.
 
 The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
 
+The uapi header files you need are linux/mshv.h, asm/hyperv-tlfs.h, and
+asm-generic/hyperv-tlfs.h.
+
 Mshv is file descriptor-based, following a similar pattern to KVM.
 
 To get a handle to the mshv driver, use open("/dev/mshv").
@@ -60,3 +63,12 @@ if one of them matches.
 This /dev/mshv file descriptor will remain 'locked' to that version as long as
 it is open - this ioctl can only be called once per open.
 
+3.2 MSHV_CREATE_PARTITION
+-------------------------
+:Type: /dev/mshv ioctl
+:Parameters: struct mshv_create_partition
+:Returns: partition file descriptor, or -1 on failure
+
+This ioctl creates a guest partition, returning a file descriptor to use as a
+handle for partition ioctls.
+
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 592c75e51e0f..4cd44ae9bffb 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -11,6 +11,7 @@
 
 #include <linux/types.h>
 #include <asm/page.h>
+#include <uapi/asm/hyperv-tlfs.h>
 /*
  * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent
  * is set by CPUID(HvCpuIdFunctionVersionAndFeatures).
diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
new file mode 100644
index 000000000000..72150c25ffe6
--- /dev/null
+++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
@@ -0,0 +1,124 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_X86_HYPERV_TLFS_USER_H
+#define _UAPI_ASM_X86_HYPERV_TLFS_USER_H
+
+#include <linux/types.h>
+
+#define HV_PARTITION_PROCESSOR_FEATURE_BANKS 2
+
+union hv_partition_processor_features {
+	struct {
+		__u64 sse3_support:1;
+		__u64 lahf_sahf_support:1;
+		__u64 ssse3_support:1;
+		__u64 sse4_1_support:1;
+		__u64 sse4_2_support:1;
+		__u64 sse4a_support:1;
+		__u64 xop_support:1;
+		__u64 pop_cnt_support:1;
+		__u64 cmpxchg16b_support:1;
+		__u64 altmovcr8_support:1;
+		__u64 lzcnt_support:1;
+		__u64 mis_align_sse_support:1;
+		__u64 mmx_ext_support:1;
+		__u64 amd3dnow_support:1;
+		__u64 extended_amd3dnow_support:1;
+		__u64 page_1gb_support:1;
+		__u64 aes_support:1;
+		__u64 pclmulqdq_support:1;
+		__u64 pcid_support:1;
+		__u64 fma4_support:1;
+		__u64 f16c_support:1;
+		__u64 rd_rand_support:1;
+		__u64 rd_wr_fs_gs_support:1;
+		__u64 smep_support:1;
+		__u64 enhanced_fast_string_support:1;
+		__u64 bmi1_support:1;
+		__u64 bmi2_support:1;
+		__u64 hle_support_deprecated:1;
+		__u64 rtm_support_deprecated:1;
+		__u64 movbe_support:1;
+		__u64 npiep1_support:1;
+		__u64 dep_x87_fpu_save_support:1;
+		__u64 rd_seed_support:1;
+		__u64 adx_support:1;
+		__u64 intel_prefetch_support:1;
+		__u64 smap_support:1;
+		__u64 hle_support:1;
+		__u64 rtm_support:1;
+		__u64 rdtscp_support:1;
+		__u64 clflushopt_support:1;
+		__u64 clwb_support:1;
+		__u64 sha_support:1;
+		__u64 x87_pointers_saved_support:1;
+		__u64 invpcid_support:1;
+		__u64 ibrs_support:1;
+		__u64 stibp_support:1;
+		__u64 ibpb_support: 1;
+		__u64 unrestricted_guest_support:1;
+		__u64 mdd_support:1;
+		__u64 fast_short_rep_mov_support:1;
+		__u64 l1dcache_flush_support:1;
+		__u64 rdcl_no_support:1;
+		__u64 ibrs_all_support:1;
+		__u64 skip_l1df_support:1;
+		__u64 ssb_no_support:1;
+		__u64 rsb_a_no_support:1;
+		__u64 virt_spec_ctrl_support:1;
+		__u64 rd_pid_support:1;
+		__u64 umip_support:1;
+		__u64 mbs_no_support:1;
+		__u64 mb_clear_support:1;
+		__u64 taa_no_support:1;
+		__u64 tsx_ctrl_support:1;
+		/*
+		 * N.B. The final processor feature bit in bank 0 is reserved to
+		 * simplify potential downlevel backports.
+		 */
+		__u64 reserved_bank0:1;
+
+		/* N.B. Begin bank 1 processor features. */
+		__u64 acount_mcount_support:1;
+		__u64 tsc_invariant_support:1;
+		__u64 cl_zero_support:1;
+		__u64 rdpru_support:1;
+		__u64 la57_support:1;
+		__u64 mbec_support:1;
+		__u64 nested_virt_support:1;
+		__u64 psfd_support:1;
+		__u64 cet_ss_support:1;
+		__u64 cet_ibt_support:1;
+		__u64 vmx_exception_inject_support:1;
+		__u64 enqcmd_support:1;
+		__u64 umwait_tpause_support:1;
+		__u64 movdiri_support:1;
+		__u64 movdir64b_support:1;
+		__u64 cldemote_support:1;
+		__u64 serialize_support:1;
+		__u64 tsc_deadline_tmr_support:1;
+		__u64 tsc_adjust_support:1;
+		__u64 fzlrep_movsb:1;
+		__u64 fsrep_stosb:1;
+		__u64 fsrep_cmpsb:1;
+		__u64 reserved_bank1:42;
+	};
+	__u64 as_uint64[HV_PARTITION_PROCESSOR_FEATURE_BANKS];
+};
+
+union hv_partition_processor_xsave_features {
+	struct {
+		__u64 xsave_support : 1;
+		__u64 xsaveopt_support : 1;
+		__u64 avx_support : 1;
+		__u64 reserved1 : 61;
+	};
+	__u64 as_uint64;
+};
+
+struct hv_partition_creation_properties {
+	union hv_partition_processor_features disabled_processor_features;
+	union hv_partition_processor_xsave_features
+		disabled_processor_xsave_features;
+};
+
+#endif
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 05b9dc9896ab..2ff580780ce4 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -12,6 +12,7 @@
 #include <linux/types.h>
 #include <linux/bits.h>
 #include <linux/time64.h>
+#include <uapi/asm-generic/hyperv-tlfs.h>
 
 /*
  * While not explicitly listed in the TLFS, Hyper-V always runs with a page size
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
index a0982fe2c0b8..fc4f35089b2c 100644
--- a/include/linux/mshv.h
+++ b/include/linux/mshv.h
@@ -6,6 +6,22 @@
  * Microsoft Hypervisor root partition driver for /dev/mshv
  */
 
+#include <linux/spinlock.h>
 #include <uapi/linux/mshv.h>
 
+#define MSHV_MAX_PARTITIONS		128
+
+struct mshv_partition {
+	u64 id;
+	refcount_t ref_count;
+};
+
+struct mshv {
+	struct {
+		spinlock_t lock;
+		u64 count;
+		struct mshv_partition *array[MSHV_MAX_PARTITIONS];
+	} partitions;
+};
+
 #endif
diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
new file mode 100644
index 000000000000..140cc0b4f98f
--- /dev/null
+++ b/include/uapi/asm-generic/hyperv-tlfs.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_GENERIC_HYPERV_TLFS_USER_H
+#define _UAPI_ASM_GENERIC_HYPERV_TLFS_USER_H
+
+#ifndef BIT
+#define BIT(X)	(1ULL << (X))
+#endif
+
+#define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
+#define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
+#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
+#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
+
+#endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index dd30fc2f0a80..3788f8bc5caa 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -8,12 +8,19 @@
  */
 
 #include <linux/types.h>
+#include <asm/hyperv-tlfs.h>
 
 #define MSHV_VERSION	0x0
 
+struct mshv_create_partition {
+	__u64 flags;
+	struct hv_partition_creation_properties partition_creation_properties;
+};
+
 #define MSHV_IOCTL 0xB8
 
 /* mshv device */
 #define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
+#define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x01, struct mshv_create_partition)
 
 #endif
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 62f631f85301..4dcbe4907430 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -12,6 +12,8 @@
 #include <linux/fs.h>
 #include <linux/miscdevice.h>
 #include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
 #include <linux/mshv.h>
 
 MODULE_AUTHOR("Microsoft");
@@ -24,6 +26,161 @@ static u32 supported_versions[] = {
 	MSHV_CURRENT_VERSION,
 };
 
+static struct mshv mshv = {};
+
+static void mshv_partition_put(struct mshv_partition *partition);
+static int mshv_partition_release(struct inode *inode, struct file *filp);
+static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
+
+static int mshv_dev_open(struct inode *inode, struct file *filp);
+static int mshv_dev_release(struct inode *inode, struct file *filp);
+static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
+
+static const struct file_operations mshv_partition_fops = {
+	.release = mshv_partition_release,
+	.unlocked_ioctl = mshv_partition_ioctl,
+	.llseek = noop_llseek,
+};
+
+static const struct file_operations mshv_dev_fops = {
+	.owner = THIS_MODULE,
+	.open = mshv_dev_open,
+	.release = mshv_dev_release,
+	.unlocked_ioctl = mshv_dev_ioctl,
+	.llseek = noop_llseek,
+};
+
+static struct miscdevice mshv_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mshv",
+	.fops = &mshv_dev_fops,
+	.mode = 600,
+};
+
+static long
+mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
+{
+	return -ENOTTY;
+}
+
+static void
+destroy_partition(struct mshv_partition *partition)
+{
+	unsigned long flags;
+	int i;
+
+	/* Remove from list of partitions */
+	spin_lock_irqsave(&mshv.partitions.lock, flags);
+
+	for (i = 0; i < MSHV_MAX_PARTITIONS; ++i) {
+		if (mshv.partitions.array[i] == partition)
+			break;
+	}
+
+	if (i == MSHV_MAX_PARTITIONS) {
+		pr_err("%s: failed to locate partition in array\n", __func__);
+	} else {
+		mshv.partitions.count--;
+		mshv.partitions.array[i] = NULL;
+	}
+
+	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
+
+	kfree(partition);
+}
+
+static void
+mshv_partition_put(struct mshv_partition *partition)
+{
+	if (refcount_dec_and_test(&partition->ref_count))
+		destroy_partition(partition);
+}
+
+static int
+mshv_partition_release(struct inode *inode, struct file *filp)
+{
+	struct mshv_partition *partition = filp->private_data;
+
+	mshv_partition_put(partition);
+
+	return 0;
+}
+
+static int
+add_partition(struct mshv_partition *partition)
+{
+	unsigned long flags;
+	int i, ret = 0;
+
+	spin_lock_irqsave(&mshv.partitions.lock, flags);
+
+	if (mshv.partitions.count >= MSHV_MAX_PARTITIONS) {
+		pr_err("%s: too many partitions\n", __func__);
+		ret = -ENOSPC;
+		goto out_unlock;
+	}
+
+	for (i = 0; i < MSHV_MAX_PARTITIONS; ++i) {
+		if (!mshv.partitions.array[i])
+			break;
+	}
+
+	mshv.partitions.count++;
+	mshv.partitions.array[i] = partition;
+
+out_unlock:
+	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
+
+	return ret;
+}
+
+static long
+mshv_ioctl_create_partition(void __user *user_arg)
+{
+	struct mshv_create_partition args;
+	struct mshv_partition *partition;
+	struct file *file;
+	int fd;
+	long ret;
+
+	if (copy_from_user(&args, user_arg, sizeof(args)))
+		return -EFAULT;
+
+	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
+	if (!partition)
+		return -ENOMEM;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		ret = fd;
+		goto free_partition;
+	}
+
+	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
+				  partition, O_RDWR);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto put_fd;
+	}
+	refcount_set(&partition->ref_count, 1);
+
+	ret = add_partition(partition);
+	if (ret)
+		goto release_file;
+
+	fd_install(fd, file);
+
+	return fd;
+
+release_file:
+	file->f_op->release(file->f_inode, file);
+put_fd:
+	put_unused_fd(fd);
+free_partition:
+	kfree(partition);
+	return ret;
+}
+
 static long
 mshv_ioctl_request_version(u32 *version, void __user *user_arg)
 {
@@ -59,7 +216,10 @@ mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 	if (*version == MSHV_INVALID_VERSION)
 		return -EBADFD;
 
-	/* TODO other ioctls */
+	switch (ioctl) {
+	case MSHV_CREATE_PARTITION:
+		return mshv_ioctl_create_partition((void __user *)arg);
+	}
 
 	return -ENOTTY;
 }
@@ -82,21 +242,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static const struct file_operations mshv_dev_fops = {
-	.owner = THIS_MODULE,
-	.open = mshv_dev_open,
-	.release = mshv_dev_release,
-	.unlocked_ioctl = mshv_dev_ioctl,
-	.llseek = noop_llseek,
-};
-
-static struct miscdevice mshv_dev = {
-	.minor = MISC_DYNAMIC_MINOR,
-	.name = "mshv",
-	.fops = &mshv_dev_fops,
-	.mode = 600,
-};
-
 static int
 __init mshv_init(void)
 {
@@ -106,6 +251,8 @@ __init mshv_init(void)
 	if (r)
 		pr_err("%s: misc device register failed\n", __func__);
 
+	spin_lock_init(&mshv.partitions.lock);
+
 	return r;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (4 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 05/18] virt/mshv: create partition ioctl Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-08 19:42   ` Michael Kelley
  2020-11-21  0:30 ` [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall Nuno Das Neves
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Add hypercalls for fully setting up and mostly tearing down a guest
partition.
The teardown operation will generate an error as the deposited
memory has not been withdrawn.
This is fixed in the next patch.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 include/asm-generic/hyperv-tlfs.h      |  52 +++++++-
 include/uapi/asm-generic/hyperv-tlfs.h |   1 +
 include/uapi/linux/mshv.h              |   1 +
 virt/mshv/mshv_main.c                  | 169 ++++++++++++++++++++++++-
 4 files changed, 220 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 2ff580780ce4..ab6ae6c164f5 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -142,6 +142,10 @@ struct ms_hyperv_tsc_page {
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX	0x0013
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX	0x0014
 #define HVCALL_SEND_IPI_EX			0x0015
+#define HVCALL_CREATE_PARTITION			0x0040
+#define HVCALL_INITIALIZE_PARTITION		0x0041
+#define HVCALL_FINALIZE_PARTITION		0x0042
+#define HVCALL_DELETE_PARTITION			0x0043
 #define HVCALL_GET_PARTITION_ID			0x0046
 #define HVCALL_DEPOSIT_MEMORY			0x0048
 #define HVCALL_CREATE_VP			0x004e
@@ -451,7 +455,7 @@ struct hv_get_partition_id {
 struct hv_deposit_memory {
 	u64 partition_id;
 	u64 gpa_page_list[];
-} __packed;
+};
 
 struct hv_proximity_domain_flags {
 	u32 proximity_preferred : 1;
@@ -767,4 +771,50 @@ struct hv_input_unmap_device_interrupt {
 #define HV_SOURCE_SHADOW_NONE               0x0
 #define HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE   0x1
 
+#define HV_MAKE_COMPATIBILITY_VERSION(major_, minor_)                          \
+	((u32)((major_) << 8 | (minor_)))
+
+enum hv_compatibility_version {
+	HV_COMPATIBILITY_19_H1 = HV_MAKE_COMPATIBILITY_VERSION(0X6, 0X5),
+	HV_COMPATIBILITY_MANGANESE = HV_MAKE_COMPATIBILITY_VERSION(0X6, 0X7),
+	HV_COMPATIBILITY_PRERELEASE = HV_MAKE_COMPATIBILITY_VERSION(0XFE, 0X0),
+	HV_COMPATIBILITY_EXPERIMENT = HV_MAKE_COMPATIBILITY_VERSION(0XFF, 0X0),
+};
+
+union hv_partition_isolation_properties {
+	u64 as_uint64;
+	struct {
+		u64 isolation_type: 5;
+		u64 rsvd_z: 7;
+		u64 shared_gpa_boundary_page_number: 52;
+	};
+};
+
+/* Non-userspace-visible partition creation flags */
+#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
+
+struct hv_create_partition_in {
+	u64 flags;
+	union hv_proximity_domain_info proximity_domain_info;
+	enum hv_compatibility_version compatibility_version;
+	struct hv_partition_creation_properties partition_creation_properties;
+	union hv_partition_isolation_properties isolation_properties;
+};
+
+struct hv_create_partition_out {
+	u64 partition_id;
+};
+
+struct hv_initialize_partition {
+	u64 partition_id;
+};
+
+struct hv_finalize_partition {
+	u64 partition_id;
+};
+
+struct hv_delete_partition {
+	u64 partition_id;
+};
+
 #endif
diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
index 140cc0b4f98f..7a858226a9c5 100644
--- a/include/uapi/asm-generic/hyperv-tlfs.h
+++ b/include/uapi/asm-generic/hyperv-tlfs.h
@@ -6,6 +6,7 @@
 #define BIT(X)	(1ULL << (X))
 #endif
 
+/* Userspace-visible partition creation flags */
 #define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
 #define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
 #define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 3788f8bc5caa..4f8da9a6fde2 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -9,6 +9,7 @@
 
 #include <linux/types.h>
 #include <asm/hyperv-tlfs.h>
+#include <asm-generic/hyperv-tlfs.h>
 
 #define MSHV_VERSION	0x0
 
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 4dcbe4907430..c4130a6508e5 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -15,6 +15,7 @@
 #include <linux/file.h>
 #include <linux/anon_inodes.h>
 #include <linux/mshv.h>
+#include <asm/mshyperv.h>
 
 MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
@@ -31,7 +32,6 @@ static struct mshv mshv = {};
 static void mshv_partition_put(struct mshv_partition *partition);
 static int mshv_partition_release(struct inode *inode, struct file *filp);
 static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
-
 static int mshv_dev_open(struct inode *inode, struct file *filp);
 static int mshv_dev_release(struct inode *inode, struct file *filp);
 static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
@@ -57,6 +57,143 @@ static struct miscdevice mshv_dev = {
 	.mode = 600,
 };
 
+#define HV_INIT_PARTITION_DEPOSIT_PAGES 208
+
+static int
+hv_call_create_partition(
+		u64 flags,
+		struct hv_partition_creation_properties creation_properties,
+		u64 *partition_id)
+{
+	struct hv_create_partition_in *input;
+	struct hv_create_partition_out *output;
+	int status;
+	int ret;
+	unsigned long irq_flags;
+	int i;
+
+	do {
+		local_irq_save(irq_flags);
+		input = (struct hv_create_partition_in *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+		output = (struct hv_create_partition_out *)(*this_cpu_ptr(
+			hyperv_pcpu_output_arg));
+
+		input->flags = flags;
+		input->proximity_domain_info.as_uint64 = 0;
+		input->compatibility_version = HV_COMPATIBILITY_MANGANESE;
+		for (i = 0; i < HV_PARTITION_PROCESSOR_FEATURE_BANKS; ++i)
+			input->partition_creation_properties
+				.disabled_processor_features.as_uint64[i] = 0;
+		input->partition_creation_properties
+			.disabled_processor_xsave_features.as_uint64 = 0;
+		input->isolation_properties.as_uint64 = 0;
+
+		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
+					 input, output);
+		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (status == HV_STATUS_SUCCESS)
+				*partition_id = output->partition_id;
+			else
+				pr_err("%s: %s\n",
+				       __func__, hv_status_to_string(status));
+			local_irq_restore(irq_flags);
+			ret = -hv_status_to_errno(status);
+			break;
+		}
+		local_irq_restore(irq_flags);
+		ret = hv_call_deposit_pages(NUMA_NO_NODE,
+					    hv_current_partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+static int
+hv_call_initialize_partition(u64 partition_id)
+{
+	struct hv_initialize_partition *input;
+	int status;
+	int ret;
+	unsigned long flags;
+
+	ret = hv_call_deposit_pages(
+				NUMA_NO_NODE,
+				partition_id,
+				HV_INIT_PARTITION_DEPOSIT_PAGES);
+	if (ret)
+		return ret;
+
+	do {
+		local_irq_save(flags);
+		input = (struct hv_initialize_partition *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+		input->partition_id = partition_id;
+
+		status = hv_do_hypercall(
+				HVCALL_INITIALIZE_PARTITION,
+				input, NULL);
+		local_irq_restore(flags);
+
+		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (status != HV_STATUS_SUCCESS)
+				pr_err("%s: %s\n",
+				       __func__, hv_status_to_string(status));
+			ret = -hv_status_to_errno(status);
+			break;
+		}
+		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+static int
+hv_call_finalize_partition(u64 partition_id)
+{
+	struct hv_finalize_partition *input;
+	int status;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	input = (struct hv_finalize_partition *)(*this_cpu_ptr(
+		hyperv_pcpu_input_arg));
+
+	input->partition_id = partition_id;
+	status = hv_do_hypercall(
+			HVCALL_FINALIZE_PARTITION,
+			input, NULL);
+	local_irq_restore(flags);
+
+	if (status != HV_STATUS_SUCCESS)
+		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
+
+	return -hv_status_to_errno(status);
+}
+
+static int
+hv_call_delete_partition(u64 partition_id)
+{
+	struct hv_delete_partition *input;
+	int status;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	input = (struct hv_delete_partition *)(*this_cpu_ptr(
+		hyperv_pcpu_input_arg));
+
+	input->partition_id = partition_id;
+	status = hv_do_hypercall(
+			HVCALL_DELETE_PARTITION,
+			input, NULL);
+	local_irq_restore(flags);
+
+	if (status != HV_STATUS_SUCCESS)
+		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
+
+	return -hv_status_to_errno(status);
+}
+
 static long
 mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
@@ -86,6 +223,17 @@ destroy_partition(struct mshv_partition *partition)
 
 	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
 
+	/*
+	 * There are no remaining references to the partition or vps,
+	 * so the remaining cleanup can be lockless
+	 */
+
+	/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
+	hv_call_finalize_partition(partition->id);
+	/* TODO: Withdraw and free all pages we deposited */
+
+	hv_call_delete_partition(partition->id);
+
 	kfree(partition);
 }
 
@@ -146,6 +294,9 @@ mshv_ioctl_create_partition(void __user *user_arg)
 	if (copy_from_user(&args, user_arg, sizeof(args)))
 		return -EFAULT;
 
+	/* Only support EXO partitions */
+	args.flags |= HV_PARTITION_CREATION_FLAG_EXO_PARTITION;
+
 	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
 	if (!partition)
 		return -ENOMEM;
@@ -156,11 +307,21 @@ mshv_ioctl_create_partition(void __user *user_arg)
 		goto free_partition;
 	}
 
+	ret = hv_call_create_partition(args.flags,
+				       args.partition_creation_properties,
+				       &partition->id);
+	if (ret)
+		goto put_fd;
+
+	ret = hv_call_initialize_partition(partition->id);
+	if (ret)
+		goto delete_partition;
+
 	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
 				  partition, O_RDWR);
 	if (IS_ERR(file)) {
 		ret = PTR_ERR(file);
-		goto put_fd;
+		goto finalize_partition;
 	}
 	refcount_set(&partition->ref_count, 1);
 
@@ -174,6 +335,10 @@ mshv_ioctl_create_partition(void __user *user_arg)
 
 release_file:
 	file->f_op->release(file->f_inode, file);
+finalize_partition:
+	hv_call_finalize_partition(partition->id);
+delete_partition:
+	hv_call_delete_partition(partition->id);
 put_fd:
 	put_unused_fd(fd);
 free_partition:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (5 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-08 19:44   ` Michael Kelley
  2020-11-21  0:30 ` [RFC PATCH 08/18] virt/mshv: map and unmap guest memory Nuno Das Neves
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Withdraw the memory from a finalized partition and free the pages.
The partition is now cleaned up correctly when the fd is released.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 include/asm-generic/hyperv-tlfs.h | 10 ++++++
 virt/mshv/mshv_main.c             | 54 ++++++++++++++++++++++++++++++-
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index ab6ae6c164f5..2a49503b7396 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -148,6 +148,7 @@ struct ms_hyperv_tsc_page {
 #define HVCALL_DELETE_PARTITION			0x0043
 #define HVCALL_GET_PARTITION_ID			0x0046
 #define HVCALL_DEPOSIT_MEMORY			0x0048
+#define HVCALL_WITHDRAW_MEMORY			0x0049
 #define HVCALL_CREATE_VP			0x004e
 #define HVCALL_GET_VP_REGISTERS			0x0050
 #define HVCALL_SET_VP_REGISTERS			0x0051
@@ -472,6 +473,15 @@ union hv_proximity_domain_info {
 	u64 as_uint64;
 };
 
+struct hv_withdraw_memory_in {
+	u64 partition_id;
+	union hv_proximity_domain_info proximity_domain_info;
+};
+
+struct hv_withdraw_memory_out {
+	u64 gpa_page_list[0];
+};
+
 struct hv_lp_startup_status {
 	u64 hv_status;
 	u64 substatus1;
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index c4130a6508e5..162a1bb42a4a 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -14,6 +14,7 @@
 #include <linux/slab.h>
 #include <linux/file.h>
 #include <linux/anon_inodes.h>
+#include <linux/mm.h>
 #include <linux/mshv.h>
 #include <asm/mshyperv.h>
 
@@ -57,8 +58,58 @@ static struct miscdevice mshv_dev = {
 	.mode = 600,
 };
 
+#define HV_WITHDRAW_BATCH_SIZE	(PAGE_SIZE / sizeof(u64))
 #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
 
+static int
+hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
+{
+	struct hv_withdraw_memory_in *input_page;
+	struct hv_withdraw_memory_out *output_page;
+	u16 completed;
+	u64 hypercall_status;
+	unsigned long remaining = count;
+	int status;
+	int i;
+	unsigned long flags;
+
+	while (remaining) {
+		local_irq_save(flags);
+
+		input_page = (struct hv_withdraw_memory_in *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+		output_page = (struct hv_withdraw_memory_out *)(*this_cpu_ptr(
+			hyperv_pcpu_output_arg));
+
+		input_page->partition_id = partition_id;
+		input_page->proximity_domain_info.as_uint64 = 0;
+		hypercall_status = hv_do_rep_hypercall(
+			HVCALL_WITHDRAW_MEMORY,
+			min(remaining, HV_WITHDRAW_BATCH_SIZE), 0, input_page,
+			output_page);
+
+		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
+			    HV_HYPERCALL_REP_COMP_OFFSET;
+
+		for (i = 0; i < completed; i++)
+			__free_page(pfn_to_page(output_page->gpa_page_list[i]));
+
+		local_irq_restore(flags);
+
+		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
+		if (status != HV_STATUS_SUCCESS) {
+			if (status != HV_STATUS_NO_RESOURCES)
+				pr_err("%s: %s\n", __func__,
+				       hv_status_to_string(status));
+			break;
+		}
+
+		remaining -= completed;
+	}
+
+	return -hv_status_to_errno(status);
+}
+
 static int
 hv_call_create_partition(
 		u64 flags,
@@ -230,7 +281,8 @@ destroy_partition(struct mshv_partition *partition)
 
 	/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
 	hv_call_finalize_partition(partition->id);
-	/* TODO: Withdraw and free all pages we deposited */
+	/* Withdraw and free all pages we deposited */
+	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->id);
 
 	hv_call_delete_partition(partition->id);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 08/18] virt/mshv: map and unmap guest memory
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (6 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-08 19:45   ` Michael Kelley
  2020-11-21  0:30 ` [RFC PATCH 09/18] virt/mshv: create vcpu ioctl Nuno Das Neves
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce ioctls for mapping and unmapping regions of guest memory.

Uses a table of memory 'slots' similar to KVM, but the slot
number is not visible to userspace.

For now, this simple implementation requires each new mapping to be
disjoint - the underlying hypercalls have no such restriction, and
implicitly overwrite any mappings on the pages in the specified regions.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst        |  15 ++
 include/asm-generic/hyperv-tlfs.h      |  15 ++
 include/linux/mshv.h                   |  14 ++
 include/uapi/asm-generic/hyperv-tlfs.h |   9 +
 include/uapi/linux/mshv.h              |  15 ++
 virt/mshv/mshv_main.c                  | 322 ++++++++++++++++++++++++-
 6 files changed, 388 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index ce651a1738e0..530efc29d354 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -72,3 +72,18 @@ it is open - this ioctl can only be called once per open.
 This ioctl creates a guest partition, returning a file descriptor to use as a
 handle for partition ioctls.
 
+3.3 MSHV_MAP_GUEST_MEMORY and MSHV_UNMAP_GUEST_MEMORY
+-----------------------------------------------------
+:Type: partition ioctl
+:Parameters: struct mshv_user_mem_region
+:Returns: 0 on success
+
+Create a mapping from a region of process memory to a region of physical memory
+in a guest partition.
+
+Mappings must be disjoint in process address space and guest address space.
+
+Note: In the current implementation, this memory is pinned to stop the pages
+being moved by linux and subsequently clobbered by the hypervisor. So the region
+is backed by physical memory.
+
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 2a49503b7396..6e5072e29897 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -149,6 +149,8 @@ struct ms_hyperv_tsc_page {
 #define HVCALL_GET_PARTITION_ID			0x0046
 #define HVCALL_DEPOSIT_MEMORY			0x0048
 #define HVCALL_WITHDRAW_MEMORY			0x0049
+#define HVCALL_MAP_GPA_PAGES			0x004b
+#define HVCALL_UNMAP_GPA_PAGES			0x004c
 #define HVCALL_CREATE_VP			0x004e
 #define HVCALL_GET_VP_REGISTERS			0x0050
 #define HVCALL_SET_VP_REGISTERS			0x0051
@@ -827,4 +829,17 @@ struct hv_delete_partition {
 	u64 partition_id;
 };
 
+struct hv_map_gpa_pages {
+	u64 target_partition_id;
+	u64 target_gpa_base;
+	u32 map_flags;
+	u64 source_gpa_page_list[];
+};
+
+struct hv_unmap_gpa_pages {
+	u64 target_partition_id;
+	u64 target_gpa_base;
+	u32 unmap_flags;
+};
+
 #endif
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
index fc4f35089b2c..91a742f37440 100644
--- a/include/linux/mshv.h
+++ b/include/linux/mshv.h
@@ -7,13 +7,27 @@
  */
 
 #include <linux/spinlock.h>
+#include <linux/mutex.h>
 #include <uapi/linux/mshv.h>
 
 #define MSHV_MAX_PARTITIONS		128
+#define MSHV_MAX_MEM_REGIONS		64
+
+struct mshv_mem_region {
+	u64 size; /* bytes */
+	u64 guest_pfn;
+	u64 userspace_addr; /* start of the userspace allocated memory */
+	struct page **pages;
+};
 
 struct mshv_partition {
 	u64 id;
 	refcount_t ref_count;
+	struct mutex mutex;
+	struct {
+		u32 count;
+		struct mshv_mem_region slots[MSHV_MAX_MEM_REGIONS];
+	} regions;
 };
 
 struct mshv {
diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
index 7a858226a9c5..e7b09b9f00de 100644
--- a/include/uapi/asm-generic/hyperv-tlfs.h
+++ b/include/uapi/asm-generic/hyperv-tlfs.h
@@ -12,4 +12,13 @@
 #define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
 #define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
 
+/* HV Map GPA (Guest Physical Address) Flags */
+#define HV_MAP_GPA_PERMISSIONS_NONE     0x0
+#define HV_MAP_GPA_READABLE             0x1
+#define HV_MAP_GPA_WRITABLE             0x2
+#define HV_MAP_GPA_KERNEL_EXECUTABLE    0x4
+#define HV_MAP_GPA_USER_EXECUTABLE      0x8
+#define HV_MAP_GPA_EXECUTABLE           0xC
+#define HV_MAP_GPA_PERMISSIONS_MASK     0xF
+
 #endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 4f8da9a6fde2..47be03ef4e86 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -18,10 +18,25 @@ struct mshv_create_partition {
 	struct hv_partition_creation_properties partition_creation_properties;
 };
 
+/*
+ * Mappings can't overlap in GPA space or userspace
+ * To unmap, these fields must match an existing mapping
+ */
+struct mshv_user_mem_region {
+	__u64 size;		/* bytes */
+	__u64 guest_pfn;
+	__u64 userspace_addr;	/* start of the userspace allocated memory */
+	__u32 flags;		/* ignored on unmap */
+};
+
 #define MSHV_IOCTL 0xB8
 
 /* mshv device */
 #define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
 #define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x01, struct mshv_create_partition)
 
+/* partition device */
+#define MSHV_MAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
+#define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct mshv_user_mem_region)
+
 #endif
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 162a1bb42a4a..ce480598e67f 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -60,6 +60,10 @@ static struct miscdevice mshv_dev = {
 
 #define HV_WITHDRAW_BATCH_SIZE	(PAGE_SIZE / sizeof(u64))
 #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
+#define HV_MAP_GPA_MASK		(0x0000000FFFFFFFFFULL)
+#define HV_MAP_GPA_BATCH_SIZE	\
+		(PAGE_SIZE / sizeof(struct hv_map_gpa_pages) / sizeof(u64))
+#define PIN_PAGES_BATCH_SIZE	(0x10000000 / PAGE_SIZE)
 
 static int
 hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
@@ -245,16 +249,318 @@ hv_call_delete_partition(u64 partition_id)
 	return -hv_status_to_errno(status);
 }
 
+static int
+hv_call_map_gpa_pages(u64 partition_id,
+		      u64 gpa_target,
+		      u64 page_count, u32 flags,
+		      struct page **pages)
+{
+	struct hv_map_gpa_pages *input_page;
+	int status;
+	int i;
+	struct page **p;
+	u32 completed = 0;
+	u64 hypercall_status;
+	unsigned long remaining = page_count;
+	int rep_count;
+	unsigned long irq_flags;
+	int ret = 0;
+
+	while (remaining) {
+
+		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
+
+		local_irq_save(irq_flags);
+		input_page = (struct hv_map_gpa_pages *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+
+		input_page->target_partition_id = partition_id;
+		input_page->target_gpa_base = gpa_target;
+		input_page->map_flags = flags;
+
+		for (i = 0, p = pages; i < rep_count; i++, p++)
+			input_page->source_gpa_page_list[i] =
+				page_to_pfn(*p) & HV_MAP_GPA_MASK;
+		hypercall_status = hv_do_rep_hypercall(
+			HVCALL_MAP_GPA_PAGES, rep_count, 0, input_page, NULL);
+		local_irq_restore(irq_flags);
+
+		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
+		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
+				HV_HYPERCALL_REP_COMP_OFFSET;
+
+		if (status == HV_STATUS_INSUFFICIENT_MEMORY) {
+			ret = hv_call_deposit_pages(NUMA_NO_NODE,
+						    partition_id, 256);
+			if (ret)
+				break;
+		} else if (status != HV_STATUS_SUCCESS) {
+			pr_err("%s: completed %llu out of %llu, %s\n",
+			       __func__,
+			       page_count - remaining, page_count,
+			       hv_status_to_string(status));
+			ret = -hv_status_to_errno(status);
+			break;
+		}
+
+		pages += completed;
+		remaining -= completed;
+		gpa_target += completed;
+	}
+
+	if (ret && completed) {
+		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
+		       __func__);
+		ret = -EBADFD;
+	}
+
+	return ret;
+}
+
+static int
+hv_call_unmap_gpa_pages(u64 partition_id,
+			u64 gpa_target,
+			u64 page_count, u32 flags)
+{
+	struct hv_unmap_gpa_pages *input_page;
+	int status;
+	int ret = 0;
+	u32 completed = 0;
+	u64 hypercall_status;
+	unsigned long remaining = page_count;
+	int rep_count;
+	unsigned long irq_flags;
+
+	local_irq_save(irq_flags);
+	input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
+		hyperv_pcpu_input_arg));
+
+	input_page->target_partition_id = partition_id;
+	input_page->target_gpa_base = gpa_target;
+	input_page->unmap_flags = flags;
+
+	while (remaining) {
+		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
+		hypercall_status = hv_do_rep_hypercall(
+			HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
+		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
+		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
+				HV_HYPERCALL_REP_COMP_OFFSET;
+		if (status != HV_STATUS_SUCCESS) {
+			pr_err("%s: completed %llu out of %llu, %s\n",
+			       __func__,
+			       page_count - remaining, page_count,
+			       hv_status_to_string(status));
+			ret = -hv_status_to_errno(status);
+			break;
+		}
+
+		remaining -= completed;
+		gpa_target += completed;
+		input_page->target_gpa_base = gpa_target;
+	}
+	local_irq_restore(irq_flags);
+
+	if (ret && completed) {
+		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
+		       __func__);
+		ret = -EBADFD;
+	}
+
+	return ret;
+}
+
+static long
+mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
+				struct mshv_user_mem_region __user *user_mem)
+{
+	struct mshv_user_mem_region mem;
+	struct mshv_mem_region *region;
+	int completed;
+	unsigned long remaining, batch_size;
+	int i;
+	struct page **pages;
+	u64 page_count, user_start, user_end, gpfn_start, gpfn_end;
+	u64 region_page_count, region_user_start, region_user_end;
+	u64 region_gpfn_start, region_gpfn_end;
+	long ret = 0;
+
+	/* Check we have enough slots*/
+	if (partition->regions.count == MSHV_MAX_MEM_REGIONS) {
+		pr_err("%s: not enough memory region slots\n", __func__);
+		return -ENOSPC;
+	}
+
+	if (copy_from_user(&mem, user_mem, sizeof(mem)))
+		return -EFAULT;
+
+	if (!mem.size ||
+	    mem.size & (PAGE_SIZE - 1) ||
+	    mem.userspace_addr & (PAGE_SIZE - 1) ||
+	    !access_ok(mem.userspace_addr, mem.size))
+		return -EINVAL;
+
+	/* Reject overlapping regions */
+	page_count = mem.size >> PAGE_SHIFT;
+	user_start = mem.userspace_addr;
+	user_end = mem.userspace_addr + mem.size;
+	gpfn_start = mem.guest_pfn;
+	gpfn_end = mem.guest_pfn + page_count;
+	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
+		region = &partition->regions.slots[i];
+		if (!region->size)
+			continue;
+		region_page_count = region->size >> PAGE_SHIFT;
+		region_user_start = region->userspace_addr;
+		region_user_end = region->userspace_addr + region->size;
+		region_gpfn_start = region->guest_pfn;
+		region_gpfn_end = region->guest_pfn + region_page_count;
+
+		if (!(
+		     (user_end <= region_user_start) ||
+		     (region_user_end <= user_start))) {
+			return -EEXIST;
+		}
+		if (!(
+		     (gpfn_end <= region_gpfn_start) ||
+		     (region_gpfn_end <= gpfn_start))) {
+			return -EEXIST;
+		}
+	}
+
+	/* Pin the userspace pages */
+	pages = vzalloc(sizeof(struct page *) * page_count);
+	if (!pages)
+		return -ENOMEM;
+
+	remaining = page_count;
+	while (remaining) {
+		/*
+		 * We need to batch this, as pin_user_pages_fast with the
+		 * FOLL_LONGTERM flag does a big temporary allocation
+		 * of contiguous memory
+		 */
+		batch_size = min(remaining, PIN_PAGES_BATCH_SIZE);
+		completed = pin_user_pages_fast(
+				mem.userspace_addr +
+					(page_count - remaining) * PAGE_SIZE,
+				batch_size,
+				FOLL_WRITE | FOLL_LONGTERM,
+				&pages[page_count - remaining]);
+		if (completed < 0) {
+			pr_err("%s: failed to pin user pages error %i\n",
+			       __func__,
+			       completed);
+			ret = completed;
+			goto err_unpin_pages;
+		}
+		remaining -= completed;
+	}
+
+	/* Map the pages to GPA pages */
+	ret = hv_call_map_gpa_pages(partition->id, mem.guest_pfn,
+				    page_count, mem.flags, pages);
+	if (ret)
+		goto err_unpin_pages;
+
+	/* Install the new region */
+	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
+		if (!partition->regions.slots[i].size) {
+			region = &partition->regions.slots[i];
+			break;
+		}
+	}
+	region->pages = pages;
+	region->size = mem.size;
+	region->guest_pfn = mem.guest_pfn;
+	region->userspace_addr = mem.userspace_addr;
+
+	partition->regions.count++;
+
+	return 0;
+
+err_unpin_pages:
+	unpin_user_pages(pages, page_count - remaining);
+	vfree(pages);
+
+	return ret;
+}
+
+static long
+mshv_partition_ioctl_unmap_memory(struct mshv_partition *partition,
+				  struct mshv_user_mem_region __user *user_mem)
+{
+	struct mshv_user_mem_region mem;
+	struct mshv_mem_region *region_ptr;
+	int i;
+	u64 page_count;
+	long ret;
+
+	if (!partition->regions.count)
+		return -EINVAL;
+
+	if (copy_from_user(&mem, user_mem, sizeof(mem)))
+		return -EFAULT;
+
+	/* Find matching region */
+	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
+		if (!partition->regions.slots[i].size)
+			continue;
+		region_ptr = &partition->regions.slots[i];
+		if (region_ptr->userspace_addr == mem.userspace_addr &&
+		    region_ptr->size == mem.size &&
+		    region_ptr->guest_pfn == mem.guest_pfn)
+			break;
+	}
+
+	if (i == MSHV_MAX_MEM_REGIONS)
+		return -EINVAL;
+
+	page_count = region_ptr->size >> PAGE_SHIFT;
+	ret = hv_call_unmap_gpa_pages(partition->id, region_ptr->guest_pfn,
+				      page_count, 0);
+	if (ret)
+		return ret;
+
+	unpin_user_pages(region_ptr->pages, page_count);
+	vfree(region_ptr->pages);
+	memset(region_ptr, 0, sizeof(*region_ptr));
+	partition->regions.count--;
+
+	return 0;
+}
+
 static long
 mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
-	return -ENOTTY;
+	struct mshv_partition *partition = filp->private_data;
+	long ret;
+
+	if (mutex_lock_killable(&partition->mutex))
+		return -EINTR;
+
+	switch (ioctl) {
+	case MSHV_MAP_GUEST_MEMORY:
+		ret = mshv_partition_ioctl_map_memory(partition,
+							(void __user *)arg);
+		break;
+	case MSHV_UNMAP_GUEST_MEMORY:
+		ret = mshv_partition_ioctl_unmap_memory(partition,
+							(void __user *)arg);
+		break;
+	default:
+		ret = -ENOTTY;
+	}
+
+	mutex_unlock(&partition->mutex);
+	return ret;
 }
 
 static void
 destroy_partition(struct mshv_partition *partition)
 {
-	unsigned long flags;
+	unsigned long flags, page_count;
+	struct mshv_mem_region *region;
 	int i;
 
 	/* Remove from list of partitions */
@@ -286,6 +592,16 @@ destroy_partition(struct mshv_partition *partition)
 
 	hv_call_delete_partition(partition->id);
 
+	/* Remove regions and unpin the pages */
+	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
+		region = &partition->regions.slots[i];
+		if (!region->size)
+			continue;
+		page_count = region->size >> PAGE_SHIFT;
+		unpin_user_pages(region->pages, page_count);
+		vfree(region->pages);
+	}
+
 	kfree(partition);
 }
 
@@ -353,6 +669,8 @@ mshv_ioctl_create_partition(void __user *user_arg)
 	if (!partition)
 		return -ENOMEM;
 
+	mutex_init(&partition->mutex);
+
 	fd = get_unused_fd_flags(O_CLOEXEC);
 	if (fd < 0) {
 		ret = fd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 09/18] virt/mshv: create vcpu ioctl
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (7 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 08/18] virt/mshv: map and unmap guest memory Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls Nuno Das Neves
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce ioctl for creating a virtual processor in a partition.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst |   9 +++
 include/linux/mshv.h            |  10 +++
 include/uapi/linux/mshv.h       |   5 ++
 virt/mshv/mshv_main.c           | 121 +++++++++++++++++++++++++++++++-
 4 files changed, 144 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 530efc29d354..f997f49f8690 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -87,3 +87,12 @@ Note: In the current implementation, this memory is pinned to stop the pages
 being moved by linux and subsequently clobbered by the hypervisor. So the region
 is backed by physical memory.
 
+3.4 MSHV_CREATE_VP
+------------------
+:Type: partition ioctl
+:Parameters: struct mshv_create_vp
+:Returns: vp file descriptor, or -1 on failure
+
+Create a virtual processor in a guest partition, returning a file descriptor to
+represent the vp and perform ioctls on.
+
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
index 91a742f37440..50521c5f7948 100644
--- a/include/linux/mshv.h
+++ b/include/linux/mshv.h
@@ -12,6 +12,12 @@
 
 #define MSHV_MAX_PARTITIONS		128
 #define MSHV_MAX_MEM_REGIONS		64
+#define MSHV_MAX_VPS			256
+
+struct mshv_vp {
+	u32 index;
+	struct mshv_partition *partition;
+};
 
 struct mshv_mem_region {
 	u64 size; /* bytes */
@@ -28,6 +34,10 @@ struct mshv_partition {
 		u32 count;
 		struct mshv_mem_region slots[MSHV_MAX_MEM_REGIONS];
 	} regions;
+	struct {
+		u32 count;
+		struct mshv_vp *array[MSHV_MAX_VPS];
+	} vps;
 };
 
 struct mshv {
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 47be03ef4e86..1f053eae68a6 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -29,6 +29,10 @@ struct mshv_user_mem_region {
 	__u32 flags;		/* ignored on unmap */
 };
 
+struct mshv_create_vp {
+	__u32 vp_index;
+};
+
 #define MSHV_IOCTL 0xB8
 
 /* mshv device */
@@ -38,5 +42,6 @@ struct mshv_user_mem_region {
 /* partition device */
 #define MSHV_MAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
 #define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct mshv_user_mem_region)
+#define MSHV_CREATE_VP		_IOW(MSHV_IOCTL, 0x04, struct mshv_create_vp)
 
 #endif
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index ce480598e67f..3be9d9a468c1 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -30,6 +30,10 @@ static u32 supported_versions[] = {
 
 static struct mshv mshv = {};
 
+static int mshv_vp_release(struct inode *inode, struct file *filp);
+static long mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
+
+static struct mshv_partition *mshv_partition_get(struct mshv_partition *partition);
 static void mshv_partition_put(struct mshv_partition *partition);
 static int mshv_partition_release(struct inode *inode, struct file *filp);
 static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
@@ -37,6 +41,12 @@ static int mshv_dev_open(struct inode *inode, struct file *filp);
 static int mshv_dev_release(struct inode *inode, struct file *filp);
 static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
 
+static const struct file_operations mshv_vp_fops = {
+	.release = mshv_vp_release,
+	.unlocked_ioctl = mshv_vp_ioctl,
+	.llseek = noop_llseek,
+};
+
 static const struct file_operations mshv_partition_fops = {
 	.release = mshv_partition_release,
 	.unlocked_ioctl = mshv_partition_ioctl,
@@ -370,6 +380,94 @@ hv_call_unmap_gpa_pages(u64 partition_id,
 	return ret;
 }
 
+static long
+mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
+{
+	return -ENOTTY;
+}
+
+static int
+mshv_vp_release(struct inode *inode, struct file *filp)
+{
+	struct mshv_vp *vp = filp->private_data;
+	mshv_partition_put(vp->partition);
+
+	/* Rest of VP cleanup happens in destroy_partition() */
+	return 0;
+}
+
+static long
+mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
+			       void __user *arg)
+{
+	struct mshv_create_vp args;
+	struct mshv_vp *vp;
+	struct file *file;
+	int fd;
+	long ret;
+
+	if (copy_from_user(&args, arg, sizeof(args)))
+		return -EFAULT;
+
+	if (args.vp_index >= MSHV_MAX_VPS)
+		return -EINVAL;
+
+	if (partition->vps.array[args.vp_index])
+		return -EEXIST;
+
+	vp = kzalloc(sizeof(*vp), GFP_KERNEL);
+
+	if (!vp)
+		return -ENOMEM;
+
+	vp->index = args.vp_index;
+	vp->partition = mshv_partition_get(partition);
+	if (!vp->partition) {
+		ret = -EBADF;
+		goto free_vp;
+	}
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		ret = fd;
+		goto put_partition;
+	}
+
+	file = anon_inode_getfile("mshv_vp", &mshv_vp_fops, vp, O_RDWR);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto put_fd;
+	}
+
+	ret = hv_call_create_vp(
+			NUMA_NO_NODE,
+			partition->id,
+			args.vp_index,
+			0 /* Only valid for root partition VPs */
+			);
+	if (ret)
+		goto release_file;
+
+	/* already exclusive with the partition mutex for all ioctls */
+	partition->vps.count++;
+	partition->vps.array[args.vp_index] = vp;
+
+	fd_install(fd, file);
+
+	return fd;
+
+release_file:
+	file->f_op->release(file->f_inode, file);
+put_fd:
+	put_unused_fd(fd);
+put_partition:
+	mshv_partition_put(partition);
+free_vp:
+	kfree(vp);
+
+	return ret;
+}
+
 static long
 mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
 				struct mshv_user_mem_region __user *user_mem)
@@ -548,6 +646,10 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 		ret = mshv_partition_ioctl_unmap_memory(partition,
 							(void __user *)arg);
 		break;
+	case MSHV_CREATE_VP:
+		ret = mshv_partition_ioctl_create_vp(partition,
+							(void __user *)arg);
+		break;
 	default:
 		ret = -ENOTTY;
 	}
@@ -560,6 +662,7 @@ static void
 destroy_partition(struct mshv_partition *partition)
 {
 	unsigned long flags, page_count;
+	struct mshv_vp *vp;
 	struct mshv_mem_region *region;
 	int i;
 
@@ -581,7 +684,7 @@ destroy_partition(struct mshv_partition *partition)
 	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
 
 	/*
-	 * There are no remaining references to the partition or vps,
+	 * There are no remaining references to the partition,
 	 * so the remaining cleanup can be lockless
 	 */
 
@@ -591,6 +694,14 @@ destroy_partition(struct mshv_partition *partition)
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->id);
 
 	hv_call_delete_partition(partition->id);
+	
+	/* Remove vps */
+	for (i = 0; i < MSHV_MAX_VPS; ++i) {
+		vp = partition->vps.array[i];
+		if (!vp)
+			continue;
+		kfree(vp);
+	}
 
 	/* Remove regions and unpin the pages */
 	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
@@ -605,6 +716,14 @@ destroy_partition(struct mshv_partition *partition)
 	kfree(partition);
 }
 
+static struct
+mshv_partition *mshv_partition_get(struct mshv_partition *partition)
+{
+	if (refcount_inc_not_zero(&partition->ref_count))
+		return partition;
+	return NULL;
+}
+
 static void
 mshv_partition_put(struct mshv_partition *partition)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (8 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 09/18] virt/mshv: create vcpu ioctl Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-08 19:47   ` Michael Kelley
  2020-11-21  0:30 ` [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages Nuno Das Neves
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Add ioctls for getting and setting virtual processor registers.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst         |  11 +
 arch/x86/include/uapi/asm/hyperv-tlfs.h | 601 ++++++++++++++++++++++++
 include/asm-generic/hyperv-tlfs.h       |  65 +--
 include/linux/mshv.h                    |   1 +
 include/uapi/linux/mshv.h               |  12 +
 virt/mshv/mshv_main.c                   | 258 +++++++++-
 6 files changed, 903 insertions(+), 45 deletions(-)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index f997f49f8690..20a626ac02d4 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -96,3 +96,14 @@ is backed by physical memory.
 Create a virtual processor in a guest partition, returning a file descriptor to
 represent the vp and perform ioctls on.
 
+3.5 MSHV_GET_VP_REGISTERS and MSHV_SET_VP_REGISTERS
+---------------------------------------------------
+:Type: vp ioctl
+:Parameters: struct mshv_vp_registers
+:Returns: 0 on success
+
+Get/set vp registers. See asm/hyperv-tlfs.h for the complete set of registers.
+Includes general purpose platform registers, MSRs, and virtual registers that
+are part of Microsoft Hypervisor platform and not directly exposed to the guest.
+
+
diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
index 72150c25ffe6..2ff655962738 100644
--- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
+++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
@@ -121,4 +121,605 @@ struct hv_partition_creation_properties {
 		disabled_processor_xsave_features;
 };
 
+enum hv_register_name {
+	/* Suspend Registers */
+	HV_REGISTER_EXPLICIT_SUSPEND		= 0x00000000,
+	HV_REGISTER_INTERCEPT_SUSPEND		= 0x00000001,
+	HV_REGISTER_INSTRUCTION_EMULATION_HINTS	= 0x00000002,
+	HV_REGISTER_DISPATCH_SUSPEND		= 0x00000003,
+	HV_REGISTER_INTERNAL_ACTIVITY_STATE	= 0x00000004,
+
+	/* Version */
+	HV_REGISTER_HYPERVISOR_VERSION	= 0x00000100, /* 128-bit result same as CPUID 0x40000002 */
+
+	/* Feature Access (registers are 128 bits) - same as CPUID 0x40000003 - 0x4000000B */
+	HV_REGISTER_PRIVILEGES_AND_FEATURES_INFO	= 0x00000200,
+	HV_REGISTER_FEATURES_INFO			= 0x00000201,
+	HV_REGISTER_IMPLEMENTATION_LIMITS_INFO		= 0x00000202,
+	HV_REGISTER_HARDWARE_FEATURES_INFO		= 0x00000203,
+	HV_REGISTER_CPU_MANAGEMENT_FEATURES_INFO	= 0x00000204,
+	HV_REGISTER_SVM_FEATURES_INFO			= 0x00000205,
+	HV_REGISTER_SKIP_LEVEL_FEATURES_INFO		= 0x00000206,
+	HV_REGISTER_NESTED_VIRT_FEATURES_INFO		= 0x00000207,
+	HV_REGISTER_IPT_FEATURES_INFO			= 0x00000208,
+
+	/* Guest Crash Registers */
+	HV_REGISTER_GUEST_CRASH_P0	= 0x00000210,
+	HV_REGISTER_GUEST_CRASH_P1	= 0x00000211,
+	HV_REGISTER_GUEST_CRASH_P2	= 0x00000212,
+	HV_REGISTER_GUEST_CRASH_P3	= 0x00000213,
+	HV_REGISTER_GUEST_CRASH_P4	= 0x00000214,
+	HV_REGISTER_GUEST_CRASH_CTL	= 0x00000215,
+
+	/* Power State Configuration */
+	HV_REGISTER_POWER_STATE_CONFIG_C1	= 0x00000220,
+	HV_REGISTER_POWER_STATE_TRIGGER_C1	= 0x00000221,
+	HV_REGISTER_POWER_STATE_CONFIG_C2	= 0x00000222,
+	HV_REGISTER_POWER_STATE_TRIGGER_C2	= 0x00000223,
+	HV_REGISTER_POWER_STATE_CONFIG_C3	= 0x00000224,
+	HV_REGISTER_POWER_STATE_TRIGGER_C3	= 0x00000225,
+
+	/* Frequency Registers */
+	HV_REGISTER_PROCESSOR_CLOCK_FREQUENCY	= 0x00000240,
+	HV_REGISTER_INTERRUPT_CLOCK_FREQUENCY	= 0x00000241,
+
+	/* Idle Register */
+	HV_REGISTER_GUEST_IDLE	= 0x00000250,
+
+	/* Guest Debug */
+	HV_REGISTER_DEBUG_DEVICE_OPTIONS	= 0x00000260,
+
+	/* Memory Zeroing Conrol Register */
+	HV_REGISTER_MEMORY_ZEROING_CONTROL	= 0x00000270,
+
+	/* Pending Event Register */
+	HV_REGISTER_PENDING_EVENT0	= 0x00010004,
+	HV_REGISTER_PENDING_EVENT1	= 0x00010005,
+
+	/* Misc */
+	HV_REGISTER_VP_RUNTIME			= 0x00090000,
+	HV_REGISTER_GUEST_OS_ID			= 0x00090002,
+	HV_REGISTER_VP_INDEX			= 0x00090003,
+	HV_REGISTER_TIME_REF_COUNT		= 0x00090004,
+	HV_REGISTER_CPU_MANAGEMENT_VERSION	= 0x00090007,
+	HV_REGISTER_VP_ASSIST_PAGE		= 0x00090013,
+	HV_REGISTER_VP_ROOT_SIGNAL_COUNT	= 0x00090014,
+	HV_REGISTER_REFERENCE_TSC		= 0x00090017,
+
+	/* Performance statistics Registers */
+	HV_REGISTER_STATS_PARTITION_RETAIL	= 0x00090020,
+	HV_REGISTER_STATS_PARTITION_INTERNAL	= 0x00090021,
+	HV_REGISTER_STATS_VP_RETAIL		= 0x00090022,
+	HV_REGISTER_STATS_VP_INTERNAL		= 0x00090023,
+
+	HV_REGISTER_NESTED_VP_INDEX	= 0x00091003,
+
+	/* Hypervisor-defined Registers (Synic) */
+	HV_REGISTER_SINT0	= 0x000A0000,
+	HV_REGISTER_SINT1	= 0x000A0001,
+	HV_REGISTER_SINT2	= 0x000A0002,
+	HV_REGISTER_SINT3	= 0x000A0003,
+	HV_REGISTER_SINT4	= 0x000A0004,
+	HV_REGISTER_SINT5	= 0x000A0005,
+	HV_REGISTER_SINT6	= 0x000A0006,
+	HV_REGISTER_SINT7	= 0x000A0007,
+	HV_REGISTER_SINT8	= 0x000A0008,
+	HV_REGISTER_SINT9	= 0x000A0009,
+	HV_REGISTER_SINT10	= 0x000A000A,
+	HV_REGISTER_SINT11	= 0x000A000B,
+	HV_REGISTER_SINT12	= 0x000A000C,
+	HV_REGISTER_SINT13	= 0x000A000D,
+	HV_REGISTER_SINT14	= 0x000A000E,
+	HV_REGISTER_SINT15	= 0x000A000F,
+	HV_REGISTER_SCONTROL	= 0x000A0010,
+	HV_REGISTER_SVERSION	= 0x000A0011,
+	HV_REGISTER_SIFP	= 0x000A0012,
+	HV_REGISTER_SIPP	= 0x000A0013,
+	HV_REGISTER_EOM		= 0x000A0014,
+	HV_REGISTER_SIRBP	= 0x000A0015,
+
+	HV_REGISTER_NESTED_SINT0	= 0x000A1000,
+	HV_REGISTER_NESTED_SINT1	= 0x000A1001,
+	HV_REGISTER_NESTED_SINT2	= 0x000A1002,
+	HV_REGISTER_NESTED_SINT3	= 0x000A1003,
+	HV_REGISTER_NESTED_SINT4	= 0x000A1004,
+	HV_REGISTER_NESTED_SINT5	= 0x000A1005,
+	HV_REGISTER_NESTED_SINT6	= 0x000A1006,
+	HV_REGISTER_NESTED_SINT7	= 0x000A1007,
+	HV_REGISTER_NESTED_SINT8	= 0x000A1008,
+	HV_REGISTER_NESTED_SINT9	= 0x000A1009,
+	HV_REGISTER_NESTED_SINT10	= 0x000A100A,
+	HV_REGISTER_NESTED_SINT11	= 0x000A100B,
+	HV_REGISTER_NESTED_SINT12	= 0x000A100C,
+	HV_REGISTER_NESTED_SINT13	= 0x000A100D,
+	HV_REGISTER_NESTED_SINT14	= 0x000A100E,
+	HV_REGISTER_NESTED_SINT15	= 0x000A100F,
+	HV_REGISTER_NESTED_SCONTROL	= 0x000A1010,
+	HV_REGISTER_NESTED_SVERSION	= 0x000A1011,
+	HV_REGISTER_NESTED_SIFP		= 0x000A1012,
+	HV_REGISTER_NESTED_SIPP		= 0x000A1013,
+	HV_REGISTER_NESTED_EOM		= 0x000A1014,
+	HV_REGISTER_NESTED_SIRBP	= 0x000a1015,
+
+
+	/* Hypervisor-defined Registers (Synthetic Timers) */
+	HV_REGISTER_STIMER0_CONFIG		= 0x000B0000,
+	HV_REGISTER_STIMER0_COUNT		= 0x000B0001,
+	HV_REGISTER_STIMER1_CONFIG		= 0x000B0002,
+	HV_REGISTER_STIMER1_COUNT		= 0x000B0003,
+	HV_REGISTER_STIMER2_CONFIG		= 0x000B0004,
+	HV_REGISTER_STIMER2_COUNT		= 0x000B0005,
+	HV_REGISTER_STIMER3_CONFIG		= 0x000B0006,
+	HV_REGISTER_STIMER3_COUNT		= 0x000B0007,
+	HV_REGISTER_STIME_UNHALTED_TIMER_CONFIG	= 0x000B0100,
+	HV_REGISTER_STIME_UNHALTED_TIMER_COUNT	= 0x000b0101,
+
+	/* Synthetic VSM registers */
+
+	/* 0x000D0000-1 are available for future use. */
+	HV_REGISTER_VSM_CODE_PAGE_OFFSETS	= 0x000D0002,
+	HV_REGISTER_VSM_VP_STATUS		= 0x000D0003,
+	HV_REGISTER_VSM_PARTITION_STATUS	= 0x000D0004,
+	HV_REGISTER_VSM_VINA			= 0x000D0005,
+	HV_REGISTER_VSM_CAPABILITIES		= 0x000D0006,
+	HV_REGISTER_VSM_PARTITION_CONFIG	= 0x000D0007,
+
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL0	= 0x000D0010,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL1	= 0x000D0011,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL2	= 0x000D0012,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL3	= 0x000D0013,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL4	= 0x000D0014,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL5	= 0x000D0015,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL6	= 0x000D0016,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL7	= 0x000D0017,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL8	= 0x000D0018,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL9	= 0x000D0019,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL10	= 0x000D001A,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL11	= 0x000D001B,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL12	= 0x000D001C,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL13	= 0x000D001D,
+	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL14	= 0x000D001E,
+
+	HV_REGISTER_VSM_VP_WAIT_FOR_TLB_LOCK	= 0x000D0020,
+
+	HV_REGISTER_ISOLATION_CAPABILITIES	= 0x000D0100,
+
+	/* Pending Interruption Register */
+	HV_REGISTER_PENDING_INTERRUPTION	= 0x00010002,
+
+	/* Interrupt State register */
+	HV_REGISTER_INTERRUPT_STATE	= 0x00010003,
+
+	/* Interruptible notification register */
+	HV_X64_REGISTER_DELIVERABILITY_NOTIFICATIONS	= 0x00010006,
+
+	/* X64 User-Mode Registers */
+	HV_X64_REGISTER_RAX	= 0x00020000,
+	HV_X64_REGISTER_RCX	= 0x00020001,
+	HV_X64_REGISTER_RDX	= 0x00020002,
+	HV_X64_REGISTER_RBX	= 0x00020003,
+	HV_X64_REGISTER_RSP	= 0x00020004,
+	HV_X64_REGISTER_RBP	= 0x00020005,
+	HV_X64_REGISTER_RSI	= 0x00020006,
+	HV_X64_REGISTER_RDI	= 0x00020007,
+	HV_X64_REGISTER_R8	= 0x00020008,
+	HV_X64_REGISTER_R9	= 0x00020009,
+	HV_X64_REGISTER_R10	= 0x0002000A,
+	HV_X64_REGISTER_R11	= 0x0002000B,
+	HV_X64_REGISTER_R12	= 0x0002000C,
+	HV_X64_REGISTER_R13	= 0x0002000D,
+	HV_X64_REGISTER_R14	= 0x0002000E,
+	HV_X64_REGISTER_R15	= 0x0002000F,
+	HV_X64_REGISTER_RIP	= 0x00020010,
+	HV_X64_REGISTER_RFLAGS	= 0x00020011,
+
+	/* X64 Floating Point and Vector Registers */
+	HV_X64_REGISTER_XMM0			= 0x00030000,
+	HV_X64_REGISTER_XMM1			= 0x00030001,
+	HV_X64_REGISTER_XMM2			= 0x00030002,
+	HV_X64_REGISTER_XMM3			= 0x00030003,
+	HV_X64_REGISTER_XMM4			= 0x00030004,
+	HV_X64_REGISTER_XMM5			= 0x00030005,
+	HV_X64_REGISTER_XMM6			= 0x00030006,
+	HV_X64_REGISTER_XMM7			= 0x00030007,
+	HV_X64_REGISTER_XMM8			= 0x00030008,
+	HV_X64_REGISTER_XMM9			= 0x00030009,
+	HV_X64_REGISTER_XMM10			= 0x0003000A,
+	HV_X64_REGISTER_XMM11			= 0x0003000B,
+	HV_X64_REGISTER_XMM12			= 0x0003000C,
+	HV_X64_REGISTER_XMM13			= 0x0003000D,
+	HV_X64_REGISTER_XMM14			= 0x0003000E,
+	HV_X64_REGISTER_XMM15			= 0x0003000F,
+	HV_X64_REGISTER_FP_MMX0			= 0x00030010,
+	HV_X64_REGISTER_FP_MMX1			= 0x00030011,
+	HV_X64_REGISTER_FP_MMX2			= 0x00030012,
+	HV_X64_REGISTER_FP_MMX3			= 0x00030013,
+	HV_X64_REGISTER_FP_MMX4			= 0x00030014,
+	HV_X64_REGISTER_FP_MMX5			= 0x00030015,
+	HV_X64_REGISTER_FP_MMX6			= 0x00030016,
+	HV_X64_REGISTER_FP_MMX7			= 0x00030017,
+	HV_X64_REGISTER_FP_CONTROL_STATUS	= 0x00030018,
+	HV_X64_REGISTER_XMM_CONTROL_STATUS	= 0x00030019,
+
+	/* X64 Control Registers */
+	HV_X64_REGISTER_CR0	= 0x00040000,
+	HV_X64_REGISTER_CR2	= 0x00040001,
+	HV_X64_REGISTER_CR3	= 0x00040002,
+	HV_X64_REGISTER_CR4	= 0x00040003,
+	HV_X64_REGISTER_CR8	= 0x00040004,
+	HV_X64_REGISTER_XFEM	= 0x00040005,
+
+	/* X64 Intermediate Control Registers */
+	HV_X64_REGISTER_INTERMEDIATE_CR0	= 0x00041000,
+	HV_X64_REGISTER_INTERMEDIATE_CR4	= 0x00041003,
+	HV_X64_REGISTER_INTERMEDIATE_CR8	= 0x00041004,
+
+	/* X64 Debug Registers */
+	HV_X64_REGISTER_DR0	= 0x00050000,
+	HV_X64_REGISTER_DR1	= 0x00050001,
+	HV_X64_REGISTER_DR2	= 0x00050002,
+	HV_X64_REGISTER_DR3	= 0x00050003,
+	HV_X64_REGISTER_DR6	= 0x00050004,
+	HV_X64_REGISTER_DR7	= 0x00050005,
+
+	/* X64 Segment Registers */
+	HV_X64_REGISTER_ES	= 0x00060000,
+	HV_X64_REGISTER_CS	= 0x00060001,
+	HV_X64_REGISTER_SS	= 0x00060002,
+	HV_X64_REGISTER_DS	= 0x00060003,
+	HV_X64_REGISTER_FS	= 0x00060004,
+	HV_X64_REGISTER_GS	= 0x00060005,
+	HV_X64_REGISTER_LDTR	= 0x00060006,
+	HV_X64_REGISTER_TR	= 0x00060007,
+
+	/* X64 Table Registers */
+	HV_X64_REGISTER_IDTR	= 0x00070000,
+	HV_X64_REGISTER_GDTR	= 0x00070001,
+
+	/* X64 Virtualized MSRs */
+	HV_X64_REGISTER_TSC		= 0x00080000,
+	HV_X64_REGISTER_EFER		= 0x00080001,
+	HV_X64_REGISTER_KERNEL_GS_BASE	= 0x00080002,
+	HV_X64_REGISTER_APIC_BASE	= 0x00080003,
+	HV_X64_REGISTER_PAT		= 0x00080004,
+	HV_X64_REGISTER_SYSENTER_CS	= 0x00080005,
+	HV_X64_REGISTER_SYSENTER_EIP	= 0x00080006,
+	HV_X64_REGISTER_SYSENTER_ESP	= 0x00080007,
+	HV_X64_REGISTER_STAR		= 0x00080008,
+	HV_X64_REGISTER_LSTAR		= 0x00080009,
+	HV_X64_REGISTER_CSTAR		= 0x0008000A,
+	HV_X64_REGISTER_SFMASK		= 0x0008000B,
+	HV_X64_REGISTER_INITIAL_APIC_ID	= 0x0008000C,
+
+	/* X64 Cache control MSRs */
+	HV_X64_REGISTER_MSR_MTRR_CAP		= 0x0008000D,
+	HV_X64_REGISTER_MSR_MTRR_DEF_TYPE	= 0x0008000E,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE0	= 0x00080010,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE1	= 0x00080011,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE2	= 0x00080012,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE3	= 0x00080013,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE4	= 0x00080014,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE5	= 0x00080015,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE6	= 0x00080016,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE7	= 0x00080017,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE8	= 0x00080018,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE9	= 0x00080019,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEA	= 0x0008001A,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEB	= 0x0008001B,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEC	= 0x0008001C,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASED	= 0x0008001D,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEE	= 0x0008001E,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEF	= 0x0008001F,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK0	= 0x00080040,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK1	= 0x00080041,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK2	= 0x00080042,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK3	= 0x00080043,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK4	= 0x00080044,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK5	= 0x00080045,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK6	= 0x00080046,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK7	= 0x00080047,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK8	= 0x00080048,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK9	= 0x00080049,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKA	= 0x0008004A,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKB	= 0x0008004B,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKC	= 0x0008004C,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKD	= 0x0008004D,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKE	= 0x0008004E,
+	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKF	= 0x0008004F,
+	HV_X64_REGISTER_MSR_MTRR_FIX64K00000	= 0x00080070,
+	HV_X64_REGISTER_MSR_MTRR_FIX16K80000	= 0x00080071,
+	HV_X64_REGISTER_MSR_MTRR_FIX16KA0000	= 0x00080072,
+	HV_X64_REGISTER_MSR_MTRR_FIX4KC0000	= 0x00080073,
+	HV_X64_REGISTER_MSR_MTRR_FIX4KC8000	= 0x00080074,
+	HV_X64_REGISTER_MSR_MTRR_FIX4KD0000	= 0x00080075,
+	HV_X64_REGISTER_MSR_MTRR_FIX4KD8000	= 0x00080076,
+	HV_X64_REGISTER_MSR_MTRR_FIX4KE0000	= 0x00080077,
+	HV_X64_REGISTER_MSR_MTRR_FIX4KE8000	= 0x00080078,
+	HV_X64_REGISTER_MSR_MTRR_FIX4KF0000	= 0x00080079,
+	HV_X64_REGISTER_MSR_MTRR_FIX4KF8000	= 0x0008007A,
+
+	HV_X64_REGISTER_TSC_AUX		= 0x0008007B,
+	HV_X64_REGISTER_BNDCFGS		= 0x0008007C,
+	HV_X64_REGISTER_DEBUG_CTL	= 0x0008007D,
+
+	/* Available */
+	HV_X64_REGISTER_AVAILABLE0008007E	= 0x0008007E,
+	HV_X64_REGISTER_AVAILABLE0008007F	= 0x0008007F,
+
+	HV_X64_REGISTER_SGX_LAUNCH_CONTROL0	= 0x00080080,
+	HV_X64_REGISTER_SGX_LAUNCH_CONTROL1	= 0x00080081,
+	HV_X64_REGISTER_SGX_LAUNCH_CONTROL2	= 0x00080082,
+	HV_X64_REGISTER_SGX_LAUNCH_CONTROL3	= 0x00080083,
+	HV_X64_REGISTER_SPEC_CTRL		= 0x00080084,
+	HV_X64_REGISTER_PRED_CMD		= 0x00080085,
+	HV_X64_REGISTER_VIRT_SPEC_CTRL		= 0x00080086,
+
+	/* Other MSRs */
+	HV_X64_REGISTER_MSR_IA32_MISC_ENABLE		= 0x000800A0,
+	HV_X64_REGISTER_IA32_FEATURE_CONTROL		= 0x000800A1,
+	HV_X64_REGISTER_IA32_VMX_BASIC			= 0x000800A2,
+	HV_X64_REGISTER_IA32_VMX_PINBASED_CTLS		= 0x000800A3,
+	HV_X64_REGISTER_IA32_VMX_PROCBASED_CTLS		= 0x000800A4,
+	HV_X64_REGISTER_IA32_VMX_EXIT_CTLS		= 0x000800A5,
+	HV_X64_REGISTER_IA32_VMX_ENTRY_CTLS		= 0x000800A6,
+	HV_X64_REGISTER_IA32_VMX_MISC			= 0x000800A7,
+	HV_X64_REGISTER_IA32_VMX_CR0_FIXED0		= 0x000800A8,
+	HV_X64_REGISTER_IA32_VMX_CR0_FIXED1		= 0x000800A9,
+	HV_X64_REGISTER_IA32_VMX_CR4_FIXED0		= 0x000800AA,
+	HV_X64_REGISTER_IA32_VMX_CR4_FIXED1		= 0x000800AB,
+	HV_X64_REGISTER_IA32_VMX_VMCS_ENUM		= 0x000800AC,
+	HV_X64_REGISTER_IA32_VMX_PROCBASED_CTLS2	= 0x000800AD,
+	HV_X64_REGISTER_IA32_VMX_EPT_VPID_CAP		= 0x000800AE,
+	HV_X64_REGISTER_IA32_VMX_TRUE_PINBASED_CTLS	= 0x000800AF,
+	HV_X64_REGISTER_IA32_VMX_TRUE_PROCBASED_CTLS	= 0x000800B0,
+	HV_X64_REGISTER_IA32_VMX_TRUE_EXIT_CTLS		= 0x000800B1,
+	HV_X64_REGISTER_IA32_VMX_TRUE_ENTRY_CTLS	= 0x000800B2,
+
+	/* Performance monitoring MSRs */
+	HV_X64_REGISTER_PERF_GLOBAL_CTRL	= 0x00081000,
+	HV_X64_REGISTER_PERF_GLOBAL_STATUS	= 0x00081001,
+	HV_X64_REGISTER_PERF_GLOBAL_IN_USE	= 0x00081002,
+	HV_X64_REGISTER_FIXED_CTR_CTRL		= 0x00081003,
+	HV_X64_REGISTER_DS_AREA			= 0x00081004,
+	HV_X64_REGISTER_PEBS_ENABLE		= 0x00081005,
+	HV_X64_REGISTER_PEBS_LD_LAT		= 0x00081006,
+	HV_X64_REGISTER_PEBS_FRONTEND		= 0x00081007,
+	HV_X64_REGISTER_PERF_EVT_SEL0		= 0x00081100,
+	HV_X64_REGISTER_PMC0			= 0x00081200,
+	HV_X64_REGISTER_FIXED_CTR0		= 0x00081300,
+
+	HV_X64_REGISTER_LBR_TOS		= 0x00082000,
+	HV_X64_REGISTER_LBR_SELECT	= 0x00082001,
+	HV_X64_REGISTER_LER_FROM_LIP	= 0x00082002,
+	HV_X64_REGISTER_LER_TO_LIP	= 0x00082003,
+	HV_X64_REGISTER_LBR_FROM0	= 0x00082100,
+	HV_X64_REGISTER_LBR_TO0		= 0x00082200,
+	HV_X64_REGISTER_LBR_INFO0	= 0x00083300,
+
+	/* Intel processor trace MSRs */
+	HV_X64_REGISTER_RTIT_CTL		= 0x00081008,
+	HV_X64_REGISTER_RTIT_STATUS		= 0x00081009,
+	HV_X64_REGISTER_RTIT_OUTPUT_BASE	= 0x0008100A,
+	HV_X64_REGISTER_RTIT_OUTPUT_MASK_PTRS	= 0x0008100B,
+	HV_X64_REGISTER_RTIT_CR3_MATCH		= 0x0008100C,
+	HV_X64_REGISTER_RTIT_ADDR0A		= 0x00081400,
+
+	/* RtitAddr0A/B - RtitAddr3A/B occupy 0x00081400-0x00081407. */
+
+	/* X64 Apic registers. These match the equivalent x2APIC MSR offsets. */
+	HV_X64_REGISTER_APIC_ID		= 0x00084802,
+	HV_X64_REGISTER_APIC_VERSION	= 0x00084803,
+
+	/* Hypervisor-defined registers (Misc) */
+	HV_X64_REGISTER_HYPERCALL	= 0x00090001,
+
+	/* X64 Virtual APIC registers synthetic MSRs */
+	HV_X64_REGISTER_SYNTHETIC_EOI	= 0x00090010,
+	HV_X64_REGISTER_SYNTHETIC_ICR	= 0x00090011,
+	HV_X64_REGISTER_SYNTHETIC_TPR	= 0x00090012,
+
+	/* Partition Timer Assist Registers */
+	HV_X64_REGISTER_EMULATED_TIMER_PERIOD	= 0x00090030,
+	HV_X64_REGISTER_EMULATED_TIMER_CONTROL	= 0x00090031,
+	HV_X64_REGISTER_PM_TIMER_ASSIST		= 0x00090032,
+
+	/* Intercept Control Registers */
+	HV_X64_REGISTER_CR_INTERCEPT_CONTROL			= 0x000E0000,
+	HV_X64_REGISTER_CR_INTERCEPT_CR0_MASK			= 0x000E0001,
+	HV_X64_REGISTER_CR_INTERCEPT_CR4_MASK			= 0x000E0002,
+	HV_X64_REGISTER_CR_INTERCEPT_IA32_MISC_ENABLE_MASK	= 0x000E0003,
+
+};
+
+struct hv_u128 {
+	__u64 high_part;
+	__u64 low_part;
+};
+
+union hv_x64_fp_register {
+	struct hv_u128 as_uint128;
+	struct {
+		__u64 mantissa;
+		__u64 biased_exponent : 15;
+		__u64 sign : 1;
+		__u64 reserved : 48;
+	};
+};
+
+union hv_x64_fp_control_status_register {
+	struct hv_u128 as_uint128;
+	struct {
+		__u16 fp_control;
+		__u16 fp_status;
+		__u8 fp_tag;
+		__u8 reserved;
+		__u16 last_fp_op;
+		union {
+			/* long mode */
+			__u64 last_fp_rip;
+			/* 32 bit mode */
+			struct {
+				__u32 last_fp_eip;
+				__u16 last_fp_cs;
+			};
+		};
+	};
+};
+
+union hv_x64_xmm_control_status_register {
+	struct hv_u128 as_uint128;
+	struct {
+		union {
+			/* long mode */
+			__u64 last_fp_rdp;
+			/* 32 bit mode */
+			struct {
+				__u32 last_fp_dp;
+				__u16 last_fp_ds;
+			};
+		};
+		__u32 xmm_status_control;
+		__u32 xmm_status_control_mask;
+	};
+};
+
+struct hv_x64_segment_register {
+	__u64 base;
+	__u32 limit;
+	__u16 selector;
+	union {
+		struct {
+			__u16 segment_type : 4;
+			__u16 non_system_segment : 1;
+			__u16 descriptor_privilege_level : 2;
+			__u16 present : 1;
+			__u16 reserved : 4;
+			__u16 available : 1;
+			__u16 _long : 1;
+			__u16 _default : 1;
+			__u16 granularity : 1;
+		};
+		__u16 attributes;
+	};
+};
+
+struct hv_x64_table_register {
+	__u16 pad[3];
+	__u16 limit;
+	__u64 base;
+};
+
+union hv_explicit_suspend_register {
+	__u64 as_uint64;
+	struct {
+		__u64 suspended : 1;
+		__u64 reserved : 63;
+	};
+};
+
+union hv_intercept_suspend_register {
+	__u64 as_uint64;
+	struct {
+		__u64 suspended : 1;
+		__u64 reserved : 63;
+	};
+};
+
+union hv_dispatch_suspend_register {
+	__u64 as_uint64;
+	struct {
+		__u64 suspended : 1;
+		__u64 reserved : 63;
+	};
+};
+
+union hv_x64_interrupt_state_register {
+	__u64 as_uint64;
+	struct {
+		__u64 interrupt_shadow : 1;
+		__u64 nmi_masked : 1;
+		__u64 reserved : 62;
+	};
+};
+
+union hv_x64_pending_interruption_register {
+	__u64 as_uint64;
+	struct {
+		__u32 interruption_pending : 1;
+		__u32 interruption_type : 3;
+		__u32 deliver_error_code : 1;
+		__u32 instruction_length : 4;
+		__u32 nested_event : 1;
+		__u32 reserved : 6;
+		__u32 interruption_vector : 16;
+		__u32 error_code;
+	};
+};
+
+union hv_x64_msr_npiep_config_contents {
+	__u64 as_uint64;
+	struct {
+		/*
+		 * These bits enable instruction execution prevention for
+		 * specific instructions.
+		 */
+		__u64 prevents_gdt : 1;
+		__u64 prevents_idt : 1;
+		__u64 prevents_ldt : 1;
+		__u64 prevents_tr : 1;
+
+		/* The reserved bits must always be 0. */
+		__u64 reserved : 60;
+	};
+};
+
+union hv_x64_pending_exception_event {
+	__u64 as_uint64[2];
+	struct {
+		__u32 event_pending : 1;
+		__u32 event_type : 3;
+		__u32 reserved0 : 4;
+		__u32 deliver_error_code : 1;
+		__u32 reserved1 : 7;
+		__u32 vector : 16;
+		__u32 error_code;
+		__u64 exception_parameter;
+	};
+};
+
+union hv_x64_pending_virtualization_fault_event {
+	__u64 as_uint64[2];
+	struct {
+		__u32 event_pending : 1;
+		__u32 event_type : 3;
+		__u32 reserved0 : 4;
+		__u32 reserved1 : 8;
+		__u32 parameter0 : 16;
+		__u32 code;
+		__u64 parameter1;
+	};
+};
+
+union hv_register_value {
+	struct hv_u128 reg128;
+	__u64 reg64;
+	__u32 reg32;
+	__u16 reg16;
+	__u8 reg8;
+	union hv_x64_fp_register fp;
+	union hv_x64_fp_control_status_register fp_control_status;
+	union hv_x64_xmm_control_status_register xmm_control_status;
+	struct hv_x64_segment_register segment;
+	struct hv_x64_table_register table;
+	union hv_explicit_suspend_register explicit_suspend;
+	union hv_intercept_suspend_register intercept_suspend;
+	union hv_dispatch_suspend_register dispatch_suspend;
+	union hv_x64_interrupt_state_register interrupt_state;
+	union hv_x64_pending_interruption_register pending_interruption;
+	union hv_x64_msr_npiep_config_contents npiep_config;
+	union hv_x64_pending_exception_event pending_exception_event;
+	union hv_x64_pending_virtualization_fault_event
+		pending_virtualization_fault_event;
+};
+
 #endif
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 6e5072e29897..b9295400c20b 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -622,53 +622,30 @@ struct hv_retarget_device_interrupt {
 } __packed __aligned(8);
 
 
-/* HvGetVpRegisters hypercall input with variable size reg name list*/
-struct hv_get_vp_registers_input {
-	struct {
-		u64 partitionid;
-		u32 vpindex;
-		u8  inputvtl;
-		u8  padding[3];
-	} header;
-	struct input {
-		u32 name0;
-		u32 name1;
-	} element[];
-} __packed;
-
+/* HvGetVpRegisters hypercall with variable size reg name list*/
+struct hv_get_vp_registers {
+	u64 partition_id;
+	u32 vp_index;
+	u8  input_vtl;
+	u8  rsvd_z8;
+	u16 rsvd_z16;
+	__aligned(8) enum hv_register_name names[];
+} __aligned(8);
 
-/* HvGetVpRegisters returns an array of these output elements */
-struct hv_get_vp_registers_output {
-	union {
-		struct {
-			u32 a;
-			u32 b;
-			u32 c;
-			u32 d;
-		} as32 __packed;
-		struct {
-			u64 low;
-			u64 high;
-		} as64 __packed;
-	};
+/* HvSetVpRegisters hypercall with variable size reg name/value list*/
+struct hv_register_assoc {
+	enum hv_register_name name;
+	__aligned(16) union hv_register_value value;
 };
 
-/* HvSetVpRegisters hypercall with variable size reg name/value list*/
-struct hv_set_vp_registers_input {
-	struct {
-		u64 partitionid;
-		u32 vpindex;
-		u8  inputvtl;
-		u8  padding[3];
-	} header;
-	struct {
-		u32 name;
-		u32 padding1;
-		u64 padding2;
-		u64 valuelow;
-		u64 valuehigh;
-	} element[];
-} __packed;
+struct hv_set_vp_registers {
+	u64 partition_id;
+	u32 vp_index;
+	u8  input_vtl;
+	u8  rsvd_z8;
+	u16 rsvd_z16;
+	struct hv_register_assoc elements[];
+} __aligned(16);
 
 enum hv_device_type {
 	HV_DEVICE_TYPE_LOGICAL = 0,
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
index 50521c5f7948..dfe469f573f9 100644
--- a/include/linux/mshv.h
+++ b/include/linux/mshv.h
@@ -17,6 +17,7 @@
 struct mshv_vp {
 	u32 index;
 	struct mshv_partition *partition;
+	struct mutex mutex;
 };
 
 struct mshv_mem_region {
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 1f053eae68a6..5d53ed655429 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -33,6 +33,14 @@ struct mshv_create_vp {
 	__u32 vp_index;
 };
 
+#define MSHV_VP_MAX_REGISTERS	128
+
+struct mshv_vp_registers {
+	int count; /* at most MSHV_VP_MAX_REGISTERS */
+	enum hv_register_name *names;
+	union hv_register_value *values;
+};
+
 #define MSHV_IOCTL 0xB8
 
 /* mshv device */
@@ -44,4 +52,8 @@ struct mshv_create_vp {
 #define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct mshv_user_mem_region)
 #define MSHV_CREATE_VP		_IOW(MSHV_IOCTL, 0x04, struct mshv_create_vp)
 
+/* vp device */
+#define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct mshv_vp_registers)
+#define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
+
 #endif
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 3be9d9a468c1..2a10137a1e84 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -74,6 +74,12 @@ static struct miscdevice mshv_dev = {
 #define HV_MAP_GPA_BATCH_SIZE	\
 		(PAGE_SIZE / sizeof(struct hv_map_gpa_pages) / sizeof(u64))
 #define PIN_PAGES_BATCH_SIZE	(0x10000000 / PAGE_SIZE)
+#define HV_GET_REGISTER_BATCH_SIZE	\
+	(PAGE_SIZE / \
+	 sizeof(struct hv_get_vp_registers) / sizeof(enum hv_register_name))
+#define HV_SET_REGISTER_BATCH_SIZE	\
+	(PAGE_SIZE / \
+	 sizeof(struct hv_set_vp_registers) / sizeof(struct hv_register_assoc))
 
 static int
 hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
@@ -380,10 +386,258 @@ hv_call_unmap_gpa_pages(u64 partition_id,
 	return ret;
 }
 
+static int
+hv_call_get_vp_registers(u32 vp_index,
+			 u64 partition_id,
+			 u16 count,
+			 const enum hv_register_name *names,
+			 union hv_register_value *values)
+{
+	struct hv_get_vp_registers *input_page;
+	union hv_register_value *output_page;
+	u16 completed = 0;
+	u64 hypercall_status;
+	unsigned long remaining = count;
+	int rep_count;
+	int status;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	input_page = (struct hv_get_vp_registers *)(*this_cpu_ptr(
+		hyperv_pcpu_input_arg));
+	output_page = (union hv_register_value *)(*this_cpu_ptr(
+		hyperv_pcpu_output_arg));
+
+	input_page->partition_id = partition_id;
+	input_page->vp_index = vp_index;
+	input_page->input_vtl = 0;
+	input_page->rsvd_z8 = 0;
+	input_page->rsvd_z16 = 0;
+
+	while (remaining) {
+		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
+		memcpy(input_page->names, names,
+			sizeof(enum hv_register_name) * rep_count);
+
+		hypercall_status =
+			hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS, rep_count,
+					    0, input_page, output_page);
+		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
+		if (status != HV_STATUS_SUCCESS) {
+			pr_err("%s: completed %li out of %u, %s\n",
+			       __func__,
+			       count - remaining, count,
+			       hv_status_to_string(status));
+			break;
+		}
+		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
+			    HV_HYPERCALL_REP_COMP_OFFSET;
+		memcpy(values, output_page,
+			sizeof(union hv_register_value) * completed);
+
+		names += completed;
+		values += completed;
+		remaining -= completed;
+	}
+	local_irq_restore(flags);
+
+	return -hv_status_to_errno(status);
+}
+
+static int
+hv_call_set_vp_registers(u32 vp_index,
+			 u64 partition_id,
+			 u16 count,
+			 struct hv_register_assoc *registers)
+{
+	struct hv_set_vp_registers *input_page;
+	u16 completed = 0;
+	u64 hypercall_status;
+	unsigned long remaining = count;
+	int rep_count;
+	int status;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	input_page = (struct hv_set_vp_registers *)(*this_cpu_ptr(
+		hyperv_pcpu_input_arg));
+
+	input_page->partition_id = partition_id;
+	input_page->vp_index = vp_index;
+	input_page->input_vtl = 0;
+	input_page->rsvd_z8 = 0;
+	input_page->rsvd_z16 = 0;
+
+	while (remaining) {
+		rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
+		memcpy(input_page->elements, registers,
+			sizeof(struct hv_register_assoc) * rep_count);
+
+		hypercall_status =
+			hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS, rep_count,
+					    0, input_page, NULL);
+		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
+		if (status != HV_STATUS_SUCCESS) {
+			pr_err("%s: completed %li out of %u, %s\n",
+			       __func__,
+			       count - remaining, count,
+			       hv_status_to_string(status));
+			break;
+		}
+		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
+			    HV_HYPERCALL_REP_COMP_OFFSET;
+		registers += completed;
+		remaining -= completed;
+	}
+
+	local_irq_restore(flags);
+
+	return -hv_status_to_errno(status);
+}
+
+static long
+mshv_vp_ioctl_get_regs(struct mshv_vp *vp, void __user *user_args)
+{
+	struct mshv_vp_registers args;
+	enum hv_register_name *names;
+	union hv_register_value *values;
+	long ret;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	if (args.count > MSHV_VP_MAX_REGISTERS)
+		return -EINVAL;
+
+	names = kmalloc_array(args.count,
+			      sizeof(enum hv_register_name),
+			      GFP_KERNEL);
+	if (!names)
+		return -ENOMEM;
+
+	values = kmalloc_array(args.count,
+			       sizeof(union hv_register_value),
+			       GFP_KERNEL);
+	if (!values) {
+		kfree(names);
+		return -ENOMEM;
+	}
+
+	if (copy_from_user(names, args.names,
+			   sizeof(enum hv_register_name) * args.count)) {
+		ret = -EFAULT;
+		goto free_return;
+	}
+
+	ret = hv_call_get_vp_registers(vp->index, vp->partition->id,
+				       args.count, names, values);
+	if (ret)
+		goto free_return;
+
+	if (copy_to_user(args.values, values,
+			 sizeof(union hv_register_value) * args.count)) {
+		ret = -EFAULT;
+	}
+
+free_return:
+	kfree(names);
+	kfree(values);
+	return ret;
+}
+
+static long
+mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user *user_args)
+{
+	int i;
+	struct mshv_vp_registers args;
+	struct hv_register_assoc *registers;
+	enum hv_register_name *names;
+	union hv_register_value *values;
+	long ret;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	if (args.count > MSHV_VP_MAX_REGISTERS)
+		return -EINVAL;
+
+	names = kmalloc_array(args.count,
+			      sizeof(enum hv_register_name),
+			      GFP_KERNEL);
+	if (!names)
+		return -ENOMEM;
+
+	values = kmalloc_array(args.count,
+			       sizeof(union hv_register_value),
+			       GFP_KERNEL);
+	if (!values) {
+		kfree(names);
+		return -ENOMEM;
+	}
+
+	registers = kmalloc_array(args.count,
+				  sizeof(struct hv_register_assoc),
+				  GFP_KERNEL);
+	if (!registers) {
+		kfree(values);
+		kfree(names);
+		return -ENOMEM;
+	}
+
+	if (copy_from_user(names, args.names,
+			   sizeof(enum hv_register_name) * args.count)) {
+		ret = -EFAULT;
+		goto free_return;
+	}
+
+	if (copy_from_user(values, args.values,
+			   sizeof(union hv_register_value) * args.count)) {
+		ret = -EFAULT;
+		goto free_return;
+	}
+
+	for (i = 0; i < args.count; i++) {
+		memcpy(&registers[i].name, &names[i],
+		       sizeof(enum hv_register_name));
+		memcpy(&registers[i].value, &values[i],
+		       sizeof(union hv_register_value));
+	}
+
+	ret = hv_call_set_vp_registers(vp->index, vp->partition->id,
+				       args.count, registers);
+
+free_return:
+	kfree(names);
+	kfree(values);
+	kfree(registers);
+	return ret;
+}
+
+
 static long
 mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
-	return -ENOTTY;
+	struct mshv_vp *vp = filp->private_data;
+	long r = 0;
+
+	if (mutex_lock_killable(&vp->mutex))
+		return -EINTR;
+
+	switch (ioctl) {
+	case MSHV_GET_VP_REGISTERS:
+		r = mshv_vp_ioctl_get_regs(vp, (void __user *)arg);
+		break;
+	case MSHV_SET_VP_REGISTERS:
+		r = mshv_vp_ioctl_set_regs(vp, (void __user *)arg);
+		break;
+	default:
+		r = -ENOTTY;
+		break;
+	}
+	mutex_unlock(&vp->mutex);
+
+	return r;
 }
 
 static int
@@ -420,6 +674,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	if (!vp)
 		return -ENOMEM;
 
+	mutex_init(&vp->mutex);
+
 	vp->index = args.vp_index;
 	vp->partition = mshv_partition_get(partition);
 	if (!vp->partition) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (9 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-08 19:47   ` Michael Kelley
  2020-11-21  0:30 ` [RFC PATCH 12/18] virt/mshv: run vp ioctl and isr Nuno Das Neves
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Same idea as synic setup in drivers/hv/hv.c:hv_synic_enable_regs()
and hv_synic_disable_regs().
Setting up synic registers in both vmbus driver and mshv would clobber
them, but the vmbus driver will not run in the root partition, so this
is safe.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 arch/x86/include/asm/hyperv-tlfs.h      |  29 ---
 arch/x86/include/uapi/asm/hyperv-tlfs.h | 264 ++++++++++++++++++++++++
 include/asm-generic/hyperv-tlfs.h       |  46 +----
 include/linux/mshv.h                    |   1 +
 include/uapi/asm-generic/hyperv-tlfs.h  |  43 ++++
 virt/mshv/mshv_main.c                   |  98 ++++++++-
 6 files changed, 404 insertions(+), 77 deletions(-)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 4cd44ae9bffb..c34a6bb4f457 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -267,35 +267,6 @@ struct hv_tsc_emulation_status {
 #define HV_X64_MSR_TSC_REFERENCE_ENABLE		0x00000001
 #define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT	12
 
-
-/* Define hypervisor message types. */
-enum hv_message_type {
-	HVMSG_NONE			= 0x00000000,
-
-	/* Memory access messages. */
-	HVMSG_UNMAPPED_GPA		= 0x80000000,
-	HVMSG_GPA_INTERCEPT		= 0x80000001,
-
-	/* Timer notification messages. */
-	HVMSG_TIMER_EXPIRED		= 0x80000010,
-
-	/* Error messages. */
-	HVMSG_INVALID_VP_REGISTER_VALUE	= 0x80000020,
-	HVMSG_UNRECOVERABLE_EXCEPTION	= 0x80000021,
-	HVMSG_UNSUPPORTED_FEATURE	= 0x80000022,
-
-	/* Trace buffer complete messages. */
-	HVMSG_EVENTLOG_BUFFERCOMPLETE	= 0x80000040,
-
-	/* Platform-specific processor intercept messages. */
-	HVMSG_X64_IOPORT_INTERCEPT	= 0x80010000,
-	HVMSG_X64_MSR_INTERCEPT		= 0x80010001,
-	HVMSG_X64_CPUID_INTERCEPT	= 0x80010002,
-	HVMSG_X64_EXCEPTION_INTERCEPT	= 0x80010003,
-	HVMSG_X64_APIC_EOI		= 0x80010004,
-	HVMSG_X64_LEGACY_FP_ERROR	= 0x80010005
-};
-
 struct hv_nested_enlightenments_control {
 	struct {
 		__u32 directhypercall:1;
diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
index 2ff655962738..c6a27053f791 100644
--- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
+++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
@@ -722,4 +722,268 @@ union hv_register_value {
 		pending_virtualization_fault_event;
 };
 
+/* Define hypervisor message types. */
+enum hv_message_type {
+	HVMSG_NONE				= 0x00000000,
+
+	/* Memory access messages. */
+	HVMSG_UNMAPPED_GPA			= 0x80000000,
+	HVMSG_GPA_INTERCEPT			= 0x80000001,
+
+	/* Timer notification messages. */
+	HVMSG_TIMER_EXPIRED			= 0x80000010,
+
+	/* Error messages. */
+	HVMSG_INVALID_VP_REGISTER_VALUE		= 0x80000020,
+	HVMSG_UNRECOVERABLE_EXCEPTION		= 0x80000021,
+	HVMSG_UNSUPPORTED_FEATURE		= 0x80000022,
+
+	/* Trace buffer complete messages. */
+	HVMSG_EVENTLOG_BUFFERCOMPLETE		= 0x80000040,
+
+	/* Platform-specific processor intercept messages. */
+	HVMSG_X64_IO_PORT_INTERCEPT		= 0x80010000,
+	HVMSG_X64_MSR_INTERCEPT			= 0x80010001,
+	HVMSG_X64_CPUID_INTERCEPT		= 0x80010002,
+	HVMSG_X64_EXCEPTION_INTERCEPT		= 0x80010003,
+	HVMSG_X64_APIC_EOI			= 0x80010004,
+	HVMSG_X64_LEGACY_FP_ERROR		= 0x80010005,
+	HVMSG_X64_IOMMU_PRQ			= 0x80010006,
+	HVMSG_X64_HALT				= 0x80010007,
+	HVMSG_X64_INTERRUPTION_DELIVERABLE	= 0x80010008,
+	HVMSG_X64_SIPI_INTERCEPT		= 0x80010009,
+};
+
+
+union hv_x64_vp_execution_state {
+	__u16 as_uint16;
+	struct {
+		__u16 cpl:2;
+		__u16 cr0_pe:1;
+		__u16 cr0_am:1;
+		__u16 efer_lma:1;
+		__u16 debug_active:1;
+		__u16 interruption_pending:1;
+		__u16 vtl:4;
+		__u16 enclave_mode:1;
+		__u16 interrupt_shadow:1;
+		__u16 virtualization_fault_active:1;
+		__u16 reserved:2;
+	};
+};
+
+/* Values for intercept_access_type field */
+#define HV_INTERCEPT_ACCESS_READ	0
+#define HV_INTERCEPT_ACCESS_WRITE	1
+#define HV_INTERCEPT_ACCESS_EXECUTE	2
+
+struct hv_x64_intercept_message_header {
+	__u32 vp_index;
+	__u8 instruction_length:4;
+	__u8 cr8:4; // only set for exo partitions
+	__u8 intercept_access_type;
+	union hv_x64_vp_execution_state execution_state;
+	struct hv_x64_segment_register cs_segment;
+	__u64 rip;
+	__u64 rflags;
+};
+
+#define HV_HYPERCALL_INTERCEPT_MAX_XMM_REGISTERS 6
+
+struct hv_x64_hypercall_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	__u64 rax;
+	__u64 rbx;
+	__u64 rcx;
+	__u64 rdx;
+	__u64 r8;
+	__u64 rsi;
+	__u64 rdi;
+	struct hv_u128 xmmregisters[HV_HYPERCALL_INTERCEPT_MAX_XMM_REGISTERS];
+	struct {
+		__u32 isolated:1;
+		__u32 reserved:31;
+	};
+};
+
+union hv_x64_register_access_info {
+	union hv_register_value source_value;
+	enum hv_register_name destination_register;
+	__u64 source_address;
+	__u64 destination_address;
+};
+
+struct hv_x64_register_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	struct {
+		__u8 is_memory_op:1;
+		__u8 reserved:7;
+	};
+	__u8 reserved8;
+	__u16 reserved16;
+	enum hv_register_name register_name;
+	union hv_x64_register_access_info access_info;
+};
+
+union hv_x64_memory_access_info {
+	__u8 as_uint8;
+	struct {
+		__u8 gva_valid:1;
+		__u8 gva_gpa_valid:1;
+		__u8 hypercall_output_pending:1;
+		__u8 tlb_locked_no_overlay:1;
+		__u8 reserved:4;
+	};
+};
+
+union hv_x64_io_port_access_info {
+	__u8 as_uint8;
+	struct {
+		__u8 access_size:3;
+		__u8 string_op:1;
+		__u8 rep_prefix:1;
+		__u8 reserved:3;
+	};
+};
+
+union hv_x64_exception_info {
+	__u8 as_uint8;
+	struct {
+		__u8 error_code_valid:1;
+		__u8 software_exception:1;
+		__u8 reserved:6;
+	};
+};
+
+enum hv_cache_type {
+	HV_CACHE_TYPE_UNCACHED	   = 0,
+	HV_CACHE_TYPE_WRITE_COMBINING = 1,
+	HV_CACHE_TYPE_WRITE_THROUGH   = 4,
+	HV_CACHE_TYPE_WRITE_PROTECTED = 5,
+	HV_CACHE_TYPE_WRITE_BACK	  = 6
+};
+
+struct hv_x64_memory_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	enum hv_cache_type cache_type;
+	__u8 instruction_byte_count;
+	union hv_x64_memory_access_info memory_access_info;
+	__u8 tpr_priority;
+	__u8 reserved1;
+	__u64 guest_virtual_address;
+	__u64 guest_physical_address;
+	__u8 instruction_bytes[16];
+};
+
+struct hv_x64_cpuid_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	__u64 rax;
+	__u64 rcx;
+	__u64 rdx;
+	__u64 rbx;
+	__u64 default_result_rax;
+	__u64 default_result_rcx;
+	__u64 default_result_rdx;
+	__u64 default_result_rbx;
+};
+
+struct hv_x64_msr_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	__u32 msr_number;
+	__u32 reserved;
+	__u64 rdx;
+	__u64 rax;
+};
+
+struct hv_x64_io_port_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	__u16 port_number;
+	union hv_x64_io_port_access_info access_info;
+	__u8 instruction_byte_count;
+	__u32 reserved;
+	__u64 rax;
+	__u8 instruction_bytes[16];
+	struct hv_x64_segment_register ds_segment;
+	struct hv_x64_segment_register es_segment;
+	__u64 rcx;
+	__u64 rsi;
+	__u64 rdi;
+};
+
+struct hv_x64_exception_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	__u16 exception_vector;
+	union hv_x64_exception_info exception_info;
+	__u8 instruction_byte_count;
+	__u32 error_code;
+	__u64 exception_parameter;
+	__u64 reserved;
+	__u8 instruction_bytes[16];
+	struct hv_x64_segment_register ds_segment;
+	struct hv_x64_segment_register ss_segment;
+	__u64 rax;
+	__u64 rcx;
+	__u64 rdx;
+	__u64 rbx;
+	__u64 rsp;
+	__u64 rbp;
+	__u64 rsi;
+	__u64 rdi;
+	__u64 r8;
+	__u64 r9;
+	__u64 r10;
+	__u64 r11;
+	__u64 r12;
+	__u64 r13;
+	__u64 r14;
+	__u64 r15;
+};
+
+struct hv_x64_invalid_vp_register_message {
+	__u32 vp_index;
+	__u32 reserved;
+};
+
+struct hv_x64_unrecoverable_exception_message {
+	struct hv_x64_intercept_message_header header;
+};
+
+enum hv_x64_unsupported_feature_code {
+	hv_unsupported_feature_intercept = 1,
+	hv_unsupported_feature_task_switch_tss = 2
+};
+
+struct hv_x64_unsupported_feature_message {
+	__u32 vp_index;
+	enum hv_x64_unsupported_feature_code feature_code;
+	__u64 feature_parameter;
+};
+
+struct hv_x64_halt_message {
+	struct hv_x64_intercept_message_header header;
+};
+
+enum hv_x64_pending_interruption_type {
+	HV_X64_PENDING_INTERRUPT	= 0,
+	HV_X64_PENDING_NMI		= 2,
+	HV_X64_PENDING_EXCEPTION	= 3
+};
+
+struct hv_x64_interruption_deliverable_message {
+	struct hv_x64_intercept_message_header header;
+	enum hv_x64_pending_interruption_type deliverable_type;
+	__u32 rsvd;
+};
+
+struct hv_x64_sipi_intercept_message {
+	struct hv_x64_intercept_message_header header;
+	__u32 target_vp_index;
+	__u32 interrupt_vector;
+};
+
+struct hv_x64_apic_eoi_message {
+	__u32 vp_index;
+	__u32 interrupt_vector;
+};
+
 #endif
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index b9295400c20b..e0185c3872a9 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -241,6 +241,8 @@ static inline const char *hv_status_to_string(enum hv_status status)
 /* Valid SynIC vectors are 16-255. */
 #define HV_SYNIC_FIRST_VALID_VECTOR	(16)
 
+#define HV_SYNIC_INTERCEPTION_SINT_INDEX 0x00000000
+
 #define HV_SYNIC_CONTROL_ENABLE		(1ULL << 0)
 #define HV_SYNIC_SIMP_ENABLE		(1ULL << 0)
 #define HV_SYNIC_SIEFP_ENABLE		(1ULL << 0)
@@ -250,49 +252,6 @@ static inline const char *hv_status_to_string(enum hv_status status)
 
 #define HV_SYNIC_STIMER_COUNT		(4)
 
-/* Define synthetic interrupt controller message constants. */
-#define HV_MESSAGE_SIZE			(256)
-#define HV_MESSAGE_PAYLOAD_BYTE_COUNT	(240)
-#define HV_MESSAGE_PAYLOAD_QWORD_COUNT	(30)
-
-/* Define synthetic interrupt controller message flags. */
-union hv_message_flags {
-	__u8 asu8;
-	struct {
-		__u8 msg_pending:1;
-		__u8 reserved:7;
-	} __packed;
-};
-
-/* Define port identifier type. */
-union hv_port_id {
-	__u32 asu32;
-	struct {
-		__u32 id:24;
-		__u32 reserved:8;
-	} __packed u;
-};
-
-/* Define synthetic interrupt controller message header. */
-struct hv_message_header {
-	__u32 message_type;
-	__u8 payload_size;
-	union hv_message_flags message_flags;
-	__u8 reserved[2];
-	union {
-		__u64 sender;
-		union hv_port_id port;
-	};
-} __packed;
-
-/* Define synthetic interrupt controller message format. */
-struct hv_message {
-	struct hv_message_header header;
-	union {
-		__u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
-	} u;
-} __packed;
-
 /* Define the synthetic interrupt message page layout. */
 struct hv_message_page {
 	struct hv_message sint_message[HV_SYNIC_SINT_COUNT];
@@ -306,7 +265,6 @@ struct hv_timer_message_payload {
 	__u64 delivery_time;	/* When the message was delivered */
 } __packed;
 
-
 /* Define synthetic interrupt controller flag constants. */
 #define HV_EVENT_FLAGS_COUNT		(256 * 8)
 #define HV_EVENT_FLAGS_LONG_COUNT	(256 / sizeof(unsigned long))
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
index dfe469f573f9..7709aaa1e064 100644
--- a/include/linux/mshv.h
+++ b/include/linux/mshv.h
@@ -42,6 +42,7 @@ struct mshv_partition {
 };
 
 struct mshv {
+	struct hv_message_page __percpu **synic_message_page;
 	struct {
 		spinlock_t lock;
 		u64 count;
diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
index e7b09b9f00de..e87389054b68 100644
--- a/include/uapi/asm-generic/hyperv-tlfs.h
+++ b/include/uapi/asm-generic/hyperv-tlfs.h
@@ -6,6 +6,49 @@
 #define BIT(X)	(1ULL << (X))
 #endif
 
+/* Define synthetic interrupt controller message constants. */
+#define HV_MESSAGE_SIZE			(256)
+#define HV_MESSAGE_PAYLOAD_BYTE_COUNT	(240)
+#define HV_MESSAGE_PAYLOAD_QWORD_COUNT	(30)
+
+/* Define synthetic interrupt controller message flags. */
+union hv_message_flags {
+	__u8 asu8;
+	struct {
+		__u8 msg_pending:1;
+		__u8 reserved:7;
+	};
+};
+
+/* Define port identifier type. */
+union hv_port_id {
+	__u32 asu32;
+	struct {
+		__u32 id:24;
+		__u32 reserved:8;
+	} u;
+};
+
+/* Define synthetic interrupt controller message header. */
+struct hv_message_header {
+	enum hv_message_type message_type;
+	__u8 payload_size;
+	union hv_message_flags message_flags;
+	__u8 reserved[2];
+	union {
+		__u64 sender;
+		union hv_port_id port;
+	};
+};
+
+/* Define synthetic interrupt controller message format. */
+struct hv_message {
+	struct hv_message_header header;
+	union {
+		__u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
+	} u;
+};
+
 /* Userspace-visible partition creation flags */
 #define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
 #define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 2a10137a1e84..c9445d2edb37 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -15,6 +15,8 @@
 #include <linux/file.h>
 #include <linux/anon_inodes.h>
 #include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/cpuhotplug.h>
 #include <linux/mshv.h>
 #include <asm/mshyperv.h>
 
@@ -1152,23 +1154,111 @@ mshv_dev_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static int
+mshv_synic_init(unsigned int cpu)
+{
+	union hv_synic_simp simp;
+	union hv_synic_sint sint;
+	union hv_synic_scontrol sctrl;
+	struct hv_message_page **msg_page =
+			this_cpu_ptr(mshv.synic_message_page);
+
+	/* Setup the Synic's message page */
+	hv_get_simp(simp.as_uint64);
+	simp.simp_enabled = true;
+	*msg_page = memremap(simp.base_simp_gpa << PAGE_SHIFT,
+			     PAGE_SIZE, MEMREMAP_WB);
+	if (!msg_page) {
+		pr_err("%s: memremap failed\n", __func__);
+		return -EFAULT;
+	}
+	hv_set_simp(simp.as_uint64);
+
+	/* Enable intercepts */
+	sint.as_uint64 = 0;
+	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.masked = false;
+	sint.auto_eoi = hv_recommend_using_aeoi();
+	hv_set_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
+
+	/* Enable global synic bit */
+	hv_get_synic_state(sctrl.as_uint64);
+	sctrl.enable = 1;
+	hv_set_synic_state(sctrl.as_uint64);
+
+	return 0;
+}
+
+static int
+mshv_synic_cleanup(unsigned int cpu)
+{
+	union hv_synic_sint sint;
+	union hv_synic_simp simp;
+	union hv_synic_scontrol sctrl;
+	struct hv_message_page **msg_page =
+			this_cpu_ptr(mshv.synic_message_page);
+
+	/* Disable the interrupt */
+	hv_get_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
+	sint.masked = true;
+	hv_set_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
+
+	/* Disable Synic's message page */
+	hv_get_simp(simp.as_uint64);
+	simp.simp_enabled = false;
+	hv_set_simp(simp.as_uint64);
+	memunmap(*msg_page);
+
+	/* Disable global synic bit */
+	hv_get_synic_state(sctrl.as_uint64);
+	sctrl.enable = 0;
+	hv_set_synic_state(sctrl.as_uint64);
+
+	return 0;
+}
+
+static int mshv_cpuhp_online;
+
 static int
 __init mshv_init(void)
 {
-	int r;
+	int ret;
 
-	r = misc_register(&mshv_dev);
-	if (r)
+	ret = misc_register(&mshv_dev);
+	if (ret) {
 		pr_err("%s: misc device register failed\n", __func__);
+		return ret;
+	}
+	spin_lock_init(&mshv.partitions.lock);
 
+	mshv.synic_message_page = alloc_percpu(struct hv_message_page *);
+	if (!mshv.synic_message_page) {
+		pr_err("%s: failed to allocate percpu synic page\n", __func__);
+		misc_deregister(&mshv_dev);
+		return -ENOMEM;
+	}
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
+				mshv_synic_init,
+				mshv_synic_cleanup);
+	if (ret < 0) {
+		pr_err("%s: failed to setup cpu hotplug state: %i\n",
+		       __func__, ret);
+		return ret;
+	}
+
+	mshv_cpuhp_online = ret;
 	spin_lock_init(&mshv.partitions.lock);
 
-	return r;
+	return 0;
 }
 
 static void
 __exit mshv_exit(void)
 {
+	cpuhp_remove_state(mshv_cpuhp_online);
+	free_percpu(mshv.synic_message_page);
+
 	misc_deregister(&mshv_dev);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 12/18] virt/mshv: run vp ioctl and isr
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (10 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2020-11-24 16:15   ` Wei Liu
  2020-11-21  0:30 ` [RFC PATCH 13/18] virt/mshv: install intercept ioctl Nuno Das Neves
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce an ioctl for running a vp and an isr to copy messages from
the synic page to the vp data structure.

Add synchronization primitives to ensure that the isr is finished
when the run vp ioctl is entered.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst |  14 ++
 arch/x86/kernel/cpu/mshyperv.c  |  16 ++
 include/asm-generic/mshyperv.h  |   3 +
 include/linux/mshv.h            |   7 +
 include/uapi/linux/mshv.h       |   1 +
 virt/mshv/mshv_main.c           | 270 +++++++++++++++++++++++++++++++-
 6 files changed, 310 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 20a626ac02d4..f525c81f2bdd 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -106,4 +106,18 @@ Get/set vp registers. See asm/hyperv-tlfs.h for the complete set of registers.
 Includes general purpose platform registers, MSRs, and virtual registers that
 are part of Microsoft Hypervisor platform and not directly exposed to the guest.
 
+3.6 MSHV_RUN_VP
+---------------
+:Type: vp ioctl
+:Parameters: struct hv_message
+:Returns: 0 on success
+
+Run the vp, returning when it triggers an intercept, or if the calling thread
+is interrupted by a signal. In this case errno will be set to EINTR.
+
+On return, the vp will be suspended.
+This ioctl will fail on any vp that's already running (not suspended).
+
+Information about the intercept is returned in the hv_message struct.
+
 
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 4795e54550e6..e6ff4ed13233 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -37,6 +37,7 @@ struct ms_hyperv_info ms_hyperv;
 EXPORT_SYMBOL_GPL(ms_hyperv);
 
 #if IS_ENABLED(CONFIG_HYPERV)
+static void (*mshv_handler)(void);
 static void (*vmbus_handler)(void);
 static void (*hv_stimer0_handler)(void);
 static void (*hv_kexec_handler)(void);
@@ -47,6 +48,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
 	inc_irq_stat(irq_hv_callback_count);
+	if (mshv_handler)
+		mshv_handler();
+
 	if (vmbus_handler)
 		vmbus_handler();
 
@@ -56,6 +60,18 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
 	set_irq_regs(old_regs);
 }
 
+void hv_setup_mshv_irq(void (*handler)(void))
+{
+	mshv_handler = handler;
+}
+
+void hv_remove_mshv_irq(void)
+{
+	mshv_handler = NULL;
+}
+EXPORT_SYMBOL_GPL(hv_setup_mshv_irq);
+EXPORT_SYMBOL_GPL(hv_remove_mshv_irq);
+
 int hv_setup_vmbus_irq(int irq, void (*handler)(void))
 {
 	/*
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index c57799684170..3283a8059ed5 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -94,6 +94,9 @@ void hv_remove_vmbus_irq(void);
 void hv_enable_vmbus_irq(void);
 void hv_disable_vmbus_irq(void);
 
+void hv_setup_mshv_irq(void (*handler)(void));
+void hv_remove_mshv_irq(void);
+
 void hv_setup_kexec_handler(void (*handler)(void));
 void hv_remove_kexec_handler(void);
 void hv_setup_crash_handler(void (*handler)(struct pt_regs *regs));
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
index 7709aaa1e064..3933d80294f1 100644
--- a/include/linux/mshv.h
+++ b/include/linux/mshv.h
@@ -8,6 +8,8 @@
 
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
+#include <linux/semaphore.h>
+#include <linux/sched.h>
 #include <uapi/linux/mshv.h>
 
 #define MSHV_MAX_PARTITIONS		128
@@ -18,6 +20,11 @@ struct mshv_vp {
 	u32 index;
 	struct mshv_partition *partition;
 	struct mutex mutex;
+	struct {
+		struct semaphore sem;
+		struct task_struct *task;
+		struct hv_message *intercept_message;
+	} run;
 };
 
 struct mshv_mem_region {
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 5d53ed655429..5be9e2d23893 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -55,5 +55,6 @@ struct mshv_vp_registers {
 /* vp device */
 #define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct mshv_vp_registers)
 #define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
+#define MSHV_RUN_VP		_IOR(MSHV_IOCTL, 0x07, struct hv_message)
 
 #endif
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index c9445d2edb37..7ddb66d260ce 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -17,6 +17,7 @@
 #include <linux/mm.h>
 #include <linux/io.h>
 #include <linux/cpuhotplug.h>
+#include <linux/random.h>
 #include <linux/mshv.h>
 #include <asm/mshyperv.h>
 
@@ -498,6 +499,240 @@ hv_call_set_vp_registers(u32 vp_index,
 	return -hv_status_to_errno(status);
 }
 
+static void
+mshv_isr(void)
+{
+	struct hv_message_page **msg_page =
+			this_cpu_ptr(mshv.synic_message_page);
+	struct hv_message *msg;
+	enum hv_message_type message_type;
+	struct mshv_partition *partition;
+	struct mshv_vp *vp;
+	u64 partition_id;
+	u32 vp_index;
+	int i;
+	unsigned long flags;
+	struct task_struct *task;
+
+	if (unlikely(!(*msg_page))) {
+		pr_err("%s: Missing synic page!\n", __func__);
+		return;
+	}
+
+	msg = &((*msg_page)->sint_message[HV_SYNIC_INTERCEPTION_SINT_INDEX]);
+
+	/*
+	 * If the type isn't set, there isn't really a message;
+	 * it may be some other hyperv interrupt
+	 */
+	message_type = msg->header.message_type;
+	if (message_type == HVMSG_NONE)
+		return;
+
+	/* Look for the partition */
+	partition_id = msg->header.sender;
+
+	/* Hold this lock for the rest of the isr, because the partition could
+	 * be released anytime.
+	 * e.g. the MSHV_RUN_VP thread could wake on another cpu; it could
+	 * release the partition unless we hold this!
+	 */
+	spin_lock_irqsave(&mshv.partitions.lock, flags);
+
+	for (i = 0; i < MSHV_MAX_PARTITIONS; i++) {
+		partition = mshv.partitions.array[i];
+		if (partition && partition->id == partition_id)
+			break;
+	}
+
+	if (unlikely(i == MSHV_MAX_PARTITIONS)) {
+		pr_err("%s: failed to find partition\n", __func__);
+		goto unlock_out;
+	}
+
+	/*
+	 * Since we directly index the vp, and it has to exist for us to be here
+	 * (because the vp is only deleted when the partition is), no additional
+	 * locking is needed here
+	 */
+	vp_index = ((struct hv_x64_intercept_message_header *)msg->u.payload)->vp_index;
+	vp = partition->vps.array[vp_index];
+	if (unlikely(!vp)) {
+		pr_err("%s: failed to find vp\n", __func__);
+		goto unlock_out;
+	}
+
+	memcpy(vp->run.intercept_message, msg, sizeof(struct hv_message));
+
+	if (unlikely(!vp->run.task)) {
+		pr_err("%s: vp run task not set\n", __func__);
+		goto unlock_out;
+	}
+
+	/* Save the task and reset it so we can wake without racing */
+	task = vp->run.task;
+	vp->run.task = NULL;
+
+	/*
+	 * up the semaphore before waking so that we don't race with
+	 * down_trylock
+	 */
+	up(&vp->run.sem);
+
+	/*
+	 * Finally, wake the process. If it wakes the vp and generates
+	 * another intercept then the message will be queued by the hypervisor
+	 */
+	wake_up_process(task);
+
+unlock_out:
+	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
+
+	/* Acknowledge message with hypervisor */
+	msg->header.message_type = HVMSG_NONE;
+	wrmsrl(HV_X64_MSR_EOM, 0);
+
+	add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0);
+}
+
+
+static long
+mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_message)
+{
+	long ret;
+	enum hv_message_type msg_type;
+	struct hv_register_assoc set_registers[2] = {
+		{ .name = HV_REGISTER_EXPLICIT_SUSPEND },
+		{ .name = HV_REGISTER_INTERCEPT_SUSPEND }
+	};
+	const enum hv_register_name get_register_names[2] = {
+		HV_REGISTER_EXPLICIT_SUSPEND,
+		HV_REGISTER_INTERCEPT_SUSPEND
+	};
+	union hv_register_value get_register_values[2];
+	/* Pointers to values for convenience */
+	union hv_explicit_suspend_register *set_explicit_suspend =
+				&set_registers[0].value.explicit_suspend;
+	union hv_intercept_suspend_register *set_intercept_suspend =
+				&set_registers[1].value.intercept_suspend;
+	union hv_explicit_suspend_register *get_explicit_suspend =
+				&get_register_values[0].explicit_suspend;
+	union hv_intercept_suspend_register *get_intercept_suspend =
+				&get_register_values[1].intercept_suspend;
+
+	/* Check that the VP is suspended */
+	ret = hv_call_get_vp_registers(
+			vp->index,
+			vp->partition->id,
+			2,
+			get_register_names,
+			get_register_values
+			);
+	if (ret)
+		return ret;
+
+	if (!get_explicit_suspend->suspended &&
+	    !get_intercept_suspend->suspended) {
+		pr_err("%s: vp not suspended!\n", __func__);
+		return -EBADFD;
+	}
+
+	/*
+	 * If intercept_suspend is set, we missed a message and need to
+	 * wait for mshv_isr to complete
+	 */
+	if (get_intercept_suspend->suspended) {
+		if (down_interruptible(&vp->run.sem))
+			return -EINTR;
+		if (copy_to_user(ret_message, vp->run.intercept_message,
+				 sizeof(struct hv_message)))
+			return -EFAULT;
+		return 0;
+	}
+
+	/*
+	 * At this point the semaphore ensures that mshv_isr is done,
+	 * and the mutex ensures that no other threads are touching this vp
+	 */
+	vp->run.task = current;
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	/* Now actually start the vp running */
+	set_explicit_suspend->suspended = 0;
+	set_intercept_suspend->suspended = 0;
+	ret = hv_call_set_vp_registers(
+			vp->index,
+			vp->partition->id,
+			2,
+			set_registers);
+	if (ret) {
+		pr_err("%s: failed to clear suspend bits\n", __func__);
+		set_current_state(TASK_RUNNING);
+		vp->run.task = NULL;
+		return ret;
+	}
+
+	schedule();
+
+	/* Explicitly suspend the vp to make sure it's stopped */
+	set_explicit_suspend->suspended = 1;
+	ret = hv_call_set_vp_registers(
+		vp->index,
+		vp->partition->id,
+		1,
+		&set_registers[0]);
+	if (ret) {
+		pr_err("%s: failed to set explicit suspend bit\n", __func__);
+		return -EBADFD;
+	}
+
+	/*
+	 * Check if woken up by a signal
+	 * Note that if the signal came after being woken by mshv_isr(),
+	 * we will still get the message correctly on re-entry
+	 */
+	if (signal_pending(current)) {
+		pr_debug("%s: woke up, received signal\n", __func__);
+		return -EINTR;
+	}
+
+	/*
+	 * No signal pending, so we were woken by hv_host_isr()
+	 * The isr can't be running now, and the intercept_suspend bit is set
+	 * We use it as a flag to tell if we missed a message due to a signal,
+	 * so we must clear it here and reset the semaphore
+	 */
+	set_intercept_suspend->suspended = 0;
+	ret = hv_call_set_vp_registers(
+		vp->index,
+		vp->partition->id,
+		1,
+		&set_registers[1]);
+	if (ret) {
+		pr_err("%s: failed to clear intercept suspend bit\n", __func__);
+		return -EBADFD;
+	}
+	if (down_trylock(&vp->run.sem)) {
+		pr_err("%s: semaphore in unexpected state\n", __func__);
+		return -EBADFD;
+	}
+
+	msg_type = vp->run.intercept_message->header.message_type;
+
+	if (msg_type == HVMSG_NONE) {
+		pr_err("%s: woke up, but no message\n", __func__);
+		return -ENOMSG;
+	}
+
+	if (copy_to_user(ret_message, vp->run.intercept_message,
+			 sizeof(struct hv_message)))
+		return -EFAULT;
+
+	return 0;
+}
+
+
+
 static long
 mshv_vp_ioctl_get_regs(struct mshv_vp *vp, void __user *user_args)
 {
@@ -600,6 +835,19 @@ mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user *user_args)
 	}
 
 	for (i = 0; i < args.count; i++) {
+
+		/*
+		 * Disallow setting suspend registers to ensure run vp state
+		 * is consistent
+		 */
+		if (names[i] == HV_REGISTER_EXPLICIT_SUSPEND ||
+		    names[i] == HV_REGISTER_INTERCEPT_SUSPEND) {
+			pr_err("%s: not allowed to set suspend registers\n",
+			       __func__);
+			ret = -EINVAL;
+			goto free_return;
+		}
+
 		memcpy(&registers[i].name, &names[i],
 		       sizeof(enum hv_register_name));
 		memcpy(&registers[i].value, &values[i],
@@ -627,6 +875,9 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 		return -EINTR;
 
 	switch (ioctl) {
+	case MSHV_RUN_VP:
+		r = mshv_vp_ioctl_run_vp(vp, (void __user *)arg);
+		break;
 	case MSHV_GET_VP_REGISTERS:
 		r = mshv_vp_ioctl_get_regs(vp, (void __user *)arg);
 		break;
@@ -677,12 +928,20 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 		return -ENOMEM;
 
 	mutex_init(&vp->mutex);
+	sema_init(&vp->run.sem, 0);
+
+	vp->run.intercept_message =
+		(struct hv_message *)get_zeroed_page(GFP_KERNEL);
+	if (!vp->run.intercept_message) {
+		ret = -ENOMEM;
+		goto free_vp;
+	}
 
 	vp->index = args.vp_index;
 	vp->partition = mshv_partition_get(partition);
 	if (!vp->partition) {
 		ret = -EBADF;
-		goto free_vp;
+		goto free_message;
 	}
 
 	fd = get_unused_fd_flags(O_CLOEXEC);
@@ -720,6 +979,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	put_unused_fd(fd);
 put_partition:
 	mshv_partition_put(partition);
+free_message:
+	free_page((unsigned long)vp->run.intercept_message);
 free_vp:
 	kfree(vp);
 
@@ -939,6 +1200,9 @@ destroy_partition(struct mshv_partition *partition)
 		mshv.partitions.array[i] = NULL;
 	}
 
+	if (!mshv.partitions.count)
+		hv_remove_mshv_irq();
+
 	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
 
 	/*
@@ -958,6 +1222,7 @@ destroy_partition(struct mshv_partition *partition)
 		vp = partition->vps.array[i];
 		if (!vp)
 			continue;
+		free_page((unsigned long)vp->run.intercept_message);
 		kfree(vp);
 	}
 
@@ -1021,6 +1286,9 @@ add_partition(struct mshv_partition *partition)
 	mshv.partitions.count++;
 	mshv.partitions.array[i] = partition;
 
+	if (mshv.partitions.count == 1)
+		hv_setup_mshv_irq(mshv_isr);
+
 out_unlock:
 	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 13/18] virt/mshv: install intercept ioctl
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (11 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 12/18] virt/mshv: run vp ioctl and isr Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 14/18] virt/mshv: assert interrupt ioctl Nuno Das Neves
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce ioctl for configuring intercept messages from a guest partition.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst         |  9 +++++
 arch/x86/include/uapi/asm/hyperv-tlfs.h | 43 ++++++++++++++++++++++
 include/asm-generic/hyperv-tlfs.h       |  8 ++++
 include/uapi/linux/mshv.h               |  7 ++++
 virt/mshv/mshv_main.c                   | 49 ++++++++++++++++++++++++-
 5 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index f525c81f2bdd..95ec77dc73f0 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -120,4 +120,13 @@ This ioctl will fail on any vp that's already running (not suspended).
 
 Information about the intercept is returned in the hv_message struct.
 
+3.7 MSHV_INSTALL_INTERCEPT
+--------------------------
+:Type: partition ioctl
+:Parameters: struct mshv_install_intercept
+:Returns: 0 on success
+
+Enable and configure different types of intercepts. Intercepts are events in a
+guest partition that will suspend the guest vp and send a message to the root
+partition (returned from MSHV_RUN_VP).
 
diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
index c6a27053f791..28917301b6df 100644
--- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
+++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
@@ -986,4 +986,47 @@ struct hv_x64_apic_eoi_message {
 	__u32 interrupt_vector;
 };
 
+enum hv_intercept_type {
+	HV_INTERCEPT_TYPE_X64_IO_PORT			= 0X00000000,
+	HV_INTERCEPT_TYPE_X64_MSR			= 0X00000001,
+	HV_INTERCEPT_TYPE_X64_CPUID			= 0X00000002,
+	HV_INTERCEPT_TYPE_EXCEPTION			= 0X00000003,
+	HV_INTERCEPT_TYPE_REGISTER			= 0X00000004,
+	HV_INTERCEPT_TYPE_MMIO				= 0X00000005,
+	HV_INTERCEPT_TYPE_X64_GLOBAL_CPUID		= 0X00000006,
+	HV_INTERCEPT_TYPE_X64_APIC_SMI			= 0X00000007,
+	HV_INTERCEPT_TYPE_HYPERCALL			= 0X00000008,
+	HV_INTERCEPT_TYPE_X64_APIC_INIT_SIPI		= 0X00000009,
+	HV_INTERCEPT_MC_UPDATE_PATCH_LEVEL_MSR_READ	= 0X0000000A,
+	HV_INTERCEPT_TYPE_X64_APIC_WRITE		= 0X0000000B,
+	HV_INTERCEPT_TYPE_MAX,
+	HV_INTERCEPT_TYPE_INVALID			= 0XFFFFFFFF,
+};
+
+union hv_intercept_parameters {
+	__u64 as_uint64;
+
+	/* hv_intercept_type_x64_io_port */
+	__u16 io_port;
+
+	/* hv_intercept_type_x64_cpuid */
+	__u32 cpuid_index;
+
+	/* hv_intercept_type_x64_apic_write */
+	__u32 apic_write_mask;
+
+	/* hv_intercept_type_exception */
+	__u16 exception_vector;
+
+	/* N.B. Other intercept types do not have any parameters. */
+};
+
+/* Access types for the install intercept hypercall parameter */
+#define HV_INTERCEPT_ACCESS_MASK_NONE		0x00
+#define HV_INTERCEPT_ACCESS_MASK_READ		0X01
+#define HV_INTERCEPT_ACCESS_MASK_WRITE		0x02
+#define HV_INTERCEPT_ACCESS_MASK_EXECUTE	0x04
+
+
+
 #endif
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index e0185c3872a9..93571bbab3a6 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -151,6 +151,7 @@ struct ms_hyperv_tsc_page {
 #define HVCALL_WITHDRAW_MEMORY			0x0049
 #define HVCALL_MAP_GPA_PAGES			0x004b
 #define HVCALL_UNMAP_GPA_PAGES			0x004c
+#define HVCALL_INSTALL_INTERCEPT		0x004d
 #define HVCALL_CREATE_VP			0x004e
 #define HVCALL_GET_VP_REGISTERS			0x0050
 #define HVCALL_SET_VP_REGISTERS			0x0051
@@ -777,4 +778,11 @@ struct hv_unmap_gpa_pages {
 	u32 unmap_flags;
 };
 
+struct hv_install_intercept {
+	u64 partition_id;
+	u32 access_type; /* mask */
+	enum hv_intercept_type intercept_type;
+	union hv_intercept_parameters intercept_parameter;
+};
+
 #endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 5be9e2d23893..e784b2d1a3fd 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -41,6 +41,12 @@ struct mshv_vp_registers {
 	union hv_register_value *values;
 };
 
+struct mshv_install_intercept {
+	__u32 access_type_mask;
+	enum hv_intercept_type intercept_type;
+	union hv_intercept_parameters intercept_parameter;
+};
+
 #define MSHV_IOCTL 0xB8
 
 /* mshv device */
@@ -51,6 +57,7 @@ struct mshv_vp_registers {
 #define MSHV_MAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
 #define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct mshv_user_mem_region)
 #define MSHV_CREATE_VP		_IOW(MSHV_IOCTL, 0x04, struct mshv_create_vp)
+#define MSHV_INSTALL_INTERCEPT	_IOW(MSHV_IOCTL, 0x08, struct mshv_install_intercept)
 
 /* vp device */
 #define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct mshv_vp_registers)
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 7ddb66d260ce..8392d5a45e04 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -1148,7 +1148,50 @@ mshv_partition_ioctl_unmap_memory(struct mshv_partition *partition,
 }
 
 static long
-mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
+mshv_partition_ioctl_install_intercept(struct mshv_partition *partition,
+				       void __user *user_args)
+{
+	struct mshv_install_intercept args;
+	struct hv_install_intercept *input;
+	unsigned long flags;
+	int status;
+	int ret;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	do {
+		local_irq_save(flags);
+		input = (struct hv_install_intercept *)(*this_cpu_ptr(
+					hyperv_pcpu_input_arg));
+		input->partition_id = partition->id;
+		input->access_type = args.access_type_mask;
+		input->intercept_type = args.intercept_type;
+		input->intercept_parameter = args.intercept_parameter;
+		status = hv_do_hypercall(
+				HVCALL_INSTALL_INTERCEPT, input, NULL) &
+					HV_HYPERCALL_RESULT_MASK;
+
+		local_irq_restore(flags);
+		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (status != HV_STATUS_SUCCESS) {
+				pr_err("%s: %s\n", __func__,
+				       hv_status_to_string(status));
+			}
+			ret = -hv_status_to_errno(status);
+			break;
+		}
+
+		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition->id, 1);
+
+	} while (!ret);
+
+	return ret;
+}
+
+static long
+mshv_partition_ioctl(struct file *filp, unsigned int ioctl,
+		     unsigned long arg)
 {
 	struct mshv_partition *partition = filp->private_data;
 	long ret;
@@ -1169,6 +1212,10 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 		ret = mshv_partition_ioctl_create_vp(partition,
 							(void __user *)arg);
 		break;
+	case MSHV_INSTALL_INTERCEPT:
+		ret = mshv_partition_ioctl_install_intercept(partition,
+							(void __user *)arg);
+		break;
 	default:
 		ret = -ENOTTY;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 14/18] virt/mshv: assert interrupt ioctl
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (12 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 13/18] virt/mshv: install intercept ioctl Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls Nuno Das Neves
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce ioctl for asserting an interrupt on a given APIC within a
guest partition.

Co-developed-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst         | 11 ++++++++
 arch/x86/include/asm/hyperv-tlfs.h      | 14 ----------
 arch/x86/include/uapi/asm/hyperv-tlfs.h | 22 +++++++++++++++
 include/asm-generic/hyperv-tlfs.h       | 11 ++++++++
 include/uapi/linux/mshv.h               |  7 +++++
 virt/mshv/mshv_main.c                   | 36 +++++++++++++++++++++++++
 6 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 95ec77dc73f0..694f978131f9 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -130,3 +130,14 @@ Enable and configure different types of intercepts. Intercepts are events in a
 guest partition that will suspend the guest vp and send a message to the root
 partition (returned from MSHV_RUN_VP).
 
+3.8 MSHV_ASSERT_INTERRUPT
+--------------------------
+:Type: partition ioctl
+:Parameters: struct mshv_assert_interrupt
+:Returns: 0 on success
+
+Assert interrupts in partitions that use Microsoft Hypervisor's internal
+emulated LAPIC. This must be enabled on partition creation with the flag:
+HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
+
+
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index c34a6bb4f457..0de3c2e30a21 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -498,20 +498,6 @@ struct hv_partition_assist_pg {
 	u32 tlb_lock_count;
 };
 
-enum hv_interrupt_type {
-	HV_X64_INTERRUPT_TYPE_FIXED             = 0x0000,
-	HV_X64_INTERRUPT_TYPE_LOWESTPRIORITY    = 0x0001,
-	HV_X64_INTERRUPT_TYPE_SMI               = 0x0002,
-	HV_X64_INTERRUPT_TYPE_REMOTEREAD        = 0x0003,
-	HV_X64_INTERRUPT_TYPE_NMI               = 0x0004,
-	HV_X64_INTERRUPT_TYPE_INIT              = 0x0005,
-	HV_X64_INTERRUPT_TYPE_SIPI              = 0x0006,
-	HV_X64_INTERRUPT_TYPE_EXTINT            = 0x0007,
-	HV_X64_INTERRUPT_TYPE_LOCALINT0         = 0x0008,
-	HV_X64_INTERRUPT_TYPE_LOCALINT1         = 0x0009,
-	HV_X64_INTERRUPT_TYPE_MAXIMUM           = 0x000A,
-};
-
 #include <asm-generic/hyperv-tlfs.h>
 
 #endif
diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
index 28917301b6df..5478d4943bfc 100644
--- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
+++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
@@ -1027,6 +1027,28 @@ union hv_intercept_parameters {
 #define HV_INTERCEPT_ACCESS_MASK_WRITE		0x02
 #define HV_INTERCEPT_ACCESS_MASK_EXECUTE	0x04
 
+enum hv_interrupt_type {
+	HV_X64_INTERRUPT_TYPE_FIXED             = 0x0000,
+	HV_X64_INTERRUPT_TYPE_LOWESTPRIORITY    = 0x0001,
+	HV_X64_INTERRUPT_TYPE_SMI               = 0x0002,
+	HV_X64_INTERRUPT_TYPE_REMOTEREAD        = 0x0003,
+	HV_X64_INTERRUPT_TYPE_NMI               = 0x0004,
+	HV_X64_INTERRUPT_TYPE_INIT              = 0x0005,
+	HV_X64_INTERRUPT_TYPE_SIPI              = 0x0006,
+	HV_X64_INTERRUPT_TYPE_EXTINT            = 0x0007,
+	HV_X64_INTERRUPT_TYPE_LOCALINT0         = 0x0008,
+	HV_X64_INTERRUPT_TYPE_LOCALINT1         = 0x0009,
+	HV_X64_INTERRUPT_TYPE_MAXIMUM           = 0x000A
+};
 
+union hv_interrupt_control {
+	struct {
+		enum hv_interrupt_type interrupt_type;
+		__u32 level_triggered : 1;
+		__u32 logical_dest_mode : 1;
+		__u32 rsvd : 30;
+	};
+	__u64 as_uint64;
+};
 
 #endif
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 93571bbab3a6..2cd46241c545 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -164,6 +164,7 @@ struct ms_hyperv_tsc_page {
 #define HVCALL_MAP_DEVICE_INTERRUPT		0x007c
 #define HVCALL_UNMAP_DEVICE_INTERRUPT		0x007d
 #define HVCALL_RETARGET_INTERRUPT		0x007e
+#define HVCALL_ASSERT_VIRTUAL_INTERRUPT		0x0094
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
 
@@ -785,4 +786,14 @@ struct hv_install_intercept {
 	union hv_intercept_parameters intercept_parameter;
 };
 
+struct hv_assert_virtual_interrupt {
+	u64 partition_id;
+	union hv_interrupt_control control;
+	u64 dest_addr; /* cpu's apic id */
+	u32 vector;
+	u8 target_vtl;
+	u8 rsvd_z0;
+	u16 rsvd_z1;
+};
+
 #endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index e784b2d1a3fd..faed9d065bb7 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -47,6 +47,12 @@ struct mshv_install_intercept {
 	union hv_intercept_parameters intercept_parameter;
 };
 
+struct mshv_assert_interrupt {
+	union hv_interrupt_control control;
+	__u64 dest_addr;
+	__u32 vector;
+};
+
 #define MSHV_IOCTL 0xB8
 
 /* mshv device */
@@ -58,6 +64,7 @@ struct mshv_install_intercept {
 #define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct mshv_user_mem_region)
 #define MSHV_CREATE_VP		_IOW(MSHV_IOCTL, 0x04, struct mshv_create_vp)
 #define MSHV_INSTALL_INTERCEPT	_IOW(MSHV_IOCTL, 0x08, struct mshv_install_intercept)
+#define MSHV_ASSERT_INTERRUPT	_IOW(MSHV_IOCTL, 0x09, struct mshv_assert_interrupt)
 
 /* vp device */
 #define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct mshv_vp_registers)
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 8392d5a45e04..9cf236ade50a 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -1189,6 +1189,38 @@ mshv_partition_ioctl_install_intercept(struct mshv_partition *partition,
 	return ret;
 }
 
+static long
+mshv_partition_ioctl_assert_interrupt(struct mshv_partition *partition,
+				      void __user *user_args)
+{
+	struct mshv_assert_interrupt args;
+	int status;
+	unsigned long flags;
+	struct hv_assert_virtual_interrupt *input;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	local_irq_save(flags);
+	input = (struct hv_assert_virtual_interrupt *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+	memset(input, 0, sizeof(*input));
+	input->partition_id = partition->id;
+	input->control = args.control;
+	input->dest_addr = args.dest_addr;
+	input->vector = args.vector;
+	status = hv_do_hypercall(HVCALL_ASSERT_VIRTUAL_INTERRUPT, input,
+			NULL) & HV_HYPERCALL_RESULT_MASK;
+	local_irq_restore(flags);
+
+	if (status != HV_STATUS_SUCCESS) {
+		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
+		return -hv_status_to_errno(status);
+	}
+
+	return 0;
+}
+
 static long
 mshv_partition_ioctl(struct file *filp, unsigned int ioctl,
 		     unsigned long arg)
@@ -1216,6 +1248,10 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl,
 		ret = mshv_partition_ioctl_install_intercept(partition,
 							(void __user *)arg);
 		break;
+	case MSHV_ASSERT_INTERRUPT:
+		ret = mshv_partition_ioctl_assert_interrupt(partition,
+							(void __user *)arg);
+		break;
 	default:
 		ret = -ENOTTY;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (13 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 14/18] virt/mshv: assert interrupt ioctl Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-08 19:48   ` Michael Kelley
  2020-11-21  0:30 ` [RFC PATCH 16/18] virt/mshv: mmap vp register page Nuno Das Neves
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce ioctls for getting and setting guest vcpu emulated LAPIC
state, and xsave data.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst         |   8 +
 arch/x86/include/uapi/asm/hyperv-tlfs.h |  59 ++++++
 include/asm-generic/hyperv-tlfs.h       |  41 ++++
 include/uapi/asm-generic/hyperv-tlfs.h  |  28 +++
 include/uapi/linux/mshv.h               |  13 ++
 virt/mshv/mshv_main.c                   | 262 ++++++++++++++++++++++++
 6 files changed, 411 insertions(+)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 694f978131f9..7fd75f248eff 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -140,4 +140,12 @@ Assert interrupts in partitions that use Microsoft Hypervisor's internal
 emulated LAPIC. This must be enabled on partition creation with the flag:
 HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
 
+3.9 MSHV_GET_VP_STATE and MSHV_SET_VP_STATE
+--------------------------
+:Type: vp ioctl
+:Parameters: struct mshv_vp_state
+:Returns: 0 on success
+
+Get/set various vp state. Currently these can be used to get and set
+emulated LAPIC state, and xsave data.
 
diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
index 5478d4943bfc..78758aedf23e 100644
--- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
+++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
@@ -1051,4 +1051,63 @@ union hv_interrupt_control {
 	__u64 as_uint64;
 };
 
+struct hv_local_interrupt_controller_state {
+	__u32 apic_id;
+	__u32 apic_version;
+	__u32 apic_ldr;
+	__u32 apic_dfr;
+	__u32 apic_spurious;
+	__u32 apic_isr[8];
+	__u32 apic_tmr[8];
+	__u32 apic_irr[8];
+	__u32 apic_esr;
+	__u32 apic_icr_high;
+	__u32 apic_icr_low;
+	__u32 apic_lvt_timer;
+	__u32 apic_lvt_thermal;
+	__u32 apic_lvt_perfmon;
+	__u32 apic_lvt_lint0;
+	__u32 apic_lvt_lint1;
+	__u32 apic_lvt_error;
+	__u32 apic_lvt_cmci;
+	__u32 apic_error_status;
+	__u32 apic_initial_count;
+	__u32 apic_counter_value;
+	__u32 apic_divide_configuration;
+	__u32 apic_remote_read;
+};
+
+#define HV_XSAVE_DATA_NO_XMM_REGISTERS 1
+
+union hv_x64_xsave_xfem_register {
+	__u64 as_uint64;
+	struct {
+		__u32 low_uint32;
+		__u32 high_uint32;
+	};
+	struct {
+		__u64 legacy_x87: 1;
+		__u64 legacy_sse: 1;
+		__u64 avx: 1;
+		__u64 mpx_bndreg: 1;
+		__u64 mpx_bndcsr: 1;
+		__u64 avx_512_op_mask: 1;
+		__u64 avx_512_zmmhi: 1;
+		__u64 avx_512_zmm16_31: 1;
+		__u64 rsvd8_9: 2;
+		__u64 pasid: 1;
+		__u64 cet_u: 1;
+		__u64 cet_s: 1;
+		__u64 rsvd13_16: 4;
+		__u64 xtile_cfg: 1;
+		__u64 xtile_data: 1;
+		__u64 rsvd19_63: 45;
+	};
+};
+
+struct hv_vp_state_data_xsave {
+	__u64 flags;
+	union hv_x64_xsave_xfem_register states;
+};
+
 #endif
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 2cd46241c545..4bc59a0344ce 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -167,6 +167,9 @@ struct ms_hyperv_tsc_page {
 #define HVCALL_ASSERT_VIRTUAL_INTERRUPT		0x0094
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
+#define HVCALL_MAP_VP_STATE_PAGE			0x00e1
+#define HVCALL_GET_VP_STATE				0x00e3
+#define HVCALL_SET_VP_STATE				0x00e4
 
 #define HV_FLUSH_ALL_PROCESSORS			BIT(0)
 #define HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES	BIT(1)
@@ -796,4 +799,42 @@ struct hv_assert_virtual_interrupt {
 	u16 rsvd_z1;
 };
 
+struct hv_vp_state_data {
+	enum hv_get_set_vp_state_type type;
+	u32 rsvd;
+	struct hv_vp_state_data_xsave xsave;
+
+};
+
+struct hv_get_vp_state_in {
+	u64 partition_id;
+	u32 vp_index;
+	u8 input_vtl;
+	u8 rsvd0;
+	u16 rsvd1;
+	struct hv_vp_state_data state_data;
+	u64 output_data_pfns[];
+};
+
+union hv_get_vp_state_out {
+	struct hv_local_interrupt_controller_state interrupt_controller_state;
+	/* Not supported yet */
+	/* struct hv_synthetic_timers_state synthetic_timers_state; */
+};
+
+union hv_input_set_vp_state_data {
+	u64 pfns;
+	u8 bytes;
+};
+
+struct hv_set_vp_state_in {
+	u64 partition_id;
+	u32 vp_index;
+	u8 input_vtl;
+	u8 rsvd0;
+	u16 rsvd1;
+	struct hv_vp_state_data state_data;
+	union hv_input_set_vp_state_data data[];
+};
+
 #endif
diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
index e87389054b68..b3c84c69b73f 100644
--- a/include/uapi/asm-generic/hyperv-tlfs.h
+++ b/include/uapi/asm-generic/hyperv-tlfs.h
@@ -64,4 +64,32 @@ struct hv_message {
 #define HV_MAP_GPA_EXECUTABLE           0xC
 #define HV_MAP_GPA_PERMISSIONS_MASK     0xF
 
+/*
+ * For getting and setting VP state, there are two options based on the state type:
+ *
+ *     1.) Data that is accessed by PFNs in the input hypercall page. This is used
+ *         for state which may not fit into the hypercall pages.
+ *     2.) Data that is accessed directly in the input\output hypercall pages.
+ *         This is used for state that will always fit into the hypercall pages.
+ *
+ * In the future this could be dynamic based on the size if needed.
+ *
+ * Note these hypercalls have an 8-byte aligned variable header size as per the tlfs
+ */
+
+#define HV_GET_SET_VP_STATE_TYPE_PFN	BIT(31)
+
+enum hv_get_set_vp_state_type {
+	HV_GET_SET_VP_STATE_LOCAL_INTERRUPT_CONTROLLER_STATE = 0,
+
+	HV_GET_SET_VP_STATE_XSAVE		= 1 | HV_GET_SET_VP_STATE_TYPE_PFN,
+	/* Synthetic message page */
+	HV_GET_SET_VP_STATE_SIM_PAGE		= 2 | HV_GET_SET_VP_STATE_TYPE_PFN,
+	/* Synthetic interrupt event flags page. */
+	HV_GET_SET_VP_STATE_SIEF_PAGE		= 3 | HV_GET_SET_VP_STATE_TYPE_PFN,
+
+	/* Synthetic timers. */
+	HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS	= 4,
+};
+
 #endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index faed9d065bb7..ae0bb64bbec3 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -53,6 +53,17 @@ struct mshv_assert_interrupt {
 	__u32 vector;
 };
 
+struct mshv_vp_state {
+	enum hv_get_set_vp_state_type type;
+	struct hv_vp_state_data_xsave xsave; /* only for xsave request */
+
+	__u64 buf_size; /* If xsave, must be page-aligned */
+	union {
+		struct hv_local_interrupt_controller_state *lapic;
+		__u8 *bytes; /* Xsave data. must be page-aligned */
+	} buf;
+};
+
 #define MSHV_IOCTL 0xB8
 
 /* mshv device */
@@ -70,5 +81,7 @@ struct mshv_assert_interrupt {
 #define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct mshv_vp_registers)
 #define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
 #define MSHV_RUN_VP		_IOR(MSHV_IOCTL, 0x07, struct hv_message)
+#define MSHV_GET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0A, struct mshv_vp_state)
+#define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
 
 #endif
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 9cf236ade50a..70172d9488de 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -864,6 +864,262 @@ mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user *user_args)
 	return ret;
 }
 
+static int
+hv_call_get_vp_state(u32 vp_index,
+		     u64 partition_id,
+		     enum hv_get_set_vp_state_type type,
+		     struct hv_vp_state_data_xsave xsave,
+		    /* Choose between pages and ret_output */
+		     u64 page_count,
+		     struct page **pages,
+		     union hv_get_vp_state_out *ret_output)
+{
+	struct hv_get_vp_state_in *input;
+	union hv_get_vp_state_out *output;
+	int status;
+	int i;
+	u64 control;
+	unsigned long flags;
+	int ret = 0;
+
+	if (sizeof(*input) + (page_count * sizeof(u64)) > PAGE_SIZE)
+		return -EINVAL;
+
+	if (!page_count && !ret_output)
+		return -EINVAL;
+
+	do {
+		local_irq_save(flags);
+		input = (struct hv_get_vp_state_in *)
+				(*this_cpu_ptr(hyperv_pcpu_input_arg));
+		output = (union hv_get_vp_state_out *)
+				(*this_cpu_ptr(hyperv_pcpu_output_arg));
+		memset(input, 0, sizeof(*input));
+		memset(output, 0, sizeof(*output));
+
+		input->partition_id = partition_id;
+		input->vp_index = vp_index;
+		input->state_data.type = type;
+		memcpy(&input->state_data.xsave, &xsave, sizeof(xsave));
+		for (i = 0; i < page_count; i++)
+			input->output_data_pfns[i] =
+				page_to_pfn(pages[i]) & HV_MAP_GPA_MASK;
+
+		control = (HVCALL_GET_VP_STATE) |
+			  (page_count << HV_HYPERCALL_VARHEAD_OFFSET);
+
+		status = hv_do_hypercall(control, input, output) &
+			 HV_HYPERCALL_RESULT_MASK;
+
+		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (status != HV_STATUS_SUCCESS)
+				pr_err("%s: %s\n", __func__,
+				       hv_status_to_string(status));
+			else if (ret_output)
+				memcpy(ret_output, output, sizeof(*output));
+
+			local_irq_restore(flags);
+			ret = -hv_status_to_errno(status);
+			break;
+		}
+		local_irq_restore(flags);
+
+		ret = hv_call_deposit_pages(NUMA_NO_NODE,
+					    partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+static int
+hv_call_set_vp_state(u32 vp_index,
+		     u64 partition_id,
+		     enum hv_get_set_vp_state_type type,
+		     struct hv_vp_state_data_xsave xsave,
+		    /* Choose between pages and bytes */
+		     u64 page_count,
+		     struct page **pages,
+		     u32 num_bytes,
+		     u8 *bytes)
+{
+	struct hv_set_vp_state_in *input;
+	int status;
+	int i;
+	u64 control;
+	unsigned long flags;
+	int ret = 0;
+	u16 varhead_sz;
+
+	if (sizeof(*input) + (page_count * sizeof(u64)) > PAGE_SIZE)
+		return -EINVAL;
+	if (sizeof(*input) + num_bytes > PAGE_SIZE)
+		return -EINVAL;
+
+	if (num_bytes)
+		/* round up to 8 and divide by 8 */
+		varhead_sz = (num_bytes + 7) >> 3;
+	else if (page_count)
+		varhead_sz =  page_count;
+	else
+		return -EINVAL;
+
+	do {
+		local_irq_save(flags);
+		input = (struct hv_set_vp_state_in *)
+				(*this_cpu_ptr(hyperv_pcpu_input_arg));
+		memset(input, 0, sizeof(*input));
+
+		input->partition_id = partition_id;
+		input->vp_index = vp_index;
+		input->state_data.type = type;
+		memcpy(&input->state_data.xsave, &xsave, sizeof(xsave));
+		if (num_bytes) {
+			memcpy((u8 *)input->data, bytes, num_bytes);
+		} else {
+			for (i = 0; i < page_count; i++)
+				input->data[i].pfns =
+					page_to_pfn(pages[i]) & HV_MAP_GPA_MASK;
+		}
+
+		control = (HVCALL_SET_VP_STATE) |
+			  (varhead_sz << HV_HYPERCALL_VARHEAD_OFFSET);
+
+		status = hv_do_hypercall(control, input, NULL) &
+			 HV_HYPERCALL_RESULT_MASK;
+
+		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (status != HV_STATUS_SUCCESS)
+				pr_err("%s: %s\n", __func__,
+				       hv_status_to_string(status));
+
+			local_irq_restore(flags);
+			ret = -hv_status_to_errno(status);
+			break;
+		}
+		local_irq_restore(flags);
+
+		ret = hv_call_deposit_pages(NUMA_NO_NODE,
+					    partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
+static long
+mshv_vp_ioctl_get_set_state_pfn(struct mshv_vp *vp,
+				struct mshv_vp_state *args,
+				bool is_set)
+{
+	u64 page_count, remaining;
+	int completed;
+	struct page **pages;
+	long ret;
+	unsigned long u_buf;
+
+	/* Buffer must be page aligned */
+	if (args->buf_size & (PAGE_SIZE - 1) ||
+	    (u64)args->buf.bytes & (PAGE_SIZE - 1))
+		return -EINVAL;
+
+	if (!access_ok(args->buf.bytes, args->buf_size))
+		return -EFAULT;
+
+	/* Pin user pages so hypervisor can copy directly to them */
+	page_count = args->buf_size >> PAGE_SHIFT;
+	pages = kcalloc(page_count, sizeof(struct page *), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	remaining = page_count;
+	u_buf = (unsigned long)args->buf.bytes;
+	while (remaining) {
+		completed = pin_user_pages_fast(
+				u_buf,
+				remaining,
+				FOLL_WRITE,
+				&pages[page_count - remaining]);
+		if (completed < 0) {
+			pr_err("%s: failed to pin user pages error %i\n",
+			       __func__, completed);
+			ret = completed;
+			goto unpin_pages;
+		}
+		remaining -= completed;
+		u_buf += completed * PAGE_SIZE;
+	}
+
+	if (is_set)
+		ret = hv_call_set_vp_state(vp->index,
+					   vp->partition->id,
+					   args->type, args->xsave,
+					   page_count, pages,
+					   0, NULL);
+	else
+		ret = hv_call_get_vp_state(vp->index,
+					   vp->partition->id,
+					   args->type, args->xsave,
+					   page_count, pages,
+					   NULL);
+
+unpin_pages:
+	unpin_user_pages(pages, page_count - remaining);
+	kfree(pages);
+	return ret;
+}
+
+static long
+mshv_vp_ioctl_get_set_state(struct mshv_vp *vp, void __user *user_args, bool is_set)
+{
+	struct mshv_vp_state args;
+	long ret = 0;
+	union hv_get_vp_state_out vp_state;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	/* For now just support these */
+	if (args.type != HV_GET_SET_VP_STATE_LOCAL_INTERRUPT_CONTROLLER_STATE &&
+	    args.type != HV_GET_SET_VP_STATE_XSAVE)
+		return -EINVAL;
+
+	/* If we need to pin pfns, delegate to helper */
+	if (args.type & HV_GET_SET_VP_STATE_TYPE_PFN)
+		return mshv_vp_ioctl_get_set_state_pfn(vp, &args, is_set);
+
+	if (args.buf_size < sizeof(vp_state))
+		return -EINVAL;
+
+	if (is_set) {
+		if (copy_from_user(
+				&vp_state,
+				args.buf.lapic,
+				sizeof(vp_state)))
+			return -EFAULT;
+
+		return hv_call_set_vp_state(vp->index,
+					    vp->partition->id,
+					    args.type, args.xsave,
+					    0, NULL,
+					    sizeof(vp_state),
+					    (u8 *)&vp_state);
+	}
+
+	ret = hv_call_get_vp_state(vp->index,
+				   vp->partition->id,
+				   args.type, args.xsave,
+				   0, NULL,
+				   &vp_state);
+
+	if (ret)
+		return ret;
+
+	if (copy_to_user(args.buf.lapic,
+			 &vp_state.interrupt_controller_state,
+			 sizeof(vp_state.interrupt_controller_state)))
+		return -EFAULT;
+
+	return 0;
+}
 
 static long
 mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
@@ -884,6 +1140,12 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 	case MSHV_SET_VP_REGISTERS:
 		r = mshv_vp_ioctl_set_regs(vp, (void __user *)arg);
 		break;
+	case MSHV_GET_VP_STATE:
+		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, false);
+		break;
+	case MSHV_SET_VP_STATE:
+		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, true);
+		break;
 	default:
 		r = -ENOTTY;
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 16/18] virt/mshv: mmap vp register page
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (14 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2021-02-08 19:49   ` Michael Kelley
  2020-11-21  0:30 ` [RFC PATCH 17/18] virt/mshv: get and set partition property ioctls Nuno Das Neves
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce mmap interface for a virtual processor, exposing a page for
setting and getting common registers while the VP is suspended.

This provides a more performant and convenient way to get and set these
registers in the context of a vmm's run-loop.

Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst         | 11 ++++
 arch/x86/include/uapi/asm/hyperv-tlfs.h | 74 ++++++++++++++++++++++
 include/asm-generic/hyperv-tlfs.h       | 10 +++
 include/linux/mshv.h                    |  1 +
 include/uapi/asm-generic/hyperv-tlfs.h  |  5 ++
 include/uapi/linux/mshv.h               | 12 ++++
 virt/mshv/mshv_main.c                   | 82 +++++++++++++++++++++++++
 7 files changed, 195 insertions(+)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 7fd75f248eff..89c276a8778f 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -149,3 +149,14 @@ HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
 Get/set various vp state. Currently these can be used to get and set
 emulated LAPIC state, and xsave data.
 
+3.10 mmap(vp)
+-------------
+:Type: vp mmap
+:Parameters: offset should be HV_VP_MMAP_REGISTERS_OFFSET
+:Returns: 0 on success
+
+Maps a page into userspace that can be used to get and set common registers
+while the vp is suspended.
+The page is laid out in struct hv_vp_register_page in asm/hyperv-tlfs.h.
+
+
diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
index 78758aedf23e..a241178567ff 100644
--- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
+++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
@@ -1110,4 +1110,78 @@ struct hv_vp_state_data_xsave {
 	union hv_x64_xsave_xfem_register states;
 };
 
+/* Bits for dirty mask of hv_vp_register_page */
+#define HV_X64_REGISTER_CLASS_GENERAL	0
+#define HV_X64_REGISTER_CLASS_IP	1
+#define HV_X64_REGISTER_CLASS_XMM	2
+#define HV_X64_REGISTER_CLASS_SEGMENT	3
+#define HV_X64_REGISTER_CLASS_FLAGS	4
+
+#define HV_VP_REGISTER_PAGE_VERSION_1	1u
+
+struct hv_vp_register_page {
+	__u16 version;
+	bool isvalid;
+	__u8 rsvdz;
+	__u32 dirty;
+	union {
+		struct {
+			__u64 rax;
+			__u64 rcx;
+			__u64 rdx;
+			__u64 rbx;
+			__u64 rsp;
+			__u64 rbp;
+			__u64 rsi;
+			__u64 rdi;
+			__u64 r8;
+			__u64 r9;
+			__u64 r10;
+			__u64 r11;
+			__u64 r12;
+			__u64 r13;
+			__u64 r14;
+			__u64 r15;
+		};
+
+		__u64 gp_registers[16];
+	};
+	__u64 rip;
+	__u64 rflags;
+	union {
+		struct {
+			struct hv_u128 xmm0;
+			struct hv_u128 xmm1;
+			struct hv_u128 xmm2;
+			struct hv_u128 xmm3;
+			struct hv_u128 xmm4;
+			struct hv_u128 xmm5;
+		};
+
+		struct hv_u128 xmm_registers[6];
+	};
+	union {
+		struct {
+			struct hv_x64_segment_register es;
+			struct hv_x64_segment_register cs;
+			struct hv_x64_segment_register ss;
+			struct hv_x64_segment_register ds;
+			struct hv_x64_segment_register fs;
+			struct hv_x64_segment_register gs;
+		};
+
+		struct hv_x64_segment_register segment_registers[6];
+	};
+	/* read only */
+	__u64 cr0;
+	__u64 cr3;
+	__u64 cr4;
+	__u64 cr8;
+	__u64 efer;
+	__u64 dr7;
+	union hv_x64_pending_interruption_register pending_interruption;
+	union hv_x64_interrupt_state_register interrupt_state;
+	__u64 instruction_emulation_hints;
+};
+
 #endif
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 4bc59a0344ce..9eed4b869110 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -837,4 +837,14 @@ struct hv_set_vp_state_in {
 	union hv_input_set_vp_state_data data[];
 };
 
+struct hv_map_vp_state_page_in {
+	u64 partition_id;
+	u32 vp_index;
+	enum hv_vp_state_page_type type;
+};
+
+struct hv_map_vp_state_page_out {
+	u64 map_location; /* page number */
+};
+
 #endif
diff --git a/include/linux/mshv.h b/include/linux/mshv.h
index 3933d80294f1..33f4d0cfee11 100644
--- a/include/linux/mshv.h
+++ b/include/linux/mshv.h
@@ -20,6 +20,7 @@ struct mshv_vp {
 	u32 index;
 	struct mshv_partition *partition;
 	struct mutex mutex;
+	struct page *register_page;
 	struct {
 		struct semaphore sem;
 		struct task_struct *task;
diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
index b3c84c69b73f..a747f39b132a 100644
--- a/include/uapi/asm-generic/hyperv-tlfs.h
+++ b/include/uapi/asm-generic/hyperv-tlfs.h
@@ -92,4 +92,9 @@ enum hv_get_set_vp_state_type {
 	HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS	= 4,
 };
 
+enum hv_vp_state_page_type {
+	HV_VP_STATE_PAGE_REGISTERS = 0,
+	HV_VP_STATE_PAGE_COUNT
+};
+
 #endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index ae0bb64bbec3..8537ff29aee5 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -13,6 +13,8 @@
 
 #define MSHV_VERSION	0x0
 
+#define MSHV_VP_MMAP_REGISTERS_OFFSET (HV_VP_STATE_PAGE_REGISTERS * 0x1000)
+
 struct mshv_create_partition {
 	__u64 flags;
 	struct hv_partition_creation_properties partition_creation_properties;
@@ -84,4 +86,14 @@ struct mshv_vp_state {
 #define MSHV_GET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0A, struct mshv_vp_state)
 #define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
 
+/* register page mapping example:
+ * struct hv_vp_register_page *regs = mmap(NULL,
+ *					   4096,
+ *					   PROT_READ | PROT_WRITE,
+ *					   MAP_SHARED,
+ *					   vp_fd,
+ *					   HV_VP_MMAP_REGISTERS_OFFSET);
+ * munmap(regs, 4096);
+ */
+
 #endif
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index 70172d9488de..a597254fa4f4 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -43,11 +43,18 @@ static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned
 static int mshv_dev_open(struct inode *inode, struct file *filp);
 static int mshv_dev_release(struct inode *inode, struct file *filp);
 static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
+static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
+static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
+
+static const struct vm_operations_struct mshv_vp_vm_ops = {
+	.fault = mshv_vp_fault,
+};
 
 static const struct file_operations mshv_vp_fops = {
 	.release = mshv_vp_release,
 	.unlocked_ioctl = mshv_vp_ioctl,
 	.llseek = noop_llseek,
+	.mmap = mshv_vp_mmap,
 };
 
 static const struct file_operations mshv_partition_fops = {
@@ -499,6 +506,47 @@ hv_call_set_vp_registers(u32 vp_index,
 	return -hv_status_to_errno(status);
 }
 
+static int
+hv_call_map_vp_state_page(u32 vp_index, u64 partition_id,
+			  struct page **state_page)
+{
+	struct hv_map_vp_state_page_in *input;
+	struct hv_map_vp_state_page_out *output;
+	int status;
+	int ret;
+	unsigned long flags;
+
+	do {
+		local_irq_save(flags);
+		input = (struct hv_map_vp_state_page_in *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+		output = (struct hv_map_vp_state_page_out *)(*this_cpu_ptr(
+			hyperv_pcpu_output_arg));
+
+		input->partition_id = partition_id;
+		input->vp_index = vp_index;
+		input->type = HV_VP_STATE_PAGE_REGISTERS;
+		status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE,
+						   input, output);
+
+		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
+			if (status == HV_STATUS_SUCCESS)
+				*state_page = pfn_to_page(output->map_location);
+			else
+				pr_err("%s: %s\n", __func__,
+				       hv_status_to_string(status));
+			local_irq_restore(flags);
+			ret = -hv_status_to_errno(status);
+			break;
+		}
+		local_irq_restore(flags);
+
+		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
+	} while (!ret);
+
+	return ret;
+}
+
 static void
 mshv_isr(void)
 {
@@ -1155,6 +1203,40 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 	return r;
 }
 
+static vm_fault_t mshv_vp_fault(struct vm_fault *vmf)
+{
+	struct mshv_vp *vp = vmf->vma->vm_file->private_data;
+
+	vmf->page = vp->register_page;
+
+	return 0;
+}
+
+static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int ret;
+	struct mshv_vp *vp = file->private_data;
+
+	if (vma->vm_pgoff != MSHV_VP_MMAP_REGISTERS_OFFSET)
+		return -EINVAL;
+
+	if (mutex_lock_killable(&vp->mutex))
+		return -EINTR;
+
+	if (!vp->register_page) {
+		ret = hv_call_map_vp_state_page(vp->index,
+						vp->partition->id,
+						&vp->register_page);
+		if (ret)
+			return ret;
+	}
+
+	mutex_unlock(&vp->mutex);
+
+	vma->vm_ops = &mshv_vp_vm_ops;
+	return 0;
+}
+
 static int
 mshv_vp_release(struct inode *inode, struct file *filp)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 17/18] virt/mshv: get and set partition property ioctls
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (15 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 16/18] virt/mshv: mmap vp register page Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2020-11-21  0:30 ` [RFC PATCH 18/18] virt/mshv: Add enlightenment bits to create partition ioctl Nuno Das Neves
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce ioctls for getting and setting properties of guest partitions.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst        |  8 +++
 include/asm-generic/hyperv-tlfs.h      | 17 ++++++
 include/uapi/asm-generic/hyperv-tlfs.h | 59 ++++++++++++++++++++
 include/uapi/linux/mshv.h              |  9 +++
 virt/mshv/mshv_main.c                  | 76 ++++++++++++++++++++++++++
 5 files changed, 169 insertions(+)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 89c276a8778f..609400313b7e 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -159,4 +159,12 @@ Maps a page into userspace that can be used to get and set common registers
 while the vp is suspended.
 The page is laid out in struct hv_vp_register_page in asm/hyperv-tlfs.h.
 
+3.11 MSHV_SET_PARTITION_PROPERTY and MSHV_GET_PARTITION_PROPERTY
+----------------------------------------------------------------
+:Type: partition ioctl
+:Parameters: struct mshv_partition_property
+:Returns: 0 on success
+
+Can be used to get/set various properties of a partition.
+
 
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 9eed4b869110..f3998027f6a3 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -146,6 +146,8 @@ struct ms_hyperv_tsc_page {
 #define HVCALL_INITIALIZE_PARTITION		0x0041
 #define HVCALL_FINALIZE_PARTITION		0x0042
 #define HVCALL_DELETE_PARTITION			0x0043
+#define HVCALL_GET_PARTITION_PROPERTY		0x0044
+#define HVCALL_SET_PARTITION_PROPERTY		0x0045
 #define HVCALL_GET_PARTITION_ID			0x0046
 #define HVCALL_DEPOSIT_MEMORY			0x0048
 #define HVCALL_WITHDRAW_MEMORY			0x0049
@@ -847,4 +849,19 @@ struct hv_map_vp_state_page_out {
 	u64 map_location; /* page number */
 };
 
+struct hv_get_partition_property_in {
+	u64 partition_id;
+	enum hv_partition_property_code property_code;
+};
+
+struct hv_get_partition_property_out {
+	u64 property_value;
+};
+
+struct hv_set_partition_property {
+	u64 partition_id;
+	enum hv_partition_property_code property_code;
+	u64 property_value;
+};
+
 #endif
diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
index a747f39b132a..d1c341de34fe 100644
--- a/include/uapi/asm-generic/hyperv-tlfs.h
+++ b/include/uapi/asm-generic/hyperv-tlfs.h
@@ -97,4 +97,63 @@ enum hv_vp_state_page_type {
 	HV_VP_STATE_PAGE_COUNT
 };
 
+enum hv_partition_property_code {
+	/* Privilege properties */
+	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS				= 0x00010000,
+
+	/* Scheduling properties */
+	HV_PARTITION_PROPERTY_SUSPEND					= 0x00020000,
+	HV_PARTITION_PROPERTY_CPU_RESERVE				= 0x00020001,
+	HV_PARTITION_PROPERTY_CPU_CAP					= 0x00020002,
+	HV_PARTITION_PROPERTY_CPU_WEIGHT				= 0x00020003,
+	HV_PARTITION_PROPERTY_CPU_GROUP_ID				= 0x00020004,
+
+	/* Time properties */
+	HV_PARTITION_PROPERTY_TIME_FREEZE				= 0x00030003,
+
+	/* Debugging properties */
+	HV_PARTITION_PROPERTY_DEBUG_CHANNEL_ID				= 0x00040000,
+
+	/* Resource properties */
+	HV_PARTITION_PROPERTY_VIRTUAL_TLB_PAGE_COUNT			= 0x00050000,
+	HV_PARTITION_PROPERTY_VSM_CONFIG				= 0x00050001,
+	HV_PARTITION_PROPERTY_ZERO_MEMORY_ON_RESET			= 0x00050002,
+	HV_PARTITION_PROPERTY_PROCESSORS_PER_SOCKET			= 0x00050003,
+	HV_PARTITION_PROPERTY_NESTED_TLB_SIZE				= 0x00050004,
+	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING			= 0x00050005,
+	HV_PARTITION_PROPERTY_VSM_PERMISSIONS_DIRTY_SINCE_LAST_QUERY	= 0x00050006,
+	HV_PARTITION_PROPERTY_SGX_LAUNCH_CONTROL_CONFIG			= 0x00050007,
+	HV_PARTITION_PROPERTY_DEFAULT_SGX_LAUNCH_CONTROL0		= 0x00050008,
+	HV_PARTITION_PROPERTY_DEFAULT_SGX_LAUNCH_CONTROL1		= 0x00050009,
+	HV_PARTITION_PROPERTY_DEFAULT_SGX_LAUNCH_CONTROL2		= 0x0005000a,
+	HV_PARTITION_PROPERTY_DEFAULT_SGX_LAUNCH_CONTROL3		= 0x0005000b,
+	HV_PARTITION_PROPERTY_ISOLATION_STATE				= 0x0005000c,
+	HV_PARTITION_PROPERTY_ISOLATION_CONTROL				= 0x0005000d,
+	HV_PARTITION_PROPERTY_RDT_L3_COS_INDEX				= 0x0005000e,
+	HV_PARTITION_PROPERTY_RDT_RMID					= 0x0005000f,
+	HV_PARTITION_PROPERTY_IMPLEMENTED_PHYSICAL_ADDRESS_BITS		= 0x00050010,
+	HV_PARTITION_PROPERTY_NON_ARCHITECTURAL_CORE_SHARING		= 0x00050011,
+	HV_PARTITION_PROPERTY_HYPERCALL_DOORBELL_PAGE			= 0x00050012,
+
+	/* Compatibility properties */
+	HV_PARTITION_PROPERTY_PROCESSOR_VENDOR				= 0x00060000,
+	HV_PARTITION_PROPERTY_PROCESSOR_FEATURES_DEPRECATED		= 0x00060001,
+	HV_PARTITION_PROPERTY_PROCESSOR_XSAVE_FEATURES			= 0x00060002,
+	HV_PARTITION_PROPERTY_PROCESSOR_CL_FLUSH_SIZE			= 0x00060003,
+	HV_PARTITION_PROPERTY_ENLIGHTENMENT_MODIFICATIONS		= 0x00060004,
+	HV_PARTITION_PROPERTY_COMPATIBILITY_VERSION			= 0x00060005,
+	HV_PARTITION_PROPERTY_PHYSICAL_ADDRESS_WIDTH			= 0x00060006,
+	HV_PARTITION_PROPERTY_XSAVE_STATES				= 0x00060007,
+	HV_PARTITION_PROPERTY_MAX_XSAVE_DATA_SIZE			= 0x00060008,
+	HV_PARTITION_PROPERTY_PROCESSOR_CLOCK_FREQUENCY			= 0x00060009,
+	HV_PARTITION_PROPERTY_PROCESSOR_FEATURES0			= 0x0006000a,
+	HV_PARTITION_PROPERTY_PROCESSOR_FEATURES1			= 0x0006000b,
+
+	/* Guest software properties */
+	HV_PARTITION_PROPERTY_GUEST_OS_ID				= 0x00070000,
+
+	/* Nested virtualization properties */
+	HV_PARTITION_PROPERTY_PROCESSOR_VIRTUALIZATION_FEATURES		= 0x00080000,
+};
+
 #endif
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 8537ff29aee5..721f5b1999d5 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -66,6 +66,11 @@ struct mshv_vp_state {
 	} buf;
 };
 
+struct mshv_partition_property {
+	enum hv_partition_property_code property_code;
+	__u64 property_value;
+};
+
 #define MSHV_IOCTL 0xB8
 
 /* mshv device */
@@ -78,6 +83,10 @@ struct mshv_vp_state {
 #define MSHV_CREATE_VP		_IOW(MSHV_IOCTL, 0x04, struct mshv_create_vp)
 #define MSHV_INSTALL_INTERCEPT	_IOW(MSHV_IOCTL, 0x08, struct mshv_install_intercept)
 #define MSHV_ASSERT_INTERRUPT	_IOW(MSHV_IOCTL, 0x09, struct mshv_assert_interrupt)
+#define MSHV_SET_PARTITION_PROPERTY \
+				_IOW(MSHV_IOCTL, 0xC, struct mshv_partition_property)
+#define MSHV_GET_PARTITION_PROPERTY \
+				_IOWR(MSHV_IOCTL, 0xD, struct mshv_partition_property)
 
 /* vp device */
 #define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct mshv_vp_registers)
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index a597254fa4f4..bfbadeb4f1fe 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -1331,6 +1331,74 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	return ret;
 }
 
+static long
+mshv_partition_ioctl_get_property(struct mshv_partition *partition,
+				  void __user *user_args)
+{
+	struct mshv_partition_property args;
+	int status;
+	unsigned long flags;
+	struct hv_get_partition_property_in *input;
+	struct hv_get_partition_property_out *output;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	local_irq_save(flags);
+	input = (struct hv_get_partition_property_in *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+	output = (struct hv_get_partition_property_out *)(*this_cpu_ptr(
+			hyperv_pcpu_output_arg));
+	memset(input, 0, sizeof(*input));
+	input->partition_id = partition->id;
+	input->property_code = args.property_code;
+	status = hv_do_hypercall(HVCALL_GET_PARTITION_PROPERTY, input,
+			output) & HV_HYPERCALL_RESULT_MASK;
+	args.property_value = output->property_value;
+	local_irq_restore(flags);
+
+	if (status != HV_STATUS_SUCCESS) {
+		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
+		return -hv_status_to_errno(status);
+	}
+
+	if (copy_to_user(user_args, &args, sizeof(args)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static long
+mshv_partition_ioctl_set_property(struct mshv_partition *partition,
+				  void __user *user_args)
+{
+	struct mshv_partition_property args;
+	int status;
+	unsigned long flags;
+	struct hv_set_partition_property *input;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	local_irq_save(flags);
+	input = (struct hv_set_partition_property *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+	memset(input, 0, sizeof(*input));
+	input->partition_id = partition->id;
+	input->property_code = args.property_code;
+	input->property_value = args.property_value;
+	status = hv_do_hypercall(HVCALL_SET_PARTITION_PROPERTY, input,
+			NULL) & HV_HYPERCALL_RESULT_MASK;
+	local_irq_restore(flags);
+
+	if (status != HV_STATUS_SUCCESS) {
+		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
+		return -hv_status_to_errno(status);
+	}
+
+	return 0;
+}
+
 static long
 mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
 				struct mshv_user_mem_region __user *user_mem)
@@ -1596,6 +1664,14 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl,
 		ret = mshv_partition_ioctl_assert_interrupt(partition,
 							(void __user *)arg);
 		break;
+	case MSHV_GET_PARTITION_PROPERTY:
+		ret = mshv_partition_ioctl_get_property(partition,
+							(void __user *)arg);
+		break;
+	case MSHV_SET_PARTITION_PROPERTY:
+		ret = mshv_partition_ioctl_set_property(partition,
+							(void __user *)arg);
+		break;
 	default:
 		ret = -ENOTTY;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 18/18] virt/mshv: Add enlightenment bits to create partition ioctl
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (16 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 17/18] virt/mshv: get and set partition property ioctls Nuno Das Neves
@ 2020-11-21  0:30 ` Nuno Das Neves
  2020-11-24 16:18 ` [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Wei Liu
  2021-02-08 19:40 ` Michael Kelley
  19 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2020-11-21  0:30 UTC (permalink / raw)
  To: linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Introduce hv_partition_synthetic_processor features mask to
MSHV_CREATE_PARTITION ioctl, which can be used to enable hypervisor
enlightenments for exo partitions.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 Documentation/virt/mshv/api.rst         |   3 +
 arch/x86/include/uapi/asm/hyperv-tlfs.h | 125 ++++++++++++++++++++++++
 include/uapi/asm-generic/hyperv-tlfs.h  |   1 +
 include/uapi/linux/mshv.h               |   1 +
 virt/mshv/mshv_main.c                   |  57 +++++++----
 5 files changed, 167 insertions(+), 20 deletions(-)

diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
index 609400313b7e..afd56fff1038 100644
--- a/Documentation/virt/mshv/api.rst
+++ b/Documentation/virt/mshv/api.rst
@@ -167,4 +167,7 @@ The page is laid out in struct hv_vp_register_page in asm/hyperv-tlfs.h.
 
 Can be used to get/set various properties of a partition.
 
+Some properties can only be set at partition creation. For these, there are
+parameters in MSHV_CREATE_PARTITION.
+
 
diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
index a241178567ff..65cd0d166d5b 100644
--- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
+++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
@@ -1184,4 +1184,129 @@ struct hv_vp_register_page {
 	__u64 instruction_emulation_hints;
 };
 
+#define HV_PARTITION_SYNTHETIC_PROCESSOR_FEATURES_BANKS 1
+
+union hv_partition_synthetic_processor_features {
+	__u64 as_uint64[HV_PARTITION_SYNTHETIC_PROCESSOR_FEATURES_BANKS];
+
+	struct {
+		/* Report a hypervisor is present. CPUID leaves
+		 * 0x40000000 and 0x40000001 are supported.
+		 */
+		__u64 hypervisor_present:1;
+
+		/*
+		 * Features associated with HV#1:
+		 */
+
+		/* Report support for Hv1 (CPUID leaves 0x40000000 - 0x40000006). */
+		__u64 hv1:1;
+
+		/* Access to HV_X64_MSR_VP_RUNTIME.
+		 * Corresponds to access_vp_run_time_reg privilege.
+		 */
+		__u64 access_vp_run_time_reg:1;
+
+		/* Access to HV_X64_MSR_TIME_REF_COUNT.
+		 * Corresponds to access_partition_reference_counter privilege.
+		 */
+		__u64 access_partition_reference_counter:1;
+
+		/* Access to SINT-related registers (HV_X64_MSR_SCONTROL through
+		 * HV_X64_MSR_EOM and HV_X64_MSR_SINT0 through HV_X64_MSR_SINT15).
+		 * Corresponds to access_synic_regs privilege.
+		 */
+		__u64 access_synic_regs:1;
+
+		/* Access to synthetic timers and associated MSRs
+		 * (HV_X64_MSR_STIMER0_CONFIG through HV_X64_MSR_STIMER3_COUNT).
+		 * Corresponds to access_synthetic_timer_regs privilege.
+		 */
+		__u64 access_synthetic_timer_regs:1;
+
+		/* Access to APIC MSRs (HV_X64_MSR_EOI, HV_X64_MSR_ICR and HV_X64_MSR_TPR)
+		 * as well as the VP assist page.
+		 * Corresponds to access_intr_ctrl_regs privilege.
+		 */
+		__u64 access_intr_ctrl_regs:1;
+
+		/* Access to registers associated with hypercalls (HV_X64_MSR_GUEST_OS_ID
+		 * and HV_X64_MSR_HYPERCALL).
+		 * Corresponds to access_hypercall_msrs privilege.
+		 */
+		__u64 access_hypercall_regs:1;
+
+		/* VP index can be queried. corresponds to access_vp_index privilege. */
+		__u64 access_vp_index:1;
+
+		/* Access to the reference TSC. Corresponds to access_partition_reference_tsc
+		 * privilege.
+		 */
+		__u64 access_partition_reference_tsc:1;
+
+		/* Partition has access to the guest idle reg. Corresponds to
+		 * access_guest_idle_reg privilege.
+		 */
+		__u64 access_guest_idle_reg:1;
+
+		/* Partition has access to frequency regs. corresponds to access_frequency_regs
+		 * privilege.
+		 */
+		__u64 access_frequency_regs:1;
+
+		__u64 reserved_z12:1; /* Reserved for access_reenlightenment_controls. */
+		__u64 reserved_z13:1; /* Reserved for access_root_scheduler_reg. */
+		__u64 reserved_z14:1; /* Reserved for access_tsc_invariant_controls. */
+
+		/* Extended GVA ranges for HvCallFlushVirtualAddressList hypercall.
+		 * Corresponds to privilege.
+		 */
+		__u64 enable_extended_gva_ranges_for_flush_virtual_address_list:1;
+
+		__u64 reserved_z16:1; /* Reserved for access_vsm. */
+		__u64 reserved_z17:1; /* Reserved for access_vp_registers. */
+
+		/* Use fast hypercall output. Corresponds to privilege. */
+		__u64 fast_hypercall_output:1;
+
+		__u64 reserved_z19:1; /* Reserved for enable_extended_hypercalls. */
+
+		/*
+		 * HvStartVirtualProcessor can be used to start virtual processors.
+		 * Corresponds to privilege.
+		 */
+		__u64 start_virtual_processor:1;
+
+		__u64 reserved_z21:1; /* Reserved for Isolation. */
+
+		/* Synthetic timers in direct mode. */
+		__u64 direct_synthetic_timers:1;
+
+		__u64 reserved_z23:1; /* Reserved for synthetic time unhalted timer */
+
+		/* Use extended processor masks. */
+		__u64 extended_processor_masks:1;
+
+		/* HvCallFlushVirtualAddressSpace / HvCallFlushVirtualAddressList are supported. */
+		__u64 tb_flush_hypercalls:1;
+
+		/* HvCallSendSyntheticClusterIpi is supported. */
+		__u64 synthetic_cluster_ipi:1;
+
+		/* HvCallNotifyLongSpinWait is supported. */
+		__u64 notify_long_spin_wait:1;
+
+		/* HvCallQueryNumaDistance is supported. */
+		__u64 query_numa_distance:1;
+
+		/* HvCallSignalEvent is supported. Corresponds to privilege. */
+		__u64 signal_events:1;
+
+		/* HvCallRetargetDeviceInterrupt is supported. */
+		__u64 retarget_device_interrupt:1;
+
+		__u64 reserved:33;
+	};
+};
+
 #endif
diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
index d1c341de34fe..5c6379a3cfd5 100644
--- a/include/uapi/asm-generic/hyperv-tlfs.h
+++ b/include/uapi/asm-generic/hyperv-tlfs.h
@@ -100,6 +100,7 @@ enum hv_vp_state_page_type {
 enum hv_partition_property_code {
 	/* Privilege properties */
 	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS				= 0x00010000,
+	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES			= 0x00010001,
 
 	/* Scheduling properties */
 	HV_PARTITION_PROPERTY_SUSPEND					= 0x00020000,
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 721f5b1999d5..bf2d8c8a0a37 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -18,6 +18,7 @@
 struct mshv_create_partition {
 	__u64 flags;
 	struct hv_partition_creation_properties partition_creation_properties;
+	union hv_partition_synthetic_processor_features synthetic_processor_features;
 };
 
 /*
diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
index bfbadeb4f1fe..78a1e70cac96 100644
--- a/virt/mshv/mshv_main.c
+++ b/virt/mshv/mshv_main.c
@@ -547,6 +547,33 @@ hv_call_map_vp_state_page(u32 vp_index, u64 partition_id,
 	return ret;
 }
 
+static long
+hv_call_set_partition_property(u64 partition_id,
+			       u64 property_code,
+			       u64 property_value)
+{
+	int status;
+	unsigned long flags;
+	struct hv_set_partition_property *input;
+
+	local_irq_save(flags);
+	input = (struct hv_set_partition_property *)(*this_cpu_ptr(
+			hyperv_pcpu_input_arg));
+	memset(input, 0, sizeof(*input));
+	input->partition_id = partition_id;
+	input->property_code = property_code;
+	input->property_value = property_value;
+	status = hv_do_hypercall(HVCALL_SET_PARTITION_PROPERTY,
+				 input,
+				 NULL) & HV_HYPERCALL_RESULT_MASK;
+	local_irq_restore(flags);
+
+	if (status != HV_STATUS_SUCCESS)
+		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
+
+	return -hv_status_to_errno(status);
+}
+
 static void
 mshv_isr(void)
 {
@@ -1373,30 +1400,13 @@ mshv_partition_ioctl_set_property(struct mshv_partition *partition,
 				  void __user *user_args)
 {
 	struct mshv_partition_property args;
-	int status;
-	unsigned long flags;
-	struct hv_set_partition_property *input;
 
 	if (copy_from_user(&args, user_args, sizeof(args)))
 		return -EFAULT;
 
-	local_irq_save(flags);
-	input = (struct hv_set_partition_property *)(*this_cpu_ptr(
-			hyperv_pcpu_input_arg));
-	memset(input, 0, sizeof(*input));
-	input->partition_id = partition->id;
-	input->property_code = args.property_code;
-	input->property_value = args.property_value;
-	status = hv_do_hypercall(HVCALL_SET_PARTITION_PROPERTY, input,
-			NULL) & HV_HYPERCALL_RESULT_MASK;
-	local_irq_restore(flags);
-
-	if (status != HV_STATUS_SUCCESS) {
-		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
-		return -hv_status_to_errno(status);
-	}
-
-	return 0;
+	return hv_call_set_partition_property(partition->id,
+					      args.property_code,
+					      args.property_value);
 }
 
 static long
@@ -1831,6 +1841,13 @@ mshv_ioctl_create_partition(void __user *user_arg)
 	if (ret)
 		goto put_fd;
 
+	ret = hv_call_set_partition_property(
+				partition->id,
+				HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES,
+				args.synthetic_processor_features.as_uint64[0]);
+	if (ret)
+		goto delete_partition;
+
 	ret = hv_call_initialize_partition(partition->id);
 	if (ret)
 		goto delete_partition;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 12/18] virt/mshv: run vp ioctl and isr
  2020-11-21  0:30 ` [RFC PATCH 12/18] virt/mshv: run vp ioctl and isr Nuno Das Neves
@ 2020-11-24 16:15   ` Wei Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Wei Liu @ 2020-11-24 16:15 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, virtualization, linux-kernel, mikelley, viremana,
	sunilmut, wei.liu, ligrassi, kys

On Fri, Nov 20, 2020 at 04:30:31PM -0800, Nuno Das Neves wrote:
[...]
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index c9445d2edb37..7ddb66d260ce 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -17,6 +17,7 @@
>  #include <linux/mm.h>
>  #include <linux/io.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/random.h>
>  #include <linux/mshv.h>
>  #include <asm/mshyperv.h>
>  
> @@ -498,6 +499,240 @@ hv_call_set_vp_registers(u32 vp_index,
>  	return -hv_status_to_errno(status);
>  }
>  
> +static void
> +mshv_isr(void)
> +{
[...]
> +
> +	/* Hold this lock for the rest of the isr, because the partition could
> +	 * be released anytime.
> +	 * e.g. the MSHV_RUN_VP thread could wake on another cpu; it could
> +	 * release the partition unless we hold this!
> +	 */
> +	spin_lock_irqsave(&mshv.partitions.lock, flags);
> +

This should be switched to rwlock variant, otherwise vcpus can't run
concurrently.

You will take the read lock and only the ioctl that changes the list
will need to take the write lock.

There may be better and cheaper primitives than rwlock. Not sure if RCU
can be used in this context.

Wei.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (17 preceding siblings ...)
  2020-11-21  0:30 ` [RFC PATCH 18/18] virt/mshv: Add enlightenment bits to create partition ioctl Nuno Das Neves
@ 2020-11-24 16:18 ` Wei Liu
  2021-02-08 19:40 ` Michael Kelley
  19 siblings, 0 replies; 53+ messages in thread
From: Wei Liu @ 2020-11-24 16:18 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, virtualization, linux-kernel, mikelley, viremana,
	sunilmut, wei.liu, ligrassi, kys

On Fri, Nov 20, 2020 at 04:30:19PM -0800, Nuno Das Neves wrote:
> This patch series provides a userspace interface for creating and running guest
> virtual machines while running on the Microsoft Hypervisor [0].
> 
> Since managing guest machines can only be done when Linux is the root partition,
> this series depends on the RFC already posted by Wei Liu:
> https://lore.kernel.org/linux-hyperv/20201105165814.29233-1-wei.liu@kernel.org/T/#t
> 
> The first two patches provide some helpers for converting hypervisor status
> codes to linux error codes, and easily printing hypervisor status codes to dmesg
> for debugging.
> 
> Hyper-V related headers asm-generic/hyperv-tlfs.h and x86/asm/hyperv-tlfs.h are
> split into uapi and non-uapi. The uapi versions contain structures used in both
> the ioctl interface and the kernel.
> 
> The mshv API is introduced in virt/mshv/mshv_main.c. As each interface is

Given this new file is placed under an arch-agnostic directory, please
make sure it doesn't break builds for other architecture. We can start
with running ARM builds for this series.

Wei.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface
  2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
                   ` (18 preceding siblings ...)
  2020-11-24 16:18 ` [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Wei Liu
@ 2021-02-08 19:40 ` Michael Kelley
  19 siblings, 0 replies; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:40 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> This patch series provides a userspace interface for creating and running guest
> virtual machines while running on the Microsoft Hypervisor [0].
> 
> Since managing guest machines can only be done when Linux is the root partition,
> this series depends on the RFC already posted by Wei Liu:
> https://lore.kernel.org/linux-hyperv/20201105165814.29233-1-wei.liu@kernel.org/T/#t
> 
> The first two patches provide some helpers for converting hypervisor status
> codes to linux error codes, and easily printing hypervisor status codes to dmesg
> for debugging.
> 
> Hyper-V related headers asm-generic/hyperv-tlfs.h and x86/asm/hyperv-tlfs.h are
> split into uapi and non-uapi. The uapi versions contain structures used in both
> the ioctl interface and the kernel.
> 
> The mshv API is introduced in virt/mshv/mshv_main.c. As each interface is
> introduced, documentation is added in Documentation/virt/mshv/api.rst.
> The API is file-desciptor based, like KVM. The entry point is /dev/mshv.
> 
> /dev/mshv ioctls:
> MSHV_REQUEST_VERSION
> MSHV_CREATE_PARTITION
> 
> Partition (vm) ioctls:
> MSHV_MAP_GUEST_MEMORY, MSHV_UNMAP_GUEST_MEMORY
> MSHV_INSTALL_INTERCEPT
> MSHV_ASSERT_INTERRUPT
> MSHV_GET_PARTITION_PROPERTY, MSHV_SET_PARTITION_PROPERTY
> MSHV_CREATE_VP
> 
> Vp (vcpu) ioctls:
> MSHV_GET_VP_REGISTERS, MSHV_SET_VP_REGISTERS
> MSHV_RUN_VP
> MSHV_GET_VP_STATE, MSHV_SET_VP_STATE
> mmap() (register page)
> 
> [0] Hyper-V is more well-known, but it really refers to the whole stack
>     including the hypervisor and other components that run in Windows kernel
>     and userspace.
> 
> Nuno Das Neves (18):
>   x86/hyperv: convert hyperv statuses to linux error codes
>   asm-generic/hyperv: convert hyperv statuses to strings
>   virt/mshv: minimal mshv module (/dev/mshv/)
>   virt/mshv: request version ioctl
>   virt/mshv: create partition ioctl
>   virt/mshv: create, initialize, finalize, delete partition hypercalls
>   virt/mshv: withdraw memory hypercall
>   virt/mshv: map and unmap guest memory
>   virt/mshv: create vcpu ioctl
>   virt/mshv: get and set vcpu registers ioctls
>   virt/mshv: set up synic pages for intercept messages
>   virt/mshv: run vp ioctl and isr
>   virt/mshv: install intercept ioctl
>   virt/mshv: assert interrupt ioctl
>   virt/mshv: get and set vp state ioctls
>   virt/mshv: mmap vp register page
>   virt/mshv: get and set partition property ioctls
>   virt/mshv: Add enlightenment bits to create partition ioctl
> 
>  .../userspace-api/ioctl/ioctl-number.rst      |    2 +
>  Documentation/virt/mshv/api.rst               |  173 ++
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/hyperv/Kconfig                       |   22 +
>  arch/x86/hyperv/Makefile                      |    4 +
>  arch/x86/hyperv/hv_init.c                     |    2 +-
>  arch/x86/hyperv/hv_proc.c                     |   40 +-
>  arch/x86/include/asm/hyperv-tlfs.h            |   44 +-
>  arch/x86/include/asm/mshyperv.h               |    1 +
>  arch/x86/include/uapi/asm/hyperv-tlfs.h       | 1312 +++++++++++
>  arch/x86/kernel/cpu/mshyperv.c                |   16 +
>  include/asm-generic/hyperv-tlfs.h             |  324 ++-
>  include/asm-generic/mshyperv.h                |    3 +
>  include/linux/mshv.h                          |   61 +
>  include/uapi/asm-generic/hyperv-tlfs.h        |  160 ++
>  include/uapi/linux/mshv.h                     |  109 +
>  virt/mshv/mshv_main.c                         | 2054 +++++++++++++++++
>  17 files changed, 4178 insertions(+), 151 deletions(-)
>  create mode 100644 Documentation/virt/mshv/api.rst
>  create mode 100644 arch/x86/hyperv/Kconfig
>  create mode 100644 arch/x86/include/uapi/asm/hyperv-tlfs.h
>  create mode 100644 include/linux/mshv.h
>  create mode 100644 include/uapi/asm-generic/hyperv-tlfs.h
>  create mode 100644 include/uapi/linux/mshv.h
>  create mode 100644 virt/mshv/mshv_main.c
> 
> --
> 2.25.1

I finally made it through reviewing this patch series.  Nice
work -- to you, and to Lillian as the original author of significant
portions!  There's a lot code, but it is well organized for reviewing
and overall is done well.

I have a three general comments:

1) Historically we have very precisely specified the layout of data
structures that are shared with Hyper-V.  Each field has an explicit
width (i.e., u16, u32, u64, etc.) and we have avoided field types that
lack an explicit width (int, enum, bool, etc.).  These patches make
liberal use of enum types in the Hyper-V data structures, and I saw
one occurrence of bool.  While treating enum and bool as 32 bits
works, I have a concern that such specifications aren't consistent
with the original rigor we tried to use.

Related, there are several places where the proper layout depends
on the compiler inserting padding (and not inserting padding in the
wrong places) to achieve the needed alignment.  In my view, we
should be explicitly adding the padding.  A couple years back at
Vitaly Kuznetsov's initiative, we added __packed on all the data
structures to instruct the compiler to not add padding, so as to
prevent padding being added at any inappropriate places.

I started by flagging all of these places I saw either of these two
Issues, but I stopped doing so in some of the later patches, figuring
that you could find the issues across the entire series.

2) With all the new hypercalls added with this patch series, and with
Wei Liu's patch series for Linux in the root partition, I've noticed that
we're inconsistent in how the hypercall status is checked.   The
current code works, but is sloppy with types and doesn't always
conform to the letter of the TLFS.  Your new hv_status_to_errno() is
a nice addition, but I think we would be well served by using a 
consistent pattern.  I'm planning to send out a separate email to
the linux-hyperv mailing list with a specific suggestion that we can
all review and comment on.  Once we have agreement, we can do
a cleanup exercise on existing code and on recent patches.

3) I've flagged a few places where the code does not handle configurations
where PAGE_SIZE is other than 4 Kbytes.  While this will never happen
on x86/x64, it could happen on other architectures like ARM64.  Of course,
we may never want to run Linux in the root partition with a page size
other than 4 Kbytes, even on ARM64, so I'm OK with not fixing all these
places.  But I've flagged some places where HV_HYP_PAGE_SIZE would
be more appropriate than PAGE_SIZE (and similar) and I think it makes
sense to fix those now, if just to express that the usage is tied to the
page size used by the Hyper-V interface, and not the guest page size.

I'll also send replies to many of the individual patches with specific
comments embedded.  I have not given "Reviewed-by:" on any of the
patches since they were submitted as RFC, but I can do so for a few
of the patches if that would be helpful.

Michael

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2020-11-21  0:30 ` [RFC PATCH 04/18] virt/mshv: request version ioctl Nuno Das Neves
@ 2021-02-08 19:41   ` Michael Kelley
  2021-03-04 21:35     ` Nuno Das Neves
  2021-02-09 13:11   ` Vitaly Kuznetsov
  1 sibling, 1 reply; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:41 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Reserve ioctl number in userpsace-api/ioctl/ioctl-number.rst
> Introduce MSHV_REQUEST_VERSION ioctl.
> Introduce documentation for /dev/mshv in Documentation/virt/mshv
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |  2 +
>  Documentation/virt/mshv/api.rst               | 62 +++++++++++++++++++
>  include/linux/mshv.h                          | 11 ++++
>  include/uapi/linux/mshv.h                     | 19 ++++++
>  virt/mshv/mshv_main.c                         | 49 +++++++++++++++
>  5 files changed, 143 insertions(+)
>  create mode 100644 Documentation/virt/mshv/api.rst
>  create mode 100644 include/linux/mshv.h
>  create mode 100644 include/uapi/linux/mshv.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 55a2d9b2ce33..13a4d3ecafca 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -343,6 +343,8 @@ Code  Seq#    Include File                                           Comments
>  0xB5  00-0F  uapi/linux/rpmsg.h                                      <mailto:linux-
> remoteproc@vger.kernel.org>
>  0xB6  all    linux/fpga-dfl.h
>  0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-
> remoteproc@vger.kernel.org>
> +0xB8  all    uapi/linux/mshv.h                                       Microsoft Hypervisor root partition APIs
> +                                                                     <mailto:linux-hyperv@vger.kernel.org>
>  0xC0  00-0F  linux/usb/iowarrior.h
>  0xCA  00-0F  uapi/misc/cxl.h
>  0xCA  10-2F  uapi/misc/ocxl.h
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> new file mode 100644
> index 000000000000..82e32de48d03
> --- /dev/null
> +++ b/Documentation/virt/mshv/api.rst
> @@ -0,0 +1,62 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================================
> +Microsoft Hypervisor Root Partition API Documentation
> +=====================================================
> +
> +1. Overview
> +===========
> +
> +This document describes APIs for creating and managing guest virtual machines
> +when running Linux as the root partition on the Microsoft Hypervisor.
> +
> +This API is not yet stable.
> +
> +2. Glossary/Terms
> +=================
> +
> +hv
> +--
> +Short for Hyper-V. This name is used in the kernel to describe interfaces to
> +the Microsoft Hypervisor.
> +
> +mshv
> +----
> +Short for Microsoft Hypervisor. This is the name of the userland API module
> +described in this document.
> +
> +Partition
> +---------
> +A virtual machine running on the Microsoft Hypervisor.
> +
> +Root Partition
> +--------------
> +The partition that is created and assumes control when the machine boots. The
> +root partition can use mshv APIs to create guest partitions.
> +
> +3. API description
> +==================
> +
> +The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
> +
> +Mshv is file descriptor-based, following a similar pattern to KVM.
> +
> +To get a handle to the mshv driver, use open("/dev/mshv").
> +
> +3.1 MSHV_REQUEST_VERSION
> +------------------------
> +:Type: /dev/mshv ioctl
> +:Parameters: pointer to a u32
> +:Returns: 0 on success
> +
> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
> +establish the interface version with the kernel module.
> +
> +The caller should pass the MSHV_VERSION as an argument.
> +
> +The kernel module will check which interface versions it supports and return 0
> +if one of them matches.
> +
> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
> +it is open - this ioctl can only be called once per open.

To clarify the wording:

The caller should pass the requested version as an argument.  If the requested
version is one that the kernel module supports, the ioctl will return 0.  If the
requested version is not supported by the kernel module, the caller may try
the ioctl repeatedly to find a version that the caller supports and that the kernel
module supports.   Once a match is found, the /dev/mshv file descriptor is
'locked' to that version as long as it is open; i.e., the ioctl can succeed
only once per open.

> +
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> new file mode 100644
> index 000000000000..a0982fe2c0b8
> --- /dev/null
> +++ b/include/linux/mshv.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _LINUX_MSHV_H
> +#define _LINUX_MSHV_H
> +
> +/*
> + * Microsoft Hypervisor root partition driver for /dev/mshv
> + */
> +
> +#include <uapi/linux/mshv.h>
> +
> +#endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> new file mode 100644
> index 000000000000..dd30fc2f0a80
> --- /dev/null
> +++ b/include/uapi/linux/mshv.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_MSHV_H
> +#define _UAPI_LINUX_MSHV_H
> +
> +/*
> + * Userspace interface for /dev/mshv
> + * Microsoft Hypervisor root partition APIs
> + */
> +
> +#include <linux/types.h>
> +
> +#define MSHV_VERSION	0x0
> +
> +#define MSHV_IOCTL 0xB8
> +
> +/* mshv device */
> +#define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
> +
> +#endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index ecb9089761fe..62f631f85301 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -11,25 +11,74 @@
>  #include <linux/module.h>
>  #include <linux/fs.h>
>  #include <linux/miscdevice.h>
> +#include <linux/slab.h>
> +#include <linux/mshv.h>
> 
>  MODULE_AUTHOR("Microsoft");
>  MODULE_LICENSE("GPL");
> 
> +#define MSHV_INVALID_VERSION	0xFFFFFFFF
> +#define MSHV_CURRENT_VERSION	MSHV_VERSION
> +
> +static u32 supported_versions[] = {
> +	MSHV_CURRENT_VERSION,
> +};

I'm not sure that the concept of "CURRENT_VERSION" makes sense
as a fixed constant.  We have an array of supported versions, any of
which are valid and supported by the kernel module.   The array
should list individual versions.   The current version is 0, which 
might be labelled as MSHV_VERSION_PRERELEASE, or something
similar.  Then later we might have MSHV_VERSION_RELEASE_1,
HSMV_VERSION_RELEASE_2, as needed.  Or maybe the versions
are tied to releases of the Microsoft Hypervisor.

> +
> +static long
> +mshv_ioctl_request_version(u32 *version, void __user *user_arg)
> +{
> +	u32 arg;
> +	int i;
> +
> +	if (copy_from_user(&arg, user_arg, sizeof(arg)))
> +		return -EFAULT;
> +
> +	for (i = 0; i < ARRAY_SIZE(supported_versions); ++i) {
> +		if (supported_versions[i] == arg) {
> +			*version = supported_versions[i];
> +			return 0;
> +		}
> +	}
> +	return -ENOTSUPP;
> +}
> +
>  static long
>  mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> +	u32 *version = (u32 *)filp->private_data;
> +
> +	if (ioctl == MSHV_REQUEST_VERSION) {
> +		/* Version can only be set once */
> +		if (*version != MSHV_INVALID_VERSION)
> +			return -EBADFD;
> +
> +		return mshv_ioctl_request_version(version, (void __user *)arg);
> +	}
> +
> +	/* Version must be set before other ioctls can be called */
> +	if (*version == MSHV_INVALID_VERSION)
> +		return -EBADFD;
> +
> +	/* TODO other ioctls */
> +
>  	return -ENOTTY;
>  }
> 
>  static int
>  mshv_dev_open(struct inode *inode, struct file *filp)
>  {
> +	filp->private_data = kmalloc(sizeof(u32), GFP_KERNEL);
> +	if (!filp->private_data)
> +		return -ENOMEM;
> +	*(u32 *)filp->private_data = MSHV_INVALID_VERSION;
> +
>  	return 0;
>  }
> 
>  static int
>  mshv_dev_release(struct inode *inode, struct file *filp)
>  {
> +	kfree(filp->private_data);
>  	return 0;
>  }
> 
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls
  2020-11-21  0:30 ` [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls Nuno Das Neves
@ 2021-02-08 19:42   ` Michael Kelley
  2021-03-04 23:49     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:42 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Add hypercalls for fully setting up and mostly tearing down a guest
> partition.
> The teardown operation will generate an error as the deposited
> memory has not been withdrawn.
> This is fixed in the next patch.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  include/asm-generic/hyperv-tlfs.h      |  52 +++++++-
>  include/uapi/asm-generic/hyperv-tlfs.h |   1 +
>  include/uapi/linux/mshv.h              |   1 +
>  virt/mshv/mshv_main.c                  | 169 ++++++++++++++++++++++++-
>  4 files changed, 220 insertions(+), 3 deletions(-)
> 
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 2ff580780ce4..ab6ae6c164f5 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -142,6 +142,10 @@ struct ms_hyperv_tsc_page {
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX	0x0013
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX	0x0014
>  #define HVCALL_SEND_IPI_EX			0x0015
> +#define HVCALL_CREATE_PARTITION			0x0040
> +#define HVCALL_INITIALIZE_PARTITION		0x0041
> +#define HVCALL_FINALIZE_PARTITION		0x0042
> +#define HVCALL_DELETE_PARTITION			0x0043
>  #define HVCALL_GET_PARTITION_ID			0x0046
>  #define HVCALL_DEPOSIT_MEMORY			0x0048
>  #define HVCALL_CREATE_VP			0x004e
> @@ -451,7 +455,7 @@ struct hv_get_partition_id {
>  struct hv_deposit_memory {
>  	u64 partition_id;
>  	u64 gpa_page_list[];
> -} __packed;
> +};

Why remove __packed?

> 
>  struct hv_proximity_domain_flags {
>  	u32 proximity_preferred : 1;
> @@ -767,4 +771,50 @@ struct hv_input_unmap_device_interrupt {
>  #define HV_SOURCE_SHADOW_NONE               0x0
>  #define HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE   0x1
> 
> +#define HV_MAKE_COMPATIBILITY_VERSION(major_, minor_)                          \
> +	((u32)((major_) << 8 | (minor_)))
> +
> +enum hv_compatibility_version {
> +	HV_COMPATIBILITY_19_H1 = HV_MAKE_COMPATIBILITY_VERSION(0X6, 0X5),
> +	HV_COMPATIBILITY_MANGANESE = HV_MAKE_COMPATIBILITY_VERSION(0X6, 0X7),

Avoid use of "Manganese", which is an internal code name.  I'd suggest calling it
20_H1 instead, which at least has some broader meaning.

> +	HV_COMPATIBILITY_PRERELEASE = HV_MAKE_COMPATIBILITY_VERSION(0XFE, 0X0),
> +	HV_COMPATIBILITY_EXPERIMENT = HV_MAKE_COMPATIBILITY_VERSION(0XFF, 0X0),
> +};
> +
> +union hv_partition_isolation_properties {
> +	u64 as_uint64;
> +	struct {
> +		u64 isolation_type: 5;
> +		u64 rsvd_z: 7;
> +		u64 shared_gpa_boundary_page_number: 52;
> +	};
> +};

Add __packed.

> +
> +/* Non-userspace-visible partition creation flags */
> +#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
> +
> +struct hv_create_partition_in {
> +	u64 flags;
> +	union hv_proximity_domain_info proximity_domain_info;
> +	enum hv_compatibility_version compatibility_version;

An "enum" is a 32 bit value in gcc and I would presume that
Hyper-V is expecting a 64 bit value.  In general, using an enum in a data
structure with exact layout requirements is problematic because the "C"
language doesn't specify how big an enum is.  In such cases, it's better
to use an integer field with an explicit size (like u64) and #defines for
the possible values.

> +	struct hv_partition_creation_properties partition_creation_properties;
> +	union hv_partition_isolation_properties isolation_properties;
> +};
> +
> +struct hv_create_partition_out {
> +	u64 partition_id;
> +};
> +
> +struct hv_initialize_partition {
> +	u64 partition_id;
> +};
> +
> +struct hv_finalize_partition {
> +	u64 partition_id;
> +};
> +
> +struct hv_delete_partition {
> +	u64 partition_id;
> +};

All of the above should have __packed for consistency with the other
Hyper-V data structures.

> +
>  #endif
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index 140cc0b4f98f..7a858226a9c5 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -6,6 +6,7 @@
>  #define BIT(X)	(1ULL << (X))
>  #endif
> 
> +/* Userspace-visible partition creation flags */

Could this comment be included in the earlier patch with the #defines so
that you avoid the trivial change here?

>  #define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
>  #define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
>  #define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index 3788f8bc5caa..4f8da9a6fde2 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -9,6 +9,7 @@
> 
>  #include <linux/types.h>
>  #include <asm/hyperv-tlfs.h>
> +#include <asm-generic/hyperv-tlfs.h>

Similarly, consider adding this #include in the earlier patch so that
this trivial change isn't needed here.

> 
>  #define MSHV_VERSION	0x0
> 
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 4dcbe4907430..c4130a6508e5 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -15,6 +15,7 @@
>  #include <linux/file.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/mshv.h>
> +#include <asm/mshyperv.h>
> 
>  MODULE_AUTHOR("Microsoft");
>  MODULE_LICENSE("GPL");
> @@ -31,7 +32,6 @@ static struct mshv mshv = {};
>  static void mshv_partition_put(struct mshv_partition *partition);
>  static int mshv_partition_release(struct inode *inode, struct file *filp);
>  static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> -

Spurious whitespace change?

>  static int mshv_dev_open(struct inode *inode, struct file *filp);
>  static int mshv_dev_release(struct inode *inode, struct file *filp);
>  static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> @@ -57,6 +57,143 @@ static struct miscdevice mshv_dev = {
>  	.mode = 600,
>  };
> 
> +#define HV_INIT_PARTITION_DEPOSIT_PAGES 208

A comment about how this value is determined would be useful.
I'm assuming it was determined empirically.

> +
> +static int
> +hv_call_create_partition(
> +		u64 flags,
> +		struct hv_partition_creation_properties creation_properties,
> +		u64 *partition_id)
> +{
> +	struct hv_create_partition_in *input;
> +	struct hv_create_partition_out *output;
> +	int status;
> +	int ret;
> +	unsigned long irq_flags;
> +	int i;
> +
> +	do {
> +		local_irq_save(irq_flags);
> +		input = (struct hv_create_partition_in *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +		output = (struct hv_create_partition_out *)(*this_cpu_ptr(
> +			hyperv_pcpu_output_arg));
> +
> +		input->flags = flags;
> +		input->proximity_domain_info.as_uint64 = 0;
> +		input->compatibility_version = HV_COMPATIBILITY_MANGANESE;
> +		for (i = 0; i < HV_PARTITION_PROCESSOR_FEATURE_BANKS; ++i)
> +			input->partition_creation_properties
> +				.disabled_processor_features.as_uint64[i] = 0;
> +		input->partition_creation_properties
> +			.disabled_processor_xsave_features.as_uint64 = 0;
> +		input->isolation_properties.as_uint64 = 0;
> +
> +		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
> +					 input, output);

hv_do_hypercall returns a u64, which should then be masked with
HV_HYPERCALL_RESULT_MASK before checking the result.

> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (status == HV_STATUS_SUCCESS)
> +				*partition_id = output->partition_id;
> +			else
> +				pr_err("%s: %s\n",
> +				       __func__, hv_status_to_string(status));
> +			local_irq_restore(irq_flags);
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(irq_flags);
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    hv_current_partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static int
> +hv_call_initialize_partition(u64 partition_id)
> +{
> +	struct hv_initialize_partition *input;
> +	int status;
> +	int ret;
> +	unsigned long flags;
> +
> +	ret = hv_call_deposit_pages(
> +				NUMA_NO_NODE,
> +				partition_id,
> +				HV_INIT_PARTITION_DEPOSIT_PAGES);
> +	if (ret)
> +		return ret;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = (struct hv_initialize_partition *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +		input->partition_id = partition_id;
> +
> +		status = hv_do_hypercall(
> +				HVCALL_INITIALIZE_PARTITION,
> +				input, NULL);

FWIW, since the input is a single 64 bit value, and there's no output,
this could use hv_do_fast_hypercall8() instead, and avoid
needing to use the input arg page and the irq save/restore.  Would have
to check that the particular hypercall supports the "fast" version.

> +		local_irq_restore(flags);
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {

Same comment about status being u64 and masking.

> +			if (status != HV_STATUS_SUCCESS)
> +				pr_err("%s: %s\n",
> +				       __func__, hv_status_to_string(status));
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static int
> +hv_call_finalize_partition(u64 partition_id)
> +{
> +	struct hv_finalize_partition *input;
> +	int status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input = (struct hv_finalize_partition *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +
> +	input->partition_id = partition_id;
> +	status = hv_do_hypercall(
> +			HVCALL_FINALIZE_PARTITION,
> +			input, NULL);
> +	local_irq_restore(flags);


Same comment about hv_do_fast_hypercall8() and about status
being a u64 and masking.

> +
> +	if (status != HV_STATUS_SUCCESS)
> +		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
> +
> +	return -hv_status_to_errno(status);
> +}
> +
> +static int
> +hv_call_delete_partition(u64 partition_id)
> +{
> +	struct hv_delete_partition *input;
> +	int status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input = (struct hv_delete_partition *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +
> +	input->partition_id = partition_id;
> +	status = hv_do_hypercall(
> +			HVCALL_DELETE_PARTITION,
> +			input, NULL);
> +	local_irq_restore(flags);

Same comments about hv_do_fast_hypercall8(), and
the status and masking.

> +
> +	if (status != HV_STATUS_SUCCESS)
> +		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
> +
> +	return -hv_status_to_errno(status);
> +}
> +
>  static long
>  mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> @@ -86,6 +223,17 @@ destroy_partition(struct mshv_partition *partition)
> 
>  	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
> 
> +	/*
> +	 * There are no remaining references to the partition or vps,
> +	 * so the remaining cleanup can be lockless
> +	 */
> +
> +	/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
> +	hv_call_finalize_partition(partition->id);
> +	/* TODO: Withdraw and free all pages we deposited */
> +
> +	hv_call_delete_partition(partition->id);
> +
>  	kfree(partition);
>  }
> 
> @@ -146,6 +294,9 @@ mshv_ioctl_create_partition(void __user *user_arg)
>  	if (copy_from_user(&args, user_arg, sizeof(args)))
>  		return -EFAULT;
> 
> +	/* Only support EXO partitions */
> +	args.flags |= HV_PARTITION_CREATION_FLAG_EXO_PARTITION;
> +
>  	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
>  	if (!partition)
>  		return -ENOMEM;
> @@ -156,11 +307,21 @@ mshv_ioctl_create_partition(void __user *user_arg)
>  		goto free_partition;
>  	}
> 
> +	ret = hv_call_create_partition(args.flags,
> +				       args.partition_creation_properties,
> +				       &partition->id);
> +	if (ret)
> +		goto put_fd;
> +
> +	ret = hv_call_initialize_partition(partition->id);
> +	if (ret)
> +		goto delete_partition;
> +
>  	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
>  				  partition, O_RDWR);
>  	if (IS_ERR(file)) {
>  		ret = PTR_ERR(file);
> -		goto put_fd;
> +		goto finalize_partition;
>  	}
>  	refcount_set(&partition->ref_count, 1);
> 
> @@ -174,6 +335,10 @@ mshv_ioctl_create_partition(void __user *user_arg)
> 
>  release_file:
>  	file->f_op->release(file->f_inode, file);
> +finalize_partition:
> +	hv_call_finalize_partition(partition->id);
> +delete_partition:
> +	hv_call_delete_partition(partition->id);
>  put_fd:
>  	put_unused_fd(fd);
>  free_partition:
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall
  2020-11-21  0:30 ` [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall Nuno Das Neves
@ 2021-02-08 19:44   ` Michael Kelley
  2021-03-05 21:01     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:44 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Withdraw the memory from a finalized partition and free the pages.
> The partition is now cleaned up correctly when the fd is released.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  include/asm-generic/hyperv-tlfs.h | 10 ++++++
>  virt/mshv/mshv_main.c             | 54 ++++++++++++++++++++++++++++++-
>  2 files changed, 63 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index ab6ae6c164f5..2a49503b7396 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -148,6 +148,7 @@ struct ms_hyperv_tsc_page {
>  #define HVCALL_DELETE_PARTITION			0x0043
>  #define HVCALL_GET_PARTITION_ID			0x0046
>  #define HVCALL_DEPOSIT_MEMORY			0x0048
> +#define HVCALL_WITHDRAW_MEMORY			0x0049
>  #define HVCALL_CREATE_VP			0x004e
>  #define HVCALL_GET_VP_REGISTERS			0x0050
>  #define HVCALL_SET_VP_REGISTERS			0x0051
> @@ -472,6 +473,15 @@ union hv_proximity_domain_info {
>  	u64 as_uint64;
>  };
> 
> +struct hv_withdraw_memory_in {
> +	u64 partition_id;
> +	union hv_proximity_domain_info proximity_domain_info;
> +};
> +
> +struct hv_withdraw_memory_out {
> +	u64 gpa_page_list[0];

For a variable size array, the Linux kernel community has an effort
underway to replace occurrences of [0] and [1] with just [].  I think
[] can be used here.

> +};
> +

Add __packed to the above two structs.

>  struct hv_lp_startup_status {
>  	u64 hv_status;
>  	u64 substatus1;
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index c4130a6508e5..162a1bb42a4a 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -14,6 +14,7 @@
>  #include <linux/slab.h>
>  #include <linux/file.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/mm.h>
>  #include <linux/mshv.h>
>  #include <asm/mshyperv.h>
> 
> @@ -57,8 +58,58 @@ static struct miscdevice mshv_dev = {
>  	.mode = 600,
>  };
> 
> +#define HV_WITHDRAW_BATCH_SIZE	(PAGE_SIZE / sizeof(u64))

Use HV_HYP_PAGE_SIZE so that we're explicit that the dependency
is on the page size used by Hyper-V, which might be different from the
guest page size (at least on architectures like ARM64).

>  #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
> 
> +static int
> +hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
> +{
> +	struct hv_withdraw_memory_in *input_page;
> +	struct hv_withdraw_memory_out *output_page;
> +	u16 completed;
> +	u64 hypercall_status;
> +	unsigned long remaining = count;
> +	int status;
> +	int i;
> +	unsigned long flags;
> +
> +	while (remaining) {
> +		local_irq_save(flags);
> +
> +		input_page = (struct hv_withdraw_memory_in *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +		output_page = (struct hv_withdraw_memory_out *)(*this_cpu_ptr(
> +			hyperv_pcpu_output_arg));
> +
> +		input_page->partition_id = partition_id;
> +		input_page->proximity_domain_info.as_uint64 = 0;
> +		hypercall_status = hv_do_rep_hypercall(
> +			HVCALL_WITHDRAW_MEMORY,
> +			min(remaining, HV_WITHDRAW_BATCH_SIZE), 0, input_page,
> +			output_page);
> +
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +			    HV_HYPERCALL_REP_COMP_OFFSET;
> +
> +		for (i = 0; i < completed; i++)
> +			__free_page(pfn_to_page(output_page->gpa_page_list[i]));
> +
> +		local_irq_restore(flags);

Seems like there's some risk that we have interrupts disabled for too long.
We could be calling __free_page() up to 512 times.  It might be better for this
function to allocate its own page to be used as the output page, so that interrupts
can be enabled immediately after the hypercall completes.  Then the __free_page()
loop can execute with interrupts enabled.   We have the per-cpu input and output
pages to avoid the overhead of allocating/freeing pages for each hypercall, but in this
case a private output page might be warranted.

> +
> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		if (status != HV_STATUS_SUCCESS) {
> +			if (status != HV_STATUS_NO_RESOURCES)
> +				pr_err("%s: %s\n", __func__,
> +				       hv_status_to_string(status));
> +			break;
> +		}
> +
> +		remaining -= completed;
> +	}
> +
> +	return -hv_status_to_errno(status);
> +}
> +
>  static int
>  hv_call_create_partition(
>  		u64 flags,
> @@ -230,7 +281,8 @@ destroy_partition(struct mshv_partition *partition)
> 
>  	/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
>  	hv_call_finalize_partition(partition->id);
> -	/* TODO: Withdraw and free all pages we deposited */
> +	/* Withdraw and free all pages we deposited */
> +	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->id);
> 
>  	hv_call_delete_partition(partition->id);
> 
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 08/18] virt/mshv: map and unmap guest memory
  2020-11-21  0:30 ` [RFC PATCH 08/18] virt/mshv: map and unmap guest memory Nuno Das Neves
@ 2021-02-08 19:45   ` Michael Kelley
  2021-03-08 19:14     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:45 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Introduce ioctls for mapping and unmapping regions of guest memory.
> 
> Uses a table of memory 'slots' similar to KVM, but the slot
> number is not visible to userspace.
> 
> For now, this simple implementation requires each new mapping to be
> disjoint - the underlying hypercalls have no such restriction, and
> implicitly overwrite any mappings on the pages in the specified regions.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst        |  15 ++
>  include/asm-generic/hyperv-tlfs.h      |  15 ++
>  include/linux/mshv.h                   |  14 ++
>  include/uapi/asm-generic/hyperv-tlfs.h |   9 +
>  include/uapi/linux/mshv.h              |  15 ++
>  virt/mshv/mshv_main.c                  | 322 ++++++++++++++++++++++++-
>  6 files changed, 388 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index ce651a1738e0..530efc29d354 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -72,3 +72,18 @@ it is open - this ioctl can only be called once per open.
>  This ioctl creates a guest partition, returning a file descriptor to use as a
>  handle for partition ioctls.
> 
> +3.3 MSHV_MAP_GUEST_MEMORY and MSHV_UNMAP_GUEST_MEMORY
> +-----------------------------------------------------
> +:Type: partition ioctl
> +:Parameters: struct mshv_user_mem_region
> +:Returns: 0 on success
> +
> +Create a mapping from a region of process memory to a region of physical memory
> +in a guest partition.

Just to be super explicit:

Create a mapping from memory in the user space of the calling process (running
in the root partition) to a region of guest physical memory in a guest partition.

> +
> +Mappings must be disjoint in process address space and guest address space.
> +
> +Note: In the current implementation, this memory is pinned to stop the pages
> +being moved by linux and subsequently clobbered by the hypervisor. So the region
> +is backed by physical memory.

Again to be super explicit:

Note: In the current implementation, this memory is pinned to real physical
memory to stop the pages being moved by Linux in the root partition,
and subsequently being clobbered by the hypervisor.  So the region is backed
by real physical memory.

> +
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 2a49503b7396..6e5072e29897 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -149,6 +149,8 @@ struct ms_hyperv_tsc_page {
>  #define HVCALL_GET_PARTITION_ID			0x0046
>  #define HVCALL_DEPOSIT_MEMORY			0x0048
>  #define HVCALL_WITHDRAW_MEMORY			0x0049
> +#define HVCALL_MAP_GPA_PAGES			0x004b
> +#define HVCALL_UNMAP_GPA_PAGES			0x004c
>  #define HVCALL_CREATE_VP			0x004e
>  #define HVCALL_GET_VP_REGISTERS			0x0050
>  #define HVCALL_SET_VP_REGISTERS			0x0051
> @@ -827,4 +829,17 @@ struct hv_delete_partition {
>  	u64 partition_id;
>  };
> 
> +struct hv_map_gpa_pages {
> +	u64 target_partition_id;
> +	u64 target_gpa_base;
> +	u32 map_flags;

Is there a reserved 32 bit field here?  Hyper-V always aligns
things on 64 bit boundaries.

> +	u64 source_gpa_page_list[];
> +};
> +
> +struct hv_unmap_gpa_pages {
> +	u64 target_partition_id;
> +	u64 target_gpa_base;
> +	u32 unmap_flags;

Is there a reserved 32 bit field here?  Hyper-V always aligns
things on 64 bit boundaries.

> +};

Add __packed to the above two structs after sorting out
the alignment issues.

> +
>  #endif
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index fc4f35089b2c..91a742f37440 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -7,13 +7,27 @@
>   */
> 
>  #include <linux/spinlock.h>
> +#include <linux/mutex.h>
>  #include <uapi/linux/mshv.h>
> 
>  #define MSHV_MAX_PARTITIONS		128
> +#define MSHV_MAX_MEM_REGIONS		64
> +
> +struct mshv_mem_region {
> +	u64 size; /* bytes */
> +	u64 guest_pfn;
> +	u64 userspace_addr; /* start of the userspace allocated memory */
> +	struct page **pages;
> +};
> 
>  struct mshv_partition {
>  	u64 id;
>  	refcount_t ref_count;
> +	struct mutex mutex;
> +	struct {
> +		u32 count;
> +		struct mshv_mem_region slots[MSHV_MAX_MEM_REGIONS];
> +	} regions;
>  };
> 
>  struct mshv {
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index 7a858226a9c5..e7b09b9f00de 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -12,4 +12,13 @@
>  #define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
>  #define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
> 
> +/* HV Map GPA (Guest Physical Address) Flags */
> +#define HV_MAP_GPA_PERMISSIONS_NONE     0x0
> +#define HV_MAP_GPA_READABLE             0x1
> +#define HV_MAP_GPA_WRITABLE             0x2
> +#define HV_MAP_GPA_KERNEL_EXECUTABLE    0x4
> +#define HV_MAP_GPA_USER_EXECUTABLE      0x8
> +#define HV_MAP_GPA_EXECUTABLE           0xC
> +#define HV_MAP_GPA_PERMISSIONS_MASK     0xF
> +
>  #endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index 4f8da9a6fde2..47be03ef4e86 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -18,10 +18,25 @@ struct mshv_create_partition {
>  	struct hv_partition_creation_properties partition_creation_properties;
>  };
> 
> +/*
> + * Mappings can't overlap in GPA space or userspace
> + * To unmap, these fields must match an existing mapping
> + */
> +struct mshv_user_mem_region {
> +	__u64 size;		/* bytes */
> +	__u64 guest_pfn;
> +	__u64 userspace_addr;	/* start of the userspace allocated memory */
> +	__u32 flags;		/* ignored on unmap */
> +};
> +
>  #define MSHV_IOCTL 0xB8
> 
>  /* mshv device */
>  #define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
>  #define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x01, struct mshv_create_partition)
> 
> +/* partition device */
> +#define MSHV_MAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
> +#define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct mshv_user_mem_region)
> +
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 162a1bb42a4a..ce480598e67f 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -60,6 +60,10 @@ static struct miscdevice mshv_dev = {
> 
>  #define HV_WITHDRAW_BATCH_SIZE	(PAGE_SIZE / sizeof(u64))
>  #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
> +#define HV_MAP_GPA_MASK		(0x0000000FFFFFFFFFULL)
> +#define HV_MAP_GPA_BATCH_SIZE	\
> +		(PAGE_SIZE / sizeof(struct hv_map_gpa_pages) / sizeof(u64))

Hmmm. Shouldn't this be:

	((HV_HYP_PAGE_SIZE - sizeof(struct hv_map_gpa_pages))/sizeof(u64))


> +#define PIN_PAGES_BATCH_SIZE	(0x10000000 / PAGE_SIZE)
> 
>  static int
>  hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
> @@ -245,16 +249,318 @@ hv_call_delete_partition(u64 partition_id)
>  	return -hv_status_to_errno(status);
>  }
> 
> +static int
> +hv_call_map_gpa_pages(u64 partition_id,
> +		      u64 gpa_target,
> +		      u64 page_count, u32 flags,
> +		      struct page **pages)
> +{
> +	struct hv_map_gpa_pages *input_page;
> +	int status;
> +	int i;
> +	struct page **p;
> +	u32 completed = 0;
> +	u64 hypercall_status;
> +	unsigned long remaining = page_count;
> +	int rep_count;
> +	unsigned long irq_flags;
> +	int ret = 0;
> +
> +	while (remaining) {
> +
> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> +
> +		local_irq_save(irq_flags);
> +		input_page = (struct hv_map_gpa_pages *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +
> +		input_page->target_partition_id = partition_id;
> +		input_page->target_gpa_base = gpa_target;
> +		input_page->map_flags = flags;
> +
> +		for (i = 0, p = pages; i < rep_count; i++, p++)
> +			input_page->source_gpa_page_list[i] =
> +				page_to_pfn(*p) & HV_MAP_GPA_MASK;

The masking seems a bit weird.  The mask allows for up to 64G page frames,
which is 256 Tbytes of total physical memory, which is probably the current
Hyper-V limit on memory size (48 bit physical address space, though 52 bit
physical address spaces are coming).  So the masking shouldn't ever be doing
anything.   And if it was doing something, that probably should be treated as
an error rather than simply dropping the high bits.

Note that this code does not handle the case where PAGE_SIZE !=
HV_HYP_PAGE_SIZE.  But maybe we'll never run the root partition with a
page size other than 4K.

> +		hypercall_status = hv_do_rep_hypercall(
> +			HVCALL_MAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> +		local_irq_restore(irq_flags);
> +
> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +				HV_HYPERCALL_REP_COMP_OFFSET;
> +
> +		if (status == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +						    partition_id, 256);

Why adding 256 pages?  I'm just contrasting with other places that add
1 page at a time.  Maybe a comment to explain ....

> +			if (ret)
> +				break;
> +		} else if (status != HV_STATUS_SUCCESS) {
> +			pr_err("%s: completed %llu out of %llu, %s\n",
> +			       __func__,
> +			       page_count - remaining, page_count,
> +			       hv_status_to_string(status));
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +
> +		pages += completed;
> +		remaining -= completed;
> +		gpa_target += completed;
> +	}
> +
> +	if (ret && completed) {

Is the above the right test?  Completed could be zero from the most
recent iteration, but still could be partially succeeded based on a previous
successful iteration.   I think this needs to check whether remaining equals
page_count.

> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> +		       __func__);
> +		ret = -EBADFD;
> +	}
> +
> +	return ret;
> +}
> +
> +static int
> +hv_call_unmap_gpa_pages(u64 partition_id,
> +			u64 gpa_target,
> +			u64 page_count, u32 flags)
> +{
> +	struct hv_unmap_gpa_pages *input_page;
> +	int status;
> +	int ret = 0;
> +	u32 completed = 0;
> +	u64 hypercall_status;
> +	unsigned long remaining = page_count;
> +	int rep_count;
> +	unsigned long irq_flags;
> +
> +	local_irq_save(irq_flags);
> +	input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +
> +	input_page->target_partition_id = partition_id;
> +	input_page->target_gpa_base = gpa_target;
> +	input_page->unmap_flags = flags;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> +		hypercall_status = hv_do_rep_hypercall(
> +			HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);

Similarly, this code doesn't handle PAGE_SIZE != HV_HYP_PAGE_SIZE.

> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +				HV_HYPERCALL_REP_COMP_OFFSET;
> +		if (status != HV_STATUS_SUCCESS) {
> +			pr_err("%s: completed %llu out of %llu, %s\n",
> +			       __func__,
> +			       page_count - remaining, page_count,
> +			       hv_status_to_string(status));
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +
> +		remaining -= completed;
> +		gpa_target += completed;
> +		input_page->target_gpa_base = gpa_target;
> +	}
> +	local_irq_restore(irq_flags);

I have some concern about holding interrupts disabled for this long.

> +
> +	if (ret && completed) {

Same comment as before.

> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> +		       __func__);
> +		ret = -EBADFD;
> +	}
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
> +				struct mshv_user_mem_region __user *user_mem)
> +{
> +	struct mshv_user_mem_region mem;
> +	struct mshv_mem_region *region;
> +	int completed;
> +	unsigned long remaining, batch_size;
> +	int i;
> +	struct page **pages;
> +	u64 page_count, user_start, user_end, gpfn_start, gpfn_end;
> +	u64 region_page_count, region_user_start, region_user_end;
> +	u64 region_gpfn_start, region_gpfn_end;
> +	long ret = 0;
> +
> +	/* Check we have enough slots*/
> +	if (partition->regions.count == MSHV_MAX_MEM_REGIONS) {
> +		pr_err("%s: not enough memory region slots\n", __func__);
> +		return -ENOSPC;
> +	}
> +
> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
> +		return -EFAULT;
> +
> +	if (!mem.size ||
> +	    mem.size & (PAGE_SIZE - 1) ||
> +	    mem.userspace_addr & (PAGE_SIZE - 1) ||

There's a PAGE_ALIGNED macro that expresses exactly what
each of the previous two tests is doing.

> +	    !access_ok(mem.userspace_addr, mem.size))
> +		return -EINVAL;
> +
> +	/* Reject overlapping regions */
> +	page_count = mem.size >> PAGE_SHIFT;
> +	user_start = mem.userspace_addr;
> +	user_end = mem.userspace_addr + mem.size;
> +	gpfn_start = mem.guest_pfn;
> +	gpfn_end = mem.guest_pfn + page_count;
> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
> +		region = &partition->regions.slots[i];
> +		if (!region->size)
> +			continue;
> +		region_page_count = region->size >> PAGE_SHIFT;
> +		region_user_start = region->userspace_addr;
> +		region_user_end = region->userspace_addr + region->size;
> +		region_gpfn_start = region->guest_pfn;
> +		region_gpfn_end = region->guest_pfn + region_page_count;
> +
> +		if (!(
> +		     (user_end <= region_user_start) ||
> +		     (region_user_end <= user_start))) {
> +			return -EEXIST;
> +		}
> +		if (!(
> +		     (gpfn_end <= region_gpfn_start) ||
> +		     (region_gpfn_end <= gpfn_start))) {
> +			return -EEXIST;

You could apply De Morgan's theorem to the conditions
in each "if" statement and get rid of the "!".  That might make
these slightly easier to understand, but I have no strong
preference.

> +		}
> +	}
> +
> +	/* Pin the userspace pages */
> +	pages = vzalloc(sizeof(struct page *) * page_count);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	remaining = page_count;
> +	while (remaining) {
> +		/*
> +		 * We need to batch this, as pin_user_pages_fast with the
> +		 * FOLL_LONGTERM flag does a big temporary allocation
> +		 * of contiguous memory
> +		 */
> +		batch_size = min(remaining, PIN_PAGES_BATCH_SIZE);
> +		completed = pin_user_pages_fast(
> +				mem.userspace_addr +
> +					(page_count - remaining) * PAGE_SIZE,
> +				batch_size,
> +				FOLL_WRITE | FOLL_LONGTERM,
> +				&pages[page_count - remaining]);
> +		if (completed < 0) {
> +			pr_err("%s: failed to pin user pages error %i\n",
> +			       __func__,
> +			       completed);
> +			ret = completed;
> +			goto err_unpin_pages;
> +		}
> +		remaining -= completed;
> +	}
> +
> +	/* Map the pages to GPA pages */
> +	ret = hv_call_map_gpa_pages(partition->id, mem.guest_pfn,
> +				    page_count, mem.flags, pages);
> +	if (ret)
> +		goto err_unpin_pages;
> +
> +	/* Install the new region */
> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
> +		if (!partition->regions.slots[i].size) {
> +			region = &partition->regions.slots[i];
> +			break;
> +		}
> +	}
> +	region->pages = pages;
> +	region->size = mem.size;
> +	region->guest_pfn = mem.guest_pfn;
> +	region->userspace_addr = mem.userspace_addr;
> +
> +	partition->regions.count++;
> +
> +	return 0;
> +
> +err_unpin_pages:
> +	unpin_user_pages(pages, page_count - remaining);
> +	vfree(pages);
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_partition_ioctl_unmap_memory(struct mshv_partition *partition,
> +				  struct mshv_user_mem_region __user *user_mem)
> +{
> +	struct mshv_user_mem_region mem;
> +	struct mshv_mem_region *region_ptr;
> +	int i;
> +	u64 page_count;
> +	long ret;
> +
> +	if (!partition->regions.count)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
> +		return -EFAULT;
> +
> +	/* Find matching region */
> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
> +		if (!partition->regions.slots[i].size)
> +			continue;
> +		region_ptr = &partition->regions.slots[i];
> +		if (region_ptr->userspace_addr == mem.userspace_addr &&
> +		    region_ptr->size == mem.size &&
> +		    region_ptr->guest_pfn == mem.guest_pfn)
> +			break;
> +	}
> +
> +	if (i == MSHV_MAX_MEM_REGIONS)
> +		return -EINVAL;
> +
> +	page_count = region_ptr->size >> PAGE_SHIFT;
> +	ret = hv_call_unmap_gpa_pages(partition->id, region_ptr->guest_pfn,
> +				      page_count, 0);
> +	if (ret)
> +		return ret;
> +
> +	unpin_user_pages(region_ptr->pages, page_count);
> +	vfree(region_ptr->pages);
> +	memset(region_ptr, 0, sizeof(*region_ptr));
> +	partition->regions.count--;
> +
> +	return 0;
> +}
> +
>  static long
>  mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> -	return -ENOTTY;
> +	struct mshv_partition *partition = filp->private_data;
> +	long ret;
> +
> +	if (mutex_lock_killable(&partition->mutex))
> +		return -EINTR;
> +
> +	switch (ioctl) {
> +	case MSHV_MAP_GUEST_MEMORY:
> +		ret = mshv_partition_ioctl_map_memory(partition,
> +							(void __user *)arg);
> +		break;
> +	case MSHV_UNMAP_GUEST_MEMORY:
> +		ret = mshv_partition_ioctl_unmap_memory(partition,
> +							(void __user *)arg);
> +		break;
> +	default:
> +		ret = -ENOTTY;
> +	}
> +
> +	mutex_unlock(&partition->mutex);
> +	return ret;
>  }
> 
>  static void
>  destroy_partition(struct mshv_partition *partition)
>  {
> -	unsigned long flags;
> +	unsigned long flags, page_count;
> +	struct mshv_mem_region *region;
>  	int i;
> 
>  	/* Remove from list of partitions */
> @@ -286,6 +592,16 @@ destroy_partition(struct mshv_partition *partition)
> 
>  	hv_call_delete_partition(partition->id);
> 
> +	/* Remove regions and unpin the pages */
> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
> +		region = &partition->regions.slots[i];
> +		if (!region->size)
> +			continue;
> +		page_count = region->size >> PAGE_SHIFT;
> +		unpin_user_pages(region->pages, page_count);
> +		vfree(region->pages);
> +	}
> +
>  	kfree(partition);
>  }
> 
> @@ -353,6 +669,8 @@ mshv_ioctl_create_partition(void __user *user_arg)
>  	if (!partition)
>  		return -ENOMEM;
> 
> +	mutex_init(&partition->mutex);
> +
>  	fd = get_unused_fd_flags(O_CLOEXEC);
>  	if (fd < 0) {
>  		ret = fd;
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls
  2020-11-21  0:30 ` [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls Nuno Das Neves
@ 2021-02-08 19:47   ` Michael Kelley
  2021-03-09  1:39     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:47 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Add ioctls for getting and setting virtual processor registers.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst         |  11 +
>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 601 ++++++++++++++++++++++++
>  include/asm-generic/hyperv-tlfs.h       |  65 +--
>  include/linux/mshv.h                    |   1 +
>  include/uapi/linux/mshv.h               |  12 +
>  virt/mshv/mshv_main.c                   | 258 +++++++++-
>  6 files changed, 903 insertions(+), 45 deletions(-)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index f997f49f8690..20a626ac02d4 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -96,3 +96,14 @@ is backed by physical memory.
>  Create a virtual processor in a guest partition, returning a file descriptor to
>  represent the vp and perform ioctls on.
> 
> +3.5 MSHV_GET_VP_REGISTERS and MSHV_SET_VP_REGISTERS
> +---------------------------------------------------
> +:Type: vp ioctl
> +:Parameters: struct mshv_vp_registers
> +:Returns: 0 on success
> +
> +Get/set vp registers. See asm/hyperv-tlfs.h for the complete set of registers.
> +Includes general purpose platform registers, MSRs, and virtual registers that
> +are part of Microsoft Hypervisor platform and not directly exposed to the guest.
> +
> +
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
> tlfs.h
> index 72150c25ffe6..2ff655962738 100644
> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -121,4 +121,605 @@ struct hv_partition_creation_properties {
>  		disabled_processor_xsave_features;
>  };
> 
> +enum hv_register_name {
> +	/* Suspend Registers */
> +	HV_REGISTER_EXPLICIT_SUSPEND		= 0x00000000,
> +	HV_REGISTER_INTERCEPT_SUSPEND		= 0x00000001,
> +	HV_REGISTER_INSTRUCTION_EMULATION_HINTS	= 0x00000002,
> +	HV_REGISTER_DISPATCH_SUSPEND		= 0x00000003,
> +	HV_REGISTER_INTERNAL_ACTIVITY_STATE	= 0x00000004,
> +
> +	/* Version */
> +	HV_REGISTER_HYPERVISOR_VERSION	= 0x00000100, /* 128-bit result same as CPUID 0x40000002 */
> +
> +	/* Feature Access (registers are 128 bits) - same as CPUID 0x40000003 - 0x4000000B */
> +	HV_REGISTER_PRIVILEGES_AND_FEATURES_INFO	= 0x00000200,
> +	HV_REGISTER_FEATURES_INFO			= 0x00000201,
> +	HV_REGISTER_IMPLEMENTATION_LIMITS_INFO		= 0x00000202,
> +	HV_REGISTER_HARDWARE_FEATURES_INFO		= 0x00000203,
> +	HV_REGISTER_CPU_MANAGEMENT_FEATURES_INFO	= 0x00000204,
> +	HV_REGISTER_SVM_FEATURES_INFO			= 0x00000205,
> +	HV_REGISTER_SKIP_LEVEL_FEATURES_INFO		= 0x00000206,
> +	HV_REGISTER_NESTED_VIRT_FEATURES_INFO		= 0x00000207,
> +	HV_REGISTER_IPT_FEATURES_INFO			= 0x00000208,
> +
> +	/* Guest Crash Registers */
> +	HV_REGISTER_GUEST_CRASH_P0	= 0x00000210,
> +	HV_REGISTER_GUEST_CRASH_P1	= 0x00000211,
> +	HV_REGISTER_GUEST_CRASH_P2	= 0x00000212,
> +	HV_REGISTER_GUEST_CRASH_P3	= 0x00000213,
> +	HV_REGISTER_GUEST_CRASH_P4	= 0x00000214,
> +	HV_REGISTER_GUEST_CRASH_CTL	= 0x00000215,
> +
> +	/* Power State Configuration */
> +	HV_REGISTER_POWER_STATE_CONFIG_C1	= 0x00000220,
> +	HV_REGISTER_POWER_STATE_TRIGGER_C1	= 0x00000221,
> +	HV_REGISTER_POWER_STATE_CONFIG_C2	= 0x00000222,
> +	HV_REGISTER_POWER_STATE_TRIGGER_C2	= 0x00000223,
> +	HV_REGISTER_POWER_STATE_CONFIG_C3	= 0x00000224,
> +	HV_REGISTER_POWER_STATE_TRIGGER_C3	= 0x00000225,
> +
> +	/* Frequency Registers */
> +	HV_REGISTER_PROCESSOR_CLOCK_FREQUENCY	= 0x00000240,
> +	HV_REGISTER_INTERRUPT_CLOCK_FREQUENCY	= 0x00000241,
> +
> +	/* Idle Register */
> +	HV_REGISTER_GUEST_IDLE	= 0x00000250,
> +
> +	/* Guest Debug */
> +	HV_REGISTER_DEBUG_DEVICE_OPTIONS	= 0x00000260,
> +
> +	/* Memory Zeroing Conrol Register */
> +	HV_REGISTER_MEMORY_ZEROING_CONTROL	= 0x00000270,
> +
> +	/* Pending Event Register */
> +	HV_REGISTER_PENDING_EVENT0	= 0x00010004,
> +	HV_REGISTER_PENDING_EVENT1	= 0x00010005,
> +
> +	/* Misc */
> +	HV_REGISTER_VP_RUNTIME			= 0x00090000,
> +	HV_REGISTER_GUEST_OS_ID			= 0x00090002,
> +	HV_REGISTER_VP_INDEX			= 0x00090003,
> +	HV_REGISTER_TIME_REF_COUNT		= 0x00090004,
> +	HV_REGISTER_CPU_MANAGEMENT_VERSION	= 0x00090007,
> +	HV_REGISTER_VP_ASSIST_PAGE		= 0x00090013,
> +	HV_REGISTER_VP_ROOT_SIGNAL_COUNT	= 0x00090014,
> +	HV_REGISTER_REFERENCE_TSC		= 0x00090017,
> +
> +	/* Performance statistics Registers */
> +	HV_REGISTER_STATS_PARTITION_RETAIL	= 0x00090020,
> +	HV_REGISTER_STATS_PARTITION_INTERNAL	= 0x00090021,
> +	HV_REGISTER_STATS_VP_RETAIL		= 0x00090022,
> +	HV_REGISTER_STATS_VP_INTERNAL		= 0x00090023,
> +
> +	HV_REGISTER_NESTED_VP_INDEX	= 0x00091003,
> +
> +	/* Hypervisor-defined Registers (Synic) */
> +	HV_REGISTER_SINT0	= 0x000A0000,
> +	HV_REGISTER_SINT1	= 0x000A0001,
> +	HV_REGISTER_SINT2	= 0x000A0002,
> +	HV_REGISTER_SINT3	= 0x000A0003,
> +	HV_REGISTER_SINT4	= 0x000A0004,
> +	HV_REGISTER_SINT5	= 0x000A0005,
> +	HV_REGISTER_SINT6	= 0x000A0006,
> +	HV_REGISTER_SINT7	= 0x000A0007,
> +	HV_REGISTER_SINT8	= 0x000A0008,
> +	HV_REGISTER_SINT9	= 0x000A0009,
> +	HV_REGISTER_SINT10	= 0x000A000A,
> +	HV_REGISTER_SINT11	= 0x000A000B,
> +	HV_REGISTER_SINT12	= 0x000A000C,
> +	HV_REGISTER_SINT13	= 0x000A000D,
> +	HV_REGISTER_SINT14	= 0x000A000E,
> +	HV_REGISTER_SINT15	= 0x000A000F,
> +	HV_REGISTER_SCONTROL	= 0x000A0010,
> +	HV_REGISTER_SVERSION	= 0x000A0011,
> +	HV_REGISTER_SIFP	= 0x000A0012,
> +	HV_REGISTER_SIPP	= 0x000A0013,
> +	HV_REGISTER_EOM		= 0x000A0014,
> +	HV_REGISTER_SIRBP	= 0x000A0015,
> +
> +	HV_REGISTER_NESTED_SINT0	= 0x000A1000,
> +	HV_REGISTER_NESTED_SINT1	= 0x000A1001,
> +	HV_REGISTER_NESTED_SINT2	= 0x000A1002,
> +	HV_REGISTER_NESTED_SINT3	= 0x000A1003,
> +	HV_REGISTER_NESTED_SINT4	= 0x000A1004,
> +	HV_REGISTER_NESTED_SINT5	= 0x000A1005,
> +	HV_REGISTER_NESTED_SINT6	= 0x000A1006,
> +	HV_REGISTER_NESTED_SINT7	= 0x000A1007,
> +	HV_REGISTER_NESTED_SINT8	= 0x000A1008,
> +	HV_REGISTER_NESTED_SINT9	= 0x000A1009,
> +	HV_REGISTER_NESTED_SINT10	= 0x000A100A,
> +	HV_REGISTER_NESTED_SINT11	= 0x000A100B,
> +	HV_REGISTER_NESTED_SINT12	= 0x000A100C,
> +	HV_REGISTER_NESTED_SINT13	= 0x000A100D,
> +	HV_REGISTER_NESTED_SINT14	= 0x000A100E,
> +	HV_REGISTER_NESTED_SINT15	= 0x000A100F,
> +	HV_REGISTER_NESTED_SCONTROL	= 0x000A1010,
> +	HV_REGISTER_NESTED_SVERSION	= 0x000A1011,
> +	HV_REGISTER_NESTED_SIFP		= 0x000A1012,
> +	HV_REGISTER_NESTED_SIPP		= 0x000A1013,
> +	HV_REGISTER_NESTED_EOM		= 0x000A1014,
> +	HV_REGISTER_NESTED_SIRBP	= 0x000a1015,
> +
> +
> +	/* Hypervisor-defined Registers (Synthetic Timers) */
> +	HV_REGISTER_STIMER0_CONFIG		= 0x000B0000,
> +	HV_REGISTER_STIMER0_COUNT		= 0x000B0001,
> +	HV_REGISTER_STIMER1_CONFIG		= 0x000B0002,
> +	HV_REGISTER_STIMER1_COUNT		= 0x000B0003,
> +	HV_REGISTER_STIMER2_CONFIG		= 0x000B0004,
> +	HV_REGISTER_STIMER2_COUNT		= 0x000B0005,
> +	HV_REGISTER_STIMER3_CONFIG		= 0x000B0006,
> +	HV_REGISTER_STIMER3_COUNT		= 0x000B0007,
> +	HV_REGISTER_STIME_UNHALTED_TIMER_CONFIG	= 0x000B0100,
> +	HV_REGISTER_STIME_UNHALTED_TIMER_COUNT	= 0x000b0101,
> +
> +	/* Synthetic VSM registers */
> +
> +	/* 0x000D0000-1 are available for future use. */
> +	HV_REGISTER_VSM_CODE_PAGE_OFFSETS	= 0x000D0002,
> +	HV_REGISTER_VSM_VP_STATUS		= 0x000D0003,
> +	HV_REGISTER_VSM_PARTITION_STATUS	= 0x000D0004,
> +	HV_REGISTER_VSM_VINA			= 0x000D0005,
> +	HV_REGISTER_VSM_CAPABILITIES		= 0x000D0006,
> +	HV_REGISTER_VSM_PARTITION_CONFIG	= 0x000D0007,
> +
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL0	= 0x000D0010,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL1	= 0x000D0011,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL2	= 0x000D0012,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL3	= 0x000D0013,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL4	= 0x000D0014,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL5	= 0x000D0015,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL6	= 0x000D0016,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL7	= 0x000D0017,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL8	= 0x000D0018,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL9	= 0x000D0019,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL10	= 0x000D001A,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL11	= 0x000D001B,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL12	= 0x000D001C,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL13	= 0x000D001D,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL14	= 0x000D001E,
> +
> +	HV_REGISTER_VSM_VP_WAIT_FOR_TLB_LOCK	= 0x000D0020,
> +
> +	HV_REGISTER_ISOLATION_CAPABILITIES	= 0x000D0100,
> +
> +	/* Pending Interruption Register */
> +	HV_REGISTER_PENDING_INTERRUPTION	= 0x00010002,
> +
> +	/* Interrupt State register */
> +	HV_REGISTER_INTERRUPT_STATE	= 0x00010003,
> +
> +	/* Interruptible notification register */
> +	HV_X64_REGISTER_DELIVERABILITY_NOTIFICATIONS	= 0x00010006,
> +
> +	/* X64 User-Mode Registers */
> +	HV_X64_REGISTER_RAX	= 0x00020000,
> +	HV_X64_REGISTER_RCX	= 0x00020001,
> +	HV_X64_REGISTER_RDX	= 0x00020002,
> +	HV_X64_REGISTER_RBX	= 0x00020003,
> +	HV_X64_REGISTER_RSP	= 0x00020004,
> +	HV_X64_REGISTER_RBP	= 0x00020005,
> +	HV_X64_REGISTER_RSI	= 0x00020006,
> +	HV_X64_REGISTER_RDI	= 0x00020007,
> +	HV_X64_REGISTER_R8	= 0x00020008,
> +	HV_X64_REGISTER_R9	= 0x00020009,
> +	HV_X64_REGISTER_R10	= 0x0002000A,
> +	HV_X64_REGISTER_R11	= 0x0002000B,
> +	HV_X64_REGISTER_R12	= 0x0002000C,
> +	HV_X64_REGISTER_R13	= 0x0002000D,
> +	HV_X64_REGISTER_R14	= 0x0002000E,
> +	HV_X64_REGISTER_R15	= 0x0002000F,
> +	HV_X64_REGISTER_RIP	= 0x00020010,
> +	HV_X64_REGISTER_RFLAGS	= 0x00020011,
> +
> +	/* X64 Floating Point and Vector Registers */
> +	HV_X64_REGISTER_XMM0			= 0x00030000,
> +	HV_X64_REGISTER_XMM1			= 0x00030001,
> +	HV_X64_REGISTER_XMM2			= 0x00030002,
> +	HV_X64_REGISTER_XMM3			= 0x00030003,
> +	HV_X64_REGISTER_XMM4			= 0x00030004,
> +	HV_X64_REGISTER_XMM5			= 0x00030005,
> +	HV_X64_REGISTER_XMM6			= 0x00030006,
> +	HV_X64_REGISTER_XMM7			= 0x00030007,
> +	HV_X64_REGISTER_XMM8			= 0x00030008,
> +	HV_X64_REGISTER_XMM9			= 0x00030009,
> +	HV_X64_REGISTER_XMM10			= 0x0003000A,
> +	HV_X64_REGISTER_XMM11			= 0x0003000B,
> +	HV_X64_REGISTER_XMM12			= 0x0003000C,
> +	HV_X64_REGISTER_XMM13			= 0x0003000D,
> +	HV_X64_REGISTER_XMM14			= 0x0003000E,
> +	HV_X64_REGISTER_XMM15			= 0x0003000F,
> +	HV_X64_REGISTER_FP_MMX0			= 0x00030010,
> +	HV_X64_REGISTER_FP_MMX1			= 0x00030011,
> +	HV_X64_REGISTER_FP_MMX2			= 0x00030012,
> +	HV_X64_REGISTER_FP_MMX3			= 0x00030013,
> +	HV_X64_REGISTER_FP_MMX4			= 0x00030014,
> +	HV_X64_REGISTER_FP_MMX5			= 0x00030015,
> +	HV_X64_REGISTER_FP_MMX6			= 0x00030016,
> +	HV_X64_REGISTER_FP_MMX7			= 0x00030017,
> +	HV_X64_REGISTER_FP_CONTROL_STATUS	= 0x00030018,
> +	HV_X64_REGISTER_XMM_CONTROL_STATUS	= 0x00030019,
> +
> +	/* X64 Control Registers */
> +	HV_X64_REGISTER_CR0	= 0x00040000,
> +	HV_X64_REGISTER_CR2	= 0x00040001,
> +	HV_X64_REGISTER_CR3	= 0x00040002,
> +	HV_X64_REGISTER_CR4	= 0x00040003,
> +	HV_X64_REGISTER_CR8	= 0x00040004,
> +	HV_X64_REGISTER_XFEM	= 0x00040005,
> +
> +	/* X64 Intermediate Control Registers */
> +	HV_X64_REGISTER_INTERMEDIATE_CR0	= 0x00041000,
> +	HV_X64_REGISTER_INTERMEDIATE_CR4	= 0x00041003,
> +	HV_X64_REGISTER_INTERMEDIATE_CR8	= 0x00041004,
> +
> +	/* X64 Debug Registers */
> +	HV_X64_REGISTER_DR0	= 0x00050000,
> +	HV_X64_REGISTER_DR1	= 0x00050001,
> +	HV_X64_REGISTER_DR2	= 0x00050002,
> +	HV_X64_REGISTER_DR3	= 0x00050003,
> +	HV_X64_REGISTER_DR6	= 0x00050004,
> +	HV_X64_REGISTER_DR7	= 0x00050005,
> +
> +	/* X64 Segment Registers */
> +	HV_X64_REGISTER_ES	= 0x00060000,
> +	HV_X64_REGISTER_CS	= 0x00060001,
> +	HV_X64_REGISTER_SS	= 0x00060002,
> +	HV_X64_REGISTER_DS	= 0x00060003,
> +	HV_X64_REGISTER_FS	= 0x00060004,
> +	HV_X64_REGISTER_GS	= 0x00060005,
> +	HV_X64_REGISTER_LDTR	= 0x00060006,
> +	HV_X64_REGISTER_TR	= 0x00060007,
> +
> +	/* X64 Table Registers */
> +	HV_X64_REGISTER_IDTR	= 0x00070000,
> +	HV_X64_REGISTER_GDTR	= 0x00070001,
> +
> +	/* X64 Virtualized MSRs */
> +	HV_X64_REGISTER_TSC		= 0x00080000,
> +	HV_X64_REGISTER_EFER		= 0x00080001,
> +	HV_X64_REGISTER_KERNEL_GS_BASE	= 0x00080002,
> +	HV_X64_REGISTER_APIC_BASE	= 0x00080003,
> +	HV_X64_REGISTER_PAT		= 0x00080004,
> +	HV_X64_REGISTER_SYSENTER_CS	= 0x00080005,
> +	HV_X64_REGISTER_SYSENTER_EIP	= 0x00080006,
> +	HV_X64_REGISTER_SYSENTER_ESP	= 0x00080007,
> +	HV_X64_REGISTER_STAR		= 0x00080008,
> +	HV_X64_REGISTER_LSTAR		= 0x00080009,
> +	HV_X64_REGISTER_CSTAR		= 0x0008000A,
> +	HV_X64_REGISTER_SFMASK		= 0x0008000B,
> +	HV_X64_REGISTER_INITIAL_APIC_ID	= 0x0008000C,
> +
> +	/* X64 Cache control MSRs */
> +	HV_X64_REGISTER_MSR_MTRR_CAP		= 0x0008000D,
> +	HV_X64_REGISTER_MSR_MTRR_DEF_TYPE	= 0x0008000E,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE0	= 0x00080010,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE1	= 0x00080011,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE2	= 0x00080012,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE3	= 0x00080013,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE4	= 0x00080014,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE5	= 0x00080015,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE6	= 0x00080016,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE7	= 0x00080017,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE8	= 0x00080018,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE9	= 0x00080019,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEA	= 0x0008001A,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEB	= 0x0008001B,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEC	= 0x0008001C,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASED	= 0x0008001D,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEE	= 0x0008001E,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEF	= 0x0008001F,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK0	= 0x00080040,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK1	= 0x00080041,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK2	= 0x00080042,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK3	= 0x00080043,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK4	= 0x00080044,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK5	= 0x00080045,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK6	= 0x00080046,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK7	= 0x00080047,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK8	= 0x00080048,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK9	= 0x00080049,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKA	= 0x0008004A,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKB	= 0x0008004B,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKC	= 0x0008004C,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKD	= 0x0008004D,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKE	= 0x0008004E,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKF	= 0x0008004F,
> +	HV_X64_REGISTER_MSR_MTRR_FIX64K00000	= 0x00080070,
> +	HV_X64_REGISTER_MSR_MTRR_FIX16K80000	= 0x00080071,
> +	HV_X64_REGISTER_MSR_MTRR_FIX16KA0000	= 0x00080072,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KC0000	= 0x00080073,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KC8000	= 0x00080074,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KD0000	= 0x00080075,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KD8000	= 0x00080076,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KE0000	= 0x00080077,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KE8000	= 0x00080078,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KF0000	= 0x00080079,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KF8000	= 0x0008007A,
> +
> +	HV_X64_REGISTER_TSC_AUX		= 0x0008007B,
> +	HV_X64_REGISTER_BNDCFGS		= 0x0008007C,
> +	HV_X64_REGISTER_DEBUG_CTL	= 0x0008007D,
> +
> +	/* Available */
> +	HV_X64_REGISTER_AVAILABLE0008007E	= 0x0008007E,
> +	HV_X64_REGISTER_AVAILABLE0008007F	= 0x0008007F,
> +
> +	HV_X64_REGISTER_SGX_LAUNCH_CONTROL0	= 0x00080080,
> +	HV_X64_REGISTER_SGX_LAUNCH_CONTROL1	= 0x00080081,
> +	HV_X64_REGISTER_SGX_LAUNCH_CONTROL2	= 0x00080082,
> +	HV_X64_REGISTER_SGX_LAUNCH_CONTROL3	= 0x00080083,
> +	HV_X64_REGISTER_SPEC_CTRL		= 0x00080084,
> +	HV_X64_REGISTER_PRED_CMD		= 0x00080085,
> +	HV_X64_REGISTER_VIRT_SPEC_CTRL		= 0x00080086,
> +
> +	/* Other MSRs */
> +	HV_X64_REGISTER_MSR_IA32_MISC_ENABLE		= 0x000800A0,
> +	HV_X64_REGISTER_IA32_FEATURE_CONTROL		= 0x000800A1,
> +	HV_X64_REGISTER_IA32_VMX_BASIC			= 0x000800A2,
> +	HV_X64_REGISTER_IA32_VMX_PINBASED_CTLS		= 0x000800A3,
> +	HV_X64_REGISTER_IA32_VMX_PROCBASED_CTLS		= 0x000800A4,
> +	HV_X64_REGISTER_IA32_VMX_EXIT_CTLS		= 0x000800A5,
> +	HV_X64_REGISTER_IA32_VMX_ENTRY_CTLS		= 0x000800A6,
> +	HV_X64_REGISTER_IA32_VMX_MISC			= 0x000800A7,
> +	HV_X64_REGISTER_IA32_VMX_CR0_FIXED0		= 0x000800A8,
> +	HV_X64_REGISTER_IA32_VMX_CR0_FIXED1		= 0x000800A9,
> +	HV_X64_REGISTER_IA32_VMX_CR4_FIXED0		= 0x000800AA,
> +	HV_X64_REGISTER_IA32_VMX_CR4_FIXED1		= 0x000800AB,
> +	HV_X64_REGISTER_IA32_VMX_VMCS_ENUM		= 0x000800AC,
> +	HV_X64_REGISTER_IA32_VMX_PROCBASED_CTLS2	= 0x000800AD,
> +	HV_X64_REGISTER_IA32_VMX_EPT_VPID_CAP		= 0x000800AE,
> +	HV_X64_REGISTER_IA32_VMX_TRUE_PINBASED_CTLS	= 0x000800AF,
> +	HV_X64_REGISTER_IA32_VMX_TRUE_PROCBASED_CTLS	= 0x000800B0,
> +	HV_X64_REGISTER_IA32_VMX_TRUE_EXIT_CTLS		= 0x000800B1,
> +	HV_X64_REGISTER_IA32_VMX_TRUE_ENTRY_CTLS	= 0x000800B2,
> +
> +	/* Performance monitoring MSRs */
> +	HV_X64_REGISTER_PERF_GLOBAL_CTRL	= 0x00081000,
> +	HV_X64_REGISTER_PERF_GLOBAL_STATUS	= 0x00081001,
> +	HV_X64_REGISTER_PERF_GLOBAL_IN_USE	= 0x00081002,
> +	HV_X64_REGISTER_FIXED_CTR_CTRL		= 0x00081003,
> +	HV_X64_REGISTER_DS_AREA			= 0x00081004,
> +	HV_X64_REGISTER_PEBS_ENABLE		= 0x00081005,
> +	HV_X64_REGISTER_PEBS_LD_LAT		= 0x00081006,
> +	HV_X64_REGISTER_PEBS_FRONTEND		= 0x00081007,
> +	HV_X64_REGISTER_PERF_EVT_SEL0		= 0x00081100,
> +	HV_X64_REGISTER_PMC0			= 0x00081200,
> +	HV_X64_REGISTER_FIXED_CTR0		= 0x00081300,
> +
> +	HV_X64_REGISTER_LBR_TOS		= 0x00082000,
> +	HV_X64_REGISTER_LBR_SELECT	= 0x00082001,
> +	HV_X64_REGISTER_LER_FROM_LIP	= 0x00082002,
> +	HV_X64_REGISTER_LER_TO_LIP	= 0x00082003,
> +	HV_X64_REGISTER_LBR_FROM0	= 0x00082100,
> +	HV_X64_REGISTER_LBR_TO0		= 0x00082200,
> +	HV_X64_REGISTER_LBR_INFO0	= 0x00083300,
> +
> +	/* Intel processor trace MSRs */
> +	HV_X64_REGISTER_RTIT_CTL		= 0x00081008,
> +	HV_X64_REGISTER_RTIT_STATUS		= 0x00081009,
> +	HV_X64_REGISTER_RTIT_OUTPUT_BASE	= 0x0008100A,
> +	HV_X64_REGISTER_RTIT_OUTPUT_MASK_PTRS	= 0x0008100B,
> +	HV_X64_REGISTER_RTIT_CR3_MATCH		= 0x0008100C,
> +	HV_X64_REGISTER_RTIT_ADDR0A		= 0x00081400,
> +
> +	/* RtitAddr0A/B - RtitAddr3A/B occupy 0x00081400-0x00081407. */
> +
> +	/* X64 Apic registers. These match the equivalent x2APIC MSR offsets. */
> +	HV_X64_REGISTER_APIC_ID		= 0x00084802,
> +	HV_X64_REGISTER_APIC_VERSION	= 0x00084803,
> +
> +	/* Hypervisor-defined registers (Misc) */
> +	HV_X64_REGISTER_HYPERCALL	= 0x00090001,
> +
> +	/* X64 Virtual APIC registers synthetic MSRs */
> +	HV_X64_REGISTER_SYNTHETIC_EOI	= 0x00090010,
> +	HV_X64_REGISTER_SYNTHETIC_ICR	= 0x00090011,
> +	HV_X64_REGISTER_SYNTHETIC_TPR	= 0x00090012,
> +
> +	/* Partition Timer Assist Registers */
> +	HV_X64_REGISTER_EMULATED_TIMER_PERIOD	= 0x00090030,
> +	HV_X64_REGISTER_EMULATED_TIMER_CONTROL	= 0x00090031,
> +	HV_X64_REGISTER_PM_TIMER_ASSIST		= 0x00090032,
> +
> +	/* Intercept Control Registers */
> +	HV_X64_REGISTER_CR_INTERCEPT_CONTROL			= 0x000E0000,
> +	HV_X64_REGISTER_CR_INTERCEPT_CR0_MASK			= 0x000E0001,
> +	HV_X64_REGISTER_CR_INTERCEPT_CR4_MASK			= 0x000E0002,
> +	HV_X64_REGISTER_CR_INTERCEPT_IA32_MISC_ENABLE_MASK	= 0x000E0003,
> +
> +};
> +
> +struct hv_u128 {
> +	__u64 high_part;
> +	__u64 low_part;
> +};
> +
> +union hv_x64_fp_register {
> +	struct hv_u128 as_uint128;
> +	struct {
> +		__u64 mantissa;
> +		__u64 biased_exponent : 15;
> +		__u64 sign : 1;
> +		__u64 reserved : 48;
> +	};
> +};
> +
> +union hv_x64_fp_control_status_register {
> +	struct hv_u128 as_uint128;
> +	struct {
> +		__u16 fp_control;
> +		__u16 fp_status;
> +		__u8 fp_tag;
> +		__u8 reserved;
> +		__u16 last_fp_op;
> +		union {
> +			/* long mode */
> +			__u64 last_fp_rip;
> +			/* 32 bit mode */
> +			struct {
> +				__u32 last_fp_eip;
> +				__u16 last_fp_cs;
> +			};
> +		};
> +	};
> +};
> +
> +union hv_x64_xmm_control_status_register {
> +	struct hv_u128 as_uint128;
> +	struct {
> +		union {
> +			/* long mode */
> +			__u64 last_fp_rdp;
> +			/* 32 bit mode */
> +			struct {
> +				__u32 last_fp_dp;
> +				__u16 last_fp_ds;
> +			};
> +		};
> +		__u32 xmm_status_control;
> +		__u32 xmm_status_control_mask;
> +	};
> +};
> +
> +struct hv_x64_segment_register {
> +	__u64 base;
> +	__u32 limit;
> +	__u16 selector;
> +	union {
> +		struct {
> +			__u16 segment_type : 4;
> +			__u16 non_system_segment : 1;
> +			__u16 descriptor_privilege_level : 2;
> +			__u16 present : 1;
> +			__u16 reserved : 4;
> +			__u16 available : 1;
> +			__u16 _long : 1;
> +			__u16 _default : 1;
> +			__u16 granularity : 1;
> +		};
> +		__u16 attributes;
> +	};
> +};
> +
> +struct hv_x64_table_register {
> +	__u16 pad[3];
> +	__u16 limit;
> +	__u64 base;
> +};
> +
> +union hv_explicit_suspend_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u64 suspended : 1;
> +		__u64 reserved : 63;
> +	};
> +};
> +
> +union hv_intercept_suspend_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u64 suspended : 1;
> +		__u64 reserved : 63;
> +	};
> +};
> +
> +union hv_dispatch_suspend_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u64 suspended : 1;
> +		__u64 reserved : 63;
> +	};
> +};
> +
> +union hv_x64_interrupt_state_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u64 interrupt_shadow : 1;
> +		__u64 nmi_masked : 1;
> +		__u64 reserved : 62;
> +	};
> +};
> +
> +union hv_x64_pending_interruption_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u32 interruption_pending : 1;
> +		__u32 interruption_type : 3;
> +		__u32 deliver_error_code : 1;
> +		__u32 instruction_length : 4;
> +		__u32 nested_event : 1;
> +		__u32 reserved : 6;
> +		__u32 interruption_vector : 16;
> +		__u32 error_code;
> +	};
> +};
> +
> +union hv_x64_msr_npiep_config_contents {
> +	__u64 as_uint64;
> +	struct {
> +		/*
> +		 * These bits enable instruction execution prevention for
> +		 * specific instructions.
> +		 */
> +		__u64 prevents_gdt : 1;
> +		__u64 prevents_idt : 1;
> +		__u64 prevents_ldt : 1;
> +		__u64 prevents_tr : 1;
> +
> +		/* The reserved bits must always be 0. */
> +		__u64 reserved : 60;
> +	};
> +};
> +
> +union hv_x64_pending_exception_event {
> +	__u64 as_uint64[2];
> +	struct {
> +		__u32 event_pending : 1;
> +		__u32 event_type : 3;
> +		__u32 reserved0 : 4;
> +		__u32 deliver_error_code : 1;
> +		__u32 reserved1 : 7;
> +		__u32 vector : 16;
> +		__u32 error_code;
> +		__u64 exception_parameter;
> +	};
> +};
> +
> +union hv_x64_pending_virtualization_fault_event {
> +	__u64 as_uint64[2];
> +	struct {
> +		__u32 event_pending : 1;
> +		__u32 event_type : 3;
> +		__u32 reserved0 : 4;
> +		__u32 reserved1 : 8;
> +		__u32 parameter0 : 16;
> +		__u32 code;
> +		__u64 parameter1;
> +	};
> +};
> +
> +union hv_register_value {
> +	struct hv_u128 reg128;
> +	__u64 reg64;
> +	__u32 reg32;
> +	__u16 reg16;
> +	__u8 reg8;
> +	union hv_x64_fp_register fp;
> +	union hv_x64_fp_control_status_register fp_control_status;
> +	union hv_x64_xmm_control_status_register xmm_control_status;
> +	struct hv_x64_segment_register segment;
> +	struct hv_x64_table_register table;
> +	union hv_explicit_suspend_register explicit_suspend;
> +	union hv_intercept_suspend_register intercept_suspend;
> +	union hv_dispatch_suspend_register dispatch_suspend;
> +	union hv_x64_interrupt_state_register interrupt_state;
> +	union hv_x64_pending_interruption_register pending_interruption;
> +	union hv_x64_msr_npiep_config_contents npiep_config;
> +	union hv_x64_pending_exception_event pending_exception_event;
> +	union hv_x64_pending_virtualization_fault_event
> +		pending_virtualization_fault_event;
> +};
> +
>  #endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 6e5072e29897..b9295400c20b 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -622,53 +622,30 @@ struct hv_retarget_device_interrupt {
>  } __packed __aligned(8);
> 
> 
> -/* HvGetVpRegisters hypercall input with variable size reg name list*/
> -struct hv_get_vp_registers_input {
> -	struct {
> -		u64 partitionid;
> -		u32 vpindex;
> -		u8  inputvtl;
> -		u8  padding[3];
> -	} header;
> -	struct input {
> -		u32 name0;
> -		u32 name1;
> -	} element[];
> -} __packed;
> -
> +/* HvGetVpRegisters hypercall with variable size reg name list*/
> +struct hv_get_vp_registers {
> +	u64 partition_id;
> +	u32 vp_index;
> +	u8  input_vtl;
> +	u8  rsvd_z8;
> +	u16 rsvd_z16;
> +	__aligned(8) enum hv_register_name names[];
> +} __aligned(8);
> 
> -/* HvGetVpRegisters returns an array of these output elements */
> -struct hv_get_vp_registers_output {
> -	union {
> -		struct {
> -			u32 a;
> -			u32 b;
> -			u32 c;
> -			u32 d;
> -		} as32 __packed;
> -		struct {
> -			u64 low;
> -			u64 high;
> -		} as64 __packed;
> -	};
> +/* HvSetVpRegisters hypercall with variable size reg name/value list*/
> +struct hv_register_assoc {
> +	enum hv_register_name name;
> +	__aligned(16) union hv_register_value value;
>  };
> 
> -/* HvSetVpRegisters hypercall with variable size reg name/value list*/
> -struct hv_set_vp_registers_input {
> -	struct {
> -		u64 partitionid;
> -		u32 vpindex;
> -		u8  inputvtl;
> -		u8  padding[3];
> -	} header;
> -	struct {
> -		u32 name;
> -		u32 padding1;
> -		u64 padding2;
> -		u64 valuelow;
> -		u64 valuehigh;
> -	} element[];
> -} __packed;
> +struct hv_set_vp_registers {
> +	u64 partition_id;
> +	u32 vp_index;
> +	u8  input_vtl;
> +	u8  rsvd_z8;
> +	u16 rsvd_z16;
> +	struct hv_register_assoc elements[];
> +} __aligned(16);

Throughout these structures, I think the approach needs to be more
explicit about the memory layout.  The current definitions assume that
the compiler is inserting padding in the expected places, and not in
any unexpected places.  My previous concerns about use of enum
also apply.

The code also removes some layouts that are used in the
not-yet-accepted patches for ARM64.   Let sync on how to get
those back in.

> 
>  enum hv_device_type {
>  	HV_DEVICE_TYPE_LOGICAL = 0,
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index 50521c5f7948..dfe469f573f9 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -17,6 +17,7 @@
>  struct mshv_vp {
>  	u32 index;
>  	struct mshv_partition *partition;
> +	struct mutex mutex;
>  };
> 
>  struct mshv_mem_region {
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index 1f053eae68a6..5d53ed655429 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -33,6 +33,14 @@ struct mshv_create_vp {
>  	__u32 vp_index;
>  };
> 
> +#define MSHV_VP_MAX_REGISTERS	128
> +
> +struct mshv_vp_registers {
> +	int count; /* at most MSHV_VP_MAX_REGISTERS */
> +	enum hv_register_name *names;
> +	union hv_register_value *values;
> +};

Having separate arrays for the names and values results in an extra
copy of the data down in the ioctl code.  Any reason the caller couldn't
supply the data as an array, where each entry is already a name/value
pair?

> +
>  #define MSHV_IOCTL 0xB8
> 
>  /* mshv device */
> @@ -44,4 +52,8 @@ struct mshv_create_vp {
>  #define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct
> mshv_user_mem_region)
>  #define MSHV_CREATE_VP		_IOW(MSHV_IOCTL, 0x04, struct mshv_create_vp)
> 
> +/* vp device */
> +#define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct
> mshv_vp_registers)
> +#define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
> +
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 3be9d9a468c1..2a10137a1e84 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -74,6 +74,12 @@ static struct miscdevice mshv_dev = {
>  #define HV_MAP_GPA_BATCH_SIZE	\
>  		(PAGE_SIZE / sizeof(struct hv_map_gpa_pages) / sizeof(u64))
>  #define PIN_PAGES_BATCH_SIZE	(0x10000000 / PAGE_SIZE)
> +#define HV_GET_REGISTER_BATCH_SIZE	\
> +	(PAGE_SIZE / \
> +	 sizeof(struct hv_get_vp_registers) / sizeof(enum hv_register_name))
> +#define HV_SET_REGISTER_BATCH_SIZE	\
> +	(PAGE_SIZE / \
> +	 sizeof(struct hv_set_vp_registers) / sizeof(struct hv_register_assoc))

These new size calculations have the same bug as HV_MAP_GPA_BATCH_SIZE.
The first divide operations should be subtraction.

With the correct calculation, HV_GET_REGISTER_BATCH_SIZE  will be
too large.  The input page will accommodate more 32 bit register names
than the output page will accommodate 128 bit register values.  The limit
should be based on the latter, not the former.  Or calculate both the
input and output limit and use the minimum.

> 
>  static int
>  hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
> @@ -380,10 +386,258 @@ hv_call_unmap_gpa_pages(u64 partition_id,
>  	return ret;
>  }
> 
> +static int
> +hv_call_get_vp_registers(u32 vp_index,
> +			 u64 partition_id,
> +			 u16 count,
> +			 const enum hv_register_name *names,
> +			 union hv_register_value *values)
> +{
> +	struct hv_get_vp_registers *input_page;
> +	union hv_register_value *output_page;
> +	u16 completed = 0;
> +	u64 hypercall_status;
> +	unsigned long remaining = count;
> +	int rep_count;
> +	int status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	input_page = (struct hv_get_vp_registers *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +	output_page = (union hv_register_value *)(*this_cpu_ptr(
> +		hyperv_pcpu_output_arg));
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl = 0;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> +		memcpy(input_page->names, names,
> +			sizeof(enum hv_register_name) * rep_count);
> +
> +		hypercall_status =
> +			hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS, rep_count,
> +					    0, input_page, output_page);
> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		if (status != HV_STATUS_SUCCESS) {
> +			pr_err("%s: completed %li out of %u, %s\n",
> +			       __func__,
> +			       count - remaining, count,
> +			       hv_status_to_string(status));
> +			break;
> +		}
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +			    HV_HYPERCALL_REP_COMP_OFFSET;
> +		memcpy(values, output_page,
> +			sizeof(union hv_register_value) * completed);
> +
> +		names += completed;
> +		values += completed;
> +		remaining -= completed;
> +	}
> +	local_irq_restore(flags);
> +
> +	return -hv_status_to_errno(status);
> +}
> +
> +static int
> +hv_call_set_vp_registers(u32 vp_index,
> +			 u64 partition_id,
> +			 u16 count,
> +			 struct hv_register_assoc *registers)
> +{
> +	struct hv_set_vp_registers *input_page;
> +	u16 completed = 0;
> +	u64 hypercall_status;
> +	unsigned long remaining = count;
> +	int rep_count;
> +	int status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input_page = (struct hv_set_vp_registers *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl = 0;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
> +		memcpy(input_page->elements, registers,
> +			sizeof(struct hv_register_assoc) * rep_count);
> +
> +		hypercall_status =
> +			hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS, rep_count,
> +					    0, input_page, NULL);
> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		if (status != HV_STATUS_SUCCESS) {
> +			pr_err("%s: completed %li out of %u, %s\n",
> +			       __func__,
> +			       count - remaining, count,
> +			       hv_status_to_string(status));
> +			break;
> +		}
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +			    HV_HYPERCALL_REP_COMP_OFFSET;
> +		registers += completed;
> +		remaining -= completed;
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return -hv_status_to_errno(status);
> +}
> +
> +static long
> +mshv_vp_ioctl_get_regs(struct mshv_vp *vp, void __user *user_args)
> +{
> +	struct mshv_vp_registers args;
> +	enum hv_register_name *names;
> +	union hv_register_value *values;
> +	long ret;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.count > MSHV_VP_MAX_REGISTERS)
> +		return -EINVAL;
> +
> +	names = kmalloc_array(args.count,
> +			      sizeof(enum hv_register_name),
> +			      GFP_KERNEL);
> +	if (!names)
> +		return -ENOMEM;
> +
> +	values = kmalloc_array(args.count,
> +			       sizeof(union hv_register_value),
> +			       GFP_KERNEL);
> +	if (!values) {
> +		kfree(names);
> +		return -ENOMEM;
> +	}
> +
> +	if (copy_from_user(names, args.names,
> +			   sizeof(enum hv_register_name) * args.count)) {
> +		ret = -EFAULT;
> +		goto free_return;
> +	}
> +
> +	ret = hv_call_get_vp_registers(vp->index, vp->partition->id,
> +				       args.count, names, values);
> +	if (ret)
> +		goto free_return;
> +
> +	if (copy_to_user(args.values, values,
> +			 sizeof(union hv_register_value) * args.count)) {
> +		ret = -EFAULT;
> +	}
> +
> +free_return:
> +	kfree(names);
> +	kfree(values);
> +	return ret;
> +}
> +
> +static long
> +mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user *user_args)
> +{
> +	int i;
> +	struct mshv_vp_registers args;
> +	struct hv_register_assoc *registers;
> +	enum hv_register_name *names;
> +	union hv_register_value *values;
> +	long ret;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.count > MSHV_VP_MAX_REGISTERS)
> +		return -EINVAL;
> +
> +	names = kmalloc_array(args.count,
> +			      sizeof(enum hv_register_name),
> +			      GFP_KERNEL);
> +	if (!names)
> +		return -ENOMEM;
> +
> +	values = kmalloc_array(args.count,
> +			       sizeof(union hv_register_value),
> +			       GFP_KERNEL);
> +	if (!values) {
> +		kfree(names);
> +		return -ENOMEM;
> +	}
> +
> +	registers = kmalloc_array(args.count,
> +				  sizeof(struct hv_register_assoc),
> +				  GFP_KERNEL);
> +	if (!registers) {
> +		kfree(values);
> +		kfree(names);
> +		return -ENOMEM;
> +	}
> +
> +	if (copy_from_user(names, args.names,
> +			   sizeof(enum hv_register_name) * args.count)) {
> +		ret = -EFAULT;
> +		goto free_return;
> +	}
> +
> +	if (copy_from_user(values, args.values,
> +			   sizeof(union hv_register_value) * args.count)) {
> +		ret = -EFAULT;
> +		goto free_return;
> +	}
> +
> +	for (i = 0; i < args.count; i++) {
> +		memcpy(&registers[i].name, &names[i],
> +		       sizeof(enum hv_register_name));
> +		memcpy(&registers[i].value, &values[i],
> +		       sizeof(union hv_register_value));
> +	}

The above will result in uninitialized memory being sent to
Hyper-V, since there is implicit padding associated with the
32 bit name field.

> +
> +	ret = hv_call_set_vp_registers(vp->index, vp->partition->id,
> +				       args.count, registers);
> +
> +free_return:
> +	kfree(names);
> +	kfree(values);
> +	kfree(registers);
> +	return ret;
> +}
> +
> +
>  static long
>  mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> -	return -ENOTTY;
> +	struct mshv_vp *vp = filp->private_data;
> +	long r = 0;
> +
> +	if (mutex_lock_killable(&vp->mutex))
> +		return -EINTR;
> +
> +	switch (ioctl) {
> +	case MSHV_GET_VP_REGISTERS:
> +		r = mshv_vp_ioctl_get_regs(vp, (void __user *)arg);
> +		break;
> +	case MSHV_SET_VP_REGISTERS:
> +		r = mshv_vp_ioctl_set_regs(vp, (void __user *)arg);
> +		break;
> +	default:
> +		r = -ENOTTY;
> +		break;
> +	}
> +	mutex_unlock(&vp->mutex);
> +
> +	return r;
>  }
> 
>  static int
> @@ -420,6 +674,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
>  	if (!vp)
>  		return -ENOMEM;
> 
> +	mutex_init(&vp->mutex);
> +
>  	vp->index = args.vp_index;
>  	vp->partition = mshv_partition_get(partition);
>  	if (!vp->partition) {
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages
  2020-11-21  0:30 ` [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages Nuno Das Neves
@ 2021-02-08 19:47   ` Michael Kelley
  2021-03-11 19:37     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:47 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
> 
> Same idea as synic setup in drivers/hv/hv.c:hv_synic_enable_regs()
> and hv_synic_disable_regs().
> Setting up synic registers in both vmbus driver and mshv would clobber
> them, but the vmbus driver will not run in the root partition, so this
> is safe.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/x86/include/asm/hyperv-tlfs.h      |  29 ---
>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 264 ++++++++++++++++++++++++
>  include/asm-generic/hyperv-tlfs.h       |  46 +----
>  include/linux/mshv.h                    |   1 +
>  include/uapi/asm-generic/hyperv-tlfs.h  |  43 ++++
>  virt/mshv/mshv_main.c                   |  98 ++++++++-
>  6 files changed, 404 insertions(+), 77 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index 4cd44ae9bffb..c34a6bb4f457 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -267,35 +267,6 @@ struct hv_tsc_emulation_status {
>  #define HV_X64_MSR_TSC_REFERENCE_ENABLE		0x00000001
>  #define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT	12
> 
> -
> -/* Define hypervisor message types. */
> -enum hv_message_type {
> -	HVMSG_NONE			= 0x00000000,
> -
> -	/* Memory access messages. */
> -	HVMSG_UNMAPPED_GPA		= 0x80000000,
> -	HVMSG_GPA_INTERCEPT		= 0x80000001,
> -
> -	/* Timer notification messages. */
> -	HVMSG_TIMER_EXPIRED		= 0x80000010,
> -
> -	/* Error messages. */
> -	HVMSG_INVALID_VP_REGISTER_VALUE	= 0x80000020,
> -	HVMSG_UNRECOVERABLE_EXCEPTION	= 0x80000021,
> -	HVMSG_UNSUPPORTED_FEATURE	= 0x80000022,
> -
> -	/* Trace buffer complete messages. */
> -	HVMSG_EVENTLOG_BUFFERCOMPLETE	= 0x80000040,
> -
> -	/* Platform-specific processor intercept messages. */
> -	HVMSG_X64_IOPORT_INTERCEPT	= 0x80010000,
> -	HVMSG_X64_MSR_INTERCEPT		= 0x80010001,
> -	HVMSG_X64_CPUID_INTERCEPT	= 0x80010002,
> -	HVMSG_X64_EXCEPTION_INTERCEPT	= 0x80010003,
> -	HVMSG_X64_APIC_EOI		= 0x80010004,
> -	HVMSG_X64_LEGACY_FP_ERROR	= 0x80010005
> -};
> -
>  struct hv_nested_enlightenments_control {
>  	struct {
>  		__u32 directhypercall:1;
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
> tlfs.h
> index 2ff655962738..c6a27053f791 100644
> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -722,4 +722,268 @@ union hv_register_value {
>  		pending_virtualization_fault_event;
>  };
> 
> +/* Define hypervisor message types. */
> +enum hv_message_type {
> +	HVMSG_NONE				= 0x00000000,
> +
> +	/* Memory access messages. */
> +	HVMSG_UNMAPPED_GPA			= 0x80000000,
> +	HVMSG_GPA_INTERCEPT			= 0x80000001,
> +
> +	/* Timer notification messages. */
> +	HVMSG_TIMER_EXPIRED			= 0x80000010,
> +
> +	/* Error messages. */
> +	HVMSG_INVALID_VP_REGISTER_VALUE		= 0x80000020,
> +	HVMSG_UNRECOVERABLE_EXCEPTION		= 0x80000021,
> +	HVMSG_UNSUPPORTED_FEATURE		= 0x80000022,
> +
> +	/* Trace buffer complete messages. */
> +	HVMSG_EVENTLOG_BUFFERCOMPLETE		= 0x80000040,
> +
> +	/* Platform-specific processor intercept messages. */
> +	HVMSG_X64_IO_PORT_INTERCEPT		= 0x80010000,
> +	HVMSG_X64_MSR_INTERCEPT			= 0x80010001,
> +	HVMSG_X64_CPUID_INTERCEPT		= 0x80010002,
> +	HVMSG_X64_EXCEPTION_INTERCEPT		= 0x80010003,
> +	HVMSG_X64_APIC_EOI			= 0x80010004,
> +	HVMSG_X64_LEGACY_FP_ERROR		= 0x80010005,
> +	HVMSG_X64_IOMMU_PRQ			= 0x80010006,
> +	HVMSG_X64_HALT				= 0x80010007,
> +	HVMSG_X64_INTERRUPTION_DELIVERABLE	= 0x80010008,
> +	HVMSG_X64_SIPI_INTERCEPT		= 0x80010009,
> +};

I have a separate patch series that moves this enum to the
asm-generic portion of hyperv-tlfs.h because there's not a good way
to separate the arch neutral from arch dependent values.

> +
> +
> +union hv_x64_vp_execution_state {
> +	__u16 as_uint16;
> +	struct {
> +		__u16 cpl:2;
> +		__u16 cr0_pe:1;
> +		__u16 cr0_am:1;
> +		__u16 efer_lma:1;
> +		__u16 debug_active:1;
> +		__u16 interruption_pending:1;
> +		__u16 vtl:4;
> +		__u16 enclave_mode:1;
> +		__u16 interrupt_shadow:1;
> +		__u16 virtualization_fault_active:1;
> +		__u16 reserved:2;
> +	};
> +};
> +
> +/* Values for intercept_access_type field */
> +#define HV_INTERCEPT_ACCESS_READ	0
> +#define HV_INTERCEPT_ACCESS_WRITE	1
> +#define HV_INTERCEPT_ACCESS_EXECUTE	2
> +
> +struct hv_x64_intercept_message_header {
> +	__u32 vp_index;
> +	__u8 instruction_length:4;
> +	__u8 cr8:4; // only set for exo partitions
> +	__u8 intercept_access_type;
> +	union hv_x64_vp_execution_state execution_state;
> +	struct hv_x64_segment_register cs_segment;
> +	__u64 rip;
> +	__u64 rflags;
> +};
> +
> +#define HV_HYPERCALL_INTERCEPT_MAX_XMM_REGISTERS 6
> +
> +struct hv_x64_hypercall_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u64 rax;
> +	__u64 rbx;
> +	__u64 rcx;
> +	__u64 rdx;
> +	__u64 r8;
> +	__u64 rsi;
> +	__u64 rdi;
> +	struct hv_u128 xmmregisters[HV_HYPERCALL_INTERCEPT_MAX_XMM_REGISTERS];
> +	struct {
> +		__u32 isolated:1;
> +		__u32 reserved:31;
> +	};
> +};
> +
> +union hv_x64_register_access_info {
> +	union hv_register_value source_value;
> +	enum hv_register_name destination_register;
> +	__u64 source_address;
> +	__u64 destination_address;
> +};
> +
> +struct hv_x64_register_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	struct {
> +		__u8 is_memory_op:1;
> +		__u8 reserved:7;
> +	};
> +	__u8 reserved8;
> +	__u16 reserved16;
> +	enum hv_register_name register_name;
> +	union hv_x64_register_access_info access_info;
> +};
> +
> +union hv_x64_memory_access_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 gva_valid:1;
> +		__u8 gva_gpa_valid:1;
> +		__u8 hypercall_output_pending:1;
> +		__u8 tlb_locked_no_overlay:1;
> +		__u8 reserved:4;
> +	};
> +};
> +
> +union hv_x64_io_port_access_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 access_size:3;
> +		__u8 string_op:1;
> +		__u8 rep_prefix:1;
> +		__u8 reserved:3;
> +	};
> +};
> +
> +union hv_x64_exception_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 error_code_valid:1;
> +		__u8 software_exception:1;
> +		__u8 reserved:6;
> +	};
> +};
> +
> +enum hv_cache_type {
> +	HV_CACHE_TYPE_UNCACHED	   = 0,
> +	HV_CACHE_TYPE_WRITE_COMBINING = 1,
> +	HV_CACHE_TYPE_WRITE_THROUGH   = 4,
> +	HV_CACHE_TYPE_WRITE_PROTECTED = 5,
> +	HV_CACHE_TYPE_WRITE_BACK	  = 6
> +};
> +
> +struct hv_x64_memory_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	enum hv_cache_type cache_type;
> +	__u8 instruction_byte_count;
> +	union hv_x64_memory_access_info memory_access_info;
> +	__u8 tpr_priority;
> +	__u8 reserved1;
> +	__u64 guest_virtual_address;
> +	__u64 guest_physical_address;
> +	__u8 instruction_bytes[16];
> +};
> +
> +struct hv_x64_cpuid_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u64 rax;
> +	__u64 rcx;
> +	__u64 rdx;
> +	__u64 rbx;
> +	__u64 default_result_rax;
> +	__u64 default_result_rcx;
> +	__u64 default_result_rdx;
> +	__u64 default_result_rbx;
> +};
> +
> +struct hv_x64_msr_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u32 msr_number;
> +	__u32 reserved;
> +	__u64 rdx;
> +	__u64 rax;
> +};
> +
> +struct hv_x64_io_port_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u16 port_number;
> +	union hv_x64_io_port_access_info access_info;
> +	__u8 instruction_byte_count;
> +	__u32 reserved;
> +	__u64 rax;
> +	__u8 instruction_bytes[16];
> +	struct hv_x64_segment_register ds_segment;
> +	struct hv_x64_segment_register es_segment;
> +	__u64 rcx;
> +	__u64 rsi;
> +	__u64 rdi;
> +};
> +
> +struct hv_x64_exception_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u16 exception_vector;
> +	union hv_x64_exception_info exception_info;
> +	__u8 instruction_byte_count;
> +	__u32 error_code;
> +	__u64 exception_parameter;
> +	__u64 reserved;
> +	__u8 instruction_bytes[16];
> +	struct hv_x64_segment_register ds_segment;
> +	struct hv_x64_segment_register ss_segment;
> +	__u64 rax;
> +	__u64 rcx;
> +	__u64 rdx;
> +	__u64 rbx;

Is the above the correct ordering (rax, rcd, rdx, rbx)?
It's just what you would expect ....

> +	__u64 rsp;
> +	__u64 rbp;
> +	__u64 rsi;
> +	__u64 rdi;
> +	__u64 r8;
> +	__u64 r9;
> +	__u64 r10;
> +	__u64 r11;
> +	__u64 r12;
> +	__u64 r13;
> +	__u64 r14;
> +	__u64 r15;
> +};
> +
> +struct hv_x64_invalid_vp_register_message {
> +	__u32 vp_index;
> +	__u32 reserved;
> +};
> +
> +struct hv_x64_unrecoverable_exception_message {
> +	struct hv_x64_intercept_message_header header;
> +};
> +
> +enum hv_x64_unsupported_feature_code {
> +	hv_unsupported_feature_intercept = 1,
> +	hv_unsupported_feature_task_switch_tss = 2
> +};
> +
> +struct hv_x64_unsupported_feature_message {
> +	__u32 vp_index;
> +	enum hv_x64_unsupported_feature_code feature_code;
> +	__u64 feature_parameter;
> +};
> +
> +struct hv_x64_halt_message {
> +	struct hv_x64_intercept_message_header header;
> +};
> +
> +enum hv_x64_pending_interruption_type {
> +	HV_X64_PENDING_INTERRUPT	= 0,
> +	HV_X64_PENDING_NMI		= 2,
> +	HV_X64_PENDING_EXCEPTION	= 3
> +};
> +
> +struct hv_x64_interruption_deliverable_message {
> +	struct hv_x64_intercept_message_header header;
> +	enum hv_x64_pending_interruption_type deliverable_type;
> +	__u32 rsvd;
> +};
> +
> +struct hv_x64_sipi_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u32 target_vp_index;
> +	__u32 interrupt_vector;
> +};
> +
> +struct hv_x64_apic_eoi_message {
> +	__u32 vp_index;
> +	__u32 interrupt_vector;
> +};

Same comments as before about enum types, not depending
on the compiler to add padding, and marking as __packed.

> +
>  #endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index b9295400c20b..e0185c3872a9 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -241,6 +241,8 @@ static inline const char *hv_status_to_string(enum hv_status status)
>  /* Valid SynIC vectors are 16-255. */
>  #define HV_SYNIC_FIRST_VALID_VECTOR	(16)
> 
> +#define HV_SYNIC_INTERCEPTION_SINT_INDEX 0x00000000
> +
>  #define HV_SYNIC_CONTROL_ENABLE		(1ULL << 0)
>  #define HV_SYNIC_SIMP_ENABLE		(1ULL << 0)
>  #define HV_SYNIC_SIEFP_ENABLE		(1ULL << 0)
> @@ -250,49 +252,6 @@ static inline const char *hv_status_to_string(enum hv_status
> status)
> 
>  #define HV_SYNIC_STIMER_COUNT		(4)
> 
> -/* Define synthetic interrupt controller message constants. */
> -#define HV_MESSAGE_SIZE			(256)
> -#define HV_MESSAGE_PAYLOAD_BYTE_COUNT	(240)
> -#define HV_MESSAGE_PAYLOAD_QWORD_COUNT	(30)
> -
> -/* Define synthetic interrupt controller message flags. */
> -union hv_message_flags {
> -	__u8 asu8;
> -	struct {
> -		__u8 msg_pending:1;
> -		__u8 reserved:7;
> -	} __packed;
> -};
> -
> -/* Define port identifier type. */
> -union hv_port_id {
> -	__u32 asu32;
> -	struct {
> -		__u32 id:24;
> -		__u32 reserved:8;
> -	} __packed u;
> -};
> -
> -/* Define synthetic interrupt controller message header. */
> -struct hv_message_header {
> -	__u32 message_type;
> -	__u8 payload_size;
> -	union hv_message_flags message_flags;
> -	__u8 reserved[2];
> -	union {
> -		__u64 sender;
> -		union hv_port_id port;
> -	};
> -} __packed;
> -
> -/* Define synthetic interrupt controller message format. */
> -struct hv_message {
> -	struct hv_message_header header;
> -	union {
> -		__u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
> -	} u;
> -} __packed;
> -
>  /* Define the synthetic interrupt message page layout. */
>  struct hv_message_page {
>  	struct hv_message sint_message[HV_SYNIC_SINT_COUNT];
> @@ -306,7 +265,6 @@ struct hv_timer_message_payload {
>  	__u64 delivery_time;	/* When the message was delivered */
>  } __packed;
> 
> -
>  /* Define synthetic interrupt controller flag constants. */
>  #define HV_EVENT_FLAGS_COUNT		(256 * 8)
>  #define HV_EVENT_FLAGS_LONG_COUNT	(256 / sizeof(unsigned long))
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index dfe469f573f9..7709aaa1e064 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -42,6 +42,7 @@ struct mshv_partition {
>  };
> 
>  struct mshv {
> +	struct hv_message_page __percpu **synic_message_page;
>  	struct {
>  		spinlock_t lock;
>  		u64 count;
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index e7b09b9f00de..e87389054b68 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -6,6 +6,49 @@
>  #define BIT(X)	(1ULL << (X))
>  #endif
> 
> +/* Define synthetic interrupt controller message constants. */
> +#define HV_MESSAGE_SIZE			(256)
> +#define HV_MESSAGE_PAYLOAD_BYTE_COUNT	(240)
> +#define HV_MESSAGE_PAYLOAD_QWORD_COUNT	(30)
> +
> +/* Define synthetic interrupt controller message flags. */
> +union hv_message_flags {
> +	__u8 asu8;
> +	struct {
> +		__u8 msg_pending:1;
> +		__u8 reserved:7;
> +	};
> +};
> +
> +/* Define port identifier type. */
> +union hv_port_id {
> +	__u32 asu32;
> +	struct {
> +		__u32 id:24;
> +		__u32 reserved:8;
> +	} u;
> +};
> +
> +/* Define synthetic interrupt controller message header. */
> +struct hv_message_header {
> +	enum hv_message_type message_type;
> +	__u8 payload_size;
> +	union hv_message_flags message_flags;
> +	__u8 reserved[2];
> +	union {
> +		__u64 sender;
> +		union hv_port_id port;
> +	};
> +};
> +
> +/* Define synthetic interrupt controller message format. */
> +struct hv_message {
> +	struct hv_message_header header;
> +	union {
> +		__u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
> +	} u;
> +};
> +
>  /* Userspace-visible partition creation flags */
>  #define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
>  #define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 2a10137a1e84..c9445d2edb37 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -15,6 +15,8 @@
>  #include <linux/file.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/mm.h>
> +#include <linux/io.h>
> +#include <linux/cpuhotplug.h>
>  #include <linux/mshv.h>
>  #include <asm/mshyperv.h>
> 
> @@ -1152,23 +1154,111 @@ mshv_dev_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
> 
> +static int
> +mshv_synic_init(unsigned int cpu)
> +{
> +	union hv_synic_simp simp;
> +	union hv_synic_sint sint;
> +	union hv_synic_scontrol sctrl;
> +	struct hv_message_page **msg_page =
> +			this_cpu_ptr(mshv.synic_message_page);
> +
> +	/* Setup the Synic's message page */
> +	hv_get_simp(simp.as_uint64);
> +	simp.simp_enabled = true;
> +	*msg_page = memremap(simp.base_simp_gpa << PAGE_SHIFT,
> +			     PAGE_SIZE, MEMREMAP_WB);

Use HV_HYP_PAGE_SHIFT and HV_HYP_PAGE_SIZE.

> +	if (!msg_page) {
> +		pr_err("%s: memremap failed\n", __func__);
> +		return -EFAULT;
> +	}
> +	hv_set_simp(simp.as_uint64);
> +
> +	/* Enable intercepts */
> +	sint.as_uint64 = 0;
> +	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.masked = false;
> +	sint.auto_eoi = hv_recommend_using_aeoi();
> +	hv_set_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
> +
> +	/* Enable global synic bit */
> +	hv_get_synic_state(sctrl.as_uint64);
> +	sctrl.enable = 1;
> +	hv_set_synic_state(sctrl.as_uint64);
> +
> +	return 0;
> +}
> +
> +static int
> +mshv_synic_cleanup(unsigned int cpu)
> +{
> +	union hv_synic_sint sint;
> +	union hv_synic_simp simp;
> +	union hv_synic_scontrol sctrl;
> +	struct hv_message_page **msg_page =
> +			this_cpu_ptr(mshv.synic_message_page);
> +
> +	/* Disable the interrupt */
> +	hv_get_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
> +	sint.masked = true;
> +	hv_set_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
> +
> +	/* Disable Synic's message page */
> +	hv_get_simp(simp.as_uint64);
> +	simp.simp_enabled = false;
> +	hv_set_simp(simp.as_uint64);
> +	memunmap(*msg_page);
> +
> +	/* Disable global synic bit */
> +	hv_get_synic_state(sctrl.as_uint64);
> +	sctrl.enable = 0;
> +	hv_set_synic_state(sctrl.as_uint64);
> +
> +	return 0;
> +}
> +
> +static int mshv_cpuhp_online;
> +
>  static int
>  __init mshv_init(void)
>  {
> -	int r;
> +	int ret;

Ideally, change the name of the variable in the earlier patch so this
one isn't cluttered with the change.

> 
> -	r = misc_register(&mshv_dev);
> -	if (r)
> +	ret = misc_register(&mshv_dev);
> +	if (ret) {
>  		pr_err("%s: misc device register failed\n", __func__);
> +		return ret;
> +	}
> +	spin_lock_init(&mshv.partitions.lock);
> 
> +	mshv.synic_message_page = alloc_percpu(struct hv_message_page *);
> +	if (!mshv.synic_message_page) {
> +		pr_err("%s: failed to allocate percpu synic page\n", __func__);
> +		misc_deregister(&mshv_dev);
> +		return -ENOMEM;
> +	}
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> +				mshv_synic_init,
> +				mshv_synic_cleanup);
> +	if (ret < 0) {
> +		pr_err("%s: failed to setup cpu hotplug state: %i\n",
> +		       __func__, ret);
> +		return ret;
> +	}
> +
> +	mshv_cpuhp_online = ret;
>  	spin_lock_init(&mshv.partitions.lock);

It looks like the spin lock is being initialized twice.

> 
> -	return r;
> +	return 0;
>  }
> 
>  static void
>  __exit mshv_exit(void)
>  {
> +	cpuhp_remove_state(mshv_cpuhp_online);
> +	free_percpu(mshv.synic_message_page);
> +
>  	misc_deregister(&mshv_dev);
>  }
> 
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls
  2020-11-21  0:30 ` [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls Nuno Das Neves
@ 2021-02-08 19:48   ` Michael Kelley
  2021-03-11 23:38     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:48 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
> To: linux-hyperv@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org; linux-kernel@vger.kernel.org; Michael Kelley
> <mikelley@microsoft.com>; viremana@linux.microsoft.com; Sunil Muthuswamy
> <sunilmut@microsoft.com>; nunodasneves@linux.microsoft.com; wei.liu@kernel.org;
> Lillian Grassin-Drake <Lillian.GrassinDrake@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>
> Subject: [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls
> 
> Introduce ioctls for getting and setting guest vcpu emulated LAPIC
> state, and xsave data.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst         |   8 +
>  arch/x86/include/uapi/asm/hyperv-tlfs.h |  59 ++++++
>  include/asm-generic/hyperv-tlfs.h       |  41 ++++
>  include/uapi/asm-generic/hyperv-tlfs.h  |  28 +++
>  include/uapi/linux/mshv.h               |  13 ++
>  virt/mshv/mshv_main.c                   | 262 ++++++++++++++++++++++++
>  6 files changed, 411 insertions(+)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index 694f978131f9..7fd75f248eff 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -140,4 +140,12 @@ Assert interrupts in partitions that use Microsoft Hypervisor's
> internal
>  emulated LAPIC. This must be enabled on partition creation with the flag:
>  HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
> 
> +3.9 MSHV_GET_VP_STATE and MSHV_SET_VP_STATE
> +--------------------------
> +:Type: vp ioctl
> +:Parameters: struct mshv_vp_state
> +:Returns: 0 on success
> +
> +Get/set various vp state. Currently these can be used to get and set
> +emulated LAPIC state, and xsave data.
> 
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
> tlfs.h
> index 5478d4943bfc..78758aedf23e 100644
> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -1051,4 +1051,63 @@ union hv_interrupt_control {
>  	__u64 as_uint64;
>  };
> 
> +struct hv_local_interrupt_controller_state {
> +	__u32 apic_id;
> +	__u32 apic_version;
> +	__u32 apic_ldr;
> +	__u32 apic_dfr;
> +	__u32 apic_spurious;
> +	__u32 apic_isr[8];
> +	__u32 apic_tmr[8];
> +	__u32 apic_irr[8];
> +	__u32 apic_esr;
> +	__u32 apic_icr_high;
> +	__u32 apic_icr_low;
> +	__u32 apic_lvt_timer;
> +	__u32 apic_lvt_thermal;
> +	__u32 apic_lvt_perfmon;
> +	__u32 apic_lvt_lint0;
> +	__u32 apic_lvt_lint1;
> +	__u32 apic_lvt_error;
> +	__u32 apic_lvt_cmci;
> +	__u32 apic_error_status;
> +	__u32 apic_initial_count;
> +	__u32 apic_counter_value;
> +	__u32 apic_divide_configuration;
> +	__u32 apic_remote_read;
> +};
> +
> +#define HV_XSAVE_DATA_NO_XMM_REGISTERS 1
> +
> +union hv_x64_xsave_xfem_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u32 low_uint32;
> +		__u32 high_uint32;
> +	};
> +	struct {
> +		__u64 legacy_x87: 1;
> +		__u64 legacy_sse: 1;
> +		__u64 avx: 1;
> +		__u64 mpx_bndreg: 1;
> +		__u64 mpx_bndcsr: 1;
> +		__u64 avx_512_op_mask: 1;
> +		__u64 avx_512_zmmhi: 1;
> +		__u64 avx_512_zmm16_31: 1;
> +		__u64 rsvd8_9: 2;
> +		__u64 pasid: 1;
> +		__u64 cet_u: 1;
> +		__u64 cet_s: 1;
> +		__u64 rsvd13_16: 4;
> +		__u64 xtile_cfg: 1;
> +		__u64 xtile_data: 1;
> +		__u64 rsvd19_63: 45;
> +	};
> +};
> +
> +struct hv_vp_state_data_xsave {
> +	__u64 flags;
> +	union hv_x64_xsave_xfem_register states;
> +};
> +
>  #endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 2cd46241c545..4bc59a0344ce 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -167,6 +167,9 @@ struct ms_hyperv_tsc_page {
>  #define HVCALL_ASSERT_VIRTUAL_INTERRUPT		0x0094
>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
> +#define HVCALL_MAP_VP_STATE_PAGE			0x00e1
> +#define HVCALL_GET_VP_STATE				0x00e3
> +#define HVCALL_SET_VP_STATE				0x00e4
> 
>  #define HV_FLUSH_ALL_PROCESSORS			BIT(0)
>  #define HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES	BIT(1)
> @@ -796,4 +799,42 @@ struct hv_assert_virtual_interrupt {
>  	u16 rsvd_z1;
>  };
> 
> +struct hv_vp_state_data {
> +	enum hv_get_set_vp_state_type type;
> +	u32 rsvd;
> +	struct hv_vp_state_data_xsave xsave;
> +
> +};
> +
> +struct hv_get_vp_state_in {
> +	u64 partition_id;
> +	u32 vp_index;
> +	u8 input_vtl;
> +	u8 rsvd0;
> +	u16 rsvd1;
> +	struct hv_vp_state_data state_data;
> +	u64 output_data_pfns[];
> +};
> +
> +union hv_get_vp_state_out {
> +	struct hv_local_interrupt_controller_state interrupt_controller_state;
> +	/* Not supported yet */
> +	/* struct hv_synthetic_timers_state synthetic_timers_state; */
> +};
> +
> +union hv_input_set_vp_state_data {
> +	u64 pfns;
> +	u8 bytes;
> +};
> +
> +struct hv_set_vp_state_in {
> +	u64 partition_id;
> +	u32 vp_index;
> +	u8 input_vtl;
> +	u8 rsvd0;
> +	u16 rsvd1;
> +	struct hv_vp_state_data state_data;
> +	union hv_input_set_vp_state_data data[];
> +};
> +
>  #endif
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index e87389054b68..b3c84c69b73f 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -64,4 +64,32 @@ struct hv_message {
>  #define HV_MAP_GPA_EXECUTABLE           0xC
>  #define HV_MAP_GPA_PERMISSIONS_MASK     0xF
> 
> +/*
> + * For getting and setting VP state, there are two options based on the state type:
> + *
> + *     1.) Data that is accessed by PFNs in the input hypercall page. This is used
> + *         for state which may not fit into the hypercall pages.
> + *     2.) Data that is accessed directly in the input\output hypercall pages.
> + *         This is used for state that will always fit into the hypercall pages.
> + *
> + * In the future this could be dynamic based on the size if needed.
> + *
> + * Note these hypercalls have an 8-byte aligned variable header size as per the tlfs
> + */
> +
> +#define HV_GET_SET_VP_STATE_TYPE_PFN	BIT(31)
> +
> +enum hv_get_set_vp_state_type {
> +	HV_GET_SET_VP_STATE_LOCAL_INTERRUPT_CONTROLLER_STATE = 0,
> +
> +	HV_GET_SET_VP_STATE_XSAVE		= 1 |
> HV_GET_SET_VP_STATE_TYPE_PFN,
> +	/* Synthetic message page */
> +	HV_GET_SET_VP_STATE_SIM_PAGE		= 2 |
> HV_GET_SET_VP_STATE_TYPE_PFN,
> +	/* Synthetic interrupt event flags page. */
> +	HV_GET_SET_VP_STATE_SIEF_PAGE		= 3 |
> HV_GET_SET_VP_STATE_TYPE_PFN,
> +
> +	/* Synthetic timers. */
> +	HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS	= 4,
> +};
> +
>  #endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index faed9d065bb7..ae0bb64bbec3 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -53,6 +53,17 @@ struct mshv_assert_interrupt {
>  	__u32 vector;
>  };
> 
> +struct mshv_vp_state {
> +	enum hv_get_set_vp_state_type type;
> +	struct hv_vp_state_data_xsave xsave; /* only for xsave request */
> +
> +	__u64 buf_size; /* If xsave, must be page-aligned */
> +	union {
> +		struct hv_local_interrupt_controller_state *lapic;
> +		__u8 *bytes; /* Xsave data. must be page-aligned */
> +	} buf;
> +};
> +
>  #define MSHV_IOCTL 0xB8
> 
>  /* mshv device */
> @@ -70,5 +81,7 @@ struct mshv_assert_interrupt {
>  #define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct
> mshv_vp_registers)
>  #define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
>  #define MSHV_RUN_VP		_IOR(MSHV_IOCTL, 0x07, struct hv_message)
> +#define MSHV_GET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0A, struct mshv_vp_state)
> +#define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
> 
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 9cf236ade50a..70172d9488de 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -864,6 +864,262 @@ mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user
> *user_args)
>  	return ret;
>  }
> 
> +static int
> +hv_call_get_vp_state(u32 vp_index,
> +		     u64 partition_id,
> +		     enum hv_get_set_vp_state_type type,
> +		     struct hv_vp_state_data_xsave xsave,
> +		    /* Choose between pages and ret_output */
> +		     u64 page_count,
> +		     struct page **pages,
> +		     union hv_get_vp_state_out *ret_output)
> +{
> +	struct hv_get_vp_state_in *input;
> +	union hv_get_vp_state_out *output;
> +	int status;
> +	int i;
> +	u64 control;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	if (sizeof(*input) + (page_count * sizeof(u64)) > PAGE_SIZE)
> +		return -EINVAL;

Nit:  Stylistically, you are handling this differently from the BATCH_SIZE
macros, which are essentially doing the same thing of calculating
how many entries will fit in the input page.   Note to use
HV_HYP_PAGE_SIZE.

> +
> +	if (!page_count && !ret_output)
> +		return -EINVAL;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = (struct hv_get_vp_state_in *)
> +				(*this_cpu_ptr(hyperv_pcpu_input_arg));
> +		output = (union hv_get_vp_state_out *)
> +				(*this_cpu_ptr(hyperv_pcpu_output_arg));
> +		memset(input, 0, sizeof(*input));
> +		memset(output, 0, sizeof(*output));
> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->state_data.type = type;
> +		memcpy(&input->state_data.xsave, &xsave, sizeof(xsave));
> +		for (i = 0; i < page_count; i++)
> +			input->output_data_pfns[i] =
> +				page_to_pfn(pages[i]) & HV_MAP_GPA_MASK;
> +
> +		control = (HVCALL_GET_VP_STATE) |
> +			  (page_count << HV_HYPERCALL_VARHEAD_OFFSET);
> +
> +		status = hv_do_hypercall(control, input, output) &
> +			 HV_HYPERCALL_RESULT_MASK;
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (status != HV_STATUS_SUCCESS)
> +				pr_err("%s: %s\n", __func__,
> +				       hv_status_to_string(status));
> +			else if (ret_output)
> +				memcpy(ret_output, output, sizeof(*output));
> +
> +			local_irq_restore(flags);
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static int
> +hv_call_set_vp_state(u32 vp_index,
> +		     u64 partition_id,
> +		     enum hv_get_set_vp_state_type type,
> +		     struct hv_vp_state_data_xsave xsave,
> +		    /* Choose between pages and bytes */
> +		     u64 page_count,
> +		     struct page **pages,
> +		     u32 num_bytes,
> +		     u8 *bytes)
> +{
> +	struct hv_set_vp_state_in *input;
> +	int status;
> +	int i;
> +	u64 control;
> +	unsigned long flags;
> +	int ret = 0;
> +	u16 varhead_sz;
> +
> +	if (sizeof(*input) + (page_count * sizeof(u64)) > PAGE_SIZE)

Same comment as above.

> +		return -EINVAL;
> +	if (sizeof(*input) + num_bytes > PAGE_SIZE)

Use HV_HYP_PAGE_SIZE.

> +		return -EINVAL;
> +
> +	if (num_bytes)
> +		/* round up to 8 and divide by 8 */
> +		varhead_sz = (num_bytes + 7) >> 3;
> +	else if (page_count)
> +		varhead_sz =  page_count;
> +	else
> +		return -EINVAL;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = (struct hv_set_vp_state_in *)
> +				(*this_cpu_ptr(hyperv_pcpu_input_arg));
> +		memset(input, 0, sizeof(*input));
> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->state_data.type = type;
> +		memcpy(&input->state_data.xsave, &xsave, sizeof(xsave));
> +		if (num_bytes) {
> +			memcpy((u8 *)input->data, bytes, num_bytes);
> +		} else {
> +			for (i = 0; i < page_count; i++)
> +				input->data[i].pfns =
> +					page_to_pfn(pages[i]) & HV_MAP_GPA_MASK;

Same comment as in earlier patch about GPA_MASK.  Also, this doesn't work
if PAGE_SIZE != HV_HYP_PAGE_SIZE, though it may be fine to not handle that case
for now.

> +		}
> +
> +		control = (HVCALL_SET_VP_STATE) |
> +			  (varhead_sz << HV_HYPERCALL_VARHEAD_OFFSET);
> +
> +		status = hv_do_hypercall(control, input, NULL) &
> +			 HV_HYPERCALL_RESULT_MASK;
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (status != HV_STATUS_SUCCESS)
> +				pr_err("%s: %s\n", __func__,
> +				       hv_status_to_string(status));
> +
> +			local_irq_restore(flags);
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_vp_ioctl_get_set_state_pfn(struct mshv_vp *vp,
> +				struct mshv_vp_state *args,
> +				bool is_set)
> +{
> +	u64 page_count, remaining;
> +	int completed;
> +	struct page **pages;
> +	long ret;
> +	unsigned long u_buf;
> +
> +	/* Buffer must be page aligned */
> +	if (args->buf_size & (PAGE_SIZE - 1) ||
> +	    (u64)args->buf.bytes & (PAGE_SIZE - 1))
> +		return -EINVAL;

Use PAGE_ALIGNED macro.

> +
> +	if (!access_ok(args->buf.bytes, args->buf_size))
> +		return -EFAULT;
> +
> +	/* Pin user pages so hypervisor can copy directly to them */
> +	page_count = args->buf_size >> PAGE_SHIFT;
> +	pages = kcalloc(page_count, sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	remaining = page_count;
> +	u_buf = (unsigned long)args->buf.bytes;
> +	while (remaining) {
> +		completed = pin_user_pages_fast(
> +				u_buf,
> +				remaining,
> +				FOLL_WRITE,
> +				&pages[page_count - remaining]);
> +		if (completed < 0) {
> +			pr_err("%s: failed to pin user pages error %i\n",
> +			       __func__, completed);
> +			ret = completed;
> +			goto unpin_pages;
> +		}
> +		remaining -= completed;
> +		u_buf += completed * PAGE_SIZE;
> +	}
> +
> +	if (is_set)
> +		ret = hv_call_set_vp_state(vp->index,
> +					   vp->partition->id,
> +					   args->type, args->xsave,
> +					   page_count, pages,
> +					   0, NULL);
> +	else
> +		ret = hv_call_get_vp_state(vp->index,
> +					   vp->partition->id,
> +					   args->type, args->xsave,
> +					   page_count, pages,
> +					   NULL);
> +
> +unpin_pages:
> +	unpin_user_pages(pages, page_count - remaining);
> +	kfree(pages);
> +	return ret;
> +}
> +
> +static long
> +mshv_vp_ioctl_get_set_state(struct mshv_vp *vp, void __user *user_args, bool is_set)
> +{
> +	struct mshv_vp_state args;
> +	long ret = 0;
> +	union hv_get_vp_state_out vp_state;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	/* For now just support these */
> +	if (args.type != HV_GET_SET_VP_STATE_LOCAL_INTERRUPT_CONTROLLER_STATE &&
> +	    args.type != HV_GET_SET_VP_STATE_XSAVE)
> +		return -EINVAL;
> +
> +	/* If we need to pin pfns, delegate to helper */
> +	if (args.type & HV_GET_SET_VP_STATE_TYPE_PFN)
> +		return mshv_vp_ioctl_get_set_state_pfn(vp, &args, is_set);
> +
> +	if (args.buf_size < sizeof(vp_state))
> +		return -EINVAL;
> +
> +	if (is_set) {
> +		if (copy_from_user(
> +				&vp_state,
> +				args.buf.lapic,
> +				sizeof(vp_state)))
> +			return -EFAULT;
> +
> +		return hv_call_set_vp_state(vp->index,
> +					    vp->partition->id,
> +					    args.type, args.xsave,
> +					    0, NULL,
> +					    sizeof(vp_state),
> +					    (u8 *)&vp_state);
> +	}
> +
> +	ret = hv_call_get_vp_state(vp->index,
> +				   vp->partition->id,
> +				   args.type, args.xsave,
> +				   0, NULL,
> +				   &vp_state);
> +
> +	if (ret)
> +		return ret;
> +
> +	if (copy_to_user(args.buf.lapic,
> +			 &vp_state.interrupt_controller_state,
> +			 sizeof(vp_state.interrupt_controller_state)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> 
>  static long
>  mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> @@ -884,6 +1140,12 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long
> arg)
>  	case MSHV_SET_VP_REGISTERS:
>  		r = mshv_vp_ioctl_set_regs(vp, (void __user *)arg);
>  		break;
> +	case MSHV_GET_VP_STATE:
> +		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, false);
> +		break;
> +	case MSHV_SET_VP_STATE:
> +		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, true);
> +		break;
>  	default:
>  		r = -ENOTTY;
>  		break;
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 16/18] virt/mshv: mmap vp register page
  2020-11-21  0:30 ` [RFC PATCH 16/18] virt/mshv: mmap vp register page Nuno Das Neves
@ 2021-02-08 19:49   ` Michael Kelley
  2021-03-25 17:36     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Michael Kelley @ 2021-02-08 19:49 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
> 
> Introduce mmap interface for a virtual processor, exposing a page for
> setting and getting common registers while the VP is suspended.
> 
> This provides a more performant and convenient way to get and set these
> registers in the context of a vmm's run-loop.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst         | 11 ++++
>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 74 ++++++++++++++++++++++
>  include/asm-generic/hyperv-tlfs.h       | 10 +++
>  include/linux/mshv.h                    |  1 +
>  include/uapi/asm-generic/hyperv-tlfs.h  |  5 ++
>  include/uapi/linux/mshv.h               | 12 ++++
>  virt/mshv/mshv_main.c                   | 82 +++++++++++++++++++++++++
>  7 files changed, 195 insertions(+)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index 7fd75f248eff..89c276a8778f 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -149,3 +149,14 @@ HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
>  Get/set various vp state. Currently these can be used to get and set
>  emulated LAPIC state, and xsave data.
> 
> +3.10 mmap(vp)
> +-------------
> +:Type: vp mmap
> +:Parameters: offset should be HV_VP_MMAP_REGISTERS_OFFSET
> +:Returns: 0 on success
> +
> +Maps a page into userspace that can be used to get and set common registers
> +while the vp is suspended.
> +The page is laid out in struct hv_vp_register_page in asm/hyperv-tlfs.h.
> +

I'm assuming there's no support for the corresponding munmap().
What happens if munmap is called?  Does it just fail and the page remains
mapped?

> +
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
> tlfs.h
> index 78758aedf23e..a241178567ff 100644
> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -1110,4 +1110,78 @@ struct hv_vp_state_data_xsave {
>  	union hv_x64_xsave_xfem_register states;
>  };
> 
> +/* Bits for dirty mask of hv_vp_register_page */
> +#define HV_X64_REGISTER_CLASS_GENERAL	0
> +#define HV_X64_REGISTER_CLASS_IP	1
> +#define HV_X64_REGISTER_CLASS_XMM	2
> +#define HV_X64_REGISTER_CLASS_SEGMENT	3
> +#define HV_X64_REGISTER_CLASS_FLAGS	4
> +
> +#define HV_VP_REGISTER_PAGE_VERSION_1	1u
> +
> +struct hv_vp_register_page {
> +	__u16 version;
> +	bool isvalid;

Like enum, avoid type "bool" in data structures shared with
Hyper-V.

> +	__u8 rsvdz;
> +	__u32 dirty;
> +	union {
> +		struct {
> +			__u64 rax;
> +			__u64 rcx;
> +			__u64 rdx;
> +			__u64 rbx;
> +			__u64 rsp;
> +			__u64 rbp;
> +			__u64 rsi;
> +			__u64 rdi;
> +			__u64 r8;
> +			__u64 r9;
> +			__u64 r10;
> +			__u64 r11;
> +			__u64 r12;
> +			__u64 r13;
> +			__u64 r14;
> +			__u64 r15;
> +		};
> +
> +		__u64 gp_registers[16];
> +	};
> +	__u64 rip;
> +	__u64 rflags;
> +	union {
> +		struct {
> +			struct hv_u128 xmm0;
> +			struct hv_u128 xmm1;
> +			struct hv_u128 xmm2;
> +			struct hv_u128 xmm3;
> +			struct hv_u128 xmm4;
> +			struct hv_u128 xmm5;
> +		};
> +
> +		struct hv_u128 xmm_registers[6];
> +	};
> +	union {
> +		struct {
> +			struct hv_x64_segment_register es;
> +			struct hv_x64_segment_register cs;
> +			struct hv_x64_segment_register ss;
> +			struct hv_x64_segment_register ds;
> +			struct hv_x64_segment_register fs;
> +			struct hv_x64_segment_register gs;
> +		};
> +
> +		struct hv_x64_segment_register segment_registers[6];
> +	};
> +	/* read only */
> +	__u64 cr0;
> +	__u64 cr3;
> +	__u64 cr4;
> +	__u64 cr8;
> +	__u64 efer;
> +	__u64 dr7;
> +	union hv_x64_pending_interruption_register pending_interruption;
> +	union hv_x64_interrupt_state_register interrupt_state;
> +	__u64 instruction_emulation_hints;
> +};
> +
>  #endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 4bc59a0344ce..9eed4b869110 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -837,4 +837,14 @@ struct hv_set_vp_state_in {
>  	union hv_input_set_vp_state_data data[];
>  };
> 
> +struct hv_map_vp_state_page_in {
> +	u64 partition_id;
> +	u32 vp_index;
> +	enum hv_vp_state_page_type type;
> +};
> +
> +struct hv_map_vp_state_page_out {
> +	u64 map_location; /* page number */
> +};
> +
>  #endif
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index 3933d80294f1..33f4d0cfee11 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -20,6 +20,7 @@ struct mshv_vp {
>  	u32 index;
>  	struct mshv_partition *partition;
>  	struct mutex mutex;
> +	struct page *register_page;
>  	struct {
>  		struct semaphore sem;
>  		struct task_struct *task;
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index b3c84c69b73f..a747f39b132a 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -92,4 +92,9 @@ enum hv_get_set_vp_state_type {
>  	HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS	= 4,
>  };
> 
> +enum hv_vp_state_page_type {
> +	HV_VP_STATE_PAGE_REGISTERS = 0,
> +	HV_VP_STATE_PAGE_COUNT
> +};
> +
>  #endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index ae0bb64bbec3..8537ff29aee5 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -13,6 +13,8 @@
> 
>  #define MSHV_VERSION	0x0
> 
> +#define MSHV_VP_MMAP_REGISTERS_OFFSET (HV_VP_STATE_PAGE_REGISTERS * 0x1000)
> +
>  struct mshv_create_partition {
>  	__u64 flags;
>  	struct hv_partition_creation_properties partition_creation_properties;
> @@ -84,4 +86,14 @@ struct mshv_vp_state {
>  #define MSHV_GET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0A, struct mshv_vp_state)
>  #define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
> 
> +/* register page mapping example:
> + * struct hv_vp_register_page *regs = mmap(NULL,
> + *					   4096,
> + *					   PROT_READ | PROT_WRITE,
> + *					   MAP_SHARED,
> + *					   vp_fd,
> + *					   HV_VP_MMAP_REGISTERS_OFFSET);
> + * munmap(regs, 4096);
> + */
> +
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 70172d9488de..a597254fa4f4 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -43,11 +43,18 @@ static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl,
> unsigned
>  static int mshv_dev_open(struct inode *inode, struct file *filp);
>  static int mshv_dev_release(struct inode *inode, struct file *filp);
>  static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
> +
> +static const struct vm_operations_struct mshv_vp_vm_ops = {
> +	.fault = mshv_vp_fault,
> +};
> 
>  static const struct file_operations mshv_vp_fops = {
>  	.release = mshv_vp_release,
>  	.unlocked_ioctl = mshv_vp_ioctl,
>  	.llseek = noop_llseek,
> +	.mmap = mshv_vp_mmap,
>  };
> 
>  static const struct file_operations mshv_partition_fops = {
> @@ -499,6 +506,47 @@ hv_call_set_vp_registers(u32 vp_index,
>  	return -hv_status_to_errno(status);
>  }
> 
> +static int
> +hv_call_map_vp_state_page(u32 vp_index, u64 partition_id,
> +			  struct page **state_page)
> +{
> +	struct hv_map_vp_state_page_in *input;
> +	struct hv_map_vp_state_page_out *output;
> +	int status;
> +	int ret;
> +	unsigned long flags;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = (struct hv_map_vp_state_page_in *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +		output = (struct hv_map_vp_state_page_out *)(*this_cpu_ptr(
> +			hyperv_pcpu_output_arg));
> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->type = HV_VP_STATE_PAGE_REGISTERS;
> +		status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE,
> +						   input, output);
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (status == HV_STATUS_SUCCESS)
> +				*state_page = pfn_to_page(output->map_location);
> +			else
> +				pr_err("%s: %s\n", __func__,
> +				       hv_status_to_string(status));
> +			local_irq_restore(flags);
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
>  static void
>  mshv_isr(void)
>  {
> @@ -1155,6 +1203,40 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long
> arg)
>  	return r;
>  }
> 
> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf)
> +{
> +	struct mshv_vp *vp = vmf->vma->vm_file->private_data;
> +
> +	vmf->page = vp->register_page;
> +
> +	return 0;
> +}
> +
> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	int ret;
> +	struct mshv_vp *vp = file->private_data;
> +
> +	if (vma->vm_pgoff != MSHV_VP_MMAP_REGISTERS_OFFSET)
> +		return -EINVAL;
> +
> +	if (mutex_lock_killable(&vp->mutex))
> +		return -EINTR;
> +
> +	if (!vp->register_page) {
> +		ret = hv_call_map_vp_state_page(vp->index,
> +						vp->partition->id,
> +						&vp->register_page);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	mutex_unlock(&vp->mutex);
> +
> +	vma->vm_ops = &mshv_vp_vm_ops;
> +	return 0;
> +}
> +
>  static int
>  mshv_vp_release(struct inode *inode, struct file *filp)
>  {
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes
  2020-11-21  0:30 ` [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes Nuno Das Neves
@ 2021-02-09 13:04   ` Vitaly Kuznetsov
  2021-03-04 18:24     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Vitaly Kuznetsov @ 2021-02-09 13:04 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> Return linux-friendly error codes from hypercall wrapper functions.
> This will be needed in the mshv module.
>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_proc.c         | 30 ++++++++++++++++++++++++++---
>  arch/x86/include/asm/mshyperv.h   |  1 +
>  include/asm-generic/hyperv-tlfs.h | 32 +++++++++++++++++++++----------
>  3 files changed, 50 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
> index 0fd972c9129a..8f86f8e86748 100644
> --- a/arch/x86/hyperv/hv_proc.c
> +++ b/arch/x86/hyperv/hv_proc.c
> @@ -18,6 +18,30 @@
>  #define HV_DEPOSIT_MAX_ORDER (8)
>  #define HV_DEPOSIT_MAX (1 << HV_DEPOSIT_MAX_ORDER)
>  
> +int hv_status_to_errno(int hv_status)
> +{
> +	switch (hv_status) {
> +	case HV_STATUS_SUCCESS:
> +		return 0;
> +	case HV_STATUS_INVALID_PARAMETER:
> +	case HV_STATUS_UNKNOWN_PROPERTY:
> +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
> +	case HV_STATUS_INVALID_VP_INDEX:
> +	case HV_STATUS_INVALID_REGISTER_VALUE:
> +	case HV_STATUS_INVALID_LP_INDEX:
> +		return EINVAL;
> +	case HV_STATUS_ACCESS_DENIED:
> +	case HV_STATUS_OPERATION_DENIED:
> +		return EACCES;
> +	case HV_STATUS_NOT_ACKNOWLEDGED:
> +	case HV_STATUS_INVALID_VP_STATE:
> +	case HV_STATUS_INVALID_PARTITION_STATE:
> +		return EBADFD;
> +	}
> +	return ENOTRECOVERABLE;
> +}
> +EXPORT_SYMBOL_GPL(hv_status_to_errno);
> +
>  /*
>   * Deposits exact number of pages
>   * Must be called with interrupts enabled
> @@ -99,7 +123,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>  
>  	if (status != HV_STATUS_SUCCESS) {
>  		pr_err("Failed to deposit pages: %d\n", status);
> -		ret = status;
> +		ret = -hv_status_to_errno(status);

"-hv_status_to_errno" looks weird, could we just return
'-EINVAL'/'-EACCES'/... from hv_status_to_errno() instead?

>  		goto err_free_allocations;
>  	}
>  
> @@ -155,7 +179,7 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>  			if (status != HV_STATUS_SUCCESS) {
>  				pr_err("%s: cpu %u apic ID %u, %d\n", __func__,
>  				       lp_index, apic_id, status);
> -				ret = status;
> +				ret = -hv_status_to_errno(status);
>  			}
>  			break;
>  		}
> @@ -203,7 +227,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>  			if (status != HV_STATUS_SUCCESS) {
>  				pr_err("%s: vcpu %u, lp %u, %d\n", __func__,
>  				       vp_index, flags, status);
> -				ret = status;
> +				ret = -hv_status_to_errno(status);
>  			}
>  			break;
>  		}
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index cbee72550a12..eb75faa4d4c5 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -243,6 +243,7 @@ int hyperv_flush_guest_mapping_range(u64 as,
>  int hyperv_fill_flush_guest_mapping_list(
>  		struct hv_guest_mapping_flush_list *flush,
>  		u64 start_gfn, u64 end_gfn);
> +int hv_status_to_errno(int hv_status);
>  
>  extern bool hv_root_partition;
>  
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index dd385c6a71b5..445244192fa4 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -181,16 +181,28 @@ enum HV_GENERIC_SET_FORMAT {
>  #define HV_HYPERCALL_REP_START_MASK	GENMASK_ULL(59, 48)
>  
>  /* hypercall status code */
> -#define HV_STATUS_SUCCESS			0
> -#define HV_STATUS_INVALID_HYPERCALL_CODE	2
> -#define HV_STATUS_INVALID_HYPERCALL_INPUT	3
> -#define HV_STATUS_INVALID_ALIGNMENT		4
> -#define HV_STATUS_INVALID_PARAMETER		5
> -#define HV_STATUS_OPERATION_DENIED		8
> -#define HV_STATUS_INSUFFICIENT_MEMORY		11
> -#define HV_STATUS_INVALID_PORT_ID		17
> -#define HV_STATUS_INVALID_CONNECTION_ID		18
> -#define HV_STATUS_INSUFFICIENT_BUFFERS		19
> +#define HV_STATUS_SUCCESS			0x0
> +#define HV_STATUS_INVALID_HYPERCALL_CODE	0x2
> +#define HV_STATUS_INVALID_HYPERCALL_INPUT	0x3
> +#define HV_STATUS_INVALID_ALIGNMENT		0x4
> +#define HV_STATUS_INVALID_PARAMETER		0x5
> +#define HV_STATUS_ACCESS_DENIED			0x6
> +#define HV_STATUS_INVALID_PARTITION_STATE	0x7
> +#define HV_STATUS_OPERATION_DENIED		0x8
> +#define HV_STATUS_UNKNOWN_PROPERTY		0x9
> +#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE	0xA
> +#define HV_STATUS_INSUFFICIENT_MEMORY		0xB
> +#define HV_STATUS_INVALID_PARTITION_ID		0xD
> +#define HV_STATUS_INVALID_VP_INDEX		0xE
> +#define HV_STATUS_NOT_FOUND			0x10
> +#define HV_STATUS_INVALID_PORT_ID		0x11
> +#define HV_STATUS_INVALID_CONNECTION_ID		0x12
> +#define HV_STATUS_INSUFFICIENT_BUFFERS		0x13
> +#define HV_STATUS_NOT_ACKNOWLEDGED		0x14
> +#define HV_STATUS_INVALID_VP_STATE		0x15
> +#define HV_STATUS_NO_RESOURCES			0x1D
> +#define HV_STATUS_INVALID_LP_INDEX		0x41
> +#define HV_STATUS_INVALID_REGISTER_VALUE	0x50
>  
>  /*
>   * The Hyper-V TimeRefCount register and the TSC

-- 
Vitaly


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2020-11-21  0:30 ` [RFC PATCH 04/18] virt/mshv: request version ioctl Nuno Das Neves
  2021-02-08 19:41   ` Michael Kelley
@ 2021-02-09 13:11   ` Vitaly Kuznetsov
  2021-03-04 18:43     ` Nuno Das Neves
  1 sibling, 1 reply; 53+ messages in thread
From: Vitaly Kuznetsov @ 2021-02-09 13:11 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> Reserve ioctl number in userpsace-api/ioctl/ioctl-number.rst
> Introduce MSHV_REQUEST_VERSION ioctl.
> Introduce documentation for /dev/mshv in Documentation/virt/mshv
>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |  2 +
>  Documentation/virt/mshv/api.rst               | 62 +++++++++++++++++++
>  include/linux/mshv.h                          | 11 ++++
>  include/uapi/linux/mshv.h                     | 19 ++++++
>  virt/mshv/mshv_main.c                         | 49 +++++++++++++++
>  5 files changed, 143 insertions(+)
>  create mode 100644 Documentation/virt/mshv/api.rst
>  create mode 100644 include/linux/mshv.h
>  create mode 100644 include/uapi/linux/mshv.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 55a2d9b2ce33..13a4d3ecafca 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -343,6 +343,8 @@ Code  Seq#    Include File                                           Comments
>  0xB5  00-0F  uapi/linux/rpmsg.h                                      <mailto:linux-remoteproc@vger.kernel.org>
>  0xB6  all    linux/fpga-dfl.h
>  0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-remoteproc@vger.kernel.org>
> +0xB8  all    uapi/linux/mshv.h                                       Microsoft Hypervisor root partition APIs
> +                                                                     <mailto:linux-hyperv@vger.kernel.org>
>  0xC0  00-0F  linux/usb/iowarrior.h
>  0xCA  00-0F  uapi/misc/cxl.h
>  0xCA  10-2F  uapi/misc/ocxl.h
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> new file mode 100644
> index 000000000000..82e32de48d03
> --- /dev/null
> +++ b/Documentation/virt/mshv/api.rst
> @@ -0,0 +1,62 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================================
> +Microsoft Hypervisor Root Partition API Documentation
> +=====================================================
> +
> +1. Overview
> +===========
> +
> +This document describes APIs for creating and managing guest virtual machines
> +when running Linux as the root partition on the Microsoft Hypervisor.
> +
> +This API is not yet stable.
> +
> +2. Glossary/Terms
> +=================
> +
> +hv
> +--
> +Short for Hyper-V. This name is used in the kernel to describe interfaces to
> +the Microsoft Hypervisor.
> +
> +mshv
> +----
> +Short for Microsoft Hypervisor. This is the name of the userland API module
> +described in this document.
> +
> +Partition
> +---------
> +A virtual machine running on the Microsoft Hypervisor.
> +
> +Root Partition
> +--------------
> +The partition that is created and assumes control when the machine boots. The
> +root partition can use mshv APIs to create guest partitions.
> +
> +3. API description
> +==================
> +
> +The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
> +
> +Mshv is file descriptor-based, following a similar pattern to KVM.
> +
> +To get a handle to the mshv driver, use open("/dev/mshv").
> +
> +3.1 MSHV_REQUEST_VERSION
> +------------------------
> +:Type: /dev/mshv ioctl
> +:Parameters: pointer to a u32
> +:Returns: 0 on success
> +
> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
> +establish the interface version with the kernel module.
> +
> +The caller should pass the MSHV_VERSION as an argument.
> +
> +The kernel module will check which interface versions it supports and return 0
> +if one of them matches.
> +
> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
> +it is open - this ioctl can only be called once per open.
> +

KVM used to have KVM_GET_API_VERSION too but this turned out to be not
very convenient so we use capabilities (KVM_CHECK_EXTENSION/KVM_ENABLE_CAP)
instead.

> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> new file mode 100644
> index 000000000000..a0982fe2c0b8
> --- /dev/null
> +++ b/include/linux/mshv.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _LINUX_MSHV_H
> +#define _LINUX_MSHV_H
> +
> +/*
> + * Microsoft Hypervisor root partition driver for /dev/mshv
> + */
> +
> +#include <uapi/linux/mshv.h>
> +
> +#endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> new file mode 100644
> index 000000000000..dd30fc2f0a80
> --- /dev/null
> +++ b/include/uapi/linux/mshv.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_MSHV_H
> +#define _UAPI_LINUX_MSHV_H
> +
> +/*
> + * Userspace interface for /dev/mshv
> + * Microsoft Hypervisor root partition APIs
> + */
> +
> +#include <linux/types.h>
> +
> +#define MSHV_VERSION	0x0
> +
> +#define MSHV_IOCTL 0xB8
> +
> +/* mshv device */
> +#define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
> +
> +#endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index ecb9089761fe..62f631f85301 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -11,25 +11,74 @@
>  #include <linux/module.h>
>  #include <linux/fs.h>
>  #include <linux/miscdevice.h>
> +#include <linux/slab.h>
> +#include <linux/mshv.h>
>  
>  MODULE_AUTHOR("Microsoft");
>  MODULE_LICENSE("GPL");
>  
> +#define MSHV_INVALID_VERSION	0xFFFFFFFF
> +#define MSHV_CURRENT_VERSION	MSHV_VERSION
> +
> +static u32 supported_versions[] = {
> +	MSHV_CURRENT_VERSION,
> +};
> +
> +static long
> +mshv_ioctl_request_version(u32 *version, void __user *user_arg)
> +{
> +	u32 arg;
> +	int i;
> +
> +	if (copy_from_user(&arg, user_arg, sizeof(arg)))
> +		return -EFAULT;
> +
> +	for (i = 0; i < ARRAY_SIZE(supported_versions); ++i) {
> +		if (supported_versions[i] == arg) {
> +			*version = supported_versions[i];
> +			return 0;
> +		}
> +	}
> +	return -ENOTSUPP;
> +}
> +
>  static long
>  mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> +	u32 *version = (u32 *)filp->private_data;
> +
> +	if (ioctl == MSHV_REQUEST_VERSION) {
> +		/* Version can only be set once */
> +		if (*version != MSHV_INVALID_VERSION)
> +			return -EBADFD;
> +
> +		return mshv_ioctl_request_version(version, (void __user *)arg);
> +	}
> +
> +	/* Version must be set before other ioctls can be called */
> +	if (*version == MSHV_INVALID_VERSION)
> +		return -EBADFD;
> +
> +	/* TODO other ioctls */
> +
>  	return -ENOTTY;
>  }
>  
>  static int
>  mshv_dev_open(struct inode *inode, struct file *filp)
>  {
> +	filp->private_data = kmalloc(sizeof(u32), GFP_KERNEL);
> +	if (!filp->private_data)
> +		return -ENOMEM;
> +	*(u32 *)filp->private_data = MSHV_INVALID_VERSION;
> +
>  	return 0;
>  }
>  
>  static int
>  mshv_dev_release(struct inode *inode, struct file *filp)
>  {
> +	kfree(filp->private_data);
>  	return 0;
>  }

-- 
Vitaly


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/18] virt/mshv: create partition ioctl
  2020-11-21  0:30 ` [RFC PATCH 05/18] virt/mshv: create partition ioctl Nuno Das Neves
@ 2021-02-09 13:15   ` Vitaly Kuznetsov
  2021-03-04 18:44     ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Vitaly Kuznetsov @ 2021-02-09 13:15 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	nunodasneves, wei.liu, ligrassi, kys

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> Add MSHV_CREATE_PARTITION, which creates an fd to track a new partition.
> Partition is not yet created in the hypervisor itself.
> Introduce header files for userspace-facing hyperv structures.
>
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst         |  12 ++
>  arch/x86/include/asm/hyperv-tlfs.h      |   1 +
>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 124 ++++++++++++++++
>  include/asm-generic/hyperv-tlfs.h       |   1 +
>  include/linux/mshv.h                    |  16 +++
>  include/uapi/asm-generic/hyperv-tlfs.h  |  14 ++
>  include/uapi/linux/mshv.h               |   7 +
>  virt/mshv/mshv_main.c                   | 179 +++++++++++++++++++++---
>  8 files changed, 338 insertions(+), 16 deletions(-)
>  create mode 100644 arch/x86/include/uapi/asm/hyperv-tlfs.h
>  create mode 100644 include/uapi/asm-generic/hyperv-tlfs.h
>
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index 82e32de48d03..ce651a1738e0 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -39,6 +39,9 @@ root partition can use mshv APIs to create guest partitions.
>  
>  The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
>  
> +The uapi header files you need are linux/mshv.h, asm/hyperv-tlfs.h, and
> +asm-generic/hyperv-tlfs.h.
> +
>  Mshv is file descriptor-based, following a similar pattern to KVM.
>  
>  To get a handle to the mshv driver, use open("/dev/mshv").
> @@ -60,3 +63,12 @@ if one of them matches.
>  This /dev/mshv file descriptor will remain 'locked' to that version as long as
>  it is open - this ioctl can only be called once per open.
>  
> +3.2 MSHV_CREATE_PARTITION
> +-------------------------
> +:Type: /dev/mshv ioctl
> +:Parameters: struct mshv_create_partition
> +:Returns: partition file descriptor, or -1 on failure
> +
> +This ioctl creates a guest partition, returning a file descriptor to use as a
> +handle for partition ioctls.
> +
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index 592c75e51e0f..4cd44ae9bffb 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -11,6 +11,7 @@
>  
>  #include <linux/types.h>
>  #include <asm/page.h>
> +#include <uapi/asm/hyperv-tlfs.h>
>  /*
>   * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent
>   * is set by CPUID(HvCpuIdFunctionVersionAndFeatures).
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> new file mode 100644
> index 000000000000..72150c25ffe6
> --- /dev/null
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -0,0 +1,124 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_ASM_X86_HYPERV_TLFS_USER_H
> +#define _UAPI_ASM_X86_HYPERV_TLFS_USER_H
> +
> +#include <linux/types.h>
> +
> +#define HV_PARTITION_PROCESSOR_FEATURE_BANKS 2
> +
> +union hv_partition_processor_features {
> +	struct {
> +		__u64 sse3_support:1;
> +		__u64 lahf_sahf_support:1;
> +		__u64 ssse3_support:1;
> +		__u64 sse4_1_support:1;
> +		__u64 sse4_2_support:1;
> +		__u64 sse4a_support:1;
> +		__u64 xop_support:1;
> +		__u64 pop_cnt_support:1;
> +		__u64 cmpxchg16b_support:1;
> +		__u64 altmovcr8_support:1;
> +		__u64 lzcnt_support:1;
> +		__u64 mis_align_sse_support:1;
> +		__u64 mmx_ext_support:1;
> +		__u64 amd3dnow_support:1;
> +		__u64 extended_amd3dnow_support:1;
> +		__u64 page_1gb_support:1;
> +		__u64 aes_support:1;
> +		__u64 pclmulqdq_support:1;
> +		__u64 pcid_support:1;
> +		__u64 fma4_support:1;
> +		__u64 f16c_support:1;
> +		__u64 rd_rand_support:1;
> +		__u64 rd_wr_fs_gs_support:1;
> +		__u64 smep_support:1;
> +		__u64 enhanced_fast_string_support:1;
> +		__u64 bmi1_support:1;
> +		__u64 bmi2_support:1;
> +		__u64 hle_support_deprecated:1;
> +		__u64 rtm_support_deprecated:1;
> +		__u64 movbe_support:1;
> +		__u64 npiep1_support:1;
> +		__u64 dep_x87_fpu_save_support:1;
> +		__u64 rd_seed_support:1;
> +		__u64 adx_support:1;
> +		__u64 intel_prefetch_support:1;
> +		__u64 smap_support:1;
> +		__u64 hle_support:1;
> +		__u64 rtm_support:1;
> +		__u64 rdtscp_support:1;
> +		__u64 clflushopt_support:1;
> +		__u64 clwb_support:1;
> +		__u64 sha_support:1;
> +		__u64 x87_pointers_saved_support:1;
> +		__u64 invpcid_support:1;
> +		__u64 ibrs_support:1;
> +		__u64 stibp_support:1;
> +		__u64 ibpb_support: 1;
> +		__u64 unrestricted_guest_support:1;
> +		__u64 mdd_support:1;
> +		__u64 fast_short_rep_mov_support:1;
> +		__u64 l1dcache_flush_support:1;
> +		__u64 rdcl_no_support:1;
> +		__u64 ibrs_all_support:1;
> +		__u64 skip_l1df_support:1;
> +		__u64 ssb_no_support:1;
> +		__u64 rsb_a_no_support:1;
> +		__u64 virt_spec_ctrl_support:1;
> +		__u64 rd_pid_support:1;
> +		__u64 umip_support:1;
> +		__u64 mbs_no_support:1;
> +		__u64 mb_clear_support:1;
> +		__u64 taa_no_support:1;
> +		__u64 tsx_ctrl_support:1;
> +		/*
> +		 * N.B. The final processor feature bit in bank 0 is reserved to
> +		 * simplify potential downlevel backports.
> +		 */
> +		__u64 reserved_bank0:1;
> +
> +		/* N.B. Begin bank 1 processor features. */
> +		__u64 acount_mcount_support:1;
> +		__u64 tsc_invariant_support:1;
> +		__u64 cl_zero_support:1;
> +		__u64 rdpru_support:1;
> +		__u64 la57_support:1;
> +		__u64 mbec_support:1;
> +		__u64 nested_virt_support:1;
> +		__u64 psfd_support:1;
> +		__u64 cet_ss_support:1;
> +		__u64 cet_ibt_support:1;
> +		__u64 vmx_exception_inject_support:1;
> +		__u64 enqcmd_support:1;
> +		__u64 umwait_tpause_support:1;
> +		__u64 movdiri_support:1;
> +		__u64 movdir64b_support:1;
> +		__u64 cldemote_support:1;
> +		__u64 serialize_support:1;
> +		__u64 tsc_deadline_tmr_support:1;
> +		__u64 tsc_adjust_support:1;
> +		__u64 fzlrep_movsb:1;
> +		__u64 fsrep_stosb:1;
> +		__u64 fsrep_cmpsb:1;
> +		__u64 reserved_bank1:42;
> +	};
> +	__u64 as_uint64[HV_PARTITION_PROCESSOR_FEATURE_BANKS];
> +};
> +
> +union hv_partition_processor_xsave_features {
> +	struct {
> +		__u64 xsave_support : 1;
> +		__u64 xsaveopt_support : 1;
> +		__u64 avx_support : 1;
> +		__u64 reserved1 : 61;
> +	};
> +	__u64 as_uint64;
> +};
> +
> +struct hv_partition_creation_properties {
> +	union hv_partition_processor_features disabled_processor_features;
> +	union hv_partition_processor_xsave_features
> +		disabled_processor_xsave_features;
> +};
> +
> +#endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 05b9dc9896ab..2ff580780ce4 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -12,6 +12,7 @@
>  #include <linux/types.h>
>  #include <linux/bits.h>
>  #include <linux/time64.h>
> +#include <uapi/asm-generic/hyperv-tlfs.h>
>  
>  /*
>   * While not explicitly listed in the TLFS, Hyper-V always runs with a page size
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index a0982fe2c0b8..fc4f35089b2c 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -6,6 +6,22 @@
>   * Microsoft Hypervisor root partition driver for /dev/mshv
>   */
>  
> +#include <linux/spinlock.h>
>  #include <uapi/linux/mshv.h>
>  
> +#define MSHV_MAX_PARTITIONS		128
> +
> +struct mshv_partition {
> +	u64 id;
> +	refcount_t ref_count;
> +};
> +
> +struct mshv {
> +	struct {
> +		spinlock_t lock;
> +		u64 count;
> +		struct mshv_partition *array[MSHV_MAX_PARTITIONS];
> +	} partitions;
> +};
> +
>  #endif
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
> new file mode 100644
> index 000000000000..140cc0b4f98f
> --- /dev/null
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_ASM_GENERIC_HYPERV_TLFS_USER_H
> +#define _UAPI_ASM_GENERIC_HYPERV_TLFS_USER_H
> +
> +#ifndef BIT
> +#define BIT(X)	(1ULL << (X))
> +#endif
> +
> +#define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
> +#define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
> +#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
> +#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
> +
> +#endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index dd30fc2f0a80..3788f8bc5caa 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -8,12 +8,19 @@
>   */
>  
>  #include <linux/types.h>
> +#include <asm/hyperv-tlfs.h>
>  
>  #define MSHV_VERSION	0x0
>  
> +struct mshv_create_partition {
> +	__u64 flags;
> +	struct hv_partition_creation_properties partition_creation_properties;
> +};
> +
>  #define MSHV_IOCTL 0xB8
>  
>  /* mshv device */
>  #define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
> +#define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x01, struct mshv_create_partition)
>  
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 62f631f85301..4dcbe4907430 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -12,6 +12,8 @@
>  #include <linux/fs.h>
>  #include <linux/miscdevice.h>
>  #include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
>  #include <linux/mshv.h>
>  
>  MODULE_AUTHOR("Microsoft");
> @@ -24,6 +26,161 @@ static u32 supported_versions[] = {
>  	MSHV_CURRENT_VERSION,
>  };
>  
> +static struct mshv mshv = {};
> +
> +static void mshv_partition_put(struct mshv_partition *partition);
> +static int mshv_partition_release(struct inode *inode, struct file *filp);
> +static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +
> +static int mshv_dev_open(struct inode *inode, struct file *filp);
> +static int mshv_dev_release(struct inode *inode, struct file *filp);
> +static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +
> +static const struct file_operations mshv_partition_fops = {
> +	.release = mshv_partition_release,
> +	.unlocked_ioctl = mshv_partition_ioctl,
> +	.llseek = noop_llseek,
> +};
> +
> +static const struct file_operations mshv_dev_fops = {
> +	.owner = THIS_MODULE,
> +	.open = mshv_dev_open,
> +	.release = mshv_dev_release,
> +	.unlocked_ioctl = mshv_dev_ioctl,
> +	.llseek = noop_llseek,
> +};
> +
> +static struct miscdevice mshv_dev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "mshv",
> +	.fops = &mshv_dev_fops,
> +	.mode = 600,
> +};
> +
> +static long
> +mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> +{
> +	return -ENOTTY;
> +}
> +
> +static void
> +destroy_partition(struct mshv_partition *partition)
> +{
> +	unsigned long flags;
> +	int i;
> +
> +	/* Remove from list of partitions */
> +	spin_lock_irqsave(&mshv.partitions.lock, flags);
> +
> +	for (i = 0; i < MSHV_MAX_PARTITIONS; ++i) {
> +		if (mshv.partitions.array[i] == partition)
> +			break;
> +	}
> +
> +	if (i == MSHV_MAX_PARTITIONS) {
> +		pr_err("%s: failed to locate partition in array\n", __func__);
> +	} else {
> +		mshv.partitions.count--;
> +		mshv.partitions.array[i] = NULL;
> +	}
> +
> +	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
> +
> +	kfree(partition);
> +}
> +
> +static void
> +mshv_partition_put(struct mshv_partition *partition)
> +{
> +	if (refcount_dec_and_test(&partition->ref_count))
> +		destroy_partition(partition);
> +}
> +
> +static int
> +mshv_partition_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_partition *partition = filp->private_data;
> +
> +	mshv_partition_put(partition);
> +
> +	return 0;
> +}
> +
> +static int
> +add_partition(struct mshv_partition *partition)
> +{
> +	unsigned long flags;
> +	int i, ret = 0;
> +
> +	spin_lock_irqsave(&mshv.partitions.lock, flags);
> +
> +	if (mshv.partitions.count >= MSHV_MAX_PARTITIONS) {
> +		pr_err("%s: too many partitions\n", __func__);
> +		ret = -ENOSPC;
> +		goto out_unlock;
> +	}
> +
> +	for (i = 0; i < MSHV_MAX_PARTITIONS; ++i) {
> +		if (!mshv.partitions.array[i])
> +			break;
> +	}
> +
> +	mshv.partitions.count++;
> +	mshv.partitions.array[i] = partition;
> +
> +out_unlock:
> +	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_ioctl_create_partition(void __user *user_arg)
> +{
> +	struct mshv_create_partition args;
> +	struct mshv_partition *partition;
> +	struct file *file;
> +	int fd;
> +	long ret;
> +
> +	if (copy_from_user(&args, user_arg, sizeof(args)))
> +		return -EFAULT;
> +
> +	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
> +	if (!partition)
> +		return -ENOMEM;
> +
> +	fd = get_unused_fd_flags(O_CLOEXEC);
> +	if (fd < 0) {
> +		ret = fd;
> +		goto free_partition;
> +	}
> +
> +	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
> +				  partition, O_RDWR);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto put_fd;
> +	}
> +	refcount_set(&partition->ref_count, 1);
> +
> +	ret = add_partition(partition);
> +	if (ret)
> +		goto release_file;
> +
> +	fd_install(fd, file);
> +
> +	return fd;
> +
> +release_file:
> +	file->f_op->release(file->f_inode, file);
> +put_fd:
> +	put_unused_fd(fd);
> +free_partition:
> +	kfree(partition);
> +	return ret;
> +}
> +
>  static long
>  mshv_ioctl_request_version(u32 *version, void __user *user_arg)
>  {
> @@ -59,7 +216,10 @@ mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  	if (*version == MSHV_INVALID_VERSION)
>  		return -EBADFD;
>  
> -	/* TODO other ioctls */
> +	switch (ioctl) {
> +	case MSHV_CREATE_PARTITION:
> +		return mshv_ioctl_create_partition((void __user *)arg);
> +	}
>  
>  	return -ENOTTY;
>  }
> @@ -82,21 +242,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> -static const struct file_operations mshv_dev_fops = {
> -	.owner = THIS_MODULE,
> -	.open = mshv_dev_open,
> -	.release = mshv_dev_release,
> -	.unlocked_ioctl = mshv_dev_ioctl,
> -	.llseek = noop_llseek,
> -};
> -
> -static struct miscdevice mshv_dev = {
> -	.minor = MISC_DYNAMIC_MINOR,
> -	.name = "mshv",
> -	.fops = &mshv_dev_fops,
> -	.mode = 600,
> -};
> -

This looks like an unneeded code churn as these structs just got added a
few patches ago. It would probably be possible to put it to the right
place from the very beginning so you don't need to move it in this
patch.

>  static int
>  __init mshv_init(void)
>  {
> @@ -106,6 +251,8 @@ __init mshv_init(void)
>  	if (r)
>  		pr_err("%s: misc device register failed\n", __func__);
>  
> +	spin_lock_init(&mshv.partitions.lock);
> +
>  	return r;
>  }

-- 
Vitaly


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes
  2021-02-09 13:04   ` Vitaly Kuznetsov
@ 2021-03-04 18:24     ` Nuno Das Neves
  0 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-04 18:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	wei.liu, ligrassi, kys

On 2/9/2021 5:04 AM, Vitaly Kuznetsov wrote:
> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
> 
>> Return linux-friendly error codes from hypercall wrapper functions.
>> This will be needed in the mshv module.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  arch/x86/hyperv/hv_proc.c         | 30 ++++++++++++++++++++++++++---
>>  arch/x86/include/asm/mshyperv.h   |  1 +
>>  include/asm-generic/hyperv-tlfs.h | 32 +++++++++++++++++++++----------
>>  3 files changed, 50 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
>> index 0fd972c9129a..8f86f8e86748 100644
>> --- a/arch/x86/hyperv/hv_proc.c
>> +++ b/arch/x86/hyperv/hv_proc.c
>> @@ -18,6 +18,30 @@
>>  #define HV_DEPOSIT_MAX_ORDER (8)
>>  #define HV_DEPOSIT_MAX (1 << HV_DEPOSIT_MAX_ORDER)
>>  
>> +int hv_status_to_errno(int hv_status)
>> +{
>> +	switch (hv_status) {
>> +	case HV_STATUS_SUCCESS:
>> +		return 0;
>> +	case HV_STATUS_INVALID_PARAMETER:
>> +	case HV_STATUS_UNKNOWN_PROPERTY:
>> +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
>> +	case HV_STATUS_INVALID_VP_INDEX:
>> +	case HV_STATUS_INVALID_REGISTER_VALUE:
>> +	case HV_STATUS_INVALID_LP_INDEX:
>> +		return EINVAL;
>> +	case HV_STATUS_ACCESS_DENIED:
>> +	case HV_STATUS_OPERATION_DENIED:
>> +		return EACCES;
>> +	case HV_STATUS_NOT_ACKNOWLEDGED:
>> +	case HV_STATUS_INVALID_VP_STATE:
>> +	case HV_STATUS_INVALID_PARTITION_STATE:
>> +		return EBADFD;
>> +	}
>> +	return ENOTRECOVERABLE;
>> +}
>> +EXPORT_SYMBOL_GPL(hv_status_to_errno);
>> +
>>  /*
>>   * Deposits exact number of pages
>>   * Must be called with interrupts enabled
>> @@ -99,7 +123,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>  
>>  	if (status != HV_STATUS_SUCCESS) {
>>  		pr_err("Failed to deposit pages: %d\n", status);
>> -		ret = status;
>> +		ret = -hv_status_to_errno(status);
> 
> "-hv_status_to_errno" looks weird, could we just return
> '-EINVAL'/'-EACCES'/... from hv_status_to_errno() instead?
> 

Yes, good idea.

>>  		goto err_free_allocations;
>>  	}
>>  
>> @@ -155,7 +179,7 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>>  			if (status != HV_STATUS_SUCCESS) {
>>  				pr_err("%s: cpu %u apic ID %u, %d\n", __func__,
>>  				       lp_index, apic_id, status);
>> -				ret = status;
>> +				ret = -hv_status_to_errno(status);
>>  			}
>>  			break;
>>  		}
>> @@ -203,7 +227,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>>  			if (status != HV_STATUS_SUCCESS) {
>>  				pr_err("%s: vcpu %u, lp %u, %d\n", __func__,
>>  				       vp_index, flags, status);
>> -				ret = status;
>> +				ret = -hv_status_to_errno(status);
>>  			}
>>  			break;
>>  		}
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index cbee72550a12..eb75faa4d4c5 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -243,6 +243,7 @@ int hyperv_flush_guest_mapping_range(u64 as,
>>  int hyperv_fill_flush_guest_mapping_list(
>>  		struct hv_guest_mapping_flush_list *flush,
>>  		u64 start_gfn, u64 end_gfn);
>> +int hv_status_to_errno(int hv_status);
>>  
>>  extern bool hv_root_partition;
>>  
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index dd385c6a71b5..445244192fa4 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -181,16 +181,28 @@ enum HV_GENERIC_SET_FORMAT {
>>  #define HV_HYPERCALL_REP_START_MASK	GENMASK_ULL(59, 48)
>>  
>>  /* hypercall status code */
>> -#define HV_STATUS_SUCCESS			0
>> -#define HV_STATUS_INVALID_HYPERCALL_CODE	2
>> -#define HV_STATUS_INVALID_HYPERCALL_INPUT	3
>> -#define HV_STATUS_INVALID_ALIGNMENT		4
>> -#define HV_STATUS_INVALID_PARAMETER		5
>> -#define HV_STATUS_OPERATION_DENIED		8
>> -#define HV_STATUS_INSUFFICIENT_MEMORY		11
>> -#define HV_STATUS_INVALID_PORT_ID		17
>> -#define HV_STATUS_INVALID_CONNECTION_ID		18
>> -#define HV_STATUS_INSUFFICIENT_BUFFERS		19
>> +#define HV_STATUS_SUCCESS			0x0
>> +#define HV_STATUS_INVALID_HYPERCALL_CODE	0x2
>> +#define HV_STATUS_INVALID_HYPERCALL_INPUT	0x3
>> +#define HV_STATUS_INVALID_ALIGNMENT		0x4
>> +#define HV_STATUS_INVALID_PARAMETER		0x5
>> +#define HV_STATUS_ACCESS_DENIED			0x6
>> +#define HV_STATUS_INVALID_PARTITION_STATE	0x7
>> +#define HV_STATUS_OPERATION_DENIED		0x8
>> +#define HV_STATUS_UNKNOWN_PROPERTY		0x9
>> +#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE	0xA
>> +#define HV_STATUS_INSUFFICIENT_MEMORY		0xB
>> +#define HV_STATUS_INVALID_PARTITION_ID		0xD
>> +#define HV_STATUS_INVALID_VP_INDEX		0xE
>> +#define HV_STATUS_NOT_FOUND			0x10
>> +#define HV_STATUS_INVALID_PORT_ID		0x11
>> +#define HV_STATUS_INVALID_CONNECTION_ID		0x12
>> +#define HV_STATUS_INSUFFICIENT_BUFFERS		0x13
>> +#define HV_STATUS_NOT_ACKNOWLEDGED		0x14
>> +#define HV_STATUS_INVALID_VP_STATE		0x15
>> +#define HV_STATUS_NO_RESOURCES			0x1D
>> +#define HV_STATUS_INVALID_LP_INDEX		0x41
>> +#define HV_STATUS_INVALID_REGISTER_VALUE	0x50
>>  
>>  /*
>>   * The Hyper-V TimeRefCount register and the TSC
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2021-02-09 13:11   ` Vitaly Kuznetsov
@ 2021-03-04 18:43     ` Nuno Das Neves
  2021-03-05  9:18       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-04 18:43 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	wei.liu, ligrassi, kys

On 2/9/2021 5:11 AM, Vitaly Kuznetsov wrote:
> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
> 
>> Reserve ioctl number in userpsace-api/ioctl/ioctl-number.rst
>> Introduce MSHV_REQUEST_VERSION ioctl.
>> Introduce documentation for /dev/mshv in Documentation/virt/mshv
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  .../userspace-api/ioctl/ioctl-number.rst      |  2 +
>>  Documentation/virt/mshv/api.rst               | 62 +++++++++++++++++++
>>  include/linux/mshv.h                          | 11 ++++
>>  include/uapi/linux/mshv.h                     | 19 ++++++
>>  virt/mshv/mshv_main.c                         | 49 +++++++++++++++
>>  5 files changed, 143 insertions(+)
>>  create mode 100644 Documentation/virt/mshv/api.rst
>>  create mode 100644 include/linux/mshv.h
>>  create mode 100644 include/uapi/linux/mshv.h
>>
>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> index 55a2d9b2ce33..13a4d3ecafca 100644
>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> @@ -343,6 +343,8 @@ Code  Seq#    Include File                                           Comments
>>  0xB5  00-0F  uapi/linux/rpmsg.h                                      <mailto:linux-remoteproc@vger.kernel.org>
>>  0xB6  all    linux/fpga-dfl.h
>>  0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-remoteproc@vger.kernel.org>
>> +0xB8  all    uapi/linux/mshv.h                                       Microsoft Hypervisor root partition APIs
>> +                                                                     <mailto:linux-hyperv@vger.kernel.org>
>>  0xC0  00-0F  linux/usb/iowarrior.h
>>  0xCA  00-0F  uapi/misc/cxl.h
>>  0xCA  10-2F  uapi/misc/ocxl.h
>> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
>> new file mode 100644
>> index 000000000000..82e32de48d03
>> --- /dev/null
>> +++ b/Documentation/virt/mshv/api.rst
>> @@ -0,0 +1,62 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================================================
>> +Microsoft Hypervisor Root Partition API Documentation
>> +=====================================================
>> +
>> +1. Overview
>> +===========
>> +
>> +This document describes APIs for creating and managing guest virtual machines
>> +when running Linux as the root partition on the Microsoft Hypervisor.
>> +
>> +This API is not yet stable.
>> +
>> +2. Glossary/Terms
>> +=================
>> +
>> +hv
>> +--
>> +Short for Hyper-V. This name is used in the kernel to describe interfaces to
>> +the Microsoft Hypervisor.
>> +
>> +mshv
>> +----
>> +Short for Microsoft Hypervisor. This is the name of the userland API module
>> +described in this document.
>> +
>> +Partition
>> +---------
>> +A virtual machine running on the Microsoft Hypervisor.
>> +
>> +Root Partition
>> +--------------
>> +The partition that is created and assumes control when the machine boots. The
>> +root partition can use mshv APIs to create guest partitions.
>> +
>> +3. API description
>> +==================
>> +
>> +The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
>> +
>> +Mshv is file descriptor-based, following a similar pattern to KVM.
>> +
>> +To get a handle to the mshv driver, use open("/dev/mshv").
>> +
>> +3.1 MSHV_REQUEST_VERSION
>> +------------------------
>> +:Type: /dev/mshv ioctl
>> +:Parameters: pointer to a u32
>> +:Returns: 0 on success
>> +
>> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
>> +establish the interface version with the kernel module.
>> +
>> +The caller should pass the MSHV_VERSION as an argument.
>> +
>> +The kernel module will check which interface versions it supports and return 0
>> +if one of them matches.
>> +
>> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
>> +it is open - this ioctl can only be called once per open.
>> +
> 
> KVM used to have KVM_GET_API_VERSION too but this turned out to be not
> very convenient so we use capabilities (KVM_CHECK_EXTENSION/KVM_ENABLE_CAP)
> instead.
> 

The goal of MSHV_REQUEST_VERSION is to support changes to APIs in the core set.
When we add new features/ioctls beyond the core we can use an extension/capability
approach like KVM.

>> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
>> new file mode 100644
>> index 000000000000..a0982fe2c0b8
>> --- /dev/null
>> +++ b/include/linux/mshv.h
>> @@ -0,0 +1,11 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +#ifndef _LINUX_MSHV_H
>> +#define _LINUX_MSHV_H
>> +
>> +/*
>> + * Microsoft Hypervisor root partition driver for /dev/mshv
>> + */
>> +
>> +#include <uapi/linux/mshv.h>
>> +
>> +#endif
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> new file mode 100644
>> index 000000000000..dd30fc2f0a80
>> --- /dev/null
>> +++ b/include/uapi/linux/mshv.h
>> @@ -0,0 +1,19 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_LINUX_MSHV_H
>> +#define _UAPI_LINUX_MSHV_H
>> +
>> +/*
>> + * Userspace interface for /dev/mshv
>> + * Microsoft Hypervisor root partition APIs
>> + */
>> +
>> +#include <linux/types.h>
>> +
>> +#define MSHV_VERSION	0x0
>> +
>> +#define MSHV_IOCTL 0xB8
>> +
>> +/* mshv device */
>> +#define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
>> +
>> +#endif
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index ecb9089761fe..62f631f85301 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -11,25 +11,74 @@
>>  #include <linux/module.h>
>>  #include <linux/fs.h>
>>  #include <linux/miscdevice.h>
>> +#include <linux/slab.h>
>> +#include <linux/mshv.h>
>>  
>>  MODULE_AUTHOR("Microsoft");
>>  MODULE_LICENSE("GPL");
>>  
>> +#define MSHV_INVALID_VERSION	0xFFFFFFFF
>> +#define MSHV_CURRENT_VERSION	MSHV_VERSION
>> +
>> +static u32 supported_versions[] = {
>> +	MSHV_CURRENT_VERSION,
>> +};
>> +
>> +static long
>> +mshv_ioctl_request_version(u32 *version, void __user *user_arg)
>> +{
>> +	u32 arg;
>> +	int i;
>> +
>> +	if (copy_from_user(&arg, user_arg, sizeof(arg)))
>> +		return -EFAULT;
>> +
>> +	for (i = 0; i < ARRAY_SIZE(supported_versions); ++i) {
>> +		if (supported_versions[i] == arg) {
>> +			*version = supported_versions[i];
>> +			return 0;
>> +		}
>> +	}
>> +	return -ENOTSUPP;
>> +}
>> +
>>  static long
>>  mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>>  {
>> +	u32 *version = (u32 *)filp->private_data;
>> +
>> +	if (ioctl == MSHV_REQUEST_VERSION) {
>> +		/* Version can only be set once */
>> +		if (*version != MSHV_INVALID_VERSION)
>> +			return -EBADFD;
>> +
>> +		return mshv_ioctl_request_version(version, (void __user *)arg);
>> +	}
>> +
>> +	/* Version must be set before other ioctls can be called */
>> +	if (*version == MSHV_INVALID_VERSION)
>> +		return -EBADFD;
>> +
>> +	/* TODO other ioctls */
>> +
>>  	return -ENOTTY;
>>  }
>>  
>>  static int
>>  mshv_dev_open(struct inode *inode, struct file *filp)
>>  {
>> +	filp->private_data = kmalloc(sizeof(u32), GFP_KERNEL);
>> +	if (!filp->private_data)
>> +		return -ENOMEM;
>> +	*(u32 *)filp->private_data = MSHV_INVALID_VERSION;
>> +
>>  	return 0;
>>  }
>>  
>>  static int
>>  mshv_dev_release(struct inode *inode, struct file *filp)
>>  {
>> +	kfree(filp->private_data);
>>  	return 0;
>>  }
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 05/18] virt/mshv: create partition ioctl
  2021-02-09 13:15   ` Vitaly Kuznetsov
@ 2021-03-04 18:44     ` Nuno Das Neves
  0 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-04 18:44 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	wei.liu, ligrassi, kys

On 2/9/2021 5:15 AM, Vitaly Kuznetsov wrote:
> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
> 
>> Add MSHV_CREATE_PARTITION, which creates an fd to track a new partition.
>> Partition is not yet created in the hypervisor itself.
>> Introduce header files for userspace-facing hyperv structures.
>>
>> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  Documentation/virt/mshv/api.rst         |  12 ++
>>  arch/x86/include/asm/hyperv-tlfs.h      |   1 +
>>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 124 ++++++++++++++++
>>  include/asm-generic/hyperv-tlfs.h       |   1 +
>>  include/linux/mshv.h                    |  16 +++
>>  include/uapi/asm-generic/hyperv-tlfs.h  |  14 ++
>>  include/uapi/linux/mshv.h               |   7 +
>>  virt/mshv/mshv_main.c                   | 179 +++++++++++++++++++++---
>>  8 files changed, 338 insertions(+), 16 deletions(-)
>>  create mode 100644 arch/x86/include/uapi/asm/hyperv-tlfs.h
>>  create mode 100644 include/uapi/asm-generic/hyperv-tlfs.h
>>
>> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
>> index 82e32de48d03..ce651a1738e0 100644
>> --- a/Documentation/virt/mshv/api.rst
>> +++ b/Documentation/virt/mshv/api.rst
>> @@ -39,6 +39,9 @@ root partition can use mshv APIs to create guest partitions.
>>  
>>  The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
>>  
>> +The uapi header files you need are linux/mshv.h, asm/hyperv-tlfs.h, and
>> +asm-generic/hyperv-tlfs.h.
>> +
>>  Mshv is file descriptor-based, following a similar pattern to KVM.
>>  
>>  To get a handle to the mshv driver, use open("/dev/mshv").
>> @@ -60,3 +63,12 @@ if one of them matches.
>>  This /dev/mshv file descriptor will remain 'locked' to that version as long as
>>  it is open - this ioctl can only be called once per open.
>>  
>> +3.2 MSHV_CREATE_PARTITION
>> +-------------------------
>> +:Type: /dev/mshv ioctl
>> +:Parameters: struct mshv_create_partition
>> +:Returns: partition file descriptor, or -1 on failure
>> +
>> +This ioctl creates a guest partition, returning a file descriptor to use as a
>> +handle for partition ioctls.
>> +
>> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
>> index 592c75e51e0f..4cd44ae9bffb 100644
>> --- a/arch/x86/include/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/asm/hyperv-tlfs.h
>> @@ -11,6 +11,7 @@
>>  
>>  #include <linux/types.h>
>>  #include <asm/page.h>
>> +#include <uapi/asm/hyperv-tlfs.h>
>>  /*
>>   * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent
>>   * is set by CPUID(HvCpuIdFunctionVersionAndFeatures).
>> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
>> new file mode 100644
>> index 000000000000..72150c25ffe6
>> --- /dev/null
>> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
>> @@ -0,0 +1,124 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_ASM_X86_HYPERV_TLFS_USER_H
>> +#define _UAPI_ASM_X86_HYPERV_TLFS_USER_H
>> +
>> +#include <linux/types.h>
>> +
>> +#define HV_PARTITION_PROCESSOR_FEATURE_BANKS 2
>> +
>> +union hv_partition_processor_features {
>> +	struct {
>> +		__u64 sse3_support:1;
>> +		__u64 lahf_sahf_support:1;
>> +		__u64 ssse3_support:1;
>> +		__u64 sse4_1_support:1;
>> +		__u64 sse4_2_support:1;
>> +		__u64 sse4a_support:1;
>> +		__u64 xop_support:1;
>> +		__u64 pop_cnt_support:1;
>> +		__u64 cmpxchg16b_support:1;
>> +		__u64 altmovcr8_support:1;
>> +		__u64 lzcnt_support:1;
>> +		__u64 mis_align_sse_support:1;
>> +		__u64 mmx_ext_support:1;
>> +		__u64 amd3dnow_support:1;
>> +		__u64 extended_amd3dnow_support:1;
>> +		__u64 page_1gb_support:1;
>> +		__u64 aes_support:1;
>> +		__u64 pclmulqdq_support:1;
>> +		__u64 pcid_support:1;
>> +		__u64 fma4_support:1;
>> +		__u64 f16c_support:1;
>> +		__u64 rd_rand_support:1;
>> +		__u64 rd_wr_fs_gs_support:1;
>> +		__u64 smep_support:1;
>> +		__u64 enhanced_fast_string_support:1;
>> +		__u64 bmi1_support:1;
>> +		__u64 bmi2_support:1;
>> +		__u64 hle_support_deprecated:1;
>> +		__u64 rtm_support_deprecated:1;
>> +		__u64 movbe_support:1;
>> +		__u64 npiep1_support:1;
>> +		__u64 dep_x87_fpu_save_support:1;
>> +		__u64 rd_seed_support:1;
>> +		__u64 adx_support:1;
>> +		__u64 intel_prefetch_support:1;
>> +		__u64 smap_support:1;
>> +		__u64 hle_support:1;
>> +		__u64 rtm_support:1;
>> +		__u64 rdtscp_support:1;
>> +		__u64 clflushopt_support:1;
>> +		__u64 clwb_support:1;
>> +		__u64 sha_support:1;
>> +		__u64 x87_pointers_saved_support:1;
>> +		__u64 invpcid_support:1;
>> +		__u64 ibrs_support:1;
>> +		__u64 stibp_support:1;
>> +		__u64 ibpb_support: 1;
>> +		__u64 unrestricted_guest_support:1;
>> +		__u64 mdd_support:1;
>> +		__u64 fast_short_rep_mov_support:1;
>> +		__u64 l1dcache_flush_support:1;
>> +		__u64 rdcl_no_support:1;
>> +		__u64 ibrs_all_support:1;
>> +		__u64 skip_l1df_support:1;
>> +		__u64 ssb_no_support:1;
>> +		__u64 rsb_a_no_support:1;
>> +		__u64 virt_spec_ctrl_support:1;
>> +		__u64 rd_pid_support:1;
>> +		__u64 umip_support:1;
>> +		__u64 mbs_no_support:1;
>> +		__u64 mb_clear_support:1;
>> +		__u64 taa_no_support:1;
>> +		__u64 tsx_ctrl_support:1;
>> +		/*
>> +		 * N.B. The final processor feature bit in bank 0 is reserved to
>> +		 * simplify potential downlevel backports.
>> +		 */
>> +		__u64 reserved_bank0:1;
>> +
>> +		/* N.B. Begin bank 1 processor features. */
>> +		__u64 acount_mcount_support:1;
>> +		__u64 tsc_invariant_support:1;
>> +		__u64 cl_zero_support:1;
>> +		__u64 rdpru_support:1;
>> +		__u64 la57_support:1;
>> +		__u64 mbec_support:1;
>> +		__u64 nested_virt_support:1;
>> +		__u64 psfd_support:1;
>> +		__u64 cet_ss_support:1;
>> +		__u64 cet_ibt_support:1;
>> +		__u64 vmx_exception_inject_support:1;
>> +		__u64 enqcmd_support:1;
>> +		__u64 umwait_tpause_support:1;
>> +		__u64 movdiri_support:1;
>> +		__u64 movdir64b_support:1;
>> +		__u64 cldemote_support:1;
>> +		__u64 serialize_support:1;
>> +		__u64 tsc_deadline_tmr_support:1;
>> +		__u64 tsc_adjust_support:1;
>> +		__u64 fzlrep_movsb:1;
>> +		__u64 fsrep_stosb:1;
>> +		__u64 fsrep_cmpsb:1;
>> +		__u64 reserved_bank1:42;
>> +	};
>> +	__u64 as_uint64[HV_PARTITION_PROCESSOR_FEATURE_BANKS];
>> +};
>> +
>> +union hv_partition_processor_xsave_features {
>> +	struct {
>> +		__u64 xsave_support : 1;
>> +		__u64 xsaveopt_support : 1;
>> +		__u64 avx_support : 1;
>> +		__u64 reserved1 : 61;
>> +	};
>> +	__u64 as_uint64;
>> +};
>> +
>> +struct hv_partition_creation_properties {
>> +	union hv_partition_processor_features disabled_processor_features;
>> +	union hv_partition_processor_xsave_features
>> +		disabled_processor_xsave_features;
>> +};
>> +
>> +#endif
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index 05b9dc9896ab..2ff580780ce4 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -12,6 +12,7 @@
>>  #include <linux/types.h>
>>  #include <linux/bits.h>
>>  #include <linux/time64.h>
>> +#include <uapi/asm-generic/hyperv-tlfs.h>
>>  
>>  /*
>>   * While not explicitly listed in the TLFS, Hyper-V always runs with a page size
>> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
>> index a0982fe2c0b8..fc4f35089b2c 100644
>> --- a/include/linux/mshv.h
>> +++ b/include/linux/mshv.h
>> @@ -6,6 +6,22 @@
>>   * Microsoft Hypervisor root partition driver for /dev/mshv
>>   */
>>  
>> +#include <linux/spinlock.h>
>>  #include <uapi/linux/mshv.h>
>>  
>> +#define MSHV_MAX_PARTITIONS		128
>> +
>> +struct mshv_partition {
>> +	u64 id;
>> +	refcount_t ref_count;
>> +};
>> +
>> +struct mshv {
>> +	struct {
>> +		spinlock_t lock;
>> +		u64 count;
>> +		struct mshv_partition *array[MSHV_MAX_PARTITIONS];
>> +	} partitions;
>> +};
>> +
>>  #endif
>> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
>> new file mode 100644
>> index 000000000000..140cc0b4f98f
>> --- /dev/null
>> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
>> @@ -0,0 +1,14 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_ASM_GENERIC_HYPERV_TLFS_USER_H
>> +#define _UAPI_ASM_GENERIC_HYPERV_TLFS_USER_H
>> +
>> +#ifndef BIT
>> +#define BIT(X)	(1ULL << (X))
>> +#endif
>> +
>> +#define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
>> +#define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
>> +#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
>> +#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
>> +
>> +#endif
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> index dd30fc2f0a80..3788f8bc5caa 100644
>> --- a/include/uapi/linux/mshv.h
>> +++ b/include/uapi/linux/mshv.h
>> @@ -8,12 +8,19 @@
>>   */
>>  
>>  #include <linux/types.h>
>> +#include <asm/hyperv-tlfs.h>
>>  
>>  #define MSHV_VERSION	0x0
>>  
>> +struct mshv_create_partition {
>> +	__u64 flags;
>> +	struct hv_partition_creation_properties partition_creation_properties;
>> +};
>> +
>>  #define MSHV_IOCTL 0xB8
>>  
>>  /* mshv device */
>>  #define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
>> +#define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x01, struct mshv_create_partition)
>>  
>>  #endif
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index 62f631f85301..4dcbe4907430 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -12,6 +12,8 @@
>>  #include <linux/fs.h>
>>  #include <linux/miscdevice.h>
>>  #include <linux/slab.h>
>> +#include <linux/file.h>
>> +#include <linux/anon_inodes.h>
>>  #include <linux/mshv.h>
>>  
>>  MODULE_AUTHOR("Microsoft");
>> @@ -24,6 +26,161 @@ static u32 supported_versions[] = {
>>  	MSHV_CURRENT_VERSION,
>>  };
>>  
>> +static struct mshv mshv = {};
>> +
>> +static void mshv_partition_put(struct mshv_partition *partition);
>> +static int mshv_partition_release(struct inode *inode, struct file *filp);
>> +static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
>> +
>> +static int mshv_dev_open(struct inode *inode, struct file *filp);
>> +static int mshv_dev_release(struct inode *inode, struct file *filp);
>> +static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
>> +
>> +static const struct file_operations mshv_partition_fops = {
>> +	.release = mshv_partition_release,
>> +	.unlocked_ioctl = mshv_partition_ioctl,
>> +	.llseek = noop_llseek,
>> +};
>> +
>> +static const struct file_operations mshv_dev_fops = {
>> +	.owner = THIS_MODULE,
>> +	.open = mshv_dev_open,
>> +	.release = mshv_dev_release,
>> +	.unlocked_ioctl = mshv_dev_ioctl,
>> +	.llseek = noop_llseek,
>> +};
>> +
>> +static struct miscdevice mshv_dev = {
>> +	.minor = MISC_DYNAMIC_MINOR,
>> +	.name = "mshv",
>> +	.fops = &mshv_dev_fops,
>> +	.mode = 600,
>> +};
>> +
>> +static long
>> +mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>> +{
>> +	return -ENOTTY;
>> +}
>> +
>> +static void
>> +destroy_partition(struct mshv_partition *partition)
>> +{
>> +	unsigned long flags;
>> +	int i;
>> +
>> +	/* Remove from list of partitions */
>> +	spin_lock_irqsave(&mshv.partitions.lock, flags);
>> +
>> +	for (i = 0; i < MSHV_MAX_PARTITIONS; ++i) {
>> +		if (mshv.partitions.array[i] == partition)
>> +			break;
>> +	}
>> +
>> +	if (i == MSHV_MAX_PARTITIONS) {
>> +		pr_err("%s: failed to locate partition in array\n", __func__);
>> +	} else {
>> +		mshv.partitions.count--;
>> +		mshv.partitions.array[i] = NULL;
>> +	}
>> +
>> +	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
>> +
>> +	kfree(partition);
>> +}
>> +
>> +static void
>> +mshv_partition_put(struct mshv_partition *partition)
>> +{
>> +	if (refcount_dec_and_test(&partition->ref_count))
>> +		destroy_partition(partition);
>> +}
>> +
>> +static int
>> +mshv_partition_release(struct inode *inode, struct file *filp)
>> +{
>> +	struct mshv_partition *partition = filp->private_data;
>> +
>> +	mshv_partition_put(partition);
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +add_partition(struct mshv_partition *partition)
>> +{
>> +	unsigned long flags;
>> +	int i, ret = 0;
>> +
>> +	spin_lock_irqsave(&mshv.partitions.lock, flags);
>> +
>> +	if (mshv.partitions.count >= MSHV_MAX_PARTITIONS) {
>> +		pr_err("%s: too many partitions\n", __func__);
>> +		ret = -ENOSPC;
>> +		goto out_unlock;
>> +	}
>> +
>> +	for (i = 0; i < MSHV_MAX_PARTITIONS; ++i) {
>> +		if (!mshv.partitions.array[i])
>> +			break;
>> +	}
>> +
>> +	mshv.partitions.count++;
>> +	mshv.partitions.array[i] = partition;
>> +
>> +out_unlock:
>> +	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
>> +
>> +	return ret;
>> +}
>> +
>> +static long
>> +mshv_ioctl_create_partition(void __user *user_arg)
>> +{
>> +	struct mshv_create_partition args;
>> +	struct mshv_partition *partition;
>> +	struct file *file;
>> +	int fd;
>> +	long ret;
>> +
>> +	if (copy_from_user(&args, user_arg, sizeof(args)))
>> +		return -EFAULT;
>> +
>> +	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
>> +	if (!partition)
>> +		return -ENOMEM;
>> +
>> +	fd = get_unused_fd_flags(O_CLOEXEC);
>> +	if (fd < 0) {
>> +		ret = fd;
>> +		goto free_partition;
>> +	}
>> +
>> +	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
>> +				  partition, O_RDWR);
>> +	if (IS_ERR(file)) {
>> +		ret = PTR_ERR(file);
>> +		goto put_fd;
>> +	}
>> +	refcount_set(&partition->ref_count, 1);
>> +
>> +	ret = add_partition(partition);
>> +	if (ret)
>> +		goto release_file;
>> +
>> +	fd_install(fd, file);
>> +
>> +	return fd;
>> +
>> +release_file:
>> +	file->f_op->release(file->f_inode, file);
>> +put_fd:
>> +	put_unused_fd(fd);
>> +free_partition:
>> +	kfree(partition);
>> +	return ret;
>> +}
>> +
>>  static long
>>  mshv_ioctl_request_version(u32 *version, void __user *user_arg)
>>  {
>> @@ -59,7 +216,10 @@ mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>>  	if (*version == MSHV_INVALID_VERSION)
>>  		return -EBADFD;
>>  
>> -	/* TODO other ioctls */
>> +	switch (ioctl) {
>> +	case MSHV_CREATE_PARTITION:
>> +		return mshv_ioctl_create_partition((void __user *)arg);
>> +	}
>>  
>>  	return -ENOTTY;
>>  }
>> @@ -82,21 +242,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
>>  	return 0;
>>  }
>>  
>> -static const struct file_operations mshv_dev_fops = {
>> -	.owner = THIS_MODULE,
>> -	.open = mshv_dev_open,
>> -	.release = mshv_dev_release,
>> -	.unlocked_ioctl = mshv_dev_ioctl,
>> -	.llseek = noop_llseek,
>> -};
>> -
>> -static struct miscdevice mshv_dev = {
>> -	.minor = MISC_DYNAMIC_MINOR,
>> -	.name = "mshv",
>> -	.fops = &mshv_dev_fops,
>> -	.mode = 600,
>> -};
>> -
> 
> This looks like an unneeded code churn as these structs just got added a
> few patches ago. It would probably be possible to put it to the right
> place from the very beginning so you don't need to move it in this
> patch.
> 

Agreed

>>  static int
>>  __init mshv_init(void)
>>  {
>> @@ -106,6 +251,8 @@ __init mshv_init(void)
>>  	if (r)
>>  		pr_err("%s: misc device register failed\n", __func__);
>>  
>> +	spin_lock_init(&mshv.partitions.lock);
>> +
>>  	return r;
>>  }
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2021-02-08 19:41   ` Michael Kelley
@ 2021-03-04 21:35     ` Nuno Das Neves
  0 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-04 21:35 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

On 2/8/2021 11:41 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
>>
>> Reserve ioctl number in userpsace-api/ioctl/ioctl-number.rst
>> Introduce MSHV_REQUEST_VERSION ioctl.
>> Introduce documentation for /dev/mshv in Documentation/virt/mshv
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  .../userspace-api/ioctl/ioctl-number.rst      |  2 +
>>  Documentation/virt/mshv/api.rst               | 62 +++++++++++++++++++
>>  include/linux/mshv.h                          | 11 ++++
>>  include/uapi/linux/mshv.h                     | 19 ++++++
>>  virt/mshv/mshv_main.c                         | 49 +++++++++++++++
>>  5 files changed, 143 insertions(+)
>>  create mode 100644 Documentation/virt/mshv/api.rst
>>  create mode 100644 include/linux/mshv.h
>>  create mode 100644 include/uapi/linux/mshv.h
>>
>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> index 55a2d9b2ce33..13a4d3ecafca 100644
>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>> @@ -343,6 +343,8 @@ Code  Seq#    Include File                                           Comments
>>  0xB5  00-0F  uapi/linux/rpmsg.h                                      <mailto:linux-
>> remoteproc@vger.kernel.org>
>>  0xB6  all    linux/fpga-dfl.h
>>  0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-
>> remoteproc@vger.kernel.org>
>> +0xB8  all    uapi/linux/mshv.h                                       Microsoft Hypervisor root partition APIs
>> +                                                                     <mailto:linux-hyperv@vger.kernel.org>
>>  0xC0  00-0F  linux/usb/iowarrior.h
>>  0xCA  00-0F  uapi/misc/cxl.h
>>  0xCA  10-2F  uapi/misc/ocxl.h
>> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
>> new file mode 100644
>> index 000000000000..82e32de48d03
>> --- /dev/null
>> +++ b/Documentation/virt/mshv/api.rst
>> @@ -0,0 +1,62 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================================================
>> +Microsoft Hypervisor Root Partition API Documentation
>> +=====================================================
>> +
>> +1. Overview
>> +===========
>> +
>> +This document describes APIs for creating and managing guest virtual machines
>> +when running Linux as the root partition on the Microsoft Hypervisor.
>> +
>> +This API is not yet stable.
>> +
>> +2. Glossary/Terms
>> +=================
>> +
>> +hv
>> +--
>> +Short for Hyper-V. This name is used in the kernel to describe interfaces to
>> +the Microsoft Hypervisor.
>> +
>> +mshv
>> +----
>> +Short for Microsoft Hypervisor. This is the name of the userland API module
>> +described in this document.
>> +
>> +Partition
>> +---------
>> +A virtual machine running on the Microsoft Hypervisor.
>> +
>> +Root Partition
>> +--------------
>> +The partition that is created and assumes control when the machine boots. The
>> +root partition can use mshv APIs to create guest partitions.
>> +
>> +3. API description
>> +==================
>> +
>> +The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
>> +
>> +Mshv is file descriptor-based, following a similar pattern to KVM.
>> +
>> +To get a handle to the mshv driver, use open("/dev/mshv").
>> +
>> +3.1 MSHV_REQUEST_VERSION
>> +------------------------
>> +:Type: /dev/mshv ioctl
>> +:Parameters: pointer to a u32
>> +:Returns: 0 on success
>> +
>> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
>> +establish the interface version with the kernel module.
>> +
>> +The caller should pass the MSHV_VERSION as an argument.
>> +
>> +The kernel module will check which interface versions it supports and return 0
>> +if one of them matches.
>> +
>> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
>> +it is open - this ioctl can only be called once per open.
> 
> To clarify the wording:
> 
> The caller should pass the requested version as an argument.  If the requested
> version is one that the kernel module supports, the ioctl will return 0.  If the
> requested version is not supported by the kernel module, the caller may try
> the ioctl repeatedly to find a version that the caller supports and that the kernel
> module supports.   Once a match is found, the /dev/mshv file descriptor is
> 'locked' to that version as long as it is open; i.e., the ioctl can succeed
> only once per open.
> 

Thanks, yes that's a bit clearer!

>> +
>> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
>> new file mode 100644
>> index 000000000000..a0982fe2c0b8
>> --- /dev/null
>> +++ b/include/linux/mshv.h
>> @@ -0,0 +1,11 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +#ifndef _LINUX_MSHV_H
>> +#define _LINUX_MSHV_H
>> +
>> +/*
>> + * Microsoft Hypervisor root partition driver for /dev/mshv
>> + */
>> +
>> +#include <uapi/linux/mshv.h>
>> +
>> +#endif
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> new file mode 100644
>> index 000000000000..dd30fc2f0a80
>> --- /dev/null
>> +++ b/include/uapi/linux/mshv.h
>> @@ -0,0 +1,19 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_LINUX_MSHV_H
>> +#define _UAPI_LINUX_MSHV_H
>> +
>> +/*
>> + * Userspace interface for /dev/mshv
>> + * Microsoft Hypervisor root partition APIs
>> + */
>> +
>> +#include <linux/types.h>
>> +
>> +#define MSHV_VERSION	0x0
>> +
>> +#define MSHV_IOCTL 0xB8
>> +
>> +/* mshv device */
>> +#define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
>> +
>> +#endif
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index ecb9089761fe..62f631f85301 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -11,25 +11,74 @@
>>  #include <linux/module.h>
>>  #include <linux/fs.h>
>>  #include <linux/miscdevice.h>
>> +#include <linux/slab.h>
>> +#include <linux/mshv.h>
>>
>>  MODULE_AUTHOR("Microsoft");
>>  MODULE_LICENSE("GPL");
>>
>> +#define MSHV_INVALID_VERSION	0xFFFFFFFF
>> +#define MSHV_CURRENT_VERSION	MSHV_VERSION
>> +
>> +static u32 supported_versions[] = {
>> +	MSHV_CURRENT_VERSION,
>> +};
> 
> I'm not sure that the concept of "CURRENT_VERSION" makes sense
> as a fixed constant.  We have an array of supported versions, any of
> which are valid and supported by the kernel module.   The array
> should list individual versions.   The current version is 0, which 
> might be labelled as MSHV_VERSION_PRERELEASE, or something
> similar.  Then later we might have MSHV_VERSION_RELEASE_1,
> HSMV_VERSION_RELEASE_2, as needed.  Or maybe the versions
> are tied to releases of the Microsoft Hypervisor.
> 

The idea was that CURRENT_VERSION matches the version in the shared
header file, which would change each release. I can see how this would
be confusing - I will change it as you suggest.

>> +
>> +static long
>> +mshv_ioctl_request_version(u32 *version, void __user *user_arg)
>> +{
>> +	u32 arg;
>> +	int i;
>> +
>> +	if (copy_from_user(&arg, user_arg, sizeof(arg)))
>> +		return -EFAULT;
>> +
>> +	for (i = 0; i < ARRAY_SIZE(supported_versions); ++i) {
>> +		if (supported_versions[i] == arg) {
>> +			*version = supported_versions[i];
>> +			return 0;
>> +		}
>> +	}
>> +	return -ENOTSUPP;
>> +}
>> +
>>  static long
>>  mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>>  {
>> +	u32 *version = (u32 *)filp->private_data;
>> +
>> +	if (ioctl == MSHV_REQUEST_VERSION) {
>> +		/* Version can only be set once */
>> +		if (*version != MSHV_INVALID_VERSION)
>> +			return -EBADFD;
>> +
>> +		return mshv_ioctl_request_version(version, (void __user *)arg);
>> +	}
>> +
>> +	/* Version must be set before other ioctls can be called */
>> +	if (*version == MSHV_INVALID_VERSION)
>> +		return -EBADFD;
>> +
>> +	/* TODO other ioctls */
>> +
>>  	return -ENOTTY;
>>  }
>>
>>  static int
>>  mshv_dev_open(struct inode *inode, struct file *filp)
>>  {
>> +	filp->private_data = kmalloc(sizeof(u32), GFP_KERNEL);
>> +	if (!filp->private_data)
>> +		return -ENOMEM;
>> +	*(u32 *)filp->private_data = MSHV_INVALID_VERSION;
>> +
>>  	return 0;
>>  }
>>
>>  static int
>>  mshv_dev_release(struct inode *inode, struct file *filp)
>>  {
>> +	kfree(filp->private_data);
>>  	return 0;
>>  }
>>
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls
  2021-02-08 19:42   ` Michael Kelley
@ 2021-03-04 23:49     ` Nuno Das Neves
  2021-03-04 23:58       ` Michael Kelley
  0 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-04 23:49 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

On 2/8/2021 11:42 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
>>
>> Add hypercalls for fully setting up and mostly tearing down a guest
>> partition.
>> The teardown operation will generate an error as the deposited
>> memory has not been withdrawn.
>> This is fixed in the next patch.
>>
>> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  include/asm-generic/hyperv-tlfs.h      |  52 +++++++-
>>  include/uapi/asm-generic/hyperv-tlfs.h |   1 +
>>  include/uapi/linux/mshv.h              |   1 +
>>  virt/mshv/mshv_main.c                  | 169 ++++++++++++++++++++++++-
>>  4 files changed, 220 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index 2ff580780ce4..ab6ae6c164f5 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -142,6 +142,10 @@ struct ms_hyperv_tsc_page {
>>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX	0x0013
>>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX	0x0014
>>  #define HVCALL_SEND_IPI_EX			0x0015
>> +#define HVCALL_CREATE_PARTITION			0x0040
>> +#define HVCALL_INITIALIZE_PARTITION		0x0041
>> +#define HVCALL_FINALIZE_PARTITION		0x0042
>> +#define HVCALL_DELETE_PARTITION			0x0043
>>  #define HVCALL_GET_PARTITION_ID			0x0046
>>  #define HVCALL_DEPOSIT_MEMORY			0x0048
>>  #define HVCALL_CREATE_VP			0x004e
>> @@ -451,7 +455,7 @@ struct hv_get_partition_id {
>>  struct hv_deposit_memory {
>>  	u64 partition_id;
>>  	u64 gpa_page_list[];
>> -} __packed;
>> +};
> 
> Why remove __packed?
> 

I change it to match the hyperv structures exactly.
I will re-add __packed here and explicitly lay out all the
structures as you have suggested.

>>
>>  struct hv_proximity_domain_flags {
>>  	u32 proximity_preferred : 1;
>> @@ -767,4 +771,50 @@ struct hv_input_unmap_device_interrupt {
>>  #define HV_SOURCE_SHADOW_NONE               0x0
>>  #define HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE   0x1
>>
>> +#define HV_MAKE_COMPATIBILITY_VERSION(major_, minor_)                          \
>> +	((u32)((major_) << 8 | (minor_)))
>> +
>> +enum hv_compatibility_version {
>> +	HV_COMPATIBILITY_19_H1 = HV_MAKE_COMPATIBILITY_VERSION(0X6, 0X5),
>> +	HV_COMPATIBILITY_MANGANESE = HV_MAKE_COMPATIBILITY_VERSION(0X6, 0X7),
> 
> Avoid use of "Manganese", which is an internal code name.  I'd suggest calling it
> 20_H1 instead, which at least has some broader meaning.
> 
>> +	HV_COMPATIBILITY_PRERELEASE = HV_MAKE_COMPATIBILITY_VERSION(0XFE, 0X0),
>> +	HV_COMPATIBILITY_EXPERIMENT = HV_MAKE_COMPATIBILITY_VERSION(0XFF, 0X0),
>> +};
>> +
>> +union hv_partition_isolation_properties {
>> +	u64 as_uint64;
>> +	struct {
>> +		u64 isolation_type: 5;
>> +		u64 rsvd_z: 7;
>> +		u64 shared_gpa_boundary_page_number: 52;
>> +	};
>> +};
> 
> Add __packed.
> 

Will do.

>> +
>> +/* Non-userspace-visible partition creation flags */
>> +#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
>> +
>> +struct hv_create_partition_in {
>> +	u64 flags;
>> +	union hv_proximity_domain_info proximity_domain_info;
>> +	enum hv_compatibility_version compatibility_version;
> 
> An "enum" is a 32 bit value in gcc and I would presume that
> Hyper-V is expecting a 64 bit value.  In general, using an enum in a data
> structure with exact layout requirements is problematic because the "C"
> language doesn't specify how big an enum is.  In such cases, it's better
> to use an integer field with an explicit size (like u64) and #defines for
> the possible values.
> 

Will do.

>> +	struct hv_partition_creation_properties partition_creation_properties;
>> +	union hv_partition_isolation_properties isolation_properties;
>> +};
>> +
>> +struct hv_create_partition_out {
>> +	u64 partition_id;
>> +};
>> +
>> +struct hv_initialize_partition {
>> +	u64 partition_id;
>> +};
>> +
>> +struct hv_finalize_partition {
>> +	u64 partition_id;
>> +};
>> +
>> +struct hv_delete_partition {
>> +	u64 partition_id;
>> +};
> 
> All of the above should have __packed for consistency with the other
> Hyper-V data structures.
> 

Will do.

>> +
>>  #endif
>> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
>> tlfs.h
>> index 140cc0b4f98f..7a858226a9c5 100644
>> --- a/include/uapi/asm-generic/hyperv-tlfs.h
>> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
>> @@ -6,6 +6,7 @@
>>  #define BIT(X)	(1ULL << (X))
>>  #endif
>>
>> +/* Userspace-visible partition creation flags */
> 
> Could this comment be included in the earlier patch with the #defines so
> that you avoid the trivial change here?
> 

Yep, will do.

>>  #define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
>>  #define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
>>  #define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> index 3788f8bc5caa..4f8da9a6fde2 100644
>> --- a/include/uapi/linux/mshv.h
>> +++ b/include/uapi/linux/mshv.h
>> @@ -9,6 +9,7 @@
>>
>>  #include <linux/types.h>
>>  #include <asm/hyperv-tlfs.h>
>> +#include <asm-generic/hyperv-tlfs.h>
> 
> Similarly, consider adding this #include in the earlier patch so that
> this trivial change isn't needed here.
> 

Will do.

>>
>>  #define MSHV_VERSION	0x0
>>
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index 4dcbe4907430..c4130a6508e5 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -15,6 +15,7 @@
>>  #include <linux/file.h>
>>  #include <linux/anon_inodes.h>
>>  #include <linux/mshv.h>
>> +#include <asm/mshyperv.h>
>>
>>  MODULE_AUTHOR("Microsoft");
>>  MODULE_LICENSE("GPL");
>> @@ -31,7 +32,6 @@ static struct mshv mshv = {};
>>  static void mshv_partition_put(struct mshv_partition *partition);
>>  static int mshv_partition_release(struct inode *inode, struct file *filp);
>>  static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
>> -
> 
> Spurious whitespace change?
> 

Yes - removed it.

>>  static int mshv_dev_open(struct inode *inode, struct file *filp);
>>  static int mshv_dev_release(struct inode *inode, struct file *filp);
>>  static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
>> @@ -57,6 +57,143 @@ static struct miscdevice mshv_dev = {
>>  	.mode = 600,
>>  };
>>
>> +#define HV_INIT_PARTITION_DEPOSIT_PAGES 208
> 
> A comment about how this value is determined would be useful.
> I'm assuming it was determined empirically.
> 

Correct - I'll add a comment.

>> +
>> +static int
>> +hv_call_create_partition(
>> +		u64 flags,
>> +		struct hv_partition_creation_properties creation_properties,
>> +		u64 *partition_id)
>> +{
>> +	struct hv_create_partition_in *input;
>> +	struct hv_create_partition_out *output;
>> +	int status;
>> +	int ret;
>> +	unsigned long irq_flags;
>> +	int i;
>> +
>> +	do {
>> +		local_irq_save(irq_flags);
>> +		input = (struct hv_create_partition_in *)(*this_cpu_ptr(
>> +			hyperv_pcpu_input_arg));
>> +		output = (struct hv_create_partition_out *)(*this_cpu_ptr(
>> +			hyperv_pcpu_output_arg));
>> +
>> +		input->flags = flags;
>> +		input->proximity_domain_info.as_uint64 = 0;
>> +		input->compatibility_version = HV_COMPATIBILITY_MANGANESE;
>> +		for (i = 0; i < HV_PARTITION_PROCESSOR_FEATURE_BANKS; ++i)
>> +			input->partition_creation_properties
>> +				.disabled_processor_features.as_uint64[i] = 0;
>> +		input->partition_creation_properties
>> +			.disabled_processor_xsave_features.as_uint64 = 0;
>> +		input->isolation_properties.as_uint64 = 0;
>> +
>> +		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
>> +					 input, output);
> 
> hv_do_hypercall returns a u64, which should then be masked with
> HV_HYPERCALL_RESULT_MASK before checking the result.
> 

Yes, I'll fix this everywhere.

>> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			if (status == HV_STATUS_SUCCESS)
>> +				*partition_id = output->partition_id;
>> +			else
>> +				pr_err("%s: %s\n",
>> +				       __func__, hv_status_to_string(status));
>> +			local_irq_restore(irq_flags);
>> +			ret = -hv_status_to_errno(status);
>> +			break;
>> +		}
>> +		local_irq_restore(irq_flags);
>> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>> +					    hv_current_partition_id, 1);
>> +	} while (!ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +hv_call_initialize_partition(u64 partition_id)
>> +{
>> +	struct hv_initialize_partition *input;
>> +	int status;
>> +	int ret;
>> +	unsigned long flags;
>> +
>> +	ret = hv_call_deposit_pages(
>> +				NUMA_NO_NODE,
>> +				partition_id,
>> +				HV_INIT_PARTITION_DEPOSIT_PAGES);
>> +	if (ret)
>> +		return ret;
>> +
>> +	do {
>> +		local_irq_save(flags);
>> +		input = (struct hv_initialize_partition *)(*this_cpu_ptr(
>> +			hyperv_pcpu_input_arg));
>> +		input->partition_id = partition_id;
>> +
>> +		status = hv_do_hypercall(
>> +				HVCALL_INITIALIZE_PARTITION,
>> +				input, NULL);
> 
> FWIW, since the input is a single 64 bit value, and there's no output,
> this could use hv_do_fast_hypercall8() instead, and avoid
> needing to use the input arg page and the irq save/restore.  Would have
> to check that the particular hypercall supports the "fast" version.
> 

Good idea! I tested it and confirmed it works.

>> +		local_irq_restore(flags);
>> +
>> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> 
> Same comment about status being u64 and masking.
> 

Will do.

>> +			if (status != HV_STATUS_SUCCESS)
>> +				pr_err("%s: %s\n",
>> +				       __func__, hv_status_to_string(status));
>> +			ret = -hv_status_to_errno(status);
>> +			break;
>> +		}
>> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
>> +	} while (!ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +hv_call_finalize_partition(u64 partition_id)
>> +{
>> +	struct hv_finalize_partition *input;
>> +	int status;
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	input = (struct hv_finalize_partition *)(*this_cpu_ptr(
>> +		hyperv_pcpu_input_arg));
>> +
>> +	input->partition_id = partition_id;
>> +	status = hv_do_hypercall(
>> +			HVCALL_FINALIZE_PARTITION,
>> +			input, NULL);
>> +	local_irq_restore(flags);
> 
> 
> Same comment about hv_do_fast_hypercall8() and about status
> being a u64 and masking.
> 

Will do.

>> +
>> +	if (status != HV_STATUS_SUCCESS)
>> +		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
>> +
>> +	return -hv_status_to_errno(status);
>> +}
>> +
>> +static int
>> +hv_call_delete_partition(u64 partition_id)
>> +{
>> +	struct hv_delete_partition *input;
>> +	int status;
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	input = (struct hv_delete_partition *)(*this_cpu_ptr(
>> +		hyperv_pcpu_input_arg));
>> +
>> +	input->partition_id = partition_id;
>> +	status = hv_do_hypercall(
>> +			HVCALL_DELETE_PARTITION,
>> +			input, NULL);
>> +	local_irq_restore(flags);
> 
> Same comments about hv_do_fast_hypercall8(), and
> the status and masking.
> 

Will do.

>> +
>> +	if (status != HV_STATUS_SUCCESS)
>> +		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
>> +
>> +	return -hv_status_to_errno(status);
>> +}
>> +
>>  static long
>>  mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>>  {
>> @@ -86,6 +223,17 @@ destroy_partition(struct mshv_partition *partition)
>>
>>  	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
>>
>> +	/*
>> +	 * There are no remaining references to the partition or vps,
>> +	 * so the remaining cleanup can be lockless
>> +	 */
>> +
>> +	/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
>> +	hv_call_finalize_partition(partition->id);
>> +	/* TODO: Withdraw and free all pages we deposited */
>> +
>> +	hv_call_delete_partition(partition->id);
>> +
>>  	kfree(partition);
>>  }
>>
>> @@ -146,6 +294,9 @@ mshv_ioctl_create_partition(void __user *user_arg)
>>  	if (copy_from_user(&args, user_arg, sizeof(args)))
>>  		return -EFAULT;
>>
>> +	/* Only support EXO partitions */
>> +	args.flags |= HV_PARTITION_CREATION_FLAG_EXO_PARTITION;
>> +
>>  	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
>>  	if (!partition)
>>  		return -ENOMEM;
>> @@ -156,11 +307,21 @@ mshv_ioctl_create_partition(void __user *user_arg)
>>  		goto free_partition;
>>  	}
>>
>> +	ret = hv_call_create_partition(args.flags,
>> +				       args.partition_creation_properties,
>> +				       &partition->id);
>> +	if (ret)
>> +		goto put_fd;
>> +
>> +	ret = hv_call_initialize_partition(partition->id);
>> +	if (ret)
>> +		goto delete_partition;
>> +
>>  	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
>>  				  partition, O_RDWR);
>>  	if (IS_ERR(file)) {
>>  		ret = PTR_ERR(file);
>> -		goto put_fd;
>> +		goto finalize_partition;
>>  	}
>>  	refcount_set(&partition->ref_count, 1);
>>
>> @@ -174,6 +335,10 @@ mshv_ioctl_create_partition(void __user *user_arg)
>>
>>  release_file:
>>  	file->f_op->release(file->f_inode, file);
>> +finalize_partition:
>> +	hv_call_finalize_partition(partition->id);
>> +delete_partition:
>> +	hv_call_delete_partition(partition->id);
>>  put_fd:
>>  	put_unused_fd(fd);
>>  free_partition:
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls
  2021-03-04 23:49     ` Nuno Das Neves
@ 2021-03-04 23:58       ` Michael Kelley
  0 siblings, 0 replies; 53+ messages in thread
From: Michael Kelley @ 2021-03-04 23:58 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, March 4, 2021 3:49 PM
> 
> On 2/8/2021 11:42 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November
> 20, 2020 4:30 PM
> >>

[snip]

> >> +
> >> +static int
> >> +hv_call_create_partition(
> >> +		u64 flags,
> >> +		struct hv_partition_creation_properties creation_properties,
> >> +		u64 *partition_id)
> >> +{
> >> +	struct hv_create_partition_in *input;
> >> +	struct hv_create_partition_out *output;
> >> +	int status;
> >> +	int ret;
> >> +	unsigned long irq_flags;
> >> +	int i;
> >> +
> >> +	do {
> >> +		local_irq_save(irq_flags);
> >> +		input = (struct hv_create_partition_in *)(*this_cpu_ptr(
> >> +			hyperv_pcpu_input_arg));
> >> +		output = (struct hv_create_partition_out *)(*this_cpu_ptr(
> >> +			hyperv_pcpu_output_arg));
> >> +
> >> +		input->flags = flags;
> >> +		input->proximity_domain_info.as_uint64 = 0;
> >> +		input->compatibility_version = HV_COMPATIBILITY_MANGANESE;
> >> +		for (i = 0; i < HV_PARTITION_PROCESSOR_FEATURE_BANKS; ++i)
> >> +			input->partition_creation_properties
> >> +				.disabled_processor_features.as_uint64[i] = 0;
> >> +		input->partition_creation_properties
> >> +			.disabled_processor_xsave_features.as_uint64 = 0;
> >> +		input->isolation_properties.as_uint64 = 0;
> >> +
> >> +		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
> >> +					 input, output);
> >
> > hv_do_hypercall returns a u64, which should then be masked with
> > HV_HYPERCALL_RESULT_MASK before checking the result.
> >
> 
> Yes, I'll fix this everywhere.
> 
> >> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			if (status == HV_STATUS_SUCCESS)
> >> +				*partition_id = output->partition_id;
> >> +			else
> >> +				pr_err("%s: %s\n",
> >> +				       __func__, hv_status_to_string(status));
> >> +			local_irq_restore(irq_flags);
> >> +			ret = -hv_status_to_errno(status);
> >> +			break;
> >> +		}
> >> +		local_irq_restore(irq_flags);
> >> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> +					    hv_current_partition_id, 1);
> >> +	} while (!ret);
> >> +
> >> +	return ret;
> >> +}
> >> +

I had a separate thread on the linux-hyperv mailing list about the
inconsistency in how we check hypercall status in current upstream
code, and proposed some helper functions to make it easier and
more consistent.  Joe Salisbury has started work on a patch to
provide those helper functions and to start using them in current
upstream code.  You could coordinate with Joe to get the helper
functions as well and use them as discussed in that thread.  Then
later on we won't have to come back and fix up the uses in this
patch series.

Michael

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2021-03-04 18:43     ` Nuno Das Neves
@ 2021-03-05  9:18       ` Vitaly Kuznetsov
  2021-04-07  0:21         ` Nuno Das Neves
  0 siblings, 1 reply; 53+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-05  9:18 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	wei.liu, ligrassi, kys

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> On 2/9/2021 5:11 AM, Vitaly Kuznetsov wrote:
>> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
>> 
...
>>> +
>>> +3.1 MSHV_REQUEST_VERSION
>>> +------------------------
>>> +:Type: /dev/mshv ioctl
>>> +:Parameters: pointer to a u32
>>> +:Returns: 0 on success
>>> +
>>> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
>>> +establish the interface version with the kernel module.
>>> +
>>> +The caller should pass the MSHV_VERSION as an argument.
>>> +
>>> +The kernel module will check which interface versions it supports and return 0
>>> +if one of them matches.
>>> +
>>> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
>>> +it is open - this ioctl can only be called once per open.
>>> +
>> 
>> KVM used to have KVM_GET_API_VERSION too but this turned out to be not
>> very convenient so we use capabilities (KVM_CHECK_EXTENSION/KVM_ENABLE_CAP)
>> instead.
>> 
>
> The goal of MSHV_REQUEST_VERSION is to support changes to APIs in the core set.
> When we add new features/ioctls beyond the core we can use an extension/capability
> approach like KVM.
>

Driver versions is a very bad idea from distribution/stable kernel point
of view as it presumes that the history is linear. It is not.

Imagine you have the following history upstream:

MSHV_REQUEST_VERSION = 1
<100 commits with features/fixes>
MSHV_REQUEST_VERSION = 2
<another 100 commits with features/fixes>
MSHV_REQUEST_VERSION = 2

Now I'm a linux distribution / stable kernel maintainer. My kernel is at
MSHV_REQUEST_VERSION = 1. Now I want to backport 1 feature from between
VER=1 and VER=2 and another feature from between VER=2 and VER=3. My
history now looks like

MSHV_REQUEST_VERSION = 1
<5 commits from between VER=1 and VER=2>
   Which version should I declare here???? 
<5 commits from between VER=2 and VER=3>
   Which version should I declare here???? 

If I keep VER=1 then userspace will think that I don't have any extra
features added and just won't use them. If I change VER to 2/3, it'll
think I have *all* features from between these versions.

The only reasonable way to manage this is to attach a "capability" to
every ABI change and expose this capability *in the same commit which
introduces the change to the ABI*. This way userspace will now exactly
which ioctls are available and what are their interfaces.

Also, trying to define "core set" is hard but you don't really need
to.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall
  2021-02-08 19:44   ` Michael Kelley
@ 2021-03-05 21:01     ` Nuno Das Neves
  0 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-05 21:01 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

On 2/8/2021 11:44 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
>>
>> Withdraw the memory from a finalized partition and free the pages.
>> The partition is now cleaned up correctly when the fd is released.
>>
>> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  include/asm-generic/hyperv-tlfs.h | 10 ++++++
>>  virt/mshv/mshv_main.c             | 54 ++++++++++++++++++++++++++++++-
>>  2 files changed, 63 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index ab6ae6c164f5..2a49503b7396 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -148,6 +148,7 @@ struct ms_hyperv_tsc_page {
>>  #define HVCALL_DELETE_PARTITION			0x0043
>>  #define HVCALL_GET_PARTITION_ID			0x0046
>>  #define HVCALL_DEPOSIT_MEMORY			0x0048
>> +#define HVCALL_WITHDRAW_MEMORY			0x0049
>>  #define HVCALL_CREATE_VP			0x004e
>>  #define HVCALL_GET_VP_REGISTERS			0x0050
>>  #define HVCALL_SET_VP_REGISTERS			0x0051
>> @@ -472,6 +473,15 @@ union hv_proximity_domain_info {
>>  	u64 as_uint64;
>>  };
>>
>> +struct hv_withdraw_memory_in {
>> +	u64 partition_id;
>> +	union hv_proximity_domain_info proximity_domain_info;
>> +};
>> +
>> +struct hv_withdraw_memory_out {
>> +	u64 gpa_page_list[0];
> 
> For a variable size array, the Linux kernel community has an effort
> underway to replace occurrences of [0] and [1] with just [].  I think
> [] can be used here.
> 

It seems the compiler doesn't like that, because there's no other members in the struct?

./include/asm-generic/hyperv-tlfs.h:452:6: error: flexible array member in a struct with no named members
  452 |  u64 gpa_page_list[];

I'll add a comment explaining that it's a hack to work around the compiler.

>> +};
>> +
> 
> Add __packed to the above two structs.
> 

Will do.

>>  struct hv_lp_startup_status {
>>  	u64 hv_status;
>>  	u64 substatus1;
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index c4130a6508e5..162a1bb42a4a 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -14,6 +14,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/file.h>
>>  #include <linux/anon_inodes.h>
>> +#include <linux/mm.h>
>>  #include <linux/mshv.h>
>>  #include <asm/mshyperv.h>
>>
>> @@ -57,8 +58,58 @@ static struct miscdevice mshv_dev = {
>>  	.mode = 600,
>>  };
>>
>> +#define HV_WITHDRAW_BATCH_SIZE	(PAGE_SIZE / sizeof(u64))
> 
> Use HV_HYP_PAGE_SIZE so that we're explicit that the dependency
> is on the page size used by Hyper-V, which might be different from the
> guest page size (at least on architectures like ARM64).
> 

Yes, will do.

>>  #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
>>
>> +static int
>> +hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
>> +{
>> +	struct hv_withdraw_memory_in *input_page;
>> +	struct hv_withdraw_memory_out *output_page;
>> +	u16 completed;
>> +	u64 hypercall_status;
>> +	unsigned long remaining = count;
>> +	int status;
>> +	int i;
>> +	unsigned long flags;
>> +
>> +	while (remaining) {
>> +		local_irq_save(flags);
>> +
>> +		input_page = (struct hv_withdraw_memory_in *)(*this_cpu_ptr(
>> +			hyperv_pcpu_input_arg));
>> +		output_page = (struct hv_withdraw_memory_out *)(*this_cpu_ptr(
>> +			hyperv_pcpu_output_arg));
>> +
>> +		input_page->partition_id = partition_id;
>> +		input_page->proximity_domain_info.as_uint64 = 0;
>> +		hypercall_status = hv_do_rep_hypercall(
>> +			HVCALL_WITHDRAW_MEMORY,
>> +			min(remaining, HV_WITHDRAW_BATCH_SIZE), 0, input_page,
>> +			output_page);
>> +
>> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
>> +			    HV_HYPERCALL_REP_COMP_OFFSET;
>> +
>> +		for (i = 0; i < completed; i++)
>> +			__free_page(pfn_to_page(output_page->gpa_page_list[i]));
>> +
>> +		local_irq_restore(flags);
> 
> Seems like there's some risk that we have interrupts disabled for too long.
> We could be calling __free_page() up to 512 times.  It might be better for this
> function to allocate its own page to be used as the output page, so that interrupts
> can be enabled immediately after the hypercall completes.  Then the __free_page()
> loop can execute with interrupts enabled.   We have the per-cpu input and output
> pages to avoid the overhead of allocating/freeing pages for each hypercall, but in this
> case a private output page might be warranted.
> 

Good idea, I'll do that.

>> +
>> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
>> +		if (status != HV_STATUS_SUCCESS) {
>> +			if (status != HV_STATUS_NO_RESOURCES)
>> +				pr_err("%s: %s\n", __func__,
>> +				       hv_status_to_string(status));
>> +			break;
>> +		}
>> +
>> +		remaining -= completed;
>> +	}
>> +
>> +	return -hv_status_to_errno(status);
>> +}
>> +
>>  static int
>>  hv_call_create_partition(
>>  		u64 flags,
>> @@ -230,7 +281,8 @@ destroy_partition(struct mshv_partition *partition)
>>
>>  	/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
>>  	hv_call_finalize_partition(partition->id);
>> -	/* TODO: Withdraw and free all pages we deposited */
>> +	/* Withdraw and free all pages we deposited */
>> +	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->id);
>>
>>  	hv_call_delete_partition(partition->id);
>>
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 08/18] virt/mshv: map and unmap guest memory
  2021-02-08 19:45   ` Michael Kelley
@ 2021-03-08 19:14     ` Nuno Das Neves
  2021-03-08 19:30       ` Michael Kelley
  0 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-08 19:14 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

On 2/8/2021 11:45 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
>>
>> Introduce ioctls for mapping and unmapping regions of guest memory.
>>
>> Uses a table of memory 'slots' similar to KVM, but the slot
>> number is not visible to userspace.
>>
>> For now, this simple implementation requires each new mapping to be
>> disjoint - the underlying hypercalls have no such restriction, and
>> implicitly overwrite any mappings on the pages in the specified regions.
>>
>> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  Documentation/virt/mshv/api.rst        |  15 ++
>>  include/asm-generic/hyperv-tlfs.h      |  15 ++
>>  include/linux/mshv.h                   |  14 ++
>>  include/uapi/asm-generic/hyperv-tlfs.h |   9 +
>>  include/uapi/linux/mshv.h              |  15 ++
>>  virt/mshv/mshv_main.c                  | 322 ++++++++++++++++++++++++-
>>  6 files changed, 388 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
>> index ce651a1738e0..530efc29d354 100644
>> --- a/Documentation/virt/mshv/api.rst
>> +++ b/Documentation/virt/mshv/api.rst
>> @@ -72,3 +72,18 @@ it is open - this ioctl can only be called once per open.
>>  This ioctl creates a guest partition, returning a file descriptor to use as a
>>  handle for partition ioctls.
>>
>> +3.3 MSHV_MAP_GUEST_MEMORY and MSHV_UNMAP_GUEST_MEMORY
>> +-----------------------------------------------------
>> +:Type: partition ioctl
>> +:Parameters: struct mshv_user_mem_region
>> +:Returns: 0 on success
>> +
>> +Create a mapping from a region of process memory to a region of physical memory
>> +in a guest partition.
> 
> Just to be super explicit:
> 
> Create a mapping from memory in the user space of the calling process (running
> in the root partition) to a region of guest physical memory in a guest partition.
> 

Thanks, yes this is clearer.

>> +
>> +Mappings must be disjoint in process address space and guest address space.
>> +
>> +Note: In the current implementation, this memory is pinned to stop the pages
>> +being moved by linux and subsequently clobbered by the hypervisor. So the region
>> +is backed by physical memory.
> 
> Again to be super explicit:
> 
> Note: In the current implementation, this memory is pinned to real physical
> memory to stop the pages being moved by Linux in the root partition,
> and subsequently being clobbered by the hypervisor.  So the region is backed
> by real physical memory.
> 

Yep, I'll update this also.

>> +
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index 2a49503b7396..6e5072e29897 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -149,6 +149,8 @@ struct ms_hyperv_tsc_page {
>>  #define HVCALL_GET_PARTITION_ID			0x0046
>>  #define HVCALL_DEPOSIT_MEMORY			0x0048
>>  #define HVCALL_WITHDRAW_MEMORY			0x0049
>> +#define HVCALL_MAP_GPA_PAGES			0x004b
>> +#define HVCALL_UNMAP_GPA_PAGES			0x004c
>>  #define HVCALL_CREATE_VP			0x004e
>>  #define HVCALL_GET_VP_REGISTERS			0x0050
>>  #define HVCALL_SET_VP_REGISTERS			0x0051
>> @@ -827,4 +829,17 @@ struct hv_delete_partition {
>>  	u64 partition_id;
>>  };
>>
>> +struct hv_map_gpa_pages {
>> +	u64 target_partition_id;
>> +	u64 target_gpa_base;
>> +	u32 map_flags;
> 
> Is there a reserved 32 bit field here?  Hyper-V always aligns
> things on 64 bit boundaries.
> 

The hypervisor code uses implicit padding here, and I copied it directly.
Yes, it should be 8 byte aligned. I will insert a padding field and add __packed.

>> +	u64 source_gpa_page_list[];
>> +};
>> +
>> +struct hv_unmap_gpa_pages {
>> +	u64 target_partition_id;
>> +	u64 target_gpa_base;
>> +	u32 unmap_flags;
> 
> Is there a reserved 32 bit field here?  Hyper-V always aligns
> things on 64 bit boundaries.
> 

ditto as above.

>> +};
> 
> Add __packed to the above two structs after sorting out
> the alignment issues.
> 

Yep.

>> +
>>  #endif
>> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
>> index fc4f35089b2c..91a742f37440 100644
>> --- a/include/linux/mshv.h
>> +++ b/include/linux/mshv.h
>> @@ -7,13 +7,27 @@
>>   */
>>
>>  #include <linux/spinlock.h>
>> +#include <linux/mutex.h>
>>  #include <uapi/linux/mshv.h>
>>
>>  #define MSHV_MAX_PARTITIONS		128
>> +#define MSHV_MAX_MEM_REGIONS		64
>> +
>> +struct mshv_mem_region {
>> +	u64 size; /* bytes */
>> +	u64 guest_pfn;
>> +	u64 userspace_addr; /* start of the userspace allocated memory */
>> +	struct page **pages;
>> +};
>>
>>  struct mshv_partition {
>>  	u64 id;
>>  	refcount_t ref_count;
>> +	struct mutex mutex;
>> +	struct {
>> +		u32 count;
>> +		struct mshv_mem_region slots[MSHV_MAX_MEM_REGIONS];
>> +	} regions;
>>  };
>>
>>  struct mshv {
>> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
>> tlfs.h
>> index 7a858226a9c5..e7b09b9f00de 100644
>> --- a/include/uapi/asm-generic/hyperv-tlfs.h
>> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
>> @@ -12,4 +12,13 @@
>>  #define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
>>  #define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
>>
>> +/* HV Map GPA (Guest Physical Address) Flags */
>> +#define HV_MAP_GPA_PERMISSIONS_NONE     0x0
>> +#define HV_MAP_GPA_READABLE             0x1
>> +#define HV_MAP_GPA_WRITABLE             0x2
>> +#define HV_MAP_GPA_KERNEL_EXECUTABLE    0x4
>> +#define HV_MAP_GPA_USER_EXECUTABLE      0x8
>> +#define HV_MAP_GPA_EXECUTABLE           0xC
>> +#define HV_MAP_GPA_PERMISSIONS_MASK     0xF
>> +
>>  #endif
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> index 4f8da9a6fde2..47be03ef4e86 100644
>> --- a/include/uapi/linux/mshv.h
>> +++ b/include/uapi/linux/mshv.h
>> @@ -18,10 +18,25 @@ struct mshv_create_partition {
>>  	struct hv_partition_creation_properties partition_creation_properties;
>>  };
>>
>> +/*
>> + * Mappings can't overlap in GPA space or userspace
>> + * To unmap, these fields must match an existing mapping
>> + */
>> +struct mshv_user_mem_region {
>> +	__u64 size;		/* bytes */
>> +	__u64 guest_pfn;
>> +	__u64 userspace_addr;	/* start of the userspace allocated memory */
>> +	__u32 flags;		/* ignored on unmap */
>> +};
>> +
>>  #define MSHV_IOCTL 0xB8
>>
>>  /* mshv device */
>>  #define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
>>  #define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x01, struct mshv_create_partition)
>>
>> +/* partition device */
>> +#define MSHV_MAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
>> +#define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct mshv_user_mem_region)
>> +
>>  #endif
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index 162a1bb42a4a..ce480598e67f 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -60,6 +60,10 @@ static struct miscdevice mshv_dev = {
>>
>>  #define HV_WITHDRAW_BATCH_SIZE	(PAGE_SIZE / sizeof(u64))
>>  #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
>> +#define HV_MAP_GPA_MASK		(0x0000000FFFFFFFFFULL)
>> +#define HV_MAP_GPA_BATCH_SIZE	\
>> +		(PAGE_SIZE / sizeof(struct hv_map_gpa_pages) / sizeof(u64))
> 
> Hmmm. Shouldn't this be:
> 
> 	((HV_HYP_PAGE_SIZE - sizeof(struct hv_map_gpa_pages))/sizeof(u64))
> 
> 

Yes! Not sure how that happened.

>> +#define PIN_PAGES_BATCH_SIZE	(0x10000000 / PAGE_SIZE)
>>
>>  static int
>>  hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
>> @@ -245,16 +249,318 @@ hv_call_delete_partition(u64 partition_id)
>>  	return -hv_status_to_errno(status);
>>  }
>>
>> +static int
>> +hv_call_map_gpa_pages(u64 partition_id,
>> +		      u64 gpa_target,
>> +		      u64 page_count, u32 flags,
>> +		      struct page **pages)
>> +{
>> +	struct hv_map_gpa_pages *input_page;
>> +	int status;
>> +	int i;
>> +	struct page **p;
>> +	u32 completed = 0;
>> +	u64 hypercall_status;
>> +	unsigned long remaining = page_count;
>> +	int rep_count;
>> +	unsigned long irq_flags;
>> +	int ret = 0;
>> +
>> +	while (remaining) {
>> +
>> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
>> +
>> +		local_irq_save(irq_flags);
>> +		input_page = (struct hv_map_gpa_pages *)(*this_cpu_ptr(
>> +			hyperv_pcpu_input_arg));
>> +
>> +		input_page->target_partition_id = partition_id;
>> +		input_page->target_gpa_base = gpa_target;
>> +		input_page->map_flags = flags;
>> +
>> +		for (i = 0, p = pages; i < rep_count; i++, p++)
>> +			input_page->source_gpa_page_list[i] =
>> +				page_to_pfn(*p) & HV_MAP_GPA_MASK;
> 
> The masking seems a bit weird.  The mask allows for up to 64G page frames,
> which is 256 Tbytes of total physical memory, which is probably the current
> Hyper-V limit on memory size (48 bit physical address space, though 52 bit
> physical address spaces are coming).  So the masking shouldn't ever be doing
> anything.   And if it was doing something, that probably should be treated as
> an error rather than simply dropping the high bits.

Good point - It looks like the mask isn't needed.

> 
> Note that this code does not handle the case where PAGE_SIZE !=
> HV_HYP_PAGE_SIZE.  But maybe we'll never run the root partition with a
> page size other than 4K.
> 

For now on x86 it won't happen, but maybe on ARM?
It shouldn't be hard to support this case, especially since
PAGE_SIZE >= HV_HYP_PAGE_SIZE. Do you think we need it in this patch set?

>> +		hypercall_status = hv_do_rep_hypercall(
>> +			HVCALL_MAP_GPA_PAGES, rep_count, 0, input_page, NULL);
>> +		local_irq_restore(irq_flags);
>> +
>> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
>> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
>> +				HV_HYPERCALL_REP_COMP_OFFSET;
>> +
>> +		if (status == HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
>> +						    partition_id, 256);
> 
> Why adding 256 pages?  I'm just contrasting with other places that add
> 1 page at a time.  Maybe a comment to explain ....
> 

Empirically determined. I'll add a #define and comment.

>> +			if (ret)
>> +				break;
>> +		} else if (status != HV_STATUS_SUCCESS) {
>> +			pr_err("%s: completed %llu out of %llu, %s\n",
>> +			       __func__,
>> +			       page_count - remaining, page_count,
>> +			       hv_status_to_string(status));
>> +			ret = -hv_status_to_errno(status);
>> +			break;
>> +		}
>> +
>> +		pages += completed;
>> +		remaining -= completed;
>> +		gpa_target += completed;
>> +	}
>> +
>> +	if (ret && completed) {
> 
> Is the above the right test?  Completed could be zero from the most
> recent iteration, but still could be partially succeeded based on a previous
> successful iteration.   I think this needs to check whether remaining equals
> page_count.
> 

You're right; I'll change it to (ret && remaining < page_count)

>> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
>> +		       __func__);
>> +		ret = -EBADFD;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +hv_call_unmap_gpa_pages(u64 partition_id,
>> +			u64 gpa_target,
>> +			u64 page_count, u32 flags)
>> +{
>> +	struct hv_unmap_gpa_pages *input_page;
>> +	int status;
>> +	int ret = 0;
>> +	u32 completed = 0;
>> +	u64 hypercall_status;
>> +	unsigned long remaining = page_count;
>> +	int rep_count;
>> +	unsigned long irq_flags;
>> +
>> +	local_irq_save(irq_flags);
>> +	input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
>> +		hyperv_pcpu_input_arg));
>> +
>> +	input_page->target_partition_id = partition_id;
>> +	input_page->target_gpa_base = gpa_target;
>> +	input_page->unmap_flags = flags;
>> +
>> +	while (remaining) {
>> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
>> +		hypercall_status = hv_do_rep_hypercall(
>> +			HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> 
> Similarly, this code doesn't handle PAGE_SIZE != HV_HYP_PAGE_SIZE.
> 

As above - do we need this for this patch set? This won't happen on x86.

>> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
>> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
>> +				HV_HYPERCALL_REP_COMP_OFFSET;
>> +		if (status != HV_STATUS_SUCCESS) {
>> +			pr_err("%s: completed %llu out of %llu, %s\n",
>> +			       __func__,
>> +			       page_count - remaining, page_count,
>> +			       hv_status_to_string(status));
>> +			ret = -hv_status_to_errno(status);
>> +			break;
>> +		}
>> +
>> +		remaining -= completed;
>> +		gpa_target += completed;
>> +		input_page->target_gpa_base = gpa_target;
>> +	}
>> +	local_irq_restore(irq_flags);
> 
> I have some concern about holding interrupts disabled for this long.
> 

How about I move the interrupt enabling/disabling inside the loop? i.e.:
        while (remaining) {
                local_irq_save(irq_flags);
                input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
                        hyperv_pcpu_input_arg));

                input_page->target_partition_id = partition_id;
                input_page->target_gpa_base = gpa_target;
                input_page->unmap_flags = flags;
                rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
                status = hv_do_rep_hypercall(
                        HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
                local_irq_restore(irq_flags);

                completed = (status & HV_HYPERCALL_REP_COMP_MASK) >>
                                HV_HYPERCALL_REP_COMP_OFFSET;
                status &= HV_HYPERCALL_RESULT_MASK;
                if (status != HV_STATUS_SUCCESS) {
                        pr_err("%s: completed %llu out of %llu, %s\n",
                               __func__,
                               page_count - remaining, page_count,
                               hv_status_to_string(status));
                        ret = hv_status_to_errno(status);
                        break;
                }

                remaining -= completed;
                gpa_target += completed;
        }


>> +
>> +	if (ret && completed) {
> 
> Same comment as before.
> 

Ditto as above.

>> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
>> +		       __func__);
>> +		ret = -EBADFD;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static long
>> +mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
>> +				struct mshv_user_mem_region __user *user_mem)
>> +{
>> +	struct mshv_user_mem_region mem;
>> +	struct mshv_mem_region *region;
>> +	int completed;
>> +	unsigned long remaining, batch_size;
>> +	int i;
>> +	struct page **pages;
>> +	u64 page_count, user_start, user_end, gpfn_start, gpfn_end;
>> +	u64 region_page_count, region_user_start, region_user_end;
>> +	u64 region_gpfn_start, region_gpfn_end;
>> +	long ret = 0;
>> +
>> +	/* Check we have enough slots*/
>> +	if (partition->regions.count == MSHV_MAX_MEM_REGIONS) {
>> +		pr_err("%s: not enough memory region slots\n", __func__);
>> +		return -ENOSPC;
>> +	}
>> +
>> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
>> +		return -EFAULT;
>> +
>> +	if (!mem.size ||
>> +	    mem.size & (PAGE_SIZE - 1) ||
>> +	    mem.userspace_addr & (PAGE_SIZE - 1) ||
> 
> There's a PAGE_ALIGNED macro that expresses exactly what
> each of the previous two tests is doing.
> 

Since these need to be HV_HYP_PAGE_SIZE aligned, I will add a
HV_HYP_PAGE_ALIGNED macro for this.

>> +	    !access_ok(mem.userspace_addr, mem.size))
>> +		return -EINVAL;
>> +
>> +	/* Reject overlapping regions */
>> +	page_count = mem.size >> PAGE_SHIFT;
>> +	user_start = mem.userspace_addr;
>> +	user_end = mem.userspace_addr + mem.size;
>> +	gpfn_start = mem.guest_pfn;
>> +	gpfn_end = mem.guest_pfn + page_count;
>> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
>> +		region = &partition->regions.slots[i];
>> +		if (!region->size)
>> +			continue;
>> +		region_page_count = region->size >> PAGE_SHIFT;
>> +		region_user_start = region->userspace_addr;
>> +		region_user_end = region->userspace_addr + region->size;
>> +		region_gpfn_start = region->guest_pfn;
>> +		region_gpfn_end = region->guest_pfn + region_page_count;
>> +
>> +		if (!(
>> +		     (user_end <= region_user_start) ||
>> +		     (region_user_end <= user_start))) {
>> +			return -EEXIST;
>> +		}
>> +		if (!(
>> +		     (gpfn_end <= region_gpfn_start) ||
>> +		     (region_gpfn_end <= gpfn_start))) {
>> +			return -EEXIST;
> 
> You could apply De Morgan's theorem to the conditions
> in each "if" statement and get rid of the "!".  That might make
> these slightly easier to understand, but I have no strong
> preference.
> 

I agree, I think that would be a bit clearer. I will change it.

>> +		}
>> +	}
>> +
>> +	/* Pin the userspace pages */
>> +	pages = vzalloc(sizeof(struct page *) * page_count);
>> +	if (!pages)
>> +		return -ENOMEM;
>> +
>> +	remaining = page_count;
>> +	while (remaining) {
>> +		/*
>> +		 * We need to batch this, as pin_user_pages_fast with the
>> +		 * FOLL_LONGTERM flag does a big temporary allocation
>> +		 * of contiguous memory
>> +		 */
>> +		batch_size = min(remaining, PIN_PAGES_BATCH_SIZE);
>> +		completed = pin_user_pages_fast(
>> +				mem.userspace_addr +
>> +					(page_count - remaining) * PAGE_SIZE,
>> +				batch_size,
>> +				FOLL_WRITE | FOLL_LONGTERM,
>> +				&pages[page_count - remaining]);
>> +		if (completed < 0) {
>> +			pr_err("%s: failed to pin user pages error %i\n",
>> +			       __func__,
>> +			       completed);
>> +			ret = completed;
>> +			goto err_unpin_pages;
>> +		}
>> +		remaining -= completed;
>> +	}
>> +
>> +	/* Map the pages to GPA pages */
>> +	ret = hv_call_map_gpa_pages(partition->id, mem.guest_pfn,
>> +				    page_count, mem.flags, pages);
>> +	if (ret)
>> +		goto err_unpin_pages;
>> +
>> +	/* Install the new region */
>> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
>> +		if (!partition->regions.slots[i].size) {
>> +			region = &partition->regions.slots[i];
>> +			break;
>> +		}
>> +	}
>> +	region->pages = pages;
>> +	region->size = mem.size;
>> +	region->guest_pfn = mem.guest_pfn;
>> +	region->userspace_addr = mem.userspace_addr;
>> +
>> +	partition->regions.count++;
>> +
>> +	return 0;
>> +
>> +err_unpin_pages:
>> +	unpin_user_pages(pages, page_count - remaining);
>> +	vfree(pages);
>> +
>> +	return ret;
>> +}
>> +
>> +static long
>> +mshv_partition_ioctl_unmap_memory(struct mshv_partition *partition,
>> +				  struct mshv_user_mem_region __user *user_mem)
>> +{
>> +	struct mshv_user_mem_region mem;
>> +	struct mshv_mem_region *region_ptr;
>> +	int i;
>> +	u64 page_count;
>> +	long ret;
>> +
>> +	if (!partition->regions.count)
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
>> +		return -EFAULT;
>> +
>> +	/* Find matching region */
>> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
>> +		if (!partition->regions.slots[i].size)
>> +			continue;
>> +		region_ptr = &partition->regions.slots[i];
>> +		if (region_ptr->userspace_addr == mem.userspace_addr &&
>> +		    region_ptr->size == mem.size &&
>> +		    region_ptr->guest_pfn == mem.guest_pfn)
>> +			break;
>> +	}
>> +
>> +	if (i == MSHV_MAX_MEM_REGIONS)
>> +		return -EINVAL;
>> +
>> +	page_count = region_ptr->size >> PAGE_SHIFT;
>> +	ret = hv_call_unmap_gpa_pages(partition->id, region_ptr->guest_pfn,
>> +				      page_count, 0);
>> +	if (ret)
>> +		return ret;
>> +
>> +	unpin_user_pages(region_ptr->pages, page_count);
>> +	vfree(region_ptr->pages);
>> +	memset(region_ptr, 0, sizeof(*region_ptr));
>> +	partition->regions.count--;
>> +
>> +	return 0;
>> +}
>> +
>>  static long
>>  mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>>  {
>> -	return -ENOTTY;
>> +	struct mshv_partition *partition = filp->private_data;
>> +	long ret;
>> +
>> +	if (mutex_lock_killable(&partition->mutex))
>> +		return -EINTR;
>> +
>> +	switch (ioctl) {
>> +	case MSHV_MAP_GUEST_MEMORY:
>> +		ret = mshv_partition_ioctl_map_memory(partition,
>> +							(void __user *)arg);
>> +		break;
>> +	case MSHV_UNMAP_GUEST_MEMORY:
>> +		ret = mshv_partition_ioctl_unmap_memory(partition,
>> +							(void __user *)arg);
>> +		break;
>> +	default:
>> +		ret = -ENOTTY;
>> +	}
>> +
>> +	mutex_unlock(&partition->mutex);
>> +	return ret;
>>  }
>>
>>  static void
>>  destroy_partition(struct mshv_partition *partition)
>>  {
>> -	unsigned long flags;
>> +	unsigned long flags, page_count;
>> +	struct mshv_mem_region *region;
>>  	int i;
>>
>>  	/* Remove from list of partitions */
>> @@ -286,6 +592,16 @@ destroy_partition(struct mshv_partition *partition)
>>
>>  	hv_call_delete_partition(partition->id);
>>
>> +	/* Remove regions and unpin the pages */
>> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
>> +		region = &partition->regions.slots[i];
>> +		if (!region->size)
>> +			continue;
>> +		page_count = region->size >> PAGE_SHIFT;
>> +		unpin_user_pages(region->pages, page_count);
>> +		vfree(region->pages);
>> +	}
>> +
>>  	kfree(partition);
>>  }
>>
>> @@ -353,6 +669,8 @@ mshv_ioctl_create_partition(void __user *user_arg)
>>  	if (!partition)
>>  		return -ENOMEM;
>>
>> +	mutex_init(&partition->mutex);
>> +
>>  	fd = get_unused_fd_flags(O_CLOEXEC);
>>  	if (fd < 0) {
>>  		ret = fd;
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 08/18] virt/mshv: map and unmap guest memory
  2021-03-08 19:14     ` Nuno Das Neves
@ 2021-03-08 19:30       ` Michael Kelley
  0 siblings, 0 replies; 53+ messages in thread
From: Michael Kelley @ 2021-03-08 19:30 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Monday, March 8, 2021 11:14 AM
> 
> On 2/8/2021 11:45 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November
> 20, 2020 4:30 PM
> >>

[snip]

> >> @@ -245,16 +249,318 @@ hv_call_delete_partition(u64 partition_id)
> >>  	return -hv_status_to_errno(status);
> >>  }
> >>
> >> +static int
> >> +hv_call_map_gpa_pages(u64 partition_id,
> >> +		      u64 gpa_target,
> >> +		      u64 page_count, u32 flags,
> >> +		      struct page **pages)
> >> +{
> >> +	struct hv_map_gpa_pages *input_page;
> >> +	int status;
> >> +	int i;
> >> +	struct page **p;
> >> +	u32 completed = 0;
> >> +	u64 hypercall_status;
> >> +	unsigned long remaining = page_count;
> >> +	int rep_count;
> >> +	unsigned long irq_flags;
> >> +	int ret = 0;
> >> +
> >> +	while (remaining) {
> >> +
> >> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> >> +
> >> +		local_irq_save(irq_flags);
> >> +		input_page = (struct hv_map_gpa_pages *)(*this_cpu_ptr(
> >> +			hyperv_pcpu_input_arg));
> >> +
> >> +		input_page->target_partition_id = partition_id;
> >> +		input_page->target_gpa_base = gpa_target;
> >> +		input_page->map_flags = flags;
> >> +
> >> +		for (i = 0, p = pages; i < rep_count; i++, p++)
> >> +			input_page->source_gpa_page_list[i] =
> >> +				page_to_pfn(*p) & HV_MAP_GPA_MASK;
> >
> > The masking seems a bit weird.  The mask allows for up to 64G page frames,
> > which is 256 Tbytes of total physical memory, which is probably the current
> > Hyper-V limit on memory size (48 bit physical address space, though 52 bit
> > physical address spaces are coming).  So the masking shouldn't ever be doing
> > anything.   And if it was doing something, that probably should be treated as
> > an error rather than simply dropping the high bits.
> 
> Good point - It looks like the mask isn't needed.
> 
> >
> > Note that this code does not handle the case where PAGE_SIZE !=
> > HV_HYP_PAGE_SIZE.  But maybe we'll never run the root partition with a
> > page size other than 4K.
> >
> 
> For now on x86 it won't happen, but maybe on ARM?
> It shouldn't be hard to support this case, especially since
> PAGE_SIZE >= HV_HYP_PAGE_SIZE. Do you think we need it in this patch set?

No, from my perspective, this case does not need to be handled in 
this patch set.

> 
> >> +		hypercall_status = hv_do_rep_hypercall(
> >> +			HVCALL_MAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> >> +		local_irq_restore(irq_flags);
> >> +
> >> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> >> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> >> +				HV_HYPERCALL_REP_COMP_OFFSET;
> >> +
> >> +		if (status == HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> +						    partition_id, 256);
> >
> > Why adding 256 pages?  I'm just contrasting with other places that add
> > 1 page at a time.  Maybe a comment to explain ....
> >
> 
> Empirically determined. I'll add a #define and comment.
> 
> >> +			if (ret)
> >> +				break;
> >> +		} else if (status != HV_STATUS_SUCCESS) {
> >> +			pr_err("%s: completed %llu out of %llu, %s\n",
> >> +			       __func__,
> >> +			       page_count - remaining, page_count,
> >> +			       hv_status_to_string(status));
> >> +			ret = -hv_status_to_errno(status);
> >> +			break;
> >> +		}
> >> +
> >> +		pages += completed;
> >> +		remaining -= completed;
> >> +		gpa_target += completed;
> >> +	}
> >> +
> >> +	if (ret && completed) {
> >
> > Is the above the right test?  Completed could be zero from the most
> > recent iteration, but still could be partially succeeded based on a previous
> > successful iteration.   I think this needs to check whether remaining equals
> > page_count.
> >
> 
> You're right; I'll change it to (ret && remaining < page_count)
> 
> >> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> >> +		       __func__);
> >> +		ret = -EBADFD;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +hv_call_unmap_gpa_pages(u64 partition_id,
> >> +			u64 gpa_target,
> >> +			u64 page_count, u32 flags)
> >> +{
> >> +	struct hv_unmap_gpa_pages *input_page;
> >> +	int status;
> >> +	int ret = 0;
> >> +	u32 completed = 0;
> >> +	u64 hypercall_status;
> >> +	unsigned long remaining = page_count;
> >> +	int rep_count;
> >> +	unsigned long irq_flags;
> >> +
> >> +	local_irq_save(irq_flags);
> >> +	input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
> >> +		hyperv_pcpu_input_arg));
> >> +
> >> +	input_page->target_partition_id = partition_id;
> >> +	input_page->target_gpa_base = gpa_target;
> >> +	input_page->unmap_flags = flags;
> >> +
> >> +	while (remaining) {
> >> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> >> +		hypercall_status = hv_do_rep_hypercall(
> >> +			HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> >
> > Similarly, this code doesn't handle PAGE_SIZE != HV_HYP_PAGE_SIZE.
> >
> 
> As above - do we need this for this patch set? This won't happen on x86.

Again, not needed from my perspective.

> 
> >> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> >> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> >> +				HV_HYPERCALL_REP_COMP_OFFSET;
> >> +		if (status != HV_STATUS_SUCCESS) {
> >> +			pr_err("%s: completed %llu out of %llu, %s\n",
> >> +			       __func__,
> >> +			       page_count - remaining, page_count,
> >> +			       hv_status_to_string(status));
> >> +			ret = -hv_status_to_errno(status);
> >> +			break;
> >> +		}
> >> +
> >> +		remaining -= completed;
> >> +		gpa_target += completed;
> >> +		input_page->target_gpa_base = gpa_target;
> >> +	}
> >> +	local_irq_restore(irq_flags);
> >
> > I have some concern about holding interrupts disabled for this long.
> >
> 
> How about I move the interrupt enabling/disabling inside the loop? i.e.:
>         while (remaining) {
>                 local_irq_save(irq_flags);
>                 input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
>                         hyperv_pcpu_input_arg));
> 
>                 input_page->target_partition_id = partition_id;
>                 input_page->target_gpa_base = gpa_target;
>                 input_page->unmap_flags = flags;
>                 rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
>                 status = hv_do_rep_hypercall(
>                         HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
>                 local_irq_restore(irq_flags);
> 
>                 completed = (status & HV_HYPERCALL_REP_COMP_MASK) >>
>                                 HV_HYPERCALL_REP_COMP_OFFSET;
>                 status &= HV_HYPERCALL_RESULT_MASK;
>                 if (status != HV_STATUS_SUCCESS) {
>                         pr_err("%s: completed %llu out of %llu, %s\n",
>                                __func__,
>                                page_count - remaining, page_count,
>                                hv_status_to_string(status));
>                         ret = hv_status_to_errno(status);
>                         break;
>                 }
> 
>                 remaining -= completed;
>                 gpa_target += completed;
>         }
> 
> 

Yes, that would help.

> >> +
> >> +	if (ret && completed) {
> >
> > Same comment as before.
> >
> 
> Ditto as above.
> 
> >> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> >> +		       __func__);
> >> +		ret = -EBADFD;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static long
> >> +mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
> >> +				struct mshv_user_mem_region __user *user_mem)
> >> +{
> >> +	struct mshv_user_mem_region mem;
> >> +	struct mshv_mem_region *region;
> >> +	int completed;
> >> +	unsigned long remaining, batch_size;
> >> +	int i;
> >> +	struct page **pages;
> >> +	u64 page_count, user_start, user_end, gpfn_start, gpfn_end;
> >> +	u64 region_page_count, region_user_start, region_user_end;
> >> +	u64 region_gpfn_start, region_gpfn_end;
> >> +	long ret = 0;
> >> +
> >> +	/* Check we have enough slots*/
> >> +	if (partition->regions.count == MSHV_MAX_MEM_REGIONS) {
> >> +		pr_err("%s: not enough memory region slots\n", __func__);
> >> +		return -ENOSPC;
> >> +	}
> >> +
> >> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
> >> +		return -EFAULT;
> >> +
> >> +	if (!mem.size ||
> >> +	    mem.size & (PAGE_SIZE - 1) ||
> >> +	    mem.userspace_addr & (PAGE_SIZE - 1) ||
> >
> > There's a PAGE_ALIGNED macro that expresses exactly what
> > each of the previous two tests is doing.
> >
> 
> Since these need to be HV_HYP_PAGE_SIZE aligned, I will add a
> HV_HYP_PAGE_ALIGNED macro for this.

I was thinking that PAGE_SIZE and PAGE_ALIGNED are correct.   If
this code were running on an ARM64 system with a 64K page
size, the 64K alignment would be fine and will make sense from
the user space perspective.   You don't want to be mapping part
of a user space page.  And 64K alignment will certainly satisfy
Hyper-V's requirement for 4K alignment.  The real requirement
from Hyper-V's standpoint is that the alignment not be smaller
than 4K.  But maybe I'm misunderstanding.

Michael

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls
  2021-02-08 19:47   ` Michael Kelley
@ 2021-03-09  1:39     ` Nuno Das Neves
  0 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-09  1:39 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

On 2/8/2021 11:47 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
>>
>> Add ioctls for getting and setting virtual processor registers.
>>
>> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  Documentation/virt/mshv/api.rst         |  11 +
>>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 601 ++++++++++++++++++++++++
>>  include/asm-generic/hyperv-tlfs.h       |  65 +--
>>  include/linux/mshv.h                    |   1 +
>>  include/uapi/linux/mshv.h               |  12 +
>>  virt/mshv/mshv_main.c                   | 258 +++++++++-
>>  6 files changed, 903 insertions(+), 45 deletions(-)
>>
[snip]
>> +
>> +union hv_register_value {
>> +	struct hv_u128 reg128;
>> +	__u64 reg64;
>> +	__u32 reg32;
>> +	__u16 reg16;
>> +	__u8 reg8;
>> +	union hv_x64_fp_register fp;
>> +	union hv_x64_fp_control_status_register fp_control_status;
>> +	union hv_x64_xmm_control_status_register xmm_control_status;
>> +	struct hv_x64_segment_register segment;
>> +	struct hv_x64_table_register table;
>> +	union hv_explicit_suspend_register explicit_suspend;
>> +	union hv_intercept_suspend_register intercept_suspend;
>> +	union hv_dispatch_suspend_register dispatch_suspend;
>> +	union hv_x64_interrupt_state_register interrupt_state;
>> +	union hv_x64_pending_interruption_register pending_interruption;
>> +	union hv_x64_msr_npiep_config_contents npiep_config;
>> +	union hv_x64_pending_exception_event pending_exception_event;
>> +	union hv_x64_pending_virtualization_fault_event
>> +		pending_virtualization_fault_event;
>> +};
>> +
>>  #endif
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index 6e5072e29897..b9295400c20b 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -622,53 +622,30 @@ struct hv_retarget_device_interrupt {
>>  } __packed __aligned(8);
>>
>>
>> -/* HvGetVpRegisters hypercall input with variable size reg name list*/
>> -struct hv_get_vp_registers_input {
>> -	struct {
>> -		u64 partitionid;
>> -		u32 vpindex;
>> -		u8  inputvtl;
>> -		u8  padding[3];
>> -	} header;
>> -	struct input {
>> -		u32 name0;
>> -		u32 name1;
>> -	} element[];
>> -} __packed;
>> -
>> +/* HvGetVpRegisters hypercall with variable size reg name list*/
>> +struct hv_get_vp_registers {
>> +	u64 partition_id;
>> +	u32 vp_index;
>> +	u8  input_vtl;
>> +	u8  rsvd_z8;
>> +	u16 rsvd_z16;
>> +	__aligned(8) enum hv_register_name names[];
>> +} __aligned(8);
>>
>> -/* HvGetVpRegisters returns an array of these output elements */
>> -struct hv_get_vp_registers_output {
>> -	union {
>> -		struct {
>> -			u32 a;
>> -			u32 b;
>> -			u32 c;
>> -			u32 d;
>> -		} as32 __packed;
>> -		struct {
>> -			u64 low;
>> -			u64 high;
>> -		} as64 __packed;
>> -	};
>> +/* HvSetVpRegisters hypercall with variable size reg name/value list*/
>> +struct hv_register_assoc {
>> +	enum hv_register_name name;
>> +	__aligned(16) union hv_register_value value;
>>  };
>>
>> -/* HvSetVpRegisters hypercall with variable size reg name/value list*/
>> -struct hv_set_vp_registers_input {
>> -	struct {
>> -		u64 partitionid;
>> -		u32 vpindex;
>> -		u8  inputvtl;
>> -		u8  padding[3];
>> -	} header;
>> -	struct {
>> -		u32 name;
>> -		u32 padding1;
>> -		u64 padding2;
>> -		u64 valuelow;
>> -		u64 valuehigh;
>> -	} element[];
>> -} __packed;
>> +struct hv_set_vp_registers {
>> +	u64 partition_id;
>> +	u32 vp_index;
>> +	u8  input_vtl;
>> +	u8  rsvd_z8;
>> +	u16 rsvd_z16;
>> +	struct hv_register_assoc elements[];
>> +} __aligned(16);
> 
> Throughout these structures, I think the approach needs to be more
> explicit about the memory layout.  The current definitions assume that
> the compiler is inserting padding in the expected places, and not in
> any unexpected places.  My previous concerns about use of enum
> also apply.
> 
> The code also removes some layouts that are used in the
> not-yet-accepted patches for ARM64.   Let sync on how to get
> those back in.
> 

I'll add __packed to all these structures.
The hv_register_name enum can be replaced by #defines, and the type can be
u32 or u64 (it only needs 32 bits).

I'll sync with you on the ARM64 structs.

>>
>>  enum hv_device_type {
>>  	HV_DEVICE_TYPE_LOGICAL = 0,
>> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
>> index 50521c5f7948..dfe469f573f9 100644
>> --- a/include/linux/mshv.h
>> +++ b/include/linux/mshv.h
>> @@ -17,6 +17,7 @@
>>  struct mshv_vp {
>>  	u32 index;
>>  	struct mshv_partition *partition;
>> +	struct mutex mutex;
>>  };
>>
>>  struct mshv_mem_region {
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> index 1f053eae68a6..5d53ed655429 100644
>> --- a/include/uapi/linux/mshv.h
>> +++ b/include/uapi/linux/mshv.h
>> @@ -33,6 +33,14 @@ struct mshv_create_vp {
>>  	__u32 vp_index;
>>  };
>>
>> +#define MSHV_VP_MAX_REGISTERS	128
>> +
>> +struct mshv_vp_registers {
>> +	int count; /* at most MSHV_VP_MAX_REGISTERS */
>> +	enum hv_register_name *names;
>> +	union hv_register_value *values;
>> +};
> 
> Having separate arrays for the names and values results in an extra
> copy of the data down in the ioctl code.  Any reason the caller couldn't
> supply the data as an array, where each entry is already a name/value
> pair?
> 

I initially thought it would not make a difference to the number of copies,
but it turns out it does. I will change it to use hv_register_assoc everywhere.

>> +
>>  #define MSHV_IOCTL 0xB8
>>
>>  /* mshv device */
>> @@ -44,4 +52,8 @@ struct mshv_create_vp {
>>  #define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct
>> mshv_user_mem_region)
>>  #define MSHV_CREATE_VP		_IOW(MSHV_IOCTL, 0x04, struct mshv_create_vp)
>>
>> +/* vp device */
>> +#define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct
>> mshv_vp_registers)
>> +#define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
>> +
>>  #endif
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index 3be9d9a468c1..2a10137a1e84 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -74,6 +74,12 @@ static struct miscdevice mshv_dev = {
>>  #define HV_MAP_GPA_BATCH_SIZE	\
>>  		(PAGE_SIZE / sizeof(struct hv_map_gpa_pages) / sizeof(u64))
>>  #define PIN_PAGES_BATCH_SIZE	(0x10000000 / PAGE_SIZE)
>> +#define HV_GET_REGISTER_BATCH_SIZE	\
>> +	(PAGE_SIZE / \
>> +	 sizeof(struct hv_get_vp_registers) / sizeof(enum hv_register_name))
>> +#define HV_SET_REGISTER_BATCH_SIZE	\
>> +	(PAGE_SIZE / \
>> +	 sizeof(struct hv_set_vp_registers) / sizeof(struct hv_register_assoc))
> 
> These new size calculations have the same bug as HV_MAP_GPA_BATCH_SIZE.
> The first divide operations should be subtraction.
> 

Yep, I'll fix it.

> With the correct calculation, HV_GET_REGISTER_BATCH_SIZE  will be
> too large.  The input page will accommodate more 32 bit register names
> than the output page will accommodate 128 bit register values.  The limit
> should be based on the latter, not the former.  Or calculate both the
> input and output limit and use the minimum.
> 

I didn't think about this previously! Will fix.

>>
>>  static int
>>  hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
>> @@ -380,10 +386,258 @@ hv_call_unmap_gpa_pages(u64 partition_id,
>>  	return ret;
>>  }
>>
>> +static int
>> +hv_call_get_vp_registers(u32 vp_index,
>> +			 u64 partition_id,
>> +			 u16 count,
>> +			 const enum hv_register_name *names,
>> +			 union hv_register_value *values)
>> +{
>> +	struct hv_get_vp_registers *input_page;
>> +	union hv_register_value *output_page;
>> +	u16 completed = 0;
>> +	u64 hypercall_status;
>> +	unsigned long remaining = count;
>> +	int rep_count;
>> +	int status;
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +
>> +	input_page = (struct hv_get_vp_registers *)(*this_cpu_ptr(
>> +		hyperv_pcpu_input_arg));
>> +	output_page = (union hv_register_value *)(*this_cpu_ptr(
>> +		hyperv_pcpu_output_arg));
>> +
>> +	input_page->partition_id = partition_id;
>> +	input_page->vp_index = vp_index;
>> +	input_page->input_vtl = 0;
>> +	input_page->rsvd_z8 = 0;
>> +	input_page->rsvd_z16 = 0;
>> +
>> +	while (remaining) {
>> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
>> +		memcpy(input_page->names, names,
>> +			sizeof(enum hv_register_name) * rep_count);
>> +
>> +		hypercall_status =
>> +			hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS, rep_count,
>> +					    0, input_page, output_page);
>> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
>> +		if (status != HV_STATUS_SUCCESS) {
>> +			pr_err("%s: completed %li out of %u, %s\n",
>> +			       __func__,
>> +			       count - remaining, count,
>> +			       hv_status_to_string(status));
>> +			break;
>> +		}
>> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
>> +			    HV_HYPERCALL_REP_COMP_OFFSET;
>> +		memcpy(values, output_page,
>> +			sizeof(union hv_register_value) * completed);
>> +
>> +		names += completed;
>> +		values += completed;
>> +		remaining -= completed;
>> +	}
>> +	local_irq_restore(flags);
>> +
>> +	return -hv_status_to_errno(status);
>> +}
>> +
>> +static int
>> +hv_call_set_vp_registers(u32 vp_index,
>> +			 u64 partition_id,
>> +			 u16 count,
>> +			 struct hv_register_assoc *registers)
>> +{
>> +	struct hv_set_vp_registers *input_page;
>> +	u16 completed = 0;
>> +	u64 hypercall_status;
>> +	unsigned long remaining = count;
>> +	int rep_count;
>> +	int status;
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +	input_page = (struct hv_set_vp_registers *)(*this_cpu_ptr(
>> +		hyperv_pcpu_input_arg));
>> +
>> +	input_page->partition_id = partition_id;
>> +	input_page->vp_index = vp_index;
>> +	input_page->input_vtl = 0;
>> +	input_page->rsvd_z8 = 0;
>> +	input_page->rsvd_z16 = 0;
>> +
>> +	while (remaining) {
>> +		rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
>> +		memcpy(input_page->elements, registers,
>> +			sizeof(struct hv_register_assoc) * rep_count);
>> +
>> +		hypercall_status =
>> +			hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS, rep_count,
>> +					    0, input_page, NULL);
>> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
>> +		if (status != HV_STATUS_SUCCESS) {
>> +			pr_err("%s: completed %li out of %u, %s\n",
>> +			       __func__,
>> +			       count - remaining, count,
>> +			       hv_status_to_string(status));
>> +			break;
>> +		}
>> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
>> +			    HV_HYPERCALL_REP_COMP_OFFSET;
>> +		registers += completed;
>> +		remaining -= completed;
>> +	}
>> +
>> +	local_irq_restore(flags);
>> +
>> +	return -hv_status_to_errno(status);
>> +}
>> +
>> +static long
>> +mshv_vp_ioctl_get_regs(struct mshv_vp *vp, void __user *user_args)
>> +{
>> +	struct mshv_vp_registers args;
>> +	enum hv_register_name *names;
>> +	union hv_register_value *values;
>> +	long ret;
>> +
>> +	if (copy_from_user(&args, user_args, sizeof(args)))
>> +		return -EFAULT;
>> +
>> +	if (args.count > MSHV_VP_MAX_REGISTERS)
>> +		return -EINVAL;
>> +
>> +	names = kmalloc_array(args.count,
>> +			      sizeof(enum hv_register_name),
>> +			      GFP_KERNEL);
>> +	if (!names)
>> +		return -ENOMEM;
>> +
>> +	values = kmalloc_array(args.count,
>> +			       sizeof(union hv_register_value),
>> +			       GFP_KERNEL);
>> +	if (!values) {
>> +		kfree(names);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	if (copy_from_user(names, args.names,
>> +			   sizeof(enum hv_register_name) * args.count)) {
>> +		ret = -EFAULT;
>> +		goto free_return;
>> +	}
>> +
>> +	ret = hv_call_get_vp_registers(vp->index, vp->partition->id,
>> +				       args.count, names, values);
>> +	if (ret)
>> +		goto free_return;
>> +
>> +	if (copy_to_user(args.values, values,
>> +			 sizeof(union hv_register_value) * args.count)) {
>> +		ret = -EFAULT;
>> +	}
>> +
>> +free_return:
>> +	kfree(names);
>> +	kfree(values);
>> +	return ret;
>> +}
>> +
>> +static long
>> +mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user *user_args)
>> +{
>> +	int i;
>> +	struct mshv_vp_registers args;
>> +	struct hv_register_assoc *registers;
>> +	enum hv_register_name *names;
>> +	union hv_register_value *values;
>> +	long ret;
>> +
>> +	if (copy_from_user(&args, user_args, sizeof(args)))
>> +		return -EFAULT;
>> +
>> +	if (args.count > MSHV_VP_MAX_REGISTERS)
>> +		return -EINVAL;
>> +
>> +	names = kmalloc_array(args.count,
>> +			      sizeof(enum hv_register_name),
>> +			      GFP_KERNEL);
>> +	if (!names)
>> +		return -ENOMEM;
>> +
>> +	values = kmalloc_array(args.count,
>> +			       sizeof(union hv_register_value),
>> +			       GFP_KERNEL);
>> +	if (!values) {
>> +		kfree(names);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	registers = kmalloc_array(args.count,
>> +				  sizeof(struct hv_register_assoc),
>> +				  GFP_KERNEL);
>> +	if (!registers) {
>> +		kfree(values);
>> +		kfree(names);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	if (copy_from_user(names, args.names,
>> +			   sizeof(enum hv_register_name) * args.count)) {
>> +		ret = -EFAULT;
>> +		goto free_return;
>> +	}
>> +
>> +	if (copy_from_user(values, args.values,
>> +			   sizeof(union hv_register_value) * args.count)) {
>> +		ret = -EFAULT;
>> +		goto free_return;
>> +	}
>> +
>> +	for (i = 0; i < args.count; i++) {
>> +		memcpy(&registers[i].name, &names[i],
>> +		       sizeof(enum hv_register_name));
>> +		memcpy(&registers[i].value, &values[i],
>> +		       sizeof(union hv_register_value));
>> +	}
> 
> The above will result in uninitialized memory being sent to
> Hyper-V, since there is implicit padding associated with the
> 32 bit name field.
> 

This shouldn't be an issue after I change this to use hv_register_assoc,
instead of separate names and values buffers.

>> +
>> +	ret = hv_call_set_vp_registers(vp->index, vp->partition->id,
>> +				       args.count, registers);
>> +
>> +free_return:
>> +	kfree(names);
>> +	kfree(values);
>> +	kfree(registers);
>> +	return ret;
>> +}
>> +
>> +
>>  static long
>>  mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>>  {
>> -	return -ENOTTY;
>> +	struct mshv_vp *vp = filp->private_data;
>> +	long r = 0;
>> +
>> +	if (mutex_lock_killable(&vp->mutex))
>> +		return -EINTR;
>> +
>> +	switch (ioctl) {
>> +	case MSHV_GET_VP_REGISTERS:
>> +		r = mshv_vp_ioctl_get_regs(vp, (void __user *)arg);
>> +		break;
>> +	case MSHV_SET_VP_REGISTERS:
>> +		r = mshv_vp_ioctl_set_regs(vp, (void __user *)arg);
>> +		break;
>> +	default:
>> +		r = -ENOTTY;
>> +		break;
>> +	}
>> +	mutex_unlock(&vp->mutex);
>> +
>> +	return r;
>>  }
>>
>>  static int
>> @@ -420,6 +674,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
>>  	if (!vp)
>>  		return -ENOMEM;
>>
>> +	mutex_init(&vp->mutex);
>> +
>>  	vp->index = args.vp_index;
>>  	vp->partition = mshv_partition_get(partition);
>>  	if (!vp->partition) {
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages
  2021-02-08 19:47   ` Michael Kelley
@ 2021-03-11 19:37     ` Nuno Das Neves
  2021-03-11 20:45       ` Michael Kelley
  0 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-11 19:37 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan


On 2/8/2021 11:47 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
>>
>> Same idea as synic setup in drivers/hv/hv.c:hv_synic_enable_regs()
>> and hv_synic_disable_regs().
>> Setting up synic registers in both vmbus driver and mshv would clobber
>> them, but the vmbus driver will not run in the root partition, so this
>> is safe.
>>
>> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  arch/x86/include/asm/hyperv-tlfs.h      |  29 ---
>>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 264 ++++++++++++++++++++++++
>>  include/asm-generic/hyperv-tlfs.h       |  46 +----
>>  include/linux/mshv.h                    |   1 +
>>  include/uapi/asm-generic/hyperv-tlfs.h  |  43 ++++
>>  virt/mshv/mshv_main.c                   |  98 ++++++++-
>>  6 files changed, 404 insertions(+), 77 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
>> index 4cd44ae9bffb..c34a6bb4f457 100644
>> --- a/arch/x86/include/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/asm/hyperv-tlfs.h
>> @@ -267,35 +267,6 @@ struct hv_tsc_emulation_status {
>>  #define HV_X64_MSR_TSC_REFERENCE_ENABLE		0x00000001
>>  #define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT	12
>>
>> -
>> -/* Define hypervisor message types. */
>> -enum hv_message_type {
>> -	HVMSG_NONE			= 0x00000000,
>> -
>> -	/* Memory access messages. */
>> -	HVMSG_UNMAPPED_GPA		= 0x80000000,
>> -	HVMSG_GPA_INTERCEPT		= 0x80000001,
>> -
>> -	/* Timer notification messages. */
>> -	HVMSG_TIMER_EXPIRED		= 0x80000010,
>> -
>> -	/* Error messages. */
>> -	HVMSG_INVALID_VP_REGISTER_VALUE	= 0x80000020,
>> -	HVMSG_UNRECOVERABLE_EXCEPTION	= 0x80000021,
>> -	HVMSG_UNSUPPORTED_FEATURE	= 0x80000022,
>> -
>> -	/* Trace buffer complete messages. */
>> -	HVMSG_EVENTLOG_BUFFERCOMPLETE	= 0x80000040,
>> -
>> -	/* Platform-specific processor intercept messages. */
>> -	HVMSG_X64_IOPORT_INTERCEPT	= 0x80010000,
>> -	HVMSG_X64_MSR_INTERCEPT		= 0x80010001,
>> -	HVMSG_X64_CPUID_INTERCEPT	= 0x80010002,
>> -	HVMSG_X64_EXCEPTION_INTERCEPT	= 0x80010003,
>> -	HVMSG_X64_APIC_EOI		= 0x80010004,
>> -	HVMSG_X64_LEGACY_FP_ERROR	= 0x80010005
>> -};
>> -
>>  struct hv_nested_enlightenments_control {
>>  	struct {
>>  		__u32 directhypercall:1;
>> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
>> tlfs.h
>> index 2ff655962738..c6a27053f791 100644
>> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
>> @@ -722,4 +722,268 @@ union hv_register_value {
>>  		pending_virtualization_fault_event;
>>  };
>>
>> +/* Define hypervisor message types. */
>> +enum hv_message_type {
>> +	HVMSG_NONE				= 0x00000000,
>> +
>> +	/* Memory access messages. */
>> +	HVMSG_UNMAPPED_GPA			= 0x80000000,
>> +	HVMSG_GPA_INTERCEPT			= 0x80000001,
>> +
>> +	/* Timer notification messages. */
>> +	HVMSG_TIMER_EXPIRED			= 0x80000010,
>> +
>> +	/* Error messages. */
>> +	HVMSG_INVALID_VP_REGISTER_VALUE		= 0x80000020,
>> +	HVMSG_UNRECOVERABLE_EXCEPTION		= 0x80000021,
>> +	HVMSG_UNSUPPORTED_FEATURE		= 0x80000022,
>> +
>> +	/* Trace buffer complete messages. */
>> +	HVMSG_EVENTLOG_BUFFERCOMPLETE		= 0x80000040,
>> +
>> +	/* Platform-specific processor intercept messages. */
>> +	HVMSG_X64_IO_PORT_INTERCEPT		= 0x80010000,
>> +	HVMSG_X64_MSR_INTERCEPT			= 0x80010001,
>> +	HVMSG_X64_CPUID_INTERCEPT		= 0x80010002,
>> +	HVMSG_X64_EXCEPTION_INTERCEPT		= 0x80010003,
>> +	HVMSG_X64_APIC_EOI			= 0x80010004,
>> +	HVMSG_X64_LEGACY_FP_ERROR		= 0x80010005,
>> +	HVMSG_X64_IOMMU_PRQ			= 0x80010006,
>> +	HVMSG_X64_HALT				= 0x80010007,
>> +	HVMSG_X64_INTERRUPTION_DELIVERABLE	= 0x80010008,
>> +	HVMSG_X64_SIPI_INTERCEPT		= 0x80010009,
>> +};
> 
> I have a separate patch series that moves this enum to the
> asm-generic portion of hyperv-tlfs.h because there's not a good way
> to separate the arch neutral from arch dependent values.
> 

Ok, but it should also be changed to #define instead of an enum, right?
I will do that in this patch.
This requires a couple of changes in other files in drivers/hv
where this enum is used.

>> +
>> +
>> +union hv_x64_vp_execution_state {
>> +	__u16 as_uint16;
>> +	struct {
>> +		__u16 cpl:2;
>> +		__u16 cr0_pe:1;
>> +		__u16 cr0_am:1;
>> +		__u16 efer_lma:1;
>> +		__u16 debug_active:1;
>> +		__u16 interruption_pending:1;
>> +		__u16 vtl:4;
>> +		__u16 enclave_mode:1;
>> +		__u16 interrupt_shadow:1;
>> +		__u16 virtualization_fault_active:1;
>> +		__u16 reserved:2;
>> +	};
>> +};
>> +
>> +/* Values for intercept_access_type field */
>> +#define HV_INTERCEPT_ACCESS_READ	0
>> +#define HV_INTERCEPT_ACCESS_WRITE	1
>> +#define HV_INTERCEPT_ACCESS_EXECUTE	2
>> +
>> +struct hv_x64_intercept_message_header {
>> +	__u32 vp_index;
>> +	__u8 instruction_length:4;
>> +	__u8 cr8:4; // only set for exo partitions
>> +	__u8 intercept_access_type;
>> +	union hv_x64_vp_execution_state execution_state;
>> +	struct hv_x64_segment_register cs_segment;
>> +	__u64 rip;
>> +	__u64 rflags;
>> +};
>> +
>> +#define HV_HYPERCALL_INTERCEPT_MAX_XMM_REGISTERS 6
>> +
>> +struct hv_x64_hypercall_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	__u64 rax;
>> +	__u64 rbx;
>> +	__u64 rcx;
>> +	__u64 rdx;
>> +	__u64 r8;
>> +	__u64 rsi;
>> +	__u64 rdi;
>> +	struct hv_u128 xmmregisters[HV_HYPERCALL_INTERCEPT_MAX_XMM_REGISTERS];
>> +	struct {
>> +		__u32 isolated:1;
>> +		__u32 reserved:31;
>> +	};
>> +};
>> +
>> +union hv_x64_register_access_info {
>> +	union hv_register_value source_value;
>> +	enum hv_register_name destination_register;
>> +	__u64 source_address;
>> +	__u64 destination_address;
>> +};
>> +
>> +struct hv_x64_register_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	struct {
>> +		__u8 is_memory_op:1;
>> +		__u8 reserved:7;
>> +	};
>> +	__u8 reserved8;
>> +	__u16 reserved16;
>> +	enum hv_register_name register_name;
>> +	union hv_x64_register_access_info access_info;
>> +};
>> +
>> +union hv_x64_memory_access_info {
>> +	__u8 as_uint8;
>> +	struct {
>> +		__u8 gva_valid:1;
>> +		__u8 gva_gpa_valid:1;
>> +		__u8 hypercall_output_pending:1;
>> +		__u8 tlb_locked_no_overlay:1;
>> +		__u8 reserved:4;
>> +	};
>> +};
>> +
>> +union hv_x64_io_port_access_info {
>> +	__u8 as_uint8;
>> +	struct {
>> +		__u8 access_size:3;
>> +		__u8 string_op:1;
>> +		__u8 rep_prefix:1;
>> +		__u8 reserved:3;
>> +	};
>> +};
>> +
>> +union hv_x64_exception_info {
>> +	__u8 as_uint8;
>> +	struct {
>> +		__u8 error_code_valid:1;
>> +		__u8 software_exception:1;
>> +		__u8 reserved:6;
>> +	};
>> +};
>> +
>> +enum hv_cache_type {
>> +	HV_CACHE_TYPE_UNCACHED	   = 0,
>> +	HV_CACHE_TYPE_WRITE_COMBINING = 1,
>> +	HV_CACHE_TYPE_WRITE_THROUGH   = 4,
>> +	HV_CACHE_TYPE_WRITE_PROTECTED = 5,
>> +	HV_CACHE_TYPE_WRITE_BACK	  = 6
>> +};
>> +
>> +struct hv_x64_memory_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	enum hv_cache_type cache_type;
>> +	__u8 instruction_byte_count;
>> +	union hv_x64_memory_access_info memory_access_info;
>> +	__u8 tpr_priority;
>> +	__u8 reserved1;
>> +	__u64 guest_virtual_address;
>> +	__u64 guest_physical_address;
>> +	__u8 instruction_bytes[16];
>> +};
>> +
>> +struct hv_x64_cpuid_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	__u64 rax;
>> +	__u64 rcx;
>> +	__u64 rdx;
>> +	__u64 rbx;
>> +	__u64 default_result_rax;
>> +	__u64 default_result_rcx;
>> +	__u64 default_result_rdx;
>> +	__u64 default_result_rbx;
>> +};
>> +
>> +struct hv_x64_msr_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	__u32 msr_number;
>> +	__u32 reserved;
>> +	__u64 rdx;
>> +	__u64 rax;
>> +};
>> +
>> +struct hv_x64_io_port_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	__u16 port_number;
>> +	union hv_x64_io_port_access_info access_info;
>> +	__u8 instruction_byte_count;
>> +	__u32 reserved;
>> +	__u64 rax;
>> +	__u8 instruction_bytes[16];
>> +	struct hv_x64_segment_register ds_segment;
>> +	struct hv_x64_segment_register es_segment;
>> +	__u64 rcx;
>> +	__u64 rsi;
>> +	__u64 rdi;
>> +};
>> +
>> +struct hv_x64_exception_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	__u16 exception_vector;
>> +	union hv_x64_exception_info exception_info;
>> +	__u8 instruction_byte_count;
>> +	__u32 error_code;
>> +	__u64 exception_parameter;
>> +	__u64 reserved;
>> +	__u8 instruction_bytes[16];
>> +	struct hv_x64_segment_register ds_segment;
>> +	struct hv_x64_segment_register ss_segment;
>> +	__u64 rax;
>> +	__u64 rcx;
>> +	__u64 rdx;
>> +	__u64 rbx;
> 
> Is the above the correct ordering (rax, rcd, rdx, rbx)?
> It's just what you would expect ....
> 

This ordering is correct.
Why is it this way? I don't know.

>> +	__u64 rsp;
>> +	__u64 rbp;
>> +	__u64 rsi;
>> +	__u64 rdi;
>> +	__u64 r8;
>> +	__u64 r9;
>> +	__u64 r10;
>> +	__u64 r11;
>> +	__u64 r12;
>> +	__u64 r13;
>> +	__u64 r14;
>> +	__u64 r15;
>> +};
>> +
>> +struct hv_x64_invalid_vp_register_message {
>> +	__u32 vp_index;
>> +	__u32 reserved;
>> +};
>> +
>> +struct hv_x64_unrecoverable_exception_message {
>> +	struct hv_x64_intercept_message_header header;
>> +};
>> +
>> +enum hv_x64_unsupported_feature_code {
>> +	hv_unsupported_feature_intercept = 1,
>> +	hv_unsupported_feature_task_switch_tss = 2
>> +};
>> +
>> +struct hv_x64_unsupported_feature_message {
>> +	__u32 vp_index;
>> +	enum hv_x64_unsupported_feature_code feature_code;
>> +	__u64 feature_parameter;
>> +};
>> +
>> +struct hv_x64_halt_message {
>> +	struct hv_x64_intercept_message_header header;
>> +};
>> +
>> +enum hv_x64_pending_interruption_type {
>> +	HV_X64_PENDING_INTERRUPT	= 0,
>> +	HV_X64_PENDING_NMI		= 2,
>> +	HV_X64_PENDING_EXCEPTION	= 3
>> +};
>> +
>> +struct hv_x64_interruption_deliverable_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	enum hv_x64_pending_interruption_type deliverable_type;
>> +	__u32 rsvd;
>> +};
>> +
>> +struct hv_x64_sipi_intercept_message {
>> +	struct hv_x64_intercept_message_header header;
>> +	__u32 target_vp_index;
>> +	__u32 interrupt_vector;
>> +};
>> +
>> +struct hv_x64_apic_eoi_message {
>> +	__u32 vp_index;
>> +	__u32 interrupt_vector;
>> +};
> 
> Same comments as before about enum types, not depending
> on the compiler to add padding, and marking as __packed.
> 

Yep, will do.

>> +
>>  #endif
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index b9295400c20b..e0185c3872a9 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -241,6 +241,8 @@ static inline const char *hv_status_to_string(enum hv_status status)
>>  /* Valid SynIC vectors are 16-255. */
>>  #define HV_SYNIC_FIRST_VALID_VECTOR	(16)
>>
>> +#define HV_SYNIC_INTERCEPTION_SINT_INDEX 0x00000000
>> +
>>  #define HV_SYNIC_CONTROL_ENABLE		(1ULL << 0)
>>  #define HV_SYNIC_SIMP_ENABLE		(1ULL << 0)
>>  #define HV_SYNIC_SIEFP_ENABLE		(1ULL << 0)
>> @@ -250,49 +252,6 @@ static inline const char *hv_status_to_string(enum hv_status
>> status)
>>
>>  #define HV_SYNIC_STIMER_COUNT		(4)
>>
>> -/* Define synthetic interrupt controller message constants. */
>> -#define HV_MESSAGE_SIZE			(256)
>> -#define HV_MESSAGE_PAYLOAD_BYTE_COUNT	(240)
>> -#define HV_MESSAGE_PAYLOAD_QWORD_COUNT	(30)
>> -
>> -/* Define synthetic interrupt controller message flags. */
>> -union hv_message_flags {
>> -	__u8 asu8;
>> -	struct {
>> -		__u8 msg_pending:1;
>> -		__u8 reserved:7;
>> -	} __packed;
>> -};
>> -
>> -/* Define port identifier type. */
>> -union hv_port_id {
>> -	__u32 asu32;
>> -	struct {
>> -		__u32 id:24;
>> -		__u32 reserved:8;
>> -	} __packed u;
>> -};
>> -
>> -/* Define synthetic interrupt controller message header. */
>> -struct hv_message_header {
>> -	__u32 message_type;
>> -	__u8 payload_size;
>> -	union hv_message_flags message_flags;
>> -	__u8 reserved[2];
>> -	union {
>> -		__u64 sender;
>> -		union hv_port_id port;
>> -	};
>> -} __packed;
>> -
>> -/* Define synthetic interrupt controller message format. */
>> -struct hv_message {
>> -	struct hv_message_header header;
>> -	union {
>> -		__u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
>> -	} u;
>> -} __packed;
>> -
>>  /* Define the synthetic interrupt message page layout. */
>>  struct hv_message_page {
>>  	struct hv_message sint_message[HV_SYNIC_SINT_COUNT];
>> @@ -306,7 +265,6 @@ struct hv_timer_message_payload {
>>  	__u64 delivery_time;	/* When the message was delivered */
>>  } __packed;
>>
>> -
>>  /* Define synthetic interrupt controller flag constants. */
>>  #define HV_EVENT_FLAGS_COUNT		(256 * 8)
>>  #define HV_EVENT_FLAGS_LONG_COUNT	(256 / sizeof(unsigned long))
>> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
>> index dfe469f573f9..7709aaa1e064 100644
>> --- a/include/linux/mshv.h
>> +++ b/include/linux/mshv.h
>> @@ -42,6 +42,7 @@ struct mshv_partition {
>>  };
>>
>>  struct mshv {
>> +	struct hv_message_page __percpu **synic_message_page;
>>  	struct {
>>  		spinlock_t lock;
>>  		u64 count;
>> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
>> tlfs.h
>> index e7b09b9f00de..e87389054b68 100644
>> --- a/include/uapi/asm-generic/hyperv-tlfs.h
>> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
>> @@ -6,6 +6,49 @@
>>  #define BIT(X)	(1ULL << (X))
>>  #endif
>>
>> +/* Define synthetic interrupt controller message constants. */
>> +#define HV_MESSAGE_SIZE			(256)
>> +#define HV_MESSAGE_PAYLOAD_BYTE_COUNT	(240)
>> +#define HV_MESSAGE_PAYLOAD_QWORD_COUNT	(30)
>> +
>> +/* Define synthetic interrupt controller message flags. */
>> +union hv_message_flags {
>> +	__u8 asu8;
>> +	struct {
>> +		__u8 msg_pending:1;
>> +		__u8 reserved:7;
>> +	};
>> +};
>> +
>> +/* Define port identifier type. */
>> +union hv_port_id {
>> +	__u32 asu32;
>> +	struct {
>> +		__u32 id:24;
>> +		__u32 reserved:8;
>> +	} u;
>> +};
>> +
>> +/* Define synthetic interrupt controller message header. */
>> +struct hv_message_header {
>> +	enum hv_message_type message_type;
>> +	__u8 payload_size;
>> +	union hv_message_flags message_flags;
>> +	__u8 reserved[2];
>> +	union {
>> +		__u64 sender;
>> +		union hv_port_id port;
>> +	};
>> +};
>> +
>> +/* Define synthetic interrupt controller message format. */
>> +struct hv_message {
>> +	struct hv_message_header header;
>> +	union {
>> +		__u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
>> +	} u;
>> +};
>> +
>>  /* Userspace-visible partition creation flags */
>>  #define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
>>  #define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index 2a10137a1e84..c9445d2edb37 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -15,6 +15,8 @@
>>  #include <linux/file.h>
>>  #include <linux/anon_inodes.h>
>>  #include <linux/mm.h>
>> +#include <linux/io.h>
>> +#include <linux/cpuhotplug.h>
>>  #include <linux/mshv.h>
>>  #include <asm/mshyperv.h>
>>
>> @@ -1152,23 +1154,111 @@ mshv_dev_release(struct inode *inode, struct file *filp)
>>  	return 0;
>>  }
>>
>> +static int
>> +mshv_synic_init(unsigned int cpu)
>> +{
>> +	union hv_synic_simp simp;
>> +	union hv_synic_sint sint;
>> +	union hv_synic_scontrol sctrl;
>> +	struct hv_message_page **msg_page =
>> +			this_cpu_ptr(mshv.synic_message_page);
>> +
>> +	/* Setup the Synic's message page */
>> +	hv_get_simp(simp.as_uint64);
>> +	simp.simp_enabled = true;
>> +	*msg_page = memremap(simp.base_simp_gpa << PAGE_SHIFT,
>> +			     PAGE_SIZE, MEMREMAP_WB);
> 
> Use HV_HYP_PAGE_SHIFT and HV_HYP_PAGE_SIZE.
> 

Yep, will do.

>> +	if (!msg_page) {
>> +		pr_err("%s: memremap failed\n", __func__);
>> +		return -EFAULT;
>> +	}
>> +	hv_set_simp(simp.as_uint64);
>> +
>> +	/* Enable intercepts */
>> +	sint.as_uint64 = 0;
>> +	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
>> +	sint.masked = false;
>> +	sint.auto_eoi = hv_recommend_using_aeoi();
>> +	hv_set_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
>> +
>> +	/* Enable global synic bit */
>> +	hv_get_synic_state(sctrl.as_uint64);
>> +	sctrl.enable = 1;
>> +	hv_set_synic_state(sctrl.as_uint64);
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +mshv_synic_cleanup(unsigned int cpu)
>> +{
>> +	union hv_synic_sint sint;
>> +	union hv_synic_simp simp;
>> +	union hv_synic_scontrol sctrl;
>> +	struct hv_message_page **msg_page =
>> +			this_cpu_ptr(mshv.synic_message_page);
>> +
>> +	/* Disable the interrupt */
>> +	hv_get_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
>> +	sint.masked = true;
>> +	hv_set_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
>> +
>> +	/* Disable Synic's message page */
>> +	hv_get_simp(simp.as_uint64);
>> +	simp.simp_enabled = false;
>> +	hv_set_simp(simp.as_uint64);
>> +	memunmap(*msg_page);
>> +
>> +	/* Disable global synic bit */
>> +	hv_get_synic_state(sctrl.as_uint64);
>> +	sctrl.enable = 0;
>> +	hv_set_synic_state(sctrl.as_uint64);
>> +
>> +	return 0;
>> +}
>> +
>> +static int mshv_cpuhp_online;
>> +
>>  static int
>>  __init mshv_init(void)
>>  {
>> -	int r;
>> +	int ret;
> 
> Ideally, change the name of the variable in the earlier patch so this
> one isn't cluttered with the change.
> 

Will do.

>>
>> -	r = misc_register(&mshv_dev);
>> -	if (r)
>> +	ret = misc_register(&mshv_dev);
>> +	if (ret) {
>>  		pr_err("%s: misc device register failed\n", __func__);
>> +		return ret;
>> +	}
>> +	spin_lock_init(&mshv.partitions.lock);
>>
>> +	mshv.synic_message_page = alloc_percpu(struct hv_message_page *);
>> +	if (!mshv.synic_message_page) {
>> +		pr_err("%s: failed to allocate percpu synic page\n", __func__);
>> +		misc_deregister(&mshv_dev);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
>> +				mshv_synic_init,
>> +				mshv_synic_cleanup);
>> +	if (ret < 0) {
>> +		pr_err("%s: failed to setup cpu hotplug state: %i\n",
>> +		       __func__, ret);
>> +		return ret;
>> +	}
>> +
>> +	mshv_cpuhp_online = ret;
>>  	spin_lock_init(&mshv.partitions.lock);
> 
> It looks like the spin lock is being initialized twice.
> 

Oops!

>>
>> -	return r;
>> +	return 0;
>>  }
>>
>>  static void
>>  __exit mshv_exit(void)
>>  {
>> +	cpuhp_remove_state(mshv_cpuhp_online);
>> +	free_percpu(mshv.synic_message_page);
>> +
>>  	misc_deregister(&mshv_dev);
>>  }
>>
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages
  2021-03-11 19:37     ` Nuno Das Neves
@ 2021-03-11 20:45       ` Michael Kelley
  0 siblings, 0 replies; 53+ messages in thread
From: Michael Kelley @ 2021-03-11 20:45 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, March 11, 2021 11:38 AM
> 
> On 2/8/2021 11:47 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November
> 20, 2020 4:31 PM
> >>
> >> Same idea as synic setup in drivers/hv/hv.c:hv_synic_enable_regs()
> >> and hv_synic_disable_regs().
> >> Setting up synic registers in both vmbus driver and mshv would clobber
> >> them, but the vmbus driver will not run in the root partition, so this
> >> is safe.
> >>
> >> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> >> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> ---
> >>  arch/x86/include/asm/hyperv-tlfs.h      |  29 ---
> >>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 264 ++++++++++++++++++++++++
> >>  include/asm-generic/hyperv-tlfs.h       |  46 +----
> >>  include/linux/mshv.h                    |   1 +
> >>  include/uapi/asm-generic/hyperv-tlfs.h  |  43 ++++
> >>  virt/mshv/mshv_main.c                   |  98 ++++++++-
> >>  6 files changed, 404 insertions(+), 77 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> >> index 4cd44ae9bffb..c34a6bb4f457 100644
> >> --- a/arch/x86/include/asm/hyperv-tlfs.h
> >> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> >> @@ -267,35 +267,6 @@ struct hv_tsc_emulation_status {
> >>  #define HV_X64_MSR_TSC_REFERENCE_ENABLE		0x00000001
> >>  #define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT	12
> >>
> >> -
> >> -/* Define hypervisor message types. */
> >> -enum hv_message_type {
> >> -	HVMSG_NONE			= 0x00000000,
> >> -
> >> -	/* Memory access messages. */
> >> -	HVMSG_UNMAPPED_GPA		= 0x80000000,
> >> -	HVMSG_GPA_INTERCEPT		= 0x80000001,
> >> -
> >> -	/* Timer notification messages. */
> >> -	HVMSG_TIMER_EXPIRED		= 0x80000010,
> >> -
> >> -	/* Error messages. */
> >> -	HVMSG_INVALID_VP_REGISTER_VALUE	= 0x80000020,
> >> -	HVMSG_UNRECOVERABLE_EXCEPTION	= 0x80000021,
> >> -	HVMSG_UNSUPPORTED_FEATURE	= 0x80000022,
> >> -
> >> -	/* Trace buffer complete messages. */
> >> -	HVMSG_EVENTLOG_BUFFERCOMPLETE	= 0x80000040,
> >> -
> >> -	/* Platform-specific processor intercept messages. */
> >> -	HVMSG_X64_IOPORT_INTERCEPT	= 0x80010000,
> >> -	HVMSG_X64_MSR_INTERCEPT		= 0x80010001,
> >> -	HVMSG_X64_CPUID_INTERCEPT	= 0x80010002,
> >> -	HVMSG_X64_EXCEPTION_INTERCEPT	= 0x80010003,
> >> -	HVMSG_X64_APIC_EOI		= 0x80010004,
> >> -	HVMSG_X64_LEGACY_FP_ERROR	= 0x80010005
> >> -};
> >> -
> >>  struct hv_nested_enlightenments_control {
> >>  	struct {
> >>  		__u32 directhypercall:1;
> >> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> b/arch/x86/include/uapi/asm/hyperv-
> >> tlfs.h
> >> index 2ff655962738..c6a27053f791 100644
> >> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> >> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> >> @@ -722,4 +722,268 @@ union hv_register_value {
> >>  		pending_virtualization_fault_event;
> >>  };
> >>
> >> +/* Define hypervisor message types. */
> >> +enum hv_message_type {
> >> +	HVMSG_NONE				= 0x00000000,
> >> +
> >> +	/* Memory access messages. */
> >> +	HVMSG_UNMAPPED_GPA			= 0x80000000,
> >> +	HVMSG_GPA_INTERCEPT			= 0x80000001,
> >> +
> >> +	/* Timer notification messages. */
> >> +	HVMSG_TIMER_EXPIRED			= 0x80000010,
> >> +
> >> +	/* Error messages. */
> >> +	HVMSG_INVALID_VP_REGISTER_VALUE		= 0x80000020,
> >> +	HVMSG_UNRECOVERABLE_EXCEPTION		= 0x80000021,
> >> +	HVMSG_UNSUPPORTED_FEATURE		= 0x80000022,
> >> +
> >> +	/* Trace buffer complete messages. */
> >> +	HVMSG_EVENTLOG_BUFFERCOMPLETE		= 0x80000040,
> >> +
> >> +	/* Platform-specific processor intercept messages. */
> >> +	HVMSG_X64_IO_PORT_INTERCEPT		= 0x80010000,
> >> +	HVMSG_X64_MSR_INTERCEPT			= 0x80010001,
> >> +	HVMSG_X64_CPUID_INTERCEPT		= 0x80010002,
> >> +	HVMSG_X64_EXCEPTION_INTERCEPT		= 0x80010003,
> >> +	HVMSG_X64_APIC_EOI			= 0x80010004,
> >> +	HVMSG_X64_LEGACY_FP_ERROR		= 0x80010005,
> >> +	HVMSG_X64_IOMMU_PRQ			= 0x80010006,
> >> +	HVMSG_X64_HALT				= 0x80010007,
> >> +	HVMSG_X64_INTERRUPTION_DELIVERABLE	= 0x80010008,
> >> +	HVMSG_X64_SIPI_INTERCEPT		= 0x80010009,
> >> +};
> >
> > I have a separate patch series that moves this enum to the
> > asm-generic portion of hyperv-tlfs.h because there's not a good way
> > to separate the arch neutral from arch dependent values.
> >
> 
> Ok, but it should also be changed to #define instead of an enum, right?
> I will do that in this patch.
> This requires a couple of changes in other files in drivers/hv
> where this enum is used.

Because of the other uses of the enum in places that don't depend
on exact structure layouts, I left it as an enum when I moved it.
When one of the enum values is passed to Hyper-V, the enum
is assigned to a u32 field, which I think is acceptable.  You could
do the same with the other enums your already have -- keep the
constant definitions as members of an enum, but assign to a u32
field in the structures that get passed to Hyper-V.  There may
actually be some benefit in that approach, particularly if the enum
is passed as an individual argument into some function(s). 

Others may have an opinion on this approach .....

Michael

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls
  2021-02-08 19:48   ` Michael Kelley
@ 2021-03-11 23:38     ` Nuno Das Neves
  0 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-11 23:38 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan


On 2/8/2021 11:48 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
>> To: linux-hyperv@vger.kernel.org
>> Cc: virtualization@lists.linux-foundation.org; linux-kernel@vger.kernel.org; Michael Kelley
>> <mikelley@microsoft.com>; viremana@linux.microsoft.com; Sunil Muthuswamy
>> <sunilmut@microsoft.com>; nunodasneves@linux.microsoft.com; wei.liu@kernel.org;
>> Lillian Grassin-Drake <Lillian.GrassinDrake@microsoft.com>; KY Srinivasan
>> <kys@microsoft.com>
>> Subject: [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls
>>
>> Introduce ioctls for getting and setting guest vcpu emulated LAPIC
>> state, and xsave data.
>>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  Documentation/virt/mshv/api.rst         |   8 +
>>  arch/x86/include/uapi/asm/hyperv-tlfs.h |  59 ++++++
>>  include/asm-generic/hyperv-tlfs.h       |  41 ++++
>>  include/uapi/asm-generic/hyperv-tlfs.h  |  28 +++
>>  include/uapi/linux/mshv.h               |  13 ++
>>  virt/mshv/mshv_main.c                   | 262 ++++++++++++++++++++++++
>>  6 files changed, 411 insertions(+)
>>
>> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
>> index 694f978131f9..7fd75f248eff 100644
>> --- a/Documentation/virt/mshv/api.rst
>> +++ b/Documentation/virt/mshv/api.rst
>> @@ -140,4 +140,12 @@ Assert interrupts in partitions that use Microsoft Hypervisor's
>> internal
>>  emulated LAPIC. This must be enabled on partition creation with the flag:
>>  HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
>>
>> +3.9 MSHV_GET_VP_STATE and MSHV_SET_VP_STATE
>> +--------------------------
>> +:Type: vp ioctl
>> +:Parameters: struct mshv_vp_state
>> +:Returns: 0 on success
>> +
>> +Get/set various vp state. Currently these can be used to get and set
>> +emulated LAPIC state, and xsave data.
>>
>> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
>> tlfs.h
>> index 5478d4943bfc..78758aedf23e 100644
>> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
>> @@ -1051,4 +1051,63 @@ union hv_interrupt_control {
>>  	__u64 as_uint64;
>>  };
>>
>> +struct hv_local_interrupt_controller_state {
>> +	__u32 apic_id;
>> +	__u32 apic_version;
>> +	__u32 apic_ldr;
>> +	__u32 apic_dfr;
>> +	__u32 apic_spurious;
>> +	__u32 apic_isr[8];
>> +	__u32 apic_tmr[8];
>> +	__u32 apic_irr[8];
>> +	__u32 apic_esr;
>> +	__u32 apic_icr_high;
>> +	__u32 apic_icr_low;
>> +	__u32 apic_lvt_timer;
>> +	__u32 apic_lvt_thermal;
>> +	__u32 apic_lvt_perfmon;
>> +	__u32 apic_lvt_lint0;
>> +	__u32 apic_lvt_lint1;
>> +	__u32 apic_lvt_error;
>> +	__u32 apic_lvt_cmci;
>> +	__u32 apic_error_status;
>> +	__u32 apic_initial_count;
>> +	__u32 apic_counter_value;
>> +	__u32 apic_divide_configuration;
>> +	__u32 apic_remote_read;
>> +};
>> +
>> +#define HV_XSAVE_DATA_NO_XMM_REGISTERS 1
>> +
>> +union hv_x64_xsave_xfem_register {
>> +	__u64 as_uint64;
>> +	struct {
>> +		__u32 low_uint32;
>> +		__u32 high_uint32;
>> +	};
>> +	struct {
>> +		__u64 legacy_x87: 1;
>> +		__u64 legacy_sse: 1;
>> +		__u64 avx: 1;
>> +		__u64 mpx_bndreg: 1;
>> +		__u64 mpx_bndcsr: 1;
>> +		__u64 avx_512_op_mask: 1;
>> +		__u64 avx_512_zmmhi: 1;
>> +		__u64 avx_512_zmm16_31: 1;
>> +		__u64 rsvd8_9: 2;
>> +		__u64 pasid: 1;
>> +		__u64 cet_u: 1;
>> +		__u64 cet_s: 1;
>> +		__u64 rsvd13_16: 4;
>> +		__u64 xtile_cfg: 1;
>> +		__u64 xtile_data: 1;
>> +		__u64 rsvd19_63: 45;
>> +	};
>> +};
>> +
>> +struct hv_vp_state_data_xsave {
>> +	__u64 flags;
>> +	union hv_x64_xsave_xfem_register states;
>> +};
>> +
>>  #endif
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index 2cd46241c545..4bc59a0344ce 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -167,6 +167,9 @@ struct ms_hyperv_tsc_page {
>>  #define HVCALL_ASSERT_VIRTUAL_INTERRUPT		0x0094
>>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
>>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
>> +#define HVCALL_MAP_VP_STATE_PAGE			0x00e1
>> +#define HVCALL_GET_VP_STATE				0x00e3
>> +#define HVCALL_SET_VP_STATE				0x00e4
>>
>>  #define HV_FLUSH_ALL_PROCESSORS			BIT(0)
>>  #define HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES	BIT(1)
>> @@ -796,4 +799,42 @@ struct hv_assert_virtual_interrupt {
>>  	u16 rsvd_z1;
>>  };
>>
>> +struct hv_vp_state_data {
>> +	enum hv_get_set_vp_state_type type;
>> +	u32 rsvd;
>> +	struct hv_vp_state_data_xsave xsave;
>> +
>> +};
>> +
>> +struct hv_get_vp_state_in {
>> +	u64 partition_id;
>> +	u32 vp_index;
>> +	u8 input_vtl;
>> +	u8 rsvd0;
>> +	u16 rsvd1;
>> +	struct hv_vp_state_data state_data;
>> +	u64 output_data_pfns[];
>> +};
>> +
>> +union hv_get_vp_state_out {
>> +	struct hv_local_interrupt_controller_state interrupt_controller_state;
>> +	/* Not supported yet */
>> +	/* struct hv_synthetic_timers_state synthetic_timers_state; */
>> +};
>> +
>> +union hv_input_set_vp_state_data {
>> +	u64 pfns;
>> +	u8 bytes;
>> +};
>> +
>> +struct hv_set_vp_state_in {
>> +	u64 partition_id;
>> +	u32 vp_index;
>> +	u8 input_vtl;
>> +	u8 rsvd0;
>> +	u16 rsvd1;
>> +	struct hv_vp_state_data state_data;
>> +	union hv_input_set_vp_state_data data[];
>> +};
>> +
>>  #endif
>> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
>> tlfs.h
>> index e87389054b68..b3c84c69b73f 100644
>> --- a/include/uapi/asm-generic/hyperv-tlfs.h
>> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
>> @@ -64,4 +64,32 @@ struct hv_message {
>>  #define HV_MAP_GPA_EXECUTABLE           0xC
>>  #define HV_MAP_GPA_PERMISSIONS_MASK     0xF
>>
>> +/*
>> + * For getting and setting VP state, there are two options based on the state type:
>> + *
>> + *     1.) Data that is accessed by PFNs in the input hypercall page. This is used
>> + *         for state which may not fit into the hypercall pages.
>> + *     2.) Data that is accessed directly in the input\output hypercall pages.
>> + *         This is used for state that will always fit into the hypercall pages.
>> + *
>> + * In the future this could be dynamic based on the size if needed.
>> + *
>> + * Note these hypercalls have an 8-byte aligned variable header size as per the tlfs
>> + */
>> +
>> +#define HV_GET_SET_VP_STATE_TYPE_PFN	BIT(31)
>> +
>> +enum hv_get_set_vp_state_type {
>> +	HV_GET_SET_VP_STATE_LOCAL_INTERRUPT_CONTROLLER_STATE = 0,
>> +
>> +	HV_GET_SET_VP_STATE_XSAVE		= 1 |
>> HV_GET_SET_VP_STATE_TYPE_PFN,
>> +	/* Synthetic message page */
>> +	HV_GET_SET_VP_STATE_SIM_PAGE		= 2 |
>> HV_GET_SET_VP_STATE_TYPE_PFN,
>> +	/* Synthetic interrupt event flags page. */
>> +	HV_GET_SET_VP_STATE_SIEF_PAGE		= 3 |
>> HV_GET_SET_VP_STATE_TYPE_PFN,
>> +
>> +	/* Synthetic timers. */
>> +	HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS	= 4,
>> +};
>> +
>>  #endif
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> index faed9d065bb7..ae0bb64bbec3 100644
>> --- a/include/uapi/linux/mshv.h
>> +++ b/include/uapi/linux/mshv.h
>> @@ -53,6 +53,17 @@ struct mshv_assert_interrupt {
>>  	__u32 vector;
>>  };
>>
>> +struct mshv_vp_state {
>> +	enum hv_get_set_vp_state_type type;
>> +	struct hv_vp_state_data_xsave xsave; /* only for xsave request */
>> +
>> +	__u64 buf_size; /* If xsave, must be page-aligned */
>> +	union {
>> +		struct hv_local_interrupt_controller_state *lapic;
>> +		__u8 *bytes; /* Xsave data. must be page-aligned */
>> +	} buf;
>> +};
>> +
>>  #define MSHV_IOCTL 0xB8
>>
>>  /* mshv device */
>> @@ -70,5 +81,7 @@ struct mshv_assert_interrupt {
>>  #define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct
>> mshv_vp_registers)
>>  #define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
>>  #define MSHV_RUN_VP		_IOR(MSHV_IOCTL, 0x07, struct hv_message)
>> +#define MSHV_GET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0A, struct mshv_vp_state)
>> +#define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
>>
>>  #endif
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index 9cf236ade50a..70172d9488de 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -864,6 +864,262 @@ mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user
>> *user_args)
>>  	return ret;
>>  }
>>
>> +static int
>> +hv_call_get_vp_state(u32 vp_index,
>> +		     u64 partition_id,
>> +		     enum hv_get_set_vp_state_type type,
>> +		     struct hv_vp_state_data_xsave xsave,
>> +		    /* Choose between pages and ret_output */
>> +		     u64 page_count,
>> +		     struct page **pages,
>> +		     union hv_get_vp_state_out *ret_output)
>> +{
>> +	struct hv_get_vp_state_in *input;
>> +	union hv_get_vp_state_out *output;
>> +	int status;
>> +	int i;
>> +	u64 control;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +
>> +	if (sizeof(*input) + (page_count * sizeof(u64)) > PAGE_SIZE)
>> +		return -EINVAL;
> 
> Nit:  Stylistically, you are handling this differently from the BATCH_SIZE
> macros, which are essentially doing the same thing of calculating
> how many entries will fit in the input page.   Note to use
> HV_HYP_PAGE_SIZE.
> 

Hmm, I didn't notice this. I guess it's ok either way, but for consistency I will add:
#define HV_GET_VP_STATE_BATCH_SIZE ((HV_HYP_PAGE_SIZE - sizeof(struct hv_get_vp_state_in)) / sizeof(u64))
And change the condition to:
if (page_count > HV_GET_VP_STATE_BATCH_SIZE)

>> +
>> +	if (!page_count && !ret_output)
>> +		return -EINVAL;
>> +
>> +	do {
>> +		local_irq_save(flags);
>> +		input = (struct hv_get_vp_state_in *)
>> +				(*this_cpu_ptr(hyperv_pcpu_input_arg));
>> +		output = (union hv_get_vp_state_out *)
>> +				(*this_cpu_ptr(hyperv_pcpu_output_arg));
>> +		memset(input, 0, sizeof(*input));
>> +		memset(output, 0, sizeof(*output));
>> +
>> +		input->partition_id = partition_id;
>> +		input->vp_index = vp_index;
>> +		input->state_data.type = type;
>> +		memcpy(&input->state_data.xsave, &xsave, sizeof(xsave));
>> +		for (i = 0; i < page_count; i++)
>> +			input->output_data_pfns[i] =
>> +				page_to_pfn(pages[i]) & HV_MAP_GPA_MASK;
>> +
>> +		control = (HVCALL_GET_VP_STATE) |
>> +			  (page_count << HV_HYPERCALL_VARHEAD_OFFSET);
>> +
>> +		status = hv_do_hypercall(control, input, output) &
>> +			 HV_HYPERCALL_RESULT_MASK;
>> +
>> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			if (status != HV_STATUS_SUCCESS)
>> +				pr_err("%s: %s\n", __func__,
>> +				       hv_status_to_string(status));
>> +			else if (ret_output)
>> +				memcpy(ret_output, output, sizeof(*output));
>> +
>> +			local_irq_restore(flags);
>> +			ret = -hv_status_to_errno(status);
>> +			break;
>> +		}
>> +		local_irq_restore(flags);
>> +
>> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>> +					    partition_id, 1);
>> +	} while (!ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +hv_call_set_vp_state(u32 vp_index,
>> +		     u64 partition_id,
>> +		     enum hv_get_set_vp_state_type type,
>> +		     struct hv_vp_state_data_xsave xsave,
>> +		    /* Choose between pages and bytes */
>> +		     u64 page_count,
>> +		     struct page **pages,
>> +		     u32 num_bytes,
>> +		     u8 *bytes)
>> +{
>> +	struct hv_set_vp_state_in *input;
>> +	int status;
>> +	int i;
>> +	u64 control;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +	u16 varhead_sz;
>> +
>> +	if (sizeof(*input) + (page_count * sizeof(u64)) > PAGE_SIZE)
> 
> Same comment as above.
> 

I'll do the same as above.

>> +		return -EINVAL;
>> +	if (sizeof(*input) + num_bytes > PAGE_SIZE)
> 
> Use HV_HYP_PAGE_SIZE.
> 

Will do.

>> +		return -EINVAL;
>> +
>> +	if (num_bytes)
>> +		/* round up to 8 and divide by 8 */
>> +		varhead_sz = (num_bytes + 7) >> 3;
>> +	else if (page_count)
>> +		varhead_sz =  page_count;
>> +	else
>> +		return -EINVAL;
>> +
>> +	do {
>> +		local_irq_save(flags);
>> +		input = (struct hv_set_vp_state_in *)
>> +				(*this_cpu_ptr(hyperv_pcpu_input_arg));
>> +		memset(input, 0, sizeof(*input));
>> +
>> +		input->partition_id = partition_id;
>> +		input->vp_index = vp_index;
>> +		input->state_data.type = type;
>> +		memcpy(&input->state_data.xsave, &xsave, sizeof(xsave));
>> +		if (num_bytes) {
>> +			memcpy((u8 *)input->data, bytes, num_bytes);
>> +		} else {
>> +			for (i = 0; i < page_count; i++)
>> +				input->data[i].pfns =
>> +					page_to_pfn(pages[i]) & HV_MAP_GPA_MASK;
> 
> Same comment as in earlier patch about GPA_MASK.  Also, this doesn't work
> if PAGE_SIZE != HV_HYP_PAGE_SIZE, though it may be fine to not handle that case
> for now.
> 

Will remove the mask.
As before, won't handle PAGE_SIZE != HV_HYP_PAGE_SIZE in this patch set.

>> +		}
>> +
>> +		control = (HVCALL_SET_VP_STATE) |
>> +			  (varhead_sz << HV_HYPERCALL_VARHEAD_OFFSET);
>> +
>> +		status = hv_do_hypercall(control, input, NULL) &
>> +			 HV_HYPERCALL_RESULT_MASK;
>> +
>> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			if (status != HV_STATUS_SUCCESS)
>> +				pr_err("%s: %s\n", __func__,
>> +				       hv_status_to_string(status));
>> +
>> +			local_irq_restore(flags);
>> +			ret = -hv_status_to_errno(status);
>> +			break;
>> +		}
>> +		local_irq_restore(flags);
>> +
>> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>> +					    partition_id, 1);
>> +	} while (!ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static long
>> +mshv_vp_ioctl_get_set_state_pfn(struct mshv_vp *vp,
>> +				struct mshv_vp_state *args,
>> +				bool is_set)
>> +{
>> +	u64 page_count, remaining;
>> +	int completed;
>> +	struct page **pages;
>> +	long ret;
>> +	unsigned long u_buf;
>> +
>> +	/* Buffer must be page aligned */
>> +	if (args->buf_size & (PAGE_SIZE - 1) ||
>> +	    (u64)args->buf.bytes & (PAGE_SIZE - 1))
>> +		return -EINVAL;
> 
> Use PAGE_ALIGNED macro.
> 

Will do.

>> +
>> +	if (!access_ok(args->buf.bytes, args->buf_size))
>> +		return -EFAULT;
>> +
>> +	/* Pin user pages so hypervisor can copy directly to them */
>> +	page_count = args->buf_size >> PAGE_SHIFT;
>> +	pages = kcalloc(page_count, sizeof(struct page *), GFP_KERNEL);
>> +	if (!pages)
>> +		return -ENOMEM;
>> +
>> +	remaining = page_count;
>> +	u_buf = (unsigned long)args->buf.bytes;
>> +	while (remaining) {
>> +		completed = pin_user_pages_fast(
>> +				u_buf,
>> +				remaining,
>> +				FOLL_WRITE,
>> +				&pages[page_count - remaining]);
>> +		if (completed < 0) {
>> +			pr_err("%s: failed to pin user pages error %i\n",
>> +			       __func__, completed);
>> +			ret = completed;
>> +			goto unpin_pages;
>> +		}
>> +		remaining -= completed;
>> +		u_buf += completed * PAGE_SIZE;
>> +	}
>> +
>> +	if (is_set)
>> +		ret = hv_call_set_vp_state(vp->index,
>> +					   vp->partition->id,
>> +					   args->type, args->xsave,
>> +					   page_count, pages,
>> +					   0, NULL);
>> +	else
>> +		ret = hv_call_get_vp_state(vp->index,
>> +					   vp->partition->id,
>> +					   args->type, args->xsave,
>> +					   page_count, pages,
>> +					   NULL);
>> +
>> +unpin_pages:
>> +	unpin_user_pages(pages, page_count - remaining);
>> +	kfree(pages);
>> +	return ret;
>> +}
>> +
>> +static long
>> +mshv_vp_ioctl_get_set_state(struct mshv_vp *vp, void __user *user_args, bool is_set)
>> +{
>> +	struct mshv_vp_state args;
>> +	long ret = 0;
>> +	union hv_get_vp_state_out vp_state;
>> +
>> +	if (copy_from_user(&args, user_args, sizeof(args)))
>> +		return -EFAULT;
>> +
>> +	/* For now just support these */
>> +	if (args.type != HV_GET_SET_VP_STATE_LOCAL_INTERRUPT_CONTROLLER_STATE &&
>> +	    args.type != HV_GET_SET_VP_STATE_XSAVE)
>> +		return -EINVAL;
>> +
>> +	/* If we need to pin pfns, delegate to helper */
>> +	if (args.type & HV_GET_SET_VP_STATE_TYPE_PFN)
>> +		return mshv_vp_ioctl_get_set_state_pfn(vp, &args, is_set);
>> +
>> +	if (args.buf_size < sizeof(vp_state))
>> +		return -EINVAL;
>> +
>> +	if (is_set) {
>> +		if (copy_from_user(
>> +				&vp_state,
>> +				args.buf.lapic,
>> +				sizeof(vp_state)))
>> +			return -EFAULT;
>> +
>> +		return hv_call_set_vp_state(vp->index,
>> +					    vp->partition->id,
>> +					    args.type, args.xsave,
>> +					    0, NULL,
>> +					    sizeof(vp_state),
>> +					    (u8 *)&vp_state);
>> +	}
>> +
>> +	ret = hv_call_get_vp_state(vp->index,
>> +				   vp->partition->id,
>> +				   args.type, args.xsave,
>> +				   0, NULL,
>> +				   &vp_state);
>> +
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (copy_to_user(args.buf.lapic,
>> +			 &vp_state.interrupt_controller_state,
>> +			 sizeof(vp_state.interrupt_controller_state)))
>> +		return -EFAULT;
>> +
>> +	return 0;
>> +}
>>
>>  static long
>>  mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>> @@ -884,6 +1140,12 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long
>> arg)
>>  	case MSHV_SET_VP_REGISTERS:
>>  		r = mshv_vp_ioctl_set_regs(vp, (void __user *)arg);
>>  		break;
>> +	case MSHV_GET_VP_STATE:
>> +		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, false);
>> +		break;
>> +	case MSHV_SET_VP_STATE:
>> +		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, true);
>> +		break;
>>  	default:
>>  		r = -ENOTTY;
>>  		break;
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 16/18] virt/mshv: mmap vp register page
  2021-02-08 19:49   ` Michael Kelley
@ 2021-03-25 17:36     ` Nuno Das Neves
  0 siblings, 0 replies; 53+ messages in thread
From: Nuno Das Neves @ 2021-03-25 17:36 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv
  Cc: virtualization, linux-kernel, viremana, Sunil Muthuswamy,
	wei.liu, Lillian Grassin-Drake, KY Srinivasan


On 2/8/2021 11:49 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
>>
>> Introduce mmap interface for a virtual processor, exposing a page for
>> setting and getting common registers while the VP is suspended.
>>
>> This provides a more performant and convenient way to get and set these
>> registers in the context of a vmm's run-loop.
>>
>> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> ---
>>  Documentation/virt/mshv/api.rst         | 11 ++++
>>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 74 ++++++++++++++++++++++
>>  include/asm-generic/hyperv-tlfs.h       | 10 +++
>>  include/linux/mshv.h                    |  1 +
>>  include/uapi/asm-generic/hyperv-tlfs.h  |  5 ++
>>  include/uapi/linux/mshv.h               | 12 ++++
>>  virt/mshv/mshv_main.c                   | 82 +++++++++++++++++++++++++
>>  7 files changed, 195 insertions(+)
>>
>> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
>> index 7fd75f248eff..89c276a8778f 100644
>> --- a/Documentation/virt/mshv/api.rst
>> +++ b/Documentation/virt/mshv/api.rst
>> @@ -149,3 +149,14 @@ HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
>>  Get/set various vp state. Currently these can be used to get and set
>>  emulated LAPIC state, and xsave data.
>>
>> +3.10 mmap(vp)
>> +-------------
>> +:Type: vp mmap
>> +:Parameters: offset should be HV_VP_MMAP_REGISTERS_OFFSET
>> +:Returns: 0 on success
>> +
>> +Maps a page into userspace that can be used to get and set common registers
>> +while the vp is suspended.
>> +The page is laid out in struct hv_vp_register_page in asm/hyperv-tlfs.h.
>> +
> 
> I'm assuming there's no support for the corresponding munmap().
> What happens if munmap is called?  Does it just fail and the page remains
> mapped?
> 

munmap() will successfully unmap the page from userspace.
The physical state page remains mapped in the hypervisor, tracked in mshv in vp->register_page.
This is re-used on subsequent mmap()s.

>> +
>> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
>> tlfs.h
>> index 78758aedf23e..a241178567ff 100644
>> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
>> @@ -1110,4 +1110,78 @@ struct hv_vp_state_data_xsave {
>>  	union hv_x64_xsave_xfem_register states;
>>  };
>>
>> +/* Bits for dirty mask of hv_vp_register_page */
>> +#define HV_X64_REGISTER_CLASS_GENERAL	0
>> +#define HV_X64_REGISTER_CLASS_IP	1
>> +#define HV_X64_REGISTER_CLASS_XMM	2
>> +#define HV_X64_REGISTER_CLASS_SEGMENT	3
>> +#define HV_X64_REGISTER_CLASS_FLAGS	4
>> +
>> +#define HV_VP_REGISTER_PAGE_VERSION_1	1u
>> +
>> +struct hv_vp_register_page {
>> +	__u16 version;
>> +	bool isvalid;
> 
> Like enum, avoid type "bool" in data structures shared with
> Hyper-V.
> 

Indeed - this should be u8. I will change it.

>> +	__u8 rsvdz;
>> +	__u32 dirty;
>> +	union {
>> +		struct {
>> +			__u64 rax;
>> +			__u64 rcx;
>> +			__u64 rdx;
>> +			__u64 rbx;
>> +			__u64 rsp;
>> +			__u64 rbp;
>> +			__u64 rsi;
>> +			__u64 rdi;
>> +			__u64 r8;
>> +			__u64 r9;
>> +			__u64 r10;
>> +			__u64 r11;
>> +			__u64 r12;
>> +			__u64 r13;
>> +			__u64 r14;
>> +			__u64 r15;
>> +		};
>> +
>> +		__u64 gp_registers[16];
>> +	};
>> +	__u64 rip;
>> +	__u64 rflags;
>> +	union {
>> +		struct {
>> +			struct hv_u128 xmm0;
>> +			struct hv_u128 xmm1;
>> +			struct hv_u128 xmm2;
>> +			struct hv_u128 xmm3;
>> +			struct hv_u128 xmm4;
>> +			struct hv_u128 xmm5;
>> +		};
>> +
>> +		struct hv_u128 xmm_registers[6];
>> +	};
>> +	union {
>> +		struct {
>> +			struct hv_x64_segment_register es;
>> +			struct hv_x64_segment_register cs;
>> +			struct hv_x64_segment_register ss;
>> +			struct hv_x64_segment_register ds;
>> +			struct hv_x64_segment_register fs;
>> +			struct hv_x64_segment_register gs;
>> +		};
>> +
>> +		struct hv_x64_segment_register segment_registers[6];
>> +	};
>> +	/* read only */
>> +	__u64 cr0;
>> +	__u64 cr3;
>> +	__u64 cr4;
>> +	__u64 cr8;
>> +	__u64 efer;
>> +	__u64 dr7;
>> +	union hv_x64_pending_interruption_register pending_interruption;
>> +	union hv_x64_interrupt_state_register interrupt_state;
>> +	__u64 instruction_emulation_hints;
>> +};
>> +
>>  #endif
>> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
>> index 4bc59a0344ce..9eed4b869110 100644
>> --- a/include/asm-generic/hyperv-tlfs.h
>> +++ b/include/asm-generic/hyperv-tlfs.h
>> @@ -837,4 +837,14 @@ struct hv_set_vp_state_in {
>>  	union hv_input_set_vp_state_data data[];
>>  };
>>
>> +struct hv_map_vp_state_page_in {
>> +	u64 partition_id;
>> +	u32 vp_index;
>> +	enum hv_vp_state_page_type type;
>> +};
>> +
>> +struct hv_map_vp_state_page_out {
>> +	u64 map_location; /* page number */
>> +};
>> +
>>  #endif
>> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
>> index 3933d80294f1..33f4d0cfee11 100644
>> --- a/include/linux/mshv.h
>> +++ b/include/linux/mshv.h
>> @@ -20,6 +20,7 @@ struct mshv_vp {
>>  	u32 index;
>>  	struct mshv_partition *partition;
>>  	struct mutex mutex;
>> +	struct page *register_page;
>>  	struct {
>>  		struct semaphore sem;
>>  		struct task_struct *task;
>> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
>> tlfs.h
>> index b3c84c69b73f..a747f39b132a 100644
>> --- a/include/uapi/asm-generic/hyperv-tlfs.h
>> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
>> @@ -92,4 +92,9 @@ enum hv_get_set_vp_state_type {
>>  	HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS	= 4,
>>  };
>>
>> +enum hv_vp_state_page_type {
>> +	HV_VP_STATE_PAGE_REGISTERS = 0,
>> +	HV_VP_STATE_PAGE_COUNT
>> +};
>> +
>>  #endif
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> index ae0bb64bbec3..8537ff29aee5 100644
>> --- a/include/uapi/linux/mshv.h
>> +++ b/include/uapi/linux/mshv.h
>> @@ -13,6 +13,8 @@
>>
>>  #define MSHV_VERSION	0x0
>>
>> +#define MSHV_VP_MMAP_REGISTERS_OFFSET (HV_VP_STATE_PAGE_REGISTERS * 0x1000)
>> +
>>  struct mshv_create_partition {
>>  	__u64 flags;
>>  	struct hv_partition_creation_properties partition_creation_properties;
>> @@ -84,4 +86,14 @@ struct mshv_vp_state {
>>  #define MSHV_GET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0A, struct mshv_vp_state)
>>  #define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
>>
>> +/* register page mapping example:
>> + * struct hv_vp_register_page *regs = mmap(NULL,
>> + *					   4096,
>> + *					   PROT_READ | PROT_WRITE,
>> + *					   MAP_SHARED,
>> + *					   vp_fd,
>> + *					   HV_VP_MMAP_REGISTERS_OFFSET);
>> + * munmap(regs, 4096);
>> + */
>> +
>>  #endif
>> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
>> index 70172d9488de..a597254fa4f4 100644
>> --- a/virt/mshv/mshv_main.c
>> +++ b/virt/mshv/mshv_main.c
>> @@ -43,11 +43,18 @@ static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl,
>> unsigned
>>  static int mshv_dev_open(struct inode *inode, struct file *filp);
>>  static int mshv_dev_release(struct inode *inode, struct file *filp);
>>  static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
>> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
>> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
>> +
>> +static const struct vm_operations_struct mshv_vp_vm_ops = {
>> +	.fault = mshv_vp_fault,
>> +};
>>
>>  static const struct file_operations mshv_vp_fops = {
>>  	.release = mshv_vp_release,
>>  	.unlocked_ioctl = mshv_vp_ioctl,
>>  	.llseek = noop_llseek,
>> +	.mmap = mshv_vp_mmap,
>>  };
>>
>>  static const struct file_operations mshv_partition_fops = {
>> @@ -499,6 +506,47 @@ hv_call_set_vp_registers(u32 vp_index,
>>  	return -hv_status_to_errno(status);
>>  }
>>
>> +static int
>> +hv_call_map_vp_state_page(u32 vp_index, u64 partition_id,
>> +			  struct page **state_page)
>> +{
>> +	struct hv_map_vp_state_page_in *input;
>> +	struct hv_map_vp_state_page_out *output;
>> +	int status;
>> +	int ret;
>> +	unsigned long flags;
>> +
>> +	do {
>> +		local_irq_save(flags);
>> +		input = (struct hv_map_vp_state_page_in *)(*this_cpu_ptr(
>> +			hyperv_pcpu_input_arg));
>> +		output = (struct hv_map_vp_state_page_out *)(*this_cpu_ptr(
>> +			hyperv_pcpu_output_arg));
>> +
>> +		input->partition_id = partition_id;
>> +		input->vp_index = vp_index;
>> +		input->type = HV_VP_STATE_PAGE_REGISTERS;
>> +		status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE,
>> +						   input, output);
>> +
>> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			if (status == HV_STATUS_SUCCESS)
>> +				*state_page = pfn_to_page(output->map_location);
>> +			else
>> +				pr_err("%s: %s\n", __func__,
>> +				       hv_status_to_string(status));
>> +			local_irq_restore(flags);
>> +			ret = -hv_status_to_errno(status);
>> +			break;
>> +		}
>> +		local_irq_restore(flags);
>> +
>> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
>> +	} while (!ret);
>> +
>> +	return ret;
>> +}
>> +
>>  static void
>>  mshv_isr(void)
>>  {
>> @@ -1155,6 +1203,40 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long
>> arg)
>>  	return r;
>>  }
>>
>> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf)
>> +{
>> +	struct mshv_vp *vp = vmf->vma->vm_file->private_data;
>> +
>> +	vmf->page = vp->register_page;
>> +
>> +	return 0;
>> +}
>> +
>> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +	int ret;
>> +	struct mshv_vp *vp = file->private_data;
>> +
>> +	if (vma->vm_pgoff != MSHV_VP_MMAP_REGISTERS_OFFSET)
>> +		return -EINVAL;
>> +
>> +	if (mutex_lock_killable(&vp->mutex))
>> +		return -EINTR;
>> +
>> +	if (!vp->register_page) {
>> +		ret = hv_call_map_vp_state_page(vp->index,
>> +						vp->partition->id,
>> +						&vp->register_page);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	mutex_unlock(&vp->mutex);
>> +
>> +	vma->vm_ops = &mshv_vp_vm_ops;
>> +	return 0;
>> +}
>> +
>>  static int
>>  mshv_vp_release(struct inode *inode, struct file *filp)
>>  {
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2021-03-05  9:18       ` Vitaly Kuznetsov
@ 2021-04-07  0:21         ` Nuno Das Neves
  2021-04-07  7:38           ` Vitaly Kuznetsov
  0 siblings, 1 reply; 53+ messages in thread
From: Nuno Das Neves @ 2021-04-07  0:21 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	wei.liu, ligrassi, kys


On 3/5/2021 1:18 AM, Vitaly Kuznetsov wrote:
> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
> 
>> On 2/9/2021 5:11 AM, Vitaly Kuznetsov wrote:
>>> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
>>>
> ...
>>>> +
>>>> +3.1 MSHV_REQUEST_VERSION
>>>> +------------------------
>>>> +:Type: /dev/mshv ioctl
>>>> +:Parameters: pointer to a u32
>>>> +:Returns: 0 on success
>>>> +
>>>> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
>>>> +establish the interface version with the kernel module.
>>>> +
>>>> +The caller should pass the MSHV_VERSION as an argument.
>>>> +
>>>> +The kernel module will check which interface versions it supports and return 0
>>>> +if one of them matches.
>>>> +
>>>> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
>>>> +it is open - this ioctl can only be called once per open.
>>>> +
>>>
>>> KVM used to have KVM_GET_API_VERSION too but this turned out to be not
>>> very convenient so we use capabilities (KVM_CHECK_EXTENSION/KVM_ENABLE_CAP)
>>> instead.
>>>
>>
>> The goal of MSHV_REQUEST_VERSION is to support changes to APIs in the core set.
>> When we add new features/ioctls beyond the core we can use an extension/capability
>> approach like KVM.
>>
> 
> Driver versions is a very bad idea from distribution/stable kernel point
> of view as it presumes that the history is linear. It is not.
> 
> Imagine you have the following history upstream:
> 
> MSHV_REQUEST_VERSION = 1
> <100 commits with features/fixes>
> MSHV_REQUEST_VERSION = 2
> <another 100 commits with features/fixes>
> MSHV_REQUEST_VERSION = 2
> 
> Now I'm a linux distribution / stable kernel maintainer. My kernel is at
> MSHV_REQUEST_VERSION = 1. Now I want to backport 1 feature from between
> VER=1 and VER=2 and another feature from between VER=2 and VER=3. My
> history now looks like
> 
> MSHV_REQUEST_VERSION = 1
> <5 commits from between VER=1 and VER=2>
>    Which version should I declare here???? 
> <5 commits from between VER=2 and VER=3>
>    Which version should I declare here???? 
> 
> If I keep VER=1 then userspace will think that I don't have any extra
> features added and just won't use them. If I change VER to 2/3, it'll
> think I have *all* features from between these versions.
> 
> The only reasonable way to manage this is to attach a "capability" to
> every ABI change and expose this capability *in the same commit which
> introduces the change to the ABI*. This way userspace will now exactly
> which ioctls are available and what are their interfaces.
> 
> Also, trying to define "core set" is hard but you don't really need
> to.
> 

We've had some internal discussion on this.

There is bound to be some iteration before this ABI is stable, since even the
underlying Microsoft hypervisor interfaces aren't stable just yet.

It might make more sense to just have an IOCTL to check if the API is stable yet.
This would be analogous to checking if kVM_GET_API_VERSION returns 12.

How does this sound as a proposal?
An MSHV_CHECK_EXTENSION ioctl to query extensions to the core /dev/mshv API.

It takes a single argument, an integer named MSHV_CAP_* corresponding to
the extension to check the existence of.

The ioctl will return 0 if the extension is unsupported, or a positive integer
if supported.

We can initially include a capability called MSHV_CAP_CORE_API_STABLE.
If supported, the core APIs are stable.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2021-04-07  0:21         ` Nuno Das Neves
@ 2021-04-07  7:38           ` Vitaly Kuznetsov
  2021-04-07 13:43             ` Wei Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Vitaly Kuznetsov @ 2021-04-07  7:38 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: virtualization, linux-kernel, mikelley, viremana, sunilmut,
	wei.liu, ligrassi, kys

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> On 3/5/2021 1:18 AM, Vitaly Kuznetsov wrote:
>> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
>> 
>>> On 2/9/2021 5:11 AM, Vitaly Kuznetsov wrote:
>>>> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
>>>>
>> ...
>>>>> +
>>>>> +3.1 MSHV_REQUEST_VERSION
>>>>> +------------------------
>>>>> +:Type: /dev/mshv ioctl
>>>>> +:Parameters: pointer to a u32
>>>>> +:Returns: 0 on success
>>>>> +
>>>>> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
>>>>> +establish the interface version with the kernel module.
>>>>> +
>>>>> +The caller should pass the MSHV_VERSION as an argument.
>>>>> +
>>>>> +The kernel module will check which interface versions it supports and return 0
>>>>> +if one of them matches.
>>>>> +
>>>>> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
>>>>> +it is open - this ioctl can only be called once per open.
>>>>> +
>>>>
>>>> KVM used to have KVM_GET_API_VERSION too but this turned out to be not
>>>> very convenient so we use capabilities (KVM_CHECK_EXTENSION/KVM_ENABLE_CAP)
>>>> instead.
>>>>
>>>
>>> The goal of MSHV_REQUEST_VERSION is to support changes to APIs in the core set.
>>> When we add new features/ioctls beyond the core we can use an extension/capability
>>> approach like KVM.
>>>
>> 
>> Driver versions is a very bad idea from distribution/stable kernel point
>> of view as it presumes that the history is linear. It is not.
>> 
>> Imagine you have the following history upstream:
>> 
>> MSHV_REQUEST_VERSION = 1
>> <100 commits with features/fixes>
>> MSHV_REQUEST_VERSION = 2
>> <another 100 commits with features/fixes>
>> MSHV_REQUEST_VERSION = 2
>> 
>> Now I'm a linux distribution / stable kernel maintainer. My kernel is at
>> MSHV_REQUEST_VERSION = 1. Now I want to backport 1 feature from between
>> VER=1 and VER=2 and another feature from between VER=2 and VER=3. My
>> history now looks like
>> 
>> MSHV_REQUEST_VERSION = 1
>> <5 commits from between VER=1 and VER=2>
>>    Which version should I declare here???? 
>> <5 commits from between VER=2 and VER=3>
>>    Which version should I declare here???? 
>> 
>> If I keep VER=1 then userspace will think that I don't have any extra
>> features added and just won't use them. If I change VER to 2/3, it'll
>> think I have *all* features from between these versions.
>> 
>> The only reasonable way to manage this is to attach a "capability" to
>> every ABI change and expose this capability *in the same commit which
>> introduces the change to the ABI*. This way userspace will now exactly
>> which ioctls are available and what are their interfaces.
>> 
>> Also, trying to define "core set" is hard but you don't really need
>> to.
>> 
>
> We've had some internal discussion on this.
>
> There is bound to be some iteration before this ABI is stable, since even the
> underlying Microsoft hypervisor interfaces aren't stable just yet.
>
> It might make more sense to just have an IOCTL to check if the API is stable yet.
> This would be analogous to checking if kVM_GET_API_VERSION returns 12.
>
> How does this sound as a proposal?
> An MSHV_CHECK_EXTENSION ioctl to query extensions to the core /dev/mshv API.
>
> It takes a single argument, an integer named MSHV_CAP_* corresponding to
> the extension to check the existence of.
>
> The ioctl will return 0 if the extension is unsupported, or a positive integer
> if supported.
>
> We can initially include a capability called MSHV_CAP_CORE_API_STABLE.
> If supported, the core APIs are stable.

This sounds reasonable, I'd suggest you reserve MSHV_CAP_CORE_API_STABLE
right away but don't expose it yet so it's clear the API is not yet
stable. Test userspace you have may always assume it's running with the
latest kernel.

Also, please be clear about the fact that /dev/mshv doesn't
provide a stable API yet so nobody builds an application on top of
it.

One more though: it is probably a good idea to introduce selftests for
/dev/mshv (similar to KVM's selftests in
/tools/testing/selftests/kvm). Selftests don't really need a stable ABI
as they live in the same linux.git and can be updated in the same patch
series which changes /dev/mshv behavior. Selftests are very useful for
checking there are no regressions, especially in the situation when
there's no publicly available userspace for /dev/mshv.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2021-04-07  7:38           ` Vitaly Kuznetsov
@ 2021-04-07 13:43             ` Wei Liu
  2021-04-07 14:02               ` Vitaly Kuznetsov
  0 siblings, 1 reply; 53+ messages in thread
From: Wei Liu @ 2021-04-07 13:43 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nuno Das Neves, linux-hyperv, virtualization, linux-kernel,
	mikelley, viremana, sunilmut, wei.liu, ligrassi, kys

On Wed, Apr 07, 2021 at 09:38:21AM +0200, Vitaly Kuznetsov wrote:
> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
> 
> > On 3/5/2021 1:18 AM, Vitaly Kuznetsov wrote:
> >> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
> >> 
> >>> On 2/9/2021 5:11 AM, Vitaly Kuznetsov wrote:
> >>>> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
> >>>>
> >> ...
> >>>>> +
> >>>>> +3.1 MSHV_REQUEST_VERSION
> >>>>> +------------------------
> >>>>> +:Type: /dev/mshv ioctl
> >>>>> +:Parameters: pointer to a u32
> >>>>> +:Returns: 0 on success
> >>>>> +
> >>>>> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
> >>>>> +establish the interface version with the kernel module.
> >>>>> +
> >>>>> +The caller should pass the MSHV_VERSION as an argument.
> >>>>> +
> >>>>> +The kernel module will check which interface versions it supports and return 0
> >>>>> +if one of them matches.
> >>>>> +
> >>>>> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
> >>>>> +it is open - this ioctl can only be called once per open.
> >>>>> +
> >>>>
> >>>> KVM used to have KVM_GET_API_VERSION too but this turned out to be not
> >>>> very convenient so we use capabilities (KVM_CHECK_EXTENSION/KVM_ENABLE_CAP)
> >>>> instead.
> >>>>
> >>>
> >>> The goal of MSHV_REQUEST_VERSION is to support changes to APIs in the core set.
> >>> When we add new features/ioctls beyond the core we can use an extension/capability
> >>> approach like KVM.
> >>>
> >> 
> >> Driver versions is a very bad idea from distribution/stable kernel point
> >> of view as it presumes that the history is linear. It is not.
> >> 
> >> Imagine you have the following history upstream:
> >> 
> >> MSHV_REQUEST_VERSION = 1
> >> <100 commits with features/fixes>
> >> MSHV_REQUEST_VERSION = 2
> >> <another 100 commits with features/fixes>
> >> MSHV_REQUEST_VERSION = 2
> >> 
> >> Now I'm a linux distribution / stable kernel maintainer. My kernel is at
> >> MSHV_REQUEST_VERSION = 1. Now I want to backport 1 feature from between
> >> VER=1 and VER=2 and another feature from between VER=2 and VER=3. My
> >> history now looks like
> >> 
> >> MSHV_REQUEST_VERSION = 1
> >> <5 commits from between VER=1 and VER=2>
> >>    Which version should I declare here???? 
> >> <5 commits from between VER=2 and VER=3>
> >>    Which version should I declare here???? 
> >> 
> >> If I keep VER=1 then userspace will think that I don't have any extra
> >> features added and just won't use them. If I change VER to 2/3, it'll
> >> think I have *all* features from between these versions.
> >> 
> >> The only reasonable way to manage this is to attach a "capability" to
> >> every ABI change and expose this capability *in the same commit which
> >> introduces the change to the ABI*. This way userspace will now exactly
> >> which ioctls are available and what are their interfaces.
> >> 
> >> Also, trying to define "core set" is hard but you don't really need
> >> to.
> >> 
> >
> > We've had some internal discussion on this.
> >
> > There is bound to be some iteration before this ABI is stable, since even the
> > underlying Microsoft hypervisor interfaces aren't stable just yet.
> >
> > It might make more sense to just have an IOCTL to check if the API is stable yet.
> > This would be analogous to checking if kVM_GET_API_VERSION returns 12.
> >
> > How does this sound as a proposal?
> > An MSHV_CHECK_EXTENSION ioctl to query extensions to the core /dev/mshv API.
> >
> > It takes a single argument, an integer named MSHV_CAP_* corresponding to
> > the extension to check the existence of.
> >
> > The ioctl will return 0 if the extension is unsupported, or a positive integer
> > if supported.
> >
> > We can initially include a capability called MSHV_CAP_CORE_API_STABLE.
> > If supported, the core APIs are stable.
> 
> This sounds reasonable, I'd suggest you reserve MSHV_CAP_CORE_API_STABLE
> right away but don't expose it yet so it's clear the API is not yet
> stable. Test userspace you have may always assume it's running with the
> latest kernel.
> 
> Also, please be clear about the fact that /dev/mshv doesn't
> provide a stable API yet so nobody builds an application on top of
> it.
> 

Very good discussion and suggestions. Thank you Vitaly.

> One more though: it is probably a good idea to introduce selftests for
> /dev/mshv (similar to KVM's selftests in
> /tools/testing/selftests/kvm). Selftests don't really need a stable ABI
> as they live in the same linux.git and can be updated in the same patch
> series which changes /dev/mshv behavior. Selftests are very useful for
> checking there are no regressions, especially in the situation when
> there's no publicly available userspace for /dev/mshv.

I think this can wait until we merge the first implementation in tree.
There are still a lot of moving parts. Our (currently limited) internal
test cases need more cleaning up before they are ready. I certainly
don't want to distract Nuno from getting the foundation right.

Wei.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2021-04-07 13:43             ` Wei Liu
@ 2021-04-07 14:02               ` Vitaly Kuznetsov
  2021-04-07 14:19                 ` Wei Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Vitaly Kuznetsov @ 2021-04-07 14:02 UTC (permalink / raw)
  To: Wei Liu
  Cc: Nuno Das Neves, linux-hyperv, virtualization, linux-kernel,
	mikelley, viremana, sunilmut, wei.liu, ligrassi, kys

Wei Liu <wei.liu@kernel.org> writes:

> On Wed, Apr 07, 2021 at 09:38:21AM +0200, Vitaly Kuznetsov wrote:
>
>> One more though: it is probably a good idea to introduce selftests for
>> /dev/mshv (similar to KVM's selftests in
>> /tools/testing/selftests/kvm). Selftests don't really need a stable ABI
>> as they live in the same linux.git and can be updated in the same patch
>> series which changes /dev/mshv behavior. Selftests are very useful for
>> checking there are no regressions, especially in the situation when
>> there's no publicly available userspace for /dev/mshv.
>
> I think this can wait until we merge the first implementation in tree.
> There are still a lot of moving parts. Our (currently limited) internal
> test cases need more cleaning up before they are ready. I certainly
> don't want to distract Nuno from getting the foundation right.
>

I'm absolutely fine with this approach, selftests are a nice add-on, not
a requirement for the initial implementation. Also, to make them more
useful to mere mortals, a doc on how to run Linux as root Hyper-V
partition would come handy)

-- 
Vitaly


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
  2021-04-07 14:02               ` Vitaly Kuznetsov
@ 2021-04-07 14:19                 ` Wei Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Wei Liu @ 2021-04-07 14:19 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Wei Liu, Nuno Das Neves, linux-hyperv, virtualization,
	linux-kernel, mikelley, viremana, sunilmut, ligrassi, kys

On Wed, Apr 07, 2021 at 04:02:56PM +0200, Vitaly Kuznetsov wrote:
> Wei Liu <wei.liu@kernel.org> writes:
> 
> > On Wed, Apr 07, 2021 at 09:38:21AM +0200, Vitaly Kuznetsov wrote:
> >
> >> One more though: it is probably a good idea to introduce selftests for
> >> /dev/mshv (similar to KVM's selftests in
> >> /tools/testing/selftests/kvm). Selftests don't really need a stable ABI
> >> as they live in the same linux.git and can be updated in the same patch
> >> series which changes /dev/mshv behavior. Selftests are very useful for
> >> checking there are no regressions, especially in the situation when
> >> there's no publicly available userspace for /dev/mshv.
> >
> > I think this can wait until we merge the first implementation in tree.
> > There are still a lot of moving parts. Our (currently limited) internal
> > test cases need more cleaning up before they are ready. I certainly
> > don't want to distract Nuno from getting the foundation right.
> >
> 
> I'm absolutely fine with this approach, selftests are a nice add-on, not
> a requirement for the initial implementation. Also, to make them more
> useful to mere mortals, a doc on how to run Linux as root Hyper-V
> partition would come handy)

Making this system easier for others to use and consume is on our radar.
Currently you need Windows bootloader and a not-yet-released loader to
load the hypervisor. We're making progress in bringing in GRUB.

Needless to say there are technical and non-technical challenges for
this work, so don't expect it to happen very soon. :-)

Wei.

> 
> -- 
> Vitaly
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2021-04-07 14:19 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-21  0:30 [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes Nuno Das Neves
2021-02-09 13:04   ` Vitaly Kuznetsov
2021-03-04 18:24     ` Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 02/18] asm-generic/hyperv: convert hyperv statuses to strings Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 03/18] virt/mshv: minimal mshv module (/dev/mshv/) Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 04/18] virt/mshv: request version ioctl Nuno Das Neves
2021-02-08 19:41   ` Michael Kelley
2021-03-04 21:35     ` Nuno Das Neves
2021-02-09 13:11   ` Vitaly Kuznetsov
2021-03-04 18:43     ` Nuno Das Neves
2021-03-05  9:18       ` Vitaly Kuznetsov
2021-04-07  0:21         ` Nuno Das Neves
2021-04-07  7:38           ` Vitaly Kuznetsov
2021-04-07 13:43             ` Wei Liu
2021-04-07 14:02               ` Vitaly Kuznetsov
2021-04-07 14:19                 ` Wei Liu
2020-11-21  0:30 ` [RFC PATCH 05/18] virt/mshv: create partition ioctl Nuno Das Neves
2021-02-09 13:15   ` Vitaly Kuznetsov
2021-03-04 18:44     ` Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls Nuno Das Neves
2021-02-08 19:42   ` Michael Kelley
2021-03-04 23:49     ` Nuno Das Neves
2021-03-04 23:58       ` Michael Kelley
2020-11-21  0:30 ` [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall Nuno Das Neves
2021-02-08 19:44   ` Michael Kelley
2021-03-05 21:01     ` Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 08/18] virt/mshv: map and unmap guest memory Nuno Das Neves
2021-02-08 19:45   ` Michael Kelley
2021-03-08 19:14     ` Nuno Das Neves
2021-03-08 19:30       ` Michael Kelley
2020-11-21  0:30 ` [RFC PATCH 09/18] virt/mshv: create vcpu ioctl Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls Nuno Das Neves
2021-02-08 19:47   ` Michael Kelley
2021-03-09  1:39     ` Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages Nuno Das Neves
2021-02-08 19:47   ` Michael Kelley
2021-03-11 19:37     ` Nuno Das Neves
2021-03-11 20:45       ` Michael Kelley
2020-11-21  0:30 ` [RFC PATCH 12/18] virt/mshv: run vp ioctl and isr Nuno Das Neves
2020-11-24 16:15   ` Wei Liu
2020-11-21  0:30 ` [RFC PATCH 13/18] virt/mshv: install intercept ioctl Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 14/18] virt/mshv: assert interrupt ioctl Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls Nuno Das Neves
2021-02-08 19:48   ` Michael Kelley
2021-03-11 23:38     ` Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 16/18] virt/mshv: mmap vp register page Nuno Das Neves
2021-02-08 19:49   ` Michael Kelley
2021-03-25 17:36     ` Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 17/18] virt/mshv: get and set partition property ioctls Nuno Das Neves
2020-11-21  0:30 ` [RFC PATCH 18/18] virt/mshv: Add enlightenment bits to create partition ioctl Nuno Das Neves
2020-11-24 16:18 ` [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Wei Liu
2021-02-08 19:40 ` Michael Kelley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).