virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* RE: [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface
       [not found] <1605918637-12192-1-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:40 ` Michael Kelley via Virtualization
       [not found] ` <1605918637-12192-5-git-send-email-nunodasneves@linux.microsoft.com>
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:40 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> This patch series provides a userspace interface for creating and running guest
> virtual machines while running on the Microsoft Hypervisor [0].
> 
> Since managing guest machines can only be done when Linux is the root partition,
> this series depends on the RFC already posted by Wei Liu:
> https://lore.kernel.org/linux-hyperv/20201105165814.29233-1-wei.liu@kernel.org/T/#t
> 
> The first two patches provide some helpers for converting hypervisor status
> codes to linux error codes, and easily printing hypervisor status codes to dmesg
> for debugging.
> 
> Hyper-V related headers asm-generic/hyperv-tlfs.h and x86/asm/hyperv-tlfs.h are
> split into uapi and non-uapi. The uapi versions contain structures used in both
> the ioctl interface and the kernel.
> 
> The mshv API is introduced in virt/mshv/mshv_main.c. As each interface is
> introduced, documentation is added in Documentation/virt/mshv/api.rst.
> The API is file-desciptor based, like KVM. The entry point is /dev/mshv.
> 
> /dev/mshv ioctls:
> MSHV_REQUEST_VERSION
> MSHV_CREATE_PARTITION
> 
> Partition (vm) ioctls:
> MSHV_MAP_GUEST_MEMORY, MSHV_UNMAP_GUEST_MEMORY
> MSHV_INSTALL_INTERCEPT
> MSHV_ASSERT_INTERRUPT
> MSHV_GET_PARTITION_PROPERTY, MSHV_SET_PARTITION_PROPERTY
> MSHV_CREATE_VP
> 
> Vp (vcpu) ioctls:
> MSHV_GET_VP_REGISTERS, MSHV_SET_VP_REGISTERS
> MSHV_RUN_VP
> MSHV_GET_VP_STATE, MSHV_SET_VP_STATE
> mmap() (register page)
> 
> [0] Hyper-V is more well-known, but it really refers to the whole stack
>     including the hypervisor and other components that run in Windows kernel
>     and userspace.
> 
> Nuno Das Neves (18):
>   x86/hyperv: convert hyperv statuses to linux error codes
>   asm-generic/hyperv: convert hyperv statuses to strings
>   virt/mshv: minimal mshv module (/dev/mshv/)
>   virt/mshv: request version ioctl
>   virt/mshv: create partition ioctl
>   virt/mshv: create, initialize, finalize, delete partition hypercalls
>   virt/mshv: withdraw memory hypercall
>   virt/mshv: map and unmap guest memory
>   virt/mshv: create vcpu ioctl
>   virt/mshv: get and set vcpu registers ioctls
>   virt/mshv: set up synic pages for intercept messages
>   virt/mshv: run vp ioctl and isr
>   virt/mshv: install intercept ioctl
>   virt/mshv: assert interrupt ioctl
>   virt/mshv: get and set vp state ioctls
>   virt/mshv: mmap vp register page
>   virt/mshv: get and set partition property ioctls
>   virt/mshv: Add enlightenment bits to create partition ioctl
> 
>  .../userspace-api/ioctl/ioctl-number.rst      |    2 +
>  Documentation/virt/mshv/api.rst               |  173 ++
>  arch/x86/Kconfig                              |    2 +
>  arch/x86/hyperv/Kconfig                       |   22 +
>  arch/x86/hyperv/Makefile                      |    4 +
>  arch/x86/hyperv/hv_init.c                     |    2 +-
>  arch/x86/hyperv/hv_proc.c                     |   40 +-
>  arch/x86/include/asm/hyperv-tlfs.h            |   44 +-
>  arch/x86/include/asm/mshyperv.h               |    1 +
>  arch/x86/include/uapi/asm/hyperv-tlfs.h       | 1312 +++++++++++
>  arch/x86/kernel/cpu/mshyperv.c                |   16 +
>  include/asm-generic/hyperv-tlfs.h             |  324 ++-
>  include/asm-generic/mshyperv.h                |    3 +
>  include/linux/mshv.h                          |   61 +
>  include/uapi/asm-generic/hyperv-tlfs.h        |  160 ++
>  include/uapi/linux/mshv.h                     |  109 +
>  virt/mshv/mshv_main.c                         | 2054 +++++++++++++++++
>  17 files changed, 4178 insertions(+), 151 deletions(-)
>  create mode 100644 Documentation/virt/mshv/api.rst
>  create mode 100644 arch/x86/hyperv/Kconfig
>  create mode 100644 arch/x86/include/uapi/asm/hyperv-tlfs.h
>  create mode 100644 include/linux/mshv.h
>  create mode 100644 include/uapi/asm-generic/hyperv-tlfs.h
>  create mode 100644 include/uapi/linux/mshv.h
>  create mode 100644 virt/mshv/mshv_main.c
> 
> --
> 2.25.1

I finally made it through reviewing this patch series.  Nice
work -- to you, and to Lillian as the original author of significant
portions!  There's a lot code, but it is well organized for reviewing
and overall is done well.

I have a three general comments:

1) Historically we have very precisely specified the layout of data
structures that are shared with Hyper-V.  Each field has an explicit
width (i.e., u16, u32, u64, etc.) and we have avoided field types that
lack an explicit width (int, enum, bool, etc.).  These patches make
liberal use of enum types in the Hyper-V data structures, and I saw
one occurrence of bool.  While treating enum and bool as 32 bits
works, I have a concern that such specifications aren't consistent
with the original rigor we tried to use.

Related, there are several places where the proper layout depends
on the compiler inserting padding (and not inserting padding in the
wrong places) to achieve the needed alignment.  In my view, we
should be explicitly adding the padding.  A couple years back at
Vitaly Kuznetsov's initiative, we added __packed on all the data
structures to instruct the compiler to not add padding, so as to
prevent padding being added at any inappropriate places.

I started by flagging all of these places I saw either of these two
Issues, but I stopped doing so in some of the later patches, figuring
that you could find the issues across the entire series.

2) With all the new hypercalls added with this patch series, and with
Wei Liu's patch series for Linux in the root partition, I've noticed that
we're inconsistent in how the hypercall status is checked.   The
current code works, but is sloppy with types and doesn't always
conform to the letter of the TLFS.  Your new hv_status_to_errno() is
a nice addition, but I think we would be well served by using a 
consistent pattern.  I'm planning to send out a separate email to
the linux-hyperv mailing list with a specific suggestion that we can
all review and comment on.  Once we have agreement, we can do
a cleanup exercise on existing code and on recent patches.

3) I've flagged a few places where the code does not handle configurations
where PAGE_SIZE is other than 4 Kbytes.  While this will never happen
on x86/x64, it could happen on other architectures like ARM64.  Of course,
we may never want to run Linux in the root partition with a page size
other than 4 Kbytes, even on ARM64, so I'm OK with not fixing all these
places.  But I've flagged some places where HV_HYP_PAGE_SIZE would
be more appropriate than PAGE_SIZE (and similar) and I think it makes
sense to fix those now, if just to express that the usage is tied to the
page size used by the Hyper-V interface, and not the guest page size.

I'll also send replies to many of the individual patches with specific
comments embedded.  I have not given "Reviewed-by:" on any of the
patches since they were submitted as RFC, but I can do so for a few
of the patches if that would be helpful.

Michael
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 04/18] virt/mshv: request version ioctl
       [not found] ` <1605918637-12192-5-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:41   ` Michael Kelley via Virtualization
       [not found]   ` <87y2fxmlmb.fsf@vitty.brq.redhat.com>
  1 sibling, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:41 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Reserve ioctl number in userpsace-api/ioctl/ioctl-number.rst
> Introduce MSHV_REQUEST_VERSION ioctl.
> Introduce documentation for /dev/mshv in Documentation/virt/mshv
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |  2 +
>  Documentation/virt/mshv/api.rst               | 62 +++++++++++++++++++
>  include/linux/mshv.h                          | 11 ++++
>  include/uapi/linux/mshv.h                     | 19 ++++++
>  virt/mshv/mshv_main.c                         | 49 +++++++++++++++
>  5 files changed, 143 insertions(+)
>  create mode 100644 Documentation/virt/mshv/api.rst
>  create mode 100644 include/linux/mshv.h
>  create mode 100644 include/uapi/linux/mshv.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 55a2d9b2ce33..13a4d3ecafca 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -343,6 +343,8 @@ Code  Seq#    Include File                                           Comments
>  0xB5  00-0F  uapi/linux/rpmsg.h                                      <mailto:linux-
> remoteproc@vger.kernel.org>
>  0xB6  all    linux/fpga-dfl.h
>  0xB7  all    uapi/linux/remoteproc_cdev.h                            <mailto:linux-
> remoteproc@vger.kernel.org>
> +0xB8  all    uapi/linux/mshv.h                                       Microsoft Hypervisor root partition APIs
> +                                                                     <mailto:linux-hyperv@vger.kernel.org>
>  0xC0  00-0F  linux/usb/iowarrior.h
>  0xCA  00-0F  uapi/misc/cxl.h
>  0xCA  10-2F  uapi/misc/ocxl.h
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> new file mode 100644
> index 000000000000..82e32de48d03
> --- /dev/null
> +++ b/Documentation/virt/mshv/api.rst
> @@ -0,0 +1,62 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================================
> +Microsoft Hypervisor Root Partition API Documentation
> +=====================================================
> +
> +1. Overview
> +===========
> +
> +This document describes APIs for creating and managing guest virtual machines
> +when running Linux as the root partition on the Microsoft Hypervisor.
> +
> +This API is not yet stable.
> +
> +2. Glossary/Terms
> +=================
> +
> +hv
> +--
> +Short for Hyper-V. This name is used in the kernel to describe interfaces to
> +the Microsoft Hypervisor.
> +
> +mshv
> +----
> +Short for Microsoft Hypervisor. This is the name of the userland API module
> +described in this document.
> +
> +Partition
> +---------
> +A virtual machine running on the Microsoft Hypervisor.
> +
> +Root Partition
> +--------------
> +The partition that is created and assumes control when the machine boots. The
> +root partition can use mshv APIs to create guest partitions.
> +
> +3. API description
> +==================
> +
> +The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
> +
> +Mshv is file descriptor-based, following a similar pattern to KVM.
> +
> +To get a handle to the mshv driver, use open("/dev/mshv").
> +
> +3.1 MSHV_REQUEST_VERSION
> +------------------------
> +:Type: /dev/mshv ioctl
> +:Parameters: pointer to a u32
> +:Returns: 0 on success
> +
> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
> +establish the interface version with the kernel module.
> +
> +The caller should pass the MSHV_VERSION as an argument.
> +
> +The kernel module will check which interface versions it supports and return 0
> +if one of them matches.
> +
> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
> +it is open - this ioctl can only be called once per open.

To clarify the wording:

The caller should pass the requested version as an argument.  If the requested
version is one that the kernel module supports, the ioctl will return 0.  If the
requested version is not supported by the kernel module, the caller may try
the ioctl repeatedly to find a version that the caller supports and that the kernel
module supports.   Once a match is found, the /dev/mshv file descriptor is
'locked' to that version as long as it is open; i.e., the ioctl can succeed
only once per open.

> +
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> new file mode 100644
> index 000000000000..a0982fe2c0b8
> --- /dev/null
> +++ b/include/linux/mshv.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _LINUX_MSHV_H
> +#define _LINUX_MSHV_H
> +
> +/*
> + * Microsoft Hypervisor root partition driver for /dev/mshv
> + */
> +
> +#include <uapi/linux/mshv.h>
> +
> +#endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> new file mode 100644
> index 000000000000..dd30fc2f0a80
> --- /dev/null
> +++ b/include/uapi/linux/mshv.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_MSHV_H
> +#define _UAPI_LINUX_MSHV_H
> +
> +/*
> + * Userspace interface for /dev/mshv
> + * Microsoft Hypervisor root partition APIs
> + */
> +
> +#include <linux/types.h>
> +
> +#define MSHV_VERSION	0x0
> +
> +#define MSHV_IOCTL 0xB8
> +
> +/* mshv device */
> +#define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
> +
> +#endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index ecb9089761fe..62f631f85301 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -11,25 +11,74 @@
>  #include <linux/module.h>
>  #include <linux/fs.h>
>  #include <linux/miscdevice.h>
> +#include <linux/slab.h>
> +#include <linux/mshv.h>
> 
>  MODULE_AUTHOR("Microsoft");
>  MODULE_LICENSE("GPL");
> 
> +#define MSHV_INVALID_VERSION	0xFFFFFFFF
> +#define MSHV_CURRENT_VERSION	MSHV_VERSION
> +
> +static u32 supported_versions[] = {
> +	MSHV_CURRENT_VERSION,
> +};

I'm not sure that the concept of "CURRENT_VERSION" makes sense
as a fixed constant.  We have an array of supported versions, any of
which are valid and supported by the kernel module.   The array
should list individual versions.   The current version is 0, which 
might be labelled as MSHV_VERSION_PRERELEASE, or something
similar.  Then later we might have MSHV_VERSION_RELEASE_1,
HSMV_VERSION_RELEASE_2, as needed.  Or maybe the versions
are tied to releases of the Microsoft Hypervisor.

> +
> +static long
> +mshv_ioctl_request_version(u32 *version, void __user *user_arg)
> +{
> +	u32 arg;
> +	int i;
> +
> +	if (copy_from_user(&arg, user_arg, sizeof(arg)))
> +		return -EFAULT;
> +
> +	for (i = 0; i < ARRAY_SIZE(supported_versions); ++i) {
> +		if (supported_versions[i] == arg) {
> +			*version = supported_versions[i];
> +			return 0;
> +		}
> +	}
> +	return -ENOTSUPP;
> +}
> +
>  static long
>  mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> +	u32 *version = (u32 *)filp->private_data;
> +
> +	if (ioctl == MSHV_REQUEST_VERSION) {
> +		/* Version can only be set once */
> +		if (*version != MSHV_INVALID_VERSION)
> +			return -EBADFD;
> +
> +		return mshv_ioctl_request_version(version, (void __user *)arg);
> +	}
> +
> +	/* Version must be set before other ioctls can be called */
> +	if (*version == MSHV_INVALID_VERSION)
> +		return -EBADFD;
> +
> +	/* TODO other ioctls */
> +
>  	return -ENOTTY;
>  }
> 
>  static int
>  mshv_dev_open(struct inode *inode, struct file *filp)
>  {
> +	filp->private_data = kmalloc(sizeof(u32), GFP_KERNEL);
> +	if (!filp->private_data)
> +		return -ENOMEM;
> +	*(u32 *)filp->private_data = MSHV_INVALID_VERSION;
> +
>  	return 0;
>  }
> 
>  static int
>  mshv_dev_release(struct inode *inode, struct file *filp)
>  {
> +	kfree(filp->private_data);
>  	return 0;
>  }
> 
> --
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls
       [not found] ` <1605918637-12192-7-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:42   ` Michael Kelley via Virtualization
       [not found]     ` <e6cc796d-f9ee-5203-95a9-05906f95d3f8@linux.microsoft.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:42 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Add hypercalls for fully setting up and mostly tearing down a guest
> partition.
> The teardown operation will generate an error as the deposited
> memory has not been withdrawn.
> This is fixed in the next patch.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  include/asm-generic/hyperv-tlfs.h      |  52 +++++++-
>  include/uapi/asm-generic/hyperv-tlfs.h |   1 +
>  include/uapi/linux/mshv.h              |   1 +
>  virt/mshv/mshv_main.c                  | 169 ++++++++++++++++++++++++-
>  4 files changed, 220 insertions(+), 3 deletions(-)
> 
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 2ff580780ce4..ab6ae6c164f5 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -142,6 +142,10 @@ struct ms_hyperv_tsc_page {
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX	0x0013
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX	0x0014
>  #define HVCALL_SEND_IPI_EX			0x0015
> +#define HVCALL_CREATE_PARTITION			0x0040
> +#define HVCALL_INITIALIZE_PARTITION		0x0041
> +#define HVCALL_FINALIZE_PARTITION		0x0042
> +#define HVCALL_DELETE_PARTITION			0x0043
>  #define HVCALL_GET_PARTITION_ID			0x0046
>  #define HVCALL_DEPOSIT_MEMORY			0x0048
>  #define HVCALL_CREATE_VP			0x004e
> @@ -451,7 +455,7 @@ struct hv_get_partition_id {
>  struct hv_deposit_memory {
>  	u64 partition_id;
>  	u64 gpa_page_list[];
> -} __packed;
> +};

Why remove __packed?

> 
>  struct hv_proximity_domain_flags {
>  	u32 proximity_preferred : 1;
> @@ -767,4 +771,50 @@ struct hv_input_unmap_device_interrupt {
>  #define HV_SOURCE_SHADOW_NONE               0x0
>  #define HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE   0x1
> 
> +#define HV_MAKE_COMPATIBILITY_VERSION(major_, minor_)                          \
> +	((u32)((major_) << 8 | (minor_)))
> +
> +enum hv_compatibility_version {
> +	HV_COMPATIBILITY_19_H1 = HV_MAKE_COMPATIBILITY_VERSION(0X6, 0X5),
> +	HV_COMPATIBILITY_MANGANESE = HV_MAKE_COMPATIBILITY_VERSION(0X6, 0X7),

Avoid use of "Manganese", which is an internal code name.  I'd suggest calling it
20_H1 instead, which at least has some broader meaning.

> +	HV_COMPATIBILITY_PRERELEASE = HV_MAKE_COMPATIBILITY_VERSION(0XFE, 0X0),
> +	HV_COMPATIBILITY_EXPERIMENT = HV_MAKE_COMPATIBILITY_VERSION(0XFF, 0X0),
> +};
> +
> +union hv_partition_isolation_properties {
> +	u64 as_uint64;
> +	struct {
> +		u64 isolation_type: 5;
> +		u64 rsvd_z: 7;
> +		u64 shared_gpa_boundary_page_number: 52;
> +	};
> +};

Add __packed.

> +
> +/* Non-userspace-visible partition creation flags */
> +#define HV_PARTITION_CREATION_FLAG_EXO_PARTITION                    BIT(8)
> +
> +struct hv_create_partition_in {
> +	u64 flags;
> +	union hv_proximity_domain_info proximity_domain_info;
> +	enum hv_compatibility_version compatibility_version;

An "enum" is a 32 bit value in gcc and I would presume that
Hyper-V is expecting a 64 bit value.  In general, using an enum in a data
structure with exact layout requirements is problematic because the "C"
language doesn't specify how big an enum is.  In such cases, it's better
to use an integer field with an explicit size (like u64) and #defines for
the possible values.

> +	struct hv_partition_creation_properties partition_creation_properties;
> +	union hv_partition_isolation_properties isolation_properties;
> +};
> +
> +struct hv_create_partition_out {
> +	u64 partition_id;
> +};
> +
> +struct hv_initialize_partition {
> +	u64 partition_id;
> +};
> +
> +struct hv_finalize_partition {
> +	u64 partition_id;
> +};
> +
> +struct hv_delete_partition {
> +	u64 partition_id;
> +};

All of the above should have __packed for consistency with the other
Hyper-V data structures.

> +
>  #endif
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index 140cc0b4f98f..7a858226a9c5 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -6,6 +6,7 @@
>  #define BIT(X)	(1ULL << (X))
>  #endif
> 
> +/* Userspace-visible partition creation flags */

Could this comment be included in the earlier patch with the #defines so
that you avoid the trivial change here?

>  #define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
>  #define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
>  #define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index 3788f8bc5caa..4f8da9a6fde2 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -9,6 +9,7 @@
> 
>  #include <linux/types.h>
>  #include <asm/hyperv-tlfs.h>
> +#include <asm-generic/hyperv-tlfs.h>

Similarly, consider adding this #include in the earlier patch so that
this trivial change isn't needed here.

> 
>  #define MSHV_VERSION	0x0
> 
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 4dcbe4907430..c4130a6508e5 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -15,6 +15,7 @@
>  #include <linux/file.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/mshv.h>
> +#include <asm/mshyperv.h>
> 
>  MODULE_AUTHOR("Microsoft");
>  MODULE_LICENSE("GPL");
> @@ -31,7 +32,6 @@ static struct mshv mshv = {};
>  static void mshv_partition_put(struct mshv_partition *partition);
>  static int mshv_partition_release(struct inode *inode, struct file *filp);
>  static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> -

Spurious whitespace change?

>  static int mshv_dev_open(struct inode *inode, struct file *filp);
>  static int mshv_dev_release(struct inode *inode, struct file *filp);
>  static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> @@ -57,6 +57,143 @@ static struct miscdevice mshv_dev = {
>  	.mode = 600,
>  };
> 
> +#define HV_INIT_PARTITION_DEPOSIT_PAGES 208

A comment about how this value is determined would be useful.
I'm assuming it was determined empirically.

> +
> +static int
> +hv_call_create_partition(
> +		u64 flags,
> +		struct hv_partition_creation_properties creation_properties,
> +		u64 *partition_id)
> +{
> +	struct hv_create_partition_in *input;
> +	struct hv_create_partition_out *output;
> +	int status;
> +	int ret;
> +	unsigned long irq_flags;
> +	int i;
> +
> +	do {
> +		local_irq_save(irq_flags);
> +		input = (struct hv_create_partition_in *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +		output = (struct hv_create_partition_out *)(*this_cpu_ptr(
> +			hyperv_pcpu_output_arg));
> +
> +		input->flags = flags;
> +		input->proximity_domain_info.as_uint64 = 0;
> +		input->compatibility_version = HV_COMPATIBILITY_MANGANESE;
> +		for (i = 0; i < HV_PARTITION_PROCESSOR_FEATURE_BANKS; ++i)
> +			input->partition_creation_properties
> +				.disabled_processor_features.as_uint64[i] = 0;
> +		input->partition_creation_properties
> +			.disabled_processor_xsave_features.as_uint64 = 0;
> +		input->isolation_properties.as_uint64 = 0;
> +
> +		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
> +					 input, output);

hv_do_hypercall returns a u64, which should then be masked with
HV_HYPERCALL_RESULT_MASK before checking the result.

> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (status == HV_STATUS_SUCCESS)
> +				*partition_id = output->partition_id;
> +			else
> +				pr_err("%s: %s\n",
> +				       __func__, hv_status_to_string(status));
> +			local_irq_restore(irq_flags);
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(irq_flags);
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    hv_current_partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static int
> +hv_call_initialize_partition(u64 partition_id)
> +{
> +	struct hv_initialize_partition *input;
> +	int status;
> +	int ret;
> +	unsigned long flags;
> +
> +	ret = hv_call_deposit_pages(
> +				NUMA_NO_NODE,
> +				partition_id,
> +				HV_INIT_PARTITION_DEPOSIT_PAGES);
> +	if (ret)
> +		return ret;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = (struct hv_initialize_partition *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +		input->partition_id = partition_id;
> +
> +		status = hv_do_hypercall(
> +				HVCALL_INITIALIZE_PARTITION,
> +				input, NULL);

FWIW, since the input is a single 64 bit value, and there's no output,
this could use hv_do_fast_hypercall8() instead, and avoid
needing to use the input arg page and the irq save/restore.  Would have
to check that the particular hypercall supports the "fast" version.

> +		local_irq_restore(flags);
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {

Same comment about status being u64 and masking.

> +			if (status != HV_STATUS_SUCCESS)
> +				pr_err("%s: %s\n",
> +				       __func__, hv_status_to_string(status));
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static int
> +hv_call_finalize_partition(u64 partition_id)
> +{
> +	struct hv_finalize_partition *input;
> +	int status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input = (struct hv_finalize_partition *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +
> +	input->partition_id = partition_id;
> +	status = hv_do_hypercall(
> +			HVCALL_FINALIZE_PARTITION,
> +			input, NULL);
> +	local_irq_restore(flags);


Same comment about hv_do_fast_hypercall8() and about status
being a u64 and masking.

> +
> +	if (status != HV_STATUS_SUCCESS)
> +		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
> +
> +	return -hv_status_to_errno(status);
> +}
> +
> +static int
> +hv_call_delete_partition(u64 partition_id)
> +{
> +	struct hv_delete_partition *input;
> +	int status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input = (struct hv_delete_partition *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +
> +	input->partition_id = partition_id;
> +	status = hv_do_hypercall(
> +			HVCALL_DELETE_PARTITION,
> +			input, NULL);
> +	local_irq_restore(flags);

Same comments about hv_do_fast_hypercall8(), and
the status and masking.

> +
> +	if (status != HV_STATUS_SUCCESS)
> +		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
> +
> +	return -hv_status_to_errno(status);
> +}
> +
>  static long
>  mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> @@ -86,6 +223,17 @@ destroy_partition(struct mshv_partition *partition)
> 
>  	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
> 
> +	/*
> +	 * There are no remaining references to the partition or vps,
> +	 * so the remaining cleanup can be lockless
> +	 */
> +
> +	/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
> +	hv_call_finalize_partition(partition->id);
> +	/* TODO: Withdraw and free all pages we deposited */
> +
> +	hv_call_delete_partition(partition->id);
> +
>  	kfree(partition);
>  }
> 
> @@ -146,6 +294,9 @@ mshv_ioctl_create_partition(void __user *user_arg)
>  	if (copy_from_user(&args, user_arg, sizeof(args)))
>  		return -EFAULT;
> 
> +	/* Only support EXO partitions */
> +	args.flags |= HV_PARTITION_CREATION_FLAG_EXO_PARTITION;
> +
>  	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
>  	if (!partition)
>  		return -ENOMEM;
> @@ -156,11 +307,21 @@ mshv_ioctl_create_partition(void __user *user_arg)
>  		goto free_partition;
>  	}
> 
> +	ret = hv_call_create_partition(args.flags,
> +				       args.partition_creation_properties,
> +				       &partition->id);
> +	if (ret)
> +		goto put_fd;
> +
> +	ret = hv_call_initialize_partition(partition->id);
> +	if (ret)
> +		goto delete_partition;
> +
>  	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
>  				  partition, O_RDWR);
>  	if (IS_ERR(file)) {
>  		ret = PTR_ERR(file);
> -		goto put_fd;
> +		goto finalize_partition;
>  	}
>  	refcount_set(&partition->ref_count, 1);
> 
> @@ -174,6 +335,10 @@ mshv_ioctl_create_partition(void __user *user_arg)
> 
>  release_file:
>  	file->f_op->release(file->f_inode, file);
> +finalize_partition:
> +	hv_call_finalize_partition(partition->id);
> +delete_partition:
> +	hv_call_delete_partition(partition->id);
>  put_fd:
>  	put_unused_fd(fd);
>  free_partition:
> --
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall
       [not found] ` <1605918637-12192-8-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:44   ` Michael Kelley via Virtualization
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:44 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Withdraw the memory from a finalized partition and free the pages.
> The partition is now cleaned up correctly when the fd is released.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  include/asm-generic/hyperv-tlfs.h | 10 ++++++
>  virt/mshv/mshv_main.c             | 54 ++++++++++++++++++++++++++++++-
>  2 files changed, 63 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index ab6ae6c164f5..2a49503b7396 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -148,6 +148,7 @@ struct ms_hyperv_tsc_page {
>  #define HVCALL_DELETE_PARTITION			0x0043
>  #define HVCALL_GET_PARTITION_ID			0x0046
>  #define HVCALL_DEPOSIT_MEMORY			0x0048
> +#define HVCALL_WITHDRAW_MEMORY			0x0049
>  #define HVCALL_CREATE_VP			0x004e
>  #define HVCALL_GET_VP_REGISTERS			0x0050
>  #define HVCALL_SET_VP_REGISTERS			0x0051
> @@ -472,6 +473,15 @@ union hv_proximity_domain_info {
>  	u64 as_uint64;
>  };
> 
> +struct hv_withdraw_memory_in {
> +	u64 partition_id;
> +	union hv_proximity_domain_info proximity_domain_info;
> +};
> +
> +struct hv_withdraw_memory_out {
> +	u64 gpa_page_list[0];

For a variable size array, the Linux kernel community has an effort
underway to replace occurrences of [0] and [1] with just [].  I think
[] can be used here.

> +};
> +

Add __packed to the above two structs.

>  struct hv_lp_startup_status {
>  	u64 hv_status;
>  	u64 substatus1;
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index c4130a6508e5..162a1bb42a4a 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -14,6 +14,7 @@
>  #include <linux/slab.h>
>  #include <linux/file.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/mm.h>
>  #include <linux/mshv.h>
>  #include <asm/mshyperv.h>
> 
> @@ -57,8 +58,58 @@ static struct miscdevice mshv_dev = {
>  	.mode = 600,
>  };
> 
> +#define HV_WITHDRAW_BATCH_SIZE	(PAGE_SIZE / sizeof(u64))

Use HV_HYP_PAGE_SIZE so that we're explicit that the dependency
is on the page size used by Hyper-V, which might be different from the
guest page size (at least on architectures like ARM64).

>  #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
> 
> +static int
> +hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
> +{
> +	struct hv_withdraw_memory_in *input_page;
> +	struct hv_withdraw_memory_out *output_page;
> +	u16 completed;
> +	u64 hypercall_status;
> +	unsigned long remaining = count;
> +	int status;
> +	int i;
> +	unsigned long flags;
> +
> +	while (remaining) {
> +		local_irq_save(flags);
> +
> +		input_page = (struct hv_withdraw_memory_in *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +		output_page = (struct hv_withdraw_memory_out *)(*this_cpu_ptr(
> +			hyperv_pcpu_output_arg));
> +
> +		input_page->partition_id = partition_id;
> +		input_page->proximity_domain_info.as_uint64 = 0;
> +		hypercall_status = hv_do_rep_hypercall(
> +			HVCALL_WITHDRAW_MEMORY,
> +			min(remaining, HV_WITHDRAW_BATCH_SIZE), 0, input_page,
> +			output_page);
> +
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +			    HV_HYPERCALL_REP_COMP_OFFSET;
> +
> +		for (i = 0; i < completed; i++)
> +			__free_page(pfn_to_page(output_page->gpa_page_list[i]));
> +
> +		local_irq_restore(flags);

Seems like there's some risk that we have interrupts disabled for too long.
We could be calling __free_page() up to 512 times.  It might be better for this
function to allocate its own page to be used as the output page, so that interrupts
can be enabled immediately after the hypercall completes.  Then the __free_page()
loop can execute with interrupts enabled.   We have the per-cpu input and output
pages to avoid the overhead of allocating/freeing pages for each hypercall, but in this
case a private output page might be warranted.

> +
> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		if (status != HV_STATUS_SUCCESS) {
> +			if (status != HV_STATUS_NO_RESOURCES)
> +				pr_err("%s: %s\n", __func__,
> +				       hv_status_to_string(status));
> +			break;
> +		}
> +
> +		remaining -= completed;
> +	}
> +
> +	return -hv_status_to_errno(status);
> +}
> +
>  static int
>  hv_call_create_partition(
>  		u64 flags,
> @@ -230,7 +281,8 @@ destroy_partition(struct mshv_partition *partition)
> 
>  	/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
>  	hv_call_finalize_partition(partition->id);
> -	/* TODO: Withdraw and free all pages we deposited */
> +	/* Withdraw and free all pages we deposited */
> +	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->id);
> 
>  	hv_call_delete_partition(partition->id);
> 
> --
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 08/18] virt/mshv: map and unmap guest memory
       [not found] ` <1605918637-12192-9-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:45   ` Michael Kelley via Virtualization
       [not found]     ` <d63330fa-de83-85de-c8ec-74cc90d680e3@linux.microsoft.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:45 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Introduce ioctls for mapping and unmapping regions of guest memory.
> 
> Uses a table of memory 'slots' similar to KVM, but the slot
> number is not visible to userspace.
> 
> For now, this simple implementation requires each new mapping to be
> disjoint - the underlying hypercalls have no such restriction, and
> implicitly overwrite any mappings on the pages in the specified regions.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst        |  15 ++
>  include/asm-generic/hyperv-tlfs.h      |  15 ++
>  include/linux/mshv.h                   |  14 ++
>  include/uapi/asm-generic/hyperv-tlfs.h |   9 +
>  include/uapi/linux/mshv.h              |  15 ++
>  virt/mshv/mshv_main.c                  | 322 ++++++++++++++++++++++++-
>  6 files changed, 388 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index ce651a1738e0..530efc29d354 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -72,3 +72,18 @@ it is open - this ioctl can only be called once per open.
>  This ioctl creates a guest partition, returning a file descriptor to use as a
>  handle for partition ioctls.
> 
> +3.3 MSHV_MAP_GUEST_MEMORY and MSHV_UNMAP_GUEST_MEMORY
> +-----------------------------------------------------
> +:Type: partition ioctl
> +:Parameters: struct mshv_user_mem_region
> +:Returns: 0 on success
> +
> +Create a mapping from a region of process memory to a region of physical memory
> +in a guest partition.

Just to be super explicit:

Create a mapping from memory in the user space of the calling process (running
in the root partition) to a region of guest physical memory in a guest partition.

> +
> +Mappings must be disjoint in process address space and guest address space.
> +
> +Note: In the current implementation, this memory is pinned to stop the pages
> +being moved by linux and subsequently clobbered by the hypervisor. So the region
> +is backed by physical memory.

Again to be super explicit:

Note: In the current implementation, this memory is pinned to real physical
memory to stop the pages being moved by Linux in the root partition,
and subsequently being clobbered by the hypervisor.  So the region is backed
by real physical memory.

> +
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 2a49503b7396..6e5072e29897 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -149,6 +149,8 @@ struct ms_hyperv_tsc_page {
>  #define HVCALL_GET_PARTITION_ID			0x0046
>  #define HVCALL_DEPOSIT_MEMORY			0x0048
>  #define HVCALL_WITHDRAW_MEMORY			0x0049
> +#define HVCALL_MAP_GPA_PAGES			0x004b
> +#define HVCALL_UNMAP_GPA_PAGES			0x004c
>  #define HVCALL_CREATE_VP			0x004e
>  #define HVCALL_GET_VP_REGISTERS			0x0050
>  #define HVCALL_SET_VP_REGISTERS			0x0051
> @@ -827,4 +829,17 @@ struct hv_delete_partition {
>  	u64 partition_id;
>  };
> 
> +struct hv_map_gpa_pages {
> +	u64 target_partition_id;
> +	u64 target_gpa_base;
> +	u32 map_flags;

Is there a reserved 32 bit field here?  Hyper-V always aligns
things on 64 bit boundaries.

> +	u64 source_gpa_page_list[];
> +};
> +
> +struct hv_unmap_gpa_pages {
> +	u64 target_partition_id;
> +	u64 target_gpa_base;
> +	u32 unmap_flags;

Is there a reserved 32 bit field here?  Hyper-V always aligns
things on 64 bit boundaries.

> +};

Add __packed to the above two structs after sorting out
the alignment issues.

> +
>  #endif
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index fc4f35089b2c..91a742f37440 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -7,13 +7,27 @@
>   */
> 
>  #include <linux/spinlock.h>
> +#include <linux/mutex.h>
>  #include <uapi/linux/mshv.h>
> 
>  #define MSHV_MAX_PARTITIONS		128
> +#define MSHV_MAX_MEM_REGIONS		64
> +
> +struct mshv_mem_region {
> +	u64 size; /* bytes */
> +	u64 guest_pfn;
> +	u64 userspace_addr; /* start of the userspace allocated memory */
> +	struct page **pages;
> +};
> 
>  struct mshv_partition {
>  	u64 id;
>  	refcount_t ref_count;
> +	struct mutex mutex;
> +	struct {
> +		u32 count;
> +		struct mshv_mem_region slots[MSHV_MAX_MEM_REGIONS];
> +	} regions;
>  };
> 
>  struct mshv {
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index 7a858226a9c5..e7b09b9f00de 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -12,4 +12,13 @@
>  #define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
>  #define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
> 
> +/* HV Map GPA (Guest Physical Address) Flags */
> +#define HV_MAP_GPA_PERMISSIONS_NONE     0x0
> +#define HV_MAP_GPA_READABLE             0x1
> +#define HV_MAP_GPA_WRITABLE             0x2
> +#define HV_MAP_GPA_KERNEL_EXECUTABLE    0x4
> +#define HV_MAP_GPA_USER_EXECUTABLE      0x8
> +#define HV_MAP_GPA_EXECUTABLE           0xC
> +#define HV_MAP_GPA_PERMISSIONS_MASK     0xF
> +
>  #endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index 4f8da9a6fde2..47be03ef4e86 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -18,10 +18,25 @@ struct mshv_create_partition {
>  	struct hv_partition_creation_properties partition_creation_properties;
>  };
> 
> +/*
> + * Mappings can't overlap in GPA space or userspace
> + * To unmap, these fields must match an existing mapping
> + */
> +struct mshv_user_mem_region {
> +	__u64 size;		/* bytes */
> +	__u64 guest_pfn;
> +	__u64 userspace_addr;	/* start of the userspace allocated memory */
> +	__u32 flags;		/* ignored on unmap */
> +};
> +
>  #define MSHV_IOCTL 0xB8
> 
>  /* mshv device */
>  #define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
>  #define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x01, struct mshv_create_partition)
> 
> +/* partition device */
> +#define MSHV_MAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x02, struct mshv_user_mem_region)
> +#define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct mshv_user_mem_region)
> +
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 162a1bb42a4a..ce480598e67f 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -60,6 +60,10 @@ static struct miscdevice mshv_dev = {
> 
>  #define HV_WITHDRAW_BATCH_SIZE	(PAGE_SIZE / sizeof(u64))
>  #define HV_INIT_PARTITION_DEPOSIT_PAGES 208
> +#define HV_MAP_GPA_MASK		(0x0000000FFFFFFFFFULL)
> +#define HV_MAP_GPA_BATCH_SIZE	\
> +		(PAGE_SIZE / sizeof(struct hv_map_gpa_pages) / sizeof(u64))

Hmmm. Shouldn't this be:

	((HV_HYP_PAGE_SIZE - sizeof(struct hv_map_gpa_pages))/sizeof(u64))


> +#define PIN_PAGES_BATCH_SIZE	(0x10000000 / PAGE_SIZE)
> 
>  static int
>  hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
> @@ -245,16 +249,318 @@ hv_call_delete_partition(u64 partition_id)
>  	return -hv_status_to_errno(status);
>  }
> 
> +static int
> +hv_call_map_gpa_pages(u64 partition_id,
> +		      u64 gpa_target,
> +		      u64 page_count, u32 flags,
> +		      struct page **pages)
> +{
> +	struct hv_map_gpa_pages *input_page;
> +	int status;
> +	int i;
> +	struct page **p;
> +	u32 completed = 0;
> +	u64 hypercall_status;
> +	unsigned long remaining = page_count;
> +	int rep_count;
> +	unsigned long irq_flags;
> +	int ret = 0;
> +
> +	while (remaining) {
> +
> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> +
> +		local_irq_save(irq_flags);
> +		input_page = (struct hv_map_gpa_pages *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +
> +		input_page->target_partition_id = partition_id;
> +		input_page->target_gpa_base = gpa_target;
> +		input_page->map_flags = flags;
> +
> +		for (i = 0, p = pages; i < rep_count; i++, p++)
> +			input_page->source_gpa_page_list[i] =
> +				page_to_pfn(*p) & HV_MAP_GPA_MASK;

The masking seems a bit weird.  The mask allows for up to 64G page frames,
which is 256 Tbytes of total physical memory, which is probably the current
Hyper-V limit on memory size (48 bit physical address space, though 52 bit
physical address spaces are coming).  So the masking shouldn't ever be doing
anything.   And if it was doing something, that probably should be treated as
an error rather than simply dropping the high bits.

Note that this code does not handle the case where PAGE_SIZE !=
HV_HYP_PAGE_SIZE.  But maybe we'll never run the root partition with a
page size other than 4K.

> +		hypercall_status = hv_do_rep_hypercall(
> +			HVCALL_MAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> +		local_irq_restore(irq_flags);
> +
> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +				HV_HYPERCALL_REP_COMP_OFFSET;
> +
> +		if (status == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +						    partition_id, 256);

Why adding 256 pages?  I'm just contrasting with other places that add
1 page at a time.  Maybe a comment to explain ....

> +			if (ret)
> +				break;
> +		} else if (status != HV_STATUS_SUCCESS) {
> +			pr_err("%s: completed %llu out of %llu, %s\n",
> +			       __func__,
> +			       page_count - remaining, page_count,
> +			       hv_status_to_string(status));
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +
> +		pages += completed;
> +		remaining -= completed;
> +		gpa_target += completed;
> +	}
> +
> +	if (ret && completed) {

Is the above the right test?  Completed could be zero from the most
recent iteration, but still could be partially succeeded based on a previous
successful iteration.   I think this needs to check whether remaining equals
page_count.

> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> +		       __func__);
> +		ret = -EBADFD;
> +	}
> +
> +	return ret;
> +}
> +
> +static int
> +hv_call_unmap_gpa_pages(u64 partition_id,
> +			u64 gpa_target,
> +			u64 page_count, u32 flags)
> +{
> +	struct hv_unmap_gpa_pages *input_page;
> +	int status;
> +	int ret = 0;
> +	u32 completed = 0;
> +	u64 hypercall_status;
> +	unsigned long remaining = page_count;
> +	int rep_count;
> +	unsigned long irq_flags;
> +
> +	local_irq_save(irq_flags);
> +	input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +
> +	input_page->target_partition_id = partition_id;
> +	input_page->target_gpa_base = gpa_target;
> +	input_page->unmap_flags = flags;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> +		hypercall_status = hv_do_rep_hypercall(
> +			HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);

Similarly, this code doesn't handle PAGE_SIZE != HV_HYP_PAGE_SIZE.

> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +				HV_HYPERCALL_REP_COMP_OFFSET;
> +		if (status != HV_STATUS_SUCCESS) {
> +			pr_err("%s: completed %llu out of %llu, %s\n",
> +			       __func__,
> +			       page_count - remaining, page_count,
> +			       hv_status_to_string(status));
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +
> +		remaining -= completed;
> +		gpa_target += completed;
> +		input_page->target_gpa_base = gpa_target;
> +	}
> +	local_irq_restore(irq_flags);

I have some concern about holding interrupts disabled for this long.

> +
> +	if (ret && completed) {

Same comment as before.

> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> +		       __func__);
> +		ret = -EBADFD;
> +	}
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
> +				struct mshv_user_mem_region __user *user_mem)
> +{
> +	struct mshv_user_mem_region mem;
> +	struct mshv_mem_region *region;
> +	int completed;
> +	unsigned long remaining, batch_size;
> +	int i;
> +	struct page **pages;
> +	u64 page_count, user_start, user_end, gpfn_start, gpfn_end;
> +	u64 region_page_count, region_user_start, region_user_end;
> +	u64 region_gpfn_start, region_gpfn_end;
> +	long ret = 0;
> +
> +	/* Check we have enough slots*/
> +	if (partition->regions.count == MSHV_MAX_MEM_REGIONS) {
> +		pr_err("%s: not enough memory region slots\n", __func__);
> +		return -ENOSPC;
> +	}
> +
> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
> +		return -EFAULT;
> +
> +	if (!mem.size ||
> +	    mem.size & (PAGE_SIZE - 1) ||
> +	    mem.userspace_addr & (PAGE_SIZE - 1) ||

There's a PAGE_ALIGNED macro that expresses exactly what
each of the previous two tests is doing.

> +	    !access_ok(mem.userspace_addr, mem.size))
> +		return -EINVAL;
> +
> +	/* Reject overlapping regions */
> +	page_count = mem.size >> PAGE_SHIFT;
> +	user_start = mem.userspace_addr;
> +	user_end = mem.userspace_addr + mem.size;
> +	gpfn_start = mem.guest_pfn;
> +	gpfn_end = mem.guest_pfn + page_count;
> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
> +		region = &partition->regions.slots[i];
> +		if (!region->size)
> +			continue;
> +		region_page_count = region->size >> PAGE_SHIFT;
> +		region_user_start = region->userspace_addr;
> +		region_user_end = region->userspace_addr + region->size;
> +		region_gpfn_start = region->guest_pfn;
> +		region_gpfn_end = region->guest_pfn + region_page_count;
> +
> +		if (!(
> +		     (user_end <= region_user_start) ||
> +		     (region_user_end <= user_start))) {
> +			return -EEXIST;
> +		}
> +		if (!(
> +		     (gpfn_end <= region_gpfn_start) ||
> +		     (region_gpfn_end <= gpfn_start))) {
> +			return -EEXIST;

You could apply De Morgan's theorem to the conditions
in each "if" statement and get rid of the "!".  That might make
these slightly easier to understand, but I have no strong
preference.

> +		}
> +	}
> +
> +	/* Pin the userspace pages */
> +	pages = vzalloc(sizeof(struct page *) * page_count);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	remaining = page_count;
> +	while (remaining) {
> +		/*
> +		 * We need to batch this, as pin_user_pages_fast with the
> +		 * FOLL_LONGTERM flag does a big temporary allocation
> +		 * of contiguous memory
> +		 */
> +		batch_size = min(remaining, PIN_PAGES_BATCH_SIZE);
> +		completed = pin_user_pages_fast(
> +				mem.userspace_addr +
> +					(page_count - remaining) * PAGE_SIZE,
> +				batch_size,
> +				FOLL_WRITE | FOLL_LONGTERM,
> +				&pages[page_count - remaining]);
> +		if (completed < 0) {
> +			pr_err("%s: failed to pin user pages error %i\n",
> +			       __func__,
> +			       completed);
> +			ret = completed;
> +			goto err_unpin_pages;
> +		}
> +		remaining -= completed;
> +	}
> +
> +	/* Map the pages to GPA pages */
> +	ret = hv_call_map_gpa_pages(partition->id, mem.guest_pfn,
> +				    page_count, mem.flags, pages);
> +	if (ret)
> +		goto err_unpin_pages;
> +
> +	/* Install the new region */
> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
> +		if (!partition->regions.slots[i].size) {
> +			region = &partition->regions.slots[i];
> +			break;
> +		}
> +	}
> +	region->pages = pages;
> +	region->size = mem.size;
> +	region->guest_pfn = mem.guest_pfn;
> +	region->userspace_addr = mem.userspace_addr;
> +
> +	partition->regions.count++;
> +
> +	return 0;
> +
> +err_unpin_pages:
> +	unpin_user_pages(pages, page_count - remaining);
> +	vfree(pages);
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_partition_ioctl_unmap_memory(struct mshv_partition *partition,
> +				  struct mshv_user_mem_region __user *user_mem)
> +{
> +	struct mshv_user_mem_region mem;
> +	struct mshv_mem_region *region_ptr;
> +	int i;
> +	u64 page_count;
> +	long ret;
> +
> +	if (!partition->regions.count)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
> +		return -EFAULT;
> +
> +	/* Find matching region */
> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
> +		if (!partition->regions.slots[i].size)
> +			continue;
> +		region_ptr = &partition->regions.slots[i];
> +		if (region_ptr->userspace_addr == mem.userspace_addr &&
> +		    region_ptr->size == mem.size &&
> +		    region_ptr->guest_pfn == mem.guest_pfn)
> +			break;
> +	}
> +
> +	if (i == MSHV_MAX_MEM_REGIONS)
> +		return -EINVAL;
> +
> +	page_count = region_ptr->size >> PAGE_SHIFT;
> +	ret = hv_call_unmap_gpa_pages(partition->id, region_ptr->guest_pfn,
> +				      page_count, 0);
> +	if (ret)
> +		return ret;
> +
> +	unpin_user_pages(region_ptr->pages, page_count);
> +	vfree(region_ptr->pages);
> +	memset(region_ptr, 0, sizeof(*region_ptr));
> +	partition->regions.count--;
> +
> +	return 0;
> +}
> +
>  static long
>  mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> -	return -ENOTTY;
> +	struct mshv_partition *partition = filp->private_data;
> +	long ret;
> +
> +	if (mutex_lock_killable(&partition->mutex))
> +		return -EINTR;
> +
> +	switch (ioctl) {
> +	case MSHV_MAP_GUEST_MEMORY:
> +		ret = mshv_partition_ioctl_map_memory(partition,
> +							(void __user *)arg);
> +		break;
> +	case MSHV_UNMAP_GUEST_MEMORY:
> +		ret = mshv_partition_ioctl_unmap_memory(partition,
> +							(void __user *)arg);
> +		break;
> +	default:
> +		ret = -ENOTTY;
> +	}
> +
> +	mutex_unlock(&partition->mutex);
> +	return ret;
>  }
> 
>  static void
>  destroy_partition(struct mshv_partition *partition)
>  {
> -	unsigned long flags;
> +	unsigned long flags, page_count;
> +	struct mshv_mem_region *region;
>  	int i;
> 
>  	/* Remove from list of partitions */
> @@ -286,6 +592,16 @@ destroy_partition(struct mshv_partition *partition)
> 
>  	hv_call_delete_partition(partition->id);
> 
> +	/* Remove regions and unpin the pages */
> +	for (i = 0; i < MSHV_MAX_MEM_REGIONS; ++i) {
> +		region = &partition->regions.slots[i];
> +		if (!region->size)
> +			continue;
> +		page_count = region->size >> PAGE_SHIFT;
> +		unpin_user_pages(region->pages, page_count);
> +		vfree(region->pages);
> +	}
> +
>  	kfree(partition);
>  }
> 
> @@ -353,6 +669,8 @@ mshv_ioctl_create_partition(void __user *user_arg)
>  	if (!partition)
>  		return -ENOMEM;
> 
> +	mutex_init(&partition->mutex);
> +
>  	fd = get_unused_fd_flags(O_CLOEXEC);
>  	if (fd < 0) {
>  		ret = fd;
> --
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls
       [not found] ` <1605918637-12192-11-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:47   ` Michael Kelley via Virtualization
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:47 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:30 PM
> 
> Add ioctls for getting and setting virtual processor registers.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst         |  11 +
>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 601 ++++++++++++++++++++++++
>  include/asm-generic/hyperv-tlfs.h       |  65 +--
>  include/linux/mshv.h                    |   1 +
>  include/uapi/linux/mshv.h               |  12 +
>  virt/mshv/mshv_main.c                   | 258 +++++++++-
>  6 files changed, 903 insertions(+), 45 deletions(-)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index f997f49f8690..20a626ac02d4 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -96,3 +96,14 @@ is backed by physical memory.
>  Create a virtual processor in a guest partition, returning a file descriptor to
>  represent the vp and perform ioctls on.
> 
> +3.5 MSHV_GET_VP_REGISTERS and MSHV_SET_VP_REGISTERS
> +---------------------------------------------------
> +:Type: vp ioctl
> +:Parameters: struct mshv_vp_registers
> +:Returns: 0 on success
> +
> +Get/set vp registers. See asm/hyperv-tlfs.h for the complete set of registers.
> +Includes general purpose platform registers, MSRs, and virtual registers that
> +are part of Microsoft Hypervisor platform and not directly exposed to the guest.
> +
> +
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
> tlfs.h
> index 72150c25ffe6..2ff655962738 100644
> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -121,4 +121,605 @@ struct hv_partition_creation_properties {
>  		disabled_processor_xsave_features;
>  };
> 
> +enum hv_register_name {
> +	/* Suspend Registers */
> +	HV_REGISTER_EXPLICIT_SUSPEND		= 0x00000000,
> +	HV_REGISTER_INTERCEPT_SUSPEND		= 0x00000001,
> +	HV_REGISTER_INSTRUCTION_EMULATION_HINTS	= 0x00000002,
> +	HV_REGISTER_DISPATCH_SUSPEND		= 0x00000003,
> +	HV_REGISTER_INTERNAL_ACTIVITY_STATE	= 0x00000004,
> +
> +	/* Version */
> +	HV_REGISTER_HYPERVISOR_VERSION	= 0x00000100, /* 128-bit result same as CPUID 0x40000002 */
> +
> +	/* Feature Access (registers are 128 bits) - same as CPUID 0x40000003 - 0x4000000B */
> +	HV_REGISTER_PRIVILEGES_AND_FEATURES_INFO	= 0x00000200,
> +	HV_REGISTER_FEATURES_INFO			= 0x00000201,
> +	HV_REGISTER_IMPLEMENTATION_LIMITS_INFO		= 0x00000202,
> +	HV_REGISTER_HARDWARE_FEATURES_INFO		= 0x00000203,
> +	HV_REGISTER_CPU_MANAGEMENT_FEATURES_INFO	= 0x00000204,
> +	HV_REGISTER_SVM_FEATURES_INFO			= 0x00000205,
> +	HV_REGISTER_SKIP_LEVEL_FEATURES_INFO		= 0x00000206,
> +	HV_REGISTER_NESTED_VIRT_FEATURES_INFO		= 0x00000207,
> +	HV_REGISTER_IPT_FEATURES_INFO			= 0x00000208,
> +
> +	/* Guest Crash Registers */
> +	HV_REGISTER_GUEST_CRASH_P0	= 0x00000210,
> +	HV_REGISTER_GUEST_CRASH_P1	= 0x00000211,
> +	HV_REGISTER_GUEST_CRASH_P2	= 0x00000212,
> +	HV_REGISTER_GUEST_CRASH_P3	= 0x00000213,
> +	HV_REGISTER_GUEST_CRASH_P4	= 0x00000214,
> +	HV_REGISTER_GUEST_CRASH_CTL	= 0x00000215,
> +
> +	/* Power State Configuration */
> +	HV_REGISTER_POWER_STATE_CONFIG_C1	= 0x00000220,
> +	HV_REGISTER_POWER_STATE_TRIGGER_C1	= 0x00000221,
> +	HV_REGISTER_POWER_STATE_CONFIG_C2	= 0x00000222,
> +	HV_REGISTER_POWER_STATE_TRIGGER_C2	= 0x00000223,
> +	HV_REGISTER_POWER_STATE_CONFIG_C3	= 0x00000224,
> +	HV_REGISTER_POWER_STATE_TRIGGER_C3	= 0x00000225,
> +
> +	/* Frequency Registers */
> +	HV_REGISTER_PROCESSOR_CLOCK_FREQUENCY	= 0x00000240,
> +	HV_REGISTER_INTERRUPT_CLOCK_FREQUENCY	= 0x00000241,
> +
> +	/* Idle Register */
> +	HV_REGISTER_GUEST_IDLE	= 0x00000250,
> +
> +	/* Guest Debug */
> +	HV_REGISTER_DEBUG_DEVICE_OPTIONS	= 0x00000260,
> +
> +	/* Memory Zeroing Conrol Register */
> +	HV_REGISTER_MEMORY_ZEROING_CONTROL	= 0x00000270,
> +
> +	/* Pending Event Register */
> +	HV_REGISTER_PENDING_EVENT0	= 0x00010004,
> +	HV_REGISTER_PENDING_EVENT1	= 0x00010005,
> +
> +	/* Misc */
> +	HV_REGISTER_VP_RUNTIME			= 0x00090000,
> +	HV_REGISTER_GUEST_OS_ID			= 0x00090002,
> +	HV_REGISTER_VP_INDEX			= 0x00090003,
> +	HV_REGISTER_TIME_REF_COUNT		= 0x00090004,
> +	HV_REGISTER_CPU_MANAGEMENT_VERSION	= 0x00090007,
> +	HV_REGISTER_VP_ASSIST_PAGE		= 0x00090013,
> +	HV_REGISTER_VP_ROOT_SIGNAL_COUNT	= 0x00090014,
> +	HV_REGISTER_REFERENCE_TSC		= 0x00090017,
> +
> +	/* Performance statistics Registers */
> +	HV_REGISTER_STATS_PARTITION_RETAIL	= 0x00090020,
> +	HV_REGISTER_STATS_PARTITION_INTERNAL	= 0x00090021,
> +	HV_REGISTER_STATS_VP_RETAIL		= 0x00090022,
> +	HV_REGISTER_STATS_VP_INTERNAL		= 0x00090023,
> +
> +	HV_REGISTER_NESTED_VP_INDEX	= 0x00091003,
> +
> +	/* Hypervisor-defined Registers (Synic) */
> +	HV_REGISTER_SINT0	= 0x000A0000,
> +	HV_REGISTER_SINT1	= 0x000A0001,
> +	HV_REGISTER_SINT2	= 0x000A0002,
> +	HV_REGISTER_SINT3	= 0x000A0003,
> +	HV_REGISTER_SINT4	= 0x000A0004,
> +	HV_REGISTER_SINT5	= 0x000A0005,
> +	HV_REGISTER_SINT6	= 0x000A0006,
> +	HV_REGISTER_SINT7	= 0x000A0007,
> +	HV_REGISTER_SINT8	= 0x000A0008,
> +	HV_REGISTER_SINT9	= 0x000A0009,
> +	HV_REGISTER_SINT10	= 0x000A000A,
> +	HV_REGISTER_SINT11	= 0x000A000B,
> +	HV_REGISTER_SINT12	= 0x000A000C,
> +	HV_REGISTER_SINT13	= 0x000A000D,
> +	HV_REGISTER_SINT14	= 0x000A000E,
> +	HV_REGISTER_SINT15	= 0x000A000F,
> +	HV_REGISTER_SCONTROL	= 0x000A0010,
> +	HV_REGISTER_SVERSION	= 0x000A0011,
> +	HV_REGISTER_SIFP	= 0x000A0012,
> +	HV_REGISTER_SIPP	= 0x000A0013,
> +	HV_REGISTER_EOM		= 0x000A0014,
> +	HV_REGISTER_SIRBP	= 0x000A0015,
> +
> +	HV_REGISTER_NESTED_SINT0	= 0x000A1000,
> +	HV_REGISTER_NESTED_SINT1	= 0x000A1001,
> +	HV_REGISTER_NESTED_SINT2	= 0x000A1002,
> +	HV_REGISTER_NESTED_SINT3	= 0x000A1003,
> +	HV_REGISTER_NESTED_SINT4	= 0x000A1004,
> +	HV_REGISTER_NESTED_SINT5	= 0x000A1005,
> +	HV_REGISTER_NESTED_SINT6	= 0x000A1006,
> +	HV_REGISTER_NESTED_SINT7	= 0x000A1007,
> +	HV_REGISTER_NESTED_SINT8	= 0x000A1008,
> +	HV_REGISTER_NESTED_SINT9	= 0x000A1009,
> +	HV_REGISTER_NESTED_SINT10	= 0x000A100A,
> +	HV_REGISTER_NESTED_SINT11	= 0x000A100B,
> +	HV_REGISTER_NESTED_SINT12	= 0x000A100C,
> +	HV_REGISTER_NESTED_SINT13	= 0x000A100D,
> +	HV_REGISTER_NESTED_SINT14	= 0x000A100E,
> +	HV_REGISTER_NESTED_SINT15	= 0x000A100F,
> +	HV_REGISTER_NESTED_SCONTROL	= 0x000A1010,
> +	HV_REGISTER_NESTED_SVERSION	= 0x000A1011,
> +	HV_REGISTER_NESTED_SIFP		= 0x000A1012,
> +	HV_REGISTER_NESTED_SIPP		= 0x000A1013,
> +	HV_REGISTER_NESTED_EOM		= 0x000A1014,
> +	HV_REGISTER_NESTED_SIRBP	= 0x000a1015,
> +
> +
> +	/* Hypervisor-defined Registers (Synthetic Timers) */
> +	HV_REGISTER_STIMER0_CONFIG		= 0x000B0000,
> +	HV_REGISTER_STIMER0_COUNT		= 0x000B0001,
> +	HV_REGISTER_STIMER1_CONFIG		= 0x000B0002,
> +	HV_REGISTER_STIMER1_COUNT		= 0x000B0003,
> +	HV_REGISTER_STIMER2_CONFIG		= 0x000B0004,
> +	HV_REGISTER_STIMER2_COUNT		= 0x000B0005,
> +	HV_REGISTER_STIMER3_CONFIG		= 0x000B0006,
> +	HV_REGISTER_STIMER3_COUNT		= 0x000B0007,
> +	HV_REGISTER_STIME_UNHALTED_TIMER_CONFIG	= 0x000B0100,
> +	HV_REGISTER_STIME_UNHALTED_TIMER_COUNT	= 0x000b0101,
> +
> +	/* Synthetic VSM registers */
> +
> +	/* 0x000D0000-1 are available for future use. */
> +	HV_REGISTER_VSM_CODE_PAGE_OFFSETS	= 0x000D0002,
> +	HV_REGISTER_VSM_VP_STATUS		= 0x000D0003,
> +	HV_REGISTER_VSM_PARTITION_STATUS	= 0x000D0004,
> +	HV_REGISTER_VSM_VINA			= 0x000D0005,
> +	HV_REGISTER_VSM_CAPABILITIES		= 0x000D0006,
> +	HV_REGISTER_VSM_PARTITION_CONFIG	= 0x000D0007,
> +
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL0	= 0x000D0010,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL1	= 0x000D0011,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL2	= 0x000D0012,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL3	= 0x000D0013,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL4	= 0x000D0014,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL5	= 0x000D0015,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL6	= 0x000D0016,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL7	= 0x000D0017,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL8	= 0x000D0018,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL9	= 0x000D0019,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL10	= 0x000D001A,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL11	= 0x000D001B,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL12	= 0x000D001C,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL13	= 0x000D001D,
> +	HV_REGISTER_VSM_VP_SECURE_CONFIG_VTL14	= 0x000D001E,
> +
> +	HV_REGISTER_VSM_VP_WAIT_FOR_TLB_LOCK	= 0x000D0020,
> +
> +	HV_REGISTER_ISOLATION_CAPABILITIES	= 0x000D0100,
> +
> +	/* Pending Interruption Register */
> +	HV_REGISTER_PENDING_INTERRUPTION	= 0x00010002,
> +
> +	/* Interrupt State register */
> +	HV_REGISTER_INTERRUPT_STATE	= 0x00010003,
> +
> +	/* Interruptible notification register */
> +	HV_X64_REGISTER_DELIVERABILITY_NOTIFICATIONS	= 0x00010006,
> +
> +	/* X64 User-Mode Registers */
> +	HV_X64_REGISTER_RAX	= 0x00020000,
> +	HV_X64_REGISTER_RCX	= 0x00020001,
> +	HV_X64_REGISTER_RDX	= 0x00020002,
> +	HV_X64_REGISTER_RBX	= 0x00020003,
> +	HV_X64_REGISTER_RSP	= 0x00020004,
> +	HV_X64_REGISTER_RBP	= 0x00020005,
> +	HV_X64_REGISTER_RSI	= 0x00020006,
> +	HV_X64_REGISTER_RDI	= 0x00020007,
> +	HV_X64_REGISTER_R8	= 0x00020008,
> +	HV_X64_REGISTER_R9	= 0x00020009,
> +	HV_X64_REGISTER_R10	= 0x0002000A,
> +	HV_X64_REGISTER_R11	= 0x0002000B,
> +	HV_X64_REGISTER_R12	= 0x0002000C,
> +	HV_X64_REGISTER_R13	= 0x0002000D,
> +	HV_X64_REGISTER_R14	= 0x0002000E,
> +	HV_X64_REGISTER_R15	= 0x0002000F,
> +	HV_X64_REGISTER_RIP	= 0x00020010,
> +	HV_X64_REGISTER_RFLAGS	= 0x00020011,
> +
> +	/* X64 Floating Point and Vector Registers */
> +	HV_X64_REGISTER_XMM0			= 0x00030000,
> +	HV_X64_REGISTER_XMM1			= 0x00030001,
> +	HV_X64_REGISTER_XMM2			= 0x00030002,
> +	HV_X64_REGISTER_XMM3			= 0x00030003,
> +	HV_X64_REGISTER_XMM4			= 0x00030004,
> +	HV_X64_REGISTER_XMM5			= 0x00030005,
> +	HV_X64_REGISTER_XMM6			= 0x00030006,
> +	HV_X64_REGISTER_XMM7			= 0x00030007,
> +	HV_X64_REGISTER_XMM8			= 0x00030008,
> +	HV_X64_REGISTER_XMM9			= 0x00030009,
> +	HV_X64_REGISTER_XMM10			= 0x0003000A,
> +	HV_X64_REGISTER_XMM11			= 0x0003000B,
> +	HV_X64_REGISTER_XMM12			= 0x0003000C,
> +	HV_X64_REGISTER_XMM13			= 0x0003000D,
> +	HV_X64_REGISTER_XMM14			= 0x0003000E,
> +	HV_X64_REGISTER_XMM15			= 0x0003000F,
> +	HV_X64_REGISTER_FP_MMX0			= 0x00030010,
> +	HV_X64_REGISTER_FP_MMX1			= 0x00030011,
> +	HV_X64_REGISTER_FP_MMX2			= 0x00030012,
> +	HV_X64_REGISTER_FP_MMX3			= 0x00030013,
> +	HV_X64_REGISTER_FP_MMX4			= 0x00030014,
> +	HV_X64_REGISTER_FP_MMX5			= 0x00030015,
> +	HV_X64_REGISTER_FP_MMX6			= 0x00030016,
> +	HV_X64_REGISTER_FP_MMX7			= 0x00030017,
> +	HV_X64_REGISTER_FP_CONTROL_STATUS	= 0x00030018,
> +	HV_X64_REGISTER_XMM_CONTROL_STATUS	= 0x00030019,
> +
> +	/* X64 Control Registers */
> +	HV_X64_REGISTER_CR0	= 0x00040000,
> +	HV_X64_REGISTER_CR2	= 0x00040001,
> +	HV_X64_REGISTER_CR3	= 0x00040002,
> +	HV_X64_REGISTER_CR4	= 0x00040003,
> +	HV_X64_REGISTER_CR8	= 0x00040004,
> +	HV_X64_REGISTER_XFEM	= 0x00040005,
> +
> +	/* X64 Intermediate Control Registers */
> +	HV_X64_REGISTER_INTERMEDIATE_CR0	= 0x00041000,
> +	HV_X64_REGISTER_INTERMEDIATE_CR4	= 0x00041003,
> +	HV_X64_REGISTER_INTERMEDIATE_CR8	= 0x00041004,
> +
> +	/* X64 Debug Registers */
> +	HV_X64_REGISTER_DR0	= 0x00050000,
> +	HV_X64_REGISTER_DR1	= 0x00050001,
> +	HV_X64_REGISTER_DR2	= 0x00050002,
> +	HV_X64_REGISTER_DR3	= 0x00050003,
> +	HV_X64_REGISTER_DR6	= 0x00050004,
> +	HV_X64_REGISTER_DR7	= 0x00050005,
> +
> +	/* X64 Segment Registers */
> +	HV_X64_REGISTER_ES	= 0x00060000,
> +	HV_X64_REGISTER_CS	= 0x00060001,
> +	HV_X64_REGISTER_SS	= 0x00060002,
> +	HV_X64_REGISTER_DS	= 0x00060003,
> +	HV_X64_REGISTER_FS	= 0x00060004,
> +	HV_X64_REGISTER_GS	= 0x00060005,
> +	HV_X64_REGISTER_LDTR	= 0x00060006,
> +	HV_X64_REGISTER_TR	= 0x00060007,
> +
> +	/* X64 Table Registers */
> +	HV_X64_REGISTER_IDTR	= 0x00070000,
> +	HV_X64_REGISTER_GDTR	= 0x00070001,
> +
> +	/* X64 Virtualized MSRs */
> +	HV_X64_REGISTER_TSC		= 0x00080000,
> +	HV_X64_REGISTER_EFER		= 0x00080001,
> +	HV_X64_REGISTER_KERNEL_GS_BASE	= 0x00080002,
> +	HV_X64_REGISTER_APIC_BASE	= 0x00080003,
> +	HV_X64_REGISTER_PAT		= 0x00080004,
> +	HV_X64_REGISTER_SYSENTER_CS	= 0x00080005,
> +	HV_X64_REGISTER_SYSENTER_EIP	= 0x00080006,
> +	HV_X64_REGISTER_SYSENTER_ESP	= 0x00080007,
> +	HV_X64_REGISTER_STAR		= 0x00080008,
> +	HV_X64_REGISTER_LSTAR		= 0x00080009,
> +	HV_X64_REGISTER_CSTAR		= 0x0008000A,
> +	HV_X64_REGISTER_SFMASK		= 0x0008000B,
> +	HV_X64_REGISTER_INITIAL_APIC_ID	= 0x0008000C,
> +
> +	/* X64 Cache control MSRs */
> +	HV_X64_REGISTER_MSR_MTRR_CAP		= 0x0008000D,
> +	HV_X64_REGISTER_MSR_MTRR_DEF_TYPE	= 0x0008000E,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE0	= 0x00080010,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE1	= 0x00080011,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE2	= 0x00080012,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE3	= 0x00080013,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE4	= 0x00080014,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE5	= 0x00080015,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE6	= 0x00080016,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE7	= 0x00080017,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE8	= 0x00080018,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASE9	= 0x00080019,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEA	= 0x0008001A,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEB	= 0x0008001B,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEC	= 0x0008001C,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASED	= 0x0008001D,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEE	= 0x0008001E,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_BASEF	= 0x0008001F,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK0	= 0x00080040,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK1	= 0x00080041,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK2	= 0x00080042,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK3	= 0x00080043,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK4	= 0x00080044,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK5	= 0x00080045,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK6	= 0x00080046,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK7	= 0x00080047,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK8	= 0x00080048,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASK9	= 0x00080049,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKA	= 0x0008004A,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKB	= 0x0008004B,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKC	= 0x0008004C,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKD	= 0x0008004D,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKE	= 0x0008004E,
> +	HV_X64_REGISTER_MSR_MTRR_PHYS_MASKF	= 0x0008004F,
> +	HV_X64_REGISTER_MSR_MTRR_FIX64K00000	= 0x00080070,
> +	HV_X64_REGISTER_MSR_MTRR_FIX16K80000	= 0x00080071,
> +	HV_X64_REGISTER_MSR_MTRR_FIX16KA0000	= 0x00080072,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KC0000	= 0x00080073,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KC8000	= 0x00080074,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KD0000	= 0x00080075,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KD8000	= 0x00080076,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KE0000	= 0x00080077,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KE8000	= 0x00080078,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KF0000	= 0x00080079,
> +	HV_X64_REGISTER_MSR_MTRR_FIX4KF8000	= 0x0008007A,
> +
> +	HV_X64_REGISTER_TSC_AUX		= 0x0008007B,
> +	HV_X64_REGISTER_BNDCFGS		= 0x0008007C,
> +	HV_X64_REGISTER_DEBUG_CTL	= 0x0008007D,
> +
> +	/* Available */
> +	HV_X64_REGISTER_AVAILABLE0008007E	= 0x0008007E,
> +	HV_X64_REGISTER_AVAILABLE0008007F	= 0x0008007F,
> +
> +	HV_X64_REGISTER_SGX_LAUNCH_CONTROL0	= 0x00080080,
> +	HV_X64_REGISTER_SGX_LAUNCH_CONTROL1	= 0x00080081,
> +	HV_X64_REGISTER_SGX_LAUNCH_CONTROL2	= 0x00080082,
> +	HV_X64_REGISTER_SGX_LAUNCH_CONTROL3	= 0x00080083,
> +	HV_X64_REGISTER_SPEC_CTRL		= 0x00080084,
> +	HV_X64_REGISTER_PRED_CMD		= 0x00080085,
> +	HV_X64_REGISTER_VIRT_SPEC_CTRL		= 0x00080086,
> +
> +	/* Other MSRs */
> +	HV_X64_REGISTER_MSR_IA32_MISC_ENABLE		= 0x000800A0,
> +	HV_X64_REGISTER_IA32_FEATURE_CONTROL		= 0x000800A1,
> +	HV_X64_REGISTER_IA32_VMX_BASIC			= 0x000800A2,
> +	HV_X64_REGISTER_IA32_VMX_PINBASED_CTLS		= 0x000800A3,
> +	HV_X64_REGISTER_IA32_VMX_PROCBASED_CTLS		= 0x000800A4,
> +	HV_X64_REGISTER_IA32_VMX_EXIT_CTLS		= 0x000800A5,
> +	HV_X64_REGISTER_IA32_VMX_ENTRY_CTLS		= 0x000800A6,
> +	HV_X64_REGISTER_IA32_VMX_MISC			= 0x000800A7,
> +	HV_X64_REGISTER_IA32_VMX_CR0_FIXED0		= 0x000800A8,
> +	HV_X64_REGISTER_IA32_VMX_CR0_FIXED1		= 0x000800A9,
> +	HV_X64_REGISTER_IA32_VMX_CR4_FIXED0		= 0x000800AA,
> +	HV_X64_REGISTER_IA32_VMX_CR4_FIXED1		= 0x000800AB,
> +	HV_X64_REGISTER_IA32_VMX_VMCS_ENUM		= 0x000800AC,
> +	HV_X64_REGISTER_IA32_VMX_PROCBASED_CTLS2	= 0x000800AD,
> +	HV_X64_REGISTER_IA32_VMX_EPT_VPID_CAP		= 0x000800AE,
> +	HV_X64_REGISTER_IA32_VMX_TRUE_PINBASED_CTLS	= 0x000800AF,
> +	HV_X64_REGISTER_IA32_VMX_TRUE_PROCBASED_CTLS	= 0x000800B0,
> +	HV_X64_REGISTER_IA32_VMX_TRUE_EXIT_CTLS		= 0x000800B1,
> +	HV_X64_REGISTER_IA32_VMX_TRUE_ENTRY_CTLS	= 0x000800B2,
> +
> +	/* Performance monitoring MSRs */
> +	HV_X64_REGISTER_PERF_GLOBAL_CTRL	= 0x00081000,
> +	HV_X64_REGISTER_PERF_GLOBAL_STATUS	= 0x00081001,
> +	HV_X64_REGISTER_PERF_GLOBAL_IN_USE	= 0x00081002,
> +	HV_X64_REGISTER_FIXED_CTR_CTRL		= 0x00081003,
> +	HV_X64_REGISTER_DS_AREA			= 0x00081004,
> +	HV_X64_REGISTER_PEBS_ENABLE		= 0x00081005,
> +	HV_X64_REGISTER_PEBS_LD_LAT		= 0x00081006,
> +	HV_X64_REGISTER_PEBS_FRONTEND		= 0x00081007,
> +	HV_X64_REGISTER_PERF_EVT_SEL0		= 0x00081100,
> +	HV_X64_REGISTER_PMC0			= 0x00081200,
> +	HV_X64_REGISTER_FIXED_CTR0		= 0x00081300,
> +
> +	HV_X64_REGISTER_LBR_TOS		= 0x00082000,
> +	HV_X64_REGISTER_LBR_SELECT	= 0x00082001,
> +	HV_X64_REGISTER_LER_FROM_LIP	= 0x00082002,
> +	HV_X64_REGISTER_LER_TO_LIP	= 0x00082003,
> +	HV_X64_REGISTER_LBR_FROM0	= 0x00082100,
> +	HV_X64_REGISTER_LBR_TO0		= 0x00082200,
> +	HV_X64_REGISTER_LBR_INFO0	= 0x00083300,
> +
> +	/* Intel processor trace MSRs */
> +	HV_X64_REGISTER_RTIT_CTL		= 0x00081008,
> +	HV_X64_REGISTER_RTIT_STATUS		= 0x00081009,
> +	HV_X64_REGISTER_RTIT_OUTPUT_BASE	= 0x0008100A,
> +	HV_X64_REGISTER_RTIT_OUTPUT_MASK_PTRS	= 0x0008100B,
> +	HV_X64_REGISTER_RTIT_CR3_MATCH		= 0x0008100C,
> +	HV_X64_REGISTER_RTIT_ADDR0A		= 0x00081400,
> +
> +	/* RtitAddr0A/B - RtitAddr3A/B occupy 0x00081400-0x00081407. */
> +
> +	/* X64 Apic registers. These match the equivalent x2APIC MSR offsets. */
> +	HV_X64_REGISTER_APIC_ID		= 0x00084802,
> +	HV_X64_REGISTER_APIC_VERSION	= 0x00084803,
> +
> +	/* Hypervisor-defined registers (Misc) */
> +	HV_X64_REGISTER_HYPERCALL	= 0x00090001,
> +
> +	/* X64 Virtual APIC registers synthetic MSRs */
> +	HV_X64_REGISTER_SYNTHETIC_EOI	= 0x00090010,
> +	HV_X64_REGISTER_SYNTHETIC_ICR	= 0x00090011,
> +	HV_X64_REGISTER_SYNTHETIC_TPR	= 0x00090012,
> +
> +	/* Partition Timer Assist Registers */
> +	HV_X64_REGISTER_EMULATED_TIMER_PERIOD	= 0x00090030,
> +	HV_X64_REGISTER_EMULATED_TIMER_CONTROL	= 0x00090031,
> +	HV_X64_REGISTER_PM_TIMER_ASSIST		= 0x00090032,
> +
> +	/* Intercept Control Registers */
> +	HV_X64_REGISTER_CR_INTERCEPT_CONTROL			= 0x000E0000,
> +	HV_X64_REGISTER_CR_INTERCEPT_CR0_MASK			= 0x000E0001,
> +	HV_X64_REGISTER_CR_INTERCEPT_CR4_MASK			= 0x000E0002,
> +	HV_X64_REGISTER_CR_INTERCEPT_IA32_MISC_ENABLE_MASK	= 0x000E0003,
> +
> +};
> +
> +struct hv_u128 {
> +	__u64 high_part;
> +	__u64 low_part;
> +};
> +
> +union hv_x64_fp_register {
> +	struct hv_u128 as_uint128;
> +	struct {
> +		__u64 mantissa;
> +		__u64 biased_exponent : 15;
> +		__u64 sign : 1;
> +		__u64 reserved : 48;
> +	};
> +};
> +
> +union hv_x64_fp_control_status_register {
> +	struct hv_u128 as_uint128;
> +	struct {
> +		__u16 fp_control;
> +		__u16 fp_status;
> +		__u8 fp_tag;
> +		__u8 reserved;
> +		__u16 last_fp_op;
> +		union {
> +			/* long mode */
> +			__u64 last_fp_rip;
> +			/* 32 bit mode */
> +			struct {
> +				__u32 last_fp_eip;
> +				__u16 last_fp_cs;
> +			};
> +		};
> +	};
> +};
> +
> +union hv_x64_xmm_control_status_register {
> +	struct hv_u128 as_uint128;
> +	struct {
> +		union {
> +			/* long mode */
> +			__u64 last_fp_rdp;
> +			/* 32 bit mode */
> +			struct {
> +				__u32 last_fp_dp;
> +				__u16 last_fp_ds;
> +			};
> +		};
> +		__u32 xmm_status_control;
> +		__u32 xmm_status_control_mask;
> +	};
> +};
> +
> +struct hv_x64_segment_register {
> +	__u64 base;
> +	__u32 limit;
> +	__u16 selector;
> +	union {
> +		struct {
> +			__u16 segment_type : 4;
> +			__u16 non_system_segment : 1;
> +			__u16 descriptor_privilege_level : 2;
> +			__u16 present : 1;
> +			__u16 reserved : 4;
> +			__u16 available : 1;
> +			__u16 _long : 1;
> +			__u16 _default : 1;
> +			__u16 granularity : 1;
> +		};
> +		__u16 attributes;
> +	};
> +};
> +
> +struct hv_x64_table_register {
> +	__u16 pad[3];
> +	__u16 limit;
> +	__u64 base;
> +};
> +
> +union hv_explicit_suspend_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u64 suspended : 1;
> +		__u64 reserved : 63;
> +	};
> +};
> +
> +union hv_intercept_suspend_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u64 suspended : 1;
> +		__u64 reserved : 63;
> +	};
> +};
> +
> +union hv_dispatch_suspend_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u64 suspended : 1;
> +		__u64 reserved : 63;
> +	};
> +};
> +
> +union hv_x64_interrupt_state_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u64 interrupt_shadow : 1;
> +		__u64 nmi_masked : 1;
> +		__u64 reserved : 62;
> +	};
> +};
> +
> +union hv_x64_pending_interruption_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u32 interruption_pending : 1;
> +		__u32 interruption_type : 3;
> +		__u32 deliver_error_code : 1;
> +		__u32 instruction_length : 4;
> +		__u32 nested_event : 1;
> +		__u32 reserved : 6;
> +		__u32 interruption_vector : 16;
> +		__u32 error_code;
> +	};
> +};
> +
> +union hv_x64_msr_npiep_config_contents {
> +	__u64 as_uint64;
> +	struct {
> +		/*
> +		 * These bits enable instruction execution prevention for
> +		 * specific instructions.
> +		 */
> +		__u64 prevents_gdt : 1;
> +		__u64 prevents_idt : 1;
> +		__u64 prevents_ldt : 1;
> +		__u64 prevents_tr : 1;
> +
> +		/* The reserved bits must always be 0. */
> +		__u64 reserved : 60;
> +	};
> +};
> +
> +union hv_x64_pending_exception_event {
> +	__u64 as_uint64[2];
> +	struct {
> +		__u32 event_pending : 1;
> +		__u32 event_type : 3;
> +		__u32 reserved0 : 4;
> +		__u32 deliver_error_code : 1;
> +		__u32 reserved1 : 7;
> +		__u32 vector : 16;
> +		__u32 error_code;
> +		__u64 exception_parameter;
> +	};
> +};
> +
> +union hv_x64_pending_virtualization_fault_event {
> +	__u64 as_uint64[2];
> +	struct {
> +		__u32 event_pending : 1;
> +		__u32 event_type : 3;
> +		__u32 reserved0 : 4;
> +		__u32 reserved1 : 8;
> +		__u32 parameter0 : 16;
> +		__u32 code;
> +		__u64 parameter1;
> +	};
> +};
> +
> +union hv_register_value {
> +	struct hv_u128 reg128;
> +	__u64 reg64;
> +	__u32 reg32;
> +	__u16 reg16;
> +	__u8 reg8;
> +	union hv_x64_fp_register fp;
> +	union hv_x64_fp_control_status_register fp_control_status;
> +	union hv_x64_xmm_control_status_register xmm_control_status;
> +	struct hv_x64_segment_register segment;
> +	struct hv_x64_table_register table;
> +	union hv_explicit_suspend_register explicit_suspend;
> +	union hv_intercept_suspend_register intercept_suspend;
> +	union hv_dispatch_suspend_register dispatch_suspend;
> +	union hv_x64_interrupt_state_register interrupt_state;
> +	union hv_x64_pending_interruption_register pending_interruption;
> +	union hv_x64_msr_npiep_config_contents npiep_config;
> +	union hv_x64_pending_exception_event pending_exception_event;
> +	union hv_x64_pending_virtualization_fault_event
> +		pending_virtualization_fault_event;
> +};
> +
>  #endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 6e5072e29897..b9295400c20b 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -622,53 +622,30 @@ struct hv_retarget_device_interrupt {
>  } __packed __aligned(8);
> 
> 
> -/* HvGetVpRegisters hypercall input with variable size reg name list*/
> -struct hv_get_vp_registers_input {
> -	struct {
> -		u64 partitionid;
> -		u32 vpindex;
> -		u8  inputvtl;
> -		u8  padding[3];
> -	} header;
> -	struct input {
> -		u32 name0;
> -		u32 name1;
> -	} element[];
> -} __packed;
> -
> +/* HvGetVpRegisters hypercall with variable size reg name list*/
> +struct hv_get_vp_registers {
> +	u64 partition_id;
> +	u32 vp_index;
> +	u8  input_vtl;
> +	u8  rsvd_z8;
> +	u16 rsvd_z16;
> +	__aligned(8) enum hv_register_name names[];
> +} __aligned(8);
> 
> -/* HvGetVpRegisters returns an array of these output elements */
> -struct hv_get_vp_registers_output {
> -	union {
> -		struct {
> -			u32 a;
> -			u32 b;
> -			u32 c;
> -			u32 d;
> -		} as32 __packed;
> -		struct {
> -			u64 low;
> -			u64 high;
> -		} as64 __packed;
> -	};
> +/* HvSetVpRegisters hypercall with variable size reg name/value list*/
> +struct hv_register_assoc {
> +	enum hv_register_name name;
> +	__aligned(16) union hv_register_value value;
>  };
> 
> -/* HvSetVpRegisters hypercall with variable size reg name/value list*/
> -struct hv_set_vp_registers_input {
> -	struct {
> -		u64 partitionid;
> -		u32 vpindex;
> -		u8  inputvtl;
> -		u8  padding[3];
> -	} header;
> -	struct {
> -		u32 name;
> -		u32 padding1;
> -		u64 padding2;
> -		u64 valuelow;
> -		u64 valuehigh;
> -	} element[];
> -} __packed;
> +struct hv_set_vp_registers {
> +	u64 partition_id;
> +	u32 vp_index;
> +	u8  input_vtl;
> +	u8  rsvd_z8;
> +	u16 rsvd_z16;
> +	struct hv_register_assoc elements[];
> +} __aligned(16);

Throughout these structures, I think the approach needs to be more
explicit about the memory layout.  The current definitions assume that
the compiler is inserting padding in the expected places, and not in
any unexpected places.  My previous concerns about use of enum
also apply.

The code also removes some layouts that are used in the
not-yet-accepted patches for ARM64.   Let sync on how to get
those back in.

> 
>  enum hv_device_type {
>  	HV_DEVICE_TYPE_LOGICAL = 0,
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index 50521c5f7948..dfe469f573f9 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -17,6 +17,7 @@
>  struct mshv_vp {
>  	u32 index;
>  	struct mshv_partition *partition;
> +	struct mutex mutex;
>  };
> 
>  struct mshv_mem_region {
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index 1f053eae68a6..5d53ed655429 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -33,6 +33,14 @@ struct mshv_create_vp {
>  	__u32 vp_index;
>  };
> 
> +#define MSHV_VP_MAX_REGISTERS	128
> +
> +struct mshv_vp_registers {
> +	int count; /* at most MSHV_VP_MAX_REGISTERS */
> +	enum hv_register_name *names;
> +	union hv_register_value *values;
> +};

Having separate arrays for the names and values results in an extra
copy of the data down in the ioctl code.  Any reason the caller couldn't
supply the data as an array, where each entry is already a name/value
pair?

> +
>  #define MSHV_IOCTL 0xB8
> 
>  /* mshv device */
> @@ -44,4 +52,8 @@ struct mshv_create_vp {
>  #define MSHV_UNMAP_GUEST_MEMORY	_IOW(MSHV_IOCTL, 0x03, struct
> mshv_user_mem_region)
>  #define MSHV_CREATE_VP		_IOW(MSHV_IOCTL, 0x04, struct mshv_create_vp)
> 
> +/* vp device */
> +#define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct
> mshv_vp_registers)
> +#define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
> +
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 3be9d9a468c1..2a10137a1e84 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -74,6 +74,12 @@ static struct miscdevice mshv_dev = {
>  #define HV_MAP_GPA_BATCH_SIZE	\
>  		(PAGE_SIZE / sizeof(struct hv_map_gpa_pages) / sizeof(u64))
>  #define PIN_PAGES_BATCH_SIZE	(0x10000000 / PAGE_SIZE)
> +#define HV_GET_REGISTER_BATCH_SIZE	\
> +	(PAGE_SIZE / \
> +	 sizeof(struct hv_get_vp_registers) / sizeof(enum hv_register_name))
> +#define HV_SET_REGISTER_BATCH_SIZE	\
> +	(PAGE_SIZE / \
> +	 sizeof(struct hv_set_vp_registers) / sizeof(struct hv_register_assoc))

These new size calculations have the same bug as HV_MAP_GPA_BATCH_SIZE.
The first divide operations should be subtraction.

With the correct calculation, HV_GET_REGISTER_BATCH_SIZE  will be
too large.  The input page will accommodate more 32 bit register names
than the output page will accommodate 128 bit register values.  The limit
should be based on the latter, not the former.  Or calculate both the
input and output limit and use the minimum.

> 
>  static int
>  hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
> @@ -380,10 +386,258 @@ hv_call_unmap_gpa_pages(u64 partition_id,
>  	return ret;
>  }
> 
> +static int
> +hv_call_get_vp_registers(u32 vp_index,
> +			 u64 partition_id,
> +			 u16 count,
> +			 const enum hv_register_name *names,
> +			 union hv_register_value *values)
> +{
> +	struct hv_get_vp_registers *input_page;
> +	union hv_register_value *output_page;
> +	u16 completed = 0;
> +	u64 hypercall_status;
> +	unsigned long remaining = count;
> +	int rep_count;
> +	int status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	input_page = (struct hv_get_vp_registers *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +	output_page = (union hv_register_value *)(*this_cpu_ptr(
> +		hyperv_pcpu_output_arg));
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl = 0;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> +		memcpy(input_page->names, names,
> +			sizeof(enum hv_register_name) * rep_count);
> +
> +		hypercall_status =
> +			hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS, rep_count,
> +					    0, input_page, output_page);
> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		if (status != HV_STATUS_SUCCESS) {
> +			pr_err("%s: completed %li out of %u, %s\n",
> +			       __func__,
> +			       count - remaining, count,
> +			       hv_status_to_string(status));
> +			break;
> +		}
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +			    HV_HYPERCALL_REP_COMP_OFFSET;
> +		memcpy(values, output_page,
> +			sizeof(union hv_register_value) * completed);
> +
> +		names += completed;
> +		values += completed;
> +		remaining -= completed;
> +	}
> +	local_irq_restore(flags);
> +
> +	return -hv_status_to_errno(status);
> +}
> +
> +static int
> +hv_call_set_vp_registers(u32 vp_index,
> +			 u64 partition_id,
> +			 u16 count,
> +			 struct hv_register_assoc *registers)
> +{
> +	struct hv_set_vp_registers *input_page;
> +	u16 completed = 0;
> +	u64 hypercall_status;
> +	unsigned long remaining = count;
> +	int rep_count;
> +	int status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input_page = (struct hv_set_vp_registers *)(*this_cpu_ptr(
> +		hyperv_pcpu_input_arg));
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl = 0;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
> +		memcpy(input_page->elements, registers,
> +			sizeof(struct hv_register_assoc) * rep_count);
> +
> +		hypercall_status =
> +			hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS, rep_count,
> +					    0, input_page, NULL);
> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> +		if (status != HV_STATUS_SUCCESS) {
> +			pr_err("%s: completed %li out of %u, %s\n",
> +			       __func__,
> +			       count - remaining, count,
> +			       hv_status_to_string(status));
> +			break;
> +		}
> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> +			    HV_HYPERCALL_REP_COMP_OFFSET;
> +		registers += completed;
> +		remaining -= completed;
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return -hv_status_to_errno(status);
> +}
> +
> +static long
> +mshv_vp_ioctl_get_regs(struct mshv_vp *vp, void __user *user_args)
> +{
> +	struct mshv_vp_registers args;
> +	enum hv_register_name *names;
> +	union hv_register_value *values;
> +	long ret;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.count > MSHV_VP_MAX_REGISTERS)
> +		return -EINVAL;
> +
> +	names = kmalloc_array(args.count,
> +			      sizeof(enum hv_register_name),
> +			      GFP_KERNEL);
> +	if (!names)
> +		return -ENOMEM;
> +
> +	values = kmalloc_array(args.count,
> +			       sizeof(union hv_register_value),
> +			       GFP_KERNEL);
> +	if (!values) {
> +		kfree(names);
> +		return -ENOMEM;
> +	}
> +
> +	if (copy_from_user(names, args.names,
> +			   sizeof(enum hv_register_name) * args.count)) {
> +		ret = -EFAULT;
> +		goto free_return;
> +	}
> +
> +	ret = hv_call_get_vp_registers(vp->index, vp->partition->id,
> +				       args.count, names, values);
> +	if (ret)
> +		goto free_return;
> +
> +	if (copy_to_user(args.values, values,
> +			 sizeof(union hv_register_value) * args.count)) {
> +		ret = -EFAULT;
> +	}
> +
> +free_return:
> +	kfree(names);
> +	kfree(values);
> +	return ret;
> +}
> +
> +static long
> +mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user *user_args)
> +{
> +	int i;
> +	struct mshv_vp_registers args;
> +	struct hv_register_assoc *registers;
> +	enum hv_register_name *names;
> +	union hv_register_value *values;
> +	long ret;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.count > MSHV_VP_MAX_REGISTERS)
> +		return -EINVAL;
> +
> +	names = kmalloc_array(args.count,
> +			      sizeof(enum hv_register_name),
> +			      GFP_KERNEL);
> +	if (!names)
> +		return -ENOMEM;
> +
> +	values = kmalloc_array(args.count,
> +			       sizeof(union hv_register_value),
> +			       GFP_KERNEL);
> +	if (!values) {
> +		kfree(names);
> +		return -ENOMEM;
> +	}
> +
> +	registers = kmalloc_array(args.count,
> +				  sizeof(struct hv_register_assoc),
> +				  GFP_KERNEL);
> +	if (!registers) {
> +		kfree(values);
> +		kfree(names);
> +		return -ENOMEM;
> +	}
> +
> +	if (copy_from_user(names, args.names,
> +			   sizeof(enum hv_register_name) * args.count)) {
> +		ret = -EFAULT;
> +		goto free_return;
> +	}
> +
> +	if (copy_from_user(values, args.values,
> +			   sizeof(union hv_register_value) * args.count)) {
> +		ret = -EFAULT;
> +		goto free_return;
> +	}
> +
> +	for (i = 0; i < args.count; i++) {
> +		memcpy(&registers[i].name, &names[i],
> +		       sizeof(enum hv_register_name));
> +		memcpy(&registers[i].value, &values[i],
> +		       sizeof(union hv_register_value));
> +	}

The above will result in uninitialized memory being sent to
Hyper-V, since there is implicit padding associated with the
32 bit name field.

> +
> +	ret = hv_call_set_vp_registers(vp->index, vp->partition->id,
> +				       args.count, registers);
> +
> +free_return:
> +	kfree(names);
> +	kfree(values);
> +	kfree(registers);
> +	return ret;
> +}
> +
> +
>  static long
>  mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> -	return -ENOTTY;
> +	struct mshv_vp *vp = filp->private_data;
> +	long r = 0;
> +
> +	if (mutex_lock_killable(&vp->mutex))
> +		return -EINTR;
> +
> +	switch (ioctl) {
> +	case MSHV_GET_VP_REGISTERS:
> +		r = mshv_vp_ioctl_get_regs(vp, (void __user *)arg);
> +		break;
> +	case MSHV_SET_VP_REGISTERS:
> +		r = mshv_vp_ioctl_set_regs(vp, (void __user *)arg);
> +		break;
> +	default:
> +		r = -ENOTTY;
> +		break;
> +	}
> +	mutex_unlock(&vp->mutex);
> +
> +	return r;
>  }
> 
>  static int
> @@ -420,6 +674,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
>  	if (!vp)
>  		return -ENOMEM;
> 
> +	mutex_init(&vp->mutex);
> +
>  	vp->index = args.vp_index;
>  	vp->partition = mshv_partition_get(partition);
>  	if (!vp->partition) {
> --
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages
       [not found] ` <1605918637-12192-12-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:47   ` Michael Kelley via Virtualization
       [not found]     ` <9e06a119-880f-5199-903b-056675331d6f@linux.microsoft.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:47 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
> 
> Same idea as synic setup in drivers/hv/hv.c:hv_synic_enable_regs()
> and hv_synic_disable_regs().
> Setting up synic registers in both vmbus driver and mshv would clobber
> them, but the vmbus driver will not run in the root partition, so this
> is safe.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/x86/include/asm/hyperv-tlfs.h      |  29 ---
>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 264 ++++++++++++++++++++++++
>  include/asm-generic/hyperv-tlfs.h       |  46 +----
>  include/linux/mshv.h                    |   1 +
>  include/uapi/asm-generic/hyperv-tlfs.h  |  43 ++++
>  virt/mshv/mshv_main.c                   |  98 ++++++++-
>  6 files changed, 404 insertions(+), 77 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index 4cd44ae9bffb..c34a6bb4f457 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -267,35 +267,6 @@ struct hv_tsc_emulation_status {
>  #define HV_X64_MSR_TSC_REFERENCE_ENABLE		0x00000001
>  #define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT	12
> 
> -
> -/* Define hypervisor message types. */
> -enum hv_message_type {
> -	HVMSG_NONE			= 0x00000000,
> -
> -	/* Memory access messages. */
> -	HVMSG_UNMAPPED_GPA		= 0x80000000,
> -	HVMSG_GPA_INTERCEPT		= 0x80000001,
> -
> -	/* Timer notification messages. */
> -	HVMSG_TIMER_EXPIRED		= 0x80000010,
> -
> -	/* Error messages. */
> -	HVMSG_INVALID_VP_REGISTER_VALUE	= 0x80000020,
> -	HVMSG_UNRECOVERABLE_EXCEPTION	= 0x80000021,
> -	HVMSG_UNSUPPORTED_FEATURE	= 0x80000022,
> -
> -	/* Trace buffer complete messages. */
> -	HVMSG_EVENTLOG_BUFFERCOMPLETE	= 0x80000040,
> -
> -	/* Platform-specific processor intercept messages. */
> -	HVMSG_X64_IOPORT_INTERCEPT	= 0x80010000,
> -	HVMSG_X64_MSR_INTERCEPT		= 0x80010001,
> -	HVMSG_X64_CPUID_INTERCEPT	= 0x80010002,
> -	HVMSG_X64_EXCEPTION_INTERCEPT	= 0x80010003,
> -	HVMSG_X64_APIC_EOI		= 0x80010004,
> -	HVMSG_X64_LEGACY_FP_ERROR	= 0x80010005
> -};
> -
>  struct hv_nested_enlightenments_control {
>  	struct {
>  		__u32 directhypercall:1;
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
> tlfs.h
> index 2ff655962738..c6a27053f791 100644
> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -722,4 +722,268 @@ union hv_register_value {
>  		pending_virtualization_fault_event;
>  };
> 
> +/* Define hypervisor message types. */
> +enum hv_message_type {
> +	HVMSG_NONE				= 0x00000000,
> +
> +	/* Memory access messages. */
> +	HVMSG_UNMAPPED_GPA			= 0x80000000,
> +	HVMSG_GPA_INTERCEPT			= 0x80000001,
> +
> +	/* Timer notification messages. */
> +	HVMSG_TIMER_EXPIRED			= 0x80000010,
> +
> +	/* Error messages. */
> +	HVMSG_INVALID_VP_REGISTER_VALUE		= 0x80000020,
> +	HVMSG_UNRECOVERABLE_EXCEPTION		= 0x80000021,
> +	HVMSG_UNSUPPORTED_FEATURE		= 0x80000022,
> +
> +	/* Trace buffer complete messages. */
> +	HVMSG_EVENTLOG_BUFFERCOMPLETE		= 0x80000040,
> +
> +	/* Platform-specific processor intercept messages. */
> +	HVMSG_X64_IO_PORT_INTERCEPT		= 0x80010000,
> +	HVMSG_X64_MSR_INTERCEPT			= 0x80010001,
> +	HVMSG_X64_CPUID_INTERCEPT		= 0x80010002,
> +	HVMSG_X64_EXCEPTION_INTERCEPT		= 0x80010003,
> +	HVMSG_X64_APIC_EOI			= 0x80010004,
> +	HVMSG_X64_LEGACY_FP_ERROR		= 0x80010005,
> +	HVMSG_X64_IOMMU_PRQ			= 0x80010006,
> +	HVMSG_X64_HALT				= 0x80010007,
> +	HVMSG_X64_INTERRUPTION_DELIVERABLE	= 0x80010008,
> +	HVMSG_X64_SIPI_INTERCEPT		= 0x80010009,
> +};

I have a separate patch series that moves this enum to the
asm-generic portion of hyperv-tlfs.h because there's not a good way
to separate the arch neutral from arch dependent values.

> +
> +
> +union hv_x64_vp_execution_state {
> +	__u16 as_uint16;
> +	struct {
> +		__u16 cpl:2;
> +		__u16 cr0_pe:1;
> +		__u16 cr0_am:1;
> +		__u16 efer_lma:1;
> +		__u16 debug_active:1;
> +		__u16 interruption_pending:1;
> +		__u16 vtl:4;
> +		__u16 enclave_mode:1;
> +		__u16 interrupt_shadow:1;
> +		__u16 virtualization_fault_active:1;
> +		__u16 reserved:2;
> +	};
> +};
> +
> +/* Values for intercept_access_type field */
> +#define HV_INTERCEPT_ACCESS_READ	0
> +#define HV_INTERCEPT_ACCESS_WRITE	1
> +#define HV_INTERCEPT_ACCESS_EXECUTE	2
> +
> +struct hv_x64_intercept_message_header {
> +	__u32 vp_index;
> +	__u8 instruction_length:4;
> +	__u8 cr8:4; // only set for exo partitions
> +	__u8 intercept_access_type;
> +	union hv_x64_vp_execution_state execution_state;
> +	struct hv_x64_segment_register cs_segment;
> +	__u64 rip;
> +	__u64 rflags;
> +};
> +
> +#define HV_HYPERCALL_INTERCEPT_MAX_XMM_REGISTERS 6
> +
> +struct hv_x64_hypercall_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u64 rax;
> +	__u64 rbx;
> +	__u64 rcx;
> +	__u64 rdx;
> +	__u64 r8;
> +	__u64 rsi;
> +	__u64 rdi;
> +	struct hv_u128 xmmregisters[HV_HYPERCALL_INTERCEPT_MAX_XMM_REGISTERS];
> +	struct {
> +		__u32 isolated:1;
> +		__u32 reserved:31;
> +	};
> +};
> +
> +union hv_x64_register_access_info {
> +	union hv_register_value source_value;
> +	enum hv_register_name destination_register;
> +	__u64 source_address;
> +	__u64 destination_address;
> +};
> +
> +struct hv_x64_register_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	struct {
> +		__u8 is_memory_op:1;
> +		__u8 reserved:7;
> +	};
> +	__u8 reserved8;
> +	__u16 reserved16;
> +	enum hv_register_name register_name;
> +	union hv_x64_register_access_info access_info;
> +};
> +
> +union hv_x64_memory_access_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 gva_valid:1;
> +		__u8 gva_gpa_valid:1;
> +		__u8 hypercall_output_pending:1;
> +		__u8 tlb_locked_no_overlay:1;
> +		__u8 reserved:4;
> +	};
> +};
> +
> +union hv_x64_io_port_access_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 access_size:3;
> +		__u8 string_op:1;
> +		__u8 rep_prefix:1;
> +		__u8 reserved:3;
> +	};
> +};
> +
> +union hv_x64_exception_info {
> +	__u8 as_uint8;
> +	struct {
> +		__u8 error_code_valid:1;
> +		__u8 software_exception:1;
> +		__u8 reserved:6;
> +	};
> +};
> +
> +enum hv_cache_type {
> +	HV_CACHE_TYPE_UNCACHED	   = 0,
> +	HV_CACHE_TYPE_WRITE_COMBINING = 1,
> +	HV_CACHE_TYPE_WRITE_THROUGH   = 4,
> +	HV_CACHE_TYPE_WRITE_PROTECTED = 5,
> +	HV_CACHE_TYPE_WRITE_BACK	  = 6
> +};
> +
> +struct hv_x64_memory_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	enum hv_cache_type cache_type;
> +	__u8 instruction_byte_count;
> +	union hv_x64_memory_access_info memory_access_info;
> +	__u8 tpr_priority;
> +	__u8 reserved1;
> +	__u64 guest_virtual_address;
> +	__u64 guest_physical_address;
> +	__u8 instruction_bytes[16];
> +};
> +
> +struct hv_x64_cpuid_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u64 rax;
> +	__u64 rcx;
> +	__u64 rdx;
> +	__u64 rbx;
> +	__u64 default_result_rax;
> +	__u64 default_result_rcx;
> +	__u64 default_result_rdx;
> +	__u64 default_result_rbx;
> +};
> +
> +struct hv_x64_msr_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u32 msr_number;
> +	__u32 reserved;
> +	__u64 rdx;
> +	__u64 rax;
> +};
> +
> +struct hv_x64_io_port_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u16 port_number;
> +	union hv_x64_io_port_access_info access_info;
> +	__u8 instruction_byte_count;
> +	__u32 reserved;
> +	__u64 rax;
> +	__u8 instruction_bytes[16];
> +	struct hv_x64_segment_register ds_segment;
> +	struct hv_x64_segment_register es_segment;
> +	__u64 rcx;
> +	__u64 rsi;
> +	__u64 rdi;
> +};
> +
> +struct hv_x64_exception_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u16 exception_vector;
> +	union hv_x64_exception_info exception_info;
> +	__u8 instruction_byte_count;
> +	__u32 error_code;
> +	__u64 exception_parameter;
> +	__u64 reserved;
> +	__u8 instruction_bytes[16];
> +	struct hv_x64_segment_register ds_segment;
> +	struct hv_x64_segment_register ss_segment;
> +	__u64 rax;
> +	__u64 rcx;
> +	__u64 rdx;
> +	__u64 rbx;

Is the above the correct ordering (rax, rcd, rdx, rbx)?
It's just what you would expect ....

> +	__u64 rsp;
> +	__u64 rbp;
> +	__u64 rsi;
> +	__u64 rdi;
> +	__u64 r8;
> +	__u64 r9;
> +	__u64 r10;
> +	__u64 r11;
> +	__u64 r12;
> +	__u64 r13;
> +	__u64 r14;
> +	__u64 r15;
> +};
> +
> +struct hv_x64_invalid_vp_register_message {
> +	__u32 vp_index;
> +	__u32 reserved;
> +};
> +
> +struct hv_x64_unrecoverable_exception_message {
> +	struct hv_x64_intercept_message_header header;
> +};
> +
> +enum hv_x64_unsupported_feature_code {
> +	hv_unsupported_feature_intercept = 1,
> +	hv_unsupported_feature_task_switch_tss = 2
> +};
> +
> +struct hv_x64_unsupported_feature_message {
> +	__u32 vp_index;
> +	enum hv_x64_unsupported_feature_code feature_code;
> +	__u64 feature_parameter;
> +};
> +
> +struct hv_x64_halt_message {
> +	struct hv_x64_intercept_message_header header;
> +};
> +
> +enum hv_x64_pending_interruption_type {
> +	HV_X64_PENDING_INTERRUPT	= 0,
> +	HV_X64_PENDING_NMI		= 2,
> +	HV_X64_PENDING_EXCEPTION	= 3
> +};
> +
> +struct hv_x64_interruption_deliverable_message {
> +	struct hv_x64_intercept_message_header header;
> +	enum hv_x64_pending_interruption_type deliverable_type;
> +	__u32 rsvd;
> +};
> +
> +struct hv_x64_sipi_intercept_message {
> +	struct hv_x64_intercept_message_header header;
> +	__u32 target_vp_index;
> +	__u32 interrupt_vector;
> +};
> +
> +struct hv_x64_apic_eoi_message {
> +	__u32 vp_index;
> +	__u32 interrupt_vector;
> +};

Same comments as before about enum types, not depending
on the compiler to add padding, and marking as __packed.

> +
>  #endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index b9295400c20b..e0185c3872a9 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -241,6 +241,8 @@ static inline const char *hv_status_to_string(enum hv_status status)
>  /* Valid SynIC vectors are 16-255. */
>  #define HV_SYNIC_FIRST_VALID_VECTOR	(16)
> 
> +#define HV_SYNIC_INTERCEPTION_SINT_INDEX 0x00000000
> +
>  #define HV_SYNIC_CONTROL_ENABLE		(1ULL << 0)
>  #define HV_SYNIC_SIMP_ENABLE		(1ULL << 0)
>  #define HV_SYNIC_SIEFP_ENABLE		(1ULL << 0)
> @@ -250,49 +252,6 @@ static inline const char *hv_status_to_string(enum hv_status
> status)
> 
>  #define HV_SYNIC_STIMER_COUNT		(4)
> 
> -/* Define synthetic interrupt controller message constants. */
> -#define HV_MESSAGE_SIZE			(256)
> -#define HV_MESSAGE_PAYLOAD_BYTE_COUNT	(240)
> -#define HV_MESSAGE_PAYLOAD_QWORD_COUNT	(30)
> -
> -/* Define synthetic interrupt controller message flags. */
> -union hv_message_flags {
> -	__u8 asu8;
> -	struct {
> -		__u8 msg_pending:1;
> -		__u8 reserved:7;
> -	} __packed;
> -};
> -
> -/* Define port identifier type. */
> -union hv_port_id {
> -	__u32 asu32;
> -	struct {
> -		__u32 id:24;
> -		__u32 reserved:8;
> -	} __packed u;
> -};
> -
> -/* Define synthetic interrupt controller message header. */
> -struct hv_message_header {
> -	__u32 message_type;
> -	__u8 payload_size;
> -	union hv_message_flags message_flags;
> -	__u8 reserved[2];
> -	union {
> -		__u64 sender;
> -		union hv_port_id port;
> -	};
> -} __packed;
> -
> -/* Define synthetic interrupt controller message format. */
> -struct hv_message {
> -	struct hv_message_header header;
> -	union {
> -		__u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
> -	} u;
> -} __packed;
> -
>  /* Define the synthetic interrupt message page layout. */
>  struct hv_message_page {
>  	struct hv_message sint_message[HV_SYNIC_SINT_COUNT];
> @@ -306,7 +265,6 @@ struct hv_timer_message_payload {
>  	__u64 delivery_time;	/* When the message was delivered */
>  } __packed;
> 
> -
>  /* Define synthetic interrupt controller flag constants. */
>  #define HV_EVENT_FLAGS_COUNT		(256 * 8)
>  #define HV_EVENT_FLAGS_LONG_COUNT	(256 / sizeof(unsigned long))
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index dfe469f573f9..7709aaa1e064 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -42,6 +42,7 @@ struct mshv_partition {
>  };
> 
>  struct mshv {
> +	struct hv_message_page __percpu **synic_message_page;
>  	struct {
>  		spinlock_t lock;
>  		u64 count;
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index e7b09b9f00de..e87389054b68 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -6,6 +6,49 @@
>  #define BIT(X)	(1ULL << (X))
>  #endif
> 
> +/* Define synthetic interrupt controller message constants. */
> +#define HV_MESSAGE_SIZE			(256)
> +#define HV_MESSAGE_PAYLOAD_BYTE_COUNT	(240)
> +#define HV_MESSAGE_PAYLOAD_QWORD_COUNT	(30)
> +
> +/* Define synthetic interrupt controller message flags. */
> +union hv_message_flags {
> +	__u8 asu8;
> +	struct {
> +		__u8 msg_pending:1;
> +		__u8 reserved:7;
> +	};
> +};
> +
> +/* Define port identifier type. */
> +union hv_port_id {
> +	__u32 asu32;
> +	struct {
> +		__u32 id:24;
> +		__u32 reserved:8;
> +	} u;
> +};
> +
> +/* Define synthetic interrupt controller message header. */
> +struct hv_message_header {
> +	enum hv_message_type message_type;
> +	__u8 payload_size;
> +	union hv_message_flags message_flags;
> +	__u8 reserved[2];
> +	union {
> +		__u64 sender;
> +		union hv_port_id port;
> +	};
> +};
> +
> +/* Define synthetic interrupt controller message format. */
> +struct hv_message {
> +	struct hv_message_header header;
> +	union {
> +		__u64 payload[HV_MESSAGE_PAYLOAD_QWORD_COUNT];
> +	} u;
> +};
> +
>  /* Userspace-visible partition creation flags */
>  #define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
>  #define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 2a10137a1e84..c9445d2edb37 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -15,6 +15,8 @@
>  #include <linux/file.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/mm.h>
> +#include <linux/io.h>
> +#include <linux/cpuhotplug.h>
>  #include <linux/mshv.h>
>  #include <asm/mshyperv.h>
> 
> @@ -1152,23 +1154,111 @@ mshv_dev_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
> 
> +static int
> +mshv_synic_init(unsigned int cpu)
> +{
> +	union hv_synic_simp simp;
> +	union hv_synic_sint sint;
> +	union hv_synic_scontrol sctrl;
> +	struct hv_message_page **msg_page =
> +			this_cpu_ptr(mshv.synic_message_page);
> +
> +	/* Setup the Synic's message page */
> +	hv_get_simp(simp.as_uint64);
> +	simp.simp_enabled = true;
> +	*msg_page = memremap(simp.base_simp_gpa << PAGE_SHIFT,
> +			     PAGE_SIZE, MEMREMAP_WB);

Use HV_HYP_PAGE_SHIFT and HV_HYP_PAGE_SIZE.

> +	if (!msg_page) {
> +		pr_err("%s: memremap failed\n", __func__);
> +		return -EFAULT;
> +	}
> +	hv_set_simp(simp.as_uint64);
> +
> +	/* Enable intercepts */
> +	sint.as_uint64 = 0;
> +	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.masked = false;
> +	sint.auto_eoi = hv_recommend_using_aeoi();
> +	hv_set_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
> +
> +	/* Enable global synic bit */
> +	hv_get_synic_state(sctrl.as_uint64);
> +	sctrl.enable = 1;
> +	hv_set_synic_state(sctrl.as_uint64);
> +
> +	return 0;
> +}
> +
> +static int
> +mshv_synic_cleanup(unsigned int cpu)
> +{
> +	union hv_synic_sint sint;
> +	union hv_synic_simp simp;
> +	union hv_synic_scontrol sctrl;
> +	struct hv_message_page **msg_page =
> +			this_cpu_ptr(mshv.synic_message_page);
> +
> +	/* Disable the interrupt */
> +	hv_get_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
> +	sint.masked = true;
> +	hv_set_synint_state(HV_SYNIC_INTERCEPTION_SINT_INDEX, sint.as_uint64);
> +
> +	/* Disable Synic's message page */
> +	hv_get_simp(simp.as_uint64);
> +	simp.simp_enabled = false;
> +	hv_set_simp(simp.as_uint64);
> +	memunmap(*msg_page);
> +
> +	/* Disable global synic bit */
> +	hv_get_synic_state(sctrl.as_uint64);
> +	sctrl.enable = 0;
> +	hv_set_synic_state(sctrl.as_uint64);
> +
> +	return 0;
> +}
> +
> +static int mshv_cpuhp_online;
> +
>  static int
>  __init mshv_init(void)
>  {
> -	int r;
> +	int ret;

Ideally, change the name of the variable in the earlier patch so this
one isn't cluttered with the change.

> 
> -	r = misc_register(&mshv_dev);
> -	if (r)
> +	ret = misc_register(&mshv_dev);
> +	if (ret) {
>  		pr_err("%s: misc device register failed\n", __func__);
> +		return ret;
> +	}
> +	spin_lock_init(&mshv.partitions.lock);
> 
> +	mshv.synic_message_page = alloc_percpu(struct hv_message_page *);
> +	if (!mshv.synic_message_page) {
> +		pr_err("%s: failed to allocate percpu synic page\n", __func__);
> +		misc_deregister(&mshv_dev);
> +		return -ENOMEM;
> +	}
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> +				mshv_synic_init,
> +				mshv_synic_cleanup);
> +	if (ret < 0) {
> +		pr_err("%s: failed to setup cpu hotplug state: %i\n",
> +		       __func__, ret);
> +		return ret;
> +	}
> +
> +	mshv_cpuhp_online = ret;
>  	spin_lock_init(&mshv.partitions.lock);

It looks like the spin lock is being initialized twice.

> 
> -	return r;
> +	return 0;
>  }
> 
>  static void
>  __exit mshv_exit(void)
>  {
> +	cpuhp_remove_state(mshv_cpuhp_online);
> +	free_percpu(mshv.synic_message_page);
> +
>  	misc_deregister(&mshv_dev);
>  }
> 
> --
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls
       [not found] ` <1605918637-12192-16-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:48   ` Michael Kelley via Virtualization
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:48 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
> To: linux-hyperv@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org; linux-kernel@vger.kernel.org; Michael Kelley
> <mikelley@microsoft.com>; viremana@linux.microsoft.com; Sunil Muthuswamy
> <sunilmut@microsoft.com>; nunodasneves@linux.microsoft.com; wei.liu@kernel.org;
> Lillian Grassin-Drake <Lillian.GrassinDrake@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>
> Subject: [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls
> 
> Introduce ioctls for getting and setting guest vcpu emulated LAPIC
> state, and xsave data.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst         |   8 +
>  arch/x86/include/uapi/asm/hyperv-tlfs.h |  59 ++++++
>  include/asm-generic/hyperv-tlfs.h       |  41 ++++
>  include/uapi/asm-generic/hyperv-tlfs.h  |  28 +++
>  include/uapi/linux/mshv.h               |  13 ++
>  virt/mshv/mshv_main.c                   | 262 ++++++++++++++++++++++++
>  6 files changed, 411 insertions(+)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index 694f978131f9..7fd75f248eff 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -140,4 +140,12 @@ Assert interrupts in partitions that use Microsoft Hypervisor's
> internal
>  emulated LAPIC. This must be enabled on partition creation with the flag:
>  HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
> 
> +3.9 MSHV_GET_VP_STATE and MSHV_SET_VP_STATE
> +--------------------------
> +:Type: vp ioctl
> +:Parameters: struct mshv_vp_state
> +:Returns: 0 on success
> +
> +Get/set various vp state. Currently these can be used to get and set
> +emulated LAPIC state, and xsave data.
> 
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
> tlfs.h
> index 5478d4943bfc..78758aedf23e 100644
> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -1051,4 +1051,63 @@ union hv_interrupt_control {
>  	__u64 as_uint64;
>  };
> 
> +struct hv_local_interrupt_controller_state {
> +	__u32 apic_id;
> +	__u32 apic_version;
> +	__u32 apic_ldr;
> +	__u32 apic_dfr;
> +	__u32 apic_spurious;
> +	__u32 apic_isr[8];
> +	__u32 apic_tmr[8];
> +	__u32 apic_irr[8];
> +	__u32 apic_esr;
> +	__u32 apic_icr_high;
> +	__u32 apic_icr_low;
> +	__u32 apic_lvt_timer;
> +	__u32 apic_lvt_thermal;
> +	__u32 apic_lvt_perfmon;
> +	__u32 apic_lvt_lint0;
> +	__u32 apic_lvt_lint1;
> +	__u32 apic_lvt_error;
> +	__u32 apic_lvt_cmci;
> +	__u32 apic_error_status;
> +	__u32 apic_initial_count;
> +	__u32 apic_counter_value;
> +	__u32 apic_divide_configuration;
> +	__u32 apic_remote_read;
> +};
> +
> +#define HV_XSAVE_DATA_NO_XMM_REGISTERS 1
> +
> +union hv_x64_xsave_xfem_register {
> +	__u64 as_uint64;
> +	struct {
> +		__u32 low_uint32;
> +		__u32 high_uint32;
> +	};
> +	struct {
> +		__u64 legacy_x87: 1;
> +		__u64 legacy_sse: 1;
> +		__u64 avx: 1;
> +		__u64 mpx_bndreg: 1;
> +		__u64 mpx_bndcsr: 1;
> +		__u64 avx_512_op_mask: 1;
> +		__u64 avx_512_zmmhi: 1;
> +		__u64 avx_512_zmm16_31: 1;
> +		__u64 rsvd8_9: 2;
> +		__u64 pasid: 1;
> +		__u64 cet_u: 1;
> +		__u64 cet_s: 1;
> +		__u64 rsvd13_16: 4;
> +		__u64 xtile_cfg: 1;
> +		__u64 xtile_data: 1;
> +		__u64 rsvd19_63: 45;
> +	};
> +};
> +
> +struct hv_vp_state_data_xsave {
> +	__u64 flags;
> +	union hv_x64_xsave_xfem_register states;
> +};
> +
>  #endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 2cd46241c545..4bc59a0344ce 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -167,6 +167,9 @@ struct ms_hyperv_tsc_page {
>  #define HVCALL_ASSERT_VIRTUAL_INTERRUPT		0x0094
>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
> +#define HVCALL_MAP_VP_STATE_PAGE			0x00e1
> +#define HVCALL_GET_VP_STATE				0x00e3
> +#define HVCALL_SET_VP_STATE				0x00e4
> 
>  #define HV_FLUSH_ALL_PROCESSORS			BIT(0)
>  #define HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES	BIT(1)
> @@ -796,4 +799,42 @@ struct hv_assert_virtual_interrupt {
>  	u16 rsvd_z1;
>  };
> 
> +struct hv_vp_state_data {
> +	enum hv_get_set_vp_state_type type;
> +	u32 rsvd;
> +	struct hv_vp_state_data_xsave xsave;
> +
> +};
> +
> +struct hv_get_vp_state_in {
> +	u64 partition_id;
> +	u32 vp_index;
> +	u8 input_vtl;
> +	u8 rsvd0;
> +	u16 rsvd1;
> +	struct hv_vp_state_data state_data;
> +	u64 output_data_pfns[];
> +};
> +
> +union hv_get_vp_state_out {
> +	struct hv_local_interrupt_controller_state interrupt_controller_state;
> +	/* Not supported yet */
> +	/* struct hv_synthetic_timers_state synthetic_timers_state; */
> +};
> +
> +union hv_input_set_vp_state_data {
> +	u64 pfns;
> +	u8 bytes;
> +};
> +
> +struct hv_set_vp_state_in {
> +	u64 partition_id;
> +	u32 vp_index;
> +	u8 input_vtl;
> +	u8 rsvd0;
> +	u16 rsvd1;
> +	struct hv_vp_state_data state_data;
> +	union hv_input_set_vp_state_data data[];
> +};
> +
>  #endif
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index e87389054b68..b3c84c69b73f 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -64,4 +64,32 @@ struct hv_message {
>  #define HV_MAP_GPA_EXECUTABLE           0xC
>  #define HV_MAP_GPA_PERMISSIONS_MASK     0xF
> 
> +/*
> + * For getting and setting VP state, there are two options based on the state type:
> + *
> + *     1.) Data that is accessed by PFNs in the input hypercall page. This is used
> + *         for state which may not fit into the hypercall pages.
> + *     2.) Data that is accessed directly in the input\output hypercall pages.
> + *         This is used for state that will always fit into the hypercall pages.
> + *
> + * In the future this could be dynamic based on the size if needed.
> + *
> + * Note these hypercalls have an 8-byte aligned variable header size as per the tlfs
> + */
> +
> +#define HV_GET_SET_VP_STATE_TYPE_PFN	BIT(31)
> +
> +enum hv_get_set_vp_state_type {
> +	HV_GET_SET_VP_STATE_LOCAL_INTERRUPT_CONTROLLER_STATE = 0,
> +
> +	HV_GET_SET_VP_STATE_XSAVE		= 1 |
> HV_GET_SET_VP_STATE_TYPE_PFN,
> +	/* Synthetic message page */
> +	HV_GET_SET_VP_STATE_SIM_PAGE		= 2 |
> HV_GET_SET_VP_STATE_TYPE_PFN,
> +	/* Synthetic interrupt event flags page. */
> +	HV_GET_SET_VP_STATE_SIEF_PAGE		= 3 |
> HV_GET_SET_VP_STATE_TYPE_PFN,
> +
> +	/* Synthetic timers. */
> +	HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS	= 4,
> +};
> +
>  #endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index faed9d065bb7..ae0bb64bbec3 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -53,6 +53,17 @@ struct mshv_assert_interrupt {
>  	__u32 vector;
>  };
> 
> +struct mshv_vp_state {
> +	enum hv_get_set_vp_state_type type;
> +	struct hv_vp_state_data_xsave xsave; /* only for xsave request */
> +
> +	__u64 buf_size; /* If xsave, must be page-aligned */
> +	union {
> +		struct hv_local_interrupt_controller_state *lapic;
> +		__u8 *bytes; /* Xsave data. must be page-aligned */
> +	} buf;
> +};
> +
>  #define MSHV_IOCTL 0xB8
> 
>  /* mshv device */
> @@ -70,5 +81,7 @@ struct mshv_assert_interrupt {
>  #define MSHV_GET_VP_REGISTERS   _IOWR(MSHV_IOCTL, 0x05, struct
> mshv_vp_registers)
>  #define MSHV_SET_VP_REGISTERS   _IOW(MSHV_IOCTL, 0x06, struct mshv_vp_registers)
>  #define MSHV_RUN_VP		_IOR(MSHV_IOCTL, 0x07, struct hv_message)
> +#define MSHV_GET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0A, struct mshv_vp_state)
> +#define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
> 
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 9cf236ade50a..70172d9488de 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -864,6 +864,262 @@ mshv_vp_ioctl_set_regs(struct mshv_vp *vp, void __user
> *user_args)
>  	return ret;
>  }
> 
> +static int
> +hv_call_get_vp_state(u32 vp_index,
> +		     u64 partition_id,
> +		     enum hv_get_set_vp_state_type type,
> +		     struct hv_vp_state_data_xsave xsave,
> +		    /* Choose between pages and ret_output */
> +		     u64 page_count,
> +		     struct page **pages,
> +		     union hv_get_vp_state_out *ret_output)
> +{
> +	struct hv_get_vp_state_in *input;
> +	union hv_get_vp_state_out *output;
> +	int status;
> +	int i;
> +	u64 control;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	if (sizeof(*input) + (page_count * sizeof(u64)) > PAGE_SIZE)
> +		return -EINVAL;

Nit:  Stylistically, you are handling this differently from the BATCH_SIZE
macros, which are essentially doing the same thing of calculating
how many entries will fit in the input page.   Note to use
HV_HYP_PAGE_SIZE.

> +
> +	if (!page_count && !ret_output)
> +		return -EINVAL;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = (struct hv_get_vp_state_in *)
> +				(*this_cpu_ptr(hyperv_pcpu_input_arg));
> +		output = (union hv_get_vp_state_out *)
> +				(*this_cpu_ptr(hyperv_pcpu_output_arg));
> +		memset(input, 0, sizeof(*input));
> +		memset(output, 0, sizeof(*output));
> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->state_data.type = type;
> +		memcpy(&input->state_data.xsave, &xsave, sizeof(xsave));
> +		for (i = 0; i < page_count; i++)
> +			input->output_data_pfns[i] =
> +				page_to_pfn(pages[i]) & HV_MAP_GPA_MASK;
> +
> +		control = (HVCALL_GET_VP_STATE) |
> +			  (page_count << HV_HYPERCALL_VARHEAD_OFFSET);
> +
> +		status = hv_do_hypercall(control, input, output) &
> +			 HV_HYPERCALL_RESULT_MASK;
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (status != HV_STATUS_SUCCESS)
> +				pr_err("%s: %s\n", __func__,
> +				       hv_status_to_string(status));
> +			else if (ret_output)
> +				memcpy(ret_output, output, sizeof(*output));
> +
> +			local_irq_restore(flags);
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static int
> +hv_call_set_vp_state(u32 vp_index,
> +		     u64 partition_id,
> +		     enum hv_get_set_vp_state_type type,
> +		     struct hv_vp_state_data_xsave xsave,
> +		    /* Choose between pages and bytes */
> +		     u64 page_count,
> +		     struct page **pages,
> +		     u32 num_bytes,
> +		     u8 *bytes)
> +{
> +	struct hv_set_vp_state_in *input;
> +	int status;
> +	int i;
> +	u64 control;
> +	unsigned long flags;
> +	int ret = 0;
> +	u16 varhead_sz;
> +
> +	if (sizeof(*input) + (page_count * sizeof(u64)) > PAGE_SIZE)

Same comment as above.

> +		return -EINVAL;
> +	if (sizeof(*input) + num_bytes > PAGE_SIZE)

Use HV_HYP_PAGE_SIZE.

> +		return -EINVAL;
> +
> +	if (num_bytes)
> +		/* round up to 8 and divide by 8 */
> +		varhead_sz = (num_bytes + 7) >> 3;
> +	else if (page_count)
> +		varhead_sz =  page_count;
> +	else
> +		return -EINVAL;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = (struct hv_set_vp_state_in *)
> +				(*this_cpu_ptr(hyperv_pcpu_input_arg));
> +		memset(input, 0, sizeof(*input));
> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->state_data.type = type;
> +		memcpy(&input->state_data.xsave, &xsave, sizeof(xsave));
> +		if (num_bytes) {
> +			memcpy((u8 *)input->data, bytes, num_bytes);
> +		} else {
> +			for (i = 0; i < page_count; i++)
> +				input->data[i].pfns =
> +					page_to_pfn(pages[i]) & HV_MAP_GPA_MASK;

Same comment as in earlier patch about GPA_MASK.  Also, this doesn't work
if PAGE_SIZE != HV_HYP_PAGE_SIZE, though it may be fine to not handle that case
for now.

> +		}
> +
> +		control = (HVCALL_SET_VP_STATE) |
> +			  (varhead_sz << HV_HYPERCALL_VARHEAD_OFFSET);
> +
> +		status = hv_do_hypercall(control, input, NULL) &
> +			 HV_HYPERCALL_RESULT_MASK;
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (status != HV_STATUS_SUCCESS)
> +				pr_err("%s: %s\n", __func__,
> +				       hv_status_to_string(status));
> +
> +			local_irq_restore(flags);
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +					    partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_vp_ioctl_get_set_state_pfn(struct mshv_vp *vp,
> +				struct mshv_vp_state *args,
> +				bool is_set)
> +{
> +	u64 page_count, remaining;
> +	int completed;
> +	struct page **pages;
> +	long ret;
> +	unsigned long u_buf;
> +
> +	/* Buffer must be page aligned */
> +	if (args->buf_size & (PAGE_SIZE - 1) ||
> +	    (u64)args->buf.bytes & (PAGE_SIZE - 1))
> +		return -EINVAL;

Use PAGE_ALIGNED macro.

> +
> +	if (!access_ok(args->buf.bytes, args->buf_size))
> +		return -EFAULT;
> +
> +	/* Pin user pages so hypervisor can copy directly to them */
> +	page_count = args->buf_size >> PAGE_SHIFT;
> +	pages = kcalloc(page_count, sizeof(struct page *), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	remaining = page_count;
> +	u_buf = (unsigned long)args->buf.bytes;
> +	while (remaining) {
> +		completed = pin_user_pages_fast(
> +				u_buf,
> +				remaining,
> +				FOLL_WRITE,
> +				&pages[page_count - remaining]);
> +		if (completed < 0) {
> +			pr_err("%s: failed to pin user pages error %i\n",
> +			       __func__, completed);
> +			ret = completed;
> +			goto unpin_pages;
> +		}
> +		remaining -= completed;
> +		u_buf += completed * PAGE_SIZE;
> +	}
> +
> +	if (is_set)
> +		ret = hv_call_set_vp_state(vp->index,
> +					   vp->partition->id,
> +					   args->type, args->xsave,
> +					   page_count, pages,
> +					   0, NULL);
> +	else
> +		ret = hv_call_get_vp_state(vp->index,
> +					   vp->partition->id,
> +					   args->type, args->xsave,
> +					   page_count, pages,
> +					   NULL);
> +
> +unpin_pages:
> +	unpin_user_pages(pages, page_count - remaining);
> +	kfree(pages);
> +	return ret;
> +}
> +
> +static long
> +mshv_vp_ioctl_get_set_state(struct mshv_vp *vp, void __user *user_args, bool is_set)
> +{
> +	struct mshv_vp_state args;
> +	long ret = 0;
> +	union hv_get_vp_state_out vp_state;
> +
> +	if (copy_from_user(&args, user_args, sizeof(args)))
> +		return -EFAULT;
> +
> +	/* For now just support these */
> +	if (args.type != HV_GET_SET_VP_STATE_LOCAL_INTERRUPT_CONTROLLER_STATE &&
> +	    args.type != HV_GET_SET_VP_STATE_XSAVE)
> +		return -EINVAL;
> +
> +	/* If we need to pin pfns, delegate to helper */
> +	if (args.type & HV_GET_SET_VP_STATE_TYPE_PFN)
> +		return mshv_vp_ioctl_get_set_state_pfn(vp, &args, is_set);
> +
> +	if (args.buf_size < sizeof(vp_state))
> +		return -EINVAL;
> +
> +	if (is_set) {
> +		if (copy_from_user(
> +				&vp_state,
> +				args.buf.lapic,
> +				sizeof(vp_state)))
> +			return -EFAULT;
> +
> +		return hv_call_set_vp_state(vp->index,
> +					    vp->partition->id,
> +					    args.type, args.xsave,
> +					    0, NULL,
> +					    sizeof(vp_state),
> +					    (u8 *)&vp_state);
> +	}
> +
> +	ret = hv_call_get_vp_state(vp->index,
> +				   vp->partition->id,
> +				   args.type, args.xsave,
> +				   0, NULL,
> +				   &vp_state);
> +
> +	if (ret)
> +		return ret;
> +
> +	if (copy_to_user(args.buf.lapic,
> +			 &vp_state.interrupt_controller_state,
> +			 sizeof(vp_state.interrupt_controller_state)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> 
>  static long
>  mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> @@ -884,6 +1140,12 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long
> arg)
>  	case MSHV_SET_VP_REGISTERS:
>  		r = mshv_vp_ioctl_set_regs(vp, (void __user *)arg);
>  		break;
> +	case MSHV_GET_VP_STATE:
> +		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, false);
> +		break;
> +	case MSHV_SET_VP_STATE:
> +		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, true);
> +		break;
>  	default:
>  		r = -ENOTTY;
>  		break;
> --
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 16/18] virt/mshv: mmap vp register page
       [not found] ` <1605918637-12192-17-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-08 19:49   ` Michael Kelley via Virtualization
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-02-08 19:49 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November 20, 2020 4:31 PM
> 
> Introduce mmap interface for a virtual processor, exposing a page for
> setting and getting common registers while the VP is suspended.
> 
> This provides a more performant and convenient way to get and set these
> registers in the context of a vmm's run-loop.
> 
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst         | 11 ++++
>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 74 ++++++++++++++++++++++
>  include/asm-generic/hyperv-tlfs.h       | 10 +++
>  include/linux/mshv.h                    |  1 +
>  include/uapi/asm-generic/hyperv-tlfs.h  |  5 ++
>  include/uapi/linux/mshv.h               | 12 ++++
>  virt/mshv/mshv_main.c                   | 82 +++++++++++++++++++++++++
>  7 files changed, 195 insertions(+)
> 
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index 7fd75f248eff..89c276a8778f 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -149,3 +149,14 @@ HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED
>  Get/set various vp state. Currently these can be used to get and set
>  emulated LAPIC state, and xsave data.
> 
> +3.10 mmap(vp)
> +-------------
> +:Type: vp mmap
> +:Parameters: offset should be HV_VP_MMAP_REGISTERS_OFFSET
> +:Returns: 0 on success
> +
> +Maps a page into userspace that can be used to get and set common registers
> +while the vp is suspended.
> +The page is laid out in struct hv_vp_register_page in asm/hyperv-tlfs.h.
> +

I'm assuming there's no support for the corresponding munmap().
What happens if munmap is called?  Does it just fail and the page remains
mapped?

> +
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-
> tlfs.h
> index 78758aedf23e..a241178567ff 100644
> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -1110,4 +1110,78 @@ struct hv_vp_state_data_xsave {
>  	union hv_x64_xsave_xfem_register states;
>  };
> 
> +/* Bits for dirty mask of hv_vp_register_page */
> +#define HV_X64_REGISTER_CLASS_GENERAL	0
> +#define HV_X64_REGISTER_CLASS_IP	1
> +#define HV_X64_REGISTER_CLASS_XMM	2
> +#define HV_X64_REGISTER_CLASS_SEGMENT	3
> +#define HV_X64_REGISTER_CLASS_FLAGS	4
> +
> +#define HV_VP_REGISTER_PAGE_VERSION_1	1u
> +
> +struct hv_vp_register_page {
> +	__u16 version;
> +	bool isvalid;

Like enum, avoid type "bool" in data structures shared with
Hyper-V.

> +	__u8 rsvdz;
> +	__u32 dirty;
> +	union {
> +		struct {
> +			__u64 rax;
> +			__u64 rcx;
> +			__u64 rdx;
> +			__u64 rbx;
> +			__u64 rsp;
> +			__u64 rbp;
> +			__u64 rsi;
> +			__u64 rdi;
> +			__u64 r8;
> +			__u64 r9;
> +			__u64 r10;
> +			__u64 r11;
> +			__u64 r12;
> +			__u64 r13;
> +			__u64 r14;
> +			__u64 r15;
> +		};
> +
> +		__u64 gp_registers[16];
> +	};
> +	__u64 rip;
> +	__u64 rflags;
> +	union {
> +		struct {
> +			struct hv_u128 xmm0;
> +			struct hv_u128 xmm1;
> +			struct hv_u128 xmm2;
> +			struct hv_u128 xmm3;
> +			struct hv_u128 xmm4;
> +			struct hv_u128 xmm5;
> +		};
> +
> +		struct hv_u128 xmm_registers[6];
> +	};
> +	union {
> +		struct {
> +			struct hv_x64_segment_register es;
> +			struct hv_x64_segment_register cs;
> +			struct hv_x64_segment_register ss;
> +			struct hv_x64_segment_register ds;
> +			struct hv_x64_segment_register fs;
> +			struct hv_x64_segment_register gs;
> +		};
> +
> +		struct hv_x64_segment_register segment_registers[6];
> +	};
> +	/* read only */
> +	__u64 cr0;
> +	__u64 cr3;
> +	__u64 cr4;
> +	__u64 cr8;
> +	__u64 efer;
> +	__u64 dr7;
> +	union hv_x64_pending_interruption_register pending_interruption;
> +	union hv_x64_interrupt_state_register interrupt_state;
> +	__u64 instruction_emulation_hints;
> +};
> +
>  #endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 4bc59a0344ce..9eed4b869110 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -837,4 +837,14 @@ struct hv_set_vp_state_in {
>  	union hv_input_set_vp_state_data data[];
>  };
> 
> +struct hv_map_vp_state_page_in {
> +	u64 partition_id;
> +	u32 vp_index;
> +	enum hv_vp_state_page_type type;
> +};
> +
> +struct hv_map_vp_state_page_out {
> +	u64 map_location; /* page number */
> +};
> +
>  #endif
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index 3933d80294f1..33f4d0cfee11 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -20,6 +20,7 @@ struct mshv_vp {
>  	u32 index;
>  	struct mshv_partition *partition;
>  	struct mutex mutex;
> +	struct page *register_page;
>  	struct {
>  		struct semaphore sem;
>  		struct task_struct *task;
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-
> tlfs.h
> index b3c84c69b73f..a747f39b132a 100644
> --- a/include/uapi/asm-generic/hyperv-tlfs.h
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -92,4 +92,9 @@ enum hv_get_set_vp_state_type {
>  	HV_GET_SET_VP_STATE_SYNTHETIC_TIMERS	= 4,
>  };
> 
> +enum hv_vp_state_page_type {
> +	HV_VP_STATE_PAGE_REGISTERS = 0,
> +	HV_VP_STATE_PAGE_COUNT
> +};
> +
>  #endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index ae0bb64bbec3..8537ff29aee5 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -13,6 +13,8 @@
> 
>  #define MSHV_VERSION	0x0
> 
> +#define MSHV_VP_MMAP_REGISTERS_OFFSET (HV_VP_STATE_PAGE_REGISTERS * 0x1000)
> +
>  struct mshv_create_partition {
>  	__u64 flags;
>  	struct hv_partition_creation_properties partition_creation_properties;
> @@ -84,4 +86,14 @@ struct mshv_vp_state {
>  #define MSHV_GET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0A, struct mshv_vp_state)
>  #define MSHV_SET_VP_STATE	_IOWR(MSHV_IOCTL, 0x0B, struct mshv_vp_state)
> 
> +/* register page mapping example:
> + * struct hv_vp_register_page *regs = mmap(NULL,
> + *					   4096,
> + *					   PROT_READ | PROT_WRITE,
> + *					   MAP_SHARED,
> + *					   vp_fd,
> + *					   HV_VP_MMAP_REGISTERS_OFFSET);
> + * munmap(regs, 4096);
> + */
> +
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 70172d9488de..a597254fa4f4 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -43,11 +43,18 @@ static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl,
> unsigned
>  static int mshv_dev_open(struct inode *inode, struct file *filp);
>  static int mshv_dev_release(struct inode *inode, struct file *filp);
>  static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma);
> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf);
> +
> +static const struct vm_operations_struct mshv_vp_vm_ops = {
> +	.fault = mshv_vp_fault,
> +};
> 
>  static const struct file_operations mshv_vp_fops = {
>  	.release = mshv_vp_release,
>  	.unlocked_ioctl = mshv_vp_ioctl,
>  	.llseek = noop_llseek,
> +	.mmap = mshv_vp_mmap,
>  };
> 
>  static const struct file_operations mshv_partition_fops = {
> @@ -499,6 +506,47 @@ hv_call_set_vp_registers(u32 vp_index,
>  	return -hv_status_to_errno(status);
>  }
> 
> +static int
> +hv_call_map_vp_state_page(u32 vp_index, u64 partition_id,
> +			  struct page **state_page)
> +{
> +	struct hv_map_vp_state_page_in *input;
> +	struct hv_map_vp_state_page_out *output;
> +	int status;
> +	int ret;
> +	unsigned long flags;
> +
> +	do {
> +		local_irq_save(flags);
> +		input = (struct hv_map_vp_state_page_in *)(*this_cpu_ptr(
> +			hyperv_pcpu_input_arg));
> +		output = (struct hv_map_vp_state_page_out *)(*this_cpu_ptr(
> +			hyperv_pcpu_output_arg));
> +
> +		input->partition_id = partition_id;
> +		input->vp_index = vp_index;
> +		input->type = HV_VP_STATE_PAGE_REGISTERS;
> +		status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE,
> +						   input, output);
> +
> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> +			if (status == HV_STATUS_SUCCESS)
> +				*state_page = pfn_to_page(output->map_location);
> +			else
> +				pr_err("%s: %s\n", __func__,
> +				       hv_status_to_string(status));
> +			local_irq_restore(flags);
> +			ret = -hv_status_to_errno(status);
> +			break;
> +		}
> +		local_irq_restore(flags);
> +
> +		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
>  static void
>  mshv_isr(void)
>  {
> @@ -1155,6 +1203,40 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long
> arg)
>  	return r;
>  }
> 
> +static vm_fault_t mshv_vp_fault(struct vm_fault *vmf)
> +{
> +	struct mshv_vp *vp = vmf->vma->vm_file->private_data;
> +
> +	vmf->page = vp->register_page;
> +
> +	return 0;
> +}
> +
> +static int mshv_vp_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	int ret;
> +	struct mshv_vp *vp = file->private_data;
> +
> +	if (vma->vm_pgoff != MSHV_VP_MMAP_REGISTERS_OFFSET)
> +		return -EINVAL;
> +
> +	if (mutex_lock_killable(&vp->mutex))
> +		return -EINTR;
> +
> +	if (!vp->register_page) {
> +		ret = hv_call_map_vp_state_page(vp->index,
> +						vp->partition->id,
> +						&vp->register_page);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	mutex_unlock(&vp->mutex);
> +
> +	vma->vm_ops = &mshv_vp_vm_ops;
> +	return 0;
> +}
> +
>  static int
>  mshv_vp_release(struct inode *inode, struct file *filp)
>  {
> --
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes
       [not found] ` <1605918637-12192-2-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-09 13:04   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 17+ messages in thread
From: Vitaly Kuznetsov @ 2021-02-09 13:04 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, linux-kernel, mikelley, nunodasneves, sunilmut,
	virtualization, viremana, ligrassi

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> Return linux-friendly error codes from hypercall wrapper functions.
> This will be needed in the mshv module.
>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_proc.c         | 30 ++++++++++++++++++++++++++---
>  arch/x86/include/asm/mshyperv.h   |  1 +
>  include/asm-generic/hyperv-tlfs.h | 32 +++++++++++++++++++++----------
>  3 files changed, 50 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
> index 0fd972c9129a..8f86f8e86748 100644
> --- a/arch/x86/hyperv/hv_proc.c
> +++ b/arch/x86/hyperv/hv_proc.c
> @@ -18,6 +18,30 @@
>  #define HV_DEPOSIT_MAX_ORDER (8)
>  #define HV_DEPOSIT_MAX (1 << HV_DEPOSIT_MAX_ORDER)
>  
> +int hv_status_to_errno(int hv_status)
> +{
> +	switch (hv_status) {
> +	case HV_STATUS_SUCCESS:
> +		return 0;
> +	case HV_STATUS_INVALID_PARAMETER:
> +	case HV_STATUS_UNKNOWN_PROPERTY:
> +	case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
> +	case HV_STATUS_INVALID_VP_INDEX:
> +	case HV_STATUS_INVALID_REGISTER_VALUE:
> +	case HV_STATUS_INVALID_LP_INDEX:
> +		return EINVAL;
> +	case HV_STATUS_ACCESS_DENIED:
> +	case HV_STATUS_OPERATION_DENIED:
> +		return EACCES;
> +	case HV_STATUS_NOT_ACKNOWLEDGED:
> +	case HV_STATUS_INVALID_VP_STATE:
> +	case HV_STATUS_INVALID_PARTITION_STATE:
> +		return EBADFD;
> +	}
> +	return ENOTRECOVERABLE;
> +}
> +EXPORT_SYMBOL_GPL(hv_status_to_errno);
> +
>  /*
>   * Deposits exact number of pages
>   * Must be called with interrupts enabled
> @@ -99,7 +123,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>  
>  	if (status != HV_STATUS_SUCCESS) {
>  		pr_err("Failed to deposit pages: %d\n", status);
> -		ret = status;
> +		ret = -hv_status_to_errno(status);

"-hv_status_to_errno" looks weird, could we just return
'-EINVAL'/'-EACCES'/... from hv_status_to_errno() instead?

>  		goto err_free_allocations;
>  	}
>  
> @@ -155,7 +179,7 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>  			if (status != HV_STATUS_SUCCESS) {
>  				pr_err("%s: cpu %u apic ID %u, %d\n", __func__,
>  				       lp_index, apic_id, status);
> -				ret = status;
> +				ret = -hv_status_to_errno(status);
>  			}
>  			break;
>  		}
> @@ -203,7 +227,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>  			if (status != HV_STATUS_SUCCESS) {
>  				pr_err("%s: vcpu %u, lp %u, %d\n", __func__,
>  				       vp_index, flags, status);
> -				ret = status;
> +				ret = -hv_status_to_errno(status);
>  			}
>  			break;
>  		}
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index cbee72550a12..eb75faa4d4c5 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -243,6 +243,7 @@ int hyperv_flush_guest_mapping_range(u64 as,
>  int hyperv_fill_flush_guest_mapping_list(
>  		struct hv_guest_mapping_flush_list *flush,
>  		u64 start_gfn, u64 end_gfn);
> +int hv_status_to_errno(int hv_status);
>  
>  extern bool hv_root_partition;
>  
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index dd385c6a71b5..445244192fa4 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -181,16 +181,28 @@ enum HV_GENERIC_SET_FORMAT {
>  #define HV_HYPERCALL_REP_START_MASK	GENMASK_ULL(59, 48)
>  
>  /* hypercall status code */
> -#define HV_STATUS_SUCCESS			0
> -#define HV_STATUS_INVALID_HYPERCALL_CODE	2
> -#define HV_STATUS_INVALID_HYPERCALL_INPUT	3
> -#define HV_STATUS_INVALID_ALIGNMENT		4
> -#define HV_STATUS_INVALID_PARAMETER		5
> -#define HV_STATUS_OPERATION_DENIED		8
> -#define HV_STATUS_INSUFFICIENT_MEMORY		11
> -#define HV_STATUS_INVALID_PORT_ID		17
> -#define HV_STATUS_INVALID_CONNECTION_ID		18
> -#define HV_STATUS_INSUFFICIENT_BUFFERS		19
> +#define HV_STATUS_SUCCESS			0x0
> +#define HV_STATUS_INVALID_HYPERCALL_CODE	0x2
> +#define HV_STATUS_INVALID_HYPERCALL_INPUT	0x3
> +#define HV_STATUS_INVALID_ALIGNMENT		0x4
> +#define HV_STATUS_INVALID_PARAMETER		0x5
> +#define HV_STATUS_ACCESS_DENIED			0x6
> +#define HV_STATUS_INVALID_PARTITION_STATE	0x7
> +#define HV_STATUS_OPERATION_DENIED		0x8
> +#define HV_STATUS_UNKNOWN_PROPERTY		0x9
> +#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE	0xA
> +#define HV_STATUS_INSUFFICIENT_MEMORY		0xB
> +#define HV_STATUS_INVALID_PARTITION_ID		0xD
> +#define HV_STATUS_INVALID_VP_INDEX		0xE
> +#define HV_STATUS_NOT_FOUND			0x10
> +#define HV_STATUS_INVALID_PORT_ID		0x11
> +#define HV_STATUS_INVALID_CONNECTION_ID		0x12
> +#define HV_STATUS_INSUFFICIENT_BUFFERS		0x13
> +#define HV_STATUS_NOT_ACKNOWLEDGED		0x14
> +#define HV_STATUS_INVALID_VP_STATE		0x15
> +#define HV_STATUS_NO_RESOURCES			0x1D
> +#define HV_STATUS_INVALID_LP_INDEX		0x41
> +#define HV_STATUS_INVALID_REGISTER_VALUE	0x50
>  
>  /*
>   * The Hyper-V TimeRefCount register and the TSC

-- 
Vitaly

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 05/18] virt/mshv: create partition ioctl
       [not found] ` <1605918637-12192-6-git-send-email-nunodasneves@linux.microsoft.com>
@ 2021-02-09 13:15   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 17+ messages in thread
From: Vitaly Kuznetsov @ 2021-02-09 13:15 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, linux-kernel, mikelley, nunodasneves, sunilmut,
	virtualization, viremana, ligrassi

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> Add MSHV_CREATE_PARTITION, which creates an fd to track a new partition.
> Partition is not yet created in the hypervisor itself.
> Introduce header files for userspace-facing hyperv structures.
>
> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  Documentation/virt/mshv/api.rst         |  12 ++
>  arch/x86/include/asm/hyperv-tlfs.h      |   1 +
>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 124 ++++++++++++++++
>  include/asm-generic/hyperv-tlfs.h       |   1 +
>  include/linux/mshv.h                    |  16 +++
>  include/uapi/asm-generic/hyperv-tlfs.h  |  14 ++
>  include/uapi/linux/mshv.h               |   7 +
>  virt/mshv/mshv_main.c                   | 179 +++++++++++++++++++++---
>  8 files changed, 338 insertions(+), 16 deletions(-)
>  create mode 100644 arch/x86/include/uapi/asm/hyperv-tlfs.h
>  create mode 100644 include/uapi/asm-generic/hyperv-tlfs.h
>
> diff --git a/Documentation/virt/mshv/api.rst b/Documentation/virt/mshv/api.rst
> index 82e32de48d03..ce651a1738e0 100644
> --- a/Documentation/virt/mshv/api.rst
> +++ b/Documentation/virt/mshv/api.rst
> @@ -39,6 +39,9 @@ root partition can use mshv APIs to create guest partitions.
>  
>  The module is named mshv and can be configured with CONFIG_HYPERV_ROOT_API.
>  
> +The uapi header files you need are linux/mshv.h, asm/hyperv-tlfs.h, and
> +asm-generic/hyperv-tlfs.h.
> +
>  Mshv is file descriptor-based, following a similar pattern to KVM.
>  
>  To get a handle to the mshv driver, use open("/dev/mshv").
> @@ -60,3 +63,12 @@ if one of them matches.
>  This /dev/mshv file descriptor will remain 'locked' to that version as long as
>  it is open - this ioctl can only be called once per open.
>  
> +3.2 MSHV_CREATE_PARTITION
> +-------------------------
> +:Type: /dev/mshv ioctl
> +:Parameters: struct mshv_create_partition
> +:Returns: partition file descriptor, or -1 on failure
> +
> +This ioctl creates a guest partition, returning a file descriptor to use as a
> +handle for partition ioctls.
> +
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index 592c75e51e0f..4cd44ae9bffb 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -11,6 +11,7 @@
>  
>  #include <linux/types.h>
>  #include <asm/page.h>
> +#include <uapi/asm/hyperv-tlfs.h>
>  /*
>   * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent
>   * is set by CPUID(HvCpuIdFunctionVersionAndFeatures).
> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> new file mode 100644
> index 000000000000..72150c25ffe6
> --- /dev/null
> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> @@ -0,0 +1,124 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_ASM_X86_HYPERV_TLFS_USER_H
> +#define _UAPI_ASM_X86_HYPERV_TLFS_USER_H
> +
> +#include <linux/types.h>
> +
> +#define HV_PARTITION_PROCESSOR_FEATURE_BANKS 2
> +
> +union hv_partition_processor_features {
> +	struct {
> +		__u64 sse3_support:1;
> +		__u64 lahf_sahf_support:1;
> +		__u64 ssse3_support:1;
> +		__u64 sse4_1_support:1;
> +		__u64 sse4_2_support:1;
> +		__u64 sse4a_support:1;
> +		__u64 xop_support:1;
> +		__u64 pop_cnt_support:1;
> +		__u64 cmpxchg16b_support:1;
> +		__u64 altmovcr8_support:1;
> +		__u64 lzcnt_support:1;
> +		__u64 mis_align_sse_support:1;
> +		__u64 mmx_ext_support:1;
> +		__u64 amd3dnow_support:1;
> +		__u64 extended_amd3dnow_support:1;
> +		__u64 page_1gb_support:1;
> +		__u64 aes_support:1;
> +		__u64 pclmulqdq_support:1;
> +		__u64 pcid_support:1;
> +		__u64 fma4_support:1;
> +		__u64 f16c_support:1;
> +		__u64 rd_rand_support:1;
> +		__u64 rd_wr_fs_gs_support:1;
> +		__u64 smep_support:1;
> +		__u64 enhanced_fast_string_support:1;
> +		__u64 bmi1_support:1;
> +		__u64 bmi2_support:1;
> +		__u64 hle_support_deprecated:1;
> +		__u64 rtm_support_deprecated:1;
> +		__u64 movbe_support:1;
> +		__u64 npiep1_support:1;
> +		__u64 dep_x87_fpu_save_support:1;
> +		__u64 rd_seed_support:1;
> +		__u64 adx_support:1;
> +		__u64 intel_prefetch_support:1;
> +		__u64 smap_support:1;
> +		__u64 hle_support:1;
> +		__u64 rtm_support:1;
> +		__u64 rdtscp_support:1;
> +		__u64 clflushopt_support:1;
> +		__u64 clwb_support:1;
> +		__u64 sha_support:1;
> +		__u64 x87_pointers_saved_support:1;
> +		__u64 invpcid_support:1;
> +		__u64 ibrs_support:1;
> +		__u64 stibp_support:1;
> +		__u64 ibpb_support: 1;
> +		__u64 unrestricted_guest_support:1;
> +		__u64 mdd_support:1;
> +		__u64 fast_short_rep_mov_support:1;
> +		__u64 l1dcache_flush_support:1;
> +		__u64 rdcl_no_support:1;
> +		__u64 ibrs_all_support:1;
> +		__u64 skip_l1df_support:1;
> +		__u64 ssb_no_support:1;
> +		__u64 rsb_a_no_support:1;
> +		__u64 virt_spec_ctrl_support:1;
> +		__u64 rd_pid_support:1;
> +		__u64 umip_support:1;
> +		__u64 mbs_no_support:1;
> +		__u64 mb_clear_support:1;
> +		__u64 taa_no_support:1;
> +		__u64 tsx_ctrl_support:1;
> +		/*
> +		 * N.B. The final processor feature bit in bank 0 is reserved to
> +		 * simplify potential downlevel backports.
> +		 */
> +		__u64 reserved_bank0:1;
> +
> +		/* N.B. Begin bank 1 processor features. */
> +		__u64 acount_mcount_support:1;
> +		__u64 tsc_invariant_support:1;
> +		__u64 cl_zero_support:1;
> +		__u64 rdpru_support:1;
> +		__u64 la57_support:1;
> +		__u64 mbec_support:1;
> +		__u64 nested_virt_support:1;
> +		__u64 psfd_support:1;
> +		__u64 cet_ss_support:1;
> +		__u64 cet_ibt_support:1;
> +		__u64 vmx_exception_inject_support:1;
> +		__u64 enqcmd_support:1;
> +		__u64 umwait_tpause_support:1;
> +		__u64 movdiri_support:1;
> +		__u64 movdir64b_support:1;
> +		__u64 cldemote_support:1;
> +		__u64 serialize_support:1;
> +		__u64 tsc_deadline_tmr_support:1;
> +		__u64 tsc_adjust_support:1;
> +		__u64 fzlrep_movsb:1;
> +		__u64 fsrep_stosb:1;
> +		__u64 fsrep_cmpsb:1;
> +		__u64 reserved_bank1:42;
> +	};
> +	__u64 as_uint64[HV_PARTITION_PROCESSOR_FEATURE_BANKS];
> +};
> +
> +union hv_partition_processor_xsave_features {
> +	struct {
> +		__u64 xsave_support : 1;
> +		__u64 xsaveopt_support : 1;
> +		__u64 avx_support : 1;
> +		__u64 reserved1 : 61;
> +	};
> +	__u64 as_uint64;
> +};
> +
> +struct hv_partition_creation_properties {
> +	union hv_partition_processor_features disabled_processor_features;
> +	union hv_partition_processor_xsave_features
> +		disabled_processor_xsave_features;
> +};
> +
> +#endif
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 05b9dc9896ab..2ff580780ce4 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -12,6 +12,7 @@
>  #include <linux/types.h>
>  #include <linux/bits.h>
>  #include <linux/time64.h>
> +#include <uapi/asm-generic/hyperv-tlfs.h>
>  
>  /*
>   * While not explicitly listed in the TLFS, Hyper-V always runs with a page size
> diff --git a/include/linux/mshv.h b/include/linux/mshv.h
> index a0982fe2c0b8..fc4f35089b2c 100644
> --- a/include/linux/mshv.h
> +++ b/include/linux/mshv.h
> @@ -6,6 +6,22 @@
>   * Microsoft Hypervisor root partition driver for /dev/mshv
>   */
>  
> +#include <linux/spinlock.h>
>  #include <uapi/linux/mshv.h>
>  
> +#define MSHV_MAX_PARTITIONS		128
> +
> +struct mshv_partition {
> +	u64 id;
> +	refcount_t ref_count;
> +};
> +
> +struct mshv {
> +	struct {
> +		spinlock_t lock;
> +		u64 count;
> +		struct mshv_partition *array[MSHV_MAX_PARTITIONS];
> +	} partitions;
> +};
> +
>  #endif
> diff --git a/include/uapi/asm-generic/hyperv-tlfs.h b/include/uapi/asm-generic/hyperv-tlfs.h
> new file mode 100644
> index 000000000000..140cc0b4f98f
> --- /dev/null
> +++ b/include/uapi/asm-generic/hyperv-tlfs.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_ASM_GENERIC_HYPERV_TLFS_USER_H
> +#define _UAPI_ASM_GENERIC_HYPERV_TLFS_USER_H
> +
> +#ifndef BIT
> +#define BIT(X)	(1ULL << (X))
> +#endif
> +
> +#define HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST                BIT(0)
> +#define HV_PARTITION_CREATION_FLAG_GPA_LARGE_PAGES_DISABLED         BIT(3)
> +#define HV_PARTITION_CREATION_FLAG_GPA_SUPER_PAGES_ENABLED          BIT(4)
> +#define HV_PARTITION_CREATION_FLAG_LAPIC_ENABLED                    BIT(13)
> +
> +#endif
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index dd30fc2f0a80..3788f8bc5caa 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -8,12 +8,19 @@
>   */
>  
>  #include <linux/types.h>
> +#include <asm/hyperv-tlfs.h>
>  
>  #define MSHV_VERSION	0x0
>  
> +struct mshv_create_partition {
> +	__u64 flags;
> +	struct hv_partition_creation_properties partition_creation_properties;
> +};
> +
>  #define MSHV_IOCTL 0xB8
>  
>  /* mshv device */
>  #define MSHV_REQUEST_VERSION	_IOW(MSHV_IOCTL, 0x00, __u32)
> +#define MSHV_CREATE_PARTITION	_IOW(MSHV_IOCTL, 0x01, struct mshv_create_partition)
>  
>  #endif
> diff --git a/virt/mshv/mshv_main.c b/virt/mshv/mshv_main.c
> index 62f631f85301..4dcbe4907430 100644
> --- a/virt/mshv/mshv_main.c
> +++ b/virt/mshv/mshv_main.c
> @@ -12,6 +12,8 @@
>  #include <linux/fs.h>
>  #include <linux/miscdevice.h>
>  #include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
>  #include <linux/mshv.h>
>  
>  MODULE_AUTHOR("Microsoft");
> @@ -24,6 +26,161 @@ static u32 supported_versions[] = {
>  	MSHV_CURRENT_VERSION,
>  };
>  
> +static struct mshv mshv = {};
> +
> +static void mshv_partition_put(struct mshv_partition *partition);
> +static int mshv_partition_release(struct inode *inode, struct file *filp);
> +static long mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +
> +static int mshv_dev_open(struct inode *inode, struct file *filp);
> +static int mshv_dev_release(struct inode *inode, struct file *filp);
> +static long mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
> +
> +static const struct file_operations mshv_partition_fops = {
> +	.release = mshv_partition_release,
> +	.unlocked_ioctl = mshv_partition_ioctl,
> +	.llseek = noop_llseek,
> +};
> +
> +static const struct file_operations mshv_dev_fops = {
> +	.owner = THIS_MODULE,
> +	.open = mshv_dev_open,
> +	.release = mshv_dev_release,
> +	.unlocked_ioctl = mshv_dev_ioctl,
> +	.llseek = noop_llseek,
> +};
> +
> +static struct miscdevice mshv_dev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "mshv",
> +	.fops = &mshv_dev_fops,
> +	.mode = 600,
> +};
> +
> +static long
> +mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> +{
> +	return -ENOTTY;
> +}
> +
> +static void
> +destroy_partition(struct mshv_partition *partition)
> +{
> +	unsigned long flags;
> +	int i;
> +
> +	/* Remove from list of partitions */
> +	spin_lock_irqsave(&mshv.partitions.lock, flags);
> +
> +	for (i = 0; i < MSHV_MAX_PARTITIONS; ++i) {
> +		if (mshv.partitions.array[i] == partition)
> +			break;
> +	}
> +
> +	if (i == MSHV_MAX_PARTITIONS) {
> +		pr_err("%s: failed to locate partition in array\n", __func__);
> +	} else {
> +		mshv.partitions.count--;
> +		mshv.partitions.array[i] = NULL;
> +	}
> +
> +	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
> +
> +	kfree(partition);
> +}
> +
> +static void
> +mshv_partition_put(struct mshv_partition *partition)
> +{
> +	if (refcount_dec_and_test(&partition->ref_count))
> +		destroy_partition(partition);
> +}
> +
> +static int
> +mshv_partition_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_partition *partition = filp->private_data;
> +
> +	mshv_partition_put(partition);
> +
> +	return 0;
> +}
> +
> +static int
> +add_partition(struct mshv_partition *partition)
> +{
> +	unsigned long flags;
> +	int i, ret = 0;
> +
> +	spin_lock_irqsave(&mshv.partitions.lock, flags);
> +
> +	if (mshv.partitions.count >= MSHV_MAX_PARTITIONS) {
> +		pr_err("%s: too many partitions\n", __func__);
> +		ret = -ENOSPC;
> +		goto out_unlock;
> +	}
> +
> +	for (i = 0; i < MSHV_MAX_PARTITIONS; ++i) {
> +		if (!mshv.partitions.array[i])
> +			break;
> +	}
> +
> +	mshv.partitions.count++;
> +	mshv.partitions.array[i] = partition;
> +
> +out_unlock:
> +	spin_unlock_irqrestore(&mshv.partitions.lock, flags);
> +
> +	return ret;
> +}
> +
> +static long
> +mshv_ioctl_create_partition(void __user *user_arg)
> +{
> +	struct mshv_create_partition args;
> +	struct mshv_partition *partition;
> +	struct file *file;
> +	int fd;
> +	long ret;
> +
> +	if (copy_from_user(&args, user_arg, sizeof(args)))
> +		return -EFAULT;
> +
> +	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
> +	if (!partition)
> +		return -ENOMEM;
> +
> +	fd = get_unused_fd_flags(O_CLOEXEC);
> +	if (fd < 0) {
> +		ret = fd;
> +		goto free_partition;
> +	}
> +
> +	file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
> +				  partition, O_RDWR);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto put_fd;
> +	}
> +	refcount_set(&partition->ref_count, 1);
> +
> +	ret = add_partition(partition);
> +	if (ret)
> +		goto release_file;
> +
> +	fd_install(fd, file);
> +
> +	return fd;
> +
> +release_file:
> +	file->f_op->release(file->f_inode, file);
> +put_fd:
> +	put_unused_fd(fd);
> +free_partition:
> +	kfree(partition);
> +	return ret;
> +}
> +
>  static long
>  mshv_ioctl_request_version(u32 *version, void __user *user_arg)
>  {
> @@ -59,7 +216,10 @@ mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  	if (*version == MSHV_INVALID_VERSION)
>  		return -EBADFD;
>  
> -	/* TODO other ioctls */
> +	switch (ioctl) {
> +	case MSHV_CREATE_PARTITION:
> +		return mshv_ioctl_create_partition((void __user *)arg);
> +	}
>  
>  	return -ENOTTY;
>  }
> @@ -82,21 +242,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> -static const struct file_operations mshv_dev_fops = {
> -	.owner = THIS_MODULE,
> -	.open = mshv_dev_open,
> -	.release = mshv_dev_release,
> -	.unlocked_ioctl = mshv_dev_ioctl,
> -	.llseek = noop_llseek,
> -};
> -
> -static struct miscdevice mshv_dev = {
> -	.minor = MISC_DYNAMIC_MINOR,
> -	.name = "mshv",
> -	.fops = &mshv_dev_fops,
> -	.mode = 600,
> -};
> -

This looks like an unneeded code churn as these structs just got added a
few patches ago. It would probably be possible to put it to the right
place from the very beginning so you don't need to move it in this
patch.

>  static int
>  __init mshv_init(void)
>  {
> @@ -106,6 +251,8 @@ __init mshv_init(void)
>  	if (r)
>  		pr_err("%s: misc device register failed\n", __func__);
>  
> +	spin_lock_init(&mshv.partitions.lock);
> +
>  	return r;
>  }

-- 
Vitaly

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls
       [not found]     ` <e6cc796d-f9ee-5203-95a9-05906f95d3f8@linux.microsoft.com>
@ 2021-03-04 23:58       ` Michael Kelley via Virtualization
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-03-04 23:58 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, March 4, 2021 3:49 PM
> 
> On 2/8/2021 11:42 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November
> 20, 2020 4:30 PM
> >>

[snip]

> >> +
> >> +static int
> >> +hv_call_create_partition(
> >> +		u64 flags,
> >> +		struct hv_partition_creation_properties creation_properties,
> >> +		u64 *partition_id)
> >> +{
> >> +	struct hv_create_partition_in *input;
> >> +	struct hv_create_partition_out *output;
> >> +	int status;
> >> +	int ret;
> >> +	unsigned long irq_flags;
> >> +	int i;
> >> +
> >> +	do {
> >> +		local_irq_save(irq_flags);
> >> +		input = (struct hv_create_partition_in *)(*this_cpu_ptr(
> >> +			hyperv_pcpu_input_arg));
> >> +		output = (struct hv_create_partition_out *)(*this_cpu_ptr(
> >> +			hyperv_pcpu_output_arg));
> >> +
> >> +		input->flags = flags;
> >> +		input->proximity_domain_info.as_uint64 = 0;
> >> +		input->compatibility_version = HV_COMPATIBILITY_MANGANESE;
> >> +		for (i = 0; i < HV_PARTITION_PROCESSOR_FEATURE_BANKS; ++i)
> >> +			input->partition_creation_properties
> >> +				.disabled_processor_features.as_uint64[i] = 0;
> >> +		input->partition_creation_properties
> >> +			.disabled_processor_xsave_features.as_uint64 = 0;
> >> +		input->isolation_properties.as_uint64 = 0;
> >> +
> >> +		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
> >> +					 input, output);
> >
> > hv_do_hypercall returns a u64, which should then be masked with
> > HV_HYPERCALL_RESULT_MASK before checking the result.
> >
> 
> Yes, I'll fix this everywhere.
> 
> >> +		if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			if (status == HV_STATUS_SUCCESS)
> >> +				*partition_id = output->partition_id;
> >> +			else
> >> +				pr_err("%s: %s\n",
> >> +				       __func__, hv_status_to_string(status));
> >> +			local_irq_restore(irq_flags);
> >> +			ret = -hv_status_to_errno(status);
> >> +			break;
> >> +		}
> >> +		local_irq_restore(irq_flags);
> >> +		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> +					    hv_current_partition_id, 1);
> >> +	} while (!ret);
> >> +
> >> +	return ret;
> >> +}
> >> +

I had a separate thread on the linux-hyperv mailing list about the
inconsistency in how we check hypercall status in current upstream
code, and proposed some helper functions to make it easier and
more consistent.  Joe Salisbury has started work on a patch to
provide those helper functions and to start using them in current
upstream code.  You could coordinate with Joe to get the helper
functions as well and use them as discussed in that thread.  Then
later on we won't have to come back and fix up the uses in this
patch series.

Michael
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
       [not found]     ` <194e0dad-495e-ae94-3f51-d2c95da52139@linux.microsoft.com>
@ 2021-03-05  9:18       ` Vitaly Kuznetsov
       [not found]         ` <fc88ba72-83ab-025e-682d-4981762ed4f6@linux.microsoft.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Vitaly Kuznetsov @ 2021-03-05  9:18 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, linux-kernel, mikelley, sunilmut, virtualization,
	viremana, ligrassi

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> On 2/9/2021 5:11 AM, Vitaly Kuznetsov wrote:
>> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
>> 
...
>>> +
>>> +3.1 MSHV_REQUEST_VERSION
>>> +------------------------
>>> +:Type: /dev/mshv ioctl
>>> +:Parameters: pointer to a u32
>>> +:Returns: 0 on success
>>> +
>>> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
>>> +establish the interface version with the kernel module.
>>> +
>>> +The caller should pass the MSHV_VERSION as an argument.
>>> +
>>> +The kernel module will check which interface versions it supports and return 0
>>> +if one of them matches.
>>> +
>>> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
>>> +it is open - this ioctl can only be called once per open.
>>> +
>> 
>> KVM used to have KVM_GET_API_VERSION too but this turned out to be not
>> very convenient so we use capabilities (KVM_CHECK_EXTENSION/KVM_ENABLE_CAP)
>> instead.
>> 
>
> The goal of MSHV_REQUEST_VERSION is to support changes to APIs in the core set.
> When we add new features/ioctls beyond the core we can use an extension/capability
> approach like KVM.
>

Driver versions is a very bad idea from distribution/stable kernel point
of view as it presumes that the history is linear. It is not.

Imagine you have the following history upstream:

MSHV_REQUEST_VERSION = 1
<100 commits with features/fixes>
MSHV_REQUEST_VERSION = 2
<another 100 commits with features/fixes>
MSHV_REQUEST_VERSION = 2

Now I'm a linux distribution / stable kernel maintainer. My kernel is at
MSHV_REQUEST_VERSION = 1. Now I want to backport 1 feature from between
VER=1 and VER=2 and another feature from between VER=2 and VER=3. My
history now looks like

MSHV_REQUEST_VERSION = 1
<5 commits from between VER=1 and VER=2>
   Which version should I declare here???? 
<5 commits from between VER=2 and VER=3>
   Which version should I declare here???? 

If I keep VER=1 then userspace will think that I don't have any extra
features added and just won't use them. If I change VER to 2/3, it'll
think I have *all* features from between these versions.

The only reasonable way to manage this is to attach a "capability" to
every ABI change and expose this capability *in the same commit which
introduces the change to the ABI*. This way userspace will now exactly
which ioctls are available and what are their interfaces.

Also, trying to define "core set" is hard but you don't really need
to.

-- 
Vitaly

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 08/18] virt/mshv: map and unmap guest memory
       [not found]     ` <d63330fa-de83-85de-c8ec-74cc90d680e3@linux.microsoft.com>
@ 2021-03-08 19:30       ` Michael Kelley via Virtualization
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-03-08 19:30 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Monday, March 8, 2021 11:14 AM
> 
> On 2/8/2021 11:45 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November
> 20, 2020 4:30 PM
> >>

[snip]

> >> @@ -245,16 +249,318 @@ hv_call_delete_partition(u64 partition_id)
> >>  	return -hv_status_to_errno(status);
> >>  }
> >>
> >> +static int
> >> +hv_call_map_gpa_pages(u64 partition_id,
> >> +		      u64 gpa_target,
> >> +		      u64 page_count, u32 flags,
> >> +		      struct page **pages)
> >> +{
> >> +	struct hv_map_gpa_pages *input_page;
> >> +	int status;
> >> +	int i;
> >> +	struct page **p;
> >> +	u32 completed = 0;
> >> +	u64 hypercall_status;
> >> +	unsigned long remaining = page_count;
> >> +	int rep_count;
> >> +	unsigned long irq_flags;
> >> +	int ret = 0;
> >> +
> >> +	while (remaining) {
> >> +
> >> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> >> +
> >> +		local_irq_save(irq_flags);
> >> +		input_page = (struct hv_map_gpa_pages *)(*this_cpu_ptr(
> >> +			hyperv_pcpu_input_arg));
> >> +
> >> +		input_page->target_partition_id = partition_id;
> >> +		input_page->target_gpa_base = gpa_target;
> >> +		input_page->map_flags = flags;
> >> +
> >> +		for (i = 0, p = pages; i < rep_count; i++, p++)
> >> +			input_page->source_gpa_page_list[i] =
> >> +				page_to_pfn(*p) & HV_MAP_GPA_MASK;
> >
> > The masking seems a bit weird.  The mask allows for up to 64G page frames,
> > which is 256 Tbytes of total physical memory, which is probably the current
> > Hyper-V limit on memory size (48 bit physical address space, though 52 bit
> > physical address spaces are coming).  So the masking shouldn't ever be doing
> > anything.   And if it was doing something, that probably should be treated as
> > an error rather than simply dropping the high bits.
> 
> Good point - It looks like the mask isn't needed.
> 
> >
> > Note that this code does not handle the case where PAGE_SIZE !=
> > HV_HYP_PAGE_SIZE.  But maybe we'll never run the root partition with a
> > page size other than 4K.
> >
> 
> For now on x86 it won't happen, but maybe on ARM?
> It shouldn't be hard to support this case, especially since
> PAGE_SIZE >= HV_HYP_PAGE_SIZE. Do you think we need it in this patch set?

No, from my perspective, this case does not need to be handled in 
this patch set.

> 
> >> +		hypercall_status = hv_do_rep_hypercall(
> >> +			HVCALL_MAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> >> +		local_irq_restore(irq_flags);
> >> +
> >> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> >> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> >> +				HV_HYPERCALL_REP_COMP_OFFSET;
> >> +
> >> +		if (status == HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> +						    partition_id, 256);
> >
> > Why adding 256 pages?  I'm just contrasting with other places that add
> > 1 page at a time.  Maybe a comment to explain ....
> >
> 
> Empirically determined. I'll add a #define and comment.
> 
> >> +			if (ret)
> >> +				break;
> >> +		} else if (status != HV_STATUS_SUCCESS) {
> >> +			pr_err("%s: completed %llu out of %llu, %s\n",
> >> +			       __func__,
> >> +			       page_count - remaining, page_count,
> >> +			       hv_status_to_string(status));
> >> +			ret = -hv_status_to_errno(status);
> >> +			break;
> >> +		}
> >> +
> >> +		pages += completed;
> >> +		remaining -= completed;
> >> +		gpa_target += completed;
> >> +	}
> >> +
> >> +	if (ret && completed) {
> >
> > Is the above the right test?  Completed could be zero from the most
> > recent iteration, but still could be partially succeeded based on a previous
> > successful iteration.   I think this needs to check whether remaining equals
> > page_count.
> >
> 
> You're right; I'll change it to (ret && remaining < page_count)
> 
> >> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> >> +		       __func__);
> >> +		ret = -EBADFD;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +hv_call_unmap_gpa_pages(u64 partition_id,
> >> +			u64 gpa_target,
> >> +			u64 page_count, u32 flags)
> >> +{
> >> +	struct hv_unmap_gpa_pages *input_page;
> >> +	int status;
> >> +	int ret = 0;
> >> +	u32 completed = 0;
> >> +	u64 hypercall_status;
> >> +	unsigned long remaining = page_count;
> >> +	int rep_count;
> >> +	unsigned long irq_flags;
> >> +
> >> +	local_irq_save(irq_flags);
> >> +	input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
> >> +		hyperv_pcpu_input_arg));
> >> +
> >> +	input_page->target_partition_id = partition_id;
> >> +	input_page->target_gpa_base = gpa_target;
> >> +	input_page->unmap_flags = flags;
> >> +
> >> +	while (remaining) {
> >> +		rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
> >> +		hypercall_status = hv_do_rep_hypercall(
> >> +			HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
> >
> > Similarly, this code doesn't handle PAGE_SIZE != HV_HYP_PAGE_SIZE.
> >
> 
> As above - do we need this for this patch set? This won't happen on x86.

Again, not needed from my perspective.

> 
> >> +		status = hypercall_status & HV_HYPERCALL_RESULT_MASK;
> >> +		completed = (hypercall_status & HV_HYPERCALL_REP_COMP_MASK) >>
> >> +				HV_HYPERCALL_REP_COMP_OFFSET;
> >> +		if (status != HV_STATUS_SUCCESS) {
> >> +			pr_err("%s: completed %llu out of %llu, %s\n",
> >> +			       __func__,
> >> +			       page_count - remaining, page_count,
> >> +			       hv_status_to_string(status));
> >> +			ret = -hv_status_to_errno(status);
> >> +			break;
> >> +		}
> >> +
> >> +		remaining -= completed;
> >> +		gpa_target += completed;
> >> +		input_page->target_gpa_base = gpa_target;
> >> +	}
> >> +	local_irq_restore(irq_flags);
> >
> > I have some concern about holding interrupts disabled for this long.
> >
> 
> How about I move the interrupt enabling/disabling inside the loop? i.e.:
>         while (remaining) {
>                 local_irq_save(irq_flags);
>                 input_page = (struct hv_unmap_gpa_pages *)(*this_cpu_ptr(
>                         hyperv_pcpu_input_arg));
> 
>                 input_page->target_partition_id = partition_id;
>                 input_page->target_gpa_base = gpa_target;
>                 input_page->unmap_flags = flags;
>                 rep_count = min(remaining, HV_MAP_GPA_BATCH_SIZE);
>                 status = hv_do_rep_hypercall(
>                         HVCALL_UNMAP_GPA_PAGES, rep_count, 0, input_page, NULL);
>                 local_irq_restore(irq_flags);
> 
>                 completed = (status & HV_HYPERCALL_REP_COMP_MASK) >>
>                                 HV_HYPERCALL_REP_COMP_OFFSET;
>                 status &= HV_HYPERCALL_RESULT_MASK;
>                 if (status != HV_STATUS_SUCCESS) {
>                         pr_err("%s: completed %llu out of %llu, %s\n",
>                                __func__,
>                                page_count - remaining, page_count,
>                                hv_status_to_string(status));
>                         ret = hv_status_to_errno(status);
>                         break;
>                 }
> 
>                 remaining -= completed;
>                 gpa_target += completed;
>         }
> 
> 

Yes, that would help.

> >> +
> >> +	if (ret && completed) {
> >
> > Same comment as before.
> >
> 
> Ditto as above.
> 
> >> +		pr_err("%s: Partially succeeded; mapped regions may be in invalid state",
> >> +		       __func__);
> >> +		ret = -EBADFD;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static long
> >> +mshv_partition_ioctl_map_memory(struct mshv_partition *partition,
> >> +				struct mshv_user_mem_region __user *user_mem)
> >> +{
> >> +	struct mshv_user_mem_region mem;
> >> +	struct mshv_mem_region *region;
> >> +	int completed;
> >> +	unsigned long remaining, batch_size;
> >> +	int i;
> >> +	struct page **pages;
> >> +	u64 page_count, user_start, user_end, gpfn_start, gpfn_end;
> >> +	u64 region_page_count, region_user_start, region_user_end;
> >> +	u64 region_gpfn_start, region_gpfn_end;
> >> +	long ret = 0;
> >> +
> >> +	/* Check we have enough slots*/
> >> +	if (partition->regions.count == MSHV_MAX_MEM_REGIONS) {
> >> +		pr_err("%s: not enough memory region slots\n", __func__);
> >> +		return -ENOSPC;
> >> +	}
> >> +
> >> +	if (copy_from_user(&mem, user_mem, sizeof(mem)))
> >> +		return -EFAULT;
> >> +
> >> +	if (!mem.size ||
> >> +	    mem.size & (PAGE_SIZE - 1) ||
> >> +	    mem.userspace_addr & (PAGE_SIZE - 1) ||
> >
> > There's a PAGE_ALIGNED macro that expresses exactly what
> > each of the previous two tests is doing.
> >
> 
> Since these need to be HV_HYP_PAGE_SIZE aligned, I will add a
> HV_HYP_PAGE_ALIGNED macro for this.

I was thinking that PAGE_SIZE and PAGE_ALIGNED are correct.   If
this code were running on an ARM64 system with a 64K page
size, the 64K alignment would be fine and will make sense from
the user space perspective.   You don't want to be mapping part
of a user space page.  And 64K alignment will certainly satisfy
Hyper-V's requirement for 4K alignment.  The real requirement
from Hyper-V's standpoint is that the alignment not be smaller
than 4K.  But maybe I'm misunderstanding.

Michael
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages
       [not found]     ` <9e06a119-880f-5199-903b-056675331d6f@linux.microsoft.com>
@ 2021-03-11 20:45       ` Michael Kelley via Virtualization
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Kelley via Virtualization @ 2021-03-11 20:45 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, Lillian Grassin-Drake, linux-kernel, virtualization,
	Sunil Muthuswamy, viremana

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, March 11, 2021 11:38 AM
> 
> On 2/8/2021 11:47 AM, Michael Kelley wrote:
> > From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Friday, November
> 20, 2020 4:31 PM
> >>
> >> Same idea as synic setup in drivers/hv/hv.c:hv_synic_enable_regs()
> >> and hv_synic_disable_regs().
> >> Setting up synic registers in both vmbus driver and mshv would clobber
> >> them, but the vmbus driver will not run in the root partition, so this
> >> is safe.
> >>
> >> Co-developed-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> >> Signed-off-by: Lillian Grassin-Drake <ligrassi@microsoft.com>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> ---
> >>  arch/x86/include/asm/hyperv-tlfs.h      |  29 ---
> >>  arch/x86/include/uapi/asm/hyperv-tlfs.h | 264 ++++++++++++++++++++++++
> >>  include/asm-generic/hyperv-tlfs.h       |  46 +----
> >>  include/linux/mshv.h                    |   1 +
> >>  include/uapi/asm-generic/hyperv-tlfs.h  |  43 ++++
> >>  virt/mshv/mshv_main.c                   |  98 ++++++++-
> >>  6 files changed, 404 insertions(+), 77 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> >> index 4cd44ae9bffb..c34a6bb4f457 100644
> >> --- a/arch/x86/include/asm/hyperv-tlfs.h
> >> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> >> @@ -267,35 +267,6 @@ struct hv_tsc_emulation_status {
> >>  #define HV_X64_MSR_TSC_REFERENCE_ENABLE		0x00000001
> >>  #define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT	12
> >>
> >> -
> >> -/* Define hypervisor message types. */
> >> -enum hv_message_type {
> >> -	HVMSG_NONE			= 0x00000000,
> >> -
> >> -	/* Memory access messages. */
> >> -	HVMSG_UNMAPPED_GPA		= 0x80000000,
> >> -	HVMSG_GPA_INTERCEPT		= 0x80000001,
> >> -
> >> -	/* Timer notification messages. */
> >> -	HVMSG_TIMER_EXPIRED		= 0x80000010,
> >> -
> >> -	/* Error messages. */
> >> -	HVMSG_INVALID_VP_REGISTER_VALUE	= 0x80000020,
> >> -	HVMSG_UNRECOVERABLE_EXCEPTION	= 0x80000021,
> >> -	HVMSG_UNSUPPORTED_FEATURE	= 0x80000022,
> >> -
> >> -	/* Trace buffer complete messages. */
> >> -	HVMSG_EVENTLOG_BUFFERCOMPLETE	= 0x80000040,
> >> -
> >> -	/* Platform-specific processor intercept messages. */
> >> -	HVMSG_X64_IOPORT_INTERCEPT	= 0x80010000,
> >> -	HVMSG_X64_MSR_INTERCEPT		= 0x80010001,
> >> -	HVMSG_X64_CPUID_INTERCEPT	= 0x80010002,
> >> -	HVMSG_X64_EXCEPTION_INTERCEPT	= 0x80010003,
> >> -	HVMSG_X64_APIC_EOI		= 0x80010004,
> >> -	HVMSG_X64_LEGACY_FP_ERROR	= 0x80010005
> >> -};
> >> -
> >>  struct hv_nested_enlightenments_control {
> >>  	struct {
> >>  		__u32 directhypercall:1;
> >> diff --git a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> b/arch/x86/include/uapi/asm/hyperv-
> >> tlfs.h
> >> index 2ff655962738..c6a27053f791 100644
> >> --- a/arch/x86/include/uapi/asm/hyperv-tlfs.h
> >> +++ b/arch/x86/include/uapi/asm/hyperv-tlfs.h
> >> @@ -722,4 +722,268 @@ union hv_register_value {
> >>  		pending_virtualization_fault_event;
> >>  };
> >>
> >> +/* Define hypervisor message types. */
> >> +enum hv_message_type {
> >> +	HVMSG_NONE				= 0x00000000,
> >> +
> >> +	/* Memory access messages. */
> >> +	HVMSG_UNMAPPED_GPA			= 0x80000000,
> >> +	HVMSG_GPA_INTERCEPT			= 0x80000001,
> >> +
> >> +	/* Timer notification messages. */
> >> +	HVMSG_TIMER_EXPIRED			= 0x80000010,
> >> +
> >> +	/* Error messages. */
> >> +	HVMSG_INVALID_VP_REGISTER_VALUE		= 0x80000020,
> >> +	HVMSG_UNRECOVERABLE_EXCEPTION		= 0x80000021,
> >> +	HVMSG_UNSUPPORTED_FEATURE		= 0x80000022,
> >> +
> >> +	/* Trace buffer complete messages. */
> >> +	HVMSG_EVENTLOG_BUFFERCOMPLETE		= 0x80000040,
> >> +
> >> +	/* Platform-specific processor intercept messages. */
> >> +	HVMSG_X64_IO_PORT_INTERCEPT		= 0x80010000,
> >> +	HVMSG_X64_MSR_INTERCEPT			= 0x80010001,
> >> +	HVMSG_X64_CPUID_INTERCEPT		= 0x80010002,
> >> +	HVMSG_X64_EXCEPTION_INTERCEPT		= 0x80010003,
> >> +	HVMSG_X64_APIC_EOI			= 0x80010004,
> >> +	HVMSG_X64_LEGACY_FP_ERROR		= 0x80010005,
> >> +	HVMSG_X64_IOMMU_PRQ			= 0x80010006,
> >> +	HVMSG_X64_HALT				= 0x80010007,
> >> +	HVMSG_X64_INTERRUPTION_DELIVERABLE	= 0x80010008,
> >> +	HVMSG_X64_SIPI_INTERCEPT		= 0x80010009,
> >> +};
> >
> > I have a separate patch series that moves this enum to the
> > asm-generic portion of hyperv-tlfs.h because there's not a good way
> > to separate the arch neutral from arch dependent values.
> >
> 
> Ok, but it should also be changed to #define instead of an enum, right?
> I will do that in this patch.
> This requires a couple of changes in other files in drivers/hv
> where this enum is used.

Because of the other uses of the enum in places that don't depend
on exact structure layouts, I left it as an enum when I moved it.
When one of the enum values is passed to Hyper-V, the enum
is assigned to a u32 field, which I think is acceptable.  You could
do the same with the other enums your already have -- keep the
constant definitions as members of an enum, but assign to a u32
field in the structures that get passed to Hyper-V.  There may
actually be some benefit in that approach, particularly if the enum
is passed as an individual argument into some function(s). 

Others may have an opinion on this approach .....

Michael
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
       [not found]         ` <fc88ba72-83ab-025e-682d-4981762ed4f6@linux.microsoft.com>
@ 2021-04-07  7:38           ` Vitaly Kuznetsov
       [not found]             ` <20210407134302.ng6n4el2km7sujfp@liuwe-devbox-debian-v2>
  0 siblings, 1 reply; 17+ messages in thread
From: Vitaly Kuznetsov @ 2021-04-07  7:38 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv
  Cc: wei.liu, linux-kernel, mikelley, sunilmut, virtualization,
	viremana, ligrassi

Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:

> On 3/5/2021 1:18 AM, Vitaly Kuznetsov wrote:
>> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
>> 
>>> On 2/9/2021 5:11 AM, Vitaly Kuznetsov wrote:
>>>> Nuno Das Neves <nunodasneves@linux.microsoft.com> writes:
>>>>
>> ...
>>>>> +
>>>>> +3.1 MSHV_REQUEST_VERSION
>>>>> +------------------------
>>>>> +:Type: /dev/mshv ioctl
>>>>> +:Parameters: pointer to a u32
>>>>> +:Returns: 0 on success
>>>>> +
>>>>> +Before issuing any other ioctls, a MSHV_REQUEST_VERSION ioctl must be called to
>>>>> +establish the interface version with the kernel module.
>>>>> +
>>>>> +The caller should pass the MSHV_VERSION as an argument.
>>>>> +
>>>>> +The kernel module will check which interface versions it supports and return 0
>>>>> +if one of them matches.
>>>>> +
>>>>> +This /dev/mshv file descriptor will remain 'locked' to that version as long as
>>>>> +it is open - this ioctl can only be called once per open.
>>>>> +
>>>>
>>>> KVM used to have KVM_GET_API_VERSION too but this turned out to be not
>>>> very convenient so we use capabilities (KVM_CHECK_EXTENSION/KVM_ENABLE_CAP)
>>>> instead.
>>>>
>>>
>>> The goal of MSHV_REQUEST_VERSION is to support changes to APIs in the core set.
>>> When we add new features/ioctls beyond the core we can use an extension/capability
>>> approach like KVM.
>>>
>> 
>> Driver versions is a very bad idea from distribution/stable kernel point
>> of view as it presumes that the history is linear. It is not.
>> 
>> Imagine you have the following history upstream:
>> 
>> MSHV_REQUEST_VERSION = 1
>> <100 commits with features/fixes>
>> MSHV_REQUEST_VERSION = 2
>> <another 100 commits with features/fixes>
>> MSHV_REQUEST_VERSION = 2
>> 
>> Now I'm a linux distribution / stable kernel maintainer. My kernel is at
>> MSHV_REQUEST_VERSION = 1. Now I want to backport 1 feature from between
>> VER=1 and VER=2 and another feature from between VER=2 and VER=3. My
>> history now looks like
>> 
>> MSHV_REQUEST_VERSION = 1
>> <5 commits from between VER=1 and VER=2>
>>    Which version should I declare here???? 
>> <5 commits from between VER=2 and VER=3>
>>    Which version should I declare here???? 
>> 
>> If I keep VER=1 then userspace will think that I don't have any extra
>> features added and just won't use them. If I change VER to 2/3, it'll
>> think I have *all* features from between these versions.
>> 
>> The only reasonable way to manage this is to attach a "capability" to
>> every ABI change and expose this capability *in the same commit which
>> introduces the change to the ABI*. This way userspace will now exactly
>> which ioctls are available and what are their interfaces.
>> 
>> Also, trying to define "core set" is hard but you don't really need
>> to.
>> 
>
> We've had some internal discussion on this.
>
> There is bound to be some iteration before this ABI is stable, since even the
> underlying Microsoft hypervisor interfaces aren't stable just yet.
>
> It might make more sense to just have an IOCTL to check if the API is stable yet.
> This would be analogous to checking if kVM_GET_API_VERSION returns 12.
>
> How does this sound as a proposal?
> An MSHV_CHECK_EXTENSION ioctl to query extensions to the core /dev/mshv API.
>
> It takes a single argument, an integer named MSHV_CAP_* corresponding to
> the extension to check the existence of.
>
> The ioctl will return 0 if the extension is unsupported, or a positive integer
> if supported.
>
> We can initially include a capability called MSHV_CAP_CORE_API_STABLE.
> If supported, the core APIs are stable.

This sounds reasonable, I'd suggest you reserve MSHV_CAP_CORE_API_STABLE
right away but don't expose it yet so it's clear the API is not yet
stable. Test userspace you have may always assume it's running with the
latest kernel.

Also, please be clear about the fact that /dev/mshv doesn't
provide a stable API yet so nobody builds an application on top of
it.

One more though: it is probably a good idea to introduce selftests for
/dev/mshv (similar to KVM's selftests in
/tools/testing/selftests/kvm). Selftests don't really need a stable ABI
as they live in the same linux.git and can be updated in the same patch
series which changes /dev/mshv behavior. Selftests are very useful for
checking there are no regressions, especially in the situation when
there's no publicly available userspace for /dev/mshv.

-- 
Vitaly

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 04/18] virt/mshv: request version ioctl
       [not found]             ` <20210407134302.ng6n4el2km7sujfp@liuwe-devbox-debian-v2>
@ 2021-04-07 14:02               ` Vitaly Kuznetsov
  0 siblings, 0 replies; 17+ messages in thread
From: Vitaly Kuznetsov @ 2021-04-07 14:02 UTC (permalink / raw)
  To: Wei Liu
  Cc: linux-hyperv, linux-kernel, mikelley, wei.liu, Nuno Das Neves,
	sunilmut, virtualization, viremana, ligrassi

Wei Liu <wei.liu@kernel.org> writes:

> On Wed, Apr 07, 2021 at 09:38:21AM +0200, Vitaly Kuznetsov wrote:
>
>> One more though: it is probably a good idea to introduce selftests for
>> /dev/mshv (similar to KVM's selftests in
>> /tools/testing/selftests/kvm). Selftests don't really need a stable ABI
>> as they live in the same linux.git and can be updated in the same patch
>> series which changes /dev/mshv behavior. Selftests are very useful for
>> checking there are no regressions, especially in the situation when
>> there's no publicly available userspace for /dev/mshv.
>
> I think this can wait until we merge the first implementation in tree.
> There are still a lot of moving parts. Our (currently limited) internal
> test cases need more cleaning up before they are ready. I certainly
> don't want to distract Nuno from getting the foundation right.
>

I'm absolutely fine with this approach, selftests are a nice add-on, not
a requirement for the initial implementation. Also, to make them more
useful to mere mortals, a doc on how to run Linux as root Hyper-V
partition would come handy)

-- 
Vitaly

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-04-07 14:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1605918637-12192-1-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:40 ` [RFC PATCH 00/18] Microsoft Hypervisor root partition ioctl interface Michael Kelley via Virtualization
     [not found] ` <1605918637-12192-5-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:41   ` [RFC PATCH 04/18] virt/mshv: request version ioctl Michael Kelley via Virtualization
     [not found]   ` <87y2fxmlmb.fsf@vitty.brq.redhat.com>
     [not found]     ` <194e0dad-495e-ae94-3f51-d2c95da52139@linux.microsoft.com>
2021-03-05  9:18       ` Vitaly Kuznetsov
     [not found]         ` <fc88ba72-83ab-025e-682d-4981762ed4f6@linux.microsoft.com>
2021-04-07  7:38           ` Vitaly Kuznetsov
     [not found]             ` <20210407134302.ng6n4el2km7sujfp@liuwe-devbox-debian-v2>
2021-04-07 14:02               ` Vitaly Kuznetsov
     [not found] ` <1605918637-12192-7-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:42   ` [RFC PATCH 06/18] virt/mshv: create, initialize, finalize, delete partition hypercalls Michael Kelley via Virtualization
     [not found]     ` <e6cc796d-f9ee-5203-95a9-05906f95d3f8@linux.microsoft.com>
2021-03-04 23:58       ` Michael Kelley via Virtualization
     [not found] ` <1605918637-12192-8-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:44   ` [RFC PATCH 07/18] virt/mshv: withdraw memory hypercall Michael Kelley via Virtualization
     [not found] ` <1605918637-12192-9-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:45   ` [RFC PATCH 08/18] virt/mshv: map and unmap guest memory Michael Kelley via Virtualization
     [not found]     ` <d63330fa-de83-85de-c8ec-74cc90d680e3@linux.microsoft.com>
2021-03-08 19:30       ` Michael Kelley via Virtualization
     [not found] ` <1605918637-12192-11-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:47   ` [RFC PATCH 10/18] virt/mshv: get and set vcpu registers ioctls Michael Kelley via Virtualization
     [not found] ` <1605918637-12192-12-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:47   ` [RFC PATCH 11/18] virt/mshv: set up synic pages for intercept messages Michael Kelley via Virtualization
     [not found]     ` <9e06a119-880f-5199-903b-056675331d6f@linux.microsoft.com>
2021-03-11 20:45       ` Michael Kelley via Virtualization
     [not found] ` <1605918637-12192-16-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:48   ` [RFC PATCH 15/18] virt/mshv: get and set vp state ioctls Michael Kelley via Virtualization
     [not found] ` <1605918637-12192-17-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-08 19:49   ` [RFC PATCH 16/18] virt/mshv: mmap vp register page Michael Kelley via Virtualization
     [not found] ` <1605918637-12192-2-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-09 13:04   ` [RFC PATCH 01/18] x86/hyperv: convert hyperv statuses to linux error codes Vitaly Kuznetsov
     [not found] ` <1605918637-12192-6-git-send-email-nunodasneves@linux.microsoft.com>
2021-02-09 13:15   ` [RFC PATCH 05/18] virt/mshv: create partition ioctl Vitaly Kuznetsov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).