All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] vfio/pci: Add OpRegion 2.0 Extended VBT support.
@ 2021-08-13  2:13 Colin Xu
  2021-08-16 22:39 ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-08-13  2:13 UTC (permalink / raw)
  To: kvm, linux-kernel, alex.williamson
  Cc: colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contigious after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Thus it's impossible
to map a contigious range to hold both OpRegion and extended VBT as 2.1.

Since the only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1,
it's feasible to amend OpRegion support for these legacy system (before
upgrading the system firmware), by kazlloc a range to shadown OpRegion
from the beginning and stitch VBT after closely, patch the shadow
OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
if the extended VBT exists. vfio igd OpRegion r/w ops will return either
shadowed data (OpRegion 2.0) or directly from physical address
(OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
mechanism makes it possible to support legacy systems on the market.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
---
 drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
 1 file changed, 75 insertions(+), 42 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 228df565e9bc..22b9436a3044 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
 				 struct vfio_pci_region *region)
 {
-	memunmap(region->data);
+	if (is_ioremap_addr(region->data))
+		memunmap(region->data);
+	else
+		kfree(region->data);
 }
 
 static const struct vfio_pci_regops vfio_pci_igd_regops = {
@@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
 static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
-	u32 addr, size;
-	void *base;
+	u32 addr, size, rvds = 0;
+	void *base, *opregionvbt;
 	int ret;
 	u16 version;
+	u64 rvda = 0;
 
 	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
 	if (ret)
@@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	size *= 1024; /* In KB */
 
 	/*
-	 * Support opregion v2.1+
-	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
-	 * the Extended VBT region next to opregion is used to hold the VBT data.
-	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
-	 * (Raw VBT Data Size) from opregion structure member are used to hold the
-	 * address from region base and size of VBT data. RVDA/RVDS are not
-	 * defined before opregion 2.0.
+	 * OpRegion and VBT:
+	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
+	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
+	 * to hold the VBT data, the Extended VBT region is introduced since
+	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
+	 * introduced to define the extended VBT data location and size.
+	 * OpRegion 2.0: RVDA defines the absolute physical address of the
+	 *   extended VBT data, RVDS defines the VBT data size.
+	 * OpRegion 2.1 and above: RVDA defines the relative address of the
+	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
 	 *
-	 * opregion 2.1+: RVDA is unsigned, relative offset from
-	 * opregion base, and should point to the end of opregion.
-	 * otherwise, exposing to userspace to allow read access to everything between
-	 * the OpRegion and VBT is not safe.
-	 * RVDS is defined as size in bytes.
-	 *
-	 * opregion 2.0: rvda is the physical VBT address.
-	 * Since rvda is HPA it cannot be directly used in guest.
-	 * And it should not be practically available for end user,so it is not supported.
+	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
+	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
+	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
+	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
+	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
+	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
 	 */
 	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
-	if (version >= 0x0200) {
-		u64 rvda;
-		u32 rvds;
 
+	if (version >= 0x0200) {
 		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
 		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
+
+		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
 		if (rvda && rvds) {
-			/* no support for opregion v2.0 with physical VBT address */
-			if (version == 0x0200) {
+			size += rvds;
+		}
+
+		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */
+		if (rvda != size && version > 0x0200) {
+			memunmap(base);
+			pci_err(vdev->pdev,
+				"Extended VBT does not follow opregion on version 0x%04x\n",
+				version);
+			return -EINVAL;
+		}
+	}
+
+	if (size != OPREGION_SIZE) {
+		/* Allocate memory for OpRegion and extended VBT for 2.0 */
+		if (rvda && rvds && version == 0x0200) {
+			void *vbt_base;
+
+			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
+			if (!vbt_base) {
 				memunmap(base);
-				pci_err(vdev->pdev,
-					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
-				return -EINVAL;
+				return -ENOMEM;
 			}
 
-			if (rvda != size) {
+			opregionvbt = kzalloc(size, GFP_KERNEL);
+			if (!opregionvbt) {
 				memunmap(base);
-				pci_err(vdev->pdev,
-					"Extended VBT does not follow opregion on version 0x%04x\n",
-					version);
-				return -EINVAL;
+				memunmap(vbt_base);
+				return -ENOMEM;
 			}
 
-			/* region size for opregion v2.0+: opregion and VBT size. */
-			size += rvds;
+			/* Stitch VBT after OpRegion noncontigious */
+			memcpy(opregionvbt, base, OPREGION_SIZE);
+			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
+
+			/* Patch OpRegion 2.0 to 2.1 */
+			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
+			/* Patch RVDA to relative address after OpRegion */
+			*(__le64 *)(opregionvbt + OPREGION_RVDA) = OPREGION_SIZE;
+
+			memunmap(vbt_base);
+			memunmap(base);
+
+			/* Register shadow instead of map as vfio_region */
+			base = opregionvbt;
+		/* Remap OpRegion + extended VBT for 2.1+ */
+		} else {
+			memunmap(base);
+			base = memremap(addr, size, MEMREMAP_WB);
+			if (!base)
+				return -ENOMEM;
 		}
 	}
 
-	if (size != OPREGION_SIZE) {
-		memunmap(base);
-		base = memremap(addr, size, MEMREMAP_WB);
-		if (!base)
-			return -ENOMEM;
-	}
-
 	ret = vfio_pci_register_dev_region(vdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
 		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
 		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
 	if (ret) {
-		memunmap(base);
+		if (is_ioremap_addr(base))
+			memunmap(base);
+		else
+			kfree(base);
 		return ret;
 	}
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-08-13  2:13 [PATCH] vfio/pci: Add OpRegion 2.0 Extended VBT support Colin Xu
@ 2021-08-16 22:39 ` Alex Williamson
  2021-08-17  0:40   ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-08-16 22:39 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, linux-kernel, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Fri, 13 Aug 2021 10:13:29 +0800
Colin Xu <colin.xu@intel.com> wrote:

> Due to historical reason, some legacy shipped system doesn't follow
> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> VBT is not contigious after OpRegion in physical address, but any
> location pointed by RVDA via absolute address. Thus it's impossible
> to map a contigious range to hold both OpRegion and extended VBT as 2.1.
> 
> Since the only difference between OpRegion 2.0 and 2.1 is where extended
> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> baes, and there is no other difference between OpRegion 2.0 and 2.1,
> it's feasible to amend OpRegion support for these legacy system (before
> upgrading the system firmware), by kazlloc a range to shadown OpRegion
> from the beginning and stitch VBT after closely, patch the shadow
> OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
> address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
> 2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
> if the extended VBT exists. vfio igd OpRegion r/w ops will return either
> shadowed data (OpRegion 2.0) or directly from physical address
> (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
> mechanism makes it possible to support legacy systems on the market.

Which systems does this enable?  There's a suggestion above that these
systems could update firmware to get OpRegion v2.1 support, why
shouldn't we ask users to do that instead?  When we added OpRegion v2.1
support we were told that v2.0 support was essentially non-existent,
why should we add code to support and old spec with few users for such
a niche use case?

> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> Cc: Hang Yuan <hang.yuan@linux.intel.com>
> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> Cc: Fred Gao <fred.gao@intel.com>
> Signed-off-by: Colin Xu <colin.xu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
>  1 file changed, 75 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
> index 228df565e9bc..22b9436a3044 100644
> --- a/drivers/vfio/pci/vfio_pci_igd.c
> +++ b/drivers/vfio/pci/vfio_pci_igd.c
> @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>  				 struct vfio_pci_region *region)
>  {
> -	memunmap(region->data);
> +	if (is_ioremap_addr(region->data))
> +		memunmap(region->data);
> +	else
> +		kfree(region->data);
>  }
>  
>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
> @@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
>  static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>  {
>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
> -	u32 addr, size;
> -	void *base;
> +	u32 addr, size, rvds = 0;
> +	void *base, *opregionvbt;
>  	int ret;
>  	u16 version;
> +	u64 rvda = 0;
>  
>  	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
>  	if (ret)
> @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>  	size *= 1024; /* In KB */
>  
>  	/*
> -	 * Support opregion v2.1+
> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
> -	 * address from region base and size of VBT data. RVDA/RVDS are not
> -	 * defined before opregion 2.0.
> +	 * OpRegion and VBT:
> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
> +	 * to hold the VBT data, the Extended VBT region is introduced since
> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
> +	 * introduced to define the extended VBT data location and size.
> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
> +	 *   extended VBT data, RVDS defines the VBT data size.
> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>  	 *
> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
> -	 * opregion base, and should point to the end of opregion.
> -	 * otherwise, exposing to userspace to allow read access to everything between
> -	 * the OpRegion and VBT is not safe.
> -	 * RVDS is defined as size in bytes.
> -	 *
> -	 * opregion 2.0: rvda is the physical VBT address.
> -	 * Since rvda is HPA it cannot be directly used in guest.
> -	 * And it should not be practically available for end user,so it is not supported.
> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
> +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
> +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
> +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
> +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
> +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
>  	 */
>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
> -	if (version >= 0x0200) {
> -		u64 rvda;
> -		u32 rvds;
>  
> +	if (version >= 0x0200) {
>  		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>  		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
> +
> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
>  		if (rvda && rvds) {
> -			/* no support for opregion v2.0 with physical VBT address */
> -			if (version == 0x0200) {
> +			size += rvds;
> +		}
> +
> +		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */
> +		if (rvda != size && version > 0x0200) {

But we already added rvds to size, this is not compatible with the
previous code that required rvda == size BEFORE adding rvds.

> +			memunmap(base);
> +			pci_err(vdev->pdev,
> +				"Extended VBT does not follow opregion on version 0x%04x\n",
> +				version);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (size != OPREGION_SIZE) {
> +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
> +		if (rvda && rvds && version == 0x0200) {
> +			void *vbt_base;
> +
> +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
> +			if (!vbt_base) {
>  				memunmap(base);
> -				pci_err(vdev->pdev,
> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
> -				return -EINVAL;
> +				return -ENOMEM;
>  			}
>  
> -			if (rvda != size) {
> +			opregionvbt = kzalloc(size, GFP_KERNEL);
> +			if (!opregionvbt) {
>  				memunmap(base);
> -				pci_err(vdev->pdev,
> -					"Extended VBT does not follow opregion on version 0x%04x\n",
> -					version);
> -				return -EINVAL;
> +				memunmap(vbt_base);
> +				return -ENOMEM;
>  			}
>  
> -			/* region size for opregion v2.0+: opregion and VBT size. */
> -			size += rvds;
> +			/* Stitch VBT after OpRegion noncontigious */
> +			memcpy(opregionvbt, base, OPREGION_SIZE);
> +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
> +
> +			/* Patch OpRegion 2.0 to 2.1 */
> +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
> +			/* Patch RVDA to relative address after OpRegion */
> +			*(__le64 *)(opregionvbt + OPREGION_RVDA) = OPREGION_SIZE;

AIUI, the OpRegion is a two-way channel between the IGD device/system
BIOS and the driver, numerous fields are writable by the driver.  Now
the driver writes to a shadow copy of the OpRegion table.  What
completes the write to the real OpRegion table for consumption by the
device/BIOS?  Likewise, what updates the fields that are written by the
device/BIOS for consumption by the driver?

If a shadow copy of the OpRegion detached from the physical table is
sufficient here, why wouldn't we always shadow the OpRegion and prevent
all userspace writes from touching the real version?  Thanks,

Alex

> +
> +			memunmap(vbt_base);
> +			memunmap(base);
> +
> +			/* Register shadow instead of map as vfio_region */
> +			base = opregionvbt;
> +		/* Remap OpRegion + extended VBT for 2.1+ */
> +		} else {
> +			memunmap(base);
> +			base = memremap(addr, size, MEMREMAP_WB);
> +			if (!base)
> +				return -ENOMEM;
>  		}
>  	}
>  
> -	if (size != OPREGION_SIZE) {
> -		memunmap(base);
> -		base = memremap(addr, size, MEMREMAP_WB);
> -		if (!base)
> -			return -ENOMEM;
> -	}
> -
>  	ret = vfio_pci_register_dev_region(vdev,
>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>  		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>  		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>  	if (ret) {
> -		memunmap(base);
> +		if (is_ioremap_addr(base))
> +			memunmap(base);
> +		else
> +			kfree(base);
>  		return ret;
>  	}
>  


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-08-16 22:39 ` Alex Williamson
@ 2021-08-17  0:40   ` Colin Xu
  2021-08-27  1:36     ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-08-17  0:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Colin Xu, kvm, linux-kernel, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Mon, 16 Aug 2021, Alex Williamson wrote:

> On Fri, 13 Aug 2021 10:13:29 +0800
> Colin Xu <colin.xu@intel.com> wrote:
>
>> Due to historical reason, some legacy shipped system doesn't follow
>> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>> VBT is not contigious after OpRegion in physical address, but any
>> location pointed by RVDA via absolute address. Thus it's impossible
>> to map a contigious range to hold both OpRegion and extended VBT as 2.1.
>>
>> Since the only difference between OpRegion 2.0 and 2.1 is where extended
>> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>> baes, and there is no other difference between OpRegion 2.0 and 2.1,
>> it's feasible to amend OpRegion support for these legacy system (before
>> upgrading the system firmware), by kazlloc a range to shadown OpRegion
>> from the beginning and stitch VBT after closely, patch the shadow
>> OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
>> address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
>> 2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
>> if the extended VBT exists. vfio igd OpRegion r/w ops will return either
>> shadowed data (OpRegion 2.0) or directly from physical address
>> (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
>> mechanism makes it possible to support legacy systems on the market.
>
> Which systems does this enable?  There's a suggestion above that these
> systems could update firmware to get OpRegion v2.1 support, why
> shouldn't we ask users to do that instead?  When we added OpRegion v2.1
> support we were told that v2.0 support was essentially non-existent,
> why should we add code to support and old spec with few users for such
> a niche use case?
Hi Alex, there was some mis-alignment with the BIOS owner that we were 
told the 2.0 system doesn't for retail but only for internal development. 
However in other projects we DO see the retail market has such systems, 
including NUC NUC6CAYB, some APL industrial PC used in RT system, and some 
customized APL motherboard by commercial virtualization solution. We 
immediately contact the BIOS owner to ask for a clarification and they 
admit it. These system won't get updated BIOS for OpRegion update but 
still under warranty. That's why the OpRegion 2.0 support is still needed.

>
>> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
>> Cc: Hang Yuan <hang.yuan@linux.intel.com>
>> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
>> Cc: Fred Gao <fred.gao@intel.com>
>> Signed-off-by: Colin Xu <colin.xu@intel.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
>>  1 file changed, 75 insertions(+), 42 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
>> index 228df565e9bc..22b9436a3044 100644
>> --- a/drivers/vfio/pci/vfio_pci_igd.c
>> +++ b/drivers/vfio/pci/vfio_pci_igd.c
>> @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>>  				 struct vfio_pci_region *region)
>>  {
>> -	memunmap(region->data);
>> +	if (is_ioremap_addr(region->data))
>> +		memunmap(region->data);
>> +	else
>> +		kfree(region->data);
>>  }
>>
>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
>> @@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
>>  static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>  {
>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>> -	u32 addr, size;
>> -	void *base;
>> +	u32 addr, size, rvds = 0;
>> +	void *base, *opregionvbt;
>>  	int ret;
>>  	u16 version;
>> +	u64 rvda = 0;
>>
>>  	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
>>  	if (ret)
>> @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>  	size *= 1024; /* In KB */
>>
>>  	/*
>> -	 * Support opregion v2.1+
>> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
>> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
>> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
>> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
>> -	 * address from region base and size of VBT data. RVDA/RVDS are not
>> -	 * defined before opregion 2.0.
>> +	 * OpRegion and VBT:
>> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
>> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
>> +	 * to hold the VBT data, the Extended VBT region is introduced since
>> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
>> +	 * introduced to define the extended VBT data location and size.
>> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
>> +	 *   extended VBT data, RVDS defines the VBT data size.
>> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
>> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>>  	 *
>> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
>> -	 * opregion base, and should point to the end of opregion.
>> -	 * otherwise, exposing to userspace to allow read access to everything between
>> -	 * the OpRegion and VBT is not safe.
>> -	 * RVDS is defined as size in bytes.
>> -	 *
>> -	 * opregion 2.0: rvda is the physical VBT address.
>> -	 * Since rvda is HPA it cannot be directly used in guest.
>> -	 * And it should not be practically available for end user,so it is not supported.
>> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
>> +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
>> +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
>> +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
>> +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
>> +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
>>  	 */
>>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
>> -	if (version >= 0x0200) {
>> -		u64 rvda;
>> -		u32 rvds;
>>
>> +	if (version >= 0x0200) {
>>  		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>>  		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
>> +
>> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
>>  		if (rvda && rvds) {
>> -			/* no support for opregion v2.0 with physical VBT address */
>> -			if (version == 0x0200) {
>> +			size += rvds;
>> +		}
>> +
>> +		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */
>> +		if (rvda != size && version > 0x0200) {
>
> But we already added rvds to size, this is not compatible with the
> previous code that required rvda == size BEFORE adding rvds.
>
>> +			memunmap(base);
>> +			pci_err(vdev->pdev,
>> +				"Extended VBT does not follow opregion on version 0x%04x\n",
>> +				version);
>> +			return -EINVAL;
>> +		}
>> +	}
>> +
>> +	if (size != OPREGION_SIZE) {
>> +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
>> +		if (rvda && rvds && version == 0x0200) {
>> +			void *vbt_base;
>> +
>> +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
>> +			if (!vbt_base) {
>>  				memunmap(base);
>> -				pci_err(vdev->pdev,
>> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
>> -				return -EINVAL;
>> +				return -ENOMEM;
>>  			}
>>
>> -			if (rvda != size) {
>> +			opregionvbt = kzalloc(size, GFP_KERNEL);
>> +			if (!opregionvbt) {
>>  				memunmap(base);
>> -				pci_err(vdev->pdev,
>> -					"Extended VBT does not follow opregion on version 0x%04x\n",
>> -					version);
>> -				return -EINVAL;
>> +				memunmap(vbt_base);
>> +				return -ENOMEM;
>>  			}
>>
>> -			/* region size for opregion v2.0+: opregion and VBT size. */
>> -			size += rvds;
>> +			/* Stitch VBT after OpRegion noncontigious */
>> +			memcpy(opregionvbt, base, OPREGION_SIZE);
>> +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
>> +
>> +			/* Patch OpRegion 2.0 to 2.1 */
>> +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
>> +			/* Patch RVDA to relative address after OpRegion */
>> +			*(__le64 *)(opregionvbt + OPREGION_RVDA) = OPREGION_SIZE;
>
> AIUI, the OpRegion is a two-way channel between the IGD device/system
> BIOS and the driver, numerous fields are writable by the driver.  Now
> the driver writes to a shadow copy of the OpRegion table.  What
> completes the write to the real OpRegion table for consumption by the
> device/BIOS?  Likewise, what updates the fields that are written by the
> device/BIOS for consumption by the driver?
>
> If a shadow copy of the OpRegion detached from the physical table is
> sufficient here, why wouldn't we always shadow the OpRegion and prevent
> all userspace writes from touching the real version?  Thanks,
>
> Alex
>
>> +
>> +			memunmap(vbt_base);
>> +			memunmap(base);
>> +
>> +			/* Register shadow instead of map as vfio_region */
>> +			base = opregionvbt;
>> +		/* Remap OpRegion + extended VBT for 2.1+ */
>> +		} else {
>> +			memunmap(base);
>> +			base = memremap(addr, size, MEMREMAP_WB);
>> +			if (!base)
>> +				return -ENOMEM;
>>  		}
>>  	}
>>
>> -	if (size != OPREGION_SIZE) {
>> -		memunmap(base);
>> -		base = memremap(addr, size, MEMREMAP_WB);
>> -		if (!base)
>> -			return -ENOMEM;
>> -	}
>> -
>>  	ret = vfio_pci_register_dev_region(vdev,
>>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>>  		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>>  		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>>  	if (ret) {
>> -		memunmap(base);
>> +		if (is_ioremap_addr(base))
>> +			memunmap(base);
>> +		else
>> +			kfree(base);
>>  		return ret;
>>  	}
>>
>
>

--
Best Regards,
Colin Xu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-08-17  0:40   ` Colin Xu
@ 2021-08-27  1:36     ` Colin Xu
  2021-08-27  1:48       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-08-27  1:36 UTC (permalink / raw)
  To: Colin Xu
  Cc: Alex Williamson, kvm, linux-kernel, zhenyuw, hang.yuan,
	swee.yee.fonn, fred.gao

Hi Alex,

In addition to the background that devices on market may still need 
OpRegion 2.0 support in vfio-pci, do you have other comments to the patch 
body?

On Tue, 17 Aug 2021, Colin Xu wrote:

> On Mon, 16 Aug 2021, Alex Williamson wrote:
>
>>  On Fri, 13 Aug 2021 10:13:29 +0800
>>  Colin Xu <colin.xu@intel.com> wrote:
>>
>>>  Due to historical reason, some legacy shipped system doesn't follow
>>>  OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>>>  VBT is not contigious after OpRegion in physical address, but any
>>>  location pointed by RVDA via absolute address. Thus it's impossible
>>>  to map a contigious range to hold both OpRegion and extended VBT as 2.1.
>>>
>>>  Since the only difference between OpRegion 2.0 and 2.1 is where extended
>>>  VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>>>  while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>>>  baes, and there is no other difference between OpRegion 2.0 and 2.1,
>>>  it's feasible to amend OpRegion support for these legacy system (before
>>>  upgrading the system firmware), by kazlloc a range to shadown OpRegion
>>>  from the beginning and stitch VBT after closely, patch the shadow
>>>  OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
>>>  address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
>>>  2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
>>>  if the extended VBT exists. vfio igd OpRegion r/w ops will return either
>>>  shadowed data (OpRegion 2.0) or directly from physical address
>>>  (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
>>>  mechanism makes it possible to support legacy systems on the market.
>>
>>  Which systems does this enable?  There's a suggestion above that these
>>  systems could update firmware to get OpRegion v2.1 support, why
>>  shouldn't we ask users to do that instead?  When we added OpRegion v2.1
>>  support we were told that v2.0 support was essentially non-existent,
>>  why should we add code to support and old spec with few users for such
>>  a niche use case?
> Hi Alex, there was some mis-alignment with the BIOS owner that we were told 
> the 2.0 system doesn't for retail but only for internal development. However 
> in other projects we DO see the retail market has such systems, including NUC 
> NUC6CAYB, some APL industrial PC used in RT system, and some customized APL 
> motherboard by commercial virtualization solution. We immediately contact the 
> BIOS owner to ask for a clarification and they admit it. These system won't 
> get updated BIOS for OpRegion update but still under warranty. That's why the 
> OpRegion 2.0 support is still needed.
>
>> 
>>> Cc:  Zhenyu Wang <zhenyuw@linux.intel.com>
>>> Cc:  Hang Yuan <hang.yuan@linux.intel.com>
>>> Cc:  Swee Yee Fonn <swee.yee.fonn@intel.com>
>>> Cc:  Fred Gao <fred.gao@intel.com>
>>>  Signed-off-by: Colin Xu <colin.xu@intel.com>
>>>  ---
>>>   drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
>>>   1 file changed, 75 insertions(+), 42 deletions(-)
>>>
>>>  diff --git a/drivers/vfio/pci/vfio_pci_igd.c
>>>  b/drivers/vfio/pci/vfio_pci_igd.c
>>>  index 228df565e9bc..22b9436a3044 100644
>>>  --- a/drivers/vfio/pci/vfio_pci_igd.c
>>>  +++ b/drivers/vfio/pci/vfio_pci_igd.c
>>>  @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device
>>>  *vdev, char __user *buf,
>>>   static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>>>   				 struct vfio_pci_region *region)
>>>  {
>>>  -	memunmap(region->data);
>>>  +	if (is_ioremap_addr(region->data))
>>>  +		memunmap(region->data);
>>>  +	else
>>>  +		kfree(region->data);
>>>   }
>>>
>>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
>>>  @@ -59,10 +62,11 @@ static const struct vfio_pci_regops
>>>  vfio_pci_igd_regops = {
>>>   static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>>   {
>>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>>>  -	u32 addr, size;
>>>  -	void *base;
>>>  +	u32 addr, size, rvds = 0;
>>>  +	void *base, *opregionvbt;
>>>    int ret;
>>>    u16 version;
>>>  +	u64 rvda = 0;
>>>
>>>    ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
>>>    if (ret)
>>>  @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct
>>>  vfio_pci_device *vdev)
>>>    size *= 1024; /* In KB */
>>>
>>>  	/*
>>>  -	 * Support opregion v2.1+
>>>  -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4,
>>>  then
>>>  -	 * the Extended VBT region next to opregion is used to hold the VBT
>>>  data.
>>>  -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
>>>  -	 * (Raw VBT Data Size) from opregion structure member are used to
>>>  hold the
>>>  -	 * address from region base and size of VBT data. RVDA/RVDS are not
>>>  -	 * defined before opregion 2.0.
>>>  +	 * OpRegion and VBT:
>>>  +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
>>>  +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large
>>>  enough
>>>  +	 * to hold the VBT data, the Extended VBT region is introduced since
>>>  +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS
>>>  are
>>>  +	 * introduced to define the extended VBT data location and size.
>>>  +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
>>>  +	 *   extended VBT data, RVDS defines the VBT data size.
>>>  +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
>>>  +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data
>>>  size.
>>>  	 *
>>>  -	 * opregion 2.1+: RVDA is unsigned, relative offset from
>>>  -	 * opregion base, and should point to the end of opregion.
>>>  -	 * otherwise, exposing to userspace to allow read access to
>>>  everything between
>>>  -	 * the OpRegion and VBT is not safe.
>>>  -	 * RVDS is defined as size in bytes.
>>>  -	 *
>>>  -	 * opregion 2.0: rvda is the physical VBT address.
>>>  -	 * Since rvda is HPA it cannot be directly used in guest.
>>>  -	 * And it should not be practically available for end user,so it is
>>>  not supported.
>>>  +	 * Due to the RVDA difference in OpRegion VBT (also the only diff
>>>  between
>>>  +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to
>>>  map
>>>  +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
>>>  +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
>>>  +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view,
>>>  only
>>>  +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or
>>>  2.1.
>>>    */
>>>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
>>>  -	if (version >= 0x0200) {
>>>  -		u64 rvda;
>>>  -		u32 rvds;
>>>
>>>  +	if (version >= 0x0200) {
>>>     rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>>>     rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
>>>  +
>>>  +		/* The extended VBT is valid only when RVDA/RVDS are
>>>  non-zero. */
>>>  		if (rvda && rvds) {
>>>  -			/* no support for opregion v2.0 with physical VBT
>>>  address */
>>>  -			if (version == 0x0200) {
>>>  +			size += rvds;
>>>  +		}
>>>  +
>>>  +		/* The extended VBT must follows OpRegion for OpRegion 2.1+
>>>  */
>>>  +		if (rvda != size && version > 0x0200) {
>>
>>  But we already added rvds to size, this is not compatible with the
>>  previous code that required rvda == size BEFORE adding rvds.
>>
>>>  +			memunmap(base);
>>>  +			pci_err(vdev->pdev,
>>>  +				"Extended VBT does not follow opregion on
>>>  version 0x%04x\n",
>>>  +				version);
>>>  +			return -EINVAL;
>>>  +		}
>>>  +	}
>>>  +
>>>  +	if (size != OPREGION_SIZE) {
>>>  +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
>>>  +		if (rvda && rvds && version == 0x0200) {
>>>  +			void *vbt_base;
>>>  +
>>>  +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
>>>  +			if (!vbt_base) {
>>>  				memunmap(base);
>>>  -				pci_err(vdev->pdev,
>>>  -					"IGD assignment does not support
>>>  opregion v2.0 with an extended VBT region\n");
>>>  -				return -EINVAL;
>>>  +				return -ENOMEM;
>>>      }
>>>
>>>  -			if (rvda != size) {
>>>  +			opregionvbt = kzalloc(size, GFP_KERNEL);
>>>  +			if (!opregionvbt) {
>>>  				memunmap(base);
>>>  -				pci_err(vdev->pdev,
>>>  -					"Extended VBT does not follow
>>>  opregion on version 0x%04x\n",
>>>  -					version);
>>>  -				return -EINVAL;
>>>  +				memunmap(vbt_base);
>>>  +				return -ENOMEM;
>>>      }
>>>
>>>  -			/* region size for opregion v2.0+: opregion and VBT
>>>  size. */
>>>  -			size += rvds;
>>>  +			/* Stitch VBT after OpRegion noncontigious */
>>>  +			memcpy(opregionvbt, base, OPREGION_SIZE);
>>>  +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
>>>  +
>>>  +			/* Patch OpRegion 2.0 to 2.1 */
>>>  +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
>>>  +			/* Patch RVDA to relative address after OpRegion */
>>>  +			*(__le64 *)(opregionvbt + OPREGION_RVDA) =
>>>  OPREGION_SIZE;
>>
>>  AIUI, the OpRegion is a two-way channel between the IGD device/system
>>  BIOS and the driver, numerous fields are writable by the driver.  Now
>>  the driver writes to a shadow copy of the OpRegion table.  What
>>  completes the write to the real OpRegion table for consumption by the
>>  device/BIOS?  Likewise, what updates the fields that are written by the
>>  device/BIOS for consumption by the driver?
>>
>>  If a shadow copy of the OpRegion detached from the physical table is
>>  sufficient here, why wouldn't we always shadow the OpRegion and prevent
>>  all userspace writes from touching the real version?  Thanks,
>>
>>  Alex
>>
>>>  +
>>>  +			memunmap(vbt_base);
>>>  +			memunmap(base);
>>>  +
>>>  +			/* Register shadow instead of map as vfio_region */
>>>  +			base = opregionvbt;
>>>  +		/* Remap OpRegion + extended VBT for 2.1+ */
>>>  +		} else {
>>>  +			memunmap(base);
>>>  +			base = memremap(addr, size, MEMREMAP_WB);
>>>  +			if (!base)
>>>  +				return -ENOMEM;
>>>    	}
>>>    }
>>>
>>>  -	if (size != OPREGION_SIZE) {
>>>  -		memunmap(base);
>>>  -		base = memremap(addr, size, MEMREMAP_WB);
>>>  -		if (!base)
>>>  -			return -ENOMEM;
>>>  -	}
>>>  -
>>>    ret = vfio_pci_register_dev_region(vdev,
>>>     PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>>>     VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>>>     &vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>>>  	if (ret) {
>>>  -		memunmap(base);
>>>  +		if (is_ioremap_addr(base))
>>>  +			memunmap(base);
>>>  +		else
>>>  +			kfree(base);
>>>    	return ret;
>>>    }
>>> 
>> 
>> 
>
> --
> Best Regards,
> Colin Xu
>
>

--
Best Regards,
Colin Xu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-08-27  1:36     ` Colin Xu
@ 2021-08-27  1:48       ` Alex Williamson
  2021-08-27  2:24         ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-08-27  1:48 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, linux-kernel, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Fri, 27 Aug 2021 09:36:36 +0800 (CST)
Colin Xu <colin.xu@intel.com> wrote:

> Hi Alex,
> 
> In addition to the background that devices on market may still need 
> OpRegion 2.0 support in vfio-pci, do you have other comments to the patch 
> body?

Yes, there were further comments in my first reply below.  Thanks,

Alex


> On Tue, 17 Aug 2021, Colin Xu wrote:
> 
> > On Mon, 16 Aug 2021, Alex Williamson wrote:
> >  
> >>  On Fri, 13 Aug 2021 10:13:29 +0800
> >>  Colin Xu <colin.xu@intel.com> wrote:
> >>  
> >>>  Due to historical reason, some legacy shipped system doesn't follow
> >>>  OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> >>>  VBT is not contigious after OpRegion in physical address, but any
> >>>  location pointed by RVDA via absolute address. Thus it's impossible
> >>>  to map a contigious range to hold both OpRegion and extended VBT as 2.1.
> >>>
> >>>  Since the only difference between OpRegion 2.0 and 2.1 is where extended
> >>>  VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> >>>  while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> >>>  baes, and there is no other difference between OpRegion 2.0 and 2.1,
> >>>  it's feasible to amend OpRegion support for these legacy system (before
> >>>  upgrading the system firmware), by kazlloc a range to shadown OpRegion
> >>>  from the beginning and stitch VBT after closely, patch the shadow
> >>>  OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
> >>>  address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
> >>>  2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
> >>>  if the extended VBT exists. vfio igd OpRegion r/w ops will return either
> >>>  shadowed data (OpRegion 2.0) or directly from physical address
> >>>  (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
> >>>  mechanism makes it possible to support legacy systems on the market.  
> >>
> >>  Which systems does this enable?  There's a suggestion above that these
> >>  systems could update firmware to get OpRegion v2.1 support, why
> >>  shouldn't we ask users to do that instead?  When we added OpRegion v2.1
> >>  support we were told that v2.0 support was essentially non-existent,
> >>  why should we add code to support and old spec with few users for such
> >>  a niche use case?  
> > Hi Alex, there was some mis-alignment with the BIOS owner that we were told 
> > the 2.0 system doesn't for retail but only for internal development. However 
> > in other projects we DO see the retail market has such systems, including NUC 
> > NUC6CAYB, some APL industrial PC used in RT system, and some customized APL 
> > motherboard by commercial virtualization solution. We immediately contact the 
> > BIOS owner to ask for a clarification and they admit it. These system won't 
> > get updated BIOS for OpRegion update but still under warranty. That's why the 
> > OpRegion 2.0 support is still needed.
> >  
> >>   
> >>> Cc:  Zhenyu Wang <zhenyuw@linux.intel.com>
> >>> Cc:  Hang Yuan <hang.yuan@linux.intel.com>
> >>> Cc:  Swee Yee Fonn <swee.yee.fonn@intel.com>
> >>> Cc:  Fred Gao <fred.gao@intel.com>
> >>>  Signed-off-by: Colin Xu <colin.xu@intel.com>
> >>>  ---
> >>>   drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
> >>>   1 file changed, 75 insertions(+), 42 deletions(-)
> >>>
> >>>  diff --git a/drivers/vfio/pci/vfio_pci_igd.c
> >>>  b/drivers/vfio/pci/vfio_pci_igd.c
> >>>  index 228df565e9bc..22b9436a3044 100644
> >>>  --- a/drivers/vfio/pci/vfio_pci_igd.c
> >>>  +++ b/drivers/vfio/pci/vfio_pci_igd.c
> >>>  @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device
> >>>  *vdev, char __user *buf,
> >>>   static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
> >>>   				 struct vfio_pci_region *region)
> >>>  {
> >>>  -	memunmap(region->data);
> >>>  +	if (is_ioremap_addr(region->data))
> >>>  +		memunmap(region->data);
> >>>  +	else
> >>>  +		kfree(region->data);
> >>>   }
> >>>
> >>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
> >>>  @@ -59,10 +62,11 @@ static const struct vfio_pci_regops
> >>>  vfio_pci_igd_regops = {
> >>>   static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
> >>>   {
> >>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
> >>>  -	u32 addr, size;
> >>>  -	void *base;
> >>>  +	u32 addr, size, rvds = 0;
> >>>  +	void *base, *opregionvbt;
> >>>    int ret;
> >>>    u16 version;
> >>>  +	u64 rvda = 0;
> >>>
> >>>    ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
> >>>    if (ret)
> >>>  @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct
> >>>  vfio_pci_device *vdev)
> >>>    size *= 1024; /* In KB */
> >>>
> >>>  	/*
> >>>  -	 * Support opregion v2.1+
> >>>  -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4,
> >>>  then
> >>>  -	 * the Extended VBT region next to opregion is used to hold the VBT
> >>>  data.
> >>>  -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
> >>>  -	 * (Raw VBT Data Size) from opregion structure member are used to
> >>>  hold the
> >>>  -	 * address from region base and size of VBT data. RVDA/RVDS are not
> >>>  -	 * defined before opregion 2.0.
> >>>  +	 * OpRegion and VBT:
> >>>  +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
> >>>  +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large
> >>>  enough
> >>>  +	 * to hold the VBT data, the Extended VBT region is introduced since
> >>>  +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS
> >>>  are
> >>>  +	 * introduced to define the extended VBT data location and size.
> >>>  +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
> >>>  +	 *   extended VBT data, RVDS defines the VBT data size.
> >>>  +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
> >>>  +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data
> >>>  size.
> >>>  	 *
> >>>  -	 * opregion 2.1+: RVDA is unsigned, relative offset from
> >>>  -	 * opregion base, and should point to the end of opregion.
> >>>  -	 * otherwise, exposing to userspace to allow read access to
> >>>  everything between
> >>>  -	 * the OpRegion and VBT is not safe.
> >>>  -	 * RVDS is defined as size in bytes.
> >>>  -	 *
> >>>  -	 * opregion 2.0: rvda is the physical VBT address.
> >>>  -	 * Since rvda is HPA it cannot be directly used in guest.
> >>>  -	 * And it should not be practically available for end user,so it is
> >>>  not supported.
> >>>  +	 * Due to the RVDA difference in OpRegion VBT (also the only diff
> >>>  between
> >>>  +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to
> >>>  map
> >>>  +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
> >>>  +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
> >>>  +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view,
> >>>  only
> >>>  +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or
> >>>  2.1.
> >>>    */
> >>>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
> >>>  -	if (version >= 0x0200) {
> >>>  -		u64 rvda;
> >>>  -		u32 rvds;
> >>>
> >>>  +	if (version >= 0x0200) {
> >>>     rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
> >>>     rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
> >>>  +
> >>>  +		/* The extended VBT is valid only when RVDA/RVDS are
> >>>  non-zero. */
> >>>  		if (rvda && rvds) {
> >>>  -			/* no support for opregion v2.0 with physical VBT
> >>>  address */
> >>>  -			if (version == 0x0200) {
> >>>  +			size += rvds;
> >>>  +		}
> >>>  +
> >>>  +		/* The extended VBT must follows OpRegion for OpRegion 2.1+
> >>>  */
> >>>  +		if (rvda != size && version > 0x0200) {  
> >>
> >>  But we already added rvds to size, this is not compatible with the
> >>  previous code that required rvda == size BEFORE adding rvds.
> >>  
> >>>  +			memunmap(base);
> >>>  +			pci_err(vdev->pdev,
> >>>  +				"Extended VBT does not follow opregion on
> >>>  version 0x%04x\n",
> >>>  +				version);
> >>>  +			return -EINVAL;
> >>>  +		}
> >>>  +	}
> >>>  +
> >>>  +	if (size != OPREGION_SIZE) {
> >>>  +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
> >>>  +		if (rvda && rvds && version == 0x0200) {
> >>>  +			void *vbt_base;
> >>>  +
> >>>  +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
> >>>  +			if (!vbt_base) {
> >>>  				memunmap(base);
> >>>  -				pci_err(vdev->pdev,
> >>>  -					"IGD assignment does not support
> >>>  opregion v2.0 with an extended VBT region\n");
> >>>  -				return -EINVAL;
> >>>  +				return -ENOMEM;
> >>>      }
> >>>
> >>>  -			if (rvda != size) {
> >>>  +			opregionvbt = kzalloc(size, GFP_KERNEL);
> >>>  +			if (!opregionvbt) {
> >>>  				memunmap(base);
> >>>  -				pci_err(vdev->pdev,
> >>>  -					"Extended VBT does not follow
> >>>  opregion on version 0x%04x\n",
> >>>  -					version);
> >>>  -				return -EINVAL;
> >>>  +				memunmap(vbt_base);
> >>>  +				return -ENOMEM;
> >>>      }
> >>>
> >>>  -			/* region size for opregion v2.0+: opregion and VBT
> >>>  size. */
> >>>  -			size += rvds;
> >>>  +			/* Stitch VBT after OpRegion noncontigious */
> >>>  +			memcpy(opregionvbt, base, OPREGION_SIZE);
> >>>  +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
> >>>  +
> >>>  +			/* Patch OpRegion 2.0 to 2.1 */
> >>>  +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
> >>>  +			/* Patch RVDA to relative address after OpRegion */
> >>>  +			*(__le64 *)(opregionvbt + OPREGION_RVDA) =
> >>>  OPREGION_SIZE;  
> >>
> >>  AIUI, the OpRegion is a two-way channel between the IGD device/system
> >>  BIOS and the driver, numerous fields are writable by the driver.  Now
> >>  the driver writes to a shadow copy of the OpRegion table.  What
> >>  completes the write to the real OpRegion table for consumption by the
> >>  device/BIOS?  Likewise, what updates the fields that are written by the
> >>  device/BIOS for consumption by the driver?
> >>
> >>  If a shadow copy of the OpRegion detached from the physical table is
> >>  sufficient here, why wouldn't we always shadow the OpRegion and prevent
> >>  all userspace writes from touching the real version?  Thanks,
> >>
> >>  Alex
> >>  
> >>>  +
> >>>  +			memunmap(vbt_base);
> >>>  +			memunmap(base);
> >>>  +
> >>>  +			/* Register shadow instead of map as vfio_region */
> >>>  +			base = opregionvbt;
> >>>  +		/* Remap OpRegion + extended VBT for 2.1+ */
> >>>  +		} else {
> >>>  +			memunmap(base);
> >>>  +			base = memremap(addr, size, MEMREMAP_WB);
> >>>  +			if (!base)
> >>>  +				return -ENOMEM;
> >>>    	}
> >>>    }
> >>>
> >>>  -	if (size != OPREGION_SIZE) {
> >>>  -		memunmap(base);
> >>>  -		base = memremap(addr, size, MEMREMAP_WB);
> >>>  -		if (!base)
> >>>  -			return -ENOMEM;
> >>>  -	}
> >>>  -
> >>>    ret = vfio_pci_register_dev_region(vdev,
> >>>     PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> >>>     VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
> >>>     &vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
> >>>  	if (ret) {
> >>>  -		memunmap(base);
> >>>  +		if (is_ioremap_addr(base))
> >>>  +			memunmap(base);
> >>>  +		else
> >>>  +			kfree(base);
> >>>    	return ret;
> >>>    }
> >>>   
> >> 
> >>   
> >
> > --
> > Best Regards,
> > Colin Xu
> >
> >  
> 
> --
> Best Regards,
> Colin Xu
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-08-27  1:48       ` Alex Williamson
@ 2021-08-27  2:24         ` Colin Xu
  2021-08-27  2:37           ` [PATCH v2] " Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-08-27  2:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Colin Xu, kvm, linux-kernel, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Fri, 27 Aug 2021, Alex Williamson wrote:

> On Fri, 27 Aug 2021 09:36:36 +0800 (CST)
> Colin Xu <colin.xu@intel.com> wrote:
>
>> Hi Alex,
>>
>> In addition to the background that devices on market may still need
>> OpRegion 2.0 support in vfio-pci, do you have other comments to the patch
>> body?
>
> Yes, there were further comments in my first reply below.  Thanks,
>
> Alex
OOPS, missed that. Replied inline.
>
>
>> On Tue, 17 Aug 2021, Colin Xu wrote:
>>
>>> On Mon, 16 Aug 2021, Alex Williamson wrote:
>>>
>>>>  On Fri, 13 Aug 2021 10:13:29 +0800
>>>>  Colin Xu <colin.xu@intel.com> wrote:
>>>>
>>>>>  Due to historical reason, some legacy shipped system doesn't follow
>>>>>  OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>>>>>  VBT is not contigious after OpRegion in physical address, but any
>>>>>  location pointed by RVDA via absolute address. Thus it's impossible
>>>>>  to map a contigious range to hold both OpRegion and extended VBT as 2.1.
>>>>>
>>>>>  Since the only difference between OpRegion 2.0 and 2.1 is where extended
>>>>>  VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>>>>>  while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>>>>>  baes, and there is no other difference between OpRegion 2.0 and 2.1,
>>>>>  it's feasible to amend OpRegion support for these legacy system (before
>>>>>  upgrading the system firmware), by kazlloc a range to shadown OpRegion
>>>>>  from the beginning and stitch VBT after closely, patch the shadow
>>>>>  OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
>>>>>  address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
>>>>>  2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
>>>>>  if the extended VBT exists. vfio igd OpRegion r/w ops will return either
>>>>>  shadowed data (OpRegion 2.0) or directly from physical address
>>>>>  (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
>>>>>  mechanism makes it possible to support legacy systems on the market.
>>>>
>>>>  Which systems does this enable?  There's a suggestion above that these
>>>>  systems could update firmware to get OpRegion v2.1 support, why
>>>>  shouldn't we ask users to do that instead?  When we added OpRegion v2.1
>>>>  support we were told that v2.0 support was essentially non-existent,
>>>>  why should we add code to support and old spec with few users for such
>>>>  a niche use case?
>>> Hi Alex, there was some mis-alignment with the BIOS owner that we were told
>>> the 2.0 system doesn't for retail but only for internal development. However
>>> in other projects we DO see the retail market has such systems, including NUC
>>> NUC6CAYB, some APL industrial PC used in RT system, and some customized APL
>>> motherboard by commercial virtualization solution. We immediately contact the
>>> BIOS owner to ask for a clarification and they admit it. These system won't
>>> get updated BIOS for OpRegion update but still under warranty. That's why the
>>> OpRegion 2.0 support is still needed.
>>>
>>>>
>>>>> Cc:  Zhenyu Wang <zhenyuw@linux.intel.com>
>>>>> Cc:  Hang Yuan <hang.yuan@linux.intel.com>
>>>>> Cc:  Swee Yee Fonn <swee.yee.fonn@intel.com>
>>>>> Cc:  Fred Gao <fred.gao@intel.com>
>>>>>  Signed-off-by: Colin Xu <colin.xu@intel.com>
>>>>>  ---
>>>>>   drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
>>>>>   1 file changed, 75 insertions(+), 42 deletions(-)
>>>>>
>>>>>  diff --git a/drivers/vfio/pci/vfio_pci_igd.c
>>>>>  b/drivers/vfio/pci/vfio_pci_igd.c
>>>>>  index 228df565e9bc..22b9436a3044 100644
>>>>>  --- a/drivers/vfio/pci/vfio_pci_igd.c
>>>>>  +++ b/drivers/vfio/pci/vfio_pci_igd.c
>>>>>  @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device
>>>>>  *vdev, char __user *buf,
>>>>>   static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>>>>>   				 struct vfio_pci_region *region)
>>>>>  {
>>>>>  -	memunmap(region->data);
>>>>>  +	if (is_ioremap_addr(region->data))
>>>>>  +		memunmap(region->data);
>>>>>  +	else
>>>>>  +		kfree(region->data);
>>>>>   }
>>>>>
>>>>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
>>>>>  @@ -59,10 +62,11 @@ static const struct vfio_pci_regops
>>>>>  vfio_pci_igd_regops = {
>>>>>   static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>>>>   {
>>>>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>>>>>  -	u32 addr, size;
>>>>>  -	void *base;
>>>>>  +	u32 addr, size, rvds = 0;
>>>>>  +	void *base, *opregionvbt;
>>>>>    int ret;
>>>>>    u16 version;
>>>>>  +	u64 rvda = 0;
>>>>>
>>>>>    ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
>>>>>    if (ret)
>>>>>  @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct
>>>>>  vfio_pci_device *vdev)
>>>>>    size *= 1024; /* In KB */
>>>>>
>>>>>  	/*
>>>>>  -	 * Support opregion v2.1+
>>>>>  -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4,
>>>>>  then
>>>>>  -	 * the Extended VBT region next to opregion is used to hold the VBT
>>>>>  data.
>>>>>  -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
>>>>>  -	 * (Raw VBT Data Size) from opregion structure member are used to
>>>>>  hold the
>>>>>  -	 * address from region base and size of VBT data. RVDA/RVDS are not
>>>>>  -	 * defined before opregion 2.0.
>>>>>  +	 * OpRegion and VBT:
>>>>>  +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
>>>>>  +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large
>>>>>  enough
>>>>>  +	 * to hold the VBT data, the Extended VBT region is introduced since
>>>>>  +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS
>>>>>  are
>>>>>  +	 * introduced to define the extended VBT data location and size.
>>>>>  +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
>>>>>  +	 *   extended VBT data, RVDS defines the VBT data size.
>>>>>  +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
>>>>>  +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data
>>>>>  size.
>>>>>  	 *
>>>>>  -	 * opregion 2.1+: RVDA is unsigned, relative offset from
>>>>>  -	 * opregion base, and should point to the end of opregion.
>>>>>  -	 * otherwise, exposing to userspace to allow read access to
>>>>>  everything between
>>>>>  -	 * the OpRegion and VBT is not safe.
>>>>>  -	 * RVDS is defined as size in bytes.
>>>>>  -	 *
>>>>>  -	 * opregion 2.0: rvda is the physical VBT address.
>>>>>  -	 * Since rvda is HPA it cannot be directly used in guest.
>>>>>  -	 * And it should not be practically available for end user,so it is
>>>>>  not supported.
>>>>>  +	 * Due to the RVDA difference in OpRegion VBT (also the only diff
>>>>>  between
>>>>>  +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to
>>>>>  map
>>>>>  +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
>>>>>  +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
>>>>>  +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view,
>>>>>  only
>>>>>  +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or
>>>>>  2.1.
>>>>>    */
>>>>>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
>>>>>  -	if (version >= 0x0200) {
>>>>>  -		u64 rvda;
>>>>>  -		u32 rvds;
>>>>>
>>>>>  +	if (version >= 0x0200) {
>>>>>     rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>>>>>     rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
>>>>>  +
>>>>>  +		/* The extended VBT is valid only when RVDA/RVDS are
>>>>>  non-zero. */
>>>>>  		if (rvda && rvds) {
>>>>>  -			/* no support for opregion v2.0 with physical VBT
>>>>>  address */
>>>>>  -			if (version == 0x0200) {
>>>>>  +			size += rvds;
>>>>>  +		}
>>>>>  +
>>>>>  +		/* The extended VBT must follows OpRegion for OpRegion 2.1+
>>>>>  */
>>>>>  +		if (rvda != size && version > 0x0200) {
>>>>
>>>>  But we already added rvds to size, this is not compatible with the
>>>>  previous code that required rvda == size BEFORE adding rvds.
>>>>
Emmm this is wrong. Should move the size check before increasing the total 
size.
>>>>>  +			memunmap(base);
>>>>>  +			pci_err(vdev->pdev,
>>>>>  +				"Extended VBT does not follow opregion on
>>>>>  version 0x%04x\n",
>>>>>  +				version);
>>>>>  +			return -EINVAL;
>>>>>  +		}
>>>>>  +	}
>>>>>  +
>>>>>  +	if (size != OPREGION_SIZE) {
>>>>>  +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
>>>>>  +		if (rvda && rvds && version == 0x0200) {
>>>>>  +			void *vbt_base;
>>>>>  +
>>>>>  +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
>>>>>  +			if (!vbt_base) {
>>>>>  				memunmap(base);
>>>>>  -				pci_err(vdev->pdev,
>>>>>  -					"IGD assignment does not support
>>>>>  opregion v2.0 with an extended VBT region\n");
>>>>>  -				return -EINVAL;
>>>>>  +				return -ENOMEM;
>>>>>      }
>>>>>
>>>>>  -			if (rvda != size) {
>>>>>  +			opregionvbt = kzalloc(size, GFP_KERNEL);
>>>>>  +			if (!opregionvbt) {
>>>>>  				memunmap(base);
>>>>>  -				pci_err(vdev->pdev,
>>>>>  -					"Extended VBT does not follow
>>>>>  opregion on version 0x%04x\n",
>>>>>  -					version);
>>>>>  -				return -EINVAL;
>>>>>  +				memunmap(vbt_base);
>>>>>  +				return -ENOMEM;
>>>>>      }
>>>>>
>>>>>  -			/* region size for opregion v2.0+: opregion and VBT
>>>>>  size. */
>>>>>  -			size += rvds;
>>>>>  +			/* Stitch VBT after OpRegion noncontigious */
>>>>>  +			memcpy(opregionvbt, base, OPREGION_SIZE);
>>>>>  +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
>>>>>  +
>>>>>  +			/* Patch OpRegion 2.0 to 2.1 */
>>>>>  +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
>>>>>  +			/* Patch RVDA to relative address after OpRegion */
>>>>>  +			*(__le64 *)(opregionvbt + OPREGION_RVDA) =
>>>>>  OPREGION_SIZE;
>>>>
>>>>  AIUI, the OpRegion is a two-way channel between the IGD device/system
>>>>  BIOS and the driver, numerous fields are writable by the driver.  Now
>>>>  the driver writes to a shadow copy of the OpRegion table.  What
>>>>  completes the write to the real OpRegion table for consumption by the
>>>>  device/BIOS?  Likewise, what updates the fields that are written by the
>>>>  device/BIOS for consumption by the driver?
>>>>
>>>>  If a shadow copy of the OpRegion detached from the physical table is
>>>>  sufficient here, why wouldn't we always shadow the OpRegion and prevent
>>>>  all userspace writes from touching the real version?  Thanks,
>>>>
>>>>  Alex
Yes per spec, OpRegion allows driver write to notify BIOS as mailbox, thus 
BIOS could do some operations, like ACPI notification, or fill the result 
on query. However the write operation is always blocked on r/w ops so 
guest write will always return -EINVAL. If only consider this patch, which 
doesn't change the behaviour, that no matter shadow or not, write to 
OpRegion is always blocked. If consider from full functionality, this is a 
gap between IGD pass through and native. Simply allow the write to 
OpRegion may expose unguarded information from host, or trigger host 
BIOS/ACPI doing unmanaged operation. More discussions are necessary on how 
to handle OpRegion write in IGD pass through, like if we don't want those 
unmanaged behaviour triggered from guest, may need modify the OpRegion 
data exposed to guest. Or if those functionalities are still needed for 
guest, may need consider how to handle them in a more secure way.6~

>>>>
>>>>>  +
>>>>>  +			memunmap(vbt_base);
>>>>>  +			memunmap(base);
>>>>>  +
>>>>>  +			/* Register shadow instead of map as vfio_region */
>>>>>  +			base = opregionvbt;
>>>>>  +		/* Remap OpRegion + extended VBT for 2.1+ */
>>>>>  +		} else {
>>>>>  +			memunmap(base);
>>>>>  +			base = memremap(addr, size, MEMREMAP_WB);
>>>>>  +			if (!base)
>>>>>  +				return -ENOMEM;
>>>>>    	}
>>>>>    }
>>>>>
>>>>>  -	if (size != OPREGION_SIZE) {
>>>>>  -		memunmap(base);
>>>>>  -		base = memremap(addr, size, MEMREMAP_WB);
>>>>>  -		if (!base)
>>>>>  -			return -ENOMEM;
>>>>>  -	}
>>>>>  -
>>>>>    ret = vfio_pci_register_dev_region(vdev,
>>>>>     PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>>>>>     VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>>>>>     &vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>>>>>  	if (ret) {
>>>>>  -		memunmap(base);
>>>>>  +		if (is_ioremap_addr(base))
>>>>>  +			memunmap(base);
>>>>>  +		else
>>>>>  +			kfree(base);
>>>>>    	return ret;
>>>>>    }
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Best Regards,
>>> Colin Xu
>>>
>>>
>>
>> --
>> Best Regards,
>> Colin Xu
>>
>
>

--
Best Regards,
Colin Xu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-08-27  2:24         ` Colin Xu
@ 2021-08-27  2:37           ` Colin Xu
  2021-08-30 20:27             ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-08-27  2:37 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contigious after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Thus it's impossible
to map a contigious range to hold both OpRegion and extended VBT as 2.1.

Since the only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1,
it's feasible to amend OpRegion support for these legacy system (before
upgrading the system firmware), by kazlloc a range to shadown OpRegion
from the beginning and stitch VBT after closely, patch the shadow
OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
if the extended VBT exists. vfio igd OpRegion r/w ops will return either
shadowed data (OpRegion 2.0) or directly from physical address
(OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
mechanism makes it possible to support legacy systems on the market.

V2:
Validate RVDA for 2.1+ before increasing total size. (Alex)

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
---
 drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
 1 file changed, 75 insertions(+), 42 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 228df565e9bc..9cd44498b378 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
 				 struct vfio_pci_region *region)
 {
-	memunmap(region->data);
+	if (is_ioremap_addr(region->data))
+		memunmap(region->data);
+	else
+		kfree(region->data);
 }
 
 static const struct vfio_pci_regops vfio_pci_igd_regops = {
@@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
 static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
-	u32 addr, size;
-	void *base;
+	u32 addr, size, rvds = 0;
+	void *base, *opregionvbt;
 	int ret;
 	u16 version;
+	u64 rvda = 0;
 
 	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
 	if (ret)
@@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	size *= 1024; /* In KB */
 
 	/*
-	 * Support opregion v2.1+
-	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
-	 * the Extended VBT region next to opregion is used to hold the VBT data.
-	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
-	 * (Raw VBT Data Size) from opregion structure member are used to hold the
-	 * address from region base and size of VBT data. RVDA/RVDS are not
-	 * defined before opregion 2.0.
+	 * OpRegion and VBT:
+	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
+	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
+	 * to hold the VBT data, the Extended VBT region is introduced since
+	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
+	 * introduced to define the extended VBT data location and size.
+	 * OpRegion 2.0: RVDA defines the absolute physical address of the
+	 *   extended VBT data, RVDS defines the VBT data size.
+	 * OpRegion 2.1 and above: RVDA defines the relative address of the
+	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
 	 *
-	 * opregion 2.1+: RVDA is unsigned, relative offset from
-	 * opregion base, and should point to the end of opregion.
-	 * otherwise, exposing to userspace to allow read access to everything between
-	 * the OpRegion and VBT is not safe.
-	 * RVDS is defined as size in bytes.
-	 *
-	 * opregion 2.0: rvda is the physical VBT address.
-	 * Since rvda is HPA it cannot be directly used in guest.
-	 * And it should not be practically available for end user,so it is not supported.
+	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
+	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
+	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
+	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
+	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
+	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
 	 */
 	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
-	if (version >= 0x0200) {
-		u64 rvda;
-		u32 rvds;
 
+	if (version >= 0x0200) {
 		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
 		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
+
+		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */
+		if (rvda != size && version > 0x0200) {
+			memunmap(base);
+			pci_err(vdev->pdev,
+				"Extended VBT does not follow opregion on version 0x%04x\n",
+				version);
+			return -EINVAL;
+		}
+
+		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
 		if (rvda && rvds) {
-			/* no support for opregion v2.0 with physical VBT address */
-			if (version == 0x0200) {
+			size += rvds;
+		}
+	}
+
+	if (size != OPREGION_SIZE) {
+		/* Allocate memory for OpRegion and extended VBT for 2.0 */
+		if (rvda && rvds && version == 0x0200) {
+			void *vbt_base;
+
+			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
+			if (!vbt_base) {
 				memunmap(base);
-				pci_err(vdev->pdev,
-					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
-				return -EINVAL;
+				return -ENOMEM;
 			}
 
-			if (rvda != size) {
+			opregionvbt = kzalloc(size, GFP_KERNEL);
+			if (!opregionvbt) {
 				memunmap(base);
-				pci_err(vdev->pdev,
-					"Extended VBT does not follow opregion on version 0x%04x\n",
-					version);
-				return -EINVAL;
+				memunmap(vbt_base);
+				return -ENOMEM;
 			}
 
-			/* region size for opregion v2.0+: opregion and VBT size. */
-			size += rvds;
+			/* Stitch VBT after OpRegion noncontigious */
+			memcpy(opregionvbt, base, OPREGION_SIZE);
+			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
+
+			/* Patch OpRegion 2.0 to 2.1 */
+			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
+			/* Patch RVDA to relative address after OpRegion */
+			*(__le64 *)(opregionvbt + OPREGION_RVDA) = OPREGION_SIZE;
+
+			memunmap(vbt_base);
+			memunmap(base);
+
+			/* Register shadow instead of map as vfio_region */
+			base = opregionvbt;
+		/* Remap OpRegion + extended VBT for 2.1+ */
+		} else {
+			memunmap(base);
+			base = memremap(addr, size, MEMREMAP_WB);
+			if (!base)
+				return -ENOMEM;
 		}
 	}
 
-	if (size != OPREGION_SIZE) {
-		memunmap(base);
-		base = memremap(addr, size, MEMREMAP_WB);
-		if (!base)
-			return -ENOMEM;
-	}
-
 	ret = vfio_pci_register_dev_region(vdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
 		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
 		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
 	if (ret) {
-		memunmap(base);
+		if (is_ioremap_addr(base))
+			memunmap(base);
+		else
+			kfree(base);
 		return ret;
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-08-27  2:37           ` [PATCH v2] " Colin Xu
@ 2021-08-30 20:27             ` Alex Williamson
  2021-09-02  7:11               ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-08-30 20:27 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Fri, 27 Aug 2021 10:37:16 +0800
Colin Xu <colin.xu@intel.com> wrote:

> Due to historical reason, some legacy shipped system doesn't follow
> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> VBT is not contigious after OpRegion in physical address, but any
> location pointed by RVDA via absolute address. Thus it's impossible
> to map a contigious range to hold both OpRegion and extended VBT as 2.1.
> 
> Since the only difference between OpRegion 2.0 and 2.1 is where extended
> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> baes, and there is no other difference between OpRegion 2.0 and 2.1,
> it's feasible to amend OpRegion support for these legacy system (before
> upgrading the system firmware), by kazlloc a range to shadown OpRegion
> from the beginning and stitch VBT after closely, patch the shadow
> OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
> address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
> 2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
> if the extended VBT exists. vfio igd OpRegion r/w ops will return either
> shadowed data (OpRegion 2.0) or directly from physical address
> (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
> mechanism makes it possible to support legacy systems on the market.
> 
> V2:
> Validate RVDA for 2.1+ before increasing total size. (Alex)
> 
> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> Cc: Hang Yuan <hang.yuan@linux.intel.com>
> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> Cc: Fred Gao <fred.gao@intel.com>
> Signed-off-by: Colin Xu <colin.xu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
>  1 file changed, 75 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
> index 228df565e9bc..9cd44498b378 100644
> --- a/drivers/vfio/pci/vfio_pci_igd.c
> +++ b/drivers/vfio/pci/vfio_pci_igd.c
> @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>  				 struct vfio_pci_region *region)
>  {
> -	memunmap(region->data);
> +	if (is_ioremap_addr(region->data))
> +		memunmap(region->data);
> +	else
> +		kfree(region->data);


Since we don't have write support to the OpRegion, should we always
allocate a shadow copy to simplify?  Or rather than a shadow copy,
since we don't support mmap of the region, our read handler could
virtualize version and rvda on the fly and shift accesses so that the
VBT appears contiguous.  That might also leave us better positioned for
handling dynamic changes (ex. does the data change when a monitor is
plugged/unplugged) and perhaps eventually write support.


>  }
>  
>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
> @@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
>  static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>  {
>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
> -	u32 addr, size;
> -	void *base;
> +	u32 addr, size, rvds = 0;
> +	void *base, *opregionvbt;


opregionvbt could be scoped within the branch it's used.


>  	int ret;
>  	u16 version;
> +	u64 rvda = 0;
>  
>  	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
>  	if (ret)
> @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>  	size *= 1024; /* In KB */
>  
>  	/*
> -	 * Support opregion v2.1+
> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
> -	 * address from region base and size of VBT data. RVDA/RVDS are not
> -	 * defined before opregion 2.0.
> +	 * OpRegion and VBT:
> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
> +	 * to hold the VBT data, the Extended VBT region is introduced since
> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
> +	 * introduced to define the extended VBT data location and size.
> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
> +	 *   extended VBT data, RVDS defines the VBT data size.
> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>  	 *
> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
> -	 * opregion base, and should point to the end of opregion.
> -	 * otherwise, exposing to userspace to allow read access to everything between
> -	 * the OpRegion and VBT is not safe.
> -	 * RVDS is defined as size in bytes.
> -	 *
> -	 * opregion 2.0: rvda is the physical VBT address.
> -	 * Since rvda is HPA it cannot be directly used in guest.
> -	 * And it should not be practically available for end user,so it is not supported.
> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
> +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
> +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
> +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
> +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
> +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
>  	 */
>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
> -	if (version >= 0x0200) {
> -		u64 rvda;
> -		u32 rvds;
>  
> +	if (version >= 0x0200) {
>  		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>  		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
> +
> +		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */


Why?  If we're going to make our own OpRegion to account for v2.0, why
should it not apply to the same scenario for >2.0?


> +		if (rvda != size && version > 0x0200) {
> +			memunmap(base);
> +			pci_err(vdev->pdev,
> +				"Extended VBT does not follow opregion on version 0x%04x\n",
> +				version);
> +			return -EINVAL;
> +		}
> +
> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
>  		if (rvda && rvds) {
> -			/* no support for opregion v2.0 with physical VBT address */
> -			if (version == 0x0200) {
> +			size += rvds;
> +		}
> +	}
> +
> +	if (size != OPREGION_SIZE) {


@size can only != OPREGION_SIZE due to the above branch, so the below
could all be scoped under the version test, or perhaps to a separate
function.


> +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
> +		if (rvda && rvds && version == 0x0200) {


We go down this path even if the VBT was contiguous with the OpRegion.


> +			void *vbt_base;
> +
> +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
> +			if (!vbt_base) {
>  				memunmap(base);
> -				pci_err(vdev->pdev,
> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
> -				return -EINVAL;
> +				return -ENOMEM;
>  			}
>  
> -			if (rvda != size) {
> +			opregionvbt = kzalloc(size, GFP_KERNEL);
> +			if (!opregionvbt) {
>  				memunmap(base);
> -				pci_err(vdev->pdev,
> -					"Extended VBT does not follow opregion on version 0x%04x\n",
> -					version);
> -				return -EINVAL;
> +				memunmap(vbt_base);
> +				return -ENOMEM;
>  			}
>  
> -			/* region size for opregion v2.0+: opregion and VBT size. */
> -			size += rvds;
> +			/* Stitch VBT after OpRegion noncontigious */
> +			memcpy(opregionvbt, base, OPREGION_SIZE);
> +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
> +
> +			/* Patch OpRegion 2.0 to 2.1 */
> +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;


= cpu_to_le16(0x0201);


> +			/* Patch RVDA to relative address after OpRegion */
> +			*(__le64 *)(opregionvbt + OPREGION_RVDA) = OPREGION_SIZE;


= cpu_to_le64(OPREGION_SIZE);


I think this is what triggered the sparse errors.  Thanks,

Alex

> +
> +			memunmap(vbt_base);
> +			memunmap(base);
> +
> +			/* Register shadow instead of map as vfio_region */
> +			base = opregionvbt;
> +		/* Remap OpRegion + extended VBT for 2.1+ */
> +		} else {
> +			memunmap(base);
> +			base = memremap(addr, size, MEMREMAP_WB);
> +			if (!base)
> +				return -ENOMEM;
>  		}
>  	}
>  
> -	if (size != OPREGION_SIZE) {
> -		memunmap(base);
> -		base = memremap(addr, size, MEMREMAP_WB);
> -		if (!base)
> -			return -ENOMEM;
> -	}
> -
>  	ret = vfio_pci_register_dev_region(vdev,
>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>  		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>  		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>  	if (ret) {
> -		memunmap(base);
> +		if (is_ioremap_addr(base))
> +			memunmap(base);
> +		else
> +			kfree(base);
>  		return ret;
>  	}
>  


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-08-30 20:27             ` Alex Williamson
@ 2021-09-02  7:11               ` Colin Xu
  2021-09-02 21:46                 ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-02  7:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Colin Xu, kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Mon, 30 Aug 2021, Alex Williamson wrote:

Thanks Alex for your detailed comments. I replied them inline.

A general question after these replies is:
which way to handle the readonly OpRegion is preferred?
1) Shadow (modify the RVDA location and OpRegion version for some 
special version, 2.0).
2) On-the-fly modification for reading.

The former doesn't need add extra fields to avoid remap on every read, the
latter leaves flexibility for write operation.

> On Fri, 27 Aug 2021 10:37:16 +0800
> Colin Xu <colin.xu@intel.com> wrote:
>
>> Due to historical reason, some legacy shipped system doesn't follow
>> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>> VBT is not contigious after OpRegion in physical address, but any
>> location pointed by RVDA via absolute address. Thus it's impossible
>> to map a contigious range to hold both OpRegion and extended VBT as 2.1.
>>
>> Since the only difference between OpRegion 2.0 and 2.1 is where extended
>> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>> baes, and there is no other difference between OpRegion 2.0 and 2.1,
>> it's feasible to amend OpRegion support for these legacy system (before
>> upgrading the system firmware), by kazlloc a range to shadown OpRegion
>> from the beginning and stitch VBT after closely, patch the shadow
>> OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
>> address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
>> 2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
>> if the extended VBT exists. vfio igd OpRegion r/w ops will return either
>> shadowed data (OpRegion 2.0) or directly from physical address
>> (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
>> mechanism makes it possible to support legacy systems on the market.
>>
>> V2:
>> Validate RVDA for 2.1+ before increasing total size. (Alex)
>>
>> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
>> Cc: Hang Yuan <hang.yuan@linux.intel.com>
>> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
>> Cc: Fred Gao <fred.gao@intel.com>
>> Signed-off-by: Colin Xu <colin.xu@intel.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
>>  1 file changed, 75 insertions(+), 42 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
>> index 228df565e9bc..9cd44498b378 100644
>> --- a/drivers/vfio/pci/vfio_pci_igd.c
>> +++ b/drivers/vfio/pci/vfio_pci_igd.c
>> @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>>  				 struct vfio_pci_region *region)
>>  {
>> -	memunmap(region->data);
>> +	if (is_ioremap_addr(region->data))
>> +		memunmap(region->data);
>> +	else
>> +		kfree(region->data);
>
>
> Since we don't have write support to the OpRegion, should we always
> allocate a shadow copy to simplify?  Or rather than a shadow copy,
> since we don't support mmap of the region, our read handler could
> virtualize version and rvda on the fly and shift accesses so that the
> VBT appears contiguous.  That might also leave us better positioned for
> handling dynamic changes (ex. does the data change when a monitor is
> plugged/unplugged) and perhaps eventually write support.
>
Always shadow sounds a more simple solution. On-the-fly offset shifting 
may need some extra code:
- A fields to store remapped RVDA, otherwise have to remap on every read.
Should I remap everytime, or add the remapped RVDA in vfio_pci_device.
- Some fields to store extra information, like the old and modified 
opregion version. Current it's parsed in init since it's one time run. To 
support on-the-fly modification, need save them somewhere instead of parse 
on every read.
- Addr shift calculation. Read could called on any start with any size, 
will need add some addr shift code.

>
>>  }
>>
>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
>> @@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
>>  static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>  {
>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>> -	u32 addr, size;
>> -	void *base;
>> +	u32 addr, size, rvds = 0;
>> +	void *base, *opregionvbt;
>
>
> opregionvbt could be scoped within the branch it's used.
>
Previous revision doesn't move it into the scope. I'll amend in next 
version.
>>  	int ret;
>>  	u16 version;
>> +	u64 rvda = 0;
>>
>>  	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
>>  	if (ret)
>> @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>  	size *= 1024; /* In KB */
>>
>>  	/*
>> -	 * Support opregion v2.1+
>> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
>> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
>> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
>> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
>> -	 * address from region base and size of VBT data. RVDA/RVDS are not
>> -	 * defined before opregion 2.0.
>> +	 * OpRegion and VBT:
>> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
>> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
>> +	 * to hold the VBT data, the Extended VBT region is introduced since
>> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
>> +	 * introduced to define the extended VBT data location and size.
>> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
>> +	 *   extended VBT data, RVDS defines the VBT data size.
>> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
>> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>>  	 *
>> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
>> -	 * opregion base, and should point to the end of opregion.
>> -	 * otherwise, exposing to userspace to allow read access to everything between
>> -	 * the OpRegion and VBT is not safe.
>> -	 * RVDS is defined as size in bytes.
>> -	 *
>> -	 * opregion 2.0: rvda is the physical VBT address.
>> -	 * Since rvda is HPA it cannot be directly used in guest.
>> -	 * And it should not be practically available for end user,so it is not supported.
>> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
>> +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
>> +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
>> +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
>> +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
>> +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
>>  	 */
>>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
>> -	if (version >= 0x0200) {
>> -		u64 rvda;
>> -		u32 rvds;
>>
>> +	if (version >= 0x0200) {
>>  		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>>  		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
>> +
>> +		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */
>
>
> Why?  If we're going to make our own OpRegion to account for v2.0, why
> should it not apply to the same scenario for >2.0?
Below check is to validate the correctness for >2.0. Accroding to spec, 
RVDA must equal to OpRegion size. If RVDA doesn't follow spec, the 
OpRegion and VBT may already corrupted so returns error here.
For 2.0, RVDA is the absolute address, VBT may or may not follow OpRegion 
so these is no such check for 2.0.
If you mean "not apply to the same scenario for >2.0" by "only shadow for 
2.0 and return as 2.1, while not using shadow for >2.0", that's because I 
expect to keep the old logic as it is and only change the behavior for 
2.0. Both 2.0 and >2.0 can use shadow mechanism.

>
>
>> +		if (rvda != size && version > 0x0200) {
>> +			memunmap(base);
>> +			pci_err(vdev->pdev,
>> +				"Extended VBT does not follow opregion on version 0x%04x\n",
>> +				version);
>> +			return -EINVAL;
>> +		}
>> +
>> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
>>  		if (rvda && rvds) {
>> -			/* no support for opregion v2.0 with physical VBT address */
>> -			if (version == 0x0200) {
>> +			size += rvds;
>> +		}
>> +	}
>> +
>> +	if (size != OPREGION_SIZE) {
>
>
> @size can only != OPREGION_SIZE due to the above branch, so the below
> could all be scoped under the version test, or perhaps to a separate
> function.
I'll move to a separate function, which does the stitch and amendment.
>
>
>> +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
>> +		if (rvda && rvds && version == 0x0200) {
>
>
> We go down this path even if the VBT was contiguous with the OpRegion.
>
Yes for 2.0, if RVDA = (OpRegion addr + size) as absolute address, still 
remap the two regions separately. Seems like not necessary to check if 
contiguous and decide remap once or twice. Follow spec to remap is 
straightforward to understand the difference in RVDA (abs. vs rel.)
Did I miss some consideration here?

>
>> +			void *vbt_base;
>> +
>> +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
>> +			if (!vbt_base) {
>>  				memunmap(base);
>> -				pci_err(vdev->pdev,
>> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
>> -				return -EINVAL;
>> +				return -ENOMEM;
>>  			}
>>
>> -			if (rvda != size) {
>> +			opregionvbt = kzalloc(size, GFP_KERNEL);
>> +			if (!opregionvbt) {
>>  				memunmap(base);
>> -				pci_err(vdev->pdev,
>> -					"Extended VBT does not follow opregion on version 0x%04x\n",
>> -					version);
>> -				return -EINVAL;
>> +				memunmap(vbt_base);
>> +				return -ENOMEM;
>>  			}
>>
>> -			/* region size for opregion v2.0+: opregion and VBT size. */
>> -			size += rvds;
>> +			/* Stitch VBT after OpRegion noncontigious */
>> +			memcpy(opregionvbt, base, OPREGION_SIZE);
>> +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
>> +
>> +			/* Patch OpRegion 2.0 to 2.1 */
>> +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
>
>
> = cpu_to_le16(0x0201);
>
>
>> +			/* Patch RVDA to relative address after OpRegion */
>> +			*(__le64 *)(opregionvbt + OPREGION_RVDA) = OPREGION_SIZE;
>
>
> = cpu_to_le64(OPREGION_SIZE);
>
>
> I think this is what triggered the sparse errors.  Thanks,
Thanks I got the sparse errors, will fix this in next version.

>
> Alex
>
>> +
>> +			memunmap(vbt_base);
>> +			memunmap(base);
>> +
>> +			/* Register shadow instead of map as vfio_region */
>> +			base = opregionvbt;
>> +		/* Remap OpRegion + extended VBT for 2.1+ */
>> +		} else {
>> +			memunmap(base);
>> +			base = memremap(addr, size, MEMREMAP_WB);
>> +			if (!base)
>> +				return -ENOMEM;
>>  		}
>>  	}
>>
>> -	if (size != OPREGION_SIZE) {
>> -		memunmap(base);
>> -		base = memremap(addr, size, MEMREMAP_WB);
>> -		if (!base)
>> -			return -ENOMEM;
>> -	}
>> -
>>  	ret = vfio_pci_register_dev_region(vdev,
>>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>>  		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>>  		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>>  	if (ret) {
>> -		memunmap(base);
>> +		if (is_ioremap_addr(base))
>> +			memunmap(base);
>> +		else
>> +			kfree(base);
>>  		return ret;
>>  	}
>>
>
>

--
Best Regards,
Colin Xu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-09-02  7:11               ` Colin Xu
@ 2021-09-02 21:46                 ` Alex Williamson
  2021-09-03  2:23                   ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-09-02 21:46 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Thu, 2 Sep 2021 15:11:11 +0800 (CST)
Colin Xu <colin.xu@intel.com> wrote:

> On Mon, 30 Aug 2021, Alex Williamson wrote:
> 
> Thanks Alex for your detailed comments. I replied them inline.
> 
> A general question after these replies is:
> which way to handle the readonly OpRegion is preferred?
> 1) Shadow (modify the RVDA location and OpRegion version for some 
> special version, 2.0).
> 2) On-the-fly modification for reading.
> 
> The former doesn't need add extra fields to avoid remap on every read, the
> latter leaves flexibility for write operation.

I'm in favor of the simplest, most consistent solution.  In retrospect,
that probably should have been exposing the VBT as a separate device
specific region from the OpRegion and we'd just rely on userspace to do
any necessary virtualization before exposing it to a guest.  However,
v2.1 support already expanded the region to include the VBT, so we'd
have a compatibility problem changing that at this point.

Therefore, since we have no plans to enable write support, the simplest
solution is probably to shadow all versions.  There's only one instance
of this device and firmware tables on the host, so we can probably
afford to waste a few pages of memory to simplify.

> > On Fri, 27 Aug 2021 10:37:16 +0800
> > Colin Xu <colin.xu@intel.com> wrote:
> >  
> >> Due to historical reason, some legacy shipped system doesn't follow
> >> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> >> VBT is not contigious after OpRegion in physical address, but any
> >> location pointed by RVDA via absolute address. Thus it's impossible
> >> to map a contigious range to hold both OpRegion and extended VBT as 2.1.
> >>
> >> Since the only difference between OpRegion 2.0 and 2.1 is where extended
> >> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> >> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> >> baes, and there is no other difference between OpRegion 2.0 and 2.1,
> >> it's feasible to amend OpRegion support for these legacy system (before
> >> upgrading the system firmware), by kazlloc a range to shadown OpRegion
> >> from the beginning and stitch VBT after closely, patch the shadow
> >> OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
> >> address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
> >> 2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
> >> if the extended VBT exists. vfio igd OpRegion r/w ops will return either
> >> shadowed data (OpRegion 2.0) or directly from physical address
> >> (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
> >> mechanism makes it possible to support legacy systems on the market.
> >>
> >> V2:
> >> Validate RVDA for 2.1+ before increasing total size. (Alex)
> >>
> >> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> >> Cc: Hang Yuan <hang.yuan@linux.intel.com>
> >> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> >> Cc: Fred Gao <fred.gao@intel.com>
> >> Signed-off-by: Colin Xu <colin.xu@intel.com>
> >> ---
> >>  drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
> >>  1 file changed, 75 insertions(+), 42 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
> >> index 228df565e9bc..9cd44498b378 100644
> >> --- a/drivers/vfio/pci/vfio_pci_igd.c
> >> +++ b/drivers/vfio/pci/vfio_pci_igd.c
> >> @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
> >>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
> >>  				 struct vfio_pci_region *region)
> >>  {
> >> -	memunmap(region->data);
> >> +	if (is_ioremap_addr(region->data))
> >> +		memunmap(region->data);
> >> +	else
> >> +		kfree(region->data);  
> >
> >
> > Since we don't have write support to the OpRegion, should we always
> > allocate a shadow copy to simplify?  Or rather than a shadow copy,
> > since we don't support mmap of the region, our read handler could
> > virtualize version and rvda on the fly and shift accesses so that the
> > VBT appears contiguous.  That might also leave us better positioned for
> > handling dynamic changes (ex. does the data change when a monitor is
> > plugged/unplugged) and perhaps eventually write support.
> >  
> Always shadow sounds a more simple solution. On-the-fly offset shifting 
> may need some extra code:
> - A fields to store remapped RVDA, otherwise have to remap on every read.
> Should I remap everytime, or add the remapped RVDA in vfio_pci_device.
> - Some fields to store extra information, like the old and modified 
> opregion version. Current it's parsed in init since it's one time run. To 
> support on-the-fly modification, need save them somewhere instead of parse 
> on every read.
> - Addr shift calculation. Read could called on any start with any size, 
> will need add some addr shift code.

I think it's a bit easier than made out here.  RVDA is either zero or
OPREGION_SIZE when it's virtualized, so the existence of a separate
mapping for the VBT is enough to know the value, where I think we'd
hold that mapping for the life of the region.  We also don't need to
store the version, the transformation is static, If the VBT mapping
exists and the read version is 2.0, it's replaced with 2.1, otherwise
we leave it alone.  I expect we can also chunk accesses to aligned
1/2/4 byte reads (QEMU is already doing this).  That simplifies both
the transition between OpRegion and VBT as well as the field
virtualization.

I could almost convince myself that this is viable, but I'd like to see
an answer to the question above, is any of the OpRegion or VBT volatile
such that we can't rely on a shadow copy exclusively?

> >>  }
> >>
> >>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
> >> @@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
> >>  static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
> >>  {
> >>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
> >> -	u32 addr, size;
> >> -	void *base;
> >> +	u32 addr, size, rvds = 0;
> >> +	void *base, *opregionvbt;  
> >
> >
> > opregionvbt could be scoped within the branch it's used.
> >  
> Previous revision doesn't move it into the scope. I'll amend in next 
> version.
> >>  	int ret;
> >>  	u16 version;
> >> +	u64 rvda = 0;
> >>
> >>  	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
> >>  	if (ret)
> >> @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
> >>  	size *= 1024; /* In KB */
> >>
> >>  	/*
> >> -	 * Support opregion v2.1+
> >> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
> >> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
> >> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
> >> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
> >> -	 * address from region base and size of VBT data. RVDA/RVDS are not
> >> -	 * defined before opregion 2.0.
> >> +	 * OpRegion and VBT:
> >> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
> >> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
> >> +	 * to hold the VBT data, the Extended VBT region is introduced since
> >> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
> >> +	 * introduced to define the extended VBT data location and size.
> >> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
> >> +	 *   extended VBT data, RVDS defines the VBT data size.
> >> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
> >> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
> >>  	 *
> >> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
> >> -	 * opregion base, and should point to the end of opregion.
> >> -	 * otherwise, exposing to userspace to allow read access to everything between
> >> -	 * the OpRegion and VBT is not safe.
> >> -	 * RVDS is defined as size in bytes.
> >> -	 *
> >> -	 * opregion 2.0: rvda is the physical VBT address.
> >> -	 * Since rvda is HPA it cannot be directly used in guest.
> >> -	 * And it should not be practically available for end user,so it is not supported.
> >> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
> >> +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
> >> +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
> >> +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
> >> +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
> >> +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
> >>  	 */
> >>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
> >> -	if (version >= 0x0200) {
> >> -		u64 rvda;
> >> -		u32 rvds;
> >>
> >> +	if (version >= 0x0200) {
> >>  		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
> >>  		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
> >> +
> >> +		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */  
> >
> >
> > Why?  If we're going to make our own OpRegion to account for v2.0, why
> > should it not apply to the same scenario for >2.0?  
> Below check is to validate the correctness for >2.0. Accroding to spec, 
> RVDA must equal to OpRegion size. If RVDA doesn't follow spec, the 
> OpRegion and VBT may already corrupted so returns error here.
> For 2.0, RVDA is the absolute address, VBT may or may not follow OpRegion 
> so these is no such check for 2.0.
> If you mean "not apply to the same scenario for >2.0" by "only shadow for 
> 2.0 and return as 2.1, while not using shadow for >2.0", that's because I 
> expect to keep the old logic as it is and only change the behavior for 
> 2.0. Both 2.0 and >2.0 can use shadow mechanism.

I was under the impression that the difference in RVDA between 2.0 and
2.1 was simply the absolute versus relative addressing and we made a
conscious decisions here to only support implementations where the VBT
is contiguous with the OpRegion, but the spec supported that
possibility.  Of course I don't have access to the spec to verify, but
if my interpretation is correct then the v2.0 support here could easily
handle a non-contiguous v2.1+ VBT as well.

> >> +		if (rvda != size && version > 0x0200) {
> >> +			memunmap(base);
> >> +			pci_err(vdev->pdev,
> >> +				"Extended VBT does not follow opregion on version 0x%04x\n",
> >> +				version);
> >> +			return -EINVAL;
> >> +		}
> >> +
> >> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
> >>  		if (rvda && rvds) {
> >> -			/* no support for opregion v2.0 with physical VBT address */
> >> -			if (version == 0x0200) {
> >> +			size += rvds;
> >> +		}
> >> +	}
> >> +
> >> +	if (size != OPREGION_SIZE) {  
> >
> >
> > @size can only != OPREGION_SIZE due to the above branch, so the below
> > could all be scoped under the version test, or perhaps to a separate
> > function.  
> I'll move to a separate function, which does the stitch and amendment.
> >
> >  
> >> +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
> >> +		if (rvda && rvds && version == 0x0200) {  
> >
> >
> > We go down this path even if the VBT was contiguous with the OpRegion.
> >  
> Yes for 2.0, if RVDA = (OpRegion addr + size) as absolute address, still 
> remap the two regions separately. Seems like not necessary to check if 
> contiguous and decide remap once or twice. Follow spec to remap is 
> straightforward to understand the difference in RVDA (abs. vs rel.)
> Did I miss some consideration here?

I probably forgot that RVDA needs to be modified regardless.  Thanks,

Alex

> >> +			void *vbt_base;
> >> +
> >> +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
> >> +			if (!vbt_base) {
> >>  				memunmap(base);
> >> -				pci_err(vdev->pdev,
> >> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
> >> -				return -EINVAL;
> >> +				return -ENOMEM;
> >>  			}
> >>
> >> -			if (rvda != size) {
> >> +			opregionvbt = kzalloc(size, GFP_KERNEL);
> >> +			if (!opregionvbt) {
> >>  				memunmap(base);
> >> -				pci_err(vdev->pdev,
> >> -					"Extended VBT does not follow opregion on version 0x%04x\n",
> >> -					version);
> >> -				return -EINVAL;
> >> +				memunmap(vbt_base);
> >> +				return -ENOMEM;
> >>  			}
> >>
> >> -			/* region size for opregion v2.0+: opregion and VBT size. */
> >> -			size += rvds;
> >> +			/* Stitch VBT after OpRegion noncontigious */
> >> +			memcpy(opregionvbt, base, OPREGION_SIZE);
> >> +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
> >> +
> >> +			/* Patch OpRegion 2.0 to 2.1 */
> >> +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;  
> >
> >
> > = cpu_to_le16(0x0201);
> >
> >  
> >> +			/* Patch RVDA to relative address after OpRegion */
> >> +			*(__le64 *)(opregionvbt + OPREGION_RVDA) = OPREGION_SIZE;  
> >
> >
> > = cpu_to_le64(OPREGION_SIZE);
> >
> >
> > I think this is what triggered the sparse errors.  Thanks,  
> Thanks I got the sparse errors, will fix this in next version.
> 
> >
> > Alex
> >  
> >> +
> >> +			memunmap(vbt_base);
> >> +			memunmap(base);
> >> +
> >> +			/* Register shadow instead of map as vfio_region */
> >> +			base = opregionvbt;
> >> +		/* Remap OpRegion + extended VBT for 2.1+ */
> >> +		} else {
> >> +			memunmap(base);
> >> +			base = memremap(addr, size, MEMREMAP_WB);
> >> +			if (!base)
> >> +				return -ENOMEM;
> >>  		}
> >>  	}
> >>
> >> -	if (size != OPREGION_SIZE) {
> >> -		memunmap(base);
> >> -		base = memremap(addr, size, MEMREMAP_WB);
> >> -		if (!base)
> >> -			return -ENOMEM;
> >> -	}
> >> -
> >>  	ret = vfio_pci_register_dev_region(vdev,
> >>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> >>  		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
> >>  		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
> >>  	if (ret) {
> >> -		memunmap(base);
> >> +		if (is_ioremap_addr(base))
> >> +			memunmap(base);
> >> +		else
> >> +			kfree(base);
> >>  		return ret;
> >>  	}
> >>  
> >
> >  
> 
> --
> Best Regards,
> Colin Xu
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-09-02 21:46                 ` Alex Williamson
@ 2021-09-03  2:23                   ` Colin Xu
  2021-09-03 22:36                     ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-03  2:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Colin Xu, kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Thu, 2 Sep 2021, Alex Williamson wrote:

> On Thu, 2 Sep 2021 15:11:11 +0800 (CST)
> Colin Xu <colin.xu@intel.com> wrote:
>
>> On Mon, 30 Aug 2021, Alex Williamson wrote:
>>
>> Thanks Alex for your detailed comments. I replied them inline.
>>
>> A general question after these replies is:
>> which way to handle the readonly OpRegion is preferred?
>> 1) Shadow (modify the RVDA location and OpRegion version for some
>> special version, 2.0).
>> 2) On-the-fly modification for reading.
>>
>> The former doesn't need add extra fields to avoid remap on every read, the
>> latter leaves flexibility for write operation.
>
> I'm in favor of the simplest, most consistent solution.  In retrospect,
> that probably should have been exposing the VBT as a separate device
> specific region from the OpRegion and we'd just rely on userspace to do
> any necessary virtualization before exposing it to a guest.  However,
> v2.1 support already expanded the region to include the VBT, so we'd
> have a compatibility problem changing that at this point.
>
> Therefore, since we have no plans to enable write support, the simplest
> solution is probably to shadow all versions.  There's only one instance
> of this device and firmware tables on the host, so we can probably
> afford to waste a few pages of memory to simplify.
>


>>> On Fri, 27 Aug 2021 10:37:16 +0800
>>> Colin Xu <colin.xu@intel.com> wrote:
>>>
>>>> Due to historical reason, some legacy shipped system doesn't follow
>>>> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>>>> VBT is not contigious after OpRegion in physical address, but any
>>>> location pointed by RVDA via absolute address. Thus it's impossible
>>>> to map a contigious range to hold both OpRegion and extended VBT as 2.1.
>>>>
>>>> Since the only difference between OpRegion 2.0 and 2.1 is where extended
>>>> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>>>> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>>>> baes, and there is no other difference between OpRegion 2.0 and 2.1,
>>>> it's feasible to amend OpRegion support for these legacy system (before
>>>> upgrading the system firmware), by kazlloc a range to shadown OpRegion
>>>> from the beginning and stitch VBT after closely, patch the shadow
>>>> OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
>>>> address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
>>>> 2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
>>>> if the extended VBT exists. vfio igd OpRegion r/w ops will return either
>>>> shadowed data (OpRegion 2.0) or directly from physical address
>>>> (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
>>>> mechanism makes it possible to support legacy systems on the market.
>>>>
>>>> V2:
>>>> Validate RVDA for 2.1+ before increasing total size. (Alex)
>>>>
>>>> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
>>>> Cc: Hang Yuan <hang.yuan@linux.intel.com>
>>>> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
>>>> Cc: Fred Gao <fred.gao@intel.com>
>>>> Signed-off-by: Colin Xu <colin.xu@intel.com>
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
>>>>  1 file changed, 75 insertions(+), 42 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
>>>> index 228df565e9bc..9cd44498b378 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_igd.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_igd.c
>>>> @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>>>>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>>>>  				 struct vfio_pci_region *region)
>>>>  {
>>>> -	memunmap(region->data);
>>>> +	if (is_ioremap_addr(region->data))
>>>> +		memunmap(region->data);
>>>> +	else
>>>> +		kfree(region->data);
>>>
>>>
>>> Since we don't have write support to the OpRegion, should we always
>>> allocate a shadow copy to simplify?  Or rather than a shadow copy,
>>> since we don't support mmap of the region, our read handler could
>>> virtualize version and rvda on the fly and shift accesses so that the
>>> VBT appears contiguous.  That might also leave us better positioned for
>>> handling dynamic changes (ex. does the data change when a monitor is
>>> plugged/unplugged) and perhaps eventually write support.
>>>
>> Always shadow sounds a more simple solution. On-the-fly offset shifting
>> may need some extra code:
>> - A fields to store remapped RVDA, otherwise have to remap on every read.
>> Should I remap everytime, or add the remapped RVDA in vfio_pci_device.
>> - Some fields to store extra information, like the old and modified
>> opregion version. Current it's parsed in init since it's one time run. To
>> support on-the-fly modification, need save them somewhere instead of parse
>> on every read.
>> - Addr shift calculation. Read could called on any start with any size,
>> will need add some addr shift code.
>
> I think it's a bit easier than made out here.  RVDA is either zero or
> OPREGION_SIZE when it's virtualized, so the existence of a separate
> mapping for the VBT is enough to know the value, where I think we'd
> hold that mapping for the life of the region.  We also don't need to
> store the version, the transformation is static, If the VBT mapping
> exists and the read version is 2.0, it's replaced with 2.1, otherwise
> we leave it alone.  I expect we can also chunk accesses to aligned
> 1/2/4 byte reads (QEMU is already doing this).  That simplifies both
> the transition between OpRegion and VBT as well as the field
> virtualization.
>
emmm version doesn't need to be stored since the host version isn't be 
changed. But need a place to store the mapped virtual addr so that can 
unmap on release. In shadow case, we have the shadow addr, but don't save 
OpRegion and VBT virtual addr.
> I could almost convince myself that this is viable, but I'd like to see
> an answer to the question above, is any of the OpRegion or VBT volatile
> such that we can't rely on a shadow copy exclusively?
>
Most of the fields in OpRegion and VBT are written by BIOS and read only 
by driver as static information. Some fields are used for communication 
between BIOS and driver, either written by driver and read by BIOS or 
vice versa, like driver can notify BIOS that driver is ready to process 
ACPI video extension calls, or when panel backlight change and BIOS notify 
driver via ACPI, driver can read PWM duty cycle, etc.
So strictly speaking, there are some cases that the data is volatile and 
can't fully rely on the shadow copy. To handle them accurately, all the 
fields need to be processed according to the actual function the field 
supports. As you mentioned above, two separate regions for OpRegion 
and VBT could be better. However currently there is only one region. So 
the shadow makes it still use single region, but the read ops shouldn't 
fully rely on the shadow, but need always read host data. That could also 
make the write ops support in future easier. The read/write ops could 
parse and filter out some functions that host doesn't want to expose for 
virtualization.
This brings a small question: need save the mapped OpRegion and VBT 
virtual addr so that no need remap every time, and also for unmap on
release. Which structure is ok to added these saved virtual address?

>
>>>>  }
>>>>
>>>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
>>>> @@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
>>>>  static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>>>  {
>>>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>>>> -	u32 addr, size;
>>>> -	void *base;
>>>> +	u32 addr, size, rvds = 0;
>>>> +	void *base, *opregionvbt;
>>>
>>>
>>> opregionvbt could be scoped within the branch it's used.
>>>
>> Previous revision doesn't move it into the scope. I'll amend in next
>> version.
>>>>  	int ret;
>>>>  	u16 version;
>>>> +	u64 rvda = 0;
>>>>
>>>>  	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
>>>>  	if (ret)
>>>> @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>>>  	size *= 1024; /* In KB */
>>>>
>>>>  	/*
>>>> -	 * Support opregion v2.1+
>>>> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
>>>> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
>>>> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
>>>> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
>>>> -	 * address from region base and size of VBT data. RVDA/RVDS are not
>>>> -	 * defined before opregion 2.0.
>>>> +	 * OpRegion and VBT:
>>>> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
>>>> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
>>>> +	 * to hold the VBT data, the Extended VBT region is introduced since
>>>> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
>>>> +	 * introduced to define the extended VBT data location and size.
>>>> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
>>>> +	 *   extended VBT data, RVDS defines the VBT data size.
>>>> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
>>>> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>>>>  	 *
>>>> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
>>>> -	 * opregion base, and should point to the end of opregion.
>>>> -	 * otherwise, exposing to userspace to allow read access to everything between
>>>> -	 * the OpRegion and VBT is not safe.
>>>> -	 * RVDS is defined as size in bytes.
>>>> -	 *
>>>> -	 * opregion 2.0: rvda is the physical VBT address.
>>>> -	 * Since rvda is HPA it cannot be directly used in guest.
>>>> -	 * And it should not be practically available for end user,so it is not supported.
>>>> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
>>>> +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
>>>> +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
>>>> +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
>>>> +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
>>>> +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
>>>>  	 */
>>>>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
>>>> -	if (version >= 0x0200) {
>>>> -		u64 rvda;
>>>> -		u32 rvds;
>>>>
>>>> +	if (version >= 0x0200) {
>>>>  		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>>>>  		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
>>>> +
>>>> +		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */
>>>
>>>
>>> Why?  If we're going to make our own OpRegion to account for v2.0, why
>>> should it not apply to the same scenario for >2.0?
>> Below check is to validate the correctness for >2.0. Accroding to spec,
>> RVDA must equal to OpRegion size. If RVDA doesn't follow spec, the
>> OpRegion and VBT may already corrupted so returns error here.
>> For 2.0, RVDA is the absolute address, VBT may or may not follow OpRegion
>> so these is no such check for 2.0.
>> If you mean "not apply to the same scenario for >2.0" by "only shadow for
>> 2.0 and return as 2.1, while not using shadow for >2.0", that's because I
>> expect to keep the old logic as it is and only change the behavior for
>> 2.0. Both 2.0 and >2.0 can use shadow mechanism.
>
> I was under the impression that the difference in RVDA between 2.0 and
> 2.1 was simply the absolute versus relative addressing and we made a
> conscious decisions here to only support implementations where the VBT
> is contiguous with the OpRegion, but the spec supported that
> possibility.  Of course I don't have access to the spec to verify, but
> if my interpretation is correct then the v2.0 support here could easily
> handle a non-contiguous v2.1+ VBT as well.
>
The team hasn't release the spec to public so I can't paste it here. What 
it describes RVDA for 2.1+ is typically RVDA will be equal to OpRegion 
size only when VBT exceeds 6K (if <6K, mailbox 4 is large enough to hold
VBT then no need to use RVDA). Technically it's correct that even if it's 
non-contiguous v2.1+ VBT, it still can be handled.
Current i915 will handle it even if v2.1+ is not contiguous, so I guess 
probably it's better to deal with it as i915.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/i915/display/intel_opregion.c?h=v5.13.13#n935

Colin

>>>> +		if (rvda != size && version > 0x0200) {
>>>> +			memunmap(base);
>>>> +			pci_err(vdev->pdev,
>>>> +				"Extended VBT does not follow opregion on version 0x%04x\n",
>>>> +				version);
>>>> +			return -EINVAL;
>>>> +		}
>>>> +
>>>> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
>>>>  		if (rvda && rvds) {
>>>> -			/* no support for opregion v2.0 with physical VBT address */
>>>> -			if (version == 0x0200) {
>>>> +			size += rvds;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (size != OPREGION_SIZE) {
>>>
>>>
>>> @size can only != OPREGION_SIZE due to the above branch, so the below
>>> could all be scoped under the version test, or perhaps to a separate
>>> function.
>> I'll move to a separate function, which does the stitch and amendment.
>>>
>>>
>>>> +		/* Allocate memory for OpRegion and extended VBT for 2.0 */
>>>> +		if (rvda && rvds && version == 0x0200) {
>>>
>>>
>>> We go down this path even if the VBT was contiguous with the OpRegion.
>>>
>> Yes for 2.0, if RVDA = (OpRegion addr + size) as absolute address, still
>> remap the two regions separately. Seems like not necessary to check if
>> contiguous and decide remap once or twice. Follow spec to remap is
>> straightforward to understand the difference in RVDA (abs. vs rel.)
>> Did I miss some consideration here?
>
> I probably forgot that RVDA needs to be modified regardless.  Thanks,
>
> Alex
>
>>>> +			void *vbt_base;
>>>> +
>>>> +			vbt_base = memremap(rvda, rvds, MEMREMAP_WB);
>>>> +			if (!vbt_base) {
>>>>  				memunmap(base);
>>>> -				pci_err(vdev->pdev,
>>>> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
>>>> -				return -EINVAL;
>>>> +				return -ENOMEM;
>>>>  			}
>>>>
>>>> -			if (rvda != size) {
>>>> +			opregionvbt = kzalloc(size, GFP_KERNEL);
>>>> +			if (!opregionvbt) {
>>>>  				memunmap(base);
>>>> -				pci_err(vdev->pdev,
>>>> -					"Extended VBT does not follow opregion on version 0x%04x\n",
>>>> -					version);
>>>> -				return -EINVAL;
>>>> +				memunmap(vbt_base);
>>>> +				return -ENOMEM;
>>>>  			}
>>>>
>>>> -			/* region size for opregion v2.0+: opregion and VBT size. */
>>>> -			size += rvds;
>>>> +			/* Stitch VBT after OpRegion noncontigious */
>>>> +			memcpy(opregionvbt, base, OPREGION_SIZE);
>>>> +			memcpy(opregionvbt + OPREGION_SIZE, vbt_base, rvds);
>>>> +
>>>> +			/* Patch OpRegion 2.0 to 2.1 */
>>>> +			*(__le16 *)(opregionvbt + OPREGION_VERSION) = 0x0201;
>>>
>>>
>>> = cpu_to_le16(0x0201);
>>>
>>>
>>>> +			/* Patch RVDA to relative address after OpRegion */
>>>> +			*(__le64 *)(opregionvbt + OPREGION_RVDA) = OPREGION_SIZE;
>>>
>>>
>>> = cpu_to_le64(OPREGION_SIZE);
>>>
>>>
>>> I think this is what triggered the sparse errors.  Thanks,
>> Thanks I got the sparse errors, will fix this in next version.
>>
>>>
>>> Alex
>>>
>>>> +
>>>> +			memunmap(vbt_base);
>>>> +			memunmap(base);
>>>> +
>>>> +			/* Register shadow instead of map as vfio_region */
>>>> +			base = opregionvbt;
>>>> +		/* Remap OpRegion + extended VBT for 2.1+ */
>>>> +		} else {
>>>> +			memunmap(base);
>>>> +			base = memremap(addr, size, MEMREMAP_WB);
>>>> +			if (!base)
>>>> +				return -ENOMEM;
>>>>  		}
>>>>  	}
>>>>
>>>> -	if (size != OPREGION_SIZE) {
>>>> -		memunmap(base);
>>>> -		base = memremap(addr, size, MEMREMAP_WB);
>>>> -		if (!base)
>>>> -			return -ENOMEM;
>>>> -	}
>>>> -
>>>>  	ret = vfio_pci_register_dev_region(vdev,
>>>>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>>>>  		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>>>>  		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>>>>  	if (ret) {
>>>> -		memunmap(base);
>>>> +		if (is_ioremap_addr(base))
>>>> +			memunmap(base);
>>>> +		else
>>>> +			kfree(base);
>>>>  		return ret;
>>>>  	}
>>>>
>>>
>>>
>>
>> --
>> Best Regards,
>> Colin Xu
>>
>
>

--
Best Regards,
Colin Xu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-09-03  2:23                   ` Colin Xu
@ 2021-09-03 22:36                     ` Alex Williamson
  2021-09-07  6:14                       ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-09-03 22:36 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Fri, 3 Sep 2021 10:23:44 +0800 (CST)
Colin Xu <colin.xu@intel.com> wrote:

> On Thu, 2 Sep 2021, Alex Williamson wrote:
> 
> > On Thu, 2 Sep 2021 15:11:11 +0800 (CST)
> > Colin Xu <colin.xu@intel.com> wrote:
> >  
> >> On Mon, 30 Aug 2021, Alex Williamson wrote:
> >>
> >> Thanks Alex for your detailed comments. I replied them inline.
> >>
> >> A general question after these replies is:
> >> which way to handle the readonly OpRegion is preferred?
> >> 1) Shadow (modify the RVDA location and OpRegion version for some
> >> special version, 2.0).
> >> 2) On-the-fly modification for reading.
> >>
> >> The former doesn't need add extra fields to avoid remap on every read, the
> >> latter leaves flexibility for write operation.  
> >
> > I'm in favor of the simplest, most consistent solution.  In retrospect,
> > that probably should have been exposing the VBT as a separate device
> > specific region from the OpRegion and we'd just rely on userspace to do
> > any necessary virtualization before exposing it to a guest.  However,
> > v2.1 support already expanded the region to include the VBT, so we'd
> > have a compatibility problem changing that at this point.
> >
> > Therefore, since we have no plans to enable write support, the simplest
> > solution is probably to shadow all versions.  There's only one instance
> > of this device and firmware tables on the host, so we can probably
> > afford to waste a few pages of memory to simplify.
> >  
> 
> 
> >>> On Fri, 27 Aug 2021 10:37:16 +0800
> >>> Colin Xu <colin.xu@intel.com> wrote:
> >>>  
> >>>> Due to historical reason, some legacy shipped system doesn't follow
> >>>> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> >>>> VBT is not contigious after OpRegion in physical address, but any
> >>>> location pointed by RVDA via absolute address. Thus it's impossible
> >>>> to map a contigious range to hold both OpRegion and extended VBT as 2.1.
> >>>>
> >>>> Since the only difference between OpRegion 2.0 and 2.1 is where extended
> >>>> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> >>>> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> >>>> baes, and there is no other difference between OpRegion 2.0 and 2.1,
> >>>> it's feasible to amend OpRegion support for these legacy system (before
> >>>> upgrading the system firmware), by kazlloc a range to shadown OpRegion
> >>>> from the beginning and stitch VBT after closely, patch the shadow
> >>>> OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
> >>>> address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
> >>>> 2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
> >>>> if the extended VBT exists. vfio igd OpRegion r/w ops will return either
> >>>> shadowed data (OpRegion 2.0) or directly from physical address
> >>>> (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
> >>>> mechanism makes it possible to support legacy systems on the market.
> >>>>
> >>>> V2:
> >>>> Validate RVDA for 2.1+ before increasing total size. (Alex)
> >>>>
> >>>> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> >>>> Cc: Hang Yuan <hang.yuan@linux.intel.com>
> >>>> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> >>>> Cc: Fred Gao <fred.gao@intel.com>
> >>>> Signed-off-by: Colin Xu <colin.xu@intel.com>
> >>>> ---
> >>>>  drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
> >>>>  1 file changed, 75 insertions(+), 42 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
> >>>> index 228df565e9bc..9cd44498b378 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci_igd.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci_igd.c
> >>>> @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
> >>>>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
> >>>>  				 struct vfio_pci_region *region)
> >>>>  {
> >>>> -	memunmap(region->data);
> >>>> +	if (is_ioremap_addr(region->data))
> >>>> +		memunmap(region->data);
> >>>> +	else
> >>>> +		kfree(region->data);  
> >>>
> >>>
> >>> Since we don't have write support to the OpRegion, should we always
> >>> allocate a shadow copy to simplify?  Or rather than a shadow copy,
> >>> since we don't support mmap of the region, our read handler could
> >>> virtualize version and rvda on the fly and shift accesses so that the
> >>> VBT appears contiguous.  That might also leave us better positioned for
> >>> handling dynamic changes (ex. does the data change when a monitor is
> >>> plugged/unplugged) and perhaps eventually write support.
> >>>  
> >> Always shadow sounds a more simple solution. On-the-fly offset shifting
> >> may need some extra code:
> >> - A fields to store remapped RVDA, otherwise have to remap on every read.
> >> Should I remap everytime, or add the remapped RVDA in vfio_pci_device.
> >> - Some fields to store extra information, like the old and modified
> >> opregion version. Current it's parsed in init since it's one time run. To
> >> support on-the-fly modification, need save them somewhere instead of parse
> >> on every read.
> >> - Addr shift calculation. Read could called on any start with any size,
> >> will need add some addr shift code.  
> >
> > I think it's a bit easier than made out here.  RVDA is either zero or
> > OPREGION_SIZE when it's virtualized, so the existence of a separate
> > mapping for the VBT is enough to know the value, where I think we'd
> > hold that mapping for the life of the region.  We also don't need to
> > store the version, the transformation is static, If the VBT mapping
> > exists and the read version is 2.0, it's replaced with 2.1, otherwise
> > we leave it alone.  I expect we can also chunk accesses to aligned
> > 1/2/4 byte reads (QEMU is already doing this).  That simplifies both
> > the transition between OpRegion and VBT as well as the field
> > virtualization.
> >  
> emmm version doesn't need to be stored since the host version isn't be 
> changed.

It'd be changed for the guest if we're virtualizing a v2.0 OpRegion
into v2.1 to make RVDA relative, but it's still not strictly necessary
to store the host or virtual version to do that.

> But need a place to store the mapped virtual addr so that can 
> unmap on release. In shadow case, we have the shadow addr, but don't save 
> OpRegion and VBT virtual addr.

Yes, we're using the void* data field of struct vfio_pci_region for
storing the opregion mapping, this could easily point to a structure
containing both the opregion and vbt mappings and size for each.

> > I could almost convince myself that this is viable, but I'd like to see
> > an answer to the question above, is any of the OpRegion or VBT volatile
> > such that we can't rely on a shadow copy exclusively?
> >  
> Most of the fields in OpRegion and VBT are written by BIOS and read only 
> by driver as static information. Some fields are used for communication 
> between BIOS and driver, either written by driver and read by BIOS or 
> vice versa, like driver can notify BIOS that driver is ready to process 
> ACPI video extension calls, or when panel backlight change and BIOS notify 
> driver via ACPI, driver can read PWM duty cycle, etc.
> So strictly speaking, there are some cases that the data is volatile and 
> can't fully rely on the shadow copy. To handle them accurately, all the 
> fields need to be processed according to the actual function the field 
> supports. As you mentioned above, two separate regions for OpRegion 
> and VBT could be better. However currently there is only one region. So 
> the shadow makes it still use single region, but the read ops shouldn't 
> fully rely on the shadow, but need always read host data. That could also 
> make the write ops support in future easier. The read/write ops could 
> parse and filter out some functions that host doesn't want to expose for 
> virtualization.
> This brings a small question: need save the mapped OpRegion and VBT 
> virtual addr so that no need remap every time, and also for unmap on
> release. Which structure is ok to added these saved virtual address?

See above, struct vfio_pci_region.data

> >>>>  }
> >>>>
> >>>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
> >>>> @@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
> >>>>  static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
> >>>>  {
> >>>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
> >>>> -	u32 addr, size;
> >>>> -	void *base;
> >>>> +	u32 addr, size, rvds = 0;
> >>>> +	void *base, *opregionvbt;  
> >>>
> >>>
> >>> opregionvbt could be scoped within the branch it's used.
> >>>  
> >> Previous revision doesn't move it into the scope. I'll amend in next
> >> version.  
> >>>>  	int ret;
> >>>>  	u16 version;
> >>>> +	u64 rvda = 0;
> >>>>
> >>>>  	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
> >>>>  	if (ret)
> >>>> @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
> >>>>  	size *= 1024; /* In KB */
> >>>>
> >>>>  	/*
> >>>> -	 * Support opregion v2.1+
> >>>> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
> >>>> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
> >>>> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
> >>>> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
> >>>> -	 * address from region base and size of VBT data. RVDA/RVDS are not
> >>>> -	 * defined before opregion 2.0.
> >>>> +	 * OpRegion and VBT:
> >>>> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
> >>>> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
> >>>> +	 * to hold the VBT data, the Extended VBT region is introduced since
> >>>> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
> >>>> +	 * introduced to define the extended VBT data location and size.
> >>>> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
> >>>> +	 *   extended VBT data, RVDS defines the VBT data size.
> >>>> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
> >>>> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
> >>>>  	 *
> >>>> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
> >>>> -	 * opregion base, and should point to the end of opregion.
> >>>> -	 * otherwise, exposing to userspace to allow read access to everything between
> >>>> -	 * the OpRegion and VBT is not safe.
> >>>> -	 * RVDS is defined as size in bytes.
> >>>> -	 *
> >>>> -	 * opregion 2.0: rvda is the physical VBT address.
> >>>> -	 * Since rvda is HPA it cannot be directly used in guest.
> >>>> -	 * And it should not be practically available for end user,so it is not supported.
> >>>> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
> >>>> +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
> >>>> +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
> >>>> +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
> >>>> +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
> >>>> +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
> >>>>  	 */
> >>>>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
> >>>> -	if (version >= 0x0200) {
> >>>> -		u64 rvda;
> >>>> -		u32 rvds;
> >>>>
> >>>> +	if (version >= 0x0200) {
> >>>>  		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
> >>>>  		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
> >>>> +
> >>>> +		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */  
> >>>
> >>>
> >>> Why?  If we're going to make our own OpRegion to account for v2.0, why
> >>> should it not apply to the same scenario for >2.0?  
> >> Below check is to validate the correctness for >2.0. Accroding to spec,
> >> RVDA must equal to OpRegion size. If RVDA doesn't follow spec, the
> >> OpRegion and VBT may already corrupted so returns error here.
> >> For 2.0, RVDA is the absolute address, VBT may or may not follow OpRegion
> >> so these is no such check for 2.0.
> >> If you mean "not apply to the same scenario for >2.0" by "only shadow for
> >> 2.0 and return as 2.1, while not using shadow for >2.0", that's because I
> >> expect to keep the old logic as it is and only change the behavior for
> >> 2.0. Both 2.0 and >2.0 can use shadow mechanism.  
> >
> > I was under the impression that the difference in RVDA between 2.0 and
> > 2.1 was simply the absolute versus relative addressing and we made a
> > conscious decisions here to only support implementations where the VBT
> > is contiguous with the OpRegion, but the spec supported that
> > possibility.  Of course I don't have access to the spec to verify, but
> > if my interpretation is correct then the v2.0 support here could easily
> > handle a non-contiguous v2.1+ VBT as well.
> >  
> The team hasn't release the spec to public so I can't paste it here. What 
> it describes RVDA for 2.1+ is typically RVDA will be equal to OpRegion 
> size only when VBT exceeds 6K (if <6K, mailbox 4 is large enough to hold
> VBT then no need to use RVDA). Technically it's correct that even if it's 
> non-contiguous v2.1+ VBT, it still can be handled.
> Current i915 will handle it even if v2.1+ is not contiguous, so I guess 
> probably it's better to deal with it as i915.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/i915/display/intel_opregion.c?h=v5.13.13#n935

So even a v2.1+ OpRegion where (RVDA > OPREGION_SIZE) should be made
contiguous within this vendor region.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] vfio/pci: Add OpRegion 2.0 Extended VBT support.
  2021-09-03 22:36                     ` Alex Williamson
@ 2021-09-07  6:14                       ` Colin Xu
  2021-09-09  5:09                         ` [PATCH v3] vfio/pci: Add OpRegion 2.0+ " Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-07  6:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Colin Xu, kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

Thanks Alex. Let me cook an updated version based on your suggestions and 
continue the discussion.

On Fri, 3 Sep 2021, Alex Williamson wrote:

> On Fri, 3 Sep 2021 10:23:44 +0800 (CST)
> Colin Xu <colin.xu@intel.com> wrote:
>
>> On Thu, 2 Sep 2021, Alex Williamson wrote:
>>
>>> On Thu, 2 Sep 2021 15:11:11 +0800 (CST)
>>> Colin Xu <colin.xu@intel.com> wrote:
>>>
>>>> On Mon, 30 Aug 2021, Alex Williamson wrote:
>>>>
>>>> Thanks Alex for your detailed comments. I replied them inline.
>>>>
>>>> A general question after these replies is:
>>>> which way to handle the readonly OpRegion is preferred?
>>>> 1) Shadow (modify the RVDA location and OpRegion version for some
>>>> special version, 2.0).
>>>> 2) On-the-fly modification for reading.
>>>>
>>>> The former doesn't need add extra fields to avoid remap on every read, the
>>>> latter leaves flexibility for write operation.
>>>
>>> I'm in favor of the simplest, most consistent solution.  In retrospect,
>>> that probably should have been exposing the VBT as a separate device
>>> specific region from the OpRegion and we'd just rely on userspace to do
>>> any necessary virtualization before exposing it to a guest.  However,
>>> v2.1 support already expanded the region to include the VBT, so we'd
>>> have a compatibility problem changing that at this point.
>>>
>>> Therefore, since we have no plans to enable write support, the simplest
>>> solution is probably to shadow all versions.  There's only one instance
>>> of this device and firmware tables on the host, so we can probably
>>> afford to waste a few pages of memory to simplify.
>>>
>>
>>
>>>>> On Fri, 27 Aug 2021 10:37:16 +0800
>>>>> Colin Xu <colin.xu@intel.com> wrote:
>>>>>
>>>>>> Due to historical reason, some legacy shipped system doesn't follow
>>>>>> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>>>>>> VBT is not contigious after OpRegion in physical address, but any
>>>>>> location pointed by RVDA via absolute address. Thus it's impossible
>>>>>> to map a contigious range to hold both OpRegion and extended VBT as 2.1.
>>>>>>
>>>>>> Since the only difference between OpRegion 2.0 and 2.1 is where extended
>>>>>> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>>>>>> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>>>>>> baes, and there is no other difference between OpRegion 2.0 and 2.1,
>>>>>> it's feasible to amend OpRegion support for these legacy system (before
>>>>>> upgrading the system firmware), by kazlloc a range to shadown OpRegion
>>>>>> from the beginning and stitch VBT after closely, patch the shadow
>>>>>> OpRegion version from 2.0 to 2.1, and patch the shadow RVDA to relative
>>>>>> address. So that from the vfio igd OpRegion r/w ops view, only OpRegion
>>>>>> 2.1 is exposed regardless the underneath host OpRegion is 2.0 or 2.1
>>>>>> if the extended VBT exists. vfio igd OpRegion r/w ops will return either
>>>>>> shadowed data (OpRegion 2.0) or directly from physical address
>>>>>> (OpRegion 2.1+) based on host OpRegion version and RVDA/RVDS. The shadow
>>>>>> mechanism makes it possible to support legacy systems on the market.
>>>>>>
>>>>>> V2:
>>>>>> Validate RVDA for 2.1+ before increasing total size. (Alex)
>>>>>>
>>>>>> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
>>>>>> Cc: Hang Yuan <hang.yuan@linux.intel.com>
>>>>>> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
>>>>>> Cc: Fred Gao <fred.gao@intel.com>
>>>>>> Signed-off-by: Colin Xu <colin.xu@intel.com>
>>>>>> ---
>>>>>>  drivers/vfio/pci/vfio_pci_igd.c | 117 ++++++++++++++++++++------------
>>>>>>  1 file changed, 75 insertions(+), 42 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
>>>>>> index 228df565e9bc..9cd44498b378 100644
>>>>>> --- a/drivers/vfio/pci/vfio_pci_igd.c
>>>>>> +++ b/drivers/vfio/pci/vfio_pci_igd.c
>>>>>> @@ -48,7 +48,10 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>>>>>>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>>>>>>  				 struct vfio_pci_region *region)
>>>>>>  {
>>>>>> -	memunmap(region->data);
>>>>>> +	if (is_ioremap_addr(region->data))
>>>>>> +		memunmap(region->data);
>>>>>> +	else
>>>>>> +		kfree(region->data);
>>>>>
>>>>>
>>>>> Since we don't have write support to the OpRegion, should we always
>>>>> allocate a shadow copy to simplify?  Or rather than a shadow copy,
>>>>> since we don't support mmap of the region, our read handler could
>>>>> virtualize version and rvda on the fly and shift accesses so that the
>>>>> VBT appears contiguous.  That might also leave us better positioned for
>>>>> handling dynamic changes (ex. does the data change when a monitor is
>>>>> plugged/unplugged) and perhaps eventually write support.
>>>>>
>>>> Always shadow sounds a more simple solution. On-the-fly offset shifting
>>>> may need some extra code:
>>>> - A fields to store remapped RVDA, otherwise have to remap on every read.
>>>> Should I remap everytime, or add the remapped RVDA in vfio_pci_device.
>>>> - Some fields to store extra information, like the old and modified
>>>> opregion version. Current it's parsed in init since it's one time run. To
>>>> support on-the-fly modification, need save them somewhere instead of parse
>>>> on every read.
>>>> - Addr shift calculation. Read could called on any start with any size,
>>>> will need add some addr shift code.
>>>
>>> I think it's a bit easier than made out here.  RVDA is either zero or
>>> OPREGION_SIZE when it's virtualized, so the existence of a separate
>>> mapping for the VBT is enough to know the value, where I think we'd
>>> hold that mapping for the life of the region.  We also don't need to
>>> store the version, the transformation is static, If the VBT mapping
>>> exists and the read version is 2.0, it's replaced with 2.1, otherwise
>>> we leave it alone.  I expect we can also chunk accesses to aligned
>>> 1/2/4 byte reads (QEMU is already doing this).  That simplifies both
>>> the transition between OpRegion and VBT as well as the field
>>> virtualization.
>>>
>> emmm version doesn't need to be stored since the host version isn't be
>> changed.
>
> It'd be changed for the guest if we're virtualizing a v2.0 OpRegion
> into v2.1 to make RVDA relative, but it's still not strictly necessary
> to store the host or virtual version to do that.
>
>> But need a place to store the mapped virtual addr so that can
>> unmap on release. In shadow case, we have the shadow addr, but don't save
>> OpRegion and VBT virtual addr.
>
> Yes, we're using the void* data field of struct vfio_pci_region for
> storing the opregion mapping, this could easily point to a structure
> containing both the opregion and vbt mappings and size for each.
>
>>> I could almost convince myself that this is viable, but I'd like to see
>>> an answer to the question above, is any of the OpRegion or VBT volatile
>>> such that we can't rely on a shadow copy exclusively?
>>>
>> Most of the fields in OpRegion and VBT are written by BIOS and read only
>> by driver as static information. Some fields are used for communication
>> between BIOS and driver, either written by driver and read by BIOS or
>> vice versa, like driver can notify BIOS that driver is ready to process
>> ACPI video extension calls, or when panel backlight change and BIOS notify
>> driver via ACPI, driver can read PWM duty cycle, etc.
>> So strictly speaking, there are some cases that the data is volatile and
>> can't fully rely on the shadow copy. To handle them accurately, all the
>> fields need to be processed according to the actual function the field
>> supports. As you mentioned above, two separate regions for OpRegion
>> and VBT could be better. However currently there is only one region. So
>> the shadow makes it still use single region, but the read ops shouldn't
>> fully rely on the shadow, but need always read host data. That could also
>> make the write ops support in future easier. The read/write ops could
>> parse and filter out some functions that host doesn't want to expose for
>> virtualization.
>> This brings a small question: need save the mapped OpRegion and VBT
>> virtual addr so that no need remap every time, and also for unmap on
>> release. Which structure is ok to added these saved virtual address?
>
> See above, struct vfio_pci_region.data
>
>>>>>>  }
>>>>>>
>>>>>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
>>>>>> @@ -59,10 +62,11 @@ static const struct vfio_pci_regops vfio_pci_igd_regops = {
>>>>>>  static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>>>>>  {
>>>>>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>>>>>> -	u32 addr, size;
>>>>>> -	void *base;
>>>>>> +	u32 addr, size, rvds = 0;
>>>>>> +	void *base, *opregionvbt;
>>>>>
>>>>>
>>>>> opregionvbt could be scoped within the branch it's used.
>>>>>
>>>> Previous revision doesn't move it into the scope. I'll amend in next
>>>> version.
>>>>>>  	int ret;
>>>>>>  	u16 version;
>>>>>> +	u64 rvda = 0;
>>>>>>
>>>>>>  	ret = pci_read_config_dword(vdev->pdev, OPREGION_PCI_ADDR, &addr);
>>>>>>  	if (ret)
>>>>>> @@ -89,66 +93,95 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>>>>>  	size *= 1024; /* In KB */
>>>>>>
>>>>>>  	/*
>>>>>> -	 * Support opregion v2.1+
>>>>>> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
>>>>>> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
>>>>>> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
>>>>>> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
>>>>>> -	 * address from region base and size of VBT data. RVDA/RVDS are not
>>>>>> -	 * defined before opregion 2.0.
>>>>>> +	 * OpRegion and VBT:
>>>>>> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
>>>>>> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
>>>>>> +	 * to hold the VBT data, the Extended VBT region is introduced since
>>>>>> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
>>>>>> +	 * introduced to define the extended VBT data location and size.
>>>>>> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
>>>>>> +	 *   extended VBT data, RVDS defines the VBT data size.
>>>>>> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
>>>>>> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>>>>>>  	 *
>>>>>> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
>>>>>> -	 * opregion base, and should point to the end of opregion.
>>>>>> -	 * otherwise, exposing to userspace to allow read access to everything between
>>>>>> -	 * the OpRegion and VBT is not safe.
>>>>>> -	 * RVDS is defined as size in bytes.
>>>>>> -	 *
>>>>>> -	 * opregion 2.0: rvda is the physical VBT address.
>>>>>> -	 * Since rvda is HPA it cannot be directly used in guest.
>>>>>> -	 * And it should not be practically available for end user,so it is not supported.
>>>>>> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
>>>>>> +	 * 2.0 and 2.1), while for OpRegion 2.1 and above it's possible to map
>>>>>> +	 * a contigious memory to expose OpRegion and VBT r/w via the vfio
>>>>>> +	 * region, for OpRegion 2.0 shadow and amendment mechanism is used to
>>>>>> +	 * expose OpRegion and VBT r/w properly. So that from r/w ops view, only
>>>>>> +	 * OpRegion 2.1 is exposed regardless underneath Region is 2.0 or 2.1.
>>>>>>  	 */
>>>>>>  	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
>>>>>> -	if (version >= 0x0200) {
>>>>>> -		u64 rvda;
>>>>>> -		u32 rvds;
>>>>>>
>>>>>> +	if (version >= 0x0200) {
>>>>>>  		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>>>>>>  		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
>>>>>> +
>>>>>> +		/* The extended VBT must follows OpRegion for OpRegion 2.1+ */
>>>>>
>>>>>
>>>>> Why?  If we're going to make our own OpRegion to account for v2.0, why
>>>>> should it not apply to the same scenario for >2.0?
>>>> Below check is to validate the correctness for >2.0. Accroding to spec,
>>>> RVDA must equal to OpRegion size. If RVDA doesn't follow spec, the
>>>> OpRegion and VBT may already corrupted so returns error here.
>>>> For 2.0, RVDA is the absolute address, VBT may or may not follow OpRegion
>>>> so these is no such check for 2.0.
>>>> If you mean "not apply to the same scenario for >2.0" by "only shadow for
>>>> 2.0 and return as 2.1, while not using shadow for >2.0", that's because I
>>>> expect to keep the old logic as it is and only change the behavior for
>>>> 2.0. Both 2.0 and >2.0 can use shadow mechanism.
>>>
>>> I was under the impression that the difference in RVDA between 2.0 and
>>> 2.1 was simply the absolute versus relative addressing and we made a
>>> conscious decisions here to only support implementations where the VBT
>>> is contiguous with the OpRegion, but the spec supported that
>>> possibility.  Of course I don't have access to the spec to verify, but
>>> if my interpretation is correct then the v2.0 support here could easily
>>> handle a non-contiguous v2.1+ VBT as well.
>>>
>> The team hasn't release the spec to public so I can't paste it here. What
>> it describes RVDA for 2.1+ is typically RVDA will be equal to OpRegion
>> size only when VBT exceeds 6K (if <6K, mailbox 4 is large enough to hold
>> VBT then no need to use RVDA). Technically it's correct that even if it's
>> non-contiguous v2.1+ VBT, it still can be handled.
>> Current i915 will handle it even if v2.1+ is not contiguous, so I guess
>> probably it's better to deal with it as i915.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/i915/display/intel_opregion.c?h=v5.13.13#n935
>
> So even a v2.1+ OpRegion where (RVDA > OPREGION_SIZE) should be made
> contiguous within this vendor region.  Thanks,
>
> Alex
>
>

--
Best Regards,
Colin Xu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-07  6:14                       ` Colin Xu
@ 2021-09-09  5:09                         ` Colin Xu
  2021-09-09 22:00                           ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-09  5:09 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contiguous after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Also although current
OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
RVDA is the relative address to OpRegion head, the extended VBT location
may change to non-contiguous to OpRegion. In both cases, it's impossible
to map a contiguous range to hold both OpRegion and the extended VBT and
expose via one vfio region.

The only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1.
To support the non-contiguous region case as described, the updated read
op will patch OpRegion version and RVDA on-the-fly accordingly. So that
from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
after OpRegion is exposed, regardless the underneath host OpRegion is
2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
2.0 extended VBT systems with on the market, and support OpRegion 2.1+
where the extended VBT isn't contiguous after OpRegion.
Also split the write op with read ops to leave flexibility for OpRegion
write op support in future.

V2:
Validate RVDA for 2.1+ before increasing total size. (Alex)

V3: (Alex)
Split read and write ops.
On-the-fly modify OpRegion version and RVDA.
Fix sparse error on assign value to casted pointer.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
---
 drivers/vfio/pci/vfio_pci_igd.c | 229 +++++++++++++++++++++++---------
 1 file changed, 169 insertions(+), 60 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 228df565e9bc..fd6ad80f0c5f 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -25,30 +25,131 @@
 #define OPREGION_RVDS		0x3c2
 #define OPREGION_VERSION	0x16
 
-static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
-			      size_t count, loff_t *ppos, bool iswrite)
+struct igd_opregion_vbt {
+	void *opregion;
+	void *vbt_ex;
+};
+
+static size_t vfio_pci_igd_read(struct igd_opregion_vbt *opregionvbt,
+				char __user *buf, size_t count, loff_t *ppos)
 {
-	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
-	void *base = vdev->region[i].data;
+	u16 version = le16_to_cpu(*(__le16 *)(opregionvbt->opregion + OPREGION_VERSION));
 	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	void *base, *shadow = NULL;
 
-	if (pos >= vdev->region[i].size || iswrite)
-		return -EINVAL;
+	/* Shift into the range for reading the extended VBT only */
+	if (pos >= OPREGION_SIZE) {
+		base = opregionvbt->vbt_ex + pos - OPREGION_SIZE;
+		goto done;
+	}
 
-	count = min(count, (size_t)(vdev->region[i].size - pos));
+	/* Simply read from OpRegion if the extended VBT doesn't exist */
+	if (!opregionvbt->vbt_ex) {
+		base = opregionvbt->opregion + pos;
+		goto done;
+	} else {
+		shadow = kzalloc(count, GFP_KERNEL);
+
+		if (!shadow)
+			return -ENOMEM;
+	}
 
-	if (copy_to_user(buf, base + pos, count))
+	/*
+	 * If the extended VBT exist, need shift for non-contiguous reading and
+	 * may need patch OpRegion version (for 2.0) and RVDA (for 2.0 and above)
+	 * Use a temporary buffer to simplify the stitch and patch
+	 */
+
+	/* Either crossing OpRegion and VBT or in OpRegion range only */
+	if (pos < OPREGION_SIZE && (pos + count) > OPREGION_SIZE) {
+		memcpy(shadow, opregionvbt->opregion + pos, OPREGION_SIZE - pos);
+		memcpy(shadow + OPREGION_SIZE - pos, opregionvbt->vbt_ex,
+		       pos + count - OPREGION_SIZE);
+	} else {
+		memcpy(shadow, opregionvbt->opregion + pos, count);
+	}
+
+	/*
+	 * Patch OpRegion 2.0 to 2.1 if extended VBT exist and reading the version
+	 */
+	if (opregionvbt->vbt_ex && version == 0x0200 &&
+	    pos <= OPREGION_VERSION && pos + count > OPREGION_VERSION) {
+		/* May only read 1 byte minor version */
+		if (pos + count == OPREGION_VERSION + 1)
+			*(u8 *)(shadow + OPREGION_VERSION - pos) = (u8)0x01;
+		else
+			*(__le16 *)(shadow + OPREGION_VERSION - pos) = cpu_to_le16(0x0201);
+	}
+
+	/*
+	 * Patch RVDA for OpRegion 2.0 and above to make the region contiguous.
+	 * For 2.0, the requestor always see 2.1 with RVDA as relative.
+	 * For 2.1+, RVDA is already relative, but possibly non-contiguous
+	 *   after OpRegion.
+	 * In both cases, patch RVDA to OpRegion size to make the extended
+	 * VBT follows OpRegion and show the requestor a contiguous region.
+	 * Always fail partial RVDA reading to prevent malicious reading to offset
+	 *   of OpRegion by construct arbitrary offset.
+	 */
+	if (opregionvbt->vbt_ex) {
+		/* Full RVDA reading */
+		if (pos <= OPREGION_RVDA && pos + count >= OPREGION_RVDA + 8) {
+			*(__le64 *)(shadow + OPREGION_RVDA - pos) = cpu_to_le64(OPREGION_SIZE);
+		/* Fail partial reading to avoid construct arbitrary RVDA */
+		} else {
+			kfree(shadow);
+			pr_err("%s: partial RVDA reading!\n", __func__);
+			return -EFAULT;
+		}
+	}
+
+	base = shadow;
+
+done:
+	if (copy_to_user(buf, base, count))
 		return -EFAULT;
 
+	kfree(shadow);
+
 	*ppos += count;

 	return count;
 }
 
+static size_t vfio_pci_igd_write(struct igd_opregion_vbt *opregionvbt,
+				 char __user *buf, size_t count, loff_t *ppos)
+{
+	// Not supported yet.
+	return -EINVAL;
+}
+
+static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
+			      size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos >= vdev->region[i].size)
+		return -EINVAL;
+
+	count = min(count, (size_t)(vdev->region[i].size - pos));
+
+	return (iswrite ?
+		vfio_pci_igd_write(opregionvbt, buf, count, ppos) :
+		vfio_pci_igd_read(opregionvbt, buf, count, ppos));
+}
+
 static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
 				 struct vfio_pci_region *region)
 {
-	memunmap(region->data);
+	struct igd_opregion_vbt *opregionvbt = region->data;
+
+	if (opregionvbt->vbt_ex)
+		memunmap(opregionvbt->vbt_ex);
+
+	memunmap(opregionvbt->opregion);
+	kfree(opregionvbt);
 }
 
 static const struct vfio_pci_regops vfio_pci_igd_regops = {
@@ -60,7 +161,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
 	u32 addr, size;
-	void *base;
+	struct igd_opregion_vbt *base;
 	int ret;
 	u16 version;
 
@@ -71,84 +172,92 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	if (!addr || !(~addr))
 		return -ENODEV;
 
-	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
+	base = kzalloc(sizeof(*base), GFP_KERNEL);
 	if (!base)
 		return -ENOMEM;
 
-	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
-		memunmap(base);
+	base->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
+	if (!base->opregion) {
+		kfree(base);
+		return -ENOMEM;
+	}
+
+	if (memcmp(base->opregion, OPREGION_SIGNATURE, 16)) {
+		memunmap(base->opregion);
+		kfree(base);
 		return -EINVAL;
 	}
 
-	size = le32_to_cpu(*(__le32 *)(base + 16));
+	size = le32_to_cpu(*(__le32 *)(base->opregion + 16));
 	if (!size) {
-		memunmap(base);
+		memunmap(base->opregion);
+		kfree(base);
 		return -EINVAL;
 	}
 
 	size *= 1024; /* In KB */
 
 	/*
-	 * Support opregion v2.1+
-	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
-	 * the Extended VBT region next to opregion is used to hold the VBT data.
-	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
-	 * (Raw VBT Data Size) from opregion structure member are used to hold the
-	 * address from region base and size of VBT data. RVDA/RVDS are not
-	 * defined before opregion 2.0.
-	 *
-	 * opregion 2.1+: RVDA is unsigned, relative offset from
-	 * opregion base, and should point to the end of opregion.
-	 * otherwise, exposing to userspace to allow read access to everything between
-	 * the OpRegion and VBT is not safe.
-	 * RVDS is defined as size in bytes.
+	 * OpRegion and VBT:
+	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
+	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
+	 * to hold the VBT data, the Extended VBT region is introduced since
+	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
+	 * introduced to define the extended VBT data location and size.
+	 * OpRegion 2.0: RVDA defines the absolute physical address of the
+	 *   extended VBT data, RVDS defines the VBT data size.
+	 * OpRegion 2.1 and above: RVDA defines the relative address of the
+	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
 	 *
-	 * opregion 2.0: rvda is the physical VBT address.
-	 * Since rvda is HPA it cannot be directly used in guest.
-	 * And it should not be practically available for end user,so it is not supported.
+	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
+	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
+	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
+	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
+	 * after OpRegion with version 2.1+ is exposed regardless the underneath
+	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
+	 * the actural offset into VBT so that data at correct position can be
+	 * returned to the requester.
 	 */
-	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
+	version = le16_to_cpu(*(__le16 *)(base->opregion + OPREGION_VERSION));
+
 	if (version >= 0x0200) {
-		u64 rvda;
-		u32 rvds;
+		u64 rvda = le64_to_cpu(*(__le64 *)(base->opregion + OPREGION_RVDA));
+		u32 rvds = le32_to_cpu(*(__le32 *)(base->opregion + OPREGION_RVDS));
 
-		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
-		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
+		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
 		if (rvda && rvds) {
-			/* no support for opregion v2.0 with physical VBT address */
-			if (version == 0x0200) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
-				return -EINVAL;
-			}
+			size += rvds;
 
-			if (rvda != size) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"Extended VBT does not follow opregion on version 0x%04x\n",
-					version);
-				return -EINVAL;
+			if (version == 0x0200) {
+				/* Absolute physical address for 2.0 */
+				base->vbt_ex = memremap(rvda, rvds, MEMREMAP_WB);
+				if (!base->vbt_ex) {
+					memunmap(base->opregion);
+					kfree(base);
+					return -ENOMEM;
+				}
+			} else {
+				/* Relative address to OpRegion header for 2.1+ */
+				base->vbt_ex = memremap(addr + rvda, rvds, MEMREMAP_WB);
+				if (!base->vbt_ex) {
+					memunmap(base->opregion);
+					kfree(base);
+					return -ENOMEM;
+				}
 			}
-
-			/* region size for opregion v2.0+: opregion and VBT size. */
-			size += rvds;
 		}
 	}

-	if (size != OPREGION_SIZE) {
-		memunmap(base);
-		base = memremap(addr, size, MEMREMAP_WB);
-		if (!base)
-			return -ENOMEM;
-	}
-
 	ret = vfio_pci_register_dev_region(vdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
 		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
 		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
 	if (ret) {
-		memunmap(base);
+		if (base->vbt_ex)
+			memunmap(base->vbt_ex);
+
+		memunmap(base->opregion);
+		kfree(base);
 		return ret;
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-09  5:09                         ` [PATCH v3] vfio/pci: Add OpRegion 2.0+ " Colin Xu
@ 2021-09-09 22:00                           ` Alex Williamson
  2021-09-13 12:39                             ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-09-09 22:00 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Thu,  9 Sep 2021 13:09:34 +0800
Colin Xu <colin.xu@intel.com> wrote:

> Due to historical reason, some legacy shipped system doesn't follow
> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> VBT is not contiguous after OpRegion in physical address, but any
> location pointed by RVDA via absolute address. Also although current
> OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
> RVDA is the relative address to OpRegion head, the extended VBT location
> may change to non-contiguous to OpRegion. In both cases, it's impossible
> to map a contiguous range to hold both OpRegion and the extended VBT and
> expose via one vfio region.
> 
> The only difference between OpRegion 2.0 and 2.1 is where extended
> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> baes, and there is no other difference between OpRegion 2.0 and 2.1.
> To support the non-contiguous region case as described, the updated read
> op will patch OpRegion version and RVDA on-the-fly accordingly. So that
> from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
> after OpRegion is exposed, regardless the underneath host OpRegion is
> 2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
> 2.0 extended VBT systems with on the market, and support OpRegion 2.1+
> where the extended VBT isn't contiguous after OpRegion.
> Also split the write op with read ops to leave flexibility for OpRegion
> write op support in future.
> 
> V2:
> Validate RVDA for 2.1+ before increasing total size. (Alex)
> 
> V3: (Alex)
> Split read and write ops.
> On-the-fly modify OpRegion version and RVDA.
> Fix sparse error on assign value to casted pointer.
> 
> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> Cc: Hang Yuan <hang.yuan@linux.intel.com>
> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> Cc: Fred Gao <fred.gao@intel.com>
> Signed-off-by: Colin Xu <colin.xu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_igd.c | 229 +++++++++++++++++++++++---------
>  1 file changed, 169 insertions(+), 60 deletions(-)


BTW, this does not apply on current mainline.


> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
> index 228df565e9bc..fd6ad80f0c5f 100644
> --- a/drivers/vfio/pci/vfio_pci_igd.c
> +++ b/drivers/vfio/pci/vfio_pci_igd.c
> @@ -25,30 +25,131 @@
>  #define OPREGION_RVDS		0x3c2
>  #define OPREGION_VERSION	0x16
>  
> -static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
> -			      size_t count, loff_t *ppos, bool iswrite)
> +struct igd_opregion_vbt {
> +	void *opregion;
> +	void *vbt_ex;

	__le16 version; // see below

> +};
> +
> +static size_t vfio_pci_igd_read(struct igd_opregion_vbt *opregionvbt,
> +				char __user *buf, size_t count, loff_t *ppos)
>  {
> -	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> -	void *base = vdev->region[i].data;
> +	u16 version = le16_to_cpu(*(__le16 *)(opregionvbt->opregion + OPREGION_VERSION));

80 column throughout please (I know we already have some violations in
this file).

>  	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	void *base, *shadow = NULL;
>  
> -	if (pos >= vdev->region[i].size || iswrite)
> -		return -EINVAL;
> +	/* Shift into the range for reading the extended VBT only */
> +	if (pos >= OPREGION_SIZE) {
> +		base = opregionvbt->vbt_ex + pos - OPREGION_SIZE;
> +		goto done;
> +	}
>  
> -	count = min(count, (size_t)(vdev->region[i].size - pos));
> +	/* Simply read from OpRegion if the extended VBT doesn't exist */
> +	if (!opregionvbt->vbt_ex) {
> +		base = opregionvbt->opregion + pos;
> +		goto done;
> +	} else {
> +		shadow = kzalloc(count, GFP_KERNEL);
> +

I don't really see any value in this shadow buffer, I don't think we
have any requirement to fulfill the read in a single copy_to_user().
Therefore we could do something like:

	size_t remaining = count;
	loff_t off = 0;

	if (remaining && pos < OPREGION_VERSION) {
		size_t bytes = min(remaining, OPREGION_VERSION - pos);

		if (copy_to_user(buf + off, opregionvbt->opregion + pos, bytes))
			return -EFAULT;

		pos += bytes;
		off += bytes;
		remaining -= bytes;
	}

	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
		size_t bytes = min(remaining, OPREGION_VERSION + sizeof(__le16) - pos);

		/* reported version cached in struct igd_opregion_vbt.version */
		if (copy_to_user(buf + off, &opregionvbt->version + pos, bytes))
			return -EFAULT;

		pos += bytes;
		off += bytes;
		remaining -= bytes;
	}

	if (remaining && pos < OPREGION_RVDA) {
		size_t bytes = min(remaining, OPREGION_RVDA - pos);

		if (copy_to_user(buf + off, opregionvbt->opregion + pos, bytes))
			return -EFAULT;

		pos += bytes;
		off += bytes;
		remaining -= bytes;
	}

	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
		size_t bytes = min(remaining, OPREGION_RVDA + sizeof(__le64) - pos);
		__le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ? OPREGION_SIZE : 0);

		if (copy_to_user(buf + off, &rvda + pos, bytes))
			return -EFAULT;

		pos += bytes;
		off += bytes;
		remaining -= bytes;
	}

	if (remaining && pos < OPREGION_SIZE) {
		size_t bytes = min(remaining, OPREGION_SIZE - pos);

		if (copy_to_user(buf + off, opregionvbt->opregion + pos, bytes))
			return -EFAULT;

		pos += bytes;
		off += bytes;
		remaining -= bytes;
	}

	if (remaining) {
		if (copy_to_user(buf + off, opregionvbt->vbt_ex + pos, remaining))
			return -EFAULT;
	}

	*ppos += count;

	return count;
		
It's tedious, but extensible and simple (and avoids the partial read
problem below).  Maybe there's a macro or helper function that'd make
it less tedious.


> +		if (!shadow)
> +			return -ENOMEM;
> +	}
>  
> -	if (copy_to_user(buf, base + pos, count))
> +	/*
> +	 * If the extended VBT exist, need shift for non-contiguous reading and
> +	 * may need patch OpRegion version (for 2.0) and RVDA (for 2.0 and above)
> +	 * Use a temporary buffer to simplify the stitch and patch
> +	 */
> +
> +	/* Either crossing OpRegion and VBT or in OpRegion range only */
> +	if (pos < OPREGION_SIZE && (pos + count) > OPREGION_SIZE) {
> +		memcpy(shadow, opregionvbt->opregion + pos, OPREGION_SIZE - pos);
> +		memcpy(shadow + OPREGION_SIZE - pos, opregionvbt->vbt_ex,
> +		       pos + count - OPREGION_SIZE);
> +	} else {
> +		memcpy(shadow, opregionvbt->opregion + pos, count);
> +	}
> +
> +	/*
> +	 * Patch OpRegion 2.0 to 2.1 if extended VBT exist and reading the version
> +	 */
> +	if (opregionvbt->vbt_ex && version == 0x0200 &&
> +	    pos <= OPREGION_VERSION && pos + count > OPREGION_VERSION) {
> +		/* May only read 1 byte minor version */
> +		if (pos + count == OPREGION_VERSION + 1)
> +			*(u8 *)(shadow + OPREGION_VERSION - pos) = (u8)0x01;
> +		else
> +			*(__le16 *)(shadow + OPREGION_VERSION - pos) = cpu_to_le16(0x0201);
> +	}
> +
> +	/*
> +	 * Patch RVDA for OpRegion 2.0 and above to make the region contiguous.
> +	 * For 2.0, the requestor always see 2.1 with RVDA as relative.
> +	 * For 2.1+, RVDA is already relative, but possibly non-contiguous
> +	 *   after OpRegion.
> +	 * In both cases, patch RVDA to OpRegion size to make the extended
> +	 * VBT follows OpRegion and show the requestor a contiguous region.
> +	 * Always fail partial RVDA reading to prevent malicious reading to offset
> +	 *   of OpRegion by construct arbitrary offset.
> +	 */
> +	if (opregionvbt->vbt_ex) {
> +		/* Full RVDA reading */
> +		if (pos <= OPREGION_RVDA && pos + count >= OPREGION_RVDA + 8) {
> +			*(__le64 *)(shadow + OPREGION_RVDA - pos) = cpu_to_le64(OPREGION_SIZE);
> +		/* Fail partial reading to avoid construct arbitrary RVDA */
> +		} else {
> +			kfree(shadow);
> +			pr_err("%s: partial RVDA reading!\n", __func__);
> +			return -EFAULT;
> +		}
> +	}
> +
> +	base = shadow;
> +
> +done:
> +	if (copy_to_user(buf, base, count))
>  		return -EFAULT;
>  
> +	kfree(shadow);
> +
>  	*ppos += count;
> 
>  	return count;
>  }
>  
> +static size_t vfio_pci_igd_write(struct igd_opregion_vbt *opregionvbt,
> +				 char __user *buf, size_t count, loff_t *ppos)
> +{
> +	// Not supported yet.
> +	return -EINVAL;
> +}
> +
> +static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
> +			      size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> +	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos >= vdev->region[i].size)
> +		return -EINVAL;
> +
> +	count = min(count, (size_t)(vdev->region[i].size - pos));
> +
> +	return (iswrite ?
> +		vfio_pci_igd_write(opregionvbt, buf, count, ppos) :
> +		vfio_pci_igd_read(opregionvbt, buf, count, ppos));
> +}

I don't think we need to go this far towards enabling write support,
I'd roll the range and iswrite check into your _read function (rename
back to _rw()) and call it good.

> +
>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>  				 struct vfio_pci_region *region)
>  {
> -	memunmap(region->data);
> +	struct igd_opregion_vbt *opregionvbt = region->data;
> +
> +	if (opregionvbt->vbt_ex)
> +		memunmap(opregionvbt->vbt_ex);
> +
> +	memunmap(opregionvbt->opregion);
> +	kfree(opregionvbt);
>  }
>  
>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
> @@ -60,7 +161,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>  {
>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>  	u32 addr, size;
> -	void *base;
> +	struct igd_opregion_vbt *base;


@base doesn't seem like an appropriate name for this, it was called
opregionvbt in the function above.


>  	int ret;
>  	u16 version;
>  
> @@ -71,84 +172,92 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>  	if (!addr || !(~addr))
>  		return -ENODEV;
>  
> -	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
> +	base = kzalloc(sizeof(*base), GFP_KERNEL);
>  	if (!base)
>  		return -ENOMEM;
>  
> -	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
> -		memunmap(base);
> +	base->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
> +	if (!base->opregion) {
> +		kfree(base);
> +		return -ENOMEM;
> +	}
> +
> +	if (memcmp(base->opregion, OPREGION_SIGNATURE, 16)) {
> +		memunmap(base->opregion);
> +		kfree(base);
>  		return -EINVAL;
>  	}
>  
> -	size = le32_to_cpu(*(__le32 *)(base + 16));
> +	size = le32_to_cpu(*(__le32 *)(base->opregion + 16));
>  	if (!size) {
> -		memunmap(base);
> +		memunmap(base->opregion);
> +		kfree(base);
>  		return -EINVAL;
>  	}
>  
>  	size *= 1024; /* In KB */
>  
>  	/*
> -	 * Support opregion v2.1+
> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
> -	 * address from region base and size of VBT data. RVDA/RVDS are not
> -	 * defined before opregion 2.0.
> -	 *
> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
> -	 * opregion base, and should point to the end of opregion.
> -	 * otherwise, exposing to userspace to allow read access to everything between
> -	 * the OpRegion and VBT is not safe.
> -	 * RVDS is defined as size in bytes.
> +	 * OpRegion and VBT:
> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
> +	 * to hold the VBT data, the Extended VBT region is introduced since
> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
> +	 * introduced to define the extended VBT data location and size.
> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
> +	 *   extended VBT data, RVDS defines the VBT data size.
> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>  	 *
> -	 * opregion 2.0: rvda is the physical VBT address.
> -	 * Since rvda is HPA it cannot be directly used in guest.
> -	 * And it should not be practically available for end user,so it is not supported.
> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
> +	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
> +	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
> +	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
> +	 * after OpRegion with version 2.1+ is exposed regardless the underneath
> +	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
> +	 * the actural offset into VBT so that data at correct position can be
> +	 * returned to the requester.
>  	 */
> -	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
> +	version = le16_to_cpu(*(__le16 *)(base->opregion + OPREGION_VERSION));
> +

	opregionvbt->version = *(__le16 *)(base + OPREGION_VERSION)
	version = le16_to_cpu(opregionvbt->version);


>  	if (version >= 0x0200) {
> -		u64 rvda;
> -		u32 rvds;
> +		u64 rvda = le64_to_cpu(*(__le64 *)(base->opregion + OPREGION_RVDA));
> +		u32 rvds = le32_to_cpu(*(__le32 *)(base->opregion + OPREGION_RVDS));
>  
> -		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
> -		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
>  		if (rvda && rvds) {
> -			/* no support for opregion v2.0 with physical VBT address */
> -			if (version == 0x0200) {
> -				memunmap(base);
> -				pci_err(vdev->pdev,
> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
> -				return -EINVAL;
> -			}
> +			size += rvds;
>  
> -			if (rvda != size) {
> -				memunmap(base);
> -				pci_err(vdev->pdev,
> -					"Extended VBT does not follow opregion on version 0x%04x\n",
> -					version);
> -				return -EINVAL;
> +			if (version == 0x0200) {
> +				/* Absolute physical address for 2.0 */


			if (version == 0x0200) {
				opregionvbt->version = cpu_to_le16(0x0201);
				addr = rvda;
			} else {
				addr += rvda;
			}

			... single memremap and error path

Thanks,

Alex

> +				base->vbt_ex = memremap(rvda, rvds, MEMREMAP_WB);
> +				if (!base->vbt_ex) {
> +					memunmap(base->opregion);
> +					kfree(base);
> +					return -ENOMEM;
> +				}
> +			} else {
> +				/* Relative address to OpRegion header for 2.1+ */
> +				base->vbt_ex = memremap(addr + rvda, rvds, MEMREMAP_WB);
> +				if (!base->vbt_ex) {
> +					memunmap(base->opregion);
> +					kfree(base);
> +					return -ENOMEM;
> +				}
>  			}
> -
> -			/* region size for opregion v2.0+: opregion and VBT size. */
> -			size += rvds;
>  		}
>  	}
> 
> -	if (size != OPREGION_SIZE) {
> -		memunmap(base);
> -		base = memremap(addr, size, MEMREMAP_WB);
> -		if (!base)
> -			return -ENOMEM;
> -	}
> -
>  	ret = vfio_pci_register_dev_region(vdev,
>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>  		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>  		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>  	if (ret) {
> -		memunmap(base);
> +		if (base->vbt_ex)
> +			memunmap(base->vbt_ex);
> +
> +		memunmap(base->opregion);
> +		kfree(base);
>  		return ret;
>  	}
>  


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-09 22:00                           ` Alex Williamson
@ 2021-09-13 12:39                             ` Colin Xu
  2021-09-13 12:41                               ` [PATCH v4] " Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-13 12:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Colin Xu, kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

yyyyyyyyyyOn Thu, 9 Sep 2021, Alex Williamson wrote:

> On Thu,  9 Sep 2021 13:09:34 +0800
> Colin Xu <colin.xu@intel.com> wrote:
>
>> Due to historical reason, some legacy shipped system doesn't follow
>> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>> VBT is not contiguous after OpRegion in physical address, but any
>> location pointed by RVDA via absolute address. Also although current
>> OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
>> RVDA is the relative address to OpRegion head, the extended VBT location
>> may change to non-contiguous to OpRegion. In both cases, it's impossible
>> to map a contiguous range to hold both OpRegion and the extended VBT and
>> expose via one vfio region.
>>
>> The only difference between OpRegion 2.0 and 2.1 is where extended
>> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>> baes, and there is no other difference between OpRegion 2.0 and 2.1.
>> To support the non-contiguous region case as described, the updated read
>> op will patch OpRegion version and RVDA on-the-fly accordingly. So that
>> from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
>> after OpRegion is exposed, regardless the underneath host OpRegion is
>> 2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
>> 2.0 extended VBT systems with on the market, and support OpRegion 2.1+
>> where the extended VBT isn't contiguous after OpRegion.
>> Also split the write op with read ops to leave flexibility for OpRegion
>> write op support in future.
>>
>> V2:
>> Validate RVDA for 2.1+ before increasing total size. (Alex)
>>
>> V3: (Alex)
>> Split read and write ops.
>> On-the-fly modify OpRegion version and RVDA.
>> Fix sparse error on assign value to casted pointer.
>>
>> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
>> Cc: Hang Yuan <hang.yuan@linux.intel.com>
>> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
>> Cc: Fred Gao <fred.gao@intel.com>
>> Signed-off-by: Colin Xu <colin.xu@intel.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_igd.c | 229 +++++++++++++++++++++++---------
>>  1 file changed, 169 insertions(+), 60 deletions(-)
>
>
> BTW, this does not apply on current mainline.
Let me rebase to latest kvm mainline.
>
>
>> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
>> index 228df565e9bc..fd6ad80f0c5f 100644
>> --- a/drivers/vfio/pci/vfio_pci_igd.c
>> +++ b/drivers/vfio/pci/vfio_pci_igd.c
>> @@ -25,30 +25,131 @@
>>  #define OPREGION_RVDS		0x3c2
>>  #define OPREGION_VERSION	0x16
>>
>> -static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>> -			      size_t count, loff_t *ppos, bool iswrite)
>> +struct igd_opregion_vbt {
>> +	void *opregion;
>> +	void *vbt_ex;
>
> 	__le16 version; // see below
>
Updated. Also add rvda here which is similarly handled.
>> +};
>> +
>> +static size_t vfio_pci_igd_read(struct igd_opregion_vbt *opregionvbt,
>> +				char __user *buf, size_t count, loff_t *ppos)
>>  {
>> -	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>> -	void *base = vdev->region[i].data;
>> +	u16 version = le16_to_cpu(*(__le16 *)(opregionvbt->opregion + OPREGION_VERSION));
>
> 80 column throughout please (I know we already have some violations in
> this file).
Done.
>
>>  	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +	void *base, *shadow = NULL;
>>
>> -	if (pos >= vdev->region[i].size || iswrite)
>> -		return -EINVAL;
>> +	/* Shift into the range for reading the extended VBT only */
>> +	if (pos >= OPREGION_SIZE) {
>> +		base = opregionvbt->vbt_ex + pos - OPREGION_SIZE;
>> +		goto done;
>> +	}
>>
>> -	count = min(count, (size_t)(vdev->region[i].size - pos));
>> +	/* Simply read from OpRegion if the extended VBT doesn't exist */
>> +	if (!opregionvbt->vbt_ex) {
>> +		base = opregionvbt->opregion + pos;
>> +		goto done;
>> +	} else {
>> +		shadow = kzalloc(count, GFP_KERNEL);
>> +
>
> I don't really see any value in this shadow buffer, I don't think we
> have any requirement to fulfill the read in a single copy_to_user().
> Therefore we could do something like:
>
Thanks. Update the logic based on below thought.

> 	size_t remaining = count;
> 	loff_t off = 0;
>
> 	if (remaining && pos < OPREGION_VERSION) {
> 		size_t bytes = min(remaining, OPREGION_VERSION - pos);
>
> 		if (copy_to_user(buf + off, opregionvbt->opregion + pos, bytes))
> 			return -EFAULT;
>
> 		pos += bytes;
> 		off += bytes;
> 		remaining -= bytes;
> 	}
>
> 	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
> 		size_t bytes = min(remaining, OPREGION_VERSION + sizeof(__le16) - pos);
>
> 		/* reported version cached in struct igd_opregion_vbt.version */
> 		if (copy_to_user(buf + off, &opregionvbt->version + pos, bytes))
> 			return -EFAULT;
>
> 		pos += bytes;
> 		off += bytes;
> 		remaining -= bytes;
> 	}
>
> 	if (remaining && pos < OPREGION_RVDA) {
> 		size_t bytes = min(remaining, OPREGION_RVDA - pos);
>
> 		if (copy_to_user(buf + off, opregionvbt->opregion + pos, bytes))
> 			return -EFAULT;
>
> 		pos += bytes;
> 		off += bytes;
> 		remaining -= bytes;
> 	}
>
> 	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
> 		size_t bytes = min(remaining, OPREGION_RVDA + sizeof(__le64) - pos);
> 		__le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ? OPREGION_SIZE : 0);
>
> 		if (copy_to_user(buf + off, &rvda + pos, bytes))
> 			return -EFAULT;
>
> 		pos += bytes;
> 		off += bytes;
> 		remaining -= bytes;
> 	}
>
> 	if (remaining && pos < OPREGION_SIZE) {
> 		size_t bytes = min(remaining, OPREGION_SIZE - pos);
>
> 		if (copy_to_user(buf + off, opregionvbt->opregion + pos, bytes))
> 			return -EFAULT;
>
> 		pos += bytes;
> 		off += bytes;
> 		remaining -= bytes;
> 	}
>
> 	if (remaining) {
> 		if (copy_to_user(buf + off, opregionvbt->vbt_ex + pos, remaining))
> 			return -EFAULT;
> 	}
>
> 	*ppos += count;
>
> 	return count;
>
> It's tedious, but extensible and simple (and avoids the partial read
> problem below).  Maybe there's a macro or helper function that'd make
> it less tedious.
Add a copy'n'shift helper to simplify above logic.
>
>
>> +		if (!shadow)
>> +			return -ENOMEM;
>> +	}
>>
>> -	if (copy_to_user(buf, base + pos, count))
>> +	/*
>> +	 * If the extended VBT exist, need shift for non-contiguous reading and
>> +	 * may need patch OpRegion version (for 2.0) and RVDA (for 2.0 and above)
>> +	 * Use a temporary buffer to simplify the stitch and patch
>> +	 */
>> +
>> +	/* Either crossing OpRegion and VBT or in OpRegion range only */
>> +	if (pos < OPREGION_SIZE && (pos + count) > OPREGION_SIZE) {
>> +		memcpy(shadow, opregionvbt->opregion + pos, OPREGION_SIZE - pos);
>> +		memcpy(shadow + OPREGION_SIZE - pos, opregionvbt->vbt_ex,
>> +		       pos + count - OPREGION_SIZE);
>> +	} else {
>> +		memcpy(shadow, opregionvbt->opregion + pos, count);
>> +	}
>> +
>> +	/*
>> +	 * Patch OpRegion 2.0 to 2.1 if extended VBT exist and reading the version
>> +	 */
>> +	if (opregionvbt->vbt_ex && version == 0x0200 &&
>> +	    pos <= OPREGION_VERSION && pos + count > OPREGION_VERSION) {
>> +		/* May only read 1 byte minor version */
>> +		if (pos + count == OPREGION_VERSION + 1)
>> +			*(u8 *)(shadow + OPREGION_VERSION - pos) = (u8)0x01;
>> +		else
>> +			*(__le16 *)(shadow + OPREGION_VERSION - pos) = cpu_to_le16(0x0201);
>> +	}
>> +
>> +	/*
>> +	 * Patch RVDA for OpRegion 2.0 and above to make the region contiguous.
>> +	 * For 2.0, the requestor always see 2.1 with RVDA as relative.
>> +	 * For 2.1+, RVDA is already relative, but possibly non-contiguous
>> +	 *   after OpRegion.
>> +	 * In both cases, patch RVDA to OpRegion size to make the extended
>> +	 * VBT follows OpRegion and show the requestor a contiguous region.
>> +	 * Always fail partial RVDA reading to prevent malicious reading to offset
>> +	 *   of OpRegion by construct arbitrary offset.
>> +	 */
>> +	if (opregionvbt->vbt_ex) {
>> +		/* Full RVDA reading */
>> +		if (pos <= OPREGION_RVDA && pos + count >= OPREGION_RVDA + 8) {
>> +			*(__le64 *)(shadow + OPREGION_RVDA - pos) = cpu_to_le64(OPREGION_SIZE);
>> +		/* Fail partial reading to avoid construct arbitrary RVDA */
>> +		} else {
>> +			kfree(shadow);
>> +			pr_err("%s: partial RVDA reading!\n", __func__);
>> +			return -EFAULT;
>> +		}
>> +	}
>> +
>> +	base = shadow;
>> +
>> +done:
>> +	if (copy_to_user(buf, base, count))
>>  		return -EFAULT;
>>
>> +	kfree(shadow);
>> +
>>  	*ppos += count;
>>
>>  	return count;
>>  }
>>
>> +static size_t vfio_pci_igd_write(struct igd_opregion_vbt *opregionvbt,
>> +				 char __user *buf, size_t count, loff_t *ppos)
>> +{
>> +	// Not supported yet.
>> +	return -EINVAL;
>> +}
>> +
>> +static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>> +			      size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>> +	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +
>> +	if (pos >= vdev->region[i].size)
>> +		return -EINVAL;
>> +
>> +	count = min(count, (size_t)(vdev->region[i].size - pos));
>> +
>> +	return (iswrite ?
>> +		vfio_pci_igd_write(opregionvbt, buf, count, ppos) :
>> +		vfio_pci_igd_read(opregionvbt, buf, count, ppos));
>> +}
>
> I don't think we need to go this far towards enabling write support,
> I'd roll the range and iswrite check into your _read function (rename
> back to _rw()) and call it good.
Remove the write op, and re-implement _rw() as above.
>
>> +
>>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>>  				 struct vfio_pci_region *region)
>>  {
>> -	memunmap(region->data);
>> +	struct igd_opregion_vbt *opregionvbt = region->data;
>> +
>> +	if (opregionvbt->vbt_ex)
>> +		memunmap(opregionvbt->vbt_ex);
>> +
>> +	memunmap(opregionvbt->opregion);
>> +	kfree(opregionvbt);
>>  }
>>
>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
>> @@ -60,7 +161,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>  {
>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>>  	u32 addr, size;
>> -	void *base;
>> +	struct igd_opregion_vbt *base;
>
>
> @base doesn't seem like an appropriate name for this, it was called
> opregionvbt in the function above.
opregionvbt is a little longer to keep within 80 column so re-use base.
Now rename to opregionvbt to make it meaningful and avoid confusion.
>
>
>>  	int ret;
>>  	u16 version;
>>
>> @@ -71,84 +172,92 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>  	if (!addr || !(~addr))
>>  		return -ENODEV;
>>
>> -	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
>> +	base = kzalloc(sizeof(*base), GFP_KERNEL);
>>  	if (!base)
>>  		return -ENOMEM;
>>
>> -	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
>> -		memunmap(base);
>> +	base->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
>> +	if (!base->opregion) {
>> +		kfree(base);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	if (memcmp(base->opregion, OPREGION_SIGNATURE, 16)) {
>> +		memunmap(base->opregion);
>> +		kfree(base);
>>  		return -EINVAL;
>>  	}
>>
>> -	size = le32_to_cpu(*(__le32 *)(base + 16));
>> +	size = le32_to_cpu(*(__le32 *)(base->opregion + 16));
>>  	if (!size) {
>> -		memunmap(base);
>> +		memunmap(base->opregion);
>> +		kfree(base);
>>  		return -EINVAL;
>>  	}
>>
>>  	size *= 1024; /* In KB */
>>
>>  	/*
>> -	 * Support opregion v2.1+
>> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
>> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
>> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
>> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
>> -	 * address from region base and size of VBT data. RVDA/RVDS are not
>> -	 * defined before opregion 2.0.
>> -	 *
>> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
>> -	 * opregion base, and should point to the end of opregion.
>> -	 * otherwise, exposing to userspace to allow read access to everything between
>> -	 * the OpRegion and VBT is not safe.
>> -	 * RVDS is defined as size in bytes.
>> +	 * OpRegion and VBT:
>> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
>> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
>> +	 * to hold the VBT data, the Extended VBT region is introduced since
>> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
>> +	 * introduced to define the extended VBT data location and size.
>> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
>> +	 *   extended VBT data, RVDS defines the VBT data size.
>> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
>> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>>  	 *
>> -	 * opregion 2.0: rvda is the physical VBT address.
>> -	 * Since rvda is HPA it cannot be directly used in guest.
>> -	 * And it should not be practically available for end user,so it is not supported.
>> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
>> +	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
>> +	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
>> +	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
>> +	 * after OpRegion with version 2.1+ is exposed regardless the underneath
>> +	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
>> +	 * the actural offset into VBT so that data at correct position can be
>> +	 * returned to the requester.
>>  	 */
>> -	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
>> +	version = le16_to_cpu(*(__le16 *)(base->opregion + OPREGION_VERSION));
>> +
>
> 	opregionvbt->version = *(__le16 *)(base + OPREGION_VERSION)
> 	version = le16_to_cpu(opregionvbt->version);
>
>
>>  	if (version >= 0x0200) {
>> -		u64 rvda;
>> -		u32 rvds;
>> +		u64 rvda = le64_to_cpu(*(__le64 *)(base->opregion + OPREGION_RVDA));
>> +		u32 rvds = le32_to_cpu(*(__le32 *)(base->opregion + OPREGION_RVDS));
>>
>> -		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>> -		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
>> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero. */
>>  		if (rvda && rvds) {
>> -			/* no support for opregion v2.0 with physical VBT address */
>> -			if (version == 0x0200) {
>> -				memunmap(base);
>> -				pci_err(vdev->pdev,
>> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
>> -				return -EINVAL;
>> -			}
>> +			size += rvds;
>>
>> -			if (rvda != size) {
>> -				memunmap(base);
>> -				pci_err(vdev->pdev,
>> -					"Extended VBT does not follow opregion on version 0x%04x\n",
>> -					version);
>> -				return -EINVAL;
>> +			if (version == 0x0200) {
>> +				/* Absolute physical address for 2.0 */
>
>
> 			if (version == 0x0200) {
> 				opregionvbt->version = cpu_to_le16(0x0201);
> 				addr = rvda;
> 			} else {
> 				addr += rvda;
> 			}
>
> 			... single memremap and error path
I add rvda here, either 0 or OPREGION_SIZE, so that the copy helper can 
directly copy from.

The updated version will be sent via v4 patch.

Thanks!
Colin
>
> Thanks,
>
> Alex
>
>> +				base->vbt_ex = memremap(rvda, rvds, MEMREMAP_WB);
>> +				if (!base->vbt_ex) {
>> +					memunmap(base->opregion);
>> +					kfree(base);
>> +					return -ENOMEM;
>> +				}
>> +			} else {
>> +				/* Relative address to OpRegion header for 2.1+ */
>> +				base->vbt_ex = memremap(addr + rvda, rvds, MEMREMAP_WB);
>> +				if (!base->vbt_ex) {
>> +					memunmap(base->opregion);
>> +					kfree(base);
>> +					return -ENOMEM;
>> +				}
>>  			}
>> -
>> -			/* region size for opregion v2.0+: opregion and VBT size. */
>> -			size += rvds;
>>  		}
>>  	}
>>
>> -	if (size != OPREGION_SIZE) {
>> -		memunmap(base);
>> -		base = memremap(addr, size, MEMREMAP_WB);
>> -		if (!base)
>> -			return -ENOMEM;
>> -	}
>> -
>>  	ret = vfio_pci_register_dev_region(vdev,
>>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>>  		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>>  		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>>  	if (ret) {
>> -		memunmap(base);
>> +		if (base->vbt_ex)
>> +			memunmap(base->vbt_ex);
>> +
>> +		memunmap(base->opregion);
>> +		kfree(base);
>>  		return ret;
>>  	}
>>
>
>

--
Best Regards,
Colin Xu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-13 12:39                             ` Colin Xu
@ 2021-09-13 12:41                               ` Colin Xu
  2021-09-13 15:14                                 ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-13 12:41 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contiguous after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Also although current
OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
RVDA is the relative address to OpRegion head, the extended VBT location
may change to non-contiguous to OpRegion. In both cases, it's impossible
to map a contiguous range to hold both OpRegion and the extended VBT and
expose via one vfio region.

The only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1.
To support the non-contiguous region case as described, the updated read
op will patch OpRegion version and RVDA on-the-fly accordingly. So that
from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
after OpRegion is exposed, regardless the underneath host OpRegion is
2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
2.0 extended VBT systems with on the market, and support OpRegion 2.1+
where the extended VBT isn't contiguous after OpRegion.
Also split the write op with read ops to leave flexibility for OpRegion
write op support in future.

V2:
Validate RVDA for 2.1+ before increasing total size. (Alex)

V3: (Alex)
Split read and write ops.
On-the-fly modify OpRegion version and RVDA.
Fix sparse error on assign value to casted pointer.

V4: (Alex)
No need support write op.
Direct copy to user buffer with several shift instead of shadow.
Copy helper to copy to user buffer and shift offset.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
---
 drivers/vfio/pci/vfio_pci_igd.c | 229 ++++++++++++++++++++++++--------
 1 file changed, 174 insertions(+), 55 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 228df565e9bc..14e958893be6 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -25,20 +25,119 @@
 #define OPREGION_RVDS		0x3c2
 #define OPREGION_VERSION	0x16
 
+struct igd_opregion_vbt {
+	void *opregion;
+	void *vbt_ex;
+	__le16 version;
+	__le64 rvda;
+};
+
+/**
+ * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
+ * @dst: User buffer ptr to copy to.
+ * @off: Offset to user buffer ptr. Increased by bytes_adv on return.
+ * @src: Source buffer to copy from.
+ * @pos: Increased by bytes_adv on return.
+ * @remaining: Decreased by bytes_adv on return.
+ * @bytes_cp: Bytes to copy.
+ * @bytes_adv: Bytes to adjust off, pos and remaining.
+ *
+ * Copy OpRegion to offset from specific source ptr and shift the offset.
+ *
+ * Return: 0 on success, -EFAULT otherwise.
+ *
+ */
+static inline unsigned long igd_opregion_shift_copy(char __user *dst,
+						    loff_t *off,
+						    void *src,
+						    loff_t *pos,
+						    loff_t *remaining,
+						    loff_t bytes_cp,
+						    loff_t bytes_adv)
+{
+	if (copy_to_user(dst + (*off), src, bytes_cp))
+		return -EFAULT;
+
+	*off += bytes_adv;
+	*pos += bytes_adv;
+	*remaining -= bytes_adv;
+
+	return 0;
+}
+
 static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 			      size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
-	void *base = vdev->region[i].data;
+	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
 	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t remaining = count;
+	loff_t off = 0;
 
 	if (pos >= vdev->region[i].size || iswrite)
 		return -EINVAL;
 
 	count = min(count, (size_t)(vdev->region[i].size - pos));
 
-	if (copy_to_user(buf, base + pos, count))
-		return -EFAULT;
+	/* Copy until OpRegion version */
+	if (remaining && pos < OPREGION_VERSION) {
+		loff_t bytes = min(remaining, OPREGION_VERSION - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy patched (if necessary) OpRegion version */
+	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
+		loff_t bytes = min(remaining,
+				   OPREGION_VERSION + (loff_t)sizeof(__le16) - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &opregionvbt->version, &pos,
+					    &remaining, bytes, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy until RVDA */
+	if (remaining && pos < OPREGION_RVDA) {
+		loff_t bytes = min((loff_t)remaining, OPREGION_RVDA - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy modified (if necessary) RVDA */
+	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
+		loff_t bytes = min(remaining, OPREGION_RVDA +
+					      (loff_t)sizeof(__le64) - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &opregionvbt->rvda, &pos,
+					    &remaining, bytes, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy the rest of OpRegion */
+	if (remaining && pos < OPREGION_SIZE) {
+		loff_t bytes = min(remaining, OPREGION_SIZE - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy extended VBT if exists */
+	if (remaining) {
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->vbt_ex, &pos,
+					    &remaining, remaining, 0))
+			return -EFAULT;
+	}
 
 	*ppos += count;
 
@@ -48,7 +147,13 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
 				 struct vfio_pci_region *region)
 {
-	memunmap(region->data);
+	struct igd_opregion_vbt *opregionvbt = region->data;
+
+	if (opregionvbt->vbt_ex)
+		memunmap(opregionvbt->vbt_ex);
+
+	memunmap(opregionvbt->opregion);
+	kfree(opregionvbt);
 }
 
 static const struct vfio_pci_regops vfio_pci_igd_regops = {
@@ -60,7 +165,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
 	u32 addr, size;
-	void *base;
+	struct igd_opregion_vbt *opregionvbt;
 	int ret;
 	u16 version;
 
@@ -71,84 +176,98 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	if (!addr || !(~addr))
 		return -ENODEV;
 
-	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
-	if (!base)
+	opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL);
+	if (!opregionvbt)
+		return -ENOMEM;
+
+	opregionvbt->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
+	if (!opregionvbt->opregion) {
+		kfree(opregionvbt);
 		return -ENOMEM;
+	}
 
-	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
-		memunmap(base);
+	if (memcmp(opregionvbt->opregion, OPREGION_SIGNATURE, 16)) {
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
-	size = le32_to_cpu(*(__le32 *)(base + 16));
+	size = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + 16));
 	if (!size) {
-		memunmap(base);
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
 	size *= 1024; /* In KB */
 
 	/*
-	 * Support opregion v2.1+
-	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
-	 * the Extended VBT region next to opregion is used to hold the VBT data.
-	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
-	 * (Raw VBT Data Size) from opregion structure member are used to hold the
-	 * address from region base and size of VBT data. RVDA/RVDS are not
-	 * defined before opregion 2.0.
-	 *
-	 * opregion 2.1+: RVDA is unsigned, relative offset from
-	 * opregion base, and should point to the end of opregion.
-	 * otherwise, exposing to userspace to allow read access to everything between
-	 * the OpRegion and VBT is not safe.
-	 * RVDS is defined as size in bytes.
+	 * OpRegion and VBT:
+	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
+	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
+	 * to hold the VBT data, the Extended VBT region is introduced since
+	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
+	 * introduced to define the extended VBT data location and size.
+	 * OpRegion 2.0: RVDA defines the absolute physical address of the
+	 *   extended VBT data, RVDS defines the VBT data size.
+	 * OpRegion 2.1 and above: RVDA defines the relative address of the
+	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
 	 *
-	 * opregion 2.0: rvda is the physical VBT address.
-	 * Since rvda is HPA it cannot be directly used in guest.
-	 * And it should not be practically available for end user,so it is not supported.
+	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
+	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
+	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
+	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
+	 * after OpRegion with version 2.1+ is exposed regardless the underneath
+	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
+	 * the actural offset into VBT so that data at correct position can be
+	 * returned to the requester.
 	 */
-	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
+	opregionvbt->version = *(__le16 *)(opregionvbt->opregion +
+					   OPREGION_VERSION);
+	version = le16_to_cpu(opregionvbt->version);
+
 	if (version >= 0x0200) {
-		u64 rvda;
-		u32 rvds;
+		u64 rvda = le64_to_cpu(*(__le64 *)(opregionvbt->opregion +
+						   OPREGION_RVDA));
+		u32 rvds = le32_to_cpu(*(__le32 *)(opregionvbt->opregion +
+						   OPREGION_RVDS));
 
-		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
-		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
+		/* The extended VBT is valid only when RVDA/RVDS are non-zero */
 		if (rvda && rvds) {
-			/* no support for opregion v2.0 with physical VBT address */
+			size += rvds;
+
 			if (version == 0x0200) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
-				return -EINVAL;
+				/* Patch to version 2.0 in read ops */
+				opregionvbt->version = cpu_to_le16(0x0201);
+				/* Absolute physical addr for 2.0 */
+				addr = rvda;
+			} else {
+				/* Relative addr to OpRegion header for 2.1+ */
+				addr += rvda;
 			}
 
-			if (rvda != size) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"Extended VBT does not follow opregion on version 0x%04x\n",
-					version);
-				return -EINVAL;
+			opregionvbt->vbt_ex = memremap(addr, rvds, MEMREMAP_WB);
+			if (!opregionvbt->vbt_ex) {
+				memunmap(opregionvbt->opregion);
+				kfree(opregionvbt);
+				return -ENOMEM;
 			}
 
-			/* region size for opregion v2.0+: opregion and VBT size. */
-			size += rvds;
+			/* Always set RVDA to make exVBT follows OpRegion */
+			opregionvbt->rvda = cpu_to_le64(OPREGION_SIZE);
 		}
 	}
 
-	if (size != OPREGION_SIZE) {
-		memunmap(base);
-		base = memremap(addr, size, MEMREMAP_WB);
-		if (!base)
-			return -ENOMEM;
-	}
-
 	ret = vfio_pci_register_dev_region(vdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
-		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
-		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
+		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &vfio_pci_igd_regops,
+		size, VFIO_REGION_INFO_FLAG_READ, opregionvbt);
 	if (ret) {
-		memunmap(base);
+		if (opregionvbt->vbt_ex)
+			memunmap(opregionvbt->vbt_ex);
+
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return ret;
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-13 12:41                               ` [PATCH v4] " Colin Xu
@ 2021-09-13 15:14                                 ` Alex Williamson
  2021-09-14  4:18                                   ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-09-13 15:14 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Mon, 13 Sep 2021 20:41:58 +0800
Colin Xu <colin.xu@intel.com> wrote:

> Due to historical reason, some legacy shipped system doesn't follow
> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> VBT is not contiguous after OpRegion in physical address, but any
> location pointed by RVDA via absolute address. Also although current
> OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
> RVDA is the relative address to OpRegion head, the extended VBT location
> may change to non-contiguous to OpRegion. In both cases, it's impossible
> to map a contiguous range to hold both OpRegion and the extended VBT and
> expose via one vfio region.
> 
> The only difference between OpRegion 2.0 and 2.1 is where extended
> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> baes, and there is no other difference between OpRegion 2.0 and 2.1.
> To support the non-contiguous region case as described, the updated read
> op will patch OpRegion version and RVDA on-the-fly accordingly. So that
> from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
> after OpRegion is exposed, regardless the underneath host OpRegion is
> 2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
> 2.0 extended VBT systems with on the market, and support OpRegion 2.1+
> where the extended VBT isn't contiguous after OpRegion.
> Also split the write op with read ops to leave flexibility for OpRegion
> write op support in future.
> 
> V2:
> Validate RVDA for 2.1+ before increasing total size. (Alex)
> 
> V3: (Alex)
> Split read and write ops.
> On-the-fly modify OpRegion version and RVDA.
> Fix sparse error on assign value to casted pointer.
> 
> V4: (Alex)
> No need support write op.
> Direct copy to user buffer with several shift instead of shadow.
> Copy helper to copy to user buffer and shift offset.
> 
> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> Cc: Hang Yuan <hang.yuan@linux.intel.com>
> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> Cc: Fred Gao <fred.gao@intel.com>
> Signed-off-by: Colin Xu <colin.xu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_igd.c | 229 ++++++++++++++++++++++++--------
>  1 file changed, 174 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
> index 228df565e9bc..14e958893be6 100644
> --- a/drivers/vfio/pci/vfio_pci_igd.c
> +++ b/drivers/vfio/pci/vfio_pci_igd.c
> @@ -25,20 +25,119 @@
>  #define OPREGION_RVDS		0x3c2
>  #define OPREGION_VERSION	0x16
>  
> +struct igd_opregion_vbt {
> +	void *opregion;
> +	void *vbt_ex;
> +	__le16 version;
> +	__le64 rvda;

I thought storing version here was questionable because we're really
only saving ourselves a read from the opregion, test against 0x0200,
and conversion to 0x0201.  Storing rvda here feel gratuitous since it
can be calculated from readily available data in the rw function.

> +};
> +
> +/**
> + * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
> + * @dst: User buffer ptr to copy to.
> + * @off: Offset to user buffer ptr. Increased by bytes_adv on return.
> + * @src: Source buffer to copy from.
> + * @pos: Increased by bytes_adv on return.
> + * @remaining: Decreased by bytes_adv on return.
> + * @bytes_cp: Bytes to copy.
> + * @bytes_adv: Bytes to adjust off, pos and remaining.
> + *
> + * Copy OpRegion to offset from specific source ptr and shift the offset.
> + *
> + * Return: 0 on success, -EFAULT otherwise.
> + *
> + */
> +static inline unsigned long igd_opregion_shift_copy(char __user *dst,
> +						    loff_t *off,
> +						    void *src,
> +						    loff_t *pos,
> +						    loff_t *remaining,
> +						    loff_t bytes_cp,
> +						    loff_t bytes_adv)
> +{
> +	if (copy_to_user(dst + (*off), src, bytes_cp))
> +		return -EFAULT;
> +
> +	*off += bytes_adv;
> +	*pos += bytes_adv;
> +	*remaining -= bytes_adv;

@bytes_cp always equals @bytes_adv except for the last case, it's not
worth the special handling imo.

> +
> +	return 0;
> +}
> +
>  static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>  			      size_t count, loff_t *ppos, bool iswrite)
>  {
>  	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> -	void *base = vdev->region[i].data;
> +	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
>  	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	loff_t remaining = count;
> +	loff_t off = 0;
>  
>  	if (pos >= vdev->region[i].size || iswrite)
>  		return -EINVAL;
>  
>  	count = min(count, (size_t)(vdev->region[i].size - pos));

We set @remaining before we bounds check @count here.  Thanks,

Alex

>  
> -	if (copy_to_user(buf, base + pos, count))
> -		return -EFAULT;
> +	/* Copy until OpRegion version */
> +	if (remaining && pos < OPREGION_VERSION) {
> +		loff_t bytes = min(remaining, OPREGION_VERSION - pos);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    opregionvbt->opregion + pos, &pos,
> +					    &remaining, bytes, bytes))
> +			return -EFAULT;
> +	}
> +
> +	/* Copy patched (if necessary) OpRegion version */
> +	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
> +		loff_t bytes = min(remaining,
> +				   OPREGION_VERSION + (loff_t)sizeof(__le16) - pos);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    &opregionvbt->version, &pos,
> +					    &remaining, bytes, bytes))
> +			return -EFAULT;
> +	}
> +
> +	/* Copy until RVDA */
> +	if (remaining && pos < OPREGION_RVDA) {
> +		loff_t bytes = min((loff_t)remaining, OPREGION_RVDA - pos);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    opregionvbt->opregion + pos, &pos,
> +					    &remaining, bytes, bytes))
> +			return -EFAULT;
> +	}
> +
> +	/* Copy modified (if necessary) RVDA */
> +	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
> +		loff_t bytes = min(remaining, OPREGION_RVDA +
> +					      (loff_t)sizeof(__le64) - pos);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    &opregionvbt->rvda, &pos,
> +					    &remaining, bytes, bytes))
> +			return -EFAULT;
> +	}
> +
> +	/* Copy the rest of OpRegion */
> +	if (remaining && pos < OPREGION_SIZE) {
> +		loff_t bytes = min(remaining, OPREGION_SIZE - pos);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    opregionvbt->opregion + pos, &pos,
> +					    &remaining, bytes, bytes))
> +			return -EFAULT;
> +	}
> +
> +	/* Copy extended VBT if exists */
> +	if (remaining) {
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    opregionvbt->vbt_ex, &pos,
> +					    &remaining, remaining, 0))
> +			return -EFAULT;
> +	}
>  
>  	*ppos += count;
>  
> @@ -48,7 +147,13 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>  				 struct vfio_pci_region *region)
>  {
> -	memunmap(region->data);
> +	struct igd_opregion_vbt *opregionvbt = region->data;
> +
> +	if (opregionvbt->vbt_ex)
> +		memunmap(opregionvbt->vbt_ex);
> +
> +	memunmap(opregionvbt->opregion);
> +	kfree(opregionvbt);
>  }
>  
>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
> @@ -60,7 +165,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>  {
>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>  	u32 addr, size;
> -	void *base;
> +	struct igd_opregion_vbt *opregionvbt;
>  	int ret;
>  	u16 version;
>  
> @@ -71,84 +176,98 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>  	if (!addr || !(~addr))
>  		return -ENODEV;
>  
> -	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
> -	if (!base)
> +	opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL);
> +	if (!opregionvbt)
> +		return -ENOMEM;
> +
> +	opregionvbt->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
> +	if (!opregionvbt->opregion) {
> +		kfree(opregionvbt);
>  		return -ENOMEM;
> +	}
>  
> -	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
> -		memunmap(base);
> +	if (memcmp(opregionvbt->opregion, OPREGION_SIGNATURE, 16)) {
> +		memunmap(opregionvbt->opregion);
> +		kfree(opregionvbt);
>  		return -EINVAL;
>  	}
>  
> -	size = le32_to_cpu(*(__le32 *)(base + 16));
> +	size = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + 16));
>  	if (!size) {
> -		memunmap(base);
> +		memunmap(opregionvbt->opregion);
> +		kfree(opregionvbt);
>  		return -EINVAL;
>  	}
>  
>  	size *= 1024; /* In KB */
>  
>  	/*
> -	 * Support opregion v2.1+
> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
> -	 * address from region base and size of VBT data. RVDA/RVDS are not
> -	 * defined before opregion 2.0.
> -	 *
> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
> -	 * opregion base, and should point to the end of opregion.
> -	 * otherwise, exposing to userspace to allow read access to everything between
> -	 * the OpRegion and VBT is not safe.
> -	 * RVDS is defined as size in bytes.
> +	 * OpRegion and VBT:
> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
> +	 * to hold the VBT data, the Extended VBT region is introduced since
> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
> +	 * introduced to define the extended VBT data location and size.
> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
> +	 *   extended VBT data, RVDS defines the VBT data size.
> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>  	 *
> -	 * opregion 2.0: rvda is the physical VBT address.
> -	 * Since rvda is HPA it cannot be directly used in guest.
> -	 * And it should not be practically available for end user,so it is not supported.
> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
> +	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
> +	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
> +	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
> +	 * after OpRegion with version 2.1+ is exposed regardless the underneath
> +	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
> +	 * the actural offset into VBT so that data at correct position can be
> +	 * returned to the requester.
>  	 */
> -	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
> +	opregionvbt->version = *(__le16 *)(opregionvbt->opregion +
> +					   OPREGION_VERSION);
> +	version = le16_to_cpu(opregionvbt->version);
> +
>  	if (version >= 0x0200) {
> -		u64 rvda;
> -		u32 rvds;
> +		u64 rvda = le64_to_cpu(*(__le64 *)(opregionvbt->opregion +
> +						   OPREGION_RVDA));
> +		u32 rvds = le32_to_cpu(*(__le32 *)(opregionvbt->opregion +
> +						   OPREGION_RVDS));
>  
> -		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
> -		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero */
>  		if (rvda && rvds) {
> -			/* no support for opregion v2.0 with physical VBT address */
> +			size += rvds;
> +
>  			if (version == 0x0200) {
> -				memunmap(base);
> -				pci_err(vdev->pdev,
> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
> -				return -EINVAL;
> +				/* Patch to version 2.0 in read ops */
> +				opregionvbt->version = cpu_to_le16(0x0201);
> +				/* Absolute physical addr for 2.0 */
> +				addr = rvda;
> +			} else {
> +				/* Relative addr to OpRegion header for 2.1+ */
> +				addr += rvda;
>  			}
>  
> -			if (rvda != size) {
> -				memunmap(base);
> -				pci_err(vdev->pdev,
> -					"Extended VBT does not follow opregion on version 0x%04x\n",
> -					version);
> -				return -EINVAL;
> +			opregionvbt->vbt_ex = memremap(addr, rvds, MEMREMAP_WB);
> +			if (!opregionvbt->vbt_ex) {
> +				memunmap(opregionvbt->opregion);
> +				kfree(opregionvbt);
> +				return -ENOMEM;
>  			}
>  
> -			/* region size for opregion v2.0+: opregion and VBT size. */
> -			size += rvds;
> +			/* Always set RVDA to make exVBT follows OpRegion */
> +			opregionvbt->rvda = cpu_to_le64(OPREGION_SIZE);
>  		}
>  	}
>  
> -	if (size != OPREGION_SIZE) {
> -		memunmap(base);
> -		base = memremap(addr, size, MEMREMAP_WB);
> -		if (!base)
> -			return -ENOMEM;
> -	}
> -
>  	ret = vfio_pci_register_dev_region(vdev,
>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> -		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
> -		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
> +		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &vfio_pci_igd_regops,
> +		size, VFIO_REGION_INFO_FLAG_READ, opregionvbt);
>  	if (ret) {
> -		memunmap(base);
> +		if (opregionvbt->vbt_ex)
> +			memunmap(opregionvbt->vbt_ex);
> +
> +		memunmap(opregionvbt->opregion);
> +		kfree(opregionvbt);
>  		return ret;
>  	}
>  


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-13 15:14                                 ` Alex Williamson
@ 2021-09-14  4:18                                   ` Colin Xu
  2021-09-14  4:29                                     ` [PATCH v5] " Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-14  4:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Colin Xu, kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Mon, 13 Sep 2021, Alex Williamson wrote:

> On Mon, 13 Sep 2021 20:41:58 +0800
> Colin Xu <colin.xu@intel.com> wrote:
>
>> Due to historical reason, some legacy shipped system doesn't follow
>> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>> VBT is not contiguous after OpRegion in physical address, but any
>> location pointed by RVDA via absolute address. Also although current
>> OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
>> RVDA is the relative address to OpRegion head, the extended VBT location
>> may change to non-contiguous to OpRegion. In both cases, it's impossible
>> to map a contiguous range to hold both OpRegion and the extended VBT and
>> expose via one vfio region.
>>
>> The only difference between OpRegion 2.0 and 2.1 is where extended
>> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>> baes, and there is no other difference between OpRegion 2.0 and 2.1.
>> To support the non-contiguous region case as described, the updated read
>> op will patch OpRegion version and RVDA on-the-fly accordingly. So that
>> from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
>> after OpRegion is exposed, regardless the underneath host OpRegion is
>> 2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
>> 2.0 extended VBT systems with on the market, and support OpRegion 2.1+
>> where the extended VBT isn't contiguous after OpRegion.
>> Also split the write op with read ops to leave flexibility for OpRegion
>> write op support in future.
>>
>> V2:
>> Validate RVDA for 2.1+ before increasing total size. (Alex)
>>
>> V3: (Alex)
>> Split read and write ops.
>> On-the-fly modify OpRegion version and RVDA.
>> Fix sparse error on assign value to casted pointer.
>>
>> V4: (Alex)
>> No need support write op.
>> Direct copy to user buffer with several shift instead of shadow.
>> Copy helper to copy to user buffer and shift offset.
>>
>> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
>> Cc: Hang Yuan <hang.yuan@linux.intel.com>
>> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
>> Cc: Fred Gao <fred.gao@intel.com>
>> Signed-off-by: Colin Xu <colin.xu@intel.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_igd.c | 229 ++++++++++++++++++++++++--------
>>  1 file changed, 174 insertions(+), 55 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
>> index 228df565e9bc..14e958893be6 100644
>> --- a/drivers/vfio/pci/vfio_pci_igd.c
>> +++ b/drivers/vfio/pci/vfio_pci_igd.c
>> @@ -25,20 +25,119 @@
>>  #define OPREGION_RVDS		0x3c2
>>  #define OPREGION_VERSION	0x16
>>
>> +struct igd_opregion_vbt {
>> +	void *opregion;
>> +	void *vbt_ex;
>> +	__le16 version;
>> +	__le64 rvda;
>
> I thought storing version here was questionable because we're really
> only saving ourselves a read from the opregion, test against 0x0200,
> and conversion to 0x0201.  Storing rvda here feel gratuitous since it
> can be calculated from readily available data in the rw function.
>
Let me move both to patch on copy.
>> +};
>> +
>> +/**
>> + * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
>> + * @dst: User buffer ptr to copy to.
>> + * @off: Offset to user buffer ptr. Increased by bytes_adv on return.
>> + * @src: Source buffer to copy from.
>> + * @pos: Increased by bytes_adv on return.
>> + * @remaining: Decreased by bytes_adv on return.
>> + * @bytes_cp: Bytes to copy.
>> + * @bytes_adv: Bytes to adjust off, pos and remaining.
>> + *
>> + * Copy OpRegion to offset from specific source ptr and shift the offset.
>> + *
>> + * Return: 0 on success, -EFAULT otherwise.
>> + *
>> + */
>> +static inline unsigned long igd_opregion_shift_copy(char __user *dst,
>> +						    loff_t *off,
>> +						    void *src,
>> +						    loff_t *pos,
>> +						    loff_t *remaining,
>> +						    loff_t bytes_cp,
>> +						    loff_t bytes_adv)
>> +{
>> +	if (copy_to_user(dst + (*off), src, bytes_cp))
>> +		return -EFAULT;
>> +
>> +	*off += bytes_adv;
>> +	*pos += bytes_adv;
>> +	*remaining -= bytes_adv;
>
> @bytes_cp always equals @bytes_adv except for the last case, it's not
> worth the special handling imo.
>
Understood. I'll remove the adv and directly use copy_to_user() for that.
>> +
>> +	return 0;
>> +}
>> +
>>  static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>>  			      size_t count, loff_t *ppos, bool iswrite)
>>  {
>>  	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>> -	void *base = vdev->region[i].data;
>> +	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
>>  	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +	loff_t remaining = count;
>> +	loff_t off = 0;
>>
>>  	if (pos >= vdev->region[i].size || iswrite)
>>  		return -EINVAL;
>>
>>  	count = min(count, (size_t)(vdev->region[i].size - pos));
>
> We set @remaining before we bounds check @count here.  Thanks,
>
Oops. How careless I am. Fixed.
> Alex
>
>>
>> -	if (copy_to_user(buf, base + pos, count))
>> -		return -EFAULT;
>> +	/* Copy until OpRegion version */
>> +	if (remaining && pos < OPREGION_VERSION) {
>> +		loff_t bytes = min(remaining, OPREGION_VERSION - pos);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    opregionvbt->opregion + pos, &pos,
>> +					    &remaining, bytes, bytes))
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy patched (if necessary) OpRegion version */
>> +	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
>> +		loff_t bytes = min(remaining,
>> +				   OPREGION_VERSION + (loff_t)sizeof(__le16) - pos);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    &opregionvbt->version, &pos,
>> +					    &remaining, bytes, bytes))
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy until RVDA */
>> +	if (remaining && pos < OPREGION_RVDA) {
>> +		loff_t bytes = min((loff_t)remaining, OPREGION_RVDA - pos);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    opregionvbt->opregion + pos, &pos,
>> +					    &remaining, bytes, bytes))
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy modified (if necessary) RVDA */
>> +	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
>> +		loff_t bytes = min(remaining, OPREGION_RVDA +
>> +					      (loff_t)sizeof(__le64) - pos);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    &opregionvbt->rvda, &pos,
>> +					    &remaining, bytes, bytes))
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy the rest of OpRegion */
>> +	if (remaining && pos < OPREGION_SIZE) {
>> +		loff_t bytes = min(remaining, OPREGION_SIZE - pos);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    opregionvbt->opregion + pos, &pos,
>> +					    &remaining, bytes, bytes))
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy extended VBT if exists */
>> +	if (remaining) {
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    opregionvbt->vbt_ex, &pos,
>> +					    &remaining, remaining, 0))
>> +			return -EFAULT;
>> +	}
>>
>>  	*ppos += count;
>>
>> @@ -48,7 +147,13 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>>  static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
>>  				 struct vfio_pci_region *region)
>>  {
>> -	memunmap(region->data);
>> +	struct igd_opregion_vbt *opregionvbt = region->data;
>> +
>> +	if (opregionvbt->vbt_ex)
>> +		memunmap(opregionvbt->vbt_ex);
>> +
>> +	memunmap(opregionvbt->opregion);
>> +	kfree(opregionvbt);
>>  }
>>
>>  static const struct vfio_pci_regops vfio_pci_igd_regops = {
>> @@ -60,7 +165,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>  {
>>  	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
>>  	u32 addr, size;
>> -	void *base;
>> +	struct igd_opregion_vbt *opregionvbt;
>>  	int ret;
>>  	u16 version;
>>
>> @@ -71,84 +176,98 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
>>  	if (!addr || !(~addr))
>>  		return -ENODEV;
>>
>> -	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
>> -	if (!base)
>> +	opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL);
>> +	if (!opregionvbt)
>> +		return -ENOMEM;
>> +
>> +	opregionvbt->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
>> +	if (!opregionvbt->opregion) {
>> +		kfree(opregionvbt);
>>  		return -ENOMEM;
>> +	}
>>
>> -	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
>> -		memunmap(base);
>> +	if (memcmp(opregionvbt->opregion, OPREGION_SIGNATURE, 16)) {
>> +		memunmap(opregionvbt->opregion);
>> +		kfree(opregionvbt);
>>  		return -EINVAL;
>>  	}
>>
>> -	size = le32_to_cpu(*(__le32 *)(base + 16));
>> +	size = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + 16));
>>  	if (!size) {
>> -		memunmap(base);
>> +		memunmap(opregionvbt->opregion);
>> +		kfree(opregionvbt);
>>  		return -EINVAL;
>>  	}
>>
>>  	size *= 1024; /* In KB */
>>
>>  	/*
>> -	 * Support opregion v2.1+
>> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
>> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
>> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
>> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
>> -	 * address from region base and size of VBT data. RVDA/RVDS are not
>> -	 * defined before opregion 2.0.
>> -	 *
>> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
>> -	 * opregion base, and should point to the end of opregion.
>> -	 * otherwise, exposing to userspace to allow read access to everything between
>> -	 * the OpRegion and VBT is not safe.
>> -	 * RVDS is defined as size in bytes.
>> +	 * OpRegion and VBT:
>> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
>> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
>> +	 * to hold the VBT data, the Extended VBT region is introduced since
>> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
>> +	 * introduced to define the extended VBT data location and size.
>> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
>> +	 *   extended VBT data, RVDS defines the VBT data size.
>> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
>> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>>  	 *
>> -	 * opregion 2.0: rvda is the physical VBT address.
>> -	 * Since rvda is HPA it cannot be directly used in guest.
>> -	 * And it should not be practically available for end user,so it is not supported.
>> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
>> +	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
>> +	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
>> +	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
>> +	 * after OpRegion with version 2.1+ is exposed regardless the underneath
>> +	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
>> +	 * the actural offset into VBT so that data at correct position can be
>> +	 * returned to the requester.
>>  	 */
>> -	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
>> +	opregionvbt->version = *(__le16 *)(opregionvbt->opregion +
>> +					   OPREGION_VERSION);
>> +	version = le16_to_cpu(opregionvbt->version);
>> +
>>  	if (version >= 0x0200) {
>> -		u64 rvda;
>> -		u32 rvds;
>> +		u64 rvda = le64_to_cpu(*(__le64 *)(opregionvbt->opregion +
>> +						   OPREGION_RVDA));
>> +		u32 rvds = le32_to_cpu(*(__le32 *)(opregionvbt->opregion +
>> +						   OPREGION_RVDS));
>>
>> -		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
>> -		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
>> +		/* The extended VBT is valid only when RVDA/RVDS are non-zero */
>>  		if (rvda && rvds) {
>> -			/* no support for opregion v2.0 with physical VBT address */
>> +			size += rvds;
>> +
>>  			if (version == 0x0200) {
>> -				memunmap(base);
>> -				pci_err(vdev->pdev,
>> -					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
>> -				return -EINVAL;
>> +				/* Patch to version 2.0 in read ops */
>> +				opregionvbt->version = cpu_to_le16(0x0201);
>> +				/* Absolute physical addr for 2.0 */
>> +				addr = rvda;
>> +			} else {
>> +				/* Relative addr to OpRegion header for 2.1+ */
>> +				addr += rvda;
>>  			}
>>
>> -			if (rvda != size) {
>> -				memunmap(base);
>> -				pci_err(vdev->pdev,
>> -					"Extended VBT does not follow opregion on version 0x%04x\n",
>> -					version);
>> -				return -EINVAL;
>> +			opregionvbt->vbt_ex = memremap(addr, rvds, MEMREMAP_WB);
>> +			if (!opregionvbt->vbt_ex) {
>> +				memunmap(opregionvbt->opregion);
>> +				kfree(opregionvbt);
>> +				return -ENOMEM;
>>  			}
>>
>> -			/* region size for opregion v2.0+: opregion and VBT size. */
>> -			size += rvds;
>> +			/* Always set RVDA to make exVBT follows OpRegion */
>> +			opregionvbt->rvda = cpu_to_le64(OPREGION_SIZE);
>>  		}
>>  	}
>>
>> -	if (size != OPREGION_SIZE) {
>> -		memunmap(base);
>> -		base = memremap(addr, size, MEMREMAP_WB);
>> -		if (!base)
>> -			return -ENOMEM;
>> -	}
>> -
>>  	ret = vfio_pci_register_dev_region(vdev,
>>  		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
>> -		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
>> -		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
>> +		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &vfio_pci_igd_regops,
>> +		size, VFIO_REGION_INFO_FLAG_READ, opregionvbt);
>>  	if (ret) {
>> -		memunmap(base);
>> +		if (opregionvbt->vbt_ex)
>> +			memunmap(opregionvbt->vbt_ex);
>> +
>> +		memunmap(opregionvbt->opregion);
>> +		kfree(opregionvbt);
>>  		return ret;
>>  	}
>>
>
>

--
Best Regards,
Colin Xu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-14  4:18                                   ` Colin Xu
@ 2021-09-14  4:29                                     ` Colin Xu
  2021-09-14  9:11                                       ` [PATCH v6] " Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-14  4:29 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contiguous after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Also although current
OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
RVDA is the relative address to OpRegion head, the extended VBT location
may change to non-contiguous to OpRegion. In both cases, it's impossible
to map a contiguous range to hold both OpRegion and the extended VBT and
expose via one vfio region.

The only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1.
To support the non-contiguous region case as described, the updated read
op will patch OpRegion version and RVDA on-the-fly accordingly. So that
from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
after OpRegion is exposed, regardless the underneath host OpRegion is
2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
2.0 extended VBT systems with on the market, and support OpRegion 2.1+
where the extended VBT isn't contiguous after OpRegion.

V2:
Validate RVDA for 2.1+ before increasing total size. (Alex)

V3: (Alex)
Split read and write ops.
On-the-fly modify OpRegion version and RVDA.
Fix sparse error on assign value to casted pointer.

V4: (Alex)
No need support write op.
Direct copy to user buffer with several shift instead of shadow.
Copy helper to copy to user buffer and shift offset.

V5: (Alex)
Simplify copy help to only cover common shift case.
Don't cache patched version and rvda. Patch on copy if necessary.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
---
 drivers/vfio/pci/vfio_pci_igd.c | 230 ++++++++++++++++++++++++--------
 1 file changed, 172 insertions(+), 58 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 228df565e9bc..3b74d04c706e 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -25,19 +25,118 @@
 #define OPREGION_RVDS		0x3c2
 #define OPREGION_VERSION	0x16
 
+struct igd_opregion_vbt {
+	void *opregion;
+	void *vbt_ex;
+};
+
+/**
+ * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
+ * @dst: User buffer ptr to copy to.
+ * @off: Offset to user buffer ptr. Increased by bytes_adv on return.
+ * @src: Source buffer to copy from.
+ * @pos: Increased by bytes_adv on return.
+ * @remaining: Decreased by bytes_adv on return.
+ * @bytes: Bytes to copy and adjust off, pos and remaining.
+ *
+ * Copy OpRegion to offset from specific source ptr and shift the offset.
+ *
+ * Return: 0 on success, -EFAULT otherwise.
+ *
+ */
+static inline unsigned long igd_opregion_shift_copy(char __user *dst,
+						    loff_t *off,
+						    void *src,
+						    loff_t *pos,
+						    loff_t *remaining,
+						    loff_t bytes)
+{
+	if (copy_to_user(dst + (*off), src, bytes))
+		return -EFAULT;
+
+	*off += bytes;
+	*pos += bytes;
+	*remaining -= bytes;
+
+	return 0;
+}
+
 static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 			      size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
-	void *base = vdev->region[i].data;
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK, off = 0, remaining;
 
 	if (pos >= vdev->region[i].size || iswrite)
 		return -EINVAL;
 
 	count = min(count, (size_t)(vdev->region[i].size - pos));
+	remaining = count;
+
+	/* Copy until OpRegion version */
+	if (remaining && pos < OPREGION_VERSION) {
+		loff_t bytes = min(remaining, OPREGION_VERSION - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy patched (if necessary) OpRegion version */
+	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
+		loff_t bytes = min(remaining,
+				   OPREGION_VERSION + (loff_t)sizeof(__le16) - pos);
+		__le16 version = *(__le16 *)(opregionvbt->opregion +
+					     OPREGION_VERSION);
+
+		/* Patch to 2.1 if OpRegion 2.0 has extended VBT */
+		if (le16_to_cpu(version) == 0x0200 && opregionvbt->vbt_ex)
+			version = cpu_to_le16(0x0201);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &version, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy until RVDA */
+	if (remaining && pos < OPREGION_RVDA) {
+		loff_t bytes = min((loff_t)remaining, OPREGION_RVDA - pos);
 
-	if (copy_to_user(buf, base + pos, count))
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy modified (if necessary) RVDA */
+	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
+		loff_t bytes = min(remaining, OPREGION_RVDA +
+					      (loff_t)sizeof(__le64) - pos);
+		__le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ?
+					  OPREGION_SIZE : 0);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &rvda, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy the rest of OpRegion */
+	if (remaining && pos < OPREGION_SIZE) {
+		loff_t bytes = min(remaining, OPREGION_SIZE - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy extended VBT if exists */
+	if (remaining &&
+	    copy_to_user(buf + off, opregionvbt->vbt_ex, remaining))
 		return -EFAULT;
 
 	*ppos += count;
@@ -48,7 +147,13 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
 				 struct vfio_pci_region *region)
 {
-	memunmap(region->data);
+	struct igd_opregion_vbt *opregionvbt = region->data;
+
+	if (opregionvbt->vbt_ex)
+		memunmap(opregionvbt->vbt_ex);
+
+	memunmap(opregionvbt->opregion);
+	kfree(opregionvbt);
 }
 
 static const struct vfio_pci_regops vfio_pci_igd_regops = {
@@ -60,7 +165,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
 	u32 addr, size;
-	void *base;
+	struct igd_opregion_vbt *opregionvbt;
 	int ret;
 	u16 version;
 
@@ -71,84 +176,93 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	if (!addr || !(~addr))
 		return -ENODEV;
 
-	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
-	if (!base)
+	opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL);
+	if (!opregionvbt)
+		return -ENOMEM;
+
+	opregionvbt->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
+	if (!opregionvbt->opregion) {
+		kfree(opregionvbt);
 		return -ENOMEM;
+	}
 
-	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
-		memunmap(base);
+	if (memcmp(opregionvbt->opregion, OPREGION_SIGNATURE, 16)) {
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
-	size = le32_to_cpu(*(__le32 *)(base + 16));
+	size = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + 16));
 	if (!size) {
-		memunmap(base);
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
 	size *= 1024; /* In KB */
 
 	/*
-	 * Support opregion v2.1+
-	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
-	 * the Extended VBT region next to opregion is used to hold the VBT data.
-	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
-	 * (Raw VBT Data Size) from opregion structure member are used to hold the
-	 * address from region base and size of VBT data. RVDA/RVDS are not
-	 * defined before opregion 2.0.
-	 *
-	 * opregion 2.1+: RVDA is unsigned, relative offset from
-	 * opregion base, and should point to the end of opregion.
-	 * otherwise, exposing to userspace to allow read access to everything between
-	 * the OpRegion and VBT is not safe.
-	 * RVDS is defined as size in bytes.
+	 * OpRegion and VBT:
+	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
+	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
+	 * to hold the VBT data, the Extended VBT region is introduced since
+	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
+	 * introduced to define the extended VBT data location and size.
+	 * OpRegion 2.0: RVDA defines the absolute physical address of the
+	 *   extended VBT data, RVDS defines the VBT data size.
+	 * OpRegion 2.1 and above: RVDA defines the relative address of the
+	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
 	 *
-	 * opregion 2.0: rvda is the physical VBT address.
-	 * Since rvda is HPA it cannot be directly used in guest.
-	 * And it should not be practically available for end user,so it is not supported.
+	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
+	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
+	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
+	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
+	 * after OpRegion with version 2.1+ is exposed regardless the underneath
+	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
+	 * the actural offset into VBT so that data at correct position can be
+	 * returned to the requester.
 	 */
-	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
+	version = le16_to_cpu(*(__le16 *)(opregionvbt->opregion +
+					  OPREGION_VERSION));
 	if (version >= 0x0200) {
-		u64 rvda;
-		u32 rvds;
+		u64 rvda = le64_to_cpu(*(__le64 *)(opregionvbt->opregion +
+						   OPREGION_RVDA));
+		u32 rvds = le32_to_cpu(*(__le32 *)(opregionvbt->opregion +
+						   OPREGION_RVDS));
 
-		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
-		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
+		/* The extended VBT is valid only when RVDA/RVDS are non-zero */
 		if (rvda && rvds) {
-			/* no support for opregion v2.0 with physical VBT address */
-			if (version == 0x0200) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
-				return -EINVAL;
-			}
+			size += rvds;
 
-			if (rvda != size) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"Extended VBT does not follow opregion on version 0x%04x\n",
-					version);
-				return -EINVAL;
+			/*
+			 * Extended VBT location by RVDA:
+			 * Absolute physical addr for 2.0.
+			 * Relative addr to OpRegion header for 2.1+.
+			 */
+			if (version == 0x0200)
+				addr = rvda;
+			else
+				addr += rvda;
+
+			opregionvbt->vbt_ex = memremap(addr, rvds, MEMREMAP_WB);
+			if (!opregionvbt->vbt_ex) {
+				memunmap(opregionvbt->opregion);
+				kfree(opregionvbt);
+				return -ENOMEM;
 			}
-
-			/* region size for opregion v2.0+: opregion and VBT size. */
-			size += rvds;
 		}
 	}
 
-	if (size != OPREGION_SIZE) {
-		memunmap(base);
-		base = memremap(addr, size, MEMREMAP_WB);
-		if (!base)
-			return -ENOMEM;
-	}
-
 	ret = vfio_pci_register_dev_region(vdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
-		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
-		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
+		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &vfio_pci_igd_regops,
+		size, VFIO_REGION_INFO_FLAG_READ, opregionvbt);
 	if (ret) {
-		memunmap(base);
+		if (opregionvbt->vbt_ex)
+			memunmap(opregionvbt->vbt_ex);
+
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return ret;
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-14  4:29                                     ` [PATCH v5] " Colin Xu
@ 2021-09-14  9:11                                       ` Colin Xu
  2021-09-24 21:24                                         ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-09-14  9:11 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contiguous after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Also although current
OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
RVDA is the relative address to OpRegion head, the extended VBT location
may change to non-contiguous to OpRegion. In both cases, it's impossible
to map a contiguous range to hold both OpRegion and the extended VBT and
expose via one vfio region.

The only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1.
To support the non-contiguous region case as described, the updated read
op will patch OpRegion version and RVDA on-the-fly accordingly. So that
from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
after OpRegion is exposed, regardless the underneath host OpRegion is
2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
2.0 extended VBT systems with on the market, and support OpRegion 2.1+
where the extended VBT isn't contiguous after OpRegion.

V2:
Validate RVDA for 2.1+ before increasing total size. (Alex)

V3: (Alex)
Split read and write ops.
On-the-fly modify OpRegion version and RVDA.
Fix sparse error on assign value to casted pointer.

V4: (Alex)
No need support write op.
Direct copy to user buffer with several shift instead of shadow.
Copy helper to copy to user buffer and shift offset.

V5: (Alex)
Simplify copy help to only cover common shift case.
Don't cache patched version and rvda. Patch on copy if necessary.

V6:
Fix comment typo and max line width.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
---
 drivers/vfio/pci/vfio_pci_igd.c | 231 ++++++++++++++++++++++++--------
 1 file changed, 173 insertions(+), 58 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 228df565e9bc..081be59c7948 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -25,19 +25,119 @@
 #define OPREGION_RVDS		0x3c2
 #define OPREGION_VERSION	0x16
 
+struct igd_opregion_vbt {
+	void *opregion;
+	void *vbt_ex;
+};
+
+/**
+ * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
+ * @dst: User buffer ptr to copy to.
+ * @off: Offset to user buffer ptr. Increased by bytes on return.
+ * @src: Source buffer to copy from.
+ * @pos: Increased by bytes on return.
+ * @remaining: Decreased by bytes on return.
+ * @bytes: Bytes to copy and adjust off, pos and remaining.
+ *
+ * Copy OpRegion to offset from specific source ptr and shift the offset.
+ *
+ * Return: 0 on success, -EFAULT otherwise.
+ *
+ */
+static inline unsigned long igd_opregion_shift_copy(char __user *dst,
+						    loff_t *off,
+						    void *src,
+						    loff_t *pos,
+						    loff_t *remaining,
+						    loff_t bytes)
+{
+	if (copy_to_user(dst + (*off), src, bytes))
+		return -EFAULT;
+
+	*off += bytes;
+	*pos += bytes;
+	*remaining -= bytes;
+
+	return 0;
+}
+
 static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 			      size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
-	void *base = vdev->region[i].data;
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK, off = 0, remaining;
 
 	if (pos >= vdev->region[i].size || iswrite)
 		return -EINVAL;
 
 	count = min(count, (size_t)(vdev->region[i].size - pos));
+	remaining = count;
+
+	/* Copy until OpRegion version */
+	if (remaining && pos < OPREGION_VERSION) {
+		loff_t bytes = min(remaining, OPREGION_VERSION - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy patched (if necessary) OpRegion version */
+	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
+		loff_t bytes = min(remaining,
+				   OPREGION_VERSION + (loff_t)sizeof(__le16) -
+				   pos);
+		__le16 version = *(__le16 *)(opregionvbt->opregion +
+					     OPREGION_VERSION);
+
+		/* Patch to 2.1 if OpRegion 2.0 has extended VBT */
+		if (le16_to_cpu(version) == 0x0200 && opregionvbt->vbt_ex)
+			version = cpu_to_le16(0x0201);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &version, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy until RVDA */
+	if (remaining && pos < OPREGION_RVDA) {
+		loff_t bytes = min((loff_t)remaining, OPREGION_RVDA - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy modified (if necessary) RVDA */
+	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
+		loff_t bytes = min(remaining, OPREGION_RVDA +
+					      (loff_t)sizeof(__le64) - pos);
+		__le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ?
+					  OPREGION_SIZE : 0);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &rvda, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
 
-	if (copy_to_user(buf, base + pos, count))
+	/* Copy the rest of OpRegion */
+	if (remaining && pos < OPREGION_SIZE) {
+		loff_t bytes = min(remaining, OPREGION_SIZE - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy extended VBT if exists */
+	if (remaining &&
+	    copy_to_user(buf + off, opregionvbt->vbt_ex, remaining))
 		return -EFAULT;
 
 	*ppos += count;
@@ -48,7 +148,13 @@ static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
 static void vfio_pci_igd_release(struct vfio_pci_device *vdev,
 				 struct vfio_pci_region *region)
 {
-	memunmap(region->data);
+	struct igd_opregion_vbt *opregionvbt = region->data;
+
+	if (opregionvbt->vbt_ex)
+		memunmap(opregionvbt->vbt_ex);
+
+	memunmap(opregionvbt->opregion);
+	kfree(opregionvbt);
 }

 static const struct vfio_pci_regops vfio_pci_igd_regops = {
@@ -60,7 +166,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
 	u32 addr, size;
-	void *base;
+	struct igd_opregion_vbt *opregionvbt;
 	int ret;
 	u16 version;
 
@@ -71,84 +177,93 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_device *vdev)
 	if (!addr || !(~addr))
 		return -ENODEV;
 
-	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
-	if (!base)
+	opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL);
+	if (!opregionvbt)
 		return -ENOMEM;
 
-	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
-		memunmap(base);
+	opregionvbt->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
+	if (!opregionvbt->opregion) {
+		kfree(opregionvbt);
+		return -ENOMEM;
+	}
+
+	if (memcmp(opregionvbt->opregion, OPREGION_SIGNATURE, 16)) {
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
-	size = le32_to_cpu(*(__le32 *)(base + 16));
+	size = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + 16));
 	if (!size) {
-		memunmap(base);
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
 	size *= 1024; /* In KB */
 
 	/*
-	 * Support opregion v2.1+
-	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
-	 * the Extended VBT region next to opregion is used to hold the VBT data.
-	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
-	 * (Raw VBT Data Size) from opregion structure member are used to hold the
-	 * address from region base and size of VBT data. RVDA/RVDS are not
-	 * defined before opregion 2.0.
+	 * OpRegion and VBT:
+	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
+	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
+	 * to hold the VBT data, the Extended VBT region is introduced since
+	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
+	 * introduced to define the extended VBT data location and size.
+	 * OpRegion 2.0: RVDA defines the absolute physical address of the
+	 *   extended VBT data, RVDS defines the VBT data size.
+	 * OpRegion 2.1 and above: RVDA defines the relative address of the
+	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
 	 *
-	 * opregion 2.1+: RVDA is unsigned, relative offset from
-	 * opregion base, and should point to the end of opregion.
-	 * otherwise, exposing to userspace to allow read access to everything between
-	 * the OpRegion and VBT is not safe.
-	 * RVDS is defined as size in bytes.
-	 *
-	 * opregion 2.0: rvda is the physical VBT address.
-	 * Since rvda is HPA it cannot be directly used in guest.
-	 * And it should not be practically available for end user,so it is not supported.
+	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
+	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
+	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
+	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
+	 * after OpRegion with version 2.1+ is exposed regardless the underneath
+	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
+	 * the actural offset into VBT so that data at correct position can be
+	 * returned to the requester.
 	 */
-	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
+	version = le16_to_cpu(*(__le16 *)(opregionvbt->opregion +
+					  OPREGION_VERSION));
 	if (version >= 0x0200) {
-		u64 rvda;
-		u32 rvds;
+		u64 rvda = le64_to_cpu(*(__le64 *)(opregionvbt->opregion +
+						   OPREGION_RVDA));
+		u32 rvds = le32_to_cpu(*(__le32 *)(opregionvbt->opregion +
+						   OPREGION_RVDS));
 
-		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
-		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
+		/* The extended VBT is valid only when RVDA/RVDS are non-zero */
 		if (rvda && rvds) {
-			/* no support for opregion v2.0 with physical VBT address */
-			if (version == 0x0200) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
-				return -EINVAL;
-			}
+			size += rvds;
 
-			if (rvda != size) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"Extended VBT does not follow opregion on version 0x%04x\n",
-					version);
-				return -EINVAL;
+			/*
+			 * Extended VBT location by RVDA:
+			 * Absolute physical addr for 2.0.
+			 * Relative addr to OpRegion header for 2.1+.
+			 */
+			if (version == 0x0200)
+				addr = rvda;
+			else
+				addr += rvda;
+
+			opregionvbt->vbt_ex = memremap(addr, rvds, MEMREMAP_WB);
+			if (!opregionvbt->vbt_ex) {
+				memunmap(opregionvbt->opregion);
+				kfree(opregionvbt);
+				return -ENOMEM;
 			}
-
-			/* region size for opregion v2.0+: opregion and VBT size. */
-			size += rvds;
 		}
 	}
 
-	if (size != OPREGION_SIZE) {
-		memunmap(base);
-		base = memremap(addr, size, MEMREMAP_WB);
-		if (!base)
-			return -ENOMEM;
-	}
-
 	ret = vfio_pci_register_dev_region(vdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
-		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
-		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
+		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &vfio_pci_igd_regops,
+		size, VFIO_REGION_INFO_FLAG_READ, opregionvbt);
 	if (ret) {
-		memunmap(base);
+		if (opregionvbt->vbt_ex)
+			memunmap(opregionvbt->vbt_ex);
+
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return ret;
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v6] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-14  9:11                                       ` [PATCH v6] " Colin Xu
@ 2021-09-24 21:24                                         ` Alex Williamson
  2021-10-03 15:46                                           ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-09-24 21:24 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Tue, 14 Sep 2021 17:11:55 +0800
Colin Xu <colin.xu@intel.com> wrote:

> Due to historical reason, some legacy shipped system doesn't follow
> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> VBT is not contiguous after OpRegion in physical address, but any
> location pointed by RVDA via absolute address. Also although current
> OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
> RVDA is the relative address to OpRegion head, the extended VBT location
> may change to non-contiguous to OpRegion. In both cases, it's impossible
> to map a contiguous range to hold both OpRegion and the extended VBT and
> expose via one vfio region.
> 
> The only difference between OpRegion 2.0 and 2.1 is where extended
> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> baes, and there is no other difference between OpRegion 2.0 and 2.1.
> To support the non-contiguous region case as described, the updated read
> op will patch OpRegion version and RVDA on-the-fly accordingly. So that
> from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
> after OpRegion is exposed, regardless the underneath host OpRegion is
> 2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
> 2.0 extended VBT systems with on the market, and support OpRegion 2.1+
> where the extended VBT isn't contiguous after OpRegion.
> 
> V2:
> Validate RVDA for 2.1+ before increasing total size. (Alex)
> 
> V3: (Alex)
> Split read and write ops.
> On-the-fly modify OpRegion version and RVDA.
> Fix sparse error on assign value to casted pointer.
> 
> V4: (Alex)
> No need support write op.
> Direct copy to user buffer with several shift instead of shadow.
> Copy helper to copy to user buffer and shift offset.
> 
> V5: (Alex)
> Simplify copy help to only cover common shift case.
> Don't cache patched version and rvda. Patch on copy if necessary.
> 
> V6:
> Fix comment typo and max line width.
> 
> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> Cc: Hang Yuan <hang.yuan@linux.intel.com>
> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> Cc: Fred Gao <fred.gao@intel.com>
> Signed-off-by: Colin Xu <colin.xu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_igd.c | 231 ++++++++++++++++++++++++--------
>  1 file changed, 173 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
> index 228df565e9bc..081be59c7948 100644
> --- a/drivers/vfio/pci/vfio_pci_igd.c
> +++ b/drivers/vfio/pci/vfio_pci_igd.c
> @@ -25,19 +25,119 @@
>  #define OPREGION_RVDS		0x3c2
>  #define OPREGION_VERSION	0x16
>  
> +struct igd_opregion_vbt {
> +	void *opregion;
> +	void *vbt_ex;
> +};
> +
> +/**
> + * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
> + * @dst: User buffer ptr to copy to.
> + * @off: Offset to user buffer ptr. Increased by bytes on return.
> + * @src: Source buffer to copy from.
> + * @pos: Increased by bytes on return.
> + * @remaining: Decreased by bytes on return.
> + * @bytes: Bytes to copy and adjust off, pos and remaining.
> + *
> + * Copy OpRegion to offset from specific source ptr and shift the offset.
> + *
> + * Return: 0 on success, -EFAULT otherwise.
> + *
> + */
> +static inline unsigned long igd_opregion_shift_copy(char __user *dst,
> +						    loff_t *off,
> +						    void *src,
> +						    loff_t *pos,
> +						    loff_t *remaining,
> +						    loff_t bytes)

@bytes and @remaining should be size_t throughout.

> +{
> +	if (copy_to_user(dst + (*off), src, bytes))
> +		return -EFAULT;
> +
> +	*off += bytes;
> +	*pos += bytes;
> +	*remaining -= bytes;
> +
> +	return 0;
> +}
> +
>  static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>  			      size_t count, loff_t *ppos, bool iswrite)
>  {
>  	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> -	void *base = vdev->region[i].data;
> -	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK, off = 0, remaining;
>  
>  	if (pos >= vdev->region[i].size || iswrite)
>  		return -EINVAL;
>  
>  	count = min(count, (size_t)(vdev->region[i].size - pos));
> +	remaining = count;
> +
> +	/* Copy until OpRegion version */
> +	if (remaining && pos < OPREGION_VERSION) {
> +		loff_t bytes = min(remaining, OPREGION_VERSION - pos);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    opregionvbt->opregion + pos, &pos,
> +					    &remaining, bytes))
> +			return -EFAULT;
> +	}
> +
> +	/* Copy patched (if necessary) OpRegion version */
> +	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
> +		loff_t bytes = min(remaining,
> +				   OPREGION_VERSION + (loff_t)sizeof(__le16) -
> +				   pos);
> +		__le16 version = *(__le16 *)(opregionvbt->opregion +
> +					     OPREGION_VERSION);
> +
> +		/* Patch to 2.1 if OpRegion 2.0 has extended VBT */
> +		if (le16_to_cpu(version) == 0x0200 && opregionvbt->vbt_ex)
> +			version = cpu_to_le16(0x0201);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    &version, &pos,
> +					    &remaining, bytes))

This looks wrong, what if pos was (OPREGION_VERSION + 1)?  We'd copy
the first byte instead of the second.  We need to add (pos -
OPREGION_VERSION) to the source.


> +			return -EFAULT;
> +	}
> +
> +	/* Copy until RVDA */
> +	if (remaining && pos < OPREGION_RVDA) {
> +		loff_t bytes = min((loff_t)remaining, OPREGION_RVDA - pos);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    opregionvbt->opregion + pos, &pos,
> +					    &remaining, bytes))
> +			return -EFAULT;
> +	}
> +
> +	/* Copy modified (if necessary) RVDA */
> +	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
> +		loff_t bytes = min(remaining, OPREGION_RVDA +
> +					      (loff_t)sizeof(__le64) - pos);
> +		__le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ?
> +					  OPREGION_SIZE : 0);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    &rvda, &pos,
> +					    &remaining, bytes))

Same here, + (pos - OPREGION_RVDA)

> +			return -EFAULT;
> +	}
>  
> -	if (copy_to_user(buf, base + pos, count))
> +	/* Copy the rest of OpRegion */
> +	if (remaining && pos < OPREGION_SIZE) {
> +		loff_t bytes = min(remaining, OPREGION_SIZE - pos);
> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    opregionvbt->opregion + pos, &pos,
> +					    &remaining, bytes))
> +			return -EFAULT;
> +	}
> +
> +	/* Copy extended VBT if exists */
> +	if (remaining &&
> +	    copy_to_user(buf + off, opregionvbt->vbt_ex, remaining))

And here, + (pos - OPREGION_SIZE)

Also this doesn't apply to mainline, please rebase to linux-next or at
least the latest rc kernel.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-09-24 21:24                                         ` Alex Williamson
@ 2021-10-03 15:46                                           ` Colin Xu
  2021-10-03 15:53                                             ` [PATCH v7] " Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-10-03 15:46 UTC (permalink / raw)
  To: alex.williamson
  Cc: colin.xu, kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Fri, 24 Sep 2021, Alex Williamson wrote:

> On Tue, 14 Sep 2021 17:11:55 +0800
> Colin Xu <colin.xu@intel.com> wrote:
>
>> Due to historical reason, some legacy shipped system doesn't follow
>> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
>> VBT is not contiguous after OpRegion in physical address, but any
>> location pointed by RVDA via absolute address. Also although current
>> OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
>> RVDA is the relative address to OpRegion head, the extended VBT location
>> may change to non-contiguous to OpRegion. In both cases, it's impossible
>> to map a contiguous range to hold both OpRegion and the extended VBT and
>> expose via one vfio region.
>>
>> The only difference between OpRegion 2.0 and 2.1 is where extended
>> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
>> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
>> baes, and there is no other difference between OpRegion 2.0 and 2.1.
>> To support the non-contiguous region case as described, the updated read
>> op will patch OpRegion version and RVDA on-the-fly accordingly. So that
>> from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
>> after OpRegion is exposed, regardless the underneath host OpRegion is
>> 2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
>> 2.0 extended VBT systems with on the market, and support OpRegion 2.1+
>> where the extended VBT isn't contiguous after OpRegion.
>>
>> V2:
>> Validate RVDA for 2.1+ before increasing total size. (Alex)
>>
>> V3: (Alex)
>> Split read and write ops.
>> On-the-fly modify OpRegion version and RVDA.
>> Fix sparse error on assign value to casted pointer.
>>
>> V4: (Alex)
>> No need support write op.
>> Direct copy to user buffer with several shift instead of shadow.
>> Copy helper to copy to user buffer and shift offset.
>>
>> V5: (Alex)
>> Simplify copy help to only cover common shift case.
>> Don't cache patched version and rvda. Patch on copy if necessary.
>>
>> V6:
>> Fix comment typo and max line width.
>>
>> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
>> Cc: Hang Yuan <hang.yuan@linux.intel.com>
>> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
>> Cc: Fred Gao <fred.gao@intel.com>
>> Signed-off-by: Colin Xu <colin.xu@intel.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_igd.c | 231 ++++++++++++++++++++++++--------
>>  1 file changed, 173 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
>> index 228df565e9bc..081be59c7948 100644
>> --- a/drivers/vfio/pci/vfio_pci_igd.c
>> +++ b/drivers/vfio/pci/vfio_pci_igd.c
>> @@ -25,19 +25,119 @@
>>  #define OPREGION_RVDS		0x3c2
>>  #define OPREGION_VERSION	0x16
>>
>> +struct igd_opregion_vbt {
>> +	void *opregion;
>> +	void *vbt_ex;
>> +};
>> +
>> +/**
>> + * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
>> + * @dst: User buffer ptr to copy to.
>> + * @off: Offset to user buffer ptr. Increased by bytes on return.
>> + * @src: Source buffer to copy from.
>> + * @pos: Increased by bytes on return.
>> + * @remaining: Decreased by bytes on return.
>> + * @bytes: Bytes to copy and adjust off, pos and remaining.
>> + *
>> + * Copy OpRegion to offset from specific source ptr and shift the offset.
>> + *
>> + * Return: 0 on success, -EFAULT otherwise.
>> + *
>> + */
>> +static inline unsigned long igd_opregion_shift_copy(char __user *dst,
>> +						    loff_t *off,
>> +						    void *src,
>> +						    loff_t *pos,
>> +						    loff_t *remaining,
>> +						    loff_t bytes)
>
> @bytes and @remaining should be size_t throughout.
>
Fixed in v7.
At first min() reports __careful_cmp() warning so I change to loff_t.
But the proper is way to do a cast.
>> +{
>> +	if (copy_to_user(dst + (*off), src, bytes))
>> +		return -EFAULT;
>> +
>> +	*off += bytes;
>> +	*pos += bytes;
>> +	*remaining -= bytes;
>> +
>> +	return 0;
>> +}
>> +
>>  static size_t vfio_pci_igd_rw(struct vfio_pci_device *vdev, char __user *buf,
>>  			      size_t count, loff_t *ppos, bool iswrite)
>>  {
>>  	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
>> -	void *base = vdev->region[i].data;
>> -	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
>> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK, off = 0, remaining;
>>
>>  	if (pos >= vdev->region[i].size || iswrite)
>>  		return -EINVAL;
>>
>>  	count = min(count, (size_t)(vdev->region[i].size - pos));
>> +	remaining = count;
>> +
>> +	/* Copy until OpRegion version */
>> +	if (remaining && pos < OPREGION_VERSION) {
>> +		loff_t bytes = min(remaining, OPREGION_VERSION - pos);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    opregionvbt->opregion + pos, &pos,
>> +					    &remaining, bytes))
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy patched (if necessary) OpRegion version */
>> +	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
>> +		loff_t bytes = min(remaining,
>> +				   OPREGION_VERSION + (loff_t)sizeof(__le16) -
>> +				   pos);
>> +		__le16 version = *(__le16 *)(opregionvbt->opregion +
>> +					     OPREGION_VERSION);
>> +
>> +		/* Patch to 2.1 if OpRegion 2.0 has extended VBT */
>> +		if (le16_to_cpu(version) == 0x0200 && opregionvbt->vbt_ex)
>> +			version = cpu_to_le16(0x0201);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    &version, &pos,
>> +					    &remaining, bytes))
>
> This looks wrong, what if pos was (OPREGION_VERSION + 1)?  We'd copy
> the first byte instead of the second.  We need to add (pos -
> OPREGION_VERSION) to the source.
>
>
Fixed in v7. Thanks for the catching.
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy until RVDA */
>> +	if (remaining && pos < OPREGION_RVDA) {
>> +		loff_t bytes = min((loff_t)remaining, OPREGION_RVDA - pos);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    opregionvbt->opregion + pos, &pos,
>> +					    &remaining, bytes))
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy modified (if necessary) RVDA */
>> +	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
>> +		loff_t bytes = min(remaining, OPREGION_RVDA +
>> +					      (loff_t)sizeof(__le64) - pos);
>> +		__le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ?
>> +					  OPREGION_SIZE : 0);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    &rvda, &pos,
>> +					    &remaining, bytes))
>
> Same here, + (pos - OPREGION_RVDA)
>
>> +			return -EFAULT;
>> +	}
>>
>> -	if (copy_to_user(buf, base + pos, count))
>> +	/* Copy the rest of OpRegion */
>> +	if (remaining && pos < OPREGION_SIZE) {
>> +		loff_t bytes = min(remaining, OPREGION_SIZE - pos);
>> +
>> +		if (igd_opregion_shift_copy(buf, &off,
>> +					    opregionvbt->opregion + pos, &pos,
>> +					    &remaining, bytes))
>> +			return -EFAULT;
>> +	}
>> +
>> +	/* Copy extended VBT if exists */
>> +	if (remaining &&
>> +	    copy_to_user(buf + off, opregionvbt->vbt_ex, remaining))
>
> And here, + (pos - OPREGION_SIZE)
>
> Also this doesn't apply to mainline, please rebase to linux-next or at
> least the latest rc kernel.  Thanks,
>
I noticed there comes some patch on vfio-pci. Now rebased to linux-next.
> Alex
>
>
btw, I'm not working in Intel currently so continue the revise with
personal email.

--
Best Regards,
Colin Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v7] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-10-03 15:46                                           ` Colin Xu
@ 2021-10-03 15:53                                             ` Colin Xu
  2021-10-11 21:44                                               ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-10-03 15:53 UTC (permalink / raw)
  To: alex.williamson
  Cc: colin.xu, kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

From: Colin Xu <colin.xu@intel.com>

Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contiguous after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Also although current
OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
RVDA is the relative address to OpRegion head, the extended VBT location
may change to non-contiguous to OpRegion. In both cases, it's impossible
to map a contiguous range to hold both OpRegion and the extended VBT and
expose via one vfio region.

The only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1.
To support the non-contiguous region case as described, the updated read
op will patch OpRegion version and RVDA on-the-fly accordingly. So that
from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
after OpRegion is exposed, regardless the underneath host OpRegion is
2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
2.0 extended VBT systems with on the market, and support OpRegion 2.1+
where the extended VBT isn't contiguous after OpRegion.

V2:
Validate RVDA for 2.1+ before increasing total size. (Alex)

V3: (Alex)
Split read and write ops.
On-the-fly modify OpRegion version and RVDA.
Fix sparse error on assign value to casted pointer.

V4: (Alex)
No need support write op.
Direct copy to user buffer with several shift instead of shadow.
Copy helper to copy to user buffer and shift offset.

V5: (Alex)
Simplify copy help to only cover common shift case.
Don't cache patched version and rvda. Patch on copy if necessary.

V6:
Fix comment typo and max line width.

V7:
Keep bytes to copy/remain as size_t.
Proper shift byte address on copy source.
Rebase to linux-next.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
---
 drivers/vfio/pci/vfio_pci_igd.c | 234 ++++++++++++++++++++++++--------
 1 file changed, 176 insertions(+), 58 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 7ca4109bba48..77a71eac56f0 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -25,20 +25,123 @@
 #define OPREGION_RVDS		0x3c2
 #define OPREGION_VERSION	0x16
 
+struct igd_opregion_vbt {
+	void *opregion;
+	void *vbt_ex;
+};
+
+/**
+ * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
+ * @dst: User buffer ptr to copy to.
+ * @off: Offset to user buffer ptr. Increased by bytes on return.
+ * @src: Source buffer to copy from.
+ * @pos: Increased by bytes on return.
+ * @remaining: Decreased by bytes on return.
+ * @bytes: Bytes to copy and adjust off, pos and remaining.
+ *
+ * Copy OpRegion to offset from specific source ptr and shift the offset.
+ *
+ * Return: 0 on success, -EFAULT otherwise.
+ *
+ */
+static inline unsigned long igd_opregion_shift_copy(char __user *dst,
+						    loff_t *off,
+						    void *src,
+						    loff_t *pos,
+						    size_t *remaining,
+						    size_t bytes)
+{
+	if (copy_to_user(dst + (*off), src, bytes))
+		return -EFAULT;
+
+	*off += bytes;
+	*pos += bytes;
+	*remaining -= bytes;
+
+	return 0;
+}
+
 static ssize_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev,
 			       char __user *buf, size_t count, loff_t *ppos,
 			       bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
-	void *base = vdev->region[i].data;
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK, off = 0;
+	size_t remaining;
 
 	if (pos >= vdev->region[i].size || iswrite)
 		return -EINVAL;
 
 	count = min(count, (size_t)(vdev->region[i].size - pos));
+	remaining = count;
+
+	/* Copy until OpRegion version */
+	if (remaining && pos < OPREGION_VERSION) {
+		size_t bytes = min(remaining, OPREGION_VERSION - (size_t)pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
 
-	if (copy_to_user(buf, base + pos, count))
+	/* Copy patched (if necessary) OpRegion version */
+	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
+		size_t bytes = min(remaining,
+				   OPREGION_VERSION + (size_t)sizeof(__le16) -
+				   (size_t)pos);
+		__le16 version = *(__le16 *)(opregionvbt->opregion +
+					     OPREGION_VERSION);
+
+		/* Patch to 2.1 if OpRegion 2.0 has extended VBT */
+		if (le16_to_cpu(version) == 0x0200 && opregionvbt->vbt_ex)
+			version = cpu_to_le16(0x0201);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &version + (pos - OPREGION_VERSION),
+					    &pos, &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy until RVDA */
+	if (remaining && pos < OPREGION_RVDA) {
+		size_t bytes = min(remaining, OPREGION_RVDA - (size_t)pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy modified (if necessary) RVDA */
+	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
+		size_t bytes = min(remaining,
+				   OPREGION_RVDA + (size_t)sizeof(__le64) -
+				   (size_t)pos);
+		__le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ?
+					  OPREGION_SIZE : 0);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &rvda + (pos - OPREGION_RVDA),
+					    &pos, &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy the rest of OpRegion */
+	if (remaining && pos < OPREGION_SIZE) {
+		size_t bytes = min(remaining, OPREGION_SIZE - (size_t)pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy extended VBT if exists */
+	if (remaining &&
+	    copy_to_user(buf + off, opregionvbt->vbt_ex + (pos - OPREGION_SIZE),
+			 remaining))
 		return -EFAULT;
 
 	*ppos += count;
@@ -49,7 +152,13 @@ static ssize_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev,
 static void vfio_pci_igd_release(struct vfio_pci_core_device *vdev,
 				 struct vfio_pci_region *region)
 {
-	memunmap(region->data);
+	struct igd_opregion_vbt *opregionvbt = region->data;
+
+	if (opregionvbt->vbt_ex)
+		memunmap(opregionvbt->vbt_ex);
+
+	memunmap(opregionvbt->opregion);
+	kfree(opregionvbt);
 }
 
 static const struct vfio_pci_regops vfio_pci_igd_regops = {
@@ -61,7 +170,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_core_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
 	u32 addr, size;
-	void *base;
+	struct igd_opregion_vbt *opregionvbt;
 	int ret;
 	u16 version;
 
@@ -72,84 +181,93 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_core_device *vdev)
 	if (!addr || !(~addr))
 		return -ENODEV;
 
-	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
-	if (!base)
+	opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL);
+	if (!opregionvbt)
 		return -ENOMEM;
 
-	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
-		memunmap(base);
+	opregionvbt->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
+	if (!opregionvbt->opregion) {
+		kfree(opregionvbt);
+		return -ENOMEM;
+	}
+
+	if (memcmp(opregionvbt->opregion, OPREGION_SIGNATURE, 16)) {
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
-	size = le32_to_cpu(*(__le32 *)(base + 16));
+	size = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + 16));
 	if (!size) {
-		memunmap(base);
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
 	size *= 1024; /* In KB */
 
 	/*
-	 * Support opregion v2.1+
-	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
-	 * the Extended VBT region next to opregion is used to hold the VBT data.
-	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
-	 * (Raw VBT Data Size) from opregion structure member are used to hold the
-	 * address from region base and size of VBT data. RVDA/RVDS are not
-	 * defined before opregion 2.0.
+	 * OpRegion and VBT:
+	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
+	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
+	 * to hold the VBT data, the Extended VBT region is introduced since
+	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
+	 * introduced to define the extended VBT data location and size.
+	 * OpRegion 2.0: RVDA defines the absolute physical address of the
+	 *   extended VBT data, RVDS defines the VBT data size.
+	 * OpRegion 2.1 and above: RVDA defines the relative address of the
+	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
 	 *
-	 * opregion 2.1+: RVDA is unsigned, relative offset from
-	 * opregion base, and should point to the end of opregion.
-	 * otherwise, exposing to userspace to allow read access to everything between
-	 * the OpRegion and VBT is not safe.
-	 * RVDS is defined as size in bytes.
-	 *
-	 * opregion 2.0: rvda is the physical VBT address.
-	 * Since rvda is HPA it cannot be directly used in guest.
-	 * And it should not be practically available for end user,so it is not supported.
+	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
+	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
+	 * OpRegion 2.0 and above makes it possible to support the non-contiguous
+	 * VBT via a single vfio region. From r/w ops view, only contiguous VBT
+	 * after OpRegion with version 2.1+ is exposed regardless the underneath
+	 * host is 2.0 or non-contiguous 2.1+. The r/w ops will on-the-fly shift
+	 * the actural offset into VBT so that data at correct position can be
+	 * returned to the requester.
 	 */
-	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
+	version = le16_to_cpu(*(__le16 *)(opregionvbt->opregion +
+					  OPREGION_VERSION));
 	if (version >= 0x0200) {
-		u64 rvda;
-		u32 rvds;
+		u64 rvda = le64_to_cpu(*(__le64 *)(opregionvbt->opregion +
+						   OPREGION_RVDA));
+		u32 rvds = le32_to_cpu(*(__le32 *)(opregionvbt->opregion +
+						   OPREGION_RVDS));
 
-		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
-		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
+		/* The extended VBT is valid only when RVDA/RVDS are non-zero */
 		if (rvda && rvds) {
-			/* no support for opregion v2.0 with physical VBT address */
-			if (version == 0x0200) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
-				return -EINVAL;
-			}
+			size += rvds;
 
-			if (rvda != size) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"Extended VBT does not follow opregion on version 0x%04x\n",
-					version);
-				return -EINVAL;
+			/*
+			 * Extended VBT location by RVDA:
+			 * Absolute physical addr for 2.0.
+			 * Relative addr to OpRegion header for 2.1+.
+			 */
+			if (version == 0x0200)
+				addr = rvda;
+			else
+				addr += rvda;
+
+			opregionvbt->vbt_ex = memremap(addr, rvds, MEMREMAP_WB);
+			if (!opregionvbt->vbt_ex) {
+				memunmap(opregionvbt->opregion);
+				kfree(opregionvbt);
+				return -ENOMEM;
 			}
-
-			/* region size for opregion v2.0+: opregion and VBT size. */
-			size += rvds;
 		}
 	}
 
-	if (size != OPREGION_SIZE) {
-		memunmap(base);
-		base = memremap(addr, size, MEMREMAP_WB);
-		if (!base)
-			return -ENOMEM;
-	}
-
 	ret = vfio_pci_register_dev_region(vdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
-		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
-		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
+		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &vfio_pci_igd_regops,
+		size, VFIO_REGION_INFO_FLAG_READ, opregionvbt);
 	if (ret) {
-		memunmap(base);
+		if (opregionvbt->vbt_ex)
+			memunmap(opregionvbt->vbt_ex);
+
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return ret;
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v7] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-10-03 15:53                                             ` [PATCH v7] " Colin Xu
@ 2021-10-11 21:44                                               ` Alex Williamson
  2021-10-12 12:48                                                 ` [PATCH v8] " Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-10-11 21:44 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Sun,  3 Oct 2021 23:53:10 +0800
Colin Xu <colin.xu@gmail.com> wrote:
...
>  static ssize_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev,
>  			       char __user *buf, size_t count, loff_t *ppos,
>  			       bool iswrite)
>  {
>  	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> -	void *base = vdev->region[i].data;
> -	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK, off = 0;
> +	size_t remaining;
>  
>  	if (pos >= vdev->region[i].size || iswrite)
>  		return -EINVAL;
>  
>  	count = min(count, (size_t)(vdev->region[i].size - pos));
> +	remaining = count;
> +
> +	/* Copy until OpRegion version */
> +	if (remaining && pos < OPREGION_VERSION) {
> +		size_t bytes = min(remaining, OPREGION_VERSION - (size_t)pos);


mint_t(size_t,...) is probably the better option than casting the
individual operands, especially when we're casting multiple operands as
below.


> +
> +		if (igd_opregion_shift_copy(buf, &off,
> +					    opregionvbt->opregion + pos, &pos,
> +					    &remaining, bytes))
> +			return -EFAULT;
> +	}
>  
...
>  	/*
> -	 * Support opregion v2.1+
> -	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
> -	 * the Extended VBT region next to opregion is used to hold the VBT data.
> -	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
> -	 * (Raw VBT Data Size) from opregion structure member are used to hold the
> -	 * address from region base and size of VBT data. RVDA/RVDS are not
> -	 * defined before opregion 2.0.
> +	 * OpRegion and VBT:
> +	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
> +	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
> +	 * to hold the VBT data, the Extended VBT region is introduced since
> +	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
> +	 * introduced to define the extended VBT data location and size.
> +	 * OpRegion 2.0: RVDA defines the absolute physical address of the
> +	 *   extended VBT data, RVDS defines the VBT data size.
> +	 * OpRegion 2.1 and above: RVDA defines the relative address of the
> +	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
>  	 *
> -	 * opregion 2.1+: RVDA is unsigned, relative offset from
> -	 * opregion base, and should point to the end of opregion.
> -	 * otherwise, exposing to userspace to allow read access to everything between
> -	 * the OpRegion and VBT is not safe.
> -	 * RVDS is defined as size in bytes.
> -	 *
> -	 * opregion 2.0: rvda is the physical VBT address.
> -	 * Since rvda is HPA it cannot be directly used in guest.
> -	 * And it should not be practically available for end user,so it is not supported.
> +	 * Due to the RVDA difference in OpRegion VBT (also the only diff between
> +	 * 2.0 and 2.1), expose OpRegion and VBT as a contiguous range for
> +	 * OpRegion 2.0 and above makes it possible to support the non-contiguous


The lines ending between$ and contiguous$ are still just over 80
columns.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-10-11 21:44                                               ` Alex Williamson
@ 2021-10-12 12:48                                                 ` Colin Xu
  2021-10-12 17:12                                                   ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Colin Xu @ 2021-10-12 12:48 UTC (permalink / raw)
  To: alex.williamson
  Cc: colin.xu, kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

From: Colin Xu <colin.xu@intel.com>

Due to historical reason, some legacy shipped system doesn't follow
OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
VBT is not contiguous after OpRegion in physical address, but any
location pointed by RVDA via absolute address. Also although current
OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
RVDA is the relative address to OpRegion head, the extended VBT location
may change to non-contiguous to OpRegion. In both cases, it's impossible
to map a contiguous range to hold both OpRegion and the extended VBT and
expose via one vfio region.

The only difference between OpRegion 2.0 and 2.1 is where extended
VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
while for 2.1, RVDA is the relative address of extended VBT to OpRegion
baes, and there is no other difference between OpRegion 2.0 and 2.1.
To support the non-contiguous region case as described, the updated read
op will patch OpRegion version and RVDA on-the-fly accordingly. So that
from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
after OpRegion is exposed, regardless the underneath host OpRegion is
2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
2.0 extended VBT systems with on the market, and support OpRegion 2.1+
where the extended VBT isn't contiguous after OpRegion.

V2:
Validate RVDA for 2.1+ before increasing total size. (Alex)

V3: (Alex)
Split read and write ops.
On-the-fly modify OpRegion version and RVDA.
Fix sparse error on assign value to casted pointer.

V4: (Alex)
No need support write op.
Direct copy to user buffer with several shift instead of shadow.
Copy helper to copy to user buffer and shift offset.

V5: (Alex)
Simplify copy help to only cover common shift case.
Don't cache patched version and rvda. Patch on copy if necessary.

V6:
Fix comment typo and max line width.

V7:
Keep bytes to copy/remain as size_t.
Proper shift byte address on copy source.
Rebase to linux-next.

V8:
Replace min() with min_t() to avoid type cast.
Wrap long lines.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Hang Yuan <hang.yuan@linux.intel.com>
Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
Cc: Fred Gao <fred.gao@intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
---
 drivers/vfio/pci/vfio_pci_igd.c | 234 ++++++++++++++++++++++++--------
 1 file changed, 175 insertions(+), 59 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_igd.c b/drivers/vfio/pci/vfio_pci_igd.c
index 7ca4109bba48..56cd551e0e04 100644
--- a/drivers/vfio/pci/vfio_pci_igd.c
+++ b/drivers/vfio/pci/vfio_pci_igd.c
@@ -25,20 +25,121 @@
 #define OPREGION_RVDS		0x3c2
 #define OPREGION_VERSION	0x16
 
+struct igd_opregion_vbt {
+	void *opregion;
+	void *vbt_ex;
+};
+
+/**
+ * igd_opregion_shift_copy() - Copy OpRegion to user buffer and shift position.
+ * @dst: User buffer ptr to copy to.
+ * @off: Offset to user buffer ptr. Increased by bytes on return.
+ * @src: Source buffer to copy from.
+ * @pos: Increased by bytes on return.
+ * @remaining: Decreased by bytes on return.
+ * @bytes: Bytes to copy and adjust off, pos and remaining.
+ *
+ * Copy OpRegion to offset from specific source ptr and shift the offset.
+ *
+ * Return: 0 on success, -EFAULT otherwise.
+ *
+ */
+static inline unsigned long igd_opregion_shift_copy(char __user *dst,
+						    loff_t *off,
+						    void *src,
+						    loff_t *pos,
+						    size_t *remaining,
+						    size_t bytes)
+{
+	if (copy_to_user(dst + (*off), src, bytes))
+		return -EFAULT;
+
+	*off += bytes;
+	*pos += bytes;
+	*remaining -= bytes;
+
+	return 0;
+}
+
 static ssize_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev,
 			       char __user *buf, size_t count, loff_t *ppos,
 			       bool iswrite)
 {
 	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
-	void *base = vdev->region[i].data;
-	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	struct igd_opregion_vbt *opregionvbt = vdev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK, off = 0;
+	size_t remaining;
 
 	if (pos >= vdev->region[i].size || iswrite)
 		return -EINVAL;
 
-	count = min(count, (size_t)(vdev->region[i].size - pos));
+	count = min_t(size_t, count, vdev->region[i].size - pos);
+	remaining = count;
+
+	/* Copy until OpRegion version */
+	if (remaining && pos < OPREGION_VERSION) {
+		size_t bytes = min_t(size_t, remaining, OPREGION_VERSION - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy patched (if necessary) OpRegion version */
+	if (remaining && pos < OPREGION_VERSION + sizeof(__le16)) {
+		size_t bytes = min_t(size_t, remaining,
+				     OPREGION_VERSION + sizeof(__le16) - pos);
+		__le16 version = *(__le16 *)(opregionvbt->opregion +
+					     OPREGION_VERSION);
+
+		/* Patch to 2.1 if OpRegion 2.0 has extended VBT */
+		if (le16_to_cpu(version) == 0x0200 && opregionvbt->vbt_ex)
+			version = cpu_to_le16(0x0201);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &version + (pos - OPREGION_VERSION),
+					    &pos, &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy until RVDA */
+	if (remaining && pos < OPREGION_RVDA) {
+		size_t bytes = min_t(size_t, remaining, OPREGION_RVDA - pos);
 
-	if (copy_to_user(buf, base + pos, count))
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy modified (if necessary) RVDA */
+	if (remaining && pos < OPREGION_RVDA + sizeof(__le64)) {
+		size_t bytes = min_t(size_t, remaining,
+				     OPREGION_RVDA + sizeof(__le64) - pos);
+		__le64 rvda = cpu_to_le64(opregionvbt->vbt_ex ?
+					  OPREGION_SIZE : 0);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    &rvda + (pos - OPREGION_RVDA),
+					    &pos, &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy the rest of OpRegion */
+	if (remaining && pos < OPREGION_SIZE) {
+		size_t bytes = min_t(size_t, remaining, OPREGION_SIZE - pos);
+
+		if (igd_opregion_shift_copy(buf, &off,
+					    opregionvbt->opregion + pos, &pos,
+					    &remaining, bytes))
+			return -EFAULT;
+	}
+
+	/* Copy extended VBT if exists */
+	if (remaining &&
+	    copy_to_user(buf + off, opregionvbt->vbt_ex + (pos - OPREGION_SIZE),
+			 remaining))
 		return -EFAULT;
 
 	*ppos += count;
@@ -49,7 +150,13 @@ static ssize_t vfio_pci_igd_rw(struct vfio_pci_core_device *vdev,
 static void vfio_pci_igd_release(struct vfio_pci_core_device *vdev,
 				 struct vfio_pci_region *region)
 {
-	memunmap(region->data);
+	struct igd_opregion_vbt *opregionvbt = region->data;
+
+	if (opregionvbt->vbt_ex)
+		memunmap(opregionvbt->vbt_ex);
+
+	memunmap(opregionvbt->opregion);
+	kfree(opregionvbt);
 }
 
 static const struct vfio_pci_regops vfio_pci_igd_regops = {
@@ -61,7 +168,7 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_core_device *vdev)
 {
 	__le32 *dwordp = (__le32 *)(vdev->vconfig + OPREGION_PCI_ADDR);
 	u32 addr, size;
-	void *base;
+	struct igd_opregion_vbt *opregionvbt;
 	int ret;
 	u16 version;
 
@@ -72,84 +179,93 @@ static int vfio_pci_igd_opregion_init(struct vfio_pci_core_device *vdev)
 	if (!addr || !(~addr))
 		return -ENODEV;
 
-	base = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
-	if (!base)
+	opregionvbt = kzalloc(sizeof(*opregionvbt), GFP_KERNEL);
+	if (!opregionvbt)
+		return -ENOMEM;
+
+	opregionvbt->opregion = memremap(addr, OPREGION_SIZE, MEMREMAP_WB);
+	if (!opregionvbt->opregion) {
+		kfree(opregionvbt);
 		return -ENOMEM;
+	}
 
-	if (memcmp(base, OPREGION_SIGNATURE, 16)) {
-		memunmap(base);
+	if (memcmp(opregionvbt->opregion, OPREGION_SIGNATURE, 16)) {
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
-	size = le32_to_cpu(*(__le32 *)(base + 16));
+	size = le32_to_cpu(*(__le32 *)(opregionvbt->opregion + 16));
 	if (!size) {
-		memunmap(base);
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return -EINVAL;
 	}
 
 	size *= 1024; /* In KB */
 
 	/*
-	 * Support opregion v2.1+
-	 * When VBT data exceeds 6KB size and cannot be within mailbox #4, then
-	 * the Extended VBT region next to opregion is used to hold the VBT data.
-	 * RVDA (Relative Address of VBT Data from Opregion Base) and RVDS
-	 * (Raw VBT Data Size) from opregion structure member are used to hold the
-	 * address from region base and size of VBT data. RVDA/RVDS are not
-	 * defined before opregion 2.0.
-	 *
-	 * opregion 2.1+: RVDA is unsigned, relative offset from
-	 * opregion base, and should point to the end of opregion.
-	 * otherwise, exposing to userspace to allow read access to everything between
-	 * the OpRegion and VBT is not safe.
-	 * RVDS is defined as size in bytes.
+	 * OpRegion and VBT:
+	 * When VBT data doesn't exceed 6KB, it's stored in Mailbox #4.
+	 * When VBT data exceeds 6KB size, Mailbox #4 is no longer large enough
+	 * to hold the VBT data, the Extended VBT region is introduced since
+	 * OpRegion 2.0 to hold the VBT data. Since OpRegion 2.0, RVDA/RVDS are
+	 * introduced to define the extended VBT data location and size.
+	 * OpRegion 2.0: RVDA defines the absolute physical address of the
+	 *   extended VBT data, RVDS defines the VBT data size.
+	 * OpRegion 2.1 and above: RVDA defines the relative address of the
+	 *   extended VBT data to OpRegion base, RVDS defines the VBT data size.
 	 *
-	 * opregion 2.0: rvda is the physical VBT address.
-	 * Since rvda is HPA it cannot be directly used in guest.
-	 * And it should not be practically available for end user,so it is not supported.
+	 * Due to the RVDA definition diff in OpRegion VBT (also the only diff
+	 * between 2.0 and 2.1), exposing OpRegion and VBT as a contiguous range
+	 * for OpRegion 2.0 and above makes it possible to support the
+	 * non-contiguous VBT through a single vfio region. From r/w ops view,
+	 * only contiguous VBT after OpRegion with version 2.1+ is exposed,
+	 * regardless the host OpRegion is 2.0 or non-contiguous 2.1+. The r/w
+	 * ops will on-the-fly shift the actural offset into VBT so that data at
+	 * correct position can be returned to the requester.
 	 */
-	version = le16_to_cpu(*(__le16 *)(base + OPREGION_VERSION));
+	version = le16_to_cpu(*(__le16 *)(opregionvbt->opregion +
+					  OPREGION_VERSION));
 	if (version >= 0x0200) {
-		u64 rvda;
-		u32 rvds;
+		u64 rvda = le64_to_cpu(*(__le64 *)(opregionvbt->opregion +
+						   OPREGION_RVDA));
+		u32 rvds = le32_to_cpu(*(__le32 *)(opregionvbt->opregion +
+						   OPREGION_RVDS));
 
-		rvda = le64_to_cpu(*(__le64 *)(base + OPREGION_RVDA));
-		rvds = le32_to_cpu(*(__le32 *)(base + OPREGION_RVDS));
+		/* The extended VBT is valid only when RVDA/RVDS are non-zero */
 		if (rvda && rvds) {
-			/* no support for opregion v2.0 with physical VBT address */
-			if (version == 0x0200) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"IGD assignment does not support opregion v2.0 with an extended VBT region\n");
-				return -EINVAL;
-			}
+			size += rvds;
 
-			if (rvda != size) {
-				memunmap(base);
-				pci_err(vdev->pdev,
-					"Extended VBT does not follow opregion on version 0x%04x\n",
-					version);
-				return -EINVAL;
+			/*
+			 * Extended VBT location by RVDA:
+			 * Absolute physical addr for 2.0.
+			 * Relative addr to OpRegion header for 2.1+.
+			 */
+			if (version == 0x0200)
+				addr = rvda;
+			else
+				addr += rvda;
+
+			opregionvbt->vbt_ex = memremap(addr, rvds, MEMREMAP_WB);
+			if (!opregionvbt->vbt_ex) {
+				memunmap(opregionvbt->opregion);
+				kfree(opregionvbt);
+				return -ENOMEM;
 			}
-
-			/* region size for opregion v2.0+: opregion and VBT size. */
-			size += rvds;
 		}
 	}
 
-	if (size != OPREGION_SIZE) {
-		memunmap(base);
-		base = memremap(addr, size, MEMREMAP_WB);
-		if (!base)
-			return -ENOMEM;
-	}
-
 	ret = vfio_pci_register_dev_region(vdev,
 		PCI_VENDOR_ID_INTEL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
-		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION,
-		&vfio_pci_igd_regops, size, VFIO_REGION_INFO_FLAG_READ, base);
+		VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION, &vfio_pci_igd_regops,
+		size, VFIO_REGION_INFO_FLAG_READ, opregionvbt);
 	if (ret) {
-		memunmap(base);
+		if (opregionvbt->vbt_ex)
+			memunmap(opregionvbt->vbt_ex);
+
+		memunmap(opregionvbt->opregion);
+		kfree(opregionvbt);
 		return ret;
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-10-12 12:48                                                 ` [PATCH v8] " Colin Xu
@ 2021-10-12 17:12                                                   ` Alex Williamson
  2021-10-12 23:10                                                     ` Colin Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2021-10-12 17:12 UTC (permalink / raw)
  To: Colin Xu; +Cc: kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao

On Tue, 12 Oct 2021 20:48:55 +0800
Colin Xu <colin.xu@gmail.com> wrote:

> From: Colin Xu <colin.xu@intel.com>
> 
> Due to historical reason, some legacy shipped system doesn't follow
> OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> VBT is not contiguous after OpRegion in physical address, but any
> location pointed by RVDA via absolute address. Also although current
> OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
> RVDA is the relative address to OpRegion head, the extended VBT location
> may change to non-contiguous to OpRegion. In both cases, it's impossible
> to map a contiguous range to hold both OpRegion and the extended VBT and
> expose via one vfio region.
> 
> The only difference between OpRegion 2.0 and 2.1 is where extended
> VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> baes, and there is no other difference between OpRegion 2.0 and 2.1.
> To support the non-contiguous region case as described, the updated read
> op will patch OpRegion version and RVDA on-the-fly accordingly. So that
> from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
> after OpRegion is exposed, regardless the underneath host OpRegion is
> 2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
> 2.0 extended VBT systems with on the market, and support OpRegion 2.1+
> where the extended VBT isn't contiguous after OpRegion.
> 
> V2:
> Validate RVDA for 2.1+ before increasing total size. (Alex)
> 
> V3: (Alex)
> Split read and write ops.
> On-the-fly modify OpRegion version and RVDA.
> Fix sparse error on assign value to casted pointer.
> 
> V4: (Alex)
> No need support write op.
> Direct copy to user buffer with several shift instead of shadow.
> Copy helper to copy to user buffer and shift offset.
> 
> V5: (Alex)
> Simplify copy help to only cover common shift case.
> Don't cache patched version and rvda. Patch on copy if necessary.
> 
> V6:
> Fix comment typo and max line width.
> 
> V7:
> Keep bytes to copy/remain as size_t.
> Proper shift byte address on copy source.
> Rebase to linux-next.
> 
> V8:
> Replace min() with min_t() to avoid type cast.
> Wrap long lines.
> 
> Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> Cc: Hang Yuan <hang.yuan@linux.intel.com>
> Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> Cc: Fred Gao <fred.gao@intel.com>
> Signed-off-by: Colin Xu <colin.xu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_igd.c | 234 ++++++++++++++++++++++++--------
>  1 file changed, 175 insertions(+), 59 deletions(-)

Looks good, applied to vfio next branch for v5.16.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8] vfio/pci: Add OpRegion 2.0+ Extended VBT support.
  2021-10-12 17:12                                                   ` Alex Williamson
@ 2021-10-12 23:10                                                     ` Colin Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Colin Xu @ 2021-10-12 23:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, colin.xu, zhenyuw, hang.yuan, swee.yee.fonn, fred.gao, Colin Xu

On Wed, Oct 13, 2021 at 1:12 AM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Tue, 12 Oct 2021 20:48:55 +0800
> Colin Xu <colin.xu@gmail.com> wrote:
>
> > From: Colin Xu <colin.xu@intel.com>
> >
> > Due to historical reason, some legacy shipped system doesn't follow
> > OpRegion 2.1 spec but still stick to OpRegion 2.0, in which the extended
> > VBT is not contiguous after OpRegion in physical address, but any
> > location pointed by RVDA via absolute address. Also although current
> > OpRegion 2.1+ systems appears that the extended VBT follows OpRegion,
> > RVDA is the relative address to OpRegion head, the extended VBT location
> > may change to non-contiguous to OpRegion. In both cases, it's impossible
> > to map a contiguous range to hold both OpRegion and the extended VBT and
> > expose via one vfio region.
> >
> > The only difference between OpRegion 2.0 and 2.1 is where extended
> > VBT is stored: For 2.0, RVDA is the absolute address of extended VBT
> > while for 2.1, RVDA is the relative address of extended VBT to OpRegion
> > baes, and there is no other difference between OpRegion 2.0 and 2.1.
> > To support the non-contiguous region case as described, the updated read
> > op will patch OpRegion version and RVDA on-the-fly accordingly. So that
> > from vfio igd OpRegion view, only 2.1+ with contiguous extended VBT
> > after OpRegion is exposed, regardless the underneath host OpRegion is
> > 2.0 or 2.1+. The mechanism makes it possible to support legacy OpRegion
> > 2.0 extended VBT systems with on the market, and support OpRegion 2.1+
> > where the extended VBT isn't contiguous after OpRegion.
> >
> > V2:
> > Validate RVDA for 2.1+ before increasing total size. (Alex)
> >
> > V3: (Alex)
> > Split read and write ops.
> > On-the-fly modify OpRegion version and RVDA.
> > Fix sparse error on assign value to casted pointer.
> >
> > V4: (Alex)
> > No need support write op.
> > Direct copy to user buffer with several shift instead of shadow.
> > Copy helper to copy to user buffer and shift offset.
> >
> > V5: (Alex)
> > Simplify copy help to only cover common shift case.
> > Don't cache patched version and rvda. Patch on copy if necessary.
> >
> > V6:
> > Fix comment typo and max line width.
> >
> > V7:
> > Keep bytes to copy/remain as size_t.
> > Proper shift byte address on copy source.
> > Rebase to linux-next.
> >
> > V8:
> > Replace min() with min_t() to avoid type cast.
> > Wrap long lines.
> >
> > Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
> > Cc: Hang Yuan <hang.yuan@linux.intel.com>
> > Cc: Swee Yee Fonn <swee.yee.fonn@intel.com>
> > Cc: Fred Gao <fred.gao@intel.com>
> > Signed-off-by: Colin Xu <colin.xu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_igd.c | 234 ++++++++++++++++++++++++--------
> >  1 file changed, 175 insertions(+), 59 deletions(-)
>
> Looks good, applied to vfio next branch for v5.16.  Thanks,
>
> Alex
>
Thanks Alex for the help plugging the hole due to the special OpRegion 2.0 case.

Colin

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-10-12 23:10 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-13  2:13 [PATCH] vfio/pci: Add OpRegion 2.0 Extended VBT support Colin Xu
2021-08-16 22:39 ` Alex Williamson
2021-08-17  0:40   ` Colin Xu
2021-08-27  1:36     ` Colin Xu
2021-08-27  1:48       ` Alex Williamson
2021-08-27  2:24         ` Colin Xu
2021-08-27  2:37           ` [PATCH v2] " Colin Xu
2021-08-30 20:27             ` Alex Williamson
2021-09-02  7:11               ` Colin Xu
2021-09-02 21:46                 ` Alex Williamson
2021-09-03  2:23                   ` Colin Xu
2021-09-03 22:36                     ` Alex Williamson
2021-09-07  6:14                       ` Colin Xu
2021-09-09  5:09                         ` [PATCH v3] vfio/pci: Add OpRegion 2.0+ " Colin Xu
2021-09-09 22:00                           ` Alex Williamson
2021-09-13 12:39                             ` Colin Xu
2021-09-13 12:41                               ` [PATCH v4] " Colin Xu
2021-09-13 15:14                                 ` Alex Williamson
2021-09-14  4:18                                   ` Colin Xu
2021-09-14  4:29                                     ` [PATCH v5] " Colin Xu
2021-09-14  9:11                                       ` [PATCH v6] " Colin Xu
2021-09-24 21:24                                         ` Alex Williamson
2021-10-03 15:46                                           ` Colin Xu
2021-10-03 15:53                                             ` [PATCH v7] " Colin Xu
2021-10-11 21:44                                               ` Alex Williamson
2021-10-12 12:48                                                 ` [PATCH v8] " Colin Xu
2021-10-12 17:12                                                   ` Alex Williamson
2021-10-12 23:10                                                     ` Colin Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.