All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0
@ 2023-03-12  7:54 Huang Rui
  2023-03-12  7:54 ` [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present Huang Rui
                   ` (7 more replies)
  0 siblings, 8 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-12  7:54 UTC (permalink / raw)
  To: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Huang Rui

Hi all,

In graphic world, the 3D applications/games are runing based on open
graphic libraries such as OpenGL and Vulkan. Mesa is the Linux
implemenatation of OpenGL and Vulkan for multiple hardware platforms.
Because the graphic libraries would like to have the GPU hardware
acceleration. In virtualization world, virtio-gpu and passthrough-gpu are
two of gpu virtualization technologies.

Current Xen only supports OpenGL (virgl:
https://docs.mesa3d.org/drivers/virgl.html) for virtio-gpu and passthrough
gpu based on PV dom0 for x86 platform. Today, we would like to introduce
Vulkan (venus: https://docs.mesa3d.org/drivers/venus.html) and another
OpenGL on Vulkan (zink: https://docs.mesa3d.org/drivers/zink.html) support
for VirtIO GPU on Xen. These functions are supported on KVM at this moment,
but so far, they are not supported on Xen. And we also introduce the PCIe
passthrough (GPU) function based on PVH dom0 for AMD x86 platform.

These supports required multiple repositories changes on kernel, xen, qemu,
mesa, and virglrenderer. Please check below branches:

Kernel: https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/log/?h=upstream-fox-xen
Xen: https://gitlab.com/huangrui123/xen/-/commits/upstream-for-xen
QEMU: https://gitlab.com/huangrui123/qemu/-/commits/upstream-for-xen
Mesa: https://gitlab.freedesktop.org/rui/mesa/-/commits/upstream-for-xen
Virglrenderer: https://gitlab.freedesktop.org/rui/virglrenderer/-/commits/upstream-for-xen

In xen part, we mainly add the PCIe passthrough support on PVH dom0. It's
using the QEMU to passthrough the GPU device into guest HVM domU. And
mainly work is to transfer the interrupt by using gsi, vector, and pirq.

Below are the screenshot of these functions, please take a look.

Venus:
https://drive.google.com/file/d/1_lPq6DMwHu1JQv7LUUVRx31dBj0HJYcL/view?usp=share_link

Zink:
https://drive.google.com/file/d/1FxLmKu6X7uJOxx1ZzwOm1yA6IL5WMGzd/view?usp=share_link

Passthrough GPU:
https://drive.google.com/file/d/17onr5gvDK8KM_LniHTSQEI2hGJZlI09L/view?usp=share_link

We are working to write the documentation that describe how to verify these
functions in the xen wiki page. And will update it in the future version.

Thanks,
Ray

Chen Jiqian (5):
  vpci: accept BAR writes if dom0 is PVH
  x86/pvh: shouldn't check pirq flag when map pirq in PVH
  x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  tools/libs/call: add linux os call to get gsi from irq
  tools/libs/light: pci: translate irq to gsi

Roger Pau Monne (1):
  x86/pvh: report ACPI VFCT table to dom0 if present

 tools/include/xen-sys/Linux/privcmd.h |  7 +++++++
 tools/include/xencall.h               |  2 ++
 tools/include/xenctrl.h               |  2 ++
 tools/libs/call/core.c                |  5 +++++
 tools/libs/call/libxencall.map        |  2 ++
 tools/libs/call/linux.c               | 14 ++++++++++++++
 tools/libs/call/private.h             |  9 +++++++++
 tools/libs/ctrl/xc_physdev.c          |  4 ++++
 tools/libs/light/libxl_pci.c          |  1 +
 xen/arch/x86/hvm/dom0_build.c         |  1 +
 xen/arch/x86/hvm/hypercall.c          |  3 +--
 xen/drivers/vpci/header.c             |  2 +-
 xen/include/acpi/actbl3.h             |  1 +
 13 files changed, 50 insertions(+), 3 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 75+ messages in thread

* [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present
  2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
@ 2023-03-12  7:54 ` Huang Rui
  2023-03-13 11:55   ` Andrew Cooper
  2023-03-12  7:54 ` [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH Huang Rui
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-12  7:54 UTC (permalink / raw)
  To: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Huang Rui, Henry Wang

From: Roger Pau Monne <roger.pau@citrix.com>

The VFCT ACPI table is used by AMD GPUs to expose the vbios ROM image
from the firmware instead of doing it on the PCI ROM on the physical
device.

As such, this needs to be available for PVH dom0 to access, or else
the GPU won't work.

Reported-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-and-Tested-by: Huang Rui <ray.huang@amd.com>
Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
---
 xen/arch/x86/hvm/dom0_build.c | 1 +
 xen/include/acpi/actbl3.h     | 1 +
 2 files changed, 2 insertions(+)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 3ac6b7b423..d44de7f2b2 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -892,6 +892,7 @@ static bool __init pvh_acpi_table_allowed(const char *sig,
         ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_FACS, ACPI_SIG_PSDT,
         ACPI_SIG_SSDT, ACPI_SIG_SBST, ACPI_SIG_MCFG, ACPI_SIG_SLIC,
         ACPI_SIG_MSDM, ACPI_SIG_WDAT, ACPI_SIG_FPDT, ACPI_SIG_S3PT,
+        ACPI_SIG_VFCT,
     };
     unsigned int i;
 
diff --git a/xen/include/acpi/actbl3.h b/xen/include/acpi/actbl3.h
index 0a6778421f..6858d3e60f 100644
--- a/xen/include/acpi/actbl3.h
+++ b/xen/include/acpi/actbl3.h
@@ -79,6 +79,7 @@
 #define ACPI_SIG_MATR           "MATR"	/* Memory Address Translation Table */
 #define ACPI_SIG_MSDM           "MSDM"	/* Microsoft Data Management Table */
 #define ACPI_SIG_WPBT           "WPBT"	/* Windows Platform Binary Table */
+#define ACPI_SIG_VFCT           "VFCT"	/* AMD Video BIOS */
 
 /*
  * All tables must be byte-packed to match the ACPI specification, since
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
  2023-03-12  7:54 ` [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present Huang Rui
@ 2023-03-12  7:54 ` Huang Rui
  2023-03-13  7:23   ` Christian König
  2023-03-14 16:02   ` Jan Beulich
  2023-03-12  7:54 ` [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH Huang Rui
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-12  7:54 UTC (permalink / raw)
  To: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Huang Rui

From: Chen Jiqian <Jiqian.Chen@amd.com>

When dom0 is PVH and we want to passthrough gpu to guest,
we should allow BAR writes even through BAR is mapped. If
not, the value of BARs are not initialized when guest firstly
start.

Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
---
 xen/drivers/vpci/header.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index ec2e978a4e..918d11fbce 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -392,7 +392,7 @@ static void cf_check bar_write(
      * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
      * writes as long as the BAR is not mapped into the p2m.
      */
-    if ( bar->enabled )
+    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
     {
         /* If the value written is the current one avoid printing a warning. */
         if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH
  2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
  2023-03-12  7:54 ` [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present Huang Rui
  2023-03-12  7:54 ` [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH Huang Rui
@ 2023-03-12  7:54 ` Huang Rui
  2023-03-14 16:27   ` Jan Beulich
  2023-03-15 15:57   ` Roger Pau Monné
  2023-03-12  7:54 ` [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call Huang Rui
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-12  7:54 UTC (permalink / raw)
  To: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Huang Rui

From: Chen Jiqian <Jiqian.Chen@amd.com>

PVH is also hvm type domain, but PVH hasn't X86_EMU_USE_PIRQ
flag. So, when dom0 is PVH and call PHYSDEVOP_map_pirq, it
will fail at check has_pirq();

Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
---
 xen/arch/x86/hvm/hypercall.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index 405d0a95af..16a2f5c0b3 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -89,8 +89,6 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     case PHYSDEVOP_eoi:
     case PHYSDEVOP_irq_status_query:
     case PHYSDEVOP_get_free_pirq:
-        if ( !has_pirq(currd) )
-            return -ENOSYS;
         break;
 
     case PHYSDEVOP_pci_mmcfg_reserved:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
                   ` (2 preceding siblings ...)
  2023-03-12  7:54 ` [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH Huang Rui
@ 2023-03-12  7:54 ` Huang Rui
  2023-03-14 16:30   ` Jan Beulich
  2023-03-12  7:54 ` [RFC XEN PATCH 5/6] tools/libs/call: add linux os call to get gsi from irq Huang Rui
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-12  7:54 UTC (permalink / raw)
  To: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Huang Rui

From: Chen Jiqian <Jiqian.Chen@amd.com>

Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
---
 xen/arch/x86/hvm/hypercall.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index 16a2f5c0b3..fce786618c 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -89,6 +89,7 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     case PHYSDEVOP_eoi:
     case PHYSDEVOP_irq_status_query:
     case PHYSDEVOP_get_free_pirq:
+    case PHYSDEVOP_setup_gsi:
         break;
 
     case PHYSDEVOP_pci_mmcfg_reserved:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC XEN PATCH 5/6] tools/libs/call: add linux os call to get gsi from irq
  2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
                   ` (3 preceding siblings ...)
  2023-03-12  7:54 ` [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call Huang Rui
@ 2023-03-12  7:54 ` Huang Rui
  2023-03-14 16:36   ` Jan Beulich
  2023-03-12  7:54 ` [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi Huang Rui
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-12  7:54 UTC (permalink / raw)
  To: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Huang Rui

From: Chen Jiqian <Jiqian.Chen@amd.com>

When passthrough gpu to guest, usersapce can only get irq
instead of gsi. But it should pass gsi to guest, so that
guest can get interrupt signal. So, provide function to get
gsi.

Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
---
 tools/include/xen-sys/Linux/privcmd.h |  7 +++++++
 tools/include/xencall.h               |  2 ++
 tools/include/xenctrl.h               |  2 ++
 tools/libs/call/core.c                |  5 +++++
 tools/libs/call/libxencall.map        |  2 ++
 tools/libs/call/linux.c               | 14 ++++++++++++++
 tools/libs/call/private.h             |  9 +++++++++
 tools/libs/ctrl/xc_physdev.c          |  4 ++++
 8 files changed, 45 insertions(+)

diff --git a/tools/include/xen-sys/Linux/privcmd.h b/tools/include/xen-sys/Linux/privcmd.h
index bc60e8fd55..d72e785b5d 100644
--- a/tools/include/xen-sys/Linux/privcmd.h
+++ b/tools/include/xen-sys/Linux/privcmd.h
@@ -95,6 +95,11 @@ typedef struct privcmd_mmap_resource {
 	__u64 addr;
 } privcmd_mmap_resource_t;
 
+typedef struct privcmd_gsi_from_irq {
+	__u32 irq;
+	__u32 gsi;
+} privcmd_gsi_from_irq_t;
+
 /*
  * @cmd: IOCTL_PRIVCMD_HYPERCALL
  * @arg: &privcmd_hypercall_t
@@ -114,6 +119,8 @@ typedef struct privcmd_mmap_resource {
 	_IOC(_IOC_NONE, 'P', 6, sizeof(domid_t))
 #define IOCTL_PRIVCMD_MMAP_RESOURCE				\
 	_IOC(_IOC_NONE, 'P', 7, sizeof(privcmd_mmap_resource_t))
+#define IOCTL_PRIVCMD_GSI_FROM_IRQ				\
+	_IOC(_IOC_NONE, 'P', 8, sizeof(privcmd_gsi_from_irq_t))
 #define IOCTL_PRIVCMD_UNIMPLEMENTED				\
 	_IOC(_IOC_NONE, 'P', 0xFF, 0)
 
diff --git a/tools/include/xencall.h b/tools/include/xencall.h
index fc95ed0fe5..962cb45e1f 100644
--- a/tools/include/xencall.h
+++ b/tools/include/xencall.h
@@ -113,6 +113,8 @@ int xencall5(xencall_handle *xcall, unsigned int op,
              uint64_t arg1, uint64_t arg2, uint64_t arg3,
              uint64_t arg4, uint64_t arg5);
 
+int xen_oscall_gsi_from_irq(xencall_handle *xcall, int irq);
+
 /* Variant(s) of the above, as needed, returning "long" instead of "int". */
 long xencall2L(xencall_handle *xcall, unsigned int op,
                uint64_t arg1, uint64_t arg2);
diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 23037874d3..3918be9e53 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1652,6 +1652,8 @@ int xc_physdev_unmap_pirq(xc_interface *xch,
                           uint32_t domid,
                           int pirq);
 
+int xc_physdev_gsi_from_irq(xc_interface *xch, int irq);
+
 /*
  *  LOGGING AND ERROR REPORTING
  */
diff --git a/tools/libs/call/core.c b/tools/libs/call/core.c
index 02c4f8e1ae..6f79f3babd 100644
--- a/tools/libs/call/core.c
+++ b/tools/libs/call/core.c
@@ -173,6 +173,11 @@ int xencall5(xencall_handle *xcall, unsigned int op,
     return osdep_hypercall(xcall, &call);
 }
 
+int xen_oscall_gsi_from_irq(xencall_handle *xcall, int irq)
+{
+    return osdep_oscall(xcall, irq);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libs/call/libxencall.map b/tools/libs/call/libxencall.map
index d18a3174e9..6cde8eda05 100644
--- a/tools/libs/call/libxencall.map
+++ b/tools/libs/call/libxencall.map
@@ -10,6 +10,8 @@ VERS_1.0 {
 		xencall4;
 		xencall5;
 
+		xen_oscall_gsi_from_irq;
+
 		xencall_alloc_buffer;
 		xencall_free_buffer;
 		xencall_alloc_buffer_pages;
diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
index 6d588e6bea..5267bceabf 100644
--- a/tools/libs/call/linux.c
+++ b/tools/libs/call/linux.c
@@ -85,6 +85,20 @@ long osdep_hypercall(xencall_handle *xcall, privcmd_hypercall_t *hypercall)
     return ioctl(xcall->fd, IOCTL_PRIVCMD_HYPERCALL, hypercall);
 }
 
+long osdep_oscall(xencall_handle *xcall, int irq)
+{
+    privcmd_gsi_from_irq_t gsi_irq = {
+        .irq = irq,
+        .gsi = -1,
+    };
+
+    if (ioctl(xcall->fd, IOCTL_PRIVCMD_GSI_FROM_IRQ, &gsi_irq)) {
+        return gsi_irq.irq;
+    }
+
+    return gsi_irq.gsi;
+}
+
 static void *alloc_pages_bufdev(xencall_handle *xcall, size_t npages)
 {
     void *p;
diff --git a/tools/libs/call/private.h b/tools/libs/call/private.h
index 9c3aa432ef..01a1f5076a 100644
--- a/tools/libs/call/private.h
+++ b/tools/libs/call/private.h
@@ -57,6 +57,15 @@ int osdep_xencall_close(xencall_handle *xcall);
 
 long osdep_hypercall(xencall_handle *xcall, privcmd_hypercall_t *hypercall);
 
+#if defined(__linux__)
+long osdep_oscall(xencall_handle *xcall, int irq);
+#else
+static inline long osdep_oscall(xencall_handle *xcall, int irq)
+{
+    return irq;
+}
+#endif
+
 void *osdep_alloc_pages(xencall_handle *xcall, size_t nr_pages);
 void osdep_free_pages(xencall_handle *xcall, void *p, size_t nr_pages);
 
diff --git a/tools/libs/ctrl/xc_physdev.c b/tools/libs/ctrl/xc_physdev.c
index 460a8e779c..4d3b138ebd 100644
--- a/tools/libs/ctrl/xc_physdev.c
+++ b/tools/libs/ctrl/xc_physdev.c
@@ -111,3 +111,7 @@ int xc_physdev_unmap_pirq(xc_interface *xch,
     return rc;
 }
 
+int xc_physdev_gsi_from_irq(xc_interface *xch, int irq)
+{
+    return xen_oscall_gsi_from_irq(xch->xcall, irq);
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
                   ` (4 preceding siblings ...)
  2023-03-12  7:54 ` [RFC XEN PATCH 5/6] tools/libs/call: add linux os call to get gsi from irq Huang Rui
@ 2023-03-12  7:54 ` Huang Rui
  2023-03-14 16:39   ` Jan Beulich
  2023-03-15 16:35   ` Roger Pau Monné
  2023-03-13  7:24 ` [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Christian König
  2023-03-20 16:22 ` Huang Rui
  7 siblings, 2 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-12  7:54 UTC (permalink / raw)
  To: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Huang Rui

From: Chen Jiqian <Jiqian.Chen@amd.com>

Use new xc_physdev_gsi_from_irq to get the GSI number

Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
---
 tools/libs/light/libxl_pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
index f4c4f17545..47cf2799bf 100644
--- a/tools/libs/light/libxl_pci.c
+++ b/tools/libs/light/libxl_pci.c
@@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
         goto out_no_irq;
     }
     if ((fscanf(f, "%u", &irq) == 1) && irq) {
+        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);
         r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
         if (r < 0) {
             LOGED(ERROR, domainid, "xc_physdev_map_pirq irq=%d (error=%d)",
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-12  7:54 ` [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH Huang Rui
@ 2023-03-13  7:23   ` Christian König
  2023-03-13  7:26     ` Christian König
  2023-03-14 16:02   ` Jan Beulich
  1 sibling, 1 reply; 75+ messages in thread
From: Christian König @ 2023-03-13  7:23 UTC (permalink / raw)
  To: Huang Rui, Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian



Am 12.03.23 um 08:54 schrieb Huang Rui:
> From: Chen Jiqian <Jiqian.Chen@amd.com>
>
> When dom0 is PVH and we want to passthrough gpu to guest,
> we should allow BAR writes even through BAR is mapped. If
> not, the value of BARs are not initialized when guest firstly
> start.
>
> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> Signed-off-by: Huang Rui <ray.huang@amd.com>
> ---
>   xen/drivers/vpci/header.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index ec2e978a4e..918d11fbce 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -392,7 +392,7 @@ static void cf_check bar_write(
>        * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
>        * writes as long as the BAR is not mapped into the p2m.
>        */
> -    if ( bar->enabled )
> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )

Checkpath.pl gives here:

ERROR: space prohibited after that open parenthesis '('
#115: FILE: xen/drivers/vpci/header.c:395:
+    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )

Christian.


>       {
>           /* If the value written is the current one avoid printing a warning. */
>           if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0
  2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
                   ` (5 preceding siblings ...)
  2023-03-12  7:54 ` [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi Huang Rui
@ 2023-03-13  7:24 ` Christian König
  2023-03-21 10:26   ` Huang Rui
  2023-03-20 16:22 ` Huang Rui
  7 siblings, 1 reply; 75+ messages in thread
From: Christian König @ 2023-03-13  7:24 UTC (permalink / raw)
  To: Huang Rui, Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian

Hi Ray,

one nit comment on the style, apart from that looks technical correct.

But I'm *really* not and expert on all that stuff.

Regards,
Christian.

Am 12.03.23 um 08:54 schrieb Huang Rui:
> Hi all,
>
> In graphic world, the 3D applications/games are runing based on open
> graphic libraries such as OpenGL and Vulkan. Mesa is the Linux
> implemenatation of OpenGL and Vulkan for multiple hardware platforms.
> Because the graphic libraries would like to have the GPU hardware
> acceleration. In virtualization world, virtio-gpu and passthrough-gpu are
> two of gpu virtualization technologies.
>
> Current Xen only supports OpenGL (virgl:
> https://docs.mesa3d.org/drivers/virgl.html) for virtio-gpu and passthrough
> gpu based on PV dom0 for x86 platform. Today, we would like to introduce
> Vulkan (venus: https://docs.mesa3d.org/drivers/venus.html) and another
> OpenGL on Vulkan (zink: https://docs.mesa3d.org/drivers/zink.html) support
> for VirtIO GPU on Xen. These functions are supported on KVM at this moment,
> but so far, they are not supported on Xen. And we also introduce the PCIe
> passthrough (GPU) function based on PVH dom0 for AMD x86 platform.
>
> These supports required multiple repositories changes on kernel, xen, qemu,
> mesa, and virglrenderer. Please check below branches:
>
> Kernel: https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/log/?h=upstream-fox-xen
> Xen: https://gitlab.com/huangrui123/xen/-/commits/upstream-for-xen
> QEMU: https://gitlab.com/huangrui123/qemu/-/commits/upstream-for-xen
> Mesa: https://gitlab.freedesktop.org/rui/mesa/-/commits/upstream-for-xen
> Virglrenderer: https://gitlab.freedesktop.org/rui/virglrenderer/-/commits/upstream-for-xen
>
> In xen part, we mainly add the PCIe passthrough support on PVH dom0. It's
> using the QEMU to passthrough the GPU device into guest HVM domU. And
> mainly work is to transfer the interrupt by using gsi, vector, and pirq.
>
> Below are the screenshot of these functions, please take a look.
>
> Venus:
> https://drive.google.com/file/d/1_lPq6DMwHu1JQv7LUUVRx31dBj0HJYcL/view?usp=share_link
>
> Zink:
> https://drive.google.com/file/d/1FxLmKu6X7uJOxx1ZzwOm1yA6IL5WMGzd/view?usp=share_link
>
> Passthrough GPU:
> https://drive.google.com/file/d/17onr5gvDK8KM_LniHTSQEI2hGJZlI09L/view?usp=share_link
>
> We are working to write the documentation that describe how to verify these
> functions in the xen wiki page. And will update it in the future version.
>
> Thanks,
> Ray
>
> Chen Jiqian (5):
>    vpci: accept BAR writes if dom0 is PVH
>    x86/pvh: shouldn't check pirq flag when map pirq in PVH
>    x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
>    tools/libs/call: add linux os call to get gsi from irq
>    tools/libs/light: pci: translate irq to gsi
>
> Roger Pau Monne (1):
>    x86/pvh: report ACPI VFCT table to dom0 if present
>
>   tools/include/xen-sys/Linux/privcmd.h |  7 +++++++
>   tools/include/xencall.h               |  2 ++
>   tools/include/xenctrl.h               |  2 ++
>   tools/libs/call/core.c                |  5 +++++
>   tools/libs/call/libxencall.map        |  2 ++
>   tools/libs/call/linux.c               | 14 ++++++++++++++
>   tools/libs/call/private.h             |  9 +++++++++
>   tools/libs/ctrl/xc_physdev.c          |  4 ++++
>   tools/libs/light/libxl_pci.c          |  1 +
>   xen/arch/x86/hvm/dom0_build.c         |  1 +
>   xen/arch/x86/hvm/hypercall.c          |  3 +--
>   xen/drivers/vpci/header.c             |  2 +-
>   xen/include/acpi/actbl3.h             |  1 +
>   13 files changed, 50 insertions(+), 3 deletions(-)
>



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-13  7:23   ` Christian König
@ 2023-03-13  7:26     ` Christian König
  2023-03-13  8:46       ` Jan Beulich
                         ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Christian König @ 2023-03-13  7:26 UTC (permalink / raw)
  To: Huang Rui, Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian

Am 13.03.23 um 08:23 schrieb Christian König:
>
>
> Am 12.03.23 um 08:54 schrieb Huang Rui:
>> From: Chen Jiqian <Jiqian.Chen@amd.com>
>>
>> When dom0 is PVH and we want to passthrough gpu to guest,
>> we should allow BAR writes even through BAR is mapped. If
>> not, the value of BARs are not initialized when guest firstly
>> start.
>>
>> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>> ---
>>   xen/drivers/vpci/header.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index ec2e978a4e..918d11fbce 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
>>        * Xen only cares whether the BAR is mapped into the p2m, so 
>> allow BAR
>>        * writes as long as the BAR is not mapped into the p2m.
>>        */
>> -    if ( bar->enabled )
>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & 
>> PCI_COMMAND_MEMORY )
>
> Checkpath.pl gives here:
>
> ERROR: space prohibited after that open parenthesis '('
> #115: FILE: xen/drivers/vpci/header.c:395:
> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )

But I should probably mention that I'm not 100% sure if this code base 
uses kernel coding style!

Christian.

>
> Christian.
>
>
>>       {
>>           /* If the value written is the current one avoid printing a 
>> warning. */
>>           if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-13  7:26     ` Christian König
@ 2023-03-13  8:46       ` Jan Beulich
  2023-03-13  8:55       ` Huang Rui
  2023-03-14 23:42       ` Stefano Stabellini
  2 siblings, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-13  8:46 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Deucher, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian, Huang Rui,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 13.03.2023 08:26, Christian König wrote:
> Am 13.03.23 um 08:23 schrieb Christian König:
>> Am 12.03.23 um 08:54 schrieb Huang Rui:
>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
>>>        * Xen only cares whether the BAR is mapped into the p2m, so 
>>> allow BAR
>>>        * writes as long as the BAR is not mapped into the p2m.
>>>        */
>>> -    if ( bar->enabled )
>>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & 
>>> PCI_COMMAND_MEMORY )
>>
>> Checkpath.pl gives here:
>>
>> ERROR: space prohibited after that open parenthesis '('
>> #115: FILE: xen/drivers/vpci/header.c:395:
>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> 
> But I should probably mention that I'm not 100% sure if this code base 
> uses kernel coding style!

It doesn't - see ./CODING_STYLE.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-13  7:26     ` Christian König
  2023-03-13  8:46       ` Jan Beulich
@ 2023-03-13  8:55       ` Huang Rui
  2023-03-14 23:42       ` Stefano Stabellini
  2 siblings, 0 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-13  8:55 UTC (permalink / raw)
  To: Koenig, Christian
  Cc: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel,
	Deucher, Alexander, Hildebrand, Stewart, Xenia Ragiadakou, Huang,
	Honglei1, Zhang, Julia, Chen, Jiqian

On Mon, Mar 13, 2023 at 03:26:09PM +0800, Koenig, Christian wrote:
> Am 13.03.23 um 08:23 schrieb Christian König:
> >
> >
> > Am 12.03.23 um 08:54 schrieb Huang Rui:
> >> From: Chen Jiqian <Jiqian.Chen@amd.com>
> >>
> >> When dom0 is PVH and we want to passthrough gpu to guest,
> >> we should allow BAR writes even through BAR is mapped. If
> >> not, the value of BARs are not initialized when guest firstly
> >> start.
> >>
> >> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> >> Signed-off-by: Huang Rui <ray.huang@amd.com>
> >> ---
> >>   xen/drivers/vpci/header.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> >> index ec2e978a4e..918d11fbce 100644
> >> --- a/xen/drivers/vpci/header.c
> >> +++ b/xen/drivers/vpci/header.c
> >> @@ -392,7 +392,7 @@ static void cf_check bar_write(
> >>        * Xen only cares whether the BAR is mapped into the p2m, so 
> >> allow BAR
> >>        * writes as long as the BAR is not mapped into the p2m.
> >>        */
> >> -    if ( bar->enabled )
> >> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & 
> >> PCI_COMMAND_MEMORY )
> >
> > Checkpath.pl gives here:
> >
> > ERROR: space prohibited after that open parenthesis '('
> > #115: FILE: xen/drivers/vpci/header.c:395:
> > +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> 
> But I should probably mention that I'm not 100% sure if this code base 
> uses kernel coding style!
> 

I noticed that actully Xen's coding style was different with Linux kernel.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present
  2023-03-12  7:54 ` [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present Huang Rui
@ 2023-03-13 11:55   ` Andrew Cooper
  2023-03-13 12:21     ` Roger Pau Monné
  0 siblings, 1 reply; 75+ messages in thread
From: Andrew Cooper @ 2023-03-13 11:55 UTC (permalink / raw)
  To: Huang Rui, Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Henry Wang

On 12/03/2023 7:54 am, Huang Rui wrote:
> From: Roger Pau Monne <roger.pau@citrix.com>
>
> The VFCT ACPI table is used by AMD GPUs to expose the vbios ROM image
> from the firmware instead of doing it on the PCI ROM on the physical
> device.
>
> As such, this needs to be available for PVH dom0 to access, or else
> the GPU won't work.
>
> Reported-by: Huang Rui <ray.huang@amd.com>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Acked-and-Tested-by: Huang Rui <ray.huang@amd.com>
> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
> Signed-off-by: Huang Rui <ray.huang@amd.com>

Huh...  Despite the release ack, this didn't get committed for 4.17.

Sorry for the oversight.  I've queued this now.

~Andrew


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present
  2023-03-13 11:55   ` Andrew Cooper
@ 2023-03-13 12:21     ` Roger Pau Monné
  2023-03-13 12:27       ` Andrew Cooper
  0 siblings, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-13 12:21 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Huang Rui, Jan Beulich, Stefano Stabellini, Anthony PERARD,
	xen-devel, Alex Deucher, Christian König,
	Stewart Hildebrand, Xenia Ragiadakou, Honglei Huang, Julia Zhang,
	Chen Jiqian, Henry Wang

On Mon, Mar 13, 2023 at 11:55:56AM +0000, Andrew Cooper wrote:
> On 12/03/2023 7:54 am, Huang Rui wrote:
> > From: Roger Pau Monne <roger.pau@citrix.com>
> >
> > The VFCT ACPI table is used by AMD GPUs to expose the vbios ROM image
> > from the firmware instead of doing it on the PCI ROM on the physical
> > device.
> >
> > As such, this needs to be available for PVH dom0 to access, or else
> > the GPU won't work.
> >
> > Reported-by: Huang Rui <ray.huang@amd.com>
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > Acked-and-Tested-by: Huang Rui <ray.huang@amd.com>
> > Release-acked-by: Henry Wang <Henry.Wang@arm.com>
> > Signed-off-by: Huang Rui <ray.huang@amd.com>
> 
> Huh...  Despite the release ack, this didn't get committed for 4.17.

There was a pending query from Jan as to where was this table
signature documented or at least registered, as it's not in the ACPI
spec or any related files.

I don't oppose to the change, as it's already used by Linux, so I
think it's impossible for the table signature to be reused, even if
not properly documented (it would cause havoc).

It's however not ideal to set this kind of precedents.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present
  2023-03-13 12:21     ` Roger Pau Monné
@ 2023-03-13 12:27       ` Andrew Cooper
  2023-03-21  6:26         ` Huang Rui
  0 siblings, 1 reply; 75+ messages in thread
From: Andrew Cooper @ 2023-03-13 12:27 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Huang Rui, Jan Beulich, Stefano Stabellini, Anthony PERARD,
	xen-devel, Alex Deucher, Christian König,
	Stewart Hildebrand, Xenia Ragiadakou, Honglei Huang, Julia Zhang,
	Chen Jiqian, Henry Wang

On 13/03/2023 12:21 pm, Roger Pau Monné wrote:
> On Mon, Mar 13, 2023 at 11:55:56AM +0000, Andrew Cooper wrote:
>> On 12/03/2023 7:54 am, Huang Rui wrote:
>>> From: Roger Pau Monne <roger.pau@citrix.com>
>>>
>>> The VFCT ACPI table is used by AMD GPUs to expose the vbios ROM image
>>> from the firmware instead of doing it on the PCI ROM on the physical
>>> device.
>>>
>>> As such, this needs to be available for PVH dom0 to access, or else
>>> the GPU won't work.
>>>
>>> Reported-by: Huang Rui <ray.huang@amd.com>
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> Acked-and-Tested-by: Huang Rui <ray.huang@amd.com>
>>> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>> Huh...  Despite the release ack, this didn't get committed for 4.17.
> There was a pending query from Jan as to where was this table
> signature documented or at least registered, as it's not in the ACPI
> spec or any related files.
>
> I don't oppose to the change, as it's already used by Linux, so I
> think it's impossible for the table signature to be reused, even if
> not properly documented (it would cause havoc).
>
> It's however not ideal to set this kind of precedents.

It's not great, but this exists in real systems, for several generations
it seems.

Making things work for users trumps any idealistic beliefs about
firmware actually conforming to spec.

~Andrew


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-12  7:54 ` [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH Huang Rui
  2023-03-13  7:23   ` Christian König
@ 2023-03-14 16:02   ` Jan Beulich
  2023-03-21  9:36     ` Huang Rui
  1 sibling, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-14 16:02 UTC (permalink / raw)
  To: Huang Rui
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 12.03.2023 08:54, Huang Rui wrote:
> From: Chen Jiqian <Jiqian.Chen@amd.com>
> 
> When dom0 is PVH and we want to passthrough gpu to guest,
> we should allow BAR writes even through BAR is mapped. If
> not, the value of BARs are not initialized when guest firstly
> start.

From this it doesn't become clear why a GPU would be special in this
regard, or what (if any) prior bug there was. Are you suggesting ...

> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -392,7 +392,7 @@ static void cf_check bar_write(
>       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
>       * writes as long as the BAR is not mapped into the p2m.
>       */
> -    if ( bar->enabled )
> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
>      {
>          /* If the value written is the current one avoid printing a warning. */
>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )

... bar->enabled doesn't properly reflect the necessary state? It
generally shouldn't be necessary to look at the physical device's
state here.

Furthermore when you make a change in a case like this, the
accompanying comment also needs updating (which might have clarified
what, if anything, has been wrong).

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH
  2023-03-12  7:54 ` [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH Huang Rui
@ 2023-03-14 16:27   ` Jan Beulich
  2023-03-15 15:57   ` Roger Pau Monné
  1 sibling, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-14 16:27 UTC (permalink / raw)
  To: Huang Rui
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 12.03.2023 08:54, Huang Rui wrote:
> From: Chen Jiqian <Jiqian.Chen@amd.com>
> 
> PVH is also hvm type domain, but PVH hasn't X86_EMU_USE_PIRQ
> flag. So, when dom0 is PVH and call PHYSDEVOP_map_pirq, it
> will fail at check has_pirq();
> 
> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> Signed-off-by: Huang Rui <ray.huang@amd.com>

Please see b96b50004804 ("x86: remove XENFEAT_hvm_pirqs for PVHv2 guests"),
which clearly says that these sub-ops shouldn't be used by PVH domains.
Plus if you're after just one sub-op (assuming that indeed needs making
available for a yet to be supplied reason), why ...

> --- a/xen/arch/x86/hvm/hypercall.c
> +++ b/xen/arch/x86/hvm/hypercall.c
> @@ -89,8 +89,6 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>      case PHYSDEVOP_eoi:
>      case PHYSDEVOP_irq_status_query:
>      case PHYSDEVOP_get_free_pirq:
> -        if ( !has_pirq(currd) )
> -            return -ENOSYS;
>          break;

... do you enable several more by simply dropping code altogether?

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-12  7:54 ` [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call Huang Rui
@ 2023-03-14 16:30   ` Jan Beulich
  2023-03-15 17:01     ` Andrew Cooper
  2023-03-21 12:22     ` Huang Rui
  0 siblings, 2 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-14 16:30 UTC (permalink / raw)
  To: Huang Rui
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 12.03.2023 08:54, Huang Rui wrote:
> From: Chen Jiqian <Jiqian.Chen@amd.com>

An empty description won't do here. First of all you need to address the Why?
As already hinted at in the reply to the earlier patch, it looks like you're
breaking the intended IRQ model for PVH.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 5/6] tools/libs/call: add linux os call to get gsi from irq
  2023-03-12  7:54 ` [RFC XEN PATCH 5/6] tools/libs/call: add linux os call to get gsi from irq Huang Rui
@ 2023-03-14 16:36   ` Jan Beulich
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-14 16:36 UTC (permalink / raw)
  To: Huang Rui
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 12.03.2023 08:54, Huang Rui wrote:
> From: Chen Jiqian <Jiqian.Chen@amd.com>
> 
> When passthrough gpu to guest, usersapce can only get irq
> instead of gsi. But it should pass gsi to guest, so that
> guest can get interrupt signal. So, provide function to get
> gsi.
> 
> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> Signed-off-by: Huang Rui <ray.huang@amd.com>
> ---
>  tools/include/xen-sys/Linux/privcmd.h |  7 +++++++

Assuming this information needs obtaining in the first place (which I
doubt), I don't think privcmd is the right vehicle to get at it. Can
one obtain such mapping information on baremetal Linux? If so, that
would want re-using in the same or a similar way. If not, there would
need to be a very good reason why the information is needed when
running on top of Xen.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-12  7:54 ` [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi Huang Rui
@ 2023-03-14 16:39   ` Jan Beulich
  2023-03-15 16:35   ` Roger Pau Monné
  1 sibling, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-14 16:39 UTC (permalink / raw)
  To: Huang Rui
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 12.03.2023 08:54, Huang Rui wrote:
> From: Chen Jiqian <Jiqian.Chen@amd.com>
> 
> Use new xc_physdev_gsi_from_irq to get the GSI number

Apart from again the "Why?", ...

> --- a/tools/libs/light/libxl_pci.c
> +++ b/tools/libs/light/libxl_pci.c
> @@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
>          goto out_no_irq;
>      }
>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> +        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);
>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>          if (r < 0) {
>              LOGED(ERROR, domainid, "xc_physdev_map_pirq irq=%d (error=%d)",

... aren't you breaking existing use cases this way?

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-13  7:26     ` Christian König
  2023-03-13  8:46       ` Jan Beulich
  2023-03-13  8:55       ` Huang Rui
@ 2023-03-14 23:42       ` Stefano Stabellini
  2 siblings, 0 replies; 75+ messages in thread
From: Stefano Stabellini @ 2023-03-14 23:42 UTC (permalink / raw)
  To: Christian König
  Cc: Huang Rui, Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel,
	Alex Deucher, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian

[-- Attachment #1: Type: text/plain, Size: 1759 bytes --]

On Mon, 13 Mar 2023, Christian König wrote:
> Am 13.03.23 um 08:23 schrieb Christian König:
> > Am 12.03.23 um 08:54 schrieb Huang Rui:
> > > From: Chen Jiqian <Jiqian.Chen@amd.com>
> > > 
> > > When dom0 is PVH and we want to passthrough gpu to guest,
> > > we should allow BAR writes even through BAR is mapped. If
> > > not, the value of BARs are not initialized when guest firstly
> > > start.
> > > 
> > > Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> > > Signed-off-by: Huang Rui <ray.huang@amd.com>
> > > ---
> > >   xen/drivers/vpci/header.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> > > index ec2e978a4e..918d11fbce 100644
> > > --- a/xen/drivers/vpci/header.c
> > > +++ b/xen/drivers/vpci/header.c
> > > @@ -392,7 +392,7 @@ static void cf_check bar_write(
> > >        * Xen only cares whether the BAR is mapped into the p2m, so allow
> > > BAR
> > >        * writes as long as the BAR is not mapped into the p2m.
> > >        */
> > > -    if ( bar->enabled )
> > > +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> > 
> > Checkpath.pl gives here:
> > 
> > ERROR: space prohibited after that open parenthesis '('
> > #115: FILE: xen/drivers/vpci/header.c:395:
> > +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> 
> But I should probably mention that I'm not 100% sure if this code base uses
> kernel coding style!

Hi Christian,

Thanks for taking a look at these patches! For better or for worse Xen
follows a different coding style from the Linux kernel (see CODING_STYLE
under xen.git).  In Xen we use:

    if ( rc != 0 ) 
    {
        return rc;
    }

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH
  2023-03-12  7:54 ` [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH Huang Rui
  2023-03-14 16:27   ` Jan Beulich
@ 2023-03-15 15:57   ` Roger Pau Monné
  2023-03-16  0:22     ` Stefano Stabellini
  2023-03-21 10:09     ` Huang Rui
  1 sibling, 2 replies; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-15 15:57 UTC (permalink / raw)
  To: Huang Rui
  Cc: Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel,
	Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian

On Sun, Mar 12, 2023 at 03:54:52PM +0800, Huang Rui wrote:
> From: Chen Jiqian <Jiqian.Chen@amd.com>
> 
> PVH is also hvm type domain, but PVH hasn't X86_EMU_USE_PIRQ
> flag. So, when dom0 is PVH and call PHYSDEVOP_map_pirq, it
> will fail at check has_pirq();
> 
> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> Signed-off-by: Huang Rui <ray.huang@amd.com>
> ---
>  xen/arch/x86/hvm/hypercall.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
> index 405d0a95af..16a2f5c0b3 100644
> --- a/xen/arch/x86/hvm/hypercall.c
> +++ b/xen/arch/x86/hvm/hypercall.c
> @@ -89,8 +89,6 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>      case PHYSDEVOP_eoi:
>      case PHYSDEVOP_irq_status_query:
>      case PHYSDEVOP_get_free_pirq:
> -        if ( !has_pirq(currd) )
> -            return -ENOSYS;

Since I've taken a look at the Linux side of this, it seems like you
need PHYSDEVOP_map_pirq and PHYSDEVOP_setup_gsi, but the later is not
in this list because has never been available to HVM type guests.

I would like to better understand the usage by PVH dom0 for GSI
passthrough before deciding on what to do here.  IIRC QEMU also uses
PHYSDEVOP_{un,}map_pirq in order to allocate MSI(-X) interrupts.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-12  7:54 ` [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi Huang Rui
  2023-03-14 16:39   ` Jan Beulich
@ 2023-03-15 16:35   ` Roger Pau Monné
  2023-03-16  0:44     ` Stefano Stabellini
  1 sibling, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-15 16:35 UTC (permalink / raw)
  To: Huang Rui
  Cc: Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel,
	Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian

On Sun, Mar 12, 2023 at 03:54:55PM +0800, Huang Rui wrote:
> From: Chen Jiqian <Jiqian.Chen@amd.com>
> 
> Use new xc_physdev_gsi_from_irq to get the GSI number
> 
> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> Signed-off-by: Huang Rui <ray.huang@amd.com>
> ---
>  tools/libs/light/libxl_pci.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> index f4c4f17545..47cf2799bf 100644
> --- a/tools/libs/light/libxl_pci.c
> +++ b/tools/libs/light/libxl_pci.c
> @@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
>          goto out_no_irq;
>      }
>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> +        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);

This is just a shot in the dark, because I don't really have enough
context to understand what's going on here, but see below.

I've taken a look at this on my box, and it seems like on
dom0 the value returned by /sys/bus/pci/devices/SBDF/irq is not
very consistent.

If devices are in use by a driver the irq sysfs node reports either
the GSI irq or the MSI IRQ (in case a single MSI interrupt is
setup).

It seems like pciback in Linux does something to report the correct
value:

root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
74
root@lcy2-dt107:~# xl pci-assignable-add 00:14.0
root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
16

As you can see, making the device assignable changed the value
reported by the irq node to be the GSI instead of the MSI IRQ, I would
think you are missing something similar in the PVH setup (some pciback
magic)?

Albeit I have no idea why you would need to translate from IRQ to GSI
in the way you do in this and related patches, because I'm missing the
context.

Regards, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-14 16:30   ` Jan Beulich
@ 2023-03-15 17:01     ` Andrew Cooper
  2023-03-16  0:26       ` Stefano Stabellini
                         ` (2 more replies)
  2023-03-21 12:22     ` Huang Rui
  1 sibling, 3 replies; 75+ messages in thread
From: Andrew Cooper @ 2023-03-15 17:01 UTC (permalink / raw)
  To: Jan Beulich, Huang Rui
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 14/03/2023 4:30 pm, Jan Beulich wrote:
> On 12.03.2023 08:54, Huang Rui wrote:
>> From: Chen Jiqian <Jiqian.Chen@amd.com>
> An empty description won't do here. First of all you need to address the Why?
> As already hinted at in the reply to the earlier patch, it looks like you're
> breaking the intended IRQ model for PVH.

I think this is rather unfair.

Until you can point to the document which describes how IRQs are
intended to work in PVH, I'd say this series is pretty damn good attempt
to make something that functions, in the absence of any guidance.

~Andrew

P.S. If it isn't obvious, this is a giant hint that something should be
written down...


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH
  2023-03-15 15:57   ` Roger Pau Monné
@ 2023-03-16  0:22     ` Stefano Stabellini
  2023-03-21 10:09     ` Huang Rui
  1 sibling, 0 replies; 75+ messages in thread
From: Stefano Stabellini @ 2023-03-16  0:22 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Huang Rui, Jan Beulich, Stefano Stabellini, Anthony PERARD,
	xen-devel, Alex Deucher, Christian König,
	Stewart Hildebrand, Xenia Ragiadakou, Honglei Huang, Julia Zhang,
	Chen Jiqian

[-- Attachment #1: Type: text/plain, Size: 1538 bytes --]

On Wed, 15 Mar 2023, Roger Pau Monné wrote:
> On Sun, Mar 12, 2023 at 03:54:52PM +0800, Huang Rui wrote:
> > From: Chen Jiqian <Jiqian.Chen@amd.com>
> > 
> > PVH is also hvm type domain, but PVH hasn't X86_EMU_USE_PIRQ
> > flag. So, when dom0 is PVH and call PHYSDEVOP_map_pirq, it
> > will fail at check has_pirq();
> > 
> > Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> > Signed-off-by: Huang Rui <ray.huang@amd.com>
> > ---
> >  xen/arch/x86/hvm/hypercall.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
> > index 405d0a95af..16a2f5c0b3 100644
> > --- a/xen/arch/x86/hvm/hypercall.c
> > +++ b/xen/arch/x86/hvm/hypercall.c
> > @@ -89,8 +89,6 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >      case PHYSDEVOP_eoi:
> >      case PHYSDEVOP_irq_status_query:
> >      case PHYSDEVOP_get_free_pirq:
> > -        if ( !has_pirq(currd) )
> > -            return -ENOSYS;
> 
> Since I've taken a look at the Linux side of this, it seems like you
> need PHYSDEVOP_map_pirq and PHYSDEVOP_setup_gsi, but the later is not
> in this list because has never been available to HVM type guests.
> 
> I would like to better understand the usage by PVH dom0 for GSI
> passthrough before deciding on what to do here.  IIRC QEMU also uses
> PHYSDEVOP_{un,}map_pirq in order to allocate MSI(-X) interrupts.

I'll let Ray reply here, but I think you are right:
HYSDEVOP_{un,}map_pirq are needed so that QEMU can run in PVH Dom0 to
support HVM guests.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-15 17:01     ` Andrew Cooper
@ 2023-03-16  0:26       ` Stefano Stabellini
  2023-03-16  0:39         ` Stefano Stabellini
  2023-03-16  8:51         ` Jan Beulich
  2023-03-16  7:05       ` Jan Beulich
  2023-03-21 12:42       ` Huang Rui
  2 siblings, 2 replies; 75+ messages in thread
From: Stefano Stabellini @ 2023-03-16  0:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jan Beulich, Huang Rui, Alex Deucher, Christian König,
	Stewart Hildebrand, Xenia Ragiadakou, Honglei Huang, Julia Zhang,
	Chen Jiqian, Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On Wed, 15 Mar 2023, Andrew Cooper wrote:
> On 14/03/2023 4:30 pm, Jan Beulich wrote:
> > On 12.03.2023 08:54, Huang Rui wrote:
> >> From: Chen Jiqian <Jiqian.Chen@amd.com>
> > An empty description won't do here. First of all you need to address the Why?
> > As already hinted at in the reply to the earlier patch, it looks like you're
> > breaking the intended IRQ model for PVH.
> 
> I think this is rather unfair.
> 
> Until you can point to the document which describes how IRQs are
> intended to work in PVH, I'd say this series is pretty damn good attempt
> to make something that functions, in the absence of any guidance.

And to make things more confusing those calls are not needed for PVH
itself, those calls are needed so that we can run QEMU to support
regular HVM guests on PVH Dom0 (I'll let Ray confirm.)

So technically, this is not breaking the PVH IRQ model.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-16  0:26       ` Stefano Stabellini
@ 2023-03-16  0:39         ` Stefano Stabellini
  2023-03-16  8:51         ` Jan Beulich
  1 sibling, 0 replies; 75+ messages in thread
From: Stefano Stabellini @ 2023-03-16  0:39 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Andrew Cooper, Jan Beulich, Huang Rui, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian, Roger Pau Monné,
	Anthony PERARD, xen-devel

On Wed, 15 Mar 2023, Stefano Stabellini wrote:
> On Wed, 15 Mar 2023, Andrew Cooper wrote:
> > On 14/03/2023 4:30 pm, Jan Beulich wrote:
> > > On 12.03.2023 08:54, Huang Rui wrote:
> > >> From: Chen Jiqian <Jiqian.Chen@amd.com>
> > > An empty description won't do here. First of all you need to address the Why?
> > > As already hinted at in the reply to the earlier patch, it looks like you're
> > > breaking the intended IRQ model for PVH.
> > 
> > I think this is rather unfair.
> > 
> > Until you can point to the document which describes how IRQs are
> > intended to work in PVH, I'd say this series is pretty damn good attempt
> > to make something that functions, in the absence of any guidance.
> 
> And to make things more confusing those calls are not needed for PVH
> itself, those calls are needed so that we can run QEMU to support
> regular HVM guests on PVH Dom0 (I'll let Ray confirm.)
> 
> So technically, this is not breaking the PVH IRQ model.

To add more info:

QEMU (hw/xen/xen_pt.c) calls xc_physdev_map_pirq and
xc_domain_bind_pt_pci_irq. Note that xc_domain_bind_pt_pci_irq is the
key hypercall here and it takes a pirq as parameter.

That is why QEMU calls xc_physdev_map_pirq, so that we can get the pirq
and use the pirq as parameter for xc_domain_bind_pt_pci_irq.

We need to get the above to work also with Dom0 PVH.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-15 16:35   ` Roger Pau Monné
@ 2023-03-16  0:44     ` Stefano Stabellini
  2023-03-16  8:54       ` Roger Pau Monné
  2023-03-16  8:55       ` Jan Beulich
  0 siblings, 2 replies; 75+ messages in thread
From: Stefano Stabellini @ 2023-03-16  0:44 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Huang Rui, Jan Beulich, Stefano Stabellini, Anthony PERARD,
	xen-devel, Alex Deucher, Christian König,
	Stewart Hildebrand, Xenia Ragiadakou, Honglei Huang, Julia Zhang,
	Chen Jiqian

[-- Attachment #1: Type: text/plain, Size: 2774 bytes --]

On Wed, 15 Mar 2023, Roger Pau Monné wrote:
> On Sun, Mar 12, 2023 at 03:54:55PM +0800, Huang Rui wrote:
> > From: Chen Jiqian <Jiqian.Chen@amd.com>
> > 
> > Use new xc_physdev_gsi_from_irq to get the GSI number
> > 
> > Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> > Signed-off-by: Huang Rui <ray.huang@amd.com>
> > ---
> >  tools/libs/light/libxl_pci.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> > index f4c4f17545..47cf2799bf 100644
> > --- a/tools/libs/light/libxl_pci.c
> > +++ b/tools/libs/light/libxl_pci.c
> > @@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
> >          goto out_no_irq;
> >      }
> >      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> > +        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);
> 
> This is just a shot in the dark, because I don't really have enough
> context to understand what's going on here, but see below.
> 
> I've taken a look at this on my box, and it seems like on
> dom0 the value returned by /sys/bus/pci/devices/SBDF/irq is not
> very consistent.
> 
> If devices are in use by a driver the irq sysfs node reports either
> the GSI irq or the MSI IRQ (in case a single MSI interrupt is
> setup).
> 
> It seems like pciback in Linux does something to report the correct
> value:
> 
> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
> 74
> root@lcy2-dt107:~# xl pci-assignable-add 00:14.0
> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
> 16
> 
> As you can see, making the device assignable changed the value
> reported by the irq node to be the GSI instead of the MSI IRQ, I would
> think you are missing something similar in the PVH setup (some pciback
> magic)?
> 
> Albeit I have no idea why you would need to translate from IRQ to GSI
> in the way you do in this and related patches, because I'm missing the
> context.

As I mention in another email, also keep in mind that we need QEMU to
work and QEMU calls:
1) xc_physdev_map_pirq (this is also called from libxl)
2) xc_domain_bind_pt_pci_irq


In this case IRQ != GSI (IRQ == 112, GSI == 28). Sysfs returns the IRQ
in Linux (112), but actually xc_physdev_map_pirq expects the GSI, not
the IRQ. If you look at the implementation of xc_physdev_map_pirq,
you'll the type is "MAP_PIRQ_TYPE_GSI" and also see the check in Xen
xen/arch/x86/irq.c:allocate_and_map_gsi_pirq:

    if ( index < 0 || index >= nr_irqs_gsi )
    {
        dprintk(XENLOG_G_ERR, "dom%d: map invalid irq %d\n", d->domain_id,
                index);
        return -EINVAL;
    }

nr_irqs_gsi < 112, and the check will fail.

So we need to pass the GSI to xc_physdev_map_pirq. To do that, we need
to discover the GSI number corresponding to the IRQ number.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-15 17:01     ` Andrew Cooper
  2023-03-16  0:26       ` Stefano Stabellini
@ 2023-03-16  7:05       ` Jan Beulich
  2023-03-21 12:42       ` Huang Rui
  2 siblings, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-16  7:05 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel, Huang Rui

On 15.03.2023 18:01, Andrew Cooper wrote:
> On 14/03/2023 4:30 pm, Jan Beulich wrote:
>> On 12.03.2023 08:54, Huang Rui wrote:
>>> From: Chen Jiqian <Jiqian.Chen@amd.com>
>> An empty description won't do here. First of all you need to address the Why?
>> As already hinted at in the reply to the earlier patch, it looks like you're
>> breaking the intended IRQ model for PVH.
> 
> I think this is rather unfair.
> 
> Until you can point to the document which describes how IRQs are
> intended to work in PVH, I'd say this series is pretty damn good attempt
> to make something that functions, in the absence of any guidance.

Are you advocating for patches which don't explain why they make a certain
change? Even in the absence of any documentation, the code itself can be
taken as reference, and hence it can be pointed out that either something
was wrong before, or something needs extending in a certain way to make
some use case work which can't be mode work by other means. In the case of
this series, without knowing the "Why?" for the various changes, it is
also impossible to suggest alternative approaches.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-16  0:26       ` Stefano Stabellini
  2023-03-16  0:39         ` Stefano Stabellini
@ 2023-03-16  8:51         ` Jan Beulich
  2023-03-16  9:18           ` Roger Pau Monné
  1 sibling, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-16  8:51 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Huang Rui, Alex Deucher, Christian König,
	Stewart Hildebrand, Xenia Ragiadakou, Honglei Huang, Julia Zhang,
	Chen Jiqian, Roger Pau Monné,
	Anthony PERARD, xen-devel, Andrew Cooper

On 16.03.2023 01:26, Stefano Stabellini wrote:
> On Wed, 15 Mar 2023, Andrew Cooper wrote:
>> On 14/03/2023 4:30 pm, Jan Beulich wrote:
>>> On 12.03.2023 08:54, Huang Rui wrote:
>>>> From: Chen Jiqian <Jiqian.Chen@amd.com>
>>> An empty description won't do here. First of all you need to address the Why?
>>> As already hinted at in the reply to the earlier patch, it looks like you're
>>> breaking the intended IRQ model for PVH.
>>
>> I think this is rather unfair.
>>
>> Until you can point to the document which describes how IRQs are
>> intended to work in PVH, I'd say this series is pretty damn good attempt
>> to make something that functions, in the absence of any guidance.
> 
> And to make things more confusing those calls are not needed for PVH
> itself, those calls are needed so that we can run QEMU to support
> regular HVM guests on PVH Dom0 (I'll let Ray confirm.)

Ah, but that wasn't said anywhere, was it? In which case ...

> So technically, this is not breaking the PVH IRQ model.

... I of course agree here. But then I guess we may want to reject
attempts for a domain to do any of this to itself.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-16  0:44     ` Stefano Stabellini
@ 2023-03-16  8:54       ` Roger Pau Monné
  2023-03-16  8:55       ` Jan Beulich
  1 sibling, 0 replies; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-16  8:54 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Huang Rui, Jan Beulich, Anthony PERARD, xen-devel, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian

On Wed, Mar 15, 2023 at 05:44:12PM -0700, Stefano Stabellini wrote:
> On Wed, 15 Mar 2023, Roger Pau Monné wrote:
> > On Sun, Mar 12, 2023 at 03:54:55PM +0800, Huang Rui wrote:
> > > From: Chen Jiqian <Jiqian.Chen@amd.com>
> > > 
> > > Use new xc_physdev_gsi_from_irq to get the GSI number
> > > 
> > > Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> > > Signed-off-by: Huang Rui <ray.huang@amd.com>
> > > ---
> > >  tools/libs/light/libxl_pci.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> > > index f4c4f17545..47cf2799bf 100644
> > > --- a/tools/libs/light/libxl_pci.c
> > > +++ b/tools/libs/light/libxl_pci.c
> > > @@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
> > >          goto out_no_irq;
> > >      }
> > >      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> > > +        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);
> > 
> > This is just a shot in the dark, because I don't really have enough
> > context to understand what's going on here, but see below.
> > 
> > I've taken a look at this on my box, and it seems like on
> > dom0 the value returned by /sys/bus/pci/devices/SBDF/irq is not
> > very consistent.
> > 
> > If devices are in use by a driver the irq sysfs node reports either
> > the GSI irq or the MSI IRQ (in case a single MSI interrupt is
> > setup).
> > 
> > It seems like pciback in Linux does something to report the correct
> > value:
> > 
> > root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
> > 74
> > root@lcy2-dt107:~# xl pci-assignable-add 00:14.0
> > root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
> > 16
> > 
> > As you can see, making the device assignable changed the value
> > reported by the irq node to be the GSI instead of the MSI IRQ, I would
> > think you are missing something similar in the PVH setup (some pciback
> > magic)?
> > 
> > Albeit I have no idea why you would need to translate from IRQ to GSI
> > in the way you do in this and related patches, because I'm missing the
> > context.
> 
> As I mention in another email, also keep in mind that we need QEMU to
> work and QEMU calls:
> 1) xc_physdev_map_pirq (this is also called from libxl)
> 2) xc_domain_bind_pt_pci_irq

Those would be fine, and don't need any translation since it's QEMU
the one that creates and maps the MSI(-X) interrupts, so it knows the
PIRQ without requiring any translation because it has been allocated
by QEMU itself.

GSI is kind of special because it's a fixed (legacy) interrupt mapped
to an IO-APIC pin and assigned to the device by the firmware.  The
setup in that case gets done by the toolstack (libxl) because the
mapping is immutable for the lifetime of the domain.

> In this case IRQ != GSI (IRQ == 112, GSI == 28). Sysfs returns the IRQ
> in Linux (112), but actually xc_physdev_map_pirq expects the GSI, not
> the IRQ.

I think the real question here is why on this scenario IRQ != GSI for
GSI interrupts.  On one of my systems when booted as PVH dom0 with
pci=nomsi I get from /proc/interrupt:

  8:          0          0          0          0          0          0          0   IO-APIC   8-edge      rtc0
  9:          1          0          0          0          0          0          0   IO-APIC   9-fasteoi   acpi
 16:          0          0       8373          0          0          0          0   IO-APIC  16-fasteoi   i801_smbus, xhci-hcd:usb1, ahci[0000:00:17.0]
 17:          0          0          0        542          0          0          0   IO-APIC  17-fasteoi   eth0
 24:       4112          0          0          0          0          0          0  xen-percpu    -virq      timer0
 25:        352          0          0          0          0          0          0  xen-percpu    -ipi       resched0
 26:       6635          0          0          0          0          0          0  xen-percpu    -ipi       callfunc0

So GSI == IRQ, and non GSI interrupts start past the last GSI, which
is 23 on this system because it has a single IO-APIC with 24 pins.

We need to figure out what causes GSIs to be mapped to IRQs != GSI on
your system, and then we can decide how to fix this.  I would expect
it could be fixed so that IRQ == GSI (like it's on PV dom0), and none
of this translation to be necessary.

Can you paste the output of /proc/interrupts on that system that has a
GSI not identity mapped to an IRQ?

> If you look at the implementation of xc_physdev_map_pirq,
> you'll the type is "MAP_PIRQ_TYPE_GSI" and also see the check in Xen
> xen/arch/x86/irq.c:allocate_and_map_gsi_pirq:
> 
>     if ( index < 0 || index >= nr_irqs_gsi )
>     {
>         dprintk(XENLOG_G_ERR, "dom%d: map invalid irq %d\n", d->domain_id,
>                 index);
>         return -EINVAL;
>     }
> 
> nr_irqs_gsi < 112, and the check will fail.
> 
> So we need to pass the GSI to xc_physdev_map_pirq. To do that, we need
> to discover the GSI number corresponding to the IRQ number.

Right, see above, I think the real problem is that IRQ != GSI on your
Linux dom0 for some reason.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-16  0:44     ` Stefano Stabellini
  2023-03-16  8:54       ` Roger Pau Monné
@ 2023-03-16  8:55       ` Jan Beulich
  2023-03-16  9:27         ` Roger Pau Monné
  1 sibling, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-16  8:55 UTC (permalink / raw)
  To: Stefano Stabellini, Roger Pau Monné
  Cc: Huang Rui, Anthony PERARD, xen-devel, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian

On 16.03.2023 01:44, Stefano Stabellini wrote:
> On Wed, 15 Mar 2023, Roger Pau Monné wrote:
>> On Sun, Mar 12, 2023 at 03:54:55PM +0800, Huang Rui wrote:
>>> From: Chen Jiqian <Jiqian.Chen@amd.com>
>>>
>>> Use new xc_physdev_gsi_from_irq to get the GSI number
>>>
>>> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>> ---
>>>  tools/libs/light/libxl_pci.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
>>> index f4c4f17545..47cf2799bf 100644
>>> --- a/tools/libs/light/libxl_pci.c
>>> +++ b/tools/libs/light/libxl_pci.c
>>> @@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>          goto out_no_irq;
>>>      }
>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>> +        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);
>>
>> This is just a shot in the dark, because I don't really have enough
>> context to understand what's going on here, but see below.
>>
>> I've taken a look at this on my box, and it seems like on
>> dom0 the value returned by /sys/bus/pci/devices/SBDF/irq is not
>> very consistent.
>>
>> If devices are in use by a driver the irq sysfs node reports either
>> the GSI irq or the MSI IRQ (in case a single MSI interrupt is
>> setup).
>>
>> It seems like pciback in Linux does something to report the correct
>> value:
>>
>> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
>> 74
>> root@lcy2-dt107:~# xl pci-assignable-add 00:14.0
>> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
>> 16
>>
>> As you can see, making the device assignable changed the value
>> reported by the irq node to be the GSI instead of the MSI IRQ, I would
>> think you are missing something similar in the PVH setup (some pciback
>> magic)?
>>
>> Albeit I have no idea why you would need to translate from IRQ to GSI
>> in the way you do in this and related patches, because I'm missing the
>> context.
> 
> As I mention in another email, also keep in mind that we need QEMU to
> work and QEMU calls:
> 1) xc_physdev_map_pirq (this is also called from libxl)
> 2) xc_domain_bind_pt_pci_irq
> 
> 
> In this case IRQ != GSI (IRQ == 112, GSI == 28). Sysfs returns the IRQ
> in Linux (112), but actually xc_physdev_map_pirq expects the GSI, not
> the IRQ. If you look at the implementation of xc_physdev_map_pirq,
> you'll the type is "MAP_PIRQ_TYPE_GSI" and also see the check in Xen
> xen/arch/x86/irq.c:allocate_and_map_gsi_pirq:
> 
>     if ( index < 0 || index >= nr_irqs_gsi )
>     {
>         dprintk(XENLOG_G_ERR, "dom%d: map invalid irq %d\n", d->domain_id,
>                 index);
>         return -EINVAL;
>     }
> 
> nr_irqs_gsi < 112, and the check will fail.
> 
> So we need to pass the GSI to xc_physdev_map_pirq. To do that, we need
> to discover the GSI number corresponding to the IRQ number.

That's one possible approach. Another could be (making a lot of assumptions)
that a PVH Dom0 would pass in the IRQ it knows for this interrupt and Xen
then translates that to GSI, knowing that PVH doesn't have (host) GSIs
exposed to it.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-16  8:51         ` Jan Beulich
@ 2023-03-16  9:18           ` Roger Pau Monné
  0 siblings, 0 replies; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-16  9:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Huang Rui, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian, Anthony PERARD,
	xen-devel, Andrew Cooper

On Thu, Mar 16, 2023 at 09:51:20AM +0100, Jan Beulich wrote:
> On 16.03.2023 01:26, Stefano Stabellini wrote:
> > On Wed, 15 Mar 2023, Andrew Cooper wrote:
> >> On 14/03/2023 4:30 pm, Jan Beulich wrote:
> >>> On 12.03.2023 08:54, Huang Rui wrote:
> >>>> From: Chen Jiqian <Jiqian.Chen@amd.com>
> >>> An empty description won't do here. First of all you need to address the Why?
> >>> As already hinted at in the reply to the earlier patch, it looks like you're
> >>> breaking the intended IRQ model for PVH.
> >>
> >> I think this is rather unfair.
> >>
> >> Until you can point to the document which describes how IRQs are
> >> intended to work in PVH, I'd say this series is pretty damn good attempt
> >> to make something that functions, in the absence of any guidance.
> > 
> > And to make things more confusing those calls are not needed for PVH
> > itself, those calls are needed so that we can run QEMU to support
> > regular HVM guests on PVH Dom0 (I'll let Ray confirm.)
> 
> Ah, but that wasn't said anywhere, was it? In which case ...
> 
> > So technically, this is not breaking the PVH IRQ model.
> 
> ... I of course agree here. But then I guess we may want to reject
> attempts for a domain to do any of this to itself.

For PCI passthrough we strictly need the PHYSDEVOP_{un,}map_pirq
because that's the only way QEMU currently has to allocate MSI(-X)
vectors from physical devices in order to assign to guests.  We could
see about moving those to DM ops maybe in the future, as I think it
would be clearer, but that shouldn't block the work here.

If we start allowing PVH domains to use PIRQs we must enforce that
PIRQ cannot be mapped to event channels, IOW, block
EVTCHNOP_bind_pirq.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-16  8:55       ` Jan Beulich
@ 2023-03-16  9:27         ` Roger Pau Monné
  2023-03-16  9:42           ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-16  9:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Huang Rui, Anthony PERARD, xen-devel,
	Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian

On Thu, Mar 16, 2023 at 09:55:03AM +0100, Jan Beulich wrote:
> On 16.03.2023 01:44, Stefano Stabellini wrote:
> > On Wed, 15 Mar 2023, Roger Pau Monné wrote:
> >> On Sun, Mar 12, 2023 at 03:54:55PM +0800, Huang Rui wrote:
> >>> From: Chen Jiqian <Jiqian.Chen@amd.com>
> >>>
> >>> Use new xc_physdev_gsi_from_irq to get the GSI number
> >>>
> >>> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> >>> Signed-off-by: Huang Rui <ray.huang@amd.com>
> >>> ---
> >>>  tools/libs/light/libxl_pci.c | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> >>> index f4c4f17545..47cf2799bf 100644
> >>> --- a/tools/libs/light/libxl_pci.c
> >>> +++ b/tools/libs/light/libxl_pci.c
> >>> @@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
> >>>          goto out_no_irq;
> >>>      }
> >>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> >>> +        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);
> >>
> >> This is just a shot in the dark, because I don't really have enough
> >> context to understand what's going on here, but see below.
> >>
> >> I've taken a look at this on my box, and it seems like on
> >> dom0 the value returned by /sys/bus/pci/devices/SBDF/irq is not
> >> very consistent.
> >>
> >> If devices are in use by a driver the irq sysfs node reports either
> >> the GSI irq or the MSI IRQ (in case a single MSI interrupt is
> >> setup).
> >>
> >> It seems like pciback in Linux does something to report the correct
> >> value:
> >>
> >> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
> >> 74
> >> root@lcy2-dt107:~# xl pci-assignable-add 00:14.0
> >> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
> >> 16
> >>
> >> As you can see, making the device assignable changed the value
> >> reported by the irq node to be the GSI instead of the MSI IRQ, I would
> >> think you are missing something similar in the PVH setup (some pciback
> >> magic)?
> >>
> >> Albeit I have no idea why you would need to translate from IRQ to GSI
> >> in the way you do in this and related patches, because I'm missing the
> >> context.
> > 
> > As I mention in another email, also keep in mind that we need QEMU to
> > work and QEMU calls:
> > 1) xc_physdev_map_pirq (this is also called from libxl)
> > 2) xc_domain_bind_pt_pci_irq
> > 
> > 
> > In this case IRQ != GSI (IRQ == 112, GSI == 28). Sysfs returns the IRQ
> > in Linux (112), but actually xc_physdev_map_pirq expects the GSI, not
> > the IRQ. If you look at the implementation of xc_physdev_map_pirq,
> > you'll the type is "MAP_PIRQ_TYPE_GSI" and also see the check in Xen
> > xen/arch/x86/irq.c:allocate_and_map_gsi_pirq:
> > 
> >     if ( index < 0 || index >= nr_irqs_gsi )
> >     {
> >         dprintk(XENLOG_G_ERR, "dom%d: map invalid irq %d\n", d->domain_id,
> >                 index);
> >         return -EINVAL;
> >     }
> > 
> > nr_irqs_gsi < 112, and the check will fail.
> > 
> > So we need to pass the GSI to xc_physdev_map_pirq. To do that, we need
> > to discover the GSI number corresponding to the IRQ number.
> 
> That's one possible approach. Another could be (making a lot of assumptions)
> that a PVH Dom0 would pass in the IRQ it knows for this interrupt and Xen
> then translates that to GSI, knowing that PVH doesn't have (host) GSIs
> exposed to it.

I don't think Xen can translate a Linux IRQ to a GSI, as that's a
Linux abstraction Xen has no part in.

The GSIs exposed to a PVH dom0 are the native (host) ones, as we
create an emulated IO-APIC topology that mimics the physical one.

Question here is why Linux ends up with a IRQ != GSI, as it's my
understanding on Linux GSIs will always be identity mapped to IRQs, and
the IRQ space up to the last possible GSI is explicitly reserved for
this purpose.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-16  9:27         ` Roger Pau Monné
@ 2023-03-16  9:42           ` Jan Beulich
  2023-03-16 23:19             ` Stefano Stabellini
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-16  9:42 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Huang Rui, Anthony PERARD, xen-devel,
	Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian

On 16.03.2023 10:27, Roger Pau Monné wrote:
> On Thu, Mar 16, 2023 at 09:55:03AM +0100, Jan Beulich wrote:
>> On 16.03.2023 01:44, Stefano Stabellini wrote:
>>> On Wed, 15 Mar 2023, Roger Pau Monné wrote:
>>>> On Sun, Mar 12, 2023 at 03:54:55PM +0800, Huang Rui wrote:
>>>>> From: Chen Jiqian <Jiqian.Chen@amd.com>
>>>>>
>>>>> Use new xc_physdev_gsi_from_irq to get the GSI number
>>>>>
>>>>> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
>>>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>>>> ---
>>>>>  tools/libs/light/libxl_pci.c | 1 +
>>>>>  1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
>>>>> index f4c4f17545..47cf2799bf 100644
>>>>> --- a/tools/libs/light/libxl_pci.c
>>>>> +++ b/tools/libs/light/libxl_pci.c
>>>>> @@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>>>          goto out_no_irq;
>>>>>      }
>>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>>>> +        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);
>>>>
>>>> This is just a shot in the dark, because I don't really have enough
>>>> context to understand what's going on here, but see below.
>>>>
>>>> I've taken a look at this on my box, and it seems like on
>>>> dom0 the value returned by /sys/bus/pci/devices/SBDF/irq is not
>>>> very consistent.
>>>>
>>>> If devices are in use by a driver the irq sysfs node reports either
>>>> the GSI irq or the MSI IRQ (in case a single MSI interrupt is
>>>> setup).
>>>>
>>>> It seems like pciback in Linux does something to report the correct
>>>> value:
>>>>
>>>> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
>>>> 74
>>>> root@lcy2-dt107:~# xl pci-assignable-add 00:14.0
>>>> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
>>>> 16
>>>>
>>>> As you can see, making the device assignable changed the value
>>>> reported by the irq node to be the GSI instead of the MSI IRQ, I would
>>>> think you are missing something similar in the PVH setup (some pciback
>>>> magic)?
>>>>
>>>> Albeit I have no idea why you would need to translate from IRQ to GSI
>>>> in the way you do in this and related patches, because I'm missing the
>>>> context.
>>>
>>> As I mention in another email, also keep in mind that we need QEMU to
>>> work and QEMU calls:
>>> 1) xc_physdev_map_pirq (this is also called from libxl)
>>> 2) xc_domain_bind_pt_pci_irq
>>>
>>>
>>> In this case IRQ != GSI (IRQ == 112, GSI == 28). Sysfs returns the IRQ
>>> in Linux (112), but actually xc_physdev_map_pirq expects the GSI, not
>>> the IRQ. If you look at the implementation of xc_physdev_map_pirq,
>>> you'll the type is "MAP_PIRQ_TYPE_GSI" and also see the check in Xen
>>> xen/arch/x86/irq.c:allocate_and_map_gsi_pirq:
>>>
>>>     if ( index < 0 || index >= nr_irqs_gsi )
>>>     {
>>>         dprintk(XENLOG_G_ERR, "dom%d: map invalid irq %d\n", d->domain_id,
>>>                 index);
>>>         return -EINVAL;
>>>     }
>>>
>>> nr_irqs_gsi < 112, and the check will fail.
>>>
>>> So we need to pass the GSI to xc_physdev_map_pirq. To do that, we need
>>> to discover the GSI number corresponding to the IRQ number.
>>
>> That's one possible approach. Another could be (making a lot of assumptions)
>> that a PVH Dom0 would pass in the IRQ it knows for this interrupt and Xen
>> then translates that to GSI, knowing that PVH doesn't have (host) GSIs
>> exposed to it.
> 
> I don't think Xen can translate a Linux IRQ to a GSI, as that's a
> Linux abstraction Xen has no part in.

Well, I was talking about whatever Dom0 and Xen use to communicate. I.e.
if at all I might have meant pIRQ, but now that you mention ...

> The GSIs exposed to a PVH dom0 are the native (host) ones, as we
> create an emulated IO-APIC topology that mimics the physical one.
> 
> Question here is why Linux ends up with a IRQ != GSI, as it's my
> understanding on Linux GSIs will always be identity mapped to IRQs, and
> the IRQ space up to the last possible GSI is explicitly reserved for
> this purpose.

... this I guess pIRQ was a PV-only concept, and it really ought to be
GSI in the PVH case. So yes, it then all boils down to that Linux-
internal question.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-16  9:42           ` Jan Beulich
@ 2023-03-16 23:19             ` Stefano Stabellini
  2023-03-17  8:39               ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Stefano Stabellini @ 2023-03-16 23:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné,
	Stefano Stabellini, Huang Rui, Anthony PERARD, xen-devel,
	Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian

[-- Attachment #1: Type: text/plain, Size: 5735 bytes --]

On Thu, 16 Mar 2023, Jan Beulich wrote:
> On 16.03.2023 10:27, Roger Pau Monné wrote:
> > On Thu, Mar 16, 2023 at 09:55:03AM +0100, Jan Beulich wrote:
> >> On 16.03.2023 01:44, Stefano Stabellini wrote:
> >>> On Wed, 15 Mar 2023, Roger Pau Monné wrote:
> >>>> On Sun, Mar 12, 2023 at 03:54:55PM +0800, Huang Rui wrote:
> >>>>> From: Chen Jiqian <Jiqian.Chen@amd.com>
> >>>>>
> >>>>> Use new xc_physdev_gsi_from_irq to get the GSI number
> >>>>>
> >>>>> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> >>>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
> >>>>> ---
> >>>>>  tools/libs/light/libxl_pci.c | 1 +
> >>>>>  1 file changed, 1 insertion(+)
> >>>>>
> >>>>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> >>>>> index f4c4f17545..47cf2799bf 100644
> >>>>> --- a/tools/libs/light/libxl_pci.c
> >>>>> +++ b/tools/libs/light/libxl_pci.c
> >>>>> @@ -1486,6 +1486,7 @@ static void pci_add_dm_done(libxl__egc *egc,
> >>>>>          goto out_no_irq;
> >>>>>      }
> >>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> >>>>> +        irq = xc_physdev_gsi_from_irq(ctx->xch, irq);
> >>>>
> >>>> This is just a shot in the dark, because I don't really have enough
> >>>> context to understand what's going on here, but see below.
> >>>>
> >>>> I've taken a look at this on my box, and it seems like on
> >>>> dom0 the value returned by /sys/bus/pci/devices/SBDF/irq is not
> >>>> very consistent.
> >>>>
> >>>> If devices are in use by a driver the irq sysfs node reports either
> >>>> the GSI irq or the MSI IRQ (in case a single MSI interrupt is
> >>>> setup).
> >>>>
> >>>> It seems like pciback in Linux does something to report the correct
> >>>> value:
> >>>>
> >>>> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
> >>>> 74
> >>>> root@lcy2-dt107:~# xl pci-assignable-add 00:14.0
> >>>> root@lcy2-dt107:~# cat /sys/bus/pci/devices/0000\:00\:14.0/irq
> >>>> 16
> >>>>
> >>>> As you can see, making the device assignable changed the value
> >>>> reported by the irq node to be the GSI instead of the MSI IRQ, I would
> >>>> think you are missing something similar in the PVH setup (some pciback
> >>>> magic)?
> >>>>
> >>>> Albeit I have no idea why you would need to translate from IRQ to GSI
> >>>> in the way you do in this and related patches, because I'm missing the
> >>>> context.
> >>>
> >>> As I mention in another email, also keep in mind that we need QEMU to
> >>> work and QEMU calls:
> >>> 1) xc_physdev_map_pirq (this is also called from libxl)
> >>> 2) xc_domain_bind_pt_pci_irq
> >>>
> >>>
> >>> In this case IRQ != GSI (IRQ == 112, GSI == 28). Sysfs returns the IRQ
> >>> in Linux (112), but actually xc_physdev_map_pirq expects the GSI, not
> >>> the IRQ. If you look at the implementation of xc_physdev_map_pirq,
> >>> you'll the type is "MAP_PIRQ_TYPE_GSI" and also see the check in Xen
> >>> xen/arch/x86/irq.c:allocate_and_map_gsi_pirq:
> >>>
> >>>     if ( index < 0 || index >= nr_irqs_gsi )
> >>>     {
> >>>         dprintk(XENLOG_G_ERR, "dom%d: map invalid irq %d\n", d->domain_id,
> >>>                 index);
> >>>         return -EINVAL;
> >>>     }
> >>>
> >>> nr_irqs_gsi < 112, and the check will fail.
> >>>
> >>> So we need to pass the GSI to xc_physdev_map_pirq. To do that, we need
> >>> to discover the GSI number corresponding to the IRQ number.
> >>
> >> That's one possible approach. Another could be (making a lot of assumptions)
> >> that a PVH Dom0 would pass in the IRQ it knows for this interrupt and Xen
> >> then translates that to GSI, knowing that PVH doesn't have (host) GSIs
> >> exposed to it.
> > 
> > I don't think Xen can translate a Linux IRQ to a GSI, as that's a
> > Linux abstraction Xen has no part in.
> 
> Well, I was talking about whatever Dom0 and Xen use to communicate. I.e.
> if at all I might have meant pIRQ, but now that you mention ...
> 
> > The GSIs exposed to a PVH dom0 are the native (host) ones, as we
> > create an emulated IO-APIC topology that mimics the physical one.
> > 
> > Question here is why Linux ends up with a IRQ != GSI, as it's my
> > understanding on Linux GSIs will always be identity mapped to IRQs, and
> > the IRQ space up to the last possible GSI is explicitly reserved for
> > this purpose.
> 
> ... this I guess pIRQ was a PV-only concept, and it really ought to be
> GSI in the PVH case. So yes, it then all boils down to that Linux-
> internal question.

Excellent question but we'll have to wait for Ray as he is the one with
access to the hardware. But I have this data I can share in the
meantime:

[    1.260378] IRQ to pin mappings:
[    1.260387] IRQ1 -> 0:1
[    1.260395] IRQ2 -> 0:2
[    1.260403] IRQ3 -> 0:3
[    1.260410] IRQ4 -> 0:4
[    1.260418] IRQ5 -> 0:5
[    1.260425] IRQ6 -> 0:6
[    1.260432] IRQ7 -> 0:7
[    1.260440] IRQ8 -> 0:8
[    1.260447] IRQ9 -> 0:9
[    1.260455] IRQ10 -> 0:10
[    1.260462] IRQ11 -> 0:11
[    1.260470] IRQ12 -> 0:12
[    1.260478] IRQ13 -> 0:13
[    1.260485] IRQ14 -> 0:14
[    1.260493] IRQ15 -> 0:15
[    1.260505] IRQ106 -> 1:8
[    1.260513] IRQ112 -> 1:4
[    1.260521] IRQ116 -> 1:13
[    1.260529] IRQ117 -> 1:14
[    1.260537] IRQ118 -> 1:15
[    1.260544] .................................... done.


And I think Ray traced the point in Linux where Linux gives us an IRQ ==
112 (which is the one causing issues):

__acpi_register_gsi->
        acpi_register_gsi_ioapic->
                mp_map_gsi_to_irq->
                        mp_map_pin_to_irq->
                                __irq_resolve_mapping()

        if (likely(data)) {
                desc = irq_data_to_desc(data);
                if (irq)
                        *irq = data->irq;
                /* this IRQ is 112, IO-APIC-34 domain */
        }

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-16 23:19             ` Stefano Stabellini
@ 2023-03-17  8:39               ` Jan Beulich
  2023-03-17  9:51                 ` Roger Pau Monné
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-17  8:39 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Roger Pau Monné,
	Huang Rui, Anthony PERARD, xen-devel, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian

On 17.03.2023 00:19, Stefano Stabellini wrote:
> On Thu, 16 Mar 2023, Jan Beulich wrote:
>> So yes, it then all boils down to that Linux-
>> internal question.
> 
> Excellent question but we'll have to wait for Ray as he is the one with
> access to the hardware. But I have this data I can share in the
> meantime:
> 
> [    1.260378] IRQ to pin mappings:
> [    1.260387] IRQ1 -> 0:1
> [    1.260395] IRQ2 -> 0:2
> [    1.260403] IRQ3 -> 0:3
> [    1.260410] IRQ4 -> 0:4
> [    1.260418] IRQ5 -> 0:5
> [    1.260425] IRQ6 -> 0:6
> [    1.260432] IRQ7 -> 0:7
> [    1.260440] IRQ8 -> 0:8
> [    1.260447] IRQ9 -> 0:9
> [    1.260455] IRQ10 -> 0:10
> [    1.260462] IRQ11 -> 0:11
> [    1.260470] IRQ12 -> 0:12
> [    1.260478] IRQ13 -> 0:13
> [    1.260485] IRQ14 -> 0:14
> [    1.260493] IRQ15 -> 0:15
> [    1.260505] IRQ106 -> 1:8
> [    1.260513] IRQ112 -> 1:4
> [    1.260521] IRQ116 -> 1:13
> [    1.260529] IRQ117 -> 1:14
> [    1.260537] IRQ118 -> 1:15
> [    1.260544] .................................... done.

And what does Linux think are IRQs 16 ... 105? Have you compared with
Linux running baremetal on the same hardware?

Jan

> And I think Ray traced the point in Linux where Linux gives us an IRQ ==
> 112 (which is the one causing issues):
> 
> __acpi_register_gsi->
>         acpi_register_gsi_ioapic->
>                 mp_map_gsi_to_irq->
>                         mp_map_pin_to_irq->
>                                 __irq_resolve_mapping()
> 
>         if (likely(data)) {
>                 desc = irq_data_to_desc(data);
>                 if (irq)
>                         *irq = data->irq;
>                 /* this IRQ is 112, IO-APIC-34 domain */
>         }



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-17  8:39               ` Jan Beulich
@ 2023-03-17  9:51                 ` Roger Pau Monné
  2023-03-17 18:15                   ` Stefano Stabellini
  0 siblings, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-17  9:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Huang Rui, Anthony PERARD, xen-devel,
	Alex Deucher, Christian König, Stewart Hildebrand,
	Xenia Ragiadakou, Honglei Huang, Julia Zhang, Chen Jiqian

On Fri, Mar 17, 2023 at 09:39:52AM +0100, Jan Beulich wrote:
> On 17.03.2023 00:19, Stefano Stabellini wrote:
> > On Thu, 16 Mar 2023, Jan Beulich wrote:
> >> So yes, it then all boils down to that Linux-
> >> internal question.
> > 
> > Excellent question but we'll have to wait for Ray as he is the one with
> > access to the hardware. But I have this data I can share in the
> > meantime:
> > 
> > [    1.260378] IRQ to pin mappings:
> > [    1.260387] IRQ1 -> 0:1
> > [    1.260395] IRQ2 -> 0:2
> > [    1.260403] IRQ3 -> 0:3
> > [    1.260410] IRQ4 -> 0:4
> > [    1.260418] IRQ5 -> 0:5
> > [    1.260425] IRQ6 -> 0:6
> > [    1.260432] IRQ7 -> 0:7
> > [    1.260440] IRQ8 -> 0:8
> > [    1.260447] IRQ9 -> 0:9
> > [    1.260455] IRQ10 -> 0:10
> > [    1.260462] IRQ11 -> 0:11
> > [    1.260470] IRQ12 -> 0:12
> > [    1.260478] IRQ13 -> 0:13
> > [    1.260485] IRQ14 -> 0:14
> > [    1.260493] IRQ15 -> 0:15
> > [    1.260505] IRQ106 -> 1:8
> > [    1.260513] IRQ112 -> 1:4
> > [    1.260521] IRQ116 -> 1:13
> > [    1.260529] IRQ117 -> 1:14
> > [    1.260537] IRQ118 -> 1:15
> > [    1.260544] .................................... done.
> 
> And what does Linux think are IRQs 16 ... 105? Have you compared with
> Linux running baremetal on the same hardware?

So I have some emails from Ray from he time he was looking into this,
and on Linux dom0 PVH dmesg there is:

[    0.065063] IOAPIC[0]: apic_id 33, version 17, address 0xfec00000, GSI 0-23
[    0.065096] IOAPIC[1]: apic_id 34, version 17, address 0xfec01000, GSI 24-55

So it seems the vIO-APIC data provided by Xen to dom0 is at least
consistent.
 
> > And I think Ray traced the point in Linux where Linux gives us an IRQ ==
> > 112 (which is the one causing issues):
> > 
> > __acpi_register_gsi->
> >         acpi_register_gsi_ioapic->
> >                 mp_map_gsi_to_irq->
> >                         mp_map_pin_to_irq->
> >                                 __irq_resolve_mapping()
> > 
> >         if (likely(data)) {
> >                 desc = irq_data_to_desc(data);
> >                 if (irq)
> >                         *irq = data->irq;
> >                 /* this IRQ is 112, IO-APIC-34 domain */
> >         }


Could this all be a result of patch 4/5 in the Linux series ("[RFC
PATCH 4/5] x86/xen: acpi registers gsi for xen pvh"), where a different
__acpi_register_gsi hook is installed for PVH in order to setup GSIs
using PHYSDEV ops instead of doing it natively from the IO-APIC?

FWIW, the introduced function in that patch
(acpi_register_gsi_xen_pvh()) seems to unconditionally call
acpi_register_gsi_ioapic() without checking if the GSI is already
registered, which might lead to multiple IRQs being allocated for the
same underlying GSI?

As I commented there, I think that approach is wrong.  If the GSI has
not been mapped in Xen (because dom0 hasn't unmasked the respective
IO-APIC pin) we should add some logic in the toolstack to map it
before attempting to bind.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-17  9:51                 ` Roger Pau Monné
@ 2023-03-17 18:15                   ` Stefano Stabellini
  2023-03-17 19:48                     ` Roger Pau Monné
  0 siblings, 1 reply; 75+ messages in thread
From: Stefano Stabellini @ 2023-03-17 18:15 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, Stefano Stabellini, Huang Rui, Anthony PERARD,
	xen-devel, Alex Deucher, Christian König,
	Stewart Hildebrand, Xenia Ragiadakou, Honglei Huang, Julia Zhang,
	Chen Jiqian

[-- Attachment #1: Type: text/plain, Size: 3609 bytes --]

On Fri, 17 Mar 2023, Roger Pau Monné wrote:
> On Fri, Mar 17, 2023 at 09:39:52AM +0100, Jan Beulich wrote:
> > On 17.03.2023 00:19, Stefano Stabellini wrote:
> > > On Thu, 16 Mar 2023, Jan Beulich wrote:
> > >> So yes, it then all boils down to that Linux-
> > >> internal question.
> > > 
> > > Excellent question but we'll have to wait for Ray as he is the one with
> > > access to the hardware. But I have this data I can share in the
> > > meantime:
> > > 
> > > [    1.260378] IRQ to pin mappings:
> > > [    1.260387] IRQ1 -> 0:1
> > > [    1.260395] IRQ2 -> 0:2
> > > [    1.260403] IRQ3 -> 0:3
> > > [    1.260410] IRQ4 -> 0:4
> > > [    1.260418] IRQ5 -> 0:5
> > > [    1.260425] IRQ6 -> 0:6
> > > [    1.260432] IRQ7 -> 0:7
> > > [    1.260440] IRQ8 -> 0:8
> > > [    1.260447] IRQ9 -> 0:9
> > > [    1.260455] IRQ10 -> 0:10
> > > [    1.260462] IRQ11 -> 0:11
> > > [    1.260470] IRQ12 -> 0:12
> > > [    1.260478] IRQ13 -> 0:13
> > > [    1.260485] IRQ14 -> 0:14
> > > [    1.260493] IRQ15 -> 0:15
> > > [    1.260505] IRQ106 -> 1:8
> > > [    1.260513] IRQ112 -> 1:4
> > > [    1.260521] IRQ116 -> 1:13
> > > [    1.260529] IRQ117 -> 1:14
> > > [    1.260537] IRQ118 -> 1:15
> > > [    1.260544] .................................... done.
> > 
> > And what does Linux think are IRQs 16 ... 105? Have you compared with
> > Linux running baremetal on the same hardware?
> 
> So I have some emails from Ray from he time he was looking into this,
> and on Linux dom0 PVH dmesg there is:
> 
> [    0.065063] IOAPIC[0]: apic_id 33, version 17, address 0xfec00000, GSI 0-23
> [    0.065096] IOAPIC[1]: apic_id 34, version 17, address 0xfec01000, GSI 24-55
> 
> So it seems the vIO-APIC data provided by Xen to dom0 is at least
> consistent.
>  
> > > And I think Ray traced the point in Linux where Linux gives us an IRQ ==
> > > 112 (which is the one causing issues):
> > > 
> > > __acpi_register_gsi->
> > >         acpi_register_gsi_ioapic->
> > >                 mp_map_gsi_to_irq->
> > >                         mp_map_pin_to_irq->
> > >                                 __irq_resolve_mapping()
> > > 
> > >         if (likely(data)) {
> > >                 desc = irq_data_to_desc(data);
> > >                 if (irq)
> > >                         *irq = data->irq;
> > >                 /* this IRQ is 112, IO-APIC-34 domain */
> > >         }
> 
> 
> Could this all be a result of patch 4/5 in the Linux series ("[RFC
> PATCH 4/5] x86/xen: acpi registers gsi for xen pvh"), where a different
> __acpi_register_gsi hook is installed for PVH in order to setup GSIs
> using PHYSDEV ops instead of doing it natively from the IO-APIC?
> 
> FWIW, the introduced function in that patch
> (acpi_register_gsi_xen_pvh()) seems to unconditionally call
> acpi_register_gsi_ioapic() without checking if the GSI is already
> registered, which might lead to multiple IRQs being allocated for the
> same underlying GSI?

I understand this point and I think it needs investigating.


> As I commented there, I think that approach is wrong.  If the GSI has
> not been mapped in Xen (because dom0 hasn't unmasked the respective
> IO-APIC pin) we should add some logic in the toolstack to map it
> before attempting to bind.

But this statement confuses me. The toolstack doesn't get involved in
IRQ setup for PCI devices for HVM guests? Keep in mind that this is a
regular HVM guest creation on PVH Dom0, so normally the IRQ setup is
done by QEMU, and QEMU already calls xc_physdev_map_pirq and
xc_domain_bind_pt_pci_irq. So I don't follow your statement about "the
toolstack to map it before attempting to bind".

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-17 18:15                   ` Stefano Stabellini
@ 2023-03-17 19:48                     ` Roger Pau Monné
  2023-03-17 20:55                       ` Stefano Stabellini
  0 siblings, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-17 19:48 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Jan Beulich, Huang Rui, Anthony PERARD, xen-devel, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian

On Fri, Mar 17, 2023 at 11:15:37AM -0700, Stefano Stabellini wrote:
> On Fri, 17 Mar 2023, Roger Pau Monné wrote:
> > On Fri, Mar 17, 2023 at 09:39:52AM +0100, Jan Beulich wrote:
> > > On 17.03.2023 00:19, Stefano Stabellini wrote:
> > > > On Thu, 16 Mar 2023, Jan Beulich wrote:
> > > >> So yes, it then all boils down to that Linux-
> > > >> internal question.
> > > > 
> > > > Excellent question but we'll have to wait for Ray as he is the one with
> > > > access to the hardware. But I have this data I can share in the
> > > > meantime:
> > > > 
> > > > [    1.260378] IRQ to pin mappings:
> > > > [    1.260387] IRQ1 -> 0:1
> > > > [    1.260395] IRQ2 -> 0:2
> > > > [    1.260403] IRQ3 -> 0:3
> > > > [    1.260410] IRQ4 -> 0:4
> > > > [    1.260418] IRQ5 -> 0:5
> > > > [    1.260425] IRQ6 -> 0:6
> > > > [    1.260432] IRQ7 -> 0:7
> > > > [    1.260440] IRQ8 -> 0:8
> > > > [    1.260447] IRQ9 -> 0:9
> > > > [    1.260455] IRQ10 -> 0:10
> > > > [    1.260462] IRQ11 -> 0:11
> > > > [    1.260470] IRQ12 -> 0:12
> > > > [    1.260478] IRQ13 -> 0:13
> > > > [    1.260485] IRQ14 -> 0:14
> > > > [    1.260493] IRQ15 -> 0:15
> > > > [    1.260505] IRQ106 -> 1:8
> > > > [    1.260513] IRQ112 -> 1:4
> > > > [    1.260521] IRQ116 -> 1:13
> > > > [    1.260529] IRQ117 -> 1:14
> > > > [    1.260537] IRQ118 -> 1:15
> > > > [    1.260544] .................................... done.
> > > 
> > > And what does Linux think are IRQs 16 ... 105? Have you compared with
> > > Linux running baremetal on the same hardware?
> > 
> > So I have some emails from Ray from he time he was looking into this,
> > and on Linux dom0 PVH dmesg there is:
> > 
> > [    0.065063] IOAPIC[0]: apic_id 33, version 17, address 0xfec00000, GSI 0-23
> > [    0.065096] IOAPIC[1]: apic_id 34, version 17, address 0xfec01000, GSI 24-55
> > 
> > So it seems the vIO-APIC data provided by Xen to dom0 is at least
> > consistent.
> >  
> > > > And I think Ray traced the point in Linux where Linux gives us an IRQ ==
> > > > 112 (which is the one causing issues):
> > > > 
> > > > __acpi_register_gsi->
> > > >         acpi_register_gsi_ioapic->
> > > >                 mp_map_gsi_to_irq->
> > > >                         mp_map_pin_to_irq->
> > > >                                 __irq_resolve_mapping()
> > > > 
> > > >         if (likely(data)) {
> > > >                 desc = irq_data_to_desc(data);
> > > >                 if (irq)
> > > >                         *irq = data->irq;
> > > >                 /* this IRQ is 112, IO-APIC-34 domain */
> > > >         }
> > 
> > 
> > Could this all be a result of patch 4/5 in the Linux series ("[RFC
> > PATCH 4/5] x86/xen: acpi registers gsi for xen pvh"), where a different
> > __acpi_register_gsi hook is installed for PVH in order to setup GSIs
> > using PHYSDEV ops instead of doing it natively from the IO-APIC?
> > 
> > FWIW, the introduced function in that patch
> > (acpi_register_gsi_xen_pvh()) seems to unconditionally call
> > acpi_register_gsi_ioapic() without checking if the GSI is already
> > registered, which might lead to multiple IRQs being allocated for the
> > same underlying GSI?
> 
> I understand this point and I think it needs investigating.
> 
> 
> > As I commented there, I think that approach is wrong.  If the GSI has
> > not been mapped in Xen (because dom0 hasn't unmasked the respective
> > IO-APIC pin) we should add some logic in the toolstack to map it
> > before attempting to bind.
> 
> But this statement confuses me. The toolstack doesn't get involved in
> IRQ setup for PCI devices for HVM guests?

It does for GSI interrupts AFAICT, see pci_add_dm_done() and the call
to xc_physdev_map_pirq().  I'm not sure whether that's a remnant that
cold be removed (maybe for qemu-trad only?) or it's also required by
QEMU upstream, I would have to investigate more.  It's my
understanding it's in pci_add_dm_done() where Ray was getting the
mismatched IRQ vs GSI number.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-17 19:48                     ` Roger Pau Monné
@ 2023-03-17 20:55                       ` Stefano Stabellini
  2023-03-20 15:16                         ` Roger Pau Monné
  2023-07-31 16:40                         ` Chen, Jiqian
  0 siblings, 2 replies; 75+ messages in thread
From: Stefano Stabellini @ 2023-03-17 20:55 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Jan Beulich, Huang Rui, Anthony PERARD,
	xen-devel, Alex Deucher, Christian König,
	Stewart Hildebrand, Xenia Ragiadakou, Honglei Huang, Julia Zhang,
	Chen Jiqian

[-- Attachment #1: Type: text/plain, Size: 4744 bytes --]

On Fri, 17 Mar 2023, Roger Pau Monné wrote:
> On Fri, Mar 17, 2023 at 11:15:37AM -0700, Stefano Stabellini wrote:
> > On Fri, 17 Mar 2023, Roger Pau Monné wrote:
> > > On Fri, Mar 17, 2023 at 09:39:52AM +0100, Jan Beulich wrote:
> > > > On 17.03.2023 00:19, Stefano Stabellini wrote:
> > > > > On Thu, 16 Mar 2023, Jan Beulich wrote:
> > > > >> So yes, it then all boils down to that Linux-
> > > > >> internal question.
> > > > > 
> > > > > Excellent question but we'll have to wait for Ray as he is the one with
> > > > > access to the hardware. But I have this data I can share in the
> > > > > meantime:
> > > > > 
> > > > > [    1.260378] IRQ to pin mappings:
> > > > > [    1.260387] IRQ1 -> 0:1
> > > > > [    1.260395] IRQ2 -> 0:2
> > > > > [    1.260403] IRQ3 -> 0:3
> > > > > [    1.260410] IRQ4 -> 0:4
> > > > > [    1.260418] IRQ5 -> 0:5
> > > > > [    1.260425] IRQ6 -> 0:6
> > > > > [    1.260432] IRQ7 -> 0:7
> > > > > [    1.260440] IRQ8 -> 0:8
> > > > > [    1.260447] IRQ9 -> 0:9
> > > > > [    1.260455] IRQ10 -> 0:10
> > > > > [    1.260462] IRQ11 -> 0:11
> > > > > [    1.260470] IRQ12 -> 0:12
> > > > > [    1.260478] IRQ13 -> 0:13
> > > > > [    1.260485] IRQ14 -> 0:14
> > > > > [    1.260493] IRQ15 -> 0:15
> > > > > [    1.260505] IRQ106 -> 1:8
> > > > > [    1.260513] IRQ112 -> 1:4
> > > > > [    1.260521] IRQ116 -> 1:13
> > > > > [    1.260529] IRQ117 -> 1:14
> > > > > [    1.260537] IRQ118 -> 1:15
> > > > > [    1.260544] .................................... done.
> > > > 
> > > > And what does Linux think are IRQs 16 ... 105? Have you compared with
> > > > Linux running baremetal on the same hardware?
> > > 
> > > So I have some emails from Ray from he time he was looking into this,
> > > and on Linux dom0 PVH dmesg there is:
> > > 
> > > [    0.065063] IOAPIC[0]: apic_id 33, version 17, address 0xfec00000, GSI 0-23
> > > [    0.065096] IOAPIC[1]: apic_id 34, version 17, address 0xfec01000, GSI 24-55
> > > 
> > > So it seems the vIO-APIC data provided by Xen to dom0 is at least
> > > consistent.
> > >  
> > > > > And I think Ray traced the point in Linux where Linux gives us an IRQ ==
> > > > > 112 (which is the one causing issues):
> > > > > 
> > > > > __acpi_register_gsi->
> > > > >         acpi_register_gsi_ioapic->
> > > > >                 mp_map_gsi_to_irq->
> > > > >                         mp_map_pin_to_irq->
> > > > >                                 __irq_resolve_mapping()
> > > > > 
> > > > >         if (likely(data)) {
> > > > >                 desc = irq_data_to_desc(data);
> > > > >                 if (irq)
> > > > >                         *irq = data->irq;
> > > > >                 /* this IRQ is 112, IO-APIC-34 domain */
> > > > >         }
> > > 
> > > 
> > > Could this all be a result of patch 4/5 in the Linux series ("[RFC
> > > PATCH 4/5] x86/xen: acpi registers gsi for xen pvh"), where a different
> > > __acpi_register_gsi hook is installed for PVH in order to setup GSIs
> > > using PHYSDEV ops instead of doing it natively from the IO-APIC?
> > > 
> > > FWIW, the introduced function in that patch
> > > (acpi_register_gsi_xen_pvh()) seems to unconditionally call
> > > acpi_register_gsi_ioapic() without checking if the GSI is already
> > > registered, which might lead to multiple IRQs being allocated for the
> > > same underlying GSI?
> > 
> > I understand this point and I think it needs investigating.
> > 
> > 
> > > As I commented there, I think that approach is wrong.  If the GSI has
> > > not been mapped in Xen (because dom0 hasn't unmasked the respective
> > > IO-APIC pin) we should add some logic in the toolstack to map it
> > > before attempting to bind.
> > 
> > But this statement confuses me. The toolstack doesn't get involved in
> > IRQ setup for PCI devices for HVM guests?
> 
> It does for GSI interrupts AFAICT, see pci_add_dm_done() and the call
> to xc_physdev_map_pirq().  I'm not sure whether that's a remnant that
> cold be removed (maybe for qemu-trad only?) or it's also required by
> QEMU upstream, I would have to investigate more.

You are right. I am not certain, but it seems like a mistake in the
toolstack to me. In theory, pci_add_dm_done should only be needed for PV
guests, not for HVM guests. I am not sure. But I can see the call to
xc_physdev_map_pirq you were referring to now.


> It's my understanding it's in pci_add_dm_done() where Ray was getting
> the mismatched IRQ vs GSI number.

I think the mismatch was actually caused by the xc_physdev_map_pirq call
from QEMU, which makes sense because in any case it should happen before
the same call done by pci_add_dm_done (pci_add_dm_done is called after
sending the pci passthrough QMP command to QEMU). So the first to hit
the IRQ!=GSI problem would be QEMU.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-17 20:55                       ` Stefano Stabellini
@ 2023-03-20 15:16                         ` Roger Pau Monné
  2023-03-20 15:29                           ` Jan Beulich
  2023-07-31 16:40                         ` Chen, Jiqian
  1 sibling, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-20 15:16 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Jan Beulich, Huang Rui, Anthony PERARD, xen-devel, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian

On Fri, Mar 17, 2023 at 01:55:08PM -0700, Stefano Stabellini wrote:
> On Fri, 17 Mar 2023, Roger Pau Monné wrote:
> > On Fri, Mar 17, 2023 at 11:15:37AM -0700, Stefano Stabellini wrote:
> > > On Fri, 17 Mar 2023, Roger Pau Monné wrote:
> > > > On Fri, Mar 17, 2023 at 09:39:52AM +0100, Jan Beulich wrote:
> > > > > On 17.03.2023 00:19, Stefano Stabellini wrote:
> > > > > > On Thu, 16 Mar 2023, Jan Beulich wrote:
> > > > > >> So yes, it then all boils down to that Linux-
> > > > > >> internal question.
> > > > > > 
> > > > > > Excellent question but we'll have to wait for Ray as he is the one with
> > > > > > access to the hardware. But I have this data I can share in the
> > > > > > meantime:
> > > > > > 
> > > > > > [    1.260378] IRQ to pin mappings:
> > > > > > [    1.260387] IRQ1 -> 0:1
> > > > > > [    1.260395] IRQ2 -> 0:2
> > > > > > [    1.260403] IRQ3 -> 0:3
> > > > > > [    1.260410] IRQ4 -> 0:4
> > > > > > [    1.260418] IRQ5 -> 0:5
> > > > > > [    1.260425] IRQ6 -> 0:6
> > > > > > [    1.260432] IRQ7 -> 0:7
> > > > > > [    1.260440] IRQ8 -> 0:8
> > > > > > [    1.260447] IRQ9 -> 0:9
> > > > > > [    1.260455] IRQ10 -> 0:10
> > > > > > [    1.260462] IRQ11 -> 0:11
> > > > > > [    1.260470] IRQ12 -> 0:12
> > > > > > [    1.260478] IRQ13 -> 0:13
> > > > > > [    1.260485] IRQ14 -> 0:14
> > > > > > [    1.260493] IRQ15 -> 0:15
> > > > > > [    1.260505] IRQ106 -> 1:8
> > > > > > [    1.260513] IRQ112 -> 1:4
> > > > > > [    1.260521] IRQ116 -> 1:13
> > > > > > [    1.260529] IRQ117 -> 1:14
> > > > > > [    1.260537] IRQ118 -> 1:15
> > > > > > [    1.260544] .................................... done.
> > > > > 
> > > > > And what does Linux think are IRQs 16 ... 105? Have you compared with
> > > > > Linux running baremetal on the same hardware?
> > > > 
> > > > So I have some emails from Ray from he time he was looking into this,
> > > > and on Linux dom0 PVH dmesg there is:
> > > > 
> > > > [    0.065063] IOAPIC[0]: apic_id 33, version 17, address 0xfec00000, GSI 0-23
> > > > [    0.065096] IOAPIC[1]: apic_id 34, version 17, address 0xfec01000, GSI 24-55
> > > > 
> > > > So it seems the vIO-APIC data provided by Xen to dom0 is at least
> > > > consistent.
> > > >  
> > > > > > And I think Ray traced the point in Linux where Linux gives us an IRQ ==
> > > > > > 112 (which is the one causing issues):
> > > > > > 
> > > > > > __acpi_register_gsi->
> > > > > >         acpi_register_gsi_ioapic->
> > > > > >                 mp_map_gsi_to_irq->
> > > > > >                         mp_map_pin_to_irq->
> > > > > >                                 __irq_resolve_mapping()
> > > > > > 
> > > > > >         if (likely(data)) {
> > > > > >                 desc = irq_data_to_desc(data);
> > > > > >                 if (irq)
> > > > > >                         *irq = data->irq;
> > > > > >                 /* this IRQ is 112, IO-APIC-34 domain */
> > > > > >         }
> > > > 
> > > > 
> > > > Could this all be a result of patch 4/5 in the Linux series ("[RFC
> > > > PATCH 4/5] x86/xen: acpi registers gsi for xen pvh"), where a different
> > > > __acpi_register_gsi hook is installed for PVH in order to setup GSIs
> > > > using PHYSDEV ops instead of doing it natively from the IO-APIC?
> > > > 
> > > > FWIW, the introduced function in that patch
> > > > (acpi_register_gsi_xen_pvh()) seems to unconditionally call
> > > > acpi_register_gsi_ioapic() without checking if the GSI is already
> > > > registered, which might lead to multiple IRQs being allocated for the
> > > > same underlying GSI?
> > > 
> > > I understand this point and I think it needs investigating.
> > > 
> > > 
> > > > As I commented there, I think that approach is wrong.  If the GSI has
> > > > not been mapped in Xen (because dom0 hasn't unmasked the respective
> > > > IO-APIC pin) we should add some logic in the toolstack to map it
> > > > before attempting to bind.
> > > 
> > > But this statement confuses me. The toolstack doesn't get involved in
> > > IRQ setup for PCI devices for HVM guests?
> > 
> > It does for GSI interrupts AFAICT, see pci_add_dm_done() and the call
> > to xc_physdev_map_pirq().  I'm not sure whether that's a remnant that
> > cold be removed (maybe for qemu-trad only?) or it's also required by
> > QEMU upstream, I would have to investigate more.
> 
> You are right. I am not certain, but it seems like a mistake in the
> toolstack to me. In theory, pci_add_dm_done should only be needed for PV
> guests, not for HVM guests. I am not sure. But I can see the call to
> xc_physdev_map_pirq you were referring to now.
> 
> 
> > It's my understanding it's in pci_add_dm_done() where Ray was getting
> > the mismatched IRQ vs GSI number.
> 
> I think the mismatch was actually caused by the xc_physdev_map_pirq call
> from QEMU, which makes sense because in any case it should happen before
> the same call done by pci_add_dm_done (pci_add_dm_done is called after
> sending the pci passthrough QMP command to QEMU). So the first to hit
> the IRQ!=GSI problem would be QEMU.

I've been thinking about this a bit, and I think one of the possible
issues with the current handling of GSIs in a PVH dom0 is that GSIs
don't get registered until/unless they are unmasked.  I could see this
as a problem when doing passthrough: it's possible for a GSI (iow:
vIO-APIC pin) to never get unmasked on dom0, because the device
driver(s) are using MSI(-X) interrupts instead.  However, the IO-APIC
pin must be configured for it to be able to be mapped into a domU.

A possible solution is to propagate the vIO-APIC pin configuration
trigger/polarity when dom0 writes the low part of the redirection
table entry.

The patch below enables the usage of PHYSDEVOP_{un,}map_pirq from PVH
domains (I need to assert this is secure even for domUs) and also
propagates the vIO-APIC pin trigger/polarity mode on writes to the
low part of the RTE.  Such propagation leads to the following
interrupt setup in Xen:

IRQ:   0 vec:f0 IO-APIC-edge    status=000 aff:{0}/{0} arch/x86/time.c#timer_interrupt()
IRQ:   1 vec:38 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:   2 vec:a8 IO-APIC-edge    status=000 aff:{0-7}/{0-7} no_action()
IRQ:   3 vec:f1 IO-APIC-edge    status=000 aff:{0-7}/{0-7} drivers/char/ns16550.c#ns16550_interrupt()
IRQ:   4 vec:40 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:   5 vec:48 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:   6 vec:50 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:   7 vec:58 IO-APIC-edge    status=006 aff:{0-7}/{0} mapped, unbound
IRQ:   8 vec:60 IO-APIC-edge    status=010 aff:{0}/{0} in-flight=0 d0:  8(-M-)
IRQ:   9 vec:68 IO-APIC-edge    status=010 aff:{0}/{0} in-flight=0 d0:  9(-M-)
IRQ:  10 vec:70 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:  11 vec:78 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:  12 vec:88 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:  13 vec:90 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:  14 vec:98 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:  15 vec:a0 IO-APIC-edge    status=002 aff:{0-7}/{0} mapped, unbound
IRQ:  16 vec:b0 IO-APIC-edge    status=010 aff:{1}/{0-7} in-flight=0 d0: 16(-M-)
IRQ:  17 vec:b8 IO-APIC-edge    status=002 aff:{0-7}/{0-7} mapped, unbound
IRQ:  18 vec:c0 IO-APIC-edge    status=002 aff:{0-7}/{0-7} mapped, unbound
IRQ:  19 vec:c8 IO-APIC-edge    status=002 aff:{0-7}/{0-7} mapped, unbound
IRQ:  20 vec:d0 IO-APIC-edge    status=010 aff:{1}/{0-7} in-flight=0 d0: 20(-M-)
IRQ:  21 vec:d8 IO-APIC-edge    status=002 aff:{0-7}/{0-7} mapped, unbound
IRQ:  22 vec:e0 IO-APIC-edge    status=002 aff:{0-7}/{0-7} mapped, unbound
IRQ:  23 vec:e8 IO-APIC-edge    status=002 aff:{0-7}/{0-7} mapped, unbound

Note how now all GSIs on my box are setup, even when not bound to
dom0 anymore.  The output without this patch looks like:

IRQ:   0 vec:f0 IO-APIC-edge    status=000 aff:{0}/{0} arch/x86/time.c#timer_interrupt()
IRQ:   1 vec:38 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:   3 vec:f1 IO-APIC-edge    status=000 aff:{0-7}/{0-7} drivers/char/ns16550.c#ns16550_interrupt()
IRQ:   4 vec:40 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:   5 vec:48 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:   6 vec:50 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:   7 vec:58 IO-APIC-edge    status=006 aff:{0}/{0} mapped, unbound
IRQ:   8 vec:d0 IO-APIC-edge    status=010 aff:{6}/{0-7} in-flight=0 d0:  8(-M-)
IRQ:   9 vec:a8 IO-APIC-level   status=010 aff:{2}/{0-7} in-flight=0 d0:  9(-M-)
IRQ:  10 vec:70 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:  11 vec:78 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:  12 vec:88 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:  13 vec:90 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:  14 vec:98 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:  15 vec:a0 IO-APIC-edge    status=002 aff:{0}/{0} mapped, unbound
IRQ:  16 vec:e0 IO-APIC-level   status=010 aff:{6}/{0-7} in-flight=0 d0: 16(-M-),d1: 16(-M-)
IRQ:  20 vec:d8 IO-APIC-level   status=010 aff:{6}/{0-7} in-flight=0 d0: 20(-M-)

Legacy IRQs (below 16) are always registered.

With the patch above I seem to be able to do PCI passthrough to an HVM
domU from a PVH dom0.

Regards, Roger.

---
diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index 405d0a95af..cc53a3bd12 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -86,6 +86,8 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     {
     case PHYSDEVOP_map_pirq:
     case PHYSDEVOP_unmap_pirq:
+        break;
+
     case PHYSDEVOP_eoi:
     case PHYSDEVOP_irq_status_query:
     case PHYSDEVOP_get_free_pirq:
diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
index 41e3c4d5e4..50e23a093c 100644
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -180,9 +180,7 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
 
     /* Interrupt has been unmasked, bind it now. */
     ret = mp_register_gsi(gsi, trig, pol);
-    if ( ret == -EEXIST )
-        return 0;
-    if ( ret )
+    if ( ret && ret != -EEXIST )
     {
         gprintk(XENLOG_WARNING, "vioapic: error registering GSI %u: %d\n",
                  gsi, ret);
@@ -244,12 +242,18 @@ static void vioapic_write_redirent(
     }
     else
     {
+        int ret;
+
         unmasked = ent.fields.mask;
         /* Remote IRR and Delivery Status are read-only. */
         ent.bits = ((ent.bits >> 32) << 32) | val;
         ent.fields.delivery_status = 0;
         ent.fields.remote_irr = pent->fields.remote_irr;
         unmasked = unmasked && !ent.fields.mask;
+        ret = mp_register_gsi(gsi, ent.fields.trig_mode, ent.fields.polarity);
+        if ( ret && ret !=  -EEXIST )
+            gprintk(XENLOG_WARNING, "vioapic: error registering GSI %u: %d\n",
+                    gsi, ret);
     }
 
     *pent = ent;



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-20 15:16                         ` Roger Pau Monné
@ 2023-03-20 15:29                           ` Jan Beulich
  2023-03-20 16:50                             ` Roger Pau Monné
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-20 15:29 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Huang Rui, Anthony PERARD, xen-devel, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian, Stefano Stabellini

On 20.03.2023 16:16, Roger Pau Monné wrote:
> @@ -244,12 +242,18 @@ static void vioapic_write_redirent(
>      }
>      else
>      {
> +        int ret;
> +
>          unmasked = ent.fields.mask;
>          /* Remote IRR and Delivery Status are read-only. */
>          ent.bits = ((ent.bits >> 32) << 32) | val;
>          ent.fields.delivery_status = 0;
>          ent.fields.remote_irr = pent->fields.remote_irr;
>          unmasked = unmasked && !ent.fields.mask;
> +        ret = mp_register_gsi(gsi, ent.fields.trig_mode, ent.fields.polarity);
> +        if ( ret && ret !=  -EEXIST )
> +            gprintk(XENLOG_WARNING, "vioapic: error registering GSI %u: %d\n",
> +                    gsi, ret);
>      }

I assume this is only meant to be experimental, as I'm missing confinement
to Dom0 here. I also question this when the mask bit as set, as in that
case neither the trigger mode bit nor the polarity one can be relied upon.
At which point it would look to me as if it was necessary for Dom0 to use
a hypercall instead (which naturally would then be PHYSDEVOP_setup_gsi).

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0
  2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
                   ` (6 preceding siblings ...)
  2023-03-13  7:24 ` [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Christian König
@ 2023-03-20 16:22 ` Huang Rui
  7 siblings, 0 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-20 16:22 UTC (permalink / raw)
  To: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, Andrew Cooper,
	xen-devel
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian

Hi Jan, Roger, Stefano, Andrew,

Sorry to late response, I was fully occupied by another problem last week.
And I will give the reply one by one in the mail tomorrow. Thanks for your
patience. :-)

Thanks,
Ray

On Sun, Mar 12, 2023 at 03:54:49PM +0800, Huang, Ray wrote:
> Hi all,
> 
> In graphic world, the 3D applications/games are runing based on open
> graphic libraries such as OpenGL and Vulkan. Mesa is the Linux
> implemenatation of OpenGL and Vulkan for multiple hardware platforms.
> Because the graphic libraries would like to have the GPU hardware
> acceleration. In virtualization world, virtio-gpu and passthrough-gpu are
> two of gpu virtualization technologies.
> 
> Current Xen only supports OpenGL (virgl:
> https://docs.mesa3d.org/drivers/virgl.html) for virtio-gpu and passthrough
> gpu based on PV dom0 for x86 platform. Today, we would like to introduce
> Vulkan (venus: https://docs.mesa3d.org/drivers/venus.html) and another
> OpenGL on Vulkan (zink: https://docs.mesa3d.org/drivers/zink.html) support
> for VirtIO GPU on Xen. These functions are supported on KVM at this moment,
> but so far, they are not supported on Xen. And we also introduce the PCIe
> passthrough (GPU) function based on PVH dom0 for AMD x86 platform.
> 
> These supports required multiple repositories changes on kernel, xen, qemu,
> mesa, and virglrenderer. Please check below branches:
> 
> Kernel: https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/log/?h=upstream-fox-xen
> Xen: https://gitlab.com/huangrui123/xen/-/commits/upstream-for-xen
> QEMU: https://gitlab.com/huangrui123/qemu/-/commits/upstream-for-xen
> Mesa: https://gitlab.freedesktop.org/rui/mesa/-/commits/upstream-for-xen
> Virglrenderer: https://gitlab.freedesktop.org/rui/virglrenderer/-/commits/upstream-for-xen
> 
> In xen part, we mainly add the PCIe passthrough support on PVH dom0. It's
> using the QEMU to passthrough the GPU device into guest HVM domU. And
> mainly work is to transfer the interrupt by using gsi, vector, and pirq.
> 
> Below are the screenshot of these functions, please take a look.
> 
> Venus:
> https://drive.google.com/file/d/1_lPq6DMwHu1JQv7LUUVRx31dBj0HJYcL/view?usp=share_link
> 
> Zink:
> https://drive.google.com/file/d/1FxLmKu6X7uJOxx1ZzwOm1yA6IL5WMGzd/view?usp=share_link
> 
> Passthrough GPU:
> https://drive.google.com/file/d/17onr5gvDK8KM_LniHTSQEI2hGJZlI09L/view?usp=share_link
> 
> We are working to write the documentation that describe how to verify these
> functions in the xen wiki page. And will update it in the future version.
> 
> Thanks,
> Ray
> 
> Chen Jiqian (5):
>   vpci: accept BAR writes if dom0 is PVH
>   x86/pvh: shouldn't check pirq flag when map pirq in PVH
>   x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
>   tools/libs/call: add linux os call to get gsi from irq
>   tools/libs/light: pci: translate irq to gsi
> 
> Roger Pau Monne (1):
>   x86/pvh: report ACPI VFCT table to dom0 if present
> 
>  tools/include/xen-sys/Linux/privcmd.h |  7 +++++++
>  tools/include/xencall.h               |  2 ++
>  tools/include/xenctrl.h               |  2 ++
>  tools/libs/call/core.c                |  5 +++++
>  tools/libs/call/libxencall.map        |  2 ++
>  tools/libs/call/linux.c               | 14 ++++++++++++++
>  tools/libs/call/private.h             |  9 +++++++++
>  tools/libs/ctrl/xc_physdev.c          |  4 ++++
>  tools/libs/light/libxl_pci.c          |  1 +
>  xen/arch/x86/hvm/dom0_build.c         |  1 +
>  xen/arch/x86/hvm/hypercall.c          |  3 +--
>  xen/drivers/vpci/header.c             |  2 +-
>  xen/include/acpi/actbl3.h             |  1 +
>  13 files changed, 50 insertions(+), 3 deletions(-)
> 
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-20 15:29                           ` Jan Beulich
@ 2023-03-20 16:50                             ` Roger Pau Monné
  0 siblings, 0 replies; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-20 16:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang Rui, Anthony PERARD, xen-devel, Alex Deucher,
	Christian König, Stewart Hildebrand, Xenia Ragiadakou,
	Honglei Huang, Julia Zhang, Chen Jiqian, Stefano Stabellini

On Mon, Mar 20, 2023 at 04:29:25PM +0100, Jan Beulich wrote:
> On 20.03.2023 16:16, Roger Pau Monné wrote:
> > @@ -244,12 +242,18 @@ static void vioapic_write_redirent(
> >      }
> >      else
> >      {
> > +        int ret;
> > +
> >          unmasked = ent.fields.mask;
> >          /* Remote IRR and Delivery Status are read-only. */
> >          ent.bits = ((ent.bits >> 32) << 32) | val;
> >          ent.fields.delivery_status = 0;
> >          ent.fields.remote_irr = pent->fields.remote_irr;
> >          unmasked = unmasked && !ent.fields.mask;
> > +        ret = mp_register_gsi(gsi, ent.fields.trig_mode, ent.fields.polarity);
> > +        if ( ret && ret !=  -EEXIST )
> > +            gprintk(XENLOG_WARNING, "vioapic: error registering GSI %u: %d\n",
> > +                    gsi, ret);
> >      }
> 
> I assume this is only meant to be experimental, as I'm missing confinement
> to Dom0 here.

Indeed.  I've attached a fixed version below, let's make sure this
doesn't influence testing.

> I also question this when the mask bit as set, as in that
> case neither the trigger mode bit nor the polarity one can be relied upon.
> At which point it would look to me as if it was necessary for Dom0 to use
> a hypercall instead (which naturally would then be PHYSDEVOP_setup_gsi).

AFAICT Linux does correctly set the trigger/polarity even when the
pins are masked, so this should be safe as a proof of concept. Let's
first figure out whether the issue is really with the lack of setup of
the IO-APIC pins.  At the end without input from Ray this is just a
wild guess.

Regards, Roger.
----
diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index 405d0a95af..cc53a3bd12 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -86,6 +86,8 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     {
     case PHYSDEVOP_map_pirq:
     case PHYSDEVOP_unmap_pirq:
+        break;
+
     case PHYSDEVOP_eoi:
     case PHYSDEVOP_irq_status_query:
     case PHYSDEVOP_get_free_pirq:
diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
index 41e3c4d5e4..64f7b5bcc5 100644
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -180,9 +180,7 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
 
     /* Interrupt has been unmasked, bind it now. */
     ret = mp_register_gsi(gsi, trig, pol);
-    if ( ret == -EEXIST )
-        return 0;
-    if ( ret )
+    if ( ret && ret != -EEXIST )
     {
         gprintk(XENLOG_WARNING, "vioapic: error registering GSI %u: %d\n",
                  gsi, ret);
@@ -250,6 +248,16 @@ static void vioapic_write_redirent(
         ent.fields.delivery_status = 0;
         ent.fields.remote_irr = pent->fields.remote_irr;
         unmasked = unmasked && !ent.fields.mask;
+        if ( is_hardware_domain(d) )
+        {
+            int ret = mp_register_gsi(gsi, ent.fields.trig_mode,
+                                      ent.fields.polarity);
+
+            if ( ret && ret !=  -EEXIST )
+                gprintk(XENLOG_WARNING,
+                        "vioapic: error registering GSI %u: %d\n",
+                        gsi, ret);
+        }
     }
 
     *pent = ent;



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present
  2023-03-13 12:27       ` Andrew Cooper
@ 2023-03-21  6:26         ` Huang Rui
  0 siblings, 0 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-21  6:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel,
	Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Henry Wang

On Mon, Mar 13, 2023 at 08:27:02PM +0800, Andrew Cooper wrote:
> On 13/03/2023 12:21 pm, Roger Pau Monné wrote:
> > On Mon, Mar 13, 2023 at 11:55:56AM +0000, Andrew Cooper wrote:
> >> On 12/03/2023 7:54 am, Huang Rui wrote:
> >>> From: Roger Pau Monne <roger.pau@citrix.com>
> >>>
> >>> The VFCT ACPI table is used by AMD GPUs to expose the vbios ROM image
> >>> from the firmware instead of doing it on the PCI ROM on the physical
> >>> device.
> >>>
> >>> As such, this needs to be available for PVH dom0 to access, or else
> >>> the GPU won't work.
> >>>
> >>> Reported-by: Huang Rui <ray.huang@amd.com>
> >>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>> Acked-and-Tested-by: Huang Rui <ray.huang@amd.com>
> >>> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
> >>> Signed-off-by: Huang Rui <ray.huang@amd.com>
> >> Huh...  Despite the release ack, this didn't get committed for 4.17.
> > There was a pending query from Jan as to where was this table
> > signature documented or at least registered, as it's not in the ACPI
> > spec or any related files.
> >
> > I don't oppose to the change, as it's already used by Linux, so I
> > think it's impossible for the table signature to be reused, even if
> > not properly documented (it would cause havoc).
> >
> > It's however not ideal to set this kind of precedents.
> 
> It's not great, but this exists in real systems, for several generations
> it seems.
> 
> Making things work for users trumps any idealistic beliefs about
> firmware actually conforming to spec.
> 

Thanks Andrew for understanding! These tables have been there for more than
10+ years on all AMD GPU platforms.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-14 16:02   ` Jan Beulich
@ 2023-03-21  9:36     ` Huang Rui
  2023-03-21  9:41       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-21  9:36 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

Hi Jan,

On Wed, Mar 15, 2023 at 12:02:35AM +0800, Jan Beulich wrote:
> On 12.03.2023 08:54, Huang Rui wrote:
> > From: Chen Jiqian <Jiqian.Chen@amd.com>
> > 
> > When dom0 is PVH and we want to passthrough gpu to guest,
> > we should allow BAR writes even through BAR is mapped. If
> > not, the value of BARs are not initialized when guest firstly
> > start.
> 
> From this it doesn't become clear why a GPU would be special in this
> regard, or what (if any) prior bug there was. Are you suggesting ...
> 

You're right. This is in fact a buggy we encountered while we start the
guest domU.

> > --- a/xen/drivers/vpci/header.c
> > +++ b/xen/drivers/vpci/header.c
> > @@ -392,7 +392,7 @@ static void cf_check bar_write(
> >       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
> >       * writes as long as the BAR is not mapped into the p2m.
> >       */
> > -    if ( bar->enabled )
> > +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> >      {
> >          /* If the value written is the current one avoid printing a warning. */
> >          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
> 
> ... bar->enabled doesn't properly reflect the necessary state? It
> generally shouldn't be necessary to look at the physical device's
> state here.
> 
> Furthermore when you make a change in a case like this, the
> accompanying comment also needs updating (which might have clarified
> what, if anything, has been wrong).
> 

That is the problem that we start domU at the first time, the enable flag
will be set while the passthrough device would like to write the real pcie
bar on the host. And yes, it's temporary workaround, we should figure out
the root cause.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21  9:36     ` Huang Rui
@ 2023-03-21  9:41       ` Jan Beulich
  2023-03-21 10:14         ` Huang Rui
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-21  9:41 UTC (permalink / raw)
  To: Huang Rui
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 21.03.2023 10:36, Huang Rui wrote:
> On Wed, Mar 15, 2023 at 12:02:35AM +0800, Jan Beulich wrote:
>> On 12.03.2023 08:54, Huang Rui wrote:
>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
>>>       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
>>>       * writes as long as the BAR is not mapped into the p2m.
>>>       */
>>> -    if ( bar->enabled )
>>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
>>>      {
>>>          /* If the value written is the current one avoid printing a warning. */
>>>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>>
>> ... bar->enabled doesn't properly reflect the necessary state? It
>> generally shouldn't be necessary to look at the physical device's
>> state here.
>>
>> Furthermore when you make a change in a case like this, the
>> accompanying comment also needs updating (which might have clarified
>> what, if anything, has been wrong).
>>
> 
> That is the problem that we start domU at the first time, the enable flag
> will be set while the passthrough device would like to write the real pcie
> bar on the host.

A pass-through device (i.e. one already owned by a DomU) should never
be allowed to write to the real BAR. But it's not clear whether I'm not
misinterpreting what you said ...

> And yes, it's temporary workaround, we should figure out
> the root cause.

Right, that's the only way to approach this, imo.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH
  2023-03-15 15:57   ` Roger Pau Monné
  2023-03-16  0:22     ` Stefano Stabellini
@ 2023-03-21 10:09     ` Huang Rui
  2023-03-21 10:17       ` Jan Beulich
  1 sibling, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-21 10:09 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel,
	Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian

On Wed, Mar 15, 2023 at 11:57:45PM +0800, Roger Pau Monné wrote:
> On Sun, Mar 12, 2023 at 03:54:52PM +0800, Huang Rui wrote:
> > From: Chen Jiqian <Jiqian.Chen@amd.com>
> > 
> > PVH is also hvm type domain, but PVH hasn't X86_EMU_USE_PIRQ
> > flag. So, when dom0 is PVH and call PHYSDEVOP_map_pirq, it
> > will fail at check has_pirq();
> > 
> > Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
> > Signed-off-by: Huang Rui <ray.huang@amd.com>
> > ---
> >  xen/arch/x86/hvm/hypercall.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
> > index 405d0a95af..16a2f5c0b3 100644
> > --- a/xen/arch/x86/hvm/hypercall.c
> > +++ b/xen/arch/x86/hvm/hypercall.c
> > @@ -89,8 +89,6 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >      case PHYSDEVOP_eoi:
> >      case PHYSDEVOP_irq_status_query:
> >      case PHYSDEVOP_get_free_pirq:
> > -        if ( !has_pirq(currd) )
> > -            return -ENOSYS;
> 
> Since I've taken a look at the Linux side of this, it seems like you
> need PHYSDEVOP_map_pirq and PHYSDEVOP_setup_gsi, but the later is not
> in this list because has never been available to HVM type guests.

Do you mean HVM guest only support MSI(-X)?

> 
> I would like to better understand the usage by PVH dom0 for GSI
> passthrough before deciding on what to do here.  IIRC QEMU also uses
> PHYSDEVOP_{un,}map_pirq in order to allocate MSI(-X) interrupts.
> 

The MSI(-X) interrupt doesn't work even on the passthrough device at domU
even the dom0 is PV domain. It seems a common problem, I remember Christian
encountered the similar issue as well. So we fallback to use the GSI
interrupt instead.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21  9:41       ` Jan Beulich
@ 2023-03-21 10:14         ` Huang Rui
  2023-03-21 10:20           ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-21 10:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On Tue, Mar 21, 2023 at 05:41:57PM +0800, Jan Beulich wrote:
> On 21.03.2023 10:36, Huang Rui wrote:
> > On Wed, Mar 15, 2023 at 12:02:35AM +0800, Jan Beulich wrote:
> >> On 12.03.2023 08:54, Huang Rui wrote:
> >>> --- a/xen/drivers/vpci/header.c
> >>> +++ b/xen/drivers/vpci/header.c
> >>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
> >>>       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
> >>>       * writes as long as the BAR is not mapped into the p2m.
> >>>       */
> >>> -    if ( bar->enabled )
> >>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> >>>      {
> >>>          /* If the value written is the current one avoid printing a warning. */
> >>>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
> >>
> >> ... bar->enabled doesn't properly reflect the necessary state? It
> >> generally shouldn't be necessary to look at the physical device's
> >> state here.
> >>
> >> Furthermore when you make a change in a case like this, the
> >> accompanying comment also needs updating (which might have clarified
> >> what, if anything, has been wrong).
> >>
> > 
> > That is the problem that we start domU at the first time, the enable flag
> > will be set while the passthrough device would like to write the real pcie
> > bar on the host.
> 
> A pass-through device (i.e. one already owned by a DomU) should never
> be allowed to write to the real BAR. But it's not clear whether I'm not
> misinterpreting what you said ...
> 

OK. Thanks to clarify this. May I know how does a passthrough device modify
pci bar with correct behavior on Xen?

Thanks,
Ray

> > And yes, it's temporary workaround, we should figure out
> > the root cause.
> 
> Right, that's the only way to approach this, imo.
> 
> Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH
  2023-03-21 10:09     ` Huang Rui
@ 2023-03-21 10:17       ` Jan Beulich
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-21 10:17 UTC (permalink / raw)
  To: Huang Rui
  Cc: Stefano Stabellini, Anthony PERARD, xen-devel, Deucher,
	Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné

On 21.03.2023 11:09, Huang Rui wrote:
> On Wed, Mar 15, 2023 at 11:57:45PM +0800, Roger Pau Monné wrote:
>> On Sun, Mar 12, 2023 at 03:54:52PM +0800, Huang Rui wrote:
>>> From: Chen Jiqian <Jiqian.Chen@amd.com>
>>>
>>> PVH is also hvm type domain, but PVH hasn't X86_EMU_USE_PIRQ
>>> flag. So, when dom0 is PVH and call PHYSDEVOP_map_pirq, it
>>> will fail at check has_pirq();
>>>
>>> Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
>>> Signed-off-by: Huang Rui <ray.huang@amd.com>
>>> ---
>>>  xen/arch/x86/hvm/hypercall.c | 2 --
>>>  1 file changed, 2 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
>>> index 405d0a95af..16a2f5c0b3 100644
>>> --- a/xen/arch/x86/hvm/hypercall.c
>>> +++ b/xen/arch/x86/hvm/hypercall.c
>>> @@ -89,8 +89,6 @@ long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>>      case PHYSDEVOP_eoi:
>>>      case PHYSDEVOP_irq_status_query:
>>>      case PHYSDEVOP_get_free_pirq:
>>> -        if ( !has_pirq(currd) )
>>> -            return -ENOSYS;
>>
>> Since I've taken a look at the Linux side of this, it seems like you
>> need PHYSDEVOP_map_pirq and PHYSDEVOP_setup_gsi, but the later is not
>> in this list because has never been available to HVM type guests.
> 
> Do you mean HVM guest only support MSI(-X)?

I don't think that was meant. Instead, as per discussion elsewhere, we
may need to make PHYSDEVOP_setup_gsi available to PVH Dom0. (DomU-s
wouldn't be allowed to use this sub-op, so the statement Roger made
simply doesn't apply to "HVM guest".)

>> I would like to better understand the usage by PVH dom0 for GSI
>> passthrough before deciding on what to do here.  IIRC QEMU also uses
>> PHYSDEVOP_{un,}map_pirq in order to allocate MSI(-X) interrupts.
>>
> 
> The MSI(-X) interrupt doesn't work even on the passthrough device at domU
> even the dom0 is PV domain. It seems a common problem, I remember Christian
> encountered the similar issue as well. So we fallback to use the GSI
> interrupt instead.

Looks like this wants figuring out properly as well then. MSI(-X)
generally works for pass-through devices, from all I know.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21 10:14         ` Huang Rui
@ 2023-03-21 10:20           ` Jan Beulich
  2023-03-21 11:49             ` Huang Rui
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-21 10:20 UTC (permalink / raw)
  To: Huang Rui
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 21.03.2023 11:14, Huang Rui wrote:
> On Tue, Mar 21, 2023 at 05:41:57PM +0800, Jan Beulich wrote:
>> On 21.03.2023 10:36, Huang Rui wrote:
>>> On Wed, Mar 15, 2023 at 12:02:35AM +0800, Jan Beulich wrote:
>>>> On 12.03.2023 08:54, Huang Rui wrote:
>>>>> --- a/xen/drivers/vpci/header.c
>>>>> +++ b/xen/drivers/vpci/header.c
>>>>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
>>>>>       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
>>>>>       * writes as long as the BAR is not mapped into the p2m.
>>>>>       */
>>>>> -    if ( bar->enabled )
>>>>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
>>>>>      {
>>>>>          /* If the value written is the current one avoid printing a warning. */
>>>>>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>>>>
>>>> ... bar->enabled doesn't properly reflect the necessary state? It
>>>> generally shouldn't be necessary to look at the physical device's
>>>> state here.
>>>>
>>>> Furthermore when you make a change in a case like this, the
>>>> accompanying comment also needs updating (which might have clarified
>>>> what, if anything, has been wrong).
>>>>
>>>
>>> That is the problem that we start domU at the first time, the enable flag
>>> will be set while the passthrough device would like to write the real pcie
>>> bar on the host.
>>
>> A pass-through device (i.e. one already owned by a DomU) should never
>> be allowed to write to the real BAR. But it's not clear whether I'm not
>> misinterpreting what you said ...
>>
> 
> OK. Thanks to clarify this. May I know how does a passthrough device modify
> pci bar with correct behavior on Xen?

A pass-through device may write to the virtual BAR, changing where in its
own memory space the MMIO range appears. But it cannot (and may not) alter
where in host memory space the (physical) MMIO range appears.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0
  2023-03-13  7:24 ` [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Christian König
@ 2023-03-21 10:26   ` Huang Rui
  0 siblings, 0 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-21 10:26 UTC (permalink / raw)
  To: Koenig, Christian
  Cc: Roger Pau Monné,
	Jan Beulich, Stefano Stabellini, Anthony PERARD, xen-devel,
	Deucher, Alexander, Hildebrand, Stewart, Xenia Ragiadakou, Huang,
	Honglei1, Zhang, Julia, Chen, Jiqian

On Mon, Mar 13, 2023 at 03:24:55PM +0800, Koenig, Christian wrote:
> Hi Ray,
> 
> one nit comment on the style, apart from that looks technical correct.
> 
> But I'm *really* not and expert on all that stuff.

Christian, thanks anyway. :-)

Thanks,
Ray

> 
> Regards,
> Christian.
> 
> Am 12.03.23 um 08:54 schrieb Huang Rui:
> > Hi all,
> >
> > In graphic world, the 3D applications/games are runing based on open
> > graphic libraries such as OpenGL and Vulkan. Mesa is the Linux
> > implemenatation of OpenGL and Vulkan for multiple hardware platforms.
> > Because the graphic libraries would like to have the GPU hardware
> > acceleration. In virtualization world, virtio-gpu and passthrough-gpu are
> > two of gpu virtualization technologies.
> >
> > Current Xen only supports OpenGL (virgl:
> > https://docs.mesa3d.org/drivers/virgl.html) for virtio-gpu and passthrough
> > gpu based on PV dom0 for x86 platform. Today, we would like to introduce
> > Vulkan (venus: https://docs.mesa3d.org/drivers/venus.html) and another
> > OpenGL on Vulkan (zink: https://docs.mesa3d.org/drivers/zink.html) support
> > for VirtIO GPU on Xen. These functions are supported on KVM at this moment,
> > but so far, they are not supported on Xen. And we also introduce the PCIe
> > passthrough (GPU) function based on PVH dom0 for AMD x86 platform.
> >
> > These supports required multiple repositories changes on kernel, xen, qemu,
> > mesa, and virglrenderer. Please check below branches:
> >
> > Kernel: https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/log/?h=upstream-fox-xen
> > Xen: https://gitlab.com/huangrui123/xen/-/commits/upstream-for-xen
> > QEMU: https://gitlab.com/huangrui123/qemu/-/commits/upstream-for-xen
> > Mesa: https://gitlab.freedesktop.org/rui/mesa/-/commits/upstream-for-xen
> > Virglrenderer: https://gitlab.freedesktop.org/rui/virglrenderer/-/commits/upstream-for-xen
> >
> > In xen part, we mainly add the PCIe passthrough support on PVH dom0. It's
> > using the QEMU to passthrough the GPU device into guest HVM domU. And
> > mainly work is to transfer the interrupt by using gsi, vector, and pirq.
> >
> > Below are the screenshot of these functions, please take a look.
> >
> > Venus:
> > https://drive.google.com/file/d/1_lPq6DMwHu1JQv7LUUVRx31dBj0HJYcL/view?usp=share_link
> >
> > Zink:
> > https://drive.google.com/file/d/1FxLmKu6X7uJOxx1ZzwOm1yA6IL5WMGzd/view?usp=share_link
> >
> > Passthrough GPU:
> > https://drive.google.com/file/d/17onr5gvDK8KM_LniHTSQEI2hGJZlI09L/view?usp=share_link
> >
> > We are working to write the documentation that describe how to verify these
> > functions in the xen wiki page. And will update it in the future version.
> >
> > Thanks,
> > Ray
> >
> > Chen Jiqian (5):
> >    vpci: accept BAR writes if dom0 is PVH
> >    x86/pvh: shouldn't check pirq flag when map pirq in PVH
> >    x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
> >    tools/libs/call: add linux os call to get gsi from irq
> >    tools/libs/light: pci: translate irq to gsi
> >
> > Roger Pau Monne (1):
> >    x86/pvh: report ACPI VFCT table to dom0 if present
> >
> >   tools/include/xen-sys/Linux/privcmd.h |  7 +++++++
> >   tools/include/xencall.h               |  2 ++
> >   tools/include/xenctrl.h               |  2 ++
> >   tools/libs/call/core.c                |  5 +++++
> >   tools/libs/call/libxencall.map        |  2 ++
> >   tools/libs/call/linux.c               | 14 ++++++++++++++
> >   tools/libs/call/private.h             |  9 +++++++++
> >   tools/libs/ctrl/xc_physdev.c          |  4 ++++
> >   tools/libs/light/libxl_pci.c          |  1 +
> >   xen/arch/x86/hvm/dom0_build.c         |  1 +
> >   xen/arch/x86/hvm/hypercall.c          |  3 +--
> >   xen/drivers/vpci/header.c             |  2 +-
> >   xen/include/acpi/actbl3.h             |  1 +
> >   13 files changed, 50 insertions(+), 3 deletions(-)
> >
> 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21 10:20           ` Jan Beulich
@ 2023-03-21 11:49             ` Huang Rui
  2023-03-21 12:20               ` Roger Pau Monné
  2023-03-21 12:27               ` Jan Beulich
  0 siblings, 2 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-21 11:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On Tue, Mar 21, 2023 at 06:20:03PM +0800, Jan Beulich wrote:
> On 21.03.2023 11:14, Huang Rui wrote:
> > On Tue, Mar 21, 2023 at 05:41:57PM +0800, Jan Beulich wrote:
> >> On 21.03.2023 10:36, Huang Rui wrote:
> >>> On Wed, Mar 15, 2023 at 12:02:35AM +0800, Jan Beulich wrote:
> >>>> On 12.03.2023 08:54, Huang Rui wrote:
> >>>>> --- a/xen/drivers/vpci/header.c
> >>>>> +++ b/xen/drivers/vpci/header.c
> >>>>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
> >>>>>       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
> >>>>>       * writes as long as the BAR is not mapped into the p2m.
> >>>>>       */
> >>>>> -    if ( bar->enabled )
> >>>>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> >>>>>      {
> >>>>>          /* If the value written is the current one avoid printing a warning. */
> >>>>>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
> >>>>
> >>>> ... bar->enabled doesn't properly reflect the necessary state? It
> >>>> generally shouldn't be necessary to look at the physical device's
> >>>> state here.
> >>>>
> >>>> Furthermore when you make a change in a case like this, the
> >>>> accompanying comment also needs updating (which might have clarified
> >>>> what, if anything, has been wrong).
> >>>>
> >>>
> >>> That is the problem that we start domU at the first time, the enable flag
> >>> will be set while the passthrough device would like to write the real pcie
> >>> bar on the host.
> >>
> >> A pass-through device (i.e. one already owned by a DomU) should never
> >> be allowed to write to the real BAR. But it's not clear whether I'm not
> >> misinterpreting what you said ...
> >>
> > 
> > OK. Thanks to clarify this. May I know how does a passthrough device modify
> > pci bar with correct behavior on Xen?
> 
> A pass-through device may write to the virtual BAR, changing where in its
> own memory space the MMIO range appears. But it cannot (and may not) alter
> where in host memory space the (physical) MMIO range appears.
> 

Thanks, but we found if dom0 is PV domain, the passthrough device will
access this function to write the real bar.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21 11:49             ` Huang Rui
@ 2023-03-21 12:20               ` Roger Pau Monné
  2023-03-21 12:25                 ` Jan Beulich
  2023-03-21 12:59                 ` Huang Rui
  2023-03-21 12:27               ` Jan Beulich
  1 sibling, 2 replies; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-21 12:20 UTC (permalink / raw)
  To: Huang Rui
  Cc: Jan Beulich, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian, Stefano Stabellini, Anthony PERARD, xen-devel

On Tue, Mar 21, 2023 at 07:49:26PM +0800, Huang Rui wrote:
> On Tue, Mar 21, 2023 at 06:20:03PM +0800, Jan Beulich wrote:
> > On 21.03.2023 11:14, Huang Rui wrote:
> > > On Tue, Mar 21, 2023 at 05:41:57PM +0800, Jan Beulich wrote:
> > >> On 21.03.2023 10:36, Huang Rui wrote:
> > >>> On Wed, Mar 15, 2023 at 12:02:35AM +0800, Jan Beulich wrote:
> > >>>> On 12.03.2023 08:54, Huang Rui wrote:
> > >>>>> --- a/xen/drivers/vpci/header.c
> > >>>>> +++ b/xen/drivers/vpci/header.c
> > >>>>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
> > >>>>>       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
> > >>>>>       * writes as long as the BAR is not mapped into the p2m.
> > >>>>>       */
> > >>>>> -    if ( bar->enabled )
> > >>>>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> > >>>>>      {
> > >>>>>          /* If the value written is the current one avoid printing a warning. */
> > >>>>>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
> > >>>>
> > >>>> ... bar->enabled doesn't properly reflect the necessary state? It
> > >>>> generally shouldn't be necessary to look at the physical device's
> > >>>> state here.
> > >>>>
> > >>>> Furthermore when you make a change in a case like this, the
> > >>>> accompanying comment also needs updating (which might have clarified
> > >>>> what, if anything, has been wrong).
> > >>>>
> > >>>
> > >>> That is the problem that we start domU at the first time, the enable flag
> > >>> will be set while the passthrough device would like to write the real pcie
> > >>> bar on the host.
> > >>
> > >> A pass-through device (i.e. one already owned by a DomU) should never
> > >> be allowed to write to the real BAR. But it's not clear whether I'm not
> > >> misinterpreting what you said ...
> > >>
> > > 
> > > OK. Thanks to clarify this. May I know how does a passthrough device modify
> > > pci bar with correct behavior on Xen?
> > 
> > A pass-through device may write to the virtual BAR, changing where in its
> > own memory space the MMIO range appears. But it cannot (and may not) alter
> > where in host memory space the (physical) MMIO range appears.
> > 
> 
> Thanks, but we found if dom0 is PV domain, the passthrough device will
> access this function to write the real bar.

I'm very confused now, are you trying to use vPCI with HVM domains?

As I understood it you are attempting to enable PCI passthrough for
HVM guests from a PVH dom0, but now you say your dom0 is PV?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-14 16:30   ` Jan Beulich
  2023-03-15 17:01     ` Andrew Cooper
@ 2023-03-21 12:22     ` Huang Rui
  1 sibling, 0 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-21 12:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On Wed, Mar 15, 2023 at 12:30:21AM +0800, Jan Beulich wrote:
> On 12.03.2023 08:54, Huang Rui wrote:
> > From: Chen Jiqian <Jiqian.Chen@amd.com>
> 
> An empty description won't do here. First of all you need to address the Why?
> As already hinted at in the reply to the earlier patch, it looks like you're
> breaking the intended IRQ model for PVH.
> 

Sorry, I used a wrong patch without commit message. Will fix in next
version.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21 12:20               ` Roger Pau Monné
@ 2023-03-21 12:25                 ` Jan Beulich
  2023-03-21 12:59                 ` Huang Rui
  1 sibling, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-21 12:25 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Stefano Stabellini, Anthony PERARD, xen-devel, Huang Rui

On 21.03.2023 13:20, Roger Pau Monné wrote:
> On Tue, Mar 21, 2023 at 07:49:26PM +0800, Huang Rui wrote:
>> On Tue, Mar 21, 2023 at 06:20:03PM +0800, Jan Beulich wrote:
>>> On 21.03.2023 11:14, Huang Rui wrote:
>>>> On Tue, Mar 21, 2023 at 05:41:57PM +0800, Jan Beulich wrote:
>>>>> On 21.03.2023 10:36, Huang Rui wrote:
>>>>>> On Wed, Mar 15, 2023 at 12:02:35AM +0800, Jan Beulich wrote:
>>>>>>> On 12.03.2023 08:54, Huang Rui wrote:
>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
>>>>>>>>       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
>>>>>>>>       * writes as long as the BAR is not mapped into the p2m.
>>>>>>>>       */
>>>>>>>> -    if ( bar->enabled )
>>>>>>>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
>>>>>>>>      {
>>>>>>>>          /* If the value written is the current one avoid printing a warning. */
>>>>>>>>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
>>>>>>>
>>>>>>> ... bar->enabled doesn't properly reflect the necessary state? It
>>>>>>> generally shouldn't be necessary to look at the physical device's
>>>>>>> state here.
>>>>>>>
>>>>>>> Furthermore when you make a change in a case like this, the
>>>>>>> accompanying comment also needs updating (which might have clarified
>>>>>>> what, if anything, has been wrong).
>>>>>>>
>>>>>>
>>>>>> That is the problem that we start domU at the first time, the enable flag
>>>>>> will be set while the passthrough device would like to write the real pcie
>>>>>> bar on the host.
>>>>>
>>>>> A pass-through device (i.e. one already owned by a DomU) should never
>>>>> be allowed to write to the real BAR. But it's not clear whether I'm not
>>>>> misinterpreting what you said ...
>>>>>
>>>>
>>>> OK. Thanks to clarify this. May I know how does a passthrough device modify
>>>> pci bar with correct behavior on Xen?
>>>
>>> A pass-through device may write to the virtual BAR, changing where in its
>>> own memory space the MMIO range appears. But it cannot (and may not) alter
>>> where in host memory space the (physical) MMIO range appears.
>>>
>>
>> Thanks, but we found if dom0 is PV domain, the passthrough device will
>> access this function to write the real bar.
> 
> I'm very confused now, are you trying to use vPCI with HVM domains?
> 
> As I understood it you are attempting to enable PCI passthrough for
> HVM guests from a PVH dom0, but now you say your dom0 is PV?

I didn't read it like this. Instead my way of understanding the reply
is that they try to mimic on PVH Dom0 what they observe on PV Dom0.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21 11:49             ` Huang Rui
  2023-03-21 12:20               ` Roger Pau Monné
@ 2023-03-21 12:27               ` Jan Beulich
  2023-03-21 13:03                 ` Huang Rui
  1 sibling, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2023-03-21 12:27 UTC (permalink / raw)
  To: Huang Rui
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 21.03.2023 12:49, Huang Rui wrote:
> Thanks, but we found if dom0 is PV domain, the passthrough device will
> access this function to write the real bar.

Can you please be quite a bit more detailed about this? The specific code
paths taken (in upstream software) to result in such would of of interest.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call
  2023-03-15 17:01     ` Andrew Cooper
  2023-03-16  0:26       ` Stefano Stabellini
  2023-03-16  7:05       ` Jan Beulich
@ 2023-03-21 12:42       ` Huang Rui
  2 siblings, 0 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-21 12:42 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jan Beulich, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian, Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On Thu, Mar 16, 2023 at 01:01:52AM +0800, Andrew Cooper wrote:
> On 14/03/2023 4:30 pm, Jan Beulich wrote:
> > On 12.03.2023 08:54, Huang Rui wrote:
> >> From: Chen Jiqian <Jiqian.Chen@amd.com>
> > An empty description won't do here. First of all you need to address the Why?
> > As already hinted at in the reply to the earlier patch, it looks like you're
> > breaking the intended IRQ model for PVH.
> 
> I think this is rather unfair.
> 
> Until you can point to the document which describes how IRQs are
> intended to work in PVH, I'd say this series is pretty damn good attempt
> to make something that functions, in the absence of any guidance.
> 

Thank you, Andrew! This is the first time we submit Xen patches, any
comments are warm for us. :-)

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21 12:20               ` Roger Pau Monné
  2023-03-21 12:25                 ` Jan Beulich
@ 2023-03-21 12:59                 ` Huang Rui
  1 sibling, 0 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-21 12:59 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian, Stefano Stabellini, Anthony PERARD, xen-devel

On Tue, Mar 21, 2023 at 08:20:53PM +0800, Roger Pau Monné wrote:
> On Tue, Mar 21, 2023 at 07:49:26PM +0800, Huang Rui wrote:
> > On Tue, Mar 21, 2023 at 06:20:03PM +0800, Jan Beulich wrote:
> > > On 21.03.2023 11:14, Huang Rui wrote:
> > > > On Tue, Mar 21, 2023 at 05:41:57PM +0800, Jan Beulich wrote:
> > > >> On 21.03.2023 10:36, Huang Rui wrote:
> > > >>> On Wed, Mar 15, 2023 at 12:02:35AM +0800, Jan Beulich wrote:
> > > >>>> On 12.03.2023 08:54, Huang Rui wrote:
> > > >>>>> --- a/xen/drivers/vpci/header.c
> > > >>>>> +++ b/xen/drivers/vpci/header.c
> > > >>>>> @@ -392,7 +392,7 @@ static void cf_check bar_write(
> > > >>>>>       * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
> > > >>>>>       * writes as long as the BAR is not mapped into the p2m.
> > > >>>>>       */
> > > >>>>> -    if ( bar->enabled )
> > > >>>>> +    if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
> > > >>>>>      {
> > > >>>>>          /* If the value written is the current one avoid printing a warning. */
> > > >>>>>          if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
> > > >>>>
> > > >>>> ... bar->enabled doesn't properly reflect the necessary state? It
> > > >>>> generally shouldn't be necessary to look at the physical device's
> > > >>>> state here.
> > > >>>>
> > > >>>> Furthermore when you make a change in a case like this, the
> > > >>>> accompanying comment also needs updating (which might have clarified
> > > >>>> what, if anything, has been wrong).
> > > >>>>
> > > >>>
> > > >>> That is the problem that we start domU at the first time, the enable flag
> > > >>> will be set while the passthrough device would like to write the real pcie
> > > >>> bar on the host.
> > > >>
> > > >> A pass-through device (i.e. one already owned by a DomU) should never
> > > >> be allowed to write to the real BAR. But it's not clear whether I'm not
> > > >> misinterpreting what you said ...
> > > >>
> > > > 
> > > > OK. Thanks to clarify this. May I know how does a passthrough device modify
> > > > pci bar with correct behavior on Xen?
> > > 
> > > A pass-through device may write to the virtual BAR, changing where in its
> > > own memory space the MMIO range appears. But it cannot (and may not) alter
> > > where in host memory space the (physical) MMIO range appears.
> > > 
> > 
> > Thanks, but we found if dom0 is PV domain, the passthrough device will
> > access this function to write the real bar.
> 
> I'm very confused now, are you trying to use vPCI with HVM domains?

We are using QEMU for passthrough at this moment.

> 
> As I understood it you are attempting to enable PCI passthrough for
> HVM guests from a PVH dom0, but now you say your dom0 is PV?
> 

Ah, sorry to make you confused, you're right. I am using PVH dom0 + HVM
domU. But we are comparing passthrough function on PV dom0 + HVM domU as a
reference.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21 12:27               ` Jan Beulich
@ 2023-03-21 13:03                 ` Huang Rui
  2023-03-22  7:28                   ` Huang Rui
  0 siblings, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-21 13:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monné,
	Stefano Stabellini, Anthony PERARD, xen-devel

On Tue, Mar 21, 2023 at 08:27:21PM +0800, Jan Beulich wrote:
> On 21.03.2023 12:49, Huang Rui wrote:
> > Thanks, but we found if dom0 is PV domain, the passthrough device will
> > access this function to write the real bar.
> 
> Can you please be quite a bit more detailed about this? The specific code
> paths taken (in upstream software) to result in such would of of interest.
> 

yes, please wait for a moment. let me capture a trace dump in my side.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-21 13:03                 ` Huang Rui
@ 2023-03-22  7:28                   ` Huang Rui
  2023-03-22  7:45                     ` Jan Beulich
  2023-03-22  9:34                     ` Roger Pau Monné
  0 siblings, 2 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-22  7:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monn�,
	Stefano Stabellini, Anthony PERARD, xen-devel

On Tue, Mar 21, 2023 at 09:03:58PM +0800, Huang Rui wrote:
> On Tue, Mar 21, 2023 at 08:27:21PM +0800, Jan Beulich wrote:
> > On 21.03.2023 12:49, Huang Rui wrote:
> > > Thanks, but we found if dom0 is PV domain, the passthrough device will
> > > access this function to write the real bar.
> > 
> > Can you please be quite a bit more detailed about this? The specific code
> > paths taken (in upstream software) to result in such would of of interest.
> > 
> 
> yes, please wait for a moment. let me capture a trace dump in my side.
> 

Sorry, we are wrong that if Xen PV dom0, bar_write() won't be called,
please ignore above information.

While xen is on initialization on PVH dom0, it will add all PCI devices in
the real bus including 0000:03:00.0 (VGA device: GPU) and 0000:03:00.1
(Audio device).

Audio is another function in the pcie device, but we won't use it here. So
we will remove it after that.

Please see below xl dmesg:

(XEN) PCI add device 0000:03:00.0
(XEN) d0v0 bar_write Ray line 391 0000:03:00.1 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:03:00.1 bar->enabled 0
(XEN) d0v0 bar_write Ray line 391 0000:03:00.1 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:03:00.1 bar->enabled 0
(XEN) PCI add device 0000:03:00.1
(XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
(XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
(XEN) PCI add device 0000:04:00.0

...

(XEN) PCI add device 0000:07:00.7
(XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc0010058 unimplemented
(XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc0011020 unimplemented
(XEN) PCI remove device 0000:03:00.1

We run below script to remove audio

echo -n "1" > /sys/bus/pci/devices/0000:03:00.1/remove

(XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc001029b unimplemented
(XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc001029a unimplemented

Then we will run "xl pci-assignable-add 03:00.0" to assign GPU as
passthrough. At this moment, the real bar is trying to be written.

(XEN) d0v7 bar_write Ray line 391 0000:03:00.0 bar->enabled 1
(XEN) d0v7 bar_write Ray line 406 0000:03:00.0 bar->enabled 1
(XEN) Xen WARN at drivers/vpci/header.c:408
(XEN) ----[ Xen-4.18-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    8
(XEN) RIP:    e008:[<ffff82d040263cb9>] drivers/vpci/header.c#bar_write+0xc0/0x1ce
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d0v7)
(XEN) rax: ffff8303fc36d06c   rbx: ffff8303f90468b0   rcx: 0000000000000010
(XEN) rdx: 0000000000000002   rsi: ffff8303fc36a020   rdi: ffff8303fc36a018
(XEN) rbp: ffff8303fc367c18   rsp: ffff8303fc367be8   r8:  0000000000000001
(XEN) r9:  ffff8303fc36a010   r10: 0000000000000001   r11: 0000000000000001
(XEN) r12: 00000000d0700000   r13: ffff8303fc6d9230   r14: ffff8303fc6d9270
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000003506e0
(XEN) cr3: 00000003fc3c4000   cr2: 00007f180f6371e8
(XEN) fsb: 00007fce655edbc0   gsb: ffff88822f3c0000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d040263cb9> (drivers/vpci/header.c#bar_write+0xc0/0x1ce):
(XEN)  b6 53 14 f6 c2 02 74 02 <0f> 0b 48 8b 03 45 84 ff 0f 85 ec 00 00 00 48 b9
(XEN) Xen stack trace from rsp=ffff8303fc367be8:
(XEN)    00000024fc367bf8 ffff8303f9046a50 0000000000000000 0000000000000004
(XEN)    0000000000000004 0000000000000024 ffff8303fc367ca0 ffff82d040263683
(XEN)    00000300fc367ca0 d070000003003501 00000024d0700000 ffff8303fc6d9230
(XEN)    0000000000000000 0000000000000000 0000002400000004 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000004 00000000d0700000
(XEN)    0000000000000024 0000000000000000 ffff82d040404bc0 ffff8303fc367cd0
(XEN)    ffff82d0402c60a8 0000030000000001 ffff8303fc367d88 0000000000000000
(XEN)    ffff8303fc610800 ffff8303fc367d30 ffff82d0402c54da ffff8303fc367ce0
(XEN)    ffff8303fc367fff 0000000000000004 ffff830300000004 00000000d0700000
(XEN)    ffff8303fc610800 ffff8303fc367d88 0000000000000001 0000000000000000
(XEN)    0000000000000000 ffff8303fc367d58 ffff82d0402c5570 0000000000000004
(XEN)    ffff8304065ea000 ffff8303fc367e28 ffff8303fc367dd0 ffff82d0402b5357
(XEN)    0000000000000cfc ffff8303fc621000 0000000000000000 0000000000000000
(XEN)    0000000000000cfc 00000000d0700000 0000000400000001 0001000000000000
(XEN)    0000000000000004 0000000000000004 0000000000000000 ffff8303fc367e44
(XEN)    ffff8304065ea000 ffff8303fc367e10 ffff82d0402b56d6 0000000000000000
(XEN)    ffff8303fc367e44 0000000000000004 0000000000000cfc ffff8304065e6000
(XEN)    0000000000000000 ffff8303fc367e30 ffff82d0402b6bcc ffff8303fc367e44
(XEN)    0000000000000001 ffff8303fc367e70 ffff82d0402c5e80 d070000040203490
(XEN)    000000000000007b ffff8303fc367ef8 ffff8304065e6000 ffff8304065ea000
(XEN) Xen call trace:
(XEN)    [<ffff82d040263cb9>] R drivers/vpci/header.c#bar_write+0xc0/0x1ce
(XEN)    [<ffff82d040263683>] F vpci_write+0x123/0x26c
(XEN)    [<ffff82d0402c60a8>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
(XEN)    [<ffff82d0402c54da>] F hvm_process_io_intercept+0x203/0x26f
(XEN)    [<ffff82d0402c5570>] F hvm_io_intercept+0x2a/0x4c
(XEN)    [<ffff82d0402b5357>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5eb
(XEN)    [<ffff82d0402b56d6>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
(XEN)    [<ffff82d0402b6bcc>] F hvmemul_do_pio_buffer+0x33/0x35
(XEN)    [<ffff82d0402c5e80>] F handle_pio+0x70/0x1b7
(XEN)    [<ffff82d04029dc7f>] F svm_vmexit_handler+0x10ba/0x18aa
(XEN)    [<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
(XEN)
(XEN) d0v7 bar_write Ray line 391 0000:03:00.0 bar->enabled 1
(XEN) d0v7 bar_write Ray line 406 0000:03:00.0 bar->enabled 1

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-22  7:28                   ` Huang Rui
@ 2023-03-22  7:45                     ` Jan Beulich
  2023-03-22  9:34                     ` Roger Pau Monné
  1 sibling, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-22  7:45 UTC (permalink / raw)
  To: Huang Rui
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Roger Pau Monn�,
	Stefano Stabellini, Anthony PERARD, xen-devel

On 22.03.2023 08:28, Huang Rui wrote:
> On Tue, Mar 21, 2023 at 09:03:58PM +0800, Huang Rui wrote:
>> On Tue, Mar 21, 2023 at 08:27:21PM +0800, Jan Beulich wrote:
>>> On 21.03.2023 12:49, Huang Rui wrote:
>>>> Thanks, but we found if dom0 is PV domain, the passthrough device will
>>>> access this function to write the real bar.
>>>
>>> Can you please be quite a bit more detailed about this? The specific code
>>> paths taken (in upstream software) to result in such would of of interest.
>>>
>>
>> yes, please wait for a moment. let me capture a trace dump in my side.
>>
> 
> Sorry, we are wrong that if Xen PV dom0, bar_write() won't be called,
> please ignore above information.
> 
> While xen is on initialization on PVH dom0, it will add all PCI devices in
> the real bus including 0000:03:00.0 (VGA device: GPU) and 0000:03:00.1
> (Audio device).
> 
> Audio is another function in the pcie device, but we won't use it here. So
> we will remove it after that.
> 
> Please see below xl dmesg:
> 
> (XEN) PCI add device 0000:03:00.0
> (XEN) d0v0 bar_write Ray line 391 0000:03:00.1 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:03:00.1 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:03:00.1 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:03:00.1 bar->enabled 0
> (XEN) PCI add device 0000:03:00.1
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) PCI add device 0000:04:00.0
> 
> ...
> 
> (XEN) PCI add device 0000:07:00.7
> (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc0010058 unimplemented
> (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc0011020 unimplemented
> (XEN) PCI remove device 0000:03:00.1
> 
> We run below script to remove audio
> 
> echo -n "1" > /sys/bus/pci/devices/0000:03:00.1/remove

Why would you do that? Aiui this is a preparatory step to hot-unplug
the device, which surely you don't mean to do. (But this is largely
unrelated to the issue at hand; I'm merely curious.)

> (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc001029b unimplemented
> (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc001029a unimplemented
> 
> Then we will run "xl pci-assignable-add 03:00.0" to assign GPU as
> passthrough. At this moment, the real bar is trying to be written.

How do you conclude it's the "real" BAR? And where is this attempt
coming from? We refuse BAR updates for enabled BARs for a reason,
so possibly there's code elsewhere which needs adjusting.

> (XEN) d0v7 bar_write Ray line 391 0000:03:00.0 bar->enabled 1
> (XEN) d0v7 bar_write Ray line 406 0000:03:00.0 bar->enabled 1
> (XEN) Xen WARN at drivers/vpci/header.c:408

None of these exist in upstream code. Therefore, for the output you
supply to be meaningful, we also need to know what code changes you
made (which then tells us by how much line numbers have shifted, and
what e.g. the WARN_ON() condition is - it clearly isn't tied to
bar->enabled being true alone, or else there would have been a 2nd
instance at the bottom, unless of course you've stripped that).

Jan

> (XEN) ----[ Xen-4.18-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    8
> (XEN) RIP:    e008:[<ffff82d040263cb9>] drivers/vpci/header.c#bar_write+0xc0/0x1ce
> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d0v7)
> (XEN) rax: ffff8303fc36d06c   rbx: ffff8303f90468b0   rcx: 0000000000000010
> (XEN) rdx: 0000000000000002   rsi: ffff8303fc36a020   rdi: ffff8303fc36a018
> (XEN) rbp: ffff8303fc367c18   rsp: ffff8303fc367be8   r8:  0000000000000001
> (XEN) r9:  ffff8303fc36a010   r10: 0000000000000001   r11: 0000000000000001
> (XEN) r12: 00000000d0700000   r13: ffff8303fc6d9230   r14: ffff8303fc6d9270
> (XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000003506e0
> (XEN) cr3: 00000003fc3c4000   cr2: 00007f180f6371e8
> (XEN) fsb: 00007fce655edbc0   gsb: ffff88822f3c0000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen code around <ffff82d040263cb9> (drivers/vpci/header.c#bar_write+0xc0/0x1ce):
> (XEN)  b6 53 14 f6 c2 02 74 02 <0f> 0b 48 8b 03 45 84 ff 0f 85 ec 00 00 00 48 b9
> (XEN) Xen stack trace from rsp=ffff8303fc367be8:
> (XEN)    00000024fc367bf8 ffff8303f9046a50 0000000000000000 0000000000000004
> (XEN)    0000000000000004 0000000000000024 ffff8303fc367ca0 ffff82d040263683
> (XEN)    00000300fc367ca0 d070000003003501 00000024d0700000 ffff8303fc6d9230
> (XEN)    0000000000000000 0000000000000000 0000002400000004 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000004 00000000d0700000
> (XEN)    0000000000000024 0000000000000000 ffff82d040404bc0 ffff8303fc367cd0
> (XEN)    ffff82d0402c60a8 0000030000000001 ffff8303fc367d88 0000000000000000
> (XEN)    ffff8303fc610800 ffff8303fc367d30 ffff82d0402c54da ffff8303fc367ce0
> (XEN)    ffff8303fc367fff 0000000000000004 ffff830300000004 00000000d0700000
> (XEN)    ffff8303fc610800 ffff8303fc367d88 0000000000000001 0000000000000000
> (XEN)    0000000000000000 ffff8303fc367d58 ffff82d0402c5570 0000000000000004
> (XEN)    ffff8304065ea000 ffff8303fc367e28 ffff8303fc367dd0 ffff82d0402b5357
> (XEN)    0000000000000cfc ffff8303fc621000 0000000000000000 0000000000000000
> (XEN)    0000000000000cfc 00000000d0700000 0000000400000001 0001000000000000
> (XEN)    0000000000000004 0000000000000004 0000000000000000 ffff8303fc367e44
> (XEN)    ffff8304065ea000 ffff8303fc367e10 ffff82d0402b56d6 0000000000000000
> (XEN)    ffff8303fc367e44 0000000000000004 0000000000000cfc ffff8304065e6000
> (XEN)    0000000000000000 ffff8303fc367e30 ffff82d0402b6bcc ffff8303fc367e44
> (XEN)    0000000000000001 ffff8303fc367e70 ffff82d0402c5e80 d070000040203490
> (XEN)    000000000000007b ffff8303fc367ef8 ffff8304065e6000 ffff8304065ea000
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040263cb9>] R drivers/vpci/header.c#bar_write+0xc0/0x1ce
> (XEN)    [<ffff82d040263683>] F vpci_write+0x123/0x26c
> (XEN)    [<ffff82d0402c60a8>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> (XEN)    [<ffff82d0402c54da>] F hvm_process_io_intercept+0x203/0x26f
> (XEN)    [<ffff82d0402c5570>] F hvm_io_intercept+0x2a/0x4c
> (XEN)    [<ffff82d0402b5357>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5eb
> (XEN)    [<ffff82d0402b56d6>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> (XEN)    [<ffff82d0402b6bcc>] F hvmemul_do_pio_buffer+0x33/0x35
> (XEN)    [<ffff82d0402c5e80>] F handle_pio+0x70/0x1b7
> (XEN)    [<ffff82d04029dc7f>] F svm_vmexit_handler+0x10ba/0x18aa
> (XEN)    [<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> (XEN)
> (XEN) d0v7 bar_write Ray line 391 0000:03:00.0 bar->enabled 1
> (XEN) d0v7 bar_write Ray line 406 0000:03:00.0 bar->enabled 1
> 
> Thanks,
> Ray



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-22  7:28                   ` Huang Rui
  2023-03-22  7:45                     ` Jan Beulich
@ 2023-03-22  9:34                     ` Roger Pau Monné
  2023-03-22 12:33                       ` Huang Rui
  1 sibling, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-22  9:34 UTC (permalink / raw)
  To: Huang Rui
  Cc: Jan Beulich, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian, Stefano Stabellini, Anthony PERARD, xen-devel

On Wed, Mar 22, 2023 at 03:28:58PM +0800, Huang Rui wrote:
> On Tue, Mar 21, 2023 at 09:03:58PM +0800, Huang Rui wrote:
> > On Tue, Mar 21, 2023 at 08:27:21PM +0800, Jan Beulich wrote:
> > > On 21.03.2023 12:49, Huang Rui wrote:
> > > > Thanks, but we found if dom0 is PV domain, the passthrough device will
> > > > access this function to write the real bar.
> > > 
> > > Can you please be quite a bit more detailed about this? The specific code
> > > paths taken (in upstream software) to result in such would of of interest.
> > > 
> > 
> > yes, please wait for a moment. let me capture a trace dump in my side.
> > 
> 
> Sorry, we are wrong that if Xen PV dom0, bar_write() won't be called,
> please ignore above information.
> 
> While xen is on initialization on PVH dom0, it will add all PCI devices in
> the real bus including 0000:03:00.0 (VGA device: GPU) and 0000:03:00.1
> (Audio device).
> 
> Audio is another function in the pcie device, but we won't use it here. So
> we will remove it after that.
> 
> Please see below xl dmesg:
> 
> (XEN) PCI add device 0000:03:00.0
> (XEN) d0v0 bar_write Ray line 391 0000:03:00.1 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:03:00.1 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:03:00.1 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:03:00.1 bar->enabled 0
> (XEN) PCI add device 0000:03:00.1
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> (XEN) PCI add device 0000:04:00.0
> 
> ...
> 
> (XEN) PCI add device 0000:07:00.7
> (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc0010058 unimplemented
> (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc0011020 unimplemented
> (XEN) PCI remove device 0000:03:00.1
> 
> We run below script to remove audio
> 
> echo -n "1" > /sys/bus/pci/devices/0000:03:00.1/remove
> 
> (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc001029b unimplemented
> (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc001029a unimplemented
> 
> Then we will run "xl pci-assignable-add 03:00.0" to assign GPU as
> passthrough. At this moment, the real bar is trying to be written.
> 
> (XEN) d0v7 bar_write Ray line 391 0000:03:00.0 bar->enabled 1
> (XEN) d0v7 bar_write Ray line 406 0000:03:00.0 bar->enabled 1
> (XEN) Xen WARN at drivers/vpci/header.c:408
> (XEN) ----[ Xen-4.18-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    8
> (XEN) RIP:    e008:[<ffff82d040263cb9>] drivers/vpci/header.c#bar_write+0xc0/0x1ce
> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d0v7)
> (XEN) rax: ffff8303fc36d06c   rbx: ffff8303f90468b0   rcx: 0000000000000010
> (XEN) rdx: 0000000000000002   rsi: ffff8303fc36a020   rdi: ffff8303fc36a018
> (XEN) rbp: ffff8303fc367c18   rsp: ffff8303fc367be8   r8:  0000000000000001
> (XEN) r9:  ffff8303fc36a010   r10: 0000000000000001   r11: 0000000000000001
> (XEN) r12: 00000000d0700000   r13: ffff8303fc6d9230   r14: ffff8303fc6d9270
> (XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000003506e0
> (XEN) cr3: 00000003fc3c4000   cr2: 00007f180f6371e8
> (XEN) fsb: 00007fce655edbc0   gsb: ffff88822f3c0000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen code around <ffff82d040263cb9> (drivers/vpci/header.c#bar_write+0xc0/0x1ce):
> (XEN)  b6 53 14 f6 c2 02 74 02 <0f> 0b 48 8b 03 45 84 ff 0f 85 ec 00 00 00 48 b9
> (XEN) Xen stack trace from rsp=ffff8303fc367be8:
> (XEN)    00000024fc367bf8 ffff8303f9046a50 0000000000000000 0000000000000004
> (XEN)    0000000000000004 0000000000000024 ffff8303fc367ca0 ffff82d040263683
> (XEN)    00000300fc367ca0 d070000003003501 00000024d0700000 ffff8303fc6d9230
> (XEN)    0000000000000000 0000000000000000 0000002400000004 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000004 00000000d0700000
> (XEN)    0000000000000024 0000000000000000 ffff82d040404bc0 ffff8303fc367cd0
> (XEN)    ffff82d0402c60a8 0000030000000001 ffff8303fc367d88 0000000000000000
> (XEN)    ffff8303fc610800 ffff8303fc367d30 ffff82d0402c54da ffff8303fc367ce0
> (XEN)    ffff8303fc367fff 0000000000000004 ffff830300000004 00000000d0700000
> (XEN)    ffff8303fc610800 ffff8303fc367d88 0000000000000001 0000000000000000
> (XEN)    0000000000000000 ffff8303fc367d58 ffff82d0402c5570 0000000000000004
> (XEN)    ffff8304065ea000 ffff8303fc367e28 ffff8303fc367dd0 ffff82d0402b5357
> (XEN)    0000000000000cfc ffff8303fc621000 0000000000000000 0000000000000000
> (XEN)    0000000000000cfc 00000000d0700000 0000000400000001 0001000000000000
> (XEN)    0000000000000004 0000000000000004 0000000000000000 ffff8303fc367e44
> (XEN)    ffff8304065ea000 ffff8303fc367e10 ffff82d0402b56d6 0000000000000000
> (XEN)    ffff8303fc367e44 0000000000000004 0000000000000cfc ffff8304065e6000
> (XEN)    0000000000000000 ffff8303fc367e30 ffff82d0402b6bcc ffff8303fc367e44
> (XEN)    0000000000000001 ffff8303fc367e70 ffff82d0402c5e80 d070000040203490
> (XEN)    000000000000007b ffff8303fc367ef8 ffff8304065e6000 ffff8304065ea000
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040263cb9>] R drivers/vpci/header.c#bar_write+0xc0/0x1ce
> (XEN)    [<ffff82d040263683>] F vpci_write+0x123/0x26c
> (XEN)    [<ffff82d0402c60a8>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> (XEN)    [<ffff82d0402c54da>] F hvm_process_io_intercept+0x203/0x26f
> (XEN)    [<ffff82d0402c5570>] F hvm_io_intercept+0x2a/0x4c
> (XEN)    [<ffff82d0402b5357>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5eb
> (XEN)    [<ffff82d0402b56d6>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> (XEN)    [<ffff82d0402b6bcc>] F hvmemul_do_pio_buffer+0x33/0x35
> (XEN)    [<ffff82d0402c5e80>] F handle_pio+0x70/0x1b7
> (XEN)    [<ffff82d04029dc7f>] F svm_vmexit_handler+0x10ba/0x18aa
> (XEN)    [<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> (XEN)
> (XEN) d0v7 bar_write Ray line 391 0000:03:00.0 bar->enabled 1
> (XEN) d0v7 bar_write Ray line 406 0000:03:00.0 bar->enabled 1

As said by Jan, it's hard to figure out where are the printks placed without a
diff of your changes.

So far the above seems to be expected, as we currently don't handle BAR
register writes with memory decoding enabled.

Given the change proposed in this patch, can you check whether `bar->enabled ==
true` but the PCI command register has the memory decoding bit unset?

If so it would mean Xen state got out-of-sync with the hardware state, and we
would need to figure out where it happened.  Is there any backdoor in the AMD
GPU that allows to disable memory decoding without using the PCI command
register?

Regards, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-22  9:34                     ` Roger Pau Monné
@ 2023-03-22 12:33                       ` Huang Rui
  2023-03-22 12:48                         ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-22 12:33 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Stefano Stabellini, Anthony PERARD, xen-devel

On Wed, Mar 22, 2023 at 05:34:41PM +0800, Roger Pau Monné wrote:
> On Wed, Mar 22, 2023 at 03:28:58PM +0800, Huang Rui wrote:
> > On Tue, Mar 21, 2023 at 09:03:58PM +0800, Huang Rui wrote:
> > > On Tue, Mar 21, 2023 at 08:27:21PM +0800, Jan Beulich wrote:
> > > > On 21.03.2023 12:49, Huang Rui wrote:
> > > > > Thanks, but we found if dom0 is PV domain, the passthrough device will
> > > > > access this function to write the real bar.
> > > > 
> > > > Can you please be quite a bit more detailed about this? The specific code
> > > > paths taken (in upstream software) to result in such would of of interest.
> > > > 
> > > 
> > > yes, please wait for a moment. let me capture a trace dump in my side.
> > > 
> > 
> > Sorry, we are wrong that if Xen PV dom0, bar_write() won't be called,
> > please ignore above information.
> > 
> > While xen is on initialization on PVH dom0, it will add all PCI devices in
> > the real bus including 0000:03:00.0 (VGA device: GPU) and 0000:03:00.1
> > (Audio device).
> > 
> > Audio is another function in the pcie device, but we won't use it here. So
> > we will remove it after that.
> > 
> > Please see below xl dmesg:
> > 
> > (XEN) PCI add device 0000:03:00.0
> > (XEN) d0v0 bar_write Ray line 391 0000:03:00.1 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:03:00.1 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 391 0000:03:00.1 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:03:00.1 bar->enabled 0
> > (XEN) PCI add device 0000:03:00.1
> > (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 391 0000:04:00.0 bar->enabled 0
> > (XEN) d0v0 bar_write Ray line 406 0000:04:00.0 bar->enabled 0
> > (XEN) PCI add device 0000:04:00.0
> > 
> > ...
> > 
> > (XEN) PCI add device 0000:07:00.7
> > (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc0010058 unimplemented
> > (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc0011020 unimplemented
> > (XEN) PCI remove device 0000:03:00.1
> > 
> > We run below script to remove audio
> > 
> > echo -n "1" > /sys/bus/pci/devices/0000:03:00.1/remove
> > 
> > (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc001029b unimplemented
> > (XEN) arch/x86/hvm/svm/svm.c:2017:d0v0 RDMSR 0xc001029a unimplemented
> > 
> > Then we will run "xl pci-assignable-add 03:00.0" to assign GPU as
> > passthrough. At this moment, the real bar is trying to be written.
> > 
> > (XEN) d0v7 bar_write Ray line 391 0000:03:00.0 bar->enabled 1
> > (XEN) d0v7 bar_write Ray line 406 0000:03:00.0 bar->enabled 1
> > (XEN) Xen WARN at drivers/vpci/header.c:408
> > (XEN) ----[ Xen-4.18-unstable  x86_64  debug=y  Not tainted ]----
> > (XEN) CPU:    8
> > (XEN) RIP:    e008:[<ffff82d040263cb9>] drivers/vpci/header.c#bar_write+0xc0/0x1ce
> > (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d0v7)
> > (XEN) rax: ffff8303fc36d06c   rbx: ffff8303f90468b0   rcx: 0000000000000010
> > (XEN) rdx: 0000000000000002   rsi: ffff8303fc36a020   rdi: ffff8303fc36a018
> > (XEN) rbp: ffff8303fc367c18   rsp: ffff8303fc367be8   r8:  0000000000000001
> > (XEN) r9:  ffff8303fc36a010   r10: 0000000000000001   r11: 0000000000000001
> > (XEN) r12: 00000000d0700000   r13: ffff8303fc6d9230   r14: ffff8303fc6d9270
> > (XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000003506e0
> > (XEN) cr3: 00000003fc3c4000   cr2: 00007f180f6371e8
> > (XEN) fsb: 00007fce655edbc0   gsb: ffff88822f3c0000   gss: 0000000000000000
> > (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> > (XEN) Xen code around <ffff82d040263cb9> (drivers/vpci/header.c#bar_write+0xc0/0x1ce):
> > (XEN)  b6 53 14 f6 c2 02 74 02 <0f> 0b 48 8b 03 45 84 ff 0f 85 ec 00 00 00 48 b9
> > (XEN) Xen stack trace from rsp=ffff8303fc367be8:
> > (XEN)    00000024fc367bf8 ffff8303f9046a50 0000000000000000 0000000000000004
> > (XEN)    0000000000000004 0000000000000024 ffff8303fc367ca0 ffff82d040263683
> > (XEN)    00000300fc367ca0 d070000003003501 00000024d0700000 ffff8303fc6d9230
> > (XEN)    0000000000000000 0000000000000000 0000002400000004 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000004 00000000d0700000
> > (XEN)    0000000000000024 0000000000000000 ffff82d040404bc0 ffff8303fc367cd0
> > (XEN)    ffff82d0402c60a8 0000030000000001 ffff8303fc367d88 0000000000000000
> > (XEN)    ffff8303fc610800 ffff8303fc367d30 ffff82d0402c54da ffff8303fc367ce0
> > (XEN)    ffff8303fc367fff 0000000000000004 ffff830300000004 00000000d0700000
> > (XEN)    ffff8303fc610800 ffff8303fc367d88 0000000000000001 0000000000000000
> > (XEN)    0000000000000000 ffff8303fc367d58 ffff82d0402c5570 0000000000000004
> > (XEN)    ffff8304065ea000 ffff8303fc367e28 ffff8303fc367dd0 ffff82d0402b5357
> > (XEN)    0000000000000cfc ffff8303fc621000 0000000000000000 0000000000000000
> > (XEN)    0000000000000cfc 00000000d0700000 0000000400000001 0001000000000000
> > (XEN)    0000000000000004 0000000000000004 0000000000000000 ffff8303fc367e44
> > (XEN)    ffff8304065ea000 ffff8303fc367e10 ffff82d0402b56d6 0000000000000000
> > (XEN)    ffff8303fc367e44 0000000000000004 0000000000000cfc ffff8304065e6000
> > (XEN)    0000000000000000 ffff8303fc367e30 ffff82d0402b6bcc ffff8303fc367e44
> > (XEN)    0000000000000001 ffff8303fc367e70 ffff82d0402c5e80 d070000040203490
> > (XEN)    000000000000007b ffff8303fc367ef8 ffff8304065e6000 ffff8304065ea000
> > (XEN) Xen call trace:
> > (XEN)    [<ffff82d040263cb9>] R drivers/vpci/header.c#bar_write+0xc0/0x1ce
> > (XEN)    [<ffff82d040263683>] F vpci_write+0x123/0x26c
> > (XEN)    [<ffff82d0402c60a8>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> > (XEN)    [<ffff82d0402c54da>] F hvm_process_io_intercept+0x203/0x26f
> > (XEN)    [<ffff82d0402c5570>] F hvm_io_intercept+0x2a/0x4c
> > (XEN)    [<ffff82d0402b5357>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5eb
> > (XEN)    [<ffff82d0402b56d6>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> > (XEN)    [<ffff82d0402b6bcc>] F hvmemul_do_pio_buffer+0x33/0x35
> > (XEN)    [<ffff82d0402c5e80>] F handle_pio+0x70/0x1b7
> > (XEN)    [<ffff82d04029dc7f>] F svm_vmexit_handler+0x10ba/0x18aa
> > (XEN)    [<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> > (XEN)
> > (XEN) d0v7 bar_write Ray line 391 0000:03:00.0 bar->enabled 1
> > (XEN) d0v7 bar_write Ray line 406 0000:03:00.0 bar->enabled 1
> 

Hi Jan, Roger,

> As said by Jan, it's hard to figure out where are the printks placed without a
> diff of your changes.

I attached the diff of my prints below, and I want to figure out why the
Bar_write() is called while we use pci-assignable-add to assign passthrough device in PVH dom0.


diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 918d11fbce..35447aff2a 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -388,12 +388,14 @@ static void cf_check bar_write(
     else
         val &= PCI_BASE_ADDRESS_MEM_MASK;
 
+    gprintk(XENLOG_WARNING, "%s Ray line %d %pp bar->enabled %d\n", __func__, __LINE__, &pdev->sbdf , bar->enabled);
     /*
      * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
      * writes as long as the BAR is not mapped into the p2m.
      */
     if ( pci_conf_read16(pdev->sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY )
     {
+        gprintk(XENLOG_WARNING, "%s Ray line %d %pp bar->enabled %d\n", __func__, __LINE__, &pdev->sbdf , bar->enabled);
         /* If the value written is the current one avoid printing a warning. */
         if ( val != (uint32_t)(bar->addr >> (hi ? 32 : 0)) )
             gprintk(XENLOG_WARNING,
@@ -401,7 +403,9 @@ static void cf_check bar_write(
                     &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
         return;
     }
-
+    gprintk(XENLOG_WARNING, "%s Ray line %d %pp bar->enabled %d\n", __func__, __LINE__, &pdev->sbdf , bar->enabled);
+    if (bar->enabled)
+	WARN_ON(1);
 
     /*
      * Update the cached address, so that when memory decoding is enabled

> 
> So far the above seems to be expected, as we currently don't handle BAR
> register writes with memory decoding enabled.
> 
> Given the change proposed in this patch, can you check whether `bar->enabled ==
> true` but the PCI command register has the memory decoding bit unset?


I traced that while we do pci-assignable-add, we will follow below trace to
bind the passthrough device.

pciassignable_add()->libxl_device_pci_assignable_add()->libxl__device_pci_assignable_add()->pciback_dev_assign()

Then kernel xen-pciback driver want to add virtual configuration spaces. In
this phase, the bar_write() in xen hypervisor will be called. I still need
a bit more time to figure the exact reason. May I know where the
xen-pciback driver would trigger a hvm_io_intercept to xen hypervisor?

[  309.719049] xen_pciback: wants to seize 0000:03:00.0
[  462.911251] pciback 0000:03:00.0: xen_pciback: probing...
[  462.911256] pciback 0000:03:00.0: xen_pciback: seizing device
[  462.911257] pciback 0000:03:00.0: xen_pciback: pcistub_device_alloc
[  462.911261] pciback 0000:03:00.0: xen_pciback: initializing...
[  462.911263] pciback 0000:03:00.0: xen_pciback: initializing config
[  462.911265] pciback 0000:03:00.0: xen-pciback: initializing virtual configuration space
[  462.911268] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x00
[  462.911271] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x02
[  462.911284] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x04
[  462.911286] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3c
[  462.911289] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3d
[  462.911291] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0c
[  462.911294] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0d
[  462.911296] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0f
[  462.911301] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x10
[  462.911306] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x14
[  462.911309] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x18
[  462.911313] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x1c
[  462.911317] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x20
[  462.911321] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x24
[  462.911325] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x30
[  462.911358] pciback 0000:03:00.0: Found capability 0x1 at 0x50
[  462.911361] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x50
[  462.911363] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x52
[  462.911368] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x54
[  462.911371] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x56
[  462.911373] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x57
[  462.911386] pciback 0000:03:00.0: Found capability 0x5 at 0xa0
[  462.911388] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa0
[  462.911391] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa2
[  462.911405] pciback 0000:03:00.0: xen_pciback: enabling device
[  462.911412] pciback 0000:03:00.0: enabling device (0006 -> 0007)
[  462.911658] Already setup the GSI :28
[  462.911668] Already map the GSI :28 and IRQ: 115
[  462.911684] pciback 0000:03:00.0: xen_pciback: save state of device
[  462.912154] pciback 0000:03:00.0: xen_pciback: resetting (FLR, D3, etc) the device
[  463.954998] pciback 0000:03:00.0: xen_pciback: reset device


> 
> If so it would mean Xen state got out-of-sync with the hardware state, and we
> would need to figure out where it happened.  Is there any backdoor in the AMD
> GPU that allows to disable memory decoding without using the PCI command
> register?
> 

I don't think we have any backdoor.

Thanks,
Ray


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-22 12:33                       ` Huang Rui
@ 2023-03-22 12:48                         ` Jan Beulich
  2023-03-23 10:26                           ` Huang Rui
  2023-03-23 10:43                           ` Roger Pau Monné
  0 siblings, 2 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-22 12:48 UTC (permalink / raw)
  To: Huang Rui
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Stefano Stabellini, Anthony PERARD, xen-devel,
	Roger Pau Monné

On 22.03.2023 13:33, Huang Rui wrote:
> I traced that while we do pci-assignable-add, we will follow below trace to
> bind the passthrough device.
> 
> pciassignable_add()->libxl_device_pci_assignable_add()->libxl__device_pci_assignable_add()->pciback_dev_assign()
> 
> Then kernel xen-pciback driver want to add virtual configuration spaces. In
> this phase, the bar_write() in xen hypervisor will be called. I still need
> a bit more time to figure the exact reason. May I know where the
> xen-pciback driver would trigger a hvm_io_intercept to xen hypervisor?

Any config space access would. And I might guess ...

> [  309.719049] xen_pciback: wants to seize 0000:03:00.0
> [  462.911251] pciback 0000:03:00.0: xen_pciback: probing...
> [  462.911256] pciback 0000:03:00.0: xen_pciback: seizing device
> [  462.911257] pciback 0000:03:00.0: xen_pciback: pcistub_device_alloc
> [  462.911261] pciback 0000:03:00.0: xen_pciback: initializing...
> [  462.911263] pciback 0000:03:00.0: xen_pciback: initializing config
> [  462.911265] pciback 0000:03:00.0: xen-pciback: initializing virtual configuration space
> [  462.911268] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x00
> [  462.911271] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x02
> [  462.911284] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x04
> [  462.911286] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3c
> [  462.911289] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3d
> [  462.911291] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0c
> [  462.911294] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0d
> [  462.911296] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0f
> [  462.911301] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x10
> [  462.911306] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x14
> [  462.911309] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x18
> [  462.911313] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x1c
> [  462.911317] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x20
> [  462.911321] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x24
> [  462.911325] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x30
> [  462.911358] pciback 0000:03:00.0: Found capability 0x1 at 0x50
> [  462.911361] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x50
> [  462.911363] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x52
> [  462.911368] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x54
> [  462.911371] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x56
> [  462.911373] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x57
> [  462.911386] pciback 0000:03:00.0: Found capability 0x5 at 0xa0
> [  462.911388] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa0
> [  462.911391] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa2
> [  462.911405] pciback 0000:03:00.0: xen_pciback: enabling device
> [  462.911412] pciback 0000:03:00.0: enabling device (0006 -> 0007)
> [  462.911658] Already setup the GSI :28
> [  462.911668] Already map the GSI :28 and IRQ: 115
> [  462.911684] pciback 0000:03:00.0: xen_pciback: save state of device
> [  462.912154] pciback 0000:03:00.0: xen_pciback: resetting (FLR, D3, etc) the device
> [  463.954998] pciback 0000:03:00.0: xen_pciback: reset device

... it is actually the reset here, saving and then restoring config space.
If e.g. that restore was done "blindly" (i.e. simply writing fields low to
high), then memory decode would be re-enabled before the BARs are written.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-22 12:48                         ` Jan Beulich
@ 2023-03-23 10:26                           ` Huang Rui
  2023-03-23 14:16                             ` Jan Beulich
  2023-03-23 10:43                           ` Roger Pau Monné
  1 sibling, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-23 10:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Stefano Stabellini, Anthony PERARD, xen-devel,
	Roger Pau Monné

On Wed, Mar 22, 2023 at 08:48:30PM +0800, Jan Beulich wrote:
> On 22.03.2023 13:33, Huang Rui wrote:
> > I traced that while we do pci-assignable-add, we will follow below trace to
> > bind the passthrough device.
> > 
> > pciassignable_add()->libxl_device_pci_assignable_add()->libxl__device_pci_assignable_add()->pciback_dev_assign()
> > 
> > Then kernel xen-pciback driver want to add virtual configuration spaces. In
> > this phase, the bar_write() in xen hypervisor will be called. I still need
> > a bit more time to figure the exact reason. May I know where the
> > xen-pciback driver would trigger a hvm_io_intercept to xen hypervisor?
> 
> Any config space access would. And I might guess ...
> 
> > [  309.719049] xen_pciback: wants to seize 0000:03:00.0
> > [  462.911251] pciback 0000:03:00.0: xen_pciback: probing...
> > [  462.911256] pciback 0000:03:00.0: xen_pciback: seizing device
> > [  462.911257] pciback 0000:03:00.0: xen_pciback: pcistub_device_alloc
> > [  462.911261] pciback 0000:03:00.0: xen_pciback: initializing...
> > [  462.911263] pciback 0000:03:00.0: xen_pciback: initializing config
> > [  462.911265] pciback 0000:03:00.0: xen-pciback: initializing virtual configuration space
> > [  462.911268] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x00
> > [  462.911271] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x02
> > [  462.911284] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x04
> > [  462.911286] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3c
> > [  462.911289] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3d
> > [  462.911291] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0c
> > [  462.911294] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0d
> > [  462.911296] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0f
> > [  462.911301] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x10
> > [  462.911306] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x14
> > [  462.911309] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x18
> > [  462.911313] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x1c
> > [  462.911317] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x20
> > [  462.911321] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x24
> > [  462.911325] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x30
> > [  462.911358] pciback 0000:03:00.0: Found capability 0x1 at 0x50
> > [  462.911361] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x50
> > [  462.911363] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x52
> > [  462.911368] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x54
> > [  462.911371] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x56
> > [  462.911373] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x57
> > [  462.911386] pciback 0000:03:00.0: Found capability 0x5 at 0xa0
> > [  462.911388] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa0
> > [  462.911391] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa2
> > [  462.911405] pciback 0000:03:00.0: xen_pciback: enabling device
> > [  462.911412] pciback 0000:03:00.0: enabling device (0006 -> 0007)
> > [  462.911658] Already setup the GSI :28
> > [  462.911668] Already map the GSI :28 and IRQ: 115
> > [  462.911684] pciback 0000:03:00.0: xen_pciback: save state of device
> > [  462.912154] pciback 0000:03:00.0: xen_pciback: resetting (FLR, D3, etc) the device
> > [  463.954998] pciback 0000:03:00.0: xen_pciback: reset device
> 
> ... it is actually the reset here, saving and then restoring config space.
> If e.g. that restore was done "blindly" (i.e. simply writing fields low to
> high), then memory decode would be re-enabled before the BARs are written.
> 

Yes, we confirm the problem is while the xen-pciback driver initializes
passthrough device with pcistub_init_device() -> pci_restore_state() ->
pci_restore_config_space() -> pci_restore_config_space_range() ->
pci_restore_config_dword() -> pci_write_config_dword(), the pci config
write will trigger io interrupt to bar_write() in the xen, then bar->enable
is set, the write is not actually allowed.

May I know whether this behavior (restore) is expected? Or it should not
reset the device.

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-22 12:48                         ` Jan Beulich
  2023-03-23 10:26                           ` Huang Rui
@ 2023-03-23 10:43                           ` Roger Pau Monné
  2023-03-23 13:34                             ` Huang Rui
  1 sibling, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-23 10:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Huang Rui, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian, Stefano Stabellini, Anthony PERARD, xen-devel

On Wed, Mar 22, 2023 at 01:48:30PM +0100, Jan Beulich wrote:
> On 22.03.2023 13:33, Huang Rui wrote:
> > I traced that while we do pci-assignable-add, we will follow below trace to
> > bind the passthrough device.
> > 
> > pciassignable_add()->libxl_device_pci_assignable_add()->libxl__device_pci_assignable_add()->pciback_dev_assign()
> > 
> > Then kernel xen-pciback driver want to add virtual configuration spaces. In
> > this phase, the bar_write() in xen hypervisor will be called. I still need
> > a bit more time to figure the exact reason. May I know where the
> > xen-pciback driver would trigger a hvm_io_intercept to xen hypervisor?
> 
> Any config space access would. And I might guess ...
> 
> > [  309.719049] xen_pciback: wants to seize 0000:03:00.0
> > [  462.911251] pciback 0000:03:00.0: xen_pciback: probing...
> > [  462.911256] pciback 0000:03:00.0: xen_pciback: seizing device
> > [  462.911257] pciback 0000:03:00.0: xen_pciback: pcistub_device_alloc
> > [  462.911261] pciback 0000:03:00.0: xen_pciback: initializing...
> > [  462.911263] pciback 0000:03:00.0: xen_pciback: initializing config
> > [  462.911265] pciback 0000:03:00.0: xen-pciback: initializing virtual configuration space
> > [  462.911268] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x00
> > [  462.911271] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x02
> > [  462.911284] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x04
> > [  462.911286] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3c
> > [  462.911289] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3d
> > [  462.911291] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0c
> > [  462.911294] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0d
> > [  462.911296] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0f
> > [  462.911301] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x10
> > [  462.911306] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x14
> > [  462.911309] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x18
> > [  462.911313] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x1c
> > [  462.911317] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x20
> > [  462.911321] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x24
> > [  462.911325] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x30
> > [  462.911358] pciback 0000:03:00.0: Found capability 0x1 at 0x50
> > [  462.911361] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x50
> > [  462.911363] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x52
> > [  462.911368] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x54
> > [  462.911371] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x56
> > [  462.911373] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x57
> > [  462.911386] pciback 0000:03:00.0: Found capability 0x5 at 0xa0
> > [  462.911388] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa0
> > [  462.911391] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa2
> > [  462.911405] pciback 0000:03:00.0: xen_pciback: enabling device
> > [  462.911412] pciback 0000:03:00.0: enabling device (0006 -> 0007)
> > [  462.911658] Already setup the GSI :28
> > [  462.911668] Already map the GSI :28 and IRQ: 115
> > [  462.911684] pciback 0000:03:00.0: xen_pciback: save state of device
> > [  462.912154] pciback 0000:03:00.0: xen_pciback: resetting (FLR, D3, etc) the device
> > [  463.954998] pciback 0000:03:00.0: xen_pciback: reset device
> 
> ... it is actually the reset here, saving and then restoring config space.
> If e.g. that restore was done "blindly" (i.e. simply writing fields low to
> high), then memory decode would be re-enabled before the BARs are written.

The problem is also that we don't tell vPCI that the device has been
reset, so the current cached state in pdev->vpci is all out of date
with the real device state.

I didn't hit this on my test because the device I was using had no
reset support.

I don't think it's feasible for Xen to detect all the possible reset
methods dom0 might use, as some of those are device specific for
example.

We would have to introduce a new hypercall that clears all vPCI device
state, PHYSDEVOP_pci_device_reset for example.  This will involve
adding proper cleanup functions, as the current code in
vpci_remove_device() only deals with allocated memory (because so far
devices where not deassigned) but we now also need to make sure
MSI(-X) interrupts are torn down and freed, and will also require
removing any mappings of BARs into the dom0 physmap.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-23 10:43                           ` Roger Pau Monné
@ 2023-03-23 13:34                             ` Huang Rui
  2023-03-23 16:23                               ` Roger Pau Monné
  0 siblings, 1 reply; 75+ messages in thread
From: Huang Rui @ 2023-03-23 13:34 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian, Stefano Stabellini, Anthony PERARD, xen-devel

On Thu, Mar 23, 2023 at 06:43:53PM +0800, Roger Pau Monné wrote:
> On Wed, Mar 22, 2023 at 01:48:30PM +0100, Jan Beulich wrote:
> > On 22.03.2023 13:33, Huang Rui wrote:
> > > I traced that while we do pci-assignable-add, we will follow below trace to
> > > bind the passthrough device.
> > > 
> > > pciassignable_add()->libxl_device_pci_assignable_add()->libxl__device_pci_assignable_add()->pciback_dev_assign()
> > > 
> > > Then kernel xen-pciback driver want to add virtual configuration spaces. In
> > > this phase, the bar_write() in xen hypervisor will be called. I still need
> > > a bit more time to figure the exact reason. May I know where the
> > > xen-pciback driver would trigger a hvm_io_intercept to xen hypervisor?
> > 
> > Any config space access would. And I might guess ...
> > 
> > > [  309.719049] xen_pciback: wants to seize 0000:03:00.0
> > > [  462.911251] pciback 0000:03:00.0: xen_pciback: probing...
> > > [  462.911256] pciback 0000:03:00.0: xen_pciback: seizing device
> > > [  462.911257] pciback 0000:03:00.0: xen_pciback: pcistub_device_alloc
> > > [  462.911261] pciback 0000:03:00.0: xen_pciback: initializing...
> > > [  462.911263] pciback 0000:03:00.0: xen_pciback: initializing config
> > > [  462.911265] pciback 0000:03:00.0: xen-pciback: initializing virtual configuration space
> > > [  462.911268] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x00
> > > [  462.911271] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x02
> > > [  462.911284] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x04
> > > [  462.911286] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3c
> > > [  462.911289] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3d
> > > [  462.911291] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0c
> > > [  462.911294] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0d
> > > [  462.911296] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0f
> > > [  462.911301] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x10
> > > [  462.911306] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x14
> > > [  462.911309] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x18
> > > [  462.911313] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x1c
> > > [  462.911317] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x20
> > > [  462.911321] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x24
> > > [  462.911325] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x30
> > > [  462.911358] pciback 0000:03:00.0: Found capability 0x1 at 0x50
> > > [  462.911361] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x50
> > > [  462.911363] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x52
> > > [  462.911368] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x54
> > > [  462.911371] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x56
> > > [  462.911373] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x57
> > > [  462.911386] pciback 0000:03:00.0: Found capability 0x5 at 0xa0
> > > [  462.911388] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa0
> > > [  462.911391] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa2
> > > [  462.911405] pciback 0000:03:00.0: xen_pciback: enabling device
> > > [  462.911412] pciback 0000:03:00.0: enabling device (0006 -> 0007)
> > > [  462.911658] Already setup the GSI :28
> > > [  462.911668] Already map the GSI :28 and IRQ: 115
> > > [  462.911684] pciback 0000:03:00.0: xen_pciback: save state of device
> > > [  462.912154] pciback 0000:03:00.0: xen_pciback: resetting (FLR, D3, etc) the device
> > > [  463.954998] pciback 0000:03:00.0: xen_pciback: reset device
> > 
> > ... it is actually the reset here, saving and then restoring config space.
> > If e.g. that restore was done "blindly" (i.e. simply writing fields low to
> > high), then memory decode would be re-enabled before the BARs are written.
> 
> The problem is also that we don't tell vPCI that the device has been
> reset, so the current cached state in pdev->vpci is all out of date
> with the real device state.
> 
> I didn't hit this on my test because the device I was using had no
> reset support.
> 
> I don't think it's feasible for Xen to detect all the possible reset
> methods dom0 might use, as some of those are device specific for
> example.

OK.

> 
> We would have to introduce a new hypercall that clears all vPCI device
> state, PHYSDEVOP_pci_device_reset for example.  This will involve
> adding proper cleanup functions, as the current code in
> vpci_remove_device() only deals with allocated memory (because so far
> devices where not deassigned) but we now also need to make sure
> MSI(-X) interrupts are torn down and freed, and will also require
> removing any mappings of BARs into the dom0 physmap.
> 

Thanks for the suggestion. Let me make the new PHYSDEVOP_pci_device_reset
in the next version instead of current workaround.

The MSI(-X) interrupts doesn't work in our platform, I don't figure the
root cause yet. Could you please elaborate where we should require removing
any mappings of BARs into the dom0 physmap here?

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-23 10:26                           ` Huang Rui
@ 2023-03-23 14:16                             ` Jan Beulich
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2023-03-23 14:16 UTC (permalink / raw)
  To: Huang Rui
  Cc: Deucher, Alexander, Koenig, Christian, Hildebrand, Stewart,
	Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen, Jiqian,
	Stefano Stabellini, Anthony PERARD, xen-devel,
	Roger Pau Monné

On 23.03.2023 11:26, Huang Rui wrote:
> On Wed, Mar 22, 2023 at 08:48:30PM +0800, Jan Beulich wrote:
>> On 22.03.2023 13:33, Huang Rui wrote:
>>> I traced that while we do pci-assignable-add, we will follow below trace to
>>> bind the passthrough device.
>>>
>>> pciassignable_add()->libxl_device_pci_assignable_add()->libxl__device_pci_assignable_add()->pciback_dev_assign()
>>>
>>> Then kernel xen-pciback driver want to add virtual configuration spaces. In
>>> this phase, the bar_write() in xen hypervisor will be called. I still need
>>> a bit more time to figure the exact reason. May I know where the
>>> xen-pciback driver would trigger a hvm_io_intercept to xen hypervisor?
>>
>> Any config space access would. And I might guess ...
>>
>>> [  309.719049] xen_pciback: wants to seize 0000:03:00.0
>>> [  462.911251] pciback 0000:03:00.0: xen_pciback: probing...
>>> [  462.911256] pciback 0000:03:00.0: xen_pciback: seizing device
>>> [  462.911257] pciback 0000:03:00.0: xen_pciback: pcistub_device_alloc
>>> [  462.911261] pciback 0000:03:00.0: xen_pciback: initializing...
>>> [  462.911263] pciback 0000:03:00.0: xen_pciback: initializing config
>>> [  462.911265] pciback 0000:03:00.0: xen-pciback: initializing virtual configuration space
>>> [  462.911268] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x00
>>> [  462.911271] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x02
>>> [  462.911284] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x04
>>> [  462.911286] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3c
>>> [  462.911289] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3d
>>> [  462.911291] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0c
>>> [  462.911294] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0d
>>> [  462.911296] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0f
>>> [  462.911301] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x10
>>> [  462.911306] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x14
>>> [  462.911309] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x18
>>> [  462.911313] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x1c
>>> [  462.911317] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x20
>>> [  462.911321] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x24
>>> [  462.911325] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x30
>>> [  462.911358] pciback 0000:03:00.0: Found capability 0x1 at 0x50
>>> [  462.911361] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x50
>>> [  462.911363] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x52
>>> [  462.911368] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x54
>>> [  462.911371] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x56
>>> [  462.911373] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x57
>>> [  462.911386] pciback 0000:03:00.0: Found capability 0x5 at 0xa0
>>> [  462.911388] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa0
>>> [  462.911391] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa2
>>> [  462.911405] pciback 0000:03:00.0: xen_pciback: enabling device
>>> [  462.911412] pciback 0000:03:00.0: enabling device (0006 -> 0007)
>>> [  462.911658] Already setup the GSI :28
>>> [  462.911668] Already map the GSI :28 and IRQ: 115
>>> [  462.911684] pciback 0000:03:00.0: xen_pciback: save state of device
>>> [  462.912154] pciback 0000:03:00.0: xen_pciback: resetting (FLR, D3, etc) the device
>>> [  463.954998] pciback 0000:03:00.0: xen_pciback: reset device
>>
>> ... it is actually the reset here, saving and then restoring config space.
>> If e.g. that restore was done "blindly" (i.e. simply writing fields low to
>> high), then memory decode would be re-enabled before the BARs are written.
>>
> 
> Yes, we confirm the problem is while the xen-pciback driver initializes
> passthrough device with pcistub_init_device() -> pci_restore_state() ->
> pci_restore_config_space() -> pci_restore_config_space_range() ->
> pci_restore_config_dword() -> pci_write_config_dword(), the pci config
> write will trigger io interrupt to bar_write() in the xen, then bar->enable
> is set, the write is not actually allowed.
> 
> May I know whether this behavior (restore) is expected? Or it should not
> reset the device.

The reset is expected. To expand slightly on Roger's reply: The reset we're
unaware of has likely indeed brought bar->enable and command register state
out of sync. For everything else see Roger's response.

Jan


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-23 13:34                             ` Huang Rui
@ 2023-03-23 16:23                               ` Roger Pau Monné
  2023-03-24  4:37                                 ` Huang Rui
  0 siblings, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-03-23 16:23 UTC (permalink / raw)
  To: Huang Rui
  Cc: Jan Beulich, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian, Stefano Stabellini, Anthony PERARD, xen-devel

On Thu, Mar 23, 2023 at 09:34:40PM +0800, Huang Rui wrote:
> On Thu, Mar 23, 2023 at 06:43:53PM +0800, Roger Pau Monné wrote:
> > On Wed, Mar 22, 2023 at 01:48:30PM +0100, Jan Beulich wrote:
> > > On 22.03.2023 13:33, Huang Rui wrote:
> > > > I traced that while we do pci-assignable-add, we will follow below trace to
> > > > bind the passthrough device.
> > > > 
> > > > pciassignable_add()->libxl_device_pci_assignable_add()->libxl__device_pci_assignable_add()->pciback_dev_assign()
> > > > 
> > > > Then kernel xen-pciback driver want to add virtual configuration spaces. In
> > > > this phase, the bar_write() in xen hypervisor will be called. I still need
> > > > a bit more time to figure the exact reason. May I know where the
> > > > xen-pciback driver would trigger a hvm_io_intercept to xen hypervisor?
> > > 
> > > Any config space access would. And I might guess ...
> > > 
> > > > [  309.719049] xen_pciback: wants to seize 0000:03:00.0
> > > > [  462.911251] pciback 0000:03:00.0: xen_pciback: probing...
> > > > [  462.911256] pciback 0000:03:00.0: xen_pciback: seizing device
> > > > [  462.911257] pciback 0000:03:00.0: xen_pciback: pcistub_device_alloc
> > > > [  462.911261] pciback 0000:03:00.0: xen_pciback: initializing...
> > > > [  462.911263] pciback 0000:03:00.0: xen_pciback: initializing config
> > > > [  462.911265] pciback 0000:03:00.0: xen-pciback: initializing virtual configuration space
> > > > [  462.911268] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x00
> > > > [  462.911271] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x02
> > > > [  462.911284] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x04
> > > > [  462.911286] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3c
> > > > [  462.911289] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3d
> > > > [  462.911291] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0c
> > > > [  462.911294] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0d
> > > > [  462.911296] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0f
> > > > [  462.911301] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x10
> > > > [  462.911306] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x14
> > > > [  462.911309] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x18
> > > > [  462.911313] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x1c
> > > > [  462.911317] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x20
> > > > [  462.911321] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x24
> > > > [  462.911325] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x30
> > > > [  462.911358] pciback 0000:03:00.0: Found capability 0x1 at 0x50
> > > > [  462.911361] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x50
> > > > [  462.911363] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x52
> > > > [  462.911368] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x54
> > > > [  462.911371] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x56
> > > > [  462.911373] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x57
> > > > [  462.911386] pciback 0000:03:00.0: Found capability 0x5 at 0xa0
> > > > [  462.911388] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa0
> > > > [  462.911391] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa2
> > > > [  462.911405] pciback 0000:03:00.0: xen_pciback: enabling device
> > > > [  462.911412] pciback 0000:03:00.0: enabling device (0006 -> 0007)
> > > > [  462.911658] Already setup the GSI :28
> > > > [  462.911668] Already map the GSI :28 and IRQ: 115
> > > > [  462.911684] pciback 0000:03:00.0: xen_pciback: save state of device
> > > > [  462.912154] pciback 0000:03:00.0: xen_pciback: resetting (FLR, D3, etc) the device
> > > > [  463.954998] pciback 0000:03:00.0: xen_pciback: reset device
> > > 
> > > ... it is actually the reset here, saving and then restoring config space.
> > > If e.g. that restore was done "blindly" (i.e. simply writing fields low to
> > > high), then memory decode would be re-enabled before the BARs are written.
> > 
> > The problem is also that we don't tell vPCI that the device has been
> > reset, so the current cached state in pdev->vpci is all out of date
> > with the real device state.
> > 
> > I didn't hit this on my test because the device I was using had no
> > reset support.
> > 
> > I don't think it's feasible for Xen to detect all the possible reset
> > methods dom0 might use, as some of those are device specific for
> > example.
> 
> OK.
> 
> > 
> > We would have to introduce a new hypercall that clears all vPCI device
> > state, PHYSDEVOP_pci_device_reset for example.  This will involve
> > adding proper cleanup functions, as the current code in
> > vpci_remove_device() only deals with allocated memory (because so far
> > devices where not deassigned) but we now also need to make sure
> > MSI(-X) interrupts are torn down and freed, and will also require
> > removing any mappings of BARs into the dom0 physmap.
> > 
> 
> Thanks for the suggestion. Let me make the new PHYSDEVOP_pci_device_reset
> in the next version instead of current workaround.
> 
> The MSI(-X) interrupts doesn't work in our platform, I don't figure the
> root cause yet.

Do MSI-X interrupts work when the device is in use by dom0 (both Pv
and PVH)?

> Could you please elaborate where we should require removing
> any mappings of BARs into the dom0 physmap here?

I think you can just use `modify_bars(pdev, 0, 0)`, as that will
effectively remove any BARs from the memory map.  That should also
take care of preemption, so you should be good to go.

Regards, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH
  2023-03-23 16:23                               ` Roger Pau Monné
@ 2023-03-24  4:37                                 ` Huang Rui
  0 siblings, 0 replies; 75+ messages in thread
From: Huang Rui @ 2023-03-24  4:37 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian, Stefano Stabellini, Anthony PERARD, xen-devel

On Fri, Mar 24, 2023 at 12:23:39AM +0800, Roger Pau Monné wrote:
> On Thu, Mar 23, 2023 at 09:34:40PM +0800, Huang Rui wrote:
> > On Thu, Mar 23, 2023 at 06:43:53PM +0800, Roger Pau Monné wrote:
> > > On Wed, Mar 22, 2023 at 01:48:30PM +0100, Jan Beulich wrote:
> > > > On 22.03.2023 13:33, Huang Rui wrote:
> > > > > I traced that while we do pci-assignable-add, we will follow below trace to
> > > > > bind the passthrough device.
> > > > > 
> > > > > pciassignable_add()->libxl_device_pci_assignable_add()->libxl__device_pci_assignable_add()->pciback_dev_assign()
> > > > > 
> > > > > Then kernel xen-pciback driver want to add virtual configuration spaces. In
> > > > > this phase, the bar_write() in xen hypervisor will be called. I still need
> > > > > a bit more time to figure the exact reason. May I know where the
> > > > > xen-pciback driver would trigger a hvm_io_intercept to xen hypervisor?
> > > > 
> > > > Any config space access would. And I might guess ...
> > > > 
> > > > > [  309.719049] xen_pciback: wants to seize 0000:03:00.0
> > > > > [  462.911251] pciback 0000:03:00.0: xen_pciback: probing...
> > > > > [  462.911256] pciback 0000:03:00.0: xen_pciback: seizing device
> > > > > [  462.911257] pciback 0000:03:00.0: xen_pciback: pcistub_device_alloc
> > > > > [  462.911261] pciback 0000:03:00.0: xen_pciback: initializing...
> > > > > [  462.911263] pciback 0000:03:00.0: xen_pciback: initializing config
> > > > > [  462.911265] pciback 0000:03:00.0: xen-pciback: initializing virtual configuration space
> > > > > [  462.911268] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x00
> > > > > [  462.911271] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x02
> > > > > [  462.911284] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x04
> > > > > [  462.911286] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3c
> > > > > [  462.911289] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x3d
> > > > > [  462.911291] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0c
> > > > > [  462.911294] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0d
> > > > > [  462.911296] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x0f
> > > > > [  462.911301] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x10
> > > > > [  462.911306] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x14
> > > > > [  462.911309] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x18
> > > > > [  462.911313] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x1c
> > > > > [  462.911317] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x20
> > > > > [  462.911321] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x24
> > > > > [  462.911325] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x30
> > > > > [  462.911358] pciback 0000:03:00.0: Found capability 0x1 at 0x50
> > > > > [  462.911361] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x50
> > > > > [  462.911363] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x52
> > > > > [  462.911368] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x54
> > > > > [  462.911371] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x56
> > > > > [  462.911373] pciback 0000:03:00.0: xen-pciback: added config field at offset 0x57
> > > > > [  462.911386] pciback 0000:03:00.0: Found capability 0x5 at 0xa0
> > > > > [  462.911388] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa0
> > > > > [  462.911391] pciback 0000:03:00.0: xen-pciback: added config field at offset 0xa2
> > > > > [  462.911405] pciback 0000:03:00.0: xen_pciback: enabling device
> > > > > [  462.911412] pciback 0000:03:00.0: enabling device (0006 -> 0007)
> > > > > [  462.911658] Already setup the GSI :28
> > > > > [  462.911668] Already map the GSI :28 and IRQ: 115
> > > > > [  462.911684] pciback 0000:03:00.0: xen_pciback: save state of device
> > > > > [  462.912154] pciback 0000:03:00.0: xen_pciback: resetting (FLR, D3, etc) the device
> > > > > [  463.954998] pciback 0000:03:00.0: xen_pciback: reset device
> > > > 
> > > > ... it is actually the reset here, saving and then restoring config space.
> > > > If e.g. that restore was done "blindly" (i.e. simply writing fields low to
> > > > high), then memory decode would be re-enabled before the BARs are written.
> > > 
> > > The problem is also that we don't tell vPCI that the device has been
> > > reset, so the current cached state in pdev->vpci is all out of date
> > > with the real device state.
> > > 
> > > I didn't hit this on my test because the device I was using had no
> > > reset support.
> > > 
> > > I don't think it's feasible for Xen to detect all the possible reset
> > > methods dom0 might use, as some of those are device specific for
> > > example.
> > 
> > OK.
> > 
> > > 
> > > We would have to introduce a new hypercall that clears all vPCI device
> > > state, PHYSDEVOP_pci_device_reset for example.  This will involve
> > > adding proper cleanup functions, as the current code in
> > > vpci_remove_device() only deals with allocated memory (because so far
> > > devices where not deassigned) but we now also need to make sure
> > > MSI(-X) interrupts are torn down and freed, and will also require
> > > removing any mappings of BARs into the dom0 physmap.
> > > 
> > 
> > Thanks for the suggestion. Let me make the new PHYSDEVOP_pci_device_reset
> > in the next version instead of current workaround.
> > 
> > The MSI(-X) interrupts doesn't work in our platform, I don't figure the
> > root cause yet.
> 
> Do MSI-X interrupts work when the device is in use by dom0 (both Pv
> and PVH)?

Yes, dom0 works well. But they don't work on passthrough devices in domU
whatever with PV or PVH. So I would like to implement the gsi firstly, then
continue checking the MSI(-X) issues.

> 
> > Could you please elaborate where we should require removing
> > any mappings of BARs into the dom0 physmap here?
> 
> I think you can just use `modify_bars(pdev, 0, 0)`, as that will
> effectively remove any BARs from the memory map.  That should also
> take care of preemption, so you should be good to go.
> 

Thanks,
Ray


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-03-17 20:55                       ` Stefano Stabellini
  2023-03-20 15:16                         ` Roger Pau Monné
@ 2023-07-31 16:40                         ` Chen, Jiqian
  2023-08-23  8:57                           ` Roger Pau Monné
  1 sibling, 1 reply; 75+ messages in thread
From: Chen, Jiqian @ 2023-07-31 16:40 UTC (permalink / raw)
  To: Stefano Stabellini, Roger Pau Monné, Jan Beulich
  Cc: Huang, Ray, Anthony PERARD, xen-devel, Deucher, Alexander,
	Koenig, Christian, Hildebrand, Stewart, Xenia Ragiadakou, Huang,
	Honglei1, Zhang, Julia, Chen, Jiqian

Hi,

On 2023/3/18 04:55, Stefano Stabellini wrote:
> On Fri, 17 Mar 2023, Roger Pau Monné wrote:
>> On Fri, Mar 17, 2023 at 11:15:37AM -0700, Stefano Stabellini wrote:
>>> On Fri, 17 Mar 2023, Roger Pau Monné wrote:
>>>> On Fri, Mar 17, 2023 at 09:39:52AM +0100, Jan Beulich wrote:
>>>>> On 17.03.2023 00:19, Stefano Stabellini wrote:
>>>>>> On Thu, 16 Mar 2023, Jan Beulich wrote:
>>>>>>> So yes, it then all boils down to that Linux-
>>>>>>> internal question.
>>>>>>
>>>>>> Excellent question but we'll have to wait for Ray as he is the one with
>>>>>> access to the hardware. But I have this data I can share in the
>>>>>> meantime:
>>>>>>
>>>>>> [    1.260378] IRQ to pin mappings:
>>>>>> [    1.260387] IRQ1 -> 0:1
>>>>>> [    1.260395] IRQ2 -> 0:2
>>>>>> [    1.260403] IRQ3 -> 0:3
>>>>>> [    1.260410] IRQ4 -> 0:4
>>>>>> [    1.260418] IRQ5 -> 0:5
>>>>>> [    1.260425] IRQ6 -> 0:6
>>>>>> [    1.260432] IRQ7 -> 0:7
>>>>>> [    1.260440] IRQ8 -> 0:8
>>>>>> [    1.260447] IRQ9 -> 0:9
>>>>>> [    1.260455] IRQ10 -> 0:10
>>>>>> [    1.260462] IRQ11 -> 0:11
>>>>>> [    1.260470] IRQ12 -> 0:12
>>>>>> [    1.260478] IRQ13 -> 0:13
>>>>>> [    1.260485] IRQ14 -> 0:14
>>>>>> [    1.260493] IRQ15 -> 0:15
>>>>>> [    1.260505] IRQ106 -> 1:8
>>>>>> [    1.260513] IRQ112 -> 1:4
>>>>>> [    1.260521] IRQ116 -> 1:13
>>>>>> [    1.260529] IRQ117 -> 1:14
>>>>>> [    1.260537] IRQ118 -> 1:15
>>>>>> [    1.260544] .................................... done.
>>>>>
>>>>> And what does Linux think are IRQs 16 ... 105? Have you compared with
>>>>> Linux running baremetal on the same hardware?
>>>>
>>>> So I have some emails from Ray from he time he was looking into this,
>>>> and on Linux dom0 PVH dmesg there is:
>>>>
>>>> [    0.065063] IOAPIC[0]: apic_id 33, version 17, address 0xfec00000, GSI 0-23
>>>> [    0.065096] IOAPIC[1]: apic_id 34, version 17, address 0xfec01000, GSI 24-55
>>>>
>>>> So it seems the vIO-APIC data provided by Xen to dom0 is at least
>>>> consistent.
>>>>  
>>>>>> And I think Ray traced the point in Linux where Linux gives us an IRQ ==
>>>>>> 112 (which is the one causing issues):
>>>>>>
>>>>>> __acpi_register_gsi->
>>>>>>         acpi_register_gsi_ioapic->
>>>>>>                 mp_map_gsi_to_irq->
>>>>>>                         mp_map_pin_to_irq->
>>>>>>                                 __irq_resolve_mapping()
>>>>>>
>>>>>>         if (likely(data)) {
>>>>>>                 desc = irq_data_to_desc(data);
>>>>>>                 if (irq)
>>>>>>                         *irq = data->irq;
>>>>>>                 /* this IRQ is 112, IO-APIC-34 domain */
>>>>>>         }
>>>>
>>>>
>>>> Could this all be a result of patch 4/5 in the Linux series ("[RFC
>>>> PATCH 4/5] x86/xen: acpi registers gsi for xen pvh"), where a different
>>>> __acpi_register_gsi hook is installed for PVH in order to setup GSIs
>>>> using PHYSDEV ops instead of doing it natively from the IO-APIC?
>>>>
>>>> FWIW, the introduced function in that patch
>>>> (acpi_register_gsi_xen_pvh()) seems to unconditionally call
>>>> acpi_register_gsi_ioapic() without checking if the GSI is already
>>>> registered, which might lead to multiple IRQs being allocated for the
>>>> same underlying GSI?
>>>
>>> I understand this point and I think it needs investigating.
>>>
>>>
>>>> As I commented there, I think that approach is wrong.  If the GSI has
>>>> not been mapped in Xen (because dom0 hasn't unmasked the respective
>>>> IO-APIC pin) we should add some logic in the toolstack to map it
>>>> before attempting to bind.
>>>
>>> But this statement confuses me. The toolstack doesn't get involved in
>>> IRQ setup for PCI devices for HVM guests?
>>
>> It does for GSI interrupts AFAICT, see pci_add_dm_done() and the call
>> to xc_physdev_map_pirq().  I'm not sure whether that's a remnant that
>> cold be removed (maybe for qemu-trad only?) or it's also required by
>> QEMU upstream, I would have to investigate more.
> 
> You are right. I am not certain, but it seems like a mistake in the
> toolstack to me. In theory, pci_add_dm_done should only be needed for PV
> guests, not for HVM guests. I am not sure. But I can see the call to
> xc_physdev_map_pirq you were referring to now.
> 
> 
>> It's my understanding it's in pci_add_dm_done() where Ray was getting
>> the mismatched IRQ vs GSI number.
> 
> I think the mismatch was actually caused by the xc_physdev_map_pirq call
> from QEMU, which makes sense because in any case it should happen before
> the same call done by pci_add_dm_done (pci_add_dm_done is called after
> sending the pci passthrough QMP command to QEMU). So the first to hit
> the IRQ!=GSI problem would be QEMU.


Sorry for replying to you so late. And thank you all for review. I realized that your questions mainly focus on the following points: 1. Why irq is not equal with gsi? 2. Why I do the translations between irq and gsi? 3. Why I call PHYSDEVOP_map_pirq in acpi_register_gsi_xen_pvh()? 4. Why I call PHYSDEVOP_setup_gsi in acpi_register_gsi_xen_pvh()? 
Please forgive me for making a summary response first. And I am looking forward to your comments.

1. Why irq is not equal with gsi?
As far as I know, irq is dynamically allocated according to gsi, they are not necessarily equal.
When I run "sudo xl pci-assignable-add 03:00.0" to assign passthrough device(Taking dGPU on my environment as an example, which gsi is 28). It will call into acpi_register_gsi_ioapic to get irq, the callstack is: 
acpi_register_gsi_ioapic
	mp_map_gsi_to_irq
		mp_map_pin_to_irq
			irq_find_mapping(if gsi has been mapped to an irq before, it will return corresponding irq here)
			alloc_irq_from_domain
				__irq_domain_alloc_irqs
					irq_domain_alloc_descs
						__irq_alloc_descs

If you add some printings like below:
---------------------------------------------------------------------------------------------------------------------------------------------
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index a868b76cd3d4..970fd461be7a 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1067,6 +1067,8 @@ static int mp_map_pin_to_irq(u32 gsi, int idx, int ioapic, int pin,
                }
        }
        mutex_unlock(&ioapic_mutex);
+       printk("cjq_debug mp_map_pin_to_irq gsi: %u, irq: %d, idx: %d, ioapic: %d, pin: %d\n",
+                       gsi, irq, idx, ioapic, pin);

        return irq;
 }
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 5db0230aa6b5..4e9613abbe96 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -786,6 +786,8 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
        start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
                                           from, cnt, 0);
        ret = -EEXIST;
+       printk("cjq_debug __irq_alloc_descs irq: %d, from: %u, cnt: %u, node: %d, start: %d, nr_irqs: %d\n",
+                       irq, from, cnt, node, start, nr_irqs);
        if (irq >=0 && start != irq)
                goto unlock;
---------------------------------------------------------------------------------------------------------------------------------------------
You will get output on PVH dom0:

[    0.181560] cjq_debug __irq_alloc_descs irq: 1, from: 1, cnt: 1, node: -1, start: 1, nr_irqs: 1096
[    0.181639] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
[    0.181641] cjq_debug __irq_alloc_descs irq: 2, from: 2, cnt: 1, node: -1, start: 2, nr_irqs: 1096
[    0.181682] cjq_debug mp_map_pin_to_irq gsi: 2, irq: 2, idx: 0, ioapic: 0, pin: 2
[    0.181683] cjq_debug __irq_alloc_descs irq: 3, from: 3, cnt: 1, node: -1, start: 3, nr_irqs: 1096
[    0.181715] cjq_debug mp_map_pin_to_irq gsi: 3, irq: 3, idx: 3, ioapic: 0, pin: 3
[    0.181716] cjq_debug __irq_alloc_descs irq: 4, from: 4, cnt: 1, node: -1, start: 4, nr_irqs: 1096
[    0.181751] cjq_debug mp_map_pin_to_irq gsi: 4, irq: 4, idx: 4, ioapic: 0, pin: 4
[    0.181752] cjq_debug __irq_alloc_descs irq: 5, from: 5, cnt: 1, node: -1, start: 5, nr_irqs: 1096
[    0.181783] cjq_debug mp_map_pin_to_irq gsi: 5, irq: 5, idx: 5, ioapic: 0, pin: 5
[    0.181784] cjq_debug __irq_alloc_descs irq: 6, from: 6, cnt: 1, node: -1, start: 6, nr_irqs: 1096
[    0.181813] cjq_debug mp_map_pin_to_irq gsi: 6, irq: 6, idx: 6, ioapic: 0, pin: 6
[    0.181814] cjq_debug __irq_alloc_descs irq: 7, from: 7, cnt: 1, node: -1, start: 7, nr_irqs: 1096
[    0.181856] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
[    0.181857] cjq_debug __irq_alloc_descs irq: 8, from: 8, cnt: 1, node: -1, start: 8, nr_irqs: 1096
[    0.181888] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
[    0.181889] cjq_debug __irq_alloc_descs irq: 9, from: 9, cnt: 1, node: -1, start: 9, nr_irqs: 1096
[    0.181918] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
[    0.181919] cjq_debug __irq_alloc_descs irq: 10, from: 10, cnt: 1, node: -1, start: 10, nr_irqs: 1096
[    0.181950] cjq_debug mp_map_pin_to_irq gsi: 10, irq: 10, idx: 9, ioapic: 0, pin: 10
[    0.181951] cjq_debug __irq_alloc_descs irq: 11, from: 11, cnt: 1, node: -1, start: 11, nr_irqs: 1096
[    0.181977] cjq_debug mp_map_pin_to_irq gsi: 11, irq: 11, idx: 10, ioapic: 0, pin: 11
[    0.181979] cjq_debug __irq_alloc_descs irq: 12, from: 12, cnt: 1, node: -1, start: 12, nr_irqs: 1096
[    0.182006] cjq_debug mp_map_pin_to_irq gsi: 12, irq: 12, idx: 11, ioapic: 0, pin: 12
[    0.182007] cjq_debug __irq_alloc_descs irq: 13, from: 13, cnt: 1, node: -1, start: 13, nr_irqs: 1096
[    0.182034] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
[    0.182035] cjq_debug __irq_alloc_descs irq: 14, from: 14, cnt: 1, node: -1, start: 14, nr_irqs: 1096
[    0.182066] cjq_debug mp_map_pin_to_irq gsi: 14, irq: 14, idx: 13, ioapic: 0, pin: 14
[    0.182067] cjq_debug __irq_alloc_descs irq: 15, from: 15, cnt: 1, node: -1, start: 15, nr_irqs: 1096
[    0.182095] cjq_debug mp_map_pin_to_irq gsi: 15, irq: 15, idx: 14, ioapic: 0, pin: 15
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 24, nr_irqs: 1096
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 26, nr_irqs: 1096
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 27, nr_irqs: 1096
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 28, nr_irqs: 1096
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 29, nr_irqs: 1096
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 30, nr_irqs: 1096
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 31, nr_irqs: 1096
[    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 32, nr_irqs: 1096
[    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 33, nr_irqs: 1096
[    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 34, nr_irqs: 1096
[    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 35, nr_irqs: 1096
[    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 36, nr_irqs: 1096
[    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 37, nr_irqs: 1096
[    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 38, nr_irqs: 1096
[    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 39, nr_irqs: 1096
[    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 40, nr_irqs: 1096
[    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 41, nr_irqs: 1096
[    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 42, nr_irqs: 1096
[    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 43, nr_irqs: 1096
[    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 44, nr_irqs: 1096
[    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 45, nr_irqs: 1096
[    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 46, nr_irqs: 1096
[    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 47, nr_irqs: 1096
[    0.198199] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 48, nr_irqs: 1096
[    0.198416] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 49, nr_irqs: 1096
[    0.198460] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 50, nr_irqs: 1096
[    0.198489] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 51, nr_irqs: 1096
[    0.198523] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 52, nr_irqs: 1096
[    0.201315] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 53, nr_irqs: 1096
[    0.202174] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 54, nr_irqs: 1096
[    0.202225] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 55, nr_irqs: 1096
[    0.202259] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 56, nr_irqs: 1096
[    0.202291] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 57, nr_irqs: 1096
[    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 58, nr_irqs: 1096
[    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 59, nr_irqs: 1096
[    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 60, nr_irqs: 1096
[    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 61, nr_irqs: 1096
[    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 62, nr_irqs: 1096
[    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 63, nr_irqs: 1096
[    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 64, nr_irqs: 1096
[    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 65, nr_irqs: 1096
[    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 66, nr_irqs: 1096
[    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 67, nr_irqs: 1096
[    0.210169] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 68, nr_irqs: 1096
[    0.210322] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 69, nr_irqs: 1096
[    0.210370] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 70, nr_irqs: 1096
[    0.210403] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 71, nr_irqs: 1096
[    0.210436] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 72, nr_irqs: 1096
[    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 73, nr_irqs: 1096
[    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 74, nr_irqs: 1096
[    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 75, nr_irqs: 1096
[    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 76, nr_irqs: 1096
[    0.214151] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 77, nr_irqs: 1096
[    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 78, nr_irqs: 1096
[    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 79, nr_irqs: 1096
[    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 80, nr_irqs: 1096
[    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 81, nr_irqs: 1096
[    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 82, nr_irqs: 1096
[    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 83, nr_irqs: 1096
[    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 84, nr_irqs: 1096
[    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 85, nr_irqs: 1096
[    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 86, nr_irqs: 1096
[    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 87, nr_irqs: 1096
[    0.222215] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 88, nr_irqs: 1096
[    0.222366] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 89, nr_irqs: 1096
[    0.222410] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 90, nr_irqs: 1096
[    0.222447] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 91, nr_irqs: 1096
[    0.222478] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 92, nr_irqs: 1096
[    0.225490] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 93, nr_irqs: 1096
[    0.226225] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 94, nr_irqs: 1096
[    0.226268] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 95, nr_irqs: 1096
[    0.226300] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 96, nr_irqs: 1096
[    0.226329] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 97, nr_irqs: 1096
[    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 98, nr_irqs: 1096
[    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 99, nr_irqs: 1096
[    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 100, nr_irqs: 1096
[    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 101, nr_irqs: 1096
[    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 102, nr_irqs: 1096
[    0.232399] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 103, nr_irqs: 1096
[    0.248854] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 104, nr_irqs: 1096
[    0.250609] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 105, nr_irqs: 1096
[    0.372343] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
[    0.720950] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
[    0.721052] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
[    1.254825] cjq_debug mp_map_pin_to_irq gsi: 7, irq: -16, idx: 7, ioapic: 0, pin: 7
[    1.333081] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
[    1.375882] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 106, nr_irqs: 1096
[    1.375951] cjq_debug mp_map_pin_to_irq gsi: 32, irq: 106, idx: -1, ioapic: 1, pin: 8
[    1.376072] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
[    1.376121] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 107, idx: -1, ioapic: 1, pin: 13
[    1.472551] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 107, idx: -1, ioapic: 1, pin: 13
[    1.472697] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
[    1.472751] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 107, idx: -1, ioapic: 1, pin: 14
[    1.484290] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 107, idx: -1, ioapic: 1, pin: 14
[    1.768163] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
[    1.768627] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 108, nr_irqs: 1096
[    1.769059] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 109, nr_irqs: 1096
[    1.769694] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 110, nr_irqs: 1096
[    1.770169] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 111, nr_irqs: 1096
[    1.770697] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 112, nr_irqs: 1096
[    1.770738] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 112, idx: -1, ioapic: 1, pin: 4
[    1.770789] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 113, nr_irqs: 1096
[    1.771230] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 112, idx: -1, ioapic: 1, pin: 4
[    1.771278] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 114, nr_irqs: 1096
[    2.127884] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 115, nr_irqs: 1096
[    3.207419] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 116, nr_irqs: 1096
[    3.207730] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
[    3.208120] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 117, nr_irqs: 1096
[    3.208475] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
[    3.208478] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 118, nr_irqs: 1096
[    3.208861] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
[    3.208933] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 119, nr_irqs: 1096
[    3.209127] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 120, nr_irqs: 1096
[    3.209383] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 121, nr_irqs: 1096
[    3.209863] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 122, nr_irqs: 1096
[    3.211439] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 123, nr_irqs: 1096
[    3.211833] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 124, nr_irqs: 1096
[    3.212873] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 125, nr_irqs: 1096
[    3.243514] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 126, nr_irqs: 1096
[    3.243689] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 126, idx: -1, ioapic: 1, pin: 14
[    3.244293] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 127, nr_irqs: 1096
[    3.244534] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 128, nr_irqs: 1096
[    3.244714] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 129, nr_irqs: 1096
[    3.244911] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 130, nr_irqs: 1096
[    3.245096] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 131, nr_irqs: 1096
[    3.245633] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 132, nr_irqs: 1096
[    3.247890] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 133, nr_irqs: 1096
[    3.248192] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 134, nr_irqs: 1096
[    3.271093] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 135, nr_irqs: 1096
[    3.307045] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 136, nr_irqs: 1096
[    3.307162] cjq_debug mp_map_pin_to_irq gsi: 48, irq: 136, idx: -1, ioapic: 1, pin: 24
[    3.307223] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 137, nr_irqs: 1096
[    3.331183] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 137, nr_irqs: 1096
[    3.331295] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 138, nr_irqs: 1096
[    3.331366] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 139, nr_irqs: 1096
[    3.331438] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 140, nr_irqs: 1096
[    3.331511] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 141, nr_irqs: 1096
[    3.331579] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 142, nr_irqs: 1096
[    3.331646] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 143, nr_irqs: 1096
[    3.331713] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 144, nr_irqs: 1096
[    3.331780] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 145, nr_irqs: 1096
[    3.331846] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 146, nr_irqs: 1096
[    3.331913] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 147, nr_irqs: 1096
[    3.331984] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 148, nr_irqs: 1096
[    3.332051] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 149, nr_irqs: 1096
[    3.332118] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 150, nr_irqs: 1096
[    3.332183] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 151, nr_irqs: 1096
[    3.332252] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 152, nr_irqs: 1096
[    3.332319] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 153, nr_irqs: 1096
[    8.010370] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
[    9.545439] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
[    9.545713] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 154, nr_irqs: 1096
[    9.546034] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 155, nr_irqs: 1096
[    9.687796] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 156, nr_irqs: 1096
[    9.687979] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 156, idx: -1, ioapic: 1, pin: 15
[    9.688057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 157, nr_irqs: 1096
[    9.921038] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 158, nr_irqs: 1096
[    9.921210] cjq_debug mp_map_pin_to_irq gsi: 29, irq: 158, idx: -1, ioapic: 1, pin: 5
[    9.921403] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 159, nr_irqs: 1096
[    9.926373] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 156, idx: -1, ioapic: 1, pin: 15
[    9.926747] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 160, nr_irqs: 1096
[    9.928201] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
[    9.928488] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 161, nr_irqs: 1096
[   10.653915] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 162, nr_irqs: 1096
[   10.656257] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 163, nr_irqs: 1096

You can find that the allocation of irq is not always based on the value of gsi. It follows the principle of requesting first, distributing first, like gsi 32 get 106 but gsi 28 get 112. And not only acpi_register_gsi_ioapic() will call into __irq_alloc_descs, but other functions will call, even earlier.
Above output is like baremetal. So, we can get conclusion irq != gsi. See below output on linux:

[    0.105053] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
[    0.105061] cjq_debug mp_map_pin_to_irq gsi: 2, irq: 0, idx: 0, ioapic: 0, pin: 2
[    0.105069] cjq_debug mp_map_pin_to_irq gsi: 3, irq: 3, idx: 3, ioapic: 0, pin: 3
[    0.105078] cjq_debug mp_map_pin_to_irq gsi: 4, irq: 4, idx: 4, ioapic: 0, pin: 4
[    0.105086] cjq_debug mp_map_pin_to_irq gsi: 5, irq: 5, idx: 5, ioapic: 0, pin: 5
[    0.105094] cjq_debug mp_map_pin_to_irq gsi: 6, irq: 6, idx: 6, ioapic: 0, pin: 6
[    0.105103] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
[    0.105111] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
[    0.105119] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
[    0.105127] cjq_debug mp_map_pin_to_irq gsi: 10, irq: 10, idx: 9, ioapic: 0, pin: 10
[    0.105136] cjq_debug mp_map_pin_to_irq gsi: 11, irq: 11, idx: 10, ioapic: 0, pin: 11
[    0.105144] cjq_debug mp_map_pin_to_irq gsi: 12, irq: 12, idx: 11, ioapic: 0, pin: 12
[    0.105152] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
[    0.105160] cjq_debug mp_map_pin_to_irq gsi: 14, irq: 14, idx: 13, ioapic: 0, pin: 14
[    0.105169] cjq_debug mp_map_pin_to_irq gsi: 15, irq: 15, idx: 14, ioapic: 0, pin: 15
[    0.398134] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
[    1.169293] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
[    1.169394] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
[    1.323132] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
[    1.345425] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
[    1.375502] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 24, nr_irqs: 1096
[    1.375575] cjq_debug mp_map_pin_to_irq gsi: 32, irq: 24, idx: -1, ioapic: 1, pin: 8
[    1.375661] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
[    1.375705] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 25, idx: -1, ioapic: 1, pin: 13
[    1.442277] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 25, idx: -1, ioapic: 1, pin: 13
[    1.442393] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
[    1.442450] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 25, idx: -1, ioapic: 1, pin: 14
[    1.453893] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 25, idx: -1, ioapic: 1, pin: 14
[    1.456127] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
[    1.734065] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 26, nr_irqs: 1096
[    1.734165] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 27, nr_irqs: 1096
[    1.734253] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 28, nr_irqs: 1096
[    1.734344] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 29, nr_irqs: 1096
[    1.734426] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 30, nr_irqs: 1096
[    1.734512] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 31, nr_irqs: 1096
[    1.734597] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 32, nr_irqs: 1096
[    1.734643] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 33, nr_irqs: 1096
[    1.734687] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 34, nr_irqs: 1096
[    1.734728] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 35, nr_irqs: 1096
[    1.735017] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 36, nr_irqs: 1096
[    1.735252] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 37, nr_irqs: 1096
[    1.735467] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 38, nr_irqs: 1096
[    1.735799] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 39, nr_irqs: 1096
[    1.736024] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 40, nr_irqs: 1096
[    1.736364] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 41, nr_irqs: 1096
[    1.736406] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 41, idx: -1, ioapic: 1, pin: 4
[    1.736434] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 42, nr_irqs: 1096
[    1.736701] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 41, idx: -1, ioapic: 1, pin: 4
[    1.736724] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 43, nr_irqs: 1096
[    3.037123] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 44, nr_irqs: 1096
[    3.037313] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
[    3.037515] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
[    3.037738] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 45, nr_irqs: 1096
[    3.037959] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 46, nr_irqs: 1096
[    3.038073] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 47, nr_irqs: 1096
[    3.038154] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 48, nr_irqs: 1096
[    3.038179] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
[    3.038277] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 49, nr_irqs: 1096
[    3.038399] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 50, nr_irqs: 1096
[    3.038525] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 51, nr_irqs: 1096
[    3.038657] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 52, nr_irqs: 1096
[    3.038852] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 53, nr_irqs: 1096
[    3.052377] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 54, nr_irqs: 1096
[    3.052479] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 54, idx: -1, ioapic: 1, pin: 14
[    3.052730] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 55, nr_irqs: 1096
[    3.052840] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 56, nr_irqs: 1096
[    3.052918] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 57, nr_irqs: 1096
[    3.052987] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 58, nr_irqs: 1096
[    3.053069] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 59, nr_irqs: 1096
[    3.053139] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 60, nr_irqs: 1096
[    3.053201] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 61, nr_irqs: 1096
[    3.053260] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 62, nr_irqs: 1096
[    3.089128] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 63, nr_irqs: 1096
[    3.089310] cjq_debug mp_map_pin_to_irq gsi: 48, irq: 63, idx: -1, ioapic: 1, pin: 24
[    3.089376] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 64, nr_irqs: 1096
[    3.103435] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 65, nr_irqs: 1096
[    3.114190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 64, nr_irqs: 1096
[    3.114346] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 66, nr_irqs: 1096
[    3.121215] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 67, nr_irqs: 1096
[    3.121350] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 68, nr_irqs: 1096
[    3.121479] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 69, nr_irqs: 1096
[    3.121612] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 70, nr_irqs: 1096
[    3.121726] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 71, nr_irqs: 1096
[    3.121841] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 72, nr_irqs: 1096
[    3.121955] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 73, nr_irqs: 1096
[    3.122025] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 74, nr_irqs: 1096
[    3.122093] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 75, nr_irqs: 1096
[    3.122148] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 76, nr_irqs: 1096
[    3.122203] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 77, nr_irqs: 1096
[    3.122265] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 78, nr_irqs: 1096
[    3.122322] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 79, nr_irqs: 1096
[    3.122378] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 80, nr_irqs: 1096
[    3.122433] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 81, nr_irqs: 1096
[    7.838753] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
[    9.619174] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
[    9.619556] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 82, nr_irqs: 1096
[    9.622038] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 83, nr_irqs: 1096
[    9.634900] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 84, nr_irqs: 1096
[    9.635316] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 84, idx: -1, ioapic: 1, pin: 15
[    9.635405] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 85, nr_irqs: 1096
[   10.006686] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 86, nr_irqs: 1096
[   10.006823] cjq_debug mp_map_pin_to_irq gsi: 29, irq: 86, idx: -1, ioapic: 1, pin: 5
[   10.007009] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 87, nr_irqs: 1096
[   10.008723] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 84, idx: -1, ioapic: 1, pin: 15
[   10.009853] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 88, nr_irqs: 1096
[   10.010786] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
[   10.010858] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 89, nr_irqs: 1096

2. Why I do the translations between irq and gsi?

After answering question 1, we get irq != gsi. And I found, in QEMU, (pci_qdev_realize->xen_pt_realize->xen_host_pci_device_get->xen_host_pci_get_hex_value) will get the irq number, but later, pci_qdev_realize->xen_pt_realize->xc_physdev_map_pirq requires us to pass into gsi, it will call into Xen physdev_map_pirq-> allocate_and_map_gsi_pirq to allocate pirq for gsi. And then the error occurred.
Not only that, the callback function pci_add_dm_done-> xc_physdev_map_pirq also need gsi.

So, I added the function xc_physdev_map_pirq() to translate irq to gsi, for QEMU.

And I didn't find similar functions in existing linux codes, and I think only "QEMU passthrough for Xen" need this translation, so I added it into privcmd. If you guys know any other similar functions or other more suitable places, please feel free to tell me.

3. Why I call PHYSDEVOP_map_pirq in acpi_register_gsi_xen_pvh()?

Because if you want to map a gsi for domU, it must have a mapping in dom0. See QEMU code:
pci_add_dm_done
	xc_physdev_map_pirq
	xc_domain_irq_permission
		XEN_DOMCTL_irq_permission
			pirq_access_permitted
xc_physdev_map_pirq will get the pirq which mapped from gsi, and xc_domain_irq_permission will use pirq and call into Xen. If we don't do PHYSDEVOP_map_pirq for passthrough devices on PVH dom0, then pirq_access_permitted will get a NULL irq from dom0 and get failed.

So, I added PHYSDEVOP_map_pirq for PVH dom0. But I think it is only necessary for passthrough devices to do that, instead of all devices which call __acpi_register_gsi. In next version patch, I will restrain that only passthrough devices can do PHYSDEVOP_map_pirq.

4. Why I call PHYSDEVOP_setup_gsi in acpi_register_gsi_xen_pvh()?

Like Roger's comments, the gsi of passthrough device doesn't be unmasked and registered(I added printings in vioapic_hwdom_map_gsi(), and I found that it never be called for dGPU with gsi 28 in my environment).
So, I called PHYSDEVOP_setup_gsi to register gsi.
But I agree with Roger and Jan's opinion, it is wrong to do PHYSDEVOP_setup_gsi for all devices.
So, in next version patch, I will also restrain that only passthrough devices can do PHYSDEVOP_setup_gsi.

-- 
Best regards,
Jiqian Chen.

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-07-31 16:40                         ` Chen, Jiqian
@ 2023-08-23  8:57                           ` Roger Pau Monné
  2023-08-31  8:56                             ` Chen, Jiqian
  0 siblings, 1 reply; 75+ messages in thread
From: Roger Pau Monné @ 2023-08-23  8:57 UTC (permalink / raw)
  To: Chen, Jiqian
  Cc: Stefano Stabellini, Jan Beulich, Huang, Ray, Anthony PERARD,
	xen-devel, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia

On Mon, Jul 31, 2023 at 04:40:35PM +0000, Chen, Jiqian wrote:
> Hi,
> 
> On 2023/3/18 04:55, Stefano Stabellini wrote:
> > On Fri, 17 Mar 2023, Roger Pau Monné wrote:
> >> On Fri, Mar 17, 2023 at 11:15:37AM -0700, Stefano Stabellini wrote:
> >>> On Fri, 17 Mar 2023, Roger Pau Monné wrote:
> >>>> On Fri, Mar 17, 2023 at 09:39:52AM +0100, Jan Beulich wrote:
> >>>>> On 17.03.2023 00:19, Stefano Stabellini wrote:
> >>>>>> On Thu, 16 Mar 2023, Jan Beulich wrote:
> >>>>>>> So yes, it then all boils down to that Linux-
> >>>>>>> internal question.
> >>>>>>
> >>>>>> Excellent question but we'll have to wait for Ray as he is the one with
> >>>>>> access to the hardware. But I have this data I can share in the
> >>>>>> meantime:
> >>>>>>
> >>>>>> [    1.260378] IRQ to pin mappings:
> >>>>>> [    1.260387] IRQ1 -> 0:1
> >>>>>> [    1.260395] IRQ2 -> 0:2
> >>>>>> [    1.260403] IRQ3 -> 0:3
> >>>>>> [    1.260410] IRQ4 -> 0:4
> >>>>>> [    1.260418] IRQ5 -> 0:5
> >>>>>> [    1.260425] IRQ6 -> 0:6
> >>>>>> [    1.260432] IRQ7 -> 0:7
> >>>>>> [    1.260440] IRQ8 -> 0:8
> >>>>>> [    1.260447] IRQ9 -> 0:9
> >>>>>> [    1.260455] IRQ10 -> 0:10
> >>>>>> [    1.260462] IRQ11 -> 0:11
> >>>>>> [    1.260470] IRQ12 -> 0:12
> >>>>>> [    1.260478] IRQ13 -> 0:13
> >>>>>> [    1.260485] IRQ14 -> 0:14
> >>>>>> [    1.260493] IRQ15 -> 0:15
> >>>>>> [    1.260505] IRQ106 -> 1:8
> >>>>>> [    1.260513] IRQ112 -> 1:4
> >>>>>> [    1.260521] IRQ116 -> 1:13
> >>>>>> [    1.260529] IRQ117 -> 1:14
> >>>>>> [    1.260537] IRQ118 -> 1:15
> >>>>>> [    1.260544] .................................... done.
> >>>>>
> >>>>> And what does Linux think are IRQs 16 ... 105? Have you compared with
> >>>>> Linux running baremetal on the same hardware?
> >>>>
> >>>> So I have some emails from Ray from he time he was looking into this,
> >>>> and on Linux dom0 PVH dmesg there is:
> >>>>
> >>>> [    0.065063] IOAPIC[0]: apic_id 33, version 17, address 0xfec00000, GSI 0-23
> >>>> [    0.065096] IOAPIC[1]: apic_id 34, version 17, address 0xfec01000, GSI 24-55
> >>>>
> >>>> So it seems the vIO-APIC data provided by Xen to dom0 is at least
> >>>> consistent.
> >>>>  
> >>>>>> And I think Ray traced the point in Linux where Linux gives us an IRQ ==
> >>>>>> 112 (which is the one causing issues):
> >>>>>>
> >>>>>> __acpi_register_gsi->
> >>>>>>         acpi_register_gsi_ioapic->
> >>>>>>                 mp_map_gsi_to_irq->
> >>>>>>                         mp_map_pin_to_irq->
> >>>>>>                                 __irq_resolve_mapping()
> >>>>>>
> >>>>>>         if (likely(data)) {
> >>>>>>                 desc = irq_data_to_desc(data);
> >>>>>>                 if (irq)
> >>>>>>                         *irq = data->irq;
> >>>>>>                 /* this IRQ is 112, IO-APIC-34 domain */
> >>>>>>         }
> >>>>
> >>>>
> >>>> Could this all be a result of patch 4/5 in the Linux series ("[RFC
> >>>> PATCH 4/5] x86/xen: acpi registers gsi for xen pvh"), where a different
> >>>> __acpi_register_gsi hook is installed for PVH in order to setup GSIs
> >>>> using PHYSDEV ops instead of doing it natively from the IO-APIC?
> >>>>
> >>>> FWIW, the introduced function in that patch
> >>>> (acpi_register_gsi_xen_pvh()) seems to unconditionally call
> >>>> acpi_register_gsi_ioapic() without checking if the GSI is already
> >>>> registered, which might lead to multiple IRQs being allocated for the
> >>>> same underlying GSI?
> >>>
> >>> I understand this point and I think it needs investigating.
> >>>
> >>>
> >>>> As I commented there, I think that approach is wrong.  If the GSI has
> >>>> not been mapped in Xen (because dom0 hasn't unmasked the respective
> >>>> IO-APIC pin) we should add some logic in the toolstack to map it
> >>>> before attempting to bind.
> >>>
> >>> But this statement confuses me. The toolstack doesn't get involved in
> >>> IRQ setup for PCI devices for HVM guests?
> >>
> >> It does for GSI interrupts AFAICT, see pci_add_dm_done() and the call
> >> to xc_physdev_map_pirq().  I'm not sure whether that's a remnant that
> >> cold be removed (maybe for qemu-trad only?) or it's also required by
> >> QEMU upstream, I would have to investigate more.
> > 
> > You are right. I am not certain, but it seems like a mistake in the
> > toolstack to me. In theory, pci_add_dm_done should only be needed for PV
> > guests, not for HVM guests. I am not sure. But I can see the call to
> > xc_physdev_map_pirq you were referring to now.
> > 
> > 
> >> It's my understanding it's in pci_add_dm_done() where Ray was getting
> >> the mismatched IRQ vs GSI number.
> > 
> > I think the mismatch was actually caused by the xc_physdev_map_pirq call
> > from QEMU, which makes sense because in any case it should happen before
> > the same call done by pci_add_dm_done (pci_add_dm_done is called after
> > sending the pci passthrough QMP command to QEMU). So the first to hit
> > the IRQ!=GSI problem would be QEMU.
> 
> 
> Sorry for replying to you so late. And thank you all for review. I realized that your questions mainly focus on the following points: 1. Why irq is not equal with gsi? 2. Why I do the translations between irq and gsi? 3. Why I call PHYSDEVOP_map_pirq in acpi_register_gsi_xen_pvh()? 4. Why I call PHYSDEVOP_setup_gsi in acpi_register_gsi_xen_pvh()? 
> Please forgive me for making a summary response first. And I am looking forward to your comments.

Sorry, it's been a bit since that conversation, so my recollection is
vague.

One of the questions was why acpi_register_gsi_xen_pvh() is needed.  I
think the patch that introduced it on Linux didn't have much of a
commit description.

> 1. Why irq is not equal with gsi?
> As far as I know, irq is dynamically allocated according to gsi, they are not necessarily equal.
> When I run "sudo xl pci-assignable-add 03:00.0" to assign passthrough device(Taking dGPU on my environment as an example, which gsi is 28). It will call into acpi_register_gsi_ioapic to get irq, the callstack is: 
> acpi_register_gsi_ioapic
> 	mp_map_gsi_to_irq
> 		mp_map_pin_to_irq
> 			irq_find_mapping(if gsi has been mapped to an irq before, it will return corresponding irq here)
> 			alloc_irq_from_domain
> 				__irq_domain_alloc_irqs
> 					irq_domain_alloc_descs
> 						__irq_alloc_descs

Won't you perform double GSI registrations with Xen if both
acpi_register_gsi_ioapic() and acpi_register_gsi_xen_pvh() are used?

> 
> If you add some printings like below:
> ---------------------------------------------------------------------------------------------------------------------------------------------
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index a868b76cd3d4..970fd461be7a 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -1067,6 +1067,8 @@ static int mp_map_pin_to_irq(u32 gsi, int idx, int ioapic, int pin,
>                 }
>         }
>         mutex_unlock(&ioapic_mutex);
> +       printk("cjq_debug mp_map_pin_to_irq gsi: %u, irq: %d, idx: %d, ioapic: %d, pin: %d\n",
> +                       gsi, irq, idx, ioapic, pin);
> 
>         return irq;
>  }
> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
> index 5db0230aa6b5..4e9613abbe96 100644
> --- a/kernel/irq/irqdesc.c
> +++ b/kernel/irq/irqdesc.c
> @@ -786,6 +786,8 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
>         start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
>                                            from, cnt, 0);
>         ret = -EEXIST;
> +       printk("cjq_debug __irq_alloc_descs irq: %d, from: %u, cnt: %u, node: %d, start: %d, nr_irqs: %d\n",
> +                       irq, from, cnt, node, start, nr_irqs);
>         if (irq >=0 && start != irq)
>                 goto unlock;
> ---------------------------------------------------------------------------------------------------------------------------------------------
> You will get output on PVH dom0:
> 
> [    0.181560] cjq_debug __irq_alloc_descs irq: 1, from: 1, cnt: 1, node: -1, start: 1, nr_irqs: 1096
> [    0.181639] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
> [    0.181641] cjq_debug __irq_alloc_descs irq: 2, from: 2, cnt: 1, node: -1, start: 2, nr_irqs: 1096
> [    0.181682] cjq_debug mp_map_pin_to_irq gsi: 2, irq: 2, idx: 0, ioapic: 0, pin: 2
> [    0.181683] cjq_debug __irq_alloc_descs irq: 3, from: 3, cnt: 1, node: -1, start: 3, nr_irqs: 1096
> [    0.181715] cjq_debug mp_map_pin_to_irq gsi: 3, irq: 3, idx: 3, ioapic: 0, pin: 3
> [    0.181716] cjq_debug __irq_alloc_descs irq: 4, from: 4, cnt: 1, node: -1, start: 4, nr_irqs: 1096
> [    0.181751] cjq_debug mp_map_pin_to_irq gsi: 4, irq: 4, idx: 4, ioapic: 0, pin: 4
> [    0.181752] cjq_debug __irq_alloc_descs irq: 5, from: 5, cnt: 1, node: -1, start: 5, nr_irqs: 1096
> [    0.181783] cjq_debug mp_map_pin_to_irq gsi: 5, irq: 5, idx: 5, ioapic: 0, pin: 5
> [    0.181784] cjq_debug __irq_alloc_descs irq: 6, from: 6, cnt: 1, node: -1, start: 6, nr_irqs: 1096
> [    0.181813] cjq_debug mp_map_pin_to_irq gsi: 6, irq: 6, idx: 6, ioapic: 0, pin: 6
> [    0.181814] cjq_debug __irq_alloc_descs irq: 7, from: 7, cnt: 1, node: -1, start: 7, nr_irqs: 1096
> [    0.181856] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
> [    0.181857] cjq_debug __irq_alloc_descs irq: 8, from: 8, cnt: 1, node: -1, start: 8, nr_irqs: 1096
> [    0.181888] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
> [    0.181889] cjq_debug __irq_alloc_descs irq: 9, from: 9, cnt: 1, node: -1, start: 9, nr_irqs: 1096
> [    0.181918] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
> [    0.181919] cjq_debug __irq_alloc_descs irq: 10, from: 10, cnt: 1, node: -1, start: 10, nr_irqs: 1096
> [    0.181950] cjq_debug mp_map_pin_to_irq gsi: 10, irq: 10, idx: 9, ioapic: 0, pin: 10
> [    0.181951] cjq_debug __irq_alloc_descs irq: 11, from: 11, cnt: 1, node: -1, start: 11, nr_irqs: 1096
> [    0.181977] cjq_debug mp_map_pin_to_irq gsi: 11, irq: 11, idx: 10, ioapic: 0, pin: 11
> [    0.181979] cjq_debug __irq_alloc_descs irq: 12, from: 12, cnt: 1, node: -1, start: 12, nr_irqs: 1096
> [    0.182006] cjq_debug mp_map_pin_to_irq gsi: 12, irq: 12, idx: 11, ioapic: 0, pin: 12
> [    0.182007] cjq_debug __irq_alloc_descs irq: 13, from: 13, cnt: 1, node: -1, start: 13, nr_irqs: 1096
> [    0.182034] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
> [    0.182035] cjq_debug __irq_alloc_descs irq: 14, from: 14, cnt: 1, node: -1, start: 14, nr_irqs: 1096
> [    0.182066] cjq_debug mp_map_pin_to_irq gsi: 14, irq: 14, idx: 13, ioapic: 0, pin: 14
> [    0.182067] cjq_debug __irq_alloc_descs irq: 15, from: 15, cnt: 1, node: -1, start: 15, nr_irqs: 1096
> [    0.182095] cjq_debug mp_map_pin_to_irq gsi: 15, irq: 15, idx: 14, ioapic: 0, pin: 15
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 24, nr_irqs: 1096
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 26, nr_irqs: 1096
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 27, nr_irqs: 1096
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 28, nr_irqs: 1096
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 29, nr_irqs: 1096
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 30, nr_irqs: 1096
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 31, nr_irqs: 1096
> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 32, nr_irqs: 1096
> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 33, nr_irqs: 1096
> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 34, nr_irqs: 1096
> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 35, nr_irqs: 1096
> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 36, nr_irqs: 1096
> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 37, nr_irqs: 1096
> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 38, nr_irqs: 1096
> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 39, nr_irqs: 1096
> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 40, nr_irqs: 1096
> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 41, nr_irqs: 1096
> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 42, nr_irqs: 1096
> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 43, nr_irqs: 1096
> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 44, nr_irqs: 1096
> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 45, nr_irqs: 1096
> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 46, nr_irqs: 1096
> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 47, nr_irqs: 1096
> [    0.198199] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 48, nr_irqs: 1096
> [    0.198416] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 49, nr_irqs: 1096
> [    0.198460] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 50, nr_irqs: 1096
> [    0.198489] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 51, nr_irqs: 1096
> [    0.198523] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 52, nr_irqs: 1096
> [    0.201315] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 53, nr_irqs: 1096
> [    0.202174] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 54, nr_irqs: 1096
> [    0.202225] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 55, nr_irqs: 1096
> [    0.202259] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 56, nr_irqs: 1096
> [    0.202291] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 57, nr_irqs: 1096
> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 58, nr_irqs: 1096
> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 59, nr_irqs: 1096
> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 60, nr_irqs: 1096
> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 61, nr_irqs: 1096
> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 62, nr_irqs: 1096
> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 63, nr_irqs: 1096
> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 64, nr_irqs: 1096
> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 65, nr_irqs: 1096
> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 66, nr_irqs: 1096
> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 67, nr_irqs: 1096
> [    0.210169] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 68, nr_irqs: 1096
> [    0.210322] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 69, nr_irqs: 1096
> [    0.210370] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 70, nr_irqs: 1096
> [    0.210403] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 71, nr_irqs: 1096
> [    0.210436] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 72, nr_irqs: 1096
> [    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 73, nr_irqs: 1096
> [    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 74, nr_irqs: 1096
> [    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 75, nr_irqs: 1096
> [    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 76, nr_irqs: 1096
> [    0.214151] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 77, nr_irqs: 1096
> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 78, nr_irqs: 1096
> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 79, nr_irqs: 1096
> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 80, nr_irqs: 1096
> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 81, nr_irqs: 1096
> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 82, nr_irqs: 1096
> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 83, nr_irqs: 1096
> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 84, nr_irqs: 1096
> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 85, nr_irqs: 1096
> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 86, nr_irqs: 1096
> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 87, nr_irqs: 1096
> [    0.222215] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 88, nr_irqs: 1096
> [    0.222366] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 89, nr_irqs: 1096
> [    0.222410] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 90, nr_irqs: 1096
> [    0.222447] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 91, nr_irqs: 1096
> [    0.222478] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 92, nr_irqs: 1096
> [    0.225490] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 93, nr_irqs: 1096
> [    0.226225] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 94, nr_irqs: 1096
> [    0.226268] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 95, nr_irqs: 1096
> [    0.226300] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 96, nr_irqs: 1096
> [    0.226329] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 97, nr_irqs: 1096
> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 98, nr_irqs: 1096
> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 99, nr_irqs: 1096
> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 100, nr_irqs: 1096
> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 101, nr_irqs: 1096
> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 102, nr_irqs: 1096
> [    0.232399] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 103, nr_irqs: 1096
> [    0.248854] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 104, nr_irqs: 1096
> [    0.250609] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 105, nr_irqs: 1096
> [    0.372343] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
> [    0.720950] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
> [    0.721052] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
> [    1.254825] cjq_debug mp_map_pin_to_irq gsi: 7, irq: -16, idx: 7, ioapic: 0, pin: 7
> [    1.333081] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
> [    1.375882] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 106, nr_irqs: 1096
> [    1.375951] cjq_debug mp_map_pin_to_irq gsi: 32, irq: 106, idx: -1, ioapic: 1, pin: 8
> [    1.376072] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
> [    1.376121] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 107, idx: -1, ioapic: 1, pin: 13
> [    1.472551] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 107, idx: -1, ioapic: 1, pin: 13
> [    1.472697] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
> [    1.472751] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 107, idx: -1, ioapic: 1, pin: 14
> [    1.484290] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 107, idx: -1, ioapic: 1, pin: 14
> [    1.768163] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
> [    1.768627] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 108, nr_irqs: 1096
> [    1.769059] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 109, nr_irqs: 1096
> [    1.769694] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 110, nr_irqs: 1096
> [    1.770169] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 111, nr_irqs: 1096
> [    1.770697] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 112, nr_irqs: 1096
> [    1.770738] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 112, idx: -1, ioapic: 1, pin: 4
> [    1.770789] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 113, nr_irqs: 1096
> [    1.771230] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 112, idx: -1, ioapic: 1, pin: 4
> [    1.771278] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 114, nr_irqs: 1096
> [    2.127884] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 115, nr_irqs: 1096
> [    3.207419] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 116, nr_irqs: 1096
> [    3.207730] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
> [    3.208120] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 117, nr_irqs: 1096
> [    3.208475] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
> [    3.208478] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 118, nr_irqs: 1096
> [    3.208861] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
> [    3.208933] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 119, nr_irqs: 1096
> [    3.209127] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 120, nr_irqs: 1096
> [    3.209383] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 121, nr_irqs: 1096
> [    3.209863] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 122, nr_irqs: 1096
> [    3.211439] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 123, nr_irqs: 1096
> [    3.211833] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 124, nr_irqs: 1096
> [    3.212873] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 125, nr_irqs: 1096
> [    3.243514] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 126, nr_irqs: 1096
> [    3.243689] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 126, idx: -1, ioapic: 1, pin: 14
> [    3.244293] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 127, nr_irqs: 1096
> [    3.244534] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 128, nr_irqs: 1096
> [    3.244714] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 129, nr_irqs: 1096
> [    3.244911] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 130, nr_irqs: 1096
> [    3.245096] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 131, nr_irqs: 1096
> [    3.245633] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 132, nr_irqs: 1096
> [    3.247890] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 133, nr_irqs: 1096
> [    3.248192] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 134, nr_irqs: 1096
> [    3.271093] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 135, nr_irqs: 1096
> [    3.307045] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 136, nr_irqs: 1096
> [    3.307162] cjq_debug mp_map_pin_to_irq gsi: 48, irq: 136, idx: -1, ioapic: 1, pin: 24
> [    3.307223] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 137, nr_irqs: 1096
> [    3.331183] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 137, nr_irqs: 1096
> [    3.331295] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 138, nr_irqs: 1096
> [    3.331366] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 139, nr_irqs: 1096
> [    3.331438] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 140, nr_irqs: 1096
> [    3.331511] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 141, nr_irqs: 1096
> [    3.331579] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 142, nr_irqs: 1096
> [    3.331646] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 143, nr_irqs: 1096
> [    3.331713] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 144, nr_irqs: 1096
> [    3.331780] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 145, nr_irqs: 1096
> [    3.331846] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 146, nr_irqs: 1096
> [    3.331913] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 147, nr_irqs: 1096
> [    3.331984] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 148, nr_irqs: 1096
> [    3.332051] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 149, nr_irqs: 1096
> [    3.332118] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 150, nr_irqs: 1096
> [    3.332183] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 151, nr_irqs: 1096
> [    3.332252] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 152, nr_irqs: 1096
> [    3.332319] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 153, nr_irqs: 1096
> [    8.010370] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
> [    9.545439] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
> [    9.545713] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 154, nr_irqs: 1096
> [    9.546034] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 155, nr_irqs: 1096
> [    9.687796] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 156, nr_irqs: 1096
> [    9.687979] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 156, idx: -1, ioapic: 1, pin: 15
> [    9.688057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 157, nr_irqs: 1096
> [    9.921038] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 158, nr_irqs: 1096
> [    9.921210] cjq_debug mp_map_pin_to_irq gsi: 29, irq: 158, idx: -1, ioapic: 1, pin: 5
> [    9.921403] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 159, nr_irqs: 1096
> [    9.926373] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 156, idx: -1, ioapic: 1, pin: 15
> [    9.926747] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 160, nr_irqs: 1096
> [    9.928201] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
> [    9.928488] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 161, nr_irqs: 1096
> [   10.653915] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 162, nr_irqs: 1096
> [   10.656257] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 163, nr_irqs: 1096
> 
> You can find that the allocation of irq is not always based on the value of gsi. It follows the principle of requesting first, distributing first, like gsi 32 get 106 but gsi 28 get 112. And not only acpi_register_gsi_ioapic() will call into __irq_alloc_descs, but other functions will call, even earlier.
> Above output is like baremetal. So, we can get conclusion irq != gsi. See below output on linux:

It does seem weird to me that it does identity map legacy IRQs (<16),
but then for GSI >= 16 it starts assigning IRQs in the 100 range.

What uses the IRQ range [24, 105]?

Also IIRC on a PV dom0 GSIs are identity mapped to IRQs on Linux?  Or
maybe that's just a side effect of GSIs being identity mapped into
PIRQs by Xen?

> [    0.105053] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
> [    0.105061] cjq_debug mp_map_pin_to_irq gsi: 2, irq: 0, idx: 0, ioapic: 0, pin: 2
> [    0.105069] cjq_debug mp_map_pin_to_irq gsi: 3, irq: 3, idx: 3, ioapic: 0, pin: 3
> [    0.105078] cjq_debug mp_map_pin_to_irq gsi: 4, irq: 4, idx: 4, ioapic: 0, pin: 4
> [    0.105086] cjq_debug mp_map_pin_to_irq gsi: 5, irq: 5, idx: 5, ioapic: 0, pin: 5
> [    0.105094] cjq_debug mp_map_pin_to_irq gsi: 6, irq: 6, idx: 6, ioapic: 0, pin: 6
> [    0.105103] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
> [    0.105111] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
> [    0.105119] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
> [    0.105127] cjq_debug mp_map_pin_to_irq gsi: 10, irq: 10, idx: 9, ioapic: 0, pin: 10
> [    0.105136] cjq_debug mp_map_pin_to_irq gsi: 11, irq: 11, idx: 10, ioapic: 0, pin: 11
> [    0.105144] cjq_debug mp_map_pin_to_irq gsi: 12, irq: 12, idx: 11, ioapic: 0, pin: 12
> [    0.105152] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
> [    0.105160] cjq_debug mp_map_pin_to_irq gsi: 14, irq: 14, idx: 13, ioapic: 0, pin: 14
> [    0.105169] cjq_debug mp_map_pin_to_irq gsi: 15, irq: 15, idx: 14, ioapic: 0, pin: 15
> [    0.398134] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
> [    1.169293] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
> [    1.169394] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
> [    1.323132] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
> [    1.345425] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
> [    1.375502] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 24, nr_irqs: 1096
> [    1.375575] cjq_debug mp_map_pin_to_irq gsi: 32, irq: 24, idx: -1, ioapic: 1, pin: 8
> [    1.375661] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
> [    1.375705] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 25, idx: -1, ioapic: 1, pin: 13
> [    1.442277] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 25, idx: -1, ioapic: 1, pin: 13
> [    1.442393] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
> [    1.442450] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 25, idx: -1, ioapic: 1, pin: 14
> [    1.453893] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 25, idx: -1, ioapic: 1, pin: 14
> [    1.456127] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
> [    1.734065] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 26, nr_irqs: 1096
> [    1.734165] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 27, nr_irqs: 1096
> [    1.734253] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 28, nr_irqs: 1096
> [    1.734344] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 29, nr_irqs: 1096
> [    1.734426] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 30, nr_irqs: 1096
> [    1.734512] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 31, nr_irqs: 1096
> [    1.734597] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 32, nr_irqs: 1096
> [    1.734643] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 33, nr_irqs: 1096
> [    1.734687] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 34, nr_irqs: 1096
> [    1.734728] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 35, nr_irqs: 1096
> [    1.735017] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 36, nr_irqs: 1096
> [    1.735252] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 37, nr_irqs: 1096
> [    1.735467] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 38, nr_irqs: 1096
> [    1.735799] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 39, nr_irqs: 1096
> [    1.736024] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 40, nr_irqs: 1096
> [    1.736364] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 41, nr_irqs: 1096
> [    1.736406] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 41, idx: -1, ioapic: 1, pin: 4
> [    1.736434] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 42, nr_irqs: 1096
> [    1.736701] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 41, idx: -1, ioapic: 1, pin: 4
> [    1.736724] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 43, nr_irqs: 1096
> [    3.037123] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 44, nr_irqs: 1096
> [    3.037313] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
> [    3.037515] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
> [    3.037738] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 45, nr_irqs: 1096
> [    3.037959] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 46, nr_irqs: 1096
> [    3.038073] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 47, nr_irqs: 1096
> [    3.038154] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 48, nr_irqs: 1096
> [    3.038179] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
> [    3.038277] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 49, nr_irqs: 1096
> [    3.038399] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 50, nr_irqs: 1096
> [    3.038525] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 51, nr_irqs: 1096
> [    3.038657] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 52, nr_irqs: 1096
> [    3.038852] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 53, nr_irqs: 1096
> [    3.052377] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 54, nr_irqs: 1096
> [    3.052479] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 54, idx: -1, ioapic: 1, pin: 14
> [    3.052730] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 55, nr_irqs: 1096
> [    3.052840] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 56, nr_irqs: 1096
> [    3.052918] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 57, nr_irqs: 1096
> [    3.052987] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 58, nr_irqs: 1096
> [    3.053069] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 59, nr_irqs: 1096
> [    3.053139] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 60, nr_irqs: 1096
> [    3.053201] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 61, nr_irqs: 1096
> [    3.053260] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 62, nr_irqs: 1096
> [    3.089128] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 63, nr_irqs: 1096
> [    3.089310] cjq_debug mp_map_pin_to_irq gsi: 48, irq: 63, idx: -1, ioapic: 1, pin: 24
> [    3.089376] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 64, nr_irqs: 1096
> [    3.103435] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 65, nr_irqs: 1096
> [    3.114190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 64, nr_irqs: 1096
> [    3.114346] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 66, nr_irqs: 1096
> [    3.121215] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 67, nr_irqs: 1096
> [    3.121350] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 68, nr_irqs: 1096
> [    3.121479] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 69, nr_irqs: 1096
> [    3.121612] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 70, nr_irqs: 1096
> [    3.121726] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 71, nr_irqs: 1096
> [    3.121841] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 72, nr_irqs: 1096
> [    3.121955] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 73, nr_irqs: 1096
> [    3.122025] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 74, nr_irqs: 1096
> [    3.122093] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 75, nr_irqs: 1096
> [    3.122148] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 76, nr_irqs: 1096
> [    3.122203] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 77, nr_irqs: 1096
> [    3.122265] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 78, nr_irqs: 1096
> [    3.122322] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 79, nr_irqs: 1096
> [    3.122378] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 80, nr_irqs: 1096
> [    3.122433] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 81, nr_irqs: 1096
> [    7.838753] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
> [    9.619174] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
> [    9.619556] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 82, nr_irqs: 1096
> [    9.622038] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 83, nr_irqs: 1096
> [    9.634900] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 84, nr_irqs: 1096
> [    9.635316] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 84, idx: -1, ioapic: 1, pin: 15
> [    9.635405] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 85, nr_irqs: 1096
> [   10.006686] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 86, nr_irqs: 1096
> [   10.006823] cjq_debug mp_map_pin_to_irq gsi: 29, irq: 86, idx: -1, ioapic: 1, pin: 5
> [   10.007009] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 87, nr_irqs: 1096
> [   10.008723] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 84, idx: -1, ioapic: 1, pin: 15
> [   10.009853] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 88, nr_irqs: 1096
> [   10.010786] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
> [   10.010858] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 89, nr_irqs: 1096
> 
> 2. Why I do the translations between irq and gsi?
> 
> After answering question 1, we get irq != gsi. And I found, in QEMU, (pci_qdev_realize->xen_pt_realize->xen_host_pci_device_get->xen_host_pci_get_hex_value) will get the irq number, but later, pci_qdev_realize->xen_pt_realize->xc_physdev_map_pirq requires us to pass into gsi,

So that's quite a difference.  For some reason on a PV dom0
xen_host_pci_get_hex_value will return the IRQ that's identity mapped
to the GSI.

Is that because a PV dom0 will use acpi_register_gsi_xen() instead of
acpi_register_gsi_ioapic()?

> it will call into Xen physdev_map_pirq-> allocate_and_map_gsi_pirq to allocate pirq for gsi. And then the error occurred.
> Not only that, the callback function pci_add_dm_done-> xc_physdev_map_pirq also need gsi.
> 
> So, I added the function xc_physdev_map_pirq() to translate irq to gsi, for QEMU.
> 
> And I didn't find similar functions in existing linux codes, and I think only "QEMU passthrough for Xen" need this translation, so I added it into privcmd. If you guys know any other similar functions or other more suitable places, please feel free to tell me.
> 
> 3. Why I call PHYSDEVOP_map_pirq in acpi_register_gsi_xen_pvh()?
> 
> Because if you want to map a gsi for domU, it must have a mapping in dom0. See QEMU code:
> pci_add_dm_done
> 	xc_physdev_map_pirq
> 	xc_domain_irq_permission
> 		XEN_DOMCTL_irq_permission
> 			pirq_access_permitted
> xc_physdev_map_pirq will get the pirq which mapped from gsi, and xc_domain_irq_permission will use pirq and call into Xen. If we don't do PHYSDEVOP_map_pirq for passthrough devices on PVH dom0, then pirq_access_permitted will get a NULL irq from dom0 and get failed.

I'm not sure of this specific case, but we shouldn't attempt to fit
the same exact PCI pass through workflow that a PV dom0 uses into a
PVH dom0.  IOW: it might make sense to diverge some paths in order to
avoid importing PV specific concepts into PVH without a reason.

> So, I added PHYSDEVOP_map_pirq for PVH dom0. But I think it is only necessary for passthrough devices to do that, instead of all devices which call __acpi_register_gsi. In next version patch, I will restrain that only passthrough devices can do PHYSDEVOP_map_pirq.
> 
> 4. Why I call PHYSDEVOP_setup_gsi in acpi_register_gsi_xen_pvh()?
> 
> Like Roger's comments, the gsi of passthrough device doesn't be unmasked and registered(I added printings in vioapic_hwdom_map_gsi(), and I found that it never be called for dGPU with gsi 28 in my environment).
> So, I called PHYSDEVOP_setup_gsi to register gsi.
> But I agree with Roger and Jan's opinion, it is wrong to do PHYSDEVOP_setup_gsi for all devices.
> So, in next version patch, I will also restrain that only passthrough devices can do PHYSDEVOP_setup_gsi.

Right, given how long it's been since the last series, I think we need
a new series posted in order to see how this looks now.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi
  2023-08-23  8:57                           ` Roger Pau Monné
@ 2023-08-31  8:56                             ` Chen, Jiqian
  0 siblings, 0 replies; 75+ messages in thread
From: Chen, Jiqian @ 2023-08-31  8:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Jan Beulich, Huang, Ray, Anthony PERARD,
	xen-devel, Deucher, Alexander, Koenig, Christian, Hildebrand,
	Stewart, Xenia Ragiadakou, Huang, Honglei1, Zhang, Julia, Chen,
	Jiqian

Thanks Roger, we will send a new series after the freezing time of Xen release 4.18.

On 2023/8/23 16:57, Roger Pau Monné wrote:
> On Mon, Jul 31, 2023 at 04:40:35PM +0000, Chen, Jiqian wrote:
>> Hi,
>>
>> On 2023/3/18 04:55, Stefano Stabellini wrote:
>>> On Fri, 17 Mar 2023, Roger Pau Monné wrote:
>>>> On Fri, Mar 17, 2023 at 11:15:37AM -0700, Stefano Stabellini wrote:
>>>>> On Fri, 17 Mar 2023, Roger Pau Monné wrote:
>>>>>> On Fri, Mar 17, 2023 at 09:39:52AM +0100, Jan Beulich wrote:
>>>>>>> On 17.03.2023 00:19, Stefano Stabellini wrote:
>>>>>>>> On Thu, 16 Mar 2023, Jan Beulich wrote:
>>>>>>>>> So yes, it then all boils down to that Linux-
>>>>>>>>> internal question.
>>>>>>>>
>>>>>>>> Excellent question but we'll have to wait for Ray as he is the one with
>>>>>>>> access to the hardware. But I have this data I can share in the
>>>>>>>> meantime:
>>>>>>>>
>>>>>>>> [    1.260378] IRQ to pin mappings:
>>>>>>>> [    1.260387] IRQ1 -> 0:1
>>>>>>>> [    1.260395] IRQ2 -> 0:2
>>>>>>>> [    1.260403] IRQ3 -> 0:3
>>>>>>>> [    1.260410] IRQ4 -> 0:4
>>>>>>>> [    1.260418] IRQ5 -> 0:5
>>>>>>>> [    1.260425] IRQ6 -> 0:6
>>>>>>>> [    1.260432] IRQ7 -> 0:7
>>>>>>>> [    1.260440] IRQ8 -> 0:8
>>>>>>>> [    1.260447] IRQ9 -> 0:9
>>>>>>>> [    1.260455] IRQ10 -> 0:10
>>>>>>>> [    1.260462] IRQ11 -> 0:11
>>>>>>>> [    1.260470] IRQ12 -> 0:12
>>>>>>>> [    1.260478] IRQ13 -> 0:13
>>>>>>>> [    1.260485] IRQ14 -> 0:14
>>>>>>>> [    1.260493] IRQ15 -> 0:15
>>>>>>>> [    1.260505] IRQ106 -> 1:8
>>>>>>>> [    1.260513] IRQ112 -> 1:4
>>>>>>>> [    1.260521] IRQ116 -> 1:13
>>>>>>>> [    1.260529] IRQ117 -> 1:14
>>>>>>>> [    1.260537] IRQ118 -> 1:15
>>>>>>>> [    1.260544] .................................... done.
>>>>>>>
>>>>>>> And what does Linux think are IRQs 16 ... 105? Have you compared with
>>>>>>> Linux running baremetal on the same hardware?
>>>>>>
>>>>>> So I have some emails from Ray from he time he was looking into this,
>>>>>> and on Linux dom0 PVH dmesg there is:
>>>>>>
>>>>>> [    0.065063] IOAPIC[0]: apic_id 33, version 17, address 0xfec00000, GSI 0-23
>>>>>> [    0.065096] IOAPIC[1]: apic_id 34, version 17, address 0xfec01000, GSI 24-55
>>>>>>
>>>>>> So it seems the vIO-APIC data provided by Xen to dom0 is at least
>>>>>> consistent.
>>>>>>  
>>>>>>>> And I think Ray traced the point in Linux where Linux gives us an IRQ ==
>>>>>>>> 112 (which is the one causing issues):
>>>>>>>>
>>>>>>>> __acpi_register_gsi->
>>>>>>>>         acpi_register_gsi_ioapic->
>>>>>>>>                 mp_map_gsi_to_irq->
>>>>>>>>                         mp_map_pin_to_irq->
>>>>>>>>                                 __irq_resolve_mapping()
>>>>>>>>
>>>>>>>>         if (likely(data)) {
>>>>>>>>                 desc = irq_data_to_desc(data);
>>>>>>>>                 if (irq)
>>>>>>>>                         *irq = data->irq;
>>>>>>>>                 /* this IRQ is 112, IO-APIC-34 domain */
>>>>>>>>         }
>>>>>>
>>>>>>
>>>>>> Could this all be a result of patch 4/5 in the Linux series ("[RFC
>>>>>> PATCH 4/5] x86/xen: acpi registers gsi for xen pvh"), where a different
>>>>>> __acpi_register_gsi hook is installed for PVH in order to setup GSIs
>>>>>> using PHYSDEV ops instead of doing it natively from the IO-APIC?
>>>>>>
>>>>>> FWIW, the introduced function in that patch
>>>>>> (acpi_register_gsi_xen_pvh()) seems to unconditionally call
>>>>>> acpi_register_gsi_ioapic() without checking if the GSI is already
>>>>>> registered, which might lead to multiple IRQs being allocated for the
>>>>>> same underlying GSI?
>>>>>
>>>>> I understand this point and I think it needs investigating.
>>>>>
>>>>>
>>>>>> As I commented there, I think that approach is wrong.  If the GSI has
>>>>>> not been mapped in Xen (because dom0 hasn't unmasked the respective
>>>>>> IO-APIC pin) we should add some logic in the toolstack to map it
>>>>>> before attempting to bind.
>>>>>
>>>>> But this statement confuses me. The toolstack doesn't get involved in
>>>>> IRQ setup for PCI devices for HVM guests?
>>>>
>>>> It does for GSI interrupts AFAICT, see pci_add_dm_done() and the call
>>>> to xc_physdev_map_pirq().  I'm not sure whether that's a remnant that
>>>> cold be removed (maybe for qemu-trad only?) or it's also required by
>>>> QEMU upstream, I would have to investigate more.
>>>
>>> You are right. I am not certain, but it seems like a mistake in the
>>> toolstack to me. In theory, pci_add_dm_done should only be needed for PV
>>> guests, not for HVM guests. I am not sure. But I can see the call to
>>> xc_physdev_map_pirq you were referring to now.
>>>
>>>
>>>> It's my understanding it's in pci_add_dm_done() where Ray was getting
>>>> the mismatched IRQ vs GSI number.
>>>
>>> I think the mismatch was actually caused by the xc_physdev_map_pirq call
>>> from QEMU, which makes sense because in any case it should happen before
>>> the same call done by pci_add_dm_done (pci_add_dm_done is called after
>>> sending the pci passthrough QMP command to QEMU). So the first to hit
>>> the IRQ!=GSI problem would be QEMU.
>>
>>
>> Sorry for replying to you so late. And thank you all for review. I realized that your questions mainly focus on the following points: 1. Why irq is not equal with gsi? 2. Why I do the translations between irq and gsi? 3. Why I call PHYSDEVOP_map_pirq in acpi_register_gsi_xen_pvh()? 4. Why I call PHYSDEVOP_setup_gsi in acpi_register_gsi_xen_pvh()? 
>> Please forgive me for making a summary response first. And I am looking forward to your comments.
> 
> Sorry, it's been a bit since that conversation, so my recollection is
> vague.
> 
> One of the questions was why acpi_register_gsi_xen_pvh() is needed.  I
> think the patch that introduced it on Linux didn't have much of a
> commit description.
PVH and baremetal both use acpi_register_gsi_ioapic  to alloc irq for gsi. And I add function acpi_register_gsi_xen_pvh to replace acpi_register_gsi_ioapic for PVH, and then I can do something special for PVH, like map_pirq, setup_gsi, etc.

> 
>> 1. Why irq is not equal with gsi?
>> As far as I know, irq is dynamically allocated according to gsi, they are not necessarily equal.
>> When I run "sudo xl pci-assignable-add 03:00.0" to assign passthrough device(Taking dGPU on my environment as an example, which gsi is 28). It will call into acpi_register_gsi_ioapic to get irq, the callstack is: 
>> acpi_register_gsi_ioapic
>> 	mp_map_gsi_to_irq
>> 		mp_map_pin_to_irq
>> 			irq_find_mapping(if gsi has been mapped to an irq before, it will return corresponding irq here)
>> 			alloc_irq_from_domain
>> 				__irq_domain_alloc_irqs
>> 					irq_domain_alloc_descs
>> 						__irq_alloc_descs
> 
> Won't you perform double GSI registrations with Xen if both
> acpi_register_gsi_ioapic() and acpi_register_gsi_xen_pvh() are used?
In the original PVH code, __acpi_register_gsi is set acpi_register_gsi_ioapic in callstack start_kernel->setup_arch->acpi_boot_init->acpi_process_madt->acpi_set_irq_model_ioapic.
In my code, I use acpi_register_gsi_xen_pvh to replace acpi_register_gsi_ioapic in call stack start_kernel-> init_IRQ-> xen_init_IRQ-> pci_xen_pvh_init.
So acpi_register_gsi_ioapic will be called only once.

> 
>>
>> If you add some printings like below:
>> ---------------------------------------------------------------------------------------------------------------------------------------------
>> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
>> index a868b76cd3d4..970fd461be7a 100644
>> --- a/arch/x86/kernel/apic/io_apic.c
>> +++ b/arch/x86/kernel/apic/io_apic.c
>> @@ -1067,6 +1067,8 @@ static int mp_map_pin_to_irq(u32 gsi, int idx, int ioapic, int pin,
>>                 }
>>         }
>>         mutex_unlock(&ioapic_mutex);
>> +       printk("cjq_debug mp_map_pin_to_irq gsi: %u, irq: %d, idx: %d, ioapic: %d, pin: %d\n",
>> +                       gsi, irq, idx, ioapic, pin);
>>
>>         return irq;
>>  }
>> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
>> index 5db0230aa6b5..4e9613abbe96 100644
>> --- a/kernel/irq/irqdesc.c
>> +++ b/kernel/irq/irqdesc.c
>> @@ -786,6 +786,8 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
>>         start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
>>                                            from, cnt, 0);
>>         ret = -EEXIST;
>> +       printk("cjq_debug __irq_alloc_descs irq: %d, from: %u, cnt: %u, node: %d, start: %d, nr_irqs: %d\n",
>> +                       irq, from, cnt, node, start, nr_irqs);
>>         if (irq >=0 && start != irq)
>>                 goto unlock;
>> ---------------------------------------------------------------------------------------------------------------------------------------------
>> You will get output on PVH dom0:
>>
>> [    0.181560] cjq_debug __irq_alloc_descs irq: 1, from: 1, cnt: 1, node: -1, start: 1, nr_irqs: 1096
>> [    0.181639] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
>> [    0.181641] cjq_debug __irq_alloc_descs irq: 2, from: 2, cnt: 1, node: -1, start: 2, nr_irqs: 1096
>> [    0.181682] cjq_debug mp_map_pin_to_irq gsi: 2, irq: 2, idx: 0, ioapic: 0, pin: 2
>> [    0.181683] cjq_debug __irq_alloc_descs irq: 3, from: 3, cnt: 1, node: -1, start: 3, nr_irqs: 1096
>> [    0.181715] cjq_debug mp_map_pin_to_irq gsi: 3, irq: 3, idx: 3, ioapic: 0, pin: 3
>> [    0.181716] cjq_debug __irq_alloc_descs irq: 4, from: 4, cnt: 1, node: -1, start: 4, nr_irqs: 1096
>> [    0.181751] cjq_debug mp_map_pin_to_irq gsi: 4, irq: 4, idx: 4, ioapic: 0, pin: 4
>> [    0.181752] cjq_debug __irq_alloc_descs irq: 5, from: 5, cnt: 1, node: -1, start: 5, nr_irqs: 1096
>> [    0.181783] cjq_debug mp_map_pin_to_irq gsi: 5, irq: 5, idx: 5, ioapic: 0, pin: 5
>> [    0.181784] cjq_debug __irq_alloc_descs irq: 6, from: 6, cnt: 1, node: -1, start: 6, nr_irqs: 1096
>> [    0.181813] cjq_debug mp_map_pin_to_irq gsi: 6, irq: 6, idx: 6, ioapic: 0, pin: 6
>> [    0.181814] cjq_debug __irq_alloc_descs irq: 7, from: 7, cnt: 1, node: -1, start: 7, nr_irqs: 1096
>> [    0.181856] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
>> [    0.181857] cjq_debug __irq_alloc_descs irq: 8, from: 8, cnt: 1, node: -1, start: 8, nr_irqs: 1096
>> [    0.181888] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
>> [    0.181889] cjq_debug __irq_alloc_descs irq: 9, from: 9, cnt: 1, node: -1, start: 9, nr_irqs: 1096
>> [    0.181918] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
>> [    0.181919] cjq_debug __irq_alloc_descs irq: 10, from: 10, cnt: 1, node: -1, start: 10, nr_irqs: 1096
>> [    0.181950] cjq_debug mp_map_pin_to_irq gsi: 10, irq: 10, idx: 9, ioapic: 0, pin: 10
>> [    0.181951] cjq_debug __irq_alloc_descs irq: 11, from: 11, cnt: 1, node: -1, start: 11, nr_irqs: 1096
>> [    0.181977] cjq_debug mp_map_pin_to_irq gsi: 11, irq: 11, idx: 10, ioapic: 0, pin: 11
>> [    0.181979] cjq_debug __irq_alloc_descs irq: 12, from: 12, cnt: 1, node: -1, start: 12, nr_irqs: 1096
>> [    0.182006] cjq_debug mp_map_pin_to_irq gsi: 12, irq: 12, idx: 11, ioapic: 0, pin: 12
>> [    0.182007] cjq_debug __irq_alloc_descs irq: 13, from: 13, cnt: 1, node: -1, start: 13, nr_irqs: 1096
>> [    0.182034] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
>> [    0.182035] cjq_debug __irq_alloc_descs irq: 14, from: 14, cnt: 1, node: -1, start: 14, nr_irqs: 1096
>> [    0.182066] cjq_debug mp_map_pin_to_irq gsi: 14, irq: 14, idx: 13, ioapic: 0, pin: 14
>> [    0.182067] cjq_debug __irq_alloc_descs irq: 15, from: 15, cnt: 1, node: -1, start: 15, nr_irqs: 1096
>> [    0.182095] cjq_debug mp_map_pin_to_irq gsi: 15, irq: 15, idx: 14, ioapic: 0, pin: 15
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 24, nr_irqs: 1096
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 26, nr_irqs: 1096
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 27, nr_irqs: 1096
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 28, nr_irqs: 1096
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 29, nr_irqs: 1096
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 30, nr_irqs: 1096
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 31, nr_irqs: 1096
>> [    0.186111] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 32, nr_irqs: 1096
>> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 33, nr_irqs: 1096
>> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 34, nr_irqs: 1096
>> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 35, nr_irqs: 1096
>> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 36, nr_irqs: 1096
>> [    0.188491] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 37, nr_irqs: 1096
>> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 38, nr_irqs: 1096
>> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 39, nr_irqs: 1096
>> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 40, nr_irqs: 1096
>> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 41, nr_irqs: 1096
>> [    0.192282] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 42, nr_irqs: 1096
>> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 43, nr_irqs: 1096
>> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 44, nr_irqs: 1096
>> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 45, nr_irqs: 1096
>> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 46, nr_irqs: 1096
>> [    0.196208] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 47, nr_irqs: 1096
>> [    0.198199] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 48, nr_irqs: 1096
>> [    0.198416] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 49, nr_irqs: 1096
>> [    0.198460] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 50, nr_irqs: 1096
>> [    0.198489] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 51, nr_irqs: 1096
>> [    0.198523] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 52, nr_irqs: 1096
>> [    0.201315] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 53, nr_irqs: 1096
>> [    0.202174] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 54, nr_irqs: 1096
>> [    0.202225] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 55, nr_irqs: 1096
>> [    0.202259] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 56, nr_irqs: 1096
>> [    0.202291] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 57, nr_irqs: 1096
>> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 58, nr_irqs: 1096
>> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 59, nr_irqs: 1096
>> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 60, nr_irqs: 1096
>> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 61, nr_irqs: 1096
>> [    0.205239] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 62, nr_irqs: 1096
>> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 63, nr_irqs: 1096
>> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 64, nr_irqs: 1096
>> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 65, nr_irqs: 1096
>> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 66, nr_irqs: 1096
>> [    0.208653] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 67, nr_irqs: 1096
>> [    0.210169] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 68, nr_irqs: 1096
>> [    0.210322] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 69, nr_irqs: 1096
>> [    0.210370] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 70, nr_irqs: 1096
>> [    0.210403] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 71, nr_irqs: 1096
>> [    0.210436] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 72, nr_irqs: 1096
>> [    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 73, nr_irqs: 1096
>> [    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 74, nr_irqs: 1096
>> [    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 75, nr_irqs: 1096
>> [    0.213190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 76, nr_irqs: 1096
>> [    0.214151] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 77, nr_irqs: 1096
>> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 78, nr_irqs: 1096
>> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 79, nr_irqs: 1096
>> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 80, nr_irqs: 1096
>> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 81, nr_irqs: 1096
>> [    0.217075] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 82, nr_irqs: 1096
>> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 83, nr_irqs: 1096
>> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 84, nr_irqs: 1096
>> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 85, nr_irqs: 1096
>> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 86, nr_irqs: 1096
>> [    0.220389] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 87, nr_irqs: 1096
>> [    0.222215] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 88, nr_irqs: 1096
>> [    0.222366] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 89, nr_irqs: 1096
>> [    0.222410] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 90, nr_irqs: 1096
>> [    0.222447] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 91, nr_irqs: 1096
>> [    0.222478] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 92, nr_irqs: 1096
>> [    0.225490] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 93, nr_irqs: 1096
>> [    0.226225] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 94, nr_irqs: 1096
>> [    0.226268] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 95, nr_irqs: 1096
>> [    0.226300] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 96, nr_irqs: 1096
>> [    0.226329] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 97, nr_irqs: 1096
>> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 98, nr_irqs: 1096
>> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 99, nr_irqs: 1096
>> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 100, nr_irqs: 1096
>> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 101, nr_irqs: 1096
>> [    0.229057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 102, nr_irqs: 1096
>> [    0.232399] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 103, nr_irqs: 1096
>> [    0.248854] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 104, nr_irqs: 1096
>> [    0.250609] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 105, nr_irqs: 1096
>> [    0.372343] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
>> [    0.720950] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
>> [    0.721052] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
>> [    1.254825] cjq_debug mp_map_pin_to_irq gsi: 7, irq: -16, idx: 7, ioapic: 0, pin: 7
>> [    1.333081] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
>> [    1.375882] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 106, nr_irqs: 1096
>> [    1.375951] cjq_debug mp_map_pin_to_irq gsi: 32, irq: 106, idx: -1, ioapic: 1, pin: 8
>> [    1.376072] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
>> [    1.376121] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 107, idx: -1, ioapic: 1, pin: 13
>> [    1.472551] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 107, idx: -1, ioapic: 1, pin: 13
>> [    1.472697] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
>> [    1.472751] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 107, idx: -1, ioapic: 1, pin: 14
>> [    1.484290] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 107, idx: -1, ioapic: 1, pin: 14
>> [    1.768163] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 107, nr_irqs: 1096
>> [    1.768627] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 108, nr_irqs: 1096
>> [    1.769059] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 109, nr_irqs: 1096
>> [    1.769694] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 110, nr_irqs: 1096
>> [    1.770169] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 111, nr_irqs: 1096
>> [    1.770697] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 112, nr_irqs: 1096
>> [    1.770738] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 112, idx: -1, ioapic: 1, pin: 4
>> [    1.770789] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 113, nr_irqs: 1096
>> [    1.771230] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 112, idx: -1, ioapic: 1, pin: 4
>> [    1.771278] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 114, nr_irqs: 1096
>> [    2.127884] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 115, nr_irqs: 1096
>> [    3.207419] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 116, nr_irqs: 1096
>> [    3.207730] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
>> [    3.208120] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 117, nr_irqs: 1096
>> [    3.208475] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
>> [    3.208478] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 118, nr_irqs: 1096
>> [    3.208861] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
>> [    3.208933] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 119, nr_irqs: 1096
>> [    3.209127] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 120, nr_irqs: 1096
>> [    3.209383] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 121, nr_irqs: 1096
>> [    3.209863] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 122, nr_irqs: 1096
>> [    3.211439] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 123, nr_irqs: 1096
>> [    3.211833] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 124, nr_irqs: 1096
>> [    3.212873] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 125, nr_irqs: 1096
>> [    3.243514] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 126, nr_irqs: 1096
>> [    3.243689] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 126, idx: -1, ioapic: 1, pin: 14
>> [    3.244293] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 127, nr_irqs: 1096
>> [    3.244534] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 128, nr_irqs: 1096
>> [    3.244714] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 129, nr_irqs: 1096
>> [    3.244911] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 130, nr_irqs: 1096
>> [    3.245096] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 131, nr_irqs: 1096
>> [    3.245633] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 132, nr_irqs: 1096
>> [    3.247890] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 133, nr_irqs: 1096
>> [    3.248192] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 134, nr_irqs: 1096
>> [    3.271093] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 135, nr_irqs: 1096
>> [    3.307045] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 136, nr_irqs: 1096
>> [    3.307162] cjq_debug mp_map_pin_to_irq gsi: 48, irq: 136, idx: -1, ioapic: 1, pin: 24
>> [    3.307223] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 137, nr_irqs: 1096
>> [    3.331183] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 137, nr_irqs: 1096
>> [    3.331295] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 138, nr_irqs: 1096
>> [    3.331366] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 139, nr_irqs: 1096
>> [    3.331438] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 140, nr_irqs: 1096
>> [    3.331511] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 141, nr_irqs: 1096
>> [    3.331579] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 142, nr_irqs: 1096
>> [    3.331646] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 143, nr_irqs: 1096
>> [    3.331713] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 144, nr_irqs: 1096
>> [    3.331780] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 145, nr_irqs: 1096
>> [    3.331846] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 146, nr_irqs: 1096
>> [    3.331913] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 147, nr_irqs: 1096
>> [    3.331984] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 148, nr_irqs: 1096
>> [    3.332051] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 149, nr_irqs: 1096
>> [    3.332118] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 150, nr_irqs: 1096
>> [    3.332183] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 151, nr_irqs: 1096
>> [    3.332252] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 152, nr_irqs: 1096
>> [    3.332319] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 153, nr_irqs: 1096
>> [    8.010370] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 116, idx: -1, ioapic: 1, pin: 13
>> [    9.545439] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
>> [    9.545713] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 154, nr_irqs: 1096
>> [    9.546034] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 155, nr_irqs: 1096
>> [    9.687796] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 156, nr_irqs: 1096
>> [    9.687979] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 156, idx: -1, ioapic: 1, pin: 15
>> [    9.688057] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 157, nr_irqs: 1096
>> [    9.921038] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 158, nr_irqs: 1096
>> [    9.921210] cjq_debug mp_map_pin_to_irq gsi: 29, irq: 158, idx: -1, ioapic: 1, pin: 5
>> [    9.921403] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 159, nr_irqs: 1096
>> [    9.926373] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 156, idx: -1, ioapic: 1, pin: 15
>> [    9.926747] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 160, nr_irqs: 1096
>> [    9.928201] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 117, idx: -1, ioapic: 1, pin: 12
>> [    9.928488] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 161, nr_irqs: 1096
>> [   10.653915] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 162, nr_irqs: 1096
>> [   10.656257] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 163, nr_irqs: 1096
>>
>> You can find that the allocation of irq is not always based on the value of gsi. It follows the principle of requesting first, distributing first, like gsi 32 get 106 but gsi 28 get 112. And not only acpi_register_gsi_ioapic() will call into __irq_alloc_descs, but other functions will call, even earlier.
>> Above output is like baremetal. So, we can get conclusion irq != gsi. See below output on linux:
> 
> It does seem weird to me that it does identity map legacy IRQs (<16),
> but then for GSI >= 16 it starts assigning IRQs in the 100 range.
> 
> What uses the IRQ range [24, 105]?
They are allocated to the ipi, msi or event channel. They call __irq_alloc_descs before the pci devices. For example, see one ipi's callstack:
kernel_init
	kernel_init_freeable
		smp_prepare_cpus
			smp_ops.smp_prepare_cpus
				xen_hvm_smp_prepare_cpus
					xen_smp_intr_init
						bind_ipi_to_irqhandler
							bind_ipi_to_irq
								xen_allocate_irq_dynamic
									__irq_alloc_descs

> 
> Also IIRC on a PV dom0 GSIs are identity mapped to IRQs on Linux?  Or
> maybe that's just a side effect of GSIs being identity mapped into
> PIRQs by Xen?
PV is different, although, ipi also will come before pci devices, they don't occupy the irq(24~56). Because in PV dom0, it doesn't call setup_IO_APIC when start_kernel, so variable "ioapic_initialized" in function arch_dynirq_lower_bound is not initialized, and then gsi_top whose value is 56 is returned, the irq allocation begins from 56 number(but PVH and baremetal will initialize "ioapic_initialized", and then arch_dynirq_lower_bound will return ioapic_dynirq_base whose value is 24). What's more, when PV allocates irq for a pci device, it call acpi_register_gsi_xen->irq_alloc_desc_at->__irq_alloc_descs, function irq_alloc_desc_at send gsi to __irq_alloc_descs(PVH and baremetal send -1), so in function __irq_alloc_descs, variable "from" is equal gsi, and gsi is between 24~56, and 24~56's irq are not occupied before. Then it returns a irq that equal gsi.

> 
>> [    0.105053] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
>> [    0.105061] cjq_debug mp_map_pin_to_irq gsi: 2, irq: 0, idx: 0, ioapic: 0, pin: 2
>> [    0.105069] cjq_debug mp_map_pin_to_irq gsi: 3, irq: 3, idx: 3, ioapic: 0, pin: 3
>> [    0.105078] cjq_debug mp_map_pin_to_irq gsi: 4, irq: 4, idx: 4, ioapic: 0, pin: 4
>> [    0.105086] cjq_debug mp_map_pin_to_irq gsi: 5, irq: 5, idx: 5, ioapic: 0, pin: 5
>> [    0.105094] cjq_debug mp_map_pin_to_irq gsi: 6, irq: 6, idx: 6, ioapic: 0, pin: 6
>> [    0.105103] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
>> [    0.105111] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
>> [    0.105119] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
>> [    0.105127] cjq_debug mp_map_pin_to_irq gsi: 10, irq: 10, idx: 9, ioapic: 0, pin: 10
>> [    0.105136] cjq_debug mp_map_pin_to_irq gsi: 11, irq: 11, idx: 10, ioapic: 0, pin: 11
>> [    0.105144] cjq_debug mp_map_pin_to_irq gsi: 12, irq: 12, idx: 11, ioapic: 0, pin: 12
>> [    0.105152] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
>> [    0.105160] cjq_debug mp_map_pin_to_irq gsi: 14, irq: 14, idx: 13, ioapic: 0, pin: 14
>> [    0.105169] cjq_debug mp_map_pin_to_irq gsi: 15, irq: 15, idx: 14, ioapic: 0, pin: 15
>> [    0.398134] cjq_debug mp_map_pin_to_irq gsi: 9, irq: 9, idx: 1, ioapic: 0, pin: 9
>> [    1.169293] cjq_debug mp_map_pin_to_irq gsi: 8, irq: 8, idx: 8, ioapic: 0, pin: 8
>> [    1.169394] cjq_debug mp_map_pin_to_irq gsi: 13, irq: 13, idx: 12, ioapic: 0, pin: 13
>> [    1.323132] cjq_debug mp_map_pin_to_irq gsi: 7, irq: 7, idx: 7, ioapic: 0, pin: 7
>> [    1.345425] cjq_debug mp_map_pin_to_irq gsi: 1, irq: 1, idx: 2, ioapic: 0, pin: 1
>> [    1.375502] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 24, nr_irqs: 1096
>> [    1.375575] cjq_debug mp_map_pin_to_irq gsi: 32, irq: 24, idx: -1, ioapic: 1, pin: 8
>> [    1.375661] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
>> [    1.375705] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 25, idx: -1, ioapic: 1, pin: 13
>> [    1.442277] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 25, idx: -1, ioapic: 1, pin: 13
>> [    1.442393] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
>> [    1.442450] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 25, idx: -1, ioapic: 1, pin: 14
>> [    1.453893] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 25, idx: -1, ioapic: 1, pin: 14
>> [    1.456127] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 25, nr_irqs: 1096
>> [    1.734065] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 26, nr_irqs: 1096
>> [    1.734165] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 27, nr_irqs: 1096
>> [    1.734253] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 28, nr_irqs: 1096
>> [    1.734344] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 29, nr_irqs: 1096
>> [    1.734426] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 30, nr_irqs: 1096
>> [    1.734512] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 31, nr_irqs: 1096
>> [    1.734597] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 32, nr_irqs: 1096
>> [    1.734643] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 33, nr_irqs: 1096
>> [    1.734687] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 34, nr_irqs: 1096
>> [    1.734728] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 35, nr_irqs: 1096
>> [    1.735017] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 36, nr_irqs: 1096
>> [    1.735252] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 37, nr_irqs: 1096
>> [    1.735467] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 38, nr_irqs: 1096
>> [    1.735799] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 39, nr_irqs: 1096
>> [    1.736024] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 40, nr_irqs: 1096
>> [    1.736364] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 41, nr_irqs: 1096
>> [    1.736406] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 41, idx: -1, ioapic: 1, pin: 4
>> [    1.736434] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 42, nr_irqs: 1096
>> [    1.736701] cjq_debug mp_map_pin_to_irq gsi: 28, irq: 41, idx: -1, ioapic: 1, pin: 4
>> [    1.736724] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 43, nr_irqs: 1096
>> [    3.037123] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 44, nr_irqs: 1096
>> [    3.037313] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
>> [    3.037515] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
>> [    3.037738] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 45, nr_irqs: 1096
>> [    3.037959] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 46, nr_irqs: 1096
>> [    3.038073] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 47, nr_irqs: 1096
>> [    3.038154] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 48, nr_irqs: 1096
>> [    3.038179] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
>> [    3.038277] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 49, nr_irqs: 1096
>> [    3.038399] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 50, nr_irqs: 1096
>> [    3.038525] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 51, nr_irqs: 1096
>> [    3.038657] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 52, nr_irqs: 1096
>> [    3.038852] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 53, nr_irqs: 1096
>> [    3.052377] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 54, nr_irqs: 1096
>> [    3.052479] cjq_debug mp_map_pin_to_irq gsi: 38, irq: 54, idx: -1, ioapic: 1, pin: 14
>> [    3.052730] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 55, nr_irqs: 1096
>> [    3.052840] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 56, nr_irqs: 1096
>> [    3.052918] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 57, nr_irqs: 1096
>> [    3.052987] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 58, nr_irqs: 1096
>> [    3.053069] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 59, nr_irqs: 1096
>> [    3.053139] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 60, nr_irqs: 1096
>> [    3.053201] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 61, nr_irqs: 1096
>> [    3.053260] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 62, nr_irqs: 1096
>> [    3.089128] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 63, nr_irqs: 1096
>> [    3.089310] cjq_debug mp_map_pin_to_irq gsi: 48, irq: 63, idx: -1, ioapic: 1, pin: 24
>> [    3.089376] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 64, nr_irqs: 1096
>> [    3.103435] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 65, nr_irqs: 1096
>> [    3.114190] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 64, nr_irqs: 1096
>> [    3.114346] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 66, nr_irqs: 1096
>> [    3.121215] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 67, nr_irqs: 1096
>> [    3.121350] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 68, nr_irqs: 1096
>> [    3.121479] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 69, nr_irqs: 1096
>> [    3.121612] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 70, nr_irqs: 1096
>> [    3.121726] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 71, nr_irqs: 1096
>> [    3.121841] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 72, nr_irqs: 1096
>> [    3.121955] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 73, nr_irqs: 1096
>> [    3.122025] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 74, nr_irqs: 1096
>> [    3.122093] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 75, nr_irqs: 1096
>> [    3.122148] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 76, nr_irqs: 1096
>> [    3.122203] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 77, nr_irqs: 1096
>> [    3.122265] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 78, nr_irqs: 1096
>> [    3.122322] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 79, nr_irqs: 1096
>> [    3.122378] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 80, nr_irqs: 1096
>> [    3.122433] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: 0, start: 81, nr_irqs: 1096
>> [    7.838753] cjq_debug mp_map_pin_to_irq gsi: 37, irq: 44, idx: -1, ioapic: 1, pin: 13
>> [    9.619174] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
>> [    9.619556] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 82, nr_irqs: 1096
>> [    9.622038] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 83, nr_irqs: 1096
>> [    9.634900] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 84, nr_irqs: 1096
>> [    9.635316] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 84, idx: -1, ioapic: 1, pin: 15
>> [    9.635405] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 85, nr_irqs: 1096
>> [   10.006686] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 86, nr_irqs: 1096
>> [   10.006823] cjq_debug mp_map_pin_to_irq gsi: 29, irq: 86, idx: -1, ioapic: 1, pin: 5
>> [   10.007009] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 87, nr_irqs: 1096
>> [   10.008723] cjq_debug mp_map_pin_to_irq gsi: 39, irq: 84, idx: -1, ioapic: 1, pin: 15
>> [   10.009853] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 88, nr_irqs: 1096
>> [   10.010786] cjq_debug mp_map_pin_to_irq gsi: 36, irq: 47, idx: -1, ioapic: 1, pin: 12
>> [   10.010858] cjq_debug __irq_alloc_descs irq: -1, from: 24, cnt: 1, node: -1, start: 89, nr_irqs: 1096
>>
>> 2. Why I do the translations between irq and gsi?
>>
>> After answering question 1, we get irq != gsi. And I found, in QEMU, (pci_qdev_realize->xen_pt_realize->xen_host_pci_device_get->xen_host_pci_get_hex_value) will get the irq number, but later, pci_qdev_realize->xen_pt_realize->xc_physdev_map_pirq requires us to pass into gsi,
> 
> So that's quite a difference.  For some reason on a PV dom0
> xen_host_pci_get_hex_value will return the IRQ that's identity mapped
> to the GSI.
> 
> Is that because a PV dom0 will use acpi_register_gsi_xen() instead of
> acpi_register_gsi_ioapic()?
Not right, PV get irq from /sys/bus/pci/devices/xxxx:xx:xx.x/irq, see xen_pt_realize-> xen_host_pci_device_get-> xen_host_pci_get_dec_value-> xen_host_pci_get_value-> open, and it treats irq as gsi.

> 
>> it will call into Xen physdev_map_pirq-> allocate_and_map_gsi_pirq to allocate pirq for gsi. And then the error occurred.
>> Not only that, the callback function pci_add_dm_done-> xc_physdev_map_pirq also need gsi.
>>
>> So, I added the function xc_physdev_map_pirq() to translate irq to gsi, for QEMU.
>>
>> And I didn't find similar functions in existing linux codes, and I think only "QEMU passthrough for Xen" need this translation, so I added it into privcmd. If you guys know any other similar functions or other more suitable places, please feel free to tell me.
>>
>> 3. Why I call PHYSDEVOP_map_pirq in acpi_register_gsi_xen_pvh()?
>>
>> Because if you want to map a gsi for domU, it must have a mapping in dom0. See QEMU code:
>> pci_add_dm_done
>> 	xc_physdev_map_pirq
>> 	xc_domain_irq_permission
>> 		XEN_DOMCTL_irq_permission
>> 			pirq_access_permitted
>> xc_physdev_map_pirq will get the pirq which mapped from gsi, and xc_domain_irq_permission will use pirq and call into Xen. If we don't do PHYSDEVOP_map_pirq for passthrough devices on PVH dom0, then pirq_access_permitted will get a NULL irq from dom0 and get failed.
> 
> I'm not sure of this specific case, but we shouldn't attempt to fit
> the same exact PCI pass through workflow that a PV dom0 uses into a
> PVH dom0.  IOW: it might make sense to diverge some paths in order to
> avoid importing PV specific concepts into PVH without a reason.
Yes, I agree with you. I also try another method to solve this problem. I think we can discuss this in the new series.

> 
>> So, I added PHYSDEVOP_map_pirq for PVH dom0. But I think it is only necessary for passthrough devices to do that, instead of all devices which call __acpi_register_gsi. In next version patch, I will restrain that only passthrough devices can do PHYSDEVOP_map_pirq.
>>
>> 4. Why I call PHYSDEVOP_setup_gsi in acpi_register_gsi_xen_pvh()?
>>
>> Like Roger's comments, the gsi of passthrough device doesn't be unmasked and registered(I added printings in vioapic_hwdom_map_gsi(), and I found that it never be called for dGPU with gsi 28 in my environment).
>> So, I called PHYSDEVOP_setup_gsi to register gsi.
>> But I agree with Roger and Jan's opinion, it is wrong to do PHYSDEVOP_setup_gsi for all devices.
>> So, in next version patch, I will also restrain that only passthrough devices can do PHYSDEVOP_setup_gsi.
> 
> Right, given how long it's been since the last series, I think we need
> a new series posted in order to see how this looks now.
Agree, I am looking forward to getting your comments in the new series.

> 
> Thanks, Roger.

-- 
Best regards,
Jiqian Chen.

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2023-08-31  8:56 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-12  7:54 [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Huang Rui
2023-03-12  7:54 ` [RFC XEN PATCH 1/6] x86/pvh: report ACPI VFCT table to dom0 if present Huang Rui
2023-03-13 11:55   ` Andrew Cooper
2023-03-13 12:21     ` Roger Pau Monné
2023-03-13 12:27       ` Andrew Cooper
2023-03-21  6:26         ` Huang Rui
2023-03-12  7:54 ` [RFC XEN PATCH 2/6] vpci: accept BAR writes if dom0 is PVH Huang Rui
2023-03-13  7:23   ` Christian König
2023-03-13  7:26     ` Christian König
2023-03-13  8:46       ` Jan Beulich
2023-03-13  8:55       ` Huang Rui
2023-03-14 23:42       ` Stefano Stabellini
2023-03-14 16:02   ` Jan Beulich
2023-03-21  9:36     ` Huang Rui
2023-03-21  9:41       ` Jan Beulich
2023-03-21 10:14         ` Huang Rui
2023-03-21 10:20           ` Jan Beulich
2023-03-21 11:49             ` Huang Rui
2023-03-21 12:20               ` Roger Pau Monné
2023-03-21 12:25                 ` Jan Beulich
2023-03-21 12:59                 ` Huang Rui
2023-03-21 12:27               ` Jan Beulich
2023-03-21 13:03                 ` Huang Rui
2023-03-22  7:28                   ` Huang Rui
2023-03-22  7:45                     ` Jan Beulich
2023-03-22  9:34                     ` Roger Pau Monné
2023-03-22 12:33                       ` Huang Rui
2023-03-22 12:48                         ` Jan Beulich
2023-03-23 10:26                           ` Huang Rui
2023-03-23 14:16                             ` Jan Beulich
2023-03-23 10:43                           ` Roger Pau Monné
2023-03-23 13:34                             ` Huang Rui
2023-03-23 16:23                               ` Roger Pau Monné
2023-03-24  4:37                                 ` Huang Rui
2023-03-12  7:54 ` [RFC XEN PATCH 3/6] x86/pvh: shouldn't check pirq flag when map pirq in PVH Huang Rui
2023-03-14 16:27   ` Jan Beulich
2023-03-15 15:57   ` Roger Pau Monné
2023-03-16  0:22     ` Stefano Stabellini
2023-03-21 10:09     ` Huang Rui
2023-03-21 10:17       ` Jan Beulich
2023-03-12  7:54 ` [RFC XEN PATCH 4/6] x86/pvh: PVH dom0 also need PHYSDEVOP_setup_gsi call Huang Rui
2023-03-14 16:30   ` Jan Beulich
2023-03-15 17:01     ` Andrew Cooper
2023-03-16  0:26       ` Stefano Stabellini
2023-03-16  0:39         ` Stefano Stabellini
2023-03-16  8:51         ` Jan Beulich
2023-03-16  9:18           ` Roger Pau Monné
2023-03-16  7:05       ` Jan Beulich
2023-03-21 12:42       ` Huang Rui
2023-03-21 12:22     ` Huang Rui
2023-03-12  7:54 ` [RFC XEN PATCH 5/6] tools/libs/call: add linux os call to get gsi from irq Huang Rui
2023-03-14 16:36   ` Jan Beulich
2023-03-12  7:54 ` [RFC XEN PATCH 6/6] tools/libs/light: pci: translate irq to gsi Huang Rui
2023-03-14 16:39   ` Jan Beulich
2023-03-15 16:35   ` Roger Pau Monné
2023-03-16  0:44     ` Stefano Stabellini
2023-03-16  8:54       ` Roger Pau Monné
2023-03-16  8:55       ` Jan Beulich
2023-03-16  9:27         ` Roger Pau Monné
2023-03-16  9:42           ` Jan Beulich
2023-03-16 23:19             ` Stefano Stabellini
2023-03-17  8:39               ` Jan Beulich
2023-03-17  9:51                 ` Roger Pau Monné
2023-03-17 18:15                   ` Stefano Stabellini
2023-03-17 19:48                     ` Roger Pau Monné
2023-03-17 20:55                       ` Stefano Stabellini
2023-03-20 15:16                         ` Roger Pau Monné
2023-03-20 15:29                           ` Jan Beulich
2023-03-20 16:50                             ` Roger Pau Monné
2023-07-31 16:40                         ` Chen, Jiqian
2023-08-23  8:57                           ` Roger Pau Monné
2023-08-31  8:56                             ` Chen, Jiqian
2023-03-13  7:24 ` [RFC XEN PATCH 0/6] Introduce VirtIO GPU and Passthrough GPU support on Xen PVH dom0 Christian König
2023-03-21 10:26   ` Huang Rui
2023-03-20 16:22 ` Huang Rui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.