[PATCH 0/13] Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping

linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/13] Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping
@ 2020-10-05 15:28 David Woodhouse
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3253 bytes --]

Linux currently refuses to use >255 CPUs on x86 unless it has interrupt
remapping. This is a bit gratuitous because it could use those extra
CPUs just fine; it just can't target external interrupts at them.

The only problem is that our generic IRQ domain code cann't cope with
the concept of domains which can only target a subset of CPUs.

The hyperv-iommu IRQ remapping driver works around this — not by
actually doing any remapping, but just returning -EINVAL if the
affinity is ever set to an unreachable CPU. This almost works, but ends
up being a bit late because irq_set_affinity_locked() doesn't call into
the irqchip driver immediately; the error only happens later.

This patch series implements a per-domain "maximum affinity" set and
uses it for the non-remapped IOAPIC and MSI domains on x86. As well as
allowing more CPUs to be used without interrupt remapping, this also
fixes the case where some IOAPICs or PCI devices aren't actually in
scope of any active IOMMU and are operating without remapping.

While we're at it, recognise that the 8-bit limit is a bit gratuitous
and a hypervisor could offer at least 15 bits of APIC ID in the IOAPIC
RTE and MSI address bits 11-5 without even needing to use remapping.

David Woodhouse (13):
      x86/apic: Use x2apic in guest kernels even with unusable CPUs.
      x86/msi: Only use high bits of MSI address for DMAR unit
      x86/ioapic: Handle Extended Destination ID field in RTE
      x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available
      genirq: Prepare for default affinity to be passed to __irq_alloc_descs()
      genirq: Add default_affinity argument to __irq_alloc_descs()
      irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
      genirq: Add irq_domain_set_affinity()
      x86/irq: Add x86_non_ir_cpumask
      x86/irq: Limit IOAPIC and MSI domains' affinity without IR
      x86/smp: Allow more than 255 CPUs even without interrupt remapping
      iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant
      x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID

 Documentation/virt/kvm/cpuid.rst     |   4 +
 arch/x86/include/asm/apic.h          |   1 +
 arch/x86/include/asm/io_apic.h       |   3 +-
 arch/x86/include/asm/mpspec.h        |   2 +
 arch/x86/include/asm/x86_init.h      |   2 +
 arch/x86/include/uapi/asm/kvm_para.h |   1 +
 arch/x86/kernel/apic/apic.c          |  41 +++++++++-
 arch/x86/kernel/apic/io_apic.c       |  23 ++++--
 arch/x86/kernel/apic/msi.c           |  44 +++++++++--
 arch/x86/kernel/apic/x2apic_phys.c   |   9 +++
 arch/x86/kernel/kvm.c                |   6 ++
 arch/x86/kernel/x86_init.c           |   1 +
 drivers/iommu/hyperv-iommu.c         | 149 +----------------------------------
 include/linux/interrupt.h            |   2 +
 include/linux/irq.h                  |  10 ++-
 include/linux/irqdomain.h            |   7 +-
 kernel/irq/devres.c                  |   8 +-
 kernel/irq/ipi.c                     |   2 +-
 kernel/irq/irqdesc.c                 |  29 ++++---
 kernel/irq/irqdomain.c               |  69 ++++++++++++++--
 kernel/irq/manage.c                  |  19 ++++-
 21 files changed, 240 insertions(+), 192 deletions(-)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs.
  2020-10-05 15:28 [PATCH 0/13] Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping David Woodhouse
@ 2020-10-05 15:28 ` David Woodhouse
  2020-10-05 15:28   ` [PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit David Woodhouse
                     ` (11 more replies)
  0 siblings, 12 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

If the BIOS has enabled x2apic mode, then leave it enabled and don't
fall back to xapic just because some CPUs can't be targeted.

Previously, if there are CPUs with APIC IDs > 255, the kernel will
disable x2apic and fall back to xapic. And then not use the additional
CPUs anyway, since xapic can't target them either.

In fact, xapic mode can target even *fewer* CPUs, since it can't use
the one with APIC ID 255 either. Using x2apic instead gives us at least
one more CPU without needing interrupt remapping in the guest.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/apic.h        |  1 +
 arch/x86/kernel/apic/apic.c        | 18 ++++++++++++++----
 arch/x86/kernel/apic/x2apic_phys.c |  9 +++++++++
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 1c129abb7f09..b0fd204e0023 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -259,6 +259,7 @@ static inline u64 native_x2apic_icr_read(void)
 
 extern int x2apic_mode;
 extern int x2apic_phys;
+extern void __init x2apic_set_max_apicid(u32 apicid);
 extern void __init check_x2apic(void);
 extern void x2apic_setup(void);
 static inline int x2apic_enabled(void)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b3eef1d5c903..a75767052a92 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1841,14 +1841,24 @@ static __init void try_to_enable_x2apic(int remap_mode)
 		return;
 
 	if (remap_mode != IRQ_REMAP_X2APIC_MODE) {
-		/* IR is required if there is APIC ID > 255 even when running
-		 * under KVM
+		/*
+		 * If there are APIC IDs above 255, they cannot be used
+		 * without interrupt remapping. In a virtual machine, but
+		 * not on bare metal, X2APIC can be used anyway. In the
+		 * case where BIOS has enabled X2APIC, don't disable it
+		 * just because there are APIC IDs that can't be used.
+		 * They couldn't be used if the kernel falls back to XAPIC
+		 * anyway.
 		 */
 		if (max_physical_apicid > 255 ||
 		    !x86_init.hyper.x2apic_available()) {
 			pr_info("x2apic: IRQ remapping doesn't support X2APIC mode\n");
-			x2apic_disable();
-			return;
+			if (!x2apic_mode) {
+				x2apic_disable();
+				return;
+			}
+
+			x2apic_set_max_apicid(255);
 		}
 
 		/*
diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c
index bc9693841353..b4b4e89c1118 100644
--- a/arch/x86/kernel/apic/x2apic_phys.c
+++ b/arch/x86/kernel/apic/x2apic_phys.c
@@ -8,6 +8,12 @@
 int x2apic_phys;
 
 static struct apic apic_x2apic_phys;
+static u32 x2apic_max_apicid;
+
+void __init x2apic_set_max_apicid(u32 apicid)
+{
+	x2apic_max_apicid = apicid;
+}
 
 static int __init set_x2apic_phys_mode(char *arg)
 {
@@ -98,6 +104,9 @@ static int x2apic_phys_probe(void)
 /* Common x2apic functions, also used by x2apic_cluster */
 int x2apic_apic_id_valid(u32 apicid)
 {
+	if (x2apic_max_apicid && apicid > x2apic_max_apicid)
+		return 0;
+
 	return 1;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-06 20:45     ` Thomas Gleixner
  2020-10-05 15:28   ` [PATCH 03/13] x86/ioapic: Handle Extended Destination ID field in RTE David Woodhouse
                     ` (10 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

The Intel IOMMU has an MSI-like configuration for its interrupt, but
it isn't really MSI. So it gets to abuse the high 32 bits of the address,
and puts the high 24 bits of the extended APIC ID there.

This isn't something that can be used in the general case for real MSIs,
since external devices using the high bits of the address would be
performing writes to actual memory space above 4GiB, not targeted at the
APIC.

Factor the hack out and allow it only to be used when appropriate,
adding a WARN_ON_ONCE() if other MSIs are targeted at an unreachable APIC
ID. That should never happen since the legacy MSI messages are not supposed
to be used with Interrupt Remapping enabled.

The x2apic_enabled() check isn't needed because we won't bring up CPUs
with higher APIC IDs unless x2apic is enabled anyway.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kernel/apic/msi.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 6313f0a05db7..356f8acf4927 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -23,13 +23,10 @@
 
 struct irq_domain *x86_pci_msi_default_domain __ro_after_init;
 
-static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg)
+static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, int dmar)
 {
 	msg->address_hi = MSI_ADDR_BASE_HI;
 
-	if (x2apic_enabled())
-		msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid);
-
 	msg->address_lo =
 		MSI_ADDR_BASE_LO |
 		((apic->irq_dest_mode == 0) ?
@@ -43,18 +40,42 @@ static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg)
 		MSI_DATA_LEVEL_ASSERT |
 		MSI_DATA_DELIVERY_FIXED |
 		MSI_DATA_VECTOR(cfg->vector);
+
+	/*
+	 * Only the IOMMU itself can use the trick of putting destination
+	 * APIC ID into the high bits of the address. Anything else would
+	 * just be writing to memory if it tried that, and needs IR to
+	 * address APICs above 255.
+	 */
+	if (dmar)
+		msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid);
+	else
+		WARN_ON_ONCE(MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid));
 }
 
 void x86_vector_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
 {
-	__irq_msi_compose_msg(irqd_cfg(data), msg);
+	__irq_msi_compose_msg(irqd_cfg(data), msg, 0);
 }
 
+/*
+ * The Intel IOMMU (ab)uses the high bits of the MSI address to contain the
+ * high bits of the destination APIC ID. This can't be done in the general
+ * case for MSIs as it would be targeting real memory above 4GiB not the
+ * APIC.
+ */
+static void dmar_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
+{
+	__irq_msi_compose_msg(irqd_cfg(data), msg, 1);
+
+
+
+}
 static void irq_msi_update_msg(struct irq_data *irqd, struct irq_cfg *cfg)
 {
 	struct msi_msg msg[2] = { [1] = { }, };
 
-	__irq_msi_compose_msg(cfg, msg);
+	__irq_msi_compose_msg(cfg, msg, 0);
 	irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);
 }
 
@@ -288,6 +309,7 @@ static struct irq_chip dmar_msi_controller = {
 	.irq_ack		= irq_chip_ack_parent,
 	.irq_set_affinity	= msi_domain_set_affinity,
 	.irq_retrigger		= irq_chip_retrigger_hierarchy,
+	.irq_compose_msi_msg	= dmar_msi_compose_msg,
 	.irq_write_msi_msg	= dmar_msi_write_msg,
 	.flags			= IRQCHIP_SKIP_SET_WAKE,
 };
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 03/13] x86/ioapic: Handle Extended Destination ID field in RTE
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
  2020-10-05 15:28   ` [PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-05 15:28   ` [PATCH 04/13] x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available David Woodhouse
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

The IOAPIC Redirection Table Entries contain an 8-bit Extended
Destination ID field which maps to bits 11-4 of the MSI address.

The lowest bit is used to indicate remappable format, when interrupt
remapping is in use. A hypervisor can use the other 7 bits to permit
guests to address up to 15 bits of APIC IDs, thus allowing 32768 vCPUs
before having to expose a vIOMMU and interrupt remapping to the guest.

No behavioural change in this patch, since nothing yet permits APIC IDs
above 255 to be used with the non-IR IOAPIC domain.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/io_apic.h |  3 ++-
 arch/x86/kernel/apic/io_apic.c | 19 +++++++++++++------
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
index a1a26f6d3aa4..e65a0b7379d0 100644
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -78,7 +78,8 @@ struct IO_APIC_route_entry {
 		mask		:  1,	/* 0: enabled, 1: disabled */
 		__reserved_2	: 15;
 
-	__u32	__reserved_3	: 24,
+	__u32	__reserved_3	: 17,
+		ext_dest	:  7,
 		dest		:  8;
 } __attribute__ ((packed));
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index a33380059db6..aa9a3b54a96c 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1239,10 +1239,10 @@ static void io_apic_print_entries(unsigned int apic, unsigned int nr_entries)
 			       buf, (ir_entry->index2 << 15) | ir_entry->index,
 			       ir_entry->zero);
 		else
-			printk(KERN_DEBUG "%s, %s, D(%02X), M(%1d)\n",
+			printk(KERN_DEBUG "%s, %s, D(%02X%02X), M(%1d)\n",
 			       buf,
 			       entry.dest_mode == IOAPIC_DEST_MODE_LOGICAL ?
-			       "logical " : "physical",
+			       "logical " : "physical", entry.ext_dest,
 			       entry.dest, entry.delivery_mode);
 	}
 }
@@ -1410,6 +1410,7 @@ void native_restore_boot_irq_mode(void)
 	 */
 	if (ioapic_i8259.pin != -1) {
 		struct IO_APIC_route_entry entry;
+		u32 apic_id = read_apic_id();
 
 		memset(&entry, 0, sizeof(entry));
 		entry.mask		= IOAPIC_UNMASKED;
@@ -1417,7 +1418,8 @@ void native_restore_boot_irq_mode(void)
 		entry.polarity		= IOAPIC_POL_HIGH;
 		entry.dest_mode		= IOAPIC_DEST_MODE_PHYSICAL;
 		entry.delivery_mode	= dest_ExtINT;
-		entry.dest		= read_apic_id();
+		entry.dest		= apic_id & 0xff;
+		entry.ext_dest		= apic_id >> 8;
 
 		/*
 		 * Add it to the IO-APIC irq-routing table:
@@ -1861,7 +1863,8 @@ static void ioapic_configure_entry(struct irq_data *irqd)
 	 * ioapic chip to verify that.
 	 */
 	if (irqd->chip == &ioapic_chip) {
-		mpd->entry.dest = cfg->dest_apicid;
+		mpd->entry.dest = cfg->dest_apicid & 0xff;
+		mpd->entry.ext_dest = cfg->dest_apicid >> 8;
 		mpd->entry.vector = cfg->vector;
 	}
 	for_each_irq_pin(entry, mpd->irq_2_pin)
@@ -2027,6 +2030,7 @@ static inline void __init unlock_ExtINT_logic(void)
 	int apic, pin, i;
 	struct IO_APIC_route_entry entry0, entry1;
 	unsigned char save_control, save_freq_select;
+	u32 apic_id;
 
 	pin  = find_isa_irq_pin(8, mp_INT);
 	if (pin == -1) {
@@ -2042,11 +2046,13 @@ static inline void __init unlock_ExtINT_logic(void)
 	entry0 = ioapic_read_entry(apic, pin);
 	clear_IO_APIC_pin(apic, pin);
 
+	apic_id = hard_smp_processor_id();
 	memset(&entry1, 0, sizeof(entry1));
 
 	entry1.dest_mode = IOAPIC_DEST_MODE_PHYSICAL;
 	entry1.mask = IOAPIC_UNMASKED;
-	entry1.dest = hard_smp_processor_id();
+	entry1.dest = apic_id & 0xff;
+	entry1.ext_dest = apic_id >> 8;
 	entry1.delivery_mode = dest_ExtINT;
 	entry1.polarity = entry0.polarity;
 	entry1.trigger = IOAPIC_EDGE;
@@ -2949,7 +2955,8 @@ static void mp_setup_entry(struct irq_cfg *cfg, struct mp_chip_data *data,
 	memset(entry, 0, sizeof(*entry));
 	entry->delivery_mode = apic->irq_delivery_mode;
 	entry->dest_mode     = apic->irq_dest_mode;
-	entry->dest	     = cfg->dest_apicid;
+	entry->dest	     = cfg->dest_apicid & 0xff;
+	entry->ext_dest	     = cfg->dest_apicid >> 8;
 	entry->vector	     = cfg->vector;
 	entry->trigger	     = data->trigger;
 	entry->polarity	     = data->polarity;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 04/13] x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
  2020-10-05 15:28   ` [PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit David Woodhouse
  2020-10-05 15:28   ` [PATCH 03/13] x86/ioapic: Handle Extended Destination ID field in RTE David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-05 15:28   ` [PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs() David Woodhouse
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

Some hypervisors can allow the guest to use the Extended Destination ID
field in the IOAPIC RTE and MSI address to address up to 32768 CPUs.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/mpspec.h   |  1 +
 arch/x86/include/asm/x86_init.h |  2 ++
 arch/x86/kernel/apic/apic.c     | 15 ++++++++++++++-
 arch/x86/kernel/apic/msi.c      |  9 ++++++++-
 arch/x86/kernel/x86_init.c      |  1 +
 5 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index e90ac7e9ae2c..25ee8ca0a1f2 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -42,6 +42,7 @@ extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
 extern unsigned int boot_cpu_physical_apicid;
 extern u8 boot_cpu_apic_version;
 extern unsigned long mp_lapic_addr;
+extern int msi_ext_dest_id;
 
 #ifdef CONFIG_X86_LOCAL_APIC
 extern int smp_found_config;
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 397196fae24d..5af3fe9e38f3 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -114,6 +114,7 @@ struct x86_init_pci {
  * @init_platform:		platform setup
  * @guest_late_init:		guest late init
  * @x2apic_available:		X2APIC detection
+ * @msi_ext_dest_id:		MSI and IOAPIC support 15-bit APIC IDs
  * @init_mem_mapping:		setup early mappings during init_mem_mapping()
  * @init_after_bootmem:		guest init after boot allocator is finished
  */
@@ -121,6 +122,7 @@ struct x86_hyper_init {
 	void (*init_platform)(void);
 	void (*guest_late_init)(void);
 	bool (*x2apic_available)(void);
+	bool (*msi_ext_dest_id)(void);
 	void (*init_mem_mapping)(void);
 	void (*init_after_bootmem)(void);
 };
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a75767052a92..459c78558f36 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1837,6 +1837,8 @@ static __init void x2apic_enable(void)
 
 static __init void try_to_enable_x2apic(int remap_mode)
 {
+	u32 apic_limit = 0;
+
 	if (x2apic_state == X2APIC_DISABLED)
 		return;
 
@@ -1858,7 +1860,15 @@ static __init void try_to_enable_x2apic(int remap_mode)
 				return;
 			}
 
-			x2apic_set_max_apicid(255);
+			/*
+			 * If the hypervisor supports extended destination ID
+			 * in IOAPIC and MSI, we can support that many CPUs.
+			 */
+			if (x86_init.hyper.msi_ext_dest_id()) {
+				msi_ext_dest_id = 1;
+				apic_limit = 32767;
+			} else
+				apic_limit = 255;
 		}
 
 		/*
@@ -1867,6 +1877,9 @@ static __init void try_to_enable_x2apic(int remap_mode)
 		 */
 		x2apic_phys = 1;
 	}
+	if (apic_limit)
+		x2apic_set_max_apicid(apic_limit);
+
 	x2apic_enable();
 }
 
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 356f8acf4927..4d891967bea4 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -23,6 +23,8 @@
 
 struct irq_domain *x86_pci_msi_default_domain __ro_after_init;
 
+int msi_ext_dest_id __ro_after_init;
+
 static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, int dmar)
 {
 	msg->address_hi = MSI_ADDR_BASE_HI;
@@ -45,10 +47,15 @@ static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, int
 	 * Only the IOMMU itself can use the trick of putting destination
 	 * APIC ID into the high bits of the address. Anything else would
 	 * just be writing to memory if it tried that, and needs IR to
-	 * address APICs above 255.
+	 * address APICs which can't be addressed in the normal 32-bit
+	 * address range at 0xFFExxxxx. That is typically just 8 bits, but
+	 * some hypervisors allow the extended destination ID field in bits
+	 * 11-5 to be used, giving support for 15 bits of APIC IDs in total.
 	 */
 	if (dmar)
 		msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid);
+	else if (msi_ext_dest_id && cfg->dest_apicid < 0x8000)
+		msg->address_lo |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid) >> 3;
 	else
 		WARN_ON_ONCE(MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid));
 }
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index a3038d8deb6a..8b395821cb8d 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -110,6 +110,7 @@ struct x86_init_ops x86_init __initdata = {
 		.init_platform		= x86_init_noop,
 		.guest_late_init	= x86_init_noop,
 		.x2apic_available	= bool_x86_init_noop,
+		.msi_ext_dest_id	= bool_x86_init_noop,
 		.init_mem_mapping	= x86_init_noop,
 		.init_after_bootmem	= x86_init_noop,
 	},
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs()
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (2 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 04/13] x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-06 21:01     ` Thomas Gleixner
  2020-10-05 15:28   ` [PATCH 06/13] genirq: Add default_affinity argument " David Woodhouse
                     ` (7 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

Instead of plugging in irq_default_affinity at the lowest level in
desc_smp_init() if the caller didn't specify an affinity for this
specific IRQ in its array, allow the default affinity to be passed
down from __irq_alloc_descs() instead.

This is in preparation for irq domains having their own default (and
indeed maximum) affinities.

An alternative possibility would have been to make the caller allocate
an entire array of struct irq_affinity_desc for the allocation, but
that's a lot nastier than simply passing in an additional const cpumask.

This commit has no visible API change outside static functions in
irqdesc.c for now; the actual change will come later.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/interrupt.h |  2 ++
 kernel/irq/irqdesc.c      | 21 +++++++++++++--------
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index f9aee3538461..cd0ff293486a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -364,6 +364,8 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
 
 #else /* CONFIG_SMP */
 
+#define irq_default_affinity (NULL)
+
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
 {
 	return -EINVAL;
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..4ac91b9fc618 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -81,8 +81,6 @@ static int alloc_masks(struct irq_desc *desc, int node)
 static void desc_smp_init(struct irq_desc *desc, int node,
 			  const struct cpumask *affinity)
 {
-	if (!affinity)
-		affinity = irq_default_affinity;
 	cpumask_copy(desc->irq_common_data.affinity, affinity);
 
 #ifdef CONFIG_GENERIC_PENDING_IRQ
@@ -465,12 +463,16 @@ static void free_desc(unsigned int irq)
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 		       const struct irq_affinity_desc *affinity,
+		       const struct cpumask *default_affinity,
 		       struct module *owner)
 {
 	struct irq_desc *desc;
 	int i;
 
 	/* Validate affinity mask(s) */
+	if (!default_affinity || cpumask_empty(default_affinity))
+		return -EINVAL;
+
 	if (affinity) {
 		for (i = 0; i < cnt; i++) {
 			if (cpumask_empty(&affinity[i].mask))
@@ -479,7 +481,7 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 	}
 
 	for (i = 0; i < cnt; i++) {
-		const struct cpumask *mask = NULL;
+		const struct cpumask *mask = default_affinity;
 		unsigned int flags = 0;
 
 		if (affinity) {
@@ -488,10 +490,10 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 					IRQD_MANAGED_SHUTDOWN;
 			}
 			mask = &affinity->mask;
-			node = cpu_to_node(cpumask_first(mask));
 			affinity++;
 		}
 
+		node = cpu_to_node(cpumask_first(mask));
 		desc = alloc_desc(start + i, node, flags, mask, owner);
 		if (!desc)
 			goto err;
@@ -538,7 +540,7 @@ int __init early_irq_init(void)
 		nr_irqs = initcnt;
 
 	for (i = 0; i < initcnt; i++) {
-		desc = alloc_desc(i, node, 0, NULL, NULL);
+		desc = alloc_desc(i, node, 0, irq_default_affinity, NULL);
 		set_bit(i, allocated_irqs);
 		irq_insert_desc(i, desc);
 	}
@@ -573,7 +575,8 @@ int __init early_irq_init(void)
 		raw_spin_lock_init(&desc[i].lock);
 		lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
 		mutex_init(&desc[i].request_mutex);
-		desc_set_defaults(i, &desc[i], node, NULL, NULL);
+		desc_set_defaults(i, &desc[i], node, irq_default_affinity,
+				  NULL);
 	}
 	return arch_early_irq_init();
 }
@@ -590,12 +593,14 @@ static void free_desc(unsigned int irq)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&desc->lock, flags);
-	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
+	desc_set_defaults(irq, desc, irq_desc_get_node(desc),
+			  irq_default_affinity, NULL);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 }
 
 static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
 			      const struct irq_affinity_desc *affinity,
+			      const struct cpumask *default_affinity,
 			      struct module *owner)
 {
 	u32 i;
@@ -803,7 +808,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 		if (ret)
 			goto unlock;
 	}
-	ret = alloc_descs(start, cnt, node, affinity, owner);
+	ret = alloc_descs(start, cnt, node, affinity, irq_default_affinity, owner);
 unlock:
 	mutex_unlock(&sparse_irq_lock);
 	return ret;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 06/13] genirq: Add default_affinity argument to __irq_alloc_descs()
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (3 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs() David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-06 21:06     ` Thomas Gleixner
  2020-10-05 15:28   ` [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs() David Woodhouse
                     ` (6 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

It already takes an array of affinities for each individual irq being
allocated but that's awkward to allocate and populate in the case
where they're all the same and inherited from the irqdomain, so pass
the default in separately as a simple cpumask.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/irq.h    | 10 ++++++----
 kernel/irq/devres.c    |  8 ++++++--
 kernel/irq/irqdesc.c   | 10 ++++++++--
 kernel/irq/irqdomain.c |  6 +++---
 4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 1b7f4dfee35b..6e119594d35d 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -897,15 +897,17 @@ unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 		      struct module *owner,
-		      const struct irq_affinity_desc *affinity);
+		      const struct irq_affinity_desc *affinity,
+		      const struct cpumask *default_affinity);
 
 int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from,
 			   unsigned int cnt, int node, struct module *owner,
-			   const struct irq_affinity_desc *affinity);
+			   const struct irq_affinity_desc *affinity,
+			   const struct cpumask *default_affinity);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)	\
-	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL)
+	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL, NULL)
 
 #define irq_alloc_desc(node)			\
 	irq_alloc_descs(-1, 0, 1, node)
@@ -920,7 +922,7 @@ int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from,
 	irq_alloc_descs(-1, from, cnt, node)
 
 #define devm_irq_alloc_descs(dev, irq, from, cnt, node)		\
-	__devm_irq_alloc_descs(dev, irq, from, cnt, node, THIS_MODULE, NULL)
+	__devm_irq_alloc_descs(dev, irq, from, cnt, node, THIS_MODULE, NULL, NULL)
 
 #define devm_irq_alloc_desc(dev, node)				\
 	devm_irq_alloc_descs(dev, -1, 0, 1, node)
diff --git a/kernel/irq/devres.c b/kernel/irq/devres.c
index f6e5515ee077..079339decc23 100644
--- a/kernel/irq/devres.c
+++ b/kernel/irq/devres.c
@@ -170,6 +170,8 @@ static void devm_irq_desc_release(struct device *dev, void *res)
  * @affinity:	Optional pointer to an irq_affinity_desc array of size @cnt
  *		which hints where the irq descriptors should be allocated
  *		and which default affinities to use
+ * @default_affinity: Optional pointer to a cpumask indicating the default
+ *              affinity to use where not specified by the @affinity array
  *
  * Returns the first irq number or error code.
  *
@@ -177,7 +179,8 @@ static void devm_irq_desc_release(struct device *dev, void *res)
  */
 int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from,
 			   unsigned int cnt, int node, struct module *owner,
-			   const struct irq_affinity_desc *affinity)
+			   const struct irq_affinity_desc *affinity,
+			   const struct cpumask *default_affinity)
 {
 	struct irq_desc_devres *dr;
 	int base;
@@ -186,7 +189,8 @@ int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from,
 	if (!dr)
 		return -ENOMEM;
 
-	base = __irq_alloc_descs(irq, from, cnt, node, owner, affinity);
+	base = __irq_alloc_descs(irq, from, cnt, node, owner, affinity,
+				 default_affinity);
 	if (base < 0) {
 		devres_free(dr);
 		return base;
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 4ac91b9fc618..fcc3b8a1fe01 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -770,15 +770,21 @@ EXPORT_SYMBOL_GPL(irq_free_descs);
  * @affinity:	Optional pointer to an affinity mask array of size @cnt which
  *		hints where the irq descriptors should be allocated and which
  *		default affinities to use
+ * @default_affinity: Optional pointer to a cpumask indicating the default
+ *              affinity where not specified in the @affinity array
  *
  * Returns the first irq number or error code
  */
 int __ref
 __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
-		  struct module *owner, const struct irq_affinity_desc *affinity)
+		  struct module *owner, const struct irq_affinity_desc *affinity,
+		  const struct cpumask *default_affinity)
 {
 	int start, ret;
 
+	if (!default_affinity)
+		default_affinity = irq_default_affinity;
+
 	if (!cnt)
 		return -EINVAL;
 
@@ -808,7 +814,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 		if (ret)
 			goto unlock;
 	}
-	ret = alloc_descs(start, cnt, node, affinity, irq_default_affinity, owner);
+	ret = alloc_descs(start, cnt, node, affinity, default_affinity, owner);
 unlock:
 	mutex_unlock(&sparse_irq_lock);
 	return ret;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 76cd7ebd1178..c93e00ca11d8 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -1017,16 +1017,16 @@ int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
 
 	if (virq >= 0) {
 		virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE,
-					 affinity);
+					 affinity, NULL);
 	} else {
 		hint = hwirq % nr_irqs;
 		if (hint == 0)
 			hint++;
 		virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE,
-					 affinity);
+					 affinity, NULL);
 		if (virq <= 0 && hint > 1) {
 			virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE,
-						 affinity);
+						 affinity, NULL);
 		}
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (4 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 06/13] genirq: Add default_affinity argument " David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-06 21:26     ` Thomas Gleixner
  2020-10-05 15:28   ` [PATCH 08/13] genirq: Add irq_domain_set_affinity() David Woodhouse
                     ` (5 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

This is the maximum possible set of CPUs which can be used. Use it
to calculate the default affinity requested from __irq_alloc_descs()
by first attempting to find the intersection with irq_default_affinity,
or falling back to using just the max_affinity if the intersection
would be empty.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/irqdomain.h |  3 ++-
 kernel/irq/ipi.c          |  2 +-
 kernel/irq/irqdomain.c    | 45 +++++++++++++++++++++++++++++++++------
 3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 44445d9de881..6b5576da77f0 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -278,7 +278,8 @@ extern void irq_set_default_host(struct irq_domain *host);
 extern struct irq_domain *irq_get_default_host(void);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
 				  irq_hw_number_t hwirq, int node,
-				  const struct irq_affinity_desc *affinity);
+				  const struct irq_affinity_desc *affinity,
+				  const struct cpumask *max_affinity);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c
index 43e3d1be622c..13f56210eca9 100644
--- a/kernel/irq/ipi.c
+++ b/kernel/irq/ipi.c
@@ -75,7 +75,7 @@ int irq_reserve_ipi(struct irq_domain *domain,
 		}
 	}
 
-	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL);
+	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL, dest);
 	if (virq <= 0) {
 		pr_warn("Can't reserve IPI, failed to alloc descs\n");
 		return -ENOMEM;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index c93e00ca11d8..ffd41f21afca 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -660,7 +660,8 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 	}
 
 	/* Allocate a virtual interrupt number */
-	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);
+	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
+				      NULL, NULL);
 	if (virq <= 0) {
 		pr_debug("-> virq allocation failed\n");
 		return 0;
@@ -1011,25 +1012,57 @@ int irq_domain_translate_twocell(struct irq_domain *d,
 EXPORT_SYMBOL_GPL(irq_domain_translate_twocell);
 
 int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
-			   int node, const struct irq_affinity_desc *affinity)
+			   int node, const struct irq_affinity_desc *affinity,
+			   const struct cpumask *max_affinity)
 {
+	cpumask_var_t default_affinity;
 	unsigned int hint;
+	int i;
+
+	/* Check requested per-IRQ affinities are in the possible range */
+	if (affinity && max_affinity) {
+		for (i = 0; i < cnt; i++)
+			if (!cpumask_subset(&affinity[i].mask, max_affinity))
+				return -EINVAL;
+	}
+
+	/*
+	 * Generate default affinity. Either the possible subset of
+	 * irq_default_affinity if such a subset is non-empty, or fall
+	 * back to the provided max_affinity if there is no intersection.
+	 * And just a copy of irq_default_affinity in the
+	 * !CONFIG_CPUMASK_OFFSTACK case.
+	 */
+	memset(&default_affinity, 0, sizeof(default_affinity));
+	if ((max_affinity &&
+	     !cpumask_subset(irq_default_affinity, max_affinity))) {
+		if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL))
+			return -ENOMEM;
+		cpumask_and(default_affinity, max_affinity,
+			    irq_default_affinity);
+		if (cpumask_empty(default_affinity))
+			cpumask_copy(default_affinity, max_affinity);
+	} else if (cpumask_available(default_affinity))
+		cpumask_copy(default_affinity, irq_default_affinity);
 
 	if (virq >= 0) {
 		virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE,
-					 affinity, NULL);
+					 affinity, default_affinity);
 	} else {
 		hint = hwirq % nr_irqs;
 		if (hint == 0)
 			hint++;
 		virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE,
-					 affinity, NULL);
+					 affinity, default_affinity);
 		if (virq <= 0 && hint > 1) {
 			virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE,
-						 affinity, NULL);
+						 affinity, default_affinity);
 		}
 	}
 
+	if (cpumask_available(default_affinity))
+		free_cpumask_var(default_affinity);
+
 	return virq;
 }
 
@@ -1342,7 +1375,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 		virq = irq_base;
 	} else {
 		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
-					      affinity);
+					      affinity, NULL);
 		if (virq < 0) {
 			pr_debug("cannot allocate IRQ(base %d, count %d)\n",
 				 irq_base, nr_irqs);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 08/13] genirq: Add irq_domain_set_affinity()
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (5 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs() David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-06 21:32     ` Thomas Gleixner
  2020-10-05 15:28   ` [PATCH 09/13] x86/irq: Add x86_non_ir_cpumask David Woodhouse
                     ` (4 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

This allows a maximal affinity to be set, for IRQ domains which cannot
target all CPUs in the system.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/irqdomain.h |  4 ++++
 kernel/irq/irqdomain.c    | 28 ++++++++++++++++++++++++++--
 kernel/irq/manage.c       | 19 +++++++++++++++----
 3 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 6b5576da77f0..cf686f18c1fa 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -151,6 +151,7 @@ struct irq_domain_chip_generic;
  *      drivers using the generic chip library which uses this pointer.
  * @parent: Pointer to parent irq_domain to support hierarchy irq_domains
  * @debugfs_file: dentry for the domain debugfs file
+ * @max_affinity: mask of CPUs targetable by this IRQ domain
  *
  * Revmap data, used internally by irq_domain
  * @revmap_direct_max_irq: The largest hwirq that can be set for controllers that
@@ -177,6 +178,7 @@ struct irq_domain {
 #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
 	struct dentry		*debugfs_file;
 #endif
+	const struct cpumask *max_affinity;
 
 	/* reverse map data. The linear map gets appended to the irq_domain */
 	irq_hw_number_t hwirq_max;
@@ -453,6 +455,8 @@ extern void irq_domain_set_info(struct irq_domain *domain, unsigned int virq,
 				void *chip_data, irq_flow_handler_t handler,
 				void *handler_data, const char *handler_name);
 extern void irq_domain_reset_irq_data(struct irq_data *irq_data);
+extern int irq_domain_set_affinity(struct irq_domain *domain,
+				   const struct cpumask *affinity);
 #ifdef	CONFIG_IRQ_DOMAIN_HIERARCHY
 extern struct irq_domain *irq_domain_create_hierarchy(struct irq_domain *parent,
 			unsigned int flags, unsigned int size,
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index ffd41f21afca..0b298355be1b 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -661,7 +661,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 
 	/* Allocate a virtual interrupt number */
 	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
-				      NULL, NULL);
+				      NULL, domain->max_affinity);
 	if (virq <= 0) {
 		pr_debug("-> virq allocation failed\n");
 		return 0;
@@ -1078,6 +1078,27 @@ void irq_domain_reset_irq_data(struct irq_data *irq_data)
 }
 EXPORT_SYMBOL_GPL(irq_domain_reset_irq_data);
 
+/**
+ * irq_domain_set_affinity - Set maximum CPU affinity for domain
+ * @parent:	Domain to set affinity for
+ * @affinity:	Pointer to cpumask, consumed by domain
+ *
+ * Sets the maximal set of CPUs to which interrupts in this domain may
+ * be delivered. Must only be called after creation, before any interrupts
+ * have been in the domain.
+ *
+ * This function retains a pointer to the cpumask which is passed in.
+ */
+int irq_domain_set_affinity(struct irq_domain *domain,
+			    const struct cpumask *affinity)
+{
+	if (cpumask_empty(affinity))
+		return -EINVAL;
+	domain->max_affinity = affinity;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(irq_domain_set_affinity);
+
 #ifdef	CONFIG_IRQ_DOMAIN_HIERARCHY
 /**
  * irq_domain_create_hierarchy - Add a irqdomain into the hierarchy
@@ -1110,6 +1131,9 @@ struct irq_domain *irq_domain_create_hierarchy(struct irq_domain *parent,
 	if (domain) {
 		domain->parent = parent;
 		domain->flags |= flags;
+		if (parent && parent->max_affinity)
+			irq_domain_set_affinity(domain,
+						parent->max_affinity);
 	}
 
 	return domain;
@@ -1375,7 +1399,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 		virq = irq_base;
 	} else {
 		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
-					      affinity, NULL);
+					      affinity, domain->max_affinity);
 		if (virq < 0) {
 			pr_debug("cannot allocate IRQ(base %d, count %d)\n",
 				 irq_base, nr_irqs);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5df903fccb60..e8c5f8ecd306 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -345,6 +345,10 @@ int irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask,
 	struct irq_desc *desc = irq_data_to_desc(data);
 	int ret = 0;
 
+	if (data->domain && data->domain->max_affinity &&
+	    !cpumask_subset(mask, data->domain->max_affinity))
+		return -EINVAL;
+
 	if (!chip || !chip->irq_set_affinity)
 		return -EINVAL;
 
@@ -484,13 +488,20 @@ int irq_setup_affinity(struct irq_desc *desc)
 	struct cpumask *set = irq_default_affinity;
 	int ret, node = irq_desc_get_node(desc);
 	static DEFINE_RAW_SPINLOCK(mask_lock);
-	static struct cpumask mask;
+	static struct cpumask mask, max_mask;
 
 	/* Excludes PER_CPU and NO_BALANCE interrupts */
 	if (!__irq_can_set_affinity(desc))
 		return 0;
 
 	raw_spin_lock(&mask_lock);
+
+	if (desc->irq_data.domain && desc->irq_data.domain->max_affinity)
+		cpumask_and(&max_mask, cpu_online_mask,
+			    desc->irq_data.domain->max_affinity);
+	else
+		cpumask_copy(&max_mask, cpu_online_mask);
+
 	/*
 	 * Preserve the managed affinity setting and a userspace affinity
 	 * setup, but make sure that one of the targets is online.
@@ -498,15 +509,15 @@ int irq_setup_affinity(struct irq_desc *desc)
 	if (irqd_affinity_is_managed(&desc->irq_data) ||
 	    irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
 		if (cpumask_intersects(desc->irq_common_data.affinity,
-				       cpu_online_mask))
+				       &max_mask))
 			set = desc->irq_common_data.affinity;
 		else
 			irqd_clear(&desc->irq_data, IRQD_AFFINITY_SET);
 	}
 
-	cpumask_and(&mask, cpu_online_mask, set);
+	cpumask_and(&mask, &max_mask, set);
 	if (cpumask_empty(&mask))
-		cpumask_copy(&mask, cpu_online_mask);
+		cpumask_copy(&mask, &max_mask);
 
 	if (node != NUMA_NO_NODE) {
 		const struct cpumask *nodemask = cpumask_of_node(node);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 09/13] x86/irq: Add x86_non_ir_cpumask
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (6 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 08/13] genirq: Add irq_domain_set_affinity() David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-06 21:42     ` Thomas Gleixner
  2020-10-05 15:28   ` [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR David Woodhouse
                     ` (3 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

This is the mask of CPUs to which IRQs can be delivered without interrupt
remapping.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/mpspec.h  |  1 +
 arch/x86/kernel/apic/apic.c    | 12 ++++++++++++
 arch/x86/kernel/apic/io_apic.c |  2 ++
 3 files changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 25ee8ca0a1f2..b2090be5b444 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -141,5 +141,6 @@ static inline void physid_set_mask_of_physid(int physid, physid_mask_t *map)
 #define PHYSID_MASK_NONE	{ {[0 ... PHYSID_ARRAY_SIZE-1] = 0UL} }
 
 extern physid_mask_t phys_cpu_present_map;
+extern cpumask_t x86_non_ir_cpumask;
 
 #endif /* _ASM_X86_MPSPEC_H */
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 459c78558f36..069f5e9f1d28 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -103,6 +103,9 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_acpiid);
 
+/* Mask of CPUs which can be targeted by non-remapped interrupts. */
+cpumask_t x86_non_ir_cpumask = { CPU_BITS_ALL };
+
 #ifdef CONFIG_X86_32
 
 /*
@@ -1838,6 +1841,7 @@ static __init void x2apic_enable(void)
 static __init void try_to_enable_x2apic(int remap_mode)
 {
 	u32 apic_limit = 0;
+	int i;
 
 	if (x2apic_state == X2APIC_DISABLED)
 		return;
@@ -1880,6 +1884,14 @@ static __init void try_to_enable_x2apic(int remap_mode)
 	if (apic_limit)
 		x2apic_set_max_apicid(apic_limit);
 
+	/* Build the affinity mask for interrupts that can't be remapped. */
+	cpumask_clear(&x86_non_ir_cpumask);
+	i = min_t(unsigned int, num_possible_cpus() - 1, apic_limit);
+	for ( ; i >= 0; i--) {
+		if (cpu_physical_id(i) <= apic_limit)
+			cpumask_set_cpu(i, &x86_non_ir_cpumask);
+	}
+
 	x2apic_enable();
 }
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index aa9a3b54a96c..4d0ef46fedb9 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2098,6 +2098,8 @@ static int mp_alloc_timer_irq(int ioapic, int pin)
 		struct irq_alloc_info info;
 
 		ioapic_set_alloc_attr(&info, NUMA_NO_NODE, 0, 0);
+		if (domain->parent == x86_vector_domain)
+			info.mask = &x86_non_ir_cpumask;
 		info.devid = mpc_ioapic_id(ioapic);
 		info.ioapic.pin = pin;
 		mutex_lock(&ioapic_mutex);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (7 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 09/13] x86/irq: Add x86_non_ir_cpumask David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-06 21:54     ` Thomas Gleixner
  2020-10-05 15:28   ` [PATCH 11/13] x86/smp: Allow more than 255 CPUs even without interrupt remapping David Woodhouse
                     ` (2 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

When interrupt remapping isn't enabled, only the first 255 CPUs can
receive external interrupts. Set the appropriate max affinity for
the IOAPIC and MSI IRQ domains accordingly.

This also fixes the case where interrupt remapping is enabled but some
devices are not within the scope of any active IOMMU.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kernel/apic/io_apic.c | 2 ++
 arch/x86/kernel/apic/msi.c     | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 4d0ef46fedb9..1c8ce7bc098f 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2332,6 +2332,8 @@ static int mp_irqdomain_create(int ioapic)
 	}
 
 	ip->irqdomain->parent = parent;
+	if (parent == x86_vector_domain)
+		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);
 
 	if (cfg->type == IOAPIC_DOMAIN_LEGACY ||
 	    cfg->type == IOAPIC_DOMAIN_STRICT)
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 4d891967bea4..af5ce5c4da02 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -259,6 +259,7 @@ struct irq_domain * __init native_create_pci_msi_domain(void)
 		pr_warn("Failed to initialize PCI-MSI irqdomain.\n");
 	} else {
 		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
+		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
 	}
 	return d;
 }
@@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id)
 		irq_domain_free_fwnode(fn);
 		kfree(domain_info);
 	}
+	if (parent == x86_vector_domain)
+		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
 	return d;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 11/13] x86/smp: Allow more than 255 CPUs even without interrupt remapping
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (8 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-05 15:28   ` [PATCH 12/13] iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant David Woodhouse
  2020-10-05 15:28   ` [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID David Woodhouse
  11 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

Now that external interrupt affinity can be limited to the range of
CPUs that can be reached through legacy IOAPIC RTEs and MSI, it is
possible to use additional CPUs.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kernel/apic/apic.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 069f5e9f1d28..750a92464bec 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1881,8 +1881,6 @@ static __init void try_to_enable_x2apic(int remap_mode)
 		 */
 		x2apic_phys = 1;
 	}
-	if (apic_limit)
-		x2apic_set_max_apicid(apic_limit);
 
 	/* Build the affinity mask for interrupts that can't be remapped. */
 	cpumask_clear(&x86_non_ir_cpumask);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 12/13] iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (9 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 11/13] x86/smp: Allow more than 255 CPUs even without interrupt remapping David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-05 15:28   ` [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID David Woodhouse
  11 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

It took me a while to realise that this "IRQ remapping" driver exists
not to actually remap interrupts, but to return -EINVAL if anyone ever
tries to set the affinity to a set of CPUs which can't be reached
*without* remapping. Having fixed the core IRQ domain code to handle
such limited, all this hackery can now die.

I haven't deleted it entirely because its existence still causes the
kernel to use X2APIC in cluster mode not physical. I'm not sure we
*want* that, but it wants further thought and testing before ripping
it out too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/iommu/hyperv-iommu.c | 149 +----------------------------------
 1 file changed, 1 insertion(+), 148 deletions(-)

diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e09e2d734c57..46a794d34f57 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -24,156 +24,12 @@
 #include "irq_remapping.h"
 
 #ifdef CONFIG_IRQ_REMAP
-
-/*
- * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
- * Redirection Table. Hyper-V exposes one single IO-APIC and so define
- * 24 IO APIC remmapping entries.
- */
-#define IOAPIC_REMAPPING_ENTRY 24
-
-static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
-static struct irq_domain *ioapic_ir_domain;
-
-static int hyperv_ir_set_affinity(struct irq_data *data,
-		const struct cpumask *mask, bool force)
-{
-	struct irq_data *parent = data->parent_data;
-	struct irq_cfg *cfg = irqd_cfg(data);
-	struct IO_APIC_route_entry *entry;
-	int ret;
-
-	/* Return error If new irq affinity is out of ioapic_max_cpumask. */
-	if (!cpumask_subset(mask, &ioapic_max_cpumask))
-		return -EINVAL;
-
-	ret = parent->chip->irq_set_affinity(parent, mask, force);
-	if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
-		return ret;
-
-	entry = data->chip_data;
-	entry->dest = cfg->dest_apicid;
-	entry->vector = cfg->vector;
-	send_cleanup_vector(cfg);
-
-	return 0;
-}
-
-static struct irq_chip hyperv_ir_chip = {
-	.name			= "HYPERV-IR",
-	.irq_ack		= apic_ack_irq,
-	.irq_set_affinity	= hyperv_ir_set_affinity,
-};
-
-static int hyperv_irq_remapping_alloc(struct irq_domain *domain,
-				     unsigned int virq, unsigned int nr_irqs,
-				     void *arg)
-{
-	struct irq_alloc_info *info = arg;
-	struct irq_data *irq_data;
-	struct irq_desc *desc;
-	int ret = 0;
-
-	if (!info || info->type != X86_IRQ_ALLOC_TYPE_IOAPIC || nr_irqs > 1)
-		return -EINVAL;
-
-	ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg);
-	if (ret < 0)
-		return ret;
-
-	irq_data = irq_domain_get_irq_data(domain, virq);
-	if (!irq_data) {
-		irq_domain_free_irqs_common(domain, virq, nr_irqs);
-		return -EINVAL;
-	}
-
-	irq_data->chip = &hyperv_ir_chip;
-
-	/*
-	 * If there is interrupt remapping function of IOMMU, setting irq
-	 * affinity only needs to change IRTE of IOMMU. But Hyper-V doesn't
-	 * support interrupt remapping function, setting irq affinity of IO-APIC
-	 * interrupts still needs to change IO-APIC registers. But ioapic_
-	 * configure_entry() will ignore value of cfg->vector and cfg->
-	 * dest_apicid when IO-APIC's parent irq domain is not the vector
-	 * domain.(See ioapic_configure_entry()) In order to setting vector
-	 * and dest_apicid to IO-APIC register, IO-APIC entry pointer is saved
-	 * in the chip_data and hyperv_irq_remapping_activate()/hyperv_ir_set_
-	 * affinity() set vector and dest_apicid directly into IO-APIC entry.
-	 */
-	irq_data->chip_data = info->ioapic.entry;
-
-	/*
-	 * Hypver-V IO APIC irq affinity should be in the scope of
-	 * ioapic_max_cpumask because no irq remapping support.
-	 */
-	desc = irq_data_to_desc(irq_data);
-	cpumask_copy(desc->irq_common_data.affinity, &ioapic_max_cpumask);
-
-	return 0;
-}
-
-static void hyperv_irq_remapping_free(struct irq_domain *domain,
-				 unsigned int virq, unsigned int nr_irqs)
-{
-	irq_domain_free_irqs_common(domain, virq, nr_irqs);
-}
-
-static int hyperv_irq_remapping_activate(struct irq_domain *domain,
-			  struct irq_data *irq_data, bool reserve)
-{
-	struct irq_cfg *cfg = irqd_cfg(irq_data);
-	struct IO_APIC_route_entry *entry = irq_data->chip_data;
-
-	entry->dest = cfg->dest_apicid;
-	entry->vector = cfg->vector;
-
-	return 0;
-}
-
-static const struct irq_domain_ops hyperv_ir_domain_ops = {
-	.alloc = hyperv_irq_remapping_alloc,
-	.free = hyperv_irq_remapping_free,
-	.activate = hyperv_irq_remapping_activate,
-};
-
 static int __init hyperv_prepare_irq_remapping(void)
 {
-	struct fwnode_handle *fn;
-	int i;
-
 	if (!hypervisor_is_type(X86_HYPER_MS_HYPERV) ||
 	    !x2apic_supported())
 		return -ENODEV;
 
-	fn = irq_domain_alloc_named_id_fwnode("HYPERV-IR", 0);
-	if (!fn)
-		return -ENOMEM;
-
-	ioapic_ir_domain =
-		irq_domain_create_hierarchy(arch_get_ir_parent_domain(),
-				0, IOAPIC_REMAPPING_ENTRY, fn,
-				&hyperv_ir_domain_ops, NULL);
-
-	if (!ioapic_ir_domain) {
-		irq_domain_free_fwnode(fn);
-		return -ENOMEM;
-	}
-
-	/*
-	 * Hyper-V doesn't provide irq remapping function for
-	 * IO-APIC and so IO-APIC only accepts 8-bit APIC ID.
-	 * Cpu's APIC ID is read from ACPI MADT table and APIC IDs
-	 * in the MADT table on Hyper-v are sorted monotonic increasingly.
-	 * APIC ID reflects cpu topology. There maybe some APIC ID
-	 * gaps when cpu number in a socket is not power of two. Prepare
-	 * max cpu affinity for IOAPIC irqs. Scan cpu 0-255 and set cpu
-	 * into ioapic_max_cpumask if its APIC ID is less than 256.
-	 */
-	for (i = min_t(unsigned int, num_possible_cpus() - 1, 255); i >= 0; i--)
-		if (cpu_physical_id(i) < 256)
-			cpumask_set_cpu(i, &ioapic_max_cpumask);
-
 	return 0;
 }
 
@@ -184,10 +40,7 @@ static int __init hyperv_enable_irq_remapping(void)
 
 static struct irq_domain *hyperv_get_irq_domain(struct irq_alloc_info *info)
 {
-	if (info->type == X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT)
-		return ioapic_ir_domain;
-	else
-		return NULL;
+	return NULL;
 }
 
 struct irq_remap_ops hyperv_irq_remap_ops = {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID
  2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
                     ` (10 preceding siblings ...)
  2020-10-05 15:28   ` [PATCH 12/13] iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant David Woodhouse
@ 2020-10-05 15:28   ` David Woodhouse
  2020-10-07  8:14     ` Paolo Bonzini
  11 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-05 15:28 UTC (permalink / raw)
  To: x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

From: David Woodhouse <dwmw@amazon.co.uk>

This allows the host to indicate that IOAPIC and MSI emulation supports
15-bit destination IDs, allowing up to 32Ki CPUs without remapping.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 Documentation/virt/kvm/cpuid.rst     | 4 ++++
 arch/x86/include/uapi/asm/kvm_para.h | 1 +
 arch/x86/kernel/kvm.c                | 6 ++++++
 3 files changed, 11 insertions(+)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index a7dff9186bed..1726b5925d2b 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
                                               async pf acknowledgment msr
                                               0x4b564d07.
 
+KVM_FEATURE_MSI_EXT_DEST_ID       15          guest checks this feature bit
+                                              before using extended destination
+                                              ID bits in MSI address bits 11-5.
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                               per-cpu warps are expeced in
                                               kvmclock
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 812e9b4c1114..950afebfba88 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -32,6 +32,7 @@
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
+#define KVM_FEATURE_MSI_EXT_DEST_ID	15
 
 #define KVM_HINTS_REALTIME      0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1b51b727b140..4986b4399aef 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -743,12 +743,18 @@ static void __init kvm_init_platform(void)
 	x86_platform.apic_post_init = kvm_apic_init;
 }
 
+static bool __init kvm_msi_ext_dest_id(void)
+{
+	return kvm_para_has_feature(KVM_FEATURE_MSI_EXT_DEST_ID);
+}
+
 const __initconst struct hypervisor_x86 x86_hyper_kvm = {
 	.name			= "KVM",
 	.detect			= kvm_detect,
 	.type			= X86_HYPER_KVM,
 	.init.guest_late_init	= kvm_guest_init,
 	.init.x2apic_available	= kvm_para_available,
+	.init.msi_ext_dest_id	= kvm_msi_ext_dest_id,
 	.init.init_platform	= kvm_init_platform,
 };
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit
  2020-10-05 15:28   ` [PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit David Woodhouse
@ 2020-10-06 20:45     ` Thomas Gleixner
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-06 20:45 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> -static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg)
> +static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, int dmar)

bool dmar?

> +/*
> + * The Intel IOMMU (ab)uses the high bits of the MSI address to contain the
> + * high bits of the destination APIC ID. This can't be done in the general
> + * case for MSIs as it would be targeting real memory above 4GiB not the
> + * APIC.
> + */
> +static void dmar_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
> +{
> +	__irq_msi_compose_msg(irqd_cfg(data), msg, 1);
> +
> +
> +

Lots of stray newlines...


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs()
  2020-10-05 15:28   ` [PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs() David Woodhouse
@ 2020-10-06 21:01     ` Thomas Gleixner
  2020-10-06 21:07       ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-06 21:01 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
>  
>  #else /* CONFIG_SMP */
>  
> +#define irq_default_affinity (NULL)

...

>  static int alloc_descs(unsigned int start, unsigned int cnt, int node,
>  		       const struct irq_affinity_desc *affinity,
> +		       const struct cpumask *default_affinity,
>  		       struct module *owner)
>  {
>  	struct irq_desc *desc;
>  	int i;
>  
>  	/* Validate affinity mask(s) */
> +	if (!default_affinity || cpumask_empty(default_affinity))
> +		return -EINVAL;

How is that supposed to work on UP?

Aside of that I really hate to have yet another argument just
because.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/13] genirq: Add default_affinity argument to __irq_alloc_descs()
  2020-10-05 15:28   ` [PATCH 06/13] genirq: Add default_affinity argument " David Woodhouse
@ 2020-10-06 21:06     ` Thomas Gleixner
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-06 21:06 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> It already takes an array of affinities for each individual irq being
> allocated but that's awkward to allocate and populate in the case
> where they're all the same and inherited from the irqdomain, so pass
> the default in separately as a simple cpumask.

So we need another cpumask argument exposed to the world just because
it's so hard to extend struct irq_affinity_desc so it supports that use
case as well. It's not written in stone that this struct can only
support arrays.

> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  include/linux/irq.h    | 10 ++++++----
>  kernel/irq/devres.c    |  8 ++++++--
>  kernel/irq/irqdesc.c   | 10 ++++++++--
>  kernel/irq/irqdomain.c |  6 +++---

git grep __irq_alloc_descs() might help you to find _all_ instances ...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs()
  2020-10-06 21:01     ` Thomas Gleixner
@ 2020-10-06 21:07       ` David Woodhouse
  0 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-06 21:07 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini



On 6 October 2020 22:01:18 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
>>  
>>  #else /* CONFIG_SMP */
>>  
>> +#define irq_default_affinity (NULL)
>
>...
>
>>  static int alloc_descs(unsigned int start, unsigned int cnt, int
>node,
>>  		       const struct irq_affinity_desc *affinity,
>> +		       const struct cpumask *default_affinity,
>>  		       struct module *owner)
>>  {
>>  	struct irq_desc *desc;
>>  	int i;
>>  
>>  	/* Validate affinity mask(s) */
>> +	if (!default_affinity || cpumask_empty(default_affinity))
>> +		return -EINVAL;
>
>How is that supposed to work on UP?

Hm, good point.

>Aside of that I really hate to have yet another argument just
>because.

Yeah, I was trying to avoid having to allocate a whole array of irq_affinity_desc just to fill them all in with the same default.

But perhaps I need to treat the "affinity_max" like we do cpu_online_mask, and allow affinity to be set even to offline/unreachable CPUs at this point. Then we can be more relaxed about the default affinities.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-05 15:28   ` [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs() David Woodhouse
@ 2020-10-06 21:26     ` Thomas Gleixner
  2020-10-07  7:19       ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-06 21:26 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> This is the maximum possible set of CPUs which can be used. Use it
> to calculate the default affinity requested from __irq_alloc_descs()
> by first attempting to find the intersection with irq_default_affinity,
> or falling back to using just the max_affinity if the intersection
> would be empty.

And why do we need that as yet another argument?

This is an optional property of the irq domain, really and no caller has
any business with that. 

>  int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
> -			   int node, const struct irq_affinity_desc *affinity)
> +			   int node, const struct irq_affinity_desc *affinity,
> +			   const struct cpumask *max_affinity)
>  {
> +	cpumask_var_t default_affinity;
>  	unsigned int hint;
> +	int i;
> +
> +	/* Check requested per-IRQ affinities are in the possible range */
> +	if (affinity && max_affinity) {
> +		for (i = 0; i < cnt; i++)
> +			if (!cpumask_subset(&affinity[i].mask, max_affinity))
> +				return -EINVAL;

https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos

What is preventing the affinity spreading code from spreading the masks
out to unusable CPUs? The changelog is silent about that part.

> +	/*
> +	 * Generate default affinity. Either the possible subset of
> +	 * irq_default_affinity if such a subset is non-empty, or fall
> +	 * back to the provided max_affinity if there is no intersection.
..
> +	 * And just a copy of irq_default_affinity in the
> +	 * !CONFIG_CPUMASK_OFFSTACK case.

We know that already...

> +	 */
> +	memset(&default_affinity, 0, sizeof(default_affinity));

Right, memset() before allocating is useful.

> +	if ((max_affinity &&
> +	     !cpumask_subset(irq_default_affinity, max_affinity))) {
> +		if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL))
> +			return -ENOMEM;
> +		cpumask_and(default_affinity, max_affinity,
> +			    irq_default_affinity);
> +		if (cpumask_empty(default_affinity))
> +			cpumask_copy(default_affinity, max_affinity);
> +	} else if (cpumask_available(default_affinity))
> +		cpumask_copy(default_affinity, irq_default_affinity);

That's garbage and unreadable.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/13] genirq: Add irq_domain_set_affinity()
  2020-10-05 15:28   ` [PATCH 08/13] genirq: Add irq_domain_set_affinity() David Woodhouse
@ 2020-10-06 21:32     ` Thomas Gleixner
  2020-10-07  7:22       ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-06 21:32 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> +/**
> + * irq_domain_set_affinity - Set maximum CPU affinity for domain
> + * @parent:	Domain to set affinity for
> + * @affinity:	Pointer to cpumask, consumed by domain
> + *
> + * Sets the maximal set of CPUs to which interrupts in this domain may
> + * be delivered. Must only be called after creation, before any interrupts
> + * have been in the domain.
> + *
> + * This function retains a pointer to the cpumask which is passed in.
> + */
> +int irq_domain_set_affinity(struct irq_domain *domain,
> +			    const struct cpumask *affinity)
> +{
> +	if (cpumask_empty(affinity))
> +		return -EINVAL;
> +	domain->max_affinity = affinity;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(irq_domain_set_affinity);

What the heck? Why does this need a setter function which is exported?
So that random driver writers can fiddle with it?

The affinity mask restriction of an irq domain is already known when the
domain is created.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/13] x86/irq: Add x86_non_ir_cpumask
  2020-10-05 15:28   ` [PATCH 09/13] x86/irq: Add x86_non_ir_cpumask David Woodhouse
@ 2020-10-06 21:42     ` Thomas Gleixner
  2020-10-07  7:25       ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-06 21:42 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> This is the mask of CPUs to which IRQs can be delivered without interrupt
> remapping.
>  
> +/* Mask of CPUs which can be targeted by non-remapped interrupts. */
> +cpumask_t x86_non_ir_cpumask = { CPU_BITS_ALL };

What?

>  #ifdef CONFIG_X86_32
>  
>  /*
> @@ -1838,6 +1841,7 @@ static __init void x2apic_enable(void)
>  static __init void try_to_enable_x2apic(int remap_mode)
>  {
>  	u32 apic_limit = 0;
> +	int i;
>  
>  	if (x2apic_state == X2APIC_DISABLED)
>  		return;
> @@ -1880,6 +1884,14 @@ static __init void try_to_enable_x2apic(int remap_mode)
>  	if (apic_limit)
>  		x2apic_set_max_apicid(apic_limit);
>  
> +	/* Build the affinity mask for interrupts that can't be remapped. */
> +	cpumask_clear(&x86_non_ir_cpumask);
> +	i = min_t(unsigned int, num_possible_cpus() - 1, apic_limit);
> +	for ( ; i >= 0; i--) {
> +		if (cpu_physical_id(i) <= apic_limit)
> +			cpumask_set_cpu(i, &x86_non_ir_cpumask);
> +	}

Blink. If the APIC id is not linear with the cpu numbers then this
results in a reduced addressable set of CPUs. WHY?

> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index aa9a3b54a96c..4d0ef46fedb9 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2098,6 +2098,8 @@ static int mp_alloc_timer_irq(int ioapic, int pin)
>  		struct irq_alloc_info info;
>  
>  		ioapic_set_alloc_attr(&info, NUMA_NO_NODE, 0, 0);
> +		if (domain->parent == x86_vector_domain)
> +			info.mask = &x86_non_ir_cpumask;

We are not going to sprinkle such domain checks all over the
place. Again, the mask is a property of the interrupt domain.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-05 15:28   ` [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR David Woodhouse
@ 2020-10-06 21:54     ` Thomas Gleixner
  2020-10-07  7:48       ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-06 21:54 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:

> From: David Woodhouse <dwmw@amazon.co.uk>
>
> When interrupt remapping isn't enabled, only the first 255 CPUs can

No, only CPUs with an APICid < 255 ....

> receive external interrupts. Set the appropriate max affinity for
> the IOAPIC and MSI IRQ domains accordingly.
>
> This also fixes the case where interrupt remapping is enabled but some
> devices are not within the scope of any active IOMMU.

What? If this fixes an pre-existing problem then

      1) Explain the problem proper
      2) Have a patch at the beginning of the series which fixes it
         independently of this pile

If it's fixing a problem in your pile, then you got the ordering wrong.

You didn't start kernel programming as of yesterday, so you really know
how that works.

>  	ip->irqdomain->parent = parent;
> +	if (parent == x86_vector_domain)
> +		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);

OMG

>  	if (cfg->type == IOAPIC_DOMAIN_LEGACY ||
>  	    cfg->type == IOAPIC_DOMAIN_STRICT)
> diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
> index 4d891967bea4..af5ce5c4da02 100644
> --- a/arch/x86/kernel/apic/msi.c
> +++ b/arch/x86/kernel/apic/msi.c
> @@ -259,6 +259,7 @@ struct irq_domain * __init native_create_pci_msi_domain(void)
>  		pr_warn("Failed to initialize PCI-MSI irqdomain.\n");
>  	} else {
>  		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
> +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);

So here it's unconditional

>  	}
>  	return d;
>  }
> @@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id)
>  		irq_domain_free_fwnode(fn);
>  		kfree(domain_info);
>  	}
> +	if (parent == x86_vector_domain)
> +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);

And here we need a condition again. Completely obvious and reviewable - NOT.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-06 21:26     ` Thomas Gleixner
@ 2020-10-07  7:19       ` David Woodhouse
  2020-10-07 13:37         ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07  7:19 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3512 bytes --]

On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote:
> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This is the maximum possible set of CPUs which can be used. Use it
> > to calculate the default affinity requested from __irq_alloc_descs()
> > by first attempting to find the intersection with irq_default_affinity,
> > or falling back to using just the max_affinity if the intersection
> > would be empty.
> 
> And why do we need that as yet another argument?
> 
> This is an optional property of the irq domain, really and no caller has
> any business with that. 

Because irq_domain_alloc_descs() doesn't actually *take* the domain as
an argument. It's more of an internal function, which is only non-
static because it's used from kernel/irq/ipi.c too for some reason. If
we convert the IPI code to just call __irq_alloc_descs() directly,
perhaps that we can actually make irq_domain_alloc_decs() static.

> >  int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
> > -			   int node, const struct irq_affinity_desc *affinity)
> > +			   int node, const struct irq_affinity_desc *affinity,
> > +			   const struct cpumask *max_affinity)
> >  {
> > +	cpumask_var_t default_affinity;
> >  	unsigned int hint;
> > +	int i;
> > +
> > +	/* Check requested per-IRQ affinities are in the possible range */
> > +	if (affinity && max_affinity) {
> > +		for (i = 0; i < cnt; i++)
> > +			if (!cpumask_subset(&affinity[i].mask, max_affinity))
> > +				return -EINVAL;
> 
> https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos
> 
> What is preventing the affinity spreading code from spreading the masks
> out to unusable CPUs? The changelog is silent about that part.

I'm coming to the conclusion that we should allow unusable CPUs to be
specified at this point, just as we do offline CPUs. That's largely
driven by the realisation that our x86_non_ir_cpumask is only going to
contain online CPUs anyway, and hotplugged CPUs only get added to it as
they are brought online.

> > +	/*
> > +	 * Generate default affinity. Either the possible subset of
> > +	 * irq_default_affinity if such a subset is non-empty, or fall
> > +	 * back to the provided max_affinity if there is no intersection.
> 
> ..
> > +	 * And just a copy of irq_default_affinity in the
> > +	 * !CONFIG_CPUMASK_OFFSTACK case.
> 
> We know that already...
> 
> > +	 */
> > +	memset(&default_affinity, 0, sizeof(default_affinity));
> 
> Right, memset() before allocating is useful.

The memset is because there's no cpumask_var_t initialiser that I can
see. So cpumask_available() on the uninitialised 'default_affinity'
variable might be true even in the OFFSTACK case.

> > +	if ((max_affinity &&
> > +	     !cpumask_subset(irq_default_affinity, max_affinity))) {
> > +		if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL))
> > +			return -ENOMEM;
> > +		cpumask_and(default_affinity, max_affinity,
> > +			    irq_default_affinity);
> > +		if (cpumask_empty(default_affinity))
> > +			cpumask_copy(default_affinity, max_affinity);
> > +	} else if (cpumask_available(default_affinity))
> > +		cpumask_copy(default_affinity, irq_default_affinity);
> 
> That's garbage and unreadable.

That's why there was a comment explaining it... at which point you
claimed to already know :)

Clearly the comment didn't do the job it was supposed to do. I'll
rework.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/13] genirq: Add irq_domain_set_affinity()
  2020-10-06 21:32     ` Thomas Gleixner
@ 2020-10-07  7:22       ` David Woodhouse
  0 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-07  7:22 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 758 bytes --]

On Tue, 2020-10-06 at 23:32 +0200, Thomas Gleixner wrote:
> What the heck? Why does this need a setter function which is exported?
> So that random driver writers can fiddle with it?
> 
> The affinity mask restriction of an irq domain is already known when the
> domain is created.

It's exported because __irq_domain_add is exported. I didn't want to
just add a rarely-used extra argument to __irq_domain_add and the other
public APIs which call into it, and figured a separate setter function
was the simplest option.

I can rework to add the argument to __irq_domain_add if you prefer.

Or we can declare that restricted affinity *isn't* something that IRQ
domains provided by modules can have, and just not export the setter
function?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/13] x86/irq: Add x86_non_ir_cpumask
  2020-10-06 21:42     ` Thomas Gleixner
@ 2020-10-07  7:25       ` David Woodhouse
  0 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-07  7:25 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2494 bytes --]

On Tue, 2020-10-06 at 23:42 +0200, Thomas Gleixner wrote:
> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This is the mask of CPUs to which IRQs can be delivered without
> > interrupt
> > remapping.
> >  
> > +/* Mask of CPUs which can be targeted by non-remapped interrupts.
> > */
> > +cpumask_t x86_non_ir_cpumask = { CPU_BITS_ALL };
> 
> What?

By default, if we didn't hit any limits, all CPUs can be targeted by
external interrupts. It's the default today.

Or at least we pretend it is, modulo the bugs :)

> >  #ifdef CONFIG_X86_32
> >  
> >  /*
> > @@ -1838,6 +1841,7 @@ static __init void x2apic_enable(void)
> >  static __init void try_to_enable_x2apic(int remap_mode)
> >  {
> >  	u32 apic_limit = 0;
> > +	int i;
> >  
> >  	if (x2apic_state == X2APIC_DISABLED)
> >  		return;
> > @@ -1880,6 +1884,14 @@ static __init void try_to_enable_x2apic(int remap_mode)
> >  	if (apic_limit)
> >  		x2apic_set_max_apicid(apic_limit);
> >  
> > +	/* Build the affinity mask for interrupts that can't be remapped. */
> > +	cpumask_clear(&x86_non_ir_cpumask);
> > +	i = min_t(unsigned int, num_possible_cpus() - 1, apic_limit);
> > +	for ( ; i >= 0; i--) {
> > +		if (cpu_physical_id(i) <= apic_limit)
> > +			cpumask_set_cpu(i, &x86_non_ir_cpumask);
> > +	}
> 
> Blink. If the APIC id is not linear with the cpu numbers then this
> results in a reduced addressable set of CPUs. WHY?

Hm, good question. That loop was cargo-culted from hyperv-iommu.c;
perhaps it makes more sense there because Hyper-V really does promise
that linearity, or perhaps it was already buggy. Will fix.

In fact, in apic.c I could probably just use the cpuid_to_apicid array
which is right there in the file.

> > diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> > index aa9a3b54a96c..4d0ef46fedb9 100644
> > --- a/arch/x86/kernel/apic/io_apic.c
> > +++ b/arch/x86/kernel/apic/io_apic.c
> > @@ -2098,6 +2098,8 @@ static int mp_alloc_timer_irq(int ioapic, int pin)
> >  		struct irq_alloc_info info;
> >  
> >  		ioapic_set_alloc_attr(&info, NUMA_NO_NODE, 0, 0);
> > +		if (domain->parent == x86_vector_domain)
> > +			info.mask = &x86_non_ir_cpumask;
> 
> We are not going to sprinkle such domain checks all over the
> place. Again, the mask is a property of the interrupt domain.

Yeah, that's a hangover from the first attempts which I forgot to
delete.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-06 21:54     ` Thomas Gleixner
@ 2020-10-07  7:48       ` David Woodhouse
  2020-10-07 12:59         ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07  7:48 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3362 bytes --]

On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote:
> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > When interrupt remapping isn't enabled, only the first 255 CPUs can
> 
> No, only CPUs with an APICid < 255 ....

Ack.

> > receive external interrupts. Set the appropriate max affinity for
> > the IOAPIC and MSI IRQ domains accordingly.
> > 
> > This also fixes the case where interrupt remapping is enabled but some
> > devices are not within the scope of any active IOMMU.
> 
> What? If this fixes an pre-existing problem then
> 
>       1) Explain the problem proper
>       2) Have a patch at the beginning of the series which fixes it
>          independently of this pile
> 
> If it's fixing a problem in your pile, then you got the ordering wrong.

It's not that simple; there's not a single patch which fixes that and
which can go first. I can, and do, fix the "no IR" case in a simple
patch that goes first, simply by restricting the kernel to the APIC IDs
which can be targeted.

This is the case I called out in the cover letter:

    This patch series implements a per-domain "maximum affinity" set and
    uses it for the non-remapped IOAPIC and MSI domains on x86. As well as
    allowing more CPUs to be used without interrupt remapping, this also
    fixes the case where some IOAPICs or PCI devices aren't actually in
    scope of any active IOMMU and are operating without remapping.

To fix *that* case, we really do need the whole series giving us per-
domain restricted affinity, and to use it for those MSIs/IOAPICs that
the IRQ remapping doesn't cover.

> You didn't start kernel programming as of yesterday, so you really know
> how that works.
> 
> >  	ip->irqdomain->parent = parent;
> > +	if (parent == x86_vector_domain)
> > +		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);
> 
> OMG

This IOAPIC function may or may not be behind remapping.

> 
> >  	if (cfg->type == IOAPIC_DOMAIN_LEGACY ||
> >  	    cfg->type == IOAPIC_DOMAIN_STRICT)
> > diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
> > index 4d891967bea4..af5ce5c4da02 100644
> > --- a/arch/x86/kernel/apic/msi.c
> > +++ b/arch/x86/kernel/apic/msi.c
> > @@ -259,6 +259,7 @@ struct irq_domain * __init native_create_pci_msi_domain(void)
> >  		pr_warn("Failed to initialize PCI-MSI irqdomain.\n");
> >  	} else {
> >  		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
> 
> So here it's unconditional

Yes, this code is only for the non-remapped case and there's a separate
arch_create_remap_msi_irq_domain() function a few lines further down
which creates the IR-PCI-MSI IRQ domain. So no need for a condition
here.

> >  	}
> >  	return d;
> >  }
> > @@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id)
> >  		irq_domain_free_fwnode(fn);
> >  		kfree(domain_info);
> >  	}
> > +	if (parent == x86_vector_domain)
> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
> 
> And here we need a condition again. Completely obvious and reviewable - NOT.

And HPET may or may not be behind remapping so again needs the
condition. I had figured that part was fairly obvious but can note it
in the commit comment.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID
  2020-10-05 15:28   ` [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID David Woodhouse
@ 2020-10-07  8:14     ` Paolo Bonzini
  2020-10-07  8:59       ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Paolo Bonzini @ 2020-10-07  8:14 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv

On 05/10/20 17:28, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> This allows the host to indicate that IOAPIC and MSI emulation supports
> 15-bit destination IDs, allowing up to 32Ki CPUs without remapping.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  Documentation/virt/kvm/cpuid.rst     | 4 ++++
>  arch/x86/include/uapi/asm/kvm_para.h | 1 +
>  arch/x86/kernel/kvm.c                | 6 ++++++
>  3 files changed, 11 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
> index a7dff9186bed..1726b5925d2b 100644
> --- a/Documentation/virt/kvm/cpuid.rst
> +++ b/Documentation/virt/kvm/cpuid.rst
> @@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
>                                                async pf acknowledgment msr
>                                                0x4b564d07.
>  
> +KVM_FEATURE_MSI_EXT_DEST_ID       15          guest checks this feature bit
> +                                              before using extended destination
> +                                              ID bits in MSI address bits 11-5.
> +
>  KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
>                                                per-cpu warps are expeced in
>                                                kvmclock
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 812e9b4c1114..950afebfba88 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -32,6 +32,7 @@
>  #define KVM_FEATURE_POLL_CONTROL	12
>  #define KVM_FEATURE_PV_SCHED_YIELD	13
>  #define KVM_FEATURE_ASYNC_PF_INT	14
> +#define KVM_FEATURE_MSI_EXT_DEST_ID	15
>  
>  #define KVM_HINTS_REALTIME      0
>  
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 1b51b727b140..4986b4399aef 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -743,12 +743,18 @@ static void __init kvm_init_platform(void)
>  	x86_platform.apic_post_init = kvm_apic_init;
>  }
>  
> +static bool __init kvm_msi_ext_dest_id(void)
> +{
> +	return kvm_para_has_feature(KVM_FEATURE_MSI_EXT_DEST_ID);
> +}
> +
>  const __initconst struct hypervisor_x86 x86_hyper_kvm = {
>  	.name			= "KVM",
>  	.detect			= kvm_detect,
>  	.type			= X86_HYPER_KVM,
>  	.init.guest_late_init	= kvm_guest_init,
>  	.init.x2apic_available	= kvm_para_available,
> +	.init.msi_ext_dest_id	= kvm_msi_ext_dest_id,
>  	.init.init_platform	= kvm_init_platform,
>  };
>  
> 

Looks like the rest of the series needs some more work, but anyway:

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID
  2020-10-07  8:14     ` Paolo Bonzini
@ 2020-10-07  8:59       ` David Woodhouse
  2020-10-07 11:15         ` Paolo Bonzini
  0 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07  8:59 UTC (permalink / raw)
  To: Paolo Bonzini, x86; +Cc: iommu, kvm, linux-hyperv

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

On Wed, 2020-10-07 at 10:14 +0200, Paolo Bonzini wrote:
> Looks like the rest of the series needs some more work, but anyway:
> 
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Thanks.

Yeah, I was expecting the per-irqdomain affinity support to take a few
iterations. But this part, still sticking with the current behaviour of
only allowing CPUs to come online at all if they can be reached by all
interrupts, can probably go in first.

It's presumably (hopefully!) a blocker for the qemu patch which exposes
the same feature bit defined in this patch.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID
  2020-10-07  8:59       ` David Woodhouse
@ 2020-10-07 11:15         ` Paolo Bonzini
  2020-10-07 12:04           ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Paolo Bonzini @ 2020-10-07 11:15 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv

On 07/10/20 10:59, David Woodhouse wrote:
> Yeah, I was expecting the per-irqdomain affinity support to take a few
> iterations. But this part, still sticking with the current behaviour of
> only allowing CPUs to come online at all if they can be reached by all
> interrupts, can probably go in first.
> 
> It's presumably (hopefully!) a blocker for the qemu patch which exposes
> the same feature bit defined in this patch.

Yeah, though we could split it further and get the documentation part in
first.  That would let the QEMU part go through.

Paolo


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID
  2020-10-07 11:15         ` Paolo Bonzini
@ 2020-10-07 12:04           ` David Woodhouse
  0 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 12:04 UTC (permalink / raw)
  To: Paolo Bonzini, x86; +Cc: iommu, kvm, linux-hyperv

[-- Attachment #1: Type: text/plain, Size: 1356 bytes --]

On Wed, 2020-10-07 at 13:15 +0200, Paolo Bonzini wrote:
> On 07/10/20 10:59, David Woodhouse wrote:
> > Yeah, I was expecting the per-irqdomain affinity support to take a few
> > iterations. But this part, still sticking with the current behaviour of
> > only allowing CPUs to come online at all if they can be reached by all
> > interrupts, can probably go in first.
> > 
> > It's presumably (hopefully!) a blocker for the qemu patch which exposes
> > the same feature bit defined in this patch.
> 
> Yeah, though we could split it further and get the documentation part in
> first.  That would let the QEMU part go through.

Potentially. Although I've worked out that the first patch in my
series, adding x2apic_set_max_apicid(), is actually a bug fix because
it fixes the behaviour if you only *hotplug* CPUs with APIC IDs > 255
and there were none of them present at boot time.

So I'll post this set on its own to start with, and then focus on the
per-irqdomain affinity support after that.

David Woodhouse (5):
      x86/apic: Fix x2apic enablement without interrupt remapping
      x86/msi: Only use high bits of MSI address for DMAR unit
      x86/ioapic: Handle Extended Destination ID field in RTE
      x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available
      x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07  7:48       ` David Woodhouse
@ 2020-10-07 12:59         ` Thomas Gleixner
  2020-10-07 13:08           ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-07 12:59 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
> On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote:
>> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> This is the case I called out in the cover letter:
>
>     This patch series implements a per-domain "maximum affinity" set and
>     uses it for the non-remapped IOAPIC and MSI domains on x86. As well as
>     allowing more CPUs to be used without interrupt remapping, this also
>     fixes the case where some IOAPICs or PCI devices aren't actually in
>     scope of any active IOMMU and are operating without remapping.
>
> To fix *that* case, we really do need the whole series giving us per-
> domain restricted affinity, and to use it for those MSIs/IOAPICs that
> the IRQ remapping doesn't cover.

Which do not exist today.

>> >  	ip->irqdomain->parent = parent;
>> > +	if (parent == x86_vector_domain)
>> > +		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);
>> 
>> OMG
>
> This IOAPIC function may or may not be behind remapping.

>> >  		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
>> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
>> 
>> So here it's unconditional
>
> Yes, this code is only for the non-remapped case and there's a separate
> arch_create_remap_msi_irq_domain() function a few lines further down
> which creates the IR-PCI-MSI IRQ domain. So no need for a condition
> here.
>
>> > +	if (parent == x86_vector_domain)
>> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
>> 
>> And here we need a condition again. Completely obvious and reviewable - NOT.
>
> And HPET may or may not be behind remapping so again needs the
> condition. I had figured that part was fairly obvious but can note it
> in the commit comment.

And all of this is completely wrong to begin with.

The information has to property of the relevant irq domains and the
hierarchy allows you nicely to retrieve it from there instead of
sprinkling this all over the place.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 12:59         ` Thomas Gleixner
@ 2020-10-07 13:08           ` David Woodhouse
  2020-10-07 14:05             ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 13:08 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini



On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>> On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote:
>>> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
>> This is the case I called out in the cover letter:
>>
>>     This patch series implements a per-domain "maximum affinity" set
>and
>>     uses it for the non-remapped IOAPIC and MSI domains on x86. As
>well as
>>     allowing more CPUs to be used without interrupt remapping, this
>also
>>     fixes the case where some IOAPICs or PCI devices aren't actually
>in
>>     scope of any active IOMMU and are operating without remapping.
>>
>> To fix *that* case, we really do need the whole series giving us per-
>> domain restricted affinity, and to use it for those MSIs/IOAPICs that
>> the IRQ remapping doesn't cover.
>
>Which do not exist today.

Sure. But at patch 10/13 into this particular patch series, it *does* exist.

(Ignoring, for the moment, that I'm about to rewrite half the preceding patches to take a different approach)


>>> >  	ip->irqdomain->parent = parent;
>>> > +	if (parent == x86_vector_domain)
>>> > +		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);
>>> 
>>> OMG
>>
>> This IOAPIC function may or may not be behind remapping.
>
>>> >  		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
>>> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
>>> 
>>> So here it's unconditional
>>
>> Yes, this code is only for the non-remapped case and there's a
>separate
>> arch_create_remap_msi_irq_domain() function a few lines further down
>> which creates the IR-PCI-MSI IRQ domain. So no need for a condition
>> here.
>>
>>> > +	if (parent == x86_vector_domain)
>>> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
>>> 
>>> And here we need a condition again. Completely obvious and
>reviewable - NOT.
>>
>> And HPET may or may not be behind remapping so again needs the
>> condition. I had figured that part was fairly obvious but can note it
>> in the commit comment.
>
>And all of this is completely wrong to begin with.
>
>The information has to property of the relevant irq domains and the
>hierarchy allows you nicely to retrieve it from there instead of
>sprinkling this all over the place.

No. This is not a property of the parent domain per se. Especially if you're thinking that we could inherit the affinity mask from the parent, then twice no.

This is a property of the MSI domain itself, and how many bits of destination ID the hardware at *this* level can interpret and pass on to the parent domain.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-07  7:19       ` David Woodhouse
@ 2020-10-07 13:37         ` Thomas Gleixner
  2020-10-07 14:10           ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-07 13:37 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote:
> On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote:
>> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
>> > From: David Woodhouse <dwmw@amazon.co.uk>
>> > 
>> > This is the maximum possible set of CPUs which can be used. Use it
>> > to calculate the default affinity requested from __irq_alloc_descs()
>> > by first attempting to find the intersection with irq_default_affinity,
>> > or falling back to using just the max_affinity if the intersection
>> > would be empty.
>> 
>> And why do we need that as yet another argument?
>> 
>> This is an optional property of the irq domain, really and no caller has
>> any business with that. 
>
> Because irq_domain_alloc_descs() doesn't actually *take* the domain as
> an argument. It's more of an internal function, which is only non-
> static because it's used from kernel/irq/ipi.c too for some reason. If
> we convert the IPI code to just call __irq_alloc_descs() directly,
> perhaps that we can actually make irq_domain_alloc_decs() static.

What is preventing you to change the function signature? But handing
down irqdomain here is not cutting it. The right thing to do is to
replace 'struct irq_affinity_desc *affinity' with something more
flexible.

>> >  int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
>> > -			   int node, const struct irq_affinity_desc *affinity)
>> > +			   int node, const struct irq_affinity_desc *affinity,
>> > +			   const struct cpumask *max_affinity)
>> >  {
>> > +	cpumask_var_t default_affinity;
>> >  	unsigned int hint;
>> > +	int i;
>> > +
>> > +	/* Check requested per-IRQ affinities are in the possible range */
>> > +	if (affinity && max_affinity) {
>> > +		for (i = 0; i < cnt; i++)
>> > +			if (!cpumask_subset(&affinity[i].mask, max_affinity))
>> > +				return -EINVAL;
>> 
>> https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos
>> 
>> What is preventing the affinity spreading code from spreading the masks
>> out to unusable CPUs? The changelog is silent about that part.
>
> I'm coming to the conclusion that we should allow unusable CPUs to be
> specified at this point, just as we do offline CPUs. That's largely
> driven by the realisation that our x86_non_ir_cpumask is only going to
> contain online CPUs anyway, and hotplugged CPUs only get added to it as
> they are brought online.

Can you please stop looking at this from a x86 only perspective. It's
largely irrelevant what particular needs x86 or virt or whatever has.

Fact is, that if there are CPUs which cannot be targeted by device
interrupts then the multiqueue affinity mechanism has to be fixed to
handle this. Right now it's just broken.

Passing yet more cpumasks and random pointers around through device
drivers and whatever is just not going to happen. Neither are we going
to have

        arch_can_be_used_for_device_interrupts_mask

or whatever you come up with and claim it to be 'generic'.

The whole affinity control mechanism needs to be refactored from ground
up and the information about CPUs which can be targeted has to be
retrievable through the irqdomain hierarchy.

Anything else is just tinkering and I have zero interest in mopping up
after you.

It's pretty obvious that the irq domains are the right place to store
that information:

const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d)
{
        while (d) {
        	if (d->get_possible_affinity)
                	return d->get_possible_affinity(d);
                d = d->parent;
        }
        return cpu_possible_mask;
}

So if you look at X86 then you have either:

   [VECTOR] ----------------- [IO/APIC]
                          |-- [MSI]
                          |-- [WHATEVER]

or

   [VECTOR] ---[REMAP]------- [IO/APIC]
             |            |-- [MSI]
             |----------------[WHATEVER]

So if REMAP allows cpu_possible_mask and VECTOR some restricted subset
then irqdomain_get_possible_affinity() will return the correct result
independent whether remapping is enabled or not.

This allows to use that for other things like per node restrictions or
whatever people come up with, without sprinkling more insanities through
the tree.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 13:08           ` David Woodhouse
@ 2020-10-07 14:05             ` Thomas Gleixner
  2020-10-07 14:23               ` David Woodhouse
  2020-10-07 15:05               ` David Woodhouse
  0 siblings, 2 replies; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-07 14:05 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
> On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>>On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>>> To fix *that* case, we really do need the whole series giving us per-
>>> domain restricted affinity, and to use it for those MSIs/IOAPICs that
>>> the IRQ remapping doesn't cover.
>>
>>Which do not exist today.
>
> Sure. But at patch 10/13 into this particular patch series, it *does*
> exist.

As I told you before: Your ordering is wrong. We do not introduce bugs
first and then fix them later ....

>>And all of this is completely wrong to begin with.
>>
>>The information has to property of the relevant irq domains and the
>>hierarchy allows you nicely to retrieve it from there instead of
>>sprinkling this all over the place.
>
> No. This is not a property of the parent domain per se. Especially if
> you're thinking that we could inherit the affinity mask from the
> parent, then twice no.
>
> This is a property of the MSI domain itself, and how many bits of
> destination ID the hardware at *this* level can interpret and pass on
> to the parent domain.

Errm what?

The MSI domain does not know anything about what the underlying domain
can handle and it shouldn't.

If MSI is on top of remapping then the remapping domain defines what the
MSI domain can do and not the other way round. Ditto for the non
remapped case in which the vector domain decides.

The top most MSI irq chip does not even have a compose function, neither
for the remap nor for the vector case. The composition is done by the
parent domain from the data which the parent domain constructed. Same
for the IO/APIC just less clearly separated.

The top most chip just takes what the underlying domain constructed and
writes it to the message store, because that's what the top most chip
controls. It does not control the content.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-07 13:37         ` Thomas Gleixner
@ 2020-10-07 14:10           ` David Woodhouse
  2020-10-07 15:57             ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 14:10 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 5425 bytes --]

On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote:
> On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote:
> > On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote:
> > > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > 
> > > > This is the maximum possible set of CPUs which can be used. Use it
> > > > to calculate the default affinity requested from __irq_alloc_descs()
> > > > by first attempting to find the intersection with irq_default_affinity,
> > > > or falling back to using just the max_affinity if the intersection
> > > > would be empty.
> > > 
> > > And why do we need that as yet another argument?
> > > 
> > > This is an optional property of the irq domain, really and no caller has
> > > any business with that. 
> > 
> > Because irq_domain_alloc_descs() doesn't actually *take* the domain as
> > an argument. It's more of an internal function, which is only non-
> > static because it's used from kernel/irq/ipi.c too for some reason. If
> > we convert the IPI code to just call __irq_alloc_descs() directly,
> > perhaps that we can actually make irq_domain_alloc_decs() static.
> 
> What is preventing you to change the function signature? But handing
> down irqdomain here is not cutting it. The right thing to do is to
> replace 'struct irq_affinity_desc *affinity' with something more
> flexible.

Yeah, although I do think I want to ditch this part completely, and
treat the "possible" mask for the domain very much more like we do
cpu_online_mask. In that we can allow affinities to be *requested*
which are outside it, and they can just never be *effective* while
those CPUs aren't present and reachable.

That way a lot of the nastiness about preparing an "acceptable" mask in
advance can go away.

It's *also* driven, as I noted, by the realisation that on x86, the
x86_non_ir_cpumask will only ever contain those CPUs which have been
brought online so far and meet the criteria... but won't that be true
on *any* system where CPU numbers are virtual and not 1:1 mapped with
whatever determines reachability by the IRQ domain? It isn't *such* an
x86ism, surely?

> Fact is, that if there are CPUs which cannot be targeted by device
> interrupts then the multiqueue affinity mechanism has to be fixed to
> handle this. Right now it's just broken.

I think part of the problem there is that I don't really understand how
this part is *supposed* to work. I was focusing on getting the simple
case working first, in the expectation that we'd come back to that part
ansd you'd keep me honest. Is there some decent documentation on this
that I'm missing?

> It's pretty obvious that the irq domains are the right place to store
> that information:
> 
> const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d)
> {
>         while (d) {
>         	if (d->get_possible_affinity)
>                 	return d->get_possible_affinity(d);
>                 d = d->parent;
>         }
>         return cpu_possible_mask;
> }

Sure.

> So if you look at X86 then you have either:
> 
>    [VECTOR] ----------------- [IO/APIC]
>                           |-- [MSI]
>                           |-- [WHATEVER]
> 
> or
> 
>    [VECTOR] ---[REMAP]------- [IO/APIC]
>              |            |-- [MSI]
>              |----------------[WHATEVER]

Hierarchically, theoretically the IOAPIC and HPET really ought to be
children of the MSI domain. It's the Compatibility MSI which has the
restriction on destination ID, because of how many bits it interprets
from the MSI address. HPET and IOAPIC are just generating MSIs that hit
that upstream limit.

> So if REMAP allows cpu_possible_mask and VECTOR some restricted subset
> then irqdomain_get_possible_affinity() will return the correct result
> independent whether remapping is enabled or not.

Sure. Although VECTOR doesn't have the restriction. As noted, it's the
Compatibility MSI that does. So the diagram looks something like this:

 [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ]
              |                             |---[ IR-PCI-MSI ]
              |                             |---[ IR-IOAPIC ]
              |
              |--[ COMPAT-MSI ] ---- [ HPET ]
                                 |-- [ PCI-MSI ]
                                 |-- [ IOAPIC ]

In this diagram, it's COMPAT-MSI that has the affinity restriction,
inherited by its children.

Now, I understand that you're not keen on IOAPIC actually being a child
of the MSI domain, and that's fine. In Linux right now, those generic
'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the
compatibility HPET, PCI-MSI and IOAPIC domains would have to add that
same 8-bit affinity limit for themselves, as their parent is the VECTOR
domain.

I suppose it *might* not hurt to pretend that VECTOR does indeed have
the limit, and to *remove* it in the remapping domain. And then the
affinity limit could be removed in one place by the REMAP domain
because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI-
MSI and IR-IOAPIC domains *are* children of that.

But honestly, I'd rather iterate and get the generic irqdomain support
for affinity limits into decent shape before I worry *too* much about
precisely which was round it gets used by x86.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 14:05             ` Thomas Gleixner
@ 2020-10-07 14:23               ` David Woodhouse
  2020-10-07 16:02                 ` Thomas Gleixner
  2020-10-07 15:05               ` David Woodhouse
  1 sibling, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 14:23 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1445 bytes --]

On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
> On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
> > On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
> > > > To fix *that* case, we really do need the whole series giving us per-
> > > > domain restricted affinity, and to use it for those MSIs/IOAPICs that
> > > > the IRQ remapping doesn't cover.
> > > 
> > > Which do not exist today.
> > 
> > Sure. But at patch 10/13 into this particular patch series, it *does*
> > exist.
> 
> As I told you before: Your ordering is wrong. We do not introduce bugs
> first and then fix them later ....

I didn't introduce that bug; it's been there for years. Fixing it
properly requires per-irqdomain affinity limits.

There's a cute little TODO at least in the Intel irq-remapping driver,
noting that we should probably check if there are any IOAPICs that
aren't in the scope of any DRHD at all. But that's all.

If it happens, I think we'll end up doing the right thing and
instantiating a non-IR IOAPIC domain for it, and if we *don't* have any
CPU with an APIC ID above... (checks notes)... 7 ... then it'll just
about work out. Over 7 and we're screwed (because logical mode; see
also https://git.infradead.org/users/dwmw2/linux.git/commit/429f0c33f
for a straw man but that's getting even further down the rabbit hole)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 14:05             ` Thomas Gleixner
  2020-10-07 14:23               ` David Woodhouse
@ 2020-10-07 15:05               ` David Woodhouse
  2020-10-07 15:25                 ` Thomas Gleixner
  1 sibling, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 15:05 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1857 bytes --]

On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
> > > The information has to property of the relevant irq domains and the
> > > hierarchy allows you nicely to retrieve it from there instead of
> > > sprinkling this all over the place.
> > 
> > No. This is not a property of the parent domain per se. Especially if
> > you're thinking that we could inherit the affinity mask from the
> > parent, then twice no.
> > 
> > This is a property of the MSI domain itself, and how many bits of
> > destination ID the hardware at *this* level can interpret and pass on
> > to the parent domain.
> 
> Errm what?
> 
> The MSI domain does not know anything about what the underlying domain
> can handle and it shouldn't.
> 
> If MSI is on top of remapping then the remapping domain defines what the
> MSI domain can do and not the other way round. Ditto for the non
> remapped case in which the vector domain decides.
> 
> The top most MSI irq chip does not even have a compose function, neither
> for the remap nor for the vector case. The composition is done by the
> parent domain from the data which the parent domain constructed. Same
> for the IO/APIC just less clearly separated.
> 
> The top most chip just takes what the underlying domain constructed and
> writes it to the message store, because that's what the top most chip
> controls. It does not control the content.

Sure, whatever. The way we arrange the IRQ domain hierarchy in x86
Linux doesn't really match my understanding of the real hardware, or
how qemu emulates it either. But it is at least internally self-
consistent, and in this model it probably does make some sense to claim
the 8-bit limit on x86_vector_domain itself, and *remove* that limit in
the remapping irqdomain.

Not really the important part to deal with right now, either way.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 15:05               ` David Woodhouse
@ 2020-10-07 15:25                 ` Thomas Gleixner
  2020-10-07 15:46                   ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-07 15:25 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Wed, Oct 07 2020 at 16:05, David Woodhouse wrote:
> On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
>> The top most MSI irq chip does not even have a compose function, neither
>> for the remap nor for the vector case. The composition is done by the
>> parent domain from the data which the parent domain constructed. Same
>> for the IO/APIC just less clearly separated.
>> 
>> The top most chip just takes what the underlying domain constructed and
>> writes it to the message store, because that's what the top most chip
>> controls. It does not control the content.
>
> Sure, whatever. The way we arrange the IRQ domain hierarchy in x86
> Linux doesn't really match my understanding of the real hardware, or
> how qemu emulates it either. But it is at least internally self-
> consistent, and in this model it probably does make some sense to claim
> the 8-bit limit on x86_vector_domain itself, and *remove* that limit in
> the remapping irqdomain.

It's clearly how the hardware works. MSI has a message store of some
sorts and if the entry is enabled then the MSI chip (in PCI or
elsewhere) will send exactly the message which is in that message
store. It knows absolutely nothing about what the message means and how
it is composed. The only things which MSI knows about is whether the
message address is 64bit wide or not and whether the entries are
maskable or not and how many entries it can store.

Software allocates a message target at the underlying irq domain (vector
or remap) and that underlying irq domain defines the properties.

If qemu emulates it differently then it's qemu's problem, but that does
not make it in anyway something which influences the irq domain
abstractions which are correctly modeled after how the hardware works.

> Not really the important part to deal with right now, either way.

Oh yes it is. We define that upfront and not after the fact.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 15:25                 ` Thomas Gleixner
@ 2020-10-07 15:46                   ` David Woodhouse
  2020-10-07 17:23                     ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 15:46 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2879 bytes --]

On Wed, 2020-10-07 at 17:25 +0200, Thomas Gleixner wrote:
> It's clearly how the hardware works. MSI has a message store of some
> sorts and if the entry is enabled then the MSI chip (in PCI or
> elsewhere) will send exactly the message which is in that message
> store. It knows absolutely nothing about what the message means and how
> it is composed. The only things which MSI knows about is whether the
> message address is 64bit wide or not and whether the entries are
> maskable or not and how many entries it can store.
> 
> Software allocates a message target at the underlying irq domain (vector
> or remap) and that underlying irq domain defines the properties.
> 
> If qemu emulates it differently then it's qemu's problem, but that does
> not make it in anyway something which influences the irq domain
> abstractions which are correctly modeled after how the hardware works.
> 
> > Not really the important part to deal with right now, either way.
> 
> Oh yes it is. We define that upfront and not after the fact.

The way the hardware works is that something handles physical address
cycles to addresses in the (on x86) 0xFEExxxxx range, and turns them
into actual interrupts on the appropriate CPU — where the APIC ID and
vector (etc.) are directly encoded in the bits of the address and the
data written. That compatibility x86 APIC MSI format is where the 8-bit 
(or 15-bit) limit comes from.

Then interrupt remapping comes along, and now those physical address
cycles are actually handled by the IOMMU — which can either handle the
compatibility format as before, or use a different format of
address/data bits and perform a lookup in its IRTE table.

The PCI MSI domain, HPET, and even the IOAPIC are just the things out
there on the bus which might perform those physical address cycles. And
yes, as you say they're just a message store sending exactly the
message that was composed for them. They know absolutely nothing about
what the message means and how it is composed.

It so happens that in Linux, we don't really architect the software
like that. So each of the PCI MSI domain, HPET, and IOAPIC have their
*own* message composer which has the same limits and composes basically
the same messages as if it was *their* format, not dictated to them by
the APIC upstream. And that's what we're both getting our panties in a
knot about, I think.

It really doesn't matter that much to the underlying generic irqdomain
support for limited affinities. Except that you want to make the
generic code support the concept of a child domain supporting *more*
CPUs than its parent, which really doesn't make much sense if you think
about it. But it isn't that hard to do, and if it means we don't have
to argue any more about the x86 hierarchy not matching the hardware
then it's a price worth paying. 

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-07 14:10           ` David Woodhouse
@ 2020-10-07 15:57             ` Thomas Gleixner
  2020-10-07 16:11               ` David Woodhouse
  2020-10-08  7:21               ` David Woodhouse
  0 siblings, 2 replies; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-07 15:57 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Wed, Oct 07 2020 at 15:10, David Woodhouse wrote:
> On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote:
>> What is preventing you to change the function signature? But handing
>> down irqdomain here is not cutting it. The right thing to do is to
>> replace 'struct irq_affinity_desc *affinity' with something more
>> flexible.
>
> Yeah, although I do think I want to ditch this part completely, and
> treat the "possible" mask for the domain very much more like we do
> cpu_online_mask. In that we can allow affinities to be *requested*
> which are outside it, and they can just never be *effective* while
> those CPUs aren't present and reachable.

Huch?

> That way a lot of the nastiness about preparing an "acceptable" mask in
> advance can go away.

There is not lot's of nastiness.

> It's *also* driven, as I noted, by the realisation that on x86, the
> x86_non_ir_cpumask will only ever contain those CPUs which have been
> brought online so far and meet the criteria... but won't that be true
> on *any* system where CPU numbers are virtual and not 1:1 mapped with
> whatever determines reachability by the IRQ domain? It isn't *such* an
> x86ism, surely?

Yes, but that's exactly the reason why I don't want to have whatever
mask name you chose to be directly exposed and want it to be part of the
irq domains because then you can express arbitrary per domain limits.

>> Fact is, that if there are CPUs which cannot be targeted by device
>> interrupts then the multiqueue affinity mechanism has to be fixed to
>> handle this. Right now it's just broken.
>
> I think part of the problem there is that I don't really understand how
> this part is *supposed* to work. I was focusing on getting the simple
> case working first, in the expectation that we'd come back to that part
> ansd you'd keep me honest. Is there some decent documentation on this
> that I'm missing?

TLDR & HTF;

Multiqueue devices want to have at max 1 queue per CPU or if the device
has less queues than CPUs they want the queues to have a fixed
association to a set of CPUs.

At setup time this is established considering possible CPUs to handle
'physical' hotplug correctly.

If a queue has no online CPUs it cannot be started. If it's active and
the last CPU goes down then it's quiesced and stopped and the core code
shuts down the interrupt and does not move it to a still online CPU.

So with your hackery, we end up in a situation where we have a large
possible mask, but not all CPUs in that mask can be reached, which means
in a 1 queue per CPU scenario all unreachable CPUs would have
disfunctional queues.

So that spreading algorithm needs to know about this limitation.

>> So if you look at X86 then you have either:
>> 
>>    [VECTOR] ----------------- [IO/APIC]
>>                           |-- [MSI]
>>                           |-- [WHATEVER]
>> 
>> or
>> 
>>    [VECTOR] ---[REMAP]------- [IO/APIC]
>>              |            |-- [MSI]
>>              |----------------[WHATEVER]
>
> Hierarchically, theoretically the IOAPIC and HPET really ought to be
> children of the MSI domain. It's the Compatibility MSI which has the
> restriction on destination ID, because of how many bits it interprets
> from the MSI address. HPET and IOAPIC are just generating MSIs that hit
> that upstream limit.

We kinda have that, but not nicely abstracted. But we surely can and
should fix that.

>> So if REMAP allows cpu_possible_mask and VECTOR some restricted subset
>> then irqdomain_get_possible_affinity() will return the correct result
>> independent whether remapping is enabled or not.
>
w> Sure. Although VECTOR doesn't have the restriction. As noted, it's the
> Compatibility MSI that does. So the diagram looks something like this:
>
>  [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ]
>               |                             |---[ IR-PCI-MSI ]
>               |                             |---[ IR-IOAPIC ]
>               |
>               |--[ COMPAT-MSI ] ---- [ HPET ]
>                                  |-- [ PCI-MSI ]
>                                  |-- [ IOAPIC ]
>
>
> In this diagram, it's COMPAT-MSI that has the affinity restriction,
> inherited by its children.
>
> Now, I understand that you're not keen on IOAPIC actually being a child
> of the MSI domain, and that's fine. In Linux right now, those generic
> 'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the
> compatibility HPET, PCI-MSI and IOAPIC domains would have to add that
> same 8-bit affinity limit for themselves, as their parent is the VECTOR
> domain.

No. We fix it proper and not by hacking around it.

> I suppose it *might* not hurt to pretend that VECTOR does indeed have
> the limit, and to *remove* it in the remapping domain. And then the
> affinity limit could be removed in one place by the REMAP domain
> because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI-
> MSI and IR-IOAPIC domains *are* children of that.

It's not rocket science to fix that as the irq remapping code already
differentiates between the device types.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 14:23               ` David Woodhouse
@ 2020-10-07 16:02                 ` Thomas Gleixner
  2020-10-07 16:15                   ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-07 16:02 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Wed, Oct 07 2020 at 15:23, David Woodhouse wrote:
> On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
>> On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
>> > On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>> > > > To fix *that* case, we really do need the whole series giving us per-
>> > > > domain restricted affinity, and to use it for those MSIs/IOAPICs that
>> > > > the IRQ remapping doesn't cover.
>> > > 
>> > > Which do not exist today.
>> > 
>> > Sure. But at patch 10/13 into this particular patch series, it *does*
>> > exist.
>> 
>> As I told you before: Your ordering is wrong. We do not introduce bugs
>> first and then fix them later ....
>
> I didn't introduce that bug; it's been there for years. Fixing it
> properly requires per-irqdomain affinity limits.
>
> There's a cute little TODO at least in the Intel irq-remapping driver,
> noting that we should probably check if there are any IOAPICs that
> aren't in the scope of any DRHD at all. But that's all.

So someone forgot to remove the cute little TODO when this was added:

       if (parse_ioapics_under_ir()) {
                pr_info("Not enabling interrupt remapping\n");
                goto error;
        }

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-07 15:57             ` Thomas Gleixner
@ 2020-10-07 16:11               ` David Woodhouse
  2020-10-07 20:53                 ` Thomas Gleixner
  2020-10-08  7:21               ` David Woodhouse
  1 sibling, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 16:11 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini



On 7 October 2020 16:57:36 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Oct 07 2020 at 15:10, David Woodhouse wrote:
>> On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote:
>>> What is preventing you to change the function signature? But handing
>>> down irqdomain here is not cutting it. The right thing to do is to
>>> replace 'struct irq_affinity_desc *affinity' with something more
>>> flexible.
>>
>> Yeah, although I do think I want to ditch this part completely, and
>> treat the "possible" mask for the domain very much more like we do
>> cpu_online_mask. In that we can allow affinities to be *requested*
>> which are outside it, and they can just never be *effective* while
>> those CPUs aren't present and reachable.
>
>Huch?
>
>> That way a lot of the nastiness about preparing an "acceptable" mask
>in
>> advance can go away.
>
>There is not lot's of nastiness.

OK, but I think we do have to cope with the fact that the limit is dynamic, and a CPU might be added which widens the mask. I think that's fundamental and not x86-specific.

>> It's *also* driven, as I noted, by the realisation that on x86, the
>> x86_non_ir_cpumask will only ever contain those CPUs which have been
>> brought online so far and meet the criteria... but won't that be true
>> on *any* system where CPU numbers are virtual and not 1:1 mapped with
>> whatever determines reachability by the IRQ domain? It isn't *such*
>an
>> x86ism, surely?
>
>Yes, but that's exactly the reason why I don't want to have whatever
>mask name you chose to be directly exposed and want it to be part of
>the
>irq domains because then you can express arbitrary per domain limits.

The x86_non_ir_mask isn't intended to be directly exposed to any generic IRQ code.

It's set up by the x86 APIC code so that those x86 IRQ domains which are affected by it, can set it with irqdomain_set_max_affinity() for the generic code to see. Without each having to create the cpumask from scratch for themselves.

> ... reading for once the kids are elsewhere...

Thanks.

>It's not rocket science to fix that as the irq remapping code already
>differentiates between the device types.

I don't actually like that very much. The IRQ remapping code could just compose the MSI messages that it desires without really having to care so much about the child device. The bit where IRQ remapping actually composes IOAPIC RTEs makes me particularly sad.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 16:02                 ` Thomas Gleixner
@ 2020-10-07 16:15                   ` David Woodhouse
  0 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 16:15 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini



On 7 October 2020 17:02:59 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Oct 07 2020 at 15:23, David Woodhouse wrote:
>> On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
>>> On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
>>> > On 7 October 2020 13:59:00 BST, Thomas Gleixner
><tglx@linutronix.de> wrote:
>>> > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>>> > > > To fix *that* case, we really do need the whole series giving
>us per-
>>> > > > domain restricted affinity, and to use it for those
>MSIs/IOAPICs that
>>> > > > the IRQ remapping doesn't cover.
>>> > > 
>>> > > Which do not exist today.
>>> > 
>>> > Sure. But at patch 10/13 into this particular patch series, it
>*does*
>>> > exist.
>>> 
>>> As I told you before: Your ordering is wrong. We do not introduce
>bugs
>>> first and then fix them later ....
>>
>> I didn't introduce that bug; it's been there for years. Fixing it
>> properly requires per-irqdomain affinity limits.
>>
>> There's a cute little TODO at least in the Intel irq-remapping
>driver,
>> noting that we should probably check if there are any IOAPICs that
>> aren't in the scope of any DRHD at all. But that's all.
>
>So someone forgot to remove the cute little TODO when this was added:
>
>       if (parse_ioapics_under_ir()) {
>                pr_info("Not enabling interrupt remapping\n");
>                goto error;
>        }

And HPET, and PCI devices including those that might be hotplugged in future and not be covered by any extant IOMMU's scope?

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 15:46                   ` David Woodhouse
@ 2020-10-07 17:23                     ` Thomas Gleixner
  2020-10-07 17:34                       ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-07 17:23 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Wed, Oct 07 2020 at 16:46, David Woodhouse wrote:
> The PCI MSI domain, HPET, and even the IOAPIC are just the things out
> there on the bus which might perform those physical address cycles. And
> yes, as you say they're just a message store sending exactly the
> message that was composed for them. They know absolutely nothing about
> what the message means and how it is composed.

That's what I said.

> It so happens that in Linux, we don't really architect the software
> like that. So each of the PCI MSI domain, HPET, and IOAPIC have their
> *own* message composer which has the same limits and composes basically
> the same messages as if it was *their* format, not dictated to them by
> the APIC upstream. And that's what we're both getting our panties in a
> knot about, I think.

Are you actually reading what I write and caring to look at the code?

PCI-MSI does not have a compose message callback in the irq chip. The
message is composed by the underlying parent domain. Same for HPET.

The only dogdy part is the IO/APIC for hysterical raisins and because
I did not come around yet to sort that out.

> It really doesn't matter that much to the underlying generic irqdomain
> support for limited affinities. Except that you want to make the
> generic code support the concept of a child domain supporting *more*
> CPUs than its parent, which really doesn't make much sense if you think
> about it.

Right. So we really want to stick the restriction into a compat-MSI
domain to make stuff match reality and to avoid banging the head against
the wall sooner than later.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  2020-10-07 17:23                     ` Thomas Gleixner
@ 2020-10-07 17:34                       ` David Woodhouse
  0 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-07 17:34 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1790 bytes --]

On Wed, 2020-10-07 at 19:23 +0200, Thomas Gleixner wrote:
> > It so happens that in Linux, we don't really architect the software
> > like that. So each of the PCI MSI domain, HPET, and IOAPIC have their
> > *own* message composer which has the same limits and composes basically
> > the same messages as if it was *their* format, not dictated to them by
> > the APIC upstream. And that's what we're both getting our panties in a
> > knot about, I think.
> 
> Are you actually reading what I write and caring to look at the code?
> 
> PCI-MSI does not have a compose message callback in the irq chip. The
> message is composed by the underlying parent domain. Same for HPET.
> 
> The only dogdy part is the IO/APIC for hysterical raisins and because
> I did not come around yet to sort that out.

Right. And the problem is that the "underlying parent domain" in this
case is actually x86_vector_domain. Whereas it probably makes more
sense, at least in theory, for there to be an *intermediate* domain
responsible for composing the Compat MSI messages.

Both the Compat-MSI and the IR-MSI domains would be children of the
generic x86_vector_domain, and then any given HPET, IOAPIC or PCI-MSI
domain would be a child of either one of those generic MSI domains.

> > It really doesn't matter that much to the underlying generic irqdomain
> > support for limited affinities. Except that you want to make the
> > generic code support the concept of a child domain supporting *more*
> > CPUs than its parent, which really doesn't make much sense if you think
> > about it.
> 
> Right. So we really want to stick the restriction into a compat-MSI
> domain to make stuff match reality and to avoid banging the head against
> the wall sooner than later.

Right.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-07 16:11               ` David Woodhouse
@ 2020-10-07 20:53                 ` Thomas Gleixner
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-07 20:53 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Wed, Oct 07 2020 at 17:11, David Woodhouse wrote:
> On 7 October 2020 16:57:36 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>>There is not lot's of nastiness.
>
> OK, but I think we do have to cope with the fact that the limit is
> dynamic, and a CPU might be added which widens the mask. I think
> that's fundamental and not x86-specific.

Yes, but it needs some thoughts vs. serialization against CPU hotplug.

The stability of that cpumask is architecture specific if the calling
context cannot serialize against CPU hotplug. The hot-unplug case is
less interesting because the mask does not shrink, it only get's wider.

That's why I want a callback instead of a pointer assignment and that
callback cannot return a pointer to something which can be modified
concurrently, it needs to update a caller provided mask so that the
result is at least consistent in itself.

It just occured to me that there is something which needs some more
thought vs. CPU hotunplug as well:

  Right now we just check whether there are enough vectors available on
  the remaining online CPUs so that the interrupts which have an
  effective affinity directed to the outgoing CPU can be migrated away
  (modulo the managed ones).

  That check obviously needs to take the target CPU restrictions into
  account to prevent that devices end up with no target at the end.

I bet there are some other interesting bits and pieces which need
some care...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-07 15:57             ` Thomas Gleixner
  2020-10-07 16:11               ` David Woodhouse
@ 2020-10-08  7:21               ` David Woodhouse
  2020-10-08  9:34                 ` Thomas Gleixner
  1 sibling, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-08  7:21 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2997 bytes --]

On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
> TLDR & HTF;
> 
> Multiqueue devices want to have at max 1 queue per CPU or if the device
> has less queues than CPUs they want the queues to have a fixed
> association to a set of CPUs.
> 
> At setup time this is established considering possible CPUs to handle
> 'physical' hotplug correctly.
> 
> If a queue has no online CPUs it cannot be started. If it's active and
> the last CPU goes down then it's quiesced and stopped and the core code
> shuts down the interrupt and does not move it to a still online CPU.
> 
> So with your hackery, we end up in a situation where we have a large
> possible mask, but not all CPUs in that mask can be reached, which means
> in a 1 queue per CPU scenario all unreachable CPUs would have
> disfunctional queues.
> 
> So that spreading algorithm needs to know about this limitation.

OK, thanks. So the queue exists, with an MSI assigned to point to an
offline CPU(s), but it cannot actually be used until/unless at least
one CPU in its mask comes online.

So when I said I wanted to try treating "reachable" the same way as
"online", that would mean the queue can't start until/unless at least
one *reachable* CPU in its mask comes online.

The underlying problem here is that until a CPU comes online, we don't
actually *know* if it's reachable or not.

So if we want carefully create the affinity masks at setup time so that
they don't include any unreachable CPUs... that basically means we
don't include any non-present CPUs at all (unless they've been added
once and then removed).

That's because — at least on x86 — we don't assign CPU numbers to CPUs
which haven't yet been added. Theoretically we *could* but if there are
more than NR_CPUS listed in (e.g.) the MADT then we could run out of
CPU numbers. Then if the admin hotplugs one of the CPUs we *didn't*
have space for, we'd not be able to handle it.

I suppose there might be options like pre-assigning CPU numbers only to
non-present APIC IDs below 256 (or 32768). Or *grouping* the CPU
numbers, so some not-yet-assigned CPU#s are for low APICIDs and some
are for high APICIDs but it doesn't actually have to be a 1:1
predetermined mapping.

But those really do seem like hacks which might only apply on x86,
while the generic approach of treating "reachable" like "online" seems
like it would work in other cases too.

Fundamentally, there are three sets of CPUs. There are those known to
be reachable, those known not to be, and those which are not yet known.

So another approach we could use is to work with a cpumask of those
*known* not to be reachable, and to filter those *out* of the prebuilt
affinities. That gives us basically the right behaviour without
hotplug, but does include absent CPUs in a mask that *if* they are ever
added, wouldn't be able to receive the IRQ. Which does mean we'd have
to refrain from bringing up the corresponding queue. 

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-08  7:21               ` David Woodhouse
@ 2020-10-08  9:34                 ` Thomas Gleixner
  2020-10-08 11:10                   ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-08  9:34 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Thu, Oct 08 2020 at 08:21, David Woodhouse wrote:
> On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
>> Multiqueue devices want to have at max 1 queue per CPU or if the device
>> has less queues than CPUs they want the queues to have a fixed
>> association to a set of CPUs.
>> 
>> At setup time this is established considering possible CPUs to handle
>> 'physical' hotplug correctly.
>> 
>> If a queue has no online CPUs it cannot be started. If it's active and
>> the last CPU goes down then it's quiesced and stopped and the core code
>> shuts down the interrupt and does not move it to a still online CPU.
>> 
>> So with your hackery, we end up in a situation where we have a large
>> possible mask, but not all CPUs in that mask can be reached, which means
>> in a 1 queue per CPU scenario all unreachable CPUs would have
>> disfunctional queues.
>> 
>> So that spreading algorithm needs to know about this limitation.
>
> OK, thanks. So the queue exists, with an MSI assigned to point to an
> offline CPU(s), but it cannot actually be used until/unless at least
> one CPU in its mask comes online.

The MSI entry in that case is actually directed to an online CPU's
MANAGED_IRQ_SHUTDOWN_VECTOR to catch cases where an interrupt is raised
by the device after shutdown.

> So when I said I wanted to try treating "reachable" the same way as
> "online", that would mean the queue can't start until/unless at least
> one *reachable* CPU in its mask comes online.
>
> The underlying problem here is that until a CPU comes online, we don't
> actually *know* if it's reachable or not.

It's known before online, i.e. when the CPU is registered which is
either at boot time for present CPUs or at 'physical' hotplug.

> So if we want carefully create the affinity masks at setup time so that
> they don't include any unreachable CPUs... that basically means we
> don't include any non-present CPUs at all (unless they've been added
> once and then removed).

That breaks _all_ multi-queue assumptions in one go. :)

> But those really do seem like hacks which might only apply on x86,
> while the generic approach of treating "reachable" like "online" seems
> like it would work in other cases too.
>
> Fundamentally, there are three sets of CPUs. There are those known to
> be reachable, those known not to be, and those which are not yet
> known.

Unfortunately there are lots of assumptions all over the place that
possible CPUs are reachable. Multi-queue using managed interrupts is
just the tip of the iceberg.

> So another approach we could use is to work with a cpumask of those
> *known* not to be reachable, and to filter those *out* of the prebuilt
> affinities. That gives us basically the right behaviour without
> hotplug, but does include absent CPUs in a mask that *if* they are ever
> added, wouldn't be able to receive the IRQ. Which does mean we'd have
> to refrain from bringing up the corresponding queue. 

The multi-queue drivers rely on the interrupt setup to create their
queues and the fundamental assumption is that this setup works. The
managed interrupt mechanism guarantees that the queue has a vector
available on all CPUs which are in the queues assigned affinity mask. As
of today it also guarantees that these CPUs are reachable once they come
online.

So in order to make that work you'd need to teach the multi-queue stuff
about this new world order:

 1) On hotplug the queue needs to be able to figure out whether the
    interrupt is functional. If not it has to redirect any requests to
    some actually functional queue.

 2) On unplug it needs to be able to figure out whether the interrupt
    will shutdown because the outgoing CPU is the last reachable in the
    group and if there are still online but unreachable CPUs then use
    the redirect mechanism.

I'm sure that the multi-queue people will be enthusiastic to add all of
this and deal with all the nasty corner cases coming out of it.

The overall conclusion for this is:

 1) X2APIC support on bare metal w/o irq remapping is not going to
    happen unless you:

      - added support in multi-queue devices which utilize managed
        interrupts
        
      - audited the whole tree for other assumptions related to the
        reachability of possible CPUs.

    I'm not expecting you to be done with that before I retire so for
    me it's just not going to happen :)

 2) X2APIC support on VIRT is possible if the extended ID magic is
    supported by the hypervisor because that does not make any CPU
    unreachable for MSI and therefore the multi-queue muck and
    everything else just works.

    This requires to have either the domain affinity limitation for HPET
    in place or just to force disable HPET or at least HPET-MSI which is
    a reasonable tradeoff.

    HPET is not required for guests which have kvmclock and
    APIC/deadline timer and known (hypervisor provided) frequencies.

Anything else is just wishful thinking, really.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-08  9:34                 ` Thomas Gleixner
@ 2020-10-08 11:10                   ` David Woodhouse
  2020-10-08 12:40                     ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: David Woodhouse @ 2020-10-08 11:10 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1776 bytes --]

On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote:
> The overall conclusion for this is:
> 
>  1) X2APIC support on bare metal w/o irq remapping is not going to
>     happen unless you:
> 
>       - added support in multi-queue devices which utilize managed
>         interrupts
>         
>       - audited the whole tree for other assumptions related to the
>         reachability of possible CPUs.
> 
>     I'm not expecting you to be done with that before I retire so for
>     me it's just not going to happen :)

Makes sense. It probably does mean we should a BUG_ON for the case
where IRQ remapping *is* enabled but any device is found which isn't
behind it. But that's OK.

>  2) X2APIC support on VIRT is possible if the extended ID magic is
>     supported by the hypervisor because that does not make any CPU
>     unreachable for MSI and therefore the multi-queue muck and
>     everything else just works.
> 
>     This requires to have either the domain affinity limitation for HPET
>     in place or just to force disable HPET or at least HPET-MSI which is
>     a reasonable tradeoff.
> 
>     HPET is not required for guests which have kvmclock and
>     APIC/deadline timer and known (hypervisor provided) frequencies.

HPET-MSI should work fine. Like the IOAPIC, it's just a child of the
*actual* MSI domain. The address/data in the MSI message are completely
opaque to it, and if the parent domain happens to put meaningful
information into bits 11-5 of the MSI address, the HPET won't even
notice.

The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI
address; it's not doing bit-swizzling like the IOAPIC does, which might
potentially *not* have been able to set certain bits in the MSI.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-08 11:10                   ` David Woodhouse
@ 2020-10-08 12:40                     ` Thomas Gleixner
  2020-10-09  7:54                       ` David Woodhouse
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2020-10-08 12:40 UTC (permalink / raw)
  To: David Woodhouse, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

On Thu, Oct 08 2020 at 12:10, David Woodhouse wrote:
> On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote:
>> The overall conclusion for this is:
>> 
>>  1) X2APIC support on bare metal w/o irq remapping is not going to
>>     happen unless you:
>> 
>>       - added support in multi-queue devices which utilize managed
>>         interrupts
>>         
>>       - audited the whole tree for other assumptions related to the
>>         reachability of possible CPUs.
>> 
>>     I'm not expecting you to be done with that before I retire so for
>>     me it's just not going to happen :)
>
> Makes sense. It probably does mean we should a BUG_ON for the case
> where IRQ remapping *is* enabled but any device is found which isn't
> behind it. But that's OK.

We can kinda gracefully handle that. See the completely untested and
incomplete patch below.

>>  2) X2APIC support on VIRT is possible if the extended ID magic is
>>     supported by the hypervisor because that does not make any CPU
>>     unreachable for MSI and therefore the multi-queue muck and
>>     everything else just works.
>> 
>>     This requires to have either the domain affinity limitation for HPET
>>     in place or just to force disable HPET or at least HPET-MSI which is
>>     a reasonable tradeoff.
>> 
>>     HPET is not required for guests which have kvmclock and
>>     APIC/deadline timer and known (hypervisor provided) frequencies.
>
> HPET-MSI should work fine. Like the IOAPIC, it's just a child of the
> *actual* MSI domain. The address/data in the MSI message are completely
> opaque to it, and if the parent domain happens to put meaningful
> information into bits 11-5 of the MSI address, the HPET won't even
> notice.
>
> The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI
> address; it's not doing bit-swizzling like the IOAPIC does, which might
> potentially *not* have been able to set certain bits in the MSI.

Indeed. I thought it was crippled in some way, but you're right it has
all the bits.

Thanks,

        tglx
---
Subject: x86/iommu: Make interrupt remapping more robust
From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 08 Oct 2020 14:09:44 +0200

Needs to be split into pieces and cover PCI proper. Right now PCI gets a
NULL pointer assigned which makes it explode at the wrong place
later. Also hyperv iommu wants some love.

NOT-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/io_apic.c      |    4 +++-
 arch/x86/kernel/apic/msi.c          |   24 ++++++++++++++----------
 drivers/iommu/amd/iommu.c           |    6 +++---
 drivers/iommu/intel/irq_remapping.c |    4 ++--
 4 files changed, 22 insertions(+), 16 deletions(-)

--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi
 	info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT;
 	info.devid = mpc_ioapic_id(ioapic);
 	parent = irq_remapping_get_irq_domain(&info);
-	if (!parent)
+	if (IS_ERR(parent))
+		return PTR_ERR(parent);
+	else if (!parent)
 		parent = x86_vector_domain;
 	else
 		name = "IO-APIC-IR";
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d
 struct irq_domain *hpet_create_irq_domain(int hpet_id)
 {
 	struct msi_domain_info *domain_info;
+	struct fwnode_handle *fn = NULL;
 	struct irq_domain *parent, *d;
 	struct irq_alloc_info info;
-	struct fwnode_handle *fn;
 
 	if (x86_vector_domain == NULL)
 		return NULL;
@@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai
 	init_irq_alloc_info(&info, NULL);
 	info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT;
 	info.devid = hpet_id;
+
 	parent = irq_remapping_get_irq_domain(&info);
-	if (parent == NULL)
+	if (IS_ERR(parent))
+		goto fail;
+	else if (!parent)
 		parent = x86_vector_domain;
 	else
 		hpet_msi_controller.name = "IR-HPET-MSI";
 
 	fn = irq_domain_alloc_named_id_fwnode(hpet_msi_controller.name,
 					      hpet_id);
-	if (!fn) {
-		kfree(domain_info);
-		return NULL;
-	}
+	if (!fn)
+		goto fail;
 
 	d = msi_create_irq_domain(fn, domain_info, parent);
-	if (!d) {
-		irq_domain_free_fwnode(fn);
-		kfree(domain_info);
-	}
+	if (!d)
+		goto fail;
 	return d;
+
+fail:
+	irq_domain_free_fwnode(fn);
+	kfree(domain_info);
+	return NULL;
 }
 
 int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3557,7 +3557,7 @@ static struct irq_domain *get_irq_domain
 	struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
 
 	if (!iommu)
-		return NULL;
+		return ERR_PTR(-ENODEV);
 
 	switch (info->type) {
 	case X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT:
@@ -3565,7 +3565,7 @@ static struct irq_domain *get_irq_domain
 		return iommu->ir_domain;
 	default:
 		WARN_ON_ONCE(1);
-		return NULL;
+		return ERR_PTR(-ENODEV);
 	}
 }
 
@@ -3578,7 +3578,7 @@ static struct irq_domain *get_irq_domain
 
 	devid = get_devid(info);
 	if (devid < 0)
-		return NULL;
+		return ERR_PTR(-ENODEV);
 	return get_irq_domain_for_devid(info, devid);
 }
 
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -212,7 +212,7 @@ static struct irq_domain *map_hpet_to_ir
 		if (ir_hpet[i].id == hpet_id && ir_hpet[i].iommu)
 			return ir_hpet[i].iommu->ir_domain;
 	}
-	return NULL;
+	return ERR_PTR(-ENODEV);
 }
 
 static struct intel_iommu *map_ioapic_to_iommu(int apic)
@@ -230,7 +230,7 @@ static struct irq_domain *map_ioapic_to_
 {
 	struct intel_iommu *iommu = map_ioapic_to_iommu(apic);
 
-	return iommu ? iommu->ir_domain : NULL;
+	return iommu ? iommu->ir_domain : ERR_PTR(-ENODEV);
 }
 
 static struct irq_domain *map_dev_to_ir(struct pci_dev *dev)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  2020-10-08 12:40                     ` Thomas Gleixner
@ 2020-10-09  7:54                       ` David Woodhouse
  0 siblings, 0 replies; 51+ messages in thread
From: David Woodhouse @ 2020-10-09  7:54 UTC (permalink / raw)
  To: Thomas Gleixner, x86; +Cc: iommu, kvm, linux-hyperv, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3735 bytes --]

On Thu, 2020-10-08 at 14:40 +0200, Thomas Gleixner wrote:
> Subject: x86/iommu: Make interrupt remapping more robust
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 08 Oct 2020 14:09:44 +0200
> 
> Needs to be split into pieces and cover PCI proper. Right now PCI gets a
> NULL pointer assigned which makes it explode at the wrong place
> later. Also hyperv iommu wants some love.
> 
> NOT-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/kernel/apic/io_apic.c      |    4 +++-
>  arch/x86/kernel/apic/msi.c          |   24 ++++++++++++++----------
>  drivers/iommu/amd/iommu.c           |    6 +++---
>  drivers/iommu/intel/irq_remapping.c |    4 ++--
>  4 files changed, 22 insertions(+), 16 deletions(-)
> 
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi
>         info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT;
>         info.devid = mpc_ioapic_id(ioapic);
>         parent = irq_remapping_get_irq_domain(&info);
> -       if (!parent)
> +       if (IS_ERR(parent))
> +               return PTR_ERR(parent);
> +       else if (!parent)
>                 parent = x86_vector_domain;
>         else
>                 name = "IO-APIC-IR";
> --- a/arch/x86/kernel/apic/msi.c
> +++ b/arch/x86/kernel/apic/msi.c
> @@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d
>  struct irq_domain *hpet_create_irq_domain(int hpet_id)
>  {
>         struct msi_domain_info *domain_info;
> +       struct fwnode_handle *fn = NULL;
>         struct irq_domain *parent, *d;
>         struct irq_alloc_info info;
> -       struct fwnode_handle *fn;
>  
>         if (x86_vector_domain == NULL)
>                 return NULL;
> @@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai
>         init_irq_alloc_info(&info, NULL);
>         info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT;
>         info.devid = hpet_id;
> +
>         parent = irq_remapping_get_irq_domain(&info);
> -       if (parent == NULL)
> +       if (IS_ERR(parent))
> +               goto fail;
> +       else if (!parent)
>                 parent = x86_vector_domain;
>         else
>                 hpet_msi_controller.name = "IR-HPET-MSI";

Hrm... this is the '-ENODEV changes' that you wanted me to pick up,
right?

I confess I don't like it very much.

I'd much prefer to be able to use a generic foo_get_parent_irq_domain()
function which just returns an answer or an error.

Definitely none of this "if it returns NULL, caller fills it in for
themselves with some magic global because we're special".

And I don't much like abusing the irq_alloc_info for this either. I
didn't like the irq_alloc_info very much in its *original* use cases,
for actually allocating IRQs. Abusing it for this is worse.

I'm more than happy to start digging into this, but once I do I fear
I'm not going to stop until HPET and IOAPIC don't know *anything* about
whether they're behind IR or not.

And yes, I know that IOAPIC has a different EOI method for IR but
AFAICT that's *already* hosed for Hyper-V because it doesn't *really*
do remapping and so the RTEs *can* change there to move interrupts
around. There's more differences between ioapic_ack_level() and
ioapic_ir_ack_level() than the erratum workaround and whether we use
data->entry.vector... aren't there?

For the specific case you were targeting with this patch, you already
successfully persuaded me it should never happen, and there's a WARN_ON
which would trigger if it ever does (with too many CPUs in the system).

Let's come back and tackle this can of worms in a separate cleanup
series.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2020-10-09  7:54 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-05 15:28 [PATCH 0/13] Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping David Woodhouse
2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
2020-10-05 15:28   ` [PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit David Woodhouse
2020-10-06 20:45     ` Thomas Gleixner
2020-10-05 15:28   ` [PATCH 03/13] x86/ioapic: Handle Extended Destination ID field in RTE David Woodhouse
2020-10-05 15:28   ` [PATCH 04/13] x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available David Woodhouse
2020-10-05 15:28   ` [PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs() David Woodhouse
2020-10-06 21:01     ` Thomas Gleixner
2020-10-06 21:07       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 06/13] genirq: Add default_affinity argument " David Woodhouse
2020-10-06 21:06     ` Thomas Gleixner
2020-10-05 15:28   ` [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs() David Woodhouse
2020-10-06 21:26     ` Thomas Gleixner
2020-10-07  7:19       ` David Woodhouse
2020-10-07 13:37         ` Thomas Gleixner
2020-10-07 14:10           ` David Woodhouse
2020-10-07 15:57             ` Thomas Gleixner
2020-10-07 16:11               ` David Woodhouse
2020-10-07 20:53                 ` Thomas Gleixner
2020-10-08  7:21               ` David Woodhouse
2020-10-08  9:34                 ` Thomas Gleixner
2020-10-08 11:10                   ` David Woodhouse
2020-10-08 12:40                     ` Thomas Gleixner
2020-10-09  7:54                       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 08/13] genirq: Add irq_domain_set_affinity() David Woodhouse
2020-10-06 21:32     ` Thomas Gleixner
2020-10-07  7:22       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 09/13] x86/irq: Add x86_non_ir_cpumask David Woodhouse
2020-10-06 21:42     ` Thomas Gleixner
2020-10-07  7:25       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR David Woodhouse
2020-10-06 21:54     ` Thomas Gleixner
2020-10-07  7:48       ` David Woodhouse
2020-10-07 12:59         ` Thomas Gleixner
2020-10-07 13:08           ` David Woodhouse
2020-10-07 14:05             ` Thomas Gleixner
2020-10-07 14:23               ` David Woodhouse
2020-10-07 16:02                 ` Thomas Gleixner
2020-10-07 16:15                   ` David Woodhouse
2020-10-07 15:05               ` David Woodhouse
2020-10-07 15:25                 ` Thomas Gleixner
2020-10-07 15:46                   ` David Woodhouse
2020-10-07 17:23                     ` Thomas Gleixner
2020-10-07 17:34                       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 11/13] x86/smp: Allow more than 255 CPUs even without interrupt remapping David Woodhouse
2020-10-05 15:28   ` [PATCH 12/13] iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant David Woodhouse
2020-10-05 15:28   ` [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID David Woodhouse
2020-10-07  8:14     ` Paolo Bonzini
2020-10-07  8:59       ` David Woodhouse
2020-10-07 11:15         ` Paolo Bonzini
2020-10-07 12:04           ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).