[-- Attachment #1: Type: text/plain, Size: 3253 bytes --] Linux currently refuses to use >255 CPUs on x86 unless it has interrupt remapping. This is a bit gratuitous because it could use those extra CPUs just fine; it just can't target external interrupts at them. The only problem is that our generic IRQ domain code cann't cope with the concept of domains which can only target a subset of CPUs. The hyperv-iommu IRQ remapping driver works around this — not by actually doing any remapping, but just returning -EINVAL if the affinity is ever set to an unreachable CPU. This almost works, but ends up being a bit late because irq_set_affinity_locked() doesn't call into the irqchip driver immediately; the error only happens later. This patch series implements a per-domain "maximum affinity" set and uses it for the non-remapped IOAPIC and MSI domains on x86. As well as allowing more CPUs to be used without interrupt remapping, this also fixes the case where some IOAPICs or PCI devices aren't actually in scope of any active IOMMU and are operating without remapping. While we're at it, recognise that the 8-bit limit is a bit gratuitous and a hypervisor could offer at least 15 bits of APIC ID in the IOAPIC RTE and MSI address bits 11-5 without even needing to use remapping. David Woodhouse (13): x86/apic: Use x2apic in guest kernels even with unusable CPUs. x86/msi: Only use high bits of MSI address for DMAR unit x86/ioapic: Handle Extended Destination ID field in RTE x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available genirq: Prepare for default affinity to be passed to __irq_alloc_descs() genirq: Add default_affinity argument to __irq_alloc_descs() irqdomain: Add max_affinity argument to irq_domain_alloc_descs() genirq: Add irq_domain_set_affinity() x86/irq: Add x86_non_ir_cpumask x86/irq: Limit IOAPIC and MSI domains' affinity without IR x86/smp: Allow more than 255 CPUs even without interrupt remapping iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID Documentation/virt/kvm/cpuid.rst | 4 + arch/x86/include/asm/apic.h | 1 + arch/x86/include/asm/io_apic.h | 3 +- arch/x86/include/asm/mpspec.h | 2 + arch/x86/include/asm/x86_init.h | 2 + arch/x86/include/uapi/asm/kvm_para.h | 1 + arch/x86/kernel/apic/apic.c | 41 +++++++++- arch/x86/kernel/apic/io_apic.c | 23 ++++-- arch/x86/kernel/apic/msi.c | 44 +++++++++-- arch/x86/kernel/apic/x2apic_phys.c | 9 +++ arch/x86/kernel/kvm.c | 6 ++ arch/x86/kernel/x86_init.c | 1 + drivers/iommu/hyperv-iommu.c | 149 +---------------------------------- include/linux/interrupt.h | 2 + include/linux/irq.h | 10 ++- include/linux/irqdomain.h | 7 +- kernel/irq/devres.c | 8 +- kernel/irq/ipi.c | 2 +- kernel/irq/irqdesc.c | 29 ++++--- kernel/irq/irqdomain.c | 69 ++++++++++++++-- kernel/irq/manage.c | 19 ++++- 21 files changed, 240 insertions(+), 192 deletions(-) [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
From: David Woodhouse <dwmw@amazon.co.uk> If the BIOS has enabled x2apic mode, then leave it enabled and don't fall back to xapic just because some CPUs can't be targeted. Previously, if there are CPUs with APIC IDs > 255, the kernel will disable x2apic and fall back to xapic. And then not use the additional CPUs anyway, since xapic can't target them either. In fact, xapic mode can target even *fewer* CPUs, since it can't use the one with APIC ID 255 either. Using x2apic instead gives us at least one more CPU without needing interrupt remapping in the guest. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- arch/x86/include/asm/apic.h | 1 + arch/x86/kernel/apic/apic.c | 18 ++++++++++++++---- arch/x86/kernel/apic/x2apic_phys.c | 9 +++++++++ 3 files changed, 24 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 1c129abb7f09..b0fd204e0023 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -259,6 +259,7 @@ static inline u64 native_x2apic_icr_read(void) extern int x2apic_mode; extern int x2apic_phys; +extern void __init x2apic_set_max_apicid(u32 apicid); extern void __init check_x2apic(void); extern void x2apic_setup(void); static inline int x2apic_enabled(void) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index b3eef1d5c903..a75767052a92 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1841,14 +1841,24 @@ static __init void try_to_enable_x2apic(int remap_mode) return; if (remap_mode != IRQ_REMAP_X2APIC_MODE) { - /* IR is required if there is APIC ID > 255 even when running - * under KVM + /* + * If there are APIC IDs above 255, they cannot be used + * without interrupt remapping. In a virtual machine, but + * not on bare metal, X2APIC can be used anyway. In the + * case where BIOS has enabled X2APIC, don't disable it + * just because there are APIC IDs that can't be used. + * They couldn't be used if the kernel falls back to XAPIC + * anyway. */ if (max_physical_apicid > 255 || !x86_init.hyper.x2apic_available()) { pr_info("x2apic: IRQ remapping doesn't support X2APIC mode\n"); - x2apic_disable(); - return; + if (!x2apic_mode) { + x2apic_disable(); + return; + } + + x2apic_set_max_apicid(255); } /* diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c index bc9693841353..b4b4e89c1118 100644 --- a/arch/x86/kernel/apic/x2apic_phys.c +++ b/arch/x86/kernel/apic/x2apic_phys.c @@ -8,6 +8,12 @@ int x2apic_phys; static struct apic apic_x2apic_phys; +static u32 x2apic_max_apicid; + +void __init x2apic_set_max_apicid(u32 apicid) +{ + x2apic_max_apicid = apicid; +} static int __init set_x2apic_phys_mode(char *arg) { @@ -98,6 +104,9 @@ static int x2apic_phys_probe(void) /* Common x2apic functions, also used by x2apic_cluster */ int x2apic_apic_id_valid(u32 apicid) { + if (x2apic_max_apicid && apicid > x2apic_max_apicid) + return 0; + return 1; } -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> The Intel IOMMU has an MSI-like configuration for its interrupt, but it isn't really MSI. So it gets to abuse the high 32 bits of the address, and puts the high 24 bits of the extended APIC ID there. This isn't something that can be used in the general case for real MSIs, since external devices using the high bits of the address would be performing writes to actual memory space above 4GiB, not targeted at the APIC. Factor the hack out and allow it only to be used when appropriate, adding a WARN_ON_ONCE() if other MSIs are targeted at an unreachable APIC ID. That should never happen since the legacy MSI messages are not supposed to be used with Interrupt Remapping enabled. The x2apic_enabled() check isn't needed because we won't bring up CPUs with higher APIC IDs unless x2apic is enabled anyway. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- arch/x86/kernel/apic/msi.c | 34 ++++++++++++++++++++++++++++------ 1 file changed, 28 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c index 6313f0a05db7..356f8acf4927 100644 --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -23,13 +23,10 @@ struct irq_domain *x86_pci_msi_default_domain __ro_after_init; -static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg) +static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, int dmar) { msg->address_hi = MSI_ADDR_BASE_HI; - if (x2apic_enabled()) - msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid); - msg->address_lo = MSI_ADDR_BASE_LO | ((apic->irq_dest_mode == 0) ? @@ -43,18 +40,42 @@ static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg) MSI_DATA_LEVEL_ASSERT | MSI_DATA_DELIVERY_FIXED | MSI_DATA_VECTOR(cfg->vector); + + /* + * Only the IOMMU itself can use the trick of putting destination + * APIC ID into the high bits of the address. Anything else would + * just be writing to memory if it tried that, and needs IR to + * address APICs above 255. + */ + if (dmar) + msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid); + else + WARN_ON_ONCE(MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid)); } void x86_vector_msi_compose_msg(struct irq_data *data, struct msi_msg *msg) { - __irq_msi_compose_msg(irqd_cfg(data), msg); + __irq_msi_compose_msg(irqd_cfg(data), msg, 0); } +/* + * The Intel IOMMU (ab)uses the high bits of the MSI address to contain the + * high bits of the destination APIC ID. This can't be done in the general + * case for MSIs as it would be targeting real memory above 4GiB not the + * APIC. + */ +static void dmar_msi_compose_msg(struct irq_data *data, struct msi_msg *msg) +{ + __irq_msi_compose_msg(irqd_cfg(data), msg, 1); + + + +} static void irq_msi_update_msg(struct irq_data *irqd, struct irq_cfg *cfg) { struct msi_msg msg[2] = { [1] = { }, }; - __irq_msi_compose_msg(cfg, msg); + __irq_msi_compose_msg(cfg, msg, 0); irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg); } @@ -288,6 +309,7 @@ static struct irq_chip dmar_msi_controller = { .irq_ack = irq_chip_ack_parent, .irq_set_affinity = msi_domain_set_affinity, .irq_retrigger = irq_chip_retrigger_hierarchy, + .irq_compose_msi_msg = dmar_msi_compose_msg, .irq_write_msi_msg = dmar_msi_write_msg, .flags = IRQCHIP_SKIP_SET_WAKE, }; -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> The IOAPIC Redirection Table Entries contain an 8-bit Extended Destination ID field which maps to bits 11-4 of the MSI address. The lowest bit is used to indicate remappable format, when interrupt remapping is in use. A hypervisor can use the other 7 bits to permit guests to address up to 15 bits of APIC IDs, thus allowing 32768 vCPUs before having to expose a vIOMMU and interrupt remapping to the guest. No behavioural change in this patch, since nothing yet permits APIC IDs above 255 to be used with the non-IR IOAPIC domain. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- arch/x86/include/asm/io_apic.h | 3 ++- arch/x86/kernel/apic/io_apic.c | 19 +++++++++++++------ 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h index a1a26f6d3aa4..e65a0b7379d0 100644 --- a/arch/x86/include/asm/io_apic.h +++ b/arch/x86/include/asm/io_apic.h @@ -78,7 +78,8 @@ struct IO_APIC_route_entry { mask : 1, /* 0: enabled, 1: disabled */ __reserved_2 : 15; - __u32 __reserved_3 : 24, + __u32 __reserved_3 : 17, + ext_dest : 7, dest : 8; } __attribute__ ((packed)); diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index a33380059db6..aa9a3b54a96c 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1239,10 +1239,10 @@ static void io_apic_print_entries(unsigned int apic, unsigned int nr_entries) buf, (ir_entry->index2 << 15) | ir_entry->index, ir_entry->zero); else - printk(KERN_DEBUG "%s, %s, D(%02X), M(%1d)\n", + printk(KERN_DEBUG "%s, %s, D(%02X%02X), M(%1d)\n", buf, entry.dest_mode == IOAPIC_DEST_MODE_LOGICAL ? - "logical " : "physical", + "logical " : "physical", entry.ext_dest, entry.dest, entry.delivery_mode); } } @@ -1410,6 +1410,7 @@ void native_restore_boot_irq_mode(void) */ if (ioapic_i8259.pin != -1) { struct IO_APIC_route_entry entry; + u32 apic_id = read_apic_id(); memset(&entry, 0, sizeof(entry)); entry.mask = IOAPIC_UNMASKED; @@ -1417,7 +1418,8 @@ void native_restore_boot_irq_mode(void) entry.polarity = IOAPIC_POL_HIGH; entry.dest_mode = IOAPIC_DEST_MODE_PHYSICAL; entry.delivery_mode = dest_ExtINT; - entry.dest = read_apic_id(); + entry.dest = apic_id & 0xff; + entry.ext_dest = apic_id >> 8; /* * Add it to the IO-APIC irq-routing table: @@ -1861,7 +1863,8 @@ static void ioapic_configure_entry(struct irq_data *irqd) * ioapic chip to verify that. */ if (irqd->chip == &ioapic_chip) { - mpd->entry.dest = cfg->dest_apicid; + mpd->entry.dest = cfg->dest_apicid & 0xff; + mpd->entry.ext_dest = cfg->dest_apicid >> 8; mpd->entry.vector = cfg->vector; } for_each_irq_pin(entry, mpd->irq_2_pin) @@ -2027,6 +2030,7 @@ static inline void __init unlock_ExtINT_logic(void) int apic, pin, i; struct IO_APIC_route_entry entry0, entry1; unsigned char save_control, save_freq_select; + u32 apic_id; pin = find_isa_irq_pin(8, mp_INT); if (pin == -1) { @@ -2042,11 +2046,13 @@ static inline void __init unlock_ExtINT_logic(void) entry0 = ioapic_read_entry(apic, pin); clear_IO_APIC_pin(apic, pin); + apic_id = hard_smp_processor_id(); memset(&entry1, 0, sizeof(entry1)); entry1.dest_mode = IOAPIC_DEST_MODE_PHYSICAL; entry1.mask = IOAPIC_UNMASKED; - entry1.dest = hard_smp_processor_id(); + entry1.dest = apic_id & 0xff; + entry1.ext_dest = apic_id >> 8; entry1.delivery_mode = dest_ExtINT; entry1.polarity = entry0.polarity; entry1.trigger = IOAPIC_EDGE; @@ -2949,7 +2955,8 @@ static void mp_setup_entry(struct irq_cfg *cfg, struct mp_chip_data *data, memset(entry, 0, sizeof(*entry)); entry->delivery_mode = apic->irq_delivery_mode; entry->dest_mode = apic->irq_dest_mode; - entry->dest = cfg->dest_apicid; + entry->dest = cfg->dest_apicid & 0xff; + entry->ext_dest = cfg->dest_apicid >> 8; entry->vector = cfg->vector; entry->trigger = data->trigger; entry->polarity = data->polarity; -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> Some hypervisors can allow the guest to use the Extended Destination ID field in the IOAPIC RTE and MSI address to address up to 32768 CPUs. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- arch/x86/include/asm/mpspec.h | 1 + arch/x86/include/asm/x86_init.h | 2 ++ arch/x86/kernel/apic/apic.c | 15 ++++++++++++++- arch/x86/kernel/apic/msi.c | 9 ++++++++- arch/x86/kernel/x86_init.c | 1 + 5 files changed, 26 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h index e90ac7e9ae2c..25ee8ca0a1f2 100644 --- a/arch/x86/include/asm/mpspec.h +++ b/arch/x86/include/asm/mpspec.h @@ -42,6 +42,7 @@ extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES); extern unsigned int boot_cpu_physical_apicid; extern u8 boot_cpu_apic_version; extern unsigned long mp_lapic_addr; +extern int msi_ext_dest_id; #ifdef CONFIG_X86_LOCAL_APIC extern int smp_found_config; diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 397196fae24d..5af3fe9e38f3 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -114,6 +114,7 @@ struct x86_init_pci { * @init_platform: platform setup * @guest_late_init: guest late init * @x2apic_available: X2APIC detection + * @msi_ext_dest_id: MSI and IOAPIC support 15-bit APIC IDs * @init_mem_mapping: setup early mappings during init_mem_mapping() * @init_after_bootmem: guest init after boot allocator is finished */ @@ -121,6 +122,7 @@ struct x86_hyper_init { void (*init_platform)(void); void (*guest_late_init)(void); bool (*x2apic_available)(void); + bool (*msi_ext_dest_id)(void); void (*init_mem_mapping)(void); void (*init_after_bootmem)(void); }; diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index a75767052a92..459c78558f36 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1837,6 +1837,8 @@ static __init void x2apic_enable(void) static __init void try_to_enable_x2apic(int remap_mode) { + u32 apic_limit = 0; + if (x2apic_state == X2APIC_DISABLED) return; @@ -1858,7 +1860,15 @@ static __init void try_to_enable_x2apic(int remap_mode) return; } - x2apic_set_max_apicid(255); + /* + * If the hypervisor supports extended destination ID + * in IOAPIC and MSI, we can support that many CPUs. + */ + if (x86_init.hyper.msi_ext_dest_id()) { + msi_ext_dest_id = 1; + apic_limit = 32767; + } else + apic_limit = 255; } /* @@ -1867,6 +1877,9 @@ static __init void try_to_enable_x2apic(int remap_mode) */ x2apic_phys = 1; } + if (apic_limit) + x2apic_set_max_apicid(apic_limit); + x2apic_enable(); } diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c index 356f8acf4927..4d891967bea4 100644 --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -23,6 +23,8 @@ struct irq_domain *x86_pci_msi_default_domain __ro_after_init; +int msi_ext_dest_id __ro_after_init; + static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, int dmar) { msg->address_hi = MSI_ADDR_BASE_HI; @@ -45,10 +47,15 @@ static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, int * Only the IOMMU itself can use the trick of putting destination * APIC ID into the high bits of the address. Anything else would * just be writing to memory if it tried that, and needs IR to - * address APICs above 255. + * address APICs which can't be addressed in the normal 32-bit + * address range at 0xFFExxxxx. That is typically just 8 bits, but + * some hypervisors allow the extended destination ID field in bits + * 11-5 to be used, giving support for 15 bits of APIC IDs in total. */ if (dmar) msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid); + else if (msi_ext_dest_id && cfg->dest_apicid < 0x8000) + msg->address_lo |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid) >> 3; else WARN_ON_ONCE(MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid)); } diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index a3038d8deb6a..8b395821cb8d 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -110,6 +110,7 @@ struct x86_init_ops x86_init __initdata = { .init_platform = x86_init_noop, .guest_late_init = x86_init_noop, .x2apic_available = bool_x86_init_noop, + .msi_ext_dest_id = bool_x86_init_noop, .init_mem_mapping = x86_init_noop, .init_after_bootmem = x86_init_noop, }, -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> Instead of plugging in irq_default_affinity at the lowest level in desc_smp_init() if the caller didn't specify an affinity for this specific IRQ in its array, allow the default affinity to be passed down from __irq_alloc_descs() instead. This is in preparation for irq domains having their own default (and indeed maximum) affinities. An alternative possibility would have been to make the caller allocate an entire array of struct irq_affinity_desc for the allocation, but that's a lot nastier than simply passing in an additional const cpumask. This commit has no visible API change outside static functions in irqdesc.c for now; the actual change will come later. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- include/linux/interrupt.h | 2 ++ kernel/irq/irqdesc.c | 21 +++++++++++++-------- 2 files changed, 15 insertions(+), 8 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index f9aee3538461..cd0ff293486a 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -364,6 +364,8 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec, #else /* CONFIG_SMP */ +#define irq_default_affinity (NULL) + static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m) { return -EINVAL; diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 1a7723604399..4ac91b9fc618 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -81,8 +81,6 @@ static int alloc_masks(struct irq_desc *desc, int node) static void desc_smp_init(struct irq_desc *desc, int node, const struct cpumask *affinity) { - if (!affinity) - affinity = irq_default_affinity; cpumask_copy(desc->irq_common_data.affinity, affinity); #ifdef CONFIG_GENERIC_PENDING_IRQ @@ -465,12 +463,16 @@ static void free_desc(unsigned int irq) static int alloc_descs(unsigned int start, unsigned int cnt, int node, const struct irq_affinity_desc *affinity, + const struct cpumask *default_affinity, struct module *owner) { struct irq_desc *desc; int i; /* Validate affinity mask(s) */ + if (!default_affinity || cpumask_empty(default_affinity)) + return -EINVAL; + if (affinity) { for (i = 0; i < cnt; i++) { if (cpumask_empty(&affinity[i].mask)) @@ -479,7 +481,7 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, } for (i = 0; i < cnt; i++) { - const struct cpumask *mask = NULL; + const struct cpumask *mask = default_affinity; unsigned int flags = 0; if (affinity) { @@ -488,10 +490,10 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node, IRQD_MANAGED_SHUTDOWN; } mask = &affinity->mask; - node = cpu_to_node(cpumask_first(mask)); affinity++; } + node = cpu_to_node(cpumask_first(mask)); desc = alloc_desc(start + i, node, flags, mask, owner); if (!desc) goto err; @@ -538,7 +540,7 @@ int __init early_irq_init(void) nr_irqs = initcnt; for (i = 0; i < initcnt; i++) { - desc = alloc_desc(i, node, 0, NULL, NULL); + desc = alloc_desc(i, node, 0, irq_default_affinity, NULL); set_bit(i, allocated_irqs); irq_insert_desc(i, desc); } @@ -573,7 +575,8 @@ int __init early_irq_init(void) raw_spin_lock_init(&desc[i].lock); lockdep_set_class(&desc[i].lock, &irq_desc_lock_class); mutex_init(&desc[i].request_mutex); - desc_set_defaults(i, &desc[i], node, NULL, NULL); + desc_set_defaults(i, &desc[i], node, irq_default_affinity, + NULL); } return arch_early_irq_init(); } @@ -590,12 +593,14 @@ static void free_desc(unsigned int irq) unsigned long flags; raw_spin_lock_irqsave(&desc->lock, flags); - desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL); + desc_set_defaults(irq, desc, irq_desc_get_node(desc), + irq_default_affinity, NULL); raw_spin_unlock_irqrestore(&desc->lock, flags); } static inline int alloc_descs(unsigned int start, unsigned int cnt, int node, const struct irq_affinity_desc *affinity, + const struct cpumask *default_affinity, struct module *owner) { u32 i; @@ -803,7 +808,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, if (ret) goto unlock; } - ret = alloc_descs(start, cnt, node, affinity, owner); + ret = alloc_descs(start, cnt, node, affinity, irq_default_affinity, owner); unlock: mutex_unlock(&sparse_irq_lock); return ret; -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> It already takes an array of affinities for each individual irq being allocated but that's awkward to allocate and populate in the case where they're all the same and inherited from the irqdomain, so pass the default in separately as a simple cpumask. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- include/linux/irq.h | 10 ++++++---- kernel/irq/devres.c | 8 ++++++-- kernel/irq/irqdesc.c | 10 ++++++++-- kernel/irq/irqdomain.c | 6 +++--- 4 files changed, 23 insertions(+), 11 deletions(-) diff --git a/include/linux/irq.h b/include/linux/irq.h index 1b7f4dfee35b..6e119594d35d 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -897,15 +897,17 @@ unsigned int arch_dynirq_lower_bound(unsigned int from); int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, struct module *owner, - const struct irq_affinity_desc *affinity); + const struct irq_affinity_desc *affinity, + const struct cpumask *default_affinity); int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from, unsigned int cnt, int node, struct module *owner, - const struct irq_affinity_desc *affinity); + const struct irq_affinity_desc *affinity, + const struct cpumask *default_affinity); /* use macros to avoid needing export.h for THIS_MODULE */ #define irq_alloc_descs(irq, from, cnt, node) \ - __irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL) + __irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL, NULL) #define irq_alloc_desc(node) \ irq_alloc_descs(-1, 0, 1, node) @@ -920,7 +922,7 @@ int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from, irq_alloc_descs(-1, from, cnt, node) #define devm_irq_alloc_descs(dev, irq, from, cnt, node) \ - __devm_irq_alloc_descs(dev, irq, from, cnt, node, THIS_MODULE, NULL) + __devm_irq_alloc_descs(dev, irq, from, cnt, node, THIS_MODULE, NULL, NULL) #define devm_irq_alloc_desc(dev, node) \ devm_irq_alloc_descs(dev, -1, 0, 1, node) diff --git a/kernel/irq/devres.c b/kernel/irq/devres.c index f6e5515ee077..079339decc23 100644 --- a/kernel/irq/devres.c +++ b/kernel/irq/devres.c @@ -170,6 +170,8 @@ static void devm_irq_desc_release(struct device *dev, void *res) * @affinity: Optional pointer to an irq_affinity_desc array of size @cnt * which hints where the irq descriptors should be allocated * and which default affinities to use + * @default_affinity: Optional pointer to a cpumask indicating the default + * affinity to use where not specified by the @affinity array * * Returns the first irq number or error code. * @@ -177,7 +179,8 @@ static void devm_irq_desc_release(struct device *dev, void *res) */ int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from, unsigned int cnt, int node, struct module *owner, - const struct irq_affinity_desc *affinity) + const struct irq_affinity_desc *affinity, + const struct cpumask *default_affinity) { struct irq_desc_devres *dr; int base; @@ -186,7 +189,8 @@ int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from, if (!dr) return -ENOMEM; - base = __irq_alloc_descs(irq, from, cnt, node, owner, affinity); + base = __irq_alloc_descs(irq, from, cnt, node, owner, affinity, + default_affinity); if (base < 0) { devres_free(dr); return base; diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 4ac91b9fc618..fcc3b8a1fe01 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -770,15 +770,21 @@ EXPORT_SYMBOL_GPL(irq_free_descs); * @affinity: Optional pointer to an affinity mask array of size @cnt which * hints where the irq descriptors should be allocated and which * default affinities to use + * @default_affinity: Optional pointer to a cpumask indicating the default + * affinity where not specified in the @affinity array * * Returns the first irq number or error code */ int __ref __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, - struct module *owner, const struct irq_affinity_desc *affinity) + struct module *owner, const struct irq_affinity_desc *affinity, + const struct cpumask *default_affinity) { int start, ret; + if (!default_affinity) + default_affinity = irq_default_affinity; + if (!cnt) return -EINVAL; @@ -808,7 +814,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node, if (ret) goto unlock; } - ret = alloc_descs(start, cnt, node, affinity, irq_default_affinity, owner); + ret = alloc_descs(start, cnt, node, affinity, default_affinity, owner); unlock: mutex_unlock(&sparse_irq_lock); return ret; diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 76cd7ebd1178..c93e00ca11d8 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -1017,16 +1017,16 @@ int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, if (virq >= 0) { virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE, - affinity); + affinity, NULL); } else { hint = hwirq % nr_irqs; if (hint == 0) hint++; virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE, - affinity); + affinity, NULL); if (virq <= 0 && hint > 1) { virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE, - affinity); + affinity, NULL); } } -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> This is the maximum possible set of CPUs which can be used. Use it to calculate the default affinity requested from __irq_alloc_descs() by first attempting to find the intersection with irq_default_affinity, or falling back to using just the max_affinity if the intersection would be empty. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- include/linux/irqdomain.h | 3 ++- kernel/irq/ipi.c | 2 +- kernel/irq/irqdomain.c | 45 +++++++++++++++++++++++++++++++++------ 3 files changed, 42 insertions(+), 8 deletions(-) diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h index 44445d9de881..6b5576da77f0 100644 --- a/include/linux/irqdomain.h +++ b/include/linux/irqdomain.h @@ -278,7 +278,8 @@ extern void irq_set_default_host(struct irq_domain *host); extern struct irq_domain *irq_get_default_host(void); extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs, irq_hw_number_t hwirq, int node, - const struct irq_affinity_desc *affinity); + const struct irq_affinity_desc *affinity, + const struct cpumask *max_affinity); static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node) { diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c index 43e3d1be622c..13f56210eca9 100644 --- a/kernel/irq/ipi.c +++ b/kernel/irq/ipi.c @@ -75,7 +75,7 @@ int irq_reserve_ipi(struct irq_domain *domain, } } - virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL); + virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL, dest); if (virq <= 0) { pr_warn("Can't reserve IPI, failed to alloc descs\n"); return -ENOMEM; diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index c93e00ca11d8..ffd41f21afca 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -660,7 +660,8 @@ unsigned int irq_create_mapping(struct irq_domain *domain, } /* Allocate a virtual interrupt number */ - virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL); + virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), + NULL, NULL); if (virq <= 0) { pr_debug("-> virq allocation failed\n"); return 0; @@ -1011,25 +1012,57 @@ int irq_domain_translate_twocell(struct irq_domain *d, EXPORT_SYMBOL_GPL(irq_domain_translate_twocell); int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, - int node, const struct irq_affinity_desc *affinity) + int node, const struct irq_affinity_desc *affinity, + const struct cpumask *max_affinity) { + cpumask_var_t default_affinity; unsigned int hint; + int i; + + /* Check requested per-IRQ affinities are in the possible range */ + if (affinity && max_affinity) { + for (i = 0; i < cnt; i++) + if (!cpumask_subset(&affinity[i].mask, max_affinity)) + return -EINVAL; + } + + /* + * Generate default affinity. Either the possible subset of + * irq_default_affinity if such a subset is non-empty, or fall + * back to the provided max_affinity if there is no intersection. + * And just a copy of irq_default_affinity in the + * !CONFIG_CPUMASK_OFFSTACK case. + */ + memset(&default_affinity, 0, sizeof(default_affinity)); + if ((max_affinity && + !cpumask_subset(irq_default_affinity, max_affinity))) { + if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL)) + return -ENOMEM; + cpumask_and(default_affinity, max_affinity, + irq_default_affinity); + if (cpumask_empty(default_affinity)) + cpumask_copy(default_affinity, max_affinity); + } else if (cpumask_available(default_affinity)) + cpumask_copy(default_affinity, irq_default_affinity); if (virq >= 0) { virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE, - affinity, NULL); + affinity, default_affinity); } else { hint = hwirq % nr_irqs; if (hint == 0) hint++; virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE, - affinity, NULL); + affinity, default_affinity); if (virq <= 0 && hint > 1) { virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE, - affinity, NULL); + affinity, default_affinity); } } + if (cpumask_available(default_affinity)) + free_cpumask_var(default_affinity); + return virq; } @@ -1342,7 +1375,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base, virq = irq_base; } else { virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node, - affinity); + affinity, NULL); if (virq < 0) { pr_debug("cannot allocate IRQ(base %d, count %d)\n", irq_base, nr_irqs); -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> This allows a maximal affinity to be set, for IRQ domains which cannot target all CPUs in the system. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- include/linux/irqdomain.h | 4 ++++ kernel/irq/irqdomain.c | 28 ++++++++++++++++++++++++++-- kernel/irq/manage.c | 19 +++++++++++++++---- 3 files changed, 45 insertions(+), 6 deletions(-) diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h index 6b5576da77f0..cf686f18c1fa 100644 --- a/include/linux/irqdomain.h +++ b/include/linux/irqdomain.h @@ -151,6 +151,7 @@ struct irq_domain_chip_generic; * drivers using the generic chip library which uses this pointer. * @parent: Pointer to parent irq_domain to support hierarchy irq_domains * @debugfs_file: dentry for the domain debugfs file + * @max_affinity: mask of CPUs targetable by this IRQ domain * * Revmap data, used internally by irq_domain * @revmap_direct_max_irq: The largest hwirq that can be set for controllers that @@ -177,6 +178,7 @@ struct irq_domain { #ifdef CONFIG_GENERIC_IRQ_DEBUGFS struct dentry *debugfs_file; #endif + const struct cpumask *max_affinity; /* reverse map data. The linear map gets appended to the irq_domain */ irq_hw_number_t hwirq_max; @@ -453,6 +455,8 @@ extern void irq_domain_set_info(struct irq_domain *domain, unsigned int virq, void *chip_data, irq_flow_handler_t handler, void *handler_data, const char *handler_name); extern void irq_domain_reset_irq_data(struct irq_data *irq_data); +extern int irq_domain_set_affinity(struct irq_domain *domain, + const struct cpumask *affinity); #ifdef CONFIG_IRQ_DOMAIN_HIERARCHY extern struct irq_domain *irq_domain_create_hierarchy(struct irq_domain *parent, unsigned int flags, unsigned int size, diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index ffd41f21afca..0b298355be1b 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -661,7 +661,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain, /* Allocate a virtual interrupt number */ virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), - NULL, NULL); + NULL, domain->max_affinity); if (virq <= 0) { pr_debug("-> virq allocation failed\n"); return 0; @@ -1078,6 +1078,27 @@ void irq_domain_reset_irq_data(struct irq_data *irq_data) } EXPORT_SYMBOL_GPL(irq_domain_reset_irq_data); +/** + * irq_domain_set_affinity - Set maximum CPU affinity for domain + * @parent: Domain to set affinity for + * @affinity: Pointer to cpumask, consumed by domain + * + * Sets the maximal set of CPUs to which interrupts in this domain may + * be delivered. Must only be called after creation, before any interrupts + * have been in the domain. + * + * This function retains a pointer to the cpumask which is passed in. + */ +int irq_domain_set_affinity(struct irq_domain *domain, + const struct cpumask *affinity) +{ + if (cpumask_empty(affinity)) + return -EINVAL; + domain->max_affinity = affinity; + return 0; +} +EXPORT_SYMBOL_GPL(irq_domain_set_affinity); + #ifdef CONFIG_IRQ_DOMAIN_HIERARCHY /** * irq_domain_create_hierarchy - Add a irqdomain into the hierarchy @@ -1110,6 +1131,9 @@ struct irq_domain *irq_domain_create_hierarchy(struct irq_domain *parent, if (domain) { domain->parent = parent; domain->flags |= flags; + if (parent && parent->max_affinity) + irq_domain_set_affinity(domain, + parent->max_affinity); } return domain; @@ -1375,7 +1399,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base, virq = irq_base; } else { virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node, - affinity, NULL); + affinity, domain->max_affinity); if (virq < 0) { pr_debug("cannot allocate IRQ(base %d, count %d)\n", irq_base, nr_irqs); diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 5df903fccb60..e8c5f8ecd306 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -345,6 +345,10 @@ int irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask, struct irq_desc *desc = irq_data_to_desc(data); int ret = 0; + if (data->domain && data->domain->max_affinity && + !cpumask_subset(mask, data->domain->max_affinity)) + return -EINVAL; + if (!chip || !chip->irq_set_affinity) return -EINVAL; @@ -484,13 +488,20 @@ int irq_setup_affinity(struct irq_desc *desc) struct cpumask *set = irq_default_affinity; int ret, node = irq_desc_get_node(desc); static DEFINE_RAW_SPINLOCK(mask_lock); - static struct cpumask mask; + static struct cpumask mask, max_mask; /* Excludes PER_CPU and NO_BALANCE interrupts */ if (!__irq_can_set_affinity(desc)) return 0; raw_spin_lock(&mask_lock); + + if (desc->irq_data.domain && desc->irq_data.domain->max_affinity) + cpumask_and(&max_mask, cpu_online_mask, + desc->irq_data.domain->max_affinity); + else + cpumask_copy(&max_mask, cpu_online_mask); + /* * Preserve the managed affinity setting and a userspace affinity * setup, but make sure that one of the targets is online. @@ -498,15 +509,15 @@ int irq_setup_affinity(struct irq_desc *desc) if (irqd_affinity_is_managed(&desc->irq_data) || irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) { if (cpumask_intersects(desc->irq_common_data.affinity, - cpu_online_mask)) + &max_mask)) set = desc->irq_common_data.affinity; else irqd_clear(&desc->irq_data, IRQD_AFFINITY_SET); } - cpumask_and(&mask, cpu_online_mask, set); + cpumask_and(&mask, &max_mask, set); if (cpumask_empty(&mask)) - cpumask_copy(&mask, cpu_online_mask); + cpumask_copy(&mask, &max_mask); if (node != NUMA_NO_NODE) { const struct cpumask *nodemask = cpumask_of_node(node); -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> This is the mask of CPUs to which IRQs can be delivered without interrupt remapping. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- arch/x86/include/asm/mpspec.h | 1 + arch/x86/kernel/apic/apic.c | 12 ++++++++++++ arch/x86/kernel/apic/io_apic.c | 2 ++ 3 files changed, 15 insertions(+) diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h index 25ee8ca0a1f2..b2090be5b444 100644 --- a/arch/x86/include/asm/mpspec.h +++ b/arch/x86/include/asm/mpspec.h @@ -141,5 +141,6 @@ static inline void physid_set_mask_of_physid(int physid, physid_mask_t *map) #define PHYSID_MASK_NONE { {[0 ... PHYSID_ARRAY_SIZE-1] = 0UL} } extern physid_mask_t phys_cpu_present_map; +extern cpumask_t x86_non_ir_cpumask; #endif /* _ASM_X86_MPSPEC_H */ diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 459c78558f36..069f5e9f1d28 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -103,6 +103,9 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_apicid); EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid); EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_acpiid); +/* Mask of CPUs which can be targeted by non-remapped interrupts. */ +cpumask_t x86_non_ir_cpumask = { CPU_BITS_ALL }; + #ifdef CONFIG_X86_32 /* @@ -1838,6 +1841,7 @@ static __init void x2apic_enable(void) static __init void try_to_enable_x2apic(int remap_mode) { u32 apic_limit = 0; + int i; if (x2apic_state == X2APIC_DISABLED) return; @@ -1880,6 +1884,14 @@ static __init void try_to_enable_x2apic(int remap_mode) if (apic_limit) x2apic_set_max_apicid(apic_limit); + /* Build the affinity mask for interrupts that can't be remapped. */ + cpumask_clear(&x86_non_ir_cpumask); + i = min_t(unsigned int, num_possible_cpus() - 1, apic_limit); + for ( ; i >= 0; i--) { + if (cpu_physical_id(i) <= apic_limit) + cpumask_set_cpu(i, &x86_non_ir_cpumask); + } + x2apic_enable(); } diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index aa9a3b54a96c..4d0ef46fedb9 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2098,6 +2098,8 @@ static int mp_alloc_timer_irq(int ioapic, int pin) struct irq_alloc_info info; ioapic_set_alloc_attr(&info, NUMA_NO_NODE, 0, 0); + if (domain->parent == x86_vector_domain) + info.mask = &x86_non_ir_cpumask; info.devid = mpc_ioapic_id(ioapic); info.ioapic.pin = pin; mutex_lock(&ioapic_mutex); -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> When interrupt remapping isn't enabled, only the first 255 CPUs can receive external interrupts. Set the appropriate max affinity for the IOAPIC and MSI IRQ domains accordingly. This also fixes the case where interrupt remapping is enabled but some devices are not within the scope of any active IOMMU. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- arch/x86/kernel/apic/io_apic.c | 2 ++ arch/x86/kernel/apic/msi.c | 3 +++ 2 files changed, 5 insertions(+) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 4d0ef46fedb9..1c8ce7bc098f 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2332,6 +2332,8 @@ static int mp_irqdomain_create(int ioapic) } ip->irqdomain->parent = parent; + if (parent == x86_vector_domain) + irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask); if (cfg->type == IOAPIC_DOMAIN_LEGACY || cfg->type == IOAPIC_DOMAIN_STRICT) diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c index 4d891967bea4..af5ce5c4da02 100644 --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -259,6 +259,7 @@ struct irq_domain * __init native_create_pci_msi_domain(void) pr_warn("Failed to initialize PCI-MSI irqdomain.\n"); } else { d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK; + irq_domain_set_affinity(d, &x86_non_ir_cpumask); } return d; } @@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id) irq_domain_free_fwnode(fn); kfree(domain_info); } + if (parent == x86_vector_domain) + irq_domain_set_affinity(d, &x86_non_ir_cpumask); return d; } -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> Now that external interrupt affinity can be limited to the range of CPUs that can be reached through legacy IOAPIC RTEs and MSI, it is possible to use additional CPUs. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- arch/x86/kernel/apic/apic.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 069f5e9f1d28..750a92464bec 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1881,8 +1881,6 @@ static __init void try_to_enable_x2apic(int remap_mode) */ x2apic_phys = 1; } - if (apic_limit) - x2apic_set_max_apicid(apic_limit); /* Build the affinity mask for interrupts that can't be remapped. */ cpumask_clear(&x86_non_ir_cpumask); -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> It took me a while to realise that this "IRQ remapping" driver exists not to actually remap interrupts, but to return -EINVAL if anyone ever tries to set the affinity to a set of CPUs which can't be reached *without* remapping. Having fixed the core IRQ domain code to handle such limited, all this hackery can now die. I haven't deleted it entirely because its existence still causes the kernel to use X2APIC in cluster mode not physical. I'm not sure we *want* that, but it wants further thought and testing before ripping it out too. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- drivers/iommu/hyperv-iommu.c | 149 +---------------------------------- 1 file changed, 1 insertion(+), 148 deletions(-) diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c index e09e2d734c57..46a794d34f57 100644 --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv-iommu.c @@ -24,156 +24,12 @@ #include "irq_remapping.h" #ifdef CONFIG_IRQ_REMAP - -/* - * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt - * Redirection Table. Hyper-V exposes one single IO-APIC and so define - * 24 IO APIC remmapping entries. - */ -#define IOAPIC_REMAPPING_ENTRY 24 - -static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE }; -static struct irq_domain *ioapic_ir_domain; - -static int hyperv_ir_set_affinity(struct irq_data *data, - const struct cpumask *mask, bool force) -{ - struct irq_data *parent = data->parent_data; - struct irq_cfg *cfg = irqd_cfg(data); - struct IO_APIC_route_entry *entry; - int ret; - - /* Return error If new irq affinity is out of ioapic_max_cpumask. */ - if (!cpumask_subset(mask, &ioapic_max_cpumask)) - return -EINVAL; - - ret = parent->chip->irq_set_affinity(parent, mask, force); - if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE) - return ret; - - entry = data->chip_data; - entry->dest = cfg->dest_apicid; - entry->vector = cfg->vector; - send_cleanup_vector(cfg); - - return 0; -} - -static struct irq_chip hyperv_ir_chip = { - .name = "HYPERV-IR", - .irq_ack = apic_ack_irq, - .irq_set_affinity = hyperv_ir_set_affinity, -}; - -static int hyperv_irq_remapping_alloc(struct irq_domain *domain, - unsigned int virq, unsigned int nr_irqs, - void *arg) -{ - struct irq_alloc_info *info = arg; - struct irq_data *irq_data; - struct irq_desc *desc; - int ret = 0; - - if (!info || info->type != X86_IRQ_ALLOC_TYPE_IOAPIC || nr_irqs > 1) - return -EINVAL; - - ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg); - if (ret < 0) - return ret; - - irq_data = irq_domain_get_irq_data(domain, virq); - if (!irq_data) { - irq_domain_free_irqs_common(domain, virq, nr_irqs); - return -EINVAL; - } - - irq_data->chip = &hyperv_ir_chip; - - /* - * If there is interrupt remapping function of IOMMU, setting irq - * affinity only needs to change IRTE of IOMMU. But Hyper-V doesn't - * support interrupt remapping function, setting irq affinity of IO-APIC - * interrupts still needs to change IO-APIC registers. But ioapic_ - * configure_entry() will ignore value of cfg->vector and cfg-> - * dest_apicid when IO-APIC's parent irq domain is not the vector - * domain.(See ioapic_configure_entry()) In order to setting vector - * and dest_apicid to IO-APIC register, IO-APIC entry pointer is saved - * in the chip_data and hyperv_irq_remapping_activate()/hyperv_ir_set_ - * affinity() set vector and dest_apicid directly into IO-APIC entry. - */ - irq_data->chip_data = info->ioapic.entry; - - /* - * Hypver-V IO APIC irq affinity should be in the scope of - * ioapic_max_cpumask because no irq remapping support. - */ - desc = irq_data_to_desc(irq_data); - cpumask_copy(desc->irq_common_data.affinity, &ioapic_max_cpumask); - - return 0; -} - -static void hyperv_irq_remapping_free(struct irq_domain *domain, - unsigned int virq, unsigned int nr_irqs) -{ - irq_domain_free_irqs_common(domain, virq, nr_irqs); -} - -static int hyperv_irq_remapping_activate(struct irq_domain *domain, - struct irq_data *irq_data, bool reserve) -{ - struct irq_cfg *cfg = irqd_cfg(irq_data); - struct IO_APIC_route_entry *entry = irq_data->chip_data; - - entry->dest = cfg->dest_apicid; - entry->vector = cfg->vector; - - return 0; -} - -static const struct irq_domain_ops hyperv_ir_domain_ops = { - .alloc = hyperv_irq_remapping_alloc, - .free = hyperv_irq_remapping_free, - .activate = hyperv_irq_remapping_activate, -}; - static int __init hyperv_prepare_irq_remapping(void) { - struct fwnode_handle *fn; - int i; - if (!hypervisor_is_type(X86_HYPER_MS_HYPERV) || !x2apic_supported()) return -ENODEV; - fn = irq_domain_alloc_named_id_fwnode("HYPERV-IR", 0); - if (!fn) - return -ENOMEM; - - ioapic_ir_domain = - irq_domain_create_hierarchy(arch_get_ir_parent_domain(), - 0, IOAPIC_REMAPPING_ENTRY, fn, - &hyperv_ir_domain_ops, NULL); - - if (!ioapic_ir_domain) { - irq_domain_free_fwnode(fn); - return -ENOMEM; - } - - /* - * Hyper-V doesn't provide irq remapping function for - * IO-APIC and so IO-APIC only accepts 8-bit APIC ID. - * Cpu's APIC ID is read from ACPI MADT table and APIC IDs - * in the MADT table on Hyper-v are sorted monotonic increasingly. - * APIC ID reflects cpu topology. There maybe some APIC ID - * gaps when cpu number in a socket is not power of two. Prepare - * max cpu affinity for IOAPIC irqs. Scan cpu 0-255 and set cpu - * into ioapic_max_cpumask if its APIC ID is less than 256. - */ - for (i = min_t(unsigned int, num_possible_cpus() - 1, 255); i >= 0; i--) - if (cpu_physical_id(i) < 256) - cpumask_set_cpu(i, &ioapic_max_cpumask); - return 0; } @@ -184,10 +40,7 @@ static int __init hyperv_enable_irq_remapping(void) static struct irq_domain *hyperv_get_irq_domain(struct irq_alloc_info *info) { - if (info->type == X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT) - return ioapic_ir_domain; - else - return NULL; + return NULL; } struct irq_remap_ops hyperv_irq_remap_ops = { -- 2.26.2
From: David Woodhouse <dwmw@amazon.co.uk> This allows the host to indicate that IOAPIC and MSI emulation supports 15-bit destination IDs, allowing up to 32Ki CPUs without remapping. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- Documentation/virt/kvm/cpuid.rst | 4 ++++ arch/x86/include/uapi/asm/kvm_para.h | 1 + arch/x86/kernel/kvm.c | 6 ++++++ 3 files changed, 11 insertions(+) diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst index a7dff9186bed..1726b5925d2b 100644 --- a/Documentation/virt/kvm/cpuid.rst +++ b/Documentation/virt/kvm/cpuid.rst @@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit async pf acknowledgment msr 0x4b564d07. +KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit + before using extended destination + ID bits in MSI address bits 11-5. + KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side per-cpu warps are expeced in kvmclock diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 812e9b4c1114..950afebfba88 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -32,6 +32,7 @@ #define KVM_FEATURE_POLL_CONTROL 12 #define KVM_FEATURE_PV_SCHED_YIELD 13 #define KVM_FEATURE_ASYNC_PF_INT 14 +#define KVM_FEATURE_MSI_EXT_DEST_ID 15 #define KVM_HINTS_REALTIME 0 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 1b51b727b140..4986b4399aef 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -743,12 +743,18 @@ static void __init kvm_init_platform(void) x86_platform.apic_post_init = kvm_apic_init; } +static bool __init kvm_msi_ext_dest_id(void) +{ + return kvm_para_has_feature(KVM_FEATURE_MSI_EXT_DEST_ID); +} + const __initconst struct hypervisor_x86 x86_hyper_kvm = { .name = "KVM", .detect = kvm_detect, .type = X86_HYPER_KVM, .init.guest_late_init = kvm_guest_init, .init.x2apic_available = kvm_para_available, + .init.msi_ext_dest_id = kvm_msi_ext_dest_id, .init.init_platform = kvm_init_platform, }; -- 2.26.2
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > -static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg) > +static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, int dmar) bool dmar? > +/* > + * The Intel IOMMU (ab)uses the high bits of the MSI address to contain the > + * high bits of the destination APIC ID. This can't be done in the general > + * case for MSIs as it would be targeting real memory above 4GiB not the > + * APIC. > + */ > +static void dmar_msi_compose_msg(struct irq_data *data, struct msi_msg *msg) > +{ > + __irq_msi_compose_msg(irqd_cfg(data), msg, 1); > + > + > + Lots of stray newlines...
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > > #else /* CONFIG_SMP */ > > +#define irq_default_affinity (NULL) ... > static int alloc_descs(unsigned int start, unsigned int cnt, int node, > const struct irq_affinity_desc *affinity, > + const struct cpumask *default_affinity, > struct module *owner) > { > struct irq_desc *desc; > int i; > > /* Validate affinity mask(s) */ > + if (!default_affinity || cpumask_empty(default_affinity)) > + return -EINVAL; How is that supposed to work on UP? Aside of that I really hate to have yet another argument just because. Thanks, tglx
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > It already takes an array of affinities for each individual irq being > allocated but that's awkward to allocate and populate in the case > where they're all the same and inherited from the irqdomain, so pass > the default in separately as a simple cpumask. So we need another cpumask argument exposed to the world just because it's so hard to extend struct irq_affinity_desc so it supports that use case as well. It's not written in stone that this struct can only support arrays. > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > include/linux/irq.h | 10 ++++++---- > kernel/irq/devres.c | 8 ++++++-- > kernel/irq/irqdesc.c | 10 ++++++++-- > kernel/irq/irqdomain.c | 6 +++--- git grep __irq_alloc_descs() might help you to find _all_ instances ... Thanks, tglx
On 6 October 2020 22:01:18 BST, Thomas Gleixner <tglx@linutronix.de> wrote: >On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: >> >> #else /* CONFIG_SMP */ >> >> +#define irq_default_affinity (NULL) > >... > >> static int alloc_descs(unsigned int start, unsigned int cnt, int >node, >> const struct irq_affinity_desc *affinity, >> + const struct cpumask *default_affinity, >> struct module *owner) >> { >> struct irq_desc *desc; >> int i; >> >> /* Validate affinity mask(s) */ >> + if (!default_affinity || cpumask_empty(default_affinity)) >> + return -EINVAL; > >How is that supposed to work on UP? Hm, good point. >Aside of that I really hate to have yet another argument just >because. Yeah, I was trying to avoid having to allocate a whole array of irq_affinity_desc just to fill them all in with the same default. But perhaps I need to treat the "affinity_max" like we do cpu_online_mask, and allow affinity to be set even to offline/unreachable CPUs at this point. Then we can be more relaxed about the default affinities. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > This is the maximum possible set of CPUs which can be used. Use it > to calculate the default affinity requested from __irq_alloc_descs() > by first attempting to find the intersection with irq_default_affinity, > or falling back to using just the max_affinity if the intersection > would be empty. And why do we need that as yet another argument? This is an optional property of the irq domain, really and no caller has any business with that. > int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, > - int node, const struct irq_affinity_desc *affinity) > + int node, const struct irq_affinity_desc *affinity, > + const struct cpumask *max_affinity) > { > + cpumask_var_t default_affinity; > unsigned int hint; > + int i; > + > + /* Check requested per-IRQ affinities are in the possible range */ > + if (affinity && max_affinity) { > + for (i = 0; i < cnt; i++) > + if (!cpumask_subset(&affinity[i].mask, max_affinity)) > + return -EINVAL; https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos What is preventing the affinity spreading code from spreading the masks out to unusable CPUs? The changelog is silent about that part. > + /* > + * Generate default affinity. Either the possible subset of > + * irq_default_affinity if such a subset is non-empty, or fall > + * back to the provided max_affinity if there is no intersection. .. > + * And just a copy of irq_default_affinity in the > + * !CONFIG_CPUMASK_OFFSTACK case. We know that already... > + */ > + memset(&default_affinity, 0, sizeof(default_affinity)); Right, memset() before allocating is useful. > + if ((max_affinity && > + !cpumask_subset(irq_default_affinity, max_affinity))) { > + if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL)) > + return -ENOMEM; > + cpumask_and(default_affinity, max_affinity, > + irq_default_affinity); > + if (cpumask_empty(default_affinity)) > + cpumask_copy(default_affinity, max_affinity); > + } else if (cpumask_available(default_affinity)) > + cpumask_copy(default_affinity, irq_default_affinity); That's garbage and unreadable. Thanks, tglx
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> +/**
> + * irq_domain_set_affinity - Set maximum CPU affinity for domain
> + * @parent: Domain to set affinity for
> + * @affinity: Pointer to cpumask, consumed by domain
> + *
> + * Sets the maximal set of CPUs to which interrupts in this domain may
> + * be delivered. Must only be called after creation, before any interrupts
> + * have been in the domain.
> + *
> + * This function retains a pointer to the cpumask which is passed in.
> + */
> +int irq_domain_set_affinity(struct irq_domain *domain,
> + const struct cpumask *affinity)
> +{
> + if (cpumask_empty(affinity))
> + return -EINVAL;
> + domain->max_affinity = affinity;
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(irq_domain_set_affinity);
What the heck? Why does this need a setter function which is exported?
So that random driver writers can fiddle with it?
The affinity mask restriction of an irq domain is already known when the
domain is created.
Thanks,
tglx
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > This is the mask of CPUs to which IRQs can be delivered without interrupt > remapping. > > +/* Mask of CPUs which can be targeted by non-remapped interrupts. */ > +cpumask_t x86_non_ir_cpumask = { CPU_BITS_ALL }; What? > #ifdef CONFIG_X86_32 > > /* > @@ -1838,6 +1841,7 @@ static __init void x2apic_enable(void) > static __init void try_to_enable_x2apic(int remap_mode) > { > u32 apic_limit = 0; > + int i; > > if (x2apic_state == X2APIC_DISABLED) > return; > @@ -1880,6 +1884,14 @@ static __init void try_to_enable_x2apic(int remap_mode) > if (apic_limit) > x2apic_set_max_apicid(apic_limit); > > + /* Build the affinity mask for interrupts that can't be remapped. */ > + cpumask_clear(&x86_non_ir_cpumask); > + i = min_t(unsigned int, num_possible_cpus() - 1, apic_limit); > + for ( ; i >= 0; i--) { > + if (cpu_physical_id(i) <= apic_limit) > + cpumask_set_cpu(i, &x86_non_ir_cpumask); > + } Blink. If the APIC id is not linear with the cpu numbers then this results in a reduced addressable set of CPUs. WHY? > diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c > index aa9a3b54a96c..4d0ef46fedb9 100644 > --- a/arch/x86/kernel/apic/io_apic.c > +++ b/arch/x86/kernel/apic/io_apic.c > @@ -2098,6 +2098,8 @@ static int mp_alloc_timer_irq(int ioapic, int pin) > struct irq_alloc_info info; > > ioapic_set_alloc_attr(&info, NUMA_NO_NODE, 0, 0); > + if (domain->parent == x86_vector_domain) > + info.mask = &x86_non_ir_cpumask; We are not going to sprinkle such domain checks all over the place. Again, the mask is a property of the interrupt domain. Thanks, tglx
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > When interrupt remapping isn't enabled, only the first 255 CPUs can No, only CPUs with an APICid < 255 .... > receive external interrupts. Set the appropriate max affinity for > the IOAPIC and MSI IRQ domains accordingly. > > This also fixes the case where interrupt remapping is enabled but some > devices are not within the scope of any active IOMMU. What? If this fixes an pre-existing problem then 1) Explain the problem proper 2) Have a patch at the beginning of the series which fixes it independently of this pile If it's fixing a problem in your pile, then you got the ordering wrong. You didn't start kernel programming as of yesterday, so you really know how that works. > ip->irqdomain->parent = parent; > + if (parent == x86_vector_domain) > + irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask); OMG > if (cfg->type == IOAPIC_DOMAIN_LEGACY || > cfg->type == IOAPIC_DOMAIN_STRICT) > diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c > index 4d891967bea4..af5ce5c4da02 100644 > --- a/arch/x86/kernel/apic/msi.c > +++ b/arch/x86/kernel/apic/msi.c > @@ -259,6 +259,7 @@ struct irq_domain * __init native_create_pci_msi_domain(void) > pr_warn("Failed to initialize PCI-MSI irqdomain.\n"); > } else { > d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK; > + irq_domain_set_affinity(d, &x86_non_ir_cpumask); So here it's unconditional > } > return d; > } > @@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id) > irq_domain_free_fwnode(fn); > kfree(domain_info); > } > + if (parent == x86_vector_domain) > + irq_domain_set_affinity(d, &x86_non_ir_cpumask); And here we need a condition again. Completely obvious and reviewable - NOT. Thanks, tglx
[-- Attachment #1: Type: text/plain, Size: 3512 bytes --] On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote: > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > This is the maximum possible set of CPUs which can be used. Use it > > to calculate the default affinity requested from __irq_alloc_descs() > > by first attempting to find the intersection with irq_default_affinity, > > or falling back to using just the max_affinity if the intersection > > would be empty. > > And why do we need that as yet another argument? > > This is an optional property of the irq domain, really and no caller has > any business with that. Because irq_domain_alloc_descs() doesn't actually *take* the domain as an argument. It's more of an internal function, which is only non- static because it's used from kernel/irq/ipi.c too for some reason. If we convert the IPI code to just call __irq_alloc_descs() directly, perhaps that we can actually make irq_domain_alloc_decs() static. > > int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, > > - int node, const struct irq_affinity_desc *affinity) > > + int node, const struct irq_affinity_desc *affinity, > > + const struct cpumask *max_affinity) > > { > > + cpumask_var_t default_affinity; > > unsigned int hint; > > + int i; > > + > > + /* Check requested per-IRQ affinities are in the possible range */ > > + if (affinity && max_affinity) { > > + for (i = 0; i < cnt; i++) > > + if (!cpumask_subset(&affinity[i].mask, max_affinity)) > > + return -EINVAL; > > https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos > > What is preventing the affinity spreading code from spreading the masks > out to unusable CPUs? The changelog is silent about that part. I'm coming to the conclusion that we should allow unusable CPUs to be specified at this point, just as we do offline CPUs. That's largely driven by the realisation that our x86_non_ir_cpumask is only going to contain online CPUs anyway, and hotplugged CPUs only get added to it as they are brought online. > > + /* > > + * Generate default affinity. Either the possible subset of > > + * irq_default_affinity if such a subset is non-empty, or fall > > + * back to the provided max_affinity if there is no intersection. > > .. > > + * And just a copy of irq_default_affinity in the > > + * !CONFIG_CPUMASK_OFFSTACK case. > > We know that already... > > > + */ > > + memset(&default_affinity, 0, sizeof(default_affinity)); > > Right, memset() before allocating is useful. The memset is because there's no cpumask_var_t initialiser that I can see. So cpumask_available() on the uninitialised 'default_affinity' variable might be true even in the OFFSTACK case. > > + if ((max_affinity && > > + !cpumask_subset(irq_default_affinity, max_affinity))) { > > + if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL)) > > + return -ENOMEM; > > + cpumask_and(default_affinity, max_affinity, > > + irq_default_affinity); > > + if (cpumask_empty(default_affinity)) > > + cpumask_copy(default_affinity, max_affinity); > > + } else if (cpumask_available(default_affinity)) > > + cpumask_copy(default_affinity, irq_default_affinity); > > That's garbage and unreadable. That's why there was a comment explaining it... at which point you claimed to already know :) Clearly the comment didn't do the job it was supposed to do. I'll rework. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
[-- Attachment #1: Type: text/plain, Size: 758 bytes --] On Tue, 2020-10-06 at 23:32 +0200, Thomas Gleixner wrote: > What the heck? Why does this need a setter function which is exported? > So that random driver writers can fiddle with it? > > The affinity mask restriction of an irq domain is already known when the > domain is created. It's exported because __irq_domain_add is exported. I didn't want to just add a rarely-used extra argument to __irq_domain_add and the other public APIs which call into it, and figured a separate setter function was the simplest option. I can rework to add the argument to __irq_domain_add if you prefer. Or we can declare that restricted affinity *isn't* something that IRQ domains provided by modules can have, and just not export the setter function? [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
[-- Attachment #1: Type: text/plain, Size: 2494 bytes --] On Tue, 2020-10-06 at 23:42 +0200, Thomas Gleixner wrote: > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > This is the mask of CPUs to which IRQs can be delivered without > > interrupt > > remapping. > > > > +/* Mask of CPUs which can be targeted by non-remapped interrupts. > > */ > > +cpumask_t x86_non_ir_cpumask = { CPU_BITS_ALL }; > > What? By default, if we didn't hit any limits, all CPUs can be targeted by external interrupts. It's the default today. Or at least we pretend it is, modulo the bugs :) > > #ifdef CONFIG_X86_32 > > > > /* > > @@ -1838,6 +1841,7 @@ static __init void x2apic_enable(void) > > static __init void try_to_enable_x2apic(int remap_mode) > > { > > u32 apic_limit = 0; > > + int i; > > > > if (x2apic_state == X2APIC_DISABLED) > > return; > > @@ -1880,6 +1884,14 @@ static __init void try_to_enable_x2apic(int remap_mode) > > if (apic_limit) > > x2apic_set_max_apicid(apic_limit); > > > > + /* Build the affinity mask for interrupts that can't be remapped. */ > > + cpumask_clear(&x86_non_ir_cpumask); > > + i = min_t(unsigned int, num_possible_cpus() - 1, apic_limit); > > + for ( ; i >= 0; i--) { > > + if (cpu_physical_id(i) <= apic_limit) > > + cpumask_set_cpu(i, &x86_non_ir_cpumask); > > + } > > Blink. If the APIC id is not linear with the cpu numbers then this > results in a reduced addressable set of CPUs. WHY? Hm, good question. That loop was cargo-culted from hyperv-iommu.c; perhaps it makes more sense there because Hyper-V really does promise that linearity, or perhaps it was already buggy. Will fix. In fact, in apic.c I could probably just use the cpuid_to_apicid array which is right there in the file. > > diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c > > index aa9a3b54a96c..4d0ef46fedb9 100644 > > --- a/arch/x86/kernel/apic/io_apic.c > > +++ b/arch/x86/kernel/apic/io_apic.c > > @@ -2098,6 +2098,8 @@ static int mp_alloc_timer_irq(int ioapic, int pin) > > struct irq_alloc_info info; > > > > ioapic_set_alloc_attr(&info, NUMA_NO_NODE, 0, 0); > > + if (domain->parent == x86_vector_domain) > > + info.mask = &x86_non_ir_cpumask; > > We are not going to sprinkle such domain checks all over the > place. Again, the mask is a property of the interrupt domain. Yeah, that's a hangover from the first attempts which I forgot to delete. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
[-- Attachment #1: Type: text/plain, Size: 3362 bytes --] On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote: > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > When interrupt remapping isn't enabled, only the first 255 CPUs can > > No, only CPUs with an APICid < 255 .... Ack. > > receive external interrupts. Set the appropriate max affinity for > > the IOAPIC and MSI IRQ domains accordingly. > > > > This also fixes the case where interrupt remapping is enabled but some > > devices are not within the scope of any active IOMMU. > > What? If this fixes an pre-existing problem then > > 1) Explain the problem proper > 2) Have a patch at the beginning of the series which fixes it > independently of this pile > > If it's fixing a problem in your pile, then you got the ordering wrong. It's not that simple; there's not a single patch which fixes that and which can go first. I can, and do, fix the "no IR" case in a simple patch that goes first, simply by restricting the kernel to the APIC IDs which can be targeted. This is the case I called out in the cover letter: This patch series implements a per-domain "maximum affinity" set and uses it for the non-remapped IOAPIC and MSI domains on x86. As well as allowing more CPUs to be used without interrupt remapping, this also fixes the case where some IOAPICs or PCI devices aren't actually in scope of any active IOMMU and are operating without remapping. To fix *that* case, we really do need the whole series giving us per- domain restricted affinity, and to use it for those MSIs/IOAPICs that the IRQ remapping doesn't cover. > You didn't start kernel programming as of yesterday, so you really know > how that works. > > > ip->irqdomain->parent = parent; > > + if (parent == x86_vector_domain) > > + irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask); > > OMG This IOAPIC function may or may not be behind remapping. > > > if (cfg->type == IOAPIC_DOMAIN_LEGACY || > > cfg->type == IOAPIC_DOMAIN_STRICT) > > diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c > > index 4d891967bea4..af5ce5c4da02 100644 > > --- a/arch/x86/kernel/apic/msi.c > > +++ b/arch/x86/kernel/apic/msi.c > > @@ -259,6 +259,7 @@ struct irq_domain * __init native_create_pci_msi_domain(void) > > pr_warn("Failed to initialize PCI-MSI irqdomain.\n"); > > } else { > > d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK; > > + irq_domain_set_affinity(d, &x86_non_ir_cpumask); > > So here it's unconditional Yes, this code is only for the non-remapped case and there's a separate arch_create_remap_msi_irq_domain() function a few lines further down which creates the IR-PCI-MSI IRQ domain. So no need for a condition here. > > } > > return d; > > } > > @@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id) > > irq_domain_free_fwnode(fn); > > kfree(domain_info); > > } > > + if (parent == x86_vector_domain) > > + irq_domain_set_affinity(d, &x86_non_ir_cpumask); > > And here we need a condition again. Completely obvious and reviewable - NOT. And HPET may or may not be behind remapping so again needs the condition. I had figured that part was fairly obvious but can note it in the commit comment. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
On 05/10/20 17:28, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> This allows the host to indicate that IOAPIC and MSI emulation supports
> 15-bit destination IDs, allowing up to 32Ki CPUs without remapping.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> Documentation/virt/kvm/cpuid.rst | 4 ++++
> arch/x86/include/uapi/asm/kvm_para.h | 1 +
> arch/x86/kernel/kvm.c | 6 ++++++
> 3 files changed, 11 insertions(+)
>
> diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
> index a7dff9186bed..1726b5925d2b 100644
> --- a/Documentation/virt/kvm/cpuid.rst
> +++ b/Documentation/virt/kvm/cpuid.rst
> @@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit
> async pf acknowledgment msr
> 0x4b564d07.
>
> +KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit
> + before using extended destination
> + ID bits in MSI address bits 11-5.
> +
> KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side
> per-cpu warps are expeced in
> kvmclock
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 812e9b4c1114..950afebfba88 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -32,6 +32,7 @@
> #define KVM_FEATURE_POLL_CONTROL 12
> #define KVM_FEATURE_PV_SCHED_YIELD 13
> #define KVM_FEATURE_ASYNC_PF_INT 14
> +#define KVM_FEATURE_MSI_EXT_DEST_ID 15
>
> #define KVM_HINTS_REALTIME 0
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 1b51b727b140..4986b4399aef 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -743,12 +743,18 @@ static void __init kvm_init_platform(void)
> x86_platform.apic_post_init = kvm_apic_init;
> }
>
> +static bool __init kvm_msi_ext_dest_id(void)
> +{
> + return kvm_para_has_feature(KVM_FEATURE_MSI_EXT_DEST_ID);
> +}
> +
> const __initconst struct hypervisor_x86 x86_hyper_kvm = {
> .name = "KVM",
> .detect = kvm_detect,
> .type = X86_HYPER_KVM,
> .init.guest_late_init = kvm_guest_init,
> .init.x2apic_available = kvm_para_available,
> + .init.msi_ext_dest_id = kvm_msi_ext_dest_id,
> .init.init_platform = kvm_init_platform,
> };
>
>
Looks like the rest of the series needs some more work, but anyway:
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo
[-- Attachment #1: Type: text/plain, Size: 572 bytes --] On Wed, 2020-10-07 at 10:14 +0200, Paolo Bonzini wrote: > Looks like the rest of the series needs some more work, but anyway: > > Acked-by: Paolo Bonzini <pbonzini@redhat.com> Thanks. Yeah, I was expecting the per-irqdomain affinity support to take a few iterations. But this part, still sticking with the current behaviour of only allowing CPUs to come online at all if they can be reached by all interrupts, can probably go in first. It's presumably (hopefully!) a blocker for the qemu patch which exposes the same feature bit defined in this patch. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
On 07/10/20 10:59, David Woodhouse wrote:
> Yeah, I was expecting the per-irqdomain affinity support to take a few
> iterations. But this part, still sticking with the current behaviour of
> only allowing CPUs to come online at all if they can be reached by all
> interrupts, can probably go in first.
>
> It's presumably (hopefully!) a blocker for the qemu patch which exposes
> the same feature bit defined in this patch.
Yeah, though we could split it further and get the documentation part in
first. That would let the QEMU part go through.
Paolo
[-- Attachment #1: Type: text/plain, Size: 1356 bytes --] On Wed, 2020-10-07 at 13:15 +0200, Paolo Bonzini wrote: > On 07/10/20 10:59, David Woodhouse wrote: > > Yeah, I was expecting the per-irqdomain affinity support to take a few > > iterations. But this part, still sticking with the current behaviour of > > only allowing CPUs to come online at all if they can be reached by all > > interrupts, can probably go in first. > > > > It's presumably (hopefully!) a blocker for the qemu patch which exposes > > the same feature bit defined in this patch. > > Yeah, though we could split it further and get the documentation part in > first. That would let the QEMU part go through. Potentially. Although I've worked out that the first patch in my series, adding x2apic_set_max_apicid(), is actually a bug fix because it fixes the behaviour if you only *hotplug* CPUs with APIC IDs > 255 and there were none of them present at boot time. So I'll post this set on its own to start with, and then focus on the per-irqdomain affinity support after that. David Woodhouse (5): x86/apic: Fix x2apic enablement without interrupt remapping x86/msi: Only use high bits of MSI address for DMAR unit x86/ioapic: Handle Extended Destination ID field in RTE x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote: > On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote: >> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > This is the case I called out in the cover letter: > > This patch series implements a per-domain "maximum affinity" set and > uses it for the non-remapped IOAPIC and MSI domains on x86. As well as > allowing more CPUs to be used without interrupt remapping, this also > fixes the case where some IOAPICs or PCI devices aren't actually in > scope of any active IOMMU and are operating without remapping. > > To fix *that* case, we really do need the whole series giving us per- > domain restricted affinity, and to use it for those MSIs/IOAPICs that > the IRQ remapping doesn't cover. Which do not exist today. >> > ip->irqdomain->parent = parent; >> > + if (parent == x86_vector_domain) >> > + irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask); >> >> OMG > > This IOAPIC function may or may not be behind remapping. >> > d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK; >> > + irq_domain_set_affinity(d, &x86_non_ir_cpumask); >> >> So here it's unconditional > > Yes, this code is only for the non-remapped case and there's a separate > arch_create_remap_msi_irq_domain() function a few lines further down > which creates the IR-PCI-MSI IRQ domain. So no need for a condition > here. > >> > + if (parent == x86_vector_domain) >> > + irq_domain_set_affinity(d, &x86_non_ir_cpumask); >> >> And here we need a condition again. Completely obvious and reviewable - NOT. > > And HPET may or may not be behind remapping so again needs the > condition. I had figured that part was fairly obvious but can note it > in the commit comment. And all of this is completely wrong to begin with. The information has to property of the relevant irq domains and the hierarchy allows you nicely to retrieve it from there instead of sprinkling this all over the place. Thanks, tglx
On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote: >On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote: >> On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote: >>> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: >> This is the case I called out in the cover letter: >> >> This patch series implements a per-domain "maximum affinity" set >and >> uses it for the non-remapped IOAPIC and MSI domains on x86. As >well as >> allowing more CPUs to be used without interrupt remapping, this >also >> fixes the case where some IOAPICs or PCI devices aren't actually >in >> scope of any active IOMMU and are operating without remapping. >> >> To fix *that* case, we really do need the whole series giving us per- >> domain restricted affinity, and to use it for those MSIs/IOAPICs that >> the IRQ remapping doesn't cover. > >Which do not exist today. Sure. But at patch 10/13 into this particular patch series, it *does* exist. (Ignoring, for the moment, that I'm about to rewrite half the preceding patches to take a different approach) >>> > ip->irqdomain->parent = parent; >>> > + if (parent == x86_vector_domain) >>> > + irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask); >>> >>> OMG >> >> This IOAPIC function may or may not be behind remapping. > >>> > d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK; >>> > + irq_domain_set_affinity(d, &x86_non_ir_cpumask); >>> >>> So here it's unconditional >> >> Yes, this code is only for the non-remapped case and there's a >separate >> arch_create_remap_msi_irq_domain() function a few lines further down >> which creates the IR-PCI-MSI IRQ domain. So no need for a condition >> here. >> >>> > + if (parent == x86_vector_domain) >>> > + irq_domain_set_affinity(d, &x86_non_ir_cpumask); >>> >>> And here we need a condition again. Completely obvious and >reviewable - NOT. >> >> And HPET may or may not be behind remapping so again needs the >> condition. I had figured that part was fairly obvious but can note it >> in the commit comment. > >And all of this is completely wrong to begin with. > >The information has to property of the relevant irq domains and the >hierarchy allows you nicely to retrieve it from there instead of >sprinkling this all over the place. No. This is not a property of the parent domain per se. Especially if you're thinking that we could inherit the affinity mask from the parent, then twice no. This is a property of the MSI domain itself, and how many bits of destination ID the hardware at *this* level can interpret and pass on to the parent domain. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote: > On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote: >> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: >> > From: David Woodhouse <dwmw@amazon.co.uk> >> > >> > This is the maximum possible set of CPUs which can be used. Use it >> > to calculate the default affinity requested from __irq_alloc_descs() >> > by first attempting to find the intersection with irq_default_affinity, >> > or falling back to using just the max_affinity if the intersection >> > would be empty. >> >> And why do we need that as yet another argument? >> >> This is an optional property of the irq domain, really and no caller has >> any business with that. > > Because irq_domain_alloc_descs() doesn't actually *take* the domain as > an argument. It's more of an internal function, which is only non- > static because it's used from kernel/irq/ipi.c too for some reason. If > we convert the IPI code to just call __irq_alloc_descs() directly, > perhaps that we can actually make irq_domain_alloc_decs() static. What is preventing you to change the function signature? But handing down irqdomain here is not cutting it. The right thing to do is to replace 'struct irq_affinity_desc *affinity' with something more flexible. >> > int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, >> > - int node, const struct irq_affinity_desc *affinity) >> > + int node, const struct irq_affinity_desc *affinity, >> > + const struct cpumask *max_affinity) >> > { >> > + cpumask_var_t default_affinity; >> > unsigned int hint; >> > + int i; >> > + >> > + /* Check requested per-IRQ affinities are in the possible range */ >> > + if (affinity && max_affinity) { >> > + for (i = 0; i < cnt; i++) >> > + if (!cpumask_subset(&affinity[i].mask, max_affinity)) >> > + return -EINVAL; >> >> https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos >> >> What is preventing the affinity spreading code from spreading the masks >> out to unusable CPUs? The changelog is silent about that part. > > I'm coming to the conclusion that we should allow unusable CPUs to be > specified at this point, just as we do offline CPUs. That's largely > driven by the realisation that our x86_non_ir_cpumask is only going to > contain online CPUs anyway, and hotplugged CPUs only get added to it as > they are brought online. Can you please stop looking at this from a x86 only perspective. It's largely irrelevant what particular needs x86 or virt or whatever has. Fact is, that if there are CPUs which cannot be targeted by device interrupts then the multiqueue affinity mechanism has to be fixed to handle this. Right now it's just broken. Passing yet more cpumasks and random pointers around through device drivers and whatever is just not going to happen. Neither are we going to have arch_can_be_used_for_device_interrupts_mask or whatever you come up with and claim it to be 'generic'. The whole affinity control mechanism needs to be refactored from ground up and the information about CPUs which can be targeted has to be retrievable through the irqdomain hierarchy. Anything else is just tinkering and I have zero interest in mopping up after you. It's pretty obvious that the irq domains are the right place to store that information: const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d) { while (d) { if (d->get_possible_affinity) return d->get_possible_affinity(d); d = d->parent; } return cpu_possible_mask; } So if you look at X86 then you have either: [VECTOR] ----------------- [IO/APIC] |-- [MSI] |-- [WHATEVER] or [VECTOR] ---[REMAP]------- [IO/APIC] | |-- [MSI] |----------------[WHATEVER] So if REMAP allows cpu_possible_mask and VECTOR some restricted subset then irqdomain_get_possible_affinity() will return the correct result independent whether remapping is enabled or not. This allows to use that for other things like per node restrictions or whatever people come up with, without sprinkling more insanities through the tree. Thanks, tglx
On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote: > On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote: >>On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote: >>> To fix *that* case, we really do need the whole series giving us per- >>> domain restricted affinity, and to use it for those MSIs/IOAPICs that >>> the IRQ remapping doesn't cover. >> >>Which do not exist today. > > Sure. But at patch 10/13 into this particular patch series, it *does* > exist. As I told you before: Your ordering is wrong. We do not introduce bugs first and then fix them later .... >>And all of this is completely wrong to begin with. >> >>The information has to property of the relevant irq domains and the >>hierarchy allows you nicely to retrieve it from there instead of >>sprinkling this all over the place. > > No. This is not a property of the parent domain per se. Especially if > you're thinking that we could inherit the affinity mask from the > parent, then twice no. > > This is a property of the MSI domain itself, and how many bits of > destination ID the hardware at *this* level can interpret and pass on > to the parent domain. Errm what? The MSI domain does not know anything about what the underlying domain can handle and it shouldn't. If MSI is on top of remapping then the remapping domain defines what the MSI domain can do and not the other way round. Ditto for the non remapped case in which the vector domain decides. The top most MSI irq chip does not even have a compose function, neither for the remap nor for the vector case. The composition is done by the parent domain from the data which the parent domain constructed. Same for the IO/APIC just less clearly separated. The top most chip just takes what the underlying domain constructed and writes it to the message store, because that's what the top most chip controls. It does not control the content. Thanks, tglx
[-- Attachment #1: Type: text/plain, Size: 5425 bytes --] On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote: > On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote: > > On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote: > > > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > > > This is the maximum possible set of CPUs which can be used. Use it > > > > to calculate the default affinity requested from __irq_alloc_descs() > > > > by first attempting to find the intersection with irq_default_affinity, > > > > or falling back to using just the max_affinity if the intersection > > > > would be empty. > > > > > > And why do we need that as yet another argument? > > > > > > This is an optional property of the irq domain, really and no caller has > > > any business with that. > > > > Because irq_domain_alloc_descs() doesn't actually *take* the domain as > > an argument. It's more of an internal function, which is only non- > > static because it's used from kernel/irq/ipi.c too for some reason. If > > we convert the IPI code to just call __irq_alloc_descs() directly, > > perhaps that we can actually make irq_domain_alloc_decs() static. > > What is preventing you to change the function signature? But handing > down irqdomain here is not cutting it. The right thing to do is to > replace 'struct irq_affinity_desc *affinity' with something more > flexible. Yeah, although I do think I want to ditch this part completely, and treat the "possible" mask for the domain very much more like we do cpu_online_mask. In that we can allow affinities to be *requested* which are outside it, and they can just never be *effective* while those CPUs aren't present and reachable. That way a lot of the nastiness about preparing an "acceptable" mask in advance can go away. It's *also* driven, as I noted, by the realisation that on x86, the x86_non_ir_cpumask will only ever contain those CPUs which have been brought online so far and meet the criteria... but won't that be true on *any* system where CPU numbers are virtual and not 1:1 mapped with whatever determines reachability by the IRQ domain? It isn't *such* an x86ism, surely? > Fact is, that if there are CPUs which cannot be targeted by device > interrupts then the multiqueue affinity mechanism has to be fixed to > handle this. Right now it's just broken. I think part of the problem there is that I don't really understand how this part is *supposed* to work. I was focusing on getting the simple case working first, in the expectation that we'd come back to that part ansd you'd keep me honest. Is there some decent documentation on this that I'm missing? > It's pretty obvious that the irq domains are the right place to store > that information: > > const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d) > { > while (d) { > if (d->get_possible_affinity) > return d->get_possible_affinity(d); > d = d->parent; > } > return cpu_possible_mask; > } Sure. > So if you look at X86 then you have either: > > [VECTOR] ----------------- [IO/APIC] > |-- [MSI] > |-- [WHATEVER] > > or > > [VECTOR] ---[REMAP]------- [IO/APIC] > | |-- [MSI] > |----------------[WHATEVER] Hierarchically, theoretically the IOAPIC and HPET really ought to be children of the MSI domain. It's the Compatibility MSI which has the restriction on destination ID, because of how many bits it interprets from the MSI address. HPET and IOAPIC are just generating MSIs that hit that upstream limit. > So if REMAP allows cpu_possible_mask and VECTOR some restricted subset > then irqdomain_get_possible_affinity() will return the correct result > independent whether remapping is enabled or not. Sure. Although VECTOR doesn't have the restriction. As noted, it's the Compatibility MSI that does. So the diagram looks something like this: [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ] | |---[ IR-PCI-MSI ] | |---[ IR-IOAPIC ] | |--[ COMPAT-MSI ] ---- [ HPET ] |-- [ PCI-MSI ] |-- [ IOAPIC ] In this diagram, it's COMPAT-MSI that has the affinity restriction, inherited by its children. Now, I understand that you're not keen on IOAPIC actually being a child of the MSI domain, and that's fine. In Linux right now, those generic 'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the compatibility HPET, PCI-MSI and IOAPIC domains would have to add that same 8-bit affinity limit for themselves, as their parent is the VECTOR domain. I suppose it *might* not hurt to pretend that VECTOR does indeed have the limit, and to *remove* it in the remapping domain. And then the affinity limit could be removed in one place by the REMAP domain because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI- MSI and IR-IOAPIC domains *are* children of that. But honestly, I'd rather iterate and get the generic irqdomain support for affinity limits into decent shape before I worry *too* much about precisely which was round it gets used by x86. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1445 bytes --] On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote: > On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote: > > On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote: > > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote: > > > > To fix *that* case, we really do need the whole series giving us per- > > > > domain restricted affinity, and to use it for those MSIs/IOAPICs that > > > > the IRQ remapping doesn't cover. > > > > > > Which do not exist today. > > > > Sure. But at patch 10/13 into this particular patch series, it *does* > > exist. > > As I told you before: Your ordering is wrong. We do not introduce bugs > first and then fix them later .... I didn't introduce that bug; it's been there for years. Fixing it properly requires per-irqdomain affinity limits. There's a cute little TODO at least in the Intel irq-remapping driver, noting that we should probably check if there are any IOAPICs that aren't in the scope of any DRHD at all. But that's all. If it happens, I think we'll end up doing the right thing and instantiating a non-IR IOAPIC domain for it, and if we *don't* have any CPU with an APIC ID above... (checks notes)... 7 ... then it'll just about work out. Over 7 and we're screwed (because logical mode; see also https://git.infradead.org/users/dwmw2/linux.git/commit/429f0c33f for a straw man but that's getting even further down the rabbit hole) [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1857 bytes --] On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote: > > > The information has to property of the relevant irq domains and the > > > hierarchy allows you nicely to retrieve it from there instead of > > > sprinkling this all over the place. > > > > No. This is not a property of the parent domain per se. Especially if > > you're thinking that we could inherit the affinity mask from the > > parent, then twice no. > > > > This is a property of the MSI domain itself, and how many bits of > > destination ID the hardware at *this* level can interpret and pass on > > to the parent domain. > > Errm what? > > The MSI domain does not know anything about what the underlying domain > can handle and it shouldn't. > > If MSI is on top of remapping then the remapping domain defines what the > MSI domain can do and not the other way round. Ditto for the non > remapped case in which the vector domain decides. > > The top most MSI irq chip does not even have a compose function, neither > for the remap nor for the vector case. The composition is done by the > parent domain from the data which the parent domain constructed. Same > for the IO/APIC just less clearly separated. > > The top most chip just takes what the underlying domain constructed and > writes it to the message store, because that's what the top most chip > controls. It does not control the content. Sure, whatever. The way we arrange the IRQ domain hierarchy in x86 Linux doesn't really match my understanding of the real hardware, or how qemu emulates it either. But it is at least internally self- consistent, and in this model it probably does make some sense to claim the 8-bit limit on x86_vector_domain itself, and *remove* that limit in the remapping irqdomain. Not really the important part to deal with right now, either way. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
On Wed, Oct 07 2020 at 16:05, David Woodhouse wrote: > On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote: >> The top most MSI irq chip does not even have a compose function, neither >> for the remap nor for the vector case. The composition is done by the >> parent domain from the data which the parent domain constructed. Same >> for the IO/APIC just less clearly separated. >> >> The top most chip just takes what the underlying domain constructed and >> writes it to the message store, because that's what the top most chip >> controls. It does not control the content. > > Sure, whatever. The way we arrange the IRQ domain hierarchy in x86 > Linux doesn't really match my understanding of the real hardware, or > how qemu emulates it either. But it is at least internally self- > consistent, and in this model it probably does make some sense to claim > the 8-bit limit on x86_vector_domain itself, and *remove* that limit in > the remapping irqdomain. It's clearly how the hardware works. MSI has a message store of some sorts and if the entry is enabled then the MSI chip (in PCI or elsewhere) will send exactly the message which is in that message store. It knows absolutely nothing about what the message means and how it is composed. The only things which MSI knows about is whether the message address is 64bit wide or not and whether the entries are maskable or not and how many entries it can store. Software allocates a message target at the underlying irq domain (vector or remap) and that underlying irq domain defines the properties. If qemu emulates it differently then it's qemu's problem, but that does not make it in anyway something which influences the irq domain abstractions which are correctly modeled after how the hardware works. > Not really the important part to deal with right now, either way. Oh yes it is. We define that upfront and not after the fact. Thanks, tglx
[-- Attachment #1: Type: text/plain, Size: 2879 bytes --] On Wed, 2020-10-07 at 17:25 +0200, Thomas Gleixner wrote: > It's clearly how the hardware works. MSI has a message store of some > sorts and if the entry is enabled then the MSI chip (in PCI or > elsewhere) will send exactly the message which is in that message > store. It knows absolutely nothing about what the message means and how > it is composed. The only things which MSI knows about is whether the > message address is 64bit wide or not and whether the entries are > maskable or not and how many entries it can store. > > Software allocates a message target at the underlying irq domain (vector > or remap) and that underlying irq domain defines the properties. > > If qemu emulates it differently then it's qemu's problem, but that does > not make it in anyway something which influences the irq domain > abstractions which are correctly modeled after how the hardware works. > > > Not really the important part to deal with right now, either way. > > Oh yes it is. We define that upfront and not after the fact. The way the hardware works is that something handles physical address cycles to addresses in the (on x86) 0xFEExxxxx range, and turns them into actual interrupts on the appropriate CPU — where the APIC ID and vector (etc.) are directly encoded in the bits of the address and the data written. That compatibility x86 APIC MSI format is where the 8-bit (or 15-bit) limit comes from. Then interrupt remapping comes along, and now those physical address cycles are actually handled by the IOMMU — which can either handle the compatibility format as before, or use a different format of address/data bits and perform a lookup in its IRTE table. The PCI MSI domain, HPET, and even the IOAPIC are just the things out there on the bus which might perform those physical address cycles. And yes, as you say they're just a message store sending exactly the message that was composed for them. They know absolutely nothing about what the message means and how it is composed. It so happens that in Linux, we don't really architect the software like that. So each of the PCI MSI domain, HPET, and IOAPIC have their *own* message composer which has the same limits and composes basically the same messages as if it was *their* format, not dictated to them by the APIC upstream. And that's what we're both getting our panties in a knot about, I think. It really doesn't matter that much to the underlying generic irqdomain support for limited affinities. Except that you want to make the generic code support the concept of a child domain supporting *more* CPUs than its parent, which really doesn't make much sense if you think about it. But it isn't that hard to do, and if it means we don't have to argue any more about the x86 hierarchy not matching the hardware then it's a price worth paying. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
On Wed, Oct 07 2020 at 15:10, David Woodhouse wrote: > On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote: >> What is preventing you to change the function signature? But handing >> down irqdomain here is not cutting it. The right thing to do is to >> replace 'struct irq_affinity_desc *affinity' with something more >> flexible. > > Yeah, although I do think I want to ditch this part completely, and > treat the "possible" mask for the domain very much more like we do > cpu_online_mask. In that we can allow affinities to be *requested* > which are outside it, and they can just never be *effective* while > those CPUs aren't present and reachable. Huch? > That way a lot of the nastiness about preparing an "acceptable" mask in > advance can go away. There is not lot's of nastiness. > It's *also* driven, as I noted, by the realisation that on x86, the > x86_non_ir_cpumask will only ever contain those CPUs which have been > brought online so far and meet the criteria... but won't that be true > on *any* system where CPU numbers are virtual and not 1:1 mapped with > whatever determines reachability by the IRQ domain? It isn't *such* an > x86ism, surely? Yes, but that's exactly the reason why I don't want to have whatever mask name you chose to be directly exposed and want it to be part of the irq domains because then you can express arbitrary per domain limits. >> Fact is, that if there are CPUs which cannot be targeted by device >> interrupts then the multiqueue affinity mechanism has to be fixed to >> handle this. Right now it's just broken. > > I think part of the problem there is that I don't really understand how > this part is *supposed* to work. I was focusing on getting the simple > case working first, in the expectation that we'd come back to that part > ansd you'd keep me honest. Is there some decent documentation on this > that I'm missing? TLDR & HTF; Multiqueue devices want to have at max 1 queue per CPU or if the device has less queues than CPUs they want the queues to have a fixed association to a set of CPUs. At setup time this is established considering possible CPUs to handle 'physical' hotplug correctly. If a queue has no online CPUs it cannot be started. If it's active and the last CPU goes down then it's quiesced and stopped and the core code shuts down the interrupt and does not move it to a still online CPU. So with your hackery, we end up in a situation where we have a large possible mask, but not all CPUs in that mask can be reached, which means in a 1 queue per CPU scenario all unreachable CPUs would have disfunctional queues. So that spreading algorithm needs to know about this limitation. >> So if you look at X86 then you have either: >> >> [VECTOR] ----------------- [IO/APIC] >> |-- [MSI] >> |-- [WHATEVER] >> >> or >> >> [VECTOR] ---[REMAP]------- [IO/APIC] >> | |-- [MSI] >> |----------------[WHATEVER] > > Hierarchically, theoretically the IOAPIC and HPET really ought to be > children of the MSI domain. It's the Compatibility MSI which has the > restriction on destination ID, because of how many bits it interprets > from the MSI address. HPET and IOAPIC are just generating MSIs that hit > that upstream limit. We kinda have that, but not nicely abstracted. But we surely can and should fix that. >> So if REMAP allows cpu_possible_mask and VECTOR some restricted subset >> then irqdomain_get_possible_affinity() will return the correct result >> independent whether remapping is enabled or not. > w> Sure. Although VECTOR doesn't have the restriction. As noted, it's the > Compatibility MSI that does. So the diagram looks something like this: > > [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ] > | |---[ IR-PCI-MSI ] > | |---[ IR-IOAPIC ] > | > |--[ COMPAT-MSI ] ---- [ HPET ] > |-- [ PCI-MSI ] > |-- [ IOAPIC ] > > > In this diagram, it's COMPAT-MSI that has the affinity restriction, > inherited by its children. > > Now, I understand that you're not keen on IOAPIC actually being a child > of the MSI domain, and that's fine. In Linux right now, those generic > 'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the > compatibility HPET, PCI-MSI and IOAPIC domains would have to add that > same 8-bit affinity limit for themselves, as their parent is the VECTOR > domain. No. We fix it proper and not by hacking around it. > I suppose it *might* not hurt to pretend that VECTOR does indeed have > the limit, and to *remove* it in the remapping domain. And then the > affinity limit could be removed in one place by the REMAP domain > because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI- > MSI and IR-IOAPIC domains *are* children of that. It's not rocket science to fix that as the irq remapping code already differentiates between the device types. Thanks, tglx
On Wed, Oct 07 2020 at 15:23, David Woodhouse wrote:
> On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
>> On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
>> > On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>> > > > To fix *that* case, we really do need the whole series giving us per-
>> > > > domain restricted affinity, and to use it for those MSIs/IOAPICs that
>> > > > the IRQ remapping doesn't cover.
>> > >
>> > > Which do not exist today.
>> >
>> > Sure. But at patch 10/13 into this particular patch series, it *does*
>> > exist.
>>
>> As I told you before: Your ordering is wrong. We do not introduce bugs
>> first and then fix them later ....
>
> I didn't introduce that bug; it's been there for years. Fixing it
> properly requires per-irqdomain affinity limits.
>
> There's a cute little TODO at least in the Intel irq-remapping driver,
> noting that we should probably check if there are any IOAPICs that
> aren't in the scope of any DRHD at all. But that's all.
So someone forgot to remove the cute little TODO when this was added:
if (parse_ioapics_under_ir()) {
pr_info("Not enabling interrupt remapping\n");
goto error;
}
Thanks,
tglx
On 7 October 2020 16:57:36 BST, Thomas Gleixner <tglx@linutronix.de> wrote: >On Wed, Oct 07 2020 at 15:10, David Woodhouse wrote: >> On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote: >>> What is preventing you to change the function signature? But handing >>> down irqdomain here is not cutting it. The right thing to do is to >>> replace 'struct irq_affinity_desc *affinity' with something more >>> flexible. >> >> Yeah, although I do think I want to ditch this part completely, and >> treat the "possible" mask for the domain very much more like we do >> cpu_online_mask. In that we can allow affinities to be *requested* >> which are outside it, and they can just never be *effective* while >> those CPUs aren't present and reachable. > >Huch? > >> That way a lot of the nastiness about preparing an "acceptable" mask >in >> advance can go away. > >There is not lot's of nastiness. OK, but I think we do have to cope with the fact that the limit is dynamic, and a CPU might be added which widens the mask. I think that's fundamental and not x86-specific. >> It's *also* driven, as I noted, by the realisation that on x86, the >> x86_non_ir_cpumask will only ever contain those CPUs which have been >> brought online so far and meet the criteria... but won't that be true >> on *any* system where CPU numbers are virtual and not 1:1 mapped with >> whatever determines reachability by the IRQ domain? It isn't *such* >an >> x86ism, surely? > >Yes, but that's exactly the reason why I don't want to have whatever >mask name you chose to be directly exposed and want it to be part of >the >irq domains because then you can express arbitrary per domain limits. The x86_non_ir_mask isn't intended to be directly exposed to any generic IRQ code. It's set up by the x86 APIC code so that those x86 IRQ domains which are affected by it, can set it with irqdomain_set_max_affinity() for the generic code to see. Without each having to create the cpumask from scratch for themselves. > ... reading for once the kids are elsewhere... Thanks. >It's not rocket science to fix that as the irq remapping code already >differentiates between the device types. I don't actually like that very much. The IRQ remapping code could just compose the MSI messages that it desires without really having to care so much about the child device. The bit where IRQ remapping actually composes IOAPIC RTEs makes me particularly sad. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 7 October 2020 17:02:59 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Oct 07 2020 at 15:23, David Woodhouse wrote:
>> On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
>>> On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
>>> > On 7 October 2020 13:59:00 BST, Thomas Gleixner
><tglx@linutronix.de> wrote:
>>> > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>>> > > > To fix *that* case, we really do need the whole series giving
>us per-
>>> > > > domain restricted affinity, and to use it for those
>MSIs/IOAPICs that
>>> > > > the IRQ remapping doesn't cover.
>>> > >
>>> > > Which do not exist today.
>>> >
>>> > Sure. But at patch 10/13 into this particular patch series, it
>*does*
>>> > exist.
>>>
>>> As I told you before: Your ordering is wrong. We do not introduce
>bugs
>>> first and then fix them later ....
>>
>> I didn't introduce that bug; it's been there for years. Fixing it
>> properly requires per-irqdomain affinity limits.
>>
>> There's a cute little TODO at least in the Intel irq-remapping
>driver,
>> noting that we should probably check if there are any IOAPICs that
>> aren't in the scope of any DRHD at all. But that's all.
>
>So someone forgot to remove the cute little TODO when this was added:
>
> if (parse_ioapics_under_ir()) {
> pr_info("Not enabling interrupt remapping\n");
> goto error;
> }
And HPET, and PCI devices including those that might be hotplugged in future and not be covered by any extant IOMMU's scope?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Oct 07 2020 at 16:46, David Woodhouse wrote: > The PCI MSI domain, HPET, and even the IOAPIC are just the things out > there on the bus which might perform those physical address cycles. And > yes, as you say they're just a message store sending exactly the > message that was composed for them. They know absolutely nothing about > what the message means and how it is composed. That's what I said. > It so happens that in Linux, we don't really architect the software > like that. So each of the PCI MSI domain, HPET, and IOAPIC have their > *own* message composer which has the same limits and composes basically > the same messages as if it was *their* format, not dictated to them by > the APIC upstream. And that's what we're both getting our panties in a > knot about, I think. Are you actually reading what I write and caring to look at the code? PCI-MSI does not have a compose message callback in the irq chip. The message is composed by the underlying parent domain. Same for HPET. The only dogdy part is the IO/APIC for hysterical raisins and because I did not come around yet to sort that out. > It really doesn't matter that much to the underlying generic irqdomain > support for limited affinities. Except that you want to make the > generic code support the concept of a child domain supporting *more* > CPUs than its parent, which really doesn't make much sense if you think > about it. Right. So we really want to stick the restriction into a compat-MSI domain to make stuff match reality and to avoid banging the head against the wall sooner than later. Thanks, tglx
[-- Attachment #1: Type: text/plain, Size: 1790 bytes --] On Wed, 2020-10-07 at 19:23 +0200, Thomas Gleixner wrote: > > It so happens that in Linux, we don't really architect the software > > like that. So each of the PCI MSI domain, HPET, and IOAPIC have their > > *own* message composer which has the same limits and composes basically > > the same messages as if it was *their* format, not dictated to them by > > the APIC upstream. And that's what we're both getting our panties in a > > knot about, I think. > > Are you actually reading what I write and caring to look at the code? > > PCI-MSI does not have a compose message callback in the irq chip. The > message is composed by the underlying parent domain. Same for HPET. > > The only dogdy part is the IO/APIC for hysterical raisins and because > I did not come around yet to sort that out. Right. And the problem is that the "underlying parent domain" in this case is actually x86_vector_domain. Whereas it probably makes more sense, at least in theory, for there to be an *intermediate* domain responsible for composing the Compat MSI messages. Both the Compat-MSI and the IR-MSI domains would be children of the generic x86_vector_domain, and then any given HPET, IOAPIC or PCI-MSI domain would be a child of either one of those generic MSI domains. > > It really doesn't matter that much to the underlying generic irqdomain > > support for limited affinities. Except that you want to make the > > generic code support the concept of a child domain supporting *more* > > CPUs than its parent, which really doesn't make much sense if you think > > about it. > > Right. So we really want to stick the restriction into a compat-MSI > domain to make stuff match reality and to avoid banging the head against > the wall sooner than later. Right. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
On Wed, Oct 07 2020 at 17:11, David Woodhouse wrote:
> On 7 October 2020 16:57:36 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>>There is not lot's of nastiness.
>
> OK, but I think we do have to cope with the fact that the limit is
> dynamic, and a CPU might be added which widens the mask. I think
> that's fundamental and not x86-specific.
Yes, but it needs some thoughts vs. serialization against CPU hotplug.
The stability of that cpumask is architecture specific if the calling
context cannot serialize against CPU hotplug. The hot-unplug case is
less interesting because the mask does not shrink, it only get's wider.
That's why I want a callback instead of a pointer assignment and that
callback cannot return a pointer to something which can be modified
concurrently, it needs to update a caller provided mask so that the
result is at least consistent in itself.
It just occured to me that there is something which needs some more
thought vs. CPU hotunplug as well:
Right now we just check whether there are enough vectors available on
the remaining online CPUs so that the interrupts which have an
effective affinity directed to the outgoing CPU can be migrated away
(modulo the managed ones).
That check obviously needs to take the target CPU restrictions into
account to prevent that devices end up with no target at the end.
I bet there are some other interesting bits and pieces which need
some care...
Thanks,
tglx
[-- Attachment #1: Type: text/plain, Size: 2997 bytes --] On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote: > TLDR & HTF; > > Multiqueue devices want to have at max 1 queue per CPU or if the device > has less queues than CPUs they want the queues to have a fixed > association to a set of CPUs. > > At setup time this is established considering possible CPUs to handle > 'physical' hotplug correctly. > > If a queue has no online CPUs it cannot be started. If it's active and > the last CPU goes down then it's quiesced and stopped and the core code > shuts down the interrupt and does not move it to a still online CPU. > > So with your hackery, we end up in a situation where we have a large > possible mask, but not all CPUs in that mask can be reached, which means > in a 1 queue per CPU scenario all unreachable CPUs would have > disfunctional queues. > > So that spreading algorithm needs to know about this limitation. OK, thanks. So the queue exists, with an MSI assigned to point to an offline CPU(s), but it cannot actually be used until/unless at least one CPU in its mask comes online. So when I said I wanted to try treating "reachable" the same way as "online", that would mean the queue can't start until/unless at least one *reachable* CPU in its mask comes online. The underlying problem here is that until a CPU comes online, we don't actually *know* if it's reachable or not. So if we want carefully create the affinity masks at setup time so that they don't include any unreachable CPUs... that basically means we don't include any non-present CPUs at all (unless they've been added once and then removed). That's because — at least on x86 — we don't assign CPU numbers to CPUs which haven't yet been added. Theoretically we *could* but if there are more than NR_CPUS listed in (e.g.) the MADT then we could run out of CPU numbers. Then if the admin hotplugs one of the CPUs we *didn't* have space for, we'd not be able to handle it. I suppose there might be options like pre-assigning CPU numbers only to non-present APIC IDs below 256 (or 32768). Or *grouping* the CPU numbers, so some not-yet-assigned CPU#s are for low APICIDs and some are for high APICIDs but it doesn't actually have to be a 1:1 predetermined mapping. But those really do seem like hacks which might only apply on x86, while the generic approach of treating "reachable" like "online" seems like it would work in other cases too. Fundamentally, there are three sets of CPUs. There are those known to be reachable, those known not to be, and those which are not yet known. So another approach we could use is to work with a cpumask of those *known* not to be reachable, and to filter those *out* of the prebuilt affinities. That gives us basically the right behaviour without hotplug, but does include absent CPUs in a mask that *if* they are ever added, wouldn't be able to receive the IRQ. Which does mean we'd have to refrain from bringing up the corresponding queue. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
On Thu, Oct 08 2020 at 08:21, David Woodhouse wrote: > On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote: >> Multiqueue devices want to have at max 1 queue per CPU or if the device >> has less queues than CPUs they want the queues to have a fixed >> association to a set of CPUs. >> >> At setup time this is established considering possible CPUs to handle >> 'physical' hotplug correctly. >> >> If a queue has no online CPUs it cannot be started. If it's active and >> the last CPU goes down then it's quiesced and stopped and the core code >> shuts down the interrupt and does not move it to a still online CPU. >> >> So with your hackery, we end up in a situation where we have a large >> possible mask, but not all CPUs in that mask can be reached, which means >> in a 1 queue per CPU scenario all unreachable CPUs would have >> disfunctional queues. >> >> So that spreading algorithm needs to know about this limitation. > > OK, thanks. So the queue exists, with an MSI assigned to point to an > offline CPU(s), but it cannot actually be used until/unless at least > one CPU in its mask comes online. The MSI entry in that case is actually directed to an online CPU's MANAGED_IRQ_SHUTDOWN_VECTOR to catch cases where an interrupt is raised by the device after shutdown. > So when I said I wanted to try treating "reachable" the same way as > "online", that would mean the queue can't start until/unless at least > one *reachable* CPU in its mask comes online. > > The underlying problem here is that until a CPU comes online, we don't > actually *know* if it's reachable or not. It's known before online, i.e. when the CPU is registered which is either at boot time for present CPUs or at 'physical' hotplug. > So if we want carefully create the affinity masks at setup time so that > they don't include any unreachable CPUs... that basically means we > don't include any non-present CPUs at all (unless they've been added > once and then removed). That breaks _all_ multi-queue assumptions in one go. :) > But those really do seem like hacks which might only apply on x86, > while the generic approach of treating "reachable" like "online" seems > like it would work in other cases too. > > Fundamentally, there are three sets of CPUs. There are those known to > be reachable, those known not to be, and those which are not yet > known. Unfortunately there are lots of assumptions all over the place that possible CPUs are reachable. Multi-queue using managed interrupts is just the tip of the iceberg. > So another approach we could use is to work with a cpumask of those > *known* not to be reachable, and to filter those *out* of the prebuilt > affinities. That gives us basically the right behaviour without > hotplug, but does include absent CPUs in a mask that *if* they are ever > added, wouldn't be able to receive the IRQ. Which does mean we'd have > to refrain from bringing up the corresponding queue. The multi-queue drivers rely on the interrupt setup to create their queues and the fundamental assumption is that this setup works. The managed interrupt mechanism guarantees that the queue has a vector available on all CPUs which are in the queues assigned affinity mask. As of today it also guarantees that these CPUs are reachable once they come online. So in order to make that work you'd need to teach the multi-queue stuff about this new world order: 1) On hotplug the queue needs to be able to figure out whether the interrupt is functional. If not it has to redirect any requests to some actually functional queue. 2) On unplug it needs to be able to figure out whether the interrupt will shutdown because the outgoing CPU is the last reachable in the group and if there are still online but unreachable CPUs then use the redirect mechanism. I'm sure that the multi-queue people will be enthusiastic to add all of this and deal with all the nasty corner cases coming out of it. The overall conclusion for this is: 1) X2APIC support on bare metal w/o irq remapping is not going to happen unless you: - added support in multi-queue devices which utilize managed interrupts - audited the whole tree for other assumptions related to the reachability of possible CPUs. I'm not expecting you to be done with that before I retire so for me it's just not going to happen :) 2) X2APIC support on VIRT is possible if the extended ID magic is supported by the hypervisor because that does not make any CPU unreachable for MSI and therefore the multi-queue muck and everything else just works. This requires to have either the domain affinity limitation for HPET in place or just to force disable HPET or at least HPET-MSI which is a reasonable tradeoff. HPET is not required for guests which have kvmclock and APIC/deadline timer and known (hypervisor provided) frequencies. Anything else is just wishful thinking, really. Thanks, tglx
[-- Attachment #1: Type: text/plain, Size: 1776 bytes --] On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote: > The overall conclusion for this is: > > 1) X2APIC support on bare metal w/o irq remapping is not going to > happen unless you: > > - added support in multi-queue devices which utilize managed > interrupts > > - audited the whole tree for other assumptions related to the > reachability of possible CPUs. > > I'm not expecting you to be done with that before I retire so for > me it's just not going to happen :) Makes sense. It probably does mean we should a BUG_ON for the case where IRQ remapping *is* enabled but any device is found which isn't behind it. But that's OK. > 2) X2APIC support on VIRT is possible if the extended ID magic is > supported by the hypervisor because that does not make any CPU > unreachable for MSI and therefore the multi-queue muck and > everything else just works. > > This requires to have either the domain affinity limitation for HPET > in place or just to force disable HPET or at least HPET-MSI which is > a reasonable tradeoff. > > HPET is not required for guests which have kvmclock and > APIC/deadline timer and known (hypervisor provided) frequencies. HPET-MSI should work fine. Like the IOAPIC, it's just a child of the *actual* MSI domain. The address/data in the MSI message are completely opaque to it, and if the parent domain happens to put meaningful information into bits 11-5 of the MSI address, the HPET won't even notice. The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI address; it's not doing bit-swizzling like the IOAPIC does, which might potentially *not* have been able to set certain bits in the MSI. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]
On Thu, Oct 08 2020 at 12:10, David Woodhouse wrote: > On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote: >> The overall conclusion for this is: >> >> 1) X2APIC support on bare metal w/o irq remapping is not going to >> happen unless you: >> >> - added support in multi-queue devices which utilize managed >> interrupts >> >> - audited the whole tree for other assumptions related to the >> reachability of possible CPUs. >> >> I'm not expecting you to be done with that before I retire so for >> me it's just not going to happen :) > > Makes sense. It probably does mean we should a BUG_ON for the case > where IRQ remapping *is* enabled but any device is found which isn't > behind it. But that's OK. We can kinda gracefully handle that. See the completely untested and incomplete patch below. >> 2) X2APIC support on VIRT is possible if the extended ID magic is >> supported by the hypervisor because that does not make any CPU >> unreachable for MSI and therefore the multi-queue muck and >> everything else just works. >> >> This requires to have either the domain affinity limitation for HPET >> in place or just to force disable HPET or at least HPET-MSI which is >> a reasonable tradeoff. >> >> HPET is not required for guests which have kvmclock and >> APIC/deadline timer and known (hypervisor provided) frequencies. > > HPET-MSI should work fine. Like the IOAPIC, it's just a child of the > *actual* MSI domain. The address/data in the MSI message are completely > opaque to it, and if the parent domain happens to put meaningful > information into bits 11-5 of the MSI address, the HPET won't even > notice. > > The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI > address; it's not doing bit-swizzling like the IOAPIC does, which might > potentially *not* have been able to set certain bits in the MSI. Indeed. I thought it was crippled in some way, but you're right it has all the bits. Thanks, tglx --- Subject: x86/iommu: Make interrupt remapping more robust From: Thomas Gleixner <tglx@linutronix.de> Date: Thu, 08 Oct 2020 14:09:44 +0200 Needs to be split into pieces and cover PCI proper. Right now PCI gets a NULL pointer assigned which makes it explode at the wrong place later. Also hyperv iommu wants some love. NOT-Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/apic/io_apic.c | 4 +++- arch/x86/kernel/apic/msi.c | 24 ++++++++++++++---------- drivers/iommu/amd/iommu.c | 6 +++--- drivers/iommu/intel/irq_remapping.c | 4 ++-- 4 files changed, 22 insertions(+), 16 deletions(-) --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT; info.devid = mpc_ioapic_id(ioapic); parent = irq_remapping_get_irq_domain(&info); - if (!parent) + if (IS_ERR(parent)) + return PTR_ERR(parent); + else if (!parent) parent = x86_vector_domain; else name = "IO-APIC-IR"; --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d struct irq_domain *hpet_create_irq_domain(int hpet_id) { struct msi_domain_info *domain_info; + struct fwnode_handle *fn = NULL; struct irq_domain *parent, *d; struct irq_alloc_info info; - struct fwnode_handle *fn; if (x86_vector_domain == NULL) return NULL; @@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai init_irq_alloc_info(&info, NULL); info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT; info.devid = hpet_id; + parent = irq_remapping_get_irq_domain(&info); - if (parent == NULL) + if (IS_ERR(parent)) + goto fail; + else if (!parent) parent = x86_vector_domain; else hpet_msi_controller.name = "IR-HPET-MSI"; fn = irq_domain_alloc_named_id_fwnode(hpet_msi_controller.name, hpet_id); - if (!fn) { - kfree(domain_info); - return NULL; - } + if (!fn) + goto fail; d = msi_create_irq_domain(fn, domain_info, parent); - if (!d) { - irq_domain_free_fwnode(fn); - kfree(domain_info); - } + if (!d) + goto fail; return d; + +fail: + irq_domain_free_fwnode(fn); + kfree(domain_info); + return NULL; } int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc, --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3557,7 +3557,7 @@ static struct irq_domain *get_irq_domain struct amd_iommu *iommu = amd_iommu_rlookup_table[devid]; if (!iommu) - return NULL; + return ERR_PTR(-ENODEV); switch (info->type) { case X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT: @@ -3565,7 +3565,7 @@ static struct irq_domain *get_irq_domain return iommu->ir_domain; default: WARN_ON_ONCE(1); - return NULL; + return ERR_PTR(-ENODEV); } } @@ -3578,7 +3578,7 @@ static struct irq_domain *get_irq_domain devid = get_devid(info); if (devid < 0) - return NULL; + return ERR_PTR(-ENODEV); return get_irq_domain_for_devid(info, devid); } --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -212,7 +212,7 @@ static struct irq_domain *map_hpet_to_ir if (ir_hpet[i].id == hpet_id && ir_hpet[i].iommu) return ir_hpet[i].iommu->ir_domain; } - return NULL; + return ERR_PTR(-ENODEV); } static struct intel_iommu *map_ioapic_to_iommu(int apic) @@ -230,7 +230,7 @@ static struct irq_domain *map_ioapic_to_ { struct intel_iommu *iommu = map_ioapic_to_iommu(apic); - return iommu ? iommu->ir_domain : NULL; + return iommu ? iommu->ir_domain : ERR_PTR(-ENODEV); } static struct irq_domain *map_dev_to_ir(struct pci_dev *dev)
[-- Attachment #1: Type: text/plain, Size: 3735 bytes --] On Thu, 2020-10-08 at 14:40 +0200, Thomas Gleixner wrote: > Subject: x86/iommu: Make interrupt remapping more robust > From: Thomas Gleixner <tglx@linutronix.de> > Date: Thu, 08 Oct 2020 14:09:44 +0200 > > Needs to be split into pieces and cover PCI proper. Right now PCI gets a > NULL pointer assigned which makes it explode at the wrong place > later. Also hyperv iommu wants some love. > > NOT-Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > arch/x86/kernel/apic/io_apic.c | 4 +++- > arch/x86/kernel/apic/msi.c | 24 ++++++++++++++---------- > drivers/iommu/amd/iommu.c | 6 +++--- > drivers/iommu/intel/irq_remapping.c | 4 ++-- > 4 files changed, 22 insertions(+), 16 deletions(-) > > --- a/arch/x86/kernel/apic/io_apic.c > +++ b/arch/x86/kernel/apic/io_apic.c > @@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi > info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT; > info.devid = mpc_ioapic_id(ioapic); > parent = irq_remapping_get_irq_domain(&info); > - if (!parent) > + if (IS_ERR(parent)) > + return PTR_ERR(parent); > + else if (!parent) > parent = x86_vector_domain; > else > name = "IO-APIC-IR"; > --- a/arch/x86/kernel/apic/msi.c > +++ b/arch/x86/kernel/apic/msi.c > @@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d > struct irq_domain *hpet_create_irq_domain(int hpet_id) > { > struct msi_domain_info *domain_info; > + struct fwnode_handle *fn = NULL; > struct irq_domain *parent, *d; > struct irq_alloc_info info; > - struct fwnode_handle *fn; > > if (x86_vector_domain == NULL) > return NULL; > @@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai > init_irq_alloc_info(&info, NULL); > info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT; > info.devid = hpet_id; > + > parent = irq_remapping_get_irq_domain(&info); > - if (parent == NULL) > + if (IS_ERR(parent)) > + goto fail; > + else if (!parent) > parent = x86_vector_domain; > else > hpet_msi_controller.name = "IR-HPET-MSI"; Hrm... this is the '-ENODEV changes' that you wanted me to pick up, right? I confess I don't like it very much. I'd much prefer to be able to use a generic foo_get_parent_irq_domain() function which just returns an answer or an error. Definitely none of this "if it returns NULL, caller fills it in for themselves with some magic global because we're special". And I don't much like abusing the irq_alloc_info for this either. I didn't like the irq_alloc_info very much in its *original* use cases, for actually allocating IRQs. Abusing it for this is worse. I'm more than happy to start digging into this, but once I do I fear I'm not going to stop until HPET and IOAPIC don't know *anything* about whether they're behind IR or not. And yes, I know that IOAPIC has a different EOI method for IR but AFAICT that's *already* hosed for Hyper-V because it doesn't *really* do remapping and so the RTEs *can* change there to move interrupts around. There's more differences between ioapic_ack_level() and ioapic_ir_ack_level() than the erratum workaround and whether we use data->entry.vector... aren't there? For the specific case you were targeting with this patch, you already successfully persuaded me it should never happen, and there's a WARN_ON which would trigger if it ever does (with too many CPUs in the system). Let's come back and tackle this can of worms in a separate cleanup series. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]