iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI
@ 2024-04-05 22:30 Jacob Pan
  2024-04-05 22:30 ` [PATCH v2 01/13] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
                   ` (12 more replies)
  0 siblings, 13 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:30 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

Hi Thomas and all,

This patch set is aimed to improve IRQ throughput on Intel Xeon by making use of
posted interrupts.

There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented this
topic.

https://lpc.events/event/17/sessions/172/#20231115

Background
==========
On modern x86 server SoCs, interrupt remapping (IR) is required and turned
on by default to support X2APIC. Two interrupt remapping modes can be supported
by IOMMU/VT-d:

- Remappable 	(host)
- Posted	(guest only so far)

With remappable mode, the device MSI to CPU process is a HW flow without system
software touch points, it roughly goes as follows:

1.	Devices issue interrupt requests with writes to 0xFEEx_xxxx
2.	The system agent accepts and remaps/translates the IRQ
3.	Upon receiving the translation response, the system agent notifies the
	destination CPU with the translated MSI
4.	CPU's local APIC accepts interrupts into its IRR/ISR registers
5.	Interrupt delivered through IDT (MSI vector)

The above process can be inefficient under high IRQ rates. The notifications in
step #3 are often unnecessary when the destination CPU is already overwhelmed
with handling bursts of IRQs. On some architectures, such as Intel Xeon, step #3
is also expensive and requires strong ordering w.r.t DMA. As a result, slower
IRQ rates can become a limiting factor for DMA I/O performance.

For example, on Intel Xeon Sapphire Rapids SoC, as more NVMe disks are attached
to the same socket, FIO (libaio engine) 4K block random read performance
per-disk drops quickly.

# of disks  	2  	4  	8
-------------------------------------
IOPS(million)  	1.991	1.136  	0.834
(NVMe Gen 5 Samsung PM174x)

With posted mode enabled in interrupt remapping, the interrupt flow is divided
into two parts: posting (storing pending IRQ vector information in memory) and
CPU notification.

The above remappable IRQ flow becomes the following (1 and 2 unchanged):
3.	Notifies the destination CPU with a notification vector
	- IOMMU suppresses CPU notification
	- IOMMU atomic swap/store IRQ status to memory-resident posted interrupt
	descriptor (PID)
4.	CPU's local APIC accepts the notification interrupt into its IRR/ISR
	registers
5.	Interrupt delivered through IDT (notification vector handler)
	System SW allows new notifications by clearing outstanding notification
	(ON) bit in PID.
(The above flow is not in Linux today since we only use posted mode for VM)

Note that the system software can now suppress CPU notifications at runtime as
needed. This allows the system software to coalesce the expensive CPU
notifications and in turn, improve IRQ throughput and DMA performance.

Consider the following scenario when MSIs arrive at a CPU in high-frequency
bursts:

Time ----------------------------------------------------------------------->
    	^ ^ ^       	^ ^ ^ ^     	^   	^
MSIs	A B C       	D E F G     	H   	I

RI  	N  N'  N'     	N  N'  N'  N'  	N   	N

PI  	N           	N           	N   	N

RI: remappable interrupt;  PI:  posted interrupt;
N: interrupt notification, N': superfluous interrupt notification

With remappable interrupt (row titled RI), every MSI generates a notification
event to the CPU.

With posted interrupts enabled in this patch set (row titled PI), CPU
notifications are coalesced during IRQ bursts. N's are eliminated in the flow
above. We refer to this mechanism Coalesced Interrupt Delivery (CID).

Post interrupts have existed for a long time, they have been used for
virtualization where MSIs from directly assigned devices can be delivered to
the guest kernel without VMM intervention. On x86 Intel platforms, posted
interrupts can be used on the host as well. Only host physical address of
Posted interrupt descriptor (PID) is used.

This patch set enables a new usage of posted interrupts on existing (and
new hardware) for host kernel device MSIs. It is referred to as Posted MSIs
throughout this patch set.

Performance (with this patch set):
==================================
Test #1. NVMe FIO

FIO libaio (million IOPS/sec/disk) Gen 5 NVMe Samsung PM174x disks on a single
socket, Intel Xeon Sapphire Rapids. Random read with 4k block size. NVMe IRQ
affinity is managed by the kernel with one vector per CPU.

#disks	Before		After		%Gain
---------------------------------------------
8	0.834		1.943		132%
4	1.136		2.023		78%

Other observations:
- Increased block sizes shows diminishing benefits, e.g. with 4 NVME disks on
one x16 PCIe slot, the combined IOPS looks like:

    Block Size	Baseline	PostedMSI
    -------------------------------------
    4K		6475		8778
    8K		5727		5896
    16k		2864		2900
    32k		1546		1520
    128k	397		398

- Submission/Completion latency (usec) also improved at 4K block size only
  FIO report SLAT
  ---------------------------------------
  Block Size	Baseline	postedMSI
  4k		2177		2282
  8k		4416		3967
  16k		2950		3053
  32k		3453		3505
  128k		5911		5801

  FIO report CLAT
  ---------------------------------------
  Block Size	Baseline	postedMSI
  4k		313		230
  8k		352		343
  16k		711		702
  32k		1320		1343
  128k		5146		5137


Test #2. Intel Data Streaming Accelerator

Two dedicated workqueues from two PCI root complex integrated endpoint
(RCIEP) devices, pin IRQ affinity of the two interrupts to a single CPU.

				Before		After		%Gain
				-------------------------------------
DSA memfill (mil IRQs/sec)	5.157		8.987		74%

DMA throughput has similar improvements.

At lower IRQ rate (< 1 million/second), no performance benefits nor regression
observed so far.

No harm tests also performed to ensure no performance regression on workloads
that do not have high interrupt rate. These tests include:
- kernel compile time
- file copy
- FIO NVME random writes

Implementation choices:
======================
- Transparent to the device drivers

- System-wide option instead of per-device or per-IRQ opt-in, i.e. once enabled
  all device MSIs are posted. The benefit is that we only need to change IR
  irq_chip and domain layer. No change to PCI MSI.
  Exceptions are: IOAPIC, HPET, and VT-d's own IRQs

- Limit the number of polling/demuxing loops per CPU notification event

- Only change Intel-IR in IRQ domain hierarchy VECTOR->INTEL-IR->PCI-MSI,

- X86 Intel only so far, can be extended to other architectures with posted
  interrupt support (ARM and AMD), RFC.

- Bare metal only, no posted interrupt capable virtual IOMMU.


Changes and implications (moving from remappable to posted mode)
===============================
1. All MSI vectors are multiplexed into a single notification vector for each
CPU MSI vectors are then de-multiplexed by SW, no IDT delivery for MSIs

2. Losing the following features compared to the remappable mode (AFAIK, none of
the below matters for device MSIs)
- Control of delivery mode, e.g. NMI for MSIs
- No logical destinations, posted interrupt destination is x2APIC
  physical APIC ID
- No per vector stack, since all MSI vectors are multiplexed into one


Runtime changes
===============
The IRQ runtime behavior has changed with this patch, here is a pseudo trace
comparison for 3 MSIs of different vectors arriving in a burst on the same CPU.
A system vector interrupt (e.g. timer) arrives randomly.

BEFORE:
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

interrupt(timer)

interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()


AFTER:
interrupt /* Posted MSI notification vector */
    irq_enter()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	pi_clear_on()
    apic_eoi()
    irq_exit()
interrupt(timer)
        process_softirq()

With posted MSI (as pointed out by Thomas Gleixner), both high-priority
interrupts (system interrupt vectors) and softIRQs are blocked during MSI vector
demux loop. Some can be timing sensitive.

Here are the options I have attempted or still working on:

1. Use self-IPI to invoke MSI vector handler but that took away the majority of
the performance benefits.

2. Limit the # of demuxing loops, this is implemented in this patch. Note that
today, we already allow one low priority MSI to block system interrupts. System
vector can preempt MSI vectors without waiting for EOI but we have IRQ disabled
in the ISR.

Performance data (on DSA with MEMFILL) also shows that coalescing more than 3
loops yields diminishing benefits. Therefore, the max loops for coalescing is
set to 3 in this patch.
	MaxLoop		IRQ/sec		bandwidth Mbps
-------------------------------------------------------------------------
	2		6157107 		25219
	3		6226611 		25504
	4		6557081 		26857
	5		6629683 		27155
	6		6662425 		27289

3. limit the time that system interrupts can be blocked (WIP).

In addition, posted MSI uses atomic xchg from both CPU and IOMMU. Compared to
remappable mode, there may be additional cache line ownership contention over
PID. However, we have not observed performance regression at lower IRQ rates.
At high interrupt rate, posted mode always wins.

Testing:
========

The following tests have been performed and continue to be evaluated.
- IRQ affinity change, migration
- CPU offlining
- Multi vector coalescing
- Low IRQ rate, general no-harm test
- VM device assignment via VFIO
- General no harm test, performance regressions have not been observed for low
IRQ rate workload.


With the patch, a new entry in /proc/interrupts is added.
cat /proc/interrupts | grep PMN
PMN:         13868907 Posted MSI notification event

No change to the device MSI accounting.

A new INTEL-IR-POST irq_chip is visible at IRQ debugfs, e.g.
domain:  IR-PCI-MSIX-0000:6f:01.0-12
 hwirq:   0x8
 chip:    IR-PCI-MSIX-0000:6f:01.0
  flags:   0x430
             IRQCHIP_SKIP_SET_WAKE
             IRQCHIP_ONESHOT_SAFE
 parent:
    domain:  INTEL-IR-12-13
     hwirq:   0x90000
     chip:    INTEL-IR-POST /* For posted MSIs */
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x65
         chip:    APIC


Acknowledgment
==============

- Rajesh Sankaran and Ashok Raj for the original idea

- Thomas Gleixner for reviewing and guiding the upstream direction of PoC
patches. Help correct my many misunderstandings of the IRQ subsystem.

- Jie J Yan(Jeff), Sebastien Lemarie, and Dan Liang for performance evaluation
with NVMe and network workload

- Bernice Zhang and Scott Morris for functional validation

- Michael Prinke helped me understand how VT-d HW works

- Sanjay Kumar for providing the DSA IRQ test suite


Changelogs:

V2:
- Code change logs are in individual patches.
- Use "Originally-by" and "Suggested-by" tags to clarify
  credits/responsibilities.
- More performance evaluation done on FIO
4K rand read test. Four Samsung PM174x NVMe drives on a single x16 PCIe gen5
lane. Fixed CPU frequency at 2.7GHz (p1, highest non-turbo).

      	    	IOPS*	CPU%	sys%	 user%	Ints/sec	IOPS/CPU	LAT**
AIO (before)	6231	55.5	39.7	 15.8	5714721		112.2702703	328
AIO (after)	8936	71.5	51.5	 20	7397543		124.979021	229

IOURING(before)	6880	43.7	30.3	13.4	6512402		157.4370709	149
IOURING(after)	8688	58.3	41.3	17	7625158		149.0222985	118

IOURING POLLEDQ	13100	100	85.1	14.9	8000		131		156

* x1000 4 drives combined
** 95% usec.

This patchset improves IOPS, IRQ throughput, and reduces latency for non-polled
queues.

V1 (since RFC)
   - Removed mentioning of wishful features, IRQ preemption, separate and
     full MSI vector space
   - Refined MSI handler de-multiplexing loop based on suggestions from
     Peter and Thomas. Reduced xchg() usage and code duplication
   - Assign the new posted IR irq_chip only to device MSI/x, avoid changing
     IO-APIC code
   - Extract and use common code for preventing lost interrupt during
     affinity change
   - Added more test results to the cover letter



Thanks,

Jacob



Jacob Pan (13):
  x86/irq: Move posted interrupt descriptor out of vmx code
  x86/irq: Unionize PID.PIR for 64bit access w/o casting
  x86/irq: Remove bitfields in posted interrupt descriptor
  x86/irq: Add a Kconfig option for posted MSI
  x86/irq: Reserve a per CPU IDT vector for posted MSIs
  x86/irq: Set up per host CPU posted interrupt descriptors
  x86/irq: Factor out calling ISR from common_interrupt
  x86/irq: Install posted MSI notification handler
  x86/irq: Factor out common code for checking pending interrupts
  x86/irq: Extend checks for pending vectors to posted interrupts
  iommu/vt-d: Make posted MSI an opt-in cmdline option
  iommu/vt-d: Add an irq_chip for posted MSIs
  iommu/vt-d: Enable posted mode for device MSIs

 .../admin-guide/kernel-parameters.txt         |   1 +
 arch/x86/Kconfig                              |  11 ++
 arch/x86/include/asm/apic.h                   |  12 ++
 arch/x86/include/asm/hardirq.h                |   6 +
 arch/x86/include/asm/idtentry.h               |   3 +
 arch/x86/include/asm/irq_remapping.h          |  11 ++
 arch/x86/include/asm/irq_vectors.h            |   9 +-
 arch/x86/include/asm/posted_intr.h            | 108 ++++++++++++
 arch/x86/kernel/apic/vector.c                 |   5 +-
 arch/x86/kernel/cpu/common.c                  |   3 +
 arch/x86/kernel/idt.c                         |   3 +
 arch/x86/kernel/irq.c                         | 164 ++++++++++++++++--
 arch/x86/kvm/vmx/posted_intr.c                |   4 +-
 arch/x86/kvm/vmx/posted_intr.h                |  93 +---------
 arch/x86/kvm/vmx/vmx.c                        |   3 +-
 arch/x86/kvm/vmx/vmx.h                        |   2 +-
 drivers/iommu/intel/irq_remapping.c           | 115 +++++++++++-
 drivers/iommu/irq_remapping.c                 |  13 +-
 tools/arch/x86/include/asm/irq_vectors.h      |   9 +-
 19 files changed, 457 insertions(+), 118 deletions(-)
 create mode 100644 arch/x86/include/asm/posted_intr.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 01/13] x86/irq: Move posted interrupt descriptor out of vmx code
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
@ 2024-04-05 22:30 ` Jacob Pan
  2024-04-17  0:34   ` Sean Christopherson
  2024-04-05 22:30 ` [PATCH v2 02/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:30 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

To prepare native usage of posted interrupt, move PID declaration out of
VMX code such that they can be shared.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 88 ++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/posted_intr.h     | 93 +-----------------------------
 arch/x86/kvm/vmx/vmx.c             |  1 +
 arch/x86/kvm/vmx/vmx.h             |  2 +-
 4 files changed, 91 insertions(+), 93 deletions(-)
 create mode 100644 arch/x86/include/asm/posted_intr.h

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
new file mode 100644
index 000000000000..f0324c56f7af
--- /dev/null
+++ b/arch/x86/include/asm/posted_intr.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_POSTED_INTR_H
+#define _X86_POSTED_INTR_H
+
+#define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
+
+#define PID_TABLE_ENTRY_VALID 1
+
+/* Posted-Interrupt Descriptor */
+struct pi_desc {
+	u32 pir[8];     /* Posted interrupt requested */
+	union {
+		struct {
+				/* bit 256 - Outstanding Notification */
+			u16	on	: 1,
+				/* bit 257 - Suppress Notification */
+				sn	: 1,
+				/* bit 271:258 - Reserved */
+				rsvd_1	: 14;
+				/* bit 279:272 - Notification Vector */
+			u8	nv;
+				/* bit 287:280 - Reserved */
+			u8	rsvd_2;
+				/* bit 319:288 - Notification Destination */
+			u32	ndst;
+		};
+		u64 control;
+	};
+	u32 rsvd[6];
+} __aligned(64);
+
+static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
+{
+	return test_and_set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
+{
+	return test_and_clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
+{
+	return test_and_clear_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
+{
+	return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
+}
+
+static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
+{
+	return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
+}
+
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+	set_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_set_on(struct pi_desc *pi_desc)
+{
+	set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_on(struct pi_desc *pi_desc)
+{
+	clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+	clear_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_on(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_sn(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+#endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 26992076552e..6b2a0226257e 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -1,98 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __KVM_X86_VMX_POSTED_INTR_H
 #define __KVM_X86_VMX_POSTED_INTR_H
-
-#define POSTED_INTR_ON  0
-#define POSTED_INTR_SN  1
-
-#define PID_TABLE_ENTRY_VALID 1
-
-/* Posted-Interrupt Descriptor */
-struct pi_desc {
-	u32 pir[8];     /* Posted interrupt requested */
-	union {
-		struct {
-				/* bit 256 - Outstanding Notification */
-			u16	on	: 1,
-				/* bit 257 - Suppress Notification */
-				sn	: 1,
-				/* bit 271:258 - Reserved */
-				rsvd_1	: 14;
-				/* bit 279:272 - Notification Vector */
-			u8	nv;
-				/* bit 287:280 - Reserved */
-			u8	rsvd_2;
-				/* bit 319:288 - Notification Destination */
-			u32	ndst;
-		};
-		u64 control;
-	};
-	u32 rsvd[6];
-} __aligned(64);
-
-static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
-{
-	return test_and_set_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
-{
-	return test_and_clear_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
-{
-	return test_and_clear_bit(POSTED_INTR_SN,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
-{
-	return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
-}
-
-static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
-{
-	return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
-}
-
-static inline void pi_set_sn(struct pi_desc *pi_desc)
-{
-	set_bit(POSTED_INTR_SN,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_set_on(struct pi_desc *pi_desc)
-{
-	set_bit(POSTED_INTR_ON,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_on(struct pi_desc *pi_desc)
-{
-	clear_bit(POSTED_INTR_ON,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_sn(struct pi_desc *pi_desc)
-{
-	clear_bit(POSTED_INTR_SN,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_on(struct pi_desc *pi_desc)
-{
-	return test_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_sn(struct pi_desc *pi_desc)
-{
-	return test_bit(POSTED_INTR_SN,
-			(unsigned long *)&pi_desc->control);
-}
+#include <asm/posted_intr.h>
 
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c37a89eda90f..d94bb069bac9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -70,6 +70,7 @@
 #include "x86.h"
 #include "smm.h"
 #include "vmx_onhyperv.h"
+#include "posted_intr.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 65786dbe7d60..e133e8077e6d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -7,10 +7,10 @@
 #include <asm/kvm.h>
 #include <asm/intel_pt.h>
 #include <asm/perf_event.h>
+#include <asm/posted_intr.h>
 
 #include "capabilities.h"
 #include "../kvm_cache_regs.h"
-#include "posted_intr.h"
 #include "vmcs.h"
 #include "vmx_ops.h"
 #include "../cpuid.h"
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 02/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
  2024-04-05 22:30 ` [PATCH v2 01/13] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
@ 2024-04-05 22:30 ` Jacob Pan
  2024-04-05 22:31 ` [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor Jacob Pan
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:30 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

Make PIR field into u64 such that atomic xchg64 can be used without ugly
casting.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index f0324c56f7af..acf237b2882e 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -9,7 +9,10 @@
 
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
-	u32 pir[8];     /* Posted interrupt requested */
+	union {
+		u32 pir[8];     /* Posted interrupt requested */
+		u64 pir64[4];
+	};
 	union {
 		struct {
 				/* bit 256 - Outstanding Notification */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
  2024-04-05 22:30 ` [PATCH v2 01/13] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
  2024-04-05 22:30 ` [PATCH v2 02/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-17  0:39   ` Sean Christopherson
  2024-04-05 22:31 ` [PATCH v2 04/13] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

Mixture of bitfields and types is weird and really not intuitive, remove
bitfields and use typed data exclusively.

Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v2:
	- Replace bitfields, no more mix.
---
 arch/x86/include/asm/posted_intr.h | 10 +---------
 arch/x86/kvm/vmx/posted_intr.c     |  4 ++--
 arch/x86/kvm/vmx/vmx.c             |  2 +-
 3 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index acf237b2882e..c682c41d4e44 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -15,17 +15,9 @@ struct pi_desc {
 	};
 	union {
 		struct {
-				/* bit 256 - Outstanding Notification */
-			u16	on	: 1,
-				/* bit 257 - Suppress Notification */
-				sn	: 1,
-				/* bit 271:258 - Reserved */
-				rsvd_1	: 14;
-				/* bit 279:272 - Notification Vector */
+			u16	notif_ctrl; /* Suppress and outstanding bits */
 			u8	nv;
-				/* bit 287:280 - Reserved */
 			u8	rsvd_2;
-				/* bit 319:288 - Notification Destination */
 			u32	ndst;
 		};
 		u64 control;
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index af662312fd07..592dbb765675 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -107,7 +107,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 		 * handle task migration (@cpu != vcpu->cpu).
 		 */
 		new.ndst = dest;
-		new.sn = 0;
+		new.notif_ctrl &= ~POSTED_INTR_SN;
 
 		/*
 		 * Restore the notification vector; in the blocking case, the
@@ -157,7 +157,7 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 		      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
 	raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
 
-	WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
+	WARN(pi_desc->notif_ctrl & POSTED_INTR_SN, "PI descriptor SN field set before blocking");
 
 	old.control = READ_ONCE(pi_desc->control);
 	do {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d94bb069bac9..50580bbfba5d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4843,7 +4843,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 	 * or POSTED_INTR_WAKEUP_VECTOR.
 	 */
 	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
-	vmx->pi_desc.sn = 1;
+	vmx->pi_desc.notif_ctrl |= POSTED_INTR_SN;
 }
 
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 04/13] x86/irq: Add a Kconfig option for posted MSI
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (2 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-05 22:31 ` [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

This option will be used to support delivering MSIs as posted
interrupts. Interrupt remapping is required.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v2: Remove x2apic dependency
---
 arch/x86/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 39886bab943a..f5688f4e299b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -464,6 +464,17 @@ config X86_X2APIC
 
 	  If you don't know what to do here, say N.
 
+config X86_POSTED_MSI
+	bool "Enable MSI and MSI-x delivery by posted interrupts"
+	depends on X86_64 && IRQ_REMAP
+	help
+	  This enables MSIs that are under interrupt remapping to be delivered as
+	  posted interrupts to the host kernel. Interrupt throughput can
+	  potentially be improved by coalescing CPU notifications during high
+	  frequency bursts.
+
+	  If you don't know what to do here, say N.
+
 config X86_MPPARSE
 	bool "Enable MPS table" if ACPI
 	default y
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (3 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 04/13] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-11 16:51   ` Thomas Gleixner
  2024-04-12  9:14   ` Tian, Kevin
  2024-04-05 22:31 ` [PATCH v2 06/13] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

When posted MSI is enabled, all device MSIs are multiplexed into a single
notification vector. MSI handlers will be de-multiplexed at run-time by
system software without IDT delivery.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v2:
 - Add missing CONFIG_ in #ifdef
 - Extend changes to x86 tools
---
 arch/x86/include/asm/irq_vectors.h       | 9 ++++++++-
 tools/arch/x86/include/asm/irq_vectors.h | 9 ++++++++-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index d18bfb238f66..1ee00be8218d 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -97,9 +97,16 @@
 
 #define LOCAL_TIMER_VECTOR		0xec
 
+/*
+ * Posted interrupt notification vector for all device MSIs delivered to
+ * the host kernel.
+ */
+#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
 #define NR_VECTORS			 256
 
-#ifdef CONFIG_X86_LOCAL_APIC
+#ifdef CONFIG_X86_POSTED_MSI
+#define FIRST_SYSTEM_VECTOR		POSTED_MSI_NOTIFICATION_VECTOR
+#elif defined(CONFIG_X86_LOCAL_APIC)
 #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
 #else
 #define FIRST_SYSTEM_VECTOR		NR_VECTORS
diff --git a/tools/arch/x86/include/asm/irq_vectors.h b/tools/arch/x86/include/asm/irq_vectors.h
index 3f73ac3ed3a0..989816ca7c9e 100644
--- a/tools/arch/x86/include/asm/irq_vectors.h
+++ b/tools/arch/x86/include/asm/irq_vectors.h
@@ -99,9 +99,16 @@
 
 #define LOCAL_TIMER_VECTOR		0xec
 
+/*
+ * Posted interrupt notification vector for all device MSIs delivered to
+ * the host kernel.
+ */
+#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
 #define NR_VECTORS			 256
 
-#ifdef CONFIG_X86_LOCAL_APIC
+#ifdef CONFIG_X86_POSTED_MSI
+#define FIRST_SYSTEM_VECTOR		POSTED_MSI_NOTIFICATION_VECTOR
+#elif defined(CONFIG_X86_LOCAL_APIC)
 #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
 #else
 #define FIRST_SYSTEM_VECTOR		NR_VECTORS
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 06/13] x86/irq: Set up per host CPU posted interrupt descriptors
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (4 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-12  9:16   ` Tian, Kevin
  2024-04-05 22:31 ` [PATCH v2 07/13] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

To support posted MSIs, create a posted interrupt descriptor (PID) for each
host CPU. Later on, when setting up IRQ CPU affinity, IOMMU's interrupt
remapping table entry (IRTE) will point to the physical address of the
matching CPU's PID.

Each PID is initialized with the owner CPU's physical APICID as the
destination.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v2: Fix xAPIC destination ID assignment, Oliver Sang reported failture on
system with x2apic optout BIOS option.
---
 arch/x86/include/asm/hardirq.h     |  3 +++
 arch/x86/include/asm/posted_intr.h |  7 +++++++
 arch/x86/kernel/cpu/common.c       |  3 +++
 arch/x86/kernel/irq.c              | 23 +++++++++++++++++++++++
 4 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index fbc7722b87d1..01ea52859462 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -48,6 +48,9 @@ typedef struct {
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 
+#ifdef CONFIG_X86_POSTED_MSI
+DECLARE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);
+#endif
 #define __ARCH_IRQ_STAT
 
 #define inc_irq_stat(member)	this_cpu_inc(irq_stat.member)
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index c682c41d4e44..4e5eea2d20e0 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -80,4 +80,11 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
 	return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
 }
 
+#ifdef CONFIG_X86_POSTED_MSI
+extern void intel_posted_msi_init(void);
+
+#else
+static inline void intel_posted_msi_init(void) {};
+
+#endif /* X86_POSTED_MSI */
 #endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5c1e6d6be267..1a5fb657e1b6 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -68,6 +68,7 @@
 #include <asm/traps.h>
 #include <asm/sev.h>
 #include <asm/tdx.h>
+#include <asm/posted_intr.h>
 
 #include "cpu.h"
 
@@ -2219,6 +2220,8 @@ void cpu_init(void)
 		barrier();
 
 		x2apic_setup();
+
+		intel_posted_msi_init();
 	}
 
 	mmgrab(&init_mm);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 35fde0107901..f39f6147104c 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -22,6 +22,8 @@
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/thermal.h>
+#include <asm/posted_intr.h>
+#include <asm/irq_remapping.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/irq_vectors.h>
@@ -334,6 +336,27 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 }
 #endif
 
+#ifdef CONFIG_X86_POSTED_MSI
+
+/* Posted Interrupt Descriptors for coalesced MSIs to be posted */
+DEFINE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);
+
+void intel_posted_msi_init(void)
+{
+	u32 destination;
+	u32 apic_id;
+
+	this_cpu_write(posted_interrupt_desc.nv, POSTED_MSI_NOTIFICATION_VECTOR);
+
+	/*
+	 * APIC destination ID is stored in bit 8:15 while in XAPIC mode.
+	 * VT-d spec. CH 9.11
+	 */
+	apic_id = this_cpu_read(x86_cpu_to_apicid);
+	destination = x2apic_enabled() ? apic_id : apic_id << 8;
+	this_cpu_write(posted_interrupt_desc.ndst, destination);
+}
+#endif /* X86_POSTED_MSI */
 
 #ifdef CONFIG_HOTPLUG_CPU
 /* A cpu has been removed from cpu_online_mask.  Reset irq affinities. */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 07/13] x86/irq: Factor out calling ISR from common_interrupt
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (5 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 06/13] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-12  9:21   ` Tian, Kevin
  2024-04-05 22:31 ` [PATCH v2 08/13] x86/irq: Install posted MSI notification handler Jacob Pan
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

Prepare for calling external IRQ handlers directly from the posted MSI
demultiplexing loop. Extract the common code with common interrupt to
avoid code duplication.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/kernel/irq.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index f39f6147104c..c54de9378943 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -242,18 +242,10 @@ static __always_inline void handle_irq(struct irq_desc *desc,
 		__handle_irq(desc, regs);
 }
 
-/*
- * common_interrupt() handles all normal device IRQ's (the special SMP
- * cross-CPU interrupts have their own entry points).
- */
-DEFINE_IDTENTRY_IRQ(common_interrupt)
+static __always_inline void call_irq_handler(int vector, struct pt_regs *regs)
 {
-	struct pt_regs *old_regs = set_irq_regs(regs);
 	struct irq_desc *desc;
 
-	/* entry code tells RCU that we're not quiescent.  Check it. */
-	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
-
 	desc = __this_cpu_read(vector_irq[vector]);
 	if (likely(!IS_ERR_OR_NULL(desc))) {
 		handle_irq(desc, regs);
@@ -268,7 +260,20 @@ DEFINE_IDTENTRY_IRQ(common_interrupt)
 			__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
 		}
 	}
+}
+
+/*
+ * common_interrupt() handles all normal device IRQ's (the special SMP
+ * cross-CPU interrupts have their own entry points).
+ */
+DEFINE_IDTENTRY_IRQ(common_interrupt)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	/* entry code tells RCU that we're not quiescent.  Check it. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
 
+	call_irq_handler(vector, regs);
 	set_irq_regs(old_regs);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 08/13] x86/irq: Install posted MSI notification handler
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (6 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 07/13] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-11  7:52   ` Tian, Kevin
  2024-04-11 16:54   ` Thomas Gleixner
  2024-04-05 22:31 ` [PATCH v2 09/13] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

All MSI vectors are multiplexed into a single notification vector when
posted MSI is enabled. It is the responsibility of the notification
vector handler to demultiplex MSI vectors. In this handler, for each
pending bit, MSI vector handlers are dispatched without IDT delivery.

For example, the interrupt flow will change as follows:
(3 MSIs of different vectors arrive in a a high frequency burst)

BEFORE:
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

AFTER:
interrupt /* Posted MSI notification vector */
    irq_enter()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	pi_clear_on()
    apic_eoi()
    irq_exit()
        process_softirq()

Except for the leading MSI, CPU notifications are skipped/coalesced.

For MSIs arrive at a low frequency, the demultiplexing loop does not
wait for more interrupts to coalesce. Therefore, there's no additional
latency other than the processing time.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v2:
   - Delete extra inline attribute
   - Fix pir pointer in xchg (Zeng Guang)
---
 arch/x86/include/asm/hardirq.h  |   3 +
 arch/x86/include/asm/idtentry.h |   3 +
 arch/x86/kernel/idt.c           |   3 +
 arch/x86/kernel/irq.c           | 113 ++++++++++++++++++++++++++++++++
 4 files changed, 122 insertions(+)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 01ea52859462..ee0c05366dec 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@ typedef struct {
 	unsigned int irq_hv_reenlightenment_count;
 	unsigned int hyperv_stimer0_count;
 #endif
+#ifdef CONFIG_X86_POSTED_MSI
+	unsigned int posted_msi_notification_count;
+#endif
 } ____cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 749c7411d2f1..5a6fc7593cfc 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -701,6 +701,9 @@ DECLARE_IDTENTRY_SYSVEC(ERROR_APIC_VECTOR,		sysvec_error_interrupt);
 DECLARE_IDTENTRY_SYSVEC(SPURIOUS_APIC_VECTOR,		sysvec_spurious_apic_interrupt);
 DECLARE_IDTENTRY_SYSVEC(LOCAL_TIMER_VECTOR,		sysvec_apic_timer_interrupt);
 DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
+# ifdef CONFIG_X86_POSTED_MSI
+DECLARE_IDTENTRY_SYSVEC(POSTED_MSI_NOTIFICATION_VECTOR,	sysvec_posted_msi_notification);
+# endif
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index fc37c8d83daf..f445bec516a0 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -163,6 +163,9 @@ static const __initconst struct idt_data apic_idts[] = {
 # endif
 	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
 	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
+# ifdef CONFIG_X86_POSTED_MSI
+	INTG(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
+# endif
 #endif
 };
 
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index c54de9378943..8772b0a3abf6 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -183,6 +183,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+#endif
+#ifdef CONFIG_X86_POSTED_MSI
+	seq_printf(p, "%*s: ", prec, "PMN");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->posted_msi_notification_count);
+	seq_puts(p, "  Posted MSI notification event\n");
 #endif
 	return 0;
 }
@@ -361,6 +368,112 @@ void intel_posted_msi_init(void)
 	destination = x2apic_enabled() ? apic_id : apic_id << 8;
 	this_cpu_write(posted_interrupt_desc.ndst, destination);
 }
+
+/*
+ * De-multiplexing posted interrupts is on the performance path, the code
+ * below is written to optimize the cache performance based on the following
+ * considerations:
+ * 1.Posted interrupt descriptor (PID) fits in a cache line that is frequently
+ *   accessed by both CPU and IOMMU.
+ * 2.During posted MSI processing, the CPU needs to do 64-bit read and xchg
+ *   for checking and clearing posted interrupt request (PIR), a 256 bit field
+ *   within the PID.
+ * 3.On the other side, the IOMMU does atomic swaps of the entire PID cache
+ *   line when posting interrupts and setting control bits.
+ * 4.The CPU can access the cache line a magnitude faster than the IOMMU.
+ * 5.Each time the IOMMU does interrupt posting to the PIR will evict the PID
+ *   cache line. The cache line states after each operation are as follows:
+ *   CPU		IOMMU			PID Cache line state
+ *   ---------------------------------------------------------------
+ *...read64					exclusive
+ *...lock xchg64				modified
+ *...			post/atomic swap	invalid
+ *...-------------------------------------------------------------
+ *
+ * To reduce L1 data cache miss, it is important to avoid contention with
+ * IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is used
+ * to dispatch interrupt handlers.
+ *
+ * In addition, the code is trying to keep the cache line state consistent
+ * as much as possible. e.g. when making a copy and clearing the PIR
+ * (assuming non-zero PIR bits are present in the entire PIR), it does:
+ *		read, read, read, read, xchg, xchg, xchg, xchg
+ * instead of:
+ *		read, xchg, read, xchg, read, xchg, read, xchg
+ */
+static __always_inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
+{
+	int i, vec = FIRST_EXTERNAL_VECTOR;
+	unsigned long pir_copy[4];
+	bool handled = false;
+
+	for (i = 0; i < 4; i++)
+		pir_copy[i] = pir[i];
+
+	for (i = 0; i < 4; i++) {
+		if (!pir_copy[i])
+			continue;
+
+		pir_copy[i] = arch_xchg(&pir[i], 0);
+		handled = true;
+	}
+
+	if (handled) {
+		for_each_set_bit_from(vec, pir_copy, FIRST_SYSTEM_VECTOR)
+			call_irq_handler(vec, regs);
+	}
+
+	return handled;
+}
+
+/*
+ * Performance data shows that 3 is good enough to harvest 90+% of the benefit
+ * on high IRQ rate workload.
+ */
+#define MAX_POSTED_MSI_COALESCING_LOOP 3
+
+/*
+ * For MSIs that are delivered as posted interrupts, the CPU notifications
+ * can be coalesced if the MSIs arrive in high frequency bursts.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+	struct pi_desc *pid;
+	int i = 0;
+
+	pid = this_cpu_ptr(&posted_interrupt_desc);
+
+	inc_irq_stat(posted_msi_notification_count);
+	irq_enter();
+
+	/*
+	 * Max coalescing count includes the extra round of handle_pending_pir
+	 * after clearing the outstanding notification bit. Hence, at most
+	 * MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here.
+	 */
+	while (++i < MAX_POSTED_MSI_COALESCING_LOOP) {
+		if (!handle_pending_pir(pid->pir64, regs))
+			break;
+	}
+
+	/*
+	 * Clear outstanding notification bit to allow new IRQ notifications,
+	 * do this last to maximize the window of interrupt coalescing.
+	 */
+	pi_clear_on(pid);
+
+	/*
+	 * There could be a race of PI notification and the clearing of ON bit,
+	 * process PIR bits one last time such that handling the new interrupts
+	 * are not delayed until the next IRQ.
+	 */
+	handle_pending_pir(pid->pir64, regs);
+
+	apic_eoi();
+	irq_exit();
+	set_irq_regs(old_regs);
+}
 #endif /* X86_POSTED_MSI */
 
 #ifdef CONFIG_HOTPLUG_CPU
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 09/13] x86/irq: Factor out common code for checking pending interrupts
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (7 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 08/13] x86/irq: Install posted MSI notification handler Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-05 22:31 ` [PATCH v2 10/13] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

Use a common function for checking pending interrupt vector in APIC IRR
instead of duplicated open coding them.

Additional checks for posted MSI vectors can then be contained in this
function.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/apic.h   | 11 +++++++++++
 arch/x86/kernel/apic/vector.c |  5 ++---
 arch/x86/kernel/irq.c         |  5 ++---
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 94ce0f7c9d3a..ebf80718912d 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -499,6 +499,17 @@ static inline bool lapic_vector_set_in_irr(unsigned int vector)
 	return !!(irr & (1U << (vector % 32)));
 }
 
+static inline bool is_vector_pending(unsigned int vector)
+{
+	unsigned int irr;
+
+	irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
+	if (irr  & (1 << (vector % 32)))
+		return true;
+
+	return false;
+}
+
 /*
  * Warm reset vector position:
  */
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 185738c72766..9eec52925fa3 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -965,7 +965,7 @@ static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
 	lockdep_assert_held(&vector_lock);
 
 	hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) {
-		unsigned int irr, vector = apicd->prev_vector;
+		unsigned int vector = apicd->prev_vector;
 
 		/*
 		 * Paranoia: Check if the vector that needs to be cleaned
@@ -979,8 +979,7 @@ static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
 		 * fixup_irqs() was just called to scan IRR for set bits and
 		 * forward them to new destination CPUs via IPIs.
 		 */
-		irr = check_irr ? apic_read(APIC_IRR + (vector / 32 * 0x10)) : 0;
-		if (irr & (1U << (vector % 32))) {
+		if (check_irr && is_vector_pending(vector)) {
 			pr_warn_once("Moved interrupt pending in old target APIC %u\n", apicd->irq);
 			rearm = true;
 			continue;
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 8772b0a3abf6..3206d48f380d 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -480,7 +480,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
 /* A cpu has been removed from cpu_online_mask.  Reset irq affinities. */
 void fixup_irqs(void)
 {
-	unsigned int irr, vector;
+	unsigned int vector;
 	struct irq_desc *desc;
 	struct irq_data *data;
 	struct irq_chip *chip;
@@ -507,8 +507,7 @@ void fixup_irqs(void)
 		if (IS_ERR_OR_NULL(__this_cpu_read(vector_irq[vector])))
 			continue;
 
-		irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
-		if (irr  & (1 << (vector % 32))) {
+		if (is_vector_pending(vector)) {
 			desc = __this_cpu_read(vector_irq[vector]);
 
 			raw_spin_lock(&desc->lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 10/13] x86/irq: Extend checks for pending vectors to posted interrupts
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (8 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 09/13] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-12  9:25   ` Tian, Kevin
  2024-04-05 22:31 ` [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

During interrupt affinity change, it is possible to have interrupts delivered
to the old CPU after the affinity has changed to the new one. To prevent lost
interrupts, local APIC IRR is checked on the old CPU. Similar checks must be
done for posted MSIs given the same reason.

Consider the following scenario:
	Device		system agent		iommu		memory 		CPU/LAPIC
1	FEEX_XXXX
2			Interrupt request
3						Fetch IRTE	->
4						->Atomic Swap PID.PIR(vec)
						Push to Global Observable(GO)
5						if (ON*)
	i						done;*
						else
6							send a notification ->

* ON: outstanding notification, 1 will suppress new notifications

If the affinity change happens between 3 and 5 in IOMMU, the old CPU's posted
interrupt request (PIR) could have pending bit set for the vector being moved.

This patch adds a helper function to check individual vector status.
Then use the helper to check for pending interrupts on the source CPU's
PID.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v2: Fold in helper function patch.
---
 arch/x86/include/asm/apic.h        |  3 ++-
 arch/x86/include/asm/posted_intr.h | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index ebf80718912d..5bf0d6c2523b 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -13,6 +13,7 @@
 #include <asm/mpspec.h>
 #include <asm/msr.h>
 #include <asm/hardirq.h>
+#include <asm/posted_intr.h>
 
 #define ARCH_APICTIMER_STOPS_ON_C3	1
 
@@ -507,7 +508,7 @@ static inline bool is_vector_pending(unsigned int vector)
 	if (irr  & (1 << (vector % 32)))
 		return true;
 
-	return false;
+	return pi_pending_this_cpu(vector);
 }
 
 /*
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index 4e5eea2d20e0..8aaa15515490 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _X86_POSTED_INTR_H
 #define _X86_POSTED_INTR_H
+#include <asm/irq_vectors.h>
 
 #define POSTED_INTR_ON  0
 #define POSTED_INTR_SN  1
@@ -81,9 +82,26 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
 }
 
 #ifdef CONFIG_X86_POSTED_MSI
+/*
+ * Not all external vectors are subject to interrupt remapping, e.g. IOMMU's
+ * own interrupts. Here we do not distinguish them since those vector bits in
+ * PIR will always be zero.
+ */
+static inline bool pi_pending_this_cpu(unsigned int vector)
+{
+	struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
+
+	if (WARN_ON_ONCE(vector > NR_VECTORS || vector < FIRST_EXTERNAL_VECTOR))
+		return false;
+
+	return test_bit(vector, (unsigned long *)pid->pir);
+}
+
 extern void intel_posted_msi_init(void);
 
 #else
+static inline bool pi_pending_this_cpu(unsigned int vector) { return false; }
+
 static inline void intel_posted_msi_init(void) {};
 
 #endif /* X86_POSTED_MSI */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (9 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 10/13] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-06  4:31   ` Robert Hoo
  2024-04-12  9:31   ` Tian, Kevin
  2024-04-05 22:31 ` [PATCH v2 12/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
  2024-04-05 22:31 ` [PATCH v2 13/13] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
  12 siblings, 2 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

Add a command line opt-in option for posted MSI if CONFIG_X86_POSTED_MSI=y.

Also introduce a helper function for testing if posted MSI is supported on
the platform.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  1 +
 arch/x86/include/asm/irq_remapping.h            | 11 +++++++++++
 drivers/iommu/irq_remapping.c                   | 13 ++++++++++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bb884c14b2f6..e5fd02423c4c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2251,6 +2251,7 @@
 			no_x2apic_optout
 				BIOS x2APIC opt-out request will be ignored
 			nopost	disable Interrupt Posting
+			posted_msi enable MSIs delivered as posted interrupts
 
 	iomem=		Disable strict checking of access to MMIO memory
 		strict	regions from userspace.
diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 7a2ed154a5e1..e46bde61029b 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -50,6 +50,17 @@ static inline struct irq_domain *arch_get_ir_parent_domain(void)
 	return x86_vector_domain;
 }
 
+#ifdef CONFIG_X86_POSTED_MSI
+extern int enable_posted_msi;
+
+static inline bool posted_msi_supported(void)
+{
+	return enable_posted_msi && irq_remapping_cap(IRQ_POSTING_CAP);
+}
+#else
+static inline bool posted_msi_supported(void) { return false; };
+#endif
+
 #else  /* CONFIG_IRQ_REMAP */
 
 static inline bool irq_remapping_cap(enum irq_remap_cap cap) { return 0; }
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index ee59647c2050..5672783c9f1f 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -24,6 +24,10 @@ int no_x2apic_optout;
 
 int disable_irq_post = 0;
 
+#ifdef CONFIG_X86_POSTED_MSI
+int enable_posted_msi;
+#endif
+
 static int disable_irq_remap;
 static struct irq_remap_ops *remap_ops;
 
@@ -70,7 +74,14 @@ static __init int setup_irqremap(char *str)
 			no_x2apic_optout = 1;
 		else if (!strncmp(str, "nopost", 6))
 			disable_irq_post = 1;
-
+#ifdef CONFIG_X86_POSTED_MSI
+		else if (!strncmp(str, "posted_msi", 10)) {
+			if (disable_irq_post || disable_irq_remap)
+				pr_warn("Posted MSI not enabled due to conflicting options!");
+			else
+				enable_posted_msi = 1;
+		}
+#endif
 		str += strcspn(str, ",");
 		while (*str == ',')
 			str++;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 12/13] iommu/vt-d: Add an irq_chip for posted MSIs
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (10 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  2024-04-12  9:36   ` Tian, Kevin
  2024-04-05 22:31 ` [PATCH v2 13/13] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
  12 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

Introduce a new irq_chip for posted MSIs, the key difference is in
irq_ack where EOI is performed by the notification handler.

When posted MSI is enabled, MSI domain/chip hierarchy will look like
this example:

domain:  IR-PCI-MSIX-0000:50:00.0-12
 hwirq:   0x29
 chip:    IR-PCI-MSIX-0000:50:00.0
  flags:   0x430
             IRQCHIP_SKIP_SET_WAKE
             IRQCHIP_ONESHOT_SAFE
 parent:
    domain:  INTEL-IR-10-13
     hwirq:   0x2d0000
     chip:    INTEL-IR-POST
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x77
         chip:    APIC

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/irq_remapping.c | 46 +++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 566297bc87dd..fa719936b44e 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1233,6 +1233,52 @@ static struct irq_chip intel_ir_chip = {
 	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
 };
 
+static void dummy(struct irq_data *d)
+{
+}
+
+/*
+ * With posted MSIs, all vectors are multiplexed into a single notification
+ * vector. Devices MSIs are then dispatched in a demux loop where
+ * EOIs can be coalesced as well.
+ *
+ * "INTEL-IR-POST" IRQ chip does not do EOI on ACK, thus the dummy irq_ack()
+ * function. Instead EOI is performed by the posted interrupt notification
+ * handler.
+ *
+ * For the example below, 3 MSIs are coalesced into one CPU notification. Only
+ * one apic_eoi() is needed.
+ *
+ * __sysvec_posted_msi_notification()
+ *	irq_enter();
+ *		handle_edge_irq()
+ *			irq_chip_ack_parent()
+ *				dummy(); // No EOI
+ *			handle_irq_event()
+ *				driver_handler()
+ *	irq_enter();
+ *		handle_edge_irq()
+ *			irq_chip_ack_parent()
+ *				dummy(); // No EOI
+ *			handle_irq_event()
+ *				driver_handler()
+ *	irq_enter();
+ *		handle_edge_irq()
+ *			irq_chip_ack_parent()
+ *				dummy(); // No EOI
+ *			handle_irq_event()
+ *				driver_handler()
+ *	apic_eoi()
+ * irq_exit()
+ */
+static struct irq_chip intel_ir_chip_post_msi = {
+	.name			= "INTEL-IR-POST",
+	.irq_ack		= dummy,
+	.irq_set_affinity	= intel_ir_set_affinity,
+	.irq_compose_msi_msg	= intel_ir_compose_msi_msg,
+	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
+};
+
 static void fill_msi_msg(struct msi_msg *msg, u32 index, u32 subhandle)
 {
 	memset(msg, 0, sizeof(*msg));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 13/13] iommu/vt-d: Enable posted mode for device MSIs
  2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (11 preceding siblings ...)
  2024-04-05 22:31 ` [PATCH v2 12/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
@ 2024-04-05 22:31 ` Jacob Pan
  12 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-05 22:31 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

With posted MSI feature enabled on the CPU side, iommu interrupt
remapping table entries (IRTEs) for device MSI/x can be allocated,
activated, and programed in posted mode. This means that IRTEs are
linked with their respective PIDs of the target CPU.

Handlers for the posted MSI notification vector will de-multiplex
device MSI handlers. CPU notifications are coalesced if interrupts
arrive at a high frequency.

Excluding the following:
- legacy devices IOAPIC, HPET (may be needed for booting, not a source
of high MSIs)
- VT-d's own IRQs (not remappable).

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v2: Fold in helper function for retrieving PID address
v1: Added a warning if the effective affinity mask is not set up
---
 drivers/iommu/intel/irq_remapping.c | 69 +++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index fa719936b44e..ac5f9e83943b 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -19,6 +19,7 @@
 #include <asm/cpu.h>
 #include <asm/irq_remapping.h>
 #include <asm/pci-direct.h>
+#include <asm/posted_intr.h>
 
 #include "iommu.h"
 #include "../irq_remapping.h"
@@ -49,6 +50,7 @@ struct irq_2_iommu {
 	u16 sub_handle;
 	u8  irte_mask;
 	enum irq_mode mode;
+	bool posted_msi;
 };
 
 struct intel_ir_data {
@@ -1118,6 +1120,14 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	irte->redir_hint = 1;
 }
 
+static void prepare_irte_posted(struct irte *irte)
+{
+	memset(irte, 0, sizeof(*irte));
+
+	irte->present = 1;
+	irte->p_pst = 1;
+}
+
 struct irq_remap_ops intel_irq_remap_ops = {
 	.prepare		= intel_prepare_irq_remapping,
 	.enable			= intel_enable_irq_remapping,
@@ -1126,6 +1136,47 @@ struct irq_remap_ops intel_irq_remap_ops = {
 	.enable_faulting	= enable_drhd_fault_handling,
 };
 
+#ifdef CONFIG_X86_POSTED_MSI
+
+static phys_addr_t get_pi_desc_addr(struct irq_data *irqd)
+{
+	int cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd));
+
+	if (WARN_ON(cpu >= nr_cpu_ids))
+		return 0;
+
+	return __pa(per_cpu_ptr(&posted_interrupt_desc, cpu));
+}
+
+static void intel_ir_reconfigure_irte_posted(struct irq_data *irqd)
+{
+	struct intel_ir_data *ir_data = irqd->chip_data;
+	struct irte *irte = &ir_data->irte_entry;
+	struct irte irte_pi;
+	u64 pid_addr;
+
+	pid_addr = get_pi_desc_addr(irqd);
+
+	if (!pid_addr) {
+		pr_warn("Failed to setup IRQ %d for posted mode", irqd->irq);
+		return;
+	}
+
+	memset(&irte_pi, 0, sizeof(irte_pi));
+
+	/* The shared IRTE already be set up as posted during alloc_irte */
+	dmar_copy_shared_irte(&irte_pi, irte);
+
+	irte_pi.pda_l = (pid_addr >> (32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
+	irte_pi.pda_h = (pid_addr >> 32) & ~(-1UL << PDA_HIGH_BIT);
+
+	modify_irte(&ir_data->irq_2_iommu, &irte_pi);
+}
+
+#else
+static inline void intel_ir_reconfigure_irte_posted(struct irq_data *irqd) {}
+#endif
+
 static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
 {
 	struct intel_ir_data *ir_data = irqd->chip_data;
@@ -1139,8 +1190,9 @@ static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
 	irte->vector = cfg->vector;
 	irte->dest_id = IRTE_DEST(cfg->dest_apicid);
 
-	/* Update the hardware only if the interrupt is in remapped mode. */
-	if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
+	if (ir_data->irq_2_iommu.posted_msi)
+		intel_ir_reconfigure_irte_posted(irqd);
+	else if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
 		modify_irte(&ir_data->irq_2_iommu, irte);
 }
 
@@ -1194,7 +1246,7 @@ static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 	struct intel_ir_data *ir_data = data->chip_data;
 	struct vcpu_data *vcpu_pi_info = info;
 
-	/* stop posting interrupts, back to remapping mode */
+	/* stop posting interrupts, back to the default mode */
 	if (!vcpu_pi_info) {
 		modify_irte(&ir_data->irq_2_iommu, &ir_data->irte_entry);
 	} else {
@@ -1320,6 +1372,11 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 		break;
 	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
+		if (posted_msi_supported()) {
+			prepare_irte_posted(irte);
+			data->irq_2_iommu.posted_msi = 1;
+		}
+
 		set_msi_sid(irte,
 			    pci_real_dma_dev(msi_desc_to_pci_dev(info->desc)));
 		break;
@@ -1407,7 +1464,11 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain,
 
 		irq_data->hwirq = (index << 16) + i;
 		irq_data->chip_data = ird;
-		irq_data->chip = &intel_ir_chip;
+		if (posted_msi_supported() &&
+			((info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI) || (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSIX)))
+			irq_data->chip = &intel_ir_chip_post_msi;
+		else
+			irq_data->chip = &intel_ir_chip;
 		intel_irq_remapping_prepare_irte(ird, irq_cfg, info, index, i);
 		irq_set_status_flags(virq + i, IRQ_MOVE_PCNTXT);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option
  2024-04-05 22:31 ` [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
@ 2024-04-06  4:31   ` Robert Hoo
  2024-04-08 23:33     ` Jacob Pan
  2024-04-12  9:31   ` Tian, Kevin
  1 sibling, 1 reply; 47+ messages in thread
From: Robert Hoo @ 2024-04-06  4:31 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng

On 4/6/2024 6:31 AM, Jacob Pan wrote:
> Add a command line opt-in option for posted MSI if CONFIG_X86_POSTED_MSI=y.
> 
> Also introduce a helper function for testing if posted MSI is supported on
> the platform.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>   Documentation/admin-guide/kernel-parameters.txt |  1 +
>   arch/x86/include/asm/irq_remapping.h            | 11 +++++++++++
>   drivers/iommu/irq_remapping.c                   | 13 ++++++++++++-
>   3 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bb884c14b2f6..e5fd02423c4c 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2251,6 +2251,7 @@
>   			no_x2apic_optout
>   				BIOS x2APIC opt-out request will be ignored
>   			nopost	disable Interrupt Posting
> +			posted_msi enable MSIs delivered as posted interrupts
>   
>   	iomem=		Disable strict checking of access to MMIO memory
>   		strict	regions from userspace.
> diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
> index 7a2ed154a5e1..e46bde61029b 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -50,6 +50,17 @@ static inline struct irq_domain *arch_get_ir_parent_domain(void)
>   	return x86_vector_domain;
>   }
>   
> +#ifdef CONFIG_X86_POSTED_MSI
> +extern int enable_posted_msi;
> +
> +static inline bool posted_msi_supported(void)
> +{
> +	return enable_posted_msi && irq_remapping_cap(IRQ_POSTING_CAP);
> +}

Out of this patch set's scope, but, dropping into irq_remappping_cap(), I'd like 
to bring this change for discussion:

diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 4047ac396728..ef2de9034897 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -98,7 +98,7 @@ void set_irq_remapping_broken(void)

  bool irq_remapping_cap(enum irq_remap_cap cap)
  {
-       if (!remap_ops || disable_irq_post)
+       if (!remap_ops || disable_irq_remap)
                 return false;

         return (remap_ops->capability & (1 << cap));


1. irq_remapping_cap() is to exam some cap, though at present it has only 1 cap, 
i.e. IRQ_POSTING_CAP, simply return false just because of disable_irq_post isn't 
good. Instead, IRQ_REMAP is the foundation of all remapping caps.
2. disable_irq_post is used by Intel iommu code only, here irq_remapping_cap() 
is common code. e.g. AMD iommu code doesn't use it to judge set cap of irq_post 
or not.

> +#else
> +static inline bool posted_msi_supported(void) { return false; };
> +#endif


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option
  2024-04-06  4:31   ` Robert Hoo
@ 2024-04-08 23:33     ` Jacob Pan
  2024-04-13 10:59       ` Robert Hoo
  0 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-08 23:33 UTC (permalink / raw)
  To: Robert Hoo
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jim.harris, a.manzanares, Bjorn Helgaas, guang.zeng,
	jacob.jun.pan

Hi Robert,

On Sat, 6 Apr 2024 12:31:14 +0800, Robert Hoo <robert.hoo.linux@gmail.com>
wrote:

> On 4/6/2024 6:31 AM, Jacob Pan wrote:
> > Add a command line opt-in option for posted MSI if
> > CONFIG_X86_POSTED_MSI=y.
> > 
> > Also introduce a helper function for testing if posted MSI is supported
> > on the platform.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >   Documentation/admin-guide/kernel-parameters.txt |  1 +
> >   arch/x86/include/asm/irq_remapping.h            | 11 +++++++++++
> >   drivers/iommu/irq_remapping.c                   | 13 ++++++++++++-
> >   3 files changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt
> > b/Documentation/admin-guide/kernel-parameters.txt index
> > bb884c14b2f6..e5fd02423c4c 100644 ---
> > a/Documentation/admin-guide/kernel-parameters.txt +++
> > b/Documentation/admin-guide/kernel-parameters.txt @@ -2251,6 +2251,7 @@
> >   			no_x2apic_optout
> >   				BIOS x2APIC opt-out request will be
> > ignored nopost	disable Interrupt Posting
> > +			posted_msi enable MSIs delivered as posted
> > interrupts 
> >   	iomem=		Disable strict checking of access to
> > MMIO memory strict	regions from userspace.
> > diff --git a/arch/x86/include/asm/irq_remapping.h
> > b/arch/x86/include/asm/irq_remapping.h index 7a2ed154a5e1..e46bde61029b
> > 100644 --- a/arch/x86/include/asm/irq_remapping.h
> > +++ b/arch/x86/include/asm/irq_remapping.h
> > @@ -50,6 +50,17 @@ static inline struct irq_domain
> > *arch_get_ir_parent_domain(void) return x86_vector_domain;
> >   }
> >   
> > +#ifdef CONFIG_X86_POSTED_MSI
> > +extern int enable_posted_msi;
> > +
> > +static inline bool posted_msi_supported(void)
> > +{
> > +	return enable_posted_msi && irq_remapping_cap(IRQ_POSTING_CAP);
> > +}  
> 
> Out of this patch set's scope, but, dropping into irq_remappping_cap(),
> I'd like to bring this change for discussion:
> 
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index 4047ac396728..ef2de9034897 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -98,7 +98,7 @@ void set_irq_remapping_broken(void)
> 
>   bool irq_remapping_cap(enum irq_remap_cap cap)
>   {
> -       if (!remap_ops || disable_irq_post)
> +       if (!remap_ops || disable_irq_remap)
>                  return false;
> 
>          return (remap_ops->capability & (1 << cap));
> 
> 
> 1. irq_remapping_cap() is to exam some cap, though at present it has only
> 1 cap, i.e. IRQ_POSTING_CAP, simply return false just because of
> disable_irq_post isn't good. Instead, IRQ_REMAP is the foundation of all
> remapping caps. 2. disable_irq_post is used by Intel iommu code only,
> here irq_remapping_cap() is common code. e.g. AMD iommu code doesn't use
> it to judge set cap of irq_post or not.
I agree, posting should be treated as a sub-capability of remapping.
IRQ_POSTING_CAP is only set when remapping is on.

We need to delete this such that posting is always off when remapping is
off.

--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1038,11 +1038,7 @@ static void disable_irq_remapping(void)
                iommu_disable_irq_remapping(iommu);
        }
 
-       /*
-        * Clear Posted-Interrupts capability.
-        */
-       if (!disable_irq_post)
-               intel_irq_remap_ops.capability &= ~(1 << IRQ_POSTING_CAP);
+       intel_irq_remap_ops.capability &= ~(1 << IRQ_POSTING_CAP);
 } 

> > +#else
> > +static inline bool posted_msi_supported(void) { return false; };
> > +#endif  
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 08/13] x86/irq: Install posted MSI notification handler
  2024-04-05 22:31 ` [PATCH v2 08/13] x86/irq: Install posted MSI notification handler Jacob Pan
@ 2024-04-11  7:52   ` Tian, Kevin
  2024-04-11 17:38     ` Jacob Pan
  2024-04-11 16:54   ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2024-04-11  7:52 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Luse, Paul E, Williams, Dan J, Jens Axboe, Raj, Ashok, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	Zeng, Guang, robert.hoo.linux

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 6, 2024 6:31 AM
> 
> +
> +/*
> + * De-multiplexing posted interrupts is on the performance path, the code
> + * below is written to optimize the cache performance based on the
> following
> + * considerations:
> + * 1.Posted interrupt descriptor (PID) fits in a cache line that is frequently
> + *   accessed by both CPU and IOMMU.
> + * 2.During posted MSI processing, the CPU needs to do 64-bit read and
> xchg
> + *   for checking and clearing posted interrupt request (PIR), a 256 bit field
> + *   within the PID.
> + * 3.On the other side, the IOMMU does atomic swaps of the entire PID
> cache
> + *   line when posting interrupts and setting control bits.
> + * 4.The CPU can access the cache line a magnitude faster than the IOMMU.
> + * 5.Each time the IOMMU does interrupt posting to the PIR will evict the
> PID
> + *   cache line. The cache line states after each operation are as follows:
> + *   CPU		IOMMU			PID Cache line state
> + *   ---------------------------------------------------------------
> + *...read64					exclusive
> + *...lock xchg64				modified
> + *...			post/atomic swap	invalid
> + *...-------------------------------------------------------------
> + *

According to VT-d spec: 5.2.3 Interrupt-Posting Hardware Operation:

"
- Read contents of the Posted Interrupt Descriptor, claiming exclusive
  ownership of its hosting cache-line.
  ...
- Modify the following descriptor field values atomically:
  ...
- Promote the cache-line to be globally observable, so that the modifications
  are visible to other caching agents. Hardware may write-back the cache-line
  anytime after this step.
"

sounds that the PID cache line is not evicted after IOMMU posting?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-05 22:31 ` [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
@ 2024-04-11 16:51   ` Thomas Gleixner
  2024-04-15 18:53     ` Jacob Pan
  2024-04-12  9:14   ` Tian, Kevin
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2024-04-11 16:51 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu, Lu Baolu,
	kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

On Fri, Apr 05 2024 at 15:31, Jacob Pan wrote:
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index d18bfb238f66..1ee00be8218d 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -97,9 +97,16 @@
>  
>  #define LOCAL_TIMER_VECTOR		0xec
>  
> +/*
> + * Posted interrupt notification vector for all device MSIs delivered to
> + * the host kernel.
> + */
> +#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
>  #define NR_VECTORS			 256
>  
> -#ifdef CONFIG_X86_LOCAL_APIC
> +#ifdef CONFIG_X86_POSTED_MSI
> +#define FIRST_SYSTEM_VECTOR		POSTED_MSI_NOTIFICATION_VECTOR
> +#elif defined(CONFIG_X86_LOCAL_APIC)
>  #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
>  #else
>  #define FIRST_SYSTEM_VECTOR		NR_VECTORS

This is horrible and we had attempts before to make the system vector
space dense. They all did not work and making an exception for this is
not what we want.

If we really care then we do it proper for _all_ of them. Something like
the uncompiled below. There is certainly a smarter way to do the build
thing, but my kbuild foo is rusty.

Thanks,

        tglx
---
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -245,6 +245,7 @@ archscripts: scripts_basic
 
 archheaders:
 	$(Q)$(MAKE) $(build)=arch/x86/entry/syscalls all
+	$(Q)$(MAKE) $(build)=arch/x86/kernel/irqvectors all
 
 ###
 # Kernel objects
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -43,59 +43,46 @@
  */
 #define ISA_IRQ_VECTOR(irq)		(((FIRST_EXTERNAL_VECTOR + 16) & ~15) + irq)
 
+#ifndef __ASSEMBLY__
 /*
- * Special IRQ vectors used by the SMP architecture, 0xf0-0xff
- *
- *  some of the following vectors are 'rare', they are merged
- *  into a single vector (CALL_FUNCTION_VECTOR) to save vector space.
- *  TLB, reschedule and local APIC vectors are performance-critical.
+ * Special IRQ vectors used by the SMP architecture, 0xff and downwards
  */
+enum {
+	__SPURIOUS_APIC_VECTOR,
+	__ERROR_APIC_VECTOR,
+	__RESCHEDULE_VECTOR,
+	__CALL_FUNCTION_VECTOR,
+	__CALL_FUNCTION_SINGLE_VECTOR,
+	__THERMAL_APIC_VECTOR,
+	__THRESHOLD_APIC_VECTOR,
+	__REBOOT_VECTOR,
+	__X86_PLATFORM_IPI_VECTOR,
+	__IRQ_WORK_VECTOR,
+	__DEFERRED_ERROR_VECTOR,
 
-#define SPURIOUS_APIC_VECTOR		0xff
-/*
- * Sanity check
- */
-#if ((SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F)
-# error SPURIOUS_APIC_VECTOR definition error
+#if IS_ENABLED(CONFIG_HYPERVISOR_GUEST)
+	__HYPERVISOR_CALLBACK_VECTOR,
 #endif
 
-#define ERROR_APIC_VECTOR		0xfe
-#define RESCHEDULE_VECTOR		0xfd
-#define CALL_FUNCTION_VECTOR		0xfc
-#define CALL_FUNCTION_SINGLE_VECTOR	0xfb
-#define THERMAL_APIC_VECTOR		0xfa
-#define THRESHOLD_APIC_VECTOR		0xf9
-#define REBOOT_VECTOR			0xf8
-
-/*
- * Generic system vector for platform specific use
- */
-#define X86_PLATFORM_IPI_VECTOR		0xf7
-
-/*
- * IRQ work vector:
- */
-#define IRQ_WORK_VECTOR			0xf6
-
-/* 0xf5 - unused, was UV_BAU_MESSAGE */
-#define DEFERRED_ERROR_VECTOR		0xf4
-
-/* Vector on which hypervisor callbacks will be delivered */
-#define HYPERVISOR_CALLBACK_VECTOR	0xf3
-
-/* Vector for KVM to deliver posted interrupt IPI */
-#define POSTED_INTR_VECTOR		0xf2
-#define POSTED_INTR_WAKEUP_VECTOR	0xf1
-#define POSTED_INTR_NESTED_VECTOR	0xf0
-
-#define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
+#if IS_ENABLED(CONFIG_KVM)
+	/* Vector for KVM to deliver posted interrupt IPI */
+	__POSTED_INTR_VECTOR,
+	__POSTED_INTR_WAKEUP_VECTOR,
+	__POSTED_INTR_NESTED_VECTOR,
+#endif
+	__MANAGED_IRQ_SHUTDOWN_VECTOR,
 
 #if IS_ENABLED(CONFIG_HYPERV)
-#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
-#define HYPERV_STIMER0_VECTOR		0xed
+	__HYPERV_REENLIGHTENMENT_VECTOR,
+	__HYPERV_STIMER0_VECTOR,
 #endif
+	__LOCAL_TIMER_VECTOR,
+};
+#endif /* !__ASSEMBLY__ */
 
-#define LOCAL_TIMER_VECTOR		0xec
+#ifndef COMPILE_OFFSETS
+#include <asm/irqvectors.h>
+#endif
 
 #define NR_VECTORS			 256
 
--- /dev/null
+++ b/arch/x86/kernel/irqvectors/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+irqvectors-file	:= arch/$(SRCARCH)/include/generated/asm/irqvectors.h
+targets 	+= arch/$(SRCARCH)/kernel/irqvectors/irqvectors.s
+
+$(irqvectors-file): arch/$(SRCARCH)/kernel/irqvectors/irqvectors.s FORCE
+	$(call filechk,offsets,__ASM_IRQVECTORS_H__)
+
+PHONY += all
+all: $(irqvectors-file)
+	@:
--- /dev/null
+++ b/arch/x86/kernel/irqvectors/irqvectors.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#define COMPILE_OFFSETS
+
+#include <linux/kbuild.h>
+#include <asm/irq_vectors.h>
+
+#define VECNR(v)	(0xFF - __##v)
+#define VECTOR(v)	DEFINE(v, VECNR(v))
+
+static void __used common(void)
+{
+	VECTOR(SPURIOUS_APIC_VECTOR);
+	VECTOR(ERROR_APIC_VECTOR);
+	VECTOR(RESCHEDULE_VECTOR);
+	VECTOR(CALL_FUNCTION_VECTOR);
+	VECTOR(CALL_FUNCTION_SINGLE_VECTOR);
+	VECTOR(THERMAL_APIC_VECTOR);
+	VECTOR(THRESHOLD_APIC_VECTOR);
+	VECTOR(REBOOT_VECTOR);
+	VECTOR(X86_PLATFORM_IPI_VECTOR);
+	VECTOR(IRQ_WORK_VECTOR);
+	VECTOR(DEFERRED_ERROR_VECTOR);
+
+#if IS_ENABLED(CONFIG_HYPERVISOR_GUEST)
+	VECTOR(HYPERVISOR_CALLBACK_VECTOR);
+#endif
+
+#if IS_ENABLED(CONFIG_KVM)
+	/* Vector for KVM to deliver posted interrupt IPI */
+	VECTOR(POSTED_INTR_VECTOR);
+	VECTOR(POSTED_INTR_WAKEUP_VECTOR);
+	VECTOR(POSTED_INTR_NESTED_VECTOR);
+#endif
+	VECTOR(MANAGED_IRQ_SHUTDOWN_VECTOR);
+
+#if IS_ENABLED(CONFIG_HYPERV)
+	VECTOR(HYPERV_REENLIGHTENMENT_VECTOR);
+	VECTOR(HYPERV_STIMER0_VECTOR);
+#endif
+	VECTOR(LOCAL_TIMER_VECTOR);
+}
+




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/13] x86/irq: Install posted MSI notification handler
  2024-04-05 22:31 ` [PATCH v2 08/13] x86/irq: Install posted MSI notification handler Jacob Pan
  2024-04-11  7:52   ` Tian, Kevin
@ 2024-04-11 16:54   ` Thomas Gleixner
  2024-04-11 18:29     ` Jacob Pan
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2024-04-11 16:54 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu, Lu Baolu,
	kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	guang.zeng, robert.hoo.linux, Jacob Pan

On Fri, Apr 05 2024 at 15:31, Jacob Pan wrote:
>  
>  #ifdef CONFIG_SMP
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index fc37c8d83daf..f445bec516a0 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -163,6 +163,9 @@ static const __initconst struct idt_data apic_idts[] = {
>  # endif
>  	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
>  	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
> +# ifdef CONFIG_X86_POSTED_MSI
> +	INTG(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
> +# endif
>  #endif
>  };

Obviously lacks FRED support...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/13] x86/irq: Install posted MSI notification handler
  2024-04-11  7:52   ` Tian, Kevin
@ 2024-04-11 17:38     ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-11 17:38 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Hansen, Dave, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Luse, Paul E, Williams, Dan J,
	Jens Axboe, Raj, Ashok, maz, seanjc, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, Zeng, Guang, robert.hoo.linux,
	jacob.jun.pan

Hi Kevin,

On Thu, 11 Apr 2024 07:52:28 +0000, "Tian, Kevin" <kevin.tian@intel.com>
wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, April 6, 2024 6:31 AM
> > 
> > +
> > +/*
> > + * De-multiplexing posted interrupts is on the performance path, the
> > code
> > + * below is written to optimize the cache performance based on the
> > following
> > + * considerations:
> > + * 1.Posted interrupt descriptor (PID) fits in a cache line that is
> > frequently
> > + *   accessed by both CPU and IOMMU.
> > + * 2.During posted MSI processing, the CPU needs to do 64-bit read and
> > xchg
> > + *   for checking and clearing posted interrupt request (PIR), a 256
> > bit field
> > + *   within the PID.
> > + * 3.On the other side, the IOMMU does atomic swaps of the entire PID
> > cache
> > + *   line when posting interrupts and setting control bits.
> > + * 4.The CPU can access the cache line a magnitude faster than the
> > IOMMU.
> > + * 5.Each time the IOMMU does interrupt posting to the PIR will evict
> > the PID
> > + *   cache line. The cache line states after each operation are as
> > follows:
> > + *   CPU		IOMMU			PID Cache line
> > state
> > + *   ---------------------------------------------------------------
> > + *...read64					exclusive
> > + *...lock xchg64				modified
> > + *...			post/atomic swap	invalid
> > + *...-------------------------------------------------------------
> > + *  
> 
> According to VT-d spec: 5.2.3 Interrupt-Posting Hardware Operation:
> 
> "
> - Read contents of the Posted Interrupt Descriptor, claiming exclusive
>   ownership of its hosting cache-line.
>   ...
> - Modify the following descriptor field values atomically:
>   ...
> - Promote the cache-line to be globally observable, so that the
> modifications are visible to other caching agents. Hardware may
> write-back the cache-line anytime after this step.
> "
> 
> sounds that the PID cache line is not evicted after IOMMU posting?
IOMMU follows the same MESI protocol defined in SDM. VT-d spec. also says
"This atomic read-modify-write operation will always snoop processor caches"

So if the PID cache line is in modified state (caused by writing ON bit,
clear PIR, etc.), IOMMU request ownership will evict the cache.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/13] x86/irq: Install posted MSI notification handler
  2024-04-11 16:54   ` Thomas Gleixner
@ 2024-04-11 18:29     ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-11 18:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Paul Luse, Dan Williams, Jens Axboe, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jim.harris, a.manzanares,
	Bjorn Helgaas, guang.zeng, robert.hoo.linux, jacob.jun.pan,
	xin3.li

Hi Thomas,

On Thu, 11 Apr 2024 18:54:29 +0200, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Fri, Apr 05 2024 at 15:31, Jacob Pan wrote:
> >  
> >  #ifdef CONFIG_SMP
> > diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> > index fc37c8d83daf..f445bec516a0 100644
> > --- a/arch/x86/kernel/idt.c
> > +++ b/arch/x86/kernel/idt.c
> > @@ -163,6 +163,9 @@ static const __initconst struct idt_data
> > apic_idts[] = { # endif
> >  	INTG(SPURIOUS_APIC_VECTOR,
> > asm_sysvec_spurious_apic_interrupt), INTG(ERROR_APIC_VECTOR,
> > 		asm_sysvec_error_interrupt), +# ifdef
> > CONFIG_X86_POSTED_MSI
> > +	INTG(POSTED_MSI_NOTIFICATION_VECTOR,
> > asm_sysvec_posted_msi_notification), +# endif
> >  #endif
> >  };  
> 
> Obviously lacks FRED support...
Good point, forgot FRED is merged :)

Will add an entry to entry_fred.c. I would not be able to test performance
though.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-05 22:31 ` [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
  2024-04-11 16:51   ` Thomas Gleixner
@ 2024-04-12  9:14   ` Tian, Kevin
  2024-04-12 14:27     ` Sean Christopherson
  1 sibling, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2024-04-12  9:14 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Luse, Paul E, Williams, Dan J, Jens Axboe, Raj, Ashok, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	Zeng, Guang, robert.hoo.linux

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 6, 2024 6:31 AM
> 
> +/*
> + * Posted interrupt notification vector for all device MSIs delivered to
> + * the host kernel.
> + */
> +#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
>  #define NR_VECTORS			 256
> 

Every interrupt is kind of a notification.

Just call it POSTED_MSI_VECTOR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 06/13] x86/irq: Set up per host CPU posted interrupt descriptors
  2024-04-05 22:31 ` [PATCH v2 06/13] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
@ 2024-04-12  9:16   ` Tian, Kevin
  2024-04-12 17:54     ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2024-04-12  9:16 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Luse, Paul E, Williams, Dan J, Jens Axboe, Raj, Ashok, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	Zeng, Guang, robert.hoo.linux

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 6, 2024 6:31 AM
> 
> +#ifdef CONFIG_X86_POSTED_MSI
> +
> +/* Posted Interrupt Descriptors for coalesced MSIs to be posted */
> +DEFINE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);

'posted_msi_desc' to be more accurate?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 07/13] x86/irq: Factor out calling ISR from common_interrupt
  2024-04-05 22:31 ` [PATCH v2 07/13] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
@ 2024-04-12  9:21   ` Tian, Kevin
  2024-04-12 16:50     ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2024-04-12  9:21 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Luse, Paul E, Williams, Dan J, Jens Axboe, Raj, Ashok, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	Zeng, Guang, robert.hoo.linux

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 6, 2024 6:31 AM
> 
> Prepare for calling external IRQ handlers directly from the posted MSI
> demultiplexing loop. Extract the common code with common interrupt to
> avoid code duplication.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  arch/x86/kernel/irq.c | 23 ++++++++++++++---------
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index f39f6147104c..c54de9378943 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -242,18 +242,10 @@ static __always_inline void handle_irq(struct
> irq_desc *desc,
>  		__handle_irq(desc, regs);
>  }
> 
> -/*
> - * common_interrupt() handles all normal device IRQ's (the special SMP
> - * cross-CPU interrupts have their own entry points).
> - */
> -DEFINE_IDTENTRY_IRQ(common_interrupt)
> +static __always_inline void call_irq_handler(int vector, struct pt_regs *regs)
>  {
> -	struct pt_regs *old_regs = set_irq_regs(regs);
>  	struct irq_desc *desc;
> 
> -	/* entry code tells RCU that we're not quiescent.  Check it. */
> -	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up
> RCU");
> -
>  	desc = __this_cpu_read(vector_irq[vector]);
>  	if (likely(!IS_ERR_OR_NULL(desc))) {
>  		handle_irq(desc, regs);

the hidden lines has one problem:

	} else {
		apic_eoi();

		if (desc == VECTOR_UNUSED) {
			...

there will be two EOI's for unused vectors, adding the one
in sysvec_posted_msi_notification().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 10/13] x86/irq: Extend checks for pending vectors to posted interrupts
  2024-04-05 22:31 ` [PATCH v2 10/13] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
@ 2024-04-12  9:25   ` Tian, Kevin
  2024-04-12 18:23     ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2024-04-12  9:25 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Luse, Paul E, Williams, Dan J, Jens Axboe, Raj, Ashok, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	Zeng, Guang, robert.hoo.linux

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 6, 2024 6:31 AM
> 
> During interrupt affinity change, it is possible to have interrupts delivered
> to the old CPU after the affinity has changed to the new one. To prevent lost
> interrupts, local APIC IRR is checked on the old CPU. Similar checks must be
> done for posted MSIs given the same reason.
> 
> Consider the following scenario:
> 	Device		system agent		iommu		memory
> 		CPU/LAPIC
> 1	FEEX_XXXX
> 2			Interrupt request
> 3						Fetch IRTE	->
> 4						->Atomic Swap PID.PIR(vec)
> 						Push to Global
> Observable(GO)
> 5						if (ON*)
> 	i						done;*

there is a stray 'i'

> 						else
> 6							send a notification ->
> 
> * ON: outstanding notification, 1 will suppress new notifications
> 
> If the affinity change happens between 3 and 5 in IOMMU, the old CPU's
> posted
> interrupt request (PIR) could have pending bit set for the vector being moved.

how could affinity change be possible in 3/4 when the cache line is
locked by IOMMU? Strictly speaking it's about a change after 4 and
before 6.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option
  2024-04-05 22:31 ` [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
  2024-04-06  4:31   ` Robert Hoo
@ 2024-04-12  9:31   ` Tian, Kevin
  2024-04-15 23:20     ` Jacob Pan
  1 sibling, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2024-04-12  9:31 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Luse, Paul E, Williams, Dan J, Jens Axboe, Raj, Ashok, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	Zeng, Guang, robert.hoo.linux

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 6, 2024 6:31 AM
> 
> +#ifdef CONFIG_X86_POSTED_MSI
> +		else if (!strncmp(str, "posted_msi", 10)) {
> +			if (disable_irq_post || disable_irq_remap)
> +				pr_warn("Posted MSI not enabled due to
> conflicting options!");
> +			else
> +				enable_posted_msi = 1;
> +		}
> +#endif

the check of disable_irq_remap is unnecessary. It's unlikely to have
a configuration with disable_irq_post=0 while disable_irq_remap=1
given the latter has bigger scope.

but thinking more do we really need a check here? there is no order
guarantee that "posted_msi" is parsed after the parameters deciding
the value of two disable variables.

it probably makes more sense to just set enable_posted_msi here
and then do all required checks when picking up the irqchip in
intel_irq_remapping_alloc().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 12/13] iommu/vt-d: Add an irq_chip for posted MSIs
  2024-04-05 22:31 ` [PATCH v2 12/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
@ 2024-04-12  9:36   ` Tian, Kevin
  2024-04-16 22:15     ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2024-04-12  9:36 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Luse, Paul E, Williams, Dan J, Jens Axboe, Raj, Ashok, maz,
	seanjc, Robin Murphy, jim.harris, a.manzanares, Bjorn Helgaas,
	Zeng, Guang, robert.hoo.linux

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 6, 2024 6:31 AM
> 
> + *
> + * For the example below, 3 MSIs are coalesced into one CPU notification.
> Only
> + * one apic_eoi() is needed.
> + *
> + * __sysvec_posted_msi_notification()
> + *	irq_enter();
> + *		handle_edge_irq()
> + *			irq_chip_ack_parent()
> + *				dummy(); // No EOI
> + *			handle_irq_event()
> + *				driver_handler()
> + *	irq_enter();
> + *		handle_edge_irq()
> + *			irq_chip_ack_parent()
> + *				dummy(); // No EOI
> + *			handle_irq_event()
> + *				driver_handler()
> + *	irq_enter();
> + *		handle_edge_irq()
> + *			irq_chip_ack_parent()
> + *				dummy(); // No EOI
> + *			handle_irq_event()
> + *				driver_handler()

typo: you added three irq_enter()'s here

> + *	apic_eoi()
> + * irq_exit()
> + */
> +static struct irq_chip intel_ir_chip_post_msi = {
> +	.name			= "INTEL-IR-POST",
> +	.irq_ack		= dummy,
> +	.irq_set_affinity	= intel_ir_set_affinity,
> +	.irq_compose_msi_msg	= intel_ir_compose_msi_msg,
> +	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
> +};

What about putting this patch at end of the series (combining the
change in intel_irq_remapping_alloc()) to finally enable this
feature?

It reads slightly better to me to first get those callbacks extended
to deal with the new mechanism (i.e. most changes in patch13)
before using them in the new irqchip. 😊

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-12  9:14   ` Tian, Kevin
@ 2024-04-12 14:27     ` Sean Christopherson
  2024-04-16  3:45       ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Sean Christopherson @ 2024-04-12 14:27 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	Peter Anvin, Borislav Petkov, Ingo Molnar, Paul E Luse,
	Dan Williams, Jens Axboe, Ashok Raj, maz, Robin Murphy,
	jim.harris, a.manzanares, Bjorn Helgaas, Guang Zeng,
	robert.hoo.linux

On Fri, Apr 12, 2024, Kevin Tian wrote:
> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, April 6, 2024 6:31 AM
> > 
> > +/*
> > + * Posted interrupt notification vector for all device MSIs delivered to
> > + * the host kernel.
> > + */
> > +#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
> >  #define NR_VECTORS			 256
> > 
> 
> Every interrupt is kind of a notification.

FWIW, I find value in having "notification" in the name to differentiate between
the IRQ that is notifying the CPU that there's a posted IRQ to be processed, and
the posted IRQ itself.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 07/13] x86/irq: Factor out calling ISR from common_interrupt
  2024-04-12  9:21   ` Tian, Kevin
@ 2024-04-12 16:50     ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-12 16:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Hansen, Dave, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Luse, Paul E, Williams, Dan J,
	Jens Axboe, Raj, Ashok, maz, seanjc, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, Zeng, Guang, robert.hoo.linux,
	jacob.jun.pan

Hi Kevin,

On Fri, 12 Apr 2024 09:21:45 +0000, "Tian, Kevin" <kevin.tian@intel.com>
wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, April 6, 2024 6:31 AM
> > 
> > Prepare for calling external IRQ handlers directly from the posted MSI
> > demultiplexing loop. Extract the common code with common interrupt to
> > avoid code duplication.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  arch/x86/kernel/irq.c | 23 ++++++++++++++---------
> >  1 file changed, 14 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> > index f39f6147104c..c54de9378943 100644
> > --- a/arch/x86/kernel/irq.c
> > +++ b/arch/x86/kernel/irq.c
> > @@ -242,18 +242,10 @@ static __always_inline void handle_irq(struct
> > irq_desc *desc,
> >  		__handle_irq(desc, regs);
> >  }
> > 
> > -/*
> > - * common_interrupt() handles all normal device IRQ's (the special SMP
> > - * cross-CPU interrupts have their own entry points).
> > - */
> > -DEFINE_IDTENTRY_IRQ(common_interrupt)
> > +static __always_inline void call_irq_handler(int vector, struct
> > pt_regs *regs) {
> > -	struct pt_regs *old_regs = set_irq_regs(regs);
> >  	struct irq_desc *desc;
> > 
> > -	/* entry code tells RCU that we're not quiescent.  Check it. */
> > -	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up
> > RCU");
> > -
> >  	desc = __this_cpu_read(vector_irq[vector]);
> >  	if (likely(!IS_ERR_OR_NULL(desc))) {
> >  		handle_irq(desc, regs);  
> 
> the hidden lines has one problem:
> 
> 	} else {
> 		apic_eoi();
> 
> 		if (desc == VECTOR_UNUSED) {
> 			...
> 
> there will be two EOI's for unused vectors, adding the one
> in sysvec_posted_msi_notification().

Indeed this unlikely case could cause lost interrupt. Imagine we have:

- IDT vector N (MSI notification), O, and P (other high-priority
system vectors).
- Device MSI vector A which triggers N.

Action 			APIC IRR		APIC ISR
---------------------------------------------------------
Device MSI A		N
APIC accepts N		-			N
New IRQs arrive		O,P			N
handle_irq(A)
eoi() due to A's fault	-			O,P
eoi in post_msi		-			P
----------------------------------------------------------
The second EOI clears ISR for vector O but missed processing it.


Intel SDM 11.8.4 for background.
"The IRR contains the active interrupt requests that have been accepted,
but not yet dispatched to the processor for servicing. When the local APIC
accepts an interrupt, it sets the bit in the IRR that corresponds the
vector of the accepted interrupt. When the processor core is ready to
handle the next interrupt, the local APIC clears the highest priority IRR
bit that is set and sets the corresponding ISR bit. The vector for the
highest priority bit set in the ISR is then dispatched to the processor
core for servicing.

While the processor is servicing the highest priority interrupt, the local
APIC can send additional fixed interrupts by setting bits in the IRR. When
the interrupt service routine issues a write to the EOI register (see
Section 11.8.5, Signaling Interrupt Servicing Completion), the local APIC
responds by clearing the highest priority ISR bit that is set. It then
repeats the process of clearing the highest priority bit in the IRR and
setting the corresponding bit in the ISR. The processor core then begins
executing the service routing for the highest priority bit set in the ISR
"

I need to avoid the duplicated EOI in this case and at minimum cost for the
hot path.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 06/13] x86/irq: Set up per host CPU posted interrupt descriptors
  2024-04-12  9:16   ` Tian, Kevin
@ 2024-04-12 17:54     ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-12 17:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Hansen, Dave, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Luse, Paul E, Williams, Dan J,
	Jens Axboe, Raj, Ashok, maz, seanjc, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, Zeng, Guang, robert.hoo.linux,
	jacob.jun.pan

Hi Kevin,

On Fri, 12 Apr 2024 09:16:25 +0000, "Tian, Kevin" <kevin.tian@intel.com>
wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, April 6, 2024 6:31 AM
> > 
> > +#ifdef CONFIG_X86_POSTED_MSI
> > +
> > +/* Posted Interrupt Descriptors for coalesced MSIs to be posted */
> > +DEFINE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);  
> 
> 'posted_msi_desc' to be more accurate?
makes sense, will do.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 10/13] x86/irq: Extend checks for pending vectors to posted interrupts
  2024-04-12  9:25   ` Tian, Kevin
@ 2024-04-12 18:23     ` Jacob Pan
  2024-04-16  3:47       ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-12 18:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Hansen, Dave, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Luse, Paul E, Williams, Dan J,
	Jens Axboe, Raj, Ashok, maz, seanjc, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, Zeng, Guang, robert.hoo.linux,
	jacob.jun.pan

Hi Kevin,

On Fri, 12 Apr 2024 09:25:57 +0000, "Tian, Kevin" <kevin.tian@intel.com>
wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, April 6, 2024 6:31 AM
> > 
> > During interrupt affinity change, it is possible to have interrupts
> > delivered to the old CPU after the affinity has changed to the new one.
> > To prevent lost interrupts, local APIC IRR is checked on the old CPU.
> > Similar checks must be done for posted MSIs given the same reason.
> > 
> > Consider the following scenario:
> > 	Device		system agent		iommu
> > 	memory CPU/LAPIC
> > 1	FEEX_XXXX
> > 2			Interrupt request
> > 3						Fetch IRTE	->
> > 4						->Atomic Swap
> > PID.PIR(vec) Push to Global
> > Observable(GO)
> > 5						if (ON*)
> > 	i						done;*  
> 
> there is a stray 'i'
will fix, thanks

> 
> > 						else
> > 6							send a
> > notification ->
> > 
> > * ON: outstanding notification, 1 will suppress new notifications
> > 
> > If the affinity change happens between 3 and 5 in IOMMU, the old CPU's
> > posted
> > interrupt request (PIR) could have pending bit set for the vector being
> > moved.  
> 
> how could affinity change be possible in 3/4 when the cache line is
> locked by IOMMU? Strictly speaking it's about a change after 4 and
> before 6.
SW can still perform affinity change on IRTE and do the flushing on IR
cache after IOMMU fectched it (step 3). They are async events.

In step 4, the atomic swap is on the PID cacheline, not IRTE.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option
  2024-04-08 23:33     ` Jacob Pan
@ 2024-04-13 10:59       ` Robert Hoo
  0 siblings, 0 replies; 47+ messages in thread
From: Robert Hoo @ 2024-04-13 10:59 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jim.harris, a.manzanares, Bjorn Helgaas, guang.zeng

On 4/9/2024 7:33 AM, Jacob Pan wrote:
> Hi Robert,
> 
> On Sat, 6 Apr 2024 12:31:14 +0800, Robert Hoo <robert.hoo.linux@gmail.com>
> wrote:
> 
>> On 4/6/2024 6:31 AM, Jacob Pan wrote:
>>> Add a command line opt-in option for posted MSI if
>>> CONFIG_X86_POSTED_MSI=y.
>>>
>>> Also introduce a helper function for testing if posted MSI is supported
>>> on the platform.
>>>
>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> ---
>>>    Documentation/admin-guide/kernel-parameters.txt |  1 +
>>>    arch/x86/include/asm/irq_remapping.h            | 11 +++++++++++
>>>    drivers/iommu/irq_remapping.c                   | 13 ++++++++++++-
>>>    3 files changed, 24 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>> b/Documentation/admin-guide/kernel-parameters.txt index
>>> bb884c14b2f6..e5fd02423c4c 100644 ---
>>> a/Documentation/admin-guide/kernel-parameters.txt +++
>>> b/Documentation/admin-guide/kernel-parameters.txt @@ -2251,6 +2251,7 @@
>>>    			no_x2apic_optout
>>>    				BIOS x2APIC opt-out request will be
>>> ignored nopost	disable Interrupt Posting
>>> +			posted_msi enable MSIs delivered as posted
>>> interrupts
>>>    	iomem=		Disable strict checking of access to
>>> MMIO memory strict	regions from userspace.
>>> diff --git a/arch/x86/include/asm/irq_remapping.h
>>> b/arch/x86/include/asm/irq_remapping.h index 7a2ed154a5e1..e46bde61029b
>>> 100644 --- a/arch/x86/include/asm/irq_remapping.h
>>> +++ b/arch/x86/include/asm/irq_remapping.h
>>> @@ -50,6 +50,17 @@ static inline struct irq_domain
>>> *arch_get_ir_parent_domain(void) return x86_vector_domain;
>>>    }
>>>    
>>> +#ifdef CONFIG_X86_POSTED_MSI
>>> +extern int enable_posted_msi;
>>> +
>>> +static inline bool posted_msi_supported(void)
>>> +{
>>> +	return enable_posted_msi && irq_remapping_cap(IRQ_POSTING_CAP);
>>> +}
>>
>> Out of this patch set's scope, but, dropping into irq_remappping_cap(),
>> I'd like to bring this change for discussion:
>>
>> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
>> index 4047ac396728..ef2de9034897 100644
>> --- a/drivers/iommu/irq_remapping.c
>> +++ b/drivers/iommu/irq_remapping.c
>> @@ -98,7 +98,7 @@ void set_irq_remapping_broken(void)
>>
>>    bool irq_remapping_cap(enum irq_remap_cap cap)
>>    {
>> -       if (!remap_ops || disable_irq_post)
>> +       if (!remap_ops || disable_irq_remap)
>>                   return false;
>>
>>           return (remap_ops->capability & (1 << cap));
>>
>>
>> 1. irq_remapping_cap() is to exam some cap, though at present it has only
>> 1 cap, i.e. IRQ_POSTING_CAP, simply return false just because of
>> disable_irq_post isn't good. Instead, IRQ_REMAP is the foundation of all
>> remapping caps. 2. disable_irq_post is used by Intel iommu code only,
>> here irq_remapping_cap() is common code. e.g. AMD iommu code doesn't use
>> it to judge set cap of irq_post or not.
> I agree, posting should be treated as a sub-capability of remapping.
> IRQ_POSTING_CAP is only set when remapping is on.
> 
> We need to delete this such that posting is always off when remapping is
> off.
> 
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -1038,11 +1038,7 @@ static void disable_irq_remapping(void)
>                  iommu_disable_irq_remapping(iommu);
>          }
>   
> -       /*
> -        * Clear Posted-Interrupts capability.
> -        */
> -       if (!disable_irq_post)
> -               intel_irq_remap_ops.capability &= ~(1 << IRQ_POSTING_CAP);
> +       intel_irq_remap_ops.capability &= ~(1 << IRQ_POSTING_CAP);
>   }
> 
Right.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-11 16:51   ` Thomas Gleixner
@ 2024-04-15 18:53     ` Jacob Pan
  2024-04-15 20:43       ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-15 18:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Paul Luse, Dan Williams, Jens Axboe, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jim.harris, a.manzanares,
	Bjorn Helgaas, guang.zeng, robert.hoo.linux, jacob.jun.pan

Hi Thomas,

On Thu, 11 Apr 2024 18:51:14 +0200, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Fri, Apr 05 2024 at 15:31, Jacob Pan wrote:
> > diff --git a/arch/x86/include/asm/irq_vectors.h
> > b/arch/x86/include/asm/irq_vectors.h index d18bfb238f66..1ee00be8218d
> > 100644 --- a/arch/x86/include/asm/irq_vectors.h
> > +++ b/arch/x86/include/asm/irq_vectors.h
> > @@ -97,9 +97,16 @@
> >  
> >  #define LOCAL_TIMER_VECTOR		0xec
> >  
> > +/*
> > + * Posted interrupt notification vector for all device MSIs delivered
> > to
> > + * the host kernel.
> > + */
> > +#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
> >  #define NR_VECTORS			 256
> >  
> > -#ifdef CONFIG_X86_LOCAL_APIC
> > +#ifdef CONFIG_X86_POSTED_MSI
> > +#define FIRST_SYSTEM_VECTOR
> > POSTED_MSI_NOTIFICATION_VECTOR +#elif defined(CONFIG_X86_LOCAL_APIC)
> >  #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
> >  #else
> >  #define FIRST_SYSTEM_VECTOR		NR_VECTORS  
> 
> This is horrible and we had attempts before to make the system vector
> space dense. They all did not work and making an exception for this is
> not what we want.
> 
> If we really care then we do it proper for _all_ of them. Something like
> the uncompiled below. There is certainly a smarter way to do the build
> thing, but my kbuild foo is rusty.
I too had the concern of the wasting system vectors, but did not know how to
fix it. But now your code below works well. Tested without KVM in .config
to show the gaps:

In VECTOR IRQ domain.

BEFORE:
System: 46: 0-31,50,235-236,244,246-255

AFTER:
System: 46: 0-31,50,241-242,245-255

The only gap is MANAGED_IRQ_SHUTDOWN_VECTOR(243), which is expected on a
running system.

Verified in irqvectors.s: .ascii "->MANAGED_IRQ_SHUTDOWN_VECTOR $243

POSTED MSI/first system vector moved up from 235 to 241 for this case.

Will try to let tools/arch/x86/include/asm/irq_vectors.h also use it
instead of manually copy over each time. Any suggestions greatly
appreciated.

> ---
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -245,6 +245,7 @@ archscripts: scripts_basic
>  
>  archheaders:
>  	$(Q)$(MAKE) $(build)=arch/x86/entry/syscalls all
> +	$(Q)$(MAKE) $(build)=arch/x86/kernel/irqvectors all
>  
>  ###
>  # Kernel objects
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -43,59 +43,46 @@
>   */
>  #define ISA_IRQ_VECTOR(irq)		(((FIRST_EXTERNAL_VECTOR +
> 16) & ~15) + irq) 
> +#ifndef __ASSEMBLY__
>  /*
> - * Special IRQ vectors used by the SMP architecture, 0xf0-0xff
> - *
> - *  some of the following vectors are 'rare', they are merged
> - *  into a single vector (CALL_FUNCTION_VECTOR) to save vector space.
> - *  TLB, reschedule and local APIC vectors are performance-critical.
> + * Special IRQ vectors used by the SMP architecture, 0xff and downwards
>   */
> +enum {
> +	__SPURIOUS_APIC_VECTOR,
> +	__ERROR_APIC_VECTOR,
> +	__RESCHEDULE_VECTOR,
> +	__CALL_FUNCTION_VECTOR,
> +	__CALL_FUNCTION_SINGLE_VECTOR,
> +	__THERMAL_APIC_VECTOR,
> +	__THRESHOLD_APIC_VECTOR,
> +	__REBOOT_VECTOR,
> +	__X86_PLATFORM_IPI_VECTOR,
> +	__IRQ_WORK_VECTOR,
> +	__DEFERRED_ERROR_VECTOR,
>  
> -#define SPURIOUS_APIC_VECTOR		0xff
> -/*
> - * Sanity check
> - */
> -#if ((SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F)
> -# error SPURIOUS_APIC_VECTOR definition error
> +#if IS_ENABLED(CONFIG_HYPERVISOR_GUEST)
> +	__HYPERVISOR_CALLBACK_VECTOR,
>  #endif
>  
> -#define ERROR_APIC_VECTOR		0xfe
> -#define RESCHEDULE_VECTOR		0xfd
> -#define CALL_FUNCTION_VECTOR		0xfc
> -#define CALL_FUNCTION_SINGLE_VECTOR	0xfb
> -#define THERMAL_APIC_VECTOR		0xfa
> -#define THRESHOLD_APIC_VECTOR		0xf9
> -#define REBOOT_VECTOR			0xf8
> -
> -/*
> - * Generic system vector for platform specific use
> - */
> -#define X86_PLATFORM_IPI_VECTOR		0xf7
> -
> -/*
> - * IRQ work vector:
> - */
> -#define IRQ_WORK_VECTOR			0xf6
> -
> -/* 0xf5 - unused, was UV_BAU_MESSAGE */
> -#define DEFERRED_ERROR_VECTOR		0xf4
> -
> -/* Vector on which hypervisor callbacks will be delivered */
> -#define HYPERVISOR_CALLBACK_VECTOR	0xf3
> -
> -/* Vector for KVM to deliver posted interrupt IPI */
> -#define POSTED_INTR_VECTOR		0xf2
> -#define POSTED_INTR_WAKEUP_VECTOR	0xf1
> -#define POSTED_INTR_NESTED_VECTOR	0xf0
> -
> -#define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
> +#if IS_ENABLED(CONFIG_KVM)
> +	/* Vector for KVM to deliver posted interrupt IPI */
> +	__POSTED_INTR_VECTOR,
> +	__POSTED_INTR_WAKEUP_VECTOR,
> +	__POSTED_INTR_NESTED_VECTOR,
> +#endif
> +	__MANAGED_IRQ_SHUTDOWN_VECTOR,
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
> -#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
> -#define HYPERV_STIMER0_VECTOR		0xed
> +	__HYPERV_REENLIGHTENMENT_VECTOR,
> +	__HYPERV_STIMER0_VECTOR,
>  #endif
> +	__LOCAL_TIMER_VECTOR,
> +};
> +#endif /* !__ASSEMBLY__ */
>  
> -#define LOCAL_TIMER_VECTOR		0xec
> +#ifndef COMPILE_OFFSETS
> +#include <asm/irqvectors.h>
> +#endif
>  
>  #define NR_VECTORS			 256
>  
> --- /dev/null
> +++ b/arch/x86/kernel/irqvectors/Makefile
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +irqvectors-file	:=
> arch/$(SRCARCH)/include/generated/asm/irqvectors.h +targets 	+=
> arch/$(SRCARCH)/kernel/irqvectors/irqvectors.s +
> +$(irqvectors-file): arch/$(SRCARCH)/kernel/irqvectors/irqvectors.s FORCE
> +	$(call filechk,offsets,__ASM_IRQVECTORS_H__)
> +
> +PHONY += all
> +all: $(irqvectors-file)
> +	@:
> --- /dev/null
> +++ b/arch/x86/kernel/irqvectors/irqvectors.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define COMPILE_OFFSETS
> +
> +#include <linux/kbuild.h>
> +#include <asm/irq_vectors.h>
> +
> +#define VECNR(v)	(0xFF - __##v)
> +#define VECTOR(v)	DEFINE(v, VECNR(v))
> +
> +static void __used common(void)
> +{
> +	VECTOR(SPURIOUS_APIC_VECTOR);
> +	VECTOR(ERROR_APIC_VECTOR);
> +	VECTOR(RESCHEDULE_VECTOR);
> +	VECTOR(CALL_FUNCTION_VECTOR);
> +	VECTOR(CALL_FUNCTION_SINGLE_VECTOR);
> +	VECTOR(THERMAL_APIC_VECTOR);
> +	VECTOR(THRESHOLD_APIC_VECTOR);
> +	VECTOR(REBOOT_VECTOR);
> +	VECTOR(X86_PLATFORM_IPI_VECTOR);
> +	VECTOR(IRQ_WORK_VECTOR);
> +	VECTOR(DEFERRED_ERROR_VECTOR);
> +
> +#if IS_ENABLED(CONFIG_HYPERVISOR_GUEST)
> +	VECTOR(HYPERVISOR_CALLBACK_VECTOR);
> +#endif
> +
> +#if IS_ENABLED(CONFIG_KVM)
> +	/* Vector for KVM to deliver posted interrupt IPI */
> +	VECTOR(POSTED_INTR_VECTOR);
> +	VECTOR(POSTED_INTR_WAKEUP_VECTOR);
> +	VECTOR(POSTED_INTR_NESTED_VECTOR);
> +#endif
> +	VECTOR(MANAGED_IRQ_SHUTDOWN_VECTOR);
> +
> +#if IS_ENABLED(CONFIG_HYPERV)
> +	VECTOR(HYPERV_REENLIGHTENMENT_VECTOR);
> +	VECTOR(HYPERV_STIMER0_VECTOR);
> +#endif
> +	VECTOR(LOCAL_TIMER_VECTOR);
> +}
> +
> 
> 
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-15 18:53     ` Jacob Pan
@ 2024-04-15 20:43       ` Jacob Pan
  2024-04-19  4:00         ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-15 20:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Paul Luse, Dan Williams, Jens Axboe, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jim.harris, a.manzanares,
	Bjorn Helgaas, guang.zeng, robert.hoo.linux, jacob.jun.pan, acme,
	kan.liang, Kleen, Andi


On Mon, 15 Apr 2024 11:53:58 -0700, Jacob Pan
<jacob.jun.pan@linux.intel.com> wrote:

> Hi Thomas,
> 
> On Thu, 11 Apr 2024 18:51:14 +0200, Thomas Gleixner <tglx@linutronix.de>
> wrote:
> 
> > On Fri, Apr 05 2024 at 15:31, Jacob Pan wrote:  
> > > diff --git a/arch/x86/include/asm/irq_vectors.h
> > > b/arch/x86/include/asm/irq_vectors.h index d18bfb238f66..1ee00be8218d
> > > 100644 --- a/arch/x86/include/asm/irq_vectors.h
> > > +++ b/arch/x86/include/asm/irq_vectors.h
> > > @@ -97,9 +97,16 @@
> > >  
> > >  #define LOCAL_TIMER_VECTOR		0xec
> > >  
> > > +/*
> > > + * Posted interrupt notification vector for all device MSIs delivered
> > > to
> > > + * the host kernel.
> > > + */
> > > +#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
> > >  #define NR_VECTORS			 256
> > >  
> > > -#ifdef CONFIG_X86_LOCAL_APIC
> > > +#ifdef CONFIG_X86_POSTED_MSI
> > > +#define FIRST_SYSTEM_VECTOR
> > > POSTED_MSI_NOTIFICATION_VECTOR +#elif defined(CONFIG_X86_LOCAL_APIC)
> > >  #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
> > >  #else
> > >  #define FIRST_SYSTEM_VECTOR		NR_VECTORS    
> > 
> > This is horrible and we had attempts before to make the system vector
> > space dense. They all did not work and making an exception for this is
> > not what we want.
> > 
> > If we really care then we do it proper for _all_ of them. Something like
> > the uncompiled below. There is certainly a smarter way to do the build
> > thing, but my kbuild foo is rusty.  
> I too had the concern of the wasting system vectors, but did not know how
> to fix it. But now your code below works well. Tested without KVM in
> .config to show the gaps:
> 
> In VECTOR IRQ domain.
> 
> BEFORE:
> System: 46: 0-31,50,235-236,244,246-255
> 
> AFTER:
> System: 46: 0-31,50,241-242,245-255
> 
> The only gap is MANAGED_IRQ_SHUTDOWN_VECTOR(243), which is expected on a
> running system.
> 
> Verified in irqvectors.s: .ascii "->MANAGED_IRQ_SHUTDOWN_VECTOR $243
> 
> POSTED MSI/first system vector moved up from 235 to 241 for this case.
> 
> Will try to let tools/arch/x86/include/asm/irq_vectors.h also use it
> instead of manually copy over each time. Any suggestions greatly
> appreciated.
> 
On a second thought, if we make system IRQ vector determined at compile
time based on different CONFIG options, will it break userspace tools such
as perf? More importantly the rule of not breaking userspace.

+Arnaldo

> > ---
> > --- a/arch/x86/Makefile
> > +++ b/arch/x86/Makefile
> > @@ -245,6 +245,7 @@ archscripts: scripts_basic
> >  
> >  archheaders:
> >  	$(Q)$(MAKE) $(build)=arch/x86/entry/syscalls all
> > +	$(Q)$(MAKE) $(build)=arch/x86/kernel/irqvectors all
> >  
> >  ###
> >  # Kernel objects
> > --- a/arch/x86/include/asm/irq_vectors.h
> > +++ b/arch/x86/include/asm/irq_vectors.h
> > @@ -43,59 +43,46 @@
> >   */
> >  #define ISA_IRQ_VECTOR(irq)		(((FIRST_EXTERNAL_VECTOR +
> > 16) & ~15) + irq) 
> > +#ifndef __ASSEMBLY__
> >  /*
> > - * Special IRQ vectors used by the SMP architecture, 0xf0-0xff
> > - *
> > - *  some of the following vectors are 'rare', they are merged
> > - *  into a single vector (CALL_FUNCTION_VECTOR) to save vector space.
> > - *  TLB, reschedule and local APIC vectors are performance-critical.
> > + * Special IRQ vectors used by the SMP architecture, 0xff and downwards
> >   */
> > +enum {
> > +	__SPURIOUS_APIC_VECTOR,
> > +	__ERROR_APIC_VECTOR,
> > +	__RESCHEDULE_VECTOR,
> > +	__CALL_FUNCTION_VECTOR,
> > +	__CALL_FUNCTION_SINGLE_VECTOR,
> > +	__THERMAL_APIC_VECTOR,
> > +	__THRESHOLD_APIC_VECTOR,
> > +	__REBOOT_VECTOR,
> > +	__X86_PLATFORM_IPI_VECTOR,
> > +	__IRQ_WORK_VECTOR,
> > +	__DEFERRED_ERROR_VECTOR,
> >  
> > -#define SPURIOUS_APIC_VECTOR		0xff
> > -/*
> > - * Sanity check
> > - */
> > -#if ((SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F)
> > -# error SPURIOUS_APIC_VECTOR definition error
> > +#if IS_ENABLED(CONFIG_HYPERVISOR_GUEST)
> > +	__HYPERVISOR_CALLBACK_VECTOR,
> >  #endif
> >  
> > -#define ERROR_APIC_VECTOR		0xfe
> > -#define RESCHEDULE_VECTOR		0xfd
> > -#define CALL_FUNCTION_VECTOR		0xfc
> > -#define CALL_FUNCTION_SINGLE_VECTOR	0xfb
> > -#define THERMAL_APIC_VECTOR		0xfa
> > -#define THRESHOLD_APIC_VECTOR		0xf9
> > -#define REBOOT_VECTOR			0xf8
> > -
> > -/*
> > - * Generic system vector for platform specific use
> > - */
> > -#define X86_PLATFORM_IPI_VECTOR		0xf7
> > -
> > -/*
> > - * IRQ work vector:
> > - */
> > -#define IRQ_WORK_VECTOR			0xf6
> > -
> > -/* 0xf5 - unused, was UV_BAU_MESSAGE */
> > -#define DEFERRED_ERROR_VECTOR		0xf4
> > -
> > -/* Vector on which hypervisor callbacks will be delivered */
> > -#define HYPERVISOR_CALLBACK_VECTOR	0xf3
> > -
> > -/* Vector for KVM to deliver posted interrupt IPI */
> > -#define POSTED_INTR_VECTOR		0xf2
> > -#define POSTED_INTR_WAKEUP_VECTOR	0xf1
> > -#define POSTED_INTR_NESTED_VECTOR	0xf0
> > -
> > -#define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
> > +#if IS_ENABLED(CONFIG_KVM)
> > +	/* Vector for KVM to deliver posted interrupt IPI */
> > +	__POSTED_INTR_VECTOR,
> > +	__POSTED_INTR_WAKEUP_VECTOR,
> > +	__POSTED_INTR_NESTED_VECTOR,
> > +#endif
> > +	__MANAGED_IRQ_SHUTDOWN_VECTOR,
> >  
> >  #if IS_ENABLED(CONFIG_HYPERV)
> > -#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
> > -#define HYPERV_STIMER0_VECTOR		0xed
> > +	__HYPERV_REENLIGHTENMENT_VECTOR,
> > +	__HYPERV_STIMER0_VECTOR,
> >  #endif
> > +	__LOCAL_TIMER_VECTOR,
> > +};
> > +#endif /* !__ASSEMBLY__ */
> >  
> > -#define LOCAL_TIMER_VECTOR		0xec
> > +#ifndef COMPILE_OFFSETS
> > +#include <asm/irqvectors.h>
> > +#endif
> >  
> >  #define NR_VECTORS			 256
> >  
> > --- /dev/null
> > +++ b/arch/x86/kernel/irqvectors/Makefile
> > @@ -0,0 +1,11 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +irqvectors-file	:=
> > arch/$(SRCARCH)/include/generated/asm/irqvectors.h +targets 	+=
> > arch/$(SRCARCH)/kernel/irqvectors/irqvectors.s +
> > +$(irqvectors-file): arch/$(SRCARCH)/kernel/irqvectors/irqvectors.s
> > FORCE
> > +	$(call filechk,offsets,__ASM_IRQVECTORS_H__)
> > +
> > +PHONY += all
> > +all: $(irqvectors-file)
> > +	@:
> > --- /dev/null
> > +++ b/arch/x86/kernel/irqvectors/irqvectors.c
> > @@ -0,0 +1,42 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#define COMPILE_OFFSETS
> > +
> > +#include <linux/kbuild.h>
> > +#include <asm/irq_vectors.h>
> > +
> > +#define VECNR(v)	(0xFF - __##v)
> > +#define VECTOR(v)	DEFINE(v, VECNR(v))
> > +
> > +static void __used common(void)
> > +{
> > +	VECTOR(SPURIOUS_APIC_VECTOR);
> > +	VECTOR(ERROR_APIC_VECTOR);
> > +	VECTOR(RESCHEDULE_VECTOR);
> > +	VECTOR(CALL_FUNCTION_VECTOR);
> > +	VECTOR(CALL_FUNCTION_SINGLE_VECTOR);
> > +	VECTOR(THERMAL_APIC_VECTOR);
> > +	VECTOR(THRESHOLD_APIC_VECTOR);
> > +	VECTOR(REBOOT_VECTOR);
> > +	VECTOR(X86_PLATFORM_IPI_VECTOR);
> > +	VECTOR(IRQ_WORK_VECTOR);
> > +	VECTOR(DEFERRED_ERROR_VECTOR);
> > +
> > +#if IS_ENABLED(CONFIG_HYPERVISOR_GUEST)
> > +	VECTOR(HYPERVISOR_CALLBACK_VECTOR);
> > +#endif
> > +
> > +#if IS_ENABLED(CONFIG_KVM)
> > +	/* Vector for KVM to deliver posted interrupt IPI */
> > +	VECTOR(POSTED_INTR_VECTOR);
> > +	VECTOR(POSTED_INTR_WAKEUP_VECTOR);
> > +	VECTOR(POSTED_INTR_NESTED_VECTOR);
> > +#endif
> > +	VECTOR(MANAGED_IRQ_SHUTDOWN_VECTOR);
> > +
> > +#if IS_ENABLED(CONFIG_HYPERV)
> > +	VECTOR(HYPERV_REENLIGHTENMENT_VECTOR);
> > +	VECTOR(HYPERV_STIMER0_VECTOR);
> > +#endif
> > +	VECTOR(LOCAL_TIMER_VECTOR);
> > +}
> > +
> > 
> > 
> >   
> 
> 
> Thanks,
> 
> Jacob


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option
  2024-04-12  9:31   ` Tian, Kevin
@ 2024-04-15 23:20     ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-15 23:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Hansen, Dave, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Luse, Paul E, Williams, Dan J,
	Jens Axboe, Raj, Ashok, maz, seanjc, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, Zeng, Guang, robert.hoo.linux,
	jacob.jun.pan

Hi Kevin,

On Fri, 12 Apr 2024 09:31:32 +0000, "Tian, Kevin" <kevin.tian@intel.com>
wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, April 6, 2024 6:31 AM
> > 
> > +#ifdef CONFIG_X86_POSTED_MSI
> > +		else if (!strncmp(str, "posted_msi", 10)) {
> > +			if (disable_irq_post || disable_irq_remap)
> > +				pr_warn("Posted MSI not enabled due to
> > conflicting options!");
> > +			else
> > +				enable_posted_msi = 1;
> > +		}
> > +#endif  
> 
> the check of disable_irq_remap is unnecessary. It's unlikely to have
> a configuration with disable_irq_post=0 while disable_irq_remap=1
> given the latter has bigger scope.
> 
> but thinking more do we really need a check here? there is no order
> guarantee that "posted_msi" is parsed after the parameters deciding
> the value of two disable variables.
> 
> it probably makes more sense to just set enable_posted_msi here
> and then do all required checks when picking up the irqchip in
> intel_irq_remapping_alloc().

Makes sense, I have a helper function posted_msi_supported() called in
intel_irq_remapping_alloc() already.

My intention was to alert negligent users, but is is not really necessary
as you said.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-12 14:27     ` Sean Christopherson
@ 2024-04-16  3:45       ` Tian, Kevin
  0 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2024-04-16  3:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	Peter Anvin, Borislav Petkov, Ingo Molnar, Luse, Paul E,
	Williams, Dan J, Jens Axboe, Raj, Ashok, maz, Robin Murphy,
	jim.harris, a.manzanares, Bjorn Helgaas, Zeng, Guang,
	robert.hoo.linux

> From: Sean Christopherson <seanjc@google.com>
> Sent: Friday, April 12, 2024 10:28 PM
> 
> On Fri, Apr 12, 2024, Kevin Tian wrote:
> > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Sent: Saturday, April 6, 2024 6:31 AM
> > >
> > > +/*
> > > + * Posted interrupt notification vector for all device MSIs delivered to
> > > + * the host kernel.
> > > + */
> > > +#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
> > >  #define NR_VECTORS			 256
> > >
> >
> > Every interrupt is kind of a notification.
> 
> FWIW, I find value in having "notification" in the name to differentiate
> between
> the IRQ that is notifying the CPU that there's a posted IRQ to be processed,
> and
> the posted IRQ itself.

IMHO one who knows posted msi doesn't need the extra
'notification' in the name to differentiate.

one who doesn't know what posted msi is anyway needs to
look at the surrounding code including the above comment.
having 'notification' in the name alone doesn't really help.

but I'd not hold strong on this... 😊

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 10/13] x86/irq: Extend checks for pending vectors to posted interrupts
  2024-04-12 18:23     ` Jacob Pan
@ 2024-04-16  3:47       ` Tian, Kevin
  0 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2024-04-16  3:47 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Hansen, Dave, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Luse, Paul E, Williams, Dan J,
	Jens Axboe, Raj, Ashok, maz, seanjc, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, Zeng, Guang, robert.hoo.linux

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 13, 2024 2:24 AM
> 
> Hi Kevin,
> 
> On Fri, 12 Apr 2024 09:25:57 +0000, "Tian, Kevin" <kevin.tian@intel.com>
> wrote:
> 
> > > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Sent: Saturday, April 6, 2024 6:31 AM
> > >
> > > During interrupt affinity change, it is possible to have interrupts
> > > delivered to the old CPU after the affinity has changed to the new one.
> > > To prevent lost interrupts, local APIC IRR is checked on the old CPU.
> > > Similar checks must be done for posted MSIs given the same reason.
> > >
> > > Consider the following scenario:
> > > 	Device		system agent		iommu
> > > 	memory CPU/LAPIC
> > > 1	FEEX_XXXX
> > > 2			Interrupt request
> > > 3						Fetch IRTE	->
> > > 4						->Atomic Swap
> > > PID.PIR(vec) Push to Global
> > > Observable(GO)
> > > 5						if (ON*)
> > > 	i						done;*
> >
> > there is a stray 'i'
> will fix, thanks
> 
> >
> > > 						else
> > > 6							send a
> > > notification ->
> > >
> > > * ON: outstanding notification, 1 will suppress new notifications
> > >
> > > If the affinity change happens between 3 and 5 in IOMMU, the old CPU's
> > > posted
> > > interrupt request (PIR) could have pending bit set for the vector being
> > > moved.
> >
> > how could affinity change be possible in 3/4 when the cache line is
> > locked by IOMMU? Strictly speaking it's about a change after 4 and
> > before 6.
> SW can still perform affinity change on IRTE and do the flushing on IR
> cache after IOMMU fectched it (step 3). They are async events.
> 
> In step 4, the atomic swap is on the PID cacheline, not IRTE.
> 

yeah, I mixed IRTE with PID.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 12/13] iommu/vt-d: Add an irq_chip for posted MSIs
  2024-04-12  9:36   ` Tian, Kevin
@ 2024-04-16 22:15     ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-16 22:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Hansen, Dave, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Luse, Paul E, Williams, Dan J,
	Jens Axboe, Raj, Ashok, maz, seanjc, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, Zeng, Guang, robert.hoo.linux,
	jacob.jun.pan

Hi Kevin,

On Fri, 12 Apr 2024 09:36:10 +0000, "Tian, Kevin" <kevin.tian@intel.com>
wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, April 6, 2024 6:31 AM
> > 
> > + *
> > + * For the example below, 3 MSIs are coalesced into one CPU
> > notification. Only
> > + * one apic_eoi() is needed.
> > + *
> > + * __sysvec_posted_msi_notification()
> > + *	irq_enter();
> > + *		handle_edge_irq()
> > + *			irq_chip_ack_parent()
> > + *				dummy(); // No EOI
> > + *			handle_irq_event()
> > + *				driver_handler()
> > + *	irq_enter();
> > + *		handle_edge_irq()
> > + *			irq_chip_ack_parent()
> > + *				dummy(); // No EOI
> > + *			handle_irq_event()
> > + *				driver_handler()
> > + *	irq_enter();
> > + *		handle_edge_irq()
> > + *			irq_chip_ack_parent()
> > + *				dummy(); // No EOI
> > + *			handle_irq_event()
> > + *				driver_handler()  
> 
> typo: you added three irq_enter()'s here
right, will remove the middle two.

> 
> > + *	apic_eoi()
> > + * irq_exit()
> > + */
> > +static struct irq_chip intel_ir_chip_post_msi = {
> > +	.name			= "INTEL-IR-POST",
> > +	.irq_ack		= dummy,
> > +	.irq_set_affinity	= intel_ir_set_affinity,
> > +	.irq_compose_msi_msg	= intel_ir_compose_msi_msg,
> > +	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
> > +};  
> 
> What about putting this patch at end of the series (combining the
> change in intel_irq_remapping_alloc()) to finally enable this
> feature?
> 
> It reads slightly better to me to first get those callbacks extended
> to deal with the new mechanism (i.e. most changes in patch13)
> before using them in the new irqchip. 😊

makes sense, will do.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/13] x86/irq: Move posted interrupt descriptor out of vmx code
  2024-04-05 22:30 ` [PATCH v2 01/13] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
@ 2024-04-17  0:34   ` Sean Christopherson
  2024-04-17 18:33     ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Sean Christopherson @ 2024-04-17  0:34 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Kevin Tian, maz, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, guang.zeng, robert.hoo.linux

"KVM" in the scope would be nice.

On Fri, Apr 05, 2024, Jacob Pan wrote:
> To prepare native usage of posted interrupt, move PID declaration out of
> VMX code such that they can be shared.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  arch/x86/include/asm/posted_intr.h | 88 ++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/posted_intr.h     | 93 +-----------------------------
>  arch/x86/kvm/vmx/vmx.c             |  1 +
>  arch/x86/kvm/vmx/vmx.h             |  2 +-
>  4 files changed, 91 insertions(+), 93 deletions(-)
>  create mode 100644 arch/x86/include/asm/posted_intr.h

Acked-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor
  2024-04-05 22:31 ` [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor Jacob Pan
@ 2024-04-17  0:39   ` Sean Christopherson
  2024-04-17 18:01     ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Sean Christopherson @ 2024-04-17  0:39 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Kevin Tian, maz, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, guang.zeng, robert.hoo.linux

"KVM" here would be nice too.

On Fri, Apr 05, 2024, Jacob Pan wrote:
> Mixture of bitfields and types is weird and really not intuitive, remove
> bitfields and use typed data exclusively.
> 
> Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> 
> ---
> v2:
> 	- Replace bitfields, no more mix.
> ---
>  arch/x86/include/asm/posted_intr.h | 10 +---------
>  arch/x86/kvm/vmx/posted_intr.c     |  4 ++--
>  arch/x86/kvm/vmx/vmx.c             |  2 +-
>  3 files changed, 4 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
> index acf237b2882e..c682c41d4e44 100644
> --- a/arch/x86/include/asm/posted_intr.h
> +++ b/arch/x86/include/asm/posted_intr.h
> @@ -15,17 +15,9 @@ struct pi_desc {
>  	};
>  	union {
>  		struct {
> -				/* bit 256 - Outstanding Notification */
> -			u16	on	: 1,
> -				/* bit 257 - Suppress Notification */
> -				sn	: 1,
> -				/* bit 271:258 - Reserved */
> -				rsvd_1	: 14;
> -				/* bit 279:272 - Notification Vector */
> +			u16	notif_ctrl; /* Suppress and outstanding bits */
>  			u8	nv;
> -				/* bit 287:280 - Reserved */
>  			u8	rsvd_2;
> -				/* bit 319:288 - Notification Destination */
>  			u32	ndst;
>  		};
>  		u64 control;
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index af662312fd07..592dbb765675 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -107,7 +107,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>  		 * handle task migration (@cpu != vcpu->cpu).
>  		 */
>  		new.ndst = dest;
> -		new.sn = 0;
> +		new.notif_ctrl &= ~POSTED_INTR_SN;

At the risk of creating confusing, would it make sense to add double-underscore,
non-atomic versions of the set/clear helpers for ON and SN?

I can't tell if that's a net positive versus open coding clear() and set() here
and below.

>  		/*
>  		 * Restore the notification vector; in the blocking case, the
> @@ -157,7 +157,7 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>  		      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
>  	raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
>  
> -	WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
> +	WARN(pi_desc->notif_ctrl & POSTED_INTR_SN, "PI descriptor SN field set before blocking");

This can use pi_test_sn(), as test_bit() isn't atomic, i.e. doesn't incur a LOCK.

>  
>  	old.control = READ_ONCE(pi_desc->control);
>  	do {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d94bb069bac9..50580bbfba5d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4843,7 +4843,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>  	 * or POSTED_INTR_WAKEUP_VECTOR.
>  	 */
>  	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> -	vmx->pi_desc.sn = 1;
> +	vmx->pi_desc.notif_ctrl |= POSTED_INTR_SN;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor
  2024-04-17  0:39   ` Sean Christopherson
@ 2024-04-17 18:01     ` Jacob Pan
  2024-04-18 17:30       ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Jacob Pan @ 2024-04-17 18:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Kevin Tian, maz, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, guang.zeng, robert.hoo.linux,
	jacob.jun.pan, oliver.sang

Hi Sean,

On Tue, 16 Apr 2024 17:39:42 -0700, Sean Christopherson <seanjc@google.com>
wrote:

> "KVM" here would be nice too.
> 
> On Fri, Apr 05, 2024, Jacob Pan wrote:
> > Mixture of bitfields and types is weird and really not intuitive, remove
> > bitfields and use typed data exclusively.
> > 
> > Link:
> > https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
> > Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by:
> > Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan
> > <jacob.jun.pan@linux.intel.com>
> > 
> > ---
> > v2:
> > 	- Replace bitfields, no more mix.
> > ---
> >  arch/x86/include/asm/posted_intr.h | 10 +---------
> >  arch/x86/kvm/vmx/posted_intr.c     |  4 ++--
> >  arch/x86/kvm/vmx/vmx.c             |  2 +-
> >  3 files changed, 4 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/posted_intr.h
> > b/arch/x86/include/asm/posted_intr.h index acf237b2882e..c682c41d4e44
> > 100644 --- a/arch/x86/include/asm/posted_intr.h
> > +++ b/arch/x86/include/asm/posted_intr.h
> > @@ -15,17 +15,9 @@ struct pi_desc {
> >  	};
> >  	union {
> >  		struct {
> > -				/* bit 256 - Outstanding Notification
> > */
> > -			u16	on	: 1,
> > -				/* bit 257 - Suppress Notification */
> > -				sn	: 1,
> > -				/* bit 271:258 - Reserved */
> > -				rsvd_1	: 14;
> > -				/* bit 279:272 - Notification Vector */
> > +			u16	notif_ctrl; /* Suppress and
> > outstanding bits */ u8	nv;
> > -				/* bit 287:280 - Reserved */
> >  			u8	rsvd_2;
> > -				/* bit 319:288 - Notification
> > Destination */ u32	ndst;
> >  		};
> >  		u64 control;
> > diff --git a/arch/x86/kvm/vmx/posted_intr.c
> > b/arch/x86/kvm/vmx/posted_intr.c index af662312fd07..592dbb765675 100644
> > --- a/arch/x86/kvm/vmx/posted_intr.c
> > +++ b/arch/x86/kvm/vmx/posted_intr.c
> > @@ -107,7 +107,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int
> > cpu)
> >  		 * handle task migration (@cpu != vcpu->cpu).
> >  		 */
> >  		new.ndst = dest;
> > -		new.sn = 0;
> > +		new.notif_ctrl &= ~POSTED_INTR_SN;  
> 
> At the risk of creating confusing, would it make sense to add
> double-underscore, non-atomic versions of the set/clear helpers for ON
> and SN?
> 
> I can't tell if that's a net positive versus open coding clear() and
> set() here and below.
IMHO, we can add non-atomic helpers when we have more than one user for
each operation.

I do have a stupid bug here, it should be:
-               new.notif_ctrl &= ~POSTED_INTR_SN;
+               new.notif_ctrl &= ~BIT(POSTED_INTR_SN);
Same as below.

Thanks to Oliver(LKP kvm self test). I didn't catch that in my VFIO device
assignment test.

> 
> >  		/*
> >  		 * Restore the notification vector; in the blocking
> > case, the @@ -157,7 +157,7 @@ static void
> > pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> > &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
> > raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); 
> > -	WARN(pi_desc->sn, "PI descriptor SN field set before
> > blocking");
> > +	WARN(pi_desc->notif_ctrl & POSTED_INTR_SN, "PI descriptor SN
> > field set before blocking");  
> 
> This can use pi_test_sn(), as test_bit() isn't atomic, i.e. doesn't incur
> a LOCK.
make sense. will do.

> >  
> >  	old.control = READ_ONCE(pi_desc->control);
> >  	do {
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index d94bb069bac9..50580bbfba5d 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -4843,7 +4843,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu
> > *vcpu)
> >  	 * or POSTED_INTR_WAKEUP_VECTOR.
> >  	 */
> >  	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> > -	vmx->pi_desc.sn = 1;
> > +	vmx->pi_desc.notif_ctrl |= POSTED_INTR_SN;  


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/13] x86/irq: Move posted interrupt descriptor out of vmx code
  2024-04-17  0:34   ` Sean Christopherson
@ 2024-04-17 18:33     ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-17 18:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Kevin Tian, maz, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, guang.zeng, robert.hoo.linux,
	jacob.jun.pan

Hi Sean,

On Tue, 16 Apr 2024 17:34:36 -0700, Sean Christopherson <seanjc@google.com>
wrote:

> "KVM" in the scope would be nice.
will change to "KVM: VMX:"

> 
> On Fri, Apr 05, 2024, Jacob Pan wrote:
> > To prepare native usage of posted interrupt, move PID declaration out of
> > VMX code such that they can be shared.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  arch/x86/include/asm/posted_intr.h | 88 ++++++++++++++++++++++++++++
> >  arch/x86/kvm/vmx/posted_intr.h     | 93 +-----------------------------
> >  arch/x86/kvm/vmx/vmx.c             |  1 +
> >  arch/x86/kvm/vmx/vmx.h             |  2 +-
> >  4 files changed, 91 insertions(+), 93 deletions(-)
> >  create mode 100644 arch/x86/include/asm/posted_intr.h  
> 
> Acked-by: Sean Christopherson <seanjc@google.com>


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor
  2024-04-17 18:01     ` Jacob Pan
@ 2024-04-18 17:30       ` Thomas Gleixner
  2024-04-18 18:10         ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2024-04-18 17:30 UTC (permalink / raw)
  To: Jacob Pan, Sean Christopherson
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Paul Luse, Dan Williams, Jens Axboe, Raj Ashok,
	Kevin Tian, maz, Robin Murphy, jim.harris, a.manzanares,
	Bjorn Helgaas, guang.zeng, robert.hoo.linux, jacob.jun.pan,
	oliver.sang

On Wed, Apr 17 2024 at 11:01, Jacob Pan wrote:
> On Tue, 16 Apr 2024 17:39:42 -0700, Sean Christopherson <seanjc@google.com>
> wrote:
>> > diff --git a/arch/x86/kvm/vmx/posted_intr.c
>> > b/arch/x86/kvm/vmx/posted_intr.c index af662312fd07..592dbb765675 100644
>> > --- a/arch/x86/kvm/vmx/posted_intr.c
>> > +++ b/arch/x86/kvm/vmx/posted_intr.c
>> > @@ -107,7 +107,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int
>> > cpu)
>> >  		 * handle task migration (@cpu != vcpu->cpu).
>> >  		 */
>> >  		new.ndst = dest;
>> > -		new.sn = 0;
>> > +		new.notif_ctrl &= ~POSTED_INTR_SN;  
>> 
>> At the risk of creating confusing, would it make sense to add
>> double-underscore, non-atomic versions of the set/clear helpers for ON
>> and SN?
>> 
>> I can't tell if that's a net positive versus open coding clear() and
>> set() here and below.
> IMHO, we can add non-atomic helpers when we have more than one user for
> each operation.
>
> I do have a stupid bug here, it should be:
> -               new.notif_ctrl &= ~POSTED_INTR_SN;
> +               new.notif_ctrl &= ~BIT(POSTED_INTR_SN);

That's a perfect reason to use a proper helper.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor
  2024-04-18 17:30       ` Thomas Gleixner
@ 2024-04-18 18:10         ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-18 18:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sean Christopherson, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Kevin Tian, maz, Robin Murphy, jim.harris,
	a.manzanares, Bjorn Helgaas, guang.zeng, robert.hoo.linux,
	oliver.sang, jacob.jun.pan

Hi Thomas,

On Thu, 18 Apr 2024 19:30:52 +0200, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Wed, Apr 17 2024 at 11:01, Jacob Pan wrote:
> > On Tue, 16 Apr 2024 17:39:42 -0700, Sean Christopherson
> > <seanjc@google.com> wrote:  
> >> > diff --git a/arch/x86/kvm/vmx/posted_intr.c
> >> > b/arch/x86/kvm/vmx/posted_intr.c index af662312fd07..592dbb765675
> >> > 100644 --- a/arch/x86/kvm/vmx/posted_intr.c
> >> > +++ b/arch/x86/kvm/vmx/posted_intr.c
> >> > @@ -107,7 +107,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int
> >> > cpu)
> >> >  		 * handle task migration (@cpu != vcpu->cpu).
> >> >  		 */
> >> >  		new.ndst = dest;
> >> > -		new.sn = 0;
> >> > +		new.notif_ctrl &= ~POSTED_INTR_SN;    
> >> 
> >> At the risk of creating confusing, would it make sense to add
> >> double-underscore, non-atomic versions of the set/clear helpers for ON
> >> and SN?
> >> 
> >> I can't tell if that's a net positive versus open coding clear() and
> >> set() here and below.  
> > IMHO, we can add non-atomic helpers when we have more than one user for
> > each operation.
> >
> > I do have a stupid bug here, it should be:
> > -               new.notif_ctrl &= ~POSTED_INTR_SN;
> > +               new.notif_ctrl &= ~BIT(POSTED_INTR_SN);  
> 
> That's a perfect reason to use a proper helper.
I just proved I was wrong:) will do.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-15 20:43       ` Jacob Pan
@ 2024-04-19  4:00         ` Thomas Gleixner
  2024-04-19 20:07           ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2024-04-19  4:00 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Paul Luse, Dan Williams, Jens Axboe, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jim.harris, a.manzanares,
	Bjorn Helgaas, guang.zeng, robert.hoo.linux, jacob.jun.pan, acme,
	kan.liang, Kleen, Andi

On Mon, Apr 15 2024 at 13:43, Jacob Pan wrote:
> On Mon, 15 Apr 2024 11:53:58 -0700, Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
>> On Thu, 11 Apr 2024 18:51:14 +0200, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > If we really care then we do it proper for _all_ of them. Something like
>> > the uncompiled below. There is certainly a smarter way to do the build
>> > thing, but my kbuild foo is rusty.  
>> I too had the concern of the wasting system vectors, but did not know how
>> to fix it. But now your code below works well. Tested without KVM in
>> .config to show the gaps:
>> 
>> In VECTOR IRQ domain.
>> 
>> BEFORE:
>> System: 46: 0-31,50,235-236,244,246-255
>> 
>> AFTER:
>> System: 46: 0-31,50,241-242,245-255
>> 
>> The only gap is MANAGED_IRQ_SHUTDOWN_VECTOR(243), which is expected on a
>> running system.
>> 
>> Verified in irqvectors.s: .ascii "->MANAGED_IRQ_SHUTDOWN_VECTOR $243
>> 
>> POSTED MSI/first system vector moved up from 235 to 241 for this case.
>> 
>> Will try to let tools/arch/x86/include/asm/irq_vectors.h also use it
>> instead of manually copy over each time. Any suggestions greatly
>> appreciated.
>>
> On a second thought, if we make system IRQ vector determined at compile
> time based on different CONFIG options, will it break userspace tools such
> as perf? More importantly the rule of not breaking userspace.

tools/arch/x86/include/asm/irq_vectors.h is only used to generate the
list of system vectors for pretty output. And your change already broke
that.

The obvious solution to that is to expose that list in sysfs for
consumption by perf.

But we don't have to do any of that right away. It's an orthogonal
issue. Just waste the extra system vector to start with and then we can
add the compile time dependend change on top if we really care about
gaining back the vectors.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-19  4:00         ` Thomas Gleixner
@ 2024-04-19 20:07           ` Arnaldo Carvalho de Melo
  2024-04-22 22:32             ` Jacob Pan
  0 siblings, 1 reply; 47+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-04-19 20:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu, Lu Baolu,
	kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Paul Luse, Dan Williams, Jens Axboe, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jim.harris, a.manzanares,
	Bjorn Helgaas, guang.zeng, robert.hoo.linux, kan.liang, Kleen,
	Andi

On Fri, Apr 19, 2024 at 06:00:24AM +0200, Thomas Gleixner wrote:
> On Mon, Apr 15 2024 at 13:43, Jacob Pan wrote:
> > On Mon, 15 Apr 2024 11:53:58 -0700, Jacob Pan <jacob.jun.pan@linux.intel.com> wrote:
> >> On Thu, 11 Apr 2024 18:51:14 +0200, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > If we really care then we do it proper for _all_ of them. Something like
> >> > the uncompiled below. There is certainly a smarter way to do the build
> >> > thing, but my kbuild foo is rusty.  
> >> I too had the concern of the wasting system vectors, but did not know how
> >> to fix it. But now your code below works well. Tested without KVM in
> >> .config to show the gaps:
> >> 
> >> In VECTOR IRQ domain.
> >> 
> >> BEFORE:
> >> System: 46: 0-31,50,235-236,244,246-255
> >> 
> >> AFTER:
> >> System: 46: 0-31,50,241-242,245-255
> >> 
> >> The only gap is MANAGED_IRQ_SHUTDOWN_VECTOR(243), which is expected on a
> >> running system.
> >> 
> >> Verified in irqvectors.s: .ascii "->MANAGED_IRQ_SHUTDOWN_VECTOR $243
> >> 
> >> POSTED MSI/first system vector moved up from 235 to 241 for this case.
> >> 
> >> Will try to let tools/arch/x86/include/asm/irq_vectors.h also use it
> >> instead of manually copy over each time. Any suggestions greatly
> >> appreciated.
> >>
> > On a second thought, if we make system IRQ vector determined at compile
> > time based on different CONFIG options, will it break userspace tools such
> > as perf? More importantly the rule of not breaking userspace.

The rule for tools/perf is "don't impose _any requirement_ on the kernel
developers, they don't have to test if any change they do outside of
tools/ will break something inside tools/."
 
> tools/arch/x86/include/asm/irq_vectors.h is only used to generate the
> list of system vectors for pretty output. And your change already broke
> that.

Yeah, I even moved that from tools/arch/x86/include/asm/irq_vectors.h
to tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h (for next
merge window).

Having it in tools/arch/x86/include/asm/irq_vectors.h was a bad decision
as it, as you mentinoned, is only used to generate string tables:

⬢[acme@toolbox perf-tools-next]$ tools/perf/trace/beauty/tracepoints/x86_irq_vectors.sh 
static const char *x86_irq_vectors[] = {
	[0x02] = "NMI",
	[0x80] = "IA32_SYSCALL",
	[0xec] = "LOCAL_TIMER",
	[0xed] = "HYPERV_STIMER0",
	[0xee] = "HYPERV_REENLIGHTENMENT",
	[0xef] = "MANAGED_IRQ_SHUTDOWN",
	[0xf0] = "POSTED_INTR_NESTED",
	[0xf1] = "POSTED_INTR_WAKEUP",
	[0xf2] = "POSTED_INTR",
	[0xf3] = "HYPERVISOR_CALLBACK",
	[0xf4] = "DEFERRED_ERROR",
	[0xf6] = "IRQ_WORK",
	[0xf7] = "X86_PLATFORM_IPI",
	[0xf8] = "REBOOT",
	[0xf9] = "THRESHOLD_APIC",
	[0xfa] = "THERMAL_APIC",
	[0xfb] = "CALL_FUNCTION_SINGLE",
	[0xfc] = "CALL_FUNCTION",
	[0xfd] = "RESCHEDULE",
	[0xfe] = "ERROR_APIC",
	[0xff] = "SPURIOUS_APIC",
};

⬢[acme@toolbox perf-tools-next]$

Used in:

root@number:~# perf trace -a -e irq_vectors:irq_work_entry/max-stack=32/ --max-events=1
     0.000 kworker/u57:0-/9912 irq_vectors:irq_work_entry(vector: IRQ_WORK)
                                       __sysvec_irq_work ([kernel.kallsyms])
                                       __sysvec_irq_work ([kernel.kallsyms])
                                       sysvec_irq_work ([kernel.kallsyms])
                                       asm_sysvec_irq_work ([kernel.kallsyms])
                                       _raw_spin_unlock_irqrestore ([kernel.kallsyms])
                                       dma_fence_wait_timeout ([kernel.kallsyms])
                                       intel_atomic_commit_tail ([kernel.kallsyms])
                                       process_one_work ([kernel.kallsyms])
                                       worker_thread ([kernel.kallsyms])
                                       kthread ([kernel.kallsyms])
                                       ret_from_fork ([kernel.kallsyms])
                                       ret_from_fork_asm ([kernel.kallsyms])
root@number:~#

But as the original cset introducing this explains, these irq_vectors:
tracepoins operate on just one of the vectors, so irq_work_entry(vector:
IRQ_WORK), irq_vectors:reschedule_exit(vector: RESCHEDULE), etc. 

> The obvious solution to that is to expose that list in sysfs for
> consumption by perf.

nah, the best thing these days is stop using 'int' for vector and use
'enum irq_vector', then since we have BTF we can use that to do the enum
-> string translation, like with (using /sys/kernel/btf/vmlinux, that is
pretty much available everywhere these days):

root@number:~# pahole clocksource_ids
enum clocksource_ids {
	CSID_GENERIC          = 0,
	CSID_ARM_ARCH_COUNTER = 1,
	CSID_MAX              = 2,
};

root@number:~# pahole skb_drop_reason | head
enum skb_drop_reason {
	SKB_NOT_DROPPED_YET                     = 0,
	SKB_CONSUMED                            = 1,
	SKB_DROP_REASON_NOT_SPECIFIED           = 2,
	SKB_DROP_REASON_NO_SOCKET               = 3,
	SKB_DROP_REASON_PKT_TOO_SMALL           = 4,
	SKB_DROP_REASON_TCP_CSUM                = 5,
	SKB_DROP_REASON_SOCKET_FILTER           = 6,
	SKB_DROP_REASON_UDP_CSUM                = 7,
	SKB_DROP_REASON_NETFILTER_DROP          = 8,
root@number:~#

Then its easy to go from 0 to CSID_GENERIC, etc.

⬢[acme@toolbox pahole]$ perf stat -e cycles pahole skb_drop_reason > /dev/null

 Performance counter stats for 'pahole skb_drop_reason':

         6,095,427      cpu_atom/cycles:u/                                                      (2.82%)
       103,694,633      cpu_core/cycles:u/                                                      (97.18%)

       0.039031759 seconds time elapsed

       0.016028000 seconds user
       0.023007000 seconds sys


⬢[acme@toolbox pahole]$

- Arnaldo
 
> But we don't have to do any of that right away. It's an orthogonal
> issue. Just waste the extra system vector to start with and then we can
> add the compile time dependend change on top if we really care about
> gaining back the vectors.
> 
> Thanks,
> 
>         tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-19 20:07           ` Arnaldo Carvalho de Melo
@ 2024-04-22 22:32             ` Jacob Pan
  0 siblings, 0 replies; 47+ messages in thread
From: Jacob Pan @ 2024-04-22 22:32 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jim.harris, a.manzanares, Bjorn Helgaas, guang.zeng,
	robert.hoo.linux, kan.liang, Kleen, Andi, jacob.jun.pan

Hi Arnaldo,

On Fri, 19 Apr 2024 17:07:17 -0300, Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:

> > > On a second thought, if we make system IRQ vector determined at
> > > compile time based on different CONFIG options, will it break
> > > userspace tools such as perf? More importantly the rule of not
> > > breaking userspace.  
> 
> The rule for tools/perf is "don't impose _any requirement_ on the kernel
> developers, they don't have to test if any change they do outside of
> tools/ will break something inside tools/."
>  
> > tools/arch/x86/include/asm/irq_vectors.h is only used to generate the
> > list of system vectors for pretty output. And your change already broke
> > that.  
> 
> Yeah, I even moved that from tools/arch/x86/include/asm/irq_vectors.h
> to tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h (for next
> merge window).

So I will not add anything to the tools directory for my next version.
Just a heads-up for adding this new vector.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-04-22 22:28 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-05 22:30 [PATCH v2 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
2024-04-05 22:30 ` [PATCH v2 01/13] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
2024-04-17  0:34   ` Sean Christopherson
2024-04-17 18:33     ` Jacob Pan
2024-04-05 22:30 ` [PATCH v2 02/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
2024-04-05 22:31 ` [PATCH v2 03/13] x86/irq: Remove bitfields in posted interrupt descriptor Jacob Pan
2024-04-17  0:39   ` Sean Christopherson
2024-04-17 18:01     ` Jacob Pan
2024-04-18 17:30       ` Thomas Gleixner
2024-04-18 18:10         ` Jacob Pan
2024-04-05 22:31 ` [PATCH v2 04/13] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
2024-04-05 22:31 ` [PATCH v2 05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
2024-04-11 16:51   ` Thomas Gleixner
2024-04-15 18:53     ` Jacob Pan
2024-04-15 20:43       ` Jacob Pan
2024-04-19  4:00         ` Thomas Gleixner
2024-04-19 20:07           ` Arnaldo Carvalho de Melo
2024-04-22 22:32             ` Jacob Pan
2024-04-12  9:14   ` Tian, Kevin
2024-04-12 14:27     ` Sean Christopherson
2024-04-16  3:45       ` Tian, Kevin
2024-04-05 22:31 ` [PATCH v2 06/13] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
2024-04-12  9:16   ` Tian, Kevin
2024-04-12 17:54     ` Jacob Pan
2024-04-05 22:31 ` [PATCH v2 07/13] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
2024-04-12  9:21   ` Tian, Kevin
2024-04-12 16:50     ` Jacob Pan
2024-04-05 22:31 ` [PATCH v2 08/13] x86/irq: Install posted MSI notification handler Jacob Pan
2024-04-11  7:52   ` Tian, Kevin
2024-04-11 17:38     ` Jacob Pan
2024-04-11 16:54   ` Thomas Gleixner
2024-04-11 18:29     ` Jacob Pan
2024-04-05 22:31 ` [PATCH v2 09/13] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
2024-04-05 22:31 ` [PATCH v2 10/13] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
2024-04-12  9:25   ` Tian, Kevin
2024-04-12 18:23     ` Jacob Pan
2024-04-16  3:47       ` Tian, Kevin
2024-04-05 22:31 ` [PATCH v2 11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
2024-04-06  4:31   ` Robert Hoo
2024-04-08 23:33     ` Jacob Pan
2024-04-13 10:59       ` Robert Hoo
2024-04-12  9:31   ` Tian, Kevin
2024-04-15 23:20     ` Jacob Pan
2024-04-05 22:31 ` [PATCH v2 12/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
2024-04-12  9:36   ` Tian, Kevin
2024-04-16 22:15     ` Jacob Pan
2024-04-05 22:31 ` [PATCH v2 13/13] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).