linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
@ 2024-01-26 23:42 Jacob Pan
  2024-01-26 23:42 ` [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
                   ` (16 more replies)
  0 siblings, 17 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

Hi Thomas and all,

This patch set is aimed to improve IRQ throughput on Intel Xeon by making use of
posted interrupts.

There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented this
topic.

https://lpc.events/event/17/sessions/172/#20231115

Background
==========
On modern x86 server SoCs, interrupt remapping (IR) is required and turned
on by default to support X2APIC. Two interrupt remapping modes can be supported
by IOMMU/VT-d:

- Remappable 	(host)
- Posted	(guest only so far)

With remappable mode, the device MSI to CPU process is a HW flow without system
software touch points, it roughly goes as follows:

1.	Devices issue interrupt requests with writes to 0xFEEx_xxxx
2.	The system agent accepts and remaps/translates the IRQ
3.	Upon receiving the translation response, the system agent notifies the
	destination CPU with the translated MSI
4.	CPU's local APIC accepts interrupts into its IRR/ISR registers
5.	Interrupt delivered through IDT (MSI vector)

The above process can be inefficient under high IRQ rates. The notifications in
step #3 are often unnecessary when the destination CPU is already overwhelmed
with handling bursts of IRQs. On some architectures, such as Intel Xeon, step #3
is also expensive and requires strong ordering w.r.t DMA. As a result, slower
IRQ rates can become a limiting factor for DMA I/O performance.

For example, on Intel Xeon Sapphire Rapids SoC, as more NVMe disks are attached
to the same socket, FIO (libaio engine) 4K block random read performance
per-disk drops quickly.

# of disks  	2  	4  	8
-------------------------------------
IOPS(million)  	1.991	1.136  	0.834
(NVMe Gen 5 Samsung PM174x)

With posted mode enabled in interrupt remapping, the interrupt flow is divided
into two parts: posting (storing pending IRQ vector information in memory) and
CPU notification.

The above remappable IRQ flow becomes the following (1 and 2 unchanged):
3.	Notifies the destination CPU with a notification vector
	- IOMMU suppresses CPU notification
	- IOMMU atomic swap/store IRQ status to memory-resident posted interrupt
	descriptor (PID)
4.	CPU's local APIC accepts the notification interrupt into its IRR/ISR
	registers
5.	Interrupt delivered through IDT (notification vector handler)
	System SW allows new notifications by clearing outstanding notification
	(ON) bit in PID.
(The above flow is not in Linux today since we only use posted mode for VM)

Note that the system software can now suppress CPU notifications at runtime as
needed. This allows the system software to coalesce the expensive CPU
notifications and in turn, improve IRQ throughput and DMA performance.

Consider the following scenario when MSIs arrive at a CPU in high-frequency
bursts:

Time ----------------------------------------------------------------------->
    	^ ^ ^       	^ ^ ^ ^     	^   	^
MSIs	A B C       	D E F G     	H   	I

RI  	N  N'  N'     	N  N'  N'  N'  	N   	N

PI  	N           	N           	N   	N

RI: remappable interrupt;  PI:  posted interrupt;
N: interrupt notification, N': superfluous interrupt notification

With remappable interrupt (row titled RI), every MSI generates a notification
event to the CPU.

With posted interrupts enabled in this patch set (row titled PI), CPU
notifications are coalesced during IRQ bursts. N's are eliminated in the flow
above. We refer to this mechanism Coalesced Interrupt Delivery (CID).

Post interrupts have existed for a long time, they have been used for
virtualization where MSIs from directly assigned devices can be delivered to
the guest kernel without VMM intervention. On x86 Intel platforms, posted
interrupts can be used on the host as well. Only host physical address of
Posted interrupt descriptor (PID) is used.

This patch set enables a new usage of posted interrupts on existing (and
new hardware) for host kernel device MSIs. It is referred to as Posted MSIs
throughout this patch set.

Performance (with this patch set):
==================================
Test #1. NVMe FIO

FIO libaio (million IOPS/sec/disk) Gen 5 NVMe Samsung PM174x disks on a single
socket, Intel Xeon Sapphire Rapids. Random read with 4k block size. NVMe IRQ
affinity is managed by the kernel with one vector per CPU.

#disks	Before		After		%Gain
---------------------------------------------
8	0.834		1.943		132%
4	1.136		2.023		78%

Other observations:
- Increased block sizes shows diminishing benefits, e.g. with 4 NVME disks on
one x16 PCIe slot, the combined IOPS looks like:

    Block Size	Baseline	PostedMSI
    -------------------------------------
    4K		6475		8778
    8K		5727		5896
    16k		2864		2900
    32k		1546		1520
    128k	397		398

- Submission/Completion latency (usec) also improved at 4K block size only
  FIO report SLAT
  ---------------------------------------
  Block Size	Baseline	postedMSI
  4k		2177		2282
  8k		4416		3967
  16k		2950		3053
  32k		3453		3505
  128k		5911		5801

  FIO report CLAT
  ---------------------------------------
  Block Size	Baseline	postedMSI
  4k		313		230
  8k		352		343
  16k		711		702
  32k		1320		1343
  128k		5146		5137


Test #2. Intel Data Streaming Accelerator

Two dedicated workqueues from two PCI root complex integrated endpoint
(RCIEP) devices, pin IRQ affinity of the two interrupts to a single CPU.

				Before		After		%Gain
				-------------------------------------
DSA memfill (mil IRQs/sec)	5.157		8.987		74%

DMA throughput has similar improvements.

At lower IRQ rate (< 1 million/second), no performance benefits nor regression
observed so far.

No harm tests also performed to ensure no performance regression on workloads
that do not have high interrupt rate. These tests include:
- kernel compile time
- file copy
- FIO NVME random writes

Implementation choices:
======================
- Transparent to the device drivers

- System-wide option instead of per-device or per-IRQ opt-in, i.e. once enabled
  all device MSIs are posted. The benefit is that we only need to change IR
  irq_chip and domain layer. No change to PCI MSI.
  Exceptions are: IOAPIC, HPET, and VT-d's own IRQs

- Limit the number of polling/demuxing loops per CPU notification event

- Only change Intel-IR in IRQ domain hierarchy VECTOR->INTEL-IR->PCI-MSI,

- X86 Intel only so far, can be extended to other architectures with posted
  interrupt support (ARM and AMD), RFC.

- Bare metal only, no posted interrupt capable virtual IOMMU.


Changes and implications (moving from remappable to posted mode)
===============================
1. All MSI vectors are multiplexed into a single notification vector for each
CPU MSI vectors are then de-multiplexed by SW, no IDT delivery for MSIs

2. Losing the following features compared to the remappable mode (AFAIK, none of
the below matters for device MSIs)
- Control of delivery mode, e.g. NMI for MSIs
- No logical destinations, posted interrupt destination is x2APIC
  physical APIC ID
- No per vector stack, since all MSI vectors are multiplexed into one


Runtime changes
===============
The IRQ runtime behavior has changed with this patch, here is a pseudo trace
comparison for 3 MSIs of different vectors arriving in a burst on the same CPU.
A system vector interrupt (e.g. timer) arrives randomly.

BEFORE:
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

interrupt(timer)

interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()


AFTER:
interrupt /* Posted MSI notification vector */
    irq_enter()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	pi_clear_on()
    apic_eoi()
    irq_exit()
interrupt(timer)
        process_softirq()

With posted MSI (as pointed out by Thomas Gleixner), both high-priority
interrupts (system interrupt vectors) and softIRQs are blocked during MSI vector
demux loop. Some can be timing sensitive.

Here are the options I have attempted or still working on:

1. Use self-IPI to invoke MSI vector handler but that took away the majority of
the performance benefits.

2. Limit the # of demuxing loops, this is implemented in this patch. Note that
today, we already allow one low priority MSI to block system interrupts. System
vector can preempt MSI vectors without waiting for EOI but we have IRQ disabled
in the ISR.

Performance data (on DSA with MEMFILL) also shows that coalescing more than 3
loops yields diminishing benefits. Therefore, the max loops for coalescing is
set to 3 in this patch.
	MaxLoop		IRQ/sec		bandwidth Mbps
-------------------------------------------------------------------------
	2		6157107 		25219
	3		6226611 		25504
	4		6557081 		26857
	5		6629683 		27155
	6		6662425 		27289

3. limit the time that system interrupts can be blocked (WIP).

In addition, posted MSI uses atomic xchg from both CPU and IOMMU. Compared to
remappable mode, there may be additional cache line ownership contention over
PID. However, we have not observed performance regression at lower IRQ rates.
At high interrupt rate, posted mode always wins.

Testing:
========

The following tests have been performed and continue to be evaluated.
- IRQ affinity change, migration
- CPU offlining
- Multi vector coalescing
- Low IRQ rate, general no-harm test
- VM device assignment via VFIO
- General no harm test, performance regressions have not been observed for low
IRQ rate workload.


With the patch, a new entry in /proc/interrupts is added.
cat /proc/interrupts | grep PMN
PMN:         13868907 Posted MSI notification event

No change to the device MSI accounting.

A new INTEL-IR-POST irq_chip is visible at IRQ debugfs, e.g.
domain:  IR-PCI-MSIX-0000:6f:01.0-12
 hwirq:   0x8
 chip:    IR-PCI-MSIX-0000:6f:01.0
  flags:   0x430
             IRQCHIP_SKIP_SET_WAKE
             IRQCHIP_ONESHOT_SAFE
 parent:
    domain:  INTEL-IR-12-13
     hwirq:   0x90000
     chip:    INTEL-IR-POST /* For posted MSIs */
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x65
         chip:    APIC


Acknowledgment
==============

- Rajesh Sankaran and Ashok Raj for the original idea

- Thomas Gleixner for reviewing and guiding the upstream direction of PoC
patches. Help correct my many misunderstandings of the IRQ subsystem.

- Jie J Yan(Jeff), Sebastien Lemarie, and Dan Liang for performance evaluation
with NVMe and network workload

- Bernice Zhang and Scott Morris for functional validation

- Michael Prinke helped me understand how VT-d HW works

- Sanjay Kumar for providing the DSA IRQ test suite



Thanks,

Jacob

Change log:
V1 (since RFC)
   - Removed mentioning of wishful features, IRQ preemption, separate and
     full MSI vector space
   - Refined MSI handler de-multiplexing loop based on suggestions from
     Peter and Thomas. Reduced xchg() usage and code duplication
   - Assign the new posted IR irq_chip only to device MSI/x, avoid changing
     IO-APIC code
   - Extract and use common code for preventing lost interrupt during
     affinity change
   - Added more test results to the cover letter

Jacob Pan (12):
  x86/irq: Move posted interrupt descriptor out of vmx code
  x86/irq: Unionize PID.PIR for 64bit access w/o casting
  x86/irq: Add a Kconfig option for posted MSI
  x86/irq: Reserve a per CPU IDT vector for posted MSIs
  x86/irq: Add accessors for posted interrupt descriptors
  x86/irq: Factor out calling ISR from common_interrupt
  x86/irq: Install posted MSI notification handler
  x86/irq: Factor out common code for checking pending interrupts
  x86/irq: Extend checks for pending vectors to posted interrupts
  iommu/vt-d: Make posted MSI an opt-in cmdline option
  iommu/vt-d: Add an irq_chip for posted MSIs
  iommu/vt-d: Enable posted mode for device MSIs

Thomas Gleixner (3):
  x86/irq: Use bitfields exclusively in posted interrupt descriptor
  x86/irq: Set up per host CPU posted interrupt descriptors
  iommu/vt-d: Add a helper to retrieve PID address

 .../admin-guide/kernel-parameters.txt         |   1 +
 arch/x86/Kconfig                              |  11 ++
 arch/x86/include/asm/apic.h                   |  12 ++
 arch/x86/include/asm/hardirq.h                |   6 +
 arch/x86/include/asm/idtentry.h               |   3 +
 arch/x86/include/asm/irq_remapping.h          |  11 ++
 arch/x86/include/asm/irq_vectors.h            |   9 +-
 arch/x86/include/asm/posted_intr.h            | 116 +++++++++++++
 arch/x86/kernel/apic/vector.c                 |   5 +-
 arch/x86/kernel/cpu/common.c                  |   3 +
 arch/x86/kernel/idt.c                         |   3 +
 arch/x86/kernel/irq.c                         | 156 ++++++++++++++++--
 arch/x86/kvm/vmx/posted_intr.h                |  93 +----------
 arch/x86/kvm/vmx/vmx.c                        |   1 +
 arch/x86/kvm/vmx/vmx.h                        |   2 +-
 drivers/iommu/intel/irq_remapping.c           | 115 ++++++++++++-
 drivers/iommu/irq_remapping.c                 |  13 +-
 17 files changed, 446 insertions(+), 114 deletions(-)
 create mode 100644 arch/x86/include/asm/posted_intr.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 02/15] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

To prepare native usage of posted interrupt, move PID declaration out of
VMX code such that they can be shared.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 88 ++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/posted_intr.h     | 93 +-----------------------------
 arch/x86/kvm/vmx/vmx.c             |  1 +
 arch/x86/kvm/vmx/vmx.h             |  2 +-
 4 files changed, 91 insertions(+), 93 deletions(-)
 create mode 100644 arch/x86/include/asm/posted_intr.h

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
new file mode 100644
index 000000000000..f0324c56f7af
--- /dev/null
+++ b/arch/x86/include/asm/posted_intr.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_POSTED_INTR_H
+#define _X86_POSTED_INTR_H
+
+#define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
+
+#define PID_TABLE_ENTRY_VALID 1
+
+/* Posted-Interrupt Descriptor */
+struct pi_desc {
+	u32 pir[8];     /* Posted interrupt requested */
+	union {
+		struct {
+				/* bit 256 - Outstanding Notification */
+			u16	on	: 1,
+				/* bit 257 - Suppress Notification */
+				sn	: 1,
+				/* bit 271:258 - Reserved */
+				rsvd_1	: 14;
+				/* bit 279:272 - Notification Vector */
+			u8	nv;
+				/* bit 287:280 - Reserved */
+			u8	rsvd_2;
+				/* bit 319:288 - Notification Destination */
+			u32	ndst;
+		};
+		u64 control;
+	};
+	u32 rsvd[6];
+} __aligned(64);
+
+static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
+{
+	return test_and_set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
+{
+	return test_and_clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
+{
+	return test_and_clear_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
+{
+	return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
+}
+
+static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
+{
+	return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
+}
+
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+	set_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_set_on(struct pi_desc *pi_desc)
+{
+	set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_on(struct pi_desc *pi_desc)
+{
+	clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+	clear_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_on(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_sn(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+#endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 26992076552e..6b2a0226257e 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -1,98 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __KVM_X86_VMX_POSTED_INTR_H
 #define __KVM_X86_VMX_POSTED_INTR_H
-
-#define POSTED_INTR_ON  0
-#define POSTED_INTR_SN  1
-
-#define PID_TABLE_ENTRY_VALID 1
-
-/* Posted-Interrupt Descriptor */
-struct pi_desc {
-	u32 pir[8];     /* Posted interrupt requested */
-	union {
-		struct {
-				/* bit 256 - Outstanding Notification */
-			u16	on	: 1,
-				/* bit 257 - Suppress Notification */
-				sn	: 1,
-				/* bit 271:258 - Reserved */
-				rsvd_1	: 14;
-				/* bit 279:272 - Notification Vector */
-			u8	nv;
-				/* bit 287:280 - Reserved */
-			u8	rsvd_2;
-				/* bit 319:288 - Notification Destination */
-			u32	ndst;
-		};
-		u64 control;
-	};
-	u32 rsvd[6];
-} __aligned(64);
-
-static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
-{
-	return test_and_set_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
-{
-	return test_and_clear_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
-{
-	return test_and_clear_bit(POSTED_INTR_SN,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
-{
-	return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
-}
-
-static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
-{
-	return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
-}
-
-static inline void pi_set_sn(struct pi_desc *pi_desc)
-{
-	set_bit(POSTED_INTR_SN,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_set_on(struct pi_desc *pi_desc)
-{
-	set_bit(POSTED_INTR_ON,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_on(struct pi_desc *pi_desc)
-{
-	clear_bit(POSTED_INTR_ON,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_sn(struct pi_desc *pi_desc)
-{
-	clear_bit(POSTED_INTR_SN,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_on(struct pi_desc *pi_desc)
-{
-	return test_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_sn(struct pi_desc *pi_desc)
-{
-	return test_bit(POSTED_INTR_SN,
-			(unsigned long *)&pi_desc->control);
-}
+#include <asm/posted_intr.h>
 
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e262bc2ba4e5..d4b6072344ed 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -67,6 +67,7 @@
 #include "x86.h"
 #include "smm.h"
 #include "vmx_onhyperv.h"
+#include "posted_intr.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e3b0985bb74a..5ca8ab15b4f8 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -7,10 +7,10 @@
 #include <asm/kvm.h>
 #include <asm/intel_pt.h>
 #include <asm/perf_event.h>
+#include <asm/posted_intr.h>
 
 #include "capabilities.h"
 #include "../kvm_cache_regs.h"
-#include "posted_intr.h"
 #include "vmcs.h"
 #include "vmx_ops.h"
 #include "../cpuid.h"
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/15] x86/irq: Unionize PID.PIR for 64bit access w/o casting
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
  2024-01-26 23:42 ` [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor Jacob Pan
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

Make PIR field into u64 such that atomic xchg64 can be used without ugly
casting.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index f0324c56f7af..acf237b2882e 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -9,7 +9,10 @@
 
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
-	u32 pir[8];     /* Posted interrupt requested */
+	union {
+		u32 pir[8];     /* Posted interrupt requested */
+		u64 pir64[4];
+	};
 	union {
 		struct {
 				/* bit 256 - Outstanding Notification */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
  2024-01-26 23:42 ` [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
  2024-01-26 23:42 ` [PATCH 02/15] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-31  1:48   ` Sean Christopherson
  2024-01-26 23:42 ` [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

From: Thomas Gleixner <tglx@linutronix.de>

Mixture of bitfields and types is weird and really not intuitive, remove
types and use bitfields exclusively.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index acf237b2882e..896b3462f3dd 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -16,17 +16,17 @@ struct pi_desc {
 	union {
 		struct {
 				/* bit 256 - Outstanding Notification */
-			u16	on	: 1,
+			u64	on	:  1,
 				/* bit 257 - Suppress Notification */
-				sn	: 1,
+				sn	:  1,
 				/* bit 271:258 - Reserved */
-				rsvd_1	: 14;
+					: 14,
 				/* bit 279:272 - Notification Vector */
-			u8	nv;
+				nv	:  8,
 				/* bit 287:280 - Reserved */
-			u8	rsvd_2;
+					:  8,
 				/* bit 319:288 - Notification Destination */
-			u32	ndst;
+				ndst	: 32;
 		};
 		u64 control;
 	};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (2 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-04-05  2:28   ` Robert Hoo
  2024-01-26 23:42 ` [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

This option will be used to support delivering MSIs as posted
interrupts. Interrupt remapping is required.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5edec175b9bf..79f04ee2b91c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -463,6 +463,17 @@ config X86_X2APIC
 
 	  If you don't know what to do here, say N.
 
+config X86_POSTED_MSI
+	bool "Enable MSI and MSI-x delivery by posted interrupts"
+	depends on X86_X2APIC && X86_64 && IRQ_REMAP
+	help
+	  This enables MSIs that are under interrupt remapping to be delivered as
+	  posted interrupts to the host kernel. Interrupt throughput can
+	  potentially be improved by coalescing CPU notifications during high
+	  frequency bursts.
+
+	  If you don't know what to do here, say N.
+
 config X86_MPPARSE
 	bool "Enable MPS table" if ACPI
 	default y
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (3 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-04-04 13:38   ` Robert Hoo
  2024-01-26 23:42 ` [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

When posted MSI is enabled, all device MSIs are multiplexed into a single
notification vector. MSI handlers will be de-multiplexed at run-time by
system software without IDT delivery.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/irq_vectors.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 3a19904c2db6..08329bef5b1d 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -99,9 +99,16 @@
 
 #define LOCAL_TIMER_VECTOR		0xec
 
+/*
+ * Posted interrupt notification vector for all device MSIs delivered to
+ * the host kernel.
+ */
+#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
 #define NR_VECTORS			 256
 
-#ifdef CONFIG_X86_LOCAL_APIC
+#ifdef X86_POSTED_MSI
+#define FIRST_SYSTEM_VECTOR		POSTED_MSI_NOTIFICATION_VECTOR
+#elif defined(CONFIG_X86_LOCAL_APIC)
 #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
 #else
 #define FIRST_SYSTEM_VECTOR		NR_VECTORS
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (4 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-02-13 19:44   ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 07/15] x86/irq: Add accessors for " Jacob Pan
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

From: Thomas Gleixner <tglx@linutronix.de>

To support posted MSIs, create a posted interrupt descriptor (PID) for each
host CPU. Later on, when setting up IRQ CPU affinity, IOMMU's interrupt
remapping table entry (IRTE) will point to the physical address of the
matching CPU's PID.

Each PID is initialized with the owner CPU's physical APICID as the
destination.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/hardirq.h     |  3 +++
 arch/x86/include/asm/posted_intr.h |  7 +++++++
 arch/x86/kernel/cpu/common.c       |  3 +++
 arch/x86/kernel/irq.c              | 16 ++++++++++++++++
 4 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 66837b8c67f1..72c6a084dba3 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -48,6 +48,9 @@ typedef struct {
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 
+#ifdef CONFIG_X86_POSTED_MSI
+DECLARE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);
+#endif
 #define __ARCH_IRQ_STAT
 
 #define inc_irq_stat(member)	this_cpu_inc(irq_stat.member)
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index 896b3462f3dd..a36cc971ea13 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -88,4 +88,11 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
 	return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
 }
 
+#ifdef CONFIG_X86_POSTED_MSI
+extern void intel_posted_msi_init(void);
+
+#else
+static inline void intel_posted_msi_init(void) {};
+
+#endif /* X86_POSTED_MSI */
 #endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0b97bcde70c6..9b6248e7c073 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -67,6 +67,7 @@
 #include <asm/traps.h>
 #include <asm/sev.h>
 #include <asm/tdx.h>
+#include <asm/posted_intr.h>
 
 #include "cpu.h"
 
@@ -2253,6 +2254,8 @@ void cpu_init(void)
 		barrier();
 
 		x2apic_setup();
+
+		intel_posted_msi_init();
 	}
 
 	mmgrab(&init_mm);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 11761c124545..f6546f83d616 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -22,6 +22,8 @@
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/thermal.h>
+#include <asm/posted_intr.h>
+#include <asm/irq_remapping.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/irq_vectors.h>
@@ -334,6 +336,20 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 }
 #endif
 
+#ifdef CONFIG_X86_POSTED_MSI
+
+/* Posted Interrupt Descriptors for coalesced MSIs to be posted */
+DEFINE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);
+
+void intel_posted_msi_init(void)
+{
+	struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
+
+	pid->nv = POSTED_MSI_NOTIFICATION_VECTOR;
+	pid->ndst = this_cpu_read(x86_cpu_to_apicid);
+}
+}
+#endif /* X86_POSTED_MSI */
 
 #ifdef CONFIG_HOTPLUG_CPU
 /* A cpu has been removed from cpu_online_mask.  Reset irq affinities. */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/15] x86/irq: Add accessors for posted interrupt descriptors
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (5 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 08/15] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

Posted interrupts are controlled by and pending interrupts are marked in
the posted interrupt descriptor. The upcoming support for host side posted
interrupts requires accessors to check for pending vectors.

This patch adds a helper function to check individual vector status.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index a36cc971ea13..eb939f630b02 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _X86_POSTED_INTR_H
 #define _X86_POSTED_INTR_H
+#include <asm/irq_vectors.h>
 
 #define POSTED_INTR_ON  0
 #define POSTED_INTR_SN  1
@@ -89,9 +90,26 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
 }
 
 #ifdef CONFIG_X86_POSTED_MSI
+/*
+ * Not all external vectors are subject to interrupt remapping, e.g. IOMMU's
+ * own interrupts. Here we do not distinguish them since those vector bits in
+ * PIR will always be zero.
+ */
+static inline bool pi_pending_this_cpu(unsigned int vector)
+{
+	struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
+
+	if (WARN_ON_ONCE(vector > NR_VECTORS || vector < FIRST_EXTERNAL_VECTOR))
+		return false;
+
+	return test_bit(vector, (unsigned long *)pid->pir);
+}
+
 extern void intel_posted_msi_init(void);
 
 #else
+static inline bool pi_pending_this_cpu(unsigned int vector) { return false; }
+
 static inline void intel_posted_msi_init(void) {};
 
 #endif /* X86_POSTED_MSI */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/15] x86/irq: Factor out calling ISR from common_interrupt
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (6 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 07/15] x86/irq: Add accessors for " Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 09/15] x86/irq: Install posted MSI notification handler Jacob Pan
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

Prepare for calling external IRQ handlers directly from the posted MSI
demultiplexing loop. Extract the common code with common interrupt to
avoid code duplication.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/kernel/irq.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index f6546f83d616..1a1762baf85f 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -242,18 +242,10 @@ static __always_inline void handle_irq(struct irq_desc *desc,
 		__handle_irq(desc, regs);
 }
 
-/*
- * common_interrupt() handles all normal device IRQ's (the special SMP
- * cross-CPU interrupts have their own entry points).
- */
-DEFINE_IDTENTRY_IRQ(common_interrupt)
+static __always_inline void call_irq_handler(int vector, struct pt_regs *regs)
 {
-	struct pt_regs *old_regs = set_irq_regs(regs);
 	struct irq_desc *desc;
 
-	/* entry code tells RCU that we're not quiescent.  Check it. */
-	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
-
 	desc = __this_cpu_read(vector_irq[vector]);
 	if (likely(!IS_ERR_OR_NULL(desc))) {
 		handle_irq(desc, regs);
@@ -268,7 +260,20 @@ DEFINE_IDTENTRY_IRQ(common_interrupt)
 			__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
 		}
 	}
+}
+
+/*
+ * common_interrupt() handles all normal device IRQ's (the special SMP
+ * cross-CPU interrupts have their own entry points).
+ */
+DEFINE_IDTENTRY_IRQ(common_interrupt)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	/* entry code tells RCU that we're not quiescent.  Check it. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
 
+	call_irq_handler(vector, regs);
 	set_irq_regs(old_regs);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/15] x86/irq: Install posted MSI notification handler
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (7 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 08/15] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-03-29  7:32   ` Zeng Guang
  2024-01-26 23:42 ` [PATCH 10/15] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

All MSI vectors are multiplexed into a single notification vector when
posted MSI is enabled. It is the responsibility of the notification
vector handler to demultiplex MSI vectors. In this handler, for each
pending bit, MSI vector handlers are dispatched without IDT delivery.

For example, the interrupt flow will change as follows:
(3 MSIs of different vectors arrive in a a high frequency burst)

BEFORE:
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

AFTER:
interrupt /* Posted MSI notification vector */
    irq_enter()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	pi_clear_on()
    apic_eoi()
    irq_exit()
        process_softirq()

Except for the leading MSI, CPU notifications are skipped/coalesced.

For MSIs arrive at a low frequency, the demultiplexing loop does not
wait for more interrupts to coalesce. Therefore, there's no additional
latency other than the processing time.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/hardirq.h  |   3 +
 arch/x86/include/asm/idtentry.h |   3 +
 arch/x86/kernel/idt.c           |   3 +
 arch/x86/kernel/irq.c           | 112 ++++++++++++++++++++++++++++++++
 4 files changed, 121 insertions(+)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 72c6a084dba3..6c8daa7518eb 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@ typedef struct {
 	unsigned int irq_hv_reenlightenment_count;
 	unsigned int hyperv_stimer0_count;
 #endif
+#ifdef CONFIG_X86_POSTED_MSI
+	unsigned int posted_msi_notification_count;
+#endif
 } ____cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 13639e57e1f8..83d4de75df34 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -648,6 +648,9 @@ DECLARE_IDTENTRY_SYSVEC(ERROR_APIC_VECTOR,		sysvec_error_interrupt);
 DECLARE_IDTENTRY_SYSVEC(SPURIOUS_APIC_VECTOR,		sysvec_spurious_apic_interrupt);
 DECLARE_IDTENTRY_SYSVEC(LOCAL_TIMER_VECTOR,		sysvec_apic_timer_interrupt);
 DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
+# ifdef CONFIG_X86_POSTED_MSI
+DECLARE_IDTENTRY_SYSVEC(POSTED_MSI_NOTIFICATION_VECTOR,	sysvec_posted_msi_notification);
+# endif
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 660b601f1d6c..061a927367ec 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -163,6 +163,9 @@ static const __initconst struct idt_data apic_idts[] = {
 # endif
 	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
 	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
+# ifdef CONFIG_X86_POSTED_MSI
+	INTG(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
+# endif
 #endif
 };
 
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 1a1762baf85f..54ddf148f1ed 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -183,6 +183,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+#endif
+#ifdef CONFIG_X86_POSTED_MSI
+	seq_printf(p, "%*s: ", prec, "PMN");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->posted_msi_notification_count);
+	seq_puts(p, "  Posted MSI notification event\n");
 #endif
 	return 0;
 }
@@ -353,6 +360,111 @@ void intel_posted_msi_init(void)
 	pid->nv = POSTED_MSI_NOTIFICATION_VECTOR;
 	pid->ndst = this_cpu_read(x86_cpu_to_apicid);
 }
+
+/*
+ * De-multiplexing posted interrupts is on the performance path, the code
+ * below is written to optimize the cache performance based on the following
+ * considerations:
+ * 1.Posted interrupt descriptor (PID) fits in a cache line that is frequently
+ *   accessed by both CPU and IOMMU.
+ * 2.During posted MSI processing, the CPU needs to do 64-bit read and xchg
+ *   for checking and clearing posted interrupt request (PIR), a 256 bit field
+ *   within the PID.
+ * 3.On the other side, the IOMMU does atomic swaps of the entire PID cache
+ *   line when posting interrupts and setting control bits.
+ * 4.The CPU can access the cache line a magnitude faster than the IOMMU.
+ * 5.Each time the IOMMU does interrupt posting to the PIR will evict the PID
+ *   cache line. The cache line states after each operation are as follows:
+ *   CPU		IOMMU			PID Cache line state
+ *   ---------------------------------------------------------------
+ *...read64					exclusive
+ *...lock xchg64				modified
+ *...			post/atomic swap	invalid
+ *...-------------------------------------------------------------
+ *
+ * To reduce L1 data cache miss, it is important to avoid contention with
+ * IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is used
+ * to dispatch interrupt handlers.
+ *
+ * In addition, the code is trying to keep the cache line state consistent
+ * as much as possible. e.g. when making a copy and clearing the PIR
+ * (assuming non-zero PIR bits are present in the entire PIR), it does:
+ *		read, read, read, read, xchg, xchg, xchg, xchg
+ * instead of:
+ *		read, xchg, read, xchg, read, xchg, read, xchg
+ */
+static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
+{
+	int i, vec = FIRST_EXTERNAL_VECTOR;
+	unsigned long pir_copy[4];
+	bool handled = false;
+
+	for (i = 0; i < 4; i++)
+		pir_copy[i] = pir[i];
+
+	for (i = 0; i < 4; i++) {
+		if (!pir_copy[i])
+			continue;
+
+		pir_copy[i] = arch_xchg(pir, 0);
+		handled = true;
+	}
+
+	if (handled) {
+		for_each_set_bit_from(vec, pir_copy, FIRST_SYSTEM_VECTOR)
+			call_irq_handler(vec, regs);
+	}
+
+	return handled;
+}
+
+/*
+ * Performance data shows that 3 is good enough to harvest 90+% of the benefit
+ * on high IRQ rate workload.
+ */
+#define MAX_POSTED_MSI_COALESCING_LOOP 3
+
+/*
+ * For MSIs that are delivered as posted interrupts, the CPU notifications
+ * can be coalesced if the MSIs arrive in high frequency bursts.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+	struct pi_desc *pid;
+	int i = 0;
+
+	pid = this_cpu_ptr(&posted_interrupt_desc);
+
+	inc_irq_stat(posted_msi_notification_count);
+	irq_enter();
+
+	/*
+	 * Max coalescing count includes the extra round of handle_pending_pir
+	 * after clearing the outstanding notification bit. Hence, at most
+	 * MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here.
+	 */
+	while (++i < MAX_POSTED_MSI_COALESCING_LOOP) {
+		if (!handle_pending_pir(pid->pir64, regs))
+			break;
+	}
+
+	/*
+	 * Clear outstanding notification bit to allow new IRQ notifications,
+	 * do this last to maximize the window of interrupt coalescing.
+	 */
+	pi_clear_on(pid);
+
+	/*
+	 * There could be a race of PI notification and the clearing of ON bit,
+	 * process PIR bits one last time such that handling the new interrupts
+	 * are not delayed until the next IRQ.
+	 */
+	handle_pending_pir(pid->pir64, regs);
+
+	apic_eoi();
+	irq_exit();
+	set_irq_regs(old_regs);
 }
 #endif /* X86_POSTED_MSI */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/15] x86/irq: Factor out common code for checking pending interrupts
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (8 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 09/15] x86/irq: Install posted MSI notification handler Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 11/15] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

Use a common function for checking pending interrupt vector in APIC IRR
instead of duplicated open coding them.

Additional checks for posted MSI vectors can then be contained in this
function.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/apic.h   | 11 +++++++++++
 arch/x86/kernel/apic/vector.c |  5 ++---
 arch/x86/kernel/irq.c         |  5 ++---
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 9d159b771dc8..e9d8e554765c 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -492,6 +492,17 @@ static inline bool lapic_vector_set_in_irr(unsigned int vector)
 	return !!(irr & (1U << (vector % 32)));
 }
 
+static inline bool is_vector_pending(unsigned int vector)
+{
+	unsigned int irr;
+
+	irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
+	if (irr  & (1 << (vector % 32)))
+		return true;
+
+	return false;
+}
+
 /*
  * Warm reset vector position:
  */
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 185738c72766..9eec52925fa3 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -965,7 +965,7 @@ static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
 	lockdep_assert_held(&vector_lock);
 
 	hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) {
-		unsigned int irr, vector = apicd->prev_vector;
+		unsigned int vector = apicd->prev_vector;
 
 		/*
 		 * Paranoia: Check if the vector that needs to be cleaned
@@ -979,8 +979,7 @@ static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
 		 * fixup_irqs() was just called to scan IRR for set bits and
 		 * forward them to new destination CPUs via IPIs.
 		 */
-		irr = check_irr ? apic_read(APIC_IRR + (vector / 32 * 0x10)) : 0;
-		if (irr & (1U << (vector % 32))) {
+		if (check_irr && is_vector_pending(vector)) {
 			pr_warn_once("Moved interrupt pending in old target APIC %u\n", apicd->irq);
 			rearm = true;
 			continue;
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 54ddf148f1ed..8e09d40ea928 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -472,7 +472,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
 /* A cpu has been removed from cpu_online_mask.  Reset irq affinities. */
 void fixup_irqs(void)
 {
-	unsigned int irr, vector;
+	unsigned int vector;
 	struct irq_desc *desc;
 	struct irq_data *data;
 	struct irq_chip *chip;
@@ -499,8 +499,7 @@ void fixup_irqs(void)
 		if (IS_ERR_OR_NULL(__this_cpu_read(vector_irq[vector])))
 			continue;
 
-		irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
-		if (irr  & (1 << (vector % 32))) {
+		if (is_vector_pending(vector)) {
 			desc = __this_cpu_read(vector_irq[vector]);
 
 			raw_spin_lock(&desc->lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 11/15] x86/irq: Extend checks for pending vectors to posted interrupts
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (9 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 10/15] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 12/15] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

During interrupt affinity change, it is possible to have interrupts delivered
to the old CPU after the affinity has changed to the new one. To prevent lost
interrupts, local APIC IRR is checked on the old CPU. Similar checks must be
done for posted MSIs given the same reason.

Consider the following scenario:
	Device		system agent		iommu		memory 		CPU/LAPIC
1	FEEX_XXXX
2			Interrupt request
3						Fetch IRTE	->
4						->Atomic Swap PID.PIR(vec)
						Push to Global Observable(GO)
5						if (ON*)
	i						done;*
						else
6							send a notification ->

* ON: outstanding notification, 1 will suppress new notifications

If the affinity change happens between 3 and 5 in IOMMU, the old CPU's posted
interrupt request (PIR) could have pending bit set for the vector being moved.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/apic.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index e9d8e554765c..6661fefdd49a 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -13,6 +13,7 @@
 #include <asm/mpspec.h>
 #include <asm/msr.h>
 #include <asm/hardirq.h>
+#include <asm/posted_intr.h>
 
 #define ARCH_APICTIMER_STOPS_ON_C3	1
 
@@ -500,7 +501,7 @@ static inline bool is_vector_pending(unsigned int vector)
 	if (irr  & (1 << (vector % 32)))
 		return true;
 
-	return false;
+	return pi_pending_this_cpu(vector);
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 12/15] iommu/vt-d: Make posted MSI an opt-in cmdline option
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (10 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 11/15] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 13/15] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

Add a command line opt-in option for posted MSI if CONFIG_X86_POSTED_MSI=y.

Also introduce a helper function for testing if posted MSI is supported on
the platform.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  1 +
 arch/x86/include/asm/irq_remapping.h            | 11 +++++++++++
 drivers/iommu/irq_remapping.c                   | 13 ++++++++++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 31b3a25680d0..35feefc0bdb0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2212,6 +2212,7 @@
 			no_x2apic_optout
 				BIOS x2APIC opt-out request will be ignored
 			nopost	disable Interrupt Posting
+			posted_msi enable MSIs delivered as posted interrupts
 
 	iomem=		Disable strict checking of access to MMIO memory
 		strict	regions from userspace.
diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 7a2ed154a5e1..e46bde61029b 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -50,6 +50,17 @@ static inline struct irq_domain *arch_get_ir_parent_domain(void)
 	return x86_vector_domain;
 }
 
+#ifdef CONFIG_X86_POSTED_MSI
+extern int enable_posted_msi;
+
+static inline bool posted_msi_supported(void)
+{
+	return enable_posted_msi && irq_remapping_cap(IRQ_POSTING_CAP);
+}
+#else
+static inline bool posted_msi_supported(void) { return false; };
+#endif
+
 #else  /* CONFIG_IRQ_REMAP */
 
 static inline bool irq_remapping_cap(enum irq_remap_cap cap) { return 0; }
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 83314b9d8f38..4047ac396728 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -24,6 +24,10 @@ int no_x2apic_optout;
 
 int disable_irq_post = 0;
 
+#ifdef CONFIG_X86_POSTED_MSI
+int enable_posted_msi;
+#endif
+
 static int disable_irq_remap;
 static struct irq_remap_ops *remap_ops;
 
@@ -70,7 +74,14 @@ static __init int setup_irqremap(char *str)
 			no_x2apic_optout = 1;
 		else if (!strncmp(str, "nopost", 6))
 			disable_irq_post = 1;
-
+#ifdef CONFIG_X86_POSTED_MSI
+		else if (!strncmp(str, "posted_msi", 10)) {
+			if (disable_irq_post || disable_irq_remap)
+				pr_warn("Posted MSI not enabled due to conflicting options!");
+			else
+				enable_posted_msi = 1;
+		}
+#endif
 		str += strcspn(str, ",");
 		while (*str == ',')
 			str++;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 13/15] iommu/vt-d: Add an irq_chip for posted MSIs
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (11 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 12/15] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 14/15] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

Introduce a new irq_chip for posted MSIs, the key difference is in
irq_ack where EOI is performed by the notification handler.

When posted MSI is enabled, MSI domain/chip hierarchy will look like
this example:

domain:  IR-PCI-MSIX-0000:50:00.0-12
 hwirq:   0x29
 chip:    IR-PCI-MSIX-0000:50:00.0
  flags:   0x430
             IRQCHIP_SKIP_SET_WAKE
             IRQCHIP_ONESHOT_SAFE
 parent:
    domain:  INTEL-IR-10-13
     hwirq:   0x2d0000
     chip:    INTEL-IR-POST
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x77
         chip:    APIC

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/irq_remapping.c | 46 +++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 566297bc87dd..fa719936b44e 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1233,6 +1233,52 @@ static struct irq_chip intel_ir_chip = {
 	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
 };
 
+static void dummy(struct irq_data *d)
+{
+}
+
+/*
+ * With posted MSIs, all vectors are multiplexed into a single notification
+ * vector. Devices MSIs are then dispatched in a demux loop where
+ * EOIs can be coalesced as well.
+ *
+ * "INTEL-IR-POST" IRQ chip does not do EOI on ACK, thus the dummy irq_ack()
+ * function. Instead EOI is performed by the posted interrupt notification
+ * handler.
+ *
+ * For the example below, 3 MSIs are coalesced into one CPU notification. Only
+ * one apic_eoi() is needed.
+ *
+ * __sysvec_posted_msi_notification()
+ *	irq_enter();
+ *		handle_edge_irq()
+ *			irq_chip_ack_parent()
+ *				dummy(); // No EOI
+ *			handle_irq_event()
+ *				driver_handler()
+ *	irq_enter();
+ *		handle_edge_irq()
+ *			irq_chip_ack_parent()
+ *				dummy(); // No EOI
+ *			handle_irq_event()
+ *				driver_handler()
+ *	irq_enter();
+ *		handle_edge_irq()
+ *			irq_chip_ack_parent()
+ *				dummy(); // No EOI
+ *			handle_irq_event()
+ *				driver_handler()
+ *	apic_eoi()
+ * irq_exit()
+ */
+static struct irq_chip intel_ir_chip_post_msi = {
+	.name			= "INTEL-IR-POST",
+	.irq_ack		= dummy,
+	.irq_set_affinity	= intel_ir_set_affinity,
+	.irq_compose_msi_msg	= intel_ir_compose_msi_msg,
+	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
+};
+
 static void fill_msi_msg(struct msi_msg *msg, u32 index, u32 subhandle)
 {
 	memset(msg, 0, sizeof(*msg));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 14/15] iommu/vt-d: Add a helper to retrieve PID address
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (12 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 13/15] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-01-26 23:42 ` [PATCH 15/15] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

From: Thomas Gleixner <tglx@linutronix.de>

Physical address of the Intel posted interrupt descriptor (PID) is needed
when programming interrupt remapping table entry (IRTE) for posted mode.

PID is per-CPU, this patch adds a helper function to retrieve the target
CPU's PID address based on the effective affinity mask of the interrupt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v1: Added a warning if the effective affinity mask is not set up
---
 drivers/iommu/intel/irq_remapping.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index fa719936b44e..01df65dca1d5 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -19,6 +19,7 @@
 #include <asm/cpu.h>
 #include <asm/irq_remapping.h>
 #include <asm/pci-direct.h>
+#include <asm/posted_intr.h>
 
 #include "iommu.h"
 #include "../irq_remapping.h"
@@ -1126,6 +1127,19 @@ struct irq_remap_ops intel_irq_remap_ops = {
 	.enable_faulting	= enable_drhd_fault_handling,
 };
 
+#ifdef CONFIG_X86_POSTED_MSI
+
+static phys_addr_t get_pi_desc_addr(struct irq_data *irqd)
+{
+	int cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd));
+
+	if (WARN_ON(cpu >= nr_cpu_ids))
+		return 0;
+
+	return __pa(per_cpu_ptr(&posted_interrupt_desc, cpu));
+}
+#endif
+
 static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
 {
 	struct intel_ir_data *ir_data = irqd->chip_data;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 15/15] iommu/vt-d: Enable posted mode for device MSIs
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (13 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 14/15] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
@ 2024-01-26 23:42 ` Jacob Pan
  2024-02-08 15:34 ` [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jens Axboe
  2024-04-04 13:45 ` Robert Hoo
  16 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-01-26 23:42 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, Jacob Pan

With posted MSI feature enabled on the CPU side, iommu interrupt
remapping table entries (IRTEs) for device MSI/x can be allocated,
activated, and programed in posted mode. This means that IRTEs are
linked with their respective PIDs of the target CPU.

Handlers for the posted MSI notification vector will de-multiplex
device MSI handlers. CPU notifications are coalesced if interrupts
arrive at a high frequency.

Excluding the following:
- legacy devices IOAPIC, HPET (may be needed for booting, not a source
of high MSIs)
- VT-d's own IRQs (not remappable).

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/irq_remapping.c | 55 ++++++++++++++++++++++++++---
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 01df65dca1d5..ac5f9e83943b 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -50,6 +50,7 @@ struct irq_2_iommu {
 	u16 sub_handle;
 	u8  irte_mask;
 	enum irq_mode mode;
+	bool posted_msi;
 };
 
 struct intel_ir_data {
@@ -1119,6 +1120,14 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	irte->redir_hint = 1;
 }
 
+static void prepare_irte_posted(struct irte *irte)
+{
+	memset(irte, 0, sizeof(*irte));
+
+	irte->present = 1;
+	irte->p_pst = 1;
+}
+
 struct irq_remap_ops intel_irq_remap_ops = {
 	.prepare		= intel_prepare_irq_remapping,
 	.enable			= intel_enable_irq_remapping,
@@ -1138,6 +1147,34 @@ static phys_addr_t get_pi_desc_addr(struct irq_data *irqd)
 
 	return __pa(per_cpu_ptr(&posted_interrupt_desc, cpu));
 }
+
+static void intel_ir_reconfigure_irte_posted(struct irq_data *irqd)
+{
+	struct intel_ir_data *ir_data = irqd->chip_data;
+	struct irte *irte = &ir_data->irte_entry;
+	struct irte irte_pi;
+	u64 pid_addr;
+
+	pid_addr = get_pi_desc_addr(irqd);
+
+	if (!pid_addr) {
+		pr_warn("Failed to setup IRQ %d for posted mode", irqd->irq);
+		return;
+	}
+
+	memset(&irte_pi, 0, sizeof(irte_pi));
+
+	/* The shared IRTE already be set up as posted during alloc_irte */
+	dmar_copy_shared_irte(&irte_pi, irte);
+
+	irte_pi.pda_l = (pid_addr >> (32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
+	irte_pi.pda_h = (pid_addr >> 32) & ~(-1UL << PDA_HIGH_BIT);
+
+	modify_irte(&ir_data->irq_2_iommu, &irte_pi);
+}
+
+#else
+static inline void intel_ir_reconfigure_irte_posted(struct irq_data *irqd) {}
 #endif
 
 static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
@@ -1153,8 +1190,9 @@ static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
 	irte->vector = cfg->vector;
 	irte->dest_id = IRTE_DEST(cfg->dest_apicid);
 
-	/* Update the hardware only if the interrupt is in remapped mode. */
-	if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
+	if (ir_data->irq_2_iommu.posted_msi)
+		intel_ir_reconfigure_irte_posted(irqd);
+	else if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
 		modify_irte(&ir_data->irq_2_iommu, irte);
 }
 
@@ -1208,7 +1246,7 @@ static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 	struct intel_ir_data *ir_data = data->chip_data;
 	struct vcpu_data *vcpu_pi_info = info;
 
-	/* stop posting interrupts, back to remapping mode */
+	/* stop posting interrupts, back to the default mode */
 	if (!vcpu_pi_info) {
 		modify_irte(&ir_data->irq_2_iommu, &ir_data->irte_entry);
 	} else {
@@ -1334,6 +1372,11 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 		break;
 	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
+		if (posted_msi_supported()) {
+			prepare_irte_posted(irte);
+			data->irq_2_iommu.posted_msi = 1;
+		}
+
 		set_msi_sid(irte,
 			    pci_real_dma_dev(msi_desc_to_pci_dev(info->desc)));
 		break;
@@ -1421,7 +1464,11 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain,
 
 		irq_data->hwirq = (index << 16) + i;
 		irq_data->chip_data = ird;
-		irq_data->chip = &intel_ir_chip;
+		if (posted_msi_supported() &&
+			((info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI) || (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSIX)))
+			irq_data->chip = &intel_ir_chip_post_msi;
+		else
+			irq_data->chip = &intel_ir_chip;
 		intel_irq_remapping_prepare_irte(ird, irq_cfg, info, index, i);
 		irq_set_status_flags(virq + i, IRQ_MOVE_PCNTXT);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor
  2024-01-26 23:42 ` [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor Jacob Pan
@ 2024-01-31  1:48   ` Sean Christopherson
  2024-02-06  0:40     ` Jacob Pan
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2024-01-31  1:48 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Kevin Tian, maz, Robin Murphy

On Fri, Jan 26, 2024, Jacob Pan wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Mixture of bitfields and types is weird and really not intuitive, remove
> types and use bitfields exclusively.

I agree it's weird, and maybe not immediately intuitive, but that doesn't mean
there's no a good reason for the code being the way it is, i.e. "it's weird" isn't
sufficient justification for touching this type of code.

Bitfields almost always generate inferior code when accessing a subset of the
overall thing.  And even worse, there are subtle side effects that I really don't
want to find out whether or not they are benign.

E.g. before this change, setting the notification vector is:

	movb   $0xf2,0x62(%rsp)

whereas after this change it becomes:

	mov    %eax,%edx
   	and    $0xff00fffd,%edx
	or     $0xf20000,%edx
   	mov    %edx,0x60(%rsp)

Writing extra bytes _shouln't_ be a problem, as KVM needs to atomically write
the entire control chunk no matter what, but changing this without very good cause
scares me.

If we really want to clean things up, my very strong vote is to remove the
bitfields entirely.  SN is the only bit that's accessed without going through an
accessor, and those should be easy enough to fixup one by one (and we can add
more non-atomic accessors/mutators if it makes sense to do so).

E.g. end up with

/* Posted-Interrupt Descriptor */
struct pi_desc {
	u32 pir[8];     /* Posted interrupt requested */
	union {
		struct {
			u16	notification_bits;
			u8	nv;
			u8	rsvd_2;
			u32	ndst;
		};
		u64 control;
	};
	u32 rsvd[6];
} __aligned(64);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor
  2024-01-31  1:48   ` Sean Christopherson
@ 2024-02-06  0:40     ` Jacob Pan
  0 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-02-06  0:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Kevin Tian, maz, Robin Murphy,
	jacob.jun.pan

Hi Sean,

On Tue, 30 Jan 2024 17:48:04 -0800, Sean Christopherson <seanjc@google.com>
wrote:

> On Fri, Jan 26, 2024, Jacob Pan wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > 
> > Mixture of bitfields and types is weird and really not intuitive, remove
> > types and use bitfields exclusively.  
> 
> I agree it's weird, and maybe not immediately intuitive, but that doesn't
> mean there's no a good reason for the code being the way it is, i.e.
> "it's weird" isn't sufficient justification for touching this type of
> code.
> 
> Bitfields almost always generate inferior code when accessing a subset of
> the overall thing.  And even worse, there are subtle side effects that I
> really don't want to find out whether or not they are benign.
> 
> E.g. before this change, setting the notification vector is:
> 
> 	movb   $0xf2,0x62(%rsp)
> 
> whereas after this change it becomes:
> 
> 	mov    %eax,%edx
>    	and    $0xff00fffd,%edx
> 	or     $0xf20000,%edx
>    	mov    %edx,0x60(%rsp)
> 
hmm, that is weird. However, my kernel build with the patch does not
exhibit such code. I am getting the same as before for setting up NV:
 112:   75 06                   jne    11a <vmx_vcpu_pi_load+0xaa>
...
 135:   c6 44 24 22 f2          movb   $0xf2,0x22(%rsp)

However, I do agree having types is more robust, we can also use
this_cpu_write() and friends if needed.

> Writing extra bytes _shouln't_ be a problem, as KVM needs to atomically
> write the entire control chunk no matter what, but changing this without
> very good cause scares me.
> 
> If we really want to clean things up, my very strong vote is to remove the
> bitfields entirely.  SN is the only bit that's accessed without going
> through an accessor, and those should be easy enough to fixup one by one
> (and we can add more non-atomic accessors/mutators if it makes sense to
> do so).
> 
> E.g. end up with
> 
> /* Posted-Interrupt Descriptor */
> struct pi_desc {
> 	u32 pir[8];     /* Posted interrupt requested */
> 	union {
> 		struct {
> 			u16	notification_bits;
> 			u8	nv;
> 			u8	rsvd_2;
> 			u32	ndst;
> 		};
> 		u64 control;
> 	};
> 	u32 rsvd[6];
> } __aligned(64);

Sounds good to me.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (14 preceding siblings ...)
  2024-01-26 23:42 ` [PATCH 15/15] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
@ 2024-02-08 15:34 ` Jens Axboe
  2024-02-09 17:43   ` Jacob Pan
  2024-04-04 13:45 ` Robert Hoo
  16 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2024-02-08 15:34 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Raj Ashok, Tian, Kevin, maz, seanjc,
	Robin Murphy

Hi Jacob,

I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
IOPS bound on the drive, and using 1 thread per drive for IO. Random
reads, using io_uring.

For reference, using polled IO:

IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31

which is abount 5.1M/drive, which is what they can deliver.

Before your patches, I see:

IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32

at 2.82M ints/sec. With the patches, I see:

IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32

at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
quite at the extent I expected. Booted with 'posted_msi' and I do see
posted interrupts increasing in the PMN in /proc/interrupts, 

Probably want to fold this one in:
 
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 8e09d40ea928..a289282f1cf9 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -393,7 +393,7 @@ void intel_posted_msi_init(void)
  * instead of:
  *		read, xchg, read, xchg, read, xchg, read, xchg
  */
-static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
+static __always_inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
 {
 	int i, vec = FIRST_EXTERNAL_VECTOR;
 	unsigned long pir_copy[4];

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-02-08 15:34 ` [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jens Axboe
@ 2024-02-09 17:43   ` Jacob Pan
  2024-02-09 20:31     ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jacob Pan @ 2024-02-09 17:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jacob.jun.pan

Hi Jens,

On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:

> Hi Jacob,
> 
> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> reads, using io_uring.
> 
> For reference, using polled IO:
> 
> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> 
> which is abount 5.1M/drive, which is what they can deliver.
> 
> Before your patches, I see:
> 
> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> 
> at 2.82M ints/sec. With the patches, I see:
> 
> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> 
> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> quite at the extent I expected. Booted with 'posted_msi' and I do see
> posted interrupts increasing in the PMN in /proc/interrupts, 
> 
The ints/sec reduction is not as high as I expected either, especially
at this high rate. Which means not enough coalescing going on to get the
performance benefits.

The opportunity of IRQ coalescing is also dependent on how long the
driver's hardirq handler executes. In the posted MSI demux loop, it does
not wait for more MSIs to come before existing the pending IRQ polling
loop. So if the hardirq handler finishes very quickly, it may not coalesce
as much. Perhaps, we need to find more "useful" work to do to maximize the
window for coalescing.

I am not familiar with optane driver, need to look into how its hardirq
handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
30-50% ints/sec reduction at even lower IRQ rate (200k/sec).

> Probably want to fold this one in:
>  
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 8e09d40ea928..a289282f1cf9 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -393,7 +393,7 @@ void intel_posted_msi_init(void)
>   * instead of:
>   *		read, xchg, read, xchg, read, xchg, read, xchg
>   */
> -static __always_inline inline bool handle_pending_pir(u64 *pir, struct
> pt_regs *regs) +static __always_inline bool handle_pending_pir(u64 *pir,
> struct pt_regs *regs) {
>  	int i, vec = FIRST_EXTERNAL_VECTOR;
>  	unsigned long pir_copy[4];
> 
Good catch! will do.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-02-09 17:43   ` Jacob Pan
@ 2024-02-09 20:31     ` Jens Axboe
  2024-02-12 18:27       ` Jacob Pan
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2024-02-09 20:31 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy

On 2/9/24 10:43 AM, Jacob Pan wrote:
> Hi Jens,
> 
> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> 
>> Hi Jacob,
>>
>> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
>> IOPS bound on the drive, and using 1 thread per drive for IO. Random
>> reads, using io_uring.
>>
>> For reference, using polled IO:
>>
>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
>>
>> which is abount 5.1M/drive, which is what they can deliver.
>>
>> Before your patches, I see:
>>
>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>>
>> at 2.82M ints/sec. With the patches, I see:
>>
>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
>>
>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
>> quite at the extent I expected. Booted with 'posted_msi' and I do see
>> posted interrupts increasing in the PMN in /proc/interrupts, 
>>
> The ints/sec reduction is not as high as I expected either, especially
> at this high rate. Which means not enough coalescing going on to get the
> performance benefits.

Right, it means that we're getting pretty decent commands-per-int
coalescing already. I added another drive and repeated, here's that one:

IOPS w/polled: 25.7M IOPS

Stock kernel:

IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32

at ~3.7M ints/sec, or about 5.8 IOPS / int on average.

Patched kernel:

IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32

at the same interrupt rate. So not a reduction, but slighter higher
perf. Maybe we're reaping more commands on average per interrupt.

Anyway, not a lot of interesting data there, just figured I'd re-run it
with the added drive.

> The opportunity of IRQ coalescing is also dependent on how long the
> driver's hardirq handler executes. In the posted MSI demux loop, it does
> not wait for more MSIs to come before existing the pending IRQ polling
> loop. So if the hardirq handler finishes very quickly, it may not coalesce
> as much. Perhaps, we need to find more "useful" work to do to maximize the
> window for coalescing.
> 
> I am not familiar with optane driver, need to look into how its hardirq
> handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).

It's just an nvme device, so it's the nvme driver. The IRQ side is very
cheap - for as long as there are CQEs in the completion ring, it'll reap
them and complete them. That does mean that if we get an IRQ and there's
more than one entry to complete, we will do all of them. No IRQ
coalescing is configured (nvme kind of sucks for that...), but optane
media is much faster than flash, so that may be a difference.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-02-09 20:31     ` Jens Axboe
@ 2024-02-12 18:27       ` Jacob Pan
  2024-02-12 18:36         ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jacob Pan @ 2024-02-12 18:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jacob.jun.pan

Hi Jens,

On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:

> On 2/9/24 10:43 AM, Jacob Pan wrote:
> > Hi Jens,
> > 
> > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >   
> >> Hi Jacob,
> >>
> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> >> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> >> reads, using io_uring.
> >>
> >> For reference, using polled IO:
> >>
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>
> >> which is abount 5.1M/drive, which is what they can deliver.
> >>
> >> Before your patches, I see:
> >>
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>
> >> at 2.82M ints/sec. With the patches, I see:
> >>
> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>
> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >> posted interrupts increasing in the PMN in /proc/interrupts, 
> >>  
> > The ints/sec reduction is not as high as I expected either, especially
> > at this high rate. Which means not enough coalescing going on to get the
> > performance benefits.  
> 
> Right, it means that we're getting pretty decent commands-per-int
> coalescing already. I added another drive and repeated, here's that one:
> 
> IOPS w/polled: 25.7M IOPS
> 
> Stock kernel:
> 
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> 
> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
> 
> Patched kernel:
> 
> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
> 
> at the same interrupt rate. So not a reduction, but slighter higher
> perf. Maybe we're reaping more commands on average per interrupt.
> 
> Anyway, not a lot of interesting data there, just figured I'd re-run it
> with the added drive.
> 
> > The opportunity of IRQ coalescing is also dependent on how long the
> > driver's hardirq handler executes. In the posted MSI demux loop, it does
> > not wait for more MSIs to come before existing the pending IRQ polling
> > loop. So if the hardirq handler finishes very quickly, it may not
> > coalesce as much. Perhaps, we need to find more "useful" work to do to
> > maximize the window for coalescing.
> > 
> > I am not familiar with optane driver, need to look into how its hardirq
> > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).  
> 
> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> cheap - for as long as there are CQEs in the completion ring, it'll reap
> them and complete them. That does mean that if we get an IRQ and there's
> more than one entry to complete, we will do all of them. No IRQ
> coalescing is configured (nvme kind of sucks for that...), but optane
> media is much faster than flash, so that may be a difference.
> 
Yeah, I also check the the driver code it seems just wake up the threaded
handler.

For the record, here is my set up and performance data for 4 Samsung disks.
IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
IRQ throughput is improved instead of reduction with this patch on my setup.
e.g. BEFORE: 185545/sec/vector 
     AFTER:  220128

CPU: (highest non-turbo freq, maybe different on yours).
echo "Set CPU frequency P1 2.7GHz"                                                                      
for i in `seq 0 1 127`; do  echo 2700000 >  /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done
for i in `seq 0 1 127`; do  echo 2700000 >  /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done

PCI:
[root@emr-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme                                                   
64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express]) 
        Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a]                                                                            
                LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported                                                                 
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+                                                                     
                LnkSta: Speed 32GT/s (ok), Width x4(ok)                                                                                    
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis

NVME setup:                                            
nvme5n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme6n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme3n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme4n1       SAMSUNG MZWLO1T9HCJR-00A07                    

FIO:
[global]                      
bs=4k                         
direct=1                      
norandommap                   
ioengine=libaio               
randrepeat=0                  
readwrite=randread            
group_reporting               
time_based                    
iodepth=64                    
exitall                       
random_generator=tausworthe64 
runtime=30                    
ramp_time=3                   
numjobs=8                     
group_reporting=1             
                              
#cpus_allowed_policy=shared   
cpus_allowed_policy=split     
[disk_nvme6n1_thread_1]       
filename=/dev/nvme6n1         
cpus_allowed=0-7       
[disk_nvme6n1_thread_1]
filename=/dev/nvme5n1  
cpus_allowed=8-15      
[disk_nvme5n1_thread_2]
filename=/dev/nvme4n1  
cpus_allowed=16-23     
[disk_nvme5n1_thread_3]
filename=/dev/nvme3n1  
cpus_allowed=24-31     

iostat w/o posted MSI patch, v6.8-rc1:						
nvme3c3n1     1615525.00   6462100.00         0.00         0.00    6462100						
nvme4c4n1     1615471.00   6461884.00         0.00         0.00    6461884						
nvme5c5n1     1615602.00   6462408.00         0.00         0.00    6462408						
nvme6c6n1     1614637.00   6458544.00         0.00         0.00    6458544	

irqtop (delta 1 sec.)					
           IRQ           TOTAL          DELTA NAME                                      							
           800         6290026         185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76							
           797         6279554         185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73							
           799         6281627         185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75							
           802         6285742         185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78							
	... ... similar irq rate for all 32 vectors

iostat w/ posted MSI patch:
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd						
nvme3c3n1     2184313.00   8737256.00         0.00         0.00    8737256          0          0						
nvme4c4n1     2184241.00   8736972.00         0.00         0.00    8736972          0          0						
nvme5c5n1     2184269.00   8737080.00         0.00         0.00    8737080          0          0						
nvme6c6n1     2184003.00   8736012.00         0.00         0.00    8736012          0          0						
						
irqtop w/ posted MSI patch:
           IRQ           TOTAL           DELTA NAME                                     							
           PMN      5230078416         5502657 Posted MSI notification event            							
           423       138068935          220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80							
           425       138057654          219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82							
           426       138101745          219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83							
	... ... similar irq rate for all 32 vectors
IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78%
550/(22*32.)=.78125         


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-02-12 18:27       ` Jacob Pan
@ 2024-02-12 18:36         ` Jens Axboe
  2024-02-12 20:13           ` Jacob Pan
  2024-02-13  1:10           ` Jacob Pan
  0 siblings, 2 replies; 34+ messages in thread
From: Jens Axboe @ 2024-02-12 18:36 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy

On 2/12/24 11:27 AM, Jacob Pan wrote:
> Hi Jens,
> 
> On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> 
>> On 2/9/24 10:43 AM, Jacob Pan wrote:
>>> Hi Jens,
>>>
>>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
>>>   
>>>> Hi Jacob,
>>>>
>>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
>>>> IOPS bound on the drive, and using 1 thread per drive for IO. Random
>>>> reads, using io_uring.
>>>>
>>>> For reference, using polled IO:
>>>>
>>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
>>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
>>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
>>>>
>>>> which is abount 5.1M/drive, which is what they can deliver.
>>>>
>>>> Before your patches, I see:
>>>>
>>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
>>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
>>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>>>>
>>>> at 2.82M ints/sec. With the patches, I see:
>>>>
>>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
>>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
>>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
>>>>
>>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
>>>> quite at the extent I expected. Booted with 'posted_msi' and I do see
>>>> posted interrupts increasing in the PMN in /proc/interrupts, 
>>>>  
>>> The ints/sec reduction is not as high as I expected either, especially
>>> at this high rate. Which means not enough coalescing going on to get the
>>> performance benefits.  
>>
>> Right, it means that we're getting pretty decent commands-per-int
>> coalescing already. I added another drive and repeated, here's that one:
>>
>> IOPS w/polled: 25.7M IOPS
>>
>> Stock kernel:
>>
>> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
>> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
>> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
>>
>> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
>>
>> Patched kernel:
>>
>> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
>> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
>> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
>>
>> at the same interrupt rate. So not a reduction, but slighter higher
>> perf. Maybe we're reaping more commands on average per interrupt.
>>
>> Anyway, not a lot of interesting data there, just figured I'd re-run it
>> with the added drive.
>>
>>> The opportunity of IRQ coalescing is also dependent on how long the
>>> driver's hardirq handler executes. In the posted MSI demux loop, it does
>>> not wait for more MSIs to come before existing the pending IRQ polling
>>> loop. So if the hardirq handler finishes very quickly, it may not
>>> coalesce as much. Perhaps, we need to find more "useful" work to do to
>>> maximize the window for coalescing.
>>>
>>> I am not familiar with optane driver, need to look into how its hardirq
>>> handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
>>> 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).  
>>
>> It's just an nvme device, so it's the nvme driver. The IRQ side is very
>> cheap - for as long as there are CQEs in the completion ring, it'll reap
>> them and complete them. That does mean that if we get an IRQ and there's
>> more than one entry to complete, we will do all of them. No IRQ
>> coalescing is configured (nvme kind of sucks for that...), but optane
>> media is much faster than flash, so that may be a difference.
>>
> Yeah, I also check the the driver code it seems just wake up the threaded
> handler.

That only happens if you're using threaded interrupts, which is not the
default as it's much slower. What happens for the normal case is that we
init a batch, and then poll the CQ ring for completions, and then add
them to the completion batch. Once no more are found, we complete the
batch.

You're not using threaded interrupts, are you?

> For the record, here is my set up and performance data for 4 Samsung disks.
> IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
> IRQ throughput is improved instead of reduction with this patch on my setup.
> e.g. BEFORE: 185545/sec/vector 
>      AFTER:  220128

I'm surprised at the rates being that low, and if so, why the posted MSI
makes a difference? Usually what I've seen for IRQ being slower than
poll is if interrupt delivery is unreasonably slow on that architecture
of machine. But ~200k/sec isn't that high at all.

> [global]                      
> bs=4k                         
> direct=1                      
> norandommap                   
> ioengine=libaio               
> randrepeat=0                  
> readwrite=randread            
> group_reporting               
> time_based                    
> iodepth=64                    
> exitall                       
> random_generator=tausworthe64 
> runtime=30                    
> ramp_time=3                   
> numjobs=8                     
> group_reporting=1             
>                               
> #cpus_allowed_policy=shared   
> cpus_allowed_policy=split     
> [disk_nvme6n1_thread_1]       
> filename=/dev/nvme6n1         
> cpus_allowed=0-7       
> [disk_nvme6n1_thread_1]
> filename=/dev/nvme5n1  
> cpus_allowed=8-15      
> [disk_nvme5n1_thread_2]
> filename=/dev/nvme4n1  
> cpus_allowed=16-23     
> [disk_nvme5n1_thread_3]
> filename=/dev/nvme3n1  
> cpus_allowed=24-31     

For better performance, I'd change that engine=libaio to:

ioengine=io_uring
fixedbufs=1
registerfiles=1

Particularly fixedbufs makes a big difference, as a big cycle consumer
is mapping/unmapping pages from the application space into the kernel
for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the
buffers. At least for my runs, this is ~15% of the systime for doing IO.
It also removes the page referencing, which isn't as big a consumer, but
still noticeable.

Anyway, side quest, but I think you'll find this considerably reduces
overhead / improves performance. Also makes it so that you can compare
with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an
option for that (with a side note that you need to configure nvme poll
queues, see the poll_queues parameter).

On my box, all the NVMe devices seem to be on node1, not node0 which
looks like it's the CPUs you are using. Might be worth checking and
adjusting your CPU domains for each drive? I also tend to get better
performance by removing the CPU scheduler, eg just pin each job to a
single CPU rather than many. It's just one process/thread anyway, so
really no point in giving it options here. It'll help reduce variability
too, which can be a pain in the butt to deal with.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-02-12 18:36         ` Jens Axboe
@ 2024-02-12 20:13           ` Jacob Pan
  2024-02-13  1:10           ` Jacob Pan
  1 sibling, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-02-12 20:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jacob.jun.pan

Hi Jens,

On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@kernel.dk> wrote:

> On 2/12/24 11:27 AM, Jacob Pan wrote:
> > Hi Jens,
> > 
> > On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >   
> >> On 2/9/24 10:43 AM, Jacob Pan wrote:  
> >>> Hi Jens,
> >>>
> >>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >>>     
> >>>> Hi Jacob,
> >>>>
> >>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test,
> >>>> just IOPS bound on the drive, and using 1 thread per drive for IO.
> >>>> Random reads, using io_uring.
> >>>>
> >>>> For reference, using polled IO:
> >>>>
> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>>>
> >>>> which is abount 5.1M/drive, which is what they can deliver.
> >>>>
> >>>> Before your patches, I see:
> >>>>
> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>>>
> >>>> at 2.82M ints/sec. With the patches, I see:
> >>>>
> >>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>>>
> >>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >>>> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >>>> posted interrupts increasing in the PMN in /proc/interrupts, 
> >>>>    
> >>> The ints/sec reduction is not as high as I expected either, especially
> >>> at this high rate. Which means not enough coalescing going on to get
> >>> the performance benefits.    
> >>
> >> Right, it means that we're getting pretty decent commands-per-int
> >> coalescing already. I added another drive and repeated, here's that
> >> one:
> >>
> >> IOPS w/polled: 25.7M IOPS
> >>
> >> Stock kernel:
> >>
> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> >> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> >>
> >> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
> >>
> >> Patched kernel:
> >>
> >> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
> >>
> >> at the same interrupt rate. So not a reduction, but slighter higher
> >> perf. Maybe we're reaping more commands on average per interrupt.
> >>
> >> Anyway, not a lot of interesting data there, just figured I'd re-run it
> >> with the added drive.
> >>  
> >>> The opportunity of IRQ coalescing is also dependent on how long the
> >>> driver's hardirq handler executes. In the posted MSI demux loop, it
> >>> does not wait for more MSIs to come before existing the pending IRQ
> >>> polling loop. So if the hardirq handler finishes very quickly, it may
> >>> not coalesce as much. Perhaps, we need to find more "useful" work to
> >>> do to maximize the window for coalescing.
> >>>
> >>> I am not familiar with optane driver, need to look into how its
> >>> hardirq handler work. I have only tested NVMe gen5 in terms of
> >>> storage IO, i saw 30-50% ints/sec reduction at even lower IRQ rate
> >>> (200k/sec).    
> >>
> >> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> >> cheap - for as long as there are CQEs in the completion ring, it'll
> >> reap them and complete them. That does mean that if we get an IRQ and
> >> there's more than one entry to complete, we will do all of them. No IRQ
> >> coalescing is configured (nvme kind of sucks for that...), but optane
> >> media is much faster than flash, so that may be a difference.
> >>  
> > Yeah, I also check the the driver code it seems just wake up the
> > threaded handler.  
> 
> That only happens if you're using threaded interrupts, which is not the
> default as it's much slower. What happens for the normal case is that we
> init a batch, and then poll the CQ ring for completions, and then add
> them to the completion batch. Once no more are found, we complete the
> batch.
> 
thanks for the explanation.

> You're not using threaded interrupts, are you?
No. I didn't add module parameter "use_threaded_interrupts"

> 
> > For the record, here is my set up and performance data for 4 Samsung
> > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I
> > noticed is that IRQ throughput is improved instead of reduction with
> > this patch on my setup. e.g. BEFORE: 185545/sec/vector 
> >      AFTER:  220128  
> 
> I'm surprised at the rates being that low, and if so, why the posted MSI
> makes a difference? Usually what I've seen for IRQ being slower than
> poll is if interrupt delivery is unreasonably slow on that architecture
> of machine. But ~200k/sec isn't that high at all.
> 


> > [global]                      
> > bs=4k                         
> > direct=1                      
> > norandommap                   
> > ioengine=libaio               
> > randrepeat=0                  
> > readwrite=randread            
> > group_reporting               
> > time_based                    
> > iodepth=64                    
> > exitall                       
> > random_generator=tausworthe64 
> > runtime=30                    
> > ramp_time=3                   
> > numjobs=8                     
> > group_reporting=1             
> >                               
> > #cpus_allowed_policy=shared   
> > cpus_allowed_policy=split     
> > [disk_nvme6n1_thread_1]       
> > filename=/dev/nvme6n1         
> > cpus_allowed=0-7       
> > [disk_nvme6n1_thread_1]
> > filename=/dev/nvme5n1  
> > cpus_allowed=8-15      
> > [disk_nvme5n1_thread_2]
> > filename=/dev/nvme4n1  
> > cpus_allowed=16-23     
> > [disk_nvme5n1_thread_3]
> > filename=/dev/nvme3n1  
> > cpus_allowed=24-31       
> 
> For better performance, I'd change that engine=libaio to:
> 
> ioengine=io_uring
> fixedbufs=1
> registerfiles=1
> 
> Particularly fixedbufs makes a big difference, as a big cycle consumer
> is mapping/unmapping pages from the application space into the kernel
> for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the
> buffers. At least for my runs, this is ~15% of the systime for doing IO.
> It also removes the page referencing, which isn't as big a consumer, but
> still noticeable.
> 
Indeed, the CPU utilization system time goes down significantly. I got the
following with posted MSI patch applied:
Before (aio):
  read: IOPS=8925k, BW=34.0GiB/s (36.6GB/s)(1021GiB/30001msec)
	user    3m25.156s                                                                                                           	
	sys     11m16.785s                                                                                                          	
						
After (fixedbufs, iouring engine):
  read: IOPS=8811k, BW=33.6GiB/s (36.1GB/s)(1008GiB/30002msec)
  	user    2m56.255s                                                                                                  	
	sys     8m56.378s                                                                                                  	
				
It seems to have no gain in IOPS, just CPU utilization reduction.

Both have improvement over libaio w/o posted MSI patch. 

> Anyway, side quest, but I think you'll find this considerably reduces
> overhead / improves performance. Also makes it so that you can compare
> with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an
> option for that (with a side note that you need to configure nvme poll
> queues, see the poll_queues parameter).
> 
> On my box, all the NVMe devices seem to be on node1, not node0 which
> looks like it's the CPUs you are using. Might be worth checking and
> adjusting your CPU domains for each drive? I also tend to get better
> performance by removing the CPU scheduler, eg just pin each job to a
> single CPU rather than many. It's just one process/thread anyway, so
> really no point in giving it options here. It'll help reduce variability
> too, which can be a pain in the butt to deal with.
> 
Much faster with poll_queues=32 (32jobs)
  read: IOPS=13.0M, BW=49.6GiB/s (53.3GB/s)(1489GiB/30001msec)
     	user    2m29.177s
	sys     15m7.022s
				
Observed no IRQ counts from NVME.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-02-12 18:36         ` Jens Axboe
  2024-02-12 20:13           ` Jacob Pan
@ 2024-02-13  1:10           ` Jacob Pan
  1 sibling, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-02-13  1:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy, jacob.jun.pan, Lantz,
	Philip

Hi Jens,

On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@kernel.dk> wrote:

> > For the record, here is my set up and performance data for 4 Samsung
> > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I
> > noticed is that IRQ throughput is improved instead of reduction with
> > this patch on my setup. e.g. BEFORE: 185545/sec/vector 
> >      AFTER:  220128  
> 
> I'm surprised at the rates being that low, and if so, why the posted MSI
> makes a difference? Usually what I've seen for IRQ being slower than
> poll is if interrupt delivery is unreasonably slow on that architecture
> of machine. But ~200k/sec isn't that high at all.

Even at ~200k/sec, I am seeing around 75% ratio between posted interrupt
notification and MSIs. i.e. for every 4 MSIs, we save one CPU notification.
That might be where the savings come from.

I was expecting an even or reduction in CPU notifications but more MSI
throughput. Instead, Optane gets less MSIs/sec as your data shows.

Is it possible to get the interrupt coalescing ratio on your set up? ie.
PMN count in cat /proc/interrupts divided by total NVME MSIs.

Here is a summary of my testing on 4 Samsung Gen 5 drives:
test cases		IOPS*1000	ints/sec(MSI)*
=================================================
aio 			6348		182218
io_uring		6895		207932
aio w/ posted MSI	8295		185545
io_uring w/ post MSI	8811		220128
io_uring poll_queue	13000		0
================================================


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors
  2024-01-26 23:42 ` [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
@ 2024-02-13 19:44   ` Jacob Pan
  0 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-02-13 19:44 UTC (permalink / raw)
  To: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy, jacob.jun.pan, oliver.sang


On Fri, 26 Jan 2024 15:42:28 -0800, Jacob Pan
<jacob.jun.pan@linux.intel.com> wrote:
I have made a couple of mistakes here, caught by LKP testing.
Reported by Oliver Sang.

>  
> +#ifdef CONFIG_X86_POSTED_MSI
> +
> +/* Posted Interrupt Descriptors for coalesced MSIs to be posted */
> +DEFINE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);
> +
> +void intel_posted_msi_init(void)
> +{
> +	struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
> +
> +	pid->nv = POSTED_MSI_NOTIFICATION_VECTOR;
> +	pid->ndst = this_cpu_read(x86_cpu_to_apicid);
Based on VT-d specification 9.11, middle portion of the 32 bit field
are used in xAPIC mode instead of the lowest 8 bit. Not sure why it was
designed this way, making sure people ready the spec carefully :)

xAPIC Mode (Physical): 
 319:304 - Reserved (0)
 303:296 - APIC DestinationID[7:0]
 295:288 - Reserved (0)
x2APIC Mode (Physical):
 319:288 - APIC DestinationID[31:0]

So it should be something like:
	pid->ndst = x2apic_enabled() ? apic_id : apic_id << 8;

> +}
> +}
partitioning mistake, will fix.

> +#endif /* X86_POSTED_MSI */

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/15] x86/irq: Install posted MSI notification handler
  2024-01-26 23:42 ` [PATCH 09/15] x86/irq: Install posted MSI notification handler Jacob Pan
@ 2024-03-29  7:32   ` Zeng Guang
  2024-04-03  2:43     ` Jacob Pan
  0 siblings, 1 reply; 34+ messages in thread
From: Zeng Guang @ 2024-03-29  7:32 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Hansen, Dave, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Luse, Paul E, Williams, Dan J, Jens Axboe, Raj, Ashok, Tian,
	Kevin, maz, seanjc, Robin Murphy


On 1/27/2024 7:42 AM, Jacob Pan wrote:
> @@ -353,6 +360,111 @@ void intel_posted_msi_init(void)
>   	pid->nv = POSTED_MSI_NOTIFICATION_VECTOR;
>   	pid->ndst = this_cpu_read(x86_cpu_to_apicid);
>   }
> +
> +/*
> + * De-multiplexing posted interrupts is on the performance path, the code
> + * below is written to optimize the cache performance based on the following
> + * considerations:
> + * 1.Posted interrupt descriptor (PID) fits in a cache line that is frequently
> + *   accessed by both CPU and IOMMU.
> + * 2.During posted MSI processing, the CPU needs to do 64-bit read and xchg
> + *   for checking and clearing posted interrupt request (PIR), a 256 bit field
> + *   within the PID.
> + * 3.On the other side, the IOMMU does atomic swaps of the entire PID cache
> + *   line when posting interrupts and setting control bits.
> + * 4.The CPU can access the cache line a magnitude faster than the IOMMU.
> + * 5.Each time the IOMMU does interrupt posting to the PIR will evict the PID
> + *   cache line. The cache line states after each operation are as follows:
> + *   CPU		IOMMU			PID Cache line state
> + *   ---------------------------------------------------------------
> + *...read64					exclusive
> + *...lock xchg64				modified
> + *...			post/atomic swap	invalid
> + *...-------------------------------------------------------------
> + *
> + * To reduce L1 data cache miss, it is important to avoid contention with
> + * IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is used
> + * to dispatch interrupt handlers.
> + *
> + * In addition, the code is trying to keep the cache line state consistent
> + * as much as possible. e.g. when making a copy and clearing the PIR
> + * (assuming non-zero PIR bits are present in the entire PIR), it does:
> + *		read, read, read, read, xchg, xchg, xchg, xchg
> + * instead of:
> + *		read, xchg, read, xchg, read, xchg, read, xchg
> + */
> +static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
> +{
> +	int i, vec = FIRST_EXTERNAL_VECTOR;
> +	unsigned long pir_copy[4];
> +	bool handled = false;
> +
> +	for (i = 0; i < 4; i++)
> +		pir_copy[i] = pir[i];
> +
> +	for (i = 0; i < 4; i++) {
> +		if (!pir_copy[i])
> +			continue;
> +
> +		pir_copy[i] = arch_xchg(pir, 0);

Here is a problem that pir_copy[i] will always be written as pir[0]. 
This leads to handle spurious posted MSIs later.

> +		handled = true;
> +	}
> +
> +	if (handled) {
> +		for_each_set_bit_from(vec, pir_copy, FIRST_SYSTEM_VECTOR)
> +			call_irq_handler(vec, regs);
> +	}
> +
> +	return handled;
> +}
> +
> +/*
> + * Performance data shows that 3 is good enough to harvest 90+% of the benefit
> + * on high IRQ rate workload.
> + */
> +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> +
> +/*
> + * For MSIs that are delivered as posted interrupts, the CPU notifications
> + * can be coalesced if the MSIs arrive in high frequency bursts.
> + */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> +{
> +	struct pt_regs *old_regs = set_irq_regs(regs);
> +	struct pi_desc *pid;
> +	int i = 0;
> +
> +	pid = this_cpu_ptr(&posted_interrupt_desc);
> +
> +	inc_irq_stat(posted_msi_notification_count);
> +	irq_enter();
> +
> +	/*
> +	 * Max coalescing count includes the extra round of handle_pending_pir
> +	 * after clearing the outstanding notification bit. Hence, at most
> +	 * MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here.
> +	 */
> +	while (++i < MAX_POSTED_MSI_COALESCING_LOOP) {
> +		if (!handle_pending_pir(pid->pir64, regs))
> +			break;
> +	}
> +
> +	/*
> +	 * Clear outstanding notification bit to allow new IRQ notifications,
> +	 * do this last to maximize the window of interrupt coalescing.
> +	 */
> +	pi_clear_on(pid);
> +
> +	/*
> +	 * There could be a race of PI notification and the clearing of ON bit,
> +	 * process PIR bits one last time such that handling the new interrupts
> +	 * are not delayed until the next IRQ.
> +	 */
> +	handle_pending_pir(pid->pir64, regs);
> +
> +	apic_eoi();
> +	irq_exit();
> +	set_irq_regs(old_regs);
>   }
>   #endif /* X86_POSTED_MSI */
>   

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 09/15] x86/irq: Install posted MSI notification handler
  2024-03-29  7:32   ` Zeng Guang
@ 2024-04-03  2:43     ` Jacob Pan
  0 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-04-03  2:43 UTC (permalink / raw)
  To: Zeng Guang
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Hansen, Dave, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Luse, Paul E, Williams, Dan J,
	Jens Axboe, Raj, Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

Hi Zeng,

On Fri, 29 Mar 2024 15:32:00 +0800, Zeng Guang <guang.zeng@intel.com> wrote:

> On 1/27/2024 7:42 AM, Jacob Pan wrote:
> > @@ -353,6 +360,111 @@ void intel_posted_msi_init(void)
> >   	pid->nv = POSTED_MSI_NOTIFICATION_VECTOR;
> >   	pid->ndst = this_cpu_read(x86_cpu_to_apicid);
> >   }
> > +
> > +/*
> > + * De-multiplexing posted interrupts is on the performance path, the
> > code
> > + * below is written to optimize the cache performance based on the
> > following
> > + * considerations:
> > + * 1.Posted interrupt descriptor (PID) fits in a cache line that is
> > frequently
> > + *   accessed by both CPU and IOMMU.
> > + * 2.During posted MSI processing, the CPU needs to do 64-bit read and
> > xchg
> > + *   for checking and clearing posted interrupt request (PIR), a 256
> > bit field
> > + *   within the PID.
> > + * 3.On the other side, the IOMMU does atomic swaps of the entire PID
> > cache
> > + *   line when posting interrupts and setting control bits.
> > + * 4.The CPU can access the cache line a magnitude faster than the
> > IOMMU.
> > + * 5.Each time the IOMMU does interrupt posting to the PIR will evict
> > the PID
> > + *   cache line. The cache line states after each operation are as
> > follows:
> > + *   CPU		IOMMU			PID Cache line
> > state
> > + *   ---------------------------------------------------------------
> > + *...read64					exclusive
> > + *...lock xchg64				modified
> > + *...			post/atomic swap	invalid
> > + *...-------------------------------------------------------------
> > + *
> > + * To reduce L1 data cache miss, it is important to avoid contention
> > with
> > + * IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is
> > used
> > + * to dispatch interrupt handlers.
> > + *
> > + * In addition, the code is trying to keep the cache line state
> > consistent
> > + * as much as possible. e.g. when making a copy and clearing the PIR
> > + * (assuming non-zero PIR bits are present in the entire PIR), it does:
> > + *		read, read, read, read, xchg, xchg, xchg, xchg
> > + * instead of:
> > + *		read, xchg, read, xchg, read, xchg, read, xchg
> > + */
> > +static __always_inline inline bool handle_pending_pir(u64 *pir, struct
> > pt_regs *regs) +{
> > +	int i, vec = FIRST_EXTERNAL_VECTOR;
> > +	unsigned long pir_copy[4];
> > +	bool handled = false;
> > +
> > +	for (i = 0; i < 4; i++)
> > +		pir_copy[i] = pir[i];
> > +
> > +	for (i = 0; i < 4; i++) {
> > +		if (!pir_copy[i])
> > +			continue;
> > +
> > +		pir_copy[i] = arch_xchg(pir, 0);  
> 
> Here is a problem that pir_copy[i] will always be written as pir[0]. 
> This leads to handle spurious posted MSIs later.
Yes, you are right. It should be
pir_copy[i] = arch_xchg(&pir[i], 0);

Will fix in v2, really appreciated.

> > +		handled = true;
> > +	}
> > +
> > +	if (handled) {
> > +		for_each_set_bit_from(vec, pir_copy,
> > FIRST_SYSTEM_VECTOR)
> > +			call_irq_handler(vec, regs);
> > +	}
> > +
> > +	return handled;
> > +}
> > +
> > +/*
> > + * Performance data shows that 3 is good enough to harvest 90+% of the
> > benefit
> > + * on high IRQ rate workload.
> > + */
> > +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> > +
> > +/*
> > + * For MSIs that are delivered as posted interrupts, the CPU
> > notifications
> > + * can be coalesced if the MSIs arrive in high frequency bursts.
> > + */
> > +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> > +{
> > +	struct pt_regs *old_regs = set_irq_regs(regs);
> > +	struct pi_desc *pid;
> > +	int i = 0;
> > +
> > +	pid = this_cpu_ptr(&posted_interrupt_desc);
> > +
> > +	inc_irq_stat(posted_msi_notification_count);
> > +	irq_enter();
> > +
> > +	/*
> > +	 * Max coalescing count includes the extra round of
> > handle_pending_pir
> > +	 * after clearing the outstanding notification bit. Hence, at
> > most
> > +	 * MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here.
> > +	 */
> > +	while (++i < MAX_POSTED_MSI_COALESCING_LOOP) {
> > +		if (!handle_pending_pir(pid->pir64, regs))
> > +			break;
> > +	}
> > +
> > +	/*
> > +	 * Clear outstanding notification bit to allow new IRQ
> > notifications,
> > +	 * do this last to maximize the window of interrupt coalescing.
> > +	 */
> > +	pi_clear_on(pid);
> > +
> > +	/*
> > +	 * There could be a race of PI notification and the clearing
> > of ON bit,
> > +	 * process PIR bits one last time such that handling the new
> > interrupts
> > +	 * are not delayed until the next IRQ.
> > +	 */
> > +	handle_pending_pir(pid->pir64, regs);
> > +
> > +	apic_eoi();
> > +	irq_exit();
> > +	set_irq_regs(old_regs);
> >   }
> >   #endif /* X86_POSTED_MSI */
> >     


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-01-26 23:42 ` [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
@ 2024-04-04 13:38   ` Robert Hoo
  2024-04-04 17:17     ` Jacob Pan
  0 siblings, 1 reply; 34+ messages in thread
From: Robert Hoo @ 2024-04-04 13:38 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy

On 1/27/2024 7:42 AM, Jacob Pan wrote:
> When posted MSI is enabled, all device MSIs are multiplexed into a single
> notification vector. MSI handlers will be de-multiplexed at run-time by
> system software without IDT delivery.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>   arch/x86/include/asm/irq_vectors.h | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 3a19904c2db6..08329bef5b1d 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -99,9 +99,16 @@
>   
>   #define LOCAL_TIMER_VECTOR		0xec
>   
> +/*
> + * Posted interrupt notification vector for all device MSIs delivered to
> + * the host kernel.
> + */
> +#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
>   #define NR_VECTORS			 256
>   
> -#ifdef CONFIG_X86_LOCAL_APIC
> +#ifdef X86_POSTED_MSI

X86_POSTED_MSI --> CONFIG_X86_POSTED_MSI?

> +#define FIRST_SYSTEM_VECTOR		POSTED_MSI_NOTIFICATION_VECTOR
> +#elif defined(CONFIG_X86_LOCAL_APIC)
>   #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
>   #else
>   #define FIRST_SYSTEM_VECTOR		NR_VECTORS


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (15 preceding siblings ...)
  2024-02-08 15:34 ` [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jens Axboe
@ 2024-04-04 13:45 ` Robert Hoo
  2024-04-04 17:37   ` Jacob Pan
  16 siblings, 1 reply; 34+ messages in thread
From: Robert Hoo @ 2024-04-04 13:45 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy

On 1/27/2024 7:42 AM, Jacob Pan wrote:
> Hi Thomas and all,
> 
> This patch set is aimed to improve IRQ throughput on Intel Xeon by making use of
> posted interrupts.
> 
> There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented this
> topic.
> 
> https://lpc.events/event/17/sessions/172/#20231115
> 
> Background
> ==========
> On modern x86 server SoCs, interrupt remapping (IR) is required and turned
> on by default to support X2APIC. Two interrupt remapping modes can be supported
> by IOMMU/VT-d:
> 
> - Remappable 	(host)
> - Posted	(guest only so far)
> 
> With remappable mode, the device MSI to CPU process is a HW flow without system
> software touch points, it roughly goes as follows:
> 
> 1.	Devices issue interrupt requests with writes to 0xFEEx_xxxx
> 2.	The system agent accepts and remaps/translates the IRQ
> 3.	Upon receiving the translation response, the system agent notifies the
> 	destination CPU with the translated MSI
> 4.	CPU's local APIC accepts interrupts into its IRR/ISR registers
> 5.	Interrupt delivered through IDT (MSI vector)
> 
> The above process can be inefficient under high IRQ rates. The notifications in
> step #3 are often unnecessary when the destination CPU is already overwhelmed
> with handling bursts of IRQs. On some architectures, such as Intel Xeon, step #3
> is also expensive and requires strong ordering w.r.t DMA. 

Can you tell more on this "step #3 requires strong ordering w.r.t. DMA"?

> As a result, slower
> IRQ rates can become a limiting factor for DMA I/O performance.
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs
  2024-04-04 13:38   ` Robert Hoo
@ 2024-04-04 17:17     ` Jacob Pan
  0 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-04-04 17:17 UTC (permalink / raw)
  To: Robert Hoo
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

Hi Robert,

On Thu, 4 Apr 2024 21:38:34 +0800, Robert Hoo <robert.hoo.linux@gmail.com>
wrote:

> On 1/27/2024 7:42 AM, Jacob Pan wrote:
> > When posted MSI is enabled, all device MSIs are multiplexed into a
> > single notification vector. MSI handlers will be de-multiplexed at
> > run-time by system software without IDT delivery.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >   arch/x86/include/asm/irq_vectors.h | 9 ++++++++-
> >   1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/irq_vectors.h
> > b/arch/x86/include/asm/irq_vectors.h index 3a19904c2db6..08329bef5b1d
> > 100644 --- a/arch/x86/include/asm/irq_vectors.h
> > +++ b/arch/x86/include/asm/irq_vectors.h
> > @@ -99,9 +99,16 @@
> >   
> >   #define LOCAL_TIMER_VECTOR		0xec
> >   
> > +/*
> > + * Posted interrupt notification vector for all device MSIs delivered
> > to
> > + * the host kernel.
> > + */
> > +#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
> >   #define NR_VECTORS			 256
> >   
> > -#ifdef CONFIG_X86_LOCAL_APIC
> > +#ifdef X86_POSTED_MSI  
> 
> X86_POSTED_MSI --> CONFIG_X86_POSTED_MSI?
Indeed, thanks for catching that!

> > +#define FIRST_SYSTEM_VECTOR
> > POSTED_MSI_NOTIFICATION_VECTOR +#elif defined(CONFIG_X86_LOCAL_APIC)
> >   #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
> >   #else
> >   #define FIRST_SYSTEM_VECTOR		NR_VECTORS  
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
  2024-04-04 13:45 ` Robert Hoo
@ 2024-04-04 17:37   ` Jacob Pan
  0 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-04-04 17:37 UTC (permalink / raw)
  To: Robert Hoo
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan, Bjorn Helgaas

Hi Robert,

On Thu, 4 Apr 2024 21:45:05 +0800, Robert Hoo <robert.hoo.linux@gmail.com>
wrote:

> On 1/27/2024 7:42 AM, Jacob Pan wrote:
> > Hi Thomas and all,
> > 
> > This patch set is aimed to improve IRQ throughput on Intel Xeon by
> > making use of posted interrupts.
> > 
> > There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented
> > this topic.
> > 
> > https://lpc.events/event/17/sessions/172/#20231115
> > 
> > Background
> > ==========
> > On modern x86 server SoCs, interrupt remapping (IR) is required and
> > turned on by default to support X2APIC. Two interrupt remapping modes
> > can be supported by IOMMU/VT-d:
> > 
> > - Remappable 	(host)
> > - Posted	(guest only so far)
> > 
> > With remappable mode, the device MSI to CPU process is a HW flow
> > without system software touch points, it roughly goes as follows:
> > 
> > 1.	Devices issue interrupt requests with writes to 0xFEEx_xxxx
> > 2.	The system agent accepts and remaps/translates the IRQ
> > 3.	Upon receiving the translation response, the system agent
> > notifies the destination CPU with the translated MSI
> > 4.	CPU's local APIC accepts interrupts into its IRR/ISR registers
> > 5.	Interrupt delivered through IDT (MSI vector)
> > 
> > The above process can be inefficient under high IRQ rates. The
> > notifications in step #3 are often unnecessary when the destination CPU
> > is already overwhelmed with handling bursts of IRQs. On some
> > architectures, such as Intel Xeon, step #3 is also expensive and
> > requires strong ordering w.r.t DMA.   
> 
> Can you tell more on this "step #3 requires strong ordering w.r.t. DMA"?
> 
I am not sure how much micro architecture details I can disclose but the
point is that there are ordering rules related to DMA read/writes
and posted MSI writes. I am not a hardware expert.

From PCIe pov, my understanding is that the upstream writes tested here on
NVMe drives as the result of 4K random reads are relaxed ordered. I can see
lspci showing: RlxdOrd+ on my Samsung drives.

DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 4096 bytes

But MSIs are strictly ordered afaik.

> > As a result, slower
> > IRQ rates can become a limiting factor for DMA I/O performance.
> >   
> 
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI
  2024-01-26 23:42 ` [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
@ 2024-04-05  2:28   ` Robert Hoo
  2024-04-05 15:54     ` Jacob Pan
  0 siblings, 1 reply; 34+ messages in thread
From: Robert Hoo @ 2024-04-05  2:28 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, Peter Zijlstra, iommu,
	Thomas Gleixner, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Paul Luse, Dan Williams, Jens Axboe, Raj Ashok, Tian, Kevin, maz,
	seanjc, Robin Murphy

On 1/27/2024 7:42 AM, Jacob Pan wrote:
> This option will be used to support delivering MSIs as posted
> interrupts. Interrupt remapping is required.
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>   arch/x86/Kconfig | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5edec175b9bf..79f04ee2b91c 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -463,6 +463,17 @@ config X86_X2APIC
>   
>   	  If you don't know what to do here, say N.
>   
> +config X86_POSTED_MSI
> +	bool "Enable MSI and MSI-x delivery by posted interrupts"
> +	depends on X86_X2APIC && X86_64 && IRQ_REMAP

Does posted_msi really depend on x2APIC? PID.NDST encoding supports both xAPIC 
and x2APIC.
If posted_msi posts more stringent requirement, I think it deserves an 
explanation in this patch's description.

And, X86_X2APIC already depends on IRQ_REMAP, can we just list one of them here?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI
  2024-04-05  2:28   ` Robert Hoo
@ 2024-04-05 15:54     ` Jacob Pan
  0 siblings, 0 replies; 34+ messages in thread
From: Jacob Pan @ 2024-04-05 15:54 UTC (permalink / raw)
  To: Robert Hoo
  Cc: LKML, X86 Kernel, Peter Zijlstra, iommu, Thomas Gleixner,
	Lu Baolu, kvm, Dave Hansen, Joerg Roedel, H. Peter Anvin,
	Borislav Petkov, Ingo Molnar, Paul Luse, Dan Williams,
	Jens Axboe, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

Hi Robert,

On Fri, 5 Apr 2024 10:28:59 +0800, Robert Hoo <robert.hoo.linux@gmail.com>
wrote:

> On 1/27/2024 7:42 AM, Jacob Pan wrote:
> > This option will be used to support delivering MSIs as posted
> > interrupts. Interrupt remapping is required.
> > 
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >   arch/x86/Kconfig | 11 +++++++++++
> >   1 file changed, 11 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 5edec175b9bf..79f04ee2b91c 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -463,6 +463,17 @@ config X86_X2APIC
> >   
> >   	  If you don't know what to do here, say N.
> >   
> > +config X86_POSTED_MSI
> > +	bool "Enable MSI and MSI-x delivery by posted interrupts"
> > +	depends on X86_X2APIC && X86_64 && IRQ_REMAP  
> 
> Does posted_msi really depend on x2APIC? PID.NDST encoding supports both
> xAPIC and x2APIC.
No, posted_msi works with xAPIC as well. I just fixed a bug in NDST xAPIC
encoding, will be in v2.

I was thinking from the performance advantage of x2APIC. But you are right
they are orthogonal.

> If posted_msi posts more stringent requirement, I think it deserves an 
> explanation in this patch's description.
> And, X86_X2APIC already depends on IRQ_REMAP, can we just list one of
> them here?
Will drop X2APIC dependency.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-04-05 15:50 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
2024-01-26 23:42 ` [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
2024-01-26 23:42 ` [PATCH 02/15] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
2024-01-26 23:42 ` [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor Jacob Pan
2024-01-31  1:48   ` Sean Christopherson
2024-02-06  0:40     ` Jacob Pan
2024-01-26 23:42 ` [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
2024-04-05  2:28   ` Robert Hoo
2024-04-05 15:54     ` Jacob Pan
2024-01-26 23:42 ` [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
2024-04-04 13:38   ` Robert Hoo
2024-04-04 17:17     ` Jacob Pan
2024-01-26 23:42 ` [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
2024-02-13 19:44   ` Jacob Pan
2024-01-26 23:42 ` [PATCH 07/15] x86/irq: Add accessors for " Jacob Pan
2024-01-26 23:42 ` [PATCH 08/15] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
2024-01-26 23:42 ` [PATCH 09/15] x86/irq: Install posted MSI notification handler Jacob Pan
2024-03-29  7:32   ` Zeng Guang
2024-04-03  2:43     ` Jacob Pan
2024-01-26 23:42 ` [PATCH 10/15] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 11/15] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 12/15] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
2024-01-26 23:42 ` [PATCH 13/15] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
2024-01-26 23:42 ` [PATCH 14/15] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
2024-01-26 23:42 ` [PATCH 15/15] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
2024-02-08 15:34 ` [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jens Axboe
2024-02-09 17:43   ` Jacob Pan
2024-02-09 20:31     ` Jens Axboe
2024-02-12 18:27       ` Jacob Pan
2024-02-12 18:36         ` Jens Axboe
2024-02-12 20:13           ` Jacob Pan
2024-02-13  1:10           ` Jacob Pan
2024-04-04 13:45 ` Robert Hoo
2024-04-04 17:37   ` Jacob Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).