* [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
@ 2022-12-22 7:26 korantwork
2022-12-22 9:15 ` Jonathan Derrick
0 siblings, 1 reply; 18+ messages in thread
From: korantwork @ 2022-12-22 7:26 UTC (permalink / raw)
To: nirmal.patel, jonathan.derrick, lpieralisi
Cc: linux-pci, linux-kernel, Xinghui Li
From: Xinghui Li <korantli@tencent.com>
Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
disable the vmd MSI-X remapping for optimizing pci performance.However,
this feature severely negatively optimized performance in multi-disk
situations.
In FIO 4K random test, we test 1 disk in the 1 CPU
when disable MSI-X remapping:
read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
io=1354GiB (1454GB), run=300001-300001msec
When not disable MSI-X remapping:
read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
io=1340GiB (1438GB), run=300001-300001msec
However, the bypass mode could increase the interrupts costs in CPU.
We test 12 disks in the 6 CPU,
When disable MSI-X remapping:
read: IOPS=562k, BW=2197MiB/s (2304MB/s)(644GiB/300001msec)
READ: bw=2197MiB/s (2304MB/s), 2197MiB/s-2197MiB/s (2304MB/s-2304MB/s),
io=644GiB (691GB), run=300001-300001msec
When not disable MSI-X remapping:
read: IOPS=1144k, BW=4470MiB/s (4687MB/s)(1310GiB/300005msec)
READ: bw=4470MiB/s (4687MB/s), 4470MiB/s-4470MiB/s (4687MB/s-4687MB/s),
io=1310GiB (1406GB), run=300005-300005msec
Signed-off-by: Xinghui Li <korantli@tencent.com>
---
drivers/pci/controller/vmd.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index e06e9f4fc50f..9f6e9324d67d 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -998,8 +998,7 @@ static const struct pci_device_id vmd_ids[] = {
.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP,},
{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
- VMD_FEAT_HAS_BUS_RESTRICTIONS |
- VMD_FEAT_CAN_BYPASS_MSI_REMAP,},
+ VMD_FEAT_HAS_BUS_RESTRICTIONS,},
{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP |
VMD_FEAT_HAS_BUS_RESTRICTIONS |
--
2.39.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2022-12-22 7:26 [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller korantwork
@ 2022-12-22 9:15 ` Jonathan Derrick
2022-12-22 21:56 ` Keith Busch
2022-12-23 7:53 ` Xinghui Li
0 siblings, 2 replies; 18+ messages in thread
From: Jonathan Derrick @ 2022-12-22 9:15 UTC (permalink / raw)
To: korantwork, nirmal.patel, lpieralisi; +Cc: linux-pci, linux-kernel, Xinghui Li
On 12/22/22 12:26 AM, korantwork@gmail.com wrote:
> From: Xinghui Li <korantli@tencent.com>
>
> Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
> disable the vmd MSI-X remapping for optimizing pci performance.However,
> this feature severely negatively optimized performance in multi-disk
> situations.
>
> In FIO 4K random test, we test 1 disk in the 1 CPU
>
> when disable MSI-X remapping:
> read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
> READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
> io=1354GiB (1454GB), run=300001-300001msec
>
> When not disable MSI-X remapping:
> read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
> READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
> io=1340GiB (1438GB), run=300001-300001msec
>
> However, the bypass mode could increase the interrupts costs in CPU.
> We test 12 disks in the 6 CPU,
Well the bypass mode was made to improve performance where you have >4
drives so this is pretty surprising. With bypass mode disabled, VMD will
intercept and forward interrupts, increasing costs.
I think Nirmal would want to to understand if there's some other factor
going on here.
>
> When disable MSI-X remapping:
> read: IOPS=562k, BW=2197MiB/s (2304MB/s)(644GiB/300001msec)
> READ: bw=2197MiB/s (2304MB/s), 2197MiB/s-2197MiB/s (2304MB/s-2304MB/s),
> io=644GiB (691GB), run=300001-300001msec
>
> When not disable MSI-X remapping:
> read: IOPS=1144k, BW=4470MiB/s (4687MB/s)(1310GiB/300005msec)
> READ: bw=4470MiB/s (4687MB/s), 4470MiB/s-4470MiB/s (4687MB/s-4687MB/s),
> io=1310GiB (1406GB), run=300005-300005msec
>
> Signed-off-by: Xinghui Li <korantli@tencent.com>
> ---
> drivers/pci/controller/vmd.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index e06e9f4fc50f..9f6e9324d67d 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -998,8 +998,7 @@ static const struct pci_device_id vmd_ids[] = {
> .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP,},
> {PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
> .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
> - VMD_FEAT_HAS_BUS_RESTRICTIONS |
> - VMD_FEAT_CAN_BYPASS_MSI_REMAP,},
> + VMD_FEAT_HAS_BUS_RESTRICTIONS,},
> {PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
> .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP |
> VMD_FEAT_HAS_BUS_RESTRICTIONS |
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2022-12-22 9:15 ` Jonathan Derrick
@ 2022-12-22 21:56 ` Keith Busch
2022-12-23 8:02 ` Xinghui Li
2022-12-23 7:53 ` Xinghui Li
1 sibling, 1 reply; 18+ messages in thread
From: Keith Busch @ 2022-12-22 21:56 UTC (permalink / raw)
To: Jonathan Derrick
Cc: korantwork, nirmal.patel, lpieralisi, linux-pci, linux-kernel,
Xinghui Li
On Thu, Dec 22, 2022 at 02:15:20AM -0700, Jonathan Derrick wrote:
> On 12/22/22 12:26 AM, korantwork@gmail.com wrote:
> >
> > However, the bypass mode could increase the interrupts costs in CPU.
> > We test 12 disks in the 6 CPU,
>
> Well the bypass mode was made to improve performance where you have >4
> drives so this is pretty surprising. With bypass mode disabled, VMD will
> intercept and forward interrupts, increasing costs.
>
> I think Nirmal would want to to understand if there's some other factor
> going on here.
With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
context switching. Sounds like the non-bypass mode is aggregating and
spreading interrupts across the cores better, but there's probably some
cpu:drive count tipping point where performance favors the other way.
The fio jobs could also probably set their cpus_allowed differently to
get better performance in the bypass mode.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2022-12-22 21:56 ` Keith Busch
@ 2022-12-23 8:02 ` Xinghui Li
2022-12-27 22:32 ` Jonathan Derrick
0 siblings, 1 reply; 18+ messages in thread
From: Xinghui Li @ 2022-12-23 8:02 UTC (permalink / raw)
To: Keith Busch
Cc: Jonathan Derrick, nirmal.patel, lpieralisi, linux-pci,
linux-kernel, Xinghui Li
Keith Busch <kbusch@kernel.org> 于2022年12月23日周五 05:56写道:
>
> With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
> context switching. Sounds like the non-bypass mode is aggregating and
> spreading interrupts across the cores better, but there's probably some
> cpu:drive count tipping point where performance favors the other way.
We found that tunning the interrupt aggregation can also bring the
drive performance back to normal.
> The fio jobs could also probably set their cpus_allowed differently to
> get better performance in the bypass mode.
We used the cpus_allowed in FIO to fix 12 dirves in 6 different CPU.
By the way, sorry for emailing twice, the last one had the format problem.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2022-12-23 8:02 ` Xinghui Li
@ 2022-12-27 22:32 ` Jonathan Derrick
2022-12-28 2:19 ` Xinghui Li
0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Derrick @ 2022-12-27 22:32 UTC (permalink / raw)
To: Xinghui Li, Keith Busch
Cc: nirmal.patel, lpieralisi, linux-pci, linux-kernel, Xinghui Li
On 12/23/2022 2:02 AM, Xinghui Li wrote:
> Keith Busch <kbusch@kernel.org> 于2022年12月23日周五 05:56写道:
>>
>> With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
>> context switching. Sounds like the non-bypass mode is aggregating and
>> spreading interrupts across the cores better, but there's probably some
>> cpu:drive count tipping point where performance favors the other way.
>
> We found that tunning the interrupt aggregation can also bring the
> drive performance back to normal.
>
>> The fio jobs could also probably set their cpus_allowed differently to
>> get better performance in the bypass mode.
>
> We used the cpus_allowed in FIO to fix 12 dirves in 6 different CPU.
>
> By the way, sorry for emailing twice, the last one had the format problem.
The bypass mode should help in the cases where drives irqs (eg nproc) exceed
VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
very few cpus for a Skylake system with that many drives, unless you mean you
are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
is effectively VMD-disabled, which points to other issues. Though I have also seen
much smaller interrupt aggregation benefits.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2022-12-27 22:32 ` Jonathan Derrick
@ 2022-12-28 2:19 ` Xinghui Li
2023-01-09 21:00 ` Jonathan Derrick
0 siblings, 1 reply; 18+ messages in thread
From: Xinghui Li @ 2022-12-28 2:19 UTC (permalink / raw)
To: Jonathan Derrick
Cc: Keith Busch, nirmal.patel, lpieralisi, linux-pci, linux-kernel,
Xinghui Li
Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月28日周三 06:32写道:
>
> The bypass mode should help in the cases where drives irqs (eg nproc) exceed
> VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
> very few cpus for a Skylake system with that many drives, unless you mean you
> are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
> is effectively VMD-disabled, which points to other issues. Though I have also seen
> much smaller interrupt aggregation benefits.
Firstly,I am sorry for my words misleading you. We totally tested 12 drives.
And each drive run in 6 CPU cores with 8 jobs.
Secondly, I try to test the drives with VMD disabled,I found the results to
be largely consistent with bypass mode. I suppose the bypass mode just
"bypass" the VMD controller.
The last one,we found in bypass mode the CPU idle is 91%. But in remapping mode
the CPU idle is 78%. And the bypass's context-switchs is much fewer
than the remapping
mode's. It seems the system is watiing for something in bypass mode.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2022-12-28 2:19 ` Xinghui Li
@ 2023-01-09 21:00 ` Jonathan Derrick
2023-01-10 12:28 ` Xinghui Li
0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Derrick @ 2023-01-09 21:00 UTC (permalink / raw)
To: Xinghui Li
Cc: Keith Busch, nirmal.patel, lpieralisi, linux-pci, linux-kernel,
Xinghui Li
As the bypass mode seems to affect performance greatly depending on the specific configuration,
it may make sense to use a moduleparam to control it
I'd vote for it being in VMD mode (non-bypass) by default.
On 12/27/2022 7:19 PM, Xinghui Li wrote:
> Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月28日周三 06:32写道:
>>
>> The bypass mode should help in the cases where drives irqs (eg nproc) exceed
>> VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
>> very few cpus for a Skylake system with that many drives, unless you mean you
>> are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
>> is effectively VMD-disabled, which points to other issues. Though I have also seen
>> much smaller interrupt aggregation benefits.
>
> Firstly,I am sorry for my words misleading you. We totally tested 12 drives.
> And each drive run in 6 CPU cores with 8 jobs.
>
> Secondly, I try to test the drives with VMD disabled,I found the results to
> be largely consistent with bypass mode. I suppose the bypass mode just
> "bypass" the VMD controller.
>
> The last one,we found in bypass mode the CPU idle is 91%. But in remapping mode
> the CPU idle is 78%. And the bypass's context-switchs is much fewer
> than the remapping
> mode's. It seems the system is watiing for something in bypass mode.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-01-09 21:00 ` Jonathan Derrick
@ 2023-01-10 12:28 ` Xinghui Li
2023-02-06 12:45 ` Xinghui Li
0 siblings, 1 reply; 18+ messages in thread
From: Xinghui Li @ 2023-01-10 12:28 UTC (permalink / raw)
To: Jonathan Derrick
Cc: Keith Busch, nirmal.patel, lpieralisi, linux-pci, linux-kernel,
Xinghui Li
Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道:
>
> As the bypass mode seems to affect performance greatly depending on the specific configuration,
> it may make sense to use a moduleparam to control it
>
We found that each pcie port can mount four drives. If we only test 2
or 1 dirve of one pcie port,
the performance of the drive performance will be normal. Also, we
observed the interruptions in different modes.
bypass:
.....
2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
2022-12-28-11-39-14: RES 26743 Rescheduling interrupts
2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
192, ACTIVE CPU : 192
disable:
......
2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74
2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73
2022-12-28-12-05-56: LOC 163697 Local timer interrupts
2022-12-28-12-05-56: TLB 5465 TLB shootdowns
2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
192, ACTIVE CPU : 192
remapping:
022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3
2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1
2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1
......
2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
192, ACTIVE CPU : 192
From the result it is not difficult to find, in remapping mode the
interruptions come from vmd.
While in other modes, interrupts come from nvme devices. Besides, we
found the port mounting
4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
NVME 8 and 9 mount in one port, other port mount 4 dirves.
2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74
2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73
2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71
2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65
2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69
2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66
2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67
2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70
......
2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75
2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67
2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84
2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70
2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
> I'd vote for it being in VMD mode (non-bypass) by default.
I speculate that the vmd controller equalizes the interrupt load and
acts like a buffer,
which improves the performance of nvme. I am not sure about my
analysis. So, I'd like
to discuss it with the community.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-01-10 12:28 ` Xinghui Li
@ 2023-02-06 12:45 ` Xinghui Li
2023-02-06 18:11 ` Patel, Nirmal
0 siblings, 1 reply; 18+ messages in thread
From: Xinghui Li @ 2023-02-06 12:45 UTC (permalink / raw)
To: Jonathan Derrick
Cc: Keith Busch, nirmal.patel, lpieralisi, linux-pci, linux-kernel,
Xinghui Li
Friendly ping~
Xinghui Li <korantwork@gmail.com> 于2023年1月10日周二 20:28写道:
>
> Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道:
> >
> > As the bypass mode seems to affect performance greatly depending on the specific configuration,
> > it may make sense to use a moduleparam to control it
> >
> We found that each pcie port can mount four drives. If we only test 2
> or 1 dirve of one pcie port,
> the performance of the drive performance will be normal. Also, we
> observed the interruptions in different modes.
> bypass:
> .....
> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
> 2022-12-28-11-39-14: RES 26743 Rescheduling interrupts
> 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
> 192, ACTIVE CPU : 192
> disable:
> ......
> 2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74
> 2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73
> 2022-12-28-12-05-56: LOC 163697 Local timer interrupts
> 2022-12-28-12-05-56: TLB 5465 TLB shootdowns
> 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
> 192, ACTIVE CPU : 192
> remapping:
> 022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3
> 2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1
> 2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1
> ......
> 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
> 192, ACTIVE CPU : 192
>
> From the result it is not difficult to find, in remapping mode the
> interruptions come from vmd.
> While in other modes, interrupts come from nvme devices. Besides, we
> found the port mounting
> 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
> NVME 8 and 9 mount in one port, other port mount 4 dirves.
>
> 2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74
> 2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73
> 2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71
> 2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65
> 2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69
> 2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66
> 2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67
> 2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70
> ......
> 2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75
> 2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67
> 2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84
> 2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70
> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
> > I'd vote for it being in VMD mode (non-bypass) by default.
> I speculate that the vmd controller equalizes the interrupt load and
> acts like a buffer,
> which improves the performance of nvme. I am not sure about my
> analysis. So, I'd like
> to discuss it with the community.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-02-06 12:45 ` Xinghui Li
@ 2023-02-06 18:11 ` Patel, Nirmal
2023-02-06 18:28 ` Keith Busch
0 siblings, 1 reply; 18+ messages in thread
From: Patel, Nirmal @ 2023-02-06 18:11 UTC (permalink / raw)
To: Xinghui Li, Jonathan Derrick
Cc: Keith Busch, lpieralisi, linux-pci, linux-kernel, Xinghui Li
On 2/6/2023 5:45 AM, Xinghui Li wrote:
> Friendly ping~
>
> Xinghui Li <korantwork@gmail.com> 于2023年1月10日周二 20:28写道:
>> Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道:
>>> As the bypass mode seems to affect performance greatly depending on the specific configuration,
>>> it may make sense to use a moduleparam to control it
>>>
>> We found that each pcie port can mount four drives. If we only test 2
>> or 1 dirve of one pcie port,
>> the performance of the drive performance will be normal. Also, we
>> observed the interruptions in different modes.
>> bypass:
>> .....
>> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
>> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
>> 2022-12-28-11-39-14: RES 26743 Rescheduling interrupts
>> 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
>> 192, ACTIVE CPU : 192
>> disable:
>> ......
>> 2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74
>> 2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73
>> 2022-12-28-12-05-56: LOC 163697 Local timer interrupts
>> 2022-12-28-12-05-56: TLB 5465 TLB shootdowns
>> 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
>> 192, ACTIVE CPU : 192
>> remapping:
>> 022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3
>> 2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1
>> 2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1
>> ......
>> 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
>> 192, ACTIVE CPU : 192
>>
>> From the result it is not difficult to find, in remapping mode the
>> interruptions come from vmd.
>> While in other modes, interrupts come from nvme devices. Besides, we
>> found the port mounting
>> 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
>> NVME 8 and 9 mount in one port, other port mount 4 dirves.
>>
>> 2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74
>> 2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73
>> 2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71
>> 2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65
>> 2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69
>> 2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66
>> 2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67
>> 2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70
>> ......
>> 2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75
>> 2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67
>> 2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84
>> 2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70
>> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
>> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
>>> I'd vote for it being in VMD mode (non-bypass) by default.
>> I speculate that the vmd controller equalizes the interrupt load and
>> acts like a buffer,
>> which improves the performance of nvme. I am not sure about my
>> analysis. So, I'd like
>> to discuss it with the community.
I like the idea of module parameter to allow switching between the modes
but keep MSI remapping enabled (non-bypass) by default.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-02-06 18:11 ` Patel, Nirmal
@ 2023-02-06 18:28 ` Keith Busch
2023-02-07 3:18 ` Xinghui Li
0 siblings, 1 reply; 18+ messages in thread
From: Keith Busch @ 2023-02-06 18:28 UTC (permalink / raw)
To: Patel, Nirmal
Cc: Xinghui Li, Jonathan Derrick, lpieralisi, linux-pci,
linux-kernel, Xinghui Li
On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
> I like the idea of module parameter to allow switching between the modes
> but keep MSI remapping enabled (non-bypass) by default.
Isn't there a more programatic way to go about selecting the best option at
runtime? I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-02-06 18:28 ` Keith Busch
@ 2023-02-07 3:18 ` Xinghui Li
2023-02-07 20:32 ` Patel, Nirmal
0 siblings, 1 reply; 18+ messages in thread
From: Xinghui Li @ 2023-02-07 3:18 UTC (permalink / raw)
To: Keith Busch
Cc: Patel, Nirmal, Jonathan Derrick, lpieralisi, linux-pci,
linux-kernel, Xinghui Li
Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
>
> On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
> > I like the idea of module parameter to allow switching between the modes
> > but keep MSI remapping enabled (non-bypass) by default.
>
> Isn't there a more programatic way to go about selecting the best option at
> runtime?
Do you mean that the operating mode is automatically selected by
detecting the number of devices and CPUs instead of being set
manually?
>I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
For this situation, My speculation is that the PCIE nodes are
over-mounted and not just because of the CPU to Drive ratio.
We considered designing online nodes, because we were concerned that
the IO of different chunk sizes would adapt to different MSI-X modes.
I privately think that it may be logically complicated if programmatic
judgments are made.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-02-07 3:18 ` Xinghui Li
@ 2023-02-07 20:32 ` Patel, Nirmal
2023-02-09 12:05 ` Xinghui Li
2023-02-09 23:05 ` Keith Busch
0 siblings, 2 replies; 18+ messages in thread
From: Patel, Nirmal @ 2023-02-07 20:32 UTC (permalink / raw)
To: Xinghui Li, Keith Busch
Cc: Jonathan Derrick, lpieralisi, linux-pci, linux-kernel, Xinghui Li
On 2/6/2023 8:18 PM, Xinghui Li wrote:
> Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
>> On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
>>> I like the idea of module parameter to allow switching between the modes
>>> but keep MSI remapping enabled (non-bypass) by default.
>> Isn't there a more programatic way to go about selecting the best option at
>> runtime?
> Do you mean that the operating mode is automatically selected by
> detecting the number of devices and CPUs instead of being set
> manually?
>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> For this situation, My speculation is that the PCIE nodes are
> over-mounted and not just because of the CPU to Drive ratio.
> We considered designing online nodes, because we were concerned that
> the IO of different chunk sizes would adapt to different MSI-X modes.
> I privately think that it may be logically complicated if programmatic
> judgments are made.
Also newer CPUs have more MSIx (128) which means we can still have
better performance without bypass. It would be better if user have
can chose module parameter based on their requirements. Thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-02-07 20:32 ` Patel, Nirmal
@ 2023-02-09 12:05 ` Xinghui Li
2023-02-09 23:05 ` Keith Busch
1 sibling, 0 replies; 18+ messages in thread
From: Xinghui Li @ 2023-02-09 12:05 UTC (permalink / raw)
To: Patel, Nirmal
Cc: Keith Busch, Jonathan Derrick, lpieralisi, linux-pci,
linux-kernel, Xinghui Li
Patel, Nirmal <nirmal.patel@linux.intel.com> 于2023年2月8日周三 04:32写道:
>
> Also newer CPUs have more MSIx (128) which means we can still have
> better performance without bypass. It would be better if user have
> can chose module parameter based on their requirements. Thanks.
>
All right~I will reset the patch V2 with the online node version later.
Thanks
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-02-07 20:32 ` Patel, Nirmal
2023-02-09 12:05 ` Xinghui Li
@ 2023-02-09 23:05 ` Keith Busch
2023-02-09 23:57 ` Patel, Nirmal
1 sibling, 1 reply; 18+ messages in thread
From: Keith Busch @ 2023-02-09 23:05 UTC (permalink / raw)
To: Patel, Nirmal
Cc: Xinghui Li, Jonathan Derrick, lpieralisi, linux-pci,
linux-kernel, Xinghui Li
On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
> On 2/6/2023 8:18 PM, Xinghui Li wrote:
> > Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
> >> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> > For this situation, My speculation is that the PCIE nodes are
> > over-mounted and not just because of the CPU to Drive ratio.
> > We considered designing online nodes, because we were concerned that
> > the IO of different chunk sizes would adapt to different MSI-X modes.
> > I privately think that it may be logically complicated if programmatic
> > judgments are made.
>
> Also newer CPUs have more MSIx (128) which means we can still have
> better performance without bypass. It would be better if user have
> can chose module parameter based on their requirements. Thanks.
So what? More vectors just pushes the threshold to when bypass becomes
relevant, which is exactly why I suggested it. There has to be an empirical
answer to when bypass beats muxing. Why do you want a user tunable if there's a
verifiable and automated better choice?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-02-09 23:05 ` Keith Busch
@ 2023-02-09 23:57 ` Patel, Nirmal
2023-02-10 0:47 ` Keith Busch
0 siblings, 1 reply; 18+ messages in thread
From: Patel, Nirmal @ 2023-02-09 23:57 UTC (permalink / raw)
To: Keith Busch
Cc: Xinghui Li, Jonathan Derrick, lpieralisi, linux-pci,
linux-kernel, Xinghui Li
On 2/9/2023 4:05 PM, Keith Busch wrote:
> On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
>> On 2/6/2023 8:18 PM, Xinghui Li wrote:
>>> Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
>>>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
>>> For this situation, My speculation is that the PCIE nodes are
>>> over-mounted and not just because of the CPU to Drive ratio.
>>> We considered designing online nodes, because we were concerned that
>>> the IO of different chunk sizes would adapt to different MSI-X modes.
>>> I privately think that it may be logically complicated if programmatic
>>> judgments are made.
>> Also newer CPUs have more MSIx (128) which means we can still have
>> better performance without bypass. It would be better if user have
>> can chose module parameter based on their requirements. Thanks.
> So what? More vectors just pushes the threshold to when bypass becomes
> relevant, which is exactly why I suggested it. There has to be an empirical
> answer to when bypass beats muxing. Why do you want a user tunable if there's a
> verifiable and automated better choice?
Make sense about the automated choice. I am not sure what is the exact
tipping point. The commit message includes only two cases. one 1 drive
1 CPU and second 12 drives 6 CPU. Also performance gets worse from 8
drives to 12 drives.
One the previous comments also mentioned something about FIO changing
cpus_allowed; will there be an issue when VMD driver decides to bypass
the remapping during the boot up, but FIO job changes the cpu_allowed?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2023-02-09 23:57 ` Patel, Nirmal
@ 2023-02-10 0:47 ` Keith Busch
0 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2023-02-10 0:47 UTC (permalink / raw)
To: Patel, Nirmal
Cc: Xinghui Li, Jonathan Derrick, lpieralisi, linux-pci,
linux-kernel, Xinghui Li
On Thu, Feb 09, 2023 at 04:57:59PM -0700, Patel, Nirmal wrote:
> On 2/9/2023 4:05 PM, Keith Busch wrote:
> > On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
> >> On 2/6/2023 8:18 PM, Xinghui Li wrote:
> >>> Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
> >>>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> >>> For this situation, My speculation is that the PCIE nodes are
> >>> over-mounted and not just because of the CPU to Drive ratio.
> >>> We considered designing online nodes, because we were concerned that
> >>> the IO of different chunk sizes would adapt to different MSI-X modes.
> >>> I privately think that it may be logically complicated if programmatic
> >>> judgments are made.
> >> Also newer CPUs have more MSIx (128) which means we can still have
> >> better performance without bypass. It would be better if user have
> >> can chose module parameter based on their requirements. Thanks.
> > So what? More vectors just pushes the threshold to when bypass becomes
> > relevant, which is exactly why I suggested it. There has to be an empirical
> > answer to when bypass beats muxing. Why do you want a user tunable if there's a
> > verifiable and automated better choice?
>
> Make sense about the automated choice. I am not sure what is the exact
> tipping point. The commit message includes only two cases. one 1 drive
> 1 CPU and second 12 drives 6 CPU. Also performance gets worse from 8
> drives to 12 drives.
That configuration's storage performance overwhelms the CPU with interrupt
context switching. That problem probably inverts when your active CPU count
exceeds your VMD vectors because you'll be funnelling more interrupts into
fewer CPUs, leaving other CPUs idle.
> One the previous comments also mentioned something about FIO changing
> cpus_allowed; will there be an issue when VMD driver decides to bypass
> the remapping during the boot up, but FIO job changes the cpu_allowed?
No. Bypass mode uses managed interrupts for your nvme child devices, which sets
the best possible affinity.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
2022-12-22 9:15 ` Jonathan Derrick
2022-12-22 21:56 ` Keith Busch
@ 2022-12-23 7:53 ` Xinghui Li
1 sibling, 0 replies; 18+ messages in thread
From: Xinghui Li @ 2022-12-23 7:53 UTC (permalink / raw)
To: Jonathan Derrick
Cc: nirmal.patel, lpieralisi, linux-pci, linux-kernel, Xinghui Li
Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月22日周四 17:15写道:
>
>
>
> On 12/22/22 12:26 AM, korantwork@gmail.com wrote:
> > From: Xinghui Li <korantli@tencent.com>
> >
> > Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
> > disable the vmd MSI-X remapping for optimizing pci performance.However,
> > this feature severely negatively optimized performance in multi-disk
> > situations.
> >
> > In FIO 4K random test, we test 1 disk in the 1 CPU
> >
> > when disable MSI-X remapping:
> > read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
> > READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
> > io=1354GiB (1454GB), run=300001-300001msec
> >
> > When not disable MSI-X remapping:
> > read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
> > READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
> > io=1340GiB (1438GB), run=300001-300001msec
> >
> > However, the bypass mode could increase the interrupts costs in CPU.
> > We test 12 disks in the 6 CPU,
> Well the bypass mode was made to improve performance where you have >4
> drives so this is pretty surprising. With bypass mode disabled, VMD will
> intercept and forward interrupts, increasing costs.
We also find the more drives we tested, the more severe the
performance degradation.
When we tested 8 drives in 6 CPU, there is about 30% drop.
> I think Nirmal would want to to understand if there's some other factor
> going on here.
I also agree with this. The tested server is None io-scheduler.
We tested the same server. Tested drives are Samsung Gen-4 nvme.
Is there anything else you worried effecting test results?
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2023-02-10 0:47 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-22 7:26 [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller korantwork
2022-12-22 9:15 ` Jonathan Derrick
2022-12-22 21:56 ` Keith Busch
2022-12-23 8:02 ` Xinghui Li
2022-12-27 22:32 ` Jonathan Derrick
2022-12-28 2:19 ` Xinghui Li
2023-01-09 21:00 ` Jonathan Derrick
2023-01-10 12:28 ` Xinghui Li
2023-02-06 12:45 ` Xinghui Li
2023-02-06 18:11 ` Patel, Nirmal
2023-02-06 18:28 ` Keith Busch
2023-02-07 3:18 ` Xinghui Li
2023-02-07 20:32 ` Patel, Nirmal
2023-02-09 12:05 ` Xinghui Li
2023-02-09 23:05 ` Keith Busch
2023-02-09 23:57 ` Patel, Nirmal
2023-02-10 0:47 ` Keith Busch
2022-12-23 7:53 ` Xinghui Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).