linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)
@ 2021-11-05 11:56 Paul Menzel
  2021-11-05 12:04 ` How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)s Paul Menzel
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Paul Menzel @ 2021-11-05 11:56 UTC (permalink / raw)
  To: Jörg Rödel, Suravee Suthikulpanit, Bjorn Helgaas
  Cc: iommu, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86, LKML,
	linux-pci

Dear Linux folks,


On a PowerEdge T440/021KCD, BIOS 2.11.2 04/22/2021, Linux 5.10.70 takes 
almost five seconds to initialize PCI. According to the timestamps, 1.5 
s are from assigning the PCI devices to the 142 IOMMU groups.

```
$ lspci | wc -l
281
$ dmesg
[…]
[    2.918411] PCI: Using host bridge windows from ACPI; if necessary, 
use "pci=nocrs" and report a bug
[    2.933841] ACPI: Enabled 5 GPEs in block 00 to 7F
[    2.973739] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-16])
[    2.980398] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM 
ClockPM Segments MSI HPX-Type3]
[    2.989457] acpi PNP0A08:00: _OSC: platform does not support [LTR]
[    2.995451] acpi PNP0A08:00: _OSC: OS now controls [PME PCIeCapability]
[    3.001394] acpi PNP0A08:00: FADT indicates ASPM is unsupported, 
using BIOS configuration
[    3.010511] PCI host bridge to bus 0000:00
[…]
[    6.233508] system 00:05: [io  0x1000-0x10fe] has been reserved
[    6.239420] system 00:05: Plug and Play ACPI device, IDs PNP0c02 (active)
[    6.239906] pnp: PnP ACPI: found 6 devices
[…]
[    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device 
[8086:2034]
[    6.996063] PCI: CLS 0 bytes, default 64
[    7.000008] Trying to unpack rootfs image as initramfs...
[    7.065281] Freeing initrd memory: 5136K
[…]
[    7.079098] DMAR: dmar7: Using Queued invalidation
[    7.083983] pci 0000:00:00.0: Adding to iommu group 0
[…]
[    8.537808] pci 0000:d7:17.1: Adding to iommu group 141
[    8.571191] DMAR: Intel(R) Virtualization Technology for Directed I/O
[    8.577618] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[…]
```

Is there anything that could be done to reduce the time?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)s
  2021-11-05 11:56 How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups) Paul Menzel
@ 2021-11-05 12:04 ` Paul Menzel
  2021-11-05 18:53 ` How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups) Bjorn Helgaas
  2021-11-08 17:18 ` Krzysztof Wilczyński
  2 siblings, 0 replies; 9+ messages in thread
From: Paul Menzel @ 2021-11-05 12:04 UTC (permalink / raw)
  To: Jörg Rödel, Suravee Suthikulpanit, Bjorn Helgaas
  Cc: iommu, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86, LKML,
	linux-pci

Dear Linux folks,


Am 05.11.21 um 12:56 schrieb Paul Menzel:

> On a PowerEdge T440/021KCD, BIOS 2.11.2 04/22/2021, Linux 5.10.70 takes 
> almost five seconds to initialize PCI. According to the timestamps, 1.5 
> s are from assigning the PCI devices to the 142 IOMMU groups.
> 
> ```
> $ lspci | wc -l
> 281
> $ dmesg
> […]
> [    2.918411] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
> [    2.933841] ACPI: Enabled 5 GPEs in block 00 to 7F
> [    2.973739] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-16])
> [    2.980398] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
> [    2.989457] acpi PNP0A08:00: _OSC: platform does not support [LTR]
> [    2.995451] acpi PNP0A08:00: _OSC: OS now controls [PME PCIeCapability]
> [    3.001394] acpi PNP0A08:00: FADT indicates ASPM is unsupported, 
> using BIOS configuration
> [    3.010511] PCI host bridge to bus 0000:00
> […]
> [    6.233508] system 00:05: [io  0x1000-0x10fe] has been reserved
> [    6.239420] system 00:05: Plug and Play ACPI device, IDs PNP0c02 (active)
> [    6.239906] pnp: PnP ACPI: found 6 devices
> […]
> [    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device [8086:2034]
> [    6.996063] PCI: CLS 0 bytes, default 64
> [    7.000008] Trying to unpack rootfs image as initramfs...
> [    7.065281] Freeing initrd memory: 5136K
> […]
> [    7.079098] DMAR: dmar7: Using Queued invalidation
> [    7.083983] pci 0000:00:00.0: Adding to iommu group 0
> […]
> [    8.537808] pci 0000:d7:17.1: Adding to iommu group 141
> [    8.571191] DMAR: Intel(R) Virtualization Technology for Directed I/O
> [    8.577618] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> […]
> ```
> 
> Is there anything that could be done to reduce the time?

I created an issue at the Kernel.org Bugzilla, and attached the output 
of `dmesg` there [1].


Kind regards,

Paul


[1]: https://bugzilla.kernel.org/show_bug.cgi?id=214953

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)
  2021-11-05 11:56 How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups) Paul Menzel
  2021-11-05 12:04 ` How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)s Paul Menzel
@ 2021-11-05 18:53 ` Bjorn Helgaas
  2021-11-06 10:42   ` Paul Menzel
  2021-11-08 17:18 ` Krzysztof Wilczyński
  2 siblings, 1 reply; 9+ messages in thread
From: Bjorn Helgaas @ 2021-11-05 18:53 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Jörg Rödel, Suravee Suthikulpanit, Bjorn Helgaas,
	iommu, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86, LKML,
	linux-pci

On Fri, Nov 05, 2021 at 12:56:09PM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> On a PowerEdge T440/021KCD, BIOS 2.11.2 04/22/2021, Linux 5.10.70 takes
> almost five seconds to initialize PCI. According to the timestamps, 1.5 s
> are from assigning the PCI devices to the 142 IOMMU groups.
> 
> ```
> $ lspci | wc -l
> 281
> $ dmesg
> […]
> [    2.918411] PCI: Using host bridge windows from ACPI; if necessary, use
> "pci=nocrs" and report a bug
> [    2.933841] ACPI: Enabled 5 GPEs in block 00 to 7F
> [    2.973739] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-16])
> [    2.980398] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
> ClockPM Segments MSI HPX-Type3]
> [    2.989457] acpi PNP0A08:00: _OSC: platform does not support [LTR]
> [    2.995451] acpi PNP0A08:00: _OSC: OS now controls [PME PCIeCapability]
> [    3.001394] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using
> BIOS configuration
> [    3.010511] PCI host bridge to bus 0000:00
> […]
> [    6.233508] system 00:05: [io  0x1000-0x10fe] has been reserved
> [    6.239420] system 00:05: Plug and Play ACPI device, IDs PNP0c02 (active)
> [    6.239906] pnp: PnP ACPI: found 6 devices

For ~280 PCI devices, (6.24-2.92)/280 = 0.012 s/dev.  On my laptop I
have about (.66-.37)/36 = 0.008 s/dev (on v5.4), so about the same
ballpark.

Faster would always be better, of course.  I assume this is not really
a regression?

> [    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device
> [8086:2034]
> [    6.996063] PCI: CLS 0 bytes, default 64
> [    7.000008] Trying to unpack rootfs image as initramfs...
> [    7.065281] Freeing initrd memory: 5136K
> […]
> [    7.079098] DMAR: dmar7: Using Queued invalidation
> [    7.083983] pci 0000:00:00.0: Adding to iommu group 0
> […]
> [    8.537808] pci 0000:d7:17.1: Adding to iommu group 141

I don't have this iommu stuff turned on and don't know what's
happening here.

> Is there anything that could be done to reduce the time?
> 
> 
> Kind regards,
> 
> Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)
  2021-11-05 18:53 ` How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups) Bjorn Helgaas
@ 2021-11-06 10:42   ` Paul Menzel
  2021-11-09 15:31     ` Robin Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Menzel @ 2021-11-06 10:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jörg Rödel, Suravee Suthikulpanit, Bjorn Helgaas,
	iommu, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86, LKML,
	linux-pci

Dear Bjorn,


Thank you for your quick reply.


Am 05.11.21 um 19:53 schrieb Bjorn Helgaas:
> On Fri, Nov 05, 2021 at 12:56:09PM +0100, Paul Menzel wrote:

>> On a PowerEdge T440/021KCD, BIOS 2.11.2 04/22/2021, Linux 5.10.70 takes
>> almost five seconds to initialize PCI. According to the timestamps, 1.5 s
>> are from assigning the PCI devices to the 142 IOMMU groups.
>>
>> ```
>> $ lspci | wc -l
>> 281
>> $ dmesg
>> […]
>> [    2.918411] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
>> [    2.933841] ACPI: Enabled 5 GPEs in block 00 to 7F
>> [    2.973739] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-16])
>> [    2.980398] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
>> [    2.989457] acpi PNP0A08:00: _OSC: platform does not support [LTR]
>> [    2.995451] acpi PNP0A08:00: _OSC: OS now controls [PME PCIeCapability]
>> [    3.001394] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration
>> [    3.010511] PCI host bridge to bus 0000:00
>> […]
>> [    6.233508] system 00:05: [io  0x1000-0x10fe] has been reserved
>> [    6.239420] system 00:05: Plug and Play ACPI device, IDs PNP0c02 (active)
>> [    6.239906] pnp: PnP ACPI: found 6 devices
> 
> For ~280 PCI devices, (6.24-2.92)/280 = 0.012 s/dev.  On my laptop I
> have about (.66-.37)/36 = 0.008 s/dev (on v5.4), so about the same
> ballpark.

Though if it was on average 0.008 s/dev here, around a second could be 
saved.

The integrated Matrox G200eW3 graphics controller (102b:0536) and the 
two Broadcom NetXtreme BCM5720 2-port Gigabit Ethernet PCIe cards 
(14e4:165f) take 150 ms to be initialized.

     [    3.454409] pci 0000:03:00.0: [102b:0536] type 00 class 0x030000
     [    3.460411] pci 0000:03:00.0: reg 0x10: [mem 
0x91000000-0x91ffffff pref]
     [    3.467403] pci 0000:03:00.0: reg 0x14: [mem 0x92808000-0x9280bfff]
     [    3.473402] pci 0000:03:00.0: reg 0x18: [mem 0x92000000-0x927fffff]
     [    3.479437] pci 0000:03:00.0: BAR 0: assigned to efifb

The timestamp in each line differs by around 6 ms. Could printing the 
messages to the console (VGA) hold this up (line 373 to line 911 makes 
(6.24 s-2.92 s)/(538 lines) = (3.32 s)/(538 lines) = 6 ms)?

     [    3.484480] pci 0000:02:00.0: PCI bridge to [bus 03]
     [    3.489401] pci 0000:02:00.0:   bridge window [mem 
0x92000000-0x928fffff]
     [    3.496398] pci 0000:02:00.0:   bridge window [mem 
0x91000000-0x91ffffff 64bit pref]
     [    3.504446] pci 0000:04:00.0: [14e4:165f] type 00 class 0x020000
     [    3.510415] pci 0000:04:00.0: reg 0x10: [mem 
0x92e30000-0x92e3ffff 64bit pref]
     [    3.517408] pci 0000:04:00.0: reg 0x18: [mem 
0x92e40000-0x92e4ffff 64bit pref]
     [    3.524407] pci 0000:04:00.0: reg 0x20: [mem 
0x92e50000-0x92e5ffff 64bit pref]
     [    3.532402] pci 0000:04:00.0: reg 0x30: [mem 
0xfffc0000-0xffffffff pref]
     [    3.538483] pci 0000:04:00.0: PME# supported from D0 D3hot D3cold
     [    3.544437] pci 0000:04:00.0: 4.000 Gb/s available PCIe 
bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:1c.5 (capable of 
8.000 Gb/s with 5.0 GT/s PCIe x2 link)
     [    3.559493] pci 0000:04:00.1: [14e4:165f] type 00 class 0x020000

Here is a 15 ms delay.

     [    3.565415] pci 0000:04:00.1: reg 0x10: [mem 
0x92e00000-0x92e0ffff 64bit pref]
     [    3.573407] pci 0000:04:00.1: reg 0x18: [mem 
0x92e10000-0x92e1ffff 64bit pref]
     [    3.580407] pci 0000:04:00.1: reg 0x20: [mem 
0x92e20000-0x92e2ffff 64bit pref]
     [    3.587402] pci 0000:04:00.1: reg 0x30: [mem 
0xfffc0000-0xffffffff pref]
     [    3.594483] pci 0000:04:00.1: PME# supported from D0 D3hot D3cold
     [    3.600502] pci 0000:00:1c.5: PCI bridge to [bus 04]

Can the 6 ms – also from your system – be explained by the PCI 
specification? Seeing how fast PCI nowadays is, 6 ms sounds like a long 
time. ;-)

> Faster would always be better, of course.  I assume this is not really
> a regression?

Correct, as far as I know of, this is no regression.

>> [    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device [8086:2034]
>> [    6.996063] PCI: CLS 0 bytes, default 64
>> [    7.000008] Trying to unpack rootfs image as initramfs...
>> [    7.065281] Freeing initrd memory: 5136K

The PCI resource assignment(?) also seems to take 670 ms:

     [    6.319656] pci 0000:04:00.0: can't claim BAR 6 [mem 
0xfffc0000-0xffffffff pref]: no compatible bridge window
     […]
     [    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device 
[8086:2034]

>> […]
>> [    7.079098] DMAR: dmar7: Using Queued invalidation
>> [    7.083983] pci 0000:00:00.0: Adding to iommu group 0
>> […]
>> [    8.537808] pci 0000:d7:17.1: Adding to iommu group 141
> 
> I don't have this iommu stuff turned on and don't know what's
> happening here.

There is a lock in `iommu_group_add_device()` in `drivers/iommu/iommu.c`:

         mutex_lock(&group->mutex);
         list_add_tail(&device->list, &group->devices);
         if (group->domain  && !iommu_is_attach_deferred(group->domain, 
dev))
                 ret = __iommu_attach_device(group->domain, dev);
         mutex_unlock(&group->mutex);

No idea, if it’s related. Unfortunately, it’s a production system, so I 
can’t do any debugging. (Maybe `initcall_debug` could give some 
insight.) Maybe the IOMMU developers can explain it without it. Could 
the IOMMU group assignment be done in parallel?

>> Is there anything that could be done to reduce the time?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)
  2021-11-05 11:56 How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups) Paul Menzel
  2021-11-05 12:04 ` How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)s Paul Menzel
  2021-11-05 18:53 ` How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups) Bjorn Helgaas
@ 2021-11-08 17:18 ` Krzysztof Wilczyński
  2021-11-09 20:25   ` Paul Menzel
  2 siblings, 1 reply; 9+ messages in thread
From: Krzysztof Wilczyński @ 2021-11-08 17:18 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Jörg Rödel, Suravee Suthikulpanit, Bjorn Helgaas,
	iommu, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86, LKML,
	linux-pci

Hi Paul,

> On a PowerEdge T440/021KCD, BIOS 2.11.2 04/22/2021, Linux 5.10.70 takes
> almost five seconds to initialize PCI. According to the timestamps, 1.5 s
> are from assigning the PCI devices to the 142 IOMMU groups.
[...]
> Is there anything that could be done to reduce the time?

I am curious - why is this a problem?  Are you power-cycling your servers
so often to the point where the cumulative time spent in enumerating PCI
devices and adding them later to IOMMU groups is a problem? 

I am simply wondering why you decided to signal out the PCI enumeration as
slow in particular, especially given that a large server hardware tends to
have (most of the time, as per my experience) rather long initialisation
time either from being powered off or after being power cycled.  I can take
a while before the actual operating system itself will start.

We talked about this briefly with Bjorn, and there might be an option to
perhaps add some caching, as we suspect that the culprit here is doing PCI
configuration space read for each device, which can be slow on some
platforms.

However, we would need to profile this to get some quantitative data to see
whether doing anything would even be worthwhile.  It would definitely help
us understand better where the bottlenecks really are and of what magnitude.

I personally don't have access to such a large hardware like the one you
have access to, thus I was wondering whether you would have some time, and
be willing, to profile this for us on the hardware you have.

Let me know what do you think?

	Krzysztof

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)
  2021-11-06 10:42   ` Paul Menzel
@ 2021-11-09 15:31     ` Robin Murphy
  2021-11-09 20:32       ` Paul Menzel
  0 siblings, 1 reply; 9+ messages in thread
From: Robin Murphy @ 2021-11-09 15:31 UTC (permalink / raw)
  To: Paul Menzel, Bjorn Helgaas
  Cc: linux-pci, x86, LKML, iommu, Ingo Molnar, Borislav Petkov,
	Bjorn Helgaas, Thomas Gleixner

On 2021-11-06 10:42, Paul Menzel wrote:
> Dear Bjorn,
> 
> 
> Thank you for your quick reply.
> 
> 
> Am 05.11.21 um 19:53 schrieb Bjorn Helgaas:
>> On Fri, Nov 05, 2021 at 12:56:09PM +0100, Paul Menzel wrote:
> 
>>> On a PowerEdge T440/021KCD, BIOS 2.11.2 04/22/2021, Linux 5.10.70 takes
>>> almost five seconds to initialize PCI. According to the timestamps, 
>>> 1.5 s
>>> are from assigning the PCI devices to the 142 IOMMU groups.
>>>
>>> ```
>>> $ lspci | wc -l
>>> 281
>>> $ dmesg
>>> […]
>>> [    2.918411] PCI: Using host bridge windows from ACPI; if 
>>> necessary, use "pci=nocrs" and report a bug
>>> [    2.933841] ACPI: Enabled 5 GPEs in block 00 to 7F
>>> [    2.973739] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-16])
>>> [    2.980398] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig 
>>> ASPM ClockPM Segments MSI HPX-Type3]
>>> [    2.989457] acpi PNP0A08:00: _OSC: platform does not support [LTR]
>>> [    2.995451] acpi PNP0A08:00: _OSC: OS now controls [PME 
>>> PCIeCapability]
>>> [    3.001394] acpi PNP0A08:00: FADT indicates ASPM is unsupported, 
>>> using BIOS configuration
>>> [    3.010511] PCI host bridge to bus 0000:00
>>> […]
>>> [    6.233508] system 00:05: [io  0x1000-0x10fe] has been reserved
>>> [    6.239420] system 00:05: Plug and Play ACPI device, IDs PNP0c02 
>>> (active)
>>> [    6.239906] pnp: PnP ACPI: found 6 devices
>>
>> For ~280 PCI devices, (6.24-2.92)/280 = 0.012 s/dev.  On my laptop I
>> have about (.66-.37)/36 = 0.008 s/dev (on v5.4), so about the same
>> ballpark.
> 
> Though if it was on average 0.008 s/dev here, around a second could be 
> saved.
> 
> The integrated Matrox G200eW3 graphics controller (102b:0536) and the 
> two Broadcom NetXtreme BCM5720 2-port Gigabit Ethernet PCIe cards 
> (14e4:165f) take 150 ms to be initialized.
> 
>      [    3.454409] pci 0000:03:00.0: [102b:0536] type 00 class 0x030000
>      [    3.460411] pci 0000:03:00.0: reg 0x10: [mem 
> 0x91000000-0x91ffffff pref]
>      [    3.467403] pci 0000:03:00.0: reg 0x14: [mem 0x92808000-0x9280bfff]
>      [    3.473402] pci 0000:03:00.0: reg 0x18: [mem 0x92000000-0x927fffff]
>      [    3.479437] pci 0000:03:00.0: BAR 0: assigned to efifb
> 
> The timestamp in each line differs by around 6 ms. Could printing the 
> messages to the console (VGA) hold this up (line 373 to line 911 makes 
> (6.24 s-2.92 s)/(538 lines) = (3.32 s)/(538 lines) = 6 ms)?
> 
>      [    3.484480] pci 0000:02:00.0: PCI bridge to [bus 03]
>      [    3.489401] pci 0000:02:00.0:   bridge window [mem 
> 0x92000000-0x928fffff]
>      [    3.496398] pci 0000:02:00.0:   bridge window [mem 
> 0x91000000-0x91ffffff 64bit pref]
>      [    3.504446] pci 0000:04:00.0: [14e4:165f] type 00 class 0x020000
>      [    3.510415] pci 0000:04:00.0: reg 0x10: [mem 
> 0x92e30000-0x92e3ffff 64bit pref]
>      [    3.517408] pci 0000:04:00.0: reg 0x18: [mem 
> 0x92e40000-0x92e4ffff 64bit pref]
>      [    3.524407] pci 0000:04:00.0: reg 0x20: [mem 
> 0x92e50000-0x92e5ffff 64bit pref]
>      [    3.532402] pci 0000:04:00.0: reg 0x30: [mem 
> 0xfffc0000-0xffffffff pref]
>      [    3.538483] pci 0000:04:00.0: PME# supported from D0 D3hot D3cold
>      [    3.544437] pci 0000:04:00.0: 4.000 Gb/s available PCIe 
> bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:1c.5 (capable of 
> 8.000 Gb/s with 5.0 GT/s PCIe x2 link)
>      [    3.559493] pci 0000:04:00.1: [14e4:165f] type 00 class 0x020000
> 
> Here is a 15 ms delay.
> 
>      [    3.565415] pci 0000:04:00.1: reg 0x10: [mem 
> 0x92e00000-0x92e0ffff 64bit pref]
>      [    3.573407] pci 0000:04:00.1: reg 0x18: [mem 
> 0x92e10000-0x92e1ffff 64bit pref]
>      [    3.580407] pci 0000:04:00.1: reg 0x20: [mem 
> 0x92e20000-0x92e2ffff 64bit pref]
>      [    3.587402] pci 0000:04:00.1: reg 0x30: [mem 
> 0xfffc0000-0xffffffff pref]
>      [    3.594483] pci 0000:04:00.1: PME# supported from D0 D3hot D3cold
>      [    3.600502] pci 0000:00:1c.5: PCI bridge to [bus 04]
> 
> Can the 6 ms – also from your system – be explained by the PCI 
> specification? Seeing how fast PCI nowadays is, 6 ms sounds like a long 
> time. ;-)
> 
>> Faster would always be better, of course.  I assume this is not really
>> a regression?
> 
> Correct, as far as I know of, this is no regression.
> 
>>> [    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device 
>>> [8086:2034]
>>> [    6.996063] PCI: CLS 0 bytes, default 64
>>> [    7.000008] Trying to unpack rootfs image as initramfs...
>>> [    7.065281] Freeing initrd memory: 5136K
> 
> The PCI resource assignment(?) also seems to take 670 ms:
> 
>      [    6.319656] pci 0000:04:00.0: can't claim BAR 6 [mem 
> 0xfffc0000-0xffffffff pref]: no compatible bridge window
>      […]
>      [    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device 
> [8086:2034]
> 
>>> […]
>>> [    7.079098] DMAR: dmar7: Using Queued invalidation
>>> [    7.083983] pci 0000:00:00.0: Adding to iommu group 0
>>> […]
>>> [    8.537808] pci 0000:d7:17.1: Adding to iommu group 141
>>
>> I don't have this iommu stuff turned on and don't know what's
>> happening here.
> 
> There is a lock in `iommu_group_add_device()` in `drivers/iommu/iommu.c`:
> 
>          mutex_lock(&group->mutex);
>          list_add_tail(&device->list, &group->devices);
>          if (group->domain  && !iommu_is_attach_deferred(group->domain, 
> dev))
>                  ret = __iommu_attach_device(group->domain, dev);
>          mutex_unlock(&group->mutex);
> 
> No idea, if it’s related. Unfortunately, it’s a production system, so I 
> can’t do any debugging. (Maybe `initcall_debug` could give some 
> insight.) Maybe the IOMMU developers can explain it without it. Could 
> the IOMMU group assignment be done in parallel?

FWIW I'd expect that locking to be pretty much immaterial - many devices 
are getting their own uncontended groups, and callers of this tend to be 
serialised at a higher level anyway. iommu_probe_device() usually runs 
off the back of the device_add() notifier (where it could be that it's 
the only thing making noise in between something *else* being slow), but 
there is the special case where it gets replayed for all existing 
devices when the IOMMU driver registers itself - at a guess it seems 
like it may well be the latter case you're seeing, but either way 
there's not much to say without figuring out where the time is actually 
being spent (I don't suppose that machine has dynamic ftrace enabled?).

That said, setting up a new group isn't a completely insignificant 
amount of work, and 142 groups seems a lot - I'd have assumed that a 
system of that scale would be the kind of big server kit that takes 
several minutes to boot to the point of even starting the kernel anyway.

Robin.

>>> Is there anything that could be done to reduce the time?
> 
> 
> Kind regards,
> 
> Paul
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)
  2021-11-08 17:18 ` Krzysztof Wilczyński
@ 2021-11-09 20:25   ` Paul Menzel
  2021-11-09 23:10     ` Krzysztof Wilczyński
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Menzel @ 2021-11-09 20:25 UTC (permalink / raw)
  To: Krzysztof Wilczyński
  Cc: Jörg Rödel, Suravee Suthikulpanit, Bjorn Helgaas,
	iommu, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86, LKML,
	linux-pci

Dear Krzysztof,


Thank you for your reply.


Am 08.11.21 um 18:18 schrieb Krzysztof Wilczyński:

>> On a PowerEdge T440/021KCD, BIOS 2.11.2 04/22/2021, Linux 5.10.70 takes
>> almost five seconds to initialize PCI. According to the timestamps, 1.5 s
>> are from assigning the PCI devices to the 142 IOMMU groups.
> [...]
>> Is there anything that could be done to reduce the time?
> 
> I am curious - why is this a problem?  Are you power-cycling your servers
> so often to the point where the cumulative time spent in enumerating PCI
> devices and adding them later to IOMMU groups is a problem?
> 
> I am simply wondering why you decided to signal out the PCI enumeration as
> slow in particular, especially given that a large server hardware tends to
> have (most of the time, as per my experience) rather long initialisation
> time either from being powered off or after being power cycled.  I can take
> a while before the actual operating system itself will start.

It’s not a problem per se, and more a pet peeve of mine. Systems get 
faster and faster, and boottime slower and slower. On desktop systems, 
it’s much more important with firmware like coreboot taking less than 
one second to initialize the hardware and passing control to the 
payload/operating system. If we are lucky, we are going to have servers 
with FLOSS firmware.

But, already now, using kexec to reboot a system, avoids the problems 
you pointed out on servers, and being able to reboot a system as quickly 
as possible, lowers the bar for people to reboot systems more often to, 
for example, so updates take effect.

> We talked about this briefly with Bjorn, and there might be an option to
> perhaps add some caching, as we suspect that the culprit here is doing PCI
> configuration space read for each device, which can be slow on some
> platforms.
> 
> However, we would need to profile this to get some quantitative data to see
> whether doing anything would even be worthwhile.  It would definitely help
> us understand better where the bottlenecks really are and of what magnitude.
> 
> I personally don't have access to such a large hardware like the one you
> have access to, thus I was wondering whether you would have some time, and
> be willing, to profile this for us on the hardware you have.
> 
> Let me know what do you think?

Sounds good. I’d be willing to help. Note, that I won’t have time before 
Wednesday next week though.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)
  2021-11-09 15:31     ` Robin Murphy
@ 2021-11-09 20:32       ` Paul Menzel
  0 siblings, 0 replies; 9+ messages in thread
From: Paul Menzel @ 2021-11-09 20:32 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-pci, x86, LKML, iommu, Ingo Molnar, Borislav Petkov,
	Bjorn Helgaas, Thomas Gleixner, Bjorn Helgaas

Dear Robin,


Thank you for your reply.

Am 09.11.21 um 16:31 schrieb Robin Murphy:
> On 2021-11-06 10:42, Paul Menzel wrote:

>> Am 05.11.21 um 19:53 schrieb Bjorn Helgaas:
>>> On Fri, Nov 05, 2021 at 12:56:09PM +0100, Paul Menzel wrote:
>>
>>>> On a PowerEdge T440/021KCD, BIOS 2.11.2 04/22/2021, Linux 5.10.70 takes
>>>> almost five seconds to initialize PCI. According to the timestamps, 
>>>> 1.5 s
>>>> are from assigning the PCI devices to the 142 IOMMU groups.
>>>>
>>>> ```
>>>> $ lspci | wc -l
>>>> 281
>>>> $ dmesg
>>>> […]
>>>> [    2.918411] PCI: Using host bridge windows from ACPI; if 
>>>> necessary, use "pci=nocrs" and report a bug
>>>> [    2.933841] ACPI: Enabled 5 GPEs in block 00 to 7F
>>>> [    2.973739] ACPI: PCI Root Bridge [PC00] (domain 0000 [bus 00-16])
>>>> [    2.980398] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
>>>> [    2.989457] acpi PNP0A08:00: _OSC: platform does not support [LTR]
>>>> [    2.995451] acpi PNP0A08:00: _OSC: OS now controls [PME 
>>>> PCIeCapability]
>>>> [    3.001394] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration
>>>> [    3.010511] PCI host bridge to bus 0000:00
>>>> […]
>>>> [    6.233508] system 00:05: [io  0x1000-0x10fe] has been reserved
>>>> [    6.239420] system 00:05: Plug and Play ACPI device, IDs PNP0c02 (active)
>>>> [    6.239906] pnp: PnP ACPI: found 6 devices
>>>
>>> For ~280 PCI devices, (6.24-2.92)/280 = 0.012 s/dev.  On my laptop I
>>> have about (.66-.37)/36 = 0.008 s/dev (on v5.4), so about the same
>>> ballpark.
>>
>> Though if it was on average 0.008 s/dev here, around a second could be 
>> saved.
>>
>> The integrated Matrox G200eW3 graphics controller (102b:0536) and the 
>> two Broadcom NetXtreme BCM5720 2-port Gigabit Ethernet PCIe cards 
>> (14e4:165f) take 150 ms to be initialized.
>>
>>      [    3.454409] pci 0000:03:00.0: [102b:0536] type 00 class 0x030000
>>      [    3.460411] pci 0000:03:00.0: reg 0x10: [mem 0x91000000-0x91ffffff pref]
>>      [    3.467403] pci 0000:03:00.0: reg 0x14: [mem 0x92808000-0x9280bfff]
>>      [    3.473402] pci 0000:03:00.0: reg 0x18: [mem 0x92000000-0x927fffff]
>>      [    3.479437] pci 0000:03:00.0: BAR 0: assigned to efifb
>>
>> The timestamp in each line differs by around 6 ms. Could printing the 
>> messages to the console (VGA) hold this up (line 373 to line 911 makes 
>> (6.24 s-2.92 s)/(538 lines) = (3.32 s)/(538 lines) = 6 ms)?
>>
>>      [    3.484480] pci 0000:02:00.0: PCI bridge to [bus 03]
>>      [    3.489401] pci 0000:02:00.0:   bridge window [mem 0x92000000-0x928fffff]
>>      [    3.496398] pci 0000:02:00.0:   bridge window [mem 0x91000000-0x91ffffff 64bit pref]
>>      [    3.504446] pci 0000:04:00.0: [14e4:165f] type 00 class 0x020000
>>      [    3.510415] pci 0000:04:00.0: reg 0x10: [mem 0x92e30000-0x92e3ffff 64bit pref]
>>      [    3.517408] pci 0000:04:00.0: reg 0x18: [mem 0x92e40000-0x92e4ffff 64bit pref]
>>      [    3.524407] pci 0000:04:00.0: reg 0x20: [mem 0x92e50000-0x92e5ffff 64bit pref]
>>      [    3.532402] pci 0000:04:00.0: reg 0x30: [mem 0xfffc0000-0xffffffff pref]
>>      [    3.538483] pci 0000:04:00.0: PME# supported from D0 D3hot D3cold
>>      [    3.544437] pci 0000:04:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:1c.5 (capable of 8.000 Gb/s with 5.0 GT/s PCIe x2 link)
>>      [    3.559493] pci 0000:04:00.1: [14e4:165f] type 00 class 0x020000
>>
>> Here is a 15 ms delay.
>>
>>      [    3.565415] pci 0000:04:00.1: reg 0x10: [mem 0x92e00000-0x92e0ffff 64bit pref]
>>      [    3.573407] pci 0000:04:00.1: reg 0x18: [mem 0x92e10000-0x92e1ffff 64bit pref]
>>      [    3.580407] pci 0000:04:00.1: reg 0x20: [mem 0x92e20000-0x92e2ffff 64bit pref]
>>      [    3.587402] pci 0000:04:00.1: reg 0x30: [mem 0xfffc0000-0xffffffff pref]
>>      [    3.594483] pci 0000:04:00.1: PME# supported from D0 D3hot D3cold
>>      [    3.600502] pci 0000:00:1c.5: PCI bridge to [bus 04]
>>
>> Can the 6 ms – also from your system – be explained by the PCI 
>> specification? Seeing how fast PCI nowadays is, 6 ms sounds like a 
>> long time. ;-)
>>
>>> Faster would always be better, of course.  I assume this is not really
>>> a regression?
>>
>> Correct, as far as I know of, this is no regression.
>>
>>>> [    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device [8086:2034]
>>>> [    6.996063] PCI: CLS 0 bytes, default 64
>>>> [    7.000008] Trying to unpack rootfs image as initramfs...
>>>> [    7.065281] Freeing initrd memory: 5136K
>>
>> The PCI resource assignment(?) also seems to take 670 ms:
>>
>>      [    6.319656] pci 0000:04:00.0: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
>>      […]
>>      [    6.989016] pci 0000:d7:05.0: disabled boot interrupts on device [8086:2034]
>>
>>>> […]
>>>> [    7.079098] DMAR: dmar7: Using Queued invalidation
>>>> [    7.083983] pci 0000:00:00.0: Adding to iommu group 0
>>>> […]
>>>> [    8.537808] pci 0000:d7:17.1: Adding to iommu group 141
>>>
>>> I don't have this iommu stuff turned on and don't know what's
>>> happening here.
>>
>> There is a lock in `iommu_group_add_device()` in `drivers/iommu/iommu.c`:
>>
>>          mutex_lock(&group->mutex);
>>          list_add_tail(&device->list, &group->devices);
>>          if (group->domain  && !iommu_is_attach_deferred(group->domain, dev))
>>                  ret = __iommu_attach_device(group->domain, dev);
>>          mutex_unlock(&group->mutex);
>>
>> No idea, if it’s related. Unfortunately, it’s a production system, so 
>> I can’t do any debugging. (Maybe `initcall_debug` could give some 
>> insight.) Maybe the IOMMU developers can explain it without it. Could 
>> the IOMMU group assignment be done in parallel?
> 
> FWIW I'd expect that locking to be pretty much immaterial - many devices 
> are getting their own uncontended groups, and callers of this tend to be 
> serialised at a higher level anyway. iommu_probe_device() usually runs 
> off the back of the device_add() notifier (where it could be that it's 
> the only thing making noise in between something *else* being slow), but 
> there is the special case where it gets replayed for all existing 
> devices when the IOMMU driver registers itself - at a guess it seems 
> like it may well be the latter case you're seeing, but either way 
> there's not much to say without figuring out where the time is actually 
> being spent (I don't suppose that machine has dynamic ftrace enabled?).

Our Linux kernel has dynamic ftrace enabled.

     $ grep CONFIG_DYNAMIC_FTRACE /boot/config-5.10.70.mx64.403
     CONFIG_DYNAMIC_FTRACE=y
     CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
     CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y

> That said, setting up a new group isn't a completely insignificant 
> amount of work, and 142 groups seems a lot - I'd have assumed that a 
> system of that scale would be the kind of big server kit that takes 
> several minutes to boot to the point of even starting the kernel anyway.

You are right. As noted in my reply to Krzysztof, it’s more like a pet 
peeve, but is also relevant, when kexec is used.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)
  2021-11-09 20:25   ` Paul Menzel
@ 2021-11-09 23:10     ` Krzysztof Wilczyński
  0 siblings, 0 replies; 9+ messages in thread
From: Krzysztof Wilczyński @ 2021-11-09 23:10 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Jörg Rödel, Suravee Suthikulpanit, Bjorn Helgaas,
	iommu, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86, LKML,
	linux-pci

Hi Paul,

> Thank you for your reply.

Thank you for getting back to us with a good insight.

[...]
> > I am curious - why is this a problem?  Are you power-cycling your servers
> > so often to the point where the cumulative time spent in enumerating PCI
> > devices and adding them later to IOMMU groups is a problem?
> > 
> > I am simply wondering why you decided to signal out the PCI enumeration as
> > slow in particular, especially given that a large server hardware tends to
> > have (most of the time, as per my experience) rather long initialisation
> > time either from being powered off or after being power cycled.  I can take
> > a while before the actual operating system itself will start.
> 
> It’s not a problem per se, and more a pet peeve of mine. Systems get faster
> and faster, and boottime slower and slower. On desktop systems, it’s much
> more important with firmware like coreboot taking less than one second to
> initialize the hardware and passing control to the payload/operating system.
> If we are lucky, we are going to have servers with FLOSS firmware.
> 
> But, already now, using kexec to reboot a system, avoids the problems you
> pointed out on servers, and being able to reboot a system as quickly as
> possible, lowers the bar for people to reboot systems more often to, for
> example, so updates take effect.

A very good point about the kexec usage.

This is definitely often invaluable to get security updates out of the door
quickly, update kernel version, or when you want to switch operating system
quickly (a trick that companies like Equinix Metal use when offering their
baremetal as a service).

> > We talked about this briefly with Bjorn, and there might be an option to
> > perhaps add some caching, as we suspect that the culprit here is doing PCI
> > configuration space read for each device, which can be slow on some
> > platforms.
> > 
> > However, we would need to profile this to get some quantitative data to see
> > whether doing anything would even be worthwhile.  It would definitely help
> > us understand better where the bottlenecks really are and of what magnitude.
> > 
> > I personally don't have access to such a large hardware like the one you
> > have access to, thus I was wondering whether you would have some time, and
> > be willing, to profile this for us on the hardware you have.
> > 
> > Let me know what do you think?
> 
> Sounds good. I’d be willing to help. Note, that I won’t have time before
> Wednesday next week though.

Not a problem!  I am very grateful you are willing to devote some of you
time to help with this.

I only have access to a few systems such as some commodity hardware like
a desktop PC and notebooks, and some assorted SoCs.  These are sadly not
even close to a proper server platforms, and trying to measure anything on
these does not really yield any useful data as the delays related to PCI
enumeration on startup are quite insignificant in comparison - there is
just not enough hardware there, so to speak.

I am really looking forward to the data you can gather for us and what
insight it might provide us with.

Thank you again!

	Krzysztof

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-11-09 23:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-05 11:56 How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups) Paul Menzel
2021-11-05 12:04 ` How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups)s Paul Menzel
2021-11-05 18:53 ` How to reduce PCI initialization from 5 s (1.5 s adding them to IOMMU groups) Bjorn Helgaas
2021-11-06 10:42   ` Paul Menzel
2021-11-09 15:31     ` Robin Murphy
2021-11-09 20:32       ` Paul Menzel
2021-11-08 17:18 ` Krzysztof Wilczyński
2021-11-09 20:25   ` Paul Menzel
2021-11-09 23:10     ` Krzysztof Wilczyński

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).