Dear Krzysztof,


Am 10.11.21 um 00:10 schrieb Krzysztof Wilczyński:

> [...]
>>> I am curious - why is this a problem?  Are you power-cycling your servers
>>> so often to the point where the cumulative time spent in enumerating PCI
>>> devices and adding them later to IOMMU groups is a problem?
>>>
>>> I am simply wondering why you decided to signal out the PCI enumeration as
>>> slow in particular, especially given that a large server hardware tends to
>>> have (most of the time, as per my experience) rather long initialisation
>>> time either from being powered off or after being power cycled.  I can take
>>> a while before the actual operating system itself will start.
>>
>> It’s not a problem per se, and more a pet peeve of mine. Systems get faster
>> and faster, and boottime slower and slower. On desktop systems, it’s much
>> more important with firmware like coreboot taking less than one second to
>> initialize the hardware and passing control to the payload/operating system.
>> If we are lucky, we are going to have servers with FLOSS firmware.
>>
>> But, already now, using kexec to reboot a system, avoids the problems you
>> pointed out on servers, and being able to reboot a system as quickly as
>> possible, lowers the bar for people to reboot systems more often to, for
>> example, so updates take effect.
> 
> A very good point about the kexec usage.
> 
> This is definitely often invaluable to get security updates out of the door
> quickly, update kernel version, or when you want to switch operating system
> quickly (a trick that companies like Equinix Metal use when offering their
> baremetal as a service).
> 
>>> We talked about this briefly with Bjorn, and there might be an option to
>>> perhaps add some caching, as we suspect that the culprit here is doing PCI
>>> configuration space read for each device, which can be slow on some
>>> platforms.
>>>
>>> However, we would need to profile this to get some quantitative data to see
>>> whether doing anything would even be worthwhile.  It would definitely help
>>> us understand better where the bottlenecks really are and of what magnitude.
>>>
>>> I personally don't have access to such a large hardware like the one you
>>> have access to, thus I was wondering whether you would have some time, and
>>> be willing, to profile this for us on the hardware you have.
>>>
>>> Let me know what do you think?
>>
>> Sounds good. I’d be willing to help. Note, that I won’t have time before
>> Wednesday next week though.
> 
> Not a problem!  I am very grateful you are willing to devote some of you
> time to help with this.
> 
> I only have access to a few systems such as some commodity hardware like
> a desktop PC and notebooks, and some assorted SoCs.  These are sadly not
> even close to a proper server platforms, and trying to measure anything on
> these does not really yield any useful data as the delays related to PCI
> enumeration on startup are quite insignificant in comparison - there is
> just not enough hardware there, so to speak.
> 
> I am really looking forward to the data you can gather for us and what
> insight it might provide us with.

So, kexec seems to work besides some DMAR-IR warnings [1]. 
`initcall_debug` increases the Linux boot time by over 50 % from 7.7 s 
to 12 s, which I didn’t expect.

Here are the functions taking more than 200 ms:

     initcall pci_apply_final_quirks+0x0/0x132 returned 0 after 228433 usecs
     initcall raid6_select_algo+0x0/0x2d6 returned 0 after 383789 usecs
     initcall pcibios_assign_resources+0x0/0xc0 returned 0 after 610757 
usecs
     initcall _mpt3sas_init+0x0/0x1c0 returned 0 after 721257 usecs
     initcall ahci_pci_driver_init+0x0/0x1a returned 0 after 945094 usecs
     initcall pci_iommu_init+0x0/0x3f returned 0 after 1487134 usecs
     initcall acpi_init+0x0/0x349 returned 0 after 7291015 usecs

Some of them are run later though, but `acpi_init` sticks out with 7.3 s.


Kind regards,

Paul


[1]: 
https://lore.kernel.org/linux-iommu/40a7581d-985b-f12b-0bb2-99c586a9f968@molgen.mpg.de/T/#u