RFD: x86: Sanitize the vector allocator

From: Thomas Gleixner <tglx@linutronix.de>
To: Chen Yu <yu.c.chen@intel.com>
Cc: x86@kernel.org, Ingo Molnar <mingo@redhat.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Rui Zhang <rui.zhang@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Len Brown <lenb@kernel.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Christoph Hellwig <hch@lst.de>,
	Peter Zijlstra <peterz@infradead.org>
Subject: RFD: x86: Sanitize the vector allocator
Date: Sun, 3 Sep 2017 21:18:04 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.20.1709032025250.2351@nanos> (raw)
In-Reply-To: <alpine.DEB.2.20.1709031642030.2351@nanos>

[-- Attachment #1: Type: text/plain, Size: 6609 bytes --]

The vector allocator of x86 is a pretty stupid linear search algorithm with
a worst case of

    nr_vectors * nr_online_cpus * nr_cpus_in_affinity mask

It has some other magic properties and really wants to be replaced by
something smarter.

That needs quite some cleanup of the vector management code outside of the
allocator, which I started to work on with the cleanup of the IDT
management which is headed for 4.14. I have some other things in the
pipeline which eliminate quite some duct tape in that area, but I ran into
a couple of interesting things:

1) Multi CPU affinities

   This is only vailable when the APIC is using logical destination
   mode. With physical destination mode there is already a restriction to a
   single CPU target.

   The multi CPU affinity is biased towards the CPU with the lowest APIC ID
   in the destination bitfield. Only if that APIC is busy (ISR not empty)
   then the next APIC gets it.

   A full kernel build on a SKL 4 CPU desktop machine with affinity set to
   CPU0-3 shows that more than 90 percent of the AHCI interrupts end up on
   CPU0.

   Aside of that the same cold build (right after boot) is about 2% faster
   when the AHCI interrupt is only affine to CPU0.

   I did some experiments on all my machines which have logical destination
   mode with various workloads and the results are similiar. The
   distribution of interrupts on the CPUs varies with the workloads, but
   the vast majority always ends up on CPU0

   I've not found a case where the multi CPU affinity is superiour. I might
   have the wrong workloads and the wrong machines, but it would be
   extremly helpful just to get rid of this and use single CPU affinities
   only. That'd simplify the allocator along with the various APIC
   implementations.

2) The 'priority level' spreading magic

   The comment in __asign_irq_vector says:

      * NOTE! The local APIC isn't very good at handling
      * multiple interrupts at the same interrupt level.
      * As the interrupt level is determined by taking the
      * vector number and shifting that right by 4, we
      * want to spread these out a bit so that they don't
      * all fall in the same interrupt level.                         

   After doing some palaeontological research I found the following in the
   PPro Developer Manual Volume 3:

     "7.4.2. Valid Interrupts

     The local and I/O APICs support 240 distinct vectors in the range of 16
     to 255. Interrupt priority is implied by its vector, according to the
     following relationship: priority = vector / 16

     One is the lowest priority and 15 is the highest. Vectors 16 through
     31 are reserved for exclusive use by the processor. The remaining
     vectors are for general use. The processor’s local APIC includes an
     in-service entry and a holding entry for each priority level. To avoid
     losing inter- rupts, software should allocate no more than 2 interrupt
     vectors per priority."

   The current SDM tells nothing about that, instead it states:

     "If more than one interrupt is generated with the same vector number,
      the local APIC can set the bit for the vector both in the IRR and the
      ISR. This means that for the Pentium 4 and Intel Xeon processors, the
      IRR and ISR can queue two interrupts for each interrupt vector: one
      in the IRR and one in the ISR. Any additional interrupts issued for
      the same interrupt vector are collapsed into the single bit in the
      IRR.

      For the P6 family and Pentium processors, the IRR and ISR registers
      can queue no more than two interrupts per interrupt vector and will
      reject other interrupts that are received within the same vector."

   Which means, that on P6/Pentium the APIC will reject a new message and
   tell the sender to retry, which increases the load on the APIC bus and
   nothing more.

   There is no affirmative answer from Intel on that, but I think it's sane
   to remove that:

    1) I've looked through a bunch of other operating systems and none of
       them bothers to implement this or mentiones this at all.

    2) The current allocator has no enforcement for this and especially the
       legacy interrupts, which are the main source of interrupts on these
       P6 and older systmes, are allocated linearly in the same priority
       level and just work.

    3) The current machines have no problem with that at all as I verified
       with some experiments.

    4) AMD at least confirmed that such an issue is unknown.

    5) P6 and older are dinosaurs almost 20 years EOL, so we really should
       not worry about that anymore.

   So this can be eliminated, which makes the allocation mechanism way
   simpler.

Some other issues which are not in the way of cleanups and replacements,
but need to be looked at as well:

1) Automated affinity assignment

    This only helps when the underlying device requests it and has the
    matching queues per CPU. That's what the managed interrupt affinity
    mechanism was made for.

    In other cases the automated assignment can have really bad effects.

    On the same SKL as above I made the AHCI interrupt affine to CPU3 only
    which makes the kernel build slower by whopping 10% than having it
    affine on CPU0. Interestingly enough irqbalanced end up with the wrong
    decision as well.

    So we need to be very careful about that. It depends on the device and
    the driver how good 'random' placement works.

    That means we need hinting from the drivers about their preferred
    allocation scheme. If we don't have that then we should for now default
    to the current scheme which puts the interrupt on the node on which the
    device is.

2) Vector waste

   All 16 legacy interrupt vectors are populated at boot and stay there
   forever whether they are used or not. On most modern machines that's 10+
   vectors wasted for nothing. If the APIC uses logical destination mode
   that means these vectors are per default allocated on up to 8 CPUs or in
   the case of clustered X2APIC on all CPUs in a cluster.

   It'd be worthwhile to allocate these legacy vectors dynamically when
   they are actually used. That might fail, but that's the same on devices
   which use MSI etc. For legacy systems this is a non issue as there are
   plenty of vectors available. On modern machines the 4-5 really used
   legacy vectors are requested early during the boot process and should
   not end up in a fully exhausted vector space.

   Nothing urgent, but worthwhile to fix I think.

Thoughts?

Thanks,

     tglx