From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932131AbdIGJp1 (ORCPT ); Thu, 7 Sep 2017 05:45:27 -0400 Received: from Galois.linutronix.de ([146.0.238.70]:50935 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932069AbdIGJpZ (ORCPT ); Thu, 7 Sep 2017 05:45:25 -0400 Date: Thu, 7 Sep 2017 11:45:19 +0200 (CEST) From: Thomas Gleixner To: Yu Chen cc: x86@kernel.org, Ingo Molnar , "H. Peter Anvin" , Rui Zhang , LKML , "Rafael J. Wysocki" , Len Brown , Dan Williams , Christoph Hellwig , Peter Zijlstra , Jeff Kirsher Subject: Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU In-Reply-To: <20170907083405.GA24450@localhost.localdomain> Message-ID: References: <20170906043454.GD23250@localhost.localdomain> <20170907025212.GA18130@localhost.localdomain> <20170907083405.GA24450@localhost.localdomain> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 7 Sep 2017, Yu Chen wrote: > On Thu, Sep 07, 2017 at 07:54:09AM +0200, Thomas Gleixner wrote: > > Please switch it over to managed interrupts so the affinity spreading > > happens in a sane way and the interrupts are properly managed on CPU > > hotplug. > Ok, I think currently in i40e driver the reservation of vectors > leverages pci_enable_msix_range() and did not provide the affinity > hit to low level IRQ system thus the managed interrupts is not enabled > there(although later in i40e driver we use irq_set_affinity_hint() to > spread the IRQs) The affinity hint has nothing to do with that. It's a hint which tells user space irqbalanced what the desired placement of the interrupt should be. That was never used for spreading the affinity automatically in the kernel and will never be used to do so. It was a design failure from the very beginning and should be eliminated ASAP. The general problem here is the way how the whole MSI(X) machinery works in the kernel. pci_enable_msix() allocate_interrupts() allocate_irqdescs() allocate_resources() allocate_DMAR_entries() allocate_vectors() initialize_MSI_entries() The reason for this is historical. Drivers expect, that request_irq() works, when they allocated the required resources upfront. Of course this could be changed, but there are issues with that: 1) The driver must ensure that it does not enable any of the internal interrupt delivery mechanisms in the device before request_irq() has succeeded. That needs auditing drivers all over the place or we just ignore that and leave everyone puzzled why things suddenly stop to work. 2) Reservation accounting When no vectors are allocated, we still need to make reservations so we can tell a driver that the vector space is exhausted when it invokes pci_enable_msix(). But how do we size the reservation space? Based on nr_possible_cpus(), nr_online_cpus() or some other heuristics? Sure, we can just ignore that and resort to overcommitment and fail request_irq() when resources are not available, which brings us back to #1 But resorting to overcommitment does not make the cpu hotplug problem magically go away. If queues and interrupts are used, then the non managed variants are going to break affinities and move stuff to the still online CPUs, which is going to fail. Managed irqs just work because the driver stops the queue and the interrupt (which can even stay requested) is shut down and 'kept' on the outgoing CPU. If the CPU comes back then the vector is reestablished and the interrupt started up on the fly. Stuff just works..... Thanks, tglx