From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S932131AbdIGJp1 (ORCPT <rfc822;w@1wt.eu>);
        Thu, 7 Sep 2017 05:45:27 -0400
Received: from Galois.linutronix.de ([146.0.238.70]:50935 "EHLO
        Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S932069AbdIGJpZ (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 7 Sep 2017 05:45:25 -0400
Date: Thu, 7 Sep 2017 11:45:19 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
To: Yu Chen <yu.c.chen@intel.com>
cc: x86@kernel.org, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>, Rui Zhang <rui.zhang@intel.com>,
        LKML <linux-kernel@vger.kernel.org>,
        "Rafael J. Wysocki" <rjw@rjwysocki.net>, Len Brown <lenb@kernel.org>,
        Dan Williams <dan.j.williams@intel.com>,
        Christoph Hellwig <hch@lst.de>, Peter Zijlstra <peterz@infradead.org>,
        Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Subject: Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing
 the idlest CPU
In-Reply-To: <20170907083405.GA24450@localhost.localdomain>
Message-ID: <alpine.DEB.2.20.1709071045440.1827@nanos>
References: <cover.1504235838.git.yu.c.chen@intel.com> <cfb50b09b26a40609c86ef9813a8cb7f43f90def.1504235838.git.yu.c.chen@intel.com> <alpine.DEB.2.20.1709031642030.2351@nanos> <alpine.DEB.2.20.1709060025470.2393@nanos> <20170906043454.GD23250@localhost.localdomain>
 <alpine.DEB.2.20.1709060801480.2144@nanos> <20170907025212.GA18130@localhost.localdomain> <alpine.DEB.2.20.1709070731110.2433@nanos> <20170907083405.GA24450@localhost.localdomain>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, 7 Sep 2017, Yu Chen wrote:
> On Thu, Sep 07, 2017 at 07:54:09AM +0200, Thomas Gleixner wrote:
> > Please switch it over to managed interrupts so the affinity spreading
> > happens in a sane way and the interrupts are properly managed on CPU
> > hotplug.
> Ok, I think currently in i40e driver the reservation of vectors
> leverages pci_enable_msix_range() and did not provide the affinity
> hit to low level IRQ system thus the managed interrupts is not enabled
> there(although later in i40e driver we use irq_set_affinity_hint() to
> spread the IRQs)

The affinity hint has nothing to do with that. It's a hint which tells user
space irqbalanced what the desired placement of the interrupt should
be. That was never used for spreading the affinity automatically in the
kernel and will never be used to do so. It was a design failure from the
very beginning and should be eliminated ASAP.

The general problem here is the way how the whole MSI(X) machinery works in
the kernel.

pci_enable_msix()

   allocate_interrupts()
     allocate_irqdescs()
       allocate_resources()
         allocate_DMAR_entries()
	   allocate_vectors()
	     initialize_MSI_entries()

The reason for this is historical. Drivers expect, that request_irq()
works, when they allocated the required resources upfront.

Of course this could be changed, but there are issues with that:

  1) The driver must ensure that it does not enable any of the internal
     interrupt delivery mechanisms in the device before request_irq() has
     succeeded.

     That needs auditing drivers all over the place or we just ignore that
     and leave everyone puzzled why things suddenly stop to work.

  2) Reservation accounting

     When no vectors are allocated, we still need to make reservations so
     we can tell a driver that the vector space is exhausted when it
     invokes pci_enable_msix(). But how do we size the reservation space?
     Based on nr_possible_cpus(), nr_online_cpus() or some other
     heuristics?

     Sure, we can just ignore that and resort to overcommitment and fail
     request_irq() when resources are not available, which brings us back
     to #1

But resorting to overcommitment does not make the cpu hotplug problem
magically go away. If queues and interrupts are used, then the non managed
variants are going to break affinities and move stuff to the still online
CPUs, which is going to fail.

Managed irqs just work because the driver stops the queue and the interrupt
(which can even stay requested) is shut down and 'kept' on the outgoing
CPU. If the CPU comes back then the vector is reestablished and the
interrupt started up on the fly. Stuff just works.....

Thanks,

	tglx