From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751524AbdIAFCp (ORCPT ); Fri, 1 Sep 2017 01:02:45 -0400 Received: from mga02.intel.com ([134.134.136.20]:29530 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751028AbdIAFCm (ORCPT ); Fri, 1 Sep 2017 01:02:42 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,456,1498546800"; d="scan'208";a="124592598" From: Chen Yu To: x86@kernel.org Cc: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Rui Zhang , linux-kernel@vger.kernel.org, Chen Yu , "Rafael J. Wysocki" , Len Brown , Dan Williams Subject: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU Date: Fri, 1 Sep 2017 13:04:43 +0800 Message-Id: X-Mailer: git-send-email 2.7.4 In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is the major logic to spread the vectors on different CPUs. The main idea is to choose the 'idlest' CPU which has assigned the least number of vectors as the candidate/hint for the vector allocation domain, in the hope that the vector allocation domain could leverage this hint to generate corresponding cpumask. One of the requirements to do this vector spreading work comes from the following hibernation problem found on a 16 cores server: CPU 31 disable failed: CPU has 62 vectors assigned and there are only 0 available. 2 issues were found after investigation: 1. The network driver has declared many vector resources via pci_enable_msix_range(), say, this driver might likely want to reserve 6 per logical CPU, then there would be 192 of them. 2. Besides, most of the vectors reserved by this driver are assigned on CPU0 due to the current code strategy, so there would be insufficient slots left on CPU0 to receive any migrated IRQs during CPU offine. In theory after the commit c5cb83bb337c ("genirq/cpuhotplug: Handle managed IRQs on CPU hotplug") the issue 1 has been solved with the managed interrupt mechanism for multi queue devices, which no longer migrate these interrupts. It simply shuts them down and restarts them when the CPU comes back online. However, according to the implementation of this network driver, the commit mentioned above does not fully fix the issue 1. Here's the framework of the network driver: step 1. Reserved enough irq vectors and corresponding IRQs. step 2. If the network is activated, invoke request_irq() to register the handler. step 3. Invoke set_affinity() to spread the IRQs onto different CPUs, thus to spread the vectors too. The problem is, if the network cable is not connected, step 2 and 3 will not get invoked, thus the IRQ vectors will not spread on different CPUs and will still be allocated on CPU0. As a result the CPUs will still get the offline failure. Previously there were some discussion in the thread [1] about the vector spread, and here are the suggestion from Thomas: Q1: Rewrite that allocation code from scratch, use per cpu bitmaps, so we can do fast search over masks and keep track of the vectors which are used/free per cpu. A1: per cpu bitmap was onced considered but this might not be as fast as the solution proposed here. That is, if we introduce the per cpu bitmap, we need to count the number of vectors assigned on each CPUs by cpumask_weight() and sort them in order to get the CPUs who have the least number of vectors assigned. By contrast, if we introduce the bitmaps for each vector number, we can get the 'idle' CPU at the cost of nearly O(1). One scenario of this patch can be illustrated below: a) After initialization, vector_mask[0] is set with {CPU0...CPU31}, other vector_mask[i] remain zero, which means that, CPU0...CPU31 have zero vectors assigned. b) During driver probe, CPU0 is chosen to be assigned with one vector. Thus vector_mask[0] becomes {CPU1...CPU31}, and vector_mask[1] becomes {CPU0}. c) The next time the system tries to assign another vector, it searches from vector[0], because it is non-empty, CPU1 is chosen to place the vector. Thus vector_mask[0] becomes {CPU2...CPU31}, and vector_mask[1] becomes {CPU0, CPU1}. d) and so on, the vectors assignment will spread on different CPUs. Q2: "respect the allocation domains" A2: This solution provides a CPU as the param for the vector allocation domains, in the hope that the domains can leverage this hint to generate the cpumask with less vectors assigned. But yes, the CPU we select upfront gets fed into apic->vector_allocation_domain() might just say no because the CPU does not belong to it, but anyway this should somehow spread the vectors, although I can not find out a graceful solution to this. BTW, the vector allocation domain is default_vector_allocation_domain() in our case, thus it works. Q3: "set a preference for the node on which the interrupt was allocated" A3: This can be achieved by further optimization in the following way: a) Also record the number of vectors used per node. b) Each time invoking __assign_irq_vector, adjust the input mask by cpumask_and(mask, mask, node_mask_with_least_vectors()) [1] https://marc.info/?l=linux-kernel&m=149867663002202 or [1] https://patchwork.kernel.org/patch/9725227/ Thanks for your time. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Dan Williams Signed-off-by: Chen Yu --- arch/x86/kernel/apic/vector.c | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index b60cc66..c7f0e8b 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -145,6 +145,28 @@ static void free_apic_chip_data(struct apic_chip_data *data) } } +static int pick_next_cpu_hint(const struct cpumask *mask) +{ + int i; + struct cpumask search, tmp; + + cpumask_and(&search, mask, cpu_online_mask); + /* + * Search in the order from vectors_alloc_mask[0] to + * vectors_alloc_mask[255], try to find an intersection + * between the mask and vectors_alloc_mask[], + * return the first CPU in this intersection, thus + * this CPU has the least number of vectors allocated. + */ + for (i = 0; i < NR_VECTORS; i++) { + cpumask_and(&tmp, &search, vectors_alloc_mask[i]); + if (!cpumask_empty(&tmp)) + return cpumask_first(&tmp); + } + /* This should not happened...*/ + return cpumask_first_and(mask, cpu_online_mask); +} + static int __assign_irq_vector(int irq, struct apic_chip_data *d, const struct cpumask *mask, struct irq_data *irqdata) @@ -175,7 +197,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d, /* Only try and allocate irqs on cpus that are present */ cpumask_clear(d->old_domain); cpumask_clear(searched_cpumask); - cpu = cpumask_first_and(mask, cpu_online_mask); + cpu = pick_next_cpu_hint(mask); while (cpu < nr_cpu_ids) { int new_cpu, offset; @@ -247,7 +269,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d, */ cpumask_or(searched_cpumask, searched_cpumask, vector_cpumask); cpumask_andnot(vector_cpumask, mask, searched_cpumask); - cpu = cpumask_first_and(vector_cpumask, cpu_online_mask); + cpu = pick_next_cpu_hint(vector_cpumask); continue; } return -ENOSPC; -- 2.7.4