From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751524AbdIAFCp (ORCPT <rfc822;w@1wt.eu>);
        Fri, 1 Sep 2017 01:02:45 -0400
Received: from mga02.intel.com ([134.134.136.20]:29530 "EHLO mga02.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751028AbdIAFCm (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 1 Sep 2017 01:02:42 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.41,456,1498546800"; 
   d="scan'208";a="124592598"
From: Chen Yu <yu.c.chen@intel.com>
To: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>, Rui Zhang <rui.zhang@intel.com>,
        linux-kernel@vger.kernel.org, Chen Yu <yu.c.chen@intel.com>,
        "Rafael J. Wysocki" <rjw@rjwysocki.net>, Len Brown <lenb@kernel.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU
Date: Fri,  1 Sep 2017 13:04:43 +0800
Message-Id: <cfb50b09b26a40609c86ef9813a8cb7f43f90def.1504235838.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <cover.1504235838.git.yu.c.chen@intel.com>
References: <cover.1504235838.git.yu.c.chen@intel.com>
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

This is the major logic to spread the vectors on different CPUs.
The main idea is to choose the 'idlest' CPU which has assigned
the least number of vectors as the candidate/hint for the vector
allocation domain, in the hope that the vector allocation domain
could leverage this hint to generate corresponding cpumask.

One of the requirements to do this vector spreading work comes from the
following hibernation problem found on a 16 cores server:

CPU 31 disable failed: CPU has 62 vectors assigned and there
are only 0 available.

2 issues were found after investigation:

1. The network driver has declared many vector resources via
   pci_enable_msix_range(), say, this driver might likely want
   to reserve 6 per logical CPU, then there would be 192 of them.
2. Besides, most of the vectors reserved by this driver are assigned
   on CPU0 due to the current code strategy, so there would be
   insufficient slots left on CPU0 to receive any migrated IRQs
   during CPU offine.

In theory after the commit c5cb83bb337c ("genirq/cpuhotplug: Handle
managed IRQs on CPU hotplug") the issue 1 has been solved with the
managed interrupt mechanism for multi queue devices, which no longer
migrate these interrupts. It simply shuts them down and restarts
them when the CPU comes back online.

However, according to the implementation of this network driver,
the commit mentioned above does not fully fix the issue 1.
Here's the framework of the network driver:

step 1. Reserved enough irq vectors and corresponding IRQs.
step 2. If the network is activated, invoke request_irq() to
        register the handler.
step 3. Invoke set_affinity() to spread the IRQs onto different
        CPUs, thus to spread the vectors too.

The problem is, if the network cable is not connected, step 2
and 3 will not get invoked, thus the IRQ vectors will not spread
on different CPUs and will still be allocated on CPU0. As a result
the CPUs will still get the offline failure.

Previously there were some discussion in the thread [1] about the
vector spread, and here are the suggestion from Thomas:

Q1:
    Rewrite that allocation code from scratch, use per cpu bitmaps,
    so we can do fast search over masks and keep track of
    the vectors which are used/free per cpu.
A1:
    per cpu bitmap was onced considered but this might not be
    as fast as the solution proposed here. That is, if we introduce the
    per cpu bitmap, we need to count the number of vectors assigned on
    each CPUs by cpumask_weight() and sort them in order to get the
    CPUs who have the least number of vectors assigned. By contrast,
    if we introduce the bitmaps for each vector number, we can get the
    'idle' CPU at the cost of nearly O(1).

    One scenario of this patch can be illustrated below:

    a) After initialization, vector_mask[0] is set with {CPU0...CPU31},
       other vector_mask[i] remain zero, which means that, CPU0...CPU31
       have zero vectors assigned.
    b) During driver probe, CPU0 is chosen to be assigned with one vector.
       Thus vector_mask[0] becomes {CPU1...CPU31}, and vector_mask[1] becomes
       {CPU0}.
    c) The next time the system tries to assign another vector, it searches from
       vector[0], because it is non-empty, CPU1 is chosen to place
       the vector. Thus vector_mask[0] becomes {CPU2...CPU31}, and vector_mask[1]
       becomes {CPU0, CPU1}.
    d) and so on, the vectors assignment will spread on different CPUs.

Q2:
    "respect the allocation domains"
A2:
    This solution provides a CPU as the param for the vector allocation domains,
    in the hope that the domains can leverage this hint to generate the cpumask
    with less vectors assigned. But yes, the CPU we select upfront gets fed into
    apic->vector_allocation_domain() might just say no because the CPU
    does not belong to it, but anyway this should somehow spread the vectors,
    although I can not find out a graceful solution to this.
    BTW, the vector allocation domain is default_vector_allocation_domain()
    in our case, thus it works.

Q3:
    "set a preference for the node on which the interrupt was allocated"
A3:
    This can be achieved by further optimization in the following way:
    a) Also record the number of vectors used per node.
    b) Each time invoking __assign_irq_vector, adjust the input mask
       by cpumask_and(mask, mask, node_mask_with_least_vectors())

[1] https://marc.info/?l=linux-kernel&m=149867663002202
or
[1] https://patchwork.kernel.org/patch/9725227/

Thanks for your time.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 arch/x86/kernel/apic/vector.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index b60cc66..c7f0e8b 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -145,6 +145,28 @@ static void free_apic_chip_data(struct apic_chip_data *data)
 	}
 }
 
+static int pick_next_cpu_hint(const struct cpumask *mask)
+{
+	int i;
+	struct cpumask search, tmp;
+
+	cpumask_and(&search, mask, cpu_online_mask);
+	/*
+	 * Search in the order from vectors_alloc_mask[0] to
+	 * vectors_alloc_mask[255], try to find an intersection
+	 * between the mask and vectors_alloc_mask[],
+	 * return the first CPU in this intersection, thus
+	 * this CPU has the least number of vectors allocated.
+	 */
+	for (i = 0; i < NR_VECTORS; i++) {
+		cpumask_and(&tmp, &search, vectors_alloc_mask[i]);
+		if (!cpumask_empty(&tmp))
+			return cpumask_first(&tmp);
+	}
+	/* This should not happened...*/
+	return cpumask_first_and(mask, cpu_online_mask);
+}
+
 static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 			       const struct cpumask *mask,
 			       struct irq_data *irqdata)
@@ -175,7 +197,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 	/* Only try and allocate irqs on cpus that are present */
 	cpumask_clear(d->old_domain);
 	cpumask_clear(searched_cpumask);
-	cpu = cpumask_first_and(mask, cpu_online_mask);
+	cpu = pick_next_cpu_hint(mask);
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, offset;
 
@@ -247,7 +269,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 		 */
 		cpumask_or(searched_cpumask, searched_cpumask, vector_cpumask);
 		cpumask_andnot(vector_cpumask, mask, searched_cpumask);
-		cpu = cpumask_first_and(vector_cpumask, cpu_online_mask);
+		cpu = pick_next_cpu_hint(vector_cpumask);
 		continue;
 	}
 	return -ENOSPC;
-- 
2.7.4