From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751027AbdIFELi (ORCPT ); Wed, 6 Sep 2017 00:11:38 -0400 Received: from mga14.intel.com ([192.55.52.115]:3665 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750752AbdIFELh (ORCPT ); Wed, 6 Sep 2017 00:11:37 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,482,1498546800"; d="scan'208";a="145923241" Date: Wed, 6 Sep 2017 12:13:38 +0800 From: Yu Chen To: Thomas Gleixner Cc: x86@kernel.org, Ingo Molnar , "H. Peter Anvin" , Rui Zhang , LKML , "Rafael J. Wysocki" , Len Brown , Dan Williams , Christoph Hellwig , Peter Zijlstra Subject: Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU Message-ID: <20170906041337.GC23250@localhost.localdomain> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.8.3 (2017-05-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thanks for looking at this, (please bear my slow response as I need to check related code before replying.) On Sun, Sep 03, 2017 at 08:17:04PM +0200, Thomas Gleixner wrote: > On Fri, 1 Sep 2017, Chen Yu wrote: > > > This is the major logic to spread the vectors on different CPUs. > > The main idea is to choose the 'idlest' CPU which has assigned > > the least number of vectors as the candidate/hint for the vector > > allocation domain, in the hope that the vector allocation domain > > could leverage this hint to generate corresponding cpumask. > > > > One of the requirements to do this vector spreading work comes from the > > following hibernation problem found on a 16 cores server: > > > > CPU 31 disable failed: CPU has 62 vectors assigned and there > > are only 0 available. > > > > 2 issues were found after investigation: > > > > 1. The network driver has declared many vector resources via > > pci_enable_msix_range(), say, this driver might likely want > > to reserve 6 per logical CPU, then there would be 192 of them. > > 2. Besides, most of the vectors reserved by this driver are assigned > > on CPU0 due to the current code strategy, so there would be > > insufficient slots left on CPU0 to receive any migrated IRQs > > during CPU offine. > > > > In theory after the commit c5cb83bb337c ("genirq/cpuhotplug: Handle > > managed IRQs on CPU hotplug") the issue 1 has been solved with the > > managed interrupt mechanism for multi queue devices, which no longer > > migrate these interrupts. It simply shuts them down and restarts > > them when the CPU comes back online. > > > > However, according to the implementation of this network driver, > > the commit mentioned above does not fully fix the issue 1. > > Here's the framework of the network driver: > > > > step 1. Reserved enough irq vectors and corresponding IRQs. > > step 2. If the network is activated, invoke request_irq() to > > register the handler. > > step 3. Invoke set_affinity() to spread the IRQs onto different > > CPUs, thus to spread the vectors too. > > > > The problem is, if the network cable is not connected, step 2 > > and 3 will not get invoked, thus the IRQ vectors will not spread > > on different CPUs and will still be allocated on CPU0. As a result > > the CPUs will still get the offline failure. > > So the proper solution for this network driver is to switch to managed > interrupts instead of trying to work around it in some other place. It's > using the wrong mechanism - end of story. > > Why are you insisting on implementing a band aid for this particular driver > instead of fixing the underlying problem of that driver which requires to > have 32 queues and interrupts open even if only a single CPU is online? > I agree, the driver could be rewritten, but it might take some time, so meanwhile I'm looking at also other possible optimization. > > Previously there were some discussion in the thread [1] about the > > vector spread, and here are the suggestion from Thomas: > > > > Q1: > > Rewrite that allocation code from scratch, use per cpu bitmaps, > > so we can do fast search over masks and keep track of > > the vectors which are used/free per cpu. > > A1: > > per cpu bitmap was onced considered but this might not be > > as fast as the solution proposed here. That is, if we introduce the > > per cpu bitmap, we need to count the number of vectors assigned on > > each CPUs by cpumask_weight() and sort them in order to get the > > CPUs who have the least number of vectors assigned. By contrast, > > if we introduce the bitmaps for each vector number, we can get the > > 'idle' CPU at the cost of nearly O(1). > > The weight accounting with the cpumasks is an orthogonal issue to the per > cpu vector bitmaps. > > Once you selected a CPU the current code still loops in circles instead of > just searching a bitmap. Yes, I agree. > > And if the required affinity mask needs more than one CPU then this is > still not giving you the right answer. You assign perhaps a vector to the > least busy CPU, but the other CPUs which happen to be in the mask are going > to be randomly selected. Yes, for multi cpumask, it might depends on how vector domain behaves. > > I'm not against making the vector allocation better, but certainly not by > adding yet more duct tape to something which is well known as one of the > dumbest search algorithms on the planet with a worst case of > > nvectors * nr_online_cpus * nr_online_cpus_in_affinity_mask > > and your mechanism nests another loop of potentially NR_VECTORS into > that. Which is pointless as the actual assignable vector space is > smaller. While x86 has 256 vectors the assignable vector space is way > smaller and non continous: > > Vectors 0- 31 are reserved for CPU traps and exceptions > Vectors 239-255 are reserved by the kernel for IPIs, local timer etc. > Vector 32 can be reserved by the kernel for the reboot interrupt > Vector 96 can be reserved by the kernel for INT80 > > So the actually usable array size is between 205 and 207. > > Aside of that the code is incorrect vs. the percpu accounting. Remove the > limit in the update function and see what happens. While the limit is > necessary in general, it should at least warn and yell whenever the > accounting goes out of bounds. And if that happens, then the whole thing is > completely useless simply because the numbers are just wrong. > Ok. > > One scenario of this patch can be illustrated below: > > > > a) After initialization, vector_mask[0] is set with {CPU0...CPU31}, > > other vector_mask[i] remain zero, which means that, CPU0...CPU31 > > have zero vectors assigned. > > b) During driver probe, CPU0 is chosen to be assigned with one vector. > > Thus vector_mask[0] becomes {CPU1...CPU31}, and vector_mask[1] becomes > > {CPU0}. > > c) The next time the system tries to assign another vector, it searches from > > vector[0], because it is non-empty, CPU1 is chosen to place > > the vector. Thus vector_mask[0] becomes {CPU2...CPU31}, and vector_mask[1] > > becomes {CPU0, CPU1}. > > d) and so on, the vectors assignment will spread on different CPUs. > > So this works for your particular network driver scenario, which is the > wrong example as this driver just needs to be reworked to not expose that > issue. > > The reality of affinity requirements is not that simple. It's very much > device dependent and making it mandatory can have negative side effects. > Yes, this patch just 'cure' a special case. > > Q2: > > "respect the allocation domains" > > A2: > > This solution provides a CPU as the param for the vector allocation domains, > > Solution? This is a crude hack which "solves" one particular problem in the > wrong way. > > > in the hope that the domains can leverage this hint to generate the cpumask > > Hope is never a good enginering principle. > > > with less vectors assigned. But yes, the CPU we select upfront gets fed into > > apic->vector_allocation_domain() might just say no because the CPU > > does not belong to it, but anyway this should somehow spread the vectors, > > although I can not find out a graceful solution to this. > > BTW, the vector allocation domain is default_vector_allocation_domain() > > in our case, thus it works. > > Now try that with the logical delivery modes which allow multi CPU > assignements and the cluster mode thereof. Yes, I feel a little hard to consider the vector domain as it looks indepent of the current vector allocation process. > > > Q3: > > "set a preference for the node on which the interrupt was allocated" > > A3: > > This can be achieved by further optimization in the following way: > > a) Also record the number of vectors used per node. > > b) Each time invoking __assign_irq_vector, adjust the input mask > > by cpumask_and(mask, mask, node_mask_with_least_vectors()) > > I'm not looking forward to that heuristics and the next loop inside the > loops of loops. > > So no, this is not going to happen. The current implementation sucks and > needs a rewrite. > > I already started to look into that, but started at the low level end of > it. See the IDT cleanup series in tip:x86/apic. There is much more vooddo > in the vector management code to clean up before we can actually talk about > making the assignment algorithm smarter. Ok, I'm pulling from tip:x86/apic. > > I'll send out a separate mail raising a few points to discuss before we are > actually starting to look into magic spreading functionality. > For the vector calculation question you asked I'll reply to that thread. Thanks, Yu > Thanks, > > tglx