From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from Galois.linutronix.de ([146.0.238.70]:53407 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726869AbeIAC62 (ORCPT ); Fri, 31 Aug 2018 22:58:28 -0400 Date: Sat, 1 Sep 2018 00:48:46 +0200 (CEST) From: Thomas Gleixner To: Kashyap Desai cc: Ming Lei , Sumit Saxena , Ming Lei , Christoph Hellwig , Linux Kernel Mailing List , Shivasharan Srikanteshwara , linux-block Subject: RE: Affinity managed interrupts vs non-managed interrupts In-Reply-To: <486f94a563d63c4779498fe8829a546c@mail.gmail.com> Message-ID: References: <20180829084618.GA24765@ming.t460p> <300d6fef733ca76ced581f8c6304bac6@mail.gmail.com> <615d78004495aebc53807156d04d988c@mail.gmail.com> <486f94a563d63c4779498fe8829a546c@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-block-owner@vger.kernel.org List-Id: linux-block@vger.kernel.org On Fri, 31 Aug 2018, Kashyap Desai wrote: > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > shost_busy etc. > > > We want to use special 16 reply queue for IO acceleration (these > queues are > > > working interrupt coalescing mode. This is a h/w feature) > > > > TBH, this does not make any sense whatsoever. Why are you trying to have > > extra interrupts for coalescing instead of doing the following: > > Thomas, > > We are using this feature mainly for performance and not for CPU hotplug > issues. > I read your below #1 to #4 points are more of addressing CPU hotplug > stuffs. Right ? If we use all 72 reply queue (all are in interrupt > coalescing mode) without any extra reply queues, we don't have any issue > with cpu-msix mapping and cpu hotplug issues. Our major problem with > that method is latency is very bad on lower QD and/or single worker case. > > To solve that problem we have added extra 16 reply queue (this is a > special h/w feature for performance only) which can be worked in interrupt > coalescing mode vs existing 72 reply queue will work without any interrupt > coalescing. Best way to map additional 16 reply queue is map it to the > local numa node. Ok. I misunderstood the whole thing a bit. So your real issue is that you want to have reply queues which are instantaneous, the per cpu ones, and then the extra 16 which do batching and are shared over a set of CPUs, right? > I understand that, it is unique requirement but at the same time we may > be able to do it gracefully (in irq sub system) as you mentioned " > irq_set_affinity_hint" should be avoided in low level driver. > Is it possible to have similar mapping in managed interrupt case as below > ? > > for (i = 0; i < 16 ; i++) > irq_set_affinity_hint (pci_irq_vector(instance->pdev, > cpumask_of_node(local_numa_node)); > > Currently we always see managed interrupts for pre-vectors are 0-71 and > effective cpu is always 0. The pre-vectors are not affinity managed. They get the default affinity assigned and at request_irq() the vectors are dynamically spread over CPUs to avoid that the bulk of interrupts ends up on CPU0. That's handled that way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation") > We want some changes in current API which can allow us to pass flags > (like *local numa affinity*) and cpu-msix mapping are from local numa node > + effective cpu are spread across local numa node. What you really want is to split the vector space for your device into two blocks. One for the regular per cpu queues and the other (16 or how many ever) which are managed separately, i.e. spread out evenly. That needs some extensions to the core allocation/management code, but that shouldn't be a huge problem. Thanks, tglx From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E05EAC433F4 for ; Fri, 31 Aug 2018 22:48:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 887642083F for ; Fri, 31 Aug 2018 22:48:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 887642083F Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linutronix.de Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727378AbeIAC63 (ORCPT ); Fri, 31 Aug 2018 22:58:29 -0400 Received: from Galois.linutronix.de ([146.0.238.70]:53407 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726869AbeIAC62 (ORCPT ); Fri, 31 Aug 2018 22:58:28 -0400 Received: from p4fea45ac.dip0.t-ipconnect.de ([79.234.69.172] helo=[192.168.0.145]) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fvsDe-00069p-TO; Sat, 01 Sep 2018 00:48:47 +0200 Date: Sat, 1 Sep 2018 00:48:46 +0200 (CEST) From: Thomas Gleixner To: Kashyap Desai cc: Ming Lei , Sumit Saxena , Ming Lei , Christoph Hellwig , Linux Kernel Mailing List , Shivasharan Srikanteshwara , linux-block Subject: RE: Affinity managed interrupts vs non-managed interrupts In-Reply-To: <486f94a563d63c4779498fe8829a546c@mail.gmail.com> Message-ID: References: <20180829084618.GA24765@ming.t460p> <300d6fef733ca76ced581f8c6304bac6@mail.gmail.com> <615d78004495aebc53807156d04d988c@mail.gmail.com> <486f94a563d63c4779498fe8829a546c@mail.gmail.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 31 Aug 2018, Kashyap Desai wrote: > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > shost_busy etc. > > > We want to use special 16 reply queue for IO acceleration (these > queues are > > > working interrupt coalescing mode. This is a h/w feature) > > > > TBH, this does not make any sense whatsoever. Why are you trying to have > > extra interrupts for coalescing instead of doing the following: > > Thomas, > > We are using this feature mainly for performance and not for CPU hotplug > issues. > I read your below #1 to #4 points are more of addressing CPU hotplug > stuffs. Right ? If we use all 72 reply queue (all are in interrupt > coalescing mode) without any extra reply queues, we don't have any issue > with cpu-msix mapping and cpu hotplug issues. Our major problem with > that method is latency is very bad on lower QD and/or single worker case. > > To solve that problem we have added extra 16 reply queue (this is a > special h/w feature for performance only) which can be worked in interrupt > coalescing mode vs existing 72 reply queue will work without any interrupt > coalescing. Best way to map additional 16 reply queue is map it to the > local numa node. Ok. I misunderstood the whole thing a bit. So your real issue is that you want to have reply queues which are instantaneous, the per cpu ones, and then the extra 16 which do batching and are shared over a set of CPUs, right? > I understand that, it is unique requirement but at the same time we may > be able to do it gracefully (in irq sub system) as you mentioned " > irq_set_affinity_hint" should be avoided in low level driver. > Is it possible to have similar mapping in managed interrupt case as below > ? > > for (i = 0; i < 16 ; i++) > irq_set_affinity_hint (pci_irq_vector(instance->pdev, > cpumask_of_node(local_numa_node)); > > Currently we always see managed interrupts for pre-vectors are 0-71 and > effective cpu is always 0. The pre-vectors are not affinity managed. They get the default affinity assigned and at request_irq() the vectors are dynamically spread over CPUs to avoid that the bulk of interrupts ends up on CPU0. That's handled that way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation") > We want some changes in current API which can allow us to pass flags > (like *local numa affinity*) and cpu-msix mapping are from local numa node > + effective cpu are spread across local numa node. What you really want is to split the vector space for your device into two blocks. One for the regular per cpu queues and the other (16 or how many ever) which are managed separately, i.e. spread out evenly. That needs some extensions to the core allocation/management code, but that shouldn't be a huge problem. Thanks, tglx