All of lore.kernel.org
 help / color / mirror / Atom feed
From: shan.hai@oracle.com (Shan Hai)
Subject: [PATCH V2 3/3] nvme pci: introduce module parameter of 'default_queues'
Date: Fri, 4 Jan 2019 10:53:18 +0800	[thread overview]
Message-ID: <27d80908-922b-cb19-4661-892d617095c6@oracle.com> (raw)
In-Reply-To: <20190103103455.GB29693@ming.t460p>



On 2019/1/3 ??6:34, Ming Lei wrote:
> On Thu, Jan 03, 2019@12:36:42PM +0800, Shan Hai wrote:
>>
>>
>> On 2019/1/3 ??11:31, Ming Lei wrote:
>>> On Thu, Jan 03, 2019@11:11:07AM +0800, Shan Hai wrote:
>>>>
>>>>
>>>> On 2019/1/3 ??10:52, Shan Hai wrote:
>>>>>
>>>>>
>>>>> On 2019/1/3 ??10:12, Ming Lei wrote:
>>>>>> On Wed, Jan 02, 2019@02:11:22PM -0600, Bjorn Helgaas wrote:
>>>>>>> [Sorry about the quote corruption below.  I'm responding with gmail in
>>>>>>> plain text mode, but seems like it corrupted some of the quoting when
>>>>>>> saving as a draft]
>>>>>>>
>>>>>>> On Mon, Dec 31, 2018@11:47 PM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>> &gt;
>>>>>>> &gt; On Mon, Dec 31, 2018@03:24:55PM -0600, Bjorn Helgaas wrote:
>>>>>>> &gt; &gt; On Fri, Dec 28, 2018@9:27 PM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>> &gt; &gt; &gt;
>>>>>>> &gt; &gt; &gt; On big system with lots of CPU cores, it is easy to
>>>>>>> consume up irq
>>>>>>> &gt; &gt; &gt; vectors by assigning defaut queue with
>>>>>>> num_possible_cpus() irq vectors.
>>>>>>> &gt; &gt; &gt; Meantime it is often not necessary to allocate so many
>>>>>>> vectors for
>>>>>>> &gt; &gt; &gt; reaching NVMe's top performance under that situation.
>>>>>>> &gt; &gt;
>>>>>>> &gt; &gt; s/defaut/default/
>>>>>>> &gt; &gt;
>>>>>>> &gt; &gt; &gt; This patch introduces module parameter of 'default_queues' to try
>>>>>>> &gt; &gt; &gt; to address this issue reported by Shan Hai.
>>>>>>> &gt; &gt;
>>>>>>> &gt; &gt; Is there a URL to this report by Shan?
>>>>>>> &gt;
>>>>>>> &gt; http://lists.infradead.org/pipermail/linux-nvme/2018-December/021863.html
>>>>>>> &gt; http://lists.infradead.org/pipermail/linux-nvme/2018-December/021862.html
>>>>>>> &gt;
>>>>>>> &gt; http://lists.infradead.org/pipermail/linux-nvme/2018-December/021872.html
>>>>>>>
>>>>>>> It'd be good to include this.  I think the first is the interesting
>>>>>>> one.  It'd be nicer to have an https://lore.kernel.org/... URL, but it
>>>>>>> doesn't look like lore hosts linux-nvme yet.  (Is anybody working on
>>>>>>> that?  I have some archives I could contribute, but other folks
>>>>>>> probably have more.)
>>>>>>>
>>>>>>> </ming.lei at redhat.com></ming.lei at redhat.com>
>>>>>>>>>
>>>>>>>>> Is there some way you can figure this out automatically instead of
>>>>>>>>> forcing the user to use a module parameter?
>>>>>>>>
>>>>>>>> Not yet, otherwise, I won't post this patch out.
>>>>>>>>
>>>>>>>>> If not, can you provide some guidance in the changelog for how a user
>>>>>>>>> is supposed to figure out when it's needed and what the value should
>>>>>>>>> be?  If you add the parameter, I assume that will eventually have to
>>>>>>>>> be mentioned in a release note, and it would be nice to have something
>>>>>>>>> to start from.
>>>>>>>>
>>>>>>>> Ok, that is a good suggestion, how about documenting it via the
>>>>>>>> following words:
>>>>>>>>
>>>>>>>> Number of IRQ vectors is system-wide resource, and usually it is big enough
>>>>>>>> for each device. However, we allocate num_possible_cpus() + 1 irq vectors for
>>>>>>>> each NVMe PCI controller. In case that system has lots of CPU cores, or there
>>>>>>>> are more than one NVMe controller, IRQ vectors can be consumed up
>>>>>>>> easily by NVMe. When this issue is triggered, please try to pass smaller
>>>>>>>> default queues via the module parameter of 'default_queues', usually
>>>>>>>> it have to be >= number of NUMA nodes, meantime it needs be big enough
>>>>>>>> to reach NVMe's top performance, which is often less than num_possible_cpus()
>>>>>>>> + 1.
>>>>>>>
>>>>>>> You say "when this issue is triggered."  How does the user know when
>>>>>>> this issue triggered?
>>>>>>
>>>>>> Any PCI IRQ vector allocation fails.
>>>>>>
>>>>>>>
>>>>>>> The failure in Shan's email (021863.html) is a pretty ugly hotplug
>>>>>>> failure and it would take me personally a long time to connect it with
>>>>>>> an IRQ exhaustion issue and even longer to dig out this module
>>>>>>> parameter to work around it.  I suppose if we run out of IRQ numbers,
>>>>>>> NVMe itself might work fine, but some other random driver might be
>>>>>>> broken?
>>>>>>
>>>>>> Yeah, seems that is true in Shan's report.
>>>>>>
>>>>>> However, Shan mentioned that the issue is only triggered in case of
>>>>>> CPU hotplug, especially "The allocation is caused by IRQ migration of
>>>>>> non-managed interrupts from dying to online CPUs."
>>>>>>
>>>>>> I still don't understand why new IRQ vector allocation is involved
>>>>>> under CPU hotplug since Shan mentioned that no IRQ exhaustion issue
>>>>>> during booting.
>>>>>>
>>>>>
>>>>> Yes, the bug can be reproduced easily by CPU-hotplug.
>>>>> We have to separate the PCI IRQ and CPU IRQ vectors first of all. We know that
>>>>> the MSI-X permits up to 2048 interrupts allocation per device, but the CPU,
>>>>> X86 as an example, could provide maximum 255 interrupt vectors, and the sad fact
>>>>> is that these vectors are not all available for peripheral devices.
>>>>>
>>>>> So even though the controllers are luxury in PCI IRQ space and have got
>>>>> thousands of vectors to use but the heavy lifting is done by the precious CPU
>>>>> irq vectors.
>>>>>
>>>>> The CPU-hotplug causes IRQ vectors exhaustion problem because the interrupt
>>>>> handlers of the controllers will be migrated from dying cpu to the online cpu
>>>>> as long as the driver's irq affinity is not managed by the kernel, the drivers
>>>>> smp_affinity of which can be set by procfs interface belong to this class.
>>>>>
>>>>> And the irq migration does not do irq free/realloc stuff, so the irqs of a
>>>>> controller will be migrated to the target CPU cores according to its irq
>>>>> affinity hint value and will consume a irq vector on the target core.
>>>>>
>>>>> If we try to offline 360 cores out of total 384 cores on a NUMA system attached
>>>>> with 6 NVMe and 6 NICs we are out of luck and observe a kernel panic due to the
>>>>> failure of I/O.
>>>>>
>>>>
>>>> Put it simply we ran out of CPU irq vectors on CPU-hotplug rather than MSI-X
>>>> vectors, adding this knob to the NVMe driver is for let it to be a good citizen
>>>> considering the drivers out there irqs of which are still not managed by the
>>>> kernel and be migrated between CPU cores on hot-plugging.
>>>
>>> Yeah, look we all think this way might address this issue sort of.
>>>
>>> But in reality, it can be hard to use this kind of workaround, given
>>> people may not conclude easily this kind of failure should be addressed
>>> by reducing 'nvme.default_queues'. At least, we should provide hint to
>>> user about this solution when the failure is triggered, as mentioned by
>>> Bjorn.
>>>
>>>>
>>>> If all driver's irq affinities are managed by the kernel I guess we will not
>>>> be bitten by this bug, but we are not so lucky till today.
>>>
>>> I am still not sure why changing affinities may introduce extra irq
>>> vector allocation.
>>>
>>
>> Below is a simple math to illustrate the problem:
>>
>> CPU = 384, NVMe = 6, NIC = 6
>> 2 * 6 * 384 local irq vectors are assigned to the controllers irqs
>>
>> offline 364 cpu, 6 * 364 NIC irqs are migrated to 20 remaining online CPUs,
>> while the irqs of the NVMe controllers are not, which means extra 6 * 364
>> local irq vectors of 20 online CPUs need to be assigned to these migrated
>> interrupt handlers.
> 
> But 6 * 364 Linux IRQs have been allocated/assigned already before, then why
> is there IRQ exhaustion?
> 
> 

The irq#1 is bound to the CPU#5
cat /proc/irq/1/smp_affinity
20

Kick the irq#1 out of CPU#5
echo 0 > /sys/devices/system/cpu/cpu5/online

cat trace
# tracer: function
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
           <...>-41    [005] d..1   122.220040: irq_matrix_alloc
<-assign_vector_locked
           <...>-41    [005] d..1   122.220061: <stack trace>
 => irq_matrix_alloc
 => assign_vector_locked
 => apic_set_affinity
 => ioapic_set_affinity
 => irq_do_set_affinity
 => irq_migrate_all_off_this_cpu
 => fixup_irqs
 => cpu_disable_common
 => native_cpu_disable
 => take_cpu_down
 => multi_cpu_stop
 => cpu_stopper_thread
 => smpboot_thread_fn
 => kthread
 => ret_from_fork

The irq#1 is migrated to the CPU#6 with a new vector assigned
cat /proc/irq/1/smp_affinity
40

Probably I misunderstood something, please feel free to correct it if there is
any.

Thanks
Shan Hai

> Thanks,
> Ming
> 

  reply	other threads:[~2019-01-04  2:53 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-29  3:26 [PATCH V2 0/3] nvme pci: two fixes on nvme_setup_irqs Ming Lei
2018-12-29  3:26 ` Ming Lei
2018-12-29  3:26 ` [PATCH V2 1/3] PCI/MSI: preference to returning -ENOSPC from pci_alloc_irq_vectors_affinity Ming Lei
2018-12-29  3:26   ` Ming Lei
2018-12-31 22:00   ` Bjorn Helgaas
2018-12-31 22:00     ` Bjorn Helgaas
2018-12-31 22:41     ` Keith Busch
2018-12-31 22:41       ` Keith Busch
2019-01-01  5:24     ` Ming Lei
2019-01-01  5:24       ` Ming Lei
2019-01-02 21:02       ` Bjorn Helgaas
2019-01-02 21:02         ` Bjorn Helgaas
2019-01-02 22:46         ` Keith Busch
2019-01-02 22:46           ` Keith Busch
2018-12-29  3:26 ` [PATCH V2 2/3] nvme pci: fix nvme_setup_irqs() Ming Lei
2018-12-29  3:26 ` [PATCH V2 3/3] nvme pci: introduce module parameter of 'default_queues' Ming Lei
2018-12-31 21:24   ` Bjorn Helgaas
2019-01-01  5:47     ` Ming Lei
2019-01-02  2:14       ` Shan Hai
     [not found]         ` <20190102073607.GA25590@ming.t460p>
     [not found]           ` <d59007c6-af13-318c-5c9d-438ad7d9149d@oracle.com>
     [not found]             ` <20190102083901.GA26881@ming.t460p>
2019-01-03  2:04               ` Shan Hai
2019-01-02 20:11       ` Bjorn Helgaas
2019-01-03  2:12         ` Ming Lei
2019-01-03  2:52           ` Shan Hai
2019-01-03  3:11             ` Shan Hai
2019-01-03  3:31               ` Ming Lei
2019-01-03  4:36                 ` Shan Hai
2019-01-03 10:34                   ` Ming Lei
2019-01-04  2:53                     ` Shan Hai [this message]
2019-01-03  4:51                 ` Shan Hai
2019-01-03  3:21             ` Ming Lei
2019-01-14 13:13 ` [PATCH V2 0/3] nvme pci: two fixes on nvme_setup_irqs John Garry
2019-01-14 13:13   ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=27d80908-922b-cb19-4661-892d617095c6@oracle.com \
    --to=shan.hai@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.