From mboxrd@z Thu Jan 1 00:00:00 1970 From: shan.hai@oracle.com (Shan Hai) Date: Fri, 4 Jan 2019 10:53:18 +0800 Subject: [PATCH V2 3/3] nvme pci: introduce module parameter of 'default_queues' In-Reply-To: <20190103103455.GB29693@ming.t460p> References: <20181229032650.27256-1-ming.lei@redhat.com> <20181229032650.27256-4-ming.lei@redhat.com> <20190101054735.GB17588@ming.t460p> <20190103021237.GA25044@ming.t460p> <4d8f963e-df7d-b2d9-3bf8-4852dfe6808e@oracle.com> <9d1a0052-85c9-9cbd-f824-7812eceb11bf@oracle.com> <20190103033131.GI25044@ming.t460p> <20190103103455.GB29693@ming.t460p> Message-ID: <27d80908-922b-cb19-4661-892d617095c6@oracle.com> On 2019/1/3 ??6:34, Ming Lei wrote: > On Thu, Jan 03, 2019@12:36:42PM +0800, Shan Hai wrote: >> >> >> On 2019/1/3 ??11:31, Ming Lei wrote: >>> On Thu, Jan 03, 2019@11:11:07AM +0800, Shan Hai wrote: >>>> >>>> >>>> On 2019/1/3 ??10:52, Shan Hai wrote: >>>>> >>>>> >>>>> On 2019/1/3 ??10:12, Ming Lei wrote: >>>>>> On Wed, Jan 02, 2019@02:11:22PM -0600, Bjorn Helgaas wrote: >>>>>>> [Sorry about the quote corruption below. I'm responding with gmail in >>>>>>> plain text mode, but seems like it corrupted some of the quoting when >>>>>>> saving as a draft] >>>>>>> >>>>>>> On Mon, Dec 31, 2018@11:47 PM Ming Lei wrote: >>>>>>> > >>>>>>> > On Mon, Dec 31, 2018@03:24:55PM -0600, Bjorn Helgaas wrote: >>>>>>> > > On Fri, Dec 28, 2018@9:27 PM Ming Lei wrote: >>>>>>> > > > >>>>>>> > > > On big system with lots of CPU cores, it is easy to >>>>>>> consume up irq >>>>>>> > > > vectors by assigning defaut queue with >>>>>>> num_possible_cpus() irq vectors. >>>>>>> > > > Meantime it is often not necessary to allocate so many >>>>>>> vectors for >>>>>>> > > > reaching NVMe's top performance under that situation. >>>>>>> > > >>>>>>> > > s/defaut/default/ >>>>>>> > > >>>>>>> > > > This patch introduces module parameter of 'default_queues' to try >>>>>>> > > > to address this issue reported by Shan Hai. >>>>>>> > > >>>>>>> > > Is there a URL to this report by Shan? >>>>>>> > >>>>>>> > http://lists.infradead.org/pipermail/linux-nvme/2018-December/021863.html >>>>>>> > http://lists.infradead.org/pipermail/linux-nvme/2018-December/021862.html >>>>>>> > >>>>>>> > http://lists.infradead.org/pipermail/linux-nvme/2018-December/021872.html >>>>>>> >>>>>>> It'd be good to include this. I think the first is the interesting >>>>>>> one. It'd be nicer to have an https://lore.kernel.org/... URL, but it >>>>>>> doesn't look like lore hosts linux-nvme yet. (Is anybody working on >>>>>>> that? I have some archives I could contribute, but other folks >>>>>>> probably have more.) >>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> Is there some way you can figure this out automatically instead of >>>>>>>>> forcing the user to use a module parameter? >>>>>>>> >>>>>>>> Not yet, otherwise, I won't post this patch out. >>>>>>>> >>>>>>>>> If not, can you provide some guidance in the changelog for how a user >>>>>>>>> is supposed to figure out when it's needed and what the value should >>>>>>>>> be? If you add the parameter, I assume that will eventually have to >>>>>>>>> be mentioned in a release note, and it would be nice to have something >>>>>>>>> to start from. >>>>>>>> >>>>>>>> Ok, that is a good suggestion, how about documenting it via the >>>>>>>> following words: >>>>>>>> >>>>>>>> Number of IRQ vectors is system-wide resource, and usually it is big enough >>>>>>>> for each device. However, we allocate num_possible_cpus() + 1 irq vectors for >>>>>>>> each NVMe PCI controller. In case that system has lots of CPU cores, or there >>>>>>>> are more than one NVMe controller, IRQ vectors can be consumed up >>>>>>>> easily by NVMe. When this issue is triggered, please try to pass smaller >>>>>>>> default queues via the module parameter of 'default_queues', usually >>>>>>>> it have to be >= number of NUMA nodes, meantime it needs be big enough >>>>>>>> to reach NVMe's top performance, which is often less than num_possible_cpus() >>>>>>>> + 1. >>>>>>> >>>>>>> You say "when this issue is triggered." How does the user know when >>>>>>> this issue triggered? >>>>>> >>>>>> Any PCI IRQ vector allocation fails. >>>>>> >>>>>>> >>>>>>> The failure in Shan's email (021863.html) is a pretty ugly hotplug >>>>>>> failure and it would take me personally a long time to connect it with >>>>>>> an IRQ exhaustion issue and even longer to dig out this module >>>>>>> parameter to work around it. I suppose if we run out of IRQ numbers, >>>>>>> NVMe itself might work fine, but some other random driver might be >>>>>>> broken? >>>>>> >>>>>> Yeah, seems that is true in Shan's report. >>>>>> >>>>>> However, Shan mentioned that the issue is only triggered in case of >>>>>> CPU hotplug, especially "The allocation is caused by IRQ migration of >>>>>> non-managed interrupts from dying to online CPUs." >>>>>> >>>>>> I still don't understand why new IRQ vector allocation is involved >>>>>> under CPU hotplug since Shan mentioned that no IRQ exhaustion issue >>>>>> during booting. >>>>>> >>>>> >>>>> Yes, the bug can be reproduced easily by CPU-hotplug. >>>>> We have to separate the PCI IRQ and CPU IRQ vectors first of all. We know that >>>>> the MSI-X permits up to 2048 interrupts allocation per device, but the CPU, >>>>> X86 as an example, could provide maximum 255 interrupt vectors, and the sad fact >>>>> is that these vectors are not all available for peripheral devices. >>>>> >>>>> So even though the controllers are luxury in PCI IRQ space and have got >>>>> thousands of vectors to use but the heavy lifting is done by the precious CPU >>>>> irq vectors. >>>>> >>>>> The CPU-hotplug causes IRQ vectors exhaustion problem because the interrupt >>>>> handlers of the controllers will be migrated from dying cpu to the online cpu >>>>> as long as the driver's irq affinity is not managed by the kernel, the drivers >>>>> smp_affinity of which can be set by procfs interface belong to this class. >>>>> >>>>> And the irq migration does not do irq free/realloc stuff, so the irqs of a >>>>> controller will be migrated to the target CPU cores according to its irq >>>>> affinity hint value and will consume a irq vector on the target core. >>>>> >>>>> If we try to offline 360 cores out of total 384 cores on a NUMA system attached >>>>> with 6 NVMe and 6 NICs we are out of luck and observe a kernel panic due to the >>>>> failure of I/O. >>>>> >>>> >>>> Put it simply we ran out of CPU irq vectors on CPU-hotplug rather than MSI-X >>>> vectors, adding this knob to the NVMe driver is for let it to be a good citizen >>>> considering the drivers out there irqs of which are still not managed by the >>>> kernel and be migrated between CPU cores on hot-plugging. >>> >>> Yeah, look we all think this way might address this issue sort of. >>> >>> But in reality, it can be hard to use this kind of workaround, given >>> people may not conclude easily this kind of failure should be addressed >>> by reducing 'nvme.default_queues'. At least, we should provide hint to >>> user about this solution when the failure is triggered, as mentioned by >>> Bjorn. >>> >>>> >>>> If all driver's irq affinities are managed by the kernel I guess we will not >>>> be bitten by this bug, but we are not so lucky till today. >>> >>> I am still not sure why changing affinities may introduce extra irq >>> vector allocation. >>> >> >> Below is a simple math to illustrate the problem: >> >> CPU = 384, NVMe = 6, NIC = 6 >> 2 * 6 * 384 local irq vectors are assigned to the controllers irqs >> >> offline 364 cpu, 6 * 364 NIC irqs are migrated to 20 remaining online CPUs, >> while the irqs of the NVMe controllers are not, which means extra 6 * 364 >> local irq vectors of 20 online CPUs need to be assigned to these migrated >> interrupt handlers. > > But 6 * 364 Linux IRQs have been allocated/assigned already before, then why > is there IRQ exhaustion? > > The irq#1 is bound to the CPU#5 cat /proc/irq/1/smp_affinity 20 Kick the irq#1 out of CPU#5 echo 0 > /sys/devices/system/cpu/cpu5/online cat trace # tracer: function # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | <...>-41 [005] d..1 122.220040: irq_matrix_alloc <-assign_vector_locked <...>-41 [005] d..1 122.220061: => irq_matrix_alloc => assign_vector_locked => apic_set_affinity => ioapic_set_affinity => irq_do_set_affinity => irq_migrate_all_off_this_cpu => fixup_irqs => cpu_disable_common => native_cpu_disable => take_cpu_down => multi_cpu_stop => cpu_stopper_thread => smpboot_thread_fn => kthread => ret_from_fork The irq#1 is migrated to the CPU#6 with a new vector assigned cat /proc/irq/1/smp_affinity 40 Probably I misunderstood something, please feel free to correct it if there is any. Thanks Shan Hai > Thanks, > Ming >