Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt

From: John Garry <john.garry@huawei.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>, <tglx@linutronix.de>,
	"chenxiang (M)" <chenxiang66@hisilicon.com>,
	<bigeasy@linutronix.de>, <linux-kernel@vger.kernel.org>,
	<hare@suse.com>, <hch@lst.de>, <axboe@kernel.dk>,
	<bvanassche@acm.org>, <peterz@infradead.org>, <mingo@redhat.com>
Subject: Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt
Date: Mon, 16 Dec 2019 18:50:55 +0000	[thread overview]
Message-ID: <ac5b5a25-df2e-18e9-6b0f-60af8c7cec3b@huawei.com> (raw)
In-Reply-To: <68058fd28c939b8e065524715494de95@www.loen.fr>

Hi Marc,

>>
>>>>
>>>> I'm just wondering if non-managed interrupts should be included in
>>>> the load balancing calculation? Couldn't irqbalance (if active) start
>>>> moving non-managed interrupts around anyway?
>>> But they are, aren't they? See what we do in irq_set_affinity:
>>> +        atomic_inc(per_cpu_ptr(&cpu_lpi_count, cpu));
>>> +        atomic_dec(per_cpu_ptr(&cpu_lpi_count,
>>> +                       its_dev->event_map.col_map[id]));
>>> We don't try to "rebalance" anything based on that though, not that
>>> I think we should.
>>
>> Ah sorry, I meant whether they should not be included. In
>> its_irq_domain_activate(), we increment the per-cpu lpi count and also
>> use its_pick_target_cpu() to find the least loaded cpu. I am asking
>> whether we should just stick with the old policy for non-managed
>> interrupts here.
>>
>> After checking D05, I see a very significant performance hit for SAS
>> controller performance - ~40% throughout lowering.
> 
> -ETOOMANYMOVINGPARTS.

Understood.

> 
>> With this patch, now we have effective affinity targeted at seemingly
>> "random" CPUs, as opposed to all just using CPU0. This affects
>> performance.
> 
> And piling all interrupts on the same CPU does help?

Apparently... I need to check this more.

> 
>> The difference is that when we use managed interrupts - like for NVME
>> or D06 SAS controller - the irq cpu affinity mask matches the CPUs
>> which enqueue the requests to the queue associated with the interrupt.
>> So there is an efficiency is enqueuing and deqeueing on same CPU group
>> - all related to blk multi-queue. And this is not the case for
>> non-managed interrupts.
> 
> So you enqueue requests from CPU0 only? It seems a bit odd...

No, but maybe I wasn't clear enough. I'll give an overview:

For D06 SAS controller - which is a multi-queue PCI device - we use 
managed interrupts. The HW has 16 submission/completion queues, so for 
96 cores, we have an even spread of 6 CPUs assigned per queue; and this 
per-queue CPU mask is the interrupt affinity mask. So CPU0-5 would 
submit any IO on queue0, CPU6-11 on queue2, and so on. PCI NVMe is 
essentially the same.

These are the environments which we're trying to promote performance.

Then for D05 SAS controller - which is multi-queue platform device 
(mbigen) - we don't use managed interrupts. We still submit IO from any 
CPU, but we choose the queue to submit IO on a round-robin basis to 
promote some isolation, i.e. reduce inter-queue lock contention, so the 
queue chosen has nothing to do with the CPU.

And with your change we may submit on cpu4 but service the interrupt on 
cpu30, as an example. While previously we would always service on cpu0. 
The old way still isn't ideal, I'll admit.

For this env, we would just like to maintain the same performance. And 
it's here that we see the performance drop.

> 
>>>>> Please give this new patch a shot on your system (my D05 doesn't have
>>>>> any managed devices):
>>>>
>>>> We could consider supporting platform msi managed interrupts, but I
>>>> doubt the value.
>>> It shouldn't be hard to do, and most of the existing code could be
>>> moved to the generic level. As for the value, I'm not convinced
>>> either. For example D05 uses the MBIGEN as an intermediate interrupt
>>> controller, so MSIs are from the PoV of MBIGEN, and not the SAS device
>>> attached to it. Not the best design...
>>
>> JFYI, I did raise this following topic before, but that's as far as I 
>> got:
>>
>> https://marc.info/?l=linux-block&m=150722088314310&w=2
> 
> Yes. And that's probably not very hard, but the problem in your case is
> that the D05 HW is not using MSIs...

Right

  You'd have to provide an abstraction
> for wired interrupts (please don't).
> 
> You'd be better off directly setting the affinity of the interrupts from
> the driver, but I somehow can't believe that you're only submitting 
> requests
> from the same CPU,

Maybe...

  always. There must be something I'm missing.
> 

Thanks,
John