Re: Virtio-scsi multiqueue irq affinity

From: Thomas Gleixner <tglx@linutronix.de>
To: Peter Xu <peterx@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>, Jason Wang <jasowang@redhat.com>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: Virtio-scsi multiqueue irq affinity
Date: Sat, 23 Mar 2019 18:15:59 +0100 (CET)	[thread overview]
Message-ID: <alpine.DEB.2.21.1903231805310.1798@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20190318062150.GC6654@xz-x1>

Peter,

On Mon, 18 Mar 2019, Peter Xu wrote:
> I noticed that starting from commit 0d9f0a52c8b9 ("virtio_scsi: use
> virtio IRQ affinity", 2017-02-27) the virtio scsi driver is using a
> new way (via irq_create_affinity_masks()) to automatically initialize
> IRQ affinities for the multi-queues, which is different comparing to
> all the other virtio devices (like virtio-net, which still uses
> virtqueue_set_affinity(), which is actually, irq_set_affinity_hint()).
> 
> Firstly, it will definitely broke some of the userspace programs with
> that when the scripts wanted to do the bindings explicitly like before
> and they could simply fail with -EIO now every time when echoing to
> /proc/irq/N/smp_affinity of any of the multi-queues (see
> write_irq_affinity()).

Did it break anything? I did not see a report so far. Assumptions about
potential breakage are not really useful.

> Is there any specific reason to do it with the new way?  Since AFAIU
> we should still allow the system admins to decide what to do for such
> configurations, .e.g., what if we only want to provision half of the
> CPU resources to handle IRQs for a specific virtio-scsi controller?
> We won't be able to achieve that with current policy.  Or, could this
> be a question for the IRQ system (irq_create_affinity_masks()) in
> general?  Any special considerations behind the big picture?

That has nothing to do with the irq subsystem. That merily provides the
mechanisms.

The reason behind this is that multi-queue devices set up queues per cpu or
if not enough queues are available queues per cpu groups. So it does not
make sense to move the interrupt away from the CPU or the CPU group.

Aside of that in the CPU hotunplug case, interrupts used to be moved to the
online CPUs which resulted in problems for e.g. hibernation because on
large systems moving all interrupts to the boot CPU does not work due to
vector space exhaustion. Also CPU hotunplug is used for power management
purposes and there it does not make sense either to have the per cpu queues
of the offlined CPUs moved to the still online CPUs which then end up with
several queues.

The new way to deal with this is to strictly bind per CPU (per CPU group)
queues. If the CPU or the last CPU in the group goes offline the following
happens:

 1) The queue is disabled, i.e. no new requests can be queued

 2) Wait for the outstanding requests to complete

 3) Shut down the interrupt

 This avoids having multiple queues moved to the still online CPUs and also
 prevents vector space exhaustion because the shut down interrupt does not
 have to be migrated.

When the CPU (or the first in the group) comes online again:

 1) Reenable the interrupt

 2) Reenable the queue

Hope that helps.

Thanks,

	tglx