From mboxrd@z Thu Jan  1 00:00:00 1970
From: Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity
 ]
Date: Mon, 13 Nov 2017 22:33:49 +0100 (CET)
Message-ID: <alpine.DEB.2.20.1711132220240.2097@nanos>
References: <2187e555-2c4e-ef55-1c3a-17f5af54d762@fb.com> <573060a9-19a7-9133-ef52-fa947088dabb@fb.com> <3af0c164-8dde-b6f0-45e1-edbbb28e7f73@mellanox.com> <83d3944f-8a31-eb31-93db-294906630b0e@grimberg.me> <556f3ff5-c1d4-25c6-7bfc-9866c0d9b323@fb.com>
 <9384acdc-a5d8-872c-0034-9a3869f4fc8b@grimberg.me> <alpine.DEB.2.20.1711021907420.2824@nanos> <1d2e9304-089a-a769-9f38-a742dc066baf@grimberg.me> <alpine.DEB.2.20.1711071551330.1716@nanos> <67a1157e-06e7-479c-993e-bdf42fd613c6@fb.com> <alpine.DEB.2.20.1711081822450.1937@nanos>
 <69a46009-184f-d925-289c-6036f0bf2554@grimberg.me> <alpine.DEB.2.20.1711091429000.1839@nanos> <7f7aab77-b0b6-f8fa-0b57-3e3c1755eeaa@fb.com> <alpine.DEB.2.20.1711091759350.1839@nanos> <4888ac89-22d5-0700-daa1-d604e4c54970@fb.com> <alpine.DEB.2.20.1711092226080.2690@nanos>
 <d88f2288-2292-1569-a336-aa4075dd74ea@grimberg.me> <alpine.DEB.2.20.1711132145350.2097@nanos> <4bae1d1d-d401-115d-91cc-4b7df88b02c5@grimberg.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Cc: Jens Axboe <axboe@fb.com>, Jes Sorensen <jsorensen@fb.com>,
        Tariq Toukan <tariqt@mellanox.com>,
        Saeed Mahameed <saeedm@dev.mellanox.co.il>,
        Networking <netdev@vger.kernel.org>,
        Leon Romanovsky <leonro@mellanox.com>,
        Saeed Mahameed <saeedm@mellanox.com>,
        Kernel Team <kernel-team@fb.com>,
        Christoph Hellwig <hch@lst.de>
To: Sagi Grimberg <sagi@grimberg.me>
Return-path: <netdev-owner@vger.kernel.org>
Received: from Galois.linutronix.de ([146.0.238.70]:35191 "EHLO
        Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751234AbdKMVdz (ORCPT
        <rfc822;netdev@vger.kernel.org>); Mon, 13 Nov 2017 16:33:55 -0500
In-Reply-To: <4bae1d1d-d401-115d-91cc-4b7df88b02c5@grimberg.me>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > > >          #1 Before the core tries to move the interrupt so it can veto
> > > > the
> > > > 	  move if it cannot allocate new resources or whatever is required
> > > > 	  to operate after the move.
> > > 
> > > What would the core do if a driver veto a move?
> > 
> > Return the error code from write_affinity() as it does with any other error
> > which fails to set the affinity.
> 
> OK, so this would mean that the driver queue no longer has a vector
> correct? so is the semantics that it needs to cleanup its resources or
> should it expect another callout for that?

The driver queue still has the old vector, i.e.

echo XXX > /proc/irq/YYY/affinity

     write_irq_affinity(newaffinity)

	newvec = reserve_new_vector();

	ret = subsys_pre_move_callback(...., newaffinity);

	if (ret) {
		drop_new_vector(newvec);
		return ret;
	}

	shutdown(oldvec);
	install(newvec);

	susbsys_post_move_callback(....)

	startup(newvec);

subsys_pre_move_callback()

	ret = do_whatever();
	if (ret)
		return ret;

	/*
	 * Make sure nothing is queued anymore and outstanding
	 * requests are completed. Same as for managed CPU hotplug.
	 */
	stop_and_drain_queue();
	return 0;

subsys_post_move_callback()

	install_new_data();

	/* Reenable queue. Same as for managed CPU hotplug */
	reenable_queue();

	free_old_data();
	return;

Does that clarify the mechanism?

> > > This looks like it can work to me, but I'm probably not familiar enough
> > > to see the full picture here.
> > 
> > On the interrupt core side this is workable, I just need the input from the
> > driver^Wsubsystem side if this can be implemented sanely.
> 
> Can you explain what do you mean by "subsystem"? I thought that the
> subsystem would be the irq subsystem (which means you are the one to provide
> the needed input :) ) and the driver would pass in something
> like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
> requirements that you listed and NULL to tell the core to leave it alone
> and do what it sees fit (or pass msi_irq_ops with flag that means that).
> 
> ops structure is a very common way for drivers to communicate with a
> subsystem core.

So if you look at the above pseudo code then the subsys_*_move_callbacks
are probably subsystem specific, i.e. block or networking.

Those subsystem callbacks might either handle it at the subsystem level
directly or call into the particular driver.

That's certainly out of the scope what the generic interrupt code can do :)

Thanks,

	tglx