From mboxrd@z Thu Jan  1 00:00:00 1970
From: Saeed Mahameed <saeedm@mellanox.com>
Subject: Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
Date: Fri, 10 Nov 2017 14:56:30 +0900
Message-ID: <1510293390.5405.9.camel@mellanox.com>
References: <2187e555-2c4e-ef55-1c3a-17f5af54d762@fb.com>
         <d65d6156-07da-14dc-88ff-fcb5c0ef5f83@grimberg.me>
         <f8624a01-f77a-ac0b-ba19-19c7150ca133@fb.com>
         <CALzJLG-Z2fJibyZ74XU7XCSx-g3=YapV9VyekAYZkS5JwvkwWw@mail.gmail.com>
         <573060a9-19a7-9133-ef52-fa947088dabb@fb.com>
         <3af0c164-8dde-b6f0-45e1-edbbb28e7f73@mellanox.com>
         <83d3944f-8a31-eb31-93db-294906630b0e@grimberg.me>
         <556f3ff5-c1d4-25c6-7bfc-9866c0d9b323@fb.com>
         <9384acdc-a5d8-872c-0034-9a3869f4fc8b@grimberg.me>
         <alpine.DEB.2.20.1711021907420.2824@nanos>
         <1d2e9304-089a-a769-9f38-a742dc066baf@grimberg.me>
         <alpine.DEB.2.20.1711071551330.1716@nanos>
         <67a1157e-06e7-479c-993e-bdf42fd613c6@fb.com>
         <alpine.DEB.2.20.1711081822450.1937@nanos>
         <69a46009-184f-d925-289c-6036f0bf2554@grimberg.me>
         <alpine.DEB.2.20.1711091429000.1839@nanos>
         <7f7aab77-b0b6-f8fa-0b57-3e3c1755eeaa@fb.com>
         <alpine.DEB.2.20.1711091759350.1839@nanos>
         <4888ac89-22d5-0700-daa1-d604e4c54970@fb.com>
         <alpine.DEB.2.20.1711092226080.2690@nanos>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Cc: Sagi Grimberg <sagi@grimberg.me>, Jes Sorensen <jsorensen@fb.com>,
        Tariq Toukan <tariqt@mellanox.com>,
        Saeed Mahameed <saeedm@dev.mellanox.co.il>,
        Networking <netdev@vger.kernel.org>,
        Leon Romanovsky <leonro@mellanox.com>,
        Kernel Team <kernel-team@fb.com>,
        Christoph Hellwig <hch@lst.de>
To: Thomas Gleixner <tglx@linutronix.de>, Jens Axboe <axboe@fb.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-db5eur01on0042.outbound.protection.outlook.com ([104.47.2.42]:53888
        "EHLO EUR01-DB5-obe.outbound.protection.outlook.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1751083AbdKJF5I (ORCPT <rfc822;netdev@vger.kernel.org>);
        Fri, 10 Nov 2017 00:57:08 -0500
In-Reply-To: <alpine.DEB.2.20.1711092226080.2690@nanos>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Thu, 2017-11-09 at 22:42 +0100, Thomas Gleixner wrote:
> Find below a summary of the technical details, implications and
> options
> 
> What can be done for 4.14?
> 
>   We basically have two options: Revert at the driver level or ship
> as
>   is.
> 

I think we all came to the consensus that this is the only immediate
action to solve the mlx5 regression, So i am going to revert the driver
level change.

>   Even if we come up with a quick and dirty hack then it will be too
> late
>   for proper testing before sunday.
> 
> 
> What can be done with some time to work on?
> 
> The managed mechanism consists of 3 pieces:
> 
>  1) Vector spreading
> 
>  2) Managed vector allocation, which becomes a guaranteed reservation
> in
>     4.15 due of the big rework of the vector management code.
> 
>     Non managed interrupts get a best effort reservation to handle
> theCPU
>     unplug vector pressure problem in a sane way.
> 
>  3) CPU hotplug management
> 
>     If the last CPU in the affinity set goes offline, then the
> interrupt is
>     shutdown and restarted when the first CPU in the affinity set
> comes
>     online again. The driver code needs to ensure that the queue
> associated
>     to that interrupt is drained before shutdown and nothing is
> queued
>     there after this point.
> 

Well, I can speak for mlx5 case or most of the network drivers, where
all of the queues associated with an interrupt, move with it, so i
don't think our current driver have this issue. I don't believe there
are network driver with fixed Per cpu resources, but it worth double
checking.

Regarding the below solutions, any one that will gurantee the initial
managed spreading and still allow the user to modify affinity via
/proc/irq/xyz/smp_afinity will be acceptable, since many tools and user
rely on this sysfs entry e.g. (irqbalance)

Thank you Thomas for handling and all the detailed information.
-Saeed.

> So we have options:
> 
> 1) Initial vector spreading 
> 
>  Let the driver use the initial vector spreading. That does only the
>  initial affinity setup, but otherwise the interrupts are handled
> like any
>  other non managed interrupt, i.e. best effort reservation, affinity
>  settings enabled and CPU unplug breaks affinity and moves them to
> some
>  random other online CPU.
> 
>  The simplest solution of all.
> 
> 2) Allowing a driver supplied mask
> 
>  Certainly simple to do, but as you said it's not really a solution.
> I'm
>  not sure whether we want to go there as this is going to be replaced
> fast
>  enough and then create another breakage/frustration level.
> 
> 
> 3) Affinity override in managed mode
> 
>  Doable, but there are a couple of things to think about:
> 
>   * How is this enabled?
> 
>     - Opt-in by driver
> 	     
>     - Extra sysfs/procfs knob
> 
>     We definitely should not enable it per default because that would
>     surprise users/drivers which work with the current managed
> devices and
>     rely on the affinity files to be non writeable in managed mode.
> 
>   * Is it allowed to set the affinity to offline, but present CPUs?
> 
>      In principle yes, because the core management code can do that
> as well
>      at setup time.
> 
>   * The affinity setting must fail when it cannot do a guaranteed
>     reservation on the new target CPU(s).
> 
>      This is not much of a question. That's a matter of fact because
>      otherwise the association cannot be guaranteed and things fall
> apart
>      all over the place.
> 
>   * When and how is the driver informed about the change?
> 
>      When:
> 
>        #1 Before the core tries to move the interrupt so it can veto
> the
> 	  move if it cannot allocate new resources or whatever is
> required
> 	  to operate after the move.
> 	  
>        #2 After the core made the move effective because:
> 
>           - The interrupt might be moved from an offline set to an
> online
>             set and needs to be started up, so the related queue must
> be
>             enabled as well.
> 
>           - The interrupt might be moved from an online set to an
> offline
>             set, so the queue needs to be drained and disabled.
> 
> 	  - Resources which have been allocated in the first step must
> be
>             made effective and old resources freed.
> 
>      How:
> 
>        The existing affinity notification mechanism does not work for
> this
>        and it's a horrible piece of crap which should go away sooner
> than
>        later.
> 
>        So we need some sensible way to provide callback. Emphasis on
>        callbacks as one multiplexing callback is not a good idea.
> 
>   * How can the change made effective?
> 
>     When the preliminaries (vector reservation on the new set and
>     evtl. resource allocation in the subsystem have been done, then
> the
>     actual move can be made.
> 
>     But, there is a caveat. x86 is not good in reassociating
> interrupts on
>     the fly except when it sits behind an interrupt remapping unit,
> but we
>     cannot rely on that.
> 
>     So the change flow which works for everything would be:
> 
>     if (reserve_vectors() < 0)
>        return FAIL;
> 
>     if (subsys_prep_callback() < 0) {
>        release_vectors();
>        return FAIL;
>     }
> 
>     shutdown(irq);
> 
>     if (!online(newset))
>        return SUCCESS;
> 
>     startup(irq);
> 
>     subsys_post_callback();
>     return SUCCESS;
> 
>     subsys_prep_callback() must basically work the same way as the
> CPU
>     offline mechanism and drain the queue and prevent queueing before
> the
>     irq is restarted. If the move results in keeping it shutdown
> because
>     the new set is offline, then the irq will be restarted via the
> CPU
>     hotplug code and the subsystem will be informed about that via
> the
>     hotplug mechanism as well.
> 
>     subsys_post_callback() is more or less the same as the hotplug
> callback
>     and restarts the queue. The only difference to the hotplug code
> as of
>     today is that it might need to make previously allocated
> resources
>     effective and free the old ones.
> 
>     I named that subsys_*_callback() on purpose because this should
> be
>     handled in a generic way for multiqueue devices and not done at
> the
>     driver level.
> 
>   There are some very interesting locking problems to solve,
> especially
>   vs. CPU hotplug, but that should be solvable.
> 
> 4) Break managed mode when affinity is changed by user
> 
>   I'm not going to describe that because this is going to require at
> least
>   as much effort as #2 plus a few extra interesting twists versus
> vector
>   management and CPU hotplug.
> 
> 5) Other options:
> 
>    Maybe ponies, but I have no clue how to implement them.
>  
> 
> Thoughts?
> 
> Thanks,
> 
> 	tglx
> 
> 
>