All of lore.kernel.org
 help / color / mirror / Atom feed
* mlx5 broken affinity
@ 2017-11-01 16:19 Jes Sorensen
  2017-11-01 17:21 ` Sagi Grimberg
  0 siblings, 1 reply; 47+ messages in thread
From: Jes Sorensen @ 2017-11-01 16:19 UTC (permalink / raw)
  To: sagi; +Cc: Networking, leonro, Saeed Mahameed, Kernel Team

Hi,

The below patch seems to have broken PCI IRQ affinity assignments for mlx5.

Prior to this patch I could echo a value to /proc/irq/<x>/smp_affinity
and it would get assigned. With this patch applied I get -EIO

The actual affinity assignments seem to have changed too, but I assume
this is a result of the generic allocation?

With this applied I get:
[root@testbox /proc/irq]# cat 50/smp_affinity
000000,00000010
[root@textbox /proc/irq]# cat 100/smp_affinity
100000,00000000

Without I get:
[root@testbox /proc/irq]# cat 50/smp_affinity
000000,00000200
[root@testbox /proc/irq]# cat 100/smp_affinity
000100,00000000

I am not wildly familiar with the affinity assignment code. Is there
something obvious I am missing here?

Cheers,
Jes


commit a435393acafbf0ecff4deb3e3cb554b34f0d0664
Author: Sagi Grimberg <sagi@grimberg.me>
Date:   Thu Jul 13 11:09:40 2017 +0300

    mlx5: move affinity hints assignments to generic code

    generic api takes care of spreading affinity similar to
    what mlx5 open coded (and even handles better asymmetric
    configurations). Ask the generic API to spread affinity
    for us, and feed him pre_vectors that do not participate
    in affinity settings (which is an improvement to what we
    had before).

    The affinity assignments should match what mlx5 tried to
    do earlier but now we do not set affinity to async, cmd
    and pages dedicated vectors.

    Also, remove mlx5e_get_cpu and introduce mlx5e_get_node
    (used for allocation purposes) and mlx5_get_vector_affinity
    (for indirection table construction) as they provide the needed
    information. Luckily, we have generic helpers to get cpumask
    and node given a irq vector. mlx5_get_vector_affinity will
    be used by mlx5_ib in a subsequent patch.

    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Leon Romanovsky <leonro@mellanox.com>
    Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
    Signed-off-by: Doug Ledford <dledford@redhat.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-01 16:19 mlx5 broken affinity Jes Sorensen
@ 2017-11-01 17:21 ` Sagi Grimberg
  2017-11-01 18:20   ` Jes Sorensen
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-01 17:21 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Networking, leonro, Saeed Mahameed, Kernel Team, Christoph Hellwig

> Hi,

Hi Jes,

> The below patch seems to have broken PCI IRQ affinity assignments for mlx5.

I wouldn't call it breaking IRQ affinity assignments. It just makes
them automatic.

> Prior to this patch I could echo a value to /proc/irq/<x>/smp_affinity
> and it would get assigned. With this patch applied I get -EIO

Adding Christoph,

Ideally the affinity assignments would be better left alone, but
I wasn't aware that they are now immutable. Christoph?

> The actual affinity assignments seem to have changed too, but I assume
> this is a result of the generic allocation?

The affinity assignments should be giving you perfect linear assignment
of the rx/tx rings completion vectors to CPU cores:
first  -> 0
second -> 1
third  -> 2
...

Before, mlx5 spread affinity starting from the local numa node as it
relied on that when constructing RSS indirection table only to the local
numa node (which as a side effect hurt applications running on the far
node as the traffic was guaranteed to arrive rx rings located in the
"wrong" node).

Now the RSS indirection table is linear across both numa nodes just like
the irq affinity settings. Another thing that was improved was the
pre/post vectors which blacklisted any non rx/tx completion vectors from
the affinity assignments (like port events completion vectors from
example).

> With this applied I get:
> [root@testbox /proc/irq]# cat 50/smp_affinity
> 000000,00000010
> [root@textbox /proc/irq]# cat 100/smp_affinity
> 100000,00000000
> 
> Without I get:
> [root@testbox /proc/irq]# cat 50/smp_affinity
> 000000,00000200
> [root@testbox /proc/irq]# cat 100/smp_affinity
> 000100,00000000
> 
> I am not wildly familiar with the affinity assignment code. Is there
> something obvious I am missing here?

The affinity assignment is indeed changed, should be better though.
If you do not see a linear (diagonal) affinity assignment then
its a bug that needs to be fixed (probably in msi.c code).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-01 17:21 ` Sagi Grimberg
@ 2017-11-01 18:20   ` Jes Sorensen
  2017-11-01 22:41     ` Saeed Mahameed
  2017-11-02  7:57     ` Sagi Grimberg
  0 siblings, 2 replies; 47+ messages in thread
From: Jes Sorensen @ 2017-11-01 18:20 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Networking, leonro, Saeed Mahameed, Kernel Team, Christoph Hellwig

On 11/01/2017 01:21 PM, Sagi Grimberg wrote:
>> Hi,
> 
> Hi Jes,
> 
>> The below patch seems to have broken PCI IRQ affinity assignments for
>> mlx5.
> 
> I wouldn't call it breaking IRQ affinity assignments. It just makes
> them automatic.

Well it explicitly breaks the option for an admin to assign them
manually, which contradicts what is written in
Documentation/IRQ-affinity.txt

>> Prior to this patch I could echo a value to /proc/irq/<x>/smp_affinity
>> and it would get assigned. With this patch applied I get -EIO
> 
> Adding Christoph,
> 
> Ideally the affinity assignments would be better left alone, but
> I wasn't aware that they are now immutable. Christoph?
> 
>> The actual affinity assignments seem to have changed too, but I assume
>> this is a result of the generic allocation?
> 
> The affinity assignments should be giving you perfect linear assignment
> of the rx/tx rings completion vectors to CPU cores:
> first  -> 0
> second -> 1
> third  -> 2
> ...
> 
> Before, mlx5 spread affinity starting from the local numa node as it
> relied on that when constructing RSS indirection table only to the local
> numa node (which as a side effect hurt applications running on the far
> node as the traffic was guaranteed to arrive rx rings located in the
> "wrong" node).
> 
> Now the RSS indirection table is linear across both numa nodes just like
> the irq affinity settings. Another thing that was improved was the
> pre/post vectors which blacklisted any non rx/tx completion vectors from
> the affinity assignments (like port events completion vectors from
> example).

I am all in favor of making the automatic setup better, but assuming an
automatic setup is always right seems problematic. There could be
workloads where you may want to assign affinity explicitly.

Jes

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-01 18:20   ` Jes Sorensen
@ 2017-11-01 22:41     ` Saeed Mahameed
  2017-11-01 23:02       ` Jes Sorensen
  2017-11-02  7:57     ` Sagi Grimberg
  1 sibling, 1 reply; 47+ messages in thread
From: Saeed Mahameed @ 2017-11-01 22:41 UTC (permalink / raw)
  To: Jes Sorensen, Tariq Toukan
  Cc: Sagi Grimberg, Networking, Leon Romanovsky, Saeed Mahameed,
	Kernel Team, Christoph Hellwig

On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensen <jsorensen@fb.com> wrote:
> On 11/01/2017 01:21 PM, Sagi Grimberg wrote:
>>> Hi,
>>
>> Hi Jes,
>>
>>> The below patch seems to have broken PCI IRQ affinity assignments for
>>> mlx5.
>>
>> I wouldn't call it breaking IRQ affinity assignments. It just makes
>> them automatic.
>
> Well it explicitly breaks the option for an admin to assign them
> manually, which contradicts what is written in
> Documentation/IRQ-affinity.txt
>
>>> Prior to this patch I could echo a value to /proc/irq/<x>/smp_affinity
>>> and it would get assigned. With this patch applied I get -EIO
>>
>> Adding Christoph,
>>
>> Ideally the affinity assignments would be better left alone, but
>> I wasn't aware that they are now immutable. Christoph?
>>
>>> The actual affinity assignments seem to have changed too, but I assume
>>> this is a result of the generic allocation?
>>
>> The affinity assignments should be giving you perfect linear assignment
>> of the rx/tx rings completion vectors to CPU cores:
>> first  -> 0
>> second -> 1
>> third  -> 2
>> ...
>>
>> Before, mlx5 spread affinity starting from the local numa node as it
>> relied on that when constructing RSS indirection table only to the local
>> numa node (which as a side effect hurt applications running on the far
>> node as the traffic was guaranteed to arrive rx rings located in the
>> "wrong" node).
>>
>> Now the RSS indirection table is linear across both numa nodes just like
>> the irq affinity settings. Another thing that was improved was the
>> pre/post vectors which blacklisted any non rx/tx completion vectors from
>> the affinity assignments (like port events completion vectors from
>> example).
>
> I am all in favor of making the automatic setup better, but assuming an
> automatic setup is always right seems problematic. There could be
> workloads where you may want to assign affinity explicitly.
>
> Jes

I vaguely remember Nacking Sagi's patch as we knew it would break
mlx5e netdev affinity assumptions.
Anyway we already submitted the mlx5e patch that removed such assumption

https://patchwork.ozlabs.org/patch/809196/ ("net/mlx5e: Distribute RSS
table among all RX rings")
Jes please confirm you have it.

And I agree here that user should be able to read
/proc/irq/x/smp_affinity and even modify it if required.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-01 22:41     ` Saeed Mahameed
@ 2017-11-01 23:02       ` Jes Sorensen
  2017-11-02  8:28         ` Tariq Toukan
  0 siblings, 1 reply; 47+ messages in thread
From: Jes Sorensen @ 2017-11-01 23:02 UTC (permalink / raw)
  To: Saeed Mahameed, Tariq Toukan
  Cc: Sagi Grimberg, Networking, Leon Romanovsky, Saeed Mahameed,
	Kernel Team, Christoph Hellwig

On 11/01/2017 06:41 PM, Saeed Mahameed wrote:
> On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensen <jsorensen@fb.com> wrote:
>> On 11/01/2017 01:21 PM, Sagi Grimberg wrote:
>> I am all in favor of making the automatic setup better, but assuming an
>> automatic setup is always right seems problematic. There could be
>> workloads where you may want to assign affinity explicitly.
>>
>> Jes
> 
> I vaguely remember Nacking Sagi's patch as we knew it would break
> mlx5e netdev affinity assumptions.
> Anyway we already submitted the mlx5e patch that removed such assumption
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_809196_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=zRrmoWoylV2tnG53v9ZA2w&m=Z6xtsiQVL8xhTauY_DrOWYDhci-D49TqNKLWV_HK5Ug&s=pkxagNCZzy5-ZzMRTxIQ5pDfFq8WOSRdSx5zeQpQdBI&e= ("net/mlx5e: Distribute RSS
> table among all RX rings")
> Jes please confirm you have it.
> 
> And I agree here that user should be able to read
> /proc/irq/x/smp_affinity and even modify it if required.

Hi Saeed,

I can confirm I have that patch - the problem is also present in Linus'
current tree.

I can read smp_affinity, but I cannot write to it.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-01 18:20   ` Jes Sorensen
  2017-11-01 22:41     ` Saeed Mahameed
@ 2017-11-02  7:57     ` Sagi Grimberg
  1 sibling, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-02  7:57 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Networking, leonro, Saeed Mahameed, Kernel Team,
	Christoph Hellwig, Thomas Gleixner

Jes,

> I am all in favor of making the automatic setup better, but assuming an
> automatic setup is always right seems problematic. There could be
> workloads where you may want to assign affinity explicitly.

Adding Thomas to the thread.

My understanding that the thought is to prevent user-space from
messing up the nice linear assignment, but maybe this might be too
conservative?

The patch that prevented user-space from touching managed irq affinity
assignments:

--
commit 45ddcecbfa947f1dd8e8019bad9e90d6c9f2665c
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Mon Jul 4 17:39:25 2016 +0900

     genirq: Use affinity hint in irqdesc allocation

     Use the affinity hint in the irqdesc allocator. The hint is used to 
determine
     the node for the allocation and to set the affinity of the interrupt.

     If multiple interrupts are allocated (multi-MSI) then the allocator 
iterates
     over the cpumask and for each set cpu it allocates on their node 
and sets the
     initial affinity to that cpu.

     If a single interrupt is allocated (MSI-X) then the allocator uses 
the first
     cpu in the mask to compute the allocation node and uses the mask 
for the
     initial affinity setting.

     Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
     prevent userspace from messing with their affinity settings.

     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
     Cc: Christoph Hellwig <hch@lst.de>
     Cc: linux-block@vger.kernel.org
     Cc: linux-pci@vger.kernel.org
     Cc: linux-nvme@lists.infradead.org
     Cc: axboe@fb.com
     Cc: agordeev@redhat.com
     Link: 
http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-hch@lst.de
     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-01 23:02       ` Jes Sorensen
@ 2017-11-02  8:28         ` Tariq Toukan
  2017-11-02 10:08           ` Sagi Grimberg
  0 siblings, 1 reply; 47+ messages in thread
From: Tariq Toukan @ 2017-11-02  8:28 UTC (permalink / raw)
  To: Jes Sorensen, Saeed Mahameed, Tariq Toukan
  Cc: Sagi Grimberg, Networking, Leon Romanovsky, Saeed Mahameed,
	Kernel Team, Christoph Hellwig



On 02/11/2017 1:02 AM, Jes Sorensen wrote:
> On 11/01/2017 06:41 PM, Saeed Mahameed wrote:
>> On Wed, Nov 1, 2017 at 11:20 AM, Jes Sorensen <jsorensen@fb.com> wrote:
>>> On 11/01/2017 01:21 PM, Sagi Grimberg wrote:
>>> I am all in favor of making the automatic setup better, but assuming an
>>> automatic setup is always right seems problematic. There could be
>>> workloads where you may want to assign affinity explicitly.
>>>
>>> Jes
>>
>> I vaguely remember Nacking Sagi's patch as we knew it would break
>> mlx5e netdev affinity assumptions.
I remember that argument. Still the series found its way in.
That series moves affinity decisions to kernel's responsibility.
AFAI see, what kernel does is assign IRQs to the NUMA's one by one in 
increasing indexing (starting with cores of NUMA #0), no matter what 
NUMA is closer to the NIC.
This means that if your NIC is on NUMA #1, and you reduce the number of 
channels, you might end up working only with the cores on the far NUMA. 
Not good!
>> Anyway we already submitted the mlx5e patch that removed such assumption
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_809196_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=zRrmoWoylV2tnG53v9ZA2w&m=Z6xtsiQVL8xhTauY_DrOWYDhci-D49TqNKLWV_HK5Ug&s=pkxagNCZzy5-ZzMRTxIQ5pDfFq8WOSRdSx5zeQpQdBI&e= ("net/mlx5e: Distribute RSS
>> table among all RX rings")
>> Jes please confirm you have it.
>>
>> And I agree here that user should be able to read
>> /proc/irq/x/smp_affinity and even modify it if required.
Totally agree. We should fix that ASAP.
User must have write access.
> 
> Hi Saeed,
> 
> I can confirm I have that patch - the problem is also present in Linus'
> current tree.
> 
> I can read smp_affinity, but I cannot write to it.
Indeed. Should be fixed.
> 
> Cheers,
> Jes
> 
Thanks,
Tariq

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-02  8:28         ` Tariq Toukan
@ 2017-11-02 10:08           ` Sagi Grimberg
  2017-11-02 12:13             ` Andrew Lunn
  2017-11-02 14:48             ` Jes Sorensen
  0 siblings, 2 replies; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-02 10:08 UTC (permalink / raw)
  To: Tariq Toukan, Jes Sorensen, Saeed Mahameed
  Cc: Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig, Thomas Gleixner


>>> I vaguely remember Nacking Sagi's patch as we knew it would break
>>> mlx5e netdev affinity assumptions.
> I remember that argument. Still the series found its way in.

Of course it maid its way in, it was acked by three different
maintainers, and I addressed all of Saeed's comments.

> That series moves affinity decisions to kernel's responsibility.
> AFAI see, what kernel does is assign IRQs to the NUMA's one by one in 
> increasing indexing (starting with cores of NUMA #0), no matter what 
> NUMA is closer to the NIC.

Well, as we said before, if there is a good argument to do the home node
first we can change the generic code (as it should be given that this is
absolutely not device specific).

> This means that if your NIC is on NUMA #1, and you reduce the number of 
> channels, you might end up working only with the cores on the far NUMA. 
> Not good!
We deliberated on this before, and concluded that application affinity
and device affinity are equally important. If you have a real use case
that shows otherwise, its perfectly doable to start from the device home
node.

>>> And I agree here that user should be able to read
>>> /proc/irq/x/smp_affinity and even modify it if required.
> Totally agree. We should fix that ASAP.
> User must have write access.

I'll let Thomas reply here, I do not fully understand the reason for why
pci_alloc_irq_vectors() make the affinity assignments immutable..

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-02 10:08           ` Sagi Grimberg
@ 2017-11-02 12:13             ` Andrew Lunn
  2017-11-02 14:48             ` Jes Sorensen
  1 sibling, 0 replies; 47+ messages in thread
From: Andrew Lunn @ 2017-11-02 12:13 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Tariq Toukan, Jes Sorensen, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig,
	Thomas Gleixner

> >This means that if your NIC is on NUMA #1, and you reduce the number of
> >channels, you might end up working only with the cores on the far NUMA.
> >Not good!

> We deliberated on this before, and concluded that application affinity
> and device affinity are equally important. If you have a real use case
> that shows otherwise, its perfectly doable to start from the device home
> node.

Hi Sagi

You might find this talk at NetDev interesting:

https://www.netdevconf.org/2.2/session.html?shochat-devicemgmt-talk

	Andrew

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-02 10:08           ` Sagi Grimberg
  2017-11-02 12:13             ` Andrew Lunn
@ 2017-11-02 14:48             ` Jes Sorensen
  2017-11-02 16:14               ` Sagi Grimberg
  1 sibling, 1 reply; 47+ messages in thread
From: Jes Sorensen @ 2017-11-02 14:48 UTC (permalink / raw)
  To: Sagi Grimberg, Tariq Toukan, Saeed Mahameed
  Cc: Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig, Thomas Gleixner

On 11/02/2017 06:08 AM, Sagi Grimberg wrote:
> 
>>>> I vaguely remember Nacking Sagi's patch as we knew it would break
>>>> mlx5e netdev affinity assumptions.
>> I remember that argument. Still the series found its way in.
> 
> Of course it maid its way in, it was acked by three different
> maintainers, and I addressed all of Saeed's comments.
> 
>> That series moves affinity decisions to kernel's responsibility.
>> AFAI see, what kernel does is assign IRQs to the NUMA's one by one in
>> increasing indexing (starting with cores of NUMA #0), no matter what
>> NUMA is closer to the NIC.
> 
> Well, as we said before, if there is a good argument to do the home node
> first we can change the generic code (as it should be given that this is
> absolutely not device specific).
> 
>> This means that if your NIC is on NUMA #1, and you reduce the number
>> of channels, you might end up working only with the cores on the far
>> NUMA. Not good!
> We deliberated on this before, and concluded that application affinity
> and device affinity are equally important. If you have a real use case
> that shows otherwise, its perfectly doable to start from the device home
> node.

This wasn't to start a debate about which allocation method is the
perfect solution. I am perfectly happy with the new default, the part
that is broken is to take away the user's option to reassign the
affinity. That is a bug and it needs to be fixed!

Jes

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-02 14:48             ` Jes Sorensen
@ 2017-11-02 16:14               ` Sagi Grimberg
  2017-11-02 17:13                 ` Jes Sorensen
  2017-11-02 18:10                 ` Thomas Gleixner
  0 siblings, 2 replies; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-02 16:14 UTC (permalink / raw)
  To: Jes Sorensen, Tariq Toukan, Saeed Mahameed
  Cc: Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig, Thomas Gleixner


> This wasn't to start a debate about which allocation method is the
> perfect solution. I am perfectly happy with the new default, the part
> that is broken is to take away the user's option to reassign the
> affinity. That is a bug and it needs to be fixed!

Well,

I would really want to wait for Thomas/Christoph to reply, but this
simple change fixed it for me:
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 573dc52b0806..eccd06be5e44 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
  {
         struct irq_desc *desc = irq_to_desc(irq);

-       return __irq_can_set_affinity(desc) &&
-               !irqd_affinity_is_managed(&desc->irq_data);
+       return __irq_can_set_affinity(desc);
  }

  /**

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-02 16:14               ` Sagi Grimberg
@ 2017-11-02 17:13                 ` Jes Sorensen
  2017-11-02 18:10                 ` Thomas Gleixner
  1 sibling, 0 replies; 47+ messages in thread
From: Jes Sorensen @ 2017-11-02 17:13 UTC (permalink / raw)
  To: Sagi Grimberg, Tariq Toukan, Saeed Mahameed
  Cc: Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig, Thomas Gleixner

On 11/02/2017 12:14 PM, Sagi Grimberg wrote:
> 
>> This wasn't to start a debate about which allocation method is the
>> perfect solution. I am perfectly happy with the new default, the part
>> that is broken is to take away the user's option to reassign the
>> affinity. That is a bug and it needs to be fixed!
> 
> Well,
> 
> I would really want to wait for Thomas/Christoph to reply, but this
> simple change fixed it for me:
> -- 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 573dc52b0806..eccd06be5e44 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
>  {
>         struct irq_desc *desc = irq_to_desc(irq);
> 
> -       return __irq_can_set_affinity(desc) &&
> -               !irqd_affinity_is_managed(&desc->irq_data);
> +       return __irq_can_set_affinity(desc);
>  }

The part I don't get here is why the mlx5 driver change was allowed to
set the irqs to 'managed'? One thing is to use the generic system for
initial allocation, which is all great, but the change shouldn't mark
the irq affinity mapping as managed.

Jes

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-02 16:14               ` Sagi Grimberg
  2017-11-02 17:13                 ` Jes Sorensen
@ 2017-11-02 18:10                 ` Thomas Gleixner
  2017-11-05  8:36                   ` Sagi Grimberg
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-02 18:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

On Thu, 2 Nov 2017, Sagi Grimberg wrote:

> 
> > This wasn't to start a debate about which allocation method is the
> > perfect solution. I am perfectly happy with the new default, the part
> > that is broken is to take away the user's option to reassign the
> > affinity. That is a bug and it needs to be fixed!
> 
> Well,
> 
> I would really want to wait for Thomas/Christoph to reply, but this
> simple change fixed it for me:
> --
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 573dc52b0806..eccd06be5e44 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
>  {
>         struct irq_desc *desc = irq_to_desc(irq);
> 
> -       return __irq_can_set_affinity(desc) &&
> -               !irqd_affinity_is_managed(&desc->irq_data);
> +       return __irq_can_set_affinity(desc);

Which defeats the whole purpose of the managed facility, which is _not_ to
break the affinities on cpu offline and bring the interrupt back on the CPU
when it comes online again.

What I can do is to have a separate flag, which only uses the initial
distribution mechanism, but I really want to have Christophs opinion on
that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-02 18:10                 ` Thomas Gleixner
@ 2017-11-05  8:36                   ` Sagi Grimberg
  2017-11-07 15:07                     ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-05  8:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig


>>> This wasn't to start a debate about which allocation method is the
>>> perfect solution. I am perfectly happy with the new default, the part
>>> that is broken is to take away the user's option to reassign the
>>> affinity. That is a bug and it needs to be fixed!
>>
>> Well,
>>
>> I would really want to wait for Thomas/Christoph to reply, but this
>> simple change fixed it for me:
>> --
>> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
>> index 573dc52b0806..eccd06be5e44 100644
>> --- a/kernel/irq/manage.c
>> +++ b/kernel/irq/manage.c
>> @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
>>   {
>>          struct irq_desc *desc = irq_to_desc(irq);
>>
>> -       return __irq_can_set_affinity(desc) &&
>> -               !irqd_affinity_is_managed(&desc->irq_data);
>> +       return __irq_can_set_affinity(desc);
> 
> Which defeats the whole purpose of the managed facility, which is _not_ to
> break the affinities on cpu offline and bring the interrupt back on the CPU
> when it comes online again.
> 
> What I can do is to have a separate flag, which only uses the initial
> distribution mechanism, but I really want to have Christophs opinion on
> that.

I do agree that the user would lose better cpu online/offline behavior,
but it seems that users want to still have some control over the IRQ
affinity assignments even if they lose this functionality.

Would it be possible to keep the managed facility until a user overrides
an affinity assignment? This way if the user didn't touch it, we keep
all the perks, and in case the user overrides it, we log the implication
so the user is aware?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-05  8:36                   ` Sagi Grimberg
@ 2017-11-07 15:07                     ` Thomas Gleixner
  2017-11-08  7:27                       ` Sagi Grimberg
  2017-11-08 16:19                       ` Jes Sorensen
  0 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-07 15:07 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

On Sun, 5 Nov 2017, Sagi Grimberg wrote:
> > > > This wasn't to start a debate about which allocation method is the
> > > > perfect solution. I am perfectly happy with the new default, the part
> > > > that is broken is to take away the user's option to reassign the
> > > > affinity. That is a bug and it needs to be fixed!
> > > 
> > > Well,
> > > 
> > > I would really want to wait for Thomas/Christoph to reply, but this
> > > simple change fixed it for me:
> > > --
> > > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > > index 573dc52b0806..eccd06be5e44 100644
> > > --- a/kernel/irq/manage.c
> > > +++ b/kernel/irq/manage.c
> > > @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
> > >   {
> > >          struct irq_desc *desc = irq_to_desc(irq);
> > > 
> > > -       return __irq_can_set_affinity(desc) &&
> > > -               !irqd_affinity_is_managed(&desc->irq_data);
> > > +       return __irq_can_set_affinity(desc);
> > 
> > Which defeats the whole purpose of the managed facility, which is _not_ to
> > break the affinities on cpu offline and bring the interrupt back on the CPU
> > when it comes online again.
> > 
> > What I can do is to have a separate flag, which only uses the initial
> > distribution mechanism, but I really want to have Christophs opinion on
> > that.
> 
> I do agree that the user would lose better cpu online/offline behavior,
> but it seems that users want to still have some control over the IRQ
> affinity assignments even if they lose this functionality.

Depending on the machine and the number of queues this might even result in
completely losing the ability to suspend/hibernate because the number of
available vectors on CPU0 is not sufficient to accomodate all queue
interrupts.

> Would it be possible to keep the managed facility until a user overrides
> an affinity assignment? This way if the user didn't touch it, we keep
> all the perks, and in case the user overrides it, we log the implication
> so the user is aware?

A lot of things are possible, the question is whether it makes sense. The
whole point is to have resources (queues, interrupts etc.) per CPU and have
them strictly associated.

Why would you give the user a knob to destroy what you carefully optimized?

Just because we can and just because users love those knobs or is there any
real technical reason?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-07 15:07                     ` Thomas Gleixner
@ 2017-11-08  7:27                       ` Sagi Grimberg
  2017-11-08 12:21                         ` David Laight
  2017-11-09 15:08                         ` Saeed Mahameed
  2017-11-08 16:19                       ` Jes Sorensen
  1 sibling, 2 replies; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-08  7:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig


> Depending on the machine and the number of queues this might even result in
> completely losing the ability to suspend/hibernate because the number of
> available vectors on CPU0 is not sufficient to accomodate all queue
> interrupts.
> 
>> Would it be possible to keep the managed facility until a user overrides
>> an affinity assignment? This way if the user didn't touch it, we keep
>> all the perks, and in case the user overrides it, we log the implication
>> so the user is aware?
> 
> A lot of things are possible, the question is whether it makes sense. The
> whole point is to have resources (queues, interrupts etc.) per CPU and have
> them strictly associated.

Not arguing here.

> Why would you give the user a knob to destroy what you carefully optimized?

Well, looks like someone relies on this knob, the question is if he is
doing something better for his workload. I don't know, its really up to
the user to say.

> Just because we can and just because users love those knobs or is there any
> real technical reason?

Again, I think Jes or others can provide more information.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: mlx5 broken affinity
  2017-11-08  7:27                       ` Sagi Grimberg
@ 2017-11-08 12:21                         ` David Laight
  2017-11-08 16:13                           ` Jens Axboe
  2017-11-09 15:08                         ` Saeed Mahameed
  1 sibling, 1 reply; 47+ messages in thread
From: David Laight @ 2017-11-08 12:21 UTC (permalink / raw)
  To: 'Sagi Grimberg', Thomas Gleixner
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

From: Sagi Grimberg
> Sent: 08 November 2017 07:28
...
> > Why would you give the user a knob to destroy what you carefully optimized?
> 
> Well, looks like someone relies on this knob, the question is if he is
> doing something better for his workload. I don't know, its really up to
> the user to say.

Maybe the user wants to ensure that nothing except some very specific
processing happens on some (or most) of the cpu cores.

If the expected overall ethernet data rate isn't exceptionally large
is there any reason to allocate a queue (etc) for every cpu.

	David


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-08 12:21                         ` David Laight
@ 2017-11-08 16:13                           ` Jens Axboe
  2017-11-09 10:09                             ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2017-11-08 16:13 UTC (permalink / raw)
  To: David Laight, 'Sagi Grimberg', Thomas Gleixner
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

On 11/08/2017 05:21 AM, David Laight wrote:
> From: Sagi Grimberg
>> Sent: 08 November 2017 07:28
> ...
>>> Why would you give the user a knob to destroy what you carefully optimized?
>>
>> Well, looks like someone relies on this knob, the question is if he is
>> doing something better for his workload. I don't know, its really up to
>> the user to say.
> 
> Maybe the user wants to ensure that nothing except some very specific
> processing happens on some (or most) of the cpu cores.
> 
> If the expected overall ethernet data rate isn't exceptionally large
> is there any reason to allocate a queue (etc) for every cpu.

There are numerous valid reasons to be able to set the affinity, for
both nics and block drivers. It's great that the kernel has a predefined
layout that works well, but users do need the flexibility to be able to
reconfigure affinities, to suit their needs.

For the particular mlx5 case, I'm actually not sure how the FB
configuration differs from the in-kernel stuff. I'll take a look at
that. It may just be that the configuration exists because the code used
to be driver private and frequently changed, setting it at bootup to a
known good configuration helped eliminate problems when upgrading
kernels. I also remember some cases of removing CPU0 from the mask.

But that particular case is completely orthogonal to whether or not we
should allow the user to reconfigure. The answer to that is clearly YES,
and we should ensure that this is possible.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-07 15:07                     ` Thomas Gleixner
  2017-11-08  7:27                       ` Sagi Grimberg
@ 2017-11-08 16:19                       ` Jes Sorensen
  2017-11-08 17:33                         ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Jes Sorensen @ 2017-11-08 16:19 UTC (permalink / raw)
  To: Thomas Gleixner, Sagi Grimberg
  Cc: Tariq Toukan, Saeed Mahameed, Networking, Leon Romanovsky,
	Saeed Mahameed, Kernel Team, Christoph Hellwig

On 11/07/2017 10:07 AM, Thomas Gleixner wrote:
> On Sun, 5 Nov 2017, Sagi Grimberg wrote:
>> I do agree that the user would lose better cpu online/offline behavior,
>> but it seems that users want to still have some control over the IRQ
>> affinity assignments even if they lose this functionality.
> 
> Depending on the machine and the number of queues this might even result in
> completely losing the ability to suspend/hibernate because the number of
> available vectors on CPU0 is not sufficient to accomodate all queue
> interrupts.

Depending on the system, suspend/resume is really the lesser interesting
issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in
it will be in some sort of rack and is unlikely to ever want to
suspend/resume.

>> Would it be possible to keep the managed facility until a user overrides
>> an affinity assignment? This way if the user didn't touch it, we keep
>> all the perks, and in case the user overrides it, we log the implication
>> so the user is aware?
> 
> A lot of things are possible, the question is whether it makes sense. The
> whole point is to have resources (queues, interrupts etc.) per CPU and have
> them strictly associated.
> 
> Why would you give the user a knob to destroy what you carefully optimized?
> 
> Just because we can and just because users love those knobs or is there any
> real technical reason?

Because the user sometimes knows better based on statically assigned
loads, or the user wants consistency across kernels. It's great that the
system is better at allocating this now, but we also need to allow for a
user to change it. Like anything on Linux, a user wanting to blow off
his/her own foot, should be allowed to do so.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-08 16:19                       ` Jes Sorensen
@ 2017-11-08 17:33                         ` Thomas Gleixner
  2017-11-09 10:50                           ` Sagi Grimberg
  2017-11-09 22:03                           ` Jes Sorensen
  0 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-08 17:33 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Sagi Grimberg, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

On Wed, 8 Nov 2017, Jes Sorensen wrote:
> On 11/07/2017 10:07 AM, Thomas Gleixner wrote:
> > On Sun, 5 Nov 2017, Sagi Grimberg wrote:
> >> I do agree that the user would lose better cpu online/offline behavior,
> >> but it seems that users want to still have some control over the IRQ
> >> affinity assignments even if they lose this functionality.
> > 
> > Depending on the machine and the number of queues this might even result in
> > completely losing the ability to suspend/hibernate because the number of
> > available vectors on CPU0 is not sufficient to accomodate all queue
> > interrupts.
> 
> Depending on the system, suspend/resume is really the lesser interesting
> issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in
> it will be in some sort of rack and is unlikely to ever want to
> suspend/resume.

The discussions with Intel about that tell a different story and cpu
online/offline for power management purposes is - while debatable - widely
used.

That's where the whole idea for managed affinities originated from along
with avoiding the affinity hint and affinity notifier machinery which
creates more problems than it solves.

> >> Would it be possible to keep the managed facility until a user overrides
> >> an affinity assignment? This way if the user didn't touch it, we keep
> >> all the perks, and in case the user overrides it, we log the implication
> >> so the user is aware?
> > 
> > A lot of things are possible, the question is whether it makes sense. The
> > whole point is to have resources (queues, interrupts etc.) per CPU and have
> > them strictly associated.
> > 
> > Why would you give the user a knob to destroy what you carefully optimized?
> > 
> > Just because we can and just because users love those knobs or is there any
> > real technical reason?
> 
> Because the user sometimes knows better based on statically assigned
> loads, or the user wants consistency across kernels. It's great that the
> system is better at allocating this now, but we also need to allow for a
> user to change it. Like anything on Linux, a user wanting to blow off
> his/her own foot, should be allowed to do so.

That's fine, but that's not what the managed affinity facility provides. If
you want to leverage the spread mechanism, but avoid the managed part, then
this is a different story and we need to figure out how to provide that
without breaking the managed side of it.

As I said it's possible, but I vehemently disagree, that this is a bug in
the core code, as it was claimed several times in this thread.

The real issue is that the driver was converted to something which was
expected to behave differently. That's hardly a bug in the core code, at
most it's a documentation problem.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-08 16:13                           ` Jens Axboe
@ 2017-11-09 10:09                             ` Christoph Hellwig
  2017-11-09 15:18                               ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2017-11-09 10:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Laight, 'Sagi Grimberg',
	Thomas Gleixner, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On Wed, Nov 08, 2017 at 09:13:59AM -0700, Jens Axboe wrote:
> There are numerous valid reasons to be able to set the affinity, for
> both nics and block drivers. It's great that the kernel has a predefined
> layout that works well, but users do need the flexibility to be able to
> reconfigure affinities, to suit their needs.

Managed interrupts are about more than just setting affinities - they
bind a a vector (which means a queue) to a specific cpu or set of cpus.
This is extremely useful to deal with hotplug issues and also makes
life a lot easier in general.

> But that particular case is completely orthogonal to whether or not we
> should allow the user to reconfigure. The answer to that is clearly YES,
> and we should ensure that this is possible.

And why is the answer yes?  If the anser is YES it means we need explicit
boilerplate code to deal with  cpu hotplug in every driver not using
managed interrupts (even if optionally), so it is a non-trivial tradeoff. 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-08 17:33                         ` Thomas Gleixner
@ 2017-11-09 10:50                           ` Sagi Grimberg
  2017-11-09 14:19                             ` Thomas Gleixner
  2017-11-09 15:19                             ` Jens Axboe
  2017-11-09 22:03                           ` Jes Sorensen
  1 sibling, 2 replies; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-09 10:50 UTC (permalink / raw)
  To: Thomas Gleixner, Jes Sorensen
  Cc: Tariq Toukan, Saeed Mahameed, Networking, Leon Romanovsky,
	Saeed Mahameed, Kernel Team, Christoph Hellwig

Thomas,

>> Because the user sometimes knows better based on statically assigned
>> loads, or the user wants consistency across kernels. It's great that the
>> system is better at allocating this now, but we also need to allow for a
>> user to change it. Like anything on Linux, a user wanting to blow off
>> his/her own foot, should be allowed to do so.
> 
> That's fine, but that's not what the managed affinity facility provides. If
> you want to leverage the spread mechanism, but avoid the managed part, then
> this is a different story and we need to figure out how to provide that
> without breaking the managed side of it.
> 
> As I said it's possible, but I vehemently disagree, that this is a bug in
> the core code, as it was claimed several times in this thread.
> 
> The real issue is that the driver was converted to something which was
> expected to behave differently. That's hardly a bug in the core code, at
> most it's a documentation problem.

I disagree here, this is not a device specific discussion. The question
of exposing IRQ affinity assignment knob to user-space holds for every
device (NICs, HBAs and alike). The same issue could have been raised on
any other device using managed affinitization (and we have quite a few
of those now). I can't see any reason why its OK for device X to use it
while its not OK for device Y to use it.

If the resolution is "YES we must expose a knob to user-space" then the
managed facility is unusable in its current form for *any* device. If
the answer is "NO, user-space can't handle all the stuff the kernel can"
then we should document it. This is really device independent.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 10:50                           ` Sagi Grimberg
@ 2017-11-09 14:19                             ` Thomas Gleixner
  2017-11-09 15:21                               ` Jens Axboe
  2017-11-09 16:01                               ` mlx5 broken affinity Sagi Grimberg
  2017-11-09 15:19                             ` Jens Axboe
  1 sibling, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-09 14:19 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

On Thu, 9 Nov 2017, Sagi Grimberg wrote:
> Thomas,
> 
> > > Because the user sometimes knows better based on statically assigned
> > > loads, or the user wants consistency across kernels. It's great that the
> > > system is better at allocating this now, but we also need to allow for a
> > > user to change it. Like anything on Linux, a user wanting to blow off
> > > his/her own foot, should be allowed to do so.
> > 
> > That's fine, but that's not what the managed affinity facility provides. If
> > you want to leverage the spread mechanism, but avoid the managed part, then
> > this is a different story and we need to figure out how to provide that
> > without breaking the managed side of it.
> > 
> > As I said it's possible, but I vehemently disagree, that this is a bug in
> > the core code, as it was claimed several times in this thread.
> > 
> > The real issue is that the driver was converted to something which was
> > expected to behave differently. That's hardly a bug in the core code, at
> > most it's a documentation problem.
> 
> I disagree here, this is not a device specific discussion. The question
> of exposing IRQ affinity assignment knob to user-space holds for every
> device (NICs, HBAs and alike). The same issue could have been raised on
> any other device using managed affinitization (and we have quite a few
> of those now). I can't see any reason why its OK for device X to use it
> while its not OK for device Y to use it.
>
> If the resolution is "YES we must expose a knob to user-space" then the
> managed facility is unusable in its current form for *any* device. If the
> answer is "NO, user-space can't handle all the stuff the kernel can" then
> we should document it. This is really device independent.

The early discussion of the managed facility came to the conclusion that it
will manage this stuff completely to allow fixed association of 'queue /
interrupt / corresponding memory' to a single CPU or a set of CPUs. That
removes a lot of 'affinity' handling magic from the driver and utilizes the
hardware in a sensible way. That was not my decision, really. It surely
made sense to me and I helped Christoph to implement it.

The real question is whether you want to have the fixed 'queue / interrupt/
corresponding memory association' and get rid of all the 'affinity' dance
in your driver or not.

If you answer that question with 'yes' then the consequence is that there
is no knob.

If you answer that question with 'no' then you should not use
the managed facility in the first place and if you need parts of that
functionality then this needs to be added to the core code _before_ a
driver gets converted and not afterwards.

It's not my problem if people decide, to use this and then trip over the
limitations after the changes hit the tree. This could have been figured
out before even a single patch was posted.

Now you try to blame the people who implemented the managed affinity stuff
for the wreckage, which was created by people who changed drivers to use
it. Nice try.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-08  7:27                       ` Sagi Grimberg
  2017-11-08 12:21                         ` David Laight
@ 2017-11-09 15:08                         ` Saeed Mahameed
  2017-11-09 15:40                           ` Sagi Grimberg
  1 sibling, 1 reply; 47+ messages in thread
From: Saeed Mahameed @ 2017-11-09 15:08 UTC (permalink / raw)
  To: Sagi Grimberg, Thomas Gleixner
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Kernel Team, Christoph Hellwig

On Wed, 2017-11-08 at 09:27 +0200, Sagi Grimberg wrote:
> > Depending on the machine and the number of queues this might even
> > result in
> > completely losing the ability to suspend/hibernate because the
> > number of
> > available vectors on CPU0 is not sufficient to accomodate all queue
> > interrupts.
> > 
> > > Would it be possible to keep the managed facility until a user
> > > overrides
> > > an affinity assignment? This way if the user didn't touch it, we
> > > keep
> > > all the perks, and in case the user overrides it, we log the
> > > implication
> > > so the user is aware?
> > 
> > A lot of things are possible, the question is whether it makes
> > sense. The
> > whole point is to have resources (queues, interrupts etc.) per CPU
> > and have
> > them strictly associated.
> 
> Not arguing here.
> 
> > Why would you give the user a knob to destroy what you carefully
> > optimized?
> 
> Well, looks like someone relies on this knob, the question is if he
> is
> doing something better for his workload. I don't know, its really up
> to
> the user to say.
> 
> > Just because we can and just because users love those knobs or is
> > there any
> > real technical reason?
> 
> Again, I think Jes or others can provide more information.

Sagi, I believe Jes is not trying to argue about what initial affinity
values you give to the driver, We have a very critical regression that
is afflicting Live systems today and common tools that already exists
in various distros, such as irqblanace which solely depends on
smp_affinity sysfs entry which is now not write-able due to this
regression. please see https://github.com/Irqbalance/irqbalance/blob/ma
ster/activate.c

Some users would like to have thier network traffic handled in some
cores and free up other cores for other purposes, you just can't take
that away from them.

If revereting mlx5 patches would solve the issue, I am afraid that is
the solution i am going to go with, until the regression is fixed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 10:09                             ` Christoph Hellwig
@ 2017-11-09 15:18                               ` Jens Axboe
  0 siblings, 0 replies; 47+ messages in thread
From: Jens Axboe @ 2017-11-09 15:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Laight, 'Sagi Grimberg',
	Thomas Gleixner, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team

On 11/09/2017 03:09 AM, Christoph Hellwig wrote:
> On Wed, Nov 08, 2017 at 09:13:59AM -0700, Jens Axboe wrote:
>> There are numerous valid reasons to be able to set the affinity, for
>> both nics and block drivers. It's great that the kernel has a predefined
>> layout that works well, but users do need the flexibility to be able to
>> reconfigure affinities, to suit their needs.
> 
> Managed interrupts are about more than just setting affinities - they
> bind a a vector (which means a queue) to a specific cpu or set of cpus.
> This is extremely useful to deal with hotplug issues and also makes
> life a lot easier in general.

I know why it was done.

>> But that particular case is completely orthogonal to whether or not we
>> should allow the user to reconfigure. The answer to that is clearly YES,
>> and we should ensure that this is possible.
> 
> And why is the answer yes?  If the anser is YES it means we need explicit
> boilerplate code to deal with  cpu hotplug in every driver not using
> managed interrupts (even if optionally), so it is a non-trivial tradeoff. 

The answer is yes because as with any other kernel level policy, there
are situations where it isn't the optimal solution. I ran into this myself
with nvme recently, and I started cursing the managed setup for that
very reason. You can even argue this is a regression, since we used
to be able to move things around as we saw fit. If the answer is
boilerplate code to make it feasible, then we need that boilerplate
code.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 10:50                           ` Sagi Grimberg
  2017-11-09 14:19                             ` Thomas Gleixner
@ 2017-11-09 15:19                             ` Jens Axboe
  1 sibling, 0 replies; 47+ messages in thread
From: Jens Axboe @ 2017-11-09 15:19 UTC (permalink / raw)
  To: Sagi Grimberg, Thomas Gleixner, Jes Sorensen
  Cc: Tariq Toukan, Saeed Mahameed, Networking, Leon Romanovsky,
	Saeed Mahameed, Kernel Team, Christoph Hellwig

On 11/09/2017 03:50 AM, Sagi Grimberg wrote:
> Thomas,
> 
>>> Because the user sometimes knows better based on statically assigned
>>> loads, or the user wants consistency across kernels. It's great that the
>>> system is better at allocating this now, but we also need to allow for a
>>> user to change it. Like anything on Linux, a user wanting to blow off
>>> his/her own foot, should be allowed to do so.
>>
>> That's fine, but that's not what the managed affinity facility provides. If
>> you want to leverage the spread mechanism, but avoid the managed part, then
>> this is a different story and we need to figure out how to provide that
>> without breaking the managed side of it.
>>
>> As I said it's possible, but I vehemently disagree, that this is a bug in
>> the core code, as it was claimed several times in this thread.
>>
>> The real issue is that the driver was converted to something which was
>> expected to behave differently. That's hardly a bug in the core code, at
>> most it's a documentation problem.
> 
> I disagree here, this is not a device specific discussion. The question
> of exposing IRQ affinity assignment knob to user-space holds for every
> device (NICs, HBAs and alike). The same issue could have been raised on
> any other device using managed affinitization (and we have quite a few
> of those now). I can't see any reason why its OK for device X to use it
> while its not OK for device Y to use it.
> 
> If the resolution is "YES we must expose a knob to user-space" then the
> managed facility is unusable in its current form for *any* device. If
> the answer is "NO, user-space can't handle all the stuff the kernel can"
> then we should document it. This is really device independent.

Completely agree, and honestly I'm pretty baffled we're even having
to argue this point.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 14:19                             ` Thomas Gleixner
@ 2017-11-09 15:21                               ` Jens Axboe
  2017-11-09 17:03                                 ` Thomas Gleixner
  2017-11-09 16:01                               ` mlx5 broken affinity Sagi Grimberg
  1 sibling, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2017-11-09 15:21 UTC (permalink / raw)
  To: Thomas Gleixner, Sagi Grimberg
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Sagi Grimberg wrote:
>> Thomas,
>>
>>>> Because the user sometimes knows better based on statically assigned
>>>> loads, or the user wants consistency across kernels. It's great that the
>>>> system is better at allocating this now, but we also need to allow for a
>>>> user to change it. Like anything on Linux, a user wanting to blow off
>>>> his/her own foot, should be allowed to do so.
>>>
>>> That's fine, but that's not what the managed affinity facility provides. If
>>> you want to leverage the spread mechanism, but avoid the managed part, then
>>> this is a different story and we need to figure out how to provide that
>>> without breaking the managed side of it.
>>>
>>> As I said it's possible, but I vehemently disagree, that this is a bug in
>>> the core code, as it was claimed several times in this thread.
>>>
>>> The real issue is that the driver was converted to something which was
>>> expected to behave differently. That's hardly a bug in the core code, at
>>> most it's a documentation problem.
>>
>> I disagree here, this is not a device specific discussion. The question
>> of exposing IRQ affinity assignment knob to user-space holds for every
>> device (NICs, HBAs and alike). The same issue could have been raised on
>> any other device using managed affinitization (and we have quite a few
>> of those now). I can't see any reason why its OK for device X to use it
>> while its not OK for device Y to use it.
>>
>> If the resolution is "YES we must expose a knob to user-space" then the
>> managed facility is unusable in its current form for *any* device. If the
>> answer is "NO, user-space can't handle all the stuff the kernel can" then
>> we should document it. This is really device independent.
> 
> The early discussion of the managed facility came to the conclusion that it
> will manage this stuff completely to allow fixed association of 'queue /
> interrupt / corresponding memory' to a single CPU or a set of CPUs. That
> removes a lot of 'affinity' handling magic from the driver and utilizes the
> hardware in a sensible way. That was not my decision, really. It surely
> made sense to me and I helped Christoph to implement it.
> 
> The real question is whether you want to have the fixed 'queue / interrupt/
> corresponding memory association' and get rid of all the 'affinity' dance
> in your driver or not.
> 
> If you answer that question with 'yes' then the consequence is that there
> is no knob.
> 
> If you answer that question with 'no' then you should not use
> the managed facility in the first place and if you need parts of that
> functionality then this needs to be added to the core code _before_ a
> driver gets converted and not afterwards.
> 
> It's not my problem if people decide, to use this and then trip over the
> limitations after the changes hit the tree. This could have been figured
> out before even a single patch was posted.
> 
> Now you try to blame the people who implemented the managed affinity stuff
> for the wreckage, which was created by people who changed drivers to use
> it. Nice try.

If that's the attitude at your end, then I do suggest we just revert the
driver changes. Clearly this isn't going to be productive going forward.

The better solution was to make the managed setup more flexible, but
it doesn't sound like that is going to be viable at all.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 15:08                         ` Saeed Mahameed
@ 2017-11-09 15:40                           ` Sagi Grimberg
  0 siblings, 0 replies; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-09 15:40 UTC (permalink / raw)
  To: Saeed Mahameed, Thomas Gleixner
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Kernel Team, Christoph Hellwig


>> Again, I think Jes or others can provide more information.
> 
> Sagi, I believe Jes is not trying to argue about what initial affinity
> values you give to the driver, We have a very critical regression that
> is afflicting Live systems today and common tools that already exists
> in various distros, such as irqblanace which solely depends on
> smp_affinity sysfs entry which is now not write-able due to this
> regression. please see https://github.com/Irqbalance/irqbalance/blob/ma
> ster/activate.c
> 
> Some users would like to have thier network traffic handled in some
> cores and free up other cores for other purposes, you just can't take
> that away from them.
> 
> If revereting mlx5 patches would solve the issue, I am afraid that is
> the solution i am going to go with, until the regression is fixed.

Well, I completely agree with you Saeed, If this is causing regression
for mlx5 users and we can't find a way for managed interface to expose
this knob to userspace, I don't see any other choice as well.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 14:19                             ` Thomas Gleixner
  2017-11-09 15:21                               ` Jens Axboe
@ 2017-11-09 16:01                               ` Sagi Grimberg
  2017-11-09 16:09                                 ` Jens Axboe
  1 sibling, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-09 16:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig


> The early discussion of the managed facility came to the conclusion that it
> will manage this stuff completely to allow fixed association of 'queue /
> interrupt / corresponding memory' to a single CPU or a set of CPUs. That
> removes a lot of 'affinity' handling magic from the driver and utilizes the
> hardware in a sensible way. That was not my decision, really. It surely
> made sense to me and I helped Christoph to implement it.
> 
> The real question is whether you want to have the fixed 'queue / interrupt/
> corresponding memory association' and get rid of all the 'affinity' dance
> in your driver or not.
> 
> If you answer that question with 'yes' then the consequence is that there
> is no knob.
> 
> If you answer that question with 'no' then you should not use
> the managed facility in the first place and if you need parts of that
> functionality then this needs to be added to the core code _before_ a
> driver gets converted and not afterwards.

point taken.

> It's not my problem if people decide, to use this and then trip over the
> limitations after the changes hit the tree. This could have been figured
> out before even a single patch was posted.

That's correct, I could have known that, but I didn't, and from your
reply, I understand there is really only a single way forward...

> Now you try to blame the people who implemented the managed affinity stuff
> for the wreckage, which was created by people who changed drivers to use
> it. Nice try.

I'm not trying to blame anyone, really. I was just trying to understand
how to move forward with making users happy and still enjoy subsystem
services instead of doing lots of similar things inside mlx5 driver.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 16:01                               ` mlx5 broken affinity Sagi Grimberg
@ 2017-11-09 16:09                                 ` Jens Axboe
  2017-11-09 17:07                                   ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2017-11-09 16:09 UTC (permalink / raw)
  To: Sagi Grimberg, Thomas Gleixner
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
>> Now you try to blame the people who implemented the managed affinity stuff
>> for the wreckage, which was created by people who changed drivers to use
>> it. Nice try.
> 
> I'm not trying to blame anyone, really. I was just trying to understand
> how to move forward with making users happy and still enjoy subsystem
> services instead of doing lots of similar things inside mlx5 driver.

Exactly. The key here is how we make it work for both cases. But there
has to be a willingness to make the infrastructure work for that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 15:21                               ` Jens Axboe
@ 2017-11-09 17:03                                 ` Thomas Gleixner
  2017-11-09 20:11                                   ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-09 17:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> If that's the attitude at your end, then I do suggest we just revert the
> driver changes. Clearly this isn't going to be productive going forward.
> 
> The better solution was to make the managed setup more flexible, but
> it doesn't sound like that is going to be viable at all.

That's not true. I indicated several times, that we can do that, but not
just by breaking the managed facility.

What I'm arguing against is that the blame is focused on those who
implemented the managed facility with the existing semantics.

I'm still waiting for a proper description of what needs to be changed in
order to make these drivers work again. All I have seen so far is to break
managed interrupts completely and that's not going to happen.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 16:09                                 ` Jens Axboe
@ 2017-11-09 17:07                                   ` Thomas Gleixner
  2017-11-09 20:12                                     ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-09 17:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On Thu, 9 Nov 2017, Jens Axboe wrote:

> On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
> >> Now you try to blame the people who implemented the managed affinity stuff
> >> for the wreckage, which was created by people who changed drivers to use
> >> it. Nice try.
> > 
> > I'm not trying to blame anyone, really. I was just trying to understand
> > how to move forward with making users happy and still enjoy subsystem
> > services instead of doing lots of similar things inside mlx5 driver.
> 
> Exactly. The key here is how we make it work for both cases. But there
> has to be a willingness to make the infrastructure work for that.

I say it one last time: It can be done and I'm willing to help.

Please tell me what exactly you expect and I can have a look what needs to
be done w/o creating an utter mess.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 17:03                                 ` Thomas Gleixner
@ 2017-11-09 20:11                                   ` Jens Axboe
  2017-11-09 21:23                                     ` Thomas Gleixner
  2017-11-09 21:42                                     ` [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ] Thomas Gleixner
  0 siblings, 2 replies; 47+ messages in thread
From: Jens Axboe @ 2017-11-09 20:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
>> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
>> If that's the attitude at your end, then I do suggest we just revert the
>> driver changes. Clearly this isn't going to be productive going forward.
>>
>> The better solution was to make the managed setup more flexible, but
>> it doesn't sound like that is going to be viable at all.
> 
> That's not true. I indicated several times, that we can do that, but not
> just by breaking the managed facility.
> 
> What I'm arguing against is that the blame is focused on those who
> implemented the managed facility with the existing semantics.
> 
> I'm still waiting for a proper description of what needs to be changed in
> order to make these drivers work again. All I have seen so far is to break
> managed interrupts completely and that's not going to happen.

There's no blame as far as I'm concerned, just frustration that we broke
this and folks apparently not acknowledging that it's a concern.

What used to work was that you could move IRQs around as you wanted to.
That was very useful for custom setups, for tuning, or for isolation
purposes. None of that works now, which is unfortunate.

As a bare minimum, being able to specify a static mask at load time
would be useful. It's still not nearly as useful as it was before
where you could move them at runtime, but at least it's better than
what we have right now.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 17:07                                   ` Thomas Gleixner
@ 2017-11-09 20:12                                     ` Jens Axboe
  2017-11-09 21:25                                       ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2017-11-09 20:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On 11/09/2017 10:07 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
> 
>> On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
>>>> Now you try to blame the people who implemented the managed affinity stuff
>>>> for the wreckage, which was created by people who changed drivers to use
>>>> it. Nice try.
>>>
>>> I'm not trying to blame anyone, really. I was just trying to understand
>>> how to move forward with making users happy and still enjoy subsystem
>>> services instead of doing lots of similar things inside mlx5 driver.
>>
>> Exactly. The key here is how we make it work for both cases. But there
>> has to be a willingness to make the infrastructure work for that.
> 
> I say it one last time: It can be done and I'm willing to help.

It didn't sound like it earlier, but that's good news.

> Please tell me what exactly you expect and I can have a look what needs to
> be done w/o creating an utter mess.

See the previous email, should have all the details.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 20:11                                   ` Jens Axboe
@ 2017-11-09 21:23                                     ` Thomas Gleixner
  2017-11-09 21:30                                       ` Jens Axboe
  2017-11-09 21:42                                     ` [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ] Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-09 21:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
> > On Thu, 9 Nov 2017, Jens Axboe wrote:
> >> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> >> If that's the attitude at your end, then I do suggest we just revert the
> >> driver changes. Clearly this isn't going to be productive going forward.
> >>
> >> The better solution was to make the managed setup more flexible, but
> >> it doesn't sound like that is going to be viable at all.
> > 
> > That's not true. I indicated several times, that we can do that, but not
> > just by breaking the managed facility.
> > 
> > What I'm arguing against is that the blame is focused on those who
> > implemented the managed facility with the existing semantics.
> > 
> > I'm still waiting for a proper description of what needs to be changed in
> > order to make these drivers work again. All I have seen so far is to break
> > managed interrupts completely and that's not going to happen.
> 
> There's no blame as far as I'm concerned, just frustration that we broke
> this and folks apparently not acknowledging that it's a concern.
> 
> What used to work was that you could move IRQs around as you wanted to.
> That was very useful for custom setups, for tuning, or for isolation
> purposes. None of that works now, which is unfortunate.

Well, its unfortunate and I can understand your frustration, but I really
have a hard time to understand that these concerns have not been brought up
when the whole thing was discussed and in the early stages of
implementation way before it got merged.

It was not my decision to make it the way it is and I merily try to prevent
hasty and damaging hackery right now.

I'll answer to the technical details in a separate mail.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 20:12                                     ` Jens Axboe
@ 2017-11-09 21:25                                       ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-09 21:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 10:07 AM, Thomas Gleixner wrote:
> > I say it one last time: It can be done and I'm willing to help.
> 
> It didn't sound like it earlier, but that's good news.

Well, I'm equally frustrated by this whole thing, but I certainly never
said that I'm not willing to change anything.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-09 21:23                                     ` Thomas Gleixner
@ 2017-11-09 21:30                                       ` Jens Axboe
  0 siblings, 0 replies; 47+ messages in thread
From: Jens Axboe @ 2017-11-09 21:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On 11/09/2017 02:23 PM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
>> On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
>>> On Thu, 9 Nov 2017, Jens Axboe wrote:
>>>> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
>>>> If that's the attitude at your end, then I do suggest we just revert the
>>>> driver changes. Clearly this isn't going to be productive going forward.
>>>>
>>>> The better solution was to make the managed setup more flexible, but
>>>> it doesn't sound like that is going to be viable at all.
>>>
>>> That's not true. I indicated several times, that we can do that, but not
>>> just by breaking the managed facility.
>>>
>>> What I'm arguing against is that the blame is focused on those who
>>> implemented the managed facility with the existing semantics.
>>>
>>> I'm still waiting for a proper description of what needs to be changed in
>>> order to make these drivers work again. All I have seen so far is to break
>>> managed interrupts completely and that's not going to happen.
>>
>> There's no blame as far as I'm concerned, just frustration that we broke
>> this and folks apparently not acknowledging that it's a concern.
>>
>> What used to work was that you could move IRQs around as you wanted to.
>> That was very useful for custom setups, for tuning, or for isolation
>> purposes. None of that works now, which is unfortunate.
> 
> Well, its unfortunate and I can understand your frustration, but I really
> have a hard time to understand that these concerns have not been brought up
> when the whole thing was discussed and in the early stages of
> implementation way before it got merged.
> 
> It was not my decision to make it the way it is and I merily try to prevent
> hasty and damaging hackery right now.

There's no rush, we can take our time to get it right. We won't get this
fixed for 4.14 anyway.

On my end, I don't think the limitation of removing user tweakability was
made clear enough. I'm all for moving common code into helpers and
frameworks, but this currently has limitation that aren't acceptable
imho. This is partly my fault, I should have realized this was the case.

As long as we can move forward in a productive fashion, then that's fine
with me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-09 20:11                                   ` Jens Axboe
  2017-11-09 21:23                                     ` Thomas Gleixner
@ 2017-11-09 21:42                                     ` Thomas Gleixner
  2017-11-10  5:56                                       ` Saeed Mahameed
  2017-11-13 19:20                                       ` Sagi Grimberg
  1 sibling, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-09 21:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

Find below a summary of the technical details, implications and options

What can be done for 4.14?

  We basically have two options: Revert at the driver level or ship as
  is.

  Even if we come up with a quick and dirty hack then it will be too late
  for proper testing before sunday.


What can be done with some time to work on?

The managed mechanism consists of 3 pieces:

 1) Vector spreading

 2) Managed vector allocation, which becomes a guaranteed reservation in
    4.15 due of the big rework of the vector management code.

    Non managed interrupts get a best effort reservation to handle theCPU
    unplug vector pressure problem in a sane way.

 3) CPU hotplug management

    If the last CPU in the affinity set goes offline, then the interrupt is
    shutdown and restarted when the first CPU in the affinity set comes
    online again. The driver code needs to ensure that the queue associated
    to that interrupt is drained before shutdown and nothing is queued
    there after this point.

So we have options:

1) Initial vector spreading 

 Let the driver use the initial vector spreading. That does only the
 initial affinity setup, but otherwise the interrupts are handled like any
 other non managed interrupt, i.e. best effort reservation, affinity
 settings enabled and CPU unplug breaks affinity and moves them to some
 random other online CPU.

 The simplest solution of all.

2) Allowing a driver supplied mask

 Certainly simple to do, but as you said it's not really a solution. I'm
 not sure whether we want to go there as this is going to be replaced fast
 enough and then create another breakage/frustration level.


3) Affinity override in managed mode

 Doable, but there are a couple of things to think about:

  * How is this enabled?

    - Opt-in by driver
	     
    - Extra sysfs/procfs knob

    We definitely should not enable it per default because that would
    surprise users/drivers which work with the current managed devices and
    rely on the affinity files to be non writeable in managed mode.

  * Is it allowed to set the affinity to offline, but present CPUs?

     In principle yes, because the core management code can do that as well
     at setup time.

  * The affinity setting must fail when it cannot do a guaranteed
    reservation on the new target CPU(s).

     This is not much of a question. That's a matter of fact because
     otherwise the association cannot be guaranteed and things fall apart
     all over the place.

  * When and how is the driver informed about the change?

     When:

       #1 Before the core tries to move the interrupt so it can veto the
	  move if it cannot allocate new resources or whatever is required
	  to operate after the move.
	  
       #2 After the core made the move effective because:

          - The interrupt might be moved from an offline set to an online
            set and needs to be started up, so the related queue must be
            enabled as well.

          - The interrupt might be moved from an online set to an offline
            set, so the queue needs to be drained and disabled.

	  - Resources which have been allocated in the first step must be
            made effective and old resources freed.

     How:

       The existing affinity notification mechanism does not work for this
       and it's a horrible piece of crap which should go away sooner than
       later.

       So we need some sensible way to provide callback. Emphasis on
       callbacks as one multiplexing callback is not a good idea.

  * How can the change made effective?

    When the preliminaries (vector reservation on the new set and
    evtl. resource allocation in the subsystem have been done, then the
    actual move can be made.

    But, there is a caveat. x86 is not good in reassociating interrupts on
    the fly except when it sits behind an interrupt remapping unit, but we
    cannot rely on that.

    So the change flow which works for everything would be:

    if (reserve_vectors() < 0)
       return FAIL;

    if (subsys_prep_callback() < 0) {
       release_vectors();
       return FAIL;
    }

    shutdown(irq);

    if (!online(newset))
       return SUCCESS;

    startup(irq);

    subsys_post_callback();
    return SUCCESS;

    subsys_prep_callback() must basically work the same way as the CPU
    offline mechanism and drain the queue and prevent queueing before the
    irq is restarted. If the move results in keeping it shutdown because
    the new set is offline, then the irq will be restarted via the CPU
    hotplug code and the subsystem will be informed about that via the
    hotplug mechanism as well.

    subsys_post_callback() is more or less the same as the hotplug callback
    and restarts the queue. The only difference to the hotplug code as of
    today is that it might need to make previously allocated resources
    effective and free the old ones.

    I named that subsys_*_callback() on purpose because this should be
    handled in a generic way for multiqueue devices and not done at the
    driver level.

  There are some very interesting locking problems to solve, especially
  vs. CPU hotplug, but that should be solvable.

4) Break managed mode when affinity is changed by user

  I'm not going to describe that because this is going to require at least
  as much effort as #2 plus a few extra interesting twists versus vector
  management and CPU hotplug.

5) Other options:

   Maybe ponies, but I have no clue how to implement them.
 

Thoughts?

Thanks,

	tglx


	    

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: mlx5 broken affinity
  2017-11-08 17:33                         ` Thomas Gleixner
  2017-11-09 10:50                           ` Sagi Grimberg
@ 2017-11-09 22:03                           ` Jes Sorensen
  1 sibling, 0 replies; 47+ messages in thread
From: Jes Sorensen @ 2017-11-09 22:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sagi Grimberg, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

On 11/08/2017 12:33 PM, Thomas Gleixner wrote:
> On Wed, 8 Nov 2017, Jes Sorensen wrote:
>> On 11/07/2017 10:07 AM, Thomas Gleixner wrote:
>>> Depending on the machine and the number of queues this might even result in
>>> completely losing the ability to suspend/hibernate because the number of
>>> available vectors on CPU0 is not sufficient to accomodate all queue
>>> interrupts.
>>
>> Depending on the system, suspend/resume is really the lesser interesting
>> issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in
>> it will be in some sort of rack and is unlikely to ever want to
>> suspend/resume.
> 
> The discussions with Intel about that tell a different story and cpu
> online/offline for power management purposes is - while debatable - widely
> used.

I certainly do not want to dispute that, it just underlines that
different users have different priorities.

>>> A lot of things are possible, the question is whether it makes sense. The
>>> whole point is to have resources (queues, interrupts etc.) per CPU and have
>>> them strictly associated.
>>>
>>> Why would you give the user a knob to destroy what you carefully optimized?
>>>
>>> Just because we can and just because users love those knobs or is there any
>>> real technical reason?
>>
>> Because the user sometimes knows better based on statically assigned
>> loads, or the user wants consistency across kernels. It's great that the
>> system is better at allocating this now, but we also need to allow for a
>> user to change it. Like anything on Linux, a user wanting to blow off
>> his/her own foot, should be allowed to do so.
> 
> That's fine, but that's not what the managed affinity facility provides. If
> you want to leverage the spread mechanism, but avoid the managed part, then
> this is a different story and we need to figure out how to provide that
> without breaking the managed side of it.
> 
> As I said it's possible, but I vehemently disagree, that this is a bug in
> the core code, as it was claimed several times in this thread.

So it may be my original message was confusing on this. Currently
IRQ-affinity.txt describes how a user can change that, but it no longer
works if an IRQ is marked managed. That is what I qualified as a bug in
my original posting. If an IRQ is not meant to have it's affinity
modified because it is managed, then I think it should also result in
the permissions on the /proc file change to rrr rather than getting EIO
when trying to write to it, but that is a minor detail IMHO.

None of this is a major showstopper for the next kernel release, as long
as we can work on a longer term fix that satisfies everyone.

What I would ideally like to see is the option for drivers to benefit
from the new allocation scheme without being locked in and also have the
option to go managed if that is right for the given devices. Basically a
best of both Worlds situation.

> The real issue is that the driver was converted to something which was
> expected to behave differently. That's hardly a bug in the core code, at
> most it's a documentation problem.

When I hit this issue I read the driver commit stating it was taking
advantage of the new allocation. Not having been part of the discussion
and design of the new core code, I just caught what was described in the
commit message for mlx5.

For now I have just reverted the offending mlx5 patch in our kernel
while we figure out how to resolve it long term.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-09 21:42                                     ` [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ] Thomas Gleixner
@ 2017-11-10  5:56                                       ` Saeed Mahameed
  2017-11-10 13:03                                         ` Thomas Gleixner
  2017-11-13 19:20                                       ` Sagi Grimberg
  1 sibling, 1 reply; 47+ messages in thread
From: Saeed Mahameed @ 2017-11-10  5:56 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Sagi Grimberg, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Kernel Team, Christoph Hellwig

On Thu, 2017-11-09 at 22:42 +0100, Thomas Gleixner wrote:
> Find below a summary of the technical details, implications and
> options
> 
> What can be done for 4.14?
> 
>   We basically have two options: Revert at the driver level or ship
> as
>   is.
> 

I think we all came to the consensus that this is the only immediate
action to solve the mlx5 regression, So i am going to revert the driver
level change.

>   Even if we come up with a quick and dirty hack then it will be too
> late
>   for proper testing before sunday.
> 
> 
> What can be done with some time to work on?
> 
> The managed mechanism consists of 3 pieces:
> 
>  1) Vector spreading
> 
>  2) Managed vector allocation, which becomes a guaranteed reservation
> in
>     4.15 due of the big rework of the vector management code.
> 
>     Non managed interrupts get a best effort reservation to handle
> theCPU
>     unplug vector pressure problem in a sane way.
> 
>  3) CPU hotplug management
> 
>     If the last CPU in the affinity set goes offline, then the
> interrupt is
>     shutdown and restarted when the first CPU in the affinity set
> comes
>     online again. The driver code needs to ensure that the queue
> associated
>     to that interrupt is drained before shutdown and nothing is
> queued
>     there after this point.
> 

Well, I can speak for mlx5 case or most of the network drivers, where
all of the queues associated with an interrupt, move with it, so i
don't think our current driver have this issue. I don't believe there
are network driver with fixed Per cpu resources, but it worth double
checking.

Regarding the below solutions, any one that will gurantee the initial
managed spreading and still allow the user to modify affinity via
/proc/irq/xyz/smp_afinity will be acceptable, since many tools and user
rely on this sysfs entry e.g. (irqbalance)

Thank you Thomas for handling and all the detailed information.
-Saeed.

> So we have options:
> 
> 1) Initial vector spreading 
> 
>  Let the driver use the initial vector spreading. That does only the
>  initial affinity setup, but otherwise the interrupts are handled
> like any
>  other non managed interrupt, i.e. best effort reservation, affinity
>  settings enabled and CPU unplug breaks affinity and moves them to
> some
>  random other online CPU.
> 
>  The simplest solution of all.
> 
> 2) Allowing a driver supplied mask
> 
>  Certainly simple to do, but as you said it's not really a solution.
> I'm
>  not sure whether we want to go there as this is going to be replaced
> fast
>  enough and then create another breakage/frustration level.
> 
> 
> 3) Affinity override in managed mode
> 
>  Doable, but there are a couple of things to think about:
> 
>   * How is this enabled?
> 
>     - Opt-in by driver
> 	     
>     - Extra sysfs/procfs knob
> 
>     We definitely should not enable it per default because that would
>     surprise users/drivers which work with the current managed
> devices and
>     rely on the affinity files to be non writeable in managed mode.
> 
>   * Is it allowed to set the affinity to offline, but present CPUs?
> 
>      In principle yes, because the core management code can do that
> as well
>      at setup time.
> 
>   * The affinity setting must fail when it cannot do a guaranteed
>     reservation on the new target CPU(s).
> 
>      This is not much of a question. That's a matter of fact because
>      otherwise the association cannot be guaranteed and things fall
> apart
>      all over the place.
> 
>   * When and how is the driver informed about the change?
> 
>      When:
> 
>        #1 Before the core tries to move the interrupt so it can veto
> the
> 	  move if it cannot allocate new resources or whatever is
> required
> 	  to operate after the move.
> 	  
>        #2 After the core made the move effective because:
> 
>           - The interrupt might be moved from an offline set to an
> online
>             set and needs to be started up, so the related queue must
> be
>             enabled as well.
> 
>           - The interrupt might be moved from an online set to an
> offline
>             set, so the queue needs to be drained and disabled.
> 
> 	  - Resources which have been allocated in the first step must
> be
>             made effective and old resources freed.
> 
>      How:
> 
>        The existing affinity notification mechanism does not work for
> this
>        and it's a horrible piece of crap which should go away sooner
> than
>        later.
> 
>        So we need some sensible way to provide callback. Emphasis on
>        callbacks as one multiplexing callback is not a good idea.
> 
>   * How can the change made effective?
> 
>     When the preliminaries (vector reservation on the new set and
>     evtl. resource allocation in the subsystem have been done, then
> the
>     actual move can be made.
> 
>     But, there is a caveat. x86 is not good in reassociating
> interrupts on
>     the fly except when it sits behind an interrupt remapping unit,
> but we
>     cannot rely on that.
> 
>     So the change flow which works for everything would be:
> 
>     if (reserve_vectors() < 0)
>        return FAIL;
> 
>     if (subsys_prep_callback() < 0) {
>        release_vectors();
>        return FAIL;
>     }
> 
>     shutdown(irq);
> 
>     if (!online(newset))
>        return SUCCESS;
> 
>     startup(irq);
> 
>     subsys_post_callback();
>     return SUCCESS;
> 
>     subsys_prep_callback() must basically work the same way as the
> CPU
>     offline mechanism and drain the queue and prevent queueing before
> the
>     irq is restarted. If the move results in keeping it shutdown
> because
>     the new set is offline, then the irq will be restarted via the
> CPU
>     hotplug code and the subsystem will be informed about that via
> the
>     hotplug mechanism as well.
> 
>     subsys_post_callback() is more or less the same as the hotplug
> callback
>     and restarts the queue. The only difference to the hotplug code
> as of
>     today is that it might need to make previously allocated
> resources
>     effective and free the old ones.
> 
>     I named that subsys_*_callback() on purpose because this should
> be
>     handled in a generic way for multiqueue devices and not done at
> the
>     driver level.
> 
>   There are some very interesting locking problems to solve,
> especially
>   vs. CPU hotplug, but that should be solvable.
> 
> 4) Break managed mode when affinity is changed by user
> 
>   I'm not going to describe that because this is going to require at
> least
>   as much effort as #2 plus a few extra interesting twists versus
> vector
>   management and CPU hotplug.
> 
> 5) Other options:
> 
>    Maybe ponies, but I have no clue how to implement them.
>  
> 
> Thoughts?
> 
> Thanks,
> 
> 	tglx
> 
> 
> 	    

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-10  5:56                                       ` Saeed Mahameed
@ 2017-11-10 13:03                                         ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-10 13:03 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Jens Axboe, Sagi Grimberg, Jes Sorensen, Tariq Toukan,
	Saeed Mahameed, Networking, Leon Romanovsky, Kernel Team,
	Christoph Hellwig

On Fri, 10 Nov 2017, Saeed Mahameed wrote:

> Well, I can speak for mlx5 case or most of the network drivers, where
> all of the queues associated with an interrupt, move with it, so i
> don't think our current driver have this issue. I don't believe there
> are network driver with fixed Per cpu resources, but it worth double
> checking.

At least the intel i40e[vf] drivers use the affinity notifier. Same for
infiniband drivers and scsi. So there seems to be requirements.

> Regarding the below solutions, any one that will gurantee the initial
> managed spreading and still allow the user to modify affinity via
> /proc/irq/xyz/smp_afinity will be acceptable, since many tools and user
> rely on this sysfs entry e.g. (irqbalance)

Well, any one is not very specific.

This needs to be discussed and agreed on in detail, otherwise we end up
with the same situation 3 month after the changes hit mainline and drivers
got converted over.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-09 21:42                                     ` [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ] Thomas Gleixner
  2017-11-10  5:56                                       ` Saeed Mahameed
@ 2017-11-13 19:20                                       ` Sagi Grimberg
  2017-11-13 20:51                                         ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-13 19:20 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe
  Cc: Jes Sorensen, Tariq Toukan, Saeed Mahameed, Networking,
	Leon Romanovsky, Saeed Mahameed, Kernel Team, Christoph Hellwig

Hi Thomas,

> What can be done with some time to work on?
> 
> The managed mechanism consists of 3 pieces:
> 
>   1) Vector spreading
> 
>   2) Managed vector allocation, which becomes a guaranteed reservation in
>      4.15 due of the big rework of the vector management code.
> 
>      Non managed interrupts get a best effort reservation to handle theCPU
>      unplug vector pressure problem in a sane way.
> 
>   3) CPU hotplug management
> 
>      If the last CPU in the affinity set goes offline, then the interrupt is
>      shutdown and restarted when the first CPU in the affinity set comes
>      online again. The driver code needs to ensure that the queue associated
>      to that interrupt is drained before shutdown and nothing is queued
>      there after this point.
> 
> So we have options:
> 
> 1) Initial vector spreading
> 
>   Let the driver use the initial vector spreading. That does only the
>   initial affinity setup, but otherwise the interrupts are handled like any
>   other non managed interrupt, i.e. best effort reservation, affinity
>   settings enabled and CPU unplug breaks affinity and moves them to some
>   random other online CPU.
> 
>   The simplest solution of all.
> 
> 2) Allowing a driver supplied mask
> 
>   Certainly simple to do, but as you said it's not really a solution. I'm
>   not sure whether we want to go there as this is going to be replaced fast
>   enough and then create another breakage/frustration level.
> 
> 
> 3) Affinity override in managed mode
> 
>   Doable, but there are a couple of things to think about:

I think that it will be good to shoot for (3). Given that there are
driver requirements I'd say that driver will expose up front if it can
handle it, and if not we fallback to (1).

>    * How is this enabled?
> 
>      - Opt-in by driver
> 	
>      - Extra sysfs/procfs knob
> 
>      We definitely should not enable it per default because that would
>      surprise users/drivers which work with the current managed devices and
>      rely on the affinity files to be non writeable in managed mode.

Do you know if any exist? Would it make sense to have a survey to
understand if anyone relies on it?

 From what I've seen so far, drivers that were converted simply worked
with the non-managed facility and didn't have any special code for it.
Perhaps Christoph can comment as he convert most of them.

But if there aren't any drivers that absolutely rely on it, maybe its
not a bad idea to allow it by default?


>    * When and how is the driver informed about the change?
> 
>       When:
> 
>         #1 Before the core tries to move the interrupt so it can veto the
> 	  move if it cannot allocate new resources or whatever is required
> 	  to operate after the move.

What would the core do if a driver veto a move? I'm wandering in what
conditions a driver will be unable to allocate resources for move to cpu
X but able to allocate for move to cpu Y.

> 	
>         #2 After the core made the move effective because:
> 
>            - The interrupt might be moved from an offline set to an online
>              set and needs to be started up, so the related queue must be
>              enabled as well.
> 
>            - The interrupt might be moved from an online set to an offline
>              set, so the queue needs to be drained and disabled.
> 
> 	  - Resources which have been allocated in the first step must be
>              made effective and old resources freed.
> 
>       How:
> 
>         The existing affinity notification mechanism does not work for this
>         and it's a horrible piece of crap which should go away sooner than
>         later.
> 
>         So we need some sensible way to provide callback. Emphasis on
>         callbacks as one multiplexing callback is not a good idea.
> 
>    * How can the change made effective?
> 
>      When the preliminaries (vector reservation on the new set and
>      evtl. resource allocation in the subsystem have been done, then the
>      actual move can be made.
> 
>      But, there is a caveat. x86 is not good in reassociating interrupts on
>      the fly except when it sits behind an interrupt remapping unit, but we
>      cannot rely on that.
> 
>      So the change flow which works for everything would be:
> 
>      if (reserve_vectors() < 0)
>         return FAIL;
> 
>      if (subsys_prep_callback() < 0) {
>         release_vectors();
>         return FAIL;
>      }
> 
>      shutdown(irq);
> 
>      if (!online(newset))
>         return SUCCESS;
> 
>      startup(irq);
> 
>      subsys_post_callback();
>      return SUCCESS;
> 
>      subsys_prep_callback() must basically work the same way as the CPU
>      offline mechanism and drain the queue and prevent queueing before the
>      irq is restarted. If the move results in keeping it shutdown because
>      the new set is offline, then the irq will be restarted via the CPU
>      hotplug code and the subsystem will be informed about that via the
>      hotplug mechanism as well.
> 
>      subsys_post_callback() is more or less the same as the hotplug callback
>      and restarts the queue. The only difference to the hotplug code as of
>      today is that it might need to make previously allocated resources
>      effective and free the old ones.
> 
>      I named that subsys_*_callback() on purpose because this should be
>      handled in a generic way for multiqueue devices and not done at the
>      driver level.
> 
>    There are some very interesting locking problems to solve, especially
>    vs. CPU hotplug, but that should be solvable.

This looks like it can work to me, but I'm probably not familiar enough
to see the full picture here.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-13 19:20                                       ` Sagi Grimberg
@ 2017-11-13 20:51                                         ` Thomas Gleixner
  2017-11-13 21:13                                           ` Sagi Grimberg
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-13 20:51 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > 3) Affinity override in managed mode
> > 
> >   Doable, but there are a couple of things to think about:
> 
> I think that it will be good to shoot for (3). Given that there are
> driver requirements I'd say that driver will expose up front if it can
> handle it, and if not we fallback to (1).
> 
> >    * How is this enabled?
> > 
> >      - Opt-in by driver
> > 	
> >      - Extra sysfs/procfs knob
> > 
> >      We definitely should not enable it per default because that would
> >      surprise users/drivers which work with the current managed devices and
> >      rely on the affinity files to be non writeable in managed mode.
> 
> Do you know if any exist? Would it make sense to have a survey to
> understand if anyone relies on it?
> 
> From what I've seen so far, drivers that were converted simply worked
> with the non-managed facility and didn't have any special code for it.
> Perhaps Christoph can comment as he convert most of them.
> 
> But if there aren't any drivers that absolutely rely on it, maybe its
> not a bad idea to allow it by default?

Sure, I was just cautious and I have to admit that I have no insight into
the driver side details.

> >    * When and how is the driver informed about the change?
> > 
> >       When:
> > 
> >         #1 Before the core tries to move the interrupt so it can veto the
> > 	  move if it cannot allocate new resources or whatever is required
> > 	  to operate after the move.
> 
> What would the core do if a driver veto a move?

Return the error code from write_affinity() as it does with any other error
which fails to set the affinity.

> I'm wandering in what conditions a driver will be unable to allocate
> resources for move to cpu X but able to allocate for move to cpu Y.

Node affine memory allocation is the only thing which comes to my mind, or
some decision not to have a gazillion of queues on a single CPU. 

> This looks like it can work to me, but I'm probably not familiar enough
> to see the full picture here.

On the interrupt core side this is workable, I just need the input from the
driver^Wsubsystem side if this can be implemented sanely.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-13 20:51                                         ` Thomas Gleixner
@ 2017-11-13 21:13                                           ` Sagi Grimberg
  2017-11-13 21:33                                             ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-13 21:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jens Axboe, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig


>> Do you know if any exist? Would it make sense to have a survey to
>> understand if anyone relies on it?
>>
>>  From what I've seen so far, drivers that were converted simply worked
>> with the non-managed facility and didn't have any special code for it.
>> Perhaps Christoph can comment as he convert most of them.
>>
>> But if there aren't any drivers that absolutely rely on it, maybe its
>> not a bad idea to allow it by default?
> 
> Sure, I was just cautious and I have to admit that I have no insight into
> the driver side details.

Christoph, feel free to chime in :)

Should I construct an email list of the driver maintainers of the
converted drivers?

>>>     * When and how is the driver informed about the change?
>>>
>>>        When:
>>>
>>>          #1 Before the core tries to move the interrupt so it can veto the
>>> 	  move if it cannot allocate new resources or whatever is required
>>> 	  to operate after the move.
>>
>> What would the core do if a driver veto a move?
> 
> Return the error code from write_affinity() as it does with any other error
> which fails to set the affinity.

OK, so this would mean that the driver queue no longer has a vector
correct? so is the semantics that it needs to cleanup its resources or
should it expect another callout for that?

>> I'm wandering in what conditions a driver will be unable to allocate
>> resources for move to cpu X but able to allocate for move to cpu Y.
> 
> Node affine memory allocation is the only thing which comes to my mind, or
> some decision not to have a gazillion of queues on a single CPU.

Yea, makes sense.

>> This looks like it can work to me, but I'm probably not familiar enough
>> to see the full picture here.
> 
> On the interrupt core side this is workable, I just need the input from the
> driver^Wsubsystem side if this can be implemented sanely.

Can you explain what do you mean by "subsystem"? I thought that the
subsystem would be the irq subsystem (which means you are the one to 
provide the needed input :) ) and the driver would pass in something
like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
requirements that you listed and NULL to tell the core to leave it alone
and do what it sees fit (or pass msi_irq_ops with flag that means that).

ops structure is a very common way for drivers to communicate with a
subsystem core.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-13 21:13                                           ` Sagi Grimberg
@ 2017-11-13 21:33                                             ` Thomas Gleixner
  2017-11-13 21:49                                               ` Sagi Grimberg
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-13 21:33 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > > >          #1 Before the core tries to move the interrupt so it can veto
> > > > the
> > > > 	  move if it cannot allocate new resources or whatever is required
> > > > 	  to operate after the move.
> > > 
> > > What would the core do if a driver veto a move?
> > 
> > Return the error code from write_affinity() as it does with any other error
> > which fails to set the affinity.
> 
> OK, so this would mean that the driver queue no longer has a vector
> correct? so is the semantics that it needs to cleanup its resources or
> should it expect another callout for that?

The driver queue still has the old vector, i.e.

echo XXX > /proc/irq/YYY/affinity

     write_irq_affinity(newaffinity)

	newvec = reserve_new_vector();

	ret = subsys_pre_move_callback(...., newaffinity);

	if (ret) {
		drop_new_vector(newvec);
		return ret;
	}

	shutdown(oldvec);
	install(newvec);

	susbsys_post_move_callback(....)

	startup(newvec);

subsys_pre_move_callback()

	ret = do_whatever();
	if (ret)
		return ret;

	/*
	 * Make sure nothing is queued anymore and outstanding
	 * requests are completed. Same as for managed CPU hotplug.
	 */
	stop_and_drain_queue();
	return 0;

subsys_post_move_callback()

	install_new_data();

	/* Reenable queue. Same as for managed CPU hotplug */
	reenable_queue();

	free_old_data();
	return;

Does that clarify the mechanism?

> > > This looks like it can work to me, but I'm probably not familiar enough
> > > to see the full picture here.
> > 
> > On the interrupt core side this is workable, I just need the input from the
> > driver^Wsubsystem side if this can be implemented sanely.
> 
> Can you explain what do you mean by "subsystem"? I thought that the
> subsystem would be the irq subsystem (which means you are the one to provide
> the needed input :) ) and the driver would pass in something
> like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
> requirements that you listed and NULL to tell the core to leave it alone
> and do what it sees fit (or pass msi_irq_ops with flag that means that).
> 
> ops structure is a very common way for drivers to communicate with a
> subsystem core.

So if you look at the above pseudo code then the subsys_*_move_callbacks
are probably subsystem specific, i.e. block or networking.

Those subsystem callbacks might either handle it at the subsystem level
directly or call into the particular driver.

That's certainly out of the scope what the generic interrupt code can do :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-13 21:33                                             ` Thomas Gleixner
@ 2017-11-13 21:49                                               ` Sagi Grimberg
  2017-11-14 10:15                                                 ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Sagi Grimberg @ 2017-11-13 21:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jens Axboe, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig


>> Can you explain what do you mean by "subsystem"? I thought that the
>> subsystem would be the irq subsystem (which means you are the one to provide
>> the needed input :) ) and the driver would pass in something
>> like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
>> requirements that you listed and NULL to tell the core to leave it alone
>> and do what it sees fit (or pass msi_irq_ops with flag that means that).
>>
>> ops structure is a very common way for drivers to communicate with a
>> subsystem core.
> 
> So if you look at the above pseudo code then the subsys_*_move_callbacks
> are probably subsystem specific, i.e. block or networking.
> 
> Those subsystem callbacks might either handle it at the subsystem level
> directly or call into the particular driver.

Personally I do not think that integrating this to networking/block
stacks in that level is going to work. drivers don't communicate any
information on what they do with msi(x) vectors (and I'm not sure they
should).

I think that driver that uses managed facilities is up to its own
discretion, it interacts with the irq subsystem to allocate the vectors
so it make sense to me that it should pass in the ops directly and
handle the callouts.

> That's certainly out of the scope what the generic interrupt code can do :)

Don't get me wrong, I'm all for adding useful helpers in net/ and block/
if drivers need some services from the core subsystem or if drivers end
up sharing lots of logic.

For example, drivers already take care of draining queues when device
hot unplug is triggered, so they must be able to get it right for
IRQ vector migration (at least I assume).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]
  2017-11-13 21:49                                               ` Sagi Grimberg
@ 2017-11-14 10:15                                                 ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2017-11-14 10:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Jes Sorensen, Tariq Toukan, Saeed Mahameed,
	Networking, Leon Romanovsky, Saeed Mahameed, Kernel Team,
	Christoph Hellwig

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > > Can you explain what do you mean by "subsystem"? I thought that the
> > > subsystem would be the irq subsystem (which means you are the one to
> > > provide
> > > the needed input :) ) and the driver would pass in something
> > > like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
> > > requirements that you listed and NULL to tell the core to leave it alone
> > > and do what it sees fit (or pass msi_irq_ops with flag that means that).
> > > 
> > > ops structure is a very common way for drivers to communicate with a
> > > subsystem core.
> > 
> > So if you look at the above pseudo code then the subsys_*_move_callbacks
> > are probably subsystem specific, i.e. block or networking.
> > 
> > Those subsystem callbacks might either handle it at the subsystem level
> > directly or call into the particular driver.
> 
> Personally I do not think that integrating this to networking/block
> stacks in that level is going to work. drivers don't communicate any
> information on what they do with msi(x) vectors (and I'm not sure they
> should).

The callback does not need to transport any information about the interrupt
itself. The driver associates the interrupt to a queue and that queue _is_
known at the subsystem level. So a queue cookie can be handed in along with
the callback information.

I think that the subsystem, e.g. block should handle parts of this, at
least the generic driver independent management of the queue, i.e.

    - Marking the queue as closed so no new requests can be queued
      anymore. Having this at the driver level seems to be a layering
      violation.
    
    - Making sure that the outstanding requests are completed. That needs
      driver support for sure, but OTOH the subsystem should know about
      outstanding requests.

    - Marking the queue as open again. Again that's mostly subsystem level.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-11-14 10:15 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-01 16:19 mlx5 broken affinity Jes Sorensen
2017-11-01 17:21 ` Sagi Grimberg
2017-11-01 18:20   ` Jes Sorensen
2017-11-01 22:41     ` Saeed Mahameed
2017-11-01 23:02       ` Jes Sorensen
2017-11-02  8:28         ` Tariq Toukan
2017-11-02 10:08           ` Sagi Grimberg
2017-11-02 12:13             ` Andrew Lunn
2017-11-02 14:48             ` Jes Sorensen
2017-11-02 16:14               ` Sagi Grimberg
2017-11-02 17:13                 ` Jes Sorensen
2017-11-02 18:10                 ` Thomas Gleixner
2017-11-05  8:36                   ` Sagi Grimberg
2017-11-07 15:07                     ` Thomas Gleixner
2017-11-08  7:27                       ` Sagi Grimberg
2017-11-08 12:21                         ` David Laight
2017-11-08 16:13                           ` Jens Axboe
2017-11-09 10:09                             ` Christoph Hellwig
2017-11-09 15:18                               ` Jens Axboe
2017-11-09 15:08                         ` Saeed Mahameed
2017-11-09 15:40                           ` Sagi Grimberg
2017-11-08 16:19                       ` Jes Sorensen
2017-11-08 17:33                         ` Thomas Gleixner
2017-11-09 10:50                           ` Sagi Grimberg
2017-11-09 14:19                             ` Thomas Gleixner
2017-11-09 15:21                               ` Jens Axboe
2017-11-09 17:03                                 ` Thomas Gleixner
2017-11-09 20:11                                   ` Jens Axboe
2017-11-09 21:23                                     ` Thomas Gleixner
2017-11-09 21:30                                       ` Jens Axboe
2017-11-09 21:42                                     ` [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ] Thomas Gleixner
2017-11-10  5:56                                       ` Saeed Mahameed
2017-11-10 13:03                                         ` Thomas Gleixner
2017-11-13 19:20                                       ` Sagi Grimberg
2017-11-13 20:51                                         ` Thomas Gleixner
2017-11-13 21:13                                           ` Sagi Grimberg
2017-11-13 21:33                                             ` Thomas Gleixner
2017-11-13 21:49                                               ` Sagi Grimberg
2017-11-14 10:15                                                 ` Thomas Gleixner
2017-11-09 16:01                               ` mlx5 broken affinity Sagi Grimberg
2017-11-09 16:09                                 ` Jens Axboe
2017-11-09 17:07                                   ` Thomas Gleixner
2017-11-09 20:12                                     ` Jens Axboe
2017-11-09 21:25                                       ` Thomas Gleixner
2017-11-09 15:19                             ` Jens Axboe
2017-11-09 22:03                           ` Jes Sorensen
2017-11-02  7:57     ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.