[PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
@ 2018-07-16  8:30 Leon Romanovsky
  2018-07-16 10:23 ` Sagi Grimberg
  0 siblings, 1 reply; 35+ messages in thread
From: Leon Romanovsky @ 2018-07-16  8:30 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Saeed Mahameed, Steve Wise,
	linux-netdev

From: Leon Romanovsky <leonro@mellanox.com>

The IRQ affinity mask is managed by mlx5_core, however any user
triggered updates through /proc/irq/<irq#>/smp_affinity were not
reflected in mlx5_ib_get_vector_affinity().

Drop the attempt to use cached version of affinity mask in favour of
managed by PCI core value.

Fixes: e3ca34880652 ("net/mlx5: Fix build break when CONFIG_SMP=n")
Reported-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/hw/mlx5/main.c | 4 +++-
 include/linux/mlx5/driver.h       | 7 -------
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index d0834525afe3..1c3584024acb 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -5304,8 +5304,10 @@ static const struct cpumask *
 mlx5_ib_get_vector_affinity(struct ib_device *ibdev, int comp_vector)
 {
 	struct mlx5_ib_dev *dev = to_mdev(ibdev);
+	int irq = pci_irq_vector(dev->mdev->pdev,
+				 MLX5_EQ_VEC_COMP_BASE + comp_vector);

-	return mlx5_get_vector_affinity_hint(dev->mdev, comp_vector);
+	return irq_get_affinity_mask(irq);
 }

 /* The mlx5_ib_multiport_mutex should be held when calling this function */
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 0b7daa4a8f84..d3581cd5d517 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1287,11 +1287,4 @@ static inline int mlx5_core_native_port_num(struct mlx5_core_dev *dev)
 enum {
 	MLX5_TRIGGERED_CMD_COMP = (u64)1 << 32,
 };
-
-static inline const struct cpumask *
-mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector)
-{
-	return dev->priv.irq_info[vector].mask;
-}
-
 #endif /* MLX5_DRIVER_H */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-16  8:30 [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask Leon Romanovsky
@ 2018-07-16 10:23 ` Sagi Grimberg
  2018-07-16 10:30   ` Leon Romanovsky
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-16 10:23 UTC (permalink / raw)
  To: Leon Romanovsky, Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Saeed Mahameed, Steve Wise,
	linux-netdev

Leon, I'd like to see a tested-by tag for this (at least
until I get some time to test it).

The patch itself looks fine to me.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-16 10:23 ` Sagi Grimberg
@ 2018-07-16 10:30   ` Leon Romanovsky
  2018-07-16 14:54     ` Max Gurtovoy
  0 siblings, 1 reply; 35+ messages in thread
From: Leon Romanovsky @ 2018-07-16 10:30 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Doug Ledford, Jason Gunthorpe, RDMA mailing list, Saeed Mahameed,
	Steve Wise, linux-netdev

[-- Attachment #1: Type: text/plain, Size: 219 bytes --]

On Mon, Jul 16, 2018 at 01:23:24PM +0300, Sagi Grimberg wrote:
> Leon, I'd like to see a tested-by tag for this (at least
> until I get some time to test it).

Of course.

Thanks

>
> The patch itself looks fine to me.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-16 10:30   ` Leon Romanovsky
@ 2018-07-16 14:54     ` Max Gurtovoy
  2018-07-16 14:59       ` Sagi Grimberg
  0 siblings, 1 reply; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-16 14:54 UTC (permalink / raw)
  To: Leon Romanovsky, Sagi Grimberg
  Cc: Doug Ledford, Jason Gunthorpe, RDMA mailing list, Saeed Mahameed,
	Steve Wise, linux-netdev

Hi,
I've tested this patch and seems problematic at this moment.
maybe this is because of the bug that Steve mentioned in the NVMe 
mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA 
initiator and I'll run his suggestion as well.
BTW, when I run the blk_mq_map_queues it works for every irq affinity.

On 7/16/2018 1:30 PM, Leon Romanovsky wrote:
> On Mon, Jul 16, 2018 at 01:23:24PM +0300, Sagi Grimberg wrote:
>> Leon, I'd like to see a tested-by tag for this (at least
>> until I get some time to test it).
> 
> Of course.
> 
> Thanks
> 
>>
>> The patch itself looks fine to me.


-Max.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-16 14:54     ` Max Gurtovoy
@ 2018-07-16 14:59       ` Sagi Grimberg
  2018-07-16 16:46         ` Max Gurtovoy
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-16 14:59 UTC (permalink / raw)
  To: Max Gurtovoy, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, RDMA mailing list, Saeed Mahameed,
	Steve Wise, linux-netdev


> Hi,
> I've tested this patch and seems problematic at this moment.

Problematic how? what are you seeing?

> maybe this is because of the bug that Steve mentioned in the NVMe 
> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA 
> initiator and I'll run his suggestion as well.

Is your device irq affinity linear?

> BTW, when I run the blk_mq_map_queues it works for every irq affinity.

But its probably not aligned to the device vector affinity.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-16 14:59       ` Sagi Grimberg
@ 2018-07-16 16:46         ` Max Gurtovoy
  2018-07-16 17:08           ` Steve Wise
  0 siblings, 1 reply; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-16 16:46 UTC (permalink / raw)
  To: Sagi Grimberg, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, RDMA mailing list, Saeed Mahameed,
	Steve Wise, linux-netdev

On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
> 
>> Hi,
>> I've tested this patch and seems problematic at this moment.
> 
> Problematic how? what are you seeing?

Connection failures and same error Steve saw:

[Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error 
wo/DNR bit: -16402
[Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18

> 
>> maybe this is because of the bug that Steve mentioned in the NVMe 
>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA 
>> initiator and I'll run his suggestion as well.
> 
> Is your device irq affinity linear?

When it's linear and the balancer is stopped the patch works.

> 
>> BTW, when I run the blk_mq_map_queues it works for every irq affinity.
> 
> But its probably not aligned to the device vector affinity.

but I guess it's better in some cases.

I've checked the situation before Leon's patch and set all the vetcors 
to CPU 0. In this case (I think that this was the initial report by 
Steve), we use the affinity_hint (Israel's and Saeed's patches were we 
use dev->priv.irq_info[vector].mask) and it worked fine.

Steve,
Can you share your configuration (kernel, HCA, affinity map, connect 
command, lscpu) ?
I want to repro it in my lab.

-Max.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-16 16:46         ` Max Gurtovoy
@ 2018-07-16 17:08           ` Steve Wise
  2018-07-17  8:46             ` Max Gurtovoy
  0 siblings, 1 reply; 35+ messages in thread
From: Steve Wise @ 2018-07-16 17:08 UTC (permalink / raw)
  To: Max Gurtovoy, Sagi Grimberg, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, RDMA mailing list, Saeed Mahameed,
	linux-netdev

Hey Max:


On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
>
>
> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
>>
>>> Hi,
>>> I've tested this patch and seems problematic at this moment.
>>
>> Problematic how? what are you seeing?
>
> Connection failures and same error Steve saw:
>
> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
> wo/DNR bit: -16402
> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18
>
>
>>
>>> maybe this is because of the bug that Steve mentioned in the NVMe
>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
>>> initiator and I'll run his suggestion as well.
>>
>> Is your device irq affinity linear?
>
> When it's linear and the balancer is stopped the patch works.
>
>>
>>> BTW, when I run the blk_mq_map_queues it works for every irq affinity.
>>
>> But its probably not aligned to the device vector affinity.
>
> but I guess it's better in some cases.
>
> I've checked the situation before Leon's patch and set all the vetcors
> to CPU 0. In this case (I think that this was the initial report by
> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
> use dev->priv.irq_info[vector].mask) and it worked fine.
>
> Steve,
> Can you share your configuration (kernel, HCA, affinity map, connect
> command, lscpu) ?
> I want to repro it in my lab.
>

- linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
patch so I can change the affinity via procfs. 

- mlx5 MT27700 RoCE card, cxgb4 T62100-CR iWARP card

- The system has 2 numa nodes with 8 real cpus in each == 16 cpus all
online.  HT disabled.

- i'm testing over HW loopback for simplicity, so the node is both the
nvme target and host.  Connecting one device like this: nvme connect -t
rdma -a 172.16.2.1 -n nvme-nullb0

- to reproduce the nvme-rdma bug, just map any two hca cq comp vectors
to the same cpu. 

- lscpu output:

[root@stevo1 linux]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
Stepping:              7
CPU MHz:               3400.057
CPU max MHz:           3800.0000
CPU min MHz:           1200.0000
BogoMIPS:              6200.10
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti
tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts

Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-16 17:08           ` Steve Wise
@ 2018-07-17  8:46             ` Max Gurtovoy
  2018-07-17  8:58               ` Leon Romanovsky
  2018-07-17 13:03               ` Steve Wise
  0 siblings, 2 replies; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-17  8:46 UTC (permalink / raw)
  To: Steve Wise, Sagi Grimberg, Leon Romanovsky
  Cc: Doug Ledford, Jason Gunthorpe, RDMA mailing list, Saeed Mahameed,
	linux-netdev



On 7/16/2018 8:08 PM, Steve Wise wrote:
> Hey Max:
> 
> 

Hey,

> On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
>>
>>
>> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
>>>
>>>> Hi,
>>>> I've tested this patch and seems problematic at this moment.
>>>
>>> Problematic how? what are you seeing?
>>
>> Connection failures and same error Steve saw:
>>
>> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
>> wo/DNR bit: -16402
>> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18
>>
>>
>>>
>>>> maybe this is because of the bug that Steve mentioned in the NVMe
>>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
>>>> initiator and I'll run his suggestion as well.
>>>
>>> Is your device irq affinity linear?
>>
>> When it's linear and the balancer is stopped the patch works.
>>
>>>
>>>> BTW, when I run the blk_mq_map_queues it works for every irq affinity.
>>>
>>> But its probably not aligned to the device vector affinity.
>>
>> but I guess it's better in some cases.
>>
>> I've checked the situation before Leon's patch and set all the vetcors
>> to CPU 0. In this case (I think that this was the initial report by
>> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
>> use dev->priv.irq_info[vector].mask) and it worked fine.
>>
>> Steve,
>> Can you share your configuration (kernel, HCA, affinity map, connect
>> command, lscpu) ?
>> I want to repro it in my lab.
>>
> 
> - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
> enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
> patch so I can change the affinity via procfs.

ohh, now I understand that you where complaining regarding the affinity 
change reflection to mlx5_ib_get_vector_affinity and not regarding the 
failures on connecting while the affinity overlaps (that is working good 
before Leon's patch).
So this is a known issue since we used a static hint that never changes
from dev->priv.irq_info[vector].mask.

IMO we must fulfil the user wish to connect to N queues and not reduce 
it because of affinity overlaps. So in order to push Leon's patch we 
must also fix the blk_mq_rdma_map_queues to do a best effort mapping 
according the affinity and map the rest in naive way (in that way we 
will *always* map all the queues).

-Max.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-17  8:46             ` Max Gurtovoy
@ 2018-07-17  8:58               ` Leon Romanovsky
  2018-07-17 10:05                 ` Max Gurtovoy
  2018-07-17 13:03               ` Steve Wise
  1 sibling, 1 reply; 35+ messages in thread
From: Leon Romanovsky @ 2018-07-17  8:58 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Steve Wise, Sagi Grimberg, Doug Ledford, Jason Gunthorpe,
	RDMA mailing list, Saeed Mahameed, linux-netdev

[-- Attachment #1: Type: text/plain, Size: 2740 bytes --]

On Tue, Jul 17, 2018 at 11:46:40AM +0300, Max Gurtovoy wrote:
>
>
> On 7/16/2018 8:08 PM, Steve Wise wrote:
> > Hey Max:
> >
> >
>
> Hey,
>
> > On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
> > >
> > >
> > > On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
> > > >
> > > > > Hi,
> > > > > I've tested this patch and seems problematic at this moment.
> > > >
> > > > Problematic how? what are you seeing?
> > >
> > > Connection failures and same error Steve saw:
> > >
> > > [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
> > > wo/DNR bit: -16402
> > > [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18
> > >
> > >
> > > >
> > > > > maybe this is because of the bug that Steve mentioned in the NVMe
> > > > > mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
> > > > > initiator and I'll run his suggestion as well.
> > > >
> > > > Is your device irq affinity linear?
> > >
> > > When it's linear and the balancer is stopped the patch works.
> > >
> > > >
> > > > > BTW, when I run the blk_mq_map_queues it works for every irq affinity.
> > > >
> > > > But its probably not aligned to the device vector affinity.
> > >
> > > but I guess it's better in some cases.
> > >
> > > I've checked the situation before Leon's patch and set all the vetcors
> > > to CPU 0. In this case (I think that this was the initial report by
> > > Steve), we use the affinity_hint (Israel's and Saeed's patches were we
> > > use dev->priv.irq_info[vector].mask) and it worked fine.
> > >
> > > Steve,
> > > Can you share your configuration (kernel, HCA, affinity map, connect
> > > command, lscpu) ?
> > > I want to repro it in my lab.
> > >
> >
> > - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
> > enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
> > patch so I can change the affinity via procfs.
>
> ohh, now I understand that you where complaining regarding the affinity
> change reflection to mlx5_ib_get_vector_affinity and not regarding the
> failures on connecting while the affinity overlaps (that is working good
> before Leon's patch).
> So this is a known issue since we used a static hint that never changes
> from dev->priv.irq_info[vector].mask.
>
> IMO we must fulfil the user wish to connect to N queues and not reduce it
> because of affinity overlaps. So in order to push Leon's patch we must
> also fix the blk_mq_rdma_map_queues to do a best effort mapping according
> the affinity and map the rest in naive way (in that way we will *always*
> map all the queues).

Max,

I have no clue what is needed to do int blq_mq*, but my patch only gave
to users ability reconfigure their affinity mask after driver is loaded.

Thanks

>
> -Max.
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-17  8:58               ` Leon Romanovsky
@ 2018-07-17 10:05                 ` Max Gurtovoy
  0 siblings, 0 replies; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-17 10:05 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steve Wise, Sagi Grimberg, Doug Ledford, Jason Gunthorpe,
	RDMA mailing list, Saeed Mahameed, linux-netdev



On 7/17/2018 11:58 AM, Leon Romanovsky wrote:
> On Tue, Jul 17, 2018 at 11:46:40AM +0300, Max Gurtovoy wrote:
>>
>>
>> On 7/16/2018 8:08 PM, Steve Wise wrote:
>>> Hey Max:
>>>
>>>
>>
>> Hey,
>>
>>> On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
>>>>
>>>>
>>>> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
>>>>>
>>>>>> Hi,
>>>>>> I've tested this patch and seems problematic at this moment.
>>>>>
>>>>> Problematic how? what are you seeing?
>>>>
>>>> Connection failures and same error Steve saw:
>>>>
>>>> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
>>>> wo/DNR bit: -16402
>>>> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-18
>>>>
>>>>
>>>>>
>>>>>> maybe this is because of the bug that Steve mentioned in the NVMe
>>>>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
>>>>>> initiator and I'll run his suggestion as well.
>>>>>
>>>>> Is your device irq affinity linear?
>>>>
>>>> When it's linear and the balancer is stopped the patch works.
>>>>
>>>>>
>>>>>> BTW, when I run the blk_mq_map_queues it works for every irq affinity.
>>>>>
>>>>> But its probably not aligned to the device vector affinity.
>>>>
>>>> but I guess it's better in some cases.
>>>>
>>>> I've checked the situation before Leon's patch and set all the vetcors
>>>> to CPU 0. In this case (I think that this was the initial report by
>>>> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
>>>> use dev->priv.irq_info[vector].mask) and it worked fine.
>>>>
>>>> Steve,
>>>> Can you share your configuration (kernel, HCA, affinity map, connect
>>>> command, lscpu) ?
>>>> I want to repro it in my lab.
>>>>
>>>
>>> - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
>>> enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
>>> patch so I can change the affinity via procfs.
>>
>> ohh, now I understand that you where complaining regarding the affinity
>> change reflection to mlx5_ib_get_vector_affinity and not regarding the
>> failures on connecting while the affinity overlaps (that is working good
>> before Leon's patch).
>> So this is a known issue since we used a static hint that never changes
>> from dev->priv.irq_info[vector].mask.
>>
>> IMO we must fulfil the user wish to connect to N queues and not reduce it
>> because of affinity overlaps. So in order to push Leon's patch we must
>> also fix the blk_mq_rdma_map_queues to do a best effort mapping according
>> the affinity and map the rest in naive way (in that way we will *always*
>> map all the queues).
> 
> Max,
> 
> I have no clue what is needed to do int blq_mq*, but my patch only gave
> to users ability reconfigure their affinity mask after driver is loaded.

Yes I know, but since the only user of this API is the blk-mq-rdma and 
the nvme_rdma driver that can't establish connection with your patch - I 
suggest to wait with pushing it and fix the mapping of 
blk_mq_rdma_map_queues

> 
> Thanks
> 
>>
>> -Max.
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-17  8:46             ` Max Gurtovoy
  2018-07-17  8:58               ` Leon Romanovsky
@ 2018-07-17 13:03               ` Steve Wise
  2018-07-18 11:38                 ` Sagi Grimberg
  1 sibling, 1 reply; 35+ messages in thread
From: Steve Wise @ 2018-07-17 13:03 UTC (permalink / raw)
  To: 'Max Gurtovoy', 'Sagi Grimberg',
	'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'


> On 7/16/2018 8:08 PM, Steve Wise wrote:
> > Hey Max:
> >
> >
> 
> Hey,
> 
> > On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
> >>
> >>
> >> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
> >>>
> >>>> Hi,
> >>>> I've tested this patch and seems problematic at this moment.
> >>>
> >>> Problematic how? what are you seeing?
> >>
> >> Connection failures and same error Steve saw:
> >>
> >> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
> >> wo/DNR bit: -16402
> >> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-
> 18
> >>
> >>
> >>>
> >>>> maybe this is because of the bug that Steve mentioned in the NVMe
> >>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
> >>>> initiator and I'll run his suggestion as well.
> >>>
> >>> Is your device irq affinity linear?
> >>
> >> When it's linear and the balancer is stopped the patch works.
> >>
> >>>
> >>>> BTW, when I run the blk_mq_map_queues it works for every irq
> affinity.
> >>>
> >>> But its probably not aligned to the device vector affinity.
> >>
> >> but I guess it's better in some cases.
> >>
> >> I've checked the situation before Leon's patch and set all the vetcors
> >> to CPU 0. In this case (I think that this was the initial report by
> >> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
> >> use dev->priv.irq_info[vector].mask) and it worked fine.
> >>
> >> Steve,
> >> Can you share your configuration (kernel, HCA, affinity map, connect
> >> command, lscpu) ?
> >> I want to repro it in my lab.
> >>
> >
> > - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
> > enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
> > patch so I can change the affinity via procfs.
> 
> ohh, now I understand that you where complaining regarding the affinity
> change reflection to mlx5_ib_get_vector_affinity and not regarding the
> failures on connecting while the affinity overlaps (that is working good
> before Leon's patch).
> So this is a known issue since we used a static hint that never changes
> from dev->priv.irq_info[vector].mask.
> 
> IMO we must fulfil the user wish to connect to N queues and not reduce
> it because of affinity overlaps. So in order to push Leon's patch we
> must also fix the blk_mq_rdma_map_queues to do a best effort mapping
> according the affinity and map the rest in naive way (in that way we
> will *always* map all the queues).

That is what I would expect also.   For example, in my node, where there are
16 cpus, and 2 numa nodes, I observe much better nvmf IOPS performance by
setting up my 16 driver completion event queues such that each is bound to a
node-local cpu.  So I end up with each nodel-local cpu having 2 queues bound
to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(), this
works fine.   I assumed adding ib_get_vector_affinity() would allow this to
all "just work" by default, but I'm running into this connection failure
issue. 

I don't understand exactly what the blk_mq layer is trying to do, but I
assume it has ingress event queues and processing that it trying to align
with the drivers ingress cq event handling, so everybody stays on the same
cpu (or at least node).   But something else is going on.  Is there
documentation on how this works somewhere?

Thanks,

Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-17 13:03               ` Steve Wise
@ 2018-07-18 11:38                 ` Sagi Grimberg
  2018-07-18 14:14                   ` Max Gurtovoy
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-18 11:38 UTC (permalink / raw)
  To: Steve Wise, 'Max Gurtovoy', 'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'


>> IMO we must fulfil the user wish to connect to N queues and not reduce
>> it because of affinity overlaps. So in order to push Leon's patch we
>> must also fix the blk_mq_rdma_map_queues to do a best effort mapping
>> according the affinity and map the rest in naive way (in that way we
>> will *always* map all the queues).
> 
> That is what I would expect also.   For example, in my node, where there are
> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS performance by
> setting up my 16 driver completion event queues such that each is bound to a
> node-local cpu.  So I end up with each nodel-local cpu having 2 queues bound
> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(), this
> works fine.   I assumed adding ib_get_vector_affinity() would allow this to
> all "just work" by default, but I'm running into this connection failure
> issue.
> 
> I don't understand exactly what the blk_mq layer is trying to do, but I
> assume it has ingress event queues and processing that it trying to align
> with the drivers ingress cq event handling, so everybody stays on the same
> cpu (or at least node).   But something else is going on.  Is there
> documentation on how this works somewhere?

Does this (untested) patch help?
--
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..dbe962cb537d 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,34 @@ static int get_first_sibling(unsigned int cpu)
         return cpu;
  }

-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu)
  {
         unsigned int *map = set->mq_map;
         unsigned int nr_queues = set->nr_hw_queues;
-       unsigned int cpu, first_sibling;
+       unsigned int first_sibling;

-       for_each_possible_cpu(cpu) {
-               /*
-                * First do sequential mapping between CPUs and queues.
-                * In case we still have CPUs to map, and we have some 
number of
-                * threads per cores then map sibling threads to the 
same queue for
-                * performace optimizations.
-                */
-               if (cpu < nr_queues) {
+       /*
+        * First do sequential mapping between CPUs and queues.
+        * In case we still have CPUs to map, and we have some number of
+        * threads per cores then map sibling threads to the same queue for
+        * performace optimizations.
+        */
+       if (cpu < nr_queues) {
+               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+       } else {
+               first_sibling = get_first_sibling(cpu);
+               if (first_sibling == cpu)
                         map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-               } else {
-                       first_sibling = get_first_sibling(cpu);
-                       if (first_sibling == cpu)
-                               map[cpu] = cpu_to_queue_index(nr_queues, 
cpu);
-                       else
-                               map[cpu] = map[first_sibling];
-               }
+               else
+                       map[cpu] = map[first_sibling];
         }
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+       for_each_possible_cpu(cpu)
+                blk_mq_map_queue(set, cpu);

         return 0;
  }
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..5e91789bea5b 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,10 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
         const struct cpumask *mask;
         unsigned int queue, cpu;

+       /* reset all to  */
+       for_each_possible_cpu(cpu)
+               set->mq_map[cpu] = UINT_MAX;
+
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
                 mask = ib_get_vector_affinity(dev, first_vec + queue);
                 if (!mask)
@@ -44,6 +48,11 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
                         set->mq_map[cpu] = queue;
         }

+       for_each_possible_cpu(cpu) {
+               if (set->mq_map[cpu] == UINT_MAX)
+                       blk_mq_map_queue(set, cpu);
+       }
+
         return 0;

  fallback:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb74222..7a9848a82475 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -283,6 +283,7 @@ int blk_mq_freeze_queue_wait_timeout(struct 
request_queue *q,
                                      unsigned long timeout);

  int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu);
  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
nr_hw_queues);

  void blk_mq_quiesce_queue_nowait(struct request_queue *q);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-18 11:38                 ` Sagi Grimberg
@ 2018-07-18 14:14                   ` Max Gurtovoy
  2018-07-18 14:25                     ` Steve Wise
  2018-07-18 19:29                     ` Steve Wise
  0 siblings, 2 replies; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-18 14:14 UTC (permalink / raw)
  To: Sagi Grimberg, Steve Wise, 'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'



On 7/18/2018 2:38 PM, Sagi Grimberg wrote:
> 
>>> IMO we must fulfil the user wish to connect to N queues and not reduce
>>> it because of affinity overlaps. So in order to push Leon's patch we
>>> must also fix the blk_mq_rdma_map_queues to do a best effort mapping
>>> according the affinity and map the rest in naive way (in that way we
>>> will *always* map all the queues).
>>
>> That is what I would expect also.   For example, in my node, where 
>> there are
>> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS performance by
>> setting up my 16 driver completion event queues such that each is 
>> bound to a
>> node-local cpu.  So I end up with each nodel-local cpu having 2 queues 
>> bound
>> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(), 
>> this
>> works fine.   I assumed adding ib_get_vector_affinity() would allow 
>> this to
>> all "just work" by default, but I'm running into this connection failure
>> issue.
>>
>> I don't understand exactly what the blk_mq layer is trying to do, but I
>> assume it has ingress event queues and processing that it trying to align
>> with the drivers ingress cq event handling, so everybody stays on the 
>> same
>> cpu (or at least node).   But something else is going on.  Is there
>> documentation on how this works somewhere?
> 
> Does this (untested) patch help?

I'm not sure (I'll test it tomorrow) because the issue is the unmapped 
queues and not the cpus.
for example, if the affinity of q=6 and q=12 returned the same cpumask 
than q=6 will not be mapped and will fail to connect.

> -- 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 3eb169f15842..dbe962cb537d 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -30,29 +30,34 @@ static int get_first_sibling(unsigned int cpu)
>          return cpu;
>   }
> 
> -int blk_mq_map_queues(struct blk_mq_tag_set *set)
> +void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu)
>   {
>          unsigned int *map = set->mq_map;
>          unsigned int nr_queues = set->nr_hw_queues;
> -       unsigned int cpu, first_sibling;
> +       unsigned int first_sibling;
> 
> -       for_each_possible_cpu(cpu) {
> -               /*
> -                * First do sequential mapping between CPUs and queues.
> -                * In case we still have CPUs to map, and we have some 
> number of
> -                * threads per cores then map sibling threads to the 
> same queue for
> -                * performace optimizations.
> -                */
> -               if (cpu < nr_queues) {
> +       /*
> +        * First do sequential mapping between CPUs and queues.
> +        * In case we still have CPUs to map, and we have some number of
> +        * threads per cores then map sibling threads to the same queue for
> +        * performace optimizations.
> +        */
> +       if (cpu < nr_queues) {
> +               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
> +       } else {
> +               first_sibling = get_first_sibling(cpu);
> +               if (first_sibling == cpu)
>                          map[cpu] = cpu_to_queue_index(nr_queues, cpu);
> -               } else {
> -                       first_sibling = get_first_sibling(cpu);
> -                       if (first_sibling == cpu)
> -                               map[cpu] = cpu_to_queue_index(nr_queues, 
> cpu);
> -                       else
> -                               map[cpu] = map[first_sibling];
> -               }
> +               else
> +                       map[cpu] = map[first_sibling];
>          }
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_map_queue);
> +
> +int blk_mq_map_queues(struct blk_mq_tag_set *set)
> +{
> +       for_each_possible_cpu(cpu)
> +                blk_mq_map_queue(set, cpu);
> 
>          return 0;
>   }
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> index 996167f1de18..5e91789bea5b 100644
> --- a/block/blk-mq-rdma.c
> +++ b/block/blk-mq-rdma.c
> @@ -35,6 +35,10 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
>          const struct cpumask *mask;
>          unsigned int queue, cpu;
> 
> +       /* reset all to  */
> +       for_each_possible_cpu(cpu)
> +               set->mq_map[cpu] = UINT_MAX;
> +
>          for (queue = 0; queue < set->nr_hw_queues; queue++) {
>                  mask = ib_get_vector_affinity(dev, first_vec + queue);
>                  if (!mask)
> @@ -44,6 +48,11 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
>                          set->mq_map[cpu] = queue;
>          }
> 
> +       for_each_possible_cpu(cpu) {
> +               if (set->mq_map[cpu] == UINT_MAX)
> +                       blk_mq_map_queue(set, cpu);
> +       }
> +
>          return 0;
> 
>   fallback:
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index e3147eb74222..7a9848a82475 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -283,6 +283,7 @@ int blk_mq_freeze_queue_wait_timeout(struct 
> request_queue *q,
>                                       unsigned long timeout);
> 
>   int blk_mq_map_queues(struct blk_mq_tag_set *set);
> +void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu);
>   void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
> nr_hw_queues);
> 
>   void blk_mq_quiesce_queue_nowait(struct request_queue *q);
> -- 
> 
> It really is still a best effort thing...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-18 14:14                   ` Max Gurtovoy
@ 2018-07-18 14:25                     ` Steve Wise
  2018-07-18 19:29                     ` Steve Wise
  1 sibling, 0 replies; 35+ messages in thread
From: Steve Wise @ 2018-07-18 14:25 UTC (permalink / raw)
  To: 'Max Gurtovoy', 'Sagi Grimberg',
	'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'



> -----Original Message-----
> From: Max Gurtovoy <maxg@mellanox.com>
> Sent: Wednesday, July 18, 2018 9:14 AM
> To: Sagi Grimberg <sagi@grimberg.me>; Steve Wise
> <swise@opengridcomputing.com>; 'Leon Romanovsky' <leon@kernel.org>
> Cc: 'Doug Ledford' <dledford@redhat.com>; 'Jason Gunthorpe'
> <jgg@mellanox.com>; 'RDMA mailing list' <linux-rdma@vger.kernel.org>;
> 'Saeed Mahameed' <saeedm@mellanox.com>; 'linux-netdev'
> <netdev@vger.kernel.org>
> Subject: Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity
> mask
> 
> 
> 
> On 7/18/2018 2:38 PM, Sagi Grimberg wrote:
> >
> >>> IMO we must fulfil the user wish to connect to N queues and not reduce
> >>> it because of affinity overlaps. So in order to push Leon's patch we
> >>> must also fix the blk_mq_rdma_map_queues to do a best effort
> mapping
> >>> according the affinity and map the rest in naive way (in that way we
> >>> will *always* map all the queues).
> >>
> >> That is what I would expect also.   For example, in my node, where
> >> there are
> >> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS
> performance by
> >> setting up my 16 driver completion event queues such that each is
> >> bound to a
> >> node-local cpu.  So I end up with each nodel-local cpu having 2 queues
> >> bound
> >> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(),
> >> this
> >> works fine.   I assumed adding ib_get_vector_affinity() would allow
> >> this to
> >> all "just work" by default, but I'm running into this connection failure
> >> issue.
> >>
> >> I don't understand exactly what the blk_mq layer is trying to do, but I
> >> assume it has ingress event queues and processing that it trying to align
> >> with the drivers ingress cq event handling, so everybody stays on the
> >> same
> >> cpu (or at least node).   But something else is going on.  Is there
> >> documentation on how this works somewhere?
> >
> > Does this (untested) patch help?
> 
> I'm not sure (I'll test it tomorrow) because the issue is the unmapped
> queues and not the cpus.
> for example, if the affinity of q=6 and q=12 returned the same cpumask
> than q=6 will not be mapped and will fail to connect.
> 
> > --
> > diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> > index 3eb169f15842..dbe962cb537d 100644
> > --- a/block/blk-mq-cpumap.c
> > +++ b/block/blk-mq-cpumap.c
> > @@ -30,29 +30,34 @@ static int get_first_sibling(unsigned int cpu)
> >          return cpu;
> >   }
> >
> > -int blk_mq_map_queues(struct blk_mq_tag_set *set)
> > +void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu)
> >   {

There is already a static inline function named blk_mq_map_queue() in block/blk-mq.h.  Did you mean to replace that?  Or is this just a function name conflict?


> >          unsigned int *map = set->mq_map;
> >          unsigned int nr_queues = set->nr_hw_queues;
> > -       unsigned int cpu, first_sibling;
> > +       unsigned int first_sibling;
> >
> > -       for_each_possible_cpu(cpu) {
> > -               /*
> > -                * First do sequential mapping between CPUs and queues.
> > -                * In case we still have CPUs to map, and we have some
> > number of
> > -                * threads per cores then map sibling threads to the
> > same queue for
> > -                * performace optimizations.
> > -                */
> > -               if (cpu < nr_queues) {
> > +       /*
> > +        * First do sequential mapping between CPUs and queues.
> > +        * In case we still have CPUs to map, and we have some number of
> > +        * threads per cores then map sibling threads to the same queue for
> > +        * performace optimizations.
> > +        */
> > +       if (cpu < nr_queues) {
> > +               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
> > +       } else {
> > +               first_sibling = get_first_sibling(cpu);
> > +               if (first_sibling == cpu)
> >                          map[cpu] = cpu_to_queue_index(nr_queues, cpu);
> > -               } else {
> > -                       first_sibling = get_first_sibling(cpu);
> > -                       if (first_sibling == cpu)
> > -                               map[cpu] = cpu_to_queue_index(nr_queues,
> > cpu);
> > -                       else
> > -                               map[cpu] = map[first_sibling];
> > -               }
> > +               else
> > +                       map[cpu] = map[first_sibling];
> >          }
> > +}
> > +EXPORT_SYMBOL_GPL(blk_mq_map_queue);
> > +
> > +int blk_mq_map_queues(struct blk_mq_tag_set *set)
> > +{
> > +       for_each_possible_cpu(cpu)
> > +                blk_mq_map_queue(set, cpu);
> >
> >          return 0;
> >   }
> > diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> > index 996167f1de18..5e91789bea5b 100644
> > --- a/block/blk-mq-rdma.c
> > +++ b/block/blk-mq-rdma.c
> > @@ -35,6 +35,10 @@ int blk_mq_rdma_map_queues(struct
> blk_mq_tag_set *set,
> >          const struct cpumask *mask;
> >          unsigned int queue, cpu;
> >
> > +       /* reset all to  */
> > +       for_each_possible_cpu(cpu)
> > +               set->mq_map[cpu] = UINT_MAX;
> > +
> >          for (queue = 0; queue < set->nr_hw_queues; queue++) {
> >                  mask = ib_get_vector_affinity(dev, first_vec + queue);
> >                  if (!mask)
> > @@ -44,6 +48,11 @@ int blk_mq_rdma_map_queues(struct
> blk_mq_tag_set *set,
> >                          set->mq_map[cpu] = queue;
> >          }
> >
> > +       for_each_possible_cpu(cpu) {
> > +               if (set->mq_map[cpu] == UINT_MAX)
> > +                       blk_mq_map_queue(set, cpu);
> > +       }
> > +
> >          return 0;
> >
> >   fallback:
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index e3147eb74222..7a9848a82475 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -283,6 +283,7 @@ int blk_mq_freeze_queue_wait_timeout(struct
> > request_queue *q,
> >                                       unsigned long timeout);
> >
> >   int blk_mq_map_queues(struct blk_mq_tag_set *set);
> > +void blk_mq_map_queue(struct blk_mq_tag_set *set, unsigned int cpu);
> >   void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int
> > nr_hw_queues);
> >
> >   void blk_mq_quiesce_queue_nowait(struct request_queue *q);
> > --
> >
> > It really is still a best effort thing...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-18 14:14                   ` Max Gurtovoy
  2018-07-18 14:25                     ` Steve Wise
@ 2018-07-18 19:29                     ` Steve Wise
  2018-07-19 14:50                       ` Max Gurtovoy
  1 sibling, 1 reply; 35+ messages in thread
From: Steve Wise @ 2018-07-18 19:29 UTC (permalink / raw)
  To: 'Max Gurtovoy', 'Sagi Grimberg',
	'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'

[-- Attachment #1: Type: text/plain, Size: 6432 bytes --]


> 
> On 7/18/2018 2:38 PM, Sagi Grimberg wrote:
> >
> >>> IMO we must fulfil the user wish to connect to N queues and not reduce
> >>> it because of affinity overlaps. So in order to push Leon's patch we
> >>> must also fix the blk_mq_rdma_map_queues to do a best effort
> mapping
> >>> according the affinity and map the rest in naive way (in that way we
> >>> will *always* map all the queues).
> >>
> >> That is what I would expect also.   For example, in my node, where
> >> there are
> >> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS
> performance by
> >> setting up my 16 driver completion event queues such that each is
> >> bound to a
> >> node-local cpu.  So I end up with each nodel-local cpu having 2 queues
> >> bound
> >> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(),
> >> this
> >> works fine.   I assumed adding ib_get_vector_affinity() would allow
> >> this to
> >> all "just work" by default, but I'm running into this connection failure
> >> issue.
> >>
> >> I don't understand exactly what the blk_mq layer is trying to do, but I
> >> assume it has ingress event queues and processing that it trying to align
> >> with the drivers ingress cq event handling, so everybody stays on the
> >> same
> >> cpu (or at least node).   But something else is going on.  Is there
> >> documentation on how this works somewhere?
> >
> > Does this (untested) patch help?
> 
> I'm not sure (I'll test it tomorrow) because the issue is the unmapped
> queues and not the cpus.
> for example, if the affinity of q=6 and q=12 returned the same cpumask
> than q=6 will not be mapped and will fail to connect.
>

Attached is a patch that applies cleanly for me.  It has problems if vectors have affinity to more than 1 cpu:

[ 2031.988881] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
[ 2031.994706] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
[ 2032.000348] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
[ 2032.005992] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
[ 2032.011629] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
[ 2032.017271] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
[ 2032.022901] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
[ 2032.028514] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
[ 2032.034110] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
[ 2032.039677] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
[ 2032.045244] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
[ 2032.050889] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
[ 2032.056531] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
[ 2032.062174] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
[ 2032.067817] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
[ 2032.073457] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
[ 2032.079102] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
[ 2032.085621] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
[ 2032.092139] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
[ 2032.098658] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
[ 2032.105177] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
[ 2032.111689] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
[ 2032.118208] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
[ 2032.124728] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
[ 2032.131246] blk_mq_rdma_map_queues: set->mq_map[8] queue 15 vector 15
[ 2032.137938] blk_mq_rdma_map_queues: set->mq_map[9] queue 15 vector 15
[ 2032.144629] blk_mq_rdma_map_queues: set->mq_map[10] queue 15 vector 15
[ 2032.151401] blk_mq_rdma_map_queues: set->mq_map[11] queue 15 vector 15
[ 2032.158172] blk_mq_rdma_map_queues: set->mq_map[12] queue 15 vector 15
[ 2032.164940] blk_mq_rdma_map_queues: set->mq_map[13] queue 15 vector 15
[ 2032.171709] blk_mq_rdma_map_queues: set->mq_map[14] queue 15 vector 15
[ 2032.178477] blk_mq_rdma_map_queues: set->mq_map[15] queue 15 vector 15
[ 2032.187409] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
[ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
 
But if I set all my vector affinities single cpus but only those in the same numa node, it now works:

[ 2311.884397] iw_cxgb4: comp_vector 0, irq 203 mask 0x100
[ 2311.890103] iw_cxgb4: comp_vector 1, irq 204 mask 0x200
[ 2311.895659] iw_cxgb4: comp_vector 2, irq 205 mask 0x400
[ 2311.901211] iw_cxgb4: comp_vector 3, irq 206 mask 0x800
[ 2311.906758] iw_cxgb4: comp_vector 4, irq 207 mask 0x1000
[ 2311.912390] iw_cxgb4: comp_vector 5, irq 208 mask 0x2000
[ 2311.918014] iw_cxgb4: comp_vector 6, irq 209 mask 0x4000
[ 2311.923627] iw_cxgb4: comp_vector 7, irq 210 mask 0x8000
[ 2311.929213] iw_cxgb4: comp_vector 8, irq 211 mask 0x100
[ 2311.934694] iw_cxgb4: comp_vector 9, irq 212 mask 0x200
[ 2311.940163] iw_cxgb4: comp_vector 10, irq 213 mask 0x400
[ 2311.945716] iw_cxgb4: comp_vector 11, irq 214 mask 0x800
[ 2311.951272] iw_cxgb4: comp_vector 12, irq 215 mask 0x1000
[ 2311.956914] iw_cxgb4: comp_vector 13, irq 216 mask 0x2000
[ 2311.962558] iw_cxgb4: comp_vector 14, irq 217 mask 0x4000
[ 2311.968201] iw_cxgb4: comp_vector 15, irq 218 mask 0x8000
[ 2311.973845] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
[ 2311.980367] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
[ 2311.986885] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
[ 2311.993402] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
[ 2311.999919] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
[ 2312.006436] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
[ 2312.012956] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
[ 2312.019473] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
[ 2312.025991] blk_mq_rdma_map_queues: set->mq_map[8] queue 8 vector 8
[ 2312.032511] blk_mq_rdma_map_queues: set->mq_map[9] queue 9 vector 9
[ 2312.039030] blk_mq_rdma_map_queues: set->mq_map[10] queue 10 vector 10
[ 2312.045801] blk_mq_rdma_map_queues: set->mq_map[11] queue 11 vector 11
[ 2312.052572] blk_mq_rdma_map_queues: set->mq_map[12] queue 12 vector 12
[ 2312.059341] blk_mq_rdma_map_queues: set->mq_map[13] queue 13 vector 13
[ 2312.066111] blk_mq_rdma_map_queues: set->mq_map[14] queue 14 vector 14
[ 2312.072879] blk_mq_rdma_map_queues: set->mq_map[15] queue 15 vector 15
[ 2312.081926] nvme nvme0: new ctrl: NQN "nvme-nullb0", addr 172.16.2.1:4420


[-- Attachment #2: sagi2.patch --]
[-- Type: application/octet-stream, Size: 3258 bytes --]

commit 5128aa16a366c78aaa7a96f3e5760f993e9edb3e
Author: Steve Wise <swise@opengridcomputing.com>
Date:   Wed Jul 18 06:53:32 2018 -0700

    sagi's patch2

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..02b888ff3c10 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,36 @@ static int get_first_sibling(unsigned int cpu)
 	return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
 	unsigned int *map = set->mq_map;
 	unsigned int nr_queues = set->nr_hw_queues;
-	unsigned int cpu, first_sibling;
+	unsigned int first_sibling;
 
-	for_each_possible_cpu(cpu) {
-		/*
-		 * First do sequential mapping between CPUs and queues.
-		 * In case we still have CPUs to map, and we have some number of
-		 * threads per cores then map sibling threads to the same queue for
-		 * performace optimizations.
-		 */
-		if (cpu < nr_queues) {
+	/*
+	 * First do sequential mapping between CPUs and queues.
+	 * In case we still have CPUs to map, and we have some number of
+	 * threads per cores then map sibling threads to the same queue for
+	 * performace optimizations.
+	 */
+	if (cpu < nr_queues) {
+		map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+	} else {
+		first_sibling = get_first_sibling(cpu);
+		if (first_sibling == cpu)
 			map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-		} else {
-			first_sibling = get_first_sibling(cpu);
-			if (first_sibling == cpu)
-				map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-			else
-				map[cpu] = map[first_sibling];
-		}
+		else
+			map[cpu] = map[first_sibling];
 	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue_to_cpu);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		blk_mq_map_queue_to_cpu(set, cpu);
 
 	return 0;
 }
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..27210105a882 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,10 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
+	/* reset all to  */
+	for_each_possible_cpu(cpu)
+		set->mq_map[cpu] = UINT_MAX;
+
 	for (queue = 0; queue < set->nr_hw_queues; queue++) {
 		mask = ib_get_vector_affinity(dev, first_vec + queue);
 		if (!mask)
@@ -44,6 +48,11 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 			set->mq_map[cpu] = queue;
 	}
 
+	for_each_possible_cpu(cpu) {
+		if (set->mq_map[cpu] == UINT_MAX)
+			blk_mq_map_queue_to_cpu(set, cpu);
+	}
+
 	return 0;
 
 fallback:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb74222..19e83d93a1d4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -283,6 +283,7 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-18 19:29                     ` Steve Wise
@ 2018-07-19 14:50                       ` Max Gurtovoy
  2018-07-19 18:45                         ` Steve Wise
  0 siblings, 1 reply; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-19 14:50 UTC (permalink / raw)
  To: Steve Wise, 'Sagi Grimberg', 'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'



On 7/18/2018 10:29 PM, Steve Wise wrote:
> 
>>
>> On 7/18/2018 2:38 PM, Sagi Grimberg wrote:
>>>
>>>>> IMO we must fulfil the user wish to connect to N queues and not reduce
>>>>> it because of affinity overlaps. So in order to push Leon's patch we
>>>>> must also fix the blk_mq_rdma_map_queues to do a best effort
>> mapping
>>>>> according the affinity and map the rest in naive way (in that way we
>>>>> will *always* map all the queues).
>>>>
>>>> That is what I would expect also.   For example, in my node, where
>>>> there are
>>>> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS
>> performance by
>>>> setting up my 16 driver completion event queues such that each is
>>>> bound to a
>>>> node-local cpu.  So I end up with each nodel-local cpu having 2 queues
>>>> bound
>>>> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(),
>>>> this
>>>> works fine.   I assumed adding ib_get_vector_affinity() would allow
>>>> this to
>>>> all "just work" by default, but I'm running into this connection failure
>>>> issue.
>>>>
>>>> I don't understand exactly what the blk_mq layer is trying to do, but I
>>>> assume it has ingress event queues and processing that it trying to align
>>>> with the drivers ingress cq event handling, so everybody stays on the
>>>> same
>>>> cpu (or at least node).   But something else is going on.  Is there
>>>> documentation on how this works somewhere?
>>>
>>> Does this (untested) patch help?
>>
>> I'm not sure (I'll test it tomorrow) because the issue is the unmapped
>> queues and not the cpus.
>> for example, if the affinity of q=6 and q=12 returned the same cpumask
>> than q=6 will not be mapped and will fail to connect.
>>
> 
> Attached is a patch that applies cleanly for me.  It has problems if vectors have affinity to more than 1 cpu:
> 
> [ 2031.988881] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
> [ 2031.994706] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
> [ 2032.000348] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
> [ 2032.005992] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
> [ 2032.011629] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
> [ 2032.017271] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
> [ 2032.022901] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
> [ 2032.028514] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
> [ 2032.034110] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
> [ 2032.039677] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
> [ 2032.045244] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
> [ 2032.050889] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
> [ 2032.056531] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
> [ 2032.062174] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
> [ 2032.067817] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
> [ 2032.073457] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
> [ 2032.079102] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
> [ 2032.085621] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
> [ 2032.092139] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
> [ 2032.098658] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
> [ 2032.105177] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
> [ 2032.111689] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
> [ 2032.118208] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
> [ 2032.124728] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
> [ 2032.131246] blk_mq_rdma_map_queues: set->mq_map[8] queue 15 vector 15
> [ 2032.137938] blk_mq_rdma_map_queues: set->mq_map[9] queue 15 vector 15
> [ 2032.144629] blk_mq_rdma_map_queues: set->mq_map[10] queue 15 vector 15
> [ 2032.151401] blk_mq_rdma_map_queues: set->mq_map[11] queue 15 vector 15
> [ 2032.158172] blk_mq_rdma_map_queues: set->mq_map[12] queue 15 vector 15
> [ 2032.164940] blk_mq_rdma_map_queues: set->mq_map[13] queue 15 vector 15
> [ 2032.171709] blk_mq_rdma_map_queues: set->mq_map[14] queue 15 vector 15
> [ 2032.178477] blk_mq_rdma_map_queues: set->mq_map[15] queue 15 vector 15
> [ 2032.187409] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18

queue 9 is not mapped (overlap).
please try the bellow:

diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f..a91d611 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -34,14 +34,55 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
  {
         const struct cpumask *mask;
         unsigned int queue, cpu;
+       bool mapped;

+       /* reset all CPUs mapping */
+       for_each_possible_cpu(cpu)
+               set->mq_map[cpu] = UINT_MAX;
+
+       /* Try to map the queues according to affinity */
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
                 mask = ib_get_vector_affinity(dev, first_vec + queue);
                 if (!mask)
                         goto fallback;

-               for_each_cpu(cpu, mask)
-                       set->mq_map[cpu] = queue;
+               for_each_cpu(cpu, mask) {
+                       if (set->mq_map[cpu] == UINT_MAX) {
+                               set->mq_map[cpu] = queue;
+                               /* Initialy each queue mapped to 1 cpu */
+                               break;
+                       }
+               }
+       }
+
+       /* Map the unmapped queues in a naive way */
+       for (queue = 0; queue < set->nr_hw_queues; queue++) {
+               mapped = false;
+               for_each_possible_cpu(cpu) {
+                       if (set->mq_map[cpu] == queue) {
+                               mapped = true;
+                               break;
+                       }
+               }
+               if (!mapped) {
+                       for_each_possible_cpu(cpu) {
+                               if (set->mq_map[cpu] == UINT_MAX) {
+                                       set->mq_map[cpu] = queue;
+                                       mapped = true;
+                                       break;
+                               }
+                       }
+               }
+               /* This case should never happen */
+               if (WARN_ON_ONCE(!mapped))
+                       goto fallback;
+       }
+
+       /* set all the rest of the CPUs */
+       queue = 0;
+       for_each_possible_cpu(cpu) {
+               if (set->mq_map[cpu] == UINT_MAX)
+                       set->mq_map[cpu] = queue++ % set->nr_hw_queues;
         }

         return 0;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-19 14:50                       ` Max Gurtovoy
@ 2018-07-19 18:45                         ` Steve Wise
  2018-07-20  1:25                           ` Max Gurtovoy
  0 siblings, 1 reply; 35+ messages in thread
From: Steve Wise @ 2018-07-19 18:45 UTC (permalink / raw)
  To: Max Gurtovoy, 'Sagi Grimberg', 'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'



On 7/19/2018 9:50 AM, Max Gurtovoy wrote:
>
>
> On 7/18/2018 10:29 PM, Steve Wise wrote:
>>
>>>
>>> On 7/18/2018 2:38 PM, Sagi Grimberg wrote:
>>>>
>>>>>> IMO we must fulfil the user wish to connect to N queues and not
>>>>>> reduce
>>>>>> it because of affinity overlaps. So in order to push Leon's patch we
>>>>>> must also fix the blk_mq_rdma_map_queues to do a best effort
>>> mapping
>>>>>> according the affinity and map the rest in naive way (in that way we
>>>>>> will *always* map all the queues).
>>>>>
>>>>> That is what I would expect also.   For example, in my node, where
>>>>> there are
>>>>> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS
>>> performance by
>>>>> setting up my 16 driver completion event queues such that each is
>>>>> bound to a
>>>>> node-local cpu.  So I end up with each nodel-local cpu having 2
>>>>> queues
>>>>> bound
>>>>> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(),
>>>>> this
>>>>> works fine.   I assumed adding ib_get_vector_affinity() would allow
>>>>> this to
>>>>> all "just work" by default, but I'm running into this connection
>>>>> failure
>>>>> issue.
>>>>>
>>>>> I don't understand exactly what the blk_mq layer is trying to do,
>>>>> but I
>>>>> assume it has ingress event queues and processing that it trying
>>>>> to align
>>>>> with the drivers ingress cq event handling, so everybody stays on the
>>>>> same
>>>>> cpu (or at least node).   But something else is going on.  Is there
>>>>> documentation on how this works somewhere?
>>>>
>>>> Does this (untested) patch help?
>>>
>>> I'm not sure (I'll test it tomorrow) because the issue is the unmapped
>>> queues and not the cpus.
>>> for example, if the affinity of q=6 and q=12 returned the same cpumask
>>> than q=6 will not be mapped and will fail to connect.
>>>
>>
>> Attached is a patch that applies cleanly for me.  It has problems if
>> vectors have affinity to more than 1 cpu:
>>
>> [ 2031.988881] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
>> [ 2031.994706] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
>> [ 2032.000348] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
>> [ 2032.005992] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
>> [ 2032.011629] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
>> [ 2032.017271] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
>> [ 2032.022901] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
>> [ 2032.028514] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
>> [ 2032.034110] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
>> [ 2032.039677] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
>> [ 2032.045244] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
>> [ 2032.050889] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
>> [ 2032.056531] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
>> [ 2032.062174] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
>> [ 2032.067817] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
>> [ 2032.073457] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
>> [ 2032.079102] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
>> [ 2032.085621] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
>> [ 2032.092139] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
>> [ 2032.098658] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
>> [ 2032.105177] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
>> [ 2032.111689] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
>> [ 2032.118208] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
>> [ 2032.124728] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
>> [ 2032.131246] blk_mq_rdma_map_queues: set->mq_map[8] queue 15 vector 15
>> [ 2032.137938] blk_mq_rdma_map_queues: set->mq_map[9] queue 15 vector 15
>> [ 2032.144629] blk_mq_rdma_map_queues: set->mq_map[10] queue 15
>> vector 15
>> [ 2032.151401] blk_mq_rdma_map_queues: set->mq_map[11] queue 15
>> vector 15
>> [ 2032.158172] blk_mq_rdma_map_queues: set->mq_map[12] queue 15
>> vector 15
>> [ 2032.164940] blk_mq_rdma_map_queues: set->mq_map[13] queue 15
>> vector 15
>> [ 2032.171709] blk_mq_rdma_map_queues: set->mq_map[14] queue 15
>> vector 15
>> [ 2032.178477] blk_mq_rdma_map_queues: set->mq_map[15] queue 15
>> vector 15
>> [ 2032.187409] nvme nvme0: Connect command failed, error wo/DNR bit:
>> -16402
>> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
>
> queue 9 is not mapped (overlap).
> please try the bellow:
>

This seems to work.  Here are three mapping cases:  each vector on its
own cpu, each vector on 1 cpu within the local numa node, and each
vector having all cpus in its numa node.  The 2nd mapping looks kinda
funny, but I think it achieved what you wanted?  And all the cases
resulted in successful connections.

#### each vector on its own cpu:

[ 3844.756229] iw_cxgb4: comp_vector 0, irq 203 mask 0x100
[ 3844.762104] iw_cxgb4: comp_vector 1, irq 204 mask 0x200
[ 3844.767896] iw_cxgb4: comp_vector 2, irq 205 mask 0x400
[ 3844.773663] iw_cxgb4: comp_vector 3, irq 206 mask 0x800
[ 3844.779405] iw_cxgb4: comp_vector 4, irq 207 mask 0x1000
[ 3844.785231] iw_cxgb4: comp_vector 5, irq 208 mask 0x2000
[ 3844.791043] iw_cxgb4: comp_vector 6, irq 209 mask 0x4000
[ 3844.796835] iw_cxgb4: comp_vector 7, irq 210 mask 0x8000
[ 3844.802619] iw_cxgb4: comp_vector 8, irq 211 mask 0x1
[ 3844.808133] iw_cxgb4: comp_vector 9, irq 212 mask 0x2
[ 3844.813643] iw_cxgb4: comp_vector 10, irq 213 mask 0x4
[ 3844.819235] iw_cxgb4: comp_vector 11, irq 214 mask 0x8
[ 3844.824817] iw_cxgb4: comp_vector 12, irq 215 mask 0x10
[ 3844.830486] iw_cxgb4: comp_vector 13, irq 216 mask 0x20
[ 3844.836148] iw_cxgb4: comp_vector 14, irq 217 mask 0x40
[ 3844.841804] iw_cxgb4: comp_vector 15, irq 218 mask 0x80
[ 3844.847456] blk_mq_rdma_map_queues: set->mq_map[0] queue 8 vector 8
[ 3844.847457] blk_mq_rdma_map_queues: set->mq_map[1] queue 9 vector 9
[ 3844.847457] blk_mq_rdma_map_queues: set->mq_map[2] queue 10 vector 10
[ 3844.847458] blk_mq_rdma_map_queues: set->mq_map[3] queue 11 vector 11
[ 3844.847459] blk_mq_rdma_map_queues: set->mq_map[4] queue 12 vector 12
[ 3844.847459] blk_mq_rdma_map_queues: set->mq_map[5] queue 13 vector 13
[ 3844.847460] blk_mq_rdma_map_queues: set->mq_map[6] queue 14 vector 14
[ 3844.847461] blk_mq_rdma_map_queues: set->mq_map[7] queue 15 vector 15
[ 3844.847462] blk_mq_rdma_map_queues: set->mq_map[8] queue 0 vector 0
[ 3844.847462] blk_mq_rdma_map_queues: set->mq_map[9] queue 1 vector 1
[ 3844.847463] blk_mq_rdma_map_queues: set->mq_map[10] queue 2 vector 2
[ 3844.847463] blk_mq_rdma_map_queues: set->mq_map[11] queue 3 vector 3
[ 3844.847464] blk_mq_rdma_map_queues: set->mq_map[12] queue 4 vector 4
[ 3844.847465] blk_mq_rdma_map_queues: set->mq_map[13] queue 5 vector 5
[ 3844.847465] blk_mq_rdma_map_queues: set->mq_map[14] queue 6 vector 6
[ 3844.847466] blk_mq_rdma_map_queues: set->mq_map[15] queue 7 vector 7

#### each vector on 1 cpu in is numa node

[ 3932.840244] iw_cxgb4: comp_vector 0, irq 203 mask 0x400
[ 3932.846018] iw_cxgb4: comp_vector 1, irq 204 mask 0x800
[ 3932.851687] iw_cxgb4: comp_vector 2, irq 205 mask 0x1000
[ 3932.857428] iw_cxgb4: comp_vector 3, irq 206 mask 0x2000
[ 3932.863160] iw_cxgb4: comp_vector 4, irq 207 mask 0x4000
[ 3932.868882] iw_cxgb4: comp_vector 5, irq 208 mask 0x8000
[ 3932.874594] iw_cxgb4: comp_vector 6, irq 209 mask 0x100
[ 3932.880213] iw_cxgb4: comp_vector 7, irq 210 mask 0x200
[ 3932.885831] iw_cxgb4: comp_vector 8, irq 211 mask 0x400
[ 3932.891440] iw_cxgb4: comp_vector 9, irq 212 mask 0x800
[ 3932.897043] iw_cxgb4: comp_vector 10, irq 213 mask 0x1000
[ 3932.902812] iw_cxgb4: comp_vector 11, irq 214 mask 0x2000
[ 3932.908580] iw_cxgb4: comp_vector 12, irq 215 mask 0x4000
[ 3932.914338] iw_cxgb4: comp_vector 13, irq 216 mask 0x8000
[ 3932.920096] iw_cxgb4: comp_vector 14, irq 217 mask 0x100
[ 3932.925756] iw_cxgb4: comp_vector 15, irq 218 mask 0x200
[ 3932.931413] blk_mq_rdma_map_queues: set->mq_map[0] queue 8 vector 8
[ 3932.931414] blk_mq_rdma_map_queues: set->mq_map[1] queue 9 vector 9
[ 3932.931415] blk_mq_rdma_map_queues: set->mq_map[2] queue 10 vector 10
[ 3932.931416] blk_mq_rdma_map_queues: set->mq_map[3] queue 11 vector 11
[ 3932.931416] blk_mq_rdma_map_queues: set->mq_map[4] queue 12 vector 12
[ 3932.931417] blk_mq_rdma_map_queues: set->mq_map[5] queue 13 vector 13
[ 3932.931418] blk_mq_rdma_map_queues: set->mq_map[6] queue 14 vector 14
[ 3932.931418] blk_mq_rdma_map_queues: set->mq_map[7] queue 15 vector 15
[ 3932.931419] blk_mq_rdma_map_queues: set->mq_map[8] queue 6 vector 6
[ 3932.931420] blk_mq_rdma_map_queues: set->mq_map[9] queue 7 vector 7
[ 3932.931420] blk_mq_rdma_map_queues: set->mq_map[10] queue 0 vector 0
[ 3932.931421] blk_mq_rdma_map_queues: set->mq_map[11] queue 1 vector 1
[ 3932.931422] blk_mq_rdma_map_queues: set->mq_map[12] queue 2 vector 2
[ 3932.931423] blk_mq_rdma_map_queues: set->mq_map[13] queue 3 vector 3
[ 3932.931423] blk_mq_rdma_map_queues: set->mq_map[14] queue 4 vector 4
[ 3932.931425] blk_mq_rdma_map_queues: set->mq_map[15] queue 5 vector 5

### each vector having all cpus in its numa node

[ 4047.308234] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
[ 4047.314042] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
[ 4047.319745] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
[ 4047.325417] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
[ 4047.331062] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
[ 4047.336703] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
[ 4047.342329] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
[ 4047.347953] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
[ 4047.353578] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
[ 4047.359204] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
[ 4047.364829] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
[ 4047.370544] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
[ 4047.376264] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
[ 4047.381983] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
[ 4047.387695] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
[ 4047.393406] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
[ 4047.399118] blk_mq_rdma_map_queues: set->mq_map[0] queue 8 vector 8
[ 4047.399119] blk_mq_rdma_map_queues: set->mq_map[1] queue 9 vector 9
[ 4047.399120] blk_mq_rdma_map_queues: set->mq_map[2] queue 10 vector 10
[ 4047.399121] blk_mq_rdma_map_queues: set->mq_map[3] queue 11 vector 11
[ 4047.399121] blk_mq_rdma_map_queues: set->mq_map[4] queue 12 vector 12
[ 4047.399122] blk_mq_rdma_map_queues: set->mq_map[5] queue 13 vector 13
[ 4047.399123] blk_mq_rdma_map_queues: set->mq_map[6] queue 14 vector 14
[ 4047.399123] blk_mq_rdma_map_queues: set->mq_map[7] queue 15 vector 15
[ 4047.399124] blk_mq_rdma_map_queues: set->mq_map[8] queue 0 vector 0
[ 4047.399125] blk_mq_rdma_map_queues: set->mq_map[9] queue 1 vector 1
[ 4047.399125] blk_mq_rdma_map_queues: set->mq_map[10] queue 2 vector 2
[ 4047.399126] blk_mq_rdma_map_queues: set->mq_map[11] queue 3 vector 3
[ 4047.399127] blk_mq_rdma_map_queues: set->mq_map[12] queue 4 vector 4
[ 4047.399127] blk_mq_rdma_map_queues: set->mq_map[13] queue 5 vector 5
[ 4047.399128] blk_mq_rdma_map_queues: set->mq_map[14] queue 6 vector 6
[ 4047.399128] blk_mq_rdma_map_queues: set->mq_map[15] queue 7 vector 7

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-19 18:45                         ` Steve Wise
@ 2018-07-20  1:25                           ` Max Gurtovoy
  2018-07-23 16:49                             ` Jason Gunthorpe
  2018-07-24 15:24                             ` Steve Wise
  0 siblings, 2 replies; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-20  1:25 UTC (permalink / raw)
  To: Steve Wise, 'Sagi Grimberg', 'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]


>>> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
>>
>> queue 9 is not mapped (overlap).
>> please try the bellow:
>>
> 
> This seems to work.  Here are three mapping cases:  each vector on its
> own cpu, each vector on 1 cpu within the local numa node, and each
> vector having all cpus in its numa node.  The 2nd mapping looks kinda
> funny, but I think it achieved what you wanted?  And all the cases
> resulted in successful connections.
> 

Thanks for testing this.
I slightly improved the setting of the left CPUs and actually used 
Sagi's initial proposal.

Sagi,
please review the attached patch and let me know if I should add your 
signature on it.
I'll run some perf test early next week on it (meanwhile I run 
login/logout with different num_queues successfully and irq settings).

Steve,
It will be great if you can apply the attached in your system and send 
your findings.

Regards,
Max,

[-- Attachment #2: 0001-blk-mq-fix-RDMA-queue-cpu-mappings-assignments-for-m.patch --]
[-- Type: text/plain, Size: 4759 bytes --]

From 6f7b98f1c43252f459772390c178fc3ad043fc82 Mon Sep 17 00:00:00 2001
From: Max Gurtovoy <maxg@mellanox.com>
Date: Thu, 19 Jul 2018 12:42:00 +0000
Subject: [PATCH 1/1] blk-mq: fix RDMA queue/cpu mappings assignments for mq

In order to fulfil the block layer cpu <-> queue mapping, all the
allocated queues and all the possible CPUs should be mapped. First,
try to map the queues according to the affinity hint from the underlying
RDMA device. Second, map all the unmapped queues in a naive way to unmapped
CPU. In case we still have unmapped CPUs, use the default blk-mq mappings
to map the rest. This way we guarantee that no matter what is the underlying
affinity, all the possible CPUs and all the allocated block queues will be
mapped.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
---
 block/blk-mq-cpumap.c  | 41 ++++++++++++++++++++++++-----------------
 block/blk-mq-rdma.c    | 44 ++++++++++++++++++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  1 +
 3 files changed, 67 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f..02b888f 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,36 @@ static int get_first_sibling(unsigned int cpu)
 	return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
 	unsigned int *map = set->mq_map;
 	unsigned int nr_queues = set->nr_hw_queues;
-	unsigned int cpu, first_sibling;
+	unsigned int first_sibling;
 
-	for_each_possible_cpu(cpu) {
-		/*
-		 * First do sequential mapping between CPUs and queues.
-		 * In case we still have CPUs to map, and we have some number of
-		 * threads per cores then map sibling threads to the same queue for
-		 * performace optimizations.
-		 */
-		if (cpu < nr_queues) {
+	/*
+	 * First do sequential mapping between CPUs and queues.
+	 * In case we still have CPUs to map, and we have some number of
+	 * threads per cores then map sibling threads to the same queue for
+	 * performace optimizations.
+	 */
+	if (cpu < nr_queues) {
+		map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+	} else {
+		first_sibling = get_first_sibling(cpu);
+		if (first_sibling == cpu)
 			map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-		} else {
-			first_sibling = get_first_sibling(cpu);
-			if (first_sibling == cpu)
-				map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-			else
-				map[cpu] = map[first_sibling];
-		}
+		else
+			map[cpu] = map[first_sibling];
 	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue_to_cpu);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		blk_mq_map_queue_to_cpu(set, cpu);
 
 	return 0;
 }
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f..10e4f8a 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -34,14 +34,54 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 {
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
+	bool mapped;
 
+	/* reset all CPUs mapping */
+	for_each_possible_cpu(cpu)
+		set->mq_map[cpu] = UINT_MAX;
+
+	/* Try to map the queues according to affinity */
 	for (queue = 0; queue < set->nr_hw_queues; queue++) {
 		mask = ib_get_vector_affinity(dev, first_vec + queue);
 		if (!mask)
 			goto fallback;
 
-		for_each_cpu(cpu, mask)
-			set->mq_map[cpu] = queue;
+		for_each_cpu(cpu, mask) {
+			if (set->mq_map[cpu] == UINT_MAX) {
+				set->mq_map[cpu] = queue;
+				/* Initialy each queue mapped to 1 cpu */
+				break;
+			}
+		}
+	}
+
+	/* Map the unmapped queues in a naive way */
+	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		mapped = false;
+		for_each_possible_cpu(cpu) {
+			if (set->mq_map[cpu] == queue) {
+				mapped = true;
+				break;
+			}
+		}
+		if (!mapped) {
+			for_each_possible_cpu(cpu) {
+				if (set->mq_map[cpu] == UINT_MAX) {
+					set->mq_map[cpu] = queue;
+					mapped = true;
+					break;
+				}
+			}
+		}
+		/* This case should never happen */
+		if (WARN_ON_ONCE(!mapped))
+			goto fallback;
+	}
+
+	/* set all the rest of the CPUs */
+	for_each_possible_cpu(cpu) {
+		if (set->mq_map[cpu] == UINT_MAX)
+			blk_mq_map_queue_to_cpu(set, cpu);
 	}
 
 	return 0;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb..d6cd114 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -282,6 +282,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-20  1:25                           ` Max Gurtovoy
@ 2018-07-23 16:49                             ` Jason Gunthorpe
  2018-07-23 16:53                               ` Max Gurtovoy
  2018-07-24 15:24                             ` Steve Wise
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2018-07-23 16:49 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Steve Wise, 'Sagi Grimberg', 'Leon Romanovsky',
	'Doug Ledford', 'RDMA mailing list',
	'Saeed Mahameed', 'linux-netdev'

On Fri, Jul 20, 2018 at 04:25:32AM +0300, Max Gurtovoy wrote:
> 
> >>>[ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
> >>
> >>queue 9 is not mapped (overlap).
> >>please try the bellow:
> >>
> >
> >This seems to work.  Here are three mapping cases:  each vector on its
> >own cpu, each vector on 1 cpu within the local numa node, and each
> >vector having all cpus in its numa node.  The 2nd mapping looks kinda
> >funny, but I think it achieved what you wanted?  And all the cases
> >resulted in successful connections.
> >
> 
> Thanks for testing this.
> I slightly improved the setting of the left CPUs and actually used Sagi's
> initial proposal.
> 
> Sagi,
> please review the attached patch and let me know if I should add your
> signature on it.
> I'll run some perf test early next week on it (meanwhile I run login/logout
> with different num_queues successfully and irq settings).
> 
> Steve,
> It will be great if you can apply the attached in your system and send your
> findings.
> 
> Regards,
> Max,

So the conlusion to this thread is that Leon's mlx5 patch needs to wait
until this block-mq patch is accepted?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-23 16:49                             ` Jason Gunthorpe
@ 2018-07-23 16:53                               ` Max Gurtovoy
  2018-07-30 15:47                                 ` Steve Wise
  2018-08-01  5:12                                 ` Sagi Grimberg
  0 siblings, 2 replies; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-23 16:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Steve Wise, 'Sagi Grimberg', 'Leon Romanovsky',
	'Doug Ledford', 'RDMA mailing list',
	'Saeed Mahameed', 'linux-netdev'



On 7/23/2018 7:49 PM, Jason Gunthorpe wrote:
> On Fri, Jul 20, 2018 at 04:25:32AM +0300, Max Gurtovoy wrote:
>>
>>>>> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
>>>>
>>>> queue 9 is not mapped (overlap).
>>>> please try the bellow:
>>>>
>>>
>>> This seems to work.  Here are three mapping cases:  each vector on its
>>> own cpu, each vector on 1 cpu within the local numa node, and each
>>> vector having all cpus in its numa node.  The 2nd mapping looks kinda
>>> funny, but I think it achieved what you wanted?  And all the cases
>>> resulted in successful connections.
>>>
>>
>> Thanks for testing this.
>> I slightly improved the setting of the left CPUs and actually used Sagi's
>> initial proposal.
>>
>> Sagi,
>> please review the attached patch and let me know if I should add your
>> signature on it.
>> I'll run some perf test early next week on it (meanwhile I run login/logout
>> with different num_queues successfully and irq settings).
>>
>> Steve,
>> It will be great if you can apply the attached in your system and send your
>> findings.
>>
>> Regards,
>> Max,
> 
> So the conlusion to this thread is that Leon's mlx5 patch needs to wait
> until this block-mq patch is accepted?

Yes, since nvmf is the only user of this function.
Still waiting for comments on the suggested patch :)

> 
> Thanks,
> Jason
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-20  1:25                           ` Max Gurtovoy
  2018-07-23 16:49                             ` Jason Gunthorpe
@ 2018-07-24 15:24                             ` Steve Wise
  2018-07-24 20:52                               ` Steve Wise
  1 sibling, 1 reply; 35+ messages in thread
From: Steve Wise @ 2018-07-24 15:24 UTC (permalink / raw)
  To: Max Gurtovoy, 'Sagi Grimberg', 'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'



On 7/19/2018 8:25 PM, Max Gurtovoy wrote:
>
>>>> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
>>>
>>> queue 9 is not mapped (overlap).
>>> please try the bellow:
>>>
>>
>> This seems to work.  Here are three mapping cases:  each vector on its
>> own cpu, each vector on 1 cpu within the local numa node, and each
>> vector having all cpus in its numa node.  The 2nd mapping looks kinda
>> funny, but I think it achieved what you wanted?  And all the cases
>> resulted in successful connections.
>>
>
> Thanks for testing this.
> I slightly improved the setting of the left CPUs and actually used
> Sagi's initial proposal.
>
> Sagi,
> please review the attached patch and let me know if I should add your
> signature on it.
> I'll run some perf test early next week on it (meanwhile I run
> login/logout with different num_queues successfully and irq settings).
>
> Steve,
> It will be great if you can apply the attached in your system and send
> your findings.

Sorry, I got side tracked.  I'll try and test this today and report back.

Steve.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-24 15:24                             ` Steve Wise
@ 2018-07-24 20:52                               ` Steve Wise
  0 siblings, 0 replies; 35+ messages in thread
From: Steve Wise @ 2018-07-24 20:52 UTC (permalink / raw)
  To: Max Gurtovoy, 'Sagi Grimberg', 'Leon Romanovsky'
  Cc: 'Doug Ledford', 'Jason Gunthorpe',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'


On 7/24/2018 10:24 AM, Steve Wise wrote:
>
> On 7/19/2018 8:25 PM, Max Gurtovoy wrote:
>>>>> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
>>>> queue 9 is not mapped (overlap).
>>>> please try the bellow:
>>>>
>>> This seems to work.  Here are three mapping cases:  each vector on its
>>> own cpu, each vector on 1 cpu within the local numa node, and each
>>> vector having all cpus in its numa node.  The 2nd mapping looks kinda
>>> funny, but I think it achieved what you wanted?  And all the cases
>>> resulted in successful connections.
>>>
>> Thanks for testing this.
>> I slightly improved the setting of the left CPUs and actually used
>> Sagi's initial proposal.
>>
>> Sagi,
>> please review the attached patch and let me know if I should add your
>> signature on it.
>> I'll run some perf test early next week on it (meanwhile I run
>> login/logout with different num_queues successfully and irq settings).
>>
>> Steve,
>> It will be great if you can apply the attached in your system and send
>> your findings.
> Sorry, I got side tracked.  I'll try and test this today and report back.
>
> Steve.


###  each vector gets a unique cpu, starting with node-local:

[  754.976577] iw_cxgb4: comp_vector 0, irq 203 mask 0x100
[  754.982378] iw_cxgb4: comp_vector 1, irq 204 mask 0x200
[  754.988167] iw_cxgb4: comp_vector 2, irq 205 mask 0x400
[  754.993935] iw_cxgb4: comp_vector 3, irq 206 mask 0x800
[  754.999686] iw_cxgb4: comp_vector 4, irq 207 mask 0x1000
[  755.005509] iw_cxgb4: comp_vector 5, irq 208 mask 0x2000
[  755.011318] iw_cxgb4: comp_vector 6, irq 209 mask 0x4000
[  755.017124] iw_cxgb4: comp_vector 7, irq 210 mask 0x8000
[  755.022915] iw_cxgb4: comp_vector 8, irq 211 mask 0x1
[  755.028437] iw_cxgb4: comp_vector 9, irq 212 mask 0x2
[  755.033948] iw_cxgb4: comp_vector 10, irq 213 mask 0x4
[  755.039543] iw_cxgb4: comp_vector 11, irq 214 mask 0x8
[  755.045135] iw_cxgb4: comp_vector 12, irq 215 mask 0x10
[  755.050801] iw_cxgb4: comp_vector 13, irq 216 mask 0x20
[  755.056464] iw_cxgb4: comp_vector 14, irq 217 mask 0x40
[  755.062117] iw_cxgb4: comp_vector 15, irq 218 mask 0x80
[  755.067767] blk_mq_rdma_map_queues: set->mq_map[0] queue 8 vector 8
[  755.067767] blk_mq_rdma_map_queues: set->mq_map[1] queue 9 vector 9
[  755.067768] blk_mq_rdma_map_queues: set->mq_map[2] queue 10 vector 10
[  755.067769] blk_mq_rdma_map_queues: set->mq_map[3] queue 11 vector 11
[  755.067769] blk_mq_rdma_map_queues: set->mq_map[4] queue 12 vector 12
[  755.067770] blk_mq_rdma_map_queues: set->mq_map[5] queue 13 vector 13
[  755.067771] blk_mq_rdma_map_queues: set->mq_map[6] queue 14 vector 14
[  755.067772] blk_mq_rdma_map_queues: set->mq_map[7] queue 15 vector 15
[  755.067772] blk_mq_rdma_map_queues: set->mq_map[8] queue 0 vector 0
[  755.067773] blk_mq_rdma_map_queues: set->mq_map[9] queue 1 vector 1
[  755.067774] blk_mq_rdma_map_queues: set->mq_map[10] queue 2 vector 2
[  755.067774] blk_mq_rdma_map_queues: set->mq_map[11] queue 3 vector 3
[  755.067775] blk_mq_rdma_map_queues: set->mq_map[12] queue 4 vector 4
[  755.067775] blk_mq_rdma_map_queues: set->mq_map[13] queue 5 vector 5
[  755.067776] blk_mq_rdma_map_queues: set->mq_map[14] queue 6 vector 6
[  755.067777] blk_mq_rdma_map_queues: set->mq_map[15] queue 7 vector 7

###  each vector gets one cpu within the local node:

[  777.590913] iw_cxgb4: comp_vector 0, irq 203 mask 0x400
[  777.596588] iw_cxgb4: comp_vector 1, irq 204 mask 0x800
[  777.602249] iw_cxgb4: comp_vector 2, irq 205 mask 0x1000
[  777.607984] iw_cxgb4: comp_vector 3, irq 206 mask 0x2000
[  777.613708] iw_cxgb4: comp_vector 4, irq 207 mask 0x4000
[  777.619431] iw_cxgb4: comp_vector 5, irq 208 mask 0x8000
[  777.625142] iw_cxgb4: comp_vector 6, irq 209 mask 0x100
[  777.630762] iw_cxgb4: comp_vector 7, irq 210 mask 0x200
[  777.636373] iw_cxgb4: comp_vector 8, irq 211 mask 0x400
[  777.641982] iw_cxgb4: comp_vector 9, irq 212 mask 0x800
[  777.647583] iw_cxgb4: comp_vector 10, irq 213 mask 0x1000
[  777.653353] iw_cxgb4: comp_vector 11, irq 214 mask 0x2000
[  777.659119] iw_cxgb4: comp_vector 12, irq 215 mask 0x4000
[  777.664877] iw_cxgb4: comp_vector 13, irq 216 mask 0x8000
[  777.670628] iw_cxgb4: comp_vector 14, irq 217 mask 0x100
[  777.676289] iw_cxgb4: comp_vector 15, irq 218 mask 0x200
[  777.681946] blk_mq_rdma_map_queues: set->mq_map[0] queue 8 vector 8
[  777.681947] blk_mq_rdma_map_queues: set->mq_map[1] queue 9 vector 9
[  777.681947] blk_mq_rdma_map_queues: set->mq_map[2] queue 10 vector 10
[  777.681948] blk_mq_rdma_map_queues: set->mq_map[3] queue 11 vector 11
[  777.681948] blk_mq_rdma_map_queues: set->mq_map[4] queue 12 vector 12
[  777.681949] blk_mq_rdma_map_queues: set->mq_map[5] queue 13 vector 13
[  777.681950] blk_mq_rdma_map_queues: set->mq_map[6] queue 14 vector 14
[  777.681950] blk_mq_rdma_map_queues: set->mq_map[7] queue 15 vector 15
[  777.681951] blk_mq_rdma_map_queues: set->mq_map[8] queue 6 vector 6
[  777.681952] blk_mq_rdma_map_queues: set->mq_map[9] queue 7 vector 7
[  777.681952] blk_mq_rdma_map_queues: set->mq_map[10] queue 0 vector 0
[  777.681953] blk_mq_rdma_map_queues: set->mq_map[11] queue 1 vector 1
[  777.681953] blk_mq_rdma_map_queues: set->mq_map[12] queue 2 vector 2
[  777.681954] blk_mq_rdma_map_queues: set->mq_map[13] queue 3 vector 3
[  777.681955] blk_mq_rdma_map_queues: set->mq_map[14] queue 4 vector 4
[  777.681955] blk_mq_rdma_map_queues: set->mq_map[15] queue 5 vector 5


###  each vector gets all cpus within the local node:

[  838.251643] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
[  838.257346] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
[  838.263038] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
[  838.268710] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
[  838.274351] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
[  838.279985] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
[  838.285610] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
[  838.291234] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
[  838.296865] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
[  838.302484] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
[  838.308109] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
[  838.313827] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
[  838.319539] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
[  838.325250] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
[  838.330963] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
[  838.336674] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
[  838.342385] blk_mq_rdma_map_queues: set->mq_map[0] queue 8 vector 8
[  838.342385] blk_mq_rdma_map_queues: set->mq_map[1] queue 9 vector 9
[  838.342386] blk_mq_rdma_map_queues: set->mq_map[2] queue 10 vector 10
[  838.342387] blk_mq_rdma_map_queues: set->mq_map[3] queue 11 vector 11
[  838.342387] blk_mq_rdma_map_queues: set->mq_map[4] queue 12 vector 12
[  838.342388] blk_mq_rdma_map_queues: set->mq_map[5] queue 13 vector 13
[  838.342389] blk_mq_rdma_map_queues: set->mq_map[6] queue 14 vector 14
[  838.342390] blk_mq_rdma_map_queues: set->mq_map[7] queue 15 vector 15
[  838.342391] blk_mq_rdma_map_queues: set->mq_map[8] queue 0 vector 0
[  838.342391] blk_mq_rdma_map_queues: set->mq_map[9] queue 1 vector 1
[  838.342392] blk_mq_rdma_map_queues: set->mq_map[10] queue 2 vector 2
[  838.342392] blk_mq_rdma_map_queues: set->mq_map[11] queue 3 vector 3
[  838.342393] blk_mq_rdma_map_queues: set->mq_map[12] queue 4 vector 4
[  838.342394] blk_mq_rdma_map_queues: set->mq_map[13] queue 5 vector 5
[  838.342394] blk_mq_rdma_map_queues: set->mq_map[14] queue 6 vector 6
[  838.342395] blk_mq_rdma_map_queues: set->mq_map[15] queue 7 vector 7

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-23 16:53                               ` Max Gurtovoy
@ 2018-07-30 15:47                                 ` Steve Wise
  2018-07-31 10:00                                   ` Max Gurtovoy
  2018-08-01  5:12                                 ` Sagi Grimberg
  1 sibling, 1 reply; 35+ messages in thread
From: Steve Wise @ 2018-07-30 15:47 UTC (permalink / raw)
  To: 'Sagi Grimberg'
  Cc: Max Gurtovoy, Jason Gunthorpe, 'Leon Romanovsky',
	'Doug Ledford', 'RDMA mailing list',
	'Saeed Mahameed', 'linux-netdev'



On 7/23/2018 11:53 AM, Max Gurtovoy wrote:
>
>
> On 7/23/2018 7:49 PM, Jason Gunthorpe wrote:
>> On Fri, Jul 20, 2018 at 04:25:32AM +0300, Max Gurtovoy wrote:
>>>
>>>>>> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
>>>>>
>>>>> queue 9 is not mapped (overlap).
>>>>> please try the bellow:
>>>>>
>>>>
>>>> This seems to work.  Here are three mapping cases:  each vector on its
>>>> own cpu, each vector on 1 cpu within the local numa node, and each
>>>> vector having all cpus in its numa node.  The 2nd mapping looks kinda
>>>> funny, but I think it achieved what you wanted?  And all the cases
>>>> resulted in successful connections.
>>>>
>>>
>>> Thanks for testing this.
>>> I slightly improved the setting of the left CPUs and actually used
>>> Sagi's
>>> initial proposal.
>>>
>>> Sagi,
>>> please review the attached patch and let me know if I should add your
>>> signature on it.
>>> I'll run some perf test early next week on it (meanwhile I run
>>> login/logout
>>> with different num_queues successfully and irq settings).
>>>
>>> Steve,
>>> It will be great if you can apply the attached in your system and
>>> send your
>>> findings.
>>>
>>> Regards,
>>> Max,
>>
>> So the conlusion to this thread is that Leon's mlx5 patch needs to wait
>> until this block-mq patch is accepted?
>
> Yes, since nvmf is the only user of this function.
> Still waiting for comments on the suggested patch :)

Hey Sagi, what do you think of Max's patch?

Max, should you resend this in a form suitable for merging?

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-30 15:47                                 ` Steve Wise
@ 2018-07-31 10:00                                   ` Max Gurtovoy
  0 siblings, 0 replies; 35+ messages in thread
From: Max Gurtovoy @ 2018-07-31 10:00 UTC (permalink / raw)
  To: Steve Wise, 'Sagi Grimberg'
  Cc: Jason Gunthorpe, 'Leon Romanovsky',
	'Doug Ledford', 'RDMA mailing list',
	'Saeed Mahameed', 'linux-netdev'



On 7/30/2018 6:47 PM, Steve Wise wrote:
> 
> 
> On 7/23/2018 11:53 AM, Max Gurtovoy wrote:
>>
>>
>> On 7/23/2018 7:49 PM, Jason Gunthorpe wrote:
>>> On Fri, Jul 20, 2018 at 04:25:32AM +0300, Max Gurtovoy wrote:
>>>>
>>>>>>> [ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
>>>>>>
>>>>>> queue 9 is not mapped (overlap).
>>>>>> please try the bellow:
>>>>>>
>>>>>
>>>>> This seems to work.  Here are three mapping cases:  each vector on its
>>>>> own cpu, each vector on 1 cpu within the local numa node, and each
>>>>> vector having all cpus in its numa node.  The 2nd mapping looks kinda
>>>>> funny, but I think it achieved what you wanted?  And all the cases
>>>>> resulted in successful connections.
>>>>>
>>>>
>>>> Thanks for testing this.
>>>> I slightly improved the setting of the left CPUs and actually used
>>>> Sagi's
>>>> initial proposal.
>>>>
>>>> Sagi,
>>>> please review the attached patch and let me know if I should add your
>>>> signature on it.
>>>> I'll run some perf test early next week on it (meanwhile I run
>>>> login/logout
>>>> with different num_queues successfully and irq settings).
>>>>
>>>> Steve,
>>>> It will be great if you can apply the attached in your system and
>>>> send your
>>>> findings.
>>>>
>>>> Regards,
>>>> Max,
>>>
>>> So the conlusion to this thread is that Leon's mlx5 patch needs to wait
>>> until this block-mq patch is accepted?
>>
>> Yes, since nvmf is the only user of this function.
>> Still waiting for comments on the suggested patch :)
> 
> Hey Sagi, what do you think of Max's patch?
> 
> Max, should you resend this in a form suitable for merging?

Yes, we have already a small improvment of the naive step.
but first I want to see some feedback from other maintainers as well.

> 
> Thanks,
> 
> Steve.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-07-23 16:53                               ` Max Gurtovoy
  2018-07-30 15:47                                 ` Steve Wise
@ 2018-08-01  5:12                                 ` Sagi Grimberg
  2018-08-01 14:27                                   ` Max Gurtovoy
  1 sibling, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-08-01  5:12 UTC (permalink / raw)
  To: Max Gurtovoy, Jason Gunthorpe
  Cc: Steve Wise, 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'

Hi Max,

> Yes, since nvmf is the only user of this function.
> Still waiting for comments on the suggested patch :)
> 

Sorry for the late response (but I'm on vacation so I have
an excuse ;))

I'm thinking that we should avoid trying to find an assignment
when stuff like irqbalance daemon is running and changing
the affinitization.

This extension was made to apply optimal affinity assignment
when the device irq affinity is lined up in a vector per
core.

I'm thinking that when we identify this is not the case, we immediately
fallback to the default mapping.

1. when we get a mask, if its weight != 1, we fallback.
2. if a queue was left unmapped, we fallback.

Maybe something like the following:
--
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..1ada6211c55e 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,17 +35,26 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
         const struct cpumask *mask;
         unsigned int queue, cpu;

+       /* reset all CPUs mapping */
+       for_each_possible_cpu(cpu)
+               set->mq_map[cpu] = UINT_MAX;
+
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
                 mask = ib_get_vector_affinity(dev, first_vec + queue);
                 if (!mask)
                         goto fallback;

-               for_each_cpu(cpu, mask)
-                       set->mq_map[cpu] = queue;
+               if (cpumask_weight(mask) != 1)
+                       goto fallback;
+
+               cpu = cpumask_first(mask);
+               if (set->mq_map[cpu] != UINT_MAX)
+                       goto fallback;
+
+               set->mq_map[cpu] = queue;
         }

         return 0;
-
  fallback:
         return blk_mq_map_queues(set);
  }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-01  5:12                                 ` Sagi Grimberg
@ 2018-08-01 14:27                                   ` Max Gurtovoy
  2018-08-06 19:20                                     ` Steve Wise
  0 siblings, 1 reply; 35+ messages in thread
From: Max Gurtovoy @ 2018-08-01 14:27 UTC (permalink / raw)
  To: Sagi Grimberg, Jason Gunthorpe
  Cc: Steve Wise, 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'

[-- Attachment #1: Type: text/plain, Size: 2814 bytes --]



On 8/1/2018 8:12 AM, Sagi Grimberg wrote:
> Hi Max,

Hi,

> 
>> Yes, since nvmf is the only user of this function.
>> Still waiting for comments on the suggested patch :)
>>
> 
> Sorry for the late response (but I'm on vacation so I have
> an excuse ;))

NP :) currently the code works..

> 
> I'm thinking that we should avoid trying to find an assignment
> when stuff like irqbalance daemon is running and changing
> the affinitization.

but this is exactly what Steve complained and Leon try to fix (and break 
the connection establishment).
If this is the case and we all agree then we're good without Leon's 
patch and without our suggestions.

> 
> This extension was made to apply optimal affinity assignment
> when the device irq affinity is lined up in a vector per
> core.
> 
> I'm thinking that when we identify this is not the case, we immediately
> fallback to the default mapping.
> 
> 1. when we get a mask, if its weight != 1, we fallback.
> 2. if a queue was left unmapped, we fallback.
> 
> Maybe something like the following:

did you test it ? I think it will not work since you need to map all the 
queues and all the CPUs.

> -- 
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> index 996167f1de18..1ada6211c55e 100644
> --- a/block/blk-mq-rdma.c
> +++ b/block/blk-mq-rdma.c
> @@ -35,17 +35,26 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
>          const struct cpumask *mask;
>          unsigned int queue, cpu;
> 
> +       /* reset all CPUs mapping */
> +       for_each_possible_cpu(cpu)
> +               set->mq_map[cpu] = UINT_MAX;
> +
>          for (queue = 0; queue < set->nr_hw_queues; queue++) {
>                  mask = ib_get_vector_affinity(dev, first_vec + queue);
>                  if (!mask)
>                          goto fallback;
> 
> -               for_each_cpu(cpu, mask)
> -                       set->mq_map[cpu] = queue;
> +               if (cpumask_weight(mask) != 1)
> +                       goto fallback;
> +
> +               cpu = cpumask_first(mask);
> +               if (set->mq_map[cpu] != UINT_MAX)
> +                       goto fallback;
> +
> +               set->mq_map[cpu] = queue;
>          }
> 
>          return 0;
> -
>   fallback:
>          return blk_mq_map_queues(set);
>   }
> -- 

see attached another algorithem that can improve the mapping (although 
it's not a short one)...

it will try to map according to affinity mask, and also in case the mask 
weight > 1 it will try to be better than the naive mapping I suggest in 
the previous email.


[-- Attachment #2: 0001-blk-mq-fix-RDMA-queue-cpu-mappings-assignments-for-m.patch --]
[-- Type: text/plain, Size: 5577 bytes --]

From 007d773af7b65a1f1ca543f031ca58b3afa5b7d9 Mon Sep 17 00:00:00 2001
From: Max Gurtovoy <maxg@mellanox.com>
Date: Thu, 19 Jul 2018 12:42:00 +0000
Subject: [PATCH 1/1] blk-mq: fix RDMA queue/cpu mappings assignments for mq

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
---
 block/blk-mq-cpumap.c  | 41 ++++++++++++++----------
 block/blk-mq-rdma.c    | 84 ++++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/blk-mq.h |  1 +
 3 files changed, 103 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f..02b888f 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,36 @@ static int get_first_sibling(unsigned int cpu)
 	return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
 	unsigned int *map = set->mq_map;
 	unsigned int nr_queues = set->nr_hw_queues;
-	unsigned int cpu, first_sibling;
+	unsigned int first_sibling;
 
-	for_each_possible_cpu(cpu) {
-		/*
-		 * First do sequential mapping between CPUs and queues.
-		 * In case we still have CPUs to map, and we have some number of
-		 * threads per cores then map sibling threads to the same queue for
-		 * performace optimizations.
-		 */
-		if (cpu < nr_queues) {
+	/*
+	 * First do sequential mapping between CPUs and queues.
+	 * In case we still have CPUs to map, and we have some number of
+	 * threads per cores then map sibling threads to the same queue for
+	 * performace optimizations.
+	 */
+	if (cpu < nr_queues) {
+		map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+	} else {
+		first_sibling = get_first_sibling(cpu);
+		if (first_sibling == cpu)
 			map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-		} else {
-			first_sibling = get_first_sibling(cpu);
-			if (first_sibling == cpu)
-				map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-			else
-				map[cpu] = map[first_sibling];
-		}
+		else
+			map[cpu] = map[first_sibling];
 	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue_to_cpu);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		blk_mq_map_queue_to_cpu(set, cpu);
 
 	return 0;
 }
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f..621d5f0 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -14,6 +14,33 @@
 #include <linux/blk-mq-rdma.h>
 #include <rdma/ib_verbs.h>
 
+static int blk_mq_rdma_map_queues_by_affinity(struct blk_mq_tag_set *set,
+		struct ib_device *dev, int first_vec, int max_mapping)
+{
+	const struct cpumask *mask;
+	unsigned int queue, cpu;
+	int num_of_mapping = 0;
+
+	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		if (num_of_mapping == max_mapping)
+			return num_of_mapping;
+
+		mask = ib_get_vector_affinity(dev, first_vec + queue);
+		if (!mask)
+			return -1;
+
+		for_each_cpu(cpu, mask) {
+			if (set->mq_map[cpu] == UINT_MAX) {
+				set->mq_map[cpu] = queue;
+				num_of_mapping++;
+				/* Each queue mapped to 1 cpu */
+				break;
+			}
+		}
+	}
+	return num_of_mapping;
+}
+
 /**
  * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
  * @set:	tagset to provide the mapping for
@@ -32,18 +59,63 @@
 int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 		struct ib_device *dev, int first_vec)
 {
-	const struct cpumask *mask;
 	unsigned int queue, cpu;
+	bool mapped;
+	int remaining_cpus = num_possible_cpus();
+	int unmapped_queues = set->nr_hw_queues;
+	int ret;
 
-	for (queue = 0; queue < set->nr_hw_queues; queue++) {
-		mask = ib_get_vector_affinity(dev, first_vec + queue);
-		if (!mask)
+	/* reset all CPUs mapping */
+	for_each_possible_cpu(cpu)
+		set->mq_map[cpu] = UINT_MAX;
+
+	/* Try to map the queues according to affinity */
+	ret = blk_mq_rdma_map_queues_by_affinity(set, dev, first_vec,
+						 remaining_cpus);
+	if (ret == -1)
+		goto fallback;
+	remaining_cpus -= ret;
+	unmapped_queues -= ret;
+
+	/* Map queues with more than one cpu according to affinity */
+	while (remaining_cpus > unmapped_queues) {
+		ret = blk_mq_rdma_map_queues_by_affinity(set, dev, first_vec,
+					remaining_cpus - unmapped_queues);
+		if (ret == -1)
 			goto fallback;
+		if (!ret)
+			break;
+		remaining_cpus -= ret;
+	}
 
-		for_each_cpu(cpu, mask)
-			set->mq_map[cpu] = queue;
+	/* Map the unmapped queues in a naive way */
+	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		mapped = false;
+		for_each_possible_cpu(cpu) {
+			if (set->mq_map[cpu] == queue) {
+				mapped = true;
+				break;
+			}
+		}
+		if (!mapped) {
+			for_each_possible_cpu(cpu) {
+				if (set->mq_map[cpu] == UINT_MAX) {
+					set->mq_map[cpu] = queue;
+					mapped = true;
+					break;
+				}
+			}
+		}
+		/* This case should never happen */
+		if (WARN_ON_ONCE(!mapped))
+			goto fallback;
 	}
 
+	/* set all the rest of the CPUs */
+	for_each_possible_cpu(cpu)
+		if (set->mq_map[cpu] == UINT_MAX)
+			blk_mq_map_queue_to_cpu(set, cpu);
+
 	return 0;
 
 fallback:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb..d6cd114 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -282,6 +282,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-01 14:27                                   ` Max Gurtovoy
@ 2018-08-06 19:20                                     ` Steve Wise
  2018-08-15  6:37                                       ` Leon Romanovsky
  2018-08-16 18:26                                       ` Sagi Grimberg
  0 siblings, 2 replies; 35+ messages in thread
From: Steve Wise @ 2018-08-06 19:20 UTC (permalink / raw)
  To: Max Gurtovoy, Sagi Grimberg, Jason Gunthorpe
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'



On 8/1/2018 9:27 AM, Max Gurtovoy wrote:
>
>
> On 8/1/2018 8:12 AM, Sagi Grimberg wrote:
>> Hi Max,
>
> Hi,
>
>>
>>> Yes, since nvmf is the only user of this function.
>>> Still waiting for comments on the suggested patch :)
>>>
>>
>> Sorry for the late response (but I'm on vacation so I have
>> an excuse ;))
>
> NP :) currently the code works..
>
>>
>> I'm thinking that we should avoid trying to find an assignment
>> when stuff like irqbalance daemon is running and changing
>> the affinitization.
>
> but this is exactly what Steve complained and Leon try to fix (and
> break the connection establishment).
> If this is the case and we all agree then we're good without Leon's
> patch and without our suggestions.
>

I don't agree.  Currently setting certain affinity mappings breaks nvme
connectivity.  I don't think that is desirable.  And mlx5 is broken in
that it doesn't allow changing the affinity but silently ignores the
change, which misleads the admin or irqbalance...
 


>>
>> This extension was made to apply optimal affinity assignment
>> when the device irq affinity is lined up in a vector per
>> core.
>>
>> I'm thinking that when we identify this is not the case, we immediately
>> fallback to the default mapping.
>>
>> 1. when we get a mask, if its weight != 1, we fallback.
>> 2. if a queue was left unmapped, we fallback.
>>
>> Maybe something like the following:
>
> did you test it ? I think it will not work since you need to map all
> the queues and all the CPUs.
>
>> -- 
>> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
>> index 996167f1de18..1ada6211c55e 100644
>> --- a/block/blk-mq-rdma.c
>> +++ b/block/blk-mq-rdma.c
>> @@ -35,17 +35,26 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set
>> *set,
>>          const struct cpumask *mask;
>>          unsigned int queue, cpu;
>>
>> +       /* reset all CPUs mapping */
>> +       for_each_possible_cpu(cpu)
>> +               set->mq_map[cpu] = UINT_MAX;
>> +
>>          for (queue = 0; queue < set->nr_hw_queues; queue++) {
>>                  mask = ib_get_vector_affinity(dev, first_vec + queue);
>>                  if (!mask)
>>                          goto fallback;
>>
>> -               for_each_cpu(cpu, mask)
>> -                       set->mq_map[cpu] = queue;
>> +               if (cpumask_weight(mask) != 1)
>> +                       goto fallback;
>> +
>> +               cpu = cpumask_first(mask);
>> +               if (set->mq_map[cpu] != UINT_MAX)
>> +                       goto fallback;
>> +
>> +               set->mq_map[cpu] = queue;
>>          }
>>
>>          return 0;
>> -
>>   fallback:
>>          return blk_mq_map_queues(set);
>>   }
>> -- 
>
> see attached another algorithem that can improve the mapping (although
> it's not a short one)...
>
> it will try to map according to affinity mask, and also in case the
> mask weight > 1 it will try to be better than the naive mapping I
> suggest in the previous email.
>

Let me know if you want me to try this or any particular fix.

Steve.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-06 19:20                                     ` Steve Wise
@ 2018-08-15  6:37                                       ` Leon Romanovsky
  2018-08-16 18:26                                       ` Sagi Grimberg
  1 sibling, 0 replies; 35+ messages in thread
From: Leon Romanovsky @ 2018-08-15  6:37 UTC (permalink / raw)
  To: Steve Wise
  Cc: Max Gurtovoy, Sagi Grimberg, Jason Gunthorpe,
	'Doug Ledford', 'RDMA mailing list',
	'Saeed Mahameed', 'linux-netdev'

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

On Mon, Aug 06, 2018 at 02:20:37PM -0500, Steve Wise wrote:
>
>
> On 8/1/2018 9:27 AM, Max Gurtovoy wrote:
> >
> >
> > On 8/1/2018 8:12 AM, Sagi Grimberg wrote:
> >> Hi Max,
> >
> > Hi,
> >
> >>
> >>> Yes, since nvmf is the only user of this function.
> >>> Still waiting for comments on the suggested patch :)
> >>>
> >>
> >> Sorry for the late response (but I'm on vacation so I have
> >> an excuse ;))
> >
> > NP :) currently the code works..
> >
> >>
> >> I'm thinking that we should avoid trying to find an assignment
> >> when stuff like irqbalance daemon is running and changing
> >> the affinitization.
> >
> > but this is exactly what Steve complained and Leon try to fix (and
> > break the connection establishment).
> > If this is the case and we all agree then we're good without Leon's
> > patch and without our suggestions.
> >
>
> I don't agree.  Currently setting certain affinity mappings breaks nvme
> connectivity.  I don't think that is desirable.  And mlx5 is broken in
> that it doesn't allow changing the affinity but silently ignores the
> change, which misleads the admin or irqbalance...

Exactly, I completely agree with Steve and don't understand any
rationale in the comments above. As a summery from my side:
NVMeOF is broken, but we are not going to fix and prohibit
from one specific driver to change affinity on the fly.

Nice.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-06 19:20                                     ` Steve Wise
  2018-08-15  6:37                                       ` Leon Romanovsky
@ 2018-08-16 18:26                                       ` Sagi Grimberg
  2018-08-16 18:32                                         ` Steve Wise
  1 sibling, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-08-16 18:26 UTC (permalink / raw)
  To: Steve Wise, Max Gurtovoy, Jason Gunthorpe
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'


> Let me know if you want me to try this or any particular fix.

Steve, can you test this one?
--
[PATCH rfc] block: fix rdma queue mapping

nvme-rdma attempts to map queues based on irq vector affinity.
However, for some devices, completion vector irq affinity is
configurable by the user which can break the existing assumption
that irq vectors are optimally arranged over the host cpu cores.

So we map queues in two stages:
First map queues according to corresponding to the completion
vector IRQ affinity taking the first cpu in the vector affinity map.
if the current irq affinity is arranged such that a vector is not
assigned to any distinct cpu, we map it to a cpu that is on the same
node. If numa affinity can not be sufficed, we map it to any unmapped
cpu we can find. Then, map the remaining cpus in the possible cpumap
naively.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Steve, can you test out this patch?
  block/blk-mq-cpumap.c  | 39 +++++++++++++-----------
  block/blk-mq-rdma.c    | 80 
+++++++++++++++++++++++++++++++++++++++++++-------
  include/linux/blk-mq.h |  1 +
  3 files changed, 93 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..34811db8cba9 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,30 +30,35 @@ static int get_first_sibling(unsigned int cpu)
         return cpu;
  }

-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
  {
         unsigned int *map = set->mq_map;
         unsigned int nr_queues = set->nr_hw_queues;
-       unsigned int cpu, first_sibling;
+       unsigned int first_sibling;

-       for_each_possible_cpu(cpu) {
-               /*
-                * First do sequential mapping between CPUs and queues.
-                * In case we still have CPUs to map, and we have some 
number of
-                * threads per cores then map sibling threads to the 
same queue for
-                * performace optimizations.
-                */
-               if (cpu < nr_queues) {
+       /*
+        * First do sequential mapping between CPUs and queues.
+        * In case we still have CPUs to map, and we have some number of
+        * threads per cores then map sibling threads to the same queue for
+        * performace optimizations.
+        */
+       if (cpu < nr_queues) {
+               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+       } else {
+               first_sibling = get_first_sibling(cpu);
+               if (first_sibling == cpu)
                         map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-               } else {
-                       first_sibling = get_first_sibling(cpu);
-                       if (first_sibling == cpu)
-                               map[cpu] = cpu_to_queue_index(nr_queues, 
cpu);
-                       else
-                               map[cpu] = map[first_sibling];
-               }
+               else
+                       map[cpu] = map[first_sibling];
         }
+}
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+       unsigned int cpu;

+       for_each_possible_cpu(cpu)
+               blk_mq_map_queue_cpu(set, cpu);
         return 0;
  }
  EXPORT_SYMBOL_GPL(blk_mq_map_queues);
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..d04cbb1925f5 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -14,6 +14,61 @@
  #include <linux/blk-mq-rdma.h>
  #include <rdma/ib_verbs.h>

+static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
+               struct ib_device *dev, int first_vec, unsigned int queue)
+{
+       const struct cpumask *mask;
+       unsigned int cpu;
+       bool mapped = false;
+
+       mask = ib_get_vector_affinity(dev, first_vec + queue);
+       if (!mask)
+               return -ENOTSUPP;
+
+       /* map with an unmapped cpu according to affinity mask */
+       for_each_cpu(cpu, mask) {
+               if (set->mq_map[cpu] == UINT_MAX) {
+                       set->mq_map[cpu] = queue;
+                       mapped = true;
+                       break;
+               }
+       }
+
+       if (!mapped) {
+               int n;
+
+               /* map with an unmapped cpu in the same numa node */
+               for_each_node(n) {
+                       const struct cpumask *node_cpumask = 
cpumask_of_node(n);
+
+                       if (!cpumask_intersects(mask, node_cpumask))
+                               continue;
+
+                       for_each_cpu(cpu, node_cpumask) {
+                               if (set->mq_map[cpu] == UINT_MAX) {
+                                       set->mq_map[cpu] = queue;
+                                       mapped = true;
+                                       break;
+                               }
+                       }
+               }
+       }
+
+       if (!mapped) {
+               /* map with any unmapped cpu we can find */
+               for_each_possible_cpu(cpu) {
+                       if (set->mq_map[cpu] == UINT_MAX) {
+                               set->mq_map[cpu] = queue;
+                               mapped = true;
+                               break;
+                       }
+               }
+       }
+
+       WARN_ON_ONCE(!mapped);
+       return 0;
+}
+
  /**
   * blk_mq_rdma_map_queues - provide a default queue mapping for rdma 
device
   * @set:       tagset to provide the mapping for
@@ -21,31 +76,36 @@
   * @first_vec: first interrupt vectors to use for queues (usually 0)
   *
   * This function assumes the rdma device @dev has at least as many 
available
- * interrupt vetors as @set has queues.  It will then query it's 
affinity mask
- * and built queue mapping that maps a queue to the CPUs that have irq 
affinity
- * for the corresponding vector.
+ * interrupt vetors as @set has queues.  It will then query vector 
affinity mask
+ * and attempt to build irq affinity aware queue mappings. If optimal 
affinity
+ * aware mapping cannot be acheived for a given queue, we look for any 
unmapped
+ * cpu to map it. Lastly, we map naively all other unmapped cpus in the 
mq_map.
   *
   * In case either the driver passed a @dev with less vectors than
   * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
   * vector, we fallback to the naive mapping.
   */
  int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
-               struct ib_device *dev, int first_vec)
+                struct ib_device *dev, int first_vec)
  {
-       const struct cpumask *mask;
         unsigned int queue, cpu;

+       /* reset cpu mapping */
+       for_each_possible_cpu(cpu)
+               set->mq_map[cpu] = UINT_MAX;
+
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
-               mask = ib_get_vector_affinity(dev, first_vec + queue);
-               if (!mask)
+               if (blk_mq_rdma_map_queue(set, dev, first_vec, queue))
                         goto fallback;
+       }

-               for_each_cpu(cpu, mask)
-                       set->mq_map[cpu] = queue;
+       /* map any remaining unmapped cpus */
+       for_each_possible_cpu(cpu) {
+               if (set->mq_map[cpu] == UINT_MAX)
+                       blk_mq_map_queue_cpu(set, cpu);;
         }

         return 0;
-
  fallback:
         return blk_mq_map_queues(set);
  }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d710e92874cc..6eb09c4de34f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -285,6 +285,7 @@ int blk_mq_freeze_queue_wait_timeout(struct 
request_queue *q,
                                      unsigned long timeout);

  int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
nr_hw_queues);

  void blk_mq_quiesce_queue_nowait(struct request_queue *q);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-16 18:26                                       ` Sagi Grimberg
@ 2018-08-16 18:32                                         ` Steve Wise
  2018-08-17 16:17                                           ` Steve Wise
  0 siblings, 1 reply; 35+ messages in thread
From: Steve Wise @ 2018-08-16 18:32 UTC (permalink / raw)
  To: Sagi Grimberg, Max Gurtovoy, Jason Gunthorpe
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'



On 8/16/2018 1:26 PM, Sagi Grimberg wrote:
>
>> Let me know if you want me to try this or any particular fix.
>
> Steve, can you test this one?

Yes!  I'll try it out tomorrow. 

Stevo

> -- 
> [PATCH rfc] block: fix rdma queue mapping
>
> nvme-rdma attempts to map queues based on irq vector affinity.
> However, for some devices, completion vector irq affinity is
> configurable by the user which can break the existing assumption
> that irq vectors are optimally arranged over the host cpu cores.
>
> So we map queues in two stages:
> First map queues according to corresponding to the completion
> vector IRQ affinity taking the first cpu in the vector affinity map.
> if the current irq affinity is arranged such that a vector is not
> assigned to any distinct cpu, we map it to a cpu that is on the same
> node. If numa affinity can not be sufficed, we map it to any unmapped
> cpu we can find. Then, map the remaining cpus in the possible cpumap
> naively.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> Steve, can you test out this patch?
>  block/blk-mq-cpumap.c  | 39 +++++++++++++-----------
>  block/blk-mq-rdma.c    | 80
> +++++++++++++++++++++++++++++++++++++++++++-------
>  include/linux/blk-mq.h |  1 +
>  3 files changed, 93 insertions(+), 27 deletions(-)
>
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 3eb169f15842..34811db8cba9 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -30,30 +30,35 @@ static int get_first_sibling(unsigned int cpu)
>         return cpu;
>  }
>
> -int blk_mq_map_queues(struct blk_mq_tag_set *set)
> +void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
>  {
>         unsigned int *map = set->mq_map;
>         unsigned int nr_queues = set->nr_hw_queues;
> -       unsigned int cpu, first_sibling;
> +       unsigned int first_sibling;
>
> -       for_each_possible_cpu(cpu) {
> -               /*
> -                * First do sequential mapping between CPUs and queues.
> -                * In case we still have CPUs to map, and we have some
> number of
> -                * threads per cores then map sibling threads to the
> same queue for
> -                * performace optimizations.
> -                */
> -               if (cpu < nr_queues) {
> +       /*
> +        * First do sequential mapping between CPUs and queues.
> +        * In case we still have CPUs to map, and we have some number of
> +        * threads per cores then map sibling threads to the same
> queue for
> +        * performace optimizations.
> +        */
> +       if (cpu < nr_queues) {
> +               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
> +       } else {
> +               first_sibling = get_first_sibling(cpu);
> +               if (first_sibling == cpu)
>                         map[cpu] = cpu_to_queue_index(nr_queues, cpu);
> -               } else {
> -                       first_sibling = get_first_sibling(cpu);
> -                       if (first_sibling == cpu)
> -                               map[cpu] =
> cpu_to_queue_index(nr_queues, cpu);
> -                       else
> -                               map[cpu] = map[first_sibling];
> -               }
> +               else
> +                       map[cpu] = map[first_sibling];
>         }
> +}
> +
> +int blk_mq_map_queues(struct blk_mq_tag_set *set)
> +{
> +       unsigned int cpu;
>
> +       for_each_possible_cpu(cpu)
> +               blk_mq_map_queue_cpu(set, cpu);
>         return 0;
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_map_queues);
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> index 996167f1de18..d04cbb1925f5 100644
> --- a/block/blk-mq-rdma.c
> +++ b/block/blk-mq-rdma.c
> @@ -14,6 +14,61 @@
>  #include <linux/blk-mq-rdma.h>
>  #include <rdma/ib_verbs.h>
>
> +static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
> +               struct ib_device *dev, int first_vec, unsigned int queue)
> +{
> +       const struct cpumask *mask;
> +       unsigned int cpu;
> +       bool mapped = false;
> +
> +       mask = ib_get_vector_affinity(dev, first_vec + queue);
> +       if (!mask)
> +               return -ENOTSUPP;
> +
> +       /* map with an unmapped cpu according to affinity mask */
> +       for_each_cpu(cpu, mask) {
> +               if (set->mq_map[cpu] == UINT_MAX) {
> +                       set->mq_map[cpu] = queue;
> +                       mapped = true;
> +                       break;
> +               }
> +       }
> +
> +       if (!mapped) {
> +               int n;
> +
> +               /* map with an unmapped cpu in the same numa node */
> +               for_each_node(n) {
> +                       const struct cpumask *node_cpumask =
> cpumask_of_node(n);
> +
> +                       if (!cpumask_intersects(mask, node_cpumask))
> +                               continue;
> +
> +                       for_each_cpu(cpu, node_cpumask) {
> +                               if (set->mq_map[cpu] == UINT_MAX) {
> +                                       set->mq_map[cpu] = queue;
> +                                       mapped = true;
> +                                       break;
> +                               }
> +                       }
> +               }
> +       }
> +
> +       if (!mapped) {
> +               /* map with any unmapped cpu we can find */
> +               for_each_possible_cpu(cpu) {
> +                       if (set->mq_map[cpu] == UINT_MAX) {
> +                               set->mq_map[cpu] = queue;
> +                               mapped = true;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       WARN_ON_ONCE(!mapped);
> +       return 0;
> +}
> +
>  /**
>   * blk_mq_rdma_map_queues - provide a default queue mapping for rdma
> device
>   * @set:       tagset to provide the mapping for
> @@ -21,31 +76,36 @@
>   * @first_vec: first interrupt vectors to use for queues (usually 0)
>   *
>   * This function assumes the rdma device @dev has at least as many
> available
> - * interrupt vetors as @set has queues.  It will then query it's
> affinity mask
> - * and built queue mapping that maps a queue to the CPUs that have
> irq affinity
> - * for the corresponding vector.
> + * interrupt vetors as @set has queues.  It will then query vector
> affinity mask
> + * and attempt to build irq affinity aware queue mappings. If optimal
> affinity
> + * aware mapping cannot be acheived for a given queue, we look for
> any unmapped
> + * cpu to map it. Lastly, we map naively all other unmapped cpus in
> the mq_map.
>   *
>   * In case either the driver passed a @dev with less vectors than
>   * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
>   * vector, we fallback to the naive mapping.
>   */
>  int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> -               struct ib_device *dev, int first_vec)
> +                struct ib_device *dev, int first_vec)
>  {
> -       const struct cpumask *mask;
>         unsigned int queue, cpu;
>
> +       /* reset cpu mapping */
> +       for_each_possible_cpu(cpu)
> +               set->mq_map[cpu] = UINT_MAX;
> +
>         for (queue = 0; queue < set->nr_hw_queues; queue++) {
> -               mask = ib_get_vector_affinity(dev, first_vec + queue);
> -               if (!mask)
> +               if (blk_mq_rdma_map_queue(set, dev, first_vec, queue))
>                         goto fallback;
> +       }
>
> -               for_each_cpu(cpu, mask)
> -                       set->mq_map[cpu] = queue;
> +       /* map any remaining unmapped cpus */
> +       for_each_possible_cpu(cpu) {
> +               if (set->mq_map[cpu] == UINT_MAX)
> +                       blk_mq_map_queue_cpu(set, cpu);;
>         }
>
>         return 0;
> -
>  fallback:
>         return blk_mq_map_queues(set);
>  }
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index d710e92874cc..6eb09c4de34f 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -285,6 +285,7 @@ int blk_mq_freeze_queue_wait_timeout(struct
> request_queue *q,
>                                      unsigned long timeout);
>
>  int blk_mq_map_queues(struct blk_mq_tag_set *set);
> +void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int
> nr_hw_queues);
>
>  void blk_mq_quiesce_queue_nowait(struct request_queue *q);
> -- 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-16 18:32                                         ` Steve Wise
@ 2018-08-17 16:17                                           ` Steve Wise
  2018-08-17 20:03                                             ` Sagi Grimberg
  0 siblings, 1 reply; 35+ messages in thread
From: Steve Wise @ 2018-08-17 16:17 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Max Gurtovoy',
	'Jason Gunthorpe'
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'

> On 8/16/2018 1:26 PM, Sagi Grimberg wrote:
> >
> >> Let me know if you want me to try this or any particular fix.
> >
> > Steve, can you test this one?
> 
> Yes!  I'll try it out tomorrow.
> 
> Stevo
> 

Hey Sagi,

The patch works allowing connections for the various affinity mappings below:

One comp_vector per core across all cores, starting with numa-local cores:

[ 3798.494963] iw_cxgb4: comp_vector 0, irq 203 mask 0x100
[ 3798.500717] iw_cxgb4: comp_vector 1, irq 204 mask 0x200
[ 3798.506396] iw_cxgb4: comp_vector 2, irq 205 mask 0x400
[ 3798.512043] iw_cxgb4: comp_vector 3, irq 206 mask 0x800
[ 3798.517675] iw_cxgb4: comp_vector 4, irq 207 mask 0x1000
[ 3798.523382] iw_cxgb4: comp_vector 5, irq 208 mask 0x2000
[ 3798.529075] iw_cxgb4: comp_vector 6, irq 209 mask 0x4000
[ 3798.534754] iw_cxgb4: comp_vector 7, irq 210 mask 0x8000
[ 3798.540425] iw_cxgb4: comp_vector 8, irq 211 mask 0x1
[ 3798.545825] iw_cxgb4: comp_vector 9, irq 212 mask 0x2
[ 3798.551231] iw_cxgb4: comp_vector 10, irq 213 mask 0x4
[ 3798.556713] iw_cxgb4: comp_vector 11, irq 214 mask 0x8
[ 3798.562189] iw_cxgb4: comp_vector 12, irq 215 mask 0x10
[ 3798.567755] iw_cxgb4: comp_vector 13, irq 216 mask 0x20
[ 3798.573312] iw_cxgb4: comp_vector 14, irq 217 mask 0x40
[ 3798.578855] iw_cxgb4: comp_vector 15, irq 218 mask 0x80
[ 3798.584384] set->mq_map[0] queue 8 vector 8
[ 3798.588879] set->mq_map[1] queue 9 vector 9
[ 3798.593358] set->mq_map[2] queue 10 vector 10
[ 3798.598008] set->mq_map[3] queue 11 vector 11
[ 3798.602633] set->mq_map[4] queue 12 vector 12
[ 3798.607260] set->mq_map[5] queue 13 vector 13
[ 3798.611872] set->mq_map[6] queue 14 vector 14
[ 3798.616470] set->mq_map[7] queue 15 vector 15
[ 3798.621059] set->mq_map[8] queue 0 vector 0
[ 3798.625460] set->mq_map[9] queue 1 vector 1
[ 3798.629852] set->mq_map[10] queue 2 vector 2
[ 3798.634331] set->mq_map[11] queue 3 vector 3
[ 3798.638796] set->mq_map[12] queue 4 vector 4
[ 3798.643263] set->mq_map[13] queue 5 vector 5
[ 3798.647727] set->mq_map[14] queue 6 vector 6
[ 3798.652197] set->mq_map[15] queue 7 vector 7

One comp_vector per core, but only numa-local cores:

[ 3855.406027] iw_cxgb4: comp_vector 0, irq 203 mask 0x400
[ 3855.411577] iw_cxgb4: comp_vector 1, irq 204 mask 0x800
[ 3855.417057] iw_cxgb4: comp_vector 2, irq 205 mask 0x1000
[ 3855.422618] iw_cxgb4: comp_vector 3, irq 206 mask 0x2000
[ 3855.428176] iw_cxgb4: comp_vector 4, irq 207 mask 0x4000
[ 3855.433731] iw_cxgb4: comp_vector 5, irq 208 mask 0x8000
[ 3855.439293] iw_cxgb4: comp_vector 6, irq 209 mask 0x100
[ 3855.444770] iw_cxgb4: comp_vector 7, irq 210 mask 0x200
[ 3855.450231] iw_cxgb4: comp_vector 8, irq 211 mask 0x400
[ 3855.455691] iw_cxgb4: comp_vector 9, irq 212 mask 0x800
[ 3855.461144] iw_cxgb4: comp_vector 10, irq 213 mask 0x1000
[ 3855.466768] iw_cxgb4: comp_vector 11, irq 214 mask 0x2000
[ 3855.472379] iw_cxgb4: comp_vector 12, irq 215 mask 0x4000
[ 3855.477992] iw_cxgb4: comp_vector 13, irq 216 mask 0x8000
[ 3855.483599] iw_cxgb4: comp_vector 14, irq 217 mask 0x100
[ 3855.489116] iw_cxgb4: comp_vector 15, irq 218 mask 0x200
[ 3855.494644] set->mq_map[0] queue 8 vector 8
[ 3855.499046] set->mq_map[1] queue 9 vector 9
[ 3855.503445] set->mq_map[2] queue 10 vector 10
[ 3855.508025] set->mq_map[3] queue 11 vector 11
[ 3855.512600] set->mq_map[4] queue 12 vector 12
[ 3855.517176] set->mq_map[5] queue 13 vector 13
[ 3855.521750] set->mq_map[6] queue 14 vector 14
[ 3855.526325] set->mq_map[7] queue 15 vector 15
[ 3855.530902] set->mq_map[8] queue 6 vector 6
[ 3855.535306] set->mq_map[9] queue 7 vector 7
[ 3855.539703] set->mq_map[10] queue 0 vector 0
[ 3855.544197] set->mq_map[11] queue 1 vector 1
[ 3855.548670] set->mq_map[12] queue 2 vector 2
[ 3855.553144] set->mq_map[13] queue 3 vector 3
[ 3855.557630] set->mq_map[14] queue 4 vector 4
[ 3855.562105] set->mq_map[15] queue 5 vector 5

Each comp_vector has affinity to all numa-local cores:

[ 4010.002954] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
[ 4010.008606] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
[ 4010.014179] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
[ 4010.019741] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
[ 4010.025310] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
[ 4010.030881] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
[ 4010.036448] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
[ 4010.042012] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
[ 4010.047562] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
[ 4010.053103] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
[ 4010.058632] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
[ 4010.064248] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
[ 4010.069863] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
[ 4010.075462] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
[ 4010.081066] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
[ 4010.086676] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
[ 4010.092283] set->mq_map[0] queue 8 vector 8
[ 4010.096683] set->mq_map[1] queue 9 vector 9
[ 4010.101085] set->mq_map[2] queue 10 vector 10
[ 4010.105663] set->mq_map[3] queue 11 vector 11
[ 4010.110239] set->mq_map[4] queue 12 vector 12
[ 4010.114810] set->mq_map[5] queue 13 vector 13
[ 4010.119379] set->mq_map[6] queue 14 vector 14
[ 4010.123942] set->mq_map[7] queue 15 vector 15
[ 4010.128502] set->mq_map[8] queue 0 vector 0
[ 4010.132903] set->mq_map[9] queue 1 vector 1
[ 4010.137299] set->mq_map[10] queue 2 vector 2
[ 4010.141792] set->mq_map[11] queue 3 vector 3
[ 4010.146274] set->mq_map[12] queue 4 vector 4
[ 4010.150754] set->mq_map[13] queue 5 vector 5
[ 4010.155235] set->mq_map[14] queue 6 vector 6
[ 4010.159716] set->mq_map[15] queue 7 vector 7

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-17 16:17                                           ` Steve Wise
@ 2018-08-17 20:03                                             ` Sagi Grimberg
  2018-08-17 20:17                                               ` Jason Gunthorpe
  2018-08-17 21:28                                               ` Steve Wise
  0 siblings, 2 replies; 35+ messages in thread
From: Sagi Grimberg @ 2018-08-17 20:03 UTC (permalink / raw)
  To: Steve Wise, 'Max Gurtovoy', 'Jason Gunthorpe'
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'


> Hey Sagi,
> 
> The patch works allowing connections for the various affinity mappings below:
> 
> One comp_vector per core across all cores, starting with numa-local cores:

Thanks Steve, is this your "Tested by:" tag?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-17 20:03                                             ` Sagi Grimberg
@ 2018-08-17 20:17                                               ` Jason Gunthorpe
  2018-08-17 20:26                                                 ` Sagi Grimberg
  2018-08-17 21:28                                               ` Steve Wise
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2018-08-17 20:17 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Steve Wise, 'Max Gurtovoy', 'Leon Romanovsky',
	'Doug Ledford', 'RDMA mailing list',
	'Saeed Mahameed', 'linux-netdev'

On Fri, Aug 17, 2018 at 01:03:20PM -0700, Sagi Grimberg wrote:
> 
> > Hey Sagi,
> > 
> > The patch works allowing connections for the various affinity mappings below:
> > 
> > One comp_vector per core across all cores, starting with numa-local cores:
> 
> Thanks Steve, is this your "Tested by:" tag?

The new patchworks doesn't grab patches inlined in messages, so you
will need to resend it.

Also, can someone remind me what the outcome is here? Does it
supersede Leon's patch:

https://patchwork.kernel.org/patch/10526167/

?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-17 20:17                                               ` Jason Gunthorpe
@ 2018-08-17 20:26                                                 ` Sagi Grimberg
  0 siblings, 0 replies; 35+ messages in thread
From: Sagi Grimberg @ 2018-08-17 20:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Steve Wise, 'Max Gurtovoy', 'Leon Romanovsky',
	'Doug Ledford', 'RDMA mailing list',
	'Saeed Mahameed', 'linux-netdev'

Hi Jason,

> The new patchworks doesn't grab patches inlined in messages, so you
> will need to resend it.

Yes, just wanted to to add Steve's tested by as its going to
lists that did not follow this thread.

> Also, can someone remind me what the outcome is here? Does it
> supersede Leon's patch:
> 
> https://patchwork.kernel.org/patch/10526167/

Leon's patch is exposing the breakage so I think it would be
wise to have it go in after this lands mainline.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
  2018-08-17 20:03                                             ` Sagi Grimberg
  2018-08-17 20:17                                               ` Jason Gunthorpe
@ 2018-08-17 21:28                                               ` Steve Wise
  1 sibling, 0 replies; 35+ messages in thread
From: Steve Wise @ 2018-08-17 21:28 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Max Gurtovoy',
	'Jason Gunthorpe'
  Cc: 'Leon Romanovsky', 'Doug Ledford',
	'RDMA mailing list', 'Saeed Mahameed',
	'linux-netdev'

> 
> 
> > Hey Sagi,
> >
> > The patch works allowing connections for the various affinity mappings
> below:
> >
> > One comp_vector per core across all cores, starting with numa-local cores:
> 
> Thanks Steve, is this your "Tested by:" tag?

Sure:

Tested-by: Steve Wise <swise@opengridcomputing.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-08-17 21:28 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-16  8:30 [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask Leon Romanovsky
2018-07-16 10:23 ` Sagi Grimberg
2018-07-16 10:30   ` Leon Romanovsky
2018-07-16 14:54     ` Max Gurtovoy
2018-07-16 14:59       ` Sagi Grimberg
2018-07-16 16:46         ` Max Gurtovoy
2018-07-16 17:08           ` Steve Wise
2018-07-17  8:46             ` Max Gurtovoy
2018-07-17  8:58               ` Leon Romanovsky
2018-07-17 10:05                 ` Max Gurtovoy
2018-07-17 13:03               ` Steve Wise
2018-07-18 11:38                 ` Sagi Grimberg
2018-07-18 14:14                   ` Max Gurtovoy
2018-07-18 14:25                     ` Steve Wise
2018-07-18 19:29                     ` Steve Wise
2018-07-19 14:50                       ` Max Gurtovoy
2018-07-19 18:45                         ` Steve Wise
2018-07-20  1:25                           ` Max Gurtovoy
2018-07-23 16:49                             ` Jason Gunthorpe
2018-07-23 16:53                               ` Max Gurtovoy
2018-07-30 15:47                                 ` Steve Wise
2018-07-31 10:00                                   ` Max Gurtovoy
2018-08-01  5:12                                 ` Sagi Grimberg
2018-08-01 14:27                                   ` Max Gurtovoy
2018-08-06 19:20                                     ` Steve Wise
2018-08-15  6:37                                       ` Leon Romanovsky
2018-08-16 18:26                                       ` Sagi Grimberg
2018-08-16 18:32                                         ` Steve Wise
2018-08-17 16:17                                           ` Steve Wise
2018-08-17 20:03                                             ` Sagi Grimberg
2018-08-17 20:17                                               ` Jason Gunthorpe
2018-08-17 20:26                                                 ` Sagi Grimberg
2018-08-17 21:28                                               ` Steve Wise
2018-07-24 15:24                             ` Steve Wise
2018-07-24 20:52                               ` Steve Wise

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.