All of lore.kernel.org
 help / color / mirror / Atom feed
* Issue with MLX5 IB driver
@ 2017-05-31 15:59 Joao Pinto
       [not found] ` <ae8a8bbf-edb5-1909-824c-f98384f506b0-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-05-31 15:59 UTC (permalink / raw)
  To: matanb-VPRAkNaXOzVWk0Htik3J/w, leonro-VPRAkNaXOzVWk0Htik3J/w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Dear Matan and Leon,

I am trying to bring-up a Connect-X 5 Ex Endpoint, using a setup composed by a
32-bit CPU and 512MB of RAM (PCIe Prototyping Platform). The MLX5 Ethernet
driver initializes well, but after MLX5 IB driver initiates, it consumes all the
available memory in my board (400MB). Does this driver needs more than 400MB to
work?

Kernel used:

Latest 4.12.

Kernel log:

mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.21102
mlx5_core 0000:01:00.0: mlx5_cmd_init:1765:(pid 1): descriptor at dma 0x9a25a000
mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
INPUT
mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 5): writing 0x1 to command
doorbell
mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
OUTPUT
mlx5_core 0000:01:00.0: mlx5_cmd_comp_handler:1418:(pid 5): command completed.
ret 0x0, delivery status no errors(0x0)
mlx5_core 0000:01:00.0: wait_func:893:(pid 1): err 0, delivery status no errors(0)
(...)
mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
(...)
mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
QUERY_HCA_VPORT_CONTEXT(0x762) INPUT
mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 40): writing 0x1 to command
doorbell
mlx5_core 0000:01:00.0: mlx5_eq_int:394:(pid 5): eqn 16, eqe type
MLX5_EVENT_TYPE_CMD
(...)
mlx5_core 0000:01:00.0: mlx5_eq_int:460:(pid 0): page request for func 0x0,
npages 4096
mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
CREATE_MKEY(0x200) INPUT
mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
(...)
kworker/u2:3 invoked oom-killer: gfp_mask=0x14200c2(GFP_HIGHUSER),
nodemask=(null),  order=0, oom_score_adj=0
CPU: 0 PID: 61 Comm: kworker/u2:3 Not tainted 4.12.0-MLNX20170524 #46
Workqueue: mlx5_page_allocator pages_work_handler

Stack Trace:
  arc_unwind_core.constprop.2+0xb4/0x100
  dump_header.isra.6+0x82/0x1a8
  out_of_memory+0x2fc/0x368
  __alloc_pages_nodemask+0x22ee/0x24e4
  give_pages+0x1fc/0x664
  pages_work_handler+0x2a/0x88
  process_one_work+0x1c8/0x390
  worker_thread+0x120/0x540
  kthread+0x116/0x13c
  ret_from_fork+0x18/0x1c
Mem-Info:
active_anon:2083 inactive_anon:7261 isolated_anon:0
 active_file:0 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:94 slab_unreclaimable:709
 mapped:0 shmem:9344 pagetables:0 bounce:0
 free:311 free_pcp:57 free_cma:0
Node 0 active_anon:16664kB inactive_anon:58088kB active_file:0kB
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
mapped:0kB dirty:0kB writeback:0kB shmem:74752kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
Normal free:2488kB min:2552kB low:3184kB high:3816kB active_anon:16664kB
inactive_anon:58088kB active_file:0kB inactive_file:0kB unevictable:0kB
writepending:0kB present:442368kB managed:407104kB mlocked:0kB
slab_reclaimable:752kB slab_unreclaimable:5672kB kernel_stack:424kB
pagetables:0kB bounce:0kB free_pcp:456kB local_pcp:456kB free_cma:0kB
lowmem_reserve[]: 0 0
Normal: 1*8kB (U) 1*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB
0*1024kB 1*2048kB (U) 0*4096kB 0*8192kB = 2488kB
9344 total pagecache pages
55296 pages RAM
0 pages HighMem/MovableOnly
4408 pages reserved
[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Kernel panic - not syncing: Out of memory and no killable processes...

---[ end Kernel panic - not syncing: Out of memory and no killable processes...


Thank you and best regards,

Joao Pinto
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found] ` <ae8a8bbf-edb5-1909-824c-f98384f506b0-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-05-31 16:18   ` Leon Romanovsky
       [not found]     ` <20170531161819.GK5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2017-05-31 16:18 UTC (permalink / raw)
  To: Joao Pinto
  Cc: matanb-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 4574 bytes --]

On Wed, May 31, 2017 at 04:59:45PM +0100, Joao Pinto wrote:
> Dear Matan and Leon,
>
> I am trying to bring-up a Connect-X 5 Ex Endpoint, using a setup composed by a
> 32-bit CPU and 512MB of RAM (PCIe Prototyping Platform). The MLX5 Ethernet
> driver initializes well, but after MLX5 IB driver initiates, it consumes all the
> available memory in my board (400MB). Does this driver needs more than 400MB to
> work?

I think that you are hitting the side effect of these commits
7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
81713d3788d2 ("IB/mlx5: Add implicit MR support")

Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
for the test?

Thanks

>
> Kernel used:
>
> Latest 4.12.
>
> Kernel log:
>
> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
> mlx5_core 0000:01:00.0: mlx5_cmd_init:1765:(pid 1): descriptor at dma 0x9a25a000
> mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
> INPUT
> mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 5): writing 0x1 to command
> doorbell
> mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
> OUTPUT
> mlx5_core 0000:01:00.0: mlx5_cmd_comp_handler:1418:(pid 5): command completed.
> ret 0x0, delivery status no errors(0x0)
> mlx5_core 0000:01:00.0: wait_func:893:(pid 1): err 0, delivery status no errors(0)
> (...)
> mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
> (...)
> mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
> QUERY_HCA_VPORT_CONTEXT(0x762) INPUT
> mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 40): writing 0x1 to command
> doorbell
> mlx5_core 0000:01:00.0: mlx5_eq_int:394:(pid 5): eqn 16, eqe type
> MLX5_EVENT_TYPE_CMD
> (...)
> mlx5_core 0000:01:00.0: mlx5_eq_int:460:(pid 0): page request for func 0x0,
> npages 4096
> mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
> CREATE_MKEY(0x200) INPUT
> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
> (...)
> kworker/u2:3 invoked oom-killer: gfp_mask=0x14200c2(GFP_HIGHUSER),
> nodemask=(null),  order=0, oom_score_adj=0
> CPU: 0 PID: 61 Comm: kworker/u2:3 Not tainted 4.12.0-MLNX20170524 #46
> Workqueue: mlx5_page_allocator pages_work_handler
>
> Stack Trace:
>   arc_unwind_core.constprop.2+0xb4/0x100
>   dump_header.isra.6+0x82/0x1a8
>   out_of_memory+0x2fc/0x368
>   __alloc_pages_nodemask+0x22ee/0x24e4
>   give_pages+0x1fc/0x664
>   pages_work_handler+0x2a/0x88
>   process_one_work+0x1c8/0x390
>   worker_thread+0x120/0x540
>   kthread+0x116/0x13c
>   ret_from_fork+0x18/0x1c
> Mem-Info:
> active_anon:2083 inactive_anon:7261 isolated_anon:0
>  active_file:0 inactive_file:0 isolated_file:0
>  unevictable:0 dirty:0 writeback:0 unstable:0
>  slab_reclaimable:94 slab_unreclaimable:709
>  mapped:0 shmem:9344 pagetables:0 bounce:0
>  free:311 free_pcp:57 free_cma:0
> Node 0 active_anon:16664kB inactive_anon:58088kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> mapped:0kB dirty:0kB writeback:0kB shmem:74752kB writeback_tmp:0kB unstable:0kB
> all_unreclaimable? yes
> Normal free:2488kB min:2552kB low:3184kB high:3816kB active_anon:16664kB
> inactive_anon:58088kB active_file:0kB inactive_file:0kB unevictable:0kB
> writepending:0kB present:442368kB managed:407104kB mlocked:0kB
> slab_reclaimable:752kB slab_unreclaimable:5672kB kernel_stack:424kB
> pagetables:0kB bounce:0kB free_pcp:456kB local_pcp:456kB free_cma:0kB
> lowmem_reserve[]: 0 0
> Normal: 1*8kB (U) 1*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB
> 0*1024kB 1*2048kB (U) 0*4096kB 0*8192kB = 2488kB
> 9344 total pagecache pages
> 55296 pages RAM
> 0 pages HighMem/MovableOnly
> 4408 pages reserved
> [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> Kernel panic - not syncing: Out of memory and no killable processes...
>
> ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
>
>
> Thank you and best regards,
>
> Joao Pinto
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]     ` <20170531161819.GK5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-05-31 16:39       ` Majd Dibbiny
  2017-05-31 19:44       ` Christoph Hellwig
  1 sibling, 0 replies; 12+ messages in thread
From: Majd Dibbiny @ 2017-05-31 16:39 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Joao Pinto, Matan Barak, linux-rdma-u79uwXL29TY76Z2rM5mHXA



> On May 31, 2017, at 7:18 PM, Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
>> On Wed, May 31, 2017 at 04:59:45PM +0100, Joao Pinto wrote:
>> Dear Matan and Leon,
>> 
>> I am trying to bring-up a Connect-X 5 Ex Endpoint, using a setup composed by a
>> 32-bit CPU and 512MB of RAM (PCIe Prototyping Platform). The MLX5 Ethernet
>> driver initializes well, but after MLX5 IB driver initiates, it consumes all the
>> available memory in my board (400MB). Does this driver needs more than 400MB to
>> work?
> 
> I think that you are hitting the side effect of these commits
> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
> 
> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
> for the test?
> 
> Thanks
Hi Joao,

As Leon mentioned, the previous commits enlarged the driver memory consumption.
In your case, what I would suggest in order to work in low memory environment is to set the profile selector (prof_sel) module parameter of mlx5_core to 0 (instead default 2) and this will work in low memory environment. This will have some side effects on performance, but thats the trade of.. 


> 
>> 
>> Kernel used:
>> 
>> Latest 4.12.he 
>> 
>> Kernel log:
>> 
>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>> mlx5_core 0000:01:00.0: mlx5_cmd_init:1765:(pid 1): descriptor at dma 0x9a25a000
>> mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
>> INPUT
>> mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 5): writing 0x1 to command
>> doorbell
>> mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
>> OUTPUT
>> mlx5_core 0000:01:00.0: mlx5_cmd_comp_handler:1418:(pid 5): command completed.
>> ret 0x0, delivery status no errors(0x0)
>> mlx5_core 0000:01:00.0: wait_func:893:(pid 1): err 0, delivery status no errors(0)
>> (...)
>> mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
>> (...)
>> mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
>> QUERY_HCA_VPORT_CONTEXT(0x762) INPUT
>> mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 40): writing 0x1 to command
>> doorbell
>> mlx5_core 0000:01:00.0: mlx5_eq_int:394:(pid 5): eqn 16, eqe type
>> MLX5_EVENT_TYPE_CMD
>> (...)
>> mlx5_core 0000:01:00.0: mlx5_eq_int:460:(pid 0): page request for func 0x0,
>> npages 4096
>> mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
>> CREATE_MKEY(0x200) INPUT
>> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
>> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
>> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
>> (...)
>> kworker/u2:3 invoked oom-killer: gfp_mask=0x14200c2(GFP_HIGHUSER),
>> nodemask=(null),  order=0, oom_score_adj=0
>> CPU: 0 PID: 61 Comm: kworker/u2:3 Not tainted 4.12.0-MLNX20170524 #46
>> Workqueue: mlx5_page_allocator pages_work_handler
>> 
>> Stack Trace:
>>  arc_unwind_core.constprop.2+0xb4/0x100
>>  dump_header.isra.6+0x82/0x1a8
>>  out_of_memory+0x2fc/0x368
>>  __alloc_pages_nodemask+0x22ee/0x24e4
>>  give_pages+0x1fc/0x664
>>  pages_work_handler+0x2a/0x88
>>  process_one_work+0x1c8/0x390
>>  worker_thread+0x120/0x540
>>  kthread+0x116/0x13c
>>  ret_from_fork+0x18/0x1c
>> Mem-Info:
>> active_anon:2083 inactive_anon:7261 isolated_anon:0
>> active_file:0 inactive_file:0 isolated_file:0
>> unevictable:0 dirty:0 writeback:0 unstable:0
>> slab_reclaimable:94 slab_unreclaimable:709
>> mapped:0 shmem:9344 pagetables:0 bounce:0
>> free:311 free_pcp:57 free_cma:0
>> Node 0 active_anon:16664kB inactive_anon:58088kB active_file:0kB
>> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> mapped:0kB dirty:0kB writeback:0kB shmem:74752kB writeback_tmp:0kB unstable:0kB
>> all_unreclaimable? yes
>> Normal free:2488kB min:2552kB low:3184kB high:3816kB active_anon:16664kB
>> inactive_anon:58088kB active_file:0kB inactive_file:0kB unevictable:0kB
>> writepending:0kB present:442368kB managed:407104kB mlocked:0kB
>> slab_reclaimable:752kB slab_unreclaimable:5672kB kernel_stack:424kB
>> pagetables:0kB bounce:0kB free_pcp:456kB local_pcp:456kB free_cma:0kB
>> lowmem_reserve[]: 0 0
>> Normal: 1*8kB (U) 1*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB
>> 0*1024kB 1*2048kB (U) 0*4096kB 0*8192kB = 2488kB
>> 9344 total pagecache pages
>> 55296 pages RAM
>> 0 pages HighMem/MovableOnly
>> 4408 pages reserved
>> [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
>> Kernel panic - not syncing: Out of memory and no killable processes...
>> 
>> ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
>> 
>> 
>> Thank you and best regards,
>> 
>> Joao Pinto
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]     ` <20170531161819.GK5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  2017-05-31 16:39       ` Majd Dibbiny
@ 2017-05-31 19:44       ` Christoph Hellwig
       [not found]         ` <20170531194426.GA23120-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2017-05-31 19:44 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Joao Pinto, matanb-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
> I think that you are hitting the side effect of these commits
> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
> 
> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
> for the test?

Eww.  Please make sure mlx5 gracefully handles cases where it can't use
crazy amount of memory, including disabling features like the above
at runtime when the required resources aren't available.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]         ` <20170531194426.GA23120-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2017-06-01  4:30           ` Leon Romanovsky
       [not found]             ` <20170601043013.GN5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2017-06-01  4:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Joao Pinto, matanb-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1033 bytes --]

On Wed, May 31, 2017 at 12:44:26PM -0700, Christoph Hellwig wrote:
> On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
> > I think that you are hitting the side effect of these commits
> > 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
> > 81713d3788d2 ("IB/mlx5: Add implicit MR support")
> >
> > Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
> > for the test?
>
> Eww.  Please make sure mlx5 gracefully handles cases where it can't use
> crazy amount of memory, including disabling features like the above
> at runtime when the required resources aren't available.

Right, the real consumer of memory in mlx5_ib is mr_cache, so the
question is how can we check in advance if we have enough memory
without calling allocations with GFP_NOWARN flag.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]             ` <20170601043013.GN5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-06-01 10:05               ` Joao Pinto
       [not found]                 ` <09d8f6bc-5994-82d1-9a0f-59540b6c525f-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 10:05 UTC (permalink / raw)
  To: Leon Romanovsky, matanb-VPRAkNaXOzVWk0Htik3J/w
  Cc: Christoph Hellwig, Joao Pinto, linux-rdma-u79uwXL29TY76Z2rM5mHXA


Hello,

Às 5:30 AM de 6/1/2017, Leon Romanovsky escreveu:
> On Wed, May 31, 2017 at 12:44:26PM -0700, Christoph Hellwig wrote:
>> On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
>>> I think that you are hitting the side effect of these commits
>>> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
>>> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
>>>
>>> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
>>> for the test?
>>
>> Eww.  Please make sure mlx5 gracefully handles cases where it can't use
>> crazy amount of memory, including disabling features like the above
>> at runtime when the required resources aren't available.
> 
> Right, the real consumer of memory in mlx5_ib is mr_cache, so the
> question is how can we check in advance if we have enough memory
> without calling allocations with GFP_NOWARN flag.
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=Uf5GrWBvnD9y_cvJHxE3U34WbGfrJ6SH6xoBLXn3-iA&s=qOiYqKtZvTJzs3QPNC_YxrNg-S_g-1PfDr0ZvDTE5pY&e= 

With CONFIG_INFINIBAND_ON_DEMAND_PAGING disabled:
Crashes the same way.

With MLX5_DEFAULT_PROF defined as 0:

There is no crash.

mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.21102
(...)
mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
(...)
mlx5_core 0000:01:00.0: device's health compromised - reached miss count
mlx5_core 0000:01:00.0: assert_var[0] 0x00000001
mlx5_core 0000:01:00.0: assert_var[1] 0x00000000
mlx5_core 0000:01:00.0: assert_var[2] 0x00000000
mlx5_core 0000:01:00.0: assert_var[3] 0x00000000
mlx5_core 0000:01:00.0: assert_var[4] 0x00000000
mlx5_core 0000:01:00.0: assert_exit_ptr 0x006994c0
mlx5_core 0000:01:00.0: assert_callra 0x00699680
mlx5_core 0000:01:00.0: fw_ver 16.19.21102
mlx5_core 0000:01:00.0: hw_id 0x0000020d
mlx5_core 0000:01:00.0: irisc_index 0
mlx5_core 0000:01:00.0: synd 0x1: firmware internal error
mlx5_core 0000:01:00.0: ext_synd 0x11c5
mlx5_core 0000:01:00.0: raw fw_ver 0x1013526e

lspci -v result:

01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
        Subsystem: Mellanox Technologies Device 0002
        Flags: bus master, fast devsel, latency 0
        Memory at d2000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [48] Vital Product Data
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [1c0] #19
        Capabilities: [320] #27
        Kernel driver in use: mlx5_core

Interrupts:

 45:          0   PCI-MSI   0  aerdrv
 46:          2   PCI-MSI 524288  mlx5_pages_eq@pci:0000:01:00.0
 47:        347   PCI-MSI 524289  mlx5_cmd_eq@pci:0000:01:00.0
 48:          0   PCI-MSI 524290  mlx5_async_eq@pci:0000:01:00.0
 50:          0   PCI-MSI 524292  mlx5_comp0@pci:0000:01:00.0

List of devices:

# ls /dev/infiniband/
issm0    rdma_cm  ucm0     umad0    uverbs0

Shouldn't I be getting some mellanox devices?

Thanks,
Joao

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]                 ` <09d8f6bc-5994-82d1-9a0f-59540b6c525f-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 11:18                   ` Joao Pinto
       [not found]                     ` <fbb4b7cb-e3e4-b540-22e4-5d920857e8fe-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 11:18 UTC (permalink / raw)
  To: Leon Romanovsky, matanb-VPRAkNaXOzVWk0Htik3J/w
  Cc: Christoph Hellwig, Joao Pinto, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Às 11:05 AM de 6/1/2017, Joao Pinto escreveu:
> 
> Hello,
> 
> Às 5:30 AM de 6/1/2017, Leon Romanovsky escreveu:
>> On Wed, May 31, 2017 at 12:44:26PM -0700, Christoph Hellwig wrote:
>>> On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
>>>> I think that you are hitting the side effect of these commits
>>>> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
>>>> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
>>>>
>>>> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
>>>> for the test?
>>>
>>> Eww.  Please make sure mlx5 gracefully handles cases where it can't use
>>> crazy amount of memory, including disabling features like the above
>>> at runtime when the required resources aren't available.
>>
>> Right, the real consumer of memory in mlx5_ib is mr_cache, so the
>> question is how can we check in advance if we have enough memory
>> without calling allocations with GFP_NOWARN flag.
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=Uf5GrWBvnD9y_cvJHxE3U34WbGfrJ6SH6xoBLXn3-iA&s=qOiYqKtZvTJzs3QPNC_YxrNg-S_g-1PfDr0ZvDTE5pY&e= 
> 
> With CONFIG_INFINIBAND_ON_DEMAND_PAGING disabled:
> Crashes the same way.
> 
> With MLX5_DEFAULT_PROF defined as 0:
> 
> There is no crash.
> 
> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
> (...)
> mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
> (...)
> mlx5_core 0000:01:00.0: device's health compromised - reached miss count
> mlx5_core 0000:01:00.0: assert_var[0] 0x00000001
> mlx5_core 0000:01:00.0: assert_var[1] 0x00000000
> mlx5_core 0000:01:00.0: assert_var[2] 0x00000000
> mlx5_core 0000:01:00.0: assert_var[3] 0x00000000
> mlx5_core 0000:01:00.0: assert_var[4] 0x00000000
> mlx5_core 0000:01:00.0: assert_exit_ptr 0x006994c0
> mlx5_core 0000:01:00.0: assert_callra 0x00699680
> mlx5_core 0000:01:00.0: fw_ver 16.19.21102
> mlx5_core 0000:01:00.0: hw_id 0x0000020d
> mlx5_core 0000:01:00.0: irisc_index 0
> mlx5_core 0000:01:00.0: synd 0x1: firmware internal error
> mlx5_core 0000:01:00.0: ext_synd 0x11c5
> mlx5_core 0000:01:00.0: raw fw_ver 0x1013526e
> 
> lspci -v result:
> 
> 01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
>         Subsystem: Mellanox Technologies Device 0002
>         Flags: bus master, fast devsel, latency 0
>         Memory at d2000000 (64-bit, prefetchable) [size=32M]
>         Capabilities: [60] Express Endpoint, MSI 00
>         Capabilities: [48] Vital Product Data
>         Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
>         Capabilities: [c0] Vendor Specific Information: Len=18 <?>
>         Capabilities: [40] Power Management version 3
>         Capabilities: [100] Advanced Error Reporting
>         Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>         Capabilities: [1c0] #19
>         Capabilities: [320] #27
>         Kernel driver in use: mlx5_core
> 
> Interrupts:
> 
>  45:          0   PCI-MSI   0  aerdrv
>  46:          2   PCI-MSI 524288  mlx5_pages_eq@pci:0000:01:00.0
>  47:        347   PCI-MSI 524289  mlx5_cmd_eq@pci:0000:01:00.0
>  48:          0   PCI-MSI 524290  mlx5_async_eq@pci:0000:01:00.0
>  50:          0   PCI-MSI 524292  mlx5_comp0@pci:0000:01:00.0
> 
> List of devices:
> 
> # ls /dev/infiniband/
> issm0    rdma_cm  ucm0     umad0    uverbs0
> 
> Shouldn't I be getting some mellanox devices?
> 
> Thanks,
> Joao
> 

After search in /sys I found the mellanox device mlx5_0
(/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:

#  ibstat mlx5_0
CA 'mlx5_0'
        CA type: MT4121
        Number of ports: 1
        Firmware version: 16.19.21102
        Hardware version: 0
        Node GUID: 0x248a070300aa8466
        System image GUID: 0x248a070300aa8466
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 65535
                LMC: 0
                SM lid: 0
                Capability mask: 0x2651e848
                Port GUID: 0x248a070300aa8466
                Link layer: InfiniBand
#
#
# pwd

Shouldn't the device be visible in /dev?

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]                     ` <fbb4b7cb-e3e4-b540-22e4-5d920857e8fe-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 11:57                       ` Majd Dibbiny
       [not found]                         ` <52727D4A-F647-4924-8DF0-4D7F248626AA-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Majd Dibbiny @ 2017-06-01 11:57 UTC (permalink / raw)
  To: Joao Pinto
  Cc: Leon Romanovsky, Matan Barak, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 5506 bytes --]


> On Jun 1, 2017, at 2:19 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
> 
> Às 11:05 AM de 6/1/2017, Joao Pinto escreveu:
>> 
>> Hello,
>> 
>> Às 5:30 AM de 6/1/2017, Leon Romanovsky escreveu:
>>>> On Wed, May 31, 2017 at 12:44:26PM -0700, Christoph Hellwig wrote:
>>>>> On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
>>>>> I think that you are hitting the side effect of these commits
>>>>> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
>>>>> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
>>>>> 
>>>>> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
>>>>> for the test?
>>>> 
>>>> Eww.  Please make sure mlx5 gracefully handles cases where it can't use
>>>> crazy amount of memory, including disabling features like the above
>>>> at runtime when the required resources aren't available.
>>> 
>>> Right, the real consumer of memory in mlx5_ib is mr_cache, so the
>>> question is how can we check in advance if we have enough memory
>>> without calling allocations with GFP_NOWARN flag.
>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=Uf5GrWBvnD9y_cvJHxE3U34WbGfrJ6SH6xoBLXn3-iA&s=qOiYqKtZvTJzs3QPNC_YxrNg-S_g-1PfDr0ZvDTE5pY&e= 
>> 
>> With CONFIG_INFINIBAND_ON_DEMAND_PAGING disabled:
>> Crashes the same way.
>> 
>> With MLX5_DEFAULT_PROF defined as 0:
>> 
>> There is no crash.
>> 
>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>> (...)
>> mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
>> (...)
>> mlx5_core 0000:01:00.0: device's health compromised - reached miss count
>> mlx5_core 0000:01:00.0: assert_var[0] 0x00000001
>> mlx5_core 0000:01:00.0: assert_var[1] 0x00000000
>> mlx5_core 0000:01:00.0: assert_var[2] 0x00000000
>> mlx5_core 0000:01:00.0: assert_var[3] 0x00000000
>> mlx5_core 0000:01:00.0: assert_var[4] 0x00000000
>> mlx5_core 0000:01:00.0: assert_exit_ptr 0x006994c0
>> mlx5_core 0000:01:00.0: assert_callra 0x00699680
>> mlx5_core 0000:01:00.0: fw_ver 16.19.21102
>> mlx5_core 0000:01:00.0: hw_id 0x0000020d
>> mlx5_core 0000:01:00.0: irisc_index 0
>> mlx5_core 0000:01:00.0: synd 0x1: firmware internal error
>> mlx5_core 0000:01:00.0: ext_synd 0x11c5
>> mlx5_core 0000:01:00.0: raw fw_ver 0x1013526e
>> 
>> lspci -v result:
>> 
>> 01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
>>        Subsystem: Mellanox Technologies Device 0002
>>        Flags: bus master, fast devsel, latency 0
>>        Memory at d2000000 (64-bit, prefetchable) [size=32M]
>>        Capabilities: [60] Express Endpoint, MSI 00
>>        Capabilities: [48] Vital Product Data
>>        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
>>        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
>>        Capabilities: [40] Power Management version 3
>>        Capabilities: [100] Advanced Error Reporting
>>        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>>        Capabilities: [1c0] #19
>>        Capabilities: [320] #27
>>        Kernel driver in use: mlx5_core
>> 
>> Interrupts:
>> 
>> 45:          0   PCI-MSI   0  aerdrv
>> 46:          2   PCI-MSI 524288  mlx5_pages_eq@pci:0000:01:00.0
>> 47:        347   PCI-MSI 524289  mlx5_cmd_eq@pci:0000:01:00.0
>> 48:          0   PCI-MSI 524290  mlx5_async_eq@pci:0000:01:00.0
>> 50:          0   PCI-MSI 524292  mlx5_comp0@pci:0000:01:00.0
>> 
>> List of devices:
>> 
>> # ls /dev/infiniband/
>> issm0    rdma_cm  ucm0     umad0    uverbs0
>> 
>> Shouldn't I be getting some mellanox devices?
>> 
>> Thanks,
>> Joao
>> 
> 
> After search in /sys I found the mellanox device mlx5_0
> (/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
> 
> #  ibstat mlx5_0
> CA 'mlx5_0'
>        CA type: MT4121
>        Number of ports: 1
>        Firmware version: 16.19.21102
>        Hardware version: 0
>        Node GUID: 0x248a070300aa8466
>        System image GUID: 0x248a070300aa8466
>        Port 1:
>                State: Down
>                Physical state: Disabled
>                Rate: 10
>                Base lid: 65535
>                LMC: 0
>                SM lid: 0
>                Capability mask: 0x2651e848
>                Port GUID: 0x248a070300aa8466
>                Link layer: InfiniBand
> #
> #
> # pwd
> 
> Shouldn't the device be visible in /dev?
Hi Joao,

I'm glad this solved your issue.
Under dev you will not see Mellanox devices. They are visible only under the sysfs path you found.
In /dev you might see the mst devices if you have mst running..
> 
> Thanks.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]                         ` <52727D4A-F647-4924-8DF0-4D7F248626AA-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 11:59                           ` Joao Pinto
       [not found]                             ` <7a4e8dce-f1af-d664-bb0b-062f84b45b60-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 11:59 UTC (permalink / raw)
  To: Majd Dibbiny, Joao Pinto
  Cc: Leon Romanovsky, Matan Barak, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi,

> 
>> After search in /sys I found the mellanox device mlx5_0
> 
>> (/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
> 
>>
> 
>> #  ibstat mlx5_0
> 
>> CA 'mlx5_0'
> 
>>        CA type: MT4121
> 
>>        Number of ports: 1
> 
>>        Firmware version: 16.19.21102
> 
>>        Hardware version: 0
> 
>>        Node GUID: 0x248a070300aa8466
> 
>>        System image GUID: 0x248a070300aa8466
> 
>>        Port 1:
> 
>>                State: Down
> 
>>                Physical state: Disabled
> 
>>                Rate: 10
> 
>>                Base lid: 65535
> 
>>                LMC: 0
> 
>>                SM lid: 0
> 
>>                Capability mask: 0x2651e848
> 
>>                Port GUID: 0x248a070300aa8466
> 
>>                Link layer: InfiniBand
> 
>> #
> 
>> #
> 
>> # pwd
> 
>>
> 
>> Shouldn't the device be visible in /dev?
> 
> Hi Joao,
> 
> 
> 
> I'm glad this solved your issue.
> 
> Under dev you will not see Mellanox devices. They are visible only under the sysfs path you found.
> 
> In /dev you might see the mst devices if you have mst running..

Thanks for the help Majd! I suggest the driver uses the 0 profile by default to
avoid this kind of meomory problems. What do you think? I can make a patch for
it if you agree.

Thanks.

> 
>>
> 
>> Thanks.
> 
>>
> 
>> --
> 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> 
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> 
>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIGaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=MZ71e2hCV-Ipb5oWcgk4I1Z6_Wd8shDLv1AtVKJPigw&s=EvobH0aYP1erS-5GZpBL5Zpabx5LMnEJWTRExwPWSdY&e= 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]                             ` <7a4e8dce-f1af-d664-bb0b-062f84b45b60-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 12:07                               ` Majd Dibbiny
       [not found]                                 ` <E798E910-E897-4C14-9161-BE1220D412DF-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Majd Dibbiny @ 2017-06-01 12:07 UTC (permalink / raw)
  To: Joao Pinto
  Cc: Leon Romanovsky, Matan Barak, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


> On Jun 1, 2017, at 2:59 PM, Joao Pinto <Joao.Pinto-HKixBCOQz3hWk0Htik3J/w@public.gmane.org> wrote:
> 
> Hi,
> 
>> 
>>> After search in /sys I found the mellanox device mlx5_0
>> 
>>> (/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
>> 
>>> 
>> 
>>> #  ibstat mlx5_0
>> 
>>> CA 'mlx5_0'
>> 
>>>       CA type: MT4121
>> 
>>>       Number of ports: 1
>> 
>>>       Firmware version: 16.19.21102
>> 
>>>       Hardware version: 0
>> 
>>>       Node GUID: 0x248a070300aa8466
>> 
>>>       System image GUID: 0x248a070300aa8466
>> 
>>>       Port 1:
>> 
>>>               State: Down
>> 
>>>               Physical state: Disabled
>> 
>>>               Rate: 10
>> 
>>>               Base lid: 65535
>> 
>>>               LMC: 0
>> 
>>>               SM lid: 0
>> 
>>>               Capability mask: 0x2651e848
>> 
>>>               Port GUID: 0x248a070300aa8466
>> 
>>>               Link layer: InfiniBand
>> 
>>> #
>> 
>>> #
>> 
>>> # pwd
>> 
>>> 
>> 
>>> Shouldn't the device be visible in /dev?
>> 
>> Hi Joao,
>> 
>> 
>> 
>> I'm glad this solved your issue.
>> 
>> Under dev you will not see Mellanox devices. They are visible only under the sysfs path you found.
>> 
>> In /dev you might see the mst devices if you have mst running..
> 
> Thanks for the help Majd! I suggest the driver uses the 0 profile by default to
> avoid this kind of meomory problems. What do you think? I can make a patch for
> it if you agree.

I prefer to keep the default 2 which gives the best performance and is the more common use case.
We will try to make the driver more resilient and fallback to a lower profile if it can't load with the default one.
> 
> Thanks.
> 
>> 
>>> 
>> 
>>> Thanks.
>> 
>>> 
>> 
>>> --
>> 
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> 
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> 
>>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIGaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=MZ71e2hCV-Ipb5oWcgk4I1Z6_Wd8shDLv1AtVKJPigw&s=EvobH0aYP1erS-5GZpBL5Zpabx5LMnEJWTRExwPWSdY&e= 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Issue with MLX5 IB driver
       [not found]                                 ` <E798E910-E897-4C14-9161-BE1220D412DF-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 12:08                                   ` Joao Pinto
       [not found]                                     ` <455d9539-8284-7e8d-fe8b-17035b511e9d-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 12:08 UTC (permalink / raw)
  To: Majd Dibbiny, Joao Pinto
  Cc: Leon Romanovsky, Matan Barak, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Às 1:07 PM de 6/1/2017, Majd Dibbiny escreveu:
> 
>> On Jun 1, 2017, at 2:59 PM, Joao Pinto <Joao.Pinto-HKixBCOQz3hWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Hi,
>>
>>>
>>>> After search in /sys I found the mellanox device mlx5_0
>>>
>>>> (/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
>>>
>>>>
>>>
>>>> #  ibstat mlx5_0
>>>
>>>> CA 'mlx5_0'
>>>
>>>>       CA type: MT4121
>>>
>>>>       Number of ports: 1
>>>
>>>>       Firmware version: 16.19.21102
>>>
>>>>       Hardware version: 0
>>>
>>>>       Node GUID: 0x248a070300aa8466
>>>
>>>>       System image GUID: 0x248a070300aa8466
>>>
>>>>       Port 1:
>>>
>>>>               State: Down
>>>
>>>>               Physical state: Disabled
>>>
>>>>               Rate: 10
>>>
>>>>               Base lid: 65535
>>>
>>>>               LMC: 0
>>>
>>>>               SM lid: 0
>>>
>>>>               Capability mask: 0x2651e848
>>>
>>>>               Port GUID: 0x248a070300aa8466
>>>
>>>>               Link layer: InfiniBand
>>>
>>>> #
>>>
>>>> #
>>>
>>>> # pwd
>>>
>>>>
>>>
>>>> Shouldn't the device be visible in /dev?
>>>
>>> Hi Joao,
>>>
>>>
>>>
>>> I'm glad this solved your issue.
>>>
>>> Under dev you will not see Mellanox devices. They are visible only under the sysfs path you found.
>>>
>>> In /dev you might see the mst devices if you have mst running..
>>
>> Thanks for the help Majd! I suggest the driver uses the 0 profile by default to
>> avoid this kind of meomory problems. What do you think? I can make a patch for
>> it if you agree.
> 
> I prefer to keep the default 2 which gives the best performance and is the more common use case.
> We will try to make the driver more resilient and fallback to a lower profile if it can't load with the default one.

Ok, makes sense. Thanks.

>>
>> Thanks.
>>
>>>
>>>>
>>>
>>>> Thanks.
>>>
>>>>
>>>
>>>> --
>>>
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>
>>>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIGaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=MZ71e2hCV-Ipb5oWcgk4I1Z6_Wd8shDLv1AtVKJPigw&s=EvobH0aYP1erS-5GZpBL5Zpabx5LMnEJWTRExwPWSdY&e= 
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIFAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=kooRxKVUatqK_gSRfdQ_S0MWioF6bXLjO1m_tPgIoSs&s=wce0S6hsGnehaHF_WQW4aB0hC7g5UFITZ2V06kU4q70&e= 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Issue with Infiniband / MLX5 IB driver when running opensm
       [not found]                                     ` <455d9539-8284-7e8d-fe8b-17035b511e9d-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 18:40                                       ` Joao Pinto
  0 siblings, 0 replies; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 18:40 UTC (permalink / raw)
  To: Majd Dibbiny, Joao Pinto, Leon Romanovsky, Matan Barak
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hello,

I am trying to bring up a Connect-X 5 Ex and I am getting an issue when
executing opensm when the infiniband cables are connected (connected from one
port to the other). Could you please give me an hint of what might be hapenning?

# ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.19.2244
        Hardware version: 0
        Node GUID: 0x248a0703009ad906
        System image GUID: 0x248a0703009ad906
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 56
                Base lid: 65535
                LMC: 0
                SM lid: 0
                Capability mask: 0x2651e848
                Port GUID: 0x248a0703009ad906
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.19.2244
        Hardware version: 0
        Node GUID: 0x248a0703009ad907
        System image GUID: 0x248a0703009ad906
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 56
                Base lid: 65535
                LMC: 0
                SM lid: 0
                Capability mask: 0x2651e848
                Port GUID: 0x248a0703009ad907
                Link layer: InfiniBand
#
#
# which opensm
/usr/sbin/opensm
# opensm -g 0x248a0703009ad906 &
# -------------------------------------------------
OpenSM 3.3.20
Command Line Arguments:
 Guid <0x248a0703009ad906>
 Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 3.3.20

Entering DISCOVERING state

------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at drivers/infiniband/hw/mlx5/mad.c:263
mlx5_ib_process_mad+0x1a6/0x64c
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Not tainted
4.12.0-MLNX20170524-ge176cc5-dirty #22
Workqueue: ib-comp-wq ib_cq_poll_work

Stack Trace:
  arc_unwind_core.constprop.2+0xb4/0x100
  warn_slowpath_null+0x48/0xe4
  mlx5_ib_process_mad+0x1a6/0x64c
  ib_mad_recv_done+0x352/0xa7c
  ib_cq_poll_work+0x72/0x130
  process_one_work+0x1c8/0x390
  worker_thread+0x120/0x540
  kthread+0x116/0x13c
  ret_from_fork+0x18/0x1c
---[ end trace 942bc9d60690df3b ]---
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at mm/page_alloc.c:3689
__alloc_pages_nodemask+0x18ec/0x24e4
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Tainted: G        W
4.12.0-MLNX20170524-ge176cc5-dirty #22
Workqueue: ib-comp-wq ib_cq_poll_work

Stack Trace:
  arc_unwind_core.constprop.2+0xb4/0x100
  warn_slowpath_null+0x48/0xe4
  __alloc_pages_nodemask+0x18ec/0x24e4
  kmalloc_order+0x16/0x28
  alloc_mad_private+0x12/0x20
  ib_mad_recv_done+0x2bc/0xa7c
  ib_cq_poll_work+0x72/0x130
  process_one_work+0x1c8/0x390
  worker_thread+0x120/0x540
  kthread+0x116/0x13c
  ret_from_fork+0x18/0x1c
---[ end trace 942bc9d60690df3c ]---
BUG: Bad rss-counter state mm:9672c000 idx:1 val:11
BUG: Bad rss-counter state mm:9672c000 idx:3 val:84
BUG: non-zero nr_ptes on freeing mm: 3
Path: /bin/busybox
CPU: 0 PID: 82 Comm: klogd Tainted: G        W
4.12.0-MLNX20170524-ge176cc5-dirty #22
task: 8fe0e3c0 task.stack: 8fe02000

[ECR   ]: 0x00220100 => Invalid Read @ 0x00008088 by insn @ 0x8124babc
[EFA   ]: 0x00008088
[BLINK ]: __d_alloc+0x2c/0x1cc
[ERET  ]: kmem_cache_alloc+0x4c/0xe8
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at kernel/workqueue.c:1080 worker_thread+0x120/0x540
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Tainted: G        W
4.12.0-MLNX20170524-ge176cc5-dirty #22
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at kernel/workqueue.c:1436 __queue_work+0x3e2/0x3e8
workqueue: per-cpu pwq for ib-comp-wq on cpu0 has 0 refcnt
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Tainted: G        W
4.12.0-MLNX20170524-ge176cc5-dirty #22

Stack Trace:
  arc_unwind_core.constprop.2+0xb4/0x100
  warn_slowpath_fmt+0x6c/0x110
  __queue_work+0x3e2/0x3e8
  queue_work_on+0x40/0x48
  mlx5_cq_completion+0x62/0xd8
  mlx5_eq_int+0x2dc/0x3a8
  __handle_irq_event_percpu+0xb8/0x150
  handle_irq_event+0x44/0x8c
  handle_simple_irq+0x5c/0xa4
  generic_handle_irq+0x1c/0x2c
  dw_handle_msi_irq+0x5a/0xd4
  dw_chained_msi_isr+0x26/0x78
  generic_handle_irq+0x1c/0x2c
  dw_apb_ictl_handler+0x7e/0xf8
  __handle_domain_irq+0x56/0x98
  handle_interrupt_level1+0xcc/0xd8
---[ end trace 942bc9d60690df3d ]---
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at kernel/workqueue.c:1064 __queue_work+0x31c/0x3e8
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Tainted: G        W
4.12.0-MLNX20170524-ge176cc5-dirty #22

Stack Trace:
  arc_unwind_core.constprop.2+0xb4/0x100
  warn_slowpath_null+0x48/0xe4
  __queue_work+0x31c/0x3e8
  queue_work_on+0x40/0x48
  mlx5_cq_completion+0x62/0xd8
  mlx5_eq_int+0x2dc/0x3a8
  __handle_irq_event_percpu+0xb8/0x150
  handle_irq_event+0x44/0x8c
  handle_simple_irq+0x5c/0xa4
  generic_handle_irq+0x1c/0x2c
  dw_handle_msi_irq+0x5a/0xd4
  dw_chained_msi_isr+0x26/0x78
  generic_handle_irq+0x1c/0x2c
  dw_apb_ictl_handler+0x7e/0xf8
  __handle_domain_irq+0x56/0x98
  handle_interrupt_level1+0xcc/0xd8
---[ end trace 942bc9d60690df3e ]---

Stack Trace:
  arc_unwind_core.constprop.2+0xb4/0x100
  warn_slowpath_null+0x48/0xe4
  worker_thread+0x120/0x540
  kthread+0x116/0x13c
  ret_from_fork+0x18/0x1c
---[ end trace 942bc9d60690df3f ]---
[STAT32]: 0x00000406 : K         E2 E1
BTA: 0x8124ba86  SP: 0x8fe03dec  FP: 0x00000000
LPS: 0x81274348 LPE: 0x81274354 LPC: 0x00000000
r00: 0x00008088 r01: 0x014000c0 r02: 0x00008088
r03: 0x00001b1a r04: 0x00000000 r05: 0x00000806
r06: 0x9a19cea0 r07: 0x00000005 r08: 0x00000054
r09: 0x00000000 r10: 0x00000000 r11: 0x2000a038
r12: 0x00000000

Stack Trace:
  kmem_cache_alloc+0x4c/0xe8
  __d_alloc+0x2c/0x1cc
  d_alloc_parallel+0x46/0x3f8
  path_openat+0xd48/0x132c
  do_filp_open+0x44/0xc0
  SyS_openat+0x144/0x1d4
  EV_Trap+0x11c/0x120


Thank you and best regards,

Joao Pinto

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-06-01 18:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-31 15:59 Issue with MLX5 IB driver Joao Pinto
     [not found] ` <ae8a8bbf-edb5-1909-824c-f98384f506b0-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-05-31 16:18   ` Leon Romanovsky
     [not found]     ` <20170531161819.GK5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-05-31 16:39       ` Majd Dibbiny
2017-05-31 19:44       ` Christoph Hellwig
     [not found]         ` <20170531194426.GA23120-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-06-01  4:30           ` Leon Romanovsky
     [not found]             ` <20170601043013.GN5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-06-01 10:05               ` Joao Pinto
     [not found]                 ` <09d8f6bc-5994-82d1-9a0f-59540b6c525f-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-06-01 11:18                   ` Joao Pinto
     [not found]                     ` <fbb4b7cb-e3e4-b540-22e4-5d920857e8fe-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-06-01 11:57                       ` Majd Dibbiny
     [not found]                         ` <52727D4A-F647-4924-8DF0-4D7F248626AA-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-06-01 11:59                           ` Joao Pinto
     [not found]                             ` <7a4e8dce-f1af-d664-bb0b-062f84b45b60-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-06-01 12:07                               ` Majd Dibbiny
     [not found]                                 ` <E798E910-E897-4C14-9161-BE1220D412DF-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-06-01 12:08                                   ` Joao Pinto
     [not found]                                     ` <455d9539-8284-7e8d-fe8b-17035b511e9d-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-06-01 18:40                                       ` Issue with Infiniband / MLX5 IB driver when running opensm Joao Pinto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.