* Issue with MLX5 IB driver
@ 2017-05-31 15:59 Joao Pinto
[not found] ` <ae8a8bbf-edb5-1909-824c-f98384f506b0-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-05-31 15:59 UTC (permalink / raw)
To: matanb-VPRAkNaXOzVWk0Htik3J/w, leonro-VPRAkNaXOzVWk0Htik3J/w
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Dear Matan and Leon,
I am trying to bring-up a Connect-X 5 Ex Endpoint, using a setup composed by a
32-bit CPU and 512MB of RAM (PCIe Prototyping Platform). The MLX5 Ethernet
driver initializes well, but after MLX5 IB driver initiates, it consumes all the
available memory in my board (400MB). Does this driver needs more than 400MB to
work?
Kernel used:
Latest 4.12.
Kernel log:
mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.21102
mlx5_core 0000:01:00.0: mlx5_cmd_init:1765:(pid 1): descriptor at dma 0x9a25a000
mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
INPUT
mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 5): writing 0x1 to command
doorbell
mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
OUTPUT
mlx5_core 0000:01:00.0: mlx5_cmd_comp_handler:1418:(pid 5): command completed.
ret 0x0, delivery status no errors(0x0)
mlx5_core 0000:01:00.0: wait_func:893:(pid 1): err 0, delivery status no errors(0)
(...)
mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
(...)
mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
QUERY_HCA_VPORT_CONTEXT(0x762) INPUT
mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 40): writing 0x1 to command
doorbell
mlx5_core 0000:01:00.0: mlx5_eq_int:394:(pid 5): eqn 16, eqe type
MLX5_EVENT_TYPE_CMD
(...)
mlx5_core 0000:01:00.0: mlx5_eq_int:460:(pid 0): page request for func 0x0,
npages 4096
mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
CREATE_MKEY(0x200) INPUT
mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
(...)
kworker/u2:3 invoked oom-killer: gfp_mask=0x14200c2(GFP_HIGHUSER),
nodemask=(null), order=0, oom_score_adj=0
CPU: 0 PID: 61 Comm: kworker/u2:3 Not tainted 4.12.0-MLNX20170524 #46
Workqueue: mlx5_page_allocator pages_work_handler
Stack Trace:
arc_unwind_core.constprop.2+0xb4/0x100
dump_header.isra.6+0x82/0x1a8
out_of_memory+0x2fc/0x368
__alloc_pages_nodemask+0x22ee/0x24e4
give_pages+0x1fc/0x664
pages_work_handler+0x2a/0x88
process_one_work+0x1c8/0x390
worker_thread+0x120/0x540
kthread+0x116/0x13c
ret_from_fork+0x18/0x1c
Mem-Info:
active_anon:2083 inactive_anon:7261 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:94 slab_unreclaimable:709
mapped:0 shmem:9344 pagetables:0 bounce:0
free:311 free_pcp:57 free_cma:0
Node 0 active_anon:16664kB inactive_anon:58088kB active_file:0kB
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
mapped:0kB dirty:0kB writeback:0kB shmem:74752kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
Normal free:2488kB min:2552kB low:3184kB high:3816kB active_anon:16664kB
inactive_anon:58088kB active_file:0kB inactive_file:0kB unevictable:0kB
writepending:0kB present:442368kB managed:407104kB mlocked:0kB
slab_reclaimable:752kB slab_unreclaimable:5672kB kernel_stack:424kB
pagetables:0kB bounce:0kB free_pcp:456kB local_pcp:456kB free_cma:0kB
lowmem_reserve[]: 0 0
Normal: 1*8kB (U) 1*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB
0*1024kB 1*2048kB (U) 0*4096kB 0*8192kB = 2488kB
9344 total pagecache pages
55296 pages RAM
0 pages HighMem/MovableOnly
4408 pages reserved
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Kernel panic - not syncing: Out of memory and no killable processes...
---[ end Kernel panic - not syncing: Out of memory and no killable processes...
Thank you and best regards,
Joao Pinto
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <ae8a8bbf-edb5-1909-824c-f98384f506b0-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-05-31 16:18 ` Leon Romanovsky
[not found] ` <20170531161819.GK5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2017-05-31 16:18 UTC (permalink / raw)
To: Joao Pinto
Cc: matanb-VPRAkNaXOzVWk0Htik3J/w, linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 4574 bytes --]
On Wed, May 31, 2017 at 04:59:45PM +0100, Joao Pinto wrote:
> Dear Matan and Leon,
>
> I am trying to bring-up a Connect-X 5 Ex Endpoint, using a setup composed by a
> 32-bit CPU and 512MB of RAM (PCIe Prototyping Platform). The MLX5 Ethernet
> driver initializes well, but after MLX5 IB driver initiates, it consumes all the
> available memory in my board (400MB). Does this driver needs more than 400MB to
> work?
I think that you are hitting the side effect of these commits
7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
81713d3788d2 ("IB/mlx5: Add implicit MR support")
Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
for the test?
Thanks
>
> Kernel used:
>
> Latest 4.12.
>
> Kernel log:
>
> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
> mlx5_core 0000:01:00.0: mlx5_cmd_init:1765:(pid 1): descriptor at dma 0x9a25a000
> mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
> INPUT
> mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 5): writing 0x1 to command
> doorbell
> mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
> OUTPUT
> mlx5_core 0000:01:00.0: mlx5_cmd_comp_handler:1418:(pid 5): command completed.
> ret 0x0, delivery status no errors(0x0)
> mlx5_core 0000:01:00.0: wait_func:893:(pid 1): err 0, delivery status no errors(0)
> (...)
> mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
> (...)
> mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
> QUERY_HCA_VPORT_CONTEXT(0x762) INPUT
> mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 40): writing 0x1 to command
> doorbell
> mlx5_core 0000:01:00.0: mlx5_eq_int:394:(pid 5): eqn 16, eqe type
> MLX5_EVENT_TYPE_CMD
> (...)
> mlx5_core 0000:01:00.0: mlx5_eq_int:460:(pid 0): page request for func 0x0,
> npages 4096
> mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
> CREATE_MKEY(0x200) INPUT
> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
> (...)
> kworker/u2:3 invoked oom-killer: gfp_mask=0x14200c2(GFP_HIGHUSER),
> nodemask=(null), order=0, oom_score_adj=0
> CPU: 0 PID: 61 Comm: kworker/u2:3 Not tainted 4.12.0-MLNX20170524 #46
> Workqueue: mlx5_page_allocator pages_work_handler
>
> Stack Trace:
> arc_unwind_core.constprop.2+0xb4/0x100
> dump_header.isra.6+0x82/0x1a8
> out_of_memory+0x2fc/0x368
> __alloc_pages_nodemask+0x22ee/0x24e4
> give_pages+0x1fc/0x664
> pages_work_handler+0x2a/0x88
> process_one_work+0x1c8/0x390
> worker_thread+0x120/0x540
> kthread+0x116/0x13c
> ret_from_fork+0x18/0x1c
> Mem-Info:
> active_anon:2083 inactive_anon:7261 isolated_anon:0
> active_file:0 inactive_file:0 isolated_file:0
> unevictable:0 dirty:0 writeback:0 unstable:0
> slab_reclaimable:94 slab_unreclaimable:709
> mapped:0 shmem:9344 pagetables:0 bounce:0
> free:311 free_pcp:57 free_cma:0
> Node 0 active_anon:16664kB inactive_anon:58088kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> mapped:0kB dirty:0kB writeback:0kB shmem:74752kB writeback_tmp:0kB unstable:0kB
> all_unreclaimable? yes
> Normal free:2488kB min:2552kB low:3184kB high:3816kB active_anon:16664kB
> inactive_anon:58088kB active_file:0kB inactive_file:0kB unevictable:0kB
> writepending:0kB present:442368kB managed:407104kB mlocked:0kB
> slab_reclaimable:752kB slab_unreclaimable:5672kB kernel_stack:424kB
> pagetables:0kB bounce:0kB free_pcp:456kB local_pcp:456kB free_cma:0kB
> lowmem_reserve[]: 0 0
> Normal: 1*8kB (U) 1*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB
> 0*1024kB 1*2048kB (U) 0*4096kB 0*8192kB = 2488kB
> 9344 total pagecache pages
> 55296 pages RAM
> 0 pages HighMem/MovableOnly
> 4408 pages reserved
> [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
> Kernel panic - not syncing: Out of memory and no killable processes...
>
> ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
>
>
> Thank you and best regards,
>
> Joao Pinto
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <20170531161819.GK5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-05-31 16:39 ` Majd Dibbiny
2017-05-31 19:44 ` Christoph Hellwig
1 sibling, 0 replies; 12+ messages in thread
From: Majd Dibbiny @ 2017-05-31 16:39 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Joao Pinto, Matan Barak, linux-rdma-u79uwXL29TY76Z2rM5mHXA
> On May 31, 2017, at 7:18 PM, Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
>> On Wed, May 31, 2017 at 04:59:45PM +0100, Joao Pinto wrote:
>> Dear Matan and Leon,
>>
>> I am trying to bring-up a Connect-X 5 Ex Endpoint, using a setup composed by a
>> 32-bit CPU and 512MB of RAM (PCIe Prototyping Platform). The MLX5 Ethernet
>> driver initializes well, but after MLX5 IB driver initiates, it consumes all the
>> available memory in my board (400MB). Does this driver needs more than 400MB to
>> work?
>
> I think that you are hitting the side effect of these commits
> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
>
> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
> for the test?
>
> Thanks
Hi Joao,
As Leon mentioned, the previous commits enlarged the driver memory consumption.
In your case, what I would suggest in order to work in low memory environment is to set the profile selector (prof_sel) module parameter of mlx5_core to 0 (instead default 2) and this will work in low memory environment. This will have some side effects on performance, but thats the trade of..
>
>>
>> Kernel used:
>>
>> Latest 4.12.he
>>
>> Kernel log:
>>
>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>> mlx5_core 0000:01:00.0: mlx5_cmd_init:1765:(pid 1): descriptor at dma 0x9a25a000
>> mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
>> INPUT
>> mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 5): writing 0x1 to command
>> doorbell
>> mlx5_core 0000:01:00.0: dump_command:726:(pid 5): dump command ENABLE_HCA(0x104)
>> OUTPUT
>> mlx5_core 0000:01:00.0: mlx5_cmd_comp_handler:1418:(pid 5): command completed.
>> ret 0x0, delivery status no errors(0x0)
>> mlx5_core 0000:01:00.0: wait_func:893:(pid 1): err 0, delivery status no errors(0)
>> (...)
>> mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
>> (...)
>> mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
>> QUERY_HCA_VPORT_CONTEXT(0x762) INPUT
>> mlx5_core 0000:01:00.0: cmd_work_handler:829:(pid 40): writing 0x1 to command
>> doorbell
>> mlx5_core 0000:01:00.0: mlx5_eq_int:394:(pid 5): eqn 16, eqe type
>> MLX5_EVENT_TYPE_CMD
>> (...)
>> mlx5_core 0000:01:00.0: mlx5_eq_int:460:(pid 0): page request for func 0x0,
>> npages 4096
>> mlx5_core 0000:01:00.0: dump_command:726:(pid 40): dump command
>> CREATE_MKEY(0x200) INPUT
>> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
>> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
>> mlx5_core 0000:01:00.0: cmd_exec:1558:(pid 61): err 0, status 0
>> (...)
>> kworker/u2:3 invoked oom-killer: gfp_mask=0x14200c2(GFP_HIGHUSER),
>> nodemask=(null), order=0, oom_score_adj=0
>> CPU: 0 PID: 61 Comm: kworker/u2:3 Not tainted 4.12.0-MLNX20170524 #46
>> Workqueue: mlx5_page_allocator pages_work_handler
>>
>> Stack Trace:
>> arc_unwind_core.constprop.2+0xb4/0x100
>> dump_header.isra.6+0x82/0x1a8
>> out_of_memory+0x2fc/0x368
>> __alloc_pages_nodemask+0x22ee/0x24e4
>> give_pages+0x1fc/0x664
>> pages_work_handler+0x2a/0x88
>> process_one_work+0x1c8/0x390
>> worker_thread+0x120/0x540
>> kthread+0x116/0x13c
>> ret_from_fork+0x18/0x1c
>> Mem-Info:
>> active_anon:2083 inactive_anon:7261 isolated_anon:0
>> active_file:0 inactive_file:0 isolated_file:0
>> unevictable:0 dirty:0 writeback:0 unstable:0
>> slab_reclaimable:94 slab_unreclaimable:709
>> mapped:0 shmem:9344 pagetables:0 bounce:0
>> free:311 free_pcp:57 free_cma:0
>> Node 0 active_anon:16664kB inactive_anon:58088kB active_file:0kB
>> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
>> mapped:0kB dirty:0kB writeback:0kB shmem:74752kB writeback_tmp:0kB unstable:0kB
>> all_unreclaimable? yes
>> Normal free:2488kB min:2552kB low:3184kB high:3816kB active_anon:16664kB
>> inactive_anon:58088kB active_file:0kB inactive_file:0kB unevictable:0kB
>> writepending:0kB present:442368kB managed:407104kB mlocked:0kB
>> slab_reclaimable:752kB slab_unreclaimable:5672kB kernel_stack:424kB
>> pagetables:0kB bounce:0kB free_pcp:456kB local_pcp:456kB free_cma:0kB
>> lowmem_reserve[]: 0 0
>> Normal: 1*8kB (U) 1*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB
>> 0*1024kB 1*2048kB (U) 0*4096kB 0*8192kB = 2488kB
>> 9344 total pagecache pages
>> 55296 pages RAM
>> 0 pages HighMem/MovableOnly
>> 4408 pages reserved
>> [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
>> Kernel panic - not syncing: Out of memory and no killable processes...
>>
>> ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
>>
>>
>> Thank you and best regards,
>>
>> Joao Pinto
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <20170531161819.GK5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-05-31 16:39 ` Majd Dibbiny
@ 2017-05-31 19:44 ` Christoph Hellwig
[not found] ` <20170531194426.GA23120-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2017-05-31 19:44 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Joao Pinto, matanb-VPRAkNaXOzVWk0Htik3J/w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
> I think that you are hitting the side effect of these commits
> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
>
> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
> for the test?
Eww. Please make sure mlx5 gracefully handles cases where it can't use
crazy amount of memory, including disabling features like the above
at runtime when the required resources aren't available.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <20170531194426.GA23120-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2017-06-01 4:30 ` Leon Romanovsky
[not found] ` <20170601043013.GN5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2017-06-01 4:30 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Joao Pinto, matanb-VPRAkNaXOzVWk0Htik3J/w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 1033 bytes --]
On Wed, May 31, 2017 at 12:44:26PM -0700, Christoph Hellwig wrote:
> On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
> > I think that you are hitting the side effect of these commits
> > 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
> > 81713d3788d2 ("IB/mlx5: Add implicit MR support")
> >
> > Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
> > for the test?
>
> Eww. Please make sure mlx5 gracefully handles cases where it can't use
> crazy amount of memory, including disabling features like the above
> at runtime when the required resources aren't available.
Right, the real consumer of memory in mlx5_ib is mr_cache, so the
question is how can we check in advance if we have enough memory
without calling allocations with GFP_NOWARN flag.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <20170601043013.GN5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-06-01 10:05 ` Joao Pinto
[not found] ` <09d8f6bc-5994-82d1-9a0f-59540b6c525f-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 10:05 UTC (permalink / raw)
To: Leon Romanovsky, matanb-VPRAkNaXOzVWk0Htik3J/w
Cc: Christoph Hellwig, Joao Pinto, linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hello,
Às 5:30 AM de 6/1/2017, Leon Romanovsky escreveu:
> On Wed, May 31, 2017 at 12:44:26PM -0700, Christoph Hellwig wrote:
>> On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
>>> I think that you are hitting the side effect of these commits
>>> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
>>> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
>>>
>>> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
>>> for the test?
>>
>> Eww. Please make sure mlx5 gracefully handles cases where it can't use
>> crazy amount of memory, including disabling features like the above
>> at runtime when the required resources aren't available.
>
> Right, the real consumer of memory in mlx5_ib is mr_cache, so the
> question is how can we check in advance if we have enough memory
> without calling allocations with GFP_NOWARN flag.
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=Uf5GrWBvnD9y_cvJHxE3U34WbGfrJ6SH6xoBLXn3-iA&s=qOiYqKtZvTJzs3QPNC_YxrNg-S_g-1PfDr0ZvDTE5pY&e=
With CONFIG_INFINIBAND_ON_DEMAND_PAGING disabled:
Crashes the same way.
With MLX5_DEFAULT_PROF defined as 0:
There is no crash.
mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.21102
(...)
mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
(...)
mlx5_core 0000:01:00.0: device's health compromised - reached miss count
mlx5_core 0000:01:00.0: assert_var[0] 0x00000001
mlx5_core 0000:01:00.0: assert_var[1] 0x00000000
mlx5_core 0000:01:00.0: assert_var[2] 0x00000000
mlx5_core 0000:01:00.0: assert_var[3] 0x00000000
mlx5_core 0000:01:00.0: assert_var[4] 0x00000000
mlx5_core 0000:01:00.0: assert_exit_ptr 0x006994c0
mlx5_core 0000:01:00.0: assert_callra 0x00699680
mlx5_core 0000:01:00.0: fw_ver 16.19.21102
mlx5_core 0000:01:00.0: hw_id 0x0000020d
mlx5_core 0000:01:00.0: irisc_index 0
mlx5_core 0000:01:00.0: synd 0x1: firmware internal error
mlx5_core 0000:01:00.0: ext_synd 0x11c5
mlx5_core 0000:01:00.0: raw fw_ver 0x1013526e
lspci -v result:
01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Subsystem: Mellanox Technologies Device 0002
Flags: bus master, fast devsel, latency 0
Memory at d2000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [48] Vital Product Data
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [1c0] #19
Capabilities: [320] #27
Kernel driver in use: mlx5_core
Interrupts:
45: 0 PCI-MSI 0 aerdrv
46: 2 PCI-MSI 524288 mlx5_pages_eq@pci:0000:01:00.0
47: 347 PCI-MSI 524289 mlx5_cmd_eq@pci:0000:01:00.0
48: 0 PCI-MSI 524290 mlx5_async_eq@pci:0000:01:00.0
50: 0 PCI-MSI 524292 mlx5_comp0@pci:0000:01:00.0
List of devices:
# ls /dev/infiniband/
issm0 rdma_cm ucm0 umad0 uverbs0
Shouldn't I be getting some mellanox devices?
Thanks,
Joao
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <09d8f6bc-5994-82d1-9a0f-59540b6c525f-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 11:18 ` Joao Pinto
[not found] ` <fbb4b7cb-e3e4-b540-22e4-5d920857e8fe-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 11:18 UTC (permalink / raw)
To: Leon Romanovsky, matanb-VPRAkNaXOzVWk0Htik3J/w
Cc: Christoph Hellwig, Joao Pinto, linux-rdma-u79uwXL29TY76Z2rM5mHXA
Às 11:05 AM de 6/1/2017, Joao Pinto escreveu:
>
> Hello,
>
> Às 5:30 AM de 6/1/2017, Leon Romanovsky escreveu:
>> On Wed, May 31, 2017 at 12:44:26PM -0700, Christoph Hellwig wrote:
>>> On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
>>>> I think that you are hitting the side effect of these commits
>>>> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
>>>> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
>>>>
>>>> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
>>>> for the test?
>>>
>>> Eww. Please make sure mlx5 gracefully handles cases where it can't use
>>> crazy amount of memory, including disabling features like the above
>>> at runtime when the required resources aren't available.
>>
>> Right, the real consumer of memory in mlx5_ib is mr_cache, so the
>> question is how can we check in advance if we have enough memory
>> without calling allocations with GFP_NOWARN flag.
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=Uf5GrWBvnD9y_cvJHxE3U34WbGfrJ6SH6xoBLXn3-iA&s=qOiYqKtZvTJzs3QPNC_YxrNg-S_g-1PfDr0ZvDTE5pY&e=
>
> With CONFIG_INFINIBAND_ON_DEMAND_PAGING disabled:
> Crashes the same way.
>
> With MLX5_DEFAULT_PROF defined as 0:
>
> There is no crash.
>
> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
> (...)
> mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
> (...)
> mlx5_core 0000:01:00.0: device's health compromised - reached miss count
> mlx5_core 0000:01:00.0: assert_var[0] 0x00000001
> mlx5_core 0000:01:00.0: assert_var[1] 0x00000000
> mlx5_core 0000:01:00.0: assert_var[2] 0x00000000
> mlx5_core 0000:01:00.0: assert_var[3] 0x00000000
> mlx5_core 0000:01:00.0: assert_var[4] 0x00000000
> mlx5_core 0000:01:00.0: assert_exit_ptr 0x006994c0
> mlx5_core 0000:01:00.0: assert_callra 0x00699680
> mlx5_core 0000:01:00.0: fw_ver 16.19.21102
> mlx5_core 0000:01:00.0: hw_id 0x0000020d
> mlx5_core 0000:01:00.0: irisc_index 0
> mlx5_core 0000:01:00.0: synd 0x1: firmware internal error
> mlx5_core 0000:01:00.0: ext_synd 0x11c5
> mlx5_core 0000:01:00.0: raw fw_ver 0x1013526e
>
> lspci -v result:
>
> 01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
> Subsystem: Mellanox Technologies Device 0002
> Flags: bus master, fast devsel, latency 0
> Memory at d2000000 (64-bit, prefetchable) [size=32M]
> Capabilities: [60] Express Endpoint, MSI 00
> Capabilities: [48] Vital Product Data
> Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
> Capabilities: [c0] Vendor Specific Information: Len=18 <?>
> Capabilities: [40] Power Management version 3
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
> Capabilities: [1c0] #19
> Capabilities: [320] #27
> Kernel driver in use: mlx5_core
>
> Interrupts:
>
> 45: 0 PCI-MSI 0 aerdrv
> 46: 2 PCI-MSI 524288 mlx5_pages_eq@pci:0000:01:00.0
> 47: 347 PCI-MSI 524289 mlx5_cmd_eq@pci:0000:01:00.0
> 48: 0 PCI-MSI 524290 mlx5_async_eq@pci:0000:01:00.0
> 50: 0 PCI-MSI 524292 mlx5_comp0@pci:0000:01:00.0
>
> List of devices:
>
> # ls /dev/infiniband/
> issm0 rdma_cm ucm0 umad0 uverbs0
>
> Shouldn't I be getting some mellanox devices?
>
> Thanks,
> Joao
>
After search in /sys I found the mellanox device mlx5_0
(/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
# ibstat mlx5_0
CA 'mlx5_0'
CA type: MT4121
Number of ports: 1
Firmware version: 16.19.21102
Hardware version: 0
Node GUID: 0x248a070300aa8466
System image GUID: 0x248a070300aa8466
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x248a070300aa8466
Link layer: InfiniBand
#
#
# pwd
Shouldn't the device be visible in /dev?
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <fbb4b7cb-e3e4-b540-22e4-5d920857e8fe-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 11:57 ` Majd Dibbiny
[not found] ` <52727D4A-F647-4924-8DF0-4D7F248626AA-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Majd Dibbiny @ 2017-06-01 11:57 UTC (permalink / raw)
To: Joao Pinto
Cc: Leon Romanovsky, Matan Barak, Christoph Hellwig,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 5506 bytes --]
> On Jun 1, 2017, at 2:19 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>
> Ãs 11:05 AM de 6/1/2017, Joao Pinto escreveu:
>>
>> Hello,
>>
>> Ãs 5:30 AM de 6/1/2017, Leon Romanovsky escreveu:
>>>> On Wed, May 31, 2017 at 12:44:26PM -0700, Christoph Hellwig wrote:
>>>>> On Wed, May 31, 2017 at 07:18:19PM +0300, Leon Romanovsky wrote:
>>>>> I think that you are hitting the side effect of these commits
>>>>> 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") and
>>>>> 81713d3788d2 ("IB/mlx5: Add implicit MR support")
>>>>>
>>>>> Do you have CONFIG_INFINIBAND_ON_DEMAND_PAGING on? Can you disable it
>>>>> for the test?
>>>>
>>>> Eww. Please make sure mlx5 gracefully handles cases where it can't use
>>>> crazy amount of memory, including disabling features like the above
>>>> at runtime when the required resources aren't available.
>>>
>>> Right, the real consumer of memory in mlx5_ib is mr_cache, so the
>>> question is how can we check in advance if we have enough memory
>>> without calling allocations with GFP_NOWARN flag.
>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=Uf5GrWBvnD9y_cvJHxE3U34WbGfrJ6SH6xoBLXn3-iA&s=qOiYqKtZvTJzs3QPNC_YxrNg-S_g-1PfDr0ZvDTE5pY&e=
>>
>> With CONFIG_INFINIBAND_ON_DEMAND_PAGING disabled:
>> Crashes the same way.
>>
>> With MLX5_DEFAULT_PROF defined as 0:
>>
>> There is no crash.
>>
>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>> (...)
>> mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)
>> (...)
>> mlx5_core 0000:01:00.0: device's health compromised - reached miss count
>> mlx5_core 0000:01:00.0: assert_var[0] 0x00000001
>> mlx5_core 0000:01:00.0: assert_var[1] 0x00000000
>> mlx5_core 0000:01:00.0: assert_var[2] 0x00000000
>> mlx5_core 0000:01:00.0: assert_var[3] 0x00000000
>> mlx5_core 0000:01:00.0: assert_var[4] 0x00000000
>> mlx5_core 0000:01:00.0: assert_exit_ptr 0x006994c0
>> mlx5_core 0000:01:00.0: assert_callra 0x00699680
>> mlx5_core 0000:01:00.0: fw_ver 16.19.21102
>> mlx5_core 0000:01:00.0: hw_id 0x0000020d
>> mlx5_core 0000:01:00.0: irisc_index 0
>> mlx5_core 0000:01:00.0: synd 0x1: firmware internal error
>> mlx5_core 0000:01:00.0: ext_synd 0x11c5
>> mlx5_core 0000:01:00.0: raw fw_ver 0x1013526e
>>
>> lspci -v result:
>>
>> 01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
>> Subsystem: Mellanox Technologies Device 0002
>> Flags: bus master, fast devsel, latency 0
>> Memory at d2000000 (64-bit, prefetchable) [size=32M]
>> Capabilities: [60] Express Endpoint, MSI 00
>> Capabilities: [48] Vital Product Data
>> Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
>> Capabilities: [c0] Vendor Specific Information: Len=18 <?>
>> Capabilities: [40] Power Management version 3
>> Capabilities: [100] Advanced Error Reporting
>> Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>> Capabilities: [1c0] #19
>> Capabilities: [320] #27
>> Kernel driver in use: mlx5_core
>>
>> Interrupts:
>>
>> 45: 0 PCI-MSI 0 aerdrv
>> 46: 2 PCI-MSI 524288 mlx5_pages_eq@pci:0000:01:00.0
>> 47: 347 PCI-MSI 524289 mlx5_cmd_eq@pci:0000:01:00.0
>> 48: 0 PCI-MSI 524290 mlx5_async_eq@pci:0000:01:00.0
>> 50: 0 PCI-MSI 524292 mlx5_comp0@pci:0000:01:00.0
>>
>> List of devices:
>>
>> # ls /dev/infiniband/
>> issm0 rdma_cm ucm0 umad0 uverbs0
>>
>> Shouldn't I be getting some mellanox devices?
>>
>> Thanks,
>> Joao
>>
>
> After search in /sys I found the mellanox device mlx5_0
> (/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
>
> # ibstat mlx5_0
> CA 'mlx5_0'
> CA type: MT4121
> Number of ports: 1
> Firmware version: 16.19.21102
> Hardware version: 0
> Node GUID: 0x248a070300aa8466
> System image GUID: 0x248a070300aa8466
> Port 1:
> State: Down
> Physical state: Disabled
> Rate: 10
> Base lid: 65535
> LMC: 0
> SM lid: 0
> Capability mask: 0x2651e848
> Port GUID: 0x248a070300aa8466
> Link layer: InfiniBand
> #
> #
> # pwd
>
> Shouldn't the device be visible in /dev?
Hi Joao,
I'm glad this solved your issue.
Under dev you will not see Mellanox devices. They are visible only under the sysfs path you found.
In /dev you might see the mst devices if you have mst running..
>
> Thanks.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
N§²æìr¸yúèØb²X¬¶Ç§vØ^)Þº{.nÇ+·¥{±Ù{ayº\x1dÊÚë,j\a¢f£¢·h»öì\x17/oSc¾Ú³9uÀ¦æåÈ&jw¨®\x03(éÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þàþf£¢·h§~m
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <52727D4A-F647-4924-8DF0-4D7F248626AA-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 11:59 ` Joao Pinto
[not found] ` <7a4e8dce-f1af-d664-bb0b-062f84b45b60-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 11:59 UTC (permalink / raw)
To: Majd Dibbiny, Joao Pinto
Cc: Leon Romanovsky, Matan Barak, Christoph Hellwig,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi,
>
>> After search in /sys I found the mellanox device mlx5_0
>
>> (/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
>
>>
>
>> # ibstat mlx5_0
>
>> CA 'mlx5_0'
>
>> CA type: MT4121
>
>> Number of ports: 1
>
>> Firmware version: 16.19.21102
>
>> Hardware version: 0
>
>> Node GUID: 0x248a070300aa8466
>
>> System image GUID: 0x248a070300aa8466
>
>> Port 1:
>
>> State: Down
>
>> Physical state: Disabled
>
>> Rate: 10
>
>> Base lid: 65535
>
>> LMC: 0
>
>> SM lid: 0
>
>> Capability mask: 0x2651e848
>
>> Port GUID: 0x248a070300aa8466
>
>> Link layer: InfiniBand
>
>> #
>
>> #
>
>> # pwd
>
>>
>
>> Shouldn't the device be visible in /dev?
>
> Hi Joao,
>
>
>
> I'm glad this solved your issue.
>
> Under dev you will not see Mellanox devices. They are visible only under the sysfs path you found.
>
> In /dev you might see the mst devices if you have mst running..
Thanks for the help Majd! I suggest the driver uses the 0 profile by default to
avoid this kind of meomory problems. What do you think? I can make a patch for
it if you agree.
Thanks.
>
>>
>
>> Thanks.
>
>>
>
>> --
>
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>
>> More majordomo info at https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIGaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=MZ71e2hCV-Ipb5oWcgk4I1Z6_Wd8shDLv1AtVKJPigw&s=EvobH0aYP1erS-5GZpBL5Zpabx5LMnEJWTRExwPWSdY&e=
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <7a4e8dce-f1af-d664-bb0b-062f84b45b60-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 12:07 ` Majd Dibbiny
[not found] ` <E798E910-E897-4C14-9161-BE1220D412DF-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Majd Dibbiny @ 2017-06-01 12:07 UTC (permalink / raw)
To: Joao Pinto
Cc: Leon Romanovsky, Matan Barak, Christoph Hellwig,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
> On Jun 1, 2017, at 2:59 PM, Joao Pinto <Joao.Pinto-HKixBCOQz3hWk0Htik3J/w@public.gmane.org> wrote:
>
> Hi,
>
>>
>>> After search in /sys I found the mellanox device mlx5_0
>>
>>> (/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
>>
>>>
>>
>>> # ibstat mlx5_0
>>
>>> CA 'mlx5_0'
>>
>>> CA type: MT4121
>>
>>> Number of ports: 1
>>
>>> Firmware version: 16.19.21102
>>
>>> Hardware version: 0
>>
>>> Node GUID: 0x248a070300aa8466
>>
>>> System image GUID: 0x248a070300aa8466
>>
>>> Port 1:
>>
>>> State: Down
>>
>>> Physical state: Disabled
>>
>>> Rate: 10
>>
>>> Base lid: 65535
>>
>>> LMC: 0
>>
>>> SM lid: 0
>>
>>> Capability mask: 0x2651e848
>>
>>> Port GUID: 0x248a070300aa8466
>>
>>> Link layer: InfiniBand
>>
>>> #
>>
>>> #
>>
>>> # pwd
>>
>>>
>>
>>> Shouldn't the device be visible in /dev?
>>
>> Hi Joao,
>>
>>
>>
>> I'm glad this solved your issue.
>>
>> Under dev you will not see Mellanox devices. They are visible only under the sysfs path you found.
>>
>> In /dev you might see the mst devices if you have mst running..
>
> Thanks for the help Majd! I suggest the driver uses the 0 profile by default to
> avoid this kind of meomory problems. What do you think? I can make a patch for
> it if you agree.
I prefer to keep the default 2 which gives the best performance and is the more common use case.
We will try to make the driver more resilient and fallback to a lower profile if it can't load with the default one.
>
> Thanks.
>
>>
>>>
>>
>>> Thanks.
>>
>>>
>>
>>> --
>>
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>
>>> More majordomo info at https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIGaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=MZ71e2hCV-Ipb5oWcgk4I1Z6_Wd8shDLv1AtVKJPigw&s=EvobH0aYP1erS-5GZpBL5Zpabx5LMnEJWTRExwPWSdY&e=
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Issue with MLX5 IB driver
[not found] ` <E798E910-E897-4C14-9161-BE1220D412DF-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 12:08 ` Joao Pinto
[not found] ` <455d9539-8284-7e8d-fe8b-17035b511e9d-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 12:08 UTC (permalink / raw)
To: Majd Dibbiny, Joao Pinto
Cc: Leon Romanovsky, Matan Barak, Christoph Hellwig,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
Às 1:07 PM de 6/1/2017, Majd Dibbiny escreveu:
>
>> On Jun 1, 2017, at 2:59 PM, Joao Pinto <Joao.Pinto-HKixBCOQz3hWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Hi,
>>
>>>
>>>> After search in /sys I found the mellanox device mlx5_0
>>>
>>>> (/sys/class/infiniband/mlx5_0/) and was able to execute ibstat on it:
>>>
>>>>
>>>
>>>> # ibstat mlx5_0
>>>
>>>> CA 'mlx5_0'
>>>
>>>> CA type: MT4121
>>>
>>>> Number of ports: 1
>>>
>>>> Firmware version: 16.19.21102
>>>
>>>> Hardware version: 0
>>>
>>>> Node GUID: 0x248a070300aa8466
>>>
>>>> System image GUID: 0x248a070300aa8466
>>>
>>>> Port 1:
>>>
>>>> State: Down
>>>
>>>> Physical state: Disabled
>>>
>>>> Rate: 10
>>>
>>>> Base lid: 65535
>>>
>>>> LMC: 0
>>>
>>>> SM lid: 0
>>>
>>>> Capability mask: 0x2651e848
>>>
>>>> Port GUID: 0x248a070300aa8466
>>>
>>>> Link layer: InfiniBand
>>>
>>>> #
>>>
>>>> #
>>>
>>>> # pwd
>>>
>>>>
>>>
>>>> Shouldn't the device be visible in /dev?
>>>
>>> Hi Joao,
>>>
>>>
>>>
>>> I'm glad this solved your issue.
>>>
>>> Under dev you will not see Mellanox devices. They are visible only under the sysfs path you found.
>>>
>>> In /dev you might see the mst devices if you have mst running..
>>
>> Thanks for the help Majd! I suggest the driver uses the 0 profile by default to
>> avoid this kind of meomory problems. What do you think? I can make a patch for
>> it if you agree.
>
> I prefer to keep the default 2 which gives the best performance and is the more common use case.
> We will try to make the driver more resilient and fallback to a lower profile if it can't load with the default one.
Ok, makes sense. Thanks.
>>
>> Thanks.
>>
>>>
>>>>
>>>
>>>> Thanks.
>>>
>>>>
>>>
>>>> --
>>>
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>
>>>> More majordomo info at https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIGaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=MZ71e2hCV-Ipb5oWcgk4I1Z6_Wd8shDLv1AtVKJPigw&s=EvobH0aYP1erS-5GZpBL5Zpabx5LMnEJWTRExwPWSdY&e=
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwIFAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=s2fO0hii0OGNOv9qQy_HRXy-xAJUD1NNoEcc3io_kx0&m=kooRxKVUatqK_gSRfdQ_S0MWioF6bXLjO1m_tPgIoSs&s=wce0S6hsGnehaHF_WQW4aB0hC7g5UFITZ2V06kU4q70&e=
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Issue with Infiniband / MLX5 IB driver when running opensm
[not found] ` <455d9539-8284-7e8d-fe8b-17035b511e9d-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-06-01 18:40 ` Joao Pinto
0 siblings, 0 replies; 12+ messages in thread
From: Joao Pinto @ 2017-06-01 18:40 UTC (permalink / raw)
To: Majd Dibbiny, Joao Pinto, Leon Romanovsky, Matan Barak
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hello,
I am trying to bring up a Connect-X 5 Ex and I am getting an issue when
executing opensm when the infiniband cables are connected (connected from one
port to the other). Could you please give me an hint of what might be hapenning?
# ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.19.2244
Hardware version: 0
Node GUID: 0x248a0703009ad906
System image GUID: 0x248a0703009ad906
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 56
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x248a0703009ad906
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4119
Number of ports: 1
Firmware version: 16.19.2244
Hardware version: 0
Node GUID: 0x248a0703009ad907
System image GUID: 0x248a0703009ad906
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 56
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x248a0703009ad907
Link layer: InfiniBand
#
#
# which opensm
/usr/sbin/opensm
# opensm -g 0x248a0703009ad906 &
# -------------------------------------------------
OpenSM 3.3.20
Command Line Arguments:
Guid <0x248a0703009ad906>
Log File: /var/log/opensm.log
-------------------------------------------------
OpenSM 3.3.20
Entering DISCOVERING state
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at drivers/infiniband/hw/mlx5/mad.c:263
mlx5_ib_process_mad+0x1a6/0x64c
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Not tainted
4.12.0-MLNX20170524-ge176cc5-dirty #22
Workqueue: ib-comp-wq ib_cq_poll_work
Stack Trace:
arc_unwind_core.constprop.2+0xb4/0x100
warn_slowpath_null+0x48/0xe4
mlx5_ib_process_mad+0x1a6/0x64c
ib_mad_recv_done+0x352/0xa7c
ib_cq_poll_work+0x72/0x130
process_one_work+0x1c8/0x390
worker_thread+0x120/0x540
kthread+0x116/0x13c
ret_from_fork+0x18/0x1c
---[ end trace 942bc9d60690df3b ]---
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at mm/page_alloc.c:3689
__alloc_pages_nodemask+0x18ec/0x24e4
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Tainted: G W
4.12.0-MLNX20170524-ge176cc5-dirty #22
Workqueue: ib-comp-wq ib_cq_poll_work
Stack Trace:
arc_unwind_core.constprop.2+0xb4/0x100
warn_slowpath_null+0x48/0xe4
__alloc_pages_nodemask+0x18ec/0x24e4
kmalloc_order+0x16/0x28
alloc_mad_private+0x12/0x20
ib_mad_recv_done+0x2bc/0xa7c
ib_cq_poll_work+0x72/0x130
process_one_work+0x1c8/0x390
worker_thread+0x120/0x540
kthread+0x116/0x13c
ret_from_fork+0x18/0x1c
---[ end trace 942bc9d60690df3c ]---
BUG: Bad rss-counter state mm:9672c000 idx:1 val:11
BUG: Bad rss-counter state mm:9672c000 idx:3 val:84
BUG: non-zero nr_ptes on freeing mm: 3
Path: /bin/busybox
CPU: 0 PID: 82 Comm: klogd Tainted: G W
4.12.0-MLNX20170524-ge176cc5-dirty #22
task: 8fe0e3c0 task.stack: 8fe02000
[ECR ]: 0x00220100 => Invalid Read @ 0x00008088 by insn @ 0x8124babc
[EFA ]: 0x00008088
[BLINK ]: __d_alloc+0x2c/0x1cc
[ERET ]: kmem_cache_alloc+0x4c/0xe8
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at kernel/workqueue.c:1080 worker_thread+0x120/0x540
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Tainted: G W
4.12.0-MLNX20170524-ge176cc5-dirty #22
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at kernel/workqueue.c:1436 __queue_work+0x3e2/0x3e8
workqueue: per-cpu pwq for ib-comp-wq on cpu0 has 0 refcnt
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Tainted: G W
4.12.0-MLNX20170524-ge176cc5-dirty #22
Stack Trace:
arc_unwind_core.constprop.2+0xb4/0x100
warn_slowpath_fmt+0x6c/0x110
__queue_work+0x3e2/0x3e8
queue_work_on+0x40/0x48
mlx5_cq_completion+0x62/0xd8
mlx5_eq_int+0x2dc/0x3a8
__handle_irq_event_percpu+0xb8/0x150
handle_irq_event+0x44/0x8c
handle_simple_irq+0x5c/0xa4
generic_handle_irq+0x1c/0x2c
dw_handle_msi_irq+0x5a/0xd4
dw_chained_msi_isr+0x26/0x78
generic_handle_irq+0x1c/0x2c
dw_apb_ictl_handler+0x7e/0xf8
__handle_domain_irq+0x56/0x98
handle_interrupt_level1+0xcc/0xd8
---[ end trace 942bc9d60690df3d ]---
------------[ cut here ]------------
WARNING: CPU: 0 PID: 128 at kernel/workqueue.c:1064 __queue_work+0x31c/0x3e8
Modules linked in:
CPU: 0 PID: 128 Comm: kworker/0:1H Tainted: G W
4.12.0-MLNX20170524-ge176cc5-dirty #22
Stack Trace:
arc_unwind_core.constprop.2+0xb4/0x100
warn_slowpath_null+0x48/0xe4
__queue_work+0x31c/0x3e8
queue_work_on+0x40/0x48
mlx5_cq_completion+0x62/0xd8
mlx5_eq_int+0x2dc/0x3a8
__handle_irq_event_percpu+0xb8/0x150
handle_irq_event+0x44/0x8c
handle_simple_irq+0x5c/0xa4
generic_handle_irq+0x1c/0x2c
dw_handle_msi_irq+0x5a/0xd4
dw_chained_msi_isr+0x26/0x78
generic_handle_irq+0x1c/0x2c
dw_apb_ictl_handler+0x7e/0xf8
__handle_domain_irq+0x56/0x98
handle_interrupt_level1+0xcc/0xd8
---[ end trace 942bc9d60690df3e ]---
Stack Trace:
arc_unwind_core.constprop.2+0xb4/0x100
warn_slowpath_null+0x48/0xe4
worker_thread+0x120/0x540
kthread+0x116/0x13c
ret_from_fork+0x18/0x1c
---[ end trace 942bc9d60690df3f ]---
[STAT32]: 0x00000406 : K E2 E1
BTA: 0x8124ba86 SP: 0x8fe03dec FP: 0x00000000
LPS: 0x81274348 LPE: 0x81274354 LPC: 0x00000000
r00: 0x00008088 r01: 0x014000c0 r02: 0x00008088
r03: 0x00001b1a r04: 0x00000000 r05: 0x00000806
r06: 0x9a19cea0 r07: 0x00000005 r08: 0x00000054
r09: 0x00000000 r10: 0x00000000 r11: 0x2000a038
r12: 0x00000000
Stack Trace:
kmem_cache_alloc+0x4c/0xe8
__d_alloc+0x2c/0x1cc
d_alloc_parallel+0x46/0x3f8
path_openat+0xd48/0x132c
do_filp_open+0x44/0xc0
SyS_openat+0x144/0x1d4
EV_Trap+0x11c/0x120
Thank you and best regards,
Joao Pinto
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-06-01 18:40 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-31 15:59 Issue with MLX5 IB driver Joao Pinto
[not found] ` <ae8a8bbf-edb5-1909-824c-f98384f506b0-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-05-31 16:18 ` Leon Romanovsky
[not found] ` <20170531161819.GK5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-05-31 16:39 ` Majd Dibbiny
2017-05-31 19:44 ` Christoph Hellwig
[not found] ` <20170531194426.GA23120-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-06-01 4:30 ` Leon Romanovsky
[not found] ` <20170601043013.GN5406-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-06-01 10:05 ` Joao Pinto
[not found] ` <09d8f6bc-5994-82d1-9a0f-59540b6c525f-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-06-01 11:18 ` Joao Pinto
[not found] ` <fbb4b7cb-e3e4-b540-22e4-5d920857e8fe-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-06-01 11:57 ` Majd Dibbiny
[not found] ` <52727D4A-F647-4924-8DF0-4D7F248626AA-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-06-01 11:59 ` Joao Pinto
[not found] ` <7a4e8dce-f1af-d664-bb0b-062f84b45b60-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-06-01 12:07 ` Majd Dibbiny
[not found] ` <E798E910-E897-4C14-9161-BE1220D412DF-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-06-01 12:08 ` Joao Pinto
[not found] ` <455d9539-8284-7e8d-fe8b-17035b511e9d-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-06-01 18:40 ` Issue with Infiniband / MLX5 IB driver when running opensm Joao Pinto
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.