All of lore.kernel.org
 help / color / mirror / Atom feed
* mlx5 endpoint driver problem
@ 2017-05-09 16:25 Joao Pinto
       [not found] ` <f0b8881d-9aa3-8816-7ea6-daccc0e91262-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
  2017-05-09 17:35 ` Saeed Mahameed
  0 siblings, 2 replies; 10+ messages in thread
From: Joao Pinto @ 2017-05-09 16:25 UTC (permalink / raw)
  To: saeedm; +Cc: netdev

Hello,

I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
when trying to enable the hca:

mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.21102
INFO: task swapper:1 blocked for more than 10 seconds.
      Not tainted 4.11.0-BETAMSIX1 #51
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swapper         D    0     1      0 0x00000000

Stack Trace:
  __switch_to+0x0/0x94
  __schedule+0x1da/0x8b0
  schedule+0x26/0x6c
  schedule_timeout+0x2da/0x380
  wait_for_completion+0x92/0x104
  mlx5_cmd_exec+0x70e/0xd60
  mlx5_load_one+0x1b4/0xad8
  init_one+0x404/0x600
  pci_device_probe+0x122/0x1f0
  really_probe+0x1ac/0x348
  __driver_attach+0xa8/0xd0
  bus_for_each_dev+0x3c/0x74
  bus_add_driver+0xc2/0x184
  driver_register+0x50/0xec
  init+0x40/0x60

(...)

Stack Trace:
  __switch_to+0x0/0x94
  __schedule+0x1da/0x8b0
  schedule+0x26/0x6c
  schedule_timeout+0x2da/0x380
  wait_for_completion+0x92/0x104
  mlx5_cmd_exec+0x70e/0xd60
  mlx5_load_one+0x1b4/0xad8
  init_one+0x404/0x600
  pci_device_probe+0x122/0x1f0
  really_probe+0x1ac/0x348
  __driver_attach+0xa8/0xd0
  bus_for_each_dev+0x3c/0x74
  bus_add_driver+0xc2/0x184
  driver_register+0x50/0xec
  init+0x40/0x60
mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
cause a leak of a command resource
mlx5_core 0000:01:00.0: enable hca failed
mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
mlx5_core: probe of 0000:01:00.0 failed with error -110

Could you give me a clue of what might be happennig?

Thanks,
Joao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* mlx5 endpoint driver problem
       [not found] ` <f0b8881d-9aa3-8816-7ea6-daccc0e91262-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-05-09 17:13   ` Joao Pinto
       [not found]     ` <18d3cc2e-e235-6ae4-cd69-b5a11d607ee4-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Joao Pinto @ 2017-05-09 17:13 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA


Hello,

I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
when trying to enable the hca:

mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.21102
INFO: task swapper:1 blocked for more than 10 seconds.
      Not tainted 4.11.0-BETAMSIX1 #51
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swapper         D    0     1      0 0x00000000

Stack Trace:
  __switch_to+0x0/0x94
  __schedule+0x1da/0x8b0
  schedule+0x26/0x6c
  schedule_timeout+0x2da/0x380
  wait_for_completion+0x92/0x104
  mlx5_cmd_exec+0x70e/0xd60
  mlx5_load_one+0x1b4/0xad8
  init_one+0x404/0x600
  pci_device_probe+0x122/0x1f0
  really_probe+0x1ac/0x348
  __driver_attach+0xa8/0xd0
  bus_for_each_dev+0x3c/0x74
  bus_add_driver+0xc2/0x184
  driver_register+0x50/0xec
  init+0x40/0x60

(...)

Stack Trace:
  __switch_to+0x0/0x94
  __schedule+0x1da/0x8b0
  schedule+0x26/0x6c
  schedule_timeout+0x2da/0x380
  wait_for_completion+0x92/0x104
  mlx5_cmd_exec+0x70e/0xd60
  mlx5_load_one+0x1b4/0xad8
  init_one+0x404/0x600
  pci_device_probe+0x122/0x1f0
  really_probe+0x1ac/0x348
  __driver_attach+0xa8/0xd0
  bus_for_each_dev+0x3c/0x74
  bus_add_driver+0xc2/0x184
  driver_register+0x50/0xec
  init+0x40/0x60
mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
cause a leak of a command resource
mlx5_core 0000:01:00.0: enable hca failed
mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
mlx5_core: probe of 0000:01:00.0 failed with error -110

Could you give me a clue of what might be happennig?

Thanks,
Joao
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mlx5 endpoint driver problem
  2017-05-09 16:25 mlx5 endpoint driver problem Joao Pinto
       [not found] ` <f0b8881d-9aa3-8816-7ea6-daccc0e91262-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-05-09 17:35 ` Saeed Mahameed
  2017-05-09 17:38   ` Joao Pinto
  1 sibling, 1 reply; 10+ messages in thread
From: Saeed Mahameed @ 2017-05-09 17:35 UTC (permalink / raw)
  To: Joao Pinto; +Cc: Saeed Mahameed, netdev

On Tue, May 9, 2017 at 7:25 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
> Hello,
>
> I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
> when trying to enable the hca:
>
> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
> INFO: task swapper:1 blocked for more than 10 seconds.
>       Not tainted 4.11.0-BETAMSIX1 #51
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> swapper         D    0     1      0 0x00000000
>
> Stack Trace:
>   __switch_to+0x0/0x94
>   __schedule+0x1da/0x8b0
>   schedule+0x26/0x6c
>   schedule_timeout+0x2da/0x380
>   wait_for_completion+0x92/0x104
>   mlx5_cmd_exec+0x70e/0xd60
>   mlx5_load_one+0x1b4/0xad8
>   init_one+0x404/0x600
>   pci_device_probe+0x122/0x1f0
>   really_probe+0x1ac/0x348
>   __driver_attach+0xa8/0xd0
>   bus_for_each_dev+0x3c/0x74
>   bus_add_driver+0xc2/0x184
>   driver_register+0x50/0xec
>   init+0x40/0x60
>
> (...)
>
> Stack Trace:
>   __switch_to+0x0/0x94
>   __schedule+0x1da/0x8b0
>   schedule+0x26/0x6c
>   schedule_timeout+0x2da/0x380
>   wait_for_completion+0x92/0x104
>   mlx5_cmd_exec+0x70e/0xd60
>   mlx5_load_one+0x1b4/0xad8
>   init_one+0x404/0x600
>   pci_device_probe+0x122/0x1f0
>   really_probe+0x1ac/0x348
>   __driver_attach+0xa8/0xd0
>   bus_for_each_dev+0x3c/0x74
>   bus_add_driver+0xc2/0x184
>   driver_register+0x50/0xec
>   init+0x40/0x60
> mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
> cause a leak of a command resource
> mlx5_core 0000:01:00.0: enable hca failed
> mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
> mlx5_core: probe of 0000:01:00.0 failed with error -110
>
> Could you give me a clue of what might be happennig?
>

Hi Joao,

Looks like FW is not responding, most likely due to the DMA mask
setting warnings, which architecture is this ?

> Thanks,
> Joao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mlx5 endpoint driver problem
  2017-05-09 17:35 ` Saeed Mahameed
@ 2017-05-09 17:38   ` Joao Pinto
  2017-05-09 17:44     ` Saeed Mahameed
  0 siblings, 1 reply; 10+ messages in thread
From: Joao Pinto @ 2017-05-09 17:38 UTC (permalink / raw)
  To: Saeed Mahameed, Joao Pinto; +Cc: Saeed Mahameed, netdev

Hi Saeed,

Às 6:35 PM de 5/9/2017, Saeed Mahameed escreveu:
> On Tue, May 9, 2017 at 7:25 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>> Hello,
>>
>> I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
>> when trying to enable the hca:
>>
>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>> INFO: task swapper:1 blocked for more than 10 seconds.
>>       Not tainted 4.11.0-BETAMSIX1 #51
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> swapper         D    0     1      0 0x00000000
>>
>> Stack Trace:
>>   __switch_to+0x0/0x94
>>   __schedule+0x1da/0x8b0
>>   schedule+0x26/0x6c
>>   schedule_timeout+0x2da/0x380
>>   wait_for_completion+0x92/0x104
>>   mlx5_cmd_exec+0x70e/0xd60
>>   mlx5_load_one+0x1b4/0xad8
>>   init_one+0x404/0x600
>>   pci_device_probe+0x122/0x1f0
>>   really_probe+0x1ac/0x348
>>   __driver_attach+0xa8/0xd0
>>   bus_for_each_dev+0x3c/0x74
>>   bus_add_driver+0xc2/0x184
>>   driver_register+0x50/0xec
>>   init+0x40/0x60
>>
>> (...)
>>
>> Stack Trace:
>>   __switch_to+0x0/0x94
>>   __schedule+0x1da/0x8b0
>>   schedule+0x26/0x6c
>>   schedule_timeout+0x2da/0x380
>>   wait_for_completion+0x92/0x104
>>   mlx5_cmd_exec+0x70e/0xd60
>>   mlx5_load_one+0x1b4/0xad8
>>   init_one+0x404/0x600
>>   pci_device_probe+0x122/0x1f0
>>   really_probe+0x1ac/0x348
>>   __driver_attach+0xa8/0xd0
>>   bus_for_each_dev+0x3c/0x74
>>   bus_add_driver+0xc2/0x184
>>   driver_register+0x50/0xec
>>   init+0x40/0x60
>> mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
>> cause a leak of a command resource
>> mlx5_core 0000:01:00.0: enable hca failed
>> mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
>> mlx5_core: probe of 0000:01:00.0 failed with error -110
>>
>> Could you give me a clue of what might be happennig?
>>
> 
> Hi Joao,
> 
> Looks like FW is not responding, most likely due to the DMA mask
> setting warnings, which architecture is this ?
> 
>> Thanks,
>> Joao

I am working with a 32-bit ARC processor based board, connected to a prototyped
Gen4 PCI RC.

Thanks,
Joao

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mlx5 endpoint driver problem
  2017-05-09 17:38   ` Joao Pinto
@ 2017-05-09 17:44     ` Saeed Mahameed
  2017-05-09 17:57       ` Joao Pinto
  0 siblings, 1 reply; 10+ messages in thread
From: Saeed Mahameed @ 2017-05-09 17:44 UTC (permalink / raw)
  To: Joao Pinto; +Cc: Saeed Mahameed, netdev

On Tue, May 9, 2017 at 8:38 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
> Hi Saeed,
>
> Às 6:35 PM de 5/9/2017, Saeed Mahameed escreveu:
>> On Tue, May 9, 2017 at 7:25 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>>> Hello,
>>>
>>> I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
>>> when trying to enable the hca:
>>>
>>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>>> INFO: task swapper:1 blocked for more than 10 seconds.
>>>       Not tainted 4.11.0-BETAMSIX1 #51
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> swapper         D    0     1      0 0x00000000
>>>
>>> Stack Trace:
>>>   __switch_to+0x0/0x94
>>>   __schedule+0x1da/0x8b0
>>>   schedule+0x26/0x6c
>>>   schedule_timeout+0x2da/0x380
>>>   wait_for_completion+0x92/0x104
>>>   mlx5_cmd_exec+0x70e/0xd60
>>>   mlx5_load_one+0x1b4/0xad8
>>>   init_one+0x404/0x600
>>>   pci_device_probe+0x122/0x1f0
>>>   really_probe+0x1ac/0x348
>>>   __driver_attach+0xa8/0xd0
>>>   bus_for_each_dev+0x3c/0x74
>>>   bus_add_driver+0xc2/0x184
>>>   driver_register+0x50/0xec
>>>   init+0x40/0x60
>>>
>>> (...)
>>>
>>> Stack Trace:
>>>   __switch_to+0x0/0x94
>>>   __schedule+0x1da/0x8b0
>>>   schedule+0x26/0x6c
>>>   schedule_timeout+0x2da/0x380
>>>   wait_for_completion+0x92/0x104
>>>   mlx5_cmd_exec+0x70e/0xd60
>>>   mlx5_load_one+0x1b4/0xad8
>>>   init_one+0x404/0x600
>>>   pci_device_probe+0x122/0x1f0
>>>   really_probe+0x1ac/0x348
>>>   __driver_attach+0xa8/0xd0
>>>   bus_for_each_dev+0x3c/0x74
>>>   bus_add_driver+0xc2/0x184
>>>   driver_register+0x50/0xec
>>>   init+0x40/0x60
>>> mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
>>> cause a leak of a command resource
>>> mlx5_core 0000:01:00.0: enable hca failed
>>> mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
>>> mlx5_core: probe of 0000:01:00.0 failed with error -110
>>>
>>> Could you give me a clue of what might be happennig?
>>>
>>
>> Hi Joao,
>>
>> Looks like FW is not responding, most likely due to the DMA mask
>> setting warnings, which architecture is this ?
>>
>>> Thanks,
>>> Joao
>
> I am working with a 32-bit ARC processor based board, connected to a prototyped
> Gen4 PCI RC.
>

Ok, i will consult with our PCI and FW experts and get back to you.

please note that the current mlx5 driver was never tested on 32-bit
architecture and might not work properly for you.

> Thanks,
> Joao
>
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mlx5 endpoint driver problem
  2017-05-09 17:44     ` Saeed Mahameed
@ 2017-05-09 17:57       ` Joao Pinto
  0 siblings, 0 replies; 10+ messages in thread
From: Joao Pinto @ 2017-05-09 17:57 UTC (permalink / raw)
  To: Saeed Mahameed, Joao Pinto; +Cc: Saeed Mahameed, netdev


Hi again Saeed,

Às 6:44 PM de 5/9/2017, Saeed Mahameed escreveu:
> On Tue, May 9, 2017 at 8:38 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>> Hi Saeed,
>>
>> Às 6:35 PM de 5/9/2017, Saeed Mahameed escreveu:
>>> On Tue, May 9, 2017 at 7:25 PM, Joao Pinto <Joao.Pinto@synopsys.com> wrote:
>>>> Hello,
>>>>
>>>> I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs
>>>> when trying to enable the hca:
>>>>
>>>> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
>>>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
>>>> mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
>>>> mlx5_core 0000:01:00.0: firmware version: 16.19.21102
>>>> INFO: task swapper:1 blocked for more than 10 seconds.
>>>>       Not tainted 4.11.0-BETAMSIX1 #51
>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> swapper         D    0     1      0 0x00000000
>>>>
>>>> Stack Trace:
>>>>   __switch_to+0x0/0x94
>>>>   __schedule+0x1da/0x8b0
>>>>   schedule+0x26/0x6c
>>>>   schedule_timeout+0x2da/0x380
>>>>   wait_for_completion+0x92/0x104
>>>>   mlx5_cmd_exec+0x70e/0xd60
>>>>   mlx5_load_one+0x1b4/0xad8
>>>>   init_one+0x404/0x600
>>>>   pci_device_probe+0x122/0x1f0
>>>>   really_probe+0x1ac/0x348
>>>>   __driver_attach+0xa8/0xd0
>>>>   bus_for_each_dev+0x3c/0x74
>>>>   bus_add_driver+0xc2/0x184
>>>>   driver_register+0x50/0xec
>>>>   init+0x40/0x60
>>>>
>>>> (...)
>>>>
>>>> Stack Trace:
>>>>   __switch_to+0x0/0x94
>>>>   __schedule+0x1da/0x8b0
>>>>   schedule+0x26/0x6c
>>>>   schedule_timeout+0x2da/0x380
>>>>   wait_for_completion+0x92/0x104
>>>>   mlx5_cmd_exec+0x70e/0xd60
>>>>   mlx5_load_one+0x1b4/0xad8
>>>>   init_one+0x404/0x600
>>>>   pci_device_probe+0x122/0x1f0
>>>>   really_probe+0x1ac/0x348
>>>>   __driver_attach+0xa8/0xd0
>>>>   bus_for_each_dev+0x3c/0x74
>>>>   bus_add_driver+0xc2/0x184
>>>>   driver_register+0x50/0xec
>>>>   init+0x40/0x60
>>>> mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will
>>>> cause a leak of a command resource
>>>> mlx5_core 0000:01:00.0: enable hca failed
>>>> mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
>>>> mlx5_core: probe of 0000:01:00.0 failed with error -110
>>>>
>>>> Could you give me a clue of what might be happennig?
>>>>
>>>
>>> Hi Joao,
>>>
>>> Looks like FW is not responding, most likely due to the DMA mask
>>> setting warnings, which architecture is this ?
>>>
>>>> Thanks,
>>>> Joao
>>
>> I am working with a 32-bit ARC processor based board, connected to a prototyped
>> Gen4 PCI RC.
>>
> 
> Ok, i will consult with our PCI and FW experts and get back to you.
> 
> please note that the current mlx5 driver was never tested on 32-bit
> architecture and might not work properly for you.

I have new data for you. My colleague is using a Mellanox MT27800 Family
(ConnectX-5) with Firmware version 16.19.148 and it does have hangs, but it
fails in the CPU mask:

mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.148
mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged
mlx5_core 0000:01:00.0: mlx5_irq_set_affinity_hint:628:(pid 1):
irq_set_affinity_hint failed,irq 0x0032
mlx5_core 0000:01:00.0: Failed to alloc affinity hint cpumask
mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -22
mlx5_core: probe of 0000:01:00.0 failed with error -22

Mine is a Mellanox MT28800 Family (ConnectX-5) with Firmware Version 16.19.21102.

Hopes it gives more data for analysis.

Thanks,
Joao

> 
>> Thanks,
>> Joao
>>
>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: mlx5 endpoint driver problem
       [not found]     ` <18d3cc2e-e235-6ae4-cd69-b5a11d607ee4-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-05-10 14:56       ` Eli Cohen
       [not found]         ` <AM4PR0501MB278757CE912385F3B3324CE7C5EC0-dp/nxUn679jFcPxmzbbP+MDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Cohen @ 2017-05-10 14:56 UTC (permalink / raw)
  To: Joao Pinto, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Joao,

Since mlx5 supported devices can do DMA with 64 bit addresses we start like this. This fails in your system since it is not capable of handling 64 bit addresses so we fall back to 32 bit addresses which then succeed. However what you are experiencing is the driver executed a command and firmware supposedly does not respond. Most likely the firmware responded but the driver could not see it due to problems related to dma addresses in your system.

Long story short, there is a problem in your system. To investigate this further you might need heavy tools such as pcie analyzer.

-----Original Message-----
From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Joao Pinto
Sent: Tuesday, May 9, 2017 12:13 PM
To: linux-rdma@vger.kernel.org
Subject: mlx5 endpoint driver problem


Hello,

I am making tests with a Mellanox MLX5 Endpoint, and I am getting kernel hangs when trying to enable the hca:

mlx5_core 0000:01:00.0: enabling device (0000 -> 0002) mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask mlx5_core 0000:01:00.0: firmware version: 16.19.21102
INFO: task swapper:1 blocked for more than 10 seconds.
      Not tainted 4.11.0-BETAMSIX1 #51
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swapper         D    0     1      0 0x00000000

Stack Trace:
  __switch_to+0x0/0x94
  __schedule+0x1da/0x8b0
  schedule+0x26/0x6c
  schedule_timeout+0x2da/0x380
  wait_for_completion+0x92/0x104
  mlx5_cmd_exec+0x70e/0xd60
  mlx5_load_one+0x1b4/0xad8
  init_one+0x404/0x600
  pci_device_probe+0x122/0x1f0
  really_probe+0x1ac/0x348
  __driver_attach+0xa8/0xd0
  bus_for_each_dev+0x3c/0x74
  bus_add_driver+0xc2/0x184
  driver_register+0x50/0xec
  init+0x40/0x60

(...)

Stack Trace:
  __switch_to+0x0/0x94
  __schedule+0x1da/0x8b0
  schedule+0x26/0x6c
  schedule_timeout+0x2da/0x380
  wait_for_completion+0x92/0x104
  mlx5_cmd_exec+0x70e/0xd60
  mlx5_load_one+0x1b4/0xad8
  init_one+0x404/0x600
  pci_device_probe+0x122/0x1f0
  really_probe+0x1ac/0x348
  __driver_attach+0xa8/0xd0
  bus_for_each_dev+0x3c/0x74
  bus_add_driver+0xc2/0x184
  driver_register+0x50/0xec
  init+0x40/0x60
mlx5_core 0000:01:00.0: wait_func:882:(pid 1): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource mlx5_core 0000:01:00.0: enable hca failed mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -110
mlx5_core: probe of 0000:01:00.0 failed with error -110

Could you give me a clue of what might be happennig?

Thanks,
Joao
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mlx5 endpoint driver problem
       [not found]         ` <AM4PR0501MB278757CE912385F3B3324CE7C5EC0-dp/nxUn679jFcPxmzbbP+MDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-05-10 15:00           ` Joao Pinto
       [not found]             ` <a3f77d2c-086b-91bf-8dc8-e3d60dcce791-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Joao Pinto @ 2017-05-10 15:00 UTC (permalink / raw)
  To: Eli Cohen, Joao Pinto, linux-rdma-u79uwXL29TY76Z2rM5mHXA


Hi Eli,

Às 3:56 PM de 5/10/2017, Eli Cohen escreveu:
> Hi Joao,
> 
> 
> 
> Since mlx5 supported devices can do DMA with 64 bit addresses we start like this. This fails in your system since it is not capable of handling 64 bit addresses so we fall back to 32 bit addresses which then succeed. However what you are experiencing is the driver executed a command and firmware supposedly does not respond. Most likely the firmware responded but the driver could not see it due to problems related to dma addresses in your system.
> 
> 
> 
> Long story short, there is a problem in your system. To investigate this further you might need heavy tools such as pcie analyzer.
> 


> 
> 

I have new data for you. My colleague is using a Mellanox MT27800 Family
(ConnectX-5) with Firmware version 16.19.148 and it does not have hangs, but it
fails in the CPU mask:

mlx5_core 0000:01:00.0: enabling device (0000 -> 0002)
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask
mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask
mlx5_core 0000:01:00.0: firmware version: 16.19.148
mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged
mlx5_core 0000:01:00.0: mlx5_irq_set_affinity_hint:628:(pid 1):
irq_set_affinity_hint failed,irq 0x0032
mlx5_core 0000:01:00.0: Failed to alloc affinity hint cpumask
mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -22
mlx5_core: probe of 0000:01:00.0 failed with error -22

Mine is a Mellanox MT28800 Family (ConnectX-5) with Firmware Version
16.19.21102. I think I have some firmware problem.

The affinity problem might be due to my Root Complex driver does not support
affinity at the moment.

Thanks,
Joao
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: mlx5 endpoint driver problem
       [not found]             ` <a3f77d2c-086b-91bf-8dc8-e3d60dcce791-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
@ 2017-05-10 15:14               ` Eli Cohen
       [not found]                 ` <AM4PR0501MB27873B51842616191FEEFE3BC5EC0-dp/nxUn679jFcPxmzbbP+MDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Cohen @ 2017-05-10 15:14 UTC (permalink / raw)
  To: Joao Pinto, linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2167 bytes --]

How many CPU cores do you have?

-----Original Message-----
From: Joao Pinto [mailto:Joao.Pinto@synopsys.com] 
Sent: Wednesday, May 10, 2017 10:01 AM
To: Eli Cohen <eli@mellanox.com>; Joao Pinto <Joao.Pinto@synopsys.com>; linux-rdma@vger.kernel.org
Subject: Re: mlx5 endpoint driver problem


Hi Eli,

Às 3:56 PM de 5/10/2017, Eli Cohen escreveu:
> Hi Joao,
> 
> 
> 
> Since mlx5 supported devices can do DMA with 64 bit addresses we start like this. This fails in your system since it is not capable of handling 64 bit addresses so we fall back to 32 bit addresses which then succeed. However what you are experiencing is the driver executed a command and firmware supposedly does not respond. Most likely the firmware responded but the driver could not see it due to problems related to dma addresses in your system.
> 
> 
> 
> Long story short, there is a problem in your system. To investigate this further you might need heavy tools such as pcie analyzer.
> 


> 
> 

I have new data for you. My colleague is using a Mellanox MT27800 Family
(ConnectX-5) with Firmware version 16.19.148 and it does not have hangs, but it fails in the CPU mask:

mlx5_core 0000:01:00.0: enabling device (0000 -> 0002) mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask mlx5_core 0000:01:00.0: firmware version: 16.19.148 mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged mlx5_core 0000:01:00.0: mlx5_irq_set_affinity_hint:628:(pid 1):
irq_set_affinity_hint failed,irq 0x0032
mlx5_core 0000:01:00.0: Failed to alloc affinity hint cpumask mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -22
mlx5_core: probe of 0000:01:00.0 failed with error -22

Mine is a Mellanox MT28800 Family (ConnectX-5) with Firmware Version 16.19.21102. I think I have some firmware problem.

The affinity problem might be due to my Root Complex driver does not support affinity at the moment.

Thanks,
Joao
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mlx5 endpoint driver problem
       [not found]                 ` <AM4PR0501MB27873B51842616191FEEFE3BC5EC0-dp/nxUn679jFcPxmzbbP+MDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-05-10 15:16                   ` Joao Pinto
  0 siblings, 0 replies; 10+ messages in thread
From: Joao Pinto @ 2017-05-10 15:16 UTC (permalink / raw)
  To: Eli Cohen, Joao Pinto, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Às 4:14 PM de 5/10/2017, Eli Cohen escreveu:
> How many CPU cores do you have?

I just have 1 core. I commented the affinity functions from mxl5/main.c and
requested my colegue that has the Endpoint with good firmware to validate if it
can initialize well the device. In your opinion it should work, right?

Joao

> 
> -----Original Message-----
> From: Joao Pinto [mailto:Joao.Pinto-HKixBCOQz3hWk0Htik3J/w@public.gmane.org] 
> Sent: Wednesday, May 10, 2017 10:01 AM
> To: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Joao Pinto <Joao.Pinto-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: mlx5 endpoint driver problem
> 
> 
> Hi Eli,
> 
> Às 3:56 PM de 5/10/2017, Eli Cohen escreveu:
>> Hi Joao,
>>
>>
>>
>> Since mlx5 supported devices can do DMA with 64 bit addresses we start like this. This fails in your system since it is not capable of handling 64 bit addresses so we fall back to 32 bit addresses which then succeed. However what you are experiencing is the driver executed a command and firmware supposedly does not respond. Most likely the firmware responded but the driver could not see it due to problems related to dma addresses in your system.
>>
>>
>>
>> Long story short, there is a problem in your system. To investigate this further you might need heavy tools such as pcie analyzer.
>>
> 
> 
>>
>>
> 
> I have new data for you. My colleague is using a Mellanox MT27800 Family
> (ConnectX-5) with Firmware version 16.19.148 and it does not have hangs, but it fails in the CPU mask:
> 
> mlx5_core 0000:01:00.0: enabling device (0000 -> 0002) mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit PCI DMA mask mlx5_core 0000:01:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask mlx5_core 0000:01:00.0: firmware version: 16.19.148 mlx5_core 0000:01:00.0: Port module event: module 0, Cable unplugged mlx5_core 0000:01:00.0: mlx5_irq_set_affinity_hint:628:(pid 1):
> irq_set_affinity_hint failed,irq 0x0032
> mlx5_core 0000:01:00.0: Failed to alloc affinity hint cpumask mlx5_core 0000:01:00.0: mlx5_load_one failed with error code -22
> mlx5_core: probe of 0000:01:00.0 failed with error -22
> 
> Mine is a Mellanox MT28800 Family (ConnectX-5) with Firmware Version 16.19.21102. I think I have some firmware problem.
> 
> The affinity problem might be due to my Root Complex driver does not support affinity at the moment.
> 
> Thanks,
> Joao
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-05-10 15:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-09 16:25 mlx5 endpoint driver problem Joao Pinto
     [not found] ` <f0b8881d-9aa3-8816-7ea6-daccc0e91262-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-05-09 17:13   ` Joao Pinto
     [not found]     ` <18d3cc2e-e235-6ae4-cd69-b5a11d607ee4-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-05-10 14:56       ` Eli Cohen
     [not found]         ` <AM4PR0501MB278757CE912385F3B3324CE7C5EC0-dp/nxUn679jFcPxmzbbP+MDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-05-10 15:00           ` Joao Pinto
     [not found]             ` <a3f77d2c-086b-91bf-8dc8-e3d60dcce791-HKixBCOQz3hWk0Htik3J/w@public.gmane.org>
2017-05-10 15:14               ` Eli Cohen
     [not found]                 ` <AM4PR0501MB27873B51842616191FEEFE3BC5EC0-dp/nxUn679jFcPxmzbbP+MDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-05-10 15:16                   ` Joao Pinto
2017-05-09 17:35 ` Saeed Mahameed
2017-05-09 17:38   ` Joao Pinto
2017-05-09 17:44     ` Saeed Mahameed
2017-05-09 17:57       ` Joao Pinto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.