netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
@ 2023-01-06  7:55 Jaroslav Pulchart
  2023-01-06 11:11 ` Linux kernel regression tracking (#adding)
  2023-01-06 16:15 ` Jason Gunthorpe
  0 siblings, 2 replies; 13+ messages in thread
From: Jaroslav Pulchart @ 2023-01-06  7:55 UTC (permalink / raw)
  To: kamalheib1, shiraz.saleem, leon, sashal, linux-rdma, netdev; +Cc: Igor Raits

Hello,

I would like to report a >= 6.1.2 some network regression (looks like
NIC us not UP) on our Dell R7525 servers with E810 NICs. The issue was
observed after I updated 6.1.0 to 6.1.2 or newer (tested up to newest
6.1.4-rc1). The system is not accesible and all services are in D
state after each reboot.

[  257.625207]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
[  257.631911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  257.639740] task:kworker/u192:1  state:D stack:0     pid:11
ppid:2      flags:0x00004000
[  257.648095] Workqueue: netns cleanup_net
[  257.652029] Call Trace:
[  257.654481]  <TASK>
[  257.656589]  __schedule+0x1eb/0x630
[  257.660087]  schedule+0x5a/0xd0
[  257.663233]  schedule_preempt_disabled+0x11/0x20
[  257.667851]  __mutex_lock.constprop.0+0x372/0x6c0
[  257.672561]  rdma_dev_change_netns+0x25/0x120 [ib_core]
[  257.677821]  rdma_dev_exit_net+0x139/0x1e0 [ib_core]
[  257.682804]  ops_exit_list+0x30/0x70
[  257.686382]  cleanup_net+0x213/0x3b0
[  257.689964]  process_one_work+0x1e2/0x3b0
[  257.693984]  ? rescuer_thread+0x390/0x390
[  257.697995]  worker_thread+0x50/0x3a0
[  257.701661]  ? rescuer_thread+0x390/0x390
[  257.705674]  kthread+0xd6/0x100
[  257.708819]  ? kthread_complete_and_exit+0x20/0x20
[  257.713613]  ret_from_fork+0x1f/0x30
[  257.717192]  </TASK>
[  257.719496] INFO: task kworker/87:0:470 blocked for more than 122 seconds.
[  257.726423]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
[  257.733123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  257.740949] task:kworker/87:0    state:D stack:0     pid:470
ppid:2      flags:0x00004000
[  257.749307] Workqueue: events linkwatch_event
[  257.753672] Call Trace:
[  257.756124]  <TASK>
[  257.758228]  __schedule+0x1eb/0x630
[  257.761723]  schedule+0x5a/0xd0
[  257.764867]  schedule_preempt_disabled+0x11/0x20
[  257.769487]  __mutex_lock.constprop.0+0x372/0x6c0
[  257.774196]  ? pick_next_task+0x57/0x9b0
[  257.778127]  ? finish_task_switch.isra.0+0x8f/0x2a0
[  257.783007]  linkwatch_event+0xa/0x30
[  257.786674]  process_one_work+0x1e2/0x3b0
[  257.790687]  worker_thread+0x50/0x3a0
[  257.794352]  ? rescuer_thread+0x390/0x390
[  257.798365]  kthread+0xd6/0x100
[  257.801513]  ? kthread_complete_and_exit+0x20/0x20
[  257.806303]  ret_from_fork+0x1f/0x30
[  257.809885]  </TASK>
[  257.812109] INFO: task kworker/39:1:614 blocked for more than 123 seconds.
[  257.818984]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
[  257.825686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  257.833519] task:kworker/39:1    state:D stack:0     pid:614
ppid:2      flags:0x00004000
[  257.841869] Workqueue: infiniband ib_cache_event_task [ib_core]
[  257.847802] Call Trace:
[  257.850252]  <TASK>
[  257.852360]  __schedule+0x1eb/0x630
[  257.855851]  schedule+0x5a/0xd0
[  257.858998]  schedule_preempt_disabled+0x11/0x20
[  257.863617]  __mutex_lock.constprop.0+0x372/0x6c0
[  257.868325]  ib_get_eth_speed+0x65/0x190 [ib_core]
[  257.873127]  ? ib_cache_update.part.0+0x4b/0x2b0 [ib_core]
[  257.878619]  ? __kmem_cache_alloc_node+0x18c/0x2b0
[  257.883417]  irdma_query_port+0xb3/0x110 [irdma]
[  257.888051]  ib_query_port+0xaa/0x100 [ib_core]
[  257.892601]  ib_cache_update.part.0+0x65/0x2b0 [ib_core]
[  257.897924]  ? pick_next_task+0x57/0x9b0
[  257.901855]  ? dequeue_task_fair+0xb6/0x3c0
[  257.906043]  ? finish_task_switch.isra.0+0x8f/0x2a0
[  257.910920]  ib_cache_event_task+0x58/0x80 [ib_core]
[  257.915906]  process_one_work+0x1e2/0x3b0
[  257.919918]  ? rescuer_thread+0x390/0x390
[  257.923931]  worker_thread+0x50/0x3a0
[  257.927595]  ? rescuer_thread+0x390/0x390
[  257.931609]  kthread+0xd6/0x100
[  257.934755]  ? kthread_complete_and_exit+0x20/0x20
[  257.939549]  ret_from_fork+0x1f/0x30
[  257.943128]  </TASK>
[  257.945438] INFO: task NetworkManager:3387 blocked for more than 123 seconds.
[  257.952577]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
[  257.959274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  257.967099] task:NetworkManager  state:D stack:0     pid:3387
ppid:1      flags:0x00004002
[  257.975446] Call Trace:
[  257.977901]  <TASK>
[  257.980004]  __schedule+0x1eb/0x630
[  257.983498]  schedule+0x5a/0xd0
[  257.986641]  schedule_timeout+0x11d/0x160
[  257.990654]  __wait_for_common+0x90/0x1e0
[  257.994666]  ? usleep_range_state+0x90/0x90
[  257.998854]  __flush_workqueue+0x13a/0x3f0
[  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
[  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
[  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
[  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
[  258.023024]  irdma_remove+0x1a/0x60 [irdma]
[  258.027223]  auxiliary_bus_remove+0x18/0x30
[  258.031414]  device_release_driver_internal+0x1aa/0x230
[  258.036643]  bus_remove_device+0xd8/0x150
[  258.040654]  device_del+0x18b/0x3f0
[  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]
[  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
[  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
[  258.058930]  raw_notifier_call_chain+0x41/0x60
[  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
[  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
[  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
[  258.077517]  ? nla_put+0x28/0x40
[  258.080756]  do_setlink+0x26c/0xc10
[  258.084249]  ? avc_alloc_node+0x27/0x180
[  258.088173]  ? __nla_validate_parse+0x141/0x190
[  258.092708]  __rtnl_newlink+0x53a/0x620
[  258.096549]  rtnl_newlink+0x44/0x70
[  258.100040]  rtnetlink_rcv_msg+0x159/0x3d0
[  258.104140]  ? rtnl_calcit.isra.0+0x140/0x140
[  258.108496]  netlink_rcv_skb+0x4e/0x100
[  258.112338]  netlink_unicast+0x23b/0x360
[  258.116264]  netlink_sendmsg+0x24e/0x4b0
[  258.120191]  sock_sendmsg+0x5f/0x70
[  258.123684]  ____sys_sendmsg+0x241/0x2c0
[  258.127609]  ? copy_msghdr_from_user+0x6d/0xa0
[  258.132054]  ___sys_sendmsg+0x88/0xd0
[  258.135722]  ? ___sys_recvmsg+0x88/0xd0
[  258.139559]  ? wake_up_q+0x4a/0x90
[  258.142967]  ? rseq_get_rseq_cs.isra.0+0x16/0x220
[  258.147673]  ? __fget_light+0xa4/0x130
[  258.151434]  __sys_sendmsg+0x59/0xa0
[  258.155012]  do_syscall_64+0x38/0x90
[  258.158591]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  258.163645] RIP: 0033:0x7ff23714fa7d
[  258.167226] RSP: 002b:00007ffdddfc8c70 EFLAGS: 00000293 ORIG_RAX:
000000000000002e
[  258.174798] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff23714fa7d
[  258.181933] RDX: 0000000000000000 RSI: 00007ffdddfc8cb0 RDI: 000000000000000d
[  258.189063] RBP: 00005572f5d77040 R08: 0000000000000000 R09: 0000000000000000
[  258.196197] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdddfc8e1c
[  258.203332] R13: 00007ffdddfc8e20 R14: 0000000000000000 R15: 00007ffdddfc8e28
[  258.210464]  </TASK>
...

I bisected the issue to a commit
"425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
correct link speed). Reverting this commit in my kernel build "fix"
the issue and the server has a working network again.

Best regards,
--
Jaroslav Pulchart
Sr. Principal SW Engineer
GoodData

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-01-06  7:55 Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed) Jaroslav Pulchart
@ 2023-01-06 11:11 ` Linux kernel regression tracking (#adding)
  2023-02-28 10:09   ` Linux regression tracking (Thorsten Leemhuis)
  2023-01-06 16:15 ` Jason Gunthorpe
  1 sibling, 1 reply; 13+ messages in thread
From: Linux kernel regression tracking (#adding) @ 2023-01-06 11:11 UTC (permalink / raw)
  To: Jaroslav Pulchart, kamalheib1, shiraz.saleem, leon, sashal,
	linux-rdma, netdev
  Cc: Igor Raits, regressions

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; all text you find below is based on a few templates
paragraphs you might have encountered already already in similar form.
See link in footer if these mails annoy you.]

On 06.01.23 08:55, Jaroslav Pulchart wrote:
> Hello,
> 
> I would like to report a >= 6.1.2 some network regression (looks like
> NIC us not UP) on our Dell R7525 servers with E810 NICs. The issue was
> observed after I updated 6.1.0 to 6.1.2 or newer (tested up to newest
> 6.1.4-rc1). The system is not accesible and all services are in D
> state after each reboot.
> 
> [  257.625207]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> [  257.631911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  257.639740] task:kworker/u192:1  state:D stack:0     pid:11
> ppid:2      flags:0x00004000
> [  257.648095] Workqueue: netns cleanup_net
> [  257.652029] Call Trace:
> [  257.654481]  <TASK>
> [  257.656589]  __schedule+0x1eb/0x630
> [  257.660087]  schedule+0x5a/0xd0
> [  257.663233]  schedule_preempt_disabled+0x11/0x20
> [  257.667851]  __mutex_lock.constprop.0+0x372/0x6c0
> [  257.672561]  rdma_dev_change_netns+0x25/0x120 [ib_core]
> [  257.677821]  rdma_dev_exit_net+0x139/0x1e0 [ib_core]
> [  257.682804]  ops_exit_list+0x30/0x70
> [  257.686382]  cleanup_net+0x213/0x3b0
> [  257.689964]  process_one_work+0x1e2/0x3b0
> [  257.693984]  ? rescuer_thread+0x390/0x390
> [  257.697995]  worker_thread+0x50/0x3a0
> [  257.701661]  ? rescuer_thread+0x390/0x390
> [  257.705674]  kthread+0xd6/0x100
> [  257.708819]  ? kthread_complete_and_exit+0x20/0x20
> [  257.713613]  ret_from_fork+0x1f/0x30
> [  257.717192]  </TASK>
> [  257.719496] INFO: task kworker/87:0:470 blocked for more than 122 seconds.
> [  257.726423]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> [  257.733123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  257.740949] task:kworker/87:0    state:D stack:0     pid:470
> ppid:2      flags:0x00004000
> [  257.749307] Workqueue: events linkwatch_event
> [  257.753672] Call Trace:
> [  257.756124]  <TASK>
> [  257.758228]  __schedule+0x1eb/0x630
> [  257.761723]  schedule+0x5a/0xd0
> [  257.764867]  schedule_preempt_disabled+0x11/0x20
> [  257.769487]  __mutex_lock.constprop.0+0x372/0x6c0
> [  257.774196]  ? pick_next_task+0x57/0x9b0
> [  257.778127]  ? finish_task_switch.isra.0+0x8f/0x2a0
> [  257.783007]  linkwatch_event+0xa/0x30
> [  257.786674]  process_one_work+0x1e2/0x3b0
> [  257.790687]  worker_thread+0x50/0x3a0
> [  257.794352]  ? rescuer_thread+0x390/0x390
> [  257.798365]  kthread+0xd6/0x100
> [  257.801513]  ? kthread_complete_and_exit+0x20/0x20
> [  257.806303]  ret_from_fork+0x1f/0x30
> [  257.809885]  </TASK>
> [  257.812109] INFO: task kworker/39:1:614 blocked for more than 123 seconds.
> [  257.818984]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> [  257.825686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  257.833519] task:kworker/39:1    state:D stack:0     pid:614
> ppid:2      flags:0x00004000
> [  257.841869] Workqueue: infiniband ib_cache_event_task [ib_core]
> [  257.847802] Call Trace:
> [  257.850252]  <TASK>
> [  257.852360]  __schedule+0x1eb/0x630
> [  257.855851]  schedule+0x5a/0xd0
> [  257.858998]  schedule_preempt_disabled+0x11/0x20
> [  257.863617]  __mutex_lock.constprop.0+0x372/0x6c0
> [  257.868325]  ib_get_eth_speed+0x65/0x190 [ib_core]
> [  257.873127]  ? ib_cache_update.part.0+0x4b/0x2b0 [ib_core]
> [  257.878619]  ? __kmem_cache_alloc_node+0x18c/0x2b0
> [  257.883417]  irdma_query_port+0xb3/0x110 [irdma]
> [  257.888051]  ib_query_port+0xaa/0x100 [ib_core]
> [  257.892601]  ib_cache_update.part.0+0x65/0x2b0 [ib_core]
> [  257.897924]  ? pick_next_task+0x57/0x9b0
> [  257.901855]  ? dequeue_task_fair+0xb6/0x3c0
> [  257.906043]  ? finish_task_switch.isra.0+0x8f/0x2a0
> [  257.910920]  ib_cache_event_task+0x58/0x80 [ib_core]
> [  257.915906]  process_one_work+0x1e2/0x3b0
> [  257.919918]  ? rescuer_thread+0x390/0x390
> [  257.923931]  worker_thread+0x50/0x3a0
> [  257.927595]  ? rescuer_thread+0x390/0x390
> [  257.931609]  kthread+0xd6/0x100
> [  257.934755]  ? kthread_complete_and_exit+0x20/0x20
> [  257.939549]  ret_from_fork+0x1f/0x30
> [  257.943128]  </TASK>
> [  257.945438] INFO: task NetworkManager:3387 blocked for more than 123 seconds.
> [  257.952577]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> [  257.959274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
> ppid:1      flags:0x00004002
> [  257.975446] Call Trace:
> [  257.977901]  <TASK>
> [  257.980004]  __schedule+0x1eb/0x630
> [  257.983498]  schedule+0x5a/0xd0
> [  257.986641]  schedule_timeout+0x11d/0x160
> [  257.990654]  __wait_for_common+0x90/0x1e0
> [  257.994666]  ? usleep_range_state+0x90/0x90
> [  257.998854]  __flush_workqueue+0x13a/0x3f0
> [  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
> [  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
> [  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
> [  258.023024]  irdma_remove+0x1a/0x60 [irdma]
> [  258.027223]  auxiliary_bus_remove+0x18/0x30
> [  258.031414]  device_release_driver_internal+0x1aa/0x230
> [  258.036643]  bus_remove_device+0xd8/0x150
> [  258.040654]  device_del+0x18b/0x3f0
> [  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]
> [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
> [  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
> [  258.058930]  raw_notifier_call_chain+0x41/0x60
> [  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
> [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
> [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
> [  258.077517]  ? nla_put+0x28/0x40
> [  258.080756]  do_setlink+0x26c/0xc10
> [  258.084249]  ? avc_alloc_node+0x27/0x180
> [  258.088173]  ? __nla_validate_parse+0x141/0x190
> [  258.092708]  __rtnl_newlink+0x53a/0x620
> [  258.096549]  rtnl_newlink+0x44/0x70
> [  258.100040]  rtnetlink_rcv_msg+0x159/0x3d0
> [  258.104140]  ? rtnl_calcit.isra.0+0x140/0x140
> [  258.108496]  netlink_rcv_skb+0x4e/0x100
> [  258.112338]  netlink_unicast+0x23b/0x360
> [  258.116264]  netlink_sendmsg+0x24e/0x4b0
> [  258.120191]  sock_sendmsg+0x5f/0x70
> [  258.123684]  ____sys_sendmsg+0x241/0x2c0
> [  258.127609]  ? copy_msghdr_from_user+0x6d/0xa0
> [  258.132054]  ___sys_sendmsg+0x88/0xd0
> [  258.135722]  ? ___sys_recvmsg+0x88/0xd0
> [  258.139559]  ? wake_up_q+0x4a/0x90
> [  258.142967]  ? rseq_get_rseq_cs.isra.0+0x16/0x220
> [  258.147673]  ? __fget_light+0xa4/0x130
> [  258.151434]  __sys_sendmsg+0x59/0xa0
> [  258.155012]  do_syscall_64+0x38/0x90
> [  258.158591]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [  258.163645] RIP: 0033:0x7ff23714fa7d
> [  258.167226] RSP: 002b:00007ffdddfc8c70 EFLAGS: 00000293 ORIG_RAX:
> 000000000000002e
> [  258.174798] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff23714fa7d
> [  258.181933] RDX: 0000000000000000 RSI: 00007ffdddfc8cb0 RDI: 000000000000000d
> [  258.189063] RBP: 00005572f5d77040 R08: 0000000000000000 R09: 0000000000000000
> [  258.196197] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdddfc8e1c
> [  258.203332] R13: 00007ffdddfc8e20 R14: 0000000000000000 R15: 00007ffdddfc8e28
> [  258.210464]  </TASK>
> ...
> 
> I bisected the issue to a commit
> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
> correct link speed). Reverting this commit in my kernel build "fix"
> the issue and the server has a working network again.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 425c9bd06b7a7079
#regzbot title RDMA/irdma: network stopped working
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (see page linked in footer for details).

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-01-06  7:55 Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed) Jaroslav Pulchart
  2023-01-06 11:11 ` Linux kernel regression tracking (#adding)
@ 2023-01-06 16:15 ` Jason Gunthorpe
  2023-01-09  9:17   ` Jaroslav Pulchart
  2023-01-09 19:36   ` Saleem, Shiraz
  1 sibling, 2 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2023-01-06 16:15 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: kamalheib1, shiraz.saleem, leon, sashal, linux-rdma, netdev, Igor Raits

On Fri, Jan 06, 2023 at 08:55:29AM +0100, Jaroslav Pulchart wrote:
> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
> ppid:1      flags:0x00004002
> [  257.975446] Call Trace:
> [  257.977901]  <TASK>
> [  257.980004]  __schedule+0x1eb/0x630
> [  257.983498]  schedule+0x5a/0xd0
> [  257.986641]  schedule_timeout+0x11d/0x160
> [  257.990654]  __wait_for_common+0x90/0x1e0
> [  257.994666]  ? usleep_range_state+0x90/0x90
> [  257.998854]  __flush_workqueue+0x13a/0x3f0
> [  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
> [  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
> [  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
> [  258.023024]  irdma_remove+0x1a/0x60 [irdma]
> [  258.027223]  auxiliary_bus_remove+0x18/0x30
> [  258.031414]  device_release_driver_internal+0x1aa/0x230
> [  258.036643]  bus_remove_device+0xd8/0x150
> [  258.040654]  device_del+0x18b/0x3f0
> [  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]

We talked about this already - wasn't it on this series?

Don't hold locks when removing aux devices.

> [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
> [  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
> [  258.058930]  raw_notifier_call_chain+0x41/0x60
> [  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
> [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
> [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
> [  258.077517]  ? nla_put+0x28/0x40
> [  258.080756]  do_setlink+0x26c/0xc10
> [  258.084249]  ? avc_alloc_node+0x27/0x180
> [  258.088173]  ? __nla_validate_parse+0x141/0x190
> [  258.092708]  __rtnl_newlink+0x53a/0x620
> [  258.096549]  rtnl_newlink+0x44/0x70

Especially not the rtnl.

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-01-06 16:15 ` Jason Gunthorpe
@ 2023-01-09  9:17   ` Jaroslav Pulchart
  2023-01-09 19:36   ` Saleem, Shiraz
  1 sibling, 0 replies; 13+ messages in thread
From: Jaroslav Pulchart @ 2023-01-09  9:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kamalheib1, shiraz.saleem, leon, sashal, linux-rdma, netdev, Igor Raits

Hello Jason

about:

> We talked about this already - wasn't it on this series?

Nope, we do not talk about this. Shall not be someone else in the
email "TO" section?

Best,
Jaroslav P.


pá 6. 1. 2023 v 17:15 odesílatel Jason Gunthorpe <jgg@ziepe.ca> napsal:
>
> On Fri, Jan 06, 2023 at 08:55:29AM +0100, Jaroslav Pulchart wrote:
> > [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
> > ppid:1      flags:0x00004002
> > [  257.975446] Call Trace:
> > [  257.977901]  <TASK>
> > [  257.980004]  __schedule+0x1eb/0x630
> > [  257.983498]  schedule+0x5a/0xd0
> > [  257.986641]  schedule_timeout+0x11d/0x160
> > [  257.990654]  __wait_for_common+0x90/0x1e0
> > [  257.994666]  ? usleep_range_state+0x90/0x90
> > [  257.998854]  __flush_workqueue+0x13a/0x3f0
> > [  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
> > [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
> > [  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
> > [  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
> > [  258.023024]  irdma_remove+0x1a/0x60 [irdma]
> > [  258.027223]  auxiliary_bus_remove+0x18/0x30
> > [  258.031414]  device_release_driver_internal+0x1aa/0x230
> > [  258.036643]  bus_remove_device+0xd8/0x150
> > [  258.040654]  device_del+0x18b/0x3f0
> > [  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]
>
> We talked about this already - wasn't it on this series?
>
> Don't hold locks when removing aux devices.
>
> > [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
> > [  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
> > [  258.058930]  raw_notifier_call_chain+0x41/0x60
> > [  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
> > [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
> > [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
> > [  258.077517]  ? nla_put+0x28/0x40
> > [  258.080756]  do_setlink+0x26c/0xc10
> > [  258.084249]  ? avc_alloc_node+0x27/0x180
> > [  258.088173]  ? __nla_validate_parse+0x141/0x190
> > [  258.092708]  __rtnl_newlink+0x53a/0x620
> > [  258.096549]  rtnl_newlink+0x44/0x70
>
> Especially not the rtnl.
>
> Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-01-06 16:15 ` Jason Gunthorpe
  2023-01-09  9:17   ` Jaroslav Pulchart
@ 2023-01-09 19:36   ` Saleem, Shiraz
  2023-01-10  6:02     ` Zhu Yanjun
  1 sibling, 1 reply; 13+ messages in thread
From: Saleem, Shiraz @ 2023-01-09 19:36 UTC (permalink / raw)
  To: Jason Gunthorpe, Jaroslav Pulchart, Ertman, David M, Wesierski, DawidX
  Cc: kamalheib1, leon, sashal, linux-rdma, netdev, Igor Raits

> Subject: Re: Network do not works with linux >= 6.1.2. Issue bisected to
> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
> correct link speed)
> 
> On Fri, Jan 06, 2023 at 08:55:29AM +0100, Jaroslav Pulchart wrote:
> > [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
> > ppid:1      flags:0x00004002
> > [  257.975446] Call Trace:
> > [  257.977901]  <TASK>
> > [  257.980004]  __schedule+0x1eb/0x630 [  257.983498]
> > schedule+0x5a/0xd0 [  257.986641]  schedule_timeout+0x11d/0x160 [
> > 257.990654]  __wait_for_common+0x90/0x1e0 [  257.994666]  ?
> > usleep_range_state+0x90/0x90 [  257.998854]
> > __flush_workqueue+0x13a/0x3f0 [  258.002955]  ?
> > __kernfs_remove.part.0+0x11e/0x1e0
> > [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core] [
> > 258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core] [  258.017959]
> > ib_unregister_device+0x22/0x30 [ib_core] [  258.023024]
> > irdma_remove+0x1a/0x60 [irdma] [  258.027223]
> > auxiliary_bus_remove+0x18/0x30 [  258.031414]
> > device_release_driver_internal+0x1aa/0x230
> > [  258.036643]  bus_remove_device+0xd8/0x150 [  258.040654]
> > device_del+0x18b/0x3f0 [  258.044149]  ice_unplug_aux_dev+0x42/0x60
> > [ice]
> 
> We talked about this already - wasn't it on this series?

This is yet another path (when ice ports are added to a bond) I believe where the RDMA aux device
is removed holding the RTNL lock. It's being exposed now with this recent irdma patch - 425c9bd06b7a,
causing a deadlock.

ice_lag_event_handler [rtnl_lock]
 ->ice_lag_changeupper_event
     ->ice_unplug_aux_dev
        ->irdma_remove
            ->ib_unregister_device
               ->ib_cache_cleanup_one
                  ->flush_workqueue(ib)
                     ->irdma_query_port
                         -> ib_get_eth_speed [rtnl_lock]

Previous discussion was on ethtool channel config change, https://lore.kernel.org/linux-rdma/Y5ES3kmYSINlAQhz@x130/,
which David E. is taking care of.

We are working on a patch for this issue.

> 
> Don't hold locks when removing aux devices.
> 
> > [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice] [
> > 258.054038]  ice_lag_event_handler+0x51/0x130 [ice] [  258.058930]
> > raw_notifier_call_chain+0x41/0x60 [  258.063381]
> > __netdev_upper_dev_link+0x1a0/0x370
> > [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
> > [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding] [  258.077517]  ?
> > nla_put+0x28/0x40 [  258.080756]  do_setlink+0x26c/0xc10 [
> > 258.084249]  ? avc_alloc_node+0x27/0x180 [  258.088173]  ?
> > __nla_validate_parse+0x141/0x190 [  258.092708]
> > __rtnl_newlink+0x53a/0x620 [  258.096549]  rtnl_newlink+0x44/0x70
> 
> Especially not the rtnl.
> 
> Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RE: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-01-09 19:36   ` Saleem, Shiraz
@ 2023-01-10  6:02     ` Zhu Yanjun
  2023-01-12 16:25       ` Linus Heckemann
  0 siblings, 1 reply; 13+ messages in thread
From: Zhu Yanjun @ 2023-01-10  6:02 UTC (permalink / raw)
  To: Saleem, Shiraz, Jason Gunthorpe, Jaroslav Pulchart, Ertman,
	David M, Wesierski, DawidX
  Cc: kamalheib1, leon, sashal, linux-rdma, netdev, Igor Raits

在 2023/1/10 3:36, Saleem, Shiraz 写道:
>> Subject: Re: Network do not works with linux >= 6.1.2. Issue bisected to
>> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
>> correct link speed)
>>
>> On Fri, Jan 06, 2023 at 08:55:29AM +0100, Jaroslav Pulchart wrote:
>>> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
>>> ppid:1      flags:0x00004002
>>> [  257.975446] Call Trace:
>>> [  257.977901]  <TASK>
>>> [  257.980004]  __schedule+0x1eb/0x630 [  257.983498]
>>> schedule+0x5a/0xd0 [  257.986641]  schedule_timeout+0x11d/0x160 [
>>> 257.990654]  __wait_for_common+0x90/0x1e0 [  257.994666]  ?
>>> usleep_range_state+0x90/0x90 [  257.998854]
>>> __flush_workqueue+0x13a/0x3f0 [  258.002955]  ?
>>> __kernfs_remove.part.0+0x11e/0x1e0
>>> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core] [
>>> 258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core] [  258.017959]
>>> ib_unregister_device+0x22/0x30 [ib_core] [  258.023024]
>>> irdma_remove+0x1a/0x60 [irdma] [  258.027223]
>>> auxiliary_bus_remove+0x18/0x30 [  258.031414]
>>> device_release_driver_internal+0x1aa/0x230
>>> [  258.036643]  bus_remove_device+0xd8/0x150 [  258.040654]
>>> device_del+0x18b/0x3f0 [  258.044149]  ice_unplug_aux_dev+0x42/0x60
>>> [ice]
>>
>> We talked about this already - wasn't it on this series?
> 
> This is yet another path (when ice ports are added to a bond) I believe where the RDMA aux device
> is removed holding the RTNL lock. It's being exposed now with this recent irdma patch - 425c9bd06b7a,
> causing a deadlock.
> 
> ice_lag_event_handler [rtnl_lock]
>   ->ice_lag_changeupper_event
>       ->ice_unplug_aux_dev
>          ->irdma_remove
>              ->ib_unregister_device
>                 ->ib_cache_cleanup_one
>                    ->flush_workqueue(ib)
>                       ->irdma_query_port
>                           -> ib_get_eth_speed [rtnl_lock]

Agree with the above analysis.
Maybe a quick and direct fix is like this.

@@ -74,6 +74,7 @@ static int irdma_query_port(struct ib_device *ibdev, 
u32 port,
  {
         struct irdma_device *iwdev = to_iwdev(ibdev);
         struct net_device *netdev = iwdev->netdev;
+       bool unlock_rtnl = false;

         /* no need to zero out pros here. done by caller */

@@ -91,9 +92,16 @@ static int irdma_query_port(struct ib_device *ibdev, 
u32 port,
                 props->phys_state = IB_PORT_PHYS_STATE_DISABLED;
         }

+       if (rtnl_is_locked()) {
+               rtnl_unlock();
+               unlock_rtnl = true;
+       }
         ib_get_eth_speed(ibdev, port, &props->active_speed,
                          &props->active_width);

+       if (unlock_rtnl) {
+               rtnl_lock();
+       }
         if (rdma_protocol_roce(ibdev, 1)) {
                 props->gid_tbl_len = 32;
                 props->ip_gids = true;

Zhu Yanjun

> 
> Previous discussion was on ethtool channel config change, https://lore.kernel.org/linux-rdma/Y5ES3kmYSINlAQhz@x130/,
> which David E. is taking care of.
> 
> We are working on a patch for this issue.
> 
>>
>> Don't hold locks when removing aux devices.
>>
>>> [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice] [
>>> 258.054038]  ice_lag_event_handler+0x51/0x130 [ice] [  258.058930]
>>> raw_notifier_call_chain+0x41/0x60 [  258.063381]
>>> __netdev_upper_dev_link+0x1a0/0x370
>>> [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
>>> [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding] [  258.077517]  ?
>>> nla_put+0x28/0x40 [  258.080756]  do_setlink+0x26c/0xc10 [
>>> 258.084249]  ? avc_alloc_node+0x27/0x180 [  258.088173]  ?
>>> __nla_validate_parse+0x141/0x190 [  258.092708]
>>> __rtnl_newlink+0x53a/0x620 [  258.096549]  rtnl_newlink+0x44/0x70
>>
>> Especially not the rtnl.
>>
>> Jason


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RE: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-01-10  6:02     ` Zhu Yanjun
@ 2023-01-12 16:25       ` Linus Heckemann
  2023-01-12 16:46         ` Saleem, Shiraz
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Heckemann @ 2023-01-12 16:25 UTC (permalink / raw)
  To: Zhu Yanjun, Saleem, Shiraz, Jason Gunthorpe, Jaroslav Pulchart,
	Ertman, David M, Wesierski, DawidX
  Cc: kamalheib1, leon, sashal, linux-rdma, netdev, Igor Raits

Hi,

Zhu Yanjun <yanjun.zhu@linux.dev> writes:

> 在 2023/1/10 3:36, Saleem, Shiraz 写道:
>>> Subject: Re: Network do not works with linux >= 6.1.2. Issue bisected to
>>> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
>>> correct link speed)
>>>
>>> On Fri, Jan 06, 2023 at 08:55:29AM +0100, Jaroslav Pulchart wrote:
>>>> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
>>>> ppid:1      flags:0x00004002
>>>> [  257.975446] Call Trace:
>>>> [  257.977901]  <TASK>
>>>> [  257.980004]  __schedule+0x1eb/0x630 [  257.983498]
>>>> schedule+0x5a/0xd0 [  257.986641]  schedule_timeout+0x11d/0x160 [
>>>> 257.990654]  __wait_for_common+0x90/0x1e0 [  257.994666]  ?
>>>> usleep_range_state+0x90/0x90 [  257.998854]
>>>> __flush_workqueue+0x13a/0x3f0 [  258.002955]  ?
>>>> __kernfs_remove.part.0+0x11e/0x1e0
>>>> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core] [
>>>> 258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core] [  258.017959]
>>>> ib_unregister_device+0x22/0x30 [ib_core] [  258.023024]
>>>> irdma_remove+0x1a/0x60 [irdma] [  258.027223]
>>>> auxiliary_bus_remove+0x18/0x30 [  258.031414]
>>>> device_release_driver_internal+0x1aa/0x230
>>>> [  258.036643]  bus_remove_device+0xd8/0x150 [  258.040654]
>>>> device_del+0x18b/0x3f0 [  258.044149]  ice_unplug_aux_dev+0x42/0x60
>>>> [ice]
>>>
>>> We talked about this already - wasn't it on this series?
>> 
>> This is yet another path (when ice ports are added to a bond) I believe where the RDMA aux device
>> is removed holding the RTNL lock. It's being exposed now with this recent irdma patch - 425c9bd06b7a,
>> causing a deadlock.
>> 
>> ice_lag_event_handler [rtnl_lock]
>>   ->ice_lag_changeupper_event
>>       ->ice_unplug_aux_dev
>>          ->irdma_remove
>>              ->ib_unregister_device
>>                 ->ib_cache_cleanup_one
>>                    ->flush_workqueue(ib)
>>                       ->irdma_query_port
>>                           -> ib_get_eth_speed [rtnl_lock]
>
> Agree with the above analysis.
> Maybe a quick and direct fix is like this.
>
> @@ -74,6 +74,7 @@ static int irdma_query_port(struct ib_device *ibdev, 
> u32 port,
>   {
>          struct irdma_device *iwdev = to_iwdev(ibdev);
>          struct net_device *netdev = iwdev->netdev;
> +       bool unlock_rtnl = false;
>
>          /* no need to zero out pros here. done by caller */
>
> @@ -91,9 +92,16 @@ static int irdma_query_port(struct ib_device *ibdev, 
> u32 port,
>                  props->phys_state = IB_PORT_PHYS_STATE_DISABLED;
>          }
>
> +       if (rtnl_is_locked()) {
> +               rtnl_unlock();
> +               unlock_rtnl = true;
> +       }

Does this not introduce a window where something else could acquire
the lock, when something further up in the call chain could be holding
the lock for a good reason?

>          ib_get_eth_speed(ibdev, port, &props->active_speed,
>                           &props->active_width);
>
> +       if (unlock_rtnl) {
> +               rtnl_lock();
> +       }
>          if (rdma_protocol_roce(ibdev, 1)) {
>                  props->gid_tbl_len = 32;
>                  props->ip_gids = true;
>

That said, I've tested this anyway and it does appear to solve the
reported deadlock, but seems to introduce other hangs (this is rarer --
I've encountered the issue on 3 boots out of 30). Maybe this is the
race condition which I suspect? Here are the hung-task traces from one
such occurrence.

[  247.237227] INFO: task kworker/10:1:415 blocked for more than 122 seconds.
[  247.244183]       Not tainted 5.15.86 #1-NixOS
[  247.248626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.256453] task:kworker/10:1    state:D stack:    0 pid:  415 ppid:     2 flags:0x00004000
[  247.256457] Workqueue: ipv6_addrconf addrconf_dad_work
[  247.256462] Call Trace:
[  247.256463]  <TASK>
[  247.256466]  __schedule+0x2e6/0x13a0
[  247.256471]  ? ip6_finish_output2+0x1f5/0x6d0
[  247.256473]  schedule+0x5b/0xd0
[  247.256475]  schedule_preempt_disabled+0xa/0x10
[  247.256477]  __mutex_lock.constprop.0+0x216/0x400
[  247.256479]  addrconf_dad_work+0x3e/0x570
[  247.256481]  ? mld_ifc_work+0x1b2/0x450
[  247.256482]  process_one_work+0x1f1/0x390
[  247.256485]  worker_thread+0x53/0x3e0
[  247.256486]  ? process_one_work+0x390/0x390
[  247.256487]  kthread+0x127/0x150
[  247.256489]  ? set_kthread_struct+0x50/0x50
[  247.256491]  ret_from_fork+0x22/0x30
[  247.256494]  </TASK>
[  247.256499] INFO: task kworker/34:1:460 blocked for more than 122 seconds.
[  247.263367]       Not tainted 5.15.86 #1-NixOS
[  247.267813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.275640] task:kworker/34:1    state:D stack:    0 pid:  460 ppid:     2 flags:0x00004000
[  247.275643] Workqueue: events_power_efficient reg_check_chans_work [cfg80211]
[  247.275661] Call Trace:
[  247.275662]  <TASK>
[  247.275663]  __schedule+0x2e6/0x13a0
[  247.275665]  ? select_task_rq_fair+0x1010/0x1090
[  247.275668]  schedule+0x5b/0xd0
[  247.275669]  schedule_preempt_disabled+0xa/0x10
[  247.275671]  __mutex_lock.constprop.0+0x216/0x400
[  247.275673]  reg_check_chans_work+0x2d/0x390 [cfg80211]
[  247.275687]  process_one_work+0x1f1/0x390
[  247.275688]  worker_thread+0x53/0x3e0
[  247.275689]  ? process_one_work+0x390/0x390
[  247.275690]  kthread+0x127/0x150
[  247.275692]  ? set_kthread_struct+0x50/0x50
[  247.275693]  ret_from_fork+0x22/0x30
[  247.275695]  </TASK>
[  247.275698] INFO: task kworker/46:1:484 blocked for more than 122 seconds.
[  247.282572]       Not tainted 5.15.86 #1-NixOS
[  247.287017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.294842] task:kworker/46:1    state:D stack:    0 pid:  484 ppid:     2 flags:0x00004000
[  247.294844] Workqueue: ipv6_addrconf addrconf_verify_work
[  247.294846] Call Trace:
[  247.294847]  <TASK>
[  247.294847]  __schedule+0x2e6/0x13a0
[  247.294849]  ? raw_spin_rq_lock_nested+0x12/0x70
[  247.294851]  ? newidle_balance+0x2ee/0x400
[  247.294852]  schedule+0x5b/0xd0
[  247.294854]  schedule_preempt_disabled+0xa/0x10
[  247.294856]  __mutex_lock.constprop.0+0x216/0x400
[  247.294857]  addrconf_verify_work+0xa/0x20
[  247.294858]  process_one_work+0x1f1/0x390
[  247.294859]  worker_thread+0x53/0x3e0
[  247.294860]  ? process_one_work+0x390/0x390
[  247.294861]  kthread+0x127/0x150
[  247.294862]  ? set_kthread_struct+0x50/0x50
[  247.294864]  ret_from_fork+0x22/0x30
[  247.294866]  </TASK>
[  247.294888] INFO: task systemd-udevd:1658 blocked for more than 122 seconds.
[  247.301934]       Not tainted 5.15.86 #1-NixOS
[  247.306379] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.314206] task:systemd-udevd   state:D stack:    0 pid: 1658 ppid:  1506 flags:0x00004226
[  247.314208] Call Trace:
[  247.314209]  <TASK>
[  247.314209]  __schedule+0x2e6/0x13a0
[  247.314212]  schedule+0x5b/0xd0
[  247.314214]  schedule_preempt_disabled+0xa/0x10
[  247.314216]  __mutex_lock.constprop.0+0x216/0x400
[  247.314217]  ? kobject_uevent_env+0x11f/0x680
[  247.314220]  ? bus_add_driver+0x1a8/0x200
[  247.314225]  register_netdevice_notifier+0x21/0x110
[  247.314228]  ? 0xffffffffc181f000
[  247.314247]  irdma_init_module+0xa6/0x1000 [irdma]
[  247.314258]  do_one_initcall+0x44/0x1d0
[  247.314260]  ? __cond_resched+0x16/0x50
[  247.314262]  ? kmem_cache_alloc_trace+0x44/0x3d0
[  247.314265]  do_init_module+0x4c/0x250
[  247.314268]  __do_sys_init_module+0x12e/0x1b0
[  247.314271]  do_syscall_64+0x3b/0x90
[  247.314274]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  247.314276] RIP: 0033:0x7f7a2089a0fe
[  247.314279] RSP: 002b:00007ffd449a33e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[  247.314281] RAX: ffffffffffffffda RBX: 0000556ce7b5f6c0 RCX: 00007f7a2089a0fe
[  247.314282] RDX: 00007f7a209feb1d RSI: 00000000000cef90 RDI: 0000556ce7f21430
[  247.314283] RBP: 0000556ce7f21430 R08: 0000000000000007 R09: 0000556ce7b607e0
[  247.314283] R10: 0000000000000005 R11: 0000000000000246 R12: 00007f7a209feb1d
[  247.314284] R13: 0000556ce7b5f6c0 R14: 0000556ce7b472b0 R15: 0000556ce7b5f6c0
[  247.314285]  </TASK>

Thanks for looking into this!

Linus Heckemann

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: RE: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-01-12 16:25       ` Linus Heckemann
@ 2023-01-12 16:46         ` Saleem, Shiraz
  0 siblings, 0 replies; 13+ messages in thread
From: Saleem, Shiraz @ 2023-01-12 16:46 UTC (permalink / raw)
  To: Linus Heckemann, Zhu Yanjun, Jason Gunthorpe, Jaroslav Pulchart,
	Ertman, David M, Wesierski, DawidX
  Cc: kamalheib1, leon, sashal, linux-rdma, netdev, Igor Raits

> Subject: Re: RE: Network do not works with linux >= 6.1.2. Issue bisected to
> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
> correct link speed)
> 
> Hi,
> 
> Zhu Yanjun <yanjun.zhu@linux.dev> writes:
> 
> > 在 2023/1/10 3:36, Saleem, Shiraz 写道:
> >>> Subject: Re: Network do not works with linux >= 6.1.2. Issue
> >>> bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma:
> >>> Report the correct link speed)
> >>>
> >>> On Fri, Jan 06, 2023 at 08:55:29AM +0100, Jaroslav Pulchart wrote:
> >>>> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
> >>>> ppid:1      flags:0x00004002
> >>>> [  257.975446] Call Trace:
> >>>> [  257.977901]  <TASK>
> >>>> [  257.980004]  __schedule+0x1eb/0x630 [  257.983498]
> >>>> schedule+0x5a/0xd0 [  257.986641]  schedule_timeout+0x11d/0x160 [
> >>>> 257.990654]  __wait_for_common+0x90/0x1e0 [  257.994666]  ?
> >>>> usleep_range_state+0x90/0x90 [  257.998854]
> >>>> __flush_workqueue+0x13a/0x3f0 [  258.002955]  ?
> >>>> __kernfs_remove.part.0+0x11e/0x1e0
> >>>> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core] [
> >>>> 258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core] [
> >>>> 258.017959]
> >>>> ib_unregister_device+0x22/0x30 [ib_core] [  258.023024]
> >>>> irdma_remove+0x1a/0x60 [irdma] [  258.027223]
> >>>> auxiliary_bus_remove+0x18/0x30 [  258.031414]
> >>>> device_release_driver_internal+0x1aa/0x230
> >>>> [  258.036643]  bus_remove_device+0xd8/0x150 [  258.040654]
> >>>> device_del+0x18b/0x3f0 [  258.044149]  ice_unplug_aux_dev+0x42/0x60
> >>>> [ice]
> >>>
> >>> We talked about this already - wasn't it on this series?
> >>
> >> This is yet another path (when ice ports are added to a bond) I
> >> believe where the RDMA aux device is removed holding the RTNL lock.
> >> It's being exposed now with this recent irdma patch - 425c9bd06b7a, causing a
> deadlock.
> >>
> >> ice_lag_event_handler [rtnl_lock]
> >>   ->ice_lag_changeupper_event
> >>       ->ice_unplug_aux_dev
> >>          ->irdma_remove
> >>              ->ib_unregister_device
> >>                 ->ib_cache_cleanup_one
> >>                    ->flush_workqueue(ib)
> >>                       ->irdma_query_port
> >>                           -> ib_get_eth_speed [rtnl_lock]
> >
> > Agree with the above analysis.
> > Maybe a quick and direct fix is like this.
> >
> > @@ -74,6 +74,7 @@ static int irdma_query_port(struct ib_device *ibdev,
> > u32 port,
> >   {
> >          struct irdma_device *iwdev = to_iwdev(ibdev);
> >          struct net_device *netdev = iwdev->netdev;
> > +       bool unlock_rtnl = false;
> >
> >          /* no need to zero out pros here. done by caller */
> >
> > @@ -91,9 +92,16 @@ static int irdma_query_port(struct ib_device
> > *ibdev,
> > u32 port,
> >                  props->phys_state = IB_PORT_PHYS_STATE_DISABLED;
> >          }
> >
> > +       if (rtnl_is_locked()) {
> > +               rtnl_unlock();
> > +               unlock_rtnl = true;
> > +       }
> 
> Does this not introduce a window where something else could acquire the lock,
> when something further up in the call chain could be holding the lock for a good
> reason?
> 
> >          ib_get_eth_speed(ibdev, port, &props->active_speed,
> >                           &props->active_width);
> >
> > +       if (unlock_rtnl) {
> > +               rtnl_lock();
> > +       }
> >          if (rdma_protocol_roce(ibdev, 1)) {
> >                  props->gid_tbl_len = 32;
> >                  props->ip_gids = true;
> >
> 
> That said, I've tested this anyway and it does appear to solve the reported
> deadlock, but seems to introduce other hangs (this is rarer -- I've encountered the
> issue on 3 boots out of 30). Maybe this is the race condition which I suspect? Here
> are the hung-task traces from one such occurrence.

David E. is fixing this in ice. To defer the remove/add auxiliary device outside the RTNL context during an ice bonding event.

A patch for it will be posted when ready.


> 
> [  247.237227] INFO: task kworker/10:1:415 blocked for more than 122 seconds.
> [  247.244183]       Not tainted 5.15.86 #1-NixOS
> [  247.248626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> [  247.256453] task:kworker/10:1    state:D stack:    0 pid:  415 ppid:     2
> flags:0x00004000
> [  247.256457] Workqueue: ipv6_addrconf addrconf_dad_work [  247.256462] Call
> Trace:
> [  247.256463]  <TASK>
> [  247.256466]  __schedule+0x2e6/0x13a0
> [  247.256471]  ? ip6_finish_output2+0x1f5/0x6d0 [  247.256473]
> schedule+0x5b/0xd0 [  247.256475]  schedule_preempt_disabled+0xa/0x10
> [  247.256477]  __mutex_lock.constprop.0+0x216/0x400
> [  247.256479]  addrconf_dad_work+0x3e/0x570 [  247.256481]  ?
> mld_ifc_work+0x1b2/0x450 [  247.256482]  process_one_work+0x1f1/0x390 [
> 247.256485]  worker_thread+0x53/0x3e0 [  247.256486]  ?
> process_one_work+0x390/0x390 [  247.256487]  kthread+0x127/0x150 [
> 247.256489]  ? set_kthread_struct+0x50/0x50 [  247.256491]
> ret_from_fork+0x22/0x30 [  247.256494]  </TASK> [  247.256499] INFO: task
> kworker/34:1:460 blocked for more than 122 seconds.
> [  247.263367]       Not tainted 5.15.86 #1-NixOS
> [  247.267813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> [  247.275640] task:kworker/34:1    state:D stack:    0 pid:  460 ppid:     2
> flags:0x00004000
> [  247.275643] Workqueue: events_power_efficient reg_check_chans_work
> [cfg80211] [  247.275661] Call Trace:
> [  247.275662]  <TASK>
> [  247.275663]  __schedule+0x2e6/0x13a0
> [  247.275665]  ? select_task_rq_fair+0x1010/0x1090 [  247.275668]
> schedule+0x5b/0xd0 [  247.275669]  schedule_preempt_disabled+0xa/0x10
> [  247.275671]  __mutex_lock.constprop.0+0x216/0x400
> [  247.275673]  reg_check_chans_work+0x2d/0x390 [cfg80211] [  247.275687]
> process_one_work+0x1f1/0x390 [  247.275688]  worker_thread+0x53/0x3e0 [
> 247.275689]  ? process_one_work+0x390/0x390 [  247.275690]
> kthread+0x127/0x150 [  247.275692]  ? set_kthread_struct+0x50/0x50 [
> 247.275693]  ret_from_fork+0x22/0x30 [  247.275695]  </TASK> [  247.275698]
> INFO: task kworker/46:1:484 blocked for more than 122 seconds.
> [  247.282572]       Not tainted 5.15.86 #1-NixOS
> [  247.287017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> [  247.294842] task:kworker/46:1    state:D stack:    0 pid:  484 ppid:     2
> flags:0x00004000
> [  247.294844] Workqueue: ipv6_addrconf addrconf_verify_work [  247.294846]
> Call Trace:
> [  247.294847]  <TASK>
> [  247.294847]  __schedule+0x2e6/0x13a0
> [  247.294849]  ? raw_spin_rq_lock_nested+0x12/0x70 [  247.294851]  ?
> newidle_balance+0x2ee/0x400 [  247.294852]  schedule+0x5b/0xd0 [  247.294854]
> schedule_preempt_disabled+0xa/0x10
> [  247.294856]  __mutex_lock.constprop.0+0x216/0x400
> [  247.294857]  addrconf_verify_work+0xa/0x20 [  247.294858]
> process_one_work+0x1f1/0x390 [  247.294859]  worker_thread+0x53/0x3e0 [
> 247.294860]  ? process_one_work+0x390/0x390 [  247.294861]
> kthread+0x127/0x150 [  247.294862]  ? set_kthread_struct+0x50/0x50 [
> 247.294864]  ret_from_fork+0x22/0x30 [  247.294866]  </TASK> [  247.294888]
> INFO: task systemd-udevd:1658 blocked for more than 122 seconds.
> [  247.301934]       Not tainted 5.15.86 #1-NixOS
> [  247.306379] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> [  247.314206] task:systemd-udevd   state:D stack:    0 pid: 1658 ppid:  1506
> flags:0x00004226
> [  247.314208] Call Trace:
> [  247.314209]  <TASK>
> [  247.314209]  __schedule+0x2e6/0x13a0
> [  247.314212]  schedule+0x5b/0xd0
> [  247.314214]  schedule_preempt_disabled+0xa/0x10
> [  247.314216]  __mutex_lock.constprop.0+0x216/0x400
> [  247.314217]  ? kobject_uevent_env+0x11f/0x680 [  247.314220]  ?
> bus_add_driver+0x1a8/0x200 [  247.314225]
> register_netdevice_notifier+0x21/0x110
> [  247.314228]  ? 0xffffffffc181f000
> [  247.314247]  irdma_init_module+0xa6/0x1000 [irdma] [  247.314258]
> do_one_initcall+0x44/0x1d0 [  247.314260]  ? __cond_resched+0x16/0x50 [
> 247.314262]  ? kmem_cache_alloc_trace+0x44/0x3d0 [  247.314265]
> do_init_module+0x4c/0x250 [  247.314268]  __do_sys_init_module+0x12e/0x1b0 [
> 247.314271]  do_syscall_64+0x3b/0x90 [  247.314274]
> entry_SYSCALL_64_after_hwframe+0x61/0xcb
> [  247.314276] RIP: 0033:0x7f7a2089a0fe
> [  247.314279] RSP: 002b:00007ffd449a33e8 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000af [  247.314281] RAX: ffffffffffffffda RBX: 0000556ce7b5f6c0
> RCX: 00007f7a2089a0fe [  247.314282] RDX: 00007f7a209feb1d RSI:
> 00000000000cef90 RDI: 0000556ce7f21430 [  247.314283] RBP: 0000556ce7f21430
> R08: 0000000000000007 R09: 0000556ce7b607e0 [  247.314283] R10:
> 0000000000000005 R11: 0000000000000246 R12: 00007f7a209feb1d [  247.314284]
> R13: 0000556ce7b5f6c0 R14: 0000556ce7b472b0 R15: 0000556ce7b5f6c0 [
> 247.314285]  </TASK>
> 
> Thanks for looking into this!
> 
> Linus Heckemann

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-01-06 11:11 ` Linux kernel regression tracking (#adding)
@ 2023-02-28 10:09   ` Linux regression tracking (Thorsten Leemhuis)
  2023-02-28 12:33     ` Jaroslav Pulchart
  0 siblings, 1 reply; 13+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-02-28 10:09 UTC (permalink / raw)
  To: Jaroslav Pulchart, kamalheib1, shiraz.saleem, leon, sashal,
	linux-rdma, netdev, Tony Nguyen
  Cc: Igor Raits, regressions

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

On 06.01.23 12:11, Linux kernel regression tracking (#adding) wrote:
> On 06.01.23 08:55, Jaroslav Pulchart wrote:
>> Hello,
>>
>> I would like to report a >= 6.1.2 some network regression (looks like
>> NIC us not UP) on our Dell R7525 servers with E810 NICs. The issue was
>> observed after I updated 6.1.0 to 6.1.2 or newer (tested up to newest
>> 6.1.4-rc1). The system is not accesible and all services are in D
>> state after each reboot.

Can anyone please provide a status on this? It seems to take quite a
while to get this regression fixed, which is unfortunate. Or was
progress made somewhere and I just missed it?

I noticed Tony tried to address this in mainline, but the last thing I'm
aware of is "Please ignore/drop this. Just saw that this change doesn't
solve the issue." here:

https://lore.kernel.org/all/b944d1d4-7f90-dcef-231c-91bb031a4275@intel.com/#t

Should the backport to 6.1.y (425c9bd06b7a ) maybe be dropped to at
least resolve the issue there until this is fixed in mainline? Or would
that cause a regression as well?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

>> [  257.625207]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
>> [  257.631911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  257.639740] task:kworker/u192:1  state:D stack:0     pid:11
>> ppid:2      flags:0x00004000
>> [  257.648095] Workqueue: netns cleanup_net
>> [  257.652029] Call Trace:
>> [  257.654481]  <TASK>
>> [  257.656589]  __schedule+0x1eb/0x630
>> [  257.660087]  schedule+0x5a/0xd0
>> [  257.663233]  schedule_preempt_disabled+0x11/0x20
>> [  257.667851]  __mutex_lock.constprop.0+0x372/0x6c0
>> [  257.672561]  rdma_dev_change_netns+0x25/0x120 [ib_core]
>> [  257.677821]  rdma_dev_exit_net+0x139/0x1e0 [ib_core]
>> [  257.682804]  ops_exit_list+0x30/0x70
>> [  257.686382]  cleanup_net+0x213/0x3b0
>> [  257.689964]  process_one_work+0x1e2/0x3b0
>> [  257.693984]  ? rescuer_thread+0x390/0x390
>> [  257.697995]  worker_thread+0x50/0x3a0
>> [  257.701661]  ? rescuer_thread+0x390/0x390
>> [  257.705674]  kthread+0xd6/0x100
>> [  257.708819]  ? kthread_complete_and_exit+0x20/0x20
>> [  257.713613]  ret_from_fork+0x1f/0x30
>> [  257.717192]  </TASK>
>> [  257.719496] INFO: task kworker/87:0:470 blocked for more than 122 seconds.
>> [  257.726423]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
>> [  257.733123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  257.740949] task:kworker/87:0    state:D stack:0     pid:470
>> ppid:2      flags:0x00004000
>> [  257.749307] Workqueue: events linkwatch_event
>> [  257.753672] Call Trace:
>> [  257.756124]  <TASK>
>> [  257.758228]  __schedule+0x1eb/0x630
>> [  257.761723]  schedule+0x5a/0xd0
>> [  257.764867]  schedule_preempt_disabled+0x11/0x20
>> [  257.769487]  __mutex_lock.constprop.0+0x372/0x6c0
>> [  257.774196]  ? pick_next_task+0x57/0x9b0
>> [  257.778127]  ? finish_task_switch.isra.0+0x8f/0x2a0
>> [  257.783007]  linkwatch_event+0xa/0x30
>> [  257.786674]  process_one_work+0x1e2/0x3b0
>> [  257.790687]  worker_thread+0x50/0x3a0
>> [  257.794352]  ? rescuer_thread+0x390/0x390
>> [  257.798365]  kthread+0xd6/0x100
>> [  257.801513]  ? kthread_complete_and_exit+0x20/0x20
>> [  257.806303]  ret_from_fork+0x1f/0x30
>> [  257.809885]  </TASK>
>> [  257.812109] INFO: task kworker/39:1:614 blocked for more than 123 seconds.
>> [  257.818984]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
>> [  257.825686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  257.833519] task:kworker/39:1    state:D stack:0     pid:614
>> ppid:2      flags:0x00004000
>> [  257.841869] Workqueue: infiniband ib_cache_event_task [ib_core]
>> [  257.847802] Call Trace:
>> [  257.850252]  <TASK>
>> [  257.852360]  __schedule+0x1eb/0x630
>> [  257.855851]  schedule+0x5a/0xd0
>> [  257.858998]  schedule_preempt_disabled+0x11/0x20
>> [  257.863617]  __mutex_lock.constprop.0+0x372/0x6c0
>> [  257.868325]  ib_get_eth_speed+0x65/0x190 [ib_core]
>> [  257.873127]  ? ib_cache_update.part.0+0x4b/0x2b0 [ib_core]
>> [  257.878619]  ? __kmem_cache_alloc_node+0x18c/0x2b0
>> [  257.883417]  irdma_query_port+0xb3/0x110 [irdma]
>> [  257.888051]  ib_query_port+0xaa/0x100 [ib_core]
>> [  257.892601]  ib_cache_update.part.0+0x65/0x2b0 [ib_core]
>> [  257.897924]  ? pick_next_task+0x57/0x9b0
>> [  257.901855]  ? dequeue_task_fair+0xb6/0x3c0
>> [  257.906043]  ? finish_task_switch.isra.0+0x8f/0x2a0
>> [  257.910920]  ib_cache_event_task+0x58/0x80 [ib_core]
>> [  257.915906]  process_one_work+0x1e2/0x3b0
>> [  257.919918]  ? rescuer_thread+0x390/0x390
>> [  257.923931]  worker_thread+0x50/0x3a0
>> [  257.927595]  ? rescuer_thread+0x390/0x390
>> [  257.931609]  kthread+0xd6/0x100
>> [  257.934755]  ? kthread_complete_and_exit+0x20/0x20
>> [  257.939549]  ret_from_fork+0x1f/0x30
>> [  257.943128]  </TASK>
>> [  257.945438] INFO: task NetworkManager:3387 blocked for more than 123 seconds.
>> [  257.952577]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
>> [  257.959274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
>> ppid:1      flags:0x00004002
>> [  257.975446] Call Trace:
>> [  257.977901]  <TASK>
>> [  257.980004]  __schedule+0x1eb/0x630
>> [  257.983498]  schedule+0x5a/0xd0
>> [  257.986641]  schedule_timeout+0x11d/0x160
>> [  257.990654]  __wait_for_common+0x90/0x1e0
>> [  257.994666]  ? usleep_range_state+0x90/0x90
>> [  257.998854]  __flush_workqueue+0x13a/0x3f0
>> [  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
>> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
>> [  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
>> [  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
>> [  258.023024]  irdma_remove+0x1a/0x60 [irdma]
>> [  258.027223]  auxiliary_bus_remove+0x18/0x30
>> [  258.031414]  device_release_driver_internal+0x1aa/0x230
>> [  258.036643]  bus_remove_device+0xd8/0x150
>> [  258.040654]  device_del+0x18b/0x3f0
>> [  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]
>> [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
>> [  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
>> [  258.058930]  raw_notifier_call_chain+0x41/0x60
>> [  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
>> [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
>> [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
>> [  258.077517]  ? nla_put+0x28/0x40
>> [  258.080756]  do_setlink+0x26c/0xc10
>> [  258.084249]  ? avc_alloc_node+0x27/0x180
>> [  258.088173]  ? __nla_validate_parse+0x141/0x190
>> [  258.092708]  __rtnl_newlink+0x53a/0x620
>> [  258.096549]  rtnl_newlink+0x44/0x70
>> [  258.100040]  rtnetlink_rcv_msg+0x159/0x3d0
>> [  258.104140]  ? rtnl_calcit.isra.0+0x140/0x140
>> [  258.108496]  netlink_rcv_skb+0x4e/0x100
>> [  258.112338]  netlink_unicast+0x23b/0x360
>> [  258.116264]  netlink_sendmsg+0x24e/0x4b0
>> [  258.120191]  sock_sendmsg+0x5f/0x70
>> [  258.123684]  ____sys_sendmsg+0x241/0x2c0
>> [  258.127609]  ? copy_msghdr_from_user+0x6d/0xa0
>> [  258.132054]  ___sys_sendmsg+0x88/0xd0
>> [  258.135722]  ? ___sys_recvmsg+0x88/0xd0
>> [  258.139559]  ? wake_up_q+0x4a/0x90
>> [  258.142967]  ? rseq_get_rseq_cs.isra.0+0x16/0x220
>> [  258.147673]  ? __fget_light+0xa4/0x130
>> [  258.151434]  __sys_sendmsg+0x59/0xa0
>> [  258.155012]  do_syscall_64+0x38/0x90
>> [  258.158591]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> [  258.163645] RIP: 0033:0x7ff23714fa7d
>> [  258.167226] RSP: 002b:00007ffdddfc8c70 EFLAGS: 00000293 ORIG_RAX:
>> 000000000000002e
>> [  258.174798] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff23714fa7d
>> [  258.181933] RDX: 0000000000000000 RSI: 00007ffdddfc8cb0 RDI: 000000000000000d
>> [  258.189063] RBP: 00005572f5d77040 R08: 0000000000000000 R09: 0000000000000000
>> [  258.196197] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdddfc8e1c
>> [  258.203332] R13: 00007ffdddfc8e20 R14: 0000000000000000 R15: 00007ffdddfc8e28
>> [  258.210464]  </TASK>
>> ...
>>
>> I bisected the issue to a commit
>> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
>> correct link speed). Reverting this commit in my kernel build "fix"
>> the issue and the server has a working network again.
> 
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced 425c9bd06b7a7079
> #regzbot title RDMA/irdma: network stopped working
> #regzbot ignore-activity
> 
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
> 
> Reminder for developers: When fixing the issue, add 'Link:' tags
> pointing to the report (see page linked in footer for details).
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-02-28 10:09   ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-02-28 12:33     ` Jaroslav Pulchart
  2023-02-28 22:10       ` Tony Nguyen
  0 siblings, 1 reply; 13+ messages in thread
From: Jaroslav Pulchart @ 2023-02-28 12:33 UTC (permalink / raw)
  To: Thorsten Leemhuis, Linux regressions mailing list
  Cc: kamalheib1, shiraz.saleem, leon, sashal, linux-rdma, netdev,
	Tony Nguyen, Igor Raits

>
> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> to make this easily accessible to everyone.
>
> On 06.01.23 12:11, Linux kernel regression tracking (#adding) wrote:
> > On 06.01.23 08:55, Jaroslav Pulchart wrote:
> >> Hello,
> >>
> >> I would like to report a >= 6.1.2 some network regression (looks like
> >> NIC us not UP) on our Dell R7525 servers with E810 NICs. The issue was
> >> observed after I updated 6.1.0 to 6.1.2 or newer (tested up to newest
> >> 6.1.4-rc1). The system is not accesible and all services are in D
> >> state after each reboot.
>
> Can anyone please provide a status on this? It seems to take quite a
> while to get this regression fixed, which is unfortunate. Or was
> progress made somewhere and I just missed it?
>
> I noticed Tony tried to address this in mainline, but the last thing I'm
> aware of is "Please ignore/drop this. Just saw that this change doesn't
> solve the issue." here:
>

FYI: We are building 6.1.y with the provided patch to fix the
regression in our environment.

> https://lore.kernel.org/all/b944d1d4-7f90-dcef-231c-91bb031a4275@intel.com/#t
>
> Should the backport to 6.1.y (425c9bd06b7a ) maybe be dropped to at
> least resolve the issue there until this is fixed in mainline? Or would
> that cause a regression as well?
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
>
> >> [  257.625207]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> >> [  257.631911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  257.639740] task:kworker/u192:1  state:D stack:0     pid:11
> >> ppid:2      flags:0x00004000
> >> [  257.648095] Workqueue: netns cleanup_net
> >> [  257.652029] Call Trace:
> >> [  257.654481]  <TASK>
> >> [  257.656589]  __schedule+0x1eb/0x630
> >> [  257.660087]  schedule+0x5a/0xd0
> >> [  257.663233]  schedule_preempt_disabled+0x11/0x20
> >> [  257.667851]  __mutex_lock.constprop.0+0x372/0x6c0
> >> [  257.672561]  rdma_dev_change_netns+0x25/0x120 [ib_core]
> >> [  257.677821]  rdma_dev_exit_net+0x139/0x1e0 [ib_core]
> >> [  257.682804]  ops_exit_list+0x30/0x70
> >> [  257.686382]  cleanup_net+0x213/0x3b0
> >> [  257.689964]  process_one_work+0x1e2/0x3b0
> >> [  257.693984]  ? rescuer_thread+0x390/0x390
> >> [  257.697995]  worker_thread+0x50/0x3a0
> >> [  257.701661]  ? rescuer_thread+0x390/0x390
> >> [  257.705674]  kthread+0xd6/0x100
> >> [  257.708819]  ? kthread_complete_and_exit+0x20/0x20
> >> [  257.713613]  ret_from_fork+0x1f/0x30
> >> [  257.717192]  </TASK>
> >> [  257.719496] INFO: task kworker/87:0:470 blocked for more than 122 seconds.
> >> [  257.726423]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> >> [  257.733123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  257.740949] task:kworker/87:0    state:D stack:0     pid:470
> >> ppid:2      flags:0x00004000
> >> [  257.749307] Workqueue: events linkwatch_event
> >> [  257.753672] Call Trace:
> >> [  257.756124]  <TASK>
> >> [  257.758228]  __schedule+0x1eb/0x630
> >> [  257.761723]  schedule+0x5a/0xd0
> >> [  257.764867]  schedule_preempt_disabled+0x11/0x20
> >> [  257.769487]  __mutex_lock.constprop.0+0x372/0x6c0
> >> [  257.774196]  ? pick_next_task+0x57/0x9b0
> >> [  257.778127]  ? finish_task_switch.isra.0+0x8f/0x2a0
> >> [  257.783007]  linkwatch_event+0xa/0x30
> >> [  257.786674]  process_one_work+0x1e2/0x3b0
> >> [  257.790687]  worker_thread+0x50/0x3a0
> >> [  257.794352]  ? rescuer_thread+0x390/0x390
> >> [  257.798365]  kthread+0xd6/0x100
> >> [  257.801513]  ? kthread_complete_and_exit+0x20/0x20
> >> [  257.806303]  ret_from_fork+0x1f/0x30
> >> [  257.809885]  </TASK>
> >> [  257.812109] INFO: task kworker/39:1:614 blocked for more than 123 seconds.
> >> [  257.818984]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> >> [  257.825686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  257.833519] task:kworker/39:1    state:D stack:0     pid:614
> >> ppid:2      flags:0x00004000
> >> [  257.841869] Workqueue: infiniband ib_cache_event_task [ib_core]
> >> [  257.847802] Call Trace:
> >> [  257.850252]  <TASK>
> >> [  257.852360]  __schedule+0x1eb/0x630
> >> [  257.855851]  schedule+0x5a/0xd0
> >> [  257.858998]  schedule_preempt_disabled+0x11/0x20
> >> [  257.863617]  __mutex_lock.constprop.0+0x372/0x6c0
> >> [  257.868325]  ib_get_eth_speed+0x65/0x190 [ib_core]
> >> [  257.873127]  ? ib_cache_update.part.0+0x4b/0x2b0 [ib_core]
> >> [  257.878619]  ? __kmem_cache_alloc_node+0x18c/0x2b0
> >> [  257.883417]  irdma_query_port+0xb3/0x110 [irdma]
> >> [  257.888051]  ib_query_port+0xaa/0x100 [ib_core]
> >> [  257.892601]  ib_cache_update.part.0+0x65/0x2b0 [ib_core]
> >> [  257.897924]  ? pick_next_task+0x57/0x9b0
> >> [  257.901855]  ? dequeue_task_fair+0xb6/0x3c0
> >> [  257.906043]  ? finish_task_switch.isra.0+0x8f/0x2a0
> >> [  257.910920]  ib_cache_event_task+0x58/0x80 [ib_core]
> >> [  257.915906]  process_one_work+0x1e2/0x3b0
> >> [  257.919918]  ? rescuer_thread+0x390/0x390
> >> [  257.923931]  worker_thread+0x50/0x3a0
> >> [  257.927595]  ? rescuer_thread+0x390/0x390
> >> [  257.931609]  kthread+0xd6/0x100
> >> [  257.934755]  ? kthread_complete_and_exit+0x20/0x20
> >> [  257.939549]  ret_from_fork+0x1f/0x30
> >> [  257.943128]  </TASK>
> >> [  257.945438] INFO: task NetworkManager:3387 blocked for more than 123 seconds.
> >> [  257.952577]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> >> [  257.959274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
> >> ppid:1      flags:0x00004002
> >> [  257.975446] Call Trace:
> >> [  257.977901]  <TASK>
> >> [  257.980004]  __schedule+0x1eb/0x630
> >> [  257.983498]  schedule+0x5a/0xd0
> >> [  257.986641]  schedule_timeout+0x11d/0x160
> >> [  257.990654]  __wait_for_common+0x90/0x1e0
> >> [  257.994666]  ? usleep_range_state+0x90/0x90
> >> [  257.998854]  __flush_workqueue+0x13a/0x3f0
> >> [  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
> >> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
> >> [  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
> >> [  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
> >> [  258.023024]  irdma_remove+0x1a/0x60 [irdma]
> >> [  258.027223]  auxiliary_bus_remove+0x18/0x30
> >> [  258.031414]  device_release_driver_internal+0x1aa/0x230
> >> [  258.036643]  bus_remove_device+0xd8/0x150
> >> [  258.040654]  device_del+0x18b/0x3f0
> >> [  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]
> >> [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
> >> [  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
> >> [  258.058930]  raw_notifier_call_chain+0x41/0x60
> >> [  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
> >> [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
> >> [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
> >> [  258.077517]  ? nla_put+0x28/0x40
> >> [  258.080756]  do_setlink+0x26c/0xc10
> >> [  258.084249]  ? avc_alloc_node+0x27/0x180
> >> [  258.088173]  ? __nla_validate_parse+0x141/0x190
> >> [  258.092708]  __rtnl_newlink+0x53a/0x620
> >> [  258.096549]  rtnl_newlink+0x44/0x70
> >> [  258.100040]  rtnetlink_rcv_msg+0x159/0x3d0
> >> [  258.104140]  ? rtnl_calcit.isra.0+0x140/0x140
> >> [  258.108496]  netlink_rcv_skb+0x4e/0x100
> >> [  258.112338]  netlink_unicast+0x23b/0x360
> >> [  258.116264]  netlink_sendmsg+0x24e/0x4b0
> >> [  258.120191]  sock_sendmsg+0x5f/0x70
> >> [  258.123684]  ____sys_sendmsg+0x241/0x2c0
> >> [  258.127609]  ? copy_msghdr_from_user+0x6d/0xa0
> >> [  258.132054]  ___sys_sendmsg+0x88/0xd0
> >> [  258.135722]  ? ___sys_recvmsg+0x88/0xd0
> >> [  258.139559]  ? wake_up_q+0x4a/0x90
> >> [  258.142967]  ? rseq_get_rseq_cs.isra.0+0x16/0x220
> >> [  258.147673]  ? __fget_light+0xa4/0x130
> >> [  258.151434]  __sys_sendmsg+0x59/0xa0
> >> [  258.155012]  do_syscall_64+0x38/0x90
> >> [  258.158591]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >> [  258.163645] RIP: 0033:0x7ff23714fa7d
> >> [  258.167226] RSP: 002b:00007ffdddfc8c70 EFLAGS: 00000293 ORIG_RAX:
> >> 000000000000002e
> >> [  258.174798] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff23714fa7d
> >> [  258.181933] RDX: 0000000000000000 RSI: 00007ffdddfc8cb0 RDI: 000000000000000d
> >> [  258.189063] RBP: 00005572f5d77040 R08: 0000000000000000 R09: 0000000000000000
> >> [  258.196197] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdddfc8e1c
> >> [  258.203332] R13: 00007ffdddfc8e20 R14: 0000000000000000 R15: 00007ffdddfc8e28
> >> [  258.210464]  </TASK>
> >> ...
> >>
> >> I bisected the issue to a commit
> >> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
> >> correct link speed). Reverting this commit in my kernel build "fix"
> >> the issue and the server has a working network again.
> >
> > Thanks for the report. To be sure the issue doesn't fall through the
> > cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> > tracking bot:
> >
> > #regzbot ^introduced 425c9bd06b7a7079
> > #regzbot title RDMA/irdma: network stopped working
> > #regzbot ignore-activity
> >
> > This isn't a regression? This issue or a fix for it are already
> > discussed somewhere else? It was fixed already? You want to clarify when
> > the regression started to happen? Or point out I got the title or
> > something else totally wrong? Then just reply and tell me -- ideally
> > while also telling regzbot about it, as explained by the page listed in
> > the footer of this mail.
> >
> > Reminder for developers: When fixing the issue, add 'Link:' tags
> > pointing to the report (see page linked in footer for details).
> >
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > --
> > Everything you wanna know about Linux kernel regression tracking:
> > https://linux-regtracking.leemhuis.info/about/#tldr
> > That page also explains what to do if mails like this annoy you.
> >
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-02-28 12:33     ` Jaroslav Pulchart
@ 2023-02-28 22:10       ` Tony Nguyen
  2023-03-01 13:27         ` Jaroslav Pulchart
  0 siblings, 1 reply; 13+ messages in thread
From: Tony Nguyen @ 2023-02-28 22:10 UTC (permalink / raw)
  To: Jaroslav Pulchart, Thorsten Leemhuis, Linux regressions mailing list
  Cc: kamalheib1, shiraz.saleem, leon, sashal, linux-rdma, netdev, Igor Raits

On 2/28/2023 4:33 AM, Jaroslav Pulchart wrote:
>>
>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>> to make this easily accessible to everyone.
>>
>> On 06.01.23 12:11, Linux kernel regression tracking (#adding) wrote:
>>> On 06.01.23 08:55, Jaroslav Pulchart wrote:
>>>> Hello,
>>>>
>>>> I would like to report a >= 6.1.2 some network regression (looks like
>>>> NIC us not UP) on our Dell R7525 servers with E810 NICs. The issue was
>>>> observed after I updated 6.1.0 to 6.1.2 or newer (tested up to newest
>>>> 6.1.4-rc1). The system is not accesible and all services are in D
>>>> state after each reboot.
>>
>> Can anyone please provide a status on this? It seems to take quite a
>> while to get this regression fixed, which is unfortunate. Or was
>> progress made somewhere and I just missed it?
>>
>> I noticed Tony tried to address this in mainline, but the last thing I'm
>> aware of is "Please ignore/drop this. Just saw that this change doesn't
>> solve the issue." here:
>>
> 
> FYI: We are building 6.1.y with the provided patch to fix the
> regression in our environment.

Thanks for the input Jaroslav; just to be clear, are you using the v1 
[1] or v2 [2] of the patch?

We're doing more testing on v2, but I was going to reach out to you 
afterwards to see if you would mind testing the v2 as we haven't heard 
from the other reporter who said v2 didn't work for him.

Thanks,
Tony

[1] 
https://lore.kernel.org/netdev/20230131213703.1347761-2-anthony.l.nguyen@intel.com/
[2] 
https://lore.kernel.org/netdev/20230217004201.2895321-1-anthony.l.nguyen@intel.com/

>> https://lore.kernel.org/all/b944d1d4-7f90-dcef-231c-91bb031a4275@intel.com/#t
>>
>> Should the backport to 6.1.y (425c9bd06b7a ) maybe be dropped to at
>> least resolve the issue there until this is fixed in mainline? Or would
>> that cause a regression as well?
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot poke
>>
>>>> [  257.625207]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
>>>> [  257.631911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [  257.639740] task:kworker/u192:1  state:D stack:0     pid:11
>>>> ppid:2      flags:0x00004000
>>>> [  257.648095] Workqueue: netns cleanup_net
>>>> [  257.652029] Call Trace:
>>>> [  257.654481]  <TASK>
>>>> [  257.656589]  __schedule+0x1eb/0x630
>>>> [  257.660087]  schedule+0x5a/0xd0
>>>> [  257.663233]  schedule_preempt_disabled+0x11/0x20
>>>> [  257.667851]  __mutex_lock.constprop.0+0x372/0x6c0
>>>> [  257.672561]  rdma_dev_change_netns+0x25/0x120 [ib_core]
>>>> [  257.677821]  rdma_dev_exit_net+0x139/0x1e0 [ib_core]
>>>> [  257.682804]  ops_exit_list+0x30/0x70
>>>> [  257.686382]  cleanup_net+0x213/0x3b0
>>>> [  257.689964]  process_one_work+0x1e2/0x3b0
>>>> [  257.693984]  ? rescuer_thread+0x390/0x390
>>>> [  257.697995]  worker_thread+0x50/0x3a0
>>>> [  257.701661]  ? rescuer_thread+0x390/0x390
>>>> [  257.705674]  kthread+0xd6/0x100
>>>> [  257.708819]  ? kthread_complete_and_exit+0x20/0x20
>>>> [  257.713613]  ret_from_fork+0x1f/0x30
>>>> [  257.717192]  </TASK>
>>>> [  257.719496] INFO: task kworker/87:0:470 blocked for more than 122 seconds.
>>>> [  257.726423]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
>>>> [  257.733123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [  257.740949] task:kworker/87:0    state:D stack:0     pid:470
>>>> ppid:2      flags:0x00004000
>>>> [  257.749307] Workqueue: events linkwatch_event
>>>> [  257.753672] Call Trace:
>>>> [  257.756124]  <TASK>
>>>> [  257.758228]  __schedule+0x1eb/0x630
>>>> [  257.761723]  schedule+0x5a/0xd0
>>>> [  257.764867]  schedule_preempt_disabled+0x11/0x20
>>>> [  257.769487]  __mutex_lock.constprop.0+0x372/0x6c0
>>>> [  257.774196]  ? pick_next_task+0x57/0x9b0
>>>> [  257.778127]  ? finish_task_switch.isra.0+0x8f/0x2a0
>>>> [  257.783007]  linkwatch_event+0xa/0x30
>>>> [  257.786674]  process_one_work+0x1e2/0x3b0
>>>> [  257.790687]  worker_thread+0x50/0x3a0
>>>> [  257.794352]  ? rescuer_thread+0x390/0x390
>>>> [  257.798365]  kthread+0xd6/0x100
>>>> [  257.801513]  ? kthread_complete_and_exit+0x20/0x20
>>>> [  257.806303]  ret_from_fork+0x1f/0x30
>>>> [  257.809885]  </TASK>
>>>> [  257.812109] INFO: task kworker/39:1:614 blocked for more than 123 seconds.
>>>> [  257.818984]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
>>>> [  257.825686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [  257.833519] task:kworker/39:1    state:D stack:0     pid:614
>>>> ppid:2      flags:0x00004000
>>>> [  257.841869] Workqueue: infiniband ib_cache_event_task [ib_core]
>>>> [  257.847802] Call Trace:
>>>> [  257.850252]  <TASK>
>>>> [  257.852360]  __schedule+0x1eb/0x630
>>>> [  257.855851]  schedule+0x5a/0xd0
>>>> [  257.858998]  schedule_preempt_disabled+0x11/0x20
>>>> [  257.863617]  __mutex_lock.constprop.0+0x372/0x6c0
>>>> [  257.868325]  ib_get_eth_speed+0x65/0x190 [ib_core]
>>>> [  257.873127]  ? ib_cache_update.part.0+0x4b/0x2b0 [ib_core]
>>>> [  257.878619]  ? __kmem_cache_alloc_node+0x18c/0x2b0
>>>> [  257.883417]  irdma_query_port+0xb3/0x110 [irdma]
>>>> [  257.888051]  ib_query_port+0xaa/0x100 [ib_core]
>>>> [  257.892601]  ib_cache_update.part.0+0x65/0x2b0 [ib_core]
>>>> [  257.897924]  ? pick_next_task+0x57/0x9b0
>>>> [  257.901855]  ? dequeue_task_fair+0xb6/0x3c0
>>>> [  257.906043]  ? finish_task_switch.isra.0+0x8f/0x2a0
>>>> [  257.910920]  ib_cache_event_task+0x58/0x80 [ib_core]
>>>> [  257.915906]  process_one_work+0x1e2/0x3b0
>>>> [  257.919918]  ? rescuer_thread+0x390/0x390
>>>> [  257.923931]  worker_thread+0x50/0x3a0
>>>> [  257.927595]  ? rescuer_thread+0x390/0x390
>>>> [  257.931609]  kthread+0xd6/0x100
>>>> [  257.934755]  ? kthread_complete_and_exit+0x20/0x20
>>>> [  257.939549]  ret_from_fork+0x1f/0x30
>>>> [  257.943128]  </TASK>
>>>> [  257.945438] INFO: task NetworkManager:3387 blocked for more than 123 seconds.
>>>> [  257.952577]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
>>>> [  257.959274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
>>>> ppid:1      flags:0x00004002
>>>> [  257.975446] Call Trace:
>>>> [  257.977901]  <TASK>
>>>> [  257.980004]  __schedule+0x1eb/0x630
>>>> [  257.983498]  schedule+0x5a/0xd0
>>>> [  257.986641]  schedule_timeout+0x11d/0x160
>>>> [  257.990654]  __wait_for_common+0x90/0x1e0
>>>> [  257.994666]  ? usleep_range_state+0x90/0x90
>>>> [  257.998854]  __flush_workqueue+0x13a/0x3f0
>>>> [  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
>>>> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
>>>> [  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
>>>> [  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
>>>> [  258.023024]  irdma_remove+0x1a/0x60 [irdma]
>>>> [  258.027223]  auxiliary_bus_remove+0x18/0x30
>>>> [  258.031414]  device_release_driver_internal+0x1aa/0x230
>>>> [  258.036643]  bus_remove_device+0xd8/0x150
>>>> [  258.040654]  device_del+0x18b/0x3f0
>>>> [  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]
>>>> [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
>>>> [  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
>>>> [  258.058930]  raw_notifier_call_chain+0x41/0x60
>>>> [  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
>>>> [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
>>>> [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
>>>> [  258.077517]  ? nla_put+0x28/0x40
>>>> [  258.080756]  do_setlink+0x26c/0xc10
>>>> [  258.084249]  ? avc_alloc_node+0x27/0x180
>>>> [  258.088173]  ? __nla_validate_parse+0x141/0x190
>>>> [  258.092708]  __rtnl_newlink+0x53a/0x620
>>>> [  258.096549]  rtnl_newlink+0x44/0x70
>>>> [  258.100040]  rtnetlink_rcv_msg+0x159/0x3d0
>>>> [  258.104140]  ? rtnl_calcit.isra.0+0x140/0x140
>>>> [  258.108496]  netlink_rcv_skb+0x4e/0x100
>>>> [  258.112338]  netlink_unicast+0x23b/0x360
>>>> [  258.116264]  netlink_sendmsg+0x24e/0x4b0
>>>> [  258.120191]  sock_sendmsg+0x5f/0x70
>>>> [  258.123684]  ____sys_sendmsg+0x241/0x2c0
>>>> [  258.127609]  ? copy_msghdr_from_user+0x6d/0xa0
>>>> [  258.132054]  ___sys_sendmsg+0x88/0xd0
>>>> [  258.135722]  ? ___sys_recvmsg+0x88/0xd0
>>>> [  258.139559]  ? wake_up_q+0x4a/0x90
>>>> [  258.142967]  ? rseq_get_rseq_cs.isra.0+0x16/0x220
>>>> [  258.147673]  ? __fget_light+0xa4/0x130
>>>> [  258.151434]  __sys_sendmsg+0x59/0xa0
>>>> [  258.155012]  do_syscall_64+0x38/0x90
>>>> [  258.158591]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>> [  258.163645] RIP: 0033:0x7ff23714fa7d
>>>> [  258.167226] RSP: 002b:00007ffdddfc8c70 EFLAGS: 00000293 ORIG_RAX:
>>>> 000000000000002e
>>>> [  258.174798] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff23714fa7d
>>>> [  258.181933] RDX: 0000000000000000 RSI: 00007ffdddfc8cb0 RDI: 000000000000000d
>>>> [  258.189063] RBP: 00005572f5d77040 R08: 0000000000000000 R09: 0000000000000000
>>>> [  258.196197] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdddfc8e1c
>>>> [  258.203332] R13: 00007ffdddfc8e20 R14: 0000000000000000 R15: 00007ffdddfc8e28
>>>> [  258.210464]  </TASK>
>>>> ...
>>>>
>>>> I bisected the issue to a commit
>>>> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
>>>> correct link speed). Reverting this commit in my kernel build "fix"
>>>> the issue and the server has a working network again.
>>>
>>> Thanks for the report. To be sure the issue doesn't fall through the
>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>> tracking bot:
>>>
>>> #regzbot ^introduced 425c9bd06b7a7079
>>> #regzbot title RDMA/irdma: network stopped working
>>> #regzbot ignore-activity
>>>
>>> This isn't a regression? This issue or a fix for it are already
>>> discussed somewhere else? It was fixed already? You want to clarify when
>>> the regression started to happen? Or point out I got the title or
>>> something else totally wrong? Then just reply and tell me -- ideally
>>> while also telling regzbot about it, as explained by the page listed in
>>> the footer of this mail.
>>>
>>> Reminder for developers: When fixing the issue, add 'Link:' tags
>>> pointing to the report (see page linked in footer for details).
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> That page also explains what to do if mails like this annoy you.
>>>
>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-02-28 22:10       ` Tony Nguyen
@ 2023-03-01 13:27         ` Jaroslav Pulchart
  2023-03-01 15:05           ` Jaroslav Pulchart
  0 siblings, 1 reply; 13+ messages in thread
From: Jaroslav Pulchart @ 2023-03-01 13:27 UTC (permalink / raw)
  To: Tony Nguyen
  Cc: Thorsten Leemhuis, Linux regressions mailing list, kamalheib1,
	shiraz.saleem, leon, sashal, linux-rdma, netdev, Igor Raits

>
> On 2/28/2023 4:33 AM, Jaroslav Pulchart wrote:
> >>
> >> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> >> to make this easily accessible to everyone.
> >>
> >> On 06.01.23 12:11, Linux kernel regression tracking (#adding) wrote:
> >>> On 06.01.23 08:55, Jaroslav Pulchart wrote:
> >>>> Hello,
> >>>>
> >>>> I would like to report a >= 6.1.2 some network regression (looks like
> >>>> NIC us not UP) on our Dell R7525 servers with E810 NICs. The issue was
> >>>> observed after I updated 6.1.0 to 6.1.2 or newer (tested up to newest
> >>>> 6.1.4-rc1). The system is not accesible and all services are in D
> >>>> state after each reboot.
> >>
> >> Can anyone please provide a status on this? It seems to take quite a
> >> while to get this regression fixed, which is unfortunate. Or was
> >> progress made somewhere and I just missed it?
> >>
> >> I noticed Tony tried to address this in mainline, but the last thing I'm
> >> aware of is "Please ignore/drop this. Just saw that this change doesn't
> >> solve the issue." here:
> >>
> >
> > FYI: We are building 6.1.y with the provided patch to fix the
> > regression in our environment.
>
> Thanks for the input Jaroslav; just to be clear, are you using the v1
> [1] or v2 [2] of the patch?

We are using v1 [1] of the patch.


>
> We're doing more testing on v2, but I was going to reach out to you
> afterwards to see if you would mind testing the v2 as we haven't heard
> from the other reporter who said v2 didn't work for him.
>
> Thanks,
> Tony
>
> [1]
> https://lore.kernel.org/netdev/20230131213703.1347761-2-anthony.l.nguyen@intel.com/
> [2]
> https://lore.kernel.org/netdev/20230217004201.2895321-1-anthony.l.nguyen@intel.com/
>
> >> https://lore.kernel.org/all/b944d1d4-7f90-dcef-231c-91bb031a4275@intel.com/#t
> >>
> >> Should the backport to 6.1.y (425c9bd06b7a ) maybe be dropped to at
> >> least resolve the issue there until this is fixed in mainline? Or would
> >> that cause a regression as well?
> >>
> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >> --
> >> Everything you wanna know about Linux kernel regression tracking:
> >> https://linux-regtracking.leemhuis.info/about/#tldr
> >> If I did something stupid, please tell me, as explained on that page.
> >>
> >> #regzbot poke
> >>
> >>>> [  257.625207]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> >>>> [  257.631911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [  257.639740] task:kworker/u192:1  state:D stack:0     pid:11
> >>>> ppid:2      flags:0x00004000
> >>>> [  257.648095] Workqueue: netns cleanup_net
> >>>> [  257.652029] Call Trace:
> >>>> [  257.654481]  <TASK>
> >>>> [  257.656589]  __schedule+0x1eb/0x630
> >>>> [  257.660087]  schedule+0x5a/0xd0
> >>>> [  257.663233]  schedule_preempt_disabled+0x11/0x20
> >>>> [  257.667851]  __mutex_lock.constprop.0+0x372/0x6c0
> >>>> [  257.672561]  rdma_dev_change_netns+0x25/0x120 [ib_core]
> >>>> [  257.677821]  rdma_dev_exit_net+0x139/0x1e0 [ib_core]
> >>>> [  257.682804]  ops_exit_list+0x30/0x70
> >>>> [  257.686382]  cleanup_net+0x213/0x3b0
> >>>> [  257.689964]  process_one_work+0x1e2/0x3b0
> >>>> [  257.693984]  ? rescuer_thread+0x390/0x390
> >>>> [  257.697995]  worker_thread+0x50/0x3a0
> >>>> [  257.701661]  ? rescuer_thread+0x390/0x390
> >>>> [  257.705674]  kthread+0xd6/0x100
> >>>> [  257.708819]  ? kthread_complete_and_exit+0x20/0x20
> >>>> [  257.713613]  ret_from_fork+0x1f/0x30
> >>>> [  257.717192]  </TASK>
> >>>> [  257.719496] INFO: task kworker/87:0:470 blocked for more than 122 seconds.
> >>>> [  257.726423]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> >>>> [  257.733123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [  257.740949] task:kworker/87:0    state:D stack:0     pid:470
> >>>> ppid:2      flags:0x00004000
> >>>> [  257.749307] Workqueue: events linkwatch_event
> >>>> [  257.753672] Call Trace:
> >>>> [  257.756124]  <TASK>
> >>>> [  257.758228]  __schedule+0x1eb/0x630
> >>>> [  257.761723]  schedule+0x5a/0xd0
> >>>> [  257.764867]  schedule_preempt_disabled+0x11/0x20
> >>>> [  257.769487]  __mutex_lock.constprop.0+0x372/0x6c0
> >>>> [  257.774196]  ? pick_next_task+0x57/0x9b0
> >>>> [  257.778127]  ? finish_task_switch.isra.0+0x8f/0x2a0
> >>>> [  257.783007]  linkwatch_event+0xa/0x30
> >>>> [  257.786674]  process_one_work+0x1e2/0x3b0
> >>>> [  257.790687]  worker_thread+0x50/0x3a0
> >>>> [  257.794352]  ? rescuer_thread+0x390/0x390
> >>>> [  257.798365]  kthread+0xd6/0x100
> >>>> [  257.801513]  ? kthread_complete_and_exit+0x20/0x20
> >>>> [  257.806303]  ret_from_fork+0x1f/0x30
> >>>> [  257.809885]  </TASK>
> >>>> [  257.812109] INFO: task kworker/39:1:614 blocked for more than 123 seconds.
> >>>> [  257.818984]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> >>>> [  257.825686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [  257.833519] task:kworker/39:1    state:D stack:0     pid:614
> >>>> ppid:2      flags:0x00004000
> >>>> [  257.841869] Workqueue: infiniband ib_cache_event_task [ib_core]
> >>>> [  257.847802] Call Trace:
> >>>> [  257.850252]  <TASK>
> >>>> [  257.852360]  __schedule+0x1eb/0x630
> >>>> [  257.855851]  schedule+0x5a/0xd0
> >>>> [  257.858998]  schedule_preempt_disabled+0x11/0x20
> >>>> [  257.863617]  __mutex_lock.constprop.0+0x372/0x6c0
> >>>> [  257.868325]  ib_get_eth_speed+0x65/0x190 [ib_core]
> >>>> [  257.873127]  ? ib_cache_update.part.0+0x4b/0x2b0 [ib_core]
> >>>> [  257.878619]  ? __kmem_cache_alloc_node+0x18c/0x2b0
> >>>> [  257.883417]  irdma_query_port+0xb3/0x110 [irdma]
> >>>> [  257.888051]  ib_query_port+0xaa/0x100 [ib_core]
> >>>> [  257.892601]  ib_cache_update.part.0+0x65/0x2b0 [ib_core]
> >>>> [  257.897924]  ? pick_next_task+0x57/0x9b0
> >>>> [  257.901855]  ? dequeue_task_fair+0xb6/0x3c0
> >>>> [  257.906043]  ? finish_task_switch.isra.0+0x8f/0x2a0
> >>>> [  257.910920]  ib_cache_event_task+0x58/0x80 [ib_core]
> >>>> [  257.915906]  process_one_work+0x1e2/0x3b0
> >>>> [  257.919918]  ? rescuer_thread+0x390/0x390
> >>>> [  257.923931]  worker_thread+0x50/0x3a0
> >>>> [  257.927595]  ? rescuer_thread+0x390/0x390
> >>>> [  257.931609]  kthread+0xd6/0x100
> >>>> [  257.934755]  ? kthread_complete_and_exit+0x20/0x20
> >>>> [  257.939549]  ret_from_fork+0x1f/0x30
> >>>> [  257.943128]  </TASK>
> >>>> [  257.945438] INFO: task NetworkManager:3387 blocked for more than 123 seconds.
> >>>> [  257.952577]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> >>>> [  257.959274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
> >>>> ppid:1      flags:0x00004002
> >>>> [  257.975446] Call Trace:
> >>>> [  257.977901]  <TASK>
> >>>> [  257.980004]  __schedule+0x1eb/0x630
> >>>> [  257.983498]  schedule+0x5a/0xd0
> >>>> [  257.986641]  schedule_timeout+0x11d/0x160
> >>>> [  257.990654]  __wait_for_common+0x90/0x1e0
> >>>> [  257.994666]  ? usleep_range_state+0x90/0x90
> >>>> [  257.998854]  __flush_workqueue+0x13a/0x3f0
> >>>> [  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
> >>>> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
> >>>> [  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
> >>>> [  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
> >>>> [  258.023024]  irdma_remove+0x1a/0x60 [irdma]
> >>>> [  258.027223]  auxiliary_bus_remove+0x18/0x30
> >>>> [  258.031414]  device_release_driver_internal+0x1aa/0x230
> >>>> [  258.036643]  bus_remove_device+0xd8/0x150
> >>>> [  258.040654]  device_del+0x18b/0x3f0
> >>>> [  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]
> >>>> [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
> >>>> [  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
> >>>> [  258.058930]  raw_notifier_call_chain+0x41/0x60
> >>>> [  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
> >>>> [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
> >>>> [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
> >>>> [  258.077517]  ? nla_put+0x28/0x40
> >>>> [  258.080756]  do_setlink+0x26c/0xc10
> >>>> [  258.084249]  ? avc_alloc_node+0x27/0x180
> >>>> [  258.088173]  ? __nla_validate_parse+0x141/0x190
> >>>> [  258.092708]  __rtnl_newlink+0x53a/0x620
> >>>> [  258.096549]  rtnl_newlink+0x44/0x70
> >>>> [  258.100040]  rtnetlink_rcv_msg+0x159/0x3d0
> >>>> [  258.104140]  ? rtnl_calcit.isra.0+0x140/0x140
> >>>> [  258.108496]  netlink_rcv_skb+0x4e/0x100
> >>>> [  258.112338]  netlink_unicast+0x23b/0x360
> >>>> [  258.116264]  netlink_sendmsg+0x24e/0x4b0
> >>>> [  258.120191]  sock_sendmsg+0x5f/0x70
> >>>> [  258.123684]  ____sys_sendmsg+0x241/0x2c0
> >>>> [  258.127609]  ? copy_msghdr_from_user+0x6d/0xa0
> >>>> [  258.132054]  ___sys_sendmsg+0x88/0xd0
> >>>> [  258.135722]  ? ___sys_recvmsg+0x88/0xd0
> >>>> [  258.139559]  ? wake_up_q+0x4a/0x90
> >>>> [  258.142967]  ? rseq_get_rseq_cs.isra.0+0x16/0x220
> >>>> [  258.147673]  ? __fget_light+0xa4/0x130
> >>>> [  258.151434]  __sys_sendmsg+0x59/0xa0
> >>>> [  258.155012]  do_syscall_64+0x38/0x90
> >>>> [  258.158591]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >>>> [  258.163645] RIP: 0033:0x7ff23714fa7d
> >>>> [  258.167226] RSP: 002b:00007ffdddfc8c70 EFLAGS: 00000293 ORIG_RAX:
> >>>> 000000000000002e
> >>>> [  258.174798] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff23714fa7d
> >>>> [  258.181933] RDX: 0000000000000000 RSI: 00007ffdddfc8cb0 RDI: 000000000000000d
> >>>> [  258.189063] RBP: 00005572f5d77040 R08: 0000000000000000 R09: 0000000000000000
> >>>> [  258.196197] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdddfc8e1c
> >>>> [  258.203332] R13: 00007ffdddfc8e20 R14: 0000000000000000 R15: 00007ffdddfc8e28
> >>>> [  258.210464]  </TASK>
> >>>> ...
> >>>>
> >>>> I bisected the issue to a commit
> >>>> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
> >>>> correct link speed). Reverting this commit in my kernel build "fix"
> >>>> the issue and the server has a working network again.
> >>>
> >>> Thanks for the report. To be sure the issue doesn't fall through the
> >>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> >>> tracking bot:
> >>>
> >>> #regzbot ^introduced 425c9bd06b7a7079
> >>> #regzbot title RDMA/irdma: network stopped working
> >>> #regzbot ignore-activity
> >>>
> >>> This isn't a regression? This issue or a fix for it are already
> >>> discussed somewhere else? It was fixed already? You want to clarify when
> >>> the regression started to happen? Or point out I got the title or
> >>> something else totally wrong? Then just reply and tell me -- ideally
> >>> while also telling regzbot about it, as explained by the page listed in
> >>> the footer of this mail.
> >>>
> >>> Reminder for developers: When fixing the issue, add 'Link:' tags
> >>> pointing to the report (see page linked in footer for details).
> >>>
> >>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>> --
> >>> Everything you wanna know about Linux kernel regression tracking:
> >>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>> That page also explains what to do if mails like this annoy you.
> >>>
> >>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed)
  2023-03-01 13:27         ` Jaroslav Pulchart
@ 2023-03-01 15:05           ` Jaroslav Pulchart
  0 siblings, 0 replies; 13+ messages in thread
From: Jaroslav Pulchart @ 2023-03-01 15:05 UTC (permalink / raw)
  To: Tony Nguyen
  Cc: Thorsten Leemhuis, Linux regressions mailing list, kamalheib1,
	shiraz.saleem, leon, sashal, linux-rdma, netdev, Igor Raits

>
> >
> > On 2/28/2023 4:33 AM, Jaroslav Pulchart wrote:
> > >>
> > >> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> > >> to make this easily accessible to everyone.
> > >>
> > >> On 06.01.23 12:11, Linux kernel regression tracking (#adding) wrote:
> > >>> On 06.01.23 08:55, Jaroslav Pulchart wrote:
> > >>>> Hello,
> > >>>>
> > >>>> I would like to report a >= 6.1.2 some network regression (looks like
> > >>>> NIC us not UP) on our Dell R7525 servers with E810 NICs. The issue was
> > >>>> observed after I updated 6.1.0 to 6.1.2 or newer (tested up to newest
> > >>>> 6.1.4-rc1). The system is not accesible and all services are in D
> > >>>> state after each reboot.
> > >>
> > >> Can anyone please provide a status on this? It seems to take quite a
> > >> while to get this regression fixed, which is unfortunate. Or was
> > >> progress made somewhere and I just missed it?
> > >>
> > >> I noticed Tony tried to address this in mainline, but the last thing I'm
> > >> aware of is "Please ignore/drop this. Just saw that this change doesn't
> > >> solve the issue." here:
> > >>
> > >
> > > FYI: We are building 6.1.y with the provided patch to fix the
> > > regression in our environment.
> >
> > Thanks for the input Jaroslav; just to be clear, are you using the v1
> > [1] or v2 [2] of the patch?
>
> We are using v1 [1] of the patch.
>
>
> >
> > We're doing more testing on v2, but I was going to reach out to you
> > afterwards to see if you would mind testing the v2 as we haven't heard
> > from the other reporter who said v2 didn't work for him.
> >

I re-build 6.1.14 with v2 [2] patch + install it + reboot the server a
few times and it works.

> > Thanks,
> > Tony
> >
> > [1]
> > https://lore.kernel.org/netdev/20230131213703.1347761-2-anthony.l.nguyen@intel.com/
> > [2]
> > https://lore.kernel.org/netdev/20230217004201.2895321-1-anthony.l.nguyen@intel.com/
> >
> > >> https://lore.kernel.org/all/b944d1d4-7f90-dcef-231c-91bb031a4275@intel.com/#t
> > >>
> > >> Should the backport to 6.1.y (425c9bd06b7a ) maybe be dropped to at
> > >> least resolve the issue there until this is fixed in mainline? Or would
> > >> that cause a regression as well?
> > >>
> > >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > >> --
> > >> Everything you wanna know about Linux kernel regression tracking:
> > >> https://linux-regtracking.leemhuis.info/about/#tldr
> > >> If I did something stupid, please tell me, as explained on that page.
> > >>
> > >> #regzbot poke
> > >>
> > >>>> [  257.625207]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> > >>>> [  257.631911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > >>>> disables this message.
> > >>>> [  257.639740] task:kworker/u192:1  state:D stack:0     pid:11
> > >>>> ppid:2      flags:0x00004000
> > >>>> [  257.648095] Workqueue: netns cleanup_net
> > >>>> [  257.652029] Call Trace:
> > >>>> [  257.654481]  <TASK>
> > >>>> [  257.656589]  __schedule+0x1eb/0x630
> > >>>> [  257.660087]  schedule+0x5a/0xd0
> > >>>> [  257.663233]  schedule_preempt_disabled+0x11/0x20
> > >>>> [  257.667851]  __mutex_lock.constprop.0+0x372/0x6c0
> > >>>> [  257.672561]  rdma_dev_change_netns+0x25/0x120 [ib_core]
> > >>>> [  257.677821]  rdma_dev_exit_net+0x139/0x1e0 [ib_core]
> > >>>> [  257.682804]  ops_exit_list+0x30/0x70
> > >>>> [  257.686382]  cleanup_net+0x213/0x3b0
> > >>>> [  257.689964]  process_one_work+0x1e2/0x3b0
> > >>>> [  257.693984]  ? rescuer_thread+0x390/0x390
> > >>>> [  257.697995]  worker_thread+0x50/0x3a0
> > >>>> [  257.701661]  ? rescuer_thread+0x390/0x390
> > >>>> [  257.705674]  kthread+0xd6/0x100
> > >>>> [  257.708819]  ? kthread_complete_and_exit+0x20/0x20
> > >>>> [  257.713613]  ret_from_fork+0x1f/0x30
> > >>>> [  257.717192]  </TASK>
> > >>>> [  257.719496] INFO: task kworker/87:0:470 blocked for more than 122 seconds.
> > >>>> [  257.726423]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> > >>>> [  257.733123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > >>>> disables this message.
> > >>>> [  257.740949] task:kworker/87:0    state:D stack:0     pid:470
> > >>>> ppid:2      flags:0x00004000
> > >>>> [  257.749307] Workqueue: events linkwatch_event
> > >>>> [  257.753672] Call Trace:
> > >>>> [  257.756124]  <TASK>
> > >>>> [  257.758228]  __schedule+0x1eb/0x630
> > >>>> [  257.761723]  schedule+0x5a/0xd0
> > >>>> [  257.764867]  schedule_preempt_disabled+0x11/0x20
> > >>>> [  257.769487]  __mutex_lock.constprop.0+0x372/0x6c0
> > >>>> [  257.774196]  ? pick_next_task+0x57/0x9b0
> > >>>> [  257.778127]  ? finish_task_switch.isra.0+0x8f/0x2a0
> > >>>> [  257.783007]  linkwatch_event+0xa/0x30
> > >>>> [  257.786674]  process_one_work+0x1e2/0x3b0
> > >>>> [  257.790687]  worker_thread+0x50/0x3a0
> > >>>> [  257.794352]  ? rescuer_thread+0x390/0x390
> > >>>> [  257.798365]  kthread+0xd6/0x100
> > >>>> [  257.801513]  ? kthread_complete_and_exit+0x20/0x20
> > >>>> [  257.806303]  ret_from_fork+0x1f/0x30
> > >>>> [  257.809885]  </TASK>
> > >>>> [  257.812109] INFO: task kworker/39:1:614 blocked for more than 123 seconds.
> > >>>> [  257.818984]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> > >>>> [  257.825686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > >>>> disables this message.
> > >>>> [  257.833519] task:kworker/39:1    state:D stack:0     pid:614
> > >>>> ppid:2      flags:0x00004000
> > >>>> [  257.841869] Workqueue: infiniband ib_cache_event_task [ib_core]
> > >>>> [  257.847802] Call Trace:
> > >>>> [  257.850252]  <TASK>
> > >>>> [  257.852360]  __schedule+0x1eb/0x630
> > >>>> [  257.855851]  schedule+0x5a/0xd0
> > >>>> [  257.858998]  schedule_preempt_disabled+0x11/0x20
> > >>>> [  257.863617]  __mutex_lock.constprop.0+0x372/0x6c0
> > >>>> [  257.868325]  ib_get_eth_speed+0x65/0x190 [ib_core]
> > >>>> [  257.873127]  ? ib_cache_update.part.0+0x4b/0x2b0 [ib_core]
> > >>>> [  257.878619]  ? __kmem_cache_alloc_node+0x18c/0x2b0
> > >>>> [  257.883417]  irdma_query_port+0xb3/0x110 [irdma]
> > >>>> [  257.888051]  ib_query_port+0xaa/0x100 [ib_core]
> > >>>> [  257.892601]  ib_cache_update.part.0+0x65/0x2b0 [ib_core]
> > >>>> [  257.897924]  ? pick_next_task+0x57/0x9b0
> > >>>> [  257.901855]  ? dequeue_task_fair+0xb6/0x3c0
> > >>>> [  257.906043]  ? finish_task_switch.isra.0+0x8f/0x2a0
> > >>>> [  257.910920]  ib_cache_event_task+0x58/0x80 [ib_core]
> > >>>> [  257.915906]  process_one_work+0x1e2/0x3b0
> > >>>> [  257.919918]  ? rescuer_thread+0x390/0x390
> > >>>> [  257.923931]  worker_thread+0x50/0x3a0
> > >>>> [  257.927595]  ? rescuer_thread+0x390/0x390
> > >>>> [  257.931609]  kthread+0xd6/0x100
> > >>>> [  257.934755]  ? kthread_complete_and_exit+0x20/0x20
> > >>>> [  257.939549]  ret_from_fork+0x1f/0x30
> > >>>> [  257.943128]  </TASK>
> > >>>> [  257.945438] INFO: task NetworkManager:3387 blocked for more than 123 seconds.
> > >>>> [  257.952577]       Tainted: G            E      6.1.4-0.gdc.el9.x86_64 #1
> > >>>> [  257.959274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > >>>> disables this message.
> > >>>> [  257.967099] task:NetworkManager  state:D stack:0     pid:3387
> > >>>> ppid:1      flags:0x00004002
> > >>>> [  257.975446] Call Trace:
> > >>>> [  257.977901]  <TASK>
> > >>>> [  257.980004]  __schedule+0x1eb/0x630
> > >>>> [  257.983498]  schedule+0x5a/0xd0
> > >>>> [  257.986641]  schedule_timeout+0x11d/0x160
> > >>>> [  257.990654]  __wait_for_common+0x90/0x1e0
> > >>>> [  257.994666]  ? usleep_range_state+0x90/0x90
> > >>>> [  257.998854]  __flush_workqueue+0x13a/0x3f0
> > >>>> [  258.002955]  ? __kernfs_remove.part.0+0x11e/0x1e0
> > >>>> [  258.007661]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
> > >>>> [  258.012721]  __ib_unregister_device+0x62/0xa0 [ib_core]
> > >>>> [  258.017959]  ib_unregister_device+0x22/0x30 [ib_core]
> > >>>> [  258.023024]  irdma_remove+0x1a/0x60 [irdma]
> > >>>> [  258.027223]  auxiliary_bus_remove+0x18/0x30
> > >>>> [  258.031414]  device_release_driver_internal+0x1aa/0x230
> > >>>> [  258.036643]  bus_remove_device+0xd8/0x150
> > >>>> [  258.040654]  device_del+0x18b/0x3f0
> > >>>> [  258.044149]  ice_unplug_aux_dev+0x42/0x60 [ice]
> > >>>> [  258.048707]  ice_lag_changeupper_event+0x287/0x2a0 [ice]
> > >>>> [  258.054038]  ice_lag_event_handler+0x51/0x130 [ice]
> > >>>> [  258.058930]  raw_notifier_call_chain+0x41/0x60
> > >>>> [  258.063381]  __netdev_upper_dev_link+0x1a0/0x370
> > >>>> [  258.068008]  netdev_master_upper_dev_link+0x3d/0x60
> > >>>> [  258.072886]  bond_enslave+0xd16/0x16f0 [bonding]
> > >>>> [  258.077517]  ? nla_put+0x28/0x40
> > >>>> [  258.080756]  do_setlink+0x26c/0xc10
> > >>>> [  258.084249]  ? avc_alloc_node+0x27/0x180
> > >>>> [  258.088173]  ? __nla_validate_parse+0x141/0x190
> > >>>> [  258.092708]  __rtnl_newlink+0x53a/0x620
> > >>>> [  258.096549]  rtnl_newlink+0x44/0x70
> > >>>> [  258.100040]  rtnetlink_rcv_msg+0x159/0x3d0
> > >>>> [  258.104140]  ? rtnl_calcit.isra.0+0x140/0x140
> > >>>> [  258.108496]  netlink_rcv_skb+0x4e/0x100
> > >>>> [  258.112338]  netlink_unicast+0x23b/0x360
> > >>>> [  258.116264]  netlink_sendmsg+0x24e/0x4b0
> > >>>> [  258.120191]  sock_sendmsg+0x5f/0x70
> > >>>> [  258.123684]  ____sys_sendmsg+0x241/0x2c0
> > >>>> [  258.127609]  ? copy_msghdr_from_user+0x6d/0xa0
> > >>>> [  258.132054]  ___sys_sendmsg+0x88/0xd0
> > >>>> [  258.135722]  ? ___sys_recvmsg+0x88/0xd0
> > >>>> [  258.139559]  ? wake_up_q+0x4a/0x90
> > >>>> [  258.142967]  ? rseq_get_rseq_cs.isra.0+0x16/0x220
> > >>>> [  258.147673]  ? __fget_light+0xa4/0x130
> > >>>> [  258.151434]  __sys_sendmsg+0x59/0xa0
> > >>>> [  258.155012]  do_syscall_64+0x38/0x90
> > >>>> [  258.158591]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > >>>> [  258.163645] RIP: 0033:0x7ff23714fa7d
> > >>>> [  258.167226] RSP: 002b:00007ffdddfc8c70 EFLAGS: 00000293 ORIG_RAX:
> > >>>> 000000000000002e
> > >>>> [  258.174798] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff23714fa7d
> > >>>> [  258.181933] RDX: 0000000000000000 RSI: 00007ffdddfc8cb0 RDI: 000000000000000d
> > >>>> [  258.189063] RBP: 00005572f5d77040 R08: 0000000000000000 R09: 0000000000000000
> > >>>> [  258.196197] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdddfc8e1c
> > >>>> [  258.203332] R13: 00007ffdddfc8e20 R14: 0000000000000000 R15: 00007ffdddfc8e28
> > >>>> [  258.210464]  </TASK>
> > >>>> ...
> > >>>>
> > >>>> I bisected the issue to a commit
> > >>>> "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the
> > >>>> correct link speed). Reverting this commit in my kernel build "fix"
> > >>>> the issue and the server has a working network again.
> > >>>
> > >>> Thanks for the report. To be sure the issue doesn't fall through the
> > >>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> > >>> tracking bot:
> > >>>
> > >>> #regzbot ^introduced 425c9bd06b7a7079
> > >>> #regzbot title RDMA/irdma: network stopped working
> > >>> #regzbot ignore-activity
> > >>>
> > >>> This isn't a regression? This issue or a fix for it are already
> > >>> discussed somewhere else? It was fixed already? You want to clarify when
> > >>> the regression started to happen? Or point out I got the title or
> > >>> something else totally wrong? Then just reply and tell me -- ideally
> > >>> while also telling regzbot about it, as explained by the page listed in
> > >>> the footer of this mail.
> > >>>
> > >>> Reminder for developers: When fixing the issue, add 'Link:' tags
> > >>> pointing to the report (see page linked in footer for details).
> > >>>
> > >>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > >>> --
> > >>> Everything you wanna know about Linux kernel regression tracking:
> > >>> https://linux-regtracking.leemhuis.info/about/#tldr
> > >>> That page also explains what to do if mails like this annoy you.
> > >>>
> > >>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-03-01 15:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-06  7:55 Network do not works with linux >= 6.1.2. Issue bisected to "425c9bd06b7a70796d880828d15c11321bdfb76d" (RDMA/irdma: Report the correct link speed) Jaroslav Pulchart
2023-01-06 11:11 ` Linux kernel regression tracking (#adding)
2023-02-28 10:09   ` Linux regression tracking (Thorsten Leemhuis)
2023-02-28 12:33     ` Jaroslav Pulchart
2023-02-28 22:10       ` Tony Nguyen
2023-03-01 13:27         ` Jaroslav Pulchart
2023-03-01 15:05           ` Jaroslav Pulchart
2023-01-06 16:15 ` Jason Gunthorpe
2023-01-09  9:17   ` Jaroslav Pulchart
2023-01-09 19:36   ` Saleem, Shiraz
2023-01-10  6:02     ` Zhu Yanjun
2023-01-12 16:25       ` Linus Heckemann
2023-01-12 16:46         ` Saleem, Shiraz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).