netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
@ 2020-04-12  6:08 Leon Romanovsky
  2020-04-12 18:59 ` Jakub Kicinski
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Leon Romanovsky @ 2020-04-12  6:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: Leon Romanovsky, Arjan van de Ven, Cong Wang, Jamal Hadi Salim,
	Jiri Pirko, netdev

From: Leon Romanovsky <leonro@mellanox.com>

In event of transmission timeout, the drivers are given an opportunity
to recover and continue to work after some in-house cleanups.

Such event can be caused by HW bugs, wrong congestion configurations
and many more other scenarios. In such case, users are interested to
get a simple  "NETDEV WATCHDOG ... " print, which points to the relevant
netdevice in trouble.

The dump stack printed later was added in the commit b4192bbd85d2
("net: Add a WARN_ON_ONCE() to the transmit timeout function") to give
extra information, like list of the modules and which driver is involved.

While the latter is already printed in "NETDEV WATCHDOG ... ", the list
of modules rarely needed and can be collected later.

So let's remove the WARN_ONCE() and make dmesg look more user-friendly in
large cluster setups.

[  281.170584] ------------[ cut here ]------------
[  281.197120] NETDEV WATCHDOG: ib1 (mlx4_core): transmit queue 0 timed out
[  281.198521] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x232/0x240
[  281.200259] Modules linked in: bonding ipip tunnel4 geneve ip6_udp_tunnel udp_tunnel ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel mlx4_en ptp pps_core mlx4_ib mlx4_core rdma_ucm ib_uverbs ib_ipoib ib_umad openvswitch nsh xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter overlay ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlx4_core]
[  281.208290] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc5-J14907-G268960df60ee #1
[  281.209954] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[  281.212281] RIP: 0010:dev_watchdog+0x232/0x240
[  281.213260] Code: 85 c0 75 e8 eb a5 4c 89 ef c6 05 dd 9c c4 00 01 e8 d3 b6 fb ff 44 89 e1 4c 89 ee 48 c7 c7 40 54 0b 82 48 89 c2 e8 10 f1 a0 ff <0f> 0b eb 86 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 c7 47
[  281.217078] RSP: 0018:ffffc90000003e70 EFLAGS: 00010282
[  281.218210] RAX: 0000000000000000 RBX: ffff8884521c3ce8 RCX: 0000000000000007
[  281.219709] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff88846fc18230
[  281.221206] RBP: ffff88846daad440 R08: 0000000000000000 R09: 0000000000000249
[  281.222697] R10: 0000000000000774 R11: ffffc90000003d25 R12: 0000000000000000
[  281.224202] R13: ffff88846daad000 R14: ffff88846daad440 R15: 0000000000000082
[  281.225733] FS:  0000000000000000(0000) GS:ffff88846fc00000(0000) knlGS:0000000000000000
[  281.227472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  281.228713] CR2: 00007efd12565000 CR3: 000000043cd3a002 CR4: 0000000000160eb0
[  281.230241] Call Trace:
[  281.230900]  <IRQ>
[  281.231469]  ? qdisc_put_unlocked+0x30/0x30
[  281.232437]  call_timer_fn+0x30/0x130
[  281.233300]  run_timer_softirq+0x18b/0x490
[  281.234229]  ? timerqueue_add+0x96/0xb0
[  281.235119]  ? enqueue_hrtimer+0x3d/0x90
[  281.236029]  __do_softirq+0xdf/0x2e5
[  281.236864]  irq_exit+0xa0/0xb0
[  281.237621]  smp_apic_timer_interrupt+0x72/0x120
[  281.238652]  apic_timer_interrupt+0xf/0x20
[  281.239581]  </IRQ>
[  281.240147] RIP: 0010:default_idle+0x2d/0x150
[  281.241133] Code: 00 00 8b 05 3d 75 a7 00 41 54 55 65 8b 2d 6b e0 71 7e 53 85 c0 7f 29 8b 05 c8 97 f7 00 85 c0 7e 07 0f 00 2d 37 56 52 00 fb f4 <8b> 05 15 75 a7 00 65 8b 2d 46 e0 71 7e 85 c0 7f 7f 5b 5d 41 5c c3
[  281.244935] RSP: 0018:ffffffff82203ea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  281.246584] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  281.248082] RDX: 000000000010db42 RSI: ffffffff82203e40 RDI: 000000416d8a7440
[  281.249581] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000041770da407
[  281.251069] R10: 0000000000000264 R11: 0000000000000000 R12: ffffffff82211840
[  281.252545] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff82211840
[  281.254036]  do_idle+0x1ee/0x210
[  281.254809]  cpu_startup_entry+0x19/0x20
[  281.255713]  start_kernel+0x490/0x4af
[  281.257577]  secondary_startup_64+0xa4/0xb0
[  281.259147] ---[ end trace 78f566c0214a2cb0 ]---
[  281.260866] ib1: transmit timeout: latency 1120 msecs
[  281.262730] ib1: queue stopped 1, tx_head 167838, tx_tail 167710

Fixes: b4192bbd85d2 ("net: Add a WARN_ON_ONCE() to the transmit timeout function")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
Hi Dave,

This is a new version of previously sent v0 [1] with change in print error
level as was suggested by Jakub and Cong. I'm asking you to reevaluate
your previous decision [2] given the fact that this is user triggered
bug and very similar scenario was committed by Linus "fs/filesystems.c:
downgrade user-reachable WARN_ONCE() to pr_warn_once()" a couple of days
ago [3].

[1] https://lore.kernel.org/netdev/20200402152336.538433-1-leon@kernel.org
[2] https://lore.kernel.org/netdev/20200402.180218.940555077368617365.davem@davemloft.net
[3] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=26c5d78c976ca298e59a56f6101a97b618ba3539
---
 net/sched/sch_generic.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6c9595f1048a..185f03db3d55 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -439,8 +439,9 @@ static void dev_watchdog(struct timer_list *t)

 			if (some_queue_timedout) {
 				trace_net_dev_xmit_timeout(dev, i);
-				WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
-				       dev->name, netdev_drivername(dev), i);
+				pr_err_once("NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
+					    dev->name,
+					    netdev_drivername(dev), i);
 				dev->netdev_ops->ndo_tx_timeout(dev, i);
 			}
 			if (!mod_timer(&dev->watchdog_timer,
--
2.25.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-12  6:08 [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout Leon Romanovsky
@ 2020-04-12 18:59 ` Jakub Kicinski
  2020-04-12 19:23   ` Leon Romanovsky
  2020-04-13  4:19 ` David Miller
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2020-04-12 18:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S. Miller, Leon Romanovsky, Arjan van de Ven, Cong Wang,
	Jamal Hadi Salim, Jiri Pirko, netdev

On Sun, 12 Apr 2020 09:08:54 +0300 Leon Romanovsky wrote:
> Hi Dave,
> 
> This is a new version of previously sent v0 [1] with change in print error
> level as was suggested by Jakub and Cong. I'm asking you to reevaluate
> your previous decision [2] given the fact that this is user triggered
> bug and very similar scenario was committed by Linus "fs/filesystems.c:
> downgrade user-reachable WARN_ONCE() to pr_warn_once()" a couple of days
> ago [3].
> 
> [1] https://lore.kernel.org/netdev/20200402152336.538433-1-leon@kernel.org
> [2] https://lore.kernel.org/netdev/20200402.180218.940555077368617365.davem@davemloft.net
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=26c5d78c976ca298e59a56f6101a97b618ba3539

How is it user triggerable? If there's a IB-specific reason maybe ib
netdev should stop implementing ndo_tx_timeout.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-12 18:59 ` Jakub Kicinski
@ 2020-04-12 19:23   ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2020-04-12 19:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Arjan van de Ven, Cong Wang, Jamal Hadi Salim,
	Jiri Pirko, netdev

On Sun, Apr 12, 2020 at 11:59:13AM -0700, Jakub Kicinski wrote:
> On Sun, 12 Apr 2020 09:08:54 +0300 Leon Romanovsky wrote:
> > Hi Dave,
> >
> > This is a new version of previously sent v0 [1] with change in print error
> > level as was suggested by Jakub and Cong. I'm asking you to reevaluate
> > your previous decision [2] given the fact that this is user triggered
> > bug and very similar scenario was committed by Linus "fs/filesystems.c:
> > downgrade user-reachable WARN_ONCE() to pr_warn_once()" a couple of days
> > ago [3].
> >
> > [1] https://lore.kernel.org/netdev/20200402152336.538433-1-leon@kernel.org
> > [2] https://lore.kernel.org/netdev/20200402.180218.940555077368617365.davem@davemloft.net
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=26c5d78c976ca298e59a56f6101a97b618ba3539
>
> How is it user triggerable? If there's a IB-specific reason maybe ib
> netdev should stop implementing ndo_tx_timeout.

It is happening if device is extremely over loaded with traffic,
internally HW decreases the performance (HW bug), it is causing to
the TX timeouts and to the WARN_ON splat.

We don't want to stop implementing ndo_tx_timeout, because it works
right most (if not all) of the time.

If it is very important, I will dig into internal bug reports to see
the possible reproduction scenarios, but from what I saw till now,
it is statistical failure.

And it is not IB specific, but mlx4 specific.

Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-12  6:08 [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout Leon Romanovsky
  2020-04-12 18:59 ` Jakub Kicinski
@ 2020-04-13  4:19 ` David Miller
  2020-04-13  5:03   ` Leon Romanovsky
  2020-04-13  9:01 ` Jose Abreu
  2020-04-13 17:22 ` Cong Wang
  3 siblings, 1 reply; 12+ messages in thread
From: David Miller @ 2020-04-13  4:19 UTC (permalink / raw)
  To: leon; +Cc: kuba, leonro, arjan, xiyou.wangcong, jhs, jiri, netdev


This is cause by a device"overwhelmed with traffic"?  Sounds like
normal operation to me.

That's a bug, and the driver handling the device with this problem
should adjust how it implements TX timeouts to accomodate this.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-13  4:19 ` David Miller
@ 2020-04-13  5:03   ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2020-04-13  5:03 UTC (permalink / raw)
  To: David Miller; +Cc: kuba, arjan, xiyou.wangcong, jhs, jiri, netdev

On Sun, Apr 12, 2020 at 09:19:25PM -0700, David Miller wrote:
>
> This is cause by a device"overwhelmed with traffic"?  Sounds like
> normal operation to me.
>
> That's a bug, and the driver handling the device with this problem
> should adjust how it implements TX timeouts to accomodate this.

From the internal bug description, hope that it makes sense.

-----
A timeout may occur if the amount of the reported bytes higher than the queue limit,
in this case, the kernel closes the queue and only after getting a completion it wil
reopen it.

In the debug we saw that in some situations the driver gets a **delayed completion**,
completions arrive after **1 min**, therefore, the amount of queued bytes exceeds the
DQL max size.

As a result, the kernel after watchdog_timeo calls the driver's timeout function,
that prints timeout to dmesg.

After debugging the issue with FW to understand the root cause of the delayed completions
we understand that since the IB and the TCP traffic are running at the same service level (SL),
the same schedule queue schedules between all the QPs, and in this case if one of the IB QPs get
stuck because of congestion, all other QPs will be stuck (include the TCP QPs) until releasing
the stuck QP.
-----

User separates traffic to different SLs.

Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-12  6:08 [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout Leon Romanovsky
  2020-04-12 18:59 ` Jakub Kicinski
  2020-04-13  4:19 ` David Miller
@ 2020-04-13  9:01 ` Jose Abreu
  2020-04-13 10:20   ` Leon Romanovsky
  2020-04-13 17:22 ` Cong Wang
  3 siblings, 1 reply; 12+ messages in thread
From: Jose Abreu @ 2020-04-13  9:01 UTC (permalink / raw)
  To: Leon Romanovsky, David S. Miller, Jakub Kicinski
  Cc: Leon Romanovsky, Arjan van de Ven, Cong Wang, Jamal Hadi Salim,
	Jiri Pirko, netdev

From: Leon Romanovsky <leon@kernel.org>
Date: Apr/12/2020, 07:08:54 (UTC+00:00)

> [  281.170584] ------------[ cut here ]------------

Not objecting to the patch it-self (because usually stack trace is 
useless), but just FYI we use this marker in our CI to track for timeouts 
or crashes. I'm not sure if anyone else is using it.

And actually, can you please explain why BQL is not suppressing your 
timeouts ?

---
Thanks,
Jose Miguel Abreu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-13  9:01 ` Jose Abreu
@ 2020-04-13 10:20   ` Leon Romanovsky
  2020-04-13 10:37     ` Jose Abreu
  0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2020-04-13 10:20 UTC (permalink / raw)
  To: Jose Abreu
  Cc: David S. Miller, Jakub Kicinski, Arjan van de Ven, Cong Wang,
	Jamal Hadi Salim, Jiri Pirko, netdev

On Mon, Apr 13, 2020 at 09:01:32AM +0000, Jose Abreu wrote:
> From: Leon Romanovsky <leon@kernel.org>
> Date: Apr/12/2020, 07:08:54 (UTC+00:00)
>
> > [  281.170584] ------------[ cut here ]------------
>
> Not objecting to the patch it-self (because usually stack trace is
> useless), but just FYI we use this marker in our CI to track for timeouts
> or crashes. I'm not sure if anyone else is using it.

I didn't delete the "NETDEV WATCHDOG .." message and it will be still
visible as a marker.

>
> And actually, can you please explain why BQL is not suppressing your
> timeouts ?

Driver can't distinguish between "real" timeout and "mixed traffic" timeout,
so we don't want to completely disable "dev->netdev_ops->ndo_tx_timeout(dev, i);"
call in watchdog [1]. The goal is to leave functionality in place and
simply remove stack trace to be similar to other BUG prints in that file [2].

[1] https://elixir.bootlin.com/linux/latest/source/net/sched/sch_generic.c#L444
[2] https://elixir.bootlin.com/linux/latest/source/net/sched/sch_generic.c#L328

>
> ---
> Thanks,
> Jose Miguel Abreu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-13 10:20   ` Leon Romanovsky
@ 2020-04-13 10:37     ` Jose Abreu
  2020-04-13 10:54       ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Jose Abreu @ 2020-04-13 10:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S. Miller, Jakub Kicinski, Arjan van de Ven, Cong Wang,
	Jamal Hadi Salim, Jiri Pirko, netdev

From: Leon Romanovsky <leon@kernel.org>
Date: Apr/13/2020, 11:20:53 (UTC+00:00)

> On Mon, Apr 13, 2020 at 09:01:32AM +0000, Jose Abreu wrote:
> > From: Leon Romanovsky <leon@kernel.org>
> > Date: Apr/12/2020, 07:08:54 (UTC+00:00)
> >
> > > [  281.170584] ------------[ cut here ]------------
> >
> > Not objecting to the patch it-self (because usually stack trace is
> > useless), but just FYI we use this marker in our CI to track for timeouts
> > or crashes. I'm not sure if anyone else is using it.
> 
> I didn't delete the "NETDEV WATCHDOG .." message and it will be still
> visible as a marker.
> 
> >
> > And actually, can you please explain why BQL is not suppressing your
> > timeouts ?
> 
> Driver can't distinguish between "real" timeout and "mixed traffic" timeout,

The point is that you should not get any "mixed traffic" timeout if the 
driver uses BQL because Queue will be disabled long before timeout happens 
as per queue size usage ...

---
Thanks,
Jose Miguel Abreu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-13 10:37     ` Jose Abreu
@ 2020-04-13 10:54       ` Leon Romanovsky
  2020-04-13 11:01         ` Jose Abreu
  0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2020-04-13 10:54 UTC (permalink / raw)
  To: Jose Abreu
  Cc: David S. Miller, Jakub Kicinski, Arjan van de Ven, Cong Wang,
	Jamal Hadi Salim, Jiri Pirko, netdev

On Mon, Apr 13, 2020 at 10:37:24AM +0000, Jose Abreu wrote:
> From: Leon Romanovsky <leon@kernel.org>
> Date: Apr/13/2020, 11:20:53 (UTC+00:00)
>
> > On Mon, Apr 13, 2020 at 09:01:32AM +0000, Jose Abreu wrote:
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Date: Apr/12/2020, 07:08:54 (UTC+00:00)
> > >
> > > > [  281.170584] ------------[ cut here ]------------
> > >
> > > Not objecting to the patch it-self (because usually stack trace is
> > > useless), but just FYI we use this marker in our CI to track for timeouts
> > > or crashes. I'm not sure if anyone else is using it.
> >
> > I didn't delete the "NETDEV WATCHDOG .." message and it will be still
> > visible as a marker.
> >
> > >
> > > And actually, can you please explain why BQL is not suppressing your
> > > timeouts ?
> >
> > Driver can't distinguish between "real" timeout and "mixed traffic" timeout,
>
> The point is that you should not get any "mixed traffic" timeout if the
> driver uses BQL because Queue will be disabled long before timeout happens
> as per queue size usage ...

Sorry, if I misunderstood you, but you are proposing to count traffic, right?

If yes, RDMA traffic bypasses the SW stack and not visible to the kernel, hence
the BQL will count only ETH portion of that mixed traffic, while RDMA traffic
is the one who "blocked" transmission channel (QP in RDMA terminology).

Thanks

>
> ---
> Thanks,
> Jose Miguel Abreu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-13 10:54       ` Leon Romanovsky
@ 2020-04-13 11:01         ` Jose Abreu
  2020-04-13 11:25           ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Jose Abreu @ 2020-04-13 11:01 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S. Miller, Jakub Kicinski, Arjan van de Ven, Cong Wang,
	Jamal Hadi Salim, Jiri Pirko, netdev

From: Leon Romanovsky <leon@kernel.org>
Date: Apr/13/2020, 11:54:08 (UTC+00:00)

> Sorry, if I misunderstood you, but you are proposing to count traffic, right?
> 
> If yes, RDMA traffic bypasses the SW stack and not visible to the kernel, hence
> the BQL will count only ETH portion of that mixed traffic, while RDMA traffic
> is the one who "blocked" transmission channel (QP in RDMA terminology).

Sorry but you don't mention in your commit message that this is RDMA 
specific so that's why I brought up the topic of BQL. Apologies for the 
misunderstood.

---
Thanks,
Jose Miguel Abreu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-13 11:01         ` Jose Abreu
@ 2020-04-13 11:25           ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2020-04-13 11:25 UTC (permalink / raw)
  To: Jose Abreu
  Cc: David S. Miller, Jakub Kicinski, Arjan van de Ven, Cong Wang,
	Jamal Hadi Salim, Jiri Pirko, netdev

On Mon, Apr 13, 2020 at 11:01:42AM +0000, Jose Abreu wrote:
> From: Leon Romanovsky <leon@kernel.org>
> Date: Apr/13/2020, 11:54:08 (UTC+00:00)
>
> > Sorry, if I misunderstood you, but you are proposing to count traffic, right?
> >
> > If yes, RDMA traffic bypasses the SW stack and not visible to the kernel, hence
> > the BQL will count only ETH portion of that mixed traffic, while RDMA traffic
> > is the one who "blocked" transmission channel (QP in RDMA terminology).
>
> Sorry but you don't mention in your commit message that this is RDMA
> specific so that's why I brought up the topic of BQL. Apologies for the
> misunderstood.

No problem, I'm glad that you asked those questions, hope that my
answers clear the rationale behind change from WARN_ON to be pr_err().

Thanks

>
> ---
> Thanks,
> Jose Miguel Abreu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout
  2020-04-12  6:08 [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout Leon Romanovsky
                   ` (2 preceding siblings ...)
  2020-04-13  9:01 ` Jose Abreu
@ 2020-04-13 17:22 ` Cong Wang
  3 siblings, 0 replies; 12+ messages in thread
From: Cong Wang @ 2020-04-13 17:22 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S. Miller, Jakub Kicinski, Leon Romanovsky,
	Arjan van de Ven, Jamal Hadi Salim, Jiri Pirko,
	Linux Kernel Network Developers

On Sat, Apr 11, 2020 at 11:09 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> From: Leon Romanovsky <leonro@mellanox.com>
>
> In event of transmission timeout, the drivers are given an opportunity
> to recover and continue to work after some in-house cleanups.
>
> Such event can be caused by HW bugs, wrong congestion configurations
> and many more other scenarios. In such case, users are interested to
> get a simple  "NETDEV WATCHDOG ... " print, which points to the relevant
> netdevice in trouble.
>
> The dump stack printed later was added in the commit b4192bbd85d2
> ("net: Add a WARN_ON_ONCE() to the transmit timeout function") to give
> extra information, like list of the modules and which driver is involved.
>
> While the latter is already printed in "NETDEV WATCHDOG ... ", the list
> of modules rarely needed and can be collected later.
>
> So let's remove the WARN_ONCE() and make dmesg look more user-friendly in
> large cluster setups.
...
>
> Fixes: b4192bbd85d2 ("net: Add a WARN_ON_ONCE() to the transmit timeout function")
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-04-13 17:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-12  6:08 [PATCH net v1] net/sched: Don't print dump stack in event of transmission timeout Leon Romanovsky
2020-04-12 18:59 ` Jakub Kicinski
2020-04-12 19:23   ` Leon Romanovsky
2020-04-13  4:19 ` David Miller
2020-04-13  5:03   ` Leon Romanovsky
2020-04-13  9:01 ` Jose Abreu
2020-04-13 10:20   ` Leon Romanovsky
2020-04-13 10:37     ` Jose Abreu
2020-04-13 10:54       ` Leon Romanovsky
2020-04-13 11:01         ` Jose Abreu
2020-04-13 11:25           ` Leon Romanovsky
2020-04-13 17:22 ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).