All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch net-next] net: add a generic tracepoint for TX queue timeout
@ 2019-04-30 18:50 Cong Wang
  2019-05-01 13:11 ` Eran Ben Elisha
  0 siblings, 1 reply; 4+ messages in thread
From: Cong Wang @ 2019-04-30 18:50 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Eran Ben Elisha, Jiri Pirko

Although devlink health report does a nice job on reporting TX
timeout and other NIC errors, unfortunately it requires drivers
to support it but currently only mlx5 has implemented it.
Before other drivers could catch up, it is useful to have a
generic tracepoint to monitor this kind of TX timeout. We have
been suffering TX timeout with different drivers, we plan to
start to monitor it with rasdaemon which just needs a new tracepoint.

Sample output:

  ksoftirqd/1-16    [001] ..s2   144.043173: net_dev_xmit_timeout: dev=ens3 driver=e1000 queue=0

Cc: Eran Ben Elisha <eranbe@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/trace/events/net.h | 23 +++++++++++++++++++++++
 net/sched/sch_generic.c    |  2 ++
 2 files changed, 25 insertions(+)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 1efd7d9b25fe..002d6f04b9e5 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -303,6 +303,29 @@ DEFINE_EVENT(net_dev_rx_exit_template, netif_receive_skb_list_exit,
 	TP_ARGS(ret)
 );
 
+TRACE_EVENT(net_dev_xmit_timeout,
+
+	TP_PROTO(struct net_device *dev,
+		 int queue_index),
+
+	TP_ARGS(dev, queue_index),
+
+	TP_STRUCT__entry(
+		__string(	name,		dev->name	)
+		__string(	driver,		netdev_drivername(dev))
+		__field(	int,		queue_index	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name, dev->name);
+		__assign_str(driver, netdev_drivername(dev));
+		__entry->queue_index = queue_index;
+	),
+
+	TP_printk("dev=%s driver=%s queue=%d",
+		__get_str(name), __get_str(driver), __entry->queue_index)
+);
+
 #endif /* _TRACE_NET_H */
 
 /* This part must be outside protection */
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 848aab3693bd..cce1e9ee85af 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -32,6 +32,7 @@
 #include <net/pkt_sched.h>
 #include <net/dst.h>
 #include <trace/events/qdisc.h>
+#include <trace/events/net.h>
 #include <net/xfrm.h>
 
 /* Qdisc to use by default */
@@ -441,6 +442,7 @@ static void dev_watchdog(struct timer_list *t)
 			}
 
 			if (some_queue_timedout) {
+				trace_net_dev_xmit_timeout(dev, i);
 				WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
 				       dev->name, netdev_drivername(dev), i);
 				dev->netdev_ops->ndo_tx_timeout(dev);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Patch net-next] net: add a generic tracepoint for TX queue timeout
  2019-04-30 18:50 [Patch net-next] net: add a generic tracepoint for TX queue timeout Cong Wang
@ 2019-05-01 13:11 ` Eran Ben Elisha
  2019-05-01 20:55   ` Saeed Mahameed
  2019-05-02  0:50   ` Cong Wang
  0 siblings, 2 replies; 4+ messages in thread
From: Eran Ben Elisha @ 2019-05-01 13:11 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: Jiri Pirko



On 4/30/2019 9:50 PM, Cong Wang wrote:
> Although devlink health report does a nice job on reporting TX
> timeout and other NIC errors, unfortunately it requires drivers
> to support it but currently only mlx5 has implemented it.

The devlink health was never intended to be the generic mechanism for 
monitoring all driver's TX timeouts notifications. mlx5e driver chose to 
handle TX timeout notification by reporting it to the newly devlink 
health mechanism.

> Before other drivers could catch up, it is useful to have a
> generic tracepoint to monitor this kind of TX timeout. We have
> been suffering TX timeout with different drivers, we plan to
> start to monitor it with rasdaemon which just needs a new tracepoint.

Great idea to suggest a generic trace message that can be monitored over 
all drivers.

> 
> Sample output:
> 
>    ksoftirqd/1-16    [001] ..s2   144.043173: net_dev_xmit_timeout: dev=ens3 driver=e1000 queue=0
> 
> Cc: Eran Ben Elisha <eranbe@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>   include/trace/events/net.h | 23 +++++++++++++++++++++++
>   net/sched/sch_generic.c    |  2 ++
>   2 files changed, 25 insertions(+)
> 
> diff --git a/include/trace/events/net.h b/include/trace/events/net.h
> index 1efd7d9b25fe..002d6f04b9e5 100644
> --- a/include/trace/events/net.h
> +++ b/include/trace/events/net.h
> @@ -303,6 +303,29 @@ DEFINE_EVENT(net_dev_rx_exit_template, netif_receive_skb_list_exit,
>   	TP_ARGS(ret)
>   );
>   

I would have put this next to net_dev_xmit trace event declaration.

> +TRACE_EVENT(net_dev_xmit_timeout,
> +
> +	TP_PROTO(struct net_device *dev,
> +		 int queue_index),
> +
> +	TP_ARGS(dev, queue_index),
> +
> +	TP_STRUCT__entry(
> +		__string(	name,		dev->name	)
> +		__string(	driver,		netdev_drivername(dev))
> +		__field(	int,		queue_index	)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(name, dev->name);
> +		__assign_str(driver, netdev_drivername(dev));
> +		__entry->queue_index = queue_index;
> +	),
> +
> +	TP_printk("dev=%s driver=%s queue=%d",
> +		__get_str(name), __get_str(driver), __entry->queue_index)
> +);
> +
>   #endif /* _TRACE_NET_H */
>   
>   /* This part must be outside protection */
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 848aab3693bd..cce1e9ee85af 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -32,6 +32,7 @@
>   #include <net/pkt_sched.h>
>   #include <net/dst.h>
>   #include <trace/events/qdisc.h>
> +#include <trace/events/net.h>
>   #include <net/xfrm.h>
>   
>   /* Qdisc to use by default */
> @@ -441,6 +442,7 @@ static void dev_watchdog(struct timer_list *t)
>   			}
>   
>   			if (some_queue_timedout) {
> +				trace_net_dev_xmit_timeout(dev, i);
>   				WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
>   				       dev->name, netdev_drivername(dev), i);
>   				dev->netdev_ops->ndo_tx_timeout(dev);
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch net-next] net: add a generic tracepoint for TX queue timeout
  2019-05-01 13:11 ` Eran Ben Elisha
@ 2019-05-01 20:55   ` Saeed Mahameed
  2019-05-02  0:50   ` Cong Wang
  1 sibling, 0 replies; 4+ messages in thread
From: Saeed Mahameed @ 2019-05-01 20:55 UTC (permalink / raw)
  To: Eran Ben Elisha, netdev, xiyou.wangcong; +Cc: Jiri Pirko

On Wed, 2019-05-01 at 13:11 +0000, Eran Ben Elisha wrote:
> 
> On 4/30/2019 9:50 PM, Cong Wang wrote:
> > Although devlink health report does a nice job on reporting TX
> > timeout and other NIC errors, unfortunately it requires drivers
> > to support it but currently only mlx5 has implemented it.
> 
> The devlink health was never intended to be the generic mechanism
> for 
> monitoring all driver's TX timeouts notifications. mlx5e driver chose
> to 
> handle TX timeout notification by reporting it to the newly devlink 
> health mechanism.
> 
> > Before other drivers could catch up, it is useful to have a
> > generic tracepoint to monitor this kind of TX timeout. We have
> > been suffering TX timeout with different drivers, we plan to
> > start to monitor it with rasdaemon which just needs a new
> > tracepoint.
> 
> Great idea to suggest a generic trace message that can be monitored
> over 
> all drivers.
> 

+1

I would also add skb->xmit_more indication to net_dev_start_xmit
tracepoint (in a separate patch), many tx timeout debugs revealed a
buggy usage of skb->xmit_more, where xmit_more was set but no further
SKBs arrived.

> > Sample output:
> > 
> >    ksoftirqd/1-16    [001] ..s2   144.043173: net_dev_xmit_timeout:
> > dev=ens3 driver=e1000 queue=0
> > 
> > Cc: Eran Ben Elisha <eranbe@mellanox.com>
> > Cc: Jiri Pirko <jiri@mellanox.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >   include/trace/events/net.h | 23 +++++++++++++++++++++++
> >   net/sched/sch_generic.c    |  2 ++
> >   2 files changed, 25 insertions(+)
> > 
> > diff --git a/include/trace/events/net.h
> > b/include/trace/events/net.h
> > index 1efd7d9b25fe..002d6f04b9e5 100644
> > --- a/include/trace/events/net.h
> > +++ b/include/trace/events/net.h
> > @@ -303,6 +303,29 @@ DEFINE_EVENT(net_dev_rx_exit_template,
> > netif_receive_skb_list_exit,
> >   	TP_ARGS(ret)
> >   );
> >   
> 
> I would have put this next to net_dev_xmit trace event declaration.
> 
> > +TRACE_EVENT(net_dev_xmit_timeout,
> > +
> > +	TP_PROTO(struct net_device *dev,
> > +		 int queue_index),
> > +
> > +	TP_ARGS(dev, queue_index),
> > +
> > +	TP_STRUCT__entry(
> > +		__string(	name,		dev->name	)
> > +		__string(	driver,		netdev_drivername(dev))
> > +		__field(	int,		queue_index	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__assign_str(name, dev->name);
> > +		__assign_str(driver, netdev_drivername(dev));
> > +		__entry->queue_index = queue_index;
> > +	),
> > +
> > +	TP_printk("dev=%s driver=%s queue=%d",
> > +		__get_str(name), __get_str(driver), __entry-
> > >queue_index)
> > +);
> > +
> >   #endif /* _TRACE_NET_H */
> >   
> >   /* This part must be outside protection */
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 848aab3693bd..cce1e9ee85af 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -32,6 +32,7 @@
> >   #include <net/pkt_sched.h>
> >   #include <net/dst.h>
> >   #include <trace/events/qdisc.h>
> > +#include <trace/events/net.h>
> >   #include <net/xfrm.h>
> >   
> >   /* Qdisc to use by default */
> > @@ -441,6 +442,7 @@ static void dev_watchdog(struct timer_list *t)
> >   			}
> >   
> >   			if (some_queue_timedout) {
> > +				trace_net_dev_xmit_timeout(dev, i);
> >   				WARN_ONCE(1, KERN_INFO "NETDEV
> > WATCHDOG: %s (%s): transmit queue %u timed out\n",
> >   				       dev->name,
> > netdev_drivername(dev), i);
> >   				dev->netdev_ops->ndo_tx_timeout(dev);
> > 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch net-next] net: add a generic tracepoint for TX queue timeout
  2019-05-01 13:11 ` Eran Ben Elisha
  2019-05-01 20:55   ` Saeed Mahameed
@ 2019-05-02  0:50   ` Cong Wang
  1 sibling, 0 replies; 4+ messages in thread
From: Cong Wang @ 2019-05-02  0:50 UTC (permalink / raw)
  To: Eran Ben Elisha; +Cc: netdev, Jiri Pirko

On Wed, May 1, 2019 at 6:11 AM Eran Ben Elisha <eranbe@mellanox.com> wrote:
>
>
>
> On 4/30/2019 9:50 PM, Cong Wang wrote:
> > Although devlink health report does a nice job on reporting TX
> > timeout and other NIC errors, unfortunately it requires drivers
> > to support it but currently only mlx5 has implemented it.
>
> The devlink health was never intended to be the generic mechanism for
> monitoring all driver's TX timeouts notifications. mlx5e driver chose to
> handle TX timeout notification by reporting it to the newly devlink
> health mechanism.

Understood.

>
> > Before other drivers could catch up, it is useful to have a
> > generic tracepoint to monitor this kind of TX timeout. We have
> > been suffering TX timeout with different drivers, we plan to
> > start to monitor it with rasdaemon which just needs a new tracepoint.
>
> Great idea to suggest a generic trace message that can be monitored over
> all drivers.
>
> >
> > Sample output:
> >
> >    ksoftirqd/1-16    [001] ..s2   144.043173: net_dev_xmit_timeout: dev=ens3 driver=e1000 queue=0
> >
> > Cc: Eran Ben Elisha <eranbe@mellanox.com>
> > Cc: Jiri Pirko <jiri@mellanox.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >   include/trace/events/net.h | 23 +++++++++++++++++++++++
> >   net/sched/sch_generic.c    |  2 ++
> >   2 files changed, 25 insertions(+)
> >
> > diff --git a/include/trace/events/net.h b/include/trace/events/net.h
> > index 1efd7d9b25fe..002d6f04b9e5 100644
> > --- a/include/trace/events/net.h
> > +++ b/include/trace/events/net.h
> > @@ -303,6 +303,29 @@ DEFINE_EVENT(net_dev_rx_exit_template, netif_receive_skb_list_exit,
> >       TP_ARGS(ret)
> >   );
> >
>
> I would have put this next to net_dev_xmit trace event declaration.
>

Sounds reasonable, it would be slightly easier to find it.
I will send v2.

Thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-05-02  0:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-30 18:50 [Patch net-next] net: add a generic tracepoint for TX queue timeout Cong Wang
2019-05-01 13:11 ` Eran Ben Elisha
2019-05-01 20:55   ` Saeed Mahameed
2019-05-02  0:50   ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.