All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] ipoib: Add ratelimited version of ipoib_warn
@ 2016-07-26 14:38 Nikolay Borisov
       [not found] ` <1469543929-17659-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2016-07-26 14:38 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, guysh-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	Nikolay Borisov

In certain cases it's possible to be flooded by warning messages. To
cope with such situation add ratelimited version of ipoib_warn -
ipoib_warn_rl. Current implementation is done in such a way so that
various messages do not skew the flood detection of one another e.g.
each is being governed by a separate ratelimit structure.

Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 3ede10309754..dc1bcb32b7a1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -763,8 +763,19 @@ static inline void ipoib_unregister_debugfs(void) { }
 
 #define ipoib_printk(level, priv, format, arg...)	\
 	printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg)
+#define ipoib_printk_warn_ratelimit(priv, fmt, args...)		\
+do {								\
+	static DEFINE_RATELIMIT_STATE(_rs,			\
+		DEFAULT_RATELIMIT_INTERVAL,			\
+		DEFAULT_RATELIMIT_BURST);			\
+	if (__ratelimit(&_rs))					\
+		ipoib_printk(KERN_WARNING, priv, fmt, ##args);	\
+} while (0)
+
 #define ipoib_warn(priv, format, arg...)		\
 	ipoib_printk(KERN_WARNING, priv, format , ## arg)
+#define ipoib_warn_rl(priv, fmt, args...)		\
+	ipoib_printk_warn_ratelimit(priv, fmt , args...)
 
 extern int ipoib_sendq_size;
 extern int ipoib_recvq_size;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] ipoib: Ratelimit messages which can flood a host.
       [not found] ` <1469543929-17659-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
@ 2016-07-26 14:38   ` Nikolay Borisov
       [not found]     ` <1469543929-17659-2-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2016-07-26 14:38 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, guysh-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	Nikolay Borisov

In case an infiniband outtage occurs the messages modified
in this patchset can flood the host with a rate of 1 message per
ms which is a lot. Using the ratelimited version of ipoib_warn
fixes the issue by limiting at most 10 messages in 5 seconds.

Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 7d3281866ffc..dd5b4afbc00b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1034,9 +1034,9 @@ static void ipoib_timeout(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
+	ipoib_warn_rl(priv, "transmit timeout: latency %d msecs\n",
 		   jiffies_to_msecs(jiffies - dev->trans_start));
-	ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
+	ipoib_warn_rl(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
 		   netif_queue_stopped(dev),
 		   priv->tx_head, priv->tx_tail);
 	/* XXX reset QP, etc. */
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ipoib: Ratelimit messages which can flood a host.
       [not found]     ` <1469543929-17659-2-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
@ 2016-07-26 18:00       ` Yuval Shaia
       [not found]         ` <20160726180021.GA6144-Hxa29pjIrETlQW142y8m19+IiqhCXseY@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Yuval Shaia @ 2016-07-26 18:00 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, guysh-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

On Tue, Jul 26, 2016 at 05:38:49PM +0300, Nikolay Borisov wrote:
> In case an infiniband outtage occurs the messages modified
> in this patchset can flood the host with a rate of 1 message per

s/patchset/patch

> ms which is a lot. Using the ratelimited version of ipoib_warn
> fixes the issue by limiting at most 10 messages in 5 seconds.
> 
> Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 7d3281866ffc..dd5b4afbc00b 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1034,9 +1034,9 @@ static void ipoib_timeout(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  
> -	ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
> +	ipoib_warn_rl(priv, "transmit timeout: latency %d msecs\n",
>  		   jiffies_to_msecs(jiffies - dev->trans_start));
> -	ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
> +	ipoib_warn_rl(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",

So this is just to hide the hide the problem, the real question is what is
causing this to happen a lot.

>  		   netif_queue_stopped(dev),
>  		   priv->tx_head, priv->tx_tail);
>  	/* XXX reset QP, etc. */
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ipoib: Ratelimit messages which can flood a host.
       [not found]         ` <20160726180021.GA6144-Hxa29pjIrETlQW142y8m19+IiqhCXseY@public.gmane.org>
@ 2016-07-26 22:02           ` Nikolay Borisov
       [not found]             ` <CAJFSNy5RA_PVxz0oPdKamF7Kc+1LFDZ9P0LbQH6EHuvqJO18xQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-08-03 14:26           ` Doug Ledford
  1 sibling, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2016-07-26 22:02 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: Nikolay Borisov, Doug Ledford, Sean Hefty,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, guysh-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

On Tue, Jul 26, 2016 at 9:00 PM, Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Jul 26, 2016 at 05:38:49PM +0300, Nikolay Borisov wrote:
>> In case an infiniband outtage occurs the messages modified
>> in this patchset can flood the host with a rate of 1 message per
>
> s/patchset/patch
>
>> ms which is a lot. Using the ratelimited version of ipoib_warn
>> fixes the issue by limiting at most 10 messages in 5 seconds.
>>
>> Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
>> ---
>>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> index 7d3281866ffc..dd5b4afbc00b 100644
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> @@ -1034,9 +1034,9 @@ static void ipoib_timeout(struct net_device *dev)
>>  {
>>       struct ipoib_dev_priv *priv = netdev_priv(dev);
>>
>> -     ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
>> +     ipoib_warn_rl(priv, "transmit timeout: latency %d msecs\n",
>>                  jiffies_to_msecs(jiffies - dev->trans_start));
>> -     ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
>> +     ipoib_warn_rl(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
>
> So this is just to hide the hide the problem, the real question is what is
> causing this to happen a lot.

So the answer to that should be "yes and no". So it's true that the
underlying bug that's causing those symptoms is not fixed, but it
seems no one has an idea how to debug it based on past reports I have
made:

http://www.spinics.net/lists/linux-rdma/msg34577.html
http://www.spinics.net/lists/linux-rdma/msg37011.html (Looking back it
seems you have also replied to one of the threads:
http://www.spinics.net/lists/linux-rdma/msg37022.html)
http://thread.gmane.org/gmane.linux.drivers.rdma/38899

I'm currently have 10s of servers experiencing this and I have no idea
what's causing, the excessive print is just one of the symptoms. While
I will be totally glad to find the root cause and troubleshoot it so
far it seems to elude me. In any case I believe those patches on their
own do make sense.

>
>>                  netif_queue_stopped(dev),
>>                  priv->tx_head, priv->tx_tail);
>>       /* XXX reset QP, etc. */
>> --
>> 2.5.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ipoib: Ratelimit messages which can flood a host.
       [not found]             ` <CAJFSNy5RA_PVxz0oPdKamF7Kc+1LFDZ9P0LbQH6EHuvqJO18xQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-08-01  8:11               ` Erez Shitrit
  0 siblings, 0 replies; 10+ messages in thread
From: Erez Shitrit @ 2016-08-01  8:11 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Yuval Shaia, Nikolay Borisov, Doug Ledford, Sean Hefty,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Hal Rosenstock,
	matanb-VPRAkNaXOzVWk0Htik3J/w, guysh-VPRAkNaXOzVWk0Htik3J/w,
	Jason Gunthorpe

On Wed, Jul 27, 2016 at 1:02 AM, Nikolay Borisov
<n.borisov-/eCPMmvKun9pLGFMi4vTTA@public.gmane.org> wrote:
> On Tue, Jul 26, 2016 at 9:00 PM, Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Jul 26, 2016 at 05:38:49PM +0300, Nikolay Borisov wrote:
>>> In case an infiniband outtage occurs the messages modified
>>> in this patchset can flood the host with a rate of 1 message per
>>
>> s/patchset/patch
>>
>>> ms which is a lot. Using the ratelimited version of ipoib_warn
>>> fixes the issue by limiting at most 10 messages in 5 seconds.
>>>
>>> Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
>>> ---
>>>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>>> index 7d3281866ffc..dd5b4afbc00b 100644
>>> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
>>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>>> @@ -1034,9 +1034,9 @@ static void ipoib_timeout(struct net_device *dev)
>>>  {
>>>       struct ipoib_dev_priv *priv = netdev_priv(dev);
>>>
>>> -     ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
>>> +     ipoib_warn_rl(priv, "transmit timeout: latency %d msecs\n",
>>>                  jiffies_to_msecs(jiffies - dev->trans_start));
>>> -     ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
>>> +     ipoib_warn_rl(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
>>
>> So this is just to hide the hide the problem, the real question is what is
>> causing this to happen a lot.
>
> So the answer to that should be "yes and no". So it's true that the
> underlying bug that's causing those symptoms is not fixed, but it
> seems no one has an idea how to debug it based on past reports I have
> made:
>
> http://www.spinics.net/lists/linux-rdma/msg34577.html
> http://www.spinics.net/lists/linux-rdma/msg37011.html (Looking back it
> seems you have also replied to one of the threads:
> http://www.spinics.net/lists/linux-rdma/msg37022.html)
> http://thread.gmane.org/gmane.linux.drivers.rdma/38899
>
> I'm currently have 10s of servers experiencing this and I have no idea
> what's causing, the excessive print is just one of the symptoms. While
> I will be totally glad to find the root cause and troubleshoot it so
> far it seems to elude me. In any case I believe those patches on their
> own do make sense.

As i wrote you in previous mail, there is more to debug here, and not
only the SW.

>
>>
>>>                  netif_queue_stopped(dev),
>>>                  priv->tx_head, priv->tx_tail);
>>>       /* XXX reset QP, etc. */
>>> --
>>> 2.5.0
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ipoib: Ratelimit messages which can flood a host.
       [not found]         ` <20160726180021.GA6144-Hxa29pjIrETlQW142y8m19+IiqhCXseY@public.gmane.org>
  2016-07-26 22:02           ` Nikolay Borisov
@ 2016-08-03 14:26           ` Doug Ledford
       [not found]             ` <1470234370.18081.63.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Doug Ledford @ 2016-08-03 14:26 UTC (permalink / raw)
  To: Yuval Shaia, Nikolay Borisov
  Cc: sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, guysh-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

[-- Attachment #1: Type: text/plain, Size: 3658 bytes --]

On Tue, 2016-07-26 at 21:00 +0300, Yuval Shaia wrote:
> On Tue, Jul 26, 2016 at 05:38:49PM +0300, Nikolay Borisov wrote:
> > 
> > In case an infiniband outtage occurs the messages modified
> > in this patchset can flood the host with a rate of 1 message per
> 
> s/patchset/patch
> 
> > 
> > ms which is a lot. Using the ratelimited version of ipoib_warn
> > fixes the issue by limiting at most 10 messages in 5 seconds.
> > 
> > Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
> > ---
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > index 7d3281866ffc..dd5b4afbc00b 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > @@ -1034,9 +1034,9 @@ static void ipoib_timeout(struct net_device
> > *dev)
> >  {
> >  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> >  
> > -	ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
> > +	ipoib_warn_rl(priv, "transmit timeout: latency %d
> > msecs\n",
> >  		   jiffies_to_msecs(jiffies - dev->trans_start));
> > -	ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail
> > %u\n",
> > +	ipoib_warn_rl(priv, "queue stopped %d, tx_head %u, tx_tail
> > %u\n",
> 
> So this is just to hide the hide the problem,

No, that's not true.  To hide the problem, he would have to remove the
printk entirely.  Linus absolutely despises when programmers use
BUG_ON() for situations where it is possible for the system to
continue, albeit with reduced capability or what not.  Similarly, it is
entirely unacceptable that a simple issue, such as a hardware queue
stoppage on a network device, can render a system mostly or entirely or
even just somewhat unusable due to a flood of kernel printk messages.

Just so we're clear, I absolutely despise printk floods.  There is
never a valid reason for them.  If someone in the kernel community
suggested that we make the default printk implementation be flood
proof, I would support it.  In the meantime, I'm all for these patches.
 And if there are other places in the ipoib code that can cause floods
too, I would like to see them switched over to use the rate limited
version too.

Now that I think about it, I'm just as likely to be receptive to a
patch that makes the default ipoib_warn() always be rate limited, and
if there are exceptions in the ipoib stack where we truly want every
message, then change just those places to a different
ipoib_warn_no_rate_limit() implementation.  I would probably modify the
default rate though...10 in 5 seconds is probably too low for some of
the messages, as we can legitimately get a burst of 50 in less than a
second when doing things like joining lots of multicast groups and
there is an issue, or something like that.  Maybe 100 in 60 to allow
for bursts but to still prevent runaway printks like you are having?

>  the real question is what is
> causing this to happen a lot.
> 
> > 
> >  		   netif_queue_stopped(dev),
> >  		   priv->tx_head, priv->tx_tail);
> >  	/* XXX reset QP, etc. */
> > -- 
> > 2.5.0
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ipoib: Ratelimit messages which can flood a host.
       [not found]             ` <1470234370.18081.63.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-08-03 14:32               ` Nikolay Borisov
       [not found]                 ` <57A2006B.5070304-6AxghH7DbtA@public.gmane.org>
  2016-08-08 14:14               ` [PATCH] ipoib: Make ipoib_warn ratelimited Nikolay Borisov
  1 sibling, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2016-08-03 14:32 UTC (permalink / raw)
  To: Doug Ledford, Yuval Shaia
  Cc: sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, guysh-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/



On 08/03/2016 05:26 PM, Doug Ledford wrote:
> On Tue, 2016-07-26 at 21:00 +0300, Yuval Shaia wrote:
>> On Tue, Jul 26, 2016 at 05:38:49PM +0300, Nikolay Borisov wrote:
>>>
>>> In case an infiniband outtage occurs the messages modified
>>> in this patchset can flood the host with a rate of 1 message per
>>
>> s/patchset/patch
>>
>>>
>>> ms which is a lot. Using the ratelimited version of ipoib_warn
>>> fixes the issue by limiting at most 10 messages in 5 seconds.
>>>
>>> Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
>>> ---
>>>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
>>> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>>> index 7d3281866ffc..dd5b4afbc00b 100644
>>> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
>>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>>> @@ -1034,9 +1034,9 @@ static void ipoib_timeout(struct net_device
>>> *dev)
>>>  {
>>>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>>>  
>>> -	ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
>>> +	ipoib_warn_rl(priv, "transmit timeout: latency %d
>>> msecs\n",
>>>  		   jiffies_to_msecs(jiffies - dev->trans_start));
>>> -	ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail
>>> %u\n",
>>> +	ipoib_warn_rl(priv, "queue stopped %d, tx_head %u, tx_tail
>>> %u\n",
>>
>> So this is just to hide the hide the problem,
> 
> No, that's not true.  To hide the problem, he would have to remove the
> printk entirely.  Linus absolutely despises when programmers use
> BUG_ON() for situations where it is possible for the system to
> continue, albeit with reduced capability or what not.  Similarly, it is
> entirely unacceptable that a simple issue, such as a hardware queue
> stoppage on a network device, can render a system mostly or entirely or
> even just somewhat unusable due to a flood of kernel printk messages.
> 
> Just so we're clear, I absolutely despise printk floods.  There is
> never a valid reason for them.  If someone in the kernel community
> suggested that we make the default printk implementation be flood
> proof, I would support it.  In the meantime, I'm all for these patches.
>  And if there are other places in the ipoib code that can cause floods
> too, I would like to see them switched over to use the rate limited
> version too.
> 
> Now that I think about it, I'm just as likely to be receptive to a
> patch that makes the default ipoib_warn() always be rate limited, and
> if there are exceptions in the ipoib stack where we truly want every

The thing with this approach is that it's entirely possible to miss
messages if there is one particular ratelimit state.

> message, then change just those places to a different
> ipoib_warn_no_rate_limit() implementation.  I would probably modify the
> default rate though...10 in 5 seconds is probably too low for some of
> the messages, as we can legitimately get a burst of 50 in less than a
> second when doing things like joining lots of multicast groups and
> there is an issue, or something like that.  Maybe 100 in 60 to allow
> for bursts but to still prevent runaway printks like you are having?
> 
If you take a closer look at the definition of ipoib_warn_rl you would
see that it's enclosed in it's own {} block, meaning its ratelimit state
is going to apply to this particular place. In essence only the
"transmit timeout: latency msec" message would be rate limited to 10
messages in 5 seconds, which I believe is fine. In a flood situation
you'd get 10 identical messages and then there will be a line saying how
many were suppressed, indicating a flood.


>>  the real question is what is
>> causing this to happen a lot.
>>
>>>
>>>  		   netif_queue_stopped(dev),
>>>  		   priv->tx_head, priv->tx_tail);
>>>  	/* XXX reset QP, etc. */
>>> -- 
>>> 2.5.0
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>> rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] ipoib: Ratelimit messages which can flood a host.
       [not found]                 ` <57A2006B.5070304-6AxghH7DbtA@public.gmane.org>
@ 2016-08-03 16:25                   ` Doug Ledford
  0 siblings, 0 replies; 10+ messages in thread
From: Doug Ledford @ 2016-08-03 16:25 UTC (permalink / raw)
  To: Nikolay Borisov, Yuval Shaia
  Cc: sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, guysh-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/

[-- Attachment #1: Type: text/plain, Size: 5432 bytes --]

On Wed, 2016-08-03 at 17:32 +0300, Nikolay Borisov wrote:
> 
> On 08/03/2016 05:26 PM, Doug Ledford wrote:
> > 
> > On Tue, 2016-07-26 at 21:00 +0300, Yuval Shaia wrote:
> > > 
> > > On Tue, Jul 26, 2016 at 05:38:49PM +0300, Nikolay Borisov wrote:
> > > > 
> > > > 
> > > > In case an infiniband outtage occurs the messages modified
> > > > in this patchset can flood the host with a rate of 1 message
> > > > per
> > > 
> > > s/patchset/patch
> > > 
> > > > 
> > > > 
> > > > ms which is a lot. Using the ratelimited version of ipoib_warn
> > > > fixes the issue by limiting at most 10 messages in 5 seconds.
> > > > 
> > > > Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
> > > > ---
> > > >  drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > > b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > > index 7d3281866ffc..dd5b4afbc00b 100644
> > > > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > > @@ -1034,9 +1034,9 @@ static void ipoib_timeout(struct
> > > > net_device
> > > > *dev)
> > > >  {
> > > >  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > > >  
> > > > -	ipoib_warn(priv, "transmit timeout: latency %d
> > > > msecs\n",
> > > > +	ipoib_warn_rl(priv, "transmit timeout: latency %d
> > > > msecs\n",
> > > >  		   jiffies_to_msecs(jiffies - dev-
> > > > >trans_start));
> > > > -	ipoib_warn(priv, "queue stopped %d, tx_head %u,
> > > > tx_tail
> > > > %u\n",
> > > > +	ipoib_warn_rl(priv, "queue stopped %d, tx_head %u,
> > > > tx_tail
> > > > %u\n",
> > > 
> > > So this is just to hide the hide the problem,
> > 
> > No, that's not true.  To hide the problem, he would have to remove
> > the
> > printk entirely.  Linus absolutely despises when programmers use
> > BUG_ON() for situations where it is possible for the system to
> > continue, albeit with reduced capability or what not.  Similarly,
> > it is
> > entirely unacceptable that a simple issue, such as a hardware queue
> > stoppage on a network device, can render a system mostly or
> > entirely or
> > even just somewhat unusable due to a flood of kernel printk
> > messages.
> > 
> > Just so we're clear, I absolutely despise printk floods.  There is
> > never a valid reason for them.  If someone in the kernel community
> > suggested that we make the default printk implementation be flood
> > proof, I would support it.  In the meantime, I'm all for these
> > patches.
> >  And if there are other places in the ipoib code that can cause
> > floods
> > too, I would like to see them switched over to use the rate limited
> > version too.
> > 
> > Now that I think about it, I'm just as likely to be receptive to a
> > patch that makes the default ipoib_warn() always be rate limited,
> > and
> > if there are exceptions in the ipoib stack where we truly want
> > every
> 
> The thing with this approach is that it's entirely possible to miss
> messages if there is one particular ratelimit state.
> 
> > 
> > message, then change just those places to a different
> > ipoib_warn_no_rate_limit() implementation.  I would probably modify
> > the
> > default rate though...10 in 5 seconds is probably too low for some
> > of
> > the messages, as we can legitimately get a burst of 50 in less than
> > a
> > second when doing things like joining lots of multicast groups and
> > there is an issue, or something like that.  Maybe 100 in 60 to
> > allow
> > for bursts but to still prevent runaway printks like you are
> > having?
> > 
> If you take a closer look at the definition of ipoib_warn_rl you
> would
> see that it's enclosed in it's own {} block, meaning its ratelimit
> state
> is going to apply to this particular place. In essence only the
> "transmit timeout: latency msec" message would be rate limited to 10
> messages in 5 seconds, which I believe is fine. In a flood situation
> you'd get 10 identical messages and then there will be a line saying
> how
> many were suppressed, indicating a flood.

I saw that.  You're missing my point.  I was referring to keeping the
individual {} block in the define of ipoib_warn() instead of creating a
new ipoib_warn_rl() define, but I was commenting that some of the
messages in the ipoib file really need a higher burst, and since even
though each of these messages would have their own rate limit state,
they still all share a common rate limit config, the config would need
to be bumped up for some of the spammier message spots in ipoib.

> 
> > 
> > > 
> > >  the real question is what is
> > > causing this to happen a lot.
> > > 
> > > > 
> > > > 
> > > >  		   netif_queue_stopped(dev),
> > > >  		   priv->tx_head, priv->tx_tail);
> > > >  	/* XXX reset QP, etc. */
> > > > -- 
> > > > 2.5.0
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> > > > linux-
> > > > rdma" in
> > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.h
> > > > tml
> > 

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] ipoib: Make ipoib_warn ratelimited
       [not found]             ` <1470234370.18081.63.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-08-03 14:32               ` Nikolay Borisov
@ 2016-08-08 14:14               ` Nikolay Borisov
       [not found]                 ` <1470665662-24028-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2016-08-08 14:14 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Nikolay Borisov

In certain cases it's possible to be flooded by warning messages. To
cope with such situations make the ipoib_warn macro be ratelimited.
To prevent accidental limiting of legitimate, bursty messages make
the limit fairly liberal by allowing up to 100 messages in 10 seconds.

Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 3ede10309754..1f02ccbdccd0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -764,7 +764,13 @@ static inline void ipoib_unregister_debugfs(void) { }
 #define ipoib_printk(level, priv, format, arg...)	\
 	printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg)
 #define ipoib_warn(priv, format, arg...)		\
-	ipoib_printk(KERN_WARNING, priv, format , ## arg)
+do {							\
+	static DEFINE_RATELIMIT_STATE(_rs,		\
+		10 * HZ /*10 seconds */,		\
+		100);		\
+	if (__ratelimit(&_rs))				\
+		ipoib_printk(KERN_WARNING, priv, format , ## arg);\
+} while (0)
 
 extern int ipoib_sendq_size;
 extern int ipoib_recvq_size;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipoib: Make ipoib_warn ratelimited
       [not found]                 ` <1470665662-24028-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
@ 2016-08-25 14:24                   ` Doug Ledford
  0 siblings, 0 replies; 10+ messages in thread
From: Doug Ledford @ 2016-08-25 14:24 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 551 bytes --]

On 8/8/2016 10:14 AM, Nikolay Borisov wrote:
> In certain cases it's possible to be flooded by warning messages. To
> cope with such situations make the ipoib_warn macro be ratelimited.
> To prevent accidental limiting of legitimate, bursty messages make
> the limit fairly liberal by allowing up to 100 messages in 10 seconds.
> 
> Signed-off-by: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>

Thanks, applied to my for-next area.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-08-25 14:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-26 14:38 [PATCH 1/2] ipoib: Add ratelimited version of ipoib_warn Nikolay Borisov
     [not found] ` <1469543929-17659-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
2016-07-26 14:38   ` [PATCH 2/2] ipoib: Ratelimit messages which can flood a host Nikolay Borisov
     [not found]     ` <1469543929-17659-2-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
2016-07-26 18:00       ` Yuval Shaia
     [not found]         ` <20160726180021.GA6144-Hxa29pjIrETlQW142y8m19+IiqhCXseY@public.gmane.org>
2016-07-26 22:02           ` Nikolay Borisov
     [not found]             ` <CAJFSNy5RA_PVxz0oPdKamF7Kc+1LFDZ9P0LbQH6EHuvqJO18xQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-01  8:11               ` Erez Shitrit
2016-08-03 14:26           ` Doug Ledford
     [not found]             ` <1470234370.18081.63.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-03 14:32               ` Nikolay Borisov
     [not found]                 ` <57A2006B.5070304-6AxghH7DbtA@public.gmane.org>
2016-08-03 16:25                   ` Doug Ledford
2016-08-08 14:14               ` [PATCH] ipoib: Make ipoib_warn ratelimited Nikolay Borisov
     [not found]                 ` <1470665662-24028-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
2016-08-25 14:24                   ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.