All of lore.kernel.org
 help / color / mirror / Atom feed
* 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed
@ 2024-04-17 17:37 Tom, Deepak Abraham
  2024-04-17 22:33 ` Stephen Hemminger
  0 siblings, 1 reply; 4+ messages in thread
From: Tom, Deepak Abraham @ 2024-04-17 17:37 UTC (permalink / raw)
  To: netdev

Hi!

I have a system configured with 2 physical eth interfaces connected to a switch.
When I reboot the switch, I see that the userspace RTM_NEWLINK notifications for the interfaces are always 1 second apart although both links actually go down almost simultaneously!
The subsequent RTM_NEWLINK notifications when the switch comes back up are however only delayed by a few microseconds between each other, which is as expected.

Turns out this delay is intentionally introudced by the linux kernel networking code in net/core/link_watch.c, last modified 17 years ago in commit 294cc44:
         /*
          * Limit the number of linkwatch events to one
          * per second so that a runaway driver does not
          * cause a storm of messages on the netlink
          * socket.  This limit does not apply to up events
          * while the device qdisc is down.
          */


On modern high performance systems, limiting the number of down events to just one per second have far reaching consequences.
I was wondering if it would be advisable to reduce this delay to something smaller, say 5ms (so 5ms+scheduling delay practically):
--- a/net/core/link_watch.c
+++ b/net/core/link_watch.c
@@ -130,8 +130,8 @@ static void linkwatch_schedule_work(int urgent)
                delay = 0;
        }

-       /* If we wrap around we'll delay it by at most HZ. */
-       if (delay > HZ)
+       /* If we wrap around we'll delay it by at most HZ/200. */
+       if (delay > (HZ/200))
                delay = 0;

        /*
@@ -187,15 +187,15 @@ static void __linkwatch_run_queue(int urgent_only)

        /*
         * Limit the number of linkwatch events to one
-        * per second so that a runaway driver does not
+        * per 5 millisecond so that a runaway driver does not
         * cause a storm of messages on the netlink
         * socket.  This limit does not apply to up events
         * while the device qdisc is down.
         */
        if (!urgent_only)
-               linkwatch_nextevent = jiffies + HZ;
+               linkwatch_nextevent = jiffies + (HZ/200);
        /* Limit wrap-around effect on delay. */
-       else if (time_after(linkwatch_nextevent, jiffies + HZ))
+       else if (time_after(linkwatch_nextevent, jiffies + (HZ/200)))
                linkwatch_nextevent = jiffies;

        clear_bit(LW_URGENT, &linkwatch_flags);


I have tested this change in my environment, and it works as expected. I don't see any new issues popping up because of this.

Are there any concerns with making this change today? Hoping to get some feedback.


Thank You,
Deepak Abraham Tom

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed
  2024-04-17 17:37 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed Tom, Deepak Abraham
@ 2024-04-17 22:33 ` Stephen Hemminger
  2024-04-18 19:26   ` Tom, Deepak Abraham
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2024-04-17 22:33 UTC (permalink / raw)
  To: Tom, Deepak Abraham; +Cc: netdev

On Wed, 17 Apr 2024 17:37:40 +0000
"Tom, Deepak Abraham" <deepak-abraham.tom@hpe.com> wrote:

> Hi!
> 
> I have a system configured with 2 physical eth interfaces connected to a switch.
> When I reboot the switch, I see that the userspace RTM_NEWLINK notifications for the interfaces are always 1 second apart although both links actually go down almost simultaneously!
> The subsequent RTM_NEWLINK notifications when the switch comes back up are however only delayed by a few microseconds between each other, which is as expected.
> 
> Turns out this delay is intentionally introudced by the linux kernel networking code in net/core/link_watch.c, last modified 17 years ago in commit 294cc44:
>          /*
>           * Limit the number of linkwatch events to one
>           * per second so that a runaway driver does not
>           * cause a storm of messages on the netlink
>           * socket.  This limit does not apply to up events
>           * while the device qdisc is down.
>           */
> 
> 
> On modern high performance systems, limiting the number of down events to just one per second have far reaching consequences.
> I was wondering if it would be advisable to reduce this delay to something smaller, say 5ms (so 5ms+scheduling delay practically):

The reason is that for systems that are connected to the Internet with routing daemons
the impact of link state change is huge. A single link transistion may keep FRR (nee Quagga)
busy for a several seconds as it linearly evaluates 3 Million route entries. Maybe more recent
versions of FRR got smarter. This is also to avoid routing daemon propagating lots of changes
a.k.a route flap.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed
  2024-04-17 22:33 ` Stephen Hemminger
@ 2024-04-18 19:26   ` Tom, Deepak Abraham
  2024-04-19 16:31     ` Stephen Hemminger
  0 siblings, 1 reply; 4+ messages in thread
From: Tom, Deepak Abraham @ 2024-04-18 19:26 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Maybe I'm missing something, but could you please explain how this really helps to not keep FRR busy?
If I understood this right, the link watch code does not ignore events but merely delays them. So any link transition will be propagated whether its scheduled urgently or not urgently.
So FRR will have to still deal with each transition keeping it busy with or without this change, unless FRR dampens flaps on its own?

Also from a design perspective, would it be better if FRR's issues with route flaps be dealt directly in FRR code itself? That way, in use cases where FRR does not come in to play, such a delay is not causing other consequences? Are there more such situations where such a delay is absolutely required?

Thank You,
Deepak Abraham Tom

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Wednesday, April 17, 2024 4:34 PM
To: Tom, Deepak Abraham <deepak-abraham.tom@hpe.com>
Cc: netdev@vger.kernel.org
Subject: Re: 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed

On Wed, 17 Apr 2024 17:37:40 +0000
"Tom, Deepak Abraham" <deepak-abraham.tom@hpe.com> wrote:

> Hi!
> 
> I have a system configured with 2 physical eth interfaces connected to a switch.
> When I reboot the switch, I see that the userspace RTM_NEWLINK notifications for the interfaces are always 1 second apart although both links actually go down almost simultaneously!
> The subsequent RTM_NEWLINK notifications when the switch comes back up are however only delayed by a few microseconds between each other, which is as expected.
> 
> Turns out this delay is intentionally introudced by the linux kernel networking code in net/core/link_watch.c, last modified 17 years ago in commit 294cc44:
>          /*
>           * Limit the number of linkwatch events to one
>           * per second so that a runaway driver does not
>           * cause a storm of messages on the netlink
>           * socket.  This limit does not apply to up events
>           * while the device qdisc is down.
>           */
> 
> 
> On modern high performance systems, limiting the number of down events to just one per second have far reaching consequences.
> I was wondering if it would be advisable to reduce this delay to something smaller, say 5ms (so 5ms+scheduling delay practically):

The reason is that for systems that are connected to the Internet with routing daemons the impact of link state change is huge. A single link transistion may keep FRR (nee Quagga) busy for a several seconds as it linearly evaluates 3 Million route entries. Maybe more recent versions of FRR got smarter. This is also to avoid routing daemon propagating lots of changes a.k.a route flap.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed
  2024-04-18 19:26   ` Tom, Deepak Abraham
@ 2024-04-19 16:31     ` Stephen Hemminger
  0 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2024-04-19 16:31 UTC (permalink / raw)
  To: Tom, Deepak Abraham; +Cc: netdev

On Thu, 18 Apr 2024 19:26:51 +0000
"Tom, Deepak Abraham" <deepak-abraham.tom@hpe.com> wrote:

> Maybe I'm missing something, but could you please explain how this really helps to not keep FRR busy?
> If I understood this right, the link watch code does not ignore events but merely delays them. So any link transition will be propagated whether its scheduled urgently or not urgently.
> So FRR will have to still deal with each transition keeping it busy with or without this change, unless FRR dampens flaps on its own?
>

A poor connection to a switch can cause repeated link down/up. I haven't seen it in person,
but have had to deal with user reports of poor router connections.

> Also from a design perspective, would it be better if FRR's issues with route flaps be dealt directly in FRR code itself? That way, in use cases where FRR does not come in to play, such a delay is not causing other consequences? Are there more such situations where such a delay is absolutely required?

Too late, now. Can't change Linux semantics without breaking many things. And it impacts not just FRR.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-19 16:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-17 17:37 2nd RTM_NEWLINK notification with operstate down is always 1 second delayed Tom, Deepak Abraham
2024-04-17 22:33 ` Stephen Hemminger
2024-04-18 19:26   ` Tom, Deepak Abraham
2024-04-19 16:31     ` Stephen Hemminger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.