netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* UDP receive performance drop since 3.10
@ 2022-05-16 12:58 David Laight
  2022-05-16 14:29 ` Paolo Abeni
  0 siblings, 1 reply; 3+ messages in thread
From: David Laight @ 2022-05-16 12:58 UTC (permalink / raw)
  To: netdev

I've noticed a doubling in the cpu cost of udp processing
between a RHEL 3.10 kernel and a 5.18-rc6 one.

This is (probably) all within ip_rcv().

I'm testing very high rate UDP receive of RTP audio.
(The target is 500000 udp/sec.)
I've enable RPS so that ip_rcv() runs on different multiple
cpus from the ethernet code.
(RSS on the BCM5720 (tg3) doesn't seem to work very well.)

On the 3.10 kernel the 'RPS' cpu show about 5% 'soft int' time.
With 5.10 this has doubled to 10% for much the same test.

The ftrace for a single packet shows a lot of extra code.
With a RHEL 3.10 kernel the trace is quite short:

                 /* netif_receive_skb: dev=em2 skbaddr=ffff99c3ee5ae000 len=200 */
                 ip_rcv() {
                   ip_rcv_finish() {
    0.483 us         udp_v4_early_demux();
                     ip_route_input_noref() {
                       ip_route_input_slow() {
    2.036 us             fib_table_lookup();
                         fib_validate_source() {
                           __fib_validate_source.isra.13() {
    0.646 us                 fib_table_lookup();
    1.589 us               }
    2.610 us             }
    6.820 us           }
    7.755 us         }
                     ip_local_deliver() {
                       ip_local_deliver_finish() {
    0.250 us             raw_local_deliver();
                         udp_rcv() {
                           __udp4_lib_rcv() {
                             __udp4_lib_lookup() {
    0.063 us                   compute_score();
    0.097 us                   compute_score();
    1.496 us                 }
                             udp_queue_rcv_skb() {
                               sk_filter_trim_cap() {
                                 security_sock_rcv_skb() {
    0.066 us                       cap_socket_sock_rcv_skb();
    1.024 us                     }
    1.836 us                   }
    0.093 us                   ipv4_pktinfo_prepare();
                               __udp_enqueue_schedule_skb() {
    0.066 us                     _raw_spin_lock();
                                 sock_def_readable() {
                                   __wake_up_sync_key() {
                                     __wake_up_common_lock() {
    0.194 us                           _raw_spin_lock_irqsave();
                                       __wake_up_common() {
                                         ep_poll_callback() {
    0.184 us                               _raw_spin_lock_irqsave();
    0.084 us                               _raw_spin_unlock_irqrestore();
    2.009 us                             }
    3.264 us                           }
    0.087 us                           _raw_spin_unlock_irqrestore();
    5.579 us                         }
    6.311 us                       }
    7.241 us                     }
    8.833 us                   }
  + 12.948 us                }
  + 16.365 us              }
  + 17.280 us            }
  + 19.900 us          }
  + 20.673 us        }
  + 31.519 us      }
  + 32.534 us    }

Whereas 5.18 has a much longer trace:
                ip_rcv() {
   0.668 us       ip_rcv_core();
                  ip_rcv_finish_core.constprop.0() {
   1.155 us         udp_v4_early_demux();
                    ip_route_input_noref() {
   0.306 us           __rcu_read_lock();
                      ip_route_input_rcu() {
                        ip_route_input_slow() {
                          make_kuid() {
   0.441 us                 map_id_range_down();
   1.231 us               }
   0.307 us               __rcu_read_lock();
                          fib_table_lookup() {
   1.268 us                 fib_lookup_good_nhc();
   2.736 us               }
   0.307 us               __rcu_read_unlock();
                          fib_validate_source() {
                            __fib_validate_source() {
                              make_kuid() {
   0.304 us                     map_id_range_down();
   0.931 us                   }
   0.304 us                   __rcu_read_lock();
                              fib_table_lookup() {
   0.493 us                     fib_lookup_good_nhc();
   1.405 us                   }
   0.393 us                   __rcu_read_unlock();
   0.390 us                   fib_info_nh_uses_dev();
   5.457 us                 }
   6.327 us               }
 + 13.726 us            }
 + 14.727 us          }
   0.407 us           __rcu_read_unlock();
 + 16.673 us        }
 + 19.389 us      }
                  ip_local_deliver() {
                    ip_local_deliver_finish() {
   0.376 us           __rcu_read_lock();
                      ip_protocol_deliver_rcu() {
   0.434 us             raw_local_deliver();
                        udp_rcv() {
                          __udp4_lib_rcv() {
                            __udp4_lib_lookup() {
   0.326 us                   udp4_lib_lookup2.isra.0();
   0.928 us                   udp4_lib_lookup2.isra.0();
   2.413 us                 }
                            udp_unicast_rcv_skb() {
                              udp_queue_rcv_skb() {
                                udp_queue_rcv_one_skb() {
                                  sk_filter_trim_cap() {
   0.440 us                         security_sock_rcv_skb();
                                    sk_filter_trim_cap.part.0() {
   0.297 us                           __rcu_read_lock();
   0.310 us                           __rcu_read_unlock();
   1.531 us                         }
   3.277 us                       }
   0.334 us                       skb_pull_rcsum();
   0.310 us                       ipv4_pktinfo_prepare();
                                  __udp_enqueue_schedule_skb() {
                                    _raw_spin_lock() {
   0.334 us                           preempt_count_add();
   0.938 us                         }
                                    _raw_spin_unlock() {
   0.303 us                           preempt_count_sub();
   0.908 us                         }
                                    sock_def_readable() {
   0.307 us                           __rcu_read_lock();
                                      __wake_up_sync_key() {
                                        __wake_up_common_lock() {
                                          _raw_spin_lock_irqsave() {
   0.326 us                                 preempt_count_add();
   0.951 us                               }
                                          __wake_up_common() {
                                            ep_poll_callback() {
                                              _raw_read_lock_irqsave() {
   0.330 us                                     preempt_count_add();
   1.152 us                                   }
   0.614 us                                   __rcu_read_lock();
   0.323 us                                   __rcu_read_unlock();
                                              _raw_read_unlock_irqrestore() {
   0.380 us                                     preempt_count_sub();
   0.995 us                                   }
   5.410 us                                 }
   6.741 us                               }
                                          _raw_spin_unlock_irqrestore() {
   0.317 us                                 preempt_count_sub();
   1.094 us                               }
   9.994 us                             }
 + 10.806 us                          }
   0.327 us                           __rcu_read_unlock();
 + 12.809 us                        }
 + 16.182 us                      }
 + 22.153 us                    }
 + 22.769 us                  }
 + 23.528 us                }
 + 27.878 us              }
 + 28.646 us            }
 + 30.476 us          }
   0.324 us           __rcu_read_unlock();
 + 32.398 us        }
 + 33.168 us      }
 + 54.976 us    }

Now I know the cost of ftrace is significant (and seems to be
higher in 5.18) but there also seems to be a lot more code.
As well as the extra rcu locks (which are probably mostly ftrace
overhead, a few other things stick out:

1) The sock_net_uid(net, NULL) calls.
   These are make_kuid(net->user_ns, 0) - so pretty much constant.
   They seem to end up in a loop in map_id_range_down_base().
   All looks expensive in the default network namespace where
   0 maps to 0.

2) Extra code in fib_lookup().

3) A lot more locking in ep_poll_callback().

The 5.18 kernel also seems to have CONFIG_DEBUG_PREEMPT set.
I can't find the Kconfig entry for it.
It doesn't exist in the old .config at all.
So I'm not sure why 'make oldconfig' picked it up.

The other possibility is that the extra code is tick_nohz_idle_exit().
The 3.10 trace is from a non-RPS config so I can't compare it.

I'm going to disable CONFIG_DEBUG_PREEMPT to see how much
difference it makes.
Any idea if any other debug options will have got picked up?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: UDP receive performance drop since 3.10
  2022-05-16 12:58 UDP receive performance drop since 3.10 David Laight
@ 2022-05-16 14:29 ` Paolo Abeni
  2022-05-17  9:12   ` David Laight
  0 siblings, 1 reply; 3+ messages in thread
From: Paolo Abeni @ 2022-05-16 14:29 UTC (permalink / raw)
  To: David Laight, netdev

On Mon, 2022-05-16 at 12:58 +0000, David Laight wrote:
> I've noticed a doubling in the cpu cost of udp processing
> between a RHEL 3.10 kernel and a 5.18-rc6 one.
> 
> This is (probably) all within ip_rcv().
> 
> I'm testing very high rate UDP receive of RTP audio.
> (The target is 500000 udp/sec.)
> I've enable RPS so that ip_rcv() runs on different multiple
> cpus from the ethernet code.
> (RSS on the BCM5720 (tg3) doesn't seem to work very well.)
> 
> On the 3.10 kernel the 'RPS' cpu show about 5% 'soft int' time.
> With 5.10 this has doubled to 10% for much the same test.
> 
> The ftrace for a single packet shows a lot of extra code.
> With a RHEL 3.10 kernel the trace is quite short:
> 
>                  /* netif_receive_skb: dev=em2 skbaddr=ffff99c3ee5ae000 len=200 */
>                  ip_rcv() {
>                    ip_rcv_finish() {
>     0.483 us         udp_v4_early_demux();
>                      ip_route_input_noref() {
>                        ip_route_input_slow() {
>     2.036 us             fib_table_lookup();
>                          fib_validate_source() {
>                            __fib_validate_source.isra.13() {
>     0.646 us                 fib_table_lookup();
>     1.589 us               }
>     2.610 us             }
>     6.820 us           }
>     7.755 us         }
>                      ip_local_deliver() {
>                        ip_local_deliver_finish() {
>     0.250 us             raw_local_deliver();
>                          udp_rcv() {
>                            __udp4_lib_rcv() {
>                              __udp4_lib_lookup() {
>     0.063 us                   compute_score();
>     0.097 us                   compute_score();
>     1.496 us                 }
>                              udp_queue_rcv_skb() {
>                                sk_filter_trim_cap() {
>                                  security_sock_rcv_skb() {
>     0.066 us                       cap_socket_sock_rcv_skb();
>     1.024 us                     }
>     1.836 us                   }
>     0.093 us                   ipv4_pktinfo_prepare();
>                                __udp_enqueue_schedule_skb() {
>     0.066 us                     _raw_spin_lock();
>                                  sock_def_readable() {
>                                    __wake_up_sync_key() {
>                                      __wake_up_common_lock() {
>     0.194 us                           _raw_spin_lock_irqsave();
>                                        __wake_up_common() {
>                                          ep_poll_callback() {
>     0.184 us                               _raw_spin_lock_irqsave();
>     0.084 us                               _raw_spin_unlock_irqrestore();
>     2.009 us                             }
>     3.264 us                           }
>     0.087 us                           _raw_spin_unlock_irqrestore();
>     5.579 us                         }
>     6.311 us                       }
>     7.241 us                     }
>     8.833 us                   }
>   + 12.948 us                }
>   + 16.365 us              }
>   + 17.280 us            }
>   + 19.900 us          }
>   + 20.673 us        }
>   + 31.519 us      }
>   + 32.534 us    }
> 
> Whereas 5.18 has a much longer trace:
>                 ip_rcv() {
>    0.668 us       ip_rcv_core();
>                   ip_rcv_finish_core.constprop.0() {
>    1.155 us         udp_v4_early_demux();
>                     ip_route_input_noref() {
>    0.306 us           __rcu_read_lock();
>                       ip_route_input_rcu() {
>                         ip_route_input_slow() {
>                           make_kuid() {
>    0.441 us                 map_id_range_down();
>    1.231 us               }
>    0.307 us               __rcu_read_lock();
>                           fib_table_lookup() {
>    1.268 us                 fib_lookup_good_nhc();
>    2.736 us               }
>    0.307 us               __rcu_read_unlock();
>                           fib_validate_source() {
>                             __fib_validate_source() {
>                               make_kuid() {
>    0.304 us                     map_id_range_down();
>    0.931 us                   }
>    0.304 us                   __rcu_read_lock();
>                               fib_table_lookup() {
>    0.493 us                     fib_lookup_good_nhc();
>    1.405 us                   }
>    0.393 us                   __rcu_read_unlock();
>    0.390 us                   fib_info_nh_uses_dev();
>    5.457 us                 }
>    6.327 us               }
>  + 13.726 us            }
>  + 14.727 us          }
>    0.407 us           __rcu_read_unlock();
>  + 16.673 us        }
>  + 19.389 us      }
>                   ip_local_deliver() {
>                     ip_local_deliver_finish() {
>    0.376 us           __rcu_read_lock();
>                       ip_protocol_deliver_rcu() {
>    0.434 us             raw_local_deliver();
>                         udp_rcv() {
>                           __udp4_lib_rcv() {
>                             __udp4_lib_lookup() {
>    0.326 us                   udp4_lib_lookup2.isra.0();
>    0.928 us                   udp4_lib_lookup2.isra.0();
>    2.413 us                 }
>                             udp_unicast_rcv_skb() {
>                               udp_queue_rcv_skb() {
>                                 udp_queue_rcv_one_skb() {
>                                   sk_filter_trim_cap() {
>    0.440 us                         security_sock_rcv_skb();
>                                     sk_filter_trim_cap.part.0() {
>    0.297 us                           __rcu_read_lock();
>    0.310 us                           __rcu_read_unlock();
>    1.531 us                         }
>    3.277 us                       }
>    0.334 us                       skb_pull_rcsum();
>    0.310 us                       ipv4_pktinfo_prepare();
>                                   __udp_enqueue_schedule_skb() {
>                                     _raw_spin_lock() {
>    0.334 us                           preempt_count_add();
>    0.938 us                         }
>                                     _raw_spin_unlock() {
>    0.303 us                           preempt_count_sub();
>    0.908 us                         }
>                                     sock_def_readable() {
>    0.307 us                           __rcu_read_lock();
>                                       __wake_up_sync_key() {
>                                         __wake_up_common_lock() {
>                                           _raw_spin_lock_irqsave() {
>    0.326 us                                 preempt_count_add();
>    0.951 us                               }
>                                           __wake_up_common() {
>                                             ep_poll_callback() {
>                                               _raw_read_lock_irqsave() {
>    0.330 us                                     preempt_count_add();
>    1.152 us                                   }
>    0.614 us                                   __rcu_read_lock();
>    0.323 us                                   __rcu_read_unlock();
>                                               _raw_read_unlock_irqrestore() {
>    0.380 us                                     preempt_count_sub();
>    0.995 us                                   }
>    5.410 us                                 }
>    6.741 us                               }
>                                           _raw_spin_unlock_irqrestore() {
>    0.317 us                                 preempt_count_sub();
>    1.094 us                               }
>    9.994 us                             }
>  + 10.806 us                          }
>    0.327 us                           __rcu_read_unlock();
>  + 12.809 us                        }
>  + 16.182 us                      }
>  + 22.153 us                    }
>  + 22.769 us                  }
>  + 23.528 us                }
>  + 27.878 us              }
>  + 28.646 us            }
>  + 30.476 us          }
>    0.324 us           __rcu_read_unlock();
>  + 32.398 us        }
>  + 33.168 us      }
>  + 54.976 us    }
> 
> Now I know the cost of ftrace is significant (and seems to be
> higher in 5.18) but there also seems to be a lot more code.
> As well as the extra rcu locks (which are probably mostly ftrace
> overhead, a few other things stick out:
> 
> 1) The sock_net_uid(net, NULL) calls.
>    These are make_kuid(net->user_ns, 0) - so pretty much constant.
>    They seem to end up in a loop in map_id_range_down_base().
>    All looks expensive in the default network namespace where
>    0 maps to 0.
> 
> 2) Extra code in fib_lookup().
> 
> 3) A lot more locking in ep_poll_callback().
> 
> The 5.18 kernel also seems to have CONFIG_DEBUG_PREEMPT set.
> I can't find the Kconfig entry for it.
> It doesn't exist in the old .config at all.
> So I'm not sure why 'make oldconfig' picked it up.
> 
> The other possibility is that the extra code is tick_nohz_idle_exit().
> The 3.10 trace is from a non-RPS config so I can't compare it.
> 
> I'm going to disable CONFIG_DEBUG_PREEMPT to see how much
> difference it makes.
> Any idea if any other debug options will have got picked up?

Do you have CONFIG_PREEMPT_DYNAMIC in your config? That was not
available in 3.10 and apparently it pulls quite a bit of stuff, which
in the end should be quite measurable. The preempt count alone adds
~7us to the above sample.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: UDP receive performance drop since 3.10
  2022-05-16 14:29 ` Paolo Abeni
@ 2022-05-17  9:12   ` David Laight
  0 siblings, 0 replies; 3+ messages in thread
From: David Laight @ 2022-05-17  9:12 UTC (permalink / raw)
  To: 'Paolo Abeni', netdev

From: Paolo Abeni
> Sent: 16 May 2022 15:29
> 
> On Mon, 2022-05-16 at 12:58 +0000, David Laight wrote:
> > I've noticed a doubling in the cpu cost of udp processing
> > between a RHEL 3.10 kernel and a 5.18-rc6 one.
> >
> > This is (probably) all within ip_rcv().
> >
> > I'm testing very high rate UDP receive of RTP audio.
> > (The target is 500000 udp/sec.)
> > I've enable RPS so that ip_rcv() runs on different multiple
> > cpus from the ethernet code.
> > (RSS on the BCM5720 (tg3) doesn't seem to work very well.)
> >
> > On the 3.10 kernel the 'RPS' cpu show about 5% 'soft int' time.
> > With 5.10 this has doubled to 10% for much the same test.
> >
...
> >
> > Now I know the cost of ftrace is significant (and seems to be
> > higher in 5.18) but there also seems to be a lot more code.
> > As well as the extra rcu locks (which are probably mostly ftrace
> > overhead, a few other things stick out:
> >
> > 1) The sock_net_uid(net, NULL) calls.
> >    These are make_kuid(net->user_ns, 0) - so pretty much constant.
> >    They seem to end up in a loop in map_id_range_down_base().
> >    All looks expensive in the default network namespace where
> >    0 maps to 0.
> >
> > 2) Extra code in fib_lookup().
> >
> > 3) A lot more locking in ep_poll_callback().
> >
> > The 5.18 kernel also seems to have CONFIG_DEBUG_PREEMPT set.
> > I can't find the Kconfig entry for it.
> > It doesn't exist in the old .config at all.
> > So I'm not sure why 'make oldconfig' picked it up.
> >
> > The other possibility is that the extra code is tick_nohz_idle_exit().
> > The 3.10 trace is from a non-RPS config so I can't compare it.
> >
> > I'm going to disable CONFIG_DEBUG_PREEMPT to see how much
> > difference it makes.
> > Any idea if any other debug options will have got picked up?
> 
> Do you have CONFIG_PREEMPT_DYNAMIC in your config? That was not
> available in 3.10 and apparently it pulls quite a bit of stuff, which
> in the end should be quite measurable. The preempt count alone adds
> ~7us to the above sample.

That was enabled, I've now turned it off.
But the timings from a full ftrace are dominated by ftrace itself.
So the ~7us is a massive overstatement.

I will do some timings for just ip_rcv() and probably ep_poll_callback()
to see if I can isolate the increase.
I'll need to go into the office to boot the old kernel.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-05-17  9:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-16 12:58 UDP receive performance drop since 3.10 David Laight
2022-05-16 14:29 ` Paolo Abeni
2022-05-17  9:12   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).