[PATCH RFC net-next 00/18] net: Improve route scalability via support for nexthop objects

* [PATCH RFC net-next 00/18] net: Improve route scalability via support for nexthop objects
@ 2018-09-01  0:49 dsahern
  2018-09-01  0:49 ` [PATCH RFC net-next 01/18] net: Rename net/nexthop.h net/rtnh.h dsahern
                   ` (19 more replies)
  0 siblings, 20 replies; 28+ messages in thread
From: dsahern @ 2018-09-01  0:49 UTC (permalink / raw)
  To: netdev; +Cc: roopa, sharpd, idosch, davem, David Ahern

From: David Ahern <dsahern@gmail.com>

As mentioned at netconf in Seoul, we would like to introduce nexthops as
independent objects from the routes to better align with both routing
daemons and hardware and to improve route insertion times into the kernel.

This series adds nexthop objects with their own lifecycle. The model
retains a lot of the established semantics from routes and re-uses some
of the data structures like fib_nh and fib6_nh to more easily align with
the existing code. One difference with nexthop objects is the behavior
better aligns with the target user - routing daemons and switch ASICs.
Specifically, with the exception of the blackhole nexthop, all nexthops
must reference a netdevice (or have a gateway that resolves to a device)
and the device must be admin up with carrier.

Prefixes are then installed pointing to the nexthop by id:
  { prefix } --> { nexthop }  --> { gateway, device }

The nexthop object contains the gateway and device reference.

Benchmarks
The following data shows the route insert time for 720,022 routes (a full
IPv4 internet feed from August 28th). "current" means the current code
where a route insert specifies the device and gateway inline with the
prefix; the "nexthop" columns mean use of the nexthop objects.

         1-hop          1-hop     |    2-hops       2-hops
        current        nexthop    |   current      nexthop
        --------------------------|-------------------------
real    0m21.872s      0m12.982s  |   0m28.723s    0m12.406s
user    0m2.929s       0m1.816s   |   0m3.966s     0m1.935s
sys     0m13.469s      0m6.010s   |   0m18.992s    0m5.913s

With nexthop objects the time to insert the routes is reduced by more
than 30% with the kernel time cut in half. The current model has a route
insertion rate of about 32,000 prefixes / second and with nexthop objects
that increases to a little over 55,000 prefixes/second.

For routes with multiple nexthops the install time is cut by more than
half with system time reduce by a factor of 3. Further, with nexthop
objects insert times for multipath routes drops down to the same as
single path routes since the multipath spec is given once (ie., with the
current model, the time to insert routes increases with the number of
paths in the route compared to nexthop objects where the number of paths
is handled once and the prefixes referencing it are installed in constant
time.

The difference between real and system times shows there is room for
improvement with the trie implementation. As an example, increasing the
sync_pages from 128 to 1024 delays the call to synchronize_rcu increasing
the insert rate to more than 78,000 prefixes/sec!

Some key features:
1. Allows atomic replace of any nexthop object - a nexthop or a group.
   This allows existing route entries to have their nexthop updated
   without the overhead of removing and re-inserting (or replacing)
   them. Instead, one update of the nexthop object implicitly updates
   all routes referencing it.

   One limitation with the atomic replace is that a nexthop group can
   only be replaced with a new group spec and similarly a nexthop can
   only be replaced by a nexthop spec. Specifically, a nexthop id can
   not move between a single nexthop and a group nexthop.

2. Blackhole nexthop: a nexthop object can be designated a blackhole
   which means any lookups that resolve to it, packets are dropped as
   if the lookup failed with the result RTN_BLACKHOLE. Blackhole nexthops
   can not be used with nexthop groups. Combined with atomic replace
   this allows routes to be installed pointing to a blackhole nexthop
   and then switched to an actual gateway with a single nexthop replace
   command (or vice versa, a gateway nexthop is flipped to a blackhole).

3. Nexthop groups for multipath routes. A nexthop group is a nexthop
   that references other nexthops. A multipath group can not be used
   as a nexthop in another nexthop group (ie., groups can not be nested).

4. Multipath routes for IPv6 with device only nexthops. There is a
   demonstrated need for this feature and the existing route semantics
   do not allow it. This series provides a means for that end - create a
   nexthop that has a device only specification.

5. Admin and carrier up are required. If the device goes down (admin or
   carrier) the nexthop is removed in which case routes referencing the
   nexthop are evicted and any nexthop groups referencing it are adjusted.

6. Follow on patches will allow IPv6 nexthops with IPv4 routes for users
   wanting support of RFC 5549.

7. Future extensions: active / backup nexthop. The nexthop groups are
   structured to allow a new group type to be added. One example is a
   group where a nexthop has a preferred device and gateway, but should
   the device go down or the gateway not resolve, the backup nexthop is
   used.

Additional Benefits
- smaller route notifications - messages contain a single nexthop id versus
  the detailed nexthop specification. This is especially noticeable as the
  number of paths increases. Smaller messages have a reduced load on
  userspace as well.

- smaller memory footprint for IPv6 routes.

Examples
1. Single path
    $ ip nexthop add id 1 via 10.99.1.2 dev veth1
    $ ip route add 10.1.1.0/24 nhid 1

    $ ip next ls
    id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link

    $ ip ro ls
    10.1.1.0/24 nhid 1 scope link
    ...

2. ECMP
    $ ip nexthop add id 2 via 10.99.3.2 dev veth3
    $ ip nexthop add id 1001 group 1/2
      --> creates a nexthop group with 2 component nexthops:
          id 1 and id 2 both the same weight

    $ ip route add 10.1.2.0/24 nhid 1001

    $ ip next ls
    id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
    id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link
    id 1001 group 1/2

    $ ip ro ls
    10.1.1.0/24 nhid 1 scope link
    10.1.2.0/24 nhid 1001 scope link
    ...

3. Weighted multipath
    $ ip nexthop add id 1002 group 1,10/2,20
      --> creates a nexthop group with 2 component nexthops:
          id 1 with a weight of 10 and id 2 with a weight of 20

    $ ip route add 10.1.3.0/24 nhid 1002

    $  ip next ls
    id 1 via 10.99.1.2 src 10.99.1.1 dev veth1 scope link
    id 2 via 10.99.3.2 src 10.99.3.1 dev veth3 scope link
    id 1001 group 1/2
    id 1002 group 1,10/2,20

    $ ip ro ls
    10.1.1.0/24 nhid 1 scope link
    10.1.2.0/24 nhid 1001 scope link
    10.1.3.0/24 nhid 1002 scope link
    ...

Open Items
There is long to-do list before this is ready (e.g., IPv6 multipath, lwt
encap, and updating mlxsw). The point of this RFC is to get comments on
the API and overall idea. Specifically, any interested parties should
think about the API, the objects, the workflow, how it fits and
possibility for future extensions.

David Ahern (18):
  net: Rename net/nexthop.h net/rtnh.h
  net: ipv4: export fib_good_nh and fib_flush
  net/ipv4: export fib_info_update_nh_saddr
  net/ipv4: export fib_check_nh
  net/ipv4: Define fib_get_nhs when CONFIG_IP_ROUTE_MULTIPATH is
    disabled
  net/ipv4: Create init and release helpers for fib_nh
  net: ipv4: Add fib_nh to fib_result
  net/ipv4: Move device validation to helper
  net/ipv6: Create init and release helpers for fib6_nh
  net/ipv6: Make fib6_nh optional at the end of fib6_info
  net: Initial nexthop code
  net/ipv4: Add nexthop helpers for ipv4 integration
  net/ipv4: Convert existing use of fib_info to new helpers
  net/ipv4: Allow routes to use nexthop objects
  net/ipv6: Use helpers to access fib6_nh data
  net/ipv6: Allow routes to use nexthop objects
  net: Add support for nexthop groups
  net/ipv4: Optimization for fib_info lookup

 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |    4 +-
 drivers/net/ethernet/rocker/rocker_ofdpa.c         |   20 +-
 include/net/addrconf.h                             |    5 +
 include/net/ip6_fib.h                              |   22 +-
 include/net/ip6_route.h                            |   12 +-
 include/net/ip_fib.h                               |   39 +-
 include/net/net_namespace.h                        |    2 +
 include/net/netns/nexthop.h                        |   18 +
 include/net/nexthop.h                              |  253 +++-
 include/net/rtnh.h                                 |   34 +
 include/trace/events/fib6.h                        |   15 +-
 include/uapi/linux/nexthop.h                       |   56 +
 include/uapi/linux/rtnetlink.h                     |    8 +
 net/core/filter.c                                  |   13 +-
 net/core/lwtunnel.c                                |    2 +-
 net/decnet/dn_fib.c                                |    2 +-
 net/ipv4/Makefile                                  |    2 +-
 net/ipv4/fib_frontend.c                            |   60 +-
 net/ipv4/fib_rules.c                               |    3 +-
 net/ipv4/fib_semantics.c                           |  433 ++++--
 net/ipv4/fib_trie.c                                |   54 +-
 net/ipv4/ipmr.c                                    |    2 +-
 net/ipv4/nexthop.c                                 | 1541 ++++++++++++++++++++
 net/ipv4/route.c                                   |   34 +-
 net/ipv6/addrconf.c                                |    5 +-
 net/ipv6/addrconf_core.c                           |    9 +
 net/ipv6/af_inet6.c                                |    1 +
 net/ipv6/ip6_fib.c                                 |   27 +-
 net/ipv6/ndisc.c                                   |   15 +-
 net/ipv6/route.c                                   |  474 +++---
 net/mpls/af_mpls.c                                 |    2 +-
 security/selinux/nlmsgtab.c                        |    5 +-
 32 files changed, 2690 insertions(+), 482 deletions(-)
 create mode 100644 include/net/netns/nexthop.h
 create mode 100644 include/net/rtnh.h
 create mode 100644 include/uapi/linux/nexthop.h
 create mode 100644 net/ipv4/nexthop.c

-- 
2.11.0

^ permalink raw reply	[flat|nested] 28+ messages in thread