"Forwarding" from TC classifier

* "Forwarding" from TC classifier
@ 2020-05-13 16:40 Lorenz Bauer
  2020-05-13 17:48 ` David Ahern
  2020-05-13 21:23 ` David Ahern
  0 siblings, 2 replies; 10+ messages in thread
From: Lorenz Bauer @ 2020-05-13 16:40 UTC (permalink / raw)
  To: bpf, Networking, David Ahern, Martynas Pumputis, kernel-team

We've recently open sourced a key component of our L4 load balancer:
cls_redirect [1].
In the commit description, I call out the following caveat:

    cls_redirect relies on receiving encapsulated packets directly
from a router. This is
    because we don't have access to the neighbour tables from BPF, yet.

The code in question lives in forward_to_next_hop() [2], and does the following:
1. Swap source and destination MAC of the packet
2. Update source and destination IP address
3. Transmit the packet via bpf_redirect(skb->ifindex, 0)

Really, I'd like to get rid of step 1, and instead rely on the network
stack to switch or route
the packet for me. The bpf_fib_lookup helper is very close to what I need. I've
hacked around a bit, and come up with the following replacement for step 1:

    switch (bpf_fib_lookup(skb, &fib, sizeof(fib), 0)) {
    case BPF_FIB_LKUP_RET_SUCCESS:
        /* There is a cached neighbour, bpf_redirect without going
through the stack. */
        return bpf_redirect(...);

    case BPF_FIB_LKUP_RET_NO_NEIGH:
        /* We have no information about this target. Let the stack handle it. */
        return TC_ACT_OK;

    case BPF_FIB_LKUP_RET_FWD_DISABLED:
        return TC_ACT_SHOT;

    default:
        return TC_ACT_SHOT;
    }

I have a couple of questions:

First, I think I can get BPF_FIB_LKUP_RET_NO_NEIGH if the packet needs
to be routed,
but there is no neighbour entry for the default gateway. Is that correct?

Second, is it possible to originate the packet from the local machine,
instead of keeping
the original source address when passing the packet to the stack on NO_NEIGH?
This is what I get with my current approach:

  IP (tos 0x0, ttl 64, id 25769, offset 0, flags [DF], proto UDP (17),
length 124)
      10.42.0.2.37074 > 10.42.0.4.2483: [bad udp cksum 0x14d3 ->
0x3c0d!] UDP, length 96
  IP (tos 0x0, ttl 63, id 25769, offset 0, flags [DF], proto UDP (17),
length 124)
      10.42.0.2.37074 > 10.42.0.3.2483: [no cksum] UDP, length 96
  IP (tos 0x0, ttl 64, id 51342, offset 0, flags [none], proto ICMP
(1), length 84)
      10.42.0.3 > 10.42.0.2: ICMP echo reply, id 33779, seq 0, length 64

The first and second packet are using our custom GUE header, they
contain an ICMP echo request. Packet three contains the answer to the
request. As you can see, the second packet keeps the 10.42.0.2 source
address instead of using 10.42.0.4.

Third, what effect does BPF_FIB_LOOKUP_OUTPUT have? Seems like I should set it,
but I get somewhat sensible results without it as well. Same for LOOKUP_DIRECT.

1: https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/
2: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/test_cls_redirect.c#n509

--
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 10+ messages in thread