All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
@ 2020-05-11 18:52 ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Overview
========

This series proposes a new BPF program type named BPF_PROG_TYPE_SK_LOOKUP,
or BPF sk_lookup for short.

BPF sk_lookup program runs when transport layer is looking up a socket for
a received packet. When called, sk_lookup program can select a socket that
will receive the packet.

This serves as a mechanism to overcome the limits of what bind() API allows
to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, fixed port to a single socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, any port to a single socket

     198.51.100.1, any port -> L7 proxy socket

In its context, program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection, and returns BPF_REDIRECT code. Transport
layer then uses the selected socket as a result of socket lookup.

Alternatively, program can also fail the lookup (BPF_DROP), or let the
lookup continue as usual (BPF_OK).

This lets the user match packets with listening (TCP) or receiving (UDP)
sockets freely at the last possible point on the receive path, where we
know that packets are destined for local delivery after undergoing
policing, filtering, and routing.

Program is attached to a network namespace, similar to BPF flow_dissector.
We add a new attach type, BPF_SK_LOOKUP, for this.

Patches are organized as so:

 1: prepares ground for attaching/detaching programs to netns
 2: introduces sk_lookup program type
 3-5: hook up the program to run on ipv4/tcp socket lookup
 6-7: hook up the program to run on ipv6/tcp socket lookup
 8-10: hook up the program to run on ipv4/udp socket lookup
 11-12: hook up the program to run on ipv4/udp socket lookup
 13-14: add libbpf support for sk_lookup
 15-17: verifier and selftests for sk_lookup

Performance considerations
==========================

Patch set adds new code on receive hot path. This comes with a cost,
especially in a scenario of a SYN flood or small UDP packet flood.

Measuring the performance penalty turned out to be harder than expected
because socket lookup is fast. For CPUs to spend >= 1% of time in socket
lookup we had to modify our setup by unloading iptables and reducing the
number of routes.

The receiver machine is a Cloudflare Gen 9 server covered in detail at [0].
In short:

 - 24 core Intel custom off-roadmap 1.9Ghz 150W (Skylake) CPU
 - dual-port 25G Mellanox ConnectX-4 NIC
 - 256G DDR4 2666Mhz RAM

Flood traffic pattern:

 - source: 1 IP, 10k ports
 - destination: 1 IP, 1 port
 - TCP - SYN packet
 - UDP - Len=0 packet

Receiver setup:

 - ingress traffic spread over 4 RX queues,
 - RX/TX pause and autoneg disabled,
 - Intel Turbo Boost disabled,
 - TCP SYN cookies always on.

For TCP test there is a receiver process with single listening socket
open. Receiver is not accept()'ing connections.

For UDP the receiver process has a single UDP socket with a filter
installed, dropping the packets.

With such setup in place, we record RX pps and cpu-cycles events under
flood for 60 seconds in 3 configurations:

 1. 5.6.3 kernel w/o this patch series (baseline),
 2. 5.6.3 kernel with patches applied, but no SK_LOOKUP program attached,
 3. 5.6.3 kernel with patches applied, and SK_LOOKUP program attached;
    BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.

RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.

| tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
| no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
| with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |

| tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
| no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
| with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |

| udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
| no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
| with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |

| udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
| no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
| with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |

Also visualized on bpf-sk-lookup-v1-rx-pps.png chart [2].

cpu-cycles measured with `perf record -F 999 --cpu 1-4 -g -- sleep 60`.

|                              |      cpu-cycles events |          |
| tcp4 SYN flood               | __inet_lookup_listener | Δ events |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     |                  1.12% |        - |
| no SK_LOOKUP prog attached   |                  1.31% |    0.19% |
| with SK_LOOKUP prog attached |                  3.05% |    1.93% |

|                              |      cpu-cycles events |          |
| tcp6 SYN flood               |  inet6_lookup_listener | Δ events |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     |                  1.05% |        - |
| no SK_LOOKUP prog attached   |                  1.68% |    0.63% |
| with SK_LOOKUP prog attached |                  3.15% |    2.10% |

|                              |      cpu-cycles events |          |
| udp4 0-len flood             |      __udp4_lib_lookup | Δ events |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     |                  3.81% |        - |
| no SK_LOOKUP prog attached   |                  5.22% |    1.41% |
| with SK_LOOKUP prog attached |                  8.20% |    4.39% |

|                              |      cpu-cycles events |          |
| udp6 0-len flood             |      __udp6_lib_lookup | Δ events |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     |                  5.51% |        - |
| no SK_LOOKUP prog attached   |                  6.51% |    1.00% |
| with SK_LOOKUP prog attached |                 10.14% |    4.63% |

Also visualized on bpf-sk-lookup-v1-cpu-cycles.png chart [3].

Further work
============

- timeout on accept() in tests

  In the end accept_timeout didn't land in network_helpers. I want to
  extract it and adapt existing tests to use it, but in a separate
  series. This one is already uncomfortably long.

- Documentation/bpf/prog_sk_lookup.rst

  In progress. Will contain the same information as the cover letter and
  description for patch 2. Will include in the next iteration or post as
  follow up.

Changelog
=========

v1 -> v2:
- Changes called out in patches 2, 13-15, 17
- Rebase to recent bpf-next (b4563facdcae)

RFCv2 -> v1:

- Switch to fetching a socket from a map and selecting a socket with
  bpf_sk_assign, instead of having a dedicated helper that does both.

- Run reuseport logic on sockets selected by BPF sk_lookup.

- Allow BPF sk_lookup to fail the lookup with no match.

- Go back to having just 2 hash table lookups in UDP.

RFCv1 -> RFCv2:

- Make socket lookup redirection map-based. BPF program now uses a
  dedicated helper and a SOCKARRAY map to select the socket to redirect to.
  A consequence of this change is that bpf_inet_lookup context is now
  read-only.

- Look for connected UDP sockets before allowing redirection from BPF.
  This makes connected UDP socket work as expected in the presence of
  inet_lookup prog.

- Share the code for BPF_PROG_{ATTACH,DETACH,QUERY} with flow_dissector,
  the only other per-netns BPF prog type.

[0] https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/
[1] https://github.com/majek/inet-tool/blob/master/ebpf/inet-kern.c
[2] https://drive.google.com/file/d/1HrrjWhQoVlqiqT73_eLtWMPhuGPKhGFX/
[3] https://drive.google.com/file/d/1cYPPOlGg7M-bkzI4RW1SOm49goI4LYbb/
[RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
[RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/

Jakub Sitnicki (17):
  flow_dissector: Extract attach/detach/query helpers
  bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  inet: Store layer 4 protocol in inet_hashinfo
  inet: Extract helper for selecting socket from reuseport group
  inet: Run SK_LOOKUP BPF program on socket lookup
  inet6: Extract helper for selecting socket from reuseport group
  inet6: Run SK_LOOKUP BPF program on socket lookup
  udp: Store layer 4 protocol in udp_table
  udp: Extract helper for selecting socket from reuseport group
  udp: Run SK_LOOKUP BPF program on socket lookup
  udp6: Extract helper for selecting socket from reuseport group
  udp6: Run SK_LOOKUP BPF program on socket lookup
  bpf: Sync linux/bpf.h to tools/
  libbpf: Add support for SK_LOOKUP program type
  selftests/bpf: Add verifier tests for bpf_sk_lookup context access
  selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c
  selftests/bpf: Tests for BPF_SK_LOOKUP attach point

 include/linux/bpf.h                           |   8 +
 include/linux/bpf_types.h                     |   2 +
 include/linux/filter.h                        |  42 +
 include/net/inet6_hashtables.h                |  20 +
 include/net/inet_hashtables.h                 |  39 +
 include/net/net_namespace.h                   |   1 +
 include/net/udp.h                             |  10 +-
 include/uapi/linux/bpf.h                      |  52 +
 kernel/bpf/syscall.c                          |  14 +
 net/core/filter.c                             | 315 ++++++
 net/core/flow_dissector.c                     |  61 +-
 net/dccp/proto.c                              |   2 +-
 net/ipv4/inet_hashtables.c                    |  44 +-
 net/ipv4/tcp_ipv4.c                           |   2 +-
 net/ipv4/udp.c                                |  85 +-
 net/ipv4/udp_impl.h                           |   2 +-
 net/ipv4/udplite.c                            |   4 +-
 net/ipv6/inet6_hashtables.c                   |  46 +-
 net/ipv6/udp.c                                |  86 +-
 net/ipv6/udp_impl.h                           |   2 +-
 net/ipv6/udplite.c                            |   2 +-
 scripts/bpf_helpers_doc.py                    |   9 +-
 tools/include/uapi/linux/bpf.h                |  52 +
 tools/lib/bpf/libbpf.c                        |   3 +
 tools/lib/bpf/libbpf.h                        |   2 +
 tools/lib/bpf/libbpf.map                      |   2 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 .../bpf/prog_tests/reference_tracking.c       |   2 +-
 .../selftests/bpf/prog_tests/sk_lookup.c      | 999 ++++++++++++++++++
 .../selftests/bpf/progs/test_ref_track_kern.c | 180 ++++
 .../selftests/bpf/progs/test_sk_lookup_kern.c | 258 +++--
 .../selftests/bpf/verifier/ctx_sk_lookup.c    | 694 ++++++++++++
 32 files changed, 2769 insertions(+), 272 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ref_track_kern.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c

-- 
2.25.3


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
@ 2020-05-11 18:52 ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Overview
====

This series proposes a new BPF program type named BPF_PROG_TYPE_SK_LOOKUP,
or BPF sk_lookup for short.

BPF sk_lookup program runs when transport layer is looking up a socket for
a received packet. When called, sk_lookup program can select a socket that
will receive the packet.

This serves as a mechanism to overcome the limits of what bind() API allows
to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, fixed port to a single socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, any port to a single socket

     198.51.100.1, any port -> L7 proxy socket

In its context, program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection, and returns BPF_REDIRECT code. Transport
layer then uses the selected socket as a result of socket lookup.

Alternatively, program can also fail the lookup (BPF_DROP), or let the
lookup continue as usual (BPF_OK).

This lets the user match packets with listening (TCP) or receiving (UDP)
sockets freely at the last possible point on the receive path, where we
know that packets are destined for local delivery after undergoing
policing, filtering, and routing.

Program is attached to a network namespace, similar to BPF flow_dissector.
We add a new attach type, BPF_SK_LOOKUP, for this.

Patches are organized as so:

 1: prepares ground for attaching/detaching programs to netns
 2: introduces sk_lookup program type
 3-5: hook up the program to run on ipv4/tcp socket lookup
 6-7: hook up the program to run on ipv6/tcp socket lookup
 8-10: hook up the program to run on ipv4/udp socket lookup
 11-12: hook up the program to run on ipv4/udp socket lookup
 13-14: add libbpf support for sk_lookup
 15-17: verifier and selftests for sk_lookup

Performance considerations
=============

Patch set adds new code on receive hot path. This comes with a cost,
especially in a scenario of a SYN flood or small UDP packet flood.

Measuring the performance penalty turned out to be harder than expected
because socket lookup is fast. For CPUs to spend >= 1% of time in socket
lookup we had to modify our setup by unloading iptables and reducing the
number of routes.

The receiver machine is a Cloudflare Gen 9 server covered in detail at [0].
In short:

 - 24 core Intel custom off-roadmap 1.9Ghz 150W (Skylake) CPU
 - dual-port 25G Mellanox ConnectX-4 NIC
 - 256G DDR4 2666Mhz RAM

Flood traffic pattern:

 - source: 1 IP, 10k ports
 - destination: 1 IP, 1 port
 - TCP - SYN packet
 - UDP - Len=0 packet

Receiver setup:

 - ingress traffic spread over 4 RX queues,
 - RX/TX pause and autoneg disabled,
 - Intel Turbo Boost disabled,
 - TCP SYN cookies always on.

For TCP test there is a receiver process with single listening socket
open. Receiver is not accept()'ing connections.

For UDP the receiver process has a single UDP socket with a filter
installed, dropping the packets.

With such setup in place, we record RX pps and cpu-cycles events under
flood for 60 seconds in 3 configurations:

 1. 5.6.3 kernel w/o this patch series (baseline),
 2. 5.6.3 kernel with patches applied, but no SK_LOOKUP program attached,
 3. 5.6.3 kernel with patches applied, and SK_LOOKUP program attached;
    BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.

RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.

| tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
| no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
| with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |

| tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
| no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
| with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |

| udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
| no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
| with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |

| udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
| no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
| with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |

Also visualized on bpf-sk-lookup-v1-rx-pps.png chart [2].

cpu-cycles measured with `perf record -F 999 --cpu 1-4 -g -- sleep 60`.

|                              |      cpu-cycles events |          |
| tcp4 SYN flood               | __inet_lookup_listener | Δ events |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     |                  1.12% |        - |
| no SK_LOOKUP prog attached   |                  1.31% |    0.19% |
| with SK_LOOKUP prog attached |                  3.05% |    1.93% |

|                              |      cpu-cycles events |          |
| tcp6 SYN flood               |  inet6_lookup_listener | Δ events |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     |                  1.05% |        - |
| no SK_LOOKUP prog attached   |                  1.68% |    0.63% |
| with SK_LOOKUP prog attached |                  3.15% |    2.10% |

|                              |      cpu-cycles events |          |
| udp4 0-len flood             |      __udp4_lib_lookup | Δ events |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     |                  3.81% |        - |
| no SK_LOOKUP prog attached   |                  5.22% |    1.41% |
| with SK_LOOKUP prog attached |                  8.20% |    4.39% |

|                              |      cpu-cycles events |          |
| udp6 0-len flood             |      __udp6_lib_lookup | Δ events |
|------------------------------+------------------------+----------|
| 5.6.3 vanilla (baseline)     |                  5.51% |        - |
| no SK_LOOKUP prog attached   |                  6.51% |    1.00% |
| with SK_LOOKUP prog attached |                 10.14% |    4.63% |

Also visualized on bpf-sk-lookup-v1-cpu-cycles.png chart [3].

Further work
======

- timeout on accept() in tests

  In the end accept_timeout didn't land in network_helpers. I want to
  extract it and adapt existing tests to use it, but in a separate
  series. This one is already uncomfortably long.

- Documentation/bpf/prog_sk_lookup.rst

  In progress. Will contain the same information as the cover letter and
  description for patch 2. Will include in the next iteration or post as
  follow up.

Changelog
====
v1 -> v2:
- Changes called out in patches 2, 13-15, 17
- Rebase to recent bpf-next (b4563facdcae)

RFCv2 -> v1:

- Switch to fetching a socket from a map and selecting a socket with
  bpf_sk_assign, instead of having a dedicated helper that does both.

- Run reuseport logic on sockets selected by BPF sk_lookup.

- Allow BPF sk_lookup to fail the lookup with no match.

- Go back to having just 2 hash table lookups in UDP.

RFCv1 -> RFCv2:

- Make socket lookup redirection map-based. BPF program now uses a
  dedicated helper and a SOCKARRAY map to select the socket to redirect to.
  A consequence of this change is that bpf_inet_lookup context is now
  read-only.

- Look for connected UDP sockets before allowing redirection from BPF.
  This makes connected UDP socket work as expected in the presence of
  inet_lookup prog.

- Share the code for BPF_PROG_{ATTACH,DETACH,QUERY} with flow_dissector,
  the only other per-netns BPF prog type.

[0] https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/
[1] https://github.com/majek/inet-tool/blob/master/ebpf/inet-kern.c
[2] https://drive.google.com/file/d/1HrrjWhQoVlqiqT73_eLtWMPhuGPKhGFX/
[3] https://drive.google.com/file/d/1cYPPOlGg7M-bkzI4RW1SOm49goI4LYbb/
[RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
[RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/

Jakub Sitnicki (17):
  flow_dissector: Extract attach/detach/query helpers
  bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  inet: Store layer 4 protocol in inet_hashinfo
  inet: Extract helper for selecting socket from reuseport group
  inet: Run SK_LOOKUP BPF program on socket lookup
  inet6: Extract helper for selecting socket from reuseport group
  inet6: Run SK_LOOKUP BPF program on socket lookup
  udp: Store layer 4 protocol in udp_table
  udp: Extract helper for selecting socket from reuseport group
  udp: Run SK_LOOKUP BPF program on socket lookup
  udp6: Extract helper for selecting socket from reuseport group
  udp6: Run SK_LOOKUP BPF program on socket lookup
  bpf: Sync linux/bpf.h to tools/
  libbpf: Add support for SK_LOOKUP program type
  selftests/bpf: Add verifier tests for bpf_sk_lookup context access
  selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c
  selftests/bpf: Tests for BPF_SK_LOOKUP attach point

 include/linux/bpf.h                           |   8 +
 include/linux/bpf_types.h                     |   2 +
 include/linux/filter.h                        |  42 +
 include/net/inet6_hashtables.h                |  20 +
 include/net/inet_hashtables.h                 |  39 +
 include/net/net_namespace.h                   |   1 +
 include/net/udp.h                             |  10 +-
 include/uapi/linux/bpf.h                      |  52 +
 kernel/bpf/syscall.c                          |  14 +
 net/core/filter.c                             | 315 ++++++
 net/core/flow_dissector.c                     |  61 +-
 net/dccp/proto.c                              |   2 +-
 net/ipv4/inet_hashtables.c                    |  44 +-
 net/ipv4/tcp_ipv4.c                           |   2 +-
 net/ipv4/udp.c                                |  85 +-
 net/ipv4/udp_impl.h                           |   2 +-
 net/ipv4/udplite.c                            |   4 +-
 net/ipv6/inet6_hashtables.c                   |  46 +-
 net/ipv6/udp.c                                |  86 +-
 net/ipv6/udp_impl.h                           |   2 +-
 net/ipv6/udplite.c                            |   2 +-
 scripts/bpf_helpers_doc.py                    |   9 +-
 tools/include/uapi/linux/bpf.h                |  52 +
 tools/lib/bpf/libbpf.c                        |   3 +
 tools/lib/bpf/libbpf.h                        |   2 +
 tools/lib/bpf/libbpf.map                      |   2 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 .../bpf/prog_tests/reference_tracking.c       |   2 +-
 .../selftests/bpf/prog_tests/sk_lookup.c      | 999 ++++++++++++++++++
 .../selftests/bpf/progs/test_ref_track_kern.c | 180 ++++
 .../selftests/bpf/progs/test_sk_lookup_kern.c | 258 +++--
 .../selftests/bpf/verifier/ctx_sk_lookup.c    | 694 ++++++++++++
 32 files changed, 2769 insertions(+), 272 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ref_track_kern.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c

-- 
2.25.3

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 01/17] flow_dissector: Extract attach/detach/query helpers
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Lorenz Bauer

Move generic parts of callbacks for querying, attaching, and detaching a
single BPF program for reuse by other BPF program types.

Subsequent patch makes use of the extracted routines.

Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/bpf.h       |  8 +++++
 net/core/filter.c         | 68 +++++++++++++++++++++++++++++++++++++++
 net/core/flow_dissector.c | 61 +++++++----------------------------
 3 files changed, 88 insertions(+), 49 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cf4b6e44f2bc..1cf4fae7987d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -32,6 +32,7 @@ struct btf;
 struct btf_type;
 struct exception_table_entry;
 struct seq_operations;
+struct mutex;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1696,4 +1697,11 @@ enum bpf_text_poke_type {
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *addr1, void *addr2);
 
+int bpf_prog_query_one(struct bpf_prog __rcu **pprog,
+		       const union bpf_attr *attr,
+		       union bpf_attr __user *uattr);
+int bpf_prog_attach_one(struct bpf_prog __rcu **pprog, struct mutex *lock,
+			struct bpf_prog *prog, u32 flags);
+int bpf_prog_detach_one(struct bpf_prog __rcu **pprog, struct mutex *lock);
+
 #endif /* _LINUX_BPF_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index da0634979f53..48ed970f4ae1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8738,6 +8738,74 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
 	return ret;
 }
 
+int bpf_prog_query_one(struct bpf_prog __rcu **pprog,
+		       const union bpf_attr *attr,
+		       union bpf_attr __user *uattr)
+{
+	__u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids);
+	u32 prog_id, prog_cnt = 0, flags = 0;
+	struct bpf_prog *attached;
+
+	if (attr->query.query_flags)
+		return -EINVAL;
+
+	rcu_read_lock();
+	attached = rcu_dereference(*pprog);
+	if (attached) {
+		prog_cnt = 1;
+		prog_id = attached->aux->id;
+	}
+	rcu_read_unlock();
+
+	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
+		return -EFAULT;
+	if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt)))
+		return -EFAULT;
+
+	if (!attr->query.prog_cnt || !prog_ids || !prog_cnt)
+		return 0;
+
+	if (copy_to_user(prog_ids, &prog_id, sizeof(u32)))
+		return -EFAULT;
+
+	return 0;
+}
+
+int bpf_prog_attach_one(struct bpf_prog __rcu **pprog, struct mutex *lock,
+			struct bpf_prog *prog, u32 flags)
+{
+	struct bpf_prog *attached;
+
+	if (flags)
+		return -EINVAL;
+
+	attached = rcu_dereference_protected(*pprog,
+					     lockdep_is_held(lock));
+	if (attached == prog) {
+		/* The same program cannot be attached twice */
+		return -EINVAL;
+	}
+	rcu_assign_pointer(*pprog, prog);
+	if (attached)
+		bpf_prog_put(attached);
+
+	return 0;
+}
+
+int bpf_prog_detach_one(struct bpf_prog __rcu **pprog, struct mutex *lock)
+{
+	struct bpf_prog *attached;
+
+	attached = rcu_dereference_protected(*pprog,
+					     lockdep_is_held(lock));
+	if (!attached)
+		return -ENOENT;
+	RCU_INIT_POINTER(*pprog, NULL);
+	bpf_prog_put(attached);
+
+	return 0;
+}
+
 #ifdef CONFIG_INET
 static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
 				    struct sock_reuseport *reuse,
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 3eff84824c8b..5ff99ed175bd 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -73,46 +73,22 @@ EXPORT_SYMBOL(skb_flow_dissector_init);
 int skb_flow_dissector_prog_query(const union bpf_attr *attr,
 				  union bpf_attr __user *uattr)
 {
-	__u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids);
-	u32 prog_id, prog_cnt = 0, flags = 0;
-	struct bpf_prog *attached;
 	struct net *net;
-
-	if (attr->query.query_flags)
-		return -EINVAL;
+	int ret;
 
 	net = get_net_ns_by_fd(attr->query.target_fd);
 	if (IS_ERR(net))
 		return PTR_ERR(net);
 
-	rcu_read_lock();
-	attached = rcu_dereference(net->flow_dissector_prog);
-	if (attached) {
-		prog_cnt = 1;
-		prog_id = attached->aux->id;
-	}
-	rcu_read_unlock();
+	ret = bpf_prog_query_one(&net->flow_dissector_prog, attr, uattr);
 
 	put_net(net);
-
-	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
-		return -EFAULT;
-	if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt)))
-		return -EFAULT;
-
-	if (!attr->query.prog_cnt || !prog_ids || !prog_cnt)
-		return 0;
-
-	if (copy_to_user(prog_ids, &prog_id, sizeof(u32)))
-		return -EFAULT;
-
-	return 0;
+	return ret;
 }
 
 int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
 				       struct bpf_prog *prog)
 {
-	struct bpf_prog *attached;
 	struct net *net;
 	int ret = 0;
 
@@ -145,16 +121,9 @@ int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
 		}
 	}
 
-	attached = rcu_dereference_protected(net->flow_dissector_prog,
-					     lockdep_is_held(&flow_dissector_mutex));
-	if (attached == prog) {
-		/* The same program cannot be attached twice */
-		ret = -EINVAL;
-		goto out;
-	}
-	rcu_assign_pointer(net->flow_dissector_prog, prog);
-	if (attached)
-		bpf_prog_put(attached);
+	ret = bpf_prog_attach_one(&net->flow_dissector_prog,
+				  &flow_dissector_mutex, prog,
+				  attr->attach_flags);
 out:
 	mutex_unlock(&flow_dissector_mutex);
 	return ret;
@@ -162,21 +131,15 @@ int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
 
 int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
 {
-	struct bpf_prog *attached;
-	struct net *net;
+	struct net *net = current->nsproxy->net_ns;
+	int ret;
 
-	net = current->nsproxy->net_ns;
 	mutex_lock(&flow_dissector_mutex);
-	attached = rcu_dereference_protected(net->flow_dissector_prog,
-					     lockdep_is_held(&flow_dissector_mutex));
-	if (!attached) {
-		mutex_unlock(&flow_dissector_mutex);
-		return -ENOENT;
-	}
-	RCU_INIT_POINTER(net->flow_dissector_prog, NULL);
-	bpf_prog_put(attached);
+	ret =  bpf_prog_detach_one(&net->flow_dissector_prog,
+				   &flow_dissector_mutex);
 	mutex_unlock(&flow_dissector_mutex);
-	return 0;
+
+	return ret;
 }
 
 /**
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 01/17] flow_dissector: Extract attach/detach/query helpers
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Move generic parts of callbacks for querying, attaching, and detaching a
single BPF program for reuse by other BPF program types.

Subsequent patch makes use of the extracted routines.

Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/bpf.h       |  8 +++++
 net/core/filter.c         | 68 +++++++++++++++++++++++++++++++++++++++
 net/core/flow_dissector.c | 61 +++++++----------------------------
 3 files changed, 88 insertions(+), 49 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cf4b6e44f2bc..1cf4fae7987d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -32,6 +32,7 @@ struct btf;
 struct btf_type;
 struct exception_table_entry;
 struct seq_operations;
+struct mutex;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1696,4 +1697,11 @@ enum bpf_text_poke_type {
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *addr1, void *addr2);
 
+int bpf_prog_query_one(struct bpf_prog __rcu **pprog,
+		       const union bpf_attr *attr,
+		       union bpf_attr __user *uattr);
+int bpf_prog_attach_one(struct bpf_prog __rcu **pprog, struct mutex *lock,
+			struct bpf_prog *prog, u32 flags);
+int bpf_prog_detach_one(struct bpf_prog __rcu **pprog, struct mutex *lock);
+
 #endif /* _LINUX_BPF_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index da0634979f53..48ed970f4ae1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8738,6 +8738,74 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
 	return ret;
 }
 
+int bpf_prog_query_one(struct bpf_prog __rcu **pprog,
+		       const union bpf_attr *attr,
+		       union bpf_attr __user *uattr)
+{
+	__u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids);
+	u32 prog_id, prog_cnt = 0, flags = 0;
+	struct bpf_prog *attached;
+
+	if (attr->query.query_flags)
+		return -EINVAL;
+
+	rcu_read_lock();
+	attached = rcu_dereference(*pprog);
+	if (attached) {
+		prog_cnt = 1;
+		prog_id = attached->aux->id;
+	}
+	rcu_read_unlock();
+
+	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
+		return -EFAULT;
+	if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt)))
+		return -EFAULT;
+
+	if (!attr->query.prog_cnt || !prog_ids || !prog_cnt)
+		return 0;
+
+	if (copy_to_user(prog_ids, &prog_id, sizeof(u32)))
+		return -EFAULT;
+
+	return 0;
+}
+
+int bpf_prog_attach_one(struct bpf_prog __rcu **pprog, struct mutex *lock,
+			struct bpf_prog *prog, u32 flags)
+{
+	struct bpf_prog *attached;
+
+	if (flags)
+		return -EINVAL;
+
+	attached = rcu_dereference_protected(*pprog,
+					     lockdep_is_held(lock));
+	if (attached = prog) {
+		/* The same program cannot be attached twice */
+		return -EINVAL;
+	}
+	rcu_assign_pointer(*pprog, prog);
+	if (attached)
+		bpf_prog_put(attached);
+
+	return 0;
+}
+
+int bpf_prog_detach_one(struct bpf_prog __rcu **pprog, struct mutex *lock)
+{
+	struct bpf_prog *attached;
+
+	attached = rcu_dereference_protected(*pprog,
+					     lockdep_is_held(lock));
+	if (!attached)
+		return -ENOENT;
+	RCU_INIT_POINTER(*pprog, NULL);
+	bpf_prog_put(attached);
+
+	return 0;
+}
+
 #ifdef CONFIG_INET
 static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
 				    struct sock_reuseport *reuse,
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 3eff84824c8b..5ff99ed175bd 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -73,46 +73,22 @@ EXPORT_SYMBOL(skb_flow_dissector_init);
 int skb_flow_dissector_prog_query(const union bpf_attr *attr,
 				  union bpf_attr __user *uattr)
 {
-	__u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids);
-	u32 prog_id, prog_cnt = 0, flags = 0;
-	struct bpf_prog *attached;
 	struct net *net;
-
-	if (attr->query.query_flags)
-		return -EINVAL;
+	int ret;
 
 	net = get_net_ns_by_fd(attr->query.target_fd);
 	if (IS_ERR(net))
 		return PTR_ERR(net);
 
-	rcu_read_lock();
-	attached = rcu_dereference(net->flow_dissector_prog);
-	if (attached) {
-		prog_cnt = 1;
-		prog_id = attached->aux->id;
-	}
-	rcu_read_unlock();
+	ret = bpf_prog_query_one(&net->flow_dissector_prog, attr, uattr);
 
 	put_net(net);
-
-	if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
-		return -EFAULT;
-	if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt)))
-		return -EFAULT;
-
-	if (!attr->query.prog_cnt || !prog_ids || !prog_cnt)
-		return 0;
-
-	if (copy_to_user(prog_ids, &prog_id, sizeof(u32)))
-		return -EFAULT;
-
-	return 0;
+	return ret;
 }
 
 int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
 				       struct bpf_prog *prog)
 {
-	struct bpf_prog *attached;
 	struct net *net;
 	int ret = 0;
 
@@ -145,16 +121,9 @@ int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
 		}
 	}
 
-	attached = rcu_dereference_protected(net->flow_dissector_prog,
-					     lockdep_is_held(&flow_dissector_mutex));
-	if (attached = prog) {
-		/* The same program cannot be attached twice */
-		ret = -EINVAL;
-		goto out;
-	}
-	rcu_assign_pointer(net->flow_dissector_prog, prog);
-	if (attached)
-		bpf_prog_put(attached);
+	ret = bpf_prog_attach_one(&net->flow_dissector_prog,
+				  &flow_dissector_mutex, prog,
+				  attr->attach_flags);
 out:
 	mutex_unlock(&flow_dissector_mutex);
 	return ret;
@@ -162,21 +131,15 @@ int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
 
 int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
 {
-	struct bpf_prog *attached;
-	struct net *net;
+	struct net *net = current->nsproxy->net_ns;
+	int ret;
 
-	net = current->nsproxy->net_ns;
 	mutex_lock(&flow_dissector_mutex);
-	attached = rcu_dereference_protected(net->flow_dissector_prog,
-					     lockdep_is_held(&flow_dissector_mutex));
-	if (!attached) {
-		mutex_unlock(&flow_dissector_mutex);
-		return -ENOENT;
-	}
-	RCU_INIT_POINTER(net->flow_dissector_prog, NULL);
-	bpf_prog_put(attached);
+	ret =  bpf_prog_detach_one(&net->flow_dissector_prog,
+				   &flow_dissector_mutex);
 	mutex_unlock(&flow_dissector_mutex);
-	return 0;
+
+	return ret;
 }
 
 /**
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Marek Majkowski, Lorenz Bauer

Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
called BPF_SK_LOOKUP. The new program kind is to be invoked by the
transport layer when looking up a socket for a received packet.

When called, SK_LOOKUP program can select a socket that will receive the
packet. This serves as a mechanism to overcome the limits of what bind()
API allows to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, fixed port to a socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, any port to a socket

     198.51.100.1, any port -> L7 proxy socket

In its run-time context, program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple. Context can be further extended to include ingress
interface identifier.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection. Transport layer then uses the selected
socket as a result of socket lookup.

This patch only enables the user to attach an SK_LOOKUP program to a
network namespace. Subsequent patches hook it up to run on local delivery
path in ipv4 and ipv6 stacks.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
    - Make bpf_sk_assign reject sockets that don't use RCU freeing.
      Update bpf_sk_assign docs accordingly. (Martin)
    - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
    - Fix broken build when CONFIG_INET is not selected. (Martin)
    - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)

 include/linux/bpf_types.h   |   2 +
 include/linux/filter.h      |  42 ++++++
 include/net/net_namespace.h |   1 +
 include/uapi/linux/bpf.h    |  52 ++++++++
 kernel/bpf/syscall.c        |  14 ++
 net/core/filter.c           | 247 ++++++++++++++++++++++++++++++++++++
 scripts/bpf_helpers_doc.py  |   9 +-
 7 files changed, 366 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 29d22752fc87..d238b8393616 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
 	      struct sk_reuseport_md, struct sk_reuseport_kern)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
+	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
 #endif
 #if defined(CONFIG_BPF_JIT)
 BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 73d06a39e2d6..95bcdfd602d3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1278,4 +1278,46 @@ struct bpf_sockopt_kern {
 	s32		retval;
 };
 
+struct bpf_sk_lookup_kern {
+	unsigned short	family;
+	u16		protocol;
+	union {
+		struct {
+			__be32 saddr;
+			__be32 daddr;
+		} v4;
+		struct {
+			struct in6_addr saddr;
+			struct in6_addr daddr;
+		} v6;
+	};
+	__be16		sport;
+	u16		dport;
+	struct sock	*selected_sk;
+};
+
+#ifdef CONFIG_INET
+int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int sk_lookup_prog_detach(const union bpf_attr *attr);
+int sk_lookup_prog_query(const union bpf_attr *attr,
+			 union bpf_attr __user *uattr);
+#else
+static inline int sk_lookup_prog_attach(const union bpf_attr *attr,
+					struct bpf_prog *prog)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int sk_lookup_prog_detach(const union bpf_attr *attr)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int sk_lookup_prog_query(const union bpf_attr *attr,
+				       union bpf_attr __user *uattr)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_INET */
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index ab96fb59131c..70bf4888c94d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -163,6 +163,7 @@ struct net {
 	struct net_generic __rcu	*gen;
 
 	struct bpf_prog __rcu	*flow_dissector_prog;
+	struct bpf_prog __rcu	*sk_lookup_prog;
 
 	/* Note : following structs are cache line aligned */
 #ifdef CONFIG_XFRM
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9d1932e23cec..03edf4ec7b7e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -188,6 +188,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_STRUCT_OPS,
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
+	BPF_PROG_TYPE_SK_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -220,6 +221,7 @@ enum bpf_attach_type {
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
 	BPF_TRACE_ITER,
+	BPF_SK_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3050,6 +3052,10 @@ union bpf_attr {
  *
  * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
+ *		**BPF_PROG_TYPE_SCHED_ACT** programs.
+ *
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
  *		will cause *skb* to be delivered to the specified socket.
@@ -3070,6 +3076,38 @@ union bpf_attr {
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
  *
+ * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
+ *
+ *		Select the *sk* as a result of a socket lookup.
+ *
+ *		For the operation to succeed passed socket must be compatible
+ *		with the packet description provided by the *ctx* object.
+ *
+ *		L4 protocol (*IPPROTO_TCP* or *IPPROTO_UDP*) must be an exact
+ *		match. While IP family (*AF_INET* or *AF_INET6*) must be
+ *		compatible, that is IPv6 sockets that are not v6-only can be
+ *		selected for IPv4 packets.
+ *
+ *		Only TCP listeners and UDP sockets, that is sockets which have
+ *		*SOCK_RCU_FREE* flag set, can be selected.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		**-EAFNOSUPPORT** is socket family (*sk->family*) is not
+ *		compatible with packet family (*ctx->family*).
+ *
+ *		**-EINVAL** if unsupported flags were specified.
+ *
+ *		**-EPROTOTYPE** if socket L4 protocol (*sk->protocol*) doesn't
+ *		match packet protocol (*ctx->protocol*).
+ *
+ *		**-ESOCKTNOSUPPORT** if socket does not use RCU freeing.
+ *
  * u64 bpf_ktime_get_boot_ns(void)
  * 	Description
  * 		Return the time elapsed since system boot, in nanoseconds.
@@ -4058,4 +4096,18 @@ struct bpf_pidns_info {
 	__u32 pid;
 	__u32 tgid;
 };
+
+/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
+struct bpf_sk_lookup {
+	__u32 family;		/* Protocol family (AF_INET, AF_INET6) */
+	__u32 protocol;		/* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
+	/* IP addresses allow 1,2,4-byte read and are in network byte order. */
+	__u32 remote_ip4;
+	__u32 remote_ip6[4];
+	__u32 remote_port;	/* network byte order */
+	__u32 local_ip4;
+	__u32 local_ip6[4];
+	__u32 local_port;	/* host byte order */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index de2a75500233..e2478f4270af 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2000,6 +2000,10 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		if (expected_attach_type == BPF_SK_LOOKUP)
+			return 0;
+		return -EINVAL;
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
 			return -EINVAL;
@@ -2680,6 +2684,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+	case BPF_PROG_TYPE_SK_LOOKUP:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
 		return prog->enforce_expected_attach_type &&
@@ -2731,6 +2736,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
 	case BPF_TRACE_ITER:
 		return BPF_PROG_TYPE_TRACING;
+	case BPF_SK_LOOKUP:
+		return BPF_PROG_TYPE_SK_LOOKUP;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -2780,6 +2787,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		ret = sk_lookup_prog_attach(attr, prog);
+		break;
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
@@ -2820,6 +2830,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return lirc_prog_detach(attr);
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return skb_flow_dissector_bpf_prog_detach(attr);
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		return sk_lookup_prog_detach(attr);
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
@@ -2869,6 +2881,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 		return lirc_prog_query(attr, uattr);
 	case BPF_FLOW_DISSECTOR:
 		return skb_flow_dissector_prog_query(attr, uattr);
+	case BPF_SK_LOOKUP:
+		return sk_lookup_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 48ed970f4ae1..8ea17eda6ff2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9052,6 +9052,253 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
+
+static DEFINE_MUTEX(sk_lookup_prog_mutex);
+
+int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int ret;
+
+	if (unlikely(attr->attach_flags))
+		return -EINVAL;
+
+	mutex_lock(&sk_lookup_prog_mutex);
+	ret = bpf_prog_attach_one(&net->sk_lookup_prog,
+				  &sk_lookup_prog_mutex, prog,
+				  attr->attach_flags);
+	mutex_unlock(&sk_lookup_prog_mutex);
+
+	return ret;
+}
+
+int sk_lookup_prog_detach(const union bpf_attr *attr)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int ret;
+
+	if (unlikely(attr->attach_flags))
+		return -EINVAL;
+
+	mutex_lock(&sk_lookup_prog_mutex);
+	ret = bpf_prog_detach_one(&net->sk_lookup_prog,
+				  &sk_lookup_prog_mutex);
+	mutex_unlock(&sk_lookup_prog_mutex);
+
+	return ret;
+}
+
+int sk_lookup_prog_query(const union bpf_attr *attr,
+			 union bpf_attr __user *uattr)
+{
+	struct net *net;
+	int ret;
+
+	net = get_net_ns_by_fd(attr->query.target_fd);
+	if (IS_ERR(net))
+		return PTR_ERR(net);
+
+	ret = bpf_prog_query_one(&net->sk_lookup_prog, attr, uattr);
+
+	put_net(net);
+	return ret;
+}
+
+BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
+	   struct sock *, sk, u64, flags)
+{
+	if (unlikely(flags != 0))
+		return -EINVAL;
+	if (unlikely(sk_is_refcounted(sk)))
+		return -ESOCKTNOSUPPORT;
+
+	/* Check if socket is suitable for packet L3/L4 protocol */
+	if (sk->sk_protocol != ctx->protocol)
+		return -EPROTOTYPE;
+	if (sk->sk_family != ctx->family &&
+	    (sk->sk_family == AF_INET || ipv6_only_sock(sk)))
+		return -EAFNOSUPPORT;
+
+	/* Select socket as lookup result */
+	ctx->selected_sk = sk;
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sk_lookup_assign_proto = {
+	.func		= bpf_sk_lookup_assign,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_SOCKET,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *
+sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_sk_assign:
+		return &bpf_sk_lookup_assign_proto;
+	case BPF_FUNC_sk_release:
+		return &bpf_sk_release_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static bool sk_lookup_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+
+	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (type != BPF_READ)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
+	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				remote_ip6[0], remote_ip6[3]):
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				local_ip6[0], local_ip6[3]):
+		if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+			return false;
+		bpf_ctx_record_field_size(info, size_default);
+		break;
+
+	case bpf_ctx_range(struct bpf_sk_lookup, family):
+	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
+	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
+	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
+		if (size != size_default)
+			return false;
+		break;
+
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+#define CHECK_FIELD_SIZE(BPF_TYPE, BPF_FIELD, KERN_TYPE, KERN_FIELD)	\
+	BUILD_BUG_ON(sizeof_field(BPF_TYPE, BPF_FIELD) <		\
+		     sizeof_field(KERN_TYPE, KERN_FIELD))
+
+#define LOAD_FIELD_SIZE_OFF(TYPE, FIELD, SIZE, OFF)			\
+	BPF_LDX_MEM(SIZE, si->dst_reg, si->src_reg,			\
+		    bpf_target_off(TYPE, FIELD,				\
+				   sizeof_field(TYPE, FIELD),		\
+				   target_size) + (OFF))
+
+#define LOAD_FIELD_SIZE(TYPE, FIELD, SIZE) \
+	LOAD_FIELD_SIZE_OFF(TYPE, FIELD, SIZE, 0)
+
+#define LOAD_FIELD(TYPE, FIELD) \
+	LOAD_FIELD_SIZE(TYPE, FIELD, BPF_FIELD_SIZEOF(TYPE, FIELD))
+
+static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
+					const struct bpf_insn *si,
+					struct bpf_insn *insn_buf,
+					struct bpf_prog *prog,
+					u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sk_lookup, family):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, family,
+				 struct bpf_sk_lookup_kern, family);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, family);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, protocol):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, protocol,
+				 struct bpf_sk_lookup_kern, protocol);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, protocol);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, remote_ip4):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, remote_ip4,
+				 struct bpf_sk_lookup_kern, v4.saddr);
+		*insn++ = LOAD_FIELD_SIZE(struct bpf_sk_lookup_kern, v4.saddr,
+					  BPF_SIZE(si->code));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, local_ip4):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, local_ip4,
+				 struct bpf_sk_lookup_kern, v4.daddr);
+		*insn++ = LOAD_FIELD_SIZE(struct bpf_sk_lookup_kern, v4.daddr,
+					  BPF_SIZE(si->code));
+
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				remote_ip6[0], remote_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, remote_ip6[0],
+				 struct bpf_sk_lookup_kern,
+				 v6.saddr.s6_addr32[0]);
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
+		*insn++ = LOAD_FIELD_SIZE_OFF(struct bpf_sk_lookup_kern,
+					      v6.saddr.s6_addr32[0],
+					      BPF_SIZE(si->code), off);
+#else
+		(void)off;
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				local_ip6[0], local_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, local_ip6[0],
+				 struct bpf_sk_lookup_kern,
+				 v6.daddr.s6_addr32[0]);
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
+		*insn++ = LOAD_FIELD_SIZE_OFF(struct bpf_sk_lookup_kern,
+					      v6.daddr.s6_addr32[0],
+					      BPF_SIZE(si->code), off);
+#else
+		(void)off;
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sk_lookup, remote_port):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, remote_port,
+				 struct bpf_sk_lookup_kern, sport);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, sport);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, local_port):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, local_port,
+				 struct bpf_sk_lookup_kern, dport);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, dport);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+const struct bpf_prog_ops sk_lookup_prog_ops = {
+};
+
+const struct bpf_verifier_ops sk_lookup_verifier_ops = {
+	.get_func_proto		= sk_lookup_func_proto,
+	.is_valid_access	= sk_lookup_is_valid_access,
+	.convert_ctx_access	= sk_lookup_convert_ctx_access,
+};
+
 #endif /* CONFIG_INET */
 
 DEFINE_BPF_DISPATCHER(xdp)
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index ded304c96a05..4a6653c64210 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -398,6 +398,7 @@ class PrinterHelpers(Printer):
 
     type_fwds = [
             'struct bpf_fib_lookup',
+            'struct bpf_sk_lookup',
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
@@ -438,6 +439,7 @@ class PrinterHelpers(Printer):
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
+            'struct bpf_sk_lookup',
             'struct bpf_sock',
             'struct bpf_sock_addr',
             'struct bpf_sock_ops',
@@ -469,6 +471,11 @@ class PrinterHelpers(Printer):
             'struct sk_msg_buff': 'struct sk_msg_md',
             'struct xdp_buff': 'struct xdp_md',
     }
+    # Helpers overloaded for different context types.
+    overloaded_helpers = [
+        'bpf_get_socket_cookie',
+        'bpf_sk_assign',
+    ]
 
     def print_header(self):
         header = '''\
@@ -525,7 +532,7 @@ class PrinterHelpers(Printer):
         for i, a in enumerate(proto['args']):
             t = a['type']
             n = a['name']
-            if proto['name'] == 'bpf_get_socket_cookie' and i == 0:
+            if proto['name'] in self.overloaded_helpers and i == 0:
                     t = 'void'
                     n = 'ctx'
             one_arg = '{}{}'.format(comma, self.map_type(t))
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
called BPF_SK_LOOKUP. The new program kind is to be invoked by the
transport layer when looking up a socket for a received packet.

When called, SK_LOOKUP program can select a socket that will receive the
packet. This serves as a mechanism to overcome the limits of what bind()
API allows to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, fixed port to a socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, any port to a socket

     198.51.100.1, any port -> L7 proxy socket

In its run-time context, program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple. Context can be further extended to include ingress
interface identifier.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection. Transport layer then uses the selected
socket as a result of socket lookup.

This patch only enables the user to attach an SK_LOOKUP program to a
network namespace. Subsequent patches hook it up to run on local delivery
path in ipv4 and ipv6 stacks.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
    - Make bpf_sk_assign reject sockets that don't use RCU freeing.
      Update bpf_sk_assign docs accordingly. (Martin)
    - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
    - Fix broken build when CONFIG_INET is not selected. (Martin)
    - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)

 include/linux/bpf_types.h   |   2 +
 include/linux/filter.h      |  42 ++++++
 include/net/net_namespace.h |   1 +
 include/uapi/linux/bpf.h    |  52 ++++++++
 kernel/bpf/syscall.c        |  14 ++
 net/core/filter.c           | 247 ++++++++++++++++++++++++++++++++++++
 scripts/bpf_helpers_doc.py  |   9 +-
 7 files changed, 366 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 29d22752fc87..d238b8393616 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
 	      struct sk_reuseport_md, struct sk_reuseport_kern)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
+	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
 #endif
 #if defined(CONFIG_BPF_JIT)
 BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 73d06a39e2d6..95bcdfd602d3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1278,4 +1278,46 @@ struct bpf_sockopt_kern {
 	s32		retval;
 };
 
+struct bpf_sk_lookup_kern {
+	unsigned short	family;
+	u16		protocol;
+	union {
+		struct {
+			__be32 saddr;
+			__be32 daddr;
+		} v4;
+		struct {
+			struct in6_addr saddr;
+			struct in6_addr daddr;
+		} v6;
+	};
+	__be16		sport;
+	u16		dport;
+	struct sock	*selected_sk;
+};
+
+#ifdef CONFIG_INET
+int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int sk_lookup_prog_detach(const union bpf_attr *attr);
+int sk_lookup_prog_query(const union bpf_attr *attr,
+			 union bpf_attr __user *uattr);
+#else
+static inline int sk_lookup_prog_attach(const union bpf_attr *attr,
+					struct bpf_prog *prog)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int sk_lookup_prog_detach(const union bpf_attr *attr)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int sk_lookup_prog_query(const union bpf_attr *attr,
+				       union bpf_attr __user *uattr)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_INET */
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index ab96fb59131c..70bf4888c94d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -163,6 +163,7 @@ struct net {
 	struct net_generic __rcu	*gen;
 
 	struct bpf_prog __rcu	*flow_dissector_prog;
+	struct bpf_prog __rcu	*sk_lookup_prog;
 
 	/* Note : following structs are cache line aligned */
 #ifdef CONFIG_XFRM
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9d1932e23cec..03edf4ec7b7e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -188,6 +188,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_STRUCT_OPS,
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
+	BPF_PROG_TYPE_SK_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -220,6 +221,7 @@ enum bpf_attach_type {
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
 	BPF_TRACE_ITER,
+	BPF_SK_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3050,6 +3052,10 @@ union bpf_attr {
  *
  * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
+ *		**BPF_PROG_TYPE_SCHED_ACT** programs.
+ *
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
  *		will cause *skb* to be delivered to the specified socket.
@@ -3070,6 +3076,38 @@ union bpf_attr {
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
  *
+ * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
+ *
+ *		Select the *sk* as a result of a socket lookup.
+ *
+ *		For the operation to succeed passed socket must be compatible
+ *		with the packet description provided by the *ctx* object.
+ *
+ *		L4 protocol (*IPPROTO_TCP* or *IPPROTO_UDP*) must be an exact
+ *		match. While IP family (*AF_INET* or *AF_INET6*) must be
+ *		compatible, that is IPv6 sockets that are not v6-only can be
+ *		selected for IPv4 packets.
+ *
+ *		Only TCP listeners and UDP sockets, that is sockets which have
+ *		*SOCK_RCU_FREE* flag set, can be selected.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		**-EAFNOSUPPORT** is socket family (*sk->family*) is not
+ *		compatible with packet family (*ctx->family*).
+ *
+ *		**-EINVAL** if unsupported flags were specified.
+ *
+ *		**-EPROTOTYPE** if socket L4 protocol (*sk->protocol*) doesn't
+ *		match packet protocol (*ctx->protocol*).
+ *
+ *		**-ESOCKTNOSUPPORT** if socket does not use RCU freeing.
+ *
  * u64 bpf_ktime_get_boot_ns(void)
  * 	Description
  * 		Return the time elapsed since system boot, in nanoseconds.
@@ -4058,4 +4096,18 @@ struct bpf_pidns_info {
 	__u32 pid;
 	__u32 tgid;
 };
+
+/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
+struct bpf_sk_lookup {
+	__u32 family;		/* Protocol family (AF_INET, AF_INET6) */
+	__u32 protocol;		/* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
+	/* IP addresses allow 1,2,4-byte read and are in network byte order. */
+	__u32 remote_ip4;
+	__u32 remote_ip6[4];
+	__u32 remote_port;	/* network byte order */
+	__u32 local_ip4;
+	__u32 local_ip6[4];
+	__u32 local_port;	/* host byte order */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index de2a75500233..e2478f4270af 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2000,6 +2000,10 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		if (expected_attach_type = BPF_SK_LOOKUP)
+			return 0;
+		return -EINVAL;
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
 			return -EINVAL;
@@ -2680,6 +2684,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+	case BPF_PROG_TYPE_SK_LOOKUP:
 		return attach_type = prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
 		return prog->enforce_expected_attach_type &&
@@ -2731,6 +2736,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
 	case BPF_TRACE_ITER:
 		return BPF_PROG_TYPE_TRACING;
+	case BPF_SK_LOOKUP:
+		return BPF_PROG_TYPE_SK_LOOKUP;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -2780,6 +2787,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		ret = sk_lookup_prog_attach(attr, prog);
+		break;
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
@@ -2820,6 +2830,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return lirc_prog_detach(attr);
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return skb_flow_dissector_bpf_prog_detach(attr);
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		return sk_lookup_prog_detach(attr);
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
@@ -2869,6 +2881,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 		return lirc_prog_query(attr, uattr);
 	case BPF_FLOW_DISSECTOR:
 		return skb_flow_dissector_prog_query(attr, uattr);
+	case BPF_SK_LOOKUP:
+		return sk_lookup_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 48ed970f4ae1..8ea17eda6ff2 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9052,6 +9052,253 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
+
+static DEFINE_MUTEX(sk_lookup_prog_mutex);
+
+int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int ret;
+
+	if (unlikely(attr->attach_flags))
+		return -EINVAL;
+
+	mutex_lock(&sk_lookup_prog_mutex);
+	ret = bpf_prog_attach_one(&net->sk_lookup_prog,
+				  &sk_lookup_prog_mutex, prog,
+				  attr->attach_flags);
+	mutex_unlock(&sk_lookup_prog_mutex);
+
+	return ret;
+}
+
+int sk_lookup_prog_detach(const union bpf_attr *attr)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int ret;
+
+	if (unlikely(attr->attach_flags))
+		return -EINVAL;
+
+	mutex_lock(&sk_lookup_prog_mutex);
+	ret = bpf_prog_detach_one(&net->sk_lookup_prog,
+				  &sk_lookup_prog_mutex);
+	mutex_unlock(&sk_lookup_prog_mutex);
+
+	return ret;
+}
+
+int sk_lookup_prog_query(const union bpf_attr *attr,
+			 union bpf_attr __user *uattr)
+{
+	struct net *net;
+	int ret;
+
+	net = get_net_ns_by_fd(attr->query.target_fd);
+	if (IS_ERR(net))
+		return PTR_ERR(net);
+
+	ret = bpf_prog_query_one(&net->sk_lookup_prog, attr, uattr);
+
+	put_net(net);
+	return ret;
+}
+
+BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
+	   struct sock *, sk, u64, flags)
+{
+	if (unlikely(flags != 0))
+		return -EINVAL;
+	if (unlikely(sk_is_refcounted(sk)))
+		return -ESOCKTNOSUPPORT;
+
+	/* Check if socket is suitable for packet L3/L4 protocol */
+	if (sk->sk_protocol != ctx->protocol)
+		return -EPROTOTYPE;
+	if (sk->sk_family != ctx->family &&
+	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
+		return -EAFNOSUPPORT;
+
+	/* Select socket as lookup result */
+	ctx->selected_sk = sk;
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sk_lookup_assign_proto = {
+	.func		= bpf_sk_lookup_assign,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_SOCKET,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *
+sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_sk_assign:
+		return &bpf_sk_lookup_assign_proto;
+	case BPF_FUNC_sk_release:
+		return &bpf_sk_release_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static bool sk_lookup_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+
+	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (type != BPF_READ)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
+	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				remote_ip6[0], remote_ip6[3]):
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				local_ip6[0], local_ip6[3]):
+		if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+			return false;
+		bpf_ctx_record_field_size(info, size_default);
+		break;
+
+	case bpf_ctx_range(struct bpf_sk_lookup, family):
+	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
+	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
+	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
+		if (size != size_default)
+			return false;
+		break;
+
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+#define CHECK_FIELD_SIZE(BPF_TYPE, BPF_FIELD, KERN_TYPE, KERN_FIELD)	\
+	BUILD_BUG_ON(sizeof_field(BPF_TYPE, BPF_FIELD) <		\
+		     sizeof_field(KERN_TYPE, KERN_FIELD))
+
+#define LOAD_FIELD_SIZE_OFF(TYPE, FIELD, SIZE, OFF)			\
+	BPF_LDX_MEM(SIZE, si->dst_reg, si->src_reg,			\
+		    bpf_target_off(TYPE, FIELD,				\
+				   sizeof_field(TYPE, FIELD),		\
+				   target_size) + (OFF))
+
+#define LOAD_FIELD_SIZE(TYPE, FIELD, SIZE) \
+	LOAD_FIELD_SIZE_OFF(TYPE, FIELD, SIZE, 0)
+
+#define LOAD_FIELD(TYPE, FIELD) \
+	LOAD_FIELD_SIZE(TYPE, FIELD, BPF_FIELD_SIZEOF(TYPE, FIELD))
+
+static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
+					const struct bpf_insn *si,
+					struct bpf_insn *insn_buf,
+					struct bpf_prog *prog,
+					u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sk_lookup, family):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, family,
+				 struct bpf_sk_lookup_kern, family);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, family);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, protocol):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, protocol,
+				 struct bpf_sk_lookup_kern, protocol);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, protocol);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, remote_ip4):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, remote_ip4,
+				 struct bpf_sk_lookup_kern, v4.saddr);
+		*insn++ = LOAD_FIELD_SIZE(struct bpf_sk_lookup_kern, v4.saddr,
+					  BPF_SIZE(si->code));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, local_ip4):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, local_ip4,
+				 struct bpf_sk_lookup_kern, v4.daddr);
+		*insn++ = LOAD_FIELD_SIZE(struct bpf_sk_lookup_kern, v4.daddr,
+					  BPF_SIZE(si->code));
+
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				remote_ip6[0], remote_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, remote_ip6[0],
+				 struct bpf_sk_lookup_kern,
+				 v6.saddr.s6_addr32[0]);
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
+		*insn++ = LOAD_FIELD_SIZE_OFF(struct bpf_sk_lookup_kern,
+					      v6.saddr.s6_addr32[0],
+					      BPF_SIZE(si->code), off);
+#else
+		(void)off;
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				local_ip6[0], local_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, local_ip6[0],
+				 struct bpf_sk_lookup_kern,
+				 v6.daddr.s6_addr32[0]);
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
+		*insn++ = LOAD_FIELD_SIZE_OFF(struct bpf_sk_lookup_kern,
+					      v6.daddr.s6_addr32[0],
+					      BPF_SIZE(si->code), off);
+#else
+		(void)off;
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sk_lookup, remote_port):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, remote_port,
+				 struct bpf_sk_lookup_kern, sport);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, sport);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, local_port):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, local_port,
+				 struct bpf_sk_lookup_kern, dport);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, dport);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+const struct bpf_prog_ops sk_lookup_prog_ops = {
+};
+
+const struct bpf_verifier_ops sk_lookup_verifier_ops = {
+	.get_func_proto		= sk_lookup_func_proto,
+	.is_valid_access	= sk_lookup_is_valid_access,
+	.convert_ctx_access	= sk_lookup_convert_ctx_access,
+};
+
 #endif /* CONFIG_INET */
 
 DEFINE_BPF_DISPATCHER(xdp)
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index ded304c96a05..4a6653c64210 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -398,6 +398,7 @@ class PrinterHelpers(Printer):
 
     type_fwds = [
             'struct bpf_fib_lookup',
+            'struct bpf_sk_lookup',
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
@@ -438,6 +439,7 @@ class PrinterHelpers(Printer):
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
+            'struct bpf_sk_lookup',
             'struct bpf_sock',
             'struct bpf_sock_addr',
             'struct bpf_sock_ops',
@@ -469,6 +471,11 @@ class PrinterHelpers(Printer):
             'struct sk_msg_buff': 'struct sk_msg_md',
             'struct xdp_buff': 'struct xdp_md',
     }
+    # Helpers overloaded for different context types.
+    overloaded_helpers = [
+        'bpf_get_socket_cookie',
+        'bpf_sk_assign',
+    ]
 
     def print_header(self):
         header = '''\
@@ -525,7 +532,7 @@ class PrinterHelpers(Printer):
         for i, a in enumerate(proto['args']):
             t = a['type']
             n = a['name']
-            if proto['name'] = 'bpf_get_socket_cookie' and i = 0:
+            if proto['name'] in self.overloaded_helpers and i = 0:
                     t = 'void'
                     n = 'ctx'
             one_arg = '{}{}'.format(comma, self.map_type(t))
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 03/17] inet: Store layer 4 protocol in inet_hashinfo
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Lorenz Bauer

Make it possible to identify the protocol of sockets stored in hashinfo
without looking up a socket.

Subsequent patches make use the new field at the socket lookup time to
ensure that BPF program selects only sockets with matching protocol.

Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet_hashtables.h | 3 +++
 net/dccp/proto.c              | 2 +-
 net/ipv4/tcp_ipv4.c           | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index ad64ba6a057f..6072dfbd1078 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -144,6 +144,9 @@ struct inet_hashinfo {
 	unsigned int			lhash2_mask;
 	struct inet_listen_hashbucket	*lhash2;
 
+	/* Layer 4 protocol of the stored sockets */
+	int				protocol;
+
 	/* All the above members are written once at bootup and
 	 * never written again _or_ are predominantly read-access.
 	 *
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 4af8a98fe784..c826419e68e6 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(dccp_statistics);
 struct percpu_counter dccp_orphan_count;
 EXPORT_SYMBOL_GPL(dccp_orphan_count);
 
-struct inet_hashinfo dccp_hashinfo;
+struct inet_hashinfo dccp_hashinfo = { .protocol = IPPROTO_DCCP };
 EXPORT_SYMBOL_GPL(dccp_hashinfo);
 
 /* the maximum queue length for tx in packets. 0 is no limit */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6c05f1ceb538..77e4f4e4c73c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -87,7 +87,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
 			       __be32 daddr, __be32 saddr, const struct tcphdr *th);
 #endif
 
-struct inet_hashinfo tcp_hashinfo;
+struct inet_hashinfo tcp_hashinfo = { .protocol = IPPROTO_TCP };
 EXPORT_SYMBOL(tcp_hashinfo);
 
 static u32 tcp_v4_init_seq(const struct sk_buff *skb)
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 03/17] inet: Store layer 4 protocol in inet_hashinfo
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Make it possible to identify the protocol of sockets stored in hashinfo
without looking up a socket.

Subsequent patches make use the new field at the socket lookup time to
ensure that BPF program selects only sockets with matching protocol.

Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet_hashtables.h | 3 +++
 net/dccp/proto.c              | 2 +-
 net/ipv4/tcp_ipv4.c           | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index ad64ba6a057f..6072dfbd1078 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -144,6 +144,9 @@ struct inet_hashinfo {
 	unsigned int			lhash2_mask;
 	struct inet_listen_hashbucket	*lhash2;
 
+	/* Layer 4 protocol of the stored sockets */
+	int				protocol;
+
 	/* All the above members are written once at bootup and
 	 * never written again _or_ are predominantly read-access.
 	 *
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 4af8a98fe784..c826419e68e6 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(dccp_statistics);
 struct percpu_counter dccp_orphan_count;
 EXPORT_SYMBOL_GPL(dccp_orphan_count);
 
-struct inet_hashinfo dccp_hashinfo;
+struct inet_hashinfo dccp_hashinfo = { .protocol = IPPROTO_DCCP };
 EXPORT_SYMBOL_GPL(dccp_hashinfo);
 
 /* the maximum queue length for tx in packets. 0 is no limit */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6c05f1ceb538..77e4f4e4c73c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -87,7 +87,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
 			       __be32 daddr, __be32 saddr, const struct tcphdr *th);
 #endif
 
-struct inet_hashinfo tcp_hashinfo;
+struct inet_hashinfo tcp_hashinfo = { .protocol = IPPROTO_TCP };
 EXPORT_SYMBOL(tcp_hashinfo);
 
 static u32 tcp_v4_init_seq(const struct sk_buff *skb)
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 04/17] inet: Extract helper for selecting socket from reuseport group
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Prepare for calling into reuseport from __inet_lookup_listener as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv4/inet_hashtables.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2bbaaf0c7176..ab64834837c8 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -246,6 +246,21 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb, int doff,
+					    __be32 saddr, __be16 sport,
+					    __be32 daddr, unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 phash;
+
+	if (sk->sk_reuseport) {
+		phash = inet_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
+	}
+	return reuse_sk;
+}
+
 /*
  * Here are some nice properties to exploit here. The BSD API
  * does not allow a listening sock to specify the remote port nor the
@@ -265,21 +280,17 @@ static struct sock *inet_lhash2_lookup(struct net *net,
 	struct inet_connection_sock *icsk;
 	struct sock *sk, *result = NULL;
 	int score, hiscore = 0;
-	u32 phash = 0;
 
 	inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {
 		sk = (struct sock *)icsk;
 		score = compute_score(sk, net, hnum, daddr,
 				      dif, sdif, exact_dif);
 		if (score > hiscore) {
-			if (sk->sk_reuseport) {
-				phash = inet_ehashfn(net, daddr, hnum,
-						     saddr, sport);
-				result = reuseport_select_sock(sk, phash,
-							       skb, doff);
-				if (result)
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb, doff,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			hiscore = score;
 		}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 04/17] inet: Extract helper for selecting socket from reuseport group
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Prepare for calling into reuseport from __inet_lookup_listener as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv4/inet_hashtables.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2bbaaf0c7176..ab64834837c8 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -246,6 +246,21 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb, int doff,
+					    __be32 saddr, __be16 sport,
+					    __be32 daddr, unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 phash;
+
+	if (sk->sk_reuseport) {
+		phash = inet_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
+	}
+	return reuse_sk;
+}
+
 /*
  * Here are some nice properties to exploit here. The BSD API
  * does not allow a listening sock to specify the remote port nor the
@@ -265,21 +280,17 @@ static struct sock *inet_lhash2_lookup(struct net *net,
 	struct inet_connection_sock *icsk;
 	struct sock *sk, *result = NULL;
 	int score, hiscore = 0;
-	u32 phash = 0;
 
 	inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {
 		sk = (struct sock *)icsk;
 		score = compute_score(sk, net, hnum, daddr,
 				      dif, sdif, exact_dif);
 		if (score > hiscore) {
-			if (sk->sk_reuseport) {
-				phash = inet_ehashfn(net, daddr, hnum,
-						     saddr, sport);
-				result = reuseport_select_sock(sk, phash,
-							       skb, doff);
-				if (result)
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb, doff,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			hiscore = score;
 		}
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Marek Majkowski, Lorenz Bauer

Run a BPF program before looking up a listening socket on the receive path.
Program selects a listening socket to yield as result of socket lookup by
calling bpf_sk_assign() helper and returning BPF_REDIRECT code.

Alternatively, program can also fail the lookup by returning with BPF_DROP,
or let the lookup continue as usual with BPF_OK on return.

This lets the user match packets with listening sockets freely at the last
possible point on the receive path, where we know that packets are destined
for local delivery after undergoing policing, filtering, and routing.

With BPF code selecting the socket, directing packets destined to an IP
range or to a port range to a single socket becomes possible.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
 net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 6072dfbd1078..3fcbc8f66f88 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 int inet_hash_connect(struct inet_timewait_death_row *death_row,
 		      struct sock *sk);
+
+static inline struct sock *bpf_sk_lookup_run(struct net *net,
+					     struct bpf_sk_lookup_kern *ctx)
+{
+	struct bpf_prog *prog;
+	int ret = BPF_OK;
+
+	rcu_read_lock();
+	prog = rcu_dereference(net->sk_lookup_prog);
+	if (prog)
+		ret = BPF_PROG_RUN(prog, ctx);
+	rcu_read_unlock();
+
+	if (ret == BPF_DROP)
+		return ERR_PTR(-ECONNREFUSED);
+	if (ret == BPF_REDIRECT)
+		return ctx->selected_sk;
+	return NULL;
+}
+
+static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
+					       __be32 saddr, __be16 sport,
+					       __be32 daddr, u16 dport)
+{
+	struct bpf_sk_lookup_kern ctx = {
+		.family		= AF_INET,
+		.protocol	= protocol,
+		.v4.saddr	= saddr,
+		.v4.daddr	= daddr,
+		.sport		= sport,
+		.dport		= dport,
+	};
+
+	return bpf_sk_lookup_run(net, &ctx);
+}
+
 #endif /* _INET_HASHTABLES_H */
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ab64834837c8..f4d07285591a 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
 				    const int dif, const int sdif)
 {
 	struct inet_listen_hashbucket *ilb2;
-	struct sock *result = NULL;
+	struct sock *result, *reuse_sk;
 	unsigned int hash2;
 
+	/* Lookup redirect from BPF */
+	result = inet_lookup_run_bpf(net, hashinfo->protocol,
+				     saddr, sport, daddr, hnum);
+	if (IS_ERR(result))
+		return NULL;
+	if (result) {
+		reuse_sk = lookup_reuseport(net, result, skb, doff,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			result = reuse_sk;
+		goto done;
+	}
+
 	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Run a BPF program before looking up a listening socket on the receive path.
Program selects a listening socket to yield as result of socket lookup by
calling bpf_sk_assign() helper and returning BPF_REDIRECT code.

Alternatively, program can also fail the lookup by returning with BPF_DROP,
or let the lookup continue as usual with BPF_OK on return.

This lets the user match packets with listening sockets freely at the last
possible point on the receive path, where we know that packets are destined
for local delivery after undergoing policing, filtering, and routing.

With BPF code selecting the socket, directing packets destined to an IP
range or to a port range to a single socket becomes possible.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
 net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 6072dfbd1078..3fcbc8f66f88 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 int inet_hash_connect(struct inet_timewait_death_row *death_row,
 		      struct sock *sk);
+
+static inline struct sock *bpf_sk_lookup_run(struct net *net,
+					     struct bpf_sk_lookup_kern *ctx)
+{
+	struct bpf_prog *prog;
+	int ret = BPF_OK;
+
+	rcu_read_lock();
+	prog = rcu_dereference(net->sk_lookup_prog);
+	if (prog)
+		ret = BPF_PROG_RUN(prog, ctx);
+	rcu_read_unlock();
+
+	if (ret = BPF_DROP)
+		return ERR_PTR(-ECONNREFUSED);
+	if (ret = BPF_REDIRECT)
+		return ctx->selected_sk;
+	return NULL;
+}
+
+static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
+					       __be32 saddr, __be16 sport,
+					       __be32 daddr, u16 dport)
+{
+	struct bpf_sk_lookup_kern ctx = {
+		.family		= AF_INET,
+		.protocol	= protocol,
+		.v4.saddr	= saddr,
+		.v4.daddr	= daddr,
+		.sport		= sport,
+		.dport		= dport,
+	};
+
+	return bpf_sk_lookup_run(net, &ctx);
+}
+
 #endif /* _INET_HASHTABLES_H */
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ab64834837c8..f4d07285591a 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
 				    const int dif, const int sdif)
 {
 	struct inet_listen_hashbucket *ilb2;
-	struct sock *result = NULL;
+	struct sock *result, *reuse_sk;
 	unsigned int hash2;
 
+	/* Lookup redirect from BPF */
+	result = inet_lookup_run_bpf(net, hashinfo->protocol,
+				     saddr, sport, daddr, hnum);
+	if (IS_ERR(result))
+		return NULL;
+	if (result) {
+		reuse_sk = lookup_reuseport(net, result, skb, doff,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			result = reuse_sk;
+		goto done;
+	}
+
 	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 06/17] inet6: Extract helper for selecting socket from reuseport group
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Prepare for calling into reuseport from inet6_lookup_listener as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv6/inet6_hashtables.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index fbe9d4295eac..03942eef8ab6 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -111,6 +111,23 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb, int doff,
+					    const struct in6_addr *saddr,
+					    __be16 sport,
+					    const struct in6_addr *daddr,
+					    unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 phash;
+
+	if (sk->sk_reuseport) {
+		phash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *inet6_lhash2_lookup(struct net *net,
 		struct inet_listen_hashbucket *ilb2,
@@ -123,21 +140,17 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
 	struct inet_connection_sock *icsk;
 	struct sock *sk, *result = NULL;
 	int score, hiscore = 0;
-	u32 phash = 0;
 
 	inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {
 		sk = (struct sock *)icsk;
 		score = compute_score(sk, net, hnum, daddr, dif, sdif,
 				      exact_dif);
 		if (score > hiscore) {
-			if (sk->sk_reuseport) {
-				phash = inet6_ehashfn(net, daddr, hnum,
-						      saddr, sport);
-				result = reuseport_select_sock(sk, phash,
-							       skb, doff);
-				if (result)
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb, doff,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			hiscore = score;
 		}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 06/17] inet6: Extract helper for selecting socket from reuseport group
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Prepare for calling into reuseport from inet6_lookup_listener as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv6/inet6_hashtables.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index fbe9d4295eac..03942eef8ab6 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -111,6 +111,23 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb, int doff,
+					    const struct in6_addr *saddr,
+					    __be16 sport,
+					    const struct in6_addr *daddr,
+					    unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 phash;
+
+	if (sk->sk_reuseport) {
+		phash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *inet6_lhash2_lookup(struct net *net,
 		struct inet_listen_hashbucket *ilb2,
@@ -123,21 +140,17 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
 	struct inet_connection_sock *icsk;
 	struct sock *sk, *result = NULL;
 	int score, hiscore = 0;
-	u32 phash = 0;
 
 	inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {
 		sk = (struct sock *)icsk;
 		score = compute_score(sk, net, hnum, daddr, dif, sdif,
 				      exact_dif);
 		if (score > hiscore) {
-			if (sk->sk_reuseport) {
-				phash = inet6_ehashfn(net, daddr, hnum,
-						      saddr, sport);
-				result = reuseport_select_sock(sk, phash,
-							       skb, doff);
-				if (result)
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb, doff,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			hiscore = score;
 		}
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 07/17] inet6: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Marek Majkowski, Lorenz Bauer

Following ipv4 stack changes, run a BPF program attached to netns before
looking up a listening socket. Program can return a listening socket to use
as result of socket lookup, fail the lookup, or take no action.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet6_hashtables.h | 20 ++++++++++++++++++++
 net/ipv6/inet6_hashtables.c    | 15 ++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 81b965953036..8b8c0cb92ea8 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -21,6 +21,7 @@
 
 #include <net/ipv6.h>
 #include <net/netns/hash.h>
+#include <net/inet_hashtables.h>
 
 struct inet_hashinfo;
 
@@ -103,6 +104,25 @@ struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
 			  const int dif);
 
 int inet6_hash(struct sock *sk);
+
+static inline struct sock *inet6_lookup_run_bpf(struct net *net, u8 protocol,
+						const struct in6_addr *saddr,
+						__be16 sport,
+						const struct in6_addr *daddr,
+						u16 dport)
+{
+	struct bpf_sk_lookup_kern ctx = {
+		.family		= AF_INET6,
+		.protocol	= protocol,
+		.v6.saddr	= *saddr,
+		.v6.daddr	= *daddr,
+		.sport		= sport,
+		.dport		= dport,
+	};
+
+	return bpf_sk_lookup_run(net, &ctx);
+}
+
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
 #define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif, __sdif) \
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 03942eef8ab6..6d91de89fd2b 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -167,9 +167,22 @@ struct sock *inet6_lookup_listener(struct net *net,
 		const unsigned short hnum, const int dif, const int sdif)
 {
 	struct inet_listen_hashbucket *ilb2;
-	struct sock *result = NULL;
+	struct sock *result, *reuse_sk;
 	unsigned int hash2;
 
+	/* Lookup redirect from BPF */
+	result = inet6_lookup_run_bpf(net, hashinfo->protocol,
+				      saddr, sport, daddr, hnum);
+	if (IS_ERR(result))
+		return NULL;
+	if (result) {
+		reuse_sk = lookup_reuseport(net, result, skb, doff,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			result = reuse_sk;
+		goto done;
+	}
+
 	hash2 = ipv6_portaddr_hash(net, daddr, hnum);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 07/17] inet6: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Following ipv4 stack changes, run a BPF program attached to netns before
looking up a listening socket. Program can return a listening socket to use
as result of socket lookup, fail the lookup, or take no action.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet6_hashtables.h | 20 ++++++++++++++++++++
 net/ipv6/inet6_hashtables.c    | 15 ++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 81b965953036..8b8c0cb92ea8 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -21,6 +21,7 @@
 
 #include <net/ipv6.h>
 #include <net/netns/hash.h>
+#include <net/inet_hashtables.h>
 
 struct inet_hashinfo;
 
@@ -103,6 +104,25 @@ struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
 			  const int dif);
 
 int inet6_hash(struct sock *sk);
+
+static inline struct sock *inet6_lookup_run_bpf(struct net *net, u8 protocol,
+						const struct in6_addr *saddr,
+						__be16 sport,
+						const struct in6_addr *daddr,
+						u16 dport)
+{
+	struct bpf_sk_lookup_kern ctx = {
+		.family		= AF_INET6,
+		.protocol	= protocol,
+		.v6.saddr	= *saddr,
+		.v6.daddr	= *daddr,
+		.sport		= sport,
+		.dport		= dport,
+	};
+
+	return bpf_sk_lookup_run(net, &ctx);
+}
+
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
 #define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif, __sdif) \
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 03942eef8ab6..6d91de89fd2b 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -167,9 +167,22 @@ struct sock *inet6_lookup_listener(struct net *net,
 		const unsigned short hnum, const int dif, const int sdif)
 {
 	struct inet_listen_hashbucket *ilb2;
-	struct sock *result = NULL;
+	struct sock *result, *reuse_sk;
 	unsigned int hash2;
 
+	/* Lookup redirect from BPF */
+	result = inet6_lookup_run_bpf(net, hashinfo->protocol,
+				      saddr, sport, daddr, hnum);
+	if (IS_ERR(result))
+		return NULL;
+	if (result) {
+		reuse_sk = lookup_reuseport(net, result, skb, doff,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			result = reuse_sk;
+		goto done;
+	}
+
 	hash2 = ipv6_portaddr_hash(net, daddr, hnum);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 08/17] udp: Store layer 4 protocol in udp_table
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Lorenz Bauer

Because UDP and UDP-Lite share code, we pass the L4 protocol identifier
alongside the UDP socket table to functions which need to distinguishing
between the two protocol.

Put the protocol identifier in the UDP table itself, so that the protocol
is known to any function in the call chain that operates on socket table.

Subsequent patches make use the new udp_table field at the socket lookup
time to ensure that BPF program selects only sockets with matching
protocol.

Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/udp.h   | 10 ++++++----
 net/ipv4/udp.c      | 15 +++++++--------
 net/ipv4/udp_impl.h |  2 +-
 net/ipv4/udplite.c  |  4 ++--
 net/ipv6/udp.c      | 12 ++++++------
 net/ipv6/udp_impl.h |  2 +-
 net/ipv6/udplite.c  |  2 +-
 7 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index a8fa6c0c6ded..f81c46c71fee 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -63,16 +63,18 @@ struct udp_hslot {
 /**
  *	struct udp_table - UDP table
  *
- *	@hash:	hash table, sockets are hashed on (local port)
- *	@hash2:	hash table, sockets are hashed on (local port, local address)
- *	@mask:	number of slots in hash tables, minus 1
- *	@log:	log2(number of slots in hash table)
+ *	@hash:		hash table, sockets are hashed on (local port)
+ *	@hash2:		hash table, sockets are hashed on local (port, address)
+ *	@mask:		number of slots in hash tables, minus 1
+ *	@log:		log2(number of slots in hash table)
+ *	@protocol:	layer 4 protocol of the stored sockets
  */
 struct udp_table {
 	struct udp_hslot	*hash;
 	struct udp_hslot	*hash2;
 	unsigned int		mask;
 	unsigned int		log;
+	int			protocol;
 };
 extern struct udp_table udp_table;
 void udp_table_init(struct udp_table *, const char *);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 32564b350823..ce96b1746ddf 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -113,7 +113,7 @@
 #include <net/addrconf.h>
 #include <net/udp_tunnel.h>
 
-struct udp_table udp_table __read_mostly;
+struct udp_table udp_table __read_mostly = { .protocol = IPPROTO_UDP };
 EXPORT_SYMBOL(udp_table);
 
 long sysctl_udp_mem[3] __read_mostly;
@@ -2145,8 +2145,7 @@ EXPORT_SYMBOL(udp_sk_rx_dst_set);
 static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 				    struct udphdr  *uh,
 				    __be32 saddr, __be32 daddr,
-				    struct udp_table *udptable,
-				    int proto)
+				    struct udp_table *udptable)
 {
 	struct sock *sk, *first = NULL;
 	unsigned short hnum = ntohs(uh->dest);
@@ -2202,7 +2201,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	} else {
 		kfree_skb(skb);
 		__UDP_INC_STATS(net, UDP_MIB_IGNOREDMULTI,
-				proto == IPPROTO_UDPLITE);
+				udptable->protocol == IPPROTO_UDPLITE);
 	}
 	return 0;
 }
@@ -2279,8 +2278,7 @@ static int udp_unicast_rcv_skb(struct sock *sk, struct sk_buff *skb,
  *	All we need to do is get the socket, and then do a checksum.
  */
 
-int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
-		   int proto)
+int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable)
 {
 	struct sock *sk;
 	struct udphdr *uh;
@@ -2288,6 +2286,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct rtable *rt = skb_rtable(skb);
 	__be32 saddr, daddr;
 	struct net *net = dev_net(skb->dev);
+	int proto = udptable->protocol;
 	bool refcounted;
 
 	/*
@@ -2330,7 +2329,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 
 	if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
 		return __udp4_lib_mcast_deliver(net, skb, uh,
-						saddr, daddr, udptable, proto);
+						saddr, daddr, udptable);
 
 	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
 	if (sk)
@@ -2504,7 +2503,7 @@ int udp_v4_early_demux(struct sk_buff *skb)
 
 int udp_rcv(struct sk_buff *skb)
 {
-	return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
+	return __udp4_lib_rcv(skb, &udp_table);
 }
 
 void udp_destroy_sock(struct sock *sk)
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 6b2fa77eeb1c..7013535f9084 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -6,7 +6,7 @@
 #include <net/protocol.h>
 #include <net/inet_common.h>
 
-int __udp4_lib_rcv(struct sk_buff *, struct udp_table *, int);
+int __udp4_lib_rcv(struct sk_buff *, struct udp_table *);
 int __udp4_lib_err(struct sk_buff *, u32, struct udp_table *);
 
 int udp_v4_get_port(struct sock *sk, unsigned short snum);
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index 5936d66d1ce2..4e4e85de95b2 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -14,12 +14,12 @@
 #include <linux/proc_fs.h>
 #include "udp_impl.h"
 
-struct udp_table 	udplite_table __read_mostly;
+struct udp_table udplite_table __read_mostly = { .protocol = IPPROTO_UDPLITE };
 EXPORT_SYMBOL(udplite_table);
 
 static int udplite_rcv(struct sk_buff *skb)
 {
-	return __udp4_lib_rcv(skb, &udplite_table, IPPROTO_UDPLITE);
+	return __udp4_lib_rcv(skb, &udplite_table);
 }
 
 static int udplite_err(struct sk_buff *skb, u32 info)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7d4151747340..f7866fded418 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -741,7 +741,7 @@ static void udp6_csum_zero_error(struct sk_buff *skb)
  */
 static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 		const struct in6_addr *saddr, const struct in6_addr *daddr,
-		struct udp_table *udptable, int proto)
+		struct udp_table *udptable)
 {
 	struct sock *sk, *first = NULL;
 	const struct udphdr *uh = udp_hdr(skb);
@@ -803,7 +803,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	} else {
 		kfree_skb(skb);
 		__UDP6_INC_STATS(net, UDP_MIB_IGNOREDMULTI,
-				 proto == IPPROTO_UDPLITE);
+				 udptable->protocol == IPPROTO_UDPLITE);
 	}
 	return 0;
 }
@@ -836,11 +836,11 @@ static int udp6_unicast_rcv_skb(struct sock *sk, struct sk_buff *skb,
 	return 0;
 }
 
-int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
-		   int proto)
+int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable)
 {
 	const struct in6_addr *saddr, *daddr;
 	struct net *net = dev_net(skb->dev);
+	int proto = udptable->protocol;
 	struct udphdr *uh;
 	struct sock *sk;
 	bool refcounted;
@@ -905,7 +905,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	 */
 	if (ipv6_addr_is_multicast(daddr))
 		return __udp6_lib_mcast_deliver(net, skb,
-				saddr, daddr, udptable, proto);
+				saddr, daddr, udptable);
 
 	/* Unicast */
 	sk = __udp6_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
@@ -1014,7 +1014,7 @@ INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)
 
 INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
 {
-	return __udp6_lib_rcv(skb, &udp_table, IPPROTO_UDP);
+	return __udp6_lib_rcv(skb, &udp_table);
 }
 
 /*
diff --git a/net/ipv6/udp_impl.h b/net/ipv6/udp_impl.h
index 20e324b6f358..acd5a942c633 100644
--- a/net/ipv6/udp_impl.h
+++ b/net/ipv6/udp_impl.h
@@ -8,7 +8,7 @@
 #include <net/inet_common.h>
 #include <net/transp_v6.h>
 
-int __udp6_lib_rcv(struct sk_buff *, struct udp_table *, int);
+int __udp6_lib_rcv(struct sk_buff *, struct udp_table *);
 int __udp6_lib_err(struct sk_buff *, struct inet6_skb_parm *, u8, u8, int,
 		   __be32, struct udp_table *);
 
diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
index bf7a7acd39b1..f442ed595e6f 100644
--- a/net/ipv6/udplite.c
+++ b/net/ipv6/udplite.c
@@ -14,7 +14,7 @@
 
 static int udplitev6_rcv(struct sk_buff *skb)
 {
-	return __udp6_lib_rcv(skb, &udplite_table, IPPROTO_UDPLITE);
+	return __udp6_lib_rcv(skb, &udplite_table);
 }
 
 static int udplitev6_err(struct sk_buff *skb,
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 08/17] udp: Store layer 4 protocol in udp_table
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Because UDP and UDP-Lite share code, we pass the L4 protocol identifier
alongside the UDP socket table to functions which need to distinguishing
between the two protocol.

Put the protocol identifier in the UDP table itself, so that the protocol
is known to any function in the call chain that operates on socket table.

Subsequent patches make use the new udp_table field at the socket lookup
time to ensure that BPF program selects only sockets with matching
protocol.

Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/udp.h   | 10 ++++++----
 net/ipv4/udp.c      | 15 +++++++--------
 net/ipv4/udp_impl.h |  2 +-
 net/ipv4/udplite.c  |  4 ++--
 net/ipv6/udp.c      | 12 ++++++------
 net/ipv6/udp_impl.h |  2 +-
 net/ipv6/udplite.c  |  2 +-
 7 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index a8fa6c0c6ded..f81c46c71fee 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -63,16 +63,18 @@ struct udp_hslot {
 /**
  *	struct udp_table - UDP table
  *
- *	@hash:	hash table, sockets are hashed on (local port)
- *	@hash2:	hash table, sockets are hashed on (local port, local address)
- *	@mask:	number of slots in hash tables, minus 1
- *	@log:	log2(number of slots in hash table)
+ *	@hash:		hash table, sockets are hashed on (local port)
+ *	@hash2:		hash table, sockets are hashed on local (port, address)
+ *	@mask:		number of slots in hash tables, minus 1
+ *	@log:		log2(number of slots in hash table)
+ *	@protocol:	layer 4 protocol of the stored sockets
  */
 struct udp_table {
 	struct udp_hslot	*hash;
 	struct udp_hslot	*hash2;
 	unsigned int		mask;
 	unsigned int		log;
+	int			protocol;
 };
 extern struct udp_table udp_table;
 void udp_table_init(struct udp_table *, const char *);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 32564b350823..ce96b1746ddf 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -113,7 +113,7 @@
 #include <net/addrconf.h>
 #include <net/udp_tunnel.h>
 
-struct udp_table udp_table __read_mostly;
+struct udp_table udp_table __read_mostly = { .protocol = IPPROTO_UDP };
 EXPORT_SYMBOL(udp_table);
 
 long sysctl_udp_mem[3] __read_mostly;
@@ -2145,8 +2145,7 @@ EXPORT_SYMBOL(udp_sk_rx_dst_set);
 static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 				    struct udphdr  *uh,
 				    __be32 saddr, __be32 daddr,
-				    struct udp_table *udptable,
-				    int proto)
+				    struct udp_table *udptable)
 {
 	struct sock *sk, *first = NULL;
 	unsigned short hnum = ntohs(uh->dest);
@@ -2202,7 +2201,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	} else {
 		kfree_skb(skb);
 		__UDP_INC_STATS(net, UDP_MIB_IGNOREDMULTI,
-				proto = IPPROTO_UDPLITE);
+				udptable->protocol = IPPROTO_UDPLITE);
 	}
 	return 0;
 }
@@ -2279,8 +2278,7 @@ static int udp_unicast_rcv_skb(struct sock *sk, struct sk_buff *skb,
  *	All we need to do is get the socket, and then do a checksum.
  */
 
-int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
-		   int proto)
+int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable)
 {
 	struct sock *sk;
 	struct udphdr *uh;
@@ -2288,6 +2286,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct rtable *rt = skb_rtable(skb);
 	__be32 saddr, daddr;
 	struct net *net = dev_net(skb->dev);
+	int proto = udptable->protocol;
 	bool refcounted;
 
 	/*
@@ -2330,7 +2329,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 
 	if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
 		return __udp4_lib_mcast_deliver(net, skb, uh,
-						saddr, daddr, udptable, proto);
+						saddr, daddr, udptable);
 
 	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
 	if (sk)
@@ -2504,7 +2503,7 @@ int udp_v4_early_demux(struct sk_buff *skb)
 
 int udp_rcv(struct sk_buff *skb)
 {
-	return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
+	return __udp4_lib_rcv(skb, &udp_table);
 }
 
 void udp_destroy_sock(struct sock *sk)
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 6b2fa77eeb1c..7013535f9084 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -6,7 +6,7 @@
 #include <net/protocol.h>
 #include <net/inet_common.h>
 
-int __udp4_lib_rcv(struct sk_buff *, struct udp_table *, int);
+int __udp4_lib_rcv(struct sk_buff *, struct udp_table *);
 int __udp4_lib_err(struct sk_buff *, u32, struct udp_table *);
 
 int udp_v4_get_port(struct sock *sk, unsigned short snum);
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index 5936d66d1ce2..4e4e85de95b2 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -14,12 +14,12 @@
 #include <linux/proc_fs.h>
 #include "udp_impl.h"
 
-struct udp_table 	udplite_table __read_mostly;
+struct udp_table udplite_table __read_mostly = { .protocol = IPPROTO_UDPLITE };
 EXPORT_SYMBOL(udplite_table);
 
 static int udplite_rcv(struct sk_buff *skb)
 {
-	return __udp4_lib_rcv(skb, &udplite_table, IPPROTO_UDPLITE);
+	return __udp4_lib_rcv(skb, &udplite_table);
 }
 
 static int udplite_err(struct sk_buff *skb, u32 info)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7d4151747340..f7866fded418 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -741,7 +741,7 @@ static void udp6_csum_zero_error(struct sk_buff *skb)
  */
 static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 		const struct in6_addr *saddr, const struct in6_addr *daddr,
-		struct udp_table *udptable, int proto)
+		struct udp_table *udptable)
 {
 	struct sock *sk, *first = NULL;
 	const struct udphdr *uh = udp_hdr(skb);
@@ -803,7 +803,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	} else {
 		kfree_skb(skb);
 		__UDP6_INC_STATS(net, UDP_MIB_IGNOREDMULTI,
-				 proto = IPPROTO_UDPLITE);
+				 udptable->protocol = IPPROTO_UDPLITE);
 	}
 	return 0;
 }
@@ -836,11 +836,11 @@ static int udp6_unicast_rcv_skb(struct sock *sk, struct sk_buff *skb,
 	return 0;
 }
 
-int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
-		   int proto)
+int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable)
 {
 	const struct in6_addr *saddr, *daddr;
 	struct net *net = dev_net(skb->dev);
+	int proto = udptable->protocol;
 	struct udphdr *uh;
 	struct sock *sk;
 	bool refcounted;
@@ -905,7 +905,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	 */
 	if (ipv6_addr_is_multicast(daddr))
 		return __udp6_lib_mcast_deliver(net, skb,
-				saddr, daddr, udptable, proto);
+				saddr, daddr, udptable);
 
 	/* Unicast */
 	sk = __udp6_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
@@ -1014,7 +1014,7 @@ INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)
 
 INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
 {
-	return __udp6_lib_rcv(skb, &udp_table, IPPROTO_UDP);
+	return __udp6_lib_rcv(skb, &udp_table);
 }
 
 /*
diff --git a/net/ipv6/udp_impl.h b/net/ipv6/udp_impl.h
index 20e324b6f358..acd5a942c633 100644
--- a/net/ipv6/udp_impl.h
+++ b/net/ipv6/udp_impl.h
@@ -8,7 +8,7 @@
 #include <net/inet_common.h>
 #include <net/transp_v6.h>
 
-int __udp6_lib_rcv(struct sk_buff *, struct udp_table *, int);
+int __udp6_lib_rcv(struct sk_buff *, struct udp_table *);
 int __udp6_lib_err(struct sk_buff *, struct inet6_skb_parm *, u8, u8, int,
 		   __be32, struct udp_table *);
 
diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
index bf7a7acd39b1..f442ed595e6f 100644
--- a/net/ipv6/udplite.c
+++ b/net/ipv6/udplite.c
@@ -14,7 +14,7 @@
 
 static int udplitev6_rcv(struct sk_buff *skb)
 {
-	return __udp6_lib_rcv(skb, &udplite_table, IPPROTO_UDPLITE);
+	return __udp6_lib_rcv(skb, &udplite_table);
 }
 
 static int udplitev6_err(struct sk_buff *skb,
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 09/17] udp: Extract helper for selecting socket from reuseport group
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Prepare for calling into reuseport from __udp4_lib_lookup as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv4/udp.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ce96b1746ddf..d4842f29294a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -405,6 +405,25 @@ static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
 			      udp_ehash_secret + net_hash_mix(net));
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb,
+					    __be32 saddr, __be16 sport,
+					    __be32 daddr, unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 hash;
+
+	if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
+		hash = udp_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, hash, skb,
+						 sizeof(struct udphdr));
+		/* Fall back to scoring if group has connections */
+		if (reuseport_has_conns(sk, false))
+			return NULL;
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *udp4_lib_lookup2(struct net *net,
 				     __be32 saddr, __be16 sport,
@@ -415,7 +434,6 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
-	u32 hash = 0;
 
 	result = NULL;
 	badness = 0;
@@ -423,15 +441,11 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
-			if (sk->sk_reuseport &&
-			    sk->sk_state != TCP_ESTABLISHED) {
-				hash = udp_ehashfn(net, daddr, hnum,
-						   saddr, sport);
-				result = reuseport_select_sock(sk, hash, skb,
-							sizeof(struct udphdr));
-				if (result && !reuseport_has_conns(sk, false))
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			badness = score;
 			result = sk;
 		}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 09/17] udp: Extract helper for selecting socket from reuseport group
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Prepare for calling into reuseport from __udp4_lib_lookup as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv4/udp.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ce96b1746ddf..d4842f29294a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -405,6 +405,25 @@ static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
 			      udp_ehash_secret + net_hash_mix(net));
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb,
+					    __be32 saddr, __be16 sport,
+					    __be32 daddr, unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 hash;
+
+	if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
+		hash = udp_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, hash, skb,
+						 sizeof(struct udphdr));
+		/* Fall back to scoring if group has connections */
+		if (reuseport_has_conns(sk, false))
+			return NULL;
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *udp4_lib_lookup2(struct net *net,
 				     __be32 saddr, __be16 sport,
@@ -415,7 +434,6 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
-	u32 hash = 0;
 
 	result = NULL;
 	badness = 0;
@@ -423,15 +441,11 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
-			if (sk->sk_reuseport &&
-			    sk->sk_state != TCP_ESTABLISHED) {
-				hash = udp_ehashfn(net, daddr, hnum,
-						   saddr, sport);
-				result = reuseport_select_sock(sk, hash, skb,
-							sizeof(struct udphdr));
-				if (result && !reuseport_has_conns(sk, false))
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			badness = score;
 			result = sk;
 		}
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 10/17] udp: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Marek Majkowski, Lorenz Bauer

Following INET/TCP socket lookup changes, modify UDP socket lookup to let
BPF program select a receiving socket before searching for a socket by
destination address and port as usual.

Lookup of connected sockets that match packet 4-tuple is unaffected by this
change. BPF program runs, and potentially overrides the lookup result, only
if a 4-tuple match was not found.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv4/udp.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d4842f29294a..18d8432f6551 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -460,7 +460,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 		__be16 sport, __be32 daddr, __be16 dport, int dif,
 		int sdif, struct udp_table *udptable, struct sk_buff *skb)
 {
-	struct sock *result;
+	struct sock *result, *sk, *reuse_sk;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash2, slot2;
 	struct udp_hslot *hslot2;
@@ -469,18 +469,38 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
 
+	/* Lookup connected or non-wildcard socket */
 	result = udp4_lib_lookup2(net, saddr, sport,
 				  daddr, hnum, dif, sdif,
 				  hslot2, skb);
-	if (!result) {
-		hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
-		slot2 = hash2 & udptable->mask;
-		hslot2 = &udptable->hash2[slot2];
+	if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
+		goto done;
 
-		result = udp4_lib_lookup2(net, saddr, sport,
-					  htonl(INADDR_ANY), hnum, dif, sdif,
-					  hslot2, skb);
+	/* Lookup redirect from BPF */
+	sk = inet_lookup_run_bpf(net, udptable->protocol,
+				 saddr, sport, daddr, hnum);
+	if (IS_ERR(sk))
+		return NULL;
+	if (sk) {
+		reuse_sk = lookup_reuseport(net, sk, skb,
+					    saddr, sport, daddr, hnum);
+		result = reuse_sk ? : sk;
+		goto done;
 	}
+
+	/* Got non-wildcard socket or error on first lookup */
+	if (result)
+		goto done;
+
+	/* Lookup wildcard sockets */
+	hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
+	slot2 = hash2 & udptable->mask;
+	hslot2 = &udptable->hash2[slot2];
+
+	result = udp4_lib_lookup2(net, saddr, sport,
+				  htonl(INADDR_ANY), hnum, dif, sdif,
+				  hslot2, skb);
+done:
 	if (IS_ERR(result))
 		return NULL;
 	return result;
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 10/17] udp: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Following INET/TCP socket lookup changes, modify UDP socket lookup to let
BPF program select a receiving socket before searching for a socket by
destination address and port as usual.

Lookup of connected sockets that match packet 4-tuple is unaffected by this
change. BPF program runs, and potentially overrides the lookup result, only
if a 4-tuple match was not found.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv4/udp.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d4842f29294a..18d8432f6551 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -460,7 +460,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 		__be16 sport, __be32 daddr, __be16 dport, int dif,
 		int sdif, struct udp_table *udptable, struct sk_buff *skb)
 {
-	struct sock *result;
+	struct sock *result, *sk, *reuse_sk;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash2, slot2;
 	struct udp_hslot *hslot2;
@@ -469,18 +469,38 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
 
+	/* Lookup connected or non-wildcard socket */
 	result = udp4_lib_lookup2(net, saddr, sport,
 				  daddr, hnum, dif, sdif,
 				  hslot2, skb);
-	if (!result) {
-		hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
-		slot2 = hash2 & udptable->mask;
-		hslot2 = &udptable->hash2[slot2];
+	if (!IS_ERR_OR_NULL(result) && result->sk_state = TCP_ESTABLISHED)
+		goto done;
 
-		result = udp4_lib_lookup2(net, saddr, sport,
-					  htonl(INADDR_ANY), hnum, dif, sdif,
-					  hslot2, skb);
+	/* Lookup redirect from BPF */
+	sk = inet_lookup_run_bpf(net, udptable->protocol,
+				 saddr, sport, daddr, hnum);
+	if (IS_ERR(sk))
+		return NULL;
+	if (sk) {
+		reuse_sk = lookup_reuseport(net, sk, skb,
+					    saddr, sport, daddr, hnum);
+		result = reuse_sk ? : sk;
+		goto done;
 	}
+
+	/* Got non-wildcard socket or error on first lookup */
+	if (result)
+		goto done;
+
+	/* Lookup wildcard sockets */
+	hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
+	slot2 = hash2 & udptable->mask;
+	hslot2 = &udptable->hash2[slot2];
+
+	result = udp4_lib_lookup2(net, saddr, sport,
+				  htonl(INADDR_ANY), hnum, dif, sdif,
+				  hslot2, skb);
+done:
 	if (IS_ERR(result))
 		return NULL;
 	return result;
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 11/17] udp6: Extract helper for selecting socket from reuseport group
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Prepare for calling into reuseport from __udp6_lib_lookup as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv6/udp.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index f7866fded418..ee2073329d25 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -141,6 +141,27 @@ static int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb,
+					    const struct in6_addr *saddr,
+					    __be16 sport,
+					    const struct in6_addr *daddr,
+					    unsigned int hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 hash;
+
+	if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
+		hash = udp6_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, hash, skb,
+						 sizeof(struct udphdr));
+		/* Fall back to scoring if group has connections */
+		if (reuseport_has_conns(sk, false))
+			return NULL;
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *udp6_lib_lookup2(struct net *net,
 		const struct in6_addr *saddr, __be16 sport,
@@ -150,7 +171,6 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
-	u32 hash = 0;
 
 	result = NULL;
 	badness = -1;
@@ -158,16 +178,11 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
-			if (sk->sk_reuseport &&
-			    sk->sk_state != TCP_ESTABLISHED) {
-				hash = udp6_ehashfn(net, daddr, hnum,
-						    saddr, sport);
-
-				result = reuseport_select_sock(sk, hash, skb,
-							sizeof(struct udphdr));
-				if (result && !reuseport_has_conns(sk, false))
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			badness = score;
 		}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 11/17] udp6: Extract helper for selecting socket from reuseport group
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Prepare for calling into reuseport from __udp6_lib_lookup as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv6/udp.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index f7866fded418..ee2073329d25 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -141,6 +141,27 @@ static int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb,
+					    const struct in6_addr *saddr,
+					    __be16 sport,
+					    const struct in6_addr *daddr,
+					    unsigned int hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 hash;
+
+	if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
+		hash = udp6_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, hash, skb,
+						 sizeof(struct udphdr));
+		/* Fall back to scoring if group has connections */
+		if (reuseport_has_conns(sk, false))
+			return NULL;
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *udp6_lib_lookup2(struct net *net,
 		const struct in6_addr *saddr, __be16 sport,
@@ -150,7 +171,6 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
-	u32 hash = 0;
 
 	result = NULL;
 	badness = -1;
@@ -158,16 +178,11 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
-			if (sk->sk_reuseport &&
-			    sk->sk_state != TCP_ESTABLISHED) {
-				hash = udp6_ehashfn(net, daddr, hnum,
-						    saddr, sport);
-
-				result = reuseport_select_sock(sk, hash, skb,
-							sizeof(struct udphdr));
-				if (result && !reuseport_has_conns(sk, false))
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			badness = score;
 		}
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 12/17] udp6: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Marek Majkowski, Lorenz Bauer

Same as for udp4, let BPF program override the socket lookup result, by
selecting a receiving socket of its choice or failing the lookup, if no
connected UDP socket matched packet 4-tuple.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv6/udp.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index ee2073329d25..934f41a5e6ca 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -197,28 +197,47 @@ struct sock *__udp6_lib_lookup(struct net *net,
 			       int dif, int sdif, struct udp_table *udptable,
 			       struct sk_buff *skb)
 {
+	struct sock *result, *sk, *reuse_sk;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash2, slot2;
 	struct udp_hslot *hslot2;
-	struct sock *result;
 
 	hash2 = ipv6_portaddr_hash(net, daddr, hnum);
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
 
+	/* Lookup connected or non-wildcard sockets */
 	result = udp6_lib_lookup2(net, saddr, sport,
 				  daddr, hnum, dif, sdif,
 				  hslot2, skb);
-	if (!result) {
-		hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
-		slot2 = hash2 & udptable->mask;
+	if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
+		goto done;
 
-		hslot2 = &udptable->hash2[slot2];
-
-		result = udp6_lib_lookup2(net, saddr, sport,
-					  &in6addr_any, hnum, dif, sdif,
-					  hslot2, skb);
+	/* Lookup redirect from BPF */
+	sk = inet6_lookup_run_bpf(net, udptable->protocol,
+				  saddr, sport, daddr, hnum);
+	if (IS_ERR(sk))
+		return NULL;
+	if (sk) {
+		reuse_sk = lookup_reuseport(net, sk, skb,
+					    saddr, sport, daddr, hnum);
+		result = reuse_sk ? : sk;
+		goto done;
 	}
+
+	/* Got non-wildcard socket or error on first lookup */
+	if (result)
+		goto done;
+
+	/* Lookup wildcard sockets */
+	hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
+	slot2 = hash2 & udptable->mask;
+	hslot2 = &udptable->hash2[slot2];
+
+	result = udp6_lib_lookup2(net, saddr, sport,
+				  &in6addr_any, hnum, dif, sdif,
+				  hslot2, skb);
+done:
 	if (IS_ERR(result))
 		return NULL;
 	return result;
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 12/17] udp6: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Same as for udp4, let BPF program override the socket lookup result, by
selecting a receiving socket of its choice or failing the lookup, if no
connected UDP socket matched packet 4-tuple.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv6/udp.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index ee2073329d25..934f41a5e6ca 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -197,28 +197,47 @@ struct sock *__udp6_lib_lookup(struct net *net,
 			       int dif, int sdif, struct udp_table *udptable,
 			       struct sk_buff *skb)
 {
+	struct sock *result, *sk, *reuse_sk;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash2, slot2;
 	struct udp_hslot *hslot2;
-	struct sock *result;
 
 	hash2 = ipv6_portaddr_hash(net, daddr, hnum);
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
 
+	/* Lookup connected or non-wildcard sockets */
 	result = udp6_lib_lookup2(net, saddr, sport,
 				  daddr, hnum, dif, sdif,
 				  hslot2, skb);
-	if (!result) {
-		hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
-		slot2 = hash2 & udptable->mask;
+	if (!IS_ERR_OR_NULL(result) && result->sk_state = TCP_ESTABLISHED)
+		goto done;
 
-		hslot2 = &udptable->hash2[slot2];
-
-		result = udp6_lib_lookup2(net, saddr, sport,
-					  &in6addr_any, hnum, dif, sdif,
-					  hslot2, skb);
+	/* Lookup redirect from BPF */
+	sk = inet6_lookup_run_bpf(net, udptable->protocol,
+				  saddr, sport, daddr, hnum);
+	if (IS_ERR(sk))
+		return NULL;
+	if (sk) {
+		reuse_sk = lookup_reuseport(net, sk, skb,
+					    saddr, sport, daddr, hnum);
+		result = reuse_sk ? : sk;
+		goto done;
 	}
+
+	/* Got non-wildcard socket or error on first lookup */
+	if (result)
+		goto done;
+
+	/* Lookup wildcard sockets */
+	hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
+	slot2 = hash2 & udptable->mask;
+	hslot2 = &udptable->hash2[slot2];
+
+	result = udp6_lib_lookup2(net, saddr, sport,
+				  &in6addr_any, hnum, dif, sdif,
+				  hslot2, skb);
+done:
 	if (IS_ERR(result))
 		return NULL;
 	return result;
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 13/17] bpf: Sync linux/bpf.h to tools/
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Lorenz Bauer

Newly added program, context type and helper is used by tests in a
subsequent patch. Synchronize the header file.

Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
    - Update after changes to bpf.h in earlier patch.

 tools/include/uapi/linux/bpf.h | 52 ++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d1932e23cec..03edf4ec7b7e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -188,6 +188,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_STRUCT_OPS,
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
+	BPF_PROG_TYPE_SK_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -220,6 +221,7 @@ enum bpf_attach_type {
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
 	BPF_TRACE_ITER,
+	BPF_SK_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3050,6 +3052,10 @@ union bpf_attr {
  *
  * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
+ *		**BPF_PROG_TYPE_SCHED_ACT** programs.
+ *
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
  *		will cause *skb* to be delivered to the specified socket.
@@ -3070,6 +3076,38 @@ union bpf_attr {
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
  *
+ * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
+ *
+ *		Select the *sk* as a result of a socket lookup.
+ *
+ *		For the operation to succeed passed socket must be compatible
+ *		with the packet description provided by the *ctx* object.
+ *
+ *		L4 protocol (*IPPROTO_TCP* or *IPPROTO_UDP*) must be an exact
+ *		match. While IP family (*AF_INET* or *AF_INET6*) must be
+ *		compatible, that is IPv6 sockets that are not v6-only can be
+ *		selected for IPv4 packets.
+ *
+ *		Only TCP listeners and UDP sockets, that is sockets which have
+ *		*SOCK_RCU_FREE* flag set, can be selected.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		**-EAFNOSUPPORT** is socket family (*sk->family*) is not
+ *		compatible with packet family (*ctx->family*).
+ *
+ *		**-EINVAL** if unsupported flags were specified.
+ *
+ *		**-EPROTOTYPE** if socket L4 protocol (*sk->protocol*) doesn't
+ *		match packet protocol (*ctx->protocol*).
+ *
+ *		**-ESOCKTNOSUPPORT** if socket does not use RCU freeing.
+ *
  * u64 bpf_ktime_get_boot_ns(void)
  * 	Description
  * 		Return the time elapsed since system boot, in nanoseconds.
@@ -4058,4 +4096,18 @@ struct bpf_pidns_info {
 	__u32 pid;
 	__u32 tgid;
 };
+
+/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
+struct bpf_sk_lookup {
+	__u32 family;		/* Protocol family (AF_INET, AF_INET6) */
+	__u32 protocol;		/* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
+	/* IP addresses allow 1,2,4-byte read and are in network byte order. */
+	__u32 remote_ip4;
+	__u32 remote_ip6[4];
+	__u32 remote_port;	/* network byte order */
+	__u32 local_ip4;
+	__u32 local_ip6[4];
+	__u32 local_port;	/* host byte order */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 13/17] bpf: Sync linux/bpf.h to tools/
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Newly added program, context type and helper is used by tests in a
subsequent patch. Synchronize the header file.

Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
    - Update after changes to bpf.h in earlier patch.

 tools/include/uapi/linux/bpf.h | 52 ++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d1932e23cec..03edf4ec7b7e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -188,6 +188,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_STRUCT_OPS,
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
+	BPF_PROG_TYPE_SK_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -220,6 +221,7 @@ enum bpf_attach_type {
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
 	BPF_TRACE_ITER,
+	BPF_SK_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3050,6 +3052,10 @@ union bpf_attr {
  *
  * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
+ *		**BPF_PROG_TYPE_SCHED_ACT** programs.
+ *
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
  *		will cause *skb* to be delivered to the specified socket.
@@ -3070,6 +3076,38 @@ union bpf_attr {
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
  *
+ * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
+ *
+ *		Select the *sk* as a result of a socket lookup.
+ *
+ *		For the operation to succeed passed socket must be compatible
+ *		with the packet description provided by the *ctx* object.
+ *
+ *		L4 protocol (*IPPROTO_TCP* or *IPPROTO_UDP*) must be an exact
+ *		match. While IP family (*AF_INET* or *AF_INET6*) must be
+ *		compatible, that is IPv6 sockets that are not v6-only can be
+ *		selected for IPv4 packets.
+ *
+ *		Only TCP listeners and UDP sockets, that is sockets which have
+ *		*SOCK_RCU_FREE* flag set, can be selected.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		**-EAFNOSUPPORT** is socket family (*sk->family*) is not
+ *		compatible with packet family (*ctx->family*).
+ *
+ *		**-EINVAL** if unsupported flags were specified.
+ *
+ *		**-EPROTOTYPE** if socket L4 protocol (*sk->protocol*) doesn't
+ *		match packet protocol (*ctx->protocol*).
+ *
+ *		**-ESOCKTNOSUPPORT** if socket does not use RCU freeing.
+ *
  * u64 bpf_ktime_get_boot_ns(void)
  * 	Description
  * 		Return the time elapsed since system boot, in nanoseconds.
@@ -4058,4 +4096,18 @@ struct bpf_pidns_info {
 	__u32 pid;
 	__u32 tgid;
 };
+
+/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
+struct bpf_sk_lookup {
+	__u32 family;		/* Protocol family (AF_INET, AF_INET6) */
+	__u32 protocol;		/* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
+	/* IP addresses allow 1,2,4-byte read and are in network byte order. */
+	__u32 remote_ip4;
+	__u32 remote_ip6[4];
+	__u32 remote_port;	/* network byte order */
+	__u32 local_ip4;
+	__u32 local_ip6[4];
+	__u32 local_port;	/* host byte order */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 14/17] libbpf: Add support for SK_LOOKUP program type
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Make libbpf aware of the newly added program type, and assign it a
section name.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
     - Add new libbpf symbols to version 0.0.9. (Andrii)

 tools/lib/bpf/libbpf.c        | 3 +++
 tools/lib/bpf/libbpf.h        | 2 ++
 tools/lib/bpf/libbpf.map      | 2 ++
 tools/lib/bpf/libbpf_probes.c | 1 +
 4 files changed, 8 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6c2f46908f4d..ccded6cd310a 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6524,6 +6524,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
 BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
 BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
 BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
+BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
 
 enum bpf_attach_type
 bpf_program__get_expected_attach_type(struct bpf_program *prog)
@@ -6690,6 +6691,8 @@ static const struct bpf_sec_def section_defs[] = {
 	BPF_EAPROG_SEC("cgroup/setsockopt",	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 						BPF_CGROUP_SETSOCKOPT),
 	BPF_PROG_SEC("struct_ops",		BPF_PROG_TYPE_STRUCT_OPS),
+	BPF_EAPROG_SEC("sk_lookup",		BPF_PROG_TYPE_SK_LOOKUP,
+						BPF_SK_LOOKUP),
 };
 
 #undef BPF_PROG_SEC_IMPL
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 8ea69558f0a8..7bb5a4f22740 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -346,6 +346,7 @@ LIBBPF_API int bpf_program__set_perf_event(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_tracing(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_struct_ops(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_extension(struct bpf_program *prog);
+LIBBPF_API int bpf_program__set_sk_lookup(struct bpf_program *prog);
 
 LIBBPF_API enum bpf_prog_type bpf_program__get_type(struct bpf_program *prog);
 LIBBPF_API void bpf_program__set_type(struct bpf_program *prog,
@@ -373,6 +374,7 @@ LIBBPF_API bool bpf_program__is_perf_event(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_tracing(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_struct_ops(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_extension(const struct bpf_program *prog);
+LIBBPF_API bool bpf_program__is_sk_lookup(const struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 0133d469d30b..2490c5e34297 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -262,4 +262,6 @@ LIBBPF_0.0.9 {
 		bpf_link_get_fd_by_id;
 		bpf_link_get_next_id;
 		bpf_program__attach_iter;
+		bpf_program__is_sk_lookup;
+		bpf_program__set_sk_lookup;
 } LIBBPF_0.0.8;
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 2c92059c0c90..5c6d3e49f254 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -109,6 +109,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_STRUCT_OPS:
 	case BPF_PROG_TYPE_EXT:
 	case BPF_PROG_TYPE_LSM:
+	case BPF_PROG_TYPE_SK_LOOKUP:
 	default:
 		break;
 	}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 14/17] libbpf: Add support for SK_LOOKUP program type
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Make libbpf aware of the newly added program type, and assign it a
section name.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
     - Add new libbpf symbols to version 0.0.9. (Andrii)

 tools/lib/bpf/libbpf.c        | 3 +++
 tools/lib/bpf/libbpf.h        | 2 ++
 tools/lib/bpf/libbpf.map      | 2 ++
 tools/lib/bpf/libbpf_probes.c | 1 +
 4 files changed, 8 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6c2f46908f4d..ccded6cd310a 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6524,6 +6524,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
 BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
 BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
 BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
+BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
 
 enum bpf_attach_type
 bpf_program__get_expected_attach_type(struct bpf_program *prog)
@@ -6690,6 +6691,8 @@ static const struct bpf_sec_def section_defs[] = {
 	BPF_EAPROG_SEC("cgroup/setsockopt",	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 						BPF_CGROUP_SETSOCKOPT),
 	BPF_PROG_SEC("struct_ops",		BPF_PROG_TYPE_STRUCT_OPS),
+	BPF_EAPROG_SEC("sk_lookup",		BPF_PROG_TYPE_SK_LOOKUP,
+						BPF_SK_LOOKUP),
 };
 
 #undef BPF_PROG_SEC_IMPL
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 8ea69558f0a8..7bb5a4f22740 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -346,6 +346,7 @@ LIBBPF_API int bpf_program__set_perf_event(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_tracing(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_struct_ops(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_extension(struct bpf_program *prog);
+LIBBPF_API int bpf_program__set_sk_lookup(struct bpf_program *prog);
 
 LIBBPF_API enum bpf_prog_type bpf_program__get_type(struct bpf_program *prog);
 LIBBPF_API void bpf_program__set_type(struct bpf_program *prog,
@@ -373,6 +374,7 @@ LIBBPF_API bool bpf_program__is_perf_event(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_tracing(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_struct_ops(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_extension(const struct bpf_program *prog);
+LIBBPF_API bool bpf_program__is_sk_lookup(const struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 0133d469d30b..2490c5e34297 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -262,4 +262,6 @@ LIBBPF_0.0.9 {
 		bpf_link_get_fd_by_id;
 		bpf_link_get_next_id;
 		bpf_program__attach_iter;
+		bpf_program__is_sk_lookup;
+		bpf_program__set_sk_lookup;
 } LIBBPF_0.0.8;
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 2c92059c0c90..5c6d3e49f254 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -109,6 +109,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_STRUCT_OPS:
 	case BPF_PROG_TYPE_EXT:
 	case BPF_PROG_TYPE_LSM:
+	case BPF_PROG_TYPE_SK_LOOKUP:
 	default:
 		break;
 	}
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 15/17] selftests/bpf: Add verifier tests for bpf_sk_lookup context access
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Exercise verifier access checks for bpf_sk_lookup context fields.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
     - Adjust for fields renames in struct bpf_sk_lookup.

 .../selftests/bpf/verifier/ctx_sk_lookup.c    | 694 ++++++++++++++++++
 1 file changed, 694 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c

diff --git a/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c b/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c
new file mode 100644
index 000000000000..223163172fa9
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c
@@ -0,0 +1,694 @@
+{
+	"valid 1,2,4-byte read bpf_sk_lookup remote_ip4",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup remote_ip4",
+	.insns = {
+		/* 8-byte read */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 4-byte write */
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 4-byte write */
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 2-byte write */
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 1-byte write */
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_sk_lookup local_ip4",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4) + 3),
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_sk_lookup remote_ip6",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[3])),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[3]) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[3]) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_sk_lookup local_ip6",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[3])),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[3]) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[3]) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_sk_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_sk_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_sk_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_sk_lookup family",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup family",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_sk_lookup family",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_sk_lookup family",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup family",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup family",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup family",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup family",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_sk_lookup protocol",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup protocol",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_sk_lookup protocol",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_sk_lookup protocol",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup protocol",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup protocol",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup protocol",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup protocol",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 15/17] selftests/bpf: Add verifier tests for bpf_sk_lookup context access
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Exercise verifier access checks for bpf_sk_lookup context fields.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
     - Adjust for fields renames in struct bpf_sk_lookup.

 .../selftests/bpf/verifier/ctx_sk_lookup.c    | 694 ++++++++++++++++++
 1 file changed, 694 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c

diff --git a/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c b/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c
new file mode 100644
index 000000000000..223163172fa9
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c
@@ -0,0 +1,694 @@
+{
+	"valid 1,2,4-byte read bpf_sk_lookup remote_ip4",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup remote_ip4",
+	.insns = {
+		/* 8-byte read */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 4-byte write */
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 4-byte write */
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 2-byte write */
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 1-byte write */
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_sk_lookup local_ip4",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4) + 3),
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_sk_lookup remote_ip6",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[3])),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[3]) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[3]) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_sk_lookup local_ip6",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[3])),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[3]) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[3]) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_sk_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_sk_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_sk_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_sk_lookup family",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup family",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_sk_lookup family",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_sk_lookup family",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup family",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup family",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup family",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup family",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_sk_lookup protocol",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_sk_lookup protocol",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_sk_lookup protocol",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_sk_lookup protocol",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_sk_lookup protocol",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_sk_lookup protocol",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_sk_lookup protocol",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_sk_lookup protocol",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+},
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 16/17] selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Name the BPF C file after the test case that uses it.

This frees up "test_sk_lookup" namespace for BPF sk_lookup program tests
introduced by the following patch.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/prog_tests/reference_tracking.c     | 2 +-
 .../bpf/progs/{test_sk_lookup_kern.c => test_ref_track_kern.c}  | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename tools/testing/selftests/bpf/progs/{test_sk_lookup_kern.c => test_ref_track_kern.c} (100%)

diff --git a/tools/testing/selftests/bpf/prog_tests/reference_tracking.c b/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
index fc0d7f4f02cf..106ca8bb2a8f 100644
--- a/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
+++ b/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
@@ -3,7 +3,7 @@
 
 void test_reference_tracking(void)
 {
-	const char *file = "test_sk_lookup_kern.o";
+	const char *file = "test_ref_track_kern.o";
 	const char *obj_name = "ref_track";
 	DECLARE_LIBBPF_OPTS(bpf_object_open_opts, open_opts,
 		.object_name = obj_name,
diff --git a/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c b/tools/testing/selftests/bpf/progs/test_ref_track_kern.c
similarity index 100%
rename from tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
rename to tools/testing/selftests/bpf/progs/test_ref_track_kern.c
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 16/17] selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Name the BPF C file after the test case that uses it.

This frees up "test_sk_lookup" namespace for BPF sk_lookup program tests
introduced by the following patch.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/prog_tests/reference_tracking.c     | 2 +-
 .../bpf/progs/{test_sk_lookup_kern.c => test_ref_track_kern.c}  | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename tools/testing/selftests/bpf/progs/{test_sk_lookup_kern.c => test_ref_track_kern.c} (100%)

diff --git a/tools/testing/selftests/bpf/prog_tests/reference_tracking.c b/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
index fc0d7f4f02cf..106ca8bb2a8f 100644
--- a/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
+++ b/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
@@ -3,7 +3,7 @@
 
 void test_reference_tracking(void)
 {
-	const char *file = "test_sk_lookup_kern.o";
+	const char *file = "test_ref_track_kern.o";
 	const char *obj_name = "ref_track";
 	DECLARE_LIBBPF_OPTS(bpf_object_open_opts, open_opts,
 		.object_name = obj_name,
diff --git a/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c b/tools/testing/selftests/bpf/progs/test_ref_track_kern.c
similarity index 100%
rename from tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
rename to tools/testing/selftests/bpf/progs/test_ref_track_kern.c
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 17/17] selftests/bpf: Tests for BPF_SK_LOOKUP attach point
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau

Add tests to test_progs that exercise:

 - attaching/detaching/querying sk_lookup program,
 - overriding socket lookup result for TCP/UDP with BPF sk_lookup by
   a) selecting a socket fetched from a SOCKMAP, or
   b) failing the lookup with no match.

Tests cover two special cases:

 - selecting an IPv6 socket (non v6-only) to receive an IPv4 packet,
 - using BPF sk_lookup together with BPF sk_reuseport program.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
     - Adjust for fields renames in struct bpf_sk_lookup.

 .../selftests/bpf/prog_tests/sk_lookup.c      | 999 ++++++++++++++++++
 .../selftests/bpf/progs/test_sk_lookup_kern.c | 162 +++
 2 files changed, 1161 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
new file mode 100644
index 000000000000..96765b156f6f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
@@ -0,0 +1,999 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2020 Cloudflare
+/*
+ * Test BPF attach point for INET socket lookup (BPF_SK_LOOKUP).
+ *
+ * Tests exercise:
+ *
+ * 1. attaching/detaching/querying BPF sk_lookup program,
+ * 2. overriding socket lookup result by:
+ *    a) selecting a listening (TCP) or receiving (UDP) socket,
+ *    b) failing the lookup with no match.
+ *
+ * Special cases covered are:
+ * - selecting an IPv6 socket (non v6-only) to receive an IPv4 packet,
+ * - using BPF sk_lookup together with BPF sk_reuseport program.
+ *
+ * Tests run in a dedicated network namespace.
+ */
+
+#define _GNU_SOURCE
+#include <arpa/inet.h>
+#include <assert.h>
+#include <errno.h>
+#include <error.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+
+#include "bpf_rlimit.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+#include "test_sk_lookup_kern.skel.h"
+#include "test_progs.h"
+
+/* External (address, port) pairs the client sends packets to. */
+#define EXT_IP4		"127.0.0.1"
+#define EXT_IP6		"fd00::1"
+#define EXT_PORT	7007
+
+/* Internal (address, port) pairs the server listens/receives at. */
+#define INT_IP4		"127.0.0.2"
+#define INT_IP4_V6	"::ffff:127.0.0.2"
+#define INT_IP6		"fd00::2"
+#define INT_PORT	8008
+
+#define IO_TIMEOUT_SEC	3
+
+enum {
+	SERVER_A = 0,
+	SERVER_B = 1,
+	MAX_SERVERS,
+};
+
+struct inet_addr {
+	const char *ip;
+	unsigned short port;
+};
+
+struct test {
+	const char *desc;
+	struct bpf_program *lookup_prog;
+	struct bpf_program *reuseport_prog;
+	struct bpf_map *sock_map;
+	int sotype;
+	struct inet_addr send_to;
+	struct inet_addr recv_at;
+};
+
+static bool is_ipv6(const char *ip)
+{
+	return !!strchr(ip, ':');
+}
+
+static int make_addr(const char *ip, int port, struct sockaddr_storage *addr)
+{
+	struct sockaddr_in6 *addr6 = (void *)addr;
+	struct sockaddr_in *addr4 = (void *)addr;
+	int ret;
+
+	errno = 0;
+	if (is_ipv6(ip)) {
+		ret = inet_pton(AF_INET6, ip, &addr6->sin6_addr);
+		if (CHECK_FAIL(ret <= 0)) {
+			log_err("failed to convert IPv6 address '%s'", ip);
+			return -1;
+		}
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_port = htons(port);
+	} else {
+		ret = inet_pton(AF_INET, ip, &addr4->sin_addr);
+		if (CHECK_FAIL(ret <= 0)) {
+			log_err("failed to convert IPv4 address '%s'", ip);
+			return -1;
+		}
+		addr4->sin_family = AF_INET;
+		addr4->sin_port = htons(port);
+	}
+	return 0;
+}
+
+static int setup_reuseport_prog(int sock_fd, struct bpf_program *reuseport_prog)
+{
+	int err, prog_fd;
+
+	prog_fd = bpf_program__fd(reuseport_prog);
+	if (prog_fd < 0) {
+		errno = -prog_fd;
+		log_err("failed to get fd for program '%s'",
+			bpf_program__name(reuseport_prog));
+		return -1;
+	}
+
+	err = setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+			 &prog_fd, sizeof(prog_fd));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to ATTACH_REUSEPORT_EBPF");
+		return -1;
+	}
+
+	return 0;
+}
+
+static socklen_t inetaddr_len(const struct sockaddr_storage *addr)
+{
+	return (addr->ss_family == AF_INET ? sizeof(struct sockaddr_in) :
+		addr->ss_family == AF_INET6 ? sizeof(struct sockaddr_in6) : 0);
+}
+
+static int make_socket_with_addr(int sotype, const char *ip, int port,
+				 struct sockaddr_storage *addr)
+{
+	struct timeval timeo = { .tv_sec = IO_TIMEOUT_SEC };
+	int err, fd;
+
+	err = make_addr(ip, port, addr);
+	if (err)
+		return -1;
+
+	fd = socket(addr->ss_family, sotype, 0);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to create listen socket");
+		return -1;
+	}
+
+	err = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo, sizeof(timeo));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to set SO_SNDTIMEO");
+		return -1;
+	}
+
+	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo, sizeof(timeo));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to set SO_RCVTIMEO");
+		return -1;
+	}
+
+	return fd;
+}
+
+static int make_server(int sotype, const char *ip, int port,
+		       struct bpf_program *reuseport_prog)
+{
+	struct sockaddr_storage addr = {0};
+	const int one = 1;
+	int err, fd = -1;
+
+	fd = make_socket_with_addr(sotype, ip, port, &addr);
+	if (fd < 0)
+		return -1;
+
+	/* Enabled for UDPv6 sockets for IPv4-mapped IPv6 to work. */
+	if (sotype == SOCK_DGRAM) {
+		err = setsockopt(fd, SOL_IP, IP_RECVORIGDSTADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable IP_RECVORIGDSTADDR");
+			goto fail;
+		}
+	}
+
+	if (sotype == SOCK_DGRAM && addr.ss_family == AF_INET6) {
+		err = setsockopt(fd, SOL_IPV6, IPV6_RECVORIGDSTADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable IPV6_RECVORIGDSTADDR");
+			goto fail;
+		}
+	}
+
+	if (sotype == SOCK_STREAM) {
+		err = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable SO_REUSEADDR");
+			goto fail;
+		}
+	}
+
+	if (reuseport_prog) {
+		err = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable SO_REUSEPORT");
+			goto fail;
+		}
+	}
+
+	err = bind(fd, (void *)&addr, inetaddr_len(&addr));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to bind listen socket");
+		goto fail;
+	}
+
+	if (sotype == SOCK_STREAM) {
+		err = listen(fd, SOMAXCONN);
+		if (CHECK_FAIL(err)) {
+			log_err("failed to listen on port %d", port);
+			goto fail;
+		}
+	}
+
+	/* Late attach reuseport prog so we can have one init path */
+	if (reuseport_prog) {
+		err = setup_reuseport_prog(fd, reuseport_prog);
+		if (err)
+			goto fail;
+	}
+
+	return fd;
+fail:
+	close(fd);
+	return -1;
+}
+
+static int make_client(int sotype, const char *ip, int port)
+{
+	struct sockaddr_storage addr = {0};
+	int err, fd;
+
+	fd = make_socket_with_addr(sotype, ip, port, &addr);
+	if (fd < 0)
+		return -1;
+
+	err = connect(fd, (void *)&addr, inetaddr_len(&addr));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to connect client socket");
+		goto fail;
+	}
+
+	return fd;
+fail:
+	close(fd);
+	return -1;
+}
+
+static int send_byte(int fd)
+{
+	ssize_t n;
+
+	errno = 0;
+	n = send(fd, "a", 1, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial send");
+		return -1;
+	}
+	return 0;
+}
+
+static int recv_byte(int fd)
+{
+	char buf[1];
+	ssize_t n;
+
+	n = recv(fd, buf, sizeof(buf), 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial recv");
+		return -1;
+	}
+	return 0;
+}
+
+static int tcp_recv_send(int server_fd)
+{
+	char buf[1];
+	int ret, fd;
+	ssize_t n;
+
+	fd = accept(server_fd, NULL, NULL);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to accept");
+		return -1;
+	}
+
+	n = recv(fd, buf, sizeof(buf), 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial recv");
+		ret = -1;
+		goto close;
+	}
+
+	n = send(fd, buf, n, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial send");
+		ret = -1;
+		goto close;
+	}
+
+	ret = 0;
+close:
+	close(fd);
+	return ret;
+}
+
+static void v4_to_v6(struct sockaddr_storage *ss)
+{
+	struct sockaddr_in6 *v6 = (struct sockaddr_in6 *)ss;
+	struct sockaddr_in v4 = *(struct sockaddr_in *)ss;
+
+	v6->sin6_family = AF_INET6;
+	v6->sin6_port = v4.sin_port;
+	v6->sin6_addr.s6_addr[10] = 0xff;
+	v6->sin6_addr.s6_addr[11] = 0xff;
+	memcpy(&v6->sin6_addr.s6_addr[12], &v4.sin_addr.s_addr, 4);
+}
+
+static int udp_recv_send(int server_fd)
+{
+	char cmsg_buf[CMSG_SPACE(sizeof(struct sockaddr_storage))];
+	struct sockaddr_storage _src_addr = { 0 };
+	struct sockaddr_storage *src_addr = &_src_addr;
+	struct sockaddr_storage *dst_addr = NULL;
+	struct msghdr msg = { 0 };
+	struct iovec iov = { 0 };
+	struct cmsghdr *cm;
+	char buf[1];
+	int ret, fd;
+	ssize_t n;
+
+	iov.iov_base = buf;
+	iov.iov_len = sizeof(buf);
+
+	msg.msg_name = src_addr;
+	msg.msg_namelen = sizeof(*src_addr);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsg_buf;
+	msg.msg_controllen = sizeof(cmsg_buf);
+
+	errno = 0;
+	n = recvmsg(server_fd, &msg, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed to receive");
+		return -1;
+	}
+	if (CHECK_FAIL(msg.msg_flags & MSG_CTRUNC)) {
+		log_err("truncated cmsg");
+		return -1;
+	}
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if ((cm->cmsg_level == SOL_IP &&
+		     cm->cmsg_type == IP_ORIGDSTADDR) ||
+		    (cm->cmsg_level == SOL_IPV6 &&
+		     cm->cmsg_type == IPV6_ORIGDSTADDR)) {
+			dst_addr = (struct sockaddr_storage *)CMSG_DATA(cm);
+			break;
+		}
+		log_err("warning: ignored cmsg at level %d type %d",
+			cm->cmsg_level, cm->cmsg_type);
+	}
+	if (CHECK_FAIL(!dst_addr)) {
+		log_err("failed to get destination address");
+		return -1;
+	}
+
+	/* Server socket bound to IPv4-mapped IPv6 address */
+	if (src_addr->ss_family == AF_INET6 &&
+	    dst_addr->ss_family == AF_INET) {
+		v4_to_v6(dst_addr);
+	}
+
+	/* Reply from original destination address. */
+	fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to create tx socket");
+		return -1;
+	}
+
+	ret = bind(fd, (struct sockaddr *)dst_addr, sizeof(*dst_addr));
+	if (CHECK_FAIL(ret)) {
+		log_err("failed to bind tx socket");
+		goto out;
+	}
+
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	n = sendmsg(fd, &msg, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed to send echo reply");
+		ret = -1;
+		goto out;
+	}
+
+	ret = 0;
+out:
+	close(fd);
+	return ret;
+}
+
+static int tcp_echo_test(int client_fd, int server_fd)
+{
+	int err;
+
+	err = send_byte(client_fd);
+	if (err)
+		return -1;
+	err = tcp_recv_send(server_fd);
+	if (err)
+		return -1;
+	err = recv_byte(client_fd);
+	if (err)
+		return -1;
+
+	return 0;
+}
+
+static int udp_echo_test(int client_fd, int server_fd)
+{
+	int err;
+
+	err = send_byte(client_fd);
+	if (err)
+		return -1;
+	err = udp_recv_send(server_fd);
+	if (err)
+		return -1;
+	err = recv_byte(client_fd);
+	if (err)
+		return -1;
+
+	return 0;
+}
+
+static int attach_lookup_prog(struct bpf_program *prog)
+{
+	const char *prog_name = bpf_program__name(prog);
+	enum bpf_attach_type attach_type;
+	int err, prog_fd;
+
+	prog_fd = bpf_program__fd(prog);
+	if (CHECK_FAIL(prog_fd < 0)) {
+		errno = -prog_fd;
+		log_err("failed to get fd for program '%s'", prog_name);
+		return -1;
+	}
+
+	attach_type = bpf_program__get_expected_attach_type(prog);
+	err = bpf_prog_attach(prog_fd, -1 /* target fd */, attach_type, 0);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to attach program '%s'", prog_name);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int detach_lookup_prog(struct bpf_program *prog)
+{
+	const char *prog_name = bpf_program__name(prog);
+	enum bpf_attach_type attach_type;
+	int err, prog_fd;
+
+	prog_fd = bpf_program__fd(prog);
+	if (CHECK_FAIL(prog_fd < 0)) {
+		errno = -prog_fd;
+		log_err("failed to get fd for program '%s'", prog_name);
+		return -1;
+	}
+
+	attach_type = bpf_program__get_expected_attach_type(prog);
+	err = bpf_prog_detach2(prog_fd, -1 /* attachable fd */, attach_type);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to detach program '%s'", prog_name);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int update_lookup_map(struct bpf_map *map, int index, int sock_fd)
+{
+	int err, map_fd;
+	uint64_t value;
+
+	map_fd = bpf_map__fd(map);
+	if (CHECK_FAIL(map_fd < 0)) {
+		errno = -map_fd;
+		log_err("failed to get map FD");
+		return -1;
+	}
+
+	value = (uint64_t)sock_fd;
+	err = bpf_map_update_elem(map_fd, &index, &value, BPF_NOEXIST);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to update redir_map @ %d", index);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void query_lookup_prog(struct test_sk_lookup_kern *skel)
+{
+	struct bpf_program *lookup_prog = skel->progs.lookup_pass;
+	enum bpf_attach_type attach_type;
+	__u32 attach_flags = 0;
+	__u32 prog_ids[1] = { 0 };
+	__u32 prog_cnt = 1;
+	int net_fd = -1;
+	int err;
+
+	net_fd = open("/proc/self/ns/net", O_RDONLY);
+	if (CHECK_FAIL(net_fd < 0)) {
+		log_err("failed to open /proc/self/ns/net");
+		return;
+	}
+
+	err = attach_lookup_prog(lookup_prog);
+	if (err)
+		goto close;
+
+	attach_type = bpf_program__get_expected_attach_type(lookup_prog);
+	err = bpf_prog_query(net_fd, attach_type, 0 /* query flags */,
+			     &attach_flags, prog_ids, &prog_cnt);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to query lookup prog");
+		goto detach;
+	}
+
+	errno = 0;
+	if (CHECK_FAIL(attach_flags != 0)) {
+		log_err("wrong attach_flags on query: %u", attach_flags);
+		goto detach;
+	}
+	if (CHECK_FAIL(prog_cnt != 1)) {
+		log_err("wrong program count on query: %u", prog_cnt);
+		goto detach;
+	}
+	if (CHECK_FAIL(prog_ids[0] == 0)) {
+		log_err("invalid program id on query: %u", prog_ids[0]);
+		goto detach;
+	}
+
+detach:
+	detach_lookup_prog(lookup_prog);
+close:
+	close(net_fd);
+}
+
+static void run_lookup_prog(const struct test *t)
+{
+	int client_fd, server_fds[MAX_SERVERS] = { -1 };
+	int i, err, server_idx;
+
+	err = attach_lookup_prog(t->lookup_prog);
+	if (err)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
+		server_fds[i] = make_server(t->sotype, t->recv_at.ip,
+					    t->recv_at.port, t->reuseport_prog);
+		if (server_fds[i] < 0)
+			goto close;
+
+		err = update_lookup_map(t->sock_map, i, server_fds[i]);
+		if (err)
+			goto detach;
+
+		/* want just one server for non-reuseport test */
+		if (!t->reuseport_prog)
+			break;
+	}
+
+	client_fd = make_client(t->sotype, t->send_to.ip, t->send_to.port);
+	if (client_fd < 0)
+		goto close;
+
+	/* reuseport prog always selects server B */
+	server_idx = t->reuseport_prog ? SERVER_B : SERVER_A;
+
+	if (t->sotype == SOCK_STREAM)
+		tcp_echo_test(client_fd, server_fds[server_idx]);
+	else
+		udp_echo_test(client_fd, server_fds[server_idx]);
+
+	close(client_fd);
+close:
+	for (i = 0; i < ARRAY_SIZE(server_fds); i++)
+		close(server_fds[i]);
+detach:
+	detach_lookup_prog(t->lookup_prog);
+}
+
+static void test_override_lookup(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { EXT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4 redir addr",
+			.lookup_prog	= skel->progs.redir_ip4,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, INT_PORT },
+			.reuseport_prog	= skel->progs.select_sock_b,
+		},
+		{
+			.desc		= "TCP IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { EXT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 redir addr",
+			.lookup_prog	= skel->progs.redir_ip6,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4->IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.recv_at	= { INT_IP4_V6, INT_PORT },
+			.send_to	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, INT_PORT },
+			.reuseport_prog	= skel->progs.select_sock_b,
+		},
+		{
+			.desc		= "UDP IPv4 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { EXT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 redir addr",
+			.lookup_prog	= skel->progs.redir_ip4,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, INT_PORT },
+			.reuseport_prog	= skel->progs.select_sock_b,
+		},
+		{
+			.desc		= "UDP IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { EXT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 redir addr",
+			.lookup_prog	= skel->progs.redir_ip6,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4->IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.recv_at	= { INT_IP4_V6, INT_PORT },
+			.send_to	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, INT_PORT },
+			.reuseport_prog	= skel->progs.select_sock_b,
+		},
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			run_lookup_prog(t);
+	}
+}
+
+static void drop_on_lookup(const struct test *t)
+{
+	struct sockaddr_storage dst = { 0 };
+	int client_fd, server_fd, err;
+	ssize_t n;
+
+	if (attach_lookup_prog(t->lookup_prog))
+		return;
+
+	server_fd = make_server(t->sotype, t->recv_at.ip, t->recv_at.port,
+				t->reuseport_prog);
+	if (server_fd < 0)
+		goto detach;
+
+	client_fd = make_socket_with_addr(t->sotype, t->send_to.ip,
+					  t->send_to.port, &dst);
+	if (client_fd < 0)
+		goto close_srv;
+
+	err = connect(client_fd, (void *)&dst, inetaddr_len(&dst));
+	if (t->sotype == SOCK_DGRAM) {
+		err = send_byte(client_fd);
+		if (err)
+			goto close_all;
+
+		/* Read out asynchronous error */
+		n = recv(client_fd, NULL, 0, 0);
+		err = n == -1;
+	}
+	if (CHECK_FAIL(!err || errno != ECONNREFUSED))
+		log_err("expected ECONNREFUSED on connect");
+
+close_all:
+	close(client_fd);
+close_srv:
+	close(server_fd);
+detach:
+	detach_lookup_prog(t->lookup_prog);
+}
+
+static void test_drop_on_lookup(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { EXT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { EXT_IP6, INT_PORT },
+		},
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			drop_on_lookup(t);
+	}
+}
+
+static void drop_on_reuseport(const struct test *t)
+{
+	struct sockaddr_storage dst = { 0 };
+	int client, server1, server2, err;
+	ssize_t n;
+
+	if (attach_lookup_prog(t->lookup_prog))
+		return;
+
+	server1 = make_server(t->sotype, t->recv_at.ip, t->recv_at.port,
+			      t->reuseport_prog);
+	if (server1 < 0)
+		goto detach;
+
+	err = update_lookup_map(t->sock_map, SERVER_A, server1);
+	if (err)
+		goto detach;
+
+	/* second server on destination address we should never reach */
+	server2 = make_server(t->sotype, t->send_to.ip, t->send_to.port,
+			      NULL /* reuseport prog */);
+	if (server2 < 0)
+		goto close_srv1;
+
+	client = make_socket_with_addr(t->sotype, t->send_to.ip,
+				       t->send_to.port, &dst);
+	if (client < 0)
+		goto close_srv2;
+
+	err = connect(client, (void *)&dst, inetaddr_len(&dst));
+	if (t->sotype == SOCK_DGRAM) {
+		err = send_byte(client);
+		if (err)
+			goto close_all;
+
+		/* Read out asynchronous error */
+		n = recv(client, NULL, 0, 0);
+		err = n == -1;
+	}
+	if (CHECK_FAIL(!err || errno != ECONNREFUSED))
+		log_err("expected ECONNREFUSED on connect");
+
+close_all:
+	close(client);
+close_srv2:
+	close(server2);
+close_srv1:
+	close(server1);
+detach:
+	detach_lookup_prog(t->lookup_prog);
+}
+
+static void test_drop_on_reuseport(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, INT_PORT },
+		},
+
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			drop_on_reuseport(t);
+	}
+}
+
+static void run_tests(struct test_sk_lookup_kern *skel)
+{
+	if (test__start_subtest("query lookup prog"))
+		query_lookup_prog(skel);
+	test_override_lookup(skel);
+	test_drop_on_lookup(skel);
+	test_drop_on_reuseport(skel);
+}
+
+static int switch_netns(int *saved_net)
+{
+	static const char * const setup_script[] = {
+		"ip -6 addr add dev lo " EXT_IP6 "/128 nodad",
+		"ip -6 addr add dev lo " INT_IP6 "/128 nodad",
+		"ip link set dev lo up",
+		NULL,
+	};
+	const char * const *cmd;
+	int net_fd, err;
+
+	net_fd = open("/proc/self/ns/net", O_RDONLY);
+	if (CHECK_FAIL(net_fd < 0)) {
+		log_err("open(/proc/self/ns/net)");
+		return -1;
+	}
+
+	err = unshare(CLONE_NEWNET);
+	if (CHECK_FAIL(err)) {
+		log_err("unshare(CLONE_NEWNET)");
+		goto close;
+	}
+
+	for (cmd = setup_script; *cmd; cmd++) {
+		err = system(*cmd);
+		if (CHECK_FAIL(err)) {
+			log_err("system(%s)", *cmd);
+			goto close;
+		}
+	}
+
+	*saved_net = net_fd;
+	return 0;
+
+close:
+	close(net_fd);
+	return -1;
+}
+
+static void restore_netns(int saved_net)
+{
+	int err;
+
+	err = setns(saved_net, CLONE_NEWNET);
+	if (CHECK_FAIL(err))
+		log_err("setns(CLONE_NEWNET)");
+
+	close(saved_net);
+}
+
+void test_sk_lookup(void)
+{
+	struct test_sk_lookup_kern *skel;
+	int err, saved_net;
+
+	err = switch_netns(&saved_net);
+	if (err)
+		return;
+
+	skel = test_sk_lookup_kern__open_and_load();
+	if (CHECK_FAIL(!skel)) {
+		errno = 0;
+		log_err("failed to open and load BPF skeleton");
+		goto restore_netns;
+	}
+
+	run_tests(skel);
+
+	test_sk_lookup_kern__destroy(skel);
+restore_netns:
+	restore_netns(saved_net);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
new file mode 100644
index 000000000000..fc3ad9a69484
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2020 Cloudflare
+
+#include <linux/bpf.h>
+#include <sys/socket.h>
+
+#include <bpf/bpf_endian.h>
+#include <bpf/bpf_helpers.h>
+
+#define IP4(a, b, c, d)					\
+	bpf_htonl((((__u32)(a) & 0xffU) << 24) |	\
+		  (((__u32)(b) & 0xffU) << 16) |	\
+		  (((__u32)(c) & 0xffU) <<  8) |	\
+		  (((__u32)(d) & 0xffU) <<  0))
+#define IP6(aaaa, bbbb, cccc, dddd)			\
+	{ bpf_htonl(aaaa), bpf_htonl(bbbb), bpf_htonl(cccc), bpf_htonl(dddd) }
+
+#define MAX_SOCKS 32
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, MAX_SOCKS);
+	__type(key, __u32);
+	__type(value, __u64);
+} redir_map SEC(".maps");
+
+enum {
+	SERVER_A = 0,
+	SERVER_B = 1,
+};
+
+enum {
+	NO_FLAGS = 0,
+};
+
+static const __u32 DST_PORT = 7007;
+static const __u32 DST_IP4 = IP4(127, 0, 0, 1);
+static const __u32 DST_IP6[] = IP6(0xfd000000, 0x0, 0x0, 0x00000001);
+
+SEC("sk_lookup/lookup_pass")
+int lookup_pass(struct bpf_sk_lookup *ctx)
+{
+	return BPF_OK;
+}
+
+SEC("sk_lookup/lookup_drop")
+int lookup_drop(struct bpf_sk_lookup *ctx)
+{
+	return BPF_DROP;
+}
+
+SEC("sk_reuseport/reuse_pass")
+int reuseport_pass(struct sk_reuseport_md *ctx)
+{
+	return SK_PASS;
+}
+
+SEC("sk_reuseport/reuse_drop")
+int reuseport_drop(struct sk_reuseport_md *ctx)
+{
+	return SK_DROP;
+}
+
+/* Redirect packets destined for port DST_PORT to socket at redir_map[0]. */
+SEC("sk_lookup/redir_port")
+int redir_port(struct bpf_sk_lookup *ctx)
+{
+	__u32 key = SERVER_A;
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &key);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, NO_FLAGS);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+/* Redirect packets destined for DST_IP4 address to socket at redir_map[0]. */
+SEC("sk_lookup/redir_ip4")
+int redir_ip4(struct bpf_sk_lookup *ctx)
+{
+	__u32 key = SERVER_A;
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->family != AF_INET)
+		return BPF_OK;
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+	if (ctx->local_ip4 != DST_IP4)
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &key);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, NO_FLAGS);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+/* Redirect packets destined for DST_IP6 address to socket at redir_map[0]. */
+SEC("sk_lookup/redir_ip6")
+int redir_ip6(struct bpf_sk_lookup *ctx)
+{
+	__u32 key = SERVER_A;
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->family != AF_INET6)
+		return BPF_OK;
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+	if (ctx->local_ip6[0] != DST_IP6[0] ||
+	    ctx->local_ip6[1] != DST_IP6[1] ||
+	    ctx->local_ip6[2] != DST_IP6[2] ||
+	    ctx->local_ip6[3] != DST_IP6[3])
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &key);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, NO_FLAGS);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+SEC("sk_lookup/select_sock_a")
+int select_sock_a(struct bpf_sk_lookup *ctx)
+{
+	__u32 key = SERVER_A;
+	struct bpf_sock *sk;
+	int err;
+
+	sk = bpf_map_lookup_elem(&redir_map, &key);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, NO_FLAGS);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+SEC("sk_reuseport/select_sock_b")
+int select_sock_b(struct sk_reuseport_md *ctx)
+{
+	__u32 key = SERVER_B;
+	int err;
+
+	err = bpf_sk_select_reuseport(ctx, &redir_map, &key, NO_FLAGS);
+	return err ? SK_DROP : SK_PASS;
+}
+
+char _license[] SEC("license") = "Dual BSD/GPL";
+__u32 _version SEC("version") = 1;
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH bpf-next v2 17/17] selftests/bpf: Tests for BPF_SK_LOOKUP attach point
@ 2020-05-11 18:52   ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 18:52 UTC (permalink / raw)
  To: dccp

Add tests to test_progs that exercise:

 - attaching/detaching/querying sk_lookup program,
 - overriding socket lookup result for TCP/UDP with BPF sk_lookup by
   a) selecting a socket fetched from a SOCKMAP, or
   b) failing the lookup with no match.

Tests cover two special cases:

 - selecting an IPv6 socket (non v6-only) to receive an IPv4 packet,
 - using BPF sk_lookup together with BPF sk_reuseport program.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v2:
     - Adjust for fields renames in struct bpf_sk_lookup.

 .../selftests/bpf/prog_tests/sk_lookup.c      | 999 ++++++++++++++++++
 .../selftests/bpf/progs/test_sk_lookup_kern.c | 162 +++
 2 files changed, 1161 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
new file mode 100644
index 000000000000..96765b156f6f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
@@ -0,0 +1,999 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2020 Cloudflare
+/*
+ * Test BPF attach point for INET socket lookup (BPF_SK_LOOKUP).
+ *
+ * Tests exercise:
+ *
+ * 1. attaching/detaching/querying BPF sk_lookup program,
+ * 2. overriding socket lookup result by:
+ *    a) selecting a listening (TCP) or receiving (UDP) socket,
+ *    b) failing the lookup with no match.
+ *
+ * Special cases covered are:
+ * - selecting an IPv6 socket (non v6-only) to receive an IPv4 packet,
+ * - using BPF sk_lookup together with BPF sk_reuseport program.
+ *
+ * Tests run in a dedicated network namespace.
+ */
+
+#define _GNU_SOURCE
+#include <arpa/inet.h>
+#include <assert.h>
+#include <errno.h>
+#include <error.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+
+#include "bpf_rlimit.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+#include "test_sk_lookup_kern.skel.h"
+#include "test_progs.h"
+
+/* External (address, port) pairs the client sends packets to. */
+#define EXT_IP4		"127.0.0.1"
+#define EXT_IP6		"fd00::1"
+#define EXT_PORT	7007
+
+/* Internal (address, port) pairs the server listens/receives at. */
+#define INT_IP4		"127.0.0.2"
+#define INT_IP4_V6	"::ffff:127.0.0.2"
+#define INT_IP6		"fd00::2"
+#define INT_PORT	8008
+
+#define IO_TIMEOUT_SEC	3
+
+enum {
+	SERVER_A = 0,
+	SERVER_B = 1,
+	MAX_SERVERS,
+};
+
+struct inet_addr {
+	const char *ip;
+	unsigned short port;
+};
+
+struct test {
+	const char *desc;
+	struct bpf_program *lookup_prog;
+	struct bpf_program *reuseport_prog;
+	struct bpf_map *sock_map;
+	int sotype;
+	struct inet_addr send_to;
+	struct inet_addr recv_at;
+};
+
+static bool is_ipv6(const char *ip)
+{
+	return !!strchr(ip, ':');
+}
+
+static int make_addr(const char *ip, int port, struct sockaddr_storage *addr)
+{
+	struct sockaddr_in6 *addr6 = (void *)addr;
+	struct sockaddr_in *addr4 = (void *)addr;
+	int ret;
+
+	errno = 0;
+	if (is_ipv6(ip)) {
+		ret = inet_pton(AF_INET6, ip, &addr6->sin6_addr);
+		if (CHECK_FAIL(ret <= 0)) {
+			log_err("failed to convert IPv6 address '%s'", ip);
+			return -1;
+		}
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_port = htons(port);
+	} else {
+		ret = inet_pton(AF_INET, ip, &addr4->sin_addr);
+		if (CHECK_FAIL(ret <= 0)) {
+			log_err("failed to convert IPv4 address '%s'", ip);
+			return -1;
+		}
+		addr4->sin_family = AF_INET;
+		addr4->sin_port = htons(port);
+	}
+	return 0;
+}
+
+static int setup_reuseport_prog(int sock_fd, struct bpf_program *reuseport_prog)
+{
+	int err, prog_fd;
+
+	prog_fd = bpf_program__fd(reuseport_prog);
+	if (prog_fd < 0) {
+		errno = -prog_fd;
+		log_err("failed to get fd for program '%s'",
+			bpf_program__name(reuseport_prog));
+		return -1;
+	}
+
+	err = setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+			 &prog_fd, sizeof(prog_fd));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to ATTACH_REUSEPORT_EBPF");
+		return -1;
+	}
+
+	return 0;
+}
+
+static socklen_t inetaddr_len(const struct sockaddr_storage *addr)
+{
+	return (addr->ss_family = AF_INET ? sizeof(struct sockaddr_in) :
+		addr->ss_family = AF_INET6 ? sizeof(struct sockaddr_in6) : 0);
+}
+
+static int make_socket_with_addr(int sotype, const char *ip, int port,
+				 struct sockaddr_storage *addr)
+{
+	struct timeval timeo = { .tv_sec = IO_TIMEOUT_SEC };
+	int err, fd;
+
+	err = make_addr(ip, port, addr);
+	if (err)
+		return -1;
+
+	fd = socket(addr->ss_family, sotype, 0);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to create listen socket");
+		return -1;
+	}
+
+	err = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo, sizeof(timeo));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to set SO_SNDTIMEO");
+		return -1;
+	}
+
+	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo, sizeof(timeo));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to set SO_RCVTIMEO");
+		return -1;
+	}
+
+	return fd;
+}
+
+static int make_server(int sotype, const char *ip, int port,
+		       struct bpf_program *reuseport_prog)
+{
+	struct sockaddr_storage addr = {0};
+	const int one = 1;
+	int err, fd = -1;
+
+	fd = make_socket_with_addr(sotype, ip, port, &addr);
+	if (fd < 0)
+		return -1;
+
+	/* Enabled for UDPv6 sockets for IPv4-mapped IPv6 to work. */
+	if (sotype = SOCK_DGRAM) {
+		err = setsockopt(fd, SOL_IP, IP_RECVORIGDSTADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable IP_RECVORIGDSTADDR");
+			goto fail;
+		}
+	}
+
+	if (sotype = SOCK_DGRAM && addr.ss_family = AF_INET6) {
+		err = setsockopt(fd, SOL_IPV6, IPV6_RECVORIGDSTADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable IPV6_RECVORIGDSTADDR");
+			goto fail;
+		}
+	}
+
+	if (sotype = SOCK_STREAM) {
+		err = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable SO_REUSEADDR");
+			goto fail;
+		}
+	}
+
+	if (reuseport_prog) {
+		err = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable SO_REUSEPORT");
+			goto fail;
+		}
+	}
+
+	err = bind(fd, (void *)&addr, inetaddr_len(&addr));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to bind listen socket");
+		goto fail;
+	}
+
+	if (sotype = SOCK_STREAM) {
+		err = listen(fd, SOMAXCONN);
+		if (CHECK_FAIL(err)) {
+			log_err("failed to listen on port %d", port);
+			goto fail;
+		}
+	}
+
+	/* Late attach reuseport prog so we can have one init path */
+	if (reuseport_prog) {
+		err = setup_reuseport_prog(fd, reuseport_prog);
+		if (err)
+			goto fail;
+	}
+
+	return fd;
+fail:
+	close(fd);
+	return -1;
+}
+
+static int make_client(int sotype, const char *ip, int port)
+{
+	struct sockaddr_storage addr = {0};
+	int err, fd;
+
+	fd = make_socket_with_addr(sotype, ip, port, &addr);
+	if (fd < 0)
+		return -1;
+
+	err = connect(fd, (void *)&addr, inetaddr_len(&addr));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to connect client socket");
+		goto fail;
+	}
+
+	return fd;
+fail:
+	close(fd);
+	return -1;
+}
+
+static int send_byte(int fd)
+{
+	ssize_t n;
+
+	errno = 0;
+	n = send(fd, "a", 1, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial send");
+		return -1;
+	}
+	return 0;
+}
+
+static int recv_byte(int fd)
+{
+	char buf[1];
+	ssize_t n;
+
+	n = recv(fd, buf, sizeof(buf), 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial recv");
+		return -1;
+	}
+	return 0;
+}
+
+static int tcp_recv_send(int server_fd)
+{
+	char buf[1];
+	int ret, fd;
+	ssize_t n;
+
+	fd = accept(server_fd, NULL, NULL);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to accept");
+		return -1;
+	}
+
+	n = recv(fd, buf, sizeof(buf), 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial recv");
+		ret = -1;
+		goto close;
+	}
+
+	n = send(fd, buf, n, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial send");
+		ret = -1;
+		goto close;
+	}
+
+	ret = 0;
+close:
+	close(fd);
+	return ret;
+}
+
+static void v4_to_v6(struct sockaddr_storage *ss)
+{
+	struct sockaddr_in6 *v6 = (struct sockaddr_in6 *)ss;
+	struct sockaddr_in v4 = *(struct sockaddr_in *)ss;
+
+	v6->sin6_family = AF_INET6;
+	v6->sin6_port = v4.sin_port;
+	v6->sin6_addr.s6_addr[10] = 0xff;
+	v6->sin6_addr.s6_addr[11] = 0xff;
+	memcpy(&v6->sin6_addr.s6_addr[12], &v4.sin_addr.s_addr, 4);
+}
+
+static int udp_recv_send(int server_fd)
+{
+	char cmsg_buf[CMSG_SPACE(sizeof(struct sockaddr_storage))];
+	struct sockaddr_storage _src_addr = { 0 };
+	struct sockaddr_storage *src_addr = &_src_addr;
+	struct sockaddr_storage *dst_addr = NULL;
+	struct msghdr msg = { 0 };
+	struct iovec iov = { 0 };
+	struct cmsghdr *cm;
+	char buf[1];
+	int ret, fd;
+	ssize_t n;
+
+	iov.iov_base = buf;
+	iov.iov_len = sizeof(buf);
+
+	msg.msg_name = src_addr;
+	msg.msg_namelen = sizeof(*src_addr);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsg_buf;
+	msg.msg_controllen = sizeof(cmsg_buf);
+
+	errno = 0;
+	n = recvmsg(server_fd, &msg, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed to receive");
+		return -1;
+	}
+	if (CHECK_FAIL(msg.msg_flags & MSG_CTRUNC)) {
+		log_err("truncated cmsg");
+		return -1;
+	}
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if ((cm->cmsg_level = SOL_IP &&
+		     cm->cmsg_type = IP_ORIGDSTADDR) ||
+		    (cm->cmsg_level = SOL_IPV6 &&
+		     cm->cmsg_type = IPV6_ORIGDSTADDR)) {
+			dst_addr = (struct sockaddr_storage *)CMSG_DATA(cm);
+			break;
+		}
+		log_err("warning: ignored cmsg at level %d type %d",
+			cm->cmsg_level, cm->cmsg_type);
+	}
+	if (CHECK_FAIL(!dst_addr)) {
+		log_err("failed to get destination address");
+		return -1;
+	}
+
+	/* Server socket bound to IPv4-mapped IPv6 address */
+	if (src_addr->ss_family = AF_INET6 &&
+	    dst_addr->ss_family = AF_INET) {
+		v4_to_v6(dst_addr);
+	}
+
+	/* Reply from original destination address. */
+	fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to create tx socket");
+		return -1;
+	}
+
+	ret = bind(fd, (struct sockaddr *)dst_addr, sizeof(*dst_addr));
+	if (CHECK_FAIL(ret)) {
+		log_err("failed to bind tx socket");
+		goto out;
+	}
+
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	n = sendmsg(fd, &msg, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed to send echo reply");
+		ret = -1;
+		goto out;
+	}
+
+	ret = 0;
+out:
+	close(fd);
+	return ret;
+}
+
+static int tcp_echo_test(int client_fd, int server_fd)
+{
+	int err;
+
+	err = send_byte(client_fd);
+	if (err)
+		return -1;
+	err = tcp_recv_send(server_fd);
+	if (err)
+		return -1;
+	err = recv_byte(client_fd);
+	if (err)
+		return -1;
+
+	return 0;
+}
+
+static int udp_echo_test(int client_fd, int server_fd)
+{
+	int err;
+
+	err = send_byte(client_fd);
+	if (err)
+		return -1;
+	err = udp_recv_send(server_fd);
+	if (err)
+		return -1;
+	err = recv_byte(client_fd);
+	if (err)
+		return -1;
+
+	return 0;
+}
+
+static int attach_lookup_prog(struct bpf_program *prog)
+{
+	const char *prog_name = bpf_program__name(prog);
+	enum bpf_attach_type attach_type;
+	int err, prog_fd;
+
+	prog_fd = bpf_program__fd(prog);
+	if (CHECK_FAIL(prog_fd < 0)) {
+		errno = -prog_fd;
+		log_err("failed to get fd for program '%s'", prog_name);
+		return -1;
+	}
+
+	attach_type = bpf_program__get_expected_attach_type(prog);
+	err = bpf_prog_attach(prog_fd, -1 /* target fd */, attach_type, 0);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to attach program '%s'", prog_name);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int detach_lookup_prog(struct bpf_program *prog)
+{
+	const char *prog_name = bpf_program__name(prog);
+	enum bpf_attach_type attach_type;
+	int err, prog_fd;
+
+	prog_fd = bpf_program__fd(prog);
+	if (CHECK_FAIL(prog_fd < 0)) {
+		errno = -prog_fd;
+		log_err("failed to get fd for program '%s'", prog_name);
+		return -1;
+	}
+
+	attach_type = bpf_program__get_expected_attach_type(prog);
+	err = bpf_prog_detach2(prog_fd, -1 /* attachable fd */, attach_type);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to detach program '%s'", prog_name);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int update_lookup_map(struct bpf_map *map, int index, int sock_fd)
+{
+	int err, map_fd;
+	uint64_t value;
+
+	map_fd = bpf_map__fd(map);
+	if (CHECK_FAIL(map_fd < 0)) {
+		errno = -map_fd;
+		log_err("failed to get map FD");
+		return -1;
+	}
+
+	value = (uint64_t)sock_fd;
+	err = bpf_map_update_elem(map_fd, &index, &value, BPF_NOEXIST);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to update redir_map @ %d", index);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void query_lookup_prog(struct test_sk_lookup_kern *skel)
+{
+	struct bpf_program *lookup_prog = skel->progs.lookup_pass;
+	enum bpf_attach_type attach_type;
+	__u32 attach_flags = 0;
+	__u32 prog_ids[1] = { 0 };
+	__u32 prog_cnt = 1;
+	int net_fd = -1;
+	int err;
+
+	net_fd = open("/proc/self/ns/net", O_RDONLY);
+	if (CHECK_FAIL(net_fd < 0)) {
+		log_err("failed to open /proc/self/ns/net");
+		return;
+	}
+
+	err = attach_lookup_prog(lookup_prog);
+	if (err)
+		goto close;
+
+	attach_type = bpf_program__get_expected_attach_type(lookup_prog);
+	err = bpf_prog_query(net_fd, attach_type, 0 /* query flags */,
+			     &attach_flags, prog_ids, &prog_cnt);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to query lookup prog");
+		goto detach;
+	}
+
+	errno = 0;
+	if (CHECK_FAIL(attach_flags != 0)) {
+		log_err("wrong attach_flags on query: %u", attach_flags);
+		goto detach;
+	}
+	if (CHECK_FAIL(prog_cnt != 1)) {
+		log_err("wrong program count on query: %u", prog_cnt);
+		goto detach;
+	}
+	if (CHECK_FAIL(prog_ids[0] = 0)) {
+		log_err("invalid program id on query: %u", prog_ids[0]);
+		goto detach;
+	}
+
+detach:
+	detach_lookup_prog(lookup_prog);
+close:
+	close(net_fd);
+}
+
+static void run_lookup_prog(const struct test *t)
+{
+	int client_fd, server_fds[MAX_SERVERS] = { -1 };
+	int i, err, server_idx;
+
+	err = attach_lookup_prog(t->lookup_prog);
+	if (err)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
+		server_fds[i] = make_server(t->sotype, t->recv_at.ip,
+					    t->recv_at.port, t->reuseport_prog);
+		if (server_fds[i] < 0)
+			goto close;
+
+		err = update_lookup_map(t->sock_map, i, server_fds[i]);
+		if (err)
+			goto detach;
+
+		/* want just one server for non-reuseport test */
+		if (!t->reuseport_prog)
+			break;
+	}
+
+	client_fd = make_client(t->sotype, t->send_to.ip, t->send_to.port);
+	if (client_fd < 0)
+		goto close;
+
+	/* reuseport prog always selects server B */
+	server_idx = t->reuseport_prog ? SERVER_B : SERVER_A;
+
+	if (t->sotype = SOCK_STREAM)
+		tcp_echo_test(client_fd, server_fds[server_idx]);
+	else
+		udp_echo_test(client_fd, server_fds[server_idx]);
+
+	close(client_fd);
+close:
+	for (i = 0; i < ARRAY_SIZE(server_fds); i++)
+		close(server_fds[i]);
+detach:
+	detach_lookup_prog(t->lookup_prog);
+}
+
+static void test_override_lookup(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { EXT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4 redir addr",
+			.lookup_prog	= skel->progs.redir_ip4,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, INT_PORT },
+			.reuseport_prog	= skel->progs.select_sock_b,
+		},
+		{
+			.desc		= "TCP IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { EXT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 redir addr",
+			.lookup_prog	= skel->progs.redir_ip6,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4->IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.recv_at	= { INT_IP4_V6, INT_PORT },
+			.send_to	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, INT_PORT },
+			.reuseport_prog	= skel->progs.select_sock_b,
+		},
+		{
+			.desc		= "UDP IPv4 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { EXT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 redir addr",
+			.lookup_prog	= skel->progs.redir_ip4,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, INT_PORT },
+			.reuseport_prog	= skel->progs.select_sock_b,
+		},
+		{
+			.desc		= "UDP IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { EXT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 redir addr",
+			.lookup_prog	= skel->progs.redir_ip6,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4->IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.recv_at	= { INT_IP4_V6, INT_PORT },
+			.send_to	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, INT_PORT },
+			.reuseport_prog	= skel->progs.select_sock_b,
+		},
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			run_lookup_prog(t);
+	}
+}
+
+static void drop_on_lookup(const struct test *t)
+{
+	struct sockaddr_storage dst = { 0 };
+	int client_fd, server_fd, err;
+	ssize_t n;
+
+	if (attach_lookup_prog(t->lookup_prog))
+		return;
+
+	server_fd = make_server(t->sotype, t->recv_at.ip, t->recv_at.port,
+				t->reuseport_prog);
+	if (server_fd < 0)
+		goto detach;
+
+	client_fd = make_socket_with_addr(t->sotype, t->send_to.ip,
+					  t->send_to.port, &dst);
+	if (client_fd < 0)
+		goto close_srv;
+
+	err = connect(client_fd, (void *)&dst, inetaddr_len(&dst));
+	if (t->sotype = SOCK_DGRAM) {
+		err = send_byte(client_fd);
+		if (err)
+			goto close_all;
+
+		/* Read out asynchronous error */
+		n = recv(client_fd, NULL, 0, 0);
+		err = n = -1;
+	}
+	if (CHECK_FAIL(!err || errno != ECONNREFUSED))
+		log_err("expected ECONNREFUSED on connect");
+
+close_all:
+	close(client_fd);
+close_srv:
+	close(server_fd);
+detach:
+	detach_lookup_prog(t->lookup_prog);
+}
+
+static void test_drop_on_lookup(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { EXT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { EXT_IP6, INT_PORT },
+		},
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			drop_on_lookup(t);
+	}
+}
+
+static void drop_on_reuseport(const struct test *t)
+{
+	struct sockaddr_storage dst = { 0 };
+	int client, server1, server2, err;
+	ssize_t n;
+
+	if (attach_lookup_prog(t->lookup_prog))
+		return;
+
+	server1 = make_server(t->sotype, t->recv_at.ip, t->recv_at.port,
+			      t->reuseport_prog);
+	if (server1 < 0)
+		goto detach;
+
+	err = update_lookup_map(t->sock_map, SERVER_A, server1);
+	if (err)
+		goto detach;
+
+	/* second server on destination address we should never reach */
+	server2 = make_server(t->sotype, t->send_to.ip, t->send_to.port,
+			      NULL /* reuseport prog */);
+	if (server2 < 0)
+		goto close_srv1;
+
+	client = make_socket_with_addr(t->sotype, t->send_to.ip,
+				       t->send_to.port, &dst);
+	if (client < 0)
+		goto close_srv2;
+
+	err = connect(client, (void *)&dst, inetaddr_len(&dst));
+	if (t->sotype = SOCK_DGRAM) {
+		err = send_byte(client);
+		if (err)
+			goto close_all;
+
+		/* Read out asynchronous error */
+		n = recv(client, NULL, 0, 0);
+		err = n = -1;
+	}
+	if (CHECK_FAIL(!err || errno != ECONNREFUSED))
+		log_err("expected ECONNREFUSED on connect");
+
+close_all:
+	close(client);
+close_srv2:
+	close(server2);
+close_srv1:
+	close(server1);
+detach:
+	detach_lookup_prog(t->lookup_prog);
+}
+
+static void test_drop_on_reuseport(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.send_to	= { EXT_IP4, EXT_PORT },
+			.recv_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.send_to	= { EXT_IP6, EXT_PORT },
+			.recv_at	= { INT_IP6, INT_PORT },
+		},
+
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			drop_on_reuseport(t);
+	}
+}
+
+static void run_tests(struct test_sk_lookup_kern *skel)
+{
+	if (test__start_subtest("query lookup prog"))
+		query_lookup_prog(skel);
+	test_override_lookup(skel);
+	test_drop_on_lookup(skel);
+	test_drop_on_reuseport(skel);
+}
+
+static int switch_netns(int *saved_net)
+{
+	static const char * const setup_script[] = {
+		"ip -6 addr add dev lo " EXT_IP6 "/128 nodad",
+		"ip -6 addr add dev lo " INT_IP6 "/128 nodad",
+		"ip link set dev lo up",
+		NULL,
+	};
+	const char * const *cmd;
+	int net_fd, err;
+
+	net_fd = open("/proc/self/ns/net", O_RDONLY);
+	if (CHECK_FAIL(net_fd < 0)) {
+		log_err("open(/proc/self/ns/net)");
+		return -1;
+	}
+
+	err = unshare(CLONE_NEWNET);
+	if (CHECK_FAIL(err)) {
+		log_err("unshare(CLONE_NEWNET)");
+		goto close;
+	}
+
+	for (cmd = setup_script; *cmd; cmd++) {
+		err = system(*cmd);
+		if (CHECK_FAIL(err)) {
+			log_err("system(%s)", *cmd);
+			goto close;
+		}
+	}
+
+	*saved_net = net_fd;
+	return 0;
+
+close:
+	close(net_fd);
+	return -1;
+}
+
+static void restore_netns(int saved_net)
+{
+	int err;
+
+	err = setns(saved_net, CLONE_NEWNET);
+	if (CHECK_FAIL(err))
+		log_err("setns(CLONE_NEWNET)");
+
+	close(saved_net);
+}
+
+void test_sk_lookup(void)
+{
+	struct test_sk_lookup_kern *skel;
+	int err, saved_net;
+
+	err = switch_netns(&saved_net);
+	if (err)
+		return;
+
+	skel = test_sk_lookup_kern__open_and_load();
+	if (CHECK_FAIL(!skel)) {
+		errno = 0;
+		log_err("failed to open and load BPF skeleton");
+		goto restore_netns;
+	}
+
+	run_tests(skel);
+
+	test_sk_lookup_kern__destroy(skel);
+restore_netns:
+	restore_netns(saved_net);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
new file mode 100644
index 000000000000..fc3ad9a69484
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2020 Cloudflare
+
+#include <linux/bpf.h>
+#include <sys/socket.h>
+
+#include <bpf/bpf_endian.h>
+#include <bpf/bpf_helpers.h>
+
+#define IP4(a, b, c, d)					\
+	bpf_htonl((((__u32)(a) & 0xffU) << 24) |	\
+		  (((__u32)(b) & 0xffU) << 16) |	\
+		  (((__u32)(c) & 0xffU) <<  8) |	\
+		  (((__u32)(d) & 0xffU) <<  0))
+#define IP6(aaaa, bbbb, cccc, dddd)			\
+	{ bpf_htonl(aaaa), bpf_htonl(bbbb), bpf_htonl(cccc), bpf_htonl(dddd) }
+
+#define MAX_SOCKS 32
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, MAX_SOCKS);
+	__type(key, __u32);
+	__type(value, __u64);
+} redir_map SEC(".maps");
+
+enum {
+	SERVER_A = 0,
+	SERVER_B = 1,
+};
+
+enum {
+	NO_FLAGS = 0,
+};
+
+static const __u32 DST_PORT = 7007;
+static const __u32 DST_IP4 = IP4(127, 0, 0, 1);
+static const __u32 DST_IP6[] = IP6(0xfd000000, 0x0, 0x0, 0x00000001);
+
+SEC("sk_lookup/lookup_pass")
+int lookup_pass(struct bpf_sk_lookup *ctx)
+{
+	return BPF_OK;
+}
+
+SEC("sk_lookup/lookup_drop")
+int lookup_drop(struct bpf_sk_lookup *ctx)
+{
+	return BPF_DROP;
+}
+
+SEC("sk_reuseport/reuse_pass")
+int reuseport_pass(struct sk_reuseport_md *ctx)
+{
+	return SK_PASS;
+}
+
+SEC("sk_reuseport/reuse_drop")
+int reuseport_drop(struct sk_reuseport_md *ctx)
+{
+	return SK_DROP;
+}
+
+/* Redirect packets destined for port DST_PORT to socket at redir_map[0]. */
+SEC("sk_lookup/redir_port")
+int redir_port(struct bpf_sk_lookup *ctx)
+{
+	__u32 key = SERVER_A;
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &key);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, NO_FLAGS);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+/* Redirect packets destined for DST_IP4 address to socket at redir_map[0]. */
+SEC("sk_lookup/redir_ip4")
+int redir_ip4(struct bpf_sk_lookup *ctx)
+{
+	__u32 key = SERVER_A;
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->family != AF_INET)
+		return BPF_OK;
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+	if (ctx->local_ip4 != DST_IP4)
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &key);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, NO_FLAGS);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+/* Redirect packets destined for DST_IP6 address to socket at redir_map[0]. */
+SEC("sk_lookup/redir_ip6")
+int redir_ip6(struct bpf_sk_lookup *ctx)
+{
+	__u32 key = SERVER_A;
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->family != AF_INET6)
+		return BPF_OK;
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+	if (ctx->local_ip6[0] != DST_IP6[0] ||
+	    ctx->local_ip6[1] != DST_IP6[1] ||
+	    ctx->local_ip6[2] != DST_IP6[2] ||
+	    ctx->local_ip6[3] != DST_IP6[3])
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &key);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, NO_FLAGS);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+SEC("sk_lookup/select_sock_a")
+int select_sock_a(struct bpf_sk_lookup *ctx)
+{
+	__u32 key = SERVER_A;
+	struct bpf_sock *sk;
+	int err;
+
+	sk = bpf_map_lookup_elem(&redir_map, &key);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, NO_FLAGS);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+SEC("sk_reuseport/select_sock_b")
+int select_sock_b(struct sk_reuseport_md *ctx)
+{
+	__u32 key = SERVER_B;
+	int err;
+
+	err = bpf_sk_select_reuseport(ctx, &redir_map, &key, NO_FLAGS);
+	return err ? SK_DROP : SK_PASS;
+}
+
+char _license[] SEC("license") = "Dual BSD/GPL";
+__u32 _version SEC("version") = 1;
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-11 19:06     ` Jakub Sitnicki
  -1 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 19:06 UTC (permalink / raw)
  To: netdev, bpf
  Cc: dccp, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Eric Dumazet, Gerrit Renker, Jakub Kicinski,
	Andrii Nakryiko, Martin KaFai Lau, Marek Majkowski, Lorenz Bauer

On Mon, May 11, 2020 at 08:52 PM CEST, Jakub Sitnicki wrote:
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
> called BPF_SK_LOOKUP. The new program kind is to be invoked by the
> transport layer when looking up a socket for a received packet.
>
> When called, SK_LOOKUP program can select a socket that will receive the
> packet. This serves as a mechanism to overcome the limits of what bind()
> API allows to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, fixed port to a socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, any port to a socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its run-time context, program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
>
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
>
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v2:
>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>       Update bpf_sk_assign docs accordingly. (Martin)
>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)

I forgot to call out one more change in v2 to this patch:

      - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)

[...]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
@ 2020-05-11 19:06     ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 19:06 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 08:52 PM CEST, Jakub Sitnicki wrote:
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
> called BPF_SK_LOOKUP. The new program kind is to be invoked by the
> transport layer when looking up a socket for a received packet.
>
> When called, SK_LOOKUP program can select a socket that will receive the
> packet. This serves as a mechanism to overcome the limits of what bind()
> API allows to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, fixed port to a socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, any port to a socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its run-time context, program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
>
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
>
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v2:
>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>       Update bpf_sk_assign docs accordingly. (Martin)
>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)

I forgot to call out one more change in v2 to this patch:

      - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)

[...]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
  2020-05-11 18:52 ` Jakub Sitnicki
@ 2020-05-11 19:45   ` Martin KaFai Lau
  -1 siblings, 0 replies; 68+ messages in thread
From: Martin KaFai Lau @ 2020-05-11 19:45 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko

On Mon, May 11, 2020 at 08:52:01PM +0200, Jakub Sitnicki wrote:

[ ... ]

> Performance considerations
> ==========================
> 
> Patch set adds new code on receive hot path. This comes with a cost,
> especially in a scenario of a SYN flood or small UDP packet flood.
> 
> Measuring the performance penalty turned out to be harder than expected
> because socket lookup is fast. For CPUs to spend >= 1% of time in socket
> lookup we had to modify our setup by unloading iptables and reducing the
> number of routes.
> 
> The receiver machine is a Cloudflare Gen 9 server covered in detail at [0].
> In short:
> 
>  - 24 core Intel custom off-roadmap 1.9Ghz 150W (Skylake) CPU
>  - dual-port 25G Mellanox ConnectX-4 NIC
>  - 256G DDR4 2666Mhz RAM
> 
> Flood traffic pattern:
> 
>  - source: 1 IP, 10k ports
>  - destination: 1 IP, 1 port
>  - TCP - SYN packet
>  - UDP - Len=0 packet
> 
> Receiver setup:
> 
>  - ingress traffic spread over 4 RX queues,
>  - RX/TX pause and autoneg disabled,
>  - Intel Turbo Boost disabled,
>  - TCP SYN cookies always on.
> 
> For TCP test there is a receiver process with single listening socket
> open. Receiver is not accept()'ing connections.
> 
> For UDP the receiver process has a single UDP socket with a filter
> installed, dropping the packets.
> 
> With such setup in place, we record RX pps and cpu-cycles events under
> flood for 60 seconds in 3 configurations:
> 
>  1. 5.6.3 kernel w/o this patch series (baseline),
>  2. 5.6.3 kernel with patches applied, but no SK_LOOKUP program attached,
>  3. 5.6.3 kernel with patches applied, and SK_LOOKUP program attached;
>     BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.
Is the link in [1] up-to-date?  I don't see it calling bpf_sk_assign().

> 
> RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.
> 
> | tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
> | no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
> | with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |
> 
> | tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
> | no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
> | with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |
> 
> | udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
> | no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
> | with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |
> 
> | udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
> | no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
What is causing this regression?

> | with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |
This also looks very different from udp4.

> 
> Also visualized on bpf-sk-lookup-v1-rx-pps.png chart [2].
> 
> cpu-cycles measured with `perf record -F 999 --cpu 1-4 -g -- sleep 60`.
> 
> |                              |      cpu-cycles events |          |
> | tcp4 SYN flood               | __inet_lookup_listener | Δ events |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     |                  1.12% |        - |
> | no SK_LOOKUP prog attached   |                  1.31% |    0.19% |
> | with SK_LOOKUP prog attached |                  3.05% |    1.93% |
> 
> |                              |      cpu-cycles events |          |
> | tcp6 SYN flood               |  inet6_lookup_listener | Δ events |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     |                  1.05% |        - |
> | no SK_LOOKUP prog attached   |                  1.68% |    0.63% |
> | with SK_LOOKUP prog attached |                  3.15% |    2.10% |
> 
> |                              |      cpu-cycles events |          |
> | udp4 0-len flood             |      __udp4_lib_lookup | Δ events |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     |                  3.81% |        - |
> | no SK_LOOKUP prog attached   |                  5.22% |    1.41% |
> | with SK_LOOKUP prog attached |                  8.20% |    4.39% |
> 
> |                              |      cpu-cycles events |          |
> | udp6 0-len flood             |      __udp6_lib_lookup | Δ events |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     |                  5.51% |        - |
> | no SK_LOOKUP prog attached   |                  6.51% |    1.00% |
> | with SK_LOOKUP prog attached |                 10.14% |    4.63% |
> 
> Also visualized on bpf-sk-lookup-v1-cpu-cycles.png chart [3].
> 

[ ... ]

> 
> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.cloudflare.com_a-2Dtour-2Dinside-2Dcloudflares-2Dg9-2Dservers_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=v4r30a5NaPFxNXVRakV9SeJkshbI4G4c5D83yZtGm-g&s=PhkIqKdmL12ZMD_6jY_rALjmO2ahv_KNF3F7TikyfTo&e= 
> [1] https://github.com/majek/inet-tool/blob/master/ebpf/inet-kern.c
> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1HrrjWhQoVlqiqT73-5FeLtWMPhuGPKhGFX_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=v4r30a5NaPFxNXVRakV9SeJkshbI4G4c5D83yZtGm-g&s=9tums5TZ16ttY69vEHkzyiEkblxT3iwvm0mFjZySJXo&e= 
> [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1cYPPOlGg7M-2DbkzI4RW1SOm49goI4LYbb_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=v4r30a5NaPFxNXVRakV9SeJkshbI4G4c5D83yZtGm-g&s=VWolTQx3GVmSh2J7TQixTlGvRTb6S9qDNx4N8id5lf8&e= 
> [RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
> [RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
@ 2020-05-11 19:45   ` Martin KaFai Lau
  0 siblings, 0 replies; 68+ messages in thread
From: Martin KaFai Lau @ 2020-05-11 19:45 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 08:52:01PM +0200, Jakub Sitnicki wrote:

[ ... ]

> Performance considerations
> =============
> 
> Patch set adds new code on receive hot path. This comes with a cost,
> especially in a scenario of a SYN flood or small UDP packet flood.
> 
> Measuring the performance penalty turned out to be harder than expected
> because socket lookup is fast. For CPUs to spend >= 1% of time in socket
> lookup we had to modify our setup by unloading iptables and reducing the
> number of routes.
> 
> The receiver machine is a Cloudflare Gen 9 server covered in detail at [0].
> In short:
> 
>  - 24 core Intel custom off-roadmap 1.9Ghz 150W (Skylake) CPU
>  - dual-port 25G Mellanox ConnectX-4 NIC
>  - 256G DDR4 2666Mhz RAM
> 
> Flood traffic pattern:
> 
>  - source: 1 IP, 10k ports
>  - destination: 1 IP, 1 port
>  - TCP - SYN packet
>  - UDP - Len=0 packet
> 
> Receiver setup:
> 
>  - ingress traffic spread over 4 RX queues,
>  - RX/TX pause and autoneg disabled,
>  - Intel Turbo Boost disabled,
>  - TCP SYN cookies always on.
> 
> For TCP test there is a receiver process with single listening socket
> open. Receiver is not accept()'ing connections.
> 
> For UDP the receiver process has a single UDP socket with a filter
> installed, dropping the packets.
> 
> With such setup in place, we record RX pps and cpu-cycles events under
> flood for 60 seconds in 3 configurations:
> 
>  1. 5.6.3 kernel w/o this patch series (baseline),
>  2. 5.6.3 kernel with patches applied, but no SK_LOOKUP program attached,
>  3. 5.6.3 kernel with patches applied, and SK_LOOKUP program attached;
>     BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.
Is the link in [1] up-to-date?  I don't see it calling bpf_sk_assign().

> 
> RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.
> 
> | tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
> | no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
> | with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |
> 
> | tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
> | no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
> | with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |
> 
> | udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
> | no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
> | with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |
> 
> | udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
> | no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
What is causing this regression?

> | with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |
This also looks very different from udp4.

> 
> Also visualized on bpf-sk-lookup-v1-rx-pps.png chart [2].
> 
> cpu-cycles measured with `perf record -F 999 --cpu 1-4 -g -- sleep 60`.
> 
> |                              |      cpu-cycles events |          |
> | tcp4 SYN flood               | __inet_lookup_listener | Δ events |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     |                  1.12% |        - |
> | no SK_LOOKUP prog attached   |                  1.31% |    0.19% |
> | with SK_LOOKUP prog attached |                  3.05% |    1.93% |
> 
> |                              |      cpu-cycles events |          |
> | tcp6 SYN flood               |  inet6_lookup_listener | Δ events |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     |                  1.05% |        - |
> | no SK_LOOKUP prog attached   |                  1.68% |    0.63% |
> | with SK_LOOKUP prog attached |                  3.15% |    2.10% |
> 
> |                              |      cpu-cycles events |          |
> | udp4 0-len flood             |      __udp4_lib_lookup | Δ events |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     |                  3.81% |        - |
> | no SK_LOOKUP prog attached   |                  5.22% |    1.41% |
> | with SK_LOOKUP prog attached |                  8.20% |    4.39% |
> 
> |                              |      cpu-cycles events |          |
> | udp6 0-len flood             |      __udp6_lib_lookup | Δ events |
> |------------------------------+------------------------+----------|
> | 5.6.3 vanilla (baseline)     |                  5.51% |        - |
> | no SK_LOOKUP prog attached   |                  6.51% |    1.00% |
> | with SK_LOOKUP prog attached |                 10.14% |    4.63% |
> 
> Also visualized on bpf-sk-lookup-v1-cpu-cycles.png chart [3].
> 

[ ... ]

> 
> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.cloudflare.com_a-2Dtour-2Dinside-2Dcloudflares-2Dg9-2Dservers_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=v4r30a5NaPFxNXVRakV9SeJkshbI4G4c5D83yZtGm-g&s=PhkIqKdmL12ZMD_6jY_rALjmO2ahv_KNF3F7TikyfTo&e= 
> [1] https://github.com/majek/inet-tool/blob/master/ebpf/inet-kern.c
> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1HrrjWhQoVlqiqT73-5FeLtWMPhuGPKhGFX_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=v4r30a5NaPFxNXVRakV9SeJkshbI4G4c5D83yZtGm-g&s=9tums5TZ16ttY69vEHkzyiEkblxT3iwvm0mFjZySJXo&e= 
> [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1cYPPOlGg7M-2DbkzI4RW1SOm49goI4LYbb_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=v4r30a5NaPFxNXVRakV9SeJkshbI4G4c5D83yZtGm-g&s=VWolTQx3GVmSh2J7TQixTlGvRTb6S9qDNx4N8id5lf8&e= 
> [RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
> [RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-11 20:44     ` Alexei Starovoitov
  -1 siblings, 0 replies; 68+ messages in thread
From: Alexei Starovoitov @ 2020-05-11 20:44 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Martin KaFai Lau,
	Marek Majkowski, Lorenz Bauer

On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
> Run a BPF program before looking up a listening socket on the receive path.
> Program selects a listening socket to yield as result of socket lookup by
> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
> 
> Alternatively, program can also fail the lookup by returning with BPF_DROP,
> or let the lookup continue as usual with BPF_OK on return.
> 
> This lets the user match packets with listening sockets freely at the last
> possible point on the receive path, where we know that packets are destined
> for local delivery after undergoing policing, filtering, and routing.
> 
> With BPF code selecting the socket, directing packets destined to an IP
> range or to a port range to a single socket becomes possible.
> 
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
>  2 files changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 6072dfbd1078..3fcbc8f66f88 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  
>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
>  		      struct sock *sk);
> +
> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
> +					     struct bpf_sk_lookup_kern *ctx)
> +{
> +	struct bpf_prog *prog;
> +	int ret = BPF_OK;
> +
> +	rcu_read_lock();
> +	prog = rcu_dereference(net->sk_lookup_prog);
> +	if (prog)
> +		ret = BPF_PROG_RUN(prog, ctx);
> +	rcu_read_unlock();
> +
> +	if (ret == BPF_DROP)
> +		return ERR_PTR(-ECONNREFUSED);
> +	if (ret == BPF_REDIRECT)
> +		return ctx->selected_sk;
> +	return NULL;
> +}
> +
> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
> +					       __be32 saddr, __be16 sport,
> +					       __be32 daddr, u16 dport)
> +{
> +	struct bpf_sk_lookup_kern ctx = {
> +		.family		= AF_INET,
> +		.protocol	= protocol,
> +		.v4.saddr	= saddr,
> +		.v4.daddr	= daddr,
> +		.sport		= sport,
> +		.dport		= dport,
> +	};
> +
> +	return bpf_sk_lookup_run(net, &ctx);
> +}
> +
>  #endif /* _INET_HASHTABLES_H */
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index ab64834837c8..f4d07285591a 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
>  				    const int dif, const int sdif)
>  {
>  	struct inet_listen_hashbucket *ilb2;
> -	struct sock *result = NULL;
> +	struct sock *result, *reuse_sk;
>  	unsigned int hash2;
>  
> +	/* Lookup redirect from BPF */
> +	result = inet_lookup_run_bpf(net, hashinfo->protocol,
> +				     saddr, sport, daddr, hnum);
> +	if (IS_ERR(result))
> +		return NULL;
> +	if (result) {
> +		reuse_sk = lookup_reuseport(net, result, skb, doff,
> +					    saddr, sport, daddr, hnum);
> +		if (reuse_sk)
> +			result = reuse_sk;
> +		goto done;
> +	}
> +

The overhead is too high to do this all the time.
The feature has to be static_key-ed.

Also please add multi-prog support. Adding it later will cause
all sorts of compatibility issues. The semantics of multi-prog
needs to be thought through right now.
For example BPF_DROP or BPF_REDIRECT could terminate the prog_run_array
sequence of progs while BPF_OK could continue.
It's not ideal, but better than nothing.
Another option could be to execute all attached progs regardless
of return code, but don't let second prog override selected_sk blindly.
bpf_sk_assign() could get smarter.

Also please switch to bpf_link way of attaching. All system wide attachments
should be visible and easily debuggable via 'bpftool link show'.
Currently we're converting tc and xdp hooks to bpf_link. This new hook
should have it from the beginning.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-11 20:44     ` Alexei Starovoitov
  0 siblings, 0 replies; 68+ messages in thread
From: Alexei Starovoitov @ 2020-05-11 20:44 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
> Run a BPF program before looking up a listening socket on the receive path.
> Program selects a listening socket to yield as result of socket lookup by
> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
> 
> Alternatively, program can also fail the lookup by returning with BPF_DROP,
> or let the lookup continue as usual with BPF_OK on return.
> 
> This lets the user match packets with listening sockets freely at the last
> possible point on the receive path, where we know that packets are destined
> for local delivery after undergoing policing, filtering, and routing.
> 
> With BPF code selecting the socket, directing packets destined to an IP
> range or to a port range to a single socket becomes possible.
> 
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
>  2 files changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 6072dfbd1078..3fcbc8f66f88 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  
>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
>  		      struct sock *sk);
> +
> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
> +					     struct bpf_sk_lookup_kern *ctx)
> +{
> +	struct bpf_prog *prog;
> +	int ret = BPF_OK;
> +
> +	rcu_read_lock();
> +	prog = rcu_dereference(net->sk_lookup_prog);
> +	if (prog)
> +		ret = BPF_PROG_RUN(prog, ctx);
> +	rcu_read_unlock();
> +
> +	if (ret = BPF_DROP)
> +		return ERR_PTR(-ECONNREFUSED);
> +	if (ret = BPF_REDIRECT)
> +		return ctx->selected_sk;
> +	return NULL;
> +}
> +
> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
> +					       __be32 saddr, __be16 sport,
> +					       __be32 daddr, u16 dport)
> +{
> +	struct bpf_sk_lookup_kern ctx = {
> +		.family		= AF_INET,
> +		.protocol	= protocol,
> +		.v4.saddr	= saddr,
> +		.v4.daddr	= daddr,
> +		.sport		= sport,
> +		.dport		= dport,
> +	};
> +
> +	return bpf_sk_lookup_run(net, &ctx);
> +}
> +
>  #endif /* _INET_HASHTABLES_H */
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index ab64834837c8..f4d07285591a 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
>  				    const int dif, const int sdif)
>  {
>  	struct inet_listen_hashbucket *ilb2;
> -	struct sock *result = NULL;
> +	struct sock *result, *reuse_sk;
>  	unsigned int hash2;
>  
> +	/* Lookup redirect from BPF */
> +	result = inet_lookup_run_bpf(net, hashinfo->protocol,
> +				     saddr, sport, daddr, hnum);
> +	if (IS_ERR(result))
> +		return NULL;
> +	if (result) {
> +		reuse_sk = lookup_reuseport(net, result, skb, doff,
> +					    saddr, sport, daddr, hnum);
> +		if (reuse_sk)
> +			result = reuse_sk;
> +		goto done;
> +	}
> +

The overhead is too high to do this all the time.
The feature has to be static_key-ed.

Also please add multi-prog support. Adding it later will cause
all sorts of compatibility issues. The semantics of multi-prog
needs to be thought through right now.
For example BPF_DROP or BPF_REDIRECT could terminate the prog_run_array
sequence of progs while BPF_OK could continue.
It's not ideal, but better than nothing.
Another option could be to execute all attached progs regardless
of return code, but don't let second prog override selected_sk blindly.
bpf_sk_assign() could get smarter.

Also please switch to bpf_link way of attaching. All system wide attachments
should be visible and easily debuggable via 'bpftool link show'.
Currently we're converting tc and xdp hooks to bpf_link. This new hook
should have it from the beginning.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
  2020-05-11 18:52 ` Jakub Sitnicki
@ 2020-05-12 11:57     ` Jakub Sitnicki
  -1 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-12 11:57 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko

On Mon, May 11, 2020 at 09:45 PM CEST, Martin KaFai Lau wrote:
> On Mon, May 11, 2020 at 08:52:01PM +0200, Jakub Sitnicki wrote:
>
> [ ... ]
>
>> Performance considerations
>> ==========================
>>
>> Patch set adds new code on receive hot path. This comes with a cost,
>> especially in a scenario of a SYN flood or small UDP packet flood.
>>
>> Measuring the performance penalty turned out to be harder than expected
>> because socket lookup is fast. For CPUs to spend >= 1% of time in socket
>> lookup we had to modify our setup by unloading iptables and reducing the
>> number of routes.
>>
>> The receiver machine is a Cloudflare Gen 9 server covered in detail at [0].
>> In short:
>>
>>  - 24 core Intel custom off-roadmap 1.9Ghz 150W (Skylake) CPU
>>  - dual-port 25G Mellanox ConnectX-4 NIC
>>  - 256G DDR4 2666Mhz RAM
>>
>> Flood traffic pattern:
>>
>>  - source: 1 IP, 10k ports
>>  - destination: 1 IP, 1 port
>>  - TCP - SYN packet
>>  - UDP - Len=0 packet
>>
>> Receiver setup:
>>
>>  - ingress traffic spread over 4 RX queues,
>>  - RX/TX pause and autoneg disabled,
>>  - Intel Turbo Boost disabled,
>>  - TCP SYN cookies always on.
>>
>> For TCP test there is a receiver process with single listening socket
>> open. Receiver is not accept()'ing connections.
>>
>> For UDP the receiver process has a single UDP socket with a filter
>> installed, dropping the packets.
>>
>> With such setup in place, we record RX pps and cpu-cycles events under
>> flood for 60 seconds in 3 configurations:
>>
>>  1. 5.6.3 kernel w/o this patch series (baseline),
>>  2. 5.6.3 kernel with patches applied, but no SK_LOOKUP program attached,
>>  3. 5.6.3 kernel with patches applied, and SK_LOOKUP program attached;
>>     BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.
> Is the link in [1] up-to-date?  I don't see it calling bpf_sk_assign().

Yes, it is, or rather was.

The reason why the inet-tool version you reviewed was not using
bpf_sk_assign(), but the "old way" from RFCv2, is that the switch to
map_lookup+sk_assign was done late in development, after changes to
SOCKMAP landed in bpf-next.

By that time performance tests were already in progress, and since they
take a bit of time to set up, and the change affected just the scenario
with program attached, I tested without this bit.

Sorry, I should have explained that in the cover letter. The next round
of benchmarks will be done against the now updated version of inet-tool
that uses bpf_sk_assign:

https://github.com/majek/inet-tool/commit/6a619c3743aaae6d4882cbbf11b616e1e468b436

>
>>
>> RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.
>>
>> | tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
>> |------------------------------+------------------------+----------|
>> | 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
>> | no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
>> | with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |
>>
>> | tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
>> |------------------------------+------------------------+----------|
>> | 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
>> | no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
>> | with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |
>>
>> | udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
>> |------------------------------+------------------------+----------|
>> | 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
>> | no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
>> | with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |
>>
>> | udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
>> |------------------------------+------------------------+----------|
>> | 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
>> | no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
> What is causing this regression?
>

I need to go back to archived perf.data and see if perf-annotate or
perf-diff provide any clues that will help me tell where CPU cycles are
going. Will get back to you on that.

Wild guess is that for udp6 we're loading and coping more data to
populate v6 addresses in program context. See inet6_lookup_run_bpf
(patch 7).

This makes me realize the copy is unnecessary, I could just store the
pointer to in6_addr{}. Will make this change in v3.

As to why udp6 is taking a bigger hit than udp4 - comparing top 10 in
`perf report --no-children` shows that in our test setup, socket lookup
contributes less to CPU cycles on receive for udp4 than for udp6.

* udp4 baseline (no children)

# Overhead       Samples  Symbol
# ........  ............  ......................................
#
     8.11%         19429  [k] fib_table_lookup
     4.31%         10333  [k] udp_queue_rcv_one_skb
     3.75%          8991  [k] fib4_rule_action
     3.66%          8763  [k] __netif_receive_skb_core
     3.42%          8198  [k] fib_rules_lookup
     3.05%          7314  [k] fib4_rule_match
     2.71%          6507  [k] mlx5e_skb_from_cqe_linear
     2.58%          6192  [k] inet_gro_receive
     2.49%          5981  [k] __x86_indirect_thunk_rax
     2.36%          5656  [k] udp4_lib_lookup2

* udp6 baseline (no children)

# Overhead       Samples  Symbol
# ........  ............  ......................................
#
     4.63%         11100  [k] udpv6_queue_rcv_one_skb
     3.88%          9308  [k] __netif_receive_skb_core
     3.54%          8480  [k] udp6_lib_lookup2
     2.69%          6442  [k] mlx5e_skb_from_cqe_linear
     2.56%          6137  [k] ipv6_gro_receive
     2.31%          5540  [k] dev_gro_receive
     2.20%          5264  [k] do_csum
     2.02%          4835  [k] ip6_pol_route
     1.94%          4639  [k] __udp6_lib_lookup
     1.89%          4540  [k] selinux_socket_sock_rcv_skb

Notice that __udp4_lib_lookup didn't even make the cut. That could
explain why adding instructions to __udp6_lib_lookup has more effect on
RX PPS.

Frankly, that is something that suprised us, but we didn't have time to
investigate further, yet.

>> | with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |
> This also looks very different from udp4.
>

Thanks for the questions,
Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
@ 2020-05-12 11:57     ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-12 11:57 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 09:45 PM CEST, Martin KaFai Lau wrote:
> On Mon, May 11, 2020 at 08:52:01PM +0200, Jakub Sitnicki wrote:
>
> [ ... ]
>
>> Performance considerations
>> ==========================
>>
>> Patch set adds new code on receive hot path. This comes with a cost,
>> especially in a scenario of a SYN flood or small UDP packet flood.
>>
>> Measuring the performance penalty turned out to be harder than expected
>> because socket lookup is fast. For CPUs to spend >= 1% of time in socket
>> lookup we had to modify our setup by unloading iptables and reducing the
>> number of routes.
>>
>> The receiver machine is a Cloudflare Gen 9 server covered in detail at [0].
>> In short:
>>
>>  - 24 core Intel custom off-roadmap 1.9Ghz 150W (Skylake) CPU
>>  - dual-port 25G Mellanox ConnectX-4 NIC
>>  - 256G DDR4 2666Mhz RAM
>>
>> Flood traffic pattern:
>>
>>  - source: 1 IP, 10k ports
>>  - destination: 1 IP, 1 port
>>  - TCP - SYN packet
>>  - UDP - Len=0 packet
>>
>> Receiver setup:
>>
>>  - ingress traffic spread over 4 RX queues,
>>  - RX/TX pause and autoneg disabled,
>>  - Intel Turbo Boost disabled,
>>  - TCP SYN cookies always on.
>>
>> For TCP test there is a receiver process with single listening socket
>> open. Receiver is not accept()'ing connections.
>>
>> For UDP the receiver process has a single UDP socket with a filter
>> installed, dropping the packets.
>>
>> With such setup in place, we record RX pps and cpu-cycles events under
>> flood for 60 seconds in 3 configurations:
>>
>>  1. 5.6.3 kernel w/o this patch series (baseline),
>>  2. 5.6.3 kernel with patches applied, but no SK_LOOKUP program attached,
>>  3. 5.6.3 kernel with patches applied, and SK_LOOKUP program attached;
>>     BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.
> Is the link in [1] up-to-date?  I don't see it calling bpf_sk_assign().

Yes, it is, or rather was.

The reason why the inet-tool version you reviewed was not using
bpf_sk_assign(), but the "old way" from RFCv2, is that the switch to
map_lookup+sk_assign was done late in development, after changes to
SOCKMAP landed in bpf-next.

By that time performance tests were already in progress, and since they
take a bit of time to set up, and the change affected just the scenario
with program attached, I tested without this bit.

Sorry, I should have explained that in the cover letter. The next round
of benchmarks will be done against the now updated version of inet-tool
that uses bpf_sk_assign:

https://github.com/majek/inet-tool/commit/6a619c3743aaae6d4882cbbf11b616e1e468b436

>
>>
>> RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.
>>
>> | tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
>> |------------------------------+------------------------+----------|
>> | 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
>> | no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
>> | with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |
>>
>> | tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
>> |------------------------------+------------------------+----------|
>> | 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
>> | no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
>> | with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |
>>
>> | udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
>> |------------------------------+------------------------+----------|
>> | 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
>> | no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
>> | with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |
>>
>> | udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
>> |------------------------------+------------------------+----------|
>> | 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
>> | no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
> What is causing this regression?
>

I need to go back to archived perf.data and see if perf-annotate or
perf-diff provide any clues that will help me tell where CPU cycles are
going. Will get back to you on that.

Wild guess is that for udp6 we're loading and coping more data to
populate v6 addresses in program context. See inet6_lookup_run_bpf
(patch 7).

This makes me realize the copy is unnecessary, I could just store the
pointer to in6_addr{}. Will make this change in v3.

As to why udp6 is taking a bigger hit than udp4 - comparing top 10 in
`perf report --no-children` shows that in our test setup, socket lookup
contributes less to CPU cycles on receive for udp4 than for udp6.

* udp4 baseline (no children)

# Overhead       Samples  Symbol
# ........  ............  ......................................
#
     8.11%         19429  [k] fib_table_lookup
     4.31%         10333  [k] udp_queue_rcv_one_skb
     3.75%          8991  [k] fib4_rule_action
     3.66%          8763  [k] __netif_receive_skb_core
     3.42%          8198  [k] fib_rules_lookup
     3.05%          7314  [k] fib4_rule_match
     2.71%          6507  [k] mlx5e_skb_from_cqe_linear
     2.58%          6192  [k] inet_gro_receive
     2.49%          5981  [k] __x86_indirect_thunk_rax
     2.36%          5656  [k] udp4_lib_lookup2

* udp6 baseline (no children)

# Overhead       Samples  Symbol
# ........  ............  ......................................
#
     4.63%         11100  [k] udpv6_queue_rcv_one_skb
     3.88%          9308  [k] __netif_receive_skb_core
     3.54%          8480  [k] udp6_lib_lookup2
     2.69%          6442  [k] mlx5e_skb_from_cqe_linear
     2.56%          6137  [k] ipv6_gro_receive
     2.31%          5540  [k] dev_gro_receive
     2.20%          5264  [k] do_csum
     2.02%          4835  [k] ip6_pol_route
     1.94%          4639  [k] __udp6_lib_lookup
     1.89%          4540  [k] selinux_socket_sock_rcv_skb

Notice that __udp4_lib_lookup didn't even make the cut. That could
explain why adding instructions to __udp6_lib_lookup has more effect on
RX PPS.

Frankly, that is something that suprised us, but we didn't have time to
investigate further, yet.

>> | with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |
> This also looks very different from udp4.
>

Thanks for the questions,
Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-12 13:52       ` Jakub Sitnicki
  -1 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-12 13:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Martin KaFai Lau,
	Marek Majkowski, Lorenz Bauer

On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
>> Run a BPF program before looking up a listening socket on the receive path.
>> Program selects a listening socket to yield as result of socket lookup by
>> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
>>
>> Alternatively, program can also fail the lookup by returning with BPF_DROP,
>> or let the lookup continue as usual with BPF_OK on return.
>>
>> This lets the user match packets with listening sockets freely at the last
>> possible point on the receive path, where we know that packets are destined
>> for local delivery after undergoing policing, filtering, and routing.
>>
>> With BPF code selecting the socket, directing packets destined to an IP
>> range or to a port range to a single socket becomes possible.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
>>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
>>  2 files changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>> index 6072dfbd1078..3fcbc8f66f88 100644
>> --- a/include/net/inet_hashtables.h
>> +++ b/include/net/inet_hashtables.h
>> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>>
>>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
>>  		      struct sock *sk);
>> +
>> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
>> +					     struct bpf_sk_lookup_kern *ctx)
>> +{
>> +	struct bpf_prog *prog;
>> +	int ret = BPF_OK;
>> +
>> +	rcu_read_lock();
>> +	prog = rcu_dereference(net->sk_lookup_prog);
>> +	if (prog)
>> +		ret = BPF_PROG_RUN(prog, ctx);
>> +	rcu_read_unlock();
>> +
>> +	if (ret == BPF_DROP)
>> +		return ERR_PTR(-ECONNREFUSED);
>> +	if (ret == BPF_REDIRECT)
>> +		return ctx->selected_sk;
>> +	return NULL;
>> +}
>> +
>> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
>> +					       __be32 saddr, __be16 sport,
>> +					       __be32 daddr, u16 dport)
>> +{
>> +	struct bpf_sk_lookup_kern ctx = {
>> +		.family		= AF_INET,
>> +		.protocol	= protocol,
>> +		.v4.saddr	= saddr,
>> +		.v4.daddr	= daddr,
>> +		.sport		= sport,
>> +		.dport		= dport,
>> +	};
>> +
>> +	return bpf_sk_lookup_run(net, &ctx);
>> +}
>> +
>>  #endif /* _INET_HASHTABLES_H */
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index ab64834837c8..f4d07285591a 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
>>  				    const int dif, const int sdif)
>>  {
>>  	struct inet_listen_hashbucket *ilb2;
>> -	struct sock *result = NULL;
>> +	struct sock *result, *reuse_sk;
>>  	unsigned int hash2;
>>
>> +	/* Lookup redirect from BPF */
>> +	result = inet_lookup_run_bpf(net, hashinfo->protocol,
>> +				     saddr, sport, daddr, hnum);
>> +	if (IS_ERR(result))
>> +		return NULL;
>> +	if (result) {
>> +		reuse_sk = lookup_reuseport(net, result, skb, doff,
>> +					    saddr, sport, daddr, hnum);
>> +		if (reuse_sk)
>> +			result = reuse_sk;
>> +		goto done;
>> +	}
>> +
>
> The overhead is too high to do this all the time.
> The feature has to be static_key-ed.

Static keys is something that Lorenz has also suggested internally, but
we wanted to keep it simple at first.

Introduction of static keys forces us to decide when non-init_net netns
are allowed to attach to SK_LOOKUP, as attaching enabling SK_LOOKUP in
isolated netns will affect the rx path in init_net.

I see two options, which seem sensible:

1) limit SK_LOOKUP to init_net, which makes testing setup harder, or

2) allow non-init_net netns to attach to SK_LOOKUP only if static key
   has been already enabled (via sysctl?).

>
> Also please add multi-prog support. Adding it later will cause
> all sorts of compatibility issues. The semantics of multi-prog
> needs to be thought through right now.
> For example BPF_DROP or BPF_REDIRECT could terminate the prog_run_array
> sequence of progs while BPF_OK could continue.
> It's not ideal, but better than nothing.

I must say this approach is quite appealing because it's simple to
explain. I would need a custom BPF_PROG_RUN_ARRAY, though.

I'm curious what downside do you see here?
Is overriding an earlier DROP/REDIRECT verdict useful?

> Another option could be to execute all attached progs regardless
> of return code, but don't let second prog override selected_sk blindly.
> bpf_sk_assign() could get smarter.

So if IIUC the rough idea here would be like below?

- 1st program calls

  bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)

- 2nd program calls

  bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
  bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)

In this case the last program to run has the final say, as opposed to
the semantics where DROP/REDIRECT terminates.

Also, 2nd and subsequent programs would probably need to know if and
which socket has been already selected. I think the selection could be
exposed in context as bpf_sock pointer.

I admit, I can't quite see the benefit of running thru all programs in
array, so I'm tempted to go with terminate of DROP/REDIRECT in v3.

>
> Also please switch to bpf_link way of attaching. All system wide attachments
> should be visible and easily debuggable via 'bpftool link show'.
> Currently we're converting tc and xdp hooks to bpf_link. This new hook
> should have it from the beginning.

Will do in v3.

Thanks for feedback,
Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-12 13:52       ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-12 13:52 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
>> Run a BPF program before looking up a listening socket on the receive path.
>> Program selects a listening socket to yield as result of socket lookup by
>> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
>>
>> Alternatively, program can also fail the lookup by returning with BPF_DROP,
>> or let the lookup continue as usual with BPF_OK on return.
>>
>> This lets the user match packets with listening sockets freely at the last
>> possible point on the receive path, where we know that packets are destined
>> for local delivery after undergoing policing, filtering, and routing.
>>
>> With BPF code selecting the socket, directing packets destined to an IP
>> range or to a port range to a single socket becomes possible.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
>>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
>>  2 files changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>> index 6072dfbd1078..3fcbc8f66f88 100644
>> --- a/include/net/inet_hashtables.h
>> +++ b/include/net/inet_hashtables.h
>> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>>
>>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
>>  		      struct sock *sk);
>> +
>> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
>> +					     struct bpf_sk_lookup_kern *ctx)
>> +{
>> +	struct bpf_prog *prog;
>> +	int ret = BPF_OK;
>> +
>> +	rcu_read_lock();
>> +	prog = rcu_dereference(net->sk_lookup_prog);
>> +	if (prog)
>> +		ret = BPF_PROG_RUN(prog, ctx);
>> +	rcu_read_unlock();
>> +
>> +	if (ret = BPF_DROP)
>> +		return ERR_PTR(-ECONNREFUSED);
>> +	if (ret = BPF_REDIRECT)
>> +		return ctx->selected_sk;
>> +	return NULL;
>> +}
>> +
>> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
>> +					       __be32 saddr, __be16 sport,
>> +					       __be32 daddr, u16 dport)
>> +{
>> +	struct bpf_sk_lookup_kern ctx = {
>> +		.family		= AF_INET,
>> +		.protocol	= protocol,
>> +		.v4.saddr	= saddr,
>> +		.v4.daddr	= daddr,
>> +		.sport		= sport,
>> +		.dport		= dport,
>> +	};
>> +
>> +	return bpf_sk_lookup_run(net, &ctx);
>> +}
>> +
>>  #endif /* _INET_HASHTABLES_H */
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index ab64834837c8..f4d07285591a 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
>>  				    const int dif, const int sdif)
>>  {
>>  	struct inet_listen_hashbucket *ilb2;
>> -	struct sock *result = NULL;
>> +	struct sock *result, *reuse_sk;
>>  	unsigned int hash2;
>>
>> +	/* Lookup redirect from BPF */
>> +	result = inet_lookup_run_bpf(net, hashinfo->protocol,
>> +				     saddr, sport, daddr, hnum);
>> +	if (IS_ERR(result))
>> +		return NULL;
>> +	if (result) {
>> +		reuse_sk = lookup_reuseport(net, result, skb, doff,
>> +					    saddr, sport, daddr, hnum);
>> +		if (reuse_sk)
>> +			result = reuse_sk;
>> +		goto done;
>> +	}
>> +
>
> The overhead is too high to do this all the time.
> The feature has to be static_key-ed.

Static keys is something that Lorenz has also suggested internally, but
we wanted to keep it simple at first.

Introduction of static keys forces us to decide when non-init_net netns
are allowed to attach to SK_LOOKUP, as attaching enabling SK_LOOKUP in
isolated netns will affect the rx path in init_net.

I see two options, which seem sensible:

1) limit SK_LOOKUP to init_net, which makes testing setup harder, or

2) allow non-init_net netns to attach to SK_LOOKUP only if static key
   has been already enabled (via sysctl?).

>
> Also please add multi-prog support. Adding it later will cause
> all sorts of compatibility issues. The semantics of multi-prog
> needs to be thought through right now.
> For example BPF_DROP or BPF_REDIRECT could terminate the prog_run_array
> sequence of progs while BPF_OK could continue.
> It's not ideal, but better than nothing.

I must say this approach is quite appealing because it's simple to
explain. I would need a custom BPF_PROG_RUN_ARRAY, though.

I'm curious what downside do you see here?
Is overriding an earlier DROP/REDIRECT verdict useful?

> Another option could be to execute all attached progs regardless
> of return code, but don't let second prog override selected_sk blindly.
> bpf_sk_assign() could get smarter.

So if IIUC the rough idea here would be like below?

- 1st program calls

  bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)

- 2nd program calls

  bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
  bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)

In this case the last program to run has the final say, as opposed to
the semantics where DROP/REDIRECT terminates.

Also, 2nd and subsequent programs would probably need to know if and
which socket has been already selected. I think the selection could be
exposed in context as bpf_sock pointer.

I admit, I can't quite see the benefit of running thru all programs in
array, so I'm tempted to go with terminate of DROP/REDIRECT in v3.

>
> Also please switch to bpf_link way of attaching. All system wide attachments
> should be visible and easily debuggable via 'bpftool link show'.
> Currently we're converting tc and xdp hooks to bpf_link. This new hook
> should have it from the beginning.

Will do in v3.

Thanks for feedback,
Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
  2020-05-11 18:52 ` Jakub Sitnicki
@ 2020-05-12 16:34       ` Martin KaFai Lau
  -1 siblings, 0 replies; 68+ messages in thread
From: Martin KaFai Lau @ 2020-05-12 16:34 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko

On Tue, May 12, 2020 at 01:57:45PM +0200, Jakub Sitnicki wrote:
> On Mon, May 11, 2020 at 09:45 PM CEST, Martin KaFai Lau wrote:
> > On Mon, May 11, 2020 at 08:52:01PM +0200, Jakub Sitnicki wrote:
> >
> > [ ... ]
> >
> >> Performance considerations
> >> ==========================
> >>
> >> Patch set adds new code on receive hot path. This comes with a cost,
> >> especially in a scenario of a SYN flood or small UDP packet flood.
> >>
> >> Measuring the performance penalty turned out to be harder than expected
> >> because socket lookup is fast. For CPUs to spend >= 1% of time in socket
> >> lookup we had to modify our setup by unloading iptables and reducing the
> >> number of routes.
> >>
> >> The receiver machine is a Cloudflare Gen 9 server covered in detail at [0].
> >> In short:
> >>
> >>  - 24 core Intel custom off-roadmap 1.9Ghz 150W (Skylake) CPU
> >>  - dual-port 25G Mellanox ConnectX-4 NIC
> >>  - 256G DDR4 2666Mhz RAM
> >>
> >> Flood traffic pattern:
> >>
> >>  - source: 1 IP, 10k ports
> >>  - destination: 1 IP, 1 port
> >>  - TCP - SYN packet
> >>  - UDP - Len=0 packet
> >>
> >> Receiver setup:
> >>
> >>  - ingress traffic spread over 4 RX queues,
> >>  - RX/TX pause and autoneg disabled,
> >>  - Intel Turbo Boost disabled,
> >>  - TCP SYN cookies always on.
> >>
> >> For TCP test there is a receiver process with single listening socket
> >> open. Receiver is not accept()'ing connections.
> >>
> >> For UDP the receiver process has a single UDP socket with a filter
> >> installed, dropping the packets.
> >>
> >> With such setup in place, we record RX pps and cpu-cycles events under
> >> flood for 60 seconds in 3 configurations:
> >>
> >>  1. 5.6.3 kernel w/o this patch series (baseline),
> >>  2. 5.6.3 kernel with patches applied, but no SK_LOOKUP program attached,
> >>  3. 5.6.3 kernel with patches applied, and SK_LOOKUP program attached;
> >>     BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.
> > Is the link in [1] up-to-date?  I don't see it calling bpf_sk_assign().
> 
> Yes, it is, or rather was.
> 
> The reason why the inet-tool version you reviewed was not using
> bpf_sk_assign(), but the "old way" from RFCv2, is that the switch to
> map_lookup+sk_assign was done late in development, after changes to
> SOCKMAP landed in bpf-next.
> 
> By that time performance tests were already in progress, and since they
> take a bit of time to set up, and the change affected just the scenario
> with program attached, I tested without this bit.
> 
> Sorry, I should have explained that in the cover letter. The next round
> of benchmarks will be done against the now updated version of inet-tool
> that uses bpf_sk_assign:
> 
> https://github.com/majek/inet-tool/commit/6a619c3743aaae6d4882cbbf11b616e1e468b436
> 
> >
> >>
> >> RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.
> >>
> >> | tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
> >> |------------------------------+------------------------+----------|
> >> | 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
> >> | no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
> >> | with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |
> >>
> >> | tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
> >> |------------------------------+------------------------+----------|
> >> | 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
> >> | no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
> >> | with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |
> >>
> >> | udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
> >> |------------------------------+------------------------+----------|
> >> | 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
> >> | no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
> >> | with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |
> >>
> >> | udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
> >> |------------------------------+------------------------+----------|
> >> | 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
> >> | no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
> > What is causing this regression?
> >
> 
> I need to go back to archived perf.data and see if perf-annotate or
> perf-diff provide any clues that will help me tell where CPU cycles are
> going. Will get back to you on that.
> 
> Wild guess is that for udp6 we're loading and coping more data to
> populate v6 addresses in program context. See inet6_lookup_run_bpf
> (patch 7).
If that is the case,
rcu_access_pointer(net->sk_lookup_prog) should be tested first before
doing ctx initialization.

> 
> This makes me realize the copy is unnecessary, I could just store the
> pointer to in6_addr{}. Will make this change in v3.
> 
> As to why udp6 is taking a bigger hit than udp4 - comparing top 10 in
> `perf report --no-children` shows that in our test setup, socket lookup
> contributes less to CPU cycles on receive for udp4 than for udp6.
> 
> * udp4 baseline (no children)
> 
> # Overhead       Samples  Symbol
> # ........  ............  ......................................
> #
>      8.11%         19429  [k] fib_table_lookup
>      4.31%         10333  [k] udp_queue_rcv_one_skb
>      3.75%          8991  [k] fib4_rule_action
>      3.66%          8763  [k] __netif_receive_skb_core
>      3.42%          8198  [k] fib_rules_lookup
>      3.05%          7314  [k] fib4_rule_match
>      2.71%          6507  [k] mlx5e_skb_from_cqe_linear
>      2.58%          6192  [k] inet_gro_receive
>      2.49%          5981  [k] __x86_indirect_thunk_rax
>      2.36%          5656  [k] udp4_lib_lookup2
> 
> * udp6 baseline (no children)
> 
> # Overhead       Samples  Symbol
> # ........  ............  ......................................
> #
>      4.63%         11100  [k] udpv6_queue_rcv_one_skb
>      3.88%          9308  [k] __netif_receive_skb_core
>      3.54%          8480  [k] udp6_lib_lookup2
>      2.69%          6442  [k] mlx5e_skb_from_cqe_linear
>      2.56%          6137  [k] ipv6_gro_receive
>      2.31%          5540  [k] dev_gro_receive
>      2.20%          5264  [k] do_csum
>      2.02%          4835  [k] ip6_pol_route
>      1.94%          4639  [k] __udp6_lib_lookup
>      1.89%          4540  [k] selinux_socket_sock_rcv_skb
> 
> Notice that __udp4_lib_lookup didn't even make the cut. That could
> explain why adding instructions to __udp6_lib_lookup has more effect on
> RX PPS.
> 
> Frankly, that is something that suprised us, but we didn't have time to
> investigate further, yet.
The perf report should be able to annotate bpf prog also.
e.g. may be part of it is because the bpf_prog itself is also dealing
with a longer address?  

> 
> >> | with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |
> > This also looks very different from udp4.
> >
> 
> Thanks for the questions,
> Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
@ 2020-05-12 16:34       ` Martin KaFai Lau
  0 siblings, 0 replies; 68+ messages in thread
From: Martin KaFai Lau @ 2020-05-12 16:34 UTC (permalink / raw)
  To: dccp

On Tue, May 12, 2020 at 01:57:45PM +0200, Jakub Sitnicki wrote:
> On Mon, May 11, 2020 at 09:45 PM CEST, Martin KaFai Lau wrote:
> > On Mon, May 11, 2020 at 08:52:01PM +0200, Jakub Sitnicki wrote:
> >
> > [ ... ]
> >
> >> Performance considerations
> >> =============
> >>
> >> Patch set adds new code on receive hot path. This comes with a cost,
> >> especially in a scenario of a SYN flood or small UDP packet flood.
> >>
> >> Measuring the performance penalty turned out to be harder than expected
> >> because socket lookup is fast. For CPUs to spend >= 1% of time in socket
> >> lookup we had to modify our setup by unloading iptables and reducing the
> >> number of routes.
> >>
> >> The receiver machine is a Cloudflare Gen 9 server covered in detail at [0].
> >> In short:
> >>
> >>  - 24 core Intel custom off-roadmap 1.9Ghz 150W (Skylake) CPU
> >>  - dual-port 25G Mellanox ConnectX-4 NIC
> >>  - 256G DDR4 2666Mhz RAM
> >>
> >> Flood traffic pattern:
> >>
> >>  - source: 1 IP, 10k ports
> >>  - destination: 1 IP, 1 port
> >>  - TCP - SYN packet
> >>  - UDP - Len=0 packet
> >>
> >> Receiver setup:
> >>
> >>  - ingress traffic spread over 4 RX queues,
> >>  - RX/TX pause and autoneg disabled,
> >>  - Intel Turbo Boost disabled,
> >>  - TCP SYN cookies always on.
> >>
> >> For TCP test there is a receiver process with single listening socket
> >> open. Receiver is not accept()'ing connections.
> >>
> >> For UDP the receiver process has a single UDP socket with a filter
> >> installed, dropping the packets.
> >>
> >> With such setup in place, we record RX pps and cpu-cycles events under
> >> flood for 60 seconds in 3 configurations:
> >>
> >>  1. 5.6.3 kernel w/o this patch series (baseline),
> >>  2. 5.6.3 kernel with patches applied, but no SK_LOOKUP program attached,
> >>  3. 5.6.3 kernel with patches applied, and SK_LOOKUP program attached;
> >>     BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.
> > Is the link in [1] up-to-date?  I don't see it calling bpf_sk_assign().
> 
> Yes, it is, or rather was.
> 
> The reason why the inet-tool version you reviewed was not using
> bpf_sk_assign(), but the "old way" from RFCv2, is that the switch to
> map_lookup+sk_assign was done late in development, after changes to
> SOCKMAP landed in bpf-next.
> 
> By that time performance tests were already in progress, and since they
> take a bit of time to set up, and the change affected just the scenario
> with program attached, I tested without this bit.
> 
> Sorry, I should have explained that in the cover letter. The next round
> of benchmarks will be done against the now updated version of inet-tool
> that uses bpf_sk_assign:
> 
> https://github.com/majek/inet-tool/commit/6a619c3743aaae6d4882cbbf11b616e1e468b436
> 
> >
> >>
> >> RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.
> >>
> >> | tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
> >> |------------------------------+------------------------+----------|
> >> | 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
> >> | no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
> >> | with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |
> >>
> >> | tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
> >> |------------------------------+------------------------+----------|
> >> | 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
> >> | no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
> >> | with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |
> >>
> >> | udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
> >> |------------------------------+------------------------+----------|
> >> | 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
> >> | no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
> >> | with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |
> >>
> >> | udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
> >> |------------------------------+------------------------+----------|
> >> | 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
> >> | no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
> > What is causing this regression?
> >
> 
> I need to go back to archived perf.data and see if perf-annotate or
> perf-diff provide any clues that will help me tell where CPU cycles are
> going. Will get back to you on that.
> 
> Wild guess is that for udp6 we're loading and coping more data to
> populate v6 addresses in program context. See inet6_lookup_run_bpf
> (patch 7).
If that is the case,
rcu_access_pointer(net->sk_lookup_prog) should be tested first before
doing ctx initialization.

> 
> This makes me realize the copy is unnecessary, I could just store the
> pointer to in6_addr{}. Will make this change in v3.
> 
> As to why udp6 is taking a bigger hit than udp4 - comparing top 10 in
> `perf report --no-children` shows that in our test setup, socket lookup
> contributes less to CPU cycles on receive for udp4 than for udp6.
> 
> * udp4 baseline (no children)
> 
> # Overhead       Samples  Symbol
> # ........  ............  ......................................
> #
>      8.11%         19429  [k] fib_table_lookup
>      4.31%         10333  [k] udp_queue_rcv_one_skb
>      3.75%          8991  [k] fib4_rule_action
>      3.66%          8763  [k] __netif_receive_skb_core
>      3.42%          8198  [k] fib_rules_lookup
>      3.05%          7314  [k] fib4_rule_match
>      2.71%          6507  [k] mlx5e_skb_from_cqe_linear
>      2.58%          6192  [k] inet_gro_receive
>      2.49%          5981  [k] __x86_indirect_thunk_rax
>      2.36%          5656  [k] udp4_lib_lookup2
> 
> * udp6 baseline (no children)
> 
> # Overhead       Samples  Symbol
> # ........  ............  ......................................
> #
>      4.63%         11100  [k] udpv6_queue_rcv_one_skb
>      3.88%          9308  [k] __netif_receive_skb_core
>      3.54%          8480  [k] udp6_lib_lookup2
>      2.69%          6442  [k] mlx5e_skb_from_cqe_linear
>      2.56%          6137  [k] ipv6_gro_receive
>      2.31%          5540  [k] dev_gro_receive
>      2.20%          5264  [k] do_csum
>      2.02%          4835  [k] ip6_pol_route
>      1.94%          4639  [k] __udp6_lib_lookup
>      1.89%          4540  [k] selinux_socket_sock_rcv_skb
> 
> Notice that __udp4_lib_lookup didn't even make the cut. That could
> explain why adding instructions to __udp6_lib_lookup has more effect on
> RX PPS.
> 
> Frankly, that is something that suprised us, but we didn't have time to
> investigate further, yet.
The perf report should be able to annotate bpf prog also.
e.g. may be part of it is because the bpf_prog itself is also dealing
with a longer address?  

> 
> >> | with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |
> > This also looks very different from udp4.
> >
> 
> Thanks for the questions,
> Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-12 23:58         ` Alexei Starovoitov
  -1 siblings, 0 replies; 68+ messages in thread
From: Alexei Starovoitov @ 2020-05-12 23:58 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Martin KaFai Lau,
	Marek Majkowski, Lorenz Bauer

On Tue, May 12, 2020 at 03:52:52PM +0200, Jakub Sitnicki wrote:
> On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> > On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
> >> Run a BPF program before looking up a listening socket on the receive path.
> >> Program selects a listening socket to yield as result of socket lookup by
> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
> >>
> >> Alternatively, program can also fail the lookup by returning with BPF_DROP,
> >> or let the lookup continue as usual with BPF_OK on return.
> >>
> >> This lets the user match packets with listening sockets freely at the last
> >> possible point on the receive path, where we know that packets are destined
> >> for local delivery after undergoing policing, filtering, and routing.
> >>
> >> With BPF code selecting the socket, directing packets destined to an IP
> >> range or to a port range to a single socket becomes possible.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
> >>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
> >>  2 files changed, 50 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> >> index 6072dfbd1078..3fcbc8f66f88 100644
> >> --- a/include/net/inet_hashtables.h
> >> +++ b/include/net/inet_hashtables.h
> >> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >>
> >>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> >>  		      struct sock *sk);
> >> +
> >> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
> >> +					     struct bpf_sk_lookup_kern *ctx)
> >> +{
> >> +	struct bpf_prog *prog;
> >> +	int ret = BPF_OK;
> >> +
> >> +	rcu_read_lock();
> >> +	prog = rcu_dereference(net->sk_lookup_prog);
> >> +	if (prog)
> >> +		ret = BPF_PROG_RUN(prog, ctx);
> >> +	rcu_read_unlock();
> >> +
> >> +	if (ret == BPF_DROP)
> >> +		return ERR_PTR(-ECONNREFUSED);
> >> +	if (ret == BPF_REDIRECT)
> >> +		return ctx->selected_sk;
> >> +	return NULL;
> >> +}
> >> +
> >> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
> >> +					       __be32 saddr, __be16 sport,
> >> +					       __be32 daddr, u16 dport)
> >> +{
> >> +	struct bpf_sk_lookup_kern ctx = {
> >> +		.family		= AF_INET,
> >> +		.protocol	= protocol,
> >> +		.v4.saddr	= saddr,
> >> +		.v4.daddr	= daddr,
> >> +		.sport		= sport,
> >> +		.dport		= dport,
> >> +	};
> >> +
> >> +	return bpf_sk_lookup_run(net, &ctx);
> >> +}
> >> +
> >>  #endif /* _INET_HASHTABLES_H */
> >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> >> index ab64834837c8..f4d07285591a 100644
> >> --- a/net/ipv4/inet_hashtables.c
> >> +++ b/net/ipv4/inet_hashtables.c
> >> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
> >>  				    const int dif, const int sdif)
> >>  {
> >>  	struct inet_listen_hashbucket *ilb2;
> >> -	struct sock *result = NULL;
> >> +	struct sock *result, *reuse_sk;
> >>  	unsigned int hash2;
> >>
> >> +	/* Lookup redirect from BPF */
> >> +	result = inet_lookup_run_bpf(net, hashinfo->protocol,
> >> +				     saddr, sport, daddr, hnum);
> >> +	if (IS_ERR(result))
> >> +		return NULL;
> >> +	if (result) {
> >> +		reuse_sk = lookup_reuseport(net, result, skb, doff,
> >> +					    saddr, sport, daddr, hnum);
> >> +		if (reuse_sk)
> >> +			result = reuse_sk;
> >> +		goto done;
> >> +	}
> >> +
> >
> > The overhead is too high to do this all the time.
> > The feature has to be static_key-ed.
> 
> Static keys is something that Lorenz has also suggested internally, but
> we wanted to keep it simple at first.
> 
> Introduction of static keys forces us to decide when non-init_net netns
> are allowed to attach to SK_LOOKUP, as attaching enabling SK_LOOKUP in
> isolated netns will affect the rx path in init_net.
> 
> I see two options, which seem sensible:
> 
> 1) limit SK_LOOKUP to init_net, which makes testing setup harder, or
> 
> 2) allow non-init_net netns to attach to SK_LOOKUP only if static key
>    has been already enabled (via sysctl?).

I think both are overkill.
Just enable that static_key if any netns has progs.
Loading this prog type will be privileged operation even after cap_bpf.

> >
> > Also please add multi-prog support. Adding it later will cause
> > all sorts of compatibility issues. The semantics of multi-prog
> > needs to be thought through right now.
> > For example BPF_DROP or BPF_REDIRECT could terminate the prog_run_array
> > sequence of progs while BPF_OK could continue.
> > It's not ideal, but better than nothing.
> 
> I must say this approach is quite appealing because it's simple to
> explain. I would need a custom BPF_PROG_RUN_ARRAY, though.

of course.

> I'm curious what downside do you see here?
> Is overriding an earlier DROP/REDIRECT verdict useful?
> 
> > Another option could be to execute all attached progs regardless
> > of return code, but don't let second prog override selected_sk blindly.
> > bpf_sk_assign() could get smarter.
> 
> So if IIUC the rough idea here would be like below?
> 
> - 1st program calls
> 
>   bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)
> 
> - 2nd program calls
> 
>   bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
>   bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)
> 
> In this case the last program to run has the final say, as opposed to
> the semantics where DROP/REDIRECT terminates.
> 
> Also, 2nd and subsequent programs would probably need to know if and
> which socket has been already selected. I think the selection could be
> exposed in context as bpf_sock pointer.

I think running all is better.
The main down side of terminating early is predictability.
Imagine first prog is doing the sock selection based on some map configuration.
Then second prog gets loaded and doing its own selection.
These two progs are managed by different user space processes.
Now first map got changed and second prog stopped seeing the packets.
No warning. Nothing. With "bpf_sk_assign(ctx, sk2, 0) -> -EBUSY"
the second prog at least will see errors and will be able to log
and alert humans to do something about it.
The question of ordering come up, of course. But that ordering concerns
we had for some time with cgroup-bpf run array and it wasn't horrible.
We're still trying to solve it on cgroup-bpf side in a generic way,
but simple first-to-attach -> first-to-run was good enough there
and I think will be here as well. The whole dispatcher project
and managing policy, priority, ordering in user space better to solve
it generically for all cases. But the kernel should do simple basics.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-12 23:58         ` Alexei Starovoitov
  0 siblings, 0 replies; 68+ messages in thread
From: Alexei Starovoitov @ 2020-05-12 23:58 UTC (permalink / raw)
  To: dccp

On Tue, May 12, 2020 at 03:52:52PM +0200, Jakub Sitnicki wrote:
> On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> > On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
> >> Run a BPF program before looking up a listening socket on the receive path.
> >> Program selects a listening socket to yield as result of socket lookup by
> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
> >>
> >> Alternatively, program can also fail the lookup by returning with BPF_DROP,
> >> or let the lookup continue as usual with BPF_OK on return.
> >>
> >> This lets the user match packets with listening sockets freely at the last
> >> possible point on the receive path, where we know that packets are destined
> >> for local delivery after undergoing policing, filtering, and routing.
> >>
> >> With BPF code selecting the socket, directing packets destined to an IP
> >> range or to a port range to a single socket becomes possible.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
> >>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
> >>  2 files changed, 50 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> >> index 6072dfbd1078..3fcbc8f66f88 100644
> >> --- a/include/net/inet_hashtables.h
> >> +++ b/include/net/inet_hashtables.h
> >> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >>
> >>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> >>  		      struct sock *sk);
> >> +
> >> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
> >> +					     struct bpf_sk_lookup_kern *ctx)
> >> +{
> >> +	struct bpf_prog *prog;
> >> +	int ret = BPF_OK;
> >> +
> >> +	rcu_read_lock();
> >> +	prog = rcu_dereference(net->sk_lookup_prog);
> >> +	if (prog)
> >> +		ret = BPF_PROG_RUN(prog, ctx);
> >> +	rcu_read_unlock();
> >> +
> >> +	if (ret = BPF_DROP)
> >> +		return ERR_PTR(-ECONNREFUSED);
> >> +	if (ret = BPF_REDIRECT)
> >> +		return ctx->selected_sk;
> >> +	return NULL;
> >> +}
> >> +
> >> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
> >> +					       __be32 saddr, __be16 sport,
> >> +					       __be32 daddr, u16 dport)
> >> +{
> >> +	struct bpf_sk_lookup_kern ctx = {
> >> +		.family		= AF_INET,
> >> +		.protocol	= protocol,
> >> +		.v4.saddr	= saddr,
> >> +		.v4.daddr	= daddr,
> >> +		.sport		= sport,
> >> +		.dport		= dport,
> >> +	};
> >> +
> >> +	return bpf_sk_lookup_run(net, &ctx);
> >> +}
> >> +
> >>  #endif /* _INET_HASHTABLES_H */
> >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> >> index ab64834837c8..f4d07285591a 100644
> >> --- a/net/ipv4/inet_hashtables.c
> >> +++ b/net/ipv4/inet_hashtables.c
> >> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
> >>  				    const int dif, const int sdif)
> >>  {
> >>  	struct inet_listen_hashbucket *ilb2;
> >> -	struct sock *result = NULL;
> >> +	struct sock *result, *reuse_sk;
> >>  	unsigned int hash2;
> >>
> >> +	/* Lookup redirect from BPF */
> >> +	result = inet_lookup_run_bpf(net, hashinfo->protocol,
> >> +				     saddr, sport, daddr, hnum);
> >> +	if (IS_ERR(result))
> >> +		return NULL;
> >> +	if (result) {
> >> +		reuse_sk = lookup_reuseport(net, result, skb, doff,
> >> +					    saddr, sport, daddr, hnum);
> >> +		if (reuse_sk)
> >> +			result = reuse_sk;
> >> +		goto done;
> >> +	}
> >> +
> >
> > The overhead is too high to do this all the time.
> > The feature has to be static_key-ed.
> 
> Static keys is something that Lorenz has also suggested internally, but
> we wanted to keep it simple at first.
> 
> Introduction of static keys forces us to decide when non-init_net netns
> are allowed to attach to SK_LOOKUP, as attaching enabling SK_LOOKUP in
> isolated netns will affect the rx path in init_net.
> 
> I see two options, which seem sensible:
> 
> 1) limit SK_LOOKUP to init_net, which makes testing setup harder, or
> 
> 2) allow non-init_net netns to attach to SK_LOOKUP only if static key
>    has been already enabled (via sysctl?).

I think both are overkill.
Just enable that static_key if any netns has progs.
Loading this prog type will be privileged operation even after cap_bpf.

> >
> > Also please add multi-prog support. Adding it later will cause
> > all sorts of compatibility issues. The semantics of multi-prog
> > needs to be thought through right now.
> > For example BPF_DROP or BPF_REDIRECT could terminate the prog_run_array
> > sequence of progs while BPF_OK could continue.
> > It's not ideal, but better than nothing.
> 
> I must say this approach is quite appealing because it's simple to
> explain. I would need a custom BPF_PROG_RUN_ARRAY, though.

of course.

> I'm curious what downside do you see here?
> Is overriding an earlier DROP/REDIRECT verdict useful?
> 
> > Another option could be to execute all attached progs regardless
> > of return code, but don't let second prog override selected_sk blindly.
> > bpf_sk_assign() could get smarter.
> 
> So if IIUC the rough idea here would be like below?
> 
> - 1st program calls
> 
>   bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)
> 
> - 2nd program calls
> 
>   bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
>   bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)
> 
> In this case the last program to run has the final say, as opposed to
> the semantics where DROP/REDIRECT terminates.
> 
> Also, 2nd and subsequent programs would probably need to know if and
> which socket has been already selected. I think the selection could be
> exposed in context as bpf_sock pointer.

I think running all is better.
The main down side of terminating early is predictability.
Imagine first prog is doing the sock selection based on some map configuration.
Then second prog gets loaded and doing its own selection.
These two progs are managed by different user space processes.
Now first map got changed and second prog stopped seeing the packets.
No warning. Nothing. With "bpf_sk_assign(ctx, sk2, 0) -> -EBUSY"
the second prog at least will see errors and will be able to log
and alert humans to do something about it.
The question of ordering come up, of course. But that ordering concerns
we had for some time with cgroup-bpf run array and it wasn't horrible.
We're still trying to solve it on cgroup-bpf side in a generic way,
but simple first-to-attach -> first-to-run was good enough there
and I think will be here as well. The whole dispatcher project
and managing policy, priority, ordering in user space better to solve
it generically for all cases. But the kernel should do simple basics.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-13  5:41     ` Martin KaFai Lau
  -1 siblings, 0 replies; 68+ messages in thread
From: Martin KaFai Lau @ 2020-05-13  5:41 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Marek Majkowski, Lorenz Bauer

On Mon, May 11, 2020 at 08:52:03PM +0200, Jakub Sitnicki wrote:

[ ... ]

> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> +	   struct sock *, sk, u64, flags)
The SK_LOOKUP bpf_prog may have already selected the proper reuseport sk.
It is possible by looking up sk from sock_map.

Thus, it is not always desired to do lookup_reuseport() after sk_assign()
in patch 5.  e.g. reuseport_select_sock() just uses a normal hash if
there is no reuse->prog.

A flag (e.g. "BPF_F_REUSEPORT_SELECT") can be added here to
specifically do the reuseport_select_sock() after sk_assign().
If not set, reuseport_select_sock() should not be called.

> +{
> +	if (unlikely(flags != 0))
> +		return -EINVAL;
> +	if (unlikely(sk_is_refcounted(sk)))
> +		return -ESOCKTNOSUPPORT;
> +
> +	/* Check if socket is suitable for packet L3/L4 protocol */
> +	if (sk->sk_protocol != ctx->protocol)
> +		return -EPROTOTYPE;
> +	if (sk->sk_family != ctx->family &&
> +	    (sk->sk_family == AF_INET || ipv6_only_sock(sk)))
> +		return -EAFNOSUPPORT;
> +
> +	/* Select socket as lookup result */
> +	ctx->selected_sk = sk;
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
@ 2020-05-13  5:41     ` Martin KaFai Lau
  0 siblings, 0 replies; 68+ messages in thread
From: Martin KaFai Lau @ 2020-05-13  5:41 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 08:52:03PM +0200, Jakub Sitnicki wrote:

[ ... ]

> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> +	   struct sock *, sk, u64, flags)
The SK_LOOKUP bpf_prog may have already selected the proper reuseport sk.
It is possible by looking up sk from sock_map.

Thus, it is not always desired to do lookup_reuseport() after sk_assign()
in patch 5.  e.g. reuseport_select_sock() just uses a normal hash if
there is no reuse->prog.

A flag (e.g. "BPF_F_REUSEPORT_SELECT") can be added here to
specifically do the reuseport_select_sock() after sk_assign().
If not set, reuseport_select_sock() should not be called.

> +{
> +	if (unlikely(flags != 0))
> +		return -EINVAL;
> +	if (unlikely(sk_is_refcounted(sk)))
> +		return -ESOCKTNOSUPPORT;
> +
> +	/* Check if socket is suitable for packet L3/L4 protocol */
> +	if (sk->sk_protocol != ctx->protocol)
> +		return -EPROTOTYPE;
> +	if (sk->sk_family != ctx->family &&
> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
> +		return -EAFNOSUPPORT;
> +
> +	/* Select socket as lookup result */
> +	ctx->selected_sk = sk;
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-13 13:55           ` Jakub Sitnicki
  -1 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-13 13:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Martin KaFai Lau,
	Marek Majkowski, Lorenz Bauer

On Wed, May 13, 2020 at 01:58 AM CEST, Alexei Starovoitov wrote:
> On Tue, May 12, 2020 at 03:52:52PM +0200, Jakub Sitnicki wrote:
>> On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
>> > On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
>> >> Run a BPF program before looking up a listening socket on the receive path.
>> >> Program selects a listening socket to yield as result of socket lookup by
>> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
>> >>
>> >> Alternatively, program can also fail the lookup by returning with BPF_DROP,
>> >> or let the lookup continue as usual with BPF_OK on return.
>> >>
>> >> This lets the user match packets with listening sockets freely at the last
>> >> possible point on the receive path, where we know that packets are destined
>> >> for local delivery after undergoing policing, filtering, and routing.
>> >>
>> >> With BPF code selecting the socket, directing packets destined to an IP
>> >> range or to a port range to a single socket becomes possible.
>> >>
>> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
>> >>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
>> >>  2 files changed, 50 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>> >> index 6072dfbd1078..3fcbc8f66f88 100644
>> >> --- a/include/net/inet_hashtables.h
>> >> +++ b/include/net/inet_hashtables.h
>> >> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>> >>
>> >>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
>> >>  		      struct sock *sk);
>> >> +
>> >> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
>> >> +					     struct bpf_sk_lookup_kern *ctx)
>> >> +{
>> >> +	struct bpf_prog *prog;
>> >> +	int ret = BPF_OK;
>> >> +
>> >> +	rcu_read_lock();
>> >> +	prog = rcu_dereference(net->sk_lookup_prog);
>> >> +	if (prog)
>> >> +		ret = BPF_PROG_RUN(prog, ctx);
>> >> +	rcu_read_unlock();
>> >> +
>> >> +	if (ret == BPF_DROP)
>> >> +		return ERR_PTR(-ECONNREFUSED);
>> >> +	if (ret == BPF_REDIRECT)
>> >> +		return ctx->selected_sk;
>> >> +	return NULL;
>> >> +}
>> >> +
>> >> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
>> >> +					       __be32 saddr, __be16 sport,
>> >> +					       __be32 daddr, u16 dport)
>> >> +{
>> >> +	struct bpf_sk_lookup_kern ctx = {
>> >> +		.family		= AF_INET,
>> >> +		.protocol	= protocol,
>> >> +		.v4.saddr	= saddr,
>> >> +		.v4.daddr	= daddr,
>> >> +		.sport		= sport,
>> >> +		.dport		= dport,
>> >> +	};
>> >> +
>> >> +	return bpf_sk_lookup_run(net, &ctx);
>> >> +}
>> >> +
>> >>  #endif /* _INET_HASHTABLES_H */
>> >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> >> index ab64834837c8..f4d07285591a 100644
>> >> --- a/net/ipv4/inet_hashtables.c
>> >> +++ b/net/ipv4/inet_hashtables.c
>> >> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
>> >>  				    const int dif, const int sdif)
>> >>  {
>> >>  	struct inet_listen_hashbucket *ilb2;
>> >> -	struct sock *result = NULL;
>> >> +	struct sock *result, *reuse_sk;
>> >>  	unsigned int hash2;
>> >>
>> >> +	/* Lookup redirect from BPF */
>> >> +	result = inet_lookup_run_bpf(net, hashinfo->protocol,
>> >> +				     saddr, sport, daddr, hnum);
>> >> +	if (IS_ERR(result))
>> >> +		return NULL;
>> >> +	if (result) {
>> >> +		reuse_sk = lookup_reuseport(net, result, skb, doff,
>> >> +					    saddr, sport, daddr, hnum);
>> >> +		if (reuse_sk)
>> >> +			result = reuse_sk;
>> >> +		goto done;
>> >> +	}
>> >> +
>> >
>> > The overhead is too high to do this all the time.
>> > The feature has to be static_key-ed.
>>
>> Static keys is something that Lorenz has also suggested internally, but
>> we wanted to keep it simple at first.
>>
>> Introduction of static keys forces us to decide when non-init_net netns
>> are allowed to attach to SK_LOOKUP, as attaching enabling SK_LOOKUP in
>> isolated netns will affect the rx path in init_net.
>>
>> I see two options, which seem sensible:
>>
>> 1) limit SK_LOOKUP to init_net, which makes testing setup harder, or
>>
>> 2) allow non-init_net netns to attach to SK_LOOKUP only if static key
>>    has been already enabled (via sysctl?).
>
> I think both are overkill.
> Just enable that static_key if any netns has progs.
> Loading this prog type will be privileged operation even after cap_bpf.
>

OK, right. In the new model caps are checked at load time. And
CAP_BPF+CAP_NET_ADMIN check on load is done against init_user_ns.

[...]

>> I'm curious what downside do you see here?
>> Is overriding an earlier DROP/REDIRECT verdict useful?
>>
>> > Another option could be to execute all attached progs regardless
>> > of return code, but don't let second prog override selected_sk blindly.
>> > bpf_sk_assign() could get smarter.
>>
>> So if IIUC the rough idea here would be like below?
>>
>> - 1st program calls
>>
>>   bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)
>>
>> - 2nd program calls
>>
>>   bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
>>   bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)
>>
>> In this case the last program to run has the final say, as opposed to
>> the semantics where DROP/REDIRECT terminates.
>>
>> Also, 2nd and subsequent programs would probably need to know if and
>> which socket has been already selected. I think the selection could be
>> exposed in context as bpf_sock pointer.
>
> I think running all is better.
> The main down side of terminating early is predictability.
> Imagine first prog is doing the sock selection based on some map configuration.
> Then second prog gets loaded and doing its own selection.
> These two progs are managed by different user space processes.
> Now first map got changed and second prog stopped seeing the packets.
> No warning. Nothing. With "bpf_sk_assign(ctx, sk2, 0) -> -EBUSY"
> the second prog at least will see errors and will be able to log
> and alert humans to do something about it.
> The question of ordering come up, of course. But that ordering concerns
> we had for some time with cgroup-bpf run array and it wasn't horrible.
> We're still trying to solve it on cgroup-bpf side in a generic way,
> but simple first-to-attach -> first-to-run was good enough there
> and I think will be here as well. The whole dispatcher project
> and managing policy, priority, ordering in user space better to solve
> it generically for all cases. But the kernel should do simple basics.

That makes sense. Thanks for guidance.

-Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-13 13:55           ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-13 13:55 UTC (permalink / raw)
  To: dccp

On Wed, May 13, 2020 at 01:58 AM CEST, Alexei Starovoitov wrote:
> On Tue, May 12, 2020 at 03:52:52PM +0200, Jakub Sitnicki wrote:
>> On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
>> > On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
>> >> Run a BPF program before looking up a listening socket on the receive path.
>> >> Program selects a listening socket to yield as result of socket lookup by
>> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
>> >>
>> >> Alternatively, program can also fail the lookup by returning with BPF_DROP,
>> >> or let the lookup continue as usual with BPF_OK on return.
>> >>
>> >> This lets the user match packets with listening sockets freely at the last
>> >> possible point on the receive path, where we know that packets are destined
>> >> for local delivery after undergoing policing, filtering, and routing.
>> >>
>> >> With BPF code selecting the socket, directing packets destined to an IP
>> >> range or to a port range to a single socket becomes possible.
>> >>
>> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
>> >>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
>> >>  2 files changed, 50 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>> >> index 6072dfbd1078..3fcbc8f66f88 100644
>> >> --- a/include/net/inet_hashtables.h
>> >> +++ b/include/net/inet_hashtables.h
>> >> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>> >>
>> >>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
>> >>  		      struct sock *sk);
>> >> +
>> >> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
>> >> +					     struct bpf_sk_lookup_kern *ctx)
>> >> +{
>> >> +	struct bpf_prog *prog;
>> >> +	int ret = BPF_OK;
>> >> +
>> >> +	rcu_read_lock();
>> >> +	prog = rcu_dereference(net->sk_lookup_prog);
>> >> +	if (prog)
>> >> +		ret = BPF_PROG_RUN(prog, ctx);
>> >> +	rcu_read_unlock();
>> >> +
>> >> +	if (ret = BPF_DROP)
>> >> +		return ERR_PTR(-ECONNREFUSED);
>> >> +	if (ret = BPF_REDIRECT)
>> >> +		return ctx->selected_sk;
>> >> +	return NULL;
>> >> +}
>> >> +
>> >> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
>> >> +					       __be32 saddr, __be16 sport,
>> >> +					       __be32 daddr, u16 dport)
>> >> +{
>> >> +	struct bpf_sk_lookup_kern ctx = {
>> >> +		.family		= AF_INET,
>> >> +		.protocol	= protocol,
>> >> +		.v4.saddr	= saddr,
>> >> +		.v4.daddr	= daddr,
>> >> +		.sport		= sport,
>> >> +		.dport		= dport,
>> >> +	};
>> >> +
>> >> +	return bpf_sk_lookup_run(net, &ctx);
>> >> +}
>> >> +
>> >>  #endif /* _INET_HASHTABLES_H */
>> >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> >> index ab64834837c8..f4d07285591a 100644
>> >> --- a/net/ipv4/inet_hashtables.c
>> >> +++ b/net/ipv4/inet_hashtables.c
>> >> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
>> >>  				    const int dif, const int sdif)
>> >>  {
>> >>  	struct inet_listen_hashbucket *ilb2;
>> >> -	struct sock *result = NULL;
>> >> +	struct sock *result, *reuse_sk;
>> >>  	unsigned int hash2;
>> >>
>> >> +	/* Lookup redirect from BPF */
>> >> +	result = inet_lookup_run_bpf(net, hashinfo->protocol,
>> >> +				     saddr, sport, daddr, hnum);
>> >> +	if (IS_ERR(result))
>> >> +		return NULL;
>> >> +	if (result) {
>> >> +		reuse_sk = lookup_reuseport(net, result, skb, doff,
>> >> +					    saddr, sport, daddr, hnum);
>> >> +		if (reuse_sk)
>> >> +			result = reuse_sk;
>> >> +		goto done;
>> >> +	}
>> >> +
>> >
>> > The overhead is too high to do this all the time.
>> > The feature has to be static_key-ed.
>>
>> Static keys is something that Lorenz has also suggested internally, but
>> we wanted to keep it simple at first.
>>
>> Introduction of static keys forces us to decide when non-init_net netns
>> are allowed to attach to SK_LOOKUP, as attaching enabling SK_LOOKUP in
>> isolated netns will affect the rx path in init_net.
>>
>> I see two options, which seem sensible:
>>
>> 1) limit SK_LOOKUP to init_net, which makes testing setup harder, or
>>
>> 2) allow non-init_net netns to attach to SK_LOOKUP only if static key
>>    has been already enabled (via sysctl?).
>
> I think both are overkill.
> Just enable that static_key if any netns has progs.
> Loading this prog type will be privileged operation even after cap_bpf.
>

OK, right. In the new model caps are checked at load time. And
CAP_BPF+CAP_NET_ADMIN check on load is done against init_user_ns.

[...]

>> I'm curious what downside do you see here?
>> Is overriding an earlier DROP/REDIRECT verdict useful?
>>
>> > Another option could be to execute all attached progs regardless
>> > of return code, but don't let second prog override selected_sk blindly.
>> > bpf_sk_assign() could get smarter.
>>
>> So if IIUC the rough idea here would be like below?
>>
>> - 1st program calls
>>
>>   bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)
>>
>> - 2nd program calls
>>
>>   bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
>>   bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)
>>
>> In this case the last program to run has the final say, as opposed to
>> the semantics where DROP/REDIRECT terminates.
>>
>> Also, 2nd and subsequent programs would probably need to know if and
>> which socket has been already selected. I think the selection could be
>> exposed in context as bpf_sock pointer.
>
> I think running all is better.
> The main down side of terminating early is predictability.
> Imagine first prog is doing the sock selection based on some map configuration.
> Then second prog gets loaded and doing its own selection.
> These two progs are managed by different user space processes.
> Now first map got changed and second prog stopped seeing the packets.
> No warning. Nothing. With "bpf_sk_assign(ctx, sk2, 0) -> -EBUSY"
> the second prog at least will see errors and will be able to log
> and alert humans to do something about it.
> The question of ordering come up, of course. But that ordering concerns
> we had for some time with cgroup-bpf run array and it wasn't horrible.
> We're still trying to solve it on cgroup-bpf side in a generic way,
> but simple first-to-attach -> first-to-run was good enough there
> and I think will be here as well. The whole dispatcher project
> and managing policy, priority, ordering in user space better to solve
> it generically for all cases. But the kernel should do simple basics.

That makes sense. Thanks for guidance.

-Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-13 14:21         ` Lorenz Bauer
  -1 siblings, 0 replies; 68+ messages in thread
From: Lorenz Bauer @ 2020-05-13 14:21 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Alexei Starovoitov, Networking, bpf, dccp, kernel-team,
	Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Gerrit Renker, Jakub Kicinski, Andrii Nakryiko,
	Martin KaFai Lau, Marek Majkowski

On Tue, 12 May 2020 at 14:52, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> > On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
> >> Run a BPF program before looking up a listening socket on the receive path.
> >> Program selects a listening socket to yield as result of socket lookup by
> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
> >>
> >> Alternatively, program can also fail the lookup by returning with BPF_DROP,
> >> or let the lookup continue as usual with BPF_OK on return.
> >>
> >> This lets the user match packets with listening sockets freely at the last
> >> possible point on the receive path, where we know that packets are destined
> >> for local delivery after undergoing policing, filtering, and routing.
> >>
> >> With BPF code selecting the socket, directing packets destined to an IP
> >> range or to a port range to a single socket becomes possible.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
> >>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
> >>  2 files changed, 50 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> >> index 6072dfbd1078..3fcbc8f66f88 100644
> >> --- a/include/net/inet_hashtables.h
> >> +++ b/include/net/inet_hashtables.h
> >> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >>
> >>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> >>                    struct sock *sk);
> >> +
> >> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
> >> +                                         struct bpf_sk_lookup_kern *ctx)
> >> +{
> >> +    struct bpf_prog *prog;
> >> +    int ret = BPF_OK;
> >> +
> >> +    rcu_read_lock();
> >> +    prog = rcu_dereference(net->sk_lookup_prog);
> >> +    if (prog)
> >> +            ret = BPF_PROG_RUN(prog, ctx);
> >> +    rcu_read_unlock();
> >> +
> >> +    if (ret == BPF_DROP)
> >> +            return ERR_PTR(-ECONNREFUSED);
> >> +    if (ret == BPF_REDIRECT)
> >> +            return ctx->selected_sk;
> >> +    return NULL;
> >> +}
> >> +
> >> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
> >> +                                           __be32 saddr, __be16 sport,
> >> +                                           __be32 daddr, u16 dport)
> >> +{
> >> +    struct bpf_sk_lookup_kern ctx = {
> >> +            .family         = AF_INET,
> >> +            .protocol       = protocol,
> >> +            .v4.saddr       = saddr,
> >> +            .v4.daddr       = daddr,
> >> +            .sport          = sport,
> >> +            .dport          = dport,
> >> +    };
> >> +
> >> +    return bpf_sk_lookup_run(net, &ctx);
> >> +}
> >> +
> >>  #endif /* _INET_HASHTABLES_H */
> >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> >> index ab64834837c8..f4d07285591a 100644
> >> --- a/net/ipv4/inet_hashtables.c
> >> +++ b/net/ipv4/inet_hashtables.c
> >> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
> >>                                  const int dif, const int sdif)
> >>  {
> >>      struct inet_listen_hashbucket *ilb2;
> >> -    struct sock *result = NULL;
> >> +    struct sock *result, *reuse_sk;
> >>      unsigned int hash2;
> >>
> >> +    /* Lookup redirect from BPF */
> >> +    result = inet_lookup_run_bpf(net, hashinfo->protocol,
> >> +                                 saddr, sport, daddr, hnum);
> >> +    if (IS_ERR(result))
> >> +            return NULL;
> >> +    if (result) {
> >> +            reuse_sk = lookup_reuseport(net, result, skb, doff,
> >> +                                        saddr, sport, daddr, hnum);
> >> +            if (reuse_sk)
> >> +                    result = reuse_sk;
> >> +            goto done;
> >> +    }
> >> +
> >
> > The overhead is too high to do this all the time.
> > The feature has to be static_key-ed.
>
> Static keys is something that Lorenz has also suggested internally, but
> we wanted to keep it simple at first.
>
> Introduction of static keys forces us to decide when non-init_net netns
> are allowed to attach to SK_LOOKUP, as attaching enabling SK_LOOKUP in
> isolated netns will affect the rx path in init_net.
>
> I see two options, which seem sensible:
>
> 1) limit SK_LOOKUP to init_net, which makes testing setup harder, or
>
> 2) allow non-init_net netns to attach to SK_LOOKUP only if static key
>    has been already enabled (via sysctl?).
>
> >
> > Also please add multi-prog support. Adding it later will cause
> > all sorts of compatibility issues. The semantics of multi-prog
> > needs to be thought through right now.
> > For example BPF_DROP or BPF_REDIRECT could terminate the prog_run_array
> > sequence of progs while BPF_OK could continue.
> > It's not ideal, but better than nothing.
>
> I must say this approach is quite appealing because it's simple to
> explain. I would need a custom BPF_PROG_RUN_ARRAY, though.
>
> I'm curious what downside do you see here?
> Is overriding an earlier DROP/REDIRECT verdict useful?
>
> > Another option could be to execute all attached progs regardless
> > of return code, but don't let second prog override selected_sk blindly.
> > bpf_sk_assign() could get smarter.
>
> So if IIUC the rough idea here would be like below?
>
> - 1st program calls
>
>   bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)
>
> - 2nd program calls
>
>   bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
>   bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)
>
> In this case the last program to run has the final say, as opposed to
> the semantics where DROP/REDIRECT terminates.

Does sk_assign from TC also gain BPF_EXIST semantics? As you know,
I'm a bit concerned that TC and sk_lookup sk_assign are actually to completely
separate helpers. This is a good way to figure out if its a good idea to
overload the name, imo.

>
> Also, 2nd and subsequent programs would probably need to know if and
> which socket has been already selected. I think the selection could be
> exposed in context as bpf_sock pointer.
>
> I admit, I can't quite see the benefit of running thru all programs in
> array, so I'm tempted to go with terminate of DROP/REDIRECT in v3.
>
> >
> > Also please switch to bpf_link way of attaching. All system wide attachments
> > should be visible and easily debuggable via 'bpftool link show'.
> > Currently we're converting tc and xdp hooks to bpf_link. This new hook
> > should have it from the beginning.
>
> Will do in v3.
>
> Thanks for feedback,
> Jakub



-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-13 14:21         ` Lorenz Bauer
  0 siblings, 0 replies; 68+ messages in thread
From: Lorenz Bauer @ 2020-05-13 14:21 UTC (permalink / raw)
  To: dccp

On Tue, 12 May 2020 at 14:52, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> > On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
> >> Run a BPF program before looking up a listening socket on the receive path.
> >> Program selects a listening socket to yield as result of socket lookup by
> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
> >>
> >> Alternatively, program can also fail the lookup by returning with BPF_DROP,
> >> or let the lookup continue as usual with BPF_OK on return.
> >>
> >> This lets the user match packets with listening sockets freely at the last
> >> possible point on the receive path, where we know that packets are destined
> >> for local delivery after undergoing policing, filtering, and routing.
> >>
> >> With BPF code selecting the socket, directing packets destined to an IP
> >> range or to a port range to a single socket becomes possible.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>  include/net/inet_hashtables.h | 36 +++++++++++++++++++++++++++++++++++
> >>  net/ipv4/inet_hashtables.c    | 15 ++++++++++++++-
> >>  2 files changed, 50 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> >> index 6072dfbd1078..3fcbc8f66f88 100644
> >> --- a/include/net/inet_hashtables.h
> >> +++ b/include/net/inet_hashtables.h
> >> @@ -422,4 +422,40 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >>
> >>  int inet_hash_connect(struct inet_timewait_death_row *death_row,
> >>                    struct sock *sk);
> >> +
> >> +static inline struct sock *bpf_sk_lookup_run(struct net *net,
> >> +                                         struct bpf_sk_lookup_kern *ctx)
> >> +{
> >> +    struct bpf_prog *prog;
> >> +    int ret = BPF_OK;
> >> +
> >> +    rcu_read_lock();
> >> +    prog = rcu_dereference(net->sk_lookup_prog);
> >> +    if (prog)
> >> +            ret = BPF_PROG_RUN(prog, ctx);
> >> +    rcu_read_unlock();
> >> +
> >> +    if (ret = BPF_DROP)
> >> +            return ERR_PTR(-ECONNREFUSED);
> >> +    if (ret = BPF_REDIRECT)
> >> +            return ctx->selected_sk;
> >> +    return NULL;
> >> +}
> >> +
> >> +static inline struct sock *inet_lookup_run_bpf(struct net *net, u8 protocol,
> >> +                                           __be32 saddr, __be16 sport,
> >> +                                           __be32 daddr, u16 dport)
> >> +{
> >> +    struct bpf_sk_lookup_kern ctx = {
> >> +            .family         = AF_INET,
> >> +            .protocol       = protocol,
> >> +            .v4.saddr       = saddr,
> >> +            .v4.daddr       = daddr,
> >> +            .sport          = sport,
> >> +            .dport          = dport,
> >> +    };
> >> +
> >> +    return bpf_sk_lookup_run(net, &ctx);
> >> +}
> >> +
> >>  #endif /* _INET_HASHTABLES_H */
> >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> >> index ab64834837c8..f4d07285591a 100644
> >> --- a/net/ipv4/inet_hashtables.c
> >> +++ b/net/ipv4/inet_hashtables.c
> >> @@ -307,9 +307,22 @@ struct sock *__inet_lookup_listener(struct net *net,
> >>                                  const int dif, const int sdif)
> >>  {
> >>      struct inet_listen_hashbucket *ilb2;
> >> -    struct sock *result = NULL;
> >> +    struct sock *result, *reuse_sk;
> >>      unsigned int hash2;
> >>
> >> +    /* Lookup redirect from BPF */
> >> +    result = inet_lookup_run_bpf(net, hashinfo->protocol,
> >> +                                 saddr, sport, daddr, hnum);
> >> +    if (IS_ERR(result))
> >> +            return NULL;
> >> +    if (result) {
> >> +            reuse_sk = lookup_reuseport(net, result, skb, doff,
> >> +                                        saddr, sport, daddr, hnum);
> >> +            if (reuse_sk)
> >> +                    result = reuse_sk;
> >> +            goto done;
> >> +    }
> >> +
> >
> > The overhead is too high to do this all the time.
> > The feature has to be static_key-ed.
>
> Static keys is something that Lorenz has also suggested internally, but
> we wanted to keep it simple at first.
>
> Introduction of static keys forces us to decide when non-init_net netns
> are allowed to attach to SK_LOOKUP, as attaching enabling SK_LOOKUP in
> isolated netns will affect the rx path in init_net.
>
> I see two options, which seem sensible:
>
> 1) limit SK_LOOKUP to init_net, which makes testing setup harder, or
>
> 2) allow non-init_net netns to attach to SK_LOOKUP only if static key
>    has been already enabled (via sysctl?).
>
> >
> > Also please add multi-prog support. Adding it later will cause
> > all sorts of compatibility issues. The semantics of multi-prog
> > needs to be thought through right now.
> > For example BPF_DROP or BPF_REDIRECT could terminate the prog_run_array
> > sequence of progs while BPF_OK could continue.
> > It's not ideal, but better than nothing.
>
> I must say this approach is quite appealing because it's simple to
> explain. I would need a custom BPF_PROG_RUN_ARRAY, though.
>
> I'm curious what downside do you see here?
> Is overriding an earlier DROP/REDIRECT verdict useful?
>
> > Another option could be to execute all attached progs regardless
> > of return code, but don't let second prog override selected_sk blindly.
> > bpf_sk_assign() could get smarter.
>
> So if IIUC the rough idea here would be like below?
>
> - 1st program calls
>
>   bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)
>
> - 2nd program calls
>
>   bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
>   bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)
>
> In this case the last program to run has the final say, as opposed to
> the semantics where DROP/REDIRECT terminates.

Does sk_assign from TC also gain BPF_EXIST semantics? As you know,
I'm a bit concerned that TC and sk_lookup sk_assign are actually to completely
separate helpers. This is a good way to figure out if its a good idea to
overload the name, imo.

>
> Also, 2nd and subsequent programs would probably need to know if and
> which socket has been already selected. I think the selection could be
> exposed in context as bpf_sock pointer.
>
> I admit, I can't quite see the benefit of running thru all programs in
> array, so I'm tempted to go with terminate of DROP/REDIRECT in v3.
>
> >
> > Also please switch to bpf_link way of attaching. All system wide attachments
> > should be visible and easily debuggable via 'bpftool link show'.
> > Currently we're converting tc and xdp hooks to bpf_link. This new hook
> > should have it from the beginning.
>
> Will do in v3.
>
> Thanks for feedback,
> Jakub



-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-13 14:34       ` Jakub Sitnicki
  -1 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-13 14:34 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Marek Majkowski, Lorenz Bauer

On Wed, May 13, 2020 at 07:41 AM CEST, Martin KaFai Lau wrote:
> On Mon, May 11, 2020 at 08:52:03PM +0200, Jakub Sitnicki wrote:
>
> [ ... ]
>
>> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>> +	   struct sock *, sk, u64, flags)
> The SK_LOOKUP bpf_prog may have already selected the proper reuseport sk.
> It is possible by looking up sk from sock_map.
>
> Thus, it is not always desired to do lookup_reuseport() after sk_assign()
> in patch 5.  e.g. reuseport_select_sock() just uses a normal hash if
> there is no reuse->prog.
>
> A flag (e.g. "BPF_F_REUSEPORT_SELECT") can be added here to
> specifically do the reuseport_select_sock() after sk_assign().
> If not set, reuseport_select_sock() should not be called.

That's true that in addition to steering connections to different
services with SK_LOOKUP, you could also, in the same program,
load-balance among sockets belonging to one service.

So skipping the reuseport socket selection, if sk_lookup already did
load-balancing sounds useful.

Thinking about our use-case, I think we would always pass
BPF_F_REUSEPORT_SELECT to sk_assign() because we either (i) know that
application is using reuseport and want it manage the load-balancing
socket group by itself, or (ii) don't know if application is using
reuseport and don't want to break expected behavior.

IOW, we'd like reuseport selection to run by default because application
expects it to happen if it was set up. OTOH, the application doesn't
have to be aware that there is sk_lookup attached (we can put one of its
sockets in sk_lookup SOCKMAP when systemd activates it).

Beacuse of that I'd be in favor of having a flag for sk_assign() that
disables reuseport selection on demand.

WDYT?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
@ 2020-05-13 14:34       ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-13 14:34 UTC (permalink / raw)
  To: dccp

On Wed, May 13, 2020 at 07:41 AM CEST, Martin KaFai Lau wrote:
> On Mon, May 11, 2020 at 08:52:03PM +0200, Jakub Sitnicki wrote:
>
> [ ... ]
>
>> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>> +	   struct sock *, sk, u64, flags)
> The SK_LOOKUP bpf_prog may have already selected the proper reuseport sk.
> It is possible by looking up sk from sock_map.
>
> Thus, it is not always desired to do lookup_reuseport() after sk_assign()
> in patch 5.  e.g. reuseport_select_sock() just uses a normal hash if
> there is no reuse->prog.
>
> A flag (e.g. "BPF_F_REUSEPORT_SELECT") can be added here to
> specifically do the reuseport_select_sock() after sk_assign().
> If not set, reuseport_select_sock() should not be called.

That's true that in addition to steering connections to different
services with SK_LOOKUP, you could also, in the same program,
load-balance among sockets belonging to one service.

So skipping the reuseport socket selection, if sk_lookup already did
load-balancing sounds useful.

Thinking about our use-case, I think we would always pass
BPF_F_REUSEPORT_SELECT to sk_assign() because we either (i) know that
application is using reuseport and want it manage the load-balancing
socket group by itself, or (ii) don't know if application is using
reuseport and don't want to break expected behavior.

IOW, we'd like reuseport selection to run by default because application
expects it to happen if it was set up. OTOH, the application doesn't
have to be aware that there is sk_lookup attached (we can put one of its
sockets in sk_lookup SOCKMAP when systemd activates it).

Beacuse of that I'd be in favor of having a flag for sk_assign() that
disables reuseport selection on demand.

WDYT?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-13 14:50           ` Jakub Sitnicki
  -1 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-13 14:50 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Alexei Starovoitov, Networking, bpf, dccp, kernel-team,
	Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Eric Dumazet, Gerrit Renker, Jakub Kicinski, Andrii Nakryiko,
	Martin KaFai Lau, Marek Majkowski

On Wed, May 13, 2020 at 04:21 PM CEST, Lorenz Bauer wrote:
> On Tue, 12 May 2020 at 14:52, Jakub Sitnicki <jakub@cloudflare.com> wrote:

[...]

>> So if IIUC the rough idea here would be like below?
>>
>> - 1st program calls
>>
>>   bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)
>>
>> - 2nd program calls
>>
>>   bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
>>   bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)
>>
>> In this case the last program to run has the final say, as opposed to
>> the semantics where DROP/REDIRECT terminates.
>
> Does sk_assign from TC also gain BPF_EXIST semantics? As you know,
> I'm a bit concerned that TC and sk_lookup sk_assign are actually to completely
> separate helpers. This is a good way to figure out if its a good idea to
> overload the name, imo.

I don't have a strong opinion here. We could have a dedicated helper.

Personally I'm not finding it confusing. As a BPF user you know what
program type you're working with (TC vs SK_LOOKUP), and both helper
variants will be documented separately in the bpf-helpers man-page, like
so:

       int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64
       flags)

              Description
                     Helper is overloaded  depending  on  BPF  program
                     type.     This     description     applies     to
                     BPF_PROG_TYPE_SCHED_CLS                       and
                     BPF_PROG_TYPE_SCHED_ACT programs.

                     Assign  the  sk  to  the  skb. When combined with
                     appropriate routing configuration to receive  the
                     packet  towards  the socket, will cause skb to be
                     delivered to the  specified  socket.   Subsequent
                     redirection    of    skb   via    bpf_redirect(),
                     bpf_clone_redirect() or other methods outside  of
                     BPF may interfere with successful delivery to the
                     socket.

                     This operation is  only  valid  from  TC  ingress
                     path.

                     The flags argument must be zero.

              Return 0  on  success,  or  a  negative errno in case of
                     failure.

                     · -EINVAL           Unsupported flags specified.

                     · -ENOENT            Socket  is  unavailable  for
                       assignment.

                     · -ENETUNREACH       Socket is unreachable (wrong
                       netns).

                     ·

                       -EOPNOTSUPP Unsupported operation, for  example
                       a
                              call from outside of TC ingress.

                     · -ESOCKTNOSUPPORT   Socket  type  not  supported
                       (reuseport).

       int bpf_sk_assign(struct bpf_sk_lookup  *ctx,  struct  bpf_sock
       *sk, u64 flags)

              Description
                     Helper  is  overloaded  depending  on BPF program
                     type.     This     description     applies     to
                     BPF_PROG_TYPE_SK_LOOKUP programs.

                     Select the sk as a result of a socket lookup.

                     For  the  operation to succeed passed socket must
                     be compatible with the  packet  description  pro‐
                     vided by the ctx object.

                     L4  protocol (IPPROTO_TCP or IPPROTO_UDP) must be
                     an exact  match.  While  IP  family  (AF_INET  or
                     AF_INET6)  must be compatible, that is IPv6 sock‐
                     ets that are not v6-only can be selected for IPv4
                     packets.

                     Only TCP listeners and UDP sockets, that is sock‐
                     ets which have SOCK_RCU_FREE  flag  set,  can  be
                     selected.

                     The flags argument must be zero.

              Return 0  on  success,  or  a  negative errno in case of
                     failure.

                     -EAFNOSUPPORT is socket  family  (sk->family)  is
                     not compatible with packet family (ctx->family).

                     -EINVAL if unsupported flags were specified.

                     -EPROTOTYPE  if socket L4 protocol (sk->protocol)
                     doesn't match packet protocol (ctx->protocol).

                     -ESOCKTNOSUPPORT if socket does not use RCU free‐
                     ing.

But it would be helpful to hear what others think about it.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-13 14:50           ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-13 14:50 UTC (permalink / raw)
  To: dccp

On Wed, May 13, 2020 at 04:21 PM CEST, Lorenz Bauer wrote:
> On Tue, 12 May 2020 at 14:52, Jakub Sitnicki <jakub@cloudflare.com> wrote:

[...]

>> So if IIUC the rough idea here would be like below?
>>
>> - 1st program calls
>>
>>   bpf_sk_assign(ctx, sk1, 0 /*flags*/) -> 0 (OK)
>>
>> - 2nd program calls
>>
>>   bpf_sk_assign(ctx, sk2, 0) -> -EBUSY (already selected)
>>   bpf_sk_assign(ctx, sk2, BPF_EXIST) -> 0 (OK, replace existing)
>>
>> In this case the last program to run has the final say, as opposed to
>> the semantics where DROP/REDIRECT terminates.
>
> Does sk_assign from TC also gain BPF_EXIST semantics? As you know,
> I'm a bit concerned that TC and sk_lookup sk_assign are actually to completely
> separate helpers. This is a good way to figure out if its a good idea to
> overload the name, imo.

I don't have a strong opinion here. We could have a dedicated helper.

Personally I'm not finding it confusing. As a BPF user you know what
program type you're working with (TC vs SK_LOOKUP), and both helper
variants will be documented separately in the bpf-helpers man-page, like
so:

       int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64
       flags)

              Description
                     Helper is overloaded  depending  on  BPF  program
                     type.     This     description     applies     to
                     BPF_PROG_TYPE_SCHED_CLS                       and
                     BPF_PROG_TYPE_SCHED_ACT programs.

                     Assign  the  sk  to  the  skb. When combined with
                     appropriate routing configuration to receive  the
                     packet  towards  the socket, will cause skb to be
                     delivered to the  specified  socket.   Subsequent
                     redirection    of    skb   via    bpf_redirect(),
                     bpf_clone_redirect() or other methods outside  of
                     BPF may interfere with successful delivery to the
                     socket.

                     This operation is  only  valid  from  TC  ingress
                     path.

                     The flags argument must be zero.

              Return 0  on  success,  or  a  negative errno in case of
                     failure.

                     · -EINVAL           Unsupported flags specified.

                     · -ENOENT            Socket  is  unavailable  for
                       assignment.

                     · -ENETUNREACH       Socket is unreachable (wrong
                       netns).

                     ·

                       -EOPNOTSUPP Unsupported operation, for  example
                       a
                              call from outside of TC ingress.

                     · -ESOCKTNOSUPPORT   Socket  type  not  supported
                       (reuseport).

       int bpf_sk_assign(struct bpf_sk_lookup  *ctx,  struct  bpf_sock
       *sk, u64 flags)

              Description
                     Helper  is  overloaded  depending  on BPF program
                     type.     This     description     applies     to
                     BPF_PROG_TYPE_SK_LOOKUP programs.

                     Select the sk as a result of a socket lookup.

                     For  the  operation to succeed passed socket must
                     be compatible with the  packet  description  pro‐
                     vided by the ctx object.

                     L4  protocol (IPPROTO_TCP or IPPROTO_UDP) must be
                     an exact  match.  While  IP  family  (AF_INET  or
                     AF_INET6)  must be compatible, that is IPv6 sock‐
                     ets that are not v6-only can be selected for IPv4
                     packets.

                     Only TCP listeners and UDP sockets, that is sock‐
                     ets which have SOCK_RCU_FREE  flag  set,  can  be
                     selected.

                     The flags argument must be zero.

              Return 0  on  success,  or  a  negative errno in case of
                     failure.

                     -EAFNOSUPPORT is socket  family  (sk->family)  is
                     not compatible with packet family (ctx->family).

                     -EINVAL if unsupported flags were specified.

                     -EPROTOTYPE  if socket L4 protocol (sk->protocol)
                     doesn't match packet protocol (ctx->protocol).

                     -ESOCKTNOSUPPORT if socket does not use RCU free‐
                     ing.

But it would be helpful to hear what others think about it.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
  2020-05-11 18:52 ` Jakub Sitnicki
@ 2020-05-13 17:54         ` Jakub Sitnicki
  -1 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-13 17:54 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko

On Tue, May 12, 2020 at 06:34 PM CEST, Martin KaFai Lau wrote:
> On Tue, May 12, 2020 at 01:57:45PM +0200, Jakub Sitnicki wrote:
>> On Mon, May 11, 2020 at 09:45 PM CEST, Martin KaFai Lau wrote:
>> > On Mon, May 11, 2020 at 08:52:01PM +0200, Jakub Sitnicki wrote:

[...]

>> >> RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.
>> >>
>> >> | tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
>> >> |------------------------------+------------------------+----------|
>> >> | 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
>> >> | no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
>> >> | with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |
>> >>
>> >> | tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
>> >> |------------------------------+------------------------+----------|
>> >> | 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
>> >> | no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
>> >> | with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |
>> >>
>> >> | udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
>> >> |------------------------------+------------------------+----------|
>> >> | 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
>> >> | no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
>> >> | with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |
>> >>
>> >> | udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
>> >> |------------------------------+------------------------+----------|
>> >> | 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
>> >> | no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
>> > What is causing this regression?
>> >
>>
>> I need to go back to archived perf.data and see if perf-annotate or
>> perf-diff provide any clues that will help me tell where CPU cycles are
>> going. Will get back to you on that.
>>
>> Wild guess is that for udp6 we're loading and coping more data to
>> populate v6 addresses in program context. See inet6_lookup_run_bpf
>> (patch 7).
> If that is the case,
> rcu_access_pointer(net->sk_lookup_prog) should be tested first before
> doing ctx initialization.

Coming back after looking for more hints in recorded perf.data.

`perf diff` between baseline and no-prog-attached shows:

# Event 'cycles'
#
# Baseline    Delta  Symbol
# ........  .......  ......................................
#
     4.63%   +0.07%  [k] udpv6_queue_rcv_one_skb
     3.88%   -0.38%  [k] __netif_receive_skb_core
     3.54%   +0.21%  [k] udp6_lib_lookup2
     3.01%   -0.42%  [k] 0xffffffffc04926cc
     2.69%   -0.10%  [k] mlx5e_skb_from_cqe_linear
     2.56%   -0.20%  [k] ipv6_gro_receive
     2.31%   -0.15%  [k] dev_gro_receive
     2.20%   -0.13%  [k] do_csum
     2.02%   -0.68%  [k] ip6_pol_route
     1.94%   +0.79%  [k] __udp6_lib_lookup

So __udp6_lib_lookup is where to look, as expected.

`perf annotate __udp6_lib_lookup` for no-prog-attached has a hot spot
right were we populate the context object:


         :                      /* Lookup redirect from BPF */
         :                      result = inet6_lookup_run_bpf(net, udptable->protocol,
    0.00 :   ffffffff818530db:       mov    0x18(%r14),%edx
         :                      inet6_lookup_run_bpf():
         :                      const struct in6_addr *saddr,
         :                      __be16 sport,
         :                      const struct in6_addr *daddr,
         :                      unsigned short hnum)
         :                      {
         :                      struct bpf_inet_lookup_kern ctx = {
 inet6_hashtables.h:115    1.27 :   ffffffff818530df:       lea    -0x78(%rbp),%r9
    0.00 :   ffffffff818530e3:       xor    %eax,%eax
    0.00 :   ffffffff818530e5:       mov    $0x8,%ecx
    0.00 :   ffffffff818530ea:       mov    %r9,%rdi
 inet6_hashtables.h:115   26.09 :   ffffffff818530ed:       rep stos %rax,%es:(%rdi)
 inet6_hashtables.h:115    1.35 :   ffffffff818530f0:       mov    $0xa,%eax
    0.00 :   ffffffff818530f5:       mov    %bx,-0x60(%rbp)
    0.00 :   ffffffff818530f9:       mov    %ax,-0x78(%rbp)
    0.00 :   ffffffff818530fd:       mov    (%r15),%rax
    1.42 :   ffffffff81853100:       mov    %dl,-0x76(%rbp)
    0.00 :   ffffffff81853103:       mov    0x8(%r15),%rdx
    0.00 :   ffffffff81853107:       mov    %r11w,-0x48(%rbp)
    0.00 :   ffffffff8185310c:       mov    %rax,-0x70(%rbp)
    1.27 :   ffffffff81853110:       mov    (%r12),%rax
    0.02 :   ffffffff81853114:       mov    %rdx,-0x68(%rbp)
    0.00 :   ffffffff81853118:       mov    0x8(%r12),%rdx
    0.02 :   ffffffff8185311d:       mov    %rax,-0x58(%rbp)
    1.24 :   ffffffff81853121:       mov    %rdx,-0x50(%rbp)
         :                      __read_once_size():
         :                      })
         :
         :                      static __always_inline
         :                      void __read_once_size(const volatile void *p, void *res, int size)
         :                      {
         :                      __READ_ONCE_SIZE;
    0.05 :   ffffffff81853125:       mov    0xd28(%r13),%rdx

Note, struct bpf_inet_lookup has been renamed to bpf_sk_lookup since
then.

I'll switch to copying just the pointer to in6_addr{} and push the
context initialization after test for net->sk_lookup_prog, like you
suggested.

I can post full output from perf-diff/annotate to some pastebin if you
would like to take a deeper look.

Thanks,
Jakub

>
>>
>> This makes me realize the copy is unnecessary, I could just store the
>> pointer to in6_addr{}. Will make this change in v3.
>>
>> As to why udp6 is taking a bigger hit than udp4 - comparing top 10 in
>> `perf report --no-children` shows that in our test setup, socket lookup
>> contributes less to CPU cycles on receive for udp4 than for udp6.
>>
>> * udp4 baseline (no children)
>>
>> # Overhead       Samples  Symbol
>> # ........  ............  ......................................
>> #
>>      8.11%         19429  [k] fib_table_lookup
>>      4.31%         10333  [k] udp_queue_rcv_one_skb
>>      3.75%          8991  [k] fib4_rule_action
>>      3.66%          8763  [k] __netif_receive_skb_core
>>      3.42%          8198  [k] fib_rules_lookup
>>      3.05%          7314  [k] fib4_rule_match
>>      2.71%          6507  [k] mlx5e_skb_from_cqe_linear
>>      2.58%          6192  [k] inet_gro_receive
>>      2.49%          5981  [k] __x86_indirect_thunk_rax
>>      2.36%          5656  [k] udp4_lib_lookup2
>>
>> * udp6 baseline (no children)
>>
>> # Overhead       Samples  Symbol
>> # ........  ............  ......................................
>> #
>>      4.63%         11100  [k] udpv6_queue_rcv_one_skb
>>      3.88%          9308  [k] __netif_receive_skb_core
>>      3.54%          8480  [k] udp6_lib_lookup2
>>      2.69%          6442  [k] mlx5e_skb_from_cqe_linear
>>      2.56%          6137  [k] ipv6_gro_receive
>>      2.31%          5540  [k] dev_gro_receive
>>      2.20%          5264  [k] do_csum
>>      2.02%          4835  [k] ip6_pol_route
>>      1.94%          4639  [k] __udp6_lib_lookup
>>      1.89%          4540  [k] selinux_socket_sock_rcv_skb
>>
>> Notice that __udp4_lib_lookup didn't even make the cut. That could
>> explain why adding instructions to __udp6_lib_lookup has more effect on
>> RX PPS.
>>
>> Frankly, that is something that suprised us, but we didn't have time to
>> investigate further, yet.
> The perf report should be able to annotate bpf prog also.
> e.g. may be part of it is because the bpf_prog itself is also dealing
> with a longer address?
>
>>
>> >> | with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |
>> > This also looks very different from udp4.
>> >
>>
>> Thanks for the questions,
>> Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup
@ 2020-05-13 17:54         ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-13 17:54 UTC (permalink / raw)
  To: dccp

On Tue, May 12, 2020 at 06:34 PM CEST, Martin KaFai Lau wrote:
> On Tue, May 12, 2020 at 01:57:45PM +0200, Jakub Sitnicki wrote:
>> On Mon, May 11, 2020 at 09:45 PM CEST, Martin KaFai Lau wrote:
>> > On Mon, May 11, 2020 at 08:52:01PM +0200, Jakub Sitnicki wrote:

[...]

>> >> RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 seconds.
>> >>
>> >> | tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
>> >> |------------------------------+------------------------+----------|
>> >> | 5.6.3 vanilla (baseline)     | 939,616 ± 0.5%         |        - |
>> >> | no SK_LOOKUP prog attached   | 929,275 ± 1.2%         |    -1.1% |
>> >> | with SK_LOOKUP prog attached | 918,582 ± 0.4%         |    -2.2% |
>> >>
>> >> | tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
>> >> |------------------------------+------------------------+----------|
>> >> | 5.6.3 vanilla (baseline)     | 875,838 ± 0.5%         |        - |
>> >> | no SK_LOOKUP prog attached   | 872,005 ± 0.3%         |    -0.4% |
>> >> | with SK_LOOKUP prog attached | 856,250 ± 0.5%         |    -2.2% |
>> >>
>> >> | udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
>> >> |------------------------------+------------------------+----------|
>> >> | 5.6.3 vanilla (baseline)     | 2,738,662 ± 1.5%       |        - |
>> >> | no SK_LOOKUP prog attached   | 2,576,893 ± 1.0%       |    -5.9% |
>> >> | with SK_LOOKUP prog attached | 2,530,698 ± 1.0%       |    -7.6% |
>> >>
>> >> | udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
>> >> |------------------------------+------------------------+----------|
>> >> | 5.6.3 vanilla (baseline)     | 2,867,885 ± 1.4%       |        - |
>> >> | no SK_LOOKUP prog attached   | 2,646,875 ± 1.0%       |    -7.7% |
>> > What is causing this regression?
>> >
>>
>> I need to go back to archived perf.data and see if perf-annotate or
>> perf-diff provide any clues that will help me tell where CPU cycles are
>> going. Will get back to you on that.
>>
>> Wild guess is that for udp6 we're loading and coping more data to
>> populate v6 addresses in program context. See inet6_lookup_run_bpf
>> (patch 7).
> If that is the case,
> rcu_access_pointer(net->sk_lookup_prog) should be tested first before
> doing ctx initialization.

Coming back after looking for more hints in recorded perf.data.

`perf diff` between baseline and no-prog-attached shows:

# Event 'cycles'
#
# Baseline    Delta  Symbol
# ........  .......  ......................................
#
     4.63%   +0.07%  [k] udpv6_queue_rcv_one_skb
     3.88%   -0.38%  [k] __netif_receive_skb_core
     3.54%   +0.21%  [k] udp6_lib_lookup2
     3.01%   -0.42%  [k] 0xffffffffc04926cc
     2.69%   -0.10%  [k] mlx5e_skb_from_cqe_linear
     2.56%   -0.20%  [k] ipv6_gro_receive
     2.31%   -0.15%  [k] dev_gro_receive
     2.20%   -0.13%  [k] do_csum
     2.02%   -0.68%  [k] ip6_pol_route
     1.94%   +0.79%  [k] __udp6_lib_lookup

So __udp6_lib_lookup is where to look, as expected.

`perf annotate __udp6_lib_lookup` for no-prog-attached has a hot spot
right were we populate the context object:


         :                      /* Lookup redirect from BPF */
         :                      result = inet6_lookup_run_bpf(net, udptable->protocol,
    0.00 :   ffffffff818530db:       mov    0x18(%r14),%edx
         :                      inet6_lookup_run_bpf():
         :                      const struct in6_addr *saddr,
         :                      __be16 sport,
         :                      const struct in6_addr *daddr,
         :                      unsigned short hnum)
         :                      {
         :                      struct bpf_inet_lookup_kern ctx = {
 inet6_hashtables.h:115    1.27 :   ffffffff818530df:       lea    -0x78(%rbp),%r9
    0.00 :   ffffffff818530e3:       xor    %eax,%eax
    0.00 :   ffffffff818530e5:       mov    $0x8,%ecx
    0.00 :   ffffffff818530ea:       mov    %r9,%rdi
 inet6_hashtables.h:115   26.09 :   ffffffff818530ed:       rep stos %rax,%es:(%rdi)
 inet6_hashtables.h:115    1.35 :   ffffffff818530f0:       mov    $0xa,%eax
    0.00 :   ffffffff818530f5:       mov    %bx,-0x60(%rbp)
    0.00 :   ffffffff818530f9:       mov    %ax,-0x78(%rbp)
    0.00 :   ffffffff818530fd:       mov    (%r15),%rax
    1.42 :   ffffffff81853100:       mov    %dl,-0x76(%rbp)
    0.00 :   ffffffff81853103:       mov    0x8(%r15),%rdx
    0.00 :   ffffffff81853107:       mov    %r11w,-0x48(%rbp)
    0.00 :   ffffffff8185310c:       mov    %rax,-0x70(%rbp)
    1.27 :   ffffffff81853110:       mov    (%r12),%rax
    0.02 :   ffffffff81853114:       mov    %rdx,-0x68(%rbp)
    0.00 :   ffffffff81853118:       mov    0x8(%r12),%rdx
    0.02 :   ffffffff8185311d:       mov    %rax,-0x58(%rbp)
    1.24 :   ffffffff81853121:       mov    %rdx,-0x50(%rbp)
         :                      __read_once_size():
         :                      })
         :
         :                      static __always_inline
         :                      void __read_once_size(const volatile void *p, void *res, int size)
         :                      {
         :                      __READ_ONCE_SIZE;
    0.05 :   ffffffff81853125:       mov    0xd28(%r13),%rdx

Note, struct bpf_inet_lookup has been renamed to bpf_sk_lookup since
then.

I'll switch to copying just the pointer to in6_addr{} and push the
context initialization after test for net->sk_lookup_prog, like you
suggested.

I can post full output from perf-diff/annotate to some pastebin if you
would like to take a deeper look.

Thanks,
Jakub

>
>>
>> This makes me realize the copy is unnecessary, I could just store the
>> pointer to in6_addr{}. Will make this change in v3.
>>
>> As to why udp6 is taking a bigger hit than udp4 - comparing top 10 in
>> `perf report --no-children` shows that in our test setup, socket lookup
>> contributes less to CPU cycles on receive for udp4 than for udp6.
>>
>> * udp4 baseline (no children)
>>
>> # Overhead       Samples  Symbol
>> # ........  ............  ......................................
>> #
>>      8.11%         19429  [k] fib_table_lookup
>>      4.31%         10333  [k] udp_queue_rcv_one_skb
>>      3.75%          8991  [k] fib4_rule_action
>>      3.66%          8763  [k] __netif_receive_skb_core
>>      3.42%          8198  [k] fib_rules_lookup
>>      3.05%          7314  [k] fib4_rule_match
>>      2.71%          6507  [k] mlx5e_skb_from_cqe_linear
>>      2.58%          6192  [k] inet_gro_receive
>>      2.49%          5981  [k] __x86_indirect_thunk_rax
>>      2.36%          5656  [k] udp4_lib_lookup2
>>
>> * udp6 baseline (no children)
>>
>> # Overhead       Samples  Symbol
>> # ........  ............  ......................................
>> #
>>      4.63%         11100  [k] udpv6_queue_rcv_one_skb
>>      3.88%          9308  [k] __netif_receive_skb_core
>>      3.54%          8480  [k] udp6_lib_lookup2
>>      2.69%          6442  [k] mlx5e_skb_from_cqe_linear
>>      2.56%          6137  [k] ipv6_gro_receive
>>      2.31%          5540  [k] dev_gro_receive
>>      2.20%          5264  [k] do_csum
>>      2.02%          4835  [k] ip6_pol_route
>>      1.94%          4639  [k] __udp6_lib_lookup
>>      1.89%          4540  [k] selinux_socket_sock_rcv_skb
>>
>> Notice that __udp4_lib_lookup didn't even make the cut. That could
>> explain why adding instructions to __udp6_lib_lookup has more effect on
>> RX PPS.
>>
>> Frankly, that is something that suprised us, but we didn't have time to
>> investigate further, yet.
> The perf report should be able to annotate bpf prog also.
> e.g. may be part of it is because the bpf_prog itself is also dealing
> with a longer address?
>
>>
>> >> | with SK_LOOKUP prog attached | 2,520,474 ± 0.7%       |   -12.1% |
>> > This also looks very different from udp4.
>> >
>>
>> Thanks for the questions,
>> Jakub

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-13 18:10         ` Martin KaFai Lau
  -1 siblings, 0 replies; 68+ messages in thread
From: Martin KaFai Lau @ 2020-05-13 18:10 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Marek Majkowski, Lorenz Bauer

On Wed, May 13, 2020 at 04:34:13PM +0200, Jakub Sitnicki wrote:
> On Wed, May 13, 2020 at 07:41 AM CEST, Martin KaFai Lau wrote:
> > On Mon, May 11, 2020 at 08:52:03PM +0200, Jakub Sitnicki wrote:
> >
> > [ ... ]
> >
> >> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> >> +	   struct sock *, sk, u64, flags)
> > The SK_LOOKUP bpf_prog may have already selected the proper reuseport sk.
> > It is possible by looking up sk from sock_map.
> >
> > Thus, it is not always desired to do lookup_reuseport() after sk_assign()
> > in patch 5.  e.g. reuseport_select_sock() just uses a normal hash if
> > there is no reuse->prog.
> >
> > A flag (e.g. "BPF_F_REUSEPORT_SELECT") can be added here to
> > specifically do the reuseport_select_sock() after sk_assign().
> > If not set, reuseport_select_sock() should not be called.
> 
> That's true that in addition to steering connections to different
> services with SK_LOOKUP, you could also, in the same program,
> load-balance among sockets belonging to one service.
> 
> So skipping the reuseport socket selection, if sk_lookup already did
> load-balancing sounds useful.
> 
> Thinking about our use-case, I think we would always pass
> BPF_F_REUSEPORT_SELECT to sk_assign() because we either (i) know that
> application is using reuseport and want it manage the load-balancing
> socket group by itself, or (ii) don't know if application is using
> reuseport and don't want to break expected behavior.
Thanks for the explanation.

> 
> IOW, we'd like reuseport selection to run by default because application
> expects it to happen if it was set up. OTOH, the application doesn't
> have to be aware that there is sk_lookup attached (we can put one of its
> sockets in sk_lookup SOCKMAP when systemd activates it).
> 
> Beacuse of that I'd be in favor of having a flag for sk_assign() that
> disables reuseport selection on demand.
> 
> WDYT?
Sure, it is hard to comment which use case is more common than
another to take the default ;)
I think there are use caes for both, so no strong opinion on this ;)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
@ 2020-05-13 18:10         ` Martin KaFai Lau
  0 siblings, 0 replies; 68+ messages in thread
From: Martin KaFai Lau @ 2020-05-13 18:10 UTC (permalink / raw)
  To: dccp

On Wed, May 13, 2020 at 04:34:13PM +0200, Jakub Sitnicki wrote:
> On Wed, May 13, 2020 at 07:41 AM CEST, Martin KaFai Lau wrote:
> > On Mon, May 11, 2020 at 08:52:03PM +0200, Jakub Sitnicki wrote:
> >
> > [ ... ]
> >
> >> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> >> +	   struct sock *, sk, u64, flags)
> > The SK_LOOKUP bpf_prog may have already selected the proper reuseport sk.
> > It is possible by looking up sk from sock_map.
> >
> > Thus, it is not always desired to do lookup_reuseport() after sk_assign()
> > in patch 5.  e.g. reuseport_select_sock() just uses a normal hash if
> > there is no reuse->prog.
> >
> > A flag (e.g. "BPF_F_REUSEPORT_SELECT") can be added here to
> > specifically do the reuseport_select_sock() after sk_assign().
> > If not set, reuseport_select_sock() should not be called.
> 
> That's true that in addition to steering connections to different
> services with SK_LOOKUP, you could also, in the same program,
> load-balance among sockets belonging to one service.
> 
> So skipping the reuseport socket selection, if sk_lookup already did
> load-balancing sounds useful.
> 
> Thinking about our use-case, I think we would always pass
> BPF_F_REUSEPORT_SELECT to sk_assign() because we either (i) know that
> application is using reuseport and want it manage the load-balancing
> socket group by itself, or (ii) don't know if application is using
> reuseport and don't want to break expected behavior.
Thanks for the explanation.

> 
> IOW, we'd like reuseport selection to run by default because application
> expects it to happen if it was set up. OTOH, the application doesn't
> have to be aware that there is sk_lookup attached (we can put one of its
> sockets in sk_lookup SOCKMAP when systemd activates it).
> 
> Beacuse of that I'd be in favor of having a flag for sk_assign() that
> disables reuseport selection on demand.
> 
> WDYT?
Sure, it is hard to comment which use case is more common than
another to take the default ;)
I think there are use caes for both, so no strong opinion on this ;)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-15 12:28       ` Jakub Sitnicki
  -1 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-15 12:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Martin KaFai Lau,
	Marek Majkowski, Lorenz Bauer

On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
>> Run a BPF program before looking up a listening socket on the receive path.
>> Program selects a listening socket to yield as result of socket lookup by
>> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
>>
>> Alternatively, program can also fail the lookup by returning with BPF_DROP,
>> or let the lookup continue as usual with BPF_OK on return.
>>
>> This lets the user match packets with listening sockets freely at the last
>> possible point on the receive path, where we know that packets are destined
>> for local delivery after undergoing policing, filtering, and routing.
>>
>> With BPF code selecting the socket, directing packets destined to an IP
>> range or to a port range to a single socket becomes possible.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---

[...]

> Also please switch to bpf_link way of attaching. All system wide attachments
> should be visible and easily debuggable via 'bpftool link show'.
> Currently we're converting tc and xdp hooks to bpf_link. This new hook
> should have it from the beginning.

Just to clarify, I understood that bpf(BPF_PROG_ATTACH/DETACH) doesn't
have to be supported for new hooks.

Please correct me if I misunderstood.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-15 12:28       ` Jakub Sitnicki
  0 siblings, 0 replies; 68+ messages in thread
From: Jakub Sitnicki @ 2020-05-15 12:28 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
>> Run a BPF program before looking up a listening socket on the receive path.
>> Program selects a listening socket to yield as result of socket lookup by
>> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
>>
>> Alternatively, program can also fail the lookup by returning with BPF_DROP,
>> or let the lookup continue as usual with BPF_OK on return.
>>
>> This lets the user match packets with listening sockets freely at the last
>> possible point on the receive path, where we know that packets are destined
>> for local delivery after undergoing policing, filtering, and routing.
>>
>> With BPF code selecting the socket, directing packets destined to an IP
>> range or to a port range to a single socket becomes possible.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---

[...]

> Also please switch to bpf_link way of attaching. All system wide attachments
> should be visible and easily debuggable via 'bpftool link show'.
> Currently we're converting tc and xdp hooks to bpf_link. This new hook
> should have it from the beginning.

Just to clarify, I understood that bpf(BPF_PROG_ATTACH/DETACH) doesn't
have to be supported for new hooks.

Please correct me if I misunderstood.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-05-11 18:52   ` Jakub Sitnicki
@ 2020-05-15 15:07         ` Alexei Starovoitov
  -1 siblings, 0 replies; 68+ messages in thread
From: Alexei Starovoitov @ 2020-05-15 15:07 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: netdev, bpf, dccp, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Gerrit Renker,
	Jakub Kicinski, Andrii Nakryiko, Martin KaFai Lau,
	Marek Majkowski, Lorenz Bauer

On Fri, May 15, 2020 at 02:28:30PM +0200, Jakub Sitnicki wrote:
> On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> > On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
> >> Run a BPF program before looking up a listening socket on the receive path.
> >> Program selects a listening socket to yield as result of socket lookup by
> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
> >>
> >> Alternatively, program can also fail the lookup by returning with BPF_DROP,
> >> or let the lookup continue as usual with BPF_OK on return.
> >>
> >> This lets the user match packets with listening sockets freely at the last
> >> possible point on the receive path, where we know that packets are destined
> >> for local delivery after undergoing policing, filtering, and routing.
> >>
> >> With BPF code selecting the socket, directing packets destined to an IP
> >> range or to a port range to a single socket becomes possible.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> 
> [...]
> 
> > Also please switch to bpf_link way of attaching. All system wide attachments
> > should be visible and easily debuggable via 'bpftool link show'.
> > Currently we're converting tc and xdp hooks to bpf_link. This new hook
> > should have it from the beginning.
> 
> Just to clarify, I understood that bpf(BPF_PROG_ATTACH/DETACH) doesn't
> have to be supported for new hooks.

Yes. Not only no need. I don't think attach/detach fits.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup
@ 2020-05-15 15:07         ` Alexei Starovoitov
  0 siblings, 0 replies; 68+ messages in thread
From: Alexei Starovoitov @ 2020-05-15 15:07 UTC (permalink / raw)
  To: dccp

On Fri, May 15, 2020 at 02:28:30PM +0200, Jakub Sitnicki wrote:
> On Mon, May 11, 2020 at 10:44 PM CEST, Alexei Starovoitov wrote:
> > On Mon, May 11, 2020 at 08:52:06PM +0200, Jakub Sitnicki wrote:
> >> Run a BPF program before looking up a listening socket on the receive path.
> >> Program selects a listening socket to yield as result of socket lookup by
> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT code.
> >>
> >> Alternatively, program can also fail the lookup by returning with BPF_DROP,
> >> or let the lookup continue as usual with BPF_OK on return.
> >>
> >> This lets the user match packets with listening sockets freely at the last
> >> possible point on the receive path, where we know that packets are destined
> >> for local delivery after undergoing policing, filtering, and routing.
> >>
> >> With BPF code selecting the socket, directing packets destined to an IP
> >> range or to a port range to a single socket becomes possible.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> 
> [...]
> 
> > Also please switch to bpf_link way of attaching. All system wide attachments
> > should be visible and easily debuggable via 'bpftool link show'.
> > Currently we're converting tc and xdp hooks to bpf_link. This new hook
> > should have it from the beginning.
> 
> Just to clarify, I understood that bpf(BPF_PROG_ATTACH/DETACH) doesn't
> have to be supported for new hooks.

Yes. Not only no need. I don't think attach/detach fits.

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2020-05-15 15:07 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11 18:52 [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup Jakub Sitnicki
2020-05-11 18:52 ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 01/17] flow_dissector: Extract attach/detach/query helpers Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 19:06   ` Jakub Sitnicki
2020-05-11 19:06     ` Jakub Sitnicki
2020-05-13  5:41   ` Martin KaFai Lau
2020-05-13  5:41     ` Martin KaFai Lau
2020-05-13 14:34     ` Jakub Sitnicki
2020-05-13 14:34       ` Jakub Sitnicki
2020-05-13 18:10       ` Martin KaFai Lau
2020-05-13 18:10         ` Martin KaFai Lau
2020-05-11 18:52 ` [PATCH bpf-next v2 03/17] inet: Store layer 4 protocol in inet_hashinfo Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 04/17] inet: Extract helper for selecting socket from reuseport group Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 05/17] inet: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 20:44   ` Alexei Starovoitov
2020-05-11 20:44     ` Alexei Starovoitov
2020-05-12 13:52     ` Jakub Sitnicki
2020-05-12 13:52       ` Jakub Sitnicki
2020-05-12 23:58       ` Alexei Starovoitov
2020-05-12 23:58         ` Alexei Starovoitov
2020-05-13 13:55         ` Jakub Sitnicki
2020-05-13 13:55           ` Jakub Sitnicki
2020-05-13 14:21       ` Lorenz Bauer
2020-05-13 14:21         ` Lorenz Bauer
2020-05-13 14:50         ` Jakub Sitnicki
2020-05-13 14:50           ` Jakub Sitnicki
2020-05-15 12:28     ` Jakub Sitnicki
2020-05-15 12:28       ` Jakub Sitnicki
2020-05-15 15:07       ` Alexei Starovoitov
2020-05-15 15:07         ` Alexei Starovoitov
2020-05-11 18:52 ` [PATCH bpf-next v2 06/17] inet6: Extract helper for selecting socket from reuseport group Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 07/17] inet6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 08/17] udp: Store layer 4 protocol in udp_table Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 09/17] udp: Extract helper for selecting socket from reuseport group Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 10/17] udp: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 11/17] udp6: Extract helper for selecting socket from reuseport group Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 12/17] udp6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 13/17] bpf: Sync linux/bpf.h to tools/ Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 14/17] libbpf: Add support for SK_LOOKUP program type Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 15/17] selftests/bpf: Add verifier tests for bpf_sk_lookup context access Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 16/17] selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 18:52 ` [PATCH bpf-next v2 17/17] selftests/bpf: Tests for BPF_SK_LOOKUP attach point Jakub Sitnicki
2020-05-11 18:52   ` Jakub Sitnicki
2020-05-11 19:45 ` [PATCH bpf-next v2 00/17] Run a BPF program on socket lookup Martin KaFai Lau
2020-05-11 19:45   ` Martin KaFai Lau
2020-05-12 11:57   ` Jakub Sitnicki
2020-05-12 11:57     ` Jakub Sitnicki
2020-05-12 16:34     ` Martin KaFai Lau
2020-05-12 16:34       ` Martin KaFai Lau
2020-05-13 17:54       ` Jakub Sitnicki
2020-05-13 17:54         ` Jakub Sitnicki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.