bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 00/12] xdp: hints via kfuncs
@ 2022-12-06  2:45 Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 01/12] bpf: Document XDP RX metadata Stanislav Fomichev
                   ` (12 more replies)
  0 siblings, 13 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Please see the first patch in the series for the overall
design and use-cases.

Changes since v3:

- Rework prog->bound_netdev refcounting (Jakub/Marin)

  Now it's based on the offload.c framework. It mostly fits, except
  I had to automatically insert a HT entry for the netdev. In the
  offloaded case, the netdev is added via a call to
  bpf_offload_dev_netdev_register from the driver init path; with
  a dev-bound programs, we have to manually add (and remove) the entry.

  As suggested by Toke, I'm also prohibiting putting dev-bound programs
  into prog-array map; essentially prohibiting tail calling into it.
  I'm also disabling freplace of the dev-bound programs. Both of those
  restrictions can be loosened up eventually.
  Note that we don't require maps to be dev-bound when the program is
  dev-bound.

  Confirmed with the test_offload.py that the existing parts are still
  operational.

- Fix compile issues with CONFIG_NET=n and mlx5 driver (lkp@intel.com)

Changes since v2:

- Rework bpf_prog_aux->xdp_netdev refcnt (Martin)

  Switched to dropping the count early, after loading / verification is
  done. At attach time, the pointer value is used only for comparing
  the actual netdev at attach vs netdev at load.

  (potentially can be a problem if the same slub slot is reused
  for another netdev later on?)

- Use correct RX queue number in xdp_hw_metadata (Toke / Jakub)

- Fix wrongly placed '*cnt=0' in fixup_kfunc_call after merge (Toke)

- Fix sorted BTF_SET8_START (Toke)

  Introduce old-school unsorted BTF_ID_LIST for lookup purposes.

- Zero-initialize mlx4_xdp_buff (Tariq)

- Separate common timestamp handling into mlx4_en_get_hwtstamp (Tariq)

- mlx5 patches (Toke)

  Note, I've renamed the following for consistency with the rest:
  - s/mlx5_xdp_ctx/mlx5_xdp_buff/
  - s/mctx/mxbuf/

Changes since v1:

- Drop xdp->skb metadata path (Jakub)

  No consensus yet on exposing xdp_skb_metadata in UAPI. Exploring
  whether everyone would be ok with kfunc to access that part..
  Will follow up separately.

- Drop kfunc unrolling (Alexei)

  Starting with simple code to resolve per-device ndo kfuncs.
  We can always go back to unrolling and keep the same kfuncs
  interface in the future.

- Add rx hash metadata (Toke)

  Not adding the rest (csum/hash_type/etc), I'd like us to agree on
  the framework.

- use dev_get_by_index and add proper refcnt (Toke)

Changes since last RFC:

- drop ice/bnxt example implementation (Alexander)

  -ENOHARDWARE to test

- fix/test mlx4 implementation

  Confirmed that I get reasonable looking timestamp.
  The last patch in the series is the small xsk program that can
  be used to dump incoming metadata.

- bpf_push64/bpf_pop64 (Alexei)

  x86_64+arm64(untested)+disassembler

- struct xdp_to_skb_metadata -> struct xdp_skb_metadata (Toke)

  s/xdp_to_skb/xdp_skb/

- Documentation/bpf/xdp-rx-metadata.rst

  Documents functionality, assumptions and limitations.

- bpf_xdp_metadata_export_to_skb returns true/false (Martin)

  Plus xdp_md->skb_metadata field to access it.

- BPF_F_XDP_HAS_METADATA flag (Toke/Martin)

  Drop magic, use the flag instead.

- drop __randomize_layout

  Not sure it's possible to sanely expose it via UAPI. Because every
  .o potentially gets its own randomized layout, test_progs
  refuses to link.

- remove __net_timestamp in veth driver (John/Jesper)

  Instead, calling ktime_get from the kfunc; enough for the selftests.

Future work on RX side:

- Support more devices besides veth and mlx4
- Support more metadata besides RX timestamp.
- Convert skb_metadata_set() callers to xdp_convert_skb_metadata()
  which handles extra xdp_skb_metadata

Prior art (to record pros/cons for different approaches):

- Stable UAPI approach:
  https://lore.kernel.org/bpf/20220628194812.1453059-1-alexandr.lobakin@intel.com/
- Metadata+BTF_ID appoach:
  https://lore.kernel.org/bpf/166256538687.1434226.15760041133601409770.stgit@firesoul/
- v1:
  https://lore.kernel.org/bpf/20221115030210.3159213-1-sdf@google.com/T/#t
- kfuncs v2 RFC:
  https://lore.kernel.org/bpf/20221027200019.4106375-1-sdf@google.com/
- kfuncs v1 RFC:
  https://lore.kernel.org/bpf/20221104032532.1615099-1-sdf@google.com/

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Stanislav Fomichev (9):
  bpf: Document XDP RX metadata
  bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
  bpf: XDP metadata RX kfuncs
  veth: Introduce veth_xdp_buff wrapper for xdp_buff
  veth: Support RX XDP metadata
  selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
  mxl4: Support RX XDP metadata
  selftests/bpf: Simple program to dump XDP RX metadata

Toke Høiland-Jørgensen (3):
  xsk: Add cb area to struct xdp_buff_xsk
  mlx5: Introduce mlx5_xdp_buff wrapper for xdp_buff
  mlx5: Support RX XDP metadata

 Documentation/bpf/xdp-rx-metadata.rst         |  90 ++++
 drivers/net/ethernet/mellanox/mlx4/en_clock.c |  13 +-
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  10 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  68 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  11 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  32 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  13 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |  35 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   4 +
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  98 ++---
 drivers/net/veth.c                            |  88 ++--
 include/linux/bpf.h                           |  26 +-
 include/linux/mlx4/device.h                   |   7 +
 include/linux/netdevice.h                     |   5 +
 include/net/xdp.h                             |  29 ++
 include/net/xsk_buff_pool.h                   |   5 +
 include/uapi/linux/bpf.h                      |   5 +
 kernel/bpf/arraymap.c                         |  17 +-
 kernel/bpf/core.c                             |   2 +-
 kernel/bpf/offload.c                          | 162 +++++--
 kernel/bpf/syscall.c                          |  25 +-
 kernel/bpf/verifier.c                         |  42 +-
 net/core/dev.c                                |   7 +-
 net/core/filter.c                             |   2 +-
 net/core/xdp.c                                |  58 +++
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   8 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 394 +++++++++++++++++
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  93 ++++
 .../selftests/bpf/progs/xdp_metadata.c        |  70 +++
 .../selftests/bpf/progs/xdp_metadata2.c       |  15 +
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 405 ++++++++++++++++++
 tools/testing/selftests/bpf/xdp_metadata.h    |   7 +
 36 files changed, 1688 insertions(+), 167 deletions(-)
 create mode 100644 Documentation/bpf/xdp-rx-metadata.rst
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata2.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_metadata.h

-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 01/12] bpf: Document XDP RX metadata
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-08  4:25   ` Jakub Kicinski
  2022-12-06  2:45 ` [PATCH bpf-next v3 02/12] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded Stanislav Fomichev
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Document all current use-cases and assumptions.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/bpf/xdp-rx-metadata.rst | 90 +++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)
 create mode 100644 Documentation/bpf/xdp-rx-metadata.rst

diff --git a/Documentation/bpf/xdp-rx-metadata.rst b/Documentation/bpf/xdp-rx-metadata.rst
new file mode 100644
index 000000000000..498eae718275
--- /dev/null
+++ b/Documentation/bpf/xdp-rx-metadata.rst
@@ -0,0 +1,90 @@
+===============
+XDP RX Metadata
+===============
+
+XDP programs support creating and passing custom metadata via
+``bpf_xdp_adjust_meta``. This metadata can be consumed by the following
+entities:
+
+1. ``AF_XDP`` consumer.
+2. Kernel core stack via ``XDP_PASS``.
+3. Another device via ``bpf_redirect_map``.
+4. Other BPF programs via ``bpf_tail_call``.
+
+General Design
+==============
+
+XDP has access to a set of kfuncs to manipulate the metadata. Every
+device driver implements these kfuncs. The set of kfuncs is
+declared in ``include/net/xdp.h`` via ``XDP_METADATA_KFUNC_xxx``.
+
+Currently, the following kfuncs are supported. In the future, as more
+metadata is supported, this set will grow:
+
+- ``bpf_xdp_metadata_rx_timestamp_supported`` returns true/false to
+  indicate whether the device supports RX timestamps
+- ``bpf_xdp_metadata_rx_timestamp`` returns packet RX timestamp
+- ``bpf_xdp_metadata_rx_hash_supported`` returns true/false to
+  indicate whether the device supports RX hash
+- ``bpf_xdp_metadata_rx_hash`` returns packet RX hash
+
+Within the XDP frame, the metadata layout is as follows::
+
+  +----------+-----------------+------+
+  | headroom | custom metadata | data |
+  +----------+-----------------+------+
+             ^                 ^
+             |                 |
+   xdp_buff->data_meta   xdp_buff->data
+
+AF_XDP
+======
+
+``AF_XDP`` use-case implies that there is a contract between the BPF program
+that redirects XDP frames into the ``XSK`` and the final consumer.
+Thus the BPF program manually allocates a fixed number of
+bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
+of kfuncs to populate it. User-space ``XSK`` consumer, looks
+at ``xsk_umem__get_data() - METADATA_SIZE`` to locate its metadata.
+
+Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
+
+  +----------+-----------------+------+
+  | headroom | custom metadata | data |
+  +----------+-----------------+------+
+                               ^
+                               |
+                        rx_desc->address
+
+XDP_PASS
+========
+
+This is the path where the packets processed by the XDP program are passed
+into the kernel. The kernel creates ``skb`` out of the ``xdp_buff`` contents.
+Currently, every driver has a custom kernel code to parse the descriptors and
+populate ``skb`` metadata when doing this ``xdp_buff->skb`` conversion.
+In the future, we'd like to support a case where XDP program can override
+some of that metadata.
+
+The plan of record is to make this path similar to ``bpf_redirect_map``
+so the program can control which metadata is passed to the skb layer.
+
+bpf_redirect_map
+================
+
+``bpf_redirect_map`` can redirect the frame to a different device.
+In this case we don't know ahead of time whether that final consumer
+will further redirect to an ``XSK`` or pass it to the kernel via ``XDP_PASS``.
+Additionally, the final consumer doesn't have access to the original
+hardware descriptor and can't access any of the original metadata.
+
+For this use-case, only custom metadata is currently supported. If
+the frame is eventually passed to the kernel, the skb created from such
+a frame won't have any skb metadata. The ``XSK`` consumer will only
+have access to the custom metadata.
+
+bpf_tail_call
+=============
+
+No special handling here. Tail-called program operates on the same context
+as the original one.
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 02/12] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 01/12] bpf: Document XDP RX metadata Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-08  4:26   ` Jakub Kicinski
  2022-12-06  2:45 ` [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs Stanislav Fomichev
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

BPF offloading infra will be reused to implement
bound-but-not-offloaded bpf programs. Rename existing
helpers for clarity. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h   |  8 ++++----
 kernel/bpf/core.c     |  4 ++--
 kernel/bpf/offload.c  |  4 ++--
 kernel/bpf/syscall.c  | 22 +++++++++++-----------
 kernel/bpf/verifier.c | 18 +++++++++---------
 net/core/dev.c        |  2 +-
 net/core/filter.c     |  2 +-
 7 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4920ac252754..d5d479dae118 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2481,12 +2481,12 @@ void unpriv_ebpf_notify(int new_state);
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
 
-static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+static inline bool bpf_prog_is_offloaded(const struct bpf_prog_aux *aux)
 {
 	return aux->offload_requested;
 }
 
-static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
+static inline bool bpf_map_is_offloaded(struct bpf_map *map)
 {
 	return unlikely(map->ops == &bpf_map_offload_ops);
 }
@@ -2513,12 +2513,12 @@ static inline int bpf_prog_offload_init(struct bpf_prog *prog,
 	return -EOPNOTSUPP;
 }
 
-static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux)
+static inline bool bpf_prog_is_offloaded(struct bpf_prog_aux *aux)
 {
 	return false;
 }
 
-static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
+static inline bool bpf_map_is_offloaded(struct bpf_map *map)
 {
 	return false;
 }
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2e57fc839a5c..641ab412ad7e 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2183,7 +2183,7 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 	 * valid program, which in this case would simply not
 	 * be JITed, but falls back to the interpreter.
 	 */
-	if (!bpf_prog_is_dev_bound(fp->aux)) {
+	if (!bpf_prog_is_offloaded(fp->aux)) {
 		*err = bpf_prog_alloc_jited_linfo(fp);
 		if (*err)
 			return fp;
@@ -2554,7 +2554,7 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 #endif
 	bpf_free_used_maps(aux);
 	bpf_free_used_btfs(aux);
-	if (bpf_prog_is_dev_bound(aux))
+	if (bpf_prog_is_offloaded(aux))
 		bpf_prog_offload_destroy(aux->prog);
 #ifdef CONFIG_PERF_EVENTS
 	if (aux->prog->has_callchain_buf)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 13e4efc971e6..f5769a8ecbee 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -549,7 +549,7 @@ static bool __bpf_offload_dev_match(struct bpf_prog *prog,
 	struct bpf_offload_netdev *ondev1, *ondev2;
 	struct bpf_prog_offload *offload;
 
-	if (!bpf_prog_is_dev_bound(prog->aux))
+	if (!bpf_prog_is_offloaded(prog->aux))
 		return false;
 
 	offload = prog->aux->offload;
@@ -581,7 +581,7 @@ bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map)
 	struct bpf_offloaded_map *offmap;
 	bool ret;
 
-	if (!bpf_map_is_dev_bound(map))
+	if (!bpf_map_is_offloaded(map))
 		return bpf_map_offload_neutral(map);
 	offmap = map_to_offmap(map);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 35972afb6850..13bc96035116 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -181,7 +181,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 	int err;
 
 	/* Need to create a kthread, thus must support schedule */
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		return bpf_map_offload_update_elem(map, key, value, flags);
 	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
 		   map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
@@ -238,7 +238,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 	void *ptr;
 	int err;
 
-	if (bpf_map_is_dev_bound(map))
+	if (bpf_map_is_offloaded(map))
 		return bpf_map_offload_lookup_elem(map, key, value);
 
 	bpf_disable_instrumentation();
@@ -1483,7 +1483,7 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_delete_elem(map, key);
 		goto out;
 	} else if (IS_FD_PROG_ARRAY(map) ||
@@ -1547,7 +1547,7 @@ static int map_get_next_key(union bpf_attr *attr)
 	if (!next_key)
 		goto free_key;
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_get_next_key(map, key, next_key);
 		goto out;
 	}
@@ -1605,7 +1605,7 @@ int generic_map_delete_batch(struct bpf_map *map,
 				   map->key_size))
 			break;
 
-		if (bpf_map_is_dev_bound(map)) {
+		if (bpf_map_is_offloaded(map)) {
 			err = bpf_map_offload_delete_elem(map, key);
 			break;
 		}
@@ -1851,7 +1851,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
 		   map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
-		if (!bpf_map_is_dev_bound(map)) {
+		if (!bpf_map_is_offloaded(map)) {
 			bpf_disable_instrumentation();
 			rcu_read_lock();
 			err = map->ops->map_lookup_and_delete_elem(map, key, value, attr->flags);
@@ -1944,7 +1944,7 @@ static int find_prog_type(enum bpf_prog_type type, struct bpf_prog *prog)
 	if (!ops)
 		return -EINVAL;
 
-	if (!bpf_prog_is_dev_bound(prog->aux))
+	if (!bpf_prog_is_offloaded(prog->aux))
 		prog->aux->ops = ops;
 	else
 		prog->aux->ops = &bpf_offload_prog_ops;
@@ -2255,7 +2255,7 @@ bool bpf_prog_get_ok(struct bpf_prog *prog,
 
 	if (prog->type != *attach_type)
 		return false;
-	if (bpf_prog_is_dev_bound(prog->aux) && !attach_drv)
+	if (bpf_prog_is_offloaded(prog->aux) && !attach_drv)
 		return false;
 
 	return true;
@@ -2598,7 +2598,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	atomic64_set(&prog->aux->refcnt, 1);
 	prog->gpl_compatible = is_gpl ? 1 : 0;
 
-	if (bpf_prog_is_dev_bound(prog->aux)) {
+	if (bpf_prog_is_offloaded(prog->aux)) {
 		err = bpf_prog_offload_init(prog, attr);
 		if (err)
 			goto free_prog_sec;
@@ -3997,7 +3997,7 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 			return -EFAULT;
 	}
 
-	if (bpf_prog_is_dev_bound(prog->aux)) {
+	if (bpf_prog_is_offloaded(prog->aux)) {
 		err = bpf_prog_offload_info_fill(&info, prog);
 		if (err)
 			return err;
@@ -4225,7 +4225,7 @@ static int bpf_map_get_info_by_fd(struct file *file,
 	}
 	info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_info_fill(&info, map);
 		if (err)
 			return err;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1d51bd9596da..fc4e313a4d2e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13648,7 +13648,7 @@ static int do_check(struct bpf_verifier_env *env)
 			env->prev_log_len = env->log.len_used;
 		}
 
-		if (bpf_prog_is_dev_bound(env->prog->aux)) {
+		if (bpf_prog_is_offloaded(env->prog->aux)) {
 			err = bpf_prog_offload_verify_insn(env, env->insn_idx,
 							   env->prev_insn_idx);
 			if (err)
@@ -14128,7 +14128,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		}
 	}
 
-	if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) &&
+	if ((bpf_prog_is_offloaded(prog->aux) || bpf_map_is_offloaded(map)) &&
 	    !bpf_offload_prog_map_match(prog, map)) {
 		verbose(env, "offload device mismatch between prog and map\n");
 		return -EINVAL;
@@ -14609,7 +14609,7 @@ static int verifier_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
 	unsigned int orig_prog_len = env->prog->len;
 	int err;
 
-	if (bpf_prog_is_dev_bound(env->prog->aux))
+	if (bpf_prog_is_offloaded(env->prog->aux))
 		bpf_prog_offload_remove_insns(env, off, cnt);
 
 	err = bpf_remove_insns(env->prog, off, cnt);
@@ -14690,7 +14690,7 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
 		else
 			continue;
 
-		if (bpf_prog_is_dev_bound(env->prog->aux))
+		if (bpf_prog_is_offloaded(env->prog->aux))
 			bpf_prog_offload_replace_insn(env, i, &ja);
 
 		memcpy(insn, &ja, sizeof(ja));
@@ -14873,7 +14873,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		}
 	}
 
-	if (bpf_prog_is_dev_bound(env->prog->aux))
+	if (bpf_prog_is_offloaded(env->prog->aux))
 		return 0;
 
 	insn = env->prog->insnsi + delta;
@@ -15273,7 +15273,7 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 	int err = 0;
 
 	if (env->prog->jit_requested &&
-	    !bpf_prog_is_dev_bound(env->prog->aux)) {
+	    !bpf_prog_is_offloaded(env->prog->aux)) {
 		err = jit_subprogs(env);
 		if (err == 0)
 			return 0;
@@ -16747,7 +16747,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	if (ret < 0)
 		goto skip_full_check;
 
-	if (bpf_prog_is_dev_bound(env->prog->aux)) {
+	if (bpf_prog_is_offloaded(env->prog->aux)) {
 		ret = bpf_prog_offload_verifier_prep(env->prog);
 		if (ret)
 			goto skip_full_check;
@@ -16760,7 +16760,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	ret = do_check_subprogs(env);
 	ret = ret ?: do_check_main(env);
 
-	if (ret == 0 && bpf_prog_is_dev_bound(env->prog->aux))
+	if (ret == 0 && bpf_prog_is_offloaded(env->prog->aux))
 		ret = bpf_prog_offload_finalize(env);
 
 skip_full_check:
@@ -16795,7 +16795,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	/* do 32-bit optimization after insn patching has done so those patched
 	 * insns could be handled correctly.
 	 */
-	if (ret == 0 && !bpf_prog_is_dev_bound(env->prog->aux)) {
+	if (ret == 0 && !bpf_prog_is_offloaded(env->prog->aux)) {
 		ret = opt_subreg_zext_lo32_rnd_hi32(env, attr);
 		env->prog->aux->verifier_zext = bpf_jit_needs_zext() ? !ret
 								     : false;
diff --git a/net/core/dev.c b/net/core/dev.c
index 7627c475d991..5b221568dfd4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9224,7 +9224,7 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			NL_SET_ERR_MSG(extack, "Native and generic XDP can't be active at the same time");
 			return -EEXIST;
 		}
-		if (!offload && bpf_prog_is_dev_bound(new_prog->aux)) {
+		if (!offload && bpf_prog_is_offloaded(new_prog->aux)) {
 			NL_SET_ERR_MSG(extack, "Using device-bound program without HW_MODE flag is not supported");
 			return -EINVAL;
 		}
diff --git a/net/core/filter.c b/net/core/filter.c
index 8607136b6e2c..d89c30ba2623 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8719,7 +8719,7 @@ static bool xdp_is_valid_access(int off, int size,
 	}
 
 	if (type == BPF_WRITE) {
-		if (bpf_prog_is_dev_bound(prog->aux)) {
+		if (bpf_prog_is_offloaded(prog->aux)) {
 			switch (off) {
 			case offsetof(struct xdp_md, rx_queue_index):
 				return __is_valid_xdp_access(off, size);
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 01/12] bpf: Document XDP RX metadata Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 02/12] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-07  4:29   ` Alexei Starovoitov
                     ` (4 more replies)
  2022-12-06  2:45 ` [PATCH bpf-next v3 04/12] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
                   ` (9 subsequent siblings)
  12 siblings, 5 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

There is an ndo handler per kfunc, the verifier replaces a call to the
generic kfunc with a call to the per-device one.

For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
implements all possible metatada kfuncs. Not all devices have to
implement them. If kfunc is not supported by the target device,
the default implementation is called instead.

Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
we treat prog_index as target device for kfunc resolution.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h            |  20 +++-
 include/linux/netdevice.h      |   5 +
 include/net/xdp.h              |  29 ++++++
 include/uapi/linux/bpf.h       |   5 +
 kernel/bpf/arraymap.c          |  17 +++-
 kernel/bpf/core.c              |   2 +-
 kernel/bpf/offload.c           | 162 ++++++++++++++++++++++++++++-----
 kernel/bpf/syscall.c           |   7 +-
 kernel/bpf/verifier.c          |  24 ++++-
 net/core/dev.c                 |   5 +
 net/core/xdp.c                 |  58 ++++++++++++
 tools/include/uapi/linux/bpf.h |   5 +
 12 files changed, 304 insertions(+), 35 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d5d479dae118..b46b60f4eae1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1261,7 +1261,8 @@ struct bpf_prog_aux {
 	enum bpf_prog_type saved_dst_prog_type;
 	enum bpf_attach_type saved_dst_attach_type;
 	bool verifier_zext; /* Zero extensions has been inserted by verifier. */
-	bool offload_requested;
+	bool dev_bound; /* Program is bound to the netdev. */
+	bool offload_requested; /* Program is bound and offloaded to the netdev. */
 	bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */
 	bool func_proto_unreliable;
 	bool sleepable;
@@ -2476,10 +2477,18 @@ void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 				       struct net_device *netdev);
 bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev);
 
+void *bpf_offload_resolve_kfunc(struct bpf_prog *prog, u32 func_id);
+
 void unpriv_ebpf_notify(int new_state);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
+void bpf_offload_bound_netdev_unregister(struct net_device *dev);
+
+static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+{
+	return aux->dev_bound;
+}
 
 static inline bool bpf_prog_is_offloaded(const struct bpf_prog_aux *aux)
 {
@@ -2513,6 +2522,15 @@ static inline int bpf_prog_offload_init(struct bpf_prog *prog,
 	return -EOPNOTSUPP;
 }
 
+static inline void bpf_offload_bound_netdev_unregister(struct net_device *dev)
+{
+}
+
+static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+{
+	return false;
+}
+
 static inline bool bpf_prog_is_offloaded(struct bpf_prog_aux *aux)
 {
 	return false;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5aa35c58c342..2eabb9157767 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -74,6 +74,7 @@ struct udp_tunnel_nic_info;
 struct udp_tunnel_nic;
 struct bpf_prog;
 struct xdp_buff;
+struct xdp_md;
 
 void synchronize_net(void);
 void netdev_set_default_ethtool_ops(struct net_device *dev,
@@ -1611,6 +1612,10 @@ struct net_device_ops {
 	ktime_t			(*ndo_get_tstamp)(struct net_device *dev,
 						  const struct skb_shared_hwtstamps *hwtstamps,
 						  bool cycles);
+	bool			(*ndo_xdp_rx_timestamp_supported)(const struct xdp_md *ctx);
+	u64			(*ndo_xdp_rx_timestamp)(const struct xdp_md *ctx);
+	bool			(*ndo_xdp_rx_hash_supported)(const struct xdp_md *ctx);
+	u32			(*ndo_xdp_rx_hash)(const struct xdp_md *ctx);
 };
 
 /**
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 55dbc68bfffc..c24aba5c363b 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -409,4 +409,33 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
+#define XDP_METADATA_KFUNC_xxx	\
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
+			   bpf_xdp_metadata_rx_timestamp_supported) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			   bpf_xdp_metadata_rx_timestamp) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH_SUPPORTED, \
+			   bpf_xdp_metadata_rx_hash_supported) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
+			   bpf_xdp_metadata_rx_hash) \
+
+enum {
+#define XDP_METADATA_KFUNC(name, str) name,
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+MAX_XDP_METADATA_KFUNC,
+};
+
+#ifdef CONFIG_NET
+u32 xdp_metadata_kfunc_id(int id);
+#else
+static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
+#endif
+
+struct xdp_md;
+bool bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx);
+u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx);
+bool bpf_xdp_metadata_rx_hash_supported(const struct xdp_md *ctx);
+u32 bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx);
+
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f89de51a45db..790650a81f2b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access it's XDP metadata.
+ */
+#define BPF_F_XDP_HAS_METADATA	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 484706959556..9b190b72ffce 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -881,12 +881,21 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
 	if (IS_ERR(prog))
 		return prog;
 
-	if (!bpf_prog_map_compatible(map, prog)) {
-		bpf_prog_put(prog);
-		return ERR_PTR(-EINVAL);
-	}
+	/* When tail-calling from a non-dev-bound program to a dev-bound one,
+	 * XDP metadata helpers should be disabled. Until it's implemented,
+	 * prohibit adding dev-bound programs to tail-call maps.
+	 */
+	if (bpf_prog_is_dev_bound(prog->aux))
+		goto err;
+
+	if (!bpf_prog_map_compatible(map, prog))
+		goto err;
 
 	return prog;
+
+err:
+	bpf_prog_put(prog);
+	return ERR_PTR(-EINVAL);
 }
 
 static void prog_fd_array_put_ptr(void *ptr)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 641ab412ad7e..71c6dc081f62 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2554,7 +2554,7 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 #endif
 	bpf_free_used_maps(aux);
 	bpf_free_used_btfs(aux);
-	if (bpf_prog_is_offloaded(aux))
+	if (bpf_prog_is_dev_bound(aux))
 		bpf_prog_offload_destroy(aux->prog);
 #ifdef CONFIG_PERF_EVENTS
 	if (aux->prog->has_callchain_buf)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index f5769a8ecbee..bad8bab916eb 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -41,7 +41,7 @@ struct bpf_offload_dev {
 struct bpf_offload_netdev {
 	struct rhash_head l;
 	struct net_device *netdev;
-	struct bpf_offload_dev *offdev;
+	struct bpf_offload_dev *offdev; /* NULL when bound-only */
 	struct list_head progs;
 	struct list_head maps;
 	struct list_head offdev_netdevs;
@@ -58,6 +58,12 @@ static const struct rhashtable_params offdevs_params = {
 static struct rhashtable offdevs;
 static bool offdevs_inited;
 
+static int __bpf_offload_init(void);
+static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
+					     struct net_device *netdev);
+static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
+						struct net_device *netdev);
+
 static int bpf_dev_offload_check(struct net_device *netdev)
 {
 	if (!netdev)
@@ -87,13 +93,17 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 	    attr->prog_type != BPF_PROG_TYPE_XDP)
 		return -EINVAL;
 
-	if (attr->prog_flags)
+	if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
 		return -EINVAL;
 
 	offload = kzalloc(sizeof(*offload), GFP_USER);
 	if (!offload)
 		return -ENOMEM;
 
+	err = __bpf_offload_init();
+	if (err)
+		return err;
+
 	offload->prog = prog;
 
 	offload->netdev = dev_get_by_index(current->nsproxy->net_ns,
@@ -102,11 +112,25 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 	if (err)
 		goto err_maybe_put;
 
+	prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
+
 	down_write(&bpf_devs_lock);
 	ondev = bpf_offload_find_netdev(offload->netdev);
 	if (!ondev) {
-		err = -EINVAL;
-		goto err_unlock;
+		if (!prog->aux->offload_requested) {
+			/* When only binding to the device, explicitly
+			 * create an entry in the hashtable. See related
+			 * maybe_remove_bound_netdev.
+			 */
+			err = __bpf_offload_dev_netdev_register(NULL, offload->netdev);
+			if (err)
+				goto err_unlock;
+			ondev = bpf_offload_find_netdev(offload->netdev);
+		}
+		if (!ondev) {
+			err = -EINVAL;
+			goto err_unlock;
+		}
 	}
 	offload->offdev = ondev->offdev;
 	prog->aux->offload = offload;
@@ -209,6 +233,19 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
 	up_read(&bpf_devs_lock);
 }
 
+static void maybe_remove_bound_netdev(struct net_device *dev)
+{
+	struct bpf_offload_netdev *ondev;
+
+	rtnl_lock();
+	down_write(&bpf_devs_lock);
+	ondev = bpf_offload_find_netdev(dev);
+	if (ondev && !ondev->offdev && list_empty(&ondev->progs))
+		__bpf_offload_dev_netdev_unregister(NULL, dev);
+	up_write(&bpf_devs_lock);
+	rtnl_unlock();
+}
+
 static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
 {
 	struct bpf_prog_offload *offload = prog->aux->offload;
@@ -226,10 +263,17 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
 
 void bpf_prog_offload_destroy(struct bpf_prog *prog)
 {
+	struct net_device *netdev = NULL;
+
 	down_write(&bpf_devs_lock);
-	if (prog->aux->offload)
+	if (prog->aux->offload) {
+		netdev = prog->aux->offload->netdev;
 		__bpf_prog_offload_destroy(prog);
+	}
 	up_write(&bpf_devs_lock);
+
+	if (netdev)
+		maybe_remove_bound_netdev(netdev);
 }
 
 static int bpf_prog_offload_translate(struct bpf_prog *prog)
@@ -549,7 +593,7 @@ static bool __bpf_offload_dev_match(struct bpf_prog *prog,
 	struct bpf_offload_netdev *ondev1, *ondev2;
 	struct bpf_prog_offload *offload;
 
-	if (!bpf_prog_is_offloaded(prog->aux))
+	if (!bpf_prog_is_dev_bound(prog->aux))
 		return false;
 
 	offload = prog->aux->offload;
@@ -592,8 +636,8 @@ bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map)
 	return ret;
 }
 
-int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
-				    struct net_device *netdev)
+static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
+					     struct net_device *netdev)
 {
 	struct bpf_offload_netdev *ondev;
 	int err;
@@ -607,15 +651,14 @@ int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
 	INIT_LIST_HEAD(&ondev->progs);
 	INIT_LIST_HEAD(&ondev->maps);
 
-	down_write(&bpf_devs_lock);
 	err = rhashtable_insert_fast(&offdevs, &ondev->l, offdevs_params);
 	if (err) {
 		netdev_warn(netdev, "failed to register for BPF offload\n");
 		goto err_unlock_free;
 	}
 
-	list_add(&ondev->offdev_netdevs, &offdev->netdevs);
-	up_write(&bpf_devs_lock);
+	if (offdev)
+		list_add(&ondev->offdev_netdevs, &offdev->netdevs);
 	return 0;
 
 err_unlock_free:
@@ -623,29 +666,42 @@ int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
 	kfree(ondev);
 	return err;
 }
+
+int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
+				    struct net_device *netdev)
+{
+	int err;
+
+	down_write(&bpf_devs_lock);
+	err = __bpf_offload_dev_netdev_register(offdev, netdev);
+	up_write(&bpf_devs_lock);
+	return err;
+}
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_register);
 
-void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
-				       struct net_device *netdev)
+static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
+						struct net_device *netdev)
 {
-	struct bpf_offload_netdev *ondev, *altdev;
+	struct bpf_offload_netdev *ondev, *altdev = NULL;
 	struct bpf_offloaded_map *offmap, *mtmp;
 	struct bpf_prog_offload *offload, *ptmp;
 
 	ASSERT_RTNL();
 
-	down_write(&bpf_devs_lock);
 	ondev = rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
 	if (WARN_ON(!ondev))
-		goto unlock;
+		return;
 
 	WARN_ON(rhashtable_remove_fast(&offdevs, &ondev->l, offdevs_params));
-	list_del(&ondev->offdev_netdevs);
 
 	/* Try to move the objects to another netdev of the device */
-	altdev = list_first_entry_or_null(&offdev->netdevs,
-					  struct bpf_offload_netdev,
-					  offdev_netdevs);
+	if (offdev) {
+		list_del(&ondev->offdev_netdevs);
+		altdev = list_first_entry_or_null(&offdev->netdevs,
+						  struct bpf_offload_netdev,
+						  offdev_netdevs);
+	}
+
 	if (altdev) {
 		list_for_each_entry(offload, &ondev->progs, offloads)
 			offload->netdev = altdev->netdev;
@@ -664,15 +720,19 @@ void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 	WARN_ON(!list_empty(&ondev->progs));
 	WARN_ON(!list_empty(&ondev->maps));
 	kfree(ondev);
-unlock:
+}
+
+void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
+				       struct net_device *netdev)
+{
+	down_write(&bpf_devs_lock);
+	__bpf_offload_dev_netdev_unregister(offdev, netdev);
 	up_write(&bpf_devs_lock);
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_unregister);
 
-struct bpf_offload_dev *
-bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
+static int __bpf_offload_init(void)
 {
-	struct bpf_offload_dev *offdev;
 	int err;
 
 	down_write(&bpf_devs_lock);
@@ -680,12 +740,25 @@ bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
 		err = rhashtable_init(&offdevs, &offdevs_params);
 		if (err) {
 			up_write(&bpf_devs_lock);
-			return ERR_PTR(err);
+			return err;
 		}
 		offdevs_inited = true;
 	}
 	up_write(&bpf_devs_lock);
 
+	return 0;
+}
+
+struct bpf_offload_dev *
+bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
+{
+	struct bpf_offload_dev *offdev;
+	int err;
+
+	err = __bpf_offload_init();
+	if (err)
+		return ERR_PTR(err);
+
 	offdev = kzalloc(sizeof(*offdev), GFP_KERNEL);
 	if (!offdev)
 		return ERR_PTR(-ENOMEM);
@@ -710,3 +783,42 @@ void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev)
 	return offdev->priv;
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_priv);
+
+void bpf_offload_bound_netdev_unregister(struct net_device *dev)
+{
+	struct bpf_offload_netdev *ondev;
+
+	ASSERT_RTNL();
+
+	down_write(&bpf_devs_lock);
+	ondev = bpf_offload_find_netdev(dev);
+	if (ondev && !ondev->offdev)
+		__bpf_offload_dev_netdev_unregister(NULL, ondev->netdev);
+	up_write(&bpf_devs_lock);
+}
+
+void *bpf_offload_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
+{
+	const struct net_device_ops *netdev_ops;
+	void *p = NULL;
+
+	down_read(&bpf_devs_lock);
+	if (!prog->aux->offload || !prog->aux->offload->netdev)
+		goto out;
+
+	netdev_ops = prog->aux->offload->netdev->netdev_ops;
+
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED))
+		p = netdev_ops->ndo_xdp_rx_timestamp_supported;
+	else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
+		p = netdev_ops->ndo_xdp_rx_timestamp;
+	else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH_SUPPORTED))
+		p = netdev_ops->ndo_xdp_rx_hash_supported;
+	else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
+		p = netdev_ops->ndo_xdp_rx_hash;
+	/* fallback to default kfunc when not supported by netdev */
+out:
+	up_read(&bpf_devs_lock);
+
+	return p;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 13bc96035116..b345a273f7d0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2491,7 +2491,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 				 BPF_F_TEST_STATE_FREQ |
 				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32 |
-				 BPF_F_XDP_HAS_FRAGS))
+				 BPF_F_XDP_HAS_FRAGS |
+				 BPF_F_XDP_HAS_METADATA))
 		return -EINVAL;
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -2575,7 +2576,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	prog->aux->attach_btf = attach_btf;
 	prog->aux->attach_btf_id = attr->attach_btf_id;
 	prog->aux->dst_prog = dst_prog;
-	prog->aux->offload_requested = !!attr->prog_ifindex;
+	prog->aux->dev_bound = !!attr->prog_ifindex;
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
@@ -2598,7 +2599,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	atomic64_set(&prog->aux->refcnt, 1);
 	prog->gpl_compatible = is_gpl ? 1 : 0;
 
-	if (bpf_prog_is_offloaded(prog->aux)) {
+	if (bpf_prog_is_dev_bound(prog->aux)) {
 		err = bpf_prog_offload_init(prog, attr);
 		if (err)
 			goto free_prog_sec;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index fc4e313a4d2e..00951a59ee26 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -15323,6 +15323,24 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		return -EINVAL;
 	}
 
+	*cnt = 0;
+
+	if (resolve_prog_type(env->prog) == BPF_PROG_TYPE_XDP) {
+		if (bpf_prog_is_offloaded(env->prog->aux)) {
+			verbose(env, "no metadata kfuncs offload\n");
+			return -EINVAL;
+		}
+
+		if (bpf_prog_is_dev_bound(env->prog->aux)) {
+			void *p = bpf_offload_resolve_kfunc(env->prog, insn->imm);
+
+			if (p) {
+				insn->imm = BPF_CALL_IMM(p);
+				return 0;
+			}
+		}
+	}
+
 	/* insn->imm has the btf func_id. Replace it with
 	 * an address (relative to __bpf_base_call).
 	 */
@@ -15333,7 +15351,6 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		return -EFAULT;
 	}
 
-	*cnt = 0;
 	insn->imm = desc->imm;
 	if (insn->off)
 		return 0;
@@ -16340,6 +16357,11 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
 	if (tgt_prog) {
 		struct bpf_prog_aux *aux = tgt_prog->aux;
 
+		if (bpf_prog_is_dev_bound(tgt_prog->aux)) {
+			bpf_log(log, "Replacing device-bound programs not supported\n");
+			return -EINVAL;
+		}
+
 		for (i = 0; i < aux->func_info_cnt; i++)
 			if (aux->func_info[i].type_id == btf_id) {
 				subprog = i;
diff --git a/net/core/dev.c b/net/core/dev.c
index 5b221568dfd4..862e03fcffa6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9228,6 +9228,10 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			NL_SET_ERR_MSG(extack, "Using device-bound program without HW_MODE flag is not supported");
 			return -EINVAL;
 		}
+		if (bpf_prog_is_dev_bound(new_prog->aux) && !bpf_offload_dev_match(new_prog, dev)) {
+			NL_SET_ERR_MSG(extack, "Cannot attach to a different target device");
+			return -EINVAL;
+		}
 		if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
 			NL_SET_ERR_MSG(extack, "BPF_XDP_DEVMAP programs can not be attached to a device");
 			return -EINVAL;
@@ -10813,6 +10817,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
 		/* Shutdown queueing discipline. */
 		dev_shutdown(dev);
 
+		bpf_offload_bound_netdev_unregister(dev);
 		dev_xdp_uninstall(dev);
 
 		netdev_offload_xstats_disable_all(dev);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 844c9d99dc0e..8240805bfdb7 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
  */
 #include <linux/bpf.h>
+#include <linux/btf_ids.h>
 #include <linux/filter.h>
 #include <linux/types.h>
 #include <linux/mm.h>
@@ -709,3 +710,60 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
 
 	return nxdpf;
 }
+
+noinline bool bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx)
+{
+	return false;
+}
+
+noinline u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx)
+{
+	return 0;
+}
+
+noinline bool bpf_xdp_metadata_rx_hash_supported(const struct xdp_md *ctx)
+{
+	return false;
+}
+
+noinline u32 bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx)
+{
+	return 0;
+}
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+BTF_SET8_START(xdp_metadata_kfunc_ids)
+#define XDP_METADATA_KFUNC(name, str) BTF_ID_FLAGS(func, str, 0)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+BTF_SET8_END(xdp_metadata_kfunc_ids)
+
+static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &xdp_metadata_kfunc_ids,
+};
+
+BTF_ID_LIST(xdp_metadata_kfunc_ids_unsorted)
+#define XDP_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+
+u32 xdp_metadata_kfunc_id(int id)
+{
+	/* xdp_metadata_kfunc_ids is sorted and can't be used */
+	return xdp_metadata_kfunc_ids_unsorted[id];
+}
+EXPORT_SYMBOL_GPL(xdp_metadata_kfunc_id);
+
+static int __init xdp_metadata_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
+}
+late_initcall(xdp_metadata_init);
+#else /* CONFIG_DEBUG_INFO_BTF */
+u32 xdp_metadata_kfunc_id(int id)
+{
+	return -1;
+}
+EXPORT_SYMBOL_GPL(xdp_metadata_kfunc_id);
+#endif /* CONFIG_DEBUG_INFO_BTF */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f89de51a45db..790650a81f2b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access it's XDP metadata.
+ */
+#define BPF_F_XDP_HAS_METADATA	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 04/12] veth: Introduce veth_xdp_buff wrapper for xdp_buff
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 05/12] veth: Support RX XDP metadata Stanislav Fomichev
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 56 +++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ac7c0653695f..04ffd8cb2945 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -116,6 +116,10 @@ static struct {
 	{ "peer_ifindex" },
 };
 
+struct veth_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 static int veth_get_link_ksettings(struct net_device *dev,
 				   struct ethtool_link_ksettings *cmd)
 {
@@ -592,23 +596,24 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (likely(xdp_prog)) {
-		struct xdp_buff xdp;
+		struct veth_xdp_buff vxbuf;
+		struct xdp_buff *xdp = &vxbuf.xdp;
 		u32 act;
 
-		xdp_convert_frame_to_buff(frame, &xdp);
-		xdp.rxq = &rq->xdp_rxq;
+		xdp_convert_frame_to_buff(frame, xdp);
+		xdp->rxq = &rq->xdp_rxq;
 
-		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 		switch (act) {
 		case XDP_PASS:
-			if (xdp_update_frame_from_buff(&xdp, frame))
+			if (xdp_update_frame_from_buff(xdp, frame))
 				goto err_xdp;
 			break;
 		case XDP_TX:
 			orig_frame = *frame;
-			xdp.rxq->mem = frame->mem;
-			if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
+			xdp->rxq->mem = frame->mem;
+			if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 				trace_xdp_exception(rq->dev, xdp_prog, act);
 				frame = &orig_frame;
 				stats->rx_drops++;
@@ -619,8 +624,8 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 			goto xdp_xmit;
 		case XDP_REDIRECT:
 			orig_frame = *frame;
-			xdp.rxq->mem = frame->mem;
-			if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
+			xdp->rxq->mem = frame->mem;
+			if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 				frame = &orig_frame;
 				stats->rx_drops++;
 				goto err_xdp;
@@ -801,7 +806,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 {
 	void *orig_data, *orig_data_end;
 	struct bpf_prog *xdp_prog;
-	struct xdp_buff xdp;
+	struct veth_xdp_buff vxbuf;
+	struct xdp_buff *xdp = &vxbuf.xdp;
 	u32 act, metalen;
 	int off;
 
@@ -815,22 +821,22 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	}
 
 	__skb_push(skb, skb->data - skb_mac_header(skb));
-	if (veth_convert_skb_to_xdp_buff(rq, &xdp, &skb))
+	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
 		goto drop;
 
-	orig_data = xdp.data;
-	orig_data_end = xdp.data_end;
+	orig_data = xdp->data;
+	orig_data_end = xdp->data_end;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 	switch (act) {
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		veth_xdp_get(&xdp);
+		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp.rxq->mem = rq->xdp_mem;
-		if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
+		xdp->rxq->mem = rq->xdp_mem;
+		if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 			trace_xdp_exception(rq->dev, xdp_prog, act);
 			stats->rx_drops++;
 			goto err_xdp;
@@ -839,10 +845,10 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 		rcu_read_unlock();
 		goto xdp_xmit;
 	case XDP_REDIRECT:
-		veth_xdp_get(&xdp);
+		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp.rxq->mem = rq->xdp_mem;
-		if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
+		xdp->rxq->mem = rq->xdp_mem;
+		if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 			stats->rx_drops++;
 			goto err_xdp;
 		}
@@ -862,7 +868,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	rcu_read_unlock();
 
 	/* check if bpf_xdp_adjust_head was used */
-	off = orig_data - xdp.data;
+	off = orig_data - xdp->data;
 	if (off > 0)
 		__skb_push(skb, off);
 	else if (off < 0)
@@ -871,21 +877,21 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	skb_reset_mac_header(skb);
 
 	/* check if bpf_xdp_adjust_tail was used */
-	off = xdp.data_end - orig_data_end;
+	off = xdp->data_end - orig_data_end;
 	if (off != 0)
 		__skb_put(skb, off); /* positive on grow, negative on shrink */
 
 	/* XDP frag metadata (e.g. nr_frags) are updated in eBPF helpers
 	 * (e.g. bpf_xdp_adjust_tail), we need to update data_len here.
 	 */
-	if (xdp_buff_has_frags(&xdp))
+	if (xdp_buff_has_frags(xdp))
 		skb->data_len = skb_shinfo(skb)->xdp_frags_size;
 	else
 		skb->data_len = 0;
 
 	skb->protocol = eth_type_trans(skb, rq->dev);
 
-	metalen = xdp.data - xdp.data_meta;
+	metalen = xdp->data - xdp->data_meta;
 	if (metalen)
 		skb_metadata_set(skb, metalen);
 out:
@@ -898,7 +904,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	return NULL;
 err_xdp:
 	rcu_read_unlock();
-	xdp_return_buff(&xdp);
+	xdp_return_buff(xdp);
 xdp_xmit:
 	return NULL;
 }
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 05/12] veth: Support RX XDP metadata
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 04/12] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 06/12] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

The goal is to enable end-to-end testing of the metadata for AF_XDP.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 04ffd8cb2945..d15302672493 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -118,6 +118,7 @@ static struct {
 
 struct veth_xdp_buff {
 	struct xdp_buff xdp;
+	struct sk_buff *skb;
 };
 
 static int veth_get_link_ksettings(struct net_device *dev,
@@ -602,6 +603,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 
 		xdp_convert_frame_to_buff(frame, xdp);
 		xdp->rxq = &rq->xdp_rxq;
+		vxbuf.skb = NULL;
 
 		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
@@ -823,6 +825,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	__skb_push(skb, skb->data - skb_mac_header(skb));
 	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
 		goto drop;
+	vxbuf.skb = skb;
 
 	orig_data = xdp->data;
 	orig_data_end = xdp->data_end;
@@ -1601,6 +1604,30 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static bool veth_xdp_rx_timestamp_supported(const struct xdp_md *ctx)
+{
+	return true;
+}
+
+static u64 veth_xdp_rx_timestamp(const struct xdp_md *ctx)
+{
+	return ktime_get_mono_fast_ns();
+}
+
+static bool veth_xdp_rx_hash_supported(const struct xdp_md *ctx)
+{
+	return true;
+}
+
+static u32 veth_xdp_rx_hash(const struct xdp_md *ctx)
+{
+	struct veth_xdp_buff *_ctx = (void *)ctx;
+
+	if (_ctx->skb)
+		return skb_get_hash(_ctx->skb);
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1620,6 +1647,11 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+
+	.ndo_xdp_rx_timestamp_supported = veth_xdp_rx_timestamp_supported,
+	.ndo_xdp_rx_timestamp	= veth_xdp_rx_timestamp,
+	.ndo_xdp_rx_hash_supported = veth_xdp_rx_hash_supported,
+	.ndo_xdp_rx_hash	= veth_xdp_rx_hash,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 06/12] selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 05/12] veth: Support RX XDP metadata Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 07/12] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

- create new netns
- create veth pair (veTX+veRX)
- setup AF_XDP socket for both interfaces
- attach bpf to veRX
- send packet via veTX
- verify the packet has expected metadata at veRX

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 394 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        |  70 ++++
 .../selftests/bpf/progs/xdp_metadata2.c       |  15 +
 tools/testing/selftests/bpf/xdp_metadata.h    |   7 +
 5 files changed, 487 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata2.c
 create mode 100644 tools/testing/selftests/bpf/xdp_metadata.h

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 6a0f043dc410..4eed22fa3681 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -527,7 +527,7 @@ TRUNNER_BPF_PROGS_DIR := progs
 TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
 			 network_helpers.c testing_helpers.c		\
 			 btf_helpers.c flow_dissector_load.h		\
-			 cap_helpers.c
+			 cap_helpers.c xsk.c
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
 		       $(OUTPUT)/xdp_synproxy				\
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
new file mode 100644
index 000000000000..0303dc2a43f0
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -0,0 +1,394 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_metadata.skel.h"
+#include "xdp_metadata2.skel.h"
+#include "xdp_metadata.h"
+#include "xsk.h"
+
+#include <bpf/btf.h>
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#define TX_NAME "veTX"
+#define RX_NAME "veRX"
+
+#define UDP_PAYLOAD_BYTES 4
+
+#define AF_XDP_SOURCE_PORT 1234
+#define AF_XDP_CONSUMER_PORT 8080
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define XDP_FLAGS XDP_FLAGS_DRV_MODE
+#define QUEUE_ID 0
+
+#define TX_ADDR "10.0.0.1"
+#define RX_ADDR "10.0.0.2"
+#define PREFIX_LEN "8"
+#define FAMILY AF_INET
+
+#define SYS(cmd) ({ \
+	if (!ASSERT_OK(system(cmd), (cmd))) \
+		goto out; \
+})
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+};
+
+static int open_xsk(const char *ifname, struct xsk *xsk)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	const struct xsk_umem_config umem_config = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+	};
+	__u32 idx;
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (!ASSERT_NEQ(xsk->umem_area, MAP_FAILED, "mmap"))
+		return -1;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       &umem_config);
+	if (!ASSERT_OK(ret, "xsk_umem__create"))
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, QUEUE_ID,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (!ASSERT_OK(ret, "xsk_socket__create"))
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = i * UMEM_FRAME_SIZE;
+		printf("%p: tx_desc[%d] -> %lx\n", xsk, i, addr);
+	}
+
+	/* Second half of umem is for RX. */
+
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	if (!ASSERT_EQ(UMEM_NUM / 2, ret, "xsk_ring_prod__reserve"))
+		return ret;
+	if (!ASSERT_EQ(idx, 0, "fill idx != 0"))
+		return -1;
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+static void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void ip_csum(struct iphdr *iph)
+{
+	__u32 sum = 0;
+	__u16 *p;
+	int i;
+
+	iph->check = 0;
+	p = (void *)iph;
+	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
+		sum += p[i];
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	iph->check = ~sum;
+}
+
+static int generate_packet(struct xsk *xsk, __u16 dst_port)
+{
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	void *data;
+	__u32 idx;
+	int ret;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
+		return -1;
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	eth = data;
+	iph = (void *)(eth + 1);
+	udph = (void *)(iph + 1);
+
+	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
+	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
+	eth->h_proto = htons(ETH_P_IP);
+
+	iph->version = 0x4;
+	iph->ihl = 0x5;
+	iph->tos = 0x9;
+	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	iph->id = 0;
+	iph->frag_off = 0;
+	iph->ttl = 0;
+	iph->protocol = IPPROTO_UDP;
+	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
+	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
+	ip_csum(iph);
+
+	udph->source = htons(AF_XDP_SOURCE_PORT);
+	udph->dest = htons(dst_port);
+	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	udph->check = 0;
+
+	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
+
+	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
+	xsk_ring_prod__submit(&xsk->tx, 1);
+
+	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (!ASSERT_GE(ret, 0, "sendto"))
+		return ret;
+
+	return 0;
+}
+
+static void complete_tx(struct xsk *xsk)
+{
+	__u32 idx;
+	__u64 addr;
+
+	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
+		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+
+		printf("%p: refill idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static void refill_rx(struct xsk *xsk, __u64 addr)
+{
+	__u32 idx;
+
+	if (ASSERT_EQ(xsk_ring_prod__reserve(&xsk->fill, 1, &idx), 1, "xsk_ring_prod__reserve")) {
+		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static int verify_xsk_metadata(struct xsk *xsk)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds = {};
+	struct xdp_meta *meta;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	__u64 comp_addr;
+	void *data;
+	__u64 addr;
+	__u32 idx;
+	int ret;
+
+	ret = recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+	if (!ASSERT_EQ(ret, 0, "recvfrom"))
+		return -1;
+
+	fds.fd = xsk_socket__fd(xsk->socket);
+	fds.events = POLLIN;
+
+	ret = poll(&fds, 1, 1000);
+	if (!ASSERT_GT(ret, 0, "poll"))
+		return -1;
+
+	ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_cons__peek"))
+		return -2;
+
+	rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx);
+	comp_addr = xsk_umem__extract_addr(rx_desc->addr);
+	addr = xsk_umem__add_offset_to_addr(rx_desc->addr);
+	printf("%p: rx_desc[%u]->addr=%llx addr=%llx comp_addr=%llx\n",
+	       xsk, idx, rx_desc->addr, addr, comp_addr);
+	data = xsk_umem__get_data(xsk->umem_area, addr);
+
+	/* Make sure we got the packet offset correctly. */
+
+	eth = data;
+	ASSERT_EQ(eth->h_proto, htons(ETH_P_IP), "eth->h_proto");
+	iph = (void *)(eth + 1);
+	ASSERT_EQ((int)iph->version, 4, "iph->version");
+
+	/* custom metadata */
+
+	meta = data - sizeof(struct xdp_meta);
+
+	if (!ASSERT_NEQ(meta->rx_timestamp, 0, "rx_timestamp"))
+		return -1;
+
+	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
+		return -1;
+
+	xsk_ring_cons__release(&xsk->rx, 1);
+	refill_rx(xsk, comp_addr);
+
+	return 0;
+}
+
+void test_xdp_metadata(void)
+{
+	struct xdp_metadata2 *bpf_obj2 = NULL;
+	struct xdp_metadata *bpf_obj = NULL;
+	struct bpf_program *new_prog, *prog;
+	struct nstoken *tok = NULL;
+	__u32 queue_id = QUEUE_ID;
+	struct bpf_map *prog_arr;
+	struct xsk tx_xsk = {};
+	struct xsk rx_xsk = {};
+	__u32 val, key = 0;
+	int rx_ifindex;
+	int sock_fd;
+	int ret;
+
+	/* Setup new networking namespace, with a veth pair. */
+
+	SYS("ip netns add xdp_metadata");
+	tok = open_netns("xdp_metadata");
+	SYS("ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
+	    " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
+	SYS("ip link set dev " TX_NAME " address 00:00:00:00:00:01");
+	SYS("ip link set dev " RX_NAME " address 00:00:00:00:00:02");
+	SYS("ip link set dev " TX_NAME " up");
+	SYS("ip link set dev " RX_NAME " up");
+	SYS("ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
+	SYS("ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
+
+	rx_ifindex = if_nametoindex(RX_NAME);
+
+	/* Setup separate AF_XDP for TX and RX interfaces. */
+
+	ret = open_xsk(TX_NAME, &tx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
+		goto out;
+
+	ret = open_xsk(RX_NAME, &rx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
+		goto out;
+
+	bpf_obj = xdp_metadata__open();
+	if (!ASSERT_OK_PTR(bpf_obj, "open skeleton"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, rx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_HAS_METADATA);
+
+	if (!ASSERT_OK(xdp_metadata__load(bpf_obj), "load skeleton"))
+		goto out;
+
+	/* Make sure we can't add dev-bound programs to prog maps. */
+	prog_arr = bpf_object__find_map_by_name(bpf_obj->obj, "prog_arr");
+	if (!ASSERT_OK_PTR(prog_arr, "no prog_arr map"))
+		goto out;
+
+	val = bpf_program__fd(prog);
+	if (!ASSERT_ERR(bpf_map__update_elem(prog_arr, &key, sizeof(key),
+					     &val, sizeof(val), BPF_ANY),
+			"update prog_arr"))
+		goto out;
+
+	/* Make sure we can't replace dev-bound program with a non-dev-bound one. */
+
+	bpf_obj2 = xdp_metadata2__open();
+	if (!ASSERT_OK_PTR(bpf_obj2, "open skeleton"))
+		goto out;
+
+	new_prog = bpf_object__find_program_by_name(bpf_obj2->obj, "freplace_rx");
+	bpf_program__set_attach_target(new_prog, bpf_program__fd(prog), "rx");
+
+	if (!ASSERT_ERR(xdp_metadata2__load(bpf_obj2), "load freplace skeleton"))
+		goto out;
+
+	/* Attach BPF program to RX interface. */
+
+	ret = bpf_xdp_attach(rx_ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach"))
+		goto out;
+
+	sock_fd = xsk_socket__fd(rx_xsk.socket);
+	ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+	if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+		goto out;
+
+	/* Send packet destined to RX AF_XDP socket. */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
+		       "generate AF_XDP_CONSUMER_PORT"))
+		goto out;
+
+	/* Verify AF_XDP RX packet has proper metadata. */
+	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
+		       "verify_xsk_metadata"))
+		goto out;
+
+	complete_tx(&tx_xsk);
+
+out:
+	close_xsk(&rx_xsk);
+	close_xsk(&tx_xsk);
+	if (bpf_obj2)
+		xdp_metadata2__destroy(bpf_obj2);
+	if (bpf_obj)
+		xdp_metadata__destroy(bpf_obj);
+	system("ip netns del xdp_metadata");
+	if (tok)
+		close_netns(tok);
+}
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
new file mode 100644
index 000000000000..7a3a72d2ba66
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+
+#ifndef ETH_P_IP
+#define ETH_P_IP 0x0800
+#endif
+
+#include "xdp_metadata.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 4);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u32);
+} prog_arr SEC(".maps");
+
+extern bool bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx) __ksym;
+extern __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx) __ksym;
+extern bool bpf_xdp_metadata_rx_hash_supported(const struct xdp_md *ctx) __ksym;
+extern __u32 bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta;
+	struct xdp_meta *meta;
+	int ret;
+
+	/* Reserve enough for all custom metadata. */
+
+	ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta));
+	if (ret != 0)
+		return XDP_DROP;
+
+	data = (void *)(long)ctx->data;
+	data_meta = (void *)(long)ctx->data_meta;
+
+	if (data_meta + sizeof(struct xdp_meta) > data)
+		return XDP_DROP;
+
+	meta = data_meta;
+
+	/* Export metadata. */
+
+	if (bpf_xdp_metadata_rx_timestamp_supported(ctx))
+		meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+
+	if (bpf_xdp_metadata_rx_hash_supported(ctx))
+		meta->rx_hash = bpf_xdp_metadata_rx_hash(ctx);
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+SEC("?freplace/rx")
+int freplace_rx(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata2.c b/tools/testing/selftests/bpf/progs/xdp_metadata2.c
new file mode 100644
index 000000000000..bec1732ade62
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata2.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+
+#include "xdp_metadata.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+SEC("freplace/rx")
+int freplace_rx(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
new file mode 100644
index 000000000000..c4892d122b7f
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#pragma once
+
+struct xdp_meta {
+	__u64 rx_timestamp;
+	__u32 rx_hash;
+};
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 07/12] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 06/12] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-08  6:11   ` Tariq Toukan
  2022-12-06  2:45 ` [PATCH bpf-next v3 08/12] mxl4: Support RX XDP metadata Stanislav Fomichev
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Tariq Toukan, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 26 +++++++++++++---------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 8f762fc170b3..9c114fc723e3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -661,9 +661,14 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 #define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
 #endif
 
+struct mlx4_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_xdp_buff mxbuf = {};
 	int factor = priv->cqe_factor;
 	struct mlx4_en_rx_ring *ring;
 	struct bpf_prog *xdp_prog;
@@ -671,7 +676,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	bool doorbell_pending;
 	bool xdp_redir_flush;
 	struct mlx4_cqe *cqe;
-	struct xdp_buff xdp;
 	int polled = 0;
 	int index;
 
@@ -681,7 +685,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	ring = priv->rx_ring[cq_ring];
 
 	xdp_prog = rcu_dereference_bh(ring->xdp_prog);
-	xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
+	xdp_init_buff(&mxbuf.xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
 	doorbell_pending = false;
 	xdp_redir_flush = false;
 
@@ -776,24 +780,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
 
-			xdp_prepare_buff(&xdp, va - frags[0].page_offset,
+			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
 					 frags[0].page_offset, length, false);
-			orig_data = xdp.data;
+			orig_data = mxbuf.xdp.data;
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
-			length = xdp.data_end - xdp.data;
-			if (xdp.data != orig_data) {
-				frags[0].page_offset = xdp.data -
-					xdp.data_hard_start;
-				va = xdp.data;
+			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
+			if (mxbuf.xdp.data != orig_data) {
+				frags[0].page_offset = mxbuf.xdp.data -
+					mxbuf.xdp.data_hard_start;
+				va = mxbuf.xdp.data;
 			}
 
 			switch (act) {
 			case XDP_PASS:
 				break;
 			case XDP_REDIRECT:
-				if (likely(!xdp_do_redirect(dev, &xdp, xdp_prog))) {
+				if (likely(!xdp_do_redirect(dev, &mxbuf.xdp, xdp_prog))) {
 					ring->xdp_redirect++;
 					xdp_redir_flush = true;
 					frags[0].page = NULL;
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 08/12] mxl4: Support RX XDP metadata
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 07/12] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-08  6:09   ` Tariq Toukan
  2022-12-06  2:45 ` [PATCH bpf-next v3 09/12] xsk: Add cb area to struct xdp_buff_xsk Stanislav Fomichev
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Tariq Toukan, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

RX timestamp and hash for now. Tested using the prog from the next
patch.

Also enabling xdp metadata support; don't see why it's disabled,
there is enough headroom..

Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_clock.c | 13 +++++--
 .../net/ethernet/mellanox/mlx4/en_netdev.c    | 10 +++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    | 38 ++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |  1 +
 include/linux/mlx4/device.h                   |  7 ++++
 5 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_clock.c b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
index 98b5ffb4d729..9e3b76182088 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_clock.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
@@ -58,9 +58,7 @@ u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
 	return hi | lo;
 }
 
-void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
-			    struct skb_shared_hwtstamps *hwts,
-			    u64 timestamp)
+u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp)
 {
 	unsigned int seq;
 	u64 nsec;
@@ -70,8 +68,15 @@ void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
 		nsec = timecounter_cyc2time(&mdev->clock, timestamp);
 	} while (read_seqretry(&mdev->clock_lock, seq));
 
+	return ns_to_ktime(nsec);
+}
+
+void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
+			    struct skb_shared_hwtstamps *hwts,
+			    u64 timestamp)
+{
 	memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
-	hwts->hwtstamp = ns_to_ktime(nsec);
+	hwts->hwtstamp = mlx4_en_get_hwtstamp(mdev, timestamp);
 }
 
 /**
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 8800d3f1f55c..1cb63746a851 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2855,6 +2855,11 @@ static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
 	.ndo_bpf		= mlx4_xdp,
+
+	.ndo_xdp_rx_timestamp_supported = mlx4_xdp_rx_timestamp_supported,
+	.ndo_xdp_rx_timestamp	= mlx4_xdp_rx_timestamp,
+	.ndo_xdp_rx_hash_supported = mlx4_xdp_rx_hash_supported,
+	.ndo_xdp_rx_hash	= mlx4_xdp_rx_hash,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2887,6 +2892,11 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
 	.ndo_bpf		= mlx4_xdp,
+
+	.ndo_xdp_rx_timestamp_supported = mlx4_xdp_rx_timestamp_supported,
+	.ndo_xdp_rx_timestamp	= mlx4_xdp_rx_timestamp,
+	.ndo_xdp_rx_hash_supported = mlx4_xdp_rx_hash_supported,
+	.ndo_xdp_rx_hash	= mlx4_xdp_rx_hash,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9c114fc723e3..1b8e1b2d8729 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -663,8 +663,40 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 
 struct mlx4_xdp_buff {
 	struct xdp_buff xdp;
+	struct mlx4_cqe *cqe;
+	struct mlx4_en_dev *mdev;
+	struct mlx4_en_rx_ring *ring;
+	struct net_device *dev;
 };
 
+bool mlx4_xdp_rx_timestamp_supported(const struct xdp_md *ctx)
+{
+	struct mlx4_xdp_buff *_ctx = (void *)ctx;
+
+	return _ctx->ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL;
+}
+
+u64 mlx4_xdp_rx_timestamp(const struct xdp_md *ctx)
+{
+	struct mlx4_xdp_buff *_ctx = (void *)ctx;
+
+	return mlx4_en_get_hwtstamp(_ctx->mdev, mlx4_en_get_cqe_ts(_ctx->cqe));
+}
+
+bool mlx4_xdp_rx_hash_supported(const struct xdp_md *ctx)
+{
+	struct mlx4_xdp_buff *_ctx = (void *)ctx;
+
+	return _ctx->dev->features & NETIF_F_RXHASH;
+}
+
+u32 mlx4_xdp_rx_hash(const struct xdp_md *ctx)
+{
+	struct mlx4_xdp_buff *_ctx = (void *)ctx;
+
+	return be32_to_cpu(_ctx->cqe->immed_rss_invalid);
+}
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -781,8 +813,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						DMA_FROM_DEVICE);
 
 			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
-					 frags[0].page_offset, length, false);
+					 frags[0].page_offset, length, true);
 			orig_data = mxbuf.xdp.data;
+			mxbuf.cqe = cqe;
+			mxbuf.mdev = priv->mdev;
+			mxbuf.ring = ring;
+			mxbuf.dev = dev;
 
 			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index e132ff4c82f2..b7c0d4899ad7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -792,6 +792,7 @@ int mlx4_en_netdev_event(struct notifier_block *this,
  * Functions for time stamping
  */
 u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe);
+u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp);
 void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
 			    struct skb_shared_hwtstamps *hwts,
 			    u64 timestamp);
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 6646634a0b9d..d5904da1d490 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -1585,4 +1585,11 @@ static inline int mlx4_get_num_reserved_uar(struct mlx4_dev *dev)
 	/* The first 128 UARs are used for EQ doorbells */
 	return (128 >> (PAGE_SHIFT - dev->uar_page_shift));
 }
+
+struct xdp_md;
+bool mlx4_xdp_rx_timestamp_supported(const struct xdp_md *ctx);
+u64 mlx4_xdp_rx_timestamp(const struct xdp_md *ctx);
+bool mlx4_xdp_rx_hash_supported(const struct xdp_md *ctx);
+u32 mlx4_xdp_rx_hash(const struct xdp_md *ctx);
+
 #endif /* MLX4_DEVICE_H */
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 09/12] xsk: Add cb area to struct xdp_buff_xsk
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 08/12] mxl4: Support RX XDP metadata Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 10/12] mlx5: Introduce mlx5_xdp_buff wrapper for xdp_buff Stanislav Fomichev
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

From: Toke Høiland-Jørgensen <toke@redhat.com>

Add an area after the xdp_buff in struct xdp_buff_xsk that drivers can use
to stash extra information to use in metadata kfuncs. The maximum size of
24 bytes means the full xdp_buff_xsk structure will take up exactly two
cache lines (with the cb field spanning both). Also add a macro drivers can
use to check their own wrapping structs against the available size.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/xsk_buff_pool.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index f787c3f524b0..3e952e569418 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -19,8 +19,11 @@ struct xdp_sock;
 struct device;
 struct page;
 
+#define XSK_PRIV_MAX 24
+
 struct xdp_buff_xsk {
 	struct xdp_buff xdp;
+	u8 cb[XSK_PRIV_MAX];
 	dma_addr_t dma;
 	dma_addr_t frame_dma;
 	struct xsk_buff_pool *pool;
@@ -28,6 +31,8 @@ struct xdp_buff_xsk {
 	struct list_head free_list_node;
 };
 
+#define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct xdp_buff_xsk, cb))
+
 struct xsk_dma_map {
 	dma_addr_t *dma_pages;
 	struct device *dev;
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 10/12] mlx5: Introduce mlx5_xdp_buff wrapper for xdp_buff
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (8 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 09/12] xsk: Add cb area to struct xdp_buff_xsk Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-06  2:45 ` [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata Stanislav Fomichev
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	Saeed Mahameed, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

From: Toke Høiland-Jørgensen <toke@redhat.com>

Preparation for implementing HW metadata kfuncs. No functional change.

Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 +++++----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 56 +++++++++----------
 5 files changed, 49 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ff5b302531d5..cdbaac5f6d25 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -469,6 +469,7 @@ struct mlx5e_txqsq {
 union mlx5e_alloc_unit {
 	struct page *page;
 	struct xdp_buff *xsk;
+	struct mlx5_xdp_buff *mxbuf;
 };
 
 /* XDP packets can be transmitted in different ways. On completion, we need to
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 20507ef2f956..db49b813bcb5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -158,8 +158,9 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
-		      struct bpf_prog *prog, struct xdp_buff *xdp)
+		      struct bpf_prog *prog, struct mlx5_xdp_buff *mxbuf)
 {
+	struct xdp_buff *xdp = &mxbuf->xdp;
 	u32 act;
 	int err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index bc2d9034af5b..a33b448d542d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -44,10 +44,14 @@
 	(MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \
 	 sizeof(struct mlx5_wqe_inline_seg))
 
+struct mlx5_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 struct mlx5e_xsk_param;
 int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk);
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
-		      struct bpf_prog *prog, struct xdp_buff *xdp);
+		      struct bpf_prog *prog, struct mlx5_xdp_buff *mlctx);
 void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq);
 bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
 void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index c91b54d9ff27..5e88dc61824e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -22,6 +22,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 		goto err;
 
 	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
+	XSK_CHECK_PRIV_TYPE(struct mlx5_xdp_buff);
 	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
 				     rq->mpwqe.pages_per_wqe);
 
@@ -233,7 +234,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    u32 head_offset,
 						    u32 page_idx)
 {
-	struct xdp_buff *xdp = wi->alloc_units[page_idx].xsk;
+	struct mlx5_xdp_buff *mxbuf = wi->alloc_units[page_idx].mxbuf;
 	struct bpf_prog *prog;
 
 	/* Check packet size. Note LRO doesn't use linear SKB */
@@ -249,9 +250,9 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(head_offset);
 
-	xsk_buff_set_size(xdp, cqe_bcnt);
-	xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
-	net_prefetch(xdp->data);
+	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
+	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
+	net_prefetch(mxbuf->xdp.data);
 
 	/* Possible flows:
 	 * - XDP_REDIRECT to XSKMAP:
@@ -269,7 +270,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp))) {
+	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf))) {
 		if (likely(__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)))
 			__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
 		return NULL; /* page/packet was consumed by XDP */
@@ -278,14 +279,14 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	/* XDP_PASS: copy the data from the UMEM to a new SKB and reuse the
 	 * frame. On SKB allocation failure, NULL is returned.
 	 */
-	return mlx5e_xsk_construct_skb(rq, xdp);
+	return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
 }
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
 					      u32 cqe_bcnt)
 {
-	struct xdp_buff *xdp = wi->au->xsk;
+	struct mlx5_xdp_buff *mxbuf = wi->au->mxbuf;
 	struct bpf_prog *prog;
 
 	/* wi->offset is not used in this function, because xdp->data and the
@@ -295,17 +296,17 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(wi->offset);
 
-	xsk_buff_set_size(xdp, cqe_bcnt);
-	xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
-	net_prefetch(xdp->data);
+	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
+	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
+	net_prefetch(mxbuf->xdp.data);
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp)))
+	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf)))
 		return NULL; /* page/packet was consumed by XDP */
 
 	/* XDP_PASS: copy the data from the UMEM to a new SKB. The frame reuse
 	 * will be handled by mlx5e_free_rx_wqe.
 	 * On SKB allocation failure, NULL is returned.
 	 */
-	return mlx5e_xsk_construct_skb(rq, xdp);
+	return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b1ea0b995d9c..434025703e50 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1565,10 +1565,10 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 }
 
 static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
-				u32 len, struct xdp_buff *xdp)
+				u32 len, struct mlx5_xdp_buff *mxbuf)
 {
-	xdp_init_buff(xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
-	xdp_prepare_buff(xdp, va, headroom, len, true);
+	xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
+	xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
 }
 
 static struct sk_buff *
@@ -1595,16 +1595,16 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct xdp_buff xdp;
+		struct mlx5_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
-		if (mlx5e_xdp_handle(rq, au->page, prog, &xdp))
+		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf))
 			return NULL; /* page/packet was consumed by XDP */
 
-		rx_headroom = xdp.data - xdp.data_hard_start;
-		metasize = xdp.data - xdp.data_meta;
-		cqe_bcnt = xdp.data_end - xdp.data;
+		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
+		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
+		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
@@ -1626,9 +1626,9 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	union mlx5e_alloc_unit *au = wi->au;
 	u16 rx_headroom = rq->buff.headroom;
 	struct skb_shared_info *sinfo;
+	struct mlx5_xdp_buff mxbuf;
 	u32 frag_consumed_bytes;
 	struct bpf_prog *prog;
-	struct xdp_buff xdp;
 	struct sk_buff *skb;
 	dma_addr_t addr;
 	u32 truesize;
@@ -1643,8 +1643,8 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetchw(va); /* xdp_frame data area */
 	net_prefetch(va + rx_headroom);
 
-	mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &xdp);
-	sinfo = xdp_get_shared_info_from_buff(&xdp);
+	mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
 	truesize = 0;
 
 	cqe_bcnt -= frag_consumed_bytes;
@@ -1662,13 +1662,13 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		dma_sync_single_for_cpu(rq->pdev, addr + wi->offset,
 					frag_consumed_bytes, rq->buff.map_dir);
 
-		if (!xdp_buff_has_frags(&xdp)) {
+		if (!xdp_buff_has_frags(&mxbuf.xdp)) {
 			/* Init on the first fragment to avoid cold cache access
 			 * when possible.
 			 */
 			sinfo->nr_frags = 0;
 			sinfo->xdp_frags_size = 0;
-			xdp_buff_set_frags_flag(&xdp);
+			xdp_buff_set_frags_flag(&mxbuf.xdp);
 		}
 
 		frag = &sinfo->frags[sinfo->nr_frags++];
@@ -1677,7 +1677,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		skb_frag_size_set(frag, frag_consumed_bytes);
 
 		if (page_is_pfmemalloc(au->page))
-			xdp_buff_set_frag_pfmemalloc(&xdp);
+			xdp_buff_set_frag_pfmemalloc(&mxbuf.xdp);
 
 		sinfo->xdp_frags_size += frag_consumed_bytes;
 		truesize += frag_info->frag_stride;
@@ -1690,7 +1690,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	au = head_wi->au;
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (prog && mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
+	if (prog && mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 		if (test_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 			int i;
 
@@ -1700,22 +1700,22 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		return NULL; /* page/packet was consumed by XDP */
 	}
 
-	skb = mlx5e_build_linear_skb(rq, xdp.data_hard_start, rq->buff.frame0_sz,
-				     xdp.data - xdp.data_hard_start,
-				     xdp.data_end - xdp.data,
-				     xdp.data - xdp.data_meta);
+	skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start, rq->buff.frame0_sz,
+				     mxbuf.xdp.data - mxbuf.xdp.data_hard_start,
+				     mxbuf.xdp.data_end - mxbuf.xdp.data,
+				     mxbuf.xdp.data - mxbuf.xdp.data_meta);
 	if (unlikely(!skb))
 		return NULL;
 
 	page_ref_inc(au->page);
 
-	if (unlikely(xdp_buff_has_frags(&xdp))) {
+	if (unlikely(xdp_buff_has_frags(&mxbuf.xdp))) {
 		int i;
 
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
 		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
 					   sinfo->xdp_frags_size, truesize,
-					   xdp_buff_is_frag_pfmemalloc(&xdp));
+					   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
 
 		for (i = 0; i < sinfo->nr_frags; i++) {
 			skb_frag_t *frag = &sinfo->frags[i];
@@ -1996,19 +1996,19 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct xdp_buff xdp;
+		struct mlx5_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
-		if (mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
+		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		rx_headroom = xdp.data - xdp.data_hard_start;
-		metasize = xdp.data - xdp.data_meta;
-		cqe_bcnt = xdp.data_end - xdp.data;
+		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
+		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
+		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (9 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 10/12] mlx5: Introduce mlx5_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-08 22:59   ` Toke Høiland-Jørgensen
  2022-12-06  2:45 ` [PATCH bpf-next v3 12/12] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
  2022-12-08 22:28 ` [xdp-hints] [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Toke Høiland-Jørgensen
  12 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	Saeed Mahameed, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

From: Toke Høiland-Jørgensen <toke@redhat.com>

Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
pointer to the mlx5e_skb_from* functions so it can be retrieved from the
XDP ctx to do this.

Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 10 +++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 29 +++++++++++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  7 +++
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 10 ++++
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  2 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  4 ++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 48 ++++++++++---------
 7 files changed, 85 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index cdbaac5f6d25..8337ff0cacd5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -627,10 +627,11 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+			       struct mlx5_cqe64 *cqe, u16 cqe_bcnt,
+			       u32 head_offset, u32 page_idx);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			 u32 cqe_bcnt);
+			 struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
 typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
 typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
 typedef void (*mlx5e_fp_shampo_dealloc_hd)(struct mlx5e_rq*, u16, u16, bool);
@@ -1036,6 +1037,11 @@ int mlx5e_vlan_rx_kill_vid(struct net_device *dev, __always_unused __be16 proto,
 			   u16 vid);
 void mlx5e_timestamp_init(struct mlx5e_priv *priv);
 
+static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
+{
+	return config->rx_filter == HWTSTAMP_FILTER_ALL;
+}
+
 struct mlx5e_xsk_param;
 
 struct mlx5e_rq_param;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index db49b813bcb5..2a4700b3695a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -156,6 +156,35 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 	return true;
 }
 
+bool mlx5e_xdp_rx_timestamp_supported(const struct xdp_md *ctx)
+{
+	const struct mlx5_xdp_buff *_ctx = (void *)ctx;
+
+	return mlx5e_rx_hw_stamp(_ctx->rq->tstamp);
+}
+
+u64 mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx)
+{
+	const struct mlx5_xdp_buff *_ctx = (void *)ctx;
+
+	return mlx5e_cqe_ts_to_ns(_ctx->rq->ptp_cyc2time,
+				  _ctx->rq->clock, get_cqe_ts(_ctx->cqe));
+}
+
+bool mlx5e_xdp_rx_hash_supported(const struct xdp_md *ctx)
+{
+	const struct mlx5_xdp_buff *_ctx = (void *)ctx;
+
+	return _ctx->xdp.rxq->dev->features & NETIF_F_RXHASH;
+}
+
+u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx)
+{
+	const struct mlx5_xdp_buff *_ctx = (void *)ctx;
+
+	return be32_to_cpu(_ctx->cqe->rss_hash_result);
+}
+
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
 		      struct bpf_prog *prog, struct mlx5_xdp_buff *mxbuf)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index a33b448d542d..a5fc30b07617 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -46,6 +46,8 @@
 
 struct mlx5_xdp_buff {
 	struct xdp_buff xdp;
+	struct mlx5_cqe64 *cqe;
+	struct mlx5e_rq *rq;
 };
 
 struct mlx5e_xsk_param;
@@ -60,6 +62,11 @@ void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		   u32 flags);
 
+bool mlx5e_xdp_rx_hash_supported(const struct xdp_md *ctx);
+u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx);
+bool mlx5e_xdp_rx_timestamp_supported(const struct xdp_md *ctx);
+u64 mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx);
+
 INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
 							  struct mlx5e_xmit_data *xdptxd,
 							  struct skb_shared_info *sinfo,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 5e88dc61824e..05cf7987585a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -49,6 +49,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 			umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
 				.ptag = cpu_to_be64(addr | MLX5_EN_WR),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
 		for (i = 0; i < batch; i++) {
@@ -58,6 +59,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.key = rq->mkey_be,
 				.va = cpu_to_be64(addr),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_TRIPLE)) {
 		u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
@@ -81,6 +83,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.key = rq->mkey_be,
 				.va = cpu_to_be64(rq->wqe_overflow.addr),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else {
 		__be32 pad_size = cpu_to_be32((1 << rq->mpwqe.page_shift) -
@@ -100,6 +103,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.va = cpu_to_be64(rq->wqe_overflow.addr),
 				.bcount = pad_size,
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	}
 
@@ -230,6 +234,7 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, struct xdp_b
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx)
@@ -250,6 +255,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(head_offset);
 
+	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
+	mxbuf->cqe = cqe;
 	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
 	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
 	net_prefetch(mxbuf->xdp.data);
@@ -284,6 +291,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
+					      struct mlx5_cqe64 *cqe,
 					      u32 cqe_bcnt)
 {
 	struct mlx5_xdp_buff *mxbuf = wi->au->mxbuf;
@@ -296,6 +304,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(wi->offset);
 
+	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
+	mxbuf->cqe = cqe;
 	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
 	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
 	net_prefetch(mxbuf->xdp.data);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
index 087c943bd8e9..cefc0ef6105d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
@@ -13,11 +13,13 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
 int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx);
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
+					      struct mlx5_cqe64 *cqe,
 					      u32 cqe_bcnt);
 
 #endif /* __MLX5_EN_XSK_RX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 217c8a478977..967a82bf34b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4891,6 +4891,10 @@ const struct net_device_ops mlx5e_netdev_ops = {
 	.ndo_tx_timeout          = mlx5e_tx_timeout,
 	.ndo_bpf		 = mlx5e_xdp,
 	.ndo_xdp_xmit            = mlx5e_xdp_xmit,
+	.ndo_xdp_rx_timestamp_supported = mlx5e_xdp_rx_timestamp_supported,
+	.ndo_xdp_rx_timestamp    = mlx5e_xdp_rx_timestamp,
+	.ndo_xdp_rx_hash_supported = mlx5e_xdp_rx_hash_supported,
+	.ndo_xdp_rx_hash         = mlx5e_xdp_rx_hash,
 	.ndo_xsk_wakeup          = mlx5e_xsk_wakeup,
 #ifdef CONFIG_MLX5_EN_ARFS
 	.ndo_rx_flow_steer	 = mlx5e_rx_flow_steer,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 434025703e50..1bc631abe24c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -62,10 +62,12 @@
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				u32 page_idx);
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				   u32 page_idx);
 static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
@@ -76,11 +78,6 @@ const struct mlx5e_rx_handlers mlx5e_rx_handlers_nic = {
 	.handle_rx_cqe_mpwqe_shampo = mlx5e_handle_rx_cqe_mpwrq_shampo,
 };
 
-static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
-{
-	return config->rx_filter == HWTSTAMP_FILTER_ALL;
-}
-
 static inline void mlx5e_read_cqe_slot(struct mlx5_cqwq *wq,
 				       u32 cqcc, void *data)
 {
@@ -1564,16 +1561,18 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 	return skb;
 }
 
-static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
+static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe, void *va, u16 headroom,
 				u32 len, struct mlx5_xdp_buff *mxbuf)
 {
 	xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
 	xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
+	mxbuf->cqe = cqe;
+	mxbuf->rq = rq;
 }
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			  u32 cqe_bcnt)
+			  struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
 	union mlx5e_alloc_unit *au = wi->au;
 	u16 rx_headroom = rq->buff.headroom;
@@ -1598,7 +1597,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 		struct mlx5_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		mlx5e_fill_xdp_buff(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
 		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf))
 			return NULL; /* page/packet was consumed by XDP */
 
@@ -1619,7 +1618,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			     u32 cqe_bcnt)
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
 	struct mlx5e_wqe_frag_info *head_wi = wi;
@@ -1643,7 +1642,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetchw(va); /* xdp_frame data area */
 	net_prefetch(va + rx_headroom);
 
-	mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
+	mlx5e_fill_xdp_buff(rq, cqe, va, rx_headroom, frag_consumed_bytes, &mxbuf);
 	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
 	truesize = 0;
 
@@ -1766,7 +1765,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_linear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
@@ -1819,7 +1818,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
@@ -1878,7 +1877,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -1929,7 +1928,8 @@ mlx5e_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				   u32 page_idx)
 {
 	union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
@@ -1968,7 +1968,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				u32 page_idx)
 {
 	union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -1999,7 +2000,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		struct mlx5_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		mlx5e_fill_xdp_buff(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
 		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
@@ -2163,8 +2164,8 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
 		if (likely(head_size))
 			*skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
 		else
-			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe_bcnt, data_offset,
-								  page_idx);
+			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
+								  data_offset, page_idx);
 		if (unlikely(!*skb))
 			goto free_hd_entry;
 
@@ -2238,7 +2239,8 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_mpwrq_linear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset,
+			      page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -2483,7 +2485,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb)
 		goto wq_free_wqe;
 
@@ -2575,7 +2577,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
 		goto free_wqe;
 	}
 
-	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe_bcnt);
+	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt);
 	if (!skb)
 		goto free_wqe;
 
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v3 12/12] selftests/bpf: Simple program to dump XDP RX metadata
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (10 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata Stanislav Fomichev
@ 2022-12-06  2:45 ` Stanislav Fomichev
  2022-12-08 22:28 ` [xdp-hints] [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Toke Høiland-Jørgensen
  12 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-06  2:45 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

To be used for verification of driver implementations. Note that
the skb path is gone from the series, but I'm still keeping the
implementation for any possible future work.

$ xdp_hw_metadata <ifname>

On the other machine:

$ echo -n xdp | nc -u -q1 <target> 9091 # for AF_XDP
$ echo -n skb | nc -u -q1 <target> 9092 # for skb

Sample output:

  # xdp
  xsk_ring_cons__peek: 1
  0x19f9090: rx_desc[0]->addr=100000000008000 addr=8100 comp_addr=8000
  rx_timestamp_supported: 1
  rx_timestamp: 1667850075063948829
  0x19f9090: complete idx=8 addr=8000

  # skb
  found skb hwtstamp = 1668314052.854274681

Decoding:
  # xdp
  rx_timestamp=1667850075.063948829

  $ date -d @1667850075
  Mon Nov  7 11:41:15 AM PST 2022
  $ date
  Mon Nov  7 11:42:05 AM PST 2022

  # skb
  $ date -d @1668314052
  Sat Nov 12 08:34:12 PM PST 2022
  $ date
  Sat Nov 12 08:37:06 PM PST 2022

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  93 ++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 405 ++++++++++++++++++
 4 files changed, 504 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 07d2d0a8c5cb..01e3baeefd4f 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -46,3 +46,4 @@ test_cpp
 xskxceiver
 xdp_redirect_multi
 xdp_synproxy
+xdp_hw_metadata
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4eed22fa3681..189b39b0e5d0 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -83,7 +83,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
-	xskxceiver xdp_redirect_multi xdp_synproxy veristat
+	xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata
 
 TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read $(OUTPUT)/sign-file
 TEST_GEN_FILES += liburandom_read.so
@@ -241,6 +241,9 @@ $(OUTPUT)/test_maps: $(TESTING_HELPERS)
 $(OUTPUT)/test_verifier: $(TESTING_HELPERS) $(CAP_HELPERS)
 $(OUTPUT)/xsk.o: $(BPFOBJ)
 $(OUTPUT)/xskxceiver: $(OUTPUT)/xsk.o
+$(OUTPUT)/xdp_hw_metadata: $(OUTPUT)/xsk.o $(OUTPUT)/xdp_hw_metadata.skel.h
+$(OUTPUT)/xdp_hw_metadata: $(OUTPUT)/network_helpers.o
+$(OUTPUT)/xdp_hw_metadata: LDFLAGS += -static
 
 BPFTOOL ?= $(DEFAULT_BPFTOOL)
 $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile)    \
@@ -383,6 +386,7 @@ linked_maps.skel.h-deps := linked_maps1.bpf.o linked_maps2.bpf.o
 test_subskeleton.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o test_subskeleton.bpf.o
 test_subskeleton_lib.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o
 test_usdt.skel.h-deps := test_usdt.bpf.o test_usdt_multispec.bpf.o
+xdp_hw_metadata.skel.h-deps := xdp_hw_metadata.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
new file mode 100644
index 000000000000..0ae409094883
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/in.h>
+#include <linux/udp.h>
+#include <stdbool.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#include "xdp_metadata.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 256);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+extern bool bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx) __ksym;
+extern __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx) __ksym;
+extern bool bpf_xdp_metadata_rx_hash_supported(const struct xdp_md *ctx) __ksym;
+extern __u32 bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta, *data_end;
+	struct ipv6hdr *ip6h = NULL;
+	struct ethhdr *eth = NULL;
+	struct udphdr *udp = NULL;
+	struct iphdr *iph = NULL;
+	struct xdp_meta *meta;
+	int ret;
+
+	data = (void *)(long)ctx->data;
+	data_end = (void *)(long)ctx->data_end;
+	eth = data;
+	if (eth + 1 < data_end) {
+		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+			iph = (void *)(eth + 1);
+			if (iph + 1 < data_end && iph->protocol == IPPROTO_UDP)
+				udp = (void *)(iph + 1);
+		}
+		if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
+			ip6h = (void *)(eth + 1);
+			if (ip6h + 1 < data_end && ip6h->nexthdr == IPPROTO_UDP)
+				udp = (void *)(ip6h + 1);
+		}
+		if (udp && udp + 1 > data_end)
+			udp = NULL;
+	}
+
+	if (!udp)
+		return XDP_PASS;
+
+	if (udp->dest != bpf_htons(9091))
+		return XDP_PASS;
+
+	bpf_printk("forwarding UDP:9091 to AF_XDP");
+
+	ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta));
+	if (ret != 0) {
+		bpf_printk("bpf_xdp_adjust_meta returned %d", ret);
+		return XDP_PASS;
+	}
+
+	data = (void *)(long)ctx->data;
+	data_meta = (void *)(long)ctx->data_meta;
+	meta = data_meta;
+
+	if (meta + 1 > data) {
+		bpf_printk("bpf_xdp_adjust_meta doesn't appear to work");
+		return XDP_PASS;
+	}
+
+	if (bpf_xdp_metadata_rx_timestamp_supported(ctx)) {
+		meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+		bpf_printk("populated rx_timestamp with %u", meta->rx_timestamp);
+	}
+
+	if (bpf_xdp_metadata_rx_hash_supported(ctx)) {
+		meta->rx_hash = bpf_xdp_metadata_rx_hash(ctx);
+		bpf_printk("populated rx_hash with %u", meta->rx_hash);
+	}
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
new file mode 100644
index 000000000000..29f9d01c1da1
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -0,0 +1,405 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Reference program for verifying XDP metadata on real HW. Functional test
+ * only, doesn't test the performance.
+ *
+ * RX:
+ * - UDP 9091 packets are diverted into AF_XDP
+ * - Metadata verified:
+ *   - rx_timestamp
+ *   - rx_hash
+ *
+ * TX:
+ * - TBD
+ */
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_hw_metadata.skel.h"
+#include "xsk.h"
+
+#include <error.h>
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <linux/sockios.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#include "xdp_metadata.h"
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define XDP_FLAGS (XDP_FLAGS_DRV_MODE | XDP_FLAGS_REPLACE)
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+};
+
+struct xdp_hw_metadata *bpf_obj;
+struct xsk *rx_xsk;
+const char *ifname;
+int ifindex;
+int rxq;
+
+void test__fail(void) { /* for network_helpers.c */ }
+
+static int open_xsk(const char *ifname, struct xsk *xsk, __u32 queue_id)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	const struct xsk_umem_config umem_config = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+	};
+	__u32 idx;
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (xsk->umem_area == MAP_FAILED)
+		return -ENOMEM;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       &umem_config);
+	if (ret)
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, queue_id,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (ret)
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = i * UMEM_FRAME_SIZE;
+		printf("%p: tx_desc[%d] -> %lx\n", xsk, i, addr);
+	}
+
+	/* Second half of umem is for RX. */
+
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+static void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void refill_rx(struct xsk *xsk, __u64 addr)
+{
+	__u32 idx;
+
+	if (xsk_ring_prod__reserve(&xsk->fill, 1, &idx) == 1) {
+		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static void verify_xdp_metadata(void *data)
+{
+	struct xdp_meta *meta;
+
+	meta = data - sizeof(*meta);
+
+	printf("rx_timestamp: %llu\n", meta->rx_timestamp);
+	printf("rx_hash: %u\n", meta->rx_hash);
+}
+
+static void verify_skb_metadata(int fd)
+{
+	char cmsg_buf[1024];
+	char packet_buf[128];
+
+	struct scm_timestamping *ts;
+	struct iovec packet_iov;
+	struct cmsghdr *cmsg;
+	struct msghdr hdr;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.msg_iov = &packet_iov;
+	hdr.msg_iovlen = 1;
+	packet_iov.iov_base = packet_buf;
+	packet_iov.iov_len = sizeof(packet_buf);
+
+	hdr.msg_control = cmsg_buf;
+	hdr.msg_controllen = sizeof(cmsg_buf);
+
+	if (recvmsg(fd, &hdr, 0) < 0)
+		error(-1, errno, "recvmsg");
+
+	for (cmsg = CMSG_FIRSTHDR(&hdr); cmsg != NULL;
+	     cmsg = CMSG_NXTHDR(&hdr, cmsg)) {
+
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			continue;
+
+		switch (cmsg->cmsg_type) {
+		case SCM_TIMESTAMPING:
+			ts = (struct scm_timestamping *)CMSG_DATA(cmsg);
+			if (ts->ts[2].tv_sec || ts->ts[2].tv_nsec) {
+				printf("found skb hwtstamp = %lu.%lu\n",
+				       ts->ts[2].tv_sec, ts->ts[2].tv_nsec);
+				return;
+			}
+			break;
+		default:
+			break;
+		}
+	}
+
+	printf("skb hwtstamp is not found!\n");
+}
+
+static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds[rxq + 1];
+	__u64 comp_addr;
+	__u64 addr;
+	__u32 idx;
+	int ret;
+	int i;
+
+	for (i = 0; i < rxq; i++) {
+		fds[i].fd = xsk_socket__fd(rx_xsk[i].socket);
+		fds[i].events = POLLIN;
+		fds[i].revents = 0;
+	}
+
+	fds[rxq].fd = server_fd;
+	fds[rxq].events = POLLIN;
+	fds[rxq].revents = 0;
+
+	while (true) {
+		errno = 0;
+		ret = poll(fds, rxq + 1, 1000);
+		printf("poll: %d (%d)\n", ret, errno);
+		if (ret < 0)
+			break;
+		if (ret == 0)
+			continue;
+
+		if (fds[rxq].revents)
+			verify_skb_metadata(server_fd);
+
+		for (i = 0; i < rxq; i++) {
+			if (fds[i].revents == 0)
+				continue;
+
+			struct xsk *xsk = &rx_xsk[i];
+
+			ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+			printf("xsk_ring_cons__peek: %d\n", ret);
+			if (ret != 1)
+				continue;
+
+			rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx);
+			comp_addr = xsk_umem__extract_addr(rx_desc->addr);
+			addr = xsk_umem__add_offset_to_addr(rx_desc->addr);
+			printf("%p: rx_desc[%u]->addr=%llx addr=%llx comp_addr=%llx\n",
+			       xsk, idx, rx_desc->addr, addr, comp_addr);
+			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr));
+			xsk_ring_cons__release(&xsk->rx, 1);
+			refill_rx(xsk, comp_addr);
+		}
+	}
+
+	return 0;
+}
+
+struct ethtool_channels {
+	__u32	cmd;
+	__u32	max_rx;
+	__u32	max_tx;
+	__u32	max_other;
+	__u32	max_combined;
+	__u32	rx_count;
+	__u32	tx_count;
+	__u32	other_count;
+	__u32	combined_count;
+};
+
+#define ETHTOOL_GCHANNELS	0x0000003c /* Get no of channels */
+
+static int rxq_num(const char *ifname)
+{
+	struct ethtool_channels ch = {
+		.cmd = ETHTOOL_GCHANNELS,
+	};
+
+	struct ifreq ifr = {
+		.ifr_data = (void *)&ch,
+	};
+	strcpy(ifr.ifr_name, ifname);
+	int fd, ret;
+
+	fd = socket(AF_UNIX, SOCK_DGRAM, 0);
+	if (fd < 0)
+		error(-1, errno, "socket");
+
+	ret = ioctl(fd, SIOCETHTOOL, &ifr);
+	if (ret < 0)
+		error(-1, errno, "socket");
+
+	close(fd);
+
+	return ch.rx_count + ch.combined_count;
+}
+
+static void cleanup(void)
+{
+	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
+	int ret;
+	int i;
+
+	if (bpf_obj) {
+		opts.old_prog_fd = bpf_program__fd(bpf_obj->progs.rx);
+		if (opts.old_prog_fd >= 0) {
+			printf("detaching bpf program....\n");
+			ret = bpf_xdp_detach(ifindex, XDP_FLAGS, &opts);
+			if (ret)
+				printf("failed to detach XDP program: %d\n", ret);
+		}
+	}
+
+	for (i = 0; i < rxq; i++)
+		close_xsk(&rx_xsk[i]);
+
+	if (bpf_obj)
+		xdp_hw_metadata__destroy(bpf_obj);
+}
+
+static void handle_signal(int sig)
+{
+	/* interrupting poll() is all we need */
+}
+
+static void timestamping_enable(int fd, int val)
+{
+	int ret;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
+	if (ret < 0)
+		error(-1, errno, "setsockopt(SO_TIMESTAMPING)");
+}
+
+int main(int argc, char *argv[])
+{
+	int server_fd = -1;
+	int ret;
+	int i;
+
+	struct bpf_program *prog;
+
+	if (argc != 2) {
+		fprintf(stderr, "pass device name\n");
+		return -1;
+	}
+
+	ifname = argv[1];
+	ifindex = if_nametoindex(ifname);
+	rxq = rxq_num(ifname);
+
+	printf("rxq: %d\n", rxq);
+
+	rx_xsk = malloc(sizeof(struct xsk) * rxq);
+	if (!rx_xsk)
+		error(-1, ENOMEM, "malloc");
+
+	for (i = 0; i < rxq; i++) {
+		printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
+		ret = open_xsk(ifname, &rx_xsk[i], i);
+		if (ret)
+			error(-1, -ret, "open_xsk");
+
+		printf("xsk_socket__fd() -> %d\n", xsk_socket__fd(rx_xsk[i].socket));
+	}
+
+	printf("open bpf program...\n");
+	bpf_obj = xdp_hw_metadata__open();
+	if (libbpf_get_error(bpf_obj))
+		error(-1, libbpf_get_error(bpf_obj), "xdp_hw_metadata__open");
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_HAS_METADATA);
+
+	printf("load bpf program...\n");
+	ret = xdp_hw_metadata__load(bpf_obj);
+	if (ret)
+		error(-1, -ret, "xdp_hw_metadata__load");
+
+	printf("prepare skb endpoint...\n");
+	server_fd = start_server(AF_INET6, SOCK_DGRAM, NULL, 9092, 1000);
+	if (server_fd < 0)
+		error(-1, errno, "start_server");
+	timestamping_enable(server_fd,
+			    SOF_TIMESTAMPING_SOFTWARE |
+			    SOF_TIMESTAMPING_RAW_HARDWARE);
+
+	printf("prepare xsk map...\n");
+	for (i = 0; i < rxq; i++) {
+		int sock_fd = xsk_socket__fd(rx_xsk[i].socket);
+		__u32 queue_id = i;
+
+		printf("map[%d] = %d\n", queue_id, sock_fd);
+		ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+		if (ret)
+			error(-1, -ret, "bpf_map_update_elem");
+	}
+
+	printf("attach bpf program...\n");
+	ret = bpf_xdp_attach(ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (ret)
+		error(-1, -ret, "bpf_xdp_attach");
+
+	signal(SIGINT, handle_signal);
+	ret = verify_metadata(rx_xsk, rxq, server_fd);
+	close(server_fd);
+	cleanup();
+	if (ret)
+		error(-1, -ret, "verify_metadata");
+}
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-06  2:45 ` [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs Stanislav Fomichev
@ 2022-12-07  4:29   ` Alexei Starovoitov
  2022-12-07  4:52     ` Stanislav Fomichev
  2022-12-08  2:47   ` Martin KaFai Lau
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  4:29 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Mon, Dec 05, 2022 at 06:45:45PM -0800, Stanislav Fomichev wrote:
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index fc4e313a4d2e..00951a59ee26 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -15323,6 +15323,24 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  		return -EINVAL;
>  	}
>  
> +	*cnt = 0;
> +
> +	if (resolve_prog_type(env->prog) == BPF_PROG_TYPE_XDP) {
> +		if (bpf_prog_is_offloaded(env->prog->aux)) {
> +			verbose(env, "no metadata kfuncs offload\n");
> +			return -EINVAL;
> +		}

If I'm reading this correctly than this error will trigger
for any XDP prog trying to use a kfunc?

I was hoping that BPF CI can prove my point, but it failed to
build your newly added xdp_hw_metadata.c test.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-07  4:29   ` Alexei Starovoitov
@ 2022-12-07  4:52     ` Stanislav Fomichev
  2022-12-07  7:23       ` Martin KaFai Lau
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-07  4:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Tue, Dec 6, 2022 at 8:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Dec 05, 2022 at 06:45:45PM -0800, Stanislav Fomichev wrote:
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index fc4e313a4d2e..00951a59ee26 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -15323,6 +15323,24 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >               return -EINVAL;
> >       }
> >
> > +     *cnt = 0;
> > +
> > +     if (resolve_prog_type(env->prog) == BPF_PROG_TYPE_XDP) {
> > +             if (bpf_prog_is_offloaded(env->prog->aux)) {
> > +                     verbose(env, "no metadata kfuncs offload\n");
> > +                     return -EINVAL;
> > +             }
>
> If I'm reading this correctly than this error will trigger
> for any XDP prog trying to use a kfunc?

bpf_prog_is_offloaded() should return true only when the program is
fully offloaded to the device (like nfp). So here the intent is to
reject kfunc programs because nft should somehow implement them first.
Unless I'm not setting offload_requested somewhere, not sure I see the
problem. LMK if I missed something.

> I was hoping that BPF CI can prove my point, but it failed to
> build your newly added xdp_hw_metadata.c test.

Ugh, will take a look, thank you!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-07  4:52     ` Stanislav Fomichev
@ 2022-12-07  7:23       ` Martin KaFai Lau
  2022-12-07 18:05         ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Martin KaFai Lau @ 2022-12-07  7:23 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	Alexei Starovoitov

On 12/6/22 8:52 PM, Stanislav Fomichev wrote:
> On Tue, Dec 6, 2022 at 8:29 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Mon, Dec 05, 2022 at 06:45:45PM -0800, Stanislav Fomichev wrote:
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index fc4e313a4d2e..00951a59ee26 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -15323,6 +15323,24 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>>                return -EINVAL;
>>>        }
>>>
>>> +     *cnt = 0;
>>> +
>>> +     if (resolve_prog_type(env->prog) == BPF_PROG_TYPE_XDP) {
>>> +             if (bpf_prog_is_offloaded(env->prog->aux)) {
>>> +                     verbose(env, "no metadata kfuncs offload\n");
>>> +                     return -EINVAL;
>>> +             }
>>
>> If I'm reading this correctly than this error will trigger
>> for any XDP prog trying to use a kfunc?
> 
> bpf_prog_is_offloaded() should return true only when the program is
> fully offloaded to the device (like nfp). So here the intent is to
> reject kfunc programs because nft should somehow implement them first.
> Unless I'm not setting offload_requested somewhere, not sure I see the
> problem. LMK if I missed something.

It errors out for all kfunc here though. or it meant to error out for the 
XDP_METADATA_KFUNC_* only?


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-07  7:23       ` Martin KaFai Lau
@ 2022-12-07 18:05         ` Stanislav Fomichev
  0 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-07 18:05 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, ast, daniel, andrii, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	Alexei Starovoitov

On Tue, Dec 6, 2022 at 11:24 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/6/22 8:52 PM, Stanislav Fomichev wrote:
> > On Tue, Dec 6, 2022 at 8:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >> On Mon, Dec 05, 2022 at 06:45:45PM -0800, Stanislav Fomichev wrote:
> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>> index fc4e313a4d2e..00951a59ee26 100644
> >>> --- a/kernel/bpf/verifier.c
> >>> +++ b/kernel/bpf/verifier.c
> >>> @@ -15323,6 +15323,24 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >>>                return -EINVAL;
> >>>        }
> >>>
> >>> +     *cnt = 0;
> >>> +
> >>> +     if (resolve_prog_type(env->prog) == BPF_PROG_TYPE_XDP) {
> >>> +             if (bpf_prog_is_offloaded(env->prog->aux)) {
> >>> +                     verbose(env, "no metadata kfuncs offload\n");
> >>> +                     return -EINVAL;
> >>> +             }
> >>
> >> If I'm reading this correctly than this error will trigger
> >> for any XDP prog trying to use a kfunc?
> >
> > bpf_prog_is_offloaded() should return true only when the program is
> > fully offloaded to the device (like nfp). So here the intent is to
> > reject kfunc programs because nft should somehow implement them first.
> > Unless I'm not setting offload_requested somewhere, not sure I see the
> > problem. LMK if I missed something.
>
> It errors out for all kfunc here though. or it meant to error out for the
> XDP_METADATA_KFUNC_* only?

Ah, good point, I was somewhat assuming that xdp doesn't use kfuncs
right now and I can just assume that kfunc == metadata_kfunc.
Will make this more selective, thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-06  2:45 ` [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs Stanislav Fomichev
  2022-12-07  4:29   ` Alexei Starovoitov
@ 2022-12-08  2:47   ` Martin KaFai Lau
  2022-12-08 19:07     ` Stanislav Fomichev
  2022-12-08  5:00   ` Jakub Kicinski
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 61+ messages in thread
From: Martin KaFai Lau @ 2022-12-08  2:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 12/5/22 6:45 PM, Stanislav Fomichev wrote:
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 55dbc68bfffc..c24aba5c363b 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -409,4 +409,33 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>   
>   #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
>   
> +#define XDP_METADATA_KFUNC_xxx	\
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
> +			   bpf_xdp_metadata_rx_timestamp_supported) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
> +			   bpf_xdp_metadata_rx_timestamp) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH_SUPPORTED, \
> +			   bpf_xdp_metadata_rx_hash_supported) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
> +			   bpf_xdp_metadata_rx_hash) \
> +
> +enum {
> +#define XDP_METADATA_KFUNC(name, str) name,
> +XDP_METADATA_KFUNC_xxx
> +#undef XDP_METADATA_KFUNC
> +MAX_XDP_METADATA_KFUNC,
> +};
> +
> +#ifdef CONFIG_NET

I think this is no longer needed because xdp_metadata_kfunc_id() is only used in 
offload.c which should be CONFIG_NET only.

> +u32 xdp_metadata_kfunc_id(int id);
> +#else
> +static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> +#endif
> +
> +struct xdp_md;
> +bool bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx);
> +u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx);
> +bool bpf_xdp_metadata_rx_hash_supported(const struct xdp_md *ctx);
> +u32 bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx);
> +
>   #endif /* __LINUX_NET_XDP_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f89de51a45db..790650a81f2b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1156,6 +1156,11 @@ enum bpf_link_type {
>    */
>   #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
>   
> +/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
> + * program becomes device-bound but can access it's XDP metadata.
> + */
> +#define BPF_F_XDP_HAS_METADATA	(1U << 6)
> +

[ ... ]

> diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> index f5769a8ecbee..bad8bab916eb 100644
> --- a/kernel/bpf/offload.c
> +++ b/kernel/bpf/offload.c
> @@ -41,7 +41,7 @@ struct bpf_offload_dev {
>   struct bpf_offload_netdev {
>   	struct rhash_head l;
>   	struct net_device *netdev;
> -	struct bpf_offload_dev *offdev;
> +	struct bpf_offload_dev *offdev; /* NULL when bound-only */
>   	struct list_head progs;
>   	struct list_head maps;
>   	struct list_head offdev_netdevs;
> @@ -58,6 +58,12 @@ static const struct rhashtable_params offdevs_params = {
>   static struct rhashtable offdevs;
>   static bool offdevs_inited;
>   
> +static int __bpf_offload_init(void);
> +static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
> +					     struct net_device *netdev);
> +static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
> +						struct net_device *netdev);
> +
>   static int bpf_dev_offload_check(struct net_device *netdev)
>   {
>   	if (!netdev)
> @@ -87,13 +93,17 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
>   	    attr->prog_type != BPF_PROG_TYPE_XDP)
>   		return -EINVAL;
>   
> -	if (attr->prog_flags)
> +	if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
>   		return -EINVAL;
>   
>   	offload = kzalloc(sizeof(*offload), GFP_USER);
>   	if (!offload)
>   		return -ENOMEM;
>   
> +	err = __bpf_offload_init();
> +	if (err)
> +		return err;
> +
>   	offload->prog = prog;
>   
>   	offload->netdev = dev_get_by_index(current->nsproxy->net_ns,
> @@ -102,11 +112,25 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
>   	if (err)
>   		goto err_maybe_put;
>   
> +	prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
> +

If I read the set correctly, bpf prog can either use metadata kfunc or offload 
but not both. It is fine to start with only supporting metadata kfunc when there 
is no offload but will be useful to understand the reason. I assume an offloaded 
bpf prog should still be able to call the bpf helpers like adjust_head/tail and 
the same should go for any kfunc?

Also, the BPF_F_XDP_HAS_METADATA feels like it is acting more like 
BPF_F_XDP_DEV_BOUND_ONLY.

>   	down_write(&bpf_devs_lock);
>   	ondev = bpf_offload_find_netdev(offload->netdev);
>   	if (!ondev) {
> -		err = -EINVAL;
> -		goto err_unlock;
> +		if (!prog->aux->offload_requested) {

nit. bpf_prog_is_offloaded(prog->aux)

> +			/* When only binding to the device, explicitly
> +			 * create an entry in the hashtable. See related
> +			 * maybe_remove_bound_netdev.
> +			 */
> +			err = __bpf_offload_dev_netdev_register(NULL, offload->netdev);
> +			if (err)
> +				goto err_unlock;
> +			ondev = bpf_offload_find_netdev(offload->netdev);
> +		}
> +		if (!ondev) {
> +			err = -EINVAL;
> +			goto err_unlock;
> +		}
>   	}
>   	offload->offdev = ondev->offdev;
>   	prog->aux->offload = offload;
> @@ -209,6 +233,19 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
>   	up_read(&bpf_devs_lock);
>   }
>   
> +static void maybe_remove_bound_netdev(struct net_device *dev)
> +{
> +	struct bpf_offload_netdev *ondev;
> +
> +	rtnl_lock();
> +	down_write(&bpf_devs_lock);
> +	ondev = bpf_offload_find_netdev(dev);
> +	if (ondev && !ondev->offdev && list_empty(&ondev->progs))
> +		__bpf_offload_dev_netdev_unregister(NULL, dev);
> +	up_write(&bpf_devs_lock);
> +	rtnl_unlock();
> +}
> +
>   static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
>   {
>   	struct bpf_prog_offload *offload = prog->aux->offload;
> @@ -226,10 +263,17 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
>   
>   void bpf_prog_offload_destroy(struct bpf_prog *prog)
>   {
> +	struct net_device *netdev = NULL;
> +
>   	down_write(&bpf_devs_lock);
> -	if (prog->aux->offload)
> +	if (prog->aux->offload) {
> +		netdev = prog->aux->offload->netdev;
>   		__bpf_prog_offload_destroy(prog);
> +	}
>   	up_write(&bpf_devs_lock);
> +
> +	if (netdev)

May be I have missed a refcnt or lock somewhere.  Is it possible that netdev may 
have been freed?

> +		maybe_remove_bound_netdev(netdev);
>   }
>   

[ ... ]

> +void *bpf_offload_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
> +{
> +	const struct net_device_ops *netdev_ops;
> +	void *p = NULL;
> +
> +	down_read(&bpf_devs_lock);
> +	if (!prog->aux->offload || !prog->aux->offload->netdev)
> +		goto out;
> +
> +	netdev_ops = prog->aux->offload->netdev->netdev_ops;
> +
> +	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED))
> +		p = netdev_ops->ndo_xdp_rx_timestamp_supported;
> +	else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
> +		p = netdev_ops->ndo_xdp_rx_timestamp;
> +	else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH_SUPPORTED))
> +		p = netdev_ops->ndo_xdp_rx_hash_supported;
> +	else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
> +		p = netdev_ops->ndo_xdp_rx_hash;
> +	/* fallback to default kfunc when not supported by netdev */
> +out:
> +	up_read(&bpf_devs_lock);
> +
> +	return p;
> +}
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 13bc96035116..b345a273f7d0 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2491,7 +2491,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
>   				 BPF_F_TEST_STATE_FREQ |
>   				 BPF_F_SLEEPABLE |
>   				 BPF_F_TEST_RND_HI32 |
> -				 BPF_F_XDP_HAS_FRAGS))
> +				 BPF_F_XDP_HAS_FRAGS |
> +				 BPF_F_XDP_HAS_METADATA))
>   		return -EINVAL;
>   
>   	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> @@ -2575,7 +2576,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
>   	prog->aux->attach_btf = attach_btf;
>   	prog->aux->attach_btf_id = attr->attach_btf_id;
>   	prog->aux->dst_prog = dst_prog;
> -	prog->aux->offload_requested = !!attr->prog_ifindex;
> +	prog->aux->dev_bound = !!attr->prog_ifindex;
>   	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
>   	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
>   
> @@ -2598,7 +2599,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
>   	atomic64_set(&prog->aux->refcnt, 1);
>   	prog->gpl_compatible = is_gpl ? 1 : 0;
>   
> -	if (bpf_prog_is_offloaded(prog->aux)) {
> +	if (bpf_prog_is_dev_bound(prog->aux)) {
>   		err = bpf_prog_offload_init(prog, attr);
>   		if (err)
>   			goto free_prog_sec;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index fc4e313a4d2e..00951a59ee26 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -15323,6 +15323,24 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>   		return -EINVAL;
>   	}
>   
> +	*cnt = 0;
> +
> +	if (resolve_prog_type(env->prog) == BPF_PROG_TYPE_XDP) {

hmmm...does it need BPF_PROG_TYPE_XDP check? Is the below 
bpf_prog_is_dev_bound() and the eariler 
'register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set)' good enough?

> +		if (bpf_prog_is_offloaded(env->prog->aux)) {
> +			verbose(env, "no metadata kfuncs offload\n");
> +			return -EINVAL;
> +		}
> +
> +		if (bpf_prog_is_dev_bound(env->prog->aux)) {
> +			void *p = bpf_offload_resolve_kfunc(env->prog, insn->imm);
> +
> +			if (p) {
> +				insn->imm = BPF_CALL_IMM(p);
> +				return 0;
> +			}
> +		}
> +	}
> +



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 01/12] bpf: Document XDP RX metadata
  2022-12-06  2:45 ` [PATCH bpf-next v3 01/12] bpf: Document XDP RX metadata Stanislav Fomichev
@ 2022-12-08  4:25   ` Jakub Kicinski
  2022-12-08 19:06     ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Jakub Kicinski @ 2022-12-08  4:25 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon,  5 Dec 2022 18:45:43 -0800 Stanislav Fomichev wrote:
> +- ``bpf_xdp_metadata_rx_timestamp_supported`` returns true/false to
> +  indicate whether the device supports RX timestamps
> +- ``bpf_xdp_metadata_rx_timestamp`` returns packet RX timestamp
> +- ``bpf_xdp_metadata_rx_hash_supported`` returns true/false to
> +  indicate whether the device supports RX hash
> +- ``bpf_xdp_metadata_rx_hash`` returns packet RX hash

Would you mind pointing to the discussion about the separate
_supported() kfuncs? I recall folks had concerns about the function
call overhead, and now we have 2 calls per field? :S

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 02/12] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
  2022-12-06  2:45 ` [PATCH bpf-next v3 02/12] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded Stanislav Fomichev
@ 2022-12-08  4:26   ` Jakub Kicinski
  0 siblings, 0 replies; 61+ messages in thread
From: Jakub Kicinski @ 2022-12-08  4:26 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon,  5 Dec 2022 18:45:44 -0800 Stanislav Fomichev wrote:
> BPF offloading infra will be reused to implement
> bound-but-not-offloaded bpf programs. Rename existing
> helpers for clarity. No functional changes.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Reviewed-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-06  2:45 ` [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs Stanislav Fomichev
  2022-12-07  4:29   ` Alexei Starovoitov
  2022-12-08  2:47   ` Martin KaFai Lau
@ 2022-12-08  5:00   ` Jakub Kicinski
  2022-12-08 19:07     ` Stanislav Fomichev
  2022-12-08 22:39   ` [xdp-hints] " Toke Høiland-Jørgensen
  2022-12-09 11:10   ` Jesper Dangaard Brouer
  4 siblings, 1 reply; 61+ messages in thread
From: Jakub Kicinski @ 2022-12-08  5:00 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

The offload tests still pass after this, right?
TBH I don't remember this code well enough to spot major issues.

On Mon,  5 Dec 2022 18:45:45 -0800 Stanislav Fomichev wrote:
> There is an ndo handler per kfunc, the verifier replaces a call to the
> generic kfunc with a call to the per-device one.
> 
> For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
> implements all possible metatada kfuncs. Not all devices have to
> implement them. If kfunc is not supported by the target device,
> the default implementation is called instead.
> 
> Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
> we treat prog_index as target device for kfunc resolution.

> @@ -2476,10 +2477,18 @@ void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
>  				       struct net_device *netdev);
>  bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev);
>  
> +void *bpf_offload_resolve_kfunc(struct bpf_prog *prog, u32 func_id);

There seems to be some mis-naming going on. I expected:

  offloaded =~ nfp
  dev_bound == XDP w/ funcs

*_offload_resolve_kfunc looks misnamed? Unless you want to resolve 
for HW offload?

>  void unpriv_ebpf_notify(int new_state);
>  
>  #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
>  int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
> +void bpf_offload_bound_netdev_unregister(struct net_device *dev);

ditto: offload_bound is a mix of terms no?

> @@ -1611,6 +1612,10 @@ struct net_device_ops {
>  	ktime_t			(*ndo_get_tstamp)(struct net_device *dev,
>  						  const struct skb_shared_hwtstamps *hwtstamps,
>  						  bool cycles);
> +	bool			(*ndo_xdp_rx_timestamp_supported)(const struct xdp_md *ctx);
> +	u64			(*ndo_xdp_rx_timestamp)(const struct xdp_md *ctx);
> +	bool			(*ndo_xdp_rx_hash_supported)(const struct xdp_md *ctx);
> +	u32			(*ndo_xdp_rx_hash)(const struct xdp_md *ctx);
>  };

Is this on the fast path? Can we do an indirection?
Put these ops in their own struct and add a pointer to that struct 
in net_device_ops? Purely for grouping reasons because the netdev
ops are getting orders of magnitude past the size where you can
actually find stuff in this struct.

>  	bpf_free_used_maps(aux);
>  	bpf_free_used_btfs(aux);
> -	if (bpf_prog_is_offloaded(aux))
> +	if (bpf_prog_is_dev_bound(aux))
>  		bpf_prog_offload_destroy(aux->prog);

This also looks a touch like a mix of terms (condition vs function
called).

> +static int __bpf_offload_init(void);
> +static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
> +					     struct net_device *netdev);
> +static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
> +						struct net_device *netdev);

fwd declarations are yuck

>  static int bpf_dev_offload_check(struct net_device *netdev)
>  {
>  	if (!netdev)
> @@ -87,13 +93,17 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
>  	    attr->prog_type != BPF_PROG_TYPE_XDP)
>  		return -EINVAL;
>  
> -	if (attr->prog_flags)
> +	if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
>  		return -EINVAL;
>  
>  	offload = kzalloc(sizeof(*offload), GFP_USER);
>  	if (!offload)
>  		return -ENOMEM;
>  
> +	err = __bpf_offload_init();
> +	if (err)
> +		return err;

leaks offload

> @@ -209,6 +233,19 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
>  	up_read(&bpf_devs_lock);
>  }
>  
> +static void maybe_remove_bound_netdev(struct net_device *dev)
> +{

func name prefix ?

> -struct bpf_offload_dev *
> -bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
> +static int __bpf_offload_init(void)
>  {
> -	struct bpf_offload_dev *offdev;
>  	int err;
>  
>  	down_write(&bpf_devs_lock);
> @@ -680,12 +740,25 @@ bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
>  		err = rhashtable_init(&offdevs, &offdevs_params);
>  		if (err) {
>  			up_write(&bpf_devs_lock);
> -			return ERR_PTR(err);
> +			return err;
>  		}
>  		offdevs_inited = true;
>  	}
>  	up_write(&bpf_devs_lock);
>  
> +	return 0;
> +}

Would late_initcall() or some such not work for this?

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5b221568dfd4..862e03fcffa6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -9228,6 +9228,10 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
>  			NL_SET_ERR_MSG(extack, "Using device-bound program without HW_MODE flag is not supported");

extack should get updated here, I reckon, maybe in previous patch

>  			return -EINVAL;
>  		}
> +		if (bpf_prog_is_dev_bound(new_prog->aux) && !bpf_offload_dev_match(new_prog, dev)) {

bound_dev_match() ?

> +			NL_SET_ERR_MSG(extack, "Cannot attach to a different target device");

different than.. ?

> +			return -EINVAL;
> +		}
>  		if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
>  			NL_SET_ERR_MSG(extack, "BPF_XDP_DEVMAP programs can not be attached to a device");
>  			return -EINVAL;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 08/12] mxl4: Support RX XDP metadata
  2022-12-06  2:45 ` [PATCH bpf-next v3 08/12] mxl4: Support RX XDP metadata Stanislav Fomichev
@ 2022-12-08  6:09   ` Tariq Toukan
  2022-12-08 19:07     ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Tariq Toukan @ 2022-12-08  6:09 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Tariq Toukan, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Typo in title mxl4 -> mlx4.
Preferably: net/mlx4_en.

On 12/6/2022 4:45 AM, Stanislav Fomichev wrote:
> RX timestamp and hash for now. Tested using the prog from the next
> patch.
> 
> Also enabling xdp metadata support; don't see why it's disabled,
> there is enough headroom..
> 
> Cc: Tariq Toukan <tariqt@nvidia.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_clock.c | 13 +++++--
>   .../net/ethernet/mellanox/mlx4/en_netdev.c    | 10 +++++
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c    | 38 ++++++++++++++++++-
>   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |  1 +
>   include/linux/mlx4/device.h                   |  7 ++++
>   5 files changed, 64 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_clock.c b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
> index 98b5ffb4d729..9e3b76182088 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_clock.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
> @@ -58,9 +58,7 @@ u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
>   	return hi | lo;
>   }
>   
> -void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> -			    struct skb_shared_hwtstamps *hwts,
> -			    u64 timestamp)
> +u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp)
>   {
>   	unsigned int seq;
>   	u64 nsec;
> @@ -70,8 +68,15 @@ void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>   		nsec = timecounter_cyc2time(&mdev->clock, timestamp);
>   	} while (read_seqretry(&mdev->clock_lock, seq));
>   
> +	return ns_to_ktime(nsec);
> +}
> +
> +void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> +			    struct skb_shared_hwtstamps *hwts,
> +			    u64 timestamp)
> +{
>   	memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> -	hwts->hwtstamp = ns_to_ktime(nsec);
> +	hwts->hwtstamp = mlx4_en_get_hwtstamp(mdev, timestamp);
>   }
>   
>   /**
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 8800d3f1f55c..1cb63746a851 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2855,6 +2855,11 @@ static const struct net_device_ops mlx4_netdev_ops = {
>   	.ndo_features_check	= mlx4_en_features_check,
>   	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
>   	.ndo_bpf		= mlx4_xdp,
> +
> +	.ndo_xdp_rx_timestamp_supported = mlx4_xdp_rx_timestamp_supported,
> +	.ndo_xdp_rx_timestamp	= mlx4_xdp_rx_timestamp,
> +	.ndo_xdp_rx_hash_supported = mlx4_xdp_rx_hash_supported,
> +	.ndo_xdp_rx_hash	= mlx4_xdp_rx_hash,
>   };
>   
>   static const struct net_device_ops mlx4_netdev_ops_master = {
> @@ -2887,6 +2892,11 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
>   	.ndo_features_check	= mlx4_en_features_check,
>   	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
>   	.ndo_bpf		= mlx4_xdp,
> +
> +	.ndo_xdp_rx_timestamp_supported = mlx4_xdp_rx_timestamp_supported,
> +	.ndo_xdp_rx_timestamp	= mlx4_xdp_rx_timestamp,
> +	.ndo_xdp_rx_hash_supported = mlx4_xdp_rx_hash_supported,
> +	.ndo_xdp_rx_hash	= mlx4_xdp_rx_hash,
>   };
>   
>   struct mlx4_en_bond {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 9c114fc723e3..1b8e1b2d8729 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -663,8 +663,40 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
>   
>   struct mlx4_xdp_buff {
>   	struct xdp_buff xdp;
> +	struct mlx4_cqe *cqe;
> +	struct mlx4_en_dev *mdev;
> +	struct mlx4_en_rx_ring *ring;
> +	struct net_device *dev;
>   };
>   
> +bool mlx4_xdp_rx_timestamp_supported(const struct xdp_md *ctx)
> +{
> +	struct mlx4_xdp_buff *_ctx = (void *)ctx;
> +
> +	return _ctx->ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL;
> +}
> +
> +u64 mlx4_xdp_rx_timestamp(const struct xdp_md *ctx)
> +{
> +	struct mlx4_xdp_buff *_ctx = (void *)ctx;
> +
> +	return mlx4_en_get_hwtstamp(_ctx->mdev, mlx4_en_get_cqe_ts(_ctx->cqe));
> +}
> +
> +bool mlx4_xdp_rx_hash_supported(const struct xdp_md *ctx)
> +{
> +	struct mlx4_xdp_buff *_ctx = (void *)ctx;
> +
> +	return _ctx->dev->features & NETIF_F_RXHASH;
> +}
> +
> +u32 mlx4_xdp_rx_hash(const struct xdp_md *ctx)
> +{
> +	struct mlx4_xdp_buff *_ctx = (void *)ctx;
> +
> +	return be32_to_cpu(_ctx->cqe->immed_rss_invalid);
> +}
> +
>   int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
>   {
>   	struct mlx4_en_priv *priv = netdev_priv(dev);
> @@ -781,8 +813,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   						DMA_FROM_DEVICE);
>   
>   			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
> -					 frags[0].page_offset, length, false);
> +					 frags[0].page_offset, length, true);
>   			orig_data = mxbuf.xdp.data;
> +			mxbuf.cqe = cqe;
> +			mxbuf.mdev = priv->mdev;
> +			mxbuf.ring = ring;
> +			mxbuf.dev = dev;
>   
>   			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
>   
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index e132ff4c82f2..b7c0d4899ad7 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -792,6 +792,7 @@ int mlx4_en_netdev_event(struct notifier_block *this,
>    * Functions for time stamping
>    */
>   u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe);
> +u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp);
>   void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>   			    struct skb_shared_hwtstamps *hwts,
>   			    u64 timestamp);
> diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
> index 6646634a0b9d..d5904da1d490 100644
> --- a/include/linux/mlx4/device.h
> +++ b/include/linux/mlx4/device.h
> @@ -1585,4 +1585,11 @@ static inline int mlx4_get_num_reserved_uar(struct mlx4_dev *dev)
>   	/* The first 128 UARs are used for EQ doorbells */
>   	return (128 >> (PAGE_SHIFT - dev->uar_page_shift));
>   }
> +
> +struct xdp_md;
> +bool mlx4_xdp_rx_timestamp_supported(const struct xdp_md *ctx);
> +u64 mlx4_xdp_rx_timestamp(const struct xdp_md *ctx);
> +bool mlx4_xdp_rx_hash_supported(const struct xdp_md *ctx);
> +u32 mlx4_xdp_rx_hash(const struct xdp_md *ctx);
> +

These are ethernet only functions, not known to the mlx4 core driver.
Please move to mlx4_en.h, and use mlx4_en_xdp_*() prefix.

>   #endif /* MLX4_DEVICE_H */

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 07/12] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
  2022-12-06  2:45 ` [PATCH bpf-next v3 07/12] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-12-08  6:11   ` Tariq Toukan
  2022-12-08 19:07     ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Tariq Toukan @ 2022-12-08  6:11 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Tariq Toukan, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev



On 12/6/2022 4:45 AM, Stanislav Fomichev wrote:
> No functional changes. Boilerplate to allow stuffing more data after xdp_buff.
> 
> Cc: Tariq Toukan <tariqt@nvidia.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c | 26 +++++++++++++---------
>   1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 8f762fc170b3..9c114fc723e3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -661,9 +661,14 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
>   #define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
>   #endif
>   
> +struct mlx4_xdp_buff {
> +	struct xdp_buff xdp;
> +};
> +

Prefer name with 'en', struct mlx4_en_xdp_buff.

>   int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
>   {
>   	struct mlx4_en_priv *priv = netdev_priv(dev);
> +	struct mlx4_xdp_buff mxbuf = {};
>   	int factor = priv->cqe_factor;
>   	struct mlx4_en_rx_ring *ring;
>   	struct bpf_prog *xdp_prog;
> @@ -671,7 +676,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	bool doorbell_pending;
>   	bool xdp_redir_flush;
>   	struct mlx4_cqe *cqe;
> -	struct xdp_buff xdp;
>   	int polled = 0;
>   	int index;
>   
> @@ -681,7 +685,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	ring = priv->rx_ring[cq_ring];
>   
>   	xdp_prog = rcu_dereference_bh(ring->xdp_prog);
> -	xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
> +	xdp_init_buff(&mxbuf.xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
>   	doorbell_pending = false;
>   	xdp_redir_flush = false;
>   
> @@ -776,24 +780,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   						priv->frag_info[0].frag_size,
>   						DMA_FROM_DEVICE);
>   
> -			xdp_prepare_buff(&xdp, va - frags[0].page_offset,
> +			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
>   					 frags[0].page_offset, length, false);
> -			orig_data = xdp.data;
> +			orig_data = mxbuf.xdp.data;
>   
> -			act = bpf_prog_run_xdp(xdp_prog, &xdp);
> +			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
>   
> -			length = xdp.data_end - xdp.data;
> -			if (xdp.data != orig_data) {
> -				frags[0].page_offset = xdp.data -
> -					xdp.data_hard_start;
> -				va = xdp.data;
> +			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
> +			if (mxbuf.xdp.data != orig_data) {
> +				frags[0].page_offset = mxbuf.xdp.data -
> +					mxbuf.xdp.data_hard_start;
> +				va = mxbuf.xdp.data;
>   			}
>   
>   			switch (act) {
>   			case XDP_PASS:
>   				break;
>   			case XDP_REDIRECT:
> -				if (likely(!xdp_do_redirect(dev, &xdp, xdp_prog))) {
> +				if (likely(!xdp_do_redirect(dev, &mxbuf.xdp, xdp_prog))) {
>   					ring->xdp_redirect++;
>   					xdp_redir_flush = true;
>   					frags[0].page = NULL;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 01/12] bpf: Document XDP RX metadata
  2022-12-08  4:25   ` Jakub Kicinski
@ 2022-12-08 19:06     ` Stanislav Fomichev
  0 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 19:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Wed, Dec 7, 2022 at 8:26 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon,  5 Dec 2022 18:45:43 -0800 Stanislav Fomichev wrote:
> > +- ``bpf_xdp_metadata_rx_timestamp_supported`` returns true/false to
> > +  indicate whether the device supports RX timestamps
> > +- ``bpf_xdp_metadata_rx_timestamp`` returns packet RX timestamp
> > +- ``bpf_xdp_metadata_rx_hash_supported`` returns true/false to
> > +  indicate whether the device supports RX hash
> > +- ``bpf_xdp_metadata_rx_hash`` returns packet RX hash
>
> Would you mind pointing to the discussion about the separate
> _supported() kfuncs? I recall folks had concerns about the function
> call overhead, and now we have 2 calls per field? :S

Take a look at [0] and [1]. I'm still assuming that we might support
some restricted set of kfuncs that can be unrolled so keeping this
simple/separate apis.

0: https://lore.kernel.org/bpf/CAADnVQJMvPjXCtKNH+WCryPmukgbWTrJyHqxrnO=2YraZEukPg@mail.gmail.com
1: https://lore.kernel.org/bpf/Y4XZkZJHVvLgTIk9@lavr/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 07/12] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
  2022-12-08  6:11   ` Tariq Toukan
@ 2022-12-08 19:07     ` Stanislav Fomichev
  0 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 19:07 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Tariq Toukan, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

On Wed, Dec 7, 2022 at 10:11 PM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 12/6/2022 4:45 AM, Stanislav Fomichev wrote:
> > No functional changes. Boilerplate to allow stuffing more data after xdp_buff.
> >
> > Cc: Tariq Toukan <tariqt@nvidia.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx4/en_rx.c | 26 +++++++++++++---------
> >   1 file changed, 15 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 8f762fc170b3..9c114fc723e3 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -661,9 +661,14 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
> >   #define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
> >   #endif
> >
> > +struct mlx4_xdp_buff {
> > +     struct xdp_buff xdp;
> > +};
> > +
>
> Prefer name with 'en', struct mlx4_en_xdp_buff.

Sure, will rename!


> >   int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
> >   {
> >       struct mlx4_en_priv *priv = netdev_priv(dev);
> > +     struct mlx4_xdp_buff mxbuf = {};
> >       int factor = priv->cqe_factor;
> >       struct mlx4_en_rx_ring *ring;
> >       struct bpf_prog *xdp_prog;
> > @@ -671,7 +676,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >       bool doorbell_pending;
> >       bool xdp_redir_flush;
> >       struct mlx4_cqe *cqe;
> > -     struct xdp_buff xdp;
> >       int polled = 0;
> >       int index;
> >
> > @@ -681,7 +685,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >       ring = priv->rx_ring[cq_ring];
> >
> >       xdp_prog = rcu_dereference_bh(ring->xdp_prog);
> > -     xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
> > +     xdp_init_buff(&mxbuf.xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
> >       doorbell_pending = false;
> >       xdp_redir_flush = false;
> >
> > @@ -776,24 +780,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >                                               priv->frag_info[0].frag_size,
> >                                               DMA_FROM_DEVICE);
> >
> > -                     xdp_prepare_buff(&xdp, va - frags[0].page_offset,
> > +                     xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
> >                                        frags[0].page_offset, length, false);
> > -                     orig_data = xdp.data;
> > +                     orig_data = mxbuf.xdp.data;
> >
> > -                     act = bpf_prog_run_xdp(xdp_prog, &xdp);
> > +                     act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
> >
> > -                     length = xdp.data_end - xdp.data;
> > -                     if (xdp.data != orig_data) {
> > -                             frags[0].page_offset = xdp.data -
> > -                                     xdp.data_hard_start;
> > -                             va = xdp.data;
> > +                     length = mxbuf.xdp.data_end - mxbuf.xdp.data;
> > +                     if (mxbuf.xdp.data != orig_data) {
> > +                             frags[0].page_offset = mxbuf.xdp.data -
> > +                                     mxbuf.xdp.data_hard_start;
> > +                             va = mxbuf.xdp.data;
> >                       }
> >
> >                       switch (act) {
> >                       case XDP_PASS:
> >                               break;
> >                       case XDP_REDIRECT:
> > -                             if (likely(!xdp_do_redirect(dev, &xdp, xdp_prog))) {
> > +                             if (likely(!xdp_do_redirect(dev, &mxbuf.xdp, xdp_prog))) {
> >                                       ring->xdp_redirect++;
> >                                       xdp_redir_flush = true;
> >                                       frags[0].page = NULL;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 08/12] mxl4: Support RX XDP metadata
  2022-12-08  6:09   ` Tariq Toukan
@ 2022-12-08 19:07     ` Stanislav Fomichev
  2022-12-08 20:23       ` Tariq Toukan
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 19:07 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Tariq Toukan, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

On Wed, Dec 7, 2022 at 10:09 PM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
> Typo in title mxl4 -> mlx4.
> Preferably: net/mlx4_en.

Oh, I always have to fight with this. Somehow mxl feels more natural
:-) Thanks for spotting, will use net/mlx4_en instead. (presumably the
same should be for mlx5?)

> On 12/6/2022 4:45 AM, Stanislav Fomichev wrote:
> > RX timestamp and hash for now. Tested using the prog from the next
> > patch.
> >
> > Also enabling xdp metadata support; don't see why it's disabled,
> > there is enough headroom..
> >
> > Cc: Tariq Toukan <tariqt@nvidia.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx4/en_clock.c | 13 +++++--
> >   .../net/ethernet/mellanox/mlx4/en_netdev.c    | 10 +++++
> >   drivers/net/ethernet/mellanox/mlx4/en_rx.c    | 38 ++++++++++++++++++-
> >   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |  1 +
> >   include/linux/mlx4/device.h                   |  7 ++++
> >   5 files changed, 64 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_clock.c b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
> > index 98b5ffb4d729..9e3b76182088 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_clock.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
> > @@ -58,9 +58,7 @@ u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> >       return hi | lo;
> >   }
> >
> > -void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > -                         struct skb_shared_hwtstamps *hwts,
> > -                         u64 timestamp)
> > +u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp)
> >   {
> >       unsigned int seq;
> >       u64 nsec;
> > @@ -70,8 +68,15 @@ void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> >               nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> >       } while (read_seqretry(&mdev->clock_lock, seq));
> >
> > +     return ns_to_ktime(nsec);
> > +}
> > +
> > +void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > +                         struct skb_shared_hwtstamps *hwts,
> > +                         u64 timestamp)
> > +{
> >       memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > -     hwts->hwtstamp = ns_to_ktime(nsec);
> > +     hwts->hwtstamp = mlx4_en_get_hwtstamp(mdev, timestamp);
> >   }
> >
> >   /**
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > index 8800d3f1f55c..1cb63746a851 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> > @@ -2855,6 +2855,11 @@ static const struct net_device_ops mlx4_netdev_ops = {
> >       .ndo_features_check     = mlx4_en_features_check,
> >       .ndo_set_tx_maxrate     = mlx4_en_set_tx_maxrate,
> >       .ndo_bpf                = mlx4_xdp,
> > +
> > +     .ndo_xdp_rx_timestamp_supported = mlx4_xdp_rx_timestamp_supported,
> > +     .ndo_xdp_rx_timestamp   = mlx4_xdp_rx_timestamp,
> > +     .ndo_xdp_rx_hash_supported = mlx4_xdp_rx_hash_supported,
> > +     .ndo_xdp_rx_hash        = mlx4_xdp_rx_hash,
> >   };
> >
> >   static const struct net_device_ops mlx4_netdev_ops_master = {
> > @@ -2887,6 +2892,11 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
> >       .ndo_features_check     = mlx4_en_features_check,
> >       .ndo_set_tx_maxrate     = mlx4_en_set_tx_maxrate,
> >       .ndo_bpf                = mlx4_xdp,
> > +
> > +     .ndo_xdp_rx_timestamp_supported = mlx4_xdp_rx_timestamp_supported,
> > +     .ndo_xdp_rx_timestamp   = mlx4_xdp_rx_timestamp,
> > +     .ndo_xdp_rx_hash_supported = mlx4_xdp_rx_hash_supported,
> > +     .ndo_xdp_rx_hash        = mlx4_xdp_rx_hash,
> >   };
> >
> >   struct mlx4_en_bond {
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 9c114fc723e3..1b8e1b2d8729 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -663,8 +663,40 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
> >
> >   struct mlx4_xdp_buff {
> >       struct xdp_buff xdp;
> > +     struct mlx4_cqe *cqe;
> > +     struct mlx4_en_dev *mdev;
> > +     struct mlx4_en_rx_ring *ring;
> > +     struct net_device *dev;
> >   };
> >
> > +bool mlx4_xdp_rx_timestamp_supported(const struct xdp_md *ctx)
> > +{
> > +     struct mlx4_xdp_buff *_ctx = (void *)ctx;
> > +
> > +     return _ctx->ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL;
> > +}
> > +
> > +u64 mlx4_xdp_rx_timestamp(const struct xdp_md *ctx)
> > +{
> > +     struct mlx4_xdp_buff *_ctx = (void *)ctx;
> > +
> > +     return mlx4_en_get_hwtstamp(_ctx->mdev, mlx4_en_get_cqe_ts(_ctx->cqe));
> > +}
> > +
> > +bool mlx4_xdp_rx_hash_supported(const struct xdp_md *ctx)
> > +{
> > +     struct mlx4_xdp_buff *_ctx = (void *)ctx;
> > +
> > +     return _ctx->dev->features & NETIF_F_RXHASH;
> > +}
> > +
> > +u32 mlx4_xdp_rx_hash(const struct xdp_md *ctx)
> > +{
> > +     struct mlx4_xdp_buff *_ctx = (void *)ctx;
> > +
> > +     return be32_to_cpu(_ctx->cqe->immed_rss_invalid);
> > +}
> > +
> >   int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
> >   {
> >       struct mlx4_en_priv *priv = netdev_priv(dev);
> > @@ -781,8 +813,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >                                               DMA_FROM_DEVICE);
> >
> >                       xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
> > -                                      frags[0].page_offset, length, false);
> > +                                      frags[0].page_offset, length, true);
> >                       orig_data = mxbuf.xdp.data;
> > +                     mxbuf.cqe = cqe;
> > +                     mxbuf.mdev = priv->mdev;
> > +                     mxbuf.ring = ring;
> > +                     mxbuf.dev = dev;
> >
> >                       act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > index e132ff4c82f2..b7c0d4899ad7 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> > @@ -792,6 +792,7 @@ int mlx4_en_netdev_event(struct notifier_block *this,
> >    * Functions for time stamping
> >    */
> >   u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe);
> > +u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp);
> >   void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> >                           struct skb_shared_hwtstamps *hwts,
> >                           u64 timestamp);
> > diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
> > index 6646634a0b9d..d5904da1d490 100644
> > --- a/include/linux/mlx4/device.h
> > +++ b/include/linux/mlx4/device.h
> > @@ -1585,4 +1585,11 @@ static inline int mlx4_get_num_reserved_uar(struct mlx4_dev *dev)
> >       /* The first 128 UARs are used for EQ doorbells */
> >       return (128 >> (PAGE_SHIFT - dev->uar_page_shift));
> >   }
> > +
> > +struct xdp_md;
> > +bool mlx4_xdp_rx_timestamp_supported(const struct xdp_md *ctx);
> > +u64 mlx4_xdp_rx_timestamp(const struct xdp_md *ctx);
> > +bool mlx4_xdp_rx_hash_supported(const struct xdp_md *ctx);
> > +u32 mlx4_xdp_rx_hash(const struct xdp_md *ctx);
> > +
>
> These are ethernet only functions, not known to the mlx4 core driver.
> Please move to mlx4_en.h, and use mlx4_en_xdp_*() prefix.

For sure, thanks for the review!

> >   #endif /* MLX4_DEVICE_H */

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-08  2:47   ` Martin KaFai Lau
@ 2022-12-08 19:07     ` Stanislav Fomichev
  2022-12-08 22:53       ` Martin KaFai Lau
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 19:07 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On Wed, Dec 7, 2022 at 6:47 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/5/22 6:45 PM, Stanislav Fomichev wrote:
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 55dbc68bfffc..c24aba5c363b 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -409,4 +409,33 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> >
> >   #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
> >
> > +#define XDP_METADATA_KFUNC_xxx       \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
> > +                        bpf_xdp_metadata_rx_timestamp_supported) \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
> > +                        bpf_xdp_metadata_rx_timestamp) \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH_SUPPORTED, \
> > +                        bpf_xdp_metadata_rx_hash_supported) \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
> > +                        bpf_xdp_metadata_rx_hash) \
> > +
> > +enum {
> > +#define XDP_METADATA_KFUNC(name, str) name,
> > +XDP_METADATA_KFUNC_xxx
> > +#undef XDP_METADATA_KFUNC
> > +MAX_XDP_METADATA_KFUNC,
> > +};
> > +
> > +#ifdef CONFIG_NET
>
> I think this is no longer needed because xdp_metadata_kfunc_id() is only used in
> offload.c which should be CONFIG_NET only.

Seems to be the case. At least my build tests with weird configs work,
thank you!

> > +u32 xdp_metadata_kfunc_id(int id);
> > +#else
> > +static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> > +#endif
> > +
> > +struct xdp_md;
> > +bool bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx);
> > +u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx);
> > +bool bpf_xdp_metadata_rx_hash_supported(const struct xdp_md *ctx);
> > +u32 bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx);
> > +
> >   #endif /* __LINUX_NET_XDP_H__ */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index f89de51a45db..790650a81f2b 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -1156,6 +1156,11 @@ enum bpf_link_type {
> >    */
> >   #define BPF_F_XDP_HAS_FRAGS (1U << 5)
> >
> > +/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
> > + * program becomes device-bound but can access it's XDP metadata.
> > + */
> > +#define BPF_F_XDP_HAS_METADATA       (1U << 6)
> > +
>
> [ ... ]
>
> > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > index f5769a8ecbee..bad8bab916eb 100644
> > --- a/kernel/bpf/offload.c
> > +++ b/kernel/bpf/offload.c
> > @@ -41,7 +41,7 @@ struct bpf_offload_dev {
> >   struct bpf_offload_netdev {
> >       struct rhash_head l;
> >       struct net_device *netdev;
> > -     struct bpf_offload_dev *offdev;
> > +     struct bpf_offload_dev *offdev; /* NULL when bound-only */
> >       struct list_head progs;
> >       struct list_head maps;
> >       struct list_head offdev_netdevs;
> > @@ -58,6 +58,12 @@ static const struct rhashtable_params offdevs_params = {
> >   static struct rhashtable offdevs;
> >   static bool offdevs_inited;
> >
> > +static int __bpf_offload_init(void);
> > +static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
> > +                                          struct net_device *netdev);
> > +static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
> > +                                             struct net_device *netdev);
> > +
> >   static int bpf_dev_offload_check(struct net_device *netdev)
> >   {
> >       if (!netdev)
> > @@ -87,13 +93,17 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
> >           attr->prog_type != BPF_PROG_TYPE_XDP)
> >               return -EINVAL;
> >
> > -     if (attr->prog_flags)
> > +     if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
> >               return -EINVAL;
> >
> >       offload = kzalloc(sizeof(*offload), GFP_USER);
> >       if (!offload)
> >               return -ENOMEM;
> >
> > +     err = __bpf_offload_init();
> > +     if (err)
> > +             return err;
> > +
> >       offload->prog = prog;
> >
> >       offload->netdev = dev_get_by_index(current->nsproxy->net_ns,
> > @@ -102,11 +112,25 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
> >       if (err)
> >               goto err_maybe_put;
> >
> > +     prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
> > +
>
> If I read the set correctly, bpf prog can either use metadata kfunc or offload
> but not both. It is fine to start with only supporting metadata kfunc when there
> is no offload but will be useful to understand the reason. I assume an offloaded
> bpf prog should still be able to call the bpf helpers like adjust_head/tail and
> the same should go for any kfunc?

Yes, I'm assuming there should be some work on the offloaded device
drivers to support metadata kfuncs.
Offloaded kfuncs, in general, seem hard (how do we call kernel func
from the device-offloaded prog?); so refusing kfuncs early for the
offloaded case seems fair for now?

> Also, the BPF_F_XDP_HAS_METADATA feels like it is acting more like
> BPF_F_XDP_DEV_BOUND_ONLY.

SG. Seems like a better option in case in the future binding to
devices might give some other nice perks besides the metadata..

> >       down_write(&bpf_devs_lock);
> >       ondev = bpf_offload_find_netdev(offload->netdev);
> >       if (!ondev) {
> > -             err = -EINVAL;
> > -             goto err_unlock;
> > +             if (!prog->aux->offload_requested) {
>
> nit. bpf_prog_is_offloaded(prog->aux)

Thx!

> > +                     /* When only binding to the device, explicitly
> > +                      * create an entry in the hashtable. See related
> > +                      * maybe_remove_bound_netdev.
> > +                      */
> > +                     err = __bpf_offload_dev_netdev_register(NULL, offload->netdev);
> > +                     if (err)
> > +                             goto err_unlock;
> > +                     ondev = bpf_offload_find_netdev(offload->netdev);
> > +             }
> > +             if (!ondev) {
> > +                     err = -EINVAL;
> > +                     goto err_unlock;
> > +             }
> >       }
> >       offload->offdev = ondev->offdev;
> >       prog->aux->offload = offload;
> > @@ -209,6 +233,19 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
> >       up_read(&bpf_devs_lock);
> >   }
> >
> > +static void maybe_remove_bound_netdev(struct net_device *dev)
> > +{
> > +     struct bpf_offload_netdev *ondev;
> > +
> > +     rtnl_lock();
> > +     down_write(&bpf_devs_lock);
> > +     ondev = bpf_offload_find_netdev(dev);
> > +     if (ondev && !ondev->offdev && list_empty(&ondev->progs))
> > +             __bpf_offload_dev_netdev_unregister(NULL, dev);
> > +     up_write(&bpf_devs_lock);
> > +     rtnl_unlock();
> > +}
> > +
> >   static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
> >   {
> >       struct bpf_prog_offload *offload = prog->aux->offload;
> > @@ -226,10 +263,17 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
> >
> >   void bpf_prog_offload_destroy(struct bpf_prog *prog)
> >   {
> > +     struct net_device *netdev = NULL;
> > +
> >       down_write(&bpf_devs_lock);
> > -     if (prog->aux->offload)
> > +     if (prog->aux->offload) {
> > +             netdev = prog->aux->offload->netdev;
> >               __bpf_prog_offload_destroy(prog);
> > +     }
> >       up_write(&bpf_devs_lock);
> > +
> > +     if (netdev)
>
> May be I have missed a refcnt or lock somewhere.  Is it possible that netdev may
> have been freed?

Yeah, with the offload framework, there are no refcnts. We put an
"offloaded" device into a separate hashtable (protected by
rtnl/semaphore).
maybe_remove_bound_netdev will re-grab the locks (due to ordering:
rtnl->bpf_devs_lock) and remove the device from the hashtable if it's
still there.
At least this is how, I think, it should work; LMK if something is
still fishy here...

Or is the concern here that somebody might allocate new netdev reusing
the same address? I think I have enough checks in
maybe_remove_bound_netdev to guard against that. Or, at least, to make
it safe :-)

> > +             maybe_remove_bound_netdev(netdev);
> >   }
> >
>
> [ ... ]
>
> > +void *bpf_offload_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
> > +{
> > +     const struct net_device_ops *netdev_ops;
> > +     void *p = NULL;
> > +
> > +     down_read(&bpf_devs_lock);
> > +     if (!prog->aux->offload || !prog->aux->offload->netdev)
> > +             goto out;
> > +
> > +     netdev_ops = prog->aux->offload->netdev->netdev_ops;
> > +
> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED))
> > +             p = netdev_ops->ndo_xdp_rx_timestamp_supported;
> > +     else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
> > +             p = netdev_ops->ndo_xdp_rx_timestamp;
> > +     else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH_SUPPORTED))
> > +             p = netdev_ops->ndo_xdp_rx_hash_supported;
> > +     else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
> > +             p = netdev_ops->ndo_xdp_rx_hash;
> > +     /* fallback to default kfunc when not supported by netdev */
> > +out:
> > +     up_read(&bpf_devs_lock);
> > +
> > +     return p;
> > +}
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 13bc96035116..b345a273f7d0 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2491,7 +2491,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
> >                                BPF_F_TEST_STATE_FREQ |
> >                                BPF_F_SLEEPABLE |
> >                                BPF_F_TEST_RND_HI32 |
> > -                              BPF_F_XDP_HAS_FRAGS))
> > +                              BPF_F_XDP_HAS_FRAGS |
> > +                              BPF_F_XDP_HAS_METADATA))
> >               return -EINVAL;
> >
> >       if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> > @@ -2575,7 +2576,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
> >       prog->aux->attach_btf = attach_btf;
> >       prog->aux->attach_btf_id = attr->attach_btf_id;
> >       prog->aux->dst_prog = dst_prog;
> > -     prog->aux->offload_requested = !!attr->prog_ifindex;
> > +     prog->aux->dev_bound = !!attr->prog_ifindex;
> >       prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
> >       prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
> >
> > @@ -2598,7 +2599,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
> >       atomic64_set(&prog->aux->refcnt, 1);
> >       prog->gpl_compatible = is_gpl ? 1 : 0;
> >
> > -     if (bpf_prog_is_offloaded(prog->aux)) {
> > +     if (bpf_prog_is_dev_bound(prog->aux)) {
> >               err = bpf_prog_offload_init(prog, attr);
> >               if (err)
> >                       goto free_prog_sec;
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index fc4e313a4d2e..00951a59ee26 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -15323,6 +15323,24 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >               return -EINVAL;
> >       }
> >
> > +     *cnt = 0;
> > +
> > +     if (resolve_prog_type(env->prog) == BPF_PROG_TYPE_XDP) {
>
> hmmm...does it need BPF_PROG_TYPE_XDP check? Is the below
> bpf_prog_is_dev_bound() and the eariler
> 'register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set)' good enough?

Should be enough, yeah, will drop. I was being a bit defensive here in
case we have non-xdp device-bound programs in the future.


> > +             if (bpf_prog_is_offloaded(env->prog->aux)) {
> > +                     verbose(env, "no metadata kfuncs offload\n");
> > +                     return -EINVAL;
> > +             }
> > +
> > +             if (bpf_prog_is_dev_bound(env->prog->aux)) {
> > +                     void *p = bpf_offload_resolve_kfunc(env->prog, insn->imm);
> > +
> > +                     if (p) {
> > +                             insn->imm = BPF_CALL_IMM(p);
> > +                             return 0;
> > +                     }
> > +             }
> > +     }
> > +
>
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-08  5:00   ` Jakub Kicinski
@ 2022-12-08 19:07     ` Stanislav Fomichev
  2022-12-09  1:30       ` Jakub Kicinski
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 19:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Wed, Dec 7, 2022 at 9:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> The offload tests still pass after this, right?

Yeah, had to bring them back in shape just for the purpose of making
sure they're still happy:
https://lore.kernel.org/bpf/20221206232739.2504890-1-sdf@google.com/

> TBH I don't remember this code well enough to spot major issues.

No worries! Appreciate the review and the comments on consistency; I'm
also mostly unaware how this whole offloading works :-)

> On Mon,  5 Dec 2022 18:45:45 -0800 Stanislav Fomichev wrote:
> > There is an ndo handler per kfunc, the verifier replaces a call to the
> > generic kfunc with a call to the per-device one.
> >
> > For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
> > implements all possible metatada kfuncs. Not all devices have to
> > implement them. If kfunc is not supported by the target device,
> > the default implementation is called instead.
> >
> > Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
> > we treat prog_index as target device for kfunc resolution.
>
> > @@ -2476,10 +2477,18 @@ void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
> >                                      struct net_device *netdev);
> >  bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev);
> >
> > +void *bpf_offload_resolve_kfunc(struct bpf_prog *prog, u32 func_id);
>
> There seems to be some mis-naming going on. I expected:
>
>   offloaded =~ nfp
>   dev_bound == XDP w/ funcs
>
> *_offload_resolve_kfunc looks misnamed? Unless you want to resolve
> for HW offload?

Yeah, I had the same expectations, but I was also assuming that this
bpf_offload_resolve_kfunc might also at some point handle offloaded
metadata kfuncs.
But looking at it again, agree that the following looks a bit off:

if (bpf_prog_is_dev_bound()) {
   xxx = bpf_offload_resolve_kfunc()
}

Let me use the dev_bound prefix more consistently here and in the
other places you've pointed out.

> >  void unpriv_ebpf_notify(int new_state);
> >
> >  #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
> >  int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
> > +void bpf_offload_bound_netdev_unregister(struct net_device *dev);
>
> ditto: offload_bound is a mix of terms no?

Ack, will do bpf_dev_bound_netdev_unregister here, thanks!

> > @@ -1611,6 +1612,10 @@ struct net_device_ops {
> >       ktime_t                 (*ndo_get_tstamp)(struct net_device *dev,
> >                                                 const struct skb_shared_hwtstamps *hwtstamps,
> >                                                 bool cycles);
> > +     bool                    (*ndo_xdp_rx_timestamp_supported)(const struct xdp_md *ctx);
> > +     u64                     (*ndo_xdp_rx_timestamp)(const struct xdp_md *ctx);
> > +     bool                    (*ndo_xdp_rx_hash_supported)(const struct xdp_md *ctx);
> > +     u32                     (*ndo_xdp_rx_hash)(const struct xdp_md *ctx);
> >  };
>
> Is this on the fast path? Can we do an indirection?

No, we resolve them at load time from "generic"
bpf_xdp_metadata_rx_<xxx> to ndo_xdp_rx_<xxx>.

> Put these ops in their own struct and add a pointer to that struct
> in net_device_ops? Purely for grouping reasons because the netdev
> ops are getting orders of magnitude past the size where you can
> actually find stuff in this struct.

Oh, great idea, will do!

> >       bpf_free_used_maps(aux);
> >       bpf_free_used_btfs(aux);
> > -     if (bpf_prog_is_offloaded(aux))
> > +     if (bpf_prog_is_dev_bound(aux))
> >               bpf_prog_offload_destroy(aux->prog);
>
> This also looks a touch like a mix of terms (condition vs function
> called).

Here, not sure, open to suggestions. These
bpf_prog_offload_init/bpf_prog_offload_destroy are generic enough
(now) that I'm calling them for both dev_bound/offloaded.

The following paths trigger for both offloaded/dev_bound cases:

if (bpf_prog_is_dev_bound()) bpf_prog_offload_init();
if (bpf_prog_is_dev_bound()) bpf_prog_offload_destroy();

Do you think it's worth it having completely separate
dev_bound/offloaded paths? Or, alternatively, can rename to
bpf_prog_dev_bound_{init,destroy} but still handle both cases?

> > +static int __bpf_offload_init(void);
> > +static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
> > +                                          struct net_device *netdev);
> > +static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
> > +                                             struct net_device *netdev);
>
> fwd declarations are yuck

SG, will move them here instead.

> >  static int bpf_dev_offload_check(struct net_device *netdev)
> >  {
> >       if (!netdev)
> > @@ -87,13 +93,17 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
> >           attr->prog_type != BPF_PROG_TYPE_XDP)
> >               return -EINVAL;
> >
> > -     if (attr->prog_flags)
> > +     if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
> >               return -EINVAL;
> >
> >       offload = kzalloc(sizeof(*offload), GFP_USER);
> >       if (!offload)
> >               return -ENOMEM;
> >
> > +     err = __bpf_offload_init();
> > +     if (err)
> > +             return err;
>
> leaks offload

Oops, let me actually move this to late_initcall as you suggest below.

> > @@ -209,6 +233,19 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
> >       up_read(&bpf_devs_lock);
> >  }
> >
> > +static void maybe_remove_bound_netdev(struct net_device *dev)
> > +{
>
> func name prefix ?

Good point, will rename to bpf_dev_bound_try_remove_netdev.

> > -struct bpf_offload_dev *
> > -bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
> > +static int __bpf_offload_init(void)
> >  {
> > -     struct bpf_offload_dev *offdev;
> >       int err;
> >
> >       down_write(&bpf_devs_lock);
> > @@ -680,12 +740,25 @@ bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
> >               err = rhashtable_init(&offdevs, &offdevs_params);
> >               if (err) {
> >                       up_write(&bpf_devs_lock);
> > -                     return ERR_PTR(err);
> > +                     return err;
> >               }
> >               offdevs_inited = true;
> >       }
> >       up_write(&bpf_devs_lock);
> >
> > +     return 0;
> > +}
>
> Would late_initcall() or some such not work for this?

Agreed, let's move it to the initcall instead.

> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 5b221568dfd4..862e03fcffa6 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -9228,6 +9228,10 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
> >                       NL_SET_ERR_MSG(extack, "Using device-bound program without HW_MODE flag is not supported");
>
> extack should get updated here, I reckon, maybe in previous patch

Oh, thanks for spotting, will fix.

> >                       return -EINVAL;
> >               }
> > +             if (bpf_prog_is_dev_bound(new_prog->aux) && !bpf_offload_dev_match(new_prog, dev)) {
>
> bound_dev_match() ?

Right, so this is another case where it works for both cases. Maybe
rename to bpf_dev_bound_match and use for both offloaded/dev_bound? Or
do you prefer completely separate paths?

> > +                     NL_SET_ERR_MSG(extack, "Cannot attach to a different target device");
>
> different than.. ?

Borrowing from netdevsim, lmk if the following won't work here:

"Program bound to different device"

> > +                     return -EINVAL;
> > +             }
> >               if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
> >                       NL_SET_ERR_MSG(extack, "BPF_XDP_DEVMAP programs can not be attached to a device");
> >                       return -EINVAL;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 08/12] mxl4: Support RX XDP metadata
  2022-12-08 19:07     ` Stanislav Fomichev
@ 2022-12-08 20:23       ` Tariq Toukan
  0 siblings, 0 replies; 61+ messages in thread
From: Tariq Toukan @ 2022-12-08 20:23 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Tariq Toukan, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev



On 12/8/2022 9:07 PM, Stanislav Fomichev wrote:
> On Wed, Dec 7, 2022 at 10:09 PM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>
>> Typo in title mxl4 -> mlx4.
>> Preferably: net/mlx4_en.
> 
> Oh, I always have to fight with this. Somehow mxl feels more natural
> :-) Thanks for spotting, will use net/mlx4_en instead. (presumably the
> same should be for mlx5?)
> 

For the newer mlx5 driver we use a shorter form, net/mlx5e.

>> On 12/6/2022 4:45 AM, Stanislav Fomichev wrote:
>>> RX timestamp and hash for now. Tested using the prog from the next
>>> patch.
>>>
>>> Also enabling xdp metadata support; don't see why it's disabled,
>>> there is enough headroom..
>>>
>>> Cc: Tariq Toukan <tariqt@nvidia.com>
>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>> Cc: David Ahern <dsahern@gmail.com>
>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>> Cc: Willem de Bruijn <willemb@google.com>
>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>> Cc: xdp-hints@xdp-project.net
>>> Cc: netdev@vger.kernel.org
>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>> ---
>>>    drivers/net/ethernet/mellanox/mlx4/en_clock.c | 13 +++++--
>>>    .../net/ethernet/mellanox/mlx4/en_netdev.c    | 10 +++++
>>>    drivers/net/ethernet/mellanox/mlx4/en_rx.c    | 38 ++++++++++++++++++-
>>>    drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |  1 +
>>>    include/linux/mlx4/device.h                   |  7 ++++
>>>    5 files changed, 64 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_clock.c b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
>>> index 98b5ffb4d729..9e3b76182088 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_clock.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
>>> @@ -58,9 +58,7 @@ u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
>>>        return hi | lo;
>>>    }
>>>
>>> -void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>>> -                         struct skb_shared_hwtstamps *hwts,
>>> -                         u64 timestamp)
>>> +u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp)
>>>    {
>>>        unsigned int seq;
>>>        u64 nsec;
>>> @@ -70,8 +68,15 @@ void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>>>                nsec = timecounter_cyc2time(&mdev->clock, timestamp);
>>>        } while (read_seqretry(&mdev->clock_lock, seq));
>>>
>>> +     return ns_to_ktime(nsec);
>>> +}
>>> +
>>> +void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>>> +                         struct skb_shared_hwtstamps *hwts,
>>> +                         u64 timestamp)
>>> +{
>>>        memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
>>> -     hwts->hwtstamp = ns_to_ktime(nsec);
>>> +     hwts->hwtstamp = mlx4_en_get_hwtstamp(mdev, timestamp);
>>>    }
>>>
>>>    /**
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> index 8800d3f1f55c..1cb63746a851 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>>> @@ -2855,6 +2855,11 @@ static const struct net_device_ops mlx4_netdev_ops = {
>>>        .ndo_features_check     = mlx4_en_features_check,
>>>        .ndo_set_tx_maxrate     = mlx4_en_set_tx_maxrate,
>>>        .ndo_bpf                = mlx4_xdp,
>>> +
>>> +     .ndo_xdp_rx_timestamp_supported = mlx4_xdp_rx_timestamp_supported,
>>> +     .ndo_xdp_rx_timestamp   = mlx4_xdp_rx_timestamp,
>>> +     .ndo_xdp_rx_hash_supported = mlx4_xdp_rx_hash_supported,
>>> +     .ndo_xdp_rx_hash        = mlx4_xdp_rx_hash,
>>>    };
>>>
>>>    static const struct net_device_ops mlx4_netdev_ops_master = {
>>> @@ -2887,6 +2892,11 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
>>>        .ndo_features_check     = mlx4_en_features_check,
>>>        .ndo_set_tx_maxrate     = mlx4_en_set_tx_maxrate,
>>>        .ndo_bpf                = mlx4_xdp,
>>> +
>>> +     .ndo_xdp_rx_timestamp_supported = mlx4_xdp_rx_timestamp_supported,
>>> +     .ndo_xdp_rx_timestamp   = mlx4_xdp_rx_timestamp,
>>> +     .ndo_xdp_rx_hash_supported = mlx4_xdp_rx_hash_supported,
>>> +     .ndo_xdp_rx_hash        = mlx4_xdp_rx_hash,
>>>    };
>>>
>>>    struct mlx4_en_bond {
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>>> index 9c114fc723e3..1b8e1b2d8729 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>>> @@ -663,8 +663,40 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
>>>
>>>    struct mlx4_xdp_buff {
>>>        struct xdp_buff xdp;
>>> +     struct mlx4_cqe *cqe;
>>> +     struct mlx4_en_dev *mdev;
>>> +     struct mlx4_en_rx_ring *ring;
>>> +     struct net_device *dev;
>>>    };
>>>
>>> +bool mlx4_xdp_rx_timestamp_supported(const struct xdp_md *ctx)
>>> +{
>>> +     struct mlx4_xdp_buff *_ctx = (void *)ctx;
>>> +
>>> +     return _ctx->ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL;
>>> +}
>>> +
>>> +u64 mlx4_xdp_rx_timestamp(const struct xdp_md *ctx)
>>> +{
>>> +     struct mlx4_xdp_buff *_ctx = (void *)ctx;
>>> +
>>> +     return mlx4_en_get_hwtstamp(_ctx->mdev, mlx4_en_get_cqe_ts(_ctx->cqe));
>>> +}
>>> +
>>> +bool mlx4_xdp_rx_hash_supported(const struct xdp_md *ctx)
>>> +{
>>> +     struct mlx4_xdp_buff *_ctx = (void *)ctx;
>>> +
>>> +     return _ctx->dev->features & NETIF_F_RXHASH;
>>> +}
>>> +
>>> +u32 mlx4_xdp_rx_hash(const struct xdp_md *ctx)
>>> +{
>>> +     struct mlx4_xdp_buff *_ctx = (void *)ctx;
>>> +
>>> +     return be32_to_cpu(_ctx->cqe->immed_rss_invalid);
>>> +}
>>> +
>>>    int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
>>>    {
>>>        struct mlx4_en_priv *priv = netdev_priv(dev);
>>> @@ -781,8 +813,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>>>                                                DMA_FROM_DEVICE);
>>>
>>>                        xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
>>> -                                      frags[0].page_offset, length, false);
>>> +                                      frags[0].page_offset, length, true);
>>>                        orig_data = mxbuf.xdp.data;
>>> +                     mxbuf.cqe = cqe;
>>> +                     mxbuf.mdev = priv->mdev;
>>> +                     mxbuf.ring = ring;
>>> +                     mxbuf.dev = dev;
>>>
>>>                        act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
>>> index e132ff4c82f2..b7c0d4899ad7 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
>>> @@ -792,6 +792,7 @@ int mlx4_en_netdev_event(struct notifier_block *this,
>>>     * Functions for time stamping
>>>     */
>>>    u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe);
>>> +u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp);
>>>    void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>>>                            struct skb_shared_hwtstamps *hwts,
>>>                            u64 timestamp);
>>> diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
>>> index 6646634a0b9d..d5904da1d490 100644
>>> --- a/include/linux/mlx4/device.h
>>> +++ b/include/linux/mlx4/device.h
>>> @@ -1585,4 +1585,11 @@ static inline int mlx4_get_num_reserved_uar(struct mlx4_dev *dev)
>>>        /* The first 128 UARs are used for EQ doorbells */
>>>        return (128 >> (PAGE_SHIFT - dev->uar_page_shift));
>>>    }
>>> +
>>> +struct xdp_md;
>>> +bool mlx4_xdp_rx_timestamp_supported(const struct xdp_md *ctx);
>>> +u64 mlx4_xdp_rx_timestamp(const struct xdp_md *ctx);
>>> +bool mlx4_xdp_rx_hash_supported(const struct xdp_md *ctx);
>>> +u32 mlx4_xdp_rx_hash(const struct xdp_md *ctx);
>>> +
>>
>> These are ethernet only functions, not known to the mlx4 core driver.
>> Please move to mlx4_en.h, and use mlx4_en_xdp_*() prefix.
> 
> For sure, thanks for the review!
> 
>>>    #endif /* MLX4_DEVICE_H */

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next v3 00/12] xdp: hints via kfuncs
  2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
                   ` (11 preceding siblings ...)
  2022-12-06  2:45 ` [PATCH bpf-next v3 12/12] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
@ 2022-12-08 22:28 ` Toke Høiland-Jørgensen
  2022-12-08 23:47   ` Stanislav Fomichev
  12 siblings, 1 reply; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-08 22:28 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> Please see the first patch in the series for the overall
> design and use-cases.
>
> Changes since v3:
>
> - Rework prog->bound_netdev refcounting (Jakub/Marin)
>
>   Now it's based on the offload.c framework. It mostly fits, except
>   I had to automatically insert a HT entry for the netdev. In the
>   offloaded case, the netdev is added via a call to
>   bpf_offload_dev_netdev_register from the driver init path; with
>   a dev-bound programs, we have to manually add (and remove) the entry.
>
>   As suggested by Toke, I'm also prohibiting putting dev-bound programs
>   into prog-array map; essentially prohibiting tail calling into it.
>   I'm also disabling freplace of the dev-bound programs. Both of those
>   restrictions can be loosened up eventually.

I thought it would be a shame that we don't support at least freplace
programs from the get-go (as that would exclude libxdp from taking
advantage of this). So see below for a patch implementing this :)

-Toke




commit 3abb333e5fd2e8a0920b77013499bdae0ee3db43
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Thu Dec 8 23:10:54 2022 +0100

    bpf: Support consuming XDP HW metadata from fext programs
    
    Instead of rejecting the attaching of PROG_TYPE_EXT programs to XDP
    programs that consume HW metadata, implement support for propagating the
    offload information. The extension program doesn't need to set a flag or
    ifindex, it these will just be propagated from the target by the verifier.
    We need to create a separate offload object for the extension program,
    though, since it can be reattached to a different program later (which
    means we can't just inhering the offload information from the target).
    
    An additional check is added on attach that the new target is compatible
    with the offload information in the extension prog.
    
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b46b60f4eae1..cfa5c847cf2c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2482,6 +2482,7 @@ void *bpf_offload_resolve_kfunc(struct bpf_prog *prog, u32 func_id);
 void unpriv_ebpf_notify(int new_state);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
+int __bpf_prog_offload_init(struct bpf_prog *prog, struct net_device *netdev);
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
 void bpf_offload_bound_netdev_unregister(struct net_device *dev);
 
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index bad8bab916eb..b059a7b53457 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -83,36 +83,25 @@ bpf_offload_find_netdev(struct net_device *netdev)
 	return rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
 }
 
-int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
+int __bpf_prog_offload_init(struct bpf_prog *prog, struct net_device *netdev)
 {
 	struct bpf_offload_netdev *ondev;
 	struct bpf_prog_offload *offload;
 	int err;
 
-	if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
-	    attr->prog_type != BPF_PROG_TYPE_XDP)
+	if (!netdev)
 		return -EINVAL;
 
-	if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
-		return -EINVAL;
+	err = __bpf_offload_init();
+	if (err)
+		return err;
 
 	offload = kzalloc(sizeof(*offload), GFP_USER);
 	if (!offload)
 		return -ENOMEM;
 
-	err = __bpf_offload_init();
-	if (err)
-		return err;
-
 	offload->prog = prog;
-
-	offload->netdev = dev_get_by_index(current->nsproxy->net_ns,
-					   attr->prog_ifindex);
-	err = bpf_dev_offload_check(offload->netdev);
-	if (err)
-		goto err_maybe_put;
-
-	prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
+	offload->netdev = netdev;
 
 	down_write(&bpf_devs_lock);
 	ondev = bpf_offload_find_netdev(offload->netdev);
@@ -135,19 +124,46 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 	offload->offdev = ondev->offdev;
 	prog->aux->offload = offload;
 	list_add_tail(&offload->offloads, &ondev->progs);
-	dev_put(offload->netdev);
 	up_write(&bpf_devs_lock);
 
 	return 0;
 err_unlock:
 	up_write(&bpf_devs_lock);
-err_maybe_put:
-	if (offload->netdev)
-		dev_put(offload->netdev);
 	kfree(offload);
 	return err;
 }
 
+int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
+{
+	struct net_device *netdev;
+	int err;
+
+	if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
+	    attr->prog_type != BPF_PROG_TYPE_XDP)
+		return -EINVAL;
+
+	if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
+		return -EINVAL;
+
+	netdev = dev_get_by_index(current->nsproxy->net_ns, attr->prog_ifindex);
+	if (!netdev)
+		return -EINVAL;
+
+	err = bpf_dev_offload_check(netdev);
+	if (err)
+		goto out;
+
+	prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
+
+	err = __bpf_prog_offload_init(prog, netdev);
+	if (err)
+		goto out;
+
+out:
+	dev_put(netdev);
+	return err;
+}
+
 int bpf_prog_offload_verifier_prep(struct bpf_prog *prog)
 {
 	struct bpf_prog_offload *offload;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b345a273f7d0..606e6de5f716 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3021,6 +3021,14 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog,
 			goto out_put_prog;
 		}
 
+		if (bpf_prog_is_dev_bound(tgt_prog->aux) &&
+		    (bpf_prog_is_offloaded(tgt_prog->aux) ||
+		     !bpf_prog_is_dev_bound(prog->aux) ||
+		     !bpf_offload_dev_match(prog, tgt_prog->aux->offload->netdev))) {
+			err = -EINVAL;
+			goto out_put_prog;
+		}
+
 		key = bpf_trampoline_compute_key(tgt_prog, NULL, btf_id);
 	}
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bc8d9b8d4f47..d92e28dd220e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -16379,11 +16379,6 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
 	if (tgt_prog) {
 		struct bpf_prog_aux *aux = tgt_prog->aux;
 
-		if (bpf_prog_is_dev_bound(tgt_prog->aux)) {
-			bpf_log(log, "Replacing device-bound programs not supported\n");
-			return -EINVAL;
-		}
-
 		for (i = 0; i < aux->func_info_cnt; i++)
 			if (aux->func_info[i].type_id == btf_id) {
 				subprog = i;
@@ -16644,10 +16639,22 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	if (tgt_prog && prog->type == BPF_PROG_TYPE_EXT) {
 		/* to make freplace equivalent to their targets, they need to
 		 * inherit env->ops and expected_attach_type for the rest of the
-		 * verification
+		 * verification; we also need to propagate the prog offload data
+		 * for resolving kfuncs.
 		 */
 		env->ops = bpf_verifier_ops[tgt_prog->type];
 		prog->expected_attach_type = tgt_prog->expected_attach_type;
+
+		if (bpf_prog_is_dev_bound(tgt_prog->aux)) {
+			if (bpf_prog_is_offloaded(tgt_prog->aux))
+				return -EINVAL;
+
+			prog->aux->dev_bound = true;
+			ret = __bpf_prog_offload_init(prog,
+						      tgt_prog->aux->offload->netdev);
+			if (ret)
+				return ret;
+		}
 	}
 
 	/* store info about the attachment target that will be used later */


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-06  2:45 ` [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs Stanislav Fomichev
                     ` (2 preceding siblings ...)
  2022-12-08  5:00   ` Jakub Kicinski
@ 2022-12-08 22:39   ` Toke Høiland-Jørgensen
  2022-12-08 23:46     ` Stanislav Fomichev
  2022-12-09 11:10   ` Jesper Dangaard Brouer
  4 siblings, 1 reply; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-08 22:39 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> There is an ndo handler per kfunc, the verifier replaces a call to the
> generic kfunc with a call to the per-device one.
>
> For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
> implements all possible metatada kfuncs. Not all devices have to
> implement them. If kfunc is not supported by the target device,
> the default implementation is called instead.

So one unfortunate consequence of this "fallback to the default
implementation" is that it's really easy to get a step wrong and end up
with something that doesn't work. Specifically, if you load an XDP
program that calls the metadata kfuncs, but don't set the ifindex and
flag on load, the kfunc resolution will work just fine, but you'll end
up calling the default kfunc implementations (and get no data). I ran
into this multiple times just now when playing around with it and
implementing the freplace support.

So I really think it would be a better user experience if we completely
block (with a nice error message!) the calling of the metadata kfuncs if
the program is not device-bound...

Another UX thing I ran into is that libbpf will bail out if it can't
find the kfunc in the kernel vmlinux, even if the code calling the
function is behind an always-false if statement (which would be
eliminated as dead code from the verifier). This makes it a bit hard to
conditionally use them. Should libbpf just allow the load without
performing the relocation (and let the verifier worry about it), or
should we have a bpf_core_kfunc_exists() macro to use for checking?
Maybe both?

> Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
> we treat prog_index as target device for kfunc resolution.

[...]

> -	if (!bpf_prog_map_compatible(map, prog)) {
> -		bpf_prog_put(prog);
> -		return ERR_PTR(-EINVAL);
> -	}
> +	/* When tail-calling from a non-dev-bound program to a dev-bound one,
> +	 * XDP metadata helpers should be disabled. Until it's implemented,
> +	 * prohibit adding dev-bound programs to tail-call maps.
> +	 */
> +	if (bpf_prog_is_dev_bound(prog->aux))
> +		goto err;
> +
> +	if (!bpf_prog_map_compatible(map, prog))
> +		goto err;

I think it's better to move the new check into bpf_prog_map_compatible()
itself; that way it'll cover cpumaps and devmaps as well :)

-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-08 19:07     ` Stanislav Fomichev
@ 2022-12-08 22:53       ` Martin KaFai Lau
  2022-12-08 23:45         ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Martin KaFai Lau @ 2022-12-08 22:53 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 12/8/22 11:07 AM, Stanislav Fomichev wrote:
>>> @@ -102,11 +112,25 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
>>>        if (err)
>>>                goto err_maybe_put;
>>>
>>> +     prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
>>> +
>>
>> If I read the set correctly, bpf prog can either use metadata kfunc or offload
>> but not both. It is fine to start with only supporting metadata kfunc when there
>> is no offload but will be useful to understand the reason. I assume an offloaded
>> bpf prog should still be able to call the bpf helpers like adjust_head/tail and
>> the same should go for any kfunc?
> 
> Yes, I'm assuming there should be some work on the offloaded device
> drivers to support metadata kfuncs.
> Offloaded kfuncs, in general, seem hard (how do we call kernel func
> from the device-offloaded prog?); so refusing kfuncs early for the
> offloaded case seems fair for now?

Ah, ok.  I was actually thinking the HW offloaded prog can just use the software 
ndo_* kfunc (like other bpf-helpers).  From skimming some 
bpf_prog_offload_ops:prepare implementation, I think you are right and it seems 
BPF_PSEUDO_KFUNC_CALL has not been recognized yet.

[ ... ]

>>> @@ -226,10 +263,17 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
>>>
>>>    void bpf_prog_offload_destroy(struct bpf_prog *prog)
>>>    {
>>> +     struct net_device *netdev = NULL;
>>> +
>>>        down_write(&bpf_devs_lock);
>>> -     if (prog->aux->offload)
>>> +     if (prog->aux->offload) {
>>> +             netdev = prog->aux->offload->netdev;
>>>                __bpf_prog_offload_destroy(prog);
>>> +     }
>>>        up_write(&bpf_devs_lock);
>>> +
>>> +     if (netdev)
>>
>> May be I have missed a refcnt or lock somewhere.  Is it possible that netdev may
>> have been freed?
> 
> Yeah, with the offload framework, there are no refcnts. We put an
> "offloaded" device into a separate hashtable (protected by
> rtnl/semaphore).
> maybe_remove_bound_netdev will re-grab the locks (due to ordering:
> rtnl->bpf_devs_lock) and remove the device from the hashtable if it's
> still there.
> At least this is how, I think, it should work; LMK if something is
> still fishy here...
> 
> Or is the concern here that somebody might allocate new netdev reusing
> the same address? I think I have enough checks in
> maybe_remove_bound_netdev to guard against that. Or, at least, to make
> it safe :-)

Race is ok because ondev needs to be removed anyway when '!ondev->offdev && 
list_empty(&ondev->progs)'?  hmmm... tricky, please add a comment. :)

Why it cannot be done together in the bpf_devs_lock above?  The above cannot 
take an extra rtnl_lock before bpf_devs_lock?


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-06  2:45 ` [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata Stanislav Fomichev
@ 2022-12-08 22:59   ` Toke Høiland-Jørgensen
  2022-12-08 23:45     ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-08 22:59 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev <sdf@google.com> writes:

> From: Toke Høiland-Jørgensen <toke@redhat.com>
>
> Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> XDP ctx to do this.

So I finally managed to get enough ducks in row to actually benchmark
this. With the caveat that I suddenly can't get the timestamp support to
work (it was working in an earlier version, but now
timestamp_supported() just returns false). I'm not sure if this is an
issue with the enablement patch, or if I just haven't gotten the
hardware configured properly. I'll investigate some more, but figured
I'd post these results now:

Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
Overhead:                   1,754,153 pps /  2.86 ns/pkt

As per the above, this is with calling three kfuncs/pkt
(metadata_supported(), rx_hash_supported() and rx_hash()). So that's
~0.95 ns per function call, which is a bit less, but not far off from
the ~1.2 ns that I'm used to. The tests where I accidentally called the
default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
definitely in that ballpark.

I'm not doing anything with the data, just reading it into an on-stack
buffer, so this is the smallest possible delta from just getting the
data out of the driver. I did confirm that the call instructions are
still in the BPF program bytecode when it's dumped back out from the
kernel.

-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-08 22:53       ` Martin KaFai Lau
@ 2022-12-08 23:45         ` Stanislav Fomichev
  0 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 23:45 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On Thu, Dec 8, 2022 at 2:53 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/8/22 11:07 AM, Stanislav Fomichev wrote:
> >>> @@ -102,11 +112,25 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
> >>>        if (err)
> >>>                goto err_maybe_put;
> >>>
> >>> +     prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
> >>> +
> >>
> >> If I read the set correctly, bpf prog can either use metadata kfunc or offload
> >> but not both. It is fine to start with only supporting metadata kfunc when there
> >> is no offload but will be useful to understand the reason. I assume an offloaded
> >> bpf prog should still be able to call the bpf helpers like adjust_head/tail and
> >> the same should go for any kfunc?
> >
> > Yes, I'm assuming there should be some work on the offloaded device
> > drivers to support metadata kfuncs.
> > Offloaded kfuncs, in general, seem hard (how do we call kernel func
> > from the device-offloaded prog?); so refusing kfuncs early for the
> > offloaded case seems fair for now?
>
> Ah, ok.  I was actually thinking the HW offloaded prog can just use the software
> ndo_* kfunc (like other bpf-helpers).  From skimming some
> bpf_prog_offload_ops:prepare implementation, I think you are right and it seems
> BPF_PSEUDO_KFUNC_CALL has not been recognized yet.
>
> [ ... ]
>
> >>> @@ -226,10 +263,17 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
> >>>
> >>>    void bpf_prog_offload_destroy(struct bpf_prog *prog)
> >>>    {
> >>> +     struct net_device *netdev = NULL;
> >>> +
> >>>        down_write(&bpf_devs_lock);
> >>> -     if (prog->aux->offload)
> >>> +     if (prog->aux->offload) {
> >>> +             netdev = prog->aux->offload->netdev;
> >>>                __bpf_prog_offload_destroy(prog);
> >>> +     }
> >>>        up_write(&bpf_devs_lock);
> >>> +
> >>> +     if (netdev)
> >>
> >> May be I have missed a refcnt or lock somewhere.  Is it possible that netdev may
> >> have been freed?
> >
> > Yeah, with the offload framework, there are no refcnts. We put an
> > "offloaded" device into a separate hashtable (protected by
> > rtnl/semaphore).
> > maybe_remove_bound_netdev will re-grab the locks (due to ordering:
> > rtnl->bpf_devs_lock) and remove the device from the hashtable if it's
> > still there.
> > At least this is how, I think, it should work; LMK if something is
> > still fishy here...
> >
> > Or is the concern here that somebody might allocate new netdev reusing
> > the same address? I think I have enough checks in
> > maybe_remove_bound_netdev to guard against that. Or, at least, to make
> > it safe :-)
>
> Race is ok because ondev needs to be removed anyway when '!ondev->offdev &&
> list_empty(&ondev->progs)'?  hmmm... tricky, please add a comment. :)
>
> Why it cannot be done together in the bpf_devs_lock above?  The above cannot
> take an extra rtnl_lock before bpf_devs_lock?

Hm, let's take an extra rtln to avoid this complexity, agree. I guess
I was trying to avoid taking it, but this path is still 'dev_bound ==
true' protected, so shouldn't affect the rest of the progs.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-08 22:59   ` Toke Høiland-Jørgensen
@ 2022-12-08 23:45     ` Stanislav Fomichev
  2022-12-09  0:02       ` [xdp-hints] " Toke Høiland-Jørgensen
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 23:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> >
> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> > XDP ctx to do this.
>
> So I finally managed to get enough ducks in row to actually benchmark
> this. With the caveat that I suddenly can't get the timestamp support to
> work (it was working in an earlier version, but now
> timestamp_supported() just returns false). I'm not sure if this is an
> issue with the enablement patch, or if I just haven't gotten the
> hardware configured properly. I'll investigate some more, but figured
> I'd post these results now:
>
> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
> Overhead:                   1,754,153 pps /  2.86 ns/pkt
>
> As per the above, this is with calling three kfuncs/pkt
> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
> ~0.95 ns per function call, which is a bit less, but not far off from
> the ~1.2 ns that I'm used to. The tests where I accidentally called the
> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
> definitely in that ballpark.
>
> I'm not doing anything with the data, just reading it into an on-stack
> buffer, so this is the smallest possible delta from just getting the
> data out of the driver. I did confirm that the call instructions are
> still in the BPF program bytecode when it's dumped back out from the
> kernel.
>
> -Toke
>

Oh, that's great, thanks for running the numbers! Will definitely
reference them in v4!
Presumably, we should be able to at least unroll most of the
_supported callbacks if we want, they should be relatively easy; but
the numbers look fine as is?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-08 22:39   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-12-08 23:46     ` Stanislav Fomichev
  2022-12-09  0:07       ` [xdp-hints] " Toke Høiland-Jørgensen
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 23:46 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Dec 8, 2022 at 2:39 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > There is an ndo handler per kfunc, the verifier replaces a call to the
> > generic kfunc with a call to the per-device one.
> >
> > For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
> > implements all possible metatada kfuncs. Not all devices have to
> > implement them. If kfunc is not supported by the target device,
> > the default implementation is called instead.
>
> So one unfortunate consequence of this "fallback to the default
> implementation" is that it's really easy to get a step wrong and end up
> with something that doesn't work. Specifically, if you load an XDP
> program that calls the metadata kfuncs, but don't set the ifindex and
> flag on load, the kfunc resolution will work just fine, but you'll end
> up calling the default kfunc implementations (and get no data). I ran
> into this multiple times just now when playing around with it and
> implementing the freplace support.
>
> So I really think it would be a better user experience if we completely
> block (with a nice error message!) the calling of the metadata kfuncs if
> the program is not device-bound...

Oh, right, that's a good point. Having defaults for dev-bound only
makes total sense.

> Another UX thing I ran into is that libbpf will bail out if it can't
> find the kfunc in the kernel vmlinux, even if the code calling the
> function is behind an always-false if statement (which would be
> eliminated as dead code from the verifier). This makes it a bit hard to
> conditionally use them. Should libbpf just allow the load without
> performing the relocation (and let the verifier worry about it), or
> should we have a bpf_core_kfunc_exists() macro to use for checking?
> Maybe both?

I'm not sure how libbpf can allow the load without performing the
relocation; maybe I'm missing something.
IIUC, libbpf uses the kfunc name (from the relocation?) and replaces
it with the kfunc id, right?

Having bpf_core_kfunc_exists would help, but this probably needs
compiler work first to preserve some of the kfunc traces in vmlinux.h?

So yeah, I don't have any good ideas/suggestions here on how to make
it all magically work :-(

> > Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
> > we treat prog_index as target device for kfunc resolution.
>
> [...]
>
> > -     if (!bpf_prog_map_compatible(map, prog)) {
> > -             bpf_prog_put(prog);
> > -             return ERR_PTR(-EINVAL);
> > -     }
> > +     /* When tail-calling from a non-dev-bound program to a dev-bound one,
> > +      * XDP metadata helpers should be disabled. Until it's implemented,
> > +      * prohibit adding dev-bound programs to tail-call maps.
> > +      */
> > +     if (bpf_prog_is_dev_bound(prog->aux))
> > +             goto err;
> > +
> > +     if (!bpf_prog_map_compatible(map, prog))
> > +             goto err;
>
> I think it's better to move the new check into bpf_prog_map_compatible()
> itself; that way it'll cover cpumaps and devmaps as well :)

Will do, thanks!

> -Toke
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next v3 00/12] xdp: hints via kfuncs
  2022-12-08 22:28 ` [xdp-hints] [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Toke Høiland-Jørgensen
@ 2022-12-08 23:47   ` Stanislav Fomichev
  2022-12-09  0:14     ` [xdp-hints] " Toke Høiland-Jørgensen
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-08 23:47 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Dec 8, 2022 at 2:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > Please see the first patch in the series for the overall
> > design and use-cases.
> >
> > Changes since v3:
> >
> > - Rework prog->bound_netdev refcounting (Jakub/Marin)
> >
> >   Now it's based on the offload.c framework. It mostly fits, except
> >   I had to automatically insert a HT entry for the netdev. In the
> >   offloaded case, the netdev is added via a call to
> >   bpf_offload_dev_netdev_register from the driver init path; with
> >   a dev-bound programs, we have to manually add (and remove) the entry.
> >
> >   As suggested by Toke, I'm also prohibiting putting dev-bound programs
> >   into prog-array map; essentially prohibiting tail calling into it.
> >   I'm also disabling freplace of the dev-bound programs. Both of those
> >   restrictions can be loosened up eventually.
>
> I thought it would be a shame that we don't support at least freplace
> programs from the get-go (as that would exclude libxdp from taking
> advantage of this). So see below for a patch implementing this :)
>
> -Toke

Damn, now I need to write a selftest :-)
But seriously, thank you for taking care of this, will try to include
preserving SoB!


> commit 3abb333e5fd2e8a0920b77013499bdae0ee3db43
> Author: Toke Høiland-Jørgensen <toke@redhat.com>
> Date:   Thu Dec 8 23:10:54 2022 +0100
>
>     bpf: Support consuming XDP HW metadata from fext programs
>
>     Instead of rejecting the attaching of PROG_TYPE_EXT programs to XDP
>     programs that consume HW metadata, implement support for propagating the
>     offload information. The extension program doesn't need to set a flag or
>     ifindex, it these will just be propagated from the target by the verifier.
>     We need to create a separate offload object for the extension program,
>     though, since it can be reattached to a different program later (which
>     means we can't just inhering the offload information from the target).
>
>     An additional check is added on attach that the new target is compatible
>     with the offload information in the extension prog.
>
>     Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index b46b60f4eae1..cfa5c847cf2c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2482,6 +2482,7 @@ void *bpf_offload_resolve_kfunc(struct bpf_prog *prog, u32 func_id);
>  void unpriv_ebpf_notify(int new_state);
>
>  #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
> +int __bpf_prog_offload_init(struct bpf_prog *prog, struct net_device *netdev);
>  int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
>  void bpf_offload_bound_netdev_unregister(struct net_device *dev);
>
> diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> index bad8bab916eb..b059a7b53457 100644
> --- a/kernel/bpf/offload.c
> +++ b/kernel/bpf/offload.c
> @@ -83,36 +83,25 @@ bpf_offload_find_netdev(struct net_device *netdev)
>         return rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
>  }
>
> -int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
> +int __bpf_prog_offload_init(struct bpf_prog *prog, struct net_device *netdev)
>  {
>         struct bpf_offload_netdev *ondev;
>         struct bpf_prog_offload *offload;
>         int err;
>
> -       if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
> -           attr->prog_type != BPF_PROG_TYPE_XDP)
> +       if (!netdev)
>                 return -EINVAL;
>
> -       if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
> -               return -EINVAL;
> +       err = __bpf_offload_init();
> +       if (err)
> +               return err;
>
>         offload = kzalloc(sizeof(*offload), GFP_USER);
>         if (!offload)
>                 return -ENOMEM;
>
> -       err = __bpf_offload_init();
> -       if (err)
> -               return err;
> -
>         offload->prog = prog;
> -
> -       offload->netdev = dev_get_by_index(current->nsproxy->net_ns,
> -                                          attr->prog_ifindex);
> -       err = bpf_dev_offload_check(offload->netdev);
> -       if (err)
> -               goto err_maybe_put;
> -
> -       prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
> +       offload->netdev = netdev;
>
>         down_write(&bpf_devs_lock);
>         ondev = bpf_offload_find_netdev(offload->netdev);
> @@ -135,19 +124,46 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
>         offload->offdev = ondev->offdev;
>         prog->aux->offload = offload;
>         list_add_tail(&offload->offloads, &ondev->progs);
> -       dev_put(offload->netdev);
>         up_write(&bpf_devs_lock);
>
>         return 0;
>  err_unlock:
>         up_write(&bpf_devs_lock);
> -err_maybe_put:
> -       if (offload->netdev)
> -               dev_put(offload->netdev);
>         kfree(offload);
>         return err;
>  }
>
> +int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
> +{
> +       struct net_device *netdev;
> +       int err;
> +
> +       if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
> +           attr->prog_type != BPF_PROG_TYPE_XDP)
> +               return -EINVAL;
> +
> +       if (attr->prog_flags & ~BPF_F_XDP_HAS_METADATA)
> +               return -EINVAL;
> +
> +       netdev = dev_get_by_index(current->nsproxy->net_ns, attr->prog_ifindex);
> +       if (!netdev)
> +               return -EINVAL;
> +
> +       err = bpf_dev_offload_check(netdev);
> +       if (err)
> +               goto out;
> +
> +       prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_HAS_METADATA);
> +
> +       err = __bpf_prog_offload_init(prog, netdev);
> +       if (err)
> +               goto out;
> +
> +out:
> +       dev_put(netdev);
> +       return err;
> +}
> +
>  int bpf_prog_offload_verifier_prep(struct bpf_prog *prog)
>  {
>         struct bpf_prog_offload *offload;
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index b345a273f7d0..606e6de5f716 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3021,6 +3021,14 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog,
>                         goto out_put_prog;
>                 }
>
> +               if (bpf_prog_is_dev_bound(tgt_prog->aux) &&
> +                   (bpf_prog_is_offloaded(tgt_prog->aux) ||
> +                    !bpf_prog_is_dev_bound(prog->aux) ||
> +                    !bpf_offload_dev_match(prog, tgt_prog->aux->offload->netdev))) {
> +                       err = -EINVAL;
> +                       goto out_put_prog;
> +               }
> +
>                 key = bpf_trampoline_compute_key(tgt_prog, NULL, btf_id);
>         }
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index bc8d9b8d4f47..d92e28dd220e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -16379,11 +16379,6 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
>         if (tgt_prog) {
>                 struct bpf_prog_aux *aux = tgt_prog->aux;
>
> -               if (bpf_prog_is_dev_bound(tgt_prog->aux)) {
> -                       bpf_log(log, "Replacing device-bound programs not supported\n");
> -                       return -EINVAL;
> -               }
> -
>                 for (i = 0; i < aux->func_info_cnt; i++)
>                         if (aux->func_info[i].type_id == btf_id) {
>                                 subprog = i;
> @@ -16644,10 +16639,22 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>         if (tgt_prog && prog->type == BPF_PROG_TYPE_EXT) {
>                 /* to make freplace equivalent to their targets, they need to
>                  * inherit env->ops and expected_attach_type for the rest of the
> -                * verification
> +                * verification; we also need to propagate the prog offload data
> +                * for resolving kfuncs.
>                  */
>                 env->ops = bpf_verifier_ops[tgt_prog->type];
>                 prog->expected_attach_type = tgt_prog->expected_attach_type;
> +
> +               if (bpf_prog_is_dev_bound(tgt_prog->aux)) {
> +                       if (bpf_prog_is_offloaded(tgt_prog->aux))
> +                               return -EINVAL;
> +
> +                       prog->aux->dev_bound = true;
> +                       ret = __bpf_prog_offload_init(prog,
> +                                                     tgt_prog->aux->offload->netdev);
> +                       if (ret)
> +                               return ret;
> +               }
>         }
>
>         /* store info about the attachment target that will be used later */
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-08 23:45     ` Stanislav Fomichev
@ 2022-12-09  0:02       ` Toke Høiland-Jørgensen
  2022-12-09  0:07         ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-09  0:02 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >
>> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
>> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
>> > XDP ctx to do this.
>>
>> So I finally managed to get enough ducks in row to actually benchmark
>> this. With the caveat that I suddenly can't get the timestamp support to
>> work (it was working in an earlier version, but now
>> timestamp_supported() just returns false). I'm not sure if this is an
>> issue with the enablement patch, or if I just haven't gotten the
>> hardware configured properly. I'll investigate some more, but figured
>> I'd post these results now:
>>
>> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
>> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
>> Overhead:                   1,754,153 pps /  2.86 ns/pkt
>>
>> As per the above, this is with calling three kfuncs/pkt
>> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
>> ~0.95 ns per function call, which is a bit less, but not far off from
>> the ~1.2 ns that I'm used to. The tests where I accidentally called the
>> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
>> definitely in that ballpark.
>>
>> I'm not doing anything with the data, just reading it into an on-stack
>> buffer, so this is the smallest possible delta from just getting the
>> data out of the driver. I did confirm that the call instructions are
>> still in the BPF program bytecode when it's dumped back out from the
>> kernel.
>>
>> -Toke
>>
>
> Oh, that's great, thanks for running the numbers! Will definitely
> reference them in v4!
> Presumably, we should be able to at least unroll most of the
> _supported callbacks if we want, they should be relatively easy; but
> the numbers look fine as is?

Well, this is for one (and a half) piece of metadata. If we extrapolate
it adds up quickly. Say we add csum and vlan tags, say, and maybe
another callback to get the type of hash (l3/l4). Those would probably
be relevant for most packets in a fairly common setup. Extrapolating
from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
baseline of 39 ns.

So in that sense I still think unrolling makes sense. At least for the
_supported() calls, as eating a whole function call just for that is
probably a bit much (which I think was also Jakub's point in a sibling
thread somewhere).

-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-08 23:46     ` Stanislav Fomichev
@ 2022-12-09  0:07       ` Toke Høiland-Jørgensen
  2022-12-09  2:57         ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-09  0:07 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

>> Another UX thing I ran into is that libbpf will bail out if it can't
>> find the kfunc in the kernel vmlinux, even if the code calling the
>> function is behind an always-false if statement (which would be
>> eliminated as dead code from the verifier). This makes it a bit hard to
>> conditionally use them. Should libbpf just allow the load without
>> performing the relocation (and let the verifier worry about it), or
>> should we have a bpf_core_kfunc_exists() macro to use for checking?
>> Maybe both?
>
> I'm not sure how libbpf can allow the load without performing the
> relocation; maybe I'm missing something.
> IIUC, libbpf uses the kfunc name (from the relocation?) and replaces
> it with the kfunc id, right?

Yeah, so if it can't find the kfunc in vmlinux, just write an id of 0.
This will trip the check at the top of fixup_kfunc_call() in the
verifier, but if the code is hidden behind an always-false branch (an
rodata variable set to zero, say) the instructions should get eliminated
before they reach that point. That way you can at least turn it off at
runtime (after having done some kind of feature detection) without
having to compile it out of your program entirely.

> Having bpf_core_kfunc_exists would help, but this probably needs
> compiler work first to preserve some of the kfunc traces in vmlinux.h?

I am not sure how the existing macros work, TBH. Hopefully someone else
can chime in :)

-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09  0:02       ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-12-09  0:07         ` Alexei Starovoitov
  2022-12-09  0:29           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2022-12-09  0:07 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> >
> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> >> > XDP ctx to do this.
> >>
> >> So I finally managed to get enough ducks in row to actually benchmark
> >> this. With the caveat that I suddenly can't get the timestamp support to
> >> work (it was working in an earlier version, but now
> >> timestamp_supported() just returns false). I'm not sure if this is an
> >> issue with the enablement patch, or if I just haven't gotten the
> >> hardware configured properly. I'll investigate some more, but figured
> >> I'd post these results now:
> >>
> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
> >>
> >> As per the above, this is with calling three kfuncs/pkt
> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
> >> ~0.95 ns per function call, which is a bit less, but not far off from
> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
> >> definitely in that ballpark.
> >>
> >> I'm not doing anything with the data, just reading it into an on-stack
> >> buffer, so this is the smallest possible delta from just getting the
> >> data out of the driver. I did confirm that the call instructions are
> >> still in the BPF program bytecode when it's dumped back out from the
> >> kernel.
> >>
> >> -Toke
> >>
> >
> > Oh, that's great, thanks for running the numbers! Will definitely
> > reference them in v4!
> > Presumably, we should be able to at least unroll most of the
> > _supported callbacks if we want, they should be relatively easy; but
> > the numbers look fine as is?
>
> Well, this is for one (and a half) piece of metadata. If we extrapolate
> it adds up quickly. Say we add csum and vlan tags, say, and maybe
> another callback to get the type of hash (l3/l4). Those would probably
> be relevant for most packets in a fairly common setup. Extrapolating
> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
> baseline of 39 ns.
>
> So in that sense I still think unrolling makes sense. At least for the
> _supported() calls, as eating a whole function call just for that is
> probably a bit much (which I think was also Jakub's point in a sibling
> thread somewhere).

imo the overhead is tiny enough that we can wait until
generic 'kfunc inlining' infra is ready.

We're planning to dual-compile some_kernel_file.c
into native arch and into bpf arch.
Then the verifier will automatically inline bpf asm
of corresponding kfunc.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 00/12] xdp: hints via kfuncs
  2022-12-08 23:47   ` Stanislav Fomichev
@ 2022-12-09  0:14     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-09  0:14 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Dec 8, 2022 at 2:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > Please see the first patch in the series for the overall
>> > design and use-cases.
>> >
>> > Changes since v3:
>> >
>> > - Rework prog->bound_netdev refcounting (Jakub/Marin)
>> >
>> >   Now it's based on the offload.c framework. It mostly fits, except
>> >   I had to automatically insert a HT entry for the netdev. In the
>> >   offloaded case, the netdev is added via a call to
>> >   bpf_offload_dev_netdev_register from the driver init path; with
>> >   a dev-bound programs, we have to manually add (and remove) the entry.
>> >
>> >   As suggested by Toke, I'm also prohibiting putting dev-bound programs
>> >   into prog-array map; essentially prohibiting tail calling into it.
>> >   I'm also disabling freplace of the dev-bound programs. Both of those
>> >   restrictions can be loosened up eventually.
>>
>> I thought it would be a shame that we don't support at least freplace
>> programs from the get-go (as that would exclude libxdp from taking
>> advantage of this). So see below for a patch implementing this :)
>>
>> -Toke
>
> Damn, now I need to write a selftest :-)
> But seriously, thank you for taking care of this, will try to include
> preserving SoB!

Cool, thanks! I just realised I made on mistake in the attach check,
though:

[...]

>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index b345a273f7d0..606e6de5f716 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -3021,6 +3021,14 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog,
>>                         goto out_put_prog;
>>                 }
>>
>> +               if (bpf_prog_is_dev_bound(tgt_prog->aux) &&
>> +                   (bpf_prog_is_offloaded(tgt_prog->aux) ||
>> +                    !bpf_prog_is_dev_bound(prog->aux) ||
>> +                    !bpf_offload_dev_match(prog, tgt_prog->aux->offload->netdev))) {

This should switch the order of the is_dev_bound() checks, like:

+               if (bpf_prog_is_dev_bound(prog->aux) &&
+                   (bpf_prog_is_offloaded(tgt_prog->aux) ||
+                    !bpf_prog_is_dev_bound(tgt_prog->aux) ||
+                    !bpf_offload_dev_match(prog, tgt_prog->aux->offload->netdev))) {

I.e., first check bpf_prog_is_dev_bound(prog->aux) (the program being
attached), and only perform the other checks if we're attaching
something that has been verified as being dev-bound. It should be fine
to attach a non-devbound function to a devbound parent program (since
that non-devbound function can't call any of the kfuncs).

-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09  0:07         ` Alexei Starovoitov
@ 2022-12-09  0:29           ` Toke Høiland-Jørgensen
  2022-12-09  0:32             ` Alexei Starovoitov
  0 siblings, 1 reply; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-09  0:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >> >
>> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
>> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
>> >> > XDP ctx to do this.
>> >>
>> >> So I finally managed to get enough ducks in row to actually benchmark
>> >> this. With the caveat that I suddenly can't get the timestamp support to
>> >> work (it was working in an earlier version, but now
>> >> timestamp_supported() just returns false). I'm not sure if this is an
>> >> issue with the enablement patch, or if I just haven't gotten the
>> >> hardware configured properly. I'll investigate some more, but figured
>> >> I'd post these results now:
>> >>
>> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
>> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
>> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
>> >>
>> >> As per the above, this is with calling three kfuncs/pkt
>> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
>> >> ~0.95 ns per function call, which is a bit less, but not far off from
>> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
>> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
>> >> definitely in that ballpark.
>> >>
>> >> I'm not doing anything with the data, just reading it into an on-stack
>> >> buffer, so this is the smallest possible delta from just getting the
>> >> data out of the driver. I did confirm that the call instructions are
>> >> still in the BPF program bytecode when it's dumped back out from the
>> >> kernel.
>> >>
>> >> -Toke
>> >>
>> >
>> > Oh, that's great, thanks for running the numbers! Will definitely
>> > reference them in v4!
>> > Presumably, we should be able to at least unroll most of the
>> > _supported callbacks if we want, they should be relatively easy; but
>> > the numbers look fine as is?
>>
>> Well, this is for one (and a half) piece of metadata. If we extrapolate
>> it adds up quickly. Say we add csum and vlan tags, say, and maybe
>> another callback to get the type of hash (l3/l4). Those would probably
>> be relevant for most packets in a fairly common setup. Extrapolating
>> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
>> baseline of 39 ns.
>>
>> So in that sense I still think unrolling makes sense. At least for the
>> _supported() calls, as eating a whole function call just for that is
>> probably a bit much (which I think was also Jakub's point in a sibling
>> thread somewhere).
>
> imo the overhead is tiny enough that we can wait until
> generic 'kfunc inlining' infra is ready.
>
> We're planning to dual-compile some_kernel_file.c
> into native arch and into bpf arch.
> Then the verifier will automatically inline bpf asm
> of corresponding kfunc.

Is that "planning" or "actively working on"? Just trying to get a sense
of the time frames here, as this sounds neat, but also something that
could potentially require quite a bit of fiddling with the build system
to get to work? :)

-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09  0:29           ` Toke Høiland-Jørgensen
@ 2022-12-09  0:32             ` Alexei Starovoitov
  2022-12-09  0:53               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2022-12-09  0:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >> >>
> >> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> >> >
> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> >> >> > XDP ctx to do this.
> >> >>
> >> >> So I finally managed to get enough ducks in row to actually benchmark
> >> >> this. With the caveat that I suddenly can't get the timestamp support to
> >> >> work (it was working in an earlier version, but now
> >> >> timestamp_supported() just returns false). I'm not sure if this is an
> >> >> issue with the enablement patch, or if I just haven't gotten the
> >> >> hardware configured properly. I'll investigate some more, but figured
> >> >> I'd post these results now:
> >> >>
> >> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
> >> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
> >> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
> >> >>
> >> >> As per the above, this is with calling three kfuncs/pkt
> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
> >> >> ~0.95 ns per function call, which is a bit less, but not far off from
> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
> >> >> definitely in that ballpark.
> >> >>
> >> >> I'm not doing anything with the data, just reading it into an on-stack
> >> >> buffer, so this is the smallest possible delta from just getting the
> >> >> data out of the driver. I did confirm that the call instructions are
> >> >> still in the BPF program bytecode when it's dumped back out from the
> >> >> kernel.
> >> >>
> >> >> -Toke
> >> >>
> >> >
> >> > Oh, that's great, thanks for running the numbers! Will definitely
> >> > reference them in v4!
> >> > Presumably, we should be able to at least unroll most of the
> >> > _supported callbacks if we want, they should be relatively easy; but
> >> > the numbers look fine as is?
> >>
> >> Well, this is for one (and a half) piece of metadata. If we extrapolate
> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe
> >> another callback to get the type of hash (l3/l4). Those would probably
> >> be relevant for most packets in a fairly common setup. Extrapolating
> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
> >> baseline of 39 ns.
> >>
> >> So in that sense I still think unrolling makes sense. At least for the
> >> _supported() calls, as eating a whole function call just for that is
> >> probably a bit much (which I think was also Jakub's point in a sibling
> >> thread somewhere).
> >
> > imo the overhead is tiny enough that we can wait until
> > generic 'kfunc inlining' infra is ready.
> >
> > We're planning to dual-compile some_kernel_file.c
> > into native arch and into bpf arch.
> > Then the verifier will automatically inline bpf asm
> > of corresponding kfunc.
>
> Is that "planning" or "actively working on"? Just trying to get a sense
> of the time frames here, as this sounds neat, but also something that
> could potentially require quite a bit of fiddling with the build system
> to get to work? :)

"planning", but regardless how long it takes I'd rather not
add any more tech debt in the form of manual bpf asm generation.
We have too much of it already: gen_lookup, convert_ctx_access, etc.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09  0:32             ` Alexei Starovoitov
@ 2022-12-09  0:53               ` Toke Høiland-Jørgensen
  2022-12-09  2:57                 ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-09  0:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>
>> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >>
>> >> >> Stanislav Fomichev <sdf@google.com> writes:
>> >> >>
>> >> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >> >> >
>> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
>> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
>> >> >> > XDP ctx to do this.
>> >> >>
>> >> >> So I finally managed to get enough ducks in row to actually benchmark
>> >> >> this. With the caveat that I suddenly can't get the timestamp support to
>> >> >> work (it was working in an earlier version, but now
>> >> >> timestamp_supported() just returns false). I'm not sure if this is an
>> >> >> issue with the enablement patch, or if I just haven't gotten the
>> >> >> hardware configured properly. I'll investigate some more, but figured
>> >> >> I'd post these results now:
>> >> >>
>> >> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
>> >> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
>> >> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
>> >> >>
>> >> >> As per the above, this is with calling three kfuncs/pkt
>> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
>> >> >> ~0.95 ns per function call, which is a bit less, but not far off from
>> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
>> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
>> >> >> definitely in that ballpark.
>> >> >>
>> >> >> I'm not doing anything with the data, just reading it into an on-stack
>> >> >> buffer, so this is the smallest possible delta from just getting the
>> >> >> data out of the driver. I did confirm that the call instructions are
>> >> >> still in the BPF program bytecode when it's dumped back out from the
>> >> >> kernel.
>> >> >>
>> >> >> -Toke
>> >> >>
>> >> >
>> >> > Oh, that's great, thanks for running the numbers! Will definitely
>> >> > reference them in v4!
>> >> > Presumably, we should be able to at least unroll most of the
>> >> > _supported callbacks if we want, they should be relatively easy; but
>> >> > the numbers look fine as is?
>> >>
>> >> Well, this is for one (and a half) piece of metadata. If we extrapolate
>> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe
>> >> another callback to get the type of hash (l3/l4). Those would probably
>> >> be relevant for most packets in a fairly common setup. Extrapolating
>> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
>> >> baseline of 39 ns.
>> >>
>> >> So in that sense I still think unrolling makes sense. At least for the
>> >> _supported() calls, as eating a whole function call just for that is
>> >> probably a bit much (which I think was also Jakub's point in a sibling
>> >> thread somewhere).
>> >
>> > imo the overhead is tiny enough that we can wait until
>> > generic 'kfunc inlining' infra is ready.
>> >
>> > We're planning to dual-compile some_kernel_file.c
>> > into native arch and into bpf arch.
>> > Then the verifier will automatically inline bpf asm
>> > of corresponding kfunc.
>>
>> Is that "planning" or "actively working on"? Just trying to get a sense
>> of the time frames here, as this sounds neat, but also something that
>> could potentially require quite a bit of fiddling with the build system
>> to get to work? :)
>
> "planning", but regardless how long it takes I'd rather not
> add any more tech debt in the form of manual bpf asm generation.
> We have too much of it already: gen_lookup, convert_ctx_access, etc.

Right, I'm no fan of the manual ASM stuff either. However, if we're
stuck with the function call overhead for the foreseeable future, maybe
we should think about other ways of cutting down the number of function
calls needed?

One thing I can think of is to get rid of the individual _supported()
kfuncs and instead have a single one that lets you query multiple
features at once, like:

__u64 features_supported, features_wanted = XDP_META_RX_HASH | XDP_META_TIMESTAMP;

features_supported = bpf_xdp_metadata_query_features(ctx, features_wanted);

if (features_supported & XDP_META_RX_HASH)
  hash = bpf_xdp_metadata_rx_hash(ctx);

...etc


-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-08 19:07     ` Stanislav Fomichev
@ 2022-12-09  1:30       ` Jakub Kicinski
  2022-12-09  2:57         ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Jakub Kicinski @ 2022-12-09  1:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Thu, 8 Dec 2022 11:07:43 -0800 Stanislav Fomichev wrote:
> > >       bpf_free_used_maps(aux);
> > >       bpf_free_used_btfs(aux);
> > > -     if (bpf_prog_is_offloaded(aux))
> > > +     if (bpf_prog_is_dev_bound(aux))
> > >               bpf_prog_offload_destroy(aux->prog);  
> >
> > This also looks a touch like a mix of terms (condition vs function
> > called).  
> 
> Here, not sure, open to suggestions. These
> bpf_prog_offload_init/bpf_prog_offload_destroy are generic enough
> (now) that I'm calling them for both dev_bound/offloaded.
> 
> The following paths trigger for both offloaded/dev_bound cases:
> 
> if (bpf_prog_is_dev_bound()) bpf_prog_offload_init();
> if (bpf_prog_is_dev_bound()) bpf_prog_offload_destroy();
> 
> Do you think it's worth it having completely separate
> dev_bound/offloaded paths? Or, alternatively, can rename to
> bpf_prog_dev_bound_{init,destroy} but still handle both cases?

Any offload should be bound, right? So I think functions which handle
both can use the bound naming scheme, only the offload-specific ones 
should explicitly use offload?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-09  0:07       ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-12-09  2:57         ` Stanislav Fomichev
  2022-12-10  0:42           ` Martin KaFai Lau
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-09  2:57 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Dec 8, 2022 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> >> Another UX thing I ran into is that libbpf will bail out if it can't
> >> find the kfunc in the kernel vmlinux, even if the code calling the
> >> function is behind an always-false if statement (which would be
> >> eliminated as dead code from the verifier). This makes it a bit hard to
> >> conditionally use them. Should libbpf just allow the load without
> >> performing the relocation (and let the verifier worry about it), or
> >> should we have a bpf_core_kfunc_exists() macro to use for checking?
> >> Maybe both?
> >
> > I'm not sure how libbpf can allow the load without performing the
> > relocation; maybe I'm missing something.
> > IIUC, libbpf uses the kfunc name (from the relocation?) and replaces
> > it with the kfunc id, right?
>
> Yeah, so if it can't find the kfunc in vmlinux, just write an id of 0.
> This will trip the check at the top of fixup_kfunc_call() in the
> verifier, but if the code is hidden behind an always-false branch (an
> rodata variable set to zero, say) the instructions should get eliminated
> before they reach that point. That way you can at least turn it off at
> runtime (after having done some kind of feature detection) without
> having to compile it out of your program entirely.
>
> > Having bpf_core_kfunc_exists would help, but this probably needs
> > compiler work first to preserve some of the kfunc traces in vmlinux.h?
>
> I am not sure how the existing macros work, TBH. Hopefully someone else
> can chime in :)

+1

I think we need to poke Andrii as a follow up :-)

> -Toke
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-09  1:30       ` Jakub Kicinski
@ 2022-12-09  2:57         ` Stanislav Fomichev
  0 siblings, 0 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-09  2:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Thu, Dec 8, 2022 at 5:30 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 8 Dec 2022 11:07:43 -0800 Stanislav Fomichev wrote:
> > > >       bpf_free_used_maps(aux);
> > > >       bpf_free_used_btfs(aux);
> > > > -     if (bpf_prog_is_offloaded(aux))
> > > > +     if (bpf_prog_is_dev_bound(aux))
> > > >               bpf_prog_offload_destroy(aux->prog);
> > >
> > > This also looks a touch like a mix of terms (condition vs function
> > > called).
> >
> > Here, not sure, open to suggestions. These
> > bpf_prog_offload_init/bpf_prog_offload_destroy are generic enough
> > (now) that I'm calling them for both dev_bound/offloaded.
> >
> > The following paths trigger for both offloaded/dev_bound cases:
> >
> > if (bpf_prog_is_dev_bound()) bpf_prog_offload_init();
> > if (bpf_prog_is_dev_bound()) bpf_prog_offload_destroy();
> >
> > Do you think it's worth it having completely separate
> > dev_bound/offloaded paths? Or, alternatively, can rename to
> > bpf_prog_dev_bound_{init,destroy} but still handle both cases?
>
> Any offload should be bound, right? So I think functions which handle
> both can use the bound naming scheme, only the offload-specific ones
> should explicitly use offload?

Agreed. Will rename the common ones to dev_offload!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09  0:53               ` Toke Høiland-Jørgensen
@ 2022-12-09  2:57                 ` Stanislav Fomichev
  2022-12-09  5:24                   ` Saeed Mahameed
  2022-12-09 14:42                   ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-09  2:57 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Thu, Dec 8, 2022 at 4:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >>
> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >> >>
> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >> >>
> >> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >> >> >>
> >> >> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> >> >> >
> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> >> >> >> > XDP ctx to do this.
> >> >> >>
> >> >> >> So I finally managed to get enough ducks in row to actually benchmark
> >> >> >> this. With the caveat that I suddenly can't get the timestamp support to
> >> >> >> work (it was working in an earlier version, but now
> >> >> >> timestamp_supported() just returns false). I'm not sure if this is an
> >> >> >> issue with the enablement patch, or if I just haven't gotten the
> >> >> >> hardware configured properly. I'll investigate some more, but figured
> >> >> >> I'd post these results now:
> >> >> >>
> >> >> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
> >> >> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
> >> >> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
> >> >> >>
> >> >> >> As per the above, this is with calling three kfuncs/pkt
> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
> >> >> >> ~0.95 ns per function call, which is a bit less, but not far off from
> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
> >> >> >> definitely in that ballpark.
> >> >> >>
> >> >> >> I'm not doing anything with the data, just reading it into an on-stack
> >> >> >> buffer, so this is the smallest possible delta from just getting the
> >> >> >> data out of the driver. I did confirm that the call instructions are
> >> >> >> still in the BPF program bytecode when it's dumped back out from the
> >> >> >> kernel.
> >> >> >>
> >> >> >> -Toke
> >> >> >>
> >> >> >
> >> >> > Oh, that's great, thanks for running the numbers! Will definitely
> >> >> > reference them in v4!
> >> >> > Presumably, we should be able to at least unroll most of the
> >> >> > _supported callbacks if we want, they should be relatively easy; but
> >> >> > the numbers look fine as is?
> >> >>
> >> >> Well, this is for one (and a half) piece of metadata. If we extrapolate
> >> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe
> >> >> another callback to get the type of hash (l3/l4). Those would probably
> >> >> be relevant for most packets in a fairly common setup. Extrapolating
> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
> >> >> baseline of 39 ns.
> >> >>
> >> >> So in that sense I still think unrolling makes sense. At least for the
> >> >> _supported() calls, as eating a whole function call just for that is
> >> >> probably a bit much (which I think was also Jakub's point in a sibling
> >> >> thread somewhere).
> >> >
> >> > imo the overhead is tiny enough that we can wait until
> >> > generic 'kfunc inlining' infra is ready.
> >> >
> >> > We're planning to dual-compile some_kernel_file.c
> >> > into native arch and into bpf arch.
> >> > Then the verifier will automatically inline bpf asm
> >> > of corresponding kfunc.
> >>
> >> Is that "planning" or "actively working on"? Just trying to get a sense
> >> of the time frames here, as this sounds neat, but also something that
> >> could potentially require quite a bit of fiddling with the build system
> >> to get to work? :)
> >
> > "planning", but regardless how long it takes I'd rather not
> > add any more tech debt in the form of manual bpf asm generation.
> > We have too much of it already: gen_lookup, convert_ctx_access, etc.
>
> Right, I'm no fan of the manual ASM stuff either. However, if we're
> stuck with the function call overhead for the foreseeable future, maybe
> we should think about other ways of cutting down the number of function
> calls needed?
>
> One thing I can think of is to get rid of the individual _supported()
> kfuncs and instead have a single one that lets you query multiple
> features at once, like:
>
> __u64 features_supported, features_wanted = XDP_META_RX_HASH | XDP_META_TIMESTAMP;
>
> features_supported = bpf_xdp_metadata_query_features(ctx, features_wanted);
>
> if (features_supported & XDP_META_RX_HASH)
>   hash = bpf_xdp_metadata_rx_hash(ctx);
>
> ...etc

I'm not too happy about having the bitmasks tbh :-(
If we want to get rid of the cost of those _supported calls, maybe we
can do some kind of libbpf-like probing? That would require loading a
program + waiting for some packet though :-(

Or maybe they can just be cached for now?

if (unlikely(!got_first_packet)) {
  have_hash = bpf_xdp_metadata_rx_hash_supported();
  have_timestamp = bpf_xdp_metadata_rx_timestamp_supported();
  got_first_packet = true;
}

if (have_hash) {}
if (have_timestamp) {}

That should hopefully work until generic inlining infra?

> -Toke
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09  2:57                 ` Stanislav Fomichev
@ 2022-12-09  5:24                   ` Saeed Mahameed
  2022-12-09 12:59                     ` Jesper Dangaard Brouer
  2022-12-09 14:42                   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 61+ messages in thread
From: Saeed Mahameed @ 2022-12-09  5:24 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, Network Development

On 08 Dec 18:57, Stanislav Fomichev wrote:
>On Thu, Dec 8, 2022 at 4:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>
>> > On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>> >>
>> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >>
>> >> >> Stanislav Fomichev <sdf@google.com> writes:
>> >> >>
>> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >> >>
>> >> >> >> Stanislav Fomichev <sdf@google.com> writes:
>> >> >> >>
>> >> >> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >> >> >> >
>> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
>> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
>> >> >> >> > XDP ctx to do this.
>> >> >> >>
>> >> >> >> So I finally managed to get enough ducks in row to actually benchmark
>> >> >> >> this. With the caveat that I suddenly can't get the timestamp support to
>> >> >> >> work (it was working in an earlier version, but now
>> >> >> >> timestamp_supported() just returns false). I'm not sure if this is an
>> >> >> >> issue with the enablement patch, or if I just haven't gotten the
>> >> >> >> hardware configured properly. I'll investigate some more, but figured
>> >> >> >> I'd post these results now:
>> >> >> >>
>> >> >> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
>> >> >> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
>> >> >> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
>> >> >> >>
>> >> >> >> As per the above, this is with calling three kfuncs/pkt
>> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
>> >> >> >> ~0.95 ns per function call, which is a bit less, but not far off from
>> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
>> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
>> >> >> >> definitely in that ballpark.
>> >> >> >>
>> >> >> >> I'm not doing anything with the data, just reading it into an on-stack
>> >> >> >> buffer, so this is the smallest possible delta from just getting the
>> >> >> >> data out of the driver. I did confirm that the call instructions are
>> >> >> >> still in the BPF program bytecode when it's dumped back out from the
>> >> >> >> kernel.
>> >> >> >>
>> >> >> >> -Toke
>> >> >> >>
>> >> >> >
>> >> >> > Oh, that's great, thanks for running the numbers! Will definitely
>> >> >> > reference them in v4!
>> >> >> > Presumably, we should be able to at least unroll most of the
>> >> >> > _supported callbacks if we want, they should be relatively easy; but
>> >> >> > the numbers look fine as is?
>> >> >>
>> >> >> Well, this is for one (and a half) piece of metadata. If we extrapolate
>> >> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe
>> >> >> another callback to get the type of hash (l3/l4). Those would probably
>> >> >> be relevant for most packets in a fairly common setup. Extrapolating
>> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
>> >> >> baseline of 39 ns.
>> >> >>
>> >> >> So in that sense I still think unrolling makes sense. At least for the
>> >> >> _supported() calls, as eating a whole function call just for that is
>> >> >> probably a bit much (which I think was also Jakub's point in a sibling
>> >> >> thread somewhere).
>> >> >
>> >> > imo the overhead is tiny enough that we can wait until
>> >> > generic 'kfunc inlining' infra is ready.
>> >> >
>> >> > We're planning to dual-compile some_kernel_file.c
>> >> > into native arch and into bpf arch.
>> >> > Then the verifier will automatically inline bpf asm
>> >> > of corresponding kfunc.
>> >>
>> >> Is that "planning" or "actively working on"? Just trying to get a sense
>> >> of the time frames here, as this sounds neat, but also something that
>> >> could potentially require quite a bit of fiddling with the build system
>> >> to get to work? :)
>> >
>> > "planning", but regardless how long it takes I'd rather not
>> > add any more tech debt in the form of manual bpf asm generation.
>> > We have too much of it already: gen_lookup, convert_ctx_access, etc.
>>
>> Right, I'm no fan of the manual ASM stuff either. However, if we're
>> stuck with the function call overhead for the foreseeable future, maybe
>> we should think about other ways of cutting down the number of function
>> calls needed?
>>
>> One thing I can think of is to get rid of the individual _supported()
>> kfuncs and instead have a single one that lets you query multiple
>> features at once, like:
>>
>> __u64 features_supported, features_wanted = XDP_META_RX_HASH | XDP_META_TIMESTAMP;
>>
>> features_supported = bpf_xdp_metadata_query_features(ctx, features_wanted);
>>
>> if (features_supported & XDP_META_RX_HASH)
>>   hash = bpf_xdp_metadata_rx_hash(ctx);
>>
>> ...etc
>
>I'm not too happy about having the bitmasks tbh :-(
>If we want to get rid of the cost of those _supported calls, maybe we
>can do some kind of libbpf-like probing? That would require loading a
>program + waiting for some packet though :-(
>
>Or maybe they can just be cached for now?
>
>if (unlikely(!got_first_packet)) {
>  have_hash = bpf_xdp_metadata_rx_hash_supported();
>  have_timestamp = bpf_xdp_metadata_rx_timestamp_supported();
>  got_first_packet = true;
>}

hash/timestap/csum is per packet .. vlan as well depending how you look at
it..

Sorry I haven't been following the progress of xdp meta data, but why did
we drop the idea of btf and driver copying metdata in front of the xdp
frame ?

hopefully future HW generations will do that for free .. 

if btf is the problem then each vendor can provide a bpf func(s) that would
parse the metdata inside of the xdp/bpf prog domain to help programs
extract the vendor specific data.. 


>
>if (have_hash) {}
>if (have_timestamp) {}
>
>That should hopefully work until generic inlining infra?
>
>> -Toke
>>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-06  2:45 ` [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs Stanislav Fomichev
                     ` (3 preceding siblings ...)
  2022-12-08 22:39   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-12-09 11:10   ` Jesper Dangaard Brouer
  2022-12-09 17:47     ` Stanislav Fomichev
  4 siblings, 1 reply; 61+ messages in thread
From: Jesper Dangaard Brouer @ 2022-12-09 11:10 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev


On 06/12/2022 03.45, Stanislav Fomichev wrote:
> There is an ndo handler per kfunc, the verifier replaces a call to the
> generic kfunc with a call to the per-device one.
> 
> For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
> implements all possible metatada kfuncs. Not all devices have to
> implement them. If kfunc is not supported by the target device,
> the default implementation is called instead.
> 
> Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
> we treat prog_index as target device for kfunc resolution.
> 

[...cut...]
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5aa35c58c342..2eabb9157767 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -74,6 +74,7 @@ struct udp_tunnel_nic_info;
>   struct udp_tunnel_nic;
>   struct bpf_prog;
>   struct xdp_buff;
> +struct xdp_md;
>   
>   void synchronize_net(void);
>   void netdev_set_default_ethtool_ops(struct net_device *dev,
> @@ -1611,6 +1612,10 @@ struct net_device_ops {
>   	ktime_t			(*ndo_get_tstamp)(struct net_device *dev,
>   						  const struct skb_shared_hwtstamps *hwtstamps,
>   						  bool cycles);
> +	bool			(*ndo_xdp_rx_timestamp_supported)(const struct xdp_md *ctx);
> +	u64			(*ndo_xdp_rx_timestamp)(const struct xdp_md *ctx);
> +	bool			(*ndo_xdp_rx_hash_supported)(const struct xdp_md *ctx);
> +	u32			(*ndo_xdp_rx_hash)(const struct xdp_md *ctx);
>   };
>   

Would it make sense to add a 'flags' parameter to ndo_xdp_rx_timestamp
and ndo_xdp_rx_hash ?

E.g. we could have a "STORE" flag that asks the kernel to store this
information for later. This will be helpful for both the SKB and
redirect use-cases.
For redirect e.g into a veth, then BPF-prog can use the same function
bpf_xdp_metadata_rx_hash() to receive the RX-hash, as it can obtain the
"stored" value (from the BPF-prog that did the redirect).

(p.s. Hopefully a const 'flags' variable can be optimized when unrolling
to eliminate store instructions when flags==0)

>   /**
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 55dbc68bfffc..c24aba5c363b 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -409,4 +409,33 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>   
>   #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
>   
> +#define XDP_METADATA_KFUNC_xxx	\
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
> +			   bpf_xdp_metadata_rx_timestamp_supported) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
> +			   bpf_xdp_metadata_rx_timestamp) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH_SUPPORTED, \
> +			   bpf_xdp_metadata_rx_hash_supported) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
> +			   bpf_xdp_metadata_rx_hash) \
> +
> +enum {
> +#define XDP_METADATA_KFUNC(name, str) name,
> +XDP_METADATA_KFUNC_xxx
> +#undef XDP_METADATA_KFUNC
> +MAX_XDP_METADATA_KFUNC,
> +};
> +
> +#ifdef CONFIG_NET
> +u32 xdp_metadata_kfunc_id(int id);
> +#else
> +static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> +#endif
> +
> +struct xdp_md;
> +bool bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx);
> +u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx);
> +bool bpf_xdp_metadata_rx_hash_supported(const struct xdp_md *ctx);
> +u32 bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx);
> +


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09  5:24                   ` Saeed Mahameed
@ 2022-12-09 12:59                     ` Jesper Dangaard Brouer
  2022-12-09 14:37                       ` Toke Høiland-Jørgensen
  2022-12-09 15:19                       ` Dave Taht
  0 siblings, 2 replies; 61+ messages in thread
From: Jesper Dangaard Brouer @ 2022-12-09 12:59 UTC (permalink / raw)
  To: Saeed Mahameed, Stanislav Fomichev
  Cc: brouer, Toke Høiland-Jørgensen, Alexei Starovoitov,
	bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	Network Development


On 09/12/2022 06.24, Saeed Mahameed wrote:
> On 08 Dec 18:57, Stanislav Fomichev wrote:
>> On Thu, Dec 8, 2022 at 4:54 PM Toke Høiland-Jørgensen 
>> <toke@redhat.com> wrote:
>>>
>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>
>>> > On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>> >>
>>> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>> >>
>>> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>> >> >>
>>> >> >> Stanislav Fomichev <sdf@google.com> writes:
>>> >> >>
>>> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>> >> >> >>
>>> >> >> >> Stanislav Fomichev <sdf@google.com> writes:
>>> >> >> >>
>>> >> >> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>>> >> >> >> >
>>> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
>>> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
>>> >> >> >> > XDP ctx to do this.
>>> >> >> >>
>>> >> >> >> So I finally managed to get enough ducks in row to actually benchmark
>>> >> >> >> this. With the caveat that I suddenly can't get the timestamp support to
>>> >> >> >> work (it was working in an earlier version, but now
>>> >> >> >> timestamp_supported() just returns false). I'm not sure if this is an
>>> >> >> >> issue with the enablement patch, or if I just haven't gotten the
>>> >> >> >> hardware configured properly. I'll investigate some more, but figured
>>> >> >> >> I'd post these results now:
>>> >> >> >>
>>> >> >> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
>>> >> >> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
>>> >> >> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
>>> >> >> >>
>>> >> >> >> As per the above, this is with calling three kfuncs/pkt
>>> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
>>> >> >> >> ~0.95 ns per function call, which is a bit less, but not far off from
>>> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
>>> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
>>> >> >> >> definitely in that ballpark.
>>> >> >> >>
>>> >> >> >> I'm not doing anything with the data, just reading it into an on-stack
>>> >> >> >> buffer, so this is the smallest possible delta from just getting the
>>> >> >> >> data out of the driver. I did confirm that the call instructions are
>>> >> >> >> still in the BPF program bytecode when it's dumped back out from the
>>> >> >> >> kernel.
>>> >> >> >>
>>> >> >> >> -Toke
>>> >> >> >>
>>> >> >> >
>>> >> >> > Oh, that's great, thanks for running the numbers! Will definitely
>>> >> >> > reference them in v4!
>>> >> >> > Presumably, we should be able to at least unroll most of the
>>> >> >> > _supported callbacks if we want, they should be relatively easy; but
>>> >> >> > the numbers look fine as is?
>>> >> >>
>>> >> >> Well, this is for one (and a half) piece of metadata. If we extrapolate
>>> >> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe
>>> >> >> another callback to get the type of hash (l3/l4). Those would probably
>>> >> >> be relevant for most packets in a fairly common setup. Extrapolating
>>> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
>>> >> >> baseline of 39 ns.
>>> >> >>
>>> >> >> So in that sense I still think unrolling makes sense. At least for the
>>> >> >> _supported() calls, as eating a whole function call just for that is
>>> >> >> probably a bit much (which I think was also Jakub's point in a sibling
>>> >> >> thread somewhere).
>>> >> >
>>> >> > imo the overhead is tiny enough that we can wait until
>>> >> > generic 'kfunc inlining' infra is ready.
>>> >> >
>>> >> > We're planning to dual-compile some_kernel_file.c
>>> >> > into native arch and into bpf arch.
>>> >> > Then the verifier will automatically inline bpf asm
>>> >> > of corresponding kfunc.
>>> >>
>>> >> Is that "planning" or "actively working on"? Just trying to get a sense
>>> >> of the time frames here, as this sounds neat, but also something that
>>> >> could potentially require quite a bit of fiddling with the build system
>>> >> to get to work? :)
>>> >
>>> > "planning", but regardless how long it takes I'd rather not
>>> > add any more tech debt in the form of manual bpf asm generation.
>>> > We have too much of it already: gen_lookup, convert_ctx_access, etc.
>>>
>>> Right, I'm no fan of the manual ASM stuff either. However, if we're
>>> stuck with the function call overhead for the foreseeable future, maybe
>>> we should think about other ways of cutting down the number of function
>>> calls needed?
>>>
>>> One thing I can think of is to get rid of the individual _supported()
>>> kfuncs and instead have a single one that lets you query multiple
>>> features at once, like:
>>>
>>> __u64 features_supported, features_wanted = XDP_META_RX_HASH | 
>>> XDP_META_TIMESTAMP;
>>>
>>> features_supported = bpf_xdp_metadata_query_features(ctx, 
>>> features_wanted);
>>>
>>> if (features_supported & XDP_META_RX_HASH)
>>>   hash = bpf_xdp_metadata_rx_hash(ctx);
>>>
>>> ...etc
>>
>> I'm not too happy about having the bitmasks tbh :-(
>> If we want to get rid of the cost of those _supported calls, maybe we
>> can do some kind of libbpf-like probing? That would require loading a
>> program + waiting for some packet though :-(
>>
>> Or maybe they can just be cached for now?
>>
>> if (unlikely(!got_first_packet)) {
>>  have_hash = bpf_xdp_metadata_rx_hash_supported();
>>  have_timestamp = bpf_xdp_metadata_rx_timestamp_supported();
>>  got_first_packet = true;
>> }
> 
> hash/timestap/csum is per packet .. vlan as well depending how you look at
> it..

True, we cannot cache this as it is *per packet* info.

> Sorry I haven't been following the progress of xdp meta data, but why did
> we drop the idea of btf and driver copying metdata in front of the xdp
> frame ?
> 

It took me some time to understand this new approach, and why it makes
sense.  This is my understanding of the design direction change:

This approach gives more control to the XDP BPF-prog to pick and choose
which XDP hints are relevant for the specific use-case.  BPF-prog can
also skip storing hints anywhere and just read+react on value (that e.g.
comes from RX-desc).

For the use-cases redirect, AF_XDP, chained BPF-progs, XDP-to-TC,
SKB-creation, we *do* need to store hints somewhere, as RX-desc will be
out-of-scope.  I this patchset hand-waves and says BPF-prog can just
manually store this in a prog custom layout in metadata area.  I'm not
super happy with ignoring/hand-waving all these use-case, but I
hope/think we later can extend this some more structure to support these
use-cases better (with this patchset as a foundation).

I actually like this kfunc design, because the BPF-prog's get an
intuitive API, and on driver side we can hide the details of howto
extract the HW hints.


> hopefully future HW generations will do that for free ..

True.  I think it is worth repeating, that the approach of storing HW
hints in metadata area (in-front of packet data) was to allow future HW
generations to write this.  Thus, eliminating the 6 ns (that I showed it
cost), and then it would be up-to XDP BPF-prog to pick and choose which
to read, like this patchset already offers.

This patchset isn't incompatible with future HW generations doing this,
as the kfunc would hide the details and point to this area instead of
the RX-desc.  While we get the "store for free" from hardware, I do
worry that reading this memory area (which will part of DMA area) is
going to be slower than reading from RX-desc.

> if btf is the problem then each vendor can provide a bpf func(s) that would
> parse the metdata inside of the xdp/bpf prog domain to help programs
> extract the vendor specific data..
> 

In some sense, if unroll will becomes a thing, then this patchset is
partly doing this.

I did imagine that after/followup on XDP-hints with BTF patchset, we
would allow drivers to load an BPF-prog that changed/selected which HW
hints were relevant, to reduce those 6 ns overhead we introduced.

--Jesper


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09 12:59                     ` Jesper Dangaard Brouer
@ 2022-12-09 14:37                       ` Toke Høiland-Jørgensen
  2022-12-09 15:19                       ` Dave Taht
  1 sibling, 0 replies; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-09 14:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Saeed Mahameed, Stanislav Fomichev
  Cc: brouer, Alexei Starovoitov, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Saeed Mahameed, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, Network Development

Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

>> hash/timestap/csum is per packet .. vlan as well depending how you look at
>> it..
>
> True, we cannot cache this as it is *per packet* info.
>
>> Sorry I haven't been following the progress of xdp meta data, but why did
>> we drop the idea of btf and driver copying metdata in front of the xdp
>> frame ?
>> 
>
> It took me some time to understand this new approach, and why it makes
> sense.  This is my understanding of the design direction change:
>
> This approach gives more control to the XDP BPF-prog to pick and choose
> which XDP hints are relevant for the specific use-case.  BPF-prog can
> also skip storing hints anywhere and just read+react on value (that e.g.
> comes from RX-desc).
>
> For the use-cases redirect, AF_XDP, chained BPF-progs, XDP-to-TC,
> SKB-creation, we *do* need to store hints somewhere, as RX-desc will be
> out-of-scope.  I this patchset hand-waves and says BPF-prog can just
> manually store this in a prog custom layout in metadata area.  I'm not
> super happy with ignoring/hand-waving all these use-case, but I
> hope/think we later can extend this some more structure to support these
> use-cases better (with this patchset as a foundation).

I don't think this approach "hand-waves" the need to store the metadata,
it just declares it out of scope :)

Which makes sense, because "accessing the metadata" and "storing it for
later use" are two different problems, where the second one build on top
of the first one. I.e., once we have a way to access the data, we can
build upon that to have a way to store it somewhere.

> I actually like this kfunc design, because the BPF-prog's get an
> intuitive API, and on driver side we can hide the details of howto
> extract the HW hints.

+1

>> hopefully future HW generations will do that for free ..
>
> True.  I think it is worth repeating, that the approach of storing HW
> hints in metadata area (in-front of packet data) was to allow future HW
> generations to write this.  Thus, eliminating the 6 ns (that I showed it
> cost), and then it would be up-to XDP BPF-prog to pick and choose which
> to read, like this patchset already offers.
>
> This patchset isn't incompatible with future HW generations doing this,
> as the kfunc would hide the details and point to this area instead of
> the RX-desc.  While we get the "store for free" from hardware, I do
> worry that reading this memory area (which will part of DMA area) is
> going to be slower than reading from RX-desc.

Agreed (choked on the "isn't incompatible" double negative at first). If
the hardware stores the data next to the packet data, the kfuncs can
just read them from there. If it turns out that we can even make the
layout for some fields the same across drivers, we could even have the
generic kfunc implementations just read this area (which also nicely
solves the "storage" problem).

-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09  2:57                 ` Stanislav Fomichev
  2022-12-09  5:24                   ` Saeed Mahameed
@ 2022-12-09 14:42                   ` Toke Høiland-Jørgensen
  2022-12-09 16:45                     ` Jakub Kicinski
  1 sibling, 1 reply; 61+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-09 14:42 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Dec 8, 2022 at 4:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>
>> > On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>> >>
>> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >>
>> >> >> Stanislav Fomichev <sdf@google.com> writes:
>> >> >>
>> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >> >>
>> >> >> >> Stanislav Fomichev <sdf@google.com> writes:
>> >> >> >>
>> >> >> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >> >> >> >
>> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
>> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
>> >> >> >> > XDP ctx to do this.
>> >> >> >>
>> >> >> >> So I finally managed to get enough ducks in row to actually benchmark
>> >> >> >> this. With the caveat that I suddenly can't get the timestamp support to
>> >> >> >> work (it was working in an earlier version, but now
>> >> >> >> timestamp_supported() just returns false). I'm not sure if this is an
>> >> >> >> issue with the enablement patch, or if I just haven't gotten the
>> >> >> >> hardware configured properly. I'll investigate some more, but figured
>> >> >> >> I'd post these results now:
>> >> >> >>
>> >> >> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
>> >> >> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
>> >> >> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
>> >> >> >>
>> >> >> >> As per the above, this is with calling three kfuncs/pkt
>> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
>> >> >> >> ~0.95 ns per function call, which is a bit less, but not far off from
>> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
>> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
>> >> >> >> definitely in that ballpark.
>> >> >> >>
>> >> >> >> I'm not doing anything with the data, just reading it into an on-stack
>> >> >> >> buffer, so this is the smallest possible delta from just getting the
>> >> >> >> data out of the driver. I did confirm that the call instructions are
>> >> >> >> still in the BPF program bytecode when it's dumped back out from the
>> >> >> >> kernel.
>> >> >> >>
>> >> >> >> -Toke
>> >> >> >>
>> >> >> >
>> >> >> > Oh, that's great, thanks for running the numbers! Will definitely
>> >> >> > reference them in v4!
>> >> >> > Presumably, we should be able to at least unroll most of the
>> >> >> > _supported callbacks if we want, they should be relatively easy; but
>> >> >> > the numbers look fine as is?
>> >> >>
>> >> >> Well, this is for one (and a half) piece of metadata. If we extrapolate
>> >> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe
>> >> >> another callback to get the type of hash (l3/l4). Those would probably
>> >> >> be relevant for most packets in a fairly common setup. Extrapolating
>> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
>> >> >> baseline of 39 ns.
>> >> >>
>> >> >> So in that sense I still think unrolling makes sense. At least for the
>> >> >> _supported() calls, as eating a whole function call just for that is
>> >> >> probably a bit much (which I think was also Jakub's point in a sibling
>> >> >> thread somewhere).
>> >> >
>> >> > imo the overhead is tiny enough that we can wait until
>> >> > generic 'kfunc inlining' infra is ready.
>> >> >
>> >> > We're planning to dual-compile some_kernel_file.c
>> >> > into native arch and into bpf arch.
>> >> > Then the verifier will automatically inline bpf asm
>> >> > of corresponding kfunc.
>> >>
>> >> Is that "planning" or "actively working on"? Just trying to get a sense
>> >> of the time frames here, as this sounds neat, but also something that
>> >> could potentially require quite a bit of fiddling with the build system
>> >> to get to work? :)
>> >
>> > "planning", but regardless how long it takes I'd rather not
>> > add any more tech debt in the form of manual bpf asm generation.
>> > We have too much of it already: gen_lookup, convert_ctx_access, etc.
>>
>> Right, I'm no fan of the manual ASM stuff either. However, if we're
>> stuck with the function call overhead for the foreseeable future, maybe
>> we should think about other ways of cutting down the number of function
>> calls needed?
>>
>> One thing I can think of is to get rid of the individual _supported()
>> kfuncs and instead have a single one that lets you query multiple
>> features at once, like:
>>
>> __u64 features_supported, features_wanted = XDP_META_RX_HASH | XDP_META_TIMESTAMP;
>>
>> features_supported = bpf_xdp_metadata_query_features(ctx, features_wanted);
>>
>> if (features_supported & XDP_META_RX_HASH)
>>   hash = bpf_xdp_metadata_rx_hash(ctx);
>>
>> ...etc
>
> I'm not too happy about having the bitmasks tbh :-(
> If we want to get rid of the cost of those _supported calls, maybe we
> can do some kind of libbpf-like probing? That would require loading a
> program + waiting for some packet though :-(

If we expect the program to do out of band probing, we could just get
rid of the _supported() functions entirely?

I mean, to me, the whole point of having the separate _supported()
function for each item was to have a lower-overhead way of checking if
the metadata item was supported. But if the overhead is not actually
lower (because both incur a function call), why have them at all? Then
we could just change the implementation from this:

bool mlx5e_xdp_rx_hash_supported(const struct xdp_md *ctx)
{
	const struct mlx5_xdp_buff *_ctx = (void *)ctx;

	return _ctx->xdp.rxq->dev->features & NETIF_F_RXHASH;
}

u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx)
{
	const struct mlx5_xdp_buff *_ctx = (void *)ctx;

	return be32_to_cpu(_ctx->cqe->rss_hash_result);
}

to this:

u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx)
{
	const struct mlx5_xdp_buff *_ctx = (void *)ctx;

	if (!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH))
                return 0;

	return be32_to_cpu(_ctx->cqe->rss_hash_result);
}

-Toke


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09 12:59                     ` Jesper Dangaard Brouer
  2022-12-09 14:37                       ` Toke Høiland-Jørgensen
@ 2022-12-09 15:19                       ` Dave Taht
  1 sibling, 0 replies; 61+ messages in thread
From: Dave Taht @ 2022-12-09 15:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Stanislav Fomichev, brouer,
	Toke Høiland-Jørgensen, Alexei Starovoitov, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	Network Development

On Fri, Dec 9, 2022 at 5:29 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 09/12/2022 06.24, Saeed Mahameed wrote:
> > On 08 Dec 18:57, Stanislav Fomichev wrote:
> >> On Thu, Dec 8, 2022 at 4:54 PM Toke Høiland-Jørgensen
> >> <toke@redhat.com> wrote:
> >>>
> >>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >>>
> >>> > On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>> >>
> >>> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >>> >>
> >>> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>> >> >>
> >>> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >>> >> >>
> >>> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>> >> >> >>
> >>> >> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >>> >> >> >>
> >>> >> >> >> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> >>> >> >> >> >
> >>> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> >>> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> >>> >> >> >> > XDP ctx to do this.
> >>> >> >> >>
> >>> >> >> >> So I finally managed to get enough ducks in row to actually benchmark
> >>> >> >> >> this. With the caveat that I suddenly can't get the timestamp support to
> >>> >> >> >> work (it was working in an earlier version, but now
> >>> >> >> >> timestamp_supported() just returns false). I'm not sure if this is an
> >>> >> >> >> issue with the enablement patch, or if I just haven't gotten the
> >>> >> >> >> hardware configured properly. I'll investigate some more, but figured
> >>> >> >> >> I'd post these results now:
> >>> >> >> >>
> >>> >> >> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
> >>> >> >> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
> >>> >> >> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
> >>> >> >> >>
> >>> >> >> >> As per the above, this is with calling three kfuncs/pkt
> >>> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
> >>> >> >> >> ~0.95 ns per function call, which is a bit less, but not far off from
> >>> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
> >>> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
> >>> >> >> >> definitely in that ballpark.
> >>> >> >> >>
> >>> >> >> >> I'm not doing anything with the data, just reading it into an on-stack
> >>> >> >> >> buffer, so this is the smallest possible delta from just getting the
> >>> >> >> >> data out of the driver. I did confirm that the call instructions are
> >>> >> >> >> still in the BPF program bytecode when it's dumped back out from the
> >>> >> >> >> kernel.
> >>> >> >> >>
> >>> >> >> >> -Toke
> >>> >> >> >>
> >>> >> >> >
> >>> >> >> > Oh, that's great, thanks for running the numbers! Will definitely
> >>> >> >> > reference them in v4!
> >>> >> >> > Presumably, we should be able to at least unroll most of the
> >>> >> >> > _supported callbacks if we want, they should be relatively easy; but
> >>> >> >> > the numbers look fine as is?
> >>> >> >>
> >>> >> >> Well, this is for one (and a half) piece of metadata. If we extrapolate
> >>> >> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe
> >>> >> >> another callback to get the type of hash (l3/l4). Those would probably
> >>> >> >> be relevant for most packets in a fairly common setup. Extrapolating
> >>> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
> >>> >> >> baseline of 39 ns.
> >>> >> >>
> >>> >> >> So in that sense I still think unrolling makes sense. At least for the
> >>> >> >> _supported() calls, as eating a whole function call just for that is
> >>> >> >> probably a bit much (which I think was also Jakub's point in a sibling
> >>> >> >> thread somewhere).
> >>> >> >
> >>> >> > imo the overhead is tiny enough that we can wait until
> >>> >> > generic 'kfunc inlining' infra is ready.
> >>> >> >
> >>> >> > We're planning to dual-compile some_kernel_file.c
> >>> >> > into native arch and into bpf arch.
> >>> >> > Then the verifier will automatically inline bpf asm
> >>> >> > of corresponding kfunc.
> >>> >>
> >>> >> Is that "planning" or "actively working on"? Just trying to get a sense
> >>> >> of the time frames here, as this sounds neat, but also something that
> >>> >> could potentially require quite a bit of fiddling with the build system
> >>> >> to get to work? :)
> >>> >
> >>> > "planning", but regardless how long it takes I'd rather not
> >>> > add any more tech debt in the form of manual bpf asm generation.
> >>> > We have too much of it already: gen_lookup, convert_ctx_access, etc.
> >>>
> >>> Right, I'm no fan of the manual ASM stuff either. However, if we're
> >>> stuck with the function call overhead for the foreseeable future, maybe
> >>> we should think about other ways of cutting down the number of function
> >>> calls needed?
> >>>
> >>> One thing I can think of is to get rid of the individual _supported()
> >>> kfuncs and instead have a single one that lets you query multiple
> >>> features at once, like:
> >>>
> >>> __u64 features_supported, features_wanted = XDP_META_RX_HASH |
> >>> XDP_META_TIMESTAMP;
> >>>
> >>> features_supported = bpf_xdp_metadata_query_features(ctx,
> >>> features_wanted);
> >>>
> >>> if (features_supported & XDP_META_RX_HASH)
> >>>   hash = bpf_xdp_metadata_rx_hash(ctx);
> >>>
> >>> ...etc
> >>
> >> I'm not too happy about having the bitmasks tbh :-(
> >> If we want to get rid of the cost of those _supported calls, maybe we
> >> can do some kind of libbpf-like probing? That would require loading a
> >> program + waiting for some packet though :-(
> >>
> >> Or maybe they can just be cached for now?
> >>
> >> if (unlikely(!got_first_packet)) {
> >>  have_hash = bpf_xdp_metadata_rx_hash_supported();
> >>  have_timestamp = bpf_xdp_metadata_rx_timestamp_supported();
> >>  got_first_packet = true;
> >> }
> >
> > hash/timestap/csum is per packet .. vlan as well depending how you look at
> > it..
>
> True, we cannot cache this as it is *per packet* info.
>
> > Sorry I haven't been following the progress of xdp meta data, but why did
> > we drop the idea of btf and driver copying metdata in front of the xdp
> > frame ?
> >
>
> It took me some time to understand this new approach, and why it makes
> sense.  This is my understanding of the design direction change:
>
> This approach gives more control to the XDP BPF-prog to pick and choose
> which XDP hints are relevant for the specific use-case.  BPF-prog can
> also skip storing hints anywhere and just read+react on value (that e.g.
> comes from RX-desc).
>
> For the use-cases redirect, AF_XDP, chained BPF-progs, XDP-to-TC,
> SKB-creation, we *do* need to store hints somewhere, as RX-desc will be
> out-of-scope.  I this patchset hand-waves and says BPF-prog can just
> manually store this in a prog custom layout in metadata area.  I'm not
> super happy with ignoring/hand-waving all these use-case, but I
> hope/think we later can extend this some more structure to support these
> use-cases better (with this patchset as a foundation).
>
> I actually like this kfunc design, because the BPF-prog's get an
> intuitive API, and on driver side we can hide the details of howto
> extract the HW hints.
>
>
> > hopefully future HW generations will do that for free ..
>
> True.  I think it is worth repeating, that the approach of storing HW
> hints in metadata area (in-front of packet data) was to allow future HW
> generations to write this.  Thus, eliminating the 6 ns (that I showed it
> cost), and then it would be up-to XDP BPF-prog to pick and choose which
> to read, like this patchset already offers.

As a hope for future generators of hw, being able to choose a cpu to interrupt
from a LPM table would be great. I keep hoping to find a card that can
do this already...

Also I would like to thank everyone working on this project so far for
what you've
accomplished. We're now pushing 20Gbit (through a vlan even) through
libreqos.io for thousands of ISP subscribers using all this great stuff, on
16 cores at only 24% of cpu through CAKE and also successfully monitoring
TCP RTTs at this scale via ebpf pping.

( https://www.yahoo.com/now/libreqoe-releases-version-1-3-214700756.html )
"Our hat is off to the creators of CAKE and the new Linux XDP and eBPF
subsystems!"

In our case, timestamp, and *3* hashes, are needed for cake, and interrupting
the right cpu would be great...

>
> This patchset isn't incompatible with future HW generations doing this,
> as the kfunc would hide the details and point to this area instead of
> the RX-desc.  While we get the "store for free" from hardware, I do
> worry that reading this memory area (which will part of DMA area) is
> going to be slower than reading from RX-desc.
>
> > if btf is the problem then each vendor can provide a bpf func(s) that would
> > parse the metdata inside of the xdp/bpf prog domain to help programs
> > extract the vendor specific data..
> >
>
> In some sense, if unroll will becomes a thing, then this patchset is
> partly doing this.
>
> I did imagine that after/followup on XDP-hints with BTF patchset, we
> would allow drivers to load an BPF-prog that changed/selected which HW
> hints were relevant, to reduce those 6 ns overhead we introduced.
>
> --Jesper
>


-- 
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09 14:42                   ` Toke Høiland-Jørgensen
@ 2022-12-09 16:45                     ` Jakub Kicinski
  2022-12-09 17:46                       ` Stanislav Fomichev
  0 siblings, 1 reply; 61+ messages in thread
From: Jakub Kicinski @ 2022-12-09 16:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Alexei Starovoitov, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Saeed Mahameed, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Fri, 09 Dec 2022 15:42:37 +0100 Toke Høiland-Jørgensen wrote:
> If we expect the program to do out of band probing, we could just get
> rid of the _supported() functions entirely?
> 
> I mean, to me, the whole point of having the separate _supported()
> function for each item was to have a lower-overhead way of checking if
> the metadata item was supported. But if the overhead is not actually
> lower (because both incur a function call), why have them at all? Then
> we could just change the implementation from this:
> 
> bool mlx5e_xdp_rx_hash_supported(const struct xdp_md *ctx)
> {
> 	const struct mlx5_xdp_buff *_ctx = (void *)ctx;
> 
> 	return _ctx->xdp.rxq->dev->features & NETIF_F_RXHASH;
> }
> 
> u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx)
> {
> 	const struct mlx5_xdp_buff *_ctx = (void *)ctx;
> 
> 	return be32_to_cpu(_ctx->cqe->rss_hash_result);
> }
> 
> to this:
> 
> u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx)
> {
> 	const struct mlx5_xdp_buff *_ctx = (void *)ctx;
> 
> 	if (!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH))
>                 return 0;
> 
> 	return be32_to_cpu(_ctx->cqe->rss_hash_result);
> }

Are there no corner cases? E.g. in case of an L2 frame you'd then
expect a hash of 0? Rather than no hash? 

If I understand we went for the _supported() thing to make inlining 
the check easier than inlining the actual read of the field.
But we're told inlining is a bit of a wait.. so isn't the motivation
for the _supported() pretty much gone? And we should we go back to
returning an error from the actual read?

Is partial inlining hard? (inline just the check and generate a full
call for the read, ending up with the same code as with _supported())

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09 16:45                     ` Jakub Kicinski
@ 2022-12-09 17:46                       ` Stanislav Fomichev
  2022-12-09 22:13                         ` Jakub Kicinski
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-09 17:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed, David Ahern,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	Network Development

On Fri, Dec 9, 2022 at 8:45 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 09 Dec 2022 15:42:37 +0100 Toke Høiland-Jørgensen wrote:
> > If we expect the program to do out of band probing, we could just get
> > rid of the _supported() functions entirely?
> >
> > I mean, to me, the whole point of having the separate _supported()
> > function for each item was to have a lower-overhead way of checking if
> > the metadata item was supported. But if the overhead is not actually
> > lower (because both incur a function call), why have them at all? Then
> > we could just change the implementation from this:
> >
> > bool mlx5e_xdp_rx_hash_supported(const struct xdp_md *ctx)
> > {
> >       const struct mlx5_xdp_buff *_ctx = (void *)ctx;
> >
> >       return _ctx->xdp.rxq->dev->features & NETIF_F_RXHASH;
> > }
> >
> > u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx)
> > {
> >       const struct mlx5_xdp_buff *_ctx = (void *)ctx;
> >
> >       return be32_to_cpu(_ctx->cqe->rss_hash_result);
> > }
> >
> > to this:
> >
> > u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx)
> > {
> >       const struct mlx5_xdp_buff *_ctx = (void *)ctx;
> >
> >       if (!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH))
> >                 return 0;
> >
> >       return be32_to_cpu(_ctx->cqe->rss_hash_result);
> > }
>
> Are there no corner cases? E.g. in case of an L2 frame you'd then
> expect a hash of 0? Rather than no hash?
>
> If I understand we went for the _supported() thing to make inlining
> the check easier than inlining the actual read of the field.
> But we're told inlining is a bit of a wait.. so isn't the motivation
> for the _supported() pretty much gone? And we should we go back to
> returning an error from the actual read?

Seems fair, we can always bring those _supported() calls back in the
future when the inlining is available and having those separate calls
shows clear benefit.
Then let's go back to a more conventional form below?

int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *timestamp)
{
      const struct mlx5_xdp_buff *_ctx = (void *)ctx;

       if (!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH))
                 return -EOPNOTSUPP;

       *timestamp = be32_to_cpu(_ctx->cqe->rss_hash_result);
       return 0;
 }

> Is partial inlining hard? (inline just the check and generate a full
> call for the read, ending up with the same code as with _supported())

I'm assuming you're suggesting to do this partial inlining manually
(as in, writing the code to output this bytecode)?
This probably also falls into the "manual bpf asm generation tech debt" bucket?
LMK if I missed your point.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-09 11:10   ` Jesper Dangaard Brouer
@ 2022-12-09 17:47     ` Stanislav Fomichev
  2022-12-11 11:09       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 61+ messages in thread
From: Stanislav Fomichev @ 2022-12-09 17:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Fri, Dec 9, 2022 at 3:11 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 06/12/2022 03.45, Stanislav Fomichev wrote:
> > There is an ndo handler per kfunc, the verifier replaces a call to the
> > generic kfunc with a call to the per-device one.
> >
> > For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
> > implements all possible metatada kfuncs. Not all devices have to
> > implement them. If kfunc is not supported by the target device,
> > the default implementation is called instead.
> >
> > Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
> > we treat prog_index as target device for kfunc resolution.
> >
>
> [...cut...]
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 5aa35c58c342..2eabb9157767 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -74,6 +74,7 @@ struct udp_tunnel_nic_info;
> >   struct udp_tunnel_nic;
> >   struct bpf_prog;
> >   struct xdp_buff;
> > +struct xdp_md;
> >
> >   void synchronize_net(void);
> >   void netdev_set_default_ethtool_ops(struct net_device *dev,
> > @@ -1611,6 +1612,10 @@ struct net_device_ops {
> >       ktime_t                 (*ndo_get_tstamp)(struct net_device *dev,
> >                                                 const struct skb_shared_hwtstamps *hwtstamps,
> >                                                 bool cycles);
> > +     bool                    (*ndo_xdp_rx_timestamp_supported)(const struct xdp_md *ctx);
> > +     u64                     (*ndo_xdp_rx_timestamp)(const struct xdp_md *ctx);
> > +     bool                    (*ndo_xdp_rx_hash_supported)(const struct xdp_md *ctx);
> > +     u32                     (*ndo_xdp_rx_hash)(const struct xdp_md *ctx);
> >   };
> >
>
> Would it make sense to add a 'flags' parameter to ndo_xdp_rx_timestamp
> and ndo_xdp_rx_hash ?
>
> E.g. we could have a "STORE" flag that asks the kernel to store this
> information for later. This will be helpful for both the SKB and
> redirect use-cases.
> For redirect e.g into a veth, then BPF-prog can use the same function
> bpf_xdp_metadata_rx_hash() to receive the RX-hash, as it can obtain the
> "stored" value (from the BPF-prog that did the redirect).
>
> (p.s. Hopefully a const 'flags' variable can be optimized when unrolling
> to eliminate store instructions when flags==0)

Are we concerned that doing this without a flag and with another
function call will be expensive?
For xdp->skb path, I was hoping we would be to do something like:

timestamp = bpf_xdp_metadata_rx_hash(ctx);
bpf_xdp_metadata_export_rx_hash_to_skb(ctx, timestamp);

This should also let the users adjust the metadata before storing it.
Am I missing something here? Why would the flag be preferable?


> >   /**
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 55dbc68bfffc..c24aba5c363b 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -409,4 +409,33 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> >
> >   #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
> >
> > +#define XDP_METADATA_KFUNC_xxx       \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
> > +                        bpf_xdp_metadata_rx_timestamp_supported) \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
> > +                        bpf_xdp_metadata_rx_timestamp) \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH_SUPPORTED, \
> > +                        bpf_xdp_metadata_rx_hash_supported) \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
> > +                        bpf_xdp_metadata_rx_hash) \
> > +
> > +enum {
> > +#define XDP_METADATA_KFUNC(name, str) name,
> > +XDP_METADATA_KFUNC_xxx
> > +#undef XDP_METADATA_KFUNC
> > +MAX_XDP_METADATA_KFUNC,
> > +};
> > +
> > +#ifdef CONFIG_NET
> > +u32 xdp_metadata_kfunc_id(int id);
> > +#else
> > +static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> > +#endif
> > +
> > +struct xdp_md;
> > +bool bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx);
> > +u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx);
> > +bool bpf_xdp_metadata_rx_hash_supported(const struct xdp_md *ctx);
> > +u32 bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx);
> > +
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata
  2022-12-09 17:46                       ` Stanislav Fomichev
@ 2022-12-09 22:13                         ` Jakub Kicinski
  0 siblings, 0 replies; 61+ messages in thread
From: Jakub Kicinski @ 2022-12-09 22:13 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Saeed Mahameed, David Ahern,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	Network Development

On Fri, 9 Dec 2022 09:46:20 -0800 Stanislav Fomichev wrote:
> > Is partial inlining hard? (inline just the check and generate a full
> > call for the read, ending up with the same code as with _supported())  
> 
> I'm assuming you're suggesting to do this partial inlining manually
> (as in, writing the code to output this bytecode)?
> This probably also falls into the "manual bpf asm generation tech debt" bucket?
> LMK if I missed your point.

Maybe just ignore that, I'm not sure how the unrolling of 
the _supported() calls was expected to work in the first place.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-09  2:57         ` Stanislav Fomichev
@ 2022-12-10  0:42           ` Martin KaFai Lau
  2022-12-10  1:12             ` Martin KaFai Lau
  0 siblings, 1 reply; 61+ messages in thread
From: Martin KaFai Lau @ 2022-12-10  0:42 UTC (permalink / raw)
  To: Stanislav Fomichev, Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On 12/8/22 6:57 PM, Stanislav Fomichev wrote:
> On Thu, Dec 8, 2022 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>>>> Another UX thing I ran into is that libbpf will bail out if it can't
>>>> find the kfunc in the kernel vmlinux, even if the code calling the
>>>> function is behind an always-false if statement (which would be
>>>> eliminated as dead code from the verifier). This makes it a bit hard to
>>>> conditionally use them. Should libbpf just allow the load without
>>>> performing the relocation (and let the verifier worry about it), or
>>>> should we have a bpf_core_kfunc_exists() macro to use for checking?
>>>> Maybe both?
>>>
>>> I'm not sure how libbpf can allow the load without performing the
>>> relocation; maybe I'm missing something.
>>> IIUC, libbpf uses the kfunc name (from the relocation?) and replaces
>>> it with the kfunc id, right?
>>
>> Yeah, so if it can't find the kfunc in vmlinux, just write an id of 0.
>> This will trip the check at the top of fixup_kfunc_call() in the
>> verifier, but if the code is hidden behind an always-false branch (an
>> rodata variable set to zero, say) the instructions should get eliminated
>> before they reach that point. That way you can at least turn it off at
>> runtime (after having done some kind of feature detection) without
>> having to compile it out of your program entirely.
>>
>>> Having bpf_core_kfunc_exists would help, but this probably needs
>>> compiler work first to preserve some of the kfunc traces in vmlinux.h?

hmm.... if I follow correctly, it wants the libbpf to accept a bpf prog using a 
kfunc that does not exist in the running kernel?

Have you tried "__weak":

extern void dummy_kfunc(void) __ksym __weak;

SEC("tc")
int load(struct __sk_buff *skb)
{
	if (dummy_kfunc) {
		dummy_kfunc();
		return TC_ACT_SHOT;
	}
	return TC_ACT_UNSPEC;
}


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-10  0:42           ` Martin KaFai Lau
@ 2022-12-10  1:12             ` Martin KaFai Lau
  0 siblings, 0 replies; 61+ messages in thread
From: Martin KaFai Lau @ 2022-12-10  1:12 UTC (permalink / raw)
  To: Stanislav Fomichev, Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On 12/9/22 4:42 PM, Martin KaFai Lau wrote:
> On 12/8/22 6:57 PM, Stanislav Fomichev wrote:
>> On Thu, Dec 8, 2022 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>
>>> Stanislav Fomichev <sdf@google.com> writes:
>>>
>>>>> Another UX thing I ran into is that libbpf will bail out if it can't
>>>>> find the kfunc in the kernel vmlinux, even if the code calling the
>>>>> function is behind an always-false if statement (which would be
>>>>> eliminated as dead code from the verifier). This makes it a bit hard to
>>>>> conditionally use them. Should libbpf just allow the load without
>>>>> performing the relocation (and let the verifier worry about it), or
>>>>> should we have a bpf_core_kfunc_exists() macro to use for checking?
>>>>> Maybe both?
>>>>
>>>> I'm not sure how libbpf can allow the load without performing the
>>>> relocation; maybe I'm missing something.
>>>> IIUC, libbpf uses the kfunc name (from the relocation?) and replaces
>>>> it with the kfunc id, right?
>>>
>>> Yeah, so if it can't find the kfunc in vmlinux, just write an id of 0.
>>> This will trip the check at the top of fixup_kfunc_call() in the
>>> verifier, but if the code is hidden behind an always-false branch (an
>>> rodata variable set to zero, say) the instructions should get eliminated
>>> before they reach that point. That way you can at least turn it off at
>>> runtime (after having done some kind of feature detection) without
>>> having to compile it out of your program entirely.
>>>
>>>> Having bpf_core_kfunc_exists would help, but this probably needs
>>>> compiler work first to preserve some of the kfunc traces in vmlinux.h?
> 
> hmm.... if I follow correctly, it wants the libbpf to accept a bpf prog using a 
> kfunc that does not exist in the running kernel?
> 
> Have you tried "__weak":
> 
> extern void dummy_kfunc(void) __ksym __weak;
> 
> SEC("tc")
> int load(struct __sk_buff *skb)
> {
>      if (dummy_kfunc) {

Sadly, won't work. only VAR is supported on ld.

>          dummy_kfunc();
>          return TC_ACT_SHOT;
>      }
>      return TC_ACT_UNSPEC;
> }
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs
  2022-12-09 17:47     ` Stanislav Fomichev
@ 2022-12-11 11:09       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 61+ messages in thread
From: Jesper Dangaard Brouer @ 2022-12-11 11:09 UTC (permalink / raw)
  To: Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev


On 09/12/2022 18.47, Stanislav Fomichev wrote:
> On Fri, Dec 9, 2022 at 3:11 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>> On 06/12/2022 03.45, Stanislav Fomichev wrote:
>>> There is an ndo handler per kfunc, the verifier replaces a call to the
>>> generic kfunc with a call to the per-device one.
>>>
>>> For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
>>> implements all possible metatada kfuncs. Not all devices have to
>>> implement them. If kfunc is not supported by the target device,
>>> the default implementation is called instead.
>>>
>>> Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
>>> we treat prog_index as target device for kfunc resolution.
>>>
>>
>> [...cut...]
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index 5aa35c58c342..2eabb9157767 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -74,6 +74,7 @@ struct udp_tunnel_nic_info;
>>>    struct udp_tunnel_nic;
>>>    struct bpf_prog;
>>>    struct xdp_buff;
>>> +struct xdp_md;
>>>
>>>    void synchronize_net(void);
>>>    void netdev_set_default_ethtool_ops(struct net_device *dev,
>>> @@ -1611,6 +1612,10 @@ struct net_device_ops {
>>>        ktime_t                 (*ndo_get_tstamp)(struct net_device *dev,
>>>                                                  const struct skb_shared_hwtstamps *hwtstamps,
>>>                                                  bool cycles);
>>> +     bool                    (*ndo_xdp_rx_timestamp_supported)(const struct xdp_md *ctx);
>>> +     u64                     (*ndo_xdp_rx_timestamp)(const struct xdp_md *ctx);
>>> +     bool                    (*ndo_xdp_rx_hash_supported)(const struct xdp_md *ctx);
>>> +     u32                     (*ndo_xdp_rx_hash)(const struct xdp_md *ctx);
>>>    };
>>>
>>
>> Would it make sense to add a 'flags' parameter to ndo_xdp_rx_timestamp
>> and ndo_xdp_rx_hash ?
>>
>> E.g. we could have a "STORE" flag that asks the kernel to store this
>> information for later. This will be helpful for both the SKB and
>> redirect use-cases.
>> For redirect e.g into a veth, then BPF-prog can use the same function
>> bpf_xdp_metadata_rx_hash() to receive the RX-hash, as it can obtain the
>> "stored" value (from the BPF-prog that did the redirect).
>>
>> (p.s. Hopefully a const 'flags' variable can be optimized when unrolling
>> to eliminate store instructions when flags==0)
> 
> Are we concerned that doing this without a flag and with another
> function call will be expensive?

Yes, but if we can unroll (to avoid the function calls) it would be more
flexible and explicit API with below instead.

> For xdp->skb path, I was hoping we would be to do something like:
> 
> timestamp = bpf_xdp_metadata_rx_hash(ctx);
> bpf_xdp_metadata_export_rx_hash_to_skb(ctx, timestamp);
> 
> This should also let the users adjust the metadata before storing it.
> Am I missing something here? Why would the flag be preferable?

I do like this ability to let the users adjust the metadata before
storing it.  This would be a more flexible API for the BPF-programmer.
I like your "export" suggestion.  The main concern for me was
performance overhead of the extra function call, which I guess can be
removed via unrolling later.
Unrolling these 'export' functions might be easier to accept from a
maintainer perspective, as it is not device driver specific, thus we can
place that in the core BPF code.

--Jesper


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2022-12-11 11:10 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-06  2:45 [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Stanislav Fomichev
2022-12-06  2:45 ` [PATCH bpf-next v3 01/12] bpf: Document XDP RX metadata Stanislav Fomichev
2022-12-08  4:25   ` Jakub Kicinski
2022-12-08 19:06     ` Stanislav Fomichev
2022-12-06  2:45 ` [PATCH bpf-next v3 02/12] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded Stanislav Fomichev
2022-12-08  4:26   ` Jakub Kicinski
2022-12-06  2:45 ` [PATCH bpf-next v3 03/12] bpf: XDP metadata RX kfuncs Stanislav Fomichev
2022-12-07  4:29   ` Alexei Starovoitov
2022-12-07  4:52     ` Stanislav Fomichev
2022-12-07  7:23       ` Martin KaFai Lau
2022-12-07 18:05         ` Stanislav Fomichev
2022-12-08  2:47   ` Martin KaFai Lau
2022-12-08 19:07     ` Stanislav Fomichev
2022-12-08 22:53       ` Martin KaFai Lau
2022-12-08 23:45         ` Stanislav Fomichev
2022-12-08  5:00   ` Jakub Kicinski
2022-12-08 19:07     ` Stanislav Fomichev
2022-12-09  1:30       ` Jakub Kicinski
2022-12-09  2:57         ` Stanislav Fomichev
2022-12-08 22:39   ` [xdp-hints] " Toke Høiland-Jørgensen
2022-12-08 23:46     ` Stanislav Fomichev
2022-12-09  0:07       ` [xdp-hints] " Toke Høiland-Jørgensen
2022-12-09  2:57         ` Stanislav Fomichev
2022-12-10  0:42           ` Martin KaFai Lau
2022-12-10  1:12             ` Martin KaFai Lau
2022-12-09 11:10   ` Jesper Dangaard Brouer
2022-12-09 17:47     ` Stanislav Fomichev
2022-12-11 11:09       ` Jesper Dangaard Brouer
2022-12-06  2:45 ` [PATCH bpf-next v3 04/12] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-12-06  2:45 ` [PATCH bpf-next v3 05/12] veth: Support RX XDP metadata Stanislav Fomichev
2022-12-06  2:45 ` [PATCH bpf-next v3 06/12] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
2022-12-06  2:45 ` [PATCH bpf-next v3 07/12] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-12-08  6:11   ` Tariq Toukan
2022-12-08 19:07     ` Stanislav Fomichev
2022-12-06  2:45 ` [PATCH bpf-next v3 08/12] mxl4: Support RX XDP metadata Stanislav Fomichev
2022-12-08  6:09   ` Tariq Toukan
2022-12-08 19:07     ` Stanislav Fomichev
2022-12-08 20:23       ` Tariq Toukan
2022-12-06  2:45 ` [PATCH bpf-next v3 09/12] xsk: Add cb area to struct xdp_buff_xsk Stanislav Fomichev
2022-12-06  2:45 ` [PATCH bpf-next v3 10/12] mlx5: Introduce mlx5_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-12-06  2:45 ` [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata Stanislav Fomichev
2022-12-08 22:59   ` Toke Høiland-Jørgensen
2022-12-08 23:45     ` Stanislav Fomichev
2022-12-09  0:02       ` [xdp-hints] " Toke Høiland-Jørgensen
2022-12-09  0:07         ` Alexei Starovoitov
2022-12-09  0:29           ` Toke Høiland-Jørgensen
2022-12-09  0:32             ` Alexei Starovoitov
2022-12-09  0:53               ` Toke Høiland-Jørgensen
2022-12-09  2:57                 ` Stanislav Fomichev
2022-12-09  5:24                   ` Saeed Mahameed
2022-12-09 12:59                     ` Jesper Dangaard Brouer
2022-12-09 14:37                       ` Toke Høiland-Jørgensen
2022-12-09 15:19                       ` Dave Taht
2022-12-09 14:42                   ` Toke Høiland-Jørgensen
2022-12-09 16:45                     ` Jakub Kicinski
2022-12-09 17:46                       ` Stanislav Fomichev
2022-12-09 22:13                         ` Jakub Kicinski
2022-12-06  2:45 ` [PATCH bpf-next v3 12/12] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
2022-12-08 22:28 ` [xdp-hints] [PATCH bpf-next v3 00/12] xdp: hints via kfuncs Toke Høiland-Jørgensen
2022-12-08 23:47   ` Stanislav Fomichev
2022-12-09  0:14     ` [xdp-hints] " Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).