bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/11] xdp: hints via kfuncs
@ 2022-11-15  3:01 Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 01/11] bpf: Document XDP RX metadata Stanislav Fomichev
                   ` (11 more replies)
  0 siblings, 12 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:01 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Please see the first patch in the series for the overall
design and use-cases.

Changes since last RFC:

- drop ice/bnxt example implementation (Alexander)

  -ENOHARDWARE to test

- fix/test mlx4 implementation

  Confirmed that I get reasonable looking timestamp.
  The last patch in the series is the small xsk program that can
  be used to dump incoming metadata.

- bpf_push64/bpf_pop64 (Alexei)

  x86_64+arm64(untested)+disassembler

- struct xdp_to_skb_metadata -> struct xdp_skb_metadata (Toke)

  s/xdp_to_skb/xdp_skb/

- Documentation/bpf/xdp-rx-metadata.rst

  Documents functionality, assumptions and limitations.

- bpf_xdp_metadata_export_to_skb returns true/false (Martin)

  Plus xdp_md->skb_metadata field to access it.

- BPF_F_XDP_HAS_METADATA flag (Toke/Martin)

  Drop magic, use the flag instead.

- drop __randomize_layout

  Not sure it's possible to sanely expose it via UAPI. Because every
  .o potentially gets its own randomized layout, test_progs
  refuses to link.

- remove __net_timestamp in veth driver (John/Jesper)

  Instead, calling ktime_get from the kfunc; enough for the selftests.

Future work on RX side:

- Support more devices besides veth and mlx4
- Support more metadata besides RX timestamp.
- Convert skb_metadata_set() callers to xdp_convert_skb_metadata()
  which handles extra xdp_skb_metadata

Prior art (to record pros/cons for different approaches):

- Stable UAPI approach:
  https://lore.kernel.org/bpf/20220628194812.1453059-1-alexandr.lobakin@intel.com/
- Metadata+BTF_ID appoach:
  https://lore.kernel.org/bpf/166256538687.1434226.15760041133601409770.stgit@firesoul/
- kfuncs v2 RFC:
  https://lore.kernel.org/bpf/20221027200019.4106375-1-sdf@google.com/
- kfuncs v1 RFC:
  https://lore.kernel.org/bpf/20221104032532.1615099-1-sdf@google.com/

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Stanislav Fomichev (11):
  bpf: Document XDP RX metadata
  bpf: Introduce bpf_patch
  bpf: Support inlined/unrolled kfuncs for xdp metadata
  bpf: Implement hidden BPF_PUSH64 and BPF_POP64 instructions
  veth: Support rx timestamp metadata for xdp
  xdp: Carry over xdp metadata into skb context
  selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  selftests/bpf: Verify xdp_metadata xdp->skb path
  mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
  mxl4: Support rx timestamp metadata for xdp
  selftests/bpf: Simple program to dump XDP RX metadata

 Documentation/bpf/kfuncs.rst                  |   8 +
 Documentation/bpf/xdp-rx-metadata.rst         | 109 +++++
 arch/arm64/net/bpf_jit_comp.c                 |   8 +
 arch/x86/net/bpf_jit_comp.c                   |   8 +
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |   2 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  68 ++-
 drivers/net/veth.c                            |  22 +-
 include/linux/bpf.h                           |   1 +
 include/linux/bpf_patch.h                     |  29 ++
 include/linux/btf.h                           |   1 +
 include/linux/btf_ids.h                       |   4 +
 include/linux/filter.h                        |  23 +
 include/linux/mlx4/device.h                   |   7 +
 include/linux/netdevice.h                     |   5 +
 include/linux/skbuff.h                        |   4 +
 include/net/xdp.h                             |  41 ++
 include/uapi/linux/bpf.h                      |  12 +
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_patch.c                        |  77 +++
 kernel/bpf/disasm.c                           |   6 +
 kernel/bpf/syscall.c                          |  28 +-
 kernel/bpf/verifier.c                         |  80 ++++
 net/core/dev.c                                |   7 +
 net/core/filter.c                             |  40 ++
 net/core/skbuff.c                             |  20 +
 net/core/xdp.c                                | 184 +++++++-
 tools/include/uapi/linux/bpf.h                |  12 +
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
 tools/testing/selftests/bpf/Makefile          |   8 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 440 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  99 ++++
 .../selftests/bpf/progs/xdp_metadata.c        | 114 +++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 404 ++++++++++++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.h |   6 +
 35 files changed, 1856 insertions(+), 25 deletions(-)
 create mode 100644 Documentation/bpf/xdp-rx-metadata.rst
 create mode 100644 include/linux/bpf_patch.h
 create mode 100644 kernel/bpf/bpf_patch.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.h

-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 01/11] bpf: Document XDP RX metadata
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15 22:31   ` Zvi Effron
  2022-11-15  3:02 ` [PATCH bpf-next 02/11] bpf: Introduce bpf_patch Stanislav Fomichev
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa

Document all current use-cases and assumptions.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/bpf/xdp-rx-metadata.rst | 109 ++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/bpf/xdp-rx-metadata.rst

diff --git a/Documentation/bpf/xdp-rx-metadata.rst b/Documentation/bpf/xdp-rx-metadata.rst
new file mode 100644
index 000000000000..5ddaaab8de31
--- /dev/null
+++ b/Documentation/bpf/xdp-rx-metadata.rst
@@ -0,0 +1,109 @@
+===============
+XDP RX Metadata
+===============
+
+XDP programs support creating and passing custom metadata via
+``bpf_xdp_adjust_meta``. This metadata can be consumed by the following
+entities:
+
+1. ``AF_XDP`` consumer.
+2. Kernel core stack via ``XDP_PASS``.
+3. Another device via ``bpf_redirect_map``.
+
+General Design
+==============
+
+XDP has access to a set of kfuncs to manipulate the metadata. Every
+device driver implements these kfuncs by generating BPF bytecode
+to parse it out from the hardware descriptors. The set of kfuncs is
+declared in ``include/net/xdp.h`` via ``XDP_METADATA_KFUNC_xxx``.
+
+Currently, the following kfuncs are supported. In the future, as more
+metadata is supported, this set will grow:
+
+- ``bpf_xdp_metadata_rx_timestamp_supported`` returns true/false to
+  indicate whether the device supports RX timestamps in general
+- ``bpf_xdp_metadata_rx_timestamp`` returns packet RX timestamp or 0
+- ``bpf_xdp_metadata_export_to_skb`` prepares metadata layout that
+  the kernel will be able to consume. See ``bpf_redirect_map`` section
+  below for more details.
+
+Within the XDP frame, the metadata layout is as follows::
+
+  +----------+------------------+-----------------+------+
+  | headroom | xdp_skb_metadata | custom metadata | data |
+  +----------+------------------+-----------------+------+
+                                ^                 ^
+                                |                 |
+                      xdp_buff->data_meta   xdp_buff->data
+
+Where ``xdp_skb_metadata`` is the metadata prepared by
+``bpf_xdp_metadata_export_to_skb``. And ``custom metadata``
+is prepared by the BPF program via calls to ``bpf_xdp_adjust_meta``.
+
+Note that ``bpf_xdp_metadata_export_to_skb`` doesn't adjust
+``xdp->data_meta`` pointer. To access the metadata generated
+by ``bpf_xdp_metadata_export_to_skb`` use ``xdp_buf->skb_metadata``.
+
+AF_XDP
+======
+
+``AF_XDP`` use-case implies that there is a contract between the BPF program
+that redirects XDP frames into the ``XSK`` and the final consumer.
+Thus the BPF program manually allocates a fixed number of
+bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
+of kfuncs to populate it. User-space ``XSK`` consumer, looks
+at ``xsk_umem__get_data() - METADATA_SIZE`` to locate its metadata.
+
+Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
+
+  +----------+------------------+-----------------+------+
+  | headroom | xdp_skb_metadata | custom metadata | data |
+  +----------+------------------+-----------------+------+
+                                                  ^
+                                                  |
+                                           rx_desc->address
+
+XDP_PASS
+========
+
+This is the path where the packets processed by the XDP program are passed
+into the kernel. The kernel creates ``skb`` out of the ``xdp_buff`` contents.
+Currently, every driver has a custom kernel code to parse the descriptors and
+populate ``skb`` metadata when doing this ``xdp_buff->skb`` conversion.
+In the future, we'd like to support a case where XDP program can override
+some of that metadata.
+
+The plan of record is to make this path similar to ``bpf_redirect_map``
+below where the program would call ``bpf_xdp_metadata_export_to_skb``,
+override the metadata and return ``XDP_PASS``. Additional work in
+the drivers will be required to enable this (for example, to skip
+populating ``skb`` metadata from the descriptors when
+``bpf_xdp_metadata_export_to_skb`` has been called).
+
+bpf_redirect_map
+================
+
+``bpf_redirect_map`` can redirect the frame to a different device.
+In this case we don't know ahead of time whether that final consumer
+will further redirect to an ``XSK`` or pass it to the kernel via ``XDP_PASS``.
+Additionally, the final consumer doesn't have access to the original
+hardware descriptor and can't access any of the original metadata.
+
+To support passing metadata via ``bpf_redirect_map``, there is a
+``bpf_xdp_metadata_export_to_skb`` kfunc that populates a subset
+of metadata into ``xdp_buff``. The layout is defined in
+``struct xdp_skb_metadata``.
+
+Mixing custom metadata and xdp_skb_metadata
+===========================================
+
+For the cases of ``bpf_redirect_map``, where the final consumer isn't
+known ahead of time, the program can store both, custom metadata
+and ``xdp_skb_metadata`` for the kernel consumption.
+
+Current limitation is that the program cannot adjust ``data_meta`` (via
+``bpf_xdp_adjust_meta``) after a call to ``bpf_xdp_metadata_export_to_skb``.
+So it has to, first, prepare its custom metadata layout and only then,
+optionally, store ``xdp_skb_metadata`` via a call to
+``bpf_xdp_metadata_export_to_skb``.
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 02/11] bpf: Introduce bpf_patch
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 01/11] bpf: Document XDP RX metadata Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

A simple abstraction around a series of instructions that transparently
handles resizing.

Currently, we have insn_buf[16] in convert_ctx_accesses which might
not be enough for xdp kfuncs.

If we find this abstraction helpful, we might convert existing
insn_buf[16] to it in the future.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf_patch.h | 29 +++++++++++++++
 kernel/bpf/Makefile       |  2 +-
 kernel/bpf/bpf_patch.c    | 77 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 107 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf_patch.h
 create mode 100644 kernel/bpf/bpf_patch.c

diff --git a/include/linux/bpf_patch.h b/include/linux/bpf_patch.h
new file mode 100644
index 000000000000..359c165ad68b
--- /dev/null
+++ b/include/linux/bpf_patch.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_BPF_PATCH_H
+#define _LINUX_BPF_PATCH_H 1
+
+#include <linux/bpf.h>
+
+struct bpf_patch {
+	struct bpf_insn *insn;
+	size_t capacity;
+	size_t len;
+	int err;
+};
+
+void bpf_patch_free(struct bpf_patch *patch);
+size_t bpf_patch_len(const struct bpf_patch *patch);
+int bpf_patch_err(const struct bpf_patch *patch);
+void __bpf_patch_append(struct bpf_patch *patch, struct bpf_insn insn);
+struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch);
+void bpf_patch_resolve_jmp(struct bpf_patch *patch);
+u32 bpf_patch_magles_registers(const struct bpf_patch *patch);
+
+#define bpf_patch_append(patch, ...) ({ \
+	struct bpf_insn insn[] = { __VA_ARGS__ }; \
+	int i; \
+	for (i = 0; i < ARRAY_SIZE(insn); i++) \
+		__bpf_patch_append(patch, insn[i]); \
+})
+
+#endif
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 3a12e6b400a2..5724f36292a5 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
-obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
+obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o bpf_patch.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
diff --git a/kernel/bpf/bpf_patch.c b/kernel/bpf/bpf_patch.c
new file mode 100644
index 000000000000..eb768398fd8f
--- /dev/null
+++ b/kernel/bpf/bpf_patch.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/bpf_patch.h>
+
+void bpf_patch_free(struct bpf_patch *patch)
+{
+	kfree(patch->insn);
+}
+
+size_t bpf_patch_len(const struct bpf_patch *patch)
+{
+	return patch->len;
+}
+
+int bpf_patch_err(const struct bpf_patch *patch)
+{
+	return patch->err;
+}
+
+void __bpf_patch_append(struct bpf_patch *patch, struct bpf_insn insn)
+{
+	void *arr;
+
+	if (patch->err)
+		return;
+
+	if (patch->len + 1 > patch->capacity) {
+		if (!patch->capacity)
+			patch->capacity = 16;
+		else
+			patch->capacity *= 2;
+
+		arr = krealloc_array(patch->insn, patch->capacity, sizeof(insn), GFP_KERNEL);
+		if (!arr) {
+			patch->err = -ENOMEM;
+			kfree(patch->insn);
+			return;
+		}
+
+		patch->insn = arr;
+		patch->capacity *= 2;
+	}
+
+	patch->insn[patch->len++] = insn;
+}
+EXPORT_SYMBOL(__bpf_patch_append);
+
+struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch)
+{
+	return patch->insn;
+}
+
+void bpf_patch_resolve_jmp(struct bpf_patch *patch)
+{
+	int i;
+
+	for (i = 0; i < patch->len; i++) {
+		if (BPF_CLASS(patch->insn[i].code) != BPF_JMP)
+			continue;
+
+		if (patch->insn[i].off != S16_MAX)
+			continue;
+
+		patch->insn[i].off = patch->len - i - 1;
+	}
+}
+
+u32 bpf_patch_magles_registers(const struct bpf_patch *patch)
+{
+	u32 mask = 0;
+	int i;
+
+	for (i = 0; i < patch->len; i++)
+		mask |= 1 << patch->insn[i].dst_reg;
+
+	return mask;
+}
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 01/11] bpf: Document XDP RX metadata Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 02/11] bpf: Introduce bpf_patch Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15 16:16   ` [xdp-hints] " Toke Høiland-Jørgensen
  2022-11-16 20:42   ` Jakub Kicinski
  2022-11-15  3:02 ` [PATCH bpf-next 04/11] bpf: Implement hidden BPF_PUSH64 and BPF_POP64 instructions Stanislav Fomichev
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Kfuncs have to be defined with KF_UNROLL for an attempted unroll.
For now, only XDP programs can have their kfuncs unrolled, but
we can extend this later on if more programs would like to use it.

For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
implements all possible metatada kfuncs. Not all devices have to
implement them. If unrolling is not supported by the target device,
the default implementation is called instead. The default
implementation is unconditionally unrolled to 'return false/0/NULL'
for now.

Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
we treat prog_index as target device for kfunc unrolling.
net_device_ops gains new ndo_unroll_kfunc which does the actual
dirty work per device.

The kfunc unrolling itself largely follows the existing map_gen_lookup
unrolling example, so there is nothing new here.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/bpf/kfuncs.rst   |  8 +++++
 include/linux/bpf.h            |  1 +
 include/linux/btf.h            |  1 +
 include/linux/btf_ids.h        |  4 +++
 include/linux/netdevice.h      |  5 +++
 include/net/xdp.h              | 24 +++++++++++++
 include/uapi/linux/bpf.h       |  5 +++
 kernel/bpf/syscall.c           | 28 ++++++++++++++-
 kernel/bpf/verifier.c          | 65 ++++++++++++++++++++++++++++++++++
 net/core/dev.c                 |  7 ++++
 net/core/xdp.c                 | 39 ++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++
 12 files changed, 191 insertions(+), 1 deletion(-)

diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
index 0f858156371d..1723de2720bb 100644
--- a/Documentation/bpf/kfuncs.rst
+++ b/Documentation/bpf/kfuncs.rst
@@ -169,6 +169,14 @@ rebooting or panicking. Due to this additional restrictions apply to these
 calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
 added later.
 
+2.4.8 KF_UNROLL flag
+-----------------------
+
+The KF_UNROLL flag is used for kfuncs that the verifier can attempt to unroll.
+Unrolling is currently implemented only for XDP programs' metadata kfuncs.
+The main motivation behind unrolling is to remove function call overhead
+and allow efficient inlined kfuncs to be generated.
+
 2.5 Registering the kfuncs
 --------------------------
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 798aec816970..bf8936522dd9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1240,6 +1240,7 @@ struct bpf_prog_aux {
 		struct work_struct work;
 		struct rcu_head	rcu;
 	};
+	const struct net_device_ops *xdp_kfunc_ndo;
 };
 
 struct bpf_prog {
diff --git a/include/linux/btf.h b/include/linux/btf.h
index d80345fa566b..950cca997a5a 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -51,6 +51,7 @@
 #define KF_TRUSTED_ARGS (1 << 4) /* kfunc only takes trusted pointer arguments */
 #define KF_SLEEPABLE    (1 << 5) /* kfunc may sleep */
 #define KF_DESTRUCTIVE  (1 << 6) /* kfunc performs destructive actions */
+#define KF_UNROLL       (1 << 7) /* kfunc unrolling can be attempted */
 
 /*
  * Return the name of the passed struct, if exists, or halt the build if for
diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
index c9744efd202f..eb448e9c79bb 100644
--- a/include/linux/btf_ids.h
+++ b/include/linux/btf_ids.h
@@ -195,6 +195,10 @@ asm(							\
 __BTF_ID_LIST(name, local)				\
 __BTF_SET8_START(name, local)
 
+#define BTF_SET8_START_GLOBAL(name)			\
+__BTF_ID_LIST(name, global)				\
+__BTF_SET8_START(name, global)
+
 #define BTF_SET8_END(name)				\
 asm(							\
 ".pushsection " BTF_IDS_SECTION ",\"a\";      \n"	\
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 02a2318da7c7..2096b4f00e4b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -73,6 +73,8 @@ struct udp_tunnel_info;
 struct udp_tunnel_nic_info;
 struct udp_tunnel_nic;
 struct bpf_prog;
+struct bpf_insn;
+struct bpf_patch;
 struct xdp_buff;
 
 void synchronize_net(void);
@@ -1604,6 +1606,9 @@ struct net_device_ops {
 	ktime_t			(*ndo_get_tstamp)(struct net_device *dev,
 						  const struct skb_shared_hwtstamps *hwtstamps,
 						  bool cycles);
+	void			(*ndo_unroll_kfunc)(const struct bpf_prog *prog,
+						    u32 func_id,
+						    struct bpf_patch *patch);
 };
 
 /**
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 55dbc68bfffc..2a82a98f2f9f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -7,6 +7,7 @@
 #define __LINUX_NET_XDP_H__
 
 #include <linux/skbuff.h> /* skb_shared_info */
+#include <linux/btf_ids.h> /* btf_id_set8 */
 
 /**
  * DOC: XDP RX-queue information
@@ -409,4 +410,27 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
+#define XDP_METADATA_KFUNC_xxx	\
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
+			   bpf_xdp_metadata_rx_timestamp_supported) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			   bpf_xdp_metadata_rx_timestamp) \
+
+enum {
+#define XDP_METADATA_KFUNC(name, str) name,
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+MAX_XDP_METADATA_KFUNC,
+};
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+extern struct btf_id_set8 xdp_metadata_kfunc_ids;
+static inline u32 xdp_metadata_kfunc_id(int id)
+{
+	return xdp_metadata_kfunc_ids.pairs[id].id;
+}
+#else
+static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
+#endif
+
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fb4c911d2a03..b444b1118c4f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access it's XDP metadata.
+ */
+#define BPF_F_XDP_HAS_METADATA	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 85532d301124..597c41949910 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2426,6 +2426,20 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
 /* last field in 'union bpf_attr' used by this command */
 #define	BPF_PROG_LOAD_LAST_FIELD core_relo_rec_size
 
+static int xdp_resolve_netdev(struct bpf_prog *prog, int ifindex)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct net_device *dev;
+
+	for_each_netdev(net, dev) {
+		if (dev->ifindex == ifindex) {
+			prog->aux->xdp_kfunc_ndo = dev->netdev_ops;
+			return 0;
+		}
+	}
+	return -EINVAL;
+}
+
 static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 {
 	enum bpf_prog_type type = attr->prog_type;
@@ -2443,7 +2457,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 				 BPF_F_TEST_STATE_FREQ |
 				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32 |
-				 BPF_F_XDP_HAS_FRAGS))
+				 BPF_F_XDP_HAS_FRAGS |
+				 BPF_F_XDP_HAS_METADATA))
 		return -EINVAL;
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -2531,6 +2546,17 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
+	if (attr->prog_flags & BPF_F_XDP_HAS_METADATA) {
+		/* Reuse prog_ifindex to carry request to unroll
+		 * metadata kfuncs.
+		 */
+		prog->aux->offload_requested = false;
+
+		err = xdp_resolve_netdev(prog, attr->prog_ifindex);
+		if (err < 0)
+			goto free_prog;
+	}
+
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
 		goto free_prog;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 07c0259dfc1a..b657ed6eb277 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/bpf.h>
+#include <linux/bpf_patch.h>
 #include <linux/btf.h>
 #include <linux/bpf_verifier.h>
 #include <linux/filter.h>
@@ -14015,6 +14016,43 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 	return err;
 }
 
+static int unroll_kfunc_call(struct bpf_verifier_env *env,
+			     struct bpf_insn *insn,
+			     struct bpf_patch *patch)
+{
+	enum bpf_prog_type prog_type;
+	struct bpf_prog_aux *aux;
+	struct btf *desc_btf;
+	u32 *kfunc_flags;
+	u32 func_id;
+
+	desc_btf = find_kfunc_desc_btf(env, insn->off);
+	if (IS_ERR(desc_btf))
+		return PTR_ERR(desc_btf);
+
+	prog_type = resolve_prog_type(env->prog);
+	func_id = insn->imm;
+
+	kfunc_flags = btf_kfunc_id_set_contains(desc_btf, prog_type, func_id);
+	if (!kfunc_flags)
+		return 0;
+	if (!(*kfunc_flags & KF_UNROLL))
+		return 0;
+	if (prog_type != BPF_PROG_TYPE_XDP)
+		return 0;
+
+	aux = env->prog->aux;
+	if (aux->xdp_kfunc_ndo && aux->xdp_kfunc_ndo->ndo_unroll_kfunc)
+		aux->xdp_kfunc_ndo->ndo_unroll_kfunc(env->prog, func_id, patch);
+	if (bpf_patch_len(patch) == 0) {
+		/* Default optimized kfunc implementation that
+		 * returns NULL/0/false.
+		 */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 0));
+	}
+	return bpf_patch_err(patch);
+}
+
 static int fixup_kfunc_call(struct bpf_verifier_env *env,
 			    struct bpf_insn *insn)
 {
@@ -14178,6 +14216,33 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 		if (insn->src_reg == BPF_PSEUDO_CALL)
 			continue;
 		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
+			struct bpf_patch patch = {};
+
+			if (bpf_prog_is_dev_bound(env->prog->aux)) {
+				verbose(env, "no metadata kfuncs offload\n");
+				return -EINVAL;
+			}
+
+			ret = unroll_kfunc_call(env, insn, &patch);
+			if (ret < 0) {
+				verbose(env, "failed to unroll kfunc with func_id=%d\n", insn->imm);
+				return cnt;
+			}
+			cnt = bpf_patch_len(&patch);
+			if (cnt) {
+				new_prog = bpf_patch_insn_data(env, i + delta,
+							       bpf_patch_data(&patch),
+							       bpf_patch_len(&patch));
+				bpf_patch_free(&patch);
+				if (!new_prog)
+					return -ENOMEM;
+
+				delta    += cnt - 1;
+				env->prog = prog = new_prog;
+				insn      = new_prog->insnsi + i + delta;
+				continue;
+			}
+
 			ret = fixup_kfunc_call(env, insn);
 			if (ret)
 				return ret;
diff --git a/net/core/dev.c b/net/core/dev.c
index 117e830cabb0..a2227f4f4a0b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9258,6 +9258,13 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			return -EOPNOTSUPP;
 		}
 
+		if (new_prog &&
+		    new_prog->aux->xdp_kfunc_ndo &&
+		    new_prog->aux->xdp_kfunc_ndo != dev->netdev_ops) {
+			NL_SET_ERR_MSG(extack, "Cannot attach to a different target device");
+			return -EINVAL;
+		}
+
 		err = dev_xdp_install(dev, mode, bpf_op, extack, flags, new_prog);
 		if (err)
 			return err;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 844c9d99dc0e..22f1e44700eb 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -4,6 +4,8 @@
  * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
  */
 #include <linux/bpf.h>
+#include <linux/bpf_patch.h>
+#include <linux/btf_ids.h>
 #include <linux/filter.h>
 #include <linux/types.h>
 #include <linux/mm.h>
@@ -709,3 +711,40 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
 
 	return nxdpf;
 }
+
+/* Indicates whether particular device supports rx_timestamp metadata.
+ * This is an optional helper to support marking some branches as
+ * "dead code" in the BPF programs.
+ */
+noinline int bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx)
+{
+	/* payload is ignored, see default case in unroll_kfunc_call */
+	return false;
+}
+
+/* Returns rx_timestamp metadata or 0 when the frame doesn't have it.
+ */
+noinline const __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx)
+{
+	/* payload is ignored, see default case in unroll_kfunc_call */
+	return 0;
+}
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids)
+#define XDP_METADATA_KFUNC(name, str) BTF_ID_FLAGS(func, str, KF_RET_NULL | KF_UNROLL)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+BTF_SET8_END(xdp_metadata_kfunc_ids)
+
+static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &xdp_metadata_kfunc_ids,
+};
+
+static int __init xdp_metadata_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
+}
+late_initcall(xdp_metadata_init);
+#endif
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index fb4c911d2a03..b444b1118c4f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access it's XDP metadata.
+ */
+#define BPF_F_XDP_HAS_METADATA	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 04/11] bpf: Implement hidden BPF_PUSH64 and BPF_POP64 instructions
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Zi Shen Lim

Implemented for:
- x86_64 jit (tested)
- arm64 jit (untested)

Interpreter is not implemented because push/pop are currently
used only with xdp kfunc and jit is required to use kfuncs.

Fundamentally:
  BPF_ST | BPF_STACK + src_reg == store into the stack
  BPF_LD | BPF_STACK + dst_reg == load from the stack
  off/imm are unused

Updated disasm code to properly dump these new instructions:

  31: (e2) push r1
  32: (79) r5 = *(u64 *)(r1 +56)
  33: (55) if r5 != 0x0 goto pc+2
  34: (b7) r0 = 0
  35: (05) goto pc+1
  36: (79) r0 = *(u64 *)(r5 +32)
  37: (e0) pop r1

Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 arch/arm64/net/bpf_jit_comp.c |  8 ++++++++
 arch/x86/net/bpf_jit_comp.c   |  8 ++++++++
 include/linux/filter.h        | 23 +++++++++++++++++++++++
 kernel/bpf/disasm.c           |  6 ++++++
 4 files changed, 45 insertions(+)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 62f805f427b7..4c0e70e6572a 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -1185,6 +1185,14 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 		 */
 		break;
 
+		/* kernel hidden stack operations */
+	case BPF_ST | BPF_STACK:
+		emit(A64_PUSH(src, src, A64_SP), ctx);
+		break;
+	case BPF_LD | BPF_STACK:
+		emit(A64_POP(dst, dst, A64_SP), ctx);
+		break;
+
 	/* ST: *(size *)(dst + off) = imm */
 	case BPF_ST | BPF_MEM | BPF_W:
 	case BPF_ST | BPF_MEM | BPF_H:
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index cec5195602bc..528bece87ca4 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1324,6 +1324,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 			EMIT_LFENCE();
 			break;
 
+			/* kernel hidden stack operations */
+		case BPF_ST | BPF_STACK:
+			EMIT1(add_1reg(0x50, src_reg)); /* pushq  */
+			break;
+		case BPF_LD | BPF_STACK:
+			EMIT1(add_1reg(0x58, dst_reg)); /* popq */
+			break;
+
 			/* ST: *(u8*)(dst_reg + off) = imm */
 		case BPF_ST | BPF_MEM | BPF_B:
 			if (is_ereg(dst_reg))
diff --git a/include/linux/filter.h b/include/linux/filter.h
index efc42a6e3aed..42c61ec8f895 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -76,6 +76,9 @@ struct ctl_table_header;
  */
 #define BPF_NOSPEC	0xc0
 
+/* unused opcode for kernel hidden stack operations */
+#define BPF_STACK	0xe0
+
 /* As per nm, we expose JITed images as text (code) section for
  * kallsyms. That way, tools like perf can find it to match
  * addresses.
@@ -402,6 +405,26 @@ static inline bool insn_is_zext(const struct bpf_insn *insn)
 		.off   = 0,					\
 		.imm   = 0 })
 
+/* Push SRC register value onto the stack */
+
+#define BPF_PUSH64(SRC)						\
+	((struct bpf_insn) {					\
+		.code  = BPF_ST | BPF_STACK,			\
+		.dst_reg = 0,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* Pop stack value into DST register */
+
+#define BPF_POP64(DST)						\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_STACK,			\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
 /* Internal classic blocks for direct assignment */
 
 #define __BPF_STMT(CODE, K)					\
diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c
index 7b4afb7d96db..9cd22f3591de 100644
--- a/kernel/bpf/disasm.c
+++ b/kernel/bpf/disasm.c
@@ -214,6 +214,9 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 				insn->off, insn->imm);
 		} else if (BPF_MODE(insn->code) == 0xc0 /* BPF_NOSPEC, no UAPI */) {
 			verbose(cbs->private_data, "(%02x) nospec\n", insn->code);
+		} else if (BPF_MODE(insn->code) == 0xe0 /* BPF_STACK, no UAPI */) {
+			verbose(cbs->private_data, "(%02x) push r%d\n",
+				insn->code, insn->src_reg);
 		} else {
 			verbose(cbs->private_data, "BUG_st_%02x\n", insn->code);
 		}
@@ -254,6 +257,9 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 				insn->code, insn->dst_reg,
 				__func_imm_name(cbs, insn, imm,
 						tmp, sizeof(tmp)));
+		} else if (BPF_MODE(insn->code) == 0xe0 /* BPF_STACK, no UAPI */) {
+			verbose(cbs->private_data, "(%02x) pop r%d\n",
+				insn->code, insn->dst_reg);
 		} else {
 			verbose(cbs->private_data, "BUG_ld_%02x\n", insn->code);
 			return;
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 04/11] bpf: Implement hidden BPF_PUSH64 and BPF_POP64 instructions Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15 16:17   ` [xdp-hints] " Toke Høiland-Jørgensen
  2022-11-15  3:02 ` [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

The goal is to enable end-to-end testing of the metadata
for AF_XDP. Current rx_timestamp kfunc returns current
time which should be enough to exercise this new functionality.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 2a4592780141..c626580a2294 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -25,6 +25,7 @@
 #include <linux/filter.h>
 #include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
+#include <linux/bpf_patch.h>
 #include <linux/net_tstamp.h>
 
 #define DRV_NAME	"veth"
@@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+			      struct bpf_patch *patch)
+{
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+		/* return true; */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+		/* return ktime_get_mono_fast_ns(); */
+		bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1678,6 +1691,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_unroll_kfunc       = veth_unroll_kfunc,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15 23:20   ` [xdp-hints] " Toke Høiland-Jørgensen
                     ` (3 more replies)
  2022-11-15  3:02 ` [PATCH bpf-next 07/11] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
                   ` (5 subsequent siblings)
  11 siblings, 4 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Implement new bpf_xdp_metadata_export_to_skb kfunc which
prepares compatible xdp metadata for kernel consumption.
This kfunc should be called prior to bpf_redirect
or when XDP_PASS'ing the frame into the kernel (note, the drivers
have to be updated to enable consuming XDP_PASS'ed metadata).

veth driver is amended to consume this metadata when converting to skb.

Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
whether the frame has skb metadata. The metadata is currently
stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
to work after a call to bpf_xdp_metadata_export_to_skb (can lift
this requirement later on if needed, we'd have to memmove
xdp_skb_metadata).

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c             |  10 +--
 include/linux/skbuff.h         |   4 +
 include/net/xdp.h              |  17 ++++
 include/uapi/linux/bpf.h       |   7 ++
 kernel/bpf/verifier.c          |  15 ++++
 net/core/filter.c              |  40 +++++++++
 net/core/skbuff.c              |  20 +++++
 net/core/xdp.c                 | 145 +++++++++++++++++++++++++++++++--
 tools/include/uapi/linux/bpf.h |   7 ++
 9 files changed, 255 insertions(+), 10 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index c626580a2294..35349a232209 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -803,7 +803,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	void *orig_data, *orig_data_end;
 	struct bpf_prog *xdp_prog;
 	struct xdp_buff xdp;
-	u32 act, metalen;
+	u32 act;
 	int off;
 
 	skb_prepare_for_gro(skb);
@@ -886,9 +886,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 
 	skb->protocol = eth_type_trans(skb, rq->dev);
 
-	metalen = xdp.data - xdp.data_meta;
-	if (metalen)
-		skb_metadata_set(skb, metalen);
+	xdp_convert_skb_metadata(&xdp, skb);
 out:
 	return skb;
 drop:
@@ -1663,7 +1661,9 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
 			      struct bpf_patch *patch)
 {
-	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
+		return xdp_metadata_export_to_skb(prog, patch);
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
 		/* return true; */
 		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
 	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4e464a27adaf..be6a9559dbf1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4219,6 +4219,10 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
 	       true : __skb_metadata_differs(skb_a, skb_b, len_a);
 }
 
+struct xdp_skb_metadata;
+bool skb_metadata_import_from_xdp(struct sk_buff *skb,
+				  struct xdp_skb_metadata *meta);
+
 static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
 {
 	skb_shinfo(skb)->meta_len = meta_len;
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 2a82a98f2f9f..547a6a0e99f8 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -73,6 +73,7 @@ enum xdp_buff_flags {
 	XDP_FLAGS_FRAGS_PF_MEMALLOC	= BIT(1), /* xdp paged memory is under
 						   * pressure
 						   */
+	XDP_FLAGS_HAS_SKB_METADATA	= BIT(2), /* xdp_skb_metadata */
 };
 
 struct xdp_buff {
@@ -91,6 +92,11 @@ static __always_inline bool xdp_buff_has_frags(struct xdp_buff *xdp)
 	return !!(xdp->flags & XDP_FLAGS_HAS_FRAGS);
 }
 
+static __always_inline bool xdp_buff_has_skb_metadata(struct xdp_buff *xdp)
+{
+	return !!(xdp->flags & XDP_FLAGS_HAS_SKB_METADATA);
+}
+
 static __always_inline void xdp_buff_set_frags_flag(struct xdp_buff *xdp)
 {
 	xdp->flags |= XDP_FLAGS_HAS_FRAGS;
@@ -306,6 +312,8 @@ struct xdp_frame *xdp_convert_buff_to_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
+bool xdp_convert_skb_metadata(struct xdp_buff *xdp, struct sk_buff *skb);
+
 void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
 		  struct xdp_buff *xdp);
 void xdp_return_frame(struct xdp_frame *xdpf);
@@ -411,6 +419,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
 #define XDP_METADATA_KFUNC_xxx	\
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_EXPORT_TO_SKB, \
+			   bpf_xdp_metadata_export_to_skb) \
 	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
 			   bpf_xdp_metadata_rx_timestamp_supported) \
 	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
@@ -423,14 +433,21 @@ XDP_METADATA_KFUNC_xxx
 MAX_XDP_METADATA_KFUNC,
 };
 
+struct bpf_patch;
+
 #ifdef CONFIG_DEBUG_INFO_BTF
 extern struct btf_id_set8 xdp_metadata_kfunc_ids;
 static inline u32 xdp_metadata_kfunc_id(int id)
 {
 	return xdp_metadata_kfunc_ids.pairs[id].id;
 }
+void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch);
 #else
 static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
+static void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
+{
+	return 0;
+}
 #endif
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b444b1118c4f..71e3bc7ad839 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6116,6 +6116,12 @@ enum xdp_action {
 	XDP_REDIRECT,
 };
 
+/* Subset of XDP metadata exported to skb context.
+ */
+struct xdp_skb_metadata {
+	__u64 rx_timestamp;
+};
+
 /* user accessible metadata for XDP packet hook
  * new fields must be added to the end of this structure
  */
@@ -6128,6 +6134,7 @@ struct xdp_md {
 	__u32 rx_queue_index;  /* rxq->queue_index  */
 
 	__u32 egress_ifindex;  /* txq->dev->ifindex */
+	__bpf_md_ptr(struct xdp_skb_metadata *, skb_metadata);
 };
 
 /* DEVMAP map-value layout
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b657ed6eb277..6879ad3a6026 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14023,6 +14023,7 @@ static int unroll_kfunc_call(struct bpf_verifier_env *env,
 	enum bpf_prog_type prog_type;
 	struct bpf_prog_aux *aux;
 	struct btf *desc_btf;
+	u32 allowed, mangled;
 	u32 *kfunc_flags;
 	u32 func_id;
 
@@ -14050,6 +14051,20 @@ static int unroll_kfunc_call(struct bpf_verifier_env *env,
 		 */
 		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 0));
 	}
+
+	allowed = 1 << BPF_REG_0;
+	allowed |= 1 << BPF_REG_1;
+	allowed |= 1 << BPF_REG_2;
+	allowed |= 1 << BPF_REG_3;
+	allowed |= 1 << BPF_REG_4;
+	allowed |= 1 << BPF_REG_5;
+	mangled = bpf_patch_magles_registers(patch);
+	if (WARN_ON_ONCE(mangled & ~allowed)) {
+		bpf_patch_free(patch);
+		verbose(env, "bpf verifier is misconfigured\n");
+		return -EINVAL;
+	}
+
 	return bpf_patch_err(patch);
 }
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 6dd2baf5eeb2..2497144e4216 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4094,6 +4094,8 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
 		return -EINVAL;
 	if (unlikely(xdp_metalen_invalid(metalen)))
 		return -EACCES;
+	if (unlikely(xdp_buff_has_skb_metadata(xdp)))
+		return -EACCES;
 
 	xdp->data_meta = meta;
 
@@ -8690,6 +8692,8 @@ static bool __is_valid_xdp_access(int off, int size)
 	return true;
 }
 
+BTF_ID_LIST_SINGLE(xdp_to_skb_metadata_btf_ids, struct, xdp_skb_metadata);
+
 static bool xdp_is_valid_access(int off, int size,
 				enum bpf_access_type type,
 				const struct bpf_prog *prog,
@@ -8722,6 +8726,18 @@ static bool xdp_is_valid_access(int off, int size,
 	case offsetof(struct xdp_md, data_end):
 		info->reg_type = PTR_TO_PACKET_END;
 		break;
+	case offsetof(struct xdp_md, skb_metadata):
+		info->btf = bpf_get_btf_vmlinux();
+		if (IS_ERR(info->btf))
+			return PTR_ERR(info->btf);
+		if (!info->btf)
+			return -EINVAL;
+
+		info->reg_type = PTR_TO_BTF_ID_OR_NULL;
+		info->btf_id = xdp_to_skb_metadata_btf_ids[0];
+		if (size == sizeof(__u64))
+			return true;
+		return false;
 	}
 
 	return __is_valid_xdp_access(off, size);
@@ -9814,6 +9830,30 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
 				      offsetof(struct net_device, ifindex));
 		break;
+	case offsetof(struct xdp_md, skb_metadata):
+		/* dst_reg = xdp_buff->flags; */
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, flags),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, flags));
+		/* dst_reg &= XDP_FLAGS_HAS_SKB_METADATA; */
+		*insn++ = BPF_ALU64_IMM(BPF_AND, si->dst_reg,
+					XDP_FLAGS_HAS_SKB_METADATA);
+
+		/* if (dst_reg != 0) { */
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 3);
+		/*	dst_reg = xdp_buff->data_meta; */
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_meta),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, data_meta));
+		/*	dst_reg -= sizeof(struct xdp_skb_metadata); */
+		*insn++ = BPF_ALU64_IMM(BPF_SUB, si->dst_reg,
+					sizeof(struct xdp_skb_metadata));
+		*insn++ = BPF_JMP_A(1);
+		/* } else { */
+		/*	return 0; */
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+		/* } */
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 90d085290d49..0cc24ca20e4d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -72,6 +72,7 @@
 #include <net/mptcp.h>
 #include <net/mctp.h>
 #include <net/page_pool.h>
+#include <net/xdp.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -6675,3 +6676,22 @@ nodefer:	__kfree_skb(skb);
 	if (unlikely(kick) && !cmpxchg(&sd->defer_ipi_scheduled, 0, 1))
 		smp_call_function_single_async(cpu, &sd->defer_csd);
 }
+
+bool skb_metadata_import_from_xdp(struct sk_buff *skb,
+				  struct xdp_skb_metadata *meta)
+{
+	/* Optional SKB info, currently missing:
+	 * - HW checksum info		(skb->ip_summed)
+	 * - HW RX hash			(skb_set_hash)
+	 * - RX ring dev queue index	(skb_record_rx_queue)
+	 */
+
+	if (meta->rx_timestamp) {
+		*skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
+			.hwtstamp = ns_to_ktime(meta->rx_timestamp),
+		};
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(skb_metadata_import_from_xdp);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 22f1e44700eb..ede9b1b987d9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -368,6 +368,22 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
+bool xdp_convert_skb_metadata(struct xdp_buff *xdp, struct sk_buff *skb)
+{
+	struct xdp_skb_metadata *meta;
+	u32 metalen;
+
+	metalen = xdp->data - xdp->data_meta;
+	if (metalen)
+		skb_metadata_set(skb, metalen);
+	if (xdp_buff_has_skb_metadata(xdp)) {
+		meta = xdp->data_meta - sizeof(*meta);
+		return skb_metadata_import_from_xdp(skb, meta);
+	}
+	return false;
+}
+EXPORT_SYMBOL(xdp_convert_skb_metadata);
+
 /* XDP RX runs under NAPI protection, and in different delivery error
  * scenarios (e.g. queue full), it is possible to return the xdp_frame
  * while still leveraging this protection.  The @napi_direct boolean
@@ -619,6 +635,7 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_frame(xdpf);
 	unsigned int headroom, frame_size;
+	struct xdp_skb_metadata *meta;
 	void *hard_start;
 	u8 nr_frags;
 
@@ -653,11 +670,10 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 	/* Essential SKB info: protocol and skb->dev */
 	skb->protocol = eth_type_trans(skb, dev);
 
-	/* Optional SKB info, currently missing:
-	 * - HW checksum info		(skb->ip_summed)
-	 * - HW RX hash			(skb_set_hash)
-	 * - RX ring dev queue index	(skb_record_rx_queue)
-	 */
+	if (xdpf->flags & XDP_FLAGS_HAS_SKB_METADATA) {
+		meta = xdpf->data - xdpf->metasize - sizeof(*meta);
+		skb_metadata_import_from_xdp(skb, meta);
+	}
 
 	/* Until page_pool get SKB return path, release DMA here */
 	xdp_release_frame(xdpf);
@@ -712,6 +728,14 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
 	return nxdpf;
 }
 
+/* For the packets directed to the kernel, this kfunc exports XDP metadata
+ * into skb context.
+ */
+noinline int bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx)
+{
+	return 0;
+}
+
 /* Indicates whether particular device supports rx_timestamp metadata.
  * This is an optional helper to support marking some branches as
  * "dead code" in the BPF programs.
@@ -736,15 +760,126 @@ BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids)
 XDP_METADATA_KFUNC_xxx
 #undef XDP_METADATA_KFUNC
 BTF_SET8_END(xdp_metadata_kfunc_ids)
+EXPORT_SYMBOL(xdp_metadata_kfunc_ids);
 
 static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
 	.owner = THIS_MODULE,
 	.set   = &xdp_metadata_kfunc_ids,
 };
 
+/* Since we're not actually doing a call but instead rewriting
+ * in place, we can only afford to use R0-R5 scratch registers
+ * and hidden BPF_PUSH64/BPF_POP64 opcodes to spill to the stack.
+ */
+void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
+{
+	u32 func_id;
+
+	/* The code below generates the following:
+	 *
+	 * int bpf_xdp_metadata_export_to_skb(struct xdp_md *ctx)
+	 * {
+	 *	struct xdp_skb_metadata *meta = ctx->data_meta - sizeof(*meta);
+	 *	int ret;
+	 *
+	 *	if (ctx->flags & XDP_FLAGS_HAS_SKB_METADATA)
+	 *		return -1;
+	 *
+	 *	if (meta < ctx->data_hard_start + sizeof(struct xdp_frame))
+	 *		return -1;
+	 *
+	 *	meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+	 *	ctx->flags |= BPF_F_XDP_HAS_METADATA;
+	 *
+	 *	return 0;
+	 * }
+	 */
+
+	bpf_patch_append(patch,
+		BPF_MOV64_IMM(BPF_REG_0, -1),
+
+		/* r2 = ((struct xdp_buff *)r1)->flags; */
+		BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, flags),
+			    BPF_REG_2, BPF_REG_1,
+			    offsetof(struct xdp_buff, flags)),
+
+		/* r2 &= XDP_FLAGS_HAS_SKB_METADATA; */
+		BPF_ALU64_IMM(BPF_AND, BPF_REG_2, XDP_FLAGS_HAS_SKB_METADATA),
+
+		/* if (xdp_buff->flags & XDP_FLAGS_HAS_SKB_METADATA) return -1; */
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0, S16_MAX),
+
+		/* r2 = ((struct xdp_buff *)r1)->data_meta; */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
+			    offsetof(struct xdp_buff, data_meta)),
+		/* r2 -= sizeof(struct xdp_skb_metadata); */
+		BPF_ALU64_IMM(BPF_SUB, BPF_REG_2,
+			      sizeof(struct xdp_skb_metadata)),
+		/* r3 = ((struct xdp_buff *)r1)->data_hard_start; */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
+			    offsetof(struct xdp_buff, data_hard_start)),
+		/* r3 += sizeof(struct xdp_frame) */
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3,
+			      sizeof(struct xdp_frame)),
+		/* if (data_meta-sizeof(struct xdp_skb_metadata) <
+		 *     data_hard_start+sizeof(struct xdp_frame)) return -1;
+		 */
+		BPF_JMP_REG(BPF_JLT, BPF_REG_2, BPF_REG_3, S16_MAX),
+
+		/* r2 = ((struct xdp_buff *)r1)->flags; */
+		BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, flags),
+			    BPF_REG_2, BPF_REG_1,
+			    offsetof(struct xdp_buff, flags)),
+
+		/* r2 |= XDP_FLAGS_HAS_SKB_METADATA; */
+		BPF_ALU64_IMM(BPF_OR, BPF_REG_2, XDP_FLAGS_HAS_SKB_METADATA),
+
+		/* ((struct xdp_buff *)r1)->flags = r2; */
+		BPF_STX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, flags),
+			    BPF_REG_1, BPF_REG_2,
+			    offsetof(struct xdp_buff, flags)),
+
+		/* push r1 */
+		BPF_PUSH64(BPF_REG_1),
+	);
+
+	/*	r0 = bpf_xdp_metadata_rx_timestamp(ctx); */
+	func_id = xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP);
+	prog->aux->xdp_kfunc_ndo->ndo_unroll_kfunc(prog, func_id, patch);
+
+	bpf_patch_append(patch,
+		/* pop r1 */
+		BPF_POP64(BPF_REG_1),
+
+		/* r2 = ((struct xdp_buff *)r1)->data_meta; */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
+			    offsetof(struct xdp_buff, data_meta)),
+		/* r2 -= sizeof(struct xdp_skb_metadata); */
+		BPF_ALU64_IMM(BPF_SUB, BPF_REG_2,
+			      sizeof(struct xdp_skb_metadata)),
+
+		/* *((struct xdp_skb_metadata *)r2)->rx_timestamp = r0; */
+		BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0,
+			    offsetof(struct xdp_skb_metadata, rx_timestamp)),
+
+		/* return 0; */
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+	);
+
+	bpf_patch_resolve_jmp(patch);
+}
+EXPORT_SYMBOL(xdp_metadata_export_to_skb);
+
 static int __init xdp_metadata_init(void)
 {
 	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
 }
 late_initcall(xdp_metadata_init);
+#else
+struct btf_id_set8 xdp_metadata_kfunc_ids = {};
+EXPORT_SYMBOL(xdp_metadata_kfunc_ids);
+void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
+{
+}
+EXPORT_SYMBOL(xdp_metadata_export_to_skb);
 #endif
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b444b1118c4f..71e3bc7ad839 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6116,6 +6116,12 @@ enum xdp_action {
 	XDP_REDIRECT,
 };
 
+/* Subset of XDP metadata exported to skb context.
+ */
+struct xdp_skb_metadata {
+	__u64 rx_timestamp;
+};
+
 /* user accessible metadata for XDP packet hook
  * new fields must be added to the end of this structure
  */
@@ -6128,6 +6134,7 @@ struct xdp_md {
 	__u32 rx_queue_index;  /* rxq->queue_index  */
 
 	__u32 egress_ifindex;  /* txq->dev->ifindex */
+	__bpf_md_ptr(struct xdp_skb_metadata *, skb_metadata);
 };
 
 /* DEVMAP map-value layout
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 07/11] selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 08/11] selftests/bpf: Verify xdp_metadata xdp->skb path Stanislav Fomichev
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

- create new netns
- create veth pair (veTX+veRX)
- setup AF_XDP socket for both interfaces
- attach bpf to veRX
- send packet via veTX
- verify the packet has expected metadata at veRX

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 359 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        |  50 +++
 3 files changed, 410 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index f3cd17026ee5..b645cf5a5021 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -523,7 +523,7 @@ TRUNNER_BPF_PROGS_DIR := progs
 TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
 			 network_helpers.c testing_helpers.c		\
 			 btf_helpers.c flow_dissector_load.h		\
-			 cap_helpers.c
+			 cap_helpers.c xsk.c
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
 		       $(OUTPUT)/xdp_synproxy				\
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
new file mode 100644
index 000000000000..c3321d8c7cd4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -0,0 +1,359 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_metadata.skel.h"
+#include "xsk.h"
+
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#define TX_NAME "veTX"
+#define RX_NAME "veRX"
+
+#define UDP_PAYLOAD_BYTES 4
+
+#define AF_XDP_SOURCE_PORT 1234
+#define AF_XDP_CONSUMER_PORT 8080
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define METADATA_SIZE 8
+#define XDP_FLAGS XDP_FLAGS_DRV_MODE
+#define QUEUE_ID 0
+
+#define TX_ADDR "10.0.0.1"
+#define RX_ADDR "10.0.0.2"
+#define PREFIX_LEN "8"
+#define FAMILY AF_INET
+
+#define SYS(cmd) ({ \
+	if (!ASSERT_OK(system(cmd), (cmd))) \
+		goto out; \
+})
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+};
+
+static int open_xsk(const char *ifname, struct xsk *xsk)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	const struct xsk_umem_config umem_config = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+	};
+	__u32 idx;
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (!ASSERT_NEQ(xsk->umem_area, MAP_FAILED, "mmap"))
+		return -1;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       &umem_config);
+	if (!ASSERT_OK(ret, "xsk_umem__create"))
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, QUEUE_ID,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (!ASSERT_OK(ret, "xsk_socket__create"))
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = i * UMEM_FRAME_SIZE;
+		printf("%p: tx_desc[%d] -> %lx\n", xsk, i, addr);
+	}
+
+	/* Second half of umem is for RX. */
+
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	if (!ASSERT_EQ(UMEM_NUM / 2, ret, "xsk_ring_prod__reserve"))
+		return ret;
+	if (!ASSERT_EQ(idx, 0, "fill idx != 0"))
+		return -1;
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+static void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void ip_csum(struct iphdr *iph)
+{
+	__u32 sum = 0;
+	__u16 *p;
+	int i;
+
+	iph->check = 0;
+	p = (void *)iph;
+	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
+		sum += p[i];
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	iph->check = ~sum;
+}
+
+static int generate_packet(struct xsk *xsk, __u16 dst_port)
+{
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	void *data;
+	__u32 idx;
+	int ret;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
+		return -1;
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	eth = data;
+	iph = (void *)(eth + 1);
+	udph = (void *)(iph + 1);
+
+	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
+	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
+	eth->h_proto = htons(ETH_P_IP);
+
+	iph->version = 0x4;
+	iph->ihl = 0x5;
+	iph->tos = 0x9;
+	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	iph->id = 0;
+	iph->frag_off = 0;
+	iph->ttl = 0;
+	iph->protocol = IPPROTO_UDP;
+	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
+	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
+	ip_csum(iph);
+
+	udph->source = htons(AF_XDP_SOURCE_PORT);
+	udph->dest = htons(dst_port);
+	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	udph->check = 0;
+
+	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
+
+	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
+	xsk_ring_prod__submit(&xsk->tx, 1);
+
+	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (!ASSERT_GE(ret, 0, "sendto"))
+		return ret;
+
+	return 0;
+}
+
+static void complete_tx(struct xsk *xsk)
+{
+	__u32 idx;
+	__u64 addr;
+
+	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
+		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+
+		printf("%p: refill idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static void refill_rx(struct xsk *xsk, __u64 addr)
+{
+	__u32 idx;
+
+	if (ASSERT_EQ(xsk_ring_prod__reserve(&xsk->fill, 1, &idx), 1, "xsk_ring_prod__reserve")) {
+		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static int verify_xsk_metadata(struct xsk *xsk)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds = {};
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	__u64 comp_addr;
+	void *data_meta;
+	void *data;
+	__u64 addr;
+	__u32 idx;
+	int ret;
+
+	ret = recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+	if (!ASSERT_EQ(ret, 0, "recvfrom"))
+		return -1;
+
+	fds.fd = xsk_socket__fd(xsk->socket);
+	fds.events = POLLIN;
+
+	ret = poll(&fds, 1, 1000);
+	if (!ASSERT_GT(ret, 0, "poll"))
+		return -1;
+
+	ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_cons__peek"))
+		return -2;
+
+	rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx);
+	comp_addr = xsk_umem__extract_addr(rx_desc->addr);
+	addr = xsk_umem__add_offset_to_addr(rx_desc->addr);
+	printf("%p: rx_desc[%u]->addr=%llx addr=%llx comp_addr=%llx\n",
+	       xsk, idx, rx_desc->addr, addr, comp_addr);
+	data = xsk_umem__get_data(xsk->umem_area, addr);
+
+	/* Make sure we got the packet offset correctly. */
+
+	eth = data;
+	ASSERT_EQ(eth->h_proto, htons(ETH_P_IP), "eth->h_proto");
+	iph = (void *)(eth + 1);
+	ASSERT_EQ((int)iph->version, 4, "iph->version");
+
+	data_meta = data - METADATA_SIZE;
+
+	if (*(__u64 *)data_meta == 0)
+		return -1;
+
+	xsk_ring_cons__release(&xsk->rx, 1);
+	refill_rx(xsk, comp_addr);
+
+	return 0;
+}
+
+void test_xdp_metadata(void)
+{
+	struct xdp_metadata *bpf_obj = NULL;
+	struct nstoken *tok = NULL;
+	__u32 queue_id = QUEUE_ID;
+	struct bpf_program *prog;
+	struct xsk tx_xsk = {};
+	struct xsk rx_xsk = {};
+	int rx_ifindex;
+	int sock_fd;
+	int ret;
+
+	/* Setup new networking namespace, with a veth pair. */
+
+	SYS("ip netns add xdp_metadata");
+	tok = open_netns("xdp_metadata");
+	SYS("ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
+	    " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
+	SYS("ip link set dev " TX_NAME " address 00:00:00:00:00:01");
+	SYS("ip link set dev " RX_NAME " address 00:00:00:00:00:02");
+	SYS("ip link set dev " TX_NAME " up");
+	SYS("ip link set dev " RX_NAME " up");
+	SYS("ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
+	SYS("ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
+
+	rx_ifindex = if_nametoindex(RX_NAME);
+
+	/* Setup separate AF_XDP for TX and RX interfaces. */
+
+	ret = open_xsk(TX_NAME, &tx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
+		goto out;
+
+	ret = open_xsk(RX_NAME, &rx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
+		goto out;
+
+	/* Attach BPF program to RX interface. */
+
+	bpf_obj = xdp_metadata__open();
+	if (!ASSERT_OK_PTR(bpf_obj, "open skeleton"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, rx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_HAS_METADATA);
+
+	if (!ASSERT_OK(xdp_metadata__load(bpf_obj), "load skeleton"))
+		goto out;
+
+	ret = bpf_xdp_attach(rx_ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach"))
+		goto out;
+
+	sock_fd = xsk_socket__fd(rx_xsk.socket);
+	ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+	if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+		goto out;
+
+	/* Send packet destined to RX AF_XDP socket. */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
+		       "generate AF_XDP_CONSUMER_PORT"))
+		goto out;
+
+	/* Verify AF_XDP RX packet has proper metadata. */
+	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
+		       "verify_xsk_metadata"))
+		goto out;
+
+	complete_tx(&tx_xsk);
+
+out:
+	close_xsk(&rx_xsk);
+	close_xsk(&tx_xsk);
+	if (bpf_obj)
+		xdp_metadata__destroy(bpf_obj);
+	system("ip netns del xdp_metadata");
+	if (tok)
+		close_netns(tok);
+}
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
new file mode 100644
index 000000000000..bdde17961ab6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/in.h>
+#include <linux/udp.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 4);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+extern int bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx) __ksym;
+extern const __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta;
+	int ret;
+
+	if (bpf_xdp_metadata_rx_timestamp_supported(ctx)) {
+		__u64 rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+
+		if (rx_timestamp) {
+			ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(rx_timestamp));
+			if (ret != 0)
+				return XDP_DROP;
+
+			data = (void *)(long)ctx->data;
+			data_meta = (void *)(long)ctx->data_meta;
+
+			if (data_meta + sizeof(rx_timestamp) > data)
+				return XDP_DROP;
+
+			*(__u64 *)data_meta = rx_timestamp;
+		}
+	}
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 08/11] selftests/bpf: Verify xdp_metadata xdp->skb path
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 07/11] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 09/11] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

- divert 9081 UDP traffic to the kernel
- call bpf_xdp_metadata_export_to_skb for such packets
- the kernel should fill in hwtstamp
- verify that the received packet has non-zero hwtstamp

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/DENYLIST.s390x    |  1 +
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 81 +++++++++++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        | 64 +++++++++++++++
 3 files changed, 146 insertions(+)

diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index be4e3d47ea3e..2fa350c8ab42 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -78,4 +78,5 @@ xdp_adjust_tail                          # case-128 err 0 errno 28 retval 1 size
 xdp_bonding                              # failed to auto-attach program 'trace_on_entry': -524                        (trampoline)
 xdp_bpf2bpf                              # failed to auto-attach program 'trace_on_entry': -524                        (trampoline)
 xdp_do_redirect                          # prog_run_max_size unexpected error: -22 (errno 22)
+xdp_metadata                             # JIT does not support push/pop opcodes                                       (jit)
 xdp_synproxy                             # JIT does not support calling kernel function                                (kfunc)
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index c3321d8c7cd4..b67a4dcfca6e 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -19,6 +19,7 @@
 
 #define AF_XDP_SOURCE_PORT 1234
 #define AF_XDP_CONSUMER_PORT 8080
+#define SOCKET_CONSUMER_PORT 9081
 
 #define UMEM_NUM 16
 #define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
@@ -275,6 +276,61 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	return 0;
 }
 
+static void timestamping_enable(int fd, int val)
+{
+	int ret;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
+	ASSERT_OK(ret, "setsockopt(SO_TIMESTAMPING)");
+}
+
+static int verify_skb_metadata(int fd)
+{
+	char cmsg_buf[1024];
+	char packet_buf[128];
+
+	struct scm_timestamping *ts;
+	struct iovec packet_iov;
+	struct cmsghdr *cmsg;
+	struct msghdr hdr;
+	bool found_hwtstamp = false;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.msg_iov = &packet_iov;
+	hdr.msg_iovlen = 1;
+	packet_iov.iov_base = packet_buf;
+	packet_iov.iov_len = sizeof(packet_buf);
+
+	hdr.msg_control = cmsg_buf;
+	hdr.msg_controllen = sizeof(cmsg_buf);
+
+	if (ASSERT_GE(recvmsg(fd, &hdr, 0), 0, "recvmsg")) {
+		for (cmsg = CMSG_FIRSTHDR(&hdr); cmsg != NULL;
+		     cmsg = CMSG_NXTHDR(&hdr, cmsg)) {
+
+			if (cmsg->cmsg_level != SOL_SOCKET)
+				continue;
+
+			switch (cmsg->cmsg_type) {
+			case SCM_TIMESTAMPING:
+				ts = (struct scm_timestamping *)CMSG_DATA(cmsg);
+				if (ts->ts[2].tv_sec || ts->ts[2].tv_nsec) {
+					found_hwtstamp = true;
+					break;
+				}
+				break;
+			default:
+				break;
+			}
+		}
+	}
+
+	if (!ASSERT_EQ(found_hwtstamp, true, "no hwtstamp!"))
+		return -1;
+
+	return 0;
+}
+
 void test_xdp_metadata(void)
 {
 	struct xdp_metadata *bpf_obj = NULL;
@@ -283,6 +339,7 @@ void test_xdp_metadata(void)
 	struct bpf_program *prog;
 	struct xsk tx_xsk = {};
 	struct xsk rx_xsk = {};
+	int rx_udp_fd = -1;
 	int rx_ifindex;
 	int sock_fd;
 	int ret;
@@ -299,6 +356,8 @@ void test_xdp_metadata(void)
 	SYS("ip link set dev " RX_NAME " up");
 	SYS("ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
 	SYS("ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
+	SYS("sysctl -q net.ipv4.ip_forward=1");
+	SYS("sysctl -q net.ipv4.conf.all.accept_local=1");
 
 	rx_ifindex = if_nametoindex(RX_NAME);
 
@@ -312,6 +371,15 @@ void test_xdp_metadata(void)
 	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
 		goto out;
 
+	/* Setup UPD listener for RX interface. */
+
+	rx_udp_fd = start_server(FAMILY, SOCK_DGRAM, NULL, SOCKET_CONSUMER_PORT, 1000);
+	if (!ASSERT_GE(rx_udp_fd, 0, "start_server"))
+		goto out;
+	timestamping_enable(rx_udp_fd,
+			    SOF_TIMESTAMPING_SOFTWARE |
+			    SOF_TIMESTAMPING_RAW_HARDWARE);
+
 	/* Attach BPF program to RX interface. */
 
 	bpf_obj = xdp_metadata__open();
@@ -348,9 +416,22 @@ void test_xdp_metadata(void)
 
 	complete_tx(&tx_xsk);
 
+	/* Send packet destined to RX UDP socket. */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, SOCKET_CONSUMER_PORT), 0,
+		       "generate SOCKET_CONSUMER_PORT"))
+		goto out;
+
+	/* Verify SKB RX packet has proper metadata. */
+	if (!ASSERT_GE(verify_skb_metadata(rx_udp_fd), 0,
+		       "verify_skb_metadata"))
+		goto out;
+
+	complete_tx(&tx_xsk);
+
 out:
 	close_xsk(&rx_xsk);
 	close_xsk(&tx_xsk);
+	close(rx_udp_fd);
 	if (bpf_obj)
 		xdp_metadata__destroy(bpf_obj);
 	system("ip netns del xdp_metadata");
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index bdde17961ab6..805178f55743 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -17,15 +17,79 @@ struct {
 	__type(value, __u32);
 } xsk SEC(".maps");
 
+extern int bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx) __ksym;
 extern int bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx) __ksym;
 extern const __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
 {
+	struct xdp_skb_metadata *skb_metadata;
 	void *data, *data_meta;
+	struct ethhdr *eth = NULL;
+	struct udphdr *udp = NULL;
+	struct iphdr *iph = NULL;
+	void *data_end;
 	int ret;
 
+	/* Exercise xdp -> skb metadata path by diverting some traffic
+	 * into the kernel (UDP destination port 9081).
+	 */
+
+	data = (void *)(long)ctx->data;
+	data_end = (void *)(long)ctx->data_end;
+	eth = data;
+	if (eth + 1 < data_end) {
+		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+			iph = (void *)(eth + 1);
+			if (iph + 1 < data_end && iph->protocol == IPPROTO_UDP)
+				udp = (void *)(iph + 1);
+		}
+		if (udp && udp + 1 > data_end)
+			udp = NULL;
+	}
+	if (udp && udp->dest == bpf_htons(9081)) {
+		bpf_printk("exporting metadata to skb for UDP port 9081");
+
+		if (bpf_xdp_metadata_export_to_skb(ctx) < 0) {
+			bpf_printk("bpf_xdp_metadata_export_to_skb failed");
+			return XDP_DROP;
+		}
+
+		/* Make sure metadata can't be adjusted after a call
+		 * to bpf_xdp_metadata_export_to_skb().
+		 */
+
+		ret = bpf_xdp_adjust_meta(ctx, -4);
+		if (ret == 0) {
+			bpf_printk("bpf_xdp_adjust_meta -4 after bpf_xdp_metadata_export_to_skb succeeded");
+			return XDP_DROP;
+		}
+
+		/* Make sure calling bpf_xdp_metadata_export_to_skb()
+		 * second time is a no-op.
+		 */
+
+		if (bpf_xdp_metadata_export_to_skb(ctx) == 0) {
+			bpf_printk("bpf_xdp_metadata_export_to_skb succeeded 2nd time");
+			return XDP_DROP;
+		}
+
+		skb_metadata = ctx->skb_metadata;
+		if (!skb_metadata) {
+			bpf_printk("no ctx->skb_metadata");
+			return XDP_DROP;
+		}
+
+		if (!skb_metadata->rx_timestamp) {
+			bpf_printk("no skb_metadata->rx_timestamp");
+			return XDP_DROP;
+		}
+
+		/*return bpf_redirect(ifindex, BPF_F_INGRESS);*/
+		return XDP_PASS;
+	}
+
 	if (bpf_xdp_metadata_rx_timestamp_supported(ctx)) {
 		__u64 rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
 
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 09/11] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 08/11] selftests/bpf: Verify xdp_metadata xdp->skb path Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 10/11] mxl4: Support rx timestamp metadata for xdp Stanislav Fomichev
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 26 +++++++++++++---------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 8f762fc170b3..467356633172 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -661,17 +661,21 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 #define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
 #endif
 
+struct mlx4_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int factor = priv->cqe_factor;
 	struct mlx4_en_rx_ring *ring;
+	struct mlx4_xdp_buff mxbuf;
 	struct bpf_prog *xdp_prog;
 	int cq_ring = cq->ring;
 	bool doorbell_pending;
 	bool xdp_redir_flush;
 	struct mlx4_cqe *cqe;
-	struct xdp_buff xdp;
 	int polled = 0;
 	int index;
 
@@ -681,7 +685,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	ring = priv->rx_ring[cq_ring];
 
 	xdp_prog = rcu_dereference_bh(ring->xdp_prog);
-	xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
+	xdp_init_buff(&mxbuf.xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
 	doorbell_pending = false;
 	xdp_redir_flush = false;
 
@@ -776,24 +780,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
 
-			xdp_prepare_buff(&xdp, va - frags[0].page_offset,
+			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
 					 frags[0].page_offset, length, false);
-			orig_data = xdp.data;
+			orig_data = mxbuf.xdp.data;
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
-			length = xdp.data_end - xdp.data;
-			if (xdp.data != orig_data) {
-				frags[0].page_offset = xdp.data -
-					xdp.data_hard_start;
-				va = xdp.data;
+			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
+			if (mxbuf.xdp.data != orig_data) {
+				frags[0].page_offset = mxbuf.xdp.data -
+					mxbuf.xdp.data_hard_start;
+				va = mxbuf.xdp.data;
 			}
 
 			switch (act) {
 			case XDP_PASS:
 				break;
 			case XDP_REDIRECT:
-				if (likely(!xdp_do_redirect(dev, &xdp, xdp_prog))) {
+				if (likely(!xdp_do_redirect(dev, &mxbuf.xdp, xdp_prog))) {
 					ring->xdp_redirect++;
 					xdp_redir_flush = true;
 					frags[0].page = NULL;
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 10/11] mxl4: Support rx timestamp metadata for xdp
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (8 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 09/11] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15  3:02 ` [PATCH bpf-next 11/11] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
  2022-11-15 15:54 ` [xdp-hints] [PATCH bpf-next 00/11] xdp: hints via kfuncs Toke Høiland-Jørgensen
  11 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Support rx timestamp metadata. Also use xdp_skb metadata upon XDP_PASS
when available (to avoid double work; but note, this supports
rx_timestamp only for now).

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  2 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    | 42 ++++++++++++++++++-
 include/linux/mlx4/device.h                   |  7 ++++
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 8800d3f1f55c..9489476bab8f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2855,6 +2855,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
 	.ndo_bpf		= mlx4_xdp,
+	.ndo_unroll_kfunc	= mlx4_unroll_kfunc,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2887,6 +2888,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
 	.ndo_bpf		= mlx4_xdp,
+	.ndo_unroll_kfunc	= mlx4_unroll_kfunc,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 467356633172..722a4d56e0b0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -33,6 +33,7 @@
 
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
+#include <linux/bpf_patch.h>
 #include <linux/mlx4/cq.h>
 #include <linux/slab.h>
 #include <linux/mlx4/qp.h>
@@ -663,8 +664,39 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 
 struct mlx4_xdp_buff {
 	struct xdp_buff xdp;
+	struct mlx4_cqe *cqe;
+	struct mlx4_en_dev *mdev;
 };
 
+u64 mxl4_xdp_rx_timestamp(struct mlx4_xdp_buff *ctx)
+{
+	unsigned int seq;
+	u64 timestamp;
+	u64 nsec;
+
+	timestamp = mlx4_en_get_cqe_ts(ctx->cqe);
+
+	do {
+		seq = read_seqbegin(&ctx->mdev->clock_lock);
+		nsec = timecounter_cyc2time(&ctx->mdev->clock, timestamp);
+	} while (read_seqretry(&ctx->mdev->clock_lock, seq));
+
+	return ns_to_ktime(nsec);
+}
+
+void mlx4_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+		       struct bpf_patch *patch)
+{
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
+		return xdp_metadata_export_to_skb(prog, patch);
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+		/* return true; */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+		bpf_patch_append(patch, BPF_EMIT_CALL(mxl4_xdp_rx_timestamp));
+	}
+}
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -781,8 +813,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						DMA_FROM_DEVICE);
 
 			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
-					 frags[0].page_offset, length, false);
+					 frags[0].page_offset, length, true);
 			orig_data = mxbuf.xdp.data;
+			if (unlikely(ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL)) {
+				mxbuf.cqe = cqe;
+				mxbuf.mdev = priv->mdev;
+			}
 
 			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
@@ -835,6 +871,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		if (unlikely(!skb))
 			goto next;
 
+		if (xdp_convert_skb_metadata(&mxbuf.xdp, skb))
+			goto skip_metadata;
+
 		if (unlikely(ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL)) {
 			u64 timestamp = mlx4_en_get_cqe_ts(cqe);
 
@@ -895,6 +934,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD),
 					       be16_to_cpu(cqe->sl_vid));
 
+skip_metadata:
 		nr = mlx4_en_complete_rx_desc(priv, frags, skb, length);
 		if (likely(nr)) {
 			skb_shinfo(skb)->nr_frags = nr;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 6646634a0b9d..a0e4d490b2fb 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -1585,4 +1585,11 @@ static inline int mlx4_get_num_reserved_uar(struct mlx4_dev *dev)
 	/* The first 128 UARs are used for EQ doorbells */
 	return (128 >> (PAGE_SHIFT - dev->uar_page_shift));
 }
+
+struct bpf_prog;
+struct bpf_insn;
+struct bpf_patch;
+
+void mlx4_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+		       struct bpf_patch *patch);
 #endif /* MLX4_DEVICE_H */
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH bpf-next 11/11] selftests/bpf: Simple program to dump XDP RX metadata
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (9 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 10/11] mxl4: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-11-15  3:02 ` Stanislav Fomichev
  2022-11-15 15:54 ` [xdp-hints] [PATCH bpf-next 00/11] xdp: hints via kfuncs Toke Høiland-Jørgensen
  11 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15  3:02 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa

To be used for verification of driver implementations:

$ xdp_hw_metadata <ifname>

On the other machine:

$ echo -n xdp | nc -u -q1 <target> 9091 # for AF_XDP
$ echo -n skb | nc -u -q1 <target> 9092 # for skb

Sample output:

  # xdp
  xsk_ring_cons__peek: 1
  0x19f9090: rx_desc[0]->addr=100000000008000 addr=8100 comp_addr=8000
  rx_timestamp_supported: 1
  rx_timestamp: 1667850075063948829
  0x19f9090: complete idx=8 addr=8000

  # skb
  found skb hwtstamp = 1668314052.854274681

Decoding:
  # xdp
  rx_timestamp=1667850075.063948829

  $ date -d @1667850075
  Mon Nov  7 11:41:15 AM PST 2022
  $ date
  Mon Nov  7 11:42:05 AM PST 2022

  # skb
  $ date -d @1668314052
  Sat Nov 12 08:34:12 PM PST 2022
  $ date
  Sat Nov 12 08:37:06 PM PST 2022

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  99 +++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 404 ++++++++++++++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.h |   6 +
 5 files changed, 515 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.h

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 07d2d0a8c5cb..01e3baeefd4f 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -46,3 +46,4 @@ test_cpp
 xskxceiver
 xdp_redirect_multi
 xdp_synproxy
+xdp_hw_metadata
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index b645cf5a5021..74d6ed307157 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -83,7 +83,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
-	xskxceiver xdp_redirect_multi xdp_synproxy veristat
+	xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata
 
 TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read $(OUTPUT)/sign-file
 TEST_GEN_FILES += liburandom_read.so
@@ -241,6 +241,9 @@ $(OUTPUT)/test_maps: $(TESTING_HELPERS)
 $(OUTPUT)/test_verifier: $(TESTING_HELPERS) $(CAP_HELPERS)
 $(OUTPUT)/xsk.o: $(BPFOBJ)
 $(OUTPUT)/xskxceiver: $(OUTPUT)/xsk.o
+$(OUTPUT)/xdp_hw_metadata: $(OUTPUT)/xsk.o $(OUTPUT)/xdp_hw_metadata.skel.h
+$(OUTPUT)/xdp_hw_metadata: $(OUTPUT)/network_helpers.o
+$(OUTPUT)/xdp_hw_metadata: LDFLAGS += -static
 
 BPFTOOL ?= $(DEFAULT_BPFTOOL)
 $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile)    \
@@ -379,6 +382,7 @@ linked_maps.skel.h-deps := linked_maps1.bpf.o linked_maps2.bpf.o
 test_subskeleton.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o test_subskeleton.bpf.o
 test_subskeleton_lib.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o
 test_usdt.skel.h-deps := test_usdt.bpf.o test_usdt_multispec.bpf.o
+xdp_hw_metadata.skel.h-deps := xdp_hw_metadata.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
new file mode 100644
index 000000000000..549ec3b1f3a0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/in.h>
+#include <linux/udp.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#include "xdp_hw_metadata.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 256);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+extern int bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx) __ksym;
+extern int bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx) __ksym;
+extern const __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta, *data_end;
+	struct ipv6hdr *ip6h = NULL;
+	struct ethhdr *eth = NULL;
+	struct udphdr *udp = NULL;
+	struct xsk_metadata *meta;
+	struct iphdr *iph = NULL;
+	int ret;
+
+	data = (void *)(long)ctx->data;
+	data_end = (void *)(long)ctx->data_end;
+	eth = data;
+	if (eth + 1 < data_end) {
+		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+			iph = (void *)(eth + 1);
+			if (iph + 1 < data_end && iph->protocol == IPPROTO_UDP)
+				udp = (void *)(iph + 1);
+		}
+		if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
+			ip6h = (void *)(eth + 1);
+			if (ip6h + 1 < data_end && ip6h->nexthdr == IPPROTO_UDP)
+				udp = (void *)(ip6h + 1);
+		}
+		if (udp && udp + 1 > data_end)
+			udp = NULL;
+	}
+
+	if (!udp)
+		return XDP_PASS;
+
+	if (udp->dest == bpf_htons(9092)) {
+		bpf_printk("forwarding UDP:9092 to socket listener");
+
+		if (!bpf_xdp_metadata_export_to_skb(ctx)) {
+			bpf_printk("bpf_xdp_metadata_export_to_skb failed");
+			return XDP_DROP;
+		}
+
+		return XDP_PASS;
+	}
+
+	if (udp->dest != bpf_htons(9091))
+		return XDP_PASS;
+
+	bpf_printk("forwarding UDP:9091 to AF_XDP");
+
+	ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xsk_metadata));
+	if (ret != 0) {
+		bpf_printk("bpf_xdp_adjust_meta returned %d", ret);
+		return XDP_PASS;
+	}
+
+	data = (void *)(long)ctx->data;
+	data_meta = (void *)(long)ctx->data_meta;
+	meta = data_meta;
+
+	if (meta + 1 > data) {
+		bpf_printk("bpf_xdp_adjust_meta doesn't appear to work");
+		return XDP_PASS;
+	}
+
+
+	if (bpf_xdp_metadata_rx_timestamp_supported(ctx)) {
+		meta->rx_timestamp_supported = 1;
+		meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+		bpf_printk("populated rx_timestamp with %u", meta->rx_timestamp);
+	}
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
new file mode 100644
index 000000000000..a043e9ef5691
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -0,0 +1,404 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Reference program for verifying XDP metadata on real HW. Functional test
+ * only, doesn't test the performance.
+ *
+ * RX:
+ * - UDP 9091 packets are diverted into AF_XDP
+ * - Metadata verified:
+ *   - rx_timestamp
+ *
+ * TX:
+ * - TBD
+ */
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_hw_metadata.skel.h"
+#include "xsk.h"
+
+#include <error.h>
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <linux/sockios.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#include "xdp_hw_metadata.h"
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define XDP_FLAGS (XDP_FLAGS_DRV_MODE | XDP_FLAGS_REPLACE)
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+};
+
+struct xdp_hw_metadata *bpf_obj;
+struct xsk *rx_xsk;
+const char *ifname;
+int ifindex;
+int rxq;
+
+void test__fail(void) { /* for network_helpers.c */ }
+
+static int open_xsk(const char *ifname, struct xsk *xsk, __u32 queue_id)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	const struct xsk_umem_config umem_config = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+	};
+	__u32 idx;
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (xsk->umem_area == MAP_FAILED)
+		return -ENOMEM;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       &umem_config);
+	if (ret)
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, queue_id,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (ret)
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = i * UMEM_FRAME_SIZE;
+		printf("%p: tx_desc[%d] -> %lx\n", xsk, i, addr);
+	}
+
+	/* Second half of umem is for RX. */
+
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+static void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void refill_rx(struct xsk *xsk, __u64 addr)
+{
+	__u32 idx;
+
+	if (xsk_ring_prod__reserve(&xsk->fill, 1, &idx) == 1) {
+		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static void verify_xdp_metadata(void *data)
+{
+	struct xsk_metadata *meta;
+
+	meta = data - sizeof(*meta);
+
+	printf("rx_timestamp_supported: %u\n", meta->rx_timestamp_supported);
+	printf("rx_timestamp: %llu\n", meta->rx_timestamp);
+}
+
+static void verify_skb_metadata(int fd)
+{
+	char cmsg_buf[1024];
+	char packet_buf[128];
+
+	struct scm_timestamping *ts;
+	struct iovec packet_iov;
+	struct cmsghdr *cmsg;
+	struct msghdr hdr;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.msg_iov = &packet_iov;
+	hdr.msg_iovlen = 1;
+	packet_iov.iov_base = packet_buf;
+	packet_iov.iov_len = sizeof(packet_buf);
+
+	hdr.msg_control = cmsg_buf;
+	hdr.msg_controllen = sizeof(cmsg_buf);
+
+	if (recvmsg(fd, &hdr, 0) < 0)
+		error(-1, errno, "recvmsg");
+
+	for (cmsg = CMSG_FIRSTHDR(&hdr); cmsg != NULL;
+	     cmsg = CMSG_NXTHDR(&hdr, cmsg)) {
+
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			continue;
+
+		switch (cmsg->cmsg_type) {
+		case SCM_TIMESTAMPING:
+			ts = (struct scm_timestamping *)CMSG_DATA(cmsg);
+			if (ts->ts[2].tv_sec || ts->ts[2].tv_nsec) {
+				printf("found skb hwtstamp = %lu.%lu\n",
+				       ts->ts[2].tv_sec, ts->ts[2].tv_nsec);
+				return;
+			}
+			break;
+		default:
+			break;
+		}
+	}
+
+	printf("skb hwtstamp is not found!\n");
+}
+
+static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds[rxq + 1];
+	__u64 comp_addr;
+	__u64 addr;
+	__u32 idx;
+	int ret;
+	int i;
+
+	for (i = 0; i < rxq; i++) {
+		fds[i].fd = xsk_socket__fd(rx_xsk[i].socket);
+		fds[i].events = POLLIN;
+		fds[i].revents = 0;
+	}
+
+	fds[rxq].fd = server_fd;
+	fds[rxq].events = POLLIN;
+	fds[rxq].revents = 0;
+
+	while (true) {
+		errno = 0;
+		ret = poll(fds, rxq + 1, 1000);
+		printf("poll: %d (%d)\n", ret, errno);
+		if (ret < 0)
+			break;
+		if (ret == 0)
+			continue;
+
+		if (fds[rxq].revents)
+			verify_skb_metadata(server_fd);
+
+		for (i = 0; i < rxq; i++) {
+			if (fds[i].revents == 0)
+				continue;
+
+			struct xsk *xsk = &rx_xsk[i];
+
+			ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+			printf("xsk_ring_cons__peek: %d\n", ret);
+			if (ret != 1)
+				continue;
+
+			rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx);
+			comp_addr = xsk_umem__extract_addr(rx_desc->addr);
+			addr = xsk_umem__add_offset_to_addr(rx_desc->addr);
+			printf("%p: rx_desc[%u]->addr=%llx addr=%llx comp_addr=%llx\n",
+			       xsk, idx, rx_desc->addr, addr, comp_addr);
+			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr));
+			xsk_ring_cons__release(&xsk->rx, 1);
+			refill_rx(xsk, comp_addr);
+		}
+	}
+
+	return 0;
+}
+
+struct ethtool_channels {
+	__u32	cmd;
+	__u32	max_rx;
+	__u32	max_tx;
+	__u32	max_other;
+	__u32	max_combined;
+	__u32	rx_count;
+	__u32	tx_count;
+	__u32	other_count;
+	__u32	combined_count;
+};
+
+#define ETHTOOL_GCHANNELS	0x0000003c /* Get no of channels */
+
+static int rxq_num(const char *ifname)
+{
+	struct ethtool_channels ch = {
+		.cmd = ETHTOOL_GCHANNELS,
+	};
+
+	struct ifreq ifr = {
+		.ifr_data = (void *)&ch,
+	};
+	strcpy(ifr.ifr_name, ifname);
+	int fd, ret;
+
+	fd = socket(AF_UNIX, SOCK_DGRAM, 0);
+	if (fd < 0)
+		error(-1, errno, "socket");
+
+	ret = ioctl(fd, SIOCETHTOOL, &ifr);
+	if (ret < 0)
+		error(-1, errno, "socket");
+
+	close(fd);
+
+	return ch.rx_count;
+}
+
+static void cleanup(void)
+{
+	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
+	int ret;
+	int i;
+
+	if (bpf_obj) {
+		opts.old_prog_fd = bpf_program__fd(bpf_obj->progs.rx);
+		if (opts.old_prog_fd >= 0) {
+			printf("detaching bpf program....\n");
+			ret = bpf_xdp_detach(ifindex, XDP_FLAGS, &opts);
+			if (ret)
+				printf("failed to detach XDP program: %d\n", ret);
+		}
+	}
+
+	for (i = 0; i < rxq; i++)
+		close_xsk(&rx_xsk[i]);
+
+	if (bpf_obj)
+		xdp_hw_metadata__destroy(bpf_obj);
+}
+
+static void handle_signal(int sig)
+{
+	/* interrupting poll() is all we need */
+}
+
+static void timestamping_enable(int fd, int val)
+{
+	int ret;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
+	if (ret < 0)
+		error(-1, errno, "setsockopt(SO_TIMESTAMPING)");
+}
+
+int main(int argc, char *argv[])
+{
+	int server_fd = -1;
+	int ret;
+	int i;
+
+	struct bpf_program *prog;
+
+	if (argc != 2) {
+		fprintf(stderr, "pass device name\n");
+		return -1;
+	}
+
+	ifname = argv[1];
+	ifindex = if_nametoindex(ifname);
+	rxq = rxq_num(ifname);
+
+	printf("rxq: %d\n", rxq);
+
+	rx_xsk = malloc(sizeof(struct xsk) * rxq);
+	if (!rx_xsk)
+		error(-1, ENOMEM, "malloc");
+
+	for (i = 0; i < rxq; i++) {
+		printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
+		ret = open_xsk(ifname, &rx_xsk[i], i);
+		if (ret)
+			error(-1, -ret, "open_xsk");
+
+		printf("xsk_socket__fd() -> %d\n", xsk_socket__fd(rx_xsk[i].socket));
+	}
+
+	printf("open bpf program...\n");
+	bpf_obj = xdp_hw_metadata__open();
+	if (libbpf_get_error(bpf_obj))
+		error(-1, libbpf_get_error(bpf_obj), "xdp_hw_metadata__open");
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_HAS_METADATA);
+
+	printf("load bpf program...\n");
+	ret = xdp_hw_metadata__load(bpf_obj);
+	if (ret)
+		error(-1, -ret, "xdp_hw_metadata__load");
+
+	printf("prepare skb endpoint...\n");
+	server_fd = start_server(AF_INET6, SOCK_DGRAM, NULL, 9092, 1000);
+	if (server_fd < 0)
+		error(-1, errno, "start_server");
+	timestamping_enable(server_fd,
+			    SOF_TIMESTAMPING_SOFTWARE |
+			    SOF_TIMESTAMPING_RAW_HARDWARE);
+
+	printf("prepare xsk map...\n");
+	for (i = 0; i < rxq; i++) {
+		int sock_fd = xsk_socket__fd(rx_xsk[i].socket);
+		__u32 queue_id = i;
+
+		printf("map[%d] = %d\n", queue_id, sock_fd);
+		ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+		if (ret)
+			error(-1, -ret, "bpf_map_update_elem");
+	}
+
+	printf("attach bpf program...\n");
+	ret = bpf_xdp_attach(ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (ret)
+		error(-1, -ret, "bpf_xdp_attach");
+
+	signal(SIGINT, handle_signal);
+	ret = verify_metadata(rx_xsk, rxq, server_fd);
+	close(server_fd);
+	cleanup();
+	if (ret)
+		error(-1, -ret, "verify_metadata");
+}
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.h b/tools/testing/selftests/bpf/xdp_hw_metadata.h
new file mode 100644
index 000000000000..b4580015ee93
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.h
@@ -0,0 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+struct xsk_metadata {
+	__u32 rx_timestamp_supported:1;
+	__u64 rx_timestamp;
+};
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 00/11] xdp: hints via kfuncs
  2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
                   ` (10 preceding siblings ...)
  2022-11-15  3:02 ` [PATCH bpf-next 11/11] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
@ 2022-11-15 15:54 ` Toke Høiland-Jørgensen
  2022-11-15 18:37   ` Stanislav Fomichev
  11 siblings, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-15 15:54 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> - drop __randomize_layout
>
>   Not sure it's possible to sanely expose it via UAPI. Because every
>   .o potentially gets its own randomized layout, test_progs
>   refuses to link.

So this won't work if the struct is in a kernel-supplied UAPI header
(which would include the __randomize_layout tag). But if it's *not* in a
UAPI header it should still be included in a stable form (i.e., without
the randomize tag) in vmlinux.h, right? Which would be the point:
consumers would be forced to read it from there and do CO-RE on it...

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata
  2022-11-15  3:02 ` [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
@ 2022-11-15 16:16   ` Toke Høiland-Jørgensen
  2022-11-15 18:37     ` Stanislav Fomichev
  2022-11-16 20:42   ` Jakub Kicinski
  1 sibling, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-15 16:16 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> Kfuncs have to be defined with KF_UNROLL for an attempted unroll.
> For now, only XDP programs can have their kfuncs unrolled, but
> we can extend this later on if more programs would like to use it.
>
> For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
> implements all possible metatada kfuncs. Not all devices have to
> implement them. If unrolling is not supported by the target device,
> the default implementation is called instead. The default
> implementation is unconditionally unrolled to 'return false/0/NULL'
> for now.
>
> Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
> we treat prog_index as target device for kfunc unrolling.
> net_device_ops gains new ndo_unroll_kfunc which does the actual
> dirty work per device.
>
> The kfunc unrolling itself largely follows the existing map_gen_lookup
> unrolling example, so there is nothing new here.
>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  Documentation/bpf/kfuncs.rst   |  8 +++++
>  include/linux/bpf.h            |  1 +
>  include/linux/btf.h            |  1 +
>  include/linux/btf_ids.h        |  4 +++
>  include/linux/netdevice.h      |  5 +++
>  include/net/xdp.h              | 24 +++++++++++++
>  include/uapi/linux/bpf.h       |  5 +++
>  kernel/bpf/syscall.c           | 28 ++++++++++++++-
>  kernel/bpf/verifier.c          | 65 ++++++++++++++++++++++++++++++++++
>  net/core/dev.c                 |  7 ++++
>  net/core/xdp.c                 | 39 ++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  5 +++
>  12 files changed, 191 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
> index 0f858156371d..1723de2720bb 100644
> --- a/Documentation/bpf/kfuncs.rst
> +++ b/Documentation/bpf/kfuncs.rst
> @@ -169,6 +169,14 @@ rebooting or panicking. Due to this additional restrictions apply to these
>  calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
>  added later.
>  
> +2.4.8 KF_UNROLL flag
> +-----------------------
> +
> +The KF_UNROLL flag is used for kfuncs that the verifier can attempt to unroll.
> +Unrolling is currently implemented only for XDP programs' metadata kfuncs.
> +The main motivation behind unrolling is to remove function call overhead
> +and allow efficient inlined kfuncs to be generated.
> +
>  2.5 Registering the kfuncs
>  --------------------------
>  
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 798aec816970..bf8936522dd9 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1240,6 +1240,7 @@ struct bpf_prog_aux {
>  		struct work_struct work;
>  		struct rcu_head	rcu;
>  	};
> +	const struct net_device_ops *xdp_kfunc_ndo;
>  };
>  
>  struct bpf_prog {
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index d80345fa566b..950cca997a5a 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -51,6 +51,7 @@
>  #define KF_TRUSTED_ARGS (1 << 4) /* kfunc only takes trusted pointer arguments */
>  #define KF_SLEEPABLE    (1 << 5) /* kfunc may sleep */
>  #define KF_DESTRUCTIVE  (1 << 6) /* kfunc performs destructive actions */
> +#define KF_UNROLL       (1 << 7) /* kfunc unrolling can be attempted */
>  
>  /*
>   * Return the name of the passed struct, if exists, or halt the build if for
> diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
> index c9744efd202f..eb448e9c79bb 100644
> --- a/include/linux/btf_ids.h
> +++ b/include/linux/btf_ids.h
> @@ -195,6 +195,10 @@ asm(							\
>  __BTF_ID_LIST(name, local)				\
>  __BTF_SET8_START(name, local)
>  
> +#define BTF_SET8_START_GLOBAL(name)			\
> +__BTF_ID_LIST(name, global)				\
> +__BTF_SET8_START(name, global)
> +
>  #define BTF_SET8_END(name)				\
>  asm(							\
>  ".pushsection " BTF_IDS_SECTION ",\"a\";      \n"	\
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 02a2318da7c7..2096b4f00e4b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -73,6 +73,8 @@ struct udp_tunnel_info;
>  struct udp_tunnel_nic_info;
>  struct udp_tunnel_nic;
>  struct bpf_prog;
> +struct bpf_insn;
> +struct bpf_patch;
>  struct xdp_buff;
>  
>  void synchronize_net(void);
> @@ -1604,6 +1606,9 @@ struct net_device_ops {
>  	ktime_t			(*ndo_get_tstamp)(struct net_device *dev,
>  						  const struct skb_shared_hwtstamps *hwtstamps,
>  						  bool cycles);
> +	void			(*ndo_unroll_kfunc)(const struct bpf_prog *prog,
> +						    u32 func_id,
> +						    struct bpf_patch *patch);
>  };
>  
>  /**
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 55dbc68bfffc..2a82a98f2f9f 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -7,6 +7,7 @@
>  #define __LINUX_NET_XDP_H__
>  
>  #include <linux/skbuff.h> /* skb_shared_info */
> +#include <linux/btf_ids.h> /* btf_id_set8 */
>  
>  /**
>   * DOC: XDP RX-queue information
> @@ -409,4 +410,27 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>  
>  #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
>  
> +#define XDP_METADATA_KFUNC_xxx	\
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
> +			   bpf_xdp_metadata_rx_timestamp_supported) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
> +			   bpf_xdp_metadata_rx_timestamp) \
> +
> +enum {
> +#define XDP_METADATA_KFUNC(name, str) name,
> +XDP_METADATA_KFUNC_xxx
> +#undef XDP_METADATA_KFUNC
> +MAX_XDP_METADATA_KFUNC,
> +};
> +
> +#ifdef CONFIG_DEBUG_INFO_BTF
> +extern struct btf_id_set8 xdp_metadata_kfunc_ids;
> +static inline u32 xdp_metadata_kfunc_id(int id)
> +{
> +	return xdp_metadata_kfunc_ids.pairs[id].id;
> +}
> +#else
> +static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> +#endif
> +
>  #endif /* __LINUX_NET_XDP_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index fb4c911d2a03..b444b1118c4f 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1156,6 +1156,11 @@ enum bpf_link_type {
>   */
>  #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
>  
> +/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
> + * program becomes device-bound but can access it's XDP metadata.
> + */
> +#define BPF_F_XDP_HAS_METADATA	(1U << 6)
> +
>  /* link_create.kprobe_multi.flags used in LINK_CREATE command for
>   * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
>   */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 85532d301124..597c41949910 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2426,6 +2426,20 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
>  /* last field in 'union bpf_attr' used by this command */
>  #define	BPF_PROG_LOAD_LAST_FIELD core_relo_rec_size
>  
> +static int xdp_resolve_netdev(struct bpf_prog *prog, int ifindex)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	struct net_device *dev;
> +
> +	for_each_netdev(net, dev) {
> +		if (dev->ifindex == ifindex) {

So this is basically dev_get_by_index(), except you're not doing
dev_hold()? Which also means there's no protection against the netdev
going away?

> +			prog->aux->xdp_kfunc_ndo = dev->netdev_ops;
> +			return 0;
> +		}
> +	}

> +	return -EINVAL;
> +}
> +
>  static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
>  {
>  	enum bpf_prog_type type = attr->prog_type;
> @@ -2443,7 +2457,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
>  				 BPF_F_TEST_STATE_FREQ |
>  				 BPF_F_SLEEPABLE |
>  				 BPF_F_TEST_RND_HI32 |
> -				 BPF_F_XDP_HAS_FRAGS))
> +				 BPF_F_XDP_HAS_FRAGS |
> +				 BPF_F_XDP_HAS_METADATA))
>  		return -EINVAL;
>  
>  	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> @@ -2531,6 +2546,17 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
>  	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
>  	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
>  
> +	if (attr->prog_flags & BPF_F_XDP_HAS_METADATA) {
> +		/* Reuse prog_ifindex to carry request to unroll
> +		 * metadata kfuncs.
> +		 */
> +		prog->aux->offload_requested = false;
> +
> +		err = xdp_resolve_netdev(prog, attr->prog_ifindex);
> +		if (err < 0)
> +			goto free_prog;
> +	}
> +
>  	err = security_bpf_prog_alloc(prog->aux);
>  	if (err)
>  		goto free_prog;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 07c0259dfc1a..b657ed6eb277 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -9,6 +9,7 @@
>  #include <linux/types.h>
>  #include <linux/slab.h>
>  #include <linux/bpf.h>
> +#include <linux/bpf_patch.h>
>  #include <linux/btf.h>
>  #include <linux/bpf_verifier.h>
>  #include <linux/filter.h>
> @@ -14015,6 +14016,43 @@ static int fixup_call_args(struct bpf_verifier_env *env)
>  	return err;
>  }
>  
> +static int unroll_kfunc_call(struct bpf_verifier_env *env,
> +			     struct bpf_insn *insn,
> +			     struct bpf_patch *patch)
> +{
> +	enum bpf_prog_type prog_type;
> +	struct bpf_prog_aux *aux;
> +	struct btf *desc_btf;
> +	u32 *kfunc_flags;
> +	u32 func_id;
> +
> +	desc_btf = find_kfunc_desc_btf(env, insn->off);
> +	if (IS_ERR(desc_btf))
> +		return PTR_ERR(desc_btf);
> +
> +	prog_type = resolve_prog_type(env->prog);
> +	func_id = insn->imm;
> +
> +	kfunc_flags = btf_kfunc_id_set_contains(desc_btf, prog_type, func_id);
> +	if (!kfunc_flags)
> +		return 0;
> +	if (!(*kfunc_flags & KF_UNROLL))
> +		return 0;
> +	if (prog_type != BPF_PROG_TYPE_XDP)
> +		return 0;

Should this just handle XDP_METADATA_KFUNC_EXPORT_TO_SKB instead of
passing that into the driver (to avoid every driver having to
reimplement the same call to xdp_metadata_export_to_skb())?

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-15  3:02 ` [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-11-15 16:17   ` Toke Høiland-Jørgensen
  2022-11-15 18:37     ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-15 16:17 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> The goal is to enable end-to-end testing of the metadata
> for AF_XDP. Current rx_timestamp kfunc returns current
> time which should be enough to exercise this new functionality.
>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  drivers/net/veth.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 2a4592780141..c626580a2294 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -25,6 +25,7 @@
>  #include <linux/filter.h>
>  #include <linux/ptr_ring.h>
>  #include <linux/bpf_trace.h>
> +#include <linux/bpf_patch.h>
>  #include <linux/net_tstamp.h>
>  
>  #define DRV_NAME	"veth"
> @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  	}
>  }
>  
> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> +			      struct bpf_patch *patch)
> +{
> +	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> +		/* return true; */
> +		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> +	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> +		/* return ktime_get_mono_fast_ns(); */
> +		bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> +	}
> +}

So these look reasonable enough, but would be good to see some examples
of kfunc implementations that don't just BPF_CALL to a kernel function
(with those helper wrappers we were discussing before).

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-15 16:17   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-15 18:37     ` Stanislav Fomichev
  2022-11-15 22:46       ` [xdp-hints] " Toke Høiland-Jørgensen
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15 18:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > The goal is to enable end-to-end testing of the metadata
> > for AF_XDP. Current rx_timestamp kfunc returns current
> > time which should be enough to exercise this new functionality.
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  drivers/net/veth.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 2a4592780141..c626580a2294 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -25,6 +25,7 @@
> >  #include <linux/filter.h>
> >  #include <linux/ptr_ring.h>
> >  #include <linux/bpf_trace.h>
> > +#include <linux/bpf_patch.h>
> >  #include <linux/net_tstamp.h>
> >
> >  #define DRV_NAME     "veth"
> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >       }
> >  }
> >
> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > +                           struct bpf_patch *patch)
> > +{
> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > +             /* return true; */
> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > +             /* return ktime_get_mono_fast_ns(); */
> > +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > +     }
> > +}
>
> So these look reasonable enough, but would be good to see some examples
> of kfunc implementations that don't just BPF_CALL to a kernel function
> (with those helper wrappers we were discussing before).

Let's maybe add them if/when needed as we add more metadata support?
xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
examples, so it shouldn't be a problem to resurrect them back at some
point?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata
  2022-11-15 16:16   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-15 18:37     ` Stanislav Fomichev
  0 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15 18:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Tue, Nov 15, 2022 at 8:16 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > Kfuncs have to be defined with KF_UNROLL for an attempted unroll.
> > For now, only XDP programs can have their kfuncs unrolled, but
> > we can extend this later on if more programs would like to use it.
> >
> > For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
> > implements all possible metatada kfuncs. Not all devices have to
> > implement them. If unrolling is not supported by the target device,
> > the default implementation is called instead. The default
> > implementation is unconditionally unrolled to 'return false/0/NULL'
> > for now.
> >
> > Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
> > we treat prog_index as target device for kfunc unrolling.
> > net_device_ops gains new ndo_unroll_kfunc which does the actual
> > dirty work per device.
> >
> > The kfunc unrolling itself largely follows the existing map_gen_lookup
> > unrolling example, so there is nothing new here.
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  Documentation/bpf/kfuncs.rst   |  8 +++++
> >  include/linux/bpf.h            |  1 +
> >  include/linux/btf.h            |  1 +
> >  include/linux/btf_ids.h        |  4 +++
> >  include/linux/netdevice.h      |  5 +++
> >  include/net/xdp.h              | 24 +++++++++++++
> >  include/uapi/linux/bpf.h       |  5 +++
> >  kernel/bpf/syscall.c           | 28 ++++++++++++++-
> >  kernel/bpf/verifier.c          | 65 ++++++++++++++++++++++++++++++++++
> >  net/core/dev.c                 |  7 ++++
> >  net/core/xdp.c                 | 39 ++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |  5 +++
> >  12 files changed, 191 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
> > index 0f858156371d..1723de2720bb 100644
> > --- a/Documentation/bpf/kfuncs.rst
> > +++ b/Documentation/bpf/kfuncs.rst
> > @@ -169,6 +169,14 @@ rebooting or panicking. Due to this additional restrictions apply to these
> >  calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
> >  added later.
> >
> > +2.4.8 KF_UNROLL flag
> > +-----------------------
> > +
> > +The KF_UNROLL flag is used for kfuncs that the verifier can attempt to unroll.
> > +Unrolling is currently implemented only for XDP programs' metadata kfuncs.
> > +The main motivation behind unrolling is to remove function call overhead
> > +and allow efficient inlined kfuncs to be generated.
> > +
> >  2.5 Registering the kfuncs
> >  --------------------------
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 798aec816970..bf8936522dd9 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1240,6 +1240,7 @@ struct bpf_prog_aux {
> >               struct work_struct work;
> >               struct rcu_head rcu;
> >       };
> > +     const struct net_device_ops *xdp_kfunc_ndo;
> >  };
> >
> >  struct bpf_prog {
> > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > index d80345fa566b..950cca997a5a 100644
> > --- a/include/linux/btf.h
> > +++ b/include/linux/btf.h
> > @@ -51,6 +51,7 @@
> >  #define KF_TRUSTED_ARGS (1 << 4) /* kfunc only takes trusted pointer arguments */
> >  #define KF_SLEEPABLE    (1 << 5) /* kfunc may sleep */
> >  #define KF_DESTRUCTIVE  (1 << 6) /* kfunc performs destructive actions */
> > +#define KF_UNROLL       (1 << 7) /* kfunc unrolling can be attempted */
> >
> >  /*
> >   * Return the name of the passed struct, if exists, or halt the build if for
> > diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
> > index c9744efd202f..eb448e9c79bb 100644
> > --- a/include/linux/btf_ids.h
> > +++ b/include/linux/btf_ids.h
> > @@ -195,6 +195,10 @@ asm(                                                     \
> >  __BTF_ID_LIST(name, local)                           \
> >  __BTF_SET8_START(name, local)
> >
> > +#define BTF_SET8_START_GLOBAL(name)                  \
> > +__BTF_ID_LIST(name, global)                          \
> > +__BTF_SET8_START(name, global)
> > +
> >  #define BTF_SET8_END(name)                           \
> >  asm(                                                 \
> >  ".pushsection " BTF_IDS_SECTION ",\"a\";      \n"    \
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 02a2318da7c7..2096b4f00e4b 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -73,6 +73,8 @@ struct udp_tunnel_info;
> >  struct udp_tunnel_nic_info;
> >  struct udp_tunnel_nic;
> >  struct bpf_prog;
> > +struct bpf_insn;
> > +struct bpf_patch;
> >  struct xdp_buff;
> >
> >  void synchronize_net(void);
> > @@ -1604,6 +1606,9 @@ struct net_device_ops {
> >       ktime_t                 (*ndo_get_tstamp)(struct net_device *dev,
> >                                                 const struct skb_shared_hwtstamps *hwtstamps,
> >                                                 bool cycles);
> > +     void                    (*ndo_unroll_kfunc)(const struct bpf_prog *prog,
> > +                                                 u32 func_id,
> > +                                                 struct bpf_patch *patch);
> >  };
> >
> >  /**
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 55dbc68bfffc..2a82a98f2f9f 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -7,6 +7,7 @@
> >  #define __LINUX_NET_XDP_H__
> >
> >  #include <linux/skbuff.h> /* skb_shared_info */
> > +#include <linux/btf_ids.h> /* btf_id_set8 */
> >
> >  /**
> >   * DOC: XDP RX-queue information
> > @@ -409,4 +410,27 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> >
> >  #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
> >
> > +#define XDP_METADATA_KFUNC_xxx       \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
> > +                        bpf_xdp_metadata_rx_timestamp_supported) \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
> > +                        bpf_xdp_metadata_rx_timestamp) \
> > +
> > +enum {
> > +#define XDP_METADATA_KFUNC(name, str) name,
> > +XDP_METADATA_KFUNC_xxx
> > +#undef XDP_METADATA_KFUNC
> > +MAX_XDP_METADATA_KFUNC,
> > +};
> > +
> > +#ifdef CONFIG_DEBUG_INFO_BTF
> > +extern struct btf_id_set8 xdp_metadata_kfunc_ids;
> > +static inline u32 xdp_metadata_kfunc_id(int id)
> > +{
> > +     return xdp_metadata_kfunc_ids.pairs[id].id;
> > +}
> > +#else
> > +static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> > +#endif
> > +
> >  #endif /* __LINUX_NET_XDP_H__ */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index fb4c911d2a03..b444b1118c4f 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -1156,6 +1156,11 @@ enum bpf_link_type {
> >   */
> >  #define BPF_F_XDP_HAS_FRAGS  (1U << 5)
> >
> > +/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
> > + * program becomes device-bound but can access it's XDP metadata.
> > + */
> > +#define BPF_F_XDP_HAS_METADATA       (1U << 6)
> > +
> >  /* link_create.kprobe_multi.flags used in LINK_CREATE command for
> >   * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
> >   */
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 85532d301124..597c41949910 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2426,6 +2426,20 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
> >  /* last field in 'union bpf_attr' used by this command */
> >  #define      BPF_PROG_LOAD_LAST_FIELD core_relo_rec_size
> >
> > +static int xdp_resolve_netdev(struct bpf_prog *prog, int ifindex)
> > +{
> > +     struct net *net = current->nsproxy->net_ns;
> > +     struct net_device *dev;
> > +
> > +     for_each_netdev(net, dev) {
> > +             if (dev->ifindex == ifindex) {
>
> So this is basically dev_get_by_index(), except you're not doing
> dev_hold()? Which also means there's no protection against the netdev
> going away?

Yeah, good point, will use dev_get_by_index here instead with proper refcnt..

> > +                     prog->aux->xdp_kfunc_ndo = dev->netdev_ops;
> > +                     return 0;
> > +             }
> > +     }
>
> > +     return -EINVAL;
> > +}
> > +
> >  static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
> >  {
> >       enum bpf_prog_type type = attr->prog_type;
> > @@ -2443,7 +2457,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
> >                                BPF_F_TEST_STATE_FREQ |
> >                                BPF_F_SLEEPABLE |
> >                                BPF_F_TEST_RND_HI32 |
> > -                              BPF_F_XDP_HAS_FRAGS))
> > +                              BPF_F_XDP_HAS_FRAGS |
> > +                              BPF_F_XDP_HAS_METADATA))
> >               return -EINVAL;
> >
> >       if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> > @@ -2531,6 +2546,17 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
> >       prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
> >       prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
> >
> > +     if (attr->prog_flags & BPF_F_XDP_HAS_METADATA) {
> > +             /* Reuse prog_ifindex to carry request to unroll
> > +              * metadata kfuncs.
> > +              */
> > +             prog->aux->offload_requested = false;
> > +
> > +             err = xdp_resolve_netdev(prog, attr->prog_ifindex);
> > +             if (err < 0)
> > +                     goto free_prog;
> > +     }
> > +
> >       err = security_bpf_prog_alloc(prog->aux);
> >       if (err)
> >               goto free_prog;
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 07c0259dfc1a..b657ed6eb277 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/types.h>
> >  #include <linux/slab.h>
> >  #include <linux/bpf.h>
> > +#include <linux/bpf_patch.h>
> >  #include <linux/btf.h>
> >  #include <linux/bpf_verifier.h>
> >  #include <linux/filter.h>
> > @@ -14015,6 +14016,43 @@ static int fixup_call_args(struct bpf_verifier_env *env)
> >       return err;
> >  }
> >
> > +static int unroll_kfunc_call(struct bpf_verifier_env *env,
> > +                          struct bpf_insn *insn,
> > +                          struct bpf_patch *patch)
> > +{
> > +     enum bpf_prog_type prog_type;
> > +     struct bpf_prog_aux *aux;
> > +     struct btf *desc_btf;
> > +     u32 *kfunc_flags;
> > +     u32 func_id;
> > +
> > +     desc_btf = find_kfunc_desc_btf(env, insn->off);
> > +     if (IS_ERR(desc_btf))
> > +             return PTR_ERR(desc_btf);
> > +
> > +     prog_type = resolve_prog_type(env->prog);
> > +     func_id = insn->imm;
> > +
> > +     kfunc_flags = btf_kfunc_id_set_contains(desc_btf, prog_type, func_id);
> > +     if (!kfunc_flags)
> > +             return 0;
> > +     if (!(*kfunc_flags & KF_UNROLL))
> > +             return 0;
> > +     if (prog_type != BPF_PROG_TYPE_XDP)
> > +             return 0;
>
> Should this just handle XDP_METADATA_KFUNC_EXPORT_TO_SKB instead of
> passing that into the driver (to avoid every driver having to
> reimplement the same call to xdp_metadata_export_to_skb())?

Good idea, will try to move it here.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 00/11] xdp: hints via kfuncs
  2022-11-15 15:54 ` [xdp-hints] [PATCH bpf-next 00/11] xdp: hints via kfuncs Toke Høiland-Jørgensen
@ 2022-11-15 18:37   ` Stanislav Fomichev
  2022-11-15 22:31     ` [xdp-hints] " Toke Høiland-Jørgensen
  2022-11-15 22:54     ` [xdp-hints] " Alexei Starovoitov
  0 siblings, 2 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15 18:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Tue, Nov 15, 2022 at 7:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > - drop __randomize_layout
> >
> >   Not sure it's possible to sanely expose it via UAPI. Because every
> >   .o potentially gets its own randomized layout, test_progs
> >   refuses to link.
>
> So this won't work if the struct is in a kernel-supplied UAPI header
> (which would include the __randomize_layout tag). But if it's *not* in a
> UAPI header it should still be included in a stable form (i.e., without
> the randomize tag) in vmlinux.h, right? Which would be the point:
> consumers would be forced to read it from there and do CO-RE on it...

So you're suggesting something like the following in the uapi header?

#ifndef __KERNEL__
#define __randomize_layout
#endif

?

Let me try to add some padding arguments to xdp_skb_metadata plus the
above to see how it goes.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 01/11] bpf: Document XDP RX metadata
  2022-11-15  3:02 ` [PATCH bpf-next 01/11] bpf: Document XDP RX metadata Stanislav Fomichev
@ 2022-11-15 22:31   ` Zvi Effron
  2022-11-15 22:43     ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: Zvi Effron @ 2022-11-15 22:31 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa

On Mon, Nov 14, 2022 at 7:04 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Document all current use-cases and assumptions.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  Documentation/bpf/xdp-rx-metadata.rst | 109 ++++++++++++++++++++++++++
>  1 file changed, 109 insertions(+)
>  create mode 100644 Documentation/bpf/xdp-rx-metadata.rst
>
> diff --git a/Documentation/bpf/xdp-rx-metadata.rst b/Documentation/bpf/xdp-rx-metadata.rst
> new file mode 100644
> index 000000000000..5ddaaab8de31
> --- /dev/null
> +++ b/Documentation/bpf/xdp-rx-metadata.rst
> @@ -0,0 +1,109 @@
> +===============
> +XDP RX Metadata
> +===============
> +
> +XDP programs support creating and passing custom metadata via
> +``bpf_xdp_adjust_meta``. This metadata can be consumed by the following
> +entities:
> +
> +1. ``AF_XDP`` consumer.
> +2. Kernel core stack via ``XDP_PASS``.
> +3. Another device via ``bpf_redirect_map``.

4. Other eBPF programs via eBPF tail calls.

> +
> +General Design
> +==============
> +
> +XDP has access to a set of kfuncs to manipulate the metadata. Every
> +device driver implements these kfuncs by generating BPF bytecode
> +to parse it out from the hardware descriptors. The set of kfuncs is
> +declared in ``include/net/xdp.h`` via ``XDP_METADATA_KFUNC_xxx``.
> +
> +Currently, the following kfuncs are supported. In the future, as more
> +metadata is supported, this set will grow:
> +
> +- ``bpf_xdp_metadata_rx_timestamp_supported`` returns true/false to
> +  indicate whether the device supports RX timestamps in general
> +- ``bpf_xdp_metadata_rx_timestamp`` returns packet RX timestamp or 0
> +- ``bpf_xdp_metadata_export_to_skb`` prepares metadata layout that
> +  the kernel will be able to consume. See ``bpf_redirect_map`` section
> +  below for more details.
> +
> +Within the XDP frame, the metadata layout is as follows::
> +
> +  +----------+------------------+-----------------+------+
> +  | headroom | xdp_skb_metadata | custom metadata | data |
> +  +----------+------------------+-----------------+------+
> +                                ^                 ^
> +                                |                 |
> +                      xdp_buff->data_meta   xdp_buff->data
> +
> +Where ``xdp_skb_metadata`` is the metadata prepared by
> +``bpf_xdp_metadata_export_to_skb``. And ``custom metadata``
> +is prepared by the BPF program via calls to ``bpf_xdp_adjust_meta``.
> +
> +Note that ``bpf_xdp_metadata_export_to_skb`` doesn't adjust
> +``xdp->data_meta`` pointer. To access the metadata generated
> +by ``bpf_xdp_metadata_export_to_skb`` use ``xdp_buf->skb_metadata``.
> +
> +AF_XDP
> +======
> +
> +``AF_XDP`` use-case implies that there is a contract between the BPF program
> +that redirects XDP frames into the ``XSK`` and the final consumer.
> +Thus the BPF program manually allocates a fixed number of
> +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
> +of kfuncs to populate it. User-space ``XSK`` consumer, looks
> +at ``xsk_umem__get_data() - METADATA_SIZE`` to locate its metadata.
> +
> +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
> +
> +  +----------+------------------+-----------------+------+
> +  | headroom | xdp_skb_metadata | custom metadata | data |
> +  +----------+------------------+-----------------+------+
> +                                                  ^
> +                                                  |
> +                                           rx_desc->address
> +
> +XDP_PASS
> +========
> +
> +This is the path where the packets processed by the XDP program are passed
> +into the kernel. The kernel creates ``skb`` out of the ``xdp_buff`` contents.
> +Currently, every driver has a custom kernel code to parse the descriptors and
> +populate ``skb`` metadata when doing this ``xdp_buff->skb`` conversion.
> +In the future, we'd like to support a case where XDP program can override
> +some of that metadata.
> +
> +The plan of record is to make this path similar to ``bpf_redirect_map``
> +below where the program would call ``bpf_xdp_metadata_export_to_skb``,
> +override the metadata and return ``XDP_PASS``. Additional work in
> +the drivers will be required to enable this (for example, to skip
> +populating ``skb`` metadata from the descriptors when
> +``bpf_xdp_metadata_export_to_skb`` has been called).
> +
> +bpf_redirect_map
> +================
> +
> +``bpf_redirect_map`` can redirect the frame to a different device.
> +In this case we don't know ahead of time whether that final consumer
> +will further redirect to an ``XSK`` or pass it to the kernel via ``XDP_PASS``.
> +Additionally, the final consumer doesn't have access to the original
> +hardware descriptor and can't access any of the original metadata.
> +
> +To support passing metadata via ``bpf_redirect_map``, there is a
> +``bpf_xdp_metadata_export_to_skb`` kfunc that populates a subset
> +of metadata into ``xdp_buff``. The layout is defined in
> +``struct xdp_skb_metadata``.
> +
> +Mixing custom metadata and xdp_skb_metadata
> +===========================================
> +
> +For the cases of ``bpf_redirect_map``, where the final consumer isn't
> +known ahead of time, the program can store both, custom metadata
> +and ``xdp_skb_metadata`` for the kernel consumption.
> +
> +Current limitation is that the program cannot adjust ``data_meta`` (via
> +``bpf_xdp_adjust_meta``) after a call to ``bpf_xdp_metadata_export_to_skb``.
> +So it has to, first, prepare its custom metadata layout and only then,
> +optionally, store ``xdp_skb_metadata`` via a call to
> +``bpf_xdp_metadata_export_to_skb``.
> --
> 2.38.1.431.g37b22c650d-goog
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 00/11] xdp: hints via kfuncs
  2022-11-15 18:37   ` Stanislav Fomichev
@ 2022-11-15 22:31     ` Toke Høiland-Jørgensen
  2022-11-15 22:54     ` [xdp-hints] " Alexei Starovoitov
  1 sibling, 0 replies; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-15 22:31 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Tue, Nov 15, 2022 at 7:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > - drop __randomize_layout
>> >
>> >   Not sure it's possible to sanely expose it via UAPI. Because every
>> >   .o potentially gets its own randomized layout, test_progs
>> >   refuses to link.
>>
>> So this won't work if the struct is in a kernel-supplied UAPI header
>> (which would include the __randomize_layout tag). But if it's *not* in a
>> UAPI header it should still be included in a stable form (i.e., without
>> the randomize tag) in vmlinux.h, right? Which would be the point:
>> consumers would be forced to read it from there and do CO-RE on it...
>
> So you're suggesting something like the following in the uapi header?
>
> #ifndef __KERNEL__
> #define __randomize_layout
> #endif
>
> ?

I actually just meant "don't put struct xdp_metadata in an UAPI header
file at all". However, I can see how that complicates having the
skb_metadata pointer in struct xdp_md, so if the above works, that's
fine with me as well :)

> Let me try to add some padding arguments to xdp_skb_metadata plus the
> above to see how it goes.

Cool!

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 01/11] bpf: Document XDP RX metadata
  2022-11-15 22:31   ` Zvi Effron
@ 2022-11-15 22:43     ` Stanislav Fomichev
  2022-11-15 23:34       ` Zvi Effron
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-15 22:43 UTC (permalink / raw)
  To: Zvi Effron
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa

On Tue, Nov 15, 2022 at 2:31 PM Zvi Effron <zeffron@riotgames.com> wrote:
>
> On Mon, Nov 14, 2022 at 7:04 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Document all current use-cases and assumptions.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  Documentation/bpf/xdp-rx-metadata.rst | 109 ++++++++++++++++++++++++++
> >  1 file changed, 109 insertions(+)
> >  create mode 100644 Documentation/bpf/xdp-rx-metadata.rst
> >
> > diff --git a/Documentation/bpf/xdp-rx-metadata.rst b/Documentation/bpf/xdp-rx-metadata.rst
> > new file mode 100644
> > index 000000000000..5ddaaab8de31
> > --- /dev/null
> > +++ b/Documentation/bpf/xdp-rx-metadata.rst
> > @@ -0,0 +1,109 @@
> > +===============
> > +XDP RX Metadata
> > +===============
> > +
> > +XDP programs support creating and passing custom metadata via
> > +``bpf_xdp_adjust_meta``. This metadata can be consumed by the following
> > +entities:
> > +
> > +1. ``AF_XDP`` consumer.
> > +2. Kernel core stack via ``XDP_PASS``.
> > +3. Another device via ``bpf_redirect_map``.
>
> 4. Other eBPF programs via eBPF tail calls.

Don't think a tail call is a special case here?
Correct me if I'm wrong, but with a tail call, we retain the original
xdp_buff ctx, so the tail call can still use the same kfuncs as if the
original bpf prog was running.

> > +
> > +General Design
> > +==============
> > +
> > +XDP has access to a set of kfuncs to manipulate the metadata. Every
> > +device driver implements these kfuncs by generating BPF bytecode
> > +to parse it out from the hardware descriptors. The set of kfuncs is
> > +declared in ``include/net/xdp.h`` via ``XDP_METADATA_KFUNC_xxx``.
> > +
> > +Currently, the following kfuncs are supported. In the future, as more
> > +metadata is supported, this set will grow:
> > +
> > +- ``bpf_xdp_metadata_rx_timestamp_supported`` returns true/false to
> > +  indicate whether the device supports RX timestamps in general
> > +- ``bpf_xdp_metadata_rx_timestamp`` returns packet RX timestamp or 0
> > +- ``bpf_xdp_metadata_export_to_skb`` prepares metadata layout that
> > +  the kernel will be able to consume. See ``bpf_redirect_map`` section
> > +  below for more details.
> > +
> > +Within the XDP frame, the metadata layout is as follows::
> > +
> > +  +----------+------------------+-----------------+------+
> > +  | headroom | xdp_skb_metadata | custom metadata | data |
> > +  +----------+------------------+-----------------+------+
> > +                                ^                 ^
> > +                                |                 |
> > +                      xdp_buff->data_meta   xdp_buff->data
> > +
> > +Where ``xdp_skb_metadata`` is the metadata prepared by
> > +``bpf_xdp_metadata_export_to_skb``. And ``custom metadata``
> > +is prepared by the BPF program via calls to ``bpf_xdp_adjust_meta``.
> > +
> > +Note that ``bpf_xdp_metadata_export_to_skb`` doesn't adjust
> > +``xdp->data_meta`` pointer. To access the metadata generated
> > +by ``bpf_xdp_metadata_export_to_skb`` use ``xdp_buf->skb_metadata``.
> > +
> > +AF_XDP
> > +======
> > +
> > +``AF_XDP`` use-case implies that there is a contract between the BPF program
> > +that redirects XDP frames into the ``XSK`` and the final consumer.
> > +Thus the BPF program manually allocates a fixed number of
> > +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
> > +of kfuncs to populate it. User-space ``XSK`` consumer, looks
> > +at ``xsk_umem__get_data() - METADATA_SIZE`` to locate its metadata.
> > +
> > +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
> > +
> > +  +----------+------------------+-----------------+------+
> > +  | headroom | xdp_skb_metadata | custom metadata | data |
> > +  +----------+------------------+-----------------+------+
> > +                                                  ^
> > +                                                  |
> > +                                           rx_desc->address
> > +
> > +XDP_PASS
> > +========
> > +
> > +This is the path where the packets processed by the XDP program are passed
> > +into the kernel. The kernel creates ``skb`` out of the ``xdp_buff`` contents.
> > +Currently, every driver has a custom kernel code to parse the descriptors and
> > +populate ``skb`` metadata when doing this ``xdp_buff->skb`` conversion.
> > +In the future, we'd like to support a case where XDP program can override
> > +some of that metadata.
> > +
> > +The plan of record is to make this path similar to ``bpf_redirect_map``
> > +below where the program would call ``bpf_xdp_metadata_export_to_skb``,
> > +override the metadata and return ``XDP_PASS``. Additional work in
> > +the drivers will be required to enable this (for example, to skip
> > +populating ``skb`` metadata from the descriptors when
> > +``bpf_xdp_metadata_export_to_skb`` has been called).
> > +
> > +bpf_redirect_map
> > +================
> > +
> > +``bpf_redirect_map`` can redirect the frame to a different device.
> > +In this case we don't know ahead of time whether that final consumer
> > +will further redirect to an ``XSK`` or pass it to the kernel via ``XDP_PASS``.
> > +Additionally, the final consumer doesn't have access to the original
> > +hardware descriptor and can't access any of the original metadata.
> > +
> > +To support passing metadata via ``bpf_redirect_map``, there is a
> > +``bpf_xdp_metadata_export_to_skb`` kfunc that populates a subset
> > +of metadata into ``xdp_buff``. The layout is defined in
> > +``struct xdp_skb_metadata``.
> > +
> > +Mixing custom metadata and xdp_skb_metadata
> > +===========================================
> > +
> > +For the cases of ``bpf_redirect_map``, where the final consumer isn't
> > +known ahead of time, the program can store both, custom metadata
> > +and ``xdp_skb_metadata`` for the kernel consumption.
> > +
> > +Current limitation is that the program cannot adjust ``data_meta`` (via
> > +``bpf_xdp_adjust_meta``) after a call to ``bpf_xdp_metadata_export_to_skb``.
> > +So it has to, first, prepare its custom metadata layout and only then,
> > +optionally, store ``xdp_skb_metadata`` via a call to
> > +``bpf_xdp_metadata_export_to_skb``.
> > --
> > 2.38.1.431.g37b22c650d-goog
> >

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-15 18:37     ` Stanislav Fomichev
@ 2022-11-15 22:46       ` Toke Høiland-Jørgensen
  2022-11-16  4:09         ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-15 22:46 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > The goal is to enable end-to-end testing of the metadata
>> > for AF_XDP. Current rx_timestamp kfunc returns current
>> > time which should be enough to exercise this new functionality.
>> >
>> > Cc: John Fastabend <john.fastabend@gmail.com>
>> > Cc: David Ahern <dsahern@gmail.com>
>> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
>> > Cc: Jakub Kicinski <kuba@kernel.org>
>> > Cc: Willem de Bruijn <willemb@google.com>
>> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>> > Cc: Maryam Tahhan <mtahhan@redhat.com>
>> > Cc: xdp-hints@xdp-project.net
>> > Cc: netdev@vger.kernel.org
>> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> > ---
>> >  drivers/net/veth.c | 14 ++++++++++++++
>> >  1 file changed, 14 insertions(+)
>> >
>> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> > index 2a4592780141..c626580a2294 100644
>> > --- a/drivers/net/veth.c
>> > +++ b/drivers/net/veth.c
>> > @@ -25,6 +25,7 @@
>> >  #include <linux/filter.h>
>> >  #include <linux/ptr_ring.h>
>> >  #include <linux/bpf_trace.h>
>> > +#include <linux/bpf_patch.h>
>> >  #include <linux/net_tstamp.h>
>> >
>> >  #define DRV_NAME     "veth"
>> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>> >       }
>> >  }
>> >
>> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>> > +                           struct bpf_patch *patch)
>> > +{
>> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>> > +             /* return true; */
>> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>> > +             /* return ktime_get_mono_fast_ns(); */
>> > +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
>> > +     }
>> > +}
>>
>> So these look reasonable enough, but would be good to see some examples
>> of kfunc implementations that don't just BPF_CALL to a kernel function
>> (with those helper wrappers we were discussing before).
>
> Let's maybe add them if/when needed as we add more metadata support?
> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> examples, so it shouldn't be a problem to resurrect them back at some
> point?

Well, the reason I asked for them is that I think having to maintain the
BPF code generation in the drivers is probably the biggest drawback of
the kfunc approach, so it would be good to be relatively sure that we
can manage that complexity (via helpers) before we commit to this :)

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 00/11] xdp: hints via kfuncs
  2022-11-15 18:37   ` Stanislav Fomichev
  2022-11-15 22:31     ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-15 22:54     ` Alexei Starovoitov
  2022-11-15 23:13       ` [xdp-hints] " Toke Høiland-Jørgensen
  1 sibling, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2022-11-15 22:54 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Tue, Nov 15, 2022 at 10:38 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Nov 15, 2022 at 7:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> > > - drop __randomize_layout
> > >
> > >   Not sure it's possible to sanely expose it via UAPI. Because every
> > >   .o potentially gets its own randomized layout, test_progs
> > >   refuses to link.
> >
> > So this won't work if the struct is in a kernel-supplied UAPI header
> > (which would include the __randomize_layout tag). But if it's *not* in a
> > UAPI header it should still be included in a stable form (i.e., without
> > the randomize tag) in vmlinux.h, right? Which would be the point:
> > consumers would be forced to read it from there and do CO-RE on it...
>
> So you're suggesting something like the following in the uapi header?
>
> #ifndef __KERNEL__
> #define __randomize_layout
> #endif
>

1.
__randomize_layout in uapi header makes no sense.

2.
It's supported by gcc plugin and afaik that plugin is broken
vs debug info, so dwarf is broken, hence BTF is broken too,
and CO-RE doesn't work on kernels compiled with that gcc plugin.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 00/11] xdp: hints via kfuncs
  2022-11-15 22:54     ` [xdp-hints] " Alexei Starovoitov
@ 2022-11-15 23:13       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-15 23:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	Network Development

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Nov 15, 2022 at 10:38 AM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Tue, Nov 15, 2022 at 7:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >
>> > Stanislav Fomichev <sdf@google.com> writes:
>> >
>> > > - drop __randomize_layout
>> > >
>> > >   Not sure it's possible to sanely expose it via UAPI. Because every
>> > >   .o potentially gets its own randomized layout, test_progs
>> > >   refuses to link.
>> >
>> > So this won't work if the struct is in a kernel-supplied UAPI header
>> > (which would include the __randomize_layout tag). But if it's *not* in a
>> > UAPI header it should still be included in a stable form (i.e., without
>> > the randomize tag) in vmlinux.h, right? Which would be the point:
>> > consumers would be forced to read it from there and do CO-RE on it...
>>
>> So you're suggesting something like the following in the uapi header?
>>
>> #ifndef __KERNEL__
>> #define __randomize_layout
>> #endif
>>
>
> 1.
> __randomize_layout in uapi header makes no sense.

I agree, which is why I wanted it to be only in vmlinux.h...

> 2.
> It's supported by gcc plugin and afaik that plugin is broken
> vs debug info, so dwarf is broken, hence BTF is broken too,
> and CO-RE doesn't work on kernels compiled with that gcc plugin.

...however this one seems a deal breaker. Ah well, too bad, seemed like
a neat trick to enforce CO-RE :(

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-15  3:02 ` [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
@ 2022-11-15 23:20   ` Toke Høiland-Jørgensen
  2022-11-16  3:49     ` Stanislav Fomichev
  2022-11-16  7:04   ` Martin KaFai Lau
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-15 23:20 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b444b1118c4f..71e3bc7ad839 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6116,6 +6116,12 @@ enum xdp_action {
>  	XDP_REDIRECT,
>  };
>  
> +/* Subset of XDP metadata exported to skb context.
> + */
> +struct xdp_skb_metadata {
> +	__u64 rx_timestamp;
> +};

Okay, so given Alexei's comment about __randomize_struct not actually
working, I think we need to come up with something else for this. Just
sticking this in a regular UAPI header seems like a bad idea; we'd just
be inviting people to use it as-is.

Do we actually need the full definition here? It's just a pointer
declaration below, so is an opaque forward-definition enough? Then we
could have the full definition in an internal header, moving the full
definition back to being in vmlinux.h only?

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 01/11] bpf: Document XDP RX metadata
  2022-11-15 22:43     ` Stanislav Fomichev
@ 2022-11-15 23:34       ` Zvi Effron
  2022-11-16  3:50         ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: Zvi Effron @ 2022-11-15 23:34 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa

On Tue, Nov 15, 2022 at 2:44 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Nov 15, 2022 at 2:31 PM Zvi Effron <zeffron@riotgames.com> wrote:
> >
> > On Mon, Nov 14, 2022 at 7:04 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > Document all current use-cases and assumptions.
> > >
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > > Documentation/bpf/xdp-rx-metadata.rst | 109 ++++++++++++++++++++++++++
> > > 1 file changed, 109 insertions(+)
> > > create mode 100644 Documentation/bpf/xdp-rx-metadata.rst
> > >
> > > diff --git a/Documentation/bpf/xdp-rx-metadata.rst b/Documentation/bpf/xdp-rx-metadata.rst
> > > new file mode 100644
> > > index 000000000000..5ddaaab8de31
> > > --- /dev/null
> > > +++ b/Documentation/bpf/xdp-rx-metadata.rst
> > > @@ -0,0 +1,109 @@
> > > +===============
> > > +XDP RX Metadata
> > > +===============
> > > +
> > > +XDP programs support creating and passing custom metadata via
> > > +``bpf_xdp_adjust_meta``. This metadata can be consumed by the following
> > > +entities:
> > > +
> > > +1. ``AF_XDP`` consumer.
> > > +2. Kernel core stack via ``XDP_PASS``.
> > > +3. Another device via ``bpf_redirect_map``.
> >
> > 4. Other eBPF programs via eBPF tail calls.
>
> Don't think a tail call is a special case here?
> Correct me if I'm wrong, but with a tail call, we retain the original
> xdp_buff ctx, so the tail call can still use the same kfuncs as if the
> original bpf prog was running.
>

That's correct, but it's still a separate program that consumes the metadata,
unrelated to anything kfuncs. Prior to the existence of kfuncs and AF_XDP, this
was (to my knowledge) the primary consumer (outside of the original program, of
course) of the metadata.

From the name of the file and commit message, it sounds like this is the
documentation for XDP metadata, not the documentation for XDP metadata as used
by kfuncs to implement xdp-hints. Is that correct?

> > > +
> > > +General Design
> > > +==============
> > > +
> > > +XDP has access to a set of kfuncs to manipulate the metadata. Every
> > > +device driver implements these kfuncs by generating BPF bytecode
> > > +to parse it out from the hardware descriptors. The set of kfuncs is
> > > +declared in ``include/net/xdp.h`` via ``XDP_METADATA_KFUNC_xxx``.
> > > +
> > > +Currently, the following kfuncs are supported. In the future, as more
> > > +metadata is supported, this set will grow:
> > > +
> > > +- ``bpf_xdp_metadata_rx_timestamp_supported`` returns true/false to
> > > + indicate whether the device supports RX timestamps in general
> > > +- ``bpf_xdp_metadata_rx_timestamp`` returns packet RX timestamp or 0
> > > +- ``bpf_xdp_metadata_export_to_skb`` prepares metadata layout that
> > > + the kernel will be able to consume. See ``bpf_redirect_map`` section
> > > + below for more details.
> > > +
> > > +Within the XDP frame, the metadata layout is as follows::
> > > +
> > > + +----------+------------------+-----------------+------+
> > > + | headroom | xdp_skb_metadata | custom metadata | data |
> > > + +----------+------------------+-----------------+------+
> > > + ^ ^
> > > + | |
> > > + xdp_buff->data_meta xdp_buff->data
> > > +
> > > +Where ``xdp_skb_metadata`` is the metadata prepared by
> > > +``bpf_xdp_metadata_export_to_skb``. And ``custom metadata``
> > > +is prepared by the BPF program via calls to ``bpf_xdp_adjust_meta``.
> > > +
> > > +Note that ``bpf_xdp_metadata_export_to_skb`` doesn't adjust
> > > +``xdp->data_meta`` pointer. To access the metadata generated
> > > +by ``bpf_xdp_metadata_export_to_skb`` use ``xdp_buf->skb_metadata``.
> > > +
> > > +AF_XDP
> > > +======
> > > +
> > > +``AF_XDP`` use-case implies that there is a contract between the BPF program
> > > +that redirects XDP frames into the ``XSK`` and the final consumer.
> > > +Thus the BPF program manually allocates a fixed number of
> > > +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
> > > +of kfuncs to populate it. User-space ``XSK`` consumer, looks
> > > +at ``xsk_umem__get_data() - METADATA_SIZE`` to locate its metadata.
> > > +
> > > +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
> > > +
> > > + +----------+------------------+-----------------+------+
> > > + | headroom | xdp_skb_metadata | custom metadata | data |
> > > + +----------+------------------+-----------------+------+
> > > + ^
> > > + |
> > > + rx_desc->address
> > > +
> > > +XDP_PASS
> > > +========
> > > +
> > > +This is the path where the packets processed by the XDP program are passed
> > > +into the kernel. The kernel creates ``skb`` out of the ``xdp_buff`` contents.
> > > +Currently, every driver has a custom kernel code to parse the descriptors and
> > > +populate ``skb`` metadata when doing this ``xdp_buff->skb`` conversion.
> > > +In the future, we'd like to support a case where XDP program can override
> > > +some of that metadata.
> > > +
> > > +The plan of record is to make this path similar to ``bpf_redirect_map``
> > > +below where the program would call ``bpf_xdp_metadata_export_to_skb``,
> > > +override the metadata and return ``XDP_PASS``. Additional work in
> > > +the drivers will be required to enable this (for example, to skip
> > > +populating ``skb`` metadata from the descriptors when
> > > +``bpf_xdp_metadata_export_to_skb`` has been called).
> > > +
> > > +bpf_redirect_map
> > > +================
> > > +
> > > +``bpf_redirect_map`` can redirect the frame to a different device.
> > > +In this case we don't know ahead of time whether that final consumer
> > > +will further redirect to an ``XSK`` or pass it to the kernel via ``XDP_PASS``.
> > > +Additionally, the final consumer doesn't have access to the original
> > > +hardware descriptor and can't access any of the original metadata.
> > > +
> > > +To support passing metadata via ``bpf_redirect_map``, there is a
> > > +``bpf_xdp_metadata_export_to_skb`` kfunc that populates a subset
> > > +of metadata into ``xdp_buff``. The layout is defined in
> > > +``struct xdp_skb_metadata``.
> > > +
> > > +Mixing custom metadata and xdp_skb_metadata
> > > +===========================================
> > > +
> > > +For the cases of ``bpf_redirect_map``, where the final consumer isn't
> > > +known ahead of time, the program can store both, custom metadata
> > > +and ``xdp_skb_metadata`` for the kernel consumption.
> > > +
> > > +Current limitation is that the program cannot adjust ``data_meta`` (via
> > > +``bpf_xdp_adjust_meta``) after a call to ``bpf_xdp_metadata_export_to_skb``.
> > > +So it has to, first, prepare its custom metadata layout and only then,
> > > +optionally, store ``xdp_skb_metadata`` via a call to
> > > +``bpf_xdp_metadata_export_to_skb``.
> > > --
> > > 2.38.1.431.g37b22c650d-goog
> > >

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-15 23:20   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-16  3:49     ` Stanislav Fomichev
  2022-11-16  9:30       ` [xdp-hints] " Toke Høiland-Jørgensen
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-16  3:49 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Tue, Nov 15, 2022 at 3:20 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index b444b1118c4f..71e3bc7ad839 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -6116,6 +6116,12 @@ enum xdp_action {
> >       XDP_REDIRECT,
> >  };
> >
> > +/* Subset of XDP metadata exported to skb context.
> > + */
> > +struct xdp_skb_metadata {
> > +     __u64 rx_timestamp;
> > +};
>
> Okay, so given Alexei's comment about __randomize_struct not actually
> working, I think we need to come up with something else for this. Just
> sticking this in a regular UAPI header seems like a bad idea; we'd just
> be inviting people to use it as-is.
>
> Do we actually need the full definition here? It's just a pointer
> declaration below, so is an opaque forward-definition enough? Then we
> could have the full definition in an internal header, moving the full
> definition back to being in vmlinux.h only?

Looks like having a uapi-declaration only (and moving the definition
into the kernel headers) might work. At least it does in my limited
testing :-) So let's go with that for now. Alexei, thanks for the
context on __randomize_struct!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 01/11] bpf: Document XDP RX metadata
  2022-11-15 23:34       ` Zvi Effron
@ 2022-11-16  3:50         ` Stanislav Fomichev
  0 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-16  3:50 UTC (permalink / raw)
  To: Zvi Effron
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa

On Tue, Nov 15, 2022 at 3:34 PM Zvi Effron <zeffron@riotgames.com> wrote:
>
> On Tue, Nov 15, 2022 at 2:44 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Nov 15, 2022 at 2:31 PM Zvi Effron <zeffron@riotgames.com> wrote:
> > >
> > > On Mon, Nov 14, 2022 at 7:04 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > Document all current use-cases and assumptions.
> > > >
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > ---
> > > > Documentation/bpf/xdp-rx-metadata.rst | 109 ++++++++++++++++++++++++++
> > > > 1 file changed, 109 insertions(+)
> > > > create mode 100644 Documentation/bpf/xdp-rx-metadata.rst
> > > >
> > > > diff --git a/Documentation/bpf/xdp-rx-metadata.rst b/Documentation/bpf/xdp-rx-metadata.rst
> > > > new file mode 100644
> > > > index 000000000000..5ddaaab8de31
> > > > --- /dev/null
> > > > +++ b/Documentation/bpf/xdp-rx-metadata.rst
> > > > @@ -0,0 +1,109 @@
> > > > +===============
> > > > +XDP RX Metadata
> > > > +===============
> > > > +
> > > > +XDP programs support creating and passing custom metadata via
> > > > +``bpf_xdp_adjust_meta``. This metadata can be consumed by the following
> > > > +entities:
> > > > +
> > > > +1. ``AF_XDP`` consumer.
> > > > +2. Kernel core stack via ``XDP_PASS``.
> > > > +3. Another device via ``bpf_redirect_map``.
> > >
> > > 4. Other eBPF programs via eBPF tail calls.
> >
> > Don't think a tail call is a special case here?
> > Correct me if I'm wrong, but with a tail call, we retain the original
> > xdp_buff ctx, so the tail call can still use the same kfuncs as if the
> > original bpf prog was running.
> >
>
> That's correct, but it's still a separate program that consumes the metadata,
> unrelated to anything kfuncs. Prior to the existence of kfuncs and AF_XDP, this
> was (to my knowledge) the primary consumer (outside of the original program, of
> course) of the metadata.

SG. I'll add this #4 in the respin and will add a short note that the
tail call operates on the same ctx.

> From the name of the file and commit message, it sounds like this is the
> documentation for XDP metadata, not the documentation for XDP metadata as used
> by kfuncs to implement xdp-hints. Is that correct?

I'm mostly focused on the kfunc-related details for now.



> > > > +
> > > > +General Design
> > > > +==============
> > > > +
> > > > +XDP has access to a set of kfuncs to manipulate the metadata. Every
> > > > +device driver implements these kfuncs by generating BPF bytecode
> > > > +to parse it out from the hardware descriptors. The set of kfuncs is
> > > > +declared in ``include/net/xdp.h`` via ``XDP_METADATA_KFUNC_xxx``.
> > > > +
> > > > +Currently, the following kfuncs are supported. In the future, as more
> > > > +metadata is supported, this set will grow:
> > > > +
> > > > +- ``bpf_xdp_metadata_rx_timestamp_supported`` returns true/false to
> > > > + indicate whether the device supports RX timestamps in general
> > > > +- ``bpf_xdp_metadata_rx_timestamp`` returns packet RX timestamp or 0
> > > > +- ``bpf_xdp_metadata_export_to_skb`` prepares metadata layout that
> > > > + the kernel will be able to consume. See ``bpf_redirect_map`` section
> > > > + below for more details.
> > > > +
> > > > +Within the XDP frame, the metadata layout is as follows::
> > > > +
> > > > + +----------+------------------+-----------------+------+
> > > > + | headroom | xdp_skb_metadata | custom metadata | data |
> > > > + +----------+------------------+-----------------+------+
> > > > + ^ ^
> > > > + | |
> > > > + xdp_buff->data_meta xdp_buff->data
> > > > +
> > > > +Where ``xdp_skb_metadata`` is the metadata prepared by
> > > > +``bpf_xdp_metadata_export_to_skb``. And ``custom metadata``
> > > > +is prepared by the BPF program via calls to ``bpf_xdp_adjust_meta``.
> > > > +
> > > > +Note that ``bpf_xdp_metadata_export_to_skb`` doesn't adjust
> > > > +``xdp->data_meta`` pointer. To access the metadata generated
> > > > +by ``bpf_xdp_metadata_export_to_skb`` use ``xdp_buf->skb_metadata``.
> > > > +
> > > > +AF_XDP
> > > > +======
> > > > +
> > > > +``AF_XDP`` use-case implies that there is a contract between the BPF program
> > > > +that redirects XDP frames into the ``XSK`` and the final consumer.
> > > > +Thus the BPF program manually allocates a fixed number of
> > > > +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
> > > > +of kfuncs to populate it. User-space ``XSK`` consumer, looks
> > > > +at ``xsk_umem__get_data() - METADATA_SIZE`` to locate its metadata.
> > > > +
> > > > +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
> > > > +
> > > > + +----------+------------------+-----------------+------+
> > > > + | headroom | xdp_skb_metadata | custom metadata | data |
> > > > + +----------+------------------+-----------------+------+
> > > > + ^
> > > > + |
> > > > + rx_desc->address
> > > > +
> > > > +XDP_PASS
> > > > +========
> > > > +
> > > > +This is the path where the packets processed by the XDP program are passed
> > > > +into the kernel. The kernel creates ``skb`` out of the ``xdp_buff`` contents.
> > > > +Currently, every driver has a custom kernel code to parse the descriptors and
> > > > +populate ``skb`` metadata when doing this ``xdp_buff->skb`` conversion.
> > > > +In the future, we'd like to support a case where XDP program can override
> > > > +some of that metadata.
> > > > +
> > > > +The plan of record is to make this path similar to ``bpf_redirect_map``
> > > > +below where the program would call ``bpf_xdp_metadata_export_to_skb``,
> > > > +override the metadata and return ``XDP_PASS``. Additional work in
> > > > +the drivers will be required to enable this (for example, to skip
> > > > +populating ``skb`` metadata from the descriptors when
> > > > +``bpf_xdp_metadata_export_to_skb`` has been called).
> > > > +
> > > > +bpf_redirect_map
> > > > +================
> > > > +
> > > > +``bpf_redirect_map`` can redirect the frame to a different device.
> > > > +In this case we don't know ahead of time whether that final consumer
> > > > +will further redirect to an ``XSK`` or pass it to the kernel via ``XDP_PASS``.
> > > > +Additionally, the final consumer doesn't have access to the original
> > > > +hardware descriptor and can't access any of the original metadata.
> > > > +
> > > > +To support passing metadata via ``bpf_redirect_map``, there is a
> > > > +``bpf_xdp_metadata_export_to_skb`` kfunc that populates a subset
> > > > +of metadata into ``xdp_buff``. The layout is defined in
> > > > +``struct xdp_skb_metadata``.
> > > > +
> > > > +Mixing custom metadata and xdp_skb_metadata
> > > > +===========================================
> > > > +
> > > > +For the cases of ``bpf_redirect_map``, where the final consumer isn't
> > > > +known ahead of time, the program can store both, custom metadata
> > > > +and ``xdp_skb_metadata`` for the kernel consumption.
> > > > +
> > > > +Current limitation is that the program cannot adjust ``data_meta`` (via
> > > > +``bpf_xdp_adjust_meta``) after a call to ``bpf_xdp_metadata_export_to_skb``.
> > > > +So it has to, first, prepare its custom metadata layout and only then,
> > > > +optionally, store ``xdp_skb_metadata`` via a call to
> > > > +``bpf_xdp_metadata_export_to_skb``.
> > > > --
> > > > 2.38.1.431.g37b22c650d-goog
> > > >

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-15 22:46       ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-16  4:09         ` Stanislav Fomichev
  2022-11-16  6:38           ` John Fastabend
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-16  4:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Tue, Nov 15, 2022 at 2:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > The goal is to enable end-to-end testing of the metadata
> >> > for AF_XDP. Current rx_timestamp kfunc returns current
> >> > time which should be enough to exercise this new functionality.
> >> >
> >> > Cc: John Fastabend <john.fastabend@gmail.com>
> >> > Cc: David Ahern <dsahern@gmail.com>
> >> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> >> > Cc: Jakub Kicinski <kuba@kernel.org>
> >> > Cc: Willem de Bruijn <willemb@google.com>
> >> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> >> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> >> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> >> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> >> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> >> > Cc: xdp-hints@xdp-project.net
> >> > Cc: netdev@vger.kernel.org
> >> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >> > ---
> >> >  drivers/net/veth.c | 14 ++++++++++++++
> >> >  1 file changed, 14 insertions(+)
> >> >
> >> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >> > index 2a4592780141..c626580a2294 100644
> >> > --- a/drivers/net/veth.c
> >> > +++ b/drivers/net/veth.c
> >> > @@ -25,6 +25,7 @@
> >> >  #include <linux/filter.h>
> >> >  #include <linux/ptr_ring.h>
> >> >  #include <linux/bpf_trace.h>
> >> > +#include <linux/bpf_patch.h>
> >> >  #include <linux/net_tstamp.h>
> >> >
> >> >  #define DRV_NAME     "veth"
> >> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >> >       }
> >> >  }
> >> >
> >> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> >> > +                           struct bpf_patch *patch)
> >> > +{
> >> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> >> > +             /* return true; */
> >> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> >> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> >> > +             /* return ktime_get_mono_fast_ns(); */
> >> > +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> >> > +     }
> >> > +}
> >>
> >> So these look reasonable enough, but would be good to see some examples
> >> of kfunc implementations that don't just BPF_CALL to a kernel function
> >> (with those helper wrappers we were discussing before).
> >
> > Let's maybe add them if/when needed as we add more metadata support?
> > xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > examples, so it shouldn't be a problem to resurrect them back at some
> > point?
>
> Well, the reason I asked for them is that I think having to maintain the
> BPF code generation in the drivers is probably the biggest drawback of
> the kfunc approach, so it would be good to be relatively sure that we
> can manage that complexity (via helpers) before we commit to this :)

Right, and I've added a bunch of examples in v2 rfc so we can judge
whether that complexity is manageable or not :-)
Do you want me to add those wrappers you've back without any real users?
Because I had to remove my veth tstamp accessors due to John/Jesper
objections; I can maybe bring some of this back gated by some
static_branch to avoid the fastpath cost?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16  4:09         ` Stanislav Fomichev
@ 2022-11-16  6:38           ` John Fastabend
  2022-11-16  7:47             ` Martin KaFai Lau
  0 siblings, 1 reply; 67+ messages in thread
From: John Fastabend @ 2022-11-16  6:38 UTC (permalink / raw)
  To: Stanislav Fomichev, Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev wrote:
> On Tue, Nov 15, 2022 at 2:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> > > On Tue, Nov 15, 2022 at 8:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >>
> > >> Stanislav Fomichev <sdf@google.com> writes:
> > >>
> > >> > The goal is to enable end-to-end testing of the metadata
> > >> > for AF_XDP. Current rx_timestamp kfunc returns current
> > >> > time which should be enough to exercise this new functionality.
> > >> >
> > >> > Cc: John Fastabend <john.fastabend@gmail.com>
> > >> > Cc: David Ahern <dsahern@gmail.com>
> > >> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > >> > Cc: Jakub Kicinski <kuba@kernel.org>
> > >> > Cc: Willem de Bruijn <willemb@google.com>
> > >> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > >> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > >> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > >> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > >> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > >> > Cc: xdp-hints@xdp-project.net
> > >> > Cc: netdev@vger.kernel.org
> > >> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > >> > ---
> > >> >  drivers/net/veth.c | 14 ++++++++++++++
> > >> >  1 file changed, 14 insertions(+)
> > >> >
> > >> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > >> > index 2a4592780141..c626580a2294 100644
> > >> > --- a/drivers/net/veth.c
> > >> > +++ b/drivers/net/veth.c
> > >> > @@ -25,6 +25,7 @@
> > >> >  #include <linux/filter.h>
> > >> >  #include <linux/ptr_ring.h>
> > >> >  #include <linux/bpf_trace.h>
> > >> > +#include <linux/bpf_patch.h>
> > >> >  #include <linux/net_tstamp.h>
> > >> >
> > >> >  #define DRV_NAME     "veth"
> > >> > @@ -1659,6 +1660,18 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> > >> >       }
> > >> >  }
> > >> >
> > >> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > >> > +                           struct bpf_patch *patch)
> > >> > +{
> > >> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > >> > +             /* return true; */
> > >> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > >> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > >> > +             /* return ktime_get_mono_fast_ns(); */
> > >> > +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > >> > +     }
> > >> > +}
> > >>
> > >> So these look reasonable enough, but would be good to see some examples
> > >> of kfunc implementations that don't just BPF_CALL to a kernel function
> > >> (with those helper wrappers we were discussing before).
> > >
> > > Let's maybe add them if/when needed as we add more metadata support?
> > > xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > examples, so it shouldn't be a problem to resurrect them back at some
> > > point?
> >
> > Well, the reason I asked for them is that I think having to maintain the
> > BPF code generation in the drivers is probably the biggest drawback of
> > the kfunc approach, so it would be good to be relatively sure that we
> > can manage that complexity (via helpers) before we commit to this :)
> 
> Right, and I've added a bunch of examples in v2 rfc so we can judge
> whether that complexity is manageable or not :-)
> Do you want me to add those wrappers you've back without any real users?
> Because I had to remove my veth tstamp accessors due to John/Jesper
> objections; I can maybe bring some of this back gated by some
> static_branch to avoid the fastpath cost?

I missed the context a bit what did you mean "would be good to see some
examples of kfunc implementations that don't just BPF_CALL to a kernel
function"? In this case do you mean BPF code directly without the call?

Early on I thought we should just expose the rx_descriptor which would
be roughly the same right? (difference being code embedded in driver vs
a lib) Trouble I ran into is driver code using seqlock_t and mutexs
which wasn't as straight forward as the simpler just read it from
the descriptor. For example in mlx getting the ts would be easy from
BPF with the mlx4_cqe struct exposed

u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
{
        u64 hi, lo;
        struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;

        lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
        hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;

        return hi | lo;
}

but converting that to nsec is a bit annoying,

void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
                            struct skb_shared_hwtstamps *hwts,
                            u64 timestamp)
{
        unsigned int seq;
        u64 nsec;

        do {
                seq = read_seqbegin(&mdev->clock_lock);
                nsec = timecounter_cyc2time(&mdev->clock, timestamp);
        } while (read_seqretry(&mdev->clock_lock, seq));

        memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
        hwts->hwtstamp = ns_to_ktime(nsec);
}

I think the nsec is what you really want.

With all the drivers doing slightly different ops we would have
to create read_seqbegin, read_seqretry, mutex_lock, ... to get
at least the mlx and ice drivers it looks like we would need some
more BPF primitives/helpers. Looks like some more work is needed
on ice driver though to get rx tstamps on all packets.

Anyways this convinced me real devices will probably use BPF_CALL
and not BPF insns directly.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-15  3:02 ` [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
  2022-11-15 23:20   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-16  7:04   ` Martin KaFai Lau
  2022-11-16  9:48     ` [xdp-hints] " Toke Høiland-Jørgensen
  2022-11-16 20:51     ` Stanislav Fomichev
  2022-11-16 21:12   ` Jakub Kicinski
  2022-11-18 14:05   ` Jesper Dangaard Brouer
  3 siblings, 2 replies; 67+ messages in thread
From: Martin KaFai Lau @ 2022-11-16  7:04 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 11/14/22 7:02 PM, Stanislav Fomichev wrote:
> Implement new bpf_xdp_metadata_export_to_skb kfunc which
> prepares compatible xdp metadata for kernel consumption.
> This kfunc should be called prior to bpf_redirect
> or when XDP_PASS'ing the frame into the kernel (note, the drivers
> have to be updated to enable consuming XDP_PASS'ed metadata).
> 
> veth driver is amended to consume this metadata when converting to skb.
> 
> Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
> whether the frame has skb metadata. The metadata is currently
> stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
> to work after a call to bpf_xdp_metadata_export_to_skb (can lift
> this requirement later on if needed, we'd have to memmove
> xdp_skb_metadata).

It is ok to refuse bpf_xdp_adjust_meta() after bpf_xdp_metadata_export_to_skb() 
for now.  However, it will also need to refuse bpf_xdp_adjust_head().

[ ... ]

> +/* For the packets directed to the kernel, this kfunc exports XDP metadata
> + * into skb context.
> + */
> +noinline int bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx)
> +{
> +	return 0;
> +}
> +

I think it is still better to return 'struct xdp_skb_metata *' instead of 
true/false.  Like:

noinline struct xdp_skb_metata *bpf_xdp_metadata_export_to_skb(const struct 
xdp_md *ctx)
{
	return 0;
}

The KF_RET_NULL has already been set in 
BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids).  There is 
"xdp_btf_struct_access()" that can allow write access to 'struct xdp_skb_metata' 
What else is missing? We can try to solve it.

Then there is no need for this double check in patch 8 selftest which is not 
easy to use:

+               if (bpf_xdp_metadata_export_to_skb(ctx) < 0) {
+                       bpf_printk("bpf_xdp_metadata_export_to_skb failed");
+                       return XDP_DROP;
+               }

[ ... ]

+               skb_metadata = ctx->skb_metadata;
+               if (!skb_metadata) {
+                       bpf_printk("no ctx->skb_metadata");
+                       return XDP_DROP;
+               }

[ ... ]


> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index b444b1118c4f..71e3bc7ad839 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -6116,6 +6116,12 @@ enum xdp_action {
>   	XDP_REDIRECT,
>   };
>   
> +/* Subset of XDP metadata exported to skb context.
> + */
> +struct xdp_skb_metadata {
> +	__u64 rx_timestamp;
> +};
> +
>   /* user accessible metadata for XDP packet hook
>    * new fields must be added to the end of this structure
>    */
> @@ -6128,6 +6134,7 @@ struct xdp_md {
>   	__u32 rx_queue_index;  /* rxq->queue_index  */
>   
>   	__u32 egress_ifindex;  /* txq->dev->ifindex */
> +	__bpf_md_ptr(struct xdp_skb_metadata *, skb_metadata);

Once the above bpf_xdp_metadata_export_to_skb() returning a pointer works, then 
it can be another kfunc 'struct xdp_skb_metata * bpf_xdp_get_skb_metadata(const 
struct xdp_md *ctx)' to return the skb_metadata which was a similar point 
discussed in the previous RFC.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16  6:38           ` John Fastabend
@ 2022-11-16  7:47             ` Martin KaFai Lau
  2022-11-16 10:08               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 67+ messages in thread
From: Martin KaFai Lau @ 2022-11-16  7:47 UTC (permalink / raw)
  To: John Fastabend, Stanislav Fomichev, Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, song, yhs, kpsingh, haoluo, jolsa,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On 11/15/22 10:38 PM, John Fastabend wrote:
>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>>>>>> +                           struct bpf_patch *patch)
>>>>>> +{
>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>>>>>> +             /* return true; */
>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>>>>>> +             /* return ktime_get_mono_fast_ns(); */
>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
>>>>>> +     }
>>>>>> +}
>>>>>
>>>>> So these look reasonable enough, but would be good to see some examples
>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
>>>>> (with those helper wrappers we were discussing before).
>>>>
>>>> Let's maybe add them if/when needed as we add more metadata support?
>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
>>>> examples, so it shouldn't be a problem to resurrect them back at some
>>>> point?
>>>
>>> Well, the reason I asked for them is that I think having to maintain the
>>> BPF code generation in the drivers is probably the biggest drawback of
>>> the kfunc approach, so it would be good to be relatively sure that we
>>> can manage that complexity (via helpers) before we commit to this :)
>>
>> Right, and I've added a bunch of examples in v2 rfc so we can judge
>> whether that complexity is manageable or not :-)
>> Do you want me to add those wrappers you've back without any real users?
>> Because I had to remove my veth tstamp accessors due to John/Jesper
>> objections; I can maybe bring some of this back gated by some
>> static_branch to avoid the fastpath cost?
> 
> I missed the context a bit what did you mean "would be good to see some
> examples of kfunc implementations that don't just BPF_CALL to a kernel
> function"? In this case do you mean BPF code directly without the call?
> 
> Early on I thought we should just expose the rx_descriptor which would
> be roughly the same right? (difference being code embedded in driver vs
> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> which wasn't as straight forward as the simpler just read it from
> the descriptor. For example in mlx getting the ts would be easy from
> BPF with the mlx4_cqe struct exposed
> 
> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> {
>          u64 hi, lo;
>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> 
>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> 
>          return hi | lo;
> }
> 
> but converting that to nsec is a bit annoying,
> 
> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>                              struct skb_shared_hwtstamps *hwts,
>                              u64 timestamp)
> {
>          unsigned int seq;
>          u64 nsec;
> 
>          do {
>                  seq = read_seqbegin(&mdev->clock_lock);
>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
>          } while (read_seqretry(&mdev->clock_lock, seq));
> 
>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
>          hwts->hwtstamp = ns_to_ktime(nsec);
> }
> 
> I think the nsec is what you really want.
> 
> With all the drivers doing slightly different ops we would have
> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> at least the mlx and ice drivers it looks like we would need some
> more BPF primitives/helpers. Looks like some more work is needed
> on ice driver though to get rx tstamps on all packets.
> 
> Anyways this convinced me real devices will probably use BPF_CALL
> and not BPF insns directly.

Some of the mlx5 path looks like this:

#define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))

static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
                                               u64 timestamp)
{
         u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);

         return ns_to_ktime(time);
}

If some hints are harder to get, then just doing a kfunc call is better.

csum may have a better chance to inline?

Regardless, BPF in-lining is a well solved problem and used in many bpf helpers 
already, so there are many examples in the kernel.  I don't think it is 
necessary to block this series because of missing some helper wrappers for 
inlining.  The driver can always start with the simpler kfunc call first and 
optimize later if some hints from the drivers allow it.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-16  3:49     ` Stanislav Fomichev
@ 2022-11-16  9:30       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-16  9:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Tue, Nov 15, 2022 at 3:20 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> > index b444b1118c4f..71e3bc7ad839 100644
>> > --- a/include/uapi/linux/bpf.h
>> > +++ b/include/uapi/linux/bpf.h
>> > @@ -6116,6 +6116,12 @@ enum xdp_action {
>> >       XDP_REDIRECT,
>> >  };
>> >
>> > +/* Subset of XDP metadata exported to skb context.
>> > + */
>> > +struct xdp_skb_metadata {
>> > +     __u64 rx_timestamp;
>> > +};
>>
>> Okay, so given Alexei's comment about __randomize_struct not actually
>> working, I think we need to come up with something else for this. Just
>> sticking this in a regular UAPI header seems like a bad idea; we'd just
>> be inviting people to use it as-is.
>>
>> Do we actually need the full definition here? It's just a pointer
>> declaration below, so is an opaque forward-definition enough? Then we
>> could have the full definition in an internal header, moving the full
>> definition back to being in vmlinux.h only?
>
> Looks like having a uapi-declaration only (and moving the definition
> into the kernel headers) might work. At least it does in my limited
> testing :-) So let's go with that for now. Alexei, thanks for the
> context on __randomize_struct!

Sounds good! :)

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-16  7:04   ` Martin KaFai Lau
@ 2022-11-16  9:48     ` Toke Høiland-Jørgensen
  2022-11-16 20:51       ` Stanislav Fomichev
  2022-11-16 20:51     ` Stanislav Fomichev
  1 sibling, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-16  9:48 UTC (permalink / raw)
  To: Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/14/22 7:02 PM, Stanislav Fomichev wrote:
>> Implement new bpf_xdp_metadata_export_to_skb kfunc which
>> prepares compatible xdp metadata for kernel consumption.
>> This kfunc should be called prior to bpf_redirect
>> or when XDP_PASS'ing the frame into the kernel (note, the drivers
>> have to be updated to enable consuming XDP_PASS'ed metadata).
>> 
>> veth driver is amended to consume this metadata when converting to skb.
>> 
>> Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
>> whether the frame has skb metadata. The metadata is currently
>> stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
>> to work after a call to bpf_xdp_metadata_export_to_skb (can lift
>> this requirement later on if needed, we'd have to memmove
>> xdp_skb_metadata).
>
> It is ok to refuse bpf_xdp_adjust_meta() after bpf_xdp_metadata_export_to_skb() 
> for now.  However, it will also need to refuse bpf_xdp_adjust_head().

I'm also OK with deferring this, although I'm wondering if it isn't just
as easy to just add the memmove() straight away? :)

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16  7:47             ` Martin KaFai Lau
@ 2022-11-16 10:08               ` Toke Høiland-Jørgensen
  2022-11-16 18:20                 ` Martin KaFai Lau
  2022-11-16 19:03                 ` John Fastabend
  0 siblings, 2 replies; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-16 10:08 UTC (permalink / raw)
  To: Martin KaFai Lau, John Fastabend, Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, song, yhs, kpsingh, haoluo, jolsa,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/15/22 10:38 PM, John Fastabend wrote:
>>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>>>>>>> +                           struct bpf_patch *patch)
>>>>>>> +{
>>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>>>>>>> +             /* return true; */
>>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>>>>>>> +             /* return ktime_get_mono_fast_ns(); */
>>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
>>>>>>> +     }
>>>>>>> +}
>>>>>>
>>>>>> So these look reasonable enough, but would be good to see some examples
>>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
>>>>>> (with those helper wrappers we were discussing before).
>>>>>
>>>>> Let's maybe add them if/when needed as we add more metadata support?
>>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
>>>>> examples, so it shouldn't be a problem to resurrect them back at some
>>>>> point?
>>>>
>>>> Well, the reason I asked for them is that I think having to maintain the
>>>> BPF code generation in the drivers is probably the biggest drawback of
>>>> the kfunc approach, so it would be good to be relatively sure that we
>>>> can manage that complexity (via helpers) before we commit to this :)
>>>
>>> Right, and I've added a bunch of examples in v2 rfc so we can judge
>>> whether that complexity is manageable or not :-)
>>> Do you want me to add those wrappers you've back without any real users?
>>> Because I had to remove my veth tstamp accessors due to John/Jesper
>>> objections; I can maybe bring some of this back gated by some
>>> static_branch to avoid the fastpath cost?
>> 
>> I missed the context a bit what did you mean "would be good to see some
>> examples of kfunc implementations that don't just BPF_CALL to a kernel
>> function"? In this case do you mean BPF code directly without the call?
>> 
>> Early on I thought we should just expose the rx_descriptor which would
>> be roughly the same right? (difference being code embedded in driver vs
>> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
>> which wasn't as straight forward as the simpler just read it from
>> the descriptor. For example in mlx getting the ts would be easy from
>> BPF with the mlx4_cqe struct exposed
>> 
>> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
>> {
>>          u64 hi, lo;
>>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
>> 
>>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
>>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
>> 
>>          return hi | lo;
>> }
>> 
>> but converting that to nsec is a bit annoying,
>> 
>> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>>                              struct skb_shared_hwtstamps *hwts,
>>                              u64 timestamp)
>> {
>>          unsigned int seq;
>>          u64 nsec;
>> 
>>          do {
>>                  seq = read_seqbegin(&mdev->clock_lock);
>>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
>>          } while (read_seqretry(&mdev->clock_lock, seq));
>> 
>>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
>>          hwts->hwtstamp = ns_to_ktime(nsec);
>> }
>> 
>> I think the nsec is what you really want.
>> 
>> With all the drivers doing slightly different ops we would have
>> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
>> at least the mlx and ice drivers it looks like we would need some
>> more BPF primitives/helpers. Looks like some more work is needed
>> on ice driver though to get rx tstamps on all packets.
>> 
>> Anyways this convinced me real devices will probably use BPF_CALL
>> and not BPF insns directly.
>
> Some of the mlx5 path looks like this:
>
> #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
>
> static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
>                                                u64 timestamp)
> {
>          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
>
>          return ns_to_ktime(time);
> }
>
> If some hints are harder to get, then just doing a kfunc call is better.

Sure, but if we end up having a full function call for every field in
the metadata, that will end up having a significant performance impact
on the XDP data path (thinking mostly about the skb metadata case here,
which will collect several bits of metadata).

> csum may have a better chance to inline?

Yup, I agree. Including that also makes it possible to benchmark this
series against Jesper's; which I think we should definitely be doing
before merging this.

> Regardless, BPF in-lining is a well solved problem and used in many
> bpf helpers already, so there are many examples in the kernel. I don't
> think it is necessary to block this series because of missing some
> helper wrappers for inlining. The driver can always start with the
> simpler kfunc call first and optimize later if some hints from the
> drivers allow it.

Well, "solved" in the sense of "there are a few handfuls of core BPF
people who know how to do it". My concern is that we'll end up with
either the BPF devs having to maintain all these bits of BPF byte code
in all the drivers; or drivers just punting to regular function calls
because the inlining is too complicated, with sub-par performance as per
the above. I don't think we should just hand-wave this away as "solved",
but rather treat this as an integral part of the initial series.

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16 10:08               ` Toke Høiland-Jørgensen
@ 2022-11-16 18:20                 ` Martin KaFai Lau
  2022-11-16 19:03                 ` John Fastabend
  1 sibling, 0 replies; 67+ messages in thread
From: Martin KaFai Lau @ 2022-11-16 18:20 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, song, yhs, kpsingh,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	John Fastabend

On 11/16/22 2:08 AM, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <martin.lau@linux.dev> writes:
> 
>> On 11/15/22 10:38 PM, John Fastabend wrote:
>>>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>>>>>>>> +                           struct bpf_patch *patch)
>>>>>>>> +{
>>>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>>>>>>>> +             /* return true; */
>>>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>>>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>>>>>>>> +             /* return ktime_get_mono_fast_ns(); */
>>>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
>>>>>>>> +     }
>>>>>>>> +}
>>>>>>>
>>>>>>> So these look reasonable enough, but would be good to see some examples
>>>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
>>>>>>> (with those helper wrappers we were discussing before).
>>>>>>
>>>>>> Let's maybe add them if/when needed as we add more metadata support?
>>>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
>>>>>> examples, so it shouldn't be a problem to resurrect them back at some
>>>>>> point?
>>>>>
>>>>> Well, the reason I asked for them is that I think having to maintain the
>>>>> BPF code generation in the drivers is probably the biggest drawback of
>>>>> the kfunc approach, so it would be good to be relatively sure that we
>>>>> can manage that complexity (via helpers) before we commit to this :)
>>>>
>>>> Right, and I've added a bunch of examples in v2 rfc so we can judge
>>>> whether that complexity is manageable or not :-)
>>>> Do you want me to add those wrappers you've back without any real users?
>>>> Because I had to remove my veth tstamp accessors due to John/Jesper
>>>> objections; I can maybe bring some of this back gated by some
>>>> static_branch to avoid the fastpath cost?
>>>
>>> I missed the context a bit what did you mean "would be good to see some
>>> examples of kfunc implementations that don't just BPF_CALL to a kernel
>>> function"? In this case do you mean BPF code directly without the call?
>>>
>>> Early on I thought we should just expose the rx_descriptor which would
>>> be roughly the same right? (difference being code embedded in driver vs
>>> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
>>> which wasn't as straight forward as the simpler just read it from
>>> the descriptor. For example in mlx getting the ts would be easy from
>>> BPF with the mlx4_cqe struct exposed
>>>
>>> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
>>> {
>>>           u64 hi, lo;
>>>           struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
>>>
>>>           lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
>>>           hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
>>>
>>>           return hi | lo;
>>> }
>>>
>>> but converting that to nsec is a bit annoying,
>>>
>>> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
>>>                               struct skb_shared_hwtstamps *hwts,
>>>                               u64 timestamp)
>>> {
>>>           unsigned int seq;
>>>           u64 nsec;
>>>
>>>           do {
>>>                   seq = read_seqbegin(&mdev->clock_lock);
>>>                   nsec = timecounter_cyc2time(&mdev->clock, timestamp);
>>>           } while (read_seqretry(&mdev->clock_lock, seq));
>>>
>>>           memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
>>>           hwts->hwtstamp = ns_to_ktime(nsec);
>>> }
>>>
>>> I think the nsec is what you really want.
>>>
>>> With all the drivers doing slightly different ops we would have
>>> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
>>> at least the mlx and ice drivers it looks like we would need some
>>> more BPF primitives/helpers. Looks like some more work is needed
>>> on ice driver though to get rx tstamps on all packets.
>>>
>>> Anyways this convinced me real devices will probably use BPF_CALL
>>> and not BPF insns directly.
>>
>> Some of the mlx5 path looks like this:
>>
>> #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
>>
>> static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
>>                                                 u64 timestamp)
>> {
>>           u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
>>
>>           return ns_to_ktime(time);
>> }
>>
>> If some hints are harder to get, then just doing a kfunc call is better.
> 
> Sure, but if we end up having a full function call for every field in
> the metadata, that will end up having a significant performance impact
> on the XDP data path (thinking mostly about the skb metadata case here,
> which will collect several bits of metadata).
> 
>> csum may have a better chance to inline?

It will be useful if the skb_metadata can get at least one more field like csum 
or vlan.

> 
> Yup, I agree. Including that also makes it possible to benchmark this
> series against Jesper's; which I think we should definitely be doing
> before merging this.

If the hint needs a lock to obtain it, it is not cheap to begin with regardless 
of kfunc or not.  Also, there is bpf_xdp_metadata_rx_timestamp_supported() 
before doing the bpf_xdp_metadata_rx_timestamp().

This set gives the xdp prog a flexible way to avoid getting all hints (some of 
them require a lock) if all the xdp prog needs is a csum.  imo, giving this 
flexibility to the xdp prog is the important thing for this set in terms of 
performance.

> 
>> Regardless, BPF in-lining is a well solved problem and used in many
>> bpf helpers already, so there are many examples in the kernel. I don't
>> think it is necessary to block this series because of missing some
>> helper wrappers for inlining. The driver can always start with the
>> simpler kfunc call first and optimize later if some hints from the
>> drivers allow it.
> 
> Well, "solved" in the sense of "there are a few handfuls of core BPF
> people who know how to do it". My concern is that we'll end up with
> either the BPF devs having to maintain all these bits of BPF byte code
> in all the drivers; or drivers just punting to regular function calls
> because the inlining is too complicated, with sub-par performance as per
> the above. I don't think we should just hand-wave this away as "solved",
> but rather treat this as an integral part of the initial series.

In terms of complexity/maintainability, I don't think the driver needs to inline 
like hundreds of bpf insn for a hint which is already a good signal that it 
should just call kfunc.  For a simple hint, xdp_metadata_export_to_skb() is a 
good example to start with and I am not sure how much a helper wrapper can do to 
simplify things further since each driver inline code is going to be different.

The community's review will definitely be useful for the first few drivers when 
the driver changes is posted to the list, and the latter drivers will have 
easier time to follow like the xdp was initially introduced to the few drivers 
first.  When there are things to refactor after enough drivers supporting it, 
that will be a better time to revisit.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16 10:08               ` Toke Høiland-Jørgensen
  2022-11-16 18:20                 ` Martin KaFai Lau
@ 2022-11-16 19:03                 ` John Fastabend
  2022-11-16 20:50                   ` Stanislav Fomichev
  1 sibling, 1 reply; 67+ messages in thread
From: John Fastabend @ 2022-11-16 19:03 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Martin KaFai Lau,
	John Fastabend, Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, song, yhs, kpsingh, haoluo, jolsa,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <martin.lau@linux.dev> writes:
> 
> > On 11/15/22 10:38 PM, John Fastabend wrote:
> >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> >>>>>>> +                           struct bpf_patch *patch)
> >>>>>>> +{
> >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> >>>>>>> +             /* return true; */
> >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> >>>>>>> +     }
> >>>>>>> +}
> >>>>>>
> >>>>>> So these look reasonable enough, but would be good to see some examples
> >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> >>>>>> (with those helper wrappers we were discussing before).
> >>>>>
> >>>>> Let's maybe add them if/when needed as we add more metadata support?
> >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> >>>>> point?
> >>>>
> >>>> Well, the reason I asked for them is that I think having to maintain the
> >>>> BPF code generation in the drivers is probably the biggest drawback of
> >>>> the kfunc approach, so it would be good to be relatively sure that we
> >>>> can manage that complexity (via helpers) before we commit to this :)
> >>>
> >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> >>> whether that complexity is manageable or not :-)
> >>> Do you want me to add those wrappers you've back without any real users?
> >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> >>> objections; I can maybe bring some of this back gated by some
> >>> static_branch to avoid the fastpath cost?
> >> 
> >> I missed the context a bit what did you mean "would be good to see some
> >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> >> function"? In this case do you mean BPF code directly without the call?
> >> 
> >> Early on I thought we should just expose the rx_descriptor which would
> >> be roughly the same right? (difference being code embedded in driver vs
> >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> >> which wasn't as straight forward as the simpler just read it from
> >> the descriptor. For example in mlx getting the ts would be easy from
> >> BPF with the mlx4_cqe struct exposed
> >> 
> >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> >> {
> >>          u64 hi, lo;
> >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> >> 
> >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> >> 
> >>          return hi | lo;
> >> }
> >> 
> >> but converting that to nsec is a bit annoying,
> >> 
> >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> >>                              struct skb_shared_hwtstamps *hwts,
> >>                              u64 timestamp)
> >> {
> >>          unsigned int seq;
> >>          u64 nsec;
> >> 
> >>          do {
> >>                  seq = read_seqbegin(&mdev->clock_lock);
> >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> >>          } while (read_seqretry(&mdev->clock_lock, seq));
> >> 
> >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> >>          hwts->hwtstamp = ns_to_ktime(nsec);
> >> }
> >> 
> >> I think the nsec is what you really want.
> >> 
> >> With all the drivers doing slightly different ops we would have
> >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> >> at least the mlx and ice drivers it looks like we would need some
> >> more BPF primitives/helpers. Looks like some more work is needed
> >> on ice driver though to get rx tstamps on all packets.
> >> 
> >> Anyways this convinced me real devices will probably use BPF_CALL
> >> and not BPF insns directly.
> >
> > Some of the mlx5 path looks like this:
> >
> > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> >
> > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> >                                                u64 timestamp)
> > {
> >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> >
> >          return ns_to_ktime(time);
> > }
> >
> > If some hints are harder to get, then just doing a kfunc call is better.
> 
> Sure, but if we end up having a full function call for every field in
> the metadata, that will end up having a significant performance impact
> on the XDP data path (thinking mostly about the skb metadata case here,
> which will collect several bits of metadata).
> 
> > csum may have a better chance to inline?
> 
> Yup, I agree. Including that also makes it possible to benchmark this
> series against Jesper's; which I think we should definitely be doing
> before merging this.

Good point I got sort of singularly focused on timestamp because I have
a use case for it now.

Also hash is often sitting in the rx descriptor.

> 
> > Regardless, BPF in-lining is a well solved problem and used in many
> > bpf helpers already, so there are many examples in the kernel. I don't
> > think it is necessary to block this series because of missing some
> > helper wrappers for inlining. The driver can always start with the
> > simpler kfunc call first and optimize later if some hints from the
> > drivers allow it.
> 
> Well, "solved" in the sense of "there are a few handfuls of core BPF
> people who know how to do it". My concern is that we'll end up with
> either the BPF devs having to maintain all these bits of BPF byte code
> in all the drivers; or drivers just punting to regular function calls
> because the inlining is too complicated, with sub-par performance as per
> the above. I don't think we should just hand-wave this away as "solved",
> but rather treat this as an integral part of the initial series.

This was my motivation for pushing the rx_descriptor into the xdp_buff.
At this point if I'm going to have a kfunc call into the driver and
have the driver rewrite the code into some BPF instructions I would
just assume maintain this as a library code where I can hook it
into my BPF program directly from user space. Maybe a few drivers
will support all the things I want to read, but we run on lots of
hardware (mlx, intel, eks, azure, gke, etc) and its been a lot of work
to just get the basic feature parity. I also don't want to run around
writing driver code for each vendor if I can avoid it. Having raw
access to the rx descriptor gets me the escape hatch where I can
just do it myself. And the last piece of info from my point of view
(Tetragon, Cilium) I can run whatever libs I want and freely upgrade
libbpf and cilium/ebpf but have a lot less ability to get users
to upgrade kernels outside the LTS they picked. Meaning I can
add new things much easier if its lifted into BPF code placed
by user space.

I appreciate that it means I import the problem of hardware detection
and BTF CO-RE on networking codes, but we've already solved these
problems for other reasons. For example just configuring the timestamp
is a bit of an exercise in does my hardware support timestamping
and does it support timestamping the packets I care about,  e.g.
all pkts, just ptp pkts, etc.

I don't think they are mutual exclusive with this series though
because I can't see how to write these timestamping logic directly
in BPF. But for rxhash and csum it seems doable. My preference
is to have both the kfuncs and expose the descriptor directly.

.John

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata
  2022-11-15  3:02 ` [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
  2022-11-15 16:16   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-16 20:42   ` Jakub Kicinski
  2022-11-16 20:53     ` Stanislav Fomichev
  1 sibling, 1 reply; 67+ messages in thread
From: Jakub Kicinski @ 2022-11-16 20:42 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon, 14 Nov 2022 19:02:02 -0800 Stanislav Fomichev wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 117e830cabb0..a2227f4f4a0b 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -9258,6 +9258,13 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
>  			return -EOPNOTSUPP;
>  		}
>  
> +		if (new_prog &&
> +		    new_prog->aux->xdp_kfunc_ndo &&
> +		    new_prog->aux->xdp_kfunc_ndo != dev->netdev_ops) {
> +			NL_SET_ERR_MSG(extack, "Cannot attach to a different target device");
> +			return -EINVAL;
> +		}

This chunk can go up into the large

	if (new_prog) {
		...

list of checks?

nit: aux->xdp_kfunc_ndo sounds like you're storing the kfunc NDO,
     not all ndos. Throw in an 's' at the end, or some such?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16 19:03                 ` John Fastabend
@ 2022-11-16 20:50                   ` Stanislav Fomichev
  2022-11-16 23:47                     ` John Fastabend
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-16 20:50 UTC (permalink / raw)
  To: John Fastabend
  Cc: Toke Høiland-Jørgensen, Martin KaFai Lau, bpf, ast,
	daniel, andrii, song, yhs, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Toke Høiland-Jørgensen wrote:
> > Martin KaFai Lau <martin.lau@linux.dev> writes:
> >
> > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > >>>>>>> +                           struct bpf_patch *patch)
> > >>>>>>> +{
> > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > >>>>>>> +             /* return true; */
> > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > >>>>>>> +     }
> > >>>>>>> +}
> > >>>>>>
> > >>>>>> So these look reasonable enough, but would be good to see some examples
> > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > >>>>>> (with those helper wrappers we were discussing before).
> > >>>>>
> > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > >>>>> point?
> > >>>>
> > >>>> Well, the reason I asked for them is that I think having to maintain the
> > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > >>>> can manage that complexity (via helpers) before we commit to this :)
> > >>>
> > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > >>> whether that complexity is manageable or not :-)
> > >>> Do you want me to add those wrappers you've back without any real users?
> > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > >>> objections; I can maybe bring some of this back gated by some
> > >>> static_branch to avoid the fastpath cost?
> > >>
> > >> I missed the context a bit what did you mean "would be good to see some
> > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > >> function"? In this case do you mean BPF code directly without the call?
> > >>
> > >> Early on I thought we should just expose the rx_descriptor which would
> > >> be roughly the same right? (difference being code embedded in driver vs
> > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > >> which wasn't as straight forward as the simpler just read it from
> > >> the descriptor. For example in mlx getting the ts would be easy from
> > >> BPF with the mlx4_cqe struct exposed
> > >>
> > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > >> {
> > >>          u64 hi, lo;
> > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > >>
> > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > >>
> > >>          return hi | lo;
> > >> }
> > >>
> > >> but converting that to nsec is a bit annoying,
> > >>
> > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > >>                              struct skb_shared_hwtstamps *hwts,
> > >>                              u64 timestamp)
> > >> {
> > >>          unsigned int seq;
> > >>          u64 nsec;
> > >>
> > >>          do {
> > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > >>
> > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > >> }
> > >>
> > >> I think the nsec is what you really want.
> > >>
> > >> With all the drivers doing slightly different ops we would have
> > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > >> at least the mlx and ice drivers it looks like we would need some
> > >> more BPF primitives/helpers. Looks like some more work is needed
> > >> on ice driver though to get rx tstamps on all packets.
> > >>
> > >> Anyways this convinced me real devices will probably use BPF_CALL
> > >> and not BPF insns directly.
> > >
> > > Some of the mlx5 path looks like this:
> > >
> > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > >
> > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > >                                                u64 timestamp)
> > > {
> > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > >
> > >          return ns_to_ktime(time);
> > > }
> > >
> > > If some hints are harder to get, then just doing a kfunc call is better.
> >
> > Sure, but if we end up having a full function call for every field in
> > the metadata, that will end up having a significant performance impact
> > on the XDP data path (thinking mostly about the skb metadata case here,
> > which will collect several bits of metadata).
> >
> > > csum may have a better chance to inline?
> >
> > Yup, I agree. Including that also makes it possible to benchmark this
> > series against Jesper's; which I think we should definitely be doing
> > before merging this.
>
> Good point I got sort of singularly focused on timestamp because I have
> a use case for it now.
>
> Also hash is often sitting in the rx descriptor.

Ack, let me try to add something else (that's more inline-able) on the
rx side for a v2.

> >
> > > Regardless, BPF in-lining is a well solved problem and used in many
> > > bpf helpers already, so there are many examples in the kernel. I don't
> > > think it is necessary to block this series because of missing some
> > > helper wrappers for inlining. The driver can always start with the
> > > simpler kfunc call first and optimize later if some hints from the
> > > drivers allow it.
> >
> > Well, "solved" in the sense of "there are a few handfuls of core BPF
> > people who know how to do it". My concern is that we'll end up with
> > either the BPF devs having to maintain all these bits of BPF byte code
> > in all the drivers; or drivers just punting to regular function calls
> > because the inlining is too complicated, with sub-par performance as per
> > the above. I don't think we should just hand-wave this away as "solved",
> > but rather treat this as an integral part of the initial series.
>
> This was my motivation for pushing the rx_descriptor into the xdp_buff.
> At this point if I'm going to have a kfunc call into the driver and
> have the driver rewrite the code into some BPF instructions I would
> just assume maintain this as a library code where I can hook it
> into my BPF program directly from user space. Maybe a few drivers
> will support all the things I want to read, but we run on lots of
> hardware (mlx, intel, eks, azure, gke, etc) and its been a lot of work
> to just get the basic feature parity. I also don't want to run around
> writing driver code for each vendor if I can avoid it. Having raw
> access to the rx descriptor gets me the escape hatch where I can
> just do it myself. And the last piece of info from my point of view
> (Tetragon, Cilium) I can run whatever libs I want and freely upgrade
> libbpf and cilium/ebpf but have a lot less ability to get users
> to upgrade kernels outside the LTS they picked. Meaning I can
> add new things much easier if its lifted into BPF code placed
> by user space.
>
> I appreciate that it means I import the problem of hardware detection
> and BTF CO-RE on networking codes, but we've already solved these
> problems for other reasons. For example just configuring the timestamp
> is a bit of an exercise in does my hardware support timestamping
> and does it support timestamping the packets I care about,  e.g.
> all pkts, just ptp pkts, etc.
>
> I don't think they are mutual exclusive with this series though
> because I can't see how to write these timestamping logic directly
> in BPF. But for rxhash and csum it seems doable. My preference
> is to have both the kfuncs and expose the descriptor directly.
>
> .John

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-16  9:48     ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-16 20:51       ` Stanislav Fomichev
  0 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-16 20:51 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, ast, daniel, andrii, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, bpf

On Wed, Nov 16, 2022 at 1:48 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Martin KaFai Lau <martin.lau@linux.dev> writes:
>
> > On 11/14/22 7:02 PM, Stanislav Fomichev wrote:
> >> Implement new bpf_xdp_metadata_export_to_skb kfunc which
> >> prepares compatible xdp metadata for kernel consumption.
> >> This kfunc should be called prior to bpf_redirect
> >> or when XDP_PASS'ing the frame into the kernel (note, the drivers
> >> have to be updated to enable consuming XDP_PASS'ed metadata).
> >>
> >> veth driver is amended to consume this metadata when converting to skb.
> >>
> >> Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
> >> whether the frame has skb metadata. The metadata is currently
> >> stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
> >> to work after a call to bpf_xdp_metadata_export_to_skb (can lift
> >> this requirement later on if needed, we'd have to memmove
> >> xdp_skb_metadata).
> >
> > It is ok to refuse bpf_xdp_adjust_meta() after bpf_xdp_metadata_export_to_skb()
> > for now.  However, it will also need to refuse bpf_xdp_adjust_head().
>
> I'm also OK with deferring this, although I'm wondering if it isn't just
> as easy to just add the memmove() straight away? :)

SG, let me try that! Martin also mentioned bpf_xdp_adjust_head needs
to be taken care of..

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-16  7:04   ` Martin KaFai Lau
  2022-11-16  9:48     ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-16 20:51     ` Stanislav Fomichev
  1 sibling, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-16 20:51 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On Tue, Nov 15, 2022 at 11:04 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/14/22 7:02 PM, Stanislav Fomichev wrote:
> > Implement new bpf_xdp_metadata_export_to_skb kfunc which
> > prepares compatible xdp metadata for kernel consumption.
> > This kfunc should be called prior to bpf_redirect
> > or when XDP_PASS'ing the frame into the kernel (note, the drivers
> > have to be updated to enable consuming XDP_PASS'ed metadata).
> >
> > veth driver is amended to consume this metadata when converting to skb.
> >
> > Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
> > whether the frame has skb metadata. The metadata is currently
> > stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
> > to work after a call to bpf_xdp_metadata_export_to_skb (can lift
> > this requirement later on if needed, we'd have to memmove
> > xdp_skb_metadata).
>
> It is ok to refuse bpf_xdp_adjust_meta() after bpf_xdp_metadata_export_to_skb()
> for now.  However, it will also need to refuse bpf_xdp_adjust_head().

Good catch!

> [ ... ]
>
> > +/* For the packets directed to the kernel, this kfunc exports XDP metadata
> > + * into skb context.
> > + */
> > +noinline int bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx)
> > +{
> > +     return 0;
> > +}
> > +
>
> I think it is still better to return 'struct xdp_skb_metata *' instead of
> true/false.  Like:
>
> noinline struct xdp_skb_metata *bpf_xdp_metadata_export_to_skb(const struct
> xdp_md *ctx)
> {
>         return 0;
> }
>
> The KF_RET_NULL has already been set in
> BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids).  There is
> "xdp_btf_struct_access()" that can allow write access to 'struct xdp_skb_metata'
> What else is missing? We can try to solve it.

Ah, that KF_RET_NULL is a left-over from my previous attempt to return
pointers :-)
I can retry with returning a pointer, I don't have a preference, but I
felt like returning simple -errno might be more simple api-wise.
(with bpf_xdp_metadata_export_to_skb returning NULL vs return loggable
errno - I'd prefer the latter, but not strongly)

> Then there is no need for this double check in patch 8 selftest which is not
> easy to use:
>
> +               if (bpf_xdp_metadata_export_to_skb(ctx) < 0) {
> +                       bpf_printk("bpf_xdp_metadata_export_to_skb failed");
> +                       return XDP_DROP;
> +               }
>
> [ ... ]
>
> +               skb_metadata = ctx->skb_metadata;
> +               if (!skb_metadata) {
> +                       bpf_printk("no ctx->skb_metadata");
> +                       return XDP_DROP;
> +               }
>
> [ ... ]
>
>
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index b444b1118c4f..71e3bc7ad839 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -6116,6 +6116,12 @@ enum xdp_action {
> >       XDP_REDIRECT,
> >   };
> >
> > +/* Subset of XDP metadata exported to skb context.
> > + */
> > +struct xdp_skb_metadata {
> > +     __u64 rx_timestamp;
> > +};
> > +
> >   /* user accessible metadata for XDP packet hook
> >    * new fields must be added to the end of this structure
> >    */
> > @@ -6128,6 +6134,7 @@ struct xdp_md {
> >       __u32 rx_queue_index;  /* rxq->queue_index  */
> >
> >       __u32 egress_ifindex;  /* txq->dev->ifindex */
> > +     __bpf_md_ptr(struct xdp_skb_metadata *, skb_metadata);
>
> Once the above bpf_xdp_metadata_export_to_skb() returning a pointer works, then
> it can be another kfunc 'struct xdp_skb_metata * bpf_xdp_get_skb_metadata(const
> struct xdp_md *ctx)' to return the skb_metadata which was a similar point
> discussed in the previous RFC.

I see. I think you've mentioned it previously somewhere to have a
kfunc accessor vs uapi field and I totally forgot. Will try to address
it, ty!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata
  2022-11-16 20:42   ` Jakub Kicinski
@ 2022-11-16 20:53     ` Stanislav Fomichev
  0 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-16 20:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Wed, Nov 16, 2022 at 12:43 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 14 Nov 2022 19:02:02 -0800 Stanislav Fomichev wrote:
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 117e830cabb0..a2227f4f4a0b 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -9258,6 +9258,13 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
> >                       return -EOPNOTSUPP;
> >               }
> >
> > +             if (new_prog &&
> > +                 new_prog->aux->xdp_kfunc_ndo &&
> > +                 new_prog->aux->xdp_kfunc_ndo != dev->netdev_ops) {
> > +                     NL_SET_ERR_MSG(extack, "Cannot attach to a different target device");
> > +                     return -EINVAL;
> > +             }
>
> This chunk can go up into the large
>
>         if (new_prog) {
>                 ...
>
> list of checks?

Agreed, will move!

> nit: aux->xdp_kfunc_ndo sounds like you're storing the kfunc NDO,
>      not all ndos. Throw in an 's' at the end, or some such?

SG. But I'll most likely replace this xdp_kfunc_ndo with something
like xdp_netdev and to add proper refcounting.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-15  3:02 ` [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
  2022-11-15 23:20   ` [xdp-hints] " Toke Høiland-Jørgensen
  2022-11-16  7:04   ` Martin KaFai Lau
@ 2022-11-16 21:12   ` Jakub Kicinski
  2022-11-16 21:49     ` Martin KaFai Lau
  2022-11-18 14:05   ` Jesper Dangaard Brouer
  3 siblings, 1 reply; 67+ messages in thread
From: Jakub Kicinski @ 2022-11-16 21:12 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon, 14 Nov 2022 19:02:05 -0800 Stanislav Fomichev wrote:
> Implement new bpf_xdp_metadata_export_to_skb kfunc which
> prepares compatible xdp metadata for kernel consumption.
> This kfunc should be called prior to bpf_redirect
> or when XDP_PASS'ing the frame into the kernel (note, the drivers
> have to be updated to enable consuming XDP_PASS'ed metadata).
> 
> veth driver is amended to consume this metadata when converting to skb.
> 
> Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
> whether the frame has skb metadata. The metadata is currently
> stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
> to work after a call to bpf_xdp_metadata_export_to_skb (can lift
> this requirement later on if needed, we'd have to memmove
> xdp_skb_metadata).

IMO we should split the xdp -> skb work from the pure HW data access 
in XDP. We have a proof point here that there is a way of building 
on top of the first 5 patches to achieve the objective, and that's
sufficient to let the prior work going in.

..because some of us may not agree that we should be pushing in a
fixed-format structure that's also listed in uAPI. And prefer to,
say, let the user define their format and add a call point for a
BPF prog to populate the skb from whatever data they decided to stash...

That's how I interpreted some of John's comments as well, but I may be
wrong.

Either way, there is no technical reason for the xdp -> skb field
propagation to hold up the HW descriptor access, right? If anything 
we may be able to make quicker progress if prior work is in tree
and multiple people can hack...

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-16 21:12   ` Jakub Kicinski
@ 2022-11-16 21:49     ` Martin KaFai Lau
  0 siblings, 0 replies; 67+ messages in thread
From: Martin KaFai Lau @ 2022-11-16 21:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 11/16/22 1:12 PM, Jakub Kicinski wrote:
> On Mon, 14 Nov 2022 19:02:05 -0800 Stanislav Fomichev wrote:
>> Implement new bpf_xdp_metadata_export_to_skb kfunc which
>> prepares compatible xdp metadata for kernel consumption.
>> This kfunc should be called prior to bpf_redirect
>> or when XDP_PASS'ing the frame into the kernel (note, the drivers
>> have to be updated to enable consuming XDP_PASS'ed metadata).
>>
>> veth driver is amended to consume this metadata when converting to skb.
>>
>> Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
>> whether the frame has skb metadata. The metadata is currently
>> stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
>> to work after a call to bpf_xdp_metadata_export_to_skb (can lift
>> this requirement later on if needed, we'd have to memmove
>> xdp_skb_metadata).
> 
> IMO we should split the xdp -> skb work from the pure HW data access
> in XDP. We have a proof point here that there is a way of building
> on top of the first 5 patches to achieve the objective, and that's
> sufficient to let the prior work going in.

+1

Good idea.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16 20:50                   ` Stanislav Fomichev
@ 2022-11-16 23:47                     ` John Fastabend
  2022-11-17  0:19                       ` Stanislav Fomichev
  2022-11-17 10:27                       ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 67+ messages in thread
From: John Fastabend @ 2022-11-16 23:47 UTC (permalink / raw)
  To: Stanislav Fomichev, John Fastabend
  Cc: Toke Høiland-Jørgensen, Martin KaFai Lau, bpf, ast,
	daniel, andrii, song, yhs, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev wrote:
> On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Toke Høiland-Jørgensen wrote:
> > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > >
> > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > >>>>>>> +                           struct bpf_patch *patch)
> > > >>>>>>> +{
> > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > >>>>>>> +             /* return true; */
> > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > >>>>>>> +     }
> > > >>>>>>> +}
> > > >>>>>>
> > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > >>>>>> (with those helper wrappers we were discussing before).
> > > >>>>>
> > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > >>>>> point?
> > > >>>>
> > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > >>>
> > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > >>> whether that complexity is manageable or not :-)
> > > >>> Do you want me to add those wrappers you've back without any real users?
> > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > >>> objections; I can maybe bring some of this back gated by some
> > > >>> static_branch to avoid the fastpath cost?
> > > >>
> > > >> I missed the context a bit what did you mean "would be good to see some
> > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > >> function"? In this case do you mean BPF code directly without the call?
> > > >>
> > > >> Early on I thought we should just expose the rx_descriptor which would
> > > >> be roughly the same right? (difference being code embedded in driver vs
> > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > >> which wasn't as straight forward as the simpler just read it from
> > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > >> BPF with the mlx4_cqe struct exposed
> > > >>
> > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > >> {
> > > >>          u64 hi, lo;
> > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > >>
> > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > >>
> > > >>          return hi | lo;
> > > >> }
> > > >>
> > > >> but converting that to nsec is a bit annoying,
> > > >>
> > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > >>                              struct skb_shared_hwtstamps *hwts,
> > > >>                              u64 timestamp)
> > > >> {
> > > >>          unsigned int seq;
> > > >>          u64 nsec;
> > > >>
> > > >>          do {
> > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > >>
> > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > >> }
> > > >>
> > > >> I think the nsec is what you really want.
> > > >>
> > > >> With all the drivers doing slightly different ops we would have
> > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > >> at least the mlx and ice drivers it looks like we would need some
> > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > >> on ice driver though to get rx tstamps on all packets.
> > > >>
> > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > >> and not BPF insns directly.
> > > >
> > > > Some of the mlx5 path looks like this:
> > > >
> > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > >
> > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > >                                                u64 timestamp)
> > > > {
> > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > >
> > > >          return ns_to_ktime(time);
> > > > }
> > > >
> > > > If some hints are harder to get, then just doing a kfunc call is better.
> > >
> > > Sure, but if we end up having a full function call for every field in
> > > the metadata, that will end up having a significant performance impact
> > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > which will collect several bits of metadata).
> > >
> > > > csum may have a better chance to inline?
> > >
> > > Yup, I agree. Including that also makes it possible to benchmark this
> > > series against Jesper's; which I think we should definitely be doing
> > > before merging this.
> >
> > Good point I got sort of singularly focused on timestamp because I have
> > a use case for it now.
> >
> > Also hash is often sitting in the rx descriptor.
> 
> Ack, let me try to add something else (that's more inline-able) on the
> rx side for a v2.

If you go with in-kernel BPF kfunc approach (vs user space side) I think
you also need to add CO-RE to be friendly for driver developers? Otherwise
they have to keep that read in sync with the descriptors? Also need to
handle versioning of descriptors where depending on specific options
and firmware and chip being enabled the descriptor might be moving
around. Of course can push this all to developer, but seems not so
nice when we have the machinery to do this and we handle it for all
other structures.

With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
expect CO-RE sorts it out based on currently running btf_id of the
descriptor. If you go through normal path you get this for free of
course.

.John

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16 23:47                     ` John Fastabend
@ 2022-11-17  0:19                       ` Stanislav Fomichev
  2022-11-17  2:17                         ` Alexei Starovoitov
  2022-11-17 10:27                       ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-17  0:19 UTC (permalink / raw)
  To: John Fastabend
  Cc: Toke Høiland-Jørgensen, Martin KaFai Lau, bpf, ast,
	daniel, andrii, song, yhs, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > <john.fastabend@gmail.com> wrote:
> > >
> > > Toke Høiland-Jørgensen wrote:
> > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > >
> > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > >>>>>>> +{
> > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > >>>>>>> +             /* return true; */
> > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > >>>>>>> +     }
> > > > >>>>>>> +}
> > > > >>>>>>
> > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > >>>>>
> > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > >>>>> point?
> > > > >>>>
> > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > >>>
> > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > >>> whether that complexity is manageable or not :-)
> > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > >>> objections; I can maybe bring some of this back gated by some
> > > > >>> static_branch to avoid the fastpath cost?
> > > > >>
> > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > >>
> > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > >> which wasn't as straight forward as the simpler just read it from
> > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > >> BPF with the mlx4_cqe struct exposed
> > > > >>
> > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > >> {
> > > > >>          u64 hi, lo;
> > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > >>
> > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > >>
> > > > >>          return hi | lo;
> > > > >> }
> > > > >>
> > > > >> but converting that to nsec is a bit annoying,
> > > > >>
> > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > >>                              u64 timestamp)
> > > > >> {
> > > > >>          unsigned int seq;
> > > > >>          u64 nsec;
> > > > >>
> > > > >>          do {
> > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > >>
> > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > >> }
> > > > >>
> > > > >> I think the nsec is what you really want.
> > > > >>
> > > > >> With all the drivers doing slightly different ops we would have
> > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > >> on ice driver though to get rx tstamps on all packets.
> > > > >>
> > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > >> and not BPF insns directly.
> > > > >
> > > > > Some of the mlx5 path looks like this:
> > > > >
> > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > >
> > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > >                                                u64 timestamp)
> > > > > {
> > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > >
> > > > >          return ns_to_ktime(time);
> > > > > }
> > > > >
> > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > >
> > > > Sure, but if we end up having a full function call for every field in
> > > > the metadata, that will end up having a significant performance impact
> > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > which will collect several bits of metadata).
> > > >
> > > > > csum may have a better chance to inline?
> > > >
> > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > series against Jesper's; which I think we should definitely be doing
> > > > before merging this.
> > >
> > > Good point I got sort of singularly focused on timestamp because I have
> > > a use case for it now.
> > >
> > > Also hash is often sitting in the rx descriptor.
> >
> > Ack, let me try to add something else (that's more inline-able) on the
> > rx side for a v2.
>
> If you go with in-kernel BPF kfunc approach (vs user space side) I think
> you also need to add CO-RE to be friendly for driver developers? Otherwise
> they have to keep that read in sync with the descriptors? Also need to
> handle versioning of descriptors where depending on specific options
> and firmware and chip being enabled the descriptor might be moving
> around. Of course can push this all to developer, but seems not so
> nice when we have the machinery to do this and we handle it for all
> other structures.
>
> With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> expect CO-RE sorts it out based on currently running btf_id of the
> descriptor. If you go through normal path you get this for free of
> course.

Doesn't look like the descriptors are as nice as you're trying to
paint them (with clear hash/csum fields) :-) So not sure how much
CO-RE would help.
At least looking at mlx4 rx_csum, the driver consults three different
sets of flags to figure out the hash_type. Or am I just unlucky with
mlx4?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17  0:19                       ` Stanislav Fomichev
@ 2022-11-17  2:17                         ` Alexei Starovoitov
  2022-11-17  2:53                           ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2022-11-17  2:17 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: John Fastabend, Toke Høiland-Jørgensen,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Stanislav Fomichev wrote:
> > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > <john.fastabend@gmail.com> wrote:
> > > >
> > > > Toke Høiland-Jørgensen wrote:
> > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > >
> > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > >>>>>>> +{
> > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > >>>>>>> +             /* return true; */
> > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > >>>>>>> +     }
> > > > > >>>>>>> +}
> > > > > >>>>>>
> > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > >>>>>
> > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > >>>>> point?
> > > > > >>>>
> > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > >>>
> > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > >>> whether that complexity is manageable or not :-)
> > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > >>> static_branch to avoid the fastpath cost?
> > > > > >>
> > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > >>
> > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > >>
> > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > >> {
> > > > > >>          u64 hi, lo;
> > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > >>
> > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > >>
> > > > > >>          return hi | lo;
> > > > > >> }
> > > > > >>
> > > > > >> but converting that to nsec is a bit annoying,
> > > > > >>
> > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > >>                              u64 timestamp)
> > > > > >> {
> > > > > >>          unsigned int seq;
> > > > > >>          u64 nsec;
> > > > > >>
> > > > > >>          do {
> > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > >>
> > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > >> }
> > > > > >>
> > > > > >> I think the nsec is what you really want.
> > > > > >>
> > > > > >> With all the drivers doing slightly different ops we would have
> > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > >>
> > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > >> and not BPF insns directly.
> > > > > >
> > > > > > Some of the mlx5 path looks like this:
> > > > > >
> > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > >
> > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > >                                                u64 timestamp)
> > > > > > {
> > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > >
> > > > > >          return ns_to_ktime(time);
> > > > > > }
> > > > > >
> > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > >
> > > > > Sure, but if we end up having a full function call for every field in
> > > > > the metadata, that will end up having a significant performance impact
> > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > which will collect several bits of metadata).
> > > > >
> > > > > > csum may have a better chance to inline?
> > > > >
> > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > series against Jesper's; which I think we should definitely be doing
> > > > > before merging this.
> > > >
> > > > Good point I got sort of singularly focused on timestamp because I have
> > > > a use case for it now.
> > > >
> > > > Also hash is often sitting in the rx descriptor.
> > >
> > > Ack, let me try to add something else (that's more inline-able) on the
> > > rx side for a v2.
> >
> > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > they have to keep that read in sync with the descriptors? Also need to
> > handle versioning of descriptors where depending on specific options
> > and firmware and chip being enabled the descriptor might be moving
> > around. Of course can push this all to developer, but seems not so
> > nice when we have the machinery to do this and we handle it for all
> > other structures.
> >
> > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > expect CO-RE sorts it out based on currently running btf_id of the
> > descriptor. If you go through normal path you get this for free of
> > course.
>
> Doesn't look like the descriptors are as nice as you're trying to
> paint them (with clear hash/csum fields) :-) So not sure how much
> CO-RE would help.
> At least looking at mlx4 rx_csum, the driver consults three different
> sets of flags to figure out the hash_type. Or am I just unlucky with
> mlx4?

Which part are you talking about ?
        hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
is trivial enough for bpf prog to do if it has access to 'cqe' pointer
which is what John is proposing (I think).

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17  2:17                         ` Alexei Starovoitov
@ 2022-11-17  2:53                           ` Stanislav Fomichev
  2022-11-17  2:59                             ` Alexei Starovoitov
  2022-11-17 11:32                             ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-17  2:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Toke Høiland-Jørgensen,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Stanislav Fomichev wrote:
> > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Toke Høiland-Jørgensen wrote:
> > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > >
> > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > >>>>>>> +{
> > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > >>>>>>> +             /* return true; */
> > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > >>>>>>> +     }
> > > > > > >>>>>>> +}
> > > > > > >>>>>>
> > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > >>>>>
> > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > >>>>> point?
> > > > > > >>>>
> > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > >>>
> > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > >>
> > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > >>
> > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > >>
> > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > >> {
> > > > > > >>          u64 hi, lo;
> > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > >>
> > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > >>
> > > > > > >>          return hi | lo;
> > > > > > >> }
> > > > > > >>
> > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > >>
> > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > >>                              u64 timestamp)
> > > > > > >> {
> > > > > > >>          unsigned int seq;
> > > > > > >>          u64 nsec;
> > > > > > >>
> > > > > > >>          do {
> > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > >>
> > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > >> }
> > > > > > >>
> > > > > > >> I think the nsec is what you really want.
> > > > > > >>
> > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > >>
> > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > >> and not BPF insns directly.
> > > > > > >
> > > > > > > Some of the mlx5 path looks like this:
> > > > > > >
> > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > >
> > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > >                                                u64 timestamp)
> > > > > > > {
> > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > >
> > > > > > >          return ns_to_ktime(time);
> > > > > > > }
> > > > > > >
> > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > >
> > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > the metadata, that will end up having a significant performance impact
> > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > which will collect several bits of metadata).
> > > > > >
> > > > > > > csum may have a better chance to inline?
> > > > > >
> > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > before merging this.
> > > > >
> > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > a use case for it now.
> > > > >
> > > > > Also hash is often sitting in the rx descriptor.
> > > >
> > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > rx side for a v2.
> > >
> > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > they have to keep that read in sync with the descriptors? Also need to
> > > handle versioning of descriptors where depending on specific options
> > > and firmware and chip being enabled the descriptor might be moving
> > > around. Of course can push this all to developer, but seems not so
> > > nice when we have the machinery to do this and we handle it for all
> > > other structures.
> > >
> > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > expect CO-RE sorts it out based on currently running btf_id of the
> > > descriptor. If you go through normal path you get this for free of
> > > course.
> >
> > Doesn't look like the descriptors are as nice as you're trying to
> > paint them (with clear hash/csum fields) :-) So not sure how much
> > CO-RE would help.
> > At least looking at mlx4 rx_csum, the driver consults three different
> > sets of flags to figure out the hash_type. Or am I just unlucky with
> > mlx4?
>
> Which part are you talking about ?
>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> which is what John is proposing (I think).

I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
I'm assuming we want to have hash_type available to the progs?

But also, check_csum handles other corner cases:
- short_frame: we simply force all those small frames to skip checksum complete
- get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
IPv6 header
- get_fixed_ipv4_csum: Although the stack expects checksum which
doesn't include the pseudo header, the HW adds it

So it doesn't look like we can just unconditionally use cqe->checksum?
The driver does a lot of massaging around that field to make it
palatable.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17  2:53                           ` Stanislav Fomichev
@ 2022-11-17  2:59                             ` Alexei Starovoitov
  2022-11-17  4:18                               ` Stanislav Fomichev
  2022-11-17 11:32                             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2022-11-17  2:59 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: John Fastabend, Toke Høiland-Jørgensen,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Stanislav Fomichev wrote:
> > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > <john.fastabend@gmail.com> wrote:
> > > > > >
> > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > >
> > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > >>>>>>> +{
> > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > >>>>>>> +             /* return true; */
> > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > >>>>>>> +     }
> > > > > > > >>>>>>> +}
> > > > > > > >>>>>>
> > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > >>>>>
> > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > >>>>> point?
> > > > > > > >>>>
> > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > >>>
> > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > >>
> > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > >>
> > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > >>
> > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > >> {
> > > > > > > >>          u64 hi, lo;
> > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > >>
> > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > >>
> > > > > > > >>          return hi | lo;
> > > > > > > >> }
> > > > > > > >>
> > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > >>
> > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > >>                              u64 timestamp)
> > > > > > > >> {
> > > > > > > >>          unsigned int seq;
> > > > > > > >>          u64 nsec;
> > > > > > > >>
> > > > > > > >>          do {
> > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > >>
> > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > >> }
> > > > > > > >>
> > > > > > > >> I think the nsec is what you really want.
> > > > > > > >>
> > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > >>
> > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > >> and not BPF insns directly.
> > > > > > > >
> > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > >
> > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > >
> > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > >                                                u64 timestamp)
> > > > > > > > {
> > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > >
> > > > > > > >          return ns_to_ktime(time);
> > > > > > > > }
> > > > > > > >
> > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > >
> > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > which will collect several bits of metadata).
> > > > > > >
> > > > > > > > csum may have a better chance to inline?
> > > > > > >
> > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > before merging this.
> > > > > >
> > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > a use case for it now.
> > > > > >
> > > > > > Also hash is often sitting in the rx descriptor.
> > > > >
> > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > rx side for a v2.
> > > >
> > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > they have to keep that read in sync with the descriptors? Also need to
> > > > handle versioning of descriptors where depending on specific options
> > > > and firmware and chip being enabled the descriptor might be moving
> > > > around. Of course can push this all to developer, but seems not so
> > > > nice when we have the machinery to do this and we handle it for all
> > > > other structures.
> > > >
> > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > descriptor. If you go through normal path you get this for free of
> > > > course.
> > >
> > > Doesn't look like the descriptors are as nice as you're trying to
> > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > CO-RE would help.
> > > At least looking at mlx4 rx_csum, the driver consults three different
> > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > mlx4?
> >
> > Which part are you talking about ?
> >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > which is what John is proposing (I think).
>
> I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> I'm assuming we want to have hash_type available to the progs?
>
> But also, check_csum handles other corner cases:
> - short_frame: we simply force all those small frames to skip checksum complete
> - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> IPv6 header
> - get_fixed_ipv4_csum: Although the stack expects checksum which
> doesn't include the pseudo header, the HW adds it

Of course, but kfunc won't be doing them either.
We're talking XDP fast path.
The mlx4 hw is old and incapable.
No amount of sw can help.

> So it doesn't look like we can just unconditionally use cqe->checksum?
> The driver does a lot of massaging around that field to make it
> palatable.

Of course we can. cqe->checksum is still usable. the bpf prog
would need to know what it's reading.
There should be no attempt to present a unified state of hw bits.
That's what skb is for. XDP layer should not hide such hw details.
Otherwise it will become a mini skb layer with all that overhead.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17  2:59                             ` Alexei Starovoitov
@ 2022-11-17  4:18                               ` Stanislav Fomichev
  2022-11-17  6:55                                 ` John Fastabend
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-17  4:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Toke Høiland-Jørgensen,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Stanislav Fomichev wrote:
> > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > <john.fastabend@gmail.com> wrote:
> > > > > > >
> > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > > >
> > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > >>>>>>> +{
> > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > >>>>>>> +     }
> > > > > > > > >>>>>>> +}
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > >>>>>
> > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > >>>>> point?
> > > > > > > > >>>>
> > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > >>>
> > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > >>
> > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > >>
> > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > >>
> > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > >> {
> > > > > > > > >>          u64 hi, lo;
> > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > >>
> > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > >>
> > > > > > > > >>          return hi | lo;
> > > > > > > > >> }
> > > > > > > > >>
> > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > >>
> > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > >>                              u64 timestamp)
> > > > > > > > >> {
> > > > > > > > >>          unsigned int seq;
> > > > > > > > >>          u64 nsec;
> > > > > > > > >>
> > > > > > > > >>          do {
> > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > >>
> > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > >> }
> > > > > > > > >>
> > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > >>
> > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > >>
> > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > >> and not BPF insns directly.
> > > > > > > > >
> > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > >
> > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > >
> > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > >                                                u64 timestamp)
> > > > > > > > > {
> > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > >
> > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > >
> > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > which will collect several bits of metadata).
> > > > > > > >
> > > > > > > > > csum may have a better chance to inline?
> > > > > > > >
> > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > before merging this.
> > > > > > >
> > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > a use case for it now.
> > > > > > >
> > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > >
> > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > rx side for a v2.
> > > > >
> > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > handle versioning of descriptors where depending on specific options
> > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > around. Of course can push this all to developer, but seems not so
> > > > > nice when we have the machinery to do this and we handle it for all
> > > > > other structures.
> > > > >
> > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > descriptor. If you go through normal path you get this for free of
> > > > > course.
> > > >
> > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > CO-RE would help.
> > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > mlx4?
> > >
> > > Which part are you talking about ?
> > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > which is what John is proposing (I think).
> >
> > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > I'm assuming we want to have hash_type available to the progs?
> >
> > But also, check_csum handles other corner cases:
> > - short_frame: we simply force all those small frames to skip checksum complete
> > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > IPv6 header
> > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > doesn't include the pseudo header, the HW adds it
>
> Of course, but kfunc won't be doing them either.
> We're talking XDP fast path.
> The mlx4 hw is old and incapable.
> No amount of sw can help.
>
> > So it doesn't look like we can just unconditionally use cqe->checksum?
> > The driver does a lot of massaging around that field to make it
> > palatable.
>
> Of course we can. cqe->checksum is still usable. the bpf prog
> would need to know what it's reading.
> There should be no attempt to present a unified state of hw bits.
> That's what skb is for. XDP layer should not hide such hw details.
> Otherwise it will become a mini skb layer with all that overhead.

I was hoping the kfunc could at least parse the flags and return some
pkt_hash_types-like enum to indicate what this csum covers.
So the users won't have to find the hardware manuals (not sure they
are even available?) to decipher what numbers they've got.
Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look
that much different :-(

But going back a bit: I'm probably missing what John has been
suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing
a CO-RE-like functionality by rewriting some "public api" (kfunc) into
the bytecode to access the relevant data.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17  4:18                               ` Stanislav Fomichev
@ 2022-11-17  6:55                                 ` John Fastabend
  2022-11-17 17:51                                   ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: John Fastabend @ 2022-11-17  6:55 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexei Starovoitov
  Cc: John Fastabend, Toke Høiland-Jørgensen,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

Stanislav Fomichev wrote:
> On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > > >
> > > > > > Stanislav Fomichev wrote:
> > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > > <john.fastabend@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > > > >
> > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > > >>>>>>> +{
> > > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > > >>>>>>> +     }
> > > > > > > > > >>>>>>> +}
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > > >>>>> point?
> > > > > > > > > >>>>
> > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > > >>>
> > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > > >>
> > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > > >>
> > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > > >>
> > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > > >> {
> > > > > > > > > >>          u64 hi, lo;
> > > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > > >>
> > > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > > >>
> > > > > > > > > >>          return hi | lo;
> > > > > > > > > >> }
> > > > > > > > > >>
> > > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > > >>
> > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > > >>                              u64 timestamp)
> > > > > > > > > >> {
> > > > > > > > > >>          unsigned int seq;
> > > > > > > > > >>          u64 nsec;
> > > > > > > > > >>
> > > > > > > > > >>          do {
> > > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > > >>
> > > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > > >> }
> > > > > > > > > >>
> > > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > > >>
> > > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > > >>
> > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > > >> and not BPF insns directly.
> > > > > > > > > >
> > > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > > >
> > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > > >
> > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > > >                                                u64 timestamp)
> > > > > > > > > > {
> > > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > > >
> > > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > > >
> > > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > > which will collect several bits of metadata).
> > > > > > > > >
> > > > > > > > > > csum may have a better chance to inline?
> > > > > > > > >
> > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > > before merging this.
> > > > > > > >
> > > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > > a use case for it now.
> > > > > > > >
> > > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > > >
> > > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > > rx side for a v2.
> > > > > >
> > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > > handle versioning of descriptors where depending on specific options
> > > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > > around. Of course can push this all to developer, but seems not so
> > > > > > nice when we have the machinery to do this and we handle it for all
> > > > > > other structures.
> > > > > >
> > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > > descriptor. If you go through normal path you get this for free of
> > > > > > course.
> > > > >
> > > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > > CO-RE would help.
> > > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > > mlx4?
> > > >
> > > > Which part are you talking about ?
> > > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > > which is what John is proposing (I think).

Yeah this is what I've been considering. If you just get the 'cqe' pointer
walking the check_sum and rxhash should be straightforward.

There seems to be a real difference between timestamps and most other 
fields IMO. Timestamps require some extra logic to turn into ns when
using the NIC hw clock. And the hw clock requires some coordination to
keep in sync and stop from overflowing and may be done through other
protocols like PTP in my use case. In some cases I think the clock is
also shared amongst multiple phys. Seems mlx has a seqlock_t to protect
it and I'm less sure about this but seems intel nic maybe has a sideband
control channel.

Then there is everything else that I can think up that is per packet data
and requires no coordination with the driver. Its just reading fields in
the completion queue. This would be the csum, check_sum, vlan_header and
so on. Sure we could kfunc each one of those things, but could also just
write that directly in BPF and remove some abstractions and kernel
dependency by doing it directly in the BPF program. If you like that
abstraction seems to be the point of contention my opinion is the cost
of kernel depency is high and I can abstract it with a user library
anyways so burying it in the kernel only causes me support issues and
backwards compat problems.

Hopefully, my position is more clear.

> > >
> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > > I'm assuming we want to have hash_type available to the progs?
> > >
> > > But also, check_csum handles other corner cases:
> > > - short_frame: we simply force all those small frames to skip checksum complete
> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > > IPv6 header
> > > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > > doesn't include the pseudo header, the HW adds it
> >
> > Of course, but kfunc won't be doing them either.
> > We're talking XDP fast path.
> > The mlx4 hw is old and incapable.
> > No amount of sw can help.

Doesn't this lend itself to letting the XDP BPF program write the
BPF code to read it out. Maybe someone cares about these details
for some cpumap thing, but the rest of us wont care we might just
want to read check_csum. Maybe we have an IPv6 only network or
IPv4 network so can make further shortcuts. If a driver dev does
this they will be forced to do the cactch all version because
they have no way to know such details.

> >
> > > So it doesn't look like we can just unconditionally use cqe->checksum?
> > > The driver does a lot of massaging around that field to make it
> > > palatable.
> >
> > Of course we can. cqe->checksum is still usable. the bpf prog
> > would need to know what it's reading.
> > There should be no attempt to present a unified state of hw bits.
> > That's what skb is for. XDP layer should not hide such hw details.
> > Otherwise it will become a mini skb layer with all that overhead.
> 
> I was hoping the kfunc could at least parse the flags and return some
> pkt_hash_types-like enum to indicate what this csum covers.
> So the users won't have to find the hardware manuals (not sure they
> are even available?) to decipher what numbers they've got.
> Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look
> that much different :-(

The driver developers could still provide and ship the BPF libs
with their drivers. I think if someone is going to use their NIC
and lots of them and requires XDP it will get done. We could put
them by the driver code mlx4.bpf or something.

> 
> But going back a bit: I'm probably missing what John has been
> suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing
> a CO-RE-like functionality by rewriting some "public api" (kfunc) into
> the bytecode to access the relevant data.

This was maybe a bit of an aside. What I was pondering a bit out
loud perhaps is my recollection is there may be a few different
descriptor layouts depending features enabled, exact device loaded
and such. So in this case if I was a driver writer I might not want
to hardcode the offset of the check_sum field. If I could use CO-RE
then I don't have to care if in one version is the Nth field and later on
someone makes it the Mth field just like any normal kernel struct.
But through the kfunc interface I couldn't see how to get that.
So instead of having a bunch of kfunc implementations you could just
have one for all your device classes because you always name the
field the same thing.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-16 23:47                     ` John Fastabend
  2022-11-17  0:19                       ` Stanislav Fomichev
@ 2022-11-17 10:27                       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-17 10:27 UTC (permalink / raw)
  To: John Fastabend, Stanislav Fomichev, John Fastabend
  Cc: Martin KaFai Lau, bpf, ast, daniel, andrii, song, yhs, kpsingh,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Just a separate comment on this bit:

John Fastabend <john.fastabend@gmail.com> writes:
> If you go with in-kernel BPF kfunc approach (vs user space side) I think
> you also need to add CO-RE to be friendly for driver developers? Otherwise
> they have to keep that read in sync with the descriptors?

CO-RE is for doing relocations of field offsets without having to
recompile. That's not really relevant for the kernel, that gets
recompiled whenever the layout changes. So the field offsets are just
kept in sync with offsetof(), like in Stanislav's RFCv2 where he had
this snippet:

+			/*	return ((struct sk_buff *)r5)->tstamp; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_5,
+				    offsetof(struct sk_buff, tstamp)),

So I definitely don't think this is an argument against the kfunc
approach?

> Also need to handle versioning of descriptors where depending on
> specific options and firmware and chip being enabled the descriptor
> might be moving around.

This is exactly the kind of thing the driver is supposed to take care
of; it knows the hardware configuration and can pick the right
descriptor format. Either by just picking an entirely different kfunc
unroll depending on the config (if it's static), or by adding the right
checks to the unroll. You'd have to replicate all this logic in BPF
anyway, and while I'm not doubting *you* are capable of doing this, I
don't think we should be forcing every XDP developer to deal with all
this.

Or to put it another way, a proper hardware abstraction and high-quality
drivers is one of the main selling points of XDP over DPDK and other
kernel bypass solutions; we should not jettison this when enabling
metadata support!

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17  2:53                           ` Stanislav Fomichev
  2022-11-17  2:59                             ` Alexei Starovoitov
@ 2022-11-17 11:32                             ` Toke Høiland-Jørgensen
  2022-11-17 16:59                               ` Alexei Starovoitov
  1 sibling, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-17 11:32 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexei Starovoitov
  Cc: John Fastabend, Martin KaFai Lau, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Song Liu, Yonghong Song,
	KP Singh, Hao Luo, Jiri Olsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	Network Development

Stanislav Fomichev <sdf@google.com> writes:

>> > Doesn't look like the descriptors are as nice as you're trying to
>> > paint them (with clear hash/csum fields) :-) So not sure how much
>> > CO-RE would help.
>> > At least looking at mlx4 rx_csum, the driver consults three different
>> > sets of flags to figure out the hash_type. Or am I just unlucky with
>> > mlx4?
>>
>> Which part are you talking about ?
>>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
>> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
>> which is what John is proposing (I think).
>
> I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> I'm assuming we want to have hash_type available to the progs?

I agree we should expose the hash_type, but that doesn't actually look
to be that complicated, see below.

> But also, check_csum handles other corner cases:
> - short_frame: we simply force all those small frames to skip checksum complete
> - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> IPv6 header
> - get_fixed_ipv4_csum: Although the stack expects checksum which
> doesn't include the pseudo header, the HW adds it
>
> So it doesn't look like we can just unconditionally use cqe->checksum?
> The driver does a lot of massaging around that field to make it
> palatable.

Poking around a bit in the other drivers, AFAICT it's only a subset of
drivers that support CSUM_COMPLETE at all; for instance, the Intel
drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the
CHECKSUM_UNNECESSARY is actually the most important bit we'd want to
propagate?

AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access
to other data structures than the rx descriptor to determine the status
of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so
just exposing the rx descriptor to BPF as John is suggesting does not
actually give the XDP program enough information to act on the checksum
field on its own. We could still have a separate kfunc to just expose
the hw checksum value (see below), but I think it probably needs to be
paired with other kfuncs to be useful.

Looking at the mlx4 code, I think the following mapping to kfuncs (in
pseudo-C) would give the flexibility for XDP to access all the bits it
needs, while inlining everything except getting the full checksum for
non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other
drivers (mlx5, ice, i40e) indicates that this would work for those
drivers as well.


bpf_xdp_metadata_rx_hash_supported() {
  return dev->features & NETIF_F_RXHASH;
}

bpf_xdp_metadata_rx_hash() {
  return be32_to_cpu(cqe->immed_rss_invalid);
}

bpf_xdp_metdata_rx_hash_type() {
  if (likely(dev->features & NETIF_F_RXCSUM) &&
      (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) &&
	(cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
	  cqe->checksum == cpu_to_be16(0xffff))
     return PKT_HASH_TYPE_L4;

   return PKT_HASH_TYPE_L3;
}

bpf_xdp_metadata_rx_csum_supported() {
  return dev->features & NETIF_F_RXCSUM;
}

bpf_xdp_metadata_rx_csum_level() {
	if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
				       MLX4_CQE_STATUS_UDP)) &&
	    (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
	    cqe->checksum == cpu_to_be16(0xffff))
            return CHECKSUM_UNNECESSARY;
            
	if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
	      (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) &&
              !short_frame(len))
            return CHECKSUM_COMPLETE; /* we could also omit this case entirely */

        return CHECKSUM_NONE;
}

/* this one could be called by the metadata_to_skb code */
bpf_xdp_metadata_rx_csum_full() {
  return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */
}

/* this one would be for people like John who want to re-implement
 * check_csum() themselves */
bpf_xdp_metdata_rx_csum_raw() {
  return cqe->checksum;
}


-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17 11:32                             ` Toke Høiland-Jørgensen
@ 2022-11-17 16:59                               ` Alexei Starovoitov
  2022-11-17 17:52                                 ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2022-11-17 16:59 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, John Fastabend, Martin KaFai Lau, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, Network Development

On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> >> > Doesn't look like the descriptors are as nice as you're trying to
> >> > paint them (with clear hash/csum fields) :-) So not sure how much
> >> > CO-RE would help.
> >> > At least looking at mlx4 rx_csum, the driver consults three different
> >> > sets of flags to figure out the hash_type. Or am I just unlucky with
> >> > mlx4?
> >>
> >> Which part are you talking about ?
> >>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> >> which is what John is proposing (I think).
> >
> > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > I'm assuming we want to have hash_type available to the progs?
>
> I agree we should expose the hash_type, but that doesn't actually look
> to be that complicated, see below.
>
> > But also, check_csum handles other corner cases:
> > - short_frame: we simply force all those small frames to skip checksum complete
> > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > IPv6 header
> > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > doesn't include the pseudo header, the HW adds it
> >
> > So it doesn't look like we can just unconditionally use cqe->checksum?
> > The driver does a lot of massaging around that field to make it
> > palatable.
>
> Poking around a bit in the other drivers, AFAICT it's only a subset of
> drivers that support CSUM_COMPLETE at all; for instance, the Intel
> drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the
> CHECKSUM_UNNECESSARY is actually the most important bit we'd want to
> propagate?
>
> AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access
> to other data structures than the rx descriptor to determine the status
> of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so
> just exposing the rx descriptor to BPF as John is suggesting does not
> actually give the XDP program enough information to act on the checksum
> field on its own. We could still have a separate kfunc to just expose
> the hw checksum value (see below), but I think it probably needs to be
> paired with other kfuncs to be useful.
>
> Looking at the mlx4 code, I think the following mapping to kfuncs (in
> pseudo-C) would give the flexibility for XDP to access all the bits it
> needs, while inlining everything except getting the full checksum for
> non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other
> drivers (mlx5, ice, i40e) indicates that this would work for those
> drivers as well.
>
>
> bpf_xdp_metadata_rx_hash_supported() {
>   return dev->features & NETIF_F_RXHASH;
> }
>
> bpf_xdp_metadata_rx_hash() {
>   return be32_to_cpu(cqe->immed_rss_invalid);
> }
>
> bpf_xdp_metdata_rx_hash_type() {
>   if (likely(dev->features & NETIF_F_RXCSUM) &&
>       (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) &&
>         (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>           cqe->checksum == cpu_to_be16(0xffff))
>      return PKT_HASH_TYPE_L4;
>
>    return PKT_HASH_TYPE_L3;
> }
>
> bpf_xdp_metadata_rx_csum_supported() {
>   return dev->features & NETIF_F_RXCSUM;
> }
>
> bpf_xdp_metadata_rx_csum_level() {
>         if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>                                        MLX4_CQE_STATUS_UDP)) &&
>             (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>             cqe->checksum == cpu_to_be16(0xffff))
>             return CHECKSUM_UNNECESSARY;
>
>         if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
>               (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) &&
>               !short_frame(len))
>             return CHECKSUM_COMPLETE; /* we could also omit this case entirely */
>
>         return CHECKSUM_NONE;
> }
>
> /* this one could be called by the metadata_to_skb code */
> bpf_xdp_metadata_rx_csum_full() {
>   return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */
> }
>
> /* this one would be for people like John who want to re-implement
>  * check_csum() themselves */
> bpf_xdp_metdata_rx_csum_raw() {
>   return cqe->checksum;
> }

Are you proposing a bunch of per-driver kfuncs that bpf prog will call.
If so that works, but bpf prog needs to pass dev and cqe pointers
into these kfuncs, so they need to be exposed to the prog somehow.
Probably through xdp_md ?
This way we can have both: bpf prog reading cqe fields directly
and using kfuncs to access things.
Inlining of kfuncs should be done generically.
It's not a driver job to convert native asm into bpf asm.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17  6:55                                 ` John Fastabend
@ 2022-11-17 17:51                                   ` Stanislav Fomichev
  2022-11-17 19:47                                     ` John Fastabend
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-17 17:51 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Wed, Nov 16, 2022 at 10:55 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > >
> > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > > > >
> > > > > > > Stanislav Fomichev wrote:
> > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > > > <john.fastabend@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > > > > >
> > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > > > >>>>>>> +{
> > > > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > > > >>>>>>> +     }
> > > > > > > > > > >>>>>>> +}
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > > > >>>>> point?
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > > > >>>
> > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > > > >>
> > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > > > >>
> > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > > > >>
> > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > > > >> {
> > > > > > > > > > >>          u64 hi, lo;
> > > > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > > > >>
> > > > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > > > >>
> > > > > > > > > > >>          return hi | lo;
> > > > > > > > > > >> }
> > > > > > > > > > >>
> > > > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > > > >>
> > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > > > >>                              u64 timestamp)
> > > > > > > > > > >> {
> > > > > > > > > > >>          unsigned int seq;
> > > > > > > > > > >>          u64 nsec;
> > > > > > > > > > >>
> > > > > > > > > > >>          do {
> > > > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > > > >>
> > > > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > > > >> }
> > > > > > > > > > >>
> > > > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > > > >>
> > > > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > > > >>
> > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > > > >> and not BPF insns directly.
> > > > > > > > > > >
> > > > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > > > >
> > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > > > >
> > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > > > >                                                u64 timestamp)
> > > > > > > > > > > {
> > > > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > > > >
> > > > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > > > >
> > > > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > > > which will collect several bits of metadata).
> > > > > > > > > >
> > > > > > > > > > > csum may have a better chance to inline?
> > > > > > > > > >
> > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > > > before merging this.
> > > > > > > > >
> > > > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > > > a use case for it now.
> > > > > > > > >
> > > > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > > > >
> > > > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > > > rx side for a v2.
> > > > > > >
> > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > > > handle versioning of descriptors where depending on specific options
> > > > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > > > around. Of course can push this all to developer, but seems not so
> > > > > > > nice when we have the machinery to do this and we handle it for all
> > > > > > > other structures.
> > > > > > >
> > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > > > descriptor. If you go through normal path you get this for free of
> > > > > > > course.
> > > > > >
> > > > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > > > CO-RE would help.
> > > > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > > > mlx4?
> > > > >
> > > > > Which part are you talking about ?
> > > > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > > > which is what John is proposing (I think).
>
> Yeah this is what I've been considering. If you just get the 'cqe' pointer
> walking the check_sum and rxhash should be straightforward.
>
> There seems to be a real difference between timestamps and most other
> fields IMO. Timestamps require some extra logic to turn into ns when
> using the NIC hw clock. And the hw clock requires some coordination to
> keep in sync and stop from overflowing and may be done through other
> protocols like PTP in my use case. In some cases I think the clock is
> also shared amongst multiple phys. Seems mlx has a seqlock_t to protect
> it and I'm less sure about this but seems intel nic maybe has a sideband
> control channel.
>
> Then there is everything else that I can think up that is per packet data
> and requires no coordination with the driver. Its just reading fields in
> the completion queue. This would be the csum, check_sum, vlan_header and
> so on. Sure we could kfunc each one of those things, but could also just
> write that directly in BPF and remove some abstractions and kernel
> dependency by doing it directly in the BPF program. If you like that
> abstraction seems to be the point of contention my opinion is the cost
> of kernel depency is high and I can abstract it with a user library
> anyways so burying it in the kernel only causes me support issues and
> backwards compat problems.
>
> Hopefully, my position is more clear.

Yeah, I see your point, I'm somewhat in the same position here wrt to
legacy kernels :-)
Exposing raw descriptors seems fine, but imo it shouldn't be the goto
mechanism for the metadata; but rather as a fallback whenever the
driver implementation is missing/buggy. Unless, as you mention below,
there are some libraries in the future to abstract that.
But at least it seems that we agree that there needs to be some other
non-raw-descriptor way to access spinlocked things like the timestamp?


> > > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > > > I'm assuming we want to have hash_type available to the progs?
> > > >
> > > > But also, check_csum handles other corner cases:
> > > > - short_frame: we simply force all those small frames to skip checksum complete
> > > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > > > IPv6 header
> > > > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > > > doesn't include the pseudo header, the HW adds it
> > >
> > > Of course, but kfunc won't be doing them either.
> > > We're talking XDP fast path.
> > > The mlx4 hw is old and incapable.
> > > No amount of sw can help.
>
> Doesn't this lend itself to letting the XDP BPF program write the
> BPF code to read it out. Maybe someone cares about these details
> for some cpumap thing, but the rest of us wont care we might just
> want to read check_csum. Maybe we have an IPv6 only network or
> IPv4 network so can make further shortcuts. If a driver dev does
> this they will be forced to do the cactch all version because
> they have no way to know such details.
>
> > > > So it doesn't look like we can just unconditionally use cqe->checksum?
> > > > The driver does a lot of massaging around that field to make it
> > > > palatable.
> > >
> > > Of course we can. cqe->checksum is still usable. the bpf prog
> > > would need to know what it's reading.
> > > There should be no attempt to present a unified state of hw bits.
> > > That's what skb is for. XDP layer should not hide such hw details.
> > > Otherwise it will become a mini skb layer with all that overhead.
> >
> > I was hoping the kfunc could at least parse the flags and return some
> > pkt_hash_types-like enum to indicate what this csum covers.
> > So the users won't have to find the hardware manuals (not sure they
> > are even available?) to decipher what numbers they've got.
> > Regarding old mlx4: true, but mlx5's mlx5e_handle_csum doesn't look
> > that much different :-(
>
> The driver developers could still provide and ship the BPF libs
> with their drivers. I think if someone is going to use their NIC
> and lots of them and requires XDP it will get done. We could put
> them by the driver code mlx4.bpf or something.
>
> >
> > But going back a bit: I'm probably missing what John has been
> > suggesting. How is CO-RE relevant for kfuncs? kfuncs are already doing
> > a CO-RE-like functionality by rewriting some "public api" (kfunc) into
> > the bytecode to access the relevant data.
>
> This was maybe a bit of an aside. What I was pondering a bit out
> loud perhaps is my recollection is there may be a few different
> descriptor layouts depending features enabled, exact device loaded
> and such. So in this case if I was a driver writer I might not want
> to hardcode the offset of the check_sum field. If I could use CO-RE
> then I don't have to care if in one version is the Nth field and later on
> someone makes it the Mth field just like any normal kernel struct.
> But through the kfunc interface I couldn't see how to get that.
> So instead of having a bunch of kfunc implementations you could just
> have one for all your device classes because you always name the
> field the same thing.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17 16:59                               ` Alexei Starovoitov
@ 2022-11-17 17:52                                 ` Stanislav Fomichev
  2022-11-17 23:46                                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-17 17:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, John Fastabend,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Thu, Nov 17, 2022 at 8:59 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> > >> > Doesn't look like the descriptors are as nice as you're trying to
> > >> > paint them (with clear hash/csum fields) :-) So not sure how much
> > >> > CO-RE would help.
> > >> > At least looking at mlx4 rx_csum, the driver consults three different
> > >> > sets of flags to figure out the hash_type. Or am I just unlucky with
> > >> > mlx4?
> > >>
> > >> Which part are you talking about ?
> > >>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > >> which is what John is proposing (I think).
> > >
> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
> > > I'm assuming we want to have hash_type available to the progs?
> >
> > I agree we should expose the hash_type, but that doesn't actually look
> > to be that complicated, see below.
> >
> > > But also, check_csum handles other corner cases:
> > > - short_frame: we simply force all those small frames to skip checksum complete
> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
> > > IPv6 header
> > > - get_fixed_ipv4_csum: Although the stack expects checksum which
> > > doesn't include the pseudo header, the HW adds it
> > >
> > > So it doesn't look like we can just unconditionally use cqe->checksum?
> > > The driver does a lot of massaging around that field to make it
> > > palatable.
> >
> > Poking around a bit in the other drivers, AFAICT it's only a subset of
> > drivers that support CSUM_COMPLETE at all; for instance, the Intel
> > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the
> > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to
> > propagate?
> >
> > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access
> > to other data structures than the rx descriptor to determine the status
> > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so
> > just exposing the rx descriptor to BPF as John is suggesting does not
> > actually give the XDP program enough information to act on the checksum
> > field on its own. We could still have a separate kfunc to just expose
> > the hw checksum value (see below), but I think it probably needs to be
> > paired with other kfuncs to be useful.
> >
> > Looking at the mlx4 code, I think the following mapping to kfuncs (in
> > pseudo-C) would give the flexibility for XDP to access all the bits it
> > needs, while inlining everything except getting the full checksum for
> > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other
> > drivers (mlx5, ice, i40e) indicates that this would work for those
> > drivers as well.
> >
> >
> > bpf_xdp_metadata_rx_hash_supported() {
> >   return dev->features & NETIF_F_RXHASH;
> > }
> >
> > bpf_xdp_metadata_rx_hash() {
> >   return be32_to_cpu(cqe->immed_rss_invalid);
> > }
> >
> > bpf_xdp_metdata_rx_hash_type() {
> >   if (likely(dev->features & NETIF_F_RXCSUM) &&
> >       (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) &&
> >         (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> >           cqe->checksum == cpu_to_be16(0xffff))
> >      return PKT_HASH_TYPE_L4;
> >
> >    return PKT_HASH_TYPE_L3;
> > }
> >
> > bpf_xdp_metadata_rx_csum_supported() {
> >   return dev->features & NETIF_F_RXCSUM;
> > }
> >
> > bpf_xdp_metadata_rx_csum_level() {
> >         if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
> >                                        MLX4_CQE_STATUS_UDP)) &&
> >             (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> >             cqe->checksum == cpu_to_be16(0xffff))
> >             return CHECKSUM_UNNECESSARY;
> >
> >         if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
> >               (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) &&
> >               !short_frame(len))
> >             return CHECKSUM_COMPLETE; /* we could also omit this case entirely */
> >
> >         return CHECKSUM_NONE;
> > }
> >
> > /* this one could be called by the metadata_to_skb code */
> > bpf_xdp_metadata_rx_csum_full() {
> >   return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */
> > }
> >
> > /* this one would be for people like John who want to re-implement
> >  * check_csum() themselves */
> > bpf_xdp_metdata_rx_csum_raw() {
> >   return cqe->checksum;
> > }
>
> Are you proposing a bunch of per-driver kfuncs that bpf prog will call.
> If so that works, but bpf prog needs to pass dev and cqe pointers
> into these kfuncs, so they need to be exposed to the prog somehow.
> Probably through xdp_md ?

So far I'm doing:

struct mlx4_xdp_buff {
  struct xdp_buff xdp;
  struct mlx4_cqe *cqe;
  struct mlx4_en_dev *mdev;
}

And then the kfuncs get ctx (aka xdp_buff) as a sole argument and can
find cqe/mdev via container_of.

If we really need these to be exposed to the program, can we use
Yonghong's approach from [0]?

0: https://lore.kernel.org/bpf/20221114162328.622665-1-yhs@fb.com/

> This way we can have both: bpf prog reading cqe fields directly
> and using kfuncs to access things.
> Inlining of kfuncs should be done generically.
> It's not a driver job to convert native asm into bpf asm.

Ack. I can replace the unrolling with something that just resolves
"generic" kfuncs to the per-driver implementation maybe? That would at
least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17 17:51                                   ` Stanislav Fomichev
@ 2022-11-17 19:47                                     ` John Fastabend
  2022-11-17 20:17                                       ` Alexei Starovoitov
  0 siblings, 1 reply; 67+ messages in thread
From: John Fastabend @ 2022-11-17 19:47 UTC (permalink / raw)
  To: Stanislav Fomichev, John Fastabend
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

Stanislav Fomichev wrote:
> On Wed, Nov 16, 2022 at 10:55 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Stanislav Fomichev wrote:
> > > On Wed, Nov 16, 2022 at 6:59 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Nov 16, 2022 at 6:53 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On Wed, Nov 16, 2022 at 6:17 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Nov 16, 2022 at 4:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > >
> > > > > > > On Wed, Nov 16, 2022 at 3:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Stanislav Fomichev wrote:
> > > > > > > > > On Wed, Nov 16, 2022 at 11:03 AM John Fastabend
> > > > > > > > > <john.fastabend@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Toke Høiland-Jørgensen wrote:
> > > > > > > > > > > Martin KaFai Lau <martin.lau@linux.dev> writes:
> > > > > > > > > > >
> > > > > > > > > > > > On 11/15/22 10:38 PM, John Fastabend wrote:
> > > > > > > > > > > >>>>>>> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > > > > > > > > > > >>>>>>> +                           struct bpf_patch *patch)
> > > > > > > > > > > >>>>>>> +{
> > > > > > > > > > > >>>>>>> +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > > > > > > > > >>>>>>> +             /* return true; */
> > > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > > > > > > > > >>>>>>> +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > > > > > > > > > > >>>>>>> +             /* return ktime_get_mono_fast_ns(); */
> > > > > > > > > > > >>>>>>> +             bpf_patch_append(patch, BPF_EMIT_CALL(ktime_get_mono_fast_ns));
> > > > > > > > > > > >>>>>>> +     }
> > > > > > > > > > > >>>>>>> +}
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> So these look reasonable enough, but would be good to see some examples
> > > > > > > > > > > >>>>>> of kfunc implementations that don't just BPF_CALL to a kernel function
> > > > > > > > > > > >>>>>> (with those helper wrappers we were discussing before).
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Let's maybe add them if/when needed as we add more metadata support?
> > > > > > > > > > > >>>>> xdp_metadata_export_to_skb has an example, and rfc 1/2 have more
> > > > > > > > > > > >>>>> examples, so it shouldn't be a problem to resurrect them back at some
> > > > > > > > > > > >>>>> point?
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Well, the reason I asked for them is that I think having to maintain the
> > > > > > > > > > > >>>> BPF code generation in the drivers is probably the biggest drawback of
> > > > > > > > > > > >>>> the kfunc approach, so it would be good to be relatively sure that we
> > > > > > > > > > > >>>> can manage that complexity (via helpers) before we commit to this :)
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Right, and I've added a bunch of examples in v2 rfc so we can judge
> > > > > > > > > > > >>> whether that complexity is manageable or not :-)
> > > > > > > > > > > >>> Do you want me to add those wrappers you've back without any real users?
> > > > > > > > > > > >>> Because I had to remove my veth tstamp accessors due to John/Jesper
> > > > > > > > > > > >>> objections; I can maybe bring some of this back gated by some
> > > > > > > > > > > >>> static_branch to avoid the fastpath cost?
> > > > > > > > > > > >>
> > > > > > > > > > > >> I missed the context a bit what did you mean "would be good to see some
> > > > > > > > > > > >> examples of kfunc implementations that don't just BPF_CALL to a kernel
> > > > > > > > > > > >> function"? In this case do you mean BPF code directly without the call?
> > > > > > > > > > > >>
> > > > > > > > > > > >> Early on I thought we should just expose the rx_descriptor which would
> > > > > > > > > > > >> be roughly the same right? (difference being code embedded in driver vs
> > > > > > > > > > > >> a lib) Trouble I ran into is driver code using seqlock_t and mutexs
> > > > > > > > > > > >> which wasn't as straight forward as the simpler just read it from
> > > > > > > > > > > >> the descriptor. For example in mlx getting the ts would be easy from
> > > > > > > > > > > >> BPF with the mlx4_cqe struct exposed
> > > > > > > > > > > >>
> > > > > > > > > > > >> u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
> > > > > > > > > > > >> {
> > > > > > > > > > > >>          u64 hi, lo;
> > > > > > > > > > > >>          struct mlx4_ts_cqe *ts_cqe = (struct mlx4_ts_cqe *)cqe;
> > > > > > > > > > > >>
> > > > > > > > > > > >>          lo = (u64)be16_to_cpu(ts_cqe->timestamp_lo);
> > > > > > > > > > > >>          hi = ((u64)be32_to_cpu(ts_cqe->timestamp_hi) + !lo) << 16;
> > > > > > > > > > > >>
> > > > > > > > > > > >>          return hi | lo;
> > > > > > > > > > > >> }
> > > > > > > > > > > >>
> > > > > > > > > > > >> but converting that to nsec is a bit annoying,
> > > > > > > > > > > >>
> > > > > > > > > > > >> void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
> > > > > > > > > > > >>                              struct skb_shared_hwtstamps *hwts,
> > > > > > > > > > > >>                              u64 timestamp)
> > > > > > > > > > > >> {
> > > > > > > > > > > >>          unsigned int seq;
> > > > > > > > > > > >>          u64 nsec;
> > > > > > > > > > > >>
> > > > > > > > > > > >>          do {
> > > > > > > > > > > >>                  seq = read_seqbegin(&mdev->clock_lock);
> > > > > > > > > > > >>                  nsec = timecounter_cyc2time(&mdev->clock, timestamp);
> > > > > > > > > > > >>          } while (read_seqretry(&mdev->clock_lock, seq));
> > > > > > > > > > > >>
> > > > > > > > > > > >>          memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
> > > > > > > > > > > >>          hwts->hwtstamp = ns_to_ktime(nsec);
> > > > > > > > > > > >> }
> > > > > > > > > > > >>
> > > > > > > > > > > >> I think the nsec is what you really want.
> > > > > > > > > > > >>
> > > > > > > > > > > >> With all the drivers doing slightly different ops we would have
> > > > > > > > > > > >> to create read_seqbegin, read_seqretry, mutex_lock, ... to get
> > > > > > > > > > > >> at least the mlx and ice drivers it looks like we would need some
> > > > > > > > > > > >> more BPF primitives/helpers. Looks like some more work is needed
> > > > > > > > > > > >> on ice driver though to get rx tstamps on all packets.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Anyways this convinced me real devices will probably use BPF_CALL
> > > > > > > > > > > >> and not BPF insns directly.
> > > > > > > > > > > >
> > > > > > > > > > > > Some of the mlx5 path looks like this:
> > > > > > > > > > > >
> > > > > > > > > > > > #define REAL_TIME_TO_NS(hi, low) (((u64)hi) * NSEC_PER_SEC + ((u64)low))
> > > > > > > > > > > >
> > > > > > > > > > > > static inline ktime_t mlx5_real_time_cyc2time(struct mlx5_clock *clock,
> > > > > > > > > > > >                                                u64 timestamp)
> > > > > > > > > > > > {
> > > > > > > > > > > >          u64 time = REAL_TIME_TO_NS(timestamp >> 32, timestamp & 0xFFFFFFFF);
> > > > > > > > > > > >
> > > > > > > > > > > >          return ns_to_ktime(time);
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > If some hints are harder to get, then just doing a kfunc call is better.
> > > > > > > > > > >
> > > > > > > > > > > Sure, but if we end up having a full function call for every field in
> > > > > > > > > > > the metadata, that will end up having a significant performance impact
> > > > > > > > > > > on the XDP data path (thinking mostly about the skb metadata case here,
> > > > > > > > > > > which will collect several bits of metadata).
> > > > > > > > > > >
> > > > > > > > > > > > csum may have a better chance to inline?
> > > > > > > > > > >
> > > > > > > > > > > Yup, I agree. Including that also makes it possible to benchmark this
> > > > > > > > > > > series against Jesper's; which I think we should definitely be doing
> > > > > > > > > > > before merging this.
> > > > > > > > > >
> > > > > > > > > > Good point I got sort of singularly focused on timestamp because I have
> > > > > > > > > > a use case for it now.
> > > > > > > > > >
> > > > > > > > > > Also hash is often sitting in the rx descriptor.
> > > > > > > > >
> > > > > > > > > Ack, let me try to add something else (that's more inline-able) on the
> > > > > > > > > rx side for a v2.
> > > > > > > >
> > > > > > > > If you go with in-kernel BPF kfunc approach (vs user space side) I think
> > > > > > > > you also need to add CO-RE to be friendly for driver developers? Otherwise
> > > > > > > > they have to keep that read in sync with the descriptors? Also need to
> > > > > > > > handle versioning of descriptors where depending on specific options
> > > > > > > > and firmware and chip being enabled the descriptor might be moving
> > > > > > > > around. Of course can push this all to developer, but seems not so
> > > > > > > > nice when we have the machinery to do this and we handle it for all
> > > > > > > > other structures.
> > > > > > > >
> > > > > > > > With CO-RE you can simply do the rx_desc->hash and rx_desc->csum and
> > > > > > > > expect CO-RE sorts it out based on currently running btf_id of the
> > > > > > > > descriptor. If you go through normal path you get this for free of
> > > > > > > > course.
> > > > > > >
> > > > > > > Doesn't look like the descriptors are as nice as you're trying to
> > > > > > > paint them (with clear hash/csum fields) :-) So not sure how much
> > > > > > > CO-RE would help.
> > > > > > > At least looking at mlx4 rx_csum, the driver consults three different
> > > > > > > sets of flags to figure out the hash_type. Or am I just unlucky with
> > > > > > > mlx4?
> > > > > >
> > > > > > Which part are you talking about ?
> > > > > >         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > > > > > is trivial enough for bpf prog to do if it has access to 'cqe' pointer
> > > > > > which is what John is proposing (I think).
> >
> > Yeah this is what I've been considering. If you just get the 'cqe' pointer
> > walking the check_sum and rxhash should be straightforward.
> >
> > There seems to be a real difference between timestamps and most other
> > fields IMO. Timestamps require some extra logic to turn into ns when
> > using the NIC hw clock. And the hw clock requires some coordination to
> > keep in sync and stop from overflowing and may be done through other
> > protocols like PTP in my use case. In some cases I think the clock is
> > also shared amongst multiple phys. Seems mlx has a seqlock_t to protect
> > it and I'm less sure about this but seems intel nic maybe has a sideband
> > control channel.
> >
> > Then there is everything else that I can think up that is per packet data
> > and requires no coordination with the driver. Its just reading fields in
> > the completion queue. This would be the csum, check_sum, vlan_header and
> > so on. Sure we could kfunc each one of those things, but could also just
> > write that directly in BPF and remove some abstractions and kernel
> > dependency by doing it directly in the BPF program. If you like that
> > abstraction seems to be the point of contention my opinion is the cost
> > of kernel depency is high and I can abstract it with a user library
> > anyways so burying it in the kernel only causes me support issues and
> > backwards compat problems.
> >
> > Hopefully, my position is more clear.
> 
> Yeah, I see your point, I'm somewhat in the same position here wrt to
> legacy kernels :-)
> Exposing raw descriptors seems fine, but imo it shouldn't be the goto
> mechanism for the metadata; but rather as a fallback whenever the
> driver implementation is missing/buggy. Unless, as you mention below,
> there are some libraries in the future to abstract that.
> But at least it seems that we agree that there needs to be some other
> non-raw-descriptor way to access spinlocked things like the timestamp?
> 

Yeah for timestamps I think a kfunc to either get the timestamp or could
also be done with a kfunc to read hw clock. But either way seems hard
to do that in BPF code directly so kfunc feels right to me here.

By the way I think if you have the completion queue (rx descriptor) in
the xdp_buff and we use Yonghong's patch to cast the ctx as a BTF type
then we should be able to also directly read all the fields. I see
you noted this in the response to Alexei so lets see what he thinks.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17 19:47                                     ` John Fastabend
@ 2022-11-17 20:17                                       ` Alexei Starovoitov
  0 siblings, 0 replies; 67+ messages in thread
From: Alexei Starovoitov @ 2022-11-17 20:17 UTC (permalink / raw)
  To: John Fastabend
  Cc: Stanislav Fomichev, Toke Høiland-Jørgensen,
	Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, Network Development

On Thu, Nov 17, 2022 at 11:47 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Yeah for timestamps I think a kfunc to either get the timestamp or could
> also be done with a kfunc to read hw clock. But either way seems hard
> to do that in BPF code directly so kfunc feels right to me here.
>
> By the way I think if you have the completion queue (rx descriptor) in
> the xdp_buff and we use Yonghong's patch to cast the ctx as a BTF type
> then we should be able to also directly read all the fields. I see
> you noted this in the response to Alexei so lets see what he thinks.

Fine with me.
Let's land something that is not uapi and then iterate on top.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17 17:52                                 ` Stanislav Fomichev
@ 2022-11-17 23:46                                   ` Toke Høiland-Jørgensen
  2022-11-18  0:02                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-17 23:46 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexei Starovoitov
  Cc: John Fastabend, Martin KaFai Lau, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Song Liu, Yonghong Song,
	KP Singh, Hao Luo, Jiri Olsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	Network Development

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Nov 17, 2022 at 8:59 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Thu, Nov 17, 2022 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >
>> > Stanislav Fomichev <sdf@google.com> writes:
>> >
>> > >> > Doesn't look like the descriptors are as nice as you're trying to
>> > >> > paint them (with clear hash/csum fields) :-) So not sure how much
>> > >> > CO-RE would help.
>> > >> > At least looking at mlx4 rx_csum, the driver consults three different
>> > >> > sets of flags to figure out the hash_type. Or am I just unlucky with
>> > >> > mlx4?
>> > >>
>> > >> Which part are you talking about ?
>> > >>         hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
>> > >> is trivial enough for bpf prog to do if it has access to 'cqe' pointer
>> > >> which is what John is proposing (I think).
>> > >
>> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_csum.
>> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch
>> > > I'm assuming we want to have hash_type available to the progs?
>> >
>> > I agree we should expose the hash_type, but that doesn't actually look
>> > to be that complicated, see below.
>> >
>> > > But also, check_csum handles other corner cases:
>> > > - short_frame: we simply force all those small frames to skip checksum complete
>> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes from
>> > > IPv6 header
>> > > - get_fixed_ipv4_csum: Although the stack expects checksum which
>> > > doesn't include the pseudo header, the HW adds it
>> > >
>> > > So it doesn't look like we can just unconditionally use cqe->checksum?
>> > > The driver does a lot of massaging around that field to make it
>> > > palatable.
>> >
>> > Poking around a bit in the other drivers, AFAICT it's only a subset of
>> > drivers that support CSUM_COMPLETE at all; for instance, the Intel
>> > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the
>> > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to
>> > propagate?
>> >
>> > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access
>> > to other data structures than the rx descriptor to determine the status
>> > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so
>> > just exposing the rx descriptor to BPF as John is suggesting does not
>> > actually give the XDP program enough information to act on the checksum
>> > field on its own. We could still have a separate kfunc to just expose
>> > the hw checksum value (see below), but I think it probably needs to be
>> > paired with other kfuncs to be useful.
>> >
>> > Looking at the mlx4 code, I think the following mapping to kfuncs (in
>> > pseudo-C) would give the flexibility for XDP to access all the bits it
>> > needs, while inlining everything except getting the full checksum for
>> > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the other
>> > drivers (mlx5, ice, i40e) indicates that this would work for those
>> > drivers as well.
>> >
>> >
>> > bpf_xdp_metadata_rx_hash_supported() {
>> >   return dev->features & NETIF_F_RXHASH;
>> > }
>> >
>> > bpf_xdp_metadata_rx_hash() {
>> >   return be32_to_cpu(cqe->immed_rss_invalid);
>> > }
>> >
>> > bpf_xdp_metdata_rx_hash_type() {
>> >   if (likely(dev->features & NETIF_F_RXCSUM) &&
>> >       (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS_UDP)) &&
>> >         (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>> >           cqe->checksum == cpu_to_be16(0xffff))
>> >      return PKT_HASH_TYPE_L4;
>> >
>> >    return PKT_HASH_TYPE_L3;
>> > }
>> >
>> > bpf_xdp_metadata_rx_csum_supported() {
>> >   return dev->features & NETIF_F_RXCSUM;
>> > }
>> >
>> > bpf_xdp_metadata_rx_csum_level() {
>> >         if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>> >                                        MLX4_CQE_STATUS_UDP)) &&
>> >             (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>> >             cqe->checksum == cpu_to_be16(0xffff))
>> >             return CHECKSUM_UNNECESSARY;
>> >
>> >         if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
>> >               (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) &&
>> >               !short_frame(len))
>> >             return CHECKSUM_COMPLETE; /* we could also omit this case entirely */
>> >
>> >         return CHECKSUM_NONE;
>> > }
>> >
>> > /* this one could be called by the metadata_to_skb code */
>> > bpf_xdp_metadata_rx_csum_full() {
>> >   return check_csum() /* BPF_CALL this after refactoring so it is skb-agnostic */
>> > }
>> >
>> > /* this one would be for people like John who want to re-implement
>> >  * check_csum() themselves */
>> > bpf_xdp_metdata_rx_csum_raw() {
>> >   return cqe->checksum;
>> > }
>>
>> Are you proposing a bunch of per-driver kfuncs that bpf prog will call.
>> If so that works, but bpf prog needs to pass dev and cqe pointers
>> into these kfuncs, so they need to be exposed to the prog somehow.
>> Probably through xdp_md ?

No, I didn't mean we should call per-driver kfuncs; the examples above
were meant to be examples of what the mlx4 driver would unrolls those
kfuncs to. Sorry that that wasn't clear.

> So far I'm doing:
>
> struct mlx4_xdp_buff {
>   struct xdp_buff xdp;
>   struct mlx4_cqe *cqe;
>   struct mlx4_en_dev *mdev;
> }
>
> And then the kfuncs get ctx (aka xdp_buff) as a sole argument and can
> find cqe/mdev via container_of.
>
> If we really need these to be exposed to the program, can we use
> Yonghong's approach from [0]?

I don't think we should expose them to the BPF program; I like your
approach of stuffing them next to the CTX pointer and de-referencing
that. This makes it up to the driver which extra objects it needs, and
the caller doesn't have to know/care.

I'm not vehemently opposed to *also* having the rx-desc pointer directly
accessible (in which case Yonghong's kfunc approach is probably fine).
However, as mentioned in my previous email, I doubt how useful that
descriptor itself will be...

>> This way we can have both: bpf prog reading cqe fields directly
>> and using kfuncs to access things.
>> Inlining of kfuncs should be done generically.
>> It's not a driver job to convert native asm into bpf asm.
>
> Ack. I can replace the unrolling with something that just resolves
> "generic" kfuncs to the per-driver implementation maybe? That would at
> least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..

As stated above, I think we should keep the unrolling. If we end up with
an actual CALL instruction for every piece of metadata that's going to
suck performance-wise; unrolling is how we keep this fast enough! :)

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-17 23:46                                   ` Toke Høiland-Jørgensen
@ 2022-11-18  0:02                                     ` Alexei Starovoitov
  2022-11-18  0:29                                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2022-11-18  0:02 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, John Fastabend, Martin KaFai Lau, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, Network Development

On Thu, Nov 17, 2022 at 3:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> >
> > Ack. I can replace the unrolling with something that just resolves
> > "generic" kfuncs to the per-driver implementation maybe? That would at
> > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..
>
> As stated above, I think we should keep the unrolling. If we end up with
> an actual CALL instruction for every piece of metadata that's going to
> suck performance-wise; unrolling is how we keep this fast enough! :)

Let's start with pure kfuncs without requiring drivers
to provide corresponding bpf asm.
If pure kfuncs will indeed turn out to be perf limiting
then we'll decide on how to optimize them.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp
  2022-11-18  0:02                                     ` Alexei Starovoitov
@ 2022-11-18  0:29                                       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-18  0:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, John Fastabend, Martin KaFai Lau, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Song Liu,
	Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, Network Development

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Nov 17, 2022 at 3:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> >
>> > Ack. I can replace the unrolling with something that just resolves
>> > "generic" kfuncs to the per-driver implementation maybe? That would at
>> > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime..
>>
>> As stated above, I think we should keep the unrolling. If we end up with
>> an actual CALL instruction for every piece of metadata that's going to
>> suck performance-wise; unrolling is how we keep this fast enough! :)
>
> Let's start with pure kfuncs without requiring drivers
> to provide corresponding bpf asm.
> If pure kfuncs will indeed turn out to be perf limiting
> then we'll decide on how to optimize them.

I'm ~90% sure we'll need that optimisation, but OK, we can start from a
baseline of having them be function calls...

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-15  3:02 ` [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
                     ` (2 preceding siblings ...)
  2022-11-16 21:12   ` Jakub Kicinski
@ 2022-11-18 14:05   ` Jesper Dangaard Brouer
  2022-11-18 18:18     ` Stanislav Fomichev
  3 siblings, 1 reply; 67+ messages in thread
From: Jesper Dangaard Brouer @ 2022-11-18 14:05 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev


On 15/11/2022 04.02, Stanislav Fomichev wrote:
> Implement new bpf_xdp_metadata_export_to_skb kfunc which
> prepares compatible xdp metadata for kernel consumption.
> This kfunc should be called prior to bpf_redirect
> or when XDP_PASS'ing the frame into the kernel (note, the drivers
> have to be updated to enable consuming XDP_PASS'ed metadata).
> 
> veth driver is amended to consume this metadata when converting to skb.
> 
> Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
> whether the frame has skb metadata. The metadata is currently
> stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
> to work after a call to bpf_xdp_metadata_export_to_skb (can lift
> this requirement later on if needed, we'd have to memmove
> xdp_skb_metadata).
> 

I think it is wrong to refuses using metadata area (bpf_xdp_adjust_meta)
when the function bpf_xdp_metadata_export_to_skb() have been called.
In my design they were suppose to co-exist, and BPF-prog was expected to
access this directly themselves.

With this current design, I think it is better to place the struct
xdp_skb_metadata (maybe call it xdp_skb_hints) after xdp_frame (in the
top of the frame).  This way we don't conflict with metadata and
headroom use-cases.  Plus, verifier will keep BPF-prog from accessing
this area directly (which seems to be one of the new design goals).

By placing it after xdp_frame, I think it would be possible to let veth 
unroll functions seamlessly access this info for XDP_REDIRECT'ed 
xdp_frame's.

WDYT?

--Jesper


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-18 14:05   ` Jesper Dangaard Brouer
@ 2022-11-18 18:18     ` Stanislav Fomichev
  2022-11-19 12:31       ` [xdp-hints] " Toke Høiland-Jørgensen
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-18 18:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Fri, Nov 18, 2022 at 6:05 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 15/11/2022 04.02, Stanislav Fomichev wrote:
> > Implement new bpf_xdp_metadata_export_to_skb kfunc which
> > prepares compatible xdp metadata for kernel consumption.
> > This kfunc should be called prior to bpf_redirect
> > or when XDP_PASS'ing the frame into the kernel (note, the drivers
> > have to be updated to enable consuming XDP_PASS'ed metadata).
> >
> > veth driver is amended to consume this metadata when converting to skb.
> >
> > Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
> > whether the frame has skb metadata. The metadata is currently
> > stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
> > to work after a call to bpf_xdp_metadata_export_to_skb (can lift
> > this requirement later on if needed, we'd have to memmove
> > xdp_skb_metadata).
> >
>
> I think it is wrong to refuses using metadata area (bpf_xdp_adjust_meta)
> when the function bpf_xdp_metadata_export_to_skb() have been called.
> In my design they were suppose to co-exist, and BPF-prog was expected to
> access this directly themselves.
>
> With this current design, I think it is better to place the struct
> xdp_skb_metadata (maybe call it xdp_skb_hints) after xdp_frame (in the
> top of the frame).  This way we don't conflict with metadata and
> headroom use-cases.  Plus, verifier will keep BPF-prog from accessing
> this area directly (which seems to be one of the new design goals).
>
> By placing it after xdp_frame, I think it would be possible to let veth
> unroll functions seamlessly access this info for XDP_REDIRECT'ed
> xdp_frame's.
>
> WDYT?

Not everyone seems to be happy with exposing this xdp_skb_metadata via
uapi though :-(
So I'll drop this part in the v2 for now. But let's definitely keep
talking about the future approach.

Putting it after xdp_frame SGTM; with this we seem to avoid the need
to memmove it on adjust_{head,meta}.

But going back to the uapi part, what if we add separate kfunc
accessors for skb exported metadata?

To export:
bpf_xdp_metadata_export_rx_timestamp_to_skb(ctx, rx_timestamp)
bpf_xdp_metadata_export_rx_hash_to_skb(ctx, rx_hash)
// ^^ these prepare xdp_skb_metadata after xdp_frame, but not expose
it via uapi/af_xdp/etc

Then bpf_xdp_metadata_export_to_skb can be 'static inline' define in
the headers:

void bpf_xdp_metadata_export_to_skb(ctx)
{
  if (bpf_xdp_metadata_rx_timestamp_supported(ctx))
    bpf_xdp_metadata_export_rx_timestamp_to_skb(ctx,
bpf_xdp_metadata_rx_timestamp(ctx));
  if (bpf_xdp_metadata_rx_hash_supported(ctx))
    bpf_xdp_metadata_export_rx_hash_to_skb(ctx, bpf_xdp_metadata_rx_hash(ctx));
}

We can also do the accessors:
u64 bpf_xdp_metadata_skb_rx_timestamp(ctx)
u32 bpf_xdp_metadata_skb_rx_hash(ctx)

Hopefully we can unroll at least these, since they are not part of the
drivers, it should be easier to argue...

The only issue, it seems, is that if the final bpf program would like
to export this metadata to af_xdp, it has to manually adj_meta and use
bpf_xdp_metadata_skb_rx_xxx to prepare a custom layout. Not sure
whether performance would suffer with this extra copy; but we can at
least try and see..

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-18 18:18     ` Stanislav Fomichev
@ 2022-11-19 12:31       ` Toke Høiland-Jørgensen
  2022-11-21 17:53         ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-19 12:31 UTC (permalink / raw)
  To: Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Fri, Nov 18, 2022 at 6:05 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>> On 15/11/2022 04.02, Stanislav Fomichev wrote:
>> > Implement new bpf_xdp_metadata_export_to_skb kfunc which
>> > prepares compatible xdp metadata for kernel consumption.
>> > This kfunc should be called prior to bpf_redirect
>> > or when XDP_PASS'ing the frame into the kernel (note, the drivers
>> > have to be updated to enable consuming XDP_PASS'ed metadata).
>> >
>> > veth driver is amended to consume this metadata when converting to skb.
>> >
>> > Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
>> > whether the frame has skb metadata. The metadata is currently
>> > stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
>> > to work after a call to bpf_xdp_metadata_export_to_skb (can lift
>> > this requirement later on if needed, we'd have to memmove
>> > xdp_skb_metadata).
>> >
>>
>> I think it is wrong to refuses using metadata area (bpf_xdp_adjust_meta)
>> when the function bpf_xdp_metadata_export_to_skb() have been called.
>> In my design they were suppose to co-exist, and BPF-prog was expected to
>> access this directly themselves.
>>
>> With this current design, I think it is better to place the struct
>> xdp_skb_metadata (maybe call it xdp_skb_hints) after xdp_frame (in the
>> top of the frame).  This way we don't conflict with metadata and
>> headroom use-cases.  Plus, verifier will keep BPF-prog from accessing
>> this area directly (which seems to be one of the new design goals).
>>
>> By placing it after xdp_frame, I think it would be possible to let veth
>> unroll functions seamlessly access this info for XDP_REDIRECT'ed
>> xdp_frame's.
>>
>> WDYT?
>
> Not everyone seems to be happy with exposing this xdp_skb_metadata via
> uapi though :-(
> So I'll drop this part in the v2 for now. But let's definitely keep
> talking about the future approach.

Jakub was objecting to putting it in the UAPI header, but didn't we
already agree that this wasn't necessary?

I.e., if we just define

struct xdp_skb_metadata *bpf_xdp_metadata_export_to_skb()

as a kfunc, the xdp_skb_metadata struct won't appear in any UAPI headers
and will only be accessible via BTF? And we can put the actual data
wherever we choose, since that bit is nicely hidden behind the kfunc,
while the returned pointer still allows programs to access it.

We could even make that kfunc smart enough that it checks if the field
is already populated and just return the pointer to the existing data
instead of re-populating it int his case (with a flag to override,
maybe?).

> Putting it after xdp_frame SGTM; with this we seem to avoid the need
> to memmove it on adjust_{head,meta}.
>
> But going back to the uapi part, what if we add separate kfunc
> accessors for skb exported metadata?
>
> To export:
> bpf_xdp_metadata_export_rx_timestamp_to_skb(ctx, rx_timestamp)
> bpf_xdp_metadata_export_rx_hash_to_skb(ctx, rx_hash)
> // ^^ these prepare xdp_skb_metadata after xdp_frame, but not expose
> it via uapi/af_xdp/etc
>
> Then bpf_xdp_metadata_export_to_skb can be 'static inline' define in
> the headers:
>
> void bpf_xdp_metadata_export_to_skb(ctx)
> {
>   if (bpf_xdp_metadata_rx_timestamp_supported(ctx))
>     bpf_xdp_metadata_export_rx_timestamp_to_skb(ctx,
> bpf_xdp_metadata_rx_timestamp(ctx));
>   if (bpf_xdp_metadata_rx_hash_supported(ctx))
>     bpf_xdp_metadata_export_rx_hash_to_skb(ctx, bpf_xdp_metadata_rx_hash(ctx));
> }

The problem with this is that the BPF programs then have to keep up with
the kernel. I.e., if the kernel later adds support for a new field that
is used in the SKB, old XDP programs won't be populating it, which seems
suboptimal. I think rather the kernel should be in control of the SKB
metadata, and just allow XDP to consume it (and change individual fields
as needed).

> The only issue, it seems, is that if the final bpf program would like
> to export this metadata to af_xdp, it has to manually adj_meta and use
> bpf_xdp_metadata_skb_rx_xxx to prepare a custom layout. Not sure
> whether performance would suffer with this extra copy; but we can at
> least try and see..

If we write the metadata after the packet data, that could still be
transferred to AF_XDP, couldn't it? Userspace would just have to know
how to find and read it, like it would  if it's before the metadata.

-Toke


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-19 12:31       ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-21 17:53         ` Stanislav Fomichev
  2022-11-21 18:47           ` Jakub Kicinski
  0 siblings, 1 reply; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-21 17:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, bpf, brouer, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	David Ahern, Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Sat, Nov 19, 2022 at 4:31 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Fri, Nov 18, 2022 at 6:05 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >> On 15/11/2022 04.02, Stanislav Fomichev wrote:
> >> > Implement new bpf_xdp_metadata_export_to_skb kfunc which
> >> > prepares compatible xdp metadata for kernel consumption.
> >> > This kfunc should be called prior to bpf_redirect
> >> > or when XDP_PASS'ing the frame into the kernel (note, the drivers
> >> > have to be updated to enable consuming XDP_PASS'ed metadata).
> >> >
> >> > veth driver is amended to consume this metadata when converting to skb.
> >> >
> >> > Internally, XDP_FLAGS_HAS_SKB_METADATA flag is used to indicate
> >> > whether the frame has skb metadata. The metadata is currently
> >> > stored prior to xdp->data_meta. bpf_xdp_adjust_meta refuses
> >> > to work after a call to bpf_xdp_metadata_export_to_skb (can lift
> >> > this requirement later on if needed, we'd have to memmove
> >> > xdp_skb_metadata).
> >> >
> >>
> >> I think it is wrong to refuses using metadata area (bpf_xdp_adjust_meta)
> >> when the function bpf_xdp_metadata_export_to_skb() have been called.
> >> In my design they were suppose to co-exist, and BPF-prog was expected to
> >> access this directly themselves.
> >>
> >> With this current design, I think it is better to place the struct
> >> xdp_skb_metadata (maybe call it xdp_skb_hints) after xdp_frame (in the
> >> top of the frame).  This way we don't conflict with metadata and
> >> headroom use-cases.  Plus, verifier will keep BPF-prog from accessing
> >> this area directly (which seems to be one of the new design goals).
> >>
> >> By placing it after xdp_frame, I think it would be possible to let veth
> >> unroll functions seamlessly access this info for XDP_REDIRECT'ed
> >> xdp_frame's.
> >>
> >> WDYT?
> >
> > Not everyone seems to be happy with exposing this xdp_skb_metadata via
> > uapi though :-(
> > So I'll drop this part in the v2 for now. But let's definitely keep
> > talking about the future approach.
>
> Jakub was objecting to putting it in the UAPI header, but didn't we
> already agree that this wasn't necessary?
>
> I.e., if we just define
>
> struct xdp_skb_metadata *bpf_xdp_metadata_export_to_skb()
>
> as a kfunc, the xdp_skb_metadata struct won't appear in any UAPI headers
> and will only be accessible via BTF? And we can put the actual data
> wherever we choose, since that bit is nicely hidden behind the kfunc,
> while the returned pointer still allows programs to access it.
>
> We could even make that kfunc smart enough that it checks if the field
> is already populated and just return the pointer to the existing data
> instead of re-populating it int his case (with a flag to override,
> maybe?).

Even if we only expose it via btf, I think the fact that we still
expose a somewhat fixed layout is the problem?
I'm not sure the fact that we're not technically putting in the uapi
header is the issue here, but maybe I'm wrong?
Jakub?

> > Putting it after xdp_frame SGTM; with this we seem to avoid the need
> > to memmove it on adjust_{head,meta}.
> >
> > But going back to the uapi part, what if we add separate kfunc
> > accessors for skb exported metadata?
> >
> > To export:
> > bpf_xdp_metadata_export_rx_timestamp_to_skb(ctx, rx_timestamp)
> > bpf_xdp_metadata_export_rx_hash_to_skb(ctx, rx_hash)
> > // ^^ these prepare xdp_skb_metadata after xdp_frame, but not expose
> > it via uapi/af_xdp/etc
> >
> > Then bpf_xdp_metadata_export_to_skb can be 'static inline' define in
> > the headers:
> >
> > void bpf_xdp_metadata_export_to_skb(ctx)
> > {
> >   if (bpf_xdp_metadata_rx_timestamp_supported(ctx))
> >     bpf_xdp_metadata_export_rx_timestamp_to_skb(ctx,
> > bpf_xdp_metadata_rx_timestamp(ctx));
> >   if (bpf_xdp_metadata_rx_hash_supported(ctx))
> >     bpf_xdp_metadata_export_rx_hash_to_skb(ctx, bpf_xdp_metadata_rx_hash(ctx));
> > }
>
> The problem with this is that the BPF programs then have to keep up with
> the kernel. I.e., if the kernel later adds support for a new field that
> is used in the SKB, old XDP programs won't be populating it, which seems
> suboptimal. I think rather the kernel should be in control of the SKB
> metadata, and just allow XDP to consume it (and change individual fields
> as needed).

Good point. Although doesn't sound like a huge drawback to me? If that
bpf_xdp_metadata_export_to_skb is a part of libbpf/libxdp, the new
fields will get populated after a library update..

> > The only issue, it seems, is that if the final bpf program would like
> > to export this metadata to af_xdp, it has to manually adj_meta and use
> > bpf_xdp_metadata_skb_rx_xxx to prepare a custom layout. Not sure
> > whether performance would suffer with this extra copy; but we can at
> > least try and see..
>
> If we write the metadata after the packet data, that could still be
> transferred to AF_XDP, couldn't it? Userspace would just have to know
> how to find and read it, like it would  if it's before the metadata.

Right, but here we again bump into the fact that we need to somehow
communicate that layout to the userspace (via btf ids) which doesn't
make everybody excited :-)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-21 17:53         ` Stanislav Fomichev
@ 2022-11-21 18:47           ` Jakub Kicinski
  2022-11-21 19:41             ` Stanislav Fomichev
  0 siblings, 1 reply; 67+ messages in thread
From: Jakub Kicinski @ 2022-11-21 18:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, bpf,
	brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon, 21 Nov 2022 09:53:02 -0800 Stanislav Fomichev wrote:
> > Jakub was objecting to putting it in the UAPI header, but didn't we
> > already agree that this wasn't necessary?
> >
> > I.e., if we just define
> >
> > struct xdp_skb_metadata *bpf_xdp_metadata_export_to_skb()
> >
> > as a kfunc, the xdp_skb_metadata struct won't appear in any UAPI headers
> > and will only be accessible via BTF? And we can put the actual data
> > wherever we choose, since that bit is nicely hidden behind the kfunc,
> > while the returned pointer still allows programs to access it.
> >
> > We could even make that kfunc smart enough that it checks if the field
> > is already populated and just return the pointer to the existing data
> > instead of re-populating it int his case (with a flag to override,
> > maybe?).  
> 
> Even if we only expose it via btf, I think the fact that we still
> expose a somewhat fixed layout is the problem?
> I'm not sure the fact that we're not technically putting in the uapi
> header is the issue here, but maybe I'm wrong?
> Jakub?

Until the device metadata access from BPF is in bpf-next the only
opinion I have on this is something along the lines of "not right now".

I may be missing some concerns / perspectives, in which case - when
is the next "BPF office hours" meeting?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context
  2022-11-21 18:47           ` Jakub Kicinski
@ 2022-11-21 19:41             ` Stanislav Fomichev
  0 siblings, 0 replies; 67+ messages in thread
From: Stanislav Fomichev @ 2022-11-21 19:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, bpf,
	brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon, Nov 21, 2022 at 10:47 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 21 Nov 2022 09:53:02 -0800 Stanislav Fomichev wrote:
> > > Jakub was objecting to putting it in the UAPI header, but didn't we
> > > already agree that this wasn't necessary?
> > >
> > > I.e., if we just define
> > >
> > > struct xdp_skb_metadata *bpf_xdp_metadata_export_to_skb()
> > >
> > > as a kfunc, the xdp_skb_metadata struct won't appear in any UAPI headers
> > > and will only be accessible via BTF? And we can put the actual data
> > > wherever we choose, since that bit is nicely hidden behind the kfunc,
> > > while the returned pointer still allows programs to access it.
> > >
> > > We could even make that kfunc smart enough that it checks if the field
> > > is already populated and just return the pointer to the existing data
> > > instead of re-populating it int his case (with a flag to override,
> > > maybe?).
> >
> > Even if we only expose it via btf, I think the fact that we still
> > expose a somewhat fixed layout is the problem?
> > I'm not sure the fact that we're not technically putting in the uapi
> > header is the issue here, but maybe I'm wrong?
> > Jakub?
>
> Until the device metadata access from BPF is in bpf-next the only
> opinion I have on this is something along the lines of "not right now".
>
> I may be missing some concerns / perspectives, in which case - when
> is the next "BPF office hours" meeting?

SG! Let's get back to it once we get the basic rx metadata sorted out.
I'll probably look at the tx part next though; that xdp->skb path is
of least priority for me.

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2022-11-21 19:41 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-15  3:01 [PATCH bpf-next 00/11] xdp: hints via kfuncs Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 01/11] bpf: Document XDP RX metadata Stanislav Fomichev
2022-11-15 22:31   ` Zvi Effron
2022-11-15 22:43     ` Stanislav Fomichev
2022-11-15 23:34       ` Zvi Effron
2022-11-16  3:50         ` Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 02/11] bpf: Introduce bpf_patch Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 03/11] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
2022-11-15 16:16   ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-15 18:37     ` Stanislav Fomichev
2022-11-16 20:42   ` Jakub Kicinski
2022-11-16 20:53     ` Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 04/11] bpf: Implement hidden BPF_PUSH64 and BPF_POP64 instructions Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-11-15 16:17   ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-15 18:37     ` Stanislav Fomichev
2022-11-15 22:46       ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-16  4:09         ` Stanislav Fomichev
2022-11-16  6:38           ` John Fastabend
2022-11-16  7:47             ` Martin KaFai Lau
2022-11-16 10:08               ` Toke Høiland-Jørgensen
2022-11-16 18:20                 ` Martin KaFai Lau
2022-11-16 19:03                 ` John Fastabend
2022-11-16 20:50                   ` Stanislav Fomichev
2022-11-16 23:47                     ` John Fastabend
2022-11-17  0:19                       ` Stanislav Fomichev
2022-11-17  2:17                         ` Alexei Starovoitov
2022-11-17  2:53                           ` Stanislav Fomichev
2022-11-17  2:59                             ` Alexei Starovoitov
2022-11-17  4:18                               ` Stanislav Fomichev
2022-11-17  6:55                                 ` John Fastabend
2022-11-17 17:51                                   ` Stanislav Fomichev
2022-11-17 19:47                                     ` John Fastabend
2022-11-17 20:17                                       ` Alexei Starovoitov
2022-11-17 11:32                             ` Toke Høiland-Jørgensen
2022-11-17 16:59                               ` Alexei Starovoitov
2022-11-17 17:52                                 ` Stanislav Fomichev
2022-11-17 23:46                                   ` Toke Høiland-Jørgensen
2022-11-18  0:02                                     ` Alexei Starovoitov
2022-11-18  0:29                                       ` Toke Høiland-Jørgensen
2022-11-17 10:27                       ` Toke Høiland-Jørgensen
2022-11-15  3:02 ` [PATCH bpf-next 06/11] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
2022-11-15 23:20   ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-16  3:49     ` Stanislav Fomichev
2022-11-16  9:30       ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-16  7:04   ` Martin KaFai Lau
2022-11-16  9:48     ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-16 20:51       ` Stanislav Fomichev
2022-11-16 20:51     ` Stanislav Fomichev
2022-11-16 21:12   ` Jakub Kicinski
2022-11-16 21:49     ` Martin KaFai Lau
2022-11-18 14:05   ` Jesper Dangaard Brouer
2022-11-18 18:18     ` Stanislav Fomichev
2022-11-19 12:31       ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-21 17:53         ` Stanislav Fomichev
2022-11-21 18:47           ` Jakub Kicinski
2022-11-21 19:41             ` Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 07/11] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 08/11] selftests/bpf: Verify xdp_metadata xdp->skb path Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 09/11] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 10/11] mxl4: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-11-15  3:02 ` [PATCH bpf-next 11/11] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
2022-11-15 15:54 ` [xdp-hints] [PATCH bpf-next 00/11] xdp: hints via kfuncs Toke Høiland-Jørgensen
2022-11-15 18:37   ` Stanislav Fomichev
2022-11-15 22:31     ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-15 22:54     ` [xdp-hints] " Alexei Starovoitov
2022-11-15 23:13       ` [xdp-hints] " Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).