bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint
@ 2023-10-12 17:05 Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 01/18] ice: make RX hash reading code more reusable Larysa Zaremba
                   ` (17 more replies)
  0 siblings, 18 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

This series introduces XDP hints via kfuncs [0] to the ice driver.

Series brings the following existing hints to the ice driver:
 - HW timestamp
 - RX hash with type

Series also introduces VLAN tag with protocol XDP hint, it now be accessed by
XDP and userspace (AF_XDP) programs. They can also be checked with xdp_metadata
test and xdp_hw_metadata program.

On Maciej's request, I provide some numbers about impact of these patches
on ice performance.
ZC:
* Full hints implementation before addition of the static key decreases
  pps in ZC mode by 6%
* Adding a static key eliminates this drop. Overall performce difference
  compared to a clean tree in inconsequential.

skb (packets with invalid IP, dropped by stack):
* Overall, patchset improves peak performance in skb mode by about 0.5%

[0] https://patchwork.kernel.org/project/netdevbpf/cover/20230119221536.3349901-1-sdf@google.com/

Intermediate RFC v2:
https://lore.kernel.org/bpf/20230927075124.23941-1-larysa.zaremba@intel.com/
Intermediate RFC v1:
https://lore.kernel.org/bpf/20230824192703.712881-1-larysa.zaremba@intel.com/
v5:
https://lore.kernel.org/bpf/20230811161509.19722-1-larysa.zaremba@intel.com/
v4:
https://lore.kernel.org/bpf/20230728173923.1318596-1-larysa.zaremba@intel.com/
v3:
https://lore.kernel.org/bpf/20230719183734.21681-1-larysa.zaremba@intel.com/
v2:
https://lore.kernel.org/bpf/20230703181226.19380-1-larysa.zaremba@intel.com/
v1:
https://lore.kernel.org/all/20230512152607.992209-1-larysa.zaremba@intel.com/

Changes since v5:
* drop checksum hint from the patchset entirely
* Alex's patch that lifts the data_meta size limitation is no longer
  required in this patchset, so will be sent separately
* new patch: hide some ice hints code behind a static key
* fix several bugs in ZC mode (ice)
* change argument order in VLAN hint kfunc (tci, proto -> proto, tci)
* cosmetic changes
* analyze performance impact

Changes since v4:
* Drop the concept of partial checksum from the hint design
* Drop the concept of checksum level from the hint design

Changes since v3:
* use XDP_CHECKSUM_VALID_LVL0 + csum_level instead of csum_level + 1
* fix spelling mistakes
* read XDP timestamp unconditionally
* add TO_STR() macro

Changes since v2:
* redesign checksum hint, so now it gives full status
* rename vlan_tag -> vlan_tci, where applicable
* use open_netns() and close_netns() in xdp_metadata
* improve VLAN hint documentation
* replace CFI with DEI
* use VLAN_VID_MASK in xdp_metadata
* make vlan_get_tag() return -ENODATA
* remove unused rx_ptype in ice_xsk.c
* fix ice timestamp code division between patches

Changes since v1:
* directly return RX hash, RX timestamp and RX checksum status
  in skb-common functions
* use intermediate enum value for checksum status in ice
* get rid of ring structure dependency in ice kfunc implementation
* make variables const, when possible, in ice implementation
* use -ENODATA instead of -EOPNOTSUPP for driver implementation
* instead of having 2 separate functions for c-tag and s-tag,
  use 1 function that outputs both VLAN tag and protocol ID
* improve documentation for introduced hints
* update xdp_metadata selftest to test new hints
* implement new hints in veth, so they can be tested in xdp_metadata
* parse VLAN tag in xdp_hw_metadata

Larysa Zaremba (18):
  ice: make RX hash reading code more reusable
  ice: make RX HW timestamp reading code more reusable
  ice: Make ptype internal to descriptor info processing
  ice: Introduce ice_xdp_buff
  ice: Support HW timestamp hint
  ice: Support RX hash XDP hint
  ice: Support XDP hints in AF_XDP ZC mode
  xdp: Add VLAN tag hint
  ice: Implement VLAN tag hint
  ice: use VLAN proto from ring packet context in skb path
  ice: put XDP meta sources assignment under a static key condition
  veth: Implement VLAN tag XDP hint
  net: make vlan_get_tag() return -ENODATA instead of -EINVAL
  mlx5: implement VLAN tag XDP hint
  selftests/bpf: Allow VLAN packets in xdp_hw_metadata
  selftests/bpf: Add flags and VLAN hint to xdp_hw_metadata
  selftests/bpf: Use AF_INET for TX in xdp_metadata
  selftests/bpf: Check VLAN tag and proto in xdp_metadata

 Documentation/networking/xdp-rx-metadata.rst  |   8 +-
 drivers/net/ethernet/intel/ice/ice.h          |   3 +
 drivers/net/ethernet/intel/ice/ice_ethtool.c  |   2 +-
 .../net/ethernet/intel/ice/ice_lan_tx_rx.h    | 412 +++++++++---------
 drivers/net/ethernet/intel/ice/ice_lib.c      |   2 +-
 drivers/net/ethernet/intel/ice/ice_main.c     |  35 ++
 drivers/net/ethernet/intel/ice/ice_ptp.c      |  25 +-
 drivers/net/ethernet/intel/ice/ice_ptp.h      |  16 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  20 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  29 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 209 ++++++++-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h |  18 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  49 ++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  15 +
 drivers/net/veth.c                            |  19 +
 include/linux/if_vlan.h                       |   4 +-
 include/linux/mlx5/device.h                   |   2 +-
 include/net/xdp.h                             |   9 +
 include/uapi/linux/netdev.h                   |   5 +-
 net/core/xdp.c                                |  33 ++
 tools/include/uapi/linux/netdev.h             |   5 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 184 ++++----
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  38 +-
 .../selftests/bpf/progs/xdp_metadata.c        |   5 +
 tools/testing/selftests/bpf/testing_helpers.h |   3 +
 tools/testing/selftests/bpf/xdp_hw_metadata.c |  38 +-
 tools/testing/selftests/bpf/xdp_metadata.h    |  34 +-
 27 files changed, 816 insertions(+), 406 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 01/18] ice: make RX hash reading code more reusable
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 02/18] ice: make RX HW timestamp " Larysa Zaremba
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Previously, we only needed RX hash in skb path,
hence all related code was written with skb in mind.
But with the addition of XDP hints via kfuncs to the ice driver,
the same logic will be needed in .xmo_() callbacks.

Separate generic process of reading RX hash from a descriptor
into a separate function.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 36 +++++++++++++------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index c8322fb6f2b3..987050dacead 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -63,28 +63,42 @@ static enum pkt_hash_types ice_ptype_to_htype(u16 ptype)
 }
 
 /**
- * ice_rx_hash - set the hash value in the skb
+ * ice_get_rx_hash - get RX hash value from descriptor
+ * @rx_desc: specific descriptor
+ *
+ * Returns hash, if present, 0 otherwise.
+ */
+static u32 ice_get_rx_hash(const union ice_32b_rx_flex_desc *rx_desc)
+{
+	const struct ice_32b_rx_flex_desc_nic *nic_mdid;
+
+	if (unlikely(rx_desc->wb.rxdid != ICE_RXDID_FLEX_NIC))
+		return 0;
+
+	nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
+	return le32_to_cpu(nic_mdid->rss_hash);
+}
+
+/**
+ * ice_rx_hash_to_skb - set the hash value in the skb
  * @rx_ring: descriptor ring
  * @rx_desc: specific descriptor
  * @skb: pointer to current skb
  * @rx_ptype: the ptype value from the descriptor
  */
 static void
-ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
-	    struct sk_buff *skb, u16 rx_ptype)
+ice_rx_hash_to_skb(const struct ice_rx_ring *rx_ring,
+		   const union ice_32b_rx_flex_desc *rx_desc,
+		   struct sk_buff *skb, u16 rx_ptype)
 {
-	struct ice_32b_rx_flex_desc_nic *nic_mdid;
 	u32 hash;
 
 	if (!(rx_ring->netdev->features & NETIF_F_RXHASH))
 		return;
 
-	if (rx_desc->wb.rxdid != ICE_RXDID_FLEX_NIC)
-		return;
-
-	nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
-	hash = le32_to_cpu(nic_mdid->rss_hash);
-	skb_set_hash(skb, hash, ice_ptype_to_htype(rx_ptype));
+	hash = ice_get_rx_hash(rx_desc);
+	if (likely(hash))
+		skb_set_hash(skb, hash, ice_ptype_to_htype(rx_ptype));
 }
 
 /**
@@ -186,7 +200,7 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       union ice_32b_rx_flex_desc *rx_desc,
 		       struct sk_buff *skb, u16 ptype)
 {
-	ice_rx_hash(rx_ring, rx_desc, skb, ptype);
+	ice_rx_hash_to_skb(rx_ring, rx_desc, skb, ptype);
 
 	/* modifies the skb - consumes the enet header */
 	skb->protocol = eth_type_trans(skb, rx_ring->netdev);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 02/18] ice: make RX HW timestamp reading code more reusable
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 01/18] ice: make RX hash reading code more reusable Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 03/18] ice: Make ptype internal to descriptor info processing Larysa Zaremba
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Previously, we only needed RX HW timestamp in skb path,
hence all related code was written with skb in mind.
But with the addition of XDP hints via kfuncs to the ice driver,
the same logic will be needed in .xmo_() callbacks.

Put generic process of reading RX HW timestamp from a descriptor
into a separate function.
Move skb-related code into another source file.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_ptp.c      | 20 ++++++-----------
 drivers/net/ethernet/intel/ice/ice_ptp.h      | 16 +++++++++-----
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 22 ++++++++++++++++++-
 3 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index 05f922d3b316..e24c17789cf5 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -2168,30 +2168,26 @@ int ice_ptp_set_ts_config(struct ice_pf *pf, struct ifreq *ifr)
 }
 
 /**
- * ice_ptp_rx_hwtstamp - Check for an Rx timestamp
- * @rx_ring: Ring to get the VSI info
+ * ice_ptp_get_rx_hwts - Get packet Rx timestamp in ns
  * @rx_desc: Receive descriptor
- * @skb: Particular skb to send timestamp with
+ * @rx_ring: Ring to get the cached time
  *
  * The driver receives a notification in the receive descriptor with timestamp.
- * The timestamp is in ns, so we must convert the result first.
  */
-void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
-		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb)
+u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
+			struct ice_rx_ring *rx_ring)
 {
-	struct skb_shared_hwtstamps *hwtstamps;
 	u64 ts_ns, cached_time;
 	u32 ts_high;
 
 	if (!(rx_desc->wb.time_stamp_low & ICE_PTP_TS_VALID))
-		return;
+		return 0;
 
 	cached_time = READ_ONCE(rx_ring->cached_phctime);
 
 	/* Do not report a timestamp if we don't have a cached PHC time */
 	if (!cached_time)
-		return;
+		return 0;
 
 	/* Use ice_ptp_extend_32b_ts directly, using the ring-specific cached
 	 * PHC value, rather than accessing the PF. This also allows us to
@@ -2202,9 +2198,7 @@ ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
 	ts_high = le32_to_cpu(rx_desc->wb.flex_ts.ts_high);
 	ts_ns = ice_ptp_extend_32b_ts(cached_time, ts_high);
 
-	hwtstamps = skb_hwtstamps(skb);
-	memset(hwtstamps, 0, sizeof(*hwtstamps));
-	hwtstamps->hwtstamp = ns_to_ktime(ts_ns);
+	return ts_ns;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.h b/drivers/net/ethernet/intel/ice/ice_ptp.h
index 995a57019ba7..8ebdf422752a 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.h
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.h
@@ -268,9 +268,8 @@ void ice_ptp_extts_event(struct ice_pf *pf);
 s8 ice_ptp_request_ts(struct ice_ptp_tx *tx, struct sk_buff *skb);
 enum ice_tx_tstamp_work ice_ptp_process_ts(struct ice_pf *pf);
 
-void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
-		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb);
+u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
+			struct ice_rx_ring *rx_ring);
 void ice_ptp_reset(struct ice_pf *pf);
 void ice_ptp_prepare_for_reset(struct ice_pf *pf);
 void ice_ptp_init(struct ice_pf *pf);
@@ -304,9 +303,14 @@ static inline bool ice_ptp_process_ts(struct ice_pf *pf)
 {
 	return true;
 }
-static inline void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
-		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb) { }
+
+static inline u64
+ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
+		    struct ice_rx_ring *rx_ring)
+{
+	return 0;
+}
+
 static inline void ice_ptp_reset(struct ice_pf *pf) { }
 static inline void ice_ptp_prepare_for_reset(struct ice_pf *pf) { }
 static inline void ice_ptp_init(struct ice_pf *pf) { }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 987050dacead..95c29181301b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -184,6 +184,26 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
 	ring->vsi->back->hw_csum_rx_error++;
 }
 
+/**
+ * ice_ptp_rx_hwts_to_skb - Put RX timestamp into skb
+ * @rx_ring: Ring to get the VSI info
+ * @rx_desc: Receive descriptor
+ * @skb: Particular skb to send timestamp with
+ *
+ * The timestamp is in ns, so we must convert the result first.
+ */
+static void
+ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
+		       const union ice_32b_rx_flex_desc *rx_desc,
+		       struct sk_buff *skb)
+{
+	u64 ts_ns = ice_ptp_get_rx_hwts(rx_desc, rx_ring);
+
+	*skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
+		.hwtstamp	= ns_to_ktime(ts_ns),
+	};
+}
+
 /**
  * ice_process_skb_fields - Populate skb header fields from Rx descriptor
  * @rx_ring: Rx descriptor ring packet is being transacted on
@@ -208,7 +228,7 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 	ice_rx_csum(rx_ring, skb, rx_desc, ptype);
 
 	if (rx_ring->ptp_rx)
-		ice_ptp_rx_hwtstamp(rx_ring, rx_desc, skb);
+		ice_ptp_rx_hwts_to_skb(rx_ring, rx_desc, skb);
 }
 
 /**
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 03/18] ice: Make ptype internal to descriptor info processing
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 01/18] ice: make RX hash reading code more reusable Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 02/18] ice: make RX HW timestamp " Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 04/18] ice: Introduce ice_xdp_buff Larysa Zaremba
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Currently, rx_ptype variable is used only as an argument
to ice_process_skb_fields() and is computed
just before the function call.

Therefore, there is no reason to pass this value as an argument.
Instead, remove this argument and compute the value directly inside
ice_process_skb_fields() function.

Also, separate its calculation into a short function, so the code
can later be reused in .xmo_() callbacks.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  6 +-----
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 15 +++++++++++++--
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h |  2 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  6 +-----
 4 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 52d0a126eb61..40f2f6dabb81 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1181,7 +1181,6 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		unsigned int size;
 		u16 stat_err_bits;
 		u16 vlan_tag = 0;
-		u16 rx_ptype;
 
 		/* get the Rx desc from Rx ring based on 'next_to_clean' */
 		rx_desc = ICE_RX_DESC(rx_ring, ntc);
@@ -1286,10 +1285,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		total_rx_bytes += skb->len;
 
 		/* populate checksum, VLAN, and protocol */
-		rx_ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
-			ICE_RX_FLEX_DESC_PTYPE_M;
-
-		ice_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+		ice_process_skb_fields(rx_ring, rx_desc, skb);
 
 		ice_trace(clean_rx_irq_indicate, rx_ring, rx_desc, skb);
 		/* send completed skb up the stack */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 95c29181301b..8b5cee0429d3 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -204,12 +204,21 @@ ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
 	};
 }
 
+/**
+ * ice_get_ptype - Read HW packet type from the descriptor
+ * @rx_desc: RX descriptor
+ */
+static u16 ice_get_ptype(const union ice_32b_rx_flex_desc *rx_desc)
+{
+	return le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
+	       ICE_RX_FLEX_DESC_PTYPE_M;
+}
+
 /**
  * ice_process_skb_fields - Populate skb header fields from Rx descriptor
  * @rx_ring: Rx descriptor ring packet is being transacted on
  * @rx_desc: pointer to the EOP Rx descriptor
  * @skb: pointer to current skb being populated
- * @ptype: the packet type decoded by hardware
  *
  * This function checks the ring, descriptor, and packet information in
  * order to populate the hash, checksum, VLAN, protocol, and
@@ -218,8 +227,10 @@ ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
 void
 ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       union ice_32b_rx_flex_desc *rx_desc,
-		       struct sk_buff *skb, u16 ptype)
+		       struct sk_buff *skb)
 {
+	u16 ptype = ice_get_ptype(rx_desc);
+
 	ice_rx_hash_to_skb(rx_ring, rx_desc, skb, ptype);
 
 	/* modifies the skb - consumes the enet header */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index 115969ecdf7b..e1d49e1235b3 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -148,7 +148,7 @@ void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val);
 void
 ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       union ice_32b_rx_flex_desc *rx_desc,
-		       struct sk_buff *skb, u16 ptype);
+		       struct sk_buff *skb);
 void
 ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag);
 #endif /* !_ICE_TXRX_LIB_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 2a3f0834e139..ef778b8e6d1b 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -870,7 +870,6 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 		struct sk_buff *skb;
 		u16 stat_err_bits;
 		u16 vlan_tag = 0;
-		u16 rx_ptype;
 
 		rx_desc = ICE_RX_DESC(rx_ring, ntc);
 
@@ -950,10 +949,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 
 		vlan_tag = ice_get_vlan_tag_from_rx_desc(rx_desc);
 
-		rx_ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
-				       ICE_RX_FLEX_DESC_PTYPE_M;
-
-		ice_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+		ice_process_skb_fields(rx_ring, rx_desc, skb);
 		ice_receive_skb(rx_ring, skb, vlan_tag);
 	}
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 04/18] ice: Introduce ice_xdp_buff
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (2 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 03/18] ice: Make ptype internal to descriptor info processing Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 05/18] ice: Support HW timestamp hint Larysa Zaremba
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

In order to use XDP hints via kfuncs we need to put
RX descriptor and ring pointers just next to xdp_buff.
Same as in hints implementations in other drivers, we achieve
this through putting xdp_buff into a child structure.

Currently, xdp_buff is stored in the ring structure,
so replace it with union that includes child structure.
This way enough memory is available while existing XDP code
remains isolated from hints.

Minimum size of the new child structure (ice_xdp_buff) is exactly
64 bytes (single cache line). To place it at the start of a cache line,
move 'next' field from CL1 to CL3, as it isn't used often. This still
leaves 128 bits available in CL3 for packet context extensions.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  7 +++--
 drivers/net/ethernet/intel/ice/ice_txrx.h     | 26 ++++++++++++++++---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 10 +++++++
 3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 40f2f6dabb81..4e6546d9cf85 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -557,13 +557,14 @@ ice_rx_frame_truesize(struct ice_rx_ring *rx_ring, const unsigned int size)
  * @xdp_prog: XDP program to run
  * @xdp_ring: ring to be used for XDP_TX action
  * @rx_buf: Rx buffer to store the XDP action
+ * @eop_desc: Last descriptor in packet to read metadata from
  *
  * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
  */
 static void
 ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	    struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
-	    struct ice_rx_buf *rx_buf)
+	    struct ice_rx_buf *rx_buf, union ice_32b_rx_flex_desc *eop_desc)
 {
 	unsigned int ret = ICE_XDP_PASS;
 	u32 act;
@@ -571,6 +572,8 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	if (!xdp_prog)
 		goto exit;
 
+	ice_xdp_meta_set_desc(xdp, eop_desc);
+
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -1240,7 +1243,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		if (ice_is_non_eop(rx_ring, rx_desc))
 			continue;
 
-		ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_buf);
+		ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_buf, rx_desc);
 		if (rx_buf->act == ICE_XDP_PASS)
 			goto construct_skb;
 		total_rx_bytes += xdp_get_buff_len(xdp);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 166413fc33f4..d0ab2c4c0c91 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -257,6 +257,18 @@ enum ice_rx_dtype {
 	ICE_RX_DTYPE_SPLIT_ALWAYS	= 2,
 };
 
+struct ice_pkt_ctx {
+	const union ice_32b_rx_flex_desc *eop_desc;
+};
+
+struct ice_xdp_buff {
+	struct xdp_buff xdp_buff;
+	struct ice_pkt_ctx pkt_ctx;
+};
+
+/* Required for compatibility with xdp_buffs from xsk_pool */
+static_assert(offsetof(struct ice_xdp_buff, xdp_buff) == 0);
+
 /* indices into GLINT_ITR registers */
 #define ICE_RX_ITR	ICE_IDX_ITR0
 #define ICE_TX_ITR	ICE_IDX_ITR1
@@ -298,7 +310,6 @@ enum ice_dynamic_itr {
 /* descriptor ring, associated with a VSI */
 struct ice_rx_ring {
 	/* CL1 - 1st cacheline starts here */
-	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
 	void *desc;			/* Descriptor ring memory */
 	struct device *dev;		/* Used for DMA mapping */
 	struct net_device *netdev;	/* netdev ring maps to */
@@ -310,12 +321,19 @@ struct ice_rx_ring {
 	u16 count;			/* Number of descriptors */
 	u16 reg_idx;			/* HW register index of the ring */
 	u16 next_to_alloc;
-	/* CL2 - 2nd cacheline starts here */
+
 	union {
 		struct ice_rx_buf *rx_buf;
 		struct xdp_buff **xdp_buf;
 	};
-	struct xdp_buff xdp;
+	/* CL2 - 2nd cacheline starts here */
+	union {
+		struct ice_xdp_buff xdp_ext;
+		struct {
+			struct xdp_buff xdp;
+			struct ice_pkt_ctx pkt_ctx;
+		};
+	};
 	/* CL3 - 3rd cacheline starts here */
 	struct bpf_prog *xdp_prog;
 	u16 rx_offset;
@@ -325,6 +343,8 @@ struct ice_rx_ring {
 	u16 next_to_clean;
 	u16 first_desc;
 
+	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
+
 	/* stats structs */
 	struct ice_ring_stats *ring_stats;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index e1d49e1235b3..145883eec129 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -151,4 +151,14 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       struct sk_buff *skb);
 void
 ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag);
+
+static inline void
+ice_xdp_meta_set_desc(struct xdp_buff *xdp,
+		      union ice_32b_rx_flex_desc *eop_desc)
+{
+	struct ice_xdp_buff *xdp_ext = container_of(xdp, struct ice_xdp_buff,
+						    xdp_buff);
+
+	xdp_ext->pkt_ctx.eop_desc = eop_desc;
+}
 #endif /* !_ICE_TXRX_LIB_H_ */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 05/18] ice: Support HW timestamp hint
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (3 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 04/18] ice: Introduce ice_xdp_buff Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 06/18] ice: Support RX hash XDP hint Larysa Zaremba
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Use previously refactored code and create a function
that allows XDP code to read HW timestamp.

Also, move cached_phctime into packet context, this way this data still
stays in the ring structure, just at the different address.

HW timestamp is the first supported hint in the driver,
so also add xdp_metadata_ops.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice.h          |  2 ++
 drivers/net/ethernet/intel/ice/ice_ethtool.c  |  2 +-
 drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
 drivers/net/ethernet/intel/ice/ice_main.c     |  1 +
 drivers/net/ethernet/intel/ice/ice_ptp.c      |  9 ++++---
 drivers/net/ethernet/intel/ice/ice_ptp.h      |  4 +--
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 25 ++++++++++++++++++-
 8 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index d30ae39c19f0..3d0f15f8b2b8 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -974,4 +974,6 @@ static inline void ice_clear_rdma_cap(struct ice_pf *pf)
 	set_bit(ICE_FLAG_UNPLUG_AUX_DEV, pf->flags);
 	clear_bit(ICE_FLAG_RDMA_ENA, pf->flags);
 }
+
+extern const struct xdp_metadata_ops ice_xdp_md_ops;
 #endif /* _ICE_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index ad4d4702129f..f740e0ad0e3c 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -2846,7 +2846,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
 		/* clone ring and setup updated count */
 		rx_rings[i] = *vsi->rx_rings[i];
 		rx_rings[i].count = new_rx_cnt;
-		rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
+		rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
 		rx_rings[i].desc = NULL;
 		rx_rings[i].rx_buf = NULL;
 		/* this is to allow wr32 to have something to write to
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 1549890a3cbf..b4cbd2f01a39 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1456,7 +1456,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
 		ring->netdev = vsi->netdev;
 		ring->dev = dev;
 		ring->count = vsi->num_rx_desc;
-		ring->cached_phctime = pf->ptp.cached_phc_time;
+		ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
 		WRITE_ONCE(vsi->rx_rings[i], ring);
 	}
 
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index e22f41fea8db..2153e27642eb 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3396,6 +3396,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
 
 	netdev->netdev_ops = &ice_netdev_ops;
 	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
+	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
 	ice_set_ethtool_ops(netdev);
 
 	if (vsi->type != ICE_VSI_PF)
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index e24c17789cf5..8aad4aff6b30 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -1038,7 +1038,8 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
 		ice_for_each_rxq(vsi, j) {
 			if (!vsi->rx_rings[j])
 				continue;
-			WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
+			WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime,
+				   systime);
 		}
 	}
 	clear_bit(ICE_CFG_BUSY, pf->state);
@@ -2170,12 +2171,12 @@ int ice_ptp_set_ts_config(struct ice_pf *pf, struct ifreq *ifr)
 /**
  * ice_ptp_get_rx_hwts - Get packet Rx timestamp in ns
  * @rx_desc: Receive descriptor
- * @rx_ring: Ring to get the cached time
+ * @pkt_ctx: Packet context to get the cached time
  *
  * The driver receives a notification in the receive descriptor with timestamp.
  */
 u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
-			struct ice_rx_ring *rx_ring)
+			const struct ice_pkt_ctx *pkt_ctx)
 {
 	u64 ts_ns, cached_time;
 	u32 ts_high;
@@ -2183,7 +2184,7 @@ u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
 	if (!(rx_desc->wb.time_stamp_low & ICE_PTP_TS_VALID))
 		return 0;
 
-	cached_time = READ_ONCE(rx_ring->cached_phctime);
+	cached_time = READ_ONCE(pkt_ctx->cached_phctime);
 
 	/* Do not report a timestamp if we don't have a cached PHC time */
 	if (!cached_time)
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.h b/drivers/net/ethernet/intel/ice/ice_ptp.h
index 8ebdf422752a..5e6240920821 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.h
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.h
@@ -269,7 +269,7 @@ s8 ice_ptp_request_ts(struct ice_ptp_tx *tx, struct sk_buff *skb);
 enum ice_tx_tstamp_work ice_ptp_process_ts(struct ice_pf *pf);
 
 u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
-			struct ice_rx_ring *rx_ring);
+			const struct ice_pkt_ctx *pkt_ctx);
 void ice_ptp_reset(struct ice_pf *pf);
 void ice_ptp_prepare_for_reset(struct ice_pf *pf);
 void ice_ptp_init(struct ice_pf *pf);
@@ -306,7 +306,7 @@ static inline bool ice_ptp_process_ts(struct ice_pf *pf)
 
 static inline u64
 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
-		    struct ice_rx_ring *rx_ring)
+		    const struct ice_pkt_ctx *pkt_ctx)
 {
 	return 0;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index d0ab2c4c0c91..4237702a58a9 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -259,6 +259,7 @@ enum ice_rx_dtype {
 
 struct ice_pkt_ctx {
 	const union ice_32b_rx_flex_desc *eop_desc;
+	u64 cached_phctime;
 };
 
 struct ice_xdp_buff {
@@ -354,7 +355,6 @@ struct ice_rx_ring {
 	struct ice_tx_ring *xdp_ring;
 	struct xsk_buff_pool *xsk_pool;
 	dma_addr_t dma;			/* physical address of ring */
-	u64 cached_phctime;
 	u16 rx_buf_len;
 	u8 dcb_tc;			/* Traffic class of ring */
 	u8 ptp_rx;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 8b5cee0429d3..7e9f3528d6b5 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -197,7 +197,7 @@ ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
 		       const union ice_32b_rx_flex_desc *rx_desc,
 		       struct sk_buff *skb)
 {
-	u64 ts_ns = ice_ptp_get_rx_hwts(rx_desc, rx_ring);
+	u64 ts_ns = ice_ptp_get_rx_hwts(rx_desc, &rx_ring->pkt_ctx);
 
 	*skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
 		.hwtstamp	= ns_to_ktime(ts_ns),
@@ -509,3 +509,26 @@ void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
 			spin_unlock(&xdp_ring->tx_lock);
 	}
 }
+
+/**
+ * ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
+ * @ctx: XDP buff pointer
+ * @ts_ns: destination address
+ *
+ * Copy HW timestamp (if available) to the destination address.
+ */
+static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
+{
+	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
+
+	*ts_ns = ice_ptp_get_rx_hwts(xdp_ext->pkt_ctx.eop_desc,
+				     &xdp_ext->pkt_ctx);
+	if (!*ts_ns)
+		return -ENODATA;
+
+	return 0;
+}
+
+const struct xdp_metadata_ops ice_xdp_md_ops = {
+	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
+};
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 06/18] ice: Support RX hash XDP hint
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (4 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 05/18] ice: Support HW timestamp hint Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-20 15:27   ` Maciej Fijalkowski
  2023-10-12 17:05 ` [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

RX hash XDP hint requests both hash value and type.
Type is XDP-specific, so we need a separate way to map
these values to the hardware ptypes, so create a lookup table.

Instead of creating a new long list, reuse contents
of ice_decode_rx_desc_ptype[] through preprocessor.

Current hash type enum does not contain ICMP packet type,
but ice devices support it, so also add a new type into core code.

Then use previously refactored code and create a function
that allows XDP code to read RX hash.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 .../net/ethernet/intel/ice/ice_lan_tx_rx.h    | 412 +++++++++---------
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  73 ++++
 include/net/xdp.h                             |   3 +
 3 files changed, 284 insertions(+), 204 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
index 89f986a75cc8..d384ddfcb83e 100644
--- a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
+++ b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
@@ -673,6 +673,212 @@ struct ice_tlan_ctx {
  *      Use the enum ice_rx_l2_ptype to decode the packet type
  * ENDIF
  */
+#define ICE_PTYPES								\
+	/* L2 Packet types */							\
+	ICE_PTT_UNUSED_ENTRY(0),						\
+	ICE_PTT(1, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),			\
+	ICE_PTT_UNUSED_ENTRY(2),						\
+	ICE_PTT_UNUSED_ENTRY(3),						\
+	ICE_PTT_UNUSED_ENTRY(4),						\
+	ICE_PTT_UNUSED_ENTRY(5),						\
+	ICE_PTT(6, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),			\
+	ICE_PTT(7, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),			\
+	ICE_PTT_UNUSED_ENTRY(8),						\
+	ICE_PTT_UNUSED_ENTRY(9),						\
+	ICE_PTT(10, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),		\
+	ICE_PTT(11, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),		\
+	ICE_PTT_UNUSED_ENTRY(12),						\
+	ICE_PTT_UNUSED_ENTRY(13),						\
+	ICE_PTT_UNUSED_ENTRY(14),						\
+	ICE_PTT_UNUSED_ENTRY(15),						\
+	ICE_PTT_UNUSED_ENTRY(16),						\
+	ICE_PTT_UNUSED_ENTRY(17),						\
+	ICE_PTT_UNUSED_ENTRY(18),						\
+	ICE_PTT_UNUSED_ENTRY(19),						\
+	ICE_PTT_UNUSED_ENTRY(20),						\
+	ICE_PTT_UNUSED_ENTRY(21),						\
+										\
+	/* Non Tunneled IPv4 */							\
+	ICE_PTT(22, IP, IPV4, FRG, NONE, NONE, NOF, NONE, PAY3),		\
+	ICE_PTT(23, IP, IPV4, NOF, NONE, NONE, NOF, NONE, PAY3),		\
+	ICE_PTT(24, IP, IPV4, NOF, NONE, NONE, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(25),						\
+	ICE_PTT(26, IP, IPV4, NOF, NONE, NONE, NOF, TCP,  PAY4),		\
+	ICE_PTT(27, IP, IPV4, NOF, NONE, NONE, NOF, SCTP, PAY4),		\
+	ICE_PTT(28, IP, IPV4, NOF, NONE, NONE, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> IPv4 */							\
+	ICE_PTT(29, IP, IPV4, NOF, IP_IP, IPV4, FRG, NONE, PAY3),		\
+	ICE_PTT(30, IP, IPV4, NOF, IP_IP, IPV4, NOF, NONE, PAY3),		\
+	ICE_PTT(31, IP, IPV4, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(32),						\
+	ICE_PTT(33, IP, IPV4, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),		\
+	ICE_PTT(34, IP, IPV4, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),		\
+	ICE_PTT(35, IP, IPV4, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> IPv6 */							\
+	ICE_PTT(36, IP, IPV4, NOF, IP_IP, IPV6, FRG, NONE, PAY3),		\
+	ICE_PTT(37, IP, IPV4, NOF, IP_IP, IPV6, NOF, NONE, PAY3),		\
+	ICE_PTT(38, IP, IPV4, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(39),						\
+	ICE_PTT(40, IP, IPV4, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),		\
+	ICE_PTT(41, IP, IPV4, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),		\
+	ICE_PTT(42, IP, IPV4, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> GRE/NAT */							\
+	ICE_PTT(43, IP, IPV4, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),		\
+										\
+	/* IPv4 --> GRE/NAT --> IPv4 */						\
+	ICE_PTT(44, IP, IPV4, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),		\
+	ICE_PTT(45, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),		\
+	ICE_PTT(46, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(47),						\
+	ICE_PTT(48, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),		\
+	ICE_PTT(49, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),		\
+	ICE_PTT(50, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> GRE/NAT --> IPv6 */						\
+	ICE_PTT(51, IP, IPV4, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),		\
+	ICE_PTT(52, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),		\
+	ICE_PTT(53, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(54),						\
+	ICE_PTT(55, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),		\
+	ICE_PTT(56, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),		\
+	ICE_PTT(57, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> GRE/NAT --> MAC */						\
+	ICE_PTT(58, IP, IPV4, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),	\
+										\
+	/* IPv4 --> GRE/NAT --> MAC --> IPv4 */					\
+	ICE_PTT(59, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),	\
+	ICE_PTT(60, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),	\
+	ICE_PTT(61, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(62),						\
+	ICE_PTT(63, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),	\
+	ICE_PTT(64, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),	\
+	ICE_PTT(65, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),	\
+										\
+	/* IPv4 --> GRE/NAT -> MAC --> IPv6 */					\
+	ICE_PTT(66, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),	\
+	ICE_PTT(67, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),	\
+	ICE_PTT(68, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(69),						\
+	ICE_PTT(70, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),	\
+	ICE_PTT(71, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),	\
+	ICE_PTT(72, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),	\
+										\
+	/* IPv4 --> GRE/NAT --> MAC/VLAN */					\
+	ICE_PTT(73, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),	\
+										\
+	/* IPv4 ---> GRE/NAT -> MAC/VLAN --> IPv4 */				\
+	ICE_PTT(74, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),	\
+	ICE_PTT(75, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),	\
+	ICE_PTT(76, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(77),						\
+	ICE_PTT(78, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),	\
+	ICE_PTT(79, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),	\
+	ICE_PTT(80, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),	\
+										\
+	/* IPv4 -> GRE/NAT -> MAC/VLAN --> IPv6 */				\
+	ICE_PTT(81, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),	\
+	ICE_PTT(82, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),	\
+	ICE_PTT(83, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(84),						\
+	ICE_PTT(85, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),	\
+	ICE_PTT(86, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),	\
+	ICE_PTT(87, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),	\
+										\
+	/* Non Tunneled IPv6 */							\
+	ICE_PTT(88, IP, IPV6, FRG, NONE, NONE, NOF, NONE, PAY3),		\
+	ICE_PTT(89, IP, IPV6, NOF, NONE, NONE, NOF, NONE, PAY3),		\
+	ICE_PTT(90, IP, IPV6, NOF, NONE, NONE, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(91),						\
+	ICE_PTT(92, IP, IPV6, NOF, NONE, NONE, NOF, TCP,  PAY4),		\
+	ICE_PTT(93, IP, IPV6, NOF, NONE, NONE, NOF, SCTP, PAY4),		\
+	ICE_PTT(94, IP, IPV6, NOF, NONE, NONE, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> IPv4 */							\
+	ICE_PTT(95, IP, IPV6, NOF, IP_IP, IPV4, FRG, NONE, PAY3),		\
+	ICE_PTT(96, IP, IPV6, NOF, IP_IP, IPV4, NOF, NONE, PAY3),		\
+	ICE_PTT(97, IP, IPV6, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(98),						\
+	ICE_PTT(99, IP, IPV6, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),		\
+	ICE_PTT(100, IP, IPV6, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),		\
+	ICE_PTT(101, IP, IPV6, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> IPv6 */							\
+	ICE_PTT(102, IP, IPV6, NOF, IP_IP, IPV6, FRG, NONE, PAY3),		\
+	ICE_PTT(103, IP, IPV6, NOF, IP_IP, IPV6, NOF, NONE, PAY3),		\
+	ICE_PTT(104, IP, IPV6, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(105),						\
+	ICE_PTT(106, IP, IPV6, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),		\
+	ICE_PTT(107, IP, IPV6, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),		\
+	ICE_PTT(108, IP, IPV6, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> GRE/NAT */							\
+	ICE_PTT(109, IP, IPV6, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),		\
+										\
+	/* IPv6 --> GRE/NAT -> IPv4 */						\
+	ICE_PTT(110, IP, IPV6, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),		\
+	ICE_PTT(111, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),		\
+	ICE_PTT(112, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(113),						\
+	ICE_PTT(114, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),		\
+	ICE_PTT(115, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),		\
+	ICE_PTT(116, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> GRE/NAT -> IPv6 */						\
+	ICE_PTT(117, IP, IPV6, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),		\
+	ICE_PTT(118, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),		\
+	ICE_PTT(119, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(120),						\
+	ICE_PTT(121, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),		\
+	ICE_PTT(122, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),		\
+	ICE_PTT(123, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> GRE/NAT -> MAC */						\
+	ICE_PTT(124, IP, IPV6, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC -> IPv4 */					\
+	ICE_PTT(125, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),	\
+	ICE_PTT(126, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),	\
+	ICE_PTT(127, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(128),						\
+	ICE_PTT(129, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),	\
+	ICE_PTT(130, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),	\
+	ICE_PTT(131, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC -> IPv6 */					\
+	ICE_PTT(132, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),	\
+	ICE_PTT(133, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),	\
+	ICE_PTT(134, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(135),						\
+	ICE_PTT(136, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),	\
+	ICE_PTT(137, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),	\
+	ICE_PTT(138, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC/VLAN */					\
+	ICE_PTT(139, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv4 */				\
+	ICE_PTT(140, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),	\
+	ICE_PTT(141, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),	\
+	ICE_PTT(142, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(143),						\
+	ICE_PTT(144, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),	\
+	ICE_PTT(145, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),	\
+	ICE_PTT(146, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv6 */				\
+	ICE_PTT(147, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),	\
+	ICE_PTT(148, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),	\
+	ICE_PTT(149, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(150),						\
+	ICE_PTT(151, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),	\
+	ICE_PTT(152, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),	\
+	ICE_PTT(153, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
+
+#define ICE_NUM_DEFINED_PTYPES	154
 
 /* macro to make the table lines short, use explicit indexing with [PTYPE] */
 #define ICE_PTT(PTYPE, OUTER_IP, OUTER_IP_VER, OUTER_FRAG, T, TE, TEF, I, PL)\
@@ -695,212 +901,10 @@ struct ice_tlan_ctx {
 
 /* Lookup table mapping in the 10-bit HW PTYPE to the bit field for decoding */
 static const struct ice_rx_ptype_decoded ice_ptype_lkup[BIT(10)] = {
-	/* L2 Packet types */
-	ICE_PTT_UNUSED_ENTRY(0),
-	ICE_PTT(1, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	ICE_PTT_UNUSED_ENTRY(2),
-	ICE_PTT_UNUSED_ENTRY(3),
-	ICE_PTT_UNUSED_ENTRY(4),
-	ICE_PTT_UNUSED_ENTRY(5),
-	ICE_PTT(6, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT(7, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT_UNUSED_ENTRY(8),
-	ICE_PTT_UNUSED_ENTRY(9),
-	ICE_PTT(10, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT(11, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT_UNUSED_ENTRY(12),
-	ICE_PTT_UNUSED_ENTRY(13),
-	ICE_PTT_UNUSED_ENTRY(14),
-	ICE_PTT_UNUSED_ENTRY(15),
-	ICE_PTT_UNUSED_ENTRY(16),
-	ICE_PTT_UNUSED_ENTRY(17),
-	ICE_PTT_UNUSED_ENTRY(18),
-	ICE_PTT_UNUSED_ENTRY(19),
-	ICE_PTT_UNUSED_ENTRY(20),
-	ICE_PTT_UNUSED_ENTRY(21),
-
-	/* Non Tunneled IPv4 */
-	ICE_PTT(22, IP, IPV4, FRG, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(23, IP, IPV4, NOF, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(24, IP, IPV4, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(25),
-	ICE_PTT(26, IP, IPV4, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	ICE_PTT(27, IP, IPV4, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	ICE_PTT(28, IP, IPV4, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv4 */
-	ICE_PTT(29, IP, IPV4, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(30, IP, IPV4, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(31, IP, IPV4, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(32),
-	ICE_PTT(33, IP, IPV4, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(34, IP, IPV4, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(35, IP, IPV4, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv6 */
-	ICE_PTT(36, IP, IPV4, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(37, IP, IPV4, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(38, IP, IPV4, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(39),
-	ICE_PTT(40, IP, IPV4, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(41, IP, IPV4, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(42, IP, IPV4, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT */
-	ICE_PTT(43, IP, IPV4, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> IPv4 */
-	ICE_PTT(44, IP, IPV4, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(45, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(46, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(47),
-	ICE_PTT(48, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(49, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(50, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> IPv6 */
-	ICE_PTT(51, IP, IPV4, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(52, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(53, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(54),
-	ICE_PTT(55, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(56, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(57, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC */
-	ICE_PTT(58, IP, IPV4, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> MAC --> IPv4 */
-	ICE_PTT(59, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(60, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(61, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(62),
-	ICE_PTT(63, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(64, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(65, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT -> MAC --> IPv6 */
-	ICE_PTT(66, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(67, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(68, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(69),
-	ICE_PTT(70, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(71, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(72, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC/VLAN */
-	ICE_PTT(73, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 ---> GRE/NAT -> MAC/VLAN --> IPv4 */
-	ICE_PTT(74, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(75, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(76, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(77),
-	ICE_PTT(78, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(79, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(80, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 -> GRE/NAT -> MAC/VLAN --> IPv6 */
-	ICE_PTT(81, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(82, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(83, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(84),
-	ICE_PTT(85, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(86, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(87, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
-
-	/* Non Tunneled IPv6 */
-	ICE_PTT(88, IP, IPV6, FRG, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(89, IP, IPV6, NOF, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(90, IP, IPV6, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(91),
-	ICE_PTT(92, IP, IPV6, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	ICE_PTT(93, IP, IPV6, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	ICE_PTT(94, IP, IPV6, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv4 */
-	ICE_PTT(95, IP, IPV6, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(96, IP, IPV6, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(97, IP, IPV6, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(98),
-	ICE_PTT(99, IP, IPV6, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(100, IP, IPV6, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(101, IP, IPV6, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv6 */
-	ICE_PTT(102, IP, IPV6, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(103, IP, IPV6, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(104, IP, IPV6, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(105),
-	ICE_PTT(106, IP, IPV6, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(107, IP, IPV6, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(108, IP, IPV6, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT */
-	ICE_PTT(109, IP, IPV6, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> IPv4 */
-	ICE_PTT(110, IP, IPV6, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(111, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(112, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(113),
-	ICE_PTT(114, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(115, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(116, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> IPv6 */
-	ICE_PTT(117, IP, IPV6, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(118, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(119, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(120),
-	ICE_PTT(121, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(122, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(123, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC */
-	ICE_PTT(124, IP, IPV6, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv4 */
-	ICE_PTT(125, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(126, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(127, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(128),
-	ICE_PTT(129, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(130, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(131, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv6 */
-	ICE_PTT(132, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(133, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(134, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(135),
-	ICE_PTT(136, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(137, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(138, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN */
-	ICE_PTT(139, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv4 */
-	ICE_PTT(140, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(141, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(142, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(143),
-	ICE_PTT(144, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(145, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(146, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv6 */
-	ICE_PTT(147, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(148, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(149, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(150),
-	ICE_PTT(151, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(152, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(153, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
+	ICE_PTYPES
 
 	/* unused entries */
-	[154 ... 1023] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
+	[ICE_NUM_DEFINED_PTYPES ... 1023] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
 };
 
 static inline struct ice_rx_ptype_decoded ice_decode_rx_desc_ptype(u16 ptype)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 7e9f3528d6b5..079b0c18047b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -529,6 +529,79 @@ static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
 	return 0;
 }
 
+/* Define a ptype index -> XDP hash type lookup table.
+ * It uses the same ptype definitions as ice_decode_rx_desc_ptype[],
+ * avoiding possible copy-paste errors.
+ */
+#undef ICE_PTT
+#undef ICE_PTT_UNUSED_ENTRY
+
+#define ICE_PTT(PTYPE, OUTER_IP, OUTER_IP_VER, OUTER_FRAG, T, TE, TEF, I, PL)\
+	[PTYPE] = XDP_RSS_L3_##OUTER_IP_VER | XDP_RSS_L4_##I | XDP_RSS_TYPE_##PL
+
+#define ICE_PTT_UNUSED_ENTRY(PTYPE) [PTYPE] = 0
+
+/* A few supplementary definitions for when XDP hash types do not coincide
+ * with what can be generated from ptype definitions
+ * by means of preprocessor concatenation.
+ */
+#define XDP_RSS_L3_NONE		XDP_RSS_TYPE_NONE
+#define XDP_RSS_L4_NONE		XDP_RSS_TYPE_NONE
+#define XDP_RSS_TYPE_PAY2	XDP_RSS_TYPE_L2
+#define XDP_RSS_TYPE_PAY3	XDP_RSS_TYPE_NONE
+#define XDP_RSS_TYPE_PAY4	XDP_RSS_L4
+
+static const enum xdp_rss_hash_type
+ice_ptype_to_xdp_hash[ICE_NUM_DEFINED_PTYPES] = {
+	ICE_PTYPES
+};
+
+#undef XDP_RSS_L3_NONE
+#undef XDP_RSS_L4_NONE
+#undef XDP_RSS_TYPE_PAY2
+#undef XDP_RSS_TYPE_PAY3
+#undef XDP_RSS_TYPE_PAY4
+
+#undef ICE_PTT
+#undef ICE_PTT_UNUSED_ENTRY
+
+/**
+ * ice_xdp_rx_hash_type - Get XDP-specific hash type from the RX descriptor
+ * @eop_desc: End of Packet descriptor
+ */
+static enum xdp_rss_hash_type
+ice_xdp_rx_hash_type(const union ice_32b_rx_flex_desc *eop_desc)
+{
+	u16 ptype = ice_get_ptype(eop_desc);
+
+	if (unlikely(ptype >= ICE_NUM_DEFINED_PTYPES))
+		return 0;
+
+	return ice_ptype_to_xdp_hash[ptype];
+}
+
+/**
+ * ice_xdp_rx_hash - RX hash XDP hint handler
+ * @ctx: XDP buff pointer
+ * @hash: hash destination address
+ * @rss_type: XDP hash type destination address
+ *
+ * Copy RX hash (if available) and its type to the destination address.
+ */
+static int ice_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
+			   enum xdp_rss_hash_type *rss_type)
+{
+	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
+
+	*hash = ice_get_rx_hash(xdp_ext->pkt_ctx.eop_desc);
+	*rss_type = ice_xdp_rx_hash_type(xdp_ext->pkt_ctx.eop_desc);
+	if (!likely(*hash))
+		return -ENODATA;
+
+	return 0;
+}
+
 const struct xdp_metadata_ops ice_xdp_md_ops = {
 	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
+	.xmo_rx_hash			= ice_xdp_rx_hash,
 };
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 349c36fb5fd8..eb77040b4825 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -427,6 +427,7 @@ enum xdp_rss_hash_type {
 	XDP_RSS_L4_UDP		= BIT(5),
 	XDP_RSS_L4_SCTP		= BIT(6),
 	XDP_RSS_L4_IPSEC	= BIT(7), /* L4 based hash include IPSEC SPI */
+	XDP_RSS_L4_ICMP		= BIT(8),
 
 	/* Second part: RSS hash type combinations used for driver HW mapping */
 	XDP_RSS_TYPE_NONE            = 0,
@@ -442,11 +443,13 @@ enum xdp_rss_hash_type {
 	XDP_RSS_TYPE_L4_IPV4_UDP     = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_UDP,
 	XDP_RSS_TYPE_L4_IPV4_SCTP    = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_SCTP,
 	XDP_RSS_TYPE_L4_IPV4_IPSEC   = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_IPSEC,
+	XDP_RSS_TYPE_L4_IPV4_ICMP    = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_ICMP,
 
 	XDP_RSS_TYPE_L4_IPV6_TCP     = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_TCP,
 	XDP_RSS_TYPE_L4_IPV6_UDP     = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_UDP,
 	XDP_RSS_TYPE_L4_IPV6_SCTP    = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_SCTP,
 	XDP_RSS_TYPE_L4_IPV6_IPSEC   = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_IPSEC,
+	XDP_RSS_TYPE_L4_IPV6_ICMP    = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_ICMP,
 
 	XDP_RSS_TYPE_L4_IPV6_TCP_EX  = XDP_RSS_TYPE_L4_IPV6_TCP  | XDP_RSS_L3_DYNHDR,
 	XDP_RSS_TYPE_L4_IPV6_UDP_EX  = XDP_RSS_TYPE_L4_IPV6_UDP  | XDP_RSS_L3_DYNHDR,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (5 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 06/18] ice: Support RX hash XDP hint Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-17 16:13   ` Maciej Fijalkowski
  2023-10-20 15:29   ` Maciej Fijalkowski
  2023-10-12 17:05 ` [PATCH bpf-next v6 08/18] xdp: Add VLAN tag hint Larysa Zaremba
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

In AF_XDP ZC, xdp_buff is not stored on ring,
instead it is provided by xsk_buff_pool.
Space for metadata sources right after such buffers was already reserved
in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
This makes the implementation rather straightforward.

Update AF_XDP ZC packet processing to support XDP hints.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_xsk.c | 34 ++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index ef778b8e6d1b..6ca620b2fbdd 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -752,22 +752,51 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
 	return ICE_XDP_CONSUMED;
 }
 
+/**
+ * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
+ * @xdp: xdp_buff used as input to the XDP program
+ * @eop_desc: End of packet descriptor
+ * @rx_ring: Rx ring with packet context
+ *
+ * In regular XDP, xdp_buff is placed inside the ring structure,
+ * just before the packet context, so the latter can be accessed
+ * with xdp_buff address only at all times, but in ZC mode,
+ * xdp_buffs come from the pool, so we need to reinitialize
+ * context for every packet.
+ *
+ * We can safely convert xdp_buff_xsk to ice_xdp_buff,
+ * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
+ * right after xdp_buff, for our private use.
+ * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
+ */
+static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
+				   union ice_32b_rx_flex_desc *eop_desc,
+				   struct ice_rx_ring *rx_ring)
+{
+	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
+	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
+	ice_xdp_meta_set_desc(xdp, eop_desc);
+}
+
 /**
  * ice_run_xdp_zc - Executes an XDP program in zero-copy path
  * @rx_ring: Rx ring
  * @xdp: xdp_buff used as input to the XDP program
  * @xdp_prog: XDP program to run
  * @xdp_ring: ring to be used for XDP_TX action
+ * @rx_desc: packet descriptor
  *
  * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
  */
 static int
 ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
-	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
+	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
+	       union ice_32b_rx_flex_desc *rx_desc)
 {
 	int err, result = ICE_XDP_PASS;
 	u32 act;
 
+	ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 	if (likely(act == XDP_REDIRECT)) {
@@ -907,7 +936,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 		if (ice_is_non_eop(rx_ring, rx_desc))
 			continue;
 
-		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
+		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
+					 rx_desc);
 		if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
 			xdp_xmit |= xdp_res;
 		} else if (xdp_res == ICE_XDP_EXIT) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 08/18] xdp: Add VLAN tag hint
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (6 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-18 23:59   ` Jakub Kicinski
  2023-10-12 17:05 ` [PATCH bpf-next v6 09/18] ice: Implement " Larysa Zaremba
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Implement functionality that enables drivers to expose VLAN tag
to XDP code.

VLAN tag is represented by 2 variables:
- protocol ID, which is passed to bpf code in BE
- VLAN TCI, in host byte order

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 Documentation/networking/xdp-rx-metadata.rst |  8 ++++-
 include/net/xdp.h                            |  6 ++++
 include/uapi/linux/netdev.h                  |  5 ++-
 net/core/xdp.c                               | 33 ++++++++++++++++++++
 tools/include/uapi/linux/netdev.h            |  5 ++-
 5 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
index 205696780b78..bb53b00d1b2c 100644
--- a/Documentation/networking/xdp-rx-metadata.rst
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
 metadata is supported, this set will grow:
 
 .. kernel-doc:: net/core/xdp.c
-   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
+   :identifiers: bpf_xdp_metadata_rx_timestamp
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_hash
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_vlan_tag
 
 An XDP program can use these kfuncs to read the metadata into stack
 variables for its own consumption. Or, to pass the metadata on to other
diff --git a/include/net/xdp.h b/include/net/xdp.h
index eb77040b4825..ef79f124dbcf 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -399,6 +399,10 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 			   NETDEV_XDP_RX_METADATA_HASH, \
 			   bpf_xdp_metadata_rx_hash, \
 			   xmo_rx_hash) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
+			   NETDEV_XDP_RX_METADATA_VLAN_TAG, \
+			   bpf_xdp_metadata_rx_vlan_tag, \
+			   xmo_rx_vlan_tag) \
 
 enum xdp_rx_metadata {
 #define XDP_METADATA_KFUNC(name, _, __, ___) name,
@@ -460,6 +464,8 @@ struct xdp_metadata_ops {
 	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
 	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
 			       enum xdp_rss_hash_type *rss_type);
+	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, __be16 *vlan_proto,
+				   u16 *vlan_tci);
 };
 
 #ifdef CONFIG_NET
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 2943a151d4f1..661f603e3e43 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -44,13 +44,16 @@ enum netdev_xdp_act {
  *   timestamp via bpf_xdp_metadata_rx_timestamp().
  * @NETDEV_XDP_RX_METADATA_HASH: Device is capable of exposing receive packet
  *   hash via bpf_xdp_metadata_rx_hash().
+ * @NETDEV_XDP_RX_METADATA_VLAN_TAG: Device is capable of exposing stripped
+ *   receive VLAN tag (proto and TCI) via bpf_xdp_metadata_rx_vlan_tag().
  */
 enum netdev_xdp_rx_metadata {
 	NETDEV_XDP_RX_METADATA_TIMESTAMP = 1,
 	NETDEV_XDP_RX_METADATA_HASH = 2,
+	NETDEV_XDP_RX_METADATA_VLAN_TAG = 4,
 
 	/* private: */
-	NETDEV_XDP_RX_METADATA_MASK = 3,
+	NETDEV_XDP_RX_METADATA_MASK = 7,
 };
 
 enum {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index df4789ab512d..fb87925b3dc3 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -738,6 +738,39 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return -EOPNOTSUPP;
 }
 
+/**
+ * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag
+ * @ctx: XDP context pointer.
+ * @vlan_proto: Destination pointer for VLAN Tag protocol identifier (TPID).
+ * @vlan_tci: Destination pointer for VLAN TCI (VID + DEI + PCP)
+ *
+ * In case of success, ``vlan_proto`` contains *Tag protocol identifier (TPID)*,
+ * usually ``ETH_P_8021Q`` or ``ETH_P_8021AD``, but some networks can use
+ * custom TPIDs. ``vlan_proto`` is stored in **network byte order (BE)**
+ * and should be used as follows:
+ * ``if (vlan_proto == bpf_htons(ETH_P_8021Q)) do_something();``
+ *
+ * ``vlan_tci`` contains the remaining 16 bits of a VLAN tag.
+ * Driver is expected to provide those in **host byte order (usually LE)**,
+ * so the bpf program should not perform byte conversion.
+ * According to 802.1Q standard, *VLAN TCI (Tag control information)*
+ * is a bit field that contains:
+ * *VLAN identifier (VID)* that can be read with ``vlan_tci & 0xfff``,
+ * *Drop eligible indicator (DEI)* - 1 bit,
+ * *Priority code point (PCP)* - 3 bits.
+ * For detailed meaning of DEI and PCP, please refer to other sources.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
+ * * ``-ENODATA``    : VLAN tag was not stripped or is not available
+ */
+__bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
+					     __be16 *vlan_proto, u16 *vlan_tci)
+{
+	return -EOPNOTSUPP;
+}
+
 __diag_pop();
 
 BTF_SET8_START(xdp_metadata_kfunc_ids)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 2943a151d4f1..661f603e3e43 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -44,13 +44,16 @@ enum netdev_xdp_act {
  *   timestamp via bpf_xdp_metadata_rx_timestamp().
  * @NETDEV_XDP_RX_METADATA_HASH: Device is capable of exposing receive packet
  *   hash via bpf_xdp_metadata_rx_hash().
+ * @NETDEV_XDP_RX_METADATA_VLAN_TAG: Device is capable of exposing stripped
+ *   receive VLAN tag (proto and TCI) via bpf_xdp_metadata_rx_vlan_tag().
  */
 enum netdev_xdp_rx_metadata {
 	NETDEV_XDP_RX_METADATA_TIMESTAMP = 1,
 	NETDEV_XDP_RX_METADATA_HASH = 2,
+	NETDEV_XDP_RX_METADATA_VLAN_TAG = 4,
 
 	/* private: */
-	NETDEV_XDP_RX_METADATA_MASK = 3,
+	NETDEV_XDP_RX_METADATA_MASK = 7,
 };
 
 enum {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 09/18] ice: Implement VLAN tag hint
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (7 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 08/18] xdp: Add VLAN tag hint Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 10/18] ice: use VLAN proto from ring packet context in skb path Larysa Zaremba
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Implement .xmo_rx_vlan_tag callback to allow XDP code to read
packet's VLAN tag.

At the same time, use vlan_tci instead of vlan_tag in touched code,
because VLAN tag often refers to VLAN proto and VLAN TCI combined,
while in the code we clearly store only VLAN TCI.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c     | 20 ++++++++++++++
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  6 ++---
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 26 +++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h |  4 +--
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  6 ++---
 6 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 2153e27642eb..47e8920e1727 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -6014,6 +6014,23 @@ ice_fix_features(struct net_device *netdev, netdev_features_t features)
 	return features;
 }
 
+/**
+ * ice_set_rx_rings_vlan_proto - update rings with new stripped VLAN proto
+ * @vsi: PF's VSI
+ * @vlan_ethertype: VLAN ethertype (802.1Q or 802.1ad) in network byte order
+ *
+ * Store current stripped VLAN proto in ring packet context,
+ * so it can be accessed more efficiently by packet processing code.
+ */
+static void
+ice_set_rx_rings_vlan_proto(struct ice_vsi *vsi, __be16 vlan_ethertype)
+{
+	u16 i;
+
+	ice_for_each_alloc_rxq(vsi, i)
+		vsi->rx_rings[i]->pkt_ctx.vlan_proto = vlan_ethertype;
+}
+
 /**
  * ice_set_vlan_offload_features - set VLAN offload features for the PF VSI
  * @vsi: PF's VSI
@@ -6056,6 +6073,9 @@ ice_set_vlan_offload_features(struct ice_vsi *vsi, netdev_features_t features)
 	if (strip_err || insert_err)
 		return -EIO;
 
+	ice_set_rx_rings_vlan_proto(vsi, enable_stripping ?
+				    htons(vlan_ethertype) : 0);
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 4e6546d9cf85..4fd7614f243d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1183,7 +1183,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		struct sk_buff *skb;
 		unsigned int size;
 		u16 stat_err_bits;
-		u16 vlan_tag = 0;
+		u16 vlan_tci;
 
 		/* get the Rx desc from Rx ring based on 'next_to_clean' */
 		rx_desc = ICE_RX_DESC(rx_ring, ntc);
@@ -1278,7 +1278,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 			continue;
 		}
 
-		vlan_tag = ice_get_vlan_tag_from_rx_desc(rx_desc);
+		vlan_tci = ice_get_vlan_tci(rx_desc);
 
 		/* pad the skb if needed, to make a valid ethernet frame */
 		if (eth_skb_pad(skb))
@@ -1292,7 +1292,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 
 		ice_trace(clean_rx_irq_indicate, rx_ring, rx_desc, skb);
 		/* send completed skb up the stack */
-		ice_receive_skb(rx_ring, skb, vlan_tag);
+		ice_receive_skb(rx_ring, skb, vlan_tci);
 
 		/* update budget accounting */
 		total_rx_pkts++;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 4237702a58a9..41e0b14e6643 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -260,6 +260,7 @@ enum ice_rx_dtype {
 struct ice_pkt_ctx {
 	const union ice_32b_rx_flex_desc *eop_desc;
 	u64 cached_phctime;
+	__be16 vlan_proto;
 };
 
 struct ice_xdp_buff {
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 079b0c18047b..b72d0a943e6f 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -601,7 +601,33 @@ static int ice_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+/**
+ * ice_xdp_rx_vlan_tag - VLAN tag XDP hint handler
+ * @ctx: XDP buff pointer
+ * @vlan_proto: destination address for VLAN protocol
+ * @vlan_tci: destination address for VLAN TCI
+ *
+ * Copy VLAN tag (if was stripped) and corresponding protocol
+ * to the destination address.
+ */
+static int ice_xdp_rx_vlan_tag(const struct xdp_md *ctx, __be16 *vlan_proto,
+			       u16 *vlan_tci)
+{
+	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
+
+	*vlan_proto = xdp_ext->pkt_ctx.vlan_proto;
+	if (!*vlan_proto)
+		return -ENODATA;
+
+	*vlan_tci = ice_get_vlan_tci(xdp_ext->pkt_ctx.eop_desc);
+	if (!*vlan_tci)
+		return -ENODATA;
+
+	return 0;
+}
+
 const struct xdp_metadata_ops ice_xdp_md_ops = {
 	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
 	.xmo_rx_hash			= ice_xdp_rx_hash,
+	.xmo_rx_vlan_tag		= ice_xdp_rx_vlan_tag,
 };
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index 145883eec129..b7205826fea8 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -84,7 +84,7 @@ ice_build_ctob(u64 td_cmd, u64 td_offset, unsigned int size, u64 td_tag)
 }
 
 /**
- * ice_get_vlan_tag_from_rx_desc - get VLAN from Rx flex descriptor
+ * ice_get_vlan_tci - get VLAN TCI from Rx flex descriptor
  * @rx_desc: Rx 32b flex descriptor with RXDID=2
  *
  * The OS and current PF implementation only support stripping a single VLAN tag
@@ -92,7 +92,7 @@ ice_build_ctob(u64 td_cmd, u64 td_offset, unsigned int size, u64 td_tag)
  * one is found return the tag, else return 0 to mean no VLAN tag was found.
  */
 static inline u16
-ice_get_vlan_tag_from_rx_desc(union ice_32b_rx_flex_desc *rx_desc)
+ice_get_vlan_tci(const union ice_32b_rx_flex_desc *rx_desc)
 {
 	u16 stat_err_bits;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 6ca620b2fbdd..39775bb6cec1 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -898,7 +898,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 		struct xdp_buff *xdp;
 		struct sk_buff *skb;
 		u16 stat_err_bits;
-		u16 vlan_tag = 0;
+		u16 vlan_tci;
 
 		rx_desc = ICE_RX_DESC(rx_ring, ntc);
 
@@ -977,10 +977,10 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 		total_rx_bytes += skb->len;
 		total_rx_packets++;
 
-		vlan_tag = ice_get_vlan_tag_from_rx_desc(rx_desc);
+		vlan_tci = ice_get_vlan_tci(rx_desc);
 
 		ice_process_skb_fields(rx_ring, rx_desc, skb);
-		ice_receive_skb(rx_ring, skb, vlan_tag);
+		ice_receive_skb(rx_ring, skb, vlan_tci);
 	}
 
 	rx_ring->next_to_clean = ntc;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 10/18] ice: use VLAN proto from ring packet context in skb path
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (8 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 09/18] ice: Implement " Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition Larysa Zaremba
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

VLAN proto, used in ice XDP hints implementation is stored in ring packet
context. Utilize this value in skb VLAN processing too instead of checking
netdev features.

At the same time, use vlan_tci instead of vlan_tag in touched code,
because vlan_tag is misleading.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 14 +++++---------
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h |  2 +-
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index b72d0a943e6f..f297395f54ab 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -246,21 +246,17 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
  * ice_receive_skb - Send a completed packet up the stack
  * @rx_ring: Rx ring in play
  * @skb: packet to send up
- * @vlan_tag: VLAN tag for packet
+ * @vlan_tci: VLAN TCI for packet
  *
  * This function sends the completed packet (via. skb) up the stack using
  * gro receive functions (with/without VLAN tag)
  */
 void
-ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag)
+ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tci)
 {
-	netdev_features_t features = rx_ring->netdev->features;
-	bool non_zero_vlan = !!(vlan_tag & VLAN_VID_MASK);
-
-	if ((features & NETIF_F_HW_VLAN_CTAG_RX) && non_zero_vlan)
-		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tag);
-	else if ((features & NETIF_F_HW_VLAN_STAG_RX) && non_zero_vlan)
-		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD), vlan_tag);
+	if ((vlan_tci & VLAN_VID_MASK) && rx_ring->pkt_ctx.vlan_proto)
+		__vlan_hwaccel_put_tag(skb, rx_ring->pkt_ctx.vlan_proto,
+				       vlan_tci);
 
 	napi_gro_receive(&rx_ring->q_vector->napi, skb);
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index b7205826fea8..8487884bf5c4 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -150,7 +150,7 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       union ice_32b_rx_flex_desc *rx_desc,
 		       struct sk_buff *skb);
 void
-ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag);
+ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tci);
 
 static inline void
 ice_xdp_meta_set_desc(struct xdp_buff *xdp,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (9 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 10/18] ice: use VLAN proto from ring packet context in skb path Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-20 16:32   ` Maciej Fijalkowski
  2023-10-12 17:05 ` [PATCH bpf-next v6 12/18] veth: Implement VLAN tag XDP hint Larysa Zaremba
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Usage of XDP hints requires putting additional information after the
xdp_buff. In basic case, only the descriptor has to be copied on a
per-packet basis, because xdp_buff permanently resides before per-ring
metadata (cached time and VLAN protocol ID).

However, in ZC mode, xdp_buffs come from a pool, so memory after such
buffer does not contain any reliable information, so everything has to be
copied, damaging the performance.

Introduce a static key to enable meta sources assignment only when attached
XDP program is device-bound.

This patch eliminates a 6% performance drop in ZC mode, which was a result
of addition of XDP hints to the driver.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice.h      |  1 +
 drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
 drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
 drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
 4 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 3d0f15f8b2b8..76d22be878a4 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -210,6 +210,7 @@ enum ice_feature {
 };
 
 DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
+DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
 
 struct ice_channel {
 	struct list_head list;
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 47e8920e1727..ee0df86d34b7 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
 DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
 EXPORT_SYMBOL(ice_xdp_locking_key);
 
+DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
+EXPORT_SYMBOL(ice_xdp_meta_key);
+
 /**
  * ice_hw_to_dev - Get device pointer from the hardware structure
  * @hw: pointer to the device HW structure
@@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
 	return -ENOMEM;
 }
 
+static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
+{
+	return prog && prog->aux->dev_bound;
+}
+
 /**
  * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
  * @vsi: VSI to set the bpf prog on
@@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
 	struct bpf_prog *old_prog;
 	int i;
 
+	if (ice_xdp_prog_has_meta(prog))
+		static_branch_inc(&ice_xdp_meta_key);
+
 	old_prog = xchg(&vsi->xdp_prog, prog);
 	ice_for_each_rxq(vsi, i)
 		WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
 
+	if (ice_xdp_prog_has_meta(old_prog))
+		static_branch_dec(&ice_xdp_meta_key);
+
 	if (old_prog)
 		bpf_prog_put(old_prog);
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 4fd7614f243d..19fc182d1f4c 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -572,7 +572,8 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	if (!xdp_prog)
 		goto exit;
 
-	ice_xdp_meta_set_desc(xdp, eop_desc);
+	if (static_branch_unlikely(&ice_xdp_meta_key))
+		ice_xdp_meta_set_desc(xdp, eop_desc);
 
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 39775bb6cec1..f92d7d33fde6 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -773,6 +773,9 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
 				   union ice_32b_rx_flex_desc *eop_desc,
 				   struct ice_rx_ring *rx_ring)
 {
+	if (!static_branch_unlikely(&ice_xdp_meta_key))
+		return;
+
 	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
 	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
 	ice_xdp_meta_set_desc(xdp, eop_desc);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 12/18] veth: Implement VLAN tag XDP hint
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (10 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 13/18] net: make vlan_get_tag() return -ENODATA instead of -EINVAL Larysa Zaremba
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

In order to test VLAN tag hint in hardware-independent selftests, implement
newly added hint in veth driver.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/veth.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0deefd1573cf..8a501972670a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1736,6 +1736,24 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int veth_xdp_rx_vlan_tag(const struct xdp_md *ctx, __be16 *vlan_proto,
+				u16 *vlan_tci)
+{
+	struct veth_xdp_buff *_ctx = (void *)ctx;
+	struct sk_buff *skb = _ctx->skb;
+	int err;
+
+	if (!skb)
+		return -ENODATA;
+
+	err = __vlan_hwaccel_get_tag(skb, vlan_tci);
+	if (err)
+		return err;
+
+	*vlan_proto = skb->vlan_proto;
+	return err;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1760,6 +1778,7 @@ static const struct net_device_ops veth_netdev_ops = {
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= veth_xdp_rx_timestamp,
 	.xmo_rx_hash			= veth_xdp_rx_hash,
+	.xmo_rx_vlan_tag		= veth_xdp_rx_vlan_tag,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 13/18] net: make vlan_get_tag() return -ENODATA instead of -EINVAL
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (11 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 12/18] veth: Implement VLAN tag XDP hint Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 14/18] mlx5: implement VLAN tag XDP hint Larysa Zaremba
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski, Jesper Dangaard Brouer

__vlan_hwaccel_get_tag() is used in veth XDP hints implementation,
its return value (-EINVAL if skb is not VLAN tagged) is passed to bpf code,
but XDP hints specification requires drivers to return -ENODATA, if a hint
cannot be provided for a particular packet.

Solve this inconsistency by changing error return value of
__vlan_hwaccel_get_tag() from -EINVAL to -ENODATA, do the same thing to
__vlan_get_tag(), because this function is supposed to follow the same
convention. This, in turn, makes -ENODATA the only non-zero value
vlan_get_tag() can return. We can do this with no side effects, because
none of the users of the 3 above-mentioned functions rely on the exact
value.

Suggested-by: Jesper Dangaard Brouer <jbrouer@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 include/linux/if_vlan.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 3028af87716e..c1645c86eed9 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -540,7 +540,7 @@ static inline int __vlan_get_tag(const struct sk_buff *skb, u16 *vlan_tci)
 	struct vlan_ethhdr *veth = skb_vlan_eth_hdr(skb);
 
 	if (!eth_type_vlan(veth->h_vlan_proto))
-		return -EINVAL;
+		return -ENODATA;
 
 	*vlan_tci = ntohs(veth->h_vlan_TCI);
 	return 0;
@@ -561,7 +561,7 @@ static inline int __vlan_hwaccel_get_tag(const struct sk_buff *skb,
 		return 0;
 	} else {
 		*vlan_tci = 0;
-		return -EINVAL;
+		return -ENODATA;
 	}
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 14/18] mlx5: implement VLAN tag XDP hint
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (12 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 13/18] net: make vlan_get_tag() return -ENODATA instead of -EINVAL Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-17 12:40   ` Tariq Toukan
  2023-10-12 17:05 ` [PATCH bpf-next v6 15/18] selftests/bpf: Allow VLAN packets in xdp_hw_metadata Larysa Zaremba
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Implement the newly added .xmo_rx_vlan_tag() hint function.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 15 +++++++++++++++
 include/linux/mlx5/device.h                      |  2 +-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 12f56d0db0af..d7cd14687ce8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -256,9 +256,24 @@ static int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int mlx5e_xdp_rx_vlan_tag(const struct xdp_md *ctx, __be16 *vlan_proto,
+				 u16 *vlan_tci)
+{
+	const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
+	const struct mlx5_cqe64 *cqe = _ctx->cqe;
+
+	if (!cqe_has_vlan(cqe))
+		return -ENODATA;
+
+	*vlan_proto = htons(ETH_P_8021Q);
+	*vlan_tci = be16_to_cpu(cqe->vlan_info);
+	return 0;
+}
+
 const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= mlx5e_xdp_rx_timestamp,
 	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
+	.xmo_rx_vlan_tag		= mlx5e_xdp_rx_vlan_tag,
 };
 
 /* returns true if packet was consumed by xdp */
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 8fbe22de16ef..0805f8231452 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -916,7 +916,7 @@ static inline u8 get_cqe_tls_offload(struct mlx5_cqe64 *cqe)
 	return (cqe->tls_outer_l3_tunneled >> 3) & 0x3;
 }
 
-static inline bool cqe_has_vlan(struct mlx5_cqe64 *cqe)
+static inline bool cqe_has_vlan(const struct mlx5_cqe64 *cqe)
 {
 	return cqe->l4_l3_hdr_type & 0x1;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 15/18] selftests/bpf: Allow VLAN packets in xdp_hw_metadata
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (13 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 14/18] mlx5: implement VLAN tag XDP hint Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 16/18] selftests/bpf: Add flags and VLAN hint to xdp_hw_metadata Larysa Zaremba
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Make VLAN c-tag and s-tag XDP hint testing more convenient
by not skipping VLAN-ed packets.

Allow both 802.1ad and 802.1Q headers.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c | 10 +++++++++-
 tools/testing/selftests/bpf/xdp_metadata.h          |  8 ++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
index b2dfd7066c6e..63d7de6c6bbb 100644
--- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -26,15 +26,23 @@ int rx(struct xdp_md *ctx)
 {
 	void *data, *data_meta, *data_end;
 	struct ipv6hdr *ip6h = NULL;
-	struct ethhdr *eth = NULL;
 	struct udphdr *udp = NULL;
 	struct iphdr *iph = NULL;
 	struct xdp_meta *meta;
+	struct ethhdr *eth;
 	int err;
 
 	data = (void *)(long)ctx->data;
 	data_end = (void *)(long)ctx->data_end;
 	eth = data;
+
+	if (eth + 1 < data_end && (eth->h_proto == bpf_htons(ETH_P_8021AD) ||
+				   eth->h_proto == bpf_htons(ETH_P_8021Q)))
+		eth = (void *)eth + sizeof(struct vlan_hdr);
+
+	if (eth + 1 < data_end && eth->h_proto == bpf_htons(ETH_P_8021Q))
+		eth = (void *)eth + sizeof(struct vlan_hdr);
+
 	if (eth + 1 < data_end) {
 		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
 			iph = (void *)(eth + 1);
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
index 938a729bd307..6664893c2c77 100644
--- a/tools/testing/selftests/bpf/xdp_metadata.h
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -9,6 +9,14 @@
 #define ETH_P_IPV6 0x86DD
 #endif
 
+#ifndef ETH_P_8021Q
+#define ETH_P_8021Q 0x8100
+#endif
+
+#ifndef ETH_P_8021AD
+#define ETH_P_8021AD 0x88A8
+#endif
+
 struct xdp_meta {
 	__u64 rx_timestamp;
 	__u64 xdp_timestamp;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 16/18] selftests/bpf: Add flags and VLAN hint to xdp_hw_metadata
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (14 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 15/18] selftests/bpf: Allow VLAN packets in xdp_hw_metadata Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 17/18] selftests/bpf: Use AF_INET for TX in xdp_metadata Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 18/18] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Add VLAN hint to the xdp_hw_metadata program.

Also, to make metadata layout more straightforward, add flags field
to pass information about validity of every separate hint separately.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 .../selftests/bpf/progs/xdp_hw_metadata.c     | 28 +++++++++++---
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 38 ++++++++++++++++---
 tools/testing/selftests/bpf/xdp_metadata.h    | 26 ++++++++++++-
 3 files changed, 79 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
index 63d7de6c6bbb..7b7e93486768 100644
--- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -20,6 +20,9 @@ extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
+					__be16 *vlan_proto,
+					__u16 *vlan_tci) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -84,15 +87,28 @@ int rx(struct xdp_md *ctx)
 		return XDP_PASS;
 	}
 
+	meta->hint_valid = 0;
+
+	meta->xdp_timestamp = bpf_ktime_get_tai_ns();
 	err = bpf_xdp_metadata_rx_timestamp(ctx, &meta->rx_timestamp);
-	if (!err)
-		meta->xdp_timestamp = bpf_ktime_get_tai_ns();
+	if (err)
+		meta->rx_timestamp_err = err;
+	else
+		meta->hint_valid |= XDP_META_FIELD_TS;
+
+	err = bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash,
+				       &meta->rx_hash_type);
+	if (err)
+		meta->rx_hash_err = err;
 	else
-		meta->rx_timestamp = 0; /* Used by AF_XDP as not avail signal */
+		meta->hint_valid |= XDP_META_FIELD_RSS;
 
-	err = bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
-	if (err < 0)
-		meta->rx_hash_err = err; /* Used by AF_XDP as no hash signal */
+	err = bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_proto,
+					   &meta->rx_vlan_tci);
+	if (err)
+		meta->rx_vlan_tag_err = err;
+	else
+		meta->hint_valid |= XDP_META_FIELD_VLAN_TAG;
 
 	__sync_add_and_fetch(&pkts_redir, 1);
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 17c980138796..dd9231390790 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -19,6 +19,9 @@
 #include "xsk.h"
 
 #include <error.h>
+#include <linux/kernel.h>
+#include <linux/bits.h>
+#include <linux/bitfield.h>
 #include <linux/errqueue.h>
 #include <linux/if_link.h>
 #include <linux/net_tstamp.h>
@@ -150,21 +153,34 @@ static __u64 gettime(clockid_t clock_id)
 	return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
 }
 
+#define VLAN_PRIO_MASK		GENMASK(15, 13) /* Priority Code Point */
+#define VLAN_DEI_MASK		GENMASK(12, 12) /* Drop Eligible Indicator */
+#define VLAN_VID_MASK		GENMASK(11, 0)	/* VLAN Identifier */
+static void print_vlan_tci(__u16 tag)
+{
+	__u16 vlan_id = FIELD_GET(VLAN_VID_MASK, tag);
+	__u8 pcp = FIELD_GET(VLAN_PRIO_MASK, tag);
+	bool dei = FIELD_GET(VLAN_DEI_MASK, tag);
+
+	printf("PCP=%u, DEI=%d, VID=0x%X\n", pcp, dei, vlan_id);
+}
+
 static void verify_xdp_metadata(void *data, clockid_t clock_id)
 {
 	struct xdp_meta *meta;
 
 	meta = data - sizeof(*meta);
 
-	if (meta->rx_hash_err < 0)
-		printf("No rx_hash err=%d\n", meta->rx_hash_err);
-	else
+	if (meta->hint_valid & XDP_META_FIELD_RSS)
 		printf("rx_hash: 0x%X with RSS type:0x%X\n",
 		       meta->rx_hash, meta->rx_hash_type);
+	else
+		printf("No rx_hash, err=%d\n", meta->rx_hash_err);
+
+	if (meta->hint_valid & XDP_META_FIELD_TS) {
+		printf("rx_timestamp:  %llu (sec:%0.4f)\n", meta->rx_timestamp,
+		       (double)meta->rx_timestamp / NANOSEC_PER_SEC);
 
-	printf("rx_timestamp:  %llu (sec:%0.4f)\n", meta->rx_timestamp,
-	       (double)meta->rx_timestamp / NANOSEC_PER_SEC);
-	if (meta->rx_timestamp) {
 		__u64 usr_clock = gettime(clock_id);
 		__u64 xdp_clock = meta->xdp_timestamp;
 		__s64 delta_X = xdp_clock - meta->rx_timestamp;
@@ -179,8 +195,18 @@ static void verify_xdp_metadata(void *data, clockid_t clock_id)
 		       usr_clock, (double)usr_clock / NANOSEC_PER_SEC,
 		       (double)delta_X2U / NANOSEC_PER_SEC,
 		       (double)delta_X2U / 1000);
+	} else {
+		printf("No rx_timestamp, err=%d\n", meta->rx_timestamp_err);
 	}
 
+	if (meta->hint_valid & XDP_META_FIELD_VLAN_TAG) {
+		printf("rx_vlan_proto: 0x%X\n", ntohs(meta->rx_vlan_proto));
+		printf("rx_vlan_tci: ");
+		print_vlan_tci(meta->rx_vlan_tci);
+	} else {
+		printf("No rx_vlan_tci or rx_vlan_proto, err=%d\n",
+		       meta->rx_vlan_tag_err);
+	}
 }
 
 static void verify_skb_metadata(int fd)
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
index 6664893c2c77..87318ad1117a 100644
--- a/tools/testing/selftests/bpf/xdp_metadata.h
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -17,12 +17,36 @@
 #define ETH_P_8021AD 0x88A8
 #endif
 
+#ifndef BIT
+#define BIT(nr)			(1 << (nr))
+#endif
+
+/* Non-existent checksum status */
+#define XDP_CHECKSUM_MAGIC	BIT(2)
+
+enum xdp_meta_field {
+	XDP_META_FIELD_TS	= BIT(0),
+	XDP_META_FIELD_RSS	= BIT(1),
+	XDP_META_FIELD_VLAN_TAG	= BIT(2),
+};
+
 struct xdp_meta {
-	__u64 rx_timestamp;
+	union {
+		__u64 rx_timestamp;
+		__s32 rx_timestamp_err;
+	};
 	__u64 xdp_timestamp;
 	__u32 rx_hash;
 	union {
 		__u32 rx_hash_type;
 		__s32 rx_hash_err;
 	};
+	union {
+		struct {
+			__be16 rx_vlan_proto;
+			__u16 rx_vlan_tci;
+		};
+		__s32 rx_vlan_tag_err;
+	};
+	enum xdp_meta_field hint_valid;
 };
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 17/18] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (15 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 16/18] selftests/bpf: Add flags and VLAN hint to xdp_hw_metadata Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  2023-10-12 17:05 ` [PATCH bpf-next v6 18/18] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

The easiest way to simulate stripped VLAN tag in veth is to send a packet
from VLAN interface, attached to veth. Unfortunately, this approach is
incompatible with AF_XDP on TX side, because VLAN interfaces do not have
such feature.

Replace AF_XDP packet generation with sending the same datagram via
AF_INET socket.

This does not change the packet contents or hints values with one notable
exception: rx_hash_type, which previously was expected to be 0, now is
expected be at least XDP_RSS_TYPE_L4.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 167 +++++++-----------
 1 file changed, 59 insertions(+), 108 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 4439ba9392f8..4fcdda02c75e 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -20,7 +20,7 @@
 
 #define UDP_PAYLOAD_BYTES 4
 
-#define AF_XDP_SOURCE_PORT 1234
+#define UDP_SOURCE_PORT 1234
 #define AF_XDP_CONSUMER_PORT 8080
 
 #define UMEM_NUM 16
@@ -33,6 +33,12 @@
 #define RX_ADDR "10.0.0.2"
 #define PREFIX_LEN "8"
 #define FAMILY AF_INET
+#define TX_NETNS_NAME "xdp_metadata_tx"
+#define RX_NETNS_NAME "xdp_metadata_rx"
+#define TX_MAC "00:00:00:00:00:01"
+#define RX_MAC "00:00:00:00:00:02"
+
+#define XDP_RSS_TYPE_L4 BIT(3)
 
 struct xsk {
 	void *umem_area;
@@ -119,90 +125,28 @@ static void close_xsk(struct xsk *xsk)
 	munmap(xsk->umem_area, UMEM_SIZE);
 }
 
-static void ip_csum(struct iphdr *iph)
+static int generate_packet_udp(void)
 {
-	__u32 sum = 0;
-	__u16 *p;
-	int i;
-
-	iph->check = 0;
-	p = (void *)iph;
-	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
-		sum += p[i];
-
-	while (sum >> 16)
-		sum = (sum & 0xffff) + (sum >> 16);
-
-	iph->check = ~sum;
-}
-
-static int generate_packet(struct xsk *xsk, __u16 dst_port)
-{
-	struct xdp_desc *tx_desc;
-	struct udphdr *udph;
-	struct ethhdr *eth;
-	struct iphdr *iph;
-	void *data;
-	__u32 idx;
-	int ret;
-
-	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
-	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
-		return -1;
-
-	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
-	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
-	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
-	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
-
-	eth = data;
-	iph = (void *)(eth + 1);
-	udph = (void *)(iph + 1);
-
-	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
-	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
-	eth->h_proto = htons(ETH_P_IP);
-
-	iph->version = 0x4;
-	iph->ihl = 0x5;
-	iph->tos = 0x9;
-	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
-	iph->id = 0;
-	iph->frag_off = 0;
-	iph->ttl = 0;
-	iph->protocol = IPPROTO_UDP;
-	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
-	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
-	ip_csum(iph);
-
-	udph->source = htons(AF_XDP_SOURCE_PORT);
-	udph->dest = htons(dst_port);
-	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
-	udph->check = 0;
-
-	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
-
-	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
-	xsk_ring_prod__submit(&xsk->tx, 1);
-
-	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
-	if (!ASSERT_GE(ret, 0, "sendto"))
-		return ret;
-
-	return 0;
-}
-
-static void complete_tx(struct xsk *xsk)
-{
-	__u32 idx;
-	__u64 addr;
-
-	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
-		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
-
-		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
-		xsk_ring_cons__release(&xsk->comp, 1);
-	}
+	char udp_payload[UDP_PAYLOAD_BYTES];
+	struct sockaddr_in rx_addr;
+	int sock_fd, err = 0;
+
+	/* Build a packet */
+	memset(udp_payload, 0xAA, UDP_PAYLOAD_BYTES);
+	rx_addr.sin_addr.s_addr = inet_addr(RX_ADDR);
+	rx_addr.sin_family = AF_INET;
+	rx_addr.sin_port = htons(UDP_SOURCE_PORT);
+
+	sock_fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
+	if (!ASSERT_GE(sock_fd, 0, "socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)"))
+		return sock_fd;
+
+	err = sendto(sock_fd, udp_payload, UDP_PAYLOAD_BYTES, MSG_DONTWAIT,
+		     (void *)&rx_addr, sizeof(rx_addr));
+	ASSERT_GE(err, 0, "sendto");
+
+	close(sock_fd);
+	return err;
 }
 
 static void refill_rx(struct xsk *xsk, __u64 addr)
@@ -268,7 +212,8 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
 		return -1;
 
-	ASSERT_EQ(meta->rx_hash_type, 0, "rx_hash_type");
+	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
+		return -1;
 
 	xsk_ring_cons__release(&xsk->rx, 1);
 	refill_rx(xsk, comp_addr);
@@ -284,36 +229,38 @@ void test_xdp_metadata(void)
 	struct nstoken *tok = NULL;
 	__u32 queue_id = QUEUE_ID;
 	struct bpf_map *prog_arr;
-	struct xsk tx_xsk = {};
 	struct xsk rx_xsk = {};
 	__u32 val, key = 0;
 	int retries = 10;
 	int rx_ifindex;
-	int tx_ifindex;
 	int sock_fd;
 	int ret;
 
-	/* Setup new networking namespace, with a veth pair. */
+	/* Setup new networking namespaces, with a veth pair. */
 
-	SYS(out, "ip netns add xdp_metadata");
-	tok = open_netns("xdp_metadata");
+	SYS(out, "ip netns add " TX_NETNS_NAME);
+	SYS(out, "ip netns add " RX_NETNS_NAME);
+
+	tok = open_netns(TX_NETNS_NAME);
 	SYS(out, "ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
 	    " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
-	SYS(out, "ip link set dev " TX_NAME " address 00:00:00:00:00:01");
-	SYS(out, "ip link set dev " RX_NAME " address 00:00:00:00:00:02");
+	SYS(out, "ip link set " RX_NAME " netns " RX_NETNS_NAME);
+
+	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
 	SYS(out, "ip link set dev " TX_NAME " up");
-	SYS(out, "ip link set dev " RX_NAME " up");
 	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
-	SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
 
-	rx_ifindex = if_nametoindex(RX_NAME);
-	tx_ifindex = if_nametoindex(TX_NAME);
+	/* Avoid ARP calls */
+	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
+	close_netns(tok);
 
-	/* Setup separate AF_XDP for TX and RX interfaces. */
+	tok = open_netns(RX_NETNS_NAME);
+	SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
+	SYS(out, "ip link set dev " RX_NAME " up");
+	SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
+	rx_ifindex = if_nametoindex(RX_NAME);
 
-	ret = open_xsk(tx_ifindex, &tx_xsk);
-	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
-		goto out;
+	/* Setup AF_XDP for RX interface. */
 
 	ret = open_xsk(rx_ifindex, &rx_xsk);
 	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
@@ -353,19 +300,20 @@ void test_xdp_metadata(void)
 	ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
 	if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
 		goto out;
+	close_netns(tok);
 
 	/* Send packet destined to RX AF_XDP socket. */
-	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
-		       "generate AF_XDP_CONSUMER_PORT"))
+	tok = open_netns(TX_NETNS_NAME);
+	if (!ASSERT_GE(generate_packet_udp(), 0, "generate UDP packet"))
 		goto out;
+	close_netns(tok);
 
 	/* Verify AF_XDP RX packet has proper metadata. */
+	tok = open_netns(RX_NETNS_NAME);
 	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
 		       "verify_xsk_metadata"))
 		goto out;
 
-	complete_tx(&tx_xsk);
-
 	/* Make sure freplace correctly picks up original bound device
 	 * and doesn't crash.
 	 */
@@ -382,12 +330,15 @@ void test_xdp_metadata(void)
 
 	if (!ASSERT_OK(xdp_metadata2__attach(bpf_obj2), "attach freplace"))
 		goto out;
+	close_netns(tok);
 
 	/* Send packet to trigger . */
-	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
-		       "generate freplace packet"))
+	tok = open_netns(TX_NETNS_NAME);
+	if (!ASSERT_GE(generate_packet_udp(), 0, "generate freplace packet"))
 		goto out;
+	close_netns(tok);
 
+	tok = open_netns(RX_NETNS_NAME);
 	while (!retries--) {
 		if (bpf_obj2->bss->called)
 			break;
@@ -397,10 +348,10 @@ void test_xdp_metadata(void)
 
 out:
 	close_xsk(&rx_xsk);
-	close_xsk(&tx_xsk);
 	xdp_metadata2__destroy(bpf_obj2);
 	xdp_metadata__destroy(bpf_obj);
 	if (tok)
 		close_netns(tok);
-	SYS_NOFAIL("ip netns del xdp_metadata");
+	SYS_NOFAIL("ip netns del " RX_NETNS_NAME);
+	SYS_NOFAIL("ip netns del " TX_NETNS_NAME);
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next v6 18/18] selftests/bpf: Check VLAN tag and proto in xdp_metadata
  2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
                   ` (16 preceding siblings ...)
  2023-10-12 17:05 ` [PATCH bpf-next v6 17/18] selftests/bpf: Use AF_INET for TX in xdp_metadata Larysa Zaremba
@ 2023-10-12 17:05 ` Larysa Zaremba
  17 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-12 17:05 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Willem de Bruijn,
	Alexei Starovoitov, Simon Horman, Tariq Toukan, Saeed Mahameed,
	Maciej Fijalkowski

Verify, whether VLAN tag and proto are set correctly.

To simulate "stripped" VLAN tag on veth, send test packet from VLAN
interface.

Also, add TO_STR() macro for convenience.

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 21 +++++++++++++++++--
 .../selftests/bpf/progs/xdp_metadata.c        |  5 +++++
 tools/testing/selftests/bpf/testing_helpers.h |  3 +++
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 4fcdda02c75e..21b8e0e95019 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -38,7 +38,14 @@
 #define TX_MAC "00:00:00:00:00:01"
 #define RX_MAC "00:00:00:00:00:02"
 
+#define VLAN_ID 59
+#define VLAN_PROTO "802.1Q"
+#define VLAN_PID htons(ETH_P_8021Q)
+#define TX_NAME_VLAN TX_NAME "." TO_STR(VLAN_ID)
+#define RX_NAME_VLAN RX_NAME "." TO_STR(VLAN_ID)
+
 #define XDP_RSS_TYPE_L4 BIT(3)
+#define VLAN_VID_MASK 0xfff
 
 struct xsk {
 	void *umem_area;
@@ -215,6 +222,12 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
 		return -1;
 
+	if (!ASSERT_EQ(meta->rx_vlan_tci & VLAN_VID_MASK, VLAN_ID, "rx_vlan_tci"))
+		return -1;
+
+	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
+		return -1;
+
 	xsk_ring_cons__release(&xsk->rx, 1);
 	refill_rx(xsk, comp_addr);
 
@@ -248,10 +261,14 @@ void test_xdp_metadata(void)
 
 	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
 	SYS(out, "ip link set dev " TX_NAME " up");
-	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
+
+	SYS(out, "ip link add link " TX_NAME " " TX_NAME_VLAN
+		 " type vlan proto " VLAN_PROTO " id " TO_STR(VLAN_ID));
+	SYS(out, "ip link set dev " TX_NAME_VLAN " up");
+	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME_VLAN);
 
 	/* Avoid ARP calls */
-	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
+	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME_VLAN);
 	close_netns(tok);
 
 	tok = open_netns(RX_NETNS_NAME);
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index d151d406a123..2c6c46ef8502 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -23,6 +23,9 @@ extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
+					__be16 *vlan_proto,
+					__u16 *vlan_tci) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -57,6 +60,8 @@ int rx(struct xdp_md *ctx)
 		meta->rx_timestamp = 1;
 
 	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
+	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_proto,
+				     &meta->rx_vlan_tci);
 
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
diff --git a/tools/testing/selftests/bpf/testing_helpers.h b/tools/testing/selftests/bpf/testing_helpers.h
index 5b7a55136741..35284faff4f2 100644
--- a/tools/testing/selftests/bpf/testing_helpers.h
+++ b/tools/testing/selftests/bpf/testing_helpers.h
@@ -9,6 +9,9 @@
 #include <bpf/libbpf.h>
 #include <time.h>
 
+#define __TO_STR(x) #x
+#define TO_STR(x) __TO_STR(x)
+
 int parse_num_list(const char *s, bool **set, int *set_len);
 __u32 link_info_prog_id(const struct bpf_link *link, struct bpf_link_info *info);
 int bpf_prog_test_load(const char *file, enum bpf_prog_type type,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 14/18] mlx5: implement VLAN tag XDP hint
  2023-10-12 17:05 ` [PATCH bpf-next v6 14/18] mlx5: implement VLAN tag XDP hint Larysa Zaremba
@ 2023-10-17 12:40   ` Tariq Toukan
  0 siblings, 0 replies; 38+ messages in thread
From: Tariq Toukan @ 2023-10-17 12:40 UTC (permalink / raw)
  To: Larysa Zaremba, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed, Maciej Fijalkowski



On 12/10/2023 20:05, Larysa Zaremba wrote:
> Implement the newly added .xmo_rx_vlan_tag() hint function.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 15 +++++++++++++++
>   include/linux/mlx5/device.h                      |  2 +-
>   2 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index 12f56d0db0af..d7cd14687ce8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -256,9 +256,24 @@ static int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
>   	return 0;
>   }
>   
> +static int mlx5e_xdp_rx_vlan_tag(const struct xdp_md *ctx, __be16 *vlan_proto,
> +				 u16 *vlan_tci)
> +{
> +	const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
> +	const struct mlx5_cqe64 *cqe = _ctx->cqe;
> +

I see inconsistency in using/not using "const" between the different 
drivers.

Other than that, patch LGTM.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>

> +	if (!cqe_has_vlan(cqe))
> +		return -ENODATA;
> +
> +	*vlan_proto = htons(ETH_P_8021Q);
> +	*vlan_tci = be16_to_cpu(cqe->vlan_info);
> +	return 0;
> +}
> +
>   const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
>   	.xmo_rx_timestamp		= mlx5e_xdp_rx_timestamp,
>   	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
> +	.xmo_rx_vlan_tag		= mlx5e_xdp_rx_vlan_tag,
>   };
>   
>   /* returns true if packet was consumed by xdp */
> diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
> index 8fbe22de16ef..0805f8231452 100644
> --- a/include/linux/mlx5/device.h
> +++ b/include/linux/mlx5/device.h
> @@ -916,7 +916,7 @@ static inline u8 get_cqe_tls_offload(struct mlx5_cqe64 *cqe)
>   	return (cqe->tls_outer_l3_tunneled >> 3) & 0x3;
>   }
>   
> -static inline bool cqe_has_vlan(struct mlx5_cqe64 *cqe)
> +static inline bool cqe_has_vlan(const struct mlx5_cqe64 *cqe)
>   {
>   	return cqe->l4_l3_hdr_type & 0x1;
>   }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode
  2023-10-12 17:05 ` [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
@ 2023-10-17 16:13   ` Maciej Fijalkowski
  2023-10-17 16:37     ` Magnus Karlsson
  2023-10-20 15:29   ` Maciej Fijalkowski
  1 sibling, 1 reply; 38+ messages in thread
From: Maciej Fijalkowski @ 2023-10-17 16:13 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed, magnus.karlsson

On Thu, Oct 12, 2023 at 07:05:13PM +0200, Larysa Zaremba wrote:
> In AF_XDP ZC, xdp_buff is not stored on ring,
> instead it is provided by xsk_buff_pool.
> Space for metadata sources right after such buffers was already reserved
> in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
> This makes the implementation rather straightforward.
> 
> Update AF_XDP ZC packet processing to support XDP hints.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_xsk.c | 34 ++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index ef778b8e6d1b..6ca620b2fbdd 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -752,22 +752,51 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
>  	return ICE_XDP_CONSUMED;
>  }
>  
> +/**
> + * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> + * @xdp: xdp_buff used as input to the XDP program
> + * @eop_desc: End of packet descriptor
> + * @rx_ring: Rx ring with packet context
> + *
> + * In regular XDP, xdp_buff is placed inside the ring structure,
> + * just before the packet context, so the latter can be accessed
> + * with xdp_buff address only at all times, but in ZC mode,
> + * xdp_buffs come from the pool, so we need to reinitialize
> + * context for every packet.
> + *
> + * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> + * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> + * right after xdp_buff, for our private use.
> + * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> + */
> +static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> +				   union ice_32b_rx_flex_desc *eop_desc,
> +				   struct ice_rx_ring *rx_ring)
> +{
> +	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> +	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;

I will be loud thinking over here, but this could be set in
ice_fill_rx_descs(), while grabbing xdp_buffs from xsk_pool, should
minimize the performance overhead.

But then again you address that with static branch in later patch.

OTOH, I was thinking that we could come with xsk_buff_pool API that would
let drivers assign this at setup time. Similar what is being done with dma
mappings.

Magnus, do you think it is worth the hassle? Thoughts?

Or should we advise any other driver that support hints to mimic static
branch solution?

> +	ice_xdp_meta_set_desc(xdp, eop_desc);
> +}
> +
>  /**
>   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
>   * @rx_ring: Rx ring
>   * @xdp: xdp_buff used as input to the XDP program
>   * @xdp_prog: XDP program to run
>   * @xdp_ring: ring to be used for XDP_TX action
> + * @rx_desc: packet descriptor
>   *
>   * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
>   */
>  static int
>  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> -	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
> +	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> +	       union ice_32b_rx_flex_desc *rx_desc)
>  {
>  	int err, result = ICE_XDP_PASS;
>  	u32 act;
>  
> +	ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
>  	act = bpf_prog_run_xdp(xdp_prog, xdp);
>  
>  	if (likely(act == XDP_REDIRECT)) {
> @@ -907,7 +936,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
>  		if (ice_is_non_eop(rx_ring, rx_desc))
>  			continue;
>  
> -		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
> +		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> +					 rx_desc);
>  		if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
>  			xdp_xmit |= xdp_res;
>  		} else if (xdp_res == ICE_XDP_EXIT) {
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode
  2023-10-17 16:13   ` Maciej Fijalkowski
@ 2023-10-17 16:37     ` Magnus Karlsson
  2023-10-17 16:45       ` Maciej Fijalkowski
  0 siblings, 1 reply; 38+ messages in thread
From: Magnus Karlsson @ 2023-10-17 16:37 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Larysa Zaremba, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed, magnus.karlsson

On Tue, 17 Oct 2023 at 18:13, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Thu, Oct 12, 2023 at 07:05:13PM +0200, Larysa Zaremba wrote:
> > In AF_XDP ZC, xdp_buff is not stored on ring,
> > instead it is provided by xsk_buff_pool.
> > Space for metadata sources right after such buffers was already reserved
> > in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
> > This makes the implementation rather straightforward.
> >
> > Update AF_XDP ZC packet processing to support XDP hints.
> >
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_xsk.c | 34 ++++++++++++++++++++++--
> >  1 file changed, 32 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > index ef778b8e6d1b..6ca620b2fbdd 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > @@ -752,22 +752,51 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
> >       return ICE_XDP_CONSUMED;
> >  }
> >
> > +/**
> > + * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> > + * @xdp: xdp_buff used as input to the XDP program
> > + * @eop_desc: End of packet descriptor
> > + * @rx_ring: Rx ring with packet context
> > + *
> > + * In regular XDP, xdp_buff is placed inside the ring structure,
> > + * just before the packet context, so the latter can be accessed
> > + * with xdp_buff address only at all times, but in ZC mode,
> > + * xdp_buffs come from the pool, so we need to reinitialize
> > + * context for every packet.
> > + *
> > + * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> > + * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> > + * right after xdp_buff, for our private use.
> > + * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> > + */
> > +static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > +                                union ice_32b_rx_flex_desc *eop_desc,
> > +                                struct ice_rx_ring *rx_ring)
> > +{
> > +     XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > +     ((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
>
> I will be loud thinking over here, but this could be set in
> ice_fill_rx_descs(), while grabbing xdp_buffs from xsk_pool, should
> minimize the performance overhead.
>
> But then again you address that with static branch in later patch.
>
> OTOH, I was thinking that we could come with xsk_buff_pool API that would
> let drivers assign this at setup time. Similar what is being done with dma
> mappings.
>
> Magnus, do you think it is worth the hassle? Thoughts?

I would measure the overhead of the current assignment and if it is
significant (incurs a cache miss for example), then why not try out
your idea. Usually good not to have to touch things when not needed.

> Or should we advise any other driver that support hints to mimic static
> branch solution?
>
> > +     ice_xdp_meta_set_desc(xdp, eop_desc);
> > +}
> > +
> >  /**
> >   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
> >   * @rx_ring: Rx ring
> >   * @xdp: xdp_buff used as input to the XDP program
> >   * @xdp_prog: XDP program to run
> >   * @xdp_ring: ring to be used for XDP_TX action
> > + * @rx_desc: packet descriptor
> >   *
> >   * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
> >   */
> >  static int
> >  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> > -            struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
> > +            struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> > +            union ice_32b_rx_flex_desc *rx_desc)
> >  {
> >       int err, result = ICE_XDP_PASS;
> >       u32 act;
> >
> > +     ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
> >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> >
> >       if (likely(act == XDP_REDIRECT)) {
> > @@ -907,7 +936,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> >               if (ice_is_non_eop(rx_ring, rx_desc))
> >                       continue;
> >
> > -             xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
> > +             xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> > +                                      rx_desc);
> >               if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
> >                       xdp_xmit |= xdp_res;
> >               } else if (xdp_res == ICE_XDP_EXIT) {
> > --
> > 2.41.0
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode
  2023-10-17 16:37     ` Magnus Karlsson
@ 2023-10-17 16:45       ` Maciej Fijalkowski
  2023-10-17 17:03         ` Larysa Zaremba
  0 siblings, 1 reply; 38+ messages in thread
From: Maciej Fijalkowski @ 2023-10-17 16:45 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Larysa Zaremba, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed, magnus.karlsson

On Tue, Oct 17, 2023 at 06:37:07PM +0200, Magnus Karlsson wrote:
> On Tue, 17 Oct 2023 at 18:13, Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Thu, Oct 12, 2023 at 07:05:13PM +0200, Larysa Zaremba wrote:
> > > In AF_XDP ZC, xdp_buff is not stored on ring,
> > > instead it is provided by xsk_buff_pool.
> > > Space for metadata sources right after such buffers was already reserved
> > > in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
> > > This makes the implementation rather straightforward.
> > >
> > > Update AF_XDP ZC packet processing to support XDP hints.
> > >
> > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > ---
> > >  drivers/net/ethernet/intel/ice/ice_xsk.c | 34 ++++++++++++++++++++++--
> > >  1 file changed, 32 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > index ef778b8e6d1b..6ca620b2fbdd 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > @@ -752,22 +752,51 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
> > >       return ICE_XDP_CONSUMED;
> > >  }
> > >
> > > +/**
> > > + * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> > > + * @xdp: xdp_buff used as input to the XDP program
> > > + * @eop_desc: End of packet descriptor
> > > + * @rx_ring: Rx ring with packet context
> > > + *
> > > + * In regular XDP, xdp_buff is placed inside the ring structure,
> > > + * just before the packet context, so the latter can be accessed
> > > + * with xdp_buff address only at all times, but in ZC mode,
> > > + * xdp_buffs come from the pool, so we need to reinitialize
> > > + * context for every packet.
> > > + *
> > > + * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> > > + * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> > > + * right after xdp_buff, for our private use.
> > > + * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> > > + */
> > > +static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > > +                                union ice_32b_rx_flex_desc *eop_desc,
> > > +                                struct ice_rx_ring *rx_ring)
> > > +{
> > > +     XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > > +     ((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> >
> > I will be loud thinking over here, but this could be set in
> > ice_fill_rx_descs(), while grabbing xdp_buffs from xsk_pool, should
> > minimize the performance overhead.
> >
> > But then again you address that with static branch in later patch.
> >
> > OTOH, I was thinking that we could come with xsk_buff_pool API that would
> > let drivers assign this at setup time. Similar what is being done with dma
> > mappings.
> >
> > Magnus, do you think it is worth the hassle? Thoughts?
> 
> I would measure the overhead of the current assignment and if it is
> significant (incurs a cache miss for example), then why not try out
> your idea. Usually good not to have to touch things when not needed.

Larysa measured that because I asked for that previously and impact was
around 6%. Then look at patch 11/18 how this was addressed.

Other ZC drivers didn't report the impact but i am rather sure they were also
affected. So i was thinking whether we should have some generic solution
within pool or every ZC driver handles that on its own.

> 
> > Or should we advise any other driver that support hints to mimic static
> > branch solution?
> >
> > > +     ice_xdp_meta_set_desc(xdp, eop_desc);
> > > +}
> > > +
> > >  /**
> > >   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
> > >   * @rx_ring: Rx ring
> > >   * @xdp: xdp_buff used as input to the XDP program
> > >   * @xdp_prog: XDP program to run
> > >   * @xdp_ring: ring to be used for XDP_TX action
> > > + * @rx_desc: packet descriptor
> > >   *
> > >   * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
> > >   */
> > >  static int
> > >  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> > > -            struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
> > > +            struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> > > +            union ice_32b_rx_flex_desc *rx_desc)
> > >  {
> > >       int err, result = ICE_XDP_PASS;
> > >       u32 act;
> > >
> > > +     ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
> > >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> > >
> > >       if (likely(act == XDP_REDIRECT)) {
> > > @@ -907,7 +936,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> > >               if (ice_is_non_eop(rx_ring, rx_desc))
> > >                       continue;
> > >
> > > -             xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
> > > +             xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> > > +                                      rx_desc);
> > >               if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
> > >                       xdp_xmit |= xdp_res;
> > >               } else if (xdp_res == ICE_XDP_EXIT) {
> > > --
> > > 2.41.0
> > >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode
  2023-10-17 16:45       ` Maciej Fijalkowski
@ 2023-10-17 17:03         ` Larysa Zaremba
  2023-10-18 10:43           ` Maciej Fijalkowski
  0 siblings, 1 reply; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-17 17:03 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Magnus Karlsson, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed, magnus.karlsson

On Tue, Oct 17, 2023 at 06:45:02PM +0200, Maciej Fijalkowski wrote:
> On Tue, Oct 17, 2023 at 06:37:07PM +0200, Magnus Karlsson wrote:
> > On Tue, 17 Oct 2023 at 18:13, Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Thu, Oct 12, 2023 at 07:05:13PM +0200, Larysa Zaremba wrote:
> > > > In AF_XDP ZC, xdp_buff is not stored on ring,
> > > > instead it is provided by xsk_buff_pool.
> > > > Space for metadata sources right after such buffers was already reserved
> > > > in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
> > > > This makes the implementation rather straightforward.
> > > >
> > > > Update AF_XDP ZC packet processing to support XDP hints.
> > > >
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >  drivers/net/ethernet/intel/ice/ice_xsk.c | 34 ++++++++++++++++++++++--
> > > >  1 file changed, 32 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > index ef778b8e6d1b..6ca620b2fbdd 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > @@ -752,22 +752,51 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
> > > >       return ICE_XDP_CONSUMED;
> > > >  }
> > > >
> > > > +/**
> > > > + * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> > > > + * @xdp: xdp_buff used as input to the XDP program
> > > > + * @eop_desc: End of packet descriptor
> > > > + * @rx_ring: Rx ring with packet context
> > > > + *
> > > > + * In regular XDP, xdp_buff is placed inside the ring structure,
> > > > + * just before the packet context, so the latter can be accessed
> > > > + * with xdp_buff address only at all times, but in ZC mode,
> > > > + * xdp_buffs come from the pool, so we need to reinitialize
> > > > + * context for every packet.
> > > > + *
> > > > + * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> > > > + * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> > > > + * right after xdp_buff, for our private use.
> > > > + * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> > > > + */
> > > > +static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > > > +                                union ice_32b_rx_flex_desc *eop_desc,
> > > > +                                struct ice_rx_ring *rx_ring)
> > > > +{
> > > > +     XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > > > +     ((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> > >
> > > I will be loud thinking over here, but this could be set in
> > > ice_fill_rx_descs(), while grabbing xdp_buffs from xsk_pool, should
> > > minimize the performance overhead.

I am not sure about that. Packet context consists of:
* VLAN protocol
* cached time

Both of those can be updated without stopping traffic, so we cannot set this 
at setup time.

> > >
> > > But then again you address that with static branch in later patch.
> > >
> > > OTOH, I was thinking that we could come with xsk_buff_pool API that would
> > > let drivers assign this at setup time. Similar what is being done with dma
> > > mappings.
> > >
> > > Magnus, do you think it is worth the hassle? Thoughts?
> > 
> > I would measure the overhead of the current assignment and if it is
> > significant (incurs a cache miss for example), then why not try out
> > your idea. Usually good not to have to touch things when not needed.
> 
> Larysa measured that because I asked for that previously and impact was
> around 6%. Then look at patch 11/18 how this was addressed.
> 
> Other ZC drivers didn't report the impact but i am rather sure they were also
> affected. So i was thinking whether we should have some generic solution
> within pool or every ZC driver handles that on its own.
> 
> > 
> > > Or should we advise any other driver that support hints to mimic static
> > > branch solution?
> > >
> > > > +     ice_xdp_meta_set_desc(xdp, eop_desc);
> > > > +}
> > > > +
> > > >  /**
> > > >   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
> > > >   * @rx_ring: Rx ring
> > > >   * @xdp: xdp_buff used as input to the XDP program
> > > >   * @xdp_prog: XDP program to run
> > > >   * @xdp_ring: ring to be used for XDP_TX action
> > > > + * @rx_desc: packet descriptor
> > > >   *
> > > >   * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
> > > >   */
> > > >  static int
> > > >  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> > > > -            struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
> > > > +            struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> > > > +            union ice_32b_rx_flex_desc *rx_desc)
> > > >  {
> > > >       int err, result = ICE_XDP_PASS;
> > > >       u32 act;
> > > >
> > > > +     ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
> > > >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > >
> > > >       if (likely(act == XDP_REDIRECT)) {
> > > > @@ -907,7 +936,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> > > >               if (ice_is_non_eop(rx_ring, rx_desc))
> > > >                       continue;
> > > >
> > > > -             xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
> > > > +             xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> > > > +                                      rx_desc);
> > > >               if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
> > > >                       xdp_xmit |= xdp_res;
> > > >               } else if (xdp_res == ICE_XDP_EXIT) {
> > > > --
> > > > 2.41.0
> > > >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode
  2023-10-17 17:03         ` Larysa Zaremba
@ 2023-10-18 10:43           ` Maciej Fijalkowski
  0 siblings, 0 replies; 38+ messages in thread
From: Maciej Fijalkowski @ 2023-10-18 10:43 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: Magnus Karlsson, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed, magnus.karlsson

On Tue, Oct 17, 2023 at 07:03:57PM +0200, Larysa Zaremba wrote:
> On Tue, Oct 17, 2023 at 06:45:02PM +0200, Maciej Fijalkowski wrote:
> > On Tue, Oct 17, 2023 at 06:37:07PM +0200, Magnus Karlsson wrote:
> > > On Tue, 17 Oct 2023 at 18:13, Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Thu, Oct 12, 2023 at 07:05:13PM +0200, Larysa Zaremba wrote:
> > > > > In AF_XDP ZC, xdp_buff is not stored on ring,
> > > > > instead it is provided by xsk_buff_pool.
> > > > > Space for metadata sources right after such buffers was already reserved
> > > > > in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
> > > > > This makes the implementation rather straightforward.
> > > > >
> > > > > Update AF_XDP ZC packet processing to support XDP hints.
> > > > >
> > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > ---
> > > > >  drivers/net/ethernet/intel/ice/ice_xsk.c | 34 ++++++++++++++++++++++--
> > > > >  1 file changed, 32 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > > index ef778b8e6d1b..6ca620b2fbdd 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > > @@ -752,22 +752,51 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
> > > > >       return ICE_XDP_CONSUMED;
> > > > >  }
> > > > >
> > > > > +/**
> > > > > + * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> > > > > + * @xdp: xdp_buff used as input to the XDP program
> > > > > + * @eop_desc: End of packet descriptor
> > > > > + * @rx_ring: Rx ring with packet context
> > > > > + *
> > > > > + * In regular XDP, xdp_buff is placed inside the ring structure,
> > > > > + * just before the packet context, so the latter can be accessed
> > > > > + * with xdp_buff address only at all times, but in ZC mode,
> > > > > + * xdp_buffs come from the pool, so we need to reinitialize
> > > > > + * context for every packet.
> > > > > + *
> > > > > + * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> > > > > + * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> > > > > + * right after xdp_buff, for our private use.
> > > > > + * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> > > > > + */
> > > > > +static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > > > > +                                union ice_32b_rx_flex_desc *eop_desc,
> > > > > +                                struct ice_rx_ring *rx_ring)
> > > > > +{
> > > > > +     XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > > > > +     ((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> > > >
> > > > I will be loud thinking over here, but this could be set in
> > > > ice_fill_rx_descs(), while grabbing xdp_buffs from xsk_pool, should
> > > > minimize the performance overhead.
> 
> I am not sure about that. Packet context consists of:
> * VLAN protocol
> * cached time
> 
> Both of those can be updated without stopping traffic, so we cannot set this 
> at setup time.

I was referring to setting the pointer to pkt_ctx. Similarly mlx5 sets
setting pointer to rq during alloc but cqe ptr is set per packet.

Regardless, let us proceed with what you have and later on maybe address
what I was bringing up here.

> 
> > > >
> > > > But then again you address that with static branch in later patch.
> > > >
> > > > OTOH, I was thinking that we could come with xsk_buff_pool API that would
> > > > let drivers assign this at setup time. Similar what is being done with dma
> > > > mappings.
> > > >
> > > > Magnus, do you think it is worth the hassle? Thoughts?
> > > 
> > > I would measure the overhead of the current assignment and if it is
> > > significant (incurs a cache miss for example), then why not try out
> > > your idea. Usually good not to have to touch things when not needed.
> > 
> > Larysa measured that because I asked for that previously and impact was
> > around 6%. Then look at patch 11/18 how this was addressed.
> > 
> > Other ZC drivers didn't report the impact but i am rather sure they were also
> > affected. So i was thinking whether we should have some generic solution
> > within pool or every ZC driver handles that on its own.
> > 
> > > 
> > > > Or should we advise any other driver that support hints to mimic static
> > > > branch solution?
> > > >
> > > > > +     ice_xdp_meta_set_desc(xdp, eop_desc);
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
> > > > >   * @rx_ring: Rx ring
> > > > >   * @xdp: xdp_buff used as input to the XDP program
> > > > >   * @xdp_prog: XDP program to run
> > > > >   * @xdp_ring: ring to be used for XDP_TX action
> > > > > + * @rx_desc: packet descriptor
> > > > >   *
> > > > >   * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
> > > > >   */
> > > > >  static int
> > > > >  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> > > > > -            struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
> > > > > +            struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> > > > > +            union ice_32b_rx_flex_desc *rx_desc)
> > > > >  {
> > > > >       int err, result = ICE_XDP_PASS;
> > > > >       u32 act;
> > > > >
> > > > > +     ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
> > > > >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > > >
> > > > >       if (likely(act == XDP_REDIRECT)) {
> > > > > @@ -907,7 +936,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> > > > >               if (ice_is_non_eop(rx_ring, rx_desc))
> > > > >                       continue;
> > > > >
> > > > > -             xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
> > > > > +             xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> > > > > +                                      rx_desc);
> > > > >               if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
> > > > >                       xdp_xmit |= xdp_res;
> > > > >               } else if (xdp_res == ICE_XDP_EXIT) {
> > > > > --
> > > > > 2.41.0
> > > > >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 08/18] xdp: Add VLAN tag hint
  2023-10-12 17:05 ` [PATCH bpf-next v6 08/18] xdp: Add VLAN tag hint Larysa Zaremba
@ 2023-10-18 23:59   ` Jakub Kicinski
  2023-10-19  8:05     ` Larysa Zaremba
  0 siblings, 1 reply; 38+ messages in thread
From: Jakub Kicinski @ 2023-10-18 23:59 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	Willem de Bruijn, Alexei Starovoitov, Simon Horman, Tariq Toukan,
	Saeed Mahameed, Maciej Fijalkowski

On Thu, 12 Oct 2023 19:05:14 +0200 Larysa Zaremba wrote:
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 2943a151d4f1..661f603e3e43 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -44,13 +44,16 @@ enum netdev_xdp_act {
>   *   timestamp via bpf_xdp_metadata_rx_timestamp().
>   * @NETDEV_XDP_RX_METADATA_HASH: Device is capable of exposing receive packet
>   *   hash via bpf_xdp_metadata_rx_hash().
> + * @NETDEV_XDP_RX_METADATA_VLAN_TAG: Device is capable of exposing stripped
> + *   receive VLAN tag (proto and TCI) via bpf_xdp_metadata_rx_vlan_tag().
>   */
>  enum netdev_xdp_rx_metadata {
>  	NETDEV_XDP_RX_METADATA_TIMESTAMP = 1,
>  	NETDEV_XDP_RX_METADATA_HASH = 2,
> +	NETDEV_XDP_RX_METADATA_VLAN_TAG = 4,
>  
>  	/* private: */
> -	NETDEV_XDP_RX_METADATA_MASK = 3,
> +	NETDEV_XDP_RX_METADATA_MASK = 7,
>  };
>  
>  enum {

Top of this file says:

/* Do not edit directly, auto-generated from: */
/*	Documentation/netlink/specs/netdev.yaml */
/* YNL-GEN uapi header */

Please add your new value in Documentation/netlink/specs/netdev.yaml
and then run ./tools/net/ynl/ynl-regen.sh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 08/18] xdp: Add VLAN tag hint
  2023-10-18 23:59   ` Jakub Kicinski
@ 2023-10-19  8:05     ` Larysa Zaremba
  0 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-19  8:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	Willem de Bruijn, Alexei Starovoitov, Simon Horman, Tariq Toukan,
	Saeed Mahameed, Maciej Fijalkowski

On Wed, Oct 18, 2023 at 04:59:31PM -0700, Jakub Kicinski wrote:
> On Thu, 12 Oct 2023 19:05:14 +0200 Larysa Zaremba wrote:
> > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > index 2943a151d4f1..661f603e3e43 100644
> > --- a/include/uapi/linux/netdev.h
> > +++ b/include/uapi/linux/netdev.h
> > @@ -44,13 +44,16 @@ enum netdev_xdp_act {
> >   *   timestamp via bpf_xdp_metadata_rx_timestamp().
> >   * @NETDEV_XDP_RX_METADATA_HASH: Device is capable of exposing receive packet
> >   *   hash via bpf_xdp_metadata_rx_hash().
> > + * @NETDEV_XDP_RX_METADATA_VLAN_TAG: Device is capable of exposing stripped
> > + *   receive VLAN tag (proto and TCI) via bpf_xdp_metadata_rx_vlan_tag().
> >   */
> >  enum netdev_xdp_rx_metadata {
> >  	NETDEV_XDP_RX_METADATA_TIMESTAMP = 1,
> >  	NETDEV_XDP_RX_METADATA_HASH = 2,
> > +	NETDEV_XDP_RX_METADATA_VLAN_TAG = 4,
> >  
> >  	/* private: */
> > -	NETDEV_XDP_RX_METADATA_MASK = 3,
> > +	NETDEV_XDP_RX_METADATA_MASK = 7,
> >  };
> >  
> >  enum {
> 
> Top of this file says:
> 
> /* Do not edit directly, auto-generated from: */
> /*	Documentation/netlink/specs/netdev.yaml */
> /* YNL-GEN uapi header */
> 
> Please add your new value in Documentation/netlink/specs/netdev.yaml
> and then run ./tools/net/ynl/ynl-regen.sh

Thanks! Will do this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 06/18] ice: Support RX hash XDP hint
  2023-10-12 17:05 ` [PATCH bpf-next v6 06/18] ice: Support RX hash XDP hint Larysa Zaremba
@ 2023-10-20 15:27   ` Maciej Fijalkowski
  2023-10-23 10:04     ` Larysa Zaremba
  0 siblings, 1 reply; 38+ messages in thread
From: Maciej Fijalkowski @ 2023-10-20 15:27 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed

On Thu, Oct 12, 2023 at 07:05:12PM +0200, Larysa Zaremba wrote:
> RX hash XDP hint requests both hash value and type.
> Type is XDP-specific, so we need a separate way to map
> these values to the hardware ptypes, so create a lookup table.
> 
> Instead of creating a new long list, reuse contents
> of ice_decode_rx_desc_ptype[] through preprocessor.
> 
> Current hash type enum does not contain ICMP packet type,
> but ice devices support it, so also add a new type into core code.
> 
> Then use previously refactored code and create a function
> that allows XDP code to read RX hash.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---

(...)

> +
> +/**
> + * ice_xdp_rx_hash_type - Get XDP-specific hash type from the RX descriptor
> + * @eop_desc: End of Packet descriptor
> + */
> +static enum xdp_rss_hash_type
> +ice_xdp_rx_hash_type(const union ice_32b_rx_flex_desc *eop_desc)
> +{
> +	u16 ptype = ice_get_ptype(eop_desc);
> +
> +	if (unlikely(ptype >= ICE_NUM_DEFINED_PTYPES))
> +		return 0;
> +
> +	return ice_ptype_to_xdp_hash[ptype];
> +}
> +
> +/**
> + * ice_xdp_rx_hash - RX hash XDP hint handler
> + * @ctx: XDP buff pointer
> + * @hash: hash destination address
> + * @rss_type: XDP hash type destination address
> + *
> + * Copy RX hash (if available) and its type to the destination address.
> + */
> +static int ice_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
> +			   enum xdp_rss_hash_type *rss_type)
> +{
> +	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
> +
> +	*hash = ice_get_rx_hash(xdp_ext->pkt_ctx.eop_desc);
> +	*rss_type = ice_xdp_rx_hash_type(xdp_ext->pkt_ctx.eop_desc);
> +	if (!likely(*hash))
> +		return -ENODATA;

maybe i have missed previous discussions, but why hash/rss_type are copied
regardless of them being available? if i am missing something can you
elaborate on that?

also, !likely() construct looks tricky to me, I am not sure what was the
intent behind it. other callbacks return -ENODATA in case NETIF_F_RXHASH
is missing from dev->features.

> +
> +	return 0;
> +}
> +
>  const struct xdp_metadata_ops ice_xdp_md_ops = {
>  	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
> +	.xmo_rx_hash			= ice_xdp_rx_hash,
>  };
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 349c36fb5fd8..eb77040b4825 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -427,6 +427,7 @@ enum xdp_rss_hash_type {
>  	XDP_RSS_L4_UDP		= BIT(5),
>  	XDP_RSS_L4_SCTP		= BIT(6),
>  	XDP_RSS_L4_IPSEC	= BIT(7), /* L4 based hash include IPSEC SPI */
> +	XDP_RSS_L4_ICMP		= BIT(8),
>  
>  	/* Second part: RSS hash type combinations used for driver HW mapping */
>  	XDP_RSS_TYPE_NONE            = 0,
> @@ -442,11 +443,13 @@ enum xdp_rss_hash_type {
>  	XDP_RSS_TYPE_L4_IPV4_UDP     = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_UDP,
>  	XDP_RSS_TYPE_L4_IPV4_SCTP    = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_SCTP,
>  	XDP_RSS_TYPE_L4_IPV4_IPSEC   = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_IPSEC,
> +	XDP_RSS_TYPE_L4_IPV4_ICMP    = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_ICMP,
>  
>  	XDP_RSS_TYPE_L4_IPV6_TCP     = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_TCP,
>  	XDP_RSS_TYPE_L4_IPV6_UDP     = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_UDP,
>  	XDP_RSS_TYPE_L4_IPV6_SCTP    = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_SCTP,
>  	XDP_RSS_TYPE_L4_IPV6_IPSEC   = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_IPSEC,
> +	XDP_RSS_TYPE_L4_IPV6_ICMP    = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_ICMP,
>  
>  	XDP_RSS_TYPE_L4_IPV6_TCP_EX  = XDP_RSS_TYPE_L4_IPV6_TCP  | XDP_RSS_L3_DYNHDR,
>  	XDP_RSS_TYPE_L4_IPV6_UDP_EX  = XDP_RSS_TYPE_L4_IPV6_UDP  | XDP_RSS_L3_DYNHDR,
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode
  2023-10-12 17:05 ` [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
  2023-10-17 16:13   ` Maciej Fijalkowski
@ 2023-10-20 15:29   ` Maciej Fijalkowski
  2023-10-23  9:37     ` Larysa Zaremba
  1 sibling, 1 reply; 38+ messages in thread
From: Maciej Fijalkowski @ 2023-10-20 15:29 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed

On Thu, Oct 12, 2023 at 07:05:13PM +0200, Larysa Zaremba wrote:
> In AF_XDP ZC, xdp_buff is not stored on ring,
> instead it is provided by xsk_buff_pool.
> Space for metadata sources right after such buffers was already reserved
> in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
> This makes the implementation rather straightforward.
> 
> Update AF_XDP ZC packet processing to support XDP hints.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_xsk.c | 34 ++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index ef778b8e6d1b..6ca620b2fbdd 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -752,22 +752,51 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
>  	return ICE_XDP_CONSUMED;
>  }
>  
> +/**
> + * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> + * @xdp: xdp_buff used as input to the XDP program
> + * @eop_desc: End of packet descriptor
> + * @rx_ring: Rx ring with packet context
> + *
> + * In regular XDP, xdp_buff is placed inside the ring structure,
> + * just before the packet context, so the latter can be accessed
> + * with xdp_buff address only at all times, but in ZC mode,

s/only// ?

> + * xdp_buffs come from the pool, so we need to reinitialize
> + * context for every packet.
> + *
> + * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> + * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> + * right after xdp_buff, for our private use.
> + * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> + */
> +static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> +				   union ice_32b_rx_flex_desc *eop_desc,
> +				   struct ice_rx_ring *rx_ring)
> +{
> +	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> +	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> +	ice_xdp_meta_set_desc(xdp, eop_desc);
> +}
> +
>  /**
>   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
>   * @rx_ring: Rx ring
>   * @xdp: xdp_buff used as input to the XDP program
>   * @xdp_prog: XDP program to run
>   * @xdp_ring: ring to be used for XDP_TX action
> + * @rx_desc: packet descriptor
>   *
>   * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
>   */
>  static int
>  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> -	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
> +	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> +	       union ice_32b_rx_flex_desc *rx_desc)
>  {
>  	int err, result = ICE_XDP_PASS;
>  	u32 act;
>  
> +	ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
>  	act = bpf_prog_run_xdp(xdp_prog, xdp);
>  
>  	if (likely(act == XDP_REDIRECT)) {
> @@ -907,7 +936,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
>  		if (ice_is_non_eop(rx_ring, rx_desc))
>  			continue;
>  
> -		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
> +		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> +					 rx_desc);
>  		if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
>  			xdp_xmit |= xdp_res;
>  		} else if (xdp_res == ICE_XDP_EXIT) {
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
  2023-10-12 17:05 ` [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition Larysa Zaremba
@ 2023-10-20 16:32   ` Maciej Fijalkowski
  2023-10-23  9:35     ` Larysa Zaremba
  0 siblings, 1 reply; 38+ messages in thread
From: Maciej Fijalkowski @ 2023-10-20 16:32 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed

On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> Usage of XDP hints requires putting additional information after the
> xdp_buff. In basic case, only the descriptor has to be copied on a
> per-packet basis, because xdp_buff permanently resides before per-ring
> metadata (cached time and VLAN protocol ID).
> 
> However, in ZC mode, xdp_buffs come from a pool, so memory after such
> buffer does not contain any reliable information, so everything has to be
> copied, damaging the performance.
> 
> Introduce a static key to enable meta sources assignment only when attached
> XDP program is device-bound.
> 
> This patch eliminates a 6% performance drop in ZC mode, which was a result
> of addition of XDP hints to the driver.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice.h      |  1 +
>  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
>  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
>  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
>  4 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> index 3d0f15f8b2b8..76d22be878a4 100644
> --- a/drivers/net/ethernet/intel/ice/ice.h
> +++ b/drivers/net/ethernet/intel/ice/ice.h
> @@ -210,6 +210,7 @@ enum ice_feature {
>  };
>  
>  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
>  
>  struct ice_channel {
>  	struct list_head list;
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index 47e8920e1727..ee0df86d34b7 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
>  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
>  EXPORT_SYMBOL(ice_xdp_locking_key);
>  
> +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> +EXPORT_SYMBOL(ice_xdp_meta_key);
> +
>  /**
>   * ice_hw_to_dev - Get device pointer from the hardware structure
>   * @hw: pointer to the device HW structure
> @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
>  	return -ENOMEM;
>  }
>  
> +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> +{
> +	return prog && prog->aux->dev_bound;
> +}
> +
>  /**
>   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
>   * @vsi: VSI to set the bpf prog on
> @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
>  	struct bpf_prog *old_prog;
>  	int i;
>  
> +	if (ice_xdp_prog_has_meta(prog))
> +		static_branch_inc(&ice_xdp_meta_key);

i thought boolean key would be enough but inc/dec should serve properly
for example prog hotswap cases.

> +
>  	old_prog = xchg(&vsi->xdp_prog, prog);
>  	ice_for_each_rxq(vsi, i)
>  		WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
>  
> +	if (ice_xdp_prog_has_meta(old_prog))
> +		static_branch_dec(&ice_xdp_meta_key);
> +
>  	if (old_prog)
>  		bpf_prog_put(old_prog);
>  }
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 4fd7614f243d..19fc182d1f4c 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -572,7 +572,8 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>  	if (!xdp_prog)
>  		goto exit;
>  
> -	ice_xdp_meta_set_desc(xdp, eop_desc);
> +	if (static_branch_unlikely(&ice_xdp_meta_key))

My only concern is that we might be hurting in a minor way hints path now,
no?

> +		ice_xdp_meta_set_desc(xdp, eop_desc);
>  
>  	act = bpf_prog_run_xdp(xdp_prog, xdp);
>  	switch (act) {
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index 39775bb6cec1..f92d7d33fde6 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -773,6 +773,9 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
>  				   union ice_32b_rx_flex_desc *eop_desc,
>  				   struct ice_rx_ring *rx_ring)
>  {
> +	if (!static_branch_unlikely(&ice_xdp_meta_key))
> +		return;

wouldn't it be better to pull it out and avoid calling
ice_prepare_pkt_ctx_zc() unnecessarily?

> +
>  	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
>  	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
>  	ice_xdp_meta_set_desc(xdp, eop_desc);
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
  2023-10-20 16:32   ` Maciej Fijalkowski
@ 2023-10-23  9:35     ` Larysa Zaremba
  2023-10-28 19:55       ` Maciej Fijalkowski
  0 siblings, 1 reply; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-23  9:35 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed

On Fri, Oct 20, 2023 at 06:32:13PM +0200, Maciej Fijalkowski wrote:
> On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> > Usage of XDP hints requires putting additional information after the
> > xdp_buff. In basic case, only the descriptor has to be copied on a
> > per-packet basis, because xdp_buff permanently resides before per-ring
> > metadata (cached time and VLAN protocol ID).
> > 
> > However, in ZC mode, xdp_buffs come from a pool, so memory after such
> > buffer does not contain any reliable information, so everything has to be
> > copied, damaging the performance.
> > 
> > Introduce a static key to enable meta sources assignment only when attached
> > XDP program is device-bound.
> > 
> > This patch eliminates a 6% performance drop in ZC mode, which was a result
> > of addition of XDP hints to the driver.
> > 
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice.h      |  1 +
> >  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
> >  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
> >  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
> >  4 files changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > index 3d0f15f8b2b8..76d22be878a4 100644
> > --- a/drivers/net/ethernet/intel/ice/ice.h
> > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > @@ -210,6 +210,7 @@ enum ice_feature {
> >  };
> >  
> >  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> >  
> >  struct ice_channel {
> >  	struct list_head list;
> > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > index 47e8920e1727..ee0df86d34b7 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
> >  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> >  EXPORT_SYMBOL(ice_xdp_locking_key);
> >  
> > +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > +EXPORT_SYMBOL(ice_xdp_meta_key);
> > +
> >  /**
> >   * ice_hw_to_dev - Get device pointer from the hardware structure
> >   * @hw: pointer to the device HW structure
> > @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> >  	return -ENOMEM;
> >  }
> >  
> > +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> > +{
> > +	return prog && prog->aux->dev_bound;
> > +}
> > +
> >  /**
> >   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
> >   * @vsi: VSI to set the bpf prog on
> > @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
> >  	struct bpf_prog *old_prog;
> >  	int i;
> >  
> > +	if (ice_xdp_prog_has_meta(prog))
> > +		static_branch_inc(&ice_xdp_meta_key);
> 
> i thought boolean key would be enough but inc/dec should serve properly
> for example prog hotswap cases.
>

My thought process on using counting instead of boolean was: there can be 
several PFs that use the same driver, so therefore we need to keep track of how 
many od them use hints. And yes, this also looks better for hot-swapping, 
because conditions become more straightforward (we do not need to compare old 
and new programs).

> > +
> >  	old_prog = xchg(&vsi->xdp_prog, prog);
> >  	ice_for_each_rxq(vsi, i)
> >  		WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> >  
> > +	if (ice_xdp_prog_has_meta(old_prog))
> > +		static_branch_dec(&ice_xdp_meta_key);
> > +
> >  	if (old_prog)
> >  		bpf_prog_put(old_prog);
> >  }
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > index 4fd7614f243d..19fc182d1f4c 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > @@ -572,7 +572,8 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> >  	if (!xdp_prog)
> >  		goto exit;
> >  
> > -	ice_xdp_meta_set_desc(xdp, eop_desc);
> > +	if (static_branch_unlikely(&ice_xdp_meta_key))
> 
> My only concern is that we might be hurting in a minor way hints path now,
> no?

I have thought "unlikely" refers to the default state the code is compiled with 
and after static key incrementation this should be patched to "likely". Isn't 
this how static keys work?

> 
> > +		ice_xdp_meta_set_desc(xdp, eop_desc);
> >  
> >  	act = bpf_prog_run_xdp(xdp_prog, xdp);
> >  	switch (act) {
> > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > index 39775bb6cec1..f92d7d33fde6 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > @@ -773,6 +773,9 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> >  				   union ice_32b_rx_flex_desc *eop_desc,
> >  				   struct ice_rx_ring *rx_ring)
> >  {
> > +	if (!static_branch_unlikely(&ice_xdp_meta_key))
> > +		return;
> 
> wouldn't it be better to pull it out and avoid calling
> ice_prepare_pkt_ctx_zc() unnecessarily?
> 
> > +
> >  	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> >  	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> >  	ice_xdp_meta_set_desc(xdp, eop_desc);
> > -- 
> > 2.41.0
> > 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode
  2023-10-20 15:29   ` Maciej Fijalkowski
@ 2023-10-23  9:37     ` Larysa Zaremba
  0 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-23  9:37 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed

On Fri, Oct 20, 2023 at 05:29:38PM +0200, Maciej Fijalkowski wrote:
> On Thu, Oct 12, 2023 at 07:05:13PM +0200, Larysa Zaremba wrote:
> > In AF_XDP ZC, xdp_buff is not stored on ring,
> > instead it is provided by xsk_buff_pool.
> > Space for metadata sources right after such buffers was already reserved
> > in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
> > This makes the implementation rather straightforward.
> > 
> > Update AF_XDP ZC packet processing to support XDP hints.
> > 
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_xsk.c | 34 ++++++++++++++++++++++--
> >  1 file changed, 32 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > index ef778b8e6d1b..6ca620b2fbdd 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > @@ -752,22 +752,51 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
> >  	return ICE_XDP_CONSUMED;
> >  }
> >  
> > +/**
> > + * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> > + * @xdp: xdp_buff used as input to the XDP program
> > + * @eop_desc: End of packet descriptor
> > + * @rx_ring: Rx ring with packet context
> > + *
> > + * In regular XDP, xdp_buff is placed inside the ring structure,
> > + * just before the packet context, so the latter can be accessed
> > + * with xdp_buff address only at all times, but in ZC mode,
> 
> s/only// ?
>

Yes :D
 
> > + * xdp_buffs come from the pool, so we need to reinitialize
> > + * context for every packet.
> > + *
> > + * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> > + * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> > + * right after xdp_buff, for our private use.
> > + * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> > + */
> > +static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > +				   union ice_32b_rx_flex_desc *eop_desc,
> > +				   struct ice_rx_ring *rx_ring)
> > +{
> > +	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > +	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> > +	ice_xdp_meta_set_desc(xdp, eop_desc);
> > +}
> > +
> >  /**
> >   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
> >   * @rx_ring: Rx ring
> >   * @xdp: xdp_buff used as input to the XDP program
> >   * @xdp_prog: XDP program to run
> >   * @xdp_ring: ring to be used for XDP_TX action
> > + * @rx_desc: packet descriptor
> >   *
> >   * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
> >   */
> >  static int
> >  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> > -	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
> > +	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> > +	       union ice_32b_rx_flex_desc *rx_desc)
> >  {
> >  	int err, result = ICE_XDP_PASS;
> >  	u32 act;
> >  
> > +	ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
> >  	act = bpf_prog_run_xdp(xdp_prog, xdp);
> >  
> >  	if (likely(act == XDP_REDIRECT)) {
> > @@ -907,7 +936,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> >  		if (ice_is_non_eop(rx_ring, rx_desc))
> >  			continue;
> >  
> > -		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
> > +		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> > +					 rx_desc);
> >  		if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
> >  			xdp_xmit |= xdp_res;
> >  		} else if (xdp_res == ICE_XDP_EXIT) {
> > -- 
> > 2.41.0
> > 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 06/18] ice: Support RX hash XDP hint
  2023-10-20 15:27   ` Maciej Fijalkowski
@ 2023-10-23 10:04     ` Larysa Zaremba
  0 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-23 10:04 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Simon Horman,
	Tariq Toukan, Saeed Mahameed

On Fri, Oct 20, 2023 at 05:27:34PM +0200, Maciej Fijalkowski wrote:
> On Thu, Oct 12, 2023 at 07:05:12PM +0200, Larysa Zaremba wrote:
> > RX hash XDP hint requests both hash value and type.
> > Type is XDP-specific, so we need a separate way to map
> > these values to the hardware ptypes, so create a lookup table.
> > 
> > Instead of creating a new long list, reuse contents
> > of ice_decode_rx_desc_ptype[] through preprocessor.
> > 
> > Current hash type enum does not contain ICMP packet type,
> > but ice devices support it, so also add a new type into core code.
> > 
> > Then use previously refactored code and create a function
> > that allows XDP code to read RX hash.
> > 
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> 
> (...)
> 
> > +
> > +/**
> > + * ice_xdp_rx_hash_type - Get XDP-specific hash type from the RX descriptor
> > + * @eop_desc: End of Packet descriptor
> > + */
> > +static enum xdp_rss_hash_type
> > +ice_xdp_rx_hash_type(const union ice_32b_rx_flex_desc *eop_desc)
> > +{
> > +	u16 ptype = ice_get_ptype(eop_desc);
> > +
> > +	if (unlikely(ptype >= ICE_NUM_DEFINED_PTYPES))
> > +		return 0;
> > +
> > +	return ice_ptype_to_xdp_hash[ptype];
> > +}
> > +
> > +/**
> > + * ice_xdp_rx_hash - RX hash XDP hint handler
> > + * @ctx: XDP buff pointer
> > + * @hash: hash destination address
> > + * @rss_type: XDP hash type destination address
> > + *
> > + * Copy RX hash (if available) and its type to the destination address.
> > + */
> > +static int ice_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
> > +			   enum xdp_rss_hash_type *rss_type)
> > +{
> > +	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
> > +
> > +	*hash = ice_get_rx_hash(xdp_ext->pkt_ctx.eop_desc);
> > +	*rss_type = ice_xdp_rx_hash_type(xdp_ext->pkt_ctx.eop_desc);
> > +	if (!likely(*hash))
> > +		return -ENODATA;
> 
> maybe i have missed previous discussions, but why hash/rss_type are copied
> regardless of them being available? if i am missing something can you
> elaborate on that?
> 
> also, !likely() construct looks tricky to me, I am not sure what was the
> intent behind it. other callbacks return -ENODATA in case NETIF_F_RXHASH
> is missing from dev->features.
>

Well, we get RX hash in the descriptor regardless of whether NETIF_F_RXHASH is 
enabled (I have tested this), so no point in checking this in the hints 
functions.

Regarding `!likely(*hash)`: we have already discussed that valid `hash == 0` is 
very improbable, so there is no harm in treating it as a failure for the sake of 
consistency with other hints functions. Basically I treat `hash == 0` here as 
"no hash in the descriptor".

But there is an error in this code snippet that I see now: there also must be a 
check that `rss_type != 0`, otherwise packet has an unhashable type, which must 
result in `-ENODATA`.
 
> > +
> > +	return 0;
> > +}
> > +
> >  const struct xdp_metadata_ops ice_xdp_md_ops = {
> >  	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
> > +	.xmo_rx_hash			= ice_xdp_rx_hash,
> >  };
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 349c36fb5fd8..eb77040b4825 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -427,6 +427,7 @@ enum xdp_rss_hash_type {
> >  	XDP_RSS_L4_UDP		= BIT(5),
> >  	XDP_RSS_L4_SCTP		= BIT(6),
> >  	XDP_RSS_L4_IPSEC	= BIT(7), /* L4 based hash include IPSEC SPI */
> > +	XDP_RSS_L4_ICMP		= BIT(8),
> >  
> >  	/* Second part: RSS hash type combinations used for driver HW mapping */
> >  	XDP_RSS_TYPE_NONE            = 0,
> > @@ -442,11 +443,13 @@ enum xdp_rss_hash_type {
> >  	XDP_RSS_TYPE_L4_IPV4_UDP     = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_UDP,
> >  	XDP_RSS_TYPE_L4_IPV4_SCTP    = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_SCTP,
> >  	XDP_RSS_TYPE_L4_IPV4_IPSEC   = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_IPSEC,
> > +	XDP_RSS_TYPE_L4_IPV4_ICMP    = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_ICMP,
> >  
> >  	XDP_RSS_TYPE_L4_IPV6_TCP     = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_TCP,
> >  	XDP_RSS_TYPE_L4_IPV6_UDP     = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_UDP,
> >  	XDP_RSS_TYPE_L4_IPV6_SCTP    = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_SCTP,
> >  	XDP_RSS_TYPE_L4_IPV6_IPSEC   = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_IPSEC,
> > +	XDP_RSS_TYPE_L4_IPV6_ICMP    = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_ICMP,
> >  
> >  	XDP_RSS_TYPE_L4_IPV6_TCP_EX  = XDP_RSS_TYPE_L4_IPV6_TCP  | XDP_RSS_L3_DYNHDR,
> >  	XDP_RSS_TYPE_L4_IPV6_UDP_EX  = XDP_RSS_TYPE_L4_IPV6_UDP  | XDP_RSS_L3_DYNHDR,
> > -- 
> > 2.41.0
> > 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
  2023-10-23  9:35     ` Larysa Zaremba
@ 2023-10-28 19:55       ` Maciej Fijalkowski
  2023-10-31 14:22         ` Larysa Zaremba
  2023-10-31 17:32         ` Larysa Zaremba
  0 siblings, 2 replies; 38+ messages in thread
From: Maciej Fijalkowski @ 2023-10-28 19:55 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Tariq Toukan,
	Saeed Mahameed, toke

On Mon, Oct 23, 2023 at 11:35:46AM +0200, Larysa Zaremba wrote:
> On Fri, Oct 20, 2023 at 06:32:13PM +0200, Maciej Fijalkowski wrote:
> > On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> > > Usage of XDP hints requires putting additional information after the
> > > xdp_buff. In basic case, only the descriptor has to be copied on a
> > > per-packet basis, because xdp_buff permanently resides before per-ring
> > > metadata (cached time and VLAN protocol ID).
> > > 
> > > However, in ZC mode, xdp_buffs come from a pool, so memory after such
> > > buffer does not contain any reliable information, so everything has to be
> > > copied, damaging the performance.
> > > 
> > > Introduce a static key to enable meta sources assignment only when attached
> > > XDP program is device-bound.
> > > 
> > > This patch eliminates a 6% performance drop in ZC mode, which was a result
> > > of addition of XDP hints to the driver.
> > > 
> > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > ---
> > >  drivers/net/ethernet/intel/ice/ice.h      |  1 +
> > >  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
> > >  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
> > >  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
> > >  4 files changed, 20 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > index 3d0f15f8b2b8..76d22be878a4 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > @@ -210,6 +210,7 @@ enum ice_feature {
> > >  };
> > >  
> > >  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > >  
> > >  struct ice_channel {
> > >  	struct list_head list;
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > index 47e8920e1727..ee0df86d34b7 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
> > >  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > >  EXPORT_SYMBOL(ice_xdp_locking_key);
> > >  
> > > +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > +EXPORT_SYMBOL(ice_xdp_meta_key);
> > > +
> > >  /**
> > >   * ice_hw_to_dev - Get device pointer from the hardware structure
> > >   * @hw: pointer to the device HW structure
> > > @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> > >  	return -ENOMEM;
> > >  }
> > >  
> > > +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> > > +{
> > > +	return prog && prog->aux->dev_bound;
> > > +}
> > > +
> > >  /**
> > >   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
> > >   * @vsi: VSI to set the bpf prog on
> > > @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
> > >  	struct bpf_prog *old_prog;
> > >  	int i;
> > >  
> > > +	if (ice_xdp_prog_has_meta(prog))
> > > +		static_branch_inc(&ice_xdp_meta_key);
> > 
> > i thought boolean key would be enough but inc/dec should serve properly
> > for example prog hotswap cases.
> >
> 
> My thought process on using counting instead of boolean was: there can be 
> several PFs that use the same driver, so therefore we need to keep track of how 
> many od them use hints. 

Very good point. This implies that if PF0 has hints-enabled prog loaded,
PF1 with non-hints prog will "suffer" from it.

Sorry for such a long delays in responses but I was having a hard time
making up my mind about it. In the end I have come up to some conclusions.
I know the timing for sending this response is not ideal, but I need to
get this off my chest and bring discussion back to life:)

IMHO having static keys to eliminate ZC overhead does not scale. I assume
every other driver would have to follow that.

XSK pool allows us to avoid initializing various things per each packet.
Instead, taking xdp_rxq_info as an example, each xdp_buff from pool has
xdp_rxq_info assigned at init time. With this in mind, we should have some
mechanism to set hints-specific things in xdp_buff_xsk::cb, at init time
as well. Such mechanism should not require us to expose driver's private
xdp_buff hints containers (such as ice_pkt_ctx) to XSK pool.

Right now you moved phctime down to ice_pkt_ctx and to me that's the main
reason we have to copy ice_pkt_ctx to each xdp_buff on ZC. What if we keep
the cached_phctime at original offset in ring but ice_pkt_ctx would get a
pointer to that?

This would allow us to init the pointer in each xdp_buff from XSK pool at
init time. I have come up with a way to program that via so called XSK
meta descriptors. Each desc would have data to write onto cb, offset
within cb and amount of bytes to write/copy.

I'll share the diff below but note that I didn't measure how much lower
the performance is degraded. My icelake machine where I used to measure
performance-sensitive code got broke. For now we can't escape initing
eop_desc per each xdp_buff, but I moved it to alloc side, as we mangle
descs there anyway.

I think mlx5 could benefit from that approach as well with initing the rq
ptr at init time.

Diff does mostly these things:
- move cached_phctime to old place in ice_rx_ring and add ptr to that in
  ice_pkt_ctx
- introduce xsk_pool_set_meta()
- use it from ice side.

I consider this as a discussion trigger rather than ready code. Any
feedback will be appreciated.

---------------------------------8<---------------------------------

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 7fa43827a3f0..c192e84bee55 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -519,6 +519,23 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
 	return 0;
 }
 
+static void ice_xsk_pool_set_meta(struct ice_rx_ring *ring)
+{
+	struct xsk_meta_desc desc = {};
+
+	desc.val = (uintptr_t)&ring->cached_phctime;
+	desc.off = offsetof(struct ice_pkt_ctx, cached_phctime);
+	desc.bytes = sizeof_field(struct ice_pkt_ctx, cached_phctime);
+	xsk_pool_set_meta(ring->xsk_pool, &desc);
+
+	memset(&desc, 0, sizeof(struct xsk_meta_desc));
+
+	desc.val = ring->pkt_ctx.vlan_proto;
+	desc.off = offsetof(struct ice_pkt_ctx, vlan_proto);
+	desc.bytes = sizeof_field(struct ice_pkt_ctx, vlan_proto);
+	xsk_pool_set_meta(ring->xsk_pool, &desc);
+}
+
 /**
  * ice_vsi_cfg_rxq - Configure an Rx queue
  * @ring: the ring being configured
@@ -553,6 +570,7 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 			if (err)
 				return err;
 			xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq);
+			ice_xsk_pool_set_meta(ring);
 
 			dev_info(dev, "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
 				 ring->q_index);
@@ -575,6 +593,7 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 
 	xdp_init_buff(&ring->xdp, ice_rx_pg_size(ring) / 2, &ring->xdp_rxq);
 	ring->xdp.data = NULL;
+	ring->pkt_ctx.cached_phctime = &ring->cached_phctime;
 	err = ice_setup_rx_ctx(ring);
 	if (err) {
 		dev_err(dev, "ice_setup_rx_ctx failed for RxQ %d, err %d\n",
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index cf5c91ada94c..d3cb08e66dcb 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -2846,7 +2846,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
 		/* clone ring and setup updated count */
 		rx_rings[i] = *vsi->rx_rings[i];
 		rx_rings[i].count = new_rx_cnt;
-		rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
+		rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
 		rx_rings[i].desc = NULL;
 		rx_rings[i].rx_buf = NULL;
 		/* this is to allow wr32 to have something to write to
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 2fc97eafd1f6..1f45f0c3963d 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1456,7 +1456,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
 		ring->netdev = vsi->netdev;
 		ring->dev = dev;
 		ring->count = vsi->num_rx_desc;
-		ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
+		ring->cached_phctime = pf->ptp.cached_phc_time;
 		WRITE_ONCE(vsi->rx_rings[i], ring);
 	}
 
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index f6444890f0ef..e2fa979830cd 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -955,8 +955,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
 		ice_for_each_rxq(vsi, j) {
 			if (!vsi->rx_rings[j])
 				continue;
-			WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime,
-				   systime);
+			WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
 		}
 	}
 	clear_bit(ICE_CFG_BUSY, pf->state);
@@ -2119,7 +2118,7 @@ u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
 	if (!(rx_desc->wb.time_stamp_low & ICE_PTP_TS_VALID))
 		return 0;
 
-	cached_time = READ_ONCE(pkt_ctx->cached_phctime);
+	cached_time = READ_ONCE(*pkt_ctx->cached_phctime);
 
 	/* Do not report a timestamp if we don't have a cached PHC time */
 	if (!cached_time)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 41e0b14e6643..94594cc0d3ee 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -259,7 +259,7 @@ enum ice_rx_dtype {
 
 struct ice_pkt_ctx {
 	const union ice_32b_rx_flex_desc *eop_desc;
-	u64 cached_phctime;
+	u64 *cached_phctime;
 	__be16 vlan_proto;
 };
 
@@ -356,6 +356,7 @@ struct ice_rx_ring {
 	struct ice_tx_ring *xdp_ring;
 	struct xsk_buff_pool *xsk_pool;
 	dma_addr_t dma;			/* physical address of ring */
+	u64 cached_phctime;
 	u16 rx_buf_len;
 	u8 dcb_tc;			/* Traffic class of ring */
 	u8 ptp_rx;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 49a64bfdd1f6..6fa7a86152d0 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -431,9 +431,18 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
 	return ret;
 }
 
+static struct ice_xdp_buff *xsk_buff_to_ice_ctx(struct xdp_buff *xdp)
+{
+	/* xdp_buff pointer used by ZC code path is alloc as xdp_buff_xsk. The
+	 * ice_xdp_buff shares its layout with xdp_buff_xsk and private
+	 * ice_xdp_buff fields fall into xdp_buff_xsk->cb
+	 */
+       return (struct ice_xdp_buff *)xdp;
+}
+
 /**
  * ice_fill_rx_descs - pick buffers from XSK buffer pool and use it
- * @pool: XSK Buffer pool to pull the buffers from
+ * @rx_ring: rx ring
  * @xdp: SW ring of xdp_buff that will hold the buffers
  * @rx_desc: Pointer to Rx descriptors that will be filled
  * @count: The number of buffers to allocate
@@ -445,18 +454,21 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
  *
  * Returns the amount of allocated Rx descriptors
  */
-static u16 ice_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
+static u16 ice_fill_rx_descs(struct ice_rx_ring *rx_ring, struct xdp_buff **xdp,
 			     union ice_32b_rx_flex_desc *rx_desc, u16 count)
 {
+	struct ice_xdp_buff *ctx;
 	dma_addr_t dma;
 	u16 buffs;
 	int i;
 
-	buffs = xsk_buff_alloc_batch(pool, xdp, count);
+	buffs = xsk_buff_alloc_batch(rx_ring->xsk_pool, xdp, count);
 	for (i = 0; i < buffs; i++) {
 		dma = xsk_buff_xdp_get_dma(*xdp);
 		rx_desc->read.pkt_addr = cpu_to_le64(dma);
 		rx_desc->wb.status_error0 = 0;
+		ctx = xsk_buff_to_ice_ctx(*xdp);
+		ctx->pkt_ctx.eop_desc = rx_desc;
 
 		rx_desc++;
 		xdp++;
@@ -488,8 +500,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
 	xdp = ice_xdp_buf(rx_ring, ntu);
 
 	if (ntu + count >= rx_ring->count) {
-		nb_buffs_extra = ice_fill_rx_descs(rx_ring->xsk_pool, xdp,
-						   rx_desc,
+		nb_buffs_extra = ice_fill_rx_descs(rx_ring, xdp, rx_desc,
 						   rx_ring->count - ntu);
 		if (nb_buffs_extra != rx_ring->count - ntu) {
 			ntu += nb_buffs_extra;
@@ -502,7 +513,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
 		ice_release_rx_desc(rx_ring, 0);
 	}
 
-	nb_buffs = ice_fill_rx_descs(rx_ring->xsk_pool, xdp, rx_desc, count);
+	nb_buffs = ice_fill_rx_descs(rx_ring, xdp, rx_desc, count);
 
 	ntu += nb_buffs;
 	if (ntu == rx_ring->count)
@@ -746,32 +757,6 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
 	return ICE_XDP_CONSUMED;
 }
 
-/**
- * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
- * @xdp: xdp_buff used as input to the XDP program
- * @eop_desc: End of packet descriptor
- * @rx_ring: Rx ring with packet context
- *
- * In regular XDP, xdp_buff is placed inside the ring structure,
- * just before the packet context, so the latter can be accessed
- * with xdp_buff address only at all times, but in ZC mode,
- * xdp_buffs come from the pool, so we need to reinitialize
- * context for every packet.
- *
- * We can safely convert xdp_buff_xsk to ice_xdp_buff,
- * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
- * right after xdp_buff, for our private use.
- * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
- */
-static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
-				   union ice_32b_rx_flex_desc *eop_desc,
-				   struct ice_rx_ring *rx_ring)
-{
-	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
-	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
-	ice_xdp_meta_set_desc(xdp, eop_desc);
-}
-
 /**
  * ice_run_xdp_zc - Executes an XDP program in zero-copy path
  * @rx_ring: Rx ring
@@ -784,13 +769,11 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
  */
 static int
 ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
-	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
-	       union ice_32b_rx_flex_desc *rx_desc)
+	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
 {
 	int err, result = ICE_XDP_PASS;
 	u32 act;
 
-	ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 	if (likely(act == XDP_REDIRECT)) {
@@ -930,8 +913,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 		if (ice_is_non_eop(rx_ring, rx_desc))
 			continue;
 
-		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
-					 rx_desc);
+		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
 		if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
 			xdp_xmit |= xdp_res;
 		} else if (xdp_res == ICE_XDP_EXIT) {
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..91fa74a14841 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -14,6 +14,13 @@
 
 #ifdef CONFIG_XDP_SOCKETS
 
+struct xsk_meta_desc {
+	u64 val;
+	u8 off;
+	u8 bytes;
+};
+
+
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
 bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
 u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
@@ -47,6 +54,12 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
 	xp_set_rxq_info(pool, rxq);
 }
 
+static inline void xsk_pool_set_meta(struct xsk_buff_pool *pool,
+				     struct xsk_meta_desc *desc)
+{
+	xp_set_meta(pool, desc);
+}
+
 static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool)
 {
 #ifdef CONFIG_NET_RX_BUSY_POLL
@@ -250,6 +263,11 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
 {
 }
 
+static inline void xsk_pool_set_meta(struct xsk_buff_pool *pool,
+				     struct xsk_meta_desc *desc)
+{
+}
+
 static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool)
 {
 	return 0;
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index b0bdff26fc88..354b1c702a82 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -12,6 +12,7 @@
 
 struct xsk_buff_pool;
 struct xdp_rxq_info;
+struct xsk_meta_desc;
 struct xsk_queue;
 struct xdp_desc;
 struct xdp_umem;
@@ -132,6 +133,7 @@ static inline void xp_init_xskb_dma(struct xdp_buff_xsk *xskb, struct xsk_buff_p
 
 /* AF_XDP ZC drivers, via xdp_sock_buff.h */
 void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq);
+void xp_set_meta(struct xsk_buff_pool *pool, struct xsk_meta_desc *desc);
 int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
 	       unsigned long attrs, struct page **pages, u32 nr_pages);
 void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs);
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 49cb9f9a09be..632fdc247862 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -123,6 +123,18 @@ void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq)
 }
 EXPORT_SYMBOL(xp_set_rxq_info);
 
+void xp_set_meta(struct xsk_buff_pool *pool, struct xsk_meta_desc *desc)
+{
+	u32 i;
+
+	for (i = 0; i < pool->heads_cnt; i++) {
+		struct xdp_buff_xsk *xskb = &pool->heads[i];
+
+		memcpy(xskb->cb + desc->off, desc->buf, desc->bytes);
+	}
+}
+EXPORT_SYMBOL(xp_set_meta);
+
 static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
 {
 	struct netdev_bpf bpf;

--------------------------------->8---------------------------------

> And yes, this also looks better for hot-swapping, 
> because conditions become more straightforward (we do not need to compare old 
> and new programs).
> 
> > > +
> > >  	old_prog = xchg(&vsi->xdp_prog, prog);
> > >  	ice_for_each_rxq(vsi, i)
> > >  		WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> > >  
> > > +	if (ice_xdp_prog_has_meta(old_prog))
> > > +		static_branch_dec(&ice_xdp_meta_key);
> > > +
> > >  	if (old_prog)
> > >  		bpf_prog_put(old_prog);
> > >  }
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > index 4fd7614f243d..19fc182d1f4c 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > @@ -572,7 +572,8 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> > >  	if (!xdp_prog)
> > >  		goto exit;
> > >  
> > > -	ice_xdp_meta_set_desc(xdp, eop_desc);
> > > +	if (static_branch_unlikely(&ice_xdp_meta_key))
> > 
> > My only concern is that we might be hurting in a minor way hints path now,
> > no?
> 
> I have thought "unlikely" refers to the default state the code is compiled with 
> and after static key incrementation this should be patched to "likely". Isn't 
> this how static keys work?

I was only referring to that it ends with compiler hint:
#define unlikely_notrace(x)	__builtin_expect(!!(x), 0)

see include/linux/jump_label.h

> 
> > 
> > > +		ice_xdp_meta_set_desc(xdp, eop_desc);
> > >  
> > >  	act = bpf_prog_run_xdp(xdp_prog, xdp);
> > >  	switch (act) {
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > index 39775bb6cec1..f92d7d33fde6 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > @@ -773,6 +773,9 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > >  				   union ice_32b_rx_flex_desc *eop_desc,
> > >  				   struct ice_rx_ring *rx_ring)
> > >  {
> > > +	if (!static_branch_unlikely(&ice_xdp_meta_key))
> > > +		return;
> > 
> > wouldn't it be better to pull it out and avoid calling
> > ice_prepare_pkt_ctx_zc() unnecessarily?
> > 
> > > +
> > >  	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > >  	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> > >  	ice_xdp_meta_set_desc(xdp, eop_desc);
> > > -- 
> > > 2.41.0
> > > 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
  2023-10-28 19:55       ` Maciej Fijalkowski
@ 2023-10-31 14:22         ` Larysa Zaremba
  2023-11-02 13:23           ` Maciej Fijalkowski
  2023-10-31 17:32         ` Larysa Zaremba
  1 sibling, 1 reply; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-31 14:22 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Tariq Toukan,
	Saeed Mahameed, toke

On Sat, Oct 28, 2023 at 09:55:52PM +0200, Maciej Fijalkowski wrote:
> On Mon, Oct 23, 2023 at 11:35:46AM +0200, Larysa Zaremba wrote:
> > On Fri, Oct 20, 2023 at 06:32:13PM +0200, Maciej Fijalkowski wrote:
> > > On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> > > > Usage of XDP hints requires putting additional information after the
> > > > xdp_buff. In basic case, only the descriptor has to be copied on a
> > > > per-packet basis, because xdp_buff permanently resides before per-ring
> > > > metadata (cached time and VLAN protocol ID).
> > > > 
> > > > However, in ZC mode, xdp_buffs come from a pool, so memory after such
> > > > buffer does not contain any reliable information, so everything has to be
> > > > copied, damaging the performance.
> > > > 
> > > > Introduce a static key to enable meta sources assignment only when attached
> > > > XDP program is device-bound.
> > > > 
> > > > This patch eliminates a 6% performance drop in ZC mode, which was a result
> > > > of addition of XDP hints to the driver.
> > > > 
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >  drivers/net/ethernet/intel/ice/ice.h      |  1 +
> > > >  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
> > > >  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
> > > >  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
> > > >  4 files changed, 20 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > > index 3d0f15f8b2b8..76d22be878a4 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > > @@ -210,6 +210,7 @@ enum ice_feature {
> > > >  };
> > > >  
> > > >  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > > +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > >  
> > > >  struct ice_channel {
> > > >  	struct list_head list;
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > index 47e8920e1727..ee0df86d34b7 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
> > > >  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > >  EXPORT_SYMBOL(ice_xdp_locking_key);
> > > >  
> > > > +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > > +EXPORT_SYMBOL(ice_xdp_meta_key);
> > > > +
> > > >  /**
> > > >   * ice_hw_to_dev - Get device pointer from the hardware structure
> > > >   * @hw: pointer to the device HW structure
> > > > @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> > > >  	return -ENOMEM;
> > > >  }
> > > >  
> > > > +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> > > > +{
> > > > +	return prog && prog->aux->dev_bound;
> > > > +}
> > > > +
> > > >  /**
> > > >   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
> > > >   * @vsi: VSI to set the bpf prog on
> > > > @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
> > > >  	struct bpf_prog *old_prog;
> > > >  	int i;
> > > >  
> > > > +	if (ice_xdp_prog_has_meta(prog))
> > > > +		static_branch_inc(&ice_xdp_meta_key);
> > > 
> > > i thought boolean key would be enough but inc/dec should serve properly
> > > for example prog hotswap cases.
> > >
> > 
> > My thought process on using counting instead of boolean was: there can be 
> > several PFs that use the same driver, so therefore we need to keep track of how 
> > many od them use hints. 
> 
> Very good point. This implies that if PF0 has hints-enabled prog loaded,
> PF1 with non-hints prog will "suffer" from it.
> 
> Sorry for such a long delays in responses but I was having a hard time
> making up my mind about it. In the end I have come up to some conclusions.
> I know the timing for sending this response is not ideal, but I need to
> get this off my chest and bring discussion back to life:)
> 
> IMHO having static keys to eliminate ZC overhead does not scale. I assume
> every other driver would have to follow that.
> 
> XSK pool allows us to avoid initializing various things per each packet.
> Instead, taking xdp_rxq_info as an example, each xdp_buff from pool has
> xdp_rxq_info assigned at init time. With this in mind, we should have some
> mechanism to set hints-specific things in xdp_buff_xsk::cb, at init time
> as well. Such mechanism should not require us to expose driver's private
> xdp_buff hints containers (such as ice_pkt_ctx) to XSK pool.
> 
> Right now you moved phctime down to ice_pkt_ctx and to me that's the main
> reason we have to copy ice_pkt_ctx to each xdp_buff on ZC. What if we keep
> the cached_phctime at original offset in ring but ice_pkt_ctx would get a
> pointer to that?
> 
> This would allow us to init the pointer in each xdp_buff from XSK pool at
> init time. I have come up with a way to program that via so called XSK
> meta descriptors. Each desc would have data to write onto cb, offset
> within cb and amount of bytes to write/copy.
> 
> I'll share the diff below but note that I didn't measure how much lower
> the performance is degraded. My icelake machine where I used to measure
> performance-sensitive code got broke. For now we can't escape initing
> eop_desc per each xdp_buff, but I moved it to alloc side, as we mangle
> descs there anyway.
> 
> I think mlx5 could benefit from that approach as well with initing the rq
> ptr at init time.
> 
> Diff does mostly these things:
> - move cached_phctime to old place in ice_rx_ring and add ptr to that in
>   ice_pkt_ctx
> - introduce xsk_pool_set_meta()
> - use it from ice side.
> 

Thank you for the code! I will probably send v7 with such changes. Are you OK, 
if patch with core changes would go with you as an author?

But also, I see a minor problem with that switching VLAN protocol does not 
trigger buffer allocation, so we have to point to that too, this probably means 
moving cached time back and finding 16 extra bits in CL3. Single pointer to 
{cached time, vlan_proto} would be copied to be after xdp_buff.

> I consider this as a discussion trigger rather than ready code. Any
> feedback will be appreciated.
> 
> ---------------------------------8<---------------------------------
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 7fa43827a3f0..c192e84bee55 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -519,6 +519,23 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
>  	return 0;
>  }
>  
> +static void ice_xsk_pool_set_meta(struct ice_rx_ring *ring)
> +{
> +	struct xsk_meta_desc desc = {};
> +
> +	desc.val = (uintptr_t)&ring->cached_phctime;
> +	desc.off = offsetof(struct ice_pkt_ctx, cached_phctime);
> +	desc.bytes = sizeof_field(struct ice_pkt_ctx, cached_phctime);
> +	xsk_pool_set_meta(ring->xsk_pool, &desc);
> +
> +	memset(&desc, 0, sizeof(struct xsk_meta_desc));
> +
> +	desc.val = ring->pkt_ctx.vlan_proto;
> +	desc.off = offsetof(struct ice_pkt_ctx, vlan_proto);
> +	desc.bytes = sizeof_field(struct ice_pkt_ctx, vlan_proto);
> +	xsk_pool_set_meta(ring->xsk_pool, &desc);
> +}
> +
>  /**
>   * ice_vsi_cfg_rxq - Configure an Rx queue
>   * @ring: the ring being configured
> @@ -553,6 +570,7 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
>  			if (err)
>  				return err;
>  			xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq);
> +			ice_xsk_pool_set_meta(ring);
>  
>  			dev_info(dev, "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
>  				 ring->q_index);
> @@ -575,6 +593,7 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
>  
>  	xdp_init_buff(&ring->xdp, ice_rx_pg_size(ring) / 2, &ring->xdp_rxq);
>  	ring->xdp.data = NULL;
> +	ring->pkt_ctx.cached_phctime = &ring->cached_phctime;
>  	err = ice_setup_rx_ctx(ring);
>  	if (err) {
>  		dev_err(dev, "ice_setup_rx_ctx failed for RxQ %d, err %d\n",
> diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> index cf5c91ada94c..d3cb08e66dcb 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> @@ -2846,7 +2846,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
>  		/* clone ring and setup updated count */
>  		rx_rings[i] = *vsi->rx_rings[i];
>  		rx_rings[i].count = new_rx_cnt;
> -		rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> +		rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
>  		rx_rings[i].desc = NULL;
>  		rx_rings[i].rx_buf = NULL;
>  		/* this is to allow wr32 to have something to write to
> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> index 2fc97eafd1f6..1f45f0c3963d 100644
> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> @@ -1456,7 +1456,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
>  		ring->netdev = vsi->netdev;
>  		ring->dev = dev;
>  		ring->count = vsi->num_rx_desc;
> -		ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> +		ring->cached_phctime = pf->ptp.cached_phc_time;
>  		WRITE_ONCE(vsi->rx_rings[i], ring);
>  	}
>  
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> index f6444890f0ef..e2fa979830cd 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> @@ -955,8 +955,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
>  		ice_for_each_rxq(vsi, j) {
>  			if (!vsi->rx_rings[j])
>  				continue;
> -			WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime,
> -				   systime);
> +			WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
>  		}
>  	}
>  	clear_bit(ICE_CFG_BUSY, pf->state);
> @@ -2119,7 +2118,7 @@ u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
>  	if (!(rx_desc->wb.time_stamp_low & ICE_PTP_TS_VALID))
>  		return 0;
>  
> -	cached_time = READ_ONCE(pkt_ctx->cached_phctime);
> +	cached_time = READ_ONCE(*pkt_ctx->cached_phctime);
>  
>  	/* Do not report a timestamp if we don't have a cached PHC time */
>  	if (!cached_time)
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index 41e0b14e6643..94594cc0d3ee 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -259,7 +259,7 @@ enum ice_rx_dtype {
>  
>  struct ice_pkt_ctx {
>  	const union ice_32b_rx_flex_desc *eop_desc;
> -	u64 cached_phctime;
> +	u64 *cached_phctime;
>  	__be16 vlan_proto;
>  };
>  
> @@ -356,6 +356,7 @@ struct ice_rx_ring {
>  	struct ice_tx_ring *xdp_ring;
>  	struct xsk_buff_pool *xsk_pool;
>  	dma_addr_t dma;			/* physical address of ring */
> +	u64 cached_phctime;
>  	u16 rx_buf_len;
>  	u8 dcb_tc;			/* Traffic class of ring */
>  	u8 ptp_rx;
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index 49a64bfdd1f6..6fa7a86152d0 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -431,9 +431,18 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
>  	return ret;
>  }
>  
> +static struct ice_xdp_buff *xsk_buff_to_ice_ctx(struct xdp_buff *xdp)
> +{
> +	/* xdp_buff pointer used by ZC code path is alloc as xdp_buff_xsk. The
> +	 * ice_xdp_buff shares its layout with xdp_buff_xsk and private
> +	 * ice_xdp_buff fields fall into xdp_buff_xsk->cb
> +	 */
> +       return (struct ice_xdp_buff *)xdp;
> +}
> +
>  /**
>   * ice_fill_rx_descs - pick buffers from XSK buffer pool and use it
> - * @pool: XSK Buffer pool to pull the buffers from
> + * @rx_ring: rx ring
>   * @xdp: SW ring of xdp_buff that will hold the buffers
>   * @rx_desc: Pointer to Rx descriptors that will be filled
>   * @count: The number of buffers to allocate
> @@ -445,18 +454,21 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
>   *
>   * Returns the amount of allocated Rx descriptors
>   */
> -static u16 ice_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
> +static u16 ice_fill_rx_descs(struct ice_rx_ring *rx_ring, struct xdp_buff **xdp,
>  			     union ice_32b_rx_flex_desc *rx_desc, u16 count)
>  {
> +	struct ice_xdp_buff *ctx;
>  	dma_addr_t dma;
>  	u16 buffs;
>  	int i;
>  
> -	buffs = xsk_buff_alloc_batch(pool, xdp, count);
> +	buffs = xsk_buff_alloc_batch(rx_ring->xsk_pool, xdp, count);
>  	for (i = 0; i < buffs; i++) {
>  		dma = xsk_buff_xdp_get_dma(*xdp);
>  		rx_desc->read.pkt_addr = cpu_to_le64(dma);
>  		rx_desc->wb.status_error0 = 0;
> +		ctx = xsk_buff_to_ice_ctx(*xdp);
> +		ctx->pkt_ctx.eop_desc = rx_desc;
>  
>  		rx_desc++;
>  		xdp++;
> @@ -488,8 +500,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
>  	xdp = ice_xdp_buf(rx_ring, ntu);
>  
>  	if (ntu + count >= rx_ring->count) {
> -		nb_buffs_extra = ice_fill_rx_descs(rx_ring->xsk_pool, xdp,
> -						   rx_desc,
> +		nb_buffs_extra = ice_fill_rx_descs(rx_ring, xdp, rx_desc,
>  						   rx_ring->count - ntu);
>  		if (nb_buffs_extra != rx_ring->count - ntu) {
>  			ntu += nb_buffs_extra;
> @@ -502,7 +513,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
>  		ice_release_rx_desc(rx_ring, 0);
>  	}
>  
> -	nb_buffs = ice_fill_rx_descs(rx_ring->xsk_pool, xdp, rx_desc, count);
> +	nb_buffs = ice_fill_rx_descs(rx_ring, xdp, rx_desc, count);
>  
>  	ntu += nb_buffs;
>  	if (ntu == rx_ring->count)
> @@ -746,32 +757,6 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
>  	return ICE_XDP_CONSUMED;
>  }
>  
> -/**
> - * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> - * @xdp: xdp_buff used as input to the XDP program
> - * @eop_desc: End of packet descriptor
> - * @rx_ring: Rx ring with packet context
> - *
> - * In regular XDP, xdp_buff is placed inside the ring structure,
> - * just before the packet context, so the latter can be accessed
> - * with xdp_buff address only at all times, but in ZC mode,
> - * xdp_buffs come from the pool, so we need to reinitialize
> - * context for every packet.
> - *
> - * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> - * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> - * right after xdp_buff, for our private use.
> - * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> - */
> -static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> -				   union ice_32b_rx_flex_desc *eop_desc,
> -				   struct ice_rx_ring *rx_ring)
> -{
> -	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> -	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> -	ice_xdp_meta_set_desc(xdp, eop_desc);
> -}
> -
>  /**
>   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
>   * @rx_ring: Rx ring
> @@ -784,13 +769,11 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
>   */
>  static int
>  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> -	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> -	       union ice_32b_rx_flex_desc *rx_desc)
> +	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
>  {
>  	int err, result = ICE_XDP_PASS;
>  	u32 act;
>  
> -	ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
>  	act = bpf_prog_run_xdp(xdp_prog, xdp);
>  
>  	if (likely(act == XDP_REDIRECT)) {
> @@ -930,8 +913,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
>  		if (ice_is_non_eop(rx_ring, rx_desc))
>  			continue;
>  
> -		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> -					 rx_desc);
> +		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
>  		if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
>  			xdp_xmit |= xdp_res;
>  		} else if (xdp_res == ICE_XDP_EXIT) {
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 1f6fc8c7a84c..91fa74a14841 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -14,6 +14,13 @@
>  
>  #ifdef CONFIG_XDP_SOCKETS
>  
> +struct xsk_meta_desc {
> +	u64 val;
> +	u8 off;
> +	u8 bytes;
> +};
> +
> +
>  void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
>  bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
>  u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
> @@ -47,6 +54,12 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
>  	xp_set_rxq_info(pool, rxq);
>  }
>  
> +static inline void xsk_pool_set_meta(struct xsk_buff_pool *pool,
> +				     struct xsk_meta_desc *desc)
> +{
> +	xp_set_meta(pool, desc);
> +}
> +
>  static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool)
>  {
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> @@ -250,6 +263,11 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
>  {
>  }
>  
> +static inline void xsk_pool_set_meta(struct xsk_buff_pool *pool,
> +				     struct xsk_meta_desc *desc)
> +{
> +}
> +
>  static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool)
>  {
>  	return 0;
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index b0bdff26fc88..354b1c702a82 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -12,6 +12,7 @@
>  
>  struct xsk_buff_pool;
>  struct xdp_rxq_info;
> +struct xsk_meta_desc;
>  struct xsk_queue;
>  struct xdp_desc;
>  struct xdp_umem;
> @@ -132,6 +133,7 @@ static inline void xp_init_xskb_dma(struct xdp_buff_xsk *xskb, struct xsk_buff_p
>  
>  /* AF_XDP ZC drivers, via xdp_sock_buff.h */
>  void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq);
> +void xp_set_meta(struct xsk_buff_pool *pool, struct xsk_meta_desc *desc);
>  int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
>  	       unsigned long attrs, struct page **pages, u32 nr_pages);
>  void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs);
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 49cb9f9a09be..632fdc247862 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -123,6 +123,18 @@ void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq)
>  }
>  EXPORT_SYMBOL(xp_set_rxq_info);
>  
> +void xp_set_meta(struct xsk_buff_pool *pool, struct xsk_meta_desc *desc)
> +{
> +	u32 i;
> +
> +	for (i = 0; i < pool->heads_cnt; i++) {
> +		struct xdp_buff_xsk *xskb = &pool->heads[i];
> +
> +		memcpy(xskb->cb + desc->off, desc->buf, desc->bytes);
> +	}
> +}
> +EXPORT_SYMBOL(xp_set_meta);
> +
>  static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
>  {
>  	struct netdev_bpf bpf;
> 
> --------------------------------->8---------------------------------
> 
> > And yes, this also looks better for hot-swapping, 
> > because conditions become more straightforward (we do not need to compare old 
> > and new programs).
> > 
> > > > +
> > > >  	old_prog = xchg(&vsi->xdp_prog, prog);
> > > >  	ice_for_each_rxq(vsi, i)
> > > >  		WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> > > >  
> > > > +	if (ice_xdp_prog_has_meta(old_prog))
> > > > +		static_branch_dec(&ice_xdp_meta_key);
> > > > +
> > > >  	if (old_prog)
> > > >  		bpf_prog_put(old_prog);
> > > >  }
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > > index 4fd7614f243d..19fc182d1f4c 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > > @@ -572,7 +572,8 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> > > >  	if (!xdp_prog)
> > > >  		goto exit;
> > > >  
> > > > -	ice_xdp_meta_set_desc(xdp, eop_desc);
> > > > +	if (static_branch_unlikely(&ice_xdp_meta_key))
> > > 
> > > My only concern is that we might be hurting in a minor way hints path now,
> > > no?
> > 
> > I have thought "unlikely" refers to the default state the code is compiled with 
> > and after static key incrementation this should be patched to "likely". Isn't 
> > this how static keys work?
> 
> I was only referring to that it ends with compiler hint:
> #define unlikely_notrace(x)	__builtin_expect(!!(x), 0)
> 
> see include/linux/jump_label.h
> 
> > 
> > > 
> > > > +		ice_xdp_meta_set_desc(xdp, eop_desc);
> > > >  
> > > >  	act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > >  	switch (act) {
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > index 39775bb6cec1..f92d7d33fde6 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > @@ -773,6 +773,9 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > > >  				   union ice_32b_rx_flex_desc *eop_desc,
> > > >  				   struct ice_rx_ring *rx_ring)
> > > >  {
> > > > +	if (!static_branch_unlikely(&ice_xdp_meta_key))
> > > > +		return;
> > > 
> > > wouldn't it be better to pull it out and avoid calling
> > > ice_prepare_pkt_ctx_zc() unnecessarily?
> > > 
> > > > +
> > > >  	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > > >  	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> > > >  	ice_xdp_meta_set_desc(xdp, eop_desc);
> > > > -- 
> > > > 2.41.0
> > > > 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
  2023-10-28 19:55       ` Maciej Fijalkowski
  2023-10-31 14:22         ` Larysa Zaremba
@ 2023-10-31 17:32         ` Larysa Zaremba
  1 sibling, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-10-31 17:32 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Tariq Toukan,
	Saeed Mahameed, toke

On Sat, Oct 28, 2023 at 09:55:52PM +0200, Maciej Fijalkowski wrote:
> On Mon, Oct 23, 2023 at 11:35:46AM +0200, Larysa Zaremba wrote:
> > On Fri, Oct 20, 2023 at 06:32:13PM +0200, Maciej Fijalkowski wrote:
> > > On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> > > > Usage of XDP hints requires putting additional information after the
> > > > xdp_buff. In basic case, only the descriptor has to be copied on a
> > > > per-packet basis, because xdp_buff permanently resides before per-ring
> > > > metadata (cached time and VLAN protocol ID).
> > > > 
> > > > However, in ZC mode, xdp_buffs come from a pool, so memory after such
> > > > buffer does not contain any reliable information, so everything has to be
> > > > copied, damaging the performance.
> > > > 
> > > > Introduce a static key to enable meta sources assignment only when attached
> > > > XDP program is device-bound.
> > > > 
> > > > This patch eliminates a 6% performance drop in ZC mode, which was a result
> > > > of addition of XDP hints to the driver.
> > > > 
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >  drivers/net/ethernet/intel/ice/ice.h      |  1 +
> > > >  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
> > > >  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
> > > >  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
> > > >  4 files changed, 20 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > > index 3d0f15f8b2b8..76d22be878a4 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > > @@ -210,6 +210,7 @@ enum ice_feature {
> > > >  };
> > > >  
> > > >  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > > +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > >  
> > > >  struct ice_channel {
> > > >  	struct list_head list;
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > index 47e8920e1727..ee0df86d34b7 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
> > > >  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > >  EXPORT_SYMBOL(ice_xdp_locking_key);
> > > >  
> > > > +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > > +EXPORT_SYMBOL(ice_xdp_meta_key);
> > > > +
> > > >  /**
> > > >   * ice_hw_to_dev - Get device pointer from the hardware structure
> > > >   * @hw: pointer to the device HW structure
> > > > @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> > > >  	return -ENOMEM;
> > > >  }
> > > >  
> > > > +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> > > > +{
> > > > +	return prog && prog->aux->dev_bound;
> > > > +}
> > > > +
> > > >  /**
> > > >   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
> > > >   * @vsi: VSI to set the bpf prog on
> > > > @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
> > > >  	struct bpf_prog *old_prog;
> > > >  	int i;
> > > >  
> > > > +	if (ice_xdp_prog_has_meta(prog))
> > > > +		static_branch_inc(&ice_xdp_meta_key);
> > > 
> > > i thought boolean key would be enough but inc/dec should serve properly
> > > for example prog hotswap cases.
> > >
> > 
> > My thought process on using counting instead of boolean was: there can be 
> > several PFs that use the same driver, so therefore we need to keep track of how 
> > many od them use hints. 
> 
> Very good point. This implies that if PF0 has hints-enabled prog loaded,
> PF1 with non-hints prog will "suffer" from it.
> 
> Sorry for such a long delays in responses but I was having a hard time
> making up my mind about it. In the end I have come up to some conclusions.
> I know the timing for sending this response is not ideal, but I need to
> get this off my chest and bring discussion back to life:)
> 
> IMHO having static keys to eliminate ZC overhead does not scale. I assume
> every other driver would have to follow that.
> 
> XSK pool allows us to avoid initializing various things per each packet.
> Instead, taking xdp_rxq_info as an example, each xdp_buff from pool has
> xdp_rxq_info assigned at init time. With this in mind, we should have some
> mechanism to set hints-specific things in xdp_buff_xsk::cb, at init time
> as well. Such mechanism should not require us to expose driver's private
> xdp_buff hints containers (such as ice_pkt_ctx) to XSK pool.
> 
> Right now you moved phctime down to ice_pkt_ctx and to me that's the main
> reason we have to copy ice_pkt_ctx to each xdp_buff on ZC. What if we keep
> the cached_phctime at original offset in ring but ice_pkt_ctx would get a
> pointer to that?
> 
> This would allow us to init the pointer in each xdp_buff from XSK pool at
> init time. I have come up with a way to program that via so called XSK
> meta descriptors. Each desc would have data to write onto cb, offset
> within cb and amount of bytes to write/copy.
> 
> I'll share the diff below but note that I didn't measure how much lower
> the performance is degraded. My icelake machine where I used to measure
> performance-sensitive code got broke. For now we can't escape initing
> eop_desc per each xdp_buff, but I moved it to alloc side, as we mangle
> descs there anyway.
> 
> I think mlx5 could benefit from that approach as well with initing the rq
> ptr at init time.
> 
> Diff does mostly these things:
> - move cached_phctime to old place in ice_rx_ring and add ptr to that in
>   ice_pkt_ctx
> - introduce xsk_pool_set_meta()
> - use it from ice side.
> 
> I consider this as a discussion trigger rather than ready code. Any
> feedback will be appreciated.
> 
> ---------------------------------8<---------------------------------
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 7fa43827a3f0..c192e84bee55 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -519,6 +519,23 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
>  	return 0;
>  }
>  

[...]

> > > 
> > > My only concern is that we might be hurting in a minor way hints path now,
> > > no?
> > 
> > I have thought "unlikely" refers to the default state the code is compiled with 
> > and after static key incrementation this should be patched to "likely". Isn't 
> > this how static keys work?
> 
> I was only referring to that it ends with compiler hint:
> #define unlikely_notrace(x)	__builtin_expect(!!(x), 0)
> 
> see include/linux/jump_label.h
> 

You are right, I have misunderstood the concept a little bit.

> > 
> > > 
> > > > +		ice_xdp_meta_set_desc(xdp, eop_desc);
> > > >  
> > > >  	act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > >  	switch (act) {
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > index 39775bb6cec1..f92d7d33fde6 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > @@ -773,6 +773,9 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > > >  				   union ice_32b_rx_flex_desc *eop_desc,
> > > >  				   struct ice_rx_ring *rx_ring)
> > > >  {
> > > > +	if (!static_branch_unlikely(&ice_xdp_meta_key))
> > > > +		return;
> > > 
> > > wouldn't it be better to pull it out and avoid calling
> > > ice_prepare_pkt_ctx_zc() unnecessarily?
> > > 
> > > > +
> > > >  	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > > >  	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> > > >  	ice_xdp_meta_set_desc(xdp, eop_desc);
> > > > -- 
> > > > 2.41.0
> > > > 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
  2023-10-31 14:22         ` Larysa Zaremba
@ 2023-11-02 13:23           ` Maciej Fijalkowski
  2023-11-02 13:48             ` Larysa Zaremba
  0 siblings, 1 reply; 38+ messages in thread
From: Maciej Fijalkowski @ 2023-11-02 13:23 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Tariq Toukan,
	Saeed Mahameed, toke

On Tue, Oct 31, 2023 at 03:22:31PM +0100, Larysa Zaremba wrote:
> On Sat, Oct 28, 2023 at 09:55:52PM +0200, Maciej Fijalkowski wrote:
> > On Mon, Oct 23, 2023 at 11:35:46AM +0200, Larysa Zaremba wrote:
> > > On Fri, Oct 20, 2023 at 06:32:13PM +0200, Maciej Fijalkowski wrote:
> > > > On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> > > > > Usage of XDP hints requires putting additional information after the
> > > > > xdp_buff. In basic case, only the descriptor has to be copied on a
> > > > > per-packet basis, because xdp_buff permanently resides before per-ring
> > > > > metadata (cached time and VLAN protocol ID).
> > > > > 
> > > > > However, in ZC mode, xdp_buffs come from a pool, so memory after such
> > > > > buffer does not contain any reliable information, so everything has to be
> > > > > copied, damaging the performance.
> > > > > 
> > > > > Introduce a static key to enable meta sources assignment only when attached
> > > > > XDP program is device-bound.
> > > > > 
> > > > > This patch eliminates a 6% performance drop in ZC mode, which was a result
> > > > > of addition of XDP hints to the driver.
> > > > > 
> > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > ---
> > > > >  drivers/net/ethernet/intel/ice/ice.h      |  1 +
> > > > >  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
> > > > >  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
> > > > >  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
> > > > >  4 files changed, 20 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > > > index 3d0f15f8b2b8..76d22be878a4 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > > > @@ -210,6 +210,7 @@ enum ice_feature {
> > > > >  };
> > > > >  
> > > > >  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > > > +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > > >  
> > > > >  struct ice_channel {
> > > > >  	struct list_head list;
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > index 47e8920e1727..ee0df86d34b7 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
> > > > >  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > > >  EXPORT_SYMBOL(ice_xdp_locking_key);
> > > > >  
> > > > > +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > > > +EXPORT_SYMBOL(ice_xdp_meta_key);
> > > > > +
> > > > >  /**
> > > > >   * ice_hw_to_dev - Get device pointer from the hardware structure
> > > > >   * @hw: pointer to the device HW structure
> > > > > @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> > > > >  	return -ENOMEM;
> > > > >  }
> > > > >  
> > > > > +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> > > > > +{
> > > > > +	return prog && prog->aux->dev_bound;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
> > > > >   * @vsi: VSI to set the bpf prog on
> > > > > @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
> > > > >  	struct bpf_prog *old_prog;
> > > > >  	int i;
> > > > >  
> > > > > +	if (ice_xdp_prog_has_meta(prog))
> > > > > +		static_branch_inc(&ice_xdp_meta_key);
> > > > 
> > > > i thought boolean key would be enough but inc/dec should serve properly
> > > > for example prog hotswap cases.
> > > >
> > > 
> > > My thought process on using counting instead of boolean was: there can be 
> > > several PFs that use the same driver, so therefore we need to keep track of how 
> > > many od them use hints. 
> > 
> > Very good point. This implies that if PF0 has hints-enabled prog loaded,
> > PF1 with non-hints prog will "suffer" from it.
> > 
> > Sorry for such a long delays in responses but I was having a hard time
> > making up my mind about it. In the end I have come up to some conclusions.
> > I know the timing for sending this response is not ideal, but I need to
> > get this off my chest and bring discussion back to life:)
> > 
> > IMHO having static keys to eliminate ZC overhead does not scale. I assume
> > every other driver would have to follow that.
> > 
> > XSK pool allows us to avoid initializing various things per each packet.
> > Instead, taking xdp_rxq_info as an example, each xdp_buff from pool has
> > xdp_rxq_info assigned at init time. With this in mind, we should have some
> > mechanism to set hints-specific things in xdp_buff_xsk::cb, at init time
> > as well. Such mechanism should not require us to expose driver's private
> > xdp_buff hints containers (such as ice_pkt_ctx) to XSK pool.
> > 
> > Right now you moved phctime down to ice_pkt_ctx and to me that's the main
> > reason we have to copy ice_pkt_ctx to each xdp_buff on ZC. What if we keep
> > the cached_phctime at original offset in ring but ice_pkt_ctx would get a
> > pointer to that?
> > 
> > This would allow us to init the pointer in each xdp_buff from XSK pool at
> > init time. I have come up with a way to program that via so called XSK
> > meta descriptors. Each desc would have data to write onto cb, offset
> > within cb and amount of bytes to write/copy.
> > 
> > I'll share the diff below but note that I didn't measure how much lower
> > the performance is degraded. My icelake machine where I used to measure
> > performance-sensitive code got broke. For now we can't escape initing
> > eop_desc per each xdp_buff, but I moved it to alloc side, as we mangle
> > descs there anyway.
> > 
> > I think mlx5 could benefit from that approach as well with initing the rq
> > ptr at init time.
> > 
> > Diff does mostly these things:
> > - move cached_phctime to old place in ice_rx_ring and add ptr to that in
> >   ice_pkt_ctx
> > - introduce xsk_pool_set_meta()
> > - use it from ice side.
> > 
> 
> Thank you for the code! I will probably send v7 with such changes. Are you OK, 
> if patch with core changes would go with you as an author?

Yes or I can produce a patch and share, up to you.

> 
> But also, I see a minor problem with that switching VLAN protocol does not 
> trigger buffer allocation, so we have to point to that too, this probably means 
> moving cached time back and finding 16 extra bits in CL3. Single pointer to 
> {cached time, vlan_proto} would be copied to be after xdp_buff.

It's not that it has to trigger buffer allocation, we could stop the
interface if pool is present and update vlan proto on pool's xdp_buffs
(from quick glance i don't see that we're stopping iface for setting vlan
features) but that sounds like more of a hassle to do...

So yeah maybe let's just have a ptr in ice_pkt_ctx as well.

[...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
  2023-11-02 13:23           ` Maciej Fijalkowski
@ 2023-11-02 13:48             ` Larysa Zaremba
  0 siblings, 0 replies; 38+ messages in thread
From: Larysa Zaremba @ 2023-11-02 13:48 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Tariq Toukan,
	Saeed Mahameed, toke

On Thu, Nov 02, 2023 at 02:23:02PM +0100, Maciej Fijalkowski wrote:
> On Tue, Oct 31, 2023 at 03:22:31PM +0100, Larysa Zaremba wrote:
> > On Sat, Oct 28, 2023 at 09:55:52PM +0200, Maciej Fijalkowski wrote:
> > > On Mon, Oct 23, 2023 at 11:35:46AM +0200, Larysa Zaremba wrote:
> > > > On Fri, Oct 20, 2023 at 06:32:13PM +0200, Maciej Fijalkowski wrote:
> > > > > On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> > > > > > Usage of XDP hints requires putting additional information after the
> > > > > > xdp_buff. In basic case, only the descriptor has to be copied on a
> > > > > > per-packet basis, because xdp_buff permanently resides before per-ring
> > > > > > metadata (cached time and VLAN protocol ID).
> > > > > > 
> > > > > > However, in ZC mode, xdp_buffs come from a pool, so memory after such
> > > > > > buffer does not contain any reliable information, so everything has to be
> > > > > > copied, damaging the performance.
> > > > > > 
> > > > > > Introduce a static key to enable meta sources assignment only when attached
> > > > > > XDP program is device-bound.
> > > > > > 
> > > > > > This patch eliminates a 6% performance drop in ZC mode, which was a result
> > > > > > of addition of XDP hints to the driver.
> > > > > > 
> > > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > > ---
> > > > > >  drivers/net/ethernet/intel/ice/ice.h      |  1 +
> > > > > >  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
> > > > > >  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
> > > > > >  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
> > > > > >  4 files changed, 20 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > > > > index 3d0f15f8b2b8..76d22be878a4 100644
> > > > > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > > > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > > > > @@ -210,6 +210,7 @@ enum ice_feature {
> > > > > >  };
> > > > > >  
> > > > > >  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > > > > +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > > > >  
> > > > > >  struct ice_channel {
> > > > > >  	struct list_head list;
> > > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > > index 47e8920e1727..ee0df86d34b7 100644
> > > > > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > > @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
> > > > > >  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > > > >  EXPORT_SYMBOL(ice_xdp_locking_key);
> > > > > >  
> > > > > > +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > > > > +EXPORT_SYMBOL(ice_xdp_meta_key);
> > > > > > +
> > > > > >  /**
> > > > > >   * ice_hw_to_dev - Get device pointer from the hardware structure
> > > > > >   * @hw: pointer to the device HW structure
> > > > > > @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> > > > > >  	return -ENOMEM;
> > > > > >  }
> > > > > >  
> > > > > > +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> > > > > > +{
> > > > > > +	return prog && prog->aux->dev_bound;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
> > > > > >   * @vsi: VSI to set the bpf prog on
> > > > > > @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
> > > > > >  	struct bpf_prog *old_prog;
> > > > > >  	int i;
> > > > > >  
> > > > > > +	if (ice_xdp_prog_has_meta(prog))
> > > > > > +		static_branch_inc(&ice_xdp_meta_key);
> > > > > 
> > > > > i thought boolean key would be enough but inc/dec should serve properly
> > > > > for example prog hotswap cases.
> > > > >
> > > > 
> > > > My thought process on using counting instead of boolean was: there can be 
> > > > several PFs that use the same driver, so therefore we need to keep track of how 
> > > > many od them use hints. 
> > > 
> > > Very good point. This implies that if PF0 has hints-enabled prog loaded,
> > > PF1 with non-hints prog will "suffer" from it.
> > > 
> > > Sorry for such a long delays in responses but I was having a hard time
> > > making up my mind about it. In the end I have come up to some conclusions.
> > > I know the timing for sending this response is not ideal, but I need to
> > > get this off my chest and bring discussion back to life:)
> > > 
> > > IMHO having static keys to eliminate ZC overhead does not scale. I assume
> > > every other driver would have to follow that.
> > > 
> > > XSK pool allows us to avoid initializing various things per each packet.
> > > Instead, taking xdp_rxq_info as an example, each xdp_buff from pool has
> > > xdp_rxq_info assigned at init time. With this in mind, we should have some
> > > mechanism to set hints-specific things in xdp_buff_xsk::cb, at init time
> > > as well. Such mechanism should not require us to expose driver's private
> > > xdp_buff hints containers (such as ice_pkt_ctx) to XSK pool.
> > > 
> > > Right now you moved phctime down to ice_pkt_ctx and to me that's the main
> > > reason we have to copy ice_pkt_ctx to each xdp_buff on ZC. What if we keep
> > > the cached_phctime at original offset in ring but ice_pkt_ctx would get a
> > > pointer to that?
> > > 
> > > This would allow us to init the pointer in each xdp_buff from XSK pool at
> > > init time. I have come up with a way to program that via so called XSK
> > > meta descriptors. Each desc would have data to write onto cb, offset
> > > within cb and amount of bytes to write/copy.
> > > 
> > > I'll share the diff below but note that I didn't measure how much lower
> > > the performance is degraded. My icelake machine where I used to measure
> > > performance-sensitive code got broke. For now we can't escape initing
> > > eop_desc per each xdp_buff, but I moved it to alloc side, as we mangle
> > > descs there anyway.
> > > 
> > > I think mlx5 could benefit from that approach as well with initing the rq
> > > ptr at init time.
> > > 
> > > Diff does mostly these things:
> > > - move cached_phctime to old place in ice_rx_ring and add ptr to that in
> > >   ice_pkt_ctx
> > > - introduce xsk_pool_set_meta()
> > > - use it from ice side.
> > > 
> > 
> > Thank you for the code! I will probably send v7 with such changes. Are you OK, 
> > if patch with core changes would go with you as an author?
> 
> Yes or I can produce a patch and share, up to you.
>

I have already started, your diff does not compile, so I took some creative 
liberty. Will send you patches for verification this week.
 
> > 
> > But also, I see a minor problem with that switching VLAN protocol does not 
> > trigger buffer allocation, so we have to point to that too, this probably means 
> > moving cached time back and finding 16 extra bits in CL3. Single pointer to 
> > {cached time, vlan_proto} would be copied to be after xdp_buff.
> 
> It's not that it has to trigger buffer allocation, we could stop the
> interface if pool is present and update vlan proto on pool's xdp_buffs
> (from quick glance i don't see that we're stopping iface for setting vlan
> features) but that sounds like more of a hassle to do...
> 
> So yeah maybe let's just have a ptr in ice_pkt_ctx as well.
> 
> [...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-11-02 13:49 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-12 17:05 [PATCH bpf-next v6 00/18] XDP metadata via kfuncs for ice + VLAN hint Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 01/18] ice: make RX hash reading code more reusable Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 02/18] ice: make RX HW timestamp " Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 03/18] ice: Make ptype internal to descriptor info processing Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 04/18] ice: Introduce ice_xdp_buff Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 05/18] ice: Support HW timestamp hint Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 06/18] ice: Support RX hash XDP hint Larysa Zaremba
2023-10-20 15:27   ` Maciej Fijalkowski
2023-10-23 10:04     ` Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 07/18] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
2023-10-17 16:13   ` Maciej Fijalkowski
2023-10-17 16:37     ` Magnus Karlsson
2023-10-17 16:45       ` Maciej Fijalkowski
2023-10-17 17:03         ` Larysa Zaremba
2023-10-18 10:43           ` Maciej Fijalkowski
2023-10-20 15:29   ` Maciej Fijalkowski
2023-10-23  9:37     ` Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 08/18] xdp: Add VLAN tag hint Larysa Zaremba
2023-10-18 23:59   ` Jakub Kicinski
2023-10-19  8:05     ` Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 09/18] ice: Implement " Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 10/18] ice: use VLAN proto from ring packet context in skb path Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition Larysa Zaremba
2023-10-20 16:32   ` Maciej Fijalkowski
2023-10-23  9:35     ` Larysa Zaremba
2023-10-28 19:55       ` Maciej Fijalkowski
2023-10-31 14:22         ` Larysa Zaremba
2023-11-02 13:23           ` Maciej Fijalkowski
2023-11-02 13:48             ` Larysa Zaremba
2023-10-31 17:32         ` Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 12/18] veth: Implement VLAN tag XDP hint Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 13/18] net: make vlan_get_tag() return -ENODATA instead of -EINVAL Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 14/18] mlx5: implement VLAN tag XDP hint Larysa Zaremba
2023-10-17 12:40   ` Tariq Toukan
2023-10-12 17:05 ` [PATCH bpf-next v6 15/18] selftests/bpf: Allow VLAN packets in xdp_hw_metadata Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 16/18] selftests/bpf: Add flags and VLAN hint to xdp_hw_metadata Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 17/18] selftests/bpf: Use AF_INET for TX in xdp_metadata Larysa Zaremba
2023-10-12 17:05 ` [PATCH bpf-next v6 18/18] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).