All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice
@ 2023-07-03 18:12 Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 01/20] ice: make RX hash reading code more reusable Larysa Zaremba
                   ` (19 more replies)
  0 siblings, 20 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

This series introduces XDP hints via kfuncs [0] to the ice driver.

Series brings the following existing hints to the ice driver:
 - HW timestamp
 - RX hash with type

Series also introduces new hints and adds their implementation
to ice and veth:
 - VLAN tag with protocol
 - Checksum level

The data above can now be accessed by XDP and userspace (AF_XDP) programs.
They can also be checked with xdp_metadata test and xdp_hw_metadata program.

[0] https://patchwork.kernel.org/project/netdevbpf/cover/20230119221536.3349901-1-sdf@google.com/

v1:
https://lore.kernel.org/all/20230512152607.992209-1-larysa.zaremba@intel.com/

Changes since v1:
- directly return RX hash, RX timestamp and RX checksum status
  in skb-common functions
- use intermediate enum value for checksum status in ice
- get rid of ring structure dependency in ice kfunc implementation
- make variables const, when possible, in ice implementation
- use -ENODATA instead of -EOPNOTSUPP for driver implementation
- instead of having 2 separate functions for c-tag and s-tag,
  use 1 function that outputs both VLAN tag and protocol ID
- improve documentation for introduced hints
- update xdp_metadata selftest to test new hints
- implement new hints in veth, so they can be tested in xdp_metadata
- parse VLAN tag in xdp_hw_metadata

Aleksander Lobakin (1):
  net, xdp: allow metadata > 32

Larysa Zaremba (19):
  ice: make RX hash reading code more reusable
  ice: make RX HW timestamp reading code more reusable
  ice: make RX checksum checking code more reusable
  ice: Make ptype internal to descriptor info processing
  ice: Introduce ice_xdp_buff
  ice: Support HW timestamp hint
  ice: Support RX hash XDP hint
  ice: Support XDP hints in AF_XDP ZC mode
  xdp: Add VLAN tag hint
  ice: Implement VLAN tag hint
  ice: use VLAN proto from ring packet context in skb path
  xdp: Add checksum level hint
  ice: Implement checksum level hint
  selftests/bpf: Allow VLAN packets in xdp_hw_metadata
  selftests/bpf: Add flags and new hints to xdp_hw_metadata
  veth: Implement VLAN tag and checksum level XDP hint
  selftests/bpf: Use AF_INET for TX in xdp_metadata
  selftests/bpf: Check VLAN tag and proto in xdp_metadata
  selftests/bpf: check checksum level in xdp_metadata

 Documentation/networking/xdp-rx-metadata.rst  |  11 +-
 drivers/net/ethernet/intel/ice/ice.h          |   2 +
 drivers/net/ethernet/intel/ice/ice_ethtool.c  |   2 +-
 .../net/ethernet/intel/ice/ice_lan_tx_rx.h    | 412 +++++++++---------
 drivers/net/ethernet/intel/ice/ice_lib.c      |   2 +-
 drivers/net/ethernet/intel/ice/ice_main.c     |  23 +
 drivers/net/ethernet/intel/ice/ice_ptp.c      |  26 +-
 drivers/net/ethernet/intel/ice/ice_ptp.h      |  15 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  15 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  29 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 339 +++++++++++---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h |  16 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  18 +-
 drivers/net/veth.c                            |  40 ++
 include/linux/netdevice.h                     |   3 +
 include/linux/skbuff.h                        |  13 +-
 include/net/xdp.h                             |  14 +-
 kernel/bpf/offload.c                          |   4 +
 net/core/xdp.c                                |  41 ++
 tools/testing/selftests/bpf/network_helpers.c |  37 +-
 tools/testing/selftests/bpf/network_helpers.h |   3 +
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 195 ++++-----
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  45 +-
 .../selftests/bpf/progs/xdp_metadata.c        |  11 +
 tools/testing/selftests/bpf/xdp_hw_metadata.c |  42 +-
 tools/testing/selftests/bpf/xdp_metadata.h    |  36 +-
 26 files changed, 953 insertions(+), 441 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 01/20] ice: make RX hash reading code more reusable
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 02/20] ice: make RX HW timestamp " Larysa Zaremba
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Previously, we only needed RX hash in skb path,
hence all related code was written with skb in mind.
But with the addition of XDP hints via kfuncs to the ice driver,
the same logic will be needed in .xmo_() callbacks.

Separate generic process of reading RX hash from a descriptor
into a separate function.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 37 +++++++++++++------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index c8322fb6f2b3..8f7f6d78f7bf 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -63,28 +63,43 @@ static enum pkt_hash_types ice_ptype_to_htype(u16 ptype)
 }
 
 /**
- * ice_rx_hash - set the hash value in the skb
+ * ice_get_rx_hash - get RX hash value from descriptor
+ * @rx_desc: specific descriptor
+ *
+ * Returns hash, if present, 0 otherwise.
+ */
+static u32
+ice_get_rx_hash(const union ice_32b_rx_flex_desc *rx_desc)
+{
+	const struct ice_32b_rx_flex_desc_nic *nic_mdid;
+
+	if (rx_desc->wb.rxdid != ICE_RXDID_FLEX_NIC)
+		return 0;
+
+	nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
+	return le32_to_cpu(nic_mdid->rss_hash);
+}
+
+/**
+ * ice_rx_hash_to_skb - set the hash value in the skb
  * @rx_ring: descriptor ring
  * @rx_desc: specific descriptor
  * @skb: pointer to current skb
  * @rx_ptype: the ptype value from the descriptor
  */
 static void
-ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
-	    struct sk_buff *skb, u16 rx_ptype)
+ice_rx_hash_to_skb(const struct ice_rx_ring *rx_ring,
+		   const union ice_32b_rx_flex_desc *rx_desc,
+		   struct sk_buff *skb, u16 rx_ptype)
 {
-	struct ice_32b_rx_flex_desc_nic *nic_mdid;
 	u32 hash;
 
 	if (!(rx_ring->netdev->features & NETIF_F_RXHASH))
 		return;
 
-	if (rx_desc->wb.rxdid != ICE_RXDID_FLEX_NIC)
-		return;
-
-	nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
-	hash = le32_to_cpu(nic_mdid->rss_hash);
-	skb_set_hash(skb, hash, ice_ptype_to_htype(rx_ptype));
+	hash = ice_get_rx_hash(rx_desc);
+	if (likely(hash))
+		skb_set_hash(skb, hash, ice_ptype_to_htype(rx_ptype));
 }
 
 /**
@@ -186,7 +201,7 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       union ice_32b_rx_flex_desc *rx_desc,
 		       struct sk_buff *skb, u16 ptype)
 {
-	ice_rx_hash(rx_ring, rx_desc, skb, ptype);
+	ice_rx_hash_to_skb(rx_ring, rx_desc, skb, ptype);
 
 	/* modifies the skb - consumes the enet header */
 	skb->protocol = eth_type_trans(skb, rx_ring->netdev);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 02/20] ice: make RX HW timestamp reading code more reusable
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 01/20] ice: make RX hash reading code more reusable Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-04 10:04   ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 03/20] ice: make RX checksum checking " Larysa Zaremba
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Previously, we only needed RX HW timestamp in skb path,
hence all related code was written with skb in mind.
But with the addition of XDP hints via kfuncs to the ice driver,
the same logic will be needed in .xmo_() callbacks.

Put generic process of reading RX HW timestamp from a descriptor
into a separate function.
Move skb-related code into another source file.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_ptp.c      | 24 ++++++------------
 drivers/net/ethernet/intel/ice/ice_ptp.h      | 15 ++++++-----
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 25 ++++++++++++++++++-
 3 files changed, 41 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index 81d96a40d5a7..a31333972c68 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -2147,30 +2147,24 @@ int ice_ptp_set_ts_config(struct ice_pf *pf, struct ifreq *ifr)
 }
 
 /**
- * ice_ptp_rx_hwtstamp - Check for an Rx timestamp
- * @rx_ring: Ring to get the VSI info
+ * ice_ptp_get_rx_hwts - Get packet Rx timestamp
  * @rx_desc: Receive descriptor
- * @skb: Particular skb to send timestamp with
+ * @cached_time: Cached PHC time
  *
  * The driver receives a notification in the receive descriptor with timestamp.
- * The timestamp is in ns, so we must convert the result first.
  */
-void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
-		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb)
+u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
+			u64 cached_time)
 {
-	struct skb_shared_hwtstamps *hwtstamps;
-	u64 ts_ns, cached_time;
 	u32 ts_high;
+	u64 ts_ns;
 
 	if (!(rx_desc->wb.time_stamp_low & ICE_PTP_TS_VALID))
-		return;
-
-	cached_time = READ_ONCE(rx_ring->cached_phctime);
+		return 0;
 
 	/* Do not report a timestamp if we don't have a cached PHC time */
 	if (!cached_time)
-		return;
+		return 0;
 
 	/* Use ice_ptp_extend_32b_ts directly, using the ring-specific cached
 	 * PHC value, rather than accessing the PF. This also allows us to
@@ -2181,9 +2175,7 @@ ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
 	ts_high = le32_to_cpu(rx_desc->wb.flex_ts.ts_high);
 	ts_ns = ice_ptp_extend_32b_ts(cached_time, ts_high);
 
-	hwtstamps = skb_hwtstamps(skb);
-	memset(hwtstamps, 0, sizeof(*hwtstamps));
-	hwtstamps->hwtstamp = ns_to_ktime(ts_ns);
+	return ts_ns;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.h b/drivers/net/ethernet/intel/ice/ice_ptp.h
index 995a57019ba7..523eefbfdf95 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.h
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.h
@@ -268,9 +268,8 @@ void ice_ptp_extts_event(struct ice_pf *pf);
 s8 ice_ptp_request_ts(struct ice_ptp_tx *tx, struct sk_buff *skb);
 enum ice_tx_tstamp_work ice_ptp_process_ts(struct ice_pf *pf);
 
-void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
-		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb);
+u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
+			u64 cached_time);
 void ice_ptp_reset(struct ice_pf *pf);
 void ice_ptp_prepare_for_reset(struct ice_pf *pf);
 void ice_ptp_init(struct ice_pf *pf);
@@ -304,9 +303,13 @@ static inline bool ice_ptp_process_ts(struct ice_pf *pf)
 {
 	return true;
 }
-static inline void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
-		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb) { }
+
+static inline u64
+ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc, u64 cached_time)
+{
+	return 0;
+}
+
 static inline void ice_ptp_reset(struct ice_pf *pf) { }
 static inline void ice_ptp_prepare_for_reset(struct ice_pf *pf) { }
 static inline void ice_ptp_init(struct ice_pf *pf) { }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 8f7f6d78f7bf..d4d27057d17b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -185,6 +185,29 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
 	ring->vsi->back->hw_csum_rx_error++;
 }
 
+/**
+ * ice_ptp_rx_hwts_to_skb - Put RX timestamp into skb
+ * @rx_ring: Ring to get the VSI info
+ * @rx_desc: Receive descriptor
+ * @skb: Particular skb to send timestamp with
+ *
+ * The timestamp is in ns, so we must convert the result first.
+ */
+static void
+ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
+		       const union ice_32b_rx_flex_desc *rx_desc,
+		       struct sk_buff *skb)
+{
+	u64 ts_ns, cached_time;
+
+	cached_time = READ_ONCE(rx_ring->pkt_ctx.cached_phctime);
+	ts_ns = ice_ptp_get_rx_hwts(rx_desc, cached_time);
+
+	*skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
+		.hwtstamp	= ns_to_ktime(ts_ns),
+	};
+}
+
 /**
  * ice_process_skb_fields - Populate skb header fields from Rx descriptor
  * @rx_ring: Rx descriptor ring packet is being transacted on
@@ -209,7 +232,7 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 	ice_rx_csum(rx_ring, skb, rx_desc, ptype);
 
 	if (rx_ring->ptp_rx)
-		ice_ptp_rx_hwtstamp(rx_ring, rx_desc, skb);
+		ice_ptp_rx_hwts_to_skb(rx_ring, rx_desc, skb);
 }
 
 /**
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 03/20] ice: make RX checksum checking code more reusable
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 01/20] ice: make RX hash reading code more reusable Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 02/20] ice: make RX HW timestamp " Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 04/20] ice: Make ptype internal to descriptor info processing Larysa Zaremba
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Previously, we only needed RX checksum flags in skb path,
hence all related code was written with skb in mind.
But with the addition of XDP hints via kfuncs to the ice driver,
the same logic will be needed in .xmo_() callbacks.

Put generic process of determining checksum status into
a separate function.

Now we cannot operate directly on skb, when deducing
checksum status, therefore introduce an intermediate enum for checksum
status. Fortunately, in ice, we have only 4 possibilities: checksum
validated at level 0, validated at level 1, no checksum, checksum error.
Use 3 bits for more convenient conversion.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 105 ++++++++++++------
 1 file changed, 69 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index d4d27057d17b..bc6158873d6b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -102,18 +102,41 @@ ice_rx_hash_to_skb(const struct ice_rx_ring *rx_ring,
 		skb_set_hash(skb, hash, ice_ptype_to_htype(rx_ptype));
 }
 
+enum ice_rx_csum_status {
+	ICE_RX_CSUM_LVL_0	= 0,
+	ICE_RX_CSUM_LVL_1	= BIT(0),
+	ICE_RX_CSUM_NONE	= BIT(1),
+	ICE_RX_CSUM_ERROR	= BIT(2),
+	ICE_RX_CSUM_FAIL	= ICE_RX_CSUM_NONE | ICE_RX_CSUM_ERROR,
+};
+
 /**
- * ice_rx_csum - Indicate in skb if checksum is good
- * @ring: the ring we care about
- * @skb: skb currently being received and modified
+ * ice_rx_csum_lvl - Get checksum level from status
+ * @status: driver-specific checksum status
+ */
+static u8 ice_rx_csum_lvl(enum ice_rx_csum_status status)
+{
+	return status & ICE_RX_CSUM_LVL_1;
+}
+
+/**
+ * ice_rx_csum_ip_summed - Checksum status from driver-specific to generic
+ * @status: driver-specific checksum status
+ */
+static u8 ice_rx_csum_ip_summed(enum ice_rx_csum_status status)
+{
+	return status & ICE_RX_CSUM_NONE ? CHECKSUM_NONE : CHECKSUM_UNNECESSARY;
+}
+
+/**
+ * ice_get_rx_csum_status - Deduce checksum status from descriptor
  * @rx_desc: the receive descriptor
  * @ptype: the packet type decoded by hardware
  *
- * skb->protocol must be set before this function is called
+ * Returns driver-specific checksum status
  */
-static void
-ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
-	    union ice_32b_rx_flex_desc *rx_desc, u16 ptype)
+static enum ice_rx_csum_status
+ice_get_rx_csum_status(const union ice_32b_rx_flex_desc *rx_desc, u16 ptype)
 {
 	struct ice_rx_ptype_decoded decoded;
 	u16 rx_status0, rx_status1;
@@ -124,20 +147,12 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
 
 	decoded = ice_decode_rx_desc_ptype(ptype);
 
-	/* Start with CHECKSUM_NONE and by default csum_level = 0 */
-	skb->ip_summed = CHECKSUM_NONE;
-	skb_checksum_none_assert(skb);
-
-	/* check if Rx checksum is enabled */
-	if (!(ring->netdev->features & NETIF_F_RXCSUM))
-		return;
-
 	/* check if HW has decoded the packet and checksum */
 	if (!(rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_L3L4P_S)))
-		return;
+		return ICE_RX_CSUM_NONE;
 
 	if (!(decoded.known && decoded.outer_ip))
-		return;
+		return ICE_RX_CSUM_NONE;
 
 	ipv4 = (decoded.outer_ip == ICE_RX_PTYPE_OUTER_IP) &&
 	       (decoded.outer_ip_ver == ICE_RX_PTYPE_OUTER_IPV4);
@@ -146,43 +161,61 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
 
 	if (ipv4 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_IPE_S) |
 				   BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_EIPE_S))))
-		goto checksum_fail;
+		return ICE_RX_CSUM_FAIL;
 
 	if (ipv6 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_IPV6EXADD_S))))
-		goto checksum_fail;
+		return ICE_RX_CSUM_FAIL;
 
 	/* check for L4 errors and handle packets that were not able to be
 	 * checksummed due to arrival speed
 	 */
 	if (rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_L4E_S))
-		goto checksum_fail;
+		return ICE_RX_CSUM_FAIL;
 
 	/* check for outer UDP checksum error in tunneled packets */
 	if ((rx_status1 & BIT(ICE_RX_FLEX_DESC_STATUS1_NAT_S)) &&
 	    (rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_EUDPE_S)))
-		goto checksum_fail;
-
-	/* If there is an outer header present that might contain a checksum
-	 * we need to bump the checksum level by 1 to reflect the fact that
-	 * we are indicating we validated the inner checksum.
-	 */
-	if (decoded.tunnel_type >= ICE_RX_PTYPE_TUNNEL_IP_GRENAT)
-		skb->csum_level = 1;
+		return ICE_RX_CSUM_FAIL;
 
 	/* Only report checksum unnecessary for TCP, UDP, or SCTP */
 	switch (decoded.inner_prot) {
 	case ICE_RX_PTYPE_INNER_PROT_TCP:
 	case ICE_RX_PTYPE_INNER_PROT_UDP:
 	case ICE_RX_PTYPE_INNER_PROT_SCTP:
-		skb->ip_summed = CHECKSUM_UNNECESSARY;
-		break;
-	default:
-		break;
+		/* If there is an outer header present that might contain
+		 * a checksum we need to bump the checksum level by 1 to reflect
+		 * the fact that we have validated the inner checksum.
+		 */
+		return decoded.tunnel_type >= ICE_RX_PTYPE_TUNNEL_IP_GRENAT ?
+		       ICE_RX_CSUM_LVL_1 : ICE_RX_CSUM_LVL_0;
 	}
-	return;
 
-checksum_fail:
-	ring->vsi->back->hw_csum_rx_error++;
+	return ICE_RX_CSUM_NONE;
+}
+
+/**
+ * ice_rx_csum_into_skb - Indicate in skb if checksum is good
+ * @ring: the ring we care about
+ * @skb: skb currently being received and modified
+ * @rx_desc: the receive descriptor
+ * @ptype: the packet type decoded by hardware
+ */
+static void
+ice_rx_csum_into_skb(struct ice_rx_ring *ring, struct sk_buff *skb,
+		     const union ice_32b_rx_flex_desc *rx_desc, u16 ptype)
+{
+	enum ice_rx_csum_status csum_status;
+
+	/* check if Rx checksum is enabled */
+	if (!(ring->netdev->features & NETIF_F_RXCSUM))
+		return;
+
+	csum_status = ice_get_rx_csum_status(rx_desc, ptype);
+	if (csum_status & ICE_RX_CSUM_ERROR)
+		ring->vsi->back->hw_csum_rx_error++;
+
+	skb->ip_summed = ice_rx_csum_ip_summed(csum_status);
+	skb->csum_level = ice_rx_csum_lvl(csum_status);
 }
 
 /**
@@ -229,7 +262,7 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 	/* modifies the skb - consumes the enet header */
 	skb->protocol = eth_type_trans(skb, rx_ring->netdev);
 
-	ice_rx_csum(rx_ring, skb, rx_desc, ptype);
+	ice_rx_csum_into_skb(rx_ring, skb, rx_desc, ptype);
 
 	if (rx_ring->ptp_rx)
 		ice_ptp_rx_hwts_to_skb(rx_ring, rx_desc, skb);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 04/20] ice: Make ptype internal to descriptor info processing
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (2 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 03/20] ice: make RX checksum checking " Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 05/20] ice: Introduce ice_xdp_buff Larysa Zaremba
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Currently, rx_ptype variable is used only as an argument
to ice_process_skb_fields() and is computed
just before the function call.

Therefore, there is no reason to pass this value as an argument.
Instead, remove this argument and compute the value directly inside
ice_process_skb_fields() function.

Also, separate its calculation into a short function, so the code
can later be reused in .xmo_() callbacks.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  6 +-----
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 15 +++++++++++++--
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h |  2 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  2 +-
 4 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 52d0a126eb61..40f2f6dabb81 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1181,7 +1181,6 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		unsigned int size;
 		u16 stat_err_bits;
 		u16 vlan_tag = 0;
-		u16 rx_ptype;
 
 		/* get the Rx desc from Rx ring based on 'next_to_clean' */
 		rx_desc = ICE_RX_DESC(rx_ring, ntc);
@@ -1286,10 +1285,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		total_rx_bytes += skb->len;
 
 		/* populate checksum, VLAN, and protocol */
-		rx_ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
-			ICE_RX_FLEX_DESC_PTYPE_M;
-
-		ice_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+		ice_process_skb_fields(rx_ring, rx_desc, skb);
 
 		ice_trace(clean_rx_irq_indicate, rx_ring, rx_desc, skb);
 		/* send completed skb up the stack */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index bc6158873d6b..beb1c5bb392a 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -241,12 +241,21 @@ ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
 	};
 }
 
+/**
+ * ice_get_ptype - Read HW packet type from the descriptor
+ * @rx_desc: RX descriptor
+ */
+static u16 ice_get_ptype(const union ice_32b_rx_flex_desc *rx_desc)
+{
+	return le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
+	       ICE_RX_FLEX_DESC_PTYPE_M;
+}
+
 /**
  * ice_process_skb_fields - Populate skb header fields from Rx descriptor
  * @rx_ring: Rx descriptor ring packet is being transacted on
  * @rx_desc: pointer to the EOP Rx descriptor
  * @skb: pointer to current skb being populated
- * @ptype: the packet type decoded by hardware
  *
  * This function checks the ring, descriptor, and packet information in
  * order to populate the hash, checksum, VLAN, protocol, and
@@ -255,8 +264,10 @@ ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
 void
 ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       union ice_32b_rx_flex_desc *rx_desc,
-		       struct sk_buff *skb, u16 ptype)
+		       struct sk_buff *skb)
 {
+	u16 ptype = ice_get_ptype(rx_desc);
+
 	ice_rx_hash_to_skb(rx_ring, rx_desc, skb, ptype);
 
 	/* modifies the skb - consumes the enet header */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index 115969ecdf7b..e1d49e1235b3 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -148,7 +148,7 @@ void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val);
 void
 ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       union ice_32b_rx_flex_desc *rx_desc,
-		       struct sk_buff *skb, u16 ptype);
+		       struct sk_buff *skb);
 void
 ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag);
 #endif /* !_ICE_TXRX_LIB_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index a7fe2b4ce655..730b059e6759 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -854,7 +854,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 		rx_ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
 				       ICE_RX_FLEX_DESC_PTYPE_M;
 
-		ice_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+		ice_process_skb_fields(rx_ring, rx_desc, skb);
 		ice_receive_skb(rx_ring, skb, vlan_tag);
 	}
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 05/20] ice: Introduce ice_xdp_buff
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (3 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 04/20] ice: Make ptype internal to descriptor info processing Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint Larysa Zaremba
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

In order to use XDP hints via kfuncs we need to put
RX descriptor and ring pointers just next to xdp_buff.
Same as in hints implementations in other drivers, we archieve
this through putting xdp_buff into a child structure.

Currently, xdp_buff is stored in the ring structure,
so replace it with union that includes child structure.
This way enough memory is available while existing XDP code
remains isolated from hints.

Minimum size of the new child structure (ice_xdp_buff) is exactly
64 bytes (single cache line). To place it at the start of a cache line,
move 'next' field from CL1 to CL3, as it isn't used often. This still
leaves 128 bits available in CL3 for packet context extensions.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  7 +++--
 drivers/net/ethernet/intel/ice/ice_txrx.h     | 26 ++++++++++++++++---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 10 +++++++
 3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 40f2f6dabb81..4e6546d9cf85 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -557,13 +557,14 @@ ice_rx_frame_truesize(struct ice_rx_ring *rx_ring, const unsigned int size)
  * @xdp_prog: XDP program to run
  * @xdp_ring: ring to be used for XDP_TX action
  * @rx_buf: Rx buffer to store the XDP action
+ * @eop_desc: Last descriptor in packet to read metadata from
  *
  * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
  */
 static void
 ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	    struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
-	    struct ice_rx_buf *rx_buf)
+	    struct ice_rx_buf *rx_buf, union ice_32b_rx_flex_desc *eop_desc)
 {
 	unsigned int ret = ICE_XDP_PASS;
 	u32 act;
@@ -571,6 +572,8 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	if (!xdp_prog)
 		goto exit;
 
+	ice_xdp_meta_set_desc(xdp, eop_desc);
+
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -1240,7 +1243,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		if (ice_is_non_eop(rx_ring, rx_desc))
 			continue;
 
-		ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_buf);
+		ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_buf, rx_desc);
 		if (rx_buf->act == ICE_XDP_PASS)
 			goto construct_skb;
 		total_rx_bytes += xdp_get_buff_len(xdp);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 166413fc33f4..d0ab2c4c0c91 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -257,6 +257,18 @@ enum ice_rx_dtype {
 	ICE_RX_DTYPE_SPLIT_ALWAYS	= 2,
 };
 
+struct ice_pkt_ctx {
+	const union ice_32b_rx_flex_desc *eop_desc;
+};
+
+struct ice_xdp_buff {
+	struct xdp_buff xdp_buff;
+	struct ice_pkt_ctx pkt_ctx;
+};
+
+/* Required for compatibility with xdp_buffs from xsk_pool */
+static_assert(offsetof(struct ice_xdp_buff, xdp_buff) == 0);
+
 /* indices into GLINT_ITR registers */
 #define ICE_RX_ITR	ICE_IDX_ITR0
 #define ICE_TX_ITR	ICE_IDX_ITR1
@@ -298,7 +310,6 @@ enum ice_dynamic_itr {
 /* descriptor ring, associated with a VSI */
 struct ice_rx_ring {
 	/* CL1 - 1st cacheline starts here */
-	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
 	void *desc;			/* Descriptor ring memory */
 	struct device *dev;		/* Used for DMA mapping */
 	struct net_device *netdev;	/* netdev ring maps to */
@@ -310,12 +321,19 @@ struct ice_rx_ring {
 	u16 count;			/* Number of descriptors */
 	u16 reg_idx;			/* HW register index of the ring */
 	u16 next_to_alloc;
-	/* CL2 - 2nd cacheline starts here */
+
 	union {
 		struct ice_rx_buf *rx_buf;
 		struct xdp_buff **xdp_buf;
 	};
-	struct xdp_buff xdp;
+	/* CL2 - 2nd cacheline starts here */
+	union {
+		struct ice_xdp_buff xdp_ext;
+		struct {
+			struct xdp_buff xdp;
+			struct ice_pkt_ctx pkt_ctx;
+		};
+	};
 	/* CL3 - 3rd cacheline starts here */
 	struct bpf_prog *xdp_prog;
 	u16 rx_offset;
@@ -325,6 +343,8 @@ struct ice_rx_ring {
 	u16 next_to_clean;
 	u16 first_desc;
 
+	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
+
 	/* stats structs */
 	struct ice_ring_stats *ring_stats;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index e1d49e1235b3..145883eec129 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -151,4 +151,14 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 		       struct sk_buff *skb);
 void
 ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag);
+
+static inline void
+ice_xdp_meta_set_desc(struct xdp_buff *xdp,
+		      union ice_32b_rx_flex_desc *eop_desc)
+{
+	struct ice_xdp_buff *xdp_ext = container_of(xdp, struct ice_xdp_buff,
+						    xdp_buff);
+
+	xdp_ext->pkt_ctx.eop_desc = eop_desc;
+}
 #endif /* !_ICE_TXRX_LIB_H_ */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (4 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 05/20] ice: Introduce ice_xdp_buff Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-05 17:30   ` Stanislav Fomichev
  2023-07-03 18:12 ` [PATCH bpf-next v2 07/20] ice: Support RX hash XDP hint Larysa Zaremba
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Use previously refactored code and create a function
that allows XDP code to read HW timestamp.

Also, move cached_phctime into packet context, this way this data still
stays in the ring structure, just at the different address.

HW timestamp is the first supported hint in the driver,
so also add xdp_metadata_ops.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice.h          |  2 ++
 drivers/net/ethernet/intel/ice/ice_ethtool.c  |  2 +-
 drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
 drivers/net/ethernet/intel/ice/ice_main.c     |  1 +
 drivers/net/ethernet/intel/ice/ice_ptp.c      |  2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 24 +++++++++++++++++++
 7 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 4ba3d99439a0..7a973a2229f1 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -943,4 +943,6 @@ static inline void ice_clear_rdma_cap(struct ice_pf *pf)
 	set_bit(ICE_FLAG_UNPLUG_AUX_DEV, pf->flags);
 	clear_bit(ICE_FLAG_RDMA_ENA, pf->flags);
 }
+
+extern const struct xdp_metadata_ops ice_xdp_md_ops;
 #endif /* _ICE_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 8d5cbbd0b3d5..3c3b9cbfbcd3 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -2837,7 +2837,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
 		/* clone ring and setup updated count */
 		rx_rings[i] = *vsi->rx_rings[i];
 		rx_rings[i].count = new_rx_cnt;
-		rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
+		rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
 		rx_rings[i].desc = NULL;
 		rx_rings[i].rx_buf = NULL;
 		/* this is to allow wr32 to have something to write to
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 00e3afd507a4..eb69b0ac7956 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1445,7 +1445,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
 		ring->netdev = vsi->netdev;
 		ring->dev = dev;
 		ring->count = vsi->num_rx_desc;
-		ring->cached_phctime = pf->ptp.cached_phc_time;
+		ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
 		WRITE_ONCE(vsi->rx_rings[i], ring);
 	}
 
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 93979ab18bc1..f21996b812ea 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3384,6 +3384,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
 
 	netdev->netdev_ops = &ice_netdev_ops;
 	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
+	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
 	ice_set_ethtool_ops(netdev);
 
 	if (vsi->type != ICE_VSI_PF)
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index a31333972c68..70697e4829dd 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -1038,7 +1038,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
 		ice_for_each_rxq(vsi, j) {
 			if (!vsi->rx_rings[j])
 				continue;
-			WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
+			WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime, systime);
 		}
 	}
 	clear_bit(ICE_CFG_BUSY, pf->state);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index d0ab2c4c0c91..4237702a58a9 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -259,6 +259,7 @@ enum ice_rx_dtype {
 
 struct ice_pkt_ctx {
 	const union ice_32b_rx_flex_desc *eop_desc;
+	u64 cached_phctime;
 };
 
 struct ice_xdp_buff {
@@ -354,7 +355,6 @@ struct ice_rx_ring {
 	struct ice_tx_ring *xdp_ring;
 	struct xsk_buff_pool *xsk_pool;
 	dma_addr_t dma;			/* physical address of ring */
-	u64 cached_phctime;
 	u16 rx_buf_len;
 	u8 dcb_tc;			/* Traffic class of ring */
 	u8 ptp_rx;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index beb1c5bb392a..463d9e5cbe05 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -546,3 +546,27 @@ void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
 			spin_unlock(&xdp_ring->tx_lock);
 	}
 }
+
+/**
+ * ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
+ * @ctx: XDP buff pointer
+ * @ts_ns: destination address
+ *
+ * Copy HW timestamp (if available) to the destination address.
+ */
+static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
+{
+	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
+	u64 cached_time;
+
+	cached_time = READ_ONCE(xdp_ext->pkt_ctx.cached_phctime);
+	*ts_ns = ice_ptp_get_rx_hwts(xdp_ext->pkt_ctx.eop_desc, cached_time);
+	if (!*ts_ns)
+		return -ENODATA;
+
+	return 0;
+}
+
+const struct xdp_metadata_ops ice_xdp_md_ops = {
+	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
+};
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 07/20] ice: Support RX hash XDP hint
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (5 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 08/20] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

RX hash XDP hint requests both hash value and type.
Type is XDP-specific, so we need a separate way to map
these values to the hardware ptypes, so create a lookup table.

Instead of creating a new long list, reuse contents
of ice_decode_rx_desc_ptype[] through preprocessor.

Current hash type enum does not contain ICMP packet type,
but ice devices support it, so also add a new type into core code.

Then use previously refactored code and create a function
that allows XDP code to read RX hash.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 .../net/ethernet/intel/ice/ice_lan_tx_rx.h    | 412 +++++++++---------
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  73 ++++
 include/net/xdp.h                             |   3 +
 3 files changed, 284 insertions(+), 204 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
index 89f986a75cc8..d384ddfcb83e 100644
--- a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
+++ b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
@@ -673,6 +673,212 @@ struct ice_tlan_ctx {
  *      Use the enum ice_rx_l2_ptype to decode the packet type
  * ENDIF
  */
+#define ICE_PTYPES								\
+	/* L2 Packet types */							\
+	ICE_PTT_UNUSED_ENTRY(0),						\
+	ICE_PTT(1, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),			\
+	ICE_PTT_UNUSED_ENTRY(2),						\
+	ICE_PTT_UNUSED_ENTRY(3),						\
+	ICE_PTT_UNUSED_ENTRY(4),						\
+	ICE_PTT_UNUSED_ENTRY(5),						\
+	ICE_PTT(6, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),			\
+	ICE_PTT(7, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),			\
+	ICE_PTT_UNUSED_ENTRY(8),						\
+	ICE_PTT_UNUSED_ENTRY(9),						\
+	ICE_PTT(10, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),		\
+	ICE_PTT(11, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),		\
+	ICE_PTT_UNUSED_ENTRY(12),						\
+	ICE_PTT_UNUSED_ENTRY(13),						\
+	ICE_PTT_UNUSED_ENTRY(14),						\
+	ICE_PTT_UNUSED_ENTRY(15),						\
+	ICE_PTT_UNUSED_ENTRY(16),						\
+	ICE_PTT_UNUSED_ENTRY(17),						\
+	ICE_PTT_UNUSED_ENTRY(18),						\
+	ICE_PTT_UNUSED_ENTRY(19),						\
+	ICE_PTT_UNUSED_ENTRY(20),						\
+	ICE_PTT_UNUSED_ENTRY(21),						\
+										\
+	/* Non Tunneled IPv4 */							\
+	ICE_PTT(22, IP, IPV4, FRG, NONE, NONE, NOF, NONE, PAY3),		\
+	ICE_PTT(23, IP, IPV4, NOF, NONE, NONE, NOF, NONE, PAY3),		\
+	ICE_PTT(24, IP, IPV4, NOF, NONE, NONE, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(25),						\
+	ICE_PTT(26, IP, IPV4, NOF, NONE, NONE, NOF, TCP,  PAY4),		\
+	ICE_PTT(27, IP, IPV4, NOF, NONE, NONE, NOF, SCTP, PAY4),		\
+	ICE_PTT(28, IP, IPV4, NOF, NONE, NONE, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> IPv4 */							\
+	ICE_PTT(29, IP, IPV4, NOF, IP_IP, IPV4, FRG, NONE, PAY3),		\
+	ICE_PTT(30, IP, IPV4, NOF, IP_IP, IPV4, NOF, NONE, PAY3),		\
+	ICE_PTT(31, IP, IPV4, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(32),						\
+	ICE_PTT(33, IP, IPV4, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),		\
+	ICE_PTT(34, IP, IPV4, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),		\
+	ICE_PTT(35, IP, IPV4, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> IPv6 */							\
+	ICE_PTT(36, IP, IPV4, NOF, IP_IP, IPV6, FRG, NONE, PAY3),		\
+	ICE_PTT(37, IP, IPV4, NOF, IP_IP, IPV6, NOF, NONE, PAY3),		\
+	ICE_PTT(38, IP, IPV4, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(39),						\
+	ICE_PTT(40, IP, IPV4, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),		\
+	ICE_PTT(41, IP, IPV4, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),		\
+	ICE_PTT(42, IP, IPV4, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> GRE/NAT */							\
+	ICE_PTT(43, IP, IPV4, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),		\
+										\
+	/* IPv4 --> GRE/NAT --> IPv4 */						\
+	ICE_PTT(44, IP, IPV4, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),		\
+	ICE_PTT(45, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),		\
+	ICE_PTT(46, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(47),						\
+	ICE_PTT(48, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),		\
+	ICE_PTT(49, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),		\
+	ICE_PTT(50, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> GRE/NAT --> IPv6 */						\
+	ICE_PTT(51, IP, IPV4, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),		\
+	ICE_PTT(52, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),		\
+	ICE_PTT(53, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(54),						\
+	ICE_PTT(55, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),		\
+	ICE_PTT(56, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),		\
+	ICE_PTT(57, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),		\
+										\
+	/* IPv4 --> GRE/NAT --> MAC */						\
+	ICE_PTT(58, IP, IPV4, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),	\
+										\
+	/* IPv4 --> GRE/NAT --> MAC --> IPv4 */					\
+	ICE_PTT(59, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),	\
+	ICE_PTT(60, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),	\
+	ICE_PTT(61, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(62),						\
+	ICE_PTT(63, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),	\
+	ICE_PTT(64, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),	\
+	ICE_PTT(65, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),	\
+										\
+	/* IPv4 --> GRE/NAT -> MAC --> IPv6 */					\
+	ICE_PTT(66, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),	\
+	ICE_PTT(67, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),	\
+	ICE_PTT(68, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(69),						\
+	ICE_PTT(70, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),	\
+	ICE_PTT(71, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),	\
+	ICE_PTT(72, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),	\
+										\
+	/* IPv4 --> GRE/NAT --> MAC/VLAN */					\
+	ICE_PTT(73, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),	\
+										\
+	/* IPv4 ---> GRE/NAT -> MAC/VLAN --> IPv4 */				\
+	ICE_PTT(74, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),	\
+	ICE_PTT(75, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),	\
+	ICE_PTT(76, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(77),						\
+	ICE_PTT(78, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),	\
+	ICE_PTT(79, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),	\
+	ICE_PTT(80, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),	\
+										\
+	/* IPv4 -> GRE/NAT -> MAC/VLAN --> IPv6 */				\
+	ICE_PTT(81, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),	\
+	ICE_PTT(82, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),	\
+	ICE_PTT(83, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(84),						\
+	ICE_PTT(85, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),	\
+	ICE_PTT(86, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),	\
+	ICE_PTT(87, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),	\
+										\
+	/* Non Tunneled IPv6 */							\
+	ICE_PTT(88, IP, IPV6, FRG, NONE, NONE, NOF, NONE, PAY3),		\
+	ICE_PTT(89, IP, IPV6, NOF, NONE, NONE, NOF, NONE, PAY3),		\
+	ICE_PTT(90, IP, IPV6, NOF, NONE, NONE, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(91),						\
+	ICE_PTT(92, IP, IPV6, NOF, NONE, NONE, NOF, TCP,  PAY4),		\
+	ICE_PTT(93, IP, IPV6, NOF, NONE, NONE, NOF, SCTP, PAY4),		\
+	ICE_PTT(94, IP, IPV6, NOF, NONE, NONE, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> IPv4 */							\
+	ICE_PTT(95, IP, IPV6, NOF, IP_IP, IPV4, FRG, NONE, PAY3),		\
+	ICE_PTT(96, IP, IPV6, NOF, IP_IP, IPV4, NOF, NONE, PAY3),		\
+	ICE_PTT(97, IP, IPV6, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(98),						\
+	ICE_PTT(99, IP, IPV6, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),		\
+	ICE_PTT(100, IP, IPV6, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),		\
+	ICE_PTT(101, IP, IPV6, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> IPv6 */							\
+	ICE_PTT(102, IP, IPV6, NOF, IP_IP, IPV6, FRG, NONE, PAY3),		\
+	ICE_PTT(103, IP, IPV6, NOF, IP_IP, IPV6, NOF, NONE, PAY3),		\
+	ICE_PTT(104, IP, IPV6, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(105),						\
+	ICE_PTT(106, IP, IPV6, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),		\
+	ICE_PTT(107, IP, IPV6, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),		\
+	ICE_PTT(108, IP, IPV6, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> GRE/NAT */							\
+	ICE_PTT(109, IP, IPV6, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),		\
+										\
+	/* IPv6 --> GRE/NAT -> IPv4 */						\
+	ICE_PTT(110, IP, IPV6, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),		\
+	ICE_PTT(111, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),		\
+	ICE_PTT(112, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(113),						\
+	ICE_PTT(114, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),		\
+	ICE_PTT(115, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),		\
+	ICE_PTT(116, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> GRE/NAT -> IPv6 */						\
+	ICE_PTT(117, IP, IPV6, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),		\
+	ICE_PTT(118, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),		\
+	ICE_PTT(119, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),		\
+	ICE_PTT_UNUSED_ENTRY(120),						\
+	ICE_PTT(121, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),		\
+	ICE_PTT(122, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),		\
+	ICE_PTT(123, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),		\
+										\
+	/* IPv6 --> GRE/NAT -> MAC */						\
+	ICE_PTT(124, IP, IPV6, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC -> IPv4 */					\
+	ICE_PTT(125, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),	\
+	ICE_PTT(126, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),	\
+	ICE_PTT(127, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(128),						\
+	ICE_PTT(129, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),	\
+	ICE_PTT(130, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),	\
+	ICE_PTT(131, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC -> IPv6 */					\
+	ICE_PTT(132, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),	\
+	ICE_PTT(133, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),	\
+	ICE_PTT(134, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(135),						\
+	ICE_PTT(136, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),	\
+	ICE_PTT(137, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),	\
+	ICE_PTT(138, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC/VLAN */					\
+	ICE_PTT(139, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv4 */				\
+	ICE_PTT(140, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),	\
+	ICE_PTT(141, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),	\
+	ICE_PTT(142, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(143),						\
+	ICE_PTT(144, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),	\
+	ICE_PTT(145, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),	\
+	ICE_PTT(146, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),	\
+										\
+	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv6 */				\
+	ICE_PTT(147, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),	\
+	ICE_PTT(148, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),	\
+	ICE_PTT(149, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),	\
+	ICE_PTT_UNUSED_ENTRY(150),						\
+	ICE_PTT(151, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),	\
+	ICE_PTT(152, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),	\
+	ICE_PTT(153, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
+
+#define ICE_NUM_DEFINED_PTYPES	154
 
 /* macro to make the table lines short, use explicit indexing with [PTYPE] */
 #define ICE_PTT(PTYPE, OUTER_IP, OUTER_IP_VER, OUTER_FRAG, T, TE, TEF, I, PL)\
@@ -695,212 +901,10 @@ struct ice_tlan_ctx {
 
 /* Lookup table mapping in the 10-bit HW PTYPE to the bit field for decoding */
 static const struct ice_rx_ptype_decoded ice_ptype_lkup[BIT(10)] = {
-	/* L2 Packet types */
-	ICE_PTT_UNUSED_ENTRY(0),
-	ICE_PTT(1, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	ICE_PTT_UNUSED_ENTRY(2),
-	ICE_PTT_UNUSED_ENTRY(3),
-	ICE_PTT_UNUSED_ENTRY(4),
-	ICE_PTT_UNUSED_ENTRY(5),
-	ICE_PTT(6, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT(7, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT_UNUSED_ENTRY(8),
-	ICE_PTT_UNUSED_ENTRY(9),
-	ICE_PTT(10, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT(11, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT_UNUSED_ENTRY(12),
-	ICE_PTT_UNUSED_ENTRY(13),
-	ICE_PTT_UNUSED_ENTRY(14),
-	ICE_PTT_UNUSED_ENTRY(15),
-	ICE_PTT_UNUSED_ENTRY(16),
-	ICE_PTT_UNUSED_ENTRY(17),
-	ICE_PTT_UNUSED_ENTRY(18),
-	ICE_PTT_UNUSED_ENTRY(19),
-	ICE_PTT_UNUSED_ENTRY(20),
-	ICE_PTT_UNUSED_ENTRY(21),
-
-	/* Non Tunneled IPv4 */
-	ICE_PTT(22, IP, IPV4, FRG, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(23, IP, IPV4, NOF, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(24, IP, IPV4, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(25),
-	ICE_PTT(26, IP, IPV4, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	ICE_PTT(27, IP, IPV4, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	ICE_PTT(28, IP, IPV4, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv4 */
-	ICE_PTT(29, IP, IPV4, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(30, IP, IPV4, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(31, IP, IPV4, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(32),
-	ICE_PTT(33, IP, IPV4, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(34, IP, IPV4, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(35, IP, IPV4, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv6 */
-	ICE_PTT(36, IP, IPV4, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(37, IP, IPV4, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(38, IP, IPV4, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(39),
-	ICE_PTT(40, IP, IPV4, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(41, IP, IPV4, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(42, IP, IPV4, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT */
-	ICE_PTT(43, IP, IPV4, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> IPv4 */
-	ICE_PTT(44, IP, IPV4, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(45, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(46, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(47),
-	ICE_PTT(48, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(49, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(50, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> IPv6 */
-	ICE_PTT(51, IP, IPV4, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(52, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(53, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(54),
-	ICE_PTT(55, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(56, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(57, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC */
-	ICE_PTT(58, IP, IPV4, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> MAC --> IPv4 */
-	ICE_PTT(59, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(60, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(61, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(62),
-	ICE_PTT(63, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(64, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(65, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT -> MAC --> IPv6 */
-	ICE_PTT(66, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(67, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(68, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(69),
-	ICE_PTT(70, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(71, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(72, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC/VLAN */
-	ICE_PTT(73, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 ---> GRE/NAT -> MAC/VLAN --> IPv4 */
-	ICE_PTT(74, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(75, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(76, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(77),
-	ICE_PTT(78, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(79, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(80, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 -> GRE/NAT -> MAC/VLAN --> IPv6 */
-	ICE_PTT(81, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(82, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(83, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(84),
-	ICE_PTT(85, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(86, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(87, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
-
-	/* Non Tunneled IPv6 */
-	ICE_PTT(88, IP, IPV6, FRG, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(89, IP, IPV6, NOF, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(90, IP, IPV6, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(91),
-	ICE_PTT(92, IP, IPV6, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	ICE_PTT(93, IP, IPV6, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	ICE_PTT(94, IP, IPV6, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv4 */
-	ICE_PTT(95, IP, IPV6, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(96, IP, IPV6, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(97, IP, IPV6, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(98),
-	ICE_PTT(99, IP, IPV6, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(100, IP, IPV6, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(101, IP, IPV6, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv6 */
-	ICE_PTT(102, IP, IPV6, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(103, IP, IPV6, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(104, IP, IPV6, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(105),
-	ICE_PTT(106, IP, IPV6, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(107, IP, IPV6, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(108, IP, IPV6, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT */
-	ICE_PTT(109, IP, IPV6, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> IPv4 */
-	ICE_PTT(110, IP, IPV6, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(111, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(112, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(113),
-	ICE_PTT(114, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(115, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(116, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> IPv6 */
-	ICE_PTT(117, IP, IPV6, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(118, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(119, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(120),
-	ICE_PTT(121, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(122, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(123, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC */
-	ICE_PTT(124, IP, IPV6, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv4 */
-	ICE_PTT(125, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(126, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(127, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(128),
-	ICE_PTT(129, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(130, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(131, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv6 */
-	ICE_PTT(132, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(133, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(134, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(135),
-	ICE_PTT(136, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(137, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(138, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN */
-	ICE_PTT(139, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv4 */
-	ICE_PTT(140, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(141, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(142, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(143),
-	ICE_PTT(144, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(145, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(146, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv6 */
-	ICE_PTT(147, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(148, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(149, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(150),
-	ICE_PTT(151, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(152, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(153, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
+	ICE_PTYPES
 
 	/* unused entries */
-	[154 ... 1023] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
+	[ICE_NUM_DEFINED_PTYPES ... 1023] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
 };
 
 static inline struct ice_rx_ptype_decoded ice_decode_rx_desc_ptype(u16 ptype)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 463d9e5cbe05..b11cfaedb81c 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -567,6 +567,79 @@ static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
 	return 0;
 }
 
+/* Define a ptype index -> XDP hash type lookup table.
+ * It uses the same ptype definitions as ice_decode_rx_desc_ptype[],
+ * avoiding possible copy-paste errors.
+ */
+#undef ICE_PTT
+#undef ICE_PTT_UNUSED_ENTRY
+
+#define ICE_PTT(PTYPE, OUTER_IP, OUTER_IP_VER, OUTER_FRAG, T, TE, TEF, I, PL)\
+	[PTYPE] = XDP_RSS_L3_##OUTER_IP_VER | XDP_RSS_L4_##I | XDP_RSS_TYPE_##PL
+
+#define ICE_PTT_UNUSED_ENTRY(PTYPE) [PTYPE] = 0
+
+/* A few supplementary definitions for when XDP hash types do not coincide
+ * with what can be generated from ptype definitions
+ * by means of preprocessor concatenation.
+ */
+#define XDP_RSS_L3_NONE		XDP_RSS_TYPE_NONE
+#define XDP_RSS_L4_NONE		XDP_RSS_TYPE_NONE
+#define XDP_RSS_TYPE_PAY2	XDP_RSS_TYPE_L2
+#define XDP_RSS_TYPE_PAY3	XDP_RSS_TYPE_NONE
+#define XDP_RSS_TYPE_PAY4	XDP_RSS_L4
+
+static const enum xdp_rss_hash_type
+ice_ptype_to_xdp_hash[ICE_NUM_DEFINED_PTYPES] = {
+	ICE_PTYPES
+};
+
+#undef XDP_RSS_L3_NONE
+#undef XDP_RSS_L4_NONE
+#undef XDP_RSS_TYPE_PAY2
+#undef XDP_RSS_TYPE_PAY3
+#undef XDP_RSS_TYPE_PAY4
+
+#undef ICE_PTT
+#undef ICE_PTT_UNUSED_ENTRY
+
+/**
+ * ice_xdp_rx_hash_type - Get XDP-specific hash type from the RX descriptor
+ * @eop_desc: End of Packet descriptor
+ */
+static enum xdp_rss_hash_type
+ice_xdp_rx_hash_type(const union ice_32b_rx_flex_desc *eop_desc)
+{
+	u16 ptype = ice_get_ptype(eop_desc);
+
+	if (unlikely(ptype >= ICE_NUM_DEFINED_PTYPES))
+		return 0;
+
+	return ice_ptype_to_xdp_hash[ptype];
+}
+
+/**
+ * ice_xdp_rx_hash - RX hash XDP hint handler
+ * @ctx: XDP buff pointer
+ * @hash: hash destination address
+ * @rss_type: XDP hash type destination address
+ *
+ * Copy RX hash (if available) and its type to the destination address.
+ */
+static int ice_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
+			   enum xdp_rss_hash_type *rss_type)
+{
+	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
+
+	*hash = ice_get_rx_hash(xdp_ext->pkt_ctx.eop_desc);
+	*rss_type = ice_xdp_rx_hash_type(xdp_ext->pkt_ctx.eop_desc);
+	if (!likely(*hash))
+		return -ENODATA;
+
+	return 0;
+}
+
 const struct xdp_metadata_ops ice_xdp_md_ops = {
 	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
+	.xmo_rx_hash			= ice_xdp_rx_hash,
 };
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d1c5381fc95f..6381560efae2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -417,6 +417,7 @@ enum xdp_rss_hash_type {
 	XDP_RSS_L4_UDP		= BIT(5),
 	XDP_RSS_L4_SCTP		= BIT(6),
 	XDP_RSS_L4_IPSEC	= BIT(7), /* L4 based hash include IPSEC SPI */
+	XDP_RSS_L4_ICMP		= BIT(8),
 
 	/* Second part: RSS hash type combinations used for driver HW mapping */
 	XDP_RSS_TYPE_NONE            = 0,
@@ -432,11 +433,13 @@ enum xdp_rss_hash_type {
 	XDP_RSS_TYPE_L4_IPV4_UDP     = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_UDP,
 	XDP_RSS_TYPE_L4_IPV4_SCTP    = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_SCTP,
 	XDP_RSS_TYPE_L4_IPV4_IPSEC   = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_IPSEC,
+	XDP_RSS_TYPE_L4_IPV4_ICMP    = XDP_RSS_L3_IPV4 | XDP_RSS_L4 | XDP_RSS_L4_ICMP,
 
 	XDP_RSS_TYPE_L4_IPV6_TCP     = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_TCP,
 	XDP_RSS_TYPE_L4_IPV6_UDP     = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_UDP,
 	XDP_RSS_TYPE_L4_IPV6_SCTP    = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_SCTP,
 	XDP_RSS_TYPE_L4_IPV6_IPSEC   = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_IPSEC,
+	XDP_RSS_TYPE_L4_IPV6_ICMP    = XDP_RSS_L3_IPV6 | XDP_RSS_L4 | XDP_RSS_L4_ICMP,
 
 	XDP_RSS_TYPE_L4_IPV6_TCP_EX  = XDP_RSS_TYPE_L4_IPV6_TCP  | XDP_RSS_L3_DYNHDR,
 	XDP_RSS_TYPE_L4_IPV6_UDP_EX  = XDP_RSS_TYPE_L4_IPV6_UDP  | XDP_RSS_L3_DYNHDR,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 08/20] ice: Support XDP hints in AF_XDP ZC mode
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (6 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 07/20] ice: Support RX hash XDP hint Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint Larysa Zaremba
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

In AF_XDP ZC, xdp_buff is not stored on ring,
instead it is provided by xsk_pool.
Space for metadata sources right after such buffers was already reserved
in commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk").
This makes the implementation rather straightforward.

Update AF_XDP ZC packet processing to support XDP hints.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_xsk.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 730b059e6759..197ebefc6307 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -705,16 +705,25 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
  * @xdp: xdp_buff used as input to the XDP program
  * @xdp_prog: XDP program to run
  * @xdp_ring: ring to be used for XDP_TX action
+ * @rx_desc: packet descriptor
  *
  * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
  */
 static int
 ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
-	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
+	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
+	       union ice_32b_rx_flex_desc *rx_desc)
 {
 	int err, result = ICE_XDP_PASS;
 	u32 act;
 
+	/* We can safely convert xdp_buff_xsk to ice_xdp_buff,
+	 * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
+	 * right after xdp_buff, for our private use.
+	 * Macro insures we do not go above the limit.
+	 */
+	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
+	ice_xdp_meta_set_desc(xdp, rx_desc);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 	if (likely(act == XDP_REDIRECT)) {
@@ -813,7 +822,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 		xsk_buff_set_size(xdp, size);
 		xsk_buff_dma_sync_for_cpu(xdp, rx_ring->xsk_pool);
 
-		xdp_res = ice_run_xdp_zc(rx_ring, xdp, xdp_prog, xdp_ring);
+		xdp_res = ice_run_xdp_zc(rx_ring, xdp, xdp_prog, xdp_ring,
+					 rx_desc);
 		if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
 			xdp_xmit |= xdp_res;
 		} else if (xdp_res == ICE_XDP_EXIT) {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (7 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 08/20] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 20:15   ` John Fastabend
  2023-07-03 18:12 ` [PATCH bpf-next v2 10/20] ice: Implement " Larysa Zaremba
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Implement functionality that enables drivers to expose VLAN tag
to XDP code.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
 include/linux/netdevice.h                    |  2 ++
 include/net/xdp.h                            |  2 ++
 kernel/bpf/offload.c                         |  2 ++
 net/core/xdp.c                               | 20 ++++++++++++++++++++
 5 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
index 25ce72af81c2..ea6dd79a21d3 100644
--- a/Documentation/networking/xdp-rx-metadata.rst
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
 metadata is supported, this set will grow:
 
 .. kernel-doc:: net/core/xdp.c
-   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
+   :identifiers: bpf_xdp_metadata_rx_timestamp
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_hash
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_vlan_tag
 
 An XDP program can use these kfuncs to read the metadata into stack
 variables for its own consumption. Or, to pass the metadata on to other
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b828c7a75be2..4fa4380e6d89 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1658,6 +1658,8 @@ struct xdp_metadata_ops {
 	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
 	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
 			       enum xdp_rss_hash_type *rss_type);
+	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
+				   __be16 *vlan_proto);
 };
 
 /**
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 6381560efae2..89c58f56ffc6 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -389,6 +389,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 			   bpf_xdp_metadata_rx_timestamp) \
 	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
 			   bpf_xdp_metadata_rx_hash) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
+			   bpf_xdp_metadata_rx_vlan_tag) \
 
 enum {
 #define XDP_METADATA_KFUNC(name, _) name,
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 8a26cd8814c1..986e7becfd42 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -848,6 +848,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 		p = ops->xmo_rx_timestamp;
 	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
 		p = ops->xmo_rx_hash;
+	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
+		p = ops->xmo_rx_vlan_tag;
 out:
 	up_read(&bpf_devs_lock);
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 41e5ca8643ec..f6262c90e45f 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return -EOPNOTSUPP;
 }
 
+/**
+ * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
+ * @ctx: XDP context pointer.
+ * @vlan_tag: Destination pointer for VLAN tag
+ * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
+ *
+ * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
+ * containing VLAN ID, vlan_proto contains protocol identifier.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
+ * * ``-ENODATA``    : VLAN tag was not stripped or is not available
+ */
+__bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
+					     __be16 *vlan_proto)
+{
+	return -EOPNOTSUPP;
+}
+
 __diag_pop();
 
 BTF_SET8_START(xdp_metadata_kfunc_ids)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 10/20] ice: Implement VLAN tag hint
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (8 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 11/20] ice: use VLAN proto from ring packet context in skb path Larysa Zaremba
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Implement .xmo_rx_vlan_tag callback to allow XDP code to read
packet's VLAN tag.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c     | 22 ++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_txrx.c     |  2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 26 +++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h |  4 +--
 drivers/net/ethernet/intel/ice/ice_xsk.c      |  2 +-
 6 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index f21996b812ea..ab7129b0dc67 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -5930,6 +5930,23 @@ ice_fix_features(struct net_device *netdev, netdev_features_t features)
 	return features;
 }
 
+/**
+ * ice_set_rx_rings_vlan_proto - update rings with new stripped VLAN proto
+ * @vsi: PF's VSI
+ * @vlan_ethertype: VLAN ethertype (802.1Q or 802.1ad) in network byte order
+ *
+ * Store current stripped VLAN proto in ring packet context,
+ * so it can be accessed more efficiently by packet processing code.
+ */
+static void
+ice_set_rx_rings_vlan_proto(struct ice_vsi *vsi, __be16 vlan_ethertype)
+{
+	u16 i;
+
+	ice_for_each_alloc_rxq(vsi, i)
+		vsi->rx_rings[i]->pkt_ctx.vlan_proto = vlan_ethertype;
+}
+
 /**
  * ice_set_vlan_offload_features - set VLAN offload features for the PF VSI
  * @vsi: PF's VSI
@@ -5972,6 +5989,11 @@ ice_set_vlan_offload_features(struct ice_vsi *vsi, netdev_features_t features)
 	if (strip_err || insert_err)
 		return -EIO;
 
+	if (enable_stripping)
+		ice_set_rx_rings_vlan_proto(vsi, htons(vlan_ethertype));
+	else
+		ice_set_rx_rings_vlan_proto(vsi, 0);
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 4e6546d9cf85..1eee7d98f92c 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1278,7 +1278,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 			continue;
 		}
 
-		vlan_tag = ice_get_vlan_tag_from_rx_desc(rx_desc);
+		vlan_tag = ice_get_vlan_tag(rx_desc);
 
 		/* pad the skb if needed, to make a valid ethernet frame */
 		if (eth_skb_pad(skb))
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 4237702a58a9..41e0b14e6643 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -260,6 +260,7 @@ enum ice_rx_dtype {
 struct ice_pkt_ctx {
 	const union ice_32b_rx_flex_desc *eop_desc;
 	u64 cached_phctime;
+	__be16 vlan_proto;
 };
 
 struct ice_xdp_buff {
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index b11cfaedb81c..c290c9d20c5c 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -639,7 +639,33 @@ static int ice_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+/**
+ * ice_xdp_rx_vlan_tag - VLAN tag XDP hint handler
+ * @ctx: XDP buff pointer
+ * @vlan_tag: destination address for VLAN tag
+ * @vlan_proto: destination address for VLAN protocol
+ *
+ * Copy VLAN tag (if was stripped) and corresponding protocol
+ * to the destination address.
+ */
+static int ice_xdp_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
+			       __be16 *vlan_proto)
+{
+	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
+
+	*vlan_proto = xdp_ext->pkt_ctx.vlan_proto;
+	if (!*vlan_proto)
+		return -ENODATA;
+
+	*vlan_tag = ice_get_vlan_tag(xdp_ext->pkt_ctx.eop_desc);
+	if (!*vlan_tag)
+		return -ENODATA;
+
+	return 0;
+}
+
 const struct xdp_metadata_ops ice_xdp_md_ops = {
 	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
 	.xmo_rx_hash			= ice_xdp_rx_hash,
+	.xmo_rx_vlan_tag		= ice_xdp_rx_vlan_tag,
 };
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index 145883eec129..d0af716c1497 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -84,7 +84,7 @@ ice_build_ctob(u64 td_cmd, u64 td_offset, unsigned int size, u64 td_tag)
 }
 
 /**
- * ice_get_vlan_tag_from_rx_desc - get VLAN from Rx flex descriptor
+ * ice_get_vlan_tag - get VLAN from Rx flex descriptor
  * @rx_desc: Rx 32b flex descriptor with RXDID=2
  *
  * The OS and current PF implementation only support stripping a single VLAN tag
@@ -92,7 +92,7 @@ ice_build_ctob(u64 td_cmd, u64 td_offset, unsigned int size, u64 td_tag)
  * one is found return the tag, else return 0 to mean no VLAN tag was found.
  */
 static inline u16
-ice_get_vlan_tag_from_rx_desc(union ice_32b_rx_flex_desc *rx_desc)
+ice_get_vlan_tag(const union ice_32b_rx_flex_desc *rx_desc)
 {
 	u16 stat_err_bits;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 197ebefc6307..cf205ea177fb 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -859,7 +859,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 		total_rx_bytes += skb->len;
 		total_rx_packets++;
 
-		vlan_tag = ice_get_vlan_tag_from_rx_desc(rx_desc);
+		vlan_tag = ice_get_vlan_tag(rx_desc);
 
 		rx_ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
 				       ICE_RX_FLEX_DESC_PTYPE_M;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 11/20] ice: use VLAN proto from ring packet context in skb path
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (9 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 10/20] ice: Implement " Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 12/20] xdp: Add checksum level hint Larysa Zaremba
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

VLAN proto, used in ice XDP hints implementation is stored in ring packet
context. Utilize this value in skb VLAN processing too instead of checking
netdev features.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index c290c9d20c5c..e9f334fecdf1 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -291,13 +291,9 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
 void
 ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag)
 {
-	netdev_features_t features = rx_ring->netdev->features;
-	bool non_zero_vlan = !!(vlan_tag & VLAN_VID_MASK);
-
-	if ((features & NETIF_F_HW_VLAN_CTAG_RX) && non_zero_vlan)
-		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tag);
-	else if ((features & NETIF_F_HW_VLAN_STAG_RX) && non_zero_vlan)
-		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD), vlan_tag);
+	if (vlan_tag & VLAN_VID_MASK && rx_ring->pkt_ctx.vlan_proto)
+		__vlan_hwaccel_put_tag(skb, rx_ring->pkt_ctx.vlan_proto,
+				       vlan_tag);
 
 	napi_gro_receive(&rx_ring->q_vector->napi, skb);
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (10 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 11/20] ice: use VLAN proto from ring packet context in skb path Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 20:38   ` John Fastabend
  2023-07-03 18:12 ` [PATCH bpf-next v2 13/20] ice: Implement " Larysa Zaremba
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Implement functionality that enables drivers to expose to XDP code,
whether checksums was checked and on what level.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 Documentation/networking/xdp-rx-metadata.rst |  3 +++
 include/linux/netdevice.h                    |  1 +
 include/net/xdp.h                            |  2 ++
 kernel/bpf/offload.c                         |  2 ++
 net/core/xdp.c                               | 21 ++++++++++++++++++++
 5 files changed, 29 insertions(+)

diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
index ea6dd79a21d3..4ec6ddfd2a52 100644
--- a/Documentation/networking/xdp-rx-metadata.rst
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -26,6 +26,9 @@ metadata is supported, this set will grow:
 .. kernel-doc:: net/core/xdp.c
    :identifiers: bpf_xdp_metadata_rx_vlan_tag
 
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_csum_lvl
+
 An XDP program can use these kfuncs to read the metadata into stack
 variables for its own consumption. Or, to pass the metadata on to other
 consumers, an XDP program can store it into the metadata area carried
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4fa4380e6d89..569563687172 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
 			       enum xdp_rss_hash_type *rss_type);
 	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
 				   __be16 *vlan_proto);
+	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
 };
 
 /**
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 89c58f56ffc6..61ed38fa79d1 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 			   bpf_xdp_metadata_rx_hash) \
 	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
 			   bpf_xdp_metadata_rx_vlan_tag) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
+			   bpf_xdp_metadata_rx_csum_lvl) \
 
 enum {
 #define XDP_METADATA_KFUNC(name, _) name,
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 986e7becfd42..a133fb775f49 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 		p = ops->xmo_rx_hash;
 	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
 		p = ops->xmo_rx_vlan_tag;
+	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
+		p = ops->xmo_rx_csum_lvl;
 out:
 	up_read(&bpf_devs_lock);
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index f6262c90e45f..c666d3e0a26c 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
 	return -EOPNOTSUPP;
 }
 
+/**
+ * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
+ * @ctx: XDP context pointer.
+ * @csum_level: Return value pointer.
+ *
+ * In case of success, csum_level contains depth of the last verified checksum.
+ * If only the outermost checksum was verified, csum_level is 0, if both
+ * encapsulation and inner transport checksums were verified, csum_level is 1,
+ * and so on.
+ * For more details, refer to csum_level field in sk_buff.
+ *
+ * Return:
+ * * Returns 0 on success or ``-errno`` on error.
+ * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
+ * * ``-ENODATA``    : Checksum was not validated
+ */
+__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
+{
+	return -EOPNOTSUPP;
+}
+
 __diag_pop();
 
 BTF_SET8_START(xdp_metadata_kfunc_ids)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 13/20] ice: Implement checksum level hint
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (11 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 12/20] xdp: Add checksum level hint Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 14/20] selftests/bpf: Allow VLAN packets in xdp_hw_metadata Larysa Zaremba
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Implement .xmo_rx_csum_lvl callback to allow XDP code to determine,
whether checksum was checked by hardware and on what level.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 26 +++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index e9f334fecdf1..41ab52b6990d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -660,8 +660,34 @@ static int ice_xdp_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
 	return 0;
 }
 
+/**
+ * ice_xdp_rx_csum_lvl - Get level, at which HW has checked the checksum
+ * @ctx: XDP buff pointer
+ * @csum_lvl: destination address
+ *
+ * Copy HW checksum level (if was checked) to the destination address.
+ */
+static int ice_xdp_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_lvl)
+{
+	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
+	const union ice_32b_rx_flex_desc *eop_desc;
+	enum ice_rx_csum_status status;
+	u16 ptype;
+
+	eop_desc = xdp_ext->pkt_ctx.eop_desc;
+	ptype = ice_get_ptype(eop_desc);
+
+	status = ice_get_rx_csum_status(eop_desc, ptype);
+	if (status & ICE_RX_CSUM_NONE)
+		return -ENODATA;
+
+	*csum_lvl = ice_rx_csum_lvl(status);
+	return 0;
+}
+
 const struct xdp_metadata_ops ice_xdp_md_ops = {
 	.xmo_rx_timestamp		= ice_xdp_rx_hw_ts,
 	.xmo_rx_hash			= ice_xdp_rx_hash,
 	.xmo_rx_vlan_tag		= ice_xdp_rx_vlan_tag,
+	.xmo_rx_csum_lvl		= ice_xdp_rx_csum_lvl,
 };
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 14/20] selftests/bpf: Allow VLAN packets in xdp_hw_metadata
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (12 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 13/20] ice: Implement " Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-05 17:31   ` Stanislav Fomichev
  2023-07-03 18:12 ` [PATCH bpf-next v2 15/20] net, xdp: allow metadata > 32 Larysa Zaremba
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Make VLAN c-tag and s-tag XDP hint testing more convenient
by not skipping VLAN-ed packets.

Allow both 802.1ad and 802.1Q headers.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c | 10 +++++++++-
 tools/testing/selftests/bpf/xdp_metadata.h          |  8 ++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
index b2dfd7066c6e..63d7de6c6bbb 100644
--- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -26,15 +26,23 @@ int rx(struct xdp_md *ctx)
 {
 	void *data, *data_meta, *data_end;
 	struct ipv6hdr *ip6h = NULL;
-	struct ethhdr *eth = NULL;
 	struct udphdr *udp = NULL;
 	struct iphdr *iph = NULL;
 	struct xdp_meta *meta;
+	struct ethhdr *eth;
 	int err;
 
 	data = (void *)(long)ctx->data;
 	data_end = (void *)(long)ctx->data_end;
 	eth = data;
+
+	if (eth + 1 < data_end && (eth->h_proto == bpf_htons(ETH_P_8021AD) ||
+				   eth->h_proto == bpf_htons(ETH_P_8021Q)))
+		eth = (void *)eth + sizeof(struct vlan_hdr);
+
+	if (eth + 1 < data_end && eth->h_proto == bpf_htons(ETH_P_8021Q))
+		eth = (void *)eth + sizeof(struct vlan_hdr);
+
 	if (eth + 1 < data_end) {
 		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
 			iph = (void *)(eth + 1);
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
index 938a729bd307..6664893c2c77 100644
--- a/tools/testing/selftests/bpf/xdp_metadata.h
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -9,6 +9,14 @@
 #define ETH_P_IPV6 0x86DD
 #endif
 
+#ifndef ETH_P_8021Q
+#define ETH_P_8021Q 0x8100
+#endif
+
+#ifndef ETH_P_8021AD
+#define ETH_P_8021AD 0x88A8
+#endif
+
 struct xdp_meta {
 	__u64 rx_timestamp;
 	__u64 xdp_timestamp;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 15/20] net, xdp: allow metadata > 32
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (13 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 14/20] selftests/bpf: Allow VLAN packets in xdp_hw_metadata Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-03 21:06   ` John Fastabend
  2023-07-03 18:12 ` [PATCH bpf-next v2 16/20] selftests/bpf: Add flags and new hints to xdp_hw_metadata Larysa Zaremba
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Aleksander Lobakin

From: Aleksander Lobakin <aleksander.lobakin@intel.com>

When using XDP hints, metadata sometimes has to be much bigger
than 32 bytes. Relax the restriction, allow metadata larger than 32 bytes
and make __skb_metadata_differs() work with bigger lengths.

Now size of metadata is only limited by the fact it is stored as u8
in skb_shared_info, so maximum possible value is 255. Other important
conditions, such as having enough space for xdp_frame building, are already
checked in bpf_xdp_adjust_meta().

The requirement of having its length aligned to 4 bytes is still
valid.

Signed-off-by: Aleksander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 include/linux/skbuff.h | 13 ++++++++-----
 include/net/xdp.h      |  7 ++++++-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..cd49cdd71019 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4209,10 +4209,13 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
 {
 	const void *a = skb_metadata_end(skb_a);
 	const void *b = skb_metadata_end(skb_b);
-	/* Using more efficient varaiant than plain call to memcmp(). */
-#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
 	u64 diffs = 0;
 
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
+	    BITS_PER_LONG != 64)
+		goto slow;
+
+	/* Using more efficient variant than plain call to memcmp(). */
 	switch (meta_len) {
 #define __it(x, op) (x -= sizeof(u##op))
 #define __it_diff(a, b, op) (*(u##op *)__it(a, op)) ^ (*(u##op *)__it(b, op))
@@ -4232,11 +4235,11 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
 		fallthrough;
 	case  4: diffs |= __it_diff(a, b, 32);
 		break;
+	default:
+slow:
+		return memcmp(a - meta_len, b - meta_len, meta_len);
 	}
 	return diffs;
-#else
-	return memcmp(a - meta_len, b - meta_len, meta_len);
-#endif
 }
 
 static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 61ed38fa79d1..3008042a00e3 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -370,7 +370,12 @@ xdp_data_meta_unsupported(const struct xdp_buff *xdp)
 
 static inline bool xdp_metalen_invalid(unsigned long metalen)
 {
-	return (metalen & (sizeof(__u32) - 1)) || (metalen > 32);
+	typeof(metalen) meta_max;
+
+	meta_max = type_max(typeof_member(struct skb_shared_info, meta_len));
+	BUILD_BUG_ON(!__builtin_constant_p(meta_max));
+
+	return !IS_ALIGNED(metalen, sizeof(u32)) || metalen > meta_max;
 }
 
 struct xdp_attachment_info {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 16/20] selftests/bpf: Add flags and new hints to xdp_hw_metadata
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (14 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 15/20] net, xdp: allow metadata > 32 Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-04 11:03   ` Jesper Dangaard Brouer
  2023-07-03 18:12 ` [PATCH bpf-next v2 17/20] veth: Implement VLAN tag and checksum level XDP hint Larysa Zaremba
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Add hints added in the previous patches (VLAN tags and checksum level)
to the xdp_hw_metadata program.

Also, to make metadata layout more straightforward, add flags field
to pass information about validity of every separate hint separately.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 .../selftests/bpf/progs/xdp_hw_metadata.c     | 35 +++++++++++++---
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 42 ++++++++++++++++---
 tools/testing/selftests/bpf/xdp_metadata.h    | 28 ++++++++++++-
 3 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
index 63d7de6c6bbb..f46f75db21b4 100644
--- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -20,6 +20,11 @@ extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
+					__u16 *vlan_tag,
+					__be16 *vlan_proto) __ksym;
+extern int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx,
+					__u8 *csum_level) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -84,15 +89,35 @@ int rx(struct xdp_md *ctx)
 		return XDP_PASS;
 	}
 
+	meta->hint_valid = 0;
+
 	err = bpf_xdp_metadata_rx_timestamp(ctx, &meta->rx_timestamp);
-	if (!err)
+	if (err) {
+		meta->rx_timestamp_err = err;
+	} else {
+		meta->hint_valid |= XDP_META_FIELD_TS;
 		meta->xdp_timestamp = bpf_ktime_get_tai_ns();
+	}
+
+	err = bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash,
+				       &meta->rx_hash_type);
+	if (err)
+		meta->rx_hash_err = err;
 	else
-		meta->rx_timestamp = 0; /* Used by AF_XDP as not avail signal */
+		meta->hint_valid |= XDP_META_FIELD_RSS;
 
-	err = bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
-	if (err < 0)
-		meta->rx_hash_err = err; /* Used by AF_XDP as no hash signal */
+	err = bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag,
+					   &meta->rx_vlan_proto);
+	if (err)
+		meta->rx_vlan_tag_err = err;
+	else
+		meta->hint_valid |= XDP_META_FIELD_VLAN_TAG;
+
+	err = bpf_xdp_metadata_rx_csum_lvl(ctx, &meta->rx_csum_lvl);
+	if (err)
+		meta->rx_csum_err = err;
+	else
+		meta->hint_valid |= XDP_META_FIELD_CSUM_LVL;
 
 	__sync_add_and_fetch(&pkts_redir, 1);
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 613321eb84c1..d234cbcc9103 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -19,6 +19,9 @@
 #include "xsk.h"
 
 #include <error.h>
+#include <linux/kernel.h>
+#include <linux/bits.h>
+#include <linux/bitfield.h>
 #include <linux/errqueue.h>
 #include <linux/if_link.h>
 #include <linux/net_tstamp.h>
@@ -150,21 +153,34 @@ static __u64 gettime(clockid_t clock_id)
 	return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
 }
 
+#define VLAN_PRIO_MASK		GENMASK(15, 13) /* Priority Code Point */
+#define VLAN_CFI_MASK		GENMASK(12, 12) /* Canonical Format / Drop Eligible Indicator */
+#define VLAN_VID_MASK		GENMASK(11, 0)	/* VLAN Identifier */
+static void print_vlan_tag(__u16 tag)
+{
+	__u16 vlan_id = FIELD_GET(VLAN_VID_MASK, tag);
+	__u8 pcp = FIELD_GET(VLAN_PRIO_MASK, tag);
+	bool cfi = FIELD_GET(VLAN_CFI_MASK, tag);
+
+	printf("PCP=%u, CFI=%d, VID=0x%X\n", pcp, cfi, vlan_id);
+}
+
 static void verify_xdp_metadata(void *data, clockid_t clock_id)
 {
 	struct xdp_meta *meta;
 
 	meta = data - sizeof(*meta);
 
-	if (meta->rx_hash_err < 0)
-		printf("No rx_hash err=%d\n", meta->rx_hash_err);
-	else
+	if (meta->hint_valid & XDP_META_FIELD_RSS)
 		printf("rx_hash: 0x%X with RSS type:0x%X\n",
 		       meta->rx_hash, meta->rx_hash_type);
+	else
+		printf("No rx_hash, err=%d\n", meta->rx_hash_err);
+
+	if (meta->hint_valid & XDP_META_FIELD_TS) {
+		printf("rx_timestamp:  %llu (sec:%0.4f)\n", meta->rx_timestamp,
+		       (double)meta->rx_timestamp / NANOSEC_PER_SEC);
 
-	printf("rx_timestamp:  %llu (sec:%0.4f)\n", meta->rx_timestamp,
-	       (double)meta->rx_timestamp / NANOSEC_PER_SEC);
-	if (meta->rx_timestamp) {
 		__u64 usr_clock = gettime(clock_id);
 		__u64 xdp_clock = meta->xdp_timestamp;
 		__s64 delta_X = xdp_clock - meta->rx_timestamp;
@@ -179,8 +195,22 @@ static void verify_xdp_metadata(void *data, clockid_t clock_id)
 		       usr_clock, (double)usr_clock / NANOSEC_PER_SEC,
 		       (double)delta_X2U / NANOSEC_PER_SEC,
 		       (double)delta_X2U / 1000);
+	} else {
+		printf("No rx_timestamp, err=%d\n", meta->rx_timestamp_err);
 	}
 
+	if (meta->hint_valid & XDP_META_FIELD_VLAN_TAG) {
+		printf("rx_vlan_proto: 0x%X\n", ntohs(meta->rx_vlan_proto));
+		printf("rx_vlan_tag: ");
+		print_vlan_tag(meta->rx_vlan_tag);
+	} else {
+		printf("No rx_vlan_tag or rx_vlan_proto, err=%d\n", meta->rx_vlan_tag_err);
+	}
+
+	if (meta->hint_valid & XDP_META_FIELD_CSUM_LVL)
+		printf("Checksum was checked at level %u\n", meta->rx_csum_lvl);
+	else
+		printf("Checksum was not checked, err=%d\n", meta->rx_csum_err);
 }
 
 static void verify_skb_metadata(int fd)
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
index 6664893c2c77..ff1372244d34 100644
--- a/tools/testing/selftests/bpf/xdp_metadata.h
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -17,12 +17,38 @@
 #define ETH_P_8021AD 0x88A8
 #endif
 
+#ifndef BIT
+#define BIT(nr)			(1 << (nr))
+#endif
+
+enum xdp_meta_field {
+	XDP_META_FIELD_TS	= BIT(0),
+	XDP_META_FIELD_RSS	= BIT(1),
+	XDP_META_FIELD_VLAN_TAG	= BIT(2),
+	XDP_META_FIELD_CSUM_LVL	= BIT(3),
+};
+
 struct xdp_meta {
-	__u64 rx_timestamp;
+	union {
+		__u64 rx_timestamp;
+		__s32 rx_timestamp_err;
+	};
 	__u64 xdp_timestamp;
 	__u32 rx_hash;
 	union {
 		__u32 rx_hash_type;
 		__s32 rx_hash_err;
 	};
+	union {
+		struct {
+			__u16 rx_vlan_tag;
+			__be16 rx_vlan_proto;
+		};
+		__s32 rx_vlan_tag_err;
+	};
+	union {
+		__u8 rx_csum_lvl;
+		__s32 rx_csum_err;
+	};
+	enum xdp_meta_field hint_valid;
 };
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 17/20] veth: Implement VLAN tag and checksum level XDP hint
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (15 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 16/20] selftests/bpf: Add flags and new hints to xdp_hw_metadata Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-05 17:25   ` Stanislav Fomichev
  2023-07-03 18:12 ` [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata Larysa Zaremba
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

In order to test VLAN tag and checksum level XDP hints in
hardware-independent selfttests, implement newly added XDP hints in veth
driver.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/veth.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..a7f2b679551d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1732,6 +1732,44 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int veth_xdp_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
+				__be16 *vlan_proto)
+{
+	struct veth_xdp_buff *_ctx = (void *)ctx;
+	struct sk_buff *skb = _ctx->skb;
+	int err;
+
+	if (!skb)
+		return -ENODATA;
+
+	err = __vlan_hwaccel_get_tag(skb, vlan_tag);
+	if (err)
+		return err;
+
+	*vlan_proto = skb->vlan_proto;
+	return err;
+}
+
+static int veth_xdp_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
+{
+	struct veth_xdp_buff *_ctx = (void *)ctx;
+	struct sk_buff *skb = _ctx->skb;
+
+	if (!skb)
+		return -ENODATA;
+
+	if (skb->ip_summed == CHECKSUM_UNNECESSARY)
+		*csum_level = skb->csum_level;
+	else if (skb->ip_summed == CHECKSUM_PARTIAL &&
+		 skb_checksum_start_offset(skb) == skb_transport_offset(skb) ||
+		 skb->csum_valid)
+		*csum_level = 0;
+	else
+		return -ENODATA;
+
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1756,6 +1794,8 @@ static const struct net_device_ops veth_netdev_ops = {
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= veth_xdp_rx_timestamp,
 	.xmo_rx_hash			= veth_xdp_rx_hash,
+	.xmo_rx_vlan_tag		= veth_xdp_rx_vlan_tag,
+	.xmo_rx_csum_lvl		= veth_xdp_rx_csum_lvl,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (16 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 17/20] veth: Implement VLAN tag and checksum level XDP hint Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-05 17:39   ` Stanislav Fomichev
  2023-07-03 18:12 ` [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba
  2023-07-03 18:12 ` [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level " Larysa Zaremba
  19 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

The easiest way to simulate stripped VLAN tag in veth is to send a packet
from VLAN interface, attached to veth. Unfortunately, this approach is
incompatible with AF_XDP on TX side, because VLAN interfaces do not have
such feature.

Replace AF_XDP packet generation with sending the same datagram via
AF_INET socket.

This does not change the packet contents or hints values with one notable
exception: rx_hash_type, which previously was expected to be 0, now is
expected be at least XDP_RSS_TYPE_L4.

Also, usage of AF_INET requires a little more complicated namespace setup,
therefore open_netns() helper function is divided into smaller reusable
pieces.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 tools/testing/selftests/bpf/network_helpers.c |  37 +++-
 tools/testing/selftests/bpf/network_helpers.h |   3 +
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 175 +++++++-----------
 3 files changed, 98 insertions(+), 117 deletions(-)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index a105c0cd008a..19463230ece5 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -386,28 +386,51 @@ char *ping_command(int family)
 	return "ping";
 }
 
+int get_cur_netns(void)
+{
+	int nsfd;
+
+	nsfd = open("/proc/self/ns/net", O_RDONLY);
+	ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
+	return nsfd;
+}
+
+int get_netns(const char *name)
+{
+	char nspath[PATH_MAX];
+	int nsfd;
+
+	snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
+	nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
+	ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
+	return nsfd;
+}
+
+int set_netns(int netns_fd)
+{
+	return setns(netns_fd, CLONE_NEWNET);
+}
+
 struct nstoken {
 	int orig_netns_fd;
 };
 
 struct nstoken *open_netns(const char *name)
 {
+	struct nstoken *token;
 	int nsfd;
-	char nspath[PATH_MAX];
 	int err;
-	struct nstoken *token;
 
 	token = calloc(1, sizeof(struct nstoken));
 	if (!ASSERT_OK_PTR(token, "malloc token"))
 		return NULL;
 
-	token->orig_netns_fd = open("/proc/self/ns/net", O_RDONLY);
-	if (!ASSERT_GE(token->orig_netns_fd, 0, "open /proc/self/ns/net"))
+	token->orig_netns_fd = get_cur_netns();
+	if (token->orig_netns_fd < 0)
 		goto fail;
 
-	snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
-	nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
-	if (!ASSERT_GE(nsfd, 0, "open netns fd"))
+	nsfd = get_netns(name);
+	if (nsfd < 0)
 		goto fail;
 
 	err = setns(nsfd, CLONE_NEWNET);
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 694185644da6..b18b9619595c 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -58,6 +58,8 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
 char *ping_command(int family);
 int get_socket_local_port(int sock_fd);
 
+int get_cur_netns(void);
+int get_netns(const char *name);
 struct nstoken;
 /**
  * open_netns() - Switch to specified network namespace by name.
@@ -67,4 +69,5 @@ struct nstoken;
  */
 struct nstoken *open_netns(const char *name);
 void close_netns(struct nstoken *token);
+int set_netns(int netns_fd);
 #endif
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 626c461fa34d..53b32a641e8e 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -20,7 +20,7 @@
 
 #define UDP_PAYLOAD_BYTES 4
 
-#define AF_XDP_SOURCE_PORT 1234
+#define UDP_SOURCE_PORT 1234
 #define AF_XDP_CONSUMER_PORT 8080
 
 #define UMEM_NUM 16
@@ -33,6 +33,12 @@
 #define RX_ADDR "10.0.0.2"
 #define PREFIX_LEN "8"
 #define FAMILY AF_INET
+#define TX_NETNS_NAME "xdp_metadata_tx"
+#define RX_NETNS_NAME "xdp_metadata_rx"
+#define TX_MAC "00:00:00:00:00:01"
+#define RX_MAC "00:00:00:00:00:02"
+
+#define XDP_RSS_TYPE_L4 BIT(3)
 
 struct xsk {
 	void *umem_area;
@@ -119,90 +125,28 @@ static void close_xsk(struct xsk *xsk)
 	munmap(xsk->umem_area, UMEM_SIZE);
 }
 
-static void ip_csum(struct iphdr *iph)
+static int generate_packet_udp(void)
 {
-	__u32 sum = 0;
-	__u16 *p;
-	int i;
-
-	iph->check = 0;
-	p = (void *)iph;
-	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
-		sum += p[i];
-
-	while (sum >> 16)
-		sum = (sum & 0xffff) + (sum >> 16);
-
-	iph->check = ~sum;
-}
-
-static int generate_packet(struct xsk *xsk, __u16 dst_port)
-{
-	struct xdp_desc *tx_desc;
-	struct udphdr *udph;
-	struct ethhdr *eth;
-	struct iphdr *iph;
-	void *data;
-	__u32 idx;
-	int ret;
-
-	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
-	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
-		return -1;
-
-	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
-	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
-	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
-	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
-
-	eth = data;
-	iph = (void *)(eth + 1);
-	udph = (void *)(iph + 1);
-
-	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
-	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
-	eth->h_proto = htons(ETH_P_IP);
-
-	iph->version = 0x4;
-	iph->ihl = 0x5;
-	iph->tos = 0x9;
-	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
-	iph->id = 0;
-	iph->frag_off = 0;
-	iph->ttl = 0;
-	iph->protocol = IPPROTO_UDP;
-	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
-	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
-	ip_csum(iph);
-
-	udph->source = htons(AF_XDP_SOURCE_PORT);
-	udph->dest = htons(dst_port);
-	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
-	udph->check = 0;
-
-	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
-
-	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
-	xsk_ring_prod__submit(&xsk->tx, 1);
-
-	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
-	if (!ASSERT_GE(ret, 0, "sendto"))
-		return ret;
-
-	return 0;
-}
-
-static void complete_tx(struct xsk *xsk)
-{
-	__u32 idx;
-	__u64 addr;
-
-	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
-		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
-
-		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
-		xsk_ring_cons__release(&xsk->comp, 1);
-	}
+	char udp_payload[UDP_PAYLOAD_BYTES];
+	struct sockaddr_in rx_addr;
+	int sock_fd, err = 0;
+
+	/* Build a packet */
+	memset(udp_payload, 0xAA, UDP_PAYLOAD_BYTES);
+	rx_addr.sin_addr.s_addr = inet_addr(RX_ADDR);
+	rx_addr.sin_family = AF_INET;
+	rx_addr.sin_port = htons(UDP_SOURCE_PORT);
+
+	sock_fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
+	if (!ASSERT_GE(sock_fd, 0, "socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)"))
+		return sock_fd;
+
+	err = sendto(sock_fd, udp_payload, UDP_PAYLOAD_BYTES, MSG_DONTWAIT,
+		     (void *)&rx_addr, sizeof(rx_addr));
+	ASSERT_GE(err, 0, "sendto");
+
+	close(sock_fd);
+	return err;
 }
 
 static void refill_rx(struct xsk *xsk, __u64 addr)
@@ -268,7 +212,8 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
 		return -1;
 
-	ASSERT_EQ(meta->rx_hash_type, 0, "rx_hash_type");
+	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
+		return -1;
 
 	xsk_ring_cons__release(&xsk->rx, 1);
 	refill_rx(xsk, comp_addr);
@@ -281,40 +226,46 @@ void test_xdp_metadata(void)
 	struct xdp_metadata2 *bpf_obj2 = NULL;
 	struct xdp_metadata *bpf_obj = NULL;
 	struct bpf_program *new_prog, *prog;
-	struct nstoken *tok = NULL;
+	int prev_netns, rx_netns, tx_netns;
 	__u32 queue_id = QUEUE_ID;
 	struct bpf_map *prog_arr;
-	struct xsk tx_xsk = {};
 	struct xsk rx_xsk = {};
 	__u32 val, key = 0;
 	int retries = 10;
 	int rx_ifindex;
-	int tx_ifindex;
 	int sock_fd;
 	int ret;
 
-	/* Setup new networking namespace, with a veth pair. */
+	/* Setup new networking namespaces, with a veth pair. */
 
-	SYS(out, "ip netns add xdp_metadata");
-	tok = open_netns("xdp_metadata");
+	SYS(out, "ip netns add " TX_NETNS_NAME);
+	SYS(out, "ip netns add " RX_NETNS_NAME);
+	prev_netns = get_cur_netns();
+	tx_netns = get_netns(TX_NETNS_NAME);
+	rx_netns = get_netns(RX_NETNS_NAME);
+	if (prev_netns < 0 || tx_netns < 0 || rx_netns < 0)
+		goto close_ns;
+
+	set_netns(tx_netns);
 	SYS(out, "ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
 	    " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
-	SYS(out, "ip link set dev " TX_NAME " address 00:00:00:00:00:01");
-	SYS(out, "ip link set dev " RX_NAME " address 00:00:00:00:00:02");
+	SYS(out, "ip link set " RX_NAME " netns " RX_NETNS_NAME);
+
+	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
 	SYS(out, "ip link set dev " TX_NAME " up");
-	SYS(out, "ip link set dev " RX_NAME " up");
 	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
-	SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
 
+	/* Avoid ARP calls */
+	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
+
+	set_netns(rx_netns);
+	SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
+	SYS(out, "ip link set dev " RX_NAME " up");
+	SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
 	rx_ifindex = if_nametoindex(RX_NAME);
-	tx_ifindex = if_nametoindex(TX_NAME);
 
 	/* Setup separate AF_XDP for TX and RX interfaces. */
 
-	ret = open_xsk(tx_ifindex, &tx_xsk);
-	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
-		goto out;
-
 	ret = open_xsk(rx_ifindex, &rx_xsk);
 	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
 		goto out;
@@ -355,17 +306,16 @@ void test_xdp_metadata(void)
 		goto out;
 
 	/* Send packet destined to RX AF_XDP socket. */
-	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
-		       "generate AF_XDP_CONSUMER_PORT"))
+	set_netns(tx_netns);
+	if (!ASSERT_GE(generate_packet_udp(), 0, "generate UDP packet"))
 		goto out;
 
 	/* Verify AF_XDP RX packet has proper metadata. */
+	set_netns(rx_netns);
 	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
 		       "verify_xsk_metadata"))
 		goto out;
 
-	complete_tx(&tx_xsk);
-
 	/* Make sure freplace correctly picks up original bound device
 	 * and doesn't crash.
 	 */
@@ -384,10 +334,11 @@ void test_xdp_metadata(void)
 		goto out;
 
 	/* Send packet to trigger . */
-	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
-		       "generate freplace packet"))
+	set_netns(tx_netns);
+	if (!ASSERT_GE(generate_packet_udp(), 0, "generate freplace packet"))
 		goto out;
 
+	set_netns(rx_netns);
 	while (!retries--) {
 		if (bpf_obj2->bss->called)
 			break;
@@ -397,10 +348,14 @@ void test_xdp_metadata(void)
 
 out:
 	close_xsk(&rx_xsk);
-	close_xsk(&tx_xsk);
 	xdp_metadata2__destroy(bpf_obj2);
 	xdp_metadata__destroy(bpf_obj);
-	if (tok)
-		close_netns(tok);
-	SYS_NOFAIL("ip netns del xdp_metadata");
+	set_netns(prev_netns);
+close_ns:
+	close(prev_netns);
+	close(tx_netns);
+	close(rx_netns);
+
+	SYS_NOFAIL("ip netns del " RX_NETNS_NAME);
+	SYS_NOFAIL("ip netns del " TX_NETNS_NAME);
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto in xdp_metadata
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (17 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-05 17:41   ` Stanislav Fomichev
  2023-07-06 10:10   ` Jesper Dangaard Brouer
  2023-07-03 18:12 ` [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level " Larysa Zaremba
  19 siblings, 2 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Verify, whether VLAN tag and proto are set correctly.

To simulate "stripped" VLAN tag on veth, send test packet from VLAN
interface.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 21 +++++++++++++++++--
 .../selftests/bpf/progs/xdp_metadata.c        |  4 ++++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 53b32a641e8e..50ac9f570bc5 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -38,6 +38,13 @@
 #define TX_MAC "00:00:00:00:00:01"
 #define RX_MAC "00:00:00:00:00:02"
 
+#define VLAN_ID 59
+#define VLAN_ID_STR "59"
+#define VLAN_PROTO "802.1Q"
+#define VLAN_PID htons(ETH_P_8021Q)
+#define TX_NAME_VLAN TX_NAME "." VLAN_ID_STR
+#define RX_NAME_VLAN RX_NAME "." VLAN_ID_STR
+
 #define XDP_RSS_TYPE_L4 BIT(3)
 
 struct xsk {
@@ -215,6 +222,12 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
 		return -1;
 
+	if (!ASSERT_EQ(meta->rx_vlan_tag, VLAN_ID, "rx_vlan_tag"))
+		return -1;
+
+	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
+		return -1;
+
 	xsk_ring_cons__release(&xsk->rx, 1);
 	refill_rx(xsk, comp_addr);
 
@@ -253,10 +266,14 @@ void test_xdp_metadata(void)
 
 	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
 	SYS(out, "ip link set dev " TX_NAME " up");
-	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
+
+	SYS(out, "ip link add link " TX_NAME " " TX_NAME_VLAN
+		 " type vlan proto " VLAN_PROTO " id " VLAN_ID_STR);
+	SYS(out, "ip link set dev " TX_NAME_VLAN " up");
+	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME_VLAN);
 
 	/* Avoid ARP calls */
-	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
+	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME_VLAN);
 
 	set_netns(rx_netns);
 	SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index d151d406a123..382984a5d1c9 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -23,6 +23,9 @@ extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
+					__u16 *vlan_tag,
+					__be16 *vlan_proto) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -57,6 +60,7 @@ int rx(struct xdp_md *ctx)
 		meta->rx_timestamp = 1;
 
 	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
+	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag, &meta->rx_vlan_proto);
 
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level in xdp_metadata
  2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
                   ` (18 preceding siblings ...)
  2023-07-03 18:12 ` [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba
@ 2023-07-03 18:12 ` Larysa Zaremba
  2023-07-05 17:41   ` Stanislav Fomichev
  2023-07-06 10:25   ` Jesper Dangaard Brouer
  19 siblings, 2 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-03 18:12 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Verify, whether kfunc in xdp_metadata test correctly returns checksum level
of zero.

Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c | 3 +++
 tools/testing/selftests/bpf/progs/xdp_metadata.c      | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 50ac9f570bc5..6c71d712932e 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -228,6 +228,9 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
 		return -1;
 
+	if (!ASSERT_NEQ(meta->rx_csum_lvl, 0, "rx_csum_lvl"))
+		return -1;
+
 	xsk_ring_cons__release(&xsk->rx, 1);
 	refill_rx(xsk, comp_addr);
 
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index 382984a5d1c9..6f7223d581b7 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -26,6 +26,8 @@ extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
 					__u16 *vlan_tag,
 					__be16 *vlan_proto) __ksym;
+extern int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx,
+					__u8 *csum_level) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -62,6 +64,11 @@ int rx(struct xdp_md *ctx)
 	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
 	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag, &meta->rx_vlan_proto);
 
+	/* Same as with timestamp, zero is expected */
+	ret = bpf_xdp_metadata_rx_csum_lvl(ctx, &meta->rx_csum_lvl);
+	if (!ret && meta->rx_csum_lvl == 0)
+		meta->rx_csum_lvl = 1;
+
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* RE: [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-03 18:12 ` [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint Larysa Zaremba
@ 2023-07-03 20:15   ` John Fastabend
  2023-07-04  8:23     ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: John Fastabend @ 2023-07-03 20:15 UTC (permalink / raw)
  To: Larysa Zaremba, bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Larysa Zaremba wrote:
> Implement functionality that enables drivers to expose VLAN tag
> to XDP code.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
>  include/linux/netdevice.h                    |  2 ++
>  include/net/xdp.h                            |  2 ++
>  kernel/bpf/offload.c                         |  2 ++
>  net/core/xdp.c                               | 20 ++++++++++++++++++++
>  5 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> index 25ce72af81c2..ea6dd79a21d3 100644
> --- a/Documentation/networking/xdp-rx-metadata.rst
> +++ b/Documentation/networking/xdp-rx-metadata.rst
> @@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
>  metadata is supported, this set will grow:
>  
>  .. kernel-doc:: net/core/xdp.c
> -   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
> +   :identifiers: bpf_xdp_metadata_rx_timestamp
> +
> +.. kernel-doc:: net/core/xdp.c
> +   :identifiers: bpf_xdp_metadata_rx_hash
> +
> +.. kernel-doc:: net/core/xdp.c
> +   :identifiers: bpf_xdp_metadata_rx_vlan_tag
>  
>  An XDP program can use these kfuncs to read the metadata into stack
>  variables for its own consumption. Or, to pass the metadata on to other
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b828c7a75be2..4fa4380e6d89 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1658,6 +1658,8 @@ struct xdp_metadata_ops {
>  	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
>  	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
>  			       enum xdp_rss_hash_type *rss_type);
> +	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
> +				   __be16 *vlan_proto);
>  };
>  
>  /**
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 6381560efae2..89c58f56ffc6 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -389,6 +389,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>  			   bpf_xdp_metadata_rx_timestamp) \
>  	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
>  			   bpf_xdp_metadata_rx_hash) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
> +			   bpf_xdp_metadata_rx_vlan_tag) \
>  
>  enum {
>  #define XDP_METADATA_KFUNC(name, _) name,
> diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> index 8a26cd8814c1..986e7becfd42 100644
> --- a/kernel/bpf/offload.c
> +++ b/kernel/bpf/offload.c
> @@ -848,6 +848,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
>  		p = ops->xmo_rx_timestamp;
>  	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
>  		p = ops->xmo_rx_hash;
> +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
> +		p = ops->xmo_rx_vlan_tag;
>  out:
>  	up_read(&bpf_devs_lock);
>  
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 41e5ca8643ec..f6262c90e45f 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
>  	return -EOPNOTSUPP;
>  }
>  
> +/**
> + * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
> + * @ctx: XDP context pointer.
> + * @vlan_tag: Destination pointer for VLAN tag
> + * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
> + *
> + * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
> + * containing VLAN ID, vlan_proto contains protocol identifier.

Above is a bit confusing to me at least.

The vlan tag would be both the 16bit TPID and 16bit TCI. What fields
are to be included here? The VlanID or the full 16bit TCI meaning the
PCP+DEI+VID? I think by "including 12 least significant bytes" you
mean bits, but also not clear about those 4 other bits.

I can likely figure it out in next patches from implementation but
would be nice to clean up docs.

> + *
> + * Return:
> + * * Returns 0 on success or ``-errno`` on error.
> + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> + * * ``-ENODATA``    : VLAN tag was not stripped or is not available
> + */
> +__bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
> +					     __be16 *vlan_proto)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
>  __diag_pop();
>  
>  BTF_SET8_START(xdp_metadata_kfunc_ids)
> -- 
> 2.41.0
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-03 18:12 ` [PATCH bpf-next v2 12/20] xdp: Add checksum level hint Larysa Zaremba
@ 2023-07-03 20:38   ` John Fastabend
  2023-07-04  9:24     ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: John Fastabend @ 2023-07-03 20:38 UTC (permalink / raw)
  To: Larysa Zaremba, bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Larysa Zaremba wrote:
> Implement functionality that enables drivers to expose to XDP code,
> whether checksums was checked and on what level.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  Documentation/networking/xdp-rx-metadata.rst |  3 +++
>  include/linux/netdevice.h                    |  1 +
>  include/net/xdp.h                            |  2 ++
>  kernel/bpf/offload.c                         |  2 ++
>  net/core/xdp.c                               | 21 ++++++++++++++++++++
>  5 files changed, 29 insertions(+)
> 
> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> index ea6dd79a21d3..4ec6ddfd2a52 100644
> --- a/Documentation/networking/xdp-rx-metadata.rst
> +++ b/Documentation/networking/xdp-rx-metadata.rst
> @@ -26,6 +26,9 @@ metadata is supported, this set will grow:
>  .. kernel-doc:: net/core/xdp.c
>     :identifiers: bpf_xdp_metadata_rx_vlan_tag
>  
> +.. kernel-doc:: net/core/xdp.c
> +   :identifiers: bpf_xdp_metadata_rx_csum_lvl
> +
>  An XDP program can use these kfuncs to read the metadata into stack
>  variables for its own consumption. Or, to pass the metadata on to other
>  consumers, an XDP program can store it into the metadata area carried
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4fa4380e6d89..569563687172 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
>  			       enum xdp_rss_hash_type *rss_type);
>  	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
>  				   __be16 *vlan_proto);
> +	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
>  };
>  
>  /**
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 89c58f56ffc6..61ed38fa79d1 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>  			   bpf_xdp_metadata_rx_hash) \
>  	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
>  			   bpf_xdp_metadata_rx_vlan_tag) \
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
> +			   bpf_xdp_metadata_rx_csum_lvl) \
>  
>  enum {
>  #define XDP_METADATA_KFUNC(name, _) name,
> diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> index 986e7becfd42..a133fb775f49 100644
> --- a/kernel/bpf/offload.c
> +++ b/kernel/bpf/offload.c
> @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
>  		p = ops->xmo_rx_hash;
>  	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
>  		p = ops->xmo_rx_vlan_tag;
> +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
> +		p = ops->xmo_rx_csum_lvl;
>  out:
>  	up_read(&bpf_devs_lock);
>  
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index f6262c90e45f..c666d3e0a26c 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
>  	return -EOPNOTSUPP;
>  }
>  
> +/**
> + * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
> + * @ctx: XDP context pointer.
> + * @csum_level: Return value pointer.
> + *
> + * In case of success, csum_level contains depth of the last verified checksum.
> + * If only the outermost checksum was verified, csum_level is 0, if both
> + * encapsulation and inner transport checksums were verified, csum_level is 1,
> + * and so on.
> + * For more details, refer to csum_level field in sk_buff.
> + *
> + * Return:
> + * * Returns 0 on success or ``-errno`` on error.
> + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> + * * ``-ENODATA``    : Checksum was not validated
> + */
> +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)

Istead of ENODATA should we return what would be put in the ip_summed field
CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,

 bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);

or something like that? Or is the thought that its not really necessary?
I don't have a strong preference but figured it was worth asking.

> +{
> +	return -EOPNOTSUPP;
> +}
> +
>  __diag_pop();
>  
>  BTF_SET8_START(xdp_metadata_kfunc_ids)
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [PATCH bpf-next v2 15/20] net, xdp: allow metadata > 32
  2023-07-03 18:12 ` [PATCH bpf-next v2 15/20] net, xdp: allow metadata > 32 Larysa Zaremba
@ 2023-07-03 21:06   ` John Fastabend
  2023-07-06 14:51     ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: John Fastabend @ 2023-07-03 21:06 UTC (permalink / raw)
  To: Larysa Zaremba, bpf
  Cc: Larysa Zaremba, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, Aleksander Lobakin

Larysa Zaremba wrote:
> From: Aleksander Lobakin <aleksander.lobakin@intel.com>
> 
> When using XDP hints, metadata sometimes has to be much bigger
> than 32 bytes. Relax the restriction, allow metadata larger than 32 bytes
> and make __skb_metadata_differs() work with bigger lengths.
> 
> Now size of metadata is only limited by the fact it is stored as u8
> in skb_shared_info, so maximum possible value is 255. Other important
> conditions, such as having enough space for xdp_frame building, are already
> checked in bpf_xdp_adjust_meta().
> 
> The requirement of having its length aligned to 4 bytes is still
> valid.
> 
> Signed-off-by: Aleksander Lobakin <aleksander.lobakin@intel.com>
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  include/linux/skbuff.h | 13 ++++++++-----
>  include/net/xdp.h      |  7 ++++++-
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 91ed66952580..cd49cdd71019 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4209,10 +4209,13 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
>  {
>  	const void *a = skb_metadata_end(skb_a);
>  	const void *b = skb_metadata_end(skb_b);
> -	/* Using more efficient varaiant than plain call to memcmp(). */
> -#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64

Why are we removing the ifdef here? Its adding a runtime 'if' when its not
necessary. I would keep the ifdef and simply add the default case
in the switch.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-03 20:15   ` John Fastabend
@ 2023-07-04  8:23     ` Larysa Zaremba
  2023-07-04 10:23       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-04  8:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, sdf,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon, Jul 03, 2023 at 01:15:34PM -0700, John Fastabend wrote:
> Larysa Zaremba wrote:
> > Implement functionality that enables drivers to expose VLAN tag
> > to XDP code.
> > 
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
> >  include/linux/netdevice.h                    |  2 ++
> >  include/net/xdp.h                            |  2 ++
> >  kernel/bpf/offload.c                         |  2 ++
> >  net/core/xdp.c                               | 20 ++++++++++++++++++++
> >  5 files changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > index 25ce72af81c2..ea6dd79a21d3 100644
> > --- a/Documentation/networking/xdp-rx-metadata.rst
> > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > @@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
> >  metadata is supported, this set will grow:
> >  
> >  .. kernel-doc:: net/core/xdp.c
> > -   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
> > +   :identifiers: bpf_xdp_metadata_rx_timestamp
> > +
> > +.. kernel-doc:: net/core/xdp.c
> > +   :identifiers: bpf_xdp_metadata_rx_hash
> > +
> > +.. kernel-doc:: net/core/xdp.c
> > +   :identifiers: bpf_xdp_metadata_rx_vlan_tag
> >  
> >  An XDP program can use these kfuncs to read the metadata into stack
> >  variables for its own consumption. Or, to pass the metadata on to other
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index b828c7a75be2..4fa4380e6d89 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1658,6 +1658,8 @@ struct xdp_metadata_ops {
> >  	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
> >  	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
> >  			       enum xdp_rss_hash_type *rss_type);
> > +	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
> > +				   __be16 *vlan_proto);
> >  };
> >  
> >  /**
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 6381560efae2..89c58f56ffc6 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -389,6 +389,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> >  			   bpf_xdp_metadata_rx_timestamp) \
> >  	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
> >  			   bpf_xdp_metadata_rx_hash) \
> > +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
> > +			   bpf_xdp_metadata_rx_vlan_tag) \
> >  
> >  enum {
> >  #define XDP_METADATA_KFUNC(name, _) name,
> > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > index 8a26cd8814c1..986e7becfd42 100644
> > --- a/kernel/bpf/offload.c
> > +++ b/kernel/bpf/offload.c
> > @@ -848,6 +848,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
> >  		p = ops->xmo_rx_timestamp;
> >  	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
> >  		p = ops->xmo_rx_hash;
> > +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
> > +		p = ops->xmo_rx_vlan_tag;
> >  out:
> >  	up_read(&bpf_devs_lock);
> >  
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 41e5ca8643ec..f6262c90e45f 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
> >  	return -EOPNOTSUPP;
> >  }
> >  
> > +/**
> > + * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
> > + * @ctx: XDP context pointer.
> > + * @vlan_tag: Destination pointer for VLAN tag
> > + * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
> > + *
> > + * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
> > + * containing VLAN ID, vlan_proto contains protocol identifier.
> 
> Above is a bit confusing to me at least.
> 
> The vlan tag would be both the 16bit TPID and 16bit TCI. What fields
> are to be included here? The VlanID or the full 16bit TCI meaning the
> PCP+DEI+VID?

It contains PCP+DEI+VID, in patch 16 ("selftests/bpf: Add flags and new hints to 
xdp_hw_metadata") this is more clear, because the tag is parsed.

What about rephrasing it this way:

In case of success, vlan_proto contains VLAN protocol identifier (TPID), 
vlan_tag contains the remaining 16 bits of a 802.1Q tag (PCP+DEI+VID).

> I think by "including 12 least significant bytes" you
> mean bits,

Yes, my bad.

> but also not clear about those 4 other bits.
> 
> I can likely figure it out in next patches from implementation but
> would be nice to clean up docs.
> 
> > + *
> > + * Return:
> > + * * Returns 0 on success or ``-errno`` on error.
> > + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> > + * * ``-ENODATA``    : VLAN tag was not stripped or is not available
> > + */
> > +__bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
> > +					     __be16 *vlan_proto)
> > +{
> > +	return -EOPNOTSUPP;
> > +}
> > +
> >  __diag_pop();
> >  
> >  BTF_SET8_START(xdp_metadata_kfunc_ids)
> > -- 
> > 2.41.0
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-03 20:38   ` John Fastabend
@ 2023-07-04  9:24     ` Larysa Zaremba
  2023-07-04 10:39       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-04  9:24 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, sdf,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon, Jul 03, 2023 at 01:38:27PM -0700, John Fastabend wrote:
> Larysa Zaremba wrote:
> > Implement functionality that enables drivers to expose to XDP code,
> > whether checksums was checked and on what level.
> > 
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  Documentation/networking/xdp-rx-metadata.rst |  3 +++
> >  include/linux/netdevice.h                    |  1 +
> >  include/net/xdp.h                            |  2 ++
> >  kernel/bpf/offload.c                         |  2 ++
> >  net/core/xdp.c                               | 21 ++++++++++++++++++++
> >  5 files changed, 29 insertions(+)
> > 
> > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > index ea6dd79a21d3..4ec6ddfd2a52 100644
> > --- a/Documentation/networking/xdp-rx-metadata.rst
> > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > @@ -26,6 +26,9 @@ metadata is supported, this set will grow:
> >  .. kernel-doc:: net/core/xdp.c
> >     :identifiers: bpf_xdp_metadata_rx_vlan_tag
> >  
> > +.. kernel-doc:: net/core/xdp.c
> > +   :identifiers: bpf_xdp_metadata_rx_csum_lvl
> > +
> >  An XDP program can use these kfuncs to read the metadata into stack
> >  variables for its own consumption. Or, to pass the metadata on to other
> >  consumers, an XDP program can store it into the metadata area carried
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 4fa4380e6d89..569563687172 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
> >  			       enum xdp_rss_hash_type *rss_type);
> >  	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
> >  				   __be16 *vlan_proto);
> > +	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
> >  };
> >  
> >  /**
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 89c58f56ffc6..61ed38fa79d1 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> >  			   bpf_xdp_metadata_rx_hash) \
> >  	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
> >  			   bpf_xdp_metadata_rx_vlan_tag) \
> > +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
> > +			   bpf_xdp_metadata_rx_csum_lvl) \
> >  
> >  enum {
> >  #define XDP_METADATA_KFUNC(name, _) name,
> > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > index 986e7becfd42..a133fb775f49 100644
> > --- a/kernel/bpf/offload.c
> > +++ b/kernel/bpf/offload.c
> > @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
> >  		p = ops->xmo_rx_hash;
> >  	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
> >  		p = ops->xmo_rx_vlan_tag;
> > +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
> > +		p = ops->xmo_rx_csum_lvl;
> >  out:
> >  	up_read(&bpf_devs_lock);
> >  
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index f6262c90e45f..c666d3e0a26c 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
> >  	return -EOPNOTSUPP;
> >  }
> >  
> > +/**
> > + * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
> > + * @ctx: XDP context pointer.
> > + * @csum_level: Return value pointer.
> > + *
> > + * In case of success, csum_level contains depth of the last verified checksum.
> > + * If only the outermost checksum was verified, csum_level is 0, if both
> > + * encapsulation and inner transport checksums were verified, csum_level is 1,
> > + * and so on.
> > + * For more details, refer to csum_level field in sk_buff.
> > + *
> > + * Return:
> > + * * Returns 0 on success or ``-errno`` on error.
> > + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> > + * * ``-ENODATA``    : Checksum was not validated
> > + */
> > +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
> 
> Istead of ENODATA should we return what would be put in the ip_summed field
> CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,
> 
>  bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);
> 
> or something like that? Or is the thought that its not really necessary?
> I don't have a strong preference but figured it was worth asking.
>

I see no value in returning CHECKSUM_COMPLETE without the actual checksum value. 
Same with CHECKSUM_PARTIAL and csum_start. Returning those values too would 
overcomplicate the function signature.
 
> > +{
> > +	return -EOPNOTSUPP;
> > +}
> > +
> >  __diag_pop();
> >  
> >  BTF_SET8_START(xdp_metadata_kfunc_ids)
> > -- 
> > 2.41.0
> > 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 02/20] ice: make RX HW timestamp reading code more reusable
  2023-07-03 18:12 ` [PATCH bpf-next v2 02/20] ice: make RX HW timestamp " Larysa Zaremba
@ 2023-07-04 10:04   ` Larysa Zaremba
  0 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-04 10:04 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Mon, Jul 03, 2023 at 08:12:08PM +0200, Larysa Zaremba wrote:
> Previously, we only needed RX HW timestamp in skb path,
> hence all related code was written with skb in mind.
> But with the addition of XDP hints via kfuncs to the ice driver,
> the same logic will be needed in .xmo_() callbacks.
> 
> Put generic process of reading RX HW timestamp from a descriptor
> into a separate function.
> Move skb-related code into another source file.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_ptp.c      | 24 ++++++------------
>  drivers/net/ethernet/intel/ice/ice_ptp.h      | 15 ++++++-----
>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 25 ++++++++++++++++++-
>  3 files changed, 41 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> index 81d96a40d5a7..a31333972c68 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> @@ -2147,30 +2147,24 @@ int ice_ptp_set_ts_config(struct ice_pf *pf, struct ifreq *ifr)
>  }
>  
>  /**
> - * ice_ptp_rx_hwtstamp - Check for an Rx timestamp
> - * @rx_ring: Ring to get the VSI info
> + * ice_ptp_get_rx_hwts - Get packet Rx timestamp
>   * @rx_desc: Receive descriptor
> - * @skb: Particular skb to send timestamp with
> + * @cached_time: Cached PHC time
>   *
>   * The driver receives a notification in the receive descriptor with timestamp.
> - * The timestamp is in ns, so we must convert the result first.
>   */
> -void
> -ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
> -		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb)
> +u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
> +			u64 cached_time)
>  {
> -	struct skb_shared_hwtstamps *hwtstamps;
> -	u64 ts_ns, cached_time;
>  	u32 ts_high;
> +	u64 ts_ns;
>  
>  	if (!(rx_desc->wb.time_stamp_low & ICE_PTP_TS_VALID))
> -		return;
> -
> -	cached_time = READ_ONCE(rx_ring->cached_phctime);
> +		return 0;
>  
>  	/* Do not report a timestamp if we don't have a cached PHC time */
>  	if (!cached_time)
> -		return;
> +		return 0;
>  
>  	/* Use ice_ptp_extend_32b_ts directly, using the ring-specific cached
>  	 * PHC value, rather than accessing the PF. This also allows us to
> @@ -2181,9 +2175,7 @@ ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
>  	ts_high = le32_to_cpu(rx_desc->wb.flex_ts.ts_high);
>  	ts_ns = ice_ptp_extend_32b_ts(cached_time, ts_high);
>  
> -	hwtstamps = skb_hwtstamps(skb);
> -	memset(hwtstamps, 0, sizeof(*hwtstamps));
> -	hwtstamps->hwtstamp = ns_to_ktime(ts_ns);
> +	return ts_ns;
>  }
>  
>  /**
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.h b/drivers/net/ethernet/intel/ice/ice_ptp.h
> index 995a57019ba7..523eefbfdf95 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.h
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.h
> @@ -268,9 +268,8 @@ void ice_ptp_extts_event(struct ice_pf *pf);
>  s8 ice_ptp_request_ts(struct ice_ptp_tx *tx, struct sk_buff *skb);
>  enum ice_tx_tstamp_work ice_ptp_process_ts(struct ice_pf *pf);
>  
> -void
> -ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
> -		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb);
> +u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
> +			u64 cached_time);
>  void ice_ptp_reset(struct ice_pf *pf);
>  void ice_ptp_prepare_for_reset(struct ice_pf *pf);
>  void ice_ptp_init(struct ice_pf *pf);
> @@ -304,9 +303,13 @@ static inline bool ice_ptp_process_ts(struct ice_pf *pf)
>  {
>  	return true;
>  }
> -static inline void
> -ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
> -		    union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb) { }
> +
> +static inline u64
> +ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc, u64 cached_time)
> +{
> +	return 0;
> +}
> +
>  static inline void ice_ptp_reset(struct ice_pf *pf) { }
>  static inline void ice_ptp_prepare_for_reset(struct ice_pf *pf) { }
>  static inline void ice_ptp_init(struct ice_pf *pf) { }
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> index 8f7f6d78f7bf..d4d27057d17b 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> @@ -185,6 +185,29 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
>  	ring->vsi->back->hw_csum_rx_error++;
>  }
>  
> +/**
> + * ice_ptp_rx_hwts_to_skb - Put RX timestamp into skb
> + * @rx_ring: Ring to get the VSI info
> + * @rx_desc: Receive descriptor
> + * @skb: Particular skb to send timestamp with
> + *
> + * The timestamp is in ns, so we must convert the result first.
> + */
> +static void
> +ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
> +		       const union ice_32b_rx_flex_desc *rx_desc,
> +		       struct sk_buff *skb)
> +{
> +	u64 ts_ns, cached_time;
> +
> +	cached_time = READ_ONCE(rx_ring->pkt_ctx.cached_phctime);

CI has pointed out this line is messed up and this is correct, a mistake while 
separating patches, should be 'READ_ONCE(rx_ring->cached_phctime)' in this 
patch, will fix in v3.

> +	ts_ns = ice_ptp_get_rx_hwts(rx_desc, cached_time);
> +
> +	*skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
> +		.hwtstamp	= ns_to_ktime(ts_ns),
> +	};
> +}
> +
>  /**
>   * ice_process_skb_fields - Populate skb header fields from Rx descriptor
>   * @rx_ring: Rx descriptor ring packet is being transacted on
> @@ -209,7 +232,7 @@ ice_process_skb_fields(struct ice_rx_ring *rx_ring,
>  	ice_rx_csum(rx_ring, skb, rx_desc, ptype);
>  
>  	if (rx_ring->ptp_rx)
> -		ice_ptp_rx_hwtstamp(rx_ring, rx_desc, skb);
> +		ice_ptp_rx_hwts_to_skb(rx_ring, rx_desc, skb);
>  }
>  
>  /**
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-04  8:23     ` Larysa Zaremba
@ 2023-07-04 10:23       ` Jesper Dangaard Brouer
  2023-07-04 11:02         ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-04 10:23 UTC (permalink / raw)
  To: Larysa Zaremba, John Fastabend
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh,
	sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, Andrew Lunn



On 04/07/2023 10.23, Larysa Zaremba wrote:
> On Mon, Jul 03, 2023 at 01:15:34PM -0700, John Fastabend wrote:
>> Larysa Zaremba wrote:
>>> Implement functionality that enables drivers to expose VLAN tag
>>> to XDP code.
>>>
>>> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>>> ---
>>>   Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
>>>   include/linux/netdevice.h                    |  2 ++
>>>   include/net/xdp.h                            |  2 ++
>>>   kernel/bpf/offload.c                         |  2 ++
>>>   net/core/xdp.c                               | 20 ++++++++++++++++++++
>>>   5 files changed, 33 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
>>> index 25ce72af81c2..ea6dd79a21d3 100644
>>> --- a/Documentation/networking/xdp-rx-metadata.rst
>>> +++ b/Documentation/networking/xdp-rx-metadata.rst
>>> @@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
>>>   metadata is supported, this set will grow:
>>>   
>>>   .. kernel-doc:: net/core/xdp.c
>>> -   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
>>> +   :identifiers: bpf_xdp_metadata_rx_timestamp
>>> +
>>> +.. kernel-doc:: net/core/xdp.c
>>> +   :identifiers: bpf_xdp_metadata_rx_hash
>>> +
>>> +.. kernel-doc:: net/core/xdp.c
>>> +   :identifiers: bpf_xdp_metadata_rx_vlan_tag
>>>   
>>>   An XDP program can use these kfuncs to read the metadata into stack
>>>   variables for its own consumption. Or, to pass the metadata on to other
[...]
>>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>>> index 41e5ca8643ec..f6262c90e45f 100644
>>> --- a/net/core/xdp.c
>>> +++ b/net/core/xdp.c
>>> @@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
>>>   	return -EOPNOTSUPP;
>>>   }
>>>   
>>> +/**
>>> + * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
>>> + * @ctx: XDP context pointer.
>>> + * @vlan_tag: Destination pointer for VLAN tag
>>> + * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
>>> + *
>>> + * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
>>> + * containing VLAN ID, vlan_proto contains protocol identifier.
>>
>> Above is a bit confusing to me at least.
>>
>> The vlan tag would be both the 16bit TPID and 16bit TCI. What fields
>> are to be included here? The VlanID or the full 16bit TCI meaning the
>> PCP+DEI+VID?
> 
> It contains PCP+DEI+VID, in patch 16 ("selftests/bpf: Add flags and new hints to
> xdp_hw_metadata") this is more clear, because the tag is parsed.
> 

Do we really care about the "EtherType" proto (in VLAN speak TPID = Tag
Protocol IDentifier)?
I mean, it can basically only have two values[1], and we just wanted to
know if it is a VLAN (that hardware offloaded/removed for us):

  static __always_inline int proto_is_vlan(__u16 h_proto)
  {
	return !!(h_proto == bpf_htons(ETH_P_8021Q) ||
		  h_proto == bpf_htons(ETH_P_8021AD));
  }

[1] 
https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L75-L79

Cc. Andrew Lunn, as I notice DSA have a fake VLAN define ETH_P_DSA_8021Q
(in file include/uapi/linux/if_ether.h)
Is this actually in use?
Maybe some hardware can "VLAN" offload this?


> What about rephrasing it this way:
> 
> In case of success, vlan_proto contains VLAN protocol identifier (TPID),
> vlan_tag contains the remaining 16 bits of a 802.1Q tag (PCP+DEI+VID).
> 

Hmm, I think we can improve this further. This text becomes part of the
documentation for end-users (target audience).  Thus, I think it is
worth being more verbose and even mention the existing defines that we
are expecting end-users to take advantage of.

What about:

In case of success. The VLAN EtherType is stored in vlan_proto (usually
either ETH_P_8021Q or ETH_P_8021AD) also known as TPID (Tag Protocol
IDentifier). The VLAN tag is stored in vlan_tag, which is a 16-bit field
containing sub-fields (PCP+DEI+VID). The VLAN ID (VID) is 12-bits
commonly extracted using mask VLAN_VID_MASK (0x0fff).  For the meaning
of the sub-fields Priority Code Point (PCP) and Drop Eligible Indicator
(DEI) (formerly CFI) please reference other documentation. Remember
these 16-bit fields are stored in network-byte. Thus, transformation
with byte-order helper functions like bpf_ntohs() are needed.



>> I think by "including 12 least significant bytes" you
>> mean bits,
> 
> Yes, my bad.
> 
>> but also not clear about those 4 other bits.
>>
>> I can likely figure it out in next patches from implementation but
>> would be nice to clean up docs.
>>
>>> + *
>>> + * Return:
>>> + * * Returns 0 on success or ``-errno`` on error.
>>> + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
>>> + * * ``-ENODATA``    : VLAN tag was not stripped or is not available
>>> + */
>>> +__bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
>>> +					     __be16 *vlan_proto)
>>> +{
>>> +	return -EOPNOTSUPP;
>>> +}
>>> +


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-04  9:24     ` Larysa Zaremba
@ 2023-07-04 10:39       ` Jesper Dangaard Brouer
  2023-07-04 11:19         ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-04 10:39 UTC (permalink / raw)
  To: Larysa Zaremba, John Fastabend
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh,
	sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	David S. Miller, Alexander Duyck

Cc. DaveM+Alex Duyck, as I value your insights on checksums.

On 04/07/2023 11.24, Larysa Zaremba wrote:
> On Mon, Jul 03, 2023 at 01:38:27PM -0700, John Fastabend wrote:
>> Larysa Zaremba wrote:
>>> Implement functionality that enables drivers to expose to XDP code,
>>> whether checksums was checked and on what level.
>>>
>>> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>>> ---
>>>   Documentation/networking/xdp-rx-metadata.rst |  3 +++
>>>   include/linux/netdevice.h                    |  1 +
>>>   include/net/xdp.h                            |  2 ++
>>>   kernel/bpf/offload.c                         |  2 ++
>>>   net/core/xdp.c                               | 21 ++++++++++++++++++++
>>>   5 files changed, 29 insertions(+)
>>>
>>> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
>>> index ea6dd79a21d3..4ec6ddfd2a52 100644
>>> --- a/Documentation/networking/xdp-rx-metadata.rst
>>> +++ b/Documentation/networking/xdp-rx-metadata.rst
>>> @@ -26,6 +26,9 @@ metadata is supported, this set will grow:
>>>   .. kernel-doc:: net/core/xdp.c
>>>      :identifiers: bpf_xdp_metadata_rx_vlan_tag
>>>   
>>> +.. kernel-doc:: net/core/xdp.c
>>> +   :identifiers: bpf_xdp_metadata_rx_csum_lvl
>>> +
>>>   An XDP program can use these kfuncs to read the metadata into stack
>>>   variables for its own consumption. Or, to pass the metadata on to other
>>>   consumers, an XDP program can store it into the metadata area carried
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index 4fa4380e6d89..569563687172 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
>>>   			       enum xdp_rss_hash_type *rss_type);
>>>   	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
>>>   				   __be16 *vlan_proto);
>>> +	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
>>>   };
>>>   
>>>   /**
>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>> index 89c58f56ffc6..61ed38fa79d1 100644
>>> --- a/include/net/xdp.h
>>> +++ b/include/net/xdp.h
>>> @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>>>   			   bpf_xdp_metadata_rx_hash) \
>>>   	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
>>>   			   bpf_xdp_metadata_rx_vlan_tag) \
>>> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
>>> +			   bpf_xdp_metadata_rx_csum_lvl) \
>>>   
>>>   enum {
>>>   #define XDP_METADATA_KFUNC(name, _) name,
>>> diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
>>> index 986e7becfd42..a133fb775f49 100644
>>> --- a/kernel/bpf/offload.c
>>> +++ b/kernel/bpf/offload.c
>>> @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
>>>   		p = ops->xmo_rx_hash;
>>>   	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
>>>   		p = ops->xmo_rx_vlan_tag;
>>> +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
>>> +		p = ops->xmo_rx_csum_lvl;
>>>   out:
>>>   	up_read(&bpf_devs_lock);
>>>   
>>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>>> index f6262c90e45f..c666d3e0a26c 100644
>>> --- a/net/core/xdp.c
>>> +++ b/net/core/xdp.c
>>> @@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
>>>   	return -EOPNOTSUPP;
>>>   }
>>>   
>>> +/**
>>> + * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
>>> + * @ctx: XDP context pointer.
>>> + * @csum_level: Return value pointer.
>>> + *
>>> + * In case of success, csum_level contains depth of the last verified checksum.
>>> + * If only the outermost checksum was verified, csum_level is 0, if both
>>> + * encapsulation and inner transport checksums were verified, csum_level is 1,
>>> + * and so on.
>>> + * For more details, refer to csum_level field in sk_buff.
>>> + *
>>> + * Return:
>>> + * * Returns 0 on success or ``-errno`` on error.
>>> + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
>>> + * * ``-ENODATA``    : Checksum was not validated
>>> + */
>>> +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
>>
>> Istead of ENODATA should we return what would be put in the ip_summed field
>> CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,

I was thinking the same, what about checksum "type".

>>
>>   bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);
>>
>> or something like that? Or is the thought that its not really necessary?
>> I don't have a strong preference but figured it was worth asking.
>>
> 
> I see no value in returning CHECKSUM_COMPLETE without the actual checksum value.
> Same with CHECKSUM_PARTIAL and csum_start. Returning those values too would
> overcomplicate the function signature.
>   

So, this kfunc bpf_xdp_metadata_rx_csum_lvl() success is it equivilent 
to CHECKSUM_UNNECESSARY?

Looking at documentation[1] (generated from skbuff.h):
  [1] 
https://kernel.org/doc/html/latest/networking/skbuff.html#checksumming-of-received-packets-by-device

Is the idea that we can add another kfunc (new signature) than can deal
with the other types of checksums (in a later kernel release)?


>>> +{
>>> +	return -EOPNOTSUPP;
>>> +}
>>> +
>>>   __diag_pop();
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-04 10:23       ` Jesper Dangaard Brouer
@ 2023-07-04 11:02         ` Larysa Zaremba
  2023-07-04 14:18           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-04 11:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, brouer, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Andrew Lunn

On Tue, Jul 04, 2023 at 12:23:45PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 04/07/2023 10.23, Larysa Zaremba wrote:
> > On Mon, Jul 03, 2023 at 01:15:34PM -0700, John Fastabend wrote:
> > > Larysa Zaremba wrote:
> > > > Implement functionality that enables drivers to expose VLAN tag
> > > > to XDP code.
> > > > 
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >   Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
> > > >   include/linux/netdevice.h                    |  2 ++
> > > >   include/net/xdp.h                            |  2 ++
> > > >   kernel/bpf/offload.c                         |  2 ++
> > > >   net/core/xdp.c                               | 20 ++++++++++++++++++++
> > > >   5 files changed, 33 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > > > index 25ce72af81c2..ea6dd79a21d3 100644
> > > > --- a/Documentation/networking/xdp-rx-metadata.rst
> > > > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > > > @@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
> > > >   metadata is supported, this set will grow:
> > > >   .. kernel-doc:: net/core/xdp.c
> > > > -   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
> > > > +   :identifiers: bpf_xdp_metadata_rx_timestamp
> > > > +
> > > > +.. kernel-doc:: net/core/xdp.c
> > > > +   :identifiers: bpf_xdp_metadata_rx_hash
> > > > +
> > > > +.. kernel-doc:: net/core/xdp.c
> > > > +   :identifiers: bpf_xdp_metadata_rx_vlan_tag
> > > >   An XDP program can use these kfuncs to read the metadata into stack
> > > >   variables for its own consumption. Or, to pass the metadata on to other
> [...]
> > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > index 41e5ca8643ec..f6262c90e45f 100644
> > > > --- a/net/core/xdp.c
> > > > +++ b/net/core/xdp.c
> > > > @@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
> > > >   	return -EOPNOTSUPP;
> > > >   }
> > > > +/**
> > > > + * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
> > > > + * @ctx: XDP context pointer.
> > > > + * @vlan_tag: Destination pointer for VLAN tag
> > > > + * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
> > > > + *
> > > > + * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
> > > > + * containing VLAN ID, vlan_proto contains protocol identifier.
> > > 
> > > Above is a bit confusing to me at least.
> > > 
> > > The vlan tag would be both the 16bit TPID and 16bit TCI. What fields
> > > are to be included here? The VlanID or the full 16bit TCI meaning the
> > > PCP+DEI+VID?
> > 
> > It contains PCP+DEI+VID, in patch 16 ("selftests/bpf: Add flags and new hints to
> > xdp_hw_metadata") this is more clear, because the tag is parsed.
> > 
> 
> Do we really care about the "EtherType" proto (in VLAN speak TPID = Tag
> Protocol IDentifier)?
> I mean, it can basically only have two values[1], and we just wanted to
> know if it is a VLAN (that hardware offloaded/removed for us):

If we assume everyone follows the standard, this would be correct.
But apparently, some applications use some ambiguous value as a TPID [0].

So it is not hard to imagine, some NICs could alllow you to configure your 
custom TPID. I am not sure if any in-tree drivers actually do this, but I think 
it's nice to provide some flexibility on XDP level, especially considering 
network stack stores full vlan_proto.

[0] 
https://techhub.hpe.com/eginfolib/networking/docs/switches/7500/5200-1938a_l2-lan_cg/content/495503472.htm

> 
>  static __always_inline int proto_is_vlan(__u16 h_proto)
>  {
> 	return !!(h_proto == bpf_htons(ETH_P_8021Q) ||
> 		  h_proto == bpf_htons(ETH_P_8021AD));
>  }
> 
> [1] https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L75-L79
> 
> Cc. Andrew Lunn, as I notice DSA have a fake VLAN define ETH_P_DSA_8021Q
> (in file include/uapi/linux/if_ether.h)
> Is this actually in use?
> Maybe some hardware can "VLAN" offload this?
> 
> 
> > What about rephrasing it this way:
> > 
> > In case of success, vlan_proto contains VLAN protocol identifier (TPID),
> > vlan_tag contains the remaining 16 bits of a 802.1Q tag (PCP+DEI+VID).
> > 
> 
> Hmm, I think we can improve this further. This text becomes part of the
> documentation for end-users (target audience).  Thus, I think it is
> worth being more verbose and even mention the existing defines that we
> are expecting end-users to take advantage of.
> 
> What about:
> 
> In case of success. The VLAN EtherType is stored in vlan_proto (usually
> either ETH_P_8021Q or ETH_P_8021AD) also known as TPID (Tag Protocol
> IDentifier). The VLAN tag is stored in vlan_tag, which is a 16-bit field
> containing sub-fields (PCP+DEI+VID). The VLAN ID (VID) is 12-bits
> commonly extracted using mask VLAN_VID_MASK (0x0fff).  For the meaning
> of the sub-fields Priority Code Point (PCP) and Drop Eligible Indicator
> (DEI) (formerly CFI) please reference other documentation. Remember
> these 16-bit fields are stored in network-byte. Thus, transformation
> with byte-order helper functions like bpf_ntohs() are needed.
> 

AFAIK, vlan_tag is stored in host byte order, this is how it is in skb.
In ice, we receive VLAN tag in descriptor already in LE.
Only protocol is BE (network byte order). So I would replace the last 2 
sentences with the following:

vlan_tag is stored in host byte order, so no byte order conversion is needed.
vlan_proto is stored in network byte order, the suggested way to use this value:

vlan_proto == bpf_htons(ETH_P_8021Q)

> 
> 
> > > I think by "including 12 least significant bytes" you
> > > mean bits,
> > 
> > Yes, my bad.
> > 
> > > but also not clear about those 4 other bits.
> > > 
> > > I can likely figure it out in next patches from implementation but
> > > would be nice to clean up docs.
> > > 
> > > > + *
> > > > + * Return:
> > > > + * * Returns 0 on success or ``-errno`` on error.
> > > > + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> > > > + * * ``-ENODATA``    : VLAN tag was not stripped or is not available
> > > > + */
> > > > +__bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
> > > > +					     __be16 *vlan_proto)
> > > > +{
> > > > +	return -EOPNOTSUPP;
> > > > +}
> > > > +
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 16/20] selftests/bpf: Add flags and new hints to xdp_hw_metadata
  2023-07-03 18:12 ` [PATCH bpf-next v2 16/20] selftests/bpf: Add flags and new hints to xdp_hw_metadata Larysa Zaremba
@ 2023-07-04 11:03   ` Jesper Dangaard Brouer
  2023-07-04 11:04     ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-04 11:03 UTC (permalink / raw)
  To: Larysa Zaremba, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev


On 03/07/2023 20.12, Larysa Zaremba wrote:
> diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> index 613321eb84c1..d234cbcc9103 100644
> --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> @@ -19,6 +19,9 @@
>   #include "xsk.h"
>   
>   #include <error.h>
> +#include <linux/kernel.h>
> +#include <linux/bits.h>
> +#include <linux/bitfield.h>
>   #include <linux/errqueue.h>
>   #include <linux/if_link.h>
>   #include <linux/net_tstamp.h>
> @@ -150,21 +153,34 @@ static __u64 gettime(clockid_t clock_id)
>   	return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
>   }
>   
> +#define VLAN_PRIO_MASK		GENMASK(15, 13) /* Priority Code Point */
> +#define VLAN_CFI_MASK		GENMASK(12, 12) /* Canonical Format / Drop Eligible Indicator */
> +#define VLAN_VID_MASK		GENMASK(11, 0)	/* VLAN Identifier */
> +static void print_vlan_tag(__u16 tag)
> +{
> +	__u16 vlan_id = FIELD_GET(VLAN_VID_MASK, tag);
> +	__u8 pcp = FIELD_GET(VLAN_PRIO_MASK, tag);
> +	bool cfi = FIELD_GET(VLAN_CFI_MASK, tag);
> +
> +	printf("PCP=%u, CFI=%d, VID=0x%X\n", pcp, cfi, vlan_id);
> +}
> +

Shouldn't we use DEI instead of CFI ?

This is new code, and CFI have been deprecated (it was only relevant for
IEEE 802.5 Token Ring LAN).

--Jesper


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 16/20] selftests/bpf: Add flags and new hints to xdp_hw_metadata
  2023-07-04 11:03   ` Jesper Dangaard Brouer
@ 2023-07-04 11:04     ` Larysa Zaremba
  0 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-04 11:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Tue, Jul 04, 2023 at 01:03:37PM +0200, Jesper Dangaard Brouer wrote:
> 
> On 03/07/2023 20.12, Larysa Zaremba wrote:
> > diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > index 613321eb84c1..d234cbcc9103 100644
> > --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > @@ -19,6 +19,9 @@
> >   #include "xsk.h"
> >   #include <error.h>
> > +#include <linux/kernel.h>
> > +#include <linux/bits.h>
> > +#include <linux/bitfield.h>
> >   #include <linux/errqueue.h>
> >   #include <linux/if_link.h>
> >   #include <linux/net_tstamp.h>
> > @@ -150,21 +153,34 @@ static __u64 gettime(clockid_t clock_id)
> >   	return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
> >   }
> > +#define VLAN_PRIO_MASK		GENMASK(15, 13) /* Priority Code Point */
> > +#define VLAN_CFI_MASK		GENMASK(12, 12) /* Canonical Format / Drop Eligible Indicator */
> > +#define VLAN_VID_MASK		GENMASK(11, 0)	/* VLAN Identifier */
> > +static void print_vlan_tag(__u16 tag)
> > +{
> > +	__u16 vlan_id = FIELD_GET(VLAN_VID_MASK, tag);
> > +	__u8 pcp = FIELD_GET(VLAN_PRIO_MASK, tag);
> > +	bool cfi = FIELD_GET(VLAN_CFI_MASK, tag);
> > +
> > +	printf("PCP=%u, CFI=%d, VID=0x%X\n", pcp, cfi, vlan_id);
> > +}
> > +
> 
> Shouldn't we use DEI instead of CFI ?
> 
> This is new code, and CFI have been deprecated (it was only relevant for
> IEEE 802.5 Token Ring LAN).

You are right, should be DEI.

> 
> --Jesper
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-04 10:39       ` Jesper Dangaard Brouer
@ 2023-07-04 11:19         ` Larysa Zaremba
  2023-07-06  5:50           ` John Fastabend
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-04 11:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, brouer, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David S. Miller, Alexander Duyck

On Tue, Jul 04, 2023 at 12:39:06PM +0200, Jesper Dangaard Brouer wrote:
> Cc. DaveM+Alex Duyck, as I value your insights on checksums.
> 
> On 04/07/2023 11.24, Larysa Zaremba wrote:
> > On Mon, Jul 03, 2023 at 01:38:27PM -0700, John Fastabend wrote:
> > > Larysa Zaremba wrote:
> > > > Implement functionality that enables drivers to expose to XDP code,
> > > > whether checksums was checked and on what level.
> > > > 
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >   Documentation/networking/xdp-rx-metadata.rst |  3 +++
> > > >   include/linux/netdevice.h                    |  1 +
> > > >   include/net/xdp.h                            |  2 ++
> > > >   kernel/bpf/offload.c                         |  2 ++
> > > >   net/core/xdp.c                               | 21 ++++++++++++++++++++
> > > >   5 files changed, 29 insertions(+)
> > > > 
> > > > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > > > index ea6dd79a21d3..4ec6ddfd2a52 100644
> > > > --- a/Documentation/networking/xdp-rx-metadata.rst
> > > > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > > > @@ -26,6 +26,9 @@ metadata is supported, this set will grow:
> > > >   .. kernel-doc:: net/core/xdp.c
> > > >      :identifiers: bpf_xdp_metadata_rx_vlan_tag
> > > > +.. kernel-doc:: net/core/xdp.c
> > > > +   :identifiers: bpf_xdp_metadata_rx_csum_lvl
> > > > +
> > > >   An XDP program can use these kfuncs to read the metadata into stack
> > > >   variables for its own consumption. Or, to pass the metadata on to other
> > > >   consumers, an XDP program can store it into the metadata area carried
> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > index 4fa4380e6d89..569563687172 100644
> > > > --- a/include/linux/netdevice.h
> > > > +++ b/include/linux/netdevice.h
> > > > @@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
> > > >   			       enum xdp_rss_hash_type *rss_type);
> > > >   	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
> > > >   				   __be16 *vlan_proto);
> > > > +	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
> > > >   };
> > > >   /**
> > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > index 89c58f56ffc6..61ed38fa79d1 100644
> > > > --- a/include/net/xdp.h
> > > > +++ b/include/net/xdp.h
> > > > @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> > > >   			   bpf_xdp_metadata_rx_hash) \
> > > >   	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
> > > >   			   bpf_xdp_metadata_rx_vlan_tag) \
> > > > +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
> > > > +			   bpf_xdp_metadata_rx_csum_lvl) \
> > > >   enum {
> > > >   #define XDP_METADATA_KFUNC(name, _) name,
> > > > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > > > index 986e7becfd42..a133fb775f49 100644
> > > > --- a/kernel/bpf/offload.c
> > > > +++ b/kernel/bpf/offload.c
> > > > @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
> > > >   		p = ops->xmo_rx_hash;
> > > >   	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
> > > >   		p = ops->xmo_rx_vlan_tag;
> > > > +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
> > > > +		p = ops->xmo_rx_csum_lvl;
> > > >   out:
> > > >   	up_read(&bpf_devs_lock);
> > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > index f6262c90e45f..c666d3e0a26c 100644
> > > > --- a/net/core/xdp.c
> > > > +++ b/net/core/xdp.c
> > > > @@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
> > > >   	return -EOPNOTSUPP;
> > > >   }
> > > > +/**
> > > > + * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
> > > > + * @ctx: XDP context pointer.
> > > > + * @csum_level: Return value pointer.
> > > > + *
> > > > + * In case of success, csum_level contains depth of the last verified checksum.
> > > > + * If only the outermost checksum was verified, csum_level is 0, if both
> > > > + * encapsulation and inner transport checksums were verified, csum_level is 1,
> > > > + * and so on.
> > > > + * For more details, refer to csum_level field in sk_buff.
> > > > + *
> > > > + * Return:
> > > > + * * Returns 0 on success or ``-errno`` on error.
> > > > + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> > > > + * * ``-ENODATA``    : Checksum was not validated
> > > > + */
> > > > +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
> > > 
> > > Istead of ENODATA should we return what would be put in the ip_summed field
> > > CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,
> 
> I was thinking the same, what about checksum "type".
> 
> > > 
> > >   bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);
> > > 
> > > or something like that? Or is the thought that its not really necessary?
> > > I don't have a strong preference but figured it was worth asking.
> > > 
> > 
> > I see no value in returning CHECKSUM_COMPLETE without the actual checksum value.
> > Same with CHECKSUM_PARTIAL and csum_start. Returning those values too would
> > overcomplicate the function signature.
> 
> So, this kfunc bpf_xdp_metadata_rx_csum_lvl() success is it equivilent to
> CHECKSUM_UNNECESSARY?

This is 100% true for physical NICs, it's more complicated for veth, bacause it 
often receives CHECKSUM_PARTIAL, which shouldn't normally apprear on RX, but is 
treated by the network stack as a validated checksum, because there is no way 
internally generated packet could be messed up. I would be grateful if you could 
look at the veth patch and share your opinion about this.

> 
> Looking at documentation[1] (generated from skbuff.h):
>  [1] https://kernel.org/doc/html/latest/networking/skbuff.html#checksumming-of-received-packets-by-device
> 
> Is the idea that we can add another kfunc (new signature) than can deal
> with the other types of checksums (in a later kernel release)?
>

Yes, that is the idea.
 
> 
> > > > +{
> > > > +	return -EOPNOTSUPP;
> > > > +}
> > > > +
> > > >   __diag_pop();
> > 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-04 11:02         ` Larysa Zaremba
@ 2023-07-04 14:18           ` Jesper Dangaard Brouer
  2023-07-06 14:46             ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-04 14:18 UTC (permalink / raw)
  To: Larysa Zaremba, Jesper Dangaard Brouer
  Cc: brouer, John Fastabend, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Andrew Lunn



On 04/07/2023 13.02, Larysa Zaremba wrote:
> On Tue, Jul 04, 2023 at 12:23:45PM +0200, Jesper Dangaard Brouer wrote:
>>
>> On 04/07/2023 10.23, Larysa Zaremba wrote:
>>> On Mon, Jul 03, 2023 at 01:15:34PM -0700, John Fastabend wrote:
>>>> Larysa Zaremba wrote:
>>>>> Implement functionality that enables drivers to expose VLAN tag
>>>>> to XDP code.
>>>>>
>>>>> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>>>>> ---
>>>>>    Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
>>>>>    include/linux/netdevice.h                    |  2 ++
>>>>>    include/net/xdp.h                            |  2 ++
>>>>>    kernel/bpf/offload.c                         |  2 ++
>>>>>    net/core/xdp.c                               | 20 ++++++++++++++++++++
>>>>>    5 files changed, 33 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
>>>>> index 25ce72af81c2..ea6dd79a21d3 100644
>>>>> --- a/Documentation/networking/xdp-rx-metadata.rst
>>>>> +++ b/Documentation/networking/xdp-rx-metadata.rst
>>>>> @@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
>>>>>    metadata is supported, this set will grow:
>>>>>    .. kernel-doc:: net/core/xdp.c
>>>>> -   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
>>>>> +   :identifiers: bpf_xdp_metadata_rx_timestamp
>>>>> +
>>>>> +.. kernel-doc:: net/core/xdp.c
>>>>> +   :identifiers: bpf_xdp_metadata_rx_hash
>>>>> +
>>>>> +.. kernel-doc:: net/core/xdp.c
>>>>> +   :identifiers: bpf_xdp_metadata_rx_vlan_tag
>>>>>    An XDP program can use these kfuncs to read the metadata into stack
>>>>>    variables for its own consumption. Or, to pass the metadata on to other
>> [...]
>>>>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>>>>> index 41e5ca8643ec..f6262c90e45f 100644
>>>>> --- a/net/core/xdp.c
>>>>> +++ b/net/core/xdp.c
>>>>> @@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
>>>>>    	return -EOPNOTSUPP;
>>>>>    }
>>>>> +/**
>>>>> + * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
>>>>> + * @ctx: XDP context pointer.
>>>>> + * @vlan_tag: Destination pointer for VLAN tag
>>>>> + * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
>>>>> + *
>>>>> + * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
>>>>> + * containing VLAN ID, vlan_proto contains protocol identifier.
>>>>
>>>> Above is a bit confusing to me at least.
>>>>
>>>> The vlan tag would be both the 16bit TPID and 16bit TCI. What fields
>>>> are to be included here? The VlanID or the full 16bit TCI meaning the
>>>> PCP+DEI+VID?
>>>
>>> It contains PCP+DEI+VID, in patch 16 ("selftests/bpf: Add flags and new hints to
>>> xdp_hw_metadata") this is more clear, because the tag is parsed.
>>>
>>
>> Do we really care about the "EtherType" proto (in VLAN speak TPID = Tag
>> Protocol IDentifier)?
>> I mean, it can basically only have two values[1], and we just wanted to
>> know if it is a VLAN (that hardware offloaded/removed for us):
> 
> If we assume everyone follows the standard, this would be correct.
> But apparently, some applications use some ambiguous value as a TPID [0].
> 
> So it is not hard to imagine, some NICs could alllow you to configure your
> custom TPID. I am not sure if any in-tree drivers actually do this, but I think
> it's nice to provide some flexibility on XDP level, especially considering
> network stack stores full vlan_proto.
>

I'm buying your argument, and agree it makes sense to provide TPID in
the call signature.  Given weird hardware exists that allow people to
configure custom TPID.

Looking through kernel defines (in uapi/linux/if_ether.h) I see evidence
that funky QinQ EtherTypes have been used in the past:

  #define ETH_P_QINQ1	0x9100		/* deprecated QinQ VLAN [ NOT AN 
OFFICIALLY REGISTERED ID ] */
  #define ETH_P_QINQ2	0x9200		/* deprecated QinQ VLAN [ NOT AN 
OFFICIALLY REGISTERED ID ] */
  #define ETH_P_QINQ3	0x9300		/* deprecated QinQ VLAN [ NOT AN 
OFFICIALLY REGISTERED ID ] */


> [0]
> https://techhub.hpe.com/eginfolib/networking/docs/switches/7500/5200-1938a_l2-lan_cg/content/495503472.htm
> 
>>
>>   static __always_inline int proto_is_vlan(__u16 h_proto)
>>   {
>> 	return !!(h_proto == bpf_htons(ETH_P_8021Q) ||
>> 		  h_proto == bpf_htons(ETH_P_8021AD));
>>   }
>>
>> [1] https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L75-L79
>>
>> Cc. Andrew Lunn, as I notice DSA have a fake VLAN define ETH_P_DSA_8021Q
>> (in file include/uapi/linux/if_ether.h)
>> Is this actually in use?
>> Maybe some hardware can "VLAN" offload this?
>>
>>
>>> What about rephrasing it this way:
>>>
>>> In case of success, vlan_proto contains VLAN protocol identifier (TPID),
>>> vlan_tag contains the remaining 16 bits of a 802.1Q tag (PCP+DEI+VID).
>>>
>>
>> Hmm, I think we can improve this further. This text becomes part of the
>> documentation for end-users (target audience).  Thus, I think it is
>> worth being more verbose and even mention the existing defines that we
>> are expecting end-users to take advantage of.
>>
>> What about:
>>
>> In case of success. The VLAN EtherType is stored in vlan_proto (usually
>> either ETH_P_8021Q or ETH_P_8021AD) also known as TPID (Tag Protocol
>> IDentifier). The VLAN tag is stored in vlan_tag, which is a 16-bit field
>> containing sub-fields (PCP+DEI+VID). The VLAN ID (VID) is 12-bits
>> commonly extracted using mask VLAN_VID_MASK (0x0fff).  For the meaning
>> of the sub-fields Priority Code Point (PCP) and Drop Eligible Indicator
>> (DEI) (formerly CFI) please reference other documentation. Remember
>> these 16-bit fields are stored in network-byte. Thus, transformation
>> with byte-order helper functions like bpf_ntohs() are needed.
>>
> 
> AFAIK, vlan_tag is stored in host byte order, this is how it is in skb.

I'm not sure we should follow SKB storage scheme for XDP.

> In ice, we receive VLAN tag in descriptor already in LE.
> Only protocol is BE (network byte order). So I would replace the last 2
> sentences with the following:
> 
> vlan_tag is stored in host byte order, so no byte order conversion is needed.

Yikes, that was unexpected.  This needs to be heavily documented in docs.

When parsing packets, it is in network-byte-order, else my code is wrong 
here[1]:

   [1] 
https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L122

I'm accessing the skb->vlan_tci here [2], and I notice I don't do any
byte-order conversions, so fortunately I didn't make a code mistake.

   [2] 
https://github.com/xdp-project/bpf-examples/blob/master/traffic-pacing-edt/edt_pacer_vlan.c#L215

> vlan_proto is stored in network byte order, the suggested way to use this value:
> 
> vlan_proto == bpf_htons(ETH_P_8021Q)
> 
>>
>>

--Jesper


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 17/20] veth: Implement VLAN tag and checksum level XDP hint
  2023-07-03 18:12 ` [PATCH bpf-next v2 17/20] veth: Implement VLAN tag and checksum level XDP hint Larysa Zaremba
@ 2023-07-05 17:25   ` Stanislav Fomichev
  2023-07-06  9:57     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-05 17:25 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 07/03, Larysa Zaremba wrote:
> In order to test VLAN tag and checksum level XDP hints in
> hardware-independent selfttests, implement newly added XDP hints in veth
> driver.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  drivers/net/veth.c | 40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 614f3e3efab0..a7f2b679551d 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1732,6 +1732,44 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
>  	return 0;
>  }
>  
> +static int veth_xdp_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
> +				__be16 *vlan_proto)
> +{
> +	struct veth_xdp_buff *_ctx = (void *)ctx;
> +	struct sk_buff *skb = _ctx->skb;
> +	int err;
> +
> +	if (!skb)
> +		return -ENODATA;
> +

[..]

> +	err = __vlan_hwaccel_get_tag(skb, vlan_tag);

We probably need to open code __vlan_hwaccel_get_tag here. Because it
returns -EINVAL on !skb_vlan_tag_present where the expectation, for us,
I'm assuming is -ENODATA?

> +	if (err)
> +		return err;
> +
> +	*vlan_proto = skb->vlan_proto;
> +	return err;
> +}
> +
> +static int veth_xdp_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
> +{
> +	struct veth_xdp_buff *_ctx = (void *)ctx;
> +	struct sk_buff *skb = _ctx->skb;
> +
> +	if (!skb)
> +		return -ENODATA;
> +
> +	if (skb->ip_summed == CHECKSUM_UNNECESSARY)
> +		*csum_level = skb->csum_level;
> +	else if (skb->ip_summed == CHECKSUM_PARTIAL &&
> +		 skb_checksum_start_offset(skb) == skb_transport_offset(skb) ||
> +		 skb->csum_valid)
> +		*csum_level = 0;
> +	else
> +		return -ENODATA;
> +
> +	return 0;
> +}
> +
>  static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_init            = veth_dev_init,
>  	.ndo_open            = veth_open,
> @@ -1756,6 +1794,8 @@ static const struct net_device_ops veth_netdev_ops = {
>  static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
>  	.xmo_rx_timestamp		= veth_xdp_rx_timestamp,
>  	.xmo_rx_hash			= veth_xdp_rx_hash,
> +	.xmo_rx_vlan_tag		= veth_xdp_rx_vlan_tag,
> +	.xmo_rx_csum_lvl		= veth_xdp_rx_csum_lvl,
>  };
>  
>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint
  2023-07-03 18:12 ` [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint Larysa Zaremba
@ 2023-07-05 17:30   ` Stanislav Fomichev
  2023-07-06 14:22     ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-05 17:30 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 07/03, Larysa Zaremba wrote:
> Use previously refactored code and create a function
> that allows XDP code to read HW timestamp.
> 
> Also, move cached_phctime into packet context, this way this data still
> stays in the ring structure, just at the different address.
> 
> HW timestamp is the first supported hint in the driver,
> so also add xdp_metadata_ops.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice.h          |  2 ++
>  drivers/net/ethernet/intel/ice/ice_ethtool.c  |  2 +-
>  drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
>  drivers/net/ethernet/intel/ice/ice_main.c     |  1 +
>  drivers/net/ethernet/intel/ice/ice_ptp.c      |  2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 24 +++++++++++++++++++
>  7 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> index 4ba3d99439a0..7a973a2229f1 100644
> --- a/drivers/net/ethernet/intel/ice/ice.h
> +++ b/drivers/net/ethernet/intel/ice/ice.h
> @@ -943,4 +943,6 @@ static inline void ice_clear_rdma_cap(struct ice_pf *pf)
>  	set_bit(ICE_FLAG_UNPLUG_AUX_DEV, pf->flags);
>  	clear_bit(ICE_FLAG_RDMA_ENA, pf->flags);
>  }
> +
> +extern const struct xdp_metadata_ops ice_xdp_md_ops;
>  #endif /* _ICE_H_ */
> diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> index 8d5cbbd0b3d5..3c3b9cbfbcd3 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> @@ -2837,7 +2837,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
>  		/* clone ring and setup updated count */
>  		rx_rings[i] = *vsi->rx_rings[i];
>  		rx_rings[i].count = new_rx_cnt;
> -		rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
> +		rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
>  		rx_rings[i].desc = NULL;
>  		rx_rings[i].rx_buf = NULL;
>  		/* this is to allow wr32 to have something to write to
> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> index 00e3afd507a4..eb69b0ac7956 100644
> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> @@ -1445,7 +1445,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
>  		ring->netdev = vsi->netdev;
>  		ring->dev = dev;
>  		ring->count = vsi->num_rx_desc;
> -		ring->cached_phctime = pf->ptp.cached_phc_time;
> +		ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
>  		WRITE_ONCE(vsi->rx_rings[i], ring);
>  	}
>  
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index 93979ab18bc1..f21996b812ea 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -3384,6 +3384,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
>  
>  	netdev->netdev_ops = &ice_netdev_ops;
>  	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
> +	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
>  	ice_set_ethtool_ops(netdev);
>  
>  	if (vsi->type != ICE_VSI_PF)
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> index a31333972c68..70697e4829dd 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> @@ -1038,7 +1038,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
>  		ice_for_each_rxq(vsi, j) {
>  			if (!vsi->rx_rings[j])
>  				continue;
> -			WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
> +			WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime, systime);
>  		}
>  	}
>  	clear_bit(ICE_CFG_BUSY, pf->state);
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index d0ab2c4c0c91..4237702a58a9 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -259,6 +259,7 @@ enum ice_rx_dtype {
>  
>  struct ice_pkt_ctx {
>  	const union ice_32b_rx_flex_desc *eop_desc;
> +	u64 cached_phctime;
>  };
>  
>  struct ice_xdp_buff {
> @@ -354,7 +355,6 @@ struct ice_rx_ring {
>  	struct ice_tx_ring *xdp_ring;
>  	struct xsk_buff_pool *xsk_pool;
>  	dma_addr_t dma;			/* physical address of ring */
> -	u64 cached_phctime;
>  	u16 rx_buf_len;
>  	u8 dcb_tc;			/* Traffic class of ring */
>  	u8 ptp_rx;
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> index beb1c5bb392a..463d9e5cbe05 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> @@ -546,3 +546,27 @@ void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
>  			spin_unlock(&xdp_ring->tx_lock);
>  	}
>  }
> +
> +/**
> + * ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
> + * @ctx: XDP buff pointer
> + * @ts_ns: destination address
> + *
> + * Copy HW timestamp (if available) to the destination address.
> + */
> +static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
> +{
> +	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
> +	u64 cached_time;
> +
> +	cached_time = READ_ONCE(xdp_ext->pkt_ctx.cached_phctime);

I believe we have to have something like the following here:

if (!ts_ns)
	return -EINVAL;

IOW, I don't think verifier guarantees that those pointer args are
non-NULL. Same for the other ice kfunc you're adding and veth changes.

Can you also fix it for the existing veth kfuncs? (or lmk if you prefer me
to fix it).

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 14/20] selftests/bpf: Allow VLAN packets in xdp_hw_metadata
  2023-07-03 18:12 ` [PATCH bpf-next v2 14/20] selftests/bpf: Allow VLAN packets in xdp_hw_metadata Larysa Zaremba
@ 2023-07-05 17:31   ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-05 17:31 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 07/03, Larysa Zaremba wrote:
> Make VLAN c-tag and s-tag XDP hint testing more convenient
> by not skipping VLAN-ed packets.
> 
> Allow both 802.1ad and 802.1Q headers.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Acked-by: Stanislav Fomichev <sdf@google.com>

> ---
>  tools/testing/selftests/bpf/progs/xdp_hw_metadata.c | 10 +++++++++-
>  tools/testing/selftests/bpf/xdp_metadata.h          |  8 ++++++++
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
> index b2dfd7066c6e..63d7de6c6bbb 100644
> --- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
> @@ -26,15 +26,23 @@ int rx(struct xdp_md *ctx)
>  {
>  	void *data, *data_meta, *data_end;
>  	struct ipv6hdr *ip6h = NULL;
> -	struct ethhdr *eth = NULL;
>  	struct udphdr *udp = NULL;
>  	struct iphdr *iph = NULL;
>  	struct xdp_meta *meta;
> +	struct ethhdr *eth;
>  	int err;
>  
>  	data = (void *)(long)ctx->data;
>  	data_end = (void *)(long)ctx->data_end;
>  	eth = data;
> +
> +	if (eth + 1 < data_end && (eth->h_proto == bpf_htons(ETH_P_8021AD) ||
> +				   eth->h_proto == bpf_htons(ETH_P_8021Q)))
> +		eth = (void *)eth + sizeof(struct vlan_hdr);
> +
> +	if (eth + 1 < data_end && eth->h_proto == bpf_htons(ETH_P_8021Q))
> +		eth = (void *)eth + sizeof(struct vlan_hdr);
> +
>  	if (eth + 1 < data_end) {
>  		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
>  			iph = (void *)(eth + 1);
> diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
> index 938a729bd307..6664893c2c77 100644
> --- a/tools/testing/selftests/bpf/xdp_metadata.h
> +++ b/tools/testing/selftests/bpf/xdp_metadata.h
> @@ -9,6 +9,14 @@
>  #define ETH_P_IPV6 0x86DD
>  #endif
>  
> +#ifndef ETH_P_8021Q
> +#define ETH_P_8021Q 0x8100
> +#endif
> +
> +#ifndef ETH_P_8021AD
> +#define ETH_P_8021AD 0x88A8
> +#endif
> +
>  struct xdp_meta {
>  	__u64 rx_timestamp;
>  	__u64 xdp_timestamp;
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-07-03 18:12 ` [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata Larysa Zaremba
@ 2023-07-05 17:39   ` Stanislav Fomichev
  2023-07-06 14:11     ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-05 17:39 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 07/03, Larysa Zaremba wrote:
> The easiest way to simulate stripped VLAN tag in veth is to send a packet
> from VLAN interface, attached to veth. Unfortunately, this approach is
> incompatible with AF_XDP on TX side, because VLAN interfaces do not have
> such feature.
> 
> Replace AF_XDP packet generation with sending the same datagram via
> AF_INET socket.
> 
> This does not change the packet contents or hints values with one notable
> exception: rx_hash_type, which previously was expected to be 0, now is
> expected be at least XDP_RSS_TYPE_L4.
> 
> Also, usage of AF_INET requires a little more complicated namespace setup,
> therefore open_netns() helper function is divided into smaller reusable
> pieces.

Ack, it's probably OK for now, but, FYI, I'm trying to extend this part
with TX metadata:
https://lore.kernel.org/bpf/20230621170244.1283336-10-sdf@google.com/

So probably long-term I'll switch it back to AF_XDP but will add
support for requesting vlan TX "offload" from the veth.
 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  tools/testing/selftests/bpf/network_helpers.c |  37 +++-
>  tools/testing/selftests/bpf/network_helpers.h |   3 +
>  .../selftests/bpf/prog_tests/xdp_metadata.c   | 175 +++++++-----------
>  3 files changed, 98 insertions(+), 117 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> index a105c0cd008a..19463230ece5 100644
> --- a/tools/testing/selftests/bpf/network_helpers.c
> +++ b/tools/testing/selftests/bpf/network_helpers.c
> @@ -386,28 +386,51 @@ char *ping_command(int family)
>  	return "ping";
>  }
>  
> +int get_cur_netns(void)
> +{
> +	int nsfd;
> +
> +	nsfd = open("/proc/self/ns/net", O_RDONLY);
> +	ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> +	return nsfd;
> +}
> +
> +int get_netns(const char *name)
> +{
> +	char nspath[PATH_MAX];
> +	int nsfd;
> +
> +	snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> +	nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> +	ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> +	return nsfd;
> +}
> +
> +int set_netns(int netns_fd)
> +{
> +	return setns(netns_fd, CLONE_NEWNET);
> +}

We have open_netns/close_netns in network_helpers.h that provide similar
functionality, let's use them instead?

> +
>  struct nstoken {
>  	int orig_netns_fd;
>  };
>  
>  struct nstoken *open_netns(const char *name)
>  {
> +	struct nstoken *token;
>  	int nsfd;
> -	char nspath[PATH_MAX];
>  	int err;
> -	struct nstoken *token;
>  
>  	token = calloc(1, sizeof(struct nstoken));
>  	if (!ASSERT_OK_PTR(token, "malloc token"))
>  		return NULL;
>  
> -	token->orig_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> -	if (!ASSERT_GE(token->orig_netns_fd, 0, "open /proc/self/ns/net"))
> +	token->orig_netns_fd = get_cur_netns();
> +	if (token->orig_netns_fd < 0)
>  		goto fail;
>  
> -	snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> -	nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> -	if (!ASSERT_GE(nsfd, 0, "open netns fd"))
> +	nsfd = get_netns(name);
> +	if (nsfd < 0)
>  		goto fail;
>  
>  	err = setns(nsfd, CLONE_NEWNET);
> diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
> index 694185644da6..b18b9619595c 100644
> --- a/tools/testing/selftests/bpf/network_helpers.h
> +++ b/tools/testing/selftests/bpf/network_helpers.h
> @@ -58,6 +58,8 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
>  char *ping_command(int family);
>  int get_socket_local_port(int sock_fd);
>  
> +int get_cur_netns(void);
> +int get_netns(const char *name);
>  struct nstoken;
>  /**
>   * open_netns() - Switch to specified network namespace by name.
> @@ -67,4 +69,5 @@ struct nstoken;
>   */
>  struct nstoken *open_netns(const char *name);
>  void close_netns(struct nstoken *token);
> +int set_netns(int netns_fd);
>  #endif
> diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> index 626c461fa34d..53b32a641e8e 100644
> --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> @@ -20,7 +20,7 @@
>  
>  #define UDP_PAYLOAD_BYTES 4
>  
> -#define AF_XDP_SOURCE_PORT 1234
> +#define UDP_SOURCE_PORT 1234
>  #define AF_XDP_CONSUMER_PORT 8080
>  
>  #define UMEM_NUM 16
> @@ -33,6 +33,12 @@
>  #define RX_ADDR "10.0.0.2"
>  #define PREFIX_LEN "8"
>  #define FAMILY AF_INET
> +#define TX_NETNS_NAME "xdp_metadata_tx"
> +#define RX_NETNS_NAME "xdp_metadata_rx"
> +#define TX_MAC "00:00:00:00:00:01"
> +#define RX_MAC "00:00:00:00:00:02"
> +
> +#define XDP_RSS_TYPE_L4 BIT(3)
>  
>  struct xsk {
>  	void *umem_area;
> @@ -119,90 +125,28 @@ static void close_xsk(struct xsk *xsk)
>  	munmap(xsk->umem_area, UMEM_SIZE);
>  }
>  
> -static void ip_csum(struct iphdr *iph)
> +static int generate_packet_udp(void)
>  {
> -	__u32 sum = 0;
> -	__u16 *p;
> -	int i;
> -
> -	iph->check = 0;
> -	p = (void *)iph;
> -	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
> -		sum += p[i];
> -
> -	while (sum >> 16)
> -		sum = (sum & 0xffff) + (sum >> 16);
> -
> -	iph->check = ~sum;
> -}
> -
> -static int generate_packet(struct xsk *xsk, __u16 dst_port)
> -{
> -	struct xdp_desc *tx_desc;
> -	struct udphdr *udph;
> -	struct ethhdr *eth;
> -	struct iphdr *iph;
> -	void *data;
> -	__u32 idx;
> -	int ret;
> -
> -	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> -	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
> -		return -1;
> -
> -	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> -	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
> -	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
> -	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> -
> -	eth = data;
> -	iph = (void *)(eth + 1);
> -	udph = (void *)(iph + 1);
> -
> -	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
> -	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
> -	eth->h_proto = htons(ETH_P_IP);
> -
> -	iph->version = 0x4;
> -	iph->ihl = 0x5;
> -	iph->tos = 0x9;
> -	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
> -	iph->id = 0;
> -	iph->frag_off = 0;
> -	iph->ttl = 0;
> -	iph->protocol = IPPROTO_UDP;
> -	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
> -	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
> -	ip_csum(iph);
> -
> -	udph->source = htons(AF_XDP_SOURCE_PORT);
> -	udph->dest = htons(dst_port);
> -	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
> -	udph->check = 0;
> -
> -	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
> -
> -	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
> -	xsk_ring_prod__submit(&xsk->tx, 1);
> -
> -	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
> -	if (!ASSERT_GE(ret, 0, "sendto"))
> -		return ret;
> -
> -	return 0;
> -}
> -
> -static void complete_tx(struct xsk *xsk)
> -{
> -	__u32 idx;
> -	__u64 addr;
> -
> -	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
> -		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> -
> -		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> -		xsk_ring_cons__release(&xsk->comp, 1);
> -	}
> +	char udp_payload[UDP_PAYLOAD_BYTES];
> +	struct sockaddr_in rx_addr;
> +	int sock_fd, err = 0;
> +
> +	/* Build a packet */
> +	memset(udp_payload, 0xAA, UDP_PAYLOAD_BYTES);
> +	rx_addr.sin_addr.s_addr = inet_addr(RX_ADDR);
> +	rx_addr.sin_family = AF_INET;
> +	rx_addr.sin_port = htons(UDP_SOURCE_PORT);
> +
> +	sock_fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> +	if (!ASSERT_GE(sock_fd, 0, "socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)"))
> +		return sock_fd;
> +
> +	err = sendto(sock_fd, udp_payload, UDP_PAYLOAD_BYTES, MSG_DONTWAIT,
> +		     (void *)&rx_addr, sizeof(rx_addr));
> +	ASSERT_GE(err, 0, "sendto");
> +
> +	close(sock_fd);
> +	return err;
>  }
>  
>  static void refill_rx(struct xsk *xsk, __u64 addr)
> @@ -268,7 +212,8 @@ static int verify_xsk_metadata(struct xsk *xsk)
>  	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
>  		return -1;
>  
> -	ASSERT_EQ(meta->rx_hash_type, 0, "rx_hash_type");
> +	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
> +		return -1;
>  
>  	xsk_ring_cons__release(&xsk->rx, 1);
>  	refill_rx(xsk, comp_addr);
> @@ -281,40 +226,46 @@ void test_xdp_metadata(void)
>  	struct xdp_metadata2 *bpf_obj2 = NULL;
>  	struct xdp_metadata *bpf_obj = NULL;
>  	struct bpf_program *new_prog, *prog;
> -	struct nstoken *tok = NULL;
> +	int prev_netns, rx_netns, tx_netns;
>  	__u32 queue_id = QUEUE_ID;
>  	struct bpf_map *prog_arr;
> -	struct xsk tx_xsk = {};
>  	struct xsk rx_xsk = {};
>  	__u32 val, key = 0;
>  	int retries = 10;
>  	int rx_ifindex;
> -	int tx_ifindex;
>  	int sock_fd;
>  	int ret;
>  
> -	/* Setup new networking namespace, with a veth pair. */
> +	/* Setup new networking namespaces, with a veth pair. */
>  
> -	SYS(out, "ip netns add xdp_metadata");
> -	tok = open_netns("xdp_metadata");
> +	SYS(out, "ip netns add " TX_NETNS_NAME);
> +	SYS(out, "ip netns add " RX_NETNS_NAME);
> +	prev_netns = get_cur_netns();
> +	tx_netns = get_netns(TX_NETNS_NAME);
> +	rx_netns = get_netns(RX_NETNS_NAME);
> +	if (prev_netns < 0 || tx_netns < 0 || rx_netns < 0)
> +		goto close_ns;
> +
> +	set_netns(tx_netns);
>  	SYS(out, "ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
>  	    " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
> -	SYS(out, "ip link set dev " TX_NAME " address 00:00:00:00:00:01");
> -	SYS(out, "ip link set dev " RX_NAME " address 00:00:00:00:00:02");
> +	SYS(out, "ip link set " RX_NAME " netns " RX_NETNS_NAME);
> +
> +	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
>  	SYS(out, "ip link set dev " TX_NAME " up");
> -	SYS(out, "ip link set dev " RX_NAME " up");
>  	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
> -	SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
>  
> +	/* Avoid ARP calls */
> +	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
> +
> +	set_netns(rx_netns);
> +	SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
> +	SYS(out, "ip link set dev " RX_NAME " up");
> +	SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
>  	rx_ifindex = if_nametoindex(RX_NAME);
> -	tx_ifindex = if_nametoindex(TX_NAME);
>  
>  	/* Setup separate AF_XDP for TX and RX interfaces. */
>  
> -	ret = open_xsk(tx_ifindex, &tx_xsk);
> -	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
> -		goto out;
> -
>  	ret = open_xsk(rx_ifindex, &rx_xsk);
>  	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
>  		goto out;
> @@ -355,17 +306,16 @@ void test_xdp_metadata(void)
>  		goto out;
>  
>  	/* Send packet destined to RX AF_XDP socket. */
> -	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
> -		       "generate AF_XDP_CONSUMER_PORT"))
> +	set_netns(tx_netns);
> +	if (!ASSERT_GE(generate_packet_udp(), 0, "generate UDP packet"))
>  		goto out;
>  
>  	/* Verify AF_XDP RX packet has proper metadata. */
> +	set_netns(rx_netns);
>  	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
>  		       "verify_xsk_metadata"))
>  		goto out;
>  
> -	complete_tx(&tx_xsk);
> -
>  	/* Make sure freplace correctly picks up original bound device
>  	 * and doesn't crash.
>  	 */
> @@ -384,10 +334,11 @@ void test_xdp_metadata(void)
>  		goto out;
>  
>  	/* Send packet to trigger . */
> -	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
> -		       "generate freplace packet"))
> +	set_netns(tx_netns);
> +	if (!ASSERT_GE(generate_packet_udp(), 0, "generate freplace packet"))
>  		goto out;
>  
> +	set_netns(rx_netns);
>  	while (!retries--) {
>  		if (bpf_obj2->bss->called)
>  			break;
> @@ -397,10 +348,14 @@ void test_xdp_metadata(void)
>  
>  out:
>  	close_xsk(&rx_xsk);
> -	close_xsk(&tx_xsk);
>  	xdp_metadata2__destroy(bpf_obj2);
>  	xdp_metadata__destroy(bpf_obj);
> -	if (tok)
> -		close_netns(tok);
> -	SYS_NOFAIL("ip netns del xdp_metadata");
> +	set_netns(prev_netns);
> +close_ns:
> +	close(prev_netns);
> +	close(tx_netns);
> +	close(rx_netns);
> +
> +	SYS_NOFAIL("ip netns del " RX_NETNS_NAME);
> +	SYS_NOFAIL("ip netns del " TX_NETNS_NAME);
>  }
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto in xdp_metadata
  2023-07-03 18:12 ` [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba
@ 2023-07-05 17:41   ` Stanislav Fomichev
  2023-07-06 10:10   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-05 17:41 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 07/03, Larysa Zaremba wrote:
> Verify, whether VLAN tag and proto are set correctly.
> 
> To simulate "stripped" VLAN tag on veth, send test packet from VLAN
> interface.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Acked-by: Stanislav Fomichev <sdf@google.com>

> ---
>  .../selftests/bpf/prog_tests/xdp_metadata.c   | 21 +++++++++++++++++--
>  .../selftests/bpf/progs/xdp_metadata.c        |  4 ++++
>  2 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> index 53b32a641e8e..50ac9f570bc5 100644
> --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> @@ -38,6 +38,13 @@
>  #define TX_MAC "00:00:00:00:00:01"
>  #define RX_MAC "00:00:00:00:00:02"
>  
> +#define VLAN_ID 59
> +#define VLAN_ID_STR "59"
> +#define VLAN_PROTO "802.1Q"
> +#define VLAN_PID htons(ETH_P_8021Q)
> +#define TX_NAME_VLAN TX_NAME "." VLAN_ID_STR
> +#define RX_NAME_VLAN RX_NAME "." VLAN_ID_STR
> +
>  #define XDP_RSS_TYPE_L4 BIT(3)
>  
>  struct xsk {
> @@ -215,6 +222,12 @@ static int verify_xsk_metadata(struct xsk *xsk)
>  	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
>  		return -1;
>  
> +	if (!ASSERT_EQ(meta->rx_vlan_tag, VLAN_ID, "rx_vlan_tag"))
> +		return -1;
> +
> +	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
> +		return -1;
> +
>  	xsk_ring_cons__release(&xsk->rx, 1);
>  	refill_rx(xsk, comp_addr);
>  
> @@ -253,10 +266,14 @@ void test_xdp_metadata(void)
>  
>  	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
>  	SYS(out, "ip link set dev " TX_NAME " up");
> -	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
> +
> +	SYS(out, "ip link add link " TX_NAME " " TX_NAME_VLAN
> +		 " type vlan proto " VLAN_PROTO " id " VLAN_ID_STR);
> +	SYS(out, "ip link set dev " TX_NAME_VLAN " up");
> +	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME_VLAN);
>  
>  	/* Avoid ARP calls */
> -	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
> +	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME_VLAN);
>  
>  	set_netns(rx_netns);
>  	SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
> diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> index d151d406a123..382984a5d1c9 100644
> --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> @@ -23,6 +23,9 @@ extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
>  					 __u64 *timestamp) __ksym;
>  extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
>  				    enum xdp_rss_hash_type *rss_type) __ksym;
> +extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
> +					__u16 *vlan_tag,
> +					__be16 *vlan_proto) __ksym;
>  
>  SEC("xdp")
>  int rx(struct xdp_md *ctx)
> @@ -57,6 +60,7 @@ int rx(struct xdp_md *ctx)
>  		meta->rx_timestamp = 1;
>  
>  	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
> +	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag, &meta->rx_vlan_proto);
>  
>  	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>  }
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level in xdp_metadata
  2023-07-03 18:12 ` [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level " Larysa Zaremba
@ 2023-07-05 17:41   ` Stanislav Fomichev
  2023-07-06 10:25   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-05 17:41 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 07/03, Larysa Zaremba wrote:
> Verify, whether kfunc in xdp_metadata test correctly returns checksum level
> of zero.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

Acked-by: Stanislav Fomichev <sdf@google.com>

> ---
>  tools/testing/selftests/bpf/prog_tests/xdp_metadata.c | 3 +++
>  tools/testing/selftests/bpf/progs/xdp_metadata.c      | 7 +++++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> index 50ac9f570bc5..6c71d712932e 100644
> --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> @@ -228,6 +228,9 @@ static int verify_xsk_metadata(struct xsk *xsk)
>  	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
>  		return -1;
>  
> +	if (!ASSERT_NEQ(meta->rx_csum_lvl, 0, "rx_csum_lvl"))
> +		return -1;
> +
>  	xsk_ring_cons__release(&xsk->rx, 1);
>  	refill_rx(xsk, comp_addr);
>  
> diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> index 382984a5d1c9..6f7223d581b7 100644
> --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> @@ -26,6 +26,8 @@ extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
>  extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
>  					__u16 *vlan_tag,
>  					__be16 *vlan_proto) __ksym;
> +extern int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx,
> +					__u8 *csum_level) __ksym;
>  
>  SEC("xdp")
>  int rx(struct xdp_md *ctx)
> @@ -62,6 +64,11 @@ int rx(struct xdp_md *ctx)
>  	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
>  	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag, &meta->rx_vlan_proto);
>  
> +	/* Same as with timestamp, zero is expected */
> +	ret = bpf_xdp_metadata_rx_csum_lvl(ctx, &meta->rx_csum_lvl);
> +	if (!ret && meta->rx_csum_lvl == 0)
> +		meta->rx_csum_lvl = 1;
> +
>  	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>  }
>  
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-04 11:19         ` Larysa Zaremba
@ 2023-07-06  5:50           ` John Fastabend
  2023-07-06  9:04             ` [xdp-hints] " Jesper Dangaard Brouer
  0 siblings, 1 reply; 66+ messages in thread
From: John Fastabend @ 2023-07-06  5:50 UTC (permalink / raw)
  To: Larysa Zaremba, Jesper Dangaard Brouer
  Cc: John Fastabend, brouer, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David S. Miller, Alexander Duyck

Larysa Zaremba wrote:
> On Tue, Jul 04, 2023 at 12:39:06PM +0200, Jesper Dangaard Brouer wrote:
> > Cc. DaveM+Alex Duyck, as I value your insights on checksums.
> > 
> > On 04/07/2023 11.24, Larysa Zaremba wrote:
> > > On Mon, Jul 03, 2023 at 01:38:27PM -0700, John Fastabend wrote:
> > > > Larysa Zaremba wrote:
> > > > > Implement functionality that enables drivers to expose to XDP code,
> > > > > whether checksums was checked and on what level.
> > > > > 
> > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > ---
> > > > >   Documentation/networking/xdp-rx-metadata.rst |  3 +++
> > > > >   include/linux/netdevice.h                    |  1 +
> > > > >   include/net/xdp.h                            |  2 ++
> > > > >   kernel/bpf/offload.c                         |  2 ++
> > > > >   net/core/xdp.c                               | 21 ++++++++++++++++++++
> > > > >   5 files changed, 29 insertions(+)
> > > > > 
> > > > > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > > > > index ea6dd79a21d3..4ec6ddfd2a52 100644
> > > > > --- a/Documentation/networking/xdp-rx-metadata.rst
> > > > > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > > > > @@ -26,6 +26,9 @@ metadata is supported, this set will grow:
> > > > >   .. kernel-doc:: net/core/xdp.c
> > > > >      :identifiers: bpf_xdp_metadata_rx_vlan_tag
> > > > > +.. kernel-doc:: net/core/xdp.c
> > > > > +   :identifiers: bpf_xdp_metadata_rx_csum_lvl
> > > > > +
> > > > >   An XDP program can use these kfuncs to read the metadata into stack
> > > > >   variables for its own consumption. Or, to pass the metadata on to other
> > > > >   consumers, an XDP program can store it into the metadata area carried
> > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > > index 4fa4380e6d89..569563687172 100644
> > > > > --- a/include/linux/netdevice.h
> > > > > +++ b/include/linux/netdevice.h
> > > > > @@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
> > > > >   			       enum xdp_rss_hash_type *rss_type);
> > > > >   	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
> > > > >   				   __be16 *vlan_proto);
> > > > > +	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
> > > > >   };
> > > > >   /**
> > > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > > index 89c58f56ffc6..61ed38fa79d1 100644
> > > > > --- a/include/net/xdp.h
> > > > > +++ b/include/net/xdp.h
> > > > > @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> > > > >   			   bpf_xdp_metadata_rx_hash) \
> > > > >   	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
> > > > >   			   bpf_xdp_metadata_rx_vlan_tag) \
> > > > > +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
> > > > > +			   bpf_xdp_metadata_rx_csum_lvl) \
> > > > >   enum {
> > > > >   #define XDP_METADATA_KFUNC(name, _) name,
> > > > > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > > > > index 986e7becfd42..a133fb775f49 100644
> > > > > --- a/kernel/bpf/offload.c
> > > > > +++ b/kernel/bpf/offload.c
> > > > > @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
> > > > >   		p = ops->xmo_rx_hash;
> > > > >   	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
> > > > >   		p = ops->xmo_rx_vlan_tag;
> > > > > +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
> > > > > +		p = ops->xmo_rx_csum_lvl;
> > > > >   out:
> > > > >   	up_read(&bpf_devs_lock);
> > > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > > index f6262c90e45f..c666d3e0a26c 100644
> > > > > --- a/net/core/xdp.c
> > > > > +++ b/net/core/xdp.c
> > > > > @@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
> > > > >   	return -EOPNOTSUPP;
> > > > >   }
> > > > > +/**
> > > > > + * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
> > > > > + * @ctx: XDP context pointer.
> > > > > + * @csum_level: Return value pointer.
> > > > > + *
> > > > > + * In case of success, csum_level contains depth of the last verified checksum.
> > > > > + * If only the outermost checksum was verified, csum_level is 0, if both
> > > > > + * encapsulation and inner transport checksums were verified, csum_level is 1,
> > > > > + * and so on.
> > > > > + * For more details, refer to csum_level field in sk_buff.
> > > > > + *
> > > > > + * Return:
> > > > > + * * Returns 0 on success or ``-errno`` on error.
> > > > > + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> > > > > + * * ``-ENODATA``    : Checksum was not validated
> > > > > + */
> > > > > +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
> > > > 
> > > > Istead of ENODATA should we return what would be put in the ip_summed field
> > > > CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,
> > 
> > I was thinking the same, what about checksum "type".
> > 
> > > > 
> > > >   bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);
> > > > 
> > > > or something like that? Or is the thought that its not really necessary?
> > > > I don't have a strong preference but figured it was worth asking.
> > > > 
> > > 
> > > I see no value in returning CHECKSUM_COMPLETE without the actual checksum value.
> > > Same with CHECKSUM_PARTIAL and csum_start. Returning those values too would
> > > overcomplicate the function signature.
> > 
> > So, this kfunc bpf_xdp_metadata_rx_csum_lvl() success is it equivilent to
> > CHECKSUM_UNNECESSARY?
> 
> This is 100% true for physical NICs, it's more complicated for veth, bacause it 
> often receives CHECKSUM_PARTIAL, which shouldn't normally apprear on RX, but is 
> treated by the network stack as a validated checksum, because there is no way 
> internally generated packet could be messed up. I would be grateful if you could 
> look at the veth patch and share your opinion about this.
> 
> > 
> > Looking at documentation[1] (generated from skbuff.h):
> >  [1] https://kernel.org/doc/html/latest/networking/skbuff.html#checksumming-of-received-packets-by-device
> > 
> > Is the idea that we can add another kfunc (new signature) than can deal
> > with the other types of checksums (in a later kernel release)?
> >
> 
> Yes, that is the idea.

If we think there is a chance we might need another kfunc we should add it
in the same kfunc. It would be unfortunate to have to do two kfuncs when
one would work. It shouldn't cost much/anything(?) to hardcode the type for
most cases? I think if we need it later I would advocate for updating this
kfunc to support it. Of course then userspace will have to swivel on the
kfunc signature.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-06  5:50           ` John Fastabend
@ 2023-07-06  9:04             ` Jesper Dangaard Brouer
  2023-07-06 12:38               ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-06  9:04 UTC (permalink / raw)
  To: John Fastabend, Larysa Zaremba, Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh,
	sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	David S. Miller, Alexander Duyck



On 06/07/2023 07.50, John Fastabend wrote:
> Larysa Zaremba wrote:
>> On Tue, Jul 04, 2023 at 12:39:06PM +0200, Jesper Dangaard Brouer wrote:
>>> Cc. DaveM+Alex Duyck, as I value your insights on checksums.
>>>
>>> On 04/07/2023 11.24, Larysa Zaremba wrote:
>>>> On Mon, Jul 03, 2023 at 01:38:27PM -0700, John Fastabend wrote:
>>>>> Larysa Zaremba wrote:
>>>>>> Implement functionality that enables drivers to expose to XDP code,
>>>>>> whether checksums was checked and on what level.
>>>>>>
>>>>>> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>>>>>> ---
>>>>>>    Documentation/networking/xdp-rx-metadata.rst |  3 +++
>>>>>>    include/linux/netdevice.h                    |  1 +
>>>>>>    include/net/xdp.h                            |  2 ++
>>>>>>    kernel/bpf/offload.c                         |  2 ++
>>>>>>    net/core/xdp.c                               | 21 ++++++++++++++++++++
>>>>>>    5 files changed, 29 insertions(+)
>>>>>>
>>>>>> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
>>>>>> index ea6dd79a21d3..4ec6ddfd2a52 100644
>>>>>> --- a/Documentation/networking/xdp-rx-metadata.rst
>>>>>> +++ b/Documentation/networking/xdp-rx-metadata.rst
>>>>>> @@ -26,6 +26,9 @@ metadata is supported, this set will grow:
>>>>>>    .. kernel-doc:: net/core/xdp.c
>>>>>>       :identifiers: bpf_xdp_metadata_rx_vlan_tag
>>>>>> +.. kernel-doc:: net/core/xdp.c
>>>>>> +   :identifiers: bpf_xdp_metadata_rx_csum_lvl
>>>>>> +
>>>>>>    An XDP program can use these kfuncs to read the metadata into stack
>>>>>>    variables for its own consumption. Or, to pass the metadata on to other
>>>>>>    consumers, an XDP program can store it into the metadata area carried
>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>>> index 4fa4380e6d89..569563687172 100644
>>>>>> --- a/include/linux/netdevice.h
>>>>>> +++ b/include/linux/netdevice.h
>>>>>> @@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
>>>>>>    			       enum xdp_rss_hash_type *rss_type);
>>>>>>    	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
>>>>>>    				   __be16 *vlan_proto);
>>>>>> +	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
>>>>>>    };
>>>>>>    /**
>>>>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>>>>> index 89c58f56ffc6..61ed38fa79d1 100644
>>>>>> --- a/include/net/xdp.h
>>>>>> +++ b/include/net/xdp.h
>>>>>> @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>>>>>>    			   bpf_xdp_metadata_rx_hash) \
>>>>>>    	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
>>>>>>    			   bpf_xdp_metadata_rx_vlan_tag) \
>>>>>> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
>>>>>> +			   bpf_xdp_metadata_rx_csum_lvl) \
>>>>>>    enum {
>>>>>>    #define XDP_METADATA_KFUNC(name, _) name,
>>>>>> diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
>>>>>> index 986e7becfd42..a133fb775f49 100644
>>>>>> --- a/kernel/bpf/offload.c
>>>>>> +++ b/kernel/bpf/offload.c
>>>>>> @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
>>>>>>    		p = ops->xmo_rx_hash;
>>>>>>    	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
>>>>>>    		p = ops->xmo_rx_vlan_tag;
>>>>>> +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
>>>>>> +		p = ops->xmo_rx_csum_lvl;
>>>>>>    out:
>>>>>>    	up_read(&bpf_devs_lock);
>>>>>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>>>>>> index f6262c90e45f..c666d3e0a26c 100644
>>>>>> --- a/net/core/xdp.c
>>>>>> +++ b/net/core/xdp.c
>>>>>> @@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
>>>>>>    	return -EOPNOTSUPP;
>>>>>>    }
>>>>>> +/**
>>>>>> + * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
>>>>>> + * @ctx: XDP context pointer.
>>>>>> + * @csum_level: Return value pointer.
>>>>>> + *
>>>>>> + * In case of success, csum_level contains depth of the last verified checksum.
>>>>>> + * If only the outermost checksum was verified, csum_level is 0, if both
>>>>>> + * encapsulation and inner transport checksums were verified, csum_level is 1,
>>>>>> + * and so on.
>>>>>> + * For more details, refer to csum_level field in sk_buff.
>>>>>> + *
>>>>>> + * Return:
>>>>>> + * * Returns 0 on success or ``-errno`` on error.
>>>>>> + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
>>>>>> + * * ``-ENODATA``    : Checksum was not validated
>>>>>> + */
>>>>>> +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
>>>>>
>>>>> Istead of ENODATA should we return what would be put in the ip_summed field
>>>>> CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,
>>>
>>> I was thinking the same, what about checksum "type".
>>>
>>>>>
>>>>>    bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);
>>>>>
>>>>> or something like that? Or is the thought that its not really necessary?
>>>>> I don't have a strong preference but figured it was worth asking.
>>>>>
>>>>
>>>> I see no value in returning CHECKSUM_COMPLETE without the actual checksum value.
>>>> Same with CHECKSUM_PARTIAL and csum_start. Returning those values too would
>>>> overcomplicate the function signature.
>>>
>>> So, this kfunc bpf_xdp_metadata_rx_csum_lvl() success is it equivilent to
>>> CHECKSUM_UNNECESSARY?
>>
>> This is 100% true for physical NICs, it's more complicated for veth, bacause it
>> often receives CHECKSUM_PARTIAL, which shouldn't normally apprear on RX, but is
>> treated by the network stack as a validated checksum, because there is no way
>> internally generated packet could be messed up. I would be grateful if you could
>> look at the veth patch and share your opinion about this.
>>
>>>
>>> Looking at documentation[1] (generated from skbuff.h):
>>>   [1] https://kernel.org/doc/html/latest/networking/skbuff.html#checksumming-of-received-packets-by-device
>>>
>>> Is the idea that we can add another kfunc (new signature) than can deal
>>> with the other types of checksums (in a later kernel release)?
>>>
>>
>> Yes, that is the idea.
> 
> If we think there is a chance we might need another kfunc we should add it
> in the same kfunc. It would be unfortunate to have to do two kfuncs when
> one would work. It shouldn't cost much/anything(?) to hardcode the type for
> most cases? I think if we need it later I would advocate for updating this
> kfunc to support it. Of course then userspace will have to swivel on the
> kfunc signature.
> 

I think it might make sense to have 3 kfuncs for checksumming.
As this would allow BPF-prog to focus on CHECKSUM_UNNECESSARY, and then
only call additional kfunc for extracting e.g csum_start  + csum_offset
when type is CHECKSUM_PARTIAL.

We could extend bpf_xdp_metadata_rx_csum_lvl() to give the csum_type
CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}.

  int bpf_xdp_metadata_rx_csum_lvl(*ctx, u8 *csum_level, u8 *csum_type)

And then add two kfunc e.g.
  (1) bpf_xdp_metadata_rx_csum_partial(ctx, start, offset)
  (2) bpf_xdp_metadata_rx_csum_complete(ctx, csum)

Pseudo BPF-prog code:

  err = bpf_xdp_metadata_rx_csum_lvl(ctx, level, type);
  if (!err && type != CHECKSUM_UNNECESSARY) {
      if (type == CHECKSUM_PARTIAL)
          err = bpf_xdp_metadata_rx_csum_partial(ctx, start, offset);
      if (type == CHECKSUM_COMPLETE)
          err = bpf_xdp_metadata_rx_csum_complete(ctx, csum);
  }

Looking at code, I feel we could rename [...]_csum_lvl to csum_type.
E.g. bpf_xdp_metadata_rx_csum_type.

Feel free to disagree,
--Jesper


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 17/20] veth: Implement VLAN tag and checksum level XDP hint
  2023-07-05 17:25   ` Stanislav Fomichev
@ 2023-07-06  9:57     ` Jesper Dangaard Brouer
  2023-07-06 10:15       ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-06  9:57 UTC (permalink / raw)
  To: Stanislav Fomichev, Larysa Zaremba
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev


On 05/07/2023 19.25, Stanislav Fomichev wrote:
> On 07/03, Larysa Zaremba wrote:
>> In order to test VLAN tag and checksum level XDP hints in
>> hardware-independent selfttests, implement newly added XDP hints in veth
>> driver.
>>
>> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>> ---
>>   drivers/net/veth.c | 40 ++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 40 insertions(+)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 614f3e3efab0..a7f2b679551d 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -1732,6 +1732,44 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
>>   	return 0;
>>   }
>>   
>> +static int veth_xdp_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
>> +				__be16 *vlan_proto)
>> +{
>> +	struct veth_xdp_buff *_ctx = (void *)ctx;
>> +	struct sk_buff *skb = _ctx->skb;
>> +	int err;
>> +
>> +	if (!skb)
>> +		return -ENODATA;
>> +
> 
> [..]
> 
>> +	err = __vlan_hwaccel_get_tag(skb, vlan_tag);
> 
> We probably need to open code __vlan_hwaccel_get_tag here. Because it
> returns -EINVAL on !skb_vlan_tag_present where the expectation, for us,
> I'm assuming is -ENODATA?
> 

Looking at in-tree users of __vlan_hwaccel_get_tag(), they don't use the
err value for anything.  Thus, we can just change
__vlan_hwaccel_get_tag() to return -ENODATA instead of -EINVAL.  (And
also remember __vlan_get_tag() adjustmment).


$ git diff
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 6ba71957851e..fb35d7dd77a2 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -540,7 +540,7 @@ static inline int __vlan_get_tag(const struct 
sk_buff *skb, u16 *vlan_tci)
         struct vlan_ethhdr *veth = skb_vlan_eth_hdr(skb);

         if (!eth_type_vlan(veth->h_vlan_proto))
-               return -EINVAL;
+               return -ENODATA;

         *vlan_tci = ntohs(veth->h_vlan_TCI);
         return 0;
@@ -561,7 +561,7 @@ static inline int __vlan_hwaccel_get_tag(const 
struct sk_buff *skb,
                 return 0;
         } else {
                 *vlan_tci = 0;
-               return -EINVAL;
+               return -ENODATA;
         }
  }



>> +	if (err)
>> +		return err;
>> +
>> +	*vlan_proto = skb->vlan_proto;
>> +	return err;
>> +}
>> +
>> +static int veth_xdp_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
>> +{
>> +	struct veth_xdp_buff *_ctx = (void *)ctx;
>> +	struct sk_buff *skb = _ctx->skb;
>> +
>> +	if (!skb)
>> +		return -ENODATA;
>> +
>> +	if (skb->ip_summed == CHECKSUM_UNNECESSARY)
>> +		*csum_level = skb->csum_level;
>> +	else if (skb->ip_summed == CHECKSUM_PARTIAL &&
>> +		 skb_checksum_start_offset(skb) == skb_transport_offset(skb) ||
>> +		 skb->csum_valid)
>> +		*csum_level = 0;
>> +	else
>> +		return -ENODATA;
>> +
>> +	return 0;
>> +}
>> +
[...]


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto in xdp_metadata
  2023-07-03 18:12 ` [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba
  2023-07-05 17:41   ` Stanislav Fomichev
@ 2023-07-06 10:10   ` Jesper Dangaard Brouer
  2023-07-06 10:13     ` Larysa Zaremba
  1 sibling, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-06 10:10 UTC (permalink / raw)
  To: Larysa Zaremba, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev


On 03/07/2023 20.12, Larysa Zaremba wrote:
> Verify, whether VLAN tag and proto are set correctly.
> 
> To simulate "stripped" VLAN tag on veth, send test packet from VLAN
> interface.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>   .../selftests/bpf/prog_tests/xdp_metadata.c   | 21 +++++++++++++++++--
>   .../selftests/bpf/progs/xdp_metadata.c        |  4 ++++
>   2 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> index 53b32a641e8e..50ac9f570bc5 100644
> --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> @@ -38,6 +38,13 @@
>   #define TX_MAC "00:00:00:00:00:01"
>   #define RX_MAC "00:00:00:00:00:02"
>   
> +#define VLAN_ID 59
> +#define VLAN_ID_STR "59"
> +#define VLAN_PROTO "802.1Q"
> +#define VLAN_PID htons(ETH_P_8021Q)
> +#define TX_NAME_VLAN TX_NAME "." VLAN_ID_STR
> +#define RX_NAME_VLAN RX_NAME "." VLAN_ID_STR
> +
>   #define XDP_RSS_TYPE_L4 BIT(3)
>   
>   struct xsk {
> @@ -215,6 +222,12 @@ static int verify_xsk_metadata(struct xsk *xsk)
>   	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
>   		return -1;
>   
> +	if (!ASSERT_EQ(meta->rx_vlan_tag, VLAN_ID, "rx_vlan_tag"))
> +		return -1;

In other examples you are masking meta->rx_vlan_tag with VLAN_VID_MASK
(12 lower bits 0x0fff) to extract the VLAN_ID.  It would make the
selftest more correct, robust and pedagogical to also mask out the ID here.


> +
> +	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
> +		return -1;
> +
>   	xsk_ring_cons__release(&xsk->rx, 1);
>   	refill_rx(xsk, comp_addr);
>   
> @@ -253,10 +266,14 @@ void test_xdp_metadata(void)
>   
>   	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
>   	SYS(out, "ip link set dev " TX_NAME " up");
> -	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
> +
> +	SYS(out, "ip link add link " TX_NAME " " TX_NAME_VLAN
> +		 " type vlan proto " VLAN_PROTO " id " VLAN_ID_STR);
> +	SYS(out, "ip link set dev " TX_NAME_VLAN " up");
> +	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME_VLAN);
>   
>   	/* Avoid ARP calls */
> -	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
> +	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME_VLAN);
>   
>   	set_netns(rx_netns);
>   	SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
> diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> index d151d406a123..382984a5d1c9 100644
> --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> @@ -23,6 +23,9 @@ extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
>   					 __u64 *timestamp) __ksym;
>   extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
>   				    enum xdp_rss_hash_type *rss_type) __ksym;
> +extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
> +					__u16 *vlan_tag,
> +					__be16 *vlan_proto) __ksym;
>   
>   SEC("xdp")
>   int rx(struct xdp_md *ctx)
> @@ -57,6 +60,7 @@ int rx(struct xdp_md *ctx)
>   		meta->rx_timestamp = 1;
>   
>   	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
> +	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag, &meta->rx_vlan_proto);
>   
>   	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>   }


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto in xdp_metadata
  2023-07-06 10:10   ` Jesper Dangaard Brouer
@ 2023-07-06 10:13     ` Larysa Zaremba
  0 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 10:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jul 06, 2023 at 12:10:01PM +0200, Jesper Dangaard Brouer wrote:
> 
> On 03/07/2023 20.12, Larysa Zaremba wrote:
> > Verify, whether VLAN tag and proto are set correctly.
> > 
> > To simulate "stripped" VLAN tag on veth, send test packet from VLAN
> > interface.
> > 
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >   .../selftests/bpf/prog_tests/xdp_metadata.c   | 21 +++++++++++++++++--
> >   .../selftests/bpf/progs/xdp_metadata.c        |  4 ++++
> >   2 files changed, 23 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > index 53b32a641e8e..50ac9f570bc5 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > @@ -38,6 +38,13 @@
> >   #define TX_MAC "00:00:00:00:00:01"
> >   #define RX_MAC "00:00:00:00:00:02"
> > +#define VLAN_ID 59
> > +#define VLAN_ID_STR "59"
> > +#define VLAN_PROTO "802.1Q"
> > +#define VLAN_PID htons(ETH_P_8021Q)
> > +#define TX_NAME_VLAN TX_NAME "." VLAN_ID_STR
> > +#define RX_NAME_VLAN RX_NAME "." VLAN_ID_STR
> > +
> >   #define XDP_RSS_TYPE_L4 BIT(3)
> >   struct xsk {
> > @@ -215,6 +222,12 @@ static int verify_xsk_metadata(struct xsk *xsk)
> >   	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
> >   		return -1;
> > +	if (!ASSERT_EQ(meta->rx_vlan_tag, VLAN_ID, "rx_vlan_tag"))
> > +		return -1;
> 
> In other examples you are masking meta->rx_vlan_tag with VLAN_VID_MASK
> (12 lower bits 0x0fff) to extract the VLAN_ID.  It would make the
> selftest more correct, robust and pedagogical to also mask out the ID here.
>

True, will do.
 
> 
> > +
> > +	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
> > +		return -1;
> > +
> >   	xsk_ring_cons__release(&xsk->rx, 1);
> >   	refill_rx(xsk, comp_addr);
> > @@ -253,10 +266,14 @@ void test_xdp_metadata(void)
> >   	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
> >   	SYS(out, "ip link set dev " TX_NAME " up");
> > -	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
> > +
> > +	SYS(out, "ip link add link " TX_NAME " " TX_NAME_VLAN
> > +		 " type vlan proto " VLAN_PROTO " id " VLAN_ID_STR);
> > +	SYS(out, "ip link set dev " TX_NAME_VLAN " up");
> > +	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME_VLAN);
> >   	/* Avoid ARP calls */
> > -	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
> > +	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME_VLAN);
> >   	set_netns(rx_netns);
> >   	SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
> > diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > index d151d406a123..382984a5d1c9 100644
> > --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > @@ -23,6 +23,9 @@ extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
> >   					 __u64 *timestamp) __ksym;
> >   extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
> >   				    enum xdp_rss_hash_type *rss_type) __ksym;
> > +extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
> > +					__u16 *vlan_tag,
> > +					__be16 *vlan_proto) __ksym;
> >   SEC("xdp")
> >   int rx(struct xdp_md *ctx)
> > @@ -57,6 +60,7 @@ int rx(struct xdp_md *ctx)
> >   		meta->rx_timestamp = 1;
> >   	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
> > +	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag, &meta->rx_vlan_proto);
> >   	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
> >   }
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 17/20] veth: Implement VLAN tag and checksum level XDP hint
  2023-07-06  9:57     ` Jesper Dangaard Brouer
@ 2023-07-06 10:15       ` Larysa Zaremba
  0 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 10:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Stanislav Fomichev, brouer, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jul 06, 2023 at 11:57:06AM +0200, Jesper Dangaard Brouer wrote:
> 
> On 05/07/2023 19.25, Stanislav Fomichev wrote:
> > On 07/03, Larysa Zaremba wrote:
> > > In order to test VLAN tag and checksum level XDP hints in
> > > hardware-independent selfttests, implement newly added XDP hints in veth
> > > driver.
> > > 
> > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > ---
> > >   drivers/net/veth.c | 40 ++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 40 insertions(+)
> > > 
> > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > > index 614f3e3efab0..a7f2b679551d 100644
> > > --- a/drivers/net/veth.c
> > > +++ b/drivers/net/veth.c
> > > @@ -1732,6 +1732,44 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
> > >   	return 0;
> > >   }
> > > +static int veth_xdp_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan_tag,
> > > +				__be16 *vlan_proto)
> > > +{
> > > +	struct veth_xdp_buff *_ctx = (void *)ctx;
> > > +	struct sk_buff *skb = _ctx->skb;
> > > +	int err;
> > > +
> > > +	if (!skb)
> > > +		return -ENODATA;
> > > +
> > 
> > [..]
> > 
> > > +	err = __vlan_hwaccel_get_tag(skb, vlan_tag);
> > 
> > We probably need to open code __vlan_hwaccel_get_tag here. Because it
> > returns -EINVAL on !skb_vlan_tag_present where the expectation, for us,
> > I'm assuming is -ENODATA?
> > 
> 
> Looking at in-tree users of __vlan_hwaccel_get_tag(), they don't use the
> err value for anything.  Thus, we can just change
> __vlan_hwaccel_get_tag() to return -ENODATA instead of -EINVAL.  (And
> also remember __vlan_get_tag() adjustmment).
>

Seems like a good idea, will include those changes it in v3.
 
> 
> $ git diff
> diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
> index 6ba71957851e..fb35d7dd77a2 100644
> --- a/include/linux/if_vlan.h
> +++ b/include/linux/if_vlan.h
> @@ -540,7 +540,7 @@ static inline int __vlan_get_tag(const struct sk_buff
> *skb, u16 *vlan_tci)
>         struct vlan_ethhdr *veth = skb_vlan_eth_hdr(skb);
> 
>         if (!eth_type_vlan(veth->h_vlan_proto))
> -               return -EINVAL;
> +               return -ENODATA;
> 
>         *vlan_tci = ntohs(veth->h_vlan_TCI);
>         return 0;
> @@ -561,7 +561,7 @@ static inline int __vlan_hwaccel_get_tag(const struct
> sk_buff *skb,
>                 return 0;
>         } else {
>                 *vlan_tci = 0;
> -               return -EINVAL;
> +               return -ENODATA;
>         }
>  }
> 
> 
> 
> > > +	if (err)
> > > +		return err;
> > > +
> > > +	*vlan_proto = skb->vlan_proto;
> > > +	return err;
> > > +}
> > > +
> > > +static int veth_xdp_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
> > > +{
> > > +	struct veth_xdp_buff *_ctx = (void *)ctx;
> > > +	struct sk_buff *skb = _ctx->skb;
> > > +
> > > +	if (!skb)
> > > +		return -ENODATA;
> > > +
> > > +	if (skb->ip_summed == CHECKSUM_UNNECESSARY)
> > > +		*csum_level = skb->csum_level;
> > > +	else if (skb->ip_summed == CHECKSUM_PARTIAL &&
> > > +		 skb_checksum_start_offset(skb) == skb_transport_offset(skb) ||
> > > +		 skb->csum_valid)
> > > +		*csum_level = 0;
> > > +	else
> > > +		return -ENODATA;
> > > +
> > > +	return 0;
> > > +}
> > > +
> [...]
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level in xdp_metadata
  2023-07-03 18:12 ` [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level " Larysa Zaremba
  2023-07-05 17:41   ` Stanislav Fomichev
@ 2023-07-06 10:25   ` Jesper Dangaard Brouer
  2023-07-06 12:02     ` Larysa Zaremba
  1 sibling, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-06 10:25 UTC (permalink / raw)
  To: Larysa Zaremba, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev



On 03/07/2023 20.12, Larysa Zaremba wrote:
> Verify, whether kfunc in xdp_metadata test correctly returns checksum level
> of zero.
> 
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>   tools/testing/selftests/bpf/prog_tests/xdp_metadata.c | 3 +++
>   tools/testing/selftests/bpf/progs/xdp_metadata.c      | 7 +++++++
>   2 files changed, 10 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> index 50ac9f570bc5..6c71d712932e 100644
> --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> @@ -228,6 +228,9 @@ static int verify_xsk_metadata(struct xsk *xsk)
>   	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
>   		return -1;
>   
> +	if (!ASSERT_NEQ(meta->rx_csum_lvl, 0, "rx_csum_lvl"))
> +		return -1;

Not-equal ("NEQ") to 0 feels weird here.
Below you set meta->rx_csum_lvl=1 in case meta->rx_csum_lvl==0.

Thus, test can pass if meta->rx_csum_lvl happens to be a random value.
We could set meta->rx_csum_lvl to 42 in case meta->rx_csum_lvl==0, and
then use a ASSERT_EQ==42 to be more certain of the case we are testing 
are fulfilled.


> +
>   	xsk_ring_cons__release(&xsk->rx, 1);
>   	refill_rx(xsk, comp_addr);
>   
> diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> index 382984a5d1c9..6f7223d581b7 100644
> --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> @@ -26,6 +26,8 @@ extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
>   extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
>   					__u16 *vlan_tag,
>   					__be16 *vlan_proto) __ksym;
> +extern int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx,
> +					__u8 *csum_level) __ksym;
>   
>   SEC("xdp")
>   int rx(struct xdp_md *ctx)
> @@ -62,6 +64,11 @@ int rx(struct xdp_md *ctx)
>   	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
>   	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag, &meta->rx_vlan_proto);
>   
> +	/* Same as with timestamp, zero is expected */
> +	ret = bpf_xdp_metadata_rx_csum_lvl(ctx, &meta->rx_csum_lvl);
> +	if (!ret && meta->rx_csum_lvl == 0)
> +		meta->rx_csum_lvl = 1;
> +

IMHO it is more human-readable-code to rename "ret" variable "err".

I know you are just reusing variable "ret", so it's not really your fault.



>   	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>   }
>   


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level in xdp_metadata
  2023-07-06 10:25   ` Jesper Dangaard Brouer
@ 2023-07-06 12:02     ` Larysa Zaremba
  0 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 12:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jul 06, 2023 at 12:25:10PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 03/07/2023 20.12, Larysa Zaremba wrote:
> > Verify, whether kfunc in xdp_metadata test correctly returns checksum level
> > of zero.
> > 
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >   tools/testing/selftests/bpf/prog_tests/xdp_metadata.c | 3 +++
> >   tools/testing/selftests/bpf/progs/xdp_metadata.c      | 7 +++++++
> >   2 files changed, 10 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > index 50ac9f570bc5..6c71d712932e 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > @@ -228,6 +228,9 @@ static int verify_xsk_metadata(struct xsk *xsk)
> >   	if (!ASSERT_EQ(meta->rx_vlan_proto, VLAN_PID, "rx_vlan_proto"))
> >   		return -1;
> > +	if (!ASSERT_NEQ(meta->rx_csum_lvl, 0, "rx_csum_lvl"))
> > +		return -1;
> 
> Not-equal ("NEQ") to 0 feels weird here.
> Below you set meta->rx_csum_lvl=1 in case meta->rx_csum_lvl==0.
> 
> Thus, test can pass if meta->rx_csum_lvl happens to be a random value.
> We could set meta->rx_csum_lvl to 42 in case meta->rx_csum_lvl==0, and
> then use a ASSERT_EQ==42 to be more certain of the case we are testing are
> fulfilled.
>

I just copied the approach used for timestamp. I think you are right and I 
should have make the new code better.

ASSERT_NEQ(0) is also used for rx_hash. It would be a good idea to go and fix 
those too, but the patchset has already ballooned too much for me, so I would 
leave it for later.

With ASSERT_EQ for checksum level, I think comparing it to "1" should be enough. 
Do I guess correctly, the main problem with ASSERT_NEQ is uninitialized memory?
 
Another value that is less magical than 42 would be "4", because csum_level 
takes 2 bits, so it is the smallest value that does not correspod to 
any valid checksum level.
 
> 
> > +
> >   	xsk_ring_cons__release(&xsk->rx, 1);
> >   	refill_rx(xsk, comp_addr);
> > diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > index 382984a5d1c9..6f7223d581b7 100644
> > --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > @@ -26,6 +26,8 @@ extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
> >   extern int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx,
> >   					__u16 *vlan_tag,
> >   					__be16 *vlan_proto) __ksym;
> > +extern int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx,
> > +					__u8 *csum_level) __ksym;
> >   SEC("xdp")
> >   int rx(struct xdp_md *ctx)
> > @@ -62,6 +64,11 @@ int rx(struct xdp_md *ctx)
> >   	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash, &meta->rx_hash_type);
> >   	bpf_xdp_metadata_rx_vlan_tag(ctx, &meta->rx_vlan_tag, &meta->rx_vlan_proto);
> > +	/* Same as with timestamp, zero is expected */
> > +	ret = bpf_xdp_metadata_rx_csum_lvl(ctx, &meta->rx_csum_lvl);
> > +	if (!ret && meta->rx_csum_lvl == 0)
> > +		meta->rx_csum_lvl = 1;
> > +
> 
> IMHO it is more human-readable-code to rename "ret" variable "err".
> 
> I know you are just reusing variable "ret", so it's not really your fault.
> 
> 
> 
> >   	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
> >   }
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-06  9:04             ` [xdp-hints] " Jesper Dangaard Brouer
@ 2023-07-06 12:38               ` Larysa Zaremba
  2023-07-06 12:49                 ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 12:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, brouer, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David S. Miller, Alexander Duyck

On Thu, Jul 06, 2023 at 11:04:49AM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 06/07/2023 07.50, John Fastabend wrote:
> > Larysa Zaremba wrote:
> > > On Tue, Jul 04, 2023 at 12:39:06PM +0200, Jesper Dangaard Brouer wrote:
> > > > Cc. DaveM+Alex Duyck, as I value your insights on checksums.
> > > > 
> > > > On 04/07/2023 11.24, Larysa Zaremba wrote:
> > > > > On Mon, Jul 03, 2023 at 01:38:27PM -0700, John Fastabend wrote:
> > > > > > Larysa Zaremba wrote:
> > > > > > > Implement functionality that enables drivers to expose to XDP code,
> > > > > > > whether checksums was checked and on what level.
> > > > > > > 
> > > > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > > > ---
> > > > > > >    Documentation/networking/xdp-rx-metadata.rst |  3 +++
> > > > > > >    include/linux/netdevice.h                    |  1 +
> > > > > > >    include/net/xdp.h                            |  2 ++
> > > > > > >    kernel/bpf/offload.c                         |  2 ++
> > > > > > >    net/core/xdp.c                               | 21 ++++++++++++++++++++
> > > > > > >    5 files changed, 29 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > index ea6dd79a21d3..4ec6ddfd2a52 100644
> > > > > > > --- a/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > @@ -26,6 +26,9 @@ metadata is supported, this set will grow:
> > > > > > >    .. kernel-doc:: net/core/xdp.c
> > > > > > >       :identifiers: bpf_xdp_metadata_rx_vlan_tag
> > > > > > > +.. kernel-doc:: net/core/xdp.c
> > > > > > > +   :identifiers: bpf_xdp_metadata_rx_csum_lvl
> > > > > > > +
> > > > > > >    An XDP program can use these kfuncs to read the metadata into stack
> > > > > > >    variables for its own consumption. Or, to pass the metadata on to other
> > > > > > >    consumers, an XDP program can store it into the metadata area carried
> > > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > > > > index 4fa4380e6d89..569563687172 100644
> > > > > > > --- a/include/linux/netdevice.h
> > > > > > > +++ b/include/linux/netdevice.h
> > > > > > > @@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
> > > > > > >    			       enum xdp_rss_hash_type *rss_type);
> > > > > > >    	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
> > > > > > >    				   __be16 *vlan_proto);
> > > > > > > +	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
> > > > > > >    };
> > > > > > >    /**
> > > > > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > > > > index 89c58f56ffc6..61ed38fa79d1 100644
> > > > > > > --- a/include/net/xdp.h
> > > > > > > +++ b/include/net/xdp.h
> > > > > > > @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> > > > > > >    			   bpf_xdp_metadata_rx_hash) \
> > > > > > >    	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
> > > > > > >    			   bpf_xdp_metadata_rx_vlan_tag) \
> > > > > > > +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
> > > > > > > +			   bpf_xdp_metadata_rx_csum_lvl) \
> > > > > > >    enum {
> > > > > > >    #define XDP_METADATA_KFUNC(name, _) name,
> > > > > > > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > > > > > > index 986e7becfd42..a133fb775f49 100644
> > > > > > > --- a/kernel/bpf/offload.c
> > > > > > > +++ b/kernel/bpf/offload.c
> > > > > > > @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
> > > > > > >    		p = ops->xmo_rx_hash;
> > > > > > >    	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
> > > > > > >    		p = ops->xmo_rx_vlan_tag;
> > > > > > > +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
> > > > > > > +		p = ops->xmo_rx_csum_lvl;
> > > > > > >    out:
> > > > > > >    	up_read(&bpf_devs_lock);
> > > > > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > > > > index f6262c90e45f..c666d3e0a26c 100644
> > > > > > > --- a/net/core/xdp.c
> > > > > > > +++ b/net/core/xdp.c
> > > > > > > @@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
> > > > > > >    	return -EOPNOTSUPP;
> > > > > > >    }
> > > > > > > +/**
> > > > > > > + * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
> > > > > > > + * @ctx: XDP context pointer.
> > > > > > > + * @csum_level: Return value pointer.
> > > > > > > + *
> > > > > > > + * In case of success, csum_level contains depth of the last verified checksum.
> > > > > > > + * If only the outermost checksum was verified, csum_level is 0, if both
> > > > > > > + * encapsulation and inner transport checksums were verified, csum_level is 1,
> > > > > > > + * and so on.
> > > > > > > + * For more details, refer to csum_level field in sk_buff.
> > > > > > > + *
> > > > > > > + * Return:
> > > > > > > + * * Returns 0 on success or ``-errno`` on error.
> > > > > > > + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> > > > > > > + * * ``-ENODATA``    : Checksum was not validated
> > > > > > > + */
> > > > > > > +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
> > > > > > 
> > > > > > Istead of ENODATA should we return what would be put in the ip_summed field
> > > > > > CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,
> > > > 
> > > > I was thinking the same, what about checksum "type".
> > > > 
> > > > > > 
> > > > > >    bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);
> > > > > > 
> > > > > > or something like that? Or is the thought that its not really necessary?
> > > > > > I don't have a strong preference but figured it was worth asking.
> > > > > > 
> > > > > 
> > > > > I see no value in returning CHECKSUM_COMPLETE without the actual checksum value.
> > > > > Same with CHECKSUM_PARTIAL and csum_start. Returning those values too would
> > > > > overcomplicate the function signature.
> > > > 
> > > > So, this kfunc bpf_xdp_metadata_rx_csum_lvl() success is it equivilent to
> > > > CHECKSUM_UNNECESSARY?
> > > 
> > > This is 100% true for physical NICs, it's more complicated for veth, bacause it
> > > often receives CHECKSUM_PARTIAL, which shouldn't normally apprear on RX, but is
> > > treated by the network stack as a validated checksum, because there is no way
> > > internally generated packet could be messed up. I would be grateful if you could
> > > look at the veth patch and share your opinion about this.
> > > 
> > > > 
> > > > Looking at documentation[1] (generated from skbuff.h):
> > > >   [1] https://kernel.org/doc/html/latest/networking/skbuff.html#checksumming-of-received-packets-by-device
> > > > 
> > > > Is the idea that we can add another kfunc (new signature) than can deal
> > > > with the other types of checksums (in a later kernel release)?
> > > > 
> > > 
> > > Yes, that is the idea.
> > 
> > If we think there is a chance we might need another kfunc we should add it
> > in the same kfunc. It would be unfortunate to have to do two kfuncs when
> > one would work. It shouldn't cost much/anything(?) to hardcode the type for
> > most cases? I think if we need it later I would advocate for updating this
> > kfunc to support it. Of course then userspace will have to swivel on the
> > kfunc signature.
> > 
> 
> I think it might make sense to have 3 kfuncs for checksumming.
> As this would allow BPF-prog to focus on CHECKSUM_UNNECESSARY, and then
> only call additional kfunc for extracting e.g csum_start  + csum_offset
> when type is CHECKSUM_PARTIAL.
> 
> We could extend bpf_xdp_metadata_rx_csum_lvl() to give the csum_type
> CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}.
> 
>  int bpf_xdp_metadata_rx_csum_lvl(*ctx, u8 *csum_level, u8 *csum_type)
> 
> And then add two kfunc e.g.
>  (1) bpf_xdp_metadata_rx_csum_partial(ctx, start, offset)
>  (2) bpf_xdp_metadata_rx_csum_complete(ctx, csum)
> 
> Pseudo BPF-prog code:
> 
>  err = bpf_xdp_metadata_rx_csum_lvl(ctx, level, type);
>  if (!err && type != CHECKSUM_UNNECESSARY) {
>      if (type == CHECKSUM_PARTIAL)
>          err = bpf_xdp_metadata_rx_csum_partial(ctx, start, offset);
>      if (type == CHECKSUM_COMPLETE)
>          err = bpf_xdp_metadata_rx_csum_complete(ctx, csum);
>  }
> 
> Looking at code, I feel we could rename [...]_csum_lvl to csum_type.
> E.g. bpf_xdp_metadata_rx_csum_type.
>

What about:

union csum_info {
	struct {
		u16 csum_start;
		u16 csum_offset;
	};
	u32 checksum;
	u8 checksum_level;
};

bpf_xdp_metadata_rx_csum(*ctx, u8 *csum_status, union csum_info *info);

One thing that is worth considering in my opinion is whether some hardware can 
provide both CHECKSUM_UNNECESSARY and CHECKSUM_COMPLETE. Judging by [0], this 
does occur. I such cases using an enum to represent the checksum status would 
artificially limit the capabilities. Now, imagine the situation:

- You want to use your XDP program with 2 different NICs

[...]

err = bpf_xdp_metadata_rx_csum(*ctx, &status, &info);
if (!err && status == CHECKSUM_UNNECESSARY)
	/* Do stuff */

[...]
- One NIC can both calculate CHECKSUM_COMPLETE and parse headers, another one 
  is only able to parse headers. Those can be very similar NICs from different 
  generation.
- You test your program on the simpler NIC, program works fine.
- You tests your program on the more advanced one and suddenly you need an 
  'else if' case with some additional calculations.

Please write, whether this makes sense :D and if so, we can work out a solution.

> Feel free to disagree,
> --Jesper
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-06 12:38               ` Larysa Zaremba
@ 2023-07-06 12:49                 ` Larysa Zaremba
  2023-07-10 16:58                   ` Alexander Lobakin
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 12:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, brouer, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David S. Miller, Alexander Duyck

On Thu, Jul 06, 2023 at 02:38:33PM +0200, Larysa Zaremba wrote:
> On Thu, Jul 06, 2023 at 11:04:49AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > 
> > On 06/07/2023 07.50, John Fastabend wrote:
> > > Larysa Zaremba wrote:
> > > > On Tue, Jul 04, 2023 at 12:39:06PM +0200, Jesper Dangaard Brouer wrote:
> > > > > Cc. DaveM+Alex Duyck, as I value your insights on checksums.
> > > > > 
> > > > > On 04/07/2023 11.24, Larysa Zaremba wrote:
> > > > > > On Mon, Jul 03, 2023 at 01:38:27PM -0700, John Fastabend wrote:
> > > > > > > Larysa Zaremba wrote:
> > > > > > > > Implement functionality that enables drivers to expose to XDP code,
> > > > > > > > whether checksums was checked and on what level.
> > > > > > > > 
> > > > > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > > > > ---
> > > > > > > >    Documentation/networking/xdp-rx-metadata.rst |  3 +++
> > > > > > > >    include/linux/netdevice.h                    |  1 +
> > > > > > > >    include/net/xdp.h                            |  2 ++
> > > > > > > >    kernel/bpf/offload.c                         |  2 ++
> > > > > > > >    net/core/xdp.c                               | 21 ++++++++++++++++++++
> > > > > > > >    5 files changed, 29 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > > index ea6dd79a21d3..4ec6ddfd2a52 100644
> > > > > > > > --- a/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > > @@ -26,6 +26,9 @@ metadata is supported, this set will grow:
> > > > > > > >    .. kernel-doc:: net/core/xdp.c
> > > > > > > >       :identifiers: bpf_xdp_metadata_rx_vlan_tag
> > > > > > > > +.. kernel-doc:: net/core/xdp.c
> > > > > > > > +   :identifiers: bpf_xdp_metadata_rx_csum_lvl
> > > > > > > > +
> > > > > > > >    An XDP program can use these kfuncs to read the metadata into stack
> > > > > > > >    variables for its own consumption. Or, to pass the metadata on to other
> > > > > > > >    consumers, an XDP program can store it into the metadata area carried
> > > > > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > > > > > index 4fa4380e6d89..569563687172 100644
> > > > > > > > --- a/include/linux/netdevice.h
> > > > > > > > +++ b/include/linux/netdevice.h
> > > > > > > > @@ -1660,6 +1660,7 @@ struct xdp_metadata_ops {
> > > > > > > >    			       enum xdp_rss_hash_type *rss_type);
> > > > > > > >    	int	(*xmo_rx_vlan_tag)(const struct xdp_md *ctx, u16 *vlan_tag,
> > > > > > > >    				   __be16 *vlan_proto);
> > > > > > > > +	int	(*xmo_rx_csum_lvl)(const struct xdp_md *ctx, u8 *csum_level);
> > > > > > > >    };
> > > > > > > >    /**
> > > > > > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > > > > > index 89c58f56ffc6..61ed38fa79d1 100644
> > > > > > > > --- a/include/net/xdp.h
> > > > > > > > +++ b/include/net/xdp.h
> > > > > > > > @@ -391,6 +391,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> > > > > > > >    			   bpf_xdp_metadata_rx_hash) \
> > > > > > > >    	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_VLAN_TAG, \
> > > > > > > >    			   bpf_xdp_metadata_rx_vlan_tag) \
> > > > > > > > +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_CSUM_LVL, \
> > > > > > > > +			   bpf_xdp_metadata_rx_csum_lvl) \
> > > > > > > >    enum {
> > > > > > > >    #define XDP_METADATA_KFUNC(name, _) name,
> > > > > > > > diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
> > > > > > > > index 986e7becfd42..a133fb775f49 100644
> > > > > > > > --- a/kernel/bpf/offload.c
> > > > > > > > +++ b/kernel/bpf/offload.c
> > > > > > > > @@ -850,6 +850,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
> > > > > > > >    		p = ops->xmo_rx_hash;
> > > > > > > >    	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_VLAN_TAG))
> > > > > > > >    		p = ops->xmo_rx_vlan_tag;
> > > > > > > > +	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_CSUM_LVL))
> > > > > > > > +		p = ops->xmo_rx_csum_lvl;
> > > > > > > >    out:
> > > > > > > >    	up_read(&bpf_devs_lock);
> > > > > > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > > > > > index f6262c90e45f..c666d3e0a26c 100644
> > > > > > > > --- a/net/core/xdp.c
> > > > > > > > +++ b/net/core/xdp.c
> > > > > > > > @@ -758,6 +758,27 @@ __bpf_kfunc int bpf_xdp_metadata_rx_vlan_tag(const struct xdp_md *ctx, u16 *vlan
> > > > > > > >    	return -EOPNOTSUPP;
> > > > > > > >    }
> > > > > > > > +/**
> > > > > > > > + * bpf_xdp_metadata_rx_csum_lvl - Get depth at which HW has checked the checksum.
> > > > > > > > + * @ctx: XDP context pointer.
> > > > > > > > + * @csum_level: Return value pointer.
> > > > > > > > + *
> > > > > > > > + * In case of success, csum_level contains depth of the last verified checksum.
> > > > > > > > + * If only the outermost checksum was verified, csum_level is 0, if both
> > > > > > > > + * encapsulation and inner transport checksums were verified, csum_level is 1,
> > > > > > > > + * and so on.
> > > > > > > > + * For more details, refer to csum_level field in sk_buff.
> > > > > > > > + *
> > > > > > > > + * Return:
> > > > > > > > + * * Returns 0 on success or ``-errno`` on error.
> > > > > > > > + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
> > > > > > > > + * * ``-ENODATA``    : Checksum was not validated
> > > > > > > > + */
> > > > > > > > +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
> > > > > > > 
> > > > > > > Istead of ENODATA should we return what would be put in the ip_summed field
> > > > > > > CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,
> > > > > 
> > > > > I was thinking the same, what about checksum "type".
> > > > > 
> > > > > > > 
> > > > > > >    bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);
> > > > > > > 
> > > > > > > or something like that? Or is the thought that its not really necessary?
> > > > > > > I don't have a strong preference but figured it was worth asking.
> > > > > > > 
> > > > > > 
> > > > > > I see no value in returning CHECKSUM_COMPLETE without the actual checksum value.
> > > > > > Same with CHECKSUM_PARTIAL and csum_start. Returning those values too would
> > > > > > overcomplicate the function signature.
> > > > > 
> > > > > So, this kfunc bpf_xdp_metadata_rx_csum_lvl() success is it equivilent to
> > > > > CHECKSUM_UNNECESSARY?
> > > > 
> > > > This is 100% true for physical NICs, it's more complicated for veth, bacause it
> > > > often receives CHECKSUM_PARTIAL, which shouldn't normally apprear on RX, but is
> > > > treated by the network stack as a validated checksum, because there is no way
> > > > internally generated packet could be messed up. I would be grateful if you could
> > > > look at the veth patch and share your opinion about this.
> > > > 
> > > > > 
> > > > > Looking at documentation[1] (generated from skbuff.h):
> > > > >   [1] https://kernel.org/doc/html/latest/networking/skbuff.html#checksumming-of-received-packets-by-device
> > > > > 
> > > > > Is the idea that we can add another kfunc (new signature) than can deal
> > > > > with the other types of checksums (in a later kernel release)?
> > > > > 
> > > > 
> > > > Yes, that is the idea.
> > > 
> > > If we think there is a chance we might need another kfunc we should add it
> > > in the same kfunc. It would be unfortunate to have to do two kfuncs when
> > > one would work. It shouldn't cost much/anything(?) to hardcode the type for
> > > most cases? I think if we need it later I would advocate for updating this
> > > kfunc to support it. Of course then userspace will have to swivel on the
> > > kfunc signature.
> > > 
> > 
> > I think it might make sense to have 3 kfuncs for checksumming.
> > As this would allow BPF-prog to focus on CHECKSUM_UNNECESSARY, and then
> > only call additional kfunc for extracting e.g csum_start  + csum_offset
> > when type is CHECKSUM_PARTIAL.
> > 
> > We could extend bpf_xdp_metadata_rx_csum_lvl() to give the csum_type
> > CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}.
> > 
> >  int bpf_xdp_metadata_rx_csum_lvl(*ctx, u8 *csum_level, u8 *csum_type)
> > 
> > And then add two kfunc e.g.
> >  (1) bpf_xdp_metadata_rx_csum_partial(ctx, start, offset)
> >  (2) bpf_xdp_metadata_rx_csum_complete(ctx, csum)
> > 
> > Pseudo BPF-prog code:
> > 
> >  err = bpf_xdp_metadata_rx_csum_lvl(ctx, level, type);
> >  if (!err && type != CHECKSUM_UNNECESSARY) {
> >      if (type == CHECKSUM_PARTIAL)
> >          err = bpf_xdp_metadata_rx_csum_partial(ctx, start, offset);
> >      if (type == CHECKSUM_COMPLETE)
> >          err = bpf_xdp_metadata_rx_csum_complete(ctx, csum);
> >  }
> > 
> > Looking at code, I feel we could rename [...]_csum_lvl to csum_type.
> > E.g. bpf_xdp_metadata_rx_csum_type.
> >
> 
> What about:
> 
> union csum_info {
> 	struct {
> 		u16 csum_start;
> 		u16 csum_offset;
> 	};
> 	u32 checksum;
> 	u8 checksum_level;
> };
> 
> bpf_xdp_metadata_rx_csum(*ctx, u8 *csum_status, union csum_info *info);
> 
> One thing that is worth considering in my opinion is whether some hardware can 
> provide both CHECKSUM_UNNECESSARY and CHECKSUM_COMPLETE. Judging by [0], this 
> does occur. I such cases using an enum to represent the checksum status would 
> artificially limit the capabilities. Now, imagine the situation:
> 
> - You want to use your XDP program with 2 different NICs
> 
> [...]
> 
> err = bpf_xdp_metadata_rx_csum(*ctx, &status, &info);
> if (!err && status == CHECKSUM_UNNECESSARY)
> 	/* Do stuff */
> 
> [...]
> - One NIC can both calculate CHECKSUM_COMPLETE and parse headers, another one 
>   is only able to parse headers. Those can be very similar NICs from different 
>   generation.
> - You test your program on the simpler NIC, program works fine.
> - You tests your program on the more advanced one and suddenly you need an 
>   'else if' case with some additional calculations.
> 
> Please write, whether this makes sense :D and if so, we can work out a solution.
>

Forgot the link:
[0] https://elixir.bootlin.com/linux/v6.4.2/source/include/linux/skbuff.h#L143
 
> > Feel free to disagree,
> > --Jesper
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-07-05 17:39   ` Stanislav Fomichev
@ 2023-07-06 14:11     ` Larysa Zaremba
  2023-07-06 17:25       ` Stanislav Fomichev
  2023-07-06 17:27       ` Stanislav Fomichev
  0 siblings, 2 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 14:11 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Wed, Jul 05, 2023 at 10:39:35AM -0700, Stanislav Fomichev wrote:
> On 07/03, Larysa Zaremba wrote:
> > The easiest way to simulate stripped VLAN tag in veth is to send a packet
> > from VLAN interface, attached to veth. Unfortunately, this approach is
> > incompatible with AF_XDP on TX side, because VLAN interfaces do not have
> > such feature.
> > 
> > Replace AF_XDP packet generation with sending the same datagram via
> > AF_INET socket.
> > 
> > This does not change the packet contents or hints values with one notable
> > exception: rx_hash_type, which previously was expected to be 0, now is
> > expected be at least XDP_RSS_TYPE_L4.
> > 
> > Also, usage of AF_INET requires a little more complicated namespace setup,
> > therefore open_netns() helper function is divided into smaller reusable
> > pieces.
> 
> Ack, it's probably OK for now, but, FYI, I'm trying to extend this part
> with TX metadata:
> https://lore.kernel.org/bpf/20230621170244.1283336-10-sdf@google.com/
> 
> So probably long-term I'll switch it back to AF_XDP but will add
> support for requesting vlan TX "offload" from the veth.
>

My bad for not reading your series. Amazing work as always!

So, 'requesting vlan TX "offload"' with new hints capabilities? This would be 
pretty neat.

But you think AF_INET TX is worth keeping for now, until TX hints are mature?
  
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  tools/testing/selftests/bpf/network_helpers.c |  37 +++-
> >  tools/testing/selftests/bpf/network_helpers.h |   3 +
> >  .../selftests/bpf/prog_tests/xdp_metadata.c   | 175 +++++++-----------
> >  3 files changed, 98 insertions(+), 117 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> > index a105c0cd008a..19463230ece5 100644
> > --- a/tools/testing/selftests/bpf/network_helpers.c
> > +++ b/tools/testing/selftests/bpf/network_helpers.c
> > @@ -386,28 +386,51 @@ char *ping_command(int family)
> >  	return "ping";
> >  }
> >  
> > +int get_cur_netns(void)
> > +{
> > +	int nsfd;
> > +
> > +	nsfd = open("/proc/self/ns/net", O_RDONLY);
> > +	ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > +	return nsfd;
> > +}
> > +
> > +int get_netns(const char *name)
> > +{
> > +	char nspath[PATH_MAX];
> > +	int nsfd;
> > +
> > +	snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> > +	nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> > +	ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > +	return nsfd;
> > +}
> > +
> > +int set_netns(int netns_fd)
> > +{
> > +	return setns(netns_fd, CLONE_NEWNET);
> > +}
> 
> We have open_netns/close_netns in network_helpers.h that provide similar
> functionality, let's use them instead?
> 

I have divided open_netns() into smaller pieces (see below), because the code I 
have added into xdp_metadata looked better with those smaller pieces (I had to 
switch namespace several times).

> > +
> >  struct nstoken {
> >  	int orig_netns_fd;
> >  };
> >  
> >  struct nstoken *open_netns(const char *name)
> >  {
> > +	struct nstoken *token;
> >  	int nsfd;
> > -	char nspath[PATH_MAX];
> >  	int err;
> > -	struct nstoken *token;
> >  
> >  	token = calloc(1, sizeof(struct nstoken));
> >  	if (!ASSERT_OK_PTR(token, "malloc token"))
> >  		return NULL;
> >  
> > -	token->orig_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> > -	if (!ASSERT_GE(token->orig_netns_fd, 0, "open /proc/self/ns/net"))
> > +	token->orig_netns_fd = get_cur_netns();
> > +	if (token->orig_netns_fd < 0)
> >  		goto fail;
> >  
> > -	snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> > -	nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> > -	if (!ASSERT_GE(nsfd, 0, "open netns fd"))
> > +	nsfd = get_netns(name);
> > +	if (nsfd < 0)
> >  		goto fail;
> >  
> >  	err = setns(nsfd, CLONE_NEWNET);
> > diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
> > index 694185644da6..b18b9619595c 100644
> > --- a/tools/testing/selftests/bpf/network_helpers.h
> > +++ b/tools/testing/selftests/bpf/network_helpers.h
> > @@ -58,6 +58,8 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
> >  char *ping_command(int family);
> >  int get_socket_local_port(int sock_fd);
> >  
> > +int get_cur_netns(void);
> > +int get_netns(const char *name);
> >  struct nstoken;
> >  /**
> >   * open_netns() - Switch to specified network namespace by name.
> > @@ -67,4 +69,5 @@ struct nstoken;
> >   */
> >  struct nstoken *open_netns(const char *name);
> >  void close_netns(struct nstoken *token);
> > +int set_netns(int netns_fd);
> >  #endif
> > diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > index 626c461fa34d..53b32a641e8e 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > @@ -20,7 +20,7 @@
> >  
> >  #define UDP_PAYLOAD_BYTES 4
> >  
> > -#define AF_XDP_SOURCE_PORT 1234
> > +#define UDP_SOURCE_PORT 1234
> >  #define AF_XDP_CONSUMER_PORT 8080
> >  
> >  #define UMEM_NUM 16
> > @@ -33,6 +33,12 @@
> >  #define RX_ADDR "10.0.0.2"
> >  #define PREFIX_LEN "8"
> >  #define FAMILY AF_INET
> > +#define TX_NETNS_NAME "xdp_metadata_tx"
> > +#define RX_NETNS_NAME "xdp_metadata_rx"
> > +#define TX_MAC "00:00:00:00:00:01"
> > +#define RX_MAC "00:00:00:00:00:02"
> > +
> > +#define XDP_RSS_TYPE_L4 BIT(3)
> >  
> >  struct xsk {
> >  	void *umem_area;
> > @@ -119,90 +125,28 @@ static void close_xsk(struct xsk *xsk)
> >  	munmap(xsk->umem_area, UMEM_SIZE);
> >  }
> >  
> > -static void ip_csum(struct iphdr *iph)
> > +static int generate_packet_udp(void)
> >  {
> > -	__u32 sum = 0;
> > -	__u16 *p;
> > -	int i;
> > -
> > -	iph->check = 0;
> > -	p = (void *)iph;
> > -	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
> > -		sum += p[i];
> > -
> > -	while (sum >> 16)
> > -		sum = (sum & 0xffff) + (sum >> 16);
> > -
> > -	iph->check = ~sum;
> > -}
> > -
> > -static int generate_packet(struct xsk *xsk, __u16 dst_port)
> > -{
> > -	struct xdp_desc *tx_desc;
> > -	struct udphdr *udph;
> > -	struct ethhdr *eth;
> > -	struct iphdr *iph;
> > -	void *data;
> > -	__u32 idx;
> > -	int ret;
> > -
> > -	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> > -	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
> > -		return -1;
> > -
> > -	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> > -	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
> > -	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
> > -	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> > -
> > -	eth = data;
> > -	iph = (void *)(eth + 1);
> > -	udph = (void *)(iph + 1);
> > -
> > -	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
> > -	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
> > -	eth->h_proto = htons(ETH_P_IP);
> > -
> > -	iph->version = 0x4;
> > -	iph->ihl = 0x5;
> > -	iph->tos = 0x9;
> > -	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
> > -	iph->id = 0;
> > -	iph->frag_off = 0;
> > -	iph->ttl = 0;
> > -	iph->protocol = IPPROTO_UDP;
> > -	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
> > -	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
> > -	ip_csum(iph);
> > -
> > -	udph->source = htons(AF_XDP_SOURCE_PORT);
> > -	udph->dest = htons(dst_port);
> > -	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
> > -	udph->check = 0;
> > -
> > -	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
> > -
> > -	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
> > -	xsk_ring_prod__submit(&xsk->tx, 1);
> > -
> > -	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
> > -	if (!ASSERT_GE(ret, 0, "sendto"))
> > -		return ret;
> > -
> > -	return 0;
> > -}
> > -
> > -static void complete_tx(struct xsk *xsk)
> > -{
> > -	__u32 idx;
> > -	__u64 addr;
> > -
> > -	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
> > -		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> > -
> > -		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> > -		xsk_ring_cons__release(&xsk->comp, 1);
> > -	}
> > +	char udp_payload[UDP_PAYLOAD_BYTES];
> > +	struct sockaddr_in rx_addr;
> > +	int sock_fd, err = 0;
> > +
> > +	/* Build a packet */
> > +	memset(udp_payload, 0xAA, UDP_PAYLOAD_BYTES);
> > +	rx_addr.sin_addr.s_addr = inet_addr(RX_ADDR);
> > +	rx_addr.sin_family = AF_INET;
> > +	rx_addr.sin_port = htons(UDP_SOURCE_PORT);
> > +
> > +	sock_fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> > +	if (!ASSERT_GE(sock_fd, 0, "socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)"))
> > +		return sock_fd;
> > +
> > +	err = sendto(sock_fd, udp_payload, UDP_PAYLOAD_BYTES, MSG_DONTWAIT,
> > +		     (void *)&rx_addr, sizeof(rx_addr));
> > +	ASSERT_GE(err, 0, "sendto");
> > +
> > +	close(sock_fd);
> > +	return err;
> >  }
> >  
> >  static void refill_rx(struct xsk *xsk, __u64 addr)
> > @@ -268,7 +212,8 @@ static int verify_xsk_metadata(struct xsk *xsk)
> >  	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
> >  		return -1;
> >  
> > -	ASSERT_EQ(meta->rx_hash_type, 0, "rx_hash_type");
> > +	if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
> > +		return -1;
> >  
> >  	xsk_ring_cons__release(&xsk->rx, 1);
> >  	refill_rx(xsk, comp_addr);
> > @@ -281,40 +226,46 @@ void test_xdp_metadata(void)
> >  	struct xdp_metadata2 *bpf_obj2 = NULL;
> >  	struct xdp_metadata *bpf_obj = NULL;
> >  	struct bpf_program *new_prog, *prog;
> > -	struct nstoken *tok = NULL;
> > +	int prev_netns, rx_netns, tx_netns;
> >  	__u32 queue_id = QUEUE_ID;
> >  	struct bpf_map *prog_arr;
> > -	struct xsk tx_xsk = {};
> >  	struct xsk rx_xsk = {};
> >  	__u32 val, key = 0;
> >  	int retries = 10;
> >  	int rx_ifindex;
> > -	int tx_ifindex;
> >  	int sock_fd;
> >  	int ret;
> >  
> > -	/* Setup new networking namespace, with a veth pair. */
> > +	/* Setup new networking namespaces, with a veth pair. */
> >  
> > -	SYS(out, "ip netns add xdp_metadata");
> > -	tok = open_netns("xdp_metadata");
> > +	SYS(out, "ip netns add " TX_NETNS_NAME);
> > +	SYS(out, "ip netns add " RX_NETNS_NAME);
> > +	prev_netns = get_cur_netns();
> > +	tx_netns = get_netns(TX_NETNS_NAME);
> > +	rx_netns = get_netns(RX_NETNS_NAME);
> > +	if (prev_netns < 0 || tx_netns < 0 || rx_netns < 0)
> > +		goto close_ns;
> > +
> > +	set_netns(tx_netns);
> >  	SYS(out, "ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
> >  	    " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
> > -	SYS(out, "ip link set dev " TX_NAME " address 00:00:00:00:00:01");
> > -	SYS(out, "ip link set dev " RX_NAME " address 00:00:00:00:00:02");
> > +	SYS(out, "ip link set " RX_NAME " netns " RX_NETNS_NAME);
> > +
> > +	SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
> >  	SYS(out, "ip link set dev " TX_NAME " up");
> > -	SYS(out, "ip link set dev " RX_NAME " up");
> >  	SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
> > -	SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
> >  
> > +	/* Avoid ARP calls */
> > +	SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
> > +
> > +	set_netns(rx_netns);
> > +	SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
> > +	SYS(out, "ip link set dev " RX_NAME " up");
> > +	SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
> >  	rx_ifindex = if_nametoindex(RX_NAME);
> > -	tx_ifindex = if_nametoindex(TX_NAME);
> >  
> >  	/* Setup separate AF_XDP for TX and RX interfaces. */
> >  
> > -	ret = open_xsk(tx_ifindex, &tx_xsk);
> > -	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
> > -		goto out;
> > -
> >  	ret = open_xsk(rx_ifindex, &rx_xsk);
> >  	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
> >  		goto out;
> > @@ -355,17 +306,16 @@ void test_xdp_metadata(void)
> >  		goto out;
> >  
> >  	/* Send packet destined to RX AF_XDP socket. */
> > -	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
> > -		       "generate AF_XDP_CONSUMER_PORT"))
> > +	set_netns(tx_netns);
> > +	if (!ASSERT_GE(generate_packet_udp(), 0, "generate UDP packet"))
> >  		goto out;
> >  
> >  	/* Verify AF_XDP RX packet has proper metadata. */
> > +	set_netns(rx_netns);
> >  	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
> >  		       "verify_xsk_metadata"))
> >  		goto out;
> >  
> > -	complete_tx(&tx_xsk);
> > -
> >  	/* Make sure freplace correctly picks up original bound device
> >  	 * and doesn't crash.
> >  	 */
> > @@ -384,10 +334,11 @@ void test_xdp_metadata(void)
> >  		goto out;
> >  
> >  	/* Send packet to trigger . */
> > -	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
> > -		       "generate freplace packet"))
> > +	set_netns(tx_netns);
> > +	if (!ASSERT_GE(generate_packet_udp(), 0, "generate freplace packet"))
> >  		goto out;
> >  
> > +	set_netns(rx_netns);
> >  	while (!retries--) {
> >  		if (bpf_obj2->bss->called)
> >  			break;
> > @@ -397,10 +348,14 @@ void test_xdp_metadata(void)
> >  
> >  out:
> >  	close_xsk(&rx_xsk);
> > -	close_xsk(&tx_xsk);
> >  	xdp_metadata2__destroy(bpf_obj2);
> >  	xdp_metadata__destroy(bpf_obj);
> > -	if (tok)
> > -		close_netns(tok);
> > -	SYS_NOFAIL("ip netns del xdp_metadata");
> > +	set_netns(prev_netns);
> > +close_ns:
> > +	close(prev_netns);
> > +	close(tx_netns);
> > +	close(rx_netns);
> > +
> > +	SYS_NOFAIL("ip netns del " RX_NETNS_NAME);
> > +	SYS_NOFAIL("ip netns del " TX_NETNS_NAME);
> >  }
> > -- 
> > 2.41.0
> > 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint
  2023-07-05 17:30   ` Stanislav Fomichev
@ 2023-07-06 14:22     ` Larysa Zaremba
  2023-07-06 16:39       ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 14:22 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Wed, Jul 05, 2023 at 10:30:56AM -0700, Stanislav Fomichev wrote:
> On 07/03, Larysa Zaremba wrote:
> > Use previously refactored code and create a function
> > that allows XDP code to read HW timestamp.
> > 
> > Also, move cached_phctime into packet context, this way this data still
> > stays in the ring structure, just at the different address.
> > 
> > HW timestamp is the first supported hint in the driver,
> > so also add xdp_metadata_ops.
> > 
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice.h          |  2 ++
> >  drivers/net/ethernet/intel/ice/ice_ethtool.c  |  2 +-
> >  drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
> >  drivers/net/ethernet/intel/ice/ice_main.c     |  1 +
> >  drivers/net/ethernet/intel/ice/ice_ptp.c      |  2 +-
> >  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
> >  drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 24 +++++++++++++++++++
> >  7 files changed, 31 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > index 4ba3d99439a0..7a973a2229f1 100644
> > --- a/drivers/net/ethernet/intel/ice/ice.h
> > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > @@ -943,4 +943,6 @@ static inline void ice_clear_rdma_cap(struct ice_pf *pf)
> >  	set_bit(ICE_FLAG_UNPLUG_AUX_DEV, pf->flags);
> >  	clear_bit(ICE_FLAG_RDMA_ENA, pf->flags);
> >  }
> > +
> > +extern const struct xdp_metadata_ops ice_xdp_md_ops;
> >  #endif /* _ICE_H_ */
> > diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > index 8d5cbbd0b3d5..3c3b9cbfbcd3 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > @@ -2837,7 +2837,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
> >  		/* clone ring and setup updated count */
> >  		rx_rings[i] = *vsi->rx_rings[i];
> >  		rx_rings[i].count = new_rx_cnt;
> > -		rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
> > +		rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> >  		rx_rings[i].desc = NULL;
> >  		rx_rings[i].rx_buf = NULL;
> >  		/* this is to allow wr32 to have something to write to
> > diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> > index 00e3afd507a4..eb69b0ac7956 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> > @@ -1445,7 +1445,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
> >  		ring->netdev = vsi->netdev;
> >  		ring->dev = dev;
> >  		ring->count = vsi->num_rx_desc;
> > -		ring->cached_phctime = pf->ptp.cached_phc_time;
> > +		ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> >  		WRITE_ONCE(vsi->rx_rings[i], ring);
> >  	}
> >  
> > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > index 93979ab18bc1..f21996b812ea 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > @@ -3384,6 +3384,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
> >  
> >  	netdev->netdev_ops = &ice_netdev_ops;
> >  	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
> > +	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
> >  	ice_set_ethtool_ops(netdev);
> >  
> >  	if (vsi->type != ICE_VSI_PF)
> > diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > index a31333972c68..70697e4829dd 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > @@ -1038,7 +1038,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
> >  		ice_for_each_rxq(vsi, j) {
> >  			if (!vsi->rx_rings[j])
> >  				continue;
> > -			WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
> > +			WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime, systime);
> >  		}
> >  	}
> >  	clear_bit(ICE_CFG_BUSY, pf->state);
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > index d0ab2c4c0c91..4237702a58a9 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > @@ -259,6 +259,7 @@ enum ice_rx_dtype {
> >  
> >  struct ice_pkt_ctx {
> >  	const union ice_32b_rx_flex_desc *eop_desc;
> > +	u64 cached_phctime;
> >  };
> >  
> >  struct ice_xdp_buff {
> > @@ -354,7 +355,6 @@ struct ice_rx_ring {
> >  	struct ice_tx_ring *xdp_ring;
> >  	struct xsk_buff_pool *xsk_pool;
> >  	dma_addr_t dma;			/* physical address of ring */
> > -	u64 cached_phctime;
> >  	u16 rx_buf_len;
> >  	u8 dcb_tc;			/* Traffic class of ring */
> >  	u8 ptp_rx;
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > index beb1c5bb392a..463d9e5cbe05 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > @@ -546,3 +546,27 @@ void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
> >  			spin_unlock(&xdp_ring->tx_lock);
> >  	}
> >  }
> > +
> > +/**
> > + * ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
> > + * @ctx: XDP buff pointer
> > + * @ts_ns: destination address
> > + *
> > + * Copy HW timestamp (if available) to the destination address.
> > + */
> > +static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
> > +{
> > +	const struct ice_xdp_buff *xdp_ext = (void *)ctx;
> > +	u64 cached_time;
> > +
> > +	cached_time = READ_ONCE(xdp_ext->pkt_ctx.cached_phctime);
> 
> I believe we have to have something like the following here:
> 
> if (!ts_ns)
> 	return -EINVAL;
> 
> IOW, I don't think verifier guarantees that those pointer args are
> non-NULL.

Oh, that's a shame.

> Same for the other ice kfunc you're adding and veth changes.
> 
> Can you also fix it for the existing veth kfuncs? (or lmk if you prefer me
> to fix it).

I think I can send fixes for RX hash and timestamp in veth separately, before 
v3 of this patchset, code probably doesn't intersect.

But argument checks in kfuncs are a little bit a gray area for me, whether they 
should be sent to stable tree or not?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-04 14:18           ` Jesper Dangaard Brouer
@ 2023-07-06 14:46             ` Larysa Zaremba
  2023-07-07 13:57               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 14:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, John Fastabend, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Andrew Lunn

On Tue, Jul 04, 2023 at 04:18:04PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 04/07/2023 13.02, Larysa Zaremba wrote:
> > On Tue, Jul 04, 2023 at 12:23:45PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > On 04/07/2023 10.23, Larysa Zaremba wrote:
> > > > On Mon, Jul 03, 2023 at 01:15:34PM -0700, John Fastabend wrote:
> > > > > Larysa Zaremba wrote:
> > > > > > Implement functionality that enables drivers to expose VLAN tag
> > > > > > to XDP code.
> > > > > > 
> > > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > > ---
> > > > > >    Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
> > > > > >    include/linux/netdevice.h                    |  2 ++
> > > > > >    include/net/xdp.h                            |  2 ++
> > > > > >    kernel/bpf/offload.c                         |  2 ++
> > > > > >    net/core/xdp.c                               | 20 ++++++++++++++++++++
> > > > > >    5 files changed, 33 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > > > > > index 25ce72af81c2..ea6dd79a21d3 100644
> > > > > > --- a/Documentation/networking/xdp-rx-metadata.rst
> > > > > > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > > > > > @@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
> > > > > >    metadata is supported, this set will grow:
> > > > > >    .. kernel-doc:: net/core/xdp.c
> > > > > > -   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
> > > > > > +   :identifiers: bpf_xdp_metadata_rx_timestamp
> > > > > > +
> > > > > > +.. kernel-doc:: net/core/xdp.c
> > > > > > +   :identifiers: bpf_xdp_metadata_rx_hash
> > > > > > +
> > > > > > +.. kernel-doc:: net/core/xdp.c
> > > > > > +   :identifiers: bpf_xdp_metadata_rx_vlan_tag
> > > > > >    An XDP program can use these kfuncs to read the metadata into stack
> > > > > >    variables for its own consumption. Or, to pass the metadata on to other
> > > [...]
> > > > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > > > index 41e5ca8643ec..f6262c90e45f 100644
> > > > > > --- a/net/core/xdp.c
> > > > > > +++ b/net/core/xdp.c
> > > > > > @@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
> > > > > >    	return -EOPNOTSUPP;
> > > > > >    }
> > > > > > +/**
> > > > > > + * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
> > > > > > + * @ctx: XDP context pointer.
> > > > > > + * @vlan_tag: Destination pointer for VLAN tag
> > > > > > + * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
> > > > > > + *
> > > > > > + * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
> > > > > > + * containing VLAN ID, vlan_proto contains protocol identifier.
> > > > > 
> > > > > Above is a bit confusing to me at least.
> > > > > 
> > > > > The vlan tag would be both the 16bit TPID and 16bit TCI. What fields
> > > > > are to be included here? The VlanID or the full 16bit TCI meaning the
> > > > > PCP+DEI+VID?
> > > > 
> > > > It contains PCP+DEI+VID, in patch 16 ("selftests/bpf: Add flags and new hints to
> > > > xdp_hw_metadata") this is more clear, because the tag is parsed.
> > > > 
> > > 
> > > Do we really care about the "EtherType" proto (in VLAN speak TPID = Tag
> > > Protocol IDentifier)?
> > > I mean, it can basically only have two values[1], and we just wanted to
> > > know if it is a VLAN (that hardware offloaded/removed for us):
> > 
> > If we assume everyone follows the standard, this would be correct.
> > But apparently, some applications use some ambiguous value as a TPID [0].
> > 
> > So it is not hard to imagine, some NICs could alllow you to configure your
> > custom TPID. I am not sure if any in-tree drivers actually do this, but I think
> > it's nice to provide some flexibility on XDP level, especially considering
> > network stack stores full vlan_proto.
> > 
> 
> I'm buying your argument, and agree it makes sense to provide TPID in
> the call signature.  Given weird hardware exists that allow people to
> configure custom TPID.
> 
> Looking through kernel defines (in uapi/linux/if_ether.h) I see evidence
> that funky QinQ EtherTypes have been used in the past:
> 
>  #define ETH_P_QINQ1	0x9100		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
> REGISTERED ID ] */
>  #define ETH_P_QINQ2	0x9200		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
> REGISTERED ID ] */
>  #define ETH_P_QINQ3	0x9300		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
> REGISTERED ID ] */
> 
> 
> > [0]
> > https://techhub.hpe.com/eginfolib/networking/docs/switches/7500/5200-1938a_l2-lan_cg/content/495503472.htm
> > 
> > > 
> > >   static __always_inline int proto_is_vlan(__u16 h_proto)
> > >   {
> > > 	return !!(h_proto == bpf_htons(ETH_P_8021Q) ||
> > > 		  h_proto == bpf_htons(ETH_P_8021AD));
> > >   }
> > > 
> > > [1] https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L75-L79
> > > 
> > > Cc. Andrew Lunn, as I notice DSA have a fake VLAN define ETH_P_DSA_8021Q
> > > (in file include/uapi/linux/if_ether.h)
> > > Is this actually in use?
> > > Maybe some hardware can "VLAN" offload this?
> > > 
> > > 
> > > > What about rephrasing it this way:
> > > > 
> > > > In case of success, vlan_proto contains VLAN protocol identifier (TPID),
> > > > vlan_tag contains the remaining 16 bits of a 802.1Q tag (PCP+DEI+VID).
> > > > 
> > > 
> > > Hmm, I think we can improve this further. This text becomes part of the
> > > documentation for end-users (target audience).  Thus, I think it is
> > > worth being more verbose and even mention the existing defines that we
> > > are expecting end-users to take advantage of.
> > > 
> > > What about:
> > > 
> > > In case of success. The VLAN EtherType is stored in vlan_proto (usually
> > > either ETH_P_8021Q or ETH_P_8021AD) also known as TPID (Tag Protocol
> > > IDentifier). The VLAN tag is stored in vlan_tag, which is a 16-bit field
> > > containing sub-fields (PCP+DEI+VID). The VLAN ID (VID) is 12-bits
> > > commonly extracted using mask VLAN_VID_MASK (0x0fff).  For the meaning
> > > of the sub-fields Priority Code Point (PCP) and Drop Eligible Indicator
> > > (DEI) (formerly CFI) please reference other documentation. Remember
> > > these 16-bit fields are stored in network-byte. Thus, transformation
> > > with byte-order helper functions like bpf_ntohs() are needed.
> > > 
> > 
> > AFAIK, vlan_tag is stored in host byte order, this is how it is in skb.
> 
> I'm not sure we should follow SKB storage scheme for XDP.
>

I think following SKB convention is a good idea in this particular case. As I 
have mentioned below, in ice VLAN TCI in descriptor already comes in LE, so no 
point in converting it into BE, so somebody would use bpf_ntohs() later anyway. 
We are not the only manufacturer that does this.

> > In ice, we receive VLAN tag in descriptor already in LE.
> > Only protocol is BE (network byte order). So I would replace the last 2
> > sentences with the following:
> > 
> > vlan_tag is stored in host byte order, so no byte order conversion is needed.
> 
> Yikes, that was unexpected.  This needs to be heavily documented in docs.

You mean the motivation, why it is so and not the other way around?

> 
> When parsing packets, it is in network-byte-order, else my code is wrong
> here[1]:
> 
>   [1] https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L122
> 
> I'm accessing the skb->vlan_tci here [2], and I notice I don't do any
> byte-order conversions, so fortunately I didn't make a code mistake.
> 
>   [2] https://github.com/xdp-project/bpf-examples/blob/master/traffic-pacing-edt/edt_pacer_vlan.c#L215
>

In raw packet, VLAN TCI is in network byte order, but skb requires NIC/driver
to convert it into host byte order before putting it into skb.
 
> > vlan_proto is stored in network byte order, the suggested way to use this value:
> > 
> > vlan_proto == bpf_htons(ETH_P_8021Q)
> > 
> > > 
> > > 
> 
> --Jesper
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 15/20] net, xdp: allow metadata > 32
  2023-07-03 21:06   ` John Fastabend
@ 2023-07-06 14:51     ` Larysa Zaremba
  2023-07-10 14:01       ` Alexander Lobakin
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-06 14:51 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, sdf,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	Aleksander Lobakin

On Mon, Jul 03, 2023 at 02:06:46PM -0700, John Fastabend wrote:
> Larysa Zaremba wrote:
> > From: Aleksander Lobakin <aleksander.lobakin@intel.com>
> > 
> > When using XDP hints, metadata sometimes has to be much bigger
> > than 32 bytes. Relax the restriction, allow metadata larger than 32 bytes
> > and make __skb_metadata_differs() work with bigger lengths.
> > 
> > Now size of metadata is only limited by the fact it is stored as u8
> > in skb_shared_info, so maximum possible value is 255. Other important
> > conditions, such as having enough space for xdp_frame building, are already
> > checked in bpf_xdp_adjust_meta().
> > 
> > The requirement of having its length aligned to 4 bytes is still
> > valid.
> > 
> > Signed-off-by: Aleksander Lobakin <aleksander.lobakin@intel.com>
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  include/linux/skbuff.h | 13 ++++++++-----
> >  include/net/xdp.h      |  7 ++++++-
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 91ed66952580..cd49cdd71019 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -4209,10 +4209,13 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
> >  {
> >  	const void *a = skb_metadata_end(skb_a);
> >  	const void *b = skb_metadata_end(skb_b);
> > -	/* Using more efficient varaiant than plain call to memcmp(). */
> > -#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
> 
> Why are we removing the ifdef here? Its adding a runtime 'if' when its not
> necessary. I would keep the ifdef and simply add the default case
> in the switch.

Seems like Alex has missed your message, but we discussed this with him before, 
so I know the answer: Compiler will 100% convert it into a compile-time 'if' and 
this looks nicer than preprocessor condition.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint
  2023-07-06 14:22     ` Larysa Zaremba
@ 2023-07-06 16:39       ` Stanislav Fomichev
  2023-07-10 15:49         ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-06 16:39 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jul 6, 2023 at 7:27 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
>
> On Wed, Jul 05, 2023 at 10:30:56AM -0700, Stanislav Fomichev wrote:
> > On 07/03, Larysa Zaremba wrote:
> > > Use previously refactored code and create a function
> > > that allows XDP code to read HW timestamp.
> > >
> > > Also, move cached_phctime into packet context, this way this data still
> > > stays in the ring structure, just at the different address.
> > >
> > > HW timestamp is the first supported hint in the driver,
> > > so also add xdp_metadata_ops.
> > >
> > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > ---
> > >  drivers/net/ethernet/intel/ice/ice.h          |  2 ++
> > >  drivers/net/ethernet/intel/ice/ice_ethtool.c  |  2 +-
> > >  drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
> > >  drivers/net/ethernet/intel/ice/ice_main.c     |  1 +
> > >  drivers/net/ethernet/intel/ice/ice_ptp.c      |  2 +-
> > >  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
> > >  drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 24 +++++++++++++++++++
> > >  7 files changed, 31 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > index 4ba3d99439a0..7a973a2229f1 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > @@ -943,4 +943,6 @@ static inline void ice_clear_rdma_cap(struct ice_pf *pf)
> > >     set_bit(ICE_FLAG_UNPLUG_AUX_DEV, pf->flags);
> > >     clear_bit(ICE_FLAG_RDMA_ENA, pf->flags);
> > >  }
> > > +
> > > +extern const struct xdp_metadata_ops ice_xdp_md_ops;
> > >  #endif /* _ICE_H_ */
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > index 8d5cbbd0b3d5..3c3b9cbfbcd3 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > @@ -2837,7 +2837,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
> > >             /* clone ring and setup updated count */
> > >             rx_rings[i] = *vsi->rx_rings[i];
> > >             rx_rings[i].count = new_rx_cnt;
> > > -           rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
> > > +           rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> > >             rx_rings[i].desc = NULL;
> > >             rx_rings[i].rx_buf = NULL;
> > >             /* this is to allow wr32 to have something to write to
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> > > index 00e3afd507a4..eb69b0ac7956 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> > > @@ -1445,7 +1445,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
> > >             ring->netdev = vsi->netdev;
> > >             ring->dev = dev;
> > >             ring->count = vsi->num_rx_desc;
> > > -           ring->cached_phctime = pf->ptp.cached_phc_time;
> > > +           ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> > >             WRITE_ONCE(vsi->rx_rings[i], ring);
> > >     }
> > >
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > index 93979ab18bc1..f21996b812ea 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > @@ -3384,6 +3384,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
> > >
> > >     netdev->netdev_ops = &ice_netdev_ops;
> > >     netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
> > > +   netdev->xdp_metadata_ops = &ice_xdp_md_ops;
> > >     ice_set_ethtool_ops(netdev);
> > >
> > >     if (vsi->type != ICE_VSI_PF)
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > index a31333972c68..70697e4829dd 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > @@ -1038,7 +1038,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
> > >             ice_for_each_rxq(vsi, j) {
> > >                     if (!vsi->rx_rings[j])
> > >                             continue;
> > > -                   WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
> > > +                   WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime, systime);
> > >             }
> > >     }
> > >     clear_bit(ICE_CFG_BUSY, pf->state);
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > index d0ab2c4c0c91..4237702a58a9 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > @@ -259,6 +259,7 @@ enum ice_rx_dtype {
> > >
> > >  struct ice_pkt_ctx {
> > >     const union ice_32b_rx_flex_desc *eop_desc;
> > > +   u64 cached_phctime;
> > >  };
> > >
> > >  struct ice_xdp_buff {
> > > @@ -354,7 +355,6 @@ struct ice_rx_ring {
> > >     struct ice_tx_ring *xdp_ring;
> > >     struct xsk_buff_pool *xsk_pool;
> > >     dma_addr_t dma;                 /* physical address of ring */
> > > -   u64 cached_phctime;
> > >     u16 rx_buf_len;
> > >     u8 dcb_tc;                      /* Traffic class of ring */
> > >     u8 ptp_rx;
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > index beb1c5bb392a..463d9e5cbe05 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > @@ -546,3 +546,27 @@ void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
> > >                     spin_unlock(&xdp_ring->tx_lock);
> > >     }
> > >  }
> > > +
> > > +/**
> > > + * ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
> > > + * @ctx: XDP buff pointer
> > > + * @ts_ns: destination address
> > > + *
> > > + * Copy HW timestamp (if available) to the destination address.
> > > + */
> > > +static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
> > > +{
> > > +   const struct ice_xdp_buff *xdp_ext = (void *)ctx;
> > > +   u64 cached_time;
> > > +
> > > +   cached_time = READ_ONCE(xdp_ext->pkt_ctx.cached_phctime);
> >
> > I believe we have to have something like the following here:
> >
> > if (!ts_ns)
> >       return -EINVAL;
> >
> > IOW, I don't think verifier guarantees that those pointer args are
> > non-NULL.
>
> Oh, that's a shame.
>
> > Same for the other ice kfunc you're adding and veth changes.
> >
> > Can you also fix it for the existing veth kfuncs? (or lmk if you prefer me
> > to fix it).
>
> I think I can send fixes for RX hash and timestamp in veth separately, before
> v3 of this patchset, code probably doesn't intersect.
>
> But argument checks in kfuncs are a little bit a gray area for me, whether they
> should be sent to stable tree or not?

Add a Fixes tag and they will get into the stable trees automatically I believe?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-07-06 14:11     ` Larysa Zaremba
@ 2023-07-06 17:25       ` Stanislav Fomichev
  2023-07-06 17:27       ` Stanislav Fomichev
  1 sibling, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-06 17:25 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jul 6, 2023 at 7:15 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
>
> On Wed, Jul 05, 2023 at 10:39:35AM -0700, Stanislav Fomichev wrote:
> > On 07/03, Larysa Zaremba wrote:
> > > The easiest way to simulate stripped VLAN tag in veth is to send a packet
> > > from VLAN interface, attached to veth. Unfortunately, this approach is
> > > incompatible with AF_XDP on TX side, because VLAN interfaces do not have
> > > such feature.
> > >
> > > Replace AF_XDP packet generation with sending the same datagram via
> > > AF_INET socket.
> > >
> > > This does not change the packet contents or hints values with one notable
> > > exception: rx_hash_type, which previously was expected to be 0, now is
> > > expected be at least XDP_RSS_TYPE_L4.
> > >
> > > Also, usage of AF_INET requires a little more complicated namespace setup,
> > > therefore open_netns() helper function is divided into smaller reusable
> > > pieces.
> >
> > Ack, it's probably OK for now, but, FYI, I'm trying to extend this part
> > with TX metadata:
> > https://lore.kernel.org/bpf/20230621170244.1283336-10-sdf@google.com/
> >
> > So probably long-term I'll switch it back to AF_XDP but will add
> > support for requesting vlan TX "offload" from the veth.
> >
>
> My bad for not reading your series. Amazing work as always!
>
> So, 'requesting vlan TX "offload"' with new hints capabilities? This would be
> pretty neat.
>
> But you think AF_INET TX is worth keeping for now, until TX hints are mature?

It's fine to replace current af_xdp tx with whatever you're suggesting.
I can bring it back later for tx when it's ready.


> > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > ---
> > >  tools/testing/selftests/bpf/network_helpers.c |  37 +++-
> > >  tools/testing/selftests/bpf/network_helpers.h |   3 +
> > >  .../selftests/bpf/prog_tests/xdp_metadata.c   | 175 +++++++-----------
> > >  3 files changed, 98 insertions(+), 117 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> > > index a105c0cd008a..19463230ece5 100644
> > > --- a/tools/testing/selftests/bpf/network_helpers.c
> > > +++ b/tools/testing/selftests/bpf/network_helpers.c
> > > @@ -386,28 +386,51 @@ char *ping_command(int family)
> > >     return "ping";
> > >  }
> > >
> > > +int get_cur_netns(void)
> > > +{
> > > +   int nsfd;
> > > +
> > > +   nsfd = open("/proc/self/ns/net", O_RDONLY);
> > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > +   return nsfd;
> > > +}
> > > +
> > > +int get_netns(const char *name)
> > > +{
> > > +   char nspath[PATH_MAX];
> > > +   int nsfd;
> > > +
> > > +   snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> > > +   nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > +   return nsfd;
> > > +}
> > > +
> > > +int set_netns(int netns_fd)
> > > +{
> > > +   return setns(netns_fd, CLONE_NEWNET);
> > > +}
> >
> > We have open_netns/close_netns in network_helpers.h that provide similar
> > functionality, let's use them instead?
> >
>
> I have divided open_netns() into smaller pieces (see below), because the code I
> have added into xdp_metadata looked better with those smaller pieces (I had to
> switch namespace several times).
>
> > > +
> > >  struct nstoken {
> > >     int orig_netns_fd;
> > >  };
> > >
> > >  struct nstoken *open_netns(const char *name)
> > >  {
> > > +   struct nstoken *token;
> > >     int nsfd;
> > > -   char nspath[PATH_MAX];
> > >     int err;
> > > -   struct nstoken *token;
> > >
> > >     token = calloc(1, sizeof(struct nstoken));
> > >     if (!ASSERT_OK_PTR(token, "malloc token"))
> > >             return NULL;
> > >
> > > -   token->orig_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> > > -   if (!ASSERT_GE(token->orig_netns_fd, 0, "open /proc/self/ns/net"))
> > > +   token->orig_netns_fd = get_cur_netns();
> > > +   if (token->orig_netns_fd < 0)
> > >             goto fail;
> > >
> > > -   snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> > > -   nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> > > -   if (!ASSERT_GE(nsfd, 0, "open netns fd"))
> > > +   nsfd = get_netns(name);
> > > +   if (nsfd < 0)
> > >             goto fail;
> > >
> > >     err = setns(nsfd, CLONE_NEWNET);
> > > diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
> > > index 694185644da6..b18b9619595c 100644
> > > --- a/tools/testing/selftests/bpf/network_helpers.h
> > > +++ b/tools/testing/selftests/bpf/network_helpers.h
> > > @@ -58,6 +58,8 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
> > >  char *ping_command(int family);
> > >  int get_socket_local_port(int sock_fd);
> > >
> > > +int get_cur_netns(void);
> > > +int get_netns(const char *name);
> > >  struct nstoken;
> > >  /**
> > >   * open_netns() - Switch to specified network namespace by name.
> > > @@ -67,4 +69,5 @@ struct nstoken;
> > >   */
> > >  struct nstoken *open_netns(const char *name);
> > >  void close_netns(struct nstoken *token);
> > > +int set_netns(int netns_fd);
> > >  #endif
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > > index 626c461fa34d..53b32a641e8e 100644
> > > --- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > > +++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
> > > @@ -20,7 +20,7 @@
> > >
> > >  #define UDP_PAYLOAD_BYTES 4
> > >
> > > -#define AF_XDP_SOURCE_PORT 1234
> > > +#define UDP_SOURCE_PORT 1234
> > >  #define AF_XDP_CONSUMER_PORT 8080
> > >
> > >  #define UMEM_NUM 16
> > > @@ -33,6 +33,12 @@
> > >  #define RX_ADDR "10.0.0.2"
> > >  #define PREFIX_LEN "8"
> > >  #define FAMILY AF_INET
> > > +#define TX_NETNS_NAME "xdp_metadata_tx"
> > > +#define RX_NETNS_NAME "xdp_metadata_rx"
> > > +#define TX_MAC "00:00:00:00:00:01"
> > > +#define RX_MAC "00:00:00:00:00:02"
> > > +
> > > +#define XDP_RSS_TYPE_L4 BIT(3)
> > >
> > >  struct xsk {
> > >     void *umem_area;
> > > @@ -119,90 +125,28 @@ static void close_xsk(struct xsk *xsk)
> > >     munmap(xsk->umem_area, UMEM_SIZE);
> > >  }
> > >
> > > -static void ip_csum(struct iphdr *iph)
> > > +static int generate_packet_udp(void)
> > >  {
> > > -   __u32 sum = 0;
> > > -   __u16 *p;
> > > -   int i;
> > > -
> > > -   iph->check = 0;
> > > -   p = (void *)iph;
> > > -   for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
> > > -           sum += p[i];
> > > -
> > > -   while (sum >> 16)
> > > -           sum = (sum & 0xffff) + (sum >> 16);
> > > -
> > > -   iph->check = ~sum;
> > > -}
> > > -
> > > -static int generate_packet(struct xsk *xsk, __u16 dst_port)
> > > -{
> > > -   struct xdp_desc *tx_desc;
> > > -   struct udphdr *udph;
> > > -   struct ethhdr *eth;
> > > -   struct iphdr *iph;
> > > -   void *data;
> > > -   __u32 idx;
> > > -   int ret;
> > > -
> > > -   ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> > > -   if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
> > > -           return -1;
> > > -
> > > -   tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> > > -   tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
> > > -   printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
> > > -   data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> > > -
> > > -   eth = data;
> > > -   iph = (void *)(eth + 1);
> > > -   udph = (void *)(iph + 1);
> > > -
> > > -   memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
> > > -   memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
> > > -   eth->h_proto = htons(ETH_P_IP);
> > > -
> > > -   iph->version = 0x4;
> > > -   iph->ihl = 0x5;
> > > -   iph->tos = 0x9;
> > > -   iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
> > > -   iph->id = 0;
> > > -   iph->frag_off = 0;
> > > -   iph->ttl = 0;
> > > -   iph->protocol = IPPROTO_UDP;
> > > -   ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
> > > -   ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
> > > -   ip_csum(iph);
> > > -
> > > -   udph->source = htons(AF_XDP_SOURCE_PORT);
> > > -   udph->dest = htons(dst_port);
> > > -   udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
> > > -   udph->check = 0;
> > > -
> > > -   memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
> > > -
> > > -   tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
> > > -   xsk_ring_prod__submit(&xsk->tx, 1);
> > > -
> > > -   ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
> > > -   if (!ASSERT_GE(ret, 0, "sendto"))
> > > -           return ret;
> > > -
> > > -   return 0;
> > > -}
> > > -
> > > -static void complete_tx(struct xsk *xsk)
> > > -{
> > > -   __u32 idx;
> > > -   __u64 addr;
> > > -
> > > -   if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
> > > -           addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> > > -
> > > -           printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> > > -           xsk_ring_cons__release(&xsk->comp, 1);
> > > -   }
> > > +   char udp_payload[UDP_PAYLOAD_BYTES];
> > > +   struct sockaddr_in rx_addr;
> > > +   int sock_fd, err = 0;
> > > +
> > > +   /* Build a packet */
> > > +   memset(udp_payload, 0xAA, UDP_PAYLOAD_BYTES);
> > > +   rx_addr.sin_addr.s_addr = inet_addr(RX_ADDR);
> > > +   rx_addr.sin_family = AF_INET;
> > > +   rx_addr.sin_port = htons(UDP_SOURCE_PORT);
> > > +
> > > +   sock_fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> > > +   if (!ASSERT_GE(sock_fd, 0, "socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)"))
> > > +           return sock_fd;
> > > +
> > > +   err = sendto(sock_fd, udp_payload, UDP_PAYLOAD_BYTES, MSG_DONTWAIT,
> > > +                (void *)&rx_addr, sizeof(rx_addr));
> > > +   ASSERT_GE(err, 0, "sendto");
> > > +
> > > +   close(sock_fd);
> > > +   return err;
> > >  }
> > >
> > >  static void refill_rx(struct xsk *xsk, __u64 addr)
> > > @@ -268,7 +212,8 @@ static int verify_xsk_metadata(struct xsk *xsk)
> > >     if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
> > >             return -1;
> > >
> > > -   ASSERT_EQ(meta->rx_hash_type, 0, "rx_hash_type");
> > > +   if (!ASSERT_NEQ(meta->rx_hash_type & XDP_RSS_TYPE_L4, 0, "rx_hash_type"))
> > > +           return -1;
> > >
> > >     xsk_ring_cons__release(&xsk->rx, 1);
> > >     refill_rx(xsk, comp_addr);
> > > @@ -281,40 +226,46 @@ void test_xdp_metadata(void)
> > >     struct xdp_metadata2 *bpf_obj2 = NULL;
> > >     struct xdp_metadata *bpf_obj = NULL;
> > >     struct bpf_program *new_prog, *prog;
> > > -   struct nstoken *tok = NULL;
> > > +   int prev_netns, rx_netns, tx_netns;
> > >     __u32 queue_id = QUEUE_ID;
> > >     struct bpf_map *prog_arr;
> > > -   struct xsk tx_xsk = {};
> > >     struct xsk rx_xsk = {};
> > >     __u32 val, key = 0;
> > >     int retries = 10;
> > >     int rx_ifindex;
> > > -   int tx_ifindex;
> > >     int sock_fd;
> > >     int ret;
> > >
> > > -   /* Setup new networking namespace, with a veth pair. */
> > > +   /* Setup new networking namespaces, with a veth pair. */
> > >
> > > -   SYS(out, "ip netns add xdp_metadata");
> > > -   tok = open_netns("xdp_metadata");
> > > +   SYS(out, "ip netns add " TX_NETNS_NAME);
> > > +   SYS(out, "ip netns add " RX_NETNS_NAME);
> > > +   prev_netns = get_cur_netns();
> > > +   tx_netns = get_netns(TX_NETNS_NAME);
> > > +   rx_netns = get_netns(RX_NETNS_NAME);
> > > +   if (prev_netns < 0 || tx_netns < 0 || rx_netns < 0)
> > > +           goto close_ns;
> > > +
> > > +   set_netns(tx_netns);
> > >     SYS(out, "ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
> > >         " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
> > > -   SYS(out, "ip link set dev " TX_NAME " address 00:00:00:00:00:01");
> > > -   SYS(out, "ip link set dev " RX_NAME " address 00:00:00:00:00:02");
> > > +   SYS(out, "ip link set " RX_NAME " netns " RX_NETNS_NAME);
> > > +
> > > +   SYS(out, "ip link set dev " TX_NAME " address " TX_MAC);
> > >     SYS(out, "ip link set dev " TX_NAME " up");
> > > -   SYS(out, "ip link set dev " RX_NAME " up");
> > >     SYS(out, "ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
> > > -   SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
> > >
> > > +   /* Avoid ARP calls */
> > > +   SYS(out, "ip -4 neigh add " RX_ADDR " lladdr " RX_MAC " dev " TX_NAME);
> > > +
> > > +   set_netns(rx_netns);
> > > +   SYS(out, "ip link set dev " RX_NAME " address " RX_MAC);
> > > +   SYS(out, "ip link set dev " RX_NAME " up");
> > > +   SYS(out, "ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
> > >     rx_ifindex = if_nametoindex(RX_NAME);
> > > -   tx_ifindex = if_nametoindex(TX_NAME);
> > >
> > >     /* Setup separate AF_XDP for TX and RX interfaces. */
> > >
> > > -   ret = open_xsk(tx_ifindex, &tx_xsk);
> > > -   if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
> > > -           goto out;
> > > -
> > >     ret = open_xsk(rx_ifindex, &rx_xsk);
> > >     if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
> > >             goto out;
> > > @@ -355,17 +306,16 @@ void test_xdp_metadata(void)
> > >             goto out;
> > >
> > >     /* Send packet destined to RX AF_XDP socket. */
> > > -   if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
> > > -                  "generate AF_XDP_CONSUMER_PORT"))
> > > +   set_netns(tx_netns);
> > > +   if (!ASSERT_GE(generate_packet_udp(), 0, "generate UDP packet"))
> > >             goto out;
> > >
> > >     /* Verify AF_XDP RX packet has proper metadata. */
> > > +   set_netns(rx_netns);
> > >     if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
> > >                    "verify_xsk_metadata"))
> > >             goto out;
> > >
> > > -   complete_tx(&tx_xsk);
> > > -
> > >     /* Make sure freplace correctly picks up original bound device
> > >      * and doesn't crash.
> > >      */
> > > @@ -384,10 +334,11 @@ void test_xdp_metadata(void)
> > >             goto out;
> > >
> > >     /* Send packet to trigger . */
> > > -   if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
> > > -                  "generate freplace packet"))
> > > +   set_netns(tx_netns);
> > > +   if (!ASSERT_GE(generate_packet_udp(), 0, "generate freplace packet"))
> > >             goto out;
> > >
> > > +   set_netns(rx_netns);
> > >     while (!retries--) {
> > >             if (bpf_obj2->bss->called)
> > >                     break;
> > > @@ -397,10 +348,14 @@ void test_xdp_metadata(void)
> > >
> > >  out:
> > >     close_xsk(&rx_xsk);
> > > -   close_xsk(&tx_xsk);
> > >     xdp_metadata2__destroy(bpf_obj2);
> > >     xdp_metadata__destroy(bpf_obj);
> > > -   if (tok)
> > > -           close_netns(tok);
> > > -   SYS_NOFAIL("ip netns del xdp_metadata");
> > > +   set_netns(prev_netns);
> > > +close_ns:
> > > +   close(prev_netns);
> > > +   close(tx_netns);
> > > +   close(rx_netns);
> > > +
> > > +   SYS_NOFAIL("ip netns del " RX_NETNS_NAME);
> > > +   SYS_NOFAIL("ip netns del " TX_NETNS_NAME);
> > >  }
> > > --
> > > 2.41.0
> > >

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-07-06 14:11     ` Larysa Zaremba
  2023-07-06 17:25       ` Stanislav Fomichev
@ 2023-07-06 17:27       ` Stanislav Fomichev
  2023-07-07  8:33         ` Larysa Zaremba
  1 sibling, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-06 17:27 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jul 6, 2023 at 7:15 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
>
> On Wed, Jul 05, 2023 at 10:39:35AM -0700, Stanislav Fomichev wrote:
> > On 07/03, Larysa Zaremba wrote:
> > > The easiest way to simulate stripped VLAN tag in veth is to send a packet
> > > from VLAN interface, attached to veth. Unfortunately, this approach is
> > > incompatible with AF_XDP on TX side, because VLAN interfaces do not have
> > > such feature.
> > >
> > > Replace AF_XDP packet generation with sending the same datagram via
> > > AF_INET socket.
> > >
> > > This does not change the packet contents or hints values with one notable
> > > exception: rx_hash_type, which previously was expected to be 0, now is
> > > expected be at least XDP_RSS_TYPE_L4.
> > >
> > > Also, usage of AF_INET requires a little more complicated namespace setup,
> > > therefore open_netns() helper function is divided into smaller reusable
> > > pieces.
> >
> > Ack, it's probably OK for now, but, FYI, I'm trying to extend this part
> > with TX metadata:
> > https://lore.kernel.org/bpf/20230621170244.1283336-10-sdf@google.com/
> >
> > So probably long-term I'll switch it back to AF_XDP but will add
> > support for requesting vlan TX "offload" from the veth.
> >
>
> My bad for not reading your series. Amazing work as always!
>
> So, 'requesting vlan TX "offload"' with new hints capabilities? This would be
> pretty neat.
>
> But you think AF_INET TX is worth keeping for now, until TX hints are mature?
>
> > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > ---
> > >  tools/testing/selftests/bpf/network_helpers.c |  37 +++-
> > >  tools/testing/selftests/bpf/network_helpers.h |   3 +
> > >  .../selftests/bpf/prog_tests/xdp_metadata.c   | 175 +++++++-----------
> > >  3 files changed, 98 insertions(+), 117 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> > > index a105c0cd008a..19463230ece5 100644
> > > --- a/tools/testing/selftests/bpf/network_helpers.c
> > > +++ b/tools/testing/selftests/bpf/network_helpers.c
> > > @@ -386,28 +386,51 @@ char *ping_command(int family)
> > >     return "ping";
> > >  }
> > >
> > > +int get_cur_netns(void)
> > > +{
> > > +   int nsfd;
> > > +
> > > +   nsfd = open("/proc/self/ns/net", O_RDONLY);
> > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > +   return nsfd;
> > > +}
> > > +
> > > +int get_netns(const char *name)
> > > +{
> > > +   char nspath[PATH_MAX];
> > > +   int nsfd;
> > > +
> > > +   snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> > > +   nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > +   return nsfd;
> > > +}
> > > +
> > > +int set_netns(int netns_fd)
> > > +{
> > > +   return setns(netns_fd, CLONE_NEWNET);
> > > +}
> >
> > We have open_netns/close_netns in network_helpers.h that provide similar
> > functionality, let's use them instead?
> >
>
> I have divided open_netns() into smaller pieces (see below), because the code I
> have added into xdp_metadata looked better with those smaller pieces (I had to
> switch namespace several times).

Forgot to reply to this part. I missed the fact that you're extending
network_helpers, sorry.
But why do we need extra namespaces at all?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-07-06 17:27       ` Stanislav Fomichev
@ 2023-07-07  8:33         ` Larysa Zaremba
  2023-07-07 16:49           ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-07  8:33 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jul 06, 2023 at 10:27:38AM -0700, Stanislav Fomichev wrote:
> On Thu, Jul 6, 2023 at 7:15 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
> >
> > On Wed, Jul 05, 2023 at 10:39:35AM -0700, Stanislav Fomichev wrote:
> > > On 07/03, Larysa Zaremba wrote:
> > > > The easiest way to simulate stripped VLAN tag in veth is to send a packet
> > > > from VLAN interface, attached to veth. Unfortunately, this approach is
> > > > incompatible with AF_XDP on TX side, because VLAN interfaces do not have
> > > > such feature.
> > > >
> > > > Replace AF_XDP packet generation with sending the same datagram via
> > > > AF_INET socket.
> > > >
> > > > This does not change the packet contents or hints values with one notable
> > > > exception: rx_hash_type, which previously was expected to be 0, now is
> > > > expected be at least XDP_RSS_TYPE_L4.
> > > >
> > > > Also, usage of AF_INET requires a little more complicated namespace setup,
> > > > therefore open_netns() helper function is divided into smaller reusable
> > > > pieces.
> > >
> > > Ack, it's probably OK for now, but, FYI, I'm trying to extend this part
> > > with TX metadata:
> > > https://lore.kernel.org/bpf/20230621170244.1283336-10-sdf@google.com/
> > >
> > > So probably long-term I'll switch it back to AF_XDP but will add
> > > support for requesting vlan TX "offload" from the veth.
> > >
> >
> > My bad for not reading your series. Amazing work as always!
> >
> > So, 'requesting vlan TX "offload"' with new hints capabilities? This would be
> > pretty neat.
> >
> > But you think AF_INET TX is worth keeping for now, until TX hints are mature?
> >
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >  tools/testing/selftests/bpf/network_helpers.c |  37 +++-
> > > >  tools/testing/selftests/bpf/network_helpers.h |   3 +
> > > >  .../selftests/bpf/prog_tests/xdp_metadata.c   | 175 +++++++-----------
> > > >  3 files changed, 98 insertions(+), 117 deletions(-)
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> > > > index a105c0cd008a..19463230ece5 100644
> > > > --- a/tools/testing/selftests/bpf/network_helpers.c
> > > > +++ b/tools/testing/selftests/bpf/network_helpers.c
> > > > @@ -386,28 +386,51 @@ char *ping_command(int family)
> > > >     return "ping";
> > > >  }
> > > >
> > > > +int get_cur_netns(void)
> > > > +{
> > > > +   int nsfd;
> > > > +
> > > > +   nsfd = open("/proc/self/ns/net", O_RDONLY);
> > > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > > +   return nsfd;
> > > > +}
> > > > +
> > > > +int get_netns(const char *name)
> > > > +{
> > > > +   char nspath[PATH_MAX];
> > > > +   int nsfd;
> > > > +
> > > > +   snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> > > > +   nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> > > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > > +   return nsfd;
> > > > +}
> > > > +
> > > > +int set_netns(int netns_fd)
> > > > +{
> > > > +   return setns(netns_fd, CLONE_NEWNET);
> > > > +}
> > >
> > > We have open_netns/close_netns in network_helpers.h that provide similar
> > > functionality, let's use them instead?
> > >
> >
> > I have divided open_netns() into smaller pieces (see below), because the code I
> > have added into xdp_metadata looked better with those smaller pieces (I had to
> > switch namespace several times).
> 
> Forgot to reply to this part. I missed the fact that you're extending
> network_helpers, sorry.
> But why do we need extra namespaces at all?

If veths are in the same namespace, AF_INET packets are not sent between them,
so XDP is skipped. So we need 2 test namespaces: for RX and TX.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-06 14:46             ` Larysa Zaremba
@ 2023-07-07 13:57               ` Jesper Dangaard Brouer
  2023-07-07 17:58                 ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-07 13:57 UTC (permalink / raw)
  To: Larysa Zaremba, Jesper Dangaard Brouer
  Cc: brouer, John Fastabend, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Andrew Lunn



On 06/07/2023 16.46, Larysa Zaremba wrote:
> On Tue, Jul 04, 2023 at 04:18:04PM +0200, Jesper Dangaard Brouer wrote:
>>
>>
>> On 04/07/2023 13.02, Larysa Zaremba wrote:
>>> On Tue, Jul 04, 2023 at 12:23:45PM +0200, Jesper Dangaard Brouer wrote:
>>>>
>>>> On 04/07/2023 10.23, Larysa Zaremba wrote:
>>>>> On Mon, Jul 03, 2023 at 01:15:34PM -0700, John Fastabend wrote:
>>>>>> Larysa Zaremba wrote:
>>>>>>> Implement functionality that enables drivers to expose VLAN tag
>>>>>>> to XDP code.
>>>>>>>
>>>>>>> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>>>>>>> ---
>>>>>>>     Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
>>>>>>>     include/linux/netdevice.h                    |  2 ++
>>>>>>>     include/net/xdp.h                            |  2 ++
>>>>>>>     kernel/bpf/offload.c                         |  2 ++
>>>>>>>     net/core/xdp.c                               | 20 ++++++++++++++++++++
>>>>>>>     5 files changed, 33 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
>>>>>>> index 25ce72af81c2..ea6dd79a21d3 100644
>>>>>>> --- a/Documentation/networking/xdp-rx-metadata.rst
>>>>>>> +++ b/Documentation/networking/xdp-rx-metadata.rst
>>>>>>> @@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
>>>>>>>     metadata is supported, this set will grow:
>>>>>>>     .. kernel-doc:: net/core/xdp.c
>>>>>>> -   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
>>>>>>> +   :identifiers: bpf_xdp_metadata_rx_timestamp
>>>>>>> +
>>>>>>> +.. kernel-doc:: net/core/xdp.c
>>>>>>> +   :identifiers: bpf_xdp_metadata_rx_hash
>>>>>>> +
>>>>>>> +.. kernel-doc:: net/core/xdp.c
>>>>>>> +   :identifiers: bpf_xdp_metadata_rx_vlan_tag
>>>>>>>     An XDP program can use these kfuncs to read the metadata into stack
>>>>>>>     variables for its own consumption. Or, to pass the metadata on to other
>>>> [...]
>>>>>>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>>>>>>> index 41e5ca8643ec..f6262c90e45f 100644
>>>>>>> --- a/net/core/xdp.c
>>>>>>> +++ b/net/core/xdp.c
>>>>>>> @@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
>>>>>>>     	return -EOPNOTSUPP;
>>>>>>>     }
>>>>>>> +/**
>>>>>>> + * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
>>>>>>> + * @ctx: XDP context pointer.
>>>>>>> + * @vlan_tag: Destination pointer for VLAN tag
>>>>>>> + * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
>>>>>>> + *
>>>>>>> + * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
>>>>>>> + * containing VLAN ID, vlan_proto contains protocol identifier.
>>>>>>
>>>>>> Above is a bit confusing to me at least.
>>>>>>
>>>>>> The vlan tag would be both the 16bit TPID and 16bit TCI. What fields
>>>>>> are to be included here? The VlanID or the full 16bit TCI meaning the
>>>>>> PCP+DEI+VID?
>>>>>
>>>>> It contains PCP+DEI+VID, in patch 16 ("selftests/bpf: Add flags and new hints to
>>>>> xdp_hw_metadata") this is more clear, because the tag is parsed.
>>>>>
>>>>
>>>> Do we really care about the "EtherType" proto (in VLAN speak TPID = Tag
>>>> Protocol IDentifier)?
>>>> I mean, it can basically only have two values[1], and we just wanted to
>>>> know if it is a VLAN (that hardware offloaded/removed for us):
>>>
>>> If we assume everyone follows the standard, this would be correct.
>>> But apparently, some applications use some ambiguous value as a TPID [0].
>>>
>>> So it is not hard to imagine, some NICs could alllow you to configure your
>>> custom TPID. I am not sure if any in-tree drivers actually do this, but I think
>>> it's nice to provide some flexibility on XDP level, especially considering
>>> network stack stores full vlan_proto.
>>>
>>
>> I'm buying your argument, and agree it makes sense to provide TPID in
>> the call signature.  Given weird hardware exists that allow people to
>> configure custom TPID.
>>
>> Looking through kernel defines (in uapi/linux/if_ether.h) I see evidence
>> that funky QinQ EtherTypes have been used in the past:
>>
>>   #define ETH_P_QINQ1	0x9100		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
>> REGISTERED ID ] */
>>   #define ETH_P_QINQ2	0x9200		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
>> REGISTERED ID ] */
>>   #define ETH_P_QINQ3	0x9300		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
>> REGISTERED ID ] */
>>
>>
>>> [0]
>>> https://techhub.hpe.com/eginfolib/networking/docs/switches/7500/5200-1938a_l2-lan_cg/content/495503472.htm
>>>
>>>>
>>>>    static __always_inline int proto_is_vlan(__u16 h_proto)
>>>>    {
>>>> 	return !!(h_proto == bpf_htons(ETH_P_8021Q) ||
>>>> 		  h_proto == bpf_htons(ETH_P_8021AD));
>>>>    }
>>>>
>>>> [1] https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L75-L79
>>>>
>>>> Cc. Andrew Lunn, as I notice DSA have a fake VLAN define ETH_P_DSA_8021Q
>>>> (in file include/uapi/linux/if_ether.h)
>>>> Is this actually in use?
>>>> Maybe some hardware can "VLAN" offload this?
>>>>
>>>>
>>>>> What about rephrasing it this way:
>>>>>
>>>>> In case of success, vlan_proto contains VLAN protocol identifier (TPID),
>>>>> vlan_tag contains the remaining 16 bits of a 802.1Q tag (PCP+DEI+VID).
>>>>>
>>>>
>>>> Hmm, I think we can improve this further. This text becomes part of the
>>>> documentation for end-users (target audience).  Thus, I think it is
>>>> worth being more verbose and even mention the existing defines that we
>>>> are expecting end-users to take advantage of.
>>>>
>>>> What about:
>>>>
>>>> In case of success. The VLAN EtherType is stored in vlan_proto (usually
>>>> either ETH_P_8021Q or ETH_P_8021AD) also known as TPID (Tag Protocol
>>>> IDentifier). The VLAN tag is stored in vlan_tag, which is a 16-bit field
>>>> containing sub-fields (PCP+DEI+VID). The VLAN ID (VID) is 12-bits
>>>> commonly extracted using mask VLAN_VID_MASK (0x0fff).  For the meaning
>>>> of the sub-fields Priority Code Point (PCP) and Drop Eligible Indicator
>>>> (DEI) (formerly CFI) please reference other documentation. Remember
>>>> these 16-bit fields are stored in network-byte. Thus, transformation
>>>> with byte-order helper functions like bpf_ntohs() are needed.
>>>>
>>>
>>> AFAIK, vlan_tag is stored in host byte order, this is how it is in skb.
>>
>> I'm not sure we should follow SKB storage scheme for XDP.
>>
> 
> I think following SKB convention is a good idea in this particular case. As I
> have mentioned below, in ice VLAN TCI in descriptor already comes in LE, so no
> point in converting it into BE, so somebody would use bpf_ntohs() later anyway.
> We are not the only manufacturer that does this.
> 

As long as other NIC hardware does the same this seems okay.


>>> In ice, we receive VLAN tag in descriptor already in LE.
>>> Only protocol is BE (network byte order). So I would replace the last 2
>>> sentences with the following:
>>>
>>> vlan_tag is stored in host byte order, so no byte order conversion is needed.
>>
>> Yikes, that was unexpected.  This needs to be heavily documented in docs.
> 
> You mean the motivation, why it is so and not the other way around?
> 

No, I don't mean the motivation.
I simply mean write it in *bold*.

Look at the description for bpf_xdp_metadata_rx_hash, how it gets
rendered [1] and how the code comments look [2].

  [1] 
https://kernel.org/doc/html/latest/networking/xdp-rx-metadata.html#general-design
  [2] https://elixir.bootlin.com/linux/v6.4/source/net/core/xdp.c#L724

To save you some time compiling htmldocs target:

  make SPHINXDIRS="networking" V=1  htmldocs

>>
>> When parsing packets, it is in network-byte-order, else my code is wrong
>> here[1]:
>>
>>    [1] https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L122
>>
>> I'm accessing the skb->vlan_tci here [2], and I notice I don't do any
>> byte-order conversions, so fortunately I didn't make a code mistake.
>>
>>    [2] https://github.com/xdp-project/bpf-examples/blob/master/traffic-pacing-edt/edt_pacer_vlan.c#L215
>>
> 
> In raw packet, VLAN TCI is in network byte order, but skb requires NIC/driver
> to convert it into host byte order before putting it into skb.
>

I'm interested in if *most* NIC hardware will deliver this in LE
(Little-Endian) which is host-byte order on x86 ?


>>> vlan_proto is stored in network byte order, the suggested way to use this value:
>>>
>>> vlan_proto == bpf_htons(ETH_P_8021Q)
>>>
>>>>
>>>>
>>
>> --Jesper
>>
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-07-07  8:33         ` Larysa Zaremba
@ 2023-07-07 16:49           ` Stanislav Fomichev
  2023-07-07 16:58             ` Larysa Zaremba
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 16:49 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 07/07, Larysa Zaremba wrote:
> On Thu, Jul 06, 2023 at 10:27:38AM -0700, Stanislav Fomichev wrote:
> > On Thu, Jul 6, 2023 at 7:15 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 10:39:35AM -0700, Stanislav Fomichev wrote:
> > > > On 07/03, Larysa Zaremba wrote:
> > > > > The easiest way to simulate stripped VLAN tag in veth is to send a packet
> > > > > from VLAN interface, attached to veth. Unfortunately, this approach is
> > > > > incompatible with AF_XDP on TX side, because VLAN interfaces do not have
> > > > > such feature.
> > > > >
> > > > > Replace AF_XDP packet generation with sending the same datagram via
> > > > > AF_INET socket.
> > > > >
> > > > > This does not change the packet contents or hints values with one notable
> > > > > exception: rx_hash_type, which previously was expected to be 0, now is
> > > > > expected be at least XDP_RSS_TYPE_L4.
> > > > >
> > > > > Also, usage of AF_INET requires a little more complicated namespace setup,
> > > > > therefore open_netns() helper function is divided into smaller reusable
> > > > > pieces.
> > > >
> > > > Ack, it's probably OK for now, but, FYI, I'm trying to extend this part
> > > > with TX metadata:
> > > > https://lore.kernel.org/bpf/20230621170244.1283336-10-sdf@google.com/
> > > >
> > > > So probably long-term I'll switch it back to AF_XDP but will add
> > > > support for requesting vlan TX "offload" from the veth.
> > > >
> > >
> > > My bad for not reading your series. Amazing work as always!
> > >
> > > So, 'requesting vlan TX "offload"' with new hints capabilities? This would be
> > > pretty neat.
> > >
> > > But you think AF_INET TX is worth keeping for now, until TX hints are mature?
> > >
> > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > ---
> > > > >  tools/testing/selftests/bpf/network_helpers.c |  37 +++-
> > > > >  tools/testing/selftests/bpf/network_helpers.h |   3 +
> > > > >  .../selftests/bpf/prog_tests/xdp_metadata.c   | 175 +++++++-----------
> > > > >  3 files changed, 98 insertions(+), 117 deletions(-)
> > > > >
> > > > > diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> > > > > index a105c0cd008a..19463230ece5 100644
> > > > > --- a/tools/testing/selftests/bpf/network_helpers.c
> > > > > +++ b/tools/testing/selftests/bpf/network_helpers.c
> > > > > @@ -386,28 +386,51 @@ char *ping_command(int family)
> > > > >     return "ping";
> > > > >  }
> > > > >
> > > > > +int get_cur_netns(void)
> > > > > +{
> > > > > +   int nsfd;
> > > > > +
> > > > > +   nsfd = open("/proc/self/ns/net", O_RDONLY);
> > > > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > > > +   return nsfd;
> > > > > +}
> > > > > +
> > > > > +int get_netns(const char *name)
> > > > > +{
> > > > > +   char nspath[PATH_MAX];
> > > > > +   int nsfd;
> > > > > +
> > > > > +   snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> > > > > +   nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> > > > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > > > +   return nsfd;
> > > > > +}
> > > > > +
> > > > > +int set_netns(int netns_fd)
> > > > > +{
> > > > > +   return setns(netns_fd, CLONE_NEWNET);
> > > > > +}
> > > >
> > > > We have open_netns/close_netns in network_helpers.h that provide similar
> > > > functionality, let's use them instead?
> > > >
> > >
> > > I have divided open_netns() into smaller pieces (see below), because the code I
> > > have added into xdp_metadata looked better with those smaller pieces (I had to
> > > switch namespace several times).
> > 
> > Forgot to reply to this part. I missed the fact that you're extending
> > network_helpers, sorry.
> > But why do we need extra namespaces at all?
> 
> If veths are in the same namespace, AF_INET packets are not sent between them,
> so XDP is skipped. So we need 2 test namespaces: for RX and TX.

Makes sense. But let's maybe use the existing helpers to jump to/from
namespaces?

It might be a bit more verbose, but it makes it easy to annotate namespace
being/end. (compared to random jumping around with setns)

tok = open_netns("tx");
do_something();
close_netns(tok);

tok = open_netns("rx");
do_something_else();
close_netns(tok);

Should be doable?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata
  2023-07-07 16:49           ` Stanislav Fomichev
@ 2023-07-07 16:58             ` Larysa Zaremba
  0 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-07 16:58 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Fri, Jul 07, 2023 at 09:49:51AM -0700, Stanislav Fomichev wrote:
> On 07/07, Larysa Zaremba wrote:
> > On Thu, Jul 06, 2023 at 10:27:38AM -0700, Stanislav Fomichev wrote:
> > > On Thu, Jul 6, 2023 at 7:15 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
> > > >
> > > > On Wed, Jul 05, 2023 at 10:39:35AM -0700, Stanislav Fomichev wrote:
> > > > > On 07/03, Larysa Zaremba wrote:
> > > > > > The easiest way to simulate stripped VLAN tag in veth is to send a packet
> > > > > > from VLAN interface, attached to veth. Unfortunately, this approach is
> > > > > > incompatible with AF_XDP on TX side, because VLAN interfaces do not have
> > > > > > such feature.
> > > > > >
> > > > > > Replace AF_XDP packet generation with sending the same datagram via
> > > > > > AF_INET socket.
> > > > > >
> > > > > > This does not change the packet contents or hints values with one notable
> > > > > > exception: rx_hash_type, which previously was expected to be 0, now is
> > > > > > expected be at least XDP_RSS_TYPE_L4.
> > > > > >
> > > > > > Also, usage of AF_INET requires a little more complicated namespace setup,
> > > > > > therefore open_netns() helper function is divided into smaller reusable
> > > > > > pieces.
> > > > >
> > > > > Ack, it's probably OK for now, but, FYI, I'm trying to extend this part
> > > > > with TX metadata:
> > > > > https://lore.kernel.org/bpf/20230621170244.1283336-10-sdf@google.com/
> > > > >
> > > > > So probably long-term I'll switch it back to AF_XDP but will add
> > > > > support for requesting vlan TX "offload" from the veth.
> > > > >
> > > >
> > > > My bad for not reading your series. Amazing work as always!
> > > >
> > > > So, 'requesting vlan TX "offload"' with new hints capabilities? This would be
> > > > pretty neat.
> > > >
> > > > But you think AF_INET TX is worth keeping for now, until TX hints are mature?
> > > >
> > > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > > ---
> > > > > >  tools/testing/selftests/bpf/network_helpers.c |  37 +++-
> > > > > >  tools/testing/selftests/bpf/network_helpers.h |   3 +
> > > > > >  .../selftests/bpf/prog_tests/xdp_metadata.c   | 175 +++++++-----------
> > > > > >  3 files changed, 98 insertions(+), 117 deletions(-)
> > > > > >
> > > > > > diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> > > > > > index a105c0cd008a..19463230ece5 100644
> > > > > > --- a/tools/testing/selftests/bpf/network_helpers.c
> > > > > > +++ b/tools/testing/selftests/bpf/network_helpers.c
> > > > > > @@ -386,28 +386,51 @@ char *ping_command(int family)
> > > > > >     return "ping";
> > > > > >  }
> > > > > >
> > > > > > +int get_cur_netns(void)
> > > > > > +{
> > > > > > +   int nsfd;
> > > > > > +
> > > > > > +   nsfd = open("/proc/self/ns/net", O_RDONLY);
> > > > > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > > > > +   return nsfd;
> > > > > > +}
> > > > > > +
> > > > > > +int get_netns(const char *name)
> > > > > > +{
> > > > > > +   char nspath[PATH_MAX];
> > > > > > +   int nsfd;
> > > > > > +
> > > > > > +   snprintf(nspath, sizeof(nspath), "%s/%s", "/var/run/netns", name);
> > > > > > +   nsfd = open(nspath, O_RDONLY | O_CLOEXEC);
> > > > > > +   ASSERT_GE(nsfd, 0, "open /proc/self/ns/net");
> > > > > > +   return nsfd;
> > > > > > +}
> > > > > > +
> > > > > > +int set_netns(int netns_fd)
> > > > > > +{
> > > > > > +   return setns(netns_fd, CLONE_NEWNET);
> > > > > > +}
> > > > >
> > > > > We have open_netns/close_netns in network_helpers.h that provide similar
> > > > > functionality, let's use them instead?
> > > > >
> > > >
> > > > I have divided open_netns() into smaller pieces (see below), because the code I
> > > > have added into xdp_metadata looked better with those smaller pieces (I had to
> > > > switch namespace several times).
> > > 
> > > Forgot to reply to this part. I missed the fact that you're extending
> > > network_helpers, sorry.
> > > But why do we need extra namespaces at all?
> > 
> > If veths are in the same namespace, AF_INET packets are not sent between them,
> > so XDP is skipped. So we need 2 test namespaces: for RX and TX.
> 
> Makes sense. But let's maybe use the existing helpers to jump to/from
> namespaces?
> 
> It might be a bit more verbose, but it makes it easy to annotate namespace
> being/end. (compared to random jumping around with setns)
> 
> tok = open_netns("tx");
> do_something();
> close_netns(tok);
> 
> tok = open_netns("rx");
> do_something_else();
> close_netns(tok);
> 
> Should be doable?

I guess you are right, will rewrite this part to use open_netns()/close_netns(), 
especially considering I have messed up namespace FD management according to CI.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint
  2023-07-07 13:57               ` Jesper Dangaard Brouer
@ 2023-07-07 17:58                 ` Larysa Zaremba
  0 siblings, 0 replies; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-07 17:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, John Fastabend, bpf, ast, daniel, andrii, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Andrew Lunn

On Fri, Jul 07, 2023 at 03:57:13PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 06/07/2023 16.46, Larysa Zaremba wrote:
> > On Tue, Jul 04, 2023 at 04:18:04PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > 
> > > On 04/07/2023 13.02, Larysa Zaremba wrote:
> > > > On Tue, Jul 04, 2023 at 12:23:45PM +0200, Jesper Dangaard Brouer wrote:
> > > > > 
> > > > > On 04/07/2023 10.23, Larysa Zaremba wrote:
> > > > > > On Mon, Jul 03, 2023 at 01:15:34PM -0700, John Fastabend wrote:
> > > > > > > Larysa Zaremba wrote:
> > > > > > > > Implement functionality that enables drivers to expose VLAN tag
> > > > > > > > to XDP code.
> > > > > > > > 
> > > > > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > > > > ---
> > > > > > > >     Documentation/networking/xdp-rx-metadata.rst |  8 +++++++-
> > > > > > > >     include/linux/netdevice.h                    |  2 ++
> > > > > > > >     include/net/xdp.h                            |  2 ++
> > > > > > > >     kernel/bpf/offload.c                         |  2 ++
> > > > > > > >     net/core/xdp.c                               | 20 ++++++++++++++++++++
> > > > > > > >     5 files changed, 33 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > > index 25ce72af81c2..ea6dd79a21d3 100644
> > > > > > > > --- a/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > > > > > > > @@ -18,7 +18,13 @@ Currently, the following kfuncs are supported. In the future, as more
> > > > > > > >     metadata is supported, this set will grow:
> > > > > > > >     .. kernel-doc:: net/core/xdp.c
> > > > > > > > -   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
> > > > > > > > +   :identifiers: bpf_xdp_metadata_rx_timestamp
> > > > > > > > +
> > > > > > > > +.. kernel-doc:: net/core/xdp.c
> > > > > > > > +   :identifiers: bpf_xdp_metadata_rx_hash
> > > > > > > > +
> > > > > > > > +.. kernel-doc:: net/core/xdp.c
> > > > > > > > +   :identifiers: bpf_xdp_metadata_rx_vlan_tag
> > > > > > > >     An XDP program can use these kfuncs to read the metadata into stack
> > > > > > > >     variables for its own consumption. Or, to pass the metadata on to other
> > > > > [...]
> > > > > > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > > > > > index 41e5ca8643ec..f6262c90e45f 100644
> > > > > > > > --- a/net/core/xdp.c
> > > > > > > > +++ b/net/core/xdp.c
> > > > > > > > @@ -738,6 +738,26 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
> > > > > > > >     	return -EOPNOTSUPP;
> > > > > > > >     }
> > > > > > > > +/**
> > > > > > > > + * bpf_xdp_metadata_rx_vlan_tag - Get XDP packet outermost VLAN tag with protocol
> > > > > > > > + * @ctx: XDP context pointer.
> > > > > > > > + * @vlan_tag: Destination pointer for VLAN tag
> > > > > > > > + * @vlan_proto: Destination pointer for VLAN protocol identifier in network byte order.
> > > > > > > > + *
> > > > > > > > + * In case of success, vlan_tag contains VLAN tag, including 12 least significant bytes
> > > > > > > > + * containing VLAN ID, vlan_proto contains protocol identifier.
> > > > > > > 
> > > > > > > Above is a bit confusing to me at least.
> > > > > > > 
> > > > > > > The vlan tag would be both the 16bit TPID and 16bit TCI. What fields
> > > > > > > are to be included here? The VlanID or the full 16bit TCI meaning the
> > > > > > > PCP+DEI+VID?
> > > > > > 
> > > > > > It contains PCP+DEI+VID, in patch 16 ("selftests/bpf: Add flags and new hints to
> > > > > > xdp_hw_metadata") this is more clear, because the tag is parsed.
> > > > > > 
> > > > > 
> > > > > Do we really care about the "EtherType" proto (in VLAN speak TPID = Tag
> > > > > Protocol IDentifier)?
> > > > > I mean, it can basically only have two values[1], and we just wanted to
> > > > > know if it is a VLAN (that hardware offloaded/removed for us):
> > > > 
> > > > If we assume everyone follows the standard, this would be correct.
> > > > But apparently, some applications use some ambiguous value as a TPID [0].
> > > > 
> > > > So it is not hard to imagine, some NICs could alllow you to configure your
> > > > custom TPID. I am not sure if any in-tree drivers actually do this, but I think
> > > > it's nice to provide some flexibility on XDP level, especially considering
> > > > network stack stores full vlan_proto.
> > > > 
> > > 
> > > I'm buying your argument, and agree it makes sense to provide TPID in
> > > the call signature.  Given weird hardware exists that allow people to
> > > configure custom TPID.
> > > 
> > > Looking through kernel defines (in uapi/linux/if_ether.h) I see evidence
> > > that funky QinQ EtherTypes have been used in the past:
> > > 
> > >   #define ETH_P_QINQ1	0x9100		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
> > > REGISTERED ID ] */
> > >   #define ETH_P_QINQ2	0x9200		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
> > > REGISTERED ID ] */
> > >   #define ETH_P_QINQ3	0x9300		/* deprecated QinQ VLAN [ NOT AN OFFICIALLY
> > > REGISTERED ID ] */
> > > 
> > > 
> > > > [0]
> > > > https://techhub.hpe.com/eginfolib/networking/docs/switches/7500/5200-1938a_l2-lan_cg/content/495503472.htm
> > > > 
> > > > > 
> > > > >    static __always_inline int proto_is_vlan(__u16 h_proto)
> > > > >    {
> > > > > 	return !!(h_proto == bpf_htons(ETH_P_8021Q) ||
> > > > > 		  h_proto == bpf_htons(ETH_P_8021AD));
> > > > >    }
> > > > > 
> > > > > [1] https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L75-L79
> > > > > 
> > > > > Cc. Andrew Lunn, as I notice DSA have a fake VLAN define ETH_P_DSA_8021Q
> > > > > (in file include/uapi/linux/if_ether.h)
> > > > > Is this actually in use?
> > > > > Maybe some hardware can "VLAN" offload this?
> > > > > 
> > > > > 
> > > > > > What about rephrasing it this way:
> > > > > > 
> > > > > > In case of success, vlan_proto contains VLAN protocol identifier (TPID),
> > > > > > vlan_tag contains the remaining 16 bits of a 802.1Q tag (PCP+DEI+VID).
> > > > > > 
> > > > > 
> > > > > Hmm, I think we can improve this further. This text becomes part of the
> > > > > documentation for end-users (target audience).  Thus, I think it is
> > > > > worth being more verbose and even mention the existing defines that we
> > > > > are expecting end-users to take advantage of.
> > > > > 
> > > > > What about:
> > > > > 
> > > > > In case of success. The VLAN EtherType is stored in vlan_proto (usually
> > > > > either ETH_P_8021Q or ETH_P_8021AD) also known as TPID (Tag Protocol
> > > > > IDentifier). The VLAN tag is stored in vlan_tag, which is a 16-bit field
> > > > > containing sub-fields (PCP+DEI+VID). The VLAN ID (VID) is 12-bits
> > > > > commonly extracted using mask VLAN_VID_MASK (0x0fff).  For the meaning
> > > > > of the sub-fields Priority Code Point (PCP) and Drop Eligible Indicator
> > > > > (DEI) (formerly CFI) please reference other documentation. Remember
> > > > > these 16-bit fields are stored in network-byte. Thus, transformation
> > > > > with byte-order helper functions like bpf_ntohs() are needed.
> > > > > 
> > > > 
> > > > AFAIK, vlan_tag is stored in host byte order, this is how it is in skb.
> > > 
> > > I'm not sure we should follow SKB storage scheme for XDP.
> > > 
> > 
> > I think following SKB convention is a good idea in this particular case. As I
> > have mentioned below, in ice VLAN TCI in descriptor already comes in LE, so no
> > point in converting it into BE, so somebody would use bpf_ntohs() later anyway.
> > We are not the only manufacturer that does this.
> > 
> 
> As long as other NIC hardware does the same this seems okay.
> 
> 
> > > > In ice, we receive VLAN tag in descriptor already in LE.
> > > > Only protocol is BE (network byte order). So I would replace the last 2
> > > > sentences with the following:
> > > > 
> > > > vlan_tag is stored in host byte order, so no byte order conversion is needed.
> > > 
> > > Yikes, that was unexpected.  This needs to be heavily documented in docs.
> > 
> > You mean the motivation, why it is so and not the other way around?
> > 
> 
> No, I don't mean the motivation.
> I simply mean write it in *bold*.
> 
> Look at the description for bpf_xdp_metadata_rx_hash, how it gets
> rendered [1] and how the code comments look [2].
> 
>  [1] https://kernel.org/doc/html/latest/networking/xdp-rx-metadata.html#general-design
>  [2] https://elixir.bootlin.com/linux/v6.4/source/net/core/xdp.c#L724
> 
> To save you some time compiling htmldocs target:
> 
>  make SPHINXDIRS="networking" V=1  htmldocs
> 

Ok, will do :)

> > > 
> > > When parsing packets, it is in network-byte-order, else my code is wrong
> > > here[1]:
> > > 
> > >    [1] https://github.com/xdp-project/bpf-examples/blob/master/include/xdp/parsing_helpers.h#L122
> > > 
> > > I'm accessing the skb->vlan_tci here [2], and I notice I don't do any
> > > byte-order conversions, so fortunately I didn't make a code mistake.
> > > 
> > >    [2] https://github.com/xdp-project/bpf-examples/blob/master/traffic-pacing-edt/edt_pacer_vlan.c#L215
> > > 
> > 
> > In raw packet, VLAN TCI is in network byte order, but skb requires NIC/driver
> > to convert it into host byte order before putting it into skb.
> > 
> 
> I'm interested in if *most* NIC hardware will deliver this in LE
> (Little-Endian) which is host-byte order on x86 ?
>

At least intel, pensando and some broadcom products get VLAN TCI in LE.
Mellanox gets in BE.

> 
> > > > vlan_proto is stored in network byte order, the suggested way to use this value:
> > > > 
> > > > vlan_proto == bpf_htons(ETH_P_8021Q)
> > > > 
> > > > > 
> > > > > 
> > > 
> > > --Jesper
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 15/20] net, xdp: allow metadata > 32
  2023-07-06 14:51     ` Larysa Zaremba
@ 2023-07-10 14:01       ` Alexander Lobakin
  0 siblings, 0 replies; 66+ messages in thread
From: Alexander Lobakin @ 2023-07-10 14:01 UTC (permalink / raw)
  To: Larysa Zaremba, John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, sdf,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

From: Larysa Zaremba <larysa.zaremba@intel.com>
Date: Thu, 6 Jul 2023 16:51:22 +0200

> On Mon, Jul 03, 2023 at 02:06:46PM -0700, John Fastabend wrote:
>> Larysa Zaremba wrote:
>>> From: Aleksander Lobakin <aleksander.lobakin@intel.com>
>>>
>>> When using XDP hints, metadata sometimes has to be much bigger
>>> than 32 bytes. Relax the restriction, allow metadata larger than 32 bytes
>>> and make __skb_metadata_differs() work with bigger lengths.
>>>
>>> Now size of metadata is only limited by the fact it is stored as u8
>>> in skb_shared_info, so maximum possible value is 255. Other important
>>> conditions, such as having enough space for xdp_frame building, are already
>>> checked in bpf_xdp_adjust_meta().
>>>
>>> The requirement of having its length aligned to 4 bytes is still
>>> valid.
>>>
>>> Signed-off-by: Aleksander Lobakin <aleksander.lobakin@intel.com>
>>> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>>> ---
>>>  include/linux/skbuff.h | 13 ++++++++-----
>>>  include/net/xdp.h      |  7 ++++++-
>>>  2 files changed, 14 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index 91ed66952580..cd49cdd71019 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -4209,10 +4209,13 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
>>>  {
>>>  	const void *a = skb_metadata_end(skb_a);
>>>  	const void *b = skb_metadata_end(skb_b);
>>> -	/* Using more efficient varaiant than plain call to memcmp(). */
>>> -#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
>>
>> Why are we removing the ifdef here? Its adding a runtime 'if' when its not
>> necessary. I would keep the ifdef and simply add the default case
>> in the switch.
> 
> Seems like Alex has missed your message, but we discussed this with him before, 
> so I know the answer: Compiler will 100% convert it into a compile-time 'if' and 
> this looks nicer than preprocessor condition.

Sorry, I'm not always able to follow all the threads =\

As Larysa said, it's not a runtime `if`. Both conditions are always
known at compilation time.
And this looks a bit less ugly than with ifdefs to me :D

Thanks,
Olek

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint
  2023-07-06 16:39       ` Stanislav Fomichev
@ 2023-07-10 15:49         ` Larysa Zaremba
  2023-07-10 18:12           ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2023-07-10 15:49 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jul 06, 2023 at 09:39:29AM -0700, Stanislav Fomichev wrote:
> On Thu, Jul 6, 2023 at 7:27 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
> >
> > On Wed, Jul 05, 2023 at 10:30:56AM -0700, Stanislav Fomichev wrote:
> > > On 07/03, Larysa Zaremba wrote:
> > > > Use previously refactored code and create a function
> > > > that allows XDP code to read HW timestamp.
> > > >
> > > > Also, move cached_phctime into packet context, this way this data still
> > > > stays in the ring structure, just at the different address.
> > > >
> > > > HW timestamp is the first supported hint in the driver,
> > > > so also add xdp_metadata_ops.
> > > >
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >  drivers/net/ethernet/intel/ice/ice.h          |  2 ++
> > > >  drivers/net/ethernet/intel/ice/ice_ethtool.c  |  2 +-
> > > >  drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
> > > >  drivers/net/ethernet/intel/ice/ice_main.c     |  1 +
> > > >  drivers/net/ethernet/intel/ice/ice_ptp.c      |  2 +-
> > > >  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
> > > >  drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 24 +++++++++++++++++++
> > > >  7 files changed, 31 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > > index 4ba3d99439a0..7a973a2229f1 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > > @@ -943,4 +943,6 @@ static inline void ice_clear_rdma_cap(struct ice_pf *pf)
> > > >     set_bit(ICE_FLAG_UNPLUG_AUX_DEV, pf->flags);
> > > >     clear_bit(ICE_FLAG_RDMA_ENA, pf->flags);
> > > >  }
> > > > +
> > > > +extern const struct xdp_metadata_ops ice_xdp_md_ops;
> > > >  #endif /* _ICE_H_ */
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > > index 8d5cbbd0b3d5..3c3b9cbfbcd3 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > > @@ -2837,7 +2837,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
> > > >             /* clone ring and setup updated count */
> > > >             rx_rings[i] = *vsi->rx_rings[i];
> > > >             rx_rings[i].count = new_rx_cnt;
> > > > -           rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
> > > > +           rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> > > >             rx_rings[i].desc = NULL;
> > > >             rx_rings[i].rx_buf = NULL;
> > > >             /* this is to allow wr32 to have something to write to
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> > > > index 00e3afd507a4..eb69b0ac7956 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> > > > @@ -1445,7 +1445,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
> > > >             ring->netdev = vsi->netdev;
> > > >             ring->dev = dev;
> > > >             ring->count = vsi->num_rx_desc;
> > > > -           ring->cached_phctime = pf->ptp.cached_phc_time;
> > > > +           ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> > > >             WRITE_ONCE(vsi->rx_rings[i], ring);
> > > >     }
> > > >
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > index 93979ab18bc1..f21996b812ea 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > @@ -3384,6 +3384,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
> > > >
> > > >     netdev->netdev_ops = &ice_netdev_ops;
> > > >     netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
> > > > +   netdev->xdp_metadata_ops = &ice_xdp_md_ops;
> > > >     ice_set_ethtool_ops(netdev);
> > > >
> > > >     if (vsi->type != ICE_VSI_PF)
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > > index a31333972c68..70697e4829dd 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > > @@ -1038,7 +1038,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
> > > >             ice_for_each_rxq(vsi, j) {
> > > >                     if (!vsi->rx_rings[j])
> > > >                             continue;
> > > > -                   WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
> > > > +                   WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime, systime);
> > > >             }
> > > >     }
> > > >     clear_bit(ICE_CFG_BUSY, pf->state);
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > > index d0ab2c4c0c91..4237702a58a9 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > > @@ -259,6 +259,7 @@ enum ice_rx_dtype {
> > > >
> > > >  struct ice_pkt_ctx {
> > > >     const union ice_32b_rx_flex_desc *eop_desc;
> > > > +   u64 cached_phctime;
> > > >  };
> > > >
> > > >  struct ice_xdp_buff {
> > > > @@ -354,7 +355,6 @@ struct ice_rx_ring {
> > > >     struct ice_tx_ring *xdp_ring;
> > > >     struct xsk_buff_pool *xsk_pool;
> > > >     dma_addr_t dma;                 /* physical address of ring */
> > > > -   u64 cached_phctime;
> > > >     u16 rx_buf_len;
> > > >     u8 dcb_tc;                      /* Traffic class of ring */
> > > >     u8 ptp_rx;
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > > index beb1c5bb392a..463d9e5cbe05 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > > @@ -546,3 +546,27 @@ void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
> > > >                     spin_unlock(&xdp_ring->tx_lock);
> > > >     }
> > > >  }
> > > > +
> > > > +/**
> > > > + * ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
> > > > + * @ctx: XDP buff pointer
> > > > + * @ts_ns: destination address
> > > > + *
> > > > + * Copy HW timestamp (if available) to the destination address.
> > > > + */
> > > > +static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
> > > > +{
> > > > +   const struct ice_xdp_buff *xdp_ext = (void *)ctx;
> > > > +   u64 cached_time;
> > > > +
> > > > +   cached_time = READ_ONCE(xdp_ext->pkt_ctx.cached_phctime);
> > >
> > > I believe we have to have something like the following here:
> > >
> > > if (!ts_ns)
> > >       return -EINVAL;
> > >
> > > IOW, I don't think verifier guarantees that those pointer args are
> > > non-NULL.
> >
> > Oh, that's a shame.
> >
> > > Same for the other ice kfunc you're adding and veth changes.
> > >
> > > Can you also fix it for the existing veth kfuncs? (or lmk if you prefer me
> > > to fix it).
> >
> > I think I can send fixes for RX hash and timestamp in veth separately, before
> > v3 of this patchset, code probably doesn't intersect.
> >
> > But argument checks in kfuncs are a little bit a gray area for me, whether they
> > should be sent to stable tree or not?
> 
> Add a Fixes tag and they will get into the stable trees automatically I believe?

What about declaring XDP hints kfuncs with

BTF_ID_FLAGS(func, name, KF_TRUSTED_ARGS)

instead of BTF_ID_FLAGS(func, name, 0)
?

I have tested this just now and xdp_metadata passes just fine (so both stack 
and data_meta destination pointers work), but if I replace &timestamp with NULL,
verifier rejects the program with a descriptive message "Possibly NULL pointer 
passed to trusted arg1", so it serves our purpose. I do not see many ways this 
could limit the users, but it definitely benefits driver developers.

The only concern I see is that if we ever decide to allow NULL arguments for 
kfuncs, we'd need to add support for a "_or_null" suffix [0]. But it doesn't 
sound too hard?

I have dug into this, because adding

if (unlikely(!hash || &rss_type))
	return -EINVAL;

or something similar to every .xmo_ handler in existence starts to look ugly.

[0] 
https://lore.kernel.org/lkml/20230120054441.arj5h6yrnh5jsrgr@MacBook-Pro-6.local.dhcp.thefacebook.com/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [xdp-hints] Re: [PATCH bpf-next v2 12/20] xdp: Add checksum level hint
  2023-07-06 12:49                 ` Larysa Zaremba
@ 2023-07-10 16:58                   ` Alexander Lobakin
  0 siblings, 0 replies; 66+ messages in thread
From: Alexander Lobakin @ 2023-07-10 16:58 UTC (permalink / raw)
  To: Larysa Zaremba, Jesper Dangaard Brouer, John Fastabend
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh,
	sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	David S. Miller, Alexander Duyck

From: Larysa Zaremba <larysa.zaremba@intel.com>
Date: Thu, 6 Jul 2023 14:49:44 +0200

> On Thu, Jul 06, 2023 at 02:38:33PM +0200, Larysa Zaremba wrote:
>> On Thu, Jul 06, 2023 at 11:04:49AM +0200, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 06/07/2023 07.50, John Fastabend wrote:
>>>> Larysa Zaremba wrote:
>>>>> On Tue, Jul 04, 2023 at 12:39:06PM +0200, Jesper Dangaard Brouer wrote:
>>>>>> Cc. DaveM+Alex Duyck, as I value your insights on checksums.

[...]

>>>>>>>>> + * Return:
>>>>>>>>> + * * Returns 0 on success or ``-errno`` on error.
>>>>>>>>> + * * ``-EOPNOTSUPP`` : device driver doesn't implement kfunc
>>>>>>>>> + * * ``-ENODATA``    : Checksum was not validated
>>>>>>>>> + */
>>>>>>>>> +__bpf_kfunc int bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *csum_level)
>>>>>>>>
>>>>>>>> Istead of ENODATA should we return what would be put in the ip_summed field
>>>>>>>> CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}? Then sig would be,
>>>>>>
>>>>>> I was thinking the same, what about checksum "type".
>>>>>>
>>>>>>>>
>>>>>>>>    bpf_xdp_metadata_rx_csum_lvl(const struct xdp_md *ctx, u8 *type, u8 *lvl);
>>>>>>>>
>>>>>>>> or something like that? Or is the thought that its not really necessary?
>>>>>>>> I don't have a strong preference but figured it was worth asking.
>>>>>>>>
>>>>>>>
>>>>>>> I see no value in returning CHECKSUM_COMPLETE without the actual checksum value.
>>>>>>> Same with CHECKSUM_PARTIAL and csum_start. Returning those values too would
>>>>>>> overcomplicate the function signature.
>>>>>>
>>>>>> So, this kfunc bpf_xdp_metadata_rx_csum_lvl() success is it equivilent to
>>>>>> CHECKSUM_UNNECESSARY?
>>>>>
>>>>> This is 100% true for physical NICs, it's more complicated for veth, bacause it
>>>>> often receives CHECKSUM_PARTIAL, which shouldn't normally apprear on RX, but is
>>>>> treated by the network stack as a validated checksum, because there is no way
>>>>> internally generated packet could be messed up. I would be grateful if you could
>>>>> look at the veth patch and share your opinion about this.
>>>>>
>>>>>>
>>>>>> Looking at documentation[1] (generated from skbuff.h):
>>>>>>   [1] https://kernel.org/doc/html/latest/networking/skbuff.html#checksumming-of-received-packets-by-device
>>>>>>
>>>>>> Is the idea that we can add another kfunc (new signature) than can deal
>>>>>> with the other types of checksums (in a later kernel release)?
>>>>>>
>>>>>
>>>>> Yes, that is the idea.
>>>>
>>>> If we think there is a chance we might need another kfunc we should add it
>>>> in the same kfunc. It would be unfortunate to have to do two kfuncs when
>>>> one would work. It shouldn't cost much/anything(?) to hardcode the type for
>>>> most cases? I think if we need it later I would advocate for updating this
>>>> kfunc to support it. Of course then userspace will have to swivel on the
>>>> kfunc signature.
>>>>
>>>
>>> I think it might make sense to have 3 kfuncs for checksumming.

Isn't that overcomplicating? 3 callbacks for just one damn thing. IOW I
agree with John.

PARTIAL and COMPLETE are mutually exclusive. Their "additional" output
can be unionized. Level is 2 bits, status is 2 bits. Level makes sense
only with UNNECESSARY (correct me if I'm wrong).
IOW the kfunc could return:

-errno - not implemented or something went wrong
0 - none
1 - complete
2 - partial
3 + lvl - unnecessary

(CHECKSUM_* defs could be shuffled accordingly)

Then `if (ret > 2)` would mean UNNECESSARY and most programs could stop
here already. Programs wanting to extract the level can do `ret - 3`.
One additional pointer to u32 (union) to fetch additional data. I would
even say "BPF prog can pass NULL if it doesn't care", but OTOH I dunno
how to validate PARTIAL then :D (COMPLETE usually assumes it's valid)

>>> As this would allow BPF-prog to focus on CHECKSUM_UNNECESSARY, and then
>>> only call additional kfunc for extracting e.g csum_start  + csum_offset
>>> when type is CHECKSUM_PARTIAL.
>>>
>>> We could extend bpf_xdp_metadata_rx_csum_lvl() to give the csum_type
>>> CHECKSUM_{NONE, UNNECESSARY, COMPLETE, PARTIAL}.
>>>
>>>  int bpf_xdp_metadata_rx_csum_lvl(*ctx, u8 *csum_level, u8 *csum_type)
>>>
>>> And then add two kfunc e.g.
>>>  (1) bpf_xdp_metadata_rx_csum_partial(ctx, start, offset)
>>>  (2) bpf_xdp_metadata_rx_csum_complete(ctx, csum)
>>>
>>> Pseudo BPF-prog code:
>>>
>>>  err = bpf_xdp_metadata_rx_csum_lvl(ctx, level, type);
>>>  if (!err && type != CHECKSUM_UNNECESSARY) {

And hurt cool HW which by default returns COMPLETE? }:>

>>>      if (type == CHECKSUM_PARTIAL)
>>>          err = bpf_xdp_metadata_rx_csum_partial(ctx, start, offset);
>>>      if (type == CHECKSUM_COMPLETE)
>>>          err = bpf_xdp_metadata_rx_csum_complete(ctx, csum);

I don't feel like 1 hotpath `if` is worth multiplying kfuncs.

[...]

Thanks,
Olek

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint
  2023-07-10 15:49         ` Larysa Zaremba
@ 2023-07-10 18:12           ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2023-07-10 18:12 UTC (permalink / raw)
  To: Larysa Zaremba
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On 07/10, Larysa Zaremba wrote:
> On Thu, Jul 06, 2023 at 09:39:29AM -0700, Stanislav Fomichev wrote:
> > On Thu, Jul 6, 2023 at 7:27 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 10:30:56AM -0700, Stanislav Fomichev wrote:
> > > > On 07/03, Larysa Zaremba wrote:
> > > > > Use previously refactored code and create a function
> > > > > that allows XDP code to read HW timestamp.
> > > > >
> > > > > Also, move cached_phctime into packet context, this way this data still
> > > > > stays in the ring structure, just at the different address.
> > > > >
> > > > > HW timestamp is the first supported hint in the driver,
> > > > > so also add xdp_metadata_ops.
> > > > >
> > > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > > ---
> > > > >  drivers/net/ethernet/intel/ice/ice.h          |  2 ++
> > > > >  drivers/net/ethernet/intel/ice/ice_ethtool.c  |  2 +-
> > > > >  drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
> > > > >  drivers/net/ethernet/intel/ice/ice_main.c     |  1 +
> > > > >  drivers/net/ethernet/intel/ice/ice_ptp.c      |  2 +-
> > > > >  drivers/net/ethernet/intel/ice/ice_txrx.h     |  2 +-
> > > > >  drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 24 +++++++++++++++++++
> > > > >  7 files changed, 31 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > > > index 4ba3d99439a0..7a973a2229f1 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > > > @@ -943,4 +943,6 @@ static inline void ice_clear_rdma_cap(struct ice_pf *pf)
> > > > >     set_bit(ICE_FLAG_UNPLUG_AUX_DEV, pf->flags);
> > > > >     clear_bit(ICE_FLAG_RDMA_ENA, pf->flags);
> > > > >  }
> > > > > +
> > > > > +extern const struct xdp_metadata_ops ice_xdp_md_ops;
> > > > >  #endif /* _ICE_H_ */
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > > > index 8d5cbbd0b3d5..3c3b9cbfbcd3 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > > > > @@ -2837,7 +2837,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
> > > > >             /* clone ring and setup updated count */
> > > > >             rx_rings[i] = *vsi->rx_rings[i];
> > > > >             rx_rings[i].count = new_rx_cnt;
> > > > > -           rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
> > > > > +           rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> > > > >             rx_rings[i].desc = NULL;
> > > > >             rx_rings[i].rx_buf = NULL;
> > > > >             /* this is to allow wr32 to have something to write to
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> > > > > index 00e3afd507a4..eb69b0ac7956 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> > > > > @@ -1445,7 +1445,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
> > > > >             ring->netdev = vsi->netdev;
> > > > >             ring->dev = dev;
> > > > >             ring->count = vsi->num_rx_desc;
> > > > > -           ring->cached_phctime = pf->ptp.cached_phc_time;
> > > > > +           ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> > > > >             WRITE_ONCE(vsi->rx_rings[i], ring);
> > > > >     }
> > > > >
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > index 93979ab18bc1..f21996b812ea 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > > @@ -3384,6 +3384,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
> > > > >
> > > > >     netdev->netdev_ops = &ice_netdev_ops;
> > > > >     netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
> > > > > +   netdev->xdp_metadata_ops = &ice_xdp_md_ops;
> > > > >     ice_set_ethtool_ops(netdev);
> > > > >
> > > > >     if (vsi->type != ICE_VSI_PF)
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > > > index a31333972c68..70697e4829dd 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > > > > @@ -1038,7 +1038,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
> > > > >             ice_for_each_rxq(vsi, j) {
> > > > >                     if (!vsi->rx_rings[j])
> > > > >                             continue;
> > > > > -                   WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
> > > > > +                   WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime, systime);
> > > > >             }
> > > > >     }
> > > > >     clear_bit(ICE_CFG_BUSY, pf->state);
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > > > index d0ab2c4c0c91..4237702a58a9 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > > > @@ -259,6 +259,7 @@ enum ice_rx_dtype {
> > > > >
> > > > >  struct ice_pkt_ctx {
> > > > >     const union ice_32b_rx_flex_desc *eop_desc;
> > > > > +   u64 cached_phctime;
> > > > >  };
> > > > >
> > > > >  struct ice_xdp_buff {
> > > > > @@ -354,7 +355,6 @@ struct ice_rx_ring {
> > > > >     struct ice_tx_ring *xdp_ring;
> > > > >     struct xsk_buff_pool *xsk_pool;
> > > > >     dma_addr_t dma;                 /* physical address of ring */
> > > > > -   u64 cached_phctime;
> > > > >     u16 rx_buf_len;
> > > > >     u8 dcb_tc;                      /* Traffic class of ring */
> > > > >     u8 ptp_rx;
> > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > > > index beb1c5bb392a..463d9e5cbe05 100644
> > > > > --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > > > > @@ -546,3 +546,27 @@ void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
> > > > >                     spin_unlock(&xdp_ring->tx_lock);
> > > > >     }
> > > > >  }
> > > > > +
> > > > > +/**
> > > > > + * ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
> > > > > + * @ctx: XDP buff pointer
> > > > > + * @ts_ns: destination address
> > > > > + *
> > > > > + * Copy HW timestamp (if available) to the destination address.
> > > > > + */
> > > > > +static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
> > > > > +{
> > > > > +   const struct ice_xdp_buff *xdp_ext = (void *)ctx;
> > > > > +   u64 cached_time;
> > > > > +
> > > > > +   cached_time = READ_ONCE(xdp_ext->pkt_ctx.cached_phctime);
> > > >
> > > > I believe we have to have something like the following here:
> > > >
> > > > if (!ts_ns)
> > > >       return -EINVAL;
> > > >
> > > > IOW, I don't think verifier guarantees that those pointer args are
> > > > non-NULL.
> > >
> > > Oh, that's a shame.
> > >
> > > > Same for the other ice kfunc you're adding and veth changes.
> > > >
> > > > Can you also fix it for the existing veth kfuncs? (or lmk if you prefer me
> > > > to fix it).
> > >
> > > I think I can send fixes for RX hash and timestamp in veth separately, before
> > > v3 of this patchset, code probably doesn't intersect.
> > >
> > > But argument checks in kfuncs are a little bit a gray area for me, whether they
> > > should be sent to stable tree or not?
> > 
> > Add a Fixes tag and they will get into the stable trees automatically I believe?
> 
> What about declaring XDP hints kfuncs with
> 
> BTF_ID_FLAGS(func, name, KF_TRUSTED_ARGS)
> 
> instead of BTF_ID_FLAGS(func, name, 0)
> ?
> 
> I have tested this just now and xdp_metadata passes just fine (so both stack 
> and data_meta destination pointers work), but if I replace &timestamp with NULL,
> verifier rejects the program with a descriptive message "Possibly NULL pointer 
> passed to trusted arg1", so it serves our purpose. I do not see many ways this 
> could limit the users, but it definitely benefits driver developers.
> 
> The only concern I see is that if we ever decide to allow NULL arguments for 
> kfuncs, we'd need to add support for a "_or_null" suffix [0]. But it doesn't 
> sound too hard?
> 
> I have dug into this, because adding
> 
> if (unlikely(!hash || &rss_type))
> 	return -EINVAL;
> 
> or something similar to every .xmo_ handler in existence starts to look ugly.
> 
> [0] 
> https://lore.kernel.org/lkml/20230120054441.arj5h6yrnh5jsrgr@MacBook-Pro-6.local.dhcp.thefacebook.com/

SG! Let's add KF_TRUSTED_ARGS. That is munch nicer indeed!

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2023-07-10 18:12 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-03 18:12 [PATCH bpf-next v2 00/20] XDP metadata via kfuncs for ice Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 01/20] ice: make RX hash reading code more reusable Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 02/20] ice: make RX HW timestamp " Larysa Zaremba
2023-07-04 10:04   ` Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 03/20] ice: make RX checksum checking " Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 04/20] ice: Make ptype internal to descriptor info processing Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 05/20] ice: Introduce ice_xdp_buff Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 06/20] ice: Support HW timestamp hint Larysa Zaremba
2023-07-05 17:30   ` Stanislav Fomichev
2023-07-06 14:22     ` Larysa Zaremba
2023-07-06 16:39       ` Stanislav Fomichev
2023-07-10 15:49         ` Larysa Zaremba
2023-07-10 18:12           ` Stanislav Fomichev
2023-07-03 18:12 ` [PATCH bpf-next v2 07/20] ice: Support RX hash XDP hint Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 08/20] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 09/20] xdp: Add VLAN tag hint Larysa Zaremba
2023-07-03 20:15   ` John Fastabend
2023-07-04  8:23     ` Larysa Zaremba
2023-07-04 10:23       ` Jesper Dangaard Brouer
2023-07-04 11:02         ` Larysa Zaremba
2023-07-04 14:18           ` Jesper Dangaard Brouer
2023-07-06 14:46             ` Larysa Zaremba
2023-07-07 13:57               ` Jesper Dangaard Brouer
2023-07-07 17:58                 ` Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 10/20] ice: Implement " Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 11/20] ice: use VLAN proto from ring packet context in skb path Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 12/20] xdp: Add checksum level hint Larysa Zaremba
2023-07-03 20:38   ` John Fastabend
2023-07-04  9:24     ` Larysa Zaremba
2023-07-04 10:39       ` Jesper Dangaard Brouer
2023-07-04 11:19         ` Larysa Zaremba
2023-07-06  5:50           ` John Fastabend
2023-07-06  9:04             ` [xdp-hints] " Jesper Dangaard Brouer
2023-07-06 12:38               ` Larysa Zaremba
2023-07-06 12:49                 ` Larysa Zaremba
2023-07-10 16:58                   ` Alexander Lobakin
2023-07-03 18:12 ` [PATCH bpf-next v2 13/20] ice: Implement " Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 14/20] selftests/bpf: Allow VLAN packets in xdp_hw_metadata Larysa Zaremba
2023-07-05 17:31   ` Stanislav Fomichev
2023-07-03 18:12 ` [PATCH bpf-next v2 15/20] net, xdp: allow metadata > 32 Larysa Zaremba
2023-07-03 21:06   ` John Fastabend
2023-07-06 14:51     ` Larysa Zaremba
2023-07-10 14:01       ` Alexander Lobakin
2023-07-03 18:12 ` [PATCH bpf-next v2 16/20] selftests/bpf: Add flags and new hints to xdp_hw_metadata Larysa Zaremba
2023-07-04 11:03   ` Jesper Dangaard Brouer
2023-07-04 11:04     ` Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 17/20] veth: Implement VLAN tag and checksum level XDP hint Larysa Zaremba
2023-07-05 17:25   ` Stanislav Fomichev
2023-07-06  9:57     ` Jesper Dangaard Brouer
2023-07-06 10:15       ` Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 18/20] selftests/bpf: Use AF_INET for TX in xdp_metadata Larysa Zaremba
2023-07-05 17:39   ` Stanislav Fomichev
2023-07-06 14:11     ` Larysa Zaremba
2023-07-06 17:25       ` Stanislav Fomichev
2023-07-06 17:27       ` Stanislav Fomichev
2023-07-07  8:33         ` Larysa Zaremba
2023-07-07 16:49           ` Stanislav Fomichev
2023-07-07 16:58             ` Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 19/20] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba
2023-07-05 17:41   ` Stanislav Fomichev
2023-07-06 10:10   ` Jesper Dangaard Brouer
2023-07-06 10:13     ` Larysa Zaremba
2023-07-03 18:12 ` [PATCH bpf-next v2 20/20] selftests/bpf: check checksum level " Larysa Zaremba
2023-07-05 17:41   ` Stanislav Fomichev
2023-07-06 10:25   ` Jesper Dangaard Brouer
2023-07-06 12:02     ` Larysa Zaremba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.