linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf
@ 2023-07-05 15:55 Alexander Lobakin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 1/9] net: intel: introduce Intel Ethernet common library Alexander Lobakin
                   ` (8 more replies)
  0 siblings, 9 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

Here's a two-shot: introduce Intel Ethernet common library (libie) and
switch iavf to Page Pool. Details in the commit messages; here's
summary:

Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers.
The first thing that came to my mind was "libie" -- "Intel Ethernet
common library". Also this sounds like "lovelie" and can be expanded as
"lib Internet Explorer" :P I'm open for anything else (but justified).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks: "can
I share this?".
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so a driver can't use
much of the lib until it's converted. iavf is only the example, the rest
will eventually be converted soon on a per-driver basis. That is when it
gets really interesting. Stay tech.

Alexander Lobakin (9):
  net: intel: introduce Intel Ethernet common library
  iavf: kill "legacy-rx" for good
  iavf: drop page splitting and recycling
  net: page_pool: add DMA-sync-for-CPU inline helpers
  libie: add Rx buffer management (via Page Pool)
  iavf: switch to Page Pool
  libie: add common queue stats
  libie: add per-queue Page Pool stats
  iavf: switch queue stats to libie

 MAINTAINERS                                   |   3 +-
 drivers/net/ethernet/intel/Kconfig            |  11 +
 drivers/net/ethernet/intel/Makefile           |   1 +
 drivers/net/ethernet/intel/i40e/i40e_common.c | 253 --------
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   1 +
 .../net/ethernet/intel/i40e/i40e_prototype.h  |   7 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  74 +--
 drivers/net/ethernet/intel/i40e/i40e_type.h   |  88 ---
 drivers/net/ethernet/intel/iavf/iavf.h        |   2 +-
 drivers/net/ethernet/intel/iavf/iavf_common.c | 253 --------
 .../net/ethernet/intel/iavf/iavf_ethtool.c    | 226 +------
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  44 +-
 .../net/ethernet/intel/iavf/iavf_prototype.h  |   7 -
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 608 ++++--------------
 drivers/net/ethernet/intel/iavf/iavf_txrx.h   | 176 +----
 drivers/net/ethernet/intel/iavf/iavf_type.h   |  90 ---
 .../net/ethernet/intel/iavf/iavf_virtchnl.c   |  20 +-
 .../net/ethernet/intel/ice/ice_lan_tx_rx.h    | 316 ---------
 drivers/net/ethernet/intel/ice/ice_main.c     |   1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  74 +--
 drivers/net/ethernet/intel/libie/Makefile     |   7 +
 drivers/net/ethernet/intel/libie/internal.h   |  23 +
 drivers/net/ethernet/intel/libie/rx.c         | 183 ++++++
 drivers/net/ethernet/intel/libie/stats.c      | 190 ++++++
 include/linux/net/intel/libie/rx.h            | 241 +++++++
 include/linux/net/intel/libie/stats.h         | 214 ++++++
 include/net/page_pool.h                       |  49 +-
 27 files changed, 1137 insertions(+), 2025 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/libie/Makefile
 create mode 100644 drivers/net/ethernet/intel/libie/internal.h
 create mode 100644 drivers/net/ethernet/intel/libie/rx.c
 create mode 100644 drivers/net/ethernet/intel/libie/stats.c
 create mode 100644 include/linux/net/intel/libie/rx.h
 create mode 100644 include/linux/net/intel/libie/stats.h

---
Directly to net-next, has non-Intel code change (#4) :p

Based on the PP "hybrid" allocation series[0] and requires it to work.

From v3[1]:
* base on the latest net-next, update bloat-o-meter and perf stats;
* split generic PP optimizations into a separate series;
* drop "optimize hotpath a bunch" commit: a lot of [controversial]
  changes in one place, worth own series (Alex);
* 02: pick Rev-by (Alex);
* 03: move in-place recycling removal here from the dropped patch;
* 05: new, add libie Rx buffer API separatelly from IAVF changes;
* 05-06: use new "hybrid" allocation API from[0] to reduce memory usage
  when a page can fit more than 1 truesize (also asked by David);
* 06: merge with "always use order-0 page" commit to reduce diffs and
  simplify things (Alex);
* 09: fix page_alloc_fail counter.

From v2[2]:
* 0006: fix page_pool.h include in OcteonTX2 files (Jakub, Patchwork);
* no functional changes.

From v1[3]:
* 0006: new (me, Jakub);
* 0008: give the helpers more intuitive names (Jakub, Ilias);
*  -^-: also expand their kdoc a bit for the same reason;
*  -^-: fix kdoc copy-paste issue (Patchwork, Jakub);
* 0011: drop `inline` from C file (Patchwork, Jakub).

[0] https://lore.kernel.org/netdev/20230629120226.14854-1-linyunsheng@huawei.com
[1] https://lore.kernel.org/netdev/20230530150035.1943669-1-aleksander.lobakin@intel.com
[2] https://lore.kernel.org/netdev/20230525125746.553874-1-aleksander.lobakin@intel.com
[3] https://lore.kernel.org/netdev/20230516161841.37138-1-aleksander.lobakin@intel.com

-- 
2.41.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 1/9] net: intel: introduce Intel Ethernet common library
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  2023-07-14 14:17   ` [Intel-wired-lan] " Przemek Kitszel
  2023-07-05 15:55 ` [PATCH RFC net-next v4 2/9] iavf: kill "legacy-rx" for good Alexander Lobakin
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 MAINTAINERS                                   |   3 +-
 drivers/net/ethernet/intel/Kconfig            |  10 +
 drivers/net/ethernet/intel/Makefile           |   1 +
 drivers/net/ethernet/intel/i40e/i40e_common.c | 253 --------------
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   1 +
 .../net/ethernet/intel/i40e/i40e_prototype.h  |   7 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  74 +---
 drivers/net/ethernet/intel/i40e/i40e_type.h   |  88 -----
 drivers/net/ethernet/intel/iavf/iavf_common.c | 253 --------------
 drivers/net/ethernet/intel/iavf/iavf_main.c   |   1 +
 .../net/ethernet/intel/iavf/iavf_prototype.h  |   7 -
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |  70 +---
 drivers/net/ethernet/intel/iavf/iavf_type.h   |  88 -----
 .../net/ethernet/intel/ice/ice_lan_tx_rx.h    | 316 ------------------
 drivers/net/ethernet/intel/ice/ice_main.c     |   1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  74 +---
 drivers/net/ethernet/intel/libie/Makefile     |   6 +
 drivers/net/ethernet/intel/libie/rx.c         | 110 ++++++
 include/linux/net/intel/libie/rx.h            | 128 +++++++
 19 files changed, 312 insertions(+), 1179 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/libie/Makefile
 create mode 100644 drivers/net/ethernet/intel/libie/rx.c
 create mode 100644 include/linux/net/intel/libie/rx.h

diff --git a/MAINTAINERS b/MAINTAINERS
index acbe54087d1c..480fe286e3f9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10388,7 +10388,8 @@ F:	Documentation/networking/device_drivers/ethernet/intel/
 F:	drivers/net/ethernet/intel/
 F:	drivers/net/ethernet/intel/*/
 F:	include/linux/avf/virtchnl.h
-F:	include/linux/net/intel/iidc.h
+F:	include/linux/net/intel/
+F:	include/linux/net/intel/*/
 
 INTEL ETHERNET PROTOCOL DRIVER FOR RDMA
 M:	Mustafa Ismail <mustafa.ismail@intel.com>
diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index 9bc0a9519899..86ecedeac115 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -84,6 +84,12 @@ config E1000E_HWTS
 	 devices. The cross-timestamp is available through the PTP clock
 	 driver precise cross-timestamp ioctl (PTP_SYS_OFFSET_PRECISE).
 
+config LIBIE
+	tristate
+	help
+	  libie (Intel Ethernet library) is a common library containing
+	  routines shared by several Intel Ethernet drivers.
+
 config IGB
 	tristate "Intel(R) 82575/82576 PCI-Express Gigabit Ethernet support"
 	depends on PCI
@@ -225,6 +231,7 @@ config I40E
 	depends on PTP_1588_CLOCK_OPTIONAL
 	depends on PCI
 	select AUXILIARY_BUS
+	select LIBIE
 	help
 	  This driver supports Intel(R) Ethernet Controller XL710 Family of
 	  devices.  For more information on how to identify your adapter, go
@@ -252,6 +259,8 @@ config I40E_DCB
 # so that CONFIG_IAVF symbol will always mirror the state of CONFIG_I40EVF
 config IAVF
 	tristate
+	select LIBIE
+
 config I40EVF
 	tristate "Intel(R) Ethernet Adaptive Virtual Function support"
 	select IAVF
@@ -282,6 +291,7 @@ config ICE
 	depends on GNSS || GNSS = n
 	select AUXILIARY_BUS
 	select DIMLIB
+	select LIBIE
 	select NET_DEVLINK
 	select PLDMFW
 	help
diff --git a/drivers/net/ethernet/intel/Makefile b/drivers/net/ethernet/intel/Makefile
index d80d04132073..ce622b4d825d 100644
--- a/drivers/net/ethernet/intel/Makefile
+++ b/drivers/net/ethernet/intel/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_I40E) += i40e/
 obj-$(CONFIG_IAVF) += iavf/
 obj-$(CONFIG_FM10K) += fm10k/
 obj-$(CONFIG_ICE) += ice/
+obj-$(CONFIG_LIBIE) += libie/
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
index ed88e38d488b..25bb858268fc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -383,259 +383,6 @@ int i40e_aq_set_rss_key(struct i40e_hw *hw,
 	return i40e_aq_get_set_rss_key(hw, vsi_id, key, true);
 }
 
-/* The i40e_ptype_lookup table is used to convert from the 8-bit ptype in the
- * hardware to a bit-field that can be used by SW to more easily determine the
- * packet type.
- *
- * Macros are used to shorten the table lines and make this table human
- * readable.
- *
- * We store the PTYPE in the top byte of the bit field - this is just so that
- * we can check that the table doesn't have a row missing, as the index into
- * the table should be the PTYPE.
- *
- * Typical work flow:
- *
- * IF NOT i40e_ptype_lookup[ptype].known
- * THEN
- *      Packet is unknown
- * ELSE IF i40e_ptype_lookup[ptype].outer_ip == I40E_RX_PTYPE_OUTER_IP
- *      Use the rest of the fields to look at the tunnels, inner protocols, etc
- * ELSE
- *      Use the enum i40e_rx_l2_ptype to decode the packet type
- * ENDIF
- */
-
-/* macro to make the table lines short, use explicit indexing with [PTYPE] */
-#define I40E_PTT(PTYPE, OUTER_IP, OUTER_IP_VER, OUTER_FRAG, T, TE, TEF, I, PL)\
-	[PTYPE] = { \
-		1, \
-		I40E_RX_PTYPE_OUTER_##OUTER_IP, \
-		I40E_RX_PTYPE_OUTER_##OUTER_IP_VER, \
-		I40E_RX_PTYPE_##OUTER_FRAG, \
-		I40E_RX_PTYPE_TUNNEL_##T, \
-		I40E_RX_PTYPE_TUNNEL_END_##TE, \
-		I40E_RX_PTYPE_##TEF, \
-		I40E_RX_PTYPE_INNER_PROT_##I, \
-		I40E_RX_PTYPE_PAYLOAD_LAYER_##PL }
-
-#define I40E_PTT_UNUSED_ENTRY(PTYPE) [PTYPE] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
-
-/* shorter macros makes the table fit but are terse */
-#define I40E_RX_PTYPE_NOF		I40E_RX_PTYPE_NOT_FRAG
-#define I40E_RX_PTYPE_FRG		I40E_RX_PTYPE_FRAG
-#define I40E_RX_PTYPE_INNER_PROT_TS	I40E_RX_PTYPE_INNER_PROT_TIMESYNC
-
-/* Lookup table mapping in the 8-bit HW PTYPE to the bit field for decoding */
-struct i40e_rx_ptype_decoded i40e_ptype_lookup[BIT(8)] = {
-	/* L2 Packet types */
-	I40E_PTT_UNUSED_ENTRY(0),
-	I40E_PTT(1,  L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	I40E_PTT(2,  L2, NONE, NOF, NONE, NONE, NOF, TS,   PAY2),
-	I40E_PTT(3,  L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	I40E_PTT_UNUSED_ENTRY(4),
-	I40E_PTT_UNUSED_ENTRY(5),
-	I40E_PTT(6,  L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	I40E_PTT(7,  L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	I40E_PTT_UNUSED_ENTRY(8),
-	I40E_PTT_UNUSED_ENTRY(9),
-	I40E_PTT(10, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	I40E_PTT(11, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	I40E_PTT(12, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(13, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(14, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(15, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(16, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(17, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(18, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(19, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(20, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(21, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-
-	/* Non Tunneled IPv4 */
-	I40E_PTT(22, IP, IPV4, FRG, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(23, IP, IPV4, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(24, IP, IPV4, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(25),
-	I40E_PTT(26, IP, IPV4, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	I40E_PTT(27, IP, IPV4, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	I40E_PTT(28, IP, IPV4, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv4 */
-	I40E_PTT(29, IP, IPV4, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	I40E_PTT(30, IP, IPV4, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	I40E_PTT(31, IP, IPV4, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(32),
-	I40E_PTT(33, IP, IPV4, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	I40E_PTT(34, IP, IPV4, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	I40E_PTT(35, IP, IPV4, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv6 */
-	I40E_PTT(36, IP, IPV4, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	I40E_PTT(37, IP, IPV4, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	I40E_PTT(38, IP, IPV4, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(39),
-	I40E_PTT(40, IP, IPV4, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	I40E_PTT(41, IP, IPV4, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	I40E_PTT(42, IP, IPV4, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT */
-	I40E_PTT(43, IP, IPV4, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> IPv4 */
-	I40E_PTT(44, IP, IPV4, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	I40E_PTT(45, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	I40E_PTT(46, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(47),
-	I40E_PTT(48, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	I40E_PTT(49, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	I40E_PTT(50, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> IPv6 */
-	I40E_PTT(51, IP, IPV4, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	I40E_PTT(52, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	I40E_PTT(53, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(54),
-	I40E_PTT(55, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	I40E_PTT(56, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	I40E_PTT(57, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC */
-	I40E_PTT(58, IP, IPV4, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> MAC --> IPv4 */
-	I40E_PTT(59, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	I40E_PTT(60, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	I40E_PTT(61, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(62),
-	I40E_PTT(63, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	I40E_PTT(64, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	I40E_PTT(65, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT -> MAC --> IPv6 */
-	I40E_PTT(66, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	I40E_PTT(67, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	I40E_PTT(68, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(69),
-	I40E_PTT(70, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	I40E_PTT(71, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	I40E_PTT(72, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC/VLAN */
-	I40E_PTT(73, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 ---> GRE/NAT -> MAC/VLAN --> IPv4 */
-	I40E_PTT(74, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	I40E_PTT(75, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	I40E_PTT(76, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(77),
-	I40E_PTT(78, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	I40E_PTT(79, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	I40E_PTT(80, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 -> GRE/NAT -> MAC/VLAN --> IPv6 */
-	I40E_PTT(81, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	I40E_PTT(82, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	I40E_PTT(83, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(84),
-	I40E_PTT(85, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	I40E_PTT(86, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	I40E_PTT(87, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
-
-	/* Non Tunneled IPv6 */
-	I40E_PTT(88, IP, IPV6, FRG, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(89, IP, IPV6, NOF, NONE, NONE, NOF, NONE, PAY3),
-	I40E_PTT(90, IP, IPV6, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(91),
-	I40E_PTT(92, IP, IPV6, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	I40E_PTT(93, IP, IPV6, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	I40E_PTT(94, IP, IPV6, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv4 */
-	I40E_PTT(95,  IP, IPV6, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	I40E_PTT(96,  IP, IPV6, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	I40E_PTT(97,  IP, IPV6, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(98),
-	I40E_PTT(99,  IP, IPV6, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	I40E_PTT(100, IP, IPV6, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	I40E_PTT(101, IP, IPV6, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv6 */
-	I40E_PTT(102, IP, IPV6, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	I40E_PTT(103, IP, IPV6, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	I40E_PTT(104, IP, IPV6, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(105),
-	I40E_PTT(106, IP, IPV6, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	I40E_PTT(107, IP, IPV6, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	I40E_PTT(108, IP, IPV6, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT */
-	I40E_PTT(109, IP, IPV6, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> IPv4 */
-	I40E_PTT(110, IP, IPV6, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	I40E_PTT(111, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	I40E_PTT(112, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(113),
-	I40E_PTT(114, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	I40E_PTT(115, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	I40E_PTT(116, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> IPv6 */
-	I40E_PTT(117, IP, IPV6, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	I40E_PTT(118, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	I40E_PTT(119, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(120),
-	I40E_PTT(121, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	I40E_PTT(122, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	I40E_PTT(123, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC */
-	I40E_PTT(124, IP, IPV6, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv4 */
-	I40E_PTT(125, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	I40E_PTT(126, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	I40E_PTT(127, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(128),
-	I40E_PTT(129, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	I40E_PTT(130, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	I40E_PTT(131, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv6 */
-	I40E_PTT(132, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	I40E_PTT(133, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	I40E_PTT(134, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(135),
-	I40E_PTT(136, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	I40E_PTT(137, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	I40E_PTT(138, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN */
-	I40E_PTT(139, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv4 */
-	I40E_PTT(140, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	I40E_PTT(141, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	I40E_PTT(142, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(143),
-	I40E_PTT(144, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	I40E_PTT(145, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	I40E_PTT(146, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv6 */
-	I40E_PTT(147, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	I40E_PTT(148, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	I40E_PTT(149, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	I40E_PTT_UNUSED_ENTRY(150),
-	I40E_PTT(151, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	I40E_PTT(152, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	I40E_PTT(153, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
-
-	/* unused entries */
-	[154 ... 255] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
-};
-
 /**
  * i40e_init_shared_code - Initialize the shared code
  * @hw: pointer to hardware structure
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 29ad1797adce..fda6ccc3fc3e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -97,6 +97,7 @@ MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all), Debug mask (0x8XXXXXXX
 
 MODULE_AUTHOR("Intel Corporation, <e1000-devel@lists.sourceforge.net>");
 MODULE_DESCRIPTION("Intel(R) Ethernet Connection XL710 Network Driver");
+MODULE_IMPORT_NS(LIBIE);
 MODULE_LICENSE("GPL v2");
 
 static struct workqueue_struct *i40e_wq;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_prototype.h b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
index fe845987d99a..5287d0ef32d5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_prototype.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
@@ -380,13 +380,6 @@ void i40e_set_pci_config_data(struct i40e_hw *hw, u16 link_status);
 
 int i40e_set_mac_type(struct i40e_hw *hw);
 
-extern struct i40e_rx_ptype_decoded i40e_ptype_lookup[];
-
-static inline struct i40e_rx_ptype_decoded decode_rx_desc_ptype(u8 ptype)
-{
-	return i40e_ptype_lookup[ptype];
-}
-
 /**
  * i40e_virtchnl_link_speed - Convert AdminQ link_speed to virtchnl definition
  * @link_speed: the speed to convert
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 8b8bf4880faa..c4b6cdbf3611 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1,8 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
-#include <linux/prefetch.h>
 #include <linux/bpf_trace.h>
+#include <linux/net/intel/libie/rx.h>
+#include <linux/prefetch.h>
 #include <net/mpls.h>
 #include <net/xdp.h>
 #include "i40e.h"
@@ -1758,40 +1759,32 @@ static inline void i40e_rx_checksum(struct i40e_vsi *vsi,
 				    struct sk_buff *skb,
 				    union i40e_rx_desc *rx_desc)
 {
-	struct i40e_rx_ptype_decoded decoded;
+	struct libie_rx_ptype_parsed parsed;
 	u32 rx_error, rx_status;
 	bool ipv4, ipv6;
 	u8 ptype;
 	u64 qword;
 
+	skb->ip_summed = CHECKSUM_NONE;
+
 	qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
 	ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >> I40E_RXD_QW1_PTYPE_SHIFT;
+
+	parsed = libie_parse_rx_ptype(ptype);
+	if (!libie_has_rx_checksum(vsi->netdev, parsed))
+		return;
+
 	rx_error = (qword & I40E_RXD_QW1_ERROR_MASK) >>
 		   I40E_RXD_QW1_ERROR_SHIFT;
 	rx_status = (qword & I40E_RXD_QW1_STATUS_MASK) >>
 		    I40E_RXD_QW1_STATUS_SHIFT;
-	decoded = decode_rx_desc_ptype(ptype);
-
-	skb->ip_summed = CHECKSUM_NONE;
-
-	skb_checksum_none_assert(skb);
-
-	/* Rx csum enabled and ip headers found? */
-	if (!(vsi->netdev->features & NETIF_F_RXCSUM))
-		return;
 
 	/* did the hardware decode the packet and checksum? */
 	if (!(rx_status & BIT(I40E_RX_DESC_STATUS_L3L4P_SHIFT)))
 		return;
 
-	/* both known and outer_ip must be set for the below code to work */
-	if (!(decoded.known && decoded.outer_ip))
-		return;
-
-	ipv4 = (decoded.outer_ip == I40E_RX_PTYPE_OUTER_IP) &&
-	       (decoded.outer_ip_ver == I40E_RX_PTYPE_OUTER_IPV4);
-	ipv6 = (decoded.outer_ip == I40E_RX_PTYPE_OUTER_IP) &&
-	       (decoded.outer_ip_ver == I40E_RX_PTYPE_OUTER_IPV6);
+	ipv4 = parsed.outer_ip == LIBIE_RX_PTYPE_OUTER_IPV4;
+	ipv6 = parsed.outer_ip == LIBIE_RX_PTYPE_OUTER_IPV6;
 
 	if (ipv4 &&
 	    (rx_error & (BIT(I40E_RX_DESC_ERROR_IPE_SHIFT) |
@@ -1819,49 +1812,16 @@ static inline void i40e_rx_checksum(struct i40e_vsi *vsi,
 	 * we need to bump the checksum level by 1 to reflect the fact that
 	 * we are indicating we validated the inner checksum.
 	 */
-	if (decoded.tunnel_type >= I40E_RX_PTYPE_TUNNEL_IP_GRENAT)
+	if (parsed.tunnel_type >= LIBIE_RX_PTYPE_TUNNEL_IP_GRENAT)
 		skb->csum_level = 1;
 
-	/* Only report checksum unnecessary for TCP, UDP, or SCTP */
-	switch (decoded.inner_prot) {
-	case I40E_RX_PTYPE_INNER_PROT_TCP:
-	case I40E_RX_PTYPE_INNER_PROT_UDP:
-	case I40E_RX_PTYPE_INNER_PROT_SCTP:
-		skb->ip_summed = CHECKSUM_UNNECESSARY;
-		fallthrough;
-	default:
-		break;
-	}
-
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
 	return;
 
 checksum_fail:
 	vsi->back->hw_csum_rx_error++;
 }
 
-/**
- * i40e_ptype_to_htype - get a hash type
- * @ptype: the ptype value from the descriptor
- *
- * Returns a hash type to be used by skb_set_hash
- **/
-static inline int i40e_ptype_to_htype(u8 ptype)
-{
-	struct i40e_rx_ptype_decoded decoded = decode_rx_desc_ptype(ptype);
-
-	if (!decoded.known)
-		return PKT_HASH_TYPE_NONE;
-
-	if (decoded.outer_ip == I40E_RX_PTYPE_OUTER_IP &&
-	    decoded.payload_layer == I40E_RX_PTYPE_PAYLOAD_LAYER_PAY4)
-		return PKT_HASH_TYPE_L4;
-	else if (decoded.outer_ip == I40E_RX_PTYPE_OUTER_IP &&
-		 decoded.payload_layer == I40E_RX_PTYPE_PAYLOAD_LAYER_PAY3)
-		return PKT_HASH_TYPE_L3;
-	else
-		return PKT_HASH_TYPE_L2;
-}
-
 /**
  * i40e_rx_hash - set the hash value in the skb
  * @ring: descriptor ring
@@ -1874,17 +1834,19 @@ static inline void i40e_rx_hash(struct i40e_ring *ring,
 				struct sk_buff *skb,
 				u8 rx_ptype)
 {
+	struct libie_rx_ptype_parsed parsed;
 	u32 hash;
 	const __le64 rss_mask =
 		cpu_to_le64((u64)I40E_RX_DESC_FLTSTAT_RSS_HASH <<
 			    I40E_RX_DESC_STATUS_FLTSTAT_SHIFT);
 
-	if (!(ring->netdev->features & NETIF_F_RXHASH))
+	parsed = libie_parse_rx_ptype(rx_ptype);
+	if (!libie_has_rx_hash(ring->netdev, parsed))
 		return;
 
 	if ((rx_desc->wb.qword1.status_error_len & rss_mask) == rss_mask) {
 		hash = le32_to_cpu(rx_desc->wb.qword0.hi_dword.rss);
-		skb_set_hash(skb, hash, i40e_ptype_to_htype(rx_ptype));
+		libie_skb_set_hash(skb, hash, parsed);
 	}
 }
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h
index 388c3d36d96a..05b8510f99a9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
@@ -773,94 +773,6 @@ enum i40e_rx_desc_error_l3l4e_fcoe_masks {
 #define I40E_RXD_QW1_PTYPE_SHIFT	30
 #define I40E_RXD_QW1_PTYPE_MASK		(0xFFULL << I40E_RXD_QW1_PTYPE_SHIFT)
 
-/* Packet type non-ip values */
-enum i40e_rx_l2_ptype {
-	I40E_RX_PTYPE_L2_RESERVED			= 0,
-	I40E_RX_PTYPE_L2_MAC_PAY2			= 1,
-	I40E_RX_PTYPE_L2_TIMESYNC_PAY2			= 2,
-	I40E_RX_PTYPE_L2_FIP_PAY2			= 3,
-	I40E_RX_PTYPE_L2_OUI_PAY2			= 4,
-	I40E_RX_PTYPE_L2_MACCNTRL_PAY2			= 5,
-	I40E_RX_PTYPE_L2_LLDP_PAY2			= 6,
-	I40E_RX_PTYPE_L2_ECP_PAY2			= 7,
-	I40E_RX_PTYPE_L2_EVB_PAY2			= 8,
-	I40E_RX_PTYPE_L2_QCN_PAY2			= 9,
-	I40E_RX_PTYPE_L2_EAPOL_PAY2			= 10,
-	I40E_RX_PTYPE_L2_ARP				= 11,
-	I40E_RX_PTYPE_L2_FCOE_PAY3			= 12,
-	I40E_RX_PTYPE_L2_FCOE_FCDATA_PAY3		= 13,
-	I40E_RX_PTYPE_L2_FCOE_FCRDY_PAY3		= 14,
-	I40E_RX_PTYPE_L2_FCOE_FCRSP_PAY3		= 15,
-	I40E_RX_PTYPE_L2_FCOE_FCOTHER_PA		= 16,
-	I40E_RX_PTYPE_L2_FCOE_VFT_PAY3			= 17,
-	I40E_RX_PTYPE_L2_FCOE_VFT_FCDATA		= 18,
-	I40E_RX_PTYPE_L2_FCOE_VFT_FCRDY			= 19,
-	I40E_RX_PTYPE_L2_FCOE_VFT_FCRSP			= 20,
-	I40E_RX_PTYPE_L2_FCOE_VFT_FCOTHER		= 21,
-	I40E_RX_PTYPE_GRENAT4_MAC_PAY3			= 58,
-	I40E_RX_PTYPE_GRENAT4_MACVLAN_IPV6_ICMP_PAY4	= 87,
-	I40E_RX_PTYPE_GRENAT6_MAC_PAY3			= 124,
-	I40E_RX_PTYPE_GRENAT6_MACVLAN_IPV6_ICMP_PAY4	= 153
-};
-
-struct i40e_rx_ptype_decoded {
-	u32 known:1;
-	u32 outer_ip:1;
-	u32 outer_ip_ver:1;
-	u32 outer_frag:1;
-	u32 tunnel_type:3;
-	u32 tunnel_end_prot:2;
-	u32 tunnel_end_frag:1;
-	u32 inner_prot:4;
-	u32 payload_layer:3;
-};
-
-enum i40e_rx_ptype_outer_ip {
-	I40E_RX_PTYPE_OUTER_L2	= 0,
-	I40E_RX_PTYPE_OUTER_IP	= 1
-};
-
-enum i40e_rx_ptype_outer_ip_ver {
-	I40E_RX_PTYPE_OUTER_NONE	= 0,
-	I40E_RX_PTYPE_OUTER_IPV4	= 0,
-	I40E_RX_PTYPE_OUTER_IPV6	= 1
-};
-
-enum i40e_rx_ptype_outer_fragmented {
-	I40E_RX_PTYPE_NOT_FRAG	= 0,
-	I40E_RX_PTYPE_FRAG	= 1
-};
-
-enum i40e_rx_ptype_tunnel_type {
-	I40E_RX_PTYPE_TUNNEL_NONE		= 0,
-	I40E_RX_PTYPE_TUNNEL_IP_IP		= 1,
-	I40E_RX_PTYPE_TUNNEL_IP_GRENAT		= 2,
-	I40E_RX_PTYPE_TUNNEL_IP_GRENAT_MAC	= 3,
-	I40E_RX_PTYPE_TUNNEL_IP_GRENAT_MAC_VLAN	= 4,
-};
-
-enum i40e_rx_ptype_tunnel_end_prot {
-	I40E_RX_PTYPE_TUNNEL_END_NONE	= 0,
-	I40E_RX_PTYPE_TUNNEL_END_IPV4	= 1,
-	I40E_RX_PTYPE_TUNNEL_END_IPV6	= 2,
-};
-
-enum i40e_rx_ptype_inner_prot {
-	I40E_RX_PTYPE_INNER_PROT_NONE		= 0,
-	I40E_RX_PTYPE_INNER_PROT_UDP		= 1,
-	I40E_RX_PTYPE_INNER_PROT_TCP		= 2,
-	I40E_RX_PTYPE_INNER_PROT_SCTP		= 3,
-	I40E_RX_PTYPE_INNER_PROT_ICMP		= 4,
-	I40E_RX_PTYPE_INNER_PROT_TIMESYNC	= 5
-};
-
-enum i40e_rx_ptype_payload_layer {
-	I40E_RX_PTYPE_PAYLOAD_LAYER_NONE	= 0,
-	I40E_RX_PTYPE_PAYLOAD_LAYER_PAY2	= 1,
-	I40E_RX_PTYPE_PAYLOAD_LAYER_PAY3	= 2,
-	I40E_RX_PTYPE_PAYLOAD_LAYER_PAY4	= 3,
-};
-
 #define I40E_RXD_QW1_LENGTH_PBUF_SHIFT	38
 #define I40E_RXD_QW1_LENGTH_PBUF_MASK	(0x3FFFULL << \
 					 I40E_RXD_QW1_LENGTH_PBUF_SHIFT)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_common.c b/drivers/net/ethernet/intel/iavf/iavf_common.c
index 1afd761d8052..9676e3832c4d 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_common.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_common.c
@@ -468,259 +468,6 @@ enum iavf_status iavf_aq_set_rss_key(struct iavf_hw *hw, u16 vsi_id,
 	return iavf_aq_get_set_rss_key(hw, vsi_id, key, true);
 }
 
-/* The iavf_ptype_lookup table is used to convert from the 8-bit ptype in the
- * hardware to a bit-field that can be used by SW to more easily determine the
- * packet type.
- *
- * Macros are used to shorten the table lines and make this table human
- * readable.
- *
- * We store the PTYPE in the top byte of the bit field - this is just so that
- * we can check that the table doesn't have a row missing, as the index into
- * the table should be the PTYPE.
- *
- * Typical work flow:
- *
- * IF NOT iavf_ptype_lookup[ptype].known
- * THEN
- *      Packet is unknown
- * ELSE IF iavf_ptype_lookup[ptype].outer_ip == IAVF_RX_PTYPE_OUTER_IP
- *      Use the rest of the fields to look at the tunnels, inner protocols, etc
- * ELSE
- *      Use the enum iavf_rx_l2_ptype to decode the packet type
- * ENDIF
- */
-
-/* macro to make the table lines short, use explicit indexing with [PTYPE] */
-#define IAVF_PTT(PTYPE, OUTER_IP, OUTER_IP_VER, OUTER_FRAG, T, TE, TEF, I, PL)\
-	[PTYPE] = { \
-		1, \
-		IAVF_RX_PTYPE_OUTER_##OUTER_IP, \
-		IAVF_RX_PTYPE_OUTER_##OUTER_IP_VER, \
-		IAVF_RX_PTYPE_##OUTER_FRAG, \
-		IAVF_RX_PTYPE_TUNNEL_##T, \
-		IAVF_RX_PTYPE_TUNNEL_END_##TE, \
-		IAVF_RX_PTYPE_##TEF, \
-		IAVF_RX_PTYPE_INNER_PROT_##I, \
-		IAVF_RX_PTYPE_PAYLOAD_LAYER_##PL }
-
-#define IAVF_PTT_UNUSED_ENTRY(PTYPE) [PTYPE] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
-
-/* shorter macros makes the table fit but are terse */
-#define IAVF_RX_PTYPE_NOF		IAVF_RX_PTYPE_NOT_FRAG
-#define IAVF_RX_PTYPE_FRG		IAVF_RX_PTYPE_FRAG
-#define IAVF_RX_PTYPE_INNER_PROT_TS	IAVF_RX_PTYPE_INNER_PROT_TIMESYNC
-
-/* Lookup table mapping the 8-bit HW PTYPE to the bit field for decoding */
-struct iavf_rx_ptype_decoded iavf_ptype_lookup[BIT(8)] = {
-	/* L2 Packet types */
-	IAVF_PTT_UNUSED_ENTRY(0),
-	IAVF_PTT(1,  L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	IAVF_PTT(2,  L2, NONE, NOF, NONE, NONE, NOF, TS,   PAY2),
-	IAVF_PTT(3,  L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	IAVF_PTT_UNUSED_ENTRY(4),
-	IAVF_PTT_UNUSED_ENTRY(5),
-	IAVF_PTT(6,  L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	IAVF_PTT(7,  L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	IAVF_PTT_UNUSED_ENTRY(8),
-	IAVF_PTT_UNUSED_ENTRY(9),
-	IAVF_PTT(10, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	IAVF_PTT(11, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	IAVF_PTT(12, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(13, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(14, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(15, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(16, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(17, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(18, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(19, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(20, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(21, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY3),
-
-	/* Non Tunneled IPv4 */
-	IAVF_PTT(22, IP, IPV4, FRG, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(23, IP, IPV4, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(24, IP, IPV4, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(25),
-	IAVF_PTT(26, IP, IPV4, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	IAVF_PTT(27, IP, IPV4, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	IAVF_PTT(28, IP, IPV4, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv4 */
-	IAVF_PTT(29, IP, IPV4, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	IAVF_PTT(30, IP, IPV4, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	IAVF_PTT(31, IP, IPV4, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(32),
-	IAVF_PTT(33, IP, IPV4, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	IAVF_PTT(34, IP, IPV4, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	IAVF_PTT(35, IP, IPV4, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv6 */
-	IAVF_PTT(36, IP, IPV4, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	IAVF_PTT(37, IP, IPV4, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	IAVF_PTT(38, IP, IPV4, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(39),
-	IAVF_PTT(40, IP, IPV4, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	IAVF_PTT(41, IP, IPV4, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	IAVF_PTT(42, IP, IPV4, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT */
-	IAVF_PTT(43, IP, IPV4, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> IPv4 */
-	IAVF_PTT(44, IP, IPV4, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	IAVF_PTT(45, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	IAVF_PTT(46, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(47),
-	IAVF_PTT(48, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	IAVF_PTT(49, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	IAVF_PTT(50, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> IPv6 */
-	IAVF_PTT(51, IP, IPV4, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	IAVF_PTT(52, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	IAVF_PTT(53, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(54),
-	IAVF_PTT(55, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	IAVF_PTT(56, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	IAVF_PTT(57, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC */
-	IAVF_PTT(58, IP, IPV4, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> MAC --> IPv4 */
-	IAVF_PTT(59, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	IAVF_PTT(60, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	IAVF_PTT(61, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(62),
-	IAVF_PTT(63, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	IAVF_PTT(64, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	IAVF_PTT(65, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT -> MAC --> IPv6 */
-	IAVF_PTT(66, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	IAVF_PTT(67, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	IAVF_PTT(68, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(69),
-	IAVF_PTT(70, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	IAVF_PTT(71, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	IAVF_PTT(72, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC/VLAN */
-	IAVF_PTT(73, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 ---> GRE/NAT -> MAC/VLAN --> IPv4 */
-	IAVF_PTT(74, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	IAVF_PTT(75, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	IAVF_PTT(76, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(77),
-	IAVF_PTT(78, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	IAVF_PTT(79, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	IAVF_PTT(80, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 -> GRE/NAT -> MAC/VLAN --> IPv6 */
-	IAVF_PTT(81, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	IAVF_PTT(82, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	IAVF_PTT(83, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(84),
-	IAVF_PTT(85, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	IAVF_PTT(86, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	IAVF_PTT(87, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
-
-	/* Non Tunneled IPv6 */
-	IAVF_PTT(88, IP, IPV6, FRG, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(89, IP, IPV6, NOF, NONE, NONE, NOF, NONE, PAY3),
-	IAVF_PTT(90, IP, IPV6, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(91),
-	IAVF_PTT(92, IP, IPV6, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	IAVF_PTT(93, IP, IPV6, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	IAVF_PTT(94, IP, IPV6, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv4 */
-	IAVF_PTT(95,  IP, IPV6, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	IAVF_PTT(96,  IP, IPV6, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	IAVF_PTT(97,  IP, IPV6, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(98),
-	IAVF_PTT(99,  IP, IPV6, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	IAVF_PTT(100, IP, IPV6, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	IAVF_PTT(101, IP, IPV6, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv6 */
-	IAVF_PTT(102, IP, IPV6, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	IAVF_PTT(103, IP, IPV6, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	IAVF_PTT(104, IP, IPV6, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(105),
-	IAVF_PTT(106, IP, IPV6, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	IAVF_PTT(107, IP, IPV6, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	IAVF_PTT(108, IP, IPV6, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT */
-	IAVF_PTT(109, IP, IPV6, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> IPv4 */
-	IAVF_PTT(110, IP, IPV6, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	IAVF_PTT(111, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	IAVF_PTT(112, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(113),
-	IAVF_PTT(114, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	IAVF_PTT(115, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	IAVF_PTT(116, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> IPv6 */
-	IAVF_PTT(117, IP, IPV6, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	IAVF_PTT(118, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	IAVF_PTT(119, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(120),
-	IAVF_PTT(121, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	IAVF_PTT(122, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	IAVF_PTT(123, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC */
-	IAVF_PTT(124, IP, IPV6, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv4 */
-	IAVF_PTT(125, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	IAVF_PTT(126, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	IAVF_PTT(127, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(128),
-	IAVF_PTT(129, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	IAVF_PTT(130, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	IAVF_PTT(131, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv6 */
-	IAVF_PTT(132, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	IAVF_PTT(133, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	IAVF_PTT(134, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(135),
-	IAVF_PTT(136, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	IAVF_PTT(137, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	IAVF_PTT(138, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN */
-	IAVF_PTT(139, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv4 */
-	IAVF_PTT(140, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	IAVF_PTT(141, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	IAVF_PTT(142, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(143),
-	IAVF_PTT(144, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	IAVF_PTT(145, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	IAVF_PTT(146, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv6 */
-	IAVF_PTT(147, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	IAVF_PTT(148, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	IAVF_PTT(149, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	IAVF_PTT_UNUSED_ENTRY(150),
-	IAVF_PTT(151, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	IAVF_PTT(152, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	IAVF_PTT(153, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
-
-	/* unused entries */
-	[154 ... 255] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
-};
-
 /**
  * iavf_aq_send_msg_to_pf
  * @hw: pointer to the hardware structure
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index a483eb185c99..23fbd45dd986 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -46,6 +46,7 @@ MODULE_DEVICE_TABLE(pci, iavf_pci_tbl);
 MODULE_ALIAS("i40evf");
 MODULE_AUTHOR("Intel Corporation, <linux.nics@intel.com>");
 MODULE_DESCRIPTION("Intel(R) Ethernet Adaptive Virtual Function Network Driver");
+MODULE_IMPORT_NS(LIBIE);
 MODULE_LICENSE("GPL v2");
 
 static const struct net_device_ops iavf_netdev_ops;
diff --git a/drivers/net/ethernet/intel/iavf/iavf_prototype.h b/drivers/net/ethernet/intel/iavf/iavf_prototype.h
index 940cb4203fbe..c2028beccde7 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_prototype.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_prototype.h
@@ -47,13 +47,6 @@ enum iavf_status iavf_aq_set_rss_key(struct iavf_hw *hw, u16 seid,
 
 enum iavf_status iavf_set_mac_type(struct iavf_hw *hw);
 
-extern struct iavf_rx_ptype_decoded iavf_ptype_lookup[];
-
-static inline struct iavf_rx_ptype_decoded decode_rx_desc_ptype(u8 ptype)
-{
-	return iavf_ptype_lookup[ptype];
-}
-
 void iavf_vf_parse_hw_config(struct iavf_hw *hw,
 			     struct virtchnl_vf_resource *msg);
 enum iavf_status iavf_aq_send_msg_to_pf(struct iavf_hw *hw,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 8c5f6096b002..aae357af072c 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
+#include <linux/net/intel/libie/rx.h>
 #include <linux/prefetch.h>
 
 #include "iavf.h"
@@ -981,40 +982,32 @@ static inline void iavf_rx_checksum(struct iavf_vsi *vsi,
 				    struct sk_buff *skb,
 				    union iavf_rx_desc *rx_desc)
 {
-	struct iavf_rx_ptype_decoded decoded;
+	struct libie_rx_ptype_parsed parsed;
 	u32 rx_error, rx_status;
 	bool ipv4, ipv6;
 	u8 ptype;
 	u64 qword;
 
+	skb->ip_summed = CHECKSUM_NONE;
+
 	qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
 	ptype = (qword & IAVF_RXD_QW1_PTYPE_MASK) >> IAVF_RXD_QW1_PTYPE_SHIFT;
+
+	parsed = libie_parse_rx_ptype(ptype);
+	if (!libie_has_rx_checksum(vsi->netdev, parsed))
+		return;
+
 	rx_error = (qword & IAVF_RXD_QW1_ERROR_MASK) >>
 		   IAVF_RXD_QW1_ERROR_SHIFT;
 	rx_status = (qword & IAVF_RXD_QW1_STATUS_MASK) >>
 		    IAVF_RXD_QW1_STATUS_SHIFT;
-	decoded = decode_rx_desc_ptype(ptype);
-
-	skb->ip_summed = CHECKSUM_NONE;
-
-	skb_checksum_none_assert(skb);
-
-	/* Rx csum enabled and ip headers found? */
-	if (!(vsi->netdev->features & NETIF_F_RXCSUM))
-		return;
 
 	/* did the hardware decode the packet and checksum? */
 	if (!(rx_status & BIT(IAVF_RX_DESC_STATUS_L3L4P_SHIFT)))
 		return;
 
-	/* both known and outer_ip must be set for the below code to work */
-	if (!(decoded.known && decoded.outer_ip))
-		return;
-
-	ipv4 = (decoded.outer_ip == IAVF_RX_PTYPE_OUTER_IP) &&
-	       (decoded.outer_ip_ver == IAVF_RX_PTYPE_OUTER_IPV4);
-	ipv6 = (decoded.outer_ip == IAVF_RX_PTYPE_OUTER_IP) &&
-	       (decoded.outer_ip_ver == IAVF_RX_PTYPE_OUTER_IPV6);
+	ipv4 = parsed.outer_ip == LIBIE_RX_PTYPE_OUTER_IPV4;
+	ipv6 = parsed.outer_ip == LIBIE_RX_PTYPE_OUTER_IPV6;
 
 	if (ipv4 &&
 	    (rx_error & (BIT(IAVF_RX_DESC_ERROR_IPE_SHIFT) |
@@ -1038,46 +1031,13 @@ static inline void iavf_rx_checksum(struct iavf_vsi *vsi,
 	if (rx_error & BIT(IAVF_RX_DESC_ERROR_PPRS_SHIFT))
 		return;
 
-	/* Only report checksum unnecessary for TCP, UDP, or SCTP */
-	switch (decoded.inner_prot) {
-	case IAVF_RX_PTYPE_INNER_PROT_TCP:
-	case IAVF_RX_PTYPE_INNER_PROT_UDP:
-	case IAVF_RX_PTYPE_INNER_PROT_SCTP:
-		skb->ip_summed = CHECKSUM_UNNECESSARY;
-		fallthrough;
-	default:
-		break;
-	}
-
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
 	return;
 
 checksum_fail:
 	vsi->back->hw_csum_rx_error++;
 }
 
-/**
- * iavf_ptype_to_htype - get a hash type
- * @ptype: the ptype value from the descriptor
- *
- * Returns a hash type to be used by skb_set_hash
- **/
-static inline int iavf_ptype_to_htype(u8 ptype)
-{
-	struct iavf_rx_ptype_decoded decoded = decode_rx_desc_ptype(ptype);
-
-	if (!decoded.known)
-		return PKT_HASH_TYPE_NONE;
-
-	if (decoded.outer_ip == IAVF_RX_PTYPE_OUTER_IP &&
-	    decoded.payload_layer == IAVF_RX_PTYPE_PAYLOAD_LAYER_PAY4)
-		return PKT_HASH_TYPE_L4;
-	else if (decoded.outer_ip == IAVF_RX_PTYPE_OUTER_IP &&
-		 decoded.payload_layer == IAVF_RX_PTYPE_PAYLOAD_LAYER_PAY3)
-		return PKT_HASH_TYPE_L3;
-	else
-		return PKT_HASH_TYPE_L2;
-}
-
 /**
  * iavf_rx_hash - set the hash value in the skb
  * @ring: descriptor ring
@@ -1090,17 +1050,19 @@ static inline void iavf_rx_hash(struct iavf_ring *ring,
 				struct sk_buff *skb,
 				u8 rx_ptype)
 {
+	struct libie_rx_ptype_parsed parsed;
 	u32 hash;
 	const __le64 rss_mask =
 		cpu_to_le64((u64)IAVF_RX_DESC_FLTSTAT_RSS_HASH <<
 			    IAVF_RX_DESC_STATUS_FLTSTAT_SHIFT);
 
-	if (!(ring->netdev->features & NETIF_F_RXHASH))
+	parsed = libie_parse_rx_ptype(rx_ptype);
+	if (!libie_has_rx_hash(ring->netdev, parsed))
 		return;
 
 	if ((rx_desc->wb.qword1.status_error_len & rss_mask) == rss_mask) {
 		hash = le32_to_cpu(rx_desc->wb.qword0.hi_dword.rss);
-		skb_set_hash(skb, hash, iavf_ptype_to_htype(rx_ptype));
+		libie_skb_set_hash(skb, hash, parsed);
 	}
 }
 
diff --git a/drivers/net/ethernet/intel/iavf/iavf_type.h b/drivers/net/ethernet/intel/iavf/iavf_type.h
index 9f1f523807c4..3030ba330326 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_type.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_type.h
@@ -339,94 +339,6 @@ enum iavf_rx_desc_error_l3l4e_fcoe_masks {
 #define IAVF_RXD_QW1_PTYPE_SHIFT	30
 #define IAVF_RXD_QW1_PTYPE_MASK		(0xFFULL << IAVF_RXD_QW1_PTYPE_SHIFT)
 
-/* Packet type non-ip values */
-enum iavf_rx_l2_ptype {
-	IAVF_RX_PTYPE_L2_RESERVED			= 0,
-	IAVF_RX_PTYPE_L2_MAC_PAY2			= 1,
-	IAVF_RX_PTYPE_L2_TIMESYNC_PAY2			= 2,
-	IAVF_RX_PTYPE_L2_FIP_PAY2			= 3,
-	IAVF_RX_PTYPE_L2_OUI_PAY2			= 4,
-	IAVF_RX_PTYPE_L2_MACCNTRL_PAY2			= 5,
-	IAVF_RX_PTYPE_L2_LLDP_PAY2			= 6,
-	IAVF_RX_PTYPE_L2_ECP_PAY2			= 7,
-	IAVF_RX_PTYPE_L2_EVB_PAY2			= 8,
-	IAVF_RX_PTYPE_L2_QCN_PAY2			= 9,
-	IAVF_RX_PTYPE_L2_EAPOL_PAY2			= 10,
-	IAVF_RX_PTYPE_L2_ARP				= 11,
-	IAVF_RX_PTYPE_L2_FCOE_PAY3			= 12,
-	IAVF_RX_PTYPE_L2_FCOE_FCDATA_PAY3		= 13,
-	IAVF_RX_PTYPE_L2_FCOE_FCRDY_PAY3		= 14,
-	IAVF_RX_PTYPE_L2_FCOE_FCRSP_PAY3		= 15,
-	IAVF_RX_PTYPE_L2_FCOE_FCOTHER_PA		= 16,
-	IAVF_RX_PTYPE_L2_FCOE_VFT_PAY3			= 17,
-	IAVF_RX_PTYPE_L2_FCOE_VFT_FCDATA		= 18,
-	IAVF_RX_PTYPE_L2_FCOE_VFT_FCRDY			= 19,
-	IAVF_RX_PTYPE_L2_FCOE_VFT_FCRSP			= 20,
-	IAVF_RX_PTYPE_L2_FCOE_VFT_FCOTHER		= 21,
-	IAVF_RX_PTYPE_GRENAT4_MAC_PAY3			= 58,
-	IAVF_RX_PTYPE_GRENAT4_MACVLAN_IPV6_ICMP_PAY4	= 87,
-	IAVF_RX_PTYPE_GRENAT6_MAC_PAY3			= 124,
-	IAVF_RX_PTYPE_GRENAT6_MACVLAN_IPV6_ICMP_PAY4	= 153
-};
-
-struct iavf_rx_ptype_decoded {
-	u32 known:1;
-	u32 outer_ip:1;
-	u32 outer_ip_ver:1;
-	u32 outer_frag:1;
-	u32 tunnel_type:3;
-	u32 tunnel_end_prot:2;
-	u32 tunnel_end_frag:1;
-	u32 inner_prot:4;
-	u32 payload_layer:3;
-};
-
-enum iavf_rx_ptype_outer_ip {
-	IAVF_RX_PTYPE_OUTER_L2	= 0,
-	IAVF_RX_PTYPE_OUTER_IP	= 1
-};
-
-enum iavf_rx_ptype_outer_ip_ver {
-	IAVF_RX_PTYPE_OUTER_NONE	= 0,
-	IAVF_RX_PTYPE_OUTER_IPV4	= 0,
-	IAVF_RX_PTYPE_OUTER_IPV6	= 1
-};
-
-enum iavf_rx_ptype_outer_fragmented {
-	IAVF_RX_PTYPE_NOT_FRAG	= 0,
-	IAVF_RX_PTYPE_FRAG	= 1
-};
-
-enum iavf_rx_ptype_tunnel_type {
-	IAVF_RX_PTYPE_TUNNEL_NONE		= 0,
-	IAVF_RX_PTYPE_TUNNEL_IP_IP		= 1,
-	IAVF_RX_PTYPE_TUNNEL_IP_GRENAT		= 2,
-	IAVF_RX_PTYPE_TUNNEL_IP_GRENAT_MAC	= 3,
-	IAVF_RX_PTYPE_TUNNEL_IP_GRENAT_MAC_VLAN	= 4,
-};
-
-enum iavf_rx_ptype_tunnel_end_prot {
-	IAVF_RX_PTYPE_TUNNEL_END_NONE	= 0,
-	IAVF_RX_PTYPE_TUNNEL_END_IPV4	= 1,
-	IAVF_RX_PTYPE_TUNNEL_END_IPV6	= 2,
-};
-
-enum iavf_rx_ptype_inner_prot {
-	IAVF_RX_PTYPE_INNER_PROT_NONE		= 0,
-	IAVF_RX_PTYPE_INNER_PROT_UDP		= 1,
-	IAVF_RX_PTYPE_INNER_PROT_TCP		= 2,
-	IAVF_RX_PTYPE_INNER_PROT_SCTP		= 3,
-	IAVF_RX_PTYPE_INNER_PROT_ICMP		= 4,
-	IAVF_RX_PTYPE_INNER_PROT_TIMESYNC	= 5
-};
-
-enum iavf_rx_ptype_payload_layer {
-	IAVF_RX_PTYPE_PAYLOAD_LAYER_NONE	= 0,
-	IAVF_RX_PTYPE_PAYLOAD_LAYER_PAY2	= 1,
-	IAVF_RX_PTYPE_PAYLOAD_LAYER_PAY3	= 2,
-	IAVF_RX_PTYPE_PAYLOAD_LAYER_PAY4	= 3,
-};
-
 #define IAVF_RXD_QW1_LENGTH_PBUF_SHIFT	38
 #define IAVF_RXD_QW1_LENGTH_PBUF_MASK	(0x3FFFULL << \
 					 IAVF_RXD_QW1_LENGTH_PBUF_SHIFT)
diff --git a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
index 89f986a75cc8..611577ebc29d 100644
--- a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
+++ b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
@@ -160,64 +160,6 @@ struct ice_fltr_desc {
 				(0x1ULL << ICE_FXD_FLTR_WB_QW1_FAIL_PROF_S)
 #define ICE_FXD_FLTR_WB_QW1_FAIL_PROF_YES	0x1ULL
 
-struct ice_rx_ptype_decoded {
-	u32 known:1;
-	u32 outer_ip:1;
-	u32 outer_ip_ver:2;
-	u32 outer_frag:1;
-	u32 tunnel_type:3;
-	u32 tunnel_end_prot:2;
-	u32 tunnel_end_frag:1;
-	u32 inner_prot:4;
-	u32 payload_layer:3;
-};
-
-enum ice_rx_ptype_outer_ip {
-	ICE_RX_PTYPE_OUTER_L2	= 0,
-	ICE_RX_PTYPE_OUTER_IP	= 1,
-};
-
-enum ice_rx_ptype_outer_ip_ver {
-	ICE_RX_PTYPE_OUTER_NONE	= 0,
-	ICE_RX_PTYPE_OUTER_IPV4	= 1,
-	ICE_RX_PTYPE_OUTER_IPV6	= 2,
-};
-
-enum ice_rx_ptype_outer_fragmented {
-	ICE_RX_PTYPE_NOT_FRAG	= 0,
-	ICE_RX_PTYPE_FRAG	= 1,
-};
-
-enum ice_rx_ptype_tunnel_type {
-	ICE_RX_PTYPE_TUNNEL_NONE		= 0,
-	ICE_RX_PTYPE_TUNNEL_IP_IP		= 1,
-	ICE_RX_PTYPE_TUNNEL_IP_GRENAT		= 2,
-	ICE_RX_PTYPE_TUNNEL_IP_GRENAT_MAC	= 3,
-	ICE_RX_PTYPE_TUNNEL_IP_GRENAT_MAC_VLAN	= 4,
-};
-
-enum ice_rx_ptype_tunnel_end_prot {
-	ICE_RX_PTYPE_TUNNEL_END_NONE	= 0,
-	ICE_RX_PTYPE_TUNNEL_END_IPV4	= 1,
-	ICE_RX_PTYPE_TUNNEL_END_IPV6	= 2,
-};
-
-enum ice_rx_ptype_inner_prot {
-	ICE_RX_PTYPE_INNER_PROT_NONE		= 0,
-	ICE_RX_PTYPE_INNER_PROT_UDP		= 1,
-	ICE_RX_PTYPE_INNER_PROT_TCP		= 2,
-	ICE_RX_PTYPE_INNER_PROT_SCTP		= 3,
-	ICE_RX_PTYPE_INNER_PROT_ICMP		= 4,
-	ICE_RX_PTYPE_INNER_PROT_TIMESYNC	= 5,
-};
-
-enum ice_rx_ptype_payload_layer {
-	ICE_RX_PTYPE_PAYLOAD_LAYER_NONE	= 0,
-	ICE_RX_PTYPE_PAYLOAD_LAYER_PAY2	= 1,
-	ICE_RX_PTYPE_PAYLOAD_LAYER_PAY3	= 2,
-	ICE_RX_PTYPE_PAYLOAD_LAYER_PAY4	= 3,
-};
-
 /* Rx Flex Descriptor
  * This descriptor is used instead of the legacy version descriptor when
  * ice_rlan_ctx.adv_desc is set
@@ -651,262 +593,4 @@ struct ice_tlan_ctx {
 	u8 int_q_state;	/* width not needed - internal - DO NOT WRITE!!! */
 };
 
-/* The ice_ptype_lkup table is used to convert from the 10-bit ptype in the
- * hardware to a bit-field that can be used by SW to more easily determine the
- * packet type.
- *
- * Macros are used to shorten the table lines and make this table human
- * readable.
- *
- * We store the PTYPE in the top byte of the bit field - this is just so that
- * we can check that the table doesn't have a row missing, as the index into
- * the table should be the PTYPE.
- *
- * Typical work flow:
- *
- * IF NOT ice_ptype_lkup[ptype].known
- * THEN
- *      Packet is unknown
- * ELSE IF ice_ptype_lkup[ptype].outer_ip == ICE_RX_PTYPE_OUTER_IP
- *      Use the rest of the fields to look at the tunnels, inner protocols, etc
- * ELSE
- *      Use the enum ice_rx_l2_ptype to decode the packet type
- * ENDIF
- */
-
-/* macro to make the table lines short, use explicit indexing with [PTYPE] */
-#define ICE_PTT(PTYPE, OUTER_IP, OUTER_IP_VER, OUTER_FRAG, T, TE, TEF, I, PL)\
-	[PTYPE] = { \
-		1, \
-		ICE_RX_PTYPE_OUTER_##OUTER_IP, \
-		ICE_RX_PTYPE_OUTER_##OUTER_IP_VER, \
-		ICE_RX_PTYPE_##OUTER_FRAG, \
-		ICE_RX_PTYPE_TUNNEL_##T, \
-		ICE_RX_PTYPE_TUNNEL_END_##TE, \
-		ICE_RX_PTYPE_##TEF, \
-		ICE_RX_PTYPE_INNER_PROT_##I, \
-		ICE_RX_PTYPE_PAYLOAD_LAYER_##PL }
-
-#define ICE_PTT_UNUSED_ENTRY(PTYPE) [PTYPE] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
-
-/* shorter macros makes the table fit but are terse */
-#define ICE_RX_PTYPE_NOF		ICE_RX_PTYPE_NOT_FRAG
-#define ICE_RX_PTYPE_FRG		ICE_RX_PTYPE_FRAG
-
-/* Lookup table mapping in the 10-bit HW PTYPE to the bit field for decoding */
-static const struct ice_rx_ptype_decoded ice_ptype_lkup[BIT(10)] = {
-	/* L2 Packet types */
-	ICE_PTT_UNUSED_ENTRY(0),
-	ICE_PTT(1, L2, NONE, NOF, NONE, NONE, NOF, NONE, PAY2),
-	ICE_PTT_UNUSED_ENTRY(2),
-	ICE_PTT_UNUSED_ENTRY(3),
-	ICE_PTT_UNUSED_ENTRY(4),
-	ICE_PTT_UNUSED_ENTRY(5),
-	ICE_PTT(6, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT(7, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT_UNUSED_ENTRY(8),
-	ICE_PTT_UNUSED_ENTRY(9),
-	ICE_PTT(10, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT(11, L2, NONE, NOF, NONE, NONE, NOF, NONE, NONE),
-	ICE_PTT_UNUSED_ENTRY(12),
-	ICE_PTT_UNUSED_ENTRY(13),
-	ICE_PTT_UNUSED_ENTRY(14),
-	ICE_PTT_UNUSED_ENTRY(15),
-	ICE_PTT_UNUSED_ENTRY(16),
-	ICE_PTT_UNUSED_ENTRY(17),
-	ICE_PTT_UNUSED_ENTRY(18),
-	ICE_PTT_UNUSED_ENTRY(19),
-	ICE_PTT_UNUSED_ENTRY(20),
-	ICE_PTT_UNUSED_ENTRY(21),
-
-	/* Non Tunneled IPv4 */
-	ICE_PTT(22, IP, IPV4, FRG, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(23, IP, IPV4, NOF, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(24, IP, IPV4, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(25),
-	ICE_PTT(26, IP, IPV4, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	ICE_PTT(27, IP, IPV4, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	ICE_PTT(28, IP, IPV4, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv4 */
-	ICE_PTT(29, IP, IPV4, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(30, IP, IPV4, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(31, IP, IPV4, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(32),
-	ICE_PTT(33, IP, IPV4, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(34, IP, IPV4, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(35, IP, IPV4, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> IPv6 */
-	ICE_PTT(36, IP, IPV4, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(37, IP, IPV4, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(38, IP, IPV4, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(39),
-	ICE_PTT(40, IP, IPV4, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(41, IP, IPV4, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(42, IP, IPV4, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT */
-	ICE_PTT(43, IP, IPV4, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> IPv4 */
-	ICE_PTT(44, IP, IPV4, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(45, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(46, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(47),
-	ICE_PTT(48, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(49, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(50, IP, IPV4, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> IPv6 */
-	ICE_PTT(51, IP, IPV4, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(52, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(53, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(54),
-	ICE_PTT(55, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(56, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(57, IP, IPV4, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC */
-	ICE_PTT(58, IP, IPV4, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 --> GRE/NAT --> MAC --> IPv4 */
-	ICE_PTT(59, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(60, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(61, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(62),
-	ICE_PTT(63, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(64, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(65, IP, IPV4, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT -> MAC --> IPv6 */
-	ICE_PTT(66, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(67, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(68, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(69),
-	ICE_PTT(70, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(71, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(72, IP, IPV4, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv4 --> GRE/NAT --> MAC/VLAN */
-	ICE_PTT(73, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv4 ---> GRE/NAT -> MAC/VLAN --> IPv4 */
-	ICE_PTT(74, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(75, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(76, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(77),
-	ICE_PTT(78, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(79, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(80, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv4 -> GRE/NAT -> MAC/VLAN --> IPv6 */
-	ICE_PTT(81, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(82, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(83, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(84),
-	ICE_PTT(85, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(86, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(87, IP, IPV4, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
-
-	/* Non Tunneled IPv6 */
-	ICE_PTT(88, IP, IPV6, FRG, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(89, IP, IPV6, NOF, NONE, NONE, NOF, NONE, PAY3),
-	ICE_PTT(90, IP, IPV6, NOF, NONE, NONE, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(91),
-	ICE_PTT(92, IP, IPV6, NOF, NONE, NONE, NOF, TCP,  PAY4),
-	ICE_PTT(93, IP, IPV6, NOF, NONE, NONE, NOF, SCTP, PAY4),
-	ICE_PTT(94, IP, IPV6, NOF, NONE, NONE, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv4 */
-	ICE_PTT(95, IP, IPV6, NOF, IP_IP, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(96, IP, IPV6, NOF, IP_IP, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(97, IP, IPV6, NOF, IP_IP, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(98),
-	ICE_PTT(99, IP, IPV6, NOF, IP_IP, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(100, IP, IPV6, NOF, IP_IP, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(101, IP, IPV6, NOF, IP_IP, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> IPv6 */
-	ICE_PTT(102, IP, IPV6, NOF, IP_IP, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(103, IP, IPV6, NOF, IP_IP, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(104, IP, IPV6, NOF, IP_IP, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(105),
-	ICE_PTT(106, IP, IPV6, NOF, IP_IP, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(107, IP, IPV6, NOF, IP_IP, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(108, IP, IPV6, NOF, IP_IP, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT */
-	ICE_PTT(109, IP, IPV6, NOF, IP_GRENAT, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> IPv4 */
-	ICE_PTT(110, IP, IPV6, NOF, IP_GRENAT, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(111, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(112, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(113),
-	ICE_PTT(114, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(115, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(116, IP, IPV6, NOF, IP_GRENAT, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> IPv6 */
-	ICE_PTT(117, IP, IPV6, NOF, IP_GRENAT, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(118, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(119, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(120),
-	ICE_PTT(121, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(122, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(123, IP, IPV6, NOF, IP_GRENAT, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC */
-	ICE_PTT(124, IP, IPV6, NOF, IP_GRENAT_MAC, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv4 */
-	ICE_PTT(125, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(126, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(127, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(128),
-	ICE_PTT(129, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(130, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(131, IP, IPV6, NOF, IP_GRENAT_MAC, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC -> IPv6 */
-	ICE_PTT(132, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(133, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(134, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(135),
-	ICE_PTT(136, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(137, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(138, IP, IPV6, NOF, IP_GRENAT_MAC, IPV6, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN */
-	ICE_PTT(139, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, NONE, NOF, NONE, PAY3),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv4 */
-	ICE_PTT(140, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, FRG, NONE, PAY3),
-	ICE_PTT(141, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, NONE, PAY3),
-	ICE_PTT(142, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(143),
-	ICE_PTT(144, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, TCP,  PAY4),
-	ICE_PTT(145, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, SCTP, PAY4),
-	ICE_PTT(146, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV4, NOF, ICMP, PAY4),
-
-	/* IPv6 --> GRE/NAT -> MAC/VLAN --> IPv6 */
-	ICE_PTT(147, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, FRG, NONE, PAY3),
-	ICE_PTT(148, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, NONE, PAY3),
-	ICE_PTT(149, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, UDP,  PAY4),
-	ICE_PTT_UNUSED_ENTRY(150),
-	ICE_PTT(151, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, TCP,  PAY4),
-	ICE_PTT(152, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, SCTP, PAY4),
-	ICE_PTT(153, IP, IPV6, NOF, IP_GRENAT_MAC_VLAN, IPV6, NOF, ICMP, PAY4),
-
-	/* unused entries */
-	[154 ... 1023] = { 0, 0, 0, 0, 0, 0, 0, 0, 0 }
-};
-
-static inline struct ice_rx_ptype_decoded ice_decode_rx_desc_ptype(u16 ptype)
-{
-	return ice_ptype_lkup[ptype];
-}
-
-
 #endif /* _ICE_LAN_TX_RX_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 93979ab18bc1..eb116c8dc6ad 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -34,6 +34,7 @@ static const char ice_copyright[] = "Copyright (c) 2018, Intel Corporation.";
 
 MODULE_AUTHOR("Intel Corporation, <linux.nics@intel.com>");
 MODULE_DESCRIPTION(DRV_SUMMARY);
+MODULE_IMPORT_NS(LIBIE);
 MODULE_LICENSE("GPL v2");
 MODULE_FIRMWARE(ICE_DDP_PKG_FILE);
 
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index c8322fb6f2b3..7543aba4ff9f 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2019, Intel Corporation. */
 
 #include <linux/filter.h>
+#include <linux/net/intel/libie/rx.h>
 
 #include "ice_txrx_lib.h"
 #include "ice_eswitch.h"
@@ -38,30 +39,6 @@ void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val)
 	}
 }
 
-/**
- * ice_ptype_to_htype - get a hash type
- * @ptype: the ptype value from the descriptor
- *
- * Returns appropriate hash type (such as PKT_HASH_TYPE_L2/L3/L4) to be used by
- * skb_set_hash based on PTYPE as parsed by HW Rx pipeline and is part of
- * Rx desc.
- */
-static enum pkt_hash_types ice_ptype_to_htype(u16 ptype)
-{
-	struct ice_rx_ptype_decoded decoded = ice_decode_rx_desc_ptype(ptype);
-
-	if (!decoded.known)
-		return PKT_HASH_TYPE_NONE;
-	if (decoded.payload_layer == ICE_RX_PTYPE_PAYLOAD_LAYER_PAY4)
-		return PKT_HASH_TYPE_L4;
-	if (decoded.payload_layer == ICE_RX_PTYPE_PAYLOAD_LAYER_PAY3)
-		return PKT_HASH_TYPE_L3;
-	if (decoded.outer_ip == ICE_RX_PTYPE_OUTER_L2)
-		return PKT_HASH_TYPE_L2;
-
-	return PKT_HASH_TYPE_NONE;
-}
-
 /**
  * ice_rx_hash - set the hash value in the skb
  * @rx_ring: descriptor ring
@@ -74,9 +51,11 @@ ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
 	    struct sk_buff *skb, u16 rx_ptype)
 {
 	struct ice_32b_rx_flex_desc_nic *nic_mdid;
+	struct libie_rx_ptype_parsed parsed;
 	u32 hash;
 
-	if (!(rx_ring->netdev->features & NETIF_F_RXHASH))
+	parsed = libie_parse_rx_ptype(rx_ptype);
+	if (!libie_has_rx_hash(rx_ring->netdev, parsed))
 		return;
 
 	if (rx_desc->wb.rxdid != ICE_RXDID_FLEX_NIC)
@@ -84,7 +63,7 @@ ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
 
 	nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
 	hash = le32_to_cpu(nic_mdid->rss_hash);
-	skb_set_hash(skb, hash, ice_ptype_to_htype(rx_ptype));
+	libie_skb_set_hash(skb, hash, parsed);
 }
 
 /**
@@ -92,7 +71,7 @@ ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
  * @ring: the ring we care about
  * @skb: skb currently being received and modified
  * @rx_desc: the receive descriptor
- * @ptype: the packet type decoded by hardware
+ * @ptype: the packet type parsed by hardware
  *
  * skb->protocol must be set before this function is called
  */
@@ -100,34 +79,26 @@ static void
 ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
 	    union ice_32b_rx_flex_desc *rx_desc, u16 ptype)
 {
-	struct ice_rx_ptype_decoded decoded;
+	struct libie_rx_ptype_parsed parsed;
 	u16 rx_status0, rx_status1;
 	bool ipv4, ipv6;
 
-	rx_status0 = le16_to_cpu(rx_desc->wb.status_error0);
-	rx_status1 = le16_to_cpu(rx_desc->wb.status_error1);
-
-	decoded = ice_decode_rx_desc_ptype(ptype);
-
 	/* Start with CHECKSUM_NONE and by default csum_level = 0 */
 	skb->ip_summed = CHECKSUM_NONE;
-	skb_checksum_none_assert(skb);
 
-	/* check if Rx checksum is enabled */
-	if (!(ring->netdev->features & NETIF_F_RXCSUM))
+	parsed = libie_parse_rx_ptype(ptype);
+	if (!libie_has_rx_checksum(ring->netdev, parsed))
 		return;
 
-	/* check if HW has decoded the packet and checksum */
-	if (!(rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_L3L4P_S)))
-		return;
+	rx_status0 = le16_to_cpu(rx_desc->wb.status_error0);
+	rx_status1 = le16_to_cpu(rx_desc->wb.status_error1);
 
-	if (!(decoded.known && decoded.outer_ip))
+	/* check if HW has parsed the packet and checksum */
+	if (!(rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_L3L4P_S)))
 		return;
 
-	ipv4 = (decoded.outer_ip == ICE_RX_PTYPE_OUTER_IP) &&
-	       (decoded.outer_ip_ver == ICE_RX_PTYPE_OUTER_IPV4);
-	ipv6 = (decoded.outer_ip == ICE_RX_PTYPE_OUTER_IP) &&
-	       (decoded.outer_ip_ver == ICE_RX_PTYPE_OUTER_IPV6);
+	ipv4 = parsed.outer_ip == LIBIE_RX_PTYPE_OUTER_IPV4;
+	ipv6 = parsed.outer_ip == LIBIE_RX_PTYPE_OUTER_IPV6;
 
 	if (ipv4 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_IPE_S) |
 				   BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_EIPE_S))))
@@ -151,19 +122,10 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
 	 * we need to bump the checksum level by 1 to reflect the fact that
 	 * we are indicating we validated the inner checksum.
 	 */
-	if (decoded.tunnel_type >= ICE_RX_PTYPE_TUNNEL_IP_GRENAT)
+	if (parsed.tunnel_type >= LIBIE_RX_PTYPE_TUNNEL_IP_GRENAT)
 		skb->csum_level = 1;
 
-	/* Only report checksum unnecessary for TCP, UDP, or SCTP */
-	switch (decoded.inner_prot) {
-	case ICE_RX_PTYPE_INNER_PROT_TCP:
-	case ICE_RX_PTYPE_INNER_PROT_UDP:
-	case ICE_RX_PTYPE_INNER_PROT_SCTP:
-		skb->ip_summed = CHECKSUM_UNNECESSARY;
-		break;
-	default:
-		break;
-	}
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
 	return;
 
 checksum_fail:
@@ -175,7 +137,7 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
  * @rx_ring: Rx descriptor ring packet is being transacted on
  * @rx_desc: pointer to the EOP Rx descriptor
  * @skb: pointer to current skb being populated
- * @ptype: the packet type decoded by hardware
+ * @ptype: the packet type parsed by hardware
  *
  * This function checks the ring, descriptor, and packet information in
  * order to populate the hash, checksum, VLAN, protocol, and
diff --git a/drivers/net/ethernet/intel/libie/Makefile b/drivers/net/ethernet/intel/libie/Makefile
new file mode 100644
index 000000000000..95e81d09b474
--- /dev/null
+++ b/drivers/net/ethernet/intel/libie/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-only
+# Copyright(c) 2023 Intel Corporation.
+
+obj-$(CONFIG_LIBIE)	+= libie.o
+
+libie-objs		+= rx.o
diff --git a/drivers/net/ethernet/intel/libie/rx.c b/drivers/net/ethernet/intel/libie/rx.c
new file mode 100644
index 000000000000..f503476d8eef
--- /dev/null
+++ b/drivers/net/ethernet/intel/libie/rx.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2023 Intel Corporation. */
+
+#include <linux/net/intel/libie/rx.h>
+
+/* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
+ * bitfield struct.
+ */
+
+#define LIBIE_RX_PTYPE(oip, ofrag, tun, tp, tefr, iprot, pl) {		   \
+		.outer_ip		= LIBIE_RX_PTYPE_OUTER_##oip,	   \
+		.outer_frag		= LIBIE_RX_PTYPE_##ofrag,	   \
+		.tunnel_type		= LIBIE_RX_PTYPE_TUNNEL_IP_##tun,  \
+		.tunnel_end_prot	= LIBIE_RX_PTYPE_TUNNEL_END_##tp,  \
+		.tunnel_end_frag	= LIBIE_RX_PTYPE_##tefr,	   \
+		.inner_prot		= LIBIE_RX_PTYPE_INNER_##iprot,	   \
+		.payload_layer		= LIBIE_RX_PTYPE_PAYLOAD_##pl,	   \
+	}
+
+#define LIBIE_RX_PTYPE_UNUSED		{ }
+
+#define __LIBIE_RX_PTYPE_L2(iprot, pl)					   \
+	LIBIE_RX_PTYPE(L2, NOT_FRAG, NONE, NONE, NOT_FRAG, iprot, pl)
+#define LIBIE_RX_PTYPE_L2		__LIBIE_RX_PTYPE_L2(NONE, L2)
+#define LIBIE_RX_PTYPE_TS		__LIBIE_RX_PTYPE_L2(TIMESYNC, L2)
+#define LIBIE_RX_PTYPE_L3		__LIBIE_RX_PTYPE_L2(NONE, L3)
+
+#define LIBIE_RX_PTYPE_IP_FRAG(oip)					   \
+	LIBIE_RX_PTYPE(IPV##oip, FRAG, NONE, NONE, NOT_FRAG, NONE, L3)
+#define LIBIE_RX_PTYPE_IP_L3(oip, tun, teprot, tefr)			   \
+	LIBIE_RX_PTYPE(IPV##oip, NOT_FRAG, tun, teprot, tefr, NONE, L3)
+#define LIBIE_RX_PTYPE_IP_L4(oip, tun, teprot, iprot)			   \
+	LIBIE_RX_PTYPE(IPV##oip, NOT_FRAG, tun, teprot, NOT_FRAG, iprot, L4)
+
+#define LIBIE_RX_PTYPE_IP_NOF(oip, tun, ver)				   \
+	LIBIE_RX_PTYPE_IP_L3(oip, tun, ver, NOT_FRAG),			   \
+	LIBIE_RX_PTYPE_IP_L4(oip, tun, ver, UDP),			   \
+	LIBIE_RX_PTYPE_UNUSED,						   \
+	LIBIE_RX_PTYPE_IP_L4(oip, tun, ver, TCP),			   \
+	LIBIE_RX_PTYPE_IP_L4(oip, tun, ver, SCTP),			   \
+	LIBIE_RX_PTYPE_IP_L4(oip, tun, ver, ICMP)
+
+/* IPv oip --> tun --> IPv ver */
+#define LIBIE_RX_PTYPE_IP_TUN_VER(oip, tun, ver)			   \
+	LIBIE_RX_PTYPE_IP_L3(oip, tun, ver, FRAG),			   \
+	LIBIE_RX_PTYPE_IP_NOF(oip, tun, ver)
+
+/* Non Tunneled IPv oip */
+#define LIBIE_RX_PTYPE_IP_RAW(oip)					   \
+	LIBIE_RX_PTYPE_IP_FRAG(oip),					   \
+	LIBIE_RX_PTYPE_IP_NOF(oip, NONE, NONE)
+
+/* IPv oip --> tun --> { IPv4, IPv6 } */
+#define LIBIE_RX_PTYPE_IP_TUN(oip, tun)					   \
+	LIBIE_RX_PTYPE_IP_TUN_VER(oip, tun, IPV4),			   \
+	LIBIE_RX_PTYPE_IP_TUN_VER(oip, tun, IPV6)
+
+/* IPv oip --> GRE/NAT tun --> { x, IPv4, IPv6 } */
+#define LIBIE_RX_PTYPE_IP_GRE(oip, tun)					   \
+	LIBIE_RX_PTYPE_IP_L3(oip, tun, NONE, NOT_FRAG),			   \
+	LIBIE_RX_PTYPE_IP_TUN(oip, tun)
+
+/* Non Tunneled IPv oip
+ * IPv oip --> { IPv4, IPv6 }
+ * IPv oip --> GRE/NAT --> { x, IPv4, IPv6 }
+ * IPv oip --> GRE/NAT --> MAC --> { x, IPv4, IPv6 }
+ * IPv oip --> GRE/NAT --> MAC/VLAN --> { x, IPv4, IPv6 }
+ */
+#define LIBIE_RX_PTYPE_IP(oip)						   \
+	LIBIE_RX_PTYPE_IP_RAW(oip),					   \
+	LIBIE_RX_PTYPE_IP_TUN(oip, IP),					   \
+	LIBIE_RX_PTYPE_IP_GRE(oip, GRENAT),				   \
+	LIBIE_RX_PTYPE_IP_GRE(oip, GRENAT_MAC),				   \
+	LIBIE_RX_PTYPE_IP_GRE(oip, GRENAT_MAC_VLAN)
+
+/* Lookup table mapping for O(1) parsing */
+const struct libie_rx_ptype_parsed libie_rx_ptype_lut[LIBIE_RX_PTYPE_NUM] = {
+	/* L2 packet types */
+	LIBIE_RX_PTYPE_UNUSED,
+	LIBIE_RX_PTYPE_L2,
+	LIBIE_RX_PTYPE_TS,
+	LIBIE_RX_PTYPE_L2,
+	LIBIE_RX_PTYPE_UNUSED,
+	LIBIE_RX_PTYPE_UNUSED,
+	LIBIE_RX_PTYPE_L2,
+	LIBIE_RX_PTYPE_L2,
+	LIBIE_RX_PTYPE_UNUSED,
+	LIBIE_RX_PTYPE_UNUSED,
+	LIBIE_RX_PTYPE_L2,
+	LIBIE_RX_PTYPE_UNUSED,
+
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+	LIBIE_RX_PTYPE_L3,
+
+	LIBIE_RX_PTYPE_IP(4),
+	LIBIE_RX_PTYPE_IP(6),
+};
+EXPORT_SYMBOL_NS_GPL(libie_rx_ptype_lut, LIBIE);
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_DESCRIPTION("Intel(R) Ethernet common library");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/net/intel/libie/rx.h b/include/linux/net/intel/libie/rx.h
new file mode 100644
index 000000000000..58bd0f35d025
--- /dev/null
+++ b/include/linux/net/intel/libie/rx.h
@@ -0,0 +1,128 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2023 Intel Corporation. */
+
+#ifndef __LIBIE_RX_H
+#define __LIBIE_RX_H
+
+#include <linux/netdevice.h>
+
+/* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
+ * bitfield struct.
+ */
+
+struct libie_rx_ptype_parsed {
+	u16	outer_ip:2;
+	u16	outer_frag:1;
+	u16	tunnel_type:3;
+	u16	tunnel_end_prot:2;
+	u16	tunnel_end_frag:1;
+	u16	inner_prot:3;
+	u16	payload_layer:2;
+};
+
+enum libie_rx_ptype_outer_ip {
+	LIBIE_RX_PTYPE_OUTER_L2				= 0U,
+	LIBIE_RX_PTYPE_OUTER_IPV4,
+	LIBIE_RX_PTYPE_OUTER_IPV6,
+};
+
+enum libie_rx_ptype_outer_fragmented {
+	LIBIE_RX_PTYPE_NOT_FRAG				= 0U,
+	LIBIE_RX_PTYPE_FRAG,
+};
+
+enum libie_rx_ptype_tunnel_type {
+	LIBIE_RX_PTYPE_TUNNEL_IP_NONE			= 0U,
+	LIBIE_RX_PTYPE_TUNNEL_IP_IP,
+	LIBIE_RX_PTYPE_TUNNEL_IP_GRENAT,
+	LIBIE_RX_PTYPE_TUNNEL_IP_GRENAT_MAC,
+	LIBIE_RX_PTYPE_TUNNEL_IP_GRENAT_MAC_VLAN,
+};
+
+enum libie_rx_ptype_tunnel_end_prot {
+	LIBIE_RX_PTYPE_TUNNEL_END_NONE			= 0U,
+	LIBIE_RX_PTYPE_TUNNEL_END_IPV4,
+	LIBIE_RX_PTYPE_TUNNEL_END_IPV6,
+};
+
+enum libie_rx_ptype_inner_prot {
+	LIBIE_RX_PTYPE_INNER_NONE			= 0U,
+	LIBIE_RX_PTYPE_INNER_UDP,
+	LIBIE_RX_PTYPE_INNER_TCP,
+	LIBIE_RX_PTYPE_INNER_SCTP,
+	LIBIE_RX_PTYPE_INNER_ICMP,
+	LIBIE_RX_PTYPE_INNER_TIMESYNC,
+};
+
+enum libie_rx_ptype_payload_layer {
+	LIBIE_RX_PTYPE_PAYLOAD_NONE			= PKT_HASH_TYPE_NONE,
+	LIBIE_RX_PTYPE_PAYLOAD_L2			= PKT_HASH_TYPE_L2,
+	LIBIE_RX_PTYPE_PAYLOAD_L3			= PKT_HASH_TYPE_L3,
+	LIBIE_RX_PTYPE_PAYLOAD_L4			= PKT_HASH_TYPE_L4,
+};
+
+#define LIBIE_RX_PTYPE_NUM				154
+
+extern const struct libie_rx_ptype_parsed
+libie_rx_ptype_lut[LIBIE_RX_PTYPE_NUM];
+
+/**
+ * libie_parse_rx_ptype - convert HW packet type to software bitfield structure
+ * @ptype: 10-bit hardware packet type value from the descriptor
+ *
+ * @libie_rx_ptype_lut must be accessed only using this wrapper.
+ *
+ * Returns the parsed bitfield struct corresponding to the provided ptype.
+ */
+static inline struct libie_rx_ptype_parsed libie_parse_rx_ptype(u32 ptype)
+{
+	if (unlikely(ptype >= LIBIE_RX_PTYPE_NUM))
+		ptype = 0;
+
+	return libie_rx_ptype_lut[ptype];
+}
+
+/* libie_has_*() can be used to quickly check whether the HW metadata is
+ * available to avoid further expensive processing such as descriptor reads.
+ * They already check for the corresponding netdev feature to be enabled,
+ * thus can be used as drop-in replacements.
+ */
+
+static inline bool libie_has_rx_checksum(const struct net_device *dev,
+					 struct libie_rx_ptype_parsed parsed)
+{
+	/* _INNER_{SCTP,TCP,UDP} are possible only when _OUTER_IPV* is set,
+	 * it is enough to check only for the L4 type.
+	 */
+	switch (parsed.inner_prot) {
+	case LIBIE_RX_PTYPE_INNER_TCP:
+	case LIBIE_RX_PTYPE_INNER_UDP:
+	case LIBIE_RX_PTYPE_INNER_SCTP:
+		return dev->features & NETIF_F_RXCSUM;
+	default:
+		return false;
+	}
+}
+
+static inline bool libie_has_rx_hash(const struct net_device *dev,
+				     struct libie_rx_ptype_parsed parsed)
+{
+	if (parsed.payload_layer < LIBIE_RX_PTYPE_PAYLOAD_L2)
+		return false;
+
+	return dev->features & NETIF_F_RXHASH;
+}
+
+/**
+ * libie_skb_set_hash - fill in skb hash value basing on the parsed ptype
+ * @skb: skb to fill the hash in
+ * @hash: 32-bit hash value from the descriptor
+ * @parsed: parsed packet type
+ */
+static inline void libie_skb_set_hash(struct sk_buff *skb, u32 hash,
+				      struct libie_rx_ptype_parsed parsed)
+{
+	skb_set_hash(skb, hash, parsed.payload_layer);
+}
+
+#endif /* __LIBIE_RX_H */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 2/9] iavf: kill "legacy-rx" for good
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 1/9] net: intel: introduce Intel Ethernet common library Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  2023-07-14 14:17   ` [Intel-wired-lan] " Przemek Kitszel
  2023-07-05 15:55 ` [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling Alexander Lobakin
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

Ever since build_skb() became stable, the old way with allocating an skb
for storing the headers separately, which will be then copied manually,
was slower, less flexible and thus obsolete.

* it had higher pressure on MM since it actually allocates new pages,
  which then get split and refcount-biased (NAPI page cache);
* it implies memcpy() of packet headers (40+ bytes per each frame);
* the actual header length was calculated via eth_get_headlen(), which
  invokes Flow Dissector and thus wastes a bunch of CPU cycles;
* XDP makes it even more weird since it requires headroom for long and
  also tailroom for some time (since mbuf landed). Take a look at the
  ice driver, which is built around work-arounds to make XDP work with
  it.

Even on some quite low-end hardware (not a common case for 100G NICs) it
was performing worse.
The only advantage "legacy-rx" had is that it didn't require any
reserved headroom and tailroom. But iavf didn't use this, as it always
splits pages into two halves of 2k, while that save would only be useful
when striding. And again, XDP effectively removes that sole pro.

There's a train of features to land in IAVF soon: Page Pool, XDP, XSk,
multi-buffer etc. Each new would require adding more and more Danse
Macabre for absolutely no reason, besides making hotpath less and less
effective.
Remove the "feature" with all the related code. This includes at least
one very hot branch (typically hit on each new frame), which was either
always-true or always-false at least for a complete NAPI bulk of 64
frames, the whole private flags cruft, and so on. Some stats:

Function: add/remove: 0/4 grow/shrink: 0/7 up/down: 0/-757 (-757)
RO Data: add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-40 (-40)

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/iavf/iavf.h        |   2 +-
 .../net/ethernet/intel/iavf/iavf_ethtool.c    | 140 ------------------
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  10 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |  84 +----------
 drivers/net/ethernet/intel/iavf/iavf_txrx.h   |  27 +---
 .../net/ethernet/intel/iavf/iavf_virtchnl.c   |   3 +-
 6 files changed, 8 insertions(+), 258 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h
index f80f2735e688..71c5d9b18692 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -298,7 +298,7 @@ struct iavf_adapter {
 #define IAVF_FLAG_CLIENT_NEEDS_L2_PARAMS	BIT(12)
 #define IAVF_FLAG_PROMISC_ON			BIT(13)
 #define IAVF_FLAG_ALLMULTI_ON			BIT(14)
-#define IAVF_FLAG_LEGACY_RX			BIT(15)
+/* BIT(15) is free, was IAVF_FLAG_LEGACY_RX */
 #define IAVF_FLAG_REINIT_ITR_NEEDED		BIT(16)
 #define IAVF_FLAG_QUEUES_DISABLED		BIT(17)
 #define IAVF_FLAG_SETUP_NETDEV_FEATURES		BIT(18)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index 6f171d1d85b7..de3050c02b6f 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -239,29 +239,6 @@ static const struct iavf_stats iavf_gstrings_stats[] = {
 
 #define IAVF_QUEUE_STATS_LEN	ARRAY_SIZE(iavf_gstrings_queue_stats)
 
-/* For now we have one and only one private flag and it is only defined
- * when we have support for the SKIP_CPU_SYNC DMA attribute.  Instead
- * of leaving all this code sitting around empty we will strip it unless
- * our one private flag is actually available.
- */
-struct iavf_priv_flags {
-	char flag_string[ETH_GSTRING_LEN];
-	u32 flag;
-	bool read_only;
-};
-
-#define IAVF_PRIV_FLAG(_name, _flag, _read_only) { \
-	.flag_string = _name, \
-	.flag = _flag, \
-	.read_only = _read_only, \
-}
-
-static const struct iavf_priv_flags iavf_gstrings_priv_flags[] = {
-	IAVF_PRIV_FLAG("legacy-rx", IAVF_FLAG_LEGACY_RX, 0),
-};
-
-#define IAVF_PRIV_FLAGS_STR_LEN ARRAY_SIZE(iavf_gstrings_priv_flags)
-
 /**
  * iavf_get_link_ksettings - Get Link Speed and Duplex settings
  * @netdev: network interface device structure
@@ -341,8 +318,6 @@ static int iavf_get_sset_count(struct net_device *netdev, int sset)
 		return IAVF_STATS_LEN +
 			(IAVF_QUEUE_STATS_LEN * 2 *
 			 netdev->real_num_tx_queues);
-	else if (sset == ETH_SS_PRIV_FLAGS)
-		return IAVF_PRIV_FLAGS_STR_LEN;
 	else
 		return -EINVAL;
 }
@@ -384,24 +359,6 @@ static void iavf_get_ethtool_stats(struct net_device *netdev,
 	rcu_read_unlock();
 }
 
-/**
- * iavf_get_priv_flag_strings - Get private flag strings
- * @netdev: network interface device structure
- * @data: buffer for string data
- *
- * Builds the private flags string table
- **/
-static void iavf_get_priv_flag_strings(struct net_device *netdev, u8 *data)
-{
-	unsigned int i;
-
-	for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++) {
-		snprintf(data, ETH_GSTRING_LEN, "%s",
-			 iavf_gstrings_priv_flags[i].flag_string);
-		data += ETH_GSTRING_LEN;
-	}
-}
-
 /**
  * iavf_get_stat_strings - Get stat strings
  * @netdev: network interface device structure
@@ -440,105 +397,11 @@ static void iavf_get_strings(struct net_device *netdev, u32 sset, u8 *data)
 	case ETH_SS_STATS:
 		iavf_get_stat_strings(netdev, data);
 		break;
-	case ETH_SS_PRIV_FLAGS:
-		iavf_get_priv_flag_strings(netdev, data);
-		break;
 	default:
 		break;
 	}
 }
 
-/**
- * iavf_get_priv_flags - report device private flags
- * @netdev: network interface device structure
- *
- * The get string set count and the string set should be matched for each
- * flag returned.  Add new strings for each flag to the iavf_gstrings_priv_flags
- * array.
- *
- * Returns a u32 bitmap of flags.
- **/
-static u32 iavf_get_priv_flags(struct net_device *netdev)
-{
-	struct iavf_adapter *adapter = netdev_priv(netdev);
-	u32 i, ret_flags = 0;
-
-	for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++) {
-		const struct iavf_priv_flags *priv_flags;
-
-		priv_flags = &iavf_gstrings_priv_flags[i];
-
-		if (priv_flags->flag & adapter->flags)
-			ret_flags |= BIT(i);
-	}
-
-	return ret_flags;
-}
-
-/**
- * iavf_set_priv_flags - set private flags
- * @netdev: network interface device structure
- * @flags: bit flags to be set
- **/
-static int iavf_set_priv_flags(struct net_device *netdev, u32 flags)
-{
-	struct iavf_adapter *adapter = netdev_priv(netdev);
-	u32 orig_flags, new_flags, changed_flags;
-	u32 i;
-
-	orig_flags = READ_ONCE(adapter->flags);
-	new_flags = orig_flags;
-
-	for (i = 0; i < IAVF_PRIV_FLAGS_STR_LEN; i++) {
-		const struct iavf_priv_flags *priv_flags;
-
-		priv_flags = &iavf_gstrings_priv_flags[i];
-
-		if (flags & BIT(i))
-			new_flags |= priv_flags->flag;
-		else
-			new_flags &= ~(priv_flags->flag);
-
-		if (priv_flags->read_only &&
-		    ((orig_flags ^ new_flags) & ~BIT(i)))
-			return -EOPNOTSUPP;
-	}
-
-	/* Before we finalize any flag changes, any checks which we need to
-	 * perform to determine if the new flags will be supported should go
-	 * here...
-	 */
-
-	/* Compare and exchange the new flags into place. If we failed, that
-	 * is if cmpxchg returns anything but the old value, this means
-	 * something else must have modified the flags variable since we
-	 * copied it. We'll just punt with an error and log something in the
-	 * message buffer.
-	 */
-	if (cmpxchg(&adapter->flags, orig_flags, new_flags) != orig_flags) {
-		dev_warn(&adapter->pdev->dev,
-			 "Unable to update adapter->flags as it was modified by another thread...\n");
-		return -EAGAIN;
-	}
-
-	changed_flags = orig_flags ^ new_flags;
-
-	/* Process any additional changes needed as a result of flag changes.
-	 * The changed_flags value reflects the list of bits that were changed
-	 * in the code above.
-	 */
-
-	/* issue a reset to force legacy-rx change to take effect */
-	if (changed_flags & IAVF_FLAG_LEGACY_RX) {
-		if (netif_running(netdev)) {
-			adapter->flags |= IAVF_FLAG_RESET_NEEDED;
-			queue_work(adapter->wq, &adapter->reset_task);
-		}
-	}
-
-	return 0;
-}
-
 /**
  * iavf_get_msglevel - Get debug message level
  * @netdev: network interface device structure
@@ -584,7 +447,6 @@ static void iavf_get_drvinfo(struct net_device *netdev,
 	strscpy(drvinfo->driver, iavf_driver_name, 32);
 	strscpy(drvinfo->fw_version, "N/A", 4);
 	strscpy(drvinfo->bus_info, pci_name(adapter->pdev), 32);
-	drvinfo->n_priv_flags = IAVF_PRIV_FLAGS_STR_LEN;
 }
 
 /**
@@ -1969,8 +1831,6 @@ static const struct ethtool_ops iavf_ethtool_ops = {
 	.get_strings		= iavf_get_strings,
 	.get_ethtool_stats	= iavf_get_ethtool_stats,
 	.get_sset_count		= iavf_get_sset_count,
-	.get_priv_flags		= iavf_get_priv_flags,
-	.set_priv_flags		= iavf_set_priv_flags,
 	.get_msglevel		= iavf_get_msglevel,
 	.set_msglevel		= iavf_set_msglevel,
 	.get_coalesce		= iavf_get_coalesce,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 23fbd45dd986..db1ed13f11bb 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -702,9 +702,7 @@ static void iavf_configure_rx(struct iavf_adapter *adapter)
 	struct iavf_hw *hw = &adapter->hw;
 	int i;
 
-	/* Legacy Rx will always default to a 2048 buffer size. */
-#if (PAGE_SIZE < 8192)
-	if (!(adapter->flags & IAVF_FLAG_LEGACY_RX)) {
+	if (PAGE_SIZE < 8192) {
 		struct net_device *netdev = adapter->netdev;
 
 		/* For jumbo frames on systems with 4K pages we have to use
@@ -721,16 +719,10 @@ static void iavf_configure_rx(struct iavf_adapter *adapter)
 		    (netdev->mtu <= ETH_DATA_LEN))
 			rx_buf_len = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
 	}
-#endif
 
 	for (i = 0; i < adapter->num_active_queues; i++) {
 		adapter->rx_rings[i].tail = hw->hw_addr + IAVF_QRX_TAIL1(i);
 		adapter->rx_rings[i].rx_buf_len = rx_buf_len;
-
-		if (adapter->flags & IAVF_FLAG_LEGACY_RX)
-			clear_ring_build_skb_enabled(&adapter->rx_rings[i]);
-		else
-			set_ring_build_skb_enabled(&adapter->rx_rings[i]);
 	}
 }
 
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index aae357af072c..a85b270fc769 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -823,17 +823,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
 	writel(val, rx_ring->tail);
 }
 
-/**
- * iavf_rx_offset - Return expected offset into page to access data
- * @rx_ring: Ring we are requesting offset of
- *
- * Returns the offset value for ring into the data buffer.
- */
-static inline unsigned int iavf_rx_offset(struct iavf_ring *rx_ring)
-{
-	return ring_uses_build_skb(rx_ring) ? IAVF_SKB_PAD : 0;
-}
-
 /**
  * iavf_alloc_mapped_page - recycle or make a new page
  * @rx_ring: ring to use
@@ -878,7 +867,7 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
 
 	bi->dma = dma;
 	bi->page = page;
-	bi->page_offset = iavf_rx_offset(rx_ring);
+	bi->page_offset = IAVF_SKB_PAD;
 
 	/* initialize pagecnt_bias to 1 representing we fully own page */
 	bi->pagecnt_bias = 1;
@@ -1219,7 +1208,7 @@ static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
 #if (PAGE_SIZE < 8192)
 	unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
 #else
-	unsigned int truesize = SKB_DATA_ALIGN(size + iavf_rx_offset(rx_ring));
+	unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
 #endif
 
 	if (!size)
@@ -1267,71 +1256,6 @@ static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
 	return rx_buffer;
 }
 
-/**
- * iavf_construct_skb - Allocate skb and populate it
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: rx buffer to pull data from
- * @size: size of buffer to add to skb
- *
- * This function allocates an skb.  It then populates it with the page
- * data from the current receive descriptor, taking care to set up the
- * skb correctly.
- */
-static struct sk_buff *iavf_construct_skb(struct iavf_ring *rx_ring,
-					  struct iavf_rx_buffer *rx_buffer,
-					  unsigned int size)
-{
-	void *va;
-#if (PAGE_SIZE < 8192)
-	unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
-#else
-	unsigned int truesize = SKB_DATA_ALIGN(size);
-#endif
-	unsigned int headlen;
-	struct sk_buff *skb;
-
-	if (!rx_buffer)
-		return NULL;
-	/* prefetch first cache line of first page */
-	va = page_address(rx_buffer->page) + rx_buffer->page_offset;
-	net_prefetch(va);
-
-	/* allocate a skb to store the frags */
-	skb = __napi_alloc_skb(&rx_ring->q_vector->napi,
-			       IAVF_RX_HDR_SIZE,
-			       GFP_ATOMIC | __GFP_NOWARN);
-	if (unlikely(!skb))
-		return NULL;
-
-	/* Determine available headroom for copy */
-	headlen = size;
-	if (headlen > IAVF_RX_HDR_SIZE)
-		headlen = eth_get_headlen(skb->dev, va, IAVF_RX_HDR_SIZE);
-
-	/* align pull length to size of long to optimize memcpy performance */
-	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
-
-	/* update all of the pointers */
-	size -= headlen;
-	if (size) {
-		skb_add_rx_frag(skb, 0, rx_buffer->page,
-				rx_buffer->page_offset + headlen,
-				size, truesize);
-
-		/* buffer is used by skb, update page_offset */
-#if (PAGE_SIZE < 8192)
-		rx_buffer->page_offset ^= truesize;
-#else
-		rx_buffer->page_offset += truesize;
-#endif
-	} else {
-		/* buffer is unused, reset bias back to rx_buffer */
-		rx_buffer->pagecnt_bias++;
-	}
-
-	return skb;
-}
-
 /**
  * iavf_build_skb - Build skb around an existing buffer
  * @rx_ring: Rx descriptor ring to transact packets on
@@ -1504,10 +1428,8 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
 		/* retrieve a buffer from the ring */
 		if (skb)
 			iavf_add_rx_frag(rx_ring, rx_buffer, skb, size);
-		else if (ring_uses_build_skb(rx_ring))
-			skb = iavf_build_skb(rx_ring, rx_buffer, size);
 		else
-			skb = iavf_construct_skb(rx_ring, rx_buffer, size);
+			skb = iavf_build_skb(rx_ring, rx_buffer, size);
 
 		/* exit if we failed to retrieve a buffer */
 		if (!skb) {
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index 7e6ee32d19b6..4b412f7662e4 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -82,20 +82,11 @@ enum iavf_dyn_idx_t {
 	BIT_ULL(IAVF_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP))
 
 /* Supported Rx Buffer Sizes (a multiple of 128) */
-#define IAVF_RXBUFFER_256   256
 #define IAVF_RXBUFFER_1536  1536  /* 128B aligned standard Ethernet frame */
 #define IAVF_RXBUFFER_2048  2048
 #define IAVF_RXBUFFER_3072  3072  /* Used for large frames w/ padding */
 #define IAVF_MAX_RXBUFFER   9728  /* largest size for single descriptor */
 
-/* NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
- * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
- * this adds up to 512 bytes of extra data meaning the smallest allocation
- * we could have is 1K.
- * i.e. RXBUFFER_256 --> 960 byte skb (size-1024 slab)
- * i.e. RXBUFFER_512 --> 1216 byte skb (size-2048 slab)
- */
-#define IAVF_RX_HDR_SIZE IAVF_RXBUFFER_256
 #define IAVF_PACKET_HDR_PAD (ETH_HLEN + ETH_FCS_LEN + (VLAN_HLEN * 2))
 #define iavf_rx_desc iavf_32byte_rx_desc
 
@@ -362,7 +353,8 @@ struct iavf_ring {
 
 	u16 flags;
 #define IAVF_TXR_FLAGS_WB_ON_ITR		BIT(0)
-#define IAVF_RXR_FLAGS_BUILD_SKB_ENABLED	BIT(1)
+/* BIT(1) is free, was IAVF_RXR_FLAGS_BUILD_SKB_ENABLED */
+/* BIT(2) is free */
 #define IAVF_TXRX_FLAGS_VLAN_TAG_LOC_L2TAG1	BIT(3)
 #define IAVF_TXR_FLAGS_VLAN_TAG_LOC_L2TAG2	BIT(4)
 #define IAVF_RXR_FLAGS_VLAN_TAG_LOC_L2TAG2_2	BIT(5)
@@ -393,21 +385,6 @@ struct iavf_ring {
 					 */
 } ____cacheline_internodealigned_in_smp;
 
-static inline bool ring_uses_build_skb(struct iavf_ring *ring)
-{
-	return !!(ring->flags & IAVF_RXR_FLAGS_BUILD_SKB_ENABLED);
-}
-
-static inline void set_ring_build_skb_enabled(struct iavf_ring *ring)
-{
-	ring->flags |= IAVF_RXR_FLAGS_BUILD_SKB_ENABLED;
-}
-
-static inline void clear_ring_build_skb_enabled(struct iavf_ring *ring)
-{
-	ring->flags &= ~IAVF_RXR_FLAGS_BUILD_SKB_ENABLED;
-}
-
 #define IAVF_ITR_ADAPTIVE_MIN_INC	0x0002
 #define IAVF_ITR_ADAPTIVE_MIN_USECS	0x0002
 #define IAVF_ITR_ADAPTIVE_MAX_USECS	0x007e
diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
index 7c0578b5457b..fdddc3588487 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
@@ -290,8 +290,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
 		return;
 
 	/* Limit maximum frame size when jumbo frames is not enabled */
-	if (!(adapter->flags & IAVF_FLAG_LEGACY_RX) &&
-	    (adapter->netdev->mtu <= ETH_DATA_LEN))
+	if (adapter->netdev->mtu <= ETH_DATA_LEN)
 		max_frame = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
 
 	vqci->vsi_id = adapter->vsi_res->vsi_id;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 1/9] net: intel: introduce Intel Ethernet common library Alexander Lobakin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 2/9] iavf: kill "legacy-rx" for good Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  2023-07-06 14:47   ` [Intel-wired-lan] " Alexander Duyck
  2023-07-05 15:55 ` [PATCH RFC net-next v4 4/9] net: page_pool: add DMA-sync-for-CPU inline helpers Alexander Lobakin
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

As an intermediate step, remove all page splitting/recyclig code. Just
always allocate a new page and don't touch its refcount, so that it gets
freed by the core stack later.
Same for the "in-place" recycling, i.e. when an unused buffer gets
assigned to a first needs-refilling descriptor. In some cases, this
was leading to moving up to 63 &iavf_rx_buf structures around the ring
on a per-field basis -- not something wanted on hotpath.
The change allows to greatly simplify certain parts of the code:

Function: add/remove: 0/2 grow/shrink: 1/3 up/down: 3/-494 (-491)

Although the array of &iavf_rx_buf is barely used now and could be
replaced with just page pointer array, don't touch it now to not
complicate replacing it with libie Rx buffer struct later on.
No surprise perf loses up to 30% here, but that regression will go
away once PP lands.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/iavf/iavf_txrx.c | 151 ++------------------
 drivers/net/ethernet/intel/iavf/iavf_txrx.h |   8 --
 2 files changed, 13 insertions(+), 146 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index a85b270fc769..789b10815d7f 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -723,7 +723,7 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
 				     DMA_FROM_DEVICE,
 				     IAVF_RX_DMA_ATTR);
 
-		__page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
+		__free_pages(rx_bi->page, iavf_rx_pg_order(rx_ring));
 
 		rx_bi->page = NULL;
 		rx_bi->page_offset = 0;
@@ -735,7 +735,6 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
 	/* Zero out the descriptor ring */
 	memset(rx_ring->desc, 0, rx_ring->size);
 
-	rx_ring->next_to_alloc = 0;
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 }
@@ -791,7 +790,6 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
 		goto err;
 	}
 
-	rx_ring->next_to_alloc = 0;
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 
@@ -811,9 +809,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
 {
 	rx_ring->next_to_use = val;
 
-	/* update next to alloc since we have filled the ring */
-	rx_ring->next_to_alloc = val;
-
 	/* Force memory writes to complete before letting h/w
 	 * know there are new descriptors to fetch.  (Only
 	 * applicable for weak-ordered memory model archs,
@@ -837,12 +832,6 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
 	struct page *page = bi->page;
 	dma_addr_t dma;
 
-	/* since we are recycling buffers we should seldom need to alloc */
-	if (likely(page)) {
-		rx_ring->rx_stats.page_reuse_count++;
-		return true;
-	}
-
 	/* alloc new page for storage */
 	page = dev_alloc_pages(iavf_rx_pg_order(rx_ring));
 	if (unlikely(!page)) {
@@ -869,9 +858,6 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
 	bi->page = page;
 	bi->page_offset = IAVF_SKB_PAD;
 
-	/* initialize pagecnt_bias to 1 representing we fully own page */
-	bi->pagecnt_bias = 1;
-
 	return true;
 }
 
@@ -1103,91 +1089,6 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
 	return false;
 }
 
-/**
- * iavf_reuse_rx_page - page flip buffer and store it back on the ring
- * @rx_ring: rx descriptor ring to store buffers on
- * @old_buff: donor buffer to have page reused
- *
- * Synchronizes page for reuse by the adapter
- **/
-static void iavf_reuse_rx_page(struct iavf_ring *rx_ring,
-			       struct iavf_rx_buffer *old_buff)
-{
-	struct iavf_rx_buffer *new_buff;
-	u16 nta = rx_ring->next_to_alloc;
-
-	new_buff = &rx_ring->rx_bi[nta];
-
-	/* update, and store next to alloc */
-	nta++;
-	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
-
-	/* transfer page from old buffer to new buffer */
-	new_buff->dma		= old_buff->dma;
-	new_buff->page		= old_buff->page;
-	new_buff->page_offset	= old_buff->page_offset;
-	new_buff->pagecnt_bias	= old_buff->pagecnt_bias;
-}
-
-/**
- * iavf_can_reuse_rx_page - Determine if this page can be reused by
- * the adapter for another receive
- *
- * @rx_buffer: buffer containing the page
- *
- * If page is reusable, rx_buffer->page_offset is adjusted to point to
- * an unused region in the page.
- *
- * For small pages, @truesize will be a constant value, half the size
- * of the memory at page.  We'll attempt to alternate between high and
- * low halves of the page, with one half ready for use by the hardware
- * and the other half being consumed by the stack.  We use the page
- * ref count to determine whether the stack has finished consuming the
- * portion of this page that was passed up with a previous packet.  If
- * the page ref count is >1, we'll assume the "other" half page is
- * still busy, and this page cannot be reused.
- *
- * For larger pages, @truesize will be the actual space used by the
- * received packet (adjusted upward to an even multiple of the cache
- * line size).  This will advance through the page by the amount
- * actually consumed by the received packets while there is still
- * space for a buffer.  Each region of larger pages will be used at
- * most once, after which the page will not be reused.
- *
- * In either case, if the page is reusable its refcount is increased.
- **/
-static bool iavf_can_reuse_rx_page(struct iavf_rx_buffer *rx_buffer)
-{
-	unsigned int pagecnt_bias = rx_buffer->pagecnt_bias;
-	struct page *page = rx_buffer->page;
-
-	/* Is any reuse possible? */
-	if (!dev_page_is_reusable(page))
-		return false;
-
-#if (PAGE_SIZE < 8192)
-	/* if we are only owner of page we can reuse it */
-	if (unlikely((page_count(page) - pagecnt_bias) > 1))
-		return false;
-#else
-#define IAVF_LAST_OFFSET \
-	(SKB_WITH_OVERHEAD(PAGE_SIZE) - IAVF_RXBUFFER_2048)
-	if (rx_buffer->page_offset > IAVF_LAST_OFFSET)
-		return false;
-#endif
-
-	/* If we have drained the page fragment pool we need to update
-	 * the pagecnt_bias and page count so that we fully restock the
-	 * number of references the driver holds.
-	 */
-	if (unlikely(!pagecnt_bias)) {
-		page_ref_add(page, USHRT_MAX);
-		rx_buffer->pagecnt_bias = USHRT_MAX;
-	}
-
-	return true;
-}
-
 /**
  * iavf_add_rx_frag - Add contents of Rx buffer to sk_buff
  * @rx_ring: rx descriptor ring to transact packets on
@@ -1216,13 +1117,6 @@ static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
 
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
 			rx_buffer->page_offset, size, truesize);
-
-	/* page is being used so we must update the page offset */
-#if (PAGE_SIZE < 8192)
-	rx_buffer->page_offset ^= truesize;
-#else
-	rx_buffer->page_offset += truesize;
-#endif
 }
 
 /**
@@ -1250,9 +1144,6 @@ static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
 				      size,
 				      DMA_FROM_DEVICE);
 
-	/* We have pulled a buffer for use, so decrement pagecnt_bias */
-	rx_buffer->pagecnt_bias--;
-
 	return rx_buffer;
 }
 
@@ -1293,23 +1184,15 @@ static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
 	skb_reserve(skb, IAVF_SKB_PAD);
 	__skb_put(skb, size);
 
-	/* buffer is used by skb, update page_offset */
-#if (PAGE_SIZE < 8192)
-	rx_buffer->page_offset ^= truesize;
-#else
-	rx_buffer->page_offset += truesize;
-#endif
-
 	return skb;
 }
 
 /**
- * iavf_put_rx_buffer - Clean up used buffer and either recycle or free
+ * iavf_put_rx_buffer - Unmap used buffer
  * @rx_ring: rx descriptor ring to transact packets on
  * @rx_buffer: rx buffer to pull data from
  *
- * This function will clean up the contents of the rx_buffer.  It will
- * either recycle the buffer or unmap it and free the associated resources.
+ * This function will unmap the buffer after it's written by HW.
  */
 static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
 			       struct iavf_rx_buffer *rx_buffer)
@@ -1317,21 +1200,10 @@ static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
 	if (!rx_buffer)
 		return;
 
-	if (iavf_can_reuse_rx_page(rx_buffer)) {
-		/* hand second half of page back to the ring */
-		iavf_reuse_rx_page(rx_ring, rx_buffer);
-		rx_ring->rx_stats.page_reuse_count++;
-	} else {
-		/* we are not reusing the buffer so unmap it */
-		dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
-				     iavf_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
-		__page_frag_cache_drain(rx_buffer->page,
-					rx_buffer->pagecnt_bias);
-	}
-
-	/* clear contents of buffer_info */
-	rx_buffer->page = NULL;
+	/* we are not reusing the buffer so unmap it */
+	dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
+			     iavf_rx_pg_size(rx_ring),
+			     DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
 }
 
 /**
@@ -1431,15 +1303,18 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
 		else
 			skb = iavf_build_skb(rx_ring, rx_buffer, size);
 
+		iavf_put_rx_buffer(rx_ring, rx_buffer);
+
 		/* exit if we failed to retrieve a buffer */
 		if (!skb) {
 			rx_ring->rx_stats.alloc_buff_failed++;
-			if (rx_buffer && size)
-				rx_buffer->pagecnt_bias++;
+			__free_pages(rx_buffer->page,
+				     iavf_rx_pg_order(rx_ring));
+			rx_buffer->page = NULL;
 			break;
 		}
 
-		iavf_put_rx_buffer(rx_ring, rx_buffer);
+		rx_buffer->page = NULL;
 		cleaned_count++;
 
 		if (iavf_is_non_eop(rx_ring, rx_desc, skb))
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index 4b412f7662e4..2170a77f8c8d 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -266,12 +266,7 @@ struct iavf_tx_buffer {
 struct iavf_rx_buffer {
 	dma_addr_t dma;
 	struct page *page;
-#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
 	__u32 page_offset;
-#else
-	__u16 page_offset;
-#endif
-	__u16 pagecnt_bias;
 };
 
 struct iavf_queue_stats {
@@ -293,8 +288,6 @@ struct iavf_rx_queue_stats {
 	u64 non_eop_descs;
 	u64 alloc_page_failed;
 	u64 alloc_buff_failed;
-	u64 page_reuse_count;
-	u64 realloc_count;
 };
 
 enum iavf_ring_state_t {
@@ -374,7 +367,6 @@ struct iavf_ring {
 	struct iavf_q_vector *q_vector;	/* Backreference to associated vector */
 
 	struct rcu_head rcu;		/* to avoid race on free */
-	u16 next_to_alloc;
 	struct sk_buff *skb;		/* When iavf_clean_rx_ring_irq() must
 					 * return before it sees the EOP for
 					 * the current packet, we save that skb
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 4/9] net: page_pool: add DMA-sync-for-CPU inline helpers
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
                   ` (2 preceding siblings ...)
  2023-07-05 15:55 ` [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool) Alexander Lobakin
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

Each driver is responsible for syncing buffers written by HW for CPU
before accessing them. Almost each PP-enabled driver uses the same
pattern, which could be shorthanded into a static inline to make driver
code a little bit more compact.
Introduce a couple such functions. The first one takes the actual size
of the data written by HW and is the main one to be used on Rx. The
second does the same, but only if the PP performs DMA synchronizations
at all.
Also constify pointer arguments of page_pool_get_dma_dir() and
page_pool_get_dma_addr() to give a bit more room for optimization,
as both of them are read-only.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/page_pool.h | 49 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index ff3772fab707..89acf9e33639 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -304,8 +304,8 @@ static inline void *page_pool_dev_cache_alloc(struct page_pool *pool,
 /* get the stored dma direction. A driver might decide to treat this locally and
  * avoid the extra cache line from page_pool to determine the direction
  */
-static
-inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
+static inline enum dma_data_direction
+page_pool_get_dma_dir(const struct page_pool *pool)
 {
 	return pool->p.dma_dir;
 }
@@ -460,7 +460,7 @@ static inline void page_pool_cache_free(struct page_pool *pool, void *data,
 	page_pool_put_page(pool, virt_to_head_page(data), -1, allow_direct);
 }
 
-static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
+static inline dma_addr_t page_pool_get_dma_addr(const struct page *page)
 {
 	dma_addr_t ret = page->dma_addr;
 
@@ -477,6 +477,49 @@ static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 		page->dma_addr_upper = upper_32_bits(addr);
 }
 
+/**
+ * __page_pool_dma_sync_for_cpu - sync Rx page for CPU after it's written by HW
+ * @pool: &page_pool the page belongs to
+ * @page: page to sync
+ * @offset: offset from page start to "hard" start if using frags
+ * @dma_sync_size: size of the data written to the page
+ *
+ * Can be used as a shorthand to sync Rx pages before accessing them in the
+ * driver. Caller must ensure the pool was created with %PP_FLAG_DMA_MAP.
+ * Note that this version performs DMA sync unconditionally, even if the
+ * associated PP doesn't perform sync-for-device. Consider the non-underscored
+ * version first if unsure.
+ */
+static inline void __page_pool_dma_sync_for_cpu(const struct page_pool *pool,
+						const struct page *page,
+						u32 offset, u32 dma_sync_size)
+{
+	dma_sync_single_range_for_cpu(pool->p.dev,
+				      page_pool_get_dma_addr(page),
+				      offset + pool->p.offset, dma_sync_size,
+				      page_pool_get_dma_dir(pool));
+}
+
+/**
+ * page_pool_dma_sync_for_cpu - sync Rx page for CPU if needed
+ * @pool: &page_pool the page belongs to
+ * @page: page to sync
+ * @offset: offset from page start to "hard" start if using frags
+ * @dma_sync_size: size of the data written to the page
+ *
+ * Performs DMA sync for CPU, but *only* when:
+ * 1) page_pool was created with %PP_FLAG_DMA_SYNC_DEV to manage DMA syncs;
+ * 2) AND sync shortcut is not available (IOMMU, swiotlb, non-coherent DMA, ...)
+ */
+static inline void page_pool_dma_sync_for_cpu(const struct page_pool *pool,
+					      const struct page *page,
+					      u32 offset, u32 dma_sync_size)
+{
+	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		__page_pool_dma_sync_for_cpu(pool, page, offset,
+					     dma_sync_size);
+}
+
 static inline bool is_page_pool_compiled_in(void)
 {
 #ifdef CONFIG_PAGE_POOL
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
                   ` (3 preceding siblings ...)
  2023-07-05 15:55 ` [PATCH RFC net-next v4 4/9] net: page_pool: add DMA-sync-for-CPU inline helpers Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  2023-07-06 12:47   ` Yunsheng Lin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool Alexander Lobakin
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

Add a couple intuitive helpers to hide Rx buffer implementation details
in the library and not multiplicate it between drivers. The settings are
optimized for Intel hardware, but nothing really HW-specific here.
Use the new page_pool_dev_alloc() to dynamically switch between
split-page and full-page modes depending on MTU, page size, required
headroom etc. For example, on x86_64 with the default driver settings
each page is shared between 2 buffers. Turning on XDP (not in this
series) -> increasing headroom requirement pushes truesize out of 2048
boundary, leading to that each buffer starts getting a full page.
The "ceiling" limit is %PAGE_SIZE, as only order-0 pages are used to
avoid compound overhead. For the above architecture, this means maximum
linear frame size of 3712 w/o XDP.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/Kconfig    |   1 +
 drivers/net/ethernet/intel/libie/rx.c |  54 +++++++++++++
 include/linux/net/intel/libie/rx.h    | 111 +++++++++++++++++++++++++-
 3 files changed, 165 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/Kconfig b/drivers/net/ethernet/intel/Kconfig
index 86ecedeac115..e187942b5a5c 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -86,6 +86,7 @@ config E1000E_HWTS
 
 config LIBIE
 	tristate
+	select PAGE_POOL
 	help
 	  libie (Intel Ethernet library) is a common library containing
 	  routines shared by several Intel Ethernet drivers.
diff --git a/drivers/net/ethernet/intel/libie/rx.c b/drivers/net/ethernet/intel/libie/rx.c
index f503476d8eef..c60d7b20ed20 100644
--- a/drivers/net/ethernet/intel/libie/rx.c
+++ b/drivers/net/ethernet/intel/libie/rx.c
@@ -3,6 +3,60 @@
 
 #include <linux/net/intel/libie/rx.h>
 
+/* Rx buffer management */
+
+/**
+ * libie_rx_sync_len - get the actual buffer size to be synced and passed to HW
+ * @dev: &net_device to calculate the size for
+ * @hr: headroom in front of each frame
+ *
+ * Returns the buffer size to pass it to HW and use for DMA synchronization
+ * accounting: MTU the @dev has, HW required alignment, minimum and maximum
+ * allowed values, and system's page size.
+ */
+static u32 libie_rx_sync_len(const struct net_device *dev, u32 hr)
+{
+	u32 len;
+
+	len = READ_ONCE(dev->mtu) + LIBIE_RX_LL_LEN;
+	len = ALIGN(len, LIBIE_RX_BUF_LEN_ALIGN);
+	len = clamp(len, LIBIE_MIN_RX_BUF_LEN, LIBIE_RX_BUF_LEN(hr));
+
+	return len;
+}
+
+/**
+ * libie_rx_page_pool_create - create a PP with the default libie settings
+ * @napi: &napi_struct covering this PP (no usage outside its poll loops)
+ * @size: size of the PP, usually simply Rx queue len
+ *
+ * Returns &page_pool on success, casted -errno on failure.
+ */
+struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
+					    u32 size)
+{
+	struct page_pool_params pp = {
+		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
+		.order		= LIBIE_RX_PAGE_ORDER,
+		.pool_size	= size,
+		.nid		= NUMA_NO_NODE,
+		.dev		= napi->dev->dev.parent,
+		.napi		= napi,
+		.dma_dir	= DMA_FROM_DEVICE,
+		.offset		= LIBIE_SKB_HEADROOM,
+	};
+	size_t truesize;
+
+	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
+
+	/* "Wanted" truesize, passed to page_pool_dev_alloc() */
+	truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
+	pp.init_arg = (void *)truesize;
+
+	return page_pool_create(&pp);
+}
+EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
+
 /* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
  * bitfield struct.
  */
diff --git a/include/linux/net/intel/libie/rx.h b/include/linux/net/intel/libie/rx.h
index 58bd0f35d025..8c0ccdff9a37 100644
--- a/include/linux/net/intel/libie/rx.h
+++ b/include/linux/net/intel/libie/rx.h
@@ -4,7 +4,116 @@
 #ifndef __LIBIE_RX_H
 #define __LIBIE_RX_H
 
-#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <net/page_pool.h>
+
+/* Rx MTU/buffer/truesize helpers. Mostly pure software-side; HW-defined values
+ * are valid for all Intel HW.
+ */
+
+/* Space reserved in front of each frame */
+#define LIBIE_SKB_HEADROOM	(NET_SKB_PAD + NET_IP_ALIGN)
+/* Maximum headroom to calculate max MTU below */
+#define LIBIE_MAX_HEADROOM	LIBIE_SKB_HEADROOM
+/* Link layer / L2 overhead: Ethernet, 2 VLAN tags (C + S), FCS */
+#define LIBIE_RX_LL_LEN		(ETH_HLEN + 2 * VLAN_HLEN + ETH_FCS_LEN)
+
+/* Always use order-0 pages */
+#define LIBIE_RX_PAGE_ORDER	0
+/* Rx buffer size config is a multiple of 128 */
+#define LIBIE_RX_BUF_LEN_ALIGN	128
+/* HW-writeable space in one buffer: truesize - headroom/tailroom,
+ * HW-aligned
+ */
+#define __LIBIE_RX_BUF_LEN(hr)						\
+	ALIGN_DOWN(SKB_MAX_ORDER(hr, LIBIE_RX_PAGE_ORDER),		\
+		   LIBIE_RX_BUF_LEN_ALIGN)
+/* The smallest and largest size for a single descriptor as per HW */
+#define LIBIE_MIN_RX_BUF_LEN	1024U
+#define LIBIE_MAX_RX_BUF_LEN	9728U
+/* "True" HW-writeable space: minimum from SW and HW values */
+#define LIBIE_RX_BUF_LEN(hr)	min_t(u32, __LIBIE_RX_BUF_LEN(hr),	\
+				      LIBIE_MAX_RX_BUF_LEN)
+
+/* The maximum frame size as per HW (S/G) */
+#define __LIBIE_MAX_RX_FRM_LEN	16382U
+/* ATST, HW can chain up to 5 Rx descriptors */
+#define LIBIE_MAX_RX_FRM_LEN(hr)					\
+	min_t(u32, __LIBIE_MAX_RX_FRM_LEN, LIBIE_RX_BUF_LEN(hr) * 5)
+/* Maximum frame size minus LL overhead */
+#define LIBIE_MAX_MTU							\
+	(LIBIE_MAX_RX_FRM_LEN(LIBIE_MAX_HEADROOM) - LIBIE_RX_LL_LEN)
+
+/* Rx buffer management */
+
+/**
+ * struct libie_rx_buffer - structure representing an Rx buffer
+ * @page: page holding the buffer
+ * @offset: offset from the page start (to the headroom)
+ * @truesize: total space occupied by the buffer (w/ headroom and tailroom)
+ *
+ * Depending on the MTU, API switches between one-page-per-frame and shared
+ * page model (to conserve memory on bigger-page platforms). In case of the
+ * former, @offset is always 0 and @truesize is always %PAGE_SIZE.
+ */
+struct libie_rx_buffer {
+	struct page	*page;
+	u32		offset;
+	u32		truesize;
+};
+
+struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
+					    u32 size);
+
+/**
+ * libie_rx_alloc - allocate a new Rx buffer
+ * @pool: page_pool to allocate from
+ * @buf: buffer structure to populate
+ *
+ * Returns &dma_addr_t to be passed to HW for Rx, %DMA_MAPPING_ERROR otherwise.
+ */
+static inline dma_addr_t libie_rx_alloc(struct page_pool *pool,
+					struct libie_rx_buffer *buf)
+{
+	/* "Wanted" truesize, see libie_rx_page_pool_create() */
+	buf->truesize = (size_t)pool->p.init_arg;
+	buf->page = page_pool_dev_alloc(pool, &buf->offset, &buf->truesize);
+	if (!buf->page)
+		return DMA_MAPPING_ERROR;
+
+	return page_pool_get_dma_addr(buf->page) + buf->offset +
+	       pool->p.offset;
+}
+
+/**
+ * libie_rx_sync_for_cpu - synchronize or recycle buffer post DMA
+ * @buf: buffer to process
+ * @len: frame length from the descriptor
+ *
+ * Process the buffer after it's written by HW. The regular path is to
+ * synchronize DMA for CPU, but in case of no data it will be immediately
+ * recycled back to its PP.
+ *
+ * Returns true when there's data to process, false otherwise.
+ */
+static inline bool __must_check
+libie_rx_sync_for_cpu(const struct libie_rx_buffer *buf, u32 len)
+{
+	struct page *page = buf->page;
+
+	/* Very rare, but possible case. The most common reason:
+	 * the last fragment contained FCS only, which was then
+	 * stripped by the HW.
+	 */
+	if (unlikely(!len)) {
+		page_pool_recycle_direct(page->pp, page);
+		return false;
+	}
+
+	page_pool_dma_sync_for_cpu(page->pp, page, buf->offset, len);
+
+	return true;
+}
 
 /* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
  * bitfield struct.
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
                   ` (4 preceding siblings ...)
  2023-07-05 15:55 ` [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool) Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  2023-07-06 12:47   ` Yunsheng Lin
  2023-07-06 15:26   ` [Intel-wired-lan] " Alexander Duyck
  2023-07-05 15:55 ` [PATCH RFC net-next v4 7/9] libie: add common queue stats Alexander Lobakin
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
no custom recycling logics, it can easily be switched to using Page
Pool / libie API instead.
This allows to removing the whole dancing around headroom, HW buffer
size, and page order. All DMA-for-device is now done in the PP core,
for-CPU -- in the libie helper.
Use skb_mark_for_recycle() to bring back the recycling and restore the
performance. Speaking of performance: on par with the baseline and
faster with the PP optimization series applied. But the memory usage for
1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
every second descriptor.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  33 +--
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 247 +++++-------------
 drivers/net/ethernet/intel/iavf/iavf_txrx.h   |  91 +------
 drivers/net/ethernet/intel/iavf/iavf_type.h   |   2 -
 .../net/ethernet/intel/iavf/iavf_virtchnl.c   |  19 +-
 5 files changed, 82 insertions(+), 310 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index db1ed13f11bb..39c6d83e80a1 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
+#include <linux/net/intel/libie/rx.h>
+
 #include "iavf.h"
 #include "iavf_prototype.h"
 #include "iavf_client.h"
@@ -698,32 +700,10 @@ static void iavf_configure_tx(struct iavf_adapter *adapter)
  **/
 static void iavf_configure_rx(struct iavf_adapter *adapter)
 {
-	unsigned int rx_buf_len = IAVF_RXBUFFER_2048;
 	struct iavf_hw *hw = &adapter->hw;
-	int i;
-
-	if (PAGE_SIZE < 8192) {
-		struct net_device *netdev = adapter->netdev;
 
-		/* For jumbo frames on systems with 4K pages we have to use
-		 * an order 1 page, so we might as well increase the size
-		 * of our Rx buffer to make better use of the available space
-		 */
-		rx_buf_len = IAVF_RXBUFFER_3072;
-
-		/* We use a 1536 buffer size for configurations with
-		 * standard Ethernet mtu.  On x86 this gives us enough room
-		 * for shared info and 192 bytes of padding.
-		 */
-		if (!IAVF_2K_TOO_SMALL_WITH_PADDING &&
-		    (netdev->mtu <= ETH_DATA_LEN))
-			rx_buf_len = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
-	}
-
-	for (i = 0; i < adapter->num_active_queues; i++) {
+	for (u32 i = 0; i < adapter->num_active_queues; i++)
 		adapter->rx_rings[i].tail = hw->hw_addr + IAVF_QRX_TAIL1(i);
-		adapter->rx_rings[i].rx_buf_len = rx_buf_len;
-	}
 }
 
 /**
@@ -1590,7 +1570,6 @@ static int iavf_alloc_queues(struct iavf_adapter *adapter)
 		rx_ring = &adapter->rx_rings[i];
 		rx_ring->queue_index = i;
 		rx_ring->netdev = adapter->netdev;
-		rx_ring->dev = &adapter->pdev->dev;
 		rx_ring->count = adapter->rx_desc_count;
 		rx_ring->itr_setting = IAVF_ITR_RX_DEF;
 	}
@@ -2562,11 +2541,7 @@ static void iavf_init_config_adapter(struct iavf_adapter *adapter)
 
 	netdev->netdev_ops = &iavf_netdev_ops;
 	iavf_set_ethtool_ops(netdev);
-	netdev->watchdog_timeo = 5 * HZ;
-
-	/* MTU range: 68 - 9710 */
-	netdev->min_mtu = ETH_MIN_MTU;
-	netdev->max_mtu = IAVF_MAX_RXBUFFER - IAVF_PACKET_HDR_PAD;
+	netdev->max_mtu = LIBIE_MAX_MTU;
 
 	if (!is_valid_ether_addr(adapter->hw.mac.addr)) {
 		dev_info(&pdev->dev, "Invalid MAC address %pM, using random\n",
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 789b10815d7f..d1491b481eac 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -689,9 +689,6 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
  **/
 static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
 {
-	unsigned long bi_size;
-	u16 i;
-
 	/* ring already cleared, nothing to do */
 	if (!rx_ring->rx_bi)
 		return;
@@ -701,40 +698,16 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
 		rx_ring->skb = NULL;
 	}
 
-	/* Free all the Rx ring sk_buffs */
-	for (i = 0; i < rx_ring->count; i++) {
-		struct iavf_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
+	/* Free all the Rx ring buffers */
+	for (u32 i = rx_ring->next_to_clean; i != rx_ring->next_to_use; ) {
+		const struct libie_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
 
-		if (!rx_bi->page)
-			continue;
+		page_pool_put_full_page(rx_ring->pp, rx_bi->page, false);
 
-		/* Invalidate cache lines that may have been written to by
-		 * device so that we avoid corrupting memory.
-		 */
-		dma_sync_single_range_for_cpu(rx_ring->dev,
-					      rx_bi->dma,
-					      rx_bi->page_offset,
-					      rx_ring->rx_buf_len,
-					      DMA_FROM_DEVICE);
-
-		/* free resources associated with mapping */
-		dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
-				     iavf_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     IAVF_RX_DMA_ATTR);
-
-		__free_pages(rx_bi->page, iavf_rx_pg_order(rx_ring));
-
-		rx_bi->page = NULL;
-		rx_bi->page_offset = 0;
+		if (unlikely(++i == rx_ring->count))
+			i = 0;
 	}
 
-	bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
-	memset(rx_ring->rx_bi, 0, bi_size);
-
-	/* Zero out the descriptor ring */
-	memset(rx_ring->desc, 0, rx_ring->size);
-
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 }
@@ -752,10 +725,13 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
 	rx_ring->rx_bi = NULL;
 
 	if (rx_ring->desc) {
-		dma_free_coherent(rx_ring->dev, rx_ring->size,
+		dma_free_coherent(rx_ring->pp->p.dev, rx_ring->size,
 				  rx_ring->desc, rx_ring->dma);
 		rx_ring->desc = NULL;
 	}
+
+	page_pool_destroy(rx_ring->pp);
+	rx_ring->pp = NULL;
 }
 
 /**
@@ -766,13 +742,19 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
  **/
 int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
 {
-	struct device *dev = rx_ring->dev;
-	int bi_size;
+	struct page_pool *pool;
+
+	pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
+					 rx_ring->count);
+	if (IS_ERR(pool))
+		return PTR_ERR(pool);
+
+	rx_ring->pp = pool;
 
 	/* warn if we are about to overwrite the pointer */
 	WARN_ON(rx_ring->rx_bi);
-	bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
-	rx_ring->rx_bi = kzalloc(bi_size, GFP_KERNEL);
+	rx_ring->rx_bi = kcalloc(rx_ring->count, sizeof(*rx_ring->rx_bi),
+				 GFP_KERNEL);
 	if (!rx_ring->rx_bi)
 		goto err;
 
@@ -781,22 +763,27 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
 	/* Round up to nearest 4K */
 	rx_ring->size = rx_ring->count * sizeof(union iavf_32byte_rx_desc);
 	rx_ring->size = ALIGN(rx_ring->size, 4096);
-	rx_ring->desc = dma_alloc_coherent(dev, rx_ring->size,
+	rx_ring->desc = dma_alloc_coherent(pool->p.dev, rx_ring->size,
 					   &rx_ring->dma, GFP_KERNEL);
 
 	if (!rx_ring->desc) {
-		dev_info(dev, "Unable to allocate memory for the Rx descriptor ring, size=%d\n",
+		dev_info(pool->p.dev, "Unable to allocate memory for the Rx descriptor ring, size=%d\n",
 			 rx_ring->size);
-		goto err;
+		goto err_free_buf;
 	}
 
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 
 	return 0;
-err:
+
+err_free_buf:
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
+
+err:
+	page_pool_destroy(rx_ring->pp);
+
 	return -ENOMEM;
 }
 
@@ -818,49 +805,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
 	writel(val, rx_ring->tail);
 }
 
-/**
- * iavf_alloc_mapped_page - recycle or make a new page
- * @rx_ring: ring to use
- * @bi: rx_buffer struct to modify
- *
- * Returns true if the page was successfully allocated or
- * reused.
- **/
-static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
-				   struct iavf_rx_buffer *bi)
-{
-	struct page *page = bi->page;
-	dma_addr_t dma;
-
-	/* alloc new page for storage */
-	page = dev_alloc_pages(iavf_rx_pg_order(rx_ring));
-	if (unlikely(!page)) {
-		rx_ring->rx_stats.alloc_page_failed++;
-		return false;
-	}
-
-	/* map page for use */
-	dma = dma_map_page_attrs(rx_ring->dev, page, 0,
-				 iavf_rx_pg_size(rx_ring),
-				 DMA_FROM_DEVICE,
-				 IAVF_RX_DMA_ATTR);
-
-	/* if mapping failed free memory back to system since
-	 * there isn't much point in holding memory we can't use
-	 */
-	if (dma_mapping_error(rx_ring->dev, dma)) {
-		__free_pages(page, iavf_rx_pg_order(rx_ring));
-		rx_ring->rx_stats.alloc_page_failed++;
-		return false;
-	}
-
-	bi->dma = dma;
-	bi->page = page;
-	bi->page_offset = IAVF_SKB_PAD;
-
-	return true;
-}
-
 /**
  * iavf_receive_skb - Send a completed packet up the stack
  * @rx_ring:  rx ring in play
@@ -893,36 +837,29 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
 {
 	u16 ntu = rx_ring->next_to_use;
 	union iavf_rx_desc *rx_desc;
-	struct iavf_rx_buffer *bi;
 
 	/* do nothing if no valid netdev defined */
 	if (!rx_ring->netdev || !cleaned_count)
 		return false;
 
 	rx_desc = IAVF_RX_DESC(rx_ring, ntu);
-	bi = &rx_ring->rx_bi[ntu];
 
 	do {
-		if (!iavf_alloc_mapped_page(rx_ring, bi))
-			goto no_buffers;
+		dma_addr_t addr;
 
-		/* sync the buffer for use by the device */
-		dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
-						 bi->page_offset,
-						 rx_ring->rx_buf_len,
-						 DMA_FROM_DEVICE);
+		addr = libie_rx_alloc(rx_ring->pp, &rx_ring->rx_bi[ntu]);
+		if (unlikely(addr == DMA_MAPPING_ERROR))
+			goto no_buffers;
 
 		/* Refresh the desc even if buffer_addrs didn't change
 		 * because each write-back erases this info.
 		 */
-		rx_desc->read.pkt_addr = cpu_to_le64(bi->dma + bi->page_offset);
+		rx_desc->read.pkt_addr = cpu_to_le64(addr);
 
 		rx_desc++;
-		bi++;
 		ntu++;
 		if (unlikely(ntu == rx_ring->count)) {
 			rx_desc = IAVF_RX_DESC(rx_ring, 0);
-			bi = rx_ring->rx_bi;
 			ntu = 0;
 		}
 
@@ -941,6 +878,8 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
 	if (rx_ring->next_to_use != ntu)
 		iavf_release_rx_desc(rx_ring, ntu);
 
+	rx_ring->rx_stats.alloc_page_failed++;
+
 	/* make sure to come back via polling to try again after
 	 * allocation failure
 	 */
@@ -1091,9 +1030,8 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
 
 /**
  * iavf_add_rx_frag - Add contents of Rx buffer to sk_buff
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: buffer containing page to add
  * @skb: sk_buff to place the data into
+ * @rx_buffer: buffer containing page to add
  * @size: packet length from rx_desc
  *
  * This function will add the data contained in rx_buffer->page to the skb.
@@ -1101,111 +1039,52 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
  *
  * The function will then update the page offset.
  **/
-static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
-			     struct iavf_rx_buffer *rx_buffer,
-			     struct sk_buff *skb,
+static void iavf_add_rx_frag(struct sk_buff *skb,
+			     const struct libie_rx_buffer *rx_buffer,
 			     unsigned int size)
 {
-#if (PAGE_SIZE < 8192)
-	unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
-#else
-	unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
-#endif
-
-	if (!size)
-		return;
+	u32 hr = rx_buffer->page->pp->p.offset;
 
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
-			rx_buffer->page_offset, size, truesize);
-}
-
-/**
- * iavf_get_rx_buffer - Fetch Rx buffer and synchronize data for use
- * @rx_ring: rx descriptor ring to transact packets on
- * @size: size of buffer to add to skb
- *
- * This function will pull an Rx buffer from the ring and synchronize it
- * for use by the CPU.
- */
-static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
-						 const unsigned int size)
-{
-	struct iavf_rx_buffer *rx_buffer;
-
-	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
-	prefetchw(rx_buffer->page);
-	if (!size)
-		return rx_buffer;
-
-	/* we are reusing so sync this buffer for CPU use */
-	dma_sync_single_range_for_cpu(rx_ring->dev,
-				      rx_buffer->dma,
-				      rx_buffer->page_offset,
-				      size,
-				      DMA_FROM_DEVICE);
-
-	return rx_buffer;
+			rx_buffer->offset + hr, size, rx_buffer->truesize);
 }
 
 /**
  * iavf_build_skb - Build skb around an existing buffer
- * @rx_ring: Rx descriptor ring to transact packets on
  * @rx_buffer: Rx buffer to pull data from
  * @size: size of buffer to add to skb
  *
  * This function builds an skb around an existing Rx buffer, taking care
  * to set up the skb correctly and avoid any memcpy overhead.
  */
-static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
-				      struct iavf_rx_buffer *rx_buffer,
+static struct sk_buff *iavf_build_skb(const struct libie_rx_buffer *rx_buffer,
 				      unsigned int size)
 {
-	void *va;
-#if (PAGE_SIZE < 8192)
-	unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
-#else
-	unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
-				SKB_DATA_ALIGN(IAVF_SKB_PAD + size);
-#endif
+	struct page *page = rx_buffer->page;
+	u32 hr = page->pp->p.offset;
 	struct sk_buff *skb;
+	void *va;
 
-	if (!rx_buffer || !size)
-		return NULL;
 	/* prefetch first cache line of first page */
-	va = page_address(rx_buffer->page) + rx_buffer->page_offset;
-	net_prefetch(va);
+	va = page_address(page) + rx_buffer->offset;
+	net_prefetch(va + hr);
 
 	/* build an skb around the page buffer */
-	skb = napi_build_skb(va - IAVF_SKB_PAD, truesize);
-	if (unlikely(!skb))
+	skb = napi_build_skb(va, rx_buffer->truesize);
+	if (unlikely(!skb)) {
+		page_pool_put_page(page->pp, page, size, true);
 		return NULL;
+	}
+
+	skb_mark_for_recycle(skb);
 
 	/* update pointers within the skb to store the data */
-	skb_reserve(skb, IAVF_SKB_PAD);
+	skb_reserve(skb, hr);
 	__skb_put(skb, size);
 
 	return skb;
 }
 
-/**
- * iavf_put_rx_buffer - Unmap used buffer
- * @rx_ring: rx descriptor ring to transact packets on
- * @rx_buffer: rx buffer to pull data from
- *
- * This function will unmap the buffer after it's written by HW.
- */
-static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
-			       struct iavf_rx_buffer *rx_buffer)
-{
-	if (!rx_buffer)
-		return;
-
-	/* we are not reusing the buffer so unmap it */
-	dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
-			     iavf_rx_pg_size(rx_ring),
-			     DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
-}
-
 /**
  * iavf_is_non_eop - process handling of non-EOP buffers
  * @rx_ring: Rx ring being processed
@@ -1259,7 +1138,7 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
 	bool failure = false;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
-		struct iavf_rx_buffer *rx_buffer;
+		struct libie_rx_buffer *rx_buffer;
 		union iavf_rx_desc *rx_desc;
 		unsigned int size;
 		u16 vlan_tag = 0;
@@ -1295,26 +1174,24 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
 		       IAVF_RXD_QW1_LENGTH_PBUF_SHIFT;
 
 		iavf_trace(clean_rx_irq, rx_ring, rx_desc, skb);
-		rx_buffer = iavf_get_rx_buffer(rx_ring, size);
+
+		rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
+		if (!libie_rx_sync_for_cpu(rx_buffer, size))
+			goto skip_data;
 
 		/* retrieve a buffer from the ring */
 		if (skb)
-			iavf_add_rx_frag(rx_ring, rx_buffer, skb, size);
+			iavf_add_rx_frag(skb, rx_buffer, size);
 		else
-			skb = iavf_build_skb(rx_ring, rx_buffer, size);
-
-		iavf_put_rx_buffer(rx_ring, rx_buffer);
+			skb = iavf_build_skb(rx_buffer, size);
 
 		/* exit if we failed to retrieve a buffer */
 		if (!skb) {
 			rx_ring->rx_stats.alloc_buff_failed++;
-			__free_pages(rx_buffer->page,
-				     iavf_rx_pg_order(rx_ring));
-			rx_buffer->page = NULL;
 			break;
 		}
 
-		rx_buffer->page = NULL;
+skip_data:
 		cleaned_count++;
 
 		if (iavf_is_non_eop(rx_ring, rx_desc, skb))
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index 2170a77f8c8d..b13d878c74c6 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -81,70 +81,8 @@ enum iavf_dyn_idx_t {
 	BIT_ULL(IAVF_FILTER_PCTYPE_NONF_UNICAST_IPV6_UDP) | \
 	BIT_ULL(IAVF_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP))
 
-/* Supported Rx Buffer Sizes (a multiple of 128) */
-#define IAVF_RXBUFFER_1536  1536  /* 128B aligned standard Ethernet frame */
-#define IAVF_RXBUFFER_2048  2048
-#define IAVF_RXBUFFER_3072  3072  /* Used for large frames w/ padding */
-#define IAVF_MAX_RXBUFFER   9728  /* largest size for single descriptor */
-
-#define IAVF_PACKET_HDR_PAD (ETH_HLEN + ETH_FCS_LEN + (VLAN_HLEN * 2))
 #define iavf_rx_desc iavf_32byte_rx_desc
 
-#define IAVF_RX_DMA_ATTR \
-	(DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
-
-/* Attempt to maximize the headroom available for incoming frames.  We
- * use a 2K buffer for receives and need 1536/1534 to store the data for
- * the frame.  This leaves us with 512 bytes of room.  From that we need
- * to deduct the space needed for the shared info and the padding needed
- * to IP align the frame.
- *
- * Note: For cache line sizes 256 or larger this value is going to end
- *	 up negative.  In these cases we should fall back to the legacy
- *	 receive path.
- */
-#if (PAGE_SIZE < 8192)
-#define IAVF_2K_TOO_SMALL_WITH_PADDING \
-((NET_SKB_PAD + IAVF_RXBUFFER_1536) > SKB_WITH_OVERHEAD(IAVF_RXBUFFER_2048))
-
-static inline int iavf_compute_pad(int rx_buf_len)
-{
-	int page_size, pad_size;
-
-	page_size = ALIGN(rx_buf_len, PAGE_SIZE / 2);
-	pad_size = SKB_WITH_OVERHEAD(page_size) - rx_buf_len;
-
-	return pad_size;
-}
-
-static inline int iavf_skb_pad(void)
-{
-	int rx_buf_len;
-
-	/* If a 2K buffer cannot handle a standard Ethernet frame then
-	 * optimize padding for a 3K buffer instead of a 1.5K buffer.
-	 *
-	 * For a 3K buffer we need to add enough padding to allow for
-	 * tailroom due to NET_IP_ALIGN possibly shifting us out of
-	 * cache-line alignment.
-	 */
-	if (IAVF_2K_TOO_SMALL_WITH_PADDING)
-		rx_buf_len = IAVF_RXBUFFER_3072 + SKB_DATA_ALIGN(NET_IP_ALIGN);
-	else
-		rx_buf_len = IAVF_RXBUFFER_1536;
-
-	/* if needed make room for NET_IP_ALIGN */
-	rx_buf_len -= NET_IP_ALIGN;
-
-	return iavf_compute_pad(rx_buf_len);
-}
-
-#define IAVF_SKB_PAD iavf_skb_pad()
-#else
-#define IAVF_2K_TOO_SMALL_WITH_PADDING false
-#define IAVF_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
-#endif
-
 /**
  * iavf_test_staterr - tests bits in Rx descriptor status and error fields
  * @rx_desc: pointer to receive descriptor (in le64 format)
@@ -263,12 +201,6 @@ struct iavf_tx_buffer {
 	u32 tx_flags;
 };
 
-struct iavf_rx_buffer {
-	dma_addr_t dma;
-	struct page *page;
-	__u32 page_offset;
-};
-
 struct iavf_queue_stats {
 	u64 packets;
 	u64 bytes;
@@ -311,16 +243,19 @@ enum iavf_ring_state_t {
 struct iavf_ring {
 	struct iavf_ring *next;		/* pointer to next ring in q_vector */
 	void *desc;			/* Descriptor ring memory */
-	struct device *dev;		/* Used for DMA mapping */
+	union {
+		struct page_pool *pp;	/* Used on Rx for buffer management */
+		struct device *dev;	/* Used on Tx for DMA mapping */
+	};
 	struct net_device *netdev;	/* netdev ring maps to */
 	union {
+		struct libie_rx_buffer *rx_bi;
 		struct iavf_tx_buffer *tx_bi;
-		struct iavf_rx_buffer *rx_bi;
 	};
 	DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
+	u8 __iomem *tail;
 	u16 queue_index;		/* Queue number of ring */
 	u8 dcb_tc;			/* Traffic class of ring */
-	u8 __iomem *tail;
 
 	/* high bit set means dynamic, use accessors routines to read/write.
 	 * hardware only supports 2us resolution for the ITR registers.
@@ -329,9 +264,8 @@ struct iavf_ring {
 	 */
 	u16 itr_setting;
 
-	u16 count;			/* Number of descriptors */
 	u16 reg_idx;			/* HW register index of the ring */
-	u16 rx_buf_len;
+	u16 count;			/* Number of descriptors */
 
 	/* used in interrupt processing */
 	u16 next_to_use;
@@ -398,17 +332,6 @@ struct iavf_ring_container {
 #define iavf_for_each_ring(pos, head) \
 	for (pos = (head).ring; pos != NULL; pos = pos->next)
 
-static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)
-{
-#if (PAGE_SIZE < 8192)
-	if (ring->rx_buf_len > (PAGE_SIZE / 2))
-		return 1;
-#endif
-	return 0;
-}
-
-#define iavf_rx_pg_size(_ring) (PAGE_SIZE << iavf_rx_pg_order(_ring))
-
 bool iavf_alloc_rx_buffers(struct iavf_ring *rxr, u16 cleaned_count);
 netdev_tx_t iavf_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring);
diff --git a/drivers/net/ethernet/intel/iavf/iavf_type.h b/drivers/net/ethernet/intel/iavf/iavf_type.h
index 3030ba330326..bb90d8f3ad7e 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_type.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_type.h
@@ -10,8 +10,6 @@
 #include "iavf_adminq.h"
 #include "iavf_devids.h"
 
-#define IAVF_RXQ_CTX_DBUFF_SHIFT 7
-
 /* IAVF_MASK is a macro used on 32 bit registers */
 #define IAVF_MASK(mask, shift) ((u32)(mask) << (shift))
 
diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
index fdddc3588487..5341508b429f 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2013 - 2018 Intel Corporation. */
 
+#include <linux/net/intel/libie/rx.h>
+
 #include "iavf.h"
 #include "iavf_prototype.h"
 #include "iavf_client.h"
@@ -269,14 +271,13 @@ int iavf_get_vf_vlan_v2_caps(struct iavf_adapter *adapter)
 void iavf_configure_queues(struct iavf_adapter *adapter)
 {
 	struct virtchnl_vsi_queue_config_info *vqci;
-	int i, max_frame = adapter->vf_res->max_mtu;
+	u32 i, max_frame = adapter->vf_res->max_mtu;
+	struct iavf_ring *rxq = adapter->rx_rings;
 	int pairs = adapter->num_active_queues;
 	struct virtchnl_queue_pair_info *vqpi;
+	u32 hr, max_len;
 	size_t len;
 
-	if (max_frame > IAVF_MAX_RXBUFFER || !max_frame)
-		max_frame = IAVF_MAX_RXBUFFER;
-
 	if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) {
 		/* bail because we already have a command pending */
 		dev_err(&adapter->pdev->dev, "Cannot configure queues, command %d pending\n",
@@ -289,9 +290,9 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
 	if (!vqci)
 		return;
 
-	/* Limit maximum frame size when jumbo frames is not enabled */
-	if (adapter->netdev->mtu <= ETH_DATA_LEN)
-		max_frame = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
+	hr = rxq->pp->p.offset;
+	max_len = rxq->pp->p.max_len;
+	max_frame = min_not_zero(max_frame, LIBIE_MAX_RX_FRM_LEN(hr));
 
 	vqci->vsi_id = adapter->vsi_res->vsi_id;
 	vqci->num_queue_pairs = pairs;
@@ -309,9 +310,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
 		vqpi->rxq.ring_len = adapter->rx_rings[i].count;
 		vqpi->rxq.dma_ring_addr = adapter->rx_rings[i].dma;
 		vqpi->rxq.max_pkt_size = max_frame;
-		vqpi->rxq.databuffer_size =
-			ALIGN(adapter->rx_rings[i].rx_buf_len,
-			      BIT_ULL(IAVF_RXQ_CTX_DBUFF_SHIFT));
+		vqpi->rxq.databuffer_size = max_len;
 		vqpi++;
 	}
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 7/9] libie: add common queue stats
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
                   ` (5 preceding siblings ...)
  2023-07-05 15:55 ` [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 8/9] libie: add per-queue Page Pool stats Alexander Lobakin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 9/9] iavf: switch queue stats to libie Alexander Lobakin
  8 siblings, 0 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

Next stop, per-queue private stats. They have only subtle differences
from driver to driver and can easily be resolved.
Define common structures, inline helpers and Ethtool helpers to collect,
update and export the statistics. Use u64_stats_t right from the start,
as well as the corresponding helpers to ensure tear-free operations.
For the NAPI parts of both Rx and Tx, also define small onstack
containers to update them in polling loops and then sync the actual
containers once a loop ends.
The drivers will be switched to use this API later on a per-driver
basis, along with conversion to PP.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/libie/Makefile |   1 +
 drivers/net/ethernet/intel/libie/stats.c  | 119 ++++++++++++++
 include/linux/net/intel/libie/stats.h     | 179 ++++++++++++++++++++++
 3 files changed, 299 insertions(+)
 create mode 100644 drivers/net/ethernet/intel/libie/stats.c
 create mode 100644 include/linux/net/intel/libie/stats.h

diff --git a/drivers/net/ethernet/intel/libie/Makefile b/drivers/net/ethernet/intel/libie/Makefile
index 95e81d09b474..76f32253481b 100644
--- a/drivers/net/ethernet/intel/libie/Makefile
+++ b/drivers/net/ethernet/intel/libie/Makefile
@@ -4,3 +4,4 @@
 obj-$(CONFIG_LIBIE)	+= libie.o
 
 libie-objs		+= rx.o
+libie-objs		+= stats.o
diff --git a/drivers/net/ethernet/intel/libie/stats.c b/drivers/net/ethernet/intel/libie/stats.c
new file mode 100644
index 000000000000..61456842a362
--- /dev/null
+++ b/drivers/net/ethernet/intel/libie/stats.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2023 Intel Corporation. */
+
+#include <linux/ethtool.h>
+#include <linux/net/intel/libie/stats.h>
+
+/* Rx per-queue stats */
+
+static const char * const libie_rq_stats_str[] = {
+#define act(s)	__stringify(s),
+	DECLARE_LIBIE_RQ_STATS(act)
+#undef act
+};
+
+#define LIBIE_RQ_STATS_NUM	ARRAY_SIZE(libie_rq_stats_str)
+
+/**
+ * libie_rq_stats_get_sset_count - get the number of Ethtool RQ stats provided
+ *
+ * Returns the number of per-queue Rx stats supported by the library.
+ */
+u32 libie_rq_stats_get_sset_count(void)
+{
+	return LIBIE_RQ_STATS_NUM;
+}
+EXPORT_SYMBOL_NS_GPL(libie_rq_stats_get_sset_count, LIBIE);
+
+/**
+ * libie_rq_stats_get_strings - get the name strings of Ethtool RQ stats
+ * @data: reference to the cursor pointing to the output buffer
+ * @qid: RQ number to print in the prefix
+ */
+void libie_rq_stats_get_strings(u8 **data, u32 qid)
+{
+	for (u32 i = 0; i < LIBIE_RQ_STATS_NUM; i++)
+		ethtool_sprintf(data, "rq%u_%s", qid, libie_rq_stats_str[i]);
+}
+EXPORT_SYMBOL_NS_GPL(libie_rq_stats_get_strings, LIBIE);
+
+/**
+ * libie_rq_stats_get_data - get the RQ stats in Ethtool format
+ * @data: reference to the cursor pointing to the output array
+ * @stats: RQ stats container from the queue
+ */
+void libie_rq_stats_get_data(u64 **data, const struct libie_rq_stats *stats)
+{
+	u64 sarr[LIBIE_RQ_STATS_NUM];
+	u32 start;
+
+	do {
+		start = u64_stats_fetch_begin(&stats->syncp);
+
+		for (u32 i = 0; i < LIBIE_RQ_STATS_NUM; i++)
+			sarr[i] = u64_stats_read(&stats->raw[i]);
+	} while (u64_stats_fetch_retry(&stats->syncp, start));
+
+	for (u32 i = 0; i < LIBIE_RQ_STATS_NUM; i++)
+		(*data)[i] += sarr[i];
+
+	*data += LIBIE_RQ_STATS_NUM;
+}
+EXPORT_SYMBOL_NS_GPL(libie_rq_stats_get_data, LIBIE);
+
+/* Tx per-queue stats */
+
+static const char * const libie_sq_stats_str[] = {
+#define act(s)	__stringify(s),
+	DECLARE_LIBIE_SQ_STATS(act)
+#undef act
+};
+
+#define LIBIE_SQ_STATS_NUM	ARRAY_SIZE(libie_sq_stats_str)
+
+/**
+ * libie_sq_stats_get_sset_count - get the number of Ethtool SQ stats provided
+ *
+ * Returns the number of per-queue Tx stats supported by the library.
+ */
+u32 libie_sq_stats_get_sset_count(void)
+{
+	return LIBIE_SQ_STATS_NUM;
+}
+EXPORT_SYMBOL_NS_GPL(libie_sq_stats_get_sset_count, LIBIE);
+
+/**
+ * libie_sq_stats_get_strings - get the name strings of Ethtool SQ stats
+ * @data: reference to the cursor pointing to the output buffer
+ * @qid: SQ number to print in the prefix
+ */
+void libie_sq_stats_get_strings(u8 **data, u32 qid)
+{
+	for (u32 i = 0; i < LIBIE_SQ_STATS_NUM; i++)
+		ethtool_sprintf(data, "sq%u_%s", qid, libie_sq_stats_str[i]);
+}
+EXPORT_SYMBOL_NS_GPL(libie_sq_stats_get_strings, LIBIE);
+
+/**
+ * libie_sq_stats_get_data - get the SQ stats in Ethtool format
+ * @data: reference to the cursor pointing to the output array
+ * @stats: SQ stats container from the queue
+ */
+void libie_sq_stats_get_data(u64 **data, const struct libie_sq_stats *stats)
+{
+	u64 sarr[LIBIE_SQ_STATS_NUM];
+	u32 start;
+
+	do {
+		start = u64_stats_fetch_begin(&stats->syncp);
+
+		for (u32 i = 0; i < LIBIE_SQ_STATS_NUM; i++)
+			sarr[i] = u64_stats_read(&stats->raw[i]);
+	} while (u64_stats_fetch_retry(&stats->syncp, start));
+
+	for (u32 i = 0; i < LIBIE_SQ_STATS_NUM; i++)
+		(*data)[i] += sarr[i];
+
+	*data += LIBIE_SQ_STATS_NUM;
+}
+EXPORT_SYMBOL_NS_GPL(libie_sq_stats_get_data, LIBIE);
diff --git a/include/linux/net/intel/libie/stats.h b/include/linux/net/intel/libie/stats.h
new file mode 100644
index 000000000000..dbbc98bbd3a7
--- /dev/null
+++ b/include/linux/net/intel/libie/stats.h
@@ -0,0 +1,179 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2023 Intel Corporation. */
+
+#ifndef __LIBIE_STATS_H
+#define __LIBIE_STATS_H
+
+#include <linux/u64_stats_sync.h>
+
+/* Common */
+
+/* Use 32-byte alignment to reduce false sharing */
+#define __libie_stats_aligned	__aligned(4 * sizeof(u64_stats_t))
+
+/**
+ * libie_stats_add - update one structure counter from a local struct
+ * @qs: queue stats structure to update (&libie_rq_stats or &libie_sq_stats)
+ * @ss: local/onstack stats structure
+ * @f: name of the field to update
+ *
+ * If a local/onstack stats structure is used to collect statistics during
+ * hotpath loops, this macro can be used to shorthand updates, given that
+ * the fields have the same name.
+ * Must be guarded with u64_stats_update_{begin,end}().
+ */
+#define libie_stats_add(qs, ss, f)			\
+	u64_stats_add(&(qs)->f, (ss)->f)
+
+/**
+ * __libie_stats_inc_one - safely increment one stats structure counter
+ * @s: queue stats structure to update (&libie_rq_stats or &libie_sq_stats)
+ * @f: name of the field to increment
+ * @n: name of the temporary variable, result of __UNIQUE_ID()
+ *
+ * To be used on exception or slow paths -- allocation fails, queue stops etc.
+ */
+#define __libie_stats_inc_one(s, f, n) ({		\
+	typeof(*(s)) *n = (s);				\
+							\
+	u64_stats_update_begin(&n->syncp);		\
+	u64_stats_inc(&n->f);				\
+	u64_stats_update_end(&n->syncp);		\
+})
+#define libie_stats_inc_one(s, f)			\
+	__libie_stats_inc_one(s, f, __UNIQUE_ID(qs_))
+
+/* Rx per-queue stats:
+ * packets: packets received on this queue
+ * bytes: bytes received on this queue
+ * fragments: number of processed descriptors carrying only a fragment
+ * alloc_page_fail: number of Rx page allocation fails
+ * build_skb_fail: number of build_skb() fails
+ */
+
+#define DECLARE_LIBIE_RQ_NAPI_STATS(act)		\
+	act(packets)					\
+	act(bytes)					\
+	act(fragments)
+
+#define DECLARE_LIBIE_RQ_FAIL_STATS(act)		\
+	act(alloc_page_fail)				\
+	act(build_skb_fail)
+
+#define DECLARE_LIBIE_RQ_STATS(act)			\
+	DECLARE_LIBIE_RQ_NAPI_STATS(act)		\
+	DECLARE_LIBIE_RQ_FAIL_STATS(act)
+
+struct libie_rq_stats {
+	struct u64_stats_sync	syncp;
+
+	union {
+		struct {
+#define act(s)	u64_stats_t	s;
+			DECLARE_LIBIE_RQ_NAPI_STATS(act);
+			DECLARE_LIBIE_RQ_FAIL_STATS(act);
+#undef act
+		};
+		DECLARE_FLEX_ARRAY(u64_stats_t, raw);
+	};
+} __libie_stats_aligned;
+
+/* Rx stats being modified frequently during the NAPI polling, to sync them
+ * with the queue stats once after the loop is finished.
+ */
+struct libie_rq_onstack_stats {
+	union {
+		struct {
+#define act(s)	u32		s;
+			DECLARE_LIBIE_RQ_NAPI_STATS(act);
+#undef act
+		};
+		DECLARE_FLEX_ARRAY(u32, raw);
+	};
+};
+
+/**
+ * libie_rq_napi_stats_add - add onstack Rx stats to the queue container
+ * @qs: Rx queue stats structure to update
+ * @ss: onstack structure to get the values from, updated during the NAPI loop
+ */
+static inline void
+libie_rq_napi_stats_add(struct libie_rq_stats *qs,
+			const struct libie_rq_onstack_stats *ss)
+{
+	u64_stats_update_begin(&qs->syncp);
+	libie_stats_add(qs, ss, packets);
+	libie_stats_add(qs, ss, bytes);
+	libie_stats_add(qs, ss, fragments);
+	u64_stats_update_end(&qs->syncp);
+}
+
+u32 libie_rq_stats_get_sset_count(void);
+void libie_rq_stats_get_strings(u8 **data, u32 qid);
+void libie_rq_stats_get_data(u64 **data, const struct libie_rq_stats *stats);
+
+/* Tx per-queue stats:
+ * packets: packets sent from this queue
+ * bytes: bytes sent from this queue
+ * busy: number of xmit failures due to the ring being full
+ * stops: number times the ring was stopped from the driver
+ * restarts: number times it was started after being stopped
+ * linearized: number of skbs linearized due to HW limits
+ */
+
+#define DECLARE_LIBIE_SQ_NAPI_STATS(act)		\
+	act(packets)					\
+	act(bytes)
+
+#define DECLARE_LIBIE_SQ_XMIT_STATS(act)		\
+	act(busy)					\
+	act(stops)					\
+	act(restarts)					\
+	act(linearized)
+
+#define DECLARE_LIBIE_SQ_STATS(act)			\
+	DECLARE_LIBIE_SQ_NAPI_STATS(act)		\
+	DECLARE_LIBIE_SQ_XMIT_STATS(act)
+
+struct libie_sq_stats {
+	struct u64_stats_sync	syncp;
+
+	union {
+		struct {
+#define act(s)	u64_stats_t	s;
+			DECLARE_LIBIE_SQ_STATS(act);
+#undef act
+		};
+		DECLARE_FLEX_ARRAY(u64_stats_t, raw);
+	};
+} __libie_stats_aligned;
+
+struct libie_sq_onstack_stats {
+#define act(s)	u32		s;
+	DECLARE_LIBIE_SQ_NAPI_STATS(act);
+#undef act
+};
+
+/**
+ * libie_sq_napi_stats_add - add onstack Tx stats to the queue container
+ * @qs: Tx queue stats structure to update
+ * @ss: onstack structure to get the values from, updated during the NAPI loop
+ */
+static inline void
+libie_sq_napi_stats_add(struct libie_sq_stats *qs,
+			const struct libie_sq_onstack_stats *ss)
+{
+	if (unlikely(!ss->packets))
+		return;
+
+	u64_stats_update_begin(&qs->syncp);
+	libie_stats_add(qs, ss, packets);
+	libie_stats_add(qs, ss, bytes);
+	u64_stats_update_end(&qs->syncp);
+}
+
+u32 libie_sq_stats_get_sset_count(void);
+void libie_sq_stats_get_strings(u8 **data, u32 qid);
+void libie_sq_stats_get_data(u64 **data, const struct libie_sq_stats *stats);
+
+#endif /* __LIBIE_STATS_H */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 8/9] libie: add per-queue Page Pool stats
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
                   ` (6 preceding siblings ...)
  2023-07-05 15:55 ` [PATCH RFC net-next v4 7/9] libie: add common queue stats Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  2023-07-05 15:55 ` [PATCH RFC net-next v4 9/9] iavf: switch queue stats to libie Alexander Lobakin
  8 siblings, 0 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

Expand the libie generic per-queue stats with the generic Page Pool
stats provided by the API itself, when CONFIG_PAGE_POOL is enabled.
When it's not, there'll be no such fields in the stats structure, so
no space wasted.
They are also a bit special in terms of how they are obtained. One
&page_pool accumulates statistics until it's destroyed obviously,
which happens on ifdown. So, in order to not lose any statistics,
get the stats and store in the queue container before destroying
a pool. This container survives ifups/downs, so it basically stores
the statistics accumulated since the very first pool was allocated
on this queue. When it's needed to export the stats, first get the
numbers from this container and then add the "live" numbers -- the
ones that the current active pool returns. The result values will
always represent the actual device-lifetime* stats.
There's a cast from &page_pool_stats to `u64 *` in a couple functions,
but they are guarded with stats asserts to make sure it's safe to do.
FWIW it saves a lot of object code.

Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/libie/internal.h | 23 +++++++
 drivers/net/ethernet/intel/libie/rx.c       | 19 ++++++
 drivers/net/ethernet/intel/libie/stats.c    | 73 ++++++++++++++++++++-
 include/linux/net/intel/libie/rx.h          |  4 ++
 include/linux/net/intel/libie/stats.h       | 39 ++++++++++-
 5 files changed, 155 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/libie/internal.h

diff --git a/drivers/net/ethernet/intel/libie/internal.h b/drivers/net/ethernet/intel/libie/internal.h
new file mode 100644
index 000000000000..083398dc37c6
--- /dev/null
+++ b/drivers/net/ethernet/intel/libie/internal.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* libie internal declarations not to be used in drivers.
+ *
+ * Copyright(c) 2023 Intel Corporation.
+ */
+
+#ifndef __LIBIE_INTERNAL_H
+#define __LIBIE_INTERNAL_H
+
+struct libie_rq_stats;
+struct page_pool;
+
+#ifdef CONFIG_PAGE_POOL_STATS
+void libie_rq_stats_sync_pp(struct libie_rq_stats *stats,
+			    struct page_pool *pool);
+#else
+static inline void libie_rq_stats_sync_pp(struct libie_rq_stats *stats,
+					  struct page_pool *pool)
+{
+}
+#endif
+
+#endif /* __LIBIE_INTERNAL_H */
diff --git a/drivers/net/ethernet/intel/libie/rx.c b/drivers/net/ethernet/intel/libie/rx.c
index c60d7b20ed20..0c26bd066587 100644
--- a/drivers/net/ethernet/intel/libie/rx.c
+++ b/drivers/net/ethernet/intel/libie/rx.c
@@ -2,6 +2,7 @@
 /* Copyright(c) 2023 Intel Corporation. */
 
 #include <linux/net/intel/libie/rx.h>
+#include "internal.h"
 
 /* Rx buffer management */
 
@@ -57,6 +58,24 @@ struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
 }
 EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
 
+/**
+ * libie_rx_page_pool_destroy - destroy a &page_pool created by libie
+ * @pool: pool to destroy
+ * @stats: RQ stats from the ring (or %NULL to skip updating PP stats)
+ *
+ * As the stats usually has the same lifetime as the device, but PP is usually
+ * created/destroyed on ifup/ifdown, in order to not lose the stats accumulated
+ * during the last ifup, the PP stats need to be added to the driver stats
+ * container. Then the PP gets destroyed.
+ */
+void libie_rx_page_pool_destroy(struct page_pool *pool,
+				struct libie_rq_stats *stats)
+{
+	libie_rq_stats_sync_pp(stats, pool);
+	page_pool_destroy(pool);
+}
+EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_destroy, LIBIE);
+
 /* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
  * bitfield struct.
  */
diff --git a/drivers/net/ethernet/intel/libie/stats.c b/drivers/net/ethernet/intel/libie/stats.c
index 61456842a362..71c7ce14edca 100644
--- a/drivers/net/ethernet/intel/libie/stats.c
+++ b/drivers/net/ethernet/intel/libie/stats.c
@@ -3,6 +3,9 @@
 
 #include <linux/ethtool.h>
 #include <linux/net/intel/libie/stats.h>
+#include <net/page_pool.h>
+
+#include "internal.h"
 
 /* Rx per-queue stats */
 
@@ -14,6 +17,70 @@ static const char * const libie_rq_stats_str[] = {
 
 #define LIBIE_RQ_STATS_NUM	ARRAY_SIZE(libie_rq_stats_str)
 
+#ifdef CONFIG_PAGE_POOL_STATS
+/**
+ * libie_rq_stats_get_pp - get the current stats from a &page_pool
+ * @sarr: local array to add stats to
+ * @pool: pool to get the stats from
+ *
+ * Adds the current "live" stats from an online PP to the stats read from
+ * the RQ container, so that the actual totals will be returned.
+ */
+static void libie_rq_stats_get_pp(u64 *sarr, struct page_pool *pool)
+{
+	struct page_pool_stats *pps;
+	/* Used only to calculate pos below */
+	struct libie_rq_stats tmp;
+	u32 pos;
+
+	/* Validate the libie PP stats array can be casted <-> PP struct */
+	static_assert(sizeof(tmp.pp) == sizeof(*pps));
+
+	if (!pool)
+		return;
+
+	/* Position of the first Page Pool stats field */
+	pos = (u64_stats_t *)&tmp.pp - tmp.raw;
+	pps = (typeof(pps))&sarr[pos];
+
+	page_pool_get_stats(pool, pps);
+}
+
+/**
+ * libie_rq_stats_sync_pp - add the current PP stats to the RQ stats container
+ * @stats: stats structure to update
+ * @pool: pool to read the stats
+ *
+ * Called by libie_rx_page_pool_destroy() to save the stats before destroying
+ * the pool.
+ */
+void libie_rq_stats_sync_pp(struct libie_rq_stats *stats,
+			    struct page_pool *pool)
+{
+	u64_stats_t *qarr = (u64_stats_t *)&stats->pp;
+	struct page_pool_stats pps = { };
+	u64 *sarr = (u64 *)&pps;
+
+	if (!stats)
+		return;
+
+	page_pool_get_stats(pool, &pps);
+
+	u64_stats_update_begin(&stats->syncp);
+
+	for (u32 i = 0; i < sizeof(pps) / sizeof(*sarr); i++)
+		u64_stats_add(&qarr[i], sarr[i]);
+
+	u64_stats_update_end(&stats->syncp);
+}
+#else
+static void libie_rq_stats_get_pp(u64 *sarr, struct page_pool *pool)
+{
+}
+
+/* static inline void libie_rq_stats_sync_pp() is declared in "internal.h" */
+#endif
+
 /**
  * libie_rq_stats_get_sset_count - get the number of Ethtool RQ stats provided
  *
@@ -41,8 +108,10 @@ EXPORT_SYMBOL_NS_GPL(libie_rq_stats_get_strings, LIBIE);
  * libie_rq_stats_get_data - get the RQ stats in Ethtool format
  * @data: reference to the cursor pointing to the output array
  * @stats: RQ stats container from the queue
+ * @pool: &page_pool from the queue (%NULL to ignore PP "live" stats)
  */
-void libie_rq_stats_get_data(u64 **data, const struct libie_rq_stats *stats)
+void libie_rq_stats_get_data(u64 **data, const struct libie_rq_stats *stats,
+			     struct page_pool *pool)
 {
 	u64 sarr[LIBIE_RQ_STATS_NUM];
 	u32 start;
@@ -54,6 +123,8 @@ void libie_rq_stats_get_data(u64 **data, const struct libie_rq_stats *stats)
 			sarr[i] = u64_stats_read(&stats->raw[i]);
 	} while (u64_stats_fetch_retry(&stats->syncp, start));
 
+	libie_rq_stats_get_pp(sarr, pool);
+
 	for (u32 i = 0; i < LIBIE_RQ_STATS_NUM; i++)
 		(*data)[i] += sarr[i];
 
diff --git a/include/linux/net/intel/libie/rx.h b/include/linux/net/intel/libie/rx.h
index 8c0ccdff9a37..c6c85f956f95 100644
--- a/include/linux/net/intel/libie/rx.h
+++ b/include/linux/net/intel/libie/rx.h
@@ -62,8 +62,12 @@ struct libie_rx_buffer {
 	u32		truesize;
 };
 
+struct libie_rq_stats;
+
 struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
 					    u32 size);
+void libie_rx_page_pool_destroy(struct page_pool *pool,
+				struct libie_rq_stats *stats);
 
 /**
  * libie_rx_alloc - allocate a new Rx buffer
diff --git a/include/linux/net/intel/libie/stats.h b/include/linux/net/intel/libie/stats.h
index dbbc98bbd3a7..23ca0079a905 100644
--- a/include/linux/net/intel/libie/stats.h
+++ b/include/linux/net/intel/libie/stats.h
@@ -49,6 +49,17 @@
  * fragments: number of processed descriptors carrying only a fragment
  * alloc_page_fail: number of Rx page allocation fails
  * build_skb_fail: number of build_skb() fails
+ * pp_alloc_fast: pages taken from the cache or ring
+ * pp_alloc_slow: actual page allocations
+ * pp_alloc_slow_ho: non-order-0 page allocations
+ * pp_alloc_empty: number of times the pool was empty
+ * pp_alloc_refill: number of cache refills
+ * pp_alloc_waive: NUMA node mismatches during recycling
+ * pp_recycle_cached: direct recyclings into the cache
+ * pp_recycle_cache_full: number of times the cache was full
+ * pp_recycle_ring: recyclings into the ring
+ * pp_recycle_ring_full: number of times the ring was full
+ * pp_recycle_released_ref: pages released due to elevated refcnt
  */
 
 #define DECLARE_LIBIE_RQ_NAPI_STATS(act)		\
@@ -60,9 +71,29 @@
 	act(alloc_page_fail)				\
 	act(build_skb_fail)
 
+#ifdef CONFIG_PAGE_POOL_STATS
+#define DECLARE_LIBIE_RQ_PP_STATS(act)			\
+	act(pp_alloc_fast)				\
+	act(pp_alloc_slow)				\
+	act(pp_alloc_slow_ho)				\
+	act(pp_alloc_empty)				\
+	act(pp_alloc_refill)				\
+	act(pp_alloc_waive)				\
+	act(pp_recycle_cached)				\
+	act(pp_recycle_cache_full)			\
+	act(pp_recycle_ring)				\
+	act(pp_recycle_ring_full)			\
+	act(pp_recycle_released_ref)
+#else
+#define DECLARE_LIBIE_RQ_PP_STATS(act)
+#endif
+
 #define DECLARE_LIBIE_RQ_STATS(act)			\
 	DECLARE_LIBIE_RQ_NAPI_STATS(act)		\
-	DECLARE_LIBIE_RQ_FAIL_STATS(act)
+	DECLARE_LIBIE_RQ_FAIL_STATS(act)		\
+	DECLARE_LIBIE_RQ_PP_STATS(act)
+
+struct page_pool;
 
 struct libie_rq_stats {
 	struct u64_stats_sync	syncp;
@@ -72,6 +103,9 @@ struct libie_rq_stats {
 #define act(s)	u64_stats_t	s;
 			DECLARE_LIBIE_RQ_NAPI_STATS(act);
 			DECLARE_LIBIE_RQ_FAIL_STATS(act);
+			struct_group(pp,
+				DECLARE_LIBIE_RQ_PP_STATS(act);
+			);
 #undef act
 		};
 		DECLARE_FLEX_ARRAY(u64_stats_t, raw);
@@ -110,7 +144,8 @@ libie_rq_napi_stats_add(struct libie_rq_stats *qs,
 
 u32 libie_rq_stats_get_sset_count(void);
 void libie_rq_stats_get_strings(u8 **data, u32 qid);
-void libie_rq_stats_get_data(u64 **data, const struct libie_rq_stats *stats);
+void libie_rq_stats_get_data(u64 **data, const struct libie_rq_stats *stats,
+			     struct page_pool *pool);
 
 /* Tx per-queue stats:
  * packets: packets sent from this queue
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH RFC net-next v4 9/9] iavf: switch queue stats to libie
  2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
                   ` (7 preceding siblings ...)
  2023-07-05 15:55 ` [PATCH RFC net-next v4 8/9] libie: add per-queue Page Pool stats Alexander Lobakin
@ 2023-07-05 15:55 ` Alexander Lobakin
  8 siblings, 0 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-05 15:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Alexander Lobakin, Maciej Fijalkowski, Michal Kubiak,
	Larysa Zaremba, Alexander Duyck, Yunsheng Lin, David Christensen,
	Jesper Dangaard Brouer, Ilias Apalodimas, Paul Menzel, netdev,
	intel-wired-lan, linux-kernel

iavf is pretty much ready for using the generic libie stats, so drop all
the custom code and just use generic definitions. The only thing is that
it previously lacked the counter of Tx queue stops. It's present in the
other drivers, so add it here as well.
The rest is straightforward. There were two fields in the Tx stats
struct, which didn't belong there. The first one has never been used,
wipe it; and move the other to the queue structure. Plus move around
a couple fields in &iavf_ring to account stats structs' alignment.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 .../net/ethernet/intel/iavf/iavf_ethtool.c    | 86 ++++-------------
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  2 +
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 92 ++++++++++---------
 drivers/net/ethernet/intel/iavf/iavf_txrx.h   | 54 +++--------
 4 files changed, 83 insertions(+), 151 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index de3050c02b6f..cdf09c5412a0 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -46,16 +46,6 @@ struct iavf_stats {
 	.stat_offset = offsetof(_type, _stat) \
 }
 
-/* Helper macro for defining some statistics related to queues */
-#define IAVF_QUEUE_STAT(_name, _stat) \
-	IAVF_STAT(struct iavf_ring, _name, _stat)
-
-/* Stats associated with a Tx or Rx ring */
-static const struct iavf_stats iavf_gstrings_queue_stats[] = {
-	IAVF_QUEUE_STAT("%s-%u.packets", stats.packets),
-	IAVF_QUEUE_STAT("%s-%u.bytes", stats.bytes),
-};
-
 /**
  * iavf_add_one_ethtool_stat - copy the stat into the supplied buffer
  * @data: location to store the stat value
@@ -141,43 +131,6 @@ __iavf_add_ethtool_stats(u64 **data, void *pointer,
 #define iavf_add_ethtool_stats(data, pointer, stats) \
 	__iavf_add_ethtool_stats(data, pointer, stats, ARRAY_SIZE(stats))
 
-/**
- * iavf_add_queue_stats - copy queue statistics into supplied buffer
- * @data: ethtool stats buffer
- * @ring: the ring to copy
- *
- * Queue statistics must be copied while protected by
- * u64_stats_fetch_begin, so we can't directly use iavf_add_ethtool_stats.
- * Assumes that queue stats are defined in iavf_gstrings_queue_stats. If the
- * ring pointer is null, zero out the queue stat values and update the data
- * pointer. Otherwise safely copy the stats from the ring into the supplied
- * buffer and update the data pointer when finished.
- *
- * This function expects to be called while under rcu_read_lock().
- **/
-static void
-iavf_add_queue_stats(u64 **data, struct iavf_ring *ring)
-{
-	const unsigned int size = ARRAY_SIZE(iavf_gstrings_queue_stats);
-	const struct iavf_stats *stats = iavf_gstrings_queue_stats;
-	unsigned int start;
-	unsigned int i;
-
-	/* To avoid invalid statistics values, ensure that we keep retrying
-	 * the copy until we get a consistent value according to
-	 * u64_stats_fetch_retry. But first, make sure our ring is
-	 * non-null before attempting to access its syncp.
-	 */
-	do {
-		start = !ring ? 0 : u64_stats_fetch_begin(&ring->syncp);
-		for (i = 0; i < size; i++)
-			iavf_add_one_ethtool_stat(&(*data)[i], ring, &stats[i]);
-	} while (ring && u64_stats_fetch_retry(&ring->syncp, start));
-
-	/* Once we successfully copy the stats in, update the data pointer */
-	*data += size;
-}
-
 /**
  * __iavf_add_stat_strings - copy stat strings into ethtool buffer
  * @p: ethtool supplied buffer
@@ -237,8 +190,6 @@ static const struct iavf_stats iavf_gstrings_stats[] = {
 
 #define IAVF_STATS_LEN	ARRAY_SIZE(iavf_gstrings_stats)
 
-#define IAVF_QUEUE_STATS_LEN	ARRAY_SIZE(iavf_gstrings_queue_stats)
-
 /**
  * iavf_get_link_ksettings - Get Link Speed and Duplex settings
  * @netdev: network interface device structure
@@ -308,18 +259,22 @@ static int iavf_get_link_ksettings(struct net_device *netdev,
  **/
 static int iavf_get_sset_count(struct net_device *netdev, int sset)
 {
-	/* Report the maximum number queues, even if not every queue is
-	 * currently configured. Since allocation of queues is in pairs,
-	 * use netdev->real_num_tx_queues * 2. The real_num_tx_queues is set
-	 * at device creation and never changes.
-	 */
+	u32 num;
 
-	if (sset == ETH_SS_STATS)
-		return IAVF_STATS_LEN +
-			(IAVF_QUEUE_STATS_LEN * 2 *
-			 netdev->real_num_tx_queues);
-	else
+	switch (sset) {
+	case ETH_SS_STATS:
+		/* Per-queue */
+		num = libie_rq_stats_get_sset_count();
+		num += libie_sq_stats_get_sset_count();
+		num *= netdev->real_num_tx_queues;
+
+		/* Global */
+		num += IAVF_STATS_LEN;
+
+		return num;
+	default:
 		return -EINVAL;
+	}
 }
 
 /**
@@ -346,15 +301,14 @@ static void iavf_get_ethtool_stats(struct net_device *netdev,
 	 * it to iterate over rings' stats.
 	 */
 	for (i = 0; i < adapter->num_active_queues; i++) {
-		struct iavf_ring *ring;
+		const struct iavf_ring *ring;
 
 		/* Tx rings stats */
-		ring = &adapter->tx_rings[i];
-		iavf_add_queue_stats(&data, ring);
+		libie_sq_stats_get_data(&data, &adapter->tx_rings[i].sq_stats);
 
 		/* Rx rings stats */
 		ring = &adapter->rx_rings[i];
-		iavf_add_queue_stats(&data, ring);
+		libie_rq_stats_get_data(&data, &ring->rq_stats, ring->pp);
 	}
 	rcu_read_unlock();
 }
@@ -376,10 +330,8 @@ static void iavf_get_stat_strings(struct net_device *netdev, u8 *data)
 	 * real_num_tx_queues for both Tx and Rx queues.
 	 */
 	for (i = 0; i < netdev->real_num_tx_queues; i++) {
-		iavf_add_stat_strings(&data, iavf_gstrings_queue_stats,
-				      "tx", i);
-		iavf_add_stat_strings(&data, iavf_gstrings_queue_stats,
-				      "rx", i);
+		libie_sq_stats_get_strings(&data, i);
+		libie_rq_stats_get_strings(&data, i);
 	}
 }
 
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 39c6d83e80a1..659bdfc33e0b 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1566,12 +1566,14 @@ static int iavf_alloc_queues(struct iavf_adapter *adapter)
 		tx_ring->itr_setting = IAVF_ITR_TX_DEF;
 		if (adapter->flags & IAVF_FLAG_WB_ON_ITR_CAPABLE)
 			tx_ring->flags |= IAVF_TXR_FLAGS_WB_ON_ITR;
+		u64_stats_init(&tx_ring->sq_stats.syncp);
 
 		rx_ring = &adapter->rx_rings[i];
 		rx_ring->queue_index = i;
 		rx_ring->netdev = adapter->netdev;
 		rx_ring->count = adapter->rx_desc_count;
 		rx_ring->itr_setting = IAVF_ITR_RX_DEF;
+		u64_stats_init(&rx_ring->rq_stats.syncp);
 	}
 
 	adapter->num_active_queues = num_active_queues;
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index d1491b481eac..1ccc80c3f732 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -176,6 +176,9 @@ void iavf_detect_recover_hung(struct iavf_vsi *vsi)
 	for (i = 0; i < vsi->back->num_active_queues; i++) {
 		tx_ring = &vsi->back->tx_rings[i];
 		if (tx_ring && tx_ring->desc) {
+			const struct libie_sq_stats *st = &tx_ring->sq_stats;
+			u32 start;
+
 			/* If packet counter has not changed the queue is
 			 * likely stalled, so force an interrupt for this
 			 * queue.
@@ -183,8 +186,13 @@ void iavf_detect_recover_hung(struct iavf_vsi *vsi)
 			 * prev_pkt_ctr would be negative if there was no
 			 * pending work.
 			 */
-			packets = tx_ring->stats.packets & INT_MAX;
-			if (tx_ring->tx_stats.prev_pkt_ctr == packets) {
+			do {
+				start = u64_stats_fetch_begin(&st->syncp);
+				packets = u64_stats_read(&st->packets) &
+					  INT_MAX;
+			} while (u64_stats_fetch_retry(&st->syncp, start));
+
+			if (tx_ring->prev_pkt_ctr == packets) {
 				iavf_force_wb(vsi, tx_ring->q_vector);
 				continue;
 			}
@@ -193,7 +201,7 @@ void iavf_detect_recover_hung(struct iavf_vsi *vsi)
 			 * to iavf_get_tx_pending()
 			 */
 			smp_rmb();
-			tx_ring->tx_stats.prev_pkt_ctr =
+			tx_ring->prev_pkt_ctr =
 			  iavf_get_tx_pending(tx_ring, true) ? packets : -1;
 		}
 	}
@@ -212,11 +220,11 @@ void iavf_detect_recover_hung(struct iavf_vsi *vsi)
 static bool iavf_clean_tx_irq(struct iavf_vsi *vsi,
 			      struct iavf_ring *tx_ring, int napi_budget)
 {
+	unsigned int budget = IAVF_DEFAULT_IRQ_WORK;
+	struct libie_sq_onstack_stats stats = { };
 	int i = tx_ring->next_to_clean;
 	struct iavf_tx_buffer *tx_buf;
 	struct iavf_tx_desc *tx_desc;
-	unsigned int total_bytes = 0, total_packets = 0;
-	unsigned int budget = IAVF_DEFAULT_IRQ_WORK;
 
 	tx_buf = &tx_ring->tx_bi[i];
 	tx_desc = IAVF_TX_DESC(tx_ring, i);
@@ -242,8 +250,8 @@ static bool iavf_clean_tx_irq(struct iavf_vsi *vsi,
 		tx_buf->next_to_watch = NULL;
 
 		/* update the statistics for this packet */
-		total_bytes += tx_buf->bytecount;
-		total_packets += tx_buf->gso_segs;
+		stats.bytes += tx_buf->bytecount;
+		stats.packets += tx_buf->gso_segs;
 
 		/* free the skb */
 		napi_consume_skb(tx_buf->skb, napi_budget);
@@ -300,12 +308,9 @@ static bool iavf_clean_tx_irq(struct iavf_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
+	libie_sq_napi_stats_add(&tx_ring->sq_stats, &stats);
+	tx_ring->q_vector->tx.total_bytes += stats.bytes;
+	tx_ring->q_vector->tx.total_packets += stats.packets;
 
 	if (tx_ring->flags & IAVF_TXR_FLAGS_WB_ON_ITR) {
 		/* check to see if there are < 4 descriptors
@@ -324,10 +329,10 @@ static bool iavf_clean_tx_irq(struct iavf_vsi *vsi,
 
 	/* notify netdev of completed buffers */
 	netdev_tx_completed_queue(txring_txq(tx_ring),
-				  total_packets, total_bytes);
+				  stats.packets, stats.bytes);
 
 #define TX_WAKE_THRESHOLD ((s16)(DESC_NEEDED * 2))
-	if (unlikely(total_packets && netif_carrier_ok(tx_ring->netdev) &&
+	if (unlikely(stats.packets && netif_carrier_ok(tx_ring->netdev) &&
 		     (IAVF_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))) {
 		/* Make sure that anybody stopping the queue after this
 		 * sees the new next_to_clean.
@@ -338,7 +343,7 @@ static bool iavf_clean_tx_irq(struct iavf_vsi *vsi,
 		   !test_bit(__IAVF_VSI_DOWN, vsi->state)) {
 			netif_wake_subqueue(tx_ring->netdev,
 					    tx_ring->queue_index);
-			++tx_ring->tx_stats.restart_queue;
+			libie_stats_inc_one(&tx_ring->sq_stats, restarts);
 		}
 	}
 
@@ -674,7 +679,7 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
 
 	tx_ring->next_to_use = 0;
 	tx_ring->next_to_clean = 0;
-	tx_ring->tx_stats.prev_pkt_ctr = -1;
+	tx_ring->prev_pkt_ctr = -1;
 	return 0;
 
 err:
@@ -730,7 +735,7 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
 		rx_ring->desc = NULL;
 	}
 
-	page_pool_destroy(rx_ring->pp);
+	libie_rx_page_pool_destroy(rx_ring->pp, &rx_ring->rq_stats);
 	rx_ring->pp = NULL;
 }
 
@@ -758,8 +763,6 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
 	if (!rx_ring->rx_bi)
 		goto err;
 
-	u64_stats_init(&rx_ring->syncp);
-
 	/* Round up to nearest 4K */
 	rx_ring->size = rx_ring->count * sizeof(union iavf_32byte_rx_desc);
 	rx_ring->size = ALIGN(rx_ring->size, 4096);
@@ -878,7 +881,7 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
 	if (rx_ring->next_to_use != ntu)
 		iavf_release_rx_desc(rx_ring, ntu);
 
-	rx_ring->rx_stats.alloc_page_failed++;
+	libie_stats_inc_one(&rx_ring->rq_stats, alloc_page_fail);
 
 	/* make sure to come back via polling to try again after
 	 * allocation failure
@@ -1089,7 +1092,7 @@ static struct sk_buff *iavf_build_skb(const struct libie_rx_buffer *rx_buffer,
  * iavf_is_non_eop - process handling of non-EOP buffers
  * @rx_ring: Rx ring being processed
  * @rx_desc: Rx descriptor for current buffer
- * @skb: Current socket buffer containing buffer in progress
+ * @stats: NAPI poll local stats to update
  *
  * This function updates next to clean.  If the buffer is an EOP buffer
  * this function exits returning false, otherwise it will place the
@@ -1098,7 +1101,7 @@ static struct sk_buff *iavf_build_skb(const struct libie_rx_buffer *rx_buffer,
  **/
 static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
 			    union iavf_rx_desc *rx_desc,
-			    struct sk_buff *skb)
+			    struct libie_rq_onstack_stats *stats)
 {
 	u32 ntc = rx_ring->next_to_clean + 1;
 
@@ -1113,7 +1116,7 @@ static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
 	if (likely(iavf_test_staterr(rx_desc, IAVF_RXD_EOF)))
 		return false;
 
-	rx_ring->rx_stats.non_eop_descs++;
+	stats->fragments++;
 
 	return true;
 }
@@ -1132,12 +1135,12 @@ static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
  **/
 static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
 {
-	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
-	struct sk_buff *skb = rx_ring->skb;
 	u16 cleaned_count = IAVF_DESC_UNUSED(rx_ring);
+	struct libie_rq_onstack_stats stats = { };
+	struct sk_buff *skb = rx_ring->skb;
 	bool failure = false;
 
-	while (likely(total_rx_packets < (unsigned int)budget)) {
+	while (likely(stats.packets < budget)) {
 		struct libie_rx_buffer *rx_buffer;
 		union iavf_rx_desc *rx_desc;
 		unsigned int size;
@@ -1187,14 +1190,15 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
 
 		/* exit if we failed to retrieve a buffer */
 		if (!skb) {
-			rx_ring->rx_stats.alloc_buff_failed++;
+			libie_stats_inc_one(&rx_ring->rq_stats,
+					    build_skb_fail);
 			break;
 		}
 
 skip_data:
 		cleaned_count++;
 
-		if (iavf_is_non_eop(rx_ring, rx_desc, skb))
+		if (iavf_is_non_eop(rx_ring, rx_desc, &stats))
 			continue;
 
 		/* ERR_MASK will only have valid bits if EOP set, and
@@ -1214,7 +1218,7 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
 		}
 
 		/* probably a little skewed due to removing CRC */
-		total_rx_bytes += skb->len;
+		stats.bytes += skb->len;
 
 		qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
 		rx_ptype = (qword & IAVF_RXD_QW1_PTYPE_MASK) >>
@@ -1236,20 +1240,20 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
 		skb = NULL;
 
 		/* update budget accounting */
-		total_rx_packets++;
+		stats.packets++;
 	}
 
 	rx_ring->skb = skb;
 
-	u64_stats_update_begin(&rx_ring->syncp);
-	rx_ring->stats.packets += total_rx_packets;
-	rx_ring->stats.bytes += total_rx_bytes;
-	u64_stats_update_end(&rx_ring->syncp);
-	rx_ring->q_vector->rx.total_packets += total_rx_packets;
-	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+	libie_rq_napi_stats_add(&rx_ring->rq_stats, &stats);
+	rx_ring->q_vector->rx.total_packets += stats.packets;
+	rx_ring->q_vector->rx.total_bytes += stats.bytes;
 
 	/* guarantee a trip back through this routine if there was a failure */
-	return failure ? budget : (int)total_rx_packets;
+	if (unlikely(failure))
+		return budget;
+
+	return stats.packets;
 }
 
 static inline u32 iavf_buildreg_itr(const int type, u16 itr)
@@ -1426,10 +1430,8 @@ int iavf_napi_poll(struct napi_struct *napi, int budget)
 			return budget - 1;
 		}
 tx_only:
-		if (arm_wb) {
-			q_vector->tx.ring[0].tx_stats.tx_force_wb++;
+		if (arm_wb)
 			iavf_enable_wb_on_itr(vsi, q_vector);
-		}
 		return budget;
 	}
 
@@ -1888,6 +1890,7 @@ bool __iavf_chk_linearize(struct sk_buff *skb)
 int __iavf_maybe_stop_tx(struct iavf_ring *tx_ring, int size)
 {
 	netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
+	libie_stats_inc_one(&tx_ring->sq_stats, stops);
 	/* Memory barrier before checking head and tail */
 	smp_mb();
 
@@ -1897,7 +1900,8 @@ int __iavf_maybe_stop_tx(struct iavf_ring *tx_ring, int size)
 
 	/* A reprieve! - use start_queue because it doesn't call schedule */
 	netif_start_subqueue(tx_ring->netdev, tx_ring->queue_index);
-	++tx_ring->tx_stats.restart_queue;
+	libie_stats_inc_one(&tx_ring->sq_stats, restarts);
+
 	return 0;
 }
 
@@ -2078,7 +2082,7 @@ static netdev_tx_t iavf_xmit_frame_ring(struct sk_buff *skb,
 			return NETDEV_TX_OK;
 		}
 		count = iavf_txd_use_count(skb->len);
-		tx_ring->tx_stats.tx_linearize++;
+		libie_stats_inc_one(&tx_ring->sq_stats, linearized);
 	}
 
 	/* need: 1 descriptor per page * PAGE_SIZE/IAVF_MAX_DATA_PER_TXD,
@@ -2088,7 +2092,7 @@ static netdev_tx_t iavf_xmit_frame_ring(struct sk_buff *skb,
 	 * otherwise try next time
 	 */
 	if (iavf_maybe_stop_tx(tx_ring, count + 4 + 1)) {
-		tx_ring->tx_stats.tx_busy++;
+		libie_stats_inc_one(&tx_ring->sq_stats, busy);
 		return NETDEV_TX_BUSY;
 	}
 
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
index b13d878c74c6..81b430e8d8ff 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
@@ -4,6 +4,8 @@
 #ifndef _IAVF_TXRX_H_
 #define _IAVF_TXRX_H_
 
+#include <linux/net/intel/libie/stats.h>
+
 /* Interrupt Throttling and Rate Limiting Goodies */
 #define IAVF_DEFAULT_IRQ_WORK      256
 
@@ -201,33 +203,6 @@ struct iavf_tx_buffer {
 	u32 tx_flags;
 };
 
-struct iavf_queue_stats {
-	u64 packets;
-	u64 bytes;
-};
-
-struct iavf_tx_queue_stats {
-	u64 restart_queue;
-	u64 tx_busy;
-	u64 tx_done_old;
-	u64 tx_linearize;
-	u64 tx_force_wb;
-	int prev_pkt_ctr;
-	u64 tx_lost_interrupt;
-};
-
-struct iavf_rx_queue_stats {
-	u64 non_eop_descs;
-	u64 alloc_page_failed;
-	u64 alloc_buff_failed;
-};
-
-enum iavf_ring_state_t {
-	__IAVF_TX_FDIR_INIT_DONE,
-	__IAVF_TX_XPS_INIT_DONE,
-	__IAVF_RING_STATE_NBITS /* must be last */
-};
-
 /* some useful defines for virtchannel interface, which
  * is the only remaining user of header split
  */
@@ -252,7 +227,6 @@ struct iavf_ring {
 		struct libie_rx_buffer *rx_bi;
 		struct iavf_tx_buffer *tx_bi;
 	};
-	DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
 	u8 __iomem *tail;
 	u16 queue_index;		/* Queue number of ring */
 	u8 dcb_tc;			/* Traffic class of ring */
@@ -286,21 +260,9 @@ struct iavf_ring {
 #define IAVF_TXR_FLAGS_VLAN_TAG_LOC_L2TAG2	BIT(4)
 #define IAVF_RXR_FLAGS_VLAN_TAG_LOC_L2TAG2_2	BIT(5)
 
-	/* stats structs */
-	struct iavf_queue_stats	stats;
-	struct u64_stats_sync syncp;
-	union {
-		struct iavf_tx_queue_stats tx_stats;
-		struct iavf_rx_queue_stats rx_stats;
-	};
-
-	unsigned int size;		/* length of descriptor ring in bytes */
-	dma_addr_t dma;			/* physical address of ring */
-
 	struct iavf_vsi *vsi;		/* Backreference to associated VSI */
 	struct iavf_q_vector *q_vector;	/* Backreference to associated vector */
 
-	struct rcu_head rcu;		/* to avoid race on free */
 	struct sk_buff *skb;		/* When iavf_clean_rx_ring_irq() must
 					 * return before it sees the EOP for
 					 * the current packet, we save that skb
@@ -309,6 +271,18 @@ struct iavf_ring {
 					 * iavf_clean_rx_ring_irq() is called
 					 * for this ring.
 					 */
+
+	/* stats structs */
+	union {
+		struct libie_rq_stats rq_stats;
+		struct libie_sq_stats sq_stats;
+	};
+
+	int prev_pkt_ctr;		/* For stall detection */
+	unsigned int size;		/* length of descriptor ring in bytes */
+	dma_addr_t dma;			/* physical address of ring */
+
+	struct rcu_head rcu;		/* to avoid race on free */
 } ____cacheline_internodealigned_in_smp;
 
 #define IAVF_ITR_ADAPTIVE_MIN_INC	0x0002
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-05 15:55 ` [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool Alexander Lobakin
@ 2023-07-06 12:47   ` Yunsheng Lin
  2023-07-06 16:38     ` Alexander Lobakin
  2023-07-06 15:26   ` [Intel-wired-lan] " Alexander Duyck
  1 sibling, 1 reply; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-06 12:47 UTC (permalink / raw)
  To: Alexander Lobakin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

On 2023/7/5 23:55, Alexander Lobakin wrote:
> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
> no custom recycling logics, it can easily be switched to using Page
> Pool / libie API instead.
> This allows to removing the whole dancing around headroom, HW buffer
> size, and page order. All DMA-for-device is now done in the PP core,
> for-CPU -- in the libie helper.
> Use skb_mark_for_recycle() to bring back the recycling and restore the
> performance. Speaking of performance: on par with the baseline and
> faster with the PP optimization series applied. But the memory usage for
> 1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
> every second descriptor.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---

...

> @@ -2562,11 +2541,7 @@ static void iavf_init_config_adapter(struct iavf_adapter *adapter)
>  
>  	netdev->netdev_ops = &iavf_netdev_ops;
>  	iavf_set_ethtool_ops(netdev);
> -	netdev->watchdog_timeo = 5 * HZ;

This seems like a unrelated change here?

> -
> -	/* MTU range: 68 - 9710 */
> -	netdev->min_mtu = ETH_MIN_MTU;
> -	netdev->max_mtu = IAVF_MAX_RXBUFFER - IAVF_PACKET_HDR_PAD;
> +	netdev->max_mtu = LIBIE_MAX_MTU;
>  

...

>  /**
> @@ -766,13 +742,19 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>   **/
>  int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>  {
> -	struct device *dev = rx_ring->dev;
> -	int bi_size;
> +	struct page_pool *pool;
> +
> +	pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
> +					 rx_ring->count);

If a page is able to be spilt between more than one desc, perhaps the
prt_ring size does not need to be as big as rx_ring->count.

> +	if (IS_ERR(pool))
> +		return PTR_ERR(pool);
> +
> +	rx_ring->pp = pool;
>  
>  	/* warn if we are about to overwrite the pointer */
>  	WARN_ON(rx_ring->rx_bi);
> -	bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
> -	rx_ring->rx_bi = kzalloc(bi_size, GFP_KERNEL);
> +	rx_ring->rx_bi = kcalloc(rx_ring->count, sizeof(*rx_ring->rx_bi),
> +				 GFP_KERNEL);
>  	if (!rx_ring->rx_bi)
>  		goto err;
>  

...

>  
>  /**
>   * iavf_build_skb - Build skb around an existing buffer
> - * @rx_ring: Rx descriptor ring to transact packets on
>   * @rx_buffer: Rx buffer to pull data from
>   * @size: size of buffer to add to skb
>   *
>   * This function builds an skb around an existing Rx buffer, taking care
>   * to set up the skb correctly and avoid any memcpy overhead.
>   */
> -static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
> -				      struct iavf_rx_buffer *rx_buffer,
> +static struct sk_buff *iavf_build_skb(const struct libie_rx_buffer *rx_buffer,
>  				      unsigned int size)
>  {
> -	void *va;
> -#if (PAGE_SIZE < 8192)
> -	unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
> -#else
> -	unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> -				SKB_DATA_ALIGN(IAVF_SKB_PAD + size);
> -#endif
> +	struct page *page = rx_buffer->page;
> +	u32 hr = page->pp->p.offset;
>  	struct sk_buff *skb;
> +	void *va;
>  
> -	if (!rx_buffer || !size)
> -		return NULL;
>  	/* prefetch first cache line of first page */
> -	va = page_address(rx_buffer->page) + rx_buffer->page_offset;
> -	net_prefetch(va);
> +	va = page_address(page) + rx_buffer->offset;
> +	net_prefetch(va + hr);
>  
>  	/* build an skb around the page buffer */
> -	skb = napi_build_skb(va - IAVF_SKB_PAD, truesize);
> -	if (unlikely(!skb))
> +	skb = napi_build_skb(va, rx_buffer->truesize);
> +	if (unlikely(!skb)) {
> +		page_pool_put_page(page->pp, page, size, true);

Isn't it more correct to call page_pool_put_full_page() here?
as we do not know which frag is used for the rx_buffer, and depend
on the last released frag to do the syncing, maybe I should mention
that in Documentation/networking/page_pool.rst.

>  		return NULL;
> +	}

...

>  struct iavf_queue_stats {
>  	u64 packets;
>  	u64 bytes;
> @@ -311,16 +243,19 @@ enum iavf_ring_state_t {
>  struct iavf_ring {
>  	struct iavf_ring *next;		/* pointer to next ring in q_vector */
>  	void *desc;			/* Descriptor ring memory */
> -	struct device *dev;		/* Used for DMA mapping */
> +	union {
> +		struct page_pool *pp;	/* Used on Rx for buffer management */
> +		struct device *dev;	/* Used on Tx for DMA mapping */
> +	};
>  	struct net_device *netdev;	/* netdev ring maps to */
>  	union {
> +		struct libie_rx_buffer *rx_bi;
>  		struct iavf_tx_buffer *tx_bi;
> -		struct iavf_rx_buffer *rx_bi;
>  	};
>  	DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
> +	u8 __iomem *tail;

Is there a reason to move it here?

>  	u16 queue_index;		/* Queue number of ring */
>  	u8 dcb_tc;			/* Traffic class of ring */
> -	u8 __iomem *tail;
>  


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
  2023-07-05 15:55 ` [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool) Alexander Lobakin
@ 2023-07-06 12:47   ` Yunsheng Lin
  2023-07-06 16:28     ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-06 12:47 UTC (permalink / raw)
  To: Alexander Lobakin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

On 2023/7/5 23:55, Alexander Lobakin wrote:

> +/**
> + * libie_rx_page_pool_create - create a PP with the default libie settings
> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
> + * @size: size of the PP, usually simply Rx queue len
> + *
> + * Returns &page_pool on success, casted -errno on failure.
> + */
> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
> +					    u32 size)
> +{
> +	struct page_pool_params pp = {
> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> +		.order		= LIBIE_RX_PAGE_ORDER,
> +		.pool_size	= size,
> +		.nid		= NUMA_NO_NODE,
> +		.dev		= napi->dev->dev.parent,
> +		.napi		= napi,
> +		.dma_dir	= DMA_FROM_DEVICE,
> +		.offset		= LIBIE_SKB_HEADROOM,

I think it worth mentioning that the '.offset' is not really accurate
when the page is split, as we do not really know what is the offset of
the frag of a page except for the first frag.

> +	};
> +	size_t truesize;
> +
> +	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
> +
> +	/* "Wanted" truesize, passed to page_pool_dev_alloc() */
> +	truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
> +	pp.init_arg = (void *)truesize;

I am not sure if it is correct to use pp.init_arg here, as it is supposed to
be used along with init_callback. And if we want to change the implemetation
of init_callback, we may stuck with it as the driver is using it very
differently here.

Is it possible to pass the 'wanted true size' by adding a parameter for
libie_rx_alloc()?

> +
> +	return page_pool_create(&pp);
> +}
> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling
  2023-07-05 15:55 ` [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling Alexander Lobakin
@ 2023-07-06 14:47   ` Alexander Duyck
  2023-07-06 16:45     ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2023-07-06 14:47 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

On Wed, Jul 5, 2023 at 8:57 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> As an intermediate step, remove all page splitting/recyclig code. Just

Spelling issue: "recycling"

> always allocate a new page and don't touch its refcount, so that it gets
> freed by the core stack later.
> Same for the "in-place" recycling, i.e. when an unused buffer gets
> assigned to a first needs-refilling descriptor. In some cases, this
> was leading to moving up to 63 &iavf_rx_buf structures around the ring
> on a per-field basis -- not something wanted on hotpath.
> The change allows to greatly simplify certain parts of the code:
>
> Function: add/remove: 0/2 grow/shrink: 1/3 up/down: 3/-494 (-491)
>
> Although the array of &iavf_rx_buf is barely used now and could be
> replaced with just page pointer array, don't touch it now to not
> complicate replacing it with libie Rx buffer struct later on.
> No surprise perf loses up to 30% here, but that regression will go
> away once PP lands.
>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  drivers/net/ethernet/intel/iavf/iavf_txrx.c | 151 ++------------------
>  drivers/net/ethernet/intel/iavf/iavf_txrx.h |   8 --
>  2 files changed, 13 insertions(+), 146 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> index a85b270fc769..789b10815d7f 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> @@ -723,7 +723,7 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
>                                      DMA_FROM_DEVICE,
>                                      IAVF_RX_DMA_ATTR);
>
> -               __page_frag_cache_drain(rx_bi->page, rx_bi->pagecnt_bias);
> +               __free_pages(rx_bi->page, iavf_rx_pg_order(rx_ring));
>
>                 rx_bi->page = NULL;
>                 rx_bi->page_offset = 0;
> @@ -735,7 +735,6 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
>         /* Zero out the descriptor ring */
>         memset(rx_ring->desc, 0, rx_ring->size);
>
> -       rx_ring->next_to_alloc = 0;
>         rx_ring->next_to_clean = 0;
>         rx_ring->next_to_use = 0;
>  }
> @@ -791,7 +790,6 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>                 goto err;
>         }
>
> -       rx_ring->next_to_alloc = 0;
>         rx_ring->next_to_clean = 0;
>         rx_ring->next_to_use = 0;
>
> @@ -811,9 +809,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
>  {
>         rx_ring->next_to_use = val;
>
> -       /* update next to alloc since we have filled the ring */
> -       rx_ring->next_to_alloc = val;
> -
>         /* Force memory writes to complete before letting h/w
>          * know there are new descriptors to fetch.  (Only
>          * applicable for weak-ordered memory model archs,
> @@ -837,12 +832,6 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
>         struct page *page = bi->page;
>         dma_addr_t dma;
>
> -       /* since we are recycling buffers we should seldom need to alloc */
> -       if (likely(page)) {
> -               rx_ring->rx_stats.page_reuse_count++;
> -               return true;
> -       }
> -
>         /* alloc new page for storage */
>         page = dev_alloc_pages(iavf_rx_pg_order(rx_ring));
>         if (unlikely(!page)) {
> @@ -869,9 +858,6 @@ static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
>         bi->page = page;
>         bi->page_offset = IAVF_SKB_PAD;
>
> -       /* initialize pagecnt_bias to 1 representing we fully own page */
> -       bi->pagecnt_bias = 1;
> -
>         return true;
>  }
>
> @@ -1103,91 +1089,6 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
>         return false;
>  }
>
> -/**
> - * iavf_reuse_rx_page - page flip buffer and store it back on the ring
> - * @rx_ring: rx descriptor ring to store buffers on
> - * @old_buff: donor buffer to have page reused
> - *
> - * Synchronizes page for reuse by the adapter
> - **/
> -static void iavf_reuse_rx_page(struct iavf_ring *rx_ring,
> -                              struct iavf_rx_buffer *old_buff)
> -{
> -       struct iavf_rx_buffer *new_buff;
> -       u16 nta = rx_ring->next_to_alloc;
> -
> -       new_buff = &rx_ring->rx_bi[nta];
> -
> -       /* update, and store next to alloc */
> -       nta++;
> -       rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> -
> -       /* transfer page from old buffer to new buffer */
> -       new_buff->dma           = old_buff->dma;
> -       new_buff->page          = old_buff->page;
> -       new_buff->page_offset   = old_buff->page_offset;
> -       new_buff->pagecnt_bias  = old_buff->pagecnt_bias;
> -}
> -
> -/**
> - * iavf_can_reuse_rx_page - Determine if this page can be reused by
> - * the adapter for another receive
> - *
> - * @rx_buffer: buffer containing the page
> - *
> - * If page is reusable, rx_buffer->page_offset is adjusted to point to
> - * an unused region in the page.
> - *
> - * For small pages, @truesize will be a constant value, half the size
> - * of the memory at page.  We'll attempt to alternate between high and
> - * low halves of the page, with one half ready for use by the hardware
> - * and the other half being consumed by the stack.  We use the page
> - * ref count to determine whether the stack has finished consuming the
> - * portion of this page that was passed up with a previous packet.  If
> - * the page ref count is >1, we'll assume the "other" half page is
> - * still busy, and this page cannot be reused.
> - *
> - * For larger pages, @truesize will be the actual space used by the
> - * received packet (adjusted upward to an even multiple of the cache
> - * line size).  This will advance through the page by the amount
> - * actually consumed by the received packets while there is still
> - * space for a buffer.  Each region of larger pages will be used at
> - * most once, after which the page will not be reused.
> - *
> - * In either case, if the page is reusable its refcount is increased.
> - **/
> -static bool iavf_can_reuse_rx_page(struct iavf_rx_buffer *rx_buffer)
> -{
> -       unsigned int pagecnt_bias = rx_buffer->pagecnt_bias;
> -       struct page *page = rx_buffer->page;
> -
> -       /* Is any reuse possible? */
> -       if (!dev_page_is_reusable(page))
> -               return false;
> -
> -#if (PAGE_SIZE < 8192)
> -       /* if we are only owner of page we can reuse it */
> -       if (unlikely((page_count(page) - pagecnt_bias) > 1))
> -               return false;
> -#else
> -#define IAVF_LAST_OFFSET \
> -       (SKB_WITH_OVERHEAD(PAGE_SIZE) - IAVF_RXBUFFER_2048)
> -       if (rx_buffer->page_offset > IAVF_LAST_OFFSET)
> -               return false;
> -#endif
> -
> -       /* If we have drained the page fragment pool we need to update
> -        * the pagecnt_bias and page count so that we fully restock the
> -        * number of references the driver holds.
> -        */
> -       if (unlikely(!pagecnt_bias)) {
> -               page_ref_add(page, USHRT_MAX);
> -               rx_buffer->pagecnt_bias = USHRT_MAX;
> -       }
> -
> -       return true;
> -}
> -
>  /**
>   * iavf_add_rx_frag - Add contents of Rx buffer to sk_buff
>   * @rx_ring: rx descriptor ring to transact packets on
> @@ -1216,13 +1117,6 @@ static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
>
>         skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
>                         rx_buffer->page_offset, size, truesize);
> -
> -       /* page is being used so we must update the page offset */
> -#if (PAGE_SIZE < 8192)
> -       rx_buffer->page_offset ^= truesize;
> -#else
> -       rx_buffer->page_offset += truesize;
> -#endif
>  }
>
>  /**
> @@ -1250,9 +1144,6 @@ static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
>                                       size,
>                                       DMA_FROM_DEVICE);
>
> -       /* We have pulled a buffer for use, so decrement pagecnt_bias */
> -       rx_buffer->pagecnt_bias--;
> -
>         return rx_buffer;
>  }
>
> @@ -1293,23 +1184,15 @@ static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
>         skb_reserve(skb, IAVF_SKB_PAD);
>         __skb_put(skb, size);
>
> -       /* buffer is used by skb, update page_offset */
> -#if (PAGE_SIZE < 8192)
> -       rx_buffer->page_offset ^= truesize;
> -#else
> -       rx_buffer->page_offset += truesize;
> -#endif
> -
>         return skb;
>  }
>
>  /**
> - * iavf_put_rx_buffer - Clean up used buffer and either recycle or free
> + * iavf_put_rx_buffer - Unmap used buffer
>   * @rx_ring: rx descriptor ring to transact packets on
>   * @rx_buffer: rx buffer to pull data from
>   *
> - * This function will clean up the contents of the rx_buffer.  It will
> - * either recycle the buffer or unmap it and free the associated resources.
> + * This function will unmap the buffer after it's written by HW.
>   */
>  static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
>                                struct iavf_rx_buffer *rx_buffer)
> @@ -1317,21 +1200,10 @@ static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
>         if (!rx_buffer)
>                 return;
>
> -       if (iavf_can_reuse_rx_page(rx_buffer)) {
> -               /* hand second half of page back to the ring */
> -               iavf_reuse_rx_page(rx_ring, rx_buffer);
> -               rx_ring->rx_stats.page_reuse_count++;
> -       } else {
> -               /* we are not reusing the buffer so unmap it */
> -               dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
> -                                    iavf_rx_pg_size(rx_ring),
> -                                    DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
> -               __page_frag_cache_drain(rx_buffer->page,
> -                                       rx_buffer->pagecnt_bias);
> -       }
> -
> -       /* clear contents of buffer_info */
> -       rx_buffer->page = NULL;
> +       /* we are not reusing the buffer so unmap it */
> +       dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
> +                            iavf_rx_pg_size(rx_ring),
> +                            DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);

Rather than reorder all this I would just do the dma_unmap_page_attrs
and then leave the assignment of NULL to rx_buffer->page. It should
make this a bit easier to clean up the code below.

>  }
>
>  /**
> @@ -1431,15 +1303,18 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
>                 else
>                         skb = iavf_build_skb(rx_ring, rx_buffer, size);
>
> +               iavf_put_rx_buffer(rx_ring, rx_buffer);
> +

This should stay below where it was.

>                 /* exit if we failed to retrieve a buffer */
>                 if (!skb) {
>                         rx_ring->rx_stats.alloc_buff_failed++;
> -                       if (rx_buffer && size)
> -                               rx_buffer->pagecnt_bias++;
> +                       __free_pages(rx_buffer->page,
> +                                    iavf_rx_pg_order(rx_ring));
> +                       rx_buffer->page = NULL;
>                         break;
>                 }

This code was undoing the iavf_get_rx_buffer decrement of pagecnt_bias
and then bailing since we have halted forward progress due to an skb
allocation failure. As such we should just be removing the if
statement and the increment of pagecnt_bias.

>
> -               iavf_put_rx_buffer(rx_ring, rx_buffer);
> +               rx_buffer->page = NULL;
>                 cleaned_count++;
>
>                 if (iavf_is_non_eop(rx_ring, rx_desc, skb))

If iavf_put_rx_buffer just does the unmap and assignment of NULL then
it could just be left here as is.

> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> index 4b412f7662e4..2170a77f8c8d 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> @@ -266,12 +266,7 @@ struct iavf_tx_buffer {
>  struct iavf_rx_buffer {
>         dma_addr_t dma;
>         struct page *page;
> -#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
>         __u32 page_offset;
> -#else
> -       __u16 page_offset;
> -#endif
> -       __u16 pagecnt_bias;
>  };
>
>  struct iavf_queue_stats {
> @@ -293,8 +288,6 @@ struct iavf_rx_queue_stats {
>         u64 non_eop_descs;
>         u64 alloc_page_failed;
>         u64 alloc_buff_failed;
> -       u64 page_reuse_count;
> -       u64 realloc_count;
>  };
>
>  enum iavf_ring_state_t {
> @@ -374,7 +367,6 @@ struct iavf_ring {
>         struct iavf_q_vector *q_vector; /* Backreference to associated vector */
>
>         struct rcu_head rcu;            /* to avoid race on free */
> -       u16 next_to_alloc;
>         struct sk_buff *skb;            /* When iavf_clean_rx_ring_irq() must
>                                          * return before it sees the EOP for
>                                          * the current packet, we save that skb
> --
> 2.41.0
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-05 15:55 ` [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool Alexander Lobakin
  2023-07-06 12:47   ` Yunsheng Lin
@ 2023-07-06 15:26   ` Alexander Duyck
  2023-07-06 16:56     ` Alexander Lobakin
  1 sibling, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2023-07-06 15:26 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

On Wed, Jul 5, 2023 at 8:58 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
> no custom recycling logics, it can easily be switched to using Page
> Pool / libie API instead.
> This allows to removing the whole dancing around headroom, HW buffer
> size, and page order. All DMA-for-device is now done in the PP core,
> for-CPU -- in the libie helper.
> Use skb_mark_for_recycle() to bring back the recycling and restore the
> performance. Speaking of performance: on par with the baseline and
> faster with the PP optimization series applied. But the memory usage for
> 1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
> every second descriptor.
>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

One thing I am noticing is that there seems to be a bunch of cleanup
changes in here as well. Things like moving around values within
structures which I am assuming are to fill holes. You may want to look
at breaking some of those out as it makes it a bit harder to review
this since they seem like unrelated changes.

> ---
>  drivers/net/ethernet/intel/iavf/iavf_main.c   |  33 +--
>  drivers/net/ethernet/intel/iavf/iavf_txrx.c   | 247 +++++-------------
>  drivers/net/ethernet/intel/iavf/iavf_txrx.h   |  91 +------
>  drivers/net/ethernet/intel/iavf/iavf_type.h   |   2 -
>  .../net/ethernet/intel/iavf/iavf_virtchnl.c   |  19 +-
>  5 files changed, 82 insertions(+), 310 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
> index db1ed13f11bb..39c6d83e80a1 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
> @@ -1,6 +1,8 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /* Copyright(c) 2013 - 2018 Intel Corporation. */
>
> +#include <linux/net/intel/libie/rx.h>
> +
>  #include "iavf.h"
>  #include "iavf_prototype.h"
>  #include "iavf_client.h"
> @@ -698,32 +700,10 @@ static void iavf_configure_tx(struct iavf_adapter *adapter)
>   **/
>  static void iavf_configure_rx(struct iavf_adapter *adapter)
>  {
> -       unsigned int rx_buf_len = IAVF_RXBUFFER_2048;
>         struct iavf_hw *hw = &adapter->hw;
> -       int i;
> -
> -       if (PAGE_SIZE < 8192) {
> -               struct net_device *netdev = adapter->netdev;
>
> -               /* For jumbo frames on systems with 4K pages we have to use
> -                * an order 1 page, so we might as well increase the size
> -                * of our Rx buffer to make better use of the available space
> -                */
> -               rx_buf_len = IAVF_RXBUFFER_3072;
> -
> -               /* We use a 1536 buffer size for configurations with
> -                * standard Ethernet mtu.  On x86 this gives us enough room
> -                * for shared info and 192 bytes of padding.
> -                */
> -               if (!IAVF_2K_TOO_SMALL_WITH_PADDING &&
> -                   (netdev->mtu <= ETH_DATA_LEN))
> -                       rx_buf_len = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
> -       }
> -
> -       for (i = 0; i < adapter->num_active_queues; i++) {
> +       for (u32 i = 0; i < adapter->num_active_queues; i++)
>                 adapter->rx_rings[i].tail = hw->hw_addr + IAVF_QRX_TAIL1(i);
> -               adapter->rx_rings[i].rx_buf_len = rx_buf_len;
> -       }
>  }
>
>  /**
> @@ -1590,7 +1570,6 @@ static int iavf_alloc_queues(struct iavf_adapter *adapter)
>                 rx_ring = &adapter->rx_rings[i];
>                 rx_ring->queue_index = i;
>                 rx_ring->netdev = adapter->netdev;
> -               rx_ring->dev = &adapter->pdev->dev;
>                 rx_ring->count = adapter->rx_desc_count;
>                 rx_ring->itr_setting = IAVF_ITR_RX_DEF;
>         }
> @@ -2562,11 +2541,7 @@ static void iavf_init_config_adapter(struct iavf_adapter *adapter)
>
>         netdev->netdev_ops = &iavf_netdev_ops;
>         iavf_set_ethtool_ops(netdev);
> -       netdev->watchdog_timeo = 5 * HZ;
> -

These removals should be pulled out into a seperate patch where we can
call out that they are not needed due to redundancies w/ other code on
the core network code.

> -       /* MTU range: 68 - 9710 */
> -       netdev->min_mtu = ETH_MIN_MTU;
> -       netdev->max_mtu = IAVF_MAX_RXBUFFER - IAVF_PACKET_HDR_PAD;
> +       netdev->max_mtu = LIBIE_MAX_MTU;

Same here for min_mtu.

>
>         if (!is_valid_ether_addr(adapter->hw.mac.addr)) {
>                 dev_info(&pdev->dev, "Invalid MAC address %pM, using random\n",
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> index 789b10815d7f..d1491b481eac 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
> @@ -689,9 +689,6 @@ int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring)
>   **/
>  static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
>  {
> -       unsigned long bi_size;
> -       u16 i;
> -
>         /* ring already cleared, nothing to do */
>         if (!rx_ring->rx_bi)
>                 return;
> @@ -701,40 +698,16 @@ static void iavf_clean_rx_ring(struct iavf_ring *rx_ring)
>                 rx_ring->skb = NULL;
>         }
>
> -       /* Free all the Rx ring sk_buffs */
> -       for (i = 0; i < rx_ring->count; i++) {
> -               struct iavf_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
> +       /* Free all the Rx ring buffers */
> +       for (u32 i = rx_ring->next_to_clean; i != rx_ring->next_to_use; ) {
> +               const struct libie_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
>
> -               if (!rx_bi->page)
> -                       continue;
> +               page_pool_put_full_page(rx_ring->pp, rx_bi->page, false);
>
> -               /* Invalidate cache lines that may have been written to by
> -                * device so that we avoid corrupting memory.
> -                */
> -               dma_sync_single_range_for_cpu(rx_ring->dev,
> -                                             rx_bi->dma,
> -                                             rx_bi->page_offset,
> -                                             rx_ring->rx_buf_len,
> -                                             DMA_FROM_DEVICE);
> -
> -               /* free resources associated with mapping */
> -               dma_unmap_page_attrs(rx_ring->dev, rx_bi->dma,
> -                                    iavf_rx_pg_size(rx_ring),
> -                                    DMA_FROM_DEVICE,
> -                                    IAVF_RX_DMA_ATTR);
> -
> -               __free_pages(rx_bi->page, iavf_rx_pg_order(rx_ring));
> -
> -               rx_bi->page = NULL;
> -               rx_bi->page_offset = 0;
> +               if (unlikely(++i == rx_ring->count))
> +                       i = 0;
>         }
>
> -       bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
> -       memset(rx_ring->rx_bi, 0, bi_size);
> -
> -       /* Zero out the descriptor ring */
> -       memset(rx_ring->desc, 0, rx_ring->size);
> -

I have some misgivings about not clearing these. We may want to double
check to verify the code paths are resilient enough that it won't
cause any issues w/ repeated up/down testing on the interface. The
general idea is to keep things consistent w/ the state after
setup_rx_descriptors. If we don't need this when we don't need to be
calling the zalloc or calloc version of things in
setup_rx_descriptors.


>         rx_ring->next_to_clean = 0;
>         rx_ring->next_to_use = 0;
>  }
> @@ -752,10 +725,13 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>         rx_ring->rx_bi = NULL;
>
>         if (rx_ring->desc) {
> -               dma_free_coherent(rx_ring->dev, rx_ring->size,
> +               dma_free_coherent(rx_ring->pp->p.dev, rx_ring->size,
>                                   rx_ring->desc, rx_ring->dma);
>                 rx_ring->desc = NULL;
>         }
> +
> +       page_pool_destroy(rx_ring->pp);
> +       rx_ring->pp = NULL;
>  }
>
>  /**
> @@ -766,13 +742,19 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>   **/
>  int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>  {
> -       struct device *dev = rx_ring->dev;
> -       int bi_size;
> +       struct page_pool *pool;
> +
> +       pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
> +                                        rx_ring->count);
> +       if (IS_ERR(pool))
> +               return PTR_ERR(pool);
> +
> +       rx_ring->pp = pool;
>
>         /* warn if we are about to overwrite the pointer */
>         WARN_ON(rx_ring->rx_bi);
> -       bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
> -       rx_ring->rx_bi = kzalloc(bi_size, GFP_KERNEL);
> +       rx_ring->rx_bi = kcalloc(rx_ring->count, sizeof(*rx_ring->rx_bi),
> +                                GFP_KERNEL);
>         if (!rx_ring->rx_bi)
>                 goto err;
>
> @@ -781,22 +763,27 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>         /* Round up to nearest 4K */
>         rx_ring->size = rx_ring->count * sizeof(union iavf_32byte_rx_desc);
>         rx_ring->size = ALIGN(rx_ring->size, 4096);
> -       rx_ring->desc = dma_alloc_coherent(dev, rx_ring->size,
> +       rx_ring->desc = dma_alloc_coherent(pool->p.dev, rx_ring->size,
>                                            &rx_ring->dma, GFP_KERNEL);
>
>         if (!rx_ring->desc) {
> -               dev_info(dev, "Unable to allocate memory for the Rx descriptor ring, size=%d\n",
> +               dev_info(pool->p.dev, "Unable to allocate memory for the Rx descriptor ring, size=%d\n",
>                          rx_ring->size);
> -               goto err;
> +               goto err_free_buf;
>         }
>
>         rx_ring->next_to_clean = 0;
>         rx_ring->next_to_use = 0;
>
>         return 0;
> -err:
> +
> +err_free_buf:
>         kfree(rx_ring->rx_bi);
>         rx_ring->rx_bi = NULL;
> +
> +err:
> +       page_pool_destroy(rx_ring->pp);
> +
>         return -ENOMEM;
>  }
>
> @@ -818,49 +805,6 @@ static inline void iavf_release_rx_desc(struct iavf_ring *rx_ring, u32 val)
>         writel(val, rx_ring->tail);
>  }
>
> -/**
> - * iavf_alloc_mapped_page - recycle or make a new page
> - * @rx_ring: ring to use
> - * @bi: rx_buffer struct to modify
> - *
> - * Returns true if the page was successfully allocated or
> - * reused.
> - **/
> -static bool iavf_alloc_mapped_page(struct iavf_ring *rx_ring,
> -                                  struct iavf_rx_buffer *bi)
> -{
> -       struct page *page = bi->page;
> -       dma_addr_t dma;
> -
> -       /* alloc new page for storage */
> -       page = dev_alloc_pages(iavf_rx_pg_order(rx_ring));
> -       if (unlikely(!page)) {
> -               rx_ring->rx_stats.alloc_page_failed++;
> -               return false;
> -       }
> -
> -       /* map page for use */
> -       dma = dma_map_page_attrs(rx_ring->dev, page, 0,
> -                                iavf_rx_pg_size(rx_ring),
> -                                DMA_FROM_DEVICE,
> -                                IAVF_RX_DMA_ATTR);
> -
> -       /* if mapping failed free memory back to system since
> -        * there isn't much point in holding memory we can't use
> -        */
> -       if (dma_mapping_error(rx_ring->dev, dma)) {
> -               __free_pages(page, iavf_rx_pg_order(rx_ring));
> -               rx_ring->rx_stats.alloc_page_failed++;
> -               return false;
> -       }
> -
> -       bi->dma = dma;
> -       bi->page = page;
> -       bi->page_offset = IAVF_SKB_PAD;
> -
> -       return true;
> -}
> -
>  /**
>   * iavf_receive_skb - Send a completed packet up the stack
>   * @rx_ring:  rx ring in play
> @@ -893,36 +837,29 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
>  {
>         u16 ntu = rx_ring->next_to_use;
>         union iavf_rx_desc *rx_desc;
> -       struct iavf_rx_buffer *bi;
>
>         /* do nothing if no valid netdev defined */
>         if (!rx_ring->netdev || !cleaned_count)
>                 return false;
>
>         rx_desc = IAVF_RX_DESC(rx_ring, ntu);
> -       bi = &rx_ring->rx_bi[ntu];
>
>         do {
> -               if (!iavf_alloc_mapped_page(rx_ring, bi))
> -                       goto no_buffers;
> +               dma_addr_t addr;
>
> -               /* sync the buffer for use by the device */
> -               dma_sync_single_range_for_device(rx_ring->dev, bi->dma,
> -                                                bi->page_offset,
> -                                                rx_ring->rx_buf_len,
> -                                                DMA_FROM_DEVICE);
> +               addr = libie_rx_alloc(rx_ring->pp, &rx_ring->rx_bi[ntu]);
> +               if (unlikely(addr == DMA_MAPPING_ERROR))
> +                       goto no_buffers;
>
>                 /* Refresh the desc even if buffer_addrs didn't change
>                  * because each write-back erases this info.
>                  */
> -               rx_desc->read.pkt_addr = cpu_to_le64(bi->dma + bi->page_offset);
> +               rx_desc->read.pkt_addr = cpu_to_le64(addr);
>
>                 rx_desc++;
> -               bi++;
>                 ntu++;
>                 if (unlikely(ntu == rx_ring->count)) {
>                         rx_desc = IAVF_RX_DESC(rx_ring, 0);
> -                       bi = rx_ring->rx_bi;
>                         ntu = 0;
>                 }
>



> @@ -941,6 +878,8 @@ bool iavf_alloc_rx_buffers(struct iavf_ring *rx_ring, u16 cleaned_count)
>         if (rx_ring->next_to_use != ntu)
>                 iavf_release_rx_desc(rx_ring, ntu);
>
> +       rx_ring->rx_stats.alloc_page_failed++;
> +
>         /* make sure to come back via polling to try again after
>          * allocation failure
>          */
> @@ -1091,9 +1030,8 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
>
>  /**
>   * iavf_add_rx_frag - Add contents of Rx buffer to sk_buff
> - * @rx_ring: rx descriptor ring to transact packets on
> - * @rx_buffer: buffer containing page to add
>   * @skb: sk_buff to place the data into
> + * @rx_buffer: buffer containing page to add
>   * @size: packet length from rx_desc
>   *
>   * This function will add the data contained in rx_buffer->page to the skb.
> @@ -1101,111 +1039,52 @@ static bool iavf_cleanup_headers(struct iavf_ring *rx_ring, struct sk_buff *skb)
>   *
>   * The function will then update the page offset.
>   **/
> -static void iavf_add_rx_frag(struct iavf_ring *rx_ring,
> -                            struct iavf_rx_buffer *rx_buffer,
> -                            struct sk_buff *skb,
> +static void iavf_add_rx_frag(struct sk_buff *skb,
> +                            const struct libie_rx_buffer *rx_buffer,
>                              unsigned int size)
>  {
> -#if (PAGE_SIZE < 8192)
> -       unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
> -#else
> -       unsigned int truesize = SKB_DATA_ALIGN(size + IAVF_SKB_PAD);
> -#endif
> -
> -       if (!size)
> -               return;
> +       u32 hr = rx_buffer->page->pp->p.offset;
>
>         skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
> -                       rx_buffer->page_offset, size, truesize);
> -}
> -
> -/**
> - * iavf_get_rx_buffer - Fetch Rx buffer and synchronize data for use
> - * @rx_ring: rx descriptor ring to transact packets on
> - * @size: size of buffer to add to skb
> - *
> - * This function will pull an Rx buffer from the ring and synchronize it
> - * for use by the CPU.
> - */
> -static struct iavf_rx_buffer *iavf_get_rx_buffer(struct iavf_ring *rx_ring,
> -                                                const unsigned int size)
> -{
> -       struct iavf_rx_buffer *rx_buffer;
> -
> -       rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
> -       prefetchw(rx_buffer->page);
> -       if (!size)
> -               return rx_buffer;
> -
> -       /* we are reusing so sync this buffer for CPU use */
> -       dma_sync_single_range_for_cpu(rx_ring->dev,
> -                                     rx_buffer->dma,
> -                                     rx_buffer->page_offset,
> -                                     size,
> -                                     DMA_FROM_DEVICE);
> -
> -       return rx_buffer;
> +                       rx_buffer->offset + hr, size, rx_buffer->truesize);
>  }
>
>  /**
>   * iavf_build_skb - Build skb around an existing buffer
> - * @rx_ring: Rx descriptor ring to transact packets on
>   * @rx_buffer: Rx buffer to pull data from
>   * @size: size of buffer to add to skb
>   *
>   * This function builds an skb around an existing Rx buffer, taking care
>   * to set up the skb correctly and avoid any memcpy overhead.
>   */
> -static struct sk_buff *iavf_build_skb(struct iavf_ring *rx_ring,
> -                                     struct iavf_rx_buffer *rx_buffer,
> +static struct sk_buff *iavf_build_skb(const struct libie_rx_buffer *rx_buffer,
>                                       unsigned int size)
>  {
> -       void *va;
> -#if (PAGE_SIZE < 8192)
> -       unsigned int truesize = iavf_rx_pg_size(rx_ring) / 2;
> -#else
> -       unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> -                               SKB_DATA_ALIGN(IAVF_SKB_PAD + size);
> -#endif
> +       struct page *page = rx_buffer->page;
> +       u32 hr = page->pp->p.offset;
>         struct sk_buff *skb;
> +       void *va;
>
> -       if (!rx_buffer || !size)
> -               return NULL;
>         /* prefetch first cache line of first page */
> -       va = page_address(rx_buffer->page) + rx_buffer->page_offset;
> -       net_prefetch(va);
> +       va = page_address(page) + rx_buffer->offset;
> +       net_prefetch(va + hr);
>
>         /* build an skb around the page buffer */
> -       skb = napi_build_skb(va - IAVF_SKB_PAD, truesize);
> -       if (unlikely(!skb))
> +       skb = napi_build_skb(va, rx_buffer->truesize);
> +       if (unlikely(!skb)) {
> +               page_pool_put_page(page->pp, page, size, true);
>                 return NULL;
> +       }
> +
> +       skb_mark_for_recycle(skb);
>
>         /* update pointers within the skb to store the data */
> -       skb_reserve(skb, IAVF_SKB_PAD);
> +       skb_reserve(skb, hr);
>         __skb_put(skb, size);
>
>         return skb;
>  }
>
> -/**
> - * iavf_put_rx_buffer - Unmap used buffer
> - * @rx_ring: rx descriptor ring to transact packets on
> - * @rx_buffer: rx buffer to pull data from
> - *
> - * This function will unmap the buffer after it's written by HW.
> - */
> -static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
> -                              struct iavf_rx_buffer *rx_buffer)
> -{
> -       if (!rx_buffer)
> -               return;
> -
> -       /* we are not reusing the buffer so unmap it */
> -       dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
> -                            iavf_rx_pg_size(rx_ring),
> -                            DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
> -}
> -
>  /**
>   * iavf_is_non_eop - process handling of non-EOP buffers
>   * @rx_ring: Rx ring being processed
> @@ -1259,7 +1138,7 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
>         bool failure = false;
>
>         while (likely(total_rx_packets < (unsigned int)budget)) {
> -               struct iavf_rx_buffer *rx_buffer;
> +               struct libie_rx_buffer *rx_buffer;
>                 union iavf_rx_desc *rx_desc;
>                 unsigned int size;
>                 u16 vlan_tag = 0;
> @@ -1295,26 +1174,24 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
>                        IAVF_RXD_QW1_LENGTH_PBUF_SHIFT;
>
>                 iavf_trace(clean_rx_irq, rx_ring, rx_desc, skb);
> -               rx_buffer = iavf_get_rx_buffer(rx_ring, size);
> +
> +               rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
> +               if (!libie_rx_sync_for_cpu(rx_buffer, size))
> +                       goto skip_data;
>
>                 /* retrieve a buffer from the ring */
>                 if (skb)
> -                       iavf_add_rx_frag(rx_ring, rx_buffer, skb, size);
> +                       iavf_add_rx_frag(skb, rx_buffer, size);
>                 else
> -                       skb = iavf_build_skb(rx_ring, rx_buffer, size);
> -
> -               iavf_put_rx_buffer(rx_ring, rx_buffer);
> +                       skb = iavf_build_skb(rx_buffer, size);
>
>                 /* exit if we failed to retrieve a buffer */
>                 if (!skb) {
>                         rx_ring->rx_stats.alloc_buff_failed++;
> -                       __free_pages(rx_buffer->page,
> -                                    iavf_rx_pg_order(rx_ring));
> -                       rx_buffer->page = NULL;
>                         break;
>                 }
>
> -               rx_buffer->page = NULL;
> +skip_data:
>                 cleaned_count++;
>
>                 if (iavf_is_non_eop(rx_ring, rx_desc, skb))
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.h b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> index 2170a77f8c8d..b13d878c74c6 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.h
> @@ -81,70 +81,8 @@ enum iavf_dyn_idx_t {
>         BIT_ULL(IAVF_FILTER_PCTYPE_NONF_UNICAST_IPV6_UDP) | \
>         BIT_ULL(IAVF_FILTER_PCTYPE_NONF_MULTICAST_IPV6_UDP))
>
> -/* Supported Rx Buffer Sizes (a multiple of 128) */
> -#define IAVF_RXBUFFER_1536  1536  /* 128B aligned standard Ethernet frame */
> -#define IAVF_RXBUFFER_2048  2048
> -#define IAVF_RXBUFFER_3072  3072  /* Used for large frames w/ padding */
> -#define IAVF_MAX_RXBUFFER   9728  /* largest size for single descriptor */
> -
> -#define IAVF_PACKET_HDR_PAD (ETH_HLEN + ETH_FCS_LEN + (VLAN_HLEN * 2))
>  #define iavf_rx_desc iavf_32byte_rx_desc
>
> -#define IAVF_RX_DMA_ATTR \
> -       (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
> -
> -/* Attempt to maximize the headroom available for incoming frames.  We
> - * use a 2K buffer for receives and need 1536/1534 to store the data for
> - * the frame.  This leaves us with 512 bytes of room.  From that we need
> - * to deduct the space needed for the shared info and the padding needed
> - * to IP align the frame.
> - *
> - * Note: For cache line sizes 256 or larger this value is going to end
> - *      up negative.  In these cases we should fall back to the legacy
> - *      receive path.
> - */
> -#if (PAGE_SIZE < 8192)
> -#define IAVF_2K_TOO_SMALL_WITH_PADDING \
> -((NET_SKB_PAD + IAVF_RXBUFFER_1536) > SKB_WITH_OVERHEAD(IAVF_RXBUFFER_2048))
> -
> -static inline int iavf_compute_pad(int rx_buf_len)
> -{
> -       int page_size, pad_size;
> -
> -       page_size = ALIGN(rx_buf_len, PAGE_SIZE / 2);
> -       pad_size = SKB_WITH_OVERHEAD(page_size) - rx_buf_len;
> -
> -       return pad_size;
> -}
> -
> -static inline int iavf_skb_pad(void)
> -{
> -       int rx_buf_len;
> -
> -       /* If a 2K buffer cannot handle a standard Ethernet frame then
> -        * optimize padding for a 3K buffer instead of a 1.5K buffer.
> -        *
> -        * For a 3K buffer we need to add enough padding to allow for
> -        * tailroom due to NET_IP_ALIGN possibly shifting us out of
> -        * cache-line alignment.
> -        */
> -       if (IAVF_2K_TOO_SMALL_WITH_PADDING)
> -               rx_buf_len = IAVF_RXBUFFER_3072 + SKB_DATA_ALIGN(NET_IP_ALIGN);
> -       else
> -               rx_buf_len = IAVF_RXBUFFER_1536;
> -
> -       /* if needed make room for NET_IP_ALIGN */
> -       rx_buf_len -= NET_IP_ALIGN;
> -
> -       return iavf_compute_pad(rx_buf_len);
> -}
> -
> -#define IAVF_SKB_PAD iavf_skb_pad()
> -#else
> -#define IAVF_2K_TOO_SMALL_WITH_PADDING false
> -#define IAVF_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
> -#endif
> -
>  /**
>   * iavf_test_staterr - tests bits in Rx descriptor status and error fields
>   * @rx_desc: pointer to receive descriptor (in le64 format)
> @@ -263,12 +201,6 @@ struct iavf_tx_buffer {
>         u32 tx_flags;
>  };
>
> -struct iavf_rx_buffer {
> -       dma_addr_t dma;
> -       struct page *page;
> -       __u32 page_offset;
> -};
> -
>  struct iavf_queue_stats {
>         u64 packets;
>         u64 bytes;
> @@ -311,16 +243,19 @@ enum iavf_ring_state_t {
>  struct iavf_ring {
>         struct iavf_ring *next;         /* pointer to next ring in q_vector */
>         void *desc;                     /* Descriptor ring memory */
> -       struct device *dev;             /* Used for DMA mapping */
> +       union {
> +               struct page_pool *pp;   /* Used on Rx for buffer management */
> +               struct device *dev;     /* Used on Tx for DMA mapping */
> +       };
>         struct net_device *netdev;      /* netdev ring maps to */
>         union {
> +               struct libie_rx_buffer *rx_bi;
>                 struct iavf_tx_buffer *tx_bi;
> -               struct iavf_rx_buffer *rx_bi;
>         };
>         DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
> +       u8 __iomem *tail;
>         u16 queue_index;                /* Queue number of ring */
>         u8 dcb_tc;                      /* Traffic class of ring */
> -       u8 __iomem *tail;
>
>         /* high bit set means dynamic, use accessors routines to read/write.
>          * hardware only supports 2us resolution for the ITR registers.

I'm assuming "tail" was moved here since it is a pointer and fills a hole?

> @@ -329,9 +264,8 @@ struct iavf_ring {
>          */
>         u16 itr_setting;
>
> -       u16 count;                      /* Number of descriptors */
>         u16 reg_idx;                    /* HW register index of the ring */
> -       u16 rx_buf_len;
> +       u16 count;                      /* Number of descriptors */

Why move count down here? It is moving the constant value that is
read-mostly into an area that will be updated more often.

>         /* used in interrupt processing */
>         u16 next_to_use;
> @@ -398,17 +332,6 @@ struct iavf_ring_container {
>  #define iavf_for_each_ring(pos, head) \
>         for (pos = (head).ring; pos != NULL; pos = pos->next)
>
> -static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)
> -{
> -#if (PAGE_SIZE < 8192)
> -       if (ring->rx_buf_len > (PAGE_SIZE / 2))
> -               return 1;
> -#endif
> -       return 0;
> -}
> -
> -#define iavf_rx_pg_size(_ring) (PAGE_SIZE << iavf_rx_pg_order(_ring))
> -

All this code probably could have been removed in an earlier patch
since I don't think we need the higher order pages once we did away
with the recycling. Odds are we can probably move this into the
recycling code removal.

>  bool iavf_alloc_rx_buffers(struct iavf_ring *rxr, u16 cleaned_count);
>  netdev_tx_t iavf_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
>  int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring);
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_type.h b/drivers/net/ethernet/intel/iavf/iavf_type.h
> index 3030ba330326..bb90d8f3ad7e 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_type.h
> +++ b/drivers/net/ethernet/intel/iavf/iavf_type.h
> @@ -10,8 +10,6 @@
>  #include "iavf_adminq.h"
>  #include "iavf_devids.h"
>
> -#define IAVF_RXQ_CTX_DBUFF_SHIFT 7
> -
>  /* IAVF_MASK is a macro used on 32 bit registers */
>  #define IAVF_MASK(mask, shift) ((u32)(mask) << (shift))
>
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
> index fdddc3588487..5341508b429f 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
> @@ -1,6 +1,8 @@
>  // SPDX-License-Identifier: GPL-2.0
>  /* Copyright(c) 2013 - 2018 Intel Corporation. */
>
> +#include <linux/net/intel/libie/rx.h>
> +
>  #include "iavf.h"
>  #include "iavf_prototype.h"
>  #include "iavf_client.h"
> @@ -269,14 +271,13 @@ int iavf_get_vf_vlan_v2_caps(struct iavf_adapter *adapter)
>  void iavf_configure_queues(struct iavf_adapter *adapter)
>  {
>         struct virtchnl_vsi_queue_config_info *vqci;
> -       int i, max_frame = adapter->vf_res->max_mtu;
> +       u32 i, max_frame = adapter->vf_res->max_mtu;
> +       struct iavf_ring *rxq = adapter->rx_rings;
>         int pairs = adapter->num_active_queues;
>         struct virtchnl_queue_pair_info *vqpi;
> +       u32 hr, max_len;
>         size_t len;
>
> -       if (max_frame > IAVF_MAX_RXBUFFER || !max_frame)
> -               max_frame = IAVF_MAX_RXBUFFER;
> -
>         if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) {
>                 /* bail because we already have a command pending */
>                 dev_err(&adapter->pdev->dev, "Cannot configure queues, command %d pending\n",
> @@ -289,9 +290,9 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
>         if (!vqci)
>                 return;
>
> -       /* Limit maximum frame size when jumbo frames is not enabled */
> -       if (adapter->netdev->mtu <= ETH_DATA_LEN)
> -               max_frame = IAVF_RXBUFFER_1536 - NET_IP_ALIGN;
> +       hr = rxq->pp->p.offset;
> +       max_len = rxq->pp->p.max_len;
> +       max_frame = min_not_zero(max_frame, LIBIE_MAX_RX_FRM_LEN(hr));
>         vqci->vsi_id = adapter->vsi_res->vsi_id;
>         vqci->num_queue_pairs = pairs;
> @@ -309,9 +310,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
>                 vqpi->rxq.ring_len = adapter->rx_rings[i].count;
>                 vqpi->rxq.dma_ring_addr = adapter->rx_rings[i].dma;
>                 vqpi->rxq.max_pkt_size = max_frame;
> -               vqpi->rxq.databuffer_size =
> -                       ALIGN(adapter->rx_rings[i].rx_buf_len,
> -                             BIT_ULL(IAVF_RXQ_CTX_DBUFF_SHIFT));

Is this rendered redundant by something? Seems like you should be
guaranteeing somewhere that you are still aligned to this.


> +               vqpi->rxq.databuffer_size = max_len;
>                 vqpi++;
>         }
>
> --
> 2.41.0
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
  2023-07-06 12:47   ` Yunsheng Lin
@ 2023-07-06 16:28     ` Alexander Lobakin
  2023-07-09  5:16       ` Yunsheng Lin
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-06 16:28 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Thu, 6 Jul 2023 20:47:28 +0800

> On 2023/7/5 23:55, Alexander Lobakin wrote:
> 
>> +/**
>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>> + * @size: size of the PP, usually simply Rx queue len
>> + *
>> + * Returns &page_pool on success, casted -errno on failure.
>> + */
>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>> +					    u32 size)
>> +{
>> +	struct page_pool_params pp = {
>> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>> +		.order		= LIBIE_RX_PAGE_ORDER,
>> +		.pool_size	= size,
>> +		.nid		= NUMA_NO_NODE,
>> +		.dev		= napi->dev->dev.parent,
>> +		.napi		= napi,
>> +		.dma_dir	= DMA_FROM_DEVICE,
>> +		.offset		= LIBIE_SKB_HEADROOM,
> 
> I think it worth mentioning that the '.offset' is not really accurate
> when the page is split, as we do not really know what is the offset of
> the frag of a page except for the first frag.

Yeah, this is read as "offset from the start of the page or frag to the
actual frame start, i.e. its Ethernet header" or "this is just
xdp->data - xdp->data_hard_start".

> 
>> +	};
>> +	size_t truesize;
>> +
>> +	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>> +
>> +	/* "Wanted" truesize, passed to page_pool_dev_alloc() */
>> +	truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
>> +	pp.init_arg = (void *)truesize;
> 
> I am not sure if it is correct to use pp.init_arg here, as it is supposed to
> be used along with init_callback. And if we want to change the implemetation

I know. I abused it to save 1 function argument :p It's safe since I
don't use init_callback (not an argument).
I was thinking also of having a union in PP params or even a new field
like "wanted true size", so that your function could even take values
from there in certain cases (e.g. if I pass 0 as parameter).

> of init_callback, we may stuck with it as the driver is using it very
> differently here.
> 
> Is it possible to pass the 'wanted true size' by adding a parameter for
> libie_rx_alloc()?

Yes, or I could store it somewhere on the ring, but looks uglier =\ This
one does as well to some degree, but at least hidden in the library and
doesn't show up in the drivers :D

> 
>> +
>> +	return page_pool_create(&pp);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-06 12:47   ` Yunsheng Lin
@ 2023-07-06 16:38     ` Alexander Lobakin
  2023-07-09  5:16       ` Yunsheng Lin
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-06 16:38 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Thu, 6 Jul 2023 20:47:22 +0800

> On 2023/7/5 23:55, Alexander Lobakin wrote:
>> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
>> no custom recycling logics, it can easily be switched to using Page
>> Pool / libie API instead.
>> This allows to removing the whole dancing around headroom, HW buffer
>> size, and page order. All DMA-for-device is now done in the PP core,
>> for-CPU -- in the libie helper.
>> Use skb_mark_for_recycle() to bring back the recycling and restore the
>> performance. Speaking of performance: on par with the baseline and
>> faster with the PP optimization series applied. But the memory usage for
>> 1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
>> every second descriptor.
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
> 
> ...
> 
>> @@ -2562,11 +2541,7 @@ static void iavf_init_config_adapter(struct iavf_adapter *adapter)
>>  
>>  	netdev->netdev_ops = &iavf_netdev_ops;
>>  	iavf_set_ethtool_ops(netdev);
>> -	netdev->watchdog_timeo = 5 * HZ;
> 
> This seems like a unrelated change here?

Sorta. Default timeout is 5 seconds already, so I removed this
redundance. But I expected someone to spot this, so I'm perfectly fine
with not doing this [and stop doing such things in general].

> 
>> -
>> -	/* MTU range: 68 - 9710 */
>> -	netdev->min_mtu = ETH_MIN_MTU;
>> -	netdev->max_mtu = IAVF_MAX_RXBUFFER - IAVF_PACKET_HDR_PAD;
>> +	netdev->max_mtu = LIBIE_MAX_MTU;
>>  
> 
> ...
> 
>>  /**
>> @@ -766,13 +742,19 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>>   **/
>>  int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>>  {
>> -	struct device *dev = rx_ring->dev;
>> -	int bi_size;
>> +	struct page_pool *pool;
>> +
>> +	pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
>> +					 rx_ring->count);
> 
> If a page is able to be spilt between more than one desc, perhaps the
> prt_ring size does not need to be as big as rx_ring->count.

But we doesn't know in advance, right? Esp. given that it's hidden in
the lib. But anyway, you can only assume that in regular cases if you
always allocate frags of the same size, PP will split pages when 2+
frags can fit there or return the whole page otherwise, but who knows
what might happen.
BTW, with recent recycling optimization, most of recycling is done
directly through cache, not ptr_ring. So I'd even say it's safe to start
creating smaller ptr_rings in the drivers.

> 
>> +	if (IS_ERR(pool))
>> +		return PTR_ERR(pool);
>> +
>> +	rx_ring->pp = pool;

[...]

>>  	/* build an skb around the page buffer */
>> -	skb = napi_build_skb(va - IAVF_SKB_PAD, truesize);
>> -	if (unlikely(!skb))
>> +	skb = napi_build_skb(va, rx_buffer->truesize);
>> +	if (unlikely(!skb)) {
>> +		page_pool_put_page(page->pp, page, size, true);
> 
> Isn't it more correct to call page_pool_put_full_page() here?
> as we do not know which frag is used for the rx_buffer, and depend
> on the last released frag to do the syncing, maybe I should mention
> that in Documentation/networking/page_pool.rst.

Ooof, maybe. My first try with PP frags. So when we use frags, we always
must use _full_page() / p.max_len instead of the actual frame size?

> 
>>  		return NULL;
>> +	}
> 
> ...
> 
>>  struct iavf_queue_stats {
>>  	u64 packets;
>>  	u64 bytes;
>> @@ -311,16 +243,19 @@ enum iavf_ring_state_t {
>>  struct iavf_ring {
>>  	struct iavf_ring *next;		/* pointer to next ring in q_vector */
>>  	void *desc;			/* Descriptor ring memory */
>> -	struct device *dev;		/* Used for DMA mapping */
>> +	union {
>> +		struct page_pool *pp;	/* Used on Rx for buffer management */
>> +		struct device *dev;	/* Used on Tx for DMA mapping */
>> +	};
>>  	struct net_device *netdev;	/* netdev ring maps to */
>>  	union {
>> +		struct libie_rx_buffer *rx_bi;
>>  		struct iavf_tx_buffer *tx_bi;
>> -		struct iavf_rx_buffer *rx_bi;
>>  	};
>>  	DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
>> +	u8 __iomem *tail;
> 
> Is there a reason to move it here?

Oops, seems like it's a leftover. There is reason: removing hole in the
structure, but it needs to be done right when I change its layout. In
this commit I just alter unions with no actual layout changes. Will fix.

> 
>>  	u16 queue_index;		/* Queue number of ring */
>>  	u8 dcb_tc;			/* Traffic class of ring */
>> -	u8 __iomem *tail;
>>  
> 

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling
  2023-07-06 14:47   ` [Intel-wired-lan] " Alexander Duyck
@ 2023-07-06 16:45     ` Alexander Lobakin
  2023-07-06 17:06       ` Alexander Duyck
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-06 16:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 6 Jul 2023 07:47:03 -0700

> On Wed, Jul 5, 2023 at 8:57 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> As an intermediate step, remove all page splitting/recyclig code. Just
> 
> Spelling issue: "recycling"

checkpatch w/codespell didn't catch this one =\

> 
>> always allocate a new page and don't touch its refcount, so that it gets
>> freed by the core stack later.
>> Same for the "in-place" recycling, i.e. when an unused buffer gets
>> assigned to a first needs-refilling descriptor. In some cases, this
>> was leading to moving up to 63 &iavf_rx_buf structures around the ring
>> on a per-field basis -- not something wanted on hotpath.
>> The change allows to greatly simplify certain parts of the code:

[...]

>> @@ -1317,21 +1200,10 @@ static void iavf_put_rx_buffer(struct iavf_ring *rx_ring,
>>         if (!rx_buffer)
>>                 return;
>>
>> -       if (iavf_can_reuse_rx_page(rx_buffer)) {
>> -               /* hand second half of page back to the ring */
>> -               iavf_reuse_rx_page(rx_ring, rx_buffer);
>> -               rx_ring->rx_stats.page_reuse_count++;
>> -       } else {
>> -               /* we are not reusing the buffer so unmap it */
>> -               dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
>> -                                    iavf_rx_pg_size(rx_ring),
>> -                                    DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
>> -               __page_frag_cache_drain(rx_buffer->page,
>> -                                       rx_buffer->pagecnt_bias);
>> -       }
>> -
>> -       /* clear contents of buffer_info */
>> -       rx_buffer->page = NULL;
>> +       /* we are not reusing the buffer so unmap it */
>> +       dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
>> +                            iavf_rx_pg_size(rx_ring),
>> +                            DMA_FROM_DEVICE, IAVF_RX_DMA_ATTR);
> 
> Rather than reorder all this I would just do the dma_unmap_page_attrs
> and then leave the assignment of NULL to rx_buffer->page. It should
> make this a bit easier to clean up the code below.
> 
>>  }
>>
>>  /**
>> @@ -1431,15 +1303,18 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
>>                 else
>>                         skb = iavf_build_skb(rx_ring, rx_buffer, size);
>>
>> +               iavf_put_rx_buffer(rx_ring, rx_buffer);
>> +
> 
> This should stay below where it was.

Wait-wait-wait.

if (!skb) break breaks the loop. put_rx_buffer() unmaps the page.
So in order to do the first, you need to do the second to avoid leaks.
Or you meant "why unmapping and freeing if we fail, just leave it in
place"? To make it easier to switch to Page Pool.

> 
>>                 /* exit if we failed to retrieve a buffer */
>>                 if (!skb) {
>>                         rx_ring->rx_stats.alloc_buff_failed++;
>> -                       if (rx_buffer && size)
>> -                               rx_buffer->pagecnt_bias++;
>> +                       __free_pages(rx_buffer->page,
>> +                                    iavf_rx_pg_order(rx_ring));
>> +                       rx_buffer->page = NULL;
>>                         break;
>>                 }
> 
> This code was undoing the iavf_get_rx_buffer decrement of pagecnt_bias
> and then bailing since we have halted forward progress due to an skb
> allocation failure. As such we should just be removing the if
> statement and the increment of pagecnt_bias.
> 
>>
>> -               iavf_put_rx_buffer(rx_ring, rx_buffer);
>> +               rx_buffer->page = NULL;
>>                 cleaned_count++;
>>
>>                 if (iavf_is_non_eop(rx_ring, rx_desc, skb))
> 
> If iavf_put_rx_buffer just does the unmap and assignment of NULL then
> it could just be left here as is.

I guess those two are tied with the one above.

[...]

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-06 15:26   ` [Intel-wired-lan] " Alexander Duyck
@ 2023-07-06 16:56     ` Alexander Lobakin
  2023-07-06 17:28       ` Alexander Duyck
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-06 16:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 6 Jul 2023 08:26:00 -0700

> On Wed, Jul 5, 2023 at 8:58 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
>> no custom recycling logics, it can easily be switched to using Page
>> Pool / libie API instead.
>> This allows to removing the whole dancing around headroom, HW buffer
>> size, and page order. All DMA-for-device is now done in the PP core,
>> for-CPU -- in the libie helper.
>> Use skb_mark_for_recycle() to bring back the recycling and restore the
>> performance. Speaking of performance: on par with the baseline and
>> faster with the PP optimization series applied. But the memory usage for
>> 1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
>> every second descriptor.
>>
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> 
> One thing I am noticing is that there seems to be a bunch of cleanup
> changes in here as well. Things like moving around values within
> structures which I am assuming are to fill holes. You may want to look
> at breaking some of those out as it makes it a bit harder to review
> this since they seem like unrelated changes.

min_mtu and watchdog are unrelated, I'll drop those.
Moving tail pointer around was supposed to land in a different commit,
not this one, as I wrote 10 minutes ago already :s

[...]

>> -       bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
>> -       memset(rx_ring->rx_bi, 0, bi_size);
>> -
>> -       /* Zero out the descriptor ring */
>> -       memset(rx_ring->desc, 0, rx_ring->size);
>> -
> 
> I have some misgivings about not clearing these. We may want to double
> check to verify the code paths are resilient enough that it won't
> cause any issues w/ repeated up/down testing on the interface. The
> general idea is to keep things consistent w/ the state after
> setup_rx_descriptors. If we don't need this when we don't need to be
> calling the zalloc or calloc version of things in
> setup_rx_descriptors.

Both arrays will be freed couple instructions below, why zero them?

> 
> 
>>         rx_ring->next_to_clean = 0;
>>         rx_ring->next_to_use = 0;
>>  }

[...]

>>         struct net_device *netdev;      /* netdev ring maps to */
>>         union {
>> +               struct libie_rx_buffer *rx_bi;
>>                 struct iavf_tx_buffer *tx_bi;
>> -               struct iavf_rx_buffer *rx_bi;
>>         };
>>         DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
>> +       u8 __iomem *tail;
>>         u16 queue_index;                /* Queue number of ring */
>>         u8 dcb_tc;                      /* Traffic class of ring */
>> -       u8 __iomem *tail;
>>
>>         /* high bit set means dynamic, use accessors routines to read/write.
>>          * hardware only supports 2us resolution for the ITR registers.
> 
> I'm assuming "tail" was moved here since it is a pointer and fills a hole?

(see above)

> 
>> @@ -329,9 +264,8 @@ struct iavf_ring {
>>          */
>>         u16 itr_setting;
>>
>> -       u16 count;                      /* Number of descriptors */
>>         u16 reg_idx;                    /* HW register index of the ring */
>> -       u16 rx_buf_len;
>> +       u16 count;                      /* Number of descriptors */
> 
> Why move count down here? It is moving the constant value that is
> read-mostly into an area that will be updated more often.

With the ::tail put in a different slot, ::count was landing in a
different cacheline. I wanted to avoid this. But now I feel like I was
just lazy and must've tested both variants to see if this move affects
performance. I'll play with this one in the next rev.

> 
>>         /* used in interrupt processing */
>>         u16 next_to_use;
>> @@ -398,17 +332,6 @@ struct iavf_ring_container {
>>  #define iavf_for_each_ring(pos, head) \
>>         for (pos = (head).ring; pos != NULL; pos = pos->next)
>>
>> -static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)
>> -{
>> -#if (PAGE_SIZE < 8192)
>> -       if (ring->rx_buf_len > (PAGE_SIZE / 2))
>> -               return 1;
>> -#endif
>> -       return 0;
>> -}
>> -
>> -#define iavf_rx_pg_size(_ring) (PAGE_SIZE << iavf_rx_pg_order(_ring))
>> -
> 
> All this code probably could have been removed in an earlier patch
> since I don't think we need the higher order pages once we did away
> with the recycling. Odds are we can probably move this into the
> recycling code removal.

This went here as I merged "always use order 0" commit with "switch to
Page Pool". In general, IIRC having removals of all the stuff at once in
one commit (#2) was less readable than the current version, but I'll
double-check.

> 
>>  bool iavf_alloc_rx_buffers(struct iavf_ring *rxr, u16 cleaned_count);
>>  netdev_tx_t iavf_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
>>  int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring);

[...]

>> @@ -309,9 +310,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
>>                 vqpi->rxq.ring_len = adapter->rx_rings[i].count;
>>                 vqpi->rxq.dma_ring_addr = adapter->rx_rings[i].dma;
>>                 vqpi->rxq.max_pkt_size = max_frame;
>> -               vqpi->rxq.databuffer_size =
>> -                       ALIGN(adapter->rx_rings[i].rx_buf_len,
>> -                             BIT_ULL(IAVF_RXQ_CTX_DBUFF_SHIFT));
> 
> Is this rendered redundant by something? Seems like you should be
> guaranteeing somewhere that you are still aligned to this.

See the previous commit, the place where I calculate max_len for the PP
params. 128 byte is Intel-wide HW req, so it lives there now.

> 
> 
>> +               vqpi->rxq.databuffer_size = max_len;
>>                 vqpi++;
Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling
  2023-07-06 16:45     ` Alexander Lobakin
@ 2023-07-06 17:06       ` Alexander Duyck
  2023-07-10 13:13         ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2023-07-06 17:06 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

On Thu, Jul 6, 2023 at 9:46 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Thu, 6 Jul 2023 07:47:03 -0700
>
> > On Wed, Jul 5, 2023 at 8:57 AM Alexander Lobakin
> > <aleksander.lobakin@intel.com> wrote:

[...]

> >> @@ -1431,15 +1303,18 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
> >>                 else
> >>                         skb = iavf_build_skb(rx_ring, rx_buffer, size);
> >>
> >> +               iavf_put_rx_buffer(rx_ring, rx_buffer);
> >> +
> >
> > This should stay below where it was.
>
> Wait-wait-wait.
>
> if (!skb) break breaks the loop. put_rx_buffer() unmaps the page.
> So in order to do the first, you need to do the second to avoid leaks.
> Or you meant "why unmapping and freeing if we fail, just leave it in
> place"? To make it easier to switch to Page Pool.

Specifically you don't want to be unmapping and freeing this page
until after the !skb check. The problem is if skb is NULL the skb
allocation failed and so processing of Rx is meant to stop in place
without removing the page. It is where we will resume on the next pass
assuming memory has been freed that can then be used. The problem is
the skb allocation, not the page. We used to do the skb allocation
before we would acquire the buffer, but with XDP there are cases where
we aren't supposed to allocate it so it got moved to after which
causes this confusion.

> >
> >>                 /* exit if we failed to retrieve a buffer */
> >>                 if (!skb) {
> >>                         rx_ring->rx_stats.alloc_buff_failed++;
> >> -                       if (rx_buffer && size)
> >> -                               rx_buffer->pagecnt_bias++;
> >> +                       __free_pages(rx_buffer->page,
> >> +                                    iavf_rx_pg_order(rx_ring));
> >> +                       rx_buffer->page = NULL;
> >>                         break;
> >>                 }
> >
> > This code was undoing the iavf_get_rx_buffer decrement of pagecnt_bias
> > and then bailing since we have halted forward progress due to an skb
> > allocation failure. As such we should just be removing the if
> > statement and the increment of pagecnt_bias.

The key bit here is the allocation failure is the reason why we halted
processing. So the page contains valid data and should not be freed.
We just need to leave it in place and wait for an allocation to
succeed and then we can resume processing.

> >
> >>
> >> -               iavf_put_rx_buffer(rx_ring, rx_buffer);
> >> +               rx_buffer->page = NULL;
> >>                 cleaned_count++;
> >>
> >>                 if (iavf_is_non_eop(rx_ring, rx_desc, skb))
> >
> > If iavf_put_rx_buffer just does the unmap and assignment of NULL then
> > it could just be left here as is.
>
> I guess those two are tied with the one above.

Yeah, the iavf_put_rx_buffer should be left  down here.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-06 16:56     ` Alexander Lobakin
@ 2023-07-06 17:28       ` Alexander Duyck
  2023-07-10 13:18         ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Duyck @ 2023-07-06 17:28 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

On Thu, Jul 6, 2023 at 9:57 AM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Alexander Duyck <alexander.duyck@gmail.com>
> Date: Thu, 6 Jul 2023 08:26:00 -0700
>
> > On Wed, Jul 5, 2023 at 8:58 AM Alexander Lobakin
> > <aleksander.lobakin@intel.com> wrote:
> >>
> >> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
> >> no custom recycling logics, it can easily be switched to using Page
> >> Pool / libie API instead.
> >> This allows to removing the whole dancing around headroom, HW buffer
> >> size, and page order. All DMA-for-device is now done in the PP core,
> >> for-CPU -- in the libie helper.
> >> Use skb_mark_for_recycle() to bring back the recycling and restore the
> >> performance. Speaking of performance: on par with the baseline and
> >> faster with the PP optimization series applied. But the memory usage for
> >> 1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
> >> every second descriptor.
> >>
> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> >
> > One thing I am noticing is that there seems to be a bunch of cleanup
> > changes in here as well. Things like moving around values within
> > structures which I am assuming are to fill holes. You may want to look
> > at breaking some of those out as it makes it a bit harder to review
> > this since they seem like unrelated changes.
>
> min_mtu and watchdog are unrelated, I'll drop those.
> Moving tail pointer around was supposed to land in a different commit,
> not this one, as I wrote 10 minutes ago already :s
>
> [...]
>
> >> -       bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
> >> -       memset(rx_ring->rx_bi, 0, bi_size);
> >> -
> >> -       /* Zero out the descriptor ring */
> >> -       memset(rx_ring->desc, 0, rx_ring->size);
> >> -
> >
> > I have some misgivings about not clearing these. We may want to double
> > check to verify the code paths are resilient enough that it won't
> > cause any issues w/ repeated up/down testing on the interface. The
> > general idea is to keep things consistent w/ the state after
> > setup_rx_descriptors. If we don't need this when we don't need to be
> > calling the zalloc or calloc version of things in
> > setup_rx_descriptors.
>
> Both arrays will be freed couple instructions below, why zero them?

Ugh. You are right, but not for a good reason. So the other Intel
drivers in the past would be doing the clean_rx_ring calls on the
_down() with the freeing of resources on _close(). Specifically it
allowed reducing the overhead for things like resets or setting
changes since it didn't require reallocating the descriptor rings and
buffer info structures.

I guess you are good to remove these since this code doesn't do that.

> >
> >
> >>         rx_ring->next_to_clean = 0;
> >>         rx_ring->next_to_use = 0;
> >>  }
>
> [...]
>
> >>         struct net_device *netdev;      /* netdev ring maps to */
> >>         union {
> >> +               struct libie_rx_buffer *rx_bi;
> >>                 struct iavf_tx_buffer *tx_bi;
> >> -               struct iavf_rx_buffer *rx_bi;
> >>         };
> >>         DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
> >> +       u8 __iomem *tail;
> >>         u16 queue_index;                /* Queue number of ring */
> >>         u8 dcb_tc;                      /* Traffic class of ring */
> >> -       u8 __iomem *tail;
> >>
> >>         /* high bit set means dynamic, use accessors routines to read/write.
> >>          * hardware only supports 2us resolution for the ITR registers.
> >
> > I'm assuming "tail" was moved here since it is a pointer and fills a hole?
>
> (see above)
>
> >
> >> @@ -329,9 +264,8 @@ struct iavf_ring {
> >>          */
> >>         u16 itr_setting;
> >>
> >> -       u16 count;                      /* Number of descriptors */
> >>         u16 reg_idx;                    /* HW register index of the ring */
> >> -       u16 rx_buf_len;
> >> +       u16 count;                      /* Number of descriptors */
> >
> > Why move count down here? It is moving the constant value that is
> > read-mostly into an area that will be updated more often.
>
> With the ::tail put in a different slot, ::count was landing in a
> different cacheline. I wanted to avoid this. But now I feel like I was
> just lazy and must've tested both variants to see if this move affects
> performance. I'll play with this one in the next rev.

The performance impact should be minimal. Odds are the placement was
the way it was since it was probably just copying the original code
that has been there since igb/ixgbe. The general idea is just keep the
read-mostly items grouped at the top and try to order them somewhat by
frequency of being read so that wherever the cache line ends up you
won't take much of a penalty as hopefully you will just have the
infrequently read items end up getting pulled into the active cache
line.

> >
> >>         /* used in interrupt processing */
> >>         u16 next_to_use;
> >> @@ -398,17 +332,6 @@ struct iavf_ring_container {
> >>  #define iavf_for_each_ring(pos, head) \
> >>         for (pos = (head).ring; pos != NULL; pos = pos->next)
> >>
> >> -static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)
> >> -{
> >> -#if (PAGE_SIZE < 8192)
> >> -       if (ring->rx_buf_len > (PAGE_SIZE / 2))
> >> -               return 1;
> >> -#endif
> >> -       return 0;
> >> -}
> >> -
> >> -#define iavf_rx_pg_size(_ring) (PAGE_SIZE << iavf_rx_pg_order(_ring))
> >> -
> >
> > All this code probably could have been removed in an earlier patch
> > since I don't think we need the higher order pages once we did away
> > with the recycling. Odds are we can probably move this into the
> > recycling code removal.
>
> This went here as I merged "always use order 0" commit with "switch to
> Page Pool". In general, IIRC having removals of all the stuff at once in
> one commit (#2) was less readable than the current version, but I'll
> double-check.

It all depends on how much is having to be added to accommodate this.
In my mind when we did away with the page splitting/recycling we also
did away with the need for the higher order pages. That is why I was
thinking it might make more sense there as it would just be more
removals with very few if any additions needed to support it.


> >
> >>  bool iavf_alloc_rx_buffers(struct iavf_ring *rxr, u16 cleaned_count);
> >>  netdev_tx_t iavf_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
> >>  int iavf_setup_tx_descriptors(struct iavf_ring *tx_ring);
>
> [...]
>
> >> @@ -309,9 +310,7 @@ void iavf_configure_queues(struct iavf_adapter *adapter)
> >>                 vqpi->rxq.ring_len = adapter->rx_rings[i].count;
> >>                 vqpi->rxq.dma_ring_addr = adapter->rx_rings[i].dma;
> >>                 vqpi->rxq.max_pkt_size = max_frame;
> >> -               vqpi->rxq.databuffer_size =
> >> -                       ALIGN(adapter->rx_rings[i].rx_buf_len,
> >> -                             BIT_ULL(IAVF_RXQ_CTX_DBUFF_SHIFT));
> >
> > Is this rendered redundant by something? Seems like you should be
> > guaranteeing somewhere that you are still aligned to this.
>
> See the previous commit, the place where I calculate max_len for the PP
> params. 128 byte is Intel-wide HW req, so it lives there now.

Okay, that is the piece I missed. It was converted from a BIT_ULL(7)
to just a 128. Thanks.

> >
> >
> >> +               vqpi->rxq.databuffer_size = max_len;
> >>                 vqpi++;
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
  2023-07-06 16:28     ` Alexander Lobakin
@ 2023-07-09  5:16       ` Yunsheng Lin
  2023-07-10 13:25         ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-09  5:16 UTC (permalink / raw)
  To: Alexander Lobakin, Yunsheng Lin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

On 2023/7/7 0:28, Alexander Lobakin wrote:
> From: Yunsheng Lin <linyunsheng@huawei.com>
> Date: Thu, 6 Jul 2023 20:47:28 +0800
> 
>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>
>>> +/**
>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>> + * @size: size of the PP, usually simply Rx queue len
>>> + *
>>> + * Returns &page_pool on success, casted -errno on failure.
>>> + */
>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>> +					    u32 size)
>>> +{
>>> +	struct page_pool_params pp = {
>>> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>> +		.order		= LIBIE_RX_PAGE_ORDER,
>>> +		.pool_size	= size,
>>> +		.nid		= NUMA_NO_NODE,
>>> +		.dev		= napi->dev->dev.parent,
>>> +		.napi		= napi,
>>> +		.dma_dir	= DMA_FROM_DEVICE,
>>> +		.offset		= LIBIE_SKB_HEADROOM,
>>
>> I think it worth mentioning that the '.offset' is not really accurate
>> when the page is split, as we do not really know what is the offset of
>> the frag of a page except for the first frag.
> 
> Yeah, this is read as "offset from the start of the page or frag to the
> actual frame start, i.e. its Ethernet header" or "this is just
> xdp->data - xdp->data_hard_start".

So the problem seems to be if most of drivers have a similar reading as
libie does here, as .offset seems to have a clear semantics which is used
to skip dma sync operation for buffer range that is not touched by the
dma operation. Even if it happens to have the same value of "offset from
the start of the page or frag to the actual frame start", I am not sure
it is future-proofing to reuse it.

When page frag is added, I didn't really give much thought about that as
we use it in a cache coherent system.
It seems we might need to extend or update that semantics if we really want
to skip dma sync operation for all the buffer ranges that are not touched
by the dma operation for page split case.
Or Skipping dma sync operation for all untouched ranges might not be worth
the effort, because it might need a per frag dma sync operation, which is
more costly than a batched per page dma sync operation. If it is true, page
pool already support that currently as my understanding, because the dma
sync operation is only done when the last frag is released/freed.

> 
>>
>>> +	};
>>> +	size_t truesize;
>>> +
>>> +	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);

As mentioned above, if we depend on the last released/freed frag to do the
dma sync, the pp.max_len might need to cover all the frag.

>>> +
>>> +	/* "Wanted" truesize, passed to page_pool_dev_alloc() */
>>> +	truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
>>> +	pp.init_arg = (void *)truesize;
>>
>> I am not sure if it is correct to use pp.init_arg here, as it is supposed to
>> be used along with init_callback. And if we want to change the implemetation
> 
> I know. I abused it to save 1 function argument :p It's safe since I
> don't use init_callback (not an argument).
> I was thinking also of having a union in PP params or even a new field
> like "wanted true size", so that your function could even take values
> from there in certain cases (e.g. if I pass 0 as parameter).
> 
>> of init_callback, we may stuck with it as the driver is using it very
>> differently here.
>>
>> Is it possible to pass the 'wanted true size' by adding a parameter for
>> libie_rx_alloc()?
> 
> Yes, or I could store it somewhere on the ring, but looks uglier =\ This
> one does as well to some degree, but at least hidden in the library and
> doesn't show up in the drivers :D

It seems most hw driver know the size of memory it needs when creating
the ring/queue, setting the frag size and deciding how many is a page
split into before allocation seems like a possible future optimization.

For now, it would be better to add helper to acess pp.init_arg at least
instead of acess pp.init_arg directly to make it more obvious and make
the future optimization more easier.

> 
>>
>>> +
>>> +	return page_pool_create(&pp);
>>> +}
>>> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
> 
> Thanks,
> Olek


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-06 16:38     ` Alexander Lobakin
@ 2023-07-09  5:16       ` Yunsheng Lin
  2023-07-10 13:34         ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-09  5:16 UTC (permalink / raw)
  To: Alexander Lobakin, Yunsheng Lin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

On 2023/7/7 0:38, Alexander Lobakin wrote:

...
 
>>
>>>  /**
>>> @@ -766,13 +742,19 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>>>   **/
>>>  int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>>>  {
>>> -	struct device *dev = rx_ring->dev;
>>> -	int bi_size;
>>> +	struct page_pool *pool;
>>> +
>>> +	pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
>>> +					 rx_ring->count);
>>
>> If a page is able to be spilt between more than one desc, perhaps the
>> prt_ring size does not need to be as big as rx_ring->count.
> 
> But we doesn't know in advance, right? Esp. given that it's hidden in
> the lib. But anyway, you can only assume that in regular cases if you
> always allocate frags of the same size, PP will split pages when 2+
> frags can fit there or return the whole page otherwise, but who knows
> what might happen.

It seems intel driver is able to know the size of memory it needs when
creating the ring/queue/napi/pp, maybe the driver only tell the libie
how many descs does it use for queue, and libie can adjust it accordingly?

> BTW, with recent recycling optimization, most of recycling is done
> directly through cache, not ptr_ring. So I'd even say it's safe to start
> creating smaller ptr_rings in the drivers.

The problem is that we may use more memory than before for certain case
if we don't limit the size of ptr_ring, unless we can ensure all of
recycling is done directly through cache, not ptr_ring.

> 
>>
>>> +	if (IS_ERR(pool))
>>> +		return PTR_ERR(pool);
>>> +
>>> +	rx_ring->pp = pool;
> 
> [...]
> 
>>>  	/* build an skb around the page buffer */
>>> -	skb = napi_build_skb(va - IAVF_SKB_PAD, truesize);
>>> -	if (unlikely(!skb))
>>> +	skb = napi_build_skb(va, rx_buffer->truesize);
>>> +	if (unlikely(!skb)) {
>>> +		page_pool_put_page(page->pp, page, size, true);
>>
>> Isn't it more correct to call page_pool_put_full_page() here?
>> as we do not know which frag is used for the rx_buffer, and depend
>> on the last released frag to do the syncing, maybe I should mention
>> that in Documentation/networking/page_pool.rst.
> 
> Ooof, maybe. My first try with PP frags. So when we use frags, we always
> must use _full_page() / p.max_len instead of the actual frame size?

Currently, _full_page() / p.max_len must be used to ensure the correct
dma sync operation.
But as mentioned in the previous patch, it might be about what is the
semantics of dma sync operation for page split case.

> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling
  2023-07-06 17:06       ` Alexander Duyck
@ 2023-07-10 13:13         ` Alexander Lobakin
  0 siblings, 0 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-10 13:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Eric Dumazet,
	Michal Kubiak, intel-wired-lan, Yunsheng Lin, David Christensen,
	Jakub Kicinski, Paolo Abeni, David S. Miller

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 6 Jul 2023 10:06:29 -0700

> On Thu, Jul 6, 2023 at 9:46 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> From: Alexander Duyck <alexander.duyck@gmail.com>
>> Date: Thu, 6 Jul 2023 07:47:03 -0700
>>
>>> On Wed, Jul 5, 2023 at 8:57 AM Alexander Lobakin
>>> <aleksander.lobakin@intel.com> wrote:
> 
> [...]
> 
>>>> @@ -1431,15 +1303,18 @@ static int iavf_clean_rx_irq(struct iavf_ring *rx_ring, int budget)
>>>>                 else
>>>>                         skb = iavf_build_skb(rx_ring, rx_buffer, size);
>>>>
>>>> +               iavf_put_rx_buffer(rx_ring, rx_buffer);
>>>> +
>>>
>>> This should stay below where it was.
>>
>> Wait-wait-wait.
>>
>> if (!skb) break breaks the loop. put_rx_buffer() unmaps the page.
>> So in order to do the first, you need to do the second to avoid leaks.
>> Or you meant "why unmapping and freeing if we fail, just leave it in
>> place"? To make it easier to switch to Page Pool.
> 
> Specifically you don't want to be unmapping and freeing this page
> until after the !skb check. The problem is if skb is NULL the skb
> allocation failed and so processing of Rx is meant to stop in place
> without removing the page. It is where we will resume on the next pass
> assuming memory has been freed that can then be used. The problem is
> the skb allocation, not the page. We used to do the skb allocation
> before we would acquire the buffer, but with XDP there are cases where
> we aren't supposed to allocate it so it got moved to after which
> causes this confusion.
> 
>>>
>>>>                 /* exit if we failed to retrieve a buffer */
>>>>                 if (!skb) {
>>>>                         rx_ring->rx_stats.alloc_buff_failed++;
>>>> -                       if (rx_buffer && size)
>>>> -                               rx_buffer->pagecnt_bias++;
>>>> +                       __free_pages(rx_buffer->page,
>>>> +                                    iavf_rx_pg_order(rx_ring));
>>>> +                       rx_buffer->page = NULL;
>>>>                         break;
>>>>                 }
>>>
>>> This code was undoing the iavf_get_rx_buffer decrement of pagecnt_bias
>>> and then bailing since we have halted forward progress due to an skb
>>> allocation failure. As such we should just be removing the if
>>> statement and the increment of pagecnt_bias.
> 
> The key bit here is the allocation failure is the reason why we halted
> processing. So the page contains valid data and should not be freed.
> We just need to leave it in place and wait for an allocation to
> succeed and then we can resume processing.

Aaah, okay, you want to try once again with the same buffer next time.
I see. Makes sense!

> 
>>>
>>>>
>>>> -               iavf_put_rx_buffer(rx_ring, rx_buffer);
>>>> +               rx_buffer->page = NULL;
>>>>                 cleaned_count++;
>>>>
>>>>                 if (iavf_is_non_eop(rx_ring, rx_desc, skb))
>>>
>>> If iavf_put_rx_buffer just does the unmap and assignment of NULL then
>>> it could just be left here as is.
>>
>> I guess those two are tied with the one above.
> 
> Yeah, the iavf_put_rx_buffer should be left  down here.
Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-06 17:28       ` Alexander Duyck
@ 2023-07-10 13:18         ` Alexander Lobakin
  0 siblings, 0 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-10 13:18 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Thu, 6 Jul 2023 10:28:06 -0700

> On Thu, Jul 6, 2023 at 9:57 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> From: Alexander Duyck <alexander.duyck@gmail.com>
>> Date: Thu, 6 Jul 2023 08:26:00 -0700
>>
>>> On Wed, Jul 5, 2023 at 8:58 AM Alexander Lobakin
>>> <aleksander.lobakin@intel.com> wrote:
>>>>
>>>> Now that the IAVF driver simply uses dev_alloc_page() + free_page() with
>>>> no custom recycling logics, it can easily be switched to using Page
>>>> Pool / libie API instead.
>>>> This allows to removing the whole dancing around headroom, HW buffer
>>>> size, and page order. All DMA-for-device is now done in the PP core,
>>>> for-CPU -- in the libie helper.
>>>> Use skb_mark_for_recycle() to bring back the recycling and restore the
>>>> performance. Speaking of performance: on par with the baseline and
>>>> faster with the PP optimization series applied. But the memory usage for
>>>> 1500b MTU is now almost 2x lower (x86_64) thanks to allocating a page
>>>> every second descriptor.
>>>>
>>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>>>
>>> One thing I am noticing is that there seems to be a bunch of cleanup
>>> changes in here as well. Things like moving around values within
>>> structures which I am assuming are to fill holes. You may want to look
>>> at breaking some of those out as it makes it a bit harder to review
>>> this since they seem like unrelated changes.
>>
>> min_mtu and watchdog are unrelated, I'll drop those.
>> Moving tail pointer around was supposed to land in a different commit,
>> not this one, as I wrote 10 minutes ago already :s
>>
>> [...]
>>
>>>> -       bi_size = sizeof(struct iavf_rx_buffer) * rx_ring->count;
>>>> -       memset(rx_ring->rx_bi, 0, bi_size);
>>>> -
>>>> -       /* Zero out the descriptor ring */
>>>> -       memset(rx_ring->desc, 0, rx_ring->size);
>>>> -
>>>
>>> I have some misgivings about not clearing these. We may want to double
>>> check to verify the code paths are resilient enough that it won't
>>> cause any issues w/ repeated up/down testing on the interface. The
>>> general idea is to keep things consistent w/ the state after
>>> setup_rx_descriptors. If we don't need this when we don't need to be
>>> calling the zalloc or calloc version of things in
>>> setup_rx_descriptors.
>>
>> Both arrays will be freed couple instructions below, why zero them?
> 
> Ugh. You are right, but not for a good reason. So the other Intel
> drivers in the past would be doing the clean_rx_ring calls on the
> _down() with the freeing of resources on _close(). Specifically it
> allowed reducing the overhead for things like resets or setting
> changes since it didn't require reallocating the descriptor rings and
> buffer info structures.
> 
> I guess you are good to remove these since this code doesn't do that.

We might go back to this to not always do a full circle when not needed,
but currently this is redundant.

> 
>>>
>>>
>>>>         rx_ring->next_to_clean = 0;
>>>>         rx_ring->next_to_use = 0;
>>>>  }
>>
>> [...]
>>
>>>>         struct net_device *netdev;      /* netdev ring maps to */
>>>>         union {
>>>> +               struct libie_rx_buffer *rx_bi;
>>>>                 struct iavf_tx_buffer *tx_bi;
>>>> -               struct iavf_rx_buffer *rx_bi;
>>>>         };
>>>>         DECLARE_BITMAP(state, __IAVF_RING_STATE_NBITS);
>>>> +       u8 __iomem *tail;
>>>>         u16 queue_index;                /* Queue number of ring */
>>>>         u8 dcb_tc;                      /* Traffic class of ring */
>>>> -       u8 __iomem *tail;
>>>>
>>>>         /* high bit set means dynamic, use accessors routines to read/write.
>>>>          * hardware only supports 2us resolution for the ITR registers.
>>>
>>> I'm assuming "tail" was moved here since it is a pointer and fills a hole?
>>
>> (see above)
>>
>>>
>>>> @@ -329,9 +264,8 @@ struct iavf_ring {
>>>>          */
>>>>         u16 itr_setting;
>>>>
>>>> -       u16 count;                      /* Number of descriptors */
>>>>         u16 reg_idx;                    /* HW register index of the ring */
>>>> -       u16 rx_buf_len;
>>>> +       u16 count;                      /* Number of descriptors */
>>>
>>> Why move count down here? It is moving the constant value that is
>>> read-mostly into an area that will be updated more often.
>>
>> With the ::tail put in a different slot, ::count was landing in a
>> different cacheline. I wanted to avoid this. But now I feel like I was
>> just lazy and must've tested both variants to see if this move affects
>> performance. I'll play with this one in the next rev.
> 
> The performance impact should be minimal. Odds are the placement was
> the way it was since it was probably just copying the original code
> that has been there since igb/ixgbe. The general idea is just keep the
> read-mostly items grouped at the top and try to order them somewhat by
> frequency of being read so that wherever the cache line ends up you
> won't take much of a penalty as hopefully you will just have the
> infrequently read items end up getting pulled into the active cache
> line.

+

> 
>>>
>>>>         /* used in interrupt processing */
>>>>         u16 next_to_use;
>>>> @@ -398,17 +332,6 @@ struct iavf_ring_container {
>>>>  #define iavf_for_each_ring(pos, head) \
>>>>         for (pos = (head).ring; pos != NULL; pos = pos->next)
>>>>
>>>> -static inline unsigned int iavf_rx_pg_order(struct iavf_ring *ring)
>>>> -{
>>>> -#if (PAGE_SIZE < 8192)
>>>> -       if (ring->rx_buf_len > (PAGE_SIZE / 2))
>>>> -               return 1;
>>>> -#endif
>>>> -       return 0;
>>>> -}
>>>> -
>>>> -#define iavf_rx_pg_size(_ring) (PAGE_SIZE << iavf_rx_pg_order(_ring))
>>>> -
>>>
>>> All this code probably could have been removed in an earlier patch
>>> since I don't think we need the higher order pages once we did away
>>> with the recycling. Odds are we can probably move this into the
>>> recycling code removal.
>>
>> This went here as I merged "always use order 0" commit with "switch to
>> Page Pool". In general, IIRC having removals of all the stuff at once in
>> one commit (#2) was less readable than the current version, but I'll
>> double-check.
> 
> It all depends on how much is having to be added to accommodate this.
> In my mind when we did away with the page splitting/recycling we also
> did away with the need for the higher order pages. That is why I was
> thinking it might make more sense there as it would just be more
> removals with very few if any additions needed to support it.
Yeah, I'll try and see whether any pieces can be grouped differently for
better reading/logics.

[...]

Thanks!
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
  2023-07-09  5:16       ` Yunsheng Lin
@ 2023-07-10 13:25         ` Alexander Lobakin
  2023-07-11 11:39           ` Yunsheng Lin
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-10 13:25 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Yunsheng Lin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

From: Yunsheng Lin <yunshenglin0825@gmail.com>
Date: Sun, 9 Jul 2023 13:16:33 +0800

> On 2023/7/7 0:28, Alexander Lobakin wrote:
>> From: Yunsheng Lin <linyunsheng@huawei.com>
>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>
>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>
>>>> +/**
>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>> + * @size: size of the PP, usually simply Rx queue len
>>>> + *
>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>> + */
>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>> +					    u32 size)
>>>> +{
>>>> +	struct page_pool_params pp = {
>>>> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>> +		.order		= LIBIE_RX_PAGE_ORDER,
>>>> +		.pool_size	= size,
>>>> +		.nid		= NUMA_NO_NODE,
>>>> +		.dev		= napi->dev->dev.parent,
>>>> +		.napi		= napi,
>>>> +		.dma_dir	= DMA_FROM_DEVICE,
>>>> +		.offset		= LIBIE_SKB_HEADROOM,
>>>
>>> I think it worth mentioning that the '.offset' is not really accurate
>>> when the page is split, as we do not really know what is the offset of
>>> the frag of a page except for the first frag.
>>
>> Yeah, this is read as "offset from the start of the page or frag to the
>> actual frame start, i.e. its Ethernet header" or "this is just
>> xdp->data - xdp->data_hard_start".
> 
> So the problem seems to be if most of drivers have a similar reading as
> libie does here, as .offset seems to have a clear semantics which is used
> to skip dma sync operation for buffer range that is not touched by the
> dma operation. Even if it happens to have the same value of "offset from
> the start of the page or frag to the actual frame start", I am not sure
> it is future-proofing to reuse it.

Not sure I'm following :s

> 
> When page frag is added, I didn't really give much thought about that as
> we use it in a cache coherent system.
> It seems we might need to extend or update that semantics if we really want
> to skip dma sync operation for all the buffer ranges that are not touched
> by the dma operation for page split case.
> Or Skipping dma sync operation for all untouched ranges might not be worth
> the effort, because it might need a per frag dma sync operation, which is
> more costly than a batched per page dma sync operation. If it is true, page
> pool already support that currently as my understanding, because the dma
> sync operation is only done when the last frag is released/freed.
> 
>>
>>>
>>>> +	};
>>>> +	size_t truesize;
>>>> +
>>>> +	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
> 
> As mentioned above, if we depend on the last released/freed frag to do the
> dma sync, the pp.max_len might need to cover all the frag.

                                               ^^^^^^^^^^^^

You mean the whole page or...?
I think it's not the driver's duty to track all this. We always set
.offset to `data - data_hard_start` and .max_len to the maximum
HW-writeable length for one frame. We don't know whether PP will give us
a whole page or just a piece. DMA sync for device is performed in the PP
core code as well. Driver just creates a PP and don't care about the
internals.

> 
>>>> +
>>>> +	/* "Wanted" truesize, passed to page_pool_dev_alloc() */
>>>> +	truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset + pp.max_len));
>>>> +	pp.init_arg = (void *)truesize;
>>>
>>> I am not sure if it is correct to use pp.init_arg here, as it is supposed to
>>> be used along with init_callback. And if we want to change the implemetation
>>
>> I know. I abused it to save 1 function argument :p It's safe since I
>> don't use init_callback (not an argument).
>> I was thinking also of having a union in PP params or even a new field
>> like "wanted true size", so that your function could even take values
>> from there in certain cases (e.g. if I pass 0 as parameter).
>>
>>> of init_callback, we may stuck with it as the driver is using it very
>>> differently here.
>>>
>>> Is it possible to pass the 'wanted true size' by adding a parameter for
>>> libie_rx_alloc()?
>>
>> Yes, or I could store it somewhere on the ring, but looks uglier =\ This
>> one does as well to some degree, but at least hidden in the library and
>> doesn't show up in the drivers :D
> 
> It seems most hw driver know the size of memory it needs when creating
> the ring/queue, setting the frag size and deciding how many is a page
> split into before allocation seems like a possible future optimization.
> 
> For now, it would be better to add helper to acess pp.init_arg at least
> instead of acess pp.init_arg directly to make it more obvious and make
> the future optimization more easier.

Makes senses.

> 
>>
>>>
>>>> +
>>>> +	return page_pool_create(&pp);
>>>> +}
>>>> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
>>
>> Thanks,
>> Olek
> 

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-09  5:16       ` Yunsheng Lin
@ 2023-07-10 13:34         ` Alexander Lobakin
  2023-07-11 11:47           ` Yunsheng Lin
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-10 13:34 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Yunsheng Lin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

From: Yunsheng Lin <yunshenglin0825@gmail.com>
Date: Sun, 9 Jul 2023 13:16:39 +0800

> On 2023/7/7 0:38, Alexander Lobakin wrote:
> 
> ...
>  
>>>
>>>>  /**
>>>> @@ -766,13 +742,19 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>>>>   **/
>>>>  int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>>>>  {
>>>> -	struct device *dev = rx_ring->dev;
>>>> -	int bi_size;
>>>> +	struct page_pool *pool;
>>>> +
>>>> +	pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
>>>> +					 rx_ring->count);
>>>
>>> If a page is able to be spilt between more than one desc, perhaps the
>>> prt_ring size does not need to be as big as rx_ring->count.
>>
>> But we doesn't know in advance, right? Esp. given that it's hidden in
>> the lib. But anyway, you can only assume that in regular cases if you
>> always allocate frags of the same size, PP will split pages when 2+
>> frags can fit there or return the whole page otherwise, but who knows
>> what might happen.
> 
> It seems intel driver is able to know the size of memory it needs when
> creating the ring/queue/napi/pp, maybe the driver only tell the libie
> how many descs does it use for queue, and libie can adjust it accordingly?

But libie can't say for sure how PP will split pages for it, right?

> 
>> BTW, with recent recycling optimization, most of recycling is done
>> directly through cache, not ptr_ring. So I'd even say it's safe to start
>> creating smaller ptr_rings in the drivers.
> 
> The problem is that we may use more memory than before for certain case
> if we don't limit the size of ptr_ring, unless we can ensure all of
> recycling is done directly through cache, not ptr_ring.
Also not sure I'm following =\

[...]

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
  2023-07-10 13:25         ` Alexander Lobakin
@ 2023-07-11 11:39           ` Yunsheng Lin
  2023-07-11 16:37             ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-11 11:39 UTC (permalink / raw)
  To: Alexander Lobakin, Yunsheng Lin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

On 2023/7/10 21:25, Alexander Lobakin wrote:
> From: Yunsheng Lin <yunshenglin0825@gmail.com>
> Date: Sun, 9 Jul 2023 13:16:33 +0800
> 
>> On 2023/7/7 0:28, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <linyunsheng@huawei.com>
>>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>>
>>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>>
>>>>> +/**
>>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>>> + * @size: size of the PP, usually simply Rx queue len
>>>>> + *
>>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>>> + */
>>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>>> +					    u32 size)
>>>>> +{
>>>>> +	struct page_pool_params pp = {
>>>>> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>>> +		.order		= LIBIE_RX_PAGE_ORDER,
>>>>> +		.pool_size	= size,
>>>>> +		.nid		= NUMA_NO_NODE,
>>>>> +		.dev		= napi->dev->dev.parent,
>>>>> +		.napi		= napi,
>>>>> +		.dma_dir	= DMA_FROM_DEVICE,
>>>>> +		.offset		= LIBIE_SKB_HEADROOM,
>>>>
>>>> I think it worth mentioning that the '.offset' is not really accurate
>>>> when the page is split, as we do not really know what is the offset of
>>>> the frag of a page except for the first frag.
>>>
>>> Yeah, this is read as "offset from the start of the page or frag to the
>>> actual frame start, i.e. its Ethernet header" or "this is just
>>> xdp->data - xdp->data_hard_start".
>>
>> So the problem seems to be if most of drivers have a similar reading as
>> libie does here, as .offset seems to have a clear semantics which is used
>> to skip dma sync operation for buffer range that is not touched by the
>> dma operation. Even if it happens to have the same value of "offset from
>> the start of the page or frag to the actual frame start", I am not sure
>> it is future-proofing to reuse it.
> 
> Not sure I'm following :s

It would be better to avoid accessing the internal data of the page pool
directly as much as possible, as that may be changed to different meaning
or removed if the implememtation is changed.

If it is common enough that most drivers are using it the same way, adding
a helper for that would be great.

> 
>>
>> When page frag is added, I didn't really give much thought about that as
>> we use it in a cache coherent system.
>> It seems we might need to extend or update that semantics if we really want
>> to skip dma sync operation for all the buffer ranges that are not touched
>> by the dma operation for page split case.
>> Or Skipping dma sync operation for all untouched ranges might not be worth
>> the effort, because it might need a per frag dma sync operation, which is
>> more costly than a batched per page dma sync operation. If it is true, page
>> pool already support that currently as my understanding, because the dma
>> sync operation is only done when the last frag is released/freed.
>>
>>>
>>>>
>>>>> +	};
>>>>> +	size_t truesize;
>>>>> +
>>>>> +	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>>
>> As mentioned above, if we depend on the last released/freed frag to do the
>> dma sync, the pp.max_len might need to cover all the frag.
> 
>                                                ^^^^^^^^^^^^
> 
> You mean the whole page or...?

If we don't care about the accurate dma syncing, "cover all the frag" means
the whole page here, as page pool doesn't have enough info to do accurate
dma sync for now.

> I think it's not the driver's duty to track all this. We always set
> .offset to `data - data_hard_start` and .max_len to the maximum
> HW-writeable length for one frame. We don't know whether PP will give us
> a whole page or just a piece. DMA sync for device is performed in the PP
> core code as well. Driver just creates a PP and don't care about the
> internals.

There problem is that when page_pool_put_page() is called with a split
page, the page pool does not know which frag is freeing too.

setting 'maximum HW-writeable length for one frame' only sync the first
frag of a page as below:

https://elixir.free-electrons.com/linux/v6.4-rc6/source/net/core/page_pool.c#L325


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-10 13:34         ` Alexander Lobakin
@ 2023-07-11 11:47           ` Yunsheng Lin
  2023-07-18 13:56             ` Alexander Lobakin
  0 siblings, 1 reply; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-11 11:47 UTC (permalink / raw)
  To: Alexander Lobakin, Yunsheng Lin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

On 2023/7/10 21:34, Alexander Lobakin wrote:
> From: Yunsheng Lin <yunshenglin0825@gmail.com>
> Date: Sun, 9 Jul 2023 13:16:39 +0800
> 
>> On 2023/7/7 0:38, Alexander Lobakin wrote:
>>
>> ...
>>  
>>>>
>>>>>  /**
>>>>> @@ -766,13 +742,19 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>>>>>   **/
>>>>>  int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>>>>>  {
>>>>> -	struct device *dev = rx_ring->dev;
>>>>> -	int bi_size;
>>>>> +	struct page_pool *pool;
>>>>> +
>>>>> +	pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
>>>>> +					 rx_ring->count);
>>>>
>>>> If a page is able to be spilt between more than one desc, perhaps the
>>>> prt_ring size does not need to be as big as rx_ring->count.
>>>
>>> But we doesn't know in advance, right? Esp. given that it's hidden in
>>> the lib. But anyway, you can only assume that in regular cases if you
>>> always allocate frags of the same size, PP will split pages when 2+
>>> frags can fit there or return the whole page otherwise, but who knows
>>> what might happen.
>>
>> It seems intel driver is able to know the size of memory it needs when
>> creating the ring/queue/napi/pp, maybe the driver only tell the libie
>> how many descs does it use for queue, and libie can adjust it accordingly?
> 
> But libie can't say for sure how PP will split pages for it, right?
> 
>>
>>> BTW, with recent recycling optimization, most of recycling is done
>>> directly through cache, not ptr_ring. So I'd even say it's safe to start
>>> creating smaller ptr_rings in the drivers.
>>
>> The problem is that we may use more memory than before for certain case
>> if we don't limit the size of ptr_ring, unless we can ensure all of
>> recycling is done directly through cache, not ptr_ring.
> Also not sure I'm following =\

Before adding page pool support, the max memory used in the driver is as
below:
rx_ring->count * PAGE_SIZE;

After adding page pool support, the max memory used in the driver is as
below:

ptr_ring->size * PAGE_SIZE +
PP_ALLOC_CACHE_SIZE * PAGE_SIZE +
rx_ring->count * PAGE_SIZE / pp.init_arg

> 
> [...]
> 
> Thanks,
> Olek
> 
> .
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
  2023-07-11 11:39           ` Yunsheng Lin
@ 2023-07-11 16:37             ` Alexander Lobakin
  2023-07-12 11:13               ` Yunsheng Lin
  0 siblings, 1 reply; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-11 16:37 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Yunsheng Lin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Tue, 11 Jul 2023 19:39:28 +0800

> On 2023/7/10 21:25, Alexander Lobakin wrote:
>> From: Yunsheng Lin <yunshenglin0825@gmail.com>
>> Date: Sun, 9 Jul 2023 13:16:33 +0800
>>
>>> On 2023/7/7 0:28, Alexander Lobakin wrote:
>>>> From: Yunsheng Lin <linyunsheng@huawei.com>
>>>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>>>
>>>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>>>
>>>>>> +/**
>>>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>>>> + * @size: size of the PP, usually simply Rx queue len
>>>>>> + *
>>>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>>>> + */
>>>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>>>> +					    u32 size)
>>>>>> +{
>>>>>> +	struct page_pool_params pp = {
>>>>>> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>>>> +		.order		= LIBIE_RX_PAGE_ORDER,
>>>>>> +		.pool_size	= size,
>>>>>> +		.nid		= NUMA_NO_NODE,
>>>>>> +		.dev		= napi->dev->dev.parent,
>>>>>> +		.napi		= napi,
>>>>>> +		.dma_dir	= DMA_FROM_DEVICE,
>>>>>> +		.offset		= LIBIE_SKB_HEADROOM,
>>>>>
>>>>> I think it worth mentioning that the '.offset' is not really accurate
>>>>> when the page is split, as we do not really know what is the offset of
>>>>> the frag of a page except for the first frag.
>>>>
>>>> Yeah, this is read as "offset from the start of the page or frag to the
>>>> actual frame start, i.e. its Ethernet header" or "this is just
>>>> xdp->data - xdp->data_hard_start".
>>>
>>> So the problem seems to be if most of drivers have a similar reading as
>>> libie does here, as .offset seems to have a clear semantics which is used
>>> to skip dma sync operation for buffer range that is not touched by the
>>> dma operation. Even if it happens to have the same value of "offset from
>>> the start of the page or frag to the actual frame start", I am not sure
>>> it is future-proofing to reuse it.
>>
>> Not sure I'm following :s
> 
> It would be better to avoid accessing the internal data of the page pool
> directly as much as possible, as that may be changed to different meaning
> or removed if the implememtation is changed.
> 
> If it is common enough that most drivers are using it the same way, adding
> a helper for that would be great.

How comes page_pool_params is internal if it's defined purely by the
driver and then exists read-only :D I even got warned in the adjacent
thread that the Page Pool core code shouldn't change it anyhow.

> 
>>
>>>
>>> When page frag is added, I didn't really give much thought about that as
>>> we use it in a cache coherent system.
>>> It seems we might need to extend or update that semantics if we really want
>>> to skip dma sync operation for all the buffer ranges that are not touched
>>> by the dma operation for page split case.
>>> Or Skipping dma sync operation for all untouched ranges might not be worth
>>> the effort, because it might need a per frag dma sync operation, which is
>>> more costly than a batched per page dma sync operation. If it is true, page
>>> pool already support that currently as my understanding, because the dma
>>> sync operation is only done when the last frag is released/freed.
>>>
>>>>
>>>>>
>>>>>> +	};
>>>>>> +	size_t truesize;
>>>>>> +
>>>>>> +	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>>>
>>> As mentioned above, if we depend on the last released/freed frag to do the
>>> dma sync, the pp.max_len might need to cover all the frag.
>>
>>                                                ^^^^^^^^^^^^
>>
>> You mean the whole page or...?
> 
> If we don't care about the accurate dma syncing, "cover all the frag" means
> the whole page here, as page pool doesn't have enough info to do accurate
> dma sync for now.
> 
>> I think it's not the driver's duty to track all this. We always set
>> .offset to `data - data_hard_start` and .max_len to the maximum
>> HW-writeable length for one frame. We don't know whether PP will give us
>> a whole page or just a piece. DMA sync for device is performed in the PP
>> core code as well. Driver just creates a PP and don't care about the
>> internals.
> 
> There problem is that when page_pool_put_page() is called with a split
> page, the page pool does not know which frag is freeing too.
> 
> setting 'maximum HW-writeable length for one frame' only sync the first
> frag of a page as below:

Maybe Page Pool should synchronize DMA even when !last_frag then?
Setting .max_len to anything bigger than the maximum frame size you're
planning to receive is counter-intuitive.
All three xdp_buff, xdp_frame and skb always have all info needed to
determine which piece of the page we're recycling, it should be possible
to do with no complications. Hypothetical forcing drivers to do DMA
syncs on their own when they use frags is counter-intuitive as well,
Page Pool should be able to handle this itself.

Alternatively, Page Pool may do as follows:

1. !last_frag -- do nothing, same as today.
2. last_frag -- sync, but not [offset, offset + max_len), but
   [offset, PAGE_SIZE).

This would also cover non-HW-writeable pieces like 2th-nth frame's
headroom and each frame's skb_shared_info, but it's the only alternative
to syncing each frag separately.
Yes, it's almost the same as to set .max_len to %PAGE_SIZE, but as I
said, it feels weird to set .max_len to 4k when you allocate 2k frags.
You don't know anyway how much of a page will be used.

For example, when I turn on driver-side XDP, increased headroom makes
truesize cross the 2k border with 1500 MTU, so that 2 frag per page
converts into 1 frag per page. In fact, instead of using the whole 4k,
I use ~2200 and don't need to sync the entire 4k.
Setting .max_len to 4k gives you way heavier overhead to non
DMA-coherent systems that per-frag-syncing would do.

> 
> https://elixir.free-electrons.com/linux/v6.4-rc6/source/net/core/page_pool.c#L325
> 

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool)
  2023-07-11 16:37             ` Alexander Lobakin
@ 2023-07-12 11:13               ` Yunsheng Lin
  0 siblings, 0 replies; 33+ messages in thread
From: Yunsheng Lin @ 2023-07-12 11:13 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Yunsheng Lin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

On 2023/7/12 0:37, Alexander Lobakin wrote:
> From: Yunsheng Lin <linyunsheng@huawei.com>
> Date: Tue, 11 Jul 2023 19:39:28 +0800
> 
>> On 2023/7/10 21:25, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <yunshenglin0825@gmail.com>
>>> Date: Sun, 9 Jul 2023 13:16:33 +0800
>>>
>>>> On 2023/7/7 0:28, Alexander Lobakin wrote:
>>>>> From: Yunsheng Lin <linyunsheng@huawei.com>
>>>>> Date: Thu, 6 Jul 2023 20:47:28 +0800
>>>>>
>>>>>> On 2023/7/5 23:55, Alexander Lobakin wrote:
>>>>>>
>>>>>>> +/**
>>>>>>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>>>>>>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>>>>>>> + * @size: size of the PP, usually simply Rx queue len
>>>>>>> + *
>>>>>>> + * Returns &page_pool on success, casted -errno on failure.
>>>>>>> + */
>>>>>>> +struct page_pool *libie_rx_page_pool_create(struct napi_struct *napi,
>>>>>>> +					    u32 size)
>>>>>>> +{
>>>>>>> +	struct page_pool_params pp = {
>>>>>>> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>>>>> +		.order		= LIBIE_RX_PAGE_ORDER,
>>>>>>> +		.pool_size	= size,
>>>>>>> +		.nid		= NUMA_NO_NODE,
>>>>>>> +		.dev		= napi->dev->dev.parent,
>>>>>>> +		.napi		= napi,
>>>>>>> +		.dma_dir	= DMA_FROM_DEVICE,
>>>>>>> +		.offset		= LIBIE_SKB_HEADROOM,
>>>>>>
>>>>>> I think it worth mentioning that the '.offset' is not really accurate
>>>>>> when the page is split, as we do not really know what is the offset of
>>>>>> the frag of a page except for the first frag.
>>>>>
>>>>> Yeah, this is read as "offset from the start of the page or frag to the
>>>>> actual frame start, i.e. its Ethernet header" or "this is just
>>>>> xdp->data - xdp->data_hard_start".
>>>>
>>>> So the problem seems to be if most of drivers have a similar reading as
>>>> libie does here, as .offset seems to have a clear semantics which is used
>>>> to skip dma sync operation for buffer range that is not touched by the
>>>> dma operation. Even if it happens to have the same value of "offset from
>>>> the start of the page or frag to the actual frame start", I am not sure
>>>> it is future-proofing to reuse it.
>>>
>>> Not sure I'm following :s
>>
>> It would be better to avoid accessing the internal data of the page pool
>> directly as much as possible, as that may be changed to different meaning
>> or removed if the implememtation is changed.
>>
>> If it is common enough that most drivers are using it the same way, adding
>> a helper for that would be great.
> 
> How comes page_pool_params is internal if it's defined purely by the
> driver and then exists read-only :D I even got warned in the adjacent
> thread that the Page Pool core code shouldn't change it anyhow.

Personally I am not one hundred percent convinced that page_pool_params
will not be changed considering the discussion about improving/replacing
the page pool to support P2P case.

> 
>>
>>>
>>>>
>>>> When page frag is added, I didn't really give much thought about that as
>>>> we use it in a cache coherent system.
>>>> It seems we might need to extend or update that semantics if we really want
>>>> to skip dma sync operation for all the buffer ranges that are not touched
>>>> by the dma operation for page split case.
>>>> Or Skipping dma sync operation for all untouched ranges might not be worth
>>>> the effort, because it might need a per frag dma sync operation, which is
>>>> more costly than a batched per page dma sync operation. If it is true, page
>>>> pool already support that currently as my understanding, because the dma
>>>> sync operation is only done when the last frag is released/freed.
>>>>
>>>>>
>>>>>>
>>>>>>> +	};
>>>>>>> +	size_t truesize;
>>>>>>> +
>>>>>>> +	pp.max_len = libie_rx_sync_len(napi->dev, pp.offset);
>>>>
>>>> As mentioned above, if we depend on the last released/freed frag to do the
>>>> dma sync, the pp.max_len might need to cover all the frag.
>>>
>>>                                                ^^^^^^^^^^^^
>>>
>>> You mean the whole page or...?
>>
>> If we don't care about the accurate dma syncing, "cover all the frag" means
>> the whole page here, as page pool doesn't have enough info to do accurate
>> dma sync for now.
>>
>>> I think it's not the driver's duty to track all this. We always set
>>> .offset to `data - data_hard_start` and .max_len to the maximum
>>> HW-writeable length for one frame. We don't know whether PP will give us
>>> a whole page or just a piece. DMA sync for device is performed in the PP
>>> core code as well. Driver just creates a PP and don't care about the
>>> internals.
>>
>> There problem is that when page_pool_put_page() is called with a split
>> page, the page pool does not know which frag is freeing too.
>>
>> setting 'maximum HW-writeable length for one frame' only sync the first
>> frag of a page as below:
> 
> Maybe Page Pool should synchronize DMA even when !last_frag then?
> Setting .max_len to anything bigger than the maximum frame size you're
> planning to receive is counter-intuitive.

This is simplest way to support dma sync for page frag case, the question
is if the batching of the dma sync for all frag of a page can even out the
additional cost of dma sync for dma untouched range of a page.

> All three xdp_buff, xdp_frame and skb always have all info needed to
> determine which piece of the page we're recycling, it should be possible
> to do with no complications. Hypothetical forcing drivers to do DMA
> syncs on their own when they use frags is counter-intuitive as well,
> Page Pool should be able to handle this itself.
> 
> Alternatively, Page Pool may do as follows:
> 
> 1. !last_frag -- do nothing, same as today.
> 2. last_frag -- sync, but not [offset, offset + max_len), but
>    [offset, PAGE_SIZE).

When a frag is free, we don't know if it is the last frag or not for
now yet.

> 
> This would also cover non-HW-writeable pieces like 2th-nth frame's
> headroom and each frame's skb_shared_info, but it's the only alternative
> to syncing each frag separately.
> Yes, it's almost the same as to set .max_len to %PAGE_SIZE, but as I
> said, it feels weird to set .max_len to 4k when you allocate 2k frags.
> You don't know anyway how much of a page will be used.

In that that case, we may need to make it more generic for the case when
a page is spilt into more than two frags, especially for system with 64K
page size.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 1/9] net: intel: introduce Intel Ethernet common library
  2023-07-05 15:55 ` [PATCH RFC net-next v4 1/9] net: intel: introduce Intel Ethernet common library Alexander Lobakin
@ 2023-07-14 14:17   ` Przemek Kitszel
  0 siblings, 0 replies; 33+ messages in thread
From: Przemek Kitszel @ 2023-07-14 14:17 UTC (permalink / raw)
  To: Alexander Lobakin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

On 7/5/23 17:55, Alexander Lobakin wrote:
> Not a secret there's a ton of code duplication between two and more Intel
> ethernet modules.
> Before introducing new changes, which would need to be copied over again,
> start decoupling the already existing duplicate functionality into a new
> module, which will be shared between several Intel Ethernet drivers.
> Add the lookup table which converts 8/10-bit hardware packet type into
> a parsed bitfield structure for easy checking packet format parameters,
> such as payload level, IP version, etc. This is currently used by i40e,
> ice and iavf and it's all the same in all three drivers.
> The only difference introduced in this implementation is that instead of
> defining a 256 (or 1024 in case of ice) element array, add unlikely()
> condition to limit the input to 154 (current maximum non-reserved packet
> type). There's no reason to waste 600 (or even 3600) bytes only to not
> hurt very unlikely exception packets.
> The hash computation function now takes payload level directly as a
> pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
> L3 payload and in the previous versions their hash level would be 2, not
> 3. But skb_set_hash() only sees difference between L4 and non-L4, thus
> this won't change anything at all.
> The module is behind the hidden Kconfig symbol, which the drivers will
> select when needed. The exports are behind 'LIBIE' namespace to limit
> the scope of the functions.
> 
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>   MAINTAINERS                                   |   3 +-
>   drivers/net/ethernet/intel/Kconfig            |  10 +
>   drivers/net/ethernet/intel/Makefile           |   1 +
>   drivers/net/ethernet/intel/i40e/i40e_common.c | 253 --------------
>   drivers/net/ethernet/intel/i40e/i40e_main.c   |   1 +
>   .../net/ethernet/intel/i40e/i40e_prototype.h  |   7 -
>   drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  74 +---
>   drivers/net/ethernet/intel/i40e/i40e_type.h   |  88 -----
>   drivers/net/ethernet/intel/iavf/iavf_common.c | 253 --------------
>   drivers/net/ethernet/intel/iavf/iavf_main.c   |   1 +
>   .../net/ethernet/intel/iavf/iavf_prototype.h  |   7 -
>   drivers/net/ethernet/intel/iavf/iavf_txrx.c   |  70 +---
>   drivers/net/ethernet/intel/iavf/iavf_type.h   |  88 -----
>   .../net/ethernet/intel/ice/ice_lan_tx_rx.h    | 316 ------------------
>   drivers/net/ethernet/intel/ice/ice_main.c     |   1 +
>   drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  74 +---
>   drivers/net/ethernet/intel/libie/Makefile     |   6 +
>   drivers/net/ethernet/intel/libie/rx.c         | 110 ++++++
>   include/linux/net/intel/libie/rx.h            | 128 +++++++
>   19 files changed, 312 insertions(+), 1179 deletions(-)
>   create mode 100644 drivers/net/ethernet/intel/libie/Makefile
>   create mode 100644 drivers/net/ethernet/intel/libie/rx.c
>   create mode 100644 include/linux/net/intel/libie/rx.h


Thanks a lot for whole effort on this series, it's refreshing to see a 
move in that direction.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Intel-wired-lan] [PATCH RFC net-next v4 2/9] iavf: kill "legacy-rx" for good
  2023-07-05 15:55 ` [PATCH RFC net-next v4 2/9] iavf: kill "legacy-rx" for good Alexander Lobakin
@ 2023-07-14 14:17   ` Przemek Kitszel
  0 siblings, 0 replies; 33+ messages in thread
From: Przemek Kitszel @ 2023-07-14 14:17 UTC (permalink / raw)
  To: Alexander Lobakin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Paul Menzel, Jesper Dangaard Brouer, Larysa Zaremba, netdev,
	Alexander Duyck, Ilias Apalodimas, linux-kernel, Yunsheng Lin,
	Michal Kubiak, intel-wired-lan, David Christensen

On 7/5/23 17:55, Alexander Lobakin wrote:
> Ever since build_skb() became stable, the old way with allocating an skb
> for storing the headers separately, which will be then copied manually,
> was slower, less flexible and thus obsolete.
> 
> * it had higher pressure on MM since it actually allocates new pages,
>    which then get split and refcount-biased (NAPI page cache);
> * it implies memcpy() of packet headers (40+ bytes per each frame);
> * the actual header length was calculated via eth_get_headlen(), which
>    invokes Flow Dissector and thus wastes a bunch of CPU cycles;
> * XDP makes it even more weird since it requires headroom for long and
>    also tailroom for some time (since mbuf landed). Take a look at the
>    ice driver, which is built around work-arounds to make XDP work with
>    it.
> 
> Even on some quite low-end hardware (not a common case for 100G NICs) it
> was performing worse.
> The only advantage "legacy-rx" had is that it didn't require any
> reserved headroom and tailroom. But iavf didn't use this, as it always
> splits pages into two halves of 2k, while that save would only be useful
> when striding. And again, XDP effectively removes that sole pro.
> 
> There's a train of features to land in IAVF soon: Page Pool, XDP, XSk,
> multi-buffer etc. Each new would require adding more and more Danse
> Macabre for absolutely no reason, besides making hotpath less and less
> effective.
> Remove the "feature" with all the related code. This includes at least
> one very hot branch (typically hit on each new frame), which was either
> always-true or always-false at least for a complete NAPI bulk of 64
> frames, the whole private flags cruft, and so on. Some stats:
> 
> Function: add/remove: 0/4 grow/shrink: 0/7 up/down: 0/-757 (-757)
> RO Data: add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-40 (-40)
> 
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>   drivers/net/ethernet/intel/iavf/iavf.h        |   2 +-
>   .../net/ethernet/intel/iavf/iavf_ethtool.c    | 140 ------------------
>   drivers/net/ethernet/intel/iavf/iavf_main.c   |  10 +-
>   drivers/net/ethernet/intel/iavf/iavf_txrx.c   |  84 +----------
>   drivers/net/ethernet/intel/iavf/iavf_txrx.h   |  27 +---
>   .../net/ethernet/intel/iavf/iavf_virtchnl.c   |   3 +-
>   6 files changed, 8 insertions(+), 258 deletions(-)

Good one, there were some random questions in my mind during the read,
but all are resolved by subsequent patches.
(It's a pity I have not found yet time to fully read them though)

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool
  2023-07-11 11:47           ` Yunsheng Lin
@ 2023-07-18 13:56             ` Alexander Lobakin
  0 siblings, 0 replies; 33+ messages in thread
From: Alexander Lobakin @ 2023-07-18 13:56 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Yunsheng Lin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Maciej Fijalkowski, Michal Kubiak, Larysa Zaremba,
	Alexander Duyck, David Christensen, Jesper Dangaard Brouer,
	Ilias Apalodimas, Paul Menzel, netdev, intel-wired-lan,
	linux-kernel

From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Tue, 11 Jul 2023 19:47:14 +0800

> On 2023/7/10 21:34, Alexander Lobakin wrote:
>> From: Yunsheng Lin <yunshenglin0825@gmail.com>
>> Date: Sun, 9 Jul 2023 13:16:39 +0800
>>
>>> On 2023/7/7 0:38, Alexander Lobakin wrote:
>>>
>>> ...
>>>  
>>>>>
>>>>>>  /**
>>>>>> @@ -766,13 +742,19 @@ void iavf_free_rx_resources(struct iavf_ring *rx_ring)
>>>>>>   **/
>>>>>>  int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
>>>>>>  {
>>>>>> -	struct device *dev = rx_ring->dev;
>>>>>> -	int bi_size;
>>>>>> +	struct page_pool *pool;
>>>>>> +
>>>>>> +	pool = libie_rx_page_pool_create(&rx_ring->q_vector->napi,
>>>>>> +					 rx_ring->count);
>>>>>
>>>>> If a page is able to be spilt between more than one desc, perhaps the
>>>>> prt_ring size does not need to be as big as rx_ring->count.
>>>>
>>>> But we doesn't know in advance, right? Esp. given that it's hidden in
>>>> the lib. But anyway, you can only assume that in regular cases if you
>>>> always allocate frags of the same size, PP will split pages when 2+
>>>> frags can fit there or return the whole page otherwise, but who knows
>>>> what might happen.
>>>
>>> It seems intel driver is able to know the size of memory it needs when
>>> creating the ring/queue/napi/pp, maybe the driver only tell the libie
>>> how many descs does it use for queue, and libie can adjust it accordingly?
>>
>> But libie can't say for sure how PP will split pages for it, right?
>>
>>>
>>>> BTW, with recent recycling optimization, most of recycling is done
>>>> directly through cache, not ptr_ring. So I'd even say it's safe to start
>>>> creating smaller ptr_rings in the drivers.
>>>
>>> The problem is that we may use more memory than before for certain case
>>> if we don't limit the size of ptr_ring, unless we can ensure all of
>>> recycling is done directly through cache, not ptr_ring.
>> Also not sure I'm following =\
> 
> Before adding page pool support, the max memory used in the driver is as
> below:
> rx_ring->count * PAGE_SIZE;
> 
> After adding page pool support, the max memory used in the driver is as
> below:
> 
> ptr_ring->size * PAGE_SIZE +
> PP_ALLOC_CACHE_SIZE * PAGE_SIZE +
> rx_ring->count * PAGE_SIZE / pp.init_arg

I know. I was wondering how your message connects with what I said
regarding that using direct recycling more aggressively allows us to
reduce ptr_ring sizes, as e.g. on my setup ptr_ring is not used at all
(it still will be in other cases, but a lot less often than before).

> 
>>
>> [...]
>>
>> Thanks,
>> Olek
>>
>> .
>>

Thanks,
Olek

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-07-18 14:01 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-05 15:55 [PATCH RFC net-next v4 0/9] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
2023-07-05 15:55 ` [PATCH RFC net-next v4 1/9] net: intel: introduce Intel Ethernet common library Alexander Lobakin
2023-07-14 14:17   ` [Intel-wired-lan] " Przemek Kitszel
2023-07-05 15:55 ` [PATCH RFC net-next v4 2/9] iavf: kill "legacy-rx" for good Alexander Lobakin
2023-07-14 14:17   ` [Intel-wired-lan] " Przemek Kitszel
2023-07-05 15:55 ` [PATCH RFC net-next v4 3/9] iavf: drop page splitting and recycling Alexander Lobakin
2023-07-06 14:47   ` [Intel-wired-lan] " Alexander Duyck
2023-07-06 16:45     ` Alexander Lobakin
2023-07-06 17:06       ` Alexander Duyck
2023-07-10 13:13         ` Alexander Lobakin
2023-07-05 15:55 ` [PATCH RFC net-next v4 4/9] net: page_pool: add DMA-sync-for-CPU inline helpers Alexander Lobakin
2023-07-05 15:55 ` [PATCH RFC net-next v4 5/9] libie: add Rx buffer management (via Page Pool) Alexander Lobakin
2023-07-06 12:47   ` Yunsheng Lin
2023-07-06 16:28     ` Alexander Lobakin
2023-07-09  5:16       ` Yunsheng Lin
2023-07-10 13:25         ` Alexander Lobakin
2023-07-11 11:39           ` Yunsheng Lin
2023-07-11 16:37             ` Alexander Lobakin
2023-07-12 11:13               ` Yunsheng Lin
2023-07-05 15:55 ` [PATCH RFC net-next v4 6/9] iavf: switch to Page Pool Alexander Lobakin
2023-07-06 12:47   ` Yunsheng Lin
2023-07-06 16:38     ` Alexander Lobakin
2023-07-09  5:16       ` Yunsheng Lin
2023-07-10 13:34         ` Alexander Lobakin
2023-07-11 11:47           ` Yunsheng Lin
2023-07-18 13:56             ` Alexander Lobakin
2023-07-06 15:26   ` [Intel-wired-lan] " Alexander Duyck
2023-07-06 16:56     ` Alexander Lobakin
2023-07-06 17:28       ` Alexander Duyck
2023-07-10 13:18         ` Alexander Lobakin
2023-07-05 15:55 ` [PATCH RFC net-next v4 7/9] libie: add common queue stats Alexander Lobakin
2023-07-05 15:55 ` [PATCH RFC net-next v4 8/9] libie: add per-queue Page Pool stats Alexander Lobakin
2023-07-05 15:55 ` [PATCH RFC net-next v4 9/9] iavf: switch queue stats to libie Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).