All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
@ 2022-09-04 13:15 Leon Romanovsky
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 1/8] xfrm: add new full offload flag Leon Romanovsky
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

Changelog:
v4:
 * Changed title from "PATCH" to "PATCH RFC" per-request.
 * Added two new patches: one to update hard/soft limits and another
   initial take on documentation.
 * Added more info about lifetime/rekeying flow to cover letter, see
   relevant section.
 * perf traces for crypto mode will come later.
v3: https://lore.kernel.org/all/cover.1661260787.git.leonro@nvidia.com
 * I didn't hear any suggestion what term to use instead of
   "full offload", so left it as is. It is used in commit messages
   and documentation only and easy to rename.
 * Added performance data and background info to cover letter
 * Reused xfrm_output_resume() function to support multiple XFRM transformations
 * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
 * Documentation is in progress, but not part of this series yet.
v2: https://lore.kernel.org/all/cover.1660639789.git.leonro@nvidia.com
 * Rebased to latest 6.0-rc1
 * Add an extra check in TX datapath patch to validate packets before
   forwarding to HW.
 * Added policy cleanup logic in case of netdev down event
v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com
 * Moved comment to be before if (...) in third patch.
v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
-----------------------------------------------------------------------

The following series extends XFRM core code to handle a new type of IPsec
offload - full offload.

In this mode, the HW is going to be responsible for the whole data path,
so both policy and state should be offloaded.

IPsec full offload is an improved version of IPsec crypto mode,
In full mode, HW is responsible to trim/add headers in addition to
decrypt/encrypt. In this mode, the packet arrives to the stack as already
decrypted and vice versa for TX (exits to HW as not-encrypted).

Devices that implement IPsec full offload mode offload policies too.
In the RX path, it causes the situation that HW can't effectively
handle mixed SW and HW priorities unless users make sure that HW offloaded
policies have higher priorities.

To make sure that users have a coherent picture, we require that
HW offloaded policies have always (both RX and TX) higher priorities
than SW ones.

To not over-engineer the code, HW policies are treated as SW ones and
don't take into account netdev to allow reuse of the same priorities for
different devices.

There are several deliberate limitations:
 * No software fallback
 * Fragments are dropped, both in RX and TX
 * No sockets policies
 * Only IPsec transport mode is implemented

================================================================================
Rekeying:

In order to support rekeying, as XFRM core is skipped, the HW/driver should
do the following:
 * Count the handled packets
 * Raise event that limits are reached
 * Drop packets once hard limit is occurred.

The XFRM core calls to newly introduced xfrm_dev_state_update_curlft()
function in order to perform sync between device statistics and internal
structures. On HW limit event, driver calls to xfrm_state_check_expire()
to allow XFRM core take relevant decisions.

This separation between control logic (in XFRM) and data plane allows us
to fully reuse SW stack.

================================================================================
Configuration:

iproute2: https://lore.kernel.org/netdev/cover.1652179360.git.leonro@nvidia.com/

Full offload mode:
  ip xfrm state offload full dev <if-name> dir <in|out>
  ip xfrm policy .... offload full dev <if-name>
Crypto offload mode:
  ip xfrm state offload crypto dev <if-name> dir <in|out>
or (backward compatibility)
  ip xfrm state offload dev <if-name> dir <in|out>

================================================================================
Performance results:

TCP multi-stream, using iperf3 instance per-CPU.
+----------------------+--------+--------+--------+--------+---------+---------+
|                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
|                      +--------+--------+--------+--------+---------+---------+
|                      |                   BW (Gbps)                           |
+----------------------+--------+--------+-------+---------+---------+---------+
| Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
+----------------------+--------+--------+-------+---------+---------+---------+
| Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
+----------------------+--------+--------+-------+---------+---------+---------+
| IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
+----------------------+--------+--------+-------+---------+---------+---------+
| IPsec full offload   | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
+----------------------+--------+--------+-------+---------+---------+---------+

IPsec full offload mode behaves as baseline and reaches linerate with same amount
of CPUs.

Setups details (similar for both sides):
* NIC: ConnectX6-DX dual port, 100 Gbps each.
  Single port used in the tests.
* CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

================================================================================
Series together with mlx5 part:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=xfrm-next

Thanks

Leon Romanovsky (8):
  xfrm: add new full offload flag
  xfrm: allow state full offload mode
  xfrm: add an interface to offload policy
  xfrm: add TX datapath support for IPsec full offload mode
  xfrm: add RX datapath protection for IPsec full offload mode
  xfrm: enforce separation between priorities of HW/SW policies
  xfrm: add support to HW update soft and hard limits
  xfrm: document IPsec full offload mode

 Documentation/networking/xfrm_device.rst      |  62 +++++-
 .../inline_crypto/ch_ipsec/chcr_ipsec.c       |   4 +
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c    |   5 +
 drivers/net/ethernet/intel/ixgbevf/ipsec.c    |   5 +
 .../mellanox/mlx5/core/en_accel/ipsec.c       |   4 +
 drivers/net/netdevsim/ipsec.c                 |   5 +
 include/linux/netdevice.h                     |   4 +
 include/net/netns/xfrm.h                      |   8 +-
 include/net/xfrm.h                            | 121 +++++++++---
 include/uapi/linux/xfrm.h                     |   6 +
 net/xfrm/xfrm_device.c                        | 103 +++++++++-
 net/xfrm/xfrm_output.c                        |  13 +-
 net/xfrm/xfrm_policy.c                        | 181 ++++++++++++++++++
 net/xfrm/xfrm_state.c                         |   4 +
 net/xfrm/xfrm_user.c                          |  19 ++
 15 files changed, 501 insertions(+), 43 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH RFC xfrm-next v3 1/8] xfrm: add new full offload flag
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
@ 2022-09-04 13:15 ` Leon Romanovsky
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 2/8] xfrm: allow state full offload mode Leon Romanovsky
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

In the next patches, the xfrm core code will be extended to support
new type of offload - full offload. In that mode, both policy and state
should be specially configured in order to perform whole offloaded data
path.

Full offload takes care of encryption, decryption, encapsulation and
other operations with headers.

As this mode is new for XFRM policy flow, we can "start fresh" with flag
bits and release first and second bit for future use.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/xfrm.h        | 7 +++++++
 include/uapi/linux/xfrm.h | 6 ++++++
 net/xfrm/xfrm_device.c    | 3 +++
 net/xfrm/xfrm_user.c      | 2 ++
 4 files changed, 18 insertions(+)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 6e8fa98f786f..b4d487053dfd 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -131,12 +131,19 @@ enum {
 	XFRM_DEV_OFFLOAD_OUT,
 };
 
+enum {
+	XFRM_DEV_OFFLOAD_UNSPECIFIED,
+	XFRM_DEV_OFFLOAD_CRYPTO,
+	XFRM_DEV_OFFLOAD_FULL,
+};
+
 struct xfrm_dev_offload {
 	struct net_device	*dev;
 	netdevice_tracker	dev_tracker;
 	struct net_device	*real_dev;
 	unsigned long		offload_handle;
 	u8			dir : 2;
+	u8			type : 2;
 };
 
 struct xfrm_mode {
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index 4f84ea7ee14c..463c6c1af23a 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -519,6 +519,12 @@ struct xfrm_user_offload {
  */
 #define XFRM_OFFLOAD_IPV6	1
 #define XFRM_OFFLOAD_INBOUND	2
+/* Two bits above are relevant for state path only, while
+ * offload is used for both policy and state flows.
+ *
+ * In policy offload mode, they are free and can be safely reused.
+ */
+#define XFRM_OFFLOAD_FULL	4
 
 struct xfrm_userpolicy_default {
 #define XFRM_USERPOLICY_UNSPEC	0
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 637ca8838436..6d1124eb1ec8 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -270,12 +270,15 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	else
 		xso->dir = XFRM_DEV_OFFLOAD_OUT;
 
+	xso->type = XFRM_DEV_OFFLOAD_CRYPTO;
+
 	err = dev->xfrmdev_ops->xdo_dev_state_add(x);
 	if (err) {
 		xso->dev = NULL;
 		xso->dir = 0;
 		xso->real_dev = NULL;
 		netdev_put(dev, &xso->dev_tracker);
+		xso->type = XFRM_DEV_OFFLOAD_UNSPECIFIED;
 
 		if (err != -EOPNOTSUPP)
 			return err;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 2ff017117730..9c0aef815730 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -854,6 +854,8 @@ static int copy_user_offload(struct xfrm_dev_offload *xso, struct sk_buff *skb)
 	xuo->ifindex = xso->dev->ifindex;
 	if (xso->dir == XFRM_DEV_OFFLOAD_IN)
 		xuo->flags = XFRM_OFFLOAD_INBOUND;
+	if (xso->type == XFRM_DEV_OFFLOAD_FULL)
+		xuo->flags |= XFRM_OFFLOAD_FULL;
 
 	return 0;
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC xfrm-next v3 2/8] xfrm: allow state full offload mode
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 1/8] xfrm: add new full offload flag Leon Romanovsky
@ 2022-09-04 13:15 ` Leon Romanovsky
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 3/8] xfrm: add an interface to offload policy Leon Romanovsky
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

Allow users to configure xfrm states with full offload mode.
The full mode must be requested both for policy and state, and
such requires us to do not implement fallback.

We explicitly return an error if requested full mode can't
be configured.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../inline_crypto/ch_ipsec/chcr_ipsec.c       |  4 ++++
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c    |  5 ++++
 drivers/net/ethernet/intel/ixgbevf/ipsec.c    |  5 ++++
 .../mellanox/mlx5/core/en_accel/ipsec.c       |  4 ++++
 drivers/net/netdevsim/ipsec.c                 |  5 ++++
 net/xfrm/xfrm_device.c                        | 24 +++++++++++++++----
 6 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/ch_ipsec/chcr_ipsec.c b/drivers/net/ethernet/chelsio/inline_crypto/ch_ipsec/chcr_ipsec.c
index 585590520076..ca21794281d6 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/ch_ipsec/chcr_ipsec.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/ch_ipsec/chcr_ipsec.c
@@ -283,6 +283,10 @@ static int ch_ipsec_xfrm_add_state(struct xfrm_state *x)
 		pr_debug("Cannot offload xfrm states with geniv other than seqiv\n");
 		return -EINVAL;
 	}
+	if (x->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		pr_debug("Unsupported xfrm offload\n");
+		return -EINVAL;
+	}
 
 	sa_entry = kzalloc(sizeof(*sa_entry), GFP_KERNEL);
 	if (!sa_entry) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 774de63dd93a..53a969e34883 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -585,6 +585,11 @@ static int ixgbe_ipsec_add_sa(struct xfrm_state *xs)
 		return -EINVAL;
 	}
 
+	if (xs->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		netdev_err(dev, "Unsupported ipsec offload type\n");
+		return -EINVAL;
+	}
+
 	if (xs->xso.dir == XFRM_DEV_OFFLOAD_IN) {
 		struct rx_sa rsa;
 
diff --git a/drivers/net/ethernet/intel/ixgbevf/ipsec.c b/drivers/net/ethernet/intel/ixgbevf/ipsec.c
index 9984ebc62d78..c1cf540d162a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ipsec.c
@@ -280,6 +280,11 @@ static int ixgbevf_ipsec_add_sa(struct xfrm_state *xs)
 		return -EINVAL;
 	}
 
+	if (xs->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		netdev_err(dev, "Unsupported ipsec offload type\n");
+		return -EINVAL;
+	}
+
 	if (xs->xso.dir == XFRM_DEV_OFFLOAD_IN) {
 		struct rx_sa rsa;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
index 2a8fd7020622..c182b640b80d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
@@ -256,6 +256,10 @@ static inline int mlx5e_xfrm_validate_state(struct xfrm_state *x)
 		netdev_info(netdev, "Cannot offload xfrm states with geniv other than seqiv\n");
 		return -EINVAL;
 	}
+	if (x->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		netdev_info(netdev, "Unsupported xfrm offload type\n");
+		return -EINVAL;
+	}
 	return 0;
 }
 
diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
index 386336a38f34..b93baf5c8bee 100644
--- a/drivers/net/netdevsim/ipsec.c
+++ b/drivers/net/netdevsim/ipsec.c
@@ -149,6 +149,11 @@ static int nsim_ipsec_add_sa(struct xfrm_state *xs)
 		return -EINVAL;
 	}
 
+	if (xs->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		netdev_err(dev, "Unsupported ipsec offload type\n");
+		return -EINVAL;
+	}
+
 	/* find the first unused index */
 	ret = nsim_ipsec_find_empty_idx(ipsec);
 	if (ret < 0) {
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 6d1124eb1ec8..5b04e5cdca64 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -215,6 +215,7 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	struct xfrm_dev_offload *xso = &x->xso;
 	xfrm_address_t *saddr;
 	xfrm_address_t *daddr;
+	bool is_full_offload;
 
 	if (!x->type_offload)
 		return -EINVAL;
@@ -223,9 +224,11 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	if (x->encap || x->tfcpad)
 		return -EINVAL;
 
-	if (xuo->flags & ~(XFRM_OFFLOAD_IPV6 | XFRM_OFFLOAD_INBOUND))
+	if (xuo->flags &
+	    ~(XFRM_OFFLOAD_IPV6 | XFRM_OFFLOAD_INBOUND | XFRM_OFFLOAD_FULL))
 		return -EINVAL;
 
+	is_full_offload = xuo->flags & XFRM_OFFLOAD_FULL;
 	dev = dev_get_by_index(net, xuo->ifindex);
 	if (!dev) {
 		if (!(xuo->flags & XFRM_OFFLOAD_INBOUND)) {
@@ -240,7 +243,7 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 					x->props.family,
 					xfrm_smark_get(0, x));
 		if (IS_ERR(dst))
-			return 0;
+			return (is_full_offload) ? -EINVAL : 0;
 
 		dev = dst->dev;
 
@@ -251,7 +254,7 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	if (!dev->xfrmdev_ops || !dev->xfrmdev_ops->xdo_dev_state_add) {
 		xso->dev = NULL;
 		dev_put(dev);
-		return 0;
+		return (is_full_offload) ? -EINVAL : 0;
 	}
 
 	if (x->props.flags & XFRM_STATE_ESN &&
@@ -270,7 +273,10 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	else
 		xso->dir = XFRM_DEV_OFFLOAD_OUT;
 
-	xso->type = XFRM_DEV_OFFLOAD_CRYPTO;
+	if (is_full_offload)
+		xso->type = XFRM_DEV_OFFLOAD_FULL;
+	else
+		xso->type = XFRM_DEV_OFFLOAD_CRYPTO;
 
 	err = dev->xfrmdev_ops->xdo_dev_state_add(x);
 	if (err) {
@@ -280,7 +286,15 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 		netdev_put(dev, &xso->dev_tracker);
 		xso->type = XFRM_DEV_OFFLOAD_UNSPECIFIED;
 
-		if (err != -EOPNOTSUPP)
+		/* User explicitly requested full offload mode and configured
+		 * policy in addition to the XFRM state. So be civil to users,
+		 * and return an error instead of taking fallback path.
+		 *
+		 * This WARN_ON() can be seen as a documentation for driver
+		 * authors to do not return -EOPNOTSUPP in full offload mode.
+		 */
+		WARN_ON(err == -EOPNOTSUPP && is_full_offload);
+		if (err != -EOPNOTSUPP || is_full_offload)
 			return err;
 	}
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC xfrm-next v3 3/8] xfrm: add an interface to offload policy
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 1/8] xfrm: add new full offload flag Leon Romanovsky
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 2/8] xfrm: allow state full offload mode Leon Romanovsky
@ 2022-09-04 13:15 ` Leon Romanovsky
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 4/8] xfrm: add TX datapath support for IPsec full offload mode Leon Romanovsky
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

Extend netlink interface to add and delete XFRM policy from the device.
This functionality is a first step to implement full IPsec offload solution.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/netdevice.h |  3 ++
 include/net/xfrm.h        | 42 ++++++++++++++++++++++++
 net/xfrm/xfrm_device.c    | 61 ++++++++++++++++++++++++++++++++++-
 net/xfrm/xfrm_policy.c    | 68 +++++++++++++++++++++++++++++++++++++++
 net/xfrm/xfrm_user.c      | 17 ++++++++++
 5 files changed, 190 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f0068c1ff1df..c1db9eaa3dca 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1026,6 +1026,9 @@ struct xfrmdev_ops {
 	bool	(*xdo_dev_offload_ok) (struct sk_buff *skb,
 				       struct xfrm_state *x);
 	void	(*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+	int	(*xdo_dev_policy_add) (struct xfrm_policy *x);
+	void	(*xdo_dev_policy_delete) (struct xfrm_policy *x);
+	void	(*xdo_dev_policy_free) (struct xfrm_policy *x);
 };
 #endif
 
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index b4d487053dfd..587697eb1d31 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -129,6 +129,7 @@ struct xfrm_state_walk {
 enum {
 	XFRM_DEV_OFFLOAD_IN = 1,
 	XFRM_DEV_OFFLOAD_OUT,
+	XFRM_DEV_OFFLOAD_FWD,
 };
 
 enum {
@@ -534,6 +535,8 @@ struct xfrm_policy {
 	struct xfrm_tmpl       	xfrm_vec[XFRM_MAX_DEPTH];
 	struct hlist_node	bydst_inexact_list;
 	struct rcu_head		rcu;
+
+	struct xfrm_dev_offload xdo;
 };
 
 static inline struct net *xp_net(const struct xfrm_policy *xp)
@@ -1577,6 +1580,7 @@ struct xfrm_state *xfrm_find_acq_byseq(struct net *net, u32 mark, u32 seq);
 int xfrm_state_delete(struct xfrm_state *x);
 int xfrm_state_flush(struct net *net, u8 proto, bool task_valid, bool sync);
 int xfrm_dev_state_flush(struct net *net, struct net_device *dev, bool task_valid);
+int xfrm_dev_policy_flush(struct net *net, struct net_device *dev, bool task_valid);
 void xfrm_sad_getinfo(struct net *net, struct xfrmk_sadinfo *si);
 void xfrm_spd_getinfo(struct net *net, struct xfrmk_spdinfo *si);
 u32 xfrm_replay_seqhi(struct xfrm_state *x, __be32 net_seq);
@@ -1887,6 +1891,8 @@ void xfrm_dev_backlog(struct softnet_data *sd);
 struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features, bool *again);
 int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 		       struct xfrm_user_offload *xuo);
+int xfrm_dev_policy_add(struct net *net, struct xfrm_policy *xp,
+		       struct xfrm_user_offload *xuo, u8 dir);
 bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x);
 
 static inline void xfrm_dev_state_advance_esn(struct xfrm_state *x)
@@ -1935,6 +1941,28 @@ static inline void xfrm_dev_state_free(struct xfrm_state *x)
 		netdev_put(dev, &xso->dev_tracker);
 	}
 }
+
+static inline void xfrm_dev_policy_delete(struct xfrm_policy *x)
+{
+	struct xfrm_dev_offload *xdo = &x->xdo;
+	struct net_device *dev = xdo->dev;
+
+	if (dev && dev->xfrmdev_ops && dev->xfrmdev_ops->xdo_dev_policy_delete)
+		dev->xfrmdev_ops->xdo_dev_policy_delete(x);
+}
+
+static inline void xfrm_dev_policy_free(struct xfrm_policy *x)
+{
+	struct xfrm_dev_offload *xdo = &x->xdo;
+	struct net_device *dev = xdo->dev;
+
+	if (dev && dev->xfrmdev_ops) {
+		if (dev->xfrmdev_ops->xdo_dev_policy_free)
+			dev->xfrmdev_ops->xdo_dev_policy_free(x);
+		xdo->dev = NULL;
+		netdev_put(dev, &xdo->dev_tracker);
+	}
+}
 #else
 static inline void xfrm_dev_resume(struct sk_buff *skb)
 {
@@ -1962,6 +1990,20 @@ static inline void xfrm_dev_state_free(struct xfrm_state *x)
 {
 }
 
+static inline int xfrm_dev_policy_add(struct net *net, struct xfrm_policy *xp,
+				      struct xfrm_user_offload *xuo, u8 dir)
+{
+	return 0;
+}
+
+static inline void xfrm_dev_policy_delete(struct xfrm_policy *x)
+{
+}
+
+static inline void xfrm_dev_policy_free(struct xfrm_policy *x)
+{
+}
+
 static inline bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
 {
 	return false;
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 5b04e5cdca64..1cc482e9c87d 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -302,6 +302,63 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 }
 EXPORT_SYMBOL_GPL(xfrm_dev_state_add);
 
+int xfrm_dev_policy_add(struct net *net, struct xfrm_policy *xp,
+			struct xfrm_user_offload *xuo, u8 dir)
+{
+	struct xfrm_dev_offload *xdo = &xp->xdo;
+	struct net_device *dev;
+	int err;
+
+	if (!xuo->flags || xuo->flags & ~XFRM_OFFLOAD_FULL)
+		/* We support only Full offload mode and it means
+		 * that user must set XFRM_OFFLOAD_FULL bit.
+		 */
+		return -EINVAL;
+
+	dev = dev_get_by_index(net, xuo->ifindex);
+	if (!dev)
+		return -EINVAL;
+
+	if (!dev->xfrmdev_ops || !dev->xfrmdev_ops->xdo_dev_policy_add) {
+		xdo->dev = NULL;
+		dev_put(dev);
+		return -EINVAL;
+	}
+
+	xdo->dev = dev;
+	netdev_tracker_alloc(dev, &xdo->dev_tracker, GFP_ATOMIC);
+	xdo->real_dev = dev;
+	xdo->type = XFRM_DEV_OFFLOAD_FULL;
+	switch (dir) {
+	case XFRM_POLICY_IN:
+		xdo->dir = XFRM_DEV_OFFLOAD_IN;
+		break;
+	case XFRM_POLICY_OUT:
+		xdo->dir = XFRM_DEV_OFFLOAD_OUT;
+		break;
+	case XFRM_POLICY_FWD:
+		xdo->dir = XFRM_DEV_OFFLOAD_FWD;
+		break;
+	default:
+		xdo->dev = NULL;
+		dev_put(dev);
+		return -EINVAL;
+	}
+
+	err = dev->xfrmdev_ops->xdo_dev_policy_add(xp);
+	if (err) {
+		xdo->dev = NULL;
+		xdo->real_dev = NULL;
+		xdo->type = XFRM_DEV_OFFLOAD_UNSPECIFIED;
+		xdo->dir = 0;
+		netdev_put(dev, &xdo->dev_tracker);
+		return err;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xfrm_dev_policy_add);
+
 bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
 {
 	int mtu;
@@ -404,8 +461,10 @@ static int xfrm_api_check(struct net_device *dev)
 
 static int xfrm_dev_down(struct net_device *dev)
 {
-	if (dev->features & NETIF_F_HW_ESP)
+	if (dev->features & NETIF_F_HW_ESP) {
 		xfrm_dev_state_flush(dev_net(dev), dev, true);
+		xfrm_dev_policy_flush(dev_net(dev), dev, true);
+	}
 
 	return NOTIFY_DONE;
 }
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index cc6ab79609e2..56ccedbb7212 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -425,6 +425,7 @@ void xfrm_policy_destroy(struct xfrm_policy *policy)
 	if (del_timer(&policy->timer) || del_timer(&policy->polq.hold_timer))
 		BUG();
 
+	xfrm_dev_policy_free(policy);
 	call_rcu(&policy->rcu, xfrm_policy_destroy_rcu);
 }
 EXPORT_SYMBOL(xfrm_policy_destroy);
@@ -1769,12 +1770,41 @@ xfrm_policy_flush_secctx_check(struct net *net, u8 type, bool task_valid)
 	}
 	return err;
 }
+
+static inline int xfrm_dev_policy_flush_secctx_check(struct net *net,
+						     struct net_device *dev,
+						     bool task_valid)
+{
+	struct xfrm_policy *pol;
+	int err = 0;
+
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		if (pol->walk.dead ||
+		    xfrm_policy_id2dir(pol->index) >= XFRM_POLICY_MAX ||
+		    pol->xdo.dev != dev)
+			continue;
+
+		err = security_xfrm_policy_delete(pol->security);
+		if (err) {
+			xfrm_audit_policy_delete(pol, 0, task_valid);
+			return err;
+		}
+	}
+	return err;
+}
 #else
 static inline int
 xfrm_policy_flush_secctx_check(struct net *net, u8 type, bool task_valid)
 {
 	return 0;
 }
+
+static inline int xfrm_dev_policy_flush_secctx_check(struct net *net,
+						     struct net_device *dev,
+						     bool task_valid)
+{
+	return 0;
+}
 #endif
 
 int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
@@ -1814,6 +1844,43 @@ int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
 }
 EXPORT_SYMBOL(xfrm_policy_flush);
 
+int xfrm_dev_policy_flush(struct net *net, struct net_device *dev, bool task_valid)
+{
+	int dir, err = 0, cnt = 0;
+	struct xfrm_policy *pol;
+
+	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+
+	err = xfrm_dev_policy_flush_secctx_check(net, dev, task_valid);
+	if (err)
+		goto out;
+
+again:
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		dir = xfrm_policy_id2dir(pol->index);
+		if (pol->walk.dead ||
+		    dir >= XFRM_POLICY_MAX ||
+		    pol->xdo.dev != dev)
+			continue;
+
+		__xfrm_policy_unlink(pol, dir);
+		spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+		cnt++;
+		xfrm_audit_policy_delete(pol, 1, task_valid);
+		xfrm_policy_kill(pol);
+		spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+		goto again;
+	}
+	if (cnt)
+		__xfrm_policy_inexact_flush(net);
+	else
+		err = -ESRCH;
+out:
+	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+	return err;
+}
+EXPORT_SYMBOL(xfrm_dev_policy_flush);
+
 int xfrm_policy_walk(struct net *net, struct xfrm_policy_walk *walk,
 		     int (*func)(struct xfrm_policy *, int, int, void*),
 		     void *data)
@@ -2246,6 +2313,7 @@ int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
 	pol = __xfrm_policy_unlink(pol, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 	if (pol) {
+		xfrm_dev_policy_delete(pol);
 		xfrm_policy_kill(pol);
 		return 0;
 	}
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 9c0aef815730..698ff84da6ba 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1747,6 +1747,14 @@ static struct xfrm_policy *xfrm_policy_construct(struct net *net, struct xfrm_us
 	if (attrs[XFRMA_IF_ID])
 		xp->if_id = nla_get_u32(attrs[XFRMA_IF_ID]);
 
+	/* configure the hardware if offload is requested */
+	if (attrs[XFRMA_OFFLOAD_DEV]) {
+		err = xfrm_dev_policy_add(net, xp,
+				nla_data(attrs[XFRMA_OFFLOAD_DEV]), p->dir);
+		if (err)
+			goto error;
+	}
+
 	return xp;
  error:
 	*errp = err;
@@ -1785,6 +1793,7 @@ static int xfrm_add_policy(struct sk_buff *skb, struct nlmsghdr *nlh,
 	xfrm_audit_policy_add(xp, err ? 0 : 1, true);
 
 	if (err) {
+		xfrm_dev_policy_delete(xp);
 		security_xfrm_policy_free(xp->security);
 		kfree(xp);
 		return err;
@@ -1897,6 +1906,8 @@ static int dump_one_policy(struct xfrm_policy *xp, int dir, int count, void *ptr
 		err = xfrm_mark_put(skb, &xp->mark);
 	if (!err)
 		err = xfrm_if_id_put(skb, xp->if_id);
+	if (!err && xp->xdo.dev)
+		err = copy_user_offload(&xp->xdo, skb);
 	if (err) {
 		nlmsg_cancel(skb, nlh);
 		return err;
@@ -3213,6 +3224,8 @@ static int build_acquire(struct sk_buff *skb, struct xfrm_state *x,
 		err = xfrm_mark_put(skb, &xp->mark);
 	if (!err)
 		err = xfrm_if_id_put(skb, xp->if_id);
+	if (!err && xp->xdo.dev)
+		err = copy_user_offload(&xp->xdo, skb);
 	if (err) {
 		nlmsg_cancel(skb, nlh);
 		return err;
@@ -3331,6 +3344,8 @@ static int build_polexpire(struct sk_buff *skb, struct xfrm_policy *xp,
 		err = xfrm_mark_put(skb, &xp->mark);
 	if (!err)
 		err = xfrm_if_id_put(skb, xp->if_id);
+	if (!err && xp->xdo.dev)
+		err = copy_user_offload(&xp->xdo, skb);
 	if (err) {
 		nlmsg_cancel(skb, nlh);
 		return err;
@@ -3414,6 +3429,8 @@ static int xfrm_notify_policy(struct xfrm_policy *xp, int dir, const struct km_e
 		err = xfrm_mark_put(skb, &xp->mark);
 	if (!err)
 		err = xfrm_if_id_put(skb, xp->if_id);
+	if (!err && xp->xdo.dev)
+		err = copy_user_offload(&xp->xdo, skb);
 	if (err)
 		goto out_free_skb;
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC xfrm-next v3 4/8] xfrm: add TX datapath support for IPsec full offload mode
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (2 preceding siblings ...)
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 3/8] xfrm: add an interface to offload policy Leon Romanovsky
@ 2022-09-04 13:15 ` Leon Romanovsky
  2022-09-25  9:16   ` Steffen Klassert
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 5/8] xfrm: add RX datapath protection " Leon Romanovsky
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

In IPsec full mode, the device is going to encrypt and encapsulate
packets that are associated with offloaded policy. After successful
policy lookup to indicate if packets should be offloaded or not,
the stack forwards packets to the device to do the magic.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/xfrm/xfrm_device.c | 15 +++++++++++++--
 net/xfrm/xfrm_output.c | 12 +++++++++++-
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 1cc482e9c87d..2d37bb86914a 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -120,6 +120,16 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
 	if (xo->flags & XFRM_GRO || x->xso.dir == XFRM_DEV_OFFLOAD_IN)
 		return skb;
 
+	/* The packet was sent to HW IPsec full offload engine,
+	 * but to wrong device. Drop the packet, so it won't skip
+	 * XFRM stack.
+	 */
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL && x->xso.dev != dev) {
+		kfree_skb(skb);
+		dev_core_stats_tx_dropped_inc(dev);
+		return NULL;
+	}
+
 	/* This skb was already validated on the upper/virtual dev */
 	if ((x->xso.dev != dev) && (x->xso.real_dev == dev))
 		return skb;
@@ -369,8 +379,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
 	if (!x->type_offload || x->encap)
 		return false;
 
-	if ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
-	    (!xdst->child->xfrm)) {
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL ||
+	    ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
+	     !xdst->child->xfrm)) {
 		mtu = xfrm_state_mtu(x, xdst->child_mtu_cached);
 		if (skb->len <= mtu)
 			goto ok;
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 9a5e79a38c67..dde009be8463 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -494,7 +494,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
 	struct xfrm_state *x = dst->xfrm;
 	struct net *net = xs_net(x);
 
-	if (err <= 0)
+	if (err <= 0 || x->xso.type == XFRM_DEV_OFFLOAD_FULL)
 		goto resume;
 
 	do {
@@ -718,6 +718,16 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
 		break;
 	}
 
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL) {
+		if (!xfrm_dev_offload_ok(skb, x)) {
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+			kfree_skb(skb);
+			return -EHOSTUNREACH;
+		}
+
+		return xfrm_output_resume(sk, skb, 0);
+	}
+
 	secpath_reset(skb);
 
 	if (xfrm_dev_offload_ok(skb, x)) {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC xfrm-next v3 5/8] xfrm: add RX datapath protection for IPsec full offload mode
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (3 preceding siblings ...)
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 4/8] xfrm: add TX datapath support for IPsec full offload mode Leon Romanovsky
@ 2022-09-04 13:15 ` Leon Romanovsky
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies Leon Romanovsky
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

Traffic received by device with enabled IPsec full offload should be
forwarded to the stack only after decryption, packet headers and
trailers removed.

Such packets are expected to be seen as normal (non-XFRM) ones, while
not-supported packets should be dropped by the HW.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/xfrm.h | 55 +++++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 587697eb1d31..38fff78a1421 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1094,6 +1094,29 @@ xfrm_state_addr_cmp(const struct xfrm_tmpl *tmpl, const struct xfrm_state *x, un
 	return !0;
 }
 
+#ifdef CONFIG_XFRM
+static inline struct xfrm_state *xfrm_input_state(struct sk_buff *skb)
+{
+	struct sec_path *sp = skb_sec_path(skb);
+
+	return sp->xvec[sp->len - 1];
+}
+#endif
+
+static inline struct xfrm_offload *xfrm_offload(struct sk_buff *skb)
+{
+#ifdef CONFIG_XFRM
+	struct sec_path *sp = skb_sec_path(skb);
+
+	if (!sp || !sp->olen || sp->len != sp->olen)
+		return NULL;
+
+	return &sp->ovec[sp->olen - 1];
+#else
+	return NULL;
+#endif
+}
+
 #ifdef CONFIG_XFRM
 int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb,
 			unsigned short family);
@@ -1125,10 +1148,19 @@ static inline int __xfrm_policy_check2(struct sock *sk, int dir,
 {
 	struct net *net = dev_net(skb->dev);
 	int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0);
+	struct xfrm_offload *xo = xfrm_offload(skb);
+	struct xfrm_state *x;
 
 	if (sk && sk->sk_policy[XFRM_POLICY_IN])
 		return __xfrm_policy_check(sk, ndir, skb, family);
 
+	if (xo) {
+		x = xfrm_input_state(skb);
+		if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
+			return (xo->flags & CRYPTO_DONE) &&
+			       (xo->status & CRYPTO_SUCCESS);
+	}
+
 	return __xfrm_check_nopolicy(net, skb, dir) ||
 	       __xfrm_check_dev_nopolicy(skb, dir, family) ||
 	       __xfrm_policy_check(sk, ndir, skb, family);
@@ -1860,29 +1892,6 @@ static inline void xfrm_states_delete(struct xfrm_state **states, int n)
 }
 #endif
 
-#ifdef CONFIG_XFRM
-static inline struct xfrm_state *xfrm_input_state(struct sk_buff *skb)
-{
-	struct sec_path *sp = skb_sec_path(skb);
-
-	return sp->xvec[sp->len - 1];
-}
-#endif
-
-static inline struct xfrm_offload *xfrm_offload(struct sk_buff *skb)
-{
-#ifdef CONFIG_XFRM
-	struct sec_path *sp = skb_sec_path(skb);
-
-	if (!sp || !sp->olen || sp->len != sp->olen)
-		return NULL;
-
-	return &sp->ovec[sp->olen - 1];
-#else
-	return NULL;
-#endif
-}
-
 void __init xfrm_dev_init(void);
 
 #ifdef CONFIG_XFRM_OFFLOAD
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (4 preceding siblings ...)
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 5/8] xfrm: add RX datapath protection " Leon Romanovsky
@ 2022-09-04 13:15 ` Leon Romanovsky
  2022-09-25  9:34   ` Steffen Klassert
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 7/8] xfrm: add support to HW update soft and hard limits Leon Romanovsky
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

Devices that implement IPsec full offload mode offload policies too.
In RX path, it causes to the situation that HW can't effectively handle
mixed SW and HW priorities unless users make sure that HW offloaded
policies have higher priorities.

In order to make sure that users have coherent picture, let's require
that HW offloaded policies have always (both RX and TX) higher priorities
than SW ones.

To do not over engineer the code, HW policies are treated as SW ones and
don't take into account netdev to allow reuse of same priorities for
different devices.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/netns/xfrm.h |   8 ++-
 net/xfrm/xfrm_policy.c   | 113 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
index bd7c3be4af5d..f0cfa0faf611 100644
--- a/include/net/netns/xfrm.h
+++ b/include/net/netns/xfrm.h
@@ -29,6 +29,11 @@ struct xfrm_policy_hthresh {
 	u8			rbits6;
 };
 
+struct xfrm_policy_prio {
+	u32 max_sw_prio;
+	u32 min_hw_prio;
+};
+
 struct netns_xfrm {
 	struct list_head	state_all;
 	/*
@@ -52,7 +57,7 @@ struct netns_xfrm {
 	unsigned int		policy_idx_hmask;
 	struct hlist_head	policy_inexact[XFRM_POLICY_MAX];
 	struct xfrm_policy_hash	policy_bydst[XFRM_POLICY_MAX];
-	unsigned int		policy_count[XFRM_POLICY_MAX * 2];
+	unsigned int		policy_count[XFRM_POLICY_MAX * 3];
 	struct work_struct	policy_hash_work;
 	struct xfrm_policy_hthresh policy_hthresh;
 	struct list_head	inexact_bins;
@@ -67,6 +72,7 @@ struct netns_xfrm {
 	u32			sysctl_acq_expires;
 
 	u8			policy_default[XFRM_POLICY_MAX];
+	struct xfrm_policy_prio	policy_prio[XFRM_POLICY_MAX];
 
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header	*sysctl_hdr;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 56ccedbb7212..69ef85424d4b 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1570,13 +1570,70 @@ static struct xfrm_policy *xfrm_policy_insert_list(struct hlist_head *chain,
 	return delpol;
 }
 
+static int __xfrm_policy_check_hw_priority(struct net *net,
+					   struct xfrm_policy *policy, int dir)
+{
+	int left, right;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (!net->xfrm.policy_count[dir])
+		/* Adding first policy */
+		return 0;
+
+	if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+		/* SW priority */
+		if (!net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir])
+			/* Special case to allow reuse maximum priority
+			 * (U32_MAX) for SW policies, when no HW policy exist.
+			 */
+			return 0;
+
+		left = policy->priority;
+		right = net->xfrm.policy_prio[dir].min_hw_prio;
+	} else {
+		/* HW priority */
+		left = net->xfrm.policy_prio[dir].max_sw_prio;
+		right = policy->priority;
+	}
+	if (left >= right)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void __xfrm_policy_update_hw_priority(struct net *net,
+					     struct xfrm_policy *policy,
+					     int dir)
+{
+	u32 *hw_prio, *sw_prio;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+		sw_prio = &net->xfrm.policy_prio[dir].max_sw_prio;
+		*sw_prio = max(*sw_prio, policy->priority);
+		return;
+	}
+
+	hw_prio = &net->xfrm.policy_prio[dir].min_hw_prio;
+	*hw_prio = min(*hw_prio, policy->priority);
+}
+
 int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 {
 	struct net *net = xp_net(policy);
 	struct xfrm_policy *delpol;
 	struct hlist_head *chain;
+	int ret;
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+	ret = __xfrm_policy_check_hw_priority(net, policy, dir);
+	if (ret) {
+		spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+		return ret;
+	}
+
 	chain = policy_hash_bysel(net, &policy->selector, policy->family, dir);
 	if (chain)
 		delpol = xfrm_policy_insert_list(chain, policy, excl);
@@ -1606,6 +1663,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 	policy->curlft.use_time = 0;
 	if (!mod_timer(&policy->timer, jiffies + HZ))
 		xfrm_pol_hold(policy);
+	__xfrm_policy_update_hw_priority(net, policy, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 
 	if (delpol)
@@ -2271,6 +2329,8 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int dir)
 
 	list_add(&pol->walk.all, &net->xfrm.policy_all);
 	net->xfrm.policy_count[dir]++;
+	if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL)
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]++;
 	xfrm_pol_hold(pol);
 }
 
@@ -2290,6 +2350,8 @@ static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
 	}
 
 	list_del_init(&pol->walk.all);
+	if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL)
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]--;
 	net->xfrm.policy_count[dir]--;
 
 	return pol;
@@ -2305,12 +2367,58 @@ static void xfrm_sk_policy_unlink(struct xfrm_policy *pol, int dir)
 	__xfrm_policy_unlink(pol, XFRM_POLICY_MAX + dir);
 }
 
+static void __xfrm_policy_delete_prio(struct net *net,
+				      struct xfrm_policy *policy, int dir)
+{
+	struct xfrm_policy *pol;
+	u32 sw_prio = 0;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (!net->xfrm.policy_count[dir]) {
+		net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+		net->xfrm.policy_prio[dir].min_hw_prio = U32_MAX;
+		return;
+	}
+
+	if (policy->xdo.type == XFRM_DEV_OFFLOAD_FULL &&
+	    !net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]) {
+		net->xfrm.policy_prio[dir].min_hw_prio = U32_MAX;
+		return;
+	}
+
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		if (pol->walk.dead)
+			continue;
+
+		if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+			/* SW priority */
+			if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL) {
+				net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+				return;
+			}
+			sw_prio = pol->priority;
+			continue;
+		}
+		/* HW priority */
+		if (pol->xdo.type != XFRM_DEV_OFFLOAD_FULL)
+			continue;
+
+		net->xfrm.policy_prio[dir].min_hw_prio = pol->priority;
+		return;
+	}
+
+	net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+}
+
 int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
 {
 	struct net *net = xp_net(pol);
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
 	pol = __xfrm_policy_unlink(pol, dir);
+	if (pol)
+		__xfrm_policy_delete_prio(net, pol, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 	if (pol) {
 		xfrm_dev_policy_delete(pol);
@@ -4112,6 +4220,7 @@ static int __net_init xfrm_policy_init(struct net *net)
 
 		net->xfrm.policy_count[dir] = 0;
 		net->xfrm.policy_count[XFRM_POLICY_MAX + dir] = 0;
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir] = 0;
 		INIT_HLIST_HEAD(&net->xfrm.policy_inexact[dir]);
 
 		htab = &net->xfrm.policy_bydst[dir];
@@ -4197,6 +4306,10 @@ static int __net_init xfrm_net_init(struct net *net)
 	net->xfrm.policy_default[XFRM_POLICY_FWD] = XFRM_USERPOLICY_ACCEPT;
 	net->xfrm.policy_default[XFRM_POLICY_OUT] = XFRM_USERPOLICY_ACCEPT;
 
+	net->xfrm.policy_prio[XFRM_POLICY_IN].min_hw_prio = U32_MAX;
+	net->xfrm.policy_prio[XFRM_POLICY_FWD].min_hw_prio = U32_MAX;
+	net->xfrm.policy_prio[XFRM_POLICY_OUT].min_hw_prio = U32_MAX;
+
 	rv = xfrm_statistics_init(net);
 	if (rv < 0)
 		goto out_statistics;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC xfrm-next v3 7/8] xfrm: add support to HW update soft and hard limits
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (5 preceding siblings ...)
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies Leon Romanovsky
@ 2022-09-04 13:15 ` Leon Romanovsky
  2022-09-25  9:20   ` Steffen Klassert
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 8/8] xfrm: document IPsec full offload mode Leon Romanovsky
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

Both in RX and TX, the traffic that performs IPsec full offload
transformation is accounted by HW. It is needed to properly handle
hard limits that require to drop the packet.

It means that XFRM core needs to update internal counters with the one
that accounted by the HW, so new callbacks are introduced in this patch.

In case of soft or hard limit is occurred, the driver should call to
xfrm_state_check_expire() that will perform key rekeying exactly as
done by XFRM core.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/netdevice.h |  1 +
 include/net/xfrm.h        | 17 +++++++++++++++++
 net/xfrm/xfrm_output.c    |  1 -
 net/xfrm/xfrm_state.c     |  4 ++++
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c1db9eaa3dca..e38154d7b4cd 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1026,6 +1026,7 @@ struct xfrmdev_ops {
 	bool	(*xdo_dev_offload_ok) (struct sk_buff *skb,
 				       struct xfrm_state *x);
 	void	(*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+	void	(*xdo_dev_state_update_curlft) (struct xfrm_state *x);
 	int	(*xdo_dev_policy_add) (struct xfrm_policy *x);
 	void	(*xdo_dev_policy_delete) (struct xfrm_policy *x);
 	void	(*xdo_dev_policy_free) (struct xfrm_policy *x);
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 38fff78a1421..100ca45d8172 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1563,6 +1563,23 @@ struct xfrm_state *xfrm_stateonly_find(struct net *net, u32 mark, u32 if_id,
 struct xfrm_state *xfrm_state_lookup_byspi(struct net *net, __be32 spi,
 					      unsigned short family);
 int xfrm_state_check_expire(struct xfrm_state *x);
+#ifdef CONFIG_XFRM_OFFLOAD
+static inline void xfrm_dev_state_update_curlft(struct xfrm_state *x)
+{
+	struct xfrm_dev_offload *xdo = &x->xso;
+	struct net_device *dev = xdo->dev;
+
+	if (x->xso.type != XFRM_DEV_OFFLOAD_FULL)
+		return;
+
+	if (dev && dev->xfrmdev_ops &&
+	    dev->xfrmdev_ops->xdo_dev_state_update_curlft)
+		dev->xfrmdev_ops->xdo_dev_state_update_curlft(x);
+
+}
+#else
+static inline void xfrm_dev_state_update_curlft(struct xfrm_state *x) {}
+#endif
 void xfrm_state_insert(struct xfrm_state *x);
 int xfrm_state_add(struct xfrm_state *x);
 int xfrm_state_update(struct xfrm_state *x);
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index dde009be8463..a22033350ddc 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -560,7 +560,6 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTSTATEPROTOERROR);
 			goto error_nolock;
 		}
-
 		dst = skb_dst_pop(skb);
 		if (!dst) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 91c32a3b6924..83d307cb526f 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -549,6 +549,8 @@ static enum hrtimer_restart xfrm_timer_handler(struct hrtimer *me)
 	int err = 0;
 
 	spin_lock(&x->lock);
+	xfrm_dev_state_update_curlft(x);
+
 	if (x->km.state == XFRM_STATE_DEAD)
 		goto out;
 	if (x->km.state == XFRM_STATE_EXPIRED)
@@ -1786,6 +1788,8 @@ EXPORT_SYMBOL(xfrm_state_update);
 
 int xfrm_state_check_expire(struct xfrm_state *x)
 {
+	xfrm_dev_state_update_curlft(x);
+
 	if (!x->curlft.use_time)
 		x->curlft.use_time = ktime_get_real_seconds();
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC xfrm-next v3 8/8] xfrm: document IPsec full offload mode
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (6 preceding siblings ...)
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 7/8] xfrm: add support to HW update soft and hard limits Leon Romanovsky
@ 2022-09-04 13:15 ` Leon Romanovsky
  2022-09-04 13:19 ` [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:15 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

From: Leon Romanovsky <leonro@nvidia.com>

Extend XFRM device offload API description with newly
added full offload mode.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/networking/xfrm_device.rst | 62 ++++++++++++++++++++----
 1 file changed, 53 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
index 01391dfd37d9..007b9de622c0 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm_device.rst
@@ -5,6 +5,7 @@ XFRM device - offloading the IPsec computations
 ===============================================
 
 Shannon Nelson <shannon.nelson@oracle.com>
+Leon Romanovsky <leonro@nvidia.com>
 
 
 Overview
@@ -18,10 +19,21 @@ can radically increase throughput and decrease CPU utilization.  The XFRM
 Device interface allows NIC drivers to offer to the stack access to the
 hardware offload.
 
+Right now, there are two types of hardware offload that kernel supports.
+ * IPsec crypto offload:
+   * NIC performs encrypt/decrypt
+   * Kernel does everything else
+ * IPsec full offload:
+   * NIC performs encrypt/decrypt
+   * NIC does encapsulation
+   * Kernel and NIC have SA and policy in-sync
+   * NIC handles the SA and policies states
+   * The Kernel talks to the keymanager
+
 Userland access to the offload is typically through a system such as
 libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
 be handy when experimenting.  An example command might look something
-like this::
+like this for crypto offload:
 
   ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
      reqid 0x07 replay-window 32 \
@@ -29,6 +41,17 @@ like this::
      sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
      offload dev eth4 dir in
 
+and for full offload
+
+  ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+     reqid 0x07 replay-window 32 \
+     aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+     sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+     offload full dev eth4 dir in
+
+  ip x p add src 14.0.0.70 dst 14.0.0.52 offload full dev eth4 dir in
+  tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport
+
 Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
 
 
@@ -40,17 +63,24 @@ Callbacks to implement
 
   /* from include/linux/netdevice.h */
   struct xfrmdev_ops {
+        /* Crypto and Full offload callbacks */
 	int	(*xdo_dev_state_add) (struct xfrm_state *x);
 	void	(*xdo_dev_state_delete) (struct xfrm_state *x);
 	void	(*xdo_dev_state_free) (struct xfrm_state *x);
 	bool	(*xdo_dev_offload_ok) (struct sk_buff *skb,
 				       struct xfrm_state *x);
 	void    (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+
+        /* Solely full offload callbacks */
+	void    (*xdo_dev_state_update_curlft) (struct xfrm_state *x);
+	int	(*xdo_dev_policy_add) (struct xfrm_policy *x);
+	void	(*xdo_dev_policy_delete) (struct xfrm_policy *x);
+	void	(*xdo_dev_policy_free) (struct xfrm_policy *x);
   };
 
-The NIC driver offering ipsec offload will need to implement these
-callbacks to make the offload available to the network stack's
-XFRM subsystem.  Additionally, the feature bits NETIF_F_HW_ESP and
+The NIC driver offering ipsec offload will need to implement callbacks
+relevant to supported offload to make the offload available to the network
+stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
 NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
 
 
@@ -79,7 +109,8 @@ and an indication of whether it is for Rx or Tx.  The driver should
 
 		===========   ===================================
 		0             success
-		-EOPNETSUPP   offload not supported, try SW IPsec
+		-EOPNETSUPP   offload not supported, try SW IPsec,
+                              not applicable for full offload mode
 		other         fail the request
 		===========   ===================================
 
@@ -96,6 +127,7 @@ will serviceable.  This can check the packet information to be sure the
 offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
 return true of false to signify its support.
 
+Crypto offload mode:
 When ready to send, the driver needs to inspect the Tx packet for the
 offload information, including the opaque context, and set up the packet
 send accordingly::
@@ -139,13 +171,25 @@ the stack in xfrm_input().
 In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn().
 Driver will check packet seq number and update HW ESN state machine if needed.
 
+Full offload mode:
+HW adds and deleted XFRM headers. So in RX path, XFRM stack is bypassed if HW
+reported success. In TX path, the packet lefts kernel without extra header
+and not encrypted, the HW is responsible to perform it.
+
 When the SA is removed by the user, the driver's xdo_dev_state_delete()
-is asked to disable the offload.  Later, xdo_dev_state_free() is called
-from a garbage collection routine after all reference counts to the state
+and xdo_dev_policy_delete() are asked to disable the offload.  Later,
+xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage
+collection routine after all reference counts to the state and policy
 have been removed and any remaining resources can be cleared for the
 offload state.  How these are used by the driver will depend on specific
 hardware needs.
 
 As a netdev is set to DOWN the XFRM stack's netdev listener will call
-xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded
-states.
+xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and
+xdo_dev_policy_free() on any remaining offloaded states.
+
+Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
+The HW/driver are responsible to perform it and provide accurate data when
+xdo_dev_state_update_curlft() is called. In case of one of these limits
+occuried, the driver needs to call to xfrm_state_check_expire() to make sure
+that XFRM performs rekeying sequence.
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (7 preceding siblings ...)
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 8/8] xfrm: document IPsec full offload mode Leon Romanovsky
@ 2022-09-04 13:19 ` Leon Romanovsky
  2022-09-08  9:56 ` Leon Romanovsky
  2022-09-19  9:31 ` Leon Romanovsky
  10 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-04 13:19 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Sun, Sep 04, 2022 at 04:15:34PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Changelog:
> v4:
>  * Changed title from "PATCH" to "PATCH RFC" per-request.
>  * Added two new patches: one to update hard/soft limits and another
>    initial take on documentation.
>  * Added more info about lifetime/rekeying flow to cover letter, see
>    relevant section.
>  * perf traces for crypto mode will come later.

<...>

The series is v4 and not as written in subject title.

Thanks

> -- 
> 2.37.2
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (8 preceding siblings ...)
  2022-09-04 13:19 ` [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
@ 2022-09-08  9:56 ` Leon Romanovsky
  2022-09-21 14:59   ` Jakub Kicinski
  2022-09-19  9:31 ` Leon Romanovsky
  10 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-08  9:56 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Sun, Sep 04, 2022 at 04:15:34PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Changelog:
> v4:

<...>

>  * perf traces for crypto mode will come later.

Single core RX:
# Children      Self       Samples  Command        Shared Object       Symbol
# ........  ........  ............  .............  ..................  .......................................
#
    99.97%     0.00%             0  ksoftirqd/143  [kernel.vmlinux]    [k] ret_from_fork
            |
            ---ret_from_fork
               kthread
               smpboot_thread_fn
               |
                --99.96%--run_ksoftirqd
                          __do_softirq
                          |
                           --99.86%--net_rx_action
                                     |
                                     |--61.49%--mlx5e_napi_poll
                                     |          |
                                     |          |--58.43%--mlx5e_poll_rx_cq
                                     |          |          |
                                     |          |           --57.27%--mlx5e_handle_rx_cqe
                                     |          |                     |
                                     |          |                     |--33.05%--napi_gro_receive
                                     |          |                     |          |
                                     |          |                     |           --32.42%--dev_gro_receive
                                     |          |                     |                     |
                                     |          |                     |                      --29.64%--inet_gro_receive
                                     |          |                     |                                |
                                     |          |                     |                                 --27.70%--esp4_gro_receive
                                     |          |                     |                                           |
                                     |          |                     |                                            --25.80%--xfrm_input
                                     |          |                     |                                                      |
                                     |          |                     |                                                      |--6.86%--xfrm4_transport_finish
                                     |          |                     |                                                      |          |
                                     |          |                     |                                                      |          |--4.19%--__memmove
                                     |          |                     |                                                      |          |
                                     |          |                     |                                                      |           --1.27%--ip_send_check
                                     |          |                     |                                                      |
                                     |          |                     |                                                      |--6.02%--esp_input_done2
                                     |          |                     |                                                      |          |
                                     |          |                     |                                                      |           --3.26%--skb_copy_bits
                                     |          |                     |                                                      |                     |
                                     |          |                     |                                                      |                      --2.50%--memcpy_erms
                                     |          |                     |                                                      |
                                     |          |                     |                                                      |--2.19%--_raw_spin_lock
                                     |          |                     |                                                      |
                                     |          |                     |                                                      |--1.22%--xfrm_rcv_cb
                                     |          |                     |                                                      |          |
                                     |          |                     |                                                      |           --0.68%--xfrm4_rcv_cb
                                     |          |                     |                                                      |
                                     |          |                     |                                                      |--0.97%--xfrm_inner_mode_input.isra.35
                                     |          |                     |                                                      |
                                     |          |                     |                                                      |--0.97%--gro_cells_receive
                                     |          |                     |                                                      |
                                     |          |                     |                                                      |--0.69%--esp_input_tail
                                     |          |                     |                                                      |
                                     |          |                     |                                                       --0.66%--xfrm_parse_spi
                                     |          |                     |
                                     |          |                     |--11.91%--mlx5e_skb_from_cqe_linear
                                     |          |                     |          |
                                     |          |                     |           --5.63%--build_skb
                                     |          |                     |                     |
                                     |          |                     |                      --3.82%--__build_skb
                                     |          |                     |                                |
                                     |          |                     |                                 --1.97%--kmem_cache_alloc
                                     |          |                     |
                                     |          |                      --9.97%--mlx5e_build_rx_skb
                                     |          |                                |
                                     |          |                                 --7.23%--mlx5e_ipsec_offload_handle_rx_skb
                                     |          |                                           |
                                     |          |                                           |--3.60%--secpath_set
                                     |          |                                           |          |
                                     |          |                                           |           --3.41%--skb_ext_add
                                     |          |                                           |                     |
                                     |          |                                           |                      --2.69%--__skb_ext_alloc
                                     |          |                                           |                                |
                                     |          |                                           |                                 --2.58%--kmem_cache_alloc
                                     |          |                                           |                                           |
                                     |          |                                           |                                            --0.60%--__slab_alloc
                                     |          |                                           |                                                      |
                                     |          |                                           |                                                       --0.56%--___slab_alloc
                                     |          |                                           |
                                     |          |                                            --2.52%--mlx5e_ipsec_sadb_rx_lookup
                                     |          |
                                     |           --2.78%--mlx5e_post_rx_wqes
				   
I have TX traces too and can add if RX are not sufficient. 

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (9 preceding siblings ...)
  2022-09-08  9:56 ` Leon Romanovsky
@ 2022-09-19  9:31 ` Leon Romanovsky
  2022-09-22  7:17   ` Leon Romanovsky
  10 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-19  9:31 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Sun, Sep 04, 2022 at 04:15:34PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>

<...>

> Leon Romanovsky (8):
>   xfrm: add new full offload flag
>   xfrm: allow state full offload mode
>   xfrm: add an interface to offload policy
>   xfrm: add TX datapath support for IPsec full offload mode
>   xfrm: add RX datapath protection for IPsec full offload mode
>   xfrm: enforce separation between priorities of HW/SW policies
>   xfrm: add support to HW update soft and hard limits
>   xfrm: document IPsec full offload mode

Kindly reminder.

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-08  9:56 ` Leon Romanovsky
@ 2022-09-21 14:59   ` Jakub Kicinski
  2022-09-21 17:37     ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Kicinski @ 2022-09-21 14:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steffen Klassert, David S. Miller, Eric Dumazet, Herbert Xu,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Thu, 8 Sep 2022 12:56:16 +0300 Leon Romanovsky wrote:
> I have TX traces too and can add if RX are not sufficient. 

The perf trace is good, but for those of us not intimately familiar
with xfrm, could you provide some analysis here?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-21 14:59   ` Jakub Kicinski
@ 2022-09-21 17:37     ` Leon Romanovsky
  2022-09-25  9:40       ` Steffen Klassert
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-21 17:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Steffen Klassert, David S. Miller, Eric Dumazet, Herbert Xu,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Wed, Sep 21, 2022 at 07:59:27AM -0700, Jakub Kicinski wrote:
> On Thu, 8 Sep 2022 12:56:16 +0300 Leon Romanovsky wrote:
> > I have TX traces too and can add if RX are not sufficient. 
> 
> The perf trace is good, but for those of us not intimately familiar
> with xfrm, could you provide some analysis here?

The perf trace presented is for RX path of IPsec crypto offload mode. In that
mode, decrypted packet enters the netdev stack to perform various XFRM specific
checks.

The trace presents "the cost" of these checks, which is 25% according to the
line "--25.80%--xfrm_input".

The xfrm_input has number of "slow" places (other places are not fast either),
which are handled by HW in parallel without any locks in IPsec full offload
mode.

The avoided checks include:
 * XFRM state lookup. It is linked list iteration.
 * Lock of whole xfrm_state. It means that parallel traffic will be
   congested on this lock.
 * Double calculation of replay window protection.
 * Update of replay window.

https://elixir.bootlin.com/linux/v6.0-rc6/source/net/xfrm/xfrm_input.c#L459
int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
{
...
		x = xfrm_state_lookup(net, mark, daddr, spi, nexthdr, family);
...
		spin_lock(&x->lock);
...
		if (xfrm_replay_check(x, skb, seq)) {
...
		spin_unlock(&x->lock);
...
		spin_lock(&x->lock);
...
		if (xfrm_replay_recheck(x, skb, seq)) {
...
		xfrm_replay_advance(x, seq);
.....



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-19  9:31 ` Leon Romanovsky
@ 2022-09-22  7:17   ` Leon Romanovsky
  2022-09-22  7:35     ` Steffen Klassert
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-22  7:17 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Mon, Sep 19, 2022 at 12:31:11PM +0300, Leon Romanovsky wrote:
> On Sun, Sep 04, 2022 at 04:15:34PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> 
> <...>
> 
> > Leon Romanovsky (8):
> >   xfrm: add new full offload flag
> >   xfrm: allow state full offload mode
> >   xfrm: add an interface to offload policy
> >   xfrm: add TX datapath support for IPsec full offload mode
> >   xfrm: add RX datapath protection for IPsec full offload mode
> >   xfrm: enforce separation between priorities of HW/SW policies
> >   xfrm: add support to HW update soft and hard limits
> >   xfrm: document IPsec full offload mode
> 
> Kindly reminder.

Hi Steffen,

Can we please progress with the series? I would like to see it is merged
in this merge window, so I will be able to progress with iproute2 and mlx5
code too.

It is approximately 3 weeks already in the ML without any objections.
The implementation proposed here has nothing specific to our HW and
applicable to other vendors as well.

Thanks

> 
> Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-22  7:17   ` Leon Romanovsky
@ 2022-09-22  7:35     ` Steffen Klassert
  0 siblings, 0 replies; 30+ messages in thread
From: Steffen Klassert @ 2022-09-22  7:35 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Thu, Sep 22, 2022 at 10:17:31AM +0300, Leon Romanovsky wrote:
> On Mon, Sep 19, 2022 at 12:31:11PM +0300, Leon Romanovsky wrote:
> > On Sun, Sep 04, 2022 at 04:15:34PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > <...>
> > 
> > > Leon Romanovsky (8):
> > >   xfrm: add new full offload flag
> > >   xfrm: allow state full offload mode
> > >   xfrm: add an interface to offload policy
> > >   xfrm: add TX datapath support for IPsec full offload mode
> > >   xfrm: add RX datapath protection for IPsec full offload mode
> > >   xfrm: enforce separation between priorities of HW/SW policies
> > >   xfrm: add support to HW update soft and hard limits
> > >   xfrm: document IPsec full offload mode
> > 
> > Kindly reminder.
> 
> Hi Steffen,
> 
> Can we please progress with the series? I would like to see it is merged
> in this merge window, so I will be able to progress with iproute2 and mlx5
> code too.
> 
> It is approximately 3 weeks already in the ML without any objections.
> The implementation proposed here has nothing specific to our HW and
> applicable to other vendors as well.

I've just replied to our private thread about this. I'll have a look
at the patches tomorrow. I've just returned from vacation and still
a bit backloged...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 4/8] xfrm: add TX datapath support for IPsec full offload mode
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 4/8] xfrm: add TX datapath support for IPsec full offload mode Leon Romanovsky
@ 2022-09-25  9:16   ` Steffen Klassert
  2022-09-26  6:06     ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Steffen Klassert @ 2022-09-25  9:16 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

On Sun, Sep 04, 2022 at 04:15:38PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> In IPsec full mode, the device is going to encrypt and encapsulate
> packets that are associated with offloaded policy. After successful
> policy lookup to indicate if packets should be offloaded or not,
> the stack forwards packets to the device to do the magic.
> 
> Signed-off-by: Raed Salem <raeds@nvidia.com>
> Signed-off-by: Huy Nguyen <huyn@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  net/xfrm/xfrm_device.c | 15 +++++++++++++--
>  net/xfrm/xfrm_output.c | 12 +++++++++++-
>  2 files changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
> index 1cc482e9c87d..2d37bb86914a 100644
> --- a/net/xfrm/xfrm_device.c
> +++ b/net/xfrm/xfrm_device.c
> @@ -120,6 +120,16 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
>  	if (xo->flags & XFRM_GRO || x->xso.dir == XFRM_DEV_OFFLOAD_IN)
>  		return skb;
>  
> +	/* The packet was sent to HW IPsec full offload engine,
> +	 * but to wrong device. Drop the packet, so it won't skip
> +	 * XFRM stack.
> +	 */
> +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL && x->xso.dev != dev) {
> +		kfree_skb(skb);
> +		dev_core_stats_tx_dropped_inc(dev);
> +		return NULL;
> +	}
> +
>  	/* This skb was already validated on the upper/virtual dev */
>  	if ((x->xso.dev != dev) && (x->xso.real_dev == dev))
>  		return skb;
> @@ -369,8 +379,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
>  	if (!x->type_offload || x->encap)
>  		return false;
>  
> -	if ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
> -	    (!xdst->child->xfrm)) {
> +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL ||
> +	    ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
> +	     !xdst->child->xfrm)) {
>  		mtu = xfrm_state_mtu(x, xdst->child_mtu_cached);
>  		if (skb->len <= mtu)
>  			goto ok;
> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> index 9a5e79a38c67..dde009be8463 100644
> --- a/net/xfrm/xfrm_output.c
> +++ b/net/xfrm/xfrm_output.c
> @@ -494,7 +494,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
>  	struct xfrm_state *x = dst->xfrm;
>  	struct net *net = xs_net(x);
>  
> -	if (err <= 0)
> +	if (err <= 0 || x->xso.type == XFRM_DEV_OFFLOAD_FULL)
>  		goto resume;

You check here that the state is marked as 'full offload' before
you skip the SW xfrm handling, but I don't see where you check
that the policy that led to this state is offloaded too. Also,
we have to make sure that both, policy and state is offloaded to
the same device. Looks like this part is missing.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 7/8] xfrm: add support to HW update soft and hard limits
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 7/8] xfrm: add support to HW update soft and hard limits Leon Romanovsky
@ 2022-09-25  9:20   ` Steffen Klassert
  2022-09-26  6:07     ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Steffen Klassert @ 2022-09-25  9:20 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

On Sun, Sep 04, 2022 at 04:15:41PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Both in RX and TX, the traffic that performs IPsec full offload
> transformation is accounted by HW. It is needed to properly handle
> hard limits that require to drop the packet.
> 
> It means that XFRM core needs to update internal counters with the one
> that accounted by the HW, so new callbacks are introduced in this patch.
> 
> In case of soft or hard limit is occurred, the driver should call to
> xfrm_state_check_expire() that will perform key rekeying exactly as
> done by XFRM core.
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

This looks good, thanks!

We need this for the other relevant counters too.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies
  2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies Leon Romanovsky
@ 2022-09-25  9:34   ` Steffen Klassert
  2022-09-26  6:38     ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Steffen Klassert @ 2022-09-25  9:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, David S. Miller, Eric Dumazet, Herbert Xu,
	Jakub Kicinski, netdev, Paolo Abeni, Raed Salem, Saeed Mahameed,
	Bharat Bhushan

On Sun, Sep 04, 2022 at 04:15:40PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Devices that implement IPsec full offload mode offload policies too.
> In RX path, it causes to the situation that HW can't effectively handle
> mixed SW and HW priorities unless users make sure that HW offloaded
> policies have higher priorities.
> 
> In order to make sure that users have coherent picture, let's require
> that HW offloaded policies have always (both RX and TX) higher priorities
> than SW ones.
> 
> To do not over engineer the code, HW policies are treated as SW ones and
> don't take into account netdev to allow reuse of same priorities for
> different devices.

I think we should split HW and SW SPD (and maybe even SAD) and priorize
over the SPDs instead of doing that in one single SPD. Each NIC should
maintain its own databases and we should do the lookups from SW with
a callback. With the current approach, we still do the costly full
policy and state lookup on the TX side in software. On a 'full offload'
that should happen in HW too. Also, that will make things easier with
tunnel mode whre we can have overlapping traffic selectors.

We can keep a HW SPD in software as a fallback for devices that don't
support the offloaded lookup, but on the long run lookups for the  RX
anf TX path should happen in HW.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-21 17:37     ` Leon Romanovsky
@ 2022-09-25  9:40       ` Steffen Klassert
  2022-09-26  6:55         ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Steffen Klassert @ 2022-09-25  9:40 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jakub Kicinski, David S. Miller, Eric Dumazet, Herbert Xu,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Wed, Sep 21, 2022 at 08:37:06PM +0300, Leon Romanovsky wrote:
> On Wed, Sep 21, 2022 at 07:59:27AM -0700, Jakub Kicinski wrote:
> > On Thu, 8 Sep 2022 12:56:16 +0300 Leon Romanovsky wrote:
> > > I have TX traces too and can add if RX are not sufficient. 
> > 
> > The perf trace is good, but for those of us not intimately familiar
> > with xfrm, could you provide some analysis here?
> 
> The perf trace presented is for RX path of IPsec crypto offload mode. In that
> mode, decrypted packet enters the netdev stack to perform various XFRM specific
> checks.

Can you provide the perf traces and analysis for the TX side too? That
would be interesting in particular, because the policy and state lookups
there happen still in software.

Thanks a lot for your effort on this!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 4/8] xfrm: add TX datapath support for IPsec full offload mode
  2022-09-25  9:16   ` Steffen Klassert
@ 2022-09-26  6:06     ` Leon Romanovsky
  2022-09-27  5:04       ` Steffen Klassert
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-26  6:06 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Sun, Sep 25, 2022 at 11:16:03AM +0200, Steffen Klassert wrote:
> On Sun, Sep 04, 2022 at 04:15:38PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > In IPsec full mode, the device is going to encrypt and encapsulate
> > packets that are associated with offloaded policy. After successful
> > policy lookup to indicate if packets should be offloaded or not,
> > the stack forwards packets to the device to do the magic.
> > 
> > Signed-off-by: Raed Salem <raeds@nvidia.com>
> > Signed-off-by: Huy Nguyen <huyn@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  net/xfrm/xfrm_device.c | 15 +++++++++++++--
> >  net/xfrm/xfrm_output.c | 12 +++++++++++-
> >  2 files changed, 24 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
> > index 1cc482e9c87d..2d37bb86914a 100644
> > --- a/net/xfrm/xfrm_device.c
> > +++ b/net/xfrm/xfrm_device.c
> > @@ -120,6 +120,16 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
> >  	if (xo->flags & XFRM_GRO || x->xso.dir == XFRM_DEV_OFFLOAD_IN)
> >  		return skb;
> >  
> > +	/* The packet was sent to HW IPsec full offload engine,
> > +	 * but to wrong device. Drop the packet, so it won't skip
> > +	 * XFRM stack.
> > +	 */
> > +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL && x->xso.dev != dev) {
> > +		kfree_skb(skb);
> > +		dev_core_stats_tx_dropped_inc(dev);
> > +		return NULL;
> > +	}
> > +
> >  	/* This skb was already validated on the upper/virtual dev */
> >  	if ((x->xso.dev != dev) && (x->xso.real_dev == dev))
> >  		return skb;
> > @@ -369,8 +379,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
> >  	if (!x->type_offload || x->encap)
> >  		return false;
> >  
> > -	if ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
> > -	    (!xdst->child->xfrm)) {
> > +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL ||
> > +	    ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
> > +	     !xdst->child->xfrm)) {
> >  		mtu = xfrm_state_mtu(x, xdst->child_mtu_cached);
> >  		if (skb->len <= mtu)
> >  			goto ok;
> > diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> > index 9a5e79a38c67..dde009be8463 100644
> > --- a/net/xfrm/xfrm_output.c
> > +++ b/net/xfrm/xfrm_output.c
> > @@ -494,7 +494,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
> >  	struct xfrm_state *x = dst->xfrm;
> >  	struct net *net = xs_net(x);
> >  
> > -	if (err <= 0)
> > +	if (err <= 0 || x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> >  		goto resume;
> 
> You check here that the state is marked as 'full offload' before
> you skip the SW xfrm handling, but I don't see where you check
> that the policy that led to this state is offloaded too. Also,
> we have to make sure that both, policy and state is offloaded to
> the same device. Looks like this part is missing.

In SW flow, users are not required to configure policy. If they don't
have policy, the packet will be encrypted and sent anyway.

The full offload follows same semantic. The missing offloaded policy is
equal to no policy at all.

I don't think that extra checks are needed.

Thanks

> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 7/8] xfrm: add support to HW update soft and hard limits
  2022-09-25  9:20   ` Steffen Klassert
@ 2022-09-26  6:07     ` Leon Romanovsky
  2022-09-27  5:49       ` Steffen Klassert
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-26  6:07 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Sun, Sep 25, 2022 at 11:20:06AM +0200, Steffen Klassert wrote:
> On Sun, Sep 04, 2022 at 04:15:41PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Both in RX and TX, the traffic that performs IPsec full offload
> > transformation is accounted by HW. It is needed to properly handle
> > hard limits that require to drop the packet.
> > 
> > It means that XFRM core needs to update internal counters with the one
> > that accounted by the HW, so new callbacks are introduced in this patch.
> > 
> > In case of soft or hard limit is occurred, the driver should call to
> > xfrm_state_check_expire() that will perform key rekeying exactly as
> > done by XFRM core.
> > 
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> 
> This looks good, thanks!
> 
> We need this for the other relevant counters too.

It is in my backlog.

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies
  2022-09-25  9:34   ` Steffen Klassert
@ 2022-09-26  6:38     ` Leon Romanovsky
  2022-09-27  5:48       ` Steffen Klassert
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-26  6:38 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Sun, Sep 25, 2022 at 11:34:54AM +0200, Steffen Klassert wrote:
> On Sun, Sep 04, 2022 at 04:15:40PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Devices that implement IPsec full offload mode offload policies too.
> > In RX path, it causes to the situation that HW can't effectively handle
> > mixed SW and HW priorities unless users make sure that HW offloaded
> > policies have higher priorities.
> > 
> > In order to make sure that users have coherent picture, let's require
> > that HW offloaded policies have always (both RX and TX) higher priorities
> > than SW ones.
> > 
> > To do not over engineer the code, HW policies are treated as SW ones and
> > don't take into account netdev to allow reuse of same priorities for
> > different devices.
> 
> I think we should split HW and SW SPD (and maybe even SAD) and priorize
> over the SPDs instead of doing that in one single SPD. Each NIC should
> maintain its own databases and we should do the lookups from SW with
> a callback. 

I don't understand how will it work and scale.

Every packet needs to be classified if it is in offloaded path or not.
To ensure that, we will need to have two identifiers: targeted device
(part of skb, so ok) and relevant SP/SA.

The latter is needed to make sure that we perform lookup only on
offloaded SP/SA. It means that we will do lookup twice now. First
to see if SP/SA is offloaded and second to see if it in HW.

HW lookup is also misleading name, as the lookup will be in driver code
in very similar way to how SADB managed for crypto mode. It makes no
sense to convert data from XFRM representation to HW format, execute in
HW and convert returned result back. It will be also slow because lookup
of SP/SA based on XFRM properties is not datapath.

In any case, you will have double lookup. You will need to query XFRM
core DBs and later call to driver DB or vice versa.

Unless you want to perform HW lookup always without relation to SP/SA
state and hurt performance for non-offloaded path.

> With the current approach, we still do the costly full
> policy and state lookup on the TX side in software. On a 'full offload'
> that should happen in HW too.

In proposed approach, you have only one lookup which is better than two.
I'm not even talking about "quality" of driver lookups implementations.

> Also, that will make things easier with tunnel mode whre we can have overlapping
> traffic selectors.

Can we please put tunnel mode aside? It is a long journey.

> 
> We can keep a HW SPD in software as a fallback for devices that don't
> support the offloaded lookup, but on the long run lookups for the  RX
> anf TX path should happen in HW.

I doubt about it.

> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-25  9:40       ` Steffen Klassert
@ 2022-09-26  6:55         ` Leon Romanovsky
  2022-09-27  5:59           ` Steffen Klassert
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-26  6:55 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Jakub Kicinski, David S. Miller, Eric Dumazet, Herbert Xu,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Sun, Sep 25, 2022 at 11:40:39AM +0200, Steffen Klassert wrote:
> On Wed, Sep 21, 2022 at 08:37:06PM +0300, Leon Romanovsky wrote:
> > On Wed, Sep 21, 2022 at 07:59:27AM -0700, Jakub Kicinski wrote:
> > > On Thu, 8 Sep 2022 12:56:16 +0300 Leon Romanovsky wrote:
> > > > I have TX traces too and can add if RX are not sufficient. 
> > > 
> > > The perf trace is good, but for those of us not intimately familiar
> > > with xfrm, could you provide some analysis here?
> > 
> > The perf trace presented is for RX path of IPsec crypto offload mode. In that
> > mode, decrypted packet enters the netdev stack to perform various XFRM specific
> > checks.
> 
> Can you provide the perf traces and analysis for the TX side too? That
> would be interesting in particular, because the policy and state lookups
> there happen still in software.

Single core TX (crypto mode) from the same run:
Please notice that it is not really bottleneck, probably RX caused to the situation
where TX was not executed enough. It is also lighter path than RX.

# Children      Self       Samples  Command          Shared Object       Symbol                                        
# ........  ........  ............  ...............  ..................  ..............................................
#
    86.58%     0.00%             0  swapper          [kernel.vmlinux]    [k] secondary_startup_64_no_verify
            |
            ---secondary_startup_64_no_verify
               start_secondary
               cpu_startup_entry
               do_idle
               |          
                --86.37%--cpu_idle_poll
                          |          
                           --24.53%--asm_common_interrupt
                                     |          
                                      --24.48%--common_interrupt
                                                |          
                                                |--23.47%--irq_exit_rcu
                                                |          |          
                                                |           --23.23%--do_softirq_own_stack
                                                |                     |          
                                                |                      --23.17%--asm_call_irq_on_stack
                                                |                                __do_softirq
                                                |                                |          
                                                |                                |--22.23%--net_rx_action
                                                |                                |          |          
                                                |                                |          |--20.17%--gro_cell_poll
                                                |                                |          |          |          
                                                |                                |          |           --20.02%--napi_complete_done
                                                |                                |          |                     |          
                                                |                                |          |                      --19.98%--gro_normal_list.part.154
                                                |                                |          |                                |          
                                                |                                |          |                                 --19.96%--netif_receive_skb_list_internal
                                                |                                |          |                                           |          
                                                |                                |          |                                            --19.89%--__netif_receive_skb_list_core
                                                |                                |          |                                                      |          
                                                |                                |          |                                                       --19.77%--ip_list_rcv
                                                |                                |          |                                                                 |          
                                                |                                |          |                                                                  --19.67%--ip_sublist_rcv
                                                |                                |          |                                                                            |          
                                                |                                |          |                                                                             --19.56%--ip_sublist_rcv_finish
                                                |                                |          |                                                                                       |          
                                                |                                |          |                                                                                        --19.54%--ip_local_deliver
                                                |                                |          |                                                                                                  |          
                                                |                                |          |                                                                                                   --19.49%--ip_local_deliver_finish
                                                |                                |          |                                                                                                             |          
                                                |                                |          |                                                                                                              --19.47%--ip_protocol_deliver_rcu
                                                |                                |          |                                                                                                                        |          
                                                |                                |          |                                                                                                                         --19.43%--tcp_v4_rcv
                                                |                                |          |                                                                                                                                   |          
                                                |                                |          |                                                                                                                                    --18.87%--tcp_v4_do_rcv
                                                |                                |          |                                                                                                                                              |          
                                                |                                |          |                                                                                                                                               --18.83%--tcp_rcv_established
                                                |                                |          |                                                                                                                                                         |          
                                                |                                |          |                                                                                                                                                         |--16.41%--__tcp_push_pending_frames
                                                |                                |          |                                                                                                                                                         |          |          
                                                |                                |          |                                                                                                                                                         |           --16.38%--tcp_write_xmit
                                                |                                |          |                                                                                                                                                         |                     |          
                                                |                                |          |                                                                                                                                                         |                     |--6.35%--tcp_event_new_data_sent
                                                |                                |          |                                                                                                                                                         |                     |          |          
                                                |                                |          |                                                                                                                                                         |                     |           --6.22%--sk_reset_timer
                                                |                                |          |                                                                                                                                                         |                     |                     |          
                                                |                                |          |                                                                                                                                                         |                     |                      --6.21%--mod_timer
                                                |                                |          |                                                                                                                                                         |                     |                                |          
                                                |                                |          |                                                                                                                                                         |                     |                                 --6.10%--get_nohz_timer_target
                                                |                                |          |                                                                                                                                                         |                     |                                           |          
                                                |                                |          |                                                                                                                                                         |                     |                                            --1.87%--cpumask_next_and
                                                |                                |          |                                                                                                                                                         |                     |                                                      |          
                                                |                                |          |                                                                                                                                                         |                     |                                                       --1.07%--_find_next_bit.constprop.1
                                                |                                |          |                                                                                                                                                         |                     |          
                                                |                                |          |                                                                                                                                                         |                     |--5.50%--tcp_schedule_loss_probe
                                                |                                |          |                                                                                                                                                         |                     |          |          
                                                |                                |          |                                                                                                                                                         |                     |           --5.49%--sk_reset_timer
                                                |                                |          |                                                                                                                                                         |                     |                     mod_timer
                                                |                                |          |                                                                                                                                                         |                     |                     |          
                                                |                                |          |                                                                                                                                                         |                     |                      --5.43%--get_nohz_timer_target
                                                |                                |          |                                                                                                                                                         |                     |                                |          
                                                |                                |          |                                                                                                                                                         |                     |                                 --1.37%--cpumask_next_and
                                                |                                |          |                                                                                                                                                         |                     |                                           |          
                                                |                                |          |                                                                                                                                                         |                     |                                            --0.71%--_find_next_bit.constprop.1
                                                |                                |          |                                                                                                                                                         |                     |          
                                                |                                |          |                                                                                                                                                         |                      --4.31%--__tcp_transmit_skb
                                                |                                |          |                                                                                                                                                         |                                |          
                                                |                                |          |                                                                                                                                                         |                                 --3.87%--__ip_queue_xmit
                                                |                                |          |                                                                                                                                                         |                                           |          
                                                |                                |          |                                                                                                                                                         |                                            --3.54%--xfrm4_output
                                                |                                |          |                                                                                                                                                         |                                                      |          
                                                |                                |          |                                                                                                                                                         |                                                       --3.26%--xfrm_output_resume
                                                |                                |          |                                                                                                                                                         |                                                                 |          
                                                |                                |          |                                                                                                                                                         |                                                                  --2.88%--ip_output
                                                |                                |          |                                                                                                                                                         |                                                                            |          
                                                |                                |          |                                                                                                                                                         |                                                                             --2.78%--ip_finish_output2
                                                |                                |          |                                                                                                                                                         |                                                                                       |          
                                                |                                |          |                                                                                                                                                         |                                                                                        --2.73%--__dev_queue_xmit
                                                |                                |          |                                                                                                                                                         |                                                                                                  |          
                                                |                                |          |                                                                                                                                                         |                                                                                                   --2.49%--sch_direct_xmit
                                                |                                |          |                                                                                                                                                         |                                                                                                             |          
                                                |                                |          |                                                                                                                                                         |                                                                                                             |--1.50%--validate_xmit_skb_list
               |                                |                                |          |                                                                                                                                                         |                                                                                                             |          |          
                                                |                                |          |                                                                                                                                                         |                                                                                                             |           --1.32%--validate_xmit_skb
                          |                     |                                |          |                                                                                                                                                         |                                                                                                             |                     |          
                                                |                                |          |                                                                                                                                                         |                                                                                                             |                      --1.06%--__skb_gso_segment
                                     |          |                                |          |                                                                                                                                                         |                                                                                                             |                                |          
                                                |                                |          |                                                                                                                                                         |                                                                                                             |                                 --1.04%--skb_mac_gso_segment
                                                |                                |          |                                                                                                                                                         |                                                                                                             |                                           |          
                                                                                 |          |                                                                                                                                                         |                                                                                                             |                                            --1.02%--inet_gso_segment
                                                           |                     |          |                                                                                                                                                         |                                                                                                             |                                                      |          
                                                                                 |          |                                                                                                                                                         |                                                                                                             |                                                       --0.93%--esp4_gso_segment
                                                                      |          |          |                                                                                                                                                         |                                                                                                             |                                                                 |          
                                                                                 |          |                                                                                                                                                         |                                                                                                             |                                                                  --0.86%--tcp_gso_segment
                                                                                 |          |                                                                                                                                                         |                                                                                                             |                                                                            |          
                                                                                            |                                                                                                                                                         |                                                                                                             |                                                                             --0.78%--skb_segment
                                                |                                |          |                                                                                                                                                         |                                                                                                             |          
                                                |                                |          |                                                                                                                                                         |                                                                                                              --0.77%--dev_hard_start_xmit
               |                                |                                |          |                                                                                                                                                         |                                                                                                                        |          
                                                |                                |          |                                                                                                                                                         |                                                                                                                         --0.75%--mlx5e_xmit
                                                |                                |          |                                                                                                                                                         |          
                                                |                                |          |                                                                                                                                                          --1.87%--tcp_ack
                                                |                                |          |                                                                                                                                                                    |          
                                                |                                |          |                                                                                                                                                                     --1.66%--tcp_clean_rtx_queue
                                                |                                |          |                                                                                                                                                                               |          
                                                |                                |          |                                                                                                                                                                                --1.35%--__kfree_skb
                                                |                                |          |                                                                                                                                                                                          |          
                                                |                                |          |                                                                                                                                                                                           --1.21%--skb_release_data
                                                |                                |          |          
                                                |                                |           --1.92%--mlx5e_napi_poll
                                                |                                |                     |          
                                                |                                |                      --1.38%--mlx5e_poll_rx_cq
                                                |                                |                                |          
                                                |                                |                                 --1.33%--mlx5e_handle_rx_cqe
                                                |                                |                                           |          
                                                |                                |                                            --0.53%--napi_gro_receive
                                                |                                |                                                      |          
                                                |                                |                                                       --0.52%--dev_gro_receive
                                                |                                |          
                                                |                                 --0.77%--tasklet_action_common.isra.17
                                                |          
                                                 --0.80%--asm_call_irq_on_stack
                                                           |          
                                                            --0.78%--handle_edge_irq
                                                                      |          
                                                                       --0.74%--handle_irq_event
                                                                                 |          
                                                                                  --0.71%--handle_irq_event_percpu
                                                                                            |          
                                                                                             --0.64%--__handle_irq_event_percpu
                                                                                                       |          
                                                                                                        --0.60%--mlx5_irq_int_handler
                                                                                                                  |          
                                                                                                                   --0.58%--atomic_notifier_call_chain
                                                                                                                             |          
                                                                                                                              --0.57%--mlx5_eq_comp_int


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 4/8] xfrm: add TX datapath support for IPsec full offload mode
  2022-09-26  6:06     ` Leon Romanovsky
@ 2022-09-27  5:04       ` Steffen Klassert
  0 siblings, 0 replies; 30+ messages in thread
From: Steffen Klassert @ 2022-09-27  5:04 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Mon, Sep 26, 2022 at 09:06:48AM +0300, Leon Romanovsky wrote:
> On Sun, Sep 25, 2022 at 11:16:03AM +0200, Steffen Klassert wrote:
> > On Sun, Sep 04, 2022 at 04:15:38PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> > > index 9a5e79a38c67..dde009be8463 100644
> > > --- a/net/xfrm/xfrm_output.c
> > > +++ b/net/xfrm/xfrm_output.c
> > > @@ -494,7 +494,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
> > >  	struct xfrm_state *x = dst->xfrm;
> > >  	struct net *net = xs_net(x);
> > >  
> > > -	if (err <= 0)
> > > +	if (err <= 0 || x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> > >  		goto resume;
> > 
> > You check here that the state is marked as 'full offload' before
> > you skip the SW xfrm handling, but I don't see where you check
> > that the policy that led to this state is offloaded too. Also,
> > we have to make sure that both, policy and state is offloaded to
> > the same device. Looks like this part is missing.
> 
> In SW flow, users are not required to configure policy. If they don't
> have policy, the packet will be encrypted and sent anyway.

No, it is not! You can't lookup a TX SA without a policy. The lookup
happens in two stages. The packet header is matched against the TS of
the policy. Then the template found at the policy is used to lookup
the SA.

> The full offload follows same semantic. The missing offloaded policy is
> equal to no policy at all.

No policy at all means that the packets are sent out unencrypted in
plaintext, and this is certainly not what you want.

> I don't think that extra checks are needed.

We need this checks. This is one of the reasons why I want to separate
the SW and HW databases.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies
  2022-09-26  6:38     ` Leon Romanovsky
@ 2022-09-27  5:48       ` Steffen Klassert
  2022-09-27 10:21         ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Steffen Klassert @ 2022-09-27  5:48 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Mon, Sep 26, 2022 at 09:38:10AM +0300, Leon Romanovsky wrote:
> On Sun, Sep 25, 2022 at 11:34:54AM +0200, Steffen Klassert wrote:
> > On Sun, Sep 04, 2022 at 04:15:40PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Devices that implement IPsec full offload mode offload policies too.
> > > In RX path, it causes to the situation that HW can't effectively handle
> > > mixed SW and HW priorities unless users make sure that HW offloaded
> > > policies have higher priorities.
> > > 
> > > In order to make sure that users have coherent picture, let's require
> > > that HW offloaded policies have always (both RX and TX) higher priorities
> > > than SW ones.
> > > 
> > > To do not over engineer the code, HW policies are treated as SW ones and
> > > don't take into account netdev to allow reuse of same priorities for
> > > different devices.
> > 
> > I think we should split HW and SW SPD (and maybe even SAD) and priorize
> > over the SPDs instead of doing that in one single SPD. Each NIC should
> > maintain its own databases and we should do the lookups from SW with
> > a callback. 
> 
> I don't understand how will it work and scale.

That is rather easy. HW offload devices register their databases
at the xfrm layer with a certain priority higher than the one
of the SW databases. The lookup will happen consecutively based
on the database priorities. If there are no HW databases are
registered everything is like it was before. It gives us a clear
separation between HW and SW.

This has the advantage that you don't need to mess with policy
priorites. User can keep the priorites as they were before. This
is in particular important because usually the IKE daemon chosses
the priorities based on some heuristics.

The HW offload has also the advantage that we don't need to
search through all SW policies and states in that case.

> Every packet needs to be classified if it is in offloaded path or not.
> To ensure that, we will need to have two identifiers: targeted device
> (part of skb, so ok) and relevant SP/SA.
> 
> The latter is needed to make sure that we perform lookup only on
> offloaded SP/SA. It means that we will do lookup twice now. First
> to see if SP/SA is offloaded and second to see if it in HW.

I think you did not get my point, see above.

> HW lookup is also misleading name, as the lookup will be in driver code
> in very similar way to how SADB managed for crypto mode.

No, HW lookup means we do the lookup in the hardware and return
the result to software. The whole point of a 'full offload' is
to get rid of the costly SW database lookups.

> It makes no
> sense to convert data from XFRM representation to HW format, execute in
> HW and convert returned result back. It will be also slow because lookup
> of SP/SA based on XFRM properties is not datapath.

In case the HW can't do the lookup itself or is considered to be slower
than in software, a separated database for HW offload devices can be
maintained.

> In any case, you will have double lookup. You will need to query XFRM
> core DBs and later call to driver DB or vice versa.
> 
> Unless you want to perform HW lookup always without relation to SP/SA
> state and hurt performance for non-offloaded path.
> 
> > With the current approach, we still do the costly full
> > policy and state lookup on the TX side in software. On a 'full offload'
> > that should happen in HW too.
> 
> In proposed approach, you have only one lookup which is better than two.

In your approach, you still do the lookups in SW. This is not a problem
if you have just a handfull policies and SAs, but is problematic when
you have 100K policies and SAs. Even if the current HW can not support
this, we need to make sure our design allows for it.

> I'm not even talking about "quality" of driver lookups implementations.

You don't need to to reimplement the lookups, you just have to create an
additional database.

> 
> > Also, that will make things easier with tunnel mode whre we can have overlapping
> > traffic selectors.
> 
> Can we please put tunnel mode aside? It is a long journey.

No, we can't. A good design should include transport and tunnel mode.
I don't want to change everyting later just because we notice then that
it does not work at all for tunnel mode.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 7/8] xfrm: add support to HW update soft and hard limits
  2022-09-26  6:07     ` Leon Romanovsky
@ 2022-09-27  5:49       ` Steffen Klassert
  0 siblings, 0 replies; 30+ messages in thread
From: Steffen Klassert @ 2022-09-27  5:49 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Mon, Sep 26, 2022 at 09:07:31AM +0300, Leon Romanovsky wrote:
> On Sun, Sep 25, 2022 at 11:20:06AM +0200, Steffen Klassert wrote:
> > On Sun, Sep 04, 2022 at 04:15:41PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Both in RX and TX, the traffic that performs IPsec full offload
> > > transformation is accounted by HW. It is needed to properly handle
> > > hard limits that require to drop the packet.
> > > 
> > > It means that XFRM core needs to update internal counters with the one
> > > that accounted by the HW, so new callbacks are introduced in this patch.
> > > 
> > > In case of soft or hard limit is occurred, the driver should call to
> > > xfrm_state_check_expire() that will perform key rekeying exactly as
> > > done by XFRM core.
> > > 
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > 
> > This looks good, thanks!
> > 
> > We need this for the other relevant counters too.
> 
> It is in my backlog.

Great, thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-26  6:55         ` Leon Romanovsky
@ 2022-09-27  5:59           ` Steffen Klassert
  2022-09-27 10:02             ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Steffen Klassert @ 2022-09-27  5:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jakub Kicinski, David S. Miller, Eric Dumazet, Herbert Xu,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Mon, Sep 26, 2022 at 09:55:45AM +0300, Leon Romanovsky wrote:
> On Sun, Sep 25, 2022 at 11:40:39AM +0200, Steffen Klassert wrote:
> > On Wed, Sep 21, 2022 at 08:37:06PM +0300, Leon Romanovsky wrote:
> > > On Wed, Sep 21, 2022 at 07:59:27AM -0700, Jakub Kicinski wrote:
> > > > On Thu, 8 Sep 2022 12:56:16 +0300 Leon Romanovsky wrote:
> > > > > I have TX traces too and can add if RX are not sufficient. 
> > > > 
> > > > The perf trace is good, but for those of us not intimately familiar
> > > > with xfrm, could you provide some analysis here?
> > > 
> > > The perf trace presented is for RX path of IPsec crypto offload mode. In that
> > > mode, decrypted packet enters the netdev stack to perform various XFRM specific
> > > checks.
> > 
> > Can you provide the perf traces and analysis for the TX side too? That
> > would be interesting in particular, because the policy and state lookups
> > there happen still in software.
> 
> Single core TX (crypto mode) from the same run:
> Please notice that it is not really bottleneck, probably RX caused to the situation
> where TX was not executed enough. It is also lighter path than RX.

Thanks for this! How many policies and SAs were installed when you ran
this? A run with 'many' policies and SAs would be interesting, in
particualar a comparison between crypto and full offload. That would
show us where the performance of the full offload comes from.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration
  2022-09-27  5:59           ` Steffen Klassert
@ 2022-09-27 10:02             ` Leon Romanovsky
  0 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-27 10:02 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Jakub Kicinski, David S. Miller, Eric Dumazet, Herbert Xu,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Tue, Sep 27, 2022 at 07:59:03AM +0200, Steffen Klassert wrote:
> On Mon, Sep 26, 2022 at 09:55:45AM +0300, Leon Romanovsky wrote:
> > On Sun, Sep 25, 2022 at 11:40:39AM +0200, Steffen Klassert wrote:
> > > On Wed, Sep 21, 2022 at 08:37:06PM +0300, Leon Romanovsky wrote:
> > > > On Wed, Sep 21, 2022 at 07:59:27AM -0700, Jakub Kicinski wrote:
> > > > > On Thu, 8 Sep 2022 12:56:16 +0300 Leon Romanovsky wrote:
> > > > > > I have TX traces too and can add if RX are not sufficient. 
> > > > > 
> > > > > The perf trace is good, but for those of us not intimately familiar
> > > > > with xfrm, could you provide some analysis here?
> > > > 
> > > > The perf trace presented is for RX path of IPsec crypto offload mode. In that
> > > > mode, decrypted packet enters the netdev stack to perform various XFRM specific
> > > > checks.
> > > 
> > > Can you provide the perf traces and analysis for the TX side too? That
> > > would be interesting in particular, because the policy and state lookups
> > > there happen still in software.
> > 
> > Single core TX (crypto mode) from the same run:
> > Please notice that it is not really bottleneck, probably RX caused to the situation
> > where TX was not executed enough. It is also lighter path than RX.
> 
> Thanks for this! How many policies and SAs were installed when you ran
> this? A run with 'many' policies and SAs would be interesting, in
> particualar a comparison between crypto and full offload. That would
> show us where the performance of the full offload comes from.

It was 160 CPU machine with policy/SA per-CPU and direction. In total,
320 policies and 320 SAs.

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies
  2022-09-27  5:48       ` Steffen Klassert
@ 2022-09-27 10:21         ` Leon Romanovsky
  0 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2022-09-27 10:21 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S. Miller, Eric Dumazet, Herbert Xu, Jakub Kicinski,
	netdev, Paolo Abeni, Raed Salem, Saeed Mahameed, Bharat Bhushan

On Tue, Sep 27, 2022 at 07:48:38AM +0200, Steffen Klassert wrote:
> On Mon, Sep 26, 2022 at 09:38:10AM +0300, Leon Romanovsky wrote:
> > On Sun, Sep 25, 2022 at 11:34:54AM +0200, Steffen Klassert wrote:
> > > On Sun, Sep 04, 2022 at 04:15:40PM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > Devices that implement IPsec full offload mode offload policies too.
> > > > In RX path, it causes to the situation that HW can't effectively handle
> > > > mixed SW and HW priorities unless users make sure that HW offloaded
> > > > policies have higher priorities.
> > > > 
> > > > In order to make sure that users have coherent picture, let's require
> > > > that HW offloaded policies have always (both RX and TX) higher priorities
> > > > than SW ones.
> > > > 
> > > > To do not over engineer the code, HW policies are treated as SW ones and
> > > > don't take into account netdev to allow reuse of same priorities for
> > > > different devices.
> > > 
> > > I think we should split HW and SW SPD (and maybe even SAD) and priorize
> > > over the SPDs instead of doing that in one single SPD. Each NIC should
> > > maintain its own databases and we should do the lookups from SW with
> > > a callback. 
> > 
> > I don't understand how will it work and scale.
> 
> That is rather easy. HW offload devices register their databases
> at the xfrm layer with a certain priority higher than the one
> of the SW databases. The lookup will happen consecutively based
> on the database priorities. If there are no HW databases are
> registered everything is like it was before. It gives us a clear
> separation between HW and SW.
> 
> This has the advantage that you don't need to mess with policy
> priorites. User can keep the priorites as they were before. This
> is in particular important because usually the IKE daemon chosses
> the priorities based on some heuristics.
> 
> The HW offload has also the advantage that we don't need to
> search through all SW policies and states in that case.

And disadvantage for SW policies, because once you register HW DB, you
will first lookup there, won't find and fallback to perform another
lookup in SW.

<...>

> > It makes no
> > sense to convert data from XFRM representation to HW format, execute in
> > HW and convert returned result back. It will be also slow because lookup
> > of SP/SA based on XFRM properties is not datapath.
> 
> In case the HW can't do the lookup itself or is considered to be slower
> than in software, a separated database for HW offload devices can be
> maintained.

ok, this is my case.
I'll try to see what I can do here.

Thanks

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-09-27 10:22 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-04 13:15 [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 1/8] xfrm: add new full offload flag Leon Romanovsky
2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 2/8] xfrm: allow state full offload mode Leon Romanovsky
2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 3/8] xfrm: add an interface to offload policy Leon Romanovsky
2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 4/8] xfrm: add TX datapath support for IPsec full offload mode Leon Romanovsky
2022-09-25  9:16   ` Steffen Klassert
2022-09-26  6:06     ` Leon Romanovsky
2022-09-27  5:04       ` Steffen Klassert
2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 5/8] xfrm: add RX datapath protection " Leon Romanovsky
2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 6/8] xfrm: enforce separation between priorities of HW/SW policies Leon Romanovsky
2022-09-25  9:34   ` Steffen Klassert
2022-09-26  6:38     ` Leon Romanovsky
2022-09-27  5:48       ` Steffen Klassert
2022-09-27 10:21         ` Leon Romanovsky
2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 7/8] xfrm: add support to HW update soft and hard limits Leon Romanovsky
2022-09-25  9:20   ` Steffen Klassert
2022-09-26  6:07     ` Leon Romanovsky
2022-09-27  5:49       ` Steffen Klassert
2022-09-04 13:15 ` [PATCH RFC xfrm-next v3 8/8] xfrm: document IPsec full offload mode Leon Romanovsky
2022-09-04 13:19 ` [PATCH RFC xfrm-next v3 0/8] Extend XFRM core to allow full offload configuration Leon Romanovsky
2022-09-08  9:56 ` Leon Romanovsky
2022-09-21 14:59   ` Jakub Kicinski
2022-09-21 17:37     ` Leon Romanovsky
2022-09-25  9:40       ` Steffen Klassert
2022-09-26  6:55         ` Leon Romanovsky
2022-09-27  5:59           ` Steffen Klassert
2022-09-27 10:02             ` Leon Romanovsky
2022-09-19  9:31 ` Leon Romanovsky
2022-09-22  7:17   ` Leon Romanovsky
2022-09-22  7:35     ` Steffen Klassert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.