All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
@ 2022-08-16  8:59 Leon Romanovsky
  2022-08-16  8:59 ` [PATCH xfrm-next v2 1/6] xfrm: add new full offload flag Leon Romanovsky
                   ` (7 more replies)
  0 siblings, 8 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

From: Leon Romanovsky <leonro@nvidia.com>

Changelog:
v2:
 * Rebased to latest 6.0-rc1
 * Add an extra check in TX datapath patch to validate packets before
   forwarding to HW.
 * Added policy cleanup logic in case of netdev down event 
v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com 
 * Moved comment to be before if (...) in third patch.
v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
-----------------------------------------------------------------------

The following series extends XFRM core code to handle new type of IPsec
offload - full offload.

In this mode, the HW is going to be responsible for whole data path, so
both policy and state should be offloaded.

Thanks

Leon Romanovsky (6):
  xfrm: add new full offload flag
  xfrm: allow state full offload mode
  xfrm: add an interface to offload policy
  xfrm: add TX datapath support for IPsec full offload mode
  xfrm: add RX datapath protection for IPsec full offload mode
  xfrm: enforce separation between priorities of HW/SW policies

 .../inline_crypto/ch_ipsec/chcr_ipsec.c       |   4 +
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c    |   5 +
 drivers/net/ethernet/intel/ixgbevf/ipsec.c    |   5 +
 .../mellanox/mlx5/core/en_accel/ipsec.c       |   4 +
 drivers/net/netdevsim/ipsec.c                 |   5 +
 include/linux/netdevice.h                     |   3 +
 include/net/netns/xfrm.h                      |   8 +-
 include/net/xfrm.h                            | 104 +++++++---
 include/uapi/linux/xfrm.h                     |   6 +
 net/xfrm/xfrm_device.c                        | 101 +++++++++-
 net/xfrm/xfrm_output.c                        |  20 ++
 net/xfrm/xfrm_policy.c                        | 180 ++++++++++++++++++
 net/xfrm/xfrm_user.c                          |  19 ++
 13 files changed, 434 insertions(+), 30 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH xfrm-next v2 1/6] xfrm: add new full offload flag
  2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
@ 2022-08-16  8:59 ` Leon Romanovsky
  2022-08-16  8:59 ` [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode Leon Romanovsky
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

From: Leon Romanovsky <leonro@nvidia.com>

In the next patches, the xfrm core code will be extended to support
new type of offload - full offload. In that mode, both policy and state
should be specially configured in order to perform whole offloaded data
path.

Full offload takes care of encryption, decryption, encapsulation and
other operations with headers.

As this mode is new for XFRM policy flow, we can "start fresh" with flag
bits and release first and second bit for future use.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/xfrm.h        | 7 +++++++
 include/uapi/linux/xfrm.h | 6 ++++++
 net/xfrm/xfrm_device.c    | 3 +++
 net/xfrm/xfrm_user.c      | 2 ++
 4 files changed, 18 insertions(+)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 6e8fa98f786f..b4d487053dfd 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -131,12 +131,19 @@ enum {
 	XFRM_DEV_OFFLOAD_OUT,
 };
 
+enum {
+	XFRM_DEV_OFFLOAD_UNSPECIFIED,
+	XFRM_DEV_OFFLOAD_CRYPTO,
+	XFRM_DEV_OFFLOAD_FULL,
+};
+
 struct xfrm_dev_offload {
 	struct net_device	*dev;
 	netdevice_tracker	dev_tracker;
 	struct net_device	*real_dev;
 	unsigned long		offload_handle;
 	u8			dir : 2;
+	u8			type : 2;
 };
 
 struct xfrm_mode {
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index b1f3e6a8f11a..3b084246d610 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -519,6 +519,12 @@ struct xfrm_user_offload {
  */
 #define XFRM_OFFLOAD_IPV6	1
 #define XFRM_OFFLOAD_INBOUND	2
+/* Two bits above are relevant for state path only, while
+ * offload is used for both policy and state flows.
+ *
+ * In policy offload mode, they are free and can be safely reused.
+ */
+#define XFRM_OFFLOAD_FULL	4
 
 struct xfrm_userpolicy_default {
 #define XFRM_USERPOLICY_UNSPEC	0
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 637ca8838436..6d1124eb1ec8 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -270,12 +270,15 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	else
 		xso->dir = XFRM_DEV_OFFLOAD_OUT;
 
+	xso->type = XFRM_DEV_OFFLOAD_CRYPTO;
+
 	err = dev->xfrmdev_ops->xdo_dev_state_add(x);
 	if (err) {
 		xso->dev = NULL;
 		xso->dir = 0;
 		xso->real_dev = NULL;
 		netdev_put(dev, &xso->dev_tracker);
+		xso->type = XFRM_DEV_OFFLOAD_UNSPECIFIED;
 
 		if (err != -EOPNOTSUPP)
 			return err;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 2ff017117730..9c0aef815730 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -854,6 +854,8 @@ static int copy_user_offload(struct xfrm_dev_offload *xso, struct sk_buff *skb)
 	xuo->ifindex = xso->dev->ifindex;
 	if (xso->dir == XFRM_DEV_OFFLOAD_IN)
 		xuo->flags = XFRM_OFFLOAD_INBOUND;
+	if (xso->type == XFRM_DEV_OFFLOAD_FULL)
+		xuo->flags |= XFRM_OFFLOAD_FULL;
 
 	return 0;
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode
  2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
  2022-08-16  8:59 ` [PATCH xfrm-next v2 1/6] xfrm: add new full offload flag Leon Romanovsky
@ 2022-08-16  8:59 ` Leon Romanovsky
  2022-08-18 10:12   ` Steffen Klassert
  2022-08-16  8:59 ` [PATCH xfrm-next v2 3/6] xfrm: add an interface to offload policy Leon Romanovsky
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

From: Leon Romanovsky <leonro@nvidia.com>

Allow users to configure xfrm states with full offload mode.
The full mode must be requested both for policy and state, and
such requires us to do not implement fallback.

We explicitly return an error if requested full mode can't
be configured.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 .../inline_crypto/ch_ipsec/chcr_ipsec.c       |  4 ++++
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c    |  5 ++++
 drivers/net/ethernet/intel/ixgbevf/ipsec.c    |  5 ++++
 .../mellanox/mlx5/core/en_accel/ipsec.c       |  4 ++++
 drivers/net/netdevsim/ipsec.c                 |  5 ++++
 net/xfrm/xfrm_device.c                        | 24 +++++++++++++++----
 6 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/ch_ipsec/chcr_ipsec.c b/drivers/net/ethernet/chelsio/inline_crypto/ch_ipsec/chcr_ipsec.c
index 585590520076..ca21794281d6 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/ch_ipsec/chcr_ipsec.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/ch_ipsec/chcr_ipsec.c
@@ -283,6 +283,10 @@ static int ch_ipsec_xfrm_add_state(struct xfrm_state *x)
 		pr_debug("Cannot offload xfrm states with geniv other than seqiv\n");
 		return -EINVAL;
 	}
+	if (x->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		pr_debug("Unsupported xfrm offload\n");
+		return -EINVAL;
+	}
 
 	sa_entry = kzalloc(sizeof(*sa_entry), GFP_KERNEL);
 	if (!sa_entry) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 774de63dd93a..53a969e34883 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -585,6 +585,11 @@ static int ixgbe_ipsec_add_sa(struct xfrm_state *xs)
 		return -EINVAL;
 	}
 
+	if (xs->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		netdev_err(dev, "Unsupported ipsec offload type\n");
+		return -EINVAL;
+	}
+
 	if (xs->xso.dir == XFRM_DEV_OFFLOAD_IN) {
 		struct rx_sa rsa;
 
diff --git a/drivers/net/ethernet/intel/ixgbevf/ipsec.c b/drivers/net/ethernet/intel/ixgbevf/ipsec.c
index 9984ebc62d78..c1cf540d162a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ipsec.c
@@ -280,6 +280,11 @@ static int ixgbevf_ipsec_add_sa(struct xfrm_state *xs)
 		return -EINVAL;
 	}
 
+	if (xs->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		netdev_err(dev, "Unsupported ipsec offload type\n");
+		return -EINVAL;
+	}
+
 	if (xs->xso.dir == XFRM_DEV_OFFLOAD_IN) {
 		struct rx_sa rsa;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
index 2a8fd7020622..c182b640b80d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
@@ -256,6 +256,10 @@ static inline int mlx5e_xfrm_validate_state(struct xfrm_state *x)
 		netdev_info(netdev, "Cannot offload xfrm states with geniv other than seqiv\n");
 		return -EINVAL;
 	}
+	if (x->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		netdev_info(netdev, "Unsupported xfrm offload type\n");
+		return -EINVAL;
+	}
 	return 0;
 }
 
diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
index 386336a38f34..b93baf5c8bee 100644
--- a/drivers/net/netdevsim/ipsec.c
+++ b/drivers/net/netdevsim/ipsec.c
@@ -149,6 +149,11 @@ static int nsim_ipsec_add_sa(struct xfrm_state *xs)
 		return -EINVAL;
 	}
 
+	if (xs->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {
+		netdev_err(dev, "Unsupported ipsec offload type\n");
+		return -EINVAL;
+	}
+
 	/* find the first unused index */
 	ret = nsim_ipsec_find_empty_idx(ipsec);
 	if (ret < 0) {
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 6d1124eb1ec8..5b04e5cdca64 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -215,6 +215,7 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	struct xfrm_dev_offload *xso = &x->xso;
 	xfrm_address_t *saddr;
 	xfrm_address_t *daddr;
+	bool is_full_offload;
 
 	if (!x->type_offload)
 		return -EINVAL;
@@ -223,9 +224,11 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	if (x->encap || x->tfcpad)
 		return -EINVAL;
 
-	if (xuo->flags & ~(XFRM_OFFLOAD_IPV6 | XFRM_OFFLOAD_INBOUND))
+	if (xuo->flags &
+	    ~(XFRM_OFFLOAD_IPV6 | XFRM_OFFLOAD_INBOUND | XFRM_OFFLOAD_FULL))
 		return -EINVAL;
 
+	is_full_offload = xuo->flags & XFRM_OFFLOAD_FULL;
 	dev = dev_get_by_index(net, xuo->ifindex);
 	if (!dev) {
 		if (!(xuo->flags & XFRM_OFFLOAD_INBOUND)) {
@@ -240,7 +243,7 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 					x->props.family,
 					xfrm_smark_get(0, x));
 		if (IS_ERR(dst))
-			return 0;
+			return (is_full_offload) ? -EINVAL : 0;
 
 		dev = dst->dev;
 
@@ -251,7 +254,7 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	if (!dev->xfrmdev_ops || !dev->xfrmdev_ops->xdo_dev_state_add) {
 		xso->dev = NULL;
 		dev_put(dev);
-		return 0;
+		return (is_full_offload) ? -EINVAL : 0;
 	}
 
 	if (x->props.flags & XFRM_STATE_ESN &&
@@ -270,7 +273,10 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 	else
 		xso->dir = XFRM_DEV_OFFLOAD_OUT;
 
-	xso->type = XFRM_DEV_OFFLOAD_CRYPTO;
+	if (is_full_offload)
+		xso->type = XFRM_DEV_OFFLOAD_FULL;
+	else
+		xso->type = XFRM_DEV_OFFLOAD_CRYPTO;
 
 	err = dev->xfrmdev_ops->xdo_dev_state_add(x);
 	if (err) {
@@ -280,7 +286,15 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 		netdev_put(dev, &xso->dev_tracker);
 		xso->type = XFRM_DEV_OFFLOAD_UNSPECIFIED;
 
-		if (err != -EOPNOTSUPP)
+		/* User explicitly requested full offload mode and configured
+		 * policy in addition to the XFRM state. So be civil to users,
+		 * and return an error instead of taking fallback path.
+		 *
+		 * This WARN_ON() can be seen as a documentation for driver
+		 * authors to do not return -EOPNOTSUPP in full offload mode.
+		 */
+		WARN_ON(err == -EOPNOTSUPP && is_full_offload);
+		if (err != -EOPNOTSUPP || is_full_offload)
 			return err;
 	}
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH xfrm-next v2 3/6] xfrm: add an interface to offload policy
  2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
  2022-08-16  8:59 ` [PATCH xfrm-next v2 1/6] xfrm: add new full offload flag Leon Romanovsky
  2022-08-16  8:59 ` [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode Leon Romanovsky
@ 2022-08-16  8:59 ` Leon Romanovsky
  2022-08-16  8:59 ` [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode Leon Romanovsky
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

From: Leon Romanovsky <leonro@nvidia.com>

Extend netlink interface to add and delete XFRM policy from the device.
This functionality is a first step to implement full IPsec offload solution.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/netdevice.h |  3 ++
 include/net/xfrm.h        | 42 ++++++++++++++++++++++++
 net/xfrm/xfrm_device.c    | 61 ++++++++++++++++++++++++++++++++++-
 net/xfrm/xfrm_policy.c    | 67 +++++++++++++++++++++++++++++++++++++++
 net/xfrm/xfrm_user.c      | 17 ++++++++++
 5 files changed, 189 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1a3cb93c3dcc..401c52aeab0e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1012,6 +1012,9 @@ struct xfrmdev_ops {
 	bool	(*xdo_dev_offload_ok) (struct sk_buff *skb,
 				       struct xfrm_state *x);
 	void	(*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+	int	(*xdo_dev_policy_add) (struct xfrm_policy *x);
+	void	(*xdo_dev_policy_delete) (struct xfrm_policy *x);
+	void	(*xdo_dev_policy_free) (struct xfrm_policy *x);
 };
 #endif
 
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index b4d487053dfd..587697eb1d31 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -129,6 +129,7 @@ struct xfrm_state_walk {
 enum {
 	XFRM_DEV_OFFLOAD_IN = 1,
 	XFRM_DEV_OFFLOAD_OUT,
+	XFRM_DEV_OFFLOAD_FWD,
 };
 
 enum {
@@ -534,6 +535,8 @@ struct xfrm_policy {
 	struct xfrm_tmpl       	xfrm_vec[XFRM_MAX_DEPTH];
 	struct hlist_node	bydst_inexact_list;
 	struct rcu_head		rcu;
+
+	struct xfrm_dev_offload xdo;
 };
 
 static inline struct net *xp_net(const struct xfrm_policy *xp)
@@ -1577,6 +1580,7 @@ struct xfrm_state *xfrm_find_acq_byseq(struct net *net, u32 mark, u32 seq);
 int xfrm_state_delete(struct xfrm_state *x);
 int xfrm_state_flush(struct net *net, u8 proto, bool task_valid, bool sync);
 int xfrm_dev_state_flush(struct net *net, struct net_device *dev, bool task_valid);
+int xfrm_dev_policy_flush(struct net *net, struct net_device *dev, bool task_valid);
 void xfrm_sad_getinfo(struct net *net, struct xfrmk_sadinfo *si);
 void xfrm_spd_getinfo(struct net *net, struct xfrmk_spdinfo *si);
 u32 xfrm_replay_seqhi(struct xfrm_state *x, __be32 net_seq);
@@ -1887,6 +1891,8 @@ void xfrm_dev_backlog(struct softnet_data *sd);
 struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features, bool *again);
 int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 		       struct xfrm_user_offload *xuo);
+int xfrm_dev_policy_add(struct net *net, struct xfrm_policy *xp,
+		       struct xfrm_user_offload *xuo, u8 dir);
 bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x);
 
 static inline void xfrm_dev_state_advance_esn(struct xfrm_state *x)
@@ -1935,6 +1941,28 @@ static inline void xfrm_dev_state_free(struct xfrm_state *x)
 		netdev_put(dev, &xso->dev_tracker);
 	}
 }
+
+static inline void xfrm_dev_policy_delete(struct xfrm_policy *x)
+{
+	struct xfrm_dev_offload *xdo = &x->xdo;
+	struct net_device *dev = xdo->dev;
+
+	if (dev && dev->xfrmdev_ops && dev->xfrmdev_ops->xdo_dev_policy_delete)
+		dev->xfrmdev_ops->xdo_dev_policy_delete(x);
+}
+
+static inline void xfrm_dev_policy_free(struct xfrm_policy *x)
+{
+	struct xfrm_dev_offload *xdo = &x->xdo;
+	struct net_device *dev = xdo->dev;
+
+	if (dev && dev->xfrmdev_ops) {
+		if (dev->xfrmdev_ops->xdo_dev_policy_free)
+			dev->xfrmdev_ops->xdo_dev_policy_free(x);
+		xdo->dev = NULL;
+		netdev_put(dev, &xdo->dev_tracker);
+	}
+}
 #else
 static inline void xfrm_dev_resume(struct sk_buff *skb)
 {
@@ -1962,6 +1990,20 @@ static inline void xfrm_dev_state_free(struct xfrm_state *x)
 {
 }
 
+static inline int xfrm_dev_policy_add(struct net *net, struct xfrm_policy *xp,
+				      struct xfrm_user_offload *xuo, u8 dir)
+{
+	return 0;
+}
+
+static inline void xfrm_dev_policy_delete(struct xfrm_policy *x)
+{
+}
+
+static inline void xfrm_dev_policy_free(struct xfrm_policy *x)
+{
+}
+
 static inline bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
 {
 	return false;
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 5b04e5cdca64..1cc482e9c87d 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -302,6 +302,63 @@ int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
 }
 EXPORT_SYMBOL_GPL(xfrm_dev_state_add);
 
+int xfrm_dev_policy_add(struct net *net, struct xfrm_policy *xp,
+			struct xfrm_user_offload *xuo, u8 dir)
+{
+	struct xfrm_dev_offload *xdo = &xp->xdo;
+	struct net_device *dev;
+	int err;
+
+	if (!xuo->flags || xuo->flags & ~XFRM_OFFLOAD_FULL)
+		/* We support only Full offload mode and it means
+		 * that user must set XFRM_OFFLOAD_FULL bit.
+		 */
+		return -EINVAL;
+
+	dev = dev_get_by_index(net, xuo->ifindex);
+	if (!dev)
+		return -EINVAL;
+
+	if (!dev->xfrmdev_ops || !dev->xfrmdev_ops->xdo_dev_policy_add) {
+		xdo->dev = NULL;
+		dev_put(dev);
+		return -EINVAL;
+	}
+
+	xdo->dev = dev;
+	netdev_tracker_alloc(dev, &xdo->dev_tracker, GFP_ATOMIC);
+	xdo->real_dev = dev;
+	xdo->type = XFRM_DEV_OFFLOAD_FULL;
+	switch (dir) {
+	case XFRM_POLICY_IN:
+		xdo->dir = XFRM_DEV_OFFLOAD_IN;
+		break;
+	case XFRM_POLICY_OUT:
+		xdo->dir = XFRM_DEV_OFFLOAD_OUT;
+		break;
+	case XFRM_POLICY_FWD:
+		xdo->dir = XFRM_DEV_OFFLOAD_FWD;
+		break;
+	default:
+		xdo->dev = NULL;
+		dev_put(dev);
+		return -EINVAL;
+	}
+
+	err = dev->xfrmdev_ops->xdo_dev_policy_add(xp);
+	if (err) {
+		xdo->dev = NULL;
+		xdo->real_dev = NULL;
+		xdo->type = XFRM_DEV_OFFLOAD_UNSPECIFIED;
+		xdo->dir = 0;
+		netdev_put(dev, &xdo->dev_tracker);
+		return err;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xfrm_dev_policy_add);
+
 bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
 {
 	int mtu;
@@ -404,8 +461,10 @@ static int xfrm_api_check(struct net_device *dev)
 
 static int xfrm_dev_down(struct net_device *dev)
 {
-	if (dev->features & NETIF_F_HW_ESP)
+	if (dev->features & NETIF_F_HW_ESP) {
 		xfrm_dev_state_flush(dev_net(dev), dev, true);
+		xfrm_dev_policy_flush(dev_net(dev), dev, true);
+	}
 
 	return NOTIFY_DONE;
 }
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index f1a0bab920a5..3049fdcf8411 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -425,6 +425,7 @@ void xfrm_policy_destroy(struct xfrm_policy *policy)
 	if (del_timer(&policy->timer) || del_timer(&policy->polq.hold_timer))
 		BUG();
 
+	xfrm_dev_policy_free(policy);
 	call_rcu(&policy->rcu, xfrm_policy_destroy_rcu);
 }
 EXPORT_SYMBOL(xfrm_policy_destroy);
@@ -1769,12 +1770,40 @@ xfrm_policy_flush_secctx_check(struct net *net, u8 type, bool task_valid)
 	}
 	return err;
 }
+
+static inline int xfrm_dev_policy_flush_secctx_check(struct net *net,
+						     struct net_device *dev,
+						     bool task_valid)
+{
+	struct xfrm_policy *pol;
+	int err = 0;
+
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		if (pol->walk.dead ||
+		    xfrm_policy_id2dir(pol->index) >= XFRM_POLICY_MAX ||
+		    pol->xdo.dev != dev)
+			continue;
+
+		err = security_xfrm_policy_delete(pol->security);
+		if (err) {
+			xfrm_audit_policy_delete(pol, 0, task_valid);
+			return err;
+		}
+	}
+	return err;
+}
 #else
 static inline int
 xfrm_policy_flush_secctx_check(struct net *net, u8 type, bool task_valid)
 {
 	return 0;
 }
+static inline int xfrm_dev_policy_flush_secctx_check(struct net *net,
+						     struct net_device *dev,
+						     bool task_valid)
+{
+	return 0;
+}
 #endif
 
 int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
@@ -1814,6 +1843,43 @@ int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
 }
 EXPORT_SYMBOL(xfrm_policy_flush);
 
+int xfrm_dev_policy_flush(struct net *net, struct net_device *dev, bool task_valid)
+{
+	int dir, err = 0, cnt = 0;
+	struct xfrm_policy *pol;
+
+	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+
+	err = xfrm_dev_policy_flush_secctx_check(net, dev, task_valid);
+	if (err)
+		goto out;
+
+again:
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		dir = xfrm_policy_id2dir(pol->index);
+		if (pol->walk.dead ||
+		    dir >= XFRM_POLICY_MAX ||
+		    pol->xdo.dev != dev)
+			continue;
+
+		__xfrm_policy_unlink(pol, dir);
+		spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+		cnt++;
+		xfrm_audit_policy_delete(pol, 1, task_valid);
+		xfrm_policy_kill(pol);
+		spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+		goto again;
+	}
+	if (cnt)
+		__xfrm_policy_inexact_flush(net);
+	else
+		err = -ESRCH;
+out:
+	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+	return err;
+}
+EXPORT_SYMBOL(xfrm_dev_policy_flush);
+
 int xfrm_policy_walk(struct net *net, struct xfrm_policy_walk *walk,
 		     int (*func)(struct xfrm_policy *, int, int, void*),
 		     void *data)
@@ -2246,6 +2312,7 @@ int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
 	pol = __xfrm_policy_unlink(pol, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 	if (pol) {
+		xfrm_dev_policy_delete(pol);
 		xfrm_policy_kill(pol);
 		return 0;
 	}
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 9c0aef815730..698ff84da6ba 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1747,6 +1747,14 @@ static struct xfrm_policy *xfrm_policy_construct(struct net *net, struct xfrm_us
 	if (attrs[XFRMA_IF_ID])
 		xp->if_id = nla_get_u32(attrs[XFRMA_IF_ID]);
 
+	/* configure the hardware if offload is requested */
+	if (attrs[XFRMA_OFFLOAD_DEV]) {
+		err = xfrm_dev_policy_add(net, xp,
+				nla_data(attrs[XFRMA_OFFLOAD_DEV]), p->dir);
+		if (err)
+			goto error;
+	}
+
 	return xp;
  error:
 	*errp = err;
@@ -1785,6 +1793,7 @@ static int xfrm_add_policy(struct sk_buff *skb, struct nlmsghdr *nlh,
 	xfrm_audit_policy_add(xp, err ? 0 : 1, true);
 
 	if (err) {
+		xfrm_dev_policy_delete(xp);
 		security_xfrm_policy_free(xp->security);
 		kfree(xp);
 		return err;
@@ -1897,6 +1906,8 @@ static int dump_one_policy(struct xfrm_policy *xp, int dir, int count, void *ptr
 		err = xfrm_mark_put(skb, &xp->mark);
 	if (!err)
 		err = xfrm_if_id_put(skb, xp->if_id);
+	if (!err && xp->xdo.dev)
+		err = copy_user_offload(&xp->xdo, skb);
 	if (err) {
 		nlmsg_cancel(skb, nlh);
 		return err;
@@ -3213,6 +3224,8 @@ static int build_acquire(struct sk_buff *skb, struct xfrm_state *x,
 		err = xfrm_mark_put(skb, &xp->mark);
 	if (!err)
 		err = xfrm_if_id_put(skb, xp->if_id);
+	if (!err && xp->xdo.dev)
+		err = copy_user_offload(&xp->xdo, skb);
 	if (err) {
 		nlmsg_cancel(skb, nlh);
 		return err;
@@ -3331,6 +3344,8 @@ static int build_polexpire(struct sk_buff *skb, struct xfrm_policy *xp,
 		err = xfrm_mark_put(skb, &xp->mark);
 	if (!err)
 		err = xfrm_if_id_put(skb, xp->if_id);
+	if (!err && xp->xdo.dev)
+		err = copy_user_offload(&xp->xdo, skb);
 	if (err) {
 		nlmsg_cancel(skb, nlh);
 		return err;
@@ -3414,6 +3429,8 @@ static int xfrm_notify_policy(struct xfrm_policy *xp, int dir, const struct km_e
 		err = xfrm_mark_put(skb, &xp->mark);
 	if (!err)
 		err = xfrm_if_id_put(skb, xp->if_id);
+	if (!err && xp->xdo.dev)
+		err = copy_user_offload(&xp->xdo, skb);
 	if (err)
 		goto out_free_skb;
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode
  2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (2 preceding siblings ...)
  2022-08-16  8:59 ` [PATCH xfrm-next v2 3/6] xfrm: add an interface to offload policy Leon Romanovsky
@ 2022-08-16  8:59 ` Leon Romanovsky
  2022-08-18 10:24   ` Steffen Klassert
  2022-08-16  8:59 ` [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection " Leon Romanovsky
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

From: Leon Romanovsky <leonro@nvidia.com>

In IPsec full mode, the device is going to encrypt and encapsulate
packets that are associated with offloaded policy. After successful
policy lookup to indicate if packets should be offloaded or not,
the stack forwards packets to the device to do the magic.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 net/xfrm/xfrm_device.c | 13 +++++++++++++
 net/xfrm/xfrm_output.c | 20 ++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 1cc482e9c87d..db5ebd36f68c 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -120,6 +120,16 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
 	if (xo->flags & XFRM_GRO || x->xso.dir == XFRM_DEV_OFFLOAD_IN)
 		return skb;
 
+	/* The packet was sent to HW IPsec full offload engine,
+	 * but to wrong device. Drop the packet, so it won't skip
+	 * XFRM stack.
+	 */
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL && x->xso.dev != dev) {
+		kfree_skb(skb);
+		dev_core_stats_tx_dropped_inc(dev);
+		return NULL;
+	}
+
 	/* This skb was already validated on the upper/virtual dev */
 	if ((x->xso.dev != dev) && (x->xso.real_dev == dev))
 		return skb;
@@ -366,6 +376,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
 	struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
 	struct net_device *dev = x->xso.dev;
 
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
+		goto ok;
+
 	if (!x->type_offload || x->encap)
 		return false;
 
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 555ab35cd119..27a8dac9ca7d 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -719,6 +719,26 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
 		break;
 	}
 
+	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL) {
+		struct dst_entry *dst = skb_dst_pop(skb);
+
+		if (!dst || !xfrm_dev_offload_ok(skb, x)) {
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+			kfree_skb(skb);
+			return -EHOSTUNREACH;
+		}
+
+		skb_dst_set(skb, dst);
+		err = skb_dst(skb)->ops->local_out(net, skb->sk, skb);
+		if (unlikely(err != 1))
+			return err;
+
+		if (!skb_dst(skb)->xfrm)
+			return dst_output(net, skb->sk, skb);
+
+		return 0;
+	}
+
 	secpath_reset(skb);
 
 	if (xfrm_dev_offload_ok(skb, x)) {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection for IPsec full offload mode
  2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (3 preceding siblings ...)
  2022-08-16  8:59 ` [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode Leon Romanovsky
@ 2022-08-16  8:59 ` Leon Romanovsky
  2022-08-18 10:27   ` Steffen Klassert
  2022-08-16  8:59 ` [PATCH xfrm-next v2 6/6] xfrm: enforce separation between priorities of HW/SW policies Leon Romanovsky
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

From: Leon Romanovsky <leonro@nvidia.com>

Traffic received by device with enabled IPsec full offload should be
forwarded to the stack only after decryption, packet headers and
trailers removed.

Such packets are expected to be seen as normal (non-XFRM) ones, while
not-supported packets should be dropped by the HW.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/xfrm.h | 55 +++++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 587697eb1d31..b64853df7262 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1094,6 +1094,29 @@ xfrm_state_addr_cmp(const struct xfrm_tmpl *tmpl, const struct xfrm_state *x, un
 	return !0;
 }
 
+#ifdef CONFIG_XFRM
+static inline struct xfrm_state *xfrm_input_state(struct sk_buff *skb)
+{
+	struct sec_path *sp = skb_sec_path(skb);
+
+	return sp->xvec[sp->len - 1];
+}
+#endif
+
+static inline struct xfrm_offload *xfrm_offload(struct sk_buff *skb)
+{
+#ifdef CONFIG_XFRM
+	struct sec_path *sp = skb_sec_path(skb);
+
+	if (!sp || !sp->olen || sp->len != sp->olen)
+		return NULL;
+
+	return &sp->ovec[sp->olen - 1];
+#else
+	return NULL;
+#endif
+}
+
 #ifdef CONFIG_XFRM
 int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb,
 			unsigned short family);
@@ -1125,6 +1148,15 @@ static inline int __xfrm_policy_check2(struct sock *sk, int dir,
 {
 	struct net *net = dev_net(skb->dev);
 	int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0);
+	struct xfrm_offload *xo = xfrm_offload(skb);
+	struct xfrm_state *x;
+
+	if (xo) {
+		x = xfrm_input_state(skb);
+		if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
+			return (xo->flags & CRYPTO_DONE) &&
+			       (xo->status & CRYPTO_SUCCESS);
+	}
 
 	if (sk && sk->sk_policy[XFRM_POLICY_IN])
 		return __xfrm_policy_check(sk, ndir, skb, family);
@@ -1860,29 +1892,6 @@ static inline void xfrm_states_delete(struct xfrm_state **states, int n)
 }
 #endif
 
-#ifdef CONFIG_XFRM
-static inline struct xfrm_state *xfrm_input_state(struct sk_buff *skb)
-{
-	struct sec_path *sp = skb_sec_path(skb);
-
-	return sp->xvec[sp->len - 1];
-}
-#endif
-
-static inline struct xfrm_offload *xfrm_offload(struct sk_buff *skb)
-{
-#ifdef CONFIG_XFRM
-	struct sec_path *sp = skb_sec_path(skb);
-
-	if (!sp || !sp->olen || sp->len != sp->olen)
-		return NULL;
-
-	return &sp->ovec[sp->olen - 1];
-#else
-	return NULL;
-#endif
-}
-
 void __init xfrm_dev_init(void);
 
 #ifdef CONFIG_XFRM_OFFLOAD
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH xfrm-next v2 6/6] xfrm: enforce separation between priorities of HW/SW policies
  2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (4 preceding siblings ...)
  2022-08-16  8:59 ` [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection " Leon Romanovsky
@ 2022-08-16  8:59 ` Leon Romanovsky
  2022-08-17  2:54 ` [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Jakub Kicinski
  2022-08-18 10:09 ` Steffen Klassert
  7 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-16  8:59 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

From: Leon Romanovsky <leonro@nvidia.com>

Devices that implement IPsec full offload mode offload policies too.
In RX path, it causes to the situation that HW can't effectively handle
mixed SW and HW priorities unless users make sure that HW offloaded
policies have higher priorities.

In order to make sure that users have coherent picture, let's require
that HW offloaded policies have always (both RX and TX) higher priorities
than SW ones.

To do not over engineer the code, HW policies are treated as SW ones and
don't take into account netdev to allow reuse of same priorities for
different devices.

Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/net/netns/xfrm.h |   8 ++-
 net/xfrm/xfrm_policy.c   | 113 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
index bd7c3be4af5d..f0cfa0faf611 100644
--- a/include/net/netns/xfrm.h
+++ b/include/net/netns/xfrm.h
@@ -29,6 +29,11 @@ struct xfrm_policy_hthresh {
 	u8			rbits6;
 };
 
+struct xfrm_policy_prio {
+	u32 max_sw_prio;
+	u32 min_hw_prio;
+};
+
 struct netns_xfrm {
 	struct list_head	state_all;
 	/*
@@ -52,7 +57,7 @@ struct netns_xfrm {
 	unsigned int		policy_idx_hmask;
 	struct hlist_head	policy_inexact[XFRM_POLICY_MAX];
 	struct xfrm_policy_hash	policy_bydst[XFRM_POLICY_MAX];
-	unsigned int		policy_count[XFRM_POLICY_MAX * 2];
+	unsigned int		policy_count[XFRM_POLICY_MAX * 3];
 	struct work_struct	policy_hash_work;
 	struct xfrm_policy_hthresh policy_hthresh;
 	struct list_head	inexact_bins;
@@ -67,6 +72,7 @@ struct netns_xfrm {
 	u32			sysctl_acq_expires;
 
 	u8			policy_default[XFRM_POLICY_MAX];
+	struct xfrm_policy_prio	policy_prio[XFRM_POLICY_MAX];
 
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header	*sysctl_hdr;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 3049fdcf8411..a8a7d5fe4cd8 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1570,13 +1570,70 @@ static struct xfrm_policy *xfrm_policy_insert_list(struct hlist_head *chain,
 	return delpol;
 }
 
+static int __xfrm_policy_check_hw_priority(struct net *net,
+					   struct xfrm_policy *policy, int dir)
+{
+	int left, right;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (!net->xfrm.policy_count[dir])
+		/* Adding first policy */
+		return 0;
+
+	if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+		/* SW priority */
+		if (!net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir])
+			/* Special case to allow reuse maximum priority
+			 * (U32_MAX) for SW policies, when no HW policy exist.
+			 */
+			return 0;
+
+		left = policy->priority;
+		right = net->xfrm.policy_prio[dir].min_hw_prio;
+	} else {
+		/* HW priority */
+		left = net->xfrm.policy_prio[dir].max_sw_prio;
+		right = policy->priority;
+	}
+	if (left >= right)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void __xfrm_policy_update_hw_priority(struct net *net,
+					     struct xfrm_policy *policy,
+					     int dir)
+{
+	u32 *hw_prio, *sw_prio;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+		sw_prio = &net->xfrm.policy_prio[dir].max_sw_prio;
+		*sw_prio = max(*sw_prio, policy->priority);
+		return;
+	}
+
+	hw_prio = &net->xfrm.policy_prio[dir].min_hw_prio;
+	*hw_prio = min(*hw_prio, policy->priority);
+}
+
 int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 {
 	struct net *net = xp_net(policy);
 	struct xfrm_policy *delpol;
 	struct hlist_head *chain;
+	int ret;
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+	ret = __xfrm_policy_check_hw_priority(net, policy, dir);
+	if (ret) {
+		spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+		return ret;
+	}
+
 	chain = policy_hash_bysel(net, &policy->selector, policy->family, dir);
 	if (chain)
 		delpol = xfrm_policy_insert_list(chain, policy, excl);
@@ -1606,6 +1663,7 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
 	policy->curlft.use_time = 0;
 	if (!mod_timer(&policy->timer, jiffies + HZ))
 		xfrm_pol_hold(policy);
+	__xfrm_policy_update_hw_priority(net, policy, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 
 	if (delpol)
@@ -2270,6 +2328,8 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int dir)
 
 	list_add(&pol->walk.all, &net->xfrm.policy_all);
 	net->xfrm.policy_count[dir]++;
+	if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL)
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]++;
 	xfrm_pol_hold(pol);
 }
 
@@ -2289,6 +2349,8 @@ static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
 	}
 
 	list_del_init(&pol->walk.all);
+	if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL)
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]--;
 	net->xfrm.policy_count[dir]--;
 
 	return pol;
@@ -2304,12 +2366,58 @@ static void xfrm_sk_policy_unlink(struct xfrm_policy *pol, int dir)
 	__xfrm_policy_unlink(pol, XFRM_POLICY_MAX + dir);
 }
 
+static void __xfrm_policy_delete_prio(struct net *net,
+				      struct xfrm_policy *policy, int dir)
+{
+	struct xfrm_policy *pol;
+	u32 sw_prio = 0;
+
+	lockdep_assert_held(&net->xfrm.xfrm_policy_lock);
+
+	if (!net->xfrm.policy_count[dir]) {
+		net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+		net->xfrm.policy_prio[dir].min_hw_prio = U32_MAX;
+		return;
+	}
+
+	if (policy->xdo.type == XFRM_DEV_OFFLOAD_FULL &&
+	    !net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir]) {
+		net->xfrm.policy_prio[dir].min_hw_prio = U32_MAX;
+		return;
+	}
+
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		if (pol->walk.dead)
+			continue;
+
+		if (policy->xdo.type != XFRM_DEV_OFFLOAD_FULL) {
+			/* SW priority */
+			if (pol->xdo.type == XFRM_DEV_OFFLOAD_FULL) {
+				net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+				return;
+			}
+			sw_prio = pol->priority;
+			continue;
+		}
+		/* HW priority */
+		if (pol->xdo.type != XFRM_DEV_OFFLOAD_FULL)
+			continue;
+
+		net->xfrm.policy_prio[dir].min_hw_prio = pol->priority;
+		return;
+	}
+
+	net->xfrm.policy_prio[dir].max_sw_prio = sw_prio;
+}
+
 int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
 {
 	struct net *net = xp_net(pol);
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
 	pol = __xfrm_policy_unlink(pol, dir);
+	if (pol)
+		__xfrm_policy_delete_prio(net, pol, dir);
 	spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
 	if (pol) {
 		xfrm_dev_policy_delete(pol);
@@ -4110,6 +4218,7 @@ static int __net_init xfrm_policy_init(struct net *net)
 
 		net->xfrm.policy_count[dir] = 0;
 		net->xfrm.policy_count[XFRM_POLICY_MAX + dir] = 0;
+		net->xfrm.policy_count[2 * XFRM_POLICY_MAX + dir] = 0;
 		INIT_HLIST_HEAD(&net->xfrm.policy_inexact[dir]);
 
 		htab = &net->xfrm.policy_bydst[dir];
@@ -4195,6 +4304,10 @@ static int __net_init xfrm_net_init(struct net *net)
 	net->xfrm.policy_default[XFRM_POLICY_FWD] = XFRM_USERPOLICY_ACCEPT;
 	net->xfrm.policy_default[XFRM_POLICY_OUT] = XFRM_USERPOLICY_ACCEPT;
 
+	net->xfrm.policy_prio[XFRM_POLICY_IN].min_hw_prio = U32_MAX;
+	net->xfrm.policy_prio[XFRM_POLICY_FWD].min_hw_prio = U32_MAX;
+	net->xfrm.policy_prio[XFRM_POLICY_OUT].min_hw_prio = U32_MAX;
+
 	rv = xfrm_statistics_init(net);
 	if (rv < 0)
 		goto out_statistics;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (5 preceding siblings ...)
  2022-08-16  8:59 ` [PATCH xfrm-next v2 6/6] xfrm: enforce separation between priorities of HW/SW policies Leon Romanovsky
@ 2022-08-17  2:54 ` Jakub Kicinski
  2022-08-17  5:22   ` Leon Romanovsky
  2022-08-18 10:09 ` Steffen Klassert
  7 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2022-08-17  2:54 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steffen Klassert, Leon Romanovsky, David S . Miller, Herbert Xu,
	netdev, Raed Salem, ipsec-devel

On Tue, 16 Aug 2022 11:59:21 +0300 Leon Romanovsky wrote:
> The following series extends XFRM core code to handle new type of IPsec
> offload - full offload.
> 
> In this mode, the HW is going to be responsible for whole data path, so
> both policy and state should be offloaded.

This is making a precedent for full tunnel offload in netdev, right?
Could you indulge us with a more detailed description, motivation,
performance results, where the behavior of offload may differ (if at
all), what visibility users have, how SW and HW work together on the
datapath? Documentation would be great.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-17  2:54 ` [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Jakub Kicinski
@ 2022-08-17  5:22   ` Leon Romanovsky
  2022-08-17 18:10     ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-17  5:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Steffen Klassert, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

On Tue, Aug 16, 2022 at 07:54:08PM -0700, Jakub Kicinski wrote:
> On Tue, 16 Aug 2022 11:59:21 +0300 Leon Romanovsky wrote:
> > The following series extends XFRM core code to handle new type of IPsec
> > offload - full offload.
> > 
> > In this mode, the HW is going to be responsible for whole data path, so
> > both policy and state should be offloaded.
> 
> This is making a precedent for full tunnel offload in netdev, right?

Not really. SW IPsec supports two modes: tunnel and transport.

However HW and SW stack supports only offload of transport mode.
This is the case for already merged IPsec crypto offload mode and
the case for this full offload.

> Could you indulge us with a more detailed description, motivation,
> performance results, where the behavior of offload may differ (if at
> all), what visibility users have, how SW and HW work together on the
> datapath? Documentation would be great.

IPsec full offload is actually improved version of IPsec crypto mode,
In full mode, HW is responsible to trim/add headers in addition to
decrypt/encrypt. In this mode, the packet arrives to the stack as already
decrypted and vice versa for TX (exits to HW as not-encrypted).

My main motivation is to perform IPsec on RoCE traffic and in our
preliminary results, we are able to do IPsec full offload in line
rate. The same goes for ETH traffic.

Regarding behavior differences - they are not expected.

We (Raed and me) tried very hard to make sure that IPsec full offload
will behave exactly as SW path.

There are some limitations to reduce complexity, but they can be removed
later if needs will arise. Right now, none of them are "real" limitations
for various *swarn forks, which we extend as well.

Some of them:
1. Request to have reqid for policy and state. I use reqid for HW
matching between policy and state.
2. Automatic separation between HW and SW priorities, because HW sees
packet first.
3. Only main template is supported.
4. No fallback to SW if IPsec HW failed to handle packet. HW should drop
such packet.

Visibility:
Users can see the mode through iproute2
https://lore.kernel.org/netdev/cover.1652179360.git.leonro@nvidia.com/
and see statistics through ethtool.

Documentation will come as well. I assume that IPsec folks are familiar
with this topic as it was discussed in IPsec coffee hour. 

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-17  5:22   ` Leon Romanovsky
@ 2022-08-17 18:10     ` Jakub Kicinski
  2022-08-18  5:24       ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2022-08-17 18:10 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steffen Klassert, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

On Wed, 17 Aug 2022 08:22:02 +0300 Leon Romanovsky wrote:
> On Tue, Aug 16, 2022 at 07:54:08PM -0700, Jakub Kicinski wrote:
> > This is making a precedent for full tunnel offload in netdev, right?  
> 
> Not really. SW IPsec supports two modes: tunnel and transport.
> 
> However HW and SW stack supports only offload of transport mode.
> This is the case for already merged IPsec crypto offload mode and
> the case for this full offload.

My point is on what you called "full offload" vs "crypto offload".
The policy so far has always been that Linux networking stack should
populate all the headers and instruct the device to do crypto, no
header insertion. Obviously we do header insertion in switch/router
offloads but that's different and stateless.

I believe the reasoning was to provide as much flexibility and control
to the software as possible while retaining most of the performance
gains.

You must provide a clear analysis (as in examination in data) and
discussion (as in examination in writing) if you're intending to 
change the "let's keep packet formation in the SW" policy. What you 
got below is a good start but not sufficient.

> > Could you indulge us with a more detailed description, motivation,
> > performance results, where the behavior of offload may differ (if at
> > all), what visibility users have, how SW and HW work together on the
> > datapath? Documentation would be great.  
> 
> IPsec full offload is actually improved version of IPsec crypto mode,
> In full mode, HW is responsible to trim/add headers in addition to
> decrypt/encrypt. In this mode, the packet arrives to the stack as already
> decrypted and vice versa for TX (exits to HW as not-encrypted).
> 
> My main motivation is to perform IPsec on RoCE traffic and in our
> preliminary results, we are able to do IPsec full offload in line
> rate. The same goes for ETH traffic.

If the motivation is RoCE I personally see no reason to provide the
configuration of this functionality via netdev interfaces, but I'll
obviously leave the final decision to Steffen.

> Regarding behavior differences - they are not expected.
> 
> We (Raed and me) tried very hard to make sure that IPsec full offload
> will behave exactly as SW path.
> 
> There are some limitations to reduce complexity, but they can be removed
> later if needs will arise. Right now, none of them are "real" limitations
> for various *swarn forks, which we extend as well.
> 
> Some of them:
> 1. Request to have reqid for policy and state. I use reqid for HW
> matching between policy and state.

reqid?

> 2. Automatic separation between HW and SW priorities, because HW sees
> packet first.

More detail needed on that.

> 3. Only main template is supported.
> 4. No fallback to SW if IPsec HW failed to handle packet. HW should drop
> such packet.

Not great for debug.

> Visibility:
> Users can see the mode through iproute2
> https://lore.kernel.org/netdev/cover.1652179360.git.leonro@nvidia.com/
> and see statistics through ethtool.

Custom vendor stats?

> Documentation will come as well. I assume that IPsec folks are familiar
> with this topic as it was discussed in IPsec coffee hour. 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-17 18:10     ` Jakub Kicinski
@ 2022-08-18  5:24       ` Leon Romanovsky
  2022-08-18 10:10         ` Steffen Klassert
  2022-08-19  2:34         ` Jakub Kicinski
  0 siblings, 2 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-18  5:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Steffen Klassert, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel, Jason Gunthorpe

On Wed, Aug 17, 2022 at 11:10:52AM -0700, Jakub Kicinski wrote:
> On Wed, 17 Aug 2022 08:22:02 +0300 Leon Romanovsky wrote:
> > On Tue, Aug 16, 2022 at 07:54:08PM -0700, Jakub Kicinski wrote:
> > > This is making a precedent for full tunnel offload in netdev, right?  
> > 
> > Not really. SW IPsec supports two modes: tunnel and transport.
> > 
> > However HW and SW stack supports only offload of transport mode.
> > This is the case for already merged IPsec crypto offload mode and
> > the case for this full offload.
> 
> My point is on what you called "full offload" vs "crypto offload".
> The policy so far has always been that Linux networking stack should
> populate all the headers and instruct the device to do crypto, no
> header insertion. Obviously we do header insertion in switch/router
> offloads but that's different and stateless.
> 
> I believe the reasoning was to provide as much flexibility and control
> to the software as possible while retaining most of the performance
> gains.

I honestly don't know the reasoning, but "performance gains" are very
limited as long as IPsec stack involved with various policy/state
lookups. These lookups are expensive in terms of CPU and they can't
hold 400 Gb/s line rate.

https://docs.nvidia.com/networking/display/connectx7en/Introduction#Introduction-ConnectX-7400GbEAdapterCards

> 
> You must provide a clear analysis (as in examination in data) and
> discussion (as in examination in writing) if you're intending to 
> change the "let's keep packet formation in the SW" policy. What you 
> got below is a good start but not sufficient.

Can you please point me to an example of such analysis, so I will know
what is needed/expected?

> 
> > > Could you indulge us with a more detailed description, motivation,
> > > performance results, where the behavior of offload may differ (if at
> > > all), what visibility users have, how SW and HW work together on the
> > > datapath? Documentation would be great.  
> > 
> > IPsec full offload is actually improved version of IPsec crypto mode,
> > In full mode, HW is responsible to trim/add headers in addition to
> > decrypt/encrypt. In this mode, the packet arrives to the stack as already
> > decrypted and vice versa for TX (exits to HW as not-encrypted).
> > 
> > My main motivation is to perform IPsec on RoCE traffic and in our
> > preliminary results, we are able to do IPsec full offload in line
> > rate. The same goes for ETH traffic.
> 
> If the motivation is RoCE I personally see no reason to provide the
> configuration of this functionality via netdev interfaces, but I'll
> obviously leave the final decision to Steffen.

This is not limited to RoCE, our customers use this offload for ethernet
traffic as well.

RoCE is a good example of traffic that performs all headers magic in HW,
without SW involved.

IPsec clearly belongs to netdev and we don't want to duplicate netdev
functionality in RDMA. Like I said above, this feature is needed for
regular ETH traffic as well.

Right now, RoCE and iWARP devices are based on netdev and long-standing
agreement ( >20 years ????) that all netdev configurations are done
there they belong - in netdev.

If you think that RDMA<->netdev binding should be untied and netdev
functionality should be duplicated, feel free to submit topic to LPC
and/or catch me and/or Jason to discuss it during the venue.

> 
> > Regarding behavior differences - they are not expected.
> > 
> > We (Raed and me) tried very hard to make sure that IPsec full offload
> > will behave exactly as SW path.
> > 
> > There are some limitations to reduce complexity, but they can be removed
> > later if needs will arise. Right now, none of them are "real" limitations
> > for various *swarn forks, which we extend as well.
> > 
> > Some of them:
> > 1. Request to have reqid for policy and state. I use reqid for HW
> > matching between policy and state.
> 
> reqid?

Policy and state are matched based on their selectors (src/deet IP, direction ...),
but they independent. The reqid is XFRM identification that this specific policy
is connected to this specific state.
https://www.man7.org/linux/man-pages/man8/ip-xfrm.8.html
https://docs.kernel.org/networking/xfrm_device.html
ip x s add ....
   reqid 0x07 ...
   offload dev eth4 dir in

IPsec SW allows to do not set this field. In offload mode, we want to be
explicit and require from user to say which policy belongs to which state.

I personally never saw anyone who configures IPsec without reqid.

> 
> > 2. Automatic separation between HW and SW priorities, because HW sees
> > packet first.
> 
> More detail needed on that.

I have description in the commit message of relevant commit.
https://lore.kernel.org/all/191be9b9c0367d4554b208533f5d75f498784889.1660639789.git.leonro@nvidia.com/
-----
Devices that implement IPsec full offload mode offload policies too.
In RX path, it causes to the situation that HW can't effectively handle
mixed SW and HW priorities unless users make sure that HW offloaded
policies have higher priorities.

In order to make sure that users have coherent picture, let's require
that HW offloaded policies have always (both RX and TX) higher priorities
than SW ones.

To do not over engineer the code, HW policies are treated as SW ones and
don't take into account netdev to allow reuse of same priorities for
different devices.
-----

> 
> > 3. Only main template is supported.
> > 4. No fallback to SW if IPsec HW failed to handle packet. HW should drop
> > such packet.
> 
> Not great for debug.

For now, there are various vendor tools to inspect FW/HW state. We have some
rough ideas on how can we improve it and forward bad packets to SW for analysis,
but it is much advanced topic and needs this series to be merged first.

> 
> > Visibility:
> > Users can see the mode through iproute2
> > https://lore.kernel.org/netdev/cover.1652179360.git.leonro@nvidia.com/
> > and see statistics through ethtool.
> 
> Custom vendor stats?

XFRM doesn't have much to offer. There are bunch of LINUX_MIB_XFRM*
counters, but nothing more.

> 
> > Documentation will come as well. I assume that IPsec folks are familiar
> > with this topic as it was discussed in IPsec coffee hour. 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
                   ` (6 preceding siblings ...)
  2022-08-17  2:54 ` [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Jakub Kicinski
@ 2022-08-18 10:09 ` Steffen Klassert
  2022-08-18 13:26   ` Leon Romanovsky
  7 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-18 10:09 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

Hi Leon,

On Tue, Aug 16, 2022 at 11:59:21AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Changelog:
> v2:
>  * Rebased to latest 6.0-rc1
>  * Add an extra check in TX datapath patch to validate packets before
>    forwarding to HW.
>  * Added policy cleanup logic in case of netdev down event 
> v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com 
>  * Moved comment to be before if (...) in third patch.
> v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
> -----------------------------------------------------------------------
> 
> The following series extends XFRM core code to handle new type of IPsec
> offload - full offload.
> 
> In this mode, the HW is going to be responsible for whole data path, so
> both policy and state should be offloaded.

some general comments about the pachset:

As implemented, the software does not hold any state.
I.e. there is no sync between hardware and software
regarding stats, liftetime, lifebyte, packet counts
and replay window. IKE rekeying and auditing is based
on these, how should this be done?

I have not seen anything that catches configurations
that stack multiple tunnels with the outer offloaded.

Where do we make sure that policy offloading device
is the same as the state offloading device?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-18  5:24       ` Leon Romanovsky
@ 2022-08-18 10:10         ` Steffen Klassert
  2022-08-18 12:51           ` Leon Romanovsky
  2022-08-19  1:54           ` Jakub Kicinski
  2022-08-19  2:34         ` Jakub Kicinski
  1 sibling, 2 replies; 45+ messages in thread
From: Steffen Klassert @ 2022-08-18 10:10 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jakub Kicinski, David S . Miller, Herbert Xu, netdev, Raed Salem,
	ipsec-devel, Jason Gunthorpe

On Thu, Aug 18, 2022 at 08:24:13AM +0300, Leon Romanovsky wrote:
> On Wed, Aug 17, 2022 at 11:10:52AM -0700, Jakub Kicinski wrote:
> > On Wed, 17 Aug 2022 08:22:02 +0300 Leon Romanovsky wrote:
> > > On Tue, Aug 16, 2022 at 07:54:08PM -0700, Jakub Kicinski wrote:
> > > > This is making a precedent for full tunnel offload in netdev, right?  
> > > 
> > > Not really. SW IPsec supports two modes: tunnel and transport.
> > > 
> > > However HW and SW stack supports only offload of transport mode.
> > > This is the case for already merged IPsec crypto offload mode and
> > > the case for this full offload.
> > 
> > My point is on what you called "full offload" vs "crypto offload".
> > The policy so far has always been that Linux networking stack should
> > populate all the headers and instruct the device to do crypto, no
> > header insertion. Obviously we do header insertion in switch/router
> > offloads but that's different and stateless.
> > 
> > I believe the reasoning was to provide as much flexibility and control
> > to the software as possible while retaining most of the performance
> > gains.
> 
> I honestly don't know the reasoning, but "performance gains" are very
> limited as long as IPsec stack involved with various policy/state
> lookups. These lookups are expensive in terms of CPU and they can't
> hold 400 Gb/s line rate.

Can you provide some performance results that show the difference
between crypto and full offload? In particular because on the TX
path, the full policy and state offload is done twice (in software
to find the offloading device and then in hardware to match policy
to state).

> 
> https://docs.nvidia.com/networking/display/connectx7en/Introduction#Introduction-ConnectX-7400GbEAdapterCards
> 
> > 
> > You must provide a clear analysis (as in examination in data) and
> > discussion (as in examination in writing) if you're intending to 
> > change the "let's keep packet formation in the SW" policy. What you 
> > got below is a good start but not sufficient.

I'm still a bit unease about this approach. I fear that doing parts
of statefull IPsec procesing in software and parts in hardware will
lead to all sort of problems. E.g. with this implementation
the software has no stats, liftetime, lifebyte and packet count
information but is responsible to do the IKE communication.

We might be able to sort out all problems during the upstraming
process, but I still have no clear picture how this should work
in the end with all corener cases this creates.

Also the name full offload is a bit missleading, because the
software still has to hold all offloaded states and policies.
In a full offload, the stack would IMO just act as a stub
layer between IKE and hardware.

> > > Some of them:
> > > 1. Request to have reqid for policy and state. I use reqid for HW
> > > matching between policy and state.
> > 
> > reqid?
> 
> Policy and state are matched based on their selectors (src/deet IP, direction ...),
> but they independent. The reqid is XFRM identification that this specific policy
> is connected to this specific state.
> https://www.man7.org/linux/man-pages/man8/ip-xfrm.8.html
> https://docs.kernel.org/networking/xfrm_device.html
> ip x s add ....
>    reqid 0x07 ...
>    offload dev eth4 dir in

Can you elaborate this a bit more? Does that matching happen in
hardware? The reqid is not a unique identifyer to match between
policy and state. You MUST match the selectors as defined in 
https://www.rfc-editor.org/rfc/rfc4301


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode
  2022-08-16  8:59 ` [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode Leon Romanovsky
@ 2022-08-18 10:12   ` Steffen Klassert
  2022-08-18 13:28     ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-18 10:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

On Tue, Aug 16, 2022 at 11:59:23AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Allow users to configure xfrm states with full offload mode.
> The full mode must be requested both for policy and state, and
> such requires us to do not implement fallback.
> 
> We explicitly return an error if requested full mode can't
> be configured.
> 
> Reviewed-by: Raed Salem <raeds@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

This one needs to be ACKed by the driver maintainers.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode
  2022-08-16  8:59 ` [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode Leon Romanovsky
@ 2022-08-18 10:24   ` Steffen Klassert
  2022-08-18 13:34     ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-18 10:24 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

On Tue, Aug 16, 2022 at 11:59:25AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> In IPsec full mode, the device is going to encrypt and encapsulate
> packets that are associated with offloaded policy. After successful
> policy lookup to indicate if packets should be offloaded or not,
> the stack forwards packets to the device to do the magic.
> 
> Signed-off-by: Raed Salem <raeds@nvidia.com>
> Signed-off-by: Huy Nguyen <huyn@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  net/xfrm/xfrm_device.c | 13 +++++++++++++
>  net/xfrm/xfrm_output.c | 20 ++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
> index 1cc482e9c87d..db5ebd36f68c 100644
> --- a/net/xfrm/xfrm_device.c
> +++ b/net/xfrm/xfrm_device.c

> @@ -366,6 +376,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
>  	struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
>  	struct net_device *dev = x->xso.dev;
>  
> +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> +		goto ok;

You skip the PMTU checks here. I've seen that you check
the packet length against the device MTU in one of your
mlx5 patches, but that does not help if the PMTU is below.

>  
> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> index 555ab35cd119..27a8dac9ca7d 100644
> --- a/net/xfrm/xfrm_output.c
> +++ b/net/xfrm/xfrm_output.c
> @@ -719,6 +719,26 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
>  		break;
>  	}
>  
> +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL) {
> +		struct dst_entry *dst = skb_dst_pop(skb);
> +
> +		if (!dst || !xfrm_dev_offload_ok(skb, x)) {
> +			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
> +			kfree_skb(skb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		skb_dst_set(skb, dst);
> +		err = skb_dst(skb)->ops->local_out(net, skb->sk, skb);
> +		if (unlikely(err != 1))
> +			return err;
> +
> +		if (!skb_dst(skb)->xfrm)
> +			return dst_output(net, skb->sk, skb);
> +
> +		return 0;

You leak skb here. Also, this skb needs another tfm because
skb_dst(skb)->xfrm is set.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection for IPsec full offload mode
  2022-08-16  8:59 ` [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection " Leon Romanovsky
@ 2022-08-18 10:27   ` Steffen Klassert
  2022-08-18 13:36     ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-18 10:27 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel

On Tue, Aug 16, 2022 at 11:59:26AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Traffic received by device with enabled IPsec full offload should be
> forwarded to the stack only after decryption, packet headers and
> trailers removed.
> 
> Such packets are expected to be seen as normal (non-XFRM) ones, while
> not-supported packets should be dropped by the HW.
> 
> Reviewed-by: Raed Salem <raeds@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

> @@ -1125,6 +1148,15 @@ static inline int __xfrm_policy_check2(struct sock *sk, int dir,
>  {
>  	struct net *net = dev_net(skb->dev);
>  	int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0);
> +	struct xfrm_offload *xo = xfrm_offload(skb);
> +	struct xfrm_state *x;
> +
> +	if (xo) {
> +		x = xfrm_input_state(skb);
> +		if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> +			return (xo->flags & CRYPTO_DONE) &&
> +			       (xo->status & CRYPTO_SUCCESS);
> +	}
>  
>  	if (sk && sk->sk_policy[XFRM_POLICY_IN])
>  		return __xfrm_policy_check(sk, ndir, skb, family);

What happens here if there is a socket policy configured?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-18 10:10         ` Steffen Klassert
@ 2022-08-18 12:51           ` Leon Romanovsky
  2022-08-19  1:54           ` Jakub Kicinski
  1 sibling, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-18 12:51 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Jakub Kicinski, David S . Miller, Herbert Xu, netdev, Raed Salem,
	ipsec-devel, Jason Gunthorpe

On Thu, Aug 18, 2022 at 12:10:31PM +0200, Steffen Klassert wrote:
> On Thu, Aug 18, 2022 at 08:24:13AM +0300, Leon Romanovsky wrote:
> > On Wed, Aug 17, 2022 at 11:10:52AM -0700, Jakub Kicinski wrote:
> > > On Wed, 17 Aug 2022 08:22:02 +0300 Leon Romanovsky wrote:
> > > > On Tue, Aug 16, 2022 at 07:54:08PM -0700, Jakub Kicinski wrote:
> > > > > This is making a precedent for full tunnel offload in netdev, right?  
> > > > 
> > > > Not really. SW IPsec supports two modes: tunnel and transport.
> > > > 
> > > > However HW and SW stack supports only offload of transport mode.
> > > > This is the case for already merged IPsec crypto offload mode and
> > > > the case for this full offload.
> > > 
> > > My point is on what you called "full offload" vs "crypto offload".
> > > The policy so far has always been that Linux networking stack should
> > > populate all the headers and instruct the device to do crypto, no
> > > header insertion. Obviously we do header insertion in switch/router
> > > offloads but that's different and stateless.
> > > 
> > > I believe the reasoning was to provide as much flexibility and control
> > > to the software as possible while retaining most of the performance
> > > gains.
> > 
> > I honestly don't know the reasoning, but "performance gains" are very
> > limited as long as IPsec stack involved with various policy/state
> > lookups. These lookups are expensive in terms of CPU and they can't
> > hold 400 Gb/s line rate.
> 
> Can you provide some performance results that show the difference
> between crypto and full offload? In particular because on the TX
> path, the full policy and state offload is done twice (in software
> to find the offloading device and then in hardware to match policy
> to state).

I will prepare the numbers.

> 
> > 
> > https://docs.nvidia.com/networking/display/connectx7en/Introduction#Introduction-ConnectX-7400GbEAdapterCards
> > 
> > > 
> > > You must provide a clear analysis (as in examination in data) and
> > > discussion (as in examination in writing) if you're intending to 
> > > change the "let's keep packet formation in the SW" policy. What you 
> > > got below is a good start but not sufficient.
> 
> I'm still a bit unease about this approach. I fear that doing parts
> of statefull IPsec procesing in software and parts in hardware will
> lead to all sort of problems. E.g. with this implementation
> the software has no stats, liftetime, lifebyte and packet count
> information but is responsible to do the IKE communication.
> 
> We might be able to sort out all problems during the upstraming
> process, but I still have no clear picture how this should work
> in the end with all corener cases this creates.

Like we discussed in IPsec coffee hour, there is no reliable way
to synchronize SW and HW. This is why we offload both policy and state
and skip stack completely.

> 
> Also the name full offload is a bit missleading, because the
> software still has to hold all offloaded states and policies.
> In a full offload, the stack would IMO just act as a stub
> layer between IKE and hardware.

It is just a name, I'm open to change it to any other name.

> 
> > > > Some of them:
> > > > 1. Request to have reqid for policy and state. I use reqid for HW
> > > > matching between policy and state.
> > > 
> > > reqid?
> > 
> > Policy and state are matched based on their selectors (src/deet IP, direction ...),
> > but they independent. The reqid is XFRM identification that this specific policy
> > is connected to this specific state.
> > https://www.man7.org/linux/man-pages/man8/ip-xfrm.8.html
> > https://docs.kernel.org/networking/xfrm_device.html
> > ip x s add ....
> >    reqid 0x07 ...
> >    offload dev eth4 dir in
> 
> Can you elaborate this a bit more? Does that matching happen in
> hardware? The reqid is not a unique identifyer to match between
> policy and state. You MUST match the selectors as defined in 
> https://www.rfc-editor.org/rfc/rfc4301

The reqid is needed for TX path and part of mlx5 flow steering logic.
https://lore.kernel.org/netdev/51ee028577396c051604703c46bd31d706b4b387.1660641154.git.leonro@nvidia.com/
I'm relying on it to make sure that both policy and state exist.

For everything else, I rely on selectors.

Thanks

> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-18 10:09 ` Steffen Klassert
@ 2022-08-18 13:26   ` Leon Romanovsky
  2022-08-22  8:34     ` Steffen Klassert
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-18 13:26 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Thu, Aug 18, 2022 at 12:09:30PM +0200, Steffen Klassert wrote:
> Hi Leon,
> 
> On Tue, Aug 16, 2022 at 11:59:21AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Changelog:
> > v2:
> >  * Rebased to latest 6.0-rc1
> >  * Add an extra check in TX datapath patch to validate packets before
> >    forwarding to HW.
> >  * Added policy cleanup logic in case of netdev down event 
> > v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com 
> >  * Moved comment to be before if (...) in third patch.
> > v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
> > -----------------------------------------------------------------------
> > 
> > The following series extends XFRM core code to handle new type of IPsec
> > offload - full offload.
> > 
> > In this mode, the HW is going to be responsible for whole data path, so
> > both policy and state should be offloaded.
> 
> some general comments about the pachset:
> 
> As implemented, the software does not hold any state.
> I.e. there is no sync between hardware and software
> regarding stats, liftetime, lifebyte, packet counts
> and replay window. IKE rekeying and auditing is based
> on these, how should this be done?

This is only rough idea as we only started to implement needed
support in libreswan, but our plan is to configure IKE with
highest possible priority 

> 
> I have not seen anything that catches configurations
> that stack multiple tunnels with the outer offloaded.
> 
> Where do we make sure that policy offloading device
> is the same as the state offloading device?

It is configuration error and we don't check it. Should we?

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode
  2022-08-18 10:12   ` Steffen Klassert
@ 2022-08-18 13:28     ` Leon Romanovsky
  2022-08-22  8:01       ` Steffen Klassert
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-18 13:28 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Thu, Aug 18, 2022 at 12:12:20PM +0200, Steffen Klassert wrote:
> On Tue, Aug 16, 2022 at 11:59:23AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Allow users to configure xfrm states with full offload mode.
> > The full mode must be requested both for policy and state, and
> > such requires us to do not implement fallback.
> > 
> > We explicitly return an error if requested full mode can't
> > be configured.
> > 
> > Reviewed-by: Raed Salem <raeds@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> 
> This one needs to be ACKed by the driver maintainers.

Why? Only crypto is supported in upstream kernel.

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode
  2022-08-18 10:24   ` Steffen Klassert
@ 2022-08-18 13:34     ` Leon Romanovsky
  2022-08-22  8:04       ` Steffen Klassert
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-18 13:34 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Thu, Aug 18, 2022 at 12:24:51PM +0200, Steffen Klassert wrote:
> On Tue, Aug 16, 2022 at 11:59:25AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > In IPsec full mode, the device is going to encrypt and encapsulate
> > packets that are associated with offloaded policy. After successful
> > policy lookup to indicate if packets should be offloaded or not,
> > the stack forwards packets to the device to do the magic.
> > 
> > Signed-off-by: Raed Salem <raeds@nvidia.com>
> > Signed-off-by: Huy Nguyen <huyn@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  net/xfrm/xfrm_device.c | 13 +++++++++++++
> >  net/xfrm/xfrm_output.c | 20 ++++++++++++++++++++
> >  2 files changed, 33 insertions(+)
> > 
> > diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
> > index 1cc482e9c87d..db5ebd36f68c 100644
> > --- a/net/xfrm/xfrm_device.c
> > +++ b/net/xfrm/xfrm_device.c
> 
> > @@ -366,6 +376,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
> >  	struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
> >  	struct net_device *dev = x->xso.dev;
> >  
> > +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> > +		goto ok;
> 
> You skip the PMTU checks here. I've seen that you check
> the packet length against the device MTU in one of your
> mlx5 patches, but that does not help if the PMTU is below.

If device supports transformation of the packet, this packet
won't be counted as XFRM anymore. I'm not sure that we need
to perform XFRM specific checks.

> 
> >  
> > diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> > index 555ab35cd119..27a8dac9ca7d 100644
> > --- a/net/xfrm/xfrm_output.c
> > +++ b/net/xfrm/xfrm_output.c
> > @@ -719,6 +719,26 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
> >  		break;
> >  	}
> >  
> > +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL) {
> > +		struct dst_entry *dst = skb_dst_pop(skb);
> > +
> > +		if (!dst || !xfrm_dev_offload_ok(skb, x)) {
> > +			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
> > +			kfree_skb(skb);
> > +			return -EHOSTUNREACH;
> > +		}
> > +
> > +		skb_dst_set(skb, dst);
> > +		err = skb_dst(skb)->ops->local_out(net, skb->sk, skb);
> > +		if (unlikely(err != 1))
> > +			return err;
> > +
> > +		if (!skb_dst(skb)->xfrm)
> > +			return dst_output(net, skb->sk, skb);
> > +
> > +		return 0;
> 
> You leak skb here. Also, this skb needs another tfm because
> skb_dst(skb)->xfrm is set.

I will fix, thanks.

> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection for IPsec full offload mode
  2022-08-18 10:27   ` Steffen Klassert
@ 2022-08-18 13:36     ` Leon Romanovsky
  2022-08-22  8:06       ` Steffen Klassert
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-18 13:36 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Thu, Aug 18, 2022 at 12:27:08PM +0200, Steffen Klassert wrote:
> On Tue, Aug 16, 2022 at 11:59:26AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Traffic received by device with enabled IPsec full offload should be
> > forwarded to the stack only after decryption, packet headers and
> > trailers removed.
> > 
> > Such packets are expected to be seen as normal (non-XFRM) ones, while
> > not-supported packets should be dropped by the HW.
> > 
> > Reviewed-by: Raed Salem <raeds@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> 
> > @@ -1125,6 +1148,15 @@ static inline int __xfrm_policy_check2(struct sock *sk, int dir,
> >  {
> >  	struct net *net = dev_net(skb->dev);
> >  	int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0);
> > +	struct xfrm_offload *xo = xfrm_offload(skb);
> > +	struct xfrm_state *x;
> > +
> > +	if (xo) {
> > +		x = xfrm_input_state(skb);
> > +		if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> > +			return (xo->flags & CRYPTO_DONE) &&
> > +			       (xo->status & CRYPTO_SUCCESS);
> > +	}
> >  
> >  	if (sk && sk->sk_policy[XFRM_POLICY_IN])
> >  		return __xfrm_policy_check(sk, ndir, skb, family);
> 
> What happens here if there is a socket policy configured?

No change, we don't support offload of socket policies.

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-18 10:10         ` Steffen Klassert
  2022-08-18 12:51           ` Leon Romanovsky
@ 2022-08-19  1:54           ` Jakub Kicinski
  1 sibling, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2022-08-19  1:54 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Leon Romanovsky, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel, Jason Gunthorpe

On Thu, 18 Aug 2022 12:10:31 +0200 Steffen Klassert wrote:
> > > You must provide a clear analysis (as in examination in data) and
> > > discussion (as in examination in writing) if you're intending to 
> > > change the "let's keep packet formation in the SW" policy. What you 
> > > got below is a good start but not sufficient.  
> 
> I'm still a bit unease about this approach. I fear that doing parts
> of statefull IPsec procesing in software and parts in hardware will
> lead to all sort of problems. E.g. with this implementation
> the software has no stats, liftetime, lifebyte and packet count
> information but is responsible to do the IKE communication.
> 
> We might be able to sort out all problems during the upstraming
> process, but I still have no clear picture how this should work
> in the end with all corener cases this creates.

Makes sense. I'm not sure any of the "deep and stateful offloads"
we have can be considered a success so IMHO we can be selective
in the approaches we accept.

> Also the name full offload is a bit missleading, because the
> software still has to hold all offloaded states and policies.
> In a full offload, the stack would IMO just act as a stub
> layer between IKE and hardware.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-18  5:24       ` Leon Romanovsky
  2022-08-18 10:10         ` Steffen Klassert
@ 2022-08-19  2:34         ` Jakub Kicinski
  2022-08-19  5:52           ` Leon Romanovsky
  1 sibling, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2022-08-19  2:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steffen Klassert, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel, Jason Gunthorpe

On Thu, 18 Aug 2022 08:24:13 +0300 Leon Romanovsky wrote:
> On Wed, Aug 17, 2022 at 11:10:52AM -0700, Jakub Kicinski wrote:
> > My point is on what you called "full offload" vs "crypto offload".
> > The policy so far has always been that Linux networking stack should
> > populate all the headers and instruct the device to do crypto, no
> > header insertion. Obviously we do header insertion in switch/router
> > offloads but that's different and stateless.
> > 
> > I believe the reasoning was to provide as much flexibility and control
> > to the software as possible while retaining most of the performance
> > gains.  
> 
> I honestly don't know the reasoning, but "performance gains" are very
> limited as long as IPsec stack involved with various policy/state

Herm. So you didn't bother figuring out what the current problems are
but unsurprisingly the solution is "buy our product and let us do it"?

> lookups. These lookups are expensive in terms of CPU and they can't
> hold 400 Gb/s line rate.
> 
> https://docs.nvidia.com/networking/display/connectx7en/Introduction#Introduction-ConnectX-7400GbEAdapterCards
>
> > You must provide a clear analysis (as in examination in data) and
> > discussion (as in examination in writing) if you're intending to 
> > change the "let's keep packet formation in the SW" policy. What you 
> > got below is a good start but not sufficient.  
> 
> Can you please point me to an example of such analysis, so I will know
> what is needed/expected?

I can't, as I said twice now, we don't have any "full crypto" offloads
AFAIK.

> > > IPsec full offload is actually improved version of IPsec crypto mode,
> > > In full mode, HW is responsible to trim/add headers in addition to
> > > decrypt/encrypt. In this mode, the packet arrives to the stack as already
> > > decrypted and vice versa for TX (exits to HW as not-encrypted).
> > > 
> > > My main motivation is to perform IPsec on RoCE traffic and in our
> > > preliminary results, we are able to do IPsec full offload in line
> > > rate. The same goes for ETH traffic.  
> > 
> > If the motivation is RoCE I personally see no reason to provide the
> > configuration of this functionality via netdev interfaces, but I'll
> > obviously leave the final decision to Steffen.  
> 
> This is not limited to RoCE, our customers use this offload for ethernet
> traffic as well.
> 
> RoCE is a good example of traffic that performs all headers magic in HW,
> without SW involved.
> 
> IPsec clearly belongs to netdev and we don't want to duplicate netdev
> functionality in RDMA. Like I said above, this feature is needed for
> regular ETH traffic as well.
> 
> Right now, RoCE and iWARP devices are based on netdev and long-standing
> agreement ( >20 years ????) that all netdev configurations are done
> there they belong - in netdev.

Let me be very clear - as far as I'm concerned no part of the RDMA
stack belongs in netdev. What's there is there, but do not try to use
that argument to justify more stuff.

If someone from the community thinks that I should have interest in
working on / helping proprietary protocol stacks please let me know,
because right now I have none.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-19  2:34         ` Jakub Kicinski
@ 2022-08-19  5:52           ` Leon Romanovsky
  2022-08-19 15:47             ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-19  5:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Steffen Klassert, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel, Jason Gunthorpe

On Thu, Aug 18, 2022 at 07:34:49PM -0700, Jakub Kicinski wrote:
> On Thu, 18 Aug 2022 08:24:13 +0300 Leon Romanovsky wrote:
> > On Wed, Aug 17, 2022 at 11:10:52AM -0700, Jakub Kicinski wrote:
> > > My point is on what you called "full offload" vs "crypto offload".
> > > The policy so far has always been that Linux networking stack should
> > > populate all the headers and instruct the device to do crypto, no
> > > header insertion. Obviously we do header insertion in switch/router
> > > offloads but that's different and stateless.
> > > 
> > > I believe the reasoning was to provide as much flexibility and control
> > > to the software as possible while retaining most of the performance
> > > gains.  
> > 
> > I honestly don't know the reasoning, but "performance gains" are very
> > limited as long as IPsec stack involved with various policy/state
> 
> Herm. So you didn't bother figuring out what the current problems are
> but unsurprisingly the solution is "buy our product and let us do it"?

Our hardware didn't support full offload back then and crypto mode
was the one that was supported in our mlx5 FPGA offering. There are
no "other" reasons from our side.

> 
> > lookups. These lookups are expensive in terms of CPU and they can't
> > hold 400 Gb/s line rate.
> > 
> > https://docs.nvidia.com/networking/display/connectx7en/Introduction#Introduction-ConnectX-7400GbEAdapterCards
> >
> > > You must provide a clear analysis (as in examination in data) and
> > > discussion (as in examination in writing) if you're intending to 
> > > change the "let's keep packet formation in the SW" policy. What you 
> > > got below is a good start but not sufficient.  
> > 
> > Can you please point me to an example of such analysis, so I will know
> > what is needed/expected?
> 
> I can't, as I said twice now, we don't have any "full crypto" offloads
> AFAIK.

No, I'm asking for an example of "clear analysis (as in examination in
data)", as I don't understand this sentence. I'm not asking for "full
crypto" examples.

This "discussion (as in examination in writing)" part is clear.

> 
> > > > IPsec full offload is actually improved version of IPsec crypto mode,
> > > > In full mode, HW is responsible to trim/add headers in addition to
> > > > decrypt/encrypt. In this mode, the packet arrives to the stack as already
> > > > decrypted and vice versa for TX (exits to HW as not-encrypted).
> > > > 
> > > > My main motivation is to perform IPsec on RoCE traffic and in our
> > > > preliminary results, we are able to do IPsec full offload in line
> > > > rate. The same goes for ETH traffic.  
> > > 
> > > If the motivation is RoCE I personally see no reason to provide the
> > > configuration of this functionality via netdev interfaces, but I'll
> > > obviously leave the final decision to Steffen.  
> > 
> > This is not limited to RoCE, our customers use this offload for ethernet
> > traffic as well.
> > 
> > RoCE is a good example of traffic that performs all headers magic in HW,
> > without SW involved.
> > 
> > IPsec clearly belongs to netdev and we don't want to duplicate netdev
> > functionality in RDMA. Like I said above, this feature is needed for
> > regular ETH traffic as well.
> > 
> > Right now, RoCE and iWARP devices are based on netdev and long-standing
> > agreement ( >20 years ????) that all netdev configurations are done
> > there they belong - in netdev.
> 
> Let me be very clear - as far as I'm concerned no part of the RDMA
> stack belongs in netdev. What's there is there, but do not try to use
> that argument to justify more stuff.
> 
> If someone from the community thinks that I should have interest in
> working on / helping proprietary protocol stacks please let me know,
> because right now I have none.

No one is asking from you to work on proprietary protocols.

RoCE is IBTA standard protocol and iWARP is IETF one. They both fully
documented and backed by multiple vendors (Intel, IBM, Mellanox, Cavium
...).

There is also interoperability lab https://www.iol.unh.edu/ that runs
various tests. In addition to distro interoperability labs testing.

I invite you to take a look on Jason's presentation "Challenges of the
RDMA subsystem", which he gave 3 years ago, about RDMA and challenges
with netdev.
https://lpc.events/event/4/contributions/364/

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-19  5:52           ` Leon Romanovsky
@ 2022-08-19 15:47             ` Jakub Kicinski
  2022-08-19 16:01               ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2022-08-19 15:47 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steffen Klassert, David S . Miller, Herbert Xu, netdev,
	Raed Salem, ipsec-devel, Jason Gunthorpe

On Fri, 19 Aug 2022 08:52:26 +0300 Leon Romanovsky wrote:
> > Let me be very clear - as far as I'm concerned no part of the RDMA
> > stack belongs in netdev. What's there is there, but do not try to use
> > that argument to justify more stuff.
> > 
> > If someone from the community thinks that I should have interest in
> > working on / helping proprietary protocol stacks please let me know,
> > because right now I have none.  
> 
> No one is asking from you to work on proprietary protocols.

That's not what I said. I don't know English grammar enough but you
took the modifying (descriptive? genitive?) noun and treated it as 
the object.

I don't want to be in any way disrespectful to the technology you
invest your time in. Or argue with any beliefs you have about it.

> RoCE is IBTA standard protocol and iWARP is IETF one. They both fully
> documented and backed by multiple vendors (Intel, IBM, Mellanox, Cavium
> ...).
> 
> There is also interoperability lab https://www.iol.unh.edu/ that runs
> various tests. In addition to distro interoperability labs testing.
> 
> I invite you to take a look on Jason's presentation "Challenges of the
> RDMA subsystem", which he gave 3 years ago, about RDMA and challenges
> with netdev.
> https://lpc.events/event/4/contributions/364/

I appreciate the invite, but it's not high enough on my list of interest
to spend time on.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-19 15:47             ` Jakub Kicinski
@ 2022-08-19 16:01               ` Jason Gunthorpe
  2022-08-19 17:53                 ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2022-08-19 16:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Leon Romanovsky, Steffen Klassert, David S . Miller, Herbert Xu,
	netdev, Raed Salem, ipsec-devel

On Fri, Aug 19, 2022 at 08:47:07AM -0700, Jakub Kicinski wrote:

> > I invite you to take a look on Jason's presentation "Challenges of the
> > RDMA subsystem", which he gave 3 years ago, about RDMA and challenges
> > with netdev.
> > https://lpc.events/event/4/contributions/364/
> 
> I appreciate the invite, but it's not high enough on my list of interest
> to spend time on.

Regardless, RDMA doesn't really intersect with this netdev work for
XFRM beyond the usual ways that RDMA IP traffic can be captured by or
run parallel to netdev.

A significant use case here is for switchdev modes where the switch
will subject traffic from a switch port to ESP, not unlike it already
does with vlan, vxlan, etc and other already fully offloaded switching
transforms.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-19 16:01               ` Jason Gunthorpe
@ 2022-08-19 17:53                 ` Jakub Kicinski
  2022-08-22  8:41                   ` Steffen Klassert
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2022-08-19 17:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Steffen Klassert, David S . Miller, Herbert Xu,
	netdev, Raed Salem, ipsec-devel

On Fri, 19 Aug 2022 13:01:22 -0300 Jason Gunthorpe wrote:
> Regardless, RDMA doesn't really intersect with this netdev work for
> XFRM beyond the usual ways that RDMA IP traffic can be captured by or
> run parallel to netdev.
> 
> A significant use case here is for switchdev modes where the switch
> will subject traffic from a switch port to ESP, not unlike it already
> does with vlan, vxlan, etc and other already fully offloaded switching
> transforms.

Yup, that's what I thought you'd say. Can't argue with that use case 
if Steffen is satisfied with the technical aspects.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode
  2022-08-18 13:28     ` Leon Romanovsky
@ 2022-08-22  8:01       ` Steffen Klassert
  2022-08-22  8:46         ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-22  8:01 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Thu, Aug 18, 2022 at 04:28:12PM +0300, Leon Romanovsky wrote:
> On Thu, Aug 18, 2022 at 12:12:20PM +0200, Steffen Klassert wrote:
> > On Tue, Aug 16, 2022 at 11:59:23AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Allow users to configure xfrm states with full offload mode.
> > > The full mode must be requested both for policy and state, and
> > > such requires us to do not implement fallback.
> > > 
> > > We explicitly return an error if requested full mode can't
> > > be configured.
> > > 
> > > Reviewed-by: Raed Salem <raeds@nvidia.com>
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > 
> > This one needs to be ACKed by the driver maintainers.
> 
> Why? Only crypto is supported in upstream kernel.

Because it touches code hat is under their maintainance.
They should at least be CCed on this patch.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode
  2022-08-18 13:34     ` Leon Romanovsky
@ 2022-08-22  8:04       ` Steffen Klassert
  2022-08-22  8:50         ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-22  8:04 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Thu, Aug 18, 2022 at 04:34:54PM +0300, Leon Romanovsky wrote:
> On Thu, Aug 18, 2022 at 12:24:51PM +0200, Steffen Klassert wrote:
> > On Tue, Aug 16, 2022 at 11:59:25AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > In IPsec full mode, the device is going to encrypt and encapsulate
> > > packets that are associated with offloaded policy. After successful
> > > policy lookup to indicate if packets should be offloaded or not,
> > > the stack forwards packets to the device to do the magic.
> > > 
> > > Signed-off-by: Raed Salem <raeds@nvidia.com>
> > > Signed-off-by: Huy Nguyen <huyn@nvidia.com>
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  net/xfrm/xfrm_device.c | 13 +++++++++++++
> > >  net/xfrm/xfrm_output.c | 20 ++++++++++++++++++++
> > >  2 files changed, 33 insertions(+)
> > > 
> > > diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
> > > index 1cc482e9c87d..db5ebd36f68c 100644
> > > --- a/net/xfrm/xfrm_device.c
> > > +++ b/net/xfrm/xfrm_device.c
> > 
> > > @@ -366,6 +376,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
> > >  	struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
> > >  	struct net_device *dev = x->xso.dev;
> > >  
> > > +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> > > +		goto ok;
> > 
> > You skip the PMTU checks here. I've seen that you check
> > the packet length against the device MTU in one of your
> > mlx5 patches, but that does not help if the PMTU is below.
> 
> If device supports transformation of the packet, this packet
> won't be counted as XFRM anymore. I'm not sure that we need
> to perform XFRM specific checks.

This is not xfrm specific, it makes sure that the packet you
want to send fits into the PMTU.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection for IPsec full offload mode
  2022-08-18 13:36     ` Leon Romanovsky
@ 2022-08-22  8:06       ` Steffen Klassert
  2022-08-22  9:35         ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-22  8:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Thu, Aug 18, 2022 at 04:36:46PM +0300, Leon Romanovsky wrote:
> On Thu, Aug 18, 2022 at 12:27:08PM +0200, Steffen Klassert wrote:
> > On Tue, Aug 16, 2022 at 11:59:26AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Traffic received by device with enabled IPsec full offload should be
> > > forwarded to the stack only after decryption, packet headers and
> > > trailers removed.
> > > 
> > > Such packets are expected to be seen as normal (non-XFRM) ones, while
> > > not-supported packets should be dropped by the HW.
> > > 
> > > Reviewed-by: Raed Salem <raeds@nvidia.com>
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > 
> > > @@ -1125,6 +1148,15 @@ static inline int __xfrm_policy_check2(struct sock *sk, int dir,
> > >  {
> > >  	struct net *net = dev_net(skb->dev);
> > >  	int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0);
> > > +	struct xfrm_offload *xo = xfrm_offload(skb);
> > > +	struct xfrm_state *x;
> > > +
> > > +	if (xo) {
> > > +		x = xfrm_input_state(skb);
> > > +		if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> > > +			return (xo->flags & CRYPTO_DONE) &&
> > > +			       (xo->status & CRYPTO_SUCCESS);
> > > +	}
> > >  
> > >  	if (sk && sk->sk_policy[XFRM_POLICY_IN])
> > >  		return __xfrm_policy_check(sk, ndir, skb, family);
> > 
> > What happens here if there is a socket policy configured?
> 
> No change, we don't support offload of socket policies.

But the user can confugure it, so it should be enforced
regardless if we had an offload before.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-18 13:26   ` Leon Romanovsky
@ 2022-08-22  8:34     ` Steffen Klassert
  2022-08-22  9:34       ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-22  8:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Thu, Aug 18, 2022 at 04:26:39PM +0300, Leon Romanovsky wrote:
> On Thu, Aug 18, 2022 at 12:09:30PM +0200, Steffen Klassert wrote:
> > Hi Leon,
> > 
> > On Tue, Aug 16, 2022 at 11:59:21AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Changelog:
> > > v2:
> > >  * Rebased to latest 6.0-rc1
> > >  * Add an extra check in TX datapath patch to validate packets before
> > >    forwarding to HW.
> > >  * Added policy cleanup logic in case of netdev down event 
> > > v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com 
> > >  * Moved comment to be before if (...) in third patch.
> > > v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
> > > -----------------------------------------------------------------------
> > > 
> > > The following series extends XFRM core code to handle new type of IPsec
> > > offload - full offload.
> > > 
> > > In this mode, the HW is going to be responsible for whole data path, so
> > > both policy and state should be offloaded.
> > 
> > some general comments about the pachset:
> > 
> > As implemented, the software does not hold any state.
> > I.e. there is no sync between hardware and software
> > regarding stats, liftetime, lifebyte, packet counts
> > and replay window. IKE rekeying and auditing is based
> > on these, how should this be done?
> 
> This is only rough idea as we only started to implement needed
> support in libreswan, but our plan is to configure IKE with
> highest possible priority 

If it is only a rough idea, then mark it as RFC. I want to see
the whole picture before we merge it. And btw. tunnel mode
belongs to the whoule picture too.

> 
> > 
> > I have not seen anything that catches configurations
> > that stack multiple tunnels with the outer offloaded.
> > 
> > Where do we make sure that policy offloading device
> > is the same as the state offloading device?
> 
> It is configuration error and we don't check it. Should we?

We should at least make sure to not send out packets untransformed
in this case.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-19 17:53                 ` Jakub Kicinski
@ 2022-08-22  8:41                   ` Steffen Klassert
  2022-08-22  8:54                     ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-22  8:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jason Gunthorpe, Leon Romanovsky, David S . Miller, Herbert Xu,
	netdev, Raed Salem, ipsec-devel

On Fri, Aug 19, 2022 at 10:53:56AM -0700, Jakub Kicinski wrote:
> On Fri, 19 Aug 2022 13:01:22 -0300 Jason Gunthorpe wrote:
> > Regardless, RDMA doesn't really intersect with this netdev work for
> > XFRM beyond the usual ways that RDMA IP traffic can be captured by or
> > run parallel to netdev.
> > 
> > A significant use case here is for switchdev modes where the switch
> > will subject traffic from a switch port to ESP, not unlike it already
> > does with vlan, vxlan, etc and other already fully offloaded switching
> > transforms.
> 
> Yup, that's what I thought you'd say. Can't argue with that use case 
> if Steffen is satisfied with the technical aspects.

Yes, everything that can help to overcome the performance problems
can help and I'm interested in this type of offload. But we need to
make sure the API is usable by the whole community, so I don't
want an API for some special case one of the NIC vendors is
interested in.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode
  2022-08-22  8:01       ` Steffen Klassert
@ 2022-08-22  8:46         ` Leon Romanovsky
  0 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-22  8:46 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Mon, Aug 22, 2022 at 10:01:08AM +0200, Steffen Klassert wrote:
> On Thu, Aug 18, 2022 at 04:28:12PM +0300, Leon Romanovsky wrote:
> > On Thu, Aug 18, 2022 at 12:12:20PM +0200, Steffen Klassert wrote:
> > > On Tue, Aug 16, 2022 at 11:59:23AM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > Allow users to configure xfrm states with full offload mode.
> > > > The full mode must be requested both for policy and state, and
> > > > such requires us to do not implement fallback.
> > > > 
> > > > We explicitly return an error if requested full mode can't
> > > > be configured.
> > > > 
> > > > Reviewed-by: Raed Salem <raeds@nvidia.com>
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > This one needs to be ACKed by the driver maintainers.
> > 
> > Why? Only crypto is supported in upstream kernel.
> 
> Because it touches code hat is under their maintainance.
> They should at least be CCed on this patch.

No problem, will do.

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode
  2022-08-22  8:04       ` Steffen Klassert
@ 2022-08-22  8:50         ` Leon Romanovsky
  0 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-22  8:50 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Mon, Aug 22, 2022 at 10:04:54AM +0200, Steffen Klassert wrote:
> On Thu, Aug 18, 2022 at 04:34:54PM +0300, Leon Romanovsky wrote:
> > On Thu, Aug 18, 2022 at 12:24:51PM +0200, Steffen Klassert wrote:
> > > On Tue, Aug 16, 2022 at 11:59:25AM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > In IPsec full mode, the device is going to encrypt and encapsulate
> > > > packets that are associated with offloaded policy. After successful
> > > > policy lookup to indicate if packets should be offloaded or not,
> > > > the stack forwards packets to the device to do the magic.
> > > > 
> > > > Signed-off-by: Raed Salem <raeds@nvidia.com>
> > > > Signed-off-by: Huy Nguyen <huyn@nvidia.com>
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > ---
> > > >  net/xfrm/xfrm_device.c | 13 +++++++++++++
> > > >  net/xfrm/xfrm_output.c | 20 ++++++++++++++++++++
> > > >  2 files changed, 33 insertions(+)
> > > > 
> > > > diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
> > > > index 1cc482e9c87d..db5ebd36f68c 100644
> > > > --- a/net/xfrm/xfrm_device.c
> > > > +++ b/net/xfrm/xfrm_device.c
> > > 
> > > > @@ -366,6 +376,9 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
> > > >  	struct xfrm_dst *xdst = (struct xfrm_dst *)dst;
> > > >  	struct net_device *dev = x->xso.dev;
> > > >  
> > > > +	if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> > > > +		goto ok;
> > > 
> > > You skip the PMTU checks here. I've seen that you check
> > > the packet length against the device MTU in one of your
> > > mlx5 patches, but that does not help if the PMTU is below.
> > 
> > If device supports transformation of the packet, this packet
> > won't be counted as XFRM anymore. I'm not sure that we need
> > to perform XFRM specific checks.
> 
> This is not xfrm specific, it makes sure that the packet you
> want to send fits into the PMTU.

I will add.

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-22  8:41                   ` Steffen Klassert
@ 2022-08-22  8:54                     ` Leon Romanovsky
  2022-08-22 16:33                       ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-22  8:54 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Jakub Kicinski, Jason Gunthorpe, David S . Miller, Herbert Xu,
	netdev, Raed Salem, ipsec-devel

On Mon, Aug 22, 2022 at 10:41:05AM +0200, Steffen Klassert wrote:
> On Fri, Aug 19, 2022 at 10:53:56AM -0700, Jakub Kicinski wrote:
> > On Fri, 19 Aug 2022 13:01:22 -0300 Jason Gunthorpe wrote:
> > > Regardless, RDMA doesn't really intersect with this netdev work for
> > > XFRM beyond the usual ways that RDMA IP traffic can be captured by or
> > > run parallel to netdev.
> > > 
> > > A significant use case here is for switchdev modes where the switch
> > > will subject traffic from a switch port to ESP, not unlike it already
> > > does with vlan, vxlan, etc and other already fully offloaded switching
> > > transforms.
> > 
> > Yup, that's what I thought you'd say. Can't argue with that use case 
> > if Steffen is satisfied with the technical aspects.
> 
> Yes, everything that can help to overcome the performance problems
> can help and I'm interested in this type of offload. But we need to
> make sure the API is usable by the whole community, so I don't
> want an API for some special case one of the NIC vendors is
> interested in.

BTW, we have a performance data, I planned to send it as part of cover
letter for v3, but it is worth to share it now.

 ================================================================================
 Performance results:

 TCP multi-stream, using iperf3 instance per-CPU.
 +----------------------+--------+--------+--------+--------+---------+---------+
 |                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
 |                      +--------+--------+--------+--------+---------+---------+
 |                      |                   BW (Gbps)                           |
 +----------------------+--------+--------+-------+---------+---------+---------+
 | Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
 +----------------------+--------+--------+-------+---------+---------+---------+
 | Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
 +----------------------+--------+--------+-------+---------+---------+---------+
 | IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
 +----------------------+--------+--------+-------+---------+---------+---------+
 | IPsec full offload   | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
 +----------------------+--------+--------+-------+---------+---------+---------+

 IPsec full offload mode behaves as baseline and reaches linerate with same amount
 of CPUs.

 Setups details (similar for both sides):
 * NIC: ConnectX6-DX dual port, 100 Gbps each.
   Single port used in the tests.
 * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-22  8:34     ` Steffen Klassert
@ 2022-08-22  9:34       ` Leon Romanovsky
  0 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-22  9:34 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Mon, Aug 22, 2022 at 10:34:43AM +0200, Steffen Klassert wrote:
> On Thu, Aug 18, 2022 at 04:26:39PM +0300, Leon Romanovsky wrote:
> > On Thu, Aug 18, 2022 at 12:09:30PM +0200, Steffen Klassert wrote:
> > > Hi Leon,
> > > 
> > > On Tue, Aug 16, 2022 at 11:59:21AM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > Changelog:
> > > > v2:
> > > >  * Rebased to latest 6.0-rc1
> > > >  * Add an extra check in TX datapath patch to validate packets before
> > > >    forwarding to HW.
> > > >  * Added policy cleanup logic in case of netdev down event 
> > > > v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com 
> > > >  * Moved comment to be before if (...) in third patch.
> > > > v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
> > > > -----------------------------------------------------------------------
> > > > 
> > > > The following series extends XFRM core code to handle new type of IPsec
> > > > offload - full offload.
> > > > 
> > > > In this mode, the HW is going to be responsible for whole data path, so
> > > > both policy and state should be offloaded.
> > > 
> > > some general comments about the pachset:
> > > 
> > > As implemented, the software does not hold any state.
> > > I.e. there is no sync between hardware and software
> > > regarding stats, liftetime, lifebyte, packet counts
> > > and replay window. IKE rekeying and auditing is based
> > > on these, how should this be done?
> > 
> > This is only rough idea as we only started to implement needed
> > support in libreswan, but our plan is to configure IKE with
> > highest possible priority 
> 
> If it is only a rough idea, then mark it as RFC. I want to see
> the whole picture before we merge it. And btw. tunnel mode
> belongs to the whoule picture too.

It is a rough in a sense that we don't have code to present yet.
We did arch review of this IKE approach and it is how we are
implementing it.

> 
> > 
> > > 
> > > I have not seen anything that catches configurations
> > > that stack multiple tunnels with the outer offloaded.
> > > 
> > > Where do we make sure that policy offloading device
> > > is the same as the state offloading device?
> > 
> > It is configuration error and we don't check it. Should we?
> 
> We should at least make sure to not send out packets untransformed
> in this case.

In TX SW path, if state doesn't exist, the packets will be sent
unencrypted. This "wrong" device configuration in offloaded path
has same behavior as not having state at all.

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection for IPsec full offload mode
  2022-08-22  8:06       ` Steffen Klassert
@ 2022-08-22  9:35         ` Leon Romanovsky
  0 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-22  9:35 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Mon, Aug 22, 2022 at 10:06:42AM +0200, Steffen Klassert wrote:
> On Thu, Aug 18, 2022 at 04:36:46PM +0300, Leon Romanovsky wrote:
> > On Thu, Aug 18, 2022 at 12:27:08PM +0200, Steffen Klassert wrote:
> > > On Tue, Aug 16, 2022 at 11:59:26AM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > Traffic received by device with enabled IPsec full offload should be
> > > > forwarded to the stack only after decryption, packet headers and
> > > > trailers removed.
> > > > 
> > > > Such packets are expected to be seen as normal (non-XFRM) ones, while
> > > > not-supported packets should be dropped by the HW.
> > > > 
> > > > Reviewed-by: Raed Salem <raeds@nvidia.com>
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > > @@ -1125,6 +1148,15 @@ static inline int __xfrm_policy_check2(struct sock *sk, int dir,
> > > >  {
> > > >  	struct net *net = dev_net(skb->dev);
> > > >  	int ndir = dir | (reverse ? XFRM_POLICY_MASK + 1 : 0);
> > > > +	struct xfrm_offload *xo = xfrm_offload(skb);
> > > > +	struct xfrm_state *x;
> > > > +
> > > > +	if (xo) {
> > > > +		x = xfrm_input_state(skb);
> > > > +		if (x->xso.type == XFRM_DEV_OFFLOAD_FULL)
> > > > +			return (xo->flags & CRYPTO_DONE) &&
> > > > +			       (xo->status & CRYPTO_SUCCESS);
> > > > +	}
> > > >  
> > > >  	if (sk && sk->sk_policy[XFRM_POLICY_IN])
> > > >  		return __xfrm_policy_check(sk, ndir, skb, family);
> > > 
> > > What happens here if there is a socket policy configured?
> > 
> > No change, we don't support offload of socket policies.
> 
> But the user can confugure it, so it should be enforced
> regardless if we had an offload before.

Thanks, I'll see how it can be resolved.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-22  8:54                     ` Leon Romanovsky
@ 2022-08-22 16:33                       ` Jakub Kicinski
  2022-08-22 21:27                         ` Saeed Mahameed
  2022-08-23  5:34                         ` Leon Romanovsky
  0 siblings, 2 replies; 45+ messages in thread
From: Jakub Kicinski @ 2022-08-22 16:33 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steffen Klassert, Jason Gunthorpe, David S . Miller, Herbert Xu,
	netdev, Raed Salem, ipsec-devel

On Mon, 22 Aug 2022 11:54:42 +0300 Leon Romanovsky wrote:
> On Mon, Aug 22, 2022 at 10:41:05AM +0200, Steffen Klassert wrote:
> > On Fri, Aug 19, 2022 at 10:53:56AM -0700, Jakub Kicinski wrote:  
> > > Yup, that's what I thought you'd say. Can't argue with that use case 
> > > if Steffen is satisfied with the technical aspects.  
> > 
> > Yes, everything that can help to overcome the performance problems
> > can help and I'm interested in this type of offload. But we need to
> > make sure the API is usable by the whole community, so I don't
> > want an API for some special case one of the NIC vendors is
> > interested in.  
> 
> BTW, we have a performance data, I planned to send it as part of cover
> letter for v3, but it is worth to share it now.
> 
>  ================================================================================
>  Performance results:
> 
>  TCP multi-stream, using iperf3 instance per-CPU.
>  +----------------------+--------+--------+--------+--------+---------+---------+
>  |                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
>  |                      +--------+--------+--------+--------+---------+---------+
>  |                      |                   BW (Gbps)                           |
>  +----------------------+--------+--------+-------+---------+---------+---------+
>  | Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
>  +----------------------+--------+--------+-------+---------+---------+---------+
>  | Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
>  +----------------------+--------+--------+-------+---------+---------+---------+
>  | IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
>  +----------------------+--------+--------+-------+---------+---------+---------+
>  | IPsec full offload   | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
>  +----------------------+--------+--------+-------+---------+---------+---------+
> 
>  IPsec full offload mode behaves as baseline and reaches linerate with same amount
>  of CPUs.
> 
>  Setups details (similar for both sides):
>  * NIC: ConnectX6-DX dual port, 100 Gbps each.
>    Single port used in the tests.
>  * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz

My questions about performance were more about where does 
the performance loss originate. Is it because of loss of GRO?
Maybe sharing perf traces could answer some of those questions?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-22 16:33                       ` Jakub Kicinski
@ 2022-08-22 21:27                         ` Saeed Mahameed
  2022-08-23  0:17                           ` Jakub Kicinski
  2022-08-23  4:48                           ` Leon Romanovsky
  2022-08-23  5:34                         ` Leon Romanovsky
  1 sibling, 2 replies; 45+ messages in thread
From: Saeed Mahameed @ 2022-08-22 21:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Leon Romanovsky, Steffen Klassert, Jason Gunthorpe,
	David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On 22 Aug 09:33, Jakub Kicinski wrote:
>On Mon, 22 Aug 2022 11:54:42 +0300 Leon Romanovsky wrote:
>> On Mon, Aug 22, 2022 at 10:41:05AM +0200, Steffen Klassert wrote:
>> > On Fri, Aug 19, 2022 at 10:53:56AM -0700, Jakub Kicinski wrote:
>> > > Yup, that's what I thought you'd say. Can't argue with that use case
>> > > if Steffen is satisfied with the technical aspects.
>> >
>> > Yes, everything that can help to overcome the performance problems
>> > can help and I'm interested in this type of offload. But we need to
>> > make sure the API is usable by the whole community, so I don't
>> > want an API for some special case one of the NIC vendors is
>> > interested in.
>>
>> BTW, we have a performance data, I planned to send it as part of cover
>> letter for v3, but it is worth to share it now.
>>
>>  ================================================================================
>>  Performance results:
>>
>>  TCP multi-stream, using iperf3 instance per-CPU.
>>  +----------------------+--------+--------+--------+--------+---------+---------+
>>  |                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
>>  |                      +--------+--------+--------+--------+---------+---------+
>>  |                      |                   BW (Gbps)                           |
>>  +----------------------+--------+--------+-------+---------+---------+---------+
>>  | Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
>>  +----------------------+--------+--------+-------+---------+---------+---------+
>>  | Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
>>  +----------------------+--------+--------+-------+---------+---------+---------+
>>  | IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
>>  +----------------------+--------+--------+-------+---------+---------+---------+
>>  | IPsec full offload   | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
>>  +----------------------+--------+--------+-------+---------+---------+---------+
>>
>>  IPsec full offload mode behaves as baseline and reaches linerate with same amount
>>  of CPUs.
>>

Just making sure: Baseline == "Clear text TCP" ?

>>  Setups details (similar for both sides):
>>  * NIC: ConnectX6-DX dual port, 100 Gbps each.
>>    Single port used in the tests.
>>  * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
>
>My questions about performance were more about where does
>the performance loss originate. Is it because of loss of GRO?

Performance loss between full and baseline ? it's hardly measurable .. 
less than 3% in the worst case.

>Maybe sharing perf traces could answer some of those questions?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-22 21:27                         ` Saeed Mahameed
@ 2022-08-23  0:17                           ` Jakub Kicinski
  2022-08-23  5:22                             ` Steffen Klassert
  2022-08-23  4:48                           ` Leon Romanovsky
  1 sibling, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2022-08-23  0:17 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Leon Romanovsky, Steffen Klassert, Jason Gunthorpe,
	David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Mon, 22 Aug 2022 14:27:16 -0700 Saeed Mahameed wrote:
> >My questions about performance were more about where does
> >the performance loss originate. Is it because of loss of GRO?  
> 
> Performance loss between full and baseline ? it's hardly measurable .. 
> less than 3% in the worst case.

The loss for crypto only vs baseline is what I meant. Obviously if we
offload everything the CPU perf may look great but we're giving up
flexibility and ability to fix problems in SW. 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-22 21:27                         ` Saeed Mahameed
  2022-08-23  0:17                           ` Jakub Kicinski
@ 2022-08-23  4:48                           ` Leon Romanovsky
  2022-08-26 12:20                             ` Jason Gunthorpe
  1 sibling, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-23  4:48 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Jakub Kicinski, Steffen Klassert, Jason Gunthorpe,
	David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Mon, Aug 22, 2022 at 02:27:16PM -0700, Saeed Mahameed wrote:
> On 22 Aug 09:33, Jakub Kicinski wrote:
> > On Mon, 22 Aug 2022 11:54:42 +0300 Leon Romanovsky wrote:
> > > On Mon, Aug 22, 2022 at 10:41:05AM +0200, Steffen Klassert wrote:
> > > > On Fri, Aug 19, 2022 at 10:53:56AM -0700, Jakub Kicinski wrote:
> > > > > Yup, that's what I thought you'd say. Can't argue with that use case
> > > > > if Steffen is satisfied with the technical aspects.
> > > >
> > > > Yes, everything that can help to overcome the performance problems
> > > > can help and I'm interested in this type of offload. But we need to
> > > > make sure the API is usable by the whole community, so I don't
> > > > want an API for some special case one of the NIC vendors is
> > > > interested in.
> > > 
> > > BTW, we have a performance data, I planned to send it as part of cover
> > > letter for v3, but it is worth to share it now.
> > > 
> > >  ================================================================================
> > >  Performance results:
> > > 
> > >  TCP multi-stream, using iperf3 instance per-CPU.
> > >  +----------------------+--------+--------+--------+--------+---------+---------+
> > >  |                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
> > >  |                      +--------+--------+--------+--------+---------+---------+
> > >  |                      |                   BW (Gbps)                           |
> > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > >  | Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
> > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > >  | Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
> > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > >  | IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
> > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > >  | IPsec full offload   | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
> > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > > 
> > >  IPsec full offload mode behaves as baseline and reaches linerate with same amount
> > >  of CPUs.
> > > 
> 
> Just making sure: Baseline == "Clear text TCP" ?

Yes, baseline is plain TCP without any encryption.

We can get higher numbers with Tariq's improvements, but it was not
important to achieve maximum as we are interested to see differences
between various modes.

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-23  0:17                           ` Jakub Kicinski
@ 2022-08-23  5:22                             ` Steffen Klassert
  2022-08-23 14:06                               ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2022-08-23  5:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, Leon Romanovsky, Jason Gunthorpe,
	David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Mon, Aug 22, 2022 at 05:17:06PM -0700, Jakub Kicinski wrote:
> On Mon, 22 Aug 2022 14:27:16 -0700 Saeed Mahameed wrote:
> > >My questions about performance were more about where does
> > >the performance loss originate. Is it because of loss of GRO?  
> > 
> > Performance loss between full and baseline ? it's hardly measurable .. 
> > less than 3% in the worst case.
> 
> The loss for crypto only vs baseline is what I meant. Obviously if we
> offload everything the CPU perf may look great but we're giving up
> flexibility and ability to fix problems in SW. 

The main difference between baseline TCP and crypto offload
is the policy/state lookup and ESP encapsulation/decapsulation.

We don't loose GRO on crypto offload. The ESP header/trailer
is stripped away in the GRO layer so that the inner TCP
packet gets GRO. But we loose hardware LRO, as the ESP
headers are not stripped in HW.

It would be really good to see where the bottlenecks are
with crypto offload (RX or TX).

Also, it would be good to see why the full offload performs
better. Some perf traces would be helpfull.

When I thought about possible offloads, I came to three
different types:

1. encapsulation offload:
   - Kernel does policy enforcement
   - NIC does encapsulation
   - NIC manages anti replay window and lifetime of SAs
   - NIC sends lifetime and anti replay notifications to the kernel
   - The Kernel talks to the keymanager

2. statefull offload with fallback:
   - NIC handles the full datapath, but kernel can take over (fragments
     etc.)
   - Kernel and NIC hold the full SA and policy database
   - Kernel and NIC must sync the state (lifetime, replay window etc.)
     of SAs and policies
   - The Kernel talks to the keymanager

3. statefull offload:
   - NIC handles the full datapath
   - NIC talks directly with the keymanager
   - Kernel xfrm acts as a stub layer to pass messages from
     userspace to the NIC.

The offload that is implemented here comes option 2 closest.
The statefull handling is shared between host and NIC.
Unfortunalely, this is the most complicated one.

If just encapsulation/decapsulation brings the performance
we could also go with option 1. That would be still a
stateless offload.

Option 3 is what I would consider as a full offload.
Kernel acts as a stateless stub layer, NIC does
statefull IPsec.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-22 16:33                       ` Jakub Kicinski
  2022-08-22 21:27                         ` Saeed Mahameed
@ 2022-08-23  5:34                         ` Leon Romanovsky
  1 sibling, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-23  5:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Steffen Klassert, Jason Gunthorpe, David S . Miller, Herbert Xu,
	netdev, Raed Salem, ipsec-devel

On Mon, Aug 22, 2022 at 09:33:04AM -0700, Jakub Kicinski wrote:
> On Mon, 22 Aug 2022 11:54:42 +0300 Leon Romanovsky wrote:
> > On Mon, Aug 22, 2022 at 10:41:05AM +0200, Steffen Klassert wrote:
> > > On Fri, Aug 19, 2022 at 10:53:56AM -0700, Jakub Kicinski wrote:  
> > > > Yup, that's what I thought you'd say. Can't argue with that use case 
> > > > if Steffen is satisfied with the technical aspects.  
> > > 
> > > Yes, everything that can help to overcome the performance problems
> > > can help and I'm interested in this type of offload. But we need to
> > > make sure the API is usable by the whole community, so I don't
> > > want an API for some special case one of the NIC vendors is
> > > interested in.  
> > 
> > BTW, we have a performance data, I planned to send it as part of cover
> > letter for v3, but it is worth to share it now.
> > 
> >  ================================================================================
> >  Performance results:
> > 
> >  TCP multi-stream, using iperf3 instance per-CPU.
> >  +----------------------+--------+--------+--------+--------+---------+---------+
> >  |                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
> >  |                      +--------+--------+--------+--------+---------+---------+
> >  |                      |                   BW (Gbps)                           |
> >  +----------------------+--------+--------+-------+---------+---------+---------+
> >  | Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
> >  +----------------------+--------+--------+-------+---------+---------+---------+
> >  | Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
> >  +----------------------+--------+--------+-------+---------+---------+---------+
> >  | IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
> >  +----------------------+--------+--------+-------+---------+---------+---------+
> >  | IPsec full offload   | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
> >  +----------------------+--------+--------+-------+---------+---------+---------+
> > 
> >  IPsec full offload mode behaves as baseline and reaches linerate with same amount
> >  of CPUs.
> > 
> >  Setups details (similar for both sides):
> >  * NIC: ConnectX6-DX dual port, 100 Gbps each.
> >    Single port used in the tests.
> >  * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
> 
> My questions about performance were more about where does 
> the performance loss originate. Is it because of loss of GRO?
> Maybe sharing perf traces could answer some of those questions?

Crypto mode doesn't scale good in terms of CPUs.

CPU load data:
 * Remind that this is 160 CPUs machine with 2 threads per-core

Baseline:
PROCESSES  TOTAL_BW  HOST_LOCAL_CPU  HOST_REMOTE_CPU
1	   27.95     0.6	     1.1
2	   58.99     1	             2
4	   93.05     1.3	     3.2
8	   92.75     2	             3.4
16	   93.74     2.2	     4
32	   94.37     2.6	     4.5

IPsec crypto:
PROCESSES  TOTAL_BW  HOST_LOCAL_CPU  HOST_REMOTE_CPU
1	   15.04	  0.7		  1.2
2	   29.68	  1.2		  2.1
4	   58.52	  2		  3.9
8	   89.58	  2.8		  5.1
16	   90.42	  3.1		  7.1
32	   90.81	  3.16		  6.9

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-23  5:22                             ` Steffen Klassert
@ 2022-08-23 14:06                               ` Leon Romanovsky
  0 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2022-08-23 14:06 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Jakub Kicinski, Saeed Mahameed, Jason Gunthorpe,
	David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Tue, Aug 23, 2022 at 07:22:03AM +0200, Steffen Klassert wrote:
> On Mon, Aug 22, 2022 at 05:17:06PM -0700, Jakub Kicinski wrote:
> > On Mon, 22 Aug 2022 14:27:16 -0700 Saeed Mahameed wrote:
> > > >My questions about performance were more about where does
> > > >the performance loss originate. Is it because of loss of GRO?  
> > > 
> > > Performance loss between full and baseline ? it's hardly measurable .. 
> > > less than 3% in the worst case.
> > 
> > The loss for crypto only vs baseline is what I meant. Obviously if we
> > offload everything the CPU perf may look great but we're giving up
> > flexibility and ability to fix problems in SW. 
> 
> The main difference between baseline TCP and crypto offload
> is the policy/state lookup and ESP encapsulation/decapsulation.
> 
> We don't loose GRO on crypto offload. The ESP header/trailer
> is stripped away in the GRO layer so that the inner TCP
> packet gets GRO. But we loose hardware LRO, as the ESP
> headers are not stripped in HW.
> 
> It would be really good to see where the bottlenecks are
> with crypto offload (RX or TX).
> 
> Also, it would be good to see why the full offload performs
> better. Some perf traces would be helpfull.
> 
> When I thought about possible offloads, I came to three
> different types:
> 
> 1. encapsulation offload:
>    - Kernel does policy enforcement
>    - NIC does encapsulation
>    - NIC manages anti replay window and lifetime of SAs
>    - NIC sends lifetime and anti replay notifications to the kernel
>    - The Kernel talks to the keymanager
> 
> 2. statefull offload with fallback:
>    - NIC handles the full datapath, but kernel can take over (fragments
>      etc.)
>    - Kernel and NIC hold the full SA and policy database
>    - Kernel and NIC must sync the state (lifetime, replay window etc.)
>      of SAs and policies
>    - The Kernel talks to the keymanager
> 
> 3. statefull offload:
>    - NIC handles the full datapath
>    - NIC talks directly with the keymanager
>    - Kernel xfrm acts as a stub layer to pass messages from
>      userspace to the NIC.
> 
> The offload that is implemented here comes option 2 closest.
> The statefull handling is shared between host and NIC.
> Unfortunalely, this is the most complicated one.

Our implementation something like option 2 but without fallback.

1. I didn't implement fallback for a number of reasons:
 * It is almost impossible to keep in sync HW and SW states in linerate.
 * IPv6 sets (require???) do-not-fragment bit.
2. NIC and kernel keep their SA and policy databases in sync to make
sure that they both see coherent picture and for users to be able to
use existing iproute2 tools.
3. Like any other offload, I want to make sure that kernel in the middle
between our HW and user. This includes keymanager.

> 
> If just encapsulation/decapsulation brings the performance
> we could also go with option 1. That would be still a
> stateless offload.

Option 1 isn't sufficient for us as it doesn't support combination of TC
and eswitch manager. In that mode, packets that arrive to uplink forwarded
directly to representor without kernel involvement,

> 
> Option 3 is what I would consider as a full offload.
> Kernel acts as a stateless stub layer, NIC does
> statefull IPsec.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration
  2022-08-23  4:48                           ` Leon Romanovsky
@ 2022-08-26 12:20                             ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2022-08-26 12:20 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Saeed Mahameed, Jakub Kicinski, Steffen Klassert,
	David S . Miller, Herbert Xu, netdev, Raed Salem, ipsec-devel

On Tue, Aug 23, 2022 at 07:48:37AM +0300, Leon Romanovsky wrote:
> On Mon, Aug 22, 2022 at 02:27:16PM -0700, Saeed Mahameed wrote:
> > On 22 Aug 09:33, Jakub Kicinski wrote:
> > > On Mon, 22 Aug 2022 11:54:42 +0300 Leon Romanovsky wrote:
> > > > On Mon, Aug 22, 2022 at 10:41:05AM +0200, Steffen Klassert wrote:
> > > > > On Fri, Aug 19, 2022 at 10:53:56AM -0700, Jakub Kicinski wrote:
> > > > > > Yup, that's what I thought you'd say. Can't argue with that use case
> > > > > > if Steffen is satisfied with the technical aspects.
> > > > >
> > > > > Yes, everything that can help to overcome the performance problems
> > > > > can help and I'm interested in this type of offload. But we need to
> > > > > make sure the API is usable by the whole community, so I don't
> > > > > want an API for some special case one of the NIC vendors is
> > > > > interested in.
> > > > 
> > > > BTW, we have a performance data, I planned to send it as part of cover
> > > > letter for v3, but it is worth to share it now.
> > > > 
> > > >  ================================================================================
> > > >  Performance results:
> > > > 
> > > >  TCP multi-stream, using iperf3 instance per-CPU.
> > > >  +----------------------+--------+--------+--------+--------+---------+---------+
> > > >  |                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
> > > >  |                      +--------+--------+--------+--------+---------+---------+
> > > >  |                      |                   BW (Gbps)                           |
> > > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > > >  | Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
> > > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > > >  | Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
> > > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > > >  | IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
> > > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > > >  | IPsec full offload   | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
> > > >  +----------------------+--------+--------+-------+---------+---------+---------+
> > > > 
> > > >  IPsec full offload mode behaves as baseline and reaches linerate with same amount
> > > >  of CPUs.
> > > > 
> > 
> > Just making sure: Baseline == "Clear text TCP" ?
> 
> Yes, baseline is plain TCP without any encryption.
> 
> We can get higher numbers with Tariq's improvements, but it was not
> important to achieve maximum as we are interested to see differences
> between various modes.

BW is only part of the goal here, a significant metric is how much
hypervisor CPU power is consumed by the ESP operation.

It is not any different than vlan or other encapsulating offloads
where the ideal is to not interrupt the hypervisor CPU at all.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2022-08-26 12:23 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16  8:59 [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Leon Romanovsky
2022-08-16  8:59 ` [PATCH xfrm-next v2 1/6] xfrm: add new full offload flag Leon Romanovsky
2022-08-16  8:59 ` [PATCH xfrm-next v2 2/6] xfrm: allow state full offload mode Leon Romanovsky
2022-08-18 10:12   ` Steffen Klassert
2022-08-18 13:28     ` Leon Romanovsky
2022-08-22  8:01       ` Steffen Klassert
2022-08-22  8:46         ` Leon Romanovsky
2022-08-16  8:59 ` [PATCH xfrm-next v2 3/6] xfrm: add an interface to offload policy Leon Romanovsky
2022-08-16  8:59 ` [PATCH xfrm-next v2 4/6] xfrm: add TX datapath support for IPsec full offload mode Leon Romanovsky
2022-08-18 10:24   ` Steffen Klassert
2022-08-18 13:34     ` Leon Romanovsky
2022-08-22  8:04       ` Steffen Klassert
2022-08-22  8:50         ` Leon Romanovsky
2022-08-16  8:59 ` [PATCH xfrm-next v2 5/6] xfrm: add RX datapath protection " Leon Romanovsky
2022-08-18 10:27   ` Steffen Klassert
2022-08-18 13:36     ` Leon Romanovsky
2022-08-22  8:06       ` Steffen Klassert
2022-08-22  9:35         ` Leon Romanovsky
2022-08-16  8:59 ` [PATCH xfrm-next v2 6/6] xfrm: enforce separation between priorities of HW/SW policies Leon Romanovsky
2022-08-17  2:54 ` [PATCH xfrm-next v2 0/6] Extend XFRM core to allow full offload configuration Jakub Kicinski
2022-08-17  5:22   ` Leon Romanovsky
2022-08-17 18:10     ` Jakub Kicinski
2022-08-18  5:24       ` Leon Romanovsky
2022-08-18 10:10         ` Steffen Klassert
2022-08-18 12:51           ` Leon Romanovsky
2022-08-19  1:54           ` Jakub Kicinski
2022-08-19  2:34         ` Jakub Kicinski
2022-08-19  5:52           ` Leon Romanovsky
2022-08-19 15:47             ` Jakub Kicinski
2022-08-19 16:01               ` Jason Gunthorpe
2022-08-19 17:53                 ` Jakub Kicinski
2022-08-22  8:41                   ` Steffen Klassert
2022-08-22  8:54                     ` Leon Romanovsky
2022-08-22 16:33                       ` Jakub Kicinski
2022-08-22 21:27                         ` Saeed Mahameed
2022-08-23  0:17                           ` Jakub Kicinski
2022-08-23  5:22                             ` Steffen Klassert
2022-08-23 14:06                               ` Leon Romanovsky
2022-08-23  4:48                           ` Leon Romanovsky
2022-08-26 12:20                             ` Jason Gunthorpe
2022-08-23  5:34                         ` Leon Romanovsky
2022-08-18 10:09 ` Steffen Klassert
2022-08-18 13:26   ` Leon Romanovsky
2022-08-22  8:34     ` Steffen Klassert
2022-08-22  9:34       ` Leon Romanovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.