All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 00/14] TLS offload, netdev & MLX5 support
@ 2018-03-20  2:44 Saeed Mahameed
  2018-03-20  2:44 ` [PATCH net-next 01/14] tcp: Add clean acked data hook Saeed Mahameed
                   ` (13 more replies)
  0 siblings, 14 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Dave Watson, Boris Pismenny, Saeed Mahameed

Hi Dave,

The following series from Ilya and Boris provides TLS TX inline crypto
offload.

Boris says:
===================
This series adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
  TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Expected TCP SN is accessed without a lock, under the assumption that
TCP doesn't transmit SKBs from different TX queue concurrently.

We assume that packets are not rerouted to a different network device.

Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf

===================

The series is based on latest net-next:
c314c7ba4038 ("Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue")

Thanks,
Saeed.

--- 

Boris Pismenny (2):
  MAINTAINERS: Update mlx5 innova driver maintainers
  MAINTAINERS: Update TLS maintainers

Ilya Lesokhin (12):
  tcp: Add clean acked data hook
  net: Rename and export copy_skb_header
  net: Add Software fallback infrastructure for socket dependent
    offloads
  net: Add TLS offload netdev ops
  net: Add TLS TX offload features
  net/tls: Add generic NIC offload infrastructure
  net/tls: Support TLS device offload with IPv6
  net/mlx5e: Move defines out of ipsec code
  net/mlx5: Accel, Add TLS tx offload interface
  net/mlx5e: TLS, Add Innova TLS TX support
  net/mlx5e: TLS, Add Innova TLS TX offload data path
  net/mlx5e: TLS, Add error statistics

 MAINTAINERS                                        |  19 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |  11 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c    |  71 ++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h    |  86 +++
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  21 +
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  72 ++
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec.h   |   3 -
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 197 +++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  87 +++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 278 +++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h         |  50 ++
 .../mellanox/mlx5/core/en_accel/tls_stats.c        |  89 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  32 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  37 +-
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h    |   1 +
 .../net/ethernet/mellanox/mlx5/core/fpga/ipsec.c   |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 ++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 ++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  11 +
 include/linux/mlx5/mlx5_ifc.h                      |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h                 |  77 ++
 include/linux/netdev_features.h                    |   2 +
 include/linux/netdevice.h                          |  24 +
 include/linux/skbuff.h                             |   1 +
 include/net/inet_connection_sock.h                 |   2 +
 include/net/sock.h                                 |  21 +
 include/net/tls.h                                  |  70 +-
 net/Kconfig                                        |   4 +
 net/core/dev.c                                     |   4 +
 net/core/ethtool.c                                 |   1 +
 net/core/skbuff.c                                  |   9 +-
 net/ipv4/tcp_input.c                               |   2 +
 net/tls/Kconfig                                    |  10 +
 net/tls/Makefile                                   |   2 +
 net/tls/tls_device.c                               | 851 +++++++++++++++++++++
 net/tls/tls_device_fallback.c                      | 419 ++++++++++
 net/tls/tls_main.c                                 |  33 +-
 41 files changed, 3210 insertions(+), 65 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

-- 
2.14.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH net-next 01/14] tcp: Add clean acked data hook
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
@ 2018-03-20  2:44 ` Saeed Mahameed
  2018-03-20 20:36   ` Rao Shoaib
  2018-03-20  2:44 ` [PATCH net-next 02/14] net: Rename and export copy_skb_header Saeed Mahameed
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:44 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data.

This is required by TLS device offload to release
metadata associated with acknowledged TLS records.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/net/inet_connection_sock.h | 2 ++
 net/ipv4/tcp_input.c               | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index b68fea022a82..2ab6667275df 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
  * @icsk_af_ops		   Operations which are AF_INET{4,6} specific
  * @icsk_ulp_ops	   Pluggable ULP control hook
  * @icsk_ulp_data	   ULP private data
+ * @icsk_clean_acked	   Clean acked data hook
  * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
  * @icsk_ca_state:	   Congestion control state
  * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
@@ -102,6 +103,7 @@ struct inet_connection_sock {
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
 	void			  *icsk_ulp_data;
+	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
 	__u8			  icsk_ca_state:6,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 451ef3012636..9854ecae7245 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3542,6 +3542,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, prior_snd_una)) {
 		flag |= FLAG_SND_UNA_ADVANCED;
 		icsk->icsk_retransmits = 0;
+		if (icsk->icsk_clean_acked)
+			icsk->icsk_clean_acked(sk, ack);
 	}
 
 	prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 02/14] net: Rename and export copy_skb_header
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
  2018-03-20  2:44 ` [PATCH net-next 01/14] tcp: Add clean acked data hook Saeed Mahameed
@ 2018-03-20  2:44 ` Saeed Mahameed
  2018-03-20  2:44 ` [PATCH net-next 03/14] net: Add Software fallback infrastructure for socket dependent offloads Saeed Mahameed
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:44 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c      | 9 +++++----
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d8340e6e8814..dc0f81277723 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1031,6 +1031,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
 struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 				   gfp_t gfp_mask, bool fclone);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..9ae1812fb705 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1304,7 +1304,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -1312,6 +1312,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1354,7 +1355,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 
 	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1418,7 +1419,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 		skb_clone_fraglist(n);
 	}
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 out:
 	return n;
 }
@@ -1598,7 +1599,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 			     skb->len + head_copy_len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 
 	skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 03/14] net: Add Software fallback infrastructure for socket dependent offloads
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
  2018-03-20  2:44 ` [PATCH net-next 01/14] tcp: Add clean acked data hook Saeed Mahameed
  2018-03-20  2:44 ` [PATCH net-next 02/14] net: Rename and export copy_skb_header Saeed Mahameed
@ 2018-03-20  2:44 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 04/14] net: Add TLS offload netdev ops Saeed Mahameed
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:44 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/net/sock.h | 21 +++++++++++++++++++++
 net/Kconfig        |  4 ++++
 net/core/dev.c     |  4 ++++
 3 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index b9624581d639..92a0e0c54ac1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -481,6 +481,11 @@ struct sock {
 	void			(*sk_error_report)(struct sock *sk);
 	int			(*sk_backlog_rcv)(struct sock *sk,
 						  struct sk_buff *skb);
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+	struct sk_buff*		(*sk_validate_xmit_skb)(struct sock *sk,
+							struct net_device *dev,
+							struct sk_buff *skb);
+#endif
 	void                    (*sk_destruct)(struct sock *sk);
 	struct sock_reuseport __rcu	*sk_reuseport_cb;
 	struct rcu_head		sk_rcu;
@@ -2323,6 +2328,22 @@ static inline bool sk_fullsock(const struct sock *sk)
 	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
 }
 
+/* Checks if this SKB belongs to an HW offloaded socket
+ * and whether any SW fallbacks are required based on dev.
+ */
+static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
+						   struct net_device *dev)
+{
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+	struct sock *sk = skb->sk;
+
+	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+		skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#endif
+
+	return skb;
+}
+
 /* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
  * SYNACK messages can be attached to either ones (depending on SYNCOOKIE)
  */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..fe84cfe3260e 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -407,6 +407,10 @@ config GRO_CELLS
 	bool
 	default n
 
+config SOCK_VALIDATE_XMIT
+	bool
+	default n
+
 config NET_DEVLINK
 	tristate "Network physical/parent device Netlink interface"
 	help
diff --git a/net/core/dev.c b/net/core/dev.c
index d8887cc38e7b..244a4c7ab266 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3086,6 +3086,10 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device
 	if (unlikely(!skb))
 		goto out_null;
 
+	skb = sk_validate_xmit_skb(skb, dev);
+	if (unlikely(!skb))
+		goto out_null;
+
 	if (netif_needs_gso(skb, features)) {
 		struct sk_buff *segs;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 04/14] net: Add TLS offload netdev ops
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (2 preceding siblings ...)
  2018-03-20  2:44 ` [PATCH net-next 03/14] net: Add Software fallback infrastructure for socket dependent offloads Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 05/14] net: Add TLS TX offload features Saeed Mahameed
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

Add new netdev ops to add and delete tls context

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdevice.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 913b1cc882cf..e1fef7bb6ed4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -864,6 +864,26 @@ struct xfrmdev_ops {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+enum tls_offload_ctx_dir {
+	TLS_OFFLOAD_CTX_DIR_RX,
+	TLS_OFFLOAD_CTX_DIR_TX,
+};
+
+struct tls_crypto_info;
+struct tls_context;
+
+struct tlsdev_ops {
+	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
+			   enum tls_offload_ctx_dir direction,
+			   struct tls_crypto_info *crypto_info,
+			   u32 start_offload_tcp_sn);
+	void (*tls_dev_del)(struct net_device *netdev,
+			    struct tls_context *ctx,
+			    enum tls_offload_ctx_dir direction);
+};
+#endif
+
 struct dev_ifalias {
 	struct rcu_head rcuhead;
 	char ifalias[];
@@ -1748,6 +1768,10 @@ struct net_device {
 	const struct xfrmdev_ops *xfrmdev_ops;
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+	const struct tlsdev_ops *tlsdev_ops;
+#endif
+
 	const struct header_ops *header_ops;
 
 	unsigned int		flags;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 05/14] net: Add TLS TX offload features
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (3 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 04/14] net: Add TLS offload netdev ops Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure Saeed Mahameed
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

This patch adds a netdev feature to configure TLS TX offloads.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdev_features.h | 2 ++
 net/core/ethtool.c              | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index db84c516bcfb..18dc34202080 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -77,6 +77,7 @@ enum {
 	NETIF_F_HW_ESP_BIT,		/* Hardware ESP transformation offload */
 	NETIF_F_HW_ESP_TX_CSUM_BIT,	/* ESP with TX checksum offload */
 	NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
+	NETIF_F_HW_TLS_TX_BIT,		/* Hardware TLS TX offload */
 
 	NETIF_F_GRO_HW_BIT,		/* Hardware Generic receive offload */
 
@@ -145,6 +146,7 @@ enum {
 #define NETIF_F_HW_ESP		__NETIF_F(HW_ESP)
 #define NETIF_F_HW_ESP_TX_CSUM	__NETIF_F(HW_ESP_TX_CSUM)
 #define	NETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
+#define NETIF_F_HW_TLS_TX	__NETIF_F(HW_TLS_TX)
 
 #define for_each_netdev_feature(mask_addr, bit)	\
 	for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 157cd9efa4be..9f07f9fe39ca 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -107,6 +107,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_HW_ESP_BIT] =		 "esp-hw-offload",
 	[NETIF_F_HW_ESP_TX_CSUM_BIT] =	 "esp-tx-csum-hw-offload",
 	[NETIF_F_RX_UDP_TUNNEL_PORT_BIT] =	 "rx-udp_tunnel-port-offload",
+	[NETIF_F_HW_TLS_TX_BIT] =	 "tls-hw-tx-offload",
 };
 
 static const char
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (4 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 05/14] net: Add TLS TX offload features Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-21 11:15   ` Kirill Tkhai
  2018-03-21 15:08   ` Dave Watson
  2018-03-20  2:45 ` [PATCH net-next 07/14] net/tls: Support TLS device offload with IPv6 Saeed Mahameed
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin,
	Aviad Yehezkel, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

This patch adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption
and authentication operations on the transmit side of the data path.
Leaving those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to
the TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/net/tls.h             |  70 +++-
 net/tls/Kconfig               |  10 +
 net/tls/Makefile              |   2 +
 net/tls/tls_device.c          | 804 ++++++++++++++++++++++++++++++++++++++++++
 net/tls/tls_device_fallback.c | 419 ++++++++++++++++++++++
 net/tls/tls_main.c            |  33 +-
 6 files changed, 1331 insertions(+), 7 deletions(-)
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

diff --git a/include/net/tls.h b/include/net/tls.h
index 4913430ab807..ab98a6dc4929 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -77,6 +77,37 @@ struct tls_sw_context {
 	struct scatterlist sg_aead_out[2];
 };
 
+struct tls_record_info {
+	struct list_head list;
+	u32 end_seq;
+	int len;
+	int num_frags;
+	skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct tls_offload_context {
+	struct crypto_aead *aead_send;
+	spinlock_t lock;	/* protects records list */
+	struct list_head records_list;
+	struct tls_record_info *open_record;
+	struct tls_record_info *retransmit_hint;
+	u64 hint_record_sn;
+	u64 unacked_record_sn;
+
+	struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
+	void (*sk_destruct)(struct sock *sk);
+	u8 driver_state[];
+	/* The TLS layer reserves room for driver specific state
+	 * Currently the belief is that there is not enough
+	 * driver specific state to justify another layer of indirection
+	 */
+#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
+};
+
+#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
+	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
+	 TLS_DRIVER_STATE_SIZE)
+
 enum {
 	TLS_PENDING_CLOSED_RECORD
 };
@@ -87,6 +118,10 @@ struct tls_context {
 		struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
 	};
 
+	struct list_head list;
+	struct net_device *netdev;
+	refcount_t refcount;
+
 	void *priv_ctx;
 
 	u8 tx_conf:2;
@@ -131,9 +166,29 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
 void tls_sw_close(struct sock *sk, long timeout);
 void tls_sw_free_tx_resources(struct sock *sk);
 
-void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
-void tls_icsk_clean_acked(struct sock *sk);
+void tls_clear_device_offload(struct sock *sk, struct tls_context *ctx);
+int tls_set_device_offload(struct sock *sk, struct tls_context *ctx);
+int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
+int tls_device_sendpage(struct sock *sk, struct page *page,
+			int offset, size_t size, int flags);
+void tls_device_sk_destruct(struct sock *sk);
+void tls_device_init(void);
+void tls_device_cleanup(void);
 
+struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+				       u32 seq, u64 *p_record_sn);
+
+static inline bool tls_record_is_start_marker(struct tls_record_info *rec)
+{
+	return rec->len == 0;
+}
+
+static inline u32 tls_record_start_seq(struct tls_record_info *rec)
+{
+	return rec->end_seq - rec->len;
+}
+
+void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
 int tls_push_sg(struct sock *sk, struct tls_context *ctx,
 		struct scatterlist *sg, u16 first_offset,
 		int flags);
@@ -170,6 +225,13 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
 	return tls_ctx->pending_open_record_frags;
 }
 
+static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
+{
+	return sk_fullsock(sk) &&
+	       /* matches smp_store_release in tls_set_device_offload */
+	       smp_load_acquire(&sk->sk_destruct) == &tls_device_sk_destruct;
+}
+
 static inline void tls_err_abort(struct sock *sk)
 {
 	sk->sk_err = EBADMSG;
@@ -257,4 +319,8 @@ static inline struct tls_offload_context *tls_offload_ctx(
 int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
 		      unsigned char *record_type);
 
+int tls_sw_fallback_init(struct sock *sk,
+			 struct tls_offload_context *offload_ctx,
+			 struct tls_crypto_info *crypto_info);
+
 #endif /* _TLS_OFFLOAD_H */
diff --git a/net/tls/Kconfig b/net/tls/Kconfig
index eb583038c67e..9d3ef820bb16 100644
--- a/net/tls/Kconfig
+++ b/net/tls/Kconfig
@@ -13,3 +13,13 @@ config TLS
 	encryption handling of the TLS protocol to be done in-kernel.
 
 	If unsure, say N.
+
+config TLS_DEVICE
+	bool "Transport Layer Security HW offload"
+	depends on TLS
+	select SOCK_VALIDATE_XMIT
+	default n
+	---help---
+	Enable kernel support for HW offload of the TLS protocol.
+
+	If unsure, say N.
diff --git a/net/tls/Makefile b/net/tls/Makefile
index a930fd1c4f7b..4d6b728a67d0 100644
--- a/net/tls/Makefile
+++ b/net/tls/Makefile
@@ -5,3 +5,5 @@
 obj-$(CONFIG_TLS) += tls.o
 
 tls-y := tls_main.o tls_sw.o
+
+tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
new file mode 100644
index 000000000000..c0d4e11a4286
--- /dev/null
+++ b/net/tls/tls_device.c
@@ -0,0 +1,804 @@
+/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ *      - Neither the name of the Mellanox Technologies nor the
+ *        names of its contributors may be used to endorse or promote
+ *        products derived from this software without specific prior written
+ *        permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE
+ */
+
+#include <linux/module.h>
+#include <net/tcp.h>
+#include <net/inet_common.h>
+#include <linux/highmem.h>
+#include <linux/netdevice.h>
+
+#include <net/tls.h>
+#include <crypto/aead.h>
+
+/* device_offload_lock is used to synchronize tls_dev_add
+ * against NETDEV_DOWN notifications.
+ */
+DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
+
+static void tls_device_gc_task(struct work_struct *work);
+
+static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
+static LIST_HEAD(tls_device_gc_list);
+static LIST_HEAD(tls_device_list);
+static DEFINE_SPINLOCK(tls_device_lock);
+
+static void tls_device_free_ctx(struct tls_context *ctx)
+{
+	struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
+
+	kfree(offlad_ctx);
+	kfree(ctx);
+}
+
+static void tls_device_gc_task(struct work_struct *work)
+{
+	struct tls_context *ctx, *tmp;
+	struct list_head gc_list;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	INIT_LIST_HEAD(&gc_list);
+	list_splice_init(&tls_device_gc_list, &gc_list);
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+
+	list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
+		struct net_device *netdev = ctx->netdev;
+
+		if (netdev) {
+			netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
+							TLS_OFFLOAD_CTX_DIR_TX);
+			dev_put(netdev);
+		}
+
+		list_del(&ctx->list);
+		tls_device_free_ctx(ctx);
+	}
+}
+
+static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	list_move_tail(&ctx->list, &tls_device_gc_list);
+
+	/* schedule_work inside the spinlock
+	 * to make sure tls_device_down waits for that work.
+	 */
+	schedule_work(&tls_device_gc_work);
+
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+}
+
+/* We assume that the socket is already connected */
+static struct net_device *get_netdev_for_sock(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct net_device *netdev = NULL;
+
+	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
+
+	return netdev;
+}
+
+static int attach_sock_to_netdev(struct sock *sk, struct net_device *netdev,
+				 struct tls_context *ctx)
+{
+	int rc;
+
+	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
+					     &ctx->crypto_send,
+					     tcp_sk(sk)->write_seq);
+	if (rc) {
+		pr_err_ratelimited("The netdev has refused to offload this socket\n");
+		goto out;
+	}
+
+	rc = 0;
+out:
+	return rc;
+}
+
+static void destroy_record(struct tls_record_info *record)
+{
+	skb_frag_t *frag;
+	int nr_frags = record->num_frags;
+
+	while (nr_frags > 0) {
+		frag = &record->frags[nr_frags - 1];
+		__skb_frag_unref(frag);
+		--nr_frags;
+	}
+	kfree(record);
+}
+
+static void delete_all_records(struct tls_offload_context *offload_ctx)
+{
+	struct tls_record_info *info, *temp;
+
+	list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
+		list_del(&info->list);
+		destroy_record(info);
+	}
+
+	offload_ctx->retransmit_hint = NULL;
+}
+
+static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx;
+	struct tls_record_info *info, *temp;
+	unsigned long flags;
+	u64 deleted_records = 0;
+
+	if (!tls_ctx)
+		return;
+
+	ctx = tls_offload_ctx(tls_ctx);
+
+	spin_lock_irqsave(&ctx->lock, flags);
+	info = ctx->retransmit_hint;
+	if (info && !before(acked_seq, info->end_seq)) {
+		ctx->retransmit_hint = NULL;
+		list_del(&info->list);
+		destroy_record(info);
+		deleted_records++;
+	}
+
+	list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
+		if (before(acked_seq, info->end_seq))
+			break;
+		list_del(&info->list);
+
+		destroy_record(info);
+		deleted_records++;
+	}
+
+	ctx->unacked_record_sn += deleted_records;
+	spin_unlock_irqrestore(&ctx->lock, flags);
+}
+
+/* At this point, there should be no references on this
+ * socket and no in-flight SKBs associated with this
+ * socket, so it is safe to free all the resources.
+ */
+void tls_device_sk_destruct(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+
+	if (ctx->open_record)
+		destroy_record(ctx->open_record);
+
+	delete_all_records(ctx);
+	crypto_free_aead(ctx->aead_send);
+	ctx->sk_destruct(sk);
+
+	if (refcount_dec_and_test(&tls_ctx->refcount))
+		tls_device_queue_ctx_destruction(tls_ctx);
+}
+EXPORT_SYMBOL(tls_device_sk_destruct);
+
+static inline void tls_append_frag(struct tls_record_info *record,
+				   struct page_frag *pfrag,
+				   int size)
+{
+	skb_frag_t *frag;
+
+	frag = &record->frags[record->num_frags - 1];
+	if (frag->page.p == pfrag->page &&
+	    frag->page_offset + frag->size == pfrag->offset) {
+		frag->size += size;
+	} else {
+		++frag;
+		frag->page.p = pfrag->page;
+		frag->page_offset = pfrag->offset;
+		frag->size = size;
+		++record->num_frags;
+		get_page(pfrag->page);
+	}
+
+	pfrag->offset += size;
+	record->len += size;
+}
+
+static inline int tls_push_record(struct sock *sk,
+				  struct tls_context *ctx,
+				  struct tls_offload_context *offload_ctx,
+				  struct tls_record_info *record,
+				  struct page_frag *pfrag,
+				  int flags,
+				  unsigned char record_type)
+{
+	skb_frag_t *frag;
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct page_frag fallback_frag;
+	struct page_frag  *tag_pfrag = pfrag;
+	int i;
+
+	/* fill prepand */
+	frag = &record->frags[0];
+	tls_fill_prepend(ctx,
+			 skb_frag_address(frag),
+			 record->len - ctx->prepend_size,
+			 record_type);
+
+	if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
+		/* HW doesn't care about the data in the tag
+		 * so in case pfrag has no room
+		 * for a tag and we can't allocate a new pfrag
+		 * just use the page in the first frag
+		 * rather then write a complicated fall back code.
+		 */
+		tag_pfrag = &fallback_frag;
+		tag_pfrag->page = skb_frag_page(frag);
+		tag_pfrag->offset = 0;
+	}
+
+	tls_append_frag(record, tag_pfrag, ctx->tag_size);
+	record->end_seq = tp->write_seq + record->len;
+	spin_lock_irq(&offload_ctx->lock);
+	list_add_tail(&record->list, &offload_ctx->records_list);
+	spin_unlock_irq(&offload_ctx->lock);
+	offload_ctx->open_record = NULL;
+	set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
+	tls_advance_record_sn(sk, ctx);
+
+	for (i = 0; i < record->num_frags; i++) {
+		frag = &record->frags[i];
+		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
+		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
+			    frag->size, frag->page_offset);
+		sk_mem_charge(sk, frag->size);
+		get_page(skb_frag_page(frag));
+	}
+	sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
+
+	/* all ready, send */
+	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
+}
+
+static inline int tls_create_new_record(struct tls_offload_context *offload_ctx,
+					struct page_frag *pfrag,
+					size_t prepend_size)
+{
+	skb_frag_t *frag;
+	struct tls_record_info *record;
+
+	record = kmalloc(sizeof(*record), GFP_KERNEL);
+	if (!record)
+		return -ENOMEM;
+
+	frag = &record->frags[0];
+	__skb_frag_set_page(frag, pfrag->page);
+	frag->page_offset = pfrag->offset;
+	skb_frag_size_set(frag, prepend_size);
+
+	get_page(pfrag->page);
+	pfrag->offset += prepend_size;
+
+	record->num_frags = 1;
+	record->len = prepend_size;
+	offload_ctx->open_record = record;
+	return 0;
+}
+
+static inline int tls_do_allocation(struct sock *sk,
+				    struct tls_offload_context *offload_ctx,
+				    struct page_frag *pfrag,
+				    size_t prepend_size)
+{
+	int ret;
+
+	if (!offload_ctx->open_record) {
+		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
+						   sk->sk_allocation))) {
+			sk->sk_prot->enter_memory_pressure(sk);
+			sk_stream_moderate_sndbuf(sk);
+			return -ENOMEM;
+		}
+
+		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
+		if (ret)
+			return ret;
+
+		if (pfrag->size > pfrag->offset)
+			return 0;
+	}
+
+	if (!sk_page_frag_refill(sk, pfrag))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int tls_push_data(struct sock *sk,
+			 struct iov_iter *msg_iter,
+			 size_t size, int flags,
+			 unsigned char record_type)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+	struct tls_record_info *record = ctx->open_record;
+	struct page_frag *pfrag;
+	int copy, rc = 0;
+	size_t orig_size = size;
+	u32 max_open_record_len;
+	long timeo;
+	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
+	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
+	bool done = false;
+
+	if (flags &
+	    ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
+		return -ENOTSUPP;
+
+	if (sk->sk_err)
+		return -sk->sk_err;
+
+	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
+	rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
+	if (rc < 0)
+		return rc;
+
+	pfrag = sk_page_frag(sk);
+
+	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
+	 * we need to leave room for an authentication tag.
+	 */
+	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
+			      tls_ctx->prepend_size;
+	do {
+		if (tls_do_allocation(sk, ctx, pfrag,
+				      tls_ctx->prepend_size)) {
+			rc = sk_stream_wait_memory(sk, &timeo);
+			if (!rc)
+				continue;
+
+			record = ctx->open_record;
+			if (!record)
+				break;
+handle_error:
+			if (record_type != TLS_RECORD_TYPE_DATA) {
+				/* avoid sending partial
+				 * record with type !=
+				 * application_data
+				 */
+				size = orig_size;
+				destroy_record(record);
+				ctx->open_record = NULL;
+			} else if (record->len > tls_ctx->prepend_size) {
+				goto last_record;
+			}
+
+			break;
+		}
+
+		record = ctx->open_record;
+		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
+		copy = min_t(size_t, copy, (max_open_record_len - record->len));
+
+		if (copy_from_iter_nocache(page_address(pfrag->page) +
+					       pfrag->offset,
+					   copy, msg_iter) != copy) {
+			rc = -EFAULT;
+			goto handle_error;
+		}
+		tls_append_frag(record, pfrag, copy);
+
+		size -= copy;
+		if (!size) {
+last_record:
+			tls_push_record_flags = flags;
+			if (more) {
+				tls_ctx->pending_open_record_frags =
+						record->num_frags;
+				break;
+			}
+
+			done = true;
+		}
+
+		if ((done) || record->len >= max_open_record_len ||
+		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
+			rc = tls_push_record(sk,
+					     tls_ctx,
+					     ctx,
+					     record,
+					     pfrag,
+					     tls_push_record_flags,
+					     record_type);
+			if (rc < 0)
+				break;
+		}
+	} while (!done);
+
+	if (orig_size - size > 0)
+		rc = orig_size - size;
+
+	return rc;
+}
+
+int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+	unsigned char record_type = TLS_RECORD_TYPE_DATA;
+	int rc = 0;
+
+	lock_sock(sk);
+
+	if (unlikely(msg->msg_controllen)) {
+		rc = tls_proccess_cmsg(sk, msg, &record_type);
+		if (rc)
+			goto out;
+	}
+
+	rc = tls_push_data(sk, &msg->msg_iter, size,
+			   msg->msg_flags, record_type);
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+int tls_device_sendpage(struct sock *sk, struct page *page,
+			int offset, size_t size, int flags)
+{
+	struct iov_iter	msg_iter;
+	struct kvec iov;
+	char *kaddr = kmap(page);
+	int rc = 0;
+
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		flags |= MSG_MORE;
+
+	lock_sock(sk);
+
+	if (flags & MSG_OOB) {
+		rc = -ENOTSUPP;
+		goto out;
+	}
+
+	iov.iov_base = kaddr + offset;
+	iov.iov_len = size;
+	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
+	rc = tls_push_data(sk, &msg_iter, size,
+			   flags, TLS_RECORD_TYPE_DATA);
+	kunmap(page);
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+struct tls_record_info *tls_get_record(struct tls_offload_context *context,
+				       u32 seq, u64 *p_record_sn)
+{
+	struct tls_record_info *info;
+	u64 record_sn = context->hint_record_sn;
+
+	info = context->retransmit_hint;
+	if (!info ||
+	    before(seq, info->end_seq - info->len)) {
+		/* if retransmit_hint is irrelevant start
+		 * from the begging of the list
+		 */
+		info = list_first_entry(&context->records_list,
+					struct tls_record_info, list);
+		record_sn = context->unacked_record_sn;
+	}
+
+	list_for_each_entry_from(info, &context->records_list, list) {
+		if (before(seq, info->end_seq)) {
+			if (!context->retransmit_hint ||
+			    after(info->end_seq,
+				  context->retransmit_hint->end_seq)) {
+				context->hint_record_sn = record_sn;
+				context->retransmit_hint = info;
+			}
+			*p_record_sn = record_sn;
+			return info;
+		}
+		record_sn++;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL(tls_get_record);
+
+static int tls_device_push_pending_record(struct sock *sk, int flags)
+{
+	struct iov_iter	msg_iter;
+
+	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
+	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
+}
+
+int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
+{
+	u16 nonece_size, tag_size, iv_size, rec_seq_size;
+	struct tls_record_info *start_marker_record;
+	struct tls_offload_context *offload_ctx;
+	struct tls_crypto_info *crypto_info;
+	struct net_device *netdev;
+	char *iv, *rec_seq;
+	struct sk_buff *skb;
+	int rc = -EINVAL;
+	__be64 rcd_sn;
+
+	if (!ctx)
+		goto out;
+
+	if (ctx->priv_ctx) {
+		rc = -EEXIST;
+		goto out;
+	}
+
+	/* We support starting offload on multiple sockets
+	 * concurrently, So we only need a read lock here.
+	 */
+	percpu_down_read(&device_offload_lock);
+	netdev = get_netdev_for_sock(sk);
+	if (!netdev) {
+		pr_err_ratelimited("%s: netdev not found\n", __func__);
+		rc = -EINVAL;
+		goto release_lock;
+	}
+
+	if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
+		rc = -ENOTSUPP;
+		goto release_netdev;
+	}
+
+	/* Avoid offloading if the device is down
+	 * We don't want to offload new flows after
+	 * the NETDEV_DOWN event
+	 */
+	if (!(netdev->flags & IFF_UP)) {
+		rc = -EINVAL;
+		goto release_lock;
+	}
+
+	crypto_info = &ctx->crypto_send;
+	switch (crypto_info->cipher_type) {
+	case TLS_CIPHER_AES_GCM_128: {
+		nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
+		tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+		iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
+		iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
+		rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
+		rec_seq =
+		 ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
+		break;
+	}
+	default:
+		rc = -EINVAL;
+		goto release_netdev;
+	}
+
+	start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
+	if (!start_marker_record) {
+		rc = -ENOMEM;
+		goto release_netdev;
+	}
+
+	offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE, GFP_KERNEL);
+	if (!offload_ctx)
+		goto free_marker_record;
+
+	ctx->priv_ctx = offload_ctx;
+	rc = attach_sock_to_netdev(sk, netdev, ctx);
+	if (rc)
+		goto free_offload_context;
+
+	ctx->netdev = netdev;
+	ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
+	ctx->tag_size = tag_size;
+	ctx->iv_size = iv_size;
+	ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
+			  GFP_KERNEL);
+	if (!ctx->iv) {
+		rc = -ENOMEM;
+		goto detach_sock;
+	}
+
+	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
+
+	ctx->rec_seq_size = rec_seq_size;
+	ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
+	if (!ctx->rec_seq) {
+		rc = -ENOMEM;
+		goto free_iv;
+	}
+	memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
+
+	/* start at rec_seq - 1 to account for the start marker record */
+	memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
+	offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
+
+	rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
+	if (rc)
+		goto free_rec_seq;
+
+	start_marker_record->end_seq = tcp_sk(sk)->write_seq;
+	start_marker_record->len = 0;
+	start_marker_record->num_frags = 0;
+
+	INIT_LIST_HEAD(&offload_ctx->records_list);
+	list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
+	spin_lock_init(&offload_ctx->lock);
+
+	inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
+	ctx->push_pending_record = tls_device_push_pending_record;
+	offload_ctx->sk_destruct = sk->sk_destruct;
+
+	/* TLS offload is greatly simplified if we don't send
+	 * SKBs where only part of the payload needs to be encrypted.
+	 * So mark the last skb in the write queue as end of record.
+	 */
+	skb = tcp_write_queue_tail(sk);
+	if (skb)
+		TCP_SKB_CB(skb)->eor = 1;
+
+	refcount_set(&ctx->refcount, 1);
+	spin_lock_irq(&tls_device_lock);
+	list_add_tail(&ctx->list, &tls_device_list);
+	spin_unlock_irq(&tls_device_lock);
+
+	/* following this assignment tls_is_sk_tx_device_offloaded
+	 * will return true and the context might be accessed
+	 * by the netdev's xmit function.
+	 */
+	smp_store_release(&sk->sk_destruct,
+			  &tls_device_sk_destruct);
+	goto release_lock;
+
+free_rec_seq:
+	kfree(ctx->rec_seq);
+free_iv:
+	kfree(ctx->iv);
+detach_sock:
+	netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_TX);
+free_offload_context:
+	kfree(offload_ctx);
+	ctx->priv_ctx = NULL;
+free_marker_record:
+	kfree(start_marker_record);
+release_netdev:
+	dev_put(netdev);
+release_lock:
+	percpu_up_read(&device_offload_lock);
+out:
+	return rc;
+}
+
+static int tls_device_register(struct net_device *dev)
+{
+	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
+		return NOTIFY_BAD;
+
+	return NOTIFY_DONE;
+}
+
+static int tls_device_unregister(struct net_device *dev)
+{
+	return NOTIFY_DONE;
+}
+
+static int tls_device_feat_change(struct net_device *dev)
+{
+	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
+		return NOTIFY_BAD;
+
+	return NOTIFY_DONE;
+}
+
+static int tls_device_down(struct net_device *netdev)
+{
+	struct tls_context *ctx, *tmp;
+	struct list_head list;
+	unsigned long flags;
+
+	if (!(netdev->features & NETIF_F_HW_TLS_TX))
+		return NOTIFY_DONE;
+
+	/* Request a write lock to block new offload attempts
+	 */
+	percpu_down_write(&device_offload_lock);
+
+	spin_lock_irqsave(&tls_device_lock, flags);
+	INIT_LIST_HEAD(&list);
+
+	list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) {
+		if (ctx->netdev != netdev ||
+		    !refcount_inc_not_zero(&ctx->refcount))
+			continue;
+
+		list_move(&ctx->list, &list);
+	}
+	spin_unlock_irqrestore(&tls_device_lock, flags);
+
+	list_for_each_entry_safe(ctx, tmp, &list, list)	{
+		netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
+						TLS_OFFLOAD_CTX_DIR_TX);
+		ctx->netdev = NULL;
+		dev_put(netdev);
+		list_del_init(&ctx->list);
+
+		if (refcount_dec_and_test(&ctx->refcount))
+			tls_device_free_ctx(ctx);
+	}
+
+	percpu_up_write(&device_offload_lock);
+
+	flush_work(&tls_device_gc_work);
+
+	return NOTIFY_DONE;
+}
+
+static int tls_dev_event(struct notifier_block *this, unsigned long event,
+			 void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return tls_device_register(dev);
+
+	case NETDEV_UNREGISTER:
+		return tls_device_unregister(dev);
+
+	case NETDEV_FEAT_CHANGE:
+		return tls_device_feat_change(dev);
+
+	case NETDEV_DOWN:
+		return tls_device_down(dev);
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block tls_dev_notifier = {
+	.notifier_call	= tls_dev_event,
+};
+
+void __init tls_device_init(void)
+{
+	register_netdevice_notifier(&tls_dev_notifier);
+}
+
+void __exit tls_device_cleanup(void)
+{
+	unregister_netdevice_notifier(&tls_dev_notifier);
+	flush_work(&tls_device_gc_work);
+}
diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
new file mode 100644
index 000000000000..14d31a36885c
--- /dev/null
+++ b/net/tls/tls_device_fallback.c
@@ -0,0 +1,419 @@
+/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ *      - Neither the name of the Mellanox Technologies nor the
+ *        names of its contributors may be used to endorse or promote
+ *        products derived from this software without specific prior written
+ *        permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE
+ */
+
+#include <net/tls.h>
+#include <crypto/aead.h>
+#include <crypto/scatterwalk.h>
+#include <net/ip6_checksum.h>
+
+static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk)
+{
+	struct scatterlist *src = walk->sg;
+	int diff = walk->offset - src->offset;
+
+	sg_set_page(sg, sg_page(src),
+		    src->length - diff, walk->offset);
+
+	scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
+}
+
+static int tls_enc_record(struct aead_request *aead_req,
+			  struct crypto_aead *aead, char *aad, char *iv,
+			  __be64 rcd_sn, struct scatter_walk *in,
+			  struct scatter_walk *out, int *in_len)
+{
+	struct scatterlist sg_in[3];
+	struct scatterlist sg_out[3];
+	unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE];
+	u16 len;
+	int rc;
+
+	len = min_t(int, *in_len, ARRAY_SIZE(buf));
+
+	scatterwalk_copychunks(buf, in, len, 0);
+	scatterwalk_copychunks(buf, out, len, 1);
+
+	*in_len -= len;
+	if (!*in_len)
+		return 0;
+
+	scatterwalk_pagedone(in, 0, 1);
+	scatterwalk_pagedone(out, 1, 1);
+
+	len = buf[4] | (buf[3] << 8);
+	len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
+
+	tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
+		     (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
+
+	memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE,
+	       TLS_CIPHER_AES_GCM_128_IV_SIZE);
+
+	sg_init_table(sg_in, ARRAY_SIZE(sg_in));
+	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
+	sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
+	sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
+	chain_to_walk(sg_in + 1, in);
+	chain_to_walk(sg_out + 1, out);
+
+	*in_len -= len;
+	if (*in_len < 0) {
+		*in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+		if (*in_len < 0)
+		/* the input buffer doesn't contain the entire record.
+		 * trim len accordingly. The resulting authentication tag
+		 * will contain garbage. but we don't care as we won't
+		 * include any of it in the output skb
+		 * Note that we assume the output buffer length
+		 * is larger then input buffer length + tag size
+		 */
+			len += *in_len;
+
+		*in_len = 0;
+	}
+
+	if (*in_len) {
+		scatterwalk_copychunks(NULL, in, len, 2);
+		scatterwalk_pagedone(in, 0, 1);
+		scatterwalk_copychunks(NULL, out, len, 2);
+		scatterwalk_pagedone(out, 1, 1);
+	}
+
+	len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
+	aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
+
+	rc = crypto_aead_encrypt(aead_req);
+
+	return rc;
+}
+
+static void tls_init_aead_request(struct aead_request *aead_req,
+				  struct crypto_aead *aead)
+{
+	aead_request_set_tfm(aead_req, aead);
+	aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
+}
+
+static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead,
+						   gfp_t flags)
+{
+	unsigned int req_size = sizeof(struct aead_request) +
+		crypto_aead_reqsize(aead);
+	struct aead_request *aead_req;
+
+	aead_req = kzalloc(req_size, flags);
+	if (!aead_req)
+		return NULL;
+
+	tls_init_aead_request(aead_req, aead);
+	return aead_req;
+}
+
+static int tls_enc_records(struct aead_request *aead_req,
+			   struct crypto_aead *aead, struct scatterlist *sg_in,
+			   struct scatterlist *sg_out, char *aad, char *iv,
+			   u64 rcd_sn, int len)
+{
+	struct scatter_walk in;
+	struct scatter_walk out;
+	int rc;
+
+	scatterwalk_start(&in, sg_in);
+	scatterwalk_start(&out, sg_out);
+
+	do {
+		rc = tls_enc_record(aead_req, aead, aad, iv,
+				    cpu_to_be64(rcd_sn), &in, &out, &len);
+		rcd_sn++;
+
+	} while (rc == 0 && len);
+
+	scatterwalk_done(&in, 0, 0);
+	scatterwalk_done(&out, 1, 0);
+
+	return rc;
+}
+
+static inline void update_chksum(struct sk_buff *skb, int headln)
+{
+	/* Can't use icsk->icsk_af_ops->send_check here because the ip addresses
+	 * might have been changed by NAT.
+	 */
+
+	const struct ipv6hdr *ipv6h;
+	const struct iphdr *iph;
+	struct tcphdr *th = tcp_hdr(skb);
+	int datalen = skb->len - headln;
+
+	/* We only changed the payload so if we are using partial we don't
+	 * need to update anything.
+	 */
+	if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
+		return;
+
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	skb->csum_start = skb_transport_header(skb) - skb->head;
+	skb->csum_offset = offsetof(struct tcphdr, check);
+
+	if (skb->sk->sk_family == AF_INET6) {
+		ipv6h = ipv6_hdr(skb);
+		th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
+					     datalen, IPPROTO_TCP, 0);
+	} else {
+		iph = ip_hdr(skb);
+		th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
+					       IPPROTO_TCP, 0);
+	}
+}
+
+static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln)
+{
+	skb_copy_header(nskb, skb);
+
+	skb_put(nskb, skb->len);
+	memcpy(nskb->data, skb->data, headln);
+	update_chksum(nskb, headln);
+
+	nskb->destructor = skb->destructor;
+	nskb->sk = skb->sk;
+	skb->destructor = NULL;
+	skb->sk = NULL;
+	refcount_add(nskb->truesize - skb->truesize,
+		     &nskb->sk->sk_wmem_alloc);
+}
+
+/* This function may be called after the user socket is already
+ * closed so make sure we don't use anything freed during
+ * tls_sk_proto_close here
+ */
+static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
+{
+	int tcp_header_size = tcp_hdrlen(skb);
+	int tcp_payload_offset = skb_transport_offset(skb) + tcp_header_size;
+	int payload_len = skb->len - tcp_payload_offset;
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
+	int remaining, buf_len, resync_sgs, rc, i = 0;
+	void *buf, *dummy_buf, *iv, *aad;
+	struct scatterlist *sg_in;
+	struct scatterlist sg_out[3];
+	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
+	struct aead_request *aead_req;
+	struct sk_buff *nskb = NULL;
+	struct tls_record_info *record;
+	unsigned long flags;
+	s32 sync_size;
+	u64 rcd_sn;
+
+	/* worst case is:
+	 * MAX_SKB_FRAGS in tls_record_info
+	 * MAX_SKB_FRAGS + 1 in SKB head an frags.
+	 */
+	int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
+
+	if (!payload_len)
+		return skb;
+
+	sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC);
+	if (!sg_in)
+		goto free_orig;
+
+	sg_init_table(sg_in, sg_in_max_elements);
+	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
+
+	spin_lock_irqsave(&ctx->lock, flags);
+	record = tls_get_record(ctx, tcp_seq, &rcd_sn);
+	if (!record) {
+		spin_unlock_irqrestore(&ctx->lock, flags);
+		WARN(1, "Record not found for seq %u\n", tcp_seq);
+		goto free_sg;
+	}
+
+	sync_size = tcp_seq - tls_record_start_seq(record);
+	if (sync_size < 0) {
+		int is_start_marker = tls_record_is_start_marker(record);
+
+		spin_unlock_irqrestore(&ctx->lock, flags);
+		if (!is_start_marker)
+		/* This should only occur if the relevant record was
+		 * already acked. In that case it should be ok
+		 * to drop the packet and avoid retransmission.
+		 *
+		 * There is a corner case where the packet contains
+		 * both an acked and a non-acked record.
+		 * We currently don't handle that case and rely
+		 * on TCP to retranmit a packet that doesn't contain
+		 * already acked payload.
+		 */
+			goto free_orig;
+
+		if (payload_len > -sync_size) {
+			WARN(1, "Fallback of partially offloaded packets is not supported\n");
+			goto free_sg;
+		} else {
+			return skb;
+		}
+	}
+
+	remaining = sync_size;
+	while (remaining > 0) {
+		skb_frag_t *frag = &record->frags[i];
+
+		__skb_frag_ref(frag);
+		sg_set_page(sg_in + i, skb_frag_page(frag),
+			    skb_frag_size(frag), frag->page_offset);
+
+		remaining -= skb_frag_size(frag);
+
+		if (remaining < 0)
+			sg_in[i].length += remaining;
+
+		i++;
+	}
+	spin_unlock_irqrestore(&ctx->lock, flags);
+	resync_sgs = i;
+
+	aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
+	if (!aead_req)
+		goto put_sg;
+
+	buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
+		  TLS_CIPHER_AES_GCM_128_IV_SIZE +
+		  TLS_AAD_SPACE_SIZE +
+		  sync_size +
+		  tls_ctx->tag_size;
+	buf = kmalloc(buf_len, GFP_ATOMIC);
+	if (!buf)
+		goto free_req;
+
+	nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC);
+	if (!nskb)
+		goto free_buf;
+
+	skb_reserve(nskb, skb_headroom(skb));
+
+	iv = buf;
+
+	memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
+	       TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+	aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
+	      TLS_CIPHER_AES_GCM_128_IV_SIZE;
+	dummy_buf = aad + TLS_AAD_SPACE_SIZE;
+
+	sg_set_buf(&sg_out[0], dummy_buf, sync_size);
+	sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
+		   payload_len);
+	/* Add room for authentication tag produced by crypto */
+	dummy_buf += sync_size;
+	sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
+	rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
+			  payload_len);
+	if (rc < 0)
+		goto free_nskb;
+
+	rc = tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv,
+			     rcd_sn, sync_size + payload_len);
+	if (rc < 0)
+		goto free_nskb;
+
+	complete_skb(nskb, skb, tcp_payload_offset);
+
+	/* validate_xmit_skb_list assumes that if the skb wasn't segmented
+	 * nskb->prev will point to the skb itself
+	 */
+	nskb->prev = nskb;
+free_buf:
+	kfree(buf);
+free_req:
+	kfree(aead_req);
+put_sg:
+	for (i = 0; i < resync_sgs; i++)
+		put_page(sg_page(&sg_in[i]));
+free_sg:
+	kfree(sg_in);
+free_orig:
+	kfree_skb(skb);
+	return nskb;
+
+free_nskb:
+	kfree_skb(nskb);
+	nskb = NULL;
+	goto free_buf;
+}
+
+static struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
+					     struct net_device *dev,
+					     struct sk_buff *skb)
+{
+	if (dev == tls_get_ctx(sk)->netdev)
+		return skb;
+
+	return tls_sw_fallback(sk, skb);
+}
+
+int tls_sw_fallback_init(struct sock *sk,
+			 struct tls_offload_context *offload_ctx,
+			 struct tls_crypto_info *crypto_info)
+{
+	int rc;
+	const u8 *key;
+
+	offload_ctx->aead_send =
+	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(offload_ctx->aead_send)) {
+		rc = PTR_ERR(offload_ctx->aead_send);
+		pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc);
+		offload_ctx->aead_send = NULL;
+		goto err_out;
+	}
+
+	key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key;
+
+	rc = crypto_aead_setkey(offload_ctx->aead_send, key,
+				TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+	if (rc)
+		goto free_aead;
+
+	rc = crypto_aead_setauthsize(offload_ctx->aead_send,
+				     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
+	if (rc)
+		goto free_aead;
+
+	sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
+	return 0;
+free_aead:
+	crypto_free_aead(offload_ctx->aead_send);
+err_out:
+	return rc;
+}
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index d824d548447e..e0dface33017 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -54,6 +54,9 @@ enum {
 enum {
 	TLS_BASE_TX,
 	TLS_SW_TX,
+#ifdef CONFIG_TLS_DEVICE
+	TLS_HW_TX,
+#endif
 	TLS_NUM_CONFIG,
 };
 
@@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval,
 		goto err_crypto_info;
 	}
 
-	/* currently SW is default, we will have ethtool in future */
-	rc = tls_set_sw_offload(sk, ctx);
-	tx_conf = TLS_SW_TX;
-	if (rc)
-		goto err_crypto_info;
+#ifdef CONFIG_TLS_DEVICE
+	rc = tls_set_device_offload(sk, ctx);
+	tx_conf = TLS_HW_TX;
+	if (rc) {
+#else
+	{
+#endif
+		/* if HW offload fails fallback to SW */
+		rc = tls_set_sw_offload(sk, ctx);
+		tx_conf = TLS_SW_TX;
+		if (rc)
+			goto err_crypto_info;
+	}
 
 	ctx->tx_conf = tx_conf;
 	update_sk_prot(sk, ctx);
@@ -473,6 +484,12 @@ static void build_protos(struct proto *prot, struct proto *base)
 	prot[TLS_SW_TX] = prot[TLS_BASE_TX];
 	prot[TLS_SW_TX].sendmsg		= tls_sw_sendmsg;
 	prot[TLS_SW_TX].sendpage	= tls_sw_sendpage;
+
+#ifdef CONFIG_TLS_DEVICE
+	prot[TLS_HW_TX] = prot[TLS_SW_TX];
+	prot[TLS_HW_TX].sendmsg		= tls_device_sendmsg;
+	prot[TLS_HW_TX].sendpage	= tls_device_sendpage;
+#endif
 }
 
 static int tls_init(struct sock *sk)
@@ -531,6 +548,9 @@ static int __init tls_register(void)
 {
 	build_protos(tls_prots[TLSV4], &tcp_prot);
 
+#ifdef CONFIG_TLS_DEVICE
+	tls_device_init();
+#endif
 	tcp_register_ulp(&tcp_tls_ulp_ops);
 
 	return 0;
@@ -539,6 +559,9 @@ static int __init tls_register(void)
 static void __exit tls_unregister(void)
 {
 	tcp_unregister_ulp(&tcp_tls_ulp_ops);
+#ifdef CONFIG_TLS_DEVICE
+	tls_device_cleanup();
+#endif
 }
 
 module_init(tls_register);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 07/14] net/tls: Support TLS device offload with IPv6
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (5 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 08/14] net/mlx5e: Move defines out of ipsec code Saeed Mahameed
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

Previously get_netdev_for_sock worked only with IPv4.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 net/tls/tls_device.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index c0d4e11a4286..6d4d4d513b84 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -37,6 +37,11 @@
 #include <net/inet_common.h>
 #include <linux/highmem.h>
 #include <linux/netdevice.h>
+#include <net/addrconf.h>
+#include <net/flow.h>
+#include <linux/ipv6.h>
+#include <net/dst.h>
+#include <linux/security.h>
 
 #include <net/tls.h>
 #include <crypto/aead.h>
@@ -101,13 +106,55 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
 	spin_unlock_irqrestore(&tls_device_lock, flags);
 }
 
+static inline struct net_device *ipv6_get_netdev(struct sock *sk)
+{
+	struct net_device *dev = NULL;
+#if IS_ENABLED(CONFIG_IPV6)
+	struct inet_sock *inet = inet_sk(sk);
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	struct flowi6 _fl6, *fl6 = &_fl6;
+	struct dst_entry *dst;
+
+	memset(fl6, 0, sizeof(*fl6));
+	fl6->flowi6_proto = sk->sk_protocol;
+	fl6->daddr = sk->sk_v6_daddr;
+	fl6->saddr = np->saddr;
+	fl6->flowlabel = np->flow_label;
+	IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+	fl6->flowi6_oif = sk->sk_bound_dev_if;
+	fl6->flowi6_mark = sk->sk_mark;
+	fl6->fl6_sport = inet->inet_sport;
+	fl6->fl6_dport = inet->inet_dport;
+	fl6->flowi6_uid = sk->sk_uid;
+	security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+
+	if (ipv6_stub->ipv6_dst_lookup(sock_net(sk), sk, &dst, fl6) < 0)
+		return NULL;
+
+	dev = dst->dev;
+	dev_hold(dev);
+	dst_release(dst);
+
+#endif
+	return dev;
+}
+
 /* We assume that the socket is already connected */
 static struct net_device *get_netdev_for_sock(struct sock *sk)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct net_device *netdev = NULL;
 
-	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
+	if (sk->sk_family == AF_INET)
+		netdev = dev_get_by_index(sock_net(sk),
+					  inet->cork.fl.flowi_oif);
+	else if (sk->sk_family == AF_INET6) {
+		netdev = ipv6_get_netdev(sk);
+		if (!netdev && !sk->sk_ipv6only &&
+		    ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED)
+			netdev = dev_get_by_index(sock_net(sk),
+						  inet->cork.fl.flowi_oif);
+	}
 
 	return netdev;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 08/14] net/mlx5e: Move defines out of ipsec code
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (6 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 07/14] net/tls: Support TLS device offload with IPv6 Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 09/14] net/mlx5: Accel, Add TLS tx offload interface Saeed Mahameed
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

The defines are not IPSEC specific.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h             | 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h | 3 ---
 drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c     | 5 +----
 drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h       | 2 ++
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 4c9360b25532..6660986285bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,9 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
+#define MLX5E_METADATA_ETHER_LEN 8
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
index 1198fc1eba4c..93bf10e6508c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.h
@@ -45,9 +45,6 @@
 #define MLX5E_IPSEC_SADB_RX_BITS 10
 #define MLX5E_IPSEC_ESN_SCOPE_MID 0x80000000L
 
-#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
-#define MLX5E_METADATA_ETHER_LEN 8
-
 struct mlx5e_priv;
 
 struct mlx5e_ipsec_sw_stats {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
index 4f1568528738..a6b672840e34 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
@@ -43,9 +43,6 @@
 #include "fpga/sdk.h"
 #include "fpga/core.h"
 
-#define SBU_QP_QUEUE_SIZE 8
-#define MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC	(60 * 1000)
-
 enum mlx5_fpga_ipsec_cmd_status {
 	MLX5_FPGA_IPSEC_CMD_PENDING,
 	MLX5_FPGA_IPSEC_CMD_SEND_FAIL,
@@ -258,7 +255,7 @@ static int mlx5_fpga_ipsec_cmd_wait(void *ctx)
 {
 	struct mlx5_fpga_ipsec_cmd_context *context = ctx;
 	unsigned long timeout =
-		msecs_to_jiffies(MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC);
+		msecs_to_jiffies(MLX5_FPGA_CMD_TIMEOUT_MSEC);
 	int res;
 
 	res = wait_for_completion_timeout(&context->complete, timeout);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h b/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
index baa537e54a49..a0573cc2fc9b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h
@@ -41,6 +41,8 @@
  * DOC: Innova SDK
  * This header defines the in-kernel API for Innova FPGA client drivers.
  */
+#define SBU_QP_QUEUE_SIZE 8
+#define MLX5_FPGA_CMD_TIMEOUT_MSEC (60 * 1000)
 
 enum mlx5_fpga_access_type {
 	MLX5_FPGA_ACCESS_TYPE_I2C = 0x0,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 09/14] net/mlx5: Accel, Add TLS tx offload interface
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (7 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 08/14] net/mlx5e: Move defines out of ipsec code Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 10/14] net/mlx5e: TLS, Add Innova TLS TX support Saeed Mahameed
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

Add routines for manipulating TLS TX offload contexts.

In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.

Add implementation for Innova TLS (FPGA-based) hardware.

These routines will be used by the TLS offload support in a later patch

mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
to work directly with mlx5_core rather than Innova FPGA or other mlx5
acceleration providers.

In the future, when IPSec/TLS or any other acceleration gets integrated
into ConnectX chip, mlx5/accel layer will provide the integrated
acceleration, rather than the Innova one.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c    |  71 +++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h    |  86 ++++
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h    |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 +++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 +++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  11 +
 include/linux/mlx5/mlx5_ifc.h                      |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h                 |  77 +++
 9 files changed, 879 insertions(+), 18 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index c805769d92a9..9989e5265a45 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -8,10 +8,10 @@ mlx5_core-y :=	main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
 		fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o lib/clock.o \
 		diag/fs_tracepoint.o
 
-mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o
+mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o accel/tls.o
 
 mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o fpga/conn.o fpga/sdk.o \
-		fpga/ipsec.o
+		fpga/ipsec.o fpga/tls.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
 		en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
new file mode 100644
index 000000000000..77ac19f38cbe
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/mlx5/device.h>
+
+#include "accel/tls.h"
+#include "mlx5_core.h"
+#include "fpga/tls.h"
+
+int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			       struct tls_crypto_info *crypto_info,
+			       u32 start_offload_tcp_sn, u32 *p_swid)
+{
+	return mlx5_fpga_tls_add_tx_flow(mdev, flow, crypto_info,
+					 start_offload_tcp_sn, p_swid);
+}
+
+void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid)
+{
+	mlx5_fpga_tls_del_tx_flow(mdev, swid, GFP_KERNEL);
+}
+
+bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_is_tls_device(mdev);
+}
+
+u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_tls_device_caps(mdev);
+}
+
+int mlx5_accel_tls_init(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_tls_init(mdev);
+}
+
+void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev)
+{
+	mlx5_fpga_tls_cleanup(mdev);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
new file mode 100644
index 000000000000..6f9c9f446ecc
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5_ACCEL_TLS_H__
+#define __MLX5_ACCEL_TLS_H__
+
+#include <linux/mlx5/driver.h>
+#include <linux/tls.h>
+
+#ifdef CONFIG_MLX5_ACCEL
+
+enum {
+	MLX5_ACCEL_TLS_TX = BIT(0),
+	MLX5_ACCEL_TLS_RX = BIT(1),
+	MLX5_ACCEL_TLS_V12 = BIT(2),
+	MLX5_ACCEL_TLS_V13 = BIT(3),
+	MLX5_ACCEL_TLS_LRO = BIT(4),
+	MLX5_ACCEL_TLS_IPV6 = BIT(5),
+	MLX5_ACCEL_TLS_AES_GCM128 = BIT(30),
+	MLX5_ACCEL_TLS_AES_GCM256 = BIT(31),
+};
+
+struct mlx5_ifc_tls_flow_bits {
+	u8         src_port[0x10];
+	u8         dst_port[0x10];
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits src_ipv4_src_ipv6;
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits dst_ipv4_dst_ipv6;
+	u8         ipv6[0x1];
+	u8         direction_sx[0x1];
+	u8         reserved_at_2[0x1e];
+};
+
+int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			       struct tls_crypto_info *crypto_info,
+			       u32 start_offload_tcp_sn, u32 *p_swid);
+void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid);
+bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev);
+u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev);
+int mlx5_accel_tls_init(struct mlx5_core_dev *mdev);
+void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev);
+
+#else
+
+static inline int
+mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			   struct tls_crypto_info *crypto_info,
+			   u32 start_offload_tcp_sn, u32 *p_swid) { return 0; }
+static inline void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid) { }
+static inline bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev) { return false; }
+static inline u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev) { return 0; }
+static inline int mlx5_accel_tls_init(struct mlx5_core_dev *mdev) { return 0; }
+static inline void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev) { }
+
+#endif
+
+#endif	/* __MLX5_ACCEL_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
index 82405ed84725..3e2355c8df3f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
@@ -53,6 +53,7 @@ struct mlx5_fpga_device {
 	} conn_res;
 
 	struct mlx5_fpga_ipsec *ipsec;
+	struct mlx5_fpga_tls *tls;
 };
 
 #define mlx5_fpga_dbg(__adev, format, ...) \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
new file mode 100644
index 000000000000..47f8b0d579e2
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
@@ -0,0 +1,563 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/mlx5/device.h>
+#include "fpga/tls.h"
+#include "fpga/cmd.h"
+#include "fpga/sdk.h"
+#include "fpga/core.h"
+#include "accel/tls.h"
+
+struct mlx5_fpga_tls_command_context;
+
+typedef void (*mlx5_fpga_tls_command_complete)
+	(struct mlx5_fpga_conn *conn, struct mlx5_fpga_device *fdev,
+	 struct mlx5_fpga_tls_command_context *ctx,
+	 struct mlx5_fpga_dma_buf *resp);
+
+struct mlx5_fpga_tls_command_context {
+	struct list_head list;
+	/* There is no guarantee on the order between the TX completion
+	 * and the command response.
+	 * The TX completion is going to touch cmd->buf even in
+	 * the case of successful transmission.
+	 * So instead of requiring separate allocations for cmd
+	 * and cmd->buf we've decided to use a reference counter
+	 */
+	refcount_t ref;
+	struct mlx5_fpga_dma_buf buf;
+	mlx5_fpga_tls_command_complete complete;
+};
+
+static inline void
+mlx5_fpga_tls_put_command_ctx(struct mlx5_fpga_tls_command_context *ctx)
+{
+	if (refcount_dec_and_test(&ctx->ref))
+		kfree(ctx);
+}
+
+static void mlx5_fpga_tls_cmd_complete(struct mlx5_fpga_device *fdev,
+				       struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_fpga_conn *conn = fdev->tls->conn;
+	struct mlx5_fpga_tls_command_context *ctx;
+	struct mlx5_fpga_tls *tls = fdev->tls;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls->pending_cmds_lock, flags);
+	ctx = list_first_entry(&tls->pending_cmds,
+			       struct mlx5_fpga_tls_command_context, list);
+	list_del(&ctx->list);
+	spin_unlock_irqrestore(&tls->pending_cmds_lock, flags);
+	ctx->complete(conn, fdev, ctx, resp);
+}
+
+static void mlx5_fpga_cmd_send_complete(struct mlx5_fpga_conn *conn,
+					struct mlx5_fpga_device *fdev,
+					struct mlx5_fpga_dma_buf *buf,
+					u8 status)
+{
+	struct mlx5_fpga_tls_command_context *ctx =
+	    container_of(buf, struct mlx5_fpga_tls_command_context, buf);
+
+	mlx5_fpga_tls_put_command_ctx(ctx);
+
+	if (unlikely(status))
+		mlx5_fpga_tls_cmd_complete(fdev, NULL);
+}
+
+static void mlx5_fpga_tls_cmd_send(struct mlx5_fpga_device *fdev,
+				   struct mlx5_fpga_tls_command_context *cmd,
+				   mlx5_fpga_tls_command_complete complete)
+{
+	struct mlx5_fpga_tls *tls = fdev->tls;
+	unsigned long flags;
+	int ret;
+
+	refcount_set(&cmd->ref, 2);
+	cmd->complete = complete;
+	cmd->buf.complete = mlx5_fpga_cmd_send_complete;
+
+	spin_lock_irqsave(&tls->pending_cmds_lock, flags);
+	/* mlx5_fpga_sbu_conn_sendmsg is called under pending_cmds_lock
+	 * to make sure commands are inserted to the tls->pending_cmds list
+	 * and the command QP in the same order.
+	 */
+	ret = mlx5_fpga_sbu_conn_sendmsg(tls->conn, &cmd->buf);
+	if (likely(!ret))
+		list_add_tail(&cmd->list, &tls->pending_cmds);
+	else
+		complete(tls->conn, fdev, cmd, NULL);
+	spin_unlock_irqrestore(&tls->pending_cmds_lock, flags);
+}
+
+/* Start of context identifiers range (inclusive) */
+#define SWID_START	0
+/* End of context identifiers range (exclusive) */
+#define SWID_END	BIT(24)
+
+static int mlx5_fpga_tls_alloc_swid(struct idr *idr, spinlock_t *idr_spinlock,
+				    void *ptr)
+{
+	int ret;
+
+	/* TLS metadata format is 1 byte for syndrome followed
+	 * by 3 bytes of swid (software ID)
+	 * swid must not exceed 3 bytes.
+	 * See tls_rxtx.c:insert_pet() for details
+	 */
+	BUILD_BUG_ON((SWID_END - 1) & 0xFF000000);
+
+	idr_preload(GFP_KERNEL);
+	spin_lock_irq(idr_spinlock);
+	ret = idr_alloc(idr, ptr, SWID_START, SWID_END, GFP_ATOMIC);
+	spin_unlock_irq(idr_spinlock);
+	idr_preload_end();
+
+	return ret;
+}
+
+static void mlx5_fpga_tls_release_swid(struct idr *idr,
+				       spinlock_t *idr_spinlock, u32 swid)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(idr_spinlock, flags);
+	idr_remove(idr, swid);
+	spin_unlock_irqrestore(idr_spinlock, flags);
+}
+
+struct mlx5_teardown_stream_context {
+	struct mlx5_fpga_tls_command_context cmd;
+	u32 swid;
+};
+
+static void
+mlx5_fpga_tls_teardown_completion(struct mlx5_fpga_conn *conn,
+				  struct mlx5_fpga_device *fdev,
+				  struct mlx5_fpga_tls_command_context *cmd,
+				  struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_teardown_stream_context *ctx =
+		    container_of(cmd, struct mlx5_teardown_stream_context, cmd);
+
+	if (resp) {
+		u32 syndrome = MLX5_GET(tls_resp, resp->sg[0].data, syndrome);
+
+		if (syndrome)
+			mlx5_fpga_err(fdev,
+				      "Teardown stream failed with syndrome = %d",
+				      syndrome);
+		else
+			mlx5_fpga_tls_release_swid(&fdev->tls->tx_idr,
+						   &fdev->tls->idr_spinlock,
+						   ctx->swid);
+	}
+	mlx5_fpga_tls_put_command_ctx(cmd);
+}
+
+static void mlx5_fpga_tls_flow_to_cmd(void *flow, void *cmd)
+{
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, src_port), flow,
+	       MLX5_BYTE_OFF(tls_flow, ipv6));
+
+	MLX5_SET(tls_cmd, cmd, ipv6, MLX5_GET(tls_flow, flow, ipv6));
+	MLX5_SET(tls_cmd, cmd, direction_sx,
+		 MLX5_GET(tls_flow, flow, direction_sx));
+}
+
+void mlx5_fpga_tls_send_teardown_cmd(struct mlx5_core_dev *mdev, void *flow,
+				     u32 swid, gfp_t flags)
+{
+	struct mlx5_teardown_stream_context *ctx;
+	struct mlx5_fpga_dma_buf *buf;
+	void *cmd;
+
+	ctx = kzalloc(sizeof(*ctx) + MLX5_TLS_COMMAND_SIZE, flags);
+	if (!ctx)
+		return;
+
+	buf = &ctx->cmd.buf;
+	cmd = (ctx + 1);
+	MLX5_SET(tls_cmd, cmd, command_type, CMD_TEARDOWN_STREAM);
+	MLX5_SET(tls_cmd, cmd, swid, swid);
+
+	mlx5_fpga_tls_flow_to_cmd(flow, cmd);
+	kfree(flow);
+
+	buf->sg[0].data = cmd;
+	buf->sg[0].size = MLX5_TLS_COMMAND_SIZE;
+
+	ctx->swid = swid;
+	mlx5_fpga_tls_cmd_send(mdev->fpga, &ctx->cmd,
+			       mlx5_fpga_tls_teardown_completion);
+}
+
+void mlx5_fpga_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid,
+			       gfp_t flags)
+{
+	struct mlx5_fpga_tls *tls = mdev->fpga->tls;
+	void *flow;
+
+	rcu_read_lock();
+	flow = idr_find(&tls->tx_idr, swid);
+	rcu_read_unlock();
+
+	if (!flow) {
+		mlx5_fpga_err(mdev->fpga, "No flow information for swid %u\n",
+			      swid);
+		return;
+	}
+
+	mlx5_fpga_tls_send_teardown_cmd(mdev, flow, swid, flags);
+}
+
+enum mlx5_fpga_setup_stream_status {
+	MLX5_FPGA_CMD_PENDING,
+	MLX5_FPGA_CMD_SEND_FAILED,
+	MLX5_FPGA_CMD_RESPONSE_RECEIVED,
+	MLX5_FPGA_CMD_ABANDONED,
+};
+
+struct mlx5_setup_stream_context {
+	struct mlx5_fpga_tls_command_context cmd;
+	atomic_t status;
+	u32 syndrome;
+	struct completion comp;
+};
+
+static void
+mlx5_fpga_tls_setup_completion(struct mlx5_fpga_conn *conn,
+			       struct mlx5_fpga_device *fdev,
+			       struct mlx5_fpga_tls_command_context *cmd,
+			       struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_setup_stream_context *ctx =
+	    container_of(cmd, struct mlx5_setup_stream_context, cmd);
+	int status = MLX5_FPGA_CMD_SEND_FAILED;
+	void *tls_cmd = ctx + 1;
+
+	/* If we failed to send to command resp == NULL */
+	if (resp) {
+		ctx->syndrome = MLX5_GET(tls_resp, resp->sg[0].data, syndrome);
+		status = MLX5_FPGA_CMD_RESPONSE_RECEIVED;
+	}
+
+	status = atomic_xchg_release(&ctx->status, status);
+	if (likely(status != MLX5_FPGA_CMD_ABANDONED)) {
+		complete(&ctx->comp);
+		return;
+	}
+
+	mlx5_fpga_err(fdev, "Command was abandoned, syndrome = %u\n",
+		      ctx->syndrome);
+
+	if (!ctx->syndrome) {
+		/* The process was killed while waiting for the context to be
+		 * added, and the add completed successfully.
+		 * We need to destroy the HW context, and we can't can't reuse
+		 * the command context because we might not have received
+		 * the tx completion yet.
+		 */
+		mlx5_fpga_tls_del_tx_flow(fdev->mdev,
+					  MLX5_GET(tls_cmd, tls_cmd, swid),
+					  GFP_ATOMIC);
+	}
+
+	mlx5_fpga_tls_put_command_ctx(cmd);
+}
+
+static int mlx5_fpga_tls_setup_stream_cmd(struct mlx5_core_dev *mdev,
+					  struct mlx5_setup_stream_context *ctx)
+{
+	struct mlx5_fpga_dma_buf *buf;
+	void *cmd = ctx + 1;
+	int status, ret = 0;
+
+	buf = &ctx->cmd.buf;
+	buf->sg[0].data = cmd;
+	buf->sg[0].size = MLX5_TLS_COMMAND_SIZE;
+	MLX5_SET(tls_cmd, cmd, command_type, CMD_SETUP_STREAM);
+
+	init_completion(&ctx->comp);
+	atomic_set(&ctx->status, MLX5_FPGA_CMD_PENDING);
+	ctx->syndrome = -1;
+
+	mlx5_fpga_tls_cmd_send(mdev->fpga, &ctx->cmd,
+			       mlx5_fpga_tls_setup_completion);
+	wait_for_completion_killable(&ctx->comp);
+
+	status = atomic_xchg_acquire(&ctx->status, MLX5_FPGA_CMD_ABANDONED);
+	if (unlikely(status == MLX5_FPGA_CMD_PENDING))
+	/* ctx is going to be released in mlx5_fpga_tls_setup_completion */
+		return -EINTR;
+
+	if (unlikely(ctx->syndrome))
+		ret = -ENOMEM;
+
+	mlx5_fpga_tls_put_command_ctx(&ctx->cmd);
+	return ret;
+}
+
+static void mlx5_fpga_tls_hw_qp_recv_cb(void *cb_arg,
+					struct mlx5_fpga_dma_buf *buf)
+{
+	struct mlx5_fpga_device *fdev = (struct mlx5_fpga_device *)cb_arg;
+
+	mlx5_fpga_tls_cmd_complete(fdev, buf);
+}
+
+bool mlx5_fpga_is_tls_device(struct mlx5_core_dev *mdev)
+{
+	if (!mdev->fpga || !MLX5_CAP_GEN(mdev, fpga))
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, ieee_vendor_id) !=
+	    MLX5_FPGA_CAP_SANDBOX_VENDOR_ID_MLNX)
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, sandbox_product_id) !=
+	    MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_TLS)
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, sandbox_product_version) != 0)
+		return false;
+
+	return true;
+}
+
+static inline int mlx5_fpga_tls_get_caps(struct mlx5_fpga_device *fdev,
+					 u32 *p_caps)
+{
+	int err, cap_size = MLX5_ST_SZ_BYTES(tls_extended_cap);
+	u32 caps = 0;
+	void *buf;
+
+	buf = kzalloc(cap_size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	err = mlx5_fpga_get_sbu_caps(fdev, cap_size, buf);
+	if (err)
+		goto out;
+
+	if (MLX5_GET(tls_extended_cap, buf, tx))
+		caps |= MLX5_ACCEL_TLS_TX;
+	if (MLX5_GET(tls_extended_cap, buf, rx))
+		caps |= MLX5_ACCEL_TLS_RX;
+	if (MLX5_GET(tls_extended_cap, buf, tls_v12))
+		caps |= MLX5_ACCEL_TLS_V12;
+	if (MLX5_GET(tls_extended_cap, buf, tls_v13))
+		caps |= MLX5_ACCEL_TLS_V13;
+	if (MLX5_GET(tls_extended_cap, buf, lro))
+		caps |= MLX5_ACCEL_TLS_LRO;
+	if (MLX5_GET(tls_extended_cap, buf, ipv6))
+		caps |= MLX5_ACCEL_TLS_IPV6;
+
+	if (MLX5_GET(tls_extended_cap, buf, aes_gcm_128))
+		caps |= MLX5_ACCEL_TLS_AES_GCM128;
+	if (MLX5_GET(tls_extended_cap, buf, aes_gcm_256))
+		caps |= MLX5_ACCEL_TLS_AES_GCM256;
+
+	*p_caps = caps;
+	err = 0;
+out:
+	kfree(buf);
+	return err;
+}
+
+int mlx5_fpga_tls_init(struct mlx5_core_dev *mdev)
+{
+	struct mlx5_fpga_device *fdev = mdev->fpga;
+	struct mlx5_fpga_conn_attr init_attr = {0};
+	struct mlx5_fpga_conn *conn;
+	struct mlx5_fpga_tls *tls;
+	int err = 0;
+
+	if (!mlx5_fpga_is_tls_device(mdev))
+		return 0;
+
+	tls = kzalloc(sizeof(*tls), GFP_KERNEL);
+	if (!tls)
+		return -ENOMEM;
+
+	err = mlx5_fpga_tls_get_caps(fdev, &tls->caps);
+	if (err)
+		goto error;
+
+	if (!(tls->caps & (MLX5_ACCEL_TLS_TX | MLX5_ACCEL_TLS_V12 |
+				 MLX5_ACCEL_TLS_AES_GCM128))) {
+		err = -ENOTSUPP;
+		goto error;
+	}
+
+	init_attr.rx_size = SBU_QP_QUEUE_SIZE;
+	init_attr.tx_size = SBU_QP_QUEUE_SIZE;
+	init_attr.recv_cb = mlx5_fpga_tls_hw_qp_recv_cb;
+	init_attr.cb_arg = fdev;
+	conn = mlx5_fpga_sbu_conn_create(fdev, &init_attr);
+	if (IS_ERR(conn)) {
+		err = PTR_ERR(conn);
+		mlx5_fpga_err(fdev, "Error creating TLS command connection %d\n",
+			      err);
+		goto error;
+	}
+
+	tls->conn = conn;
+	spin_lock_init(&tls->pending_cmds_lock);
+	INIT_LIST_HEAD(&tls->pending_cmds);
+
+	idr_init(&tls->tx_idr);
+	spin_lock_init(&tls->idr_spinlock);
+	fdev->tls = tls;
+	return 0;
+
+error:
+	kfree(tls);
+	return err;
+}
+
+void mlx5_fpga_tls_cleanup(struct mlx5_core_dev *mdev)
+{
+	struct mlx5_fpga_device *fdev = mdev->fpga;
+
+	if (!fdev->tls)
+		return;
+
+	mlx5_fpga_sbu_conn_destroy(fdev->tls->conn);
+	kfree(fdev->tls);
+	fdev->tls = NULL;
+}
+
+static void mlx5_fpga_tls_set_aes_gcm128_ctx(void *cmd,
+					     struct tls_crypto_info *info,
+					     __be64 *rcd_sn)
+{
+	struct tls12_crypto_info_aes_gcm_128 *crypto_info =
+	    (struct tls12_crypto_info_aes_gcm_128 *)info;
+
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, tls_rcd_sn), crypto_info->rec_seq,
+	       TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE);
+
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, tls_implicit_iv),
+	       crypto_info->salt, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, encryption_key),
+	       crypto_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+
+	/* in AES-GCM 128 we need to write the key twice */
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, encryption_key) +
+		   TLS_CIPHER_AES_GCM_128_KEY_SIZE,
+	       crypto_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+
+	MLX5_SET(tls_cmd, cmd, alg, MLX5_TLS_ALG_AES_GCM_128);
+}
+
+static int mlx5_fpga_tls_set_key_material(void *cmd, u32 caps,
+					  struct tls_crypto_info *crypto_info)
+{
+	__be64 rcd_sn;
+
+	switch (crypto_info->cipher_type) {
+	case TLS_CIPHER_AES_GCM_128:
+		if (!(caps & MLX5_ACCEL_TLS_AES_GCM128))
+			return -EINVAL;
+		mlx5_fpga_tls_set_aes_gcm128_ctx(cmd, crypto_info, &rcd_sn);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int mlx5_fpga_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+				  struct tls_crypto_info *crypto_info, u32 swid,
+				  u32 tcp_sn)
+{
+	u32 caps = mlx5_fpga_tls_device_caps(mdev);
+	struct mlx5_setup_stream_context *ctx;
+	int ret = -ENOMEM;
+	size_t cmd_size;
+	void *cmd;
+
+	cmd_size = MLX5_TLS_COMMAND_SIZE + sizeof(*ctx);
+	ctx = kzalloc(cmd_size, GFP_KERNEL);
+	if (!ctx)
+		goto out;
+
+	cmd = ctx + 1;
+	ret = mlx5_fpga_tls_set_key_material(cmd, caps, crypto_info);
+	if (ret)
+		goto free_ctx;
+
+	mlx5_fpga_tls_flow_to_cmd(flow, cmd);
+
+	MLX5_SET(tls_cmd, cmd, swid, swid);
+	MLX5_SET(tls_cmd, cmd, tcp_sn, tcp_sn);
+
+	return mlx5_fpga_tls_setup_stream_cmd(mdev, ctx);
+
+free_ctx:
+	kfree(ctx);
+out:
+	return ret;
+}
+
+int mlx5_fpga_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			      struct tls_crypto_info *crypto_info,
+			      u32 start_offload_tcp_sn, u32 *p_swid)
+{
+	struct mlx5_fpga_tls *tls = mdev->fpga->tls;
+	int ret = -ENOMEM;
+	u32 swid;
+
+	ret = mlx5_fpga_tls_alloc_swid(&tls->tx_idr, &tls->idr_spinlock, flow);
+	if (ret < 0)
+		return ret;
+
+	swid = ret;
+	MLX5_SET(tls_flow, flow, direction_sx, 1);
+
+	ret = mlx5_fpga_tls_add_flow(mdev, flow, crypto_info, swid,
+				     start_offload_tcp_sn);
+	if (ret && ret != -EINTR)
+		goto free_swid;
+
+	*p_swid = swid;
+	return 0;
+free_swid:
+	mlx5_fpga_tls_release_swid(&tls->tx_idr, &tls->idr_spinlock, swid);
+
+	return ret;
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
new file mode 100644
index 000000000000..800a214e4e49
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5_FPGA_TLS_H__
+#define __MLX5_FPGA_TLS_H__
+
+#include <linux/mlx5/driver.h>
+
+#include <net/tls.h>
+#include "fpga/core.h"
+
+struct mlx5_fpga_tls {
+	struct list_head pending_cmds;
+	spinlock_t pending_cmds_lock; /* Protects pending_cmds */
+	u32 caps;
+	struct mlx5_fpga_conn *conn;
+
+	struct idr tx_idr;
+	spinlock_t idr_spinlock; /* protects the IDR */
+};
+
+int mlx5_fpga_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			      struct tls_crypto_info *crypto_info,
+			      u32 start_offload_tcp_sn, u32 *p_swid);
+
+void mlx5_fpga_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid,
+			       gfp_t flags);
+
+bool mlx5_fpga_is_tls_device(struct mlx5_core_dev *mdev);
+int mlx5_fpga_tls_init(struct mlx5_core_dev *mdev);
+void mlx5_fpga_tls_cleanup(struct mlx5_core_dev *mdev);
+
+static inline u32 mlx5_fpga_tls_device_caps(struct mlx5_core_dev *mdev)
+{
+	return mdev->fpga->tls->caps;
+}
+
+#endif /* __MLX5_FPGA_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 13b6f66310c9..808091df84ee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -60,6 +60,7 @@
 #include "fpga/core.h"
 #include "fpga/ipsec.h"
 #include "accel/ipsec.h"
+#include "accel/tls.h"
 #include "lib/clock.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
@@ -1186,6 +1187,12 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_ipsec_start;
 	}
 
+	err = mlx5_accel_tls_init(dev);
+	if (err) {
+		dev_err(&pdev->dev, "TLS device start failed %d\n", err);
+		goto err_tls_start;
+	}
+
 	err = mlx5_init_fs(dev);
 	if (err) {
 		dev_err(&pdev->dev, "Failed to init flow steering\n");
@@ -1227,6 +1234,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_cleanup_fs(dev);
 
 err_fs:
+	mlx5_accel_tls_cleanup(dev);
+
+err_tls_start:
 	mlx5_accel_ipsec_cleanup(dev);
 
 err_ipsec_start:
@@ -1302,6 +1312,7 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_sriov_detach(dev);
 	mlx5_cleanup_fs(dev);
 	mlx5_accel_ipsec_cleanup(dev);
+	mlx5_accel_tls_cleanup(dev);
 	mlx5_fpga_device_stop(dev);
 	mlx5_irq_clear_affinity_hints(dev);
 	free_comp_eqs(dev);
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 14ad84afe8ba..24092a871c3d 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -350,22 +350,6 @@ struct mlx5_ifc_odp_per_transport_service_cap_bits {
 	u8         reserved_at_6[0x1a];
 };
 
-struct mlx5_ifc_ipv4_layout_bits {
-	u8         reserved_at_0[0x60];
-
-	u8         ipv4[0x20];
-};
-
-struct mlx5_ifc_ipv6_layout_bits {
-	u8         ipv6[16][0x8];
-};
-
-union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits {
-	struct mlx5_ifc_ipv6_layout_bits ipv6_layout;
-	struct mlx5_ifc_ipv4_layout_bits ipv4_layout;
-	u8         reserved_at_0[0x80];
-};
-
 struct mlx5_ifc_fte_match_set_lyr_2_4_bits {
 	u8         smac_47_16[0x20];
 
diff --git a/include/linux/mlx5/mlx5_ifc_fpga.h b/include/linux/mlx5/mlx5_ifc_fpga.h
index ec052491ba3d..193091537cb6 100644
--- a/include/linux/mlx5/mlx5_ifc_fpga.h
+++ b/include/linux/mlx5/mlx5_ifc_fpga.h
@@ -32,12 +32,29 @@
 #ifndef MLX5_IFC_FPGA_H
 #define MLX5_IFC_FPGA_H
 
+struct mlx5_ifc_ipv4_layout_bits {
+	u8         reserved_at_0[0x60];
+
+	u8         ipv4[0x20];
+};
+
+struct mlx5_ifc_ipv6_layout_bits {
+	u8         ipv6[16][0x8];
+};
+
+union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits {
+	struct mlx5_ifc_ipv6_layout_bits ipv6_layout;
+	struct mlx5_ifc_ipv4_layout_bits ipv4_layout;
+	u8         reserved_at_0[0x80];
+};
+
 enum {
 	MLX5_FPGA_CAP_SANDBOX_VENDOR_ID_MLNX = 0x2c9,
 };
 
 enum {
 	MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_IPSEC    = 0x2,
+	MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_TLS      = 0x3,
 };
 
 struct mlx5_ifc_fpga_shell_caps_bits {
@@ -370,6 +387,27 @@ struct mlx5_ifc_fpga_destroy_qp_out_bits {
 	u8         reserved_at_40[0x40];
 };
 
+struct mlx5_ifc_tls_extended_cap_bits {
+	u8         aes_gcm_128[0x1];
+	u8         aes_gcm_256[0x1];
+	u8         reserved_at_2[0x1e];
+	u8         reserved_at_20[0x20];
+	u8         context_capacity_total[0x20];
+	u8         context_capacity_rx[0x20];
+	u8         context_capacity_tx[0x20];
+	u8         reserved_at_a0[0x10];
+	u8         tls_counter_size[0x10];
+	u8         tls_counters_addr_low[0x20];
+	u8         tls_counters_addr_high[0x20];
+	u8         rx[0x1];
+	u8         tx[0x1];
+	u8         tls_v12[0x1];
+	u8         tls_v13[0x1];
+	u8         lro[0x1];
+	u8         ipv6[0x1];
+	u8         reserved_at_106[0x1a];
+};
+
 struct mlx5_ifc_ipsec_extended_cap_bits {
 	u8         encapsulation[0x20];
 
@@ -519,4 +557,43 @@ struct mlx5_ifc_fpga_ipsec_sa {
 	__be16 reserved2;
 } __packed;
 
+enum fpga_tls_cmds {
+	CMD_SETUP_STREAM		= 0x1001,
+	CMD_TEARDOWN_STREAM		= 0x1002,
+};
+
+#define MLX5_TLS_1_2 (0)
+
+#define MLX5_TLS_ALG_AES_GCM_128 (0)
+#define MLX5_TLS_ALG_AES_GCM_256 (1)
+
+struct mlx5_ifc_tls_cmd_bits {
+	u8         command_type[0x20];
+	u8         ipv6[0x1];
+	u8         direction_sx[0x1];
+	u8         tls_version[0x2];
+	u8         reserved[0x1c];
+	u8         swid[0x20];
+	u8         src_port[0x10];
+	u8         dst_port[0x10];
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits src_ipv4_src_ipv6;
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits dst_ipv4_dst_ipv6;
+	u8         tls_rcd_sn[0x40];
+	u8         tcp_sn[0x20];
+	u8         tls_implicit_iv[0x20];
+	u8         tls_xor_iv[0x40];
+	u8         encryption_key[0x100];
+	u8         alg[4];
+	u8         reserved2[0x1c];
+	u8         reserved3[0x4a0];
+};
+
+struct mlx5_ifc_tls_resp_bits {
+	u8         syndrome[0x20];
+	u8         stream_id[0x20];
+	u8         reserverd[0x40];
+};
+
+#define MLX5_TLS_COMMAND_SIZE (0x100)
+
 #endif /* MLX5_IFC_FPGA_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 10/14] net/mlx5e: TLS, Add Innova TLS TX support
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (8 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 09/14] net/mlx5: Accel, Add TLS tx offload interface Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 11/14] net/mlx5e: TLS, Add Innova TLS TX offload data path Saeed Mahameed
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

Add NETIF_F_HW_TLS_TX capability and expose tlsdev_ops to work with the
TLS generic NIC offload infrastructure.
The NETIF_F_HW_TLS_TX capability will be added in the next patch.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |  11 ++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 173 +++++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  65 ++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 5 files changed, 254 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 25deaa5a534c..6befd2c381b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -85,3 +85,14 @@ config MLX5_EN_IPSEC
 	  Build support for IPsec cryptography-offload accelaration in the NIC.
 	  Note: Support for hardware with this capability needs to be selected
 	  for this option to become available.
+
+config MLX5_EN_TLS
+	bool "TLS cryptography-offload accelaration"
+	depends on MLX5_CORE_EN
+	depends on TLS_DEVICE
+	depends on MLX5_ACCEL
+	default n
+	---help---
+	  Build support for TLS cryptography-offload accelaration in the NIC.
+	  Note: Support for hardware with this capability needs to be selected
+	  for this option to become available.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9989e5265a45..50872ed30c0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,4 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
new file mode 100644
index 000000000000..38d88108a55a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -0,0 +1,173 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <net/ipv6.h>
+#include "en_accel/tls.h"
+#include "accel/tls.h"
+
+static void mlx5e_tls_set_ipv4_flow(void *flow, struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	MLX5_SET(tls_flow, flow, ipv6, 0);
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
+	       &inet->inet_daddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_ipv4_src_ipv6.ipv4_layout.ipv4),
+	       &inet->inet_rcv_saddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static void mlx5e_tls_set_ipv6_flow(void *flow, struct sock *sk)
+{
+	struct ipv6_pinfo *np = inet6_sk(sk);
+
+	MLX5_SET(tls_flow, flow, ipv6, 1);
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
+	       &sk->sk_v6_daddr, MLX5_FLD_SZ_BYTES(ipv6_layout, ipv6));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_ipv4_src_ipv6.ipv6_layout.ipv6),
+	       &np->saddr, MLX5_FLD_SZ_BYTES(ipv6_layout, ipv6));
+}
+#endif
+
+static void mlx5e_tls_set_flow_tcp_ports(void *flow, struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_port), &inet->inet_sport,
+	       MLX5_FLD_SZ_BYTES(tls_flow, src_port));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_port), &inet->inet_dport,
+	       MLX5_FLD_SZ_BYTES(tls_flow, dst_port));
+}
+
+static int mlx5e_tls_set_flow(void *flow, struct sock *sk, u32 caps)
+{
+	switch (sk->sk_family) {
+	case AF_INET:
+		mlx5e_tls_set_ipv4_flow(flow, sk);
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		if (!sk->sk_ipv6only &&
+		    ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED) {
+			mlx5e_tls_set_ipv4_flow(flow, sk);
+			break;
+		}
+		if (!(caps & MLX5_ACCEL_TLS_IPV6))
+			goto error_out;
+
+		mlx5e_tls_set_ipv6_flow(flow, sk);
+		break;
+#endif
+	default:
+		goto error_out;
+	}
+
+	mlx5e_tls_set_flow_tcp_ports(flow, sk);
+	return 0;
+error_out:
+	return -EINVAL;
+}
+
+static int mlx5e_tls_add(struct net_device *netdev, struct sock *sk,
+			 enum tls_offload_ctx_dir direction,
+			 struct tls_crypto_info *crypto_info,
+			 u32 start_offload_tcp_sn)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	u32 caps = mlx5_accel_tls_device_caps(mdev);
+	int ret = -ENOMEM;
+	void *flow;
+
+	if (direction != TLS_OFFLOAD_CTX_DIR_TX)
+		return -EINVAL;
+
+	flow = kzalloc(MLX5_ST_SZ_BYTES(tls_flow), GFP_KERNEL);
+	if (!flow)
+		return ret;
+
+	ret = mlx5e_tls_set_flow(flow, sk, caps);
+	if (ret)
+		goto free_flow;
+
+	if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
+		struct mlx5e_tls_offload_context *tx_ctx =
+		    mlx5e_get_tls_tx_context(tls_ctx);
+		u32 swid;
+
+		ret = mlx5_accel_tls_add_tx_flow(mdev, flow, crypto_info,
+						 start_offload_tcp_sn, &swid);
+		if (ret < 0)
+			goto free_flow;
+
+		tx_ctx->swid = htonl(swid);
+		tx_ctx->expected_seq = start_offload_tcp_sn;
+	}
+
+	return 0;
+free_flow:
+	kfree(flow);
+	return ret;
+}
+
+static void mlx5e_tls_del(struct net_device *netdev,
+			  struct tls_context *tls_ctx,
+			  enum tls_offload_ctx_dir direction)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+
+	if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
+		u32 swid = ntohl(mlx5e_get_tls_tx_context(tls_ctx)->swid);
+
+		mlx5_accel_tls_del_tx_flow(priv->mdev, swid);
+	} else {
+		netdev_err(netdev, "unsupported direction %d\n", direction);
+	}
+}
+
+static const struct tlsdev_ops mlx5e_tls_ops = {
+	.tls_dev_add = mlx5e_tls_add,
+	.tls_dev_del = mlx5e_tls_del,
+};
+
+void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
+{
+	struct net_device *netdev = priv->netdev;
+
+	if (!mlx5_accel_is_tls_device(priv->mdev))
+		return;
+
+	netdev->tlsdev_ops = &mlx5e_tls_ops;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
new file mode 100644
index 000000000000..f7216b9b98e2
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -0,0 +1,65 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#ifndef __MLX5E_TLS_H__
+#define __MLX5E_TLS_H__
+
+#ifdef CONFIG_MLX5_EN_TLS
+
+#include <net/tls.h>
+#include "en.h"
+
+struct mlx5e_tls_offload_context {
+	struct tls_offload_context base;
+	u32 expected_seq;
+	__be32 swid;
+};
+
+static inline struct mlx5e_tls_offload_context *
+mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
+{
+	BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context) >
+		     TLS_OFFLOAD_CONTEXT_SIZE);
+	return container_of(tls_offload_ctx(tls_ctx),
+			    struct mlx5e_tls_offload_context,
+			    base);
+}
+
+void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
+
+#else
+
+static inline void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { }
+
+#endif
+
+#endif /* __MLX5E_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index da94c8cba5ee..8dbe058da178 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -41,7 +41,9 @@
 #include "en_rep.h"
 #include "en_accel/ipsec.h"
 #include "en_accel/ipsec_rxtx.h"
+#include "en_accel/tls.h"
 #include "accel/ipsec.h"
+#include "accel/tls.h"
 #include "vxlan.h"
 
 struct mlx5e_rq_param {
@@ -4181,6 +4183,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 #endif
 
 	mlx5e_ipsec_build_netdev(priv);
+	mlx5e_tls_build_netdev(priv);
 }
 
 static void mlx5e_create_q_counter(struct mlx5e_priv *priv)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 11/14] net/mlx5e: TLS, Add Innova TLS TX offload data path
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (9 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 10/14] net/mlx5e: TLS, Add Innova TLS TX support Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 12/14] net/mlx5e: TLS, Add error statistics Saeed Mahameed
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

Implement the TLS tx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  15 ++
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  72 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c |   2 +
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 272 +++++++++++++++++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h         |  50 ++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  37 +--
 10 files changed, 455 insertions(+), 16 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 50872ed30c0b..ec785f589666 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 6660986285bf..7d8696fca826 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -340,6 +340,7 @@ struct mlx5e_sq_dma {
 enum {
 	MLX5E_SQ_STATE_ENABLED,
 	MLX5E_SQ_STATE_IPSEC,
+	MLX5E_SQ_STATE_TLS,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -824,6 +825,8 @@ void mlx5e_build_ptys2ethtool_map(void);
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
 		       void *accel_priv, select_queue_fallback_t fallback);
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+			  struct mlx5e_tx_wqe *wqe, u16 pi);
 
 void mlx5e_completion_event(struct mlx5_core_cq *mcq);
 void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
@@ -929,6 +932,18 @@ static inline bool mlx5e_tunnel_inner_ft_supported(struct mlx5_core_dev *mdev)
 		MLX5_CAP_FLOWTABLE_NIC_RX(mdev, ft_field_support.inner_ip_version));
 }
 
+static inline void mlx5e_sq_fetch_wqe(struct mlx5e_txqsq *sq,
+				      struct mlx5e_tx_wqe **wqe,
+				      u16 *pi)
+{
+	struct mlx5_wq_cyc *wq;
+
+	wq = &sq->wq;
+	*pi = sq->pc & wq->sz_m1;
+	*wqe = mlx5_wq_cyc_get_wqe(wq, *pi);
+	memset(*wqe, 0, sizeof(**wqe));
+}
+
 static inline
 struct mlx5e_tx_wqe *mlx5e_post_nop(struct mlx5_wq_cyc *wq, u32 sqn, u16 *pc)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
new file mode 100644
index 000000000000..68fcb40a2847
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5E_EN_ACCEL_H__
+#define __MLX5E_EN_ACCEL_H__
+
+#ifdef CONFIG_MLX5_ACCEL
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include "en_accel/ipsec_rxtx.h"
+#include "en_accel/tls_rxtx.h"
+#include "en.h"
+
+static inline struct sk_buff *mlx5e_accel_handle_tx(struct sk_buff *skb,
+						    struct mlx5e_txqsq *sq,
+						    struct net_device *dev,
+						    struct mlx5e_tx_wqe **wqe,
+						    u16 *pi)
+{
+#ifdef CONFIG_MLX5_EN_TLS
+	if (sq->state & BIT(MLX5E_SQ_STATE_TLS)) {
+		skb = mlx5e_tls_handle_tx_skb(dev, sq, skb, wqe, pi);
+		if (unlikely(!skb))
+			return NULL;
+	}
+#endif
+
+#ifdef CONFIG_MLX5_EN_IPSEC
+	if (sq->state & BIT(MLX5E_SQ_STATE_IPSEC)) {
+		skb = mlx5e_ipsec_handle_tx_skb(dev, *wqe, skb);
+		if (unlikely(!skb))
+			return NULL;
+	}
+#endif
+
+	return skb;
+}
+
+#endif /* CONFIG_MLX5_ACCEL */
+
+#endif /* __MLX5E_EN_ACCEL_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 38d88108a55a..aa6981c98bdc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -169,5 +169,7 @@ void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
 	if (!mlx5_accel_is_tls_device(priv->mdev))
 		return;
 
+	netdev->features |= NETIF_F_HW_TLS_TX;
+	netdev->hw_features |= NETIF_F_HW_TLS_TX;
 	netdev->tlsdev_ops = &mlx5e_tls_ops;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
new file mode 100644
index 000000000000..49e8d455ebc3
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -0,0 +1,272 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include "en_accel/tls.h"
+#include "en_accel/tls_rxtx.h"
+
+#define SYNDROME_OFFLOAD_REQUIRED 32
+#define SYNDROME_SYNC 33
+
+struct sync_info {
+	u64 rcd_sn;
+	s32 sync_len;
+	int nr_frags;
+	skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct mlx5e_tls_metadata {
+	/* One byte of syndrome followed by 3 bytes of swid */
+	__be32 syndrome_swid;
+	__be16 first_seq;
+	/* packet type ID field	*/
+	__be16 ethertype;
+} __packed;
+
+static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 swid)
+{
+	struct mlx5e_tls_metadata *pet;
+	struct ethhdr *eth;
+
+	if (skb_cow_head(skb, sizeof(struct mlx5e_tls_metadata)))
+		return -ENOMEM;
+
+	eth = (struct ethhdr *)skb_push(skb, sizeof(struct mlx5e_tls_metadata));
+	skb->mac_header -= sizeof(struct mlx5e_tls_metadata);
+	pet = (struct mlx5e_tls_metadata *)(eth + 1);
+
+	memmove(skb->data, skb->data + sizeof(struct mlx5e_tls_metadata),
+		2 * ETH_ALEN);
+
+	eth->h_proto = cpu_to_be16(MLX5E_METADATA_ETHER_TYPE);
+	pet->syndrome_swid = htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid;
+
+	return 0;
+}
+
+static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context *context,
+				   u32 tcp_seq, struct sync_info *info)
+{
+	int remaining, i = 0, ret = -EINVAL;
+	struct tls_record_info *record;
+	unsigned long flags;
+	s32 sync_size;
+
+	spin_lock_irqsave(&context->base.lock, flags);
+	record = tls_get_record(&context->base, tcp_seq, &info->rcd_sn);
+
+	if (unlikely(!record))
+		goto out;
+
+	sync_size = tcp_seq - tls_record_start_seq(record);
+	info->sync_len = sync_size;
+	if (unlikely(sync_size < 0)) {
+		if (tls_record_is_start_marker(record))
+			goto done;
+
+		goto out;
+	}
+
+	remaining = sync_size;
+	while (remaining > 0) {
+		info->frags[i] = record->frags[i];
+		__skb_frag_ref(&info->frags[i]);
+		remaining -= skb_frag_size(&info->frags[i]);
+
+		if (remaining < 0)
+			skb_frag_size_add(&info->frags[i], remaining);
+
+		i++;
+	}
+	info->nr_frags = i;
+done:
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&context->base.lock, flags);
+	return ret;
+}
+
+static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
+					struct sk_buff *nskb, u32 tcp_seq,
+					int headln, __be64 rcd_sn)
+{
+	struct mlx5e_tls_metadata *pet;
+	u8 syndrome = SYNDROME_SYNC;
+	struct iphdr *iph;
+	struct tcphdr *th;
+	int data_len, mss;
+
+	nskb->dev = skb->dev;
+	skb_reset_mac_header(nskb);
+	skb_set_network_header(nskb, skb_network_offset(skb));
+	skb_set_transport_header(nskb, skb_transport_offset(skb));
+	memcpy(nskb->data, skb->data, headln);
+	memcpy(nskb->data + headln, &rcd_sn, sizeof(rcd_sn));
+
+	iph = ip_hdr(nskb);
+	iph->tot_len = htons(nskb->len - skb_network_offset(nskb));
+	th = tcp_hdr(nskb);
+	data_len = nskb->len - headln;
+	tcp_seq -= data_len;
+	th->seq = htonl(tcp_seq);
+
+	mss = nskb->dev->mtu - (headln - skb_network_offset(nskb));
+	skb_shinfo(nskb)->gso_size = 0;
+	if (data_len > mss) {
+		skb_shinfo(nskb)->gso_size = mss;
+		skb_shinfo(nskb)->gso_segs = DIV_ROUND_UP(data_len, mss);
+	}
+	skb_shinfo(nskb)->gso_type = skb_shinfo(skb)->gso_type;
+
+	pet = (struct mlx5e_tls_metadata *)(nskb->data + sizeof(struct ethhdr));
+	memcpy(pet, &syndrome, sizeof(syndrome));
+	pet->first_seq = htons(tcp_seq);
+
+	/* MLX5 devices don't care about the checksum partial start, offset
+	 * and pseudo header
+	 */
+	nskb->ip_summed = CHECKSUM_PARTIAL;
+
+	nskb->xmit_more = 1;
+	nskb->queue_mapping = skb->queue_mapping;
+}
+
+static struct sk_buff *
+mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
+		     struct mlx5e_txqsq *sq, struct sk_buff *skb,
+		     struct mlx5e_tx_wqe **wqe,
+		     u16 *pi)
+{
+	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
+	struct sync_info info;
+	struct sk_buff *nskb;
+	int linear_len = 0;
+	int headln;
+	int i;
+
+	sq->stats.tls_ooo++;
+
+	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info))
+		/* We might get here if a retransmission reaches the driver
+		 * after the relevant record is acked.
+		 * It should be safe to drop the packet in this case
+		 */
+		goto err_out;
+
+	if (unlikely(info.sync_len < 0)) {
+		u32 payload;
+
+		headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
+		payload = skb->len - headln;
+		if (likely(payload <= -info.sync_len))
+			/* SKB payload doesn't require offload
+			 */
+			return skb;
+
+		netdev_err(skb->dev,
+			   "Can't offload from the middle of an SKB [seq: %X, offload_seq: %X, end_seq: %X]\n",
+			   tcp_seq, tcp_seq + payload + info.sync_len,
+			   tcp_seq + payload);
+		goto err_out;
+	}
+
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid)))
+		goto err_out;
+
+	headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	linear_len += headln + sizeof(info.rcd_sn);
+	nskb = alloc_skb(linear_len, GFP_ATOMIC);
+	if (unlikely(!nskb))
+		goto err_out;
+
+	context->expected_seq = tcp_seq + skb->len - headln;
+	skb_put(nskb, linear_len);
+	for (i = 0; i < info.nr_frags; i++)
+		skb_shinfo(nskb)->frags[i] = info.frags[i];
+
+	skb_shinfo(nskb)->nr_frags = info.nr_frags;
+	nskb->data_len = info.sync_len;
+	nskb->len += info.sync_len;
+	sq->stats.tls_resync_bytes += nskb->len;
+	mlx5e_tls_complete_sync_skb(skb, nskb, tcp_seq, headln,
+				    cpu_to_be64(info.rcd_sn));
+	mlx5e_sq_xmit(sq, nskb, *wqe, *pi);
+	mlx5e_sq_fetch_wqe(sq, wqe, pi);
+	return skb;
+
+err_out:
+	dev_kfree_skb_any(skb);
+	return NULL;
+}
+
+struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
+					struct mlx5e_txqsq *sq,
+					struct sk_buff *skb,
+					struct mlx5e_tx_wqe **wqe,
+					u16 *pi)
+{
+	struct mlx5e_tls_offload_context *context;
+	struct tls_context *tls_ctx;
+	u32 expected_seq;
+	int datalen;
+	u32 skb_seq;
+
+	if (!skb->sk || !tls_is_sk_tx_device_offloaded(skb->sk))
+		goto out;
+
+	datalen = skb->len - (skb_transport_offset(skb) + tcp_hdrlen(skb));
+	if (!datalen)
+		goto out;
+
+	tls_ctx = tls_get_ctx(skb->sk);
+	if (unlikely(tls_ctx->netdev != netdev))
+		goto out;
+
+	skb_seq = ntohl(tcp_hdr(skb)->seq);
+	context = mlx5e_get_tls_tx_context(tls_ctx);
+	expected_seq = context->expected_seq;
+
+	if (unlikely(expected_seq != skb_seq)) {
+		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi);
+		goto out;
+	}
+
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		dev_kfree_skb_any(skb);
+		skb = NULL;
+		goto out;
+	}
+
+	context->expected_seq = skb_seq + datalen;
+out:
+	return skb;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
new file mode 100644
index 000000000000..405dfd302225
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
@@ -0,0 +1,50 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5E_TLS_RXTX_H__
+#define __MLX5E_TLS_RXTX_H__
+
+#ifdef CONFIG_MLX5_EN_TLS
+
+#include <linux/skbuff.h>
+#include "en.h"
+
+struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
+					struct mlx5e_txqsq *sq,
+					struct sk_buff *skb,
+					struct mlx5e_tx_wqe **wqe,
+					u16 *pi);
+
+#endif /* CONFIG_MLX5_EN_TLS */
+
+#endif /* __MLX5E_TLS_RXTX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8dbe058da178..d4c397aec2ee 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -976,6 +976,8 @@ static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
 	sq->min_inline_mode = params->tx_min_inline_mode;
 	if (MLX5_IPSEC_DEV(c->priv->mdev))
 		set_bit(MLX5E_SQ_STATE_IPSEC, &sq->state);
+	if (mlx5_accel_is_tls_device(c->priv->mdev))
+		set_bit(MLX5E_SQ_STATE_TLS, &sq->state);
 
 	param->wq.db_numa_node = cpu_to_node(c->cpu);
 	err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, &sq->wq, &sq->wq_ctrl);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 5f0f3493d747..81c1f383d682 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -43,6 +43,12 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tso_inner_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tso_inner_bytes) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_added_vlan_packets) },
+
+#ifdef CONFIG_MLX5_EN_TLS
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tls_ooo) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tls_resync_bytes) },
+#endif
+
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_bytes) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_removed_vlan_packets) },
@@ -157,6 +163,10 @@ static void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
 			s->tx_csum_partial_inner += sq_stats->csum_partial_inner;
 			s->tx_csum_none		+= sq_stats->csum_none;
 			s->tx_csum_partial	+= sq_stats->csum_partial;
+#ifdef CONFIG_MLX5_EN_TLS
+			s->tx_tls_ooo		+= sq_stats->tls_ooo;
+			s->tx_tls_resync_bytes	+= sq_stats->tls_resync_bytes;
+#endif
 		}
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 0b3320a2b072..f956ee1704ef 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -91,6 +91,11 @@ struct mlx5e_sw_stats {
 	u64 rx_cache_waive;
 	u64 ch_eq_rearm;
 
+#ifdef CONFIG_MLX5_EN_TLS
+	u64 tx_tls_ooo;
+	u64 tx_tls_resync_bytes;
+#endif
+
 	/* Special handling counters */
 	u64 link_down_events_phy;
 };
@@ -187,6 +192,10 @@ struct mlx5e_sq_stats {
 	u64 csum_partial_inner;
 	u64 added_vlan_packets;
 	u64 nop;
+#ifdef CONFIG_MLX5_EN_TLS
+	u64 tls_ooo;
+	u64 tls_resync_bytes;
+#endif
 	/* less likely accessed in data path */
 	u64 csum_none;
 	u64 stopped;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 11b4f1089d1c..af3c318610e6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -35,12 +35,21 @@
 #include <net/dsfield.h>
 #include "en.h"
 #include "ipoib/ipoib.h"
-#include "en_accel/ipsec_rxtx.h"
+#include "en_accel/en_accel.h"
 #include "lib/clock.h"
 
 #define MLX5E_SQ_NOPS_ROOM  MLX5_SEND_WQE_MAX_WQEBBS
+
+#ifndef CONFIG_MLX5_EN_TLS
 #define MLX5E_SQ_STOP_ROOM (MLX5_SEND_WQE_MAX_WQEBBS +\
 			    MLX5E_SQ_NOPS_ROOM)
+#else
+/* TLS offload requires MLX5E_SQ_STOP_ROOM to have
+ * enough room for a resync SKB, a normal SKB and a NOP
+ */
+#define MLX5E_SQ_STOP_ROOM (2 * MLX5_SEND_WQE_MAX_WQEBBS +\
+			    MLX5E_SQ_NOPS_ROOM)
+#endif
 
 static inline void mlx5e_tx_dma_unmap(struct device *pdev,
 				      struct mlx5e_sq_dma *dma)
@@ -325,8 +334,8 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	}
 }
 
-static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
-				 struct mlx5e_tx_wqe *wqe, u16 pi)
+netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+			  struct mlx5e_tx_wqe *wqe, u16 pi)
 {
 	struct mlx5e_tx_wqe_info *wi   = &sq->db.wqe_info[pi];
 
@@ -399,21 +408,19 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
-	struct mlx5e_txqsq *sq = priv->txq2sq[skb_get_queue_mapping(skb)];
-	struct mlx5_wq_cyc *wq = &sq->wq;
-	u16 pi = sq->pc & wq->sz_m1;
-	struct mlx5e_tx_wqe *wqe = mlx5_wq_cyc_get_wqe(wq, pi);
+	struct mlx5e_tx_wqe *wqe;
+	struct mlx5e_txqsq *sq;
+	u16 pi;
 
-	memset(wqe, 0, sizeof(*wqe));
+	sq = priv->txq2sq[skb_get_queue_mapping(skb)];
+	mlx5e_sq_fetch_wqe(sq, &wqe, &pi);
 
-#ifdef CONFIG_MLX5_EN_IPSEC
-	if (sq->state & BIT(MLX5E_SQ_STATE_IPSEC)) {
-		skb = mlx5e_ipsec_handle_tx_skb(dev, wqe, skb);
-		if (unlikely(!skb))
-			return NETDEV_TX_OK;
-	}
+#ifdef CONFIG_MLX5_ACCEL
+	/* might send skbs and update wqe and pi */
+	skb = mlx5e_accel_handle_tx(skb, sq, dev, &wqe, &pi);
+	if (unlikely(!skb))
+		return NETDEV_TX_OK;
 #endif
-
 	return mlx5e_sq_xmit(sq, skb, wqe, pi);
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 12/14] net/mlx5e: TLS, Add error statistics
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (10 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 11/14] net/mlx5e: TLS, Add Innova TLS TX offload data path Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 13/14] MAINTAINERS: Update mlx5 innova driver maintainers Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 14/14] MAINTAINERS: Update TLS maintainers Saeed Mahameed
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Saeed Mahameed

From: Ilya Lesokhin <ilyal@mellanox.com>

Add statistics for rare TLS related errors.
Since the errors are rare we have a counter per netdev
rather then per SQ.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  3 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 22 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 22 ++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 24 +++---
 .../mellanox/mlx5/core/en_accel/tls_stats.c        | 89 ++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 22 ++++++
 8 files changed, 178 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index ec785f589666..a7135f5d5cf6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o en_accel/tls_stats.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7d8696fca826..d397be0b5885 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -795,6 +795,9 @@ struct mlx5e_priv {
 #ifdef CONFIG_MLX5_EN_IPSEC
 	struct mlx5e_ipsec        *ipsec;
 #endif
+#ifdef CONFIG_MLX5_EN_TLS
+	struct mlx5e_tls          *tls;
+#endif
 };
 
 struct mlx5e_profile {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index aa6981c98bdc..d167845271c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -173,3 +173,25 @@ void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
 	netdev->hw_features |= NETIF_F_HW_TLS_TX;
 	netdev->tlsdev_ops = &mlx5e_tls_ops;
 }
+
+int mlx5e_tls_init(struct mlx5e_priv *priv)
+{
+	struct mlx5e_tls *tls = kzalloc(sizeof(*tls), GFP_KERNEL);
+
+	if (!tls)
+		return -ENOMEM;
+
+	priv->tls = tls;
+	return 0;
+}
+
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv)
+{
+	struct mlx5e_tls *tls = priv->tls;
+
+	if (!tls)
+		return;
+
+	kfree(tls);
+	priv->tls = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index f7216b9b98e2..b6162178f621 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -38,6 +38,17 @@
 #include <net/tls.h>
 #include "en.h"
 
+struct mlx5e_tls_sw_stats {
+	atomic64_t tx_tls_drop_metadata;
+	atomic64_t tx_tls_drop_resync_alloc;
+	atomic64_t tx_tls_drop_no_sync_data;
+	atomic64_t tx_tls_drop_bypass_required;
+};
+
+struct mlx5e_tls {
+	struct mlx5e_tls_sw_stats sw_stats;
+};
+
 struct mlx5e_tls_offload_context {
 	struct tls_offload_context base;
 	u32 expected_seq;
@@ -55,10 +66,21 @@ mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
 }
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
+int mlx5e_tls_init(struct mlx5e_priv *priv);
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv);
+
+int mlx5e_tls_get_count(struct mlx5e_priv *priv);
+int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data);
+int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data);
 
 #else
 
 static inline void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_init(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_tls_cleanup(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_get_count(struct mlx5e_priv *priv) { return 0; }
+static inline int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data) { return 0; }
+static inline int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data) { return 0; }
 
 #endif
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index 49e8d455ebc3..ad2790fb5966 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -164,7 +164,8 @@ static struct sk_buff *
 mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
 		     struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		     struct mlx5e_tx_wqe **wqe,
-		     u16 *pi)
+		     u16 *pi,
+		     struct mlx5e_tls *tls)
 {
 	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
 	struct sync_info info;
@@ -175,12 +176,14 @@ mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
 
 	sq->stats.tls_ooo++;
 
-	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info))
+	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info)) {
 		/* We might get here if a retransmission reaches the driver
 		 * after the relevant record is acked.
 		 * It should be safe to drop the packet in this case
 		 */
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_no_sync_data);
 		goto err_out;
+	}
 
 	if (unlikely(info.sync_len < 0)) {
 		u32 payload;
@@ -192,21 +195,22 @@ mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
 			 */
 			return skb;
 
-		netdev_err(skb->dev,
-			   "Can't offload from the middle of an SKB [seq: %X, offload_seq: %X, end_seq: %X]\n",
-			   tcp_seq, tcp_seq + payload + info.sync_len,
-			   tcp_seq + payload);
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_bypass_required);
 		goto err_out;
 	}
 
-	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid)))
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_metadata);
 		goto err_out;
+	}
 
 	headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
 	linear_len += headln + sizeof(info.rcd_sn);
 	nskb = alloc_skb(linear_len, GFP_ATOMIC);
-	if (unlikely(!nskb))
+	if (unlikely(!nskb)) {
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_resync_alloc);
 		goto err_out;
+	}
 
 	context->expected_seq = tcp_seq + skb->len - headln;
 	skb_put(nskb, linear_len);
@@ -234,6 +238,7 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
 					struct mlx5e_tx_wqe **wqe,
 					u16 *pi)
 {
+	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct mlx5e_tls_offload_context *context;
 	struct tls_context *tls_ctx;
 	u32 expected_seq;
@@ -256,11 +261,12 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
 	expected_seq = context->expected_seq;
 
 	if (unlikely(expected_seq != skb_seq)) {
-		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi);
+		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi, priv->tls);
 		goto out;
 	}
 
 	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		atomic64_inc(&priv->tls->sw_stats.tx_tls_drop_metadata);
 		dev_kfree_skb_any(skb);
 		skb = NULL;
 		goto out;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
new file mode 100644
index 000000000000..01468ec27446
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/ethtool.h>
+#include <net/sock.h>
+
+#include "en.h"
+#include "accel/tls.h"
+#include "fpga/sdk.h"
+#include "en_accel/tls.h"
+
+static const struct counter_desc mlx5e_tls_sw_stats_desc[] = {
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_metadata) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_resync_alloc) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_no_sync_data) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_bypass_required) },
+};
+
+#define MLX5E_READ_CTR_ATOMIC64(ptr, dsc, i) \
+	atomic64_read((atomic64_t *)((char *)(ptr) + (dsc)[i].offset))
+
+#define NUM_TLS_SW_COUNTERS ARRAY_SIZE(mlx5e_tls_sw_stats_desc)
+
+int mlx5e_tls_get_count(struct mlx5e_priv *priv)
+{
+	if (!priv->tls)
+		return 0;
+
+	return NUM_TLS_SW_COUNTERS;
+}
+
+int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data)
+{
+	unsigned int i, idx = 0;
+
+	if (!priv->tls)
+		return 0;
+
+	for (i = 0; i < NUM_TLS_SW_COUNTERS; i++)
+		strcpy(data + (idx++) * ETH_GSTRING_LEN,
+		       mlx5e_tls_sw_stats_desc[i].format);
+
+	return NUM_TLS_SW_COUNTERS;
+}
+
+int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data)
+{
+	int i, idx = 0;
+
+	if (!priv->tls)
+		return 0;
+
+	for (i = 0; i < NUM_TLS_SW_COUNTERS; i++)
+		data[idx++] =
+		    MLX5E_READ_CTR_ATOMIC64(&priv->tls->sw_stats,
+					    mlx5e_tls_sw_stats_desc, i);
+
+	return NUM_TLS_SW_COUNTERS;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d4c397aec2ee..44cf09574926 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4220,12 +4220,16 @@ static void mlx5e_nic_init(struct mlx5_core_dev *mdev,
 	err = mlx5e_ipsec_init(priv);
 	if (err)
 		mlx5_core_err(mdev, "IPSec initialization failed, %d\n", err);
+	err = mlx5e_tls_init(priv);
+	if (err)
+		mlx5_core_err(mdev, "TLS initialization failed, %d\n", err);
 	mlx5e_build_nic_netdev(netdev);
 	mlx5e_vxlan_init(priv);
 }
 
 static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 {
+	mlx5e_tls_cleanup(priv);
 	mlx5e_ipsec_cleanup(priv);
 	mlx5e_vxlan_cleanup(priv);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 81c1f383d682..a9800586b8d7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -32,6 +32,7 @@
 
 #include "en.h"
 #include "en_accel/ipsec.h"
+#include "en_accel/tls.h"
 
 static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_packets) },
@@ -971,6 +972,22 @@ static void mlx5e_grp_ipsec_update_stats(struct mlx5e_priv *priv)
 	mlx5e_ipsec_update_stats(priv);
 }
 
+static int mlx5e_grp_tls_get_num_stats(struct mlx5e_priv *priv)
+{
+	return mlx5e_tls_get_count(priv);
+}
+
+static int mlx5e_grp_tls_fill_strings(struct mlx5e_priv *priv, u8 *data,
+				      int idx)
+{
+	return idx + mlx5e_tls_get_strings(priv, data + idx * ETH_GSTRING_LEN);
+}
+
+static int mlx5e_grp_tls_fill_stats(struct mlx5e_priv *priv, u64 *data, int idx)
+{
+	return idx + mlx5e_tls_get_stats(priv, data + idx);
+}
+
 static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, bytes) },
@@ -1165,6 +1182,11 @@ const struct mlx5e_stats_grp mlx5e_stats_grps[] = {
 		.fill_stats = mlx5e_grp_ipsec_fill_stats,
 		.update_stats = mlx5e_grp_ipsec_update_stats,
 	},
+	{
+		.get_num_stats = mlx5e_grp_tls_get_num_stats,
+		.fill_strings = mlx5e_grp_tls_fill_strings,
+		.fill_stats = mlx5e_grp_tls_fill_stats,
+	},
 	{
 		.get_num_stats = mlx5e_grp_channels_get_num_stats,
 		.fill_strings = mlx5e_grp_channels_fill_strings,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 13/14] MAINTAINERS: Update mlx5 innova driver maintainers
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (11 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 12/14] net/mlx5e: TLS, Add error statistics Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  2018-03-20  2:45 ` [PATCH net-next 14/14] MAINTAINERS: Update TLS maintainers Saeed Mahameed
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Dave Watson, Boris Pismenny, Saeed Mahameed

From: Boris Pismenny <borisp@mellanox.com>

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 MAINTAINERS | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 214c9bca232a..cd4067ccf959 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8913,26 +8913,17 @@ W:	http://www.mellanox.com
 Q:	http://patchwork.ozlabs.org/project/netdev/list/
 F:	drivers/net/ethernet/mellanox/mlx5/core/en_*
 
-MELLANOX ETHERNET INNOVA DRIVER
-M:	Ilan Tayari <ilant@mellanox.com>
-R:	Boris Pismenny <borisp@mellanox.com>
+MELLANOX ETHERNET INNOVA DRIVERS
+M:	Boris Pismenny <borisp@mellanox.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 W:	http://www.mellanox.com
 Q:	http://patchwork.ozlabs.org/project/netdev/list/
+F:	drivers/net/ethernet/mellanox/mlx5/core/en_accel/*
+F:	drivers/net/ethernet/mellanox/mlx5/core/accel/*
 F:	drivers/net/ethernet/mellanox/mlx5/core/fpga/*
 F:	include/linux/mlx5/mlx5_ifc_fpga.h
 
-MELLANOX ETHERNET INNOVA IPSEC DRIVER
-M:	Ilan Tayari <ilant@mellanox.com>
-R:	Boris Pismenny <borisp@mellanox.com>
-L:	netdev@vger.kernel.org
-S:	Supported
-W:	http://www.mellanox.com
-Q:	http://patchwork.ozlabs.org/project/netdev/list/
-F:	drivers/net/ethernet/mellanox/mlx5/core/en_ipsec/*
-F:	drivers/net/ethernet/mellanox/mlx5/core/ipsec*
-
 MELLANOX ETHERNET SWITCH DRIVERS
 M:	Jiri Pirko <jiri@mellanox.com>
 M:	Ido Schimmel <idosch@mellanox.com>
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next 14/14] MAINTAINERS: Update TLS maintainers
  2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
                   ` (12 preceding siblings ...)
  2018-03-20  2:45 ` [PATCH net-next 13/14] MAINTAINERS: Update mlx5 innova driver maintainers Saeed Mahameed
@ 2018-03-20  2:45 ` Saeed Mahameed
  13 siblings, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-20  2:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Dave Watson, Boris Pismenny, Saeed Mahameed

From: Boris Pismenny <borisp@mellanox.com>

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index cd4067ccf959..285ea4e6c580 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9711,7 +9711,7 @@ F:	net/netfilter/xt_CONNSECMARK.c
 F:	net/netfilter/xt_SECMARK.c
 
 NETWORKING [TLS]
-M:	Ilya Lesokhin <ilyal@mellanox.com>
+M:	Boris Pismenny <borisp@mellanox.com>
 M:	Aviad Yehezkel <aviadye@mellanox.com>
 M:	Dave Watson <davejwatson@fb.com>
 L:	netdev@vger.kernel.org
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 01/14] tcp: Add clean acked data hook
  2018-03-20  2:44 ` [PATCH net-next 01/14] tcp: Add clean acked data hook Saeed Mahameed
@ 2018-03-20 20:36   ` Rao Shoaib
  2018-03-21 11:21     ` Boris Pismenny
  0 siblings, 1 reply; 27+ messages in thread
From: Rao Shoaib @ 2018-03-20 20:36 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Aviad Yehezkel



On 03/19/2018 07:44 PM, Saeed Mahameed wrote:
> From: Ilya Lesokhin <ilyal@mellanox.com>
>
> Called when a TCP segment is acknowledged.
> Could be used by application protocols who hold additional
> metadata associated with the stream data.
>
> This is required by TLS device offload to release
> metadata associated with acknowledged TLS records.
>
> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>   include/net/inet_connection_sock.h | 2 ++
>   net/ipv4/tcp_input.c               | 2 ++
>   2 files changed, 4 insertions(+)
>
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index b68fea022a82..2ab6667275df 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
>    * @icsk_af_ops		   Operations which are AF_INET{4,6} specific
>    * @icsk_ulp_ops	   Pluggable ULP control hook
>    * @icsk_ulp_data	   ULP private data
> + * @icsk_clean_acked	   Clean acked data hook
>    * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
>    * @icsk_ca_state:	   Congestion control state
>    * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
> @@ -102,6 +103,7 @@ struct inet_connection_sock {
>   	const struct inet_connection_sock_af_ops *icsk_af_ops;
>   	const struct tcp_ulp_ops  *icsk_ulp_ops;
>   	void			  *icsk_ulp_data;
> +	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
>   	struct hlist_node         icsk_listen_portaddr_node;
>   	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
>   	__u8			  icsk_ca_state:6,
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 451ef3012636..9854ecae7245 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3542,6 +3542,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>   	if (after(ack, prior_snd_una)) {
>   		flag |= FLAG_SND_UNA_ADVANCED;
>   		icsk->icsk_retransmits = 0;
> +		if (icsk->icsk_clean_acked)
> +			icsk->icsk_clean_acked(sk, ack);
>   	}
>   
>   	prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una;
Per Dave we are not allowed to use function pointers any more, so why 
extend their use. I implemented a similar callback for my changes but in 
my use case I need to call the meta data update function even when the 
packet does not ack any new data or has no payload. Is it possible to 
move this to say tcp_data_queue() ?

Thanks,

Shoaib

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-20  2:45 ` [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure Saeed Mahameed
@ 2018-03-21 11:15   ` Kirill Tkhai
  2018-03-21 15:53     ` Boris Pismenny
  2018-03-21 15:08   ` Dave Watson
  1 sibling, 1 reply; 27+ messages in thread
From: Kirill Tkhai @ 2018-03-21 11:15 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin, Aviad Yehezkel

On 20.03.2018 05:45, Saeed Mahameed wrote:
> From: Ilya Lesokhin <ilyal@mellanox.com>
> 
> This patch adds a generic infrastructure to offload TLS crypto to a
> network devices. It enables the kernel TLS socket to skip encryption
> and authentication operations on the transmit side of the data path.
> Leaving those computationally expensive operations to the NIC.
> 
> The NIC offload infrastructure builds TLS records and pushes them to
> the TCP layer just like the SW KTLS implementation and using the same API.
> TCP segmentation is mostly unaffected. Currently the only exception is
> that we prevent mixed SKBs where only part of the payload requires
> offload. In the future we are likely to add a similar restriction
> following a change cipher spec record.
> 
> The notable differences between SW KTLS and NIC offloaded TLS
> implementations are as follows:
> 1. The offloaded implementation builds "plaintext TLS record", those
> records contain plaintext instead of ciphertext and place holder bytes
> instead of authentication tags.
> 2. The offloaded implementation maintains a mapping from TCP sequence
> number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
> TLS socket, we can use the tls NIC offload infrastructure to obtain
> enough context to encrypt the payload of the SKB.
> A TLS record is released when the last byte of the record is ack'ed,
> this is done through the new icsk_clean_acked callback.
> 
> The infrastructure should be extendable to support various NIC offload
> implementations.  However it is currently written with the
> implementation below in mind:
> The NIC assumes that packets from each offloaded stream are sent as
> plaintext and in-order. It keeps track of the TLS records in the TCP
> stream. When a packet marked for offload is transmitted, the NIC
> encrypts the payload in-place and puts authentication tags in the
> relevant place holders.
> 
> The responsibility for handling out-of-order packets (i.e. TCP
> retransmission, qdisc drops) falls on the netdev driver.
> 
> The netdev driver keeps track of the expected TCP SN from the NIC's
> perspective.  If the next packet to transmit matches the expected TCP
> SN, the driver advances the expected TCP SN, and transmits the packet
> with TLS offload indication.
> 
> If the next packet to transmit does not match the expected TCP SN. The
> driver calls the TLS layer to obtain the TLS record that includes the
> TCP of the packet for transmission. Using this TLS record, the driver
> posts a work entry on the transmit queue to reconstruct the NIC TLS
> state required for the offload of the out-of-order packet. It updates
> the expected TCP SN accordingly and transmit the now in-order packet.
> The same queue is used for packet transmission and TLS context
> reconstruction to avoid the need for flushing the transmit queue before
> issuing the context reconstruction request.
> 
> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  include/net/tls.h             |  70 +++-
>  net/tls/Kconfig               |  10 +
>  net/tls/Makefile              |   2 +
>  net/tls/tls_device.c          | 804 ++++++++++++++++++++++++++++++++++++++++++
>  net/tls/tls_device_fallback.c | 419 ++++++++++++++++++++++
>  net/tls/tls_main.c            |  33 +-
>  6 files changed, 1331 insertions(+), 7 deletions(-)
>  create mode 100644 net/tls/tls_device.c
>  create mode 100644 net/tls/tls_device_fallback.c
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 4913430ab807..ab98a6dc4929 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -77,6 +77,37 @@ struct tls_sw_context {
>  	struct scatterlist sg_aead_out[2];
>  };
>  
> +struct tls_record_info {
> +	struct list_head list;
> +	u32 end_seq;
> +	int len;
> +	int num_frags;
> +	skb_frag_t frags[MAX_SKB_FRAGS];
> +};
> +
> +struct tls_offload_context {
> +	struct crypto_aead *aead_send;
> +	spinlock_t lock;	/* protects records list */
> +	struct list_head records_list;
> +	struct tls_record_info *open_record;
> +	struct tls_record_info *retransmit_hint;
> +	u64 hint_record_sn;
> +	u64 unacked_record_sn;
> +
> +	struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
> +	void (*sk_destruct)(struct sock *sk);
> +	u8 driver_state[];
> +	/* The TLS layer reserves room for driver specific state
> +	 * Currently the belief is that there is not enough
> +	 * driver specific state to justify another layer of indirection
> +	 */
> +#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
> +};
> +
> +#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
> +	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
> +	 TLS_DRIVER_STATE_SIZE)
> +
>  enum {
>  	TLS_PENDING_CLOSED_RECORD
>  };
> @@ -87,6 +118,10 @@ struct tls_context {
>  		struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
>  	};
>  
> +	struct list_head list;
> +	struct net_device *netdev;
> +	refcount_t refcount;
> +
>  	void *priv_ctx;
>  
>  	u8 tx_conf:2;
> @@ -131,9 +166,29 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
>  void tls_sw_close(struct sock *sk, long timeout);
>  void tls_sw_free_tx_resources(struct sock *sk);
>  
> -void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
> -void tls_icsk_clean_acked(struct sock *sk);
> +void tls_clear_device_offload(struct sock *sk, struct tls_context *ctx);
> +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx);
> +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
> +int tls_device_sendpage(struct sock *sk, struct page *page,
> +			int offset, size_t size, int flags);
> +void tls_device_sk_destruct(struct sock *sk);
> +void tls_device_init(void);
> +void tls_device_cleanup(void);
>  
> +struct tls_record_info *tls_get_record(struct tls_offload_context *context,
> +				       u32 seq, u64 *p_record_sn);
> +
> +static inline bool tls_record_is_start_marker(struct tls_record_info *rec)
> +{
> +	return rec->len == 0;
> +}
> +
> +static inline u32 tls_record_start_seq(struct tls_record_info *rec)
> +{
> +	return rec->end_seq - rec->len;
> +}
> +
> +void tls_sk_destruct(struct sock *sk, struct tls_context *ctx);
>  int tls_push_sg(struct sock *sk, struct tls_context *ctx,
>  		struct scatterlist *sg, u16 first_offset,
>  		int flags);
> @@ -170,6 +225,13 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx)
>  	return tls_ctx->pending_open_record_frags;
>  }
>  
> +static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk)
> +{
> +	return sk_fullsock(sk) &&
> +	       /* matches smp_store_release in tls_set_device_offload */
> +	       smp_load_acquire(&sk->sk_destruct) == &tls_device_sk_destruct;
> +}
> +
>  static inline void tls_err_abort(struct sock *sk)
>  {
>  	sk->sk_err = EBADMSG;
> @@ -257,4 +319,8 @@ static inline struct tls_offload_context *tls_offload_ctx(
>  int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
>  		      unsigned char *record_type);
>  
> +int tls_sw_fallback_init(struct sock *sk,
> +			 struct tls_offload_context *offload_ctx,
> +			 struct tls_crypto_info *crypto_info);
> +
>  #endif /* _TLS_OFFLOAD_H */
> diff --git a/net/tls/Kconfig b/net/tls/Kconfig
> index eb583038c67e..9d3ef820bb16 100644
> --- a/net/tls/Kconfig
> +++ b/net/tls/Kconfig
> @@ -13,3 +13,13 @@ config TLS
>  	encryption handling of the TLS protocol to be done in-kernel.
>  
>  	If unsure, say N.
> +
> +config TLS_DEVICE
> +	bool "Transport Layer Security HW offload"
> +	depends on TLS
> +	select SOCK_VALIDATE_XMIT
> +	default n
> +	---help---
> +	Enable kernel support for HW offload of the TLS protocol.
> +
> +	If unsure, say N.
> diff --git a/net/tls/Makefile b/net/tls/Makefile
> index a930fd1c4f7b..4d6b728a67d0 100644
> --- a/net/tls/Makefile
> +++ b/net/tls/Makefile
> @@ -5,3 +5,5 @@
>  obj-$(CONFIG_TLS) += tls.o
>  
>  tls-y := tls_main.o tls_sw.o
> +
> +tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o
> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> new file mode 100644
> index 000000000000..c0d4e11a4286
> --- /dev/null
> +++ b/net/tls/tls_device.c
> @@ -0,0 +1,804 @@
> +/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + *      - Neither the name of the Mellanox Technologies nor the
> + *        names of its contributors may be used to endorse or promote
> + *        products derived from this software without specific prior written
> + *        permission.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED.
> + * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
> + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
> + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> + * POSSIBILITY OF SUCH DAMAGE
> + */

Other patches have two licenses in header. Can I distribute this file under GPL license terms?

> +#include <linux/module.h>
> +#include <net/tcp.h>
> +#include <net/inet_common.h>
> +#include <linux/highmem.h>
> +#include <linux/netdevice.h>
> +
> +#include <net/tls.h>
> +#include <crypto/aead.h>
> +
> +/* device_offload_lock is used to synchronize tls_dev_add
> + * against NETDEV_DOWN notifications.
> + */
> +DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
> +
> +static void tls_device_gc_task(struct work_struct *work);
> +
> +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
> +static LIST_HEAD(tls_device_gc_list);
> +static LIST_HEAD(tls_device_list);
> +static DEFINE_SPINLOCK(tls_device_lock);
> +
> +static void tls_device_free_ctx(struct tls_context *ctx)
> +{
> +	struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
> +
> +	kfree(offlad_ctx);
> +	kfree(ctx);
> +}
> +
> +static void tls_device_gc_task(struct work_struct *work)
> +{
> +	struct tls_context *ctx, *tmp;
> +	struct list_head gc_list;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	INIT_LIST_HEAD(&gc_list);

This is stack variable, and it should be initialized outside of global spinlock.
There is LIST_HEAD() primitive for that in kernel.
There is one more similar place below.

> +	list_splice_init(&tls_device_gc_list, &gc_list);
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +
> +	list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
> +		struct net_device *netdev = ctx->netdev;
> +
> +		if (netdev) {
> +			netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> +							TLS_OFFLOAD_CTX_DIR_TX);
> +			dev_put(netdev);
> +		}

How is possible the situation we meet NULL netdev here?

> +
> +		list_del(&ctx->list);
> +		tls_device_free_ctx(ctx);
> +	}
> +}
> +
> +static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	list_move_tail(&ctx->list, &tls_device_gc_list);
> +
> +	/* schedule_work inside the spinlock
> +	 * to make sure tls_device_down waits for that work.
> +	 */
> +	schedule_work(&tls_device_gc_work);
> +
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +}
> +
> +/* We assume that the socket is already connected */
> +static struct net_device *get_netdev_for_sock(struct sock *sk)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct net_device *netdev = NULL;
> +
> +	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +
> +	return netdev;
> +}
> +
> +static int attach_sock_to_netdev(struct sock *sk, struct net_device *netdev,
> +				 struct tls_context *ctx)
> +{
> +	int rc;
> +
> +	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
> +					     &ctx->crypto_send,
> +					     tcp_sk(sk)->write_seq);
> +	if (rc) {
> +		pr_err_ratelimited("The netdev has refused to offload this socket\n");
> +		goto out;
> +	}
> +
> +	rc = 0;
> +out:
> +	return rc;
> +}
> +
> +static void destroy_record(struct tls_record_info *record)
> +{
> +	skb_frag_t *frag;
> +	int nr_frags = record->num_frags;
> +
> +	while (nr_frags > 0) {
> +		frag = &record->frags[nr_frags - 1];
> +		__skb_frag_unref(frag);
> +		--nr_frags;
> +	}
> +	kfree(record);
> +}
> +
> +static void delete_all_records(struct tls_offload_context *offload_ctx)
> +{
> +	struct tls_record_info *info, *temp;
> +
> +	list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
> +		list_del(&info->list);
> +		destroy_record(info);
> +	}
> +
> +	offload_ctx->retransmit_hint = NULL;
> +}
> +
> +static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx;
> +	struct tls_record_info *info, *temp;
> +	unsigned long flags;
> +	u64 deleted_records = 0;
> +
> +	if (!tls_ctx)
> +		return;
> +
> +	ctx = tls_offload_ctx(tls_ctx);
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	info = ctx->retransmit_hint;
> +	if (info && !before(acked_seq, info->end_seq)) {
> +		ctx->retransmit_hint = NULL;
> +		list_del(&info->list);
> +		destroy_record(info);
> +		deleted_records++;
> +	}
> +
> +	list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
> +		if (before(acked_seq, info->end_seq))
> +			break;
> +		list_del(&info->list);
> +
> +		destroy_record(info);
> +		deleted_records++;
> +	}
> +
> +	ctx->unacked_record_sn += deleted_records;
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +}
> +
> +/* At this point, there should be no references on this
> + * socket and no in-flight SKBs associated with this
> + * socket, so it is safe to free all the resources.
> + */
> +void tls_device_sk_destruct(struct sock *sk)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +
> +	if (ctx->open_record)
> +		destroy_record(ctx->open_record);
> +
> +	delete_all_records(ctx);
> +	crypto_free_aead(ctx->aead_send);
> +	ctx->sk_destruct(sk);
> +
> +	if (refcount_dec_and_test(&tls_ctx->refcount))
> +		tls_device_queue_ctx_destruction(tls_ctx);
> +}
> +EXPORT_SYMBOL(tls_device_sk_destruct);
> +
> +static inline void tls_append_frag(struct tls_record_info *record,
> +				   struct page_frag *pfrag,
> +				   int size)
> +{
> +	skb_frag_t *frag;
> +
> +	frag = &record->frags[record->num_frags - 1];
> +	if (frag->page.p == pfrag->page &&
> +	    frag->page_offset + frag->size == pfrag->offset) {
> +		frag->size += size;
> +	} else {
> +		++frag;
> +		frag->page.p = pfrag->page;
> +		frag->page_offset = pfrag->offset;
> +		frag->size = size;
> +		++record->num_frags;
> +		get_page(pfrag->page);
> +	}
> +
> +	pfrag->offset += size;
> +	record->len += size;
> +}
> +
> +static inline int tls_push_record(struct sock *sk,
> +				  struct tls_context *ctx,
> +				  struct tls_offload_context *offload_ctx,
> +				  struct tls_record_info *record,
> +				  struct page_frag *pfrag,
> +				  int flags,
> +				  unsigned char record_type)
> +{
> +	skb_frag_t *frag;
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	struct page_frag fallback_frag;
> +	struct page_frag  *tag_pfrag = pfrag;
> +	int i;
> +
> +	/* fill prepand */
> +	frag = &record->frags[0];
> +	tls_fill_prepend(ctx,
> +			 skb_frag_address(frag),
> +			 record->len - ctx->prepend_size,
> +			 record_type);
> +
> +	if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
> +		/* HW doesn't care about the data in the tag
> +		 * so in case pfrag has no room
> +		 * for a tag and we can't allocate a new pfrag
> +		 * just use the page in the first frag
> +		 * rather then write a complicated fall back code.
> +		 */
> +		tag_pfrag = &fallback_frag;
> +		tag_pfrag->page = skb_frag_page(frag);
> +		tag_pfrag->offset = 0;
> +	}
> +
> +	tls_append_frag(record, tag_pfrag, ctx->tag_size);
> +	record->end_seq = tp->write_seq + record->len;
> +	spin_lock_irq(&offload_ctx->lock);
> +	list_add_tail(&record->list, &offload_ctx->records_list);
> +	spin_unlock_irq(&offload_ctx->lock);
> +	offload_ctx->open_record = NULL;
> +	set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
> +	tls_advance_record_sn(sk, ctx);
> +
> +	for (i = 0; i < record->num_frags; i++) {
> +		frag = &record->frags[i];
> +		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
> +		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
> +			    frag->size, frag->page_offset);
> +		sk_mem_charge(sk, frag->size);
> +		get_page(skb_frag_page(frag));
> +	}
> +	sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
> +
> +	/* all ready, send */
> +	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
> +}
> +
> +static inline int tls_create_new_record(struct tls_offload_context *offload_ctx,
> +					struct page_frag *pfrag,
> +					size_t prepend_size)
> +{
> +	skb_frag_t *frag;
> +	struct tls_record_info *record;
> +
> +	record = kmalloc(sizeof(*record), GFP_KERNEL);
> +	if (!record)
> +		return -ENOMEM;
> +
> +	frag = &record->frags[0];
> +	__skb_frag_set_page(frag, pfrag->page);
> +	frag->page_offset = pfrag->offset;
> +	skb_frag_size_set(frag, prepend_size);
> +
> +	get_page(pfrag->page);
> +	pfrag->offset += prepend_size;
> +
> +	record->num_frags = 1;
> +	record->len = prepend_size;
> +	offload_ctx->open_record = record;
> +	return 0;
> +}
> +
> +static inline int tls_do_allocation(struct sock *sk,
> +				    struct tls_offload_context *offload_ctx,
> +				    struct page_frag *pfrag,
> +				    size_t prepend_size)
> +{
> +	int ret;
> +
> +	if (!offload_ctx->open_record) {
> +		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
> +						   sk->sk_allocation))) {
> +			sk->sk_prot->enter_memory_pressure(sk);
> +			sk_stream_moderate_sndbuf(sk);
> +			return -ENOMEM;
> +		}
> +
> +		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
> +		if (ret)
> +			return ret;
> +
> +		if (pfrag->size > pfrag->offset)
> +			return 0;
> +	}
> +
> +	if (!sk_page_frag_refill(sk, pfrag))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static int tls_push_data(struct sock *sk,
> +			 struct iov_iter *msg_iter,
> +			 size_t size, int flags,
> +			 unsigned char record_type)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +	struct tls_record_info *record = ctx->open_record;
> +	struct page_frag *pfrag;
> +	int copy, rc = 0;
> +	size_t orig_size = size;
> +	u32 max_open_record_len;
> +	long timeo;
> +	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
> +	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> +	bool done = false;
> +
> +	if (flags &
> +	    ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
> +		return -ENOTSUPP;
> +
> +	if (sk->sk_err)
> +		return -sk->sk_err;
> +
> +	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
> +	rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
> +	if (rc < 0)
> +		return rc;
> +
> +	pfrag = sk_page_frag(sk);
> +
> +	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
> +	 * we need to leave room for an authentication tag.
> +	 */
> +	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
> +			      tls_ctx->prepend_size;
> +	do {
> +		if (tls_do_allocation(sk, ctx, pfrag,
> +				      tls_ctx->prepend_size)) {
> +			rc = sk_stream_wait_memory(sk, &timeo);
> +			if (!rc)
> +				continue;
> +
> +			record = ctx->open_record;
> +			if (!record)
> +				break;
> +handle_error:
> +			if (record_type != TLS_RECORD_TYPE_DATA) {
> +				/* avoid sending partial
> +				 * record with type !=
> +				 * application_data
> +				 */
> +				size = orig_size;
> +				destroy_record(record);
> +				ctx->open_record = NULL;
> +			} else if (record->len > tls_ctx->prepend_size) {
> +				goto last_record;
> +			}
> +
> +			break;
> +		}
> +
> +		record = ctx->open_record;
> +		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
> +		copy = min_t(size_t, copy, (max_open_record_len - record->len));
> +
> +		if (copy_from_iter_nocache(page_address(pfrag->page) +
> +					       pfrag->offset,
> +					   copy, msg_iter) != copy) {
> +			rc = -EFAULT;
> +			goto handle_error;
> +		}
> +		tls_append_frag(record, pfrag, copy);
> +
> +		size -= copy;
> +		if (!size) {
> +last_record:
> +			tls_push_record_flags = flags;
> +			if (more) {
> +				tls_ctx->pending_open_record_frags =
> +						record->num_frags;
> +				break;
> +			}
> +
> +			done = true;
> +		}
> +
> +		if ((done) || record->len >= max_open_record_len ||
> +		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
> +			rc = tls_push_record(sk,
> +					     tls_ctx,
> +					     ctx,
> +					     record,
> +					     pfrag,
> +					     tls_push_record_flags,
> +					     record_type);
> +			if (rc < 0)
> +				break;
> +		}
> +	} while (!done);
> +
> +	if (orig_size - size > 0)
> +		rc = orig_size - size;
> +
> +	return rc;
> +}
> +
> +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> +{
> +	unsigned char record_type = TLS_RECORD_TYPE_DATA;
> +	int rc = 0;
> +
> +	lock_sock(sk);
> +
> +	if (unlikely(msg->msg_controllen)) {
> +		rc = tls_proccess_cmsg(sk, msg, &record_type);
> +		if (rc)
> +			goto out;
> +	}
> +
> +	rc = tls_push_data(sk, &msg->msg_iter, size,
> +			   msg->msg_flags, record_type);
> +
> +out:
> +	release_sock(sk);
> +	return rc;
> +}
> +
> +int tls_device_sendpage(struct sock *sk, struct page *page,
> +			int offset, size_t size, int flags)
> +{
> +	struct iov_iter	msg_iter;
> +	struct kvec iov;
> +	char *kaddr = kmap(page);
> +	int rc = 0;
> +
> +	if (flags & MSG_SENDPAGE_NOTLAST)
> +		flags |= MSG_MORE;
> +
> +	lock_sock(sk);
> +
> +	if (flags & MSG_OOB) {
> +		rc = -ENOTSUPP;
> +		goto out;
> +	}
> +
> +	iov.iov_base = kaddr + offset;
> +	iov.iov_len = size;
> +	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
> +	rc = tls_push_data(sk, &msg_iter, size,
> +			   flags, TLS_RECORD_TYPE_DATA);
> +	kunmap(page);
> +
> +out:
> +	release_sock(sk);
> +	return rc;
> +}
> +
> +struct tls_record_info *tls_get_record(struct tls_offload_context *context,
> +				       u32 seq, u64 *p_record_sn)
> +{
> +	struct tls_record_info *info;
> +	u64 record_sn = context->hint_record_sn;
> +
> +	info = context->retransmit_hint;
> +	if (!info ||
> +	    before(seq, info->end_seq - info->len)) {
> +		/* if retransmit_hint is irrelevant start
> +		 * from the begging of the list
> +		 */
> +		info = list_first_entry(&context->records_list,
> +					struct tls_record_info, list);
> +		record_sn = context->unacked_record_sn;
> +	}
> +
> +	list_for_each_entry_from(info, &context->records_list, list) {
> +		if (before(seq, info->end_seq)) {
> +			if (!context->retransmit_hint ||
> +			    after(info->end_seq,
> +				  context->retransmit_hint->end_seq)) {
> +				context->hint_record_sn = record_sn;
> +				context->retransmit_hint = info;
> +			}
> +			*p_record_sn = record_sn;
> +			return info;
> +		}
> +		record_sn++;
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(tls_get_record);
> +
> +static int tls_device_push_pending_record(struct sock *sk, int flags)
> +{
> +	struct iov_iter	msg_iter;
> +
> +	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
> +	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
> +}
> +
> +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
> +{
> +	u16 nonece_size, tag_size, iv_size, rec_seq_size;
> +	struct tls_record_info *start_marker_record;
> +	struct tls_offload_context *offload_ctx;
> +	struct tls_crypto_info *crypto_info;
> +	struct net_device *netdev;
> +	char *iv, *rec_seq;
> +	struct sk_buff *skb;
> +	int rc = -EINVAL;
> +	__be64 rcd_sn;
> +
> +	if (!ctx)
> +		goto out;
> +
> +	if (ctx->priv_ctx) {
> +		rc = -EEXIST;
> +		goto out;
> +	}
> +
> +	/* We support starting offload on multiple sockets
> +	 * concurrently, So we only need a read lock here.
> +	 */
> +	percpu_down_read(&device_offload_lock);
> +	netdev = get_netdev_for_sock(sk);
> +	if (!netdev) {
> +		pr_err_ratelimited("%s: netdev not found\n", __func__);
> +		rc = -EINVAL;
> +		goto release_lock;
> +	}
> +
> +	if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
> +		rc = -ENOTSUPP;
> +		goto release_netdev;
> +	}
> +
> +	/* Avoid offloading if the device is down
> +	 * We don't want to offload new flows after
> +	 * the NETDEV_DOWN event
> +	 */
> +	if (!(netdev->flags & IFF_UP)) {
> +		rc = -EINVAL;
> +		goto release_lock;
> +	}
> +
> +	crypto_info = &ctx->crypto_send;
> +	switch (crypto_info->cipher_type) {
> +	case TLS_CIPHER_AES_GCM_128: {
> +		nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +		tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +		iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +		iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
> +		rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
> +		rec_seq =
> +		 ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
> +		break;
> +	}
> +	default:
> +		rc = -EINVAL;
> +		goto release_netdev;
> +	}
> +
> +	start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);

Can we memory allocations and simple memory initializations ouside the global rwsem?

> +	if (!start_marker_record) {
> +		rc = -ENOMEM;
> +		goto release_netdev;
> +	}
> +
> +	offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE, GFP_KERNEL);
> +	if (!offload_ctx)
> +		goto free_marker_record;
> +
> +	ctx->priv_ctx = offload_ctx;
> +	rc = attach_sock_to_netdev(sk, netdev, ctx);
> +	if (rc)
> +		goto free_offload_context;
> +
> +	ctx->netdev = netdev;
> +	ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
> +	ctx->tag_size = tag_size;
> +	ctx->iv_size = iv_size;
> +	ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
> +			  GFP_KERNEL);
> +	if (!ctx->iv) {
> +		rc = -ENOMEM;
> +		goto detach_sock;
> +	}
> +
> +	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
> +
> +	ctx->rec_seq_size = rec_seq_size;
> +	ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
> +	if (!ctx->rec_seq) {
> +		rc = -ENOMEM;
> +		goto free_iv;
> +	}
> +	memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
> +
> +	/* start at rec_seq - 1 to account for the start marker record */
> +	memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
> +	offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> +
> +	rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
> +	if (rc)
> +		goto free_rec_seq;
> +
> +	start_marker_record->end_seq = tcp_sk(sk)->write_seq;
> +	start_marker_record->len = 0;
> +	start_marker_record->num_frags = 0;
> +
> +	INIT_LIST_HEAD(&offload_ctx->records_list);
> +	list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
> +	spin_lock_init(&offload_ctx->lock);
> +
> +	inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
> +	ctx->push_pending_record = tls_device_push_pending_record;
> +	offload_ctx->sk_destruct = sk->sk_destruct;
> +
> +	/* TLS offload is greatly simplified if we don't send
> +	 * SKBs where only part of the payload needs to be encrypted.
> +	 * So mark the last skb in the write queue as end of record.
> +	 */
> +	skb = tcp_write_queue_tail(sk);
> +	if (skb)
> +		TCP_SKB_CB(skb)->eor = 1;
> +
> +	refcount_set(&ctx->refcount, 1);
> +	spin_lock_irq(&tls_device_lock);
> +	list_add_tail(&ctx->list, &tls_device_list);
> +	spin_unlock_irq(&tls_device_lock);
> +
> +	/* following this assignment tls_is_sk_tx_device_offloaded
> +	 * will return true and the context might be accessed
> +	 * by the netdev's xmit function.
> +	 */
> +	smp_store_release(&sk->sk_destruct,
> +			  &tls_device_sk_destruct);
> +	goto release_lock;
> +
> +free_rec_seq:
> +	kfree(ctx->rec_seq);
> +free_iv:
> +	kfree(ctx->iv);
> +detach_sock:
> +	netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_TX);
> +free_offload_context:
> +	kfree(offload_ctx);
> +	ctx->priv_ctx = NULL;
> +free_marker_record:
> +	kfree(start_marker_record);
> +release_netdev:
> +	dev_put(netdev);
> +release_lock:
> +	percpu_up_read(&device_offload_lock);
> +out:
> +	return rc;
> +}
> +
> +static int tls_device_register(struct net_device *dev)
> +{
> +	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
> +		return NOTIFY_BAD;
> +
> +	return NOTIFY_DONE;
> +}

This function is the same as tls_device_feat_change(). Can't we merge
them together and avoid duplicating of code?

> +static int tls_device_unregister(struct net_device *dev)
> +{
> +	return NOTIFY_DONE;
> +}

This function does nothing, and next patches do not change it.
Can't we remove it since so?

> +static int tls_device_feat_change(struct net_device *dev)
> +{
> +	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
> +		return NOTIFY_BAD;
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static int tls_device_down(struct net_device *netdev)
> +{
> +	struct tls_context *ctx, *tmp;
> +	struct list_head list;
> +	unsigned long flags;
> +
> +	if (!(netdev->features & NETIF_F_HW_TLS_TX))
> +		return NOTIFY_DONE;

Can't we move this check in tls_dev_event() and use it for all types of events?
Then we avoid duplicate code.

> +
> +	/* Request a write lock to block new offload attempts
> +	 */
> +	percpu_down_write(&device_offload_lock);

What is the reason percpu_rwsem is chosen here? It looks like this primitive
gives more advantages readers, then plain rwsem does. But it also gives
disadvantages to writers. It would be good, unless tls_device_down() is called
with rtnl_lock() held from netdevice notifier. But since netdevice notifier
are called with rtnl_lock() held, percpu_rwsem will increase the time rtnl_lock()
is locked.

Can't we use plain rwsem here instead?

> +
> +	spin_lock_irqsave(&tls_device_lock, flags);
> +	INIT_LIST_HEAD(&list);

This may go outside the global spinlock.

> +	list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) {
> +		if (ctx->netdev != netdev ||
> +		    !refcount_inc_not_zero(&ctx->refcount))
> +			continue;
> +
> +		list_move(&ctx->list, &list);
> +	}
> +	spin_unlock_irqrestore(&tls_device_lock, flags);
> +
> +	list_for_each_entry_safe(ctx, tmp, &list, list)	{
> +		netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> +						TLS_OFFLOAD_CTX_DIR_TX);
> +		ctx->netdev = NULL;
> +		dev_put(netdev);
> +		list_del_init(&ctx->list);
> +
> +		if (refcount_dec_and_test(&ctx->refcount))
> +			tls_device_free_ctx(ctx);
> +	}
> +
> +	percpu_up_write(&device_offload_lock);
> +
> +	flush_work(&tls_device_gc_work);
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static int tls_dev_event(struct notifier_block *this, unsigned long event,
> +			 void *ptr)
> +{
> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return tls_device_register(dev);
> +
> +	case NETDEV_UNREGISTER:
> +		return tls_device_unregister(dev);
> +
> +	case NETDEV_FEAT_CHANGE:
> +		return tls_device_feat_change(dev);
> +
> +	case NETDEV_DOWN:
> +		return tls_device_down(dev);
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block tls_dev_notifier = {
> +	.notifier_call	= tls_dev_event,
> +};
> +
> +void __init tls_device_init(void)
> +{
> +	register_netdevice_notifier(&tls_dev_notifier);
> +}
> +
> +void __exit tls_device_cleanup(void)
> +{
> +	unregister_netdevice_notifier(&tls_dev_notifier);
> +	flush_work(&tls_device_gc_work);
> +}
> diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
> new file mode 100644
> index 000000000000..14d31a36885c
> --- /dev/null
> +++ b/net/tls/tls_device_fallback.c
> @@ -0,0 +1,419 @@
> +/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + *      - Neither the name of the Mellanox Technologies nor the
> + *        names of its contributors may be used to endorse or promote
> + *        products derived from this software without specific prior written
> + *        permission.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED.
> + * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
> + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
> + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> + * POSSIBILITY OF SUCH DAMAGE
> + */
> +
> +#include <net/tls.h>
> +#include <crypto/aead.h>
> +#include <crypto/scatterwalk.h>
> +#include <net/ip6_checksum.h>
> +
> +static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk)
> +{
> +	struct scatterlist *src = walk->sg;
> +	int diff = walk->offset - src->offset;
> +
> +	sg_set_page(sg, sg_page(src),
> +		    src->length - diff, walk->offset);
> +
> +	scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
> +}
> +
> +static int tls_enc_record(struct aead_request *aead_req,
> +			  struct crypto_aead *aead, char *aad, char *iv,
> +			  __be64 rcd_sn, struct scatter_walk *in,
> +			  struct scatter_walk *out, int *in_len)
> +{
> +	struct scatterlist sg_in[3];
> +	struct scatterlist sg_out[3];
> +	unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE];
> +	u16 len;
> +	int rc;
> +
> +	len = min_t(int, *in_len, ARRAY_SIZE(buf));
> +
> +	scatterwalk_copychunks(buf, in, len, 0);
> +	scatterwalk_copychunks(buf, out, len, 1);
> +
> +	*in_len -= len;
> +	if (!*in_len)
> +		return 0;
> +
> +	scatterwalk_pagedone(in, 0, 1);
> +	scatterwalk_pagedone(out, 1, 1);
> +
> +	len = buf[4] | (buf[3] << 8);
> +	len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +
> +	tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
> +		     (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
> +
> +	memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE,
> +	       TLS_CIPHER_AES_GCM_128_IV_SIZE);
> +
> +	sg_init_table(sg_in, ARRAY_SIZE(sg_in));
> +	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> +	sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
> +	sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
> +	chain_to_walk(sg_in + 1, in);
> +	chain_to_walk(sg_out + 1, out);
> +
> +	*in_len -= len;
> +	if (*in_len < 0) {
> +		*in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +		if (*in_len < 0)
> +		/* the input buffer doesn't contain the entire record.
> +		 * trim len accordingly. The resulting authentication tag
> +		 * will contain garbage. but we don't care as we won't
> +		 * include any of it in the output skb
> +		 * Note that we assume the output buffer length
> +		 * is larger then input buffer length + tag size
> +		 */
> +			len += *in_len;
> +
> +		*in_len = 0;
> +	}
> +
> +	if (*in_len) {
> +		scatterwalk_copychunks(NULL, in, len, 2);
> +		scatterwalk_pagedone(in, 0, 1);
> +		scatterwalk_copychunks(NULL, out, len, 2);
> +		scatterwalk_pagedone(out, 1, 1);
> +	}
> +
> +	len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> +	aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
> +
> +	rc = crypto_aead_encrypt(aead_req);
> +
> +	return rc;
> +}
> +
> +static void tls_init_aead_request(struct aead_request *aead_req,
> +				  struct crypto_aead *aead)
> +{
> +	aead_request_set_tfm(aead_req, aead);
> +	aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
> +}
> +
> +static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead,
> +						   gfp_t flags)
> +{
> +	unsigned int req_size = sizeof(struct aead_request) +
> +		crypto_aead_reqsize(aead);
> +	struct aead_request *aead_req;
> +
> +	aead_req = kzalloc(req_size, flags);
> +	if (!aead_req)
> +		return NULL;
> +
> +	tls_init_aead_request(aead_req, aead);
> +	return aead_req;
> +}
> +
> +static int tls_enc_records(struct aead_request *aead_req,
> +			   struct crypto_aead *aead, struct scatterlist *sg_in,
> +			   struct scatterlist *sg_out, char *aad, char *iv,
> +			   u64 rcd_sn, int len)
> +{
> +	struct scatter_walk in;
> +	struct scatter_walk out;
> +	int rc;
> +
> +	scatterwalk_start(&in, sg_in);
> +	scatterwalk_start(&out, sg_out);
> +
> +	do {
> +		rc = tls_enc_record(aead_req, aead, aad, iv,
> +				    cpu_to_be64(rcd_sn), &in, &out, &len);
> +		rcd_sn++;
> +
> +	} while (rc == 0 && len);
> +
> +	scatterwalk_done(&in, 0, 0);
> +	scatterwalk_done(&out, 1, 0);
> +
> +	return rc;
> +}
> +
> +static inline void update_chksum(struct sk_buff *skb, int headln)
> +{
> +	/* Can't use icsk->icsk_af_ops->send_check here because the ip addresses
> +	 * might have been changed by NAT.
> +	 */
> +
> +	const struct ipv6hdr *ipv6h;
> +	const struct iphdr *iph;
> +	struct tcphdr *th = tcp_hdr(skb);
> +	int datalen = skb->len - headln;
> +
> +	/* We only changed the payload so if we are using partial we don't
> +	 * need to update anything.
> +	 */
> +	if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
> +		return;
> +
> +	skb->ip_summed = CHECKSUM_PARTIAL;
> +	skb->csum_start = skb_transport_header(skb) - skb->head;
> +	skb->csum_offset = offsetof(struct tcphdr, check);
> +
> +	if (skb->sk->sk_family == AF_INET6) {
> +		ipv6h = ipv6_hdr(skb);
> +		th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
> +					     datalen, IPPROTO_TCP, 0);
> +	} else {
> +		iph = ip_hdr(skb);
> +		th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
> +					       IPPROTO_TCP, 0);
> +	}
> +}
> +
> +static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln)
> +{
> +	skb_copy_header(nskb, skb);
> +
> +	skb_put(nskb, skb->len);
> +	memcpy(nskb->data, skb->data, headln);
> +	update_chksum(nskb, headln);
> +
> +	nskb->destructor = skb->destructor;
> +	nskb->sk = skb->sk;
> +	skb->destructor = NULL;
> +	skb->sk = NULL;
> +	refcount_add(nskb->truesize - skb->truesize,
> +		     &nskb->sk->sk_wmem_alloc);
> +}
> +
> +/* This function may be called after the user socket is already
> + * closed so make sure we don't use anything freed during
> + * tls_sk_proto_close here
> + */
> +static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
> +{
> +	int tcp_header_size = tcp_hdrlen(skb);
> +	int tcp_payload_offset = skb_transport_offset(skb) + tcp_header_size;
> +	int payload_len = skb->len - tcp_payload_offset;
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
> +	int remaining, buf_len, resync_sgs, rc, i = 0;
> +	void *buf, *dummy_buf, *iv, *aad;
> +	struct scatterlist *sg_in;
> +	struct scatterlist sg_out[3];
> +	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> +	struct aead_request *aead_req;
> +	struct sk_buff *nskb = NULL;
> +	struct tls_record_info *record;
> +	unsigned long flags;
> +	s32 sync_size;
> +	u64 rcd_sn;
> +
> +	/* worst case is:
> +	 * MAX_SKB_FRAGS in tls_record_info
> +	 * MAX_SKB_FRAGS + 1 in SKB head an frags.
> +	 */
> +	int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
> +
> +	if (!payload_len)
> +		return skb;
> +
> +	sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC);
> +	if (!sg_in)
> +		goto free_orig;
> +
> +	sg_init_table(sg_in, sg_in_max_elements);
> +	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	record = tls_get_record(ctx, tcp_seq, &rcd_sn);
> +	if (!record) {
> +		spin_unlock_irqrestore(&ctx->lock, flags);
> +		WARN(1, "Record not found for seq %u\n", tcp_seq);
> +		goto free_sg;
> +	}
> +
> +	sync_size = tcp_seq - tls_record_start_seq(record);
> +	if (sync_size < 0) {
> +		int is_start_marker = tls_record_is_start_marker(record);
> +
> +		spin_unlock_irqrestore(&ctx->lock, flags);
> +		if (!is_start_marker)
> +		/* This should only occur if the relevant record was
> +		 * already acked. In that case it should be ok
> +		 * to drop the packet and avoid retransmission.
> +		 *
> +		 * There is a corner case where the packet contains
> +		 * both an acked and a non-acked record.
> +		 * We currently don't handle that case and rely
> +		 * on TCP to retranmit a packet that doesn't contain
> +		 * already acked payload.
> +		 */
> +			goto free_orig;
> +
> +		if (payload_len > -sync_size) {
> +			WARN(1, "Fallback of partially offloaded packets is not supported\n");
> +			goto free_sg;
> +		} else {
> +			return skb;
> +		}
> +	}
> +
> +	remaining = sync_size;
> +	while (remaining > 0) {
> +		skb_frag_t *frag = &record->frags[i];
> +
> +		__skb_frag_ref(frag);
> +		sg_set_page(sg_in + i, skb_frag_page(frag),
> +			    skb_frag_size(frag), frag->page_offset);
> +
> +		remaining -= skb_frag_size(frag);
> +
> +		if (remaining < 0)
> +			sg_in[i].length += remaining;
> +
> +		i++;
> +	}
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +	resync_sgs = i;
> +
> +	aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
> +	if (!aead_req)
> +		goto put_sg;
> +
> +	buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> +		  TLS_CIPHER_AES_GCM_128_IV_SIZE +
> +		  TLS_AAD_SPACE_SIZE +
> +		  sync_size +
> +		  tls_ctx->tag_size;
> +	buf = kmalloc(buf_len, GFP_ATOMIC);
> +	if (!buf)
> +		goto free_req;
> +
> +	nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC);
> +	if (!nskb)
> +		goto free_buf;
> +
> +	skb_reserve(nskb, skb_headroom(skb));
> +
> +	iv = buf;
> +
> +	memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
> +	       TLS_CIPHER_AES_GCM_128_SALT_SIZE);
> +	aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> +	      TLS_CIPHER_AES_GCM_128_IV_SIZE;
> +	dummy_buf = aad + TLS_AAD_SPACE_SIZE;
> +
> +	sg_set_buf(&sg_out[0], dummy_buf, sync_size);
> +	sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
> +		   payload_len);
> +	/* Add room for authentication tag produced by crypto */
> +	dummy_buf += sync_size;
> +	sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
> +	rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
> +			  payload_len);
> +	if (rc < 0)
> +		goto free_nskb;
> +
> +	rc = tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv,
> +			     rcd_sn, sync_size + payload_len);
> +	if (rc < 0)
> +		goto free_nskb;
> +
> +	complete_skb(nskb, skb, tcp_payload_offset);
> +
> +	/* validate_xmit_skb_list assumes that if the skb wasn't segmented
> +	 * nskb->prev will point to the skb itself
> +	 */
> +	nskb->prev = nskb;
> +free_buf:
> +	kfree(buf);
> +free_req:
> +	kfree(aead_req);
> +put_sg:
> +	for (i = 0; i < resync_sgs; i++)
> +		put_page(sg_page(&sg_in[i]));
> +free_sg:
> +	kfree(sg_in);
> +free_orig:
> +	kfree_skb(skb);
> +	return nskb;
> +
> +free_nskb:
> +	kfree_skb(nskb);
> +	nskb = NULL;
> +	goto free_buf;
> +}
> +
> +static struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
> +					     struct net_device *dev,
> +					     struct sk_buff *skb)
> +{
> +	if (dev == tls_get_ctx(sk)->netdev)
> +		return skb;
> +
> +	return tls_sw_fallback(sk, skb);
> +}
> +
> +int tls_sw_fallback_init(struct sock *sk,
> +			 struct tls_offload_context *offload_ctx,
> +			 struct tls_crypto_info *crypto_info)
> +{
> +	int rc;
> +	const u8 *key;
> +
> +	offload_ctx->aead_send =
> +	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
> +	if (IS_ERR(offload_ctx->aead_send)) {
> +		rc = PTR_ERR(offload_ctx->aead_send);
> +		pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc);
> +		offload_ctx->aead_send = NULL;
> +		goto err_out;
> +	}
> +
> +	key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key;
> +
> +	rc = crypto_aead_setkey(offload_ctx->aead_send, key,
> +				TLS_CIPHER_AES_GCM_128_KEY_SIZE);
> +	if (rc)
> +		goto free_aead;
> +
> +	rc = crypto_aead_setauthsize(offload_ctx->aead_send,
> +				     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
> +	if (rc)
> +		goto free_aead;
> +
> +	sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
> +	return 0;
> +free_aead:
> +	crypto_free_aead(offload_ctx->aead_send);
> +err_out:
> +	return rc;
> +}
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index d824d548447e..e0dface33017 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -54,6 +54,9 @@ enum {
>  enum {
>  	TLS_BASE_TX,
>  	TLS_SW_TX,
> +#ifdef CONFIG_TLS_DEVICE
> +	TLS_HW_TX,
> +#endif
>  	TLS_NUM_CONFIG,
>  };
>  
> @@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval,
>  		goto err_crypto_info;
>  	}
>  
> -	/* currently SW is default, we will have ethtool in future */
> -	rc = tls_set_sw_offload(sk, ctx);
> -	tx_conf = TLS_SW_TX;
> -	if (rc)
> -		goto err_crypto_info;
> +#ifdef CONFIG_TLS_DEVICE
> +	rc = tls_set_device_offload(sk, ctx);
> +	tx_conf = TLS_HW_TX;
> +	if (rc) {
> +#else
> +	{
> +#endif
> +		/* if HW offload fails fallback to SW */
> +		rc = tls_set_sw_offload(sk, ctx);
> +		tx_conf = TLS_SW_TX;
> +		if (rc)
> +			goto err_crypto_info;
> +	}
>  
>  	ctx->tx_conf = tx_conf;
>  	update_sk_prot(sk, ctx);
> @@ -473,6 +484,12 @@ static void build_protos(struct proto *prot, struct proto *base)
>  	prot[TLS_SW_TX] = prot[TLS_BASE_TX];
>  	prot[TLS_SW_TX].sendmsg		= tls_sw_sendmsg;
>  	prot[TLS_SW_TX].sendpage	= tls_sw_sendpage;
> +
> +#ifdef CONFIG_TLS_DEVICE
> +	prot[TLS_HW_TX] = prot[TLS_SW_TX];
> +	prot[TLS_HW_TX].sendmsg		= tls_device_sendmsg;
> +	prot[TLS_HW_TX].sendpage	= tls_device_sendpage;
> +#endif
>  }
>  
>  static int tls_init(struct sock *sk)
> @@ -531,6 +548,9 @@ static int __init tls_register(void)
>  {
>  	build_protos(tls_prots[TLSV4], &tcp_prot);
>  
> +#ifdef CONFIG_TLS_DEVICE
> +	tls_device_init();
> +#endif
>  	tcp_register_ulp(&tcp_tls_ulp_ops);
>  
>  	return 0;
> @@ -539,6 +559,9 @@ static int __init tls_register(void)
>  static void __exit tls_unregister(void)
>  {
>  	tcp_unregister_ulp(&tcp_tls_ulp_ops);
> +#ifdef CONFIG_TLS_DEVICE
> +	tls_device_cleanup();
> +#endif
>  }
>  
>  module_init(tls_register);

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 01/14] tcp: Add clean acked data hook
  2018-03-20 20:36   ` Rao Shoaib
@ 2018-03-21 11:21     ` Boris Pismenny
  2018-03-21 16:16       ` Rao Shoaib
  0 siblings, 1 reply; 27+ messages in thread
From: Boris Pismenny @ 2018-03-21 11:21 UTC (permalink / raw)
  To: Rao Shoaib, Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Ilya Lesokhin, Aviad Yehezkel



On 3/20/2018 10:36 PM, Rao Shoaib wrote:
> 
> 
> On 03/19/2018 07:44 PM, Saeed Mahameed wrote:
>> From: Ilya Lesokhin <ilyal@mellanox.com>
>>
>> Called when a TCP segment is acknowledged.
>> Could be used by application protocols who hold additional
>> metadata associated with the stream data.
>>
>> This is required by TLS device offload to release
>> metadata associated with acknowledged TLS records.
>>
>> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
>> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>   include/net/inet_connection_sock.h | 2 ++
>>   net/ipv4/tcp_input.c               | 2 ++
>>   2 files changed, 4 insertions(+)
>>
>> diff --git a/include/net/inet_connection_sock.h 
>> b/include/net/inet_connection_sock.h
>> index b68fea022a82..2ab6667275df 100644
>> --- a/include/net/inet_connection_sock.h
>> +++ b/include/net/inet_connection_sock.h
>> @@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
>>    * @icsk_af_ops           Operations which are AF_INET{4,6} specific
>>    * @icsk_ulp_ops       Pluggable ULP control hook
>>    * @icsk_ulp_data       ULP private data
>> + * @icsk_clean_acked       Clean acked data hook
>>    * @icsk_listen_portaddr_node    hash to the portaddr listener 
>> hashtable
>>    * @icsk_ca_state:       Congestion control state
>>    * @icsk_retransmits:       Number of unrecovered [RTO] timeouts
>> @@ -102,6 +103,7 @@ struct inet_connection_sock {
>>       const struct inet_connection_sock_af_ops *icsk_af_ops;
>>       const struct tcp_ulp_ops  *icsk_ulp_ops;
>>       void              *icsk_ulp_data;
>> +    void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
>>       struct hlist_node         icsk_listen_portaddr_node;
>>       unsigned int          (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
>>       __u8              icsk_ca_state:6,
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 451ef3012636..9854ecae7245 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -3542,6 +3542,8 @@ static int tcp_ack(struct sock *sk, const struct 
>> sk_buff *skb, int flag)
>>       if (after(ack, prior_snd_una)) {
>>           flag |= FLAG_SND_UNA_ADVANCED;
>>           icsk->icsk_retransmits = 0;
>> +        if (icsk->icsk_clean_acked)
>> +            icsk->icsk_clean_acked(sk, ack);
>>       }
>>       prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : 
>> tp->snd_una;
> Per Dave we are not allowed to use function pointers any more, so why 
> extend their use. I implemented a similar callback for my changes but in 
> my use case I need to call the meta data update function even when the 
> packet does not ack any new data or has no payload. Is it possible to 
> move this to say tcp_data_queue() ?

Sometimes function pointers are unavoidable. For example, when a module 
must change the functionality of a function. I think it is preferable to 
advance the kernel

This function is used to free memory based on new acknowledged data. It 
is unrelated to whether data was received or not. So it is not possible 
to move this call to tcp_data_queue.

Just in case, I'll add a static key here to reduce the impact on the 
fast-path as once suggested by EricD on netdev2.2.

> 
> Thanks,
> 
> Shoaib
> 
> 

Best,
Boris.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-20  2:45 ` [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure Saeed Mahameed
  2018-03-21 11:15   ` Kirill Tkhai
@ 2018-03-21 15:08   ` Dave Watson
  2018-03-21 15:38     ` Boris Pismenny
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Watson @ 2018-03-21 15:08 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev, Boris Pismenny, Ilya Lesokhin, Aviad Yehezkel

On 03/19/18 07:45 PM, Saeed Mahameed wrote:
> +#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
> +	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
> +	 TLS_DRIVER_STATE_SIZE)
> +
> +	pfrag = sk_page_frag(sk);
> +
> +	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and

I think the define is actually TLS_HEADER_SIZE, no KTLS_ prefix

> +	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
> +
> +	ctx->rec_seq_size = rec_seq_size;
> +	/* worst case is:
> +	 * MAX_SKB_FRAGS in tls_record_info
> +	 * MAX_SKB_FRAGS + 1 in SKB head an frags.

spelling

> +int tls_sw_fallback_init(struct sock *sk,
> +			 struct tls_offload_context *offload_ctx,
> +			 struct tls_crypto_info *crypto_info)
> +{
> +	int rc;
> +	const u8 *key;
> +
> +	offload_ctx->aead_send =
> +	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);

in tls_sw we went with async + crypto_wait_req, any reason to not do
that here?  Otherwise I think you still get the software gcm on x86
instead of aesni without additional changes.

> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index d824d548447e..e0dface33017 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -54,6 +54,9 @@ enum {
>  enum {
>  	TLS_BASE_TX,
>  	TLS_SW_TX,
> +#ifdef CONFIG_TLS_DEVICE
> +	TLS_HW_TX,
> +#endif
>  	TLS_NUM_CONFIG,
>  };

I have posted SW_RX patches, do you forsee any issues with SW_RX + HW_TX?

Thanks

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-21 15:08   ` Dave Watson
@ 2018-03-21 15:38     ` Boris Pismenny
  0 siblings, 0 replies; 27+ messages in thread
From: Boris Pismenny @ 2018-03-21 15:38 UTC (permalink / raw)
  To: Dave Watson, Saeed Mahameed
  Cc: David S. Miller, netdev, Ilya Lesokhin, Aviad Yehezkel



On 3/21/2018 5:08 PM, Dave Watson wrote:
> On 03/19/18 07:45 PM, Saeed Mahameed wrote:
>> +#define TLS_OFFLOAD_CONTEXT_SIZE                                               \
>> +	(ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +           \
>> +	 TLS_DRIVER_STATE_SIZE)
>> +
>> +	pfrag = sk_page_frag(sk);
>> +
>> +	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
> 
> I think the define is actually TLS_HEADER_SIZE, no KTLS_ prefix
> 

Fixed. Thanks.

>> +	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
>> +
>> +	ctx->rec_seq_size = rec_seq_size;
>> +	/* worst case is:
>> +	 * MAX_SKB_FRAGS in tls_record_info
>> +	 * MAX_SKB_FRAGS + 1 in SKB head an frags.
> 
> spelling
> 

Fixed. Thanks.

>> +int tls_sw_fallback_init(struct sock *sk,
>> +			 struct tls_offload_context *offload_ctx,
>> +			 struct tls_crypto_info *crypto_info)
>> +{
>> +	int rc;
>> +	const u8 *key;
>> +
>> +	offload_ctx->aead_send =
>> +	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
> 
> in tls_sw we went with async + crypto_wait_req, any reason to not do
> that here?  Otherwise I think you still get the software gcm on x86
> instead of aesni without additional changes.
> 

Yes, synchronous crypto code runs to handle a software fallback in 
validate_xmit_skb, where waiting is not possible. I know Steffen 
recently added support for calling async crypto from validate_xmit_skb, 
but it wasn't available when we were writing these patches.

I think we could implemented async support in the future based on the 
infrastructure introduced by Steffen.

>> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
>> index d824d548447e..e0dface33017 100644
>> --- a/net/tls/tls_main.c
>> +++ b/net/tls/tls_main.c
>> @@ -54,6 +54,9 @@ enum {
>>   enum {
>>   	TLS_BASE_TX,
>>   	TLS_SW_TX,
>> +#ifdef CONFIG_TLS_DEVICE
>> +	TLS_HW_TX,
>> +#endif
>>   	TLS_NUM_CONFIG,
>>   };
> 
> I have posted SW_RX patches, do you forsee any issues with SW_RX + HW_TX?
> 

No, but I haven't tested these patches with the SW_RX patches.
I'll try to rebase your V2 SW_RX patches over this series tomorrow and 
run some tests.

> Thanks
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-21 11:15   ` Kirill Tkhai
@ 2018-03-21 15:53     ` Boris Pismenny
  2018-03-21 16:31       ` Kirill Tkhai
  0 siblings, 1 reply; 27+ messages in thread
From: Boris Pismenny @ 2018-03-21 15:53 UTC (permalink / raw)
  To: Kirill Tkhai, Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Ilya Lesokhin, Aviad Yehezkel

...
> 
> Other patches have two licenses in header. Can I distribute this file under GPL license terms?
> 

Sure, I'll update the license to match other files under net/tls.

>> +#include <linux/module.h>
>> +#include <net/tcp.h>
>> +#include <net/inet_common.h>
>> +#include <linux/highmem.h>
>> +#include <linux/netdevice.h>
>> +
>> +#include <net/tls.h>
>> +#include <crypto/aead.h>
>> +
>> +/* device_offload_lock is used to synchronize tls_dev_add
>> + * against NETDEV_DOWN notifications.
>> + */
>> +DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
>> +
>> +static void tls_device_gc_task(struct work_struct *work);
>> +
>> +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
>> +static LIST_HEAD(tls_device_gc_list);
>> +static LIST_HEAD(tls_device_list);
>> +static DEFINE_SPINLOCK(tls_device_lock);
>> +
>> +static void tls_device_free_ctx(struct tls_context *ctx)
>> +{
>> +	struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
>> +
>> +	kfree(offlad_ctx);
>> +	kfree(ctx);
>> +}
>> +
>> +static void tls_device_gc_task(struct work_struct *work)
>> +{
>> +	struct tls_context *ctx, *tmp;
>> +	struct list_head gc_list;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&tls_device_lock, flags);
>> +	INIT_LIST_HEAD(&gc_list);
> 
> This is stack variable, and it should be initialized outside of global spinlock.
> There is LIST_HEAD() primitive for that in kernel.
> There is one more similar place below.
> 

Sure.

>> +	list_splice_init(&tls_device_gc_list, &gc_list);
>> +	spin_unlock_irqrestore(&tls_device_lock, flags);
>> +
>> +	list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
>> +		struct net_device *netdev = ctx->netdev;
>> +
>> +		if (netdev) {
>> +			netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
>> +							TLS_OFFLOAD_CTX_DIR_TX);
>> +			dev_put(netdev);
>> +		}
> 
> How is possible the situation we meet NULL netdev here >

This can happen in tls_device_down. tls_deviec_down is called whenever a 
netdev that is used for TLS inline crypto offload goes down. It gets 
called via the NETDEV_DOWN event of the netdevice notifier.

This flow is somewhat similar to the xfrm_device netdev notifier. 
However, we do not destroy the socket (as in destroying the xfrm_state 
in xfrm_device). Instead, we cleanup the netdev state and allow software 
fallback to handle the rest of the traffic.

>> +
>> +		list_del(&ctx->list);
>> +		tls_device_free_ctx(ctx);
>> +	}
>> +}
>> +
>> +static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
>> +{
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&tls_device_lock, flags);
>> +	list_move_tail(&ctx->list, &tls_device_gc_list);
>> +
>> +	/* schedule_work inside the spinlock
>> +	 * to make sure tls_device_down waits for that work.
>> +	 */
>> +	schedule_work(&tls_device_gc_work);
>> +
>> +	spin_unlock_irqrestore(&tls_device_lock, flags);
>> +}
>> +
>> +/* We assume that the socket is already connected */
>> +static struct net_device *get_netdev_for_sock(struct sock *sk)
>> +{
>> +	struct inet_sock *inet = inet_sk(sk);
>> +	struct net_device *netdev = NULL;
>> +
>> +	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
>> +
>> +	return netdev;
>> +}
>> +
>> +static int attach_sock_to_netdev(struct sock *sk, struct net_device *netdev,
>> +				 struct tls_context *ctx)
>> +{
>> +	int rc;
>> +
>> +	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
>> +					     &ctx->crypto_send,
>> +					     tcp_sk(sk)->write_seq);
>> +	if (rc) {
>> +		pr_err_ratelimited("The netdev has refused to offload this socket\n");
>> +		goto out;
>> +	}
>> +
>> +	rc = 0;
>> +out:
>> +	return rc;
>> +}
>> +
>> +static void destroy_record(struct tls_record_info *record)
>> +{
>> +	skb_frag_t *frag;
>> +	int nr_frags = record->num_frags;
>> +
>> +	while (nr_frags > 0) {
>> +		frag = &record->frags[nr_frags - 1];
>> +		__skb_frag_unref(frag);
>> +		--nr_frags;
>> +	}
>> +	kfree(record);
>> +}
>> +
>> +static void delete_all_records(struct tls_offload_context *offload_ctx)
>> +{
>> +	struct tls_record_info *info, *temp;
>> +
>> +	list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
>> +		list_del(&info->list);
>> +		destroy_record(info);
>> +	}
>> +
>> +	offload_ctx->retransmit_hint = NULL;
>> +}
>> +
>> +static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
>> +{
>> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
>> +	struct tls_offload_context *ctx;
>> +	struct tls_record_info *info, *temp;
>> +	unsigned long flags;
>> +	u64 deleted_records = 0;
>> +
>> +	if (!tls_ctx)
>> +		return;
>> +
>> +	ctx = tls_offload_ctx(tls_ctx);
>> +
>> +	spin_lock_irqsave(&ctx->lock, flags);
>> +	info = ctx->retransmit_hint;
>> +	if (info && !before(acked_seq, info->end_seq)) {
>> +		ctx->retransmit_hint = NULL;
>> +		list_del(&info->list);
>> +		destroy_record(info);
>> +		deleted_records++;
>> +	}
>> +
>> +	list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
>> +		if (before(acked_seq, info->end_seq))
>> +			break;
>> +		list_del(&info->list);
>> +
>> +		destroy_record(info);
>> +		deleted_records++;
>> +	}
>> +
>> +	ctx->unacked_record_sn += deleted_records;
>> +	spin_unlock_irqrestore(&ctx->lock, flags);
>> +}
>> +
>> +/* At this point, there should be no references on this
>> + * socket and no in-flight SKBs associated with this
>> + * socket, so it is safe to free all the resources.
>> + */
>> +void tls_device_sk_destruct(struct sock *sk)
>> +{
>> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
>> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
>> +
>> +	if (ctx->open_record)
>> +		destroy_record(ctx->open_record);
>> +
>> +	delete_all_records(ctx);
>> +	crypto_free_aead(ctx->aead_send);
>> +	ctx->sk_destruct(sk);
>> +
>> +	if (refcount_dec_and_test(&tls_ctx->refcount))
>> +		tls_device_queue_ctx_destruction(tls_ctx);
>> +}
>> +EXPORT_SYMBOL(tls_device_sk_destruct);
>> +
>> +static inline void tls_append_frag(struct tls_record_info *record,
>> +				   struct page_frag *pfrag,
>> +				   int size)
>> +{
>> +	skb_frag_t *frag;
>> +
>> +	frag = &record->frags[record->num_frags - 1];
>> +	if (frag->page.p == pfrag->page &&
>> +	    frag->page_offset + frag->size == pfrag->offset) {
>> +		frag->size += size;
>> +	} else {
>> +		++frag;
>> +		frag->page.p = pfrag->page;
>> +		frag->page_offset = pfrag->offset;
>> +		frag->size = size;
>> +		++record->num_frags;
>> +		get_page(pfrag->page);
>> +	}
>> +
>> +	pfrag->offset += size;
>> +	record->len += size;
>> +}
>> +
>> +static inline int tls_push_record(struct sock *sk,
>> +				  struct tls_context *ctx,
>> +				  struct tls_offload_context *offload_ctx,
>> +				  struct tls_record_info *record,
>> +				  struct page_frag *pfrag,
>> +				  int flags,
>> +				  unsigned char record_type)
>> +{
>> +	skb_frag_t *frag;
>> +	struct tcp_sock *tp = tcp_sk(sk);
>> +	struct page_frag fallback_frag;
>> +	struct page_frag  *tag_pfrag = pfrag;
>> +	int i;
>> +
>> +	/* fill prepand */
>> +	frag = &record->frags[0];
>> +	tls_fill_prepend(ctx,
>> +			 skb_frag_address(frag),
>> +			 record->len - ctx->prepend_size,
>> +			 record_type);
>> +
>> +	if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
>> +		/* HW doesn't care about the data in the tag
>> +		 * so in case pfrag has no room
>> +		 * for a tag and we can't allocate a new pfrag
>> +		 * just use the page in the first frag
>> +		 * rather then write a complicated fall back code.
>> +		 */
>> +		tag_pfrag = &fallback_frag;
>> +		tag_pfrag->page = skb_frag_page(frag);
>> +		tag_pfrag->offset = 0;
>> +	}
>> +
>> +	tls_append_frag(record, tag_pfrag, ctx->tag_size);
>> +	record->end_seq = tp->write_seq + record->len;
>> +	spin_lock_irq(&offload_ctx->lock);
>> +	list_add_tail(&record->list, &offload_ctx->records_list);
>> +	spin_unlock_irq(&offload_ctx->lock);
>> +	offload_ctx->open_record = NULL;
>> +	set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
>> +	tls_advance_record_sn(sk, ctx);
>> +
>> +	for (i = 0; i < record->num_frags; i++) {
>> +		frag = &record->frags[i];
>> +		sg_unmark_end(&offload_ctx->sg_tx_data[i]);
>> +		sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
>> +			    frag->size, frag->page_offset);
>> +		sk_mem_charge(sk, frag->size);
>> +		get_page(skb_frag_page(frag));
>> +	}
>> +	sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
>> +
>> +	/* all ready, send */
>> +	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
>> +}
>> +
>> +static inline int tls_create_new_record(struct tls_offload_context *offload_ctx,
>> +					struct page_frag *pfrag,
>> +					size_t prepend_size)
>> +{
>> +	skb_frag_t *frag;
>> +	struct tls_record_info *record;
>> +
>> +	record = kmalloc(sizeof(*record), GFP_KERNEL);
>> +	if (!record)
>> +		return -ENOMEM;
>> +
>> +	frag = &record->frags[0];
>> +	__skb_frag_set_page(frag, pfrag->page);
>> +	frag->page_offset = pfrag->offset;
>> +	skb_frag_size_set(frag, prepend_size);
>> +
>> +	get_page(pfrag->page);
>> +	pfrag->offset += prepend_size;
>> +
>> +	record->num_frags = 1;
>> +	record->len = prepend_size;
>> +	offload_ctx->open_record = record;
>> +	return 0;
>> +}
>> +
>> +static inline int tls_do_allocation(struct sock *sk,
>> +				    struct tls_offload_context *offload_ctx,
>> +				    struct page_frag *pfrag,
>> +				    size_t prepend_size)
>> +{
>> +	int ret;
>> +
>> +	if (!offload_ctx->open_record) {
>> +		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
>> +						   sk->sk_allocation))) {
>> +			sk->sk_prot->enter_memory_pressure(sk);
>> +			sk_stream_moderate_sndbuf(sk);
>> +			return -ENOMEM;
>> +		}
>> +
>> +		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
>> +		if (ret)
>> +			return ret;
>> +
>> +		if (pfrag->size > pfrag->offset)
>> +			return 0;
>> +	}
>> +
>> +	if (!sk_page_frag_refill(sk, pfrag))
>> +		return -ENOMEM;
>> +
>> +	return 0;
>> +}
>> +
>> +static int tls_push_data(struct sock *sk,
>> +			 struct iov_iter *msg_iter,
>> +			 size_t size, int flags,
>> +			 unsigned char record_type)
>> +{
>> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
>> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
>> +	struct tls_record_info *record = ctx->open_record;
>> +	struct page_frag *pfrag;
>> +	int copy, rc = 0;
>> +	size_t orig_size = size;
>> +	u32 max_open_record_len;
>> +	long timeo;
>> +	int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
>> +	int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
>> +	bool done = false;
>> +
>> +	if (flags &
>> +	    ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
>> +		return -ENOTSUPP;
>> +
>> +	if (sk->sk_err)
>> +		return -sk->sk_err;
>> +
>> +	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
>> +	rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
>> +	if (rc < 0)
>> +		return rc;
>> +
>> +	pfrag = sk_page_frag(sk);
>> +
>> +	/* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
>> +	 * we need to leave room for an authentication tag.
>> +	 */
>> +	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
>> +			      tls_ctx->prepend_size;
>> +	do {
>> +		if (tls_do_allocation(sk, ctx, pfrag,
>> +				      tls_ctx->prepend_size)) {
>> +			rc = sk_stream_wait_memory(sk, &timeo);
>> +			if (!rc)
>> +				continue;
>> +
>> +			record = ctx->open_record;
>> +			if (!record)
>> +				break;
>> +handle_error:
>> +			if (record_type != TLS_RECORD_TYPE_DATA) {
>> +				/* avoid sending partial
>> +				 * record with type !=
>> +				 * application_data
>> +				 */
>> +				size = orig_size;
>> +				destroy_record(record);
>> +				ctx->open_record = NULL;
>> +			} else if (record->len > tls_ctx->prepend_size) {
>> +				goto last_record;
>> +			}
>> +
>> +			break;
>> +		}
>> +
>> +		record = ctx->open_record;
>> +		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
>> +		copy = min_t(size_t, copy, (max_open_record_len - record->len));
>> +
>> +		if (copy_from_iter_nocache(page_address(pfrag->page) +
>> +					       pfrag->offset,
>> +					   copy, msg_iter) != copy) {
>> +			rc = -EFAULT;
>> +			goto handle_error;
>> +		}
>> +		tls_append_frag(record, pfrag, copy);
>> +
>> +		size -= copy;
>> +		if (!size) {
>> +last_record:
>> +			tls_push_record_flags = flags;
>> +			if (more) {
>> +				tls_ctx->pending_open_record_frags =
>> +						record->num_frags;
>> +				break;
>> +			}
>> +
>> +			done = true;
>> +		}
>> +
>> +		if ((done) || record->len >= max_open_record_len ||
>> +		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
>> +			rc = tls_push_record(sk,
>> +					     tls_ctx,
>> +					     ctx,
>> +					     record,
>> +					     pfrag,
>> +					     tls_push_record_flags,
>> +					     record_type);
>> +			if (rc < 0)
>> +				break;
>> +		}
>> +	} while (!done);
>> +
>> +	if (orig_size - size > 0)
>> +		rc = orig_size - size;
>> +
>> +	return rc;
>> +}
>> +
>> +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>> +{
>> +	unsigned char record_type = TLS_RECORD_TYPE_DATA;
>> +	int rc = 0;
>> +
>> +	lock_sock(sk);
>> +
>> +	if (unlikely(msg->msg_controllen)) {
>> +		rc = tls_proccess_cmsg(sk, msg, &record_type);
>> +		if (rc)
>> +			goto out;
>> +	}
>> +
>> +	rc = tls_push_data(sk, &msg->msg_iter, size,
>> +			   msg->msg_flags, record_type);
>> +
>> +out:
>> +	release_sock(sk);
>> +	return rc;
>> +}
>> +
>> +int tls_device_sendpage(struct sock *sk, struct page *page,
>> +			int offset, size_t size, int flags)
>> +{
>> +	struct iov_iter	msg_iter;
>> +	struct kvec iov;
>> +	char *kaddr = kmap(page);
>> +	int rc = 0;
>> +
>> +	if (flags & MSG_SENDPAGE_NOTLAST)
>> +		flags |= MSG_MORE;
>> +
>> +	lock_sock(sk);
>> +
>> +	if (flags & MSG_OOB) {
>> +		rc = -ENOTSUPP;
>> +		goto out;
>> +	}
>> +
>> +	iov.iov_base = kaddr + offset;
>> +	iov.iov_len = size;
>> +	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
>> +	rc = tls_push_data(sk, &msg_iter, size,
>> +			   flags, TLS_RECORD_TYPE_DATA);
>> +	kunmap(page);
>> +
>> +out:
>> +	release_sock(sk);
>> +	return rc;
>> +}
>> +
>> +struct tls_record_info *tls_get_record(struct tls_offload_context *context,
>> +				       u32 seq, u64 *p_record_sn)
>> +{
>> +	struct tls_record_info *info;
>> +	u64 record_sn = context->hint_record_sn;
>> +
>> +	info = context->retransmit_hint;
>> +	if (!info ||
>> +	    before(seq, info->end_seq - info->len)) {
>> +		/* if retransmit_hint is irrelevant start
>> +		 * from the begging of the list
>> +		 */
>> +		info = list_first_entry(&context->records_list,
>> +					struct tls_record_info, list);
>> +		record_sn = context->unacked_record_sn;
>> +	}
>> +
>> +	list_for_each_entry_from(info, &context->records_list, list) {
>> +		if (before(seq, info->end_seq)) {
>> +			if (!context->retransmit_hint ||
>> +			    after(info->end_seq,
>> +				  context->retransmit_hint->end_seq)) {
>> +				context->hint_record_sn = record_sn;
>> +				context->retransmit_hint = info;
>> +			}
>> +			*p_record_sn = record_sn;
>> +			return info;
>> +		}
>> +		record_sn++;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +EXPORT_SYMBOL(tls_get_record);
>> +
>> +static int tls_device_push_pending_record(struct sock *sk, int flags)
>> +{
>> +	struct iov_iter	msg_iter;
>> +
>> +	iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
>> +	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
>> +}
>> +
>> +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
>> +{
>> +	u16 nonece_size, tag_size, iv_size, rec_seq_size;
>> +	struct tls_record_info *start_marker_record;
>> +	struct tls_offload_context *offload_ctx;
>> +	struct tls_crypto_info *crypto_info;
>> +	struct net_device *netdev;
>> +	char *iv, *rec_seq;
>> +	struct sk_buff *skb;
>> +	int rc = -EINVAL;
>> +	__be64 rcd_sn;
>> +
>> +	if (!ctx)
>> +		goto out;
>> +
>> +	if (ctx->priv_ctx) {
>> +		rc = -EEXIST;
>> +		goto out;
>> +	}
>> +
>> +	/* We support starting offload on multiple sockets
>> +	 * concurrently, So we only need a read lock here.
>> +	 */
>> +	percpu_down_read(&device_offload_lock);
>> +	netdev = get_netdev_for_sock(sk);
>> +	if (!netdev) {
>> +		pr_err_ratelimited("%s: netdev not found\n", __func__);
>> +		rc = -EINVAL;
>> +		goto release_lock;
>> +	}
>> +
>> +	if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
>> +		rc = -ENOTSUPP;
>> +		goto release_netdev;
>> +	}
>> +
>> +	/* Avoid offloading if the device is down
>> +	 * We don't want to offload new flows after
>> +	 * the NETDEV_DOWN event
>> +	 */
>> +	if (!(netdev->flags & IFF_UP)) {
>> +		rc = -EINVAL;
>> +		goto release_lock;
>> +	}
>> +
>> +	crypto_info = &ctx->crypto_send;
>> +	switch (crypto_info->cipher_type) {
>> +	case TLS_CIPHER_AES_GCM_128: {
>> +		nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
>> +		tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
>> +		iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
>> +		iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
>> +		rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
>> +		rec_seq =
>> +		 ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
>> +		break;
>> +	}
>> +	default:
>> +		rc = -EINVAL;
>> +		goto release_netdev;
>> +	}
>> +
>> +	start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
> 
> Can we memory allocations and simple memory initializations ouside the global rwsem?
> 

Sure, we can move all memory allocations outside the lock.

>> +	if (!start_marker_record) {
>> +		rc = -ENOMEM;
>> +		goto release_netdev;
>> +	}
>> +
>> +	offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE, GFP_KERNEL);
>> +	if (!offload_ctx)
>> +		goto free_marker_record;
>> +
>> +	ctx->priv_ctx = offload_ctx;
>> +	rc = attach_sock_to_netdev(sk, netdev, ctx);
>> +	if (rc)
>> +		goto free_offload_context;
>> +
>> +	ctx->netdev = netdev;
>> +	ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
>> +	ctx->tag_size = tag_size;
>> +	ctx->iv_size = iv_size;
>> +	ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
>> +			  GFP_KERNEL);
>> +	if (!ctx->iv) {
>> +		rc = -ENOMEM;
>> +		goto detach_sock;
>> +	}
>> +
>> +	memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
>> +
>> +	ctx->rec_seq_size = rec_seq_size;
>> +	ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
>> +	if (!ctx->rec_seq) {
>> +		rc = -ENOMEM;
>> +		goto free_iv;
>> +	}
>> +	memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
>> +
>> +	/* start at rec_seq - 1 to account for the start marker record */
>> +	memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
>> +	offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
>> +
>> +	rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
>> +	if (rc)
>> +		goto free_rec_seq;
>> +
>> +	start_marker_record->end_seq = tcp_sk(sk)->write_seq;
>> +	start_marker_record->len = 0;
>> +	start_marker_record->num_frags = 0;
>> +
>> +	INIT_LIST_HEAD(&offload_ctx->records_list);
>> +	list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
>> +	spin_lock_init(&offload_ctx->lock);
>> +
>> +	inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
>> +	ctx->push_pending_record = tls_device_push_pending_record;
>> +	offload_ctx->sk_destruct = sk->sk_destruct;
>> +
>> +	/* TLS offload is greatly simplified if we don't send
>> +	 * SKBs where only part of the payload needs to be encrypted.
>> +	 * So mark the last skb in the write queue as end of record.
>> +	 */
>> +	skb = tcp_write_queue_tail(sk);
>> +	if (skb)
>> +		TCP_SKB_CB(skb)->eor = 1;
>> +
>> +	refcount_set(&ctx->refcount, 1);
>> +	spin_lock_irq(&tls_device_lock);
>> +	list_add_tail(&ctx->list, &tls_device_list);
>> +	spin_unlock_irq(&tls_device_lock);
>> +
>> +	/* following this assignment tls_is_sk_tx_device_offloaded
>> +	 * will return true and the context might be accessed
>> +	 * by the netdev's xmit function.
>> +	 */
>> +	smp_store_release(&sk->sk_destruct,
>> +			  &tls_device_sk_destruct);
>> +	goto release_lock;
>> +
>> +free_rec_seq:
>> +	kfree(ctx->rec_seq);
>> +free_iv:
>> +	kfree(ctx->iv);
>> +detach_sock:
>> +	netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_TX);
>> +free_offload_context:
>> +	kfree(offload_ctx);
>> +	ctx->priv_ctx = NULL;
>> +free_marker_record:
>> +	kfree(start_marker_record);
>> +release_netdev:
>> +	dev_put(netdev);
>> +release_lock:
>> +	percpu_up_read(&device_offload_lock);
>> +out:
>> +	return rc;
>> +}
>> +
>> +static int tls_device_register(struct net_device *dev)
>> +{
>> +	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
>> +		return NOTIFY_BAD;
>> +
>> +	return NOTIFY_DONE;
>> +}
> 
> This function is the same as tls_device_feat_change(). Can't we merge
> them together and avoid duplicating of code?
> 

Sure.

>> +static int tls_device_unregister(struct net_device *dev)
>> +{
>> +	return NOTIFY_DONE;
>> +}
> 
> This function does nothing, and next patches do not change it.
> Can't we remove it since so?
> 

Sure.

>> +static int tls_device_feat_change(struct net_device *dev)
>> +{
>> +	if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
>> +		return NOTIFY_BAD;
>> +
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int tls_device_down(struct net_device *netdev)
>> +{
>> +	struct tls_context *ctx, *tmp;
>> +	struct list_head list;
>> +	unsigned long flags;
>> +
>> +	if (!(netdev->features & NETIF_F_HW_TLS_TX))
>> +		return NOTIFY_DONE;
> 
> Can't we move this check in tls_dev_event() and use it for all types of events?
> Then we avoid duplicate code.
> 

No. Not all events require this check. Also, the result is different for 
different events.

>> +
>> +	/* Request a write lock to block new offload attempts
>> +	 */
>> +	percpu_down_write(&device_offload_lock);
> 
> What is the reason percpu_rwsem is chosen here? It looks like this primitive
> gives more advantages readers, then plain rwsem does. But it also gives
> disadvantages to writers. It would be good, unless tls_device_down() is called
> with rtnl_lock() held from netdevice notifier. But since netdevice notifier
> are called with rtnl_lock() held, percpu_rwsem will increase the time rtnl_lock()
> is locked.
We use the a rwsem to allow multiple (readers) invocations of 
tls_set_device_offload, which is triggered by the user (persumably) 
during the TLS handshake. This might be considered a fast-path.

However, we must block all calls to tls_set_device_offload while we are 
processing NETDEV_DOWN events (writer).

As you've mentioned, the percpu rwsem is more efficient for readers, 
especially on NUMA systems, where cache-line bouncing occurs during 
reader acquire and reduces performance.

> 
> Can't we use plain rwsem here instead?
> 

Its a performance tradeoff. I'm not certain that the percpu rwsem write 
side acquire is significantly worse than using the global rwsem.

For now, while all of this is experimental, can we agree to focus on the 
performance of readers? We can change it later if it becomes a problem.

>> +
>> +	spin_lock_irqsave(&tls_device_lock, flags);
>> +	INIT_LIST_HEAD(&list);
> 
> This may go outside the global spinlock.
> 

Sure.

>> +	list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) {
>> +		if (ctx->netdev != netdev ||
>> +		    !refcount_inc_not_zero(&ctx->refcount))
>> +			continue;
>> +
>> +		list_move(&ctx->list, &list);
>> +	}
>> +	spin_unlock_irqrestore(&tls_device_lock, flags);
>> +
>> +	list_for_each_entry_safe(ctx, tmp, &list, list)	{
>> +		netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
>> +						TLS_OFFLOAD_CTX_DIR_TX);
>> +		ctx->netdev = NULL;
>> +		dev_put(netdev);
>> +		list_del_init(&ctx->list);
>> +
>> +		if (refcount_dec_and_test(&ctx->refcount))
>> +			tls_device_free_ctx(ctx);
>> +	}
>> +
>> +	percpu_up_write(&device_offload_lock);
>> +
>> +	flush_work(&tls_device_gc_work);
>> +
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int tls_dev_event(struct notifier_block *this, unsigned long event,
>> +			 void *ptr)
>> +{
>> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return tls_device_register(dev);
>> +
>> +	case NETDEV_UNREGISTER:
>> +		return tls_device_unregister(dev);
>> +
>> +	case NETDEV_FEAT_CHANGE:
>> +		return tls_device_feat_change(dev);
>> +
>> +	case NETDEV_DOWN:
>> +		return tls_device_down(dev);
>> +	}
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static struct notifier_block tls_dev_notifier = {
>> +	.notifier_call	= tls_dev_event,
>> +};
>> +
>> +void __init tls_device_init(void)
>> +{
>> +	register_netdevice_notifier(&tls_dev_notifier);
>> +}
>> +
>> +void __exit tls_device_cleanup(void)
>> +{
>> +	unregister_netdevice_notifier(&tls_dev_notifier);
>> +	flush_work(&tls_device_gc_work);
>> +}
>> diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
>> new file mode 100644
>> index 000000000000..14d31a36885c
>> --- /dev/null
>> +++ b/net/tls/tls_device_fallback.c
>> @@ -0,0 +1,419 @@
>> +/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
>> + *
>> + *     Redistribution and use in source and binary forms, with or
>> + *     without modification, are permitted provided that the following
>> + *     conditions are met:
>> + *
>> + *      - Redistributions of source code must retain the above
>> + *        copyright notice, this list of conditions and the following
>> + *        disclaimer.
>> + *
>> + *      - Redistributions in binary form must reproduce the above
>> + *        copyright notice, this list of conditions and the following
>> + *        disclaimer in the documentation and/or other materials
>> + *        provided with the distribution.
>> + *
>> + *      - Neither the name of the Mellanox Technologies nor the
>> + *        names of its contributors may be used to endorse or promote
>> + *        products derived from this software without specific prior written
>> + *        permission.
>> + *
>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
>> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>> + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> + * A PARTICULAR PURPOSE ARE DISCLAIMED.
>> + * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
>> + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
>> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
>> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
>> + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>> + * POSSIBILITY OF SUCH DAMAGE
>> + */
>> +
>> +#include <net/tls.h>
>> +#include <crypto/aead.h>
>> +#include <crypto/scatterwalk.h>
>> +#include <net/ip6_checksum.h>
>> +
>> +static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk)
>> +{
>> +	struct scatterlist *src = walk->sg;
>> +	int diff = walk->offset - src->offset;
>> +
>> +	sg_set_page(sg, sg_page(src),
>> +		    src->length - diff, walk->offset);
>> +
>> +	scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
>> +}
>> +
>> +static int tls_enc_record(struct aead_request *aead_req,
>> +			  struct crypto_aead *aead, char *aad, char *iv,
>> +			  __be64 rcd_sn, struct scatter_walk *in,
>> +			  struct scatter_walk *out, int *in_len)
>> +{
>> +	struct scatterlist sg_in[3];
>> +	struct scatterlist sg_out[3];
>> +	unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE];
>> +	u16 len;
>> +	int rc;
>> +
>> +	len = min_t(int, *in_len, ARRAY_SIZE(buf));
>> +
>> +	scatterwalk_copychunks(buf, in, len, 0);
>> +	scatterwalk_copychunks(buf, out, len, 1);
>> +
>> +	*in_len -= len;
>> +	if (!*in_len)
>> +		return 0;
>> +
>> +	scatterwalk_pagedone(in, 0, 1);
>> +	scatterwalk_pagedone(out, 1, 1);
>> +
>> +	len = buf[4] | (buf[3] << 8);
>> +	len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
>> +
>> +	tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
>> +		     (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
>> +
>> +	memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE,
>> +	       TLS_CIPHER_AES_GCM_128_IV_SIZE);
>> +
>> +	sg_init_table(sg_in, ARRAY_SIZE(sg_in));
>> +	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
>> +	sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
>> +	sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
>> +	chain_to_walk(sg_in + 1, in);
>> +	chain_to_walk(sg_out + 1, out);
>> +
>> +	*in_len -= len;
>> +	if (*in_len < 0) {
>> +		*in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
>> +		if (*in_len < 0)
>> +		/* the input buffer doesn't contain the entire record.
>> +		 * trim len accordingly. The resulting authentication tag
>> +		 * will contain garbage. but we don't care as we won't
>> +		 * include any of it in the output skb
>> +		 * Note that we assume the output buffer length
>> +		 * is larger then input buffer length + tag size
>> +		 */
>> +			len += *in_len;
>> +
>> +		*in_len = 0;
>> +	}
>> +
>> +	if (*in_len) {
>> +		scatterwalk_copychunks(NULL, in, len, 2);
>> +		scatterwalk_pagedone(in, 0, 1);
>> +		scatterwalk_copychunks(NULL, out, len, 2);
>> +		scatterwalk_pagedone(out, 1, 1);
>> +	}
>> +
>> +	len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
>> +	aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
>> +
>> +	rc = crypto_aead_encrypt(aead_req);
>> +
>> +	return rc;
>> +}
>> +
>> +static void tls_init_aead_request(struct aead_request *aead_req,
>> +				  struct crypto_aead *aead)
>> +{
>> +	aead_request_set_tfm(aead_req, aead);
>> +	aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
>> +}
>> +
>> +static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead,
>> +						   gfp_t flags)
>> +{
>> +	unsigned int req_size = sizeof(struct aead_request) +
>> +		crypto_aead_reqsize(aead);
>> +	struct aead_request *aead_req;
>> +
>> +	aead_req = kzalloc(req_size, flags);
>> +	if (!aead_req)
>> +		return NULL;
>> +
>> +	tls_init_aead_request(aead_req, aead);
>> +	return aead_req;
>> +}
>> +
>> +static int tls_enc_records(struct aead_request *aead_req,
>> +			   struct crypto_aead *aead, struct scatterlist *sg_in,
>> +			   struct scatterlist *sg_out, char *aad, char *iv,
>> +			   u64 rcd_sn, int len)
>> +{
>> +	struct scatter_walk in;
>> +	struct scatter_walk out;
>> +	int rc;
>> +
>> +	scatterwalk_start(&in, sg_in);
>> +	scatterwalk_start(&out, sg_out);
>> +
>> +	do {
>> +		rc = tls_enc_record(aead_req, aead, aad, iv,
>> +				    cpu_to_be64(rcd_sn), &in, &out, &len);
>> +		rcd_sn++;
>> +
>> +	} while (rc == 0 && len);
>> +
>> +	scatterwalk_done(&in, 0, 0);
>> +	scatterwalk_done(&out, 1, 0);
>> +
>> +	return rc;
>> +}
>> +
>> +static inline void update_chksum(struct sk_buff *skb, int headln)
>> +{
>> +	/* Can't use icsk->icsk_af_ops->send_check here because the ip addresses
>> +	 * might have been changed by NAT.
>> +	 */
>> +
>> +	const struct ipv6hdr *ipv6h;
>> +	const struct iphdr *iph;
>> +	struct tcphdr *th = tcp_hdr(skb);
>> +	int datalen = skb->len - headln;
>> +
>> +	/* We only changed the payload so if we are using partial we don't
>> +	 * need to update anything.
>> +	 */
>> +	if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
>> +		return;
>> +
>> +	skb->ip_summed = CHECKSUM_PARTIAL;
>> +	skb->csum_start = skb_transport_header(skb) - skb->head;
>> +	skb->csum_offset = offsetof(struct tcphdr, check);
>> +
>> +	if (skb->sk->sk_family == AF_INET6) {
>> +		ipv6h = ipv6_hdr(skb);
>> +		th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
>> +					     datalen, IPPROTO_TCP, 0);
>> +	} else {
>> +		iph = ip_hdr(skb);
>> +		th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
>> +					       IPPROTO_TCP, 0);
>> +	}
>> +}
>> +
>> +static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln)
>> +{
>> +	skb_copy_header(nskb, skb);
>> +
>> +	skb_put(nskb, skb->len);
>> +	memcpy(nskb->data, skb->data, headln);
>> +	update_chksum(nskb, headln);
>> +
>> +	nskb->destructor = skb->destructor;
>> +	nskb->sk = skb->sk;
>> +	skb->destructor = NULL;
>> +	skb->sk = NULL;
>> +	refcount_add(nskb->truesize - skb->truesize,
>> +		     &nskb->sk->sk_wmem_alloc);
>> +}
>> +
>> +/* This function may be called after the user socket is already
>> + * closed so make sure we don't use anything freed during
>> + * tls_sk_proto_close here
>> + */
>> +static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
>> +{
>> +	int tcp_header_size = tcp_hdrlen(skb);
>> +	int tcp_payload_offset = skb_transport_offset(skb) + tcp_header_size;
>> +	int payload_len = skb->len - tcp_payload_offset;
>> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
>> +	struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
>> +	int remaining, buf_len, resync_sgs, rc, i = 0;
>> +	void *buf, *dummy_buf, *iv, *aad;
>> +	struct scatterlist *sg_in;
>> +	struct scatterlist sg_out[3];
>> +	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
>> +	struct aead_request *aead_req;
>> +	struct sk_buff *nskb = NULL;
>> +	struct tls_record_info *record;
>> +	unsigned long flags;
>> +	s32 sync_size;
>> +	u64 rcd_sn;
>> +
>> +	/* worst case is:
>> +	 * MAX_SKB_FRAGS in tls_record_info
>> +	 * MAX_SKB_FRAGS + 1 in SKB head an frags.
>> +	 */
>> +	int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
>> +
>> +	if (!payload_len)
>> +		return skb;
>> +
>> +	sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC);
>> +	if (!sg_in)
>> +		goto free_orig;
>> +
>> +	sg_init_table(sg_in, sg_in_max_elements);
>> +	sg_init_table(sg_out, ARRAY_SIZE(sg_out));
>> +
>> +	spin_lock_irqsave(&ctx->lock, flags);
>> +	record = tls_get_record(ctx, tcp_seq, &rcd_sn);
>> +	if (!record) {
>> +		spin_unlock_irqrestore(&ctx->lock, flags);
>> +		WARN(1, "Record not found for seq %u\n", tcp_seq);
>> +		goto free_sg;
>> +	}
>> +
>> +	sync_size = tcp_seq - tls_record_start_seq(record);
>> +	if (sync_size < 0) {
>> +		int is_start_marker = tls_record_is_start_marker(record);
>> +
>> +		spin_unlock_irqrestore(&ctx->lock, flags);
>> +		if (!is_start_marker)
>> +		/* This should only occur if the relevant record was
>> +		 * already acked. In that case it should be ok
>> +		 * to drop the packet and avoid retransmission.
>> +		 *
>> +		 * There is a corner case where the packet contains
>> +		 * both an acked and a non-acked record.
>> +		 * We currently don't handle that case and rely
>> +		 * on TCP to retranmit a packet that doesn't contain
>> +		 * already acked payload.
>> +		 */
>> +			goto free_orig;
>> +
>> +		if (payload_len > -sync_size) {
>> +			WARN(1, "Fallback of partially offloaded packets is not supported\n");
>> +			goto free_sg;
>> +		} else {
>> +			return skb;
>> +		}
>> +	}
>> +
>> +	remaining = sync_size;
>> +	while (remaining > 0) {
>> +		skb_frag_t *frag = &record->frags[i];
>> +
>> +		__skb_frag_ref(frag);
>> +		sg_set_page(sg_in + i, skb_frag_page(frag),
>> +			    skb_frag_size(frag), frag->page_offset);
>> +
>> +		remaining -= skb_frag_size(frag);
>> +
>> +		if (remaining < 0)
>> +			sg_in[i].length += remaining;
>> +
>> +		i++;
>> +	}
>> +	spin_unlock_irqrestore(&ctx->lock, flags);
>> +	resync_sgs = i;
>> +
>> +	aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
>> +	if (!aead_req)
>> +		goto put_sg;
>> +
>> +	buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
>> +		  TLS_CIPHER_AES_GCM_128_IV_SIZE +
>> +		  TLS_AAD_SPACE_SIZE +
>> +		  sync_size +
>> +		  tls_ctx->tag_size;
>> +	buf = kmalloc(buf_len, GFP_ATOMIC);
>> +	if (!buf)
>> +		goto free_req;
>> +
>> +	nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC);
>> +	if (!nskb)
>> +		goto free_buf;
>> +
>> +	skb_reserve(nskb, skb_headroom(skb));
>> +
>> +	iv = buf;
>> +
>> +	memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
>> +	       TLS_CIPHER_AES_GCM_128_SALT_SIZE);
>> +	aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
>> +	      TLS_CIPHER_AES_GCM_128_IV_SIZE;
>> +	dummy_buf = aad + TLS_AAD_SPACE_SIZE;
>> +
>> +	sg_set_buf(&sg_out[0], dummy_buf, sync_size);
>> +	sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
>> +		   payload_len);
>> +	/* Add room for authentication tag produced by crypto */
>> +	dummy_buf += sync_size;
>> +	sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
>> +	rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
>> +			  payload_len);
>> +	if (rc < 0)
>> +		goto free_nskb;
>> +
>> +	rc = tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv,
>> +			     rcd_sn, sync_size + payload_len);
>> +	if (rc < 0)
>> +		goto free_nskb;
>> +
>> +	complete_skb(nskb, skb, tcp_payload_offset);
>> +
>> +	/* validate_xmit_skb_list assumes that if the skb wasn't segmented
>> +	 * nskb->prev will point to the skb itself
>> +	 */
>> +	nskb->prev = nskb;
>> +free_buf:
>> +	kfree(buf);
>> +free_req:
>> +	kfree(aead_req);
>> +put_sg:
>> +	for (i = 0; i < resync_sgs; i++)
>> +		put_page(sg_page(&sg_in[i]));
>> +free_sg:
>> +	kfree(sg_in);
>> +free_orig:
>> +	kfree_skb(skb);
>> +	return nskb;
>> +
>> +free_nskb:
>> +	kfree_skb(nskb);
>> +	nskb = NULL;
>> +	goto free_buf;
>> +}
>> +
>> +static struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
>> +					     struct net_device *dev,
>> +					     struct sk_buff *skb)
>> +{
>> +	if (dev == tls_get_ctx(sk)->netdev)
>> +		return skb;
>> +
>> +	return tls_sw_fallback(sk, skb);
>> +}
>> +
>> +int tls_sw_fallback_init(struct sock *sk,
>> +			 struct tls_offload_context *offload_ctx,
>> +			 struct tls_crypto_info *crypto_info)
>> +{
>> +	int rc;
>> +	const u8 *key;
>> +
>> +	offload_ctx->aead_send =
>> +	    crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
>> +	if (IS_ERR(offload_ctx->aead_send)) {
>> +		rc = PTR_ERR(offload_ctx->aead_send);
>> +		pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc);
>> +		offload_ctx->aead_send = NULL;
>> +		goto err_out;
>> +	}
>> +
>> +	key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key;
>> +
>> +	rc = crypto_aead_setkey(offload_ctx->aead_send, key,
>> +				TLS_CIPHER_AES_GCM_128_KEY_SIZE);
>> +	if (rc)
>> +		goto free_aead;
>> +
>> +	rc = crypto_aead_setauthsize(offload_ctx->aead_send,
>> +				     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
>> +	if (rc)
>> +		goto free_aead;
>> +
>> +	sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
>> +	return 0;
>> +free_aead:
>> +	crypto_free_aead(offload_ctx->aead_send);
>> +err_out:
>> +	return rc;
>> +}
>> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
>> index d824d548447e..e0dface33017 100644
>> --- a/net/tls/tls_main.c
>> +++ b/net/tls/tls_main.c
>> @@ -54,6 +54,9 @@ enum {
>>   enum {
>>   	TLS_BASE_TX,
>>   	TLS_SW_TX,
>> +#ifdef CONFIG_TLS_DEVICE
>> +	TLS_HW_TX,
>> +#endif
>>   	TLS_NUM_CONFIG,
>>   };
>>   
>> @@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval,
>>   		goto err_crypto_info;
>>   	}
>>   
>> -	/* currently SW is default, we will have ethtool in future */
>> -	rc = tls_set_sw_offload(sk, ctx);
>> -	tx_conf = TLS_SW_TX;
>> -	if (rc)
>> -		goto err_crypto_info;
>> +#ifdef CONFIG_TLS_DEVICE
>> +	rc = tls_set_device_offload(sk, ctx);
>> +	tx_conf = TLS_HW_TX;
>> +	if (rc) {
>> +#else
>> +	{
>> +#endif
>> +		/* if HW offload fails fallback to SW */
>> +		rc = tls_set_sw_offload(sk, ctx);
>> +		tx_conf = TLS_SW_TX;
>> +		if (rc)
>> +			goto err_crypto_info;
>> +	}
>>   
>>   	ctx->tx_conf = tx_conf;
>>   	update_sk_prot(sk, ctx);
>> @@ -473,6 +484,12 @@ static void build_protos(struct proto *prot, struct proto *base)
>>   	prot[TLS_SW_TX] = prot[TLS_BASE_TX];
>>   	prot[TLS_SW_TX].sendmsg		= tls_sw_sendmsg;
>>   	prot[TLS_SW_TX].sendpage	= tls_sw_sendpage;
>> +
>> +#ifdef CONFIG_TLS_DEVICE
>> +	prot[TLS_HW_TX] = prot[TLS_SW_TX];
>> +	prot[TLS_HW_TX].sendmsg		= tls_device_sendmsg;
>> +	prot[TLS_HW_TX].sendpage	= tls_device_sendpage;
>> +#endif
>>   }
>>   
>>   static int tls_init(struct sock *sk)
>> @@ -531,6 +548,9 @@ static int __init tls_register(void)
>>   {
>>   	build_protos(tls_prots[TLSV4], &tcp_prot);
>>   
>> +#ifdef CONFIG_TLS_DEVICE
>> +	tls_device_init();
>> +#endif
>>   	tcp_register_ulp(&tcp_tls_ulp_ops);
>>   
>>   	return 0;
>> @@ -539,6 +559,9 @@ static int __init tls_register(void)
>>   static void __exit tls_unregister(void)
>>   {
>>   	tcp_unregister_ulp(&tcp_tls_ulp_ops);
>> +#ifdef CONFIG_TLS_DEVICE
>> +	tls_device_cleanup();
>> +#endif
>>   }
>>   
>>   module_init(tls_register);
> 
> Thanks,
> Kirill
> 

Best,
Boris.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 01/14] tcp: Add clean acked data hook
  2018-03-21 11:21     ` Boris Pismenny
@ 2018-03-21 16:16       ` Rao Shoaib
  2018-03-21 16:32         ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Rao Shoaib @ 2018-03-21 16:16 UTC (permalink / raw)
  To: Boris Pismenny, Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Ilya Lesokhin, Aviad Yehezkel



On 03/21/2018 04:21 AM, Boris Pismenny wrote:
>
>
> On 3/20/2018 10:36 PM, Rao Shoaib wrote:
>>
>>
>> On 03/19/2018 07:44 PM, Saeed Mahameed wrote:
>>> From: Ilya Lesokhin <ilyal@mellanox.com>
>>>
>>> Called when a TCP segment is acknowledged.
>>> Could be used by application protocols who hold additional
>>> metadata associated with the stream data.
>>>
>>> This is required by TLS device offload to release
>>> metadata associated with acknowledged TLS records.
>>>
>>> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
>>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
>>> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
>>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>>> ---
>>>   include/net/inet_connection_sock.h | 2 ++
>>>   net/ipv4/tcp_input.c               | 2 ++
>>>   2 files changed, 4 insertions(+)
>>>
>>> diff --git a/include/net/inet_connection_sock.h 
>>> b/include/net/inet_connection_sock.h
>>> index b68fea022a82..2ab6667275df 100644
>>> --- a/include/net/inet_connection_sock.h
>>> +++ b/include/net/inet_connection_sock.h
>>> @@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
>>>    * @icsk_af_ops           Operations which are AF_INET{4,6} specific
>>>    * @icsk_ulp_ops       Pluggable ULP control hook
>>>    * @icsk_ulp_data       ULP private data
>>> + * @icsk_clean_acked       Clean acked data hook
>>>    * @icsk_listen_portaddr_node    hash to the portaddr listener 
>>> hashtable
>>>    * @icsk_ca_state:       Congestion control state
>>>    * @icsk_retransmits:       Number of unrecovered [RTO] timeouts
>>> @@ -102,6 +103,7 @@ struct inet_connection_sock {
>>>       const struct inet_connection_sock_af_ops *icsk_af_ops;
>>>       const struct tcp_ulp_ops  *icsk_ulp_ops;
>>>       void              *icsk_ulp_data;
>>> +    void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
>>>       struct hlist_node         icsk_listen_portaddr_node;
>>>       unsigned int          (*icsk_sync_mss)(struct sock *sk, u32 
>>> pmtu);
>>>       __u8              icsk_ca_state:6,
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index 451ef3012636..9854ecae7245 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -3542,6 +3542,8 @@ static int tcp_ack(struct sock *sk, const 
>>> struct sk_buff *skb, int flag)
>>>       if (after(ack, prior_snd_una)) {
>>>           flag |= FLAG_SND_UNA_ADVANCED;
>>>           icsk->icsk_retransmits = 0;
>>> +        if (icsk->icsk_clean_acked)
>>> +            icsk->icsk_clean_acked(sk, ack);
>>>       }
>>>       prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : 
>>> tp->snd_una;
>> Per Dave we are not allowed to use function pointers any more, so why 
>> extend their use. I implemented a similar callback for my changes but 
>> in my use case I need to call the meta data update function even when 
>> the packet does not ack any new data or has no payload. Is it 
>> possible to move this to say tcp_data_queue() ?
>
> Sometimes function pointers are unavoidable. For example, when a 
> module must change the functionality of a function. I think it is 
> preferable to advance the kernel
I agree, in fact I was using function pointers for the exact reason, to 
change the functionality of a function. I asked Dave about the use and 
he said No (Also note that the relevant CPU optimizations have been 
turned off on selected NIC's due to the latest security issues -- On AMD 
CPU's the optimizations are not turned off). So it is Dave's decision -- 
I am hoping that he would reconsider and allow me to use pointers as 
well as pointers solve the problem nicely and are used extensively.
>
> This function is used to free memory based on new acknowledged data. 
> It is unrelated to whether data was received or not. So it is not 
> possible to move this call to tcp_data_queue.
After reviewing my changes I believe I can work with the change. So go 
ahead.
>
> Just in case, I'll add a static key here to reduce the impact on the 
> fast-path as once suggested by EricD on netdev2.2.
Regards,

Shoaib
>
>>
>> Thanks,
>>
>> Shoaib
>>
>>
>
> Best,
> Boris.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-21 15:53     ` Boris Pismenny
@ 2018-03-21 16:31       ` Kirill Tkhai
  2018-03-21 20:50         ` Saeed Mahameed
  2018-03-22 12:38         ` Boris Pismenny
  0 siblings, 2 replies; 27+ messages in thread
From: Kirill Tkhai @ 2018-03-21 16:31 UTC (permalink / raw)
  To: Boris Pismenny, Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Ilya Lesokhin, Aviad Yehezkel

On 21.03.2018 18:53, Boris Pismenny wrote:
> ...
>>
>> Other patches have two licenses in header. Can I distribute this file under GPL license terms?
>>
> 
> Sure, I'll update the license to match other files under net/tls.
> 
>>> +#include <linux/module.h>
>>> +#include <net/tcp.h>
>>> +#include <net/inet_common.h>
>>> +#include <linux/highmem.h>
>>> +#include <linux/netdevice.h>
>>> +
>>> +#include <net/tls.h>
>>> +#include <crypto/aead.h>
>>> +
>>> +/* device_offload_lock is used to synchronize tls_dev_add
>>> + * against NETDEV_DOWN notifications.
>>> + */
>>> +DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
>>> +
>>> +static void tls_device_gc_task(struct work_struct *work);
>>> +
>>> +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
>>> +static LIST_HEAD(tls_device_gc_list);
>>> +static LIST_HEAD(tls_device_list);
>>> +static DEFINE_SPINLOCK(tls_device_lock);
>>> +
>>> +static void tls_device_free_ctx(struct tls_context *ctx)
>>> +{
>>> +    struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
>>> +
>>> +    kfree(offlad_ctx);
>>> +    kfree(ctx);
>>> +}
>>> +
>>> +static void tls_device_gc_task(struct work_struct *work)
>>> +{
>>> +    struct tls_context *ctx, *tmp;
>>> +    struct list_head gc_list;
>>> +    unsigned long flags;
>>> +
>>> +    spin_lock_irqsave(&tls_device_lock, flags);
>>> +    INIT_LIST_HEAD(&gc_list);
>>
>> This is stack variable, and it should be initialized outside of global spinlock.
>> There is LIST_HEAD() primitive for that in kernel.
>> There is one more similar place below.
>>
> 
> Sure.
> 
>>> +    list_splice_init(&tls_device_gc_list, &gc_list);
>>> +    spin_unlock_irqrestore(&tls_device_lock, flags);
>>> +
>>> +    list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
>>> +        struct net_device *netdev = ctx->netdev;
>>> +
>>> +        if (netdev) {
>>> +            netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
>>> +                            TLS_OFFLOAD_CTX_DIR_TX);
>>> +            dev_put(netdev);
>>> +        }
>>
>> How is possible the situation we meet NULL netdev here >
> 
> This can happen in tls_device_down. tls_deviec_down is called whenever a netdev that is used for TLS inline crypto offload goes down. It gets called via the NETDEV_DOWN event of the netdevice notifier.
> 
> This flow is somewhat similar to the xfrm_device netdev notifier. However, we do not destroy the socket (as in destroying the xfrm_state in xfrm_device). Instead, we cleanup the netdev state and allow software fallback to handle the rest of the traffic.
> 
>>> +
>>> +        list_del(&ctx->list);
>>> +        tls_device_free_ctx(ctx);
>>> +    }
>>> +}
>>> +
>>> +static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
>>> +{
>>> +    unsigned long flags;
>>> +
>>> +    spin_lock_irqsave(&tls_device_lock, flags);
>>> +    list_move_tail(&ctx->list, &tls_device_gc_list);
>>> +
>>> +    /* schedule_work inside the spinlock
>>> +     * to make sure tls_device_down waits for that work.
>>> +     */
>>> +    schedule_work(&tls_device_gc_work);
>>> +
>>> +    spin_unlock_irqrestore(&tls_device_lock, flags);
>>> +}
>>> +
>>> +/* We assume that the socket is already connected */
>>> +static struct net_device *get_netdev_for_sock(struct sock *sk)
>>> +{
>>> +    struct inet_sock *inet = inet_sk(sk);
>>> +    struct net_device *netdev = NULL;
>>> +
>>> +    netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
>>> +
>>> +    return netdev;
>>> +}
>>> +
>>> +static int attach_sock_to_netdev(struct sock *sk, struct net_device *netdev,
>>> +                 struct tls_context *ctx)
>>> +{
>>> +    int rc;
>>> +
>>> +    rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
>>> +                         &ctx->crypto_send,
>>> +                         tcp_sk(sk)->write_seq);
>>> +    if (rc) {
>>> +        pr_err_ratelimited("The netdev has refused to offload this socket\n");
>>> +        goto out;
>>> +    }
>>> +
>>> +    rc = 0;
>>> +out:
>>> +    return rc;
>>> +}
>>> +
>>> +static void destroy_record(struct tls_record_info *record)
>>> +{
>>> +    skb_frag_t *frag;
>>> +    int nr_frags = record->num_frags;
>>> +
>>> +    while (nr_frags > 0) {
>>> +        frag = &record->frags[nr_frags - 1];
>>> +        __skb_frag_unref(frag);
>>> +        --nr_frags;
>>> +    }
>>> +    kfree(record);
>>> +}
>>> +
>>> +static void delete_all_records(struct tls_offload_context *offload_ctx)
>>> +{
>>> +    struct tls_record_info *info, *temp;
>>> +
>>> +    list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) {
>>> +        list_del(&info->list);
>>> +        destroy_record(info);
>>> +    }
>>> +
>>> +    offload_ctx->retransmit_hint = NULL;
>>> +}
>>> +
>>> +static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
>>> +{
>>> +    struct tls_context *tls_ctx = tls_get_ctx(sk);
>>> +    struct tls_offload_context *ctx;
>>> +    struct tls_record_info *info, *temp;
>>> +    unsigned long flags;
>>> +    u64 deleted_records = 0;
>>> +
>>> +    if (!tls_ctx)
>>> +        return;
>>> +
>>> +    ctx = tls_offload_ctx(tls_ctx);
>>> +
>>> +    spin_lock_irqsave(&ctx->lock, flags);
>>> +    info = ctx->retransmit_hint;
>>> +    if (info && !before(acked_seq, info->end_seq)) {
>>> +        ctx->retransmit_hint = NULL;
>>> +        list_del(&info->list);
>>> +        destroy_record(info);
>>> +        deleted_records++;
>>> +    }
>>> +
>>> +    list_for_each_entry_safe(info, temp, &ctx->records_list, list) {
>>> +        if (before(acked_seq, info->end_seq))
>>> +            break;
>>> +        list_del(&info->list);
>>> +
>>> +        destroy_record(info);
>>> +        deleted_records++;
>>> +    }
>>> +
>>> +    ctx->unacked_record_sn += deleted_records;
>>> +    spin_unlock_irqrestore(&ctx->lock, flags);
>>> +}
>>> +
>>> +/* At this point, there should be no references on this
>>> + * socket and no in-flight SKBs associated with this
>>> + * socket, so it is safe to free all the resources.
>>> + */
>>> +void tls_device_sk_destruct(struct sock *sk)
>>> +{
>>> +    struct tls_context *tls_ctx = tls_get_ctx(sk);
>>> +    struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
>>> +
>>> +    if (ctx->open_record)
>>> +        destroy_record(ctx->open_record);
>>> +
>>> +    delete_all_records(ctx);
>>> +    crypto_free_aead(ctx->aead_send);
>>> +    ctx->sk_destruct(sk);
>>> +
>>> +    if (refcount_dec_and_test(&tls_ctx->refcount))
>>> +        tls_device_queue_ctx_destruction(tls_ctx);
>>> +}
>>> +EXPORT_SYMBOL(tls_device_sk_destruct);
>>> +
>>> +static inline void tls_append_frag(struct tls_record_info *record,
>>> +                   struct page_frag *pfrag,
>>> +                   int size)
>>> +{
>>> +    skb_frag_t *frag;
>>> +
>>> +    frag = &record->frags[record->num_frags - 1];
>>> +    if (frag->page.p == pfrag->page &&
>>> +        frag->page_offset + frag->size == pfrag->offset) {
>>> +        frag->size += size;
>>> +    } else {
>>> +        ++frag;
>>> +        frag->page.p = pfrag->page;
>>> +        frag->page_offset = pfrag->offset;
>>> +        frag->size = size;
>>> +        ++record->num_frags;
>>> +        get_page(pfrag->page);
>>> +    }
>>> +
>>> +    pfrag->offset += size;
>>> +    record->len += size;
>>> +}
>>> +
>>> +static inline int tls_push_record(struct sock *sk,
>>> +                  struct tls_context *ctx,
>>> +                  struct tls_offload_context *offload_ctx,
>>> +                  struct tls_record_info *record,
>>> +                  struct page_frag *pfrag,
>>> +                  int flags,
>>> +                  unsigned char record_type)
>>> +{
>>> +    skb_frag_t *frag;
>>> +    struct tcp_sock *tp = tcp_sk(sk);
>>> +    struct page_frag fallback_frag;
>>> +    struct page_frag  *tag_pfrag = pfrag;
>>> +    int i;
>>> +
>>> +    /* fill prepand */
>>> +    frag = &record->frags[0];
>>> +    tls_fill_prepend(ctx,
>>> +             skb_frag_address(frag),
>>> +             record->len - ctx->prepend_size,
>>> +             record_type);
>>> +
>>> +    if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag, GFP_KERNEL))) {
>>> +        /* HW doesn't care about the data in the tag
>>> +         * so in case pfrag has no room
>>> +         * for a tag and we can't allocate a new pfrag
>>> +         * just use the page in the first frag
>>> +         * rather then write a complicated fall back code.
>>> +         */
>>> +        tag_pfrag = &fallback_frag;
>>> +        tag_pfrag->page = skb_frag_page(frag);
>>> +        tag_pfrag->offset = 0;
>>> +    }
>>> +
>>> +    tls_append_frag(record, tag_pfrag, ctx->tag_size);
>>> +    record->end_seq = tp->write_seq + record->len;
>>> +    spin_lock_irq(&offload_ctx->lock);
>>> +    list_add_tail(&record->list, &offload_ctx->records_list);
>>> +    spin_unlock_irq(&offload_ctx->lock);
>>> +    offload_ctx->open_record = NULL;
>>> +    set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
>>> +    tls_advance_record_sn(sk, ctx);
>>> +
>>> +    for (i = 0; i < record->num_frags; i++) {
>>> +        frag = &record->frags[i];
>>> +        sg_unmark_end(&offload_ctx->sg_tx_data[i]);
>>> +        sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag),
>>> +                frag->size, frag->page_offset);
>>> +        sk_mem_charge(sk, frag->size);
>>> +        get_page(skb_frag_page(frag));
>>> +    }
>>> +    sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]);
>>> +
>>> +    /* all ready, send */
>>> +    return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
>>> +}
>>> +
>>> +static inline int tls_create_new_record(struct tls_offload_context *offload_ctx,
>>> +                    struct page_frag *pfrag,
>>> +                    size_t prepend_size)
>>> +{
>>> +    skb_frag_t *frag;
>>> +    struct tls_record_info *record;
>>> +
>>> +    record = kmalloc(sizeof(*record), GFP_KERNEL);
>>> +    if (!record)
>>> +        return -ENOMEM;
>>> +
>>> +    frag = &record->frags[0];
>>> +    __skb_frag_set_page(frag, pfrag->page);
>>> +    frag->page_offset = pfrag->offset;
>>> +    skb_frag_size_set(frag, prepend_size);
>>> +
>>> +    get_page(pfrag->page);
>>> +    pfrag->offset += prepend_size;
>>> +
>>> +    record->num_frags = 1;
>>> +    record->len = prepend_size;
>>> +    offload_ctx->open_record = record;
>>> +    return 0;
>>> +}
>>> +
>>> +static inline int tls_do_allocation(struct sock *sk,
>>> +                    struct tls_offload_context *offload_ctx,
>>> +                    struct page_frag *pfrag,
>>> +                    size_t prepend_size)
>>> +{
>>> +    int ret;
>>> +
>>> +    if (!offload_ctx->open_record) {
>>> +        if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
>>> +                           sk->sk_allocation))) {
>>> +            sk->sk_prot->enter_memory_pressure(sk);
>>> +            sk_stream_moderate_sndbuf(sk);
>>> +            return -ENOMEM;
>>> +        }
>>> +
>>> +        ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
>>> +        if (ret)
>>> +            return ret;
>>> +
>>> +        if (pfrag->size > pfrag->offset)
>>> +            return 0;
>>> +    }
>>> +
>>> +    if (!sk_page_frag_refill(sk, pfrag))
>>> +        return -ENOMEM;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int tls_push_data(struct sock *sk,
>>> +             struct iov_iter *msg_iter,
>>> +             size_t size, int flags,
>>> +             unsigned char record_type)
>>> +{
>>> +    struct tls_context *tls_ctx = tls_get_ctx(sk);
>>> +    struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
>>> +    struct tls_record_info *record = ctx->open_record;
>>> +    struct page_frag *pfrag;
>>> +    int copy, rc = 0;
>>> +    size_t orig_size = size;
>>> +    u32 max_open_record_len;
>>> +    long timeo;
>>> +    int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
>>> +    int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
>>> +    bool done = false;
>>> +
>>> +    if (flags &
>>> +        ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
>>> +        return -ENOTSUPP;
>>> +
>>> +    if (sk->sk_err)
>>> +        return -sk->sk_err;
>>> +
>>> +    timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
>>> +    rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo);
>>> +    if (rc < 0)
>>> +        return rc;
>>> +
>>> +    pfrag = sk_page_frag(sk);
>>> +
>>> +    /* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS record, and
>>> +     * we need to leave room for an authentication tag.
>>> +     */
>>> +    max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
>>> +                  tls_ctx->prepend_size;
>>> +    do {
>>> +        if (tls_do_allocation(sk, ctx, pfrag,
>>> +                      tls_ctx->prepend_size)) {
>>> +            rc = sk_stream_wait_memory(sk, &timeo);
>>> +            if (!rc)
>>> +                continue;
>>> +
>>> +            record = ctx->open_record;
>>> +            if (!record)
>>> +                break;
>>> +handle_error:
>>> +            if (record_type != TLS_RECORD_TYPE_DATA) {
>>> +                /* avoid sending partial
>>> +                 * record with type !=
>>> +                 * application_data
>>> +                 */
>>> +                size = orig_size;
>>> +                destroy_record(record);
>>> +                ctx->open_record = NULL;
>>> +            } else if (record->len > tls_ctx->prepend_size) {
>>> +                goto last_record;
>>> +            }
>>> +
>>> +            break;
>>> +        }
>>> +
>>> +        record = ctx->open_record;
>>> +        copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
>>> +        copy = min_t(size_t, copy, (max_open_record_len - record->len));
>>> +
>>> +        if (copy_from_iter_nocache(page_address(pfrag->page) +
>>> +                           pfrag->offset,
>>> +                       copy, msg_iter) != copy) {
>>> +            rc = -EFAULT;
>>> +            goto handle_error;
>>> +        }
>>> +        tls_append_frag(record, pfrag, copy);
>>> +
>>> +        size -= copy;
>>> +        if (!size) {
>>> +last_record:
>>> +            tls_push_record_flags = flags;
>>> +            if (more) {
>>> +                tls_ctx->pending_open_record_frags =
>>> +                        record->num_frags;
>>> +                break;
>>> +            }
>>> +
>>> +            done = true;
>>> +        }
>>> +
>>> +        if ((done) || record->len >= max_open_record_len ||
>>> +            (record->num_frags >= MAX_SKB_FRAGS - 1)) {
>>> +            rc = tls_push_record(sk,
>>> +                         tls_ctx,
>>> +                         ctx,
>>> +                         record,
>>> +                         pfrag,
>>> +                         tls_push_record_flags,
>>> +                         record_type);
>>> +            if (rc < 0)
>>> +                break;
>>> +        }
>>> +    } while (!done);
>>> +
>>> +    if (orig_size - size > 0)
>>> +        rc = orig_size - size;
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>>> +{
>>> +    unsigned char record_type = TLS_RECORD_TYPE_DATA;
>>> +    int rc = 0;
>>> +
>>> +    lock_sock(sk);
>>> +
>>> +    if (unlikely(msg->msg_controllen)) {
>>> +        rc = tls_proccess_cmsg(sk, msg, &record_type);
>>> +        if (rc)
>>> +            goto out;
>>> +    }
>>> +
>>> +    rc = tls_push_data(sk, &msg->msg_iter, size,
>>> +               msg->msg_flags, record_type);
>>> +
>>> +out:
>>> +    release_sock(sk);
>>> +    return rc;
>>> +}
>>> +
>>> +int tls_device_sendpage(struct sock *sk, struct page *page,
>>> +            int offset, size_t size, int flags)
>>> +{
>>> +    struct iov_iter    msg_iter;
>>> +    struct kvec iov;
>>> +    char *kaddr = kmap(page);
>>> +    int rc = 0;
>>> +
>>> +    if (flags & MSG_SENDPAGE_NOTLAST)
>>> +        flags |= MSG_MORE;
>>> +
>>> +    lock_sock(sk);
>>> +
>>> +    if (flags & MSG_OOB) {
>>> +        rc = -ENOTSUPP;
>>> +        goto out;
>>> +    }
>>> +
>>> +    iov.iov_base = kaddr + offset;
>>> +    iov.iov_len = size;
>>> +    iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size);
>>> +    rc = tls_push_data(sk, &msg_iter, size,
>>> +               flags, TLS_RECORD_TYPE_DATA);
>>> +    kunmap(page);
>>> +
>>> +out:
>>> +    release_sock(sk);
>>> +    return rc;
>>> +}
>>> +
>>> +struct tls_record_info *tls_get_record(struct tls_offload_context *context,
>>> +                       u32 seq, u64 *p_record_sn)
>>> +{
>>> +    struct tls_record_info *info;
>>> +    u64 record_sn = context->hint_record_sn;
>>> +
>>> +    info = context->retransmit_hint;
>>> +    if (!info ||
>>> +        before(seq, info->end_seq - info->len)) {
>>> +        /* if retransmit_hint is irrelevant start
>>> +         * from the begging of the list
>>> +         */
>>> +        info = list_first_entry(&context->records_list,
>>> +                    struct tls_record_info, list);
>>> +        record_sn = context->unacked_record_sn;
>>> +    }
>>> +
>>> +    list_for_each_entry_from(info, &context->records_list, list) {
>>> +        if (before(seq, info->end_seq)) {
>>> +            if (!context->retransmit_hint ||
>>> +                after(info->end_seq,
>>> +                  context->retransmit_hint->end_seq)) {
>>> +                context->hint_record_sn = record_sn;
>>> +                context->retransmit_hint = info;
>>> +            }
>>> +            *p_record_sn = record_sn;
>>> +            return info;
>>> +        }
>>> +        record_sn++;
>>> +    }
>>> +
>>> +    return NULL;
>>> +}
>>> +EXPORT_SYMBOL(tls_get_record);
>>> +
>>> +static int tls_device_push_pending_record(struct sock *sk, int flags)
>>> +{
>>> +    struct iov_iter    msg_iter;
>>> +
>>> +    iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
>>> +    return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
>>> +}
>>> +
>>> +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
>>> +{
>>> +    u16 nonece_size, tag_size, iv_size, rec_seq_size;
>>> +    struct tls_record_info *start_marker_record;
>>> +    struct tls_offload_context *offload_ctx;
>>> +    struct tls_crypto_info *crypto_info;
>>> +    struct net_device *netdev;
>>> +    char *iv, *rec_seq;
>>> +    struct sk_buff *skb;
>>> +    int rc = -EINVAL;
>>> +    __be64 rcd_sn;
>>> +
>>> +    if (!ctx)
>>> +        goto out;
>>> +
>>> +    if (ctx->priv_ctx) {
>>> +        rc = -EEXIST;
>>> +        goto out;
>>> +    }
>>> +
>>> +    /* We support starting offload on multiple sockets
>>> +     * concurrently, So we only need a read lock here.
>>> +     */
>>> +    percpu_down_read(&device_offload_lock);
>>> +    netdev = get_netdev_for_sock(sk);
>>> +    if (!netdev) {
>>> +        pr_err_ratelimited("%s: netdev not found\n", __func__);
>>> +        rc = -EINVAL;
>>> +        goto release_lock;
>>> +    }
>>> +
>>> +    if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
>>> +        rc = -ENOTSUPP;
>>> +        goto release_netdev;
>>> +    }
>>> +
>>> +    /* Avoid offloading if the device is down
>>> +     * We don't want to offload new flows after
>>> +     * the NETDEV_DOWN event
>>> +     */
>>> +    if (!(netdev->flags & IFF_UP)) {
>>> +        rc = -EINVAL;
>>> +        goto release_lock;
>>> +    }
>>> +
>>> +    crypto_info = &ctx->crypto_send;
>>> +    switch (crypto_info->cipher_type) {
>>> +    case TLS_CIPHER_AES_GCM_128: {
>>> +        nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
>>> +        tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
>>> +        iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
>>> +        iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv;
>>> +        rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
>>> +        rec_seq =
>>> +         ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq;
>>> +        break;
>>> +    }
>>> +    default:
>>> +        rc = -EINVAL;
>>> +        goto release_netdev;
>>> +    }
>>> +
>>> +    start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
>>
>> Can we memory allocations and simple memory initializations ouside the global rwsem?
>>
> 
> Sure, we can move all memory allocations outside the lock.
> 
>>> +    if (!start_marker_record) {
>>> +        rc = -ENOMEM;
>>> +        goto release_netdev;
>>> +    }
>>> +
>>> +    offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE, GFP_KERNEL);
>>> +    if (!offload_ctx)
>>> +        goto free_marker_record;
>>> +
>>> +    ctx->priv_ctx = offload_ctx;
>>> +    rc = attach_sock_to_netdev(sk, netdev, ctx);
>>> +    if (rc)
>>> +        goto free_offload_context;
>>> +
>>> +    ctx->netdev = netdev;
>>> +    ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
>>> +    ctx->tag_size = tag_size;
>>> +    ctx->iv_size = iv_size;
>>> +    ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE,
>>> +              GFP_KERNEL);
>>> +    if (!ctx->iv) {
>>> +        rc = -ENOMEM;
>>> +        goto detach_sock;
>>> +    }
>>> +
>>> +    memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size);
>>> +
>>> +    ctx->rec_seq_size = rec_seq_size;
>>> +    ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
>>> +    if (!ctx->rec_seq) {
>>> +        rc = -ENOMEM;
>>> +        goto free_iv;
>>> +    }
>>> +    memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
>>> +
>>> +    /* start at rec_seq - 1 to account for the start marker record */
>>> +    memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
>>> +    offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
>>> +
>>> +    rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
>>> +    if (rc)
>>> +        goto free_rec_seq;
>>> +
>>> +    start_marker_record->end_seq = tcp_sk(sk)->write_seq;
>>> +    start_marker_record->len = 0;
>>> +    start_marker_record->num_frags = 0;
>>> +
>>> +    INIT_LIST_HEAD(&offload_ctx->records_list);
>>> +    list_add_tail(&start_marker_record->list, &offload_ctx->records_list);
>>> +    spin_lock_init(&offload_ctx->lock);
>>> +
>>> +    inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
>>> +    ctx->push_pending_record = tls_device_push_pending_record;
>>> +    offload_ctx->sk_destruct = sk->sk_destruct;
>>> +
>>> +    /* TLS offload is greatly simplified if we don't send
>>> +     * SKBs where only part of the payload needs to be encrypted.
>>> +     * So mark the last skb in the write queue as end of record.
>>> +     */
>>> +    skb = tcp_write_queue_tail(sk);
>>> +    if (skb)
>>> +        TCP_SKB_CB(skb)->eor = 1;
>>> +
>>> +    refcount_set(&ctx->refcount, 1);
>>> +    spin_lock_irq(&tls_device_lock);
>>> +    list_add_tail(&ctx->list, &tls_device_list);
>>> +    spin_unlock_irq(&tls_device_lock);
>>> +
>>> +    /* following this assignment tls_is_sk_tx_device_offloaded
>>> +     * will return true and the context might be accessed
>>> +     * by the netdev's xmit function.
>>> +     */
>>> +    smp_store_release(&sk->sk_destruct,
>>> +              &tls_device_sk_destruct);
>>> +    goto release_lock;
>>> +
>>> +free_rec_seq:
>>> +    kfree(ctx->rec_seq);
>>> +free_iv:
>>> +    kfree(ctx->iv);
>>> +detach_sock:
>>> +    netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_TX);
>>> +free_offload_context:
>>> +    kfree(offload_ctx);
>>> +    ctx->priv_ctx = NULL;
>>> +free_marker_record:
>>> +    kfree(start_marker_record);
>>> +release_netdev:
>>> +    dev_put(netdev);
>>> +release_lock:
>>> +    percpu_up_read(&device_offload_lock);
>>> +out:
>>> +    return rc;
>>> +}
>>> +
>>> +static int tls_device_register(struct net_device *dev)
>>> +{
>>> +    if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
>>> +        return NOTIFY_BAD;
>>> +
>>> +    return NOTIFY_DONE;
>>> +}
>>
>> This function is the same as tls_device_feat_change(). Can't we merge
>> them together and avoid duplicating of code?
>>
> 
> Sure.
> 
>>> +static int tls_device_unregister(struct net_device *dev)
>>> +{
>>> +    return NOTIFY_DONE;
>>> +}
>>
>> This function does nothing, and next patches do not change it.
>> Can't we remove it since so?
>>
> 
> Sure.
> 
>>> +static int tls_device_feat_change(struct net_device *dev)
>>> +{
>>> +    if ((dev->features & NETIF_F_HW_TLS_TX) && !dev->tlsdev_ops)
>>> +        return NOTIFY_BAD;
>>> +
>>> +    return NOTIFY_DONE;
>>> +}
>>> +
>>> +static int tls_device_down(struct net_device *netdev)
>>> +{
>>> +    struct tls_context *ctx, *tmp;
>>> +    struct list_head list;
>>> +    unsigned long flags;
>>> +
>>> +    if (!(netdev->features & NETIF_F_HW_TLS_TX))
>>> +        return NOTIFY_DONE;
>>
>> Can't we move this check in tls_dev_event() and use it for all types of events?
>> Then we avoid duplicate code.
>>
> 
> No. Not all events require this check. Also, the result is different for different events.

No. You always return NOTIFY_DONE, in case of !(netdev->features & NETIF_F_HW_TLS_TX).
See below:

static int tls_check_dev_ops(struct net_device *dev) 
{
	if (!dev->tlsdev_ops)
		return NOTIFY_BAD; 

	return NOTIFY_DONE; 
}

static int tls_device_down(struct net_device *netdev) 
{
	struct tls_context *ctx, *tmp; 
	struct list_head list; 
	unsigned long flags; 

	...
	return NOTIFY_DONE;
}

static int tls_dev_event(struct notifier_block *this, unsigned long event, 
        		 void *ptr) 
{ 
        struct net_device *dev = netdev_notifier_info_to_dev(ptr); 

	if (!(netdev->features & NETIF_F_HW_TLS_TX)) 
		return NOTIFY_DONE; 
 
        switch (event) { 
        case NETDEV_REGISTER:
        case NETDEV_FEAT_CHANGE: 
        	return tls_check_dev_ops(dev); 
 
        case NETDEV_DOWN: 
        	return tls_device_down(dev); 
        } 
        return NOTIFY_DONE; 
} 
 
>>> +
>>> +    /* Request a write lock to block new offload attempts
>>> +     */
>>> +    percpu_down_write(&device_offload_lock);
>>
>> What is the reason percpu_rwsem is chosen here? It looks like this primitive
>> gives more advantages readers, then plain rwsem does. But it also gives
>> disadvantages to writers. It would be good, unless tls_device_down() is called
>> with rtnl_lock() held from netdevice notifier. But since netdevice notifier
>> are called with rtnl_lock() held, percpu_rwsem will increase the time rtnl_lock()
>> is locked.
> We use the a rwsem to allow multiple (readers) invocations of tls_set_device_offload, which is triggered by the user (persumably) during the TLS handshake. This might be considered a fast-path.
> 
> However, we must block all calls to tls_set_device_offload while we are processing NETDEV_DOWN events (writer).
> 
> As you've mentioned, the percpu rwsem is more efficient for readers, especially on NUMA systems, where cache-line bouncing occurs during reader acquire and reduces performance.

Hm, and who are the readers? It's used from do_tls_setsockopt_tx(), but it doesn't
seem to be performance critical. Who else?

>>
>> Can't we use plain rwsem here instead?
>>
> 
> Its a performance tradeoff. I'm not certain that the percpu rwsem write side acquire is significantly worse than using the global rwsem.
> 
> For now, while all of this is experimental, can we agree to focus on the performance of readers? We can change it later if it becomes a problem.

Same as above.
 
>>> +
>>> +    spin_lock_irqsave(&tls_device_lock, flags);
>>> +    INIT_LIST_HEAD(&list);
>>
>> This may go outside the global spinlock.
>>
> 
> Sure.
> 
>>> +    list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) {
>>> +        if (ctx->netdev != netdev ||
>>> +            !refcount_inc_not_zero(&ctx->refcount))
>>> +            continue;
>>> +
>>> +        list_move(&ctx->list, &list);
>>> +    }
>>> +    spin_unlock_irqrestore(&tls_device_lock, flags);
>>> +
>>> +    list_for_each_entry_safe(ctx, tmp, &list, list)    {
>>> +        netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
>>> +                        TLS_OFFLOAD_CTX_DIR_TX);
>>> +        ctx->netdev = NULL;
>>> +        dev_put(netdev);
>>> +        list_del_init(&ctx->list);
>>> +
>>> +        if (refcount_dec_and_test(&ctx->refcount))
>>> +            tls_device_free_ctx(ctx);
>>> +    }
>>> +
>>> +    percpu_up_write(&device_offload_lock);
>>> +
>>> +    flush_work(&tls_device_gc_work);
>>> +
>>> +    return NOTIFY_DONE;
>>> +}
>>> +
>>> +static int tls_dev_event(struct notifier_block *this, unsigned long event,
>>> +             void *ptr)
>>> +{
>>> +    struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>> +
>>> +    switch (event) {
>>> +    case NETDEV_REGISTER:
>>> +        return tls_device_register(dev);
>>> +
>>> +    case NETDEV_UNREGISTER:
>>> +        return tls_device_unregister(dev);
>>> +
>>> +    case NETDEV_FEAT_CHANGE:
>>> +        return tls_device_feat_change(dev);
>>> +
>>> +    case NETDEV_DOWN:
>>> +        return tls_device_down(dev);
>>> +    }
>>> +    return NOTIFY_DONE;
>>> +}
>>> +
>>> +static struct notifier_block tls_dev_notifier = {
>>> +    .notifier_call    = tls_dev_event,
>>> +};
>>> +
>>> +void __init tls_device_init(void)
>>> +{
>>> +    register_netdevice_notifier(&tls_dev_notifier);
>>> +}
>>> +
>>> +void __exit tls_device_cleanup(void)
>>> +{
>>> +    unregister_netdevice_notifier(&tls_dev_notifier);
>>> +    flush_work(&tls_device_gc_work);
>>> +}
>>> diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
>>> new file mode 100644
>>> index 000000000000..14d31a36885c
>>> --- /dev/null
>>> +++ b/net/tls/tls_device_fallback.c
>>> @@ -0,0 +1,419 @@
>>> +/* Copyright (c) 2018, Mellanox Technologies All rights reserved.
>>> + *
>>> + *     Redistribution and use in source and binary forms, with or
>>> + *     without modification, are permitted provided that the following
>>> + *     conditions are met:
>>> + *
>>> + *      - Redistributions of source code must retain the above
>>> + *        copyright notice, this list of conditions and the following
>>> + *        disclaimer.
>>> + *
>>> + *      - Redistributions in binary form must reproduce the above
>>> + *        copyright notice, this list of conditions and the following
>>> + *        disclaimer in the documentation and/or other materials
>>> + *        provided with the distribution.
>>> + *
>>> + *      - Neither the name of the Mellanox Technologies nor the
>>> + *        names of its contributors may be used to endorse or promote
>>> + *        products derived from this software without specific prior written
>>> + *        permission.
>>> + *
>>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
>>> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>>> + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>>> + * A PARTICULAR PURPOSE ARE DISCLAIMED.
>>> + * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
>>> + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
>>> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
>>> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
>>> + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>>> + * POSSIBILITY OF SUCH DAMAGE
>>> + */
>>> +
>>> +#include <net/tls.h>
>>> +#include <crypto/aead.h>
>>> +#include <crypto/scatterwalk.h>
>>> +#include <net/ip6_checksum.h>
>>> +
>>> +static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk)
>>> +{
>>> +    struct scatterlist *src = walk->sg;
>>> +    int diff = walk->offset - src->offset;
>>> +
>>> +    sg_set_page(sg, sg_page(src),
>>> +            src->length - diff, walk->offset);
>>> +
>>> +    scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
>>> +}
>>> +
>>> +static int tls_enc_record(struct aead_request *aead_req,
>>> +              struct crypto_aead *aead, char *aad, char *iv,
>>> +              __be64 rcd_sn, struct scatter_walk *in,
>>> +              struct scatter_walk *out, int *in_len)
>>> +{
>>> +    struct scatterlist sg_in[3];
>>> +    struct scatterlist sg_out[3];
>>> +    unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE];
>>> +    u16 len;
>>> +    int rc;
>>> +
>>> +    len = min_t(int, *in_len, ARRAY_SIZE(buf));
>>> +
>>> +    scatterwalk_copychunks(buf, in, len, 0);
>>> +    scatterwalk_copychunks(buf, out, len, 1);
>>> +
>>> +    *in_len -= len;
>>> +    if (!*in_len)
>>> +        return 0;
>>> +
>>> +    scatterwalk_pagedone(in, 0, 1);
>>> +    scatterwalk_pagedone(out, 1, 1);
>>> +
>>> +    len = buf[4] | (buf[3] << 8);
>>> +    len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
>>> +
>>> +    tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
>>> +             (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
>>> +
>>> +    memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE,
>>> +           TLS_CIPHER_AES_GCM_128_IV_SIZE);
>>> +
>>> +    sg_init_table(sg_in, ARRAY_SIZE(sg_in));
>>> +    sg_init_table(sg_out, ARRAY_SIZE(sg_out));
>>> +    sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
>>> +    sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
>>> +    chain_to_walk(sg_in + 1, in);
>>> +    chain_to_walk(sg_out + 1, out);
>>> +
>>> +    *in_len -= len;
>>> +    if (*in_len < 0) {
>>> +        *in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
>>> +        if (*in_len < 0)
>>> +        /* the input buffer doesn't contain the entire record.
>>> +         * trim len accordingly. The resulting authentication tag
>>> +         * will contain garbage. but we don't care as we won't
>>> +         * include any of it in the output skb
>>> +         * Note that we assume the output buffer length
>>> +         * is larger then input buffer length + tag size
>>> +         */
>>> +            len += *in_len;
>>> +
>>> +        *in_len = 0;
>>> +    }
>>> +
>>> +    if (*in_len) {
>>> +        scatterwalk_copychunks(NULL, in, len, 2);
>>> +        scatterwalk_pagedone(in, 0, 1);
>>> +        scatterwalk_copychunks(NULL, out, len, 2);
>>> +        scatterwalk_pagedone(out, 1, 1);
>>> +    }
>>> +
>>> +    len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
>>> +    aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
>>> +
>>> +    rc = crypto_aead_encrypt(aead_req);
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +static void tls_init_aead_request(struct aead_request *aead_req,
>>> +                  struct crypto_aead *aead)
>>> +{
>>> +    aead_request_set_tfm(aead_req, aead);
>>> +    aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
>>> +}
>>> +
>>> +static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead,
>>> +                           gfp_t flags)
>>> +{
>>> +    unsigned int req_size = sizeof(struct aead_request) +
>>> +        crypto_aead_reqsize(aead);
>>> +    struct aead_request *aead_req;
>>> +
>>> +    aead_req = kzalloc(req_size, flags);
>>> +    if (!aead_req)
>>> +        return NULL;
>>> +
>>> +    tls_init_aead_request(aead_req, aead);
>>> +    return aead_req;
>>> +}
>>> +
>>> +static int tls_enc_records(struct aead_request *aead_req,
>>> +               struct crypto_aead *aead, struct scatterlist *sg_in,
>>> +               struct scatterlist *sg_out, char *aad, char *iv,
>>> +               u64 rcd_sn, int len)
>>> +{
>>> +    struct scatter_walk in;
>>> +    struct scatter_walk out;
>>> +    int rc;
>>> +
>>> +    scatterwalk_start(&in, sg_in);
>>> +    scatterwalk_start(&out, sg_out);
>>> +
>>> +    do {
>>> +        rc = tls_enc_record(aead_req, aead, aad, iv,
>>> +                    cpu_to_be64(rcd_sn), &in, &out, &len);
>>> +        rcd_sn++;
>>> +
>>> +    } while (rc == 0 && len);
>>> +
>>> +    scatterwalk_done(&in, 0, 0);
>>> +    scatterwalk_done(&out, 1, 0);
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +static inline void update_chksum(struct sk_buff *skb, int headln)
>>> +{
>>> +    /* Can't use icsk->icsk_af_ops->send_check here because the ip addresses
>>> +     * might have been changed by NAT.
>>> +     */
>>> +
>>> +    const struct ipv6hdr *ipv6h;
>>> +    const struct iphdr *iph;
>>> +    struct tcphdr *th = tcp_hdr(skb);
>>> +    int datalen = skb->len - headln;
>>> +
>>> +    /* We only changed the payload so if we are using partial we don't
>>> +     * need to update anything.
>>> +     */
>>> +    if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
>>> +        return;
>>> +
>>> +    skb->ip_summed = CHECKSUM_PARTIAL;
>>> +    skb->csum_start = skb_transport_header(skb) - skb->head;
>>> +    skb->csum_offset = offsetof(struct tcphdr, check);
>>> +
>>> +    if (skb->sk->sk_family == AF_INET6) {
>>> +        ipv6h = ipv6_hdr(skb);
>>> +        th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr,
>>> +                         datalen, IPPROTO_TCP, 0);
>>> +    } else {
>>> +        iph = ip_hdr(skb);
>>> +        th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen,
>>> +                           IPPROTO_TCP, 0);
>>> +    }
>>> +}
>>> +
>>> +static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln)
>>> +{
>>> +    skb_copy_header(nskb, skb);
>>> +
>>> +    skb_put(nskb, skb->len);
>>> +    memcpy(nskb->data, skb->data, headln);
>>> +    update_chksum(nskb, headln);
>>> +
>>> +    nskb->destructor = skb->destructor;
>>> +    nskb->sk = skb->sk;
>>> +    skb->destructor = NULL;
>>> +    skb->sk = NULL;
>>> +    refcount_add(nskb->truesize - skb->truesize,
>>> +             &nskb->sk->sk_wmem_alloc);
>>> +}
>>> +
>>> +/* This function may be called after the user socket is already
>>> + * closed so make sure we don't use anything freed during
>>> + * tls_sk_proto_close here
>>> + */
>>> +static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
>>> +{
>>> +    int tcp_header_size = tcp_hdrlen(skb);
>>> +    int tcp_payload_offset = skb_transport_offset(skb) + tcp_header_size;
>>> +    int payload_len = skb->len - tcp_payload_offset;
>>> +    struct tls_context *tls_ctx = tls_get_ctx(sk);
>>> +    struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
>>> +    int remaining, buf_len, resync_sgs, rc, i = 0;
>>> +    void *buf, *dummy_buf, *iv, *aad;
>>> +    struct scatterlist *sg_in;
>>> +    struct scatterlist sg_out[3];
>>> +    u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
>>> +    struct aead_request *aead_req;
>>> +    struct sk_buff *nskb = NULL;
>>> +    struct tls_record_info *record;
>>> +    unsigned long flags;
>>> +    s32 sync_size;
>>> +    u64 rcd_sn;
>>> +
>>> +    /* worst case is:
>>> +     * MAX_SKB_FRAGS in tls_record_info
>>> +     * MAX_SKB_FRAGS + 1 in SKB head an frags.
>>> +     */
>>> +    int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
>>> +
>>> +    if (!payload_len)
>>> +        return skb;
>>> +
>>> +    sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC);
>>> +    if (!sg_in)
>>> +        goto free_orig;
>>> +
>>> +    sg_init_table(sg_in, sg_in_max_elements);
>>> +    sg_init_table(sg_out, ARRAY_SIZE(sg_out));
>>> +
>>> +    spin_lock_irqsave(&ctx->lock, flags);
>>> +    record = tls_get_record(ctx, tcp_seq, &rcd_sn);
>>> +    if (!record) {
>>> +        spin_unlock_irqrestore(&ctx->lock, flags);
>>> +        WARN(1, "Record not found for seq %u\n", tcp_seq);
>>> +        goto free_sg;
>>> +    }
>>> +
>>> +    sync_size = tcp_seq - tls_record_start_seq(record);
>>> +    if (sync_size < 0) {
>>> +        int is_start_marker = tls_record_is_start_marker(record);
>>> +
>>> +        spin_unlock_irqrestore(&ctx->lock, flags);
>>> +        if (!is_start_marker)
>>> +        /* This should only occur if the relevant record was
>>> +         * already acked. In that case it should be ok
>>> +         * to drop the packet and avoid retransmission.
>>> +         *
>>> +         * There is a corner case where the packet contains
>>> +         * both an acked and a non-acked record.
>>> +         * We currently don't handle that case and rely
>>> +         * on TCP to retranmit a packet that doesn't contain
>>> +         * already acked payload.
>>> +         */
>>> +            goto free_orig;
>>> +
>>> +        if (payload_len > -sync_size) {
>>> +            WARN(1, "Fallback of partially offloaded packets is not supported\n");
>>> +            goto free_sg;
>>> +        } else {
>>> +            return skb;
>>> +        }
>>> +    }
>>> +
>>> +    remaining = sync_size;
>>> +    while (remaining > 0) {
>>> +        skb_frag_t *frag = &record->frags[i];
>>> +
>>> +        __skb_frag_ref(frag);
>>> +        sg_set_page(sg_in + i, skb_frag_page(frag),
>>> +                skb_frag_size(frag), frag->page_offset);
>>> +
>>> +        remaining -= skb_frag_size(frag);
>>> +
>>> +        if (remaining < 0)
>>> +            sg_in[i].length += remaining;
>>> +
>>> +        i++;
>>> +    }
>>> +    spin_unlock_irqrestore(&ctx->lock, flags);
>>> +    resync_sgs = i;
>>> +
>>> +    aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
>>> +    if (!aead_req)
>>> +        goto put_sg;
>>> +
>>> +    buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
>>> +          TLS_CIPHER_AES_GCM_128_IV_SIZE +
>>> +          TLS_AAD_SPACE_SIZE +
>>> +          sync_size +
>>> +          tls_ctx->tag_size;
>>> +    buf = kmalloc(buf_len, GFP_ATOMIC);
>>> +    if (!buf)
>>> +        goto free_req;
>>> +
>>> +    nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC);
>>> +    if (!nskb)
>>> +        goto free_buf;
>>> +
>>> +    skb_reserve(nskb, skb_headroom(skb));
>>> +
>>> +    iv = buf;
>>> +
>>> +    memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
>>> +           TLS_CIPHER_AES_GCM_128_SALT_SIZE);
>>> +    aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
>>> +          TLS_CIPHER_AES_GCM_128_IV_SIZE;
>>> +    dummy_buf = aad + TLS_AAD_SPACE_SIZE;
>>> +
>>> +    sg_set_buf(&sg_out[0], dummy_buf, sync_size);
>>> +    sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
>>> +           payload_len);
>>> +    /* Add room for authentication tag produced by crypto */
>>> +    dummy_buf += sync_size;
>>> +    sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
>>> +    rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
>>> +              payload_len);
>>> +    if (rc < 0)
>>> +        goto free_nskb;
>>> +
>>> +    rc = tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv,
>>> +                 rcd_sn, sync_size + payload_len);
>>> +    if (rc < 0)
>>> +        goto free_nskb;
>>> +
>>> +    complete_skb(nskb, skb, tcp_payload_offset);
>>> +
>>> +    /* validate_xmit_skb_list assumes that if the skb wasn't segmented
>>> +     * nskb->prev will point to the skb itself
>>> +     */
>>> +    nskb->prev = nskb;
>>> +free_buf:
>>> +    kfree(buf);
>>> +free_req:
>>> +    kfree(aead_req);
>>> +put_sg:
>>> +    for (i = 0; i < resync_sgs; i++)
>>> +        put_page(sg_page(&sg_in[i]));
>>> +free_sg:
>>> +    kfree(sg_in);
>>> +free_orig:
>>> +    kfree_skb(skb);
>>> +    return nskb;
>>> +
>>> +free_nskb:
>>> +    kfree_skb(nskb);
>>> +    nskb = NULL;
>>> +    goto free_buf;
>>> +}
>>> +
>>> +static struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
>>> +                         struct net_device *dev,
>>> +                         struct sk_buff *skb)
>>> +{
>>> +    if (dev == tls_get_ctx(sk)->netdev)
>>> +        return skb;
>>> +
>>> +    return tls_sw_fallback(sk, skb);
>>> +}
>>> +
>>> +int tls_sw_fallback_init(struct sock *sk,
>>> +             struct tls_offload_context *offload_ctx,
>>> +             struct tls_crypto_info *crypto_info)
>>> +{
>>> +    int rc;
>>> +    const u8 *key;
>>> +
>>> +    offload_ctx->aead_send =
>>> +        crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
>>> +    if (IS_ERR(offload_ctx->aead_send)) {
>>> +        rc = PTR_ERR(offload_ctx->aead_send);
>>> +        pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc);
>>> +        offload_ctx->aead_send = NULL;
>>> +        goto err_out;
>>> +    }
>>> +
>>> +    key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key;
>>> +
>>> +    rc = crypto_aead_setkey(offload_ctx->aead_send, key,
>>> +                TLS_CIPHER_AES_GCM_128_KEY_SIZE);
>>> +    if (rc)
>>> +        goto free_aead;
>>> +
>>> +    rc = crypto_aead_setauthsize(offload_ctx->aead_send,
>>> +                     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
>>> +    if (rc)
>>> +        goto free_aead;
>>> +
>>> +    sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
>>> +    return 0;
>>> +free_aead:
>>> +    crypto_free_aead(offload_ctx->aead_send);
>>> +err_out:
>>> +    return rc;
>>> +}
>>> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
>>> index d824d548447e..e0dface33017 100644
>>> --- a/net/tls/tls_main.c
>>> +++ b/net/tls/tls_main.c
>>> @@ -54,6 +54,9 @@ enum {
>>>   enum {
>>>       TLS_BASE_TX,
>>>       TLS_SW_TX,
>>> +#ifdef CONFIG_TLS_DEVICE
>>> +    TLS_HW_TX,
>>> +#endif
>>>       TLS_NUM_CONFIG,
>>>   };
>>>   @@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval,
>>>           goto err_crypto_info;
>>>       }
>>>   -    /* currently SW is default, we will have ethtool in future */
>>> -    rc = tls_set_sw_offload(sk, ctx);
>>> -    tx_conf = TLS_SW_TX;
>>> -    if (rc)
>>> -        goto err_crypto_info;
>>> +#ifdef CONFIG_TLS_DEVICE
>>> +    rc = tls_set_device_offload(sk, ctx);
>>> +    tx_conf = TLS_HW_TX;
>>> +    if (rc) {
>>> +#else
>>> +    {
>>> +#endif
>>> +        /* if HW offload fails fallback to SW */
>>> +        rc = tls_set_sw_offload(sk, ctx);
>>> +        tx_conf = TLS_SW_TX;
>>> +        if (rc)
>>> +            goto err_crypto_info;
>>> +    }
>>>         ctx->tx_conf = tx_conf;
>>>       update_sk_prot(sk, ctx);
>>> @@ -473,6 +484,12 @@ static void build_protos(struct proto *prot, struct proto *base)
>>>       prot[TLS_SW_TX] = prot[TLS_BASE_TX];
>>>       prot[TLS_SW_TX].sendmsg        = tls_sw_sendmsg;
>>>       prot[TLS_SW_TX].sendpage    = tls_sw_sendpage;
>>> +
>>> +#ifdef CONFIG_TLS_DEVICE
>>> +    prot[TLS_HW_TX] = prot[TLS_SW_TX];
>>> +    prot[TLS_HW_TX].sendmsg        = tls_device_sendmsg;
>>> +    prot[TLS_HW_TX].sendpage    = tls_device_sendpage;
>>> +#endif
>>>   }
>>>     static int tls_init(struct sock *sk)
>>> @@ -531,6 +548,9 @@ static int __init tls_register(void)
>>>   {
>>>       build_protos(tls_prots[TLSV4], &tcp_prot);
>>>   +#ifdef CONFIG_TLS_DEVICE
>>> +    tls_device_init();
>>> +#endif
>>>       tcp_register_ulp(&tcp_tls_ulp_ops);
>>>         return 0;
>>> @@ -539,6 +559,9 @@ static int __init tls_register(void)
>>>   static void __exit tls_unregister(void)
>>>   {
>>>       tcp_unregister_ulp(&tcp_tls_ulp_ops);
>>> +#ifdef CONFIG_TLS_DEVICE
>>> +    tls_device_cleanup();
>>> +#endif
>>>   }
>>>     module_init(tls_register);

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 01/14] tcp: Add clean acked data hook
  2018-03-21 16:16       ` Rao Shoaib
@ 2018-03-21 16:32         ` David Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David Miller @ 2018-03-21 16:32 UTC (permalink / raw)
  To: rao.shoaib; +Cc: borisp, saeedm, netdev, davejwatson, ilyal, aviadye

From: Rao Shoaib <rao.shoaib@oracle.com>
Date: Wed, 21 Mar 2018 09:16:48 -0700

> I agree, in fact I was using function pointers for the exact reason,
> to change the functionality of a function. I asked Dave about the
> use and he said No (Also note that the relevant CPU optimizations
> have been turned off on selected NIC's due to the latest security
> issues -- On AMD CPU's the optimizations are not turned off). So it
> is Dave's decision -- I am hoping that he would reconsider and allow
> me to use pointers as well as pointers solve the problem nicely and
> are used extensively.

This situation is different from your's Rao.

That proposal was to add indirect calls for things the TCP stack
internally can make internal state checks for.

Whereas this current patch discussed here is a driver offload hook,
which TCP cannot internally possibly know anything about.

I am fine with what Boris et al. are doing here.  It is a different
situation than your's.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-21 16:31       ` Kirill Tkhai
@ 2018-03-21 20:50         ` Saeed Mahameed
  2018-03-22 12:38         ` Boris Pismenny
  1 sibling, 0 replies; 27+ messages in thread
From: Saeed Mahameed @ 2018-03-21 20:50 UTC (permalink / raw)
  To: ktkhai, davem, Boris Pismenny
  Cc: netdev, davejwatson, Ilya Lesokhin, Aviad Yehezkel

On Wed, 2018-03-21 at 19:31 +0300, Kirill Tkhai wrote:
> On 21.03.2018 18:53, Boris Pismenny wrote:
> > ...
> > > 
> > > Other patches have two licenses in header. Can I distribute this
> > > file under GPL license terms?
> > > 
> > 
> > Sure, I'll update the license to match other files under net/tls.
> > 
> > > > +#include <linux/module.h>
> > > > +#include <net/tcp.h>
> > > > +#include <net/inet_common.h>
> > > > +#include <linux/highmem.h>
> > > > +#include <linux/netdevice.h>
> > > > +
> > > > +#include <net/tls.h>
> > > > +#include <crypto/aead.h>
> > > > +
> > > > +/* device_offload_lock is used to synchronize tls_dev_add
> > > > + * against NETDEV_DOWN notifications.
> > > > + */
> > > > +DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
> > > > +
> > > > +static void tls_device_gc_task(struct work_struct *work);
> > > > +
> > > > +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
> > > > +static LIST_HEAD(tls_device_gc_list);
> > > > +static LIST_HEAD(tls_device_list);
> > > > +static DEFINE_SPINLOCK(tls_device_lock);
> > > > +
> > > > +static void tls_device_free_ctx(struct tls_context *ctx)
> > > > +{
> > > > +    struct tls_offload_context *offlad_ctx =
> > > > tls_offload_ctx(ctx);
> > > > +
> > > > +    kfree(offlad_ctx);
> > > > +    kfree(ctx);
> > > > +}
> > > > +
> > > > +static void tls_device_gc_task(struct work_struct *work)
> > > > +{
> > > > +    struct tls_context *ctx, *tmp;
> > > > +    struct list_head gc_list;
> > > > +    unsigned long flags;
> > > > +
> > > > +    spin_lock_irqsave(&tls_device_lock, flags);
> > > > +    INIT_LIST_HEAD(&gc_list);
> > > 
> > > This is stack variable, and it should be initialized outside of
> > > global spinlock.
> > > There is LIST_HEAD() primitive for that in kernel.
> > > There is one more similar place below.
> > > 
> > 
> > Sure.
> > 
> > > > +    list_splice_init(&tls_device_gc_list, &gc_list);
> > > > +    spin_unlock_irqrestore(&tls_device_lock, flags);
> > > > +
> > > > +    list_for_each_entry_safe(ctx, tmp, &gc_list, list) {
> > > > +        struct net_device *netdev = ctx->netdev;
> > > > +
> > > > +        if (netdev) {
> > > > +            netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > > > +                            TLS_OFFLOAD_CTX_DIR_TX);
> > > > +            dev_put(netdev);
> > > > +        }
> > > 
> > > How is possible the situation we meet NULL netdev here >
> > 
> > This can happen in tls_device_down. tls_deviec_down is called
> > whenever a netdev that is used for TLS inline crypto offload goes
> > down. It gets called via the NETDEV_DOWN event of the netdevice
> > notifier.
> > 
> > This flow is somewhat similar to the xfrm_device netdev notifier.
> > However, we do not destroy the socket (as in destroying the
> > xfrm_state in xfrm_device). Instead, we cleanup the netdev state
> > and allow software fallback to handle the rest of the traffic.
> > 
> > > > +
> > > > +        list_del(&ctx->list);
> > > > +        tls_device_free_ctx(ctx);
> > > > +    }
> > > > +}
> > > > +
> > > > +static void tls_device_queue_ctx_destruction(struct
> > > > tls_context *ctx)
> > > > +{
> > > > +    unsigned long flags;
> > > > +
> > > > +    spin_lock_irqsave(&tls_device_lock, flags);
> > > > +    list_move_tail(&ctx->list, &tls_device_gc_list);
> > > > +
> > > > +    /* schedule_work inside the spinlock
> > > > +     * to make sure tls_device_down waits for that work.
> > > > +     */
> > > > +    schedule_work(&tls_device_gc_work);
> > > > +
> > > > +    spin_unlock_irqrestore(&tls_device_lock, flags);
> > > > +}
> > > > +
> > > > +/* We assume that the socket is already connected */
> > > > +static struct net_device *get_netdev_for_sock(struct sock *sk)
> > > > +{
> > > > +    struct inet_sock *inet = inet_sk(sk);
> > > > +    struct net_device *netdev = NULL;
> > > > +
> > > > +    netdev = dev_get_by_index(sock_net(sk), inet-
> > > > >cork.fl.flowi_oif);
> > > > +
> > > > +    return netdev;
> > > > +}
> > > > +
> > > > +static int attach_sock_to_netdev(struct sock *sk, struct
> > > > net_device *netdev,
> > > > +                 struct tls_context *ctx)
> > > > +{
> > > > +    int rc;
> > > > +
> > > > +    rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk,
> > > > TLS_OFFLOAD_CTX_DIR_TX,
> > > > +                         &ctx->crypto_send,
> > > > +                         tcp_sk(sk)->write_seq);
> > > > +    if (rc) {
> > > > +        pr_err_ratelimited("The netdev has refused to offload
> > > > this socket\n");
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    rc = 0;
> > > > +out:
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +static void destroy_record(struct tls_record_info *record)
> > > > +{
> > > > +    skb_frag_t *frag;
> > > > +    int nr_frags = record->num_frags;
> > > > +
> > > > +    while (nr_frags > 0) {
> > > > +        frag = &record->frags[nr_frags - 1];
> > > > +        __skb_frag_unref(frag);
> > > > +        --nr_frags;
> > > > +    }
> > > > +    kfree(record);
> > > > +}
> > > > +
> > > > +static void delete_all_records(struct tls_offload_context
> > > > *offload_ctx)
> > > > +{
> > > > +    struct tls_record_info *info, *temp;
> > > > +
> > > > +    list_for_each_entry_safe(info, temp, &offload_ctx-
> > > > >records_list, list) {
> > > > +        list_del(&info->list);
> > > > +        destroy_record(info);
> > > > +    }
> > > > +
> > > > +    offload_ctx->retransmit_hint = NULL;
> > > > +}
> > > > +
> > > > +static void tls_icsk_clean_acked(struct sock *sk, u32
> > > > acked_seq)
> > > > +{
> > > > +    struct tls_context *tls_ctx = tls_get_ctx(sk);
> > > > +    struct tls_offload_context *ctx;
> > > > +    struct tls_record_info *info, *temp;
> > > > +    unsigned long flags;
> > > > +    u64 deleted_records = 0;
> > > > +
> > > > +    if (!tls_ctx)
> > > > +        return;
> > > > +
> > > > +    ctx = tls_offload_ctx(tls_ctx);
> > > > +
> > > > +    spin_lock_irqsave(&ctx->lock, flags);
> > > > +    info = ctx->retransmit_hint;
> > > > +    if (info && !before(acked_seq, info->end_seq)) {
> > > > +        ctx->retransmit_hint = NULL;
> > > > +        list_del(&info->list);
> > > > +        destroy_record(info);
> > > > +        deleted_records++;
> > > > +    }
> > > > +
> > > > +    list_for_each_entry_safe(info, temp, &ctx->records_list,
> > > > list) {
> > > > +        if (before(acked_seq, info->end_seq))
> > > > +            break;
> > > > +        list_del(&info->list);
> > > > +
> > > > +        destroy_record(info);
> > > > +        deleted_records++;
> > > > +    }
> > > > +
> > > > +    ctx->unacked_record_sn += deleted_records;
> > > > +    spin_unlock_irqrestore(&ctx->lock, flags);
> > > > +}
> > > > +
> > > > +/* At this point, there should be no references on this
> > > > + * socket and no in-flight SKBs associated with this
> > > > + * socket, so it is safe to free all the resources.
> > > > + */
> > > > +void tls_device_sk_destruct(struct sock *sk)
> > > > +{
> > > > +    struct tls_context *tls_ctx = tls_get_ctx(sk);
> > > > +    struct tls_offload_context *ctx =
> > > > tls_offload_ctx(tls_ctx);
> > > > +
> > > > +    if (ctx->open_record)
> > > > +        destroy_record(ctx->open_record);
> > > > +
> > > > +    delete_all_records(ctx);
> > > > +    crypto_free_aead(ctx->aead_send);
> > > > +    ctx->sk_destruct(sk);
> > > > +
> > > > +    if (refcount_dec_and_test(&tls_ctx->refcount))
> > > > +        tls_device_queue_ctx_destruction(tls_ctx);
> > > > +}
> > > > +EXPORT_SYMBOL(tls_device_sk_destruct);
> > > > +
> > > > +static inline void tls_append_frag(struct tls_record_info
> > > > *record,
> > > > +                   struct page_frag *pfrag,
> > > > +                   int size)
> > > > +{
> > > > +    skb_frag_t *frag;
> > > > +
> > > > +    frag = &record->frags[record->num_frags - 1];
> > > > +    if (frag->page.p == pfrag->page &&
> > > > +        frag->page_offset + frag->size == pfrag->offset) {
> > > > +        frag->size += size;
> > > > +    } else {
> > > > +        ++frag;
> > > > +        frag->page.p = pfrag->page;
> > > > +        frag->page_offset = pfrag->offset;
> > > > +        frag->size = size;
> > > > +        ++record->num_frags;
> > > > +        get_page(pfrag->page);
> > > > +    }
> > > > +
> > > > +    pfrag->offset += size;
> > > > +    record->len += size;
> > > > +}
> > > > +
> > > > +static inline int tls_push_record(struct sock *sk,
> > > > +                  struct tls_context *ctx,
> > > > +                  struct tls_offload_context *offload_ctx,
> > > > +                  struct tls_record_info *record,
> > > > +                  struct page_frag *pfrag,
> > > > +                  int flags,
> > > > +                  unsigned char record_type)
> > > > +{
> > > > +    skb_frag_t *frag;
> > > > +    struct tcp_sock *tp = tcp_sk(sk);
> > > > +    struct page_frag fallback_frag;
> > > > +    struct page_frag  *tag_pfrag = pfrag;
> > > > +    int i;
> > > > +
> > > > +    /* fill prepand */
> > > > +    frag = &record->frags[0];
> > > > +    tls_fill_prepend(ctx,
> > > > +             skb_frag_address(frag),
> > > > +             record->len - ctx->prepend_size,
> > > > +             record_type);
> > > > +
> > > > +    if (unlikely(!skb_page_frag_refill(ctx->tag_size, pfrag,
> > > > GFP_KERNEL))) {
> > > > +        /* HW doesn't care about the data in the tag
> > > > +         * so in case pfrag has no room
> > > > +         * for a tag and we can't allocate a new pfrag
> > > > +         * just use the page in the first frag
> > > > +         * rather then write a complicated fall back code.
> > > > +         */
> > > > +        tag_pfrag = &fallback_frag;
> > > > +        tag_pfrag->page = skb_frag_page(frag);
> > > > +        tag_pfrag->offset = 0;
> > > > +    }
> > > > +
> > > > +    tls_append_frag(record, tag_pfrag, ctx->tag_size);
> > > > +    record->end_seq = tp->write_seq + record->len;
> > > > +    spin_lock_irq(&offload_ctx->lock);
> > > > +    list_add_tail(&record->list, &offload_ctx->records_list);
> > > > +    spin_unlock_irq(&offload_ctx->lock);
> > > > +    offload_ctx->open_record = NULL;
> > > > +    set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
> > > > +    tls_advance_record_sn(sk, ctx);
> > > > +
> > > > +    for (i = 0; i < record->num_frags; i++) {
> > > > +        frag = &record->frags[i];
> > > > +        sg_unmark_end(&offload_ctx->sg_tx_data[i]);
> > > > +        sg_set_page(&offload_ctx->sg_tx_data[i],
> > > > skb_frag_page(frag),
> > > > +                frag->size, frag->page_offset);
> > > > +        sk_mem_charge(sk, frag->size);
> > > > +        get_page(skb_frag_page(frag));
> > > > +    }
> > > > +    sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags -
> > > > 1]);
> > > > +
> > > > +    /* all ready, send */
> > > > +    return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0,
> > > > flags);
> > > > +}
> > > > +
> > > > +static inline int tls_create_new_record(struct
> > > > tls_offload_context *offload_ctx,
> > > > +                    struct page_frag *pfrag,
> > > > +                    size_t prepend_size)
> > > > +{
> > > > +    skb_frag_t *frag;
> > > > +    struct tls_record_info *record;
> > > > +
> > > > +    record = kmalloc(sizeof(*record), GFP_KERNEL);
> > > > +    if (!record)
> > > > +        return -ENOMEM;
> > > > +
> > > > +    frag = &record->frags[0];
> > > > +    __skb_frag_set_page(frag, pfrag->page);
> > > > +    frag->page_offset = pfrag->offset;
> > > > +    skb_frag_size_set(frag, prepend_size);
> > > > +
> > > > +    get_page(pfrag->page);
> > > > +    pfrag->offset += prepend_size;
> > > > +
> > > > +    record->num_frags = 1;
> > > > +    record->len = prepend_size;
> > > > +    offload_ctx->open_record = record;
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static inline int tls_do_allocation(struct sock *sk,
> > > > +                    struct tls_offload_context *offload_ctx,
> > > > +                    struct page_frag *pfrag,
> > > > +                    size_t prepend_size)
> > > > +{
> > > > +    int ret;
> > > > +
> > > > +    if (!offload_ctx->open_record) {
> > > > +        if (unlikely(!skb_page_frag_refill(prepend_size,
> > > > pfrag,
> > > > +                           sk->sk_allocation))) {
> > > > +            sk->sk_prot->enter_memory_pressure(sk);
> > > > +            sk_stream_moderate_sndbuf(sk);
> > > > +            return -ENOMEM;
> > > > +        }
> > > > +
> > > > +        ret = tls_create_new_record(offload_ctx, pfrag,
> > > > prepend_size);
> > > > +        if (ret)
> > > > +            return ret;
> > > > +
> > > > +        if (pfrag->size > pfrag->offset)
> > > > +            return 0;
> > > > +    }
> > > > +
> > > > +    if (!sk_page_frag_refill(sk, pfrag))
> > > > +        return -ENOMEM;
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static int tls_push_data(struct sock *sk,
> > > > +             struct iov_iter *msg_iter,
> > > > +             size_t size, int flags,
> > > > +             unsigned char record_type)
> > > > +{
> > > > +    struct tls_context *tls_ctx = tls_get_ctx(sk);
> > > > +    struct tls_offload_context *ctx =
> > > > tls_offload_ctx(tls_ctx);
> > > > +    struct tls_record_info *record = ctx->open_record;
> > > > +    struct page_frag *pfrag;
> > > > +    int copy, rc = 0;
> > > > +    size_t orig_size = size;
> > > > +    u32 max_open_record_len;
> > > > +    long timeo;
> > > > +    int more = flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE);
> > > > +    int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
> > > > +    bool done = false;
> > > > +
> > > > +    if (flags &
> > > > +        ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
> > > > MSG_SENDPAGE_NOTLAST))
> > > > +        return -ENOTSUPP;
> > > > +
> > > > +    if (sk->sk_err)
> > > > +        return -sk->sk_err;
> > > > +
> > > > +    timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
> > > > +    rc = tls_complete_pending_work(sk, tls_ctx, flags,
> > > > &timeo);
> > > > +    if (rc < 0)
> > > > +        return rc;
> > > > +
> > > > +    pfrag = sk_page_frag(sk);
> > > > +
> > > > +    /* KTLS_TLS_HEADER_SIZE is not counted as part of the TLS
> > > > record, and
> > > > +     * we need to leave room for an authentication tag.
> > > > +     */
> > > > +    max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
> > > > +                  tls_ctx->prepend_size;
> > > > +    do {
> > > > +        if (tls_do_allocation(sk, ctx, pfrag,
> > > > +                      tls_ctx->prepend_size)) {
> > > > +            rc = sk_stream_wait_memory(sk, &timeo);
> > > > +            if (!rc)
> > > > +                continue;
> > > > +
> > > > +            record = ctx->open_record;
> > > > +            if (!record)
> > > > +                break;
> > > > +handle_error:
> > > > +            if (record_type != TLS_RECORD_TYPE_DATA) {
> > > > +                /* avoid sending partial
> > > > +                 * record with type !=
> > > > +                 * application_data
> > > > +                 */
> > > > +                size = orig_size;
> > > > +                destroy_record(record);
> > > > +                ctx->open_record = NULL;
> > > > +            } else if (record->len > tls_ctx->prepend_size) {
> > > > +                goto last_record;
> > > > +            }
> > > > +
> > > > +            break;
> > > > +        }
> > > > +
> > > > +        record = ctx->open_record;
> > > > +        copy = min_t(size_t, size, (pfrag->size - pfrag-
> > > > >offset));
> > > > +        copy = min_t(size_t, copy, (max_open_record_len -
> > > > record->len));
> > > > +
> > > > +        if (copy_from_iter_nocache(page_address(pfrag->page) +
> > > > +                           pfrag->offset,
> > > > +                       copy, msg_iter) != copy) {
> > > > +            rc = -EFAULT;
> > > > +            goto handle_error;
> > > > +        }
> > > > +        tls_append_frag(record, pfrag, copy);
> > > > +
> > > > +        size -= copy;
> > > > +        if (!size) {
> > > > +last_record:
> > > > +            tls_push_record_flags = flags;
> > > > +            if (more) {
> > > > +                tls_ctx->pending_open_record_frags =
> > > > +                        record->num_frags;
> > > > +                break;
> > > > +            }
> > > > +
> > > > +            done = true;
> > > > +        }
> > > > +
> > > > +        if ((done) || record->len >= max_open_record_len ||
> > > > +            (record->num_frags >= MAX_SKB_FRAGS - 1)) {
> > > > +            rc = tls_push_record(sk,
> > > > +                         tls_ctx,
> > > > +                         ctx,
> > > > +                         record,
> > > > +                         pfrag,
> > > > +                         tls_push_record_flags,
> > > > +                         record_type);
> > > > +            if (rc < 0)
> > > > +                break;
> > > > +        }
> > > > +    } while (!done);
> > > > +
> > > > +    if (orig_size - size > 0)
> > > > +        rc = orig_size - size;
> > > > +
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg,
> > > > size_t size)
> > > > +{
> > > > +    unsigned char record_type = TLS_RECORD_TYPE_DATA;
> > > > +    int rc = 0;
> > > > +
> > > > +    lock_sock(sk);
> > > > +
> > > > +    if (unlikely(msg->msg_controllen)) {
> > > > +        rc = tls_proccess_cmsg(sk, msg, &record_type);
> > > > +        if (rc)
> > > > +            goto out;
> > > > +    }
> > > > +
> > > > +    rc = tls_push_data(sk, &msg->msg_iter, size,
> > > > +               msg->msg_flags, record_type);
> > > > +
> > > > +out:
> > > > +    release_sock(sk);
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +int tls_device_sendpage(struct sock *sk, struct page *page,
> > > > +            int offset, size_t size, int flags)
> > > > +{
> > > > +    struct iov_iter    msg_iter;
> > > > +    struct kvec iov;
> > > > +    char *kaddr = kmap(page);
> > > > +    int rc = 0;
> > > > +
> > > > +    if (flags & MSG_SENDPAGE_NOTLAST)
> > > > +        flags |= MSG_MORE;
> > > > +
> > > > +    lock_sock(sk);
> > > > +
> > > > +    if (flags & MSG_OOB) {
> > > > +        rc = -ENOTSUPP;
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    iov.iov_base = kaddr + offset;
> > > > +    iov.iov_len = size;
> > > > +    iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1,
> > > > size);
> > > > +    rc = tls_push_data(sk, &msg_iter, size,
> > > > +               flags, TLS_RECORD_TYPE_DATA);
> > > > +    kunmap(page);
> > > > +
> > > > +out:
> > > > +    release_sock(sk);
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +struct tls_record_info *tls_get_record(struct
> > > > tls_offload_context *context,
> > > > +                       u32 seq, u64 *p_record_sn)
> > > > +{
> > > > +    struct tls_record_info *info;
> > > > +    u64 record_sn = context->hint_record_sn;
> > > > +
> > > > +    info = context->retransmit_hint;
> > > > +    if (!info ||
> > > > +        before(seq, info->end_seq - info->len)) {
> > > > +        /* if retransmit_hint is irrelevant start
> > > > +         * from the begging of the list
> > > > +         */
> > > > +        info = list_first_entry(&context->records_list,
> > > > +                    struct tls_record_info, list);
> > > > +        record_sn = context->unacked_record_sn;
> > > > +    }
> > > > +
> > > > +    list_for_each_entry_from(info, &context->records_list,
> > > > list) {
> > > > +        if (before(seq, info->end_seq)) {
> > > > +            if (!context->retransmit_hint ||
> > > > +                after(info->end_seq,
> > > > +                  context->retransmit_hint->end_seq)) {
> > > > +                context->hint_record_sn = record_sn;
> > > > +                context->retransmit_hint = info;
> > > > +            }
> > > > +            *p_record_sn = record_sn;
> > > > +            return info;
> > > > +        }
> > > > +        record_sn++;
> > > > +    }
> > > > +
> > > > +    return NULL;
> > > > +}
> > > > +EXPORT_SYMBOL(tls_get_record);
> > > > +
> > > > +static int tls_device_push_pending_record(struct sock *sk, int
> > > > flags)
> > > > +{
> > > > +    struct iov_iter    msg_iter;
> > > > +
> > > > +    iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0);
> > > > +    return tls_push_data(sk, &msg_iter, 0, flags,
> > > > TLS_RECORD_TYPE_DATA);
> > > > +}
> > > > +
> > > > +int tls_set_device_offload(struct sock *sk, struct tls_context
> > > > *ctx)
> > > > +{
> > > > +    u16 nonece_size, tag_size, iv_size, rec_seq_size;
> > > > +    struct tls_record_info *start_marker_record;
> > > > +    struct tls_offload_context *offload_ctx;
> > > > +    struct tls_crypto_info *crypto_info;
> > > > +    struct net_device *netdev;
> > > > +    char *iv, *rec_seq;
> > > > +    struct sk_buff *skb;
> > > > +    int rc = -EINVAL;
> > > > +    __be64 rcd_sn;
> > > > +
> > > > +    if (!ctx)
> > > > +        goto out;
> > > > +
> > > > +    if (ctx->priv_ctx) {
> > > > +        rc = -EEXIST;
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    /* We support starting offload on multiple sockets
> > > > +     * concurrently, So we only need a read lock here.
> > > > +     */
> > > > +    percpu_down_read(&device_offload_lock);
> > > > +    netdev = get_netdev_for_sock(sk);
> > > > +    if (!netdev) {
> > > > +        pr_err_ratelimited("%s: netdev not found\n",
> > > > __func__);
> > > > +        rc = -EINVAL;
> > > > +        goto release_lock;
> > > > +    }
> > > > +
> > > > +    if (!(netdev->features & NETIF_F_HW_TLS_TX)) {
> > > > +        rc = -ENOTSUPP;
> > > > +        goto release_netdev;
> > > > +    }
> > > > +
> > > > +    /* Avoid offloading if the device is down
> > > > +     * We don't want to offload new flows after
> > > > +     * the NETDEV_DOWN event
> > > > +     */
> > > > +    if (!(netdev->flags & IFF_UP)) {
> > > > +        rc = -EINVAL;
> > > > +        goto release_lock;
> > > > +    }
> > > > +
> > > > +    crypto_info = &ctx->crypto_send;
> > > > +    switch (crypto_info->cipher_type) {
> > > > +    case TLS_CIPHER_AES_GCM_128: {
> > > > +        nonece_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> > > > +        tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> > > > +        iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
> > > > +        iv = ((struct tls12_crypto_info_aes_gcm_128
> > > > *)crypto_info)->iv;
> > > > +        rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE;
> > > > +        rec_seq =
> > > > +         ((struct tls12_crypto_info_aes_gcm_128
> > > > *)crypto_info)->rec_seq;
> > > > +        break;
> > > > +    }
> > > > +    default:
> > > > +        rc = -EINVAL;
> > > > +        goto release_netdev;
> > > > +    }
> > > > +
> > > > +    start_marker_record =
> > > > kmalloc(sizeof(*start_marker_record), GFP_KERNEL);
> > > 
> > > Can we memory allocations and simple memory initializations
> > > ouside the global rwsem?
> > > 
> > 
> > Sure, we can move all memory allocations outside the lock.
> > 
> > > > +    if (!start_marker_record) {
> > > > +        rc = -ENOMEM;
> > > > +        goto release_netdev;
> > > > +    }
> > > > +
> > > > +    offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE,
> > > > GFP_KERNEL);
> > > > +    if (!offload_ctx)
> > > > +        goto free_marker_record;
> > > > +
> > > > +    ctx->priv_ctx = offload_ctx;
> > > > +    rc = attach_sock_to_netdev(sk, netdev, ctx);
> > > > +    if (rc)
> > > > +        goto free_offload_context;
> > > > +
> > > > +    ctx->netdev = netdev;
> > > > +    ctx->prepend_size = TLS_HEADER_SIZE + nonece_size;
> > > > +    ctx->tag_size = tag_size;
> > > > +    ctx->iv_size = iv_size;
> > > > +    ctx->iv = kmalloc(iv_size +
> > > > TLS_CIPHER_AES_GCM_128_SALT_SIZE,
> > > > +              GFP_KERNEL);
> > > > +    if (!ctx->iv) {
> > > > +        rc = -ENOMEM;
> > > > +        goto detach_sock;
> > > > +    }
> > > > +
> > > > +    memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv,
> > > > iv_size);
> > > > +
> > > > +    ctx->rec_seq_size = rec_seq_size;
> > > > +    ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL);
> > > > +    if (!ctx->rec_seq) {
> > > > +        rc = -ENOMEM;
> > > > +        goto free_iv;
> > > > +    }
> > > > +    memcpy(ctx->rec_seq, rec_seq, rec_seq_size);
> > > > +
> > > > +    /* start at rec_seq - 1 to account for the start marker
> > > > record */
> > > > +    memcpy(&rcd_sn, ctx->rec_seq, sizeof(rcd_sn));
> > > > +    offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> > > > +
> > > > +    rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info);
> > > > +    if (rc)
> > > > +        goto free_rec_seq;
> > > > +
> > > > +    start_marker_record->end_seq = tcp_sk(sk)->write_seq;
> > > > +    start_marker_record->len = 0;
> > > > +    start_marker_record->num_frags = 0;
> > > > +
> > > > +    INIT_LIST_HEAD(&offload_ctx->records_list);
> > > > +    list_add_tail(&start_marker_record->list, &offload_ctx-
> > > > >records_list);
> > > > +    spin_lock_init(&offload_ctx->lock);
> > > > +
> > > > +    inet_csk(sk)->icsk_clean_acked = &tls_icsk_clean_acked;
> > > > +    ctx->push_pending_record = tls_device_push_pending_record;
> > > > +    offload_ctx->sk_destruct = sk->sk_destruct;
> > > > +
> > > > +    /* TLS offload is greatly simplified if we don't send
> > > > +     * SKBs where only part of the payload needs to be
> > > > encrypted.
> > > > +     * So mark the last skb in the write queue as end of
> > > > record.
> > > > +     */
> > > > +    skb = tcp_write_queue_tail(sk);
> > > > +    if (skb)
> > > > +        TCP_SKB_CB(skb)->eor = 1;
> > > > +
> > > > +    refcount_set(&ctx->refcount, 1);
> > > > +    spin_lock_irq(&tls_device_lock);
> > > > +    list_add_tail(&ctx->list, &tls_device_list);
> > > > +    spin_unlock_irq(&tls_device_lock);
> > > > +
> > > > +    /* following this assignment tls_is_sk_tx_device_offloaded
> > > > +     * will return true and the context might be accessed
> > > > +     * by the netdev's xmit function.
> > > > +     */
> > > > +    smp_store_release(&sk->sk_destruct,
> > > > +              &tls_device_sk_destruct);
> > > > +    goto release_lock;
> > > > +
> > > > +free_rec_seq:
> > > > +    kfree(ctx->rec_seq);
> > > > +free_iv:
> > > > +    kfree(ctx->iv);
> > > > +detach_sock:
> > > > +    netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > > > TLS_OFFLOAD_CTX_DIR_TX);
> > > > +free_offload_context:
> > > > +    kfree(offload_ctx);
> > > > +    ctx->priv_ctx = NULL;
> > > > +free_marker_record:
> > > > +    kfree(start_marker_record);
> > > > +release_netdev:
> > > > +    dev_put(netdev);
> > > > +release_lock:
> > > > +    percpu_up_read(&device_offload_lock);
> > > > +out:
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +static int tls_device_register(struct net_device *dev)
> > > > +{
> > > > +    if ((dev->features & NETIF_F_HW_TLS_TX) && !dev-
> > > > >tlsdev_ops)
> > > > +        return NOTIFY_BAD;
> > > > +
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > 
> > > This function is the same as tls_device_feat_change(). Can't we
> > > merge
> > > them together and avoid duplicating of code?
> > > 
> > 
> > Sure.
> > 
> > > > +static int tls_device_unregister(struct net_device *dev)
> > > > +{
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > 
> > > This function does nothing, and next patches do not change it.
> > > Can't we remove it since so?
> > > 
> > 
> > Sure.
> > 
> > > > +static int tls_device_feat_change(struct net_device *dev)
> > > > +{
> > > > +    if ((dev->features & NETIF_F_HW_TLS_TX) && !dev-
> > > > >tlsdev_ops)
> > > > +        return NOTIFY_BAD;
> > > > +
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > > +
> > > > +static int tls_device_down(struct net_device *netdev)
> > > > +{
> > > > +    struct tls_context *ctx, *tmp;
> > > > +    struct list_head list;
> > > > +    unsigned long flags;
> > > > +
> > > > +    if (!(netdev->features & NETIF_F_HW_TLS_TX))
> > > > +        return NOTIFY_DONE;
> > > 
> > > Can't we move this check in tls_dev_event() and use it for all
> > > types of events?
> > > Then we avoid duplicate code.
> > > 
> > 
> > No. Not all events require this check. Also, the result is
> > different for different events.
> 
> No. You always return NOTIFY_DONE, in case of !(netdev->features &
> NETIF_F_HW_TLS_TX).
> See below:
> 
> static int tls_check_dev_ops(struct net_device *dev) 
> {
> 	if (!dev->tlsdev_ops)
> 		return NOTIFY_BAD; 
> 
> 	return NOTIFY_DONE; 
> }
> 
> static int tls_device_down(struct net_device *netdev) 
> {
> 	struct tls_context *ctx, *tmp; 
> 	struct list_head list; 
> 	unsigned long flags; 
> 
> 	...
> 	return NOTIFY_DONE;
> }
> 
> static int tls_dev_event(struct notifier_block *this, unsigned long
> event, 
>         		 void *ptr) 
> { 
>         struct net_device *dev = netdev_notifier_info_to_dev(ptr); 
> 
> 	if (!(netdev->features & NETIF_F_HW_TLS_TX)) 
> 		return NOTIFY_DONE; 
>  
>         switch (event) { 
>         case NETDEV_REGISTER:
>         case NETDEV_FEAT_CHANGE: 
>         	return tls_check_dev_ops(dev); 
>  
>         case NETDEV_DOWN: 
>         	return tls_device_down(dev); 
>         } 
>         return NOTIFY_DONE; 
> } 
>  

Will fix in V2.

> > > > +
> > > > +    /* Request a write lock to block new offload attempts
> > > > +     */
> > > > +    percpu_down_write(&device_offload_lock);
> > > 
> > > What is the reason percpu_rwsem is chosen here? It looks like
> > > this primitive
> > > gives more advantages readers, then plain rwsem does. But it also
> > > gives
> > > disadvantages to writers. It would be good, unless
> > > tls_device_down() is called
> > > with rtnl_lock() held from netdevice notifier. But since
> > > netdevice notifier
> > > are called with rtnl_lock() held, percpu_rwsem will increase the
> > > time rtnl_lock()
> > > is locked.
> > 
> > We use the a rwsem to allow multiple (readers) invocations of
> > tls_set_device_offload, which is triggered by the user (persumably)
> > during the TLS handshake. This might be considered a fast-path.
> > 
> > However, we must block all calls to tls_set_device_offload while we
> > are processing NETDEV_DOWN events (writer).
> > 
> > As you've mentioned, the percpu rwsem is more efficient for
> > readers, especially on NUMA systems, where cache-line bouncing
> > occurs during reader acquire and reduces performance.
> 
> Hm, and who are the readers? It's used from do_tls_setsockopt_tx(),
> but it doesn't
> seem to be performance critical. Who else?
> 

it is performance critical since it is done in the socket handshake
phase, anyway I tend to agree with you that per cpu rwsem is an
overkill, will change it to regular rwsem in V2.

> > > 
> > > Can't we use plain rwsem here instead?
> > > 
> > 
> > Its a performance tradeoff. I'm not certain that the percpu rwsem
> > write side acquire is significantly worse than using the global
> > rwsem.
> > 
> > For now, while all of this is experimental, can we agree to focus
> > on the performance of readers? We can change it later if it becomes
> > a problem.
> 
> Same as above.
>  
> > > > +
> > > > +    spin_lock_irqsave(&tls_device_lock, flags);
> > > > +    INIT_LIST_HEAD(&list);
> > > 
> > > This may go outside the global spinlock.
> > > 
> > 
> > Sure.
> > 
> > > > +    list_for_each_entry_safe(ctx, tmp, &tls_device_list, list)
> > > > {
> > > > +        if (ctx->netdev != netdev ||
> > > > +            !refcount_inc_not_zero(&ctx->refcount))
> > > > +            continue;
> > > > +
> > > > +        list_move(&ctx->list, &list);
> > > > +    }
> > > > +    spin_unlock_irqrestore(&tls_device_lock, flags);
> > > > +
> > > > +    list_for_each_entry_safe(ctx, tmp, &list, list)    {
> > > > +        netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> > > > +                        TLS_OFFLOAD_CTX_DIR_TX);
> > > > +        ctx->netdev = NULL;
> > > > +        dev_put(netdev);
> > > > +        list_del_init(&ctx->list);
> > > > +
> > > > +        if (refcount_dec_and_test(&ctx->refcount))
> > > > +            tls_device_free_ctx(ctx);
> > > > +    }
> > > > +
> > > > +    percpu_up_write(&device_offload_lock);
> > > > +
> > > > +    flush_work(&tls_device_gc_work);
> > > > +
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > > +
> > > > +static int tls_dev_event(struct notifier_block *this, unsigned
> > > > long event,
> > > > +             void *ptr)
> > > > +{
> > > > +    struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> > > > +
> > > > +    switch (event) {
> > > > +    case NETDEV_REGISTER:
> > > > +        return tls_device_register(dev);
> > > > +
> > > > +    case NETDEV_UNREGISTER:
> > > > +        return tls_device_unregister(dev);
> > > > +
> > > > +    case NETDEV_FEAT_CHANGE:
> > > > +        return tls_device_feat_change(dev);
> > > > +
> > > > +    case NETDEV_DOWN:
> > > > +        return tls_device_down(dev);
> > > > +    }
> > > > +    return NOTIFY_DONE;
> > > > +}
> > > > +
> > > > +static struct notifier_block tls_dev_notifier = {
> > > > +    .notifier_call    = tls_dev_event,
> > > > +};
> > > > +
> > > > +void __init tls_device_init(void)
> > > > +{
> > > > +    register_netdevice_notifier(&tls_dev_notifier);
> > > > +}
> > > > +
> > > > +void __exit tls_device_cleanup(void)
> > > > +{
> > > > +    unregister_netdevice_notifier(&tls_dev_notifier);
> > > > +    flush_work(&tls_device_gc_work);
> > > > +}
> > > > diff --git a/net/tls/tls_device_fallback.c
> > > > b/net/tls/tls_device_fallback.c
> > > > new file mode 100644
> > > > index 000000000000..14d31a36885c
> > > > --- /dev/null
> > > > +++ b/net/tls/tls_device_fallback.c
> > > > @@ -0,0 +1,419 @@
> > > > +/* Copyright (c) 2018, Mellanox Technologies All rights
> > > > reserved.
> > > > + *
> > > > + *     Redistribution and use in source and binary forms, with
> > > > or
> > > > + *     without modification, are permitted provided that the
> > > > following
> > > > + *     conditions are met:
> > > > + *
> > > > + *      - Redistributions of source code must retain the above
> > > > + *        copyright notice, this list of conditions and the
> > > > following
> > > > + *        disclaimer.
> > > > + *
> > > > + *      - Redistributions in binary form must reproduce the
> > > > above
> > > > + *        copyright notice, this list of conditions and the
> > > > following
> > > > + *        disclaimer in the documentation and/or other
> > > > materials
> > > > + *        provided with the distribution.
> > > > + *
> > > > + *      - Neither the name of the Mellanox Technologies nor
> > > > the
> > > > + *        names of its contributors may be used to endorse or
> > > > promote
> > > > + *        products derived from this software without specific
> > > > prior written
> > > > + *        permission.
> > > > + *
> > > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> > > > CONTRIBUTORS "AS IS"
> > > > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > LIMITED TO,
> > > > + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > + * A PARTICULAR PURPOSE ARE DISCLAIMED.
> > > > + * IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
> > > > LIABLE FOR
> > > > + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > > > CONSEQUENTIAL
> > > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> > > > SUBSTITUTE GOODS OR
> > > > + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> > > > INTERRUPTION)
> > > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
> > > > CONTRACT,
> > > > + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
> > > > OTHERWISE) ARISING
> > > > + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> > > > OF THE
> > > > + * POSSIBILITY OF SUCH DAMAGE
> > > > + */
> > > > +
> > > > +#include <net/tls.h>
> > > > +#include <crypto/aead.h>
> > > > +#include <crypto/scatterwalk.h>
> > > > +#include <net/ip6_checksum.h>
> > > > +
> > > > +static void chain_to_walk(struct scatterlist *sg, struct
> > > > scatter_walk *walk)
> > > > +{
> > > > +    struct scatterlist *src = walk->sg;
> > > > +    int diff = walk->offset - src->offset;
> > > > +
> > > > +    sg_set_page(sg, sg_page(src),
> > > > +            src->length - diff, walk->offset);
> > > > +
> > > > +    scatterwalk_crypto_chain(sg, sg_next(src), 0, 2);
> > > > +}
> > > > +
> > > > +static int tls_enc_record(struct aead_request *aead_req,
> > > > +              struct crypto_aead *aead, char *aad, char *iv,
> > > > +              __be64 rcd_sn, struct scatter_walk *in,
> > > > +              struct scatter_walk *out, int *in_len)
> > > > +{
> > > > +    struct scatterlist sg_in[3];
> > > > +    struct scatterlist sg_out[3];
> > > > +    unsigned char buf[TLS_HEADER_SIZE +
> > > > TLS_CIPHER_AES_GCM_128_IV_SIZE];
> > > > +    u16 len;
> > > > +    int rc;
> > > > +
> > > > +    len = min_t(int, *in_len, ARRAY_SIZE(buf));
> > > > +
> > > > +    scatterwalk_copychunks(buf, in, len, 0);
> > > > +    scatterwalk_copychunks(buf, out, len, 1);
> > > > +
> > > > +    *in_len -= len;
> > > > +    if (!*in_len)
> > > > +        return 0;
> > > > +
> > > > +    scatterwalk_pagedone(in, 0, 1);
> > > > +    scatterwalk_pagedone(out, 1, 1);
> > > > +
> > > > +    len = buf[4] | (buf[3] << 8);
> > > > +    len -= TLS_CIPHER_AES_GCM_128_IV_SIZE;
> > > > +
> > > > +    tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE,
> > > > +             (char *)&rcd_sn, sizeof(rcd_sn), buf[0]);
> > > > +
> > > > +    memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf +
> > > > TLS_HEADER_SIZE,
> > > > +           TLS_CIPHER_AES_GCM_128_IV_SIZE);
> > > > +
> > > > +    sg_init_table(sg_in, ARRAY_SIZE(sg_in));
> > > > +    sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> > > > +    sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE);
> > > > +    sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE);
> > > > +    chain_to_walk(sg_in + 1, in);
> > > > +    chain_to_walk(sg_out + 1, out);
> > > > +
> > > > +    *in_len -= len;
> > > > +    if (*in_len < 0) {
> > > > +        *in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> > > > +        if (*in_len < 0)
> > > > +        /* the input buffer doesn't contain the entire record.
> > > > +         * trim len accordingly. The resulting authentication
> > > > tag
> > > > +         * will contain garbage. but we don't care as we won't
> > > > +         * include any of it in the output skb
> > > > +         * Note that we assume the output buffer length
> > > > +         * is larger then input buffer length + tag size
> > > > +         */
> > > > +            len += *in_len;
> > > > +
> > > > +        *in_len = 0;
> > > > +    }
> > > > +
> > > > +    if (*in_len) {
> > > > +        scatterwalk_copychunks(NULL, in, len, 2);
> > > > +        scatterwalk_pagedone(in, 0, 1);
> > > > +        scatterwalk_copychunks(NULL, out, len, 2);
> > > > +        scatterwalk_pagedone(out, 1, 1);
> > > > +    }
> > > > +
> > > > +    len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE;
> > > > +    aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv);
> > > > +
> > > > +    rc = crypto_aead_encrypt(aead_req);
> > > > +
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +static void tls_init_aead_request(struct aead_request
> > > > *aead_req,
> > > > +                  struct crypto_aead *aead)
> > > > +{
> > > > +    aead_request_set_tfm(aead_req, aead);
> > > > +    aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE);
> > > > +}
> > > > +
> > > > +static struct aead_request *tls_alloc_aead_request(struct
> > > > crypto_aead *aead,
> > > > +                           gfp_t flags)
> > > > +{
> > > > +    unsigned int req_size = sizeof(struct aead_request) +
> > > > +        crypto_aead_reqsize(aead);
> > > > +    struct aead_request *aead_req;
> > > > +
> > > > +    aead_req = kzalloc(req_size, flags);
> > > > +    if (!aead_req)
> > > > +        return NULL;
> > > > +
> > > > +    tls_init_aead_request(aead_req, aead);
> > > > +    return aead_req;
> > > > +}
> > > > +
> > > > +static int tls_enc_records(struct aead_request *aead_req,
> > > > +               struct crypto_aead *aead, struct scatterlist
> > > > *sg_in,
> > > > +               struct scatterlist *sg_out, char *aad, char
> > > > *iv,
> > > > +               u64 rcd_sn, int len)
> > > > +{
> > > > +    struct scatter_walk in;
> > > > +    struct scatter_walk out;
> > > > +    int rc;
> > > > +
> > > > +    scatterwalk_start(&in, sg_in);
> > > > +    scatterwalk_start(&out, sg_out);
> > > > +
> > > > +    do {
> > > > +        rc = tls_enc_record(aead_req, aead, aad, iv,
> > > > +                    cpu_to_be64(rcd_sn), &in, &out, &len);
> > > > +        rcd_sn++;
> > > > +
> > > > +    } while (rc == 0 && len);
> > > > +
> > > > +    scatterwalk_done(&in, 0, 0);
> > > > +    scatterwalk_done(&out, 1, 0);
> > > > +
> > > > +    return rc;
> > > > +}
> > > > +
> > > > +static inline void update_chksum(struct sk_buff *skb, int
> > > > headln)
> > > > +{
> > > > +    /* Can't use icsk->icsk_af_ops->send_check here because
> > > > the ip addresses
> > > > +     * might have been changed by NAT.
> > > > +     */
> > > > +
> > > > +    const struct ipv6hdr *ipv6h;
> > > > +    const struct iphdr *iph;
> > > > +    struct tcphdr *th = tcp_hdr(skb);
> > > > +    int datalen = skb->len - headln;
> > > > +
> > > > +    /* We only changed the payload so if we are using partial
> > > > we don't
> > > > +     * need to update anything.
> > > > +     */
> > > > +    if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
> > > > +        return;
> > > > +
> > > > +    skb->ip_summed = CHECKSUM_PARTIAL;
> > > > +    skb->csum_start = skb_transport_header(skb) - skb->head;
> > > > +    skb->csum_offset = offsetof(struct tcphdr, check);
> > > > +
> > > > +    if (skb->sk->sk_family == AF_INET6) {
> > > > +        ipv6h = ipv6_hdr(skb);
> > > > +        th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h-
> > > > >daddr,
> > > > +                         datalen, IPPROTO_TCP, 0);
> > > > +    } else {
> > > > +        iph = ip_hdr(skb);
> > > > +        th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, 
> > > > datalen,
> > > > +                           IPPROTO_TCP, 0);
> > > > +    }
> > > > +}
> > > > +
> > > > +static void complete_skb(struct sk_buff *nskb, struct sk_buff
> > > > *skb, int headln)
> > > > +{
> > > > +    skb_copy_header(nskb, skb);
> > > > +
> > > > +    skb_put(nskb, skb->len);
> > > > +    memcpy(nskb->data, skb->data, headln);
> > > > +    update_chksum(nskb, headln);
> > > > +
> > > > +    nskb->destructor = skb->destructor;
> > > > +    nskb->sk = skb->sk;
> > > > +    skb->destructor = NULL;
> > > > +    skb->sk = NULL;
> > > > +    refcount_add(nskb->truesize - skb->truesize,
> > > > +             &nskb->sk->sk_wmem_alloc);
> > > > +}
> > > > +
> > > > +/* This function may be called after the user socket is
> > > > already
> > > > + * closed so make sure we don't use anything freed during
> > > > + * tls_sk_proto_close here
> > > > + */
> > > > +static struct sk_buff *tls_sw_fallback(struct sock *sk, struct
> > > > sk_buff *skb)
> > > > +{
> > > > +    int tcp_header_size = tcp_hdrlen(skb);
> > > > +    int tcp_payload_offset = skb_transport_offset(skb) +
> > > > tcp_header_size;
> > > > +    int payload_len = skb->len - tcp_payload_offset;
> > > > +    struct tls_context *tls_ctx = tls_get_ctx(sk);
> > > > +    struct tls_offload_context *ctx =
> > > > tls_offload_ctx(tls_ctx);
> > > > +    int remaining, buf_len, resync_sgs, rc, i = 0;
> > > > +    void *buf, *dummy_buf, *iv, *aad;
> > > > +    struct scatterlist *sg_in;
> > > > +    struct scatterlist sg_out[3];
> > > > +    u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> > > > +    struct aead_request *aead_req;
> > > > +    struct sk_buff *nskb = NULL;
> > > > +    struct tls_record_info *record;
> > > > +    unsigned long flags;
> > > > +    s32 sync_size;
> > > > +    u64 rcd_sn;
> > > > +
> > > > +    /* worst case is:
> > > > +     * MAX_SKB_FRAGS in tls_record_info
> > > > +     * MAX_SKB_FRAGS + 1 in SKB head an frags.
> > > > +     */
> > > > +    int sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1;
> > > > +
> > > > +    if (!payload_len)
> > > > +        return skb;
> > > > +
> > > > +    sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in),
> > > > GFP_ATOMIC);
> > > > +    if (!sg_in)
> > > > +        goto free_orig;
> > > > +
> > > > +    sg_init_table(sg_in, sg_in_max_elements);
> > > > +    sg_init_table(sg_out, ARRAY_SIZE(sg_out));
> > > > +
> > > > +    spin_lock_irqsave(&ctx->lock, flags);
> > > > +    record = tls_get_record(ctx, tcp_seq, &rcd_sn);
> > > > +    if (!record) {
> > > > +        spin_unlock_irqrestore(&ctx->lock, flags);
> > > > +        WARN(1, "Record not found for seq %u\n", tcp_seq);
> > > > +        goto free_sg;
> > > > +    }
> > > > +
> > > > +    sync_size = tcp_seq - tls_record_start_seq(record);
> > > > +    if (sync_size < 0) {
> > > > +        int is_start_marker =
> > > > tls_record_is_start_marker(record);
> > > > +
> > > > +        spin_unlock_irqrestore(&ctx->lock, flags);
> > > > +        if (!is_start_marker)
> > > > +        /* This should only occur if the relevant record was
> > > > +         * already acked. In that case it should be ok
> > > > +         * to drop the packet and avoid retransmission.
> > > > +         *
> > > > +         * There is a corner case where the packet contains
> > > > +         * both an acked and a non-acked record.
> > > > +         * We currently don't handle that case and rely
> > > > +         * on TCP to retranmit a packet that doesn't contain
> > > > +         * already acked payload.
> > > > +         */
> > > > +            goto free_orig;
> > > > +
> > > > +        if (payload_len > -sync_size) {
> > > > +            WARN(1, "Fallback of partially offloaded packets
> > > > is not supported\n");
> > > > +            goto free_sg;
> > > > +        } else {
> > > > +            return skb;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    remaining = sync_size;
> > > > +    while (remaining > 0) {
> > > > +        skb_frag_t *frag = &record->frags[i];
> > > > +
> > > > +        __skb_frag_ref(frag);
> > > > +        sg_set_page(sg_in + i, skb_frag_page(frag),
> > > > +                skb_frag_size(frag), frag->page_offset);
> > > > +
> > > > +        remaining -= skb_frag_size(frag);
> > > > +
> > > > +        if (remaining < 0)
> > > > +            sg_in[i].length += remaining;
> > > > +
> > > > +        i++;
> > > > +    }
> > > > +    spin_unlock_irqrestore(&ctx->lock, flags);
> > > > +    resync_sgs = i;
> > > > +
> > > > +    aead_req = tls_alloc_aead_request(ctx->aead_send,
> > > > GFP_ATOMIC);
> > > > +    if (!aead_req)
> > > > +        goto put_sg;
> > > > +
> > > > +    buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> > > > +          TLS_CIPHER_AES_GCM_128_IV_SIZE +
> > > > +          TLS_AAD_SPACE_SIZE +
> > > > +          sync_size +
> > > > +          tls_ctx->tag_size;
> > > > +    buf = kmalloc(buf_len, GFP_ATOMIC);
> > > > +    if (!buf)
> > > > +        goto free_req;
> > > > +
> > > > +    nskb = alloc_skb(skb_headroom(skb) + skb->len,
> > > > GFP_ATOMIC);
> > > > +    if (!nskb)
> > > > +        goto free_buf;
> > > > +
> > > > +    skb_reserve(nskb, skb_headroom(skb));
> > > > +
> > > > +    iv = buf;
> > > > +
> > > > +    memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
> > > > +           TLS_CIPHER_AES_GCM_128_SALT_SIZE);
> > > > +    aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
> > > > +          TLS_CIPHER_AES_GCM_128_IV_SIZE;
> > > > +    dummy_buf = aad + TLS_AAD_SPACE_SIZE;
> > > > +
> > > > +    sg_set_buf(&sg_out[0], dummy_buf, sync_size);
> > > > +    sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset,
> > > > +           payload_len);
> > > > +    /* Add room for authentication tag produced by crypto */
> > > > +    dummy_buf += sync_size;
> > > > +    sg_set_buf(&sg_out[2], dummy_buf, tls_ctx->tag_size);
> > > > +    rc = skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset,
> > > > +              payload_len);
> > > > +    if (rc < 0)
> > > > +        goto free_nskb;
> > > > +
> > > > +    rc = tls_enc_records(aead_req, ctx->aead_send, sg_in,
> > > > sg_out, aad, iv,
> > > > +                 rcd_sn, sync_size + payload_len);
> > > > +    if (rc < 0)
> > > > +        goto free_nskb;
> > > > +
> > > > +    complete_skb(nskb, skb, tcp_payload_offset);
> > > > +
> > > > +    /* validate_xmit_skb_list assumes that if the skb wasn't
> > > > segmented
> > > > +     * nskb->prev will point to the skb itself
> > > > +     */
> > > > +    nskb->prev = nskb;
> > > > +free_buf:
> > > > +    kfree(buf);
> > > > +free_req:
> > > > +    kfree(aead_req);
> > > > +put_sg:
> > > > +    for (i = 0; i < resync_sgs; i++)
> > > > +        put_page(sg_page(&sg_in[i]));
> > > > +free_sg:
> > > > +    kfree(sg_in);
> > > > +free_orig:
> > > > +    kfree_skb(skb);
> > > > +    return nskb;
> > > > +
> > > > +free_nskb:
> > > > +    kfree_skb(nskb);
> > > > +    nskb = NULL;
> > > > +    goto free_buf;
> > > > +}
> > > > +
> > > > +static struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
> > > > +                         struct net_device *dev,
> > > > +                         struct sk_buff *skb)
> > > > +{
> > > > +    if (dev == tls_get_ctx(sk)->netdev)
> > > > +        return skb;
> > > > +
> > > > +    return tls_sw_fallback(sk, skb);
> > > > +}
> > > > +
> > > > +int tls_sw_fallback_init(struct sock *sk,
> > > > +             struct tls_offload_context *offload_ctx,
> > > > +             struct tls_crypto_info *crypto_info)
> > > > +{
> > > > +    int rc;
> > > > +    const u8 *key;
> > > > +
> > > > +    offload_ctx->aead_send =
> > > > +        crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC);
> > > > +    if (IS_ERR(offload_ctx->aead_send)) {
> > > > +        rc = PTR_ERR(offload_ctx->aead_send);
> > > > +        pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n",
> > > > rc);
> > > > +        offload_ctx->aead_send = NULL;
> > > > +        goto err_out;
> > > > +    }
> > > > +
> > > > +    key = ((struct tls12_crypto_info_aes_gcm_128
> > > > *)crypto_info)->key;
> > > > +
> > > > +    rc = crypto_aead_setkey(offload_ctx->aead_send, key,
> > > > +                TLS_CIPHER_AES_GCM_128_KEY_SIZE);
> > > > +    if (rc)
> > > > +        goto free_aead;
> > > > +
> > > > +    rc = crypto_aead_setauthsize(offload_ctx->aead_send,
> > > > +                     TLS_CIPHER_AES_GCM_128_TAG_SIZE);
> > > > +    if (rc)
> > > > +        goto free_aead;
> > > > +
> > > > +    sk->sk_validate_xmit_skb = tls_validate_xmit_skb;
> > > > +    return 0;
> > > > +free_aead:
> > > > +    crypto_free_aead(offload_ctx->aead_send);
> > > > +err_out:
> > > > +    return rc;
> > > > +}
> > > > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > > > index d824d548447e..e0dface33017 100644
> > > > --- a/net/tls/tls_main.c
> > > > +++ b/net/tls/tls_main.c
> > > > @@ -54,6 +54,9 @@ enum {
> > > >   enum {
> > > >       TLS_BASE_TX,
> > > >       TLS_SW_TX,
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +    TLS_HW_TX,
> > > > +#endif
> > > >       TLS_NUM_CONFIG,
> > > >   };
> > > >   @@ -416,11 +419,19 @@ static int do_tls_setsockopt_tx(struct
> > > > sock *sk, char __user *optval,
> > > >           goto err_crypto_info;
> > > >       }
> > > >   -    /* currently SW is default, we will have ethtool in
> > > > future */
> > > > -    rc = tls_set_sw_offload(sk, ctx);
> > > > -    tx_conf = TLS_SW_TX;
> > > > -    if (rc)
> > > > -        goto err_crypto_info;
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +    rc = tls_set_device_offload(sk, ctx);
> > > > +    tx_conf = TLS_HW_TX;
> > > > +    if (rc) {
> > > > +#else
> > > > +    {
> > > > +#endif
> > > > +        /* if HW offload fails fallback to SW */
> > > > +        rc = tls_set_sw_offload(sk, ctx);
> > > > +        tx_conf = TLS_SW_TX;
> > > > +        if (rc)
> > > > +            goto err_crypto_info;
> > > > +    }
> > > >         ctx->tx_conf = tx_conf;
> > > >       update_sk_prot(sk, ctx);
> > > > @@ -473,6 +484,12 @@ static void build_protos(struct proto
> > > > *prot, struct proto *base)
> > > >       prot[TLS_SW_TX] = prot[TLS_BASE_TX];
> > > >       prot[TLS_SW_TX].sendmsg        = tls_sw_sendmsg;
> > > >       prot[TLS_SW_TX].sendpage    = tls_sw_sendpage;
> > > > +
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +    prot[TLS_HW_TX] = prot[TLS_SW_TX];
> > > > +    prot[TLS_HW_TX].sendmsg        = tls_device_sendmsg;
> > > > +    prot[TLS_HW_TX].sendpage    = tls_device_sendpage;
> > > > +#endif
> > > >   }
> > > >     static int tls_init(struct sock *sk)
> > > > @@ -531,6 +548,9 @@ static int __init tls_register(void)
> > > >   {
> > > >       build_protos(tls_prots[TLSV4], &tcp_prot);
> > > >   +#ifdef CONFIG_TLS_DEVICE
> > > > +    tls_device_init();
> > > > +#endif
> > > >       tcp_register_ulp(&tcp_tls_ulp_ops);
> > > >         return 0;
> > > > @@ -539,6 +559,9 @@ static int __init tls_register(void)
> > > >   static void __exit tls_unregister(void)
> > > >   {
> > > >       tcp_unregister_ulp(&tcp_tls_ulp_ops);
> > > > +#ifdef CONFIG_TLS_DEVICE
> > > > +    tls_device_cleanup();
> > > > +#endif
> > > >   }
> > > >     module_init(tls_register);
> 
> Thanks,
> Kirill

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-21 16:31       ` Kirill Tkhai
  2018-03-21 20:50         ` Saeed Mahameed
@ 2018-03-22 12:38         ` Boris Pismenny
  2018-03-22 13:03           ` Kirill Tkhai
  1 sibling, 1 reply; 27+ messages in thread
From: Boris Pismenny @ 2018-03-22 12:38 UTC (permalink / raw)
  To: Kirill Tkhai, Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Ilya Lesokhin, Aviad Yehezkel

...
>>>
>>> Can't we move this check in tls_dev_event() and use it for all types of events?
>>> Then we avoid duplicate code.
>>>
>>
>> No. Not all events require this check. Also, the result is different for different events.
> 
> No. You always return NOTIFY_DONE, in case of !(netdev->features & NETIF_F_HW_TLS_TX).
> See below:
> 
> static int tls_check_dev_ops(struct net_device *dev)
> {
> 	if (!dev->tlsdev_ops)
> 		return NOTIFY_BAD;
> 
> 	return NOTIFY_DONE;
> }
> 
> static int tls_device_down(struct net_device *netdev)
> {
> 	struct tls_context *ctx, *tmp;
> 	struct list_head list;
> 	unsigned long flags;
> 
> 	...
> 	return NOTIFY_DONE;
> }
> 
> static int tls_dev_event(struct notifier_block *this, unsigned long event,
>          		 void *ptr)
> {
>          struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> 
> 	if (!(netdev->features & NETIF_F_HW_TLS_TX))
> 		return NOTIFY_DONE;
>   
>          switch (event) {
>          case NETDEV_REGISTER:
>          case NETDEV_FEAT_CHANGE:
>          	return tls_check_dev_ops(dev);
>   
>          case NETDEV_DOWN:
>          	return tls_device_down(dev);
>          }
>          return NOTIFY_DONE;
> }
>  

Sure, will fix in V3.

>>>> +
>>>> +    /* Request a write lock to block new offload attempts
>>>> +     */
>>>> +    percpu_down_write(&device_offload_lock);
>>>
>>> What is the reason percpu_rwsem is chosen here? It looks like this primitive
>>> gives more advantages readers, then plain rwsem does. But it also gives
>>> disadvantages to writers. It would be good, unless tls_device_down() is called
>>> with rtnl_lock() held from netdevice notifier. But since netdevice notifier
>>> are called with rtnl_lock() held, percpu_rwsem will increase the time rtnl_lock()
>>> is locked.
>> We use the a rwsem to allow multiple (readers) invocations of tls_set_device_offload, which is triggered by the user (persumably) during the TLS handshake. This might be considered a fast-path.
>>
>> However, we must block all calls to tls_set_device_offload while we are processing NETDEV_DOWN events (writer).
>>
>> As you've mentioned, the percpu rwsem is more efficient for readers, especially on NUMA systems, where cache-line bouncing occurs during reader acquire and reduces performance.
> 
> Hm, and who are the readers? It's used from do_tls_setsockopt_tx(), but it doesn't
> seem to be performance critical. Who else?
> 

It depends on whether you consider the TLS handshake code as critical.
The readers are TCP connections processing the CCS message of the TLS 
handshake. They are providing key material to the kernel to start using 
Kernel TLS.


>>>
>>> Can't we use plain rwsem here instead?
>>>
>>
>> Its a performance tradeoff. I'm not certain that the percpu rwsem write side acquire is significantly worse than using the global rwsem.
>>
>> For now, while all of this is experimental, can we agree to focus on the performance of readers? We can change it later if it becomes a problem.
> 
> Same as above.
>   

Replaced with rwsem from V2.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure
  2018-03-22 12:38         ` Boris Pismenny
@ 2018-03-22 13:03           ` Kirill Tkhai
  0 siblings, 0 replies; 27+ messages in thread
From: Kirill Tkhai @ 2018-03-22 13:03 UTC (permalink / raw)
  To: Boris Pismenny, Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Ilya Lesokhin, Aviad Yehezkel

On 22.03.2018 15:38, Boris Pismenny wrote:
> ...
>>>>
>>>> Can't we move this check in tls_dev_event() and use it for all types of events?
>>>> Then we avoid duplicate code.
>>>>
>>>
>>> No. Not all events require this check. Also, the result is different for different events.
>>
>> No. You always return NOTIFY_DONE, in case of !(netdev->features & NETIF_F_HW_TLS_TX).
>> See below:
>>
>> static int tls_check_dev_ops(struct net_device *dev)
>> {
>>     if (!dev->tlsdev_ops)
>>         return NOTIFY_BAD;
>>
>>     return NOTIFY_DONE;
>> }
>>
>> static int tls_device_down(struct net_device *netdev)
>> {
>>     struct tls_context *ctx, *tmp;
>>     struct list_head list;
>>     unsigned long flags;
>>
>>     ...
>>     return NOTIFY_DONE;
>> }
>>
>> static int tls_dev_event(struct notifier_block *this, unsigned long event,
>>                   void *ptr)
>> {
>>          struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>
>>     if (!(netdev->features & NETIF_F_HW_TLS_TX))
>>         return NOTIFY_DONE;
>>            switch (event) {
>>          case NETDEV_REGISTER:
>>          case NETDEV_FEAT_CHANGE:
>>              return tls_check_dev_ops(dev);
>>            case NETDEV_DOWN:
>>              return tls_device_down(dev);
>>          }
>>          return NOTIFY_DONE;
>> }
>>  
> 
> Sure, will fix in V3.
> 
>>>>> +
>>>>> +    /* Request a write lock to block new offload attempts
>>>>> +     */
>>>>> +    percpu_down_write(&device_offload_lock);
>>>>
>>>> What is the reason percpu_rwsem is chosen here? It looks like this primitive
>>>> gives more advantages readers, then plain rwsem does. But it also gives
>>>> disadvantages to writers. It would be good, unless tls_device_down() is called
>>>> with rtnl_lock() held from netdevice notifier. But since netdevice notifier
>>>> are called with rtnl_lock() held, percpu_rwsem will increase the time rtnl_lock()
>>>> is locked.
>>> We use the a rwsem to allow multiple (readers) invocations of tls_set_device_offload, which is triggered by the user (persumably) during the TLS handshake. This might be considered a fast-path.
>>>
>>> However, we must block all calls to tls_set_device_offload while we are processing NETDEV_DOWN events (writer).
>>>
>>> As you've mentioned, the percpu rwsem is more efficient for readers, especially on NUMA systems, where cache-line bouncing occurs during reader acquire and reduces performance.
>>
>> Hm, and who are the readers? It's used from do_tls_setsockopt_tx(), but it doesn't
>> seem to be performance critical. Who else?
>>
> 
> It depends on whether you consider the TLS handshake code as critical.
> The readers are TCP connections processing the CCS message of the TLS handshake. They are providing key material to the kernel to start using Kernel TLS.

The thing is rtnl_lock() is critical for the rest of the system,
while TLS handshake is small subset of actions the system makes.

rtnl_lock() is used just almost everywhere, from netlink messages
to netdev ioctls.

Currently, you even just can't close raw socket without rtnl lock.
So, all of this is big reason to avoid doing rcu waitings under it.

Kirill

>>>>
>>>> Can't we use plain rwsem here instead?
>>>>
>>>
>>> Its a performance tradeoff. I'm not certain that the percpu rwsem write side acquire is significantly worse than using the global rwsem.
>>>
>>> For now, while all of this is experimental, can we agree to focus on the performance of readers? We can change it later if it becomes a problem.
>>
>> Same as above.
>>   
> 
> Replaced with rwsem from V2.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-03-22 13:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-20  2:44 [PATCH net-next 00/14] TLS offload, netdev & MLX5 support Saeed Mahameed
2018-03-20  2:44 ` [PATCH net-next 01/14] tcp: Add clean acked data hook Saeed Mahameed
2018-03-20 20:36   ` Rao Shoaib
2018-03-21 11:21     ` Boris Pismenny
2018-03-21 16:16       ` Rao Shoaib
2018-03-21 16:32         ` David Miller
2018-03-20  2:44 ` [PATCH net-next 02/14] net: Rename and export copy_skb_header Saeed Mahameed
2018-03-20  2:44 ` [PATCH net-next 03/14] net: Add Software fallback infrastructure for socket dependent offloads Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 04/14] net: Add TLS offload netdev ops Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 05/14] net: Add TLS TX offload features Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure Saeed Mahameed
2018-03-21 11:15   ` Kirill Tkhai
2018-03-21 15:53     ` Boris Pismenny
2018-03-21 16:31       ` Kirill Tkhai
2018-03-21 20:50         ` Saeed Mahameed
2018-03-22 12:38         ` Boris Pismenny
2018-03-22 13:03           ` Kirill Tkhai
2018-03-21 15:08   ` Dave Watson
2018-03-21 15:38     ` Boris Pismenny
2018-03-20  2:45 ` [PATCH net-next 07/14] net/tls: Support TLS device offload with IPv6 Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 08/14] net/mlx5e: Move defines out of ipsec code Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 09/14] net/mlx5: Accel, Add TLS tx offload interface Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 10/14] net/mlx5e: TLS, Add Innova TLS TX support Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 11/14] net/mlx5e: TLS, Add Innova TLS TX offload data path Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 12/14] net/mlx5e: TLS, Add error statistics Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 13/14] MAINTAINERS: Update mlx5 innova driver maintainers Saeed Mahameed
2018-03-20  2:45 ` [PATCH net-next 14/14] MAINTAINERS: Update TLS maintainers Saeed Mahameed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.