kernel-tls-handshake.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP
@ 2023-03-21 12:43 Hannes Reinecke
  2023-03-21 12:43 ` [PATCH 01/18] nvme-keyring: register '.nvme' keyring Hannes Reinecke
                   ` (18 more replies)
  0 siblings, 19 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Hi all,

finally I've managed to put all things together and enable in-kernel
TLS support for NVMe-over-TCP.

The patchset is based on the TLS upcall mechanism from Chuck Lever
(cf '[PATCH v7 0/2] Another crack at a handshake upcall mechanism'
posted to the linux netdev list), and requires the 'tlshd' userspace
daemon (https://github.com/oracle/ktls-utils) for the actual TLS handshake.

Theory of operation:
A dedicated '.nvme' keyring is created to hold the pre-shared keys (PSKs)
for the TLS handshake. Keys will have to be provisioned before TLS handshake
is attempted; I'll be sending a patch for nvme-cli separately.
After connection to the remote TCP port the client side will check if there are
matching PSKs, and call the TLS userspace daemon to initiate a TLS handshake.
The server side will be doing a MSG_PEEK on the first incoming PDU after
accept(), and check if it's an TCP ICREQ packet. If not it's assumed to be
a TLS ClientHello and the TLS userspace daemon is invoked for the TLS
handshake.
If the TLS handshake succeeds the userspace daemon will be activating
kTLS on the socket, and control is passed back to the kernel.

The one issue of note is the multiple identity handling.
The NVMe TCP spec defines up to 4 PSK identities, and
TLS 1.3 allows for several identities to be sent with the
ClientHello. So in theory we could sent the all in one go.
Sadly none of the userspace libraries implement this feature,
so we have to test each possible identity and terminate the
connection on failure.
With this patchset all possible identities need to be part of
the keyring, and the client side will be trying to establish
a TLS connection with each matching PSK from the keyring.
The beauty is that this method works without modification
to the existing nvme-cli; one only needs to provision PSKs
in the .nvme keyring and TLS handshake will be attempted.
As I'm not sure if that approach meets with general approval
I'm sending out this patchset as an RFC for now.

As usual, comments and reviews are welcome.

Hannes Reinecke (18):
  nvme-keyring: register '.nvme' keyring
  nvme-keyring: define a 'psk' keytype
  nvme: add TCP TSAS definitions
  nvme-tcp: add definitions for TLS cipher suites
  nvme-tcp: implement recvmsg rx flow for TLS
  nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  nvme/tcp: allocate socket file
  nvme-tcp: enable TLS handshake upcall
  nvme-tcp: add connect option 'tls'
  nvme-tcp: fixup send workflow for kTLS
  nvme-tcp: control message handling for recvmsg()
  nvmet: make TCP sectype settable via configfs
  nvmet-tcp: allocate socket file
  security/keys: export key_lookup()
  nvmet-tcp: enable TLS handshake upcall
  nvmet-tcp: rework sendpage for kTLS
  nvmet-tcp: control messages for recvmsg()
  nvmet-tcp: peek icreq before starting TLS

 drivers/nvme/common/Makefile   |   2 +-
 drivers/nvme/common/keyring.c  | 132 ++++++++++
 drivers/nvme/host/core.c       |  10 +-
 drivers/nvme/host/fabrics.c    |   5 +
 drivers/nvme/host/fabrics.h    |   2 +
 drivers/nvme/host/tcp.c        | 450 +++++++++++++++++++++++++--------
 drivers/nvme/target/configfs.c |  65 +++++
 drivers/nvme/target/tcp.c      | 407 ++++++++++++++++++++++++++---
 include/linux/nvme-keyring.h   |  20 ++
 include/linux/nvme-tcp.h       |   6 +
 include/linux/nvme.h           |  10 +
 security/keys/key.c            |   1 +
 12 files changed, 965 insertions(+), 145 deletions(-)
 create mode 100644 drivers/nvme/common/keyring.c
 create mode 100644 include/linux/nvme-keyring.h

-- 
2.35.3


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 01/18] nvme-keyring: register '.nvme' keyring
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-21 13:50   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 02/18] nvme-keyring: define a 'psk' keytype Hannes Reinecke
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Register a '.nvme' keyring to hold keys for TLS and DH-HMAC-CHAP.
We need a separate keyring as for NVMe the might not be a userspace
process attached (eg during reconnect), and so the use of a session
keyring or any other process-related keyrings might not be possible.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/common/Makefile  |  2 +-
 drivers/nvme/common/keyring.c | 36 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/core.c      | 10 +++++++++-
 include/linux/nvme-keyring.h  | 12 ++++++++++++
 4 files changed, 58 insertions(+), 2 deletions(-)
 create mode 100644 drivers/nvme/common/keyring.c
 create mode 100644 include/linux/nvme-keyring.h

diff --git a/drivers/nvme/common/Makefile b/drivers/nvme/common/Makefile
index 720c625b8a52..c4e3b312d2cc 100644
--- a/drivers/nvme/common/Makefile
+++ b/drivers/nvme/common/Makefile
@@ -4,4 +4,4 @@ ccflags-y			+= -I$(src)
 
 obj-$(CONFIG_NVME_COMMON)	+= nvme-common.o
 
-nvme-common-y			+= auth.o
+nvme-common-y			+= auth.o keyring.o
diff --git a/drivers/nvme/common/keyring.c b/drivers/nvme/common/keyring.c
new file mode 100644
index 000000000000..3a6e8a0b38e2
--- /dev/null
+++ b/drivers/nvme/common/keyring.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020 Hannes Reinecke, SUSE Linux
+ */
+
+#include <linux/module.h>
+#include <linux/nvme.h>
+#include <linux/seq_file.h>
+#include <linux/key-type.h>
+#include <keys/user-type.h>
+
+static struct key *nvme_keyring;
+
+int nvme_keyring_init(void)
+{
+	int err;
+
+	nvme_keyring = keyring_alloc(".nvme",
+				     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
+				     current_cred(),
+				     (KEY_POS_ALL & ~KEY_POS_SETATTR) |
+				     (KEY_USR_ALL & ~KEY_USR_SETATTR),
+				     KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
+	if (IS_ERR(nvme_keyring))
+		return PTR_ERR(nvme_keyring);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_keyring_init);
+
+void nvme_keyring_exit(void)
+{
+	key_revoke(nvme_keyring);
+	key_put(nvme_keyring);
+}
+EXPORT_SYMBOL_GPL(nvme_keyring_exit);
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d4be525f8100..839bc7587f54 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -25,6 +25,7 @@
 #include "nvme.h"
 #include "fabrics.h"
 #include <linux/nvme-auth.h>
+#include <linux/nvme-keyring.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -5415,11 +5416,17 @@ static int __init nvme_core_init(void)
 		goto unregister_generic_ns;
 	}
 
-	result = nvme_init_auth();
+	result = nvme_keyring_init();
 	if (result)
 		goto destroy_ns_chr;
+
+	result = nvme_init_auth();
+	if (result)
+		goto keyring_exit;
 	return 0;
 
+keyring_exit:
+	nvme_keyring_exit();
 destroy_ns_chr:
 	class_destroy(nvme_ns_chr_class);
 unregister_generic_ns:
@@ -5443,6 +5450,7 @@ static int __init nvme_core_init(void)
 static void __exit nvme_core_exit(void)
 {
 	nvme_exit_auth();
+	nvme_keyring_exit();
 	class_destroy(nvme_ns_chr_class);
 	class_destroy(nvme_subsys_class);
 	class_destroy(nvme_class);
diff --git a/include/linux/nvme-keyring.h b/include/linux/nvme-keyring.h
new file mode 100644
index 000000000000..a875c06cc922
--- /dev/null
+++ b/include/linux/nvme-keyring.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2021 Hannes Reinecke, SUSE Software Solutions
+ */
+
+#ifndef _NVME_KEYRING_H
+#define _NVME_KEYRING_H
+
+int nvme_keyring_init(void);
+void nvme_keyring_exit(void);
+
+#endif
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 02/18] nvme-keyring: define a 'psk' keytype
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
  2023-03-21 12:43 ` [PATCH 01/18] nvme-keyring: register '.nvme' keyring Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22  8:29   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 03/18] nvme: add TCP TSAS definitions Hannes Reinecke
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Define a 'psk' keytype to hold the NVMe TLS PSKs.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/common/keyring.c | 96 +++++++++++++++++++++++++++++++++++
 include/linux/nvme-keyring.h  |  8 +++
 2 files changed, 104 insertions(+)

diff --git a/drivers/nvme/common/keyring.c b/drivers/nvme/common/keyring.c
index 3a6e8a0b38e2..6cbb9d66e0f6 100644
--- a/drivers/nvme/common/keyring.c
+++ b/drivers/nvme/common/keyring.c
@@ -11,6 +11,96 @@
 
 static struct key *nvme_keyring;
 
+key_serial_t nvme_keyring_id(void)
+{
+	return nvme_keyring->serial;
+}
+EXPORT_SYMBOL_GPL(nvme_keyring_id);
+
+static void nvme_tls_psk_describe(const struct key *key, struct seq_file *m)
+{
+	seq_puts(m, key->description);
+	seq_printf(m, ": %u", key->datalen);
+}
+
+static bool nvme_tls_psk_match(const struct key *key,
+			       const struct key_match_data *match_data)
+{
+	const char *match_id;
+	size_t match_len;
+
+	if (!key->description) {
+		pr_debug("%s: no key description\n", __func__);
+		return false;
+	}
+	match_len = strlen(key->description);
+	pr_debug("%s: id %s len %zd\n", __func__, key->description, match_len);
+
+	if (!match_data->raw_data) {
+		pr_debug("%s: no match data\n", __func__);
+		return false;
+	}
+	match_id = match_data->raw_data;
+	pr_debug("%s: match '%s' '%s' len %lu\n",
+		 __func__, match_id, key->description, match_len);
+	return !memcmp(key->description, match_id, match_len);
+}
+
+static int nvme_tls_psk_match_preparse(struct key_match_data *match_data)
+{
+	match_data->lookup_type = KEYRING_SEARCH_LOOKUP_ITERATE;
+	match_data->cmp = nvme_tls_psk_match;
+	return 0;
+}
+
+static struct key_type nvme_tls_psk_key_type = {
+	.name           = "psk",
+	.flags          = KEY_TYPE_NET_DOMAIN,
+	.preparse       = user_preparse,
+	.free_preparse  = user_free_preparse,
+	.match_preparse = nvme_tls_psk_match_preparse,
+	.instantiate    = generic_key_instantiate,
+	.revoke         = user_revoke,
+	.destroy        = user_destroy,
+	.describe       = nvme_tls_psk_describe,
+	.read           = user_read,
+};
+
+struct key *nvme_tls_psk_lookup(key_ref_t keyring,
+		const char *hostnqn, const char *subnqn,
+		int hmac, bool generated)
+{
+	char *identity;
+	size_t identity_len = (NVMF_NQN_SIZE) * 2 + 11;
+	key_ref_t keyref;
+	key_serial_t keyring_id;
+
+	identity = kzalloc(identity_len, GFP_KERNEL);
+	if (!identity)
+		return ERR_PTR(-ENOMEM);
+
+	snprintf(identity, identity_len, "NVMe0%c%02d %s %s",
+		 generated ? 'G' : 'R', hmac, hostnqn, subnqn);
+
+	if (!keyring)
+		keyring = make_key_ref(nvme_keyring, true);
+	keyring_id = key_serial(key_ref_to_ptr(keyring));
+	pr_debug("keyring %x lookup tls psk '%s'\n",
+		 keyring_id, identity);
+	keyref = keyring_search(keyring, &nvme_tls_psk_key_type,
+				identity, false);
+	if (IS_ERR(keyref)) {
+		pr_debug("lookup tls psk '%s' failed, error %ld\n",
+			 identity, PTR_ERR(keyref));
+		kfree(identity);
+		return ERR_PTR(-ENOKEY);
+	}
+	kfree(identity);
+
+	return key_ref_to_ptr(keyref);
+}
+EXPORT_SYMBOL_GPL(nvme_tls_psk_lookup);
+
 int nvme_keyring_init(void)
 {
 	int err;
@@ -24,12 +114,18 @@ int nvme_keyring_init(void)
 	if (IS_ERR(nvme_keyring))
 		return PTR_ERR(nvme_keyring);
 
+	err = register_key_type(&nvme_tls_psk_key_type);
+	if (err) {
+		key_put(nvme_keyring);
+		return err;
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(nvme_keyring_init);
 
 void nvme_keyring_exit(void)
 {
+	unregister_key_type(&nvme_tls_psk_key_type);
 	key_revoke(nvme_keyring);
 	key_put(nvme_keyring);
 }
diff --git a/include/linux/nvme-keyring.h b/include/linux/nvme-keyring.h
index a875c06cc922..c0c3d934f474 100644
--- a/include/linux/nvme-keyring.h
+++ b/include/linux/nvme-keyring.h
@@ -6,6 +6,14 @@
 #ifndef _NVME_KEYRING_H
 #define _NVME_KEYRING_H
 
+#include <linux/key.h>
+
+struct key *nvme_tls_psk_lookup(key_ref_t keyring,
+				const char *hostnqn, const char *subnqn,
+				int hmac, bool generated);
+
+key_serial_t nvme_keyring_id(void);
+
 int nvme_keyring_init(void);
 void nvme_keyring_exit(void);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 03/18] nvme: add TCP TSAS definitions
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
  2023-03-21 12:43 ` [PATCH 01/18] nvme-keyring: register '.nvme' keyring Hannes Reinecke
  2023-03-21 12:43 ` [PATCH 02/18] nvme-keyring: define a 'psk' keytype Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-21 13:46   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 04/18] nvme-tcp: add definitions for TLS cipher suites Hannes Reinecke
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/nvme.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 779507ac750b..ea961ca2022d 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -108,6 +108,13 @@ enum {
 	NVMF_RDMA_CMS_RDMA_CM	= 1, /* Sockets based endpoint addressing */
 };
 
+/* TSAS SECTYPE for TCP transport */
+enum {
+	NVMF_TCP_SECTYPE_NONE = 0, /* No Security */
+	NVMF_TCP_SECTYPE_TLS12 = 1, /* TLSv1.2, NVMe-oF 1.1 and NVMe-TCP 3.6.1.1 */
+	NVMF_TCP_SECTYPE_TLS13 = 2, /* TLSv1.3, NVMe-oF 1.1 and NVMe-TCP 3.6.1.1 */
+};
+
 #define NVME_AQ_DEPTH		32
 #define NVME_NR_AEN_COMMANDS	1
 #define NVME_AQ_BLK_MQ_DEPTH	(NVME_AQ_DEPTH - NVME_NR_AEN_COMMANDS)
@@ -1458,6 +1465,9 @@ struct nvmf_disc_rsp_page_entry {
 			__u16	pkey;
 			__u8	resv10[246];
 		} rdma;
+		struct tcp {
+			__u8	sectype;
+		} tcp;
 	} tsas;
 };
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 04/18] nvme-tcp: add definitions for TLS cipher suites
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (2 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 03/18] nvme: add TCP TSAS definitions Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22  8:18   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 05/18] nvme-tcp: implement recvmsg rx flow for TLS Hannes Reinecke
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/nvme-tcp.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
index 75470159a194..19ca863813f1 100644
--- a/include/linux/nvme-tcp.h
+++ b/include/linux/nvme-tcp.h
@@ -18,6 +18,12 @@ enum nvme_tcp_pfv {
 	NVME_TCP_PFV_1_0 = 0x0,
 };
 
+enum nvme_tcp_tls_cipher {
+	NVME_TCP_TLS_CIPHER_INVALID     = 0,
+	NVME_TCP_TLS_CIPHER_SHA256      = 1,
+	NVME_TCP_TLS_CIPHER_SHA384      = 2,
+};
+
 enum nvme_tcp_fatal_error_status {
 	NVME_TCP_FES_INVALID_PDU_HDR		= 0x01,
 	NVME_TCP_FES_PDU_SEQ_ERR		= 0x02,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 05/18] nvme-tcp: implement recvmsg rx flow for TLS
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (3 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 04/18] nvme-tcp: add definitions for TLS cipher suites Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-21 13:39   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready() Hannes Reinecke
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

TLS offload only implements recvmsg(), so implement the receive
side with using recvmsg().

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 156 ++++++++++++++++++++--------------------
 1 file changed, 77 insertions(+), 79 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 42c0598c31f2..0e14b1b90855 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -529,7 +529,7 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
 	queue->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
 				nvme_tcp_hdgst_len(queue);
 	queue->pdu_offset = 0;
-	queue->data_remaining = -1;
+	queue->data_remaining = 0;
 	queue->ddgst_remaining = 0;
 }
 
@@ -707,25 +707,32 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	return 0;
 }
 
-static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
-		unsigned int *offset, size_t *len)
+static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, bool pending)
 {
 	struct nvme_tcp_hdr *hdr;
-	char *pdu = queue->pdu;
-	size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
+	size_t rcv_len = queue->pdu_remaining;
+	struct msghdr msg = {
+		.msg_flags = pending ? 0 : MSG_DONTWAIT,
+	};
+	struct kvec iov = {
+		.iov_base = (u8 *)queue->pdu + queue->pdu_offset,
+		.iov_len = rcv_len,
+	};
 	int ret;
 
-	ret = skb_copy_bits(skb, *offset,
-		&pdu[queue->pdu_offset], rcv_len);
-	if (unlikely(ret))
+	if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_PDU)
+		return 0;
+
+	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+			     iov.iov_len, msg.msg_flags);
+	if (ret <= 0)
 		return ret;
 
+	rcv_len = ret;
 	queue->pdu_remaining -= rcv_len;
 	queue->pdu_offset += rcv_len;
-	*offset += rcv_len;
-	*len -= rcv_len;
 	if (queue->pdu_remaining)
-		return 0;
+		return queue->pdu_remaining;
 
 	hdr = queue->pdu;
 	if (queue->hdr_digest) {
@@ -734,7 +741,6 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 			return ret;
 	}
 
-
 	if (queue->data_digest) {
 		ret = nvme_tcp_check_ddgst(queue, queue->pdu);
 		if (unlikely(ret))
@@ -765,19 +771,21 @@ static inline void nvme_tcp_end_request(struct request *rq, u16 status)
 		nvme_complete_rq(rq);
 }
 
-static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
-			      unsigned int *offset, size_t *len)
+static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
 {
 	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
 	struct request *rq =
 		nvme_cid_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
 	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
 
+	if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DATA)
+		return 0;
+
 	while (true) {
-		int recv_len, ret;
+		struct msghdr msg;
+		int ret;
 
-		recv_len = min_t(size_t, *len, queue->data_remaining);
-		if (!recv_len)
+		if (!queue->data_remaining)
 			break;
 
 		if (!iov_iter_count(&req->iter)) {
@@ -798,25 +806,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 		}
 
 		/* we can read only from what is left in this bio */
-		recv_len = min_t(size_t, recv_len,
-				iov_iter_count(&req->iter));
+		memset(&msg, 0, sizeof(msg));
+		msg.msg_iter = req->iter;
 
-		if (queue->data_digest)
-			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
-				&req->iter, recv_len, queue->rcv_hash);
-		else
-			ret = skb_copy_datagram_iter(skb, *offset,
-					&req->iter, recv_len);
-		if (ret) {
+		ret = sock_recvmsg(queue->sock, &msg, 0);
+		if (ret <= 0) {
 			dev_err(queue->ctrl->ctrl.device,
-				"queue %d failed to copy request %#x data",
+				"queue %d failed to receive request %#x data",
 				nvme_tcp_queue_id(queue), rq->tag);
 			return ret;
 		}
 
-		*len -= recv_len;
-		*offset += recv_len;
-		queue->data_remaining -= recv_len;
+		queue->data_remaining -= ret;
+		if (queue->data_remaining)
+			nvme_tcp_advance_req(req, ret);
 	}
 
 	if (!queue->data_remaining) {
@@ -833,27 +836,36 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 		}
 	}
 
-	return 0;
+	return queue->data_remaining;
 }
 
-static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
-		struct sk_buff *skb, unsigned int *offset, size_t *len)
+static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue)
 {
 	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
 	char *ddgst = (char *)&queue->recv_ddgst;
-	size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
+	size_t recv_len = queue->ddgst_remaining;
 	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
+	struct msghdr msg = {
+		.msg_flags = 0,
+	};
+	struct kvec iov = {
+		.iov_base = (u8 *)ddgst + off,
+		.iov_len = recv_len,
+	};
 	int ret;
 
-	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
-	if (unlikely(ret))
+	if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DDGST)
+		return 0;
+
+	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1, iov.iov_len,
+			     msg.msg_flags);
+	if (ret <= 0)
 		return ret;
 
+	recv_len = ret;
 	queue->ddgst_remaining -= recv_len;
-	*offset += recv_len;
-	*len -= recv_len;
 	if (queue->ddgst_remaining)
-		return 0;
+		return queue->ddgst_remaining;
 
 	if (queue->recv_ddgst != queue->exp_ddgst) {
 		struct request *rq = nvme_cid_to_rq(nvme_tcp_tagset(queue),
@@ -881,37 +893,41 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
 	return 0;
 }
 
-static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
-			     unsigned int offset, size_t len)
+static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, bool pending)
 {
-	struct nvme_tcp_queue *queue = desc->arg.data;
-	size_t consumed = len;
 	int result;
+	int nr_cqe = queue->nr_cqe;
 
-	while (len) {
+	do {
 		switch (nvme_tcp_recv_state(queue)) {
 		case NVME_TCP_RECV_PDU:
-			result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
-			break;
+			result = nvme_tcp_recv_pdu(queue, pending);
+			if (result)
+				break;
+			fallthrough;
 		case NVME_TCP_RECV_DATA:
-			result = nvme_tcp_recv_data(queue, skb, &offset, &len);
-			break;
+			result = nvme_tcp_recv_data(queue);
+			if (result)
+				break;
+			fallthrough;
 		case NVME_TCP_RECV_DDGST:
-			result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
+			result = nvme_tcp_recv_ddgst(queue);
 			break;
 		default:
 			result = -EFAULT;
 		}
-		if (result) {
-			dev_err(queue->ctrl->ctrl.device,
-				"receive failed:  %d\n", result);
-			queue->rd_enabled = false;
-			nvme_tcp_error_recovery(&queue->ctrl->ctrl);
-			return result;
-		}
+		if (nr_cqe != queue->nr_cqe)
+			break;
+	} while (result >= 0);
+	if (result < 0 && result != -EAGAIN) {
+		dev_err(queue->ctrl->ctrl.device,
+			"receive failed: %d state %d %s\n",
+			result, nvme_tcp_recv_state(queue),
+			pending ? "pending" : "");
+		queue->rd_enabled = false;
+		nvme_tcp_error_recovery(&queue->ctrl->ctrl);
 	}
-
-	return consumed;
+	return result < 0 ? result : (queue->nr_cqe - nr_cqe);
 }
 
 static void nvme_tcp_data_ready(struct sock *sk)
@@ -1203,22 +1219,6 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
 	return ret;
 }
 
-static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
-{
-	struct socket *sock = queue->sock;
-	struct sock *sk = sock->sk;
-	read_descriptor_t rd_desc;
-	int consumed;
-
-	rd_desc.arg.data = queue;
-	rd_desc.count = 1;
-	lock_sock(sk);
-	queue->nr_cqe = 0;
-	consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
-	release_sock(sk);
-	return consumed;
-}
-
 static void nvme_tcp_io_work(struct work_struct *w)
 {
 	struct nvme_tcp_queue *queue =
@@ -1232,13 +1232,11 @@ static void nvme_tcp_io_work(struct work_struct *w)
 		if (mutex_trylock(&queue->send_mutex)) {
 			result = nvme_tcp_try_send(queue);
 			mutex_unlock(&queue->send_mutex);
-			if (result > 0)
-				pending = true;
-			else if (unlikely(result < 0))
+			if (unlikely(result < 0))
 				break;
 		}
 
-		result = nvme_tcp_try_recv(queue);
+		result = nvme_tcp_try_recv(queue, pending);
 		if (result > 0)
 			pending = true;
 		else if (unlikely(result < 0))
@@ -2491,7 +2489,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
 	set_bit(NVME_TCP_Q_POLLING, &queue->flags);
 	if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue))
 		sk_busy_loop(sk, true);
-	nvme_tcp_try_recv(queue);
+	nvme_tcp_try_recv(queue, false);
 	clear_bit(NVME_TCP_Q_POLLING, &queue->flags);
 	return queue->nr_cqe;
 }
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (4 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 05/18] nvme-tcp: implement recvmsg rx flow for TLS Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-21 13:44   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 07/18] nvme/tcp: allocate socket file Hannes Reinecke
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Call the original data_ready() callback in nvme_tcp_data_ready()
to avoid a receive stall.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0e14b1b90855..0512eb289dcf 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -936,12 +936,14 @@ static void nvme_tcp_data_ready(struct sock *sk)
 
 	trace_sk_data_ready(sk);
 
-	read_lock_bh(&sk->sk_callback_lock);
-	queue = sk->sk_user_data;
+	rcu_read_lock_bh();
+	queue = rcu_dereference_sk_user_data(sk);
+	if (queue->data_ready)
+		queue->data_ready(sk);
 	if (likely(queue && queue->rd_enabled) &&
 	    !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
 		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
-	read_unlock_bh(&sk->sk_callback_lock);
+	rcu_read_unlock_bh();
 }
 
 static void nvme_tcp_write_space(struct sock *sk)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 07/18] nvme/tcp: allocate socket file
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (5 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready() Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-21 13:52   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 08/18] nvme-tcp: enable TLS handshake upcall Hannes Reinecke
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

When using the TLS upcall we need to allocate a socket file such
that the userspace daemon is able to use the socket.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0512eb289dcf..0438d42f4179 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -115,6 +115,7 @@ enum nvme_tcp_recv_state {
 struct nvme_tcp_ctrl;
 struct nvme_tcp_queue {
 	struct socket		*sock;
+	struct file		*sock_file;
 	struct work_struct	io_work;
 	int			io_cpu;
 
@@ -1330,7 +1331,12 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	}
 
 	noreclaim_flag = memalloc_noreclaim_save();
-	sock_release(queue->sock);
+	if (queue->sock_file) {
+		fput(queue->sock_file);
+		queue->sock_file = NULL;
+		/* ->sock will be released by fput() */
+	} else
+		sock_release(queue->sock);
 	memalloc_noreclaim_restore(noreclaim_flag);
 
 	kfree(queue->pdu);
@@ -1526,6 +1532,12 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
 		goto err_destroy_mutex;
 	}
 
+	queue->sock_file = sock_alloc_file(queue->sock, O_CLOEXEC, NULL);
+	if (IS_ERR(queue->sock_file)) {
+		ret = PTR_ERR(queue->sock_file);
+		queue->sock_file = NULL;
+		goto err_sock;
+	}
 	nvme_tcp_reclassify_socket(queue->sock);
 
 	/* Single syn retry */
@@ -1647,7 +1659,12 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 err_sock:
-	sock_release(queue->sock);
+	if (queue->sock_file) {
+		fput(queue->sock_file);
+		queue->sock_file = NULL;
+		/* ->sock will be released by fput() */
+	} else
+		sock_release(queue->sock);
 	queue->sock = NULL;
 err_destroy_mutex:
 	mutex_destroy(&queue->send_mutex);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 08/18] nvme-tcp: enable TLS handshake upcall
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (6 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 07/18] nvme/tcp: allocate socket file Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22  8:45   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 09/18] nvme-tcp: add connect option 'tls' Hannes Reinecke
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Select possible PSK identities and call the TLS handshake upcall
for each identity.
The TLS 1.3 RFC allows to send multiple identities with each ClientHello
request, but none of the SSL libraries implement it. As the connection
is established when the association is created we send only a single
identity for each upcall, and close the connection to restart with
the next identity if the handshake fails.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 157 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 148 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0438d42f4179..bcf24e9a08e1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -8,9 +8,12 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/key.h>
 #include <linux/nvme-tcp.h>
+#include <linux/nvme-keyring.h>
 #include <net/sock.h>
 #include <net/tcp.h>
+#include <net/handshake.h>
 #include <linux/blk-mq.h>
 #include <crypto/hash.h>
 #include <net/busy_poll.h>
@@ -31,6 +34,14 @@ static int so_priority;
 module_param(so_priority, int, 0644);
 MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
 
+/*
+ * TLS handshake timeout
+ */
+static int tls_handshake_timeout = 10;
+module_param(tls_handshake_timeout, int, 0644);
+MODULE_PARM_DESC(tls_handshake_timeout,
+		 "nvme TLS handshake timeout in seconds (default 10)");
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /* lockdep can detect a circular dependency of the form
  *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
@@ -104,6 +115,7 @@ enum nvme_tcp_queue_flags {
 	NVME_TCP_Q_ALLOCATED	= 0,
 	NVME_TCP_Q_LIVE		= 1,
 	NVME_TCP_Q_POLLING	= 2,
+	NVME_TCP_Q_TLS		= 3,
 };
 
 enum nvme_tcp_recv_state {
@@ -148,6 +160,9 @@ struct nvme_tcp_queue {
 	__le32			exp_ddgst;
 	__le32			recv_ddgst;
 
+	struct completion       *tls_complete;
+	int                     tls_err;
+
 	struct page_frag_cache	pf_cache;
 
 	void (*state_change)(struct sock *);
@@ -1505,7 +1520,102 @@ static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
 	queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
 }
 
-static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
+/*
+ * nvme_tcp_lookup_psk - Look up PSKs to use for TLS
+ *
+ */
+static int nvme_tcp_lookup_psks(struct nvme_ctrl *nctrl,
+			       key_serial_t *keylist, int num_keys)
+{
+	enum nvme_tcp_tls_cipher cipher = NVME_TCP_TLS_CIPHER_SHA384;
+	struct key *tls_key;
+	int num = 0;
+	bool generated = false;
+
+	/* Check for pre-provisioned keys; retained keys first */
+	do {
+		tls_key = nvme_tls_psk_lookup(NULL, nctrl->opts->host->nqn,
+					      nctrl->opts->subsysnqn,
+					      cipher, generated);
+		if (!IS_ERR(tls_key)) {
+			keylist[num] = tls_key->serial;
+			num++;
+			key_put(tls_key);
+		}
+		if (cipher == NVME_TCP_TLS_CIPHER_SHA384)
+			cipher = NVME_TCP_TLS_CIPHER_SHA256;
+		else {
+			if (generated)
+				cipher = NVME_TCP_TLS_CIPHER_INVALID;
+			else {
+				cipher = NVME_TCP_TLS_CIPHER_SHA384;
+				generated = true;
+			}
+		}
+	} while(cipher != NVME_TCP_TLS_CIPHER_INVALID);
+	return num;
+}
+
+static void nvme_tcp_tls_done(void *data, int status, key_serial_t peerid)
+{
+	struct nvme_tcp_queue *queue = data;
+	struct nvme_tcp_ctrl *ctrl = queue->ctrl;
+	int qid = nvme_tcp_queue_id(queue);
+
+	dev_dbg(ctrl->ctrl.device, "queue %d: TLS handshake done, key %x, status %d\n",
+		qid, peerid, status);
+
+	queue->tls_err = -status;
+	if (queue->tls_complete)
+		complete(queue->tls_complete);
+}
+
+static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
+			      struct nvme_tcp_queue *queue,
+			      key_serial_t peerid)
+{
+	int qid = nvme_tcp_queue_id(queue);
+	int ret;
+	struct tls_handshake_args args;
+	unsigned long tmo = tls_handshake_timeout * HZ;
+	DECLARE_COMPLETION_ONSTACK(tls_complete);
+
+	dev_dbg(nctrl->device, "queue %d: start TLS with key %x\n",
+		qid, peerid);
+	args.ta_sock = queue->sock;
+	args.ta_done = nvme_tcp_tls_done;
+	args.ta_data = queue;
+	args.ta_my_peerids[0] = peerid;
+	args.ta_num_peerids = 1;
+	args.ta_keyring = nvme_keyring_id();
+	args.ta_timeout_ms = tls_handshake_timeout * 2 * 1000;
+	queue->tls_err = -EOPNOTSUPP;
+	queue->tls_complete = &tls_complete;
+	ret = tls_client_hello_psk(&args, GFP_KERNEL);
+	if (ret) {
+		dev_dbg(nctrl->device, "queue %d: failed to start TLS: %d\n",
+			qid, ret);
+		return ret;
+	}
+	if (wait_for_completion_timeout(queue->tls_complete, tmo) == 0) {
+		dev_dbg(nctrl->device,
+			"queue %d: TLS handshake timeout\n", qid);
+		queue->tls_complete = NULL;
+		ret = -ETIMEDOUT;
+	} else {
+		dev_dbg(nctrl->device,
+			"queue %d: TLS handshake complete, error %d\n",
+			qid, queue->tls_err);
+		ret = queue->tls_err;
+	}
+	queue->tls_complete = NULL;
+	if (!ret)
+		set_bit(NVME_TCP_Q_TLS, &queue->flags);
+	return ret;
+}
+
+static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
+				key_serial_t peerid)
 {
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
@@ -1628,6 +1738,13 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
 		goto err_rcv_pdu;
 	}
 
+	/* If PSKs are configured try to start TLS */
+	if (peerid) {
+		ret = nvme_tcp_start_tls(nctrl, queue, peerid);
+		if (ret)
+			goto err_init_connect;
+	}
+
 	ret = nvme_tcp_init_connection(queue);
 	if (ret)
 		goto err_init_connect;
@@ -1774,11 +1891,22 @@ static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl,
 
 static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
 {
-	int ret;
+	int ret = -EINVAL, num_keys, k;
+	key_serial_t keylist[4];
 
-	ret = nvme_tcp_alloc_queue(ctrl, 0);
-	if (ret)
-		return ret;
+	memset(keylist, 0, sizeof(key_serial_t));
+	num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
+	for (k = 0; k < num_keys; k++) {
+		ret = nvme_tcp_alloc_queue(ctrl, 0, keylist[k]);
+		if (!ret)
+			break;
+	}
+	if (ret) {
+		/* Try without TLS */
+		ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
+		if (ret)
+			goto out_free_queue;
+	}
 
 	ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
 	if (ret)
@@ -1793,12 +1921,23 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
 
 static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
 {
-	int i, ret;
+	int i, ret, num_keys = 0, k;
+	key_serial_t keylist[4];
 
+	memset(keylist, 0, sizeof(key_serial_t));
+	num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
 	for (i = 1; i < ctrl->queue_count; i++) {
-		ret = nvme_tcp_alloc_queue(ctrl, i);
-		if (ret)
-			goto out_free_queues;
+		ret = -EINVAL;
+		for (k = 0; k < num_keys; k++) {
+			ret = nvme_tcp_alloc_queue(ctrl, i, keylist[k]);
+			if (!ret)
+				break;
+		}
+		if (ret) {
+			ret = nvme_tcp_alloc_queue(ctrl, i, 0);
+			if (ret)
+				goto out_free_queues;
+		}
 	}
 
 	return 0;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 09/18] nvme-tcp: add connect option 'tls'
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (7 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 08/18] nvme-tcp: enable TLS handshake upcall Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22  9:24   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 10/18] nvme-tcp: fixup send workflow for kTLS Hannes Reinecke
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Add a connect option 'tls' to request TLS1.3 in-band encryption, and
abort the connection attempt if TLS could not be established.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/fabrics.c | 5 +++++
 drivers/nvme/host/fabrics.h | 2 ++
 drivers/nvme/host/tcp.c     | 7 ++++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index bbaa04a0c502..fdff7cdff029 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -609,6 +609,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_DISCOVERY,		"discovery"		},
 	{ NVMF_OPT_DHCHAP_SECRET,	"dhchap_secret=%s"	},
 	{ NVMF_OPT_DHCHAP_CTRL_SECRET,	"dhchap_ctrl_secret=%s"	},
+	{ NVMF_OPT_TLS,			"tls"			},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -632,6 +633,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->hdr_digest = false;
 	opts->data_digest = false;
 	opts->tos = -1; /* < 0 == use transport default */
+	opts->tls = false;
 
 	options = o = kstrdup(buf, GFP_KERNEL);
 	if (!options)
@@ -918,6 +920,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 			kfree(opts->dhchap_ctrl_secret);
 			opts->dhchap_ctrl_secret = p;
 			break;
+		case NVMF_OPT_TLS:
+			opts->tls = true;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index dcac3df8a5f7..c4538a9d437c 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -70,6 +70,7 @@ enum {
 	NVMF_OPT_DISCOVERY	= 1 << 22,
 	NVMF_OPT_DHCHAP_SECRET	= 1 << 23,
 	NVMF_OPT_DHCHAP_CTRL_SECRET = 1 << 24,
+	NVMF_OPT_TLS		= 1 << 25,
 };
 
 /**
@@ -128,6 +129,7 @@ struct nvmf_ctrl_options {
 	int			max_reconnects;
 	char			*dhchap_secret;
 	char			*dhchap_ctrl_secret;
+	bool			tls;
 	bool			disable_sqflow;
 	bool			hdr_digest;
 	bool			data_digest;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bcf24e9a08e1..bbff1f52a167 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1902,6 +1902,9 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
 			break;
 	}
 	if (ret) {
+		/* Abort if TLS is requested */
+		if (num_keys && ctrl->opts->tls)
+			goto out_free_queue;
 		/* Try without TLS */
 		ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
 		if (ret)
@@ -1934,6 +1937,8 @@ static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
 				break;
 		}
 		if (ret) {
+			if (num_keys && ctrl->opts->tls)
+				goto out_free_queues;
 			ret = nvme_tcp_alloc_queue(ctrl, i, 0);
 			if (ret)
 				goto out_free_queues;
@@ -2844,7 +2849,7 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
 			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST |
 			  NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
-			  NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE,
+			  NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE | NVMF_OPT_TLS,
 	.create_ctrl	= nvme_tcp_create_ctrl,
 };
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 10/18] nvme-tcp: fixup send workflow for kTLS
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (8 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 09/18] nvme-tcp: add connect option 'tls' Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22  9:31   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 11/18] nvme-tcp: control message handling for recvmsg() Hannes Reinecke
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

kTLS does not support MSG_EOR flag for sendmsg(), and the ->sendpage()
call really doesn't bring any benefit as data has to be copied
anyway.
So use sock_no_sendpage() or sendmsg() instead, and ensure that the
MSG_EOR flag is blanked out for kTLS.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bbff1f52a167..007d457cacf9 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1034,13 +1034,19 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 		bool last = nvme_tcp_pdu_last_send(req, len);
 		int req_data_sent = req->data_sent;
 		int ret, flags = MSG_DONTWAIT;
+		bool do_sendpage = sendpage_ok(page);
 
-		if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
+		if (!last || queue->data_digest || nvme_tcp_queue_more(queue))
+			flags |= MSG_MORE;
+		else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
 			flags |= MSG_EOR;
-		else
-			flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
 
-		if (sendpage_ok(page)) {
+		if (test_bit(NVME_TCP_Q_TLS, &queue->flags))
+			do_sendpage = false;
+
+		if (do_sendpage) {
+			if (flags & MSG_MORE)
+				flags |= MSG_SENDPAGE_NOTLAST;
 			ret = kernel_sendpage(queue->sock, page, offset, len,
 					flags);
 		} else {
@@ -1088,19 +1094,22 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	bool inline_data = nvme_tcp_has_inline_data(req);
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 	int len = sizeof(*pdu) + hdgst - req->offset;
-	int flags = MSG_DONTWAIT;
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+	struct kvec iov = {
+		.iov_base = (u8 *)req->pdu + req->offset,
+		.iov_len = len,
+	};
 	int ret;
 
 	if (inline_data || nvme_tcp_queue_more(queue))
-		flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
-	else
-		flags |= MSG_EOR;
+		msg.msg_flags |= MSG_MORE;
+	else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
+		msg.msg_flags |= MSG_EOR;
 
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
-	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
-			offset_in_page(pdu) + req->offset, len,  flags);
+	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
 	if (unlikely(ret <= 0))
 		return ret;
 
@@ -1131,7 +1140,7 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
-	if (!req->h2cdata_left)
+	if (!test_bit(NVME_TCP_Q_TLS, &queue->flags) && !req->h2cdata_left)
 		ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
 				offset_in_page(pdu) + req->offset, len,
 				MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
@@ -1168,7 +1177,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
 
 	if (nvme_tcp_queue_more(queue))
 		msg.msg_flags |= MSG_MORE;
-	else
+	else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
 		msg.msg_flags |= MSG_EOR;
 
 	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 11/18] nvme-tcp: control message handling for recvmsg()
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (9 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 10/18] nvme-tcp: fixup send workflow for kTLS Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22 11:33   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 12/18] nvmet: make TCP sectype settable via configfs Hannes Reinecke
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

kTLS is sending TLS ALERT messages as control messages for recvmsg().
As we can't do anything sensible with it just abort the connection
and let the userspace agent to a re-negotiation.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 68 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 007d457cacf9..e0fc98ac9e05 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -13,6 +13,7 @@
 #include <linux/nvme-keyring.h>
 #include <net/sock.h>
 #include <net/tcp.h>
+#include <net/tls.h>
 #include <net/handshake.h>
 #include <linux/blk-mq.h>
 #include <crypto/hash.h>
@@ -727,7 +728,12 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, bool pending)
 {
 	struct nvme_tcp_hdr *hdr;
 	size_t rcv_len = queue->pdu_remaining;
+	char cbuf[CMSG_LEN(sizeof(char))] = {};
+	struct cmsghdr *cmsg;
+	unsigned char ctype;
 	struct msghdr msg = {
+		.msg_control = cbuf,
+		.msg_controllen = sizeof(cbuf),
 		.msg_flags = pending ? 0 : MSG_DONTWAIT,
 	};
 	struct kvec iov = {
@@ -743,6 +749,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, bool pending)
 			     iov.iov_len, msg.msg_flags);
 	if (ret <= 0)
 		return ret;
+	cmsg = (struct cmsghdr *)cbuf;
+	if (CMSG_OK(&msg, cmsg) &&
+	    cmsg->cmsg_level == SOL_TLS &&
+	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+		ctype = *((unsigned char *)CMSG_DATA(cmsg));
+		if (ctype != TLS_RECORD_TYPE_DATA) {
+			dev_err(queue->ctrl->ctrl.device,
+				"queue %d unhandled TLS record %d\n",
+				nvme_tcp_queue_id(queue), ctype);
+			return -ENOTCONN;
+		}
+	}
 
 	rcv_len = ret;
 	queue->pdu_remaining -= rcv_len;
@@ -793,6 +811,9 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
 	struct request *rq =
 		nvme_cid_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
 	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	char cbuf[CMSG_LEN(sizeof(char))];
+	struct cmsghdr *cmsg;
+	unsigned char ctype;
 
 	if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DATA)
 		return 0;
@@ -824,6 +845,8 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
 		/* we can read only from what is left in this bio */
 		memset(&msg, 0, sizeof(msg));
 		msg.msg_iter = req->iter;
+		msg.msg_control = cbuf;
+		msg.msg_controllen = sizeof(cbuf);
 
 		ret = sock_recvmsg(queue->sock, &msg, 0);
 		if (ret <= 0) {
@@ -832,6 +855,18 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
 				nvme_tcp_queue_id(queue), rq->tag);
 			return ret;
 		}
+		cmsg = (struct cmsghdr *)cbuf;
+		if (CMSG_OK(&msg, cmsg) &&
+		    cmsg->cmsg_level == SOL_TLS &&
+		    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+			ctype = *((unsigned char *)CMSG_DATA(cmsg));
+			if (ctype != TLS_RECORD_TYPE_DATA) {
+				dev_err(queue->ctrl->ctrl.device,
+					"queue %d unhandled TLS record %d\n",
+					nvme_tcp_queue_id(queue), ctype);
+				return -ENOTCONN;
+			}
+		}
 
 		queue->data_remaining -= ret;
 		if (queue->data_remaining)
@@ -861,7 +896,12 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue)
 	char *ddgst = (char *)&queue->recv_ddgst;
 	size_t recv_len = queue->ddgst_remaining;
 	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
+	char cbuf[CMSG_LEN(sizeof(char))] = {};
+	struct cmsghdr *cmsg;
+	unsigned char ctype;
 	struct msghdr msg = {
+		.msg_control = cbuf,
+		.msg_controllen = sizeof(cbuf),
 		.msg_flags = 0,
 	};
 	struct kvec iov = {
@@ -877,6 +917,18 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue)
 			     msg.msg_flags);
 	if (ret <= 0)
 		return ret;
+	cmsg = (struct cmsghdr *)cbuf;
+	if (CMSG_OK(&msg, cmsg) &&
+	    cmsg->cmsg_level == SOL_TLS &&
+	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+		ctype = *((unsigned char *)CMSG_DATA(cmsg));
+		if (ctype != TLS_RECORD_TYPE_DATA) {
+			dev_err(queue->ctrl->ctrl.device,
+				"queue %d unhandled TLS record %d\n",
+				nvme_tcp_queue_id(queue), ctype);
+			return -ENOTCONN;
+		}
+	}
 
 	recv_len = ret;
 	queue->ddgst_remaining -= recv_len;
@@ -1372,6 +1424,9 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
 {
 	struct nvme_tcp_icreq_pdu *icreq;
 	struct nvme_tcp_icresp_pdu *icresp;
+	char cbuf[CMSG_LEN(sizeof(char))] = {};
+	struct cmsghdr *cmsg;
+	unsigned char ctype;
 	struct msghdr msg = {};
 	struct kvec iov;
 	bool ctrl_hdgst, ctrl_ddgst;
@@ -1409,10 +1464,23 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
 	memset(&msg, 0, sizeof(msg));
 	iov.iov_base = icresp;
 	iov.iov_len = sizeof(*icresp);
+	msg.msg_control = cbuf;
+	msg.msg_controllen = sizeof(cbuf);
 	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
 			iov.iov_len, msg.msg_flags);
 	if (ret < 0)
 		goto free_icresp;
+	cmsg = (struct cmsghdr *)cbuf;
+	if (CMSG_OK(&msg, cmsg) &&
+	    cmsg->cmsg_level == SOL_TLS &&
+	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+		ctype = *((unsigned char *)CMSG_DATA(cmsg));
+		if (ctype != TLS_RECORD_TYPE_DATA) {
+			pr_err("queue %d: unhandled TLS record %d\n",
+			       nvme_tcp_queue_id(queue), ctype);
+			return -ENOTCONN;
+		}
+	}
 
 	ret = -EINVAL;
 	if (icresp->hdr.type != nvme_tcp_icresp) {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 12/18] nvmet: make TCP sectype settable via configfs
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (10 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 11/18] nvme-tcp: control message handling for recvmsg() Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22 11:38   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 13/18] nvmet-tcp: allocate socket file Hannes Reinecke
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Add a new configfs attribute 'addr_tsas' to make the TCP sectype
settable via configfs.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/target/configfs.c | 65 ++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 907143870da5..d3d105a1665c 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -303,6 +303,11 @@ static void nvmet_port_init_tsas_rdma(struct nvmet_port *port)
 	port->disc_addr.tsas.rdma.cms = NVMF_RDMA_CMS_RDMA_CM;
 }
 
+static void nvmet_port_init_tsas_tcp(struct nvmet_port *port, int tsas)
+{
+	port->disc_addr.tsas.tcp.sectype = tsas;
+}
+
 static ssize_t nvmet_addr_trtype_store(struct config_item *item,
 		const char *page, size_t count)
 {
@@ -325,11 +330,70 @@ static ssize_t nvmet_addr_trtype_store(struct config_item *item,
 	port->disc_addr.trtype = nvmet_transport[i].type;
 	if (port->disc_addr.trtype == NVMF_TRTYPE_RDMA)
 		nvmet_port_init_tsas_rdma(port);
+	else if (port->disc_addr.trtype == NVMF_TRTYPE_TCP)
+		nvmet_port_init_tsas_tcp(port, NVMF_TCP_SECTYPE_NONE);
 	return count;
 }
 
 CONFIGFS_ATTR(nvmet_, addr_trtype);
 
+static const struct nvmet_type_name_map nvmet_addr_tsas_tcp[] = {
+	{ NVMF_TCP_SECTYPE_NONE,	"none" },
+	{ NVMF_TCP_SECTYPE_TLS12,	"tls1.2" },
+	{ NVMF_TCP_SECTYPE_TLS13,	"tls1.3" },
+};
+
+static ssize_t nvmet_addr_tsas_show(struct config_item *item,
+		char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	int i;
+
+	if (port->disc_addr.trtype == NVMF_TRTYPE_TCP) {
+		for (i = 0; i < ARRAY_SIZE(nvmet_addr_tsas_tcp); i++) {
+			if (port->disc_addr.tsas.tcp.sectype == nvmet_addr_tsas_tcp[i].type)
+				return sprintf(page, "%s\n", nvmet_addr_tsas_tcp[i].name);
+		}
+	} else if (port->disc_addr.trtype == NVMF_TRTYPE_RDMA) {
+		switch (port->disc_addr.tsas.rdma.qptype) {
+		case NVMF_RDMA_QPTYPE_CONNECTED:
+			return sprintf(page, "connected\n");
+		case NVMF_RDMA_QPTYPE_DATAGRAM:
+			return sprintf(page, "datagram\n");
+		default:
+			return sprintf(page, "reserved\n");
+		}
+	}
+	return sprintf(page, "not required\n");
+}
+
+static ssize_t nvmet_addr_tsas_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	int i;
+
+	if (nvmet_is_port_enabled(port, __func__))
+		return -EACCES;
+
+	if (port->disc_addr.trtype != NVMF_TRTYPE_TCP)
+		return -EINVAL;
+
+	for (i = 0; i < ARRAY_SIZE(nvmet_addr_tsas_tcp); i++) {
+		if (sysfs_streq(page, nvmet_addr_tsas_tcp[i].name))
+			goto found;
+	}
+
+	pr_err("Invalid value '%s' for tsas\n", page);
+	return -EINVAL;
+
+found:
+	nvmet_port_init_tsas_tcp(port, nvmet_addr_tsas_tcp[i].type);
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_, addr_tsas);
+
 /*
  * Namespace structures & file operation functions below
  */
@@ -1741,6 +1805,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+	&nvmet_attr_addr_tsas,
 	&nvmet_attr_param_inline_data_size,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 	&nvmet_attr_param_pi_enable,
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 13/18] nvmet-tcp: allocate socket file
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (11 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 12/18] nvmet: make TCP sectype settable via configfs Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22 11:46   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 14/18] security/keys: export key_lookup() Hannes Reinecke
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

When using the TLS upcall we need to allocate a socket file such
that the userspace daemon is able to use the socket.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/target/tcp.c | 49 ++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 66e8f9fd0ca7..5c43767c5ecd 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -96,12 +96,14 @@ struct nvmet_tcp_cmd {
 
 enum nvmet_tcp_queue_state {
 	NVMET_TCP_Q_CONNECTING,
+	NVMET_TCP_Q_TLS_HANDSHAKE,
 	NVMET_TCP_Q_LIVE,
 	NVMET_TCP_Q_DISCONNECTING,
 };
 
 struct nvmet_tcp_queue {
 	struct socket		*sock;
+	struct file		*sock_file;
 	struct nvmet_tcp_port	*port;
 	struct work_struct	io_work;
 	struct nvmet_cq		nvme_cq;
@@ -1455,12 +1457,19 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	nvmet_sq_destroy(&queue->nvme_sq);
 	cancel_work_sync(&queue->io_work);
 	nvmet_tcp_free_cmd_data_in_buffers(queue);
-	sock_release(queue->sock);
+	if (queue->sock_file) {
+		fput(queue->sock_file);
+		queue->sock_file = NULL;
+		queue->sock = NULL;
+	} else {
+		WARN_ON(!queue->sock->ops);
+		sock_release(queue->sock);
+		queue->sock = NULL;
+	}
 	nvmet_tcp_free_cmds(queue);
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-
 	page = virt_to_head_page(queue->pf_cache.va);
 	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
 	kfree(queue);
@@ -1583,7 +1592,7 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
 	return ret;
 }
 
-static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
+static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 		struct socket *newsock)
 {
 	struct nvmet_tcp_queue *queue;
@@ -1591,7 +1600,7 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 
 	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
 	if (!queue)
-		return -ENOMEM;
+		return;
 
 	INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
 	INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
@@ -1599,15 +1608,28 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 	queue->port = port;
 	queue->nr_cmds = 0;
 	spin_lock_init(&queue->state_lock);
-	queue->state = NVMET_TCP_Q_CONNECTING;
+	if (queue->port->nport->disc_addr.tsas.tcp.sectype ==
+	    NVMF_TCP_SECTYPE_TLS13)
+		queue->state = NVMET_TCP_Q_TLS_HANDSHAKE;
+	else
+		queue->state = NVMET_TCP_Q_CONNECTING;
 	INIT_LIST_HEAD(&queue->free_list);
 	init_llist_head(&queue->resp_list);
 	INIT_LIST_HEAD(&queue->resp_send_list);
 
+	if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
+		queue->sock_file = sock_alloc_file(queue->sock, O_CLOEXEC, NULL);
+		if (IS_ERR(queue->sock_file)) {
+			ret = PTR_ERR(queue->sock_file);
+			queue->sock_file = NULL;
+			goto out_free_queue;
+		}
+	}
+
 	queue->idx = ida_alloc(&nvmet_tcp_queue_ida, GFP_KERNEL);
 	if (queue->idx < 0) {
 		ret = queue->idx;
-		goto out_free_queue;
+		goto out_sock;
 	}
 
 	ret = nvmet_tcp_alloc_cmd(queue, &queue->connect);
@@ -1628,7 +1650,7 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 	if (ret)
 		goto out_destroy_sq;
 
-	return 0;
+	return;
 out_destroy_sq:
 	mutex_lock(&nvmet_tcp_queue_mutex);
 	list_del_init(&queue->queue_list);
@@ -1638,9 +1660,14 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 	nvmet_tcp_free_cmd(&queue->connect);
 out_ida_remove:
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
+out_sock:
+	if (queue->sock_file)
+		fput(queue->sock_file);
+	else
+		sock_release(queue->sock);
 out_free_queue:
 	kfree(queue);
-	return ret;
+	pr_err("failed to allocate queue");
 }
 
 static void nvmet_tcp_accept_work(struct work_struct *w)
@@ -1657,11 +1684,7 @@ static void nvmet_tcp_accept_work(struct work_struct *w)
 				pr_warn("failed to accept err=%d\n", ret);
 			return;
 		}
-		ret = nvmet_tcp_alloc_queue(port, newsock);
-		if (ret) {
-			pr_err("failed to allocate queue\n");
-			sock_release(newsock);
-		}
+		nvmet_tcp_alloc_queue(port, newsock);
 	}
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 14/18] security/keys: export key_lookup()
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (12 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 13/18] nvmet-tcp: allocate socket file Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-21 12:43 ` [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall Hannes Reinecke
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

For in-kernel consumers one cannot readily assign a user (eg when
running from a workqueue), so the normal key search permissions
cannot be applied.
This patch exports the 'key_lookup()' function for a simple lookup
of keys without checking for permissions.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 security/keys/key.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/security/keys/key.c b/security/keys/key.c
index 5c0c7df833f8..bd1b7d45df90 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -693,6 +693,7 @@ struct key *key_lookup(key_serial_t id)
 	spin_unlock(&key_serial_lock);
 	return key;
 }
+EXPORT_SYMBOL_GPL(key_lookup);
 
 /*
  * Find and lock the specified key type against removal.
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (13 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 14/18] security/keys: export key_lookup() Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22 12:13   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 16/18] nvmet-tcp: rework sendpage for kTLS Hannes Reinecke
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Add functions to start the TLS handshake upcall.

Signed-off-by: Hannes Reincke <hare@suse.de>
---
 drivers/nvme/target/tcp.c | 188 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 181 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 5c43767c5ecd..6e88e98a2c59 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -9,8 +9,10 @@
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/nvme-tcp.h>
+#include <linux/nvme-keyring.h>
 #include <net/sock.h>
 #include <net/tcp.h>
+#include <net/handshake.h>
 #include <linux/inet.h>
 #include <linux/llist.h>
 #include <crypto/hash.h>
@@ -40,6 +42,14 @@ module_param(idle_poll_period_usecs, int, 0644);
 MODULE_PARM_DESC(idle_poll_period_usecs,
 		"nvmet tcp io_work poll till idle time period in usecs");
 
+/*
+ * TLS handshake timeout
+ */
+static int tls_handshake_timeout = 30;
+module_param(tls_handshake_timeout, int, 0644);
+MODULE_PARM_DESC(tls_handshake_timeout,
+		 "nvme TLS handshake timeout in seconds (default 30)");
+
 #define NVMET_TCP_RECV_BUDGET		8
 #define NVMET_TCP_SEND_BUDGET		8
 #define NVMET_TCP_IO_WORK_BUDGET	64
@@ -131,6 +141,9 @@ struct nvmet_tcp_queue {
 	struct ahash_request	*snd_hash;
 	struct ahash_request	*rcv_hash;
 
+	struct key		*tls_psk;
+	struct delayed_work	tls_handshake_work;
+
 	unsigned long           poll_end;
 
 	spinlock_t		state_lock;
@@ -168,6 +181,7 @@ static struct workqueue_struct *nvmet_tcp_wq;
 static const struct nvmet_fabrics_ops nvmet_tcp_ops;
 static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
 static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd);
+static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct *work);
 
 static inline u16 nvmet_tcp_cmd_tag(struct nvmet_tcp_queue *queue,
 		struct nvmet_tcp_cmd *cmd)
@@ -1400,6 +1414,8 @@ static void nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
 {
 	struct socket *sock = queue->sock;
 
+	if (!sock->sk)
+		return;
 	write_lock_bh(&sock->sk->sk_callback_lock);
 	sock->sk->sk_data_ready =  queue->data_ready;
 	sock->sk->sk_state_change = queue->state_change;
@@ -1448,7 +1464,8 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	list_del_init(&queue->queue_list);
 	mutex_unlock(&nvmet_tcp_queue_mutex);
 
-	nvmet_tcp_restore_socket_callbacks(queue);
+	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
+		nvmet_tcp_restore_socket_callbacks(queue);
 	cancel_work_sync(&queue->io_work);
 	/* stop accepting incoming data */
 	queue->rcv_state = NVMET_TCP_RECV_ERR;
@@ -1469,6 +1486,8 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	nvmet_tcp_free_cmds(queue);
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
+	if (queue->tls_psk)
+		key_put(queue->tls_psk);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
 	page = virt_to_head_page(queue->pf_cache.va);
 	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
@@ -1481,11 +1500,15 @@ static void nvmet_tcp_data_ready(struct sock *sk)
 
 	trace_sk_data_ready(sk);
 
-	read_lock_bh(&sk->sk_callback_lock);
-	queue = sk->sk_user_data;
-	if (likely(queue))
-		queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
-	read_unlock_bh(&sk->sk_callback_lock);
+	rcu_read_lock_bh();
+	queue = rcu_dereference_sk_user_data(sk);
+	if (queue->data_ready)
+		queue->data_ready(sk);
+	if (likely(queue) &&
+	    queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
+		queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
+			      &queue->io_work);
+	rcu_read_unlock_bh();
 }
 
 static void nvmet_tcp_write_space(struct sock *sk)
@@ -1585,13 +1608,139 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
 		sock->sk->sk_write_space = nvmet_tcp_write_space;
 		if (idle_poll_period_usecs)
 			nvmet_tcp_arm_queue_deadline(queue);
-		queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
+		queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
+			      &queue->io_work);
 	}
 	write_unlock_bh(&sock->sk->sk_callback_lock);
 
 	return ret;
 }
 
+static void nvmet_tcp_tls_data_ready(struct sock *sk)
+{
+	struct socket_wq *wq;
+
+	rcu_read_lock();
+	/* kTLS will change the callback */
+	if (sk->sk_data_ready == nvmet_tcp_tls_data_ready) {
+		wq = rcu_dereference(sk->sk_wq);
+		if (skwq_has_sleeper(wq))
+			wake_up_interruptible_all(&wq->wait);
+	}
+	rcu_read_unlock();
+}
+
+static void nvmet_tcp_tls_handshake_restart(struct nvmet_tcp_queue *queue)
+{
+	spin_lock(&queue->state_lock);
+	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
+		pr_warn("queue %d: TLS handshake already completed\n",
+			queue->idx);
+		spin_unlock(&queue->state_lock);
+		return;
+	}
+	queue->state = NVMET_TCP_Q_CONNECTING;
+	spin_unlock(&queue->state_lock);
+
+	pr_debug("queue %d: restarting queue after TLS handshake\n",
+		 queue->idx);
+	/*
+	 * Set callbacks after handshake; TLS implementation
+	 * might have changed the socket callbacks.
+	 */
+	nvmet_tcp_set_queue_sock(queue);
+}
+
+static void nvmet_tcp_save_tls_callbacks(struct nvmet_tcp_queue *queue)
+{
+	struct sock *sk = queue->sock->sk;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	rcu_assign_sk_user_data(sk, queue);
+	queue->data_ready = sk->sk_data_ready;
+	sk->sk_data_ready = nvmet_tcp_tls_data_ready;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvmet_tcp_restore_tls_callbacks(struct nvmet_tcp_queue *queue)
+{
+	struct sock *sk = queue->sock->sk;
+
+	if (WARN_ON(!sk))
+		return;
+	write_lock_bh(&sk->sk_callback_lock);
+	/* Only reset the callback if it really is ours */
+	if (sk->sk_data_ready == nvmet_tcp_tls_data_ready)
+		sk->sk_data_ready = queue->data_ready;
+	rcu_assign_sk_user_data(sk, NULL);
+	queue->data_ready = NULL;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void nvmet_tcp_tls_handshake_done(void *data, int status,
+					 key_serial_t peerid)
+{
+	struct nvmet_tcp_queue *queue = data;
+
+	pr_debug("queue %d: TLS handshake done, key %x, status %d\n",
+		 queue->idx, peerid, status);
+	if (!status) {
+		spin_lock(&queue->state_lock);
+		queue->tls_psk = key_lookup(peerid);
+		if (IS_ERR(queue->tls_psk)) {
+			pr_warn("queue %d: TLS key %x not found\n",
+				queue->idx, peerid);
+			queue->tls_psk = NULL;
+		}
+		spin_unlock(&queue->state_lock);
+	}
+	cancel_delayed_work_sync(&queue->tls_handshake_work);
+	nvmet_tcp_restore_tls_callbacks(queue);
+	if (status)
+		nvmet_tcp_schedule_release_queue(queue);
+	else
+		nvmet_tcp_tls_handshake_restart(queue);
+}
+
+static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct *w)
+{
+	struct nvmet_tcp_queue *queue = container_of(to_delayed_work(w),
+			struct nvmet_tcp_queue, tls_handshake_work);
+
+	pr_debug("queue %d: TLS handshake timeout\n", queue->idx);
+	nvmet_tcp_restore_tls_callbacks(queue);
+	nvmet_tcp_schedule_release_queue(queue);
+}
+
+static int nvmet_tcp_tls_handshake(struct nvmet_tcp_queue *queue)
+{
+	int ret = -EOPNOTSUPP;
+	struct tls_handshake_args args;
+
+	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
+		pr_warn("cannot start TLS in state %d\n", queue->state);
+		return -EINVAL;
+	}
+
+	pr_debug("queue %d: TLS ServerHello\n", queue->idx);
+	args.ta_sock = queue->sock;
+	args.ta_done = nvmet_tcp_tls_handshake_done;
+	args.ta_data = queue;
+	args.ta_keyring = nvme_keyring_id();
+	args.ta_timeout_ms = tls_handshake_timeout * 2 * 1024;
+
+	ret = tls_server_hello_psk(&args, GFP_KERNEL);
+	if (ret) {
+		pr_err("failed to start TLS, err=%d\n", ret);
+	} else {
+		pr_debug("queue %d wakeup userspace\n", queue->idx);
+		nvmet_tcp_tls_data_ready(queue->sock->sk);
+		queue_delayed_work(nvmet_wq, &queue->tls_handshake_work,
+				   tls_handshake_timeout * HZ);
+	}
+	return ret;
+}
+
 static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 		struct socket *newsock)
 {
@@ -1604,6 +1753,8 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 
 	INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
 	INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
+	INIT_DELAYED_WORK(&queue->tls_handshake_work,
+			  nvmet_tcp_tls_handshake_timeout_work);
 	queue->sock = newsock;
 	queue->port = port;
 	queue->nr_cmds = 0;
@@ -1646,6 +1797,29 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 	list_add_tail(&queue->queue_list, &nvmet_tcp_queue_list);
 	mutex_unlock(&nvmet_tcp_queue_mutex);
 
+	if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
+		nvmet_tcp_save_tls_callbacks(queue);
+		if (!nvmet_tcp_tls_handshake(queue))
+			return;
+		nvmet_tcp_restore_tls_callbacks(queue);
+
+		/*
+		 * If sectype is set to 'tls1.3' TLS is required
+		 * so terminate the connection if the TLS handshake
+		 * failed.
+		 */
+		if (queue->port->nport->disc_addr.tsas.tcp.sectype ==
+		    NVMF_TCP_SECTYPE_TLS13) {
+			pr_debug("queue %d sectype tls1.3, terminate connection\n",
+				 queue->idx);
+			goto out_destroy_sq;
+		}
+		pr_debug("queue %d fallback to icreq\n", queue->idx);
+		spin_lock(&queue->state_lock);
+		queue->state = NVMET_TCP_Q_CONNECTING;
+		spin_unlock(&queue->state_lock);
+	}
+
 	ret = nvmet_tcp_set_queue_sock(queue);
 	if (ret)
 		goto out_destroy_sq;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 16/18] nvmet-tcp: rework sendpage for kTLS
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (14 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22 12:16   ` Sagi Grimberg
  2023-03-21 12:43 ` [PATCH 17/18] nvmet-tcp: control messages for recvmsg() Hannes Reinecke
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

kTLS ->sendpage() doesn't support the MSG_EOR flag, and it's
questionable whether it makes sense for kTLS as one has to copy
data anyway.
So use sock_no_sendpage() for kTLS.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/target/tcp.c | 56 ++++++++++++++++++++++++++++-----------
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 6e88e98a2c59..9b69cac84508 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -570,9 +570,14 @@ static int nvmet_try_send_data_pdu(struct nvmet_tcp_cmd *cmd)
 	int left = sizeof(*cmd->data_pdu) - cmd->offset + hdgst;
 	int ret;
 
-	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
-			offset_in_page(cmd->data_pdu) + cmd->offset,
-			left, MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
+	if (cmd->queue->tls_psk)
+		ret = sock_no_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
+				      offset_in_page(cmd->data_pdu) + cmd->offset,
+				      left, MSG_DONTWAIT | MSG_MORE);
+	else
+		ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
+				      offset_in_page(cmd->data_pdu) + cmd->offset,
+				      left, MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
 	if (ret <= 0)
 		return ret;
 
@@ -600,10 +605,17 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
 		if ((!last_in_batch && cmd->queue->send_list_len) ||
 		    cmd->wbytes_done + left < cmd->req.transfer_len ||
 		    queue->data_digest || !queue->nvme_sq.sqhd_disabled)
-			flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
-
-		ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
-					left, flags);
+			flags |= MSG_MORE;
+
+		if (queue->tls_psk)
+			ret = sock_no_sendpage(cmd->queue->sock, page, cmd->offset,
+					       left, flags);
+		else {
+			if (flags & MSG_MORE)
+				flags |= MSG_SENDPAGE_NOTLAST;
+			ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
+					      left, flags);
+		}
 		if (ret <= 0)
 			return ret;
 
@@ -645,12 +657,19 @@ static int nvmet_try_send_response(struct nvmet_tcp_cmd *cmd,
 	int ret;
 
 	if (!last_in_batch && cmd->queue->send_list_len)
-		flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
-	else
+		flags |= MSG_MORE;
+	else if (!cmd->queue->tls_psk)
 		flags |= MSG_EOR;
 
-	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
-		offset_in_page(cmd->rsp_pdu) + cmd->offset, left, flags);
+	if (cmd->queue->tls_psk)
+		ret = sock_no_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
+			offset_in_page(cmd->rsp_pdu) + cmd->offset, left, flags);
+	else {
+		if (flags & MSG_MORE)
+			flags |= MSG_SENDPAGE_NOTLAST;
+		ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
+			offset_in_page(cmd->rsp_pdu) + cmd->offset, left, flags);
+	}
 	if (ret <= 0)
 		return ret;
 	cmd->offset += ret;
@@ -673,12 +692,19 @@ static int nvmet_try_send_r2t(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
 	int ret;
 
 	if (!last_in_batch && cmd->queue->send_list_len)
-		flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
-	else
+		flags |= MSG_MORE;
+	else if (!cmd->queue->tls_psk)
 		flags |= MSG_EOR;
 
-	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
-		offset_in_page(cmd->r2t_pdu) + cmd->offset, left, flags);
+	if (cmd->queue->tls_psk)
+		ret = sock_no_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
+			offset_in_page(cmd->r2t_pdu) + cmd->offset, left, flags);
+	else {
+		if (flags & MSG_MORE)
+			flags |= MSG_SENDPAGE_NOTLAST;
+		ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
+			offset_in_page(cmd->r2t_pdu) + cmd->offset, left, flags);
+	}
 	if (ret <= 0)
 		return ret;
 	cmd->offset += ret;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 17/18] nvmet-tcp: control messages for recvmsg()
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (15 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 16/18] nvmet-tcp: rework sendpage for kTLS Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-21 12:43 ` [PATCH 18/18] nvmet-tcp: peek icreq before starting TLS Hannes Reinecke
  2023-03-21 13:12 ` [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Sagi Grimberg
  18 siblings, 0 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

kTLS requires control messages for recvmsg() to relay any out-of-band
TLS messages (eg TLS alerts) to the caller.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/target/tcp.c | 58 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 56 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 9b69cac84508..a69647fb2c81 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -12,6 +12,7 @@
 #include <linux/nvme-keyring.h>
 #include <net/sock.h>
 #include <net/tcp.h>
+#include <net/tls.h>
 #include <net/handshake.h>
 #include <linux/inet.h>
 #include <linux/llist.h>
@@ -88,6 +89,7 @@ struct nvmet_tcp_cmd {
 	u32				pdu_len;
 	u32				pdu_recv;
 	int				sg_idx;
+	char				recv_cbuf[CMSG_LEN(sizeof(char))];
 	struct msghdr			recv_msg;
 	struct bio_vec			*iov;
 	u32				flags;
@@ -1108,7 +1110,14 @@ static int nvmet_tcp_try_recv_pdu(struct nvmet_tcp_queue *queue)
 	struct nvme_tcp_hdr *hdr = &queue->pdu.cmd.hdr;
 	int len;
 	struct kvec iov;
-	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+	char cbuf[CMSG_LEN(sizeof(char))] = {};
+	unsigned char ctype;
+	struct cmsghdr *cmsg;
+	struct msghdr msg = {
+		.msg_control = cbuf,
+		.msg_controllen = sizeof(cbuf),
+		.msg_flags = MSG_DONTWAIT
+	};
 
 recv:
 	iov.iov_base = (void *)&queue->pdu + queue->offset;
@@ -1117,6 +1126,17 @@ static int nvmet_tcp_try_recv_pdu(struct nvmet_tcp_queue *queue)
 			iov.iov_len, msg.msg_flags);
 	if (unlikely(len < 0))
 		return len;
+	cmsg = (struct cmsghdr *)cbuf;
+	if (CMSG_OK(&msg, cmsg) &&
+	    cmsg->cmsg_level == SOL_TLS &&
+	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+		ctype = *((unsigned char *)CMSG_DATA(cmsg));
+		if (ctype != TLS_RECORD_TYPE_DATA) {
+			pr_err("queue %d unhandled TLS record %d\n",
+				queue->idx, ctype);
+			return -ENOTCONN;
+		}
+	}
 
 	queue->offset += len;
 	queue->left -= len;
@@ -1172,10 +1192,24 @@ static int nvmet_tcp_try_recv_data(struct nvmet_tcp_queue *queue)
 	int ret;
 
 	while (msg_data_left(&cmd->recv_msg)) {
+		struct cmsghdr *cmsg;
+		unsigned char ctype;
+
 		ret = sock_recvmsg(cmd->queue->sock, &cmd->recv_msg,
 			cmd->recv_msg.msg_flags);
 		if (ret <= 0)
 			return ret;
+		cmsg = (struct cmsghdr *)cmd->recv_cbuf;
+		if (CMSG_OK(&cmd->recv_msg, cmsg) &&
+		    cmsg->cmsg_level == SOL_TLS &&
+		    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+			ctype = *((unsigned char *)CMSG_DATA(cmsg));
+			if (ctype != TLS_RECORD_TYPE_DATA) {
+				pr_err("queue %d unhandled TLS record %d\n",
+				       queue->idx, ctype);
+				return -ENOTCONN;
+			}
+		}
 
 		cmd->pdu_recv += ret;
 		cmd->rbytes_done += ret;
@@ -1197,7 +1231,14 @@ static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
 {
 	struct nvmet_tcp_cmd *cmd = queue->cmd;
 	int ret;
-	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+	char cbuf[CMSG_LEN(sizeof(char))] = {};
+	unsigned char ctype;
+	struct cmsghdr *cmsg;
+	struct msghdr msg = {
+		.msg_control = cbuf,
+		.msg_controllen = sizeof(cbuf),
+		.msg_flags = MSG_DONTWAIT
+	};
 	struct kvec iov = {
 		.iov_base = (void *)&cmd->recv_ddgst + queue->offset,
 		.iov_len = queue->left
@@ -1207,6 +1248,17 @@ static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
 			iov.iov_len, msg.msg_flags);
 	if (unlikely(ret < 0))
 		return ret;
+	cmsg = (struct cmsghdr *)cbuf;
+	if (CMSG_OK(&msg, cmsg) &&
+	    cmsg->cmsg_level == SOL_TLS &&
+	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+		ctype = *((unsigned char *)CMSG_DATA(cmsg));
+		if (ctype != TLS_RECORD_TYPE_DATA) {
+			pr_err("queue %d unhandled TLS record %d\n",
+				queue->idx, ctype);
+			return -ENOTCONN;
+		}
+	}
 
 	queue->offset += ret;
 	queue->left -= ret;
@@ -1376,6 +1428,8 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
 	if (!c->r2t_pdu)
 		goto out_free_data;
 
+	c->recv_msg.msg_control = c->recv_cbuf;
+	c->recv_msg.msg_controllen = sizeof(c->recv_cbuf);
 	c->recv_msg.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
 
 	list_add_tail(&c->entry, &queue->free_list);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 18/18] nvmet-tcp: peek icreq before starting TLS
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (16 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 17/18] nvmet-tcp: control messages for recvmsg() Hannes Reinecke
@ 2023-03-21 12:43 ` Hannes Reinecke
  2023-03-22 12:24   ` Sagi Grimberg
  2023-03-21 13:12 ` [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Sagi Grimberg
  18 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Incoming connection might be either 'normal' NVMe-TCP connections
starting with icreq or TLS handshakes. To ensure that 'normal'
connections can still be handled we need to peek the first packet
and only start TLS handshake if it's not an icreq.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/target/tcp.c | 60 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index a69647fb2c81..a328a303c2be 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1105,6 +1105,61 @@ static inline bool nvmet_tcp_pdu_valid(u8 type)
 	return false;
 }
 
+static int nvmet_tcp_try_peek_pdu(struct nvmet_tcp_queue *queue)
+{
+	struct nvme_tcp_hdr *hdr = &queue->pdu.cmd.hdr;
+	int len;
+	struct kvec iov = {
+		.iov_base = (u8 *)&queue->pdu + queue->offset,
+		.iov_len = sizeof(struct nvme_tcp_hdr),
+	};
+	char cbuf[CMSG_LEN(sizeof(char))] = {};
+	unsigned char ctype;
+	struct cmsghdr *cmsg;
+	struct msghdr msg = {
+		.msg_control = cbuf,
+		.msg_controllen = sizeof(cbuf),
+		.msg_flags = MSG_PEEK,
+	};
+
+	len = kernel_recvmsg(queue->sock, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (unlikely(len < 0)) {
+		pr_debug("queue %d peek error %d\n",
+			 queue->idx, len);
+		return len;
+	}
+
+	cmsg = (struct cmsghdr *)cbuf;
+	if (CMSG_OK(&msg, cmsg) &&
+	    cmsg->cmsg_level == SOL_TLS &&
+	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
+		ctype = *((unsigned char *)CMSG_DATA(cmsg));
+		if (ctype != TLS_RECORD_TYPE_DATA) {
+			pr_err("queue %d unhandled TLS record %d\n",
+				queue->idx, ctype);
+			return -ENOTCONN;
+		}
+	}
+
+	if (len < sizeof(struct nvme_tcp_hdr)) {
+		pr_debug("queue %d short read, %d bytes missing\n",
+			 queue->idx, (int)iov.iov_len - len);
+		return -EAGAIN;
+	}
+	pr_debug("queue %d hdr type %d hlen %d plen %d size %d\n",
+		 queue->idx, hdr->type, hdr->hlen, hdr->plen,
+		 (int)sizeof(struct nvme_tcp_icreq_pdu));
+	if (hdr->type == nvme_tcp_icreq &&
+	    hdr->hlen == sizeof(struct nvme_tcp_icreq_pdu) &&
+	    hdr->plen == sizeof(struct nvme_tcp_icreq_pdu)) {
+		pr_debug("queue %d icreq detected\n",
+			 queue->idx);
+		return len;
+	}
+	return 0;
+}
+
 static int nvmet_tcp_try_recv_pdu(struct nvmet_tcp_queue *queue)
 {
 	struct nvme_tcp_hdr *hdr = &queue->pdu.cmd.hdr;
@@ -1879,8 +1934,9 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 
 	if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
 		nvmet_tcp_save_tls_callbacks(queue);
-		if (!nvmet_tcp_tls_handshake(queue))
-			return;
+		if (!nvmet_tcp_try_peek_pdu(queue))
+			if (!nvmet_tcp_tls_handshake(queue))
+				return;
 		nvmet_tcp_restore_tls_callbacks(queue);
 
 		/*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP
  2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
                   ` (17 preceding siblings ...)
  2023-03-21 12:43 ` [PATCH 18/18] nvmet-tcp: peek icreq before starting TLS Hannes Reinecke
@ 2023-03-21 13:12 ` Sagi Grimberg
  2023-03-21 13:30   ` Hannes Reinecke
  18 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-21 13:12 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


> Hi all,
> 
> finally I've managed to put all things together and enable in-kernel
> TLS support for NVMe-over-TCP.

Hannes (and Chuck) this is great, I'm very happy to see this!

I'll start a detailed review soon enough.

Thank you for doing this.

> The patchset is based on the TLS upcall mechanism from Chuck Lever
> (cf '[PATCH v7 0/2] Another crack at a handshake upcall mechanism'
> posted to the linux netdev list), and requires the 'tlshd' userspace
> daemon (https://github.com/oracle/ktls-utils) for the actual TLS handshake.

Do you have an actual link to follow for this patch set?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP
  2023-03-21 13:12 ` [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Sagi Grimberg
@ 2023-03-21 13:30   ` Hannes Reinecke
  2023-03-22  8:16     ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 13:30 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/21/23 14:12, Sagi Grimberg wrote:
> 
>> Hi all,
>>
>> finally I've managed to put all things together and enable in-kernel
>> TLS support for NVMe-over-TCP.
> 
> Hannes (and Chuck) this is great, I'm very happy to see this!
> 
> I'll start a detailed review soon enough.
> 
> Thank you for doing this.
> 
>> The patchset is based on the TLS upcall mechanism from Chuck Lever
>> (cf '[PATCH v7 0/2] Another crack at a handshake upcall mechanism'
>> posted to the linux netdev list), and requires the 'tlshd' userspace
>> daemon (https://github.com/oracle/ktls-utils) for the actual TLS 
>> handshake.
> 
> Do you have an actual link to follow for this patch set?

Sure.

git.kernel.org:/pub/scm/linux/kernel/git/hare/scsi-devel.git
branch tls-netlink.v7

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Frankenstr. 146, 90461 Nürnberg
Managing Directors: I. Totev, A. Myers, A. McDonald, M. B. Moerman
(HRB 36809, AG Nürnberg)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/18] nvme-tcp: implement recvmsg rx flow for TLS
  2023-03-21 12:43 ` [PATCH 05/18] nvme-tcp: implement recvmsg rx flow for TLS Hannes Reinecke
@ 2023-03-21 13:39   ` Sagi Grimberg
  2023-03-21 13:59     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-21 13:39 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> TLS offload only implements recvmsg(), so implement the receive
> side with using recvmsg().

I don't really mind changing this, however this change makes us
lock the socket for every consumption, instead of taking the lock
once and consume as much as possible. Which in theory is suboptimal.

Is there any material reason why tls cannot implement read_sock() ?

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 156 ++++++++++++++++++++--------------------
>   1 file changed, 77 insertions(+), 79 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 42c0598c31f2..0e14b1b90855 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -529,7 +529,7 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>   	queue->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
>   				nvme_tcp_hdgst_len(queue);
>   	queue->pdu_offset = 0;
> -	queue->data_remaining = -1;
> +	queue->data_remaining = 0;
>   	queue->ddgst_remaining = 0;
>   }
>   
> @@ -707,25 +707,32 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>   	return 0;
>   }
>   
> -static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
> -		unsigned int *offset, size_t *len)
> +static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, bool pending)
>   {
>   	struct nvme_tcp_hdr *hdr;
> -	char *pdu = queue->pdu;
> -	size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
> +	size_t rcv_len = queue->pdu_remaining;
> +	struct msghdr msg = {
> +		.msg_flags = pending ? 0 : MSG_DONTWAIT,

Umm, why?
What is the reason to block in this recv?

> +	};
> +	struct kvec iov = {
> +		.iov_base = (u8 *)queue->pdu + queue->pdu_offset,
> +		.iov_len = rcv_len,
> +	};
>   	int ret;
>   
> -	ret = skb_copy_bits(skb, *offset,
> -		&pdu[queue->pdu_offset], rcv_len);
> -	if (unlikely(ret))
> +	if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_PDU)
> +		return 0;

Why is this check needed? looks like a left-over.

> +
> +	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> +			     iov.iov_len, msg.msg_flags);
> +	if (ret <= 0)
>   		return ret;
>   
> +	rcv_len = ret;
>   	queue->pdu_remaining -= rcv_len;
>   	queue->pdu_offset += rcv_len;
> -	*offset += rcv_len;
> -	*len -= rcv_len;
>   	if (queue->pdu_remaining)
> -		return 0;
> +		return queue->pdu_remaining;
>   
>   	hdr = queue->pdu;
>   	if (queue->hdr_digest) {
> @@ -734,7 +741,6 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   			return ret;
>   	}
>   
> -
>   	if (queue->data_digest) {
>   		ret = nvme_tcp_check_ddgst(queue, queue->pdu);
>   		if (unlikely(ret))
> @@ -765,19 +771,21 @@ static inline void nvme_tcp_end_request(struct request *rq, u16 status)
>   		nvme_complete_rq(rq);
>   }
>   
> -static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
> -			      unsigned int *offset, size_t *len)
> +static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
>   {
>   	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>   	struct request *rq =
>   		nvme_cid_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
>   	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
>   
> +	if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DATA)
> +		return 0;
> +
>   	while (true) {
> -		int recv_len, ret;
> +		struct msghdr msg;
> +		int ret;
>   
> -		recv_len = min_t(size_t, *len, queue->data_remaining);
> -		if (!recv_len)
> +		if (!queue->data_remaining)
>   			break;
>   
>   		if (!iov_iter_count(&req->iter)) {
> @@ -798,25 +806,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   		}
>   
>   		/* we can read only from what is left in this bio */
> -		recv_len = min_t(size_t, recv_len,
> -				iov_iter_count(&req->iter));
> +		memset(&msg, 0, sizeof(msg));
> +		msg.msg_iter = req->iter;
>   
> -		if (queue->data_digest)
> -			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
> -				&req->iter, recv_len, queue->rcv_hash);
> -		else
> -			ret = skb_copy_datagram_iter(skb, *offset,
> -					&req->iter, recv_len);
> -		if (ret) {
> +		ret = sock_recvmsg(queue->sock, &msg, 0);

Who updates the rcv_hash for data digest validation?

> +		if (ret <= 0) {
>   			dev_err(queue->ctrl->ctrl.device,
> -				"queue %d failed to copy request %#x data",
> +				"queue %d failed to receive request %#x data",
>   				nvme_tcp_queue_id(queue), rq->tag);
>   			return ret;
>   		}
>   
> -		*len -= recv_len;
> -		*offset += recv_len;
> -		queue->data_remaining -= recv_len;
> +		queue->data_remaining -= ret;
> +		if (queue->data_remaining)
> +			nvme_tcp_advance_req(req, ret);
>   	}
>   
>   	if (!queue->data_remaining) {
> @@ -833,27 +836,36 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   		}
>   	}
>   
> -	return 0;
> +	return queue->data_remaining;
>   }
>   
> -static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
> -		struct sk_buff *skb, unsigned int *offset, size_t *len)
> +static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue)
>   {
>   	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>   	char *ddgst = (char *)&queue->recv_ddgst;
> -	size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
> +	size_t recv_len = queue->ddgst_remaining;
>   	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
> +	struct msghdr msg = {
> +		.msg_flags = 0,
> +	};
> +	struct kvec iov = {
> +		.iov_base = (u8 *)ddgst + off,
> +		.iov_len = recv_len,
> +	};
>   	int ret;
>   
> -	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
> -	if (unlikely(ret))
> +	if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DDGST)
> +		return 0;
> +
> +	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1, iov.iov_len,
> +			     msg.msg_flags);
> +	if (ret <= 0)
>   		return ret;
>   
> +	recv_len = ret;
>   	queue->ddgst_remaining -= recv_len;
> -	*offset += recv_len;
> -	*len -= recv_len;
>   	if (queue->ddgst_remaining)
> -		return 0;
> +		return queue->ddgst_remaining;
>   
>   	if (queue->recv_ddgst != queue->exp_ddgst) {
>   		struct request *rq = nvme_cid_to_rq(nvme_tcp_tagset(queue),
> @@ -881,37 +893,41 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>   	return 0;
>   }
>   
> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
> -			     unsigned int offset, size_t len)
> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, bool pending)
>   {
> -	struct nvme_tcp_queue *queue = desc->arg.data;
> -	size_t consumed = len;
>   	int result;
> +	int nr_cqe = queue->nr_cqe;
>   
> -	while (len) {
> +	do {
>   		switch (nvme_tcp_recv_state(queue)) {
>   		case NVME_TCP_RECV_PDU:
> -			result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
> -			break;
> +			result = nvme_tcp_recv_pdu(queue, pending);
> +			if (result)
> +				break;
> +			fallthrough;
>   		case NVME_TCP_RECV_DATA:
> -			result = nvme_tcp_recv_data(queue, skb, &offset, &len);
> -			break;
> +			result = nvme_tcp_recv_data(queue);
> +			if (result)
> +				break;
> +			fallthrough;
>   		case NVME_TCP_RECV_DDGST:
> -			result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
> +			result = nvme_tcp_recv_ddgst(queue);
>   			break;
>   		default:
>   			result = -EFAULT;
>   		}
> -		if (result) {
> -			dev_err(queue->ctrl->ctrl.device,
> -				"receive failed:  %d\n", result);
> -			queue->rd_enabled = false;
> -			nvme_tcp_error_recovery(&queue->ctrl->ctrl);
> -			return result;
> -		}
> +		if (nr_cqe != queue->nr_cqe)
> +			break;
> +	} while (result >= 0);
> +	if (result < 0 && result != -EAGAIN) {
> +		dev_err(queue->ctrl->ctrl.device,
> +			"receive failed: %d state %d %s\n",
> +			result, nvme_tcp_recv_state(queue),
> +			pending ? "pending" : "");

I'm unclear why pending would be an input to try_recv. Semantically
it is an output, signalling the io_work that data is pending to be
reaped from the socket.

> +		queue->rd_enabled = false;
> +		nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>   	}
> -
> -	return consumed;
> +	return result < 0 ? result : (queue->nr_cqe - nr_cqe);

Isn't it possible that we consumed data but no completion in this
round? I'm assuming that

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-21 12:43 ` [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready() Hannes Reinecke
@ 2023-03-21 13:44   ` Sagi Grimberg
  2023-03-21 14:09     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-21 13:44 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


> Call the original data_ready() callback in nvme_tcp_data_ready()
> to avoid a receive stall.

Can you please improve the description to include what is the stall?
For example, does the stall exist today? If it is, I would like to
separate such patches from this set and include them asap.

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0e14b1b90855..0512eb289dcf 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -936,12 +936,14 @@ static void nvme_tcp_data_ready(struct sock *sk)
>   
>   	trace_sk_data_ready(sk);
>   
> -	read_lock_bh(&sk->sk_callback_lock);
> -	queue = sk->sk_user_data;
> +	rcu_read_lock_bh();

Now I understand your comment from a previous patch.
Can you explain why is this convention needed?

I would prefer to have it as a separate patch with an
explanation to why it is needed.

> +	queue = rcu_dereference_sk_user_data(sk);
> +	if (queue->data_ready)
> +		queue->data_ready(sk);

Is the tls data_ready call sync or async? just for general knowledge?


>   	if (likely(queue && queue->rd_enabled) &&
>   	    !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
>   		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> -	read_unlock_bh(&sk->sk_callback_lock);
> +	rcu_read_unlock_bh();
>   }
>   
>   static void nvme_tcp_write_space(struct sock *sk)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/18] nvme: add TCP TSAS definitions
  2023-03-21 12:43 ` [PATCH 03/18] nvme: add TCP TSAS definitions Hannes Reinecke
@ 2023-03-21 13:46   ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-21 13:46 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   include/linux/nvme.h | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 779507ac750b..ea961ca2022d 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -108,6 +108,13 @@ enum {
>   	NVMF_RDMA_CMS_RDMA_CM	= 1, /* Sockets based endpoint addressing */
>   };
>   
> +/* TSAS SECTYPE for TCP transport */
> +enum {
> +	NVMF_TCP_SECTYPE_NONE = 0, /* No Security */
> +	NVMF_TCP_SECTYPE_TLS12 = 1, /* TLSv1.2, NVMe-oF 1.1 and NVMe-TCP 3.6.1.1 */
> +	NVMF_TCP_SECTYPE_TLS13 = 2, /* TLSv1.3, NVMe-oF 1.1 and NVMe-TCP 3.6.1.1 */
> +};

I think these should be located in nvme-tcp.h
Otherwise looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


> +
>   #define NVME_AQ_DEPTH		32
>   #define NVME_NR_AEN_COMMANDS	1
>   #define NVME_AQ_BLK_MQ_DEPTH	(NVME_AQ_DEPTH - NVME_NR_AEN_COMMANDS)
> @@ -1458,6 +1465,9 @@ struct nvmf_disc_rsp_page_entry {
>   			__u16	pkey;
>   			__u8	resv10[246];
>   		} rdma;
> +		struct tcp {
> +			__u8	sectype;
> +		} tcp;
>   	} tsas;
>   };
>   

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/18] nvme-keyring: register '.nvme' keyring
  2023-03-21 12:43 ` [PATCH 01/18] nvme-keyring: register '.nvme' keyring Hannes Reinecke
@ 2023-03-21 13:50   ` Sagi Grimberg
  2023-03-21 14:11     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-21 13:50 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> Register a '.nvme' keyring to hold keys for TLS and DH-HMAC-CHAP.
> We need a separate keyring as for NVMe the might not be a userspace
> process attached (eg during reconnect), and so the use of a session
> keyring or any other process-related keyrings might not be possible.

So the keys will be stored in the ring such that on any reconnect
userspace will have access to these keys? How does this affect 
dh-hmac-chap keys?

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/common/Makefile  |  2 +-
>   drivers/nvme/common/keyring.c | 36 +++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/core.c      | 10 +++++++++-
>   include/linux/nvme-keyring.h  | 12 ++++++++++++
>   4 files changed, 58 insertions(+), 2 deletions(-)
>   create mode 100644 drivers/nvme/common/keyring.c
>   create mode 100644 include/linux/nvme-keyring.h
> 
> diff --git a/drivers/nvme/common/Makefile b/drivers/nvme/common/Makefile
> index 720c625b8a52..c4e3b312d2cc 100644
> --- a/drivers/nvme/common/Makefile
> +++ b/drivers/nvme/common/Makefile
> @@ -4,4 +4,4 @@ ccflags-y			+= -I$(src)
>   
>   obj-$(CONFIG_NVME_COMMON)	+= nvme-common.o
>   
> -nvme-common-y			+= auth.o
> +nvme-common-y			+= auth.o keyring.o
> diff --git a/drivers/nvme/common/keyring.c b/drivers/nvme/common/keyring.c
> new file mode 100644
> index 000000000000..3a6e8a0b38e2
> --- /dev/null
> +++ b/drivers/nvme/common/keyring.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020 Hannes Reinecke, SUSE Linux
> + */
> +
> +#include <linux/module.h>
> +#include <linux/nvme.h>
> +#include <linux/seq_file.h>
> +#include <linux/key-type.h>
> +#include <keys/user-type.h>
> +
> +static struct key *nvme_keyring;
> +
> +int nvme_keyring_init(void)
> +{
> +	int err;
> +
> +	nvme_keyring = keyring_alloc(".nvme",
> +				     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> +				     current_cred(),
> +				     (KEY_POS_ALL & ~KEY_POS_SETATTR) |
> +				     (KEY_USR_ALL & ~KEY_USR_SETATTR),
> +				     KEY_ALLOC_NOT_IN_QUOTA, NULL, NULL);
> +	if (IS_ERR(nvme_keyring))
> +		return PTR_ERR(nvme_keyring);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(nvme_keyring_init);
> +
> +void nvme_keyring_exit(void)
> +{
> +	key_revoke(nvme_keyring);
> +	key_put(nvme_keyring);
> +}
> +EXPORT_SYMBOL_GPL(nvme_keyring_exit);
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index d4be525f8100..839bc7587f54 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -25,6 +25,7 @@
>   #include "nvme.h"
>   #include "fabrics.h"
>   #include <linux/nvme-auth.h>
> +#include <linux/nvme-keyring.h>
>   
>   #define CREATE_TRACE_POINTS
>   #include "trace.h"
> @@ -5415,11 +5416,17 @@ static int __init nvme_core_init(void)
>   		goto unregister_generic_ns;
>   	}
>   
> -	result = nvme_init_auth();
> +	result = nvme_keyring_init();
>   	if (result)
>   		goto destroy_ns_chr;
> +
> +	result = nvme_init_auth();
> +	if (result)
> +		goto keyring_exit;
>   	return 0;
>   
> +keyring_exit:
> +	nvme_keyring_exit();
>   destroy_ns_chr:
>   	class_destroy(nvme_ns_chr_class);
>   unregister_generic_ns:
> @@ -5443,6 +5450,7 @@ static int __init nvme_core_init(void)
>   static void __exit nvme_core_exit(void)
>   {
>   	nvme_exit_auth();
> +	nvme_keyring_exit();
>   	class_destroy(nvme_ns_chr_class);
>   	class_destroy(nvme_subsys_class);
>   	class_destroy(nvme_class);
> diff --git a/include/linux/nvme-keyring.h b/include/linux/nvme-keyring.h
> new file mode 100644
> index 000000000000..a875c06cc922
> --- /dev/null
> +++ b/include/linux/nvme-keyring.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2021 Hannes Reinecke, SUSE Software Solutions
> + */
> +
> +#ifndef _NVME_KEYRING_H
> +#define _NVME_KEYRING_H
> +
> +int nvme_keyring_init(void);
> +void nvme_keyring_exit(void);
> +
> +#endif

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/18] nvme/tcp: allocate socket file
  2023-03-21 12:43 ` [PATCH 07/18] nvme/tcp: allocate socket file Hannes Reinecke
@ 2023-03-21 13:52   ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-21 13:52 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> When using the TLS upcall we need to allocate a socket file such
> that the userspace daemon is able to use the socket.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 21 +++++++++++++++++++--
>   1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0512eb289dcf..0438d42f4179 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -115,6 +115,7 @@ enum nvme_tcp_recv_state {
>   struct nvme_tcp_ctrl;
>   struct nvme_tcp_queue {
>   	struct socket		*sock;
> +	struct file		*sock_file;
>   	struct work_struct	io_work;
>   	int			io_cpu;
>   
> @@ -1330,7 +1331,12 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
>   	}
>   
>   	noreclaim_flag = memalloc_noreclaim_save();
> -	sock_release(queue->sock);
> +	if (queue->sock_file) {
> +		fput(queue->sock_file);
> +		queue->sock_file = NULL;
> +		/* ->sock will be released by fput() */
> +	} else
> +		sock_release(queue->sock);
>   	memalloc_noreclaim_restore(noreclaim_flag);
>   
>   	kfree(queue->pdu);
> @@ -1526,6 +1532,12 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
>   		goto err_destroy_mutex;
>   	}
>   
> +	queue->sock_file = sock_alloc_file(queue->sock, O_CLOEXEC, NULL);
> +	if (IS_ERR(queue->sock_file)) {
> +		ret = PTR_ERR(queue->sock_file);
> +		queue->sock_file = NULL;
> +		goto err_sock;
> +	}

If a sock_file is always allocated, and fail the connection if
unsuccessful, why do you check on its existence when freeing the queue?

>   	nvme_tcp_reclassify_socket(queue->sock);
>   
>   	/* Single syn retry */
> @@ -1647,7 +1659,12 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
>   	if (queue->hdr_digest || queue->data_digest)
>   		nvme_tcp_free_crypto(queue);
>   err_sock:
> -	sock_release(queue->sock);
> +	if (queue->sock_file) {
> +		fput(queue->sock_file);
> +		queue->sock_file = NULL;
> +		/* ->sock will be released by fput() */
> +	} else
> +		sock_release(queue->sock);
>   	queue->sock = NULL;
>   err_destroy_mutex:
>   	mutex_destroy(&queue->send_mutex);

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/18] nvme-tcp: implement recvmsg rx flow for TLS
  2023-03-21 13:39   ` Sagi Grimberg
@ 2023-03-21 13:59     ` Hannes Reinecke
  2023-03-22  8:01       ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 13:59 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/21/23 14:39, Sagi Grimberg wrote:
> 
> 
> On 3/21/23 14:43, Hannes Reinecke wrote:
>> TLS offload only implements recvmsg(), so implement the receive
>> side with using recvmsg().
> 
> I don't really mind changing this, however this change makes us
> lock the socket for every consumption, instead of taking the lock
> once and consume as much as possible. Which in theory is suboptimal.
> 
> Is there any material reason why tls cannot implement read_sock() ?
> 
Because the 'read_sock()' interface operates on skbs, but for TLS we 
just have a 'stream' (there is this 'stream parser' thingie handling the 
data), and skbs are meaningless as the decrypted payload can extend 
across several skbs.
At least, that's how I understood that.

But really, the prime reason is that I'm _far_ more familiar with the 
NVMe code than the tls networking code, so implementing the recvmsg() 
flow was relatively simple.

Maybe we can ask Boris Pismenny to implement read_sock() for tls ...

>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/tcp.c | 156 ++++++++++++++++++++--------------------
>>   1 file changed, 77 insertions(+), 79 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 42c0598c31f2..0e14b1b90855 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -529,7 +529,7 @@ static void nvme_tcp_init_recv_ctx(struct 
>> nvme_tcp_queue *queue)
>>       queue->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
>>                   nvme_tcp_hdgst_len(queue);
>>       queue->pdu_offset = 0;
>> -    queue->data_remaining = -1;
>> +    queue->data_remaining = 0;
>>       queue->ddgst_remaining = 0;
>>   }
>> @@ -707,25 +707,32 @@ static int nvme_tcp_handle_r2t(struct 
>> nvme_tcp_queue *queue,
>>       return 0;
>>   }
>> -static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct 
>> sk_buff *skb,
>> -        unsigned int *offset, size_t *len)
>> +static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, bool pending)
>>   {
>>       struct nvme_tcp_hdr *hdr;
>> -    char *pdu = queue->pdu;
>> -    size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
>> +    size_t rcv_len = queue->pdu_remaining;
>> +    struct msghdr msg = {
>> +        .msg_flags = pending ? 0 : MSG_DONTWAIT,
> 
> Umm, why?
> What is the reason to block in this recv?
> 
To avoid frequent -EAGAIN returns; that looked really ugly in the
debug logs :-)
Can try to do away with that; if we do also the 'pending' argument
can be removed, so might be an idea.

>> +    };
>> +    struct kvec iov = {
>> +        .iov_base = (u8 *)queue->pdu + queue->pdu_offset,
>> +        .iov_len = rcv_len,
>> +    };
>>       int ret;
>> -    ret = skb_copy_bits(skb, *offset,
>> -        &pdu[queue->pdu_offset], rcv_len);
>> -    if (unlikely(ret))
>> +    if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_PDU)
>> +        return 0;
> 
> Why is this check needed? looks like a left-over.
> 
Yeah.

>> +
>> +    ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
>> +                 iov.iov_len, msg.msg_flags);
>> +    if (ret <= 0)
>>           return ret;
>> +    rcv_len = ret;
>>       queue->pdu_remaining -= rcv_len;
>>       queue->pdu_offset += rcv_len;
>> -    *offset += rcv_len;
>> -    *len -= rcv_len;
>>       if (queue->pdu_remaining)
>> -        return 0;
>> +        return queue->pdu_remaining;
>>       hdr = queue->pdu;
>>       if (queue->hdr_digest) {
>> @@ -734,7 +741,6 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue 
>> *queue, struct sk_buff *skb,
>>               return ret;
>>       }
>> -
>>       if (queue->data_digest) {
>>           ret = nvme_tcp_check_ddgst(queue, queue->pdu);
>>           if (unlikely(ret))
>> @@ -765,19 +771,21 @@ static inline void nvme_tcp_end_request(struct 
>> request *rq, u16 status)
>>           nvme_complete_rq(rq);
>>   }
>> -static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct 
>> sk_buff *skb,
>> -                  unsigned int *offset, size_t *len)
>> +static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
>>   {
>>       struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>>       struct request *rq =
>>           nvme_cid_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
>>       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
>> +    if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DATA)
>> +        return 0;
>> +
>>       while (true) {
>> -        int recv_len, ret;
>> +        struct msghdr msg;
>> +        int ret;
>> -        recv_len = min_t(size_t, *len, queue->data_remaining);
>> -        if (!recv_len)
>> +        if (!queue->data_remaining)
>>               break;
>>           if (!iov_iter_count(&req->iter)) {
>> @@ -798,25 +806,20 @@ static int nvme_tcp_recv_data(struct 
>> nvme_tcp_queue *queue, struct sk_buff *skb,
>>           }
>>           /* we can read only from what is left in this bio */
>> -        recv_len = min_t(size_t, recv_len,
>> -                iov_iter_count(&req->iter));
>> +        memset(&msg, 0, sizeof(msg));
>> +        msg.msg_iter = req->iter;
>> -        if (queue->data_digest)
>> -            ret = skb_copy_and_hash_datagram_iter(skb, *offset,
>> -                &req->iter, recv_len, queue->rcv_hash);
>> -        else
>> -            ret = skb_copy_datagram_iter(skb, *offset,
>> -                    &req->iter, recv_len);
>> -        if (ret) {
>> +        ret = sock_recvmsg(queue->sock, &msg, 0);
> 
> Who updates the rcv_hash for data digest validation?
> 
Weelll ... currently, no-one.
One of the things which I haven't tested.

>> +        if (ret <= 0) {
>>               dev_err(queue->ctrl->ctrl.device,
>> -                "queue %d failed to copy request %#x data",
>> +                "queue %d failed to receive request %#x data",
>>                   nvme_tcp_queue_id(queue), rq->tag);
>>               return ret;
>>           }
>> -        *len -= recv_len;
>> -        *offset += recv_len;
>> -        queue->data_remaining -= recv_len;
>> +        queue->data_remaining -= ret;
>> +        if (queue->data_remaining)
>> +            nvme_tcp_advance_req(req, ret);
>>       }
>>       if (!queue->data_remaining) {
>> @@ -833,27 +836,36 @@ static int nvme_tcp_recv_data(struct 
>> nvme_tcp_queue *queue, struct sk_buff *skb,
>>           }
>>       }
>> -    return 0;
>> +    return queue->data_remaining;
>>   }
>> -static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>> -        struct sk_buff *skb, unsigned int *offset, size_t *len)
>> +static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue)
>>   {
>>       struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>>       char *ddgst = (char *)&queue->recv_ddgst;
>> -    size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
>> +    size_t recv_len = queue->ddgst_remaining;
>>       off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>> +    struct msghdr msg = {
>> +        .msg_flags = 0,
>> +    };
>> +    struct kvec iov = {
>> +        .iov_base = (u8 *)ddgst + off,
>> +        .iov_len = recv_len,
>> +    };
>>       int ret;
>> -    ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
>> -    if (unlikely(ret))
>> +    if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DDGST)
>> +        return 0;
>> +
>> +    ret = kernel_recvmsg(queue->sock, &msg, &iov, 1, iov.iov_len,
>> +                 msg.msg_flags);
>> +    if (ret <= 0)
>>           return ret;
>> +    recv_len = ret;
>>       queue->ddgst_remaining -= recv_len;
>> -    *offset += recv_len;
>> -    *len -= recv_len;
>>       if (queue->ddgst_remaining)
>> -        return 0;
>> +        return queue->ddgst_remaining;
>>       if (queue->recv_ddgst != queue->exp_ddgst) {
>>           struct request *rq = nvme_cid_to_rq(nvme_tcp_tagset(queue),
>> @@ -881,37 +893,41 @@ static int nvme_tcp_recv_ddgst(struct 
>> nvme_tcp_queue *queue,
>>       return 0;
>>   }
>> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff 
>> *skb,
>> -                 unsigned int offset, size_t len)
>> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, bool pending)
>>   {
>> -    struct nvme_tcp_queue *queue = desc->arg.data;
>> -    size_t consumed = len;
>>       int result;
>> +    int nr_cqe = queue->nr_cqe;
>> -    while (len) {
>> +    do {
>>           switch (nvme_tcp_recv_state(queue)) {
>>           case NVME_TCP_RECV_PDU:
>> -            result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
>> -            break;
>> +            result = nvme_tcp_recv_pdu(queue, pending);
>> +            if (result)
>> +                break;
>> +            fallthrough;
>>           case NVME_TCP_RECV_DATA:
>> -            result = nvme_tcp_recv_data(queue, skb, &offset, &len);
>> -            break;
>> +            result = nvme_tcp_recv_data(queue);
>> +            if (result)
>> +                break;
>> +            fallthrough;
>>           case NVME_TCP_RECV_DDGST:
>> -            result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
>> +            result = nvme_tcp_recv_ddgst(queue);
>>               break;
>>           default:
>>               result = -EFAULT;
>>           }
>> -        if (result) {
>> -            dev_err(queue->ctrl->ctrl.device,
>> -                "receive failed:  %d\n", result);
>> -            queue->rd_enabled = false;
>> -            nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>> -            return result;
>> -        }
>> +        if (nr_cqe != queue->nr_cqe)
>> +            break;
>> +    } while (result >= 0);
>> +    if (result < 0 && result != -EAGAIN) {
>> +        dev_err(queue->ctrl->ctrl.device,
>> +            "receive failed: %d state %d %s\n",
>> +            result, nvme_tcp_recv_state(queue),
>> +            pending ? "pending" : "");
> 
> I'm unclear why pending would be an input to try_recv. Semantically
> it is an output, signalling the io_work that data is pending to be
> reaped from the socket.
> 
See above. 'pending' is really there to clear the 'NOWAIT' flag for 
recvmsg(), and to avoid frequent -EAGAIN returns.
If we're fine handling them it can be removed.

>> +        queue->rd_enabled = false;
>> +        nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>       }
>> -
>> -    return consumed;
>> +    return result < 0 ? result : (queue->nr_cqe - nr_cqe);
> 
> Isn't it possible that we consumed data but no completion in this
> round? I'm assuming that

See above, that't the -EAGAIN issue.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Frankenstr. 146, 90461 Nürnberg
Managing Directors: I. Totev, A. Myers, A. McDonald, M. B. Moerman
(HRB 36809, AG Nürnberg)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-21 13:44   ` Sagi Grimberg
@ 2023-03-21 14:09     ` Hannes Reinecke
  2023-03-22  0:18       ` Chris Leech
  2023-03-22  8:08       ` Sagi Grimberg
  0 siblings, 2 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 14:09 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/21/23 14:44, Sagi Grimberg wrote:
> 
>> Call the original data_ready() callback in nvme_tcp_data_ready()
>> to avoid a receive stall.
> 
> Can you please improve the description to include what is the stall?
> For example, does the stall exist today? If it is, I would like to
> separate such patches from this set and include them asap.
> 
That is actually particular to the TLS implementation, as it uses the 
'data_ready' callback to produce the data which can be read by eg recvmsg().

Without this call there's no data to peruse for recvmsg().

But I'm not _that_ deep into networking details to know whether this is 
TLS specific or an issue with any data_ready callback.
I assume the latter, but then again, who knows.

Hence the slightly vague description.

>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/tcp.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0e14b1b90855..0512eb289dcf 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -936,12 +936,14 @@ static void nvme_tcp_data_ready(struct sock *sk)
>>       trace_sk_data_ready(sk);
>> -    read_lock_bh(&sk->sk_callback_lock);
>> -    queue = sk->sk_user_data;
>> +    rcu_read_lock_bh();
> 
> Now I understand your comment from a previous patch.
> Can you explain why is this convention needed?
> 
> I would prefer to have it as a separate patch with an
> explanation to why it is needed.
> 
This is the slightly odd socket callback handling.
Any driver is free to set the socket callbacks, but it has to be aware 
that it might not be the only one in the stack doing so.
So one has to be prepared that the callbacks are set already, so we 
should be calling them prior to our callback.

>> +    queue = rcu_dereference_sk_user_data(sk);
>> +    if (queue->data_ready)
>> +        queue->data_ready(sk);
> 
> Is the tls data_ready call sync or async? just for general knowledge?
> 
> 
Sync, I guess. Otherwise we wouldn't be needing the lock ...

>>       if (likely(queue && queue->rd_enabled) &&
>>           !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
>>           queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> -    read_unlock_bh(&sk->sk_callback_lock);
>> +    rcu_read_unlock_bh();
>>   }
>>   static void nvme_tcp_write_space(struct sock *sk)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Frankenstr. 146, 90461 Nürnberg
Managing Directors: I. Totev, A. Myers, A. McDonald, M. B. Moerman
(HRB 36809, AG Nürnberg)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/18] nvme-keyring: register '.nvme' keyring
  2023-03-21 13:50   ` Sagi Grimberg
@ 2023-03-21 14:11     ` Hannes Reinecke
  0 siblings, 0 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-21 14:11 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/21/23 14:50, Sagi Grimberg wrote:
> 
> 
> On 3/21/23 14:43, Hannes Reinecke wrote:
>> Register a '.nvme' keyring to hold keys for TLS and DH-HMAC-CHAP.
>> We need a separate keyring as for NVMe the might not be a userspace
>> process attached (eg during reconnect), and so the use of a session
>> keyring or any other process-related keyrings might not be possible.
> 
> So the keys will be stored in the ring such that on any reconnect
> userspace will have access to these keys? How does this affect 
> dh-hmac-chap keys?
> 
Correct.

And it does not affect dh-hmac-chap handling as that implementation 
doesn't use keyrings (yet). That's another patchset which is in the works.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Frankenstr. 146, 90461 Nürnberg
Managing Directors: I. Totev, A. Myers, A. McDonald, M. B. Moerman
(HRB 36809, AG Nürnberg)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-21 14:09     ` Hannes Reinecke
@ 2023-03-22  0:18       ` Chris Leech
  2023-03-22  6:59         ` Hannes Reinecke
  2023-03-22  8:08       ` Sagi Grimberg
  1 sibling, 1 reply; 90+ messages in thread
From: Chris Leech @ 2023-03-22  0:18 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme,
	Chuck Lever, kernel-tls-handshake

On Tue, Mar 21, 2023 at 03:09:06PM +0100, Hannes Reinecke wrote:
> On 3/21/23 14:44, Sagi Grimberg wrote:
> > 
> > > Call the original data_ready() callback in nvme_tcp_data_ready()
> > > to avoid a receive stall.
> > 
> > Can you please improve the description to include what is the stall?
> > For example, does the stall exist today? If it is, I would like to
> > separate such patches from this set and include them asap.
> > 
> That is actually particular to the TLS implementation, as it uses the
> 'data_ready' callback to produce the data which can be read by eg recvmsg().
> 
> Without this call there's no data to peruse for recvmsg().
> 
> But I'm not _that_ deep into networking details to know whether this is TLS
> specific or an issue with any data_ready callback.
> I assume the latter, but then again, who knows.
> 
> Hence the slightly vague description.

This looks like the socket callbacks end up hooked in the wrong order.
Ideally it would be tcp -> tls -> nvme_tcp, while this currently looks
like tcp -> nvme_tcp and then this call back to tls for decryption.

I'm not quite sure how to untangle this; nvme_tcp can't just set it's
own callbacks before initializing kTLS, becuse that's being done by
tlshd which is going to need the userspace socket API callbacks working.

- Chris


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-22  0:18       ` Chris Leech
@ 2023-03-22  6:59         ` Hannes Reinecke
  2023-03-22  8:12           ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22  6:59 UTC (permalink / raw)
  To: Chris Leech
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme,
	Chuck Lever, kernel-tls-handshake

On 3/22/23 01:18, Chris Leech wrote:
> On Tue, Mar 21, 2023 at 03:09:06PM +0100, Hannes Reinecke wrote:
>> On 3/21/23 14:44, Sagi Grimberg wrote:
>>>
>>>> Call the original data_ready() callback in nvme_tcp_data_ready()
>>>> to avoid a receive stall.
>>>
>>> Can you please improve the description to include what is the stall?
>>> For example, does the stall exist today? If it is, I would like to
>>> separate such patches from this set and include them asap.
>>>
>> That is actually particular to the TLS implementation, as it uses the
>> 'data_ready' callback to produce the data which can be read by eg recvmsg().
>>
>> Without this call there's no data to peruse for recvmsg().
>>
>> But I'm not _that_ deep into networking details to know whether this is TLS
>> specific or an issue with any data_ready callback.
>> I assume the latter, but then again, who knows.
>>
>> Hence the slightly vague description.
> 
> This looks like the socket callbacks end up hooked in the wrong order.
> Ideally it would be tcp -> tls -> nvme_tcp, while this currently looks
> like tcp -> nvme_tcp and then this call back to tls for decryption.
> 
Well, problem is that I need not one but two sets the callbacks.
One callback is for waking up userspace (took me weeks to figure that 
out), and needs to added before calling the userspace helper.
The other is the 'normal' nvme-tcp callback:

tcp->nvme-upcall->tls->nvme-tcp

So really the problem is not so much an inversion, but rather the fact
that the nvme-upcall callback is really only needed for the duration
of the handshake. And hence I thought that we should remove the callback
once we're done with the upcall.
Turns out that we can't, and the best we can do is to disable the 
functionality, leaving the callback itself in place.

> I'm not quite sure how to untangle this; nvme_tcp can't just set it's
> own callbacks before initializing kTLS, becuse that's being done by
> tlshd which is going to need the userspace socket API callbacks working.
> 
Correct.
So for now I'll leave the callbacks in place, even though they are 
pointless after the upcall.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/18] nvme-tcp: implement recvmsg rx flow for TLS
  2023-03-21 13:59     ` Hannes Reinecke
@ 2023-03-22  8:01       ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  8:01 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, boris.pismenny
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>> On 3/21/23 14:43, Hannes Reinecke wrote:
>>> TLS offload only implements recvmsg(), so implement the receive
>>> side with using recvmsg().
>>
>> I don't really mind changing this, however this change makes us
>> lock the socket for every consumption, instead of taking the lock
>> once and consume as much as possible. Which in theory is suboptimal.
>>
>> Is there any material reason why tls cannot implement read_sock() ?
>>
> Because the 'read_sock()' interface operates on skbs, but for TLS we 
> just have a 'stream' (there is this 'stream parser' thingie handling the 
> data), and skbs are meaningless as the decrypted payload can extend 
> across several skbs.
> At least, that's how I understood that.

I don't see why it can't produce an skb though...
Seems like there is a need here, because I don't know how we'd pass
a digest inline calculation to recvmsg. I'd hate to move the digest
calculation all the way up to the completion and rescan all the
data...

> But really, the prime reason is that I'm _far_ more familiar with the 
> NVMe code than the tls networking code, so implementing the recvmsg() 
> flow was relatively simple.
> 
> Maybe we can ask Boris Pismenny to implement read_sock() for tls ...

Maybe... CCing Boris.

> 
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>   drivers/nvme/host/tcp.c | 156 ++++++++++++++++++++--------------------
>>>   1 file changed, 77 insertions(+), 79 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 42c0598c31f2..0e14b1b90855 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -529,7 +529,7 @@ static void nvme_tcp_init_recv_ctx(struct 
>>> nvme_tcp_queue *queue)
>>>       queue->pdu_remaining = sizeof(struct nvme_tcp_rsp_pdu) +
>>>                   nvme_tcp_hdgst_len(queue);
>>>       queue->pdu_offset = 0;
>>> -    queue->data_remaining = -1;
>>> +    queue->data_remaining = 0;
>>>       queue->ddgst_remaining = 0;
>>>   }
>>> @@ -707,25 +707,32 @@ static int nvme_tcp_handle_r2t(struct 
>>> nvme_tcp_queue *queue,
>>>       return 0;
>>>   }
>>> -static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct 
>>> sk_buff *skb,
>>> -        unsigned int *offset, size_t *len)
>>> +static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, bool 
>>> pending)
>>>   {
>>>       struct nvme_tcp_hdr *hdr;
>>> -    char *pdu = queue->pdu;
>>> -    size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining);
>>> +    size_t rcv_len = queue->pdu_remaining;
>>> +    struct msghdr msg = {
>>> +        .msg_flags = pending ? 0 : MSG_DONTWAIT,
>>
>> Umm, why?
>> What is the reason to block in this recv?
>>
> To avoid frequent -EAGAIN returns; that looked really ugly in the
> debug logs :-)
> Can try to do away with that; if we do also the 'pending' argument
> can be removed, so might be an idea.

I'd prefer not to block on any socket opreation, unless someone
shows that there is a gain to be made somehow.

> 
>>> +    };
>>> +    struct kvec iov = {
>>> +        .iov_base = (u8 *)queue->pdu + queue->pdu_offset,
>>> +        .iov_len = rcv_len,
>>> +    };
>>>       int ret;
>>> -    ret = skb_copy_bits(skb, *offset,
>>> -        &pdu[queue->pdu_offset], rcv_len);
>>> -    if (unlikely(ret))
>>> +    if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_PDU)
>>> +        return 0;
>>
>> Why is this check needed? looks like a left-over.
>>
> Yeah.
> 
>>> +
>>> +    ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
>>> +                 iov.iov_len, msg.msg_flags);
>>> +    if (ret <= 0)
>>>           return ret;
>>> +    rcv_len = ret;
>>>       queue->pdu_remaining -= rcv_len;
>>>       queue->pdu_offset += rcv_len;
>>> -    *offset += rcv_len;
>>> -    *len -= rcv_len;
>>>       if (queue->pdu_remaining)
>>> -        return 0;
>>> +        return queue->pdu_remaining;
>>>       hdr = queue->pdu;
>>>       if (queue->hdr_digest) {
>>> @@ -734,7 +741,6 @@ static int nvme_tcp_recv_pdu(struct 
>>> nvme_tcp_queue *queue, struct sk_buff *skb,
>>>               return ret;
>>>       }
>>> -
>>>       if (queue->data_digest) {
>>>           ret = nvme_tcp_check_ddgst(queue, queue->pdu);
>>>           if (unlikely(ret))
>>> @@ -765,19 +771,21 @@ static inline void nvme_tcp_end_request(struct 
>>> request *rq, u16 status)
>>>           nvme_complete_rq(rq);
>>>   }
>>> -static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct 
>>> sk_buff *skb,
>>> -                  unsigned int *offset, size_t *len)
>>> +static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
>>>   {
>>>       struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>>>       struct request *rq =
>>>           nvme_cid_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
>>>       struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
>>> +    if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DATA)
>>> +        return 0;
>>> +
>>>       while (true) {
>>> -        int recv_len, ret;
>>> +        struct msghdr msg;
>>> +        int ret;
>>> -        recv_len = min_t(size_t, *len, queue->data_remaining);
>>> -        if (!recv_len)
>>> +        if (!queue->data_remaining)
>>>               break;
>>>           if (!iov_iter_count(&req->iter)) {
>>> @@ -798,25 +806,20 @@ static int nvme_tcp_recv_data(struct 
>>> nvme_tcp_queue *queue, struct sk_buff *skb,
>>>           }
>>>           /* we can read only from what is left in this bio */
>>> -        recv_len = min_t(size_t, recv_len,
>>> -                iov_iter_count(&req->iter));
>>> +        memset(&msg, 0, sizeof(msg));
>>> +        msg.msg_iter = req->iter;
>>> -        if (queue->data_digest)
>>> -            ret = skb_copy_and_hash_datagram_iter(skb, *offset,
>>> -                &req->iter, recv_len, queue->rcv_hash);
>>> -        else
>>> -            ret = skb_copy_datagram_iter(skb, *offset,
>>> -                    &req->iter, recv_len);
>>> -        if (ret) {
>>> +        ret = sock_recvmsg(queue->sock, &msg, 0);
>>
>> Who updates the rcv_hash for data digest validation?
>>
> Weelll ... currently, no-one.
> One of the things which I haven't tested.

I see, that obviously needs to work. This is primarily why
I want to have .read_sock for tls.

> 
>>> +        if (ret <= 0) {
>>>               dev_err(queue->ctrl->ctrl.device,
>>> -                "queue %d failed to copy request %#x data",
>>> +                "queue %d failed to receive request %#x data",
>>>                   nvme_tcp_queue_id(queue), rq->tag);
>>>               return ret;
>>>           }
>>> -        *len -= recv_len;
>>> -        *offset += recv_len;
>>> -        queue->data_remaining -= recv_len;
>>> +        queue->data_remaining -= ret;
>>> +        if (queue->data_remaining)
>>> +            nvme_tcp_advance_req(req, ret);
>>>       }
>>>       if (!queue->data_remaining) {
>>> @@ -833,27 +836,36 @@ static int nvme_tcp_recv_data(struct 
>>> nvme_tcp_queue *queue, struct sk_buff *skb,
>>>           }
>>>       }
>>> -    return 0;
>>> +    return queue->data_remaining;
>>>   }
>>> -static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>>> -        struct sk_buff *skb, unsigned int *offset, size_t *len)
>>> +static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue)
>>>   {
>>>       struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>>>       char *ddgst = (char *)&queue->recv_ddgst;
>>> -    size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
>>> +    size_t recv_len = queue->ddgst_remaining;
>>>       off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>>> +    struct msghdr msg = {
>>> +        .msg_flags = 0,
>>> +    };
>>> +    struct kvec iov = {
>>> +        .iov_base = (u8 *)ddgst + off,
>>> +        .iov_len = recv_len,
>>> +    };
>>>       int ret;
>>> -    ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
>>> -    if (unlikely(ret))
>>> +    if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DDGST)
>>> +        return 0;
>>> +
>>> +    ret = kernel_recvmsg(queue->sock, &msg, &iov, 1, iov.iov_len,
>>> +                 msg.msg_flags);
>>> +    if (ret <= 0)
>>>           return ret;
>>> +    recv_len = ret;
>>>       queue->ddgst_remaining -= recv_len;
>>> -    *offset += recv_len;
>>> -    *len -= recv_len;
>>>       if (queue->ddgst_remaining)
>>> -        return 0;
>>> +        return queue->ddgst_remaining;
>>>       if (queue->recv_ddgst != queue->exp_ddgst) {
>>>           struct request *rq = nvme_cid_to_rq(nvme_tcp_tagset(queue),
>>> @@ -881,37 +893,41 @@ static int nvme_tcp_recv_ddgst(struct 
>>> nvme_tcp_queue *queue,
>>>       return 0;
>>>   }
>>> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff 
>>> *skb,
>>> -                 unsigned int offset, size_t len)
>>> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, bool 
>>> pending)
>>>   {
>>> -    struct nvme_tcp_queue *queue = desc->arg.data;
>>> -    size_t consumed = len;
>>>       int result;
>>> +    int nr_cqe = queue->nr_cqe;
>>> -    while (len) {
>>> +    do {
>>>           switch (nvme_tcp_recv_state(queue)) {
>>>           case NVME_TCP_RECV_PDU:
>>> -            result = nvme_tcp_recv_pdu(queue, skb, &offset, &len);
>>> -            break;
>>> +            result = nvme_tcp_recv_pdu(queue, pending);
>>> +            if (result)
>>> +                break;
>>> +            fallthrough;
>>>           case NVME_TCP_RECV_DATA:
>>> -            result = nvme_tcp_recv_data(queue, skb, &offset, &len);
>>> -            break;
>>> +            result = nvme_tcp_recv_data(queue);
>>> +            if (result)
>>> +                break;
>>> +            fallthrough;
>>>           case NVME_TCP_RECV_DDGST:
>>> -            result = nvme_tcp_recv_ddgst(queue, skb, &offset, &len);
>>> +            result = nvme_tcp_recv_ddgst(queue);
>>>               break;
>>>           default:
>>>               result = -EFAULT;
>>>           }
>>> -        if (result) {
>>> -            dev_err(queue->ctrl->ctrl.device,
>>> -                "receive failed:  %d\n", result);
>>> -            queue->rd_enabled = false;
>>> -            nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>> -            return result;
>>> -        }
>>> +        if (nr_cqe != queue->nr_cqe)
>>> +            break;
>>> +    } while (result >= 0);
>>> +    if (result < 0 && result != -EAGAIN) {
>>> +        dev_err(queue->ctrl->ctrl.device,
>>> +            "receive failed: %d state %d %s\n",
>>> +            result, nvme_tcp_recv_state(queue),
>>> +            pending ? "pending" : "");
>>
>> I'm unclear why pending would be an input to try_recv. Semantically
>> it is an output, signalling the io_work that data is pending to be
>> reaped from the socket.
>>
> See above. 'pending' is really there to clear the 'NOWAIT' flag for 
> recvmsg(), and to avoid frequent -EAGAIN returns.
> If we're fine handling them it can be removed.

I think we can remove it for now, it will also simplify the code.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-21 14:09     ` Hannes Reinecke
  2023-03-22  0:18       ` Chris Leech
@ 2023-03-22  8:08       ` Sagi Grimberg
  2023-03-22  8:26         ` Hannes Reinecke
  1 sibling, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  8:08 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 16:09, Hannes Reinecke wrote:
> On 3/21/23 14:44, Sagi Grimberg wrote:
>>
>>> Call the original data_ready() callback in nvme_tcp_data_ready()
>>> to avoid a receive stall.
>>
>> Can you please improve the description to include what is the stall?
>> For example, does the stall exist today? If it is, I would like to
>> separate such patches from this set and include them asap.
>>
> That is actually particular to the TLS implementation, as it uses the 
> 'data_ready' callback to produce the data which can be read by eg 
> recvmsg().
> 
> Without this call there's no data to peruse for recvmsg().
> 
> But I'm not _that_ deep into networking details to know whether this is 
> TLS specific or an issue with any data_ready callback.
> I assume the latter, but then again, who knows.

Seems that this is only relevant when nvme_tcp runs on top of a ulp,
so a code comment would probably make sense here.

> 
> Hence the slightly vague description.
> 
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>   drivers/nvme/host/tcp.c | 8 +++++---
>>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 0e14b1b90855..0512eb289dcf 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -936,12 +936,14 @@ static void nvme_tcp_data_ready(struct sock *sk)
>>>       trace_sk_data_ready(sk);
>>> -    read_lock_bh(&sk->sk_callback_lock);
>>> -    queue = sk->sk_user_data;
>>> +    rcu_read_lock_bh();
>>
>> Now I understand your comment from a previous patch.
>> Can you explain why is this convention needed?
>>
>> I would prefer to have it as a separate patch with an
>> explanation to why it is needed.
>>
> This is the slightly odd socket callback handling.
> Any driver is free to set the socket callbacks, but it has to be aware 
> that it might not be the only one in the stack doing so.
> So one has to be prepared that the callbacks are set already, so we 
> should be calling them prior to our callback.

I meant the change from read_lock_bh to rcu_read_lock_bh and the
same for rcu_dereference_sk_user_data. It needs to be in a
separate patch with explanation to why it is needed.

> 
>>> +    queue = rcu_dereference_sk_user_data(sk);
>>> +    if (queue->data_ready)
>>> +        queue->data_ready(sk);
>>
>> Is the tls data_ready call sync or async? just for general knowledge?
>>
>>
> Sync, I guess. Otherwise we wouldn't be needing the lock ...

OK.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-22  6:59         ` Hannes Reinecke
@ 2023-03-22  8:12           ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  8:12 UTC (permalink / raw)
  To: Hannes Reinecke, Chris Leech
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake



On 3/22/23 08:59, Hannes Reinecke wrote:
> On 3/22/23 01:18, Chris Leech wrote:
>> On Tue, Mar 21, 2023 at 03:09:06PM +0100, Hannes Reinecke wrote:
>>> On 3/21/23 14:44, Sagi Grimberg wrote:
>>>>
>>>>> Call the original data_ready() callback in nvme_tcp_data_ready()
>>>>> to avoid a receive stall.
>>>>
>>>> Can you please improve the description to include what is the stall?
>>>> For example, does the stall exist today? If it is, I would like to
>>>> separate such patches from this set and include them asap.
>>>>
>>> That is actually particular to the TLS implementation, as it uses the
>>> 'data_ready' callback to produce the data which can be read by eg 
>>> recvmsg().
>>>
>>> Without this call there's no data to peruse for recvmsg().
>>>
>>> But I'm not _that_ deep into networking details to know whether this 
>>> is TLS
>>> specific or an issue with any data_ready callback.
>>> I assume the latter, but then again, who knows.
>>>
>>> Hence the slightly vague description.
>>
>> This looks like the socket callbacks end up hooked in the wrong order.
>> Ideally it would be tcp -> tls -> nvme_tcp, while this currently looks
>> like tcp -> nvme_tcp and then this call back to tls for decryption.
>>
> Well, problem is that I need not one but two sets the callbacks.
> One callback is for waking up userspace (took me weeks to figure that 
> out), and needs to added before calling the userspace helper.
> The other is the 'normal' nvme-tcp callback:
> 
> tcp->nvme-upcall->tls->nvme-tcp
> 
> So really the problem is not so much an inversion, but rather the fact
> that the nvme-upcall callback is really only needed for the duration
> of the handshake. And hence I thought that we should remove the callback
> once we're done with the upcall.

What do you mean remove the callback? data_ready? I'm not sure I'm
following.

> Turns out that we can't, and the best we can do is to disable the 
> functionality, leaving the callback itself in place.
> 
>> I'm not quite sure how to untangle this; nvme_tcp can't just set it's
>> own callbacks before initializing kTLS, becuse that's being done by
>> tlshd which is going to need the userspace socket API callbacks working.
>>
> Correct.
> So for now I'll leave the callbacks in place, even though they are 
> pointless after the upcall.

Does it make any difference now that callbacks setting moved to
nvme_tcp_start_queue?

Again, I'm not sure I understand what callback is pointless?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP
  2023-03-21 13:30   ` Hannes Reinecke
@ 2023-03-22  8:16     ` Sagi Grimberg
  2023-03-22  8:28       ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  8:16 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>>> Hi all,
>>>
>>> finally I've managed to put all things together and enable in-kernel
>>> TLS support for NVMe-over-TCP.
>>
>> Hannes (and Chuck) this is great, I'm very happy to see this!
>>
>> I'll start a detailed review soon enough.
>>
>> Thank you for doing this.
>>
>>> The patchset is based on the TLS upcall mechanism from Chuck Lever
>>> (cf '[PATCH v7 0/2] Another crack at a handshake upcall mechanism'
>>> posted to the linux netdev list), and requires the 'tlshd' userspace
>>> daemon (https://github.com/oracle/ktls-utils) for the actual TLS 
>>> handshake.
>>
>> Do you have an actual link to follow for this patch set?
> 
> Sure.
> 
> git.kernel.org:/pub/scm/linux/kernel/git/hare/scsi-devel.git
> branch tls-netlink.v7

I meant Chuck's posting on linux-netdev.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/18] nvme-tcp: add definitions for TLS cipher suites
  2023-03-21 12:43 ` [PATCH 04/18] nvme-tcp: add definitions for TLS cipher suites Hannes Reinecke
@ 2023-03-22  8:18   ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  8:18 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-22  8:08       ` Sagi Grimberg
@ 2023-03-22  8:26         ` Hannes Reinecke
  2023-03-22 10:13           ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22  8:26 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 09:08, Sagi Grimberg wrote:
> 
> 
> On 3/21/23 16:09, Hannes Reinecke wrote:
>> On 3/21/23 14:44, Sagi Grimberg wrote:
>>>
>>>> Call the original data_ready() callback in nvme_tcp_data_ready()
>>>> to avoid a receive stall.
>>>
>>> Can you please improve the description to include what is the stall?
>>> For example, does the stall exist today? If it is, I would like to
>>> separate such patches from this set and include them asap.
>>>
>> That is actually particular to the TLS implementation, as it uses the 
>> 'data_ready' callback to produce the data which can be read by eg 
>> recvmsg().
>>
>> Without this call there's no data to peruse for recvmsg().
>>
>> But I'm not _that_ deep into networking details to know whether this 
>> is TLS specific or an issue with any data_ready callback.
>> I assume the latter, but then again, who knows.
> 
> Seems that this is only relevant when nvme_tcp runs on top of a ulp,
> so a code comment would probably make sense here.
> 
>>
>> Hence the slightly vague description.
>>
>>>>
>>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>>> ---
>>>>   drivers/nvme/host/tcp.c | 8 +++++---
>>>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 0e14b1b90855..0512eb289dcf 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -936,12 +936,14 @@ static void nvme_tcp_data_ready(struct sock *sk)
>>>>       trace_sk_data_ready(sk);
>>>> -    read_lock_bh(&sk->sk_callback_lock);
>>>> -    queue = sk->sk_user_data;
>>>> +    rcu_read_lock_bh();
>>>
>>> Now I understand your comment from a previous patch.
>>> Can you explain why is this convention needed?
>>>
>>> I would prefer to have it as a separate patch with an
>>> explanation to why it is needed.
>>>
>> This is the slightly odd socket callback handling.
>> Any driver is free to set the socket callbacks, but it has to be aware 
>> that it might not be the only one in the stack doing so.
>> So one has to be prepared that the callbacks are set already, so we 
>> should be calling them prior to our callback.
> 
> I meant the change from read_lock_bh to rcu_read_lock_bh and the
> same for rcu_dereference_sk_user_data. It needs to be in a
> separate patch with explanation to why it is needed.
> 
That is primarily for sanity checking.

Both us _and_ the tls code are setting the sk_user_data pointer,
so I wanted to make sure that we do get things correct.
But seems to work now, so I guess I can drop the rcu_dereference
modifications.

Let's see.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP
  2023-03-22  8:16     ` Sagi Grimberg
@ 2023-03-22  8:28       ` Hannes Reinecke
  2023-03-22 12:53         ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22  8:28 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 09:16, Sagi Grimberg wrote:
> 
>>>> Hi all,
>>>>
>>>> finally I've managed to put all things together and enable in-kernel
>>>> TLS support for NVMe-over-TCP.
>>>
>>> Hannes (and Chuck) this is great, I'm very happy to see this!
>>>
>>> I'll start a detailed review soon enough.
>>>
>>> Thank you for doing this.
>>>
>>>> The patchset is based on the TLS upcall mechanism from Chuck Lever
>>>> (cf '[PATCH v7 0/2] Another crack at a handshake upcall mechanism'
>>>> posted to the linux netdev list), and requires the 'tlshd' userspace
>>>> daemon (https://github.com/oracle/ktls-utils) for the actual TLS 
>>>> handshake.
>>>
>>> Do you have an actual link to follow for this patch set?
>>
>> Sure.
>>
>> git.kernel.org:/pub/scm/linux/kernel/git/hare/scsi-devel.git
>> branch tls-netlink.v7
> 
> I meant Chuck's posting on linux-netdev.

To be found here:

<https://www.spinics.net/lists/netdev/msg890047.html>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/18] nvme-keyring: define a 'psk' keytype
  2023-03-21 12:43 ` [PATCH 02/18] nvme-keyring: define a 'psk' keytype Hannes Reinecke
@ 2023-03-22  8:29   ` Sagi Grimberg
  2023-03-22  8:38     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  8:29 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> Define a 'psk' keytype to hold the NVMe TLS PSKs.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/common/keyring.c | 96 +++++++++++++++++++++++++++++++++++
>   include/linux/nvme-keyring.h  |  8 +++
>   2 files changed, 104 insertions(+)
> 
> diff --git a/drivers/nvme/common/keyring.c b/drivers/nvme/common/keyring.c
> index 3a6e8a0b38e2..6cbb9d66e0f6 100644
> --- a/drivers/nvme/common/keyring.c
> +++ b/drivers/nvme/common/keyring.c
> @@ -11,6 +11,96 @@
>   
>   static struct key *nvme_keyring;
>   
> +key_serial_t nvme_keyring_id(void)
> +{
> +	return nvme_keyring->serial;
> +}
> +EXPORT_SYMBOL_GPL(nvme_keyring_id);
> +
> +static void nvme_tls_psk_describe(const struct key *key, struct seq_file *m)
> +{
> +	seq_puts(m, key->description);
> +	seq_printf(m, ": %u", key->datalen);
> +}
> +
> +static bool nvme_tls_psk_match(const struct key *key,
> +			       const struct key_match_data *match_data)
> +{
> +	const char *match_id;
> +	size_t match_len;
> +
> +	if (!key->description) {
> +		pr_debug("%s: no key description\n", __func__);
> +		return false;
> +	}
> +	match_len = strlen(key->description);
> +	pr_debug("%s: id %s len %zd\n", __func__, key->description, match_len);
> +
> +	if (!match_data->raw_data) {
> +		pr_debug("%s: no match data\n", __func__);
> +		return false;
> +	}
> +	match_id = match_data->raw_data;
> +	pr_debug("%s: match '%s' '%s' len %lu\n",
> +		 __func__, match_id, key->description, match_len);
> +	return !memcmp(key->description, match_id, match_len);
> +}
> +
> +static int nvme_tls_psk_match_preparse(struct key_match_data *match_data)
> +{
> +	match_data->lookup_type = KEYRING_SEARCH_LOOKUP_ITERATE;
> +	match_data->cmp = nvme_tls_psk_match;
> +	return 0;
> +}
> +
> +static struct key_type nvme_tls_psk_key_type = {
> +	.name           = "psk",
> +	.flags          = KEY_TYPE_NET_DOMAIN,
> +	.preparse       = user_preparse,
> +	.free_preparse  = user_free_preparse,
> +	.match_preparse = nvme_tls_psk_match_preparse,
> +	.instantiate    = generic_key_instantiate,
> +	.revoke         = user_revoke,
> +	.destroy        = user_destroy,
> +	.describe       = nvme_tls_psk_describe,
> +	.read           = user_read,
> +};
> +

Hannes, can you please provide a documentation section
to this function? most importantly 'generated' argument.

> +struct key *nvme_tls_psk_lookup(key_ref_t keyring,
> +		const char *hostnqn, const char *subnqn,
> +		int hmac, bool generated)
> +{
> +	char *identity;
> +	size_t identity_len = (NVMF_NQN_SIZE) * 2 + 11;
> +	key_ref_t keyref;
> +	key_serial_t keyring_id;
> +
> +	identity = kzalloc(identity_len, GFP_KERNEL);
> +	if (!identity)
> +		return ERR_PTR(-ENOMEM);
> +
> +	snprintf(identity, identity_len, "NVMe0%c%02d %s %s",
> +		 generated ? 'G' : 'R', hmac, hostnqn, subnqn);

Is that a format that is expected from userspace to produce?


> +
> +	if (!keyring)
> +		keyring = make_key_ref(nvme_keyring, true);
> +	keyring_id = key_serial(key_ref_to_ptr(keyring));
> +	pr_debug("keyring %x lookup tls psk '%s'\n",
> +		 keyring_id, identity);
> +	keyref = keyring_search(keyring, &nvme_tls_psk_key_type,
> +				identity, false);
> +	if (IS_ERR(keyref)) {
> +		pr_debug("lookup tls psk '%s' failed, error %ld\n",
> +			 identity, PTR_ERR(keyref));
> +		kfree(identity);
> +		return ERR_PTR(-ENOKEY);
> +	}
> +	kfree(identity);
> +
> +	return key_ref_to_ptr(keyref);
> +}
> +EXPORT_SYMBOL_GPL(nvme_tls_psk_lookup);
> +
>   int nvme_keyring_init(void)
>   {
>   	int err;
> @@ -24,12 +114,18 @@ int nvme_keyring_init(void)
>   	if (IS_ERR(nvme_keyring))
>   		return PTR_ERR(nvme_keyring);
>   
> +	err = register_key_type(&nvme_tls_psk_key_type);
> +	if (err) {
> +		key_put(nvme_keyring);
> +		return err;
> +	}
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(nvme_keyring_init);
>   
>   void nvme_keyring_exit(void)
>   {
> +	unregister_key_type(&nvme_tls_psk_key_type);
>   	key_revoke(nvme_keyring);
>   	key_put(nvme_keyring);
>   }
> diff --git a/include/linux/nvme-keyring.h b/include/linux/nvme-keyring.h
> index a875c06cc922..c0c3d934f474 100644
> --- a/include/linux/nvme-keyring.h
> +++ b/include/linux/nvme-keyring.h
> @@ -6,6 +6,14 @@
>   #ifndef _NVME_KEYRING_H
>   #define _NVME_KEYRING_H
>   
> +#include <linux/key.h>
> +
> +struct key *nvme_tls_psk_lookup(key_ref_t keyring,
> +				const char *hostnqn, const char *subnqn,
> +				int hmac, bool generated);
> +
> +key_serial_t nvme_keyring_id(void);
> +
>   int nvme_keyring_init(void);
>   void nvme_keyring_exit(void);
>   

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/18] nvme-keyring: define a 'psk' keytype
  2023-03-22  8:29   ` Sagi Grimberg
@ 2023-03-22  8:38     ` Hannes Reinecke
  2023-03-22  8:49       ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22  8:38 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 09:29, Sagi Grimberg wrote:
> 
> 
> On 3/21/23 14:43, Hannes Reinecke wrote:
>> Define a 'psk' keytype to hold the NVMe TLS PSKs.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/common/keyring.c | 96 +++++++++++++++++++++++++++++++++++
>>   include/linux/nvme-keyring.h  |  8 +++
>>   2 files changed, 104 insertions(+)
>>
>> diff --git a/drivers/nvme/common/keyring.c 
>> b/drivers/nvme/common/keyring.c
>> index 3a6e8a0b38e2..6cbb9d66e0f6 100644
>> --- a/drivers/nvme/common/keyring.c
>> +++ b/drivers/nvme/common/keyring.c
>> @@ -11,6 +11,96 @@
>>   static struct key *nvme_keyring;
>> +key_serial_t nvme_keyring_id(void)
>> +{
>> +    return nvme_keyring->serial;
>> +}
>> +EXPORT_SYMBOL_GPL(nvme_keyring_id);
>> +
>> +static void nvme_tls_psk_describe(const struct key *key, struct 
>> seq_file *m)
>> +{
>> +    seq_puts(m, key->description);
>> +    seq_printf(m, ": %u", key->datalen);
>> +}
>> +
>> +static bool nvme_tls_psk_match(const struct key *key,
>> +                   const struct key_match_data *match_data)
>> +{
>> +    const char *match_id;
>> +    size_t match_len;
>> +
>> +    if (!key->description) {
>> +        pr_debug("%s: no key description\n", __func__);
>> +        return false;
>> +    }
>> +    match_len = strlen(key->description);
>> +    pr_debug("%s: id %s len %zd\n", __func__, key->description, 
>> match_len);
>> +
>> +    if (!match_data->raw_data) {
>> +        pr_debug("%s: no match data\n", __func__);
>> +        return false;
>> +    }
>> +    match_id = match_data->raw_data;
>> +    pr_debug("%s: match '%s' '%s' len %lu\n",
>> +         __func__, match_id, key->description, match_len);
>> +    return !memcmp(key->description, match_id, match_len);
>> +}
>> +
>> +static int nvme_tls_psk_match_preparse(struct key_match_data 
>> *match_data)
>> +{
>> +    match_data->lookup_type = KEYRING_SEARCH_LOOKUP_ITERATE;
>> +    match_data->cmp = nvme_tls_psk_match;
>> +    return 0;
>> +}
>> +
>> +static struct key_type nvme_tls_psk_key_type = {
>> +    .name           = "psk",
>> +    .flags          = KEY_TYPE_NET_DOMAIN,
>> +    .preparse       = user_preparse,
>> +    .free_preparse  = user_free_preparse,
>> +    .match_preparse = nvme_tls_psk_match_preparse,
>> +    .instantiate    = generic_key_instantiate,
>> +    .revoke         = user_revoke,
>> +    .destroy        = user_destroy,
>> +    .describe       = nvme_tls_psk_describe,
>> +    .read           = user_read,
>> +};
>> +
> 
> Hannes, can you please provide a documentation section
> to this function? most importantly 'generated' argument.
> 
Sure. There are two types of PSK types specified in the NVMe TCP spec;
a 'retained' one which has to be provided by the admin, and a 
'generated' one which is derived from the shared key material from 
DH-HMAC-CHAP when doing a secure channel concatenation.
And each type has two possible hash algorithms (SHA-256 and SHA-384),
resulting in 4 possible PSKs.

And indeed, the secure channel concatenation bits are missing as we're 
still hashing out details at the fmds group.
I do have a patchset for that, though, but decided not to include it in 
this submission as it'll increase the patchset even more.
Can do if you like ...

But will be updating the documentation.

>> +struct key *nvme_tls_psk_lookup(key_ref_t keyring,
>> +        const char *hostnqn, const char *subnqn,
>> +        int hmac, bool generated)
>> +{
>> +    char *identity;
>> +    size_t identity_len = (NVMF_NQN_SIZE) * 2 + 11;
>> +    key_ref_t keyref;
>> +    key_serial_t keyring_id;
>> +
>> +    identity = kzalloc(identity_len, GFP_KERNEL);
>> +    if (!identity)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    snprintf(identity, identity_len, "NVMe0%c%02d %s %s",
>> +         generated ? 'G' : 'R', hmac, hostnqn, subnqn);
> 
> Is that a format that is expected from userspace to produce?
> 
Yes. See
NVMe-TCP 1.0a section 3.6.1.3 "TLS PSK and PSK Identity Derivation"
for details.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 08/18] nvme-tcp: enable TLS handshake upcall
  2023-03-21 12:43 ` [PATCH 08/18] nvme-tcp: enable TLS handshake upcall Hannes Reinecke
@ 2023-03-22  8:45   ` Sagi Grimberg
  2023-03-22  9:12     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  8:45 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


> Select possible PSK identities and call the TLS handshake upcall
> for each identity.
> The TLS 1.3 RFC allows to send multiple identities with each ClientHello
> request, but none of the SSL libraries implement it. As the connection
> is established when the association is created we send only a single
> identity for each upcall, and close the connection to restart with
> the next identity if the handshake fails.

Can't this loop be done in userspace? In other words, how can
we get rid of this when SSL libs would decide to support it?

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 157 +++++++++++++++++++++++++++++++++++++---
>   1 file changed, 148 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0438d42f4179..bcf24e9a08e1 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -8,9 +8,12 @@
>   #include <linux/init.h>
>   #include <linux/slab.h>
>   #include <linux/err.h>
> +#include <linux/key.h>
>   #include <linux/nvme-tcp.h>
> +#include <linux/nvme-keyring.h>
>   #include <net/sock.h>
>   #include <net/tcp.h>
> +#include <net/handshake.h>
>   #include <linux/blk-mq.h>
>   #include <crypto/hash.h>
>   #include <net/busy_poll.h>
> @@ -31,6 +34,14 @@ static int so_priority;
>   module_param(so_priority, int, 0644);
>   MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
>   
> +/*
> + * TLS handshake timeout
> + */
> +static int tls_handshake_timeout = 10;
> +module_param(tls_handshake_timeout, int, 0644);
> +MODULE_PARM_DESC(tls_handshake_timeout,
> +		 "nvme TLS handshake timeout in seconds (default 10)");

Can you share what is the normal time of an upcall?

> +
>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   /* lockdep can detect a circular dependency of the form
>    *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
> @@ -104,6 +115,7 @@ enum nvme_tcp_queue_flags {
>   	NVME_TCP_Q_ALLOCATED	= 0,
>   	NVME_TCP_Q_LIVE		= 1,
>   	NVME_TCP_Q_POLLING	= 2,
> +	NVME_TCP_Q_TLS		= 3,
>   };
>   
>   enum nvme_tcp_recv_state {
> @@ -148,6 +160,9 @@ struct nvme_tcp_queue {
>   	__le32			exp_ddgst;
>   	__le32			recv_ddgst;
>   
> +	struct completion       *tls_complete;
> +	int                     tls_err;
> +
>   	struct page_frag_cache	pf_cache;
>   
>   	void (*state_change)(struct sock *);
> @@ -1505,7 +1520,102 @@ static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
>   	queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
>   }
>   
> -static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
> +/*
> + * nvme_tcp_lookup_psk - Look up PSKs to use for TLS
> + *
> + */
> +static int nvme_tcp_lookup_psks(struct nvme_ctrl *nctrl,
> +			       key_serial_t *keylist, int num_keys)

Where is num_keys used?

> +{
> +	enum nvme_tcp_tls_cipher cipher = NVME_TCP_TLS_CIPHER_SHA384;
> +	struct key *tls_key;
> +	int num = 0;
> +	bool generated = false;
> +
> +	/* Check for pre-provisioned keys; retained keys first */
> +	do {
> +		tls_key = nvme_tls_psk_lookup(NULL, nctrl->opts->host->nqn,
> +					      nctrl->opts->subsysnqn,
> +					      cipher, generated);
> +		if (!IS_ERR(tls_key)) {
> +			keylist[num] = tls_key->serial;
> +			num++;
> +			key_put(tls_key);
> +		}
> +		if (cipher == NVME_TCP_TLS_CIPHER_SHA384)
> +			cipher = NVME_TCP_TLS_CIPHER_SHA256;
> +		else {
> +			if (generated)
> +				cipher = NVME_TCP_TLS_CIPHER_INVALID;
> +			else {
> +				cipher = NVME_TCP_TLS_CIPHER_SHA384;
> +				generated = true;
> +			}
> +		}
> +	} while(cipher != NVME_TCP_TLS_CIPHER_INVALID);

I'm unclear about a few things here:
1. what is the meaning of pre-provisioned vs. retained vs. generated?
2. Can this loop be reorganized in a nested for loop with a break?
    I'm wandering if it will make it simpler to read.

> +	return num;
> +}
> +
> +static void nvme_tcp_tls_done(void *data, int status, key_serial_t peerid)
> +{
> +	struct nvme_tcp_queue *queue = data;
> +	struct nvme_tcp_ctrl *ctrl = queue->ctrl;
> +	int qid = nvme_tcp_queue_id(queue);
> +
> +	dev_dbg(ctrl->ctrl.device, "queue %d: TLS handshake done, key %x, status %d\n",
> +		qid, peerid, status);
> +
> +	queue->tls_err = -status;
> +	if (queue->tls_complete)
> +		complete(queue->tls_complete);
> +}
> +
> +static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
> +			      struct nvme_tcp_queue *queue,
> +			      key_serial_t peerid)
> +{
> +	int qid = nvme_tcp_queue_id(queue);
> +	int ret;
> +	struct tls_handshake_args args;
> +	unsigned long tmo = tls_handshake_timeout * HZ;
> +	DECLARE_COMPLETION_ONSTACK(tls_complete);
> +
> +	dev_dbg(nctrl->device, "queue %d: start TLS with key %x\n",
> +		qid, peerid);
> +	args.ta_sock = queue->sock;
> +	args.ta_done = nvme_tcp_tls_done;
> +	args.ta_data = queue;
> +	args.ta_my_peerids[0] = peerid;
> +	args.ta_num_peerids = 1;
> +	args.ta_keyring = nvme_keyring_id();
> +	args.ta_timeout_ms = tls_handshake_timeout * 2 * 1000;
> +	queue->tls_err = -EOPNOTSUPP;
> +	queue->tls_complete = &tls_complete;
> +	ret = tls_client_hello_psk(&args, GFP_KERNEL);
> +	if (ret) {
> +		dev_dbg(nctrl->device, "queue %d: failed to start TLS: %d\n",
> +			qid, ret);
> +		return ret;
> +	}
> +	if (wait_for_completion_timeout(queue->tls_complete, tmo) == 0) {
> +		dev_dbg(nctrl->device,
> +			"queue %d: TLS handshake timeout\n", qid);
> +		queue->tls_complete = NULL;
> +		ret = -ETIMEDOUT;
> +	} else {
> +		dev_dbg(nctrl->device,
> +			"queue %d: TLS handshake complete, error %d\n",
> +			qid, queue->tls_err);
> +		ret = queue->tls_err;
> +	}
> +	queue->tls_complete = NULL;
> +	if (!ret)
> +		set_bit(NVME_TCP_Q_TLS, &queue->flags);
> +	return ret;
> +}
> +
> +static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
> +				key_serial_t peerid)
>   {
>   	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>   	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
> @@ -1628,6 +1738,13 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
>   		goto err_rcv_pdu;
>   	}
>   
> +	/* If PSKs are configured try to start TLS */
> +	if (peerid) {

Where is peerid being initialized? Not to mention that peerid is
a rather cryptic name (at least to me). Is this the ClientHello
identity?

> +		ret = nvme_tcp_start_tls(nctrl, queue, peerid);
> +		if (ret)
> +			goto err_init_connect;
> +	}
> +
>   	ret = nvme_tcp_init_connection(queue);
>   	if (ret)
>   		goto err_init_connect;
> @@ -1774,11 +1891,22 @@ static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl,
>   
>   static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>   {
> -	int ret;
> +	int ret = -EINVAL, num_keys, k;
> +	key_serial_t keylist[4];
>   
> -	ret = nvme_tcp_alloc_queue(ctrl, 0);
> -	if (ret)
> -		return ret;
> +	memset(keylist, 0, sizeof(key_serial_t));
> +	num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
> +	for (k = 0; k < num_keys; k++) {
> +		ret = nvme_tcp_alloc_queue(ctrl, 0, keylist[k]);
> +		if (!ret)
> +			break;
> +	}
> +	if (ret) {
> +		/* Try without TLS */

Why? this is trying to always connect with tls and fallback to no-tls?
Why not simply do what userspace is asking us to do?

Seems backwards to me. Unless there is a statement in the spec
that I'm not aware of which tells us to do so.

> +		ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
> +		if (ret)
> +			goto out_free_queue;
> +	}
>   
>   	ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
>   	if (ret)
> @@ -1793,12 +1921,23 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>   
>   static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>   {
> -	int i, ret;
> +	int i, ret, num_keys = 0, k;
> +	key_serial_t keylist[4];
>   
> +	memset(keylist, 0, sizeof(key_serial_t));
> +	num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>   	for (i = 1; i < ctrl->queue_count; i++) {
> -		ret = nvme_tcp_alloc_queue(ctrl, i);
> -		if (ret)
> -			goto out_free_queues;
> +		ret = -EINVAL;
> +		for (k = 0; k < num_keys; k++) {
> +			ret = nvme_tcp_alloc_queue(ctrl, i, keylist[k]);
> +			if (!ret)
> +				break;

What is going on here. are you establishing queue_count x num_keys nvme 
queues?


> +		}
> +		if (ret) {
> +			ret = nvme_tcp_alloc_queue(ctrl, i, 0);
> +			if (ret)
> +				goto out_free_queues;
> +		}
>   	}
>   
>   	return 0;

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/18] nvme-keyring: define a 'psk' keytype
  2023-03-22  8:38     ` Hannes Reinecke
@ 2023-03-22  8:49       ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  8:49 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>> On 3/21/23 14:43, Hannes Reinecke wrote:
>>> Define a 'psk' keytype to hold the NVMe TLS PSKs.
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>   drivers/nvme/common/keyring.c | 96 +++++++++++++++++++++++++++++++++++
>>>   include/linux/nvme-keyring.h  |  8 +++
>>>   2 files changed, 104 insertions(+)
>>>
>>> diff --git a/drivers/nvme/common/keyring.c 
>>> b/drivers/nvme/common/keyring.c
>>> index 3a6e8a0b38e2..6cbb9d66e0f6 100644
>>> --- a/drivers/nvme/common/keyring.c
>>> +++ b/drivers/nvme/common/keyring.c
>>> @@ -11,6 +11,96 @@
>>>   static struct key *nvme_keyring;
>>> +key_serial_t nvme_keyring_id(void)
>>> +{
>>> +    return nvme_keyring->serial;
>>> +}
>>> +EXPORT_SYMBOL_GPL(nvme_keyring_id);
>>> +
>>> +static void nvme_tls_psk_describe(const struct key *key, struct 
>>> seq_file *m)
>>> +{
>>> +    seq_puts(m, key->description);
>>> +    seq_printf(m, ": %u", key->datalen);
>>> +}
>>> +
>>> +static bool nvme_tls_psk_match(const struct key *key,
>>> +                   const struct key_match_data *match_data)
>>> +{
>>> +    const char *match_id;
>>> +    size_t match_len;
>>> +
>>> +    if (!key->description) {
>>> +        pr_debug("%s: no key description\n", __func__);
>>> +        return false;
>>> +    }
>>> +    match_len = strlen(key->description);
>>> +    pr_debug("%s: id %s len %zd\n", __func__, key->description, 
>>> match_len);
>>> +
>>> +    if (!match_data->raw_data) {
>>> +        pr_debug("%s: no match data\n", __func__);
>>> +        return false;
>>> +    }
>>> +    match_id = match_data->raw_data;
>>> +    pr_debug("%s: match '%s' '%s' len %lu\n",
>>> +         __func__, match_id, key->description, match_len);
>>> +    return !memcmp(key->description, match_id, match_len);
>>> +}
>>> +
>>> +static int nvme_tls_psk_match_preparse(struct key_match_data 
>>> *match_data)
>>> +{
>>> +    match_data->lookup_type = KEYRING_SEARCH_LOOKUP_ITERATE;
>>> +    match_data->cmp = nvme_tls_psk_match;
>>> +    return 0;
>>> +}
>>> +
>>> +static struct key_type nvme_tls_psk_key_type = {
>>> +    .name           = "psk",
>>> +    .flags          = KEY_TYPE_NET_DOMAIN,
>>> +    .preparse       = user_preparse,
>>> +    .free_preparse  = user_free_preparse,
>>> +    .match_preparse = nvme_tls_psk_match_preparse,
>>> +    .instantiate    = generic_key_instantiate,
>>> +    .revoke         = user_revoke,
>>> +    .destroy        = user_destroy,
>>> +    .describe       = nvme_tls_psk_describe,
>>> +    .read           = user_read,
>>> +};
>>> +
>>
>> Hannes, can you please provide a documentation section
>> to this function? most importantly 'generated' argument.
>>
> Sure. There are two types of PSK types specified in the NVMe TCP spec;
> a 'retained' one which has to be provided by the admin, and a 
> 'generated' one which is derived from the shared key material from 
> DH-HMAC-CHAP when doing a secure channel concatenation.
> And each type has two possible hash algorithms (SHA-256 and SHA-384),
> resulting in 4 possible PSKs.
> 
> And indeed, the secure channel concatenation bits are missing as we're 
> still hashing out details at the fmds group.
> I do have a patchset for that, though, but decided not to include it in 
> this submission as it'll increase the patchset even more.
> Can do if you like ...

I'd prefer to split the secure channel concatenation out of this.
But it would be good to give some context in the code.

Also the bool is confusing, can you make this an enumeration that
translates to 'G' or 'R' and pass that in?

> 
> But will be updating the documentation.
> 
>>> +struct key *nvme_tls_psk_lookup(key_ref_t keyring,
>>> +        const char *hostnqn, const char *subnqn,
>>> +        int hmac, bool generated)
>>> +{
>>> +    char *identity;
>>> +    size_t identity_len = (NVMF_NQN_SIZE) * 2 + 11;
>>> +    key_ref_t keyref;
>>> +    key_serial_t keyring_id;
>>> +
>>> +    identity = kzalloc(identity_len, GFP_KERNEL);
>>> +    if (!identity)
>>> +        return ERR_PTR(-ENOMEM);
>>> +
>>> +    snprintf(identity, identity_len, "NVMe0%c%02d %s %s",
>>> +         generated ? 'G' : 'R', hmac, hostnqn, subnqn);
>>
>> Is that a format that is expected from userspace to produce?
>>
> Yes. See
> NVMe-TCP 1.0a section 3.6.1.3 "TLS PSK and PSK Identity Derivation"
> for details.
> 

Yes, I see it now. Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 08/18] nvme-tcp: enable TLS handshake upcall
  2023-03-22  8:45   ` Sagi Grimberg
@ 2023-03-22  9:12     ` Hannes Reinecke
  2023-03-22 10:56       ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22  9:12 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 09:45, Sagi Grimberg wrote:
> 
>> Select possible PSK identities and call the TLS handshake upcall
>> for each identity.
>> The TLS 1.3 RFC allows to send multiple identities with each ClientHello
>> request, but none of the SSL libraries implement it. As the connection
>> is established when the association is created we send only a single
>> identity for each upcall, and close the connection to restart with
>> the next identity if the handshake fails.
> 
> Can't this loop be done in userspace? In other words, how can
> we get rid of this when SSL libs would decide to support it?
> 
Well. That is something which I've been thinking about, but really 
haven't come to a good solution.

Crux of the matter is that we have to close the connection after a 
failed TLS handshake:

 > A TLS 1.3 client implementation that only supports sending a single
 > PSK identity during connection setup may be required to connect
 > multiple times in order to negotiate cipher suites with different hash
 > functions.

and as it's quite unclear in which state the connection is after the 
userspace library failed the handshake.
So the only good way to recover is to close the connection and restart 
with a different identity.

While we can move the identity selection to userspace (eg by providing 
an 'tls_psk' fabrics option holding the key serial of the PSK to user), 
that will allow us only to pass a _single_ PSK for each attempt.

And the other problem is that in its current form the spec allows for 
_different_ identites for each connection; by passing a key from 
userspace we would not be able to support that.
(Not saying that it's useful, mind.)

We could allow for several 'tls_psk' options, though; maybe that would 
be a way out.

>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/tcp.c | 157 +++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 148 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0438d42f4179..bcf24e9a08e1 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -8,9 +8,12 @@
>>   #include <linux/init.h>
>>   #include <linux/slab.h>
>>   #include <linux/err.h>
>> +#include <linux/key.h>
>>   #include <linux/nvme-tcp.h>
>> +#include <linux/nvme-keyring.h>
>>   #include <net/sock.h>
>>   #include <net/tcp.h>
>> +#include <net/handshake.h>
>>   #include <linux/blk-mq.h>
>>   #include <crypto/hash.h>
>>   #include <net/busy_poll.h>
>> @@ -31,6 +34,14 @@ static int so_priority;
>>   module_param(so_priority, int, 0644);
>>   MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
>> +/*
>> + * TLS handshake timeout
>> + */
>> +static int tls_handshake_timeout = 10;
>> +module_param(tls_handshake_timeout, int, 0644);
>> +MODULE_PARM_DESC(tls_handshake_timeout,
>> +         "nvme TLS handshake timeout in seconds (default 10)");
> 
> Can you share what is the normal time of an upcall?
> 
That really depends on the network latency and/or reachability of the 
server. It might just have been started up, switches MAC tables not 
updated, STP still ongoing, what do I know.
So 10 seconds seemed to be a good compromise.
But that's also why I made this configurable.

>> +
>>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>   /* lockdep can detect a circular dependency of the form
>>    *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
>> @@ -104,6 +115,7 @@ enum nvme_tcp_queue_flags {
>>       NVME_TCP_Q_ALLOCATED    = 0,
>>       NVME_TCP_Q_LIVE        = 1,
>>       NVME_TCP_Q_POLLING    = 2,
>> +    NVME_TCP_Q_TLS        = 3,
>>   };
>>   enum nvme_tcp_recv_state {
>> @@ -148,6 +160,9 @@ struct nvme_tcp_queue {
>>       __le32            exp_ddgst;
>>       __le32            recv_ddgst;
>> +    struct completion       *tls_complete;
>> +    int                     tls_err;
>> +
>>       struct page_frag_cache    pf_cache;
>>       void (*state_change)(struct sock *);
>> @@ -1505,7 +1520,102 @@ static void nvme_tcp_set_queue_io_cpu(struct 
>> nvme_tcp_queue *queue)
>>       queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, 
>> false);
>>   }
>> -static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
>> +/*
>> + * nvme_tcp_lookup_psk - Look up PSKs to use for TLS
>> + *
>> + */
>> +static int nvme_tcp_lookup_psks(struct nvme_ctrl *nctrl,
>> +                   key_serial_t *keylist, int num_keys)
> 
> Where is num_keys used?
> 
Ah, indeed, need to check this in the loop.

>> +{
>> +    enum nvme_tcp_tls_cipher cipher = NVME_TCP_TLS_CIPHER_SHA384;
>> +    struct key *tls_key;
>> +    int num = 0;
>> +    bool generated = false;
>> +
>> +    /* Check for pre-provisioned keys; retained keys first */
>> +    do {
>> +        tls_key = nvme_tls_psk_lookup(NULL, nctrl->opts->host->nqn,
>> +                          nctrl->opts->subsysnqn,
>> +                          cipher, generated);
>> +        if (!IS_ERR(tls_key)) {
>> +            keylist[num] = tls_key->serial;
>> +            num++;
>> +            key_put(tls_key);
>> +        }
>> +        if (cipher == NVME_TCP_TLS_CIPHER_SHA384)
>> +            cipher = NVME_TCP_TLS_CIPHER_SHA256;
>> +        else {
>> +            if (generated)
>> +                cipher = NVME_TCP_TLS_CIPHER_INVALID;
>> +            else {
>> +                cipher = NVME_TCP_TLS_CIPHER_SHA384;
>> +                generated = true;
>> +            }
>> +        }
>> +    } while(cipher != NVME_TCP_TLS_CIPHER_INVALID);
> 
> I'm unclear about a few things here:
> 1. what is the meaning of pre-provisioned vs. retained vs. generated?
> 2. Can this loop be reorganized in a nested for loop with a break?
>     I'm wandering if it will make it simpler to read.
> 
'pre-provisioned' means that the admin has stored the keys in the 
keyring prior to calling 'nvme connect'.
'generated' means a key which is derived from the key material generated 
from a previous DH-HMAC-CHAP transaction.
As for the loop: I am going back and forth between having a loop
(which is executed exactly four times) and unrolling the loop into four 
distinct calls to nvme_tls_psk_lookup().
It probably doesn't matter for the actual assembler code (as the 
compiler will be doing a loop unroll anyway), but the unrolled code 
would allow for better documentation, Code might be slightly longer, 
though, with lots of repetitions.
So really, I don't know which is best.

>> +    return num;
>> +}
>> +
>> +static void nvme_tcp_tls_done(void *data, int status, key_serial_t 
>> peerid)
>> +{
>> +    struct nvme_tcp_queue *queue = data;
>> +    struct nvme_tcp_ctrl *ctrl = queue->ctrl;
>> +    int qid = nvme_tcp_queue_id(queue);
>> +
>> +    dev_dbg(ctrl->ctrl.device, "queue %d: TLS handshake done, key %x, 
>> status %d\n",
>> +        qid, peerid, status);
>> +
>> +    queue->tls_err = -status;
>> +    if (queue->tls_complete)
>> +        complete(queue->tls_complete);
>> +}
>> +
>> +static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
>> +                  struct nvme_tcp_queue *queue,
>> +                  key_serial_t peerid)
>> +{
>> +    int qid = nvme_tcp_queue_id(queue);
>> +    int ret;
>> +    struct tls_handshake_args args;
>> +    unsigned long tmo = tls_handshake_timeout * HZ;
>> +    DECLARE_COMPLETION_ONSTACK(tls_complete);
>> +
>> +    dev_dbg(nctrl->device, "queue %d: start TLS with key %x\n",
>> +        qid, peerid);
>> +    args.ta_sock = queue->sock;
>> +    args.ta_done = nvme_tcp_tls_done;
>> +    args.ta_data = queue;
>> +    args.ta_my_peerids[0] = peerid;
>> +    args.ta_num_peerids = 1;
>> +    args.ta_keyring = nvme_keyring_id();
>> +    args.ta_timeout_ms = tls_handshake_timeout * 2 * 1000;
>> +    queue->tls_err = -EOPNOTSUPP;
>> +    queue->tls_complete = &tls_complete;
>> +    ret = tls_client_hello_psk(&args, GFP_KERNEL);
>> +    if (ret) {
>> +        dev_dbg(nctrl->device, "queue %d: failed to start TLS: %d\n",
>> +            qid, ret);
>> +        return ret;
>> +    }
>> +    if (wait_for_completion_timeout(queue->tls_complete, tmo) == 0) {
>> +        dev_dbg(nctrl->device,
>> +            "queue %d: TLS handshake timeout\n", qid);
>> +        queue->tls_complete = NULL;
>> +        ret = -ETIMEDOUT;
>> +    } else {
>> +        dev_dbg(nctrl->device,
>> +            "queue %d: TLS handshake complete, error %d\n",
>> +            qid, queue->tls_err);
>> +        ret = queue->tls_err;
>> +    }
>> +    queue->tls_complete = NULL;
>> +    if (!ret)
>> +        set_bit(NVME_TCP_Q_TLS, &queue->flags);
>> +    return ret;
>> +}
>> +
>> +static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
>> +                key_serial_t peerid)
>>   {
>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
>> @@ -1628,6 +1738,13 @@ static int nvme_tcp_alloc_queue(struct 
>> nvme_ctrl *nctrl, int qid)
>>           goto err_rcv_pdu;
>>       }
>> +    /* If PSKs are configured try to start TLS */
>> +    if (peerid) {
> 
> Where is peerid being initialized? Not to mention that peerid is
> a rather cryptic name (at least to me). Is this the ClientHello
> identity?
> 
'peerid' is the term used in the netlink handshake protocol.
It actually is the key serial number of the PSK to use.
Maybe 'psk_id' would be more appropriate here.

>> +        ret = nvme_tcp_start_tls(nctrl, queue, peerid);
>> +        if (ret)
>> +            goto err_init_connect;
>> +    }
>> +
>>       ret = nvme_tcp_init_connection(queue);
>>       if (ret)
>>           goto err_init_connect;
>> @@ -1774,11 +1891,22 @@ static int nvme_tcp_start_io_queues(struct 
>> nvme_ctrl *ctrl,
>>   static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>>   {
>> -    int ret;
>> +    int ret = -EINVAL, num_keys, k;
>> +    key_serial_t keylist[4];
>> -    ret = nvme_tcp_alloc_queue(ctrl, 0);
>> -    if (ret)
>> -        return ret;
>> +    memset(keylist, 0, sizeof(key_serial_t));
>> +    num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>> +    for (k = 0; k < num_keys; k++) {
>> +        ret = nvme_tcp_alloc_queue(ctrl, 0, keylist[k]);
>> +        if (!ret)
>> +            break;
>> +    }
>> +    if (ret) {
>> +        /* Try without TLS */
> 
> Why? this is trying to always connect with tls and fallback to no-tls?
> Why not simply do what userspace is asking us to do?
> 
> Seems backwards to me. Unless there is a statement in the spec
> that I'm not aware of which tells us to do so.
> 
This is an implication of the chosen method to select the PSK from the 
kernel code.
If we move PSK selection to userspace we clearly wouldn't need this.
But if we move PSK selection to userspace we need an updated nvme-cli 
for a) selecting the PSK from the keystore and b) passing in the new option.
So for development it was easier to run with the in-kernel selection as 
I don't need to modify nvme-cli.

>> +        ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
>> +        if (ret)
>> +            goto out_free_queue;
>> +    }
>>       ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
>>       if (ret)
>> @@ -1793,12 +1921,23 @@ static int nvme_tcp_alloc_admin_queue(struct 
>> nvme_ctrl *ctrl)
>>   static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>>   {
>> -    int i, ret;
>> +    int i, ret, num_keys = 0, k;
>> +    key_serial_t keylist[4];
>> +    memset(keylist, 0, sizeof(key_serial_t));
>> +    num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>>       for (i = 1; i < ctrl->queue_count; i++) {
>> -        ret = nvme_tcp_alloc_queue(ctrl, i);
>> -        if (ret)
>> -            goto out_free_queues;
>> +        ret = -EINVAL;
>> +        for (k = 0; k < num_keys; k++) {
>> +            ret = nvme_tcp_alloc_queue(ctrl, i, keylist[k]);
>> +            if (!ret)
>> +                break;
> 
> What is going on here. are you establishing queue_count x num_keys nvme 
> queues?
> 
No, I am _trying_ to establish a connection, breaking out if the attempt
_succeeded_.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/18] nvme-tcp: add connect option 'tls'
  2023-03-21 12:43 ` [PATCH 09/18] nvme-tcp: add connect option 'tls' Hannes Reinecke
@ 2023-03-22  9:24   ` Sagi Grimberg
  2023-03-22  9:59     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  9:24 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

> Add a connect option 'tls' to request TLS1.3 in-band encryption, and
> abort the connection attempt if TLS could not be established.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/fabrics.c | 5 +++++
>   drivers/nvme/host/fabrics.h | 2 ++
>   drivers/nvme/host/tcp.c     | 7 ++++++-
>   3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index bbaa04a0c502..fdff7cdff029 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -609,6 +609,7 @@ static const match_table_t opt_tokens = {
>   	{ NVMF_OPT_DISCOVERY,		"discovery"		},
>   	{ NVMF_OPT_DHCHAP_SECRET,	"dhchap_secret=%s"	},
>   	{ NVMF_OPT_DHCHAP_CTRL_SECRET,	"dhchap_ctrl_secret=%s"	},
> +	{ NVMF_OPT_TLS,			"tls"			},
>   	{ NVMF_OPT_ERR,			NULL			}
>   };
>   
> @@ -632,6 +633,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>   	opts->hdr_digest = false;
>   	opts->data_digest = false;
>   	opts->tos = -1; /* < 0 == use transport default */
> +	opts->tls = false;
>   
>   	options = o = kstrdup(buf, GFP_KERNEL);
>   	if (!options)
> @@ -918,6 +920,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>   			kfree(opts->dhchap_ctrl_secret);
>   			opts->dhchap_ctrl_secret = p;
>   			break;
> +		case NVMF_OPT_TLS:
> +			opts->tls = true;
> +			break;
>   		default:
>   			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
>   				p);
> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
> index dcac3df8a5f7..c4538a9d437c 100644
> --- a/drivers/nvme/host/fabrics.h
> +++ b/drivers/nvme/host/fabrics.h
> @@ -70,6 +70,7 @@ enum {
>   	NVMF_OPT_DISCOVERY	= 1 << 22,
>   	NVMF_OPT_DHCHAP_SECRET	= 1 << 23,
>   	NVMF_OPT_DHCHAP_CTRL_SECRET = 1 << 24,
> +	NVMF_OPT_TLS		= 1 << 25,
>   };
>   
>   /**
> @@ -128,6 +129,7 @@ struct nvmf_ctrl_options {
>   	int			max_reconnects;
>   	char			*dhchap_secret;
>   	char			*dhchap_ctrl_secret;
> +	bool			tls;
>   	bool			disable_sqflow;
>   	bool			hdr_digest;
>   	bool			data_digest;
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index bcf24e9a08e1..bbff1f52a167 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1902,6 +1902,9 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>   			break;
>   	}
>   	if (ret) {
> +		/* Abort if TLS is requested */
> +		if (num_keys && ctrl->opts->tls)
> +			goto out_free_queue;
>   		/* Try without TLS */
>   		ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
>   		if (ret)
> @@ -1934,6 +1937,8 @@ static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>   				break;
>   		}
>   		if (ret) {
> +			if (num_keys && ctrl->opts->tls)
> +				goto out_free_queues;

I don't see why we even attempt tls if we're not explicitly told to.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/18] nvme-tcp: fixup send workflow for kTLS
  2023-03-21 12:43 ` [PATCH 10/18] nvme-tcp: fixup send workflow for kTLS Hannes Reinecke
@ 2023-03-22  9:31   ` Sagi Grimberg
  2023-03-22 10:08     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22  9:31 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> kTLS does not support MSG_EOR flag for sendmsg(), and the ->sendpage()
> call really doesn't bring any benefit as data has to be copied
> anyway.
> So use sock_no_sendpage() or sendmsg() instead, and ensure that the
> MSG_EOR flag is blanked out for kTLS.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 33 +++++++++++++++++++++------------
>   1 file changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index bbff1f52a167..007d457cacf9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1034,13 +1034,19 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
>   		bool last = nvme_tcp_pdu_last_send(req, len);
>   		int req_data_sent = req->data_sent;
>   		int ret, flags = MSG_DONTWAIT;
> +		bool do_sendpage = sendpage_ok(page);
>   
> -		if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
> +		if (!last || queue->data_digest || nvme_tcp_queue_more(queue))
> +			flags |= MSG_MORE;
> +		else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
>   			flags |= MSG_EOR;
> -		else
> -			flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;

I think its time to move the flags setting to a helper.

>   
> -		if (sendpage_ok(page)) {
> +		if (test_bit(NVME_TCP_Q_TLS, &queue->flags))
> +			do_sendpage = false;
> +
> +		if (do_sendpage) {

The do_sendpage looks redundant to me.

> +			if (flags & MSG_MORE)
> +				flags |= MSG_SENDPAGE_NOTLAST;
>   			ret = kernel_sendpage(queue->sock, page, offset, len,
>   					flags);

I think that the SENDPAGE_NOLAST should be set together with MSG_MORE
regardless.

>   		} else {
> @@ -1088,19 +1094,22 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>   	bool inline_data = nvme_tcp_has_inline_data(req);
>   	u8 hdgst = nvme_tcp_hdgst_len(queue);
>   	int len = sizeof(*pdu) + hdgst - req->offset;
> -	int flags = MSG_DONTWAIT;
> +	struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
> +	struct kvec iov = {
> +		.iov_base = (u8 *)req->pdu + req->offset,
> +		.iov_len = len,
> +	};
>   	int ret;
>   
>   	if (inline_data || nvme_tcp_queue_more(queue))
> -		flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> -	else
> -		flags |= MSG_EOR;
> +		msg.msg_flags |= MSG_MORE;
> +	else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
> +		msg.msg_flags |= MSG_EOR;
>   
>   	if (queue->hdr_digest && !req->offset)
>   		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
>   
> -	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
> -			offset_in_page(pdu) + req->offset, len,  flags);
> +	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);

I'd prefer to do kernel_sednpage/sock_no_sendpage similar to how we do
it for data and data pdu.

>   	if (unlikely(ret <= 0))
>   		return ret;
>   
> @@ -1131,7 +1140,7 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
>   	if (queue->hdr_digest && !req->offset)
>   		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
>   
> -	if (!req->h2cdata_left)
> +	if (!test_bit(NVME_TCP_Q_TLS, &queue->flags) && !req->h2cdata_left)
>   		ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
>   				offset_in_page(pdu) + req->offset, len,
>   				MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);

Something is unclear to me. Is kernel_sendpage unsupported with tls? (I
think it is). I understand the motivation to add more checks in the code
for kernel_sendpage vs. sock_no_sendpage given that it should be
perfectly fine to use either.

Did you see any regressions with using kernel_sendpage? If so, isn't
that a bug in the tls code?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/18] nvme-tcp: add connect option 'tls'
  2023-03-22  9:24   ` Sagi Grimberg
@ 2023-03-22  9:59     ` Hannes Reinecke
  2023-03-22 10:09       ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22  9:59 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 10:24, Sagi Grimberg wrote:
>> Add a connect option 'tls' to request TLS1.3 in-band encryption, and
>> abort the connection attempt if TLS could not be established.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/fabrics.c | 5 +++++
>>   drivers/nvme/host/fabrics.h | 2 ++
>>   drivers/nvme/host/tcp.c     | 7 ++++++-
>>   3 files changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>> index bbaa04a0c502..fdff7cdff029 100644
>> --- a/drivers/nvme/host/fabrics.c
>> +++ b/drivers/nvme/host/fabrics.c
>> @@ -609,6 +609,7 @@ static const match_table_t opt_tokens = {
>>       { NVMF_OPT_DISCOVERY,        "discovery"        },
>>       { NVMF_OPT_DHCHAP_SECRET,    "dhchap_secret=%s"    },
>>       { NVMF_OPT_DHCHAP_CTRL_SECRET,    "dhchap_ctrl_secret=%s"    },
>> +    { NVMF_OPT_TLS,            "tls"            },
>>       { NVMF_OPT_ERR,            NULL            }
>>   };
>> @@ -632,6 +633,7 @@ static int nvmf_parse_options(struct 
>> nvmf_ctrl_options *opts,
>>       opts->hdr_digest = false;
>>       opts->data_digest = false;
>>       opts->tos = -1; /* < 0 == use transport default */
>> +    opts->tls = false;
>>       options = o = kstrdup(buf, GFP_KERNEL);
>>       if (!options)
>> @@ -918,6 +920,9 @@ static int nvmf_parse_options(struct 
>> nvmf_ctrl_options *opts,
>>               kfree(opts->dhchap_ctrl_secret);
>>               opts->dhchap_ctrl_secret = p;
>>               break;
>> +        case NVMF_OPT_TLS:
>> +            opts->tls = true;
>> +            break;
>>           default:
>>               pr_warn("unknown parameter or missing value '%s' in ctrl 
>> creation request\n",
>>                   p);
>> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
>> index dcac3df8a5f7..c4538a9d437c 100644
>> --- a/drivers/nvme/host/fabrics.h
>> +++ b/drivers/nvme/host/fabrics.h
>> @@ -70,6 +70,7 @@ enum {
>>       NVMF_OPT_DISCOVERY    = 1 << 22,
>>       NVMF_OPT_DHCHAP_SECRET    = 1 << 23,
>>       NVMF_OPT_DHCHAP_CTRL_SECRET = 1 << 24,
>> +    NVMF_OPT_TLS        = 1 << 25,
>>   };
>>   /**
>> @@ -128,6 +129,7 @@ struct nvmf_ctrl_options {
>>       int            max_reconnects;
>>       char            *dhchap_secret;
>>       char            *dhchap_ctrl_secret;
>> +    bool            tls;
>>       bool            disable_sqflow;
>>       bool            hdr_digest;
>>       bool            data_digest;
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index bcf24e9a08e1..bbff1f52a167 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1902,6 +1902,9 @@ static int nvme_tcp_alloc_admin_queue(struct 
>> nvme_ctrl *ctrl)
>>               break;
>>       }
>>       if (ret) {
>> +        /* Abort if TLS is requested */
>> +        if (num_keys && ctrl->opts->tls)
>> +            goto out_free_queue;
>>           /* Try without TLS */
>>           ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
>>           if (ret)
>> @@ -1934,6 +1937,8 @@ static int __nvme_tcp_alloc_io_queues(struct 
>> nvme_ctrl *ctrl)
>>                   break;
>>           }
>>           if (ret) {
>> +            if (num_keys && ctrl->opts->tls)
>> +                goto out_free_queues;
> 
> I don't see why we even attempt tls if we're not explicitly told to.

Because we don't know.

It's all easy if we do a discovery before connect, as the discovery log 
page tells us what to do.
But if we do _not_ do a discovery, how would we know what to do?

We could try to start off with no TLS, but if the server requires TLS 
then all we get is a 'connection refused' error, leaving us none the wiser.
So really we have to start off with TLS (if we have a matching 
identity). The server then can always refuse the connection (with the 
same error), in which case we should re-try without TLS.
That's where the 'tls' option comes in: to avoid having a fallback 
without TLS.

So in the end we have three modes for the client:

1) TLS not supported
2) TLS allowed (with fallback to non-TLS)
3) TLS required (no fallback to non-TLS)

For modes 2) and 3) a PSK has to be present, and
the 'tls' option is used to switch from mode 2) to mode 3)

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/18] nvme-tcp: fixup send workflow for kTLS
  2023-03-22  9:31   ` Sagi Grimberg
@ 2023-03-22 10:08     ` Hannes Reinecke
  2023-03-22 11:18       ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 10:08 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 10:31, Sagi Grimberg wrote:
> 
> 
> On 3/21/23 14:43, Hannes Reinecke wrote:
>> kTLS does not support MSG_EOR flag for sendmsg(), and the ->sendpage()
>> call really doesn't bring any benefit as data has to be copied
>> anyway.
>> So use sock_no_sendpage() or sendmsg() instead, and ensure that the
>> MSG_EOR flag is blanked out for kTLS.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/tcp.c | 33 +++++++++++++++++++++------------
>>   1 file changed, 21 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index bbff1f52a167..007d457cacf9 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1034,13 +1034,19 @@ static int nvme_tcp_try_send_data(struct 
>> nvme_tcp_request *req)
>>           bool last = nvme_tcp_pdu_last_send(req, len);
>>           int req_data_sent = req->data_sent;
>>           int ret, flags = MSG_DONTWAIT;
>> +        bool do_sendpage = sendpage_ok(page);
>> -        if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
>> +        if (!last || queue->data_digest || nvme_tcp_queue_more(queue))
>> +            flags |= MSG_MORE;
>> +        else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
>>               flags |= MSG_EOR;
>> -        else
>> -            flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> 
> I think its time to move the flags setting to a helper.
> 
>> -        if (sendpage_ok(page)) {
>> +        if (test_bit(NVME_TCP_Q_TLS, &queue->flags))
>> +            do_sendpage = false;
>> +
>> +        if (do_sendpage) {
> 
> The do_sendpage looks redundant to me.
> 
>> +            if (flags & MSG_MORE)
>> +                flags |= MSG_SENDPAGE_NOTLAST;
>>               ret = kernel_sendpage(queue->sock, page, offset, len,
>>                       flags);
> 
> I think that the SENDPAGE_NOLAST should be set together with MSG_MORE
> regardless.
> 
>>           } else {
>> @@ -1088,19 +1094,22 @@ static int nvme_tcp_try_send_cmd_pdu(struct 
>> nvme_tcp_request *req)
>>       bool inline_data = nvme_tcp_has_inline_data(req);
>>       u8 hdgst = nvme_tcp_hdgst_len(queue);
>>       int len = sizeof(*pdu) + hdgst - req->offset;
>> -    int flags = MSG_DONTWAIT;
>> +    struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
>> +    struct kvec iov = {
>> +        .iov_base = (u8 *)req->pdu + req->offset,
>> +        .iov_len = len,
>> +    };
>>       int ret;
>>       if (inline_data || nvme_tcp_queue_more(queue))
>> -        flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
>> -    else
>> -        flags |= MSG_EOR;
>> +        msg.msg_flags |= MSG_MORE;
>> +    else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
>> +        msg.msg_flags |= MSG_EOR;
>>       if (queue->hdr_digest && !req->offset)
>>           nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
>> -    ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
>> -            offset_in_page(pdu) + req->offset, len,  flags);
>> +    ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
> 
> I'd prefer to do kernel_sednpage/sock_no_sendpage similar to how we do
> it for data and data pdu.
> 
>>       if (unlikely(ret <= 0))
>>           return ret;
>> @@ -1131,7 +1140,7 @@ static int nvme_tcp_try_send_data_pdu(struct 
>> nvme_tcp_request *req)
>>       if (queue->hdr_digest && !req->offset)
>>           nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
>> -    if (!req->h2cdata_left)
>> +    if (!test_bit(NVME_TCP_Q_TLS, &queue->flags) && !req->h2cdata_left)
>>           ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
>>                   offset_in_page(pdu) + req->offset, len,
>>                   MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
> 
> Something is unclear to me. Is kernel_sendpage unsupported with tls? (I
> think it is). I understand the motivation to add more checks in the code
> for kernel_sendpage vs. sock_no_sendpage given that it should be
> perfectly fine to use either.
> 
> Did you see any regressions with using kernel_sendpage? If so, isn't
> that a bug in the tls code?

The actual issue with the tls code is the 'MSG_EOR' handling.
Problem is that tls is using MSG_EOR internally, and bails out on 
unknown MSG_ settings:

int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
[ .. ]
         if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
                                MSG_CMSG_COMPAT))
                 return -EOPNOTSUPP;

I would _vastly_ prefer to blank out unsupported flags (like MSG_EOR) 
from the TLS code, because to all intents and purposes MSG_EOR is just 
the opposite of MSG_MORE.
Or drop MSG_EOR usage from the nvme tcp code.
But then I'm not _that_ into the networking code to make a judgement here.
And as we're using sendmsg() already I had been switching to use it for 
ktls, too (as I know that the sendmsg() flow worked).
But in the end I guess we could use sendpage going forward.

I'll check.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/18] nvme-tcp: add connect option 'tls'
  2023-03-22  9:59     ` Hannes Reinecke
@ 2023-03-22 10:09       ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 10:09 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>>> Add a connect option 'tls' to request TLS1.3 in-band encryption, and
>>> abort the connection attempt if TLS could not be established.
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>   drivers/nvme/host/fabrics.c | 5 +++++
>>>   drivers/nvme/host/fabrics.h | 2 ++
>>>   drivers/nvme/host/tcp.c     | 7 ++++++-
>>>   3 files changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>>> index bbaa04a0c502..fdff7cdff029 100644
>>> --- a/drivers/nvme/host/fabrics.c
>>> +++ b/drivers/nvme/host/fabrics.c
>>> @@ -609,6 +609,7 @@ static const match_table_t opt_tokens = {
>>>       { NVMF_OPT_DISCOVERY,        "discovery"        },
>>>       { NVMF_OPT_DHCHAP_SECRET,    "dhchap_secret=%s"    },
>>>       { NVMF_OPT_DHCHAP_CTRL_SECRET,    "dhchap_ctrl_secret=%s"    },
>>> +    { NVMF_OPT_TLS,            "tls"            },
>>>       { NVMF_OPT_ERR,            NULL            }
>>>   };
>>> @@ -632,6 +633,7 @@ static int nvmf_parse_options(struct 
>>> nvmf_ctrl_options *opts,
>>>       opts->hdr_digest = false;
>>>       opts->data_digest = false;
>>>       opts->tos = -1; /* < 0 == use transport default */
>>> +    opts->tls = false;
>>>       options = o = kstrdup(buf, GFP_KERNEL);
>>>       if (!options)
>>> @@ -918,6 +920,9 @@ static int nvmf_parse_options(struct 
>>> nvmf_ctrl_options *opts,
>>>               kfree(opts->dhchap_ctrl_secret);
>>>               opts->dhchap_ctrl_secret = p;
>>>               break;
>>> +        case NVMF_OPT_TLS:
>>> +            opts->tls = true;
>>> +            break;
>>>           default:
>>>               pr_warn("unknown parameter or missing value '%s' in 
>>> ctrl creation request\n",
>>>                   p);
>>> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
>>> index dcac3df8a5f7..c4538a9d437c 100644
>>> --- a/drivers/nvme/host/fabrics.h
>>> +++ b/drivers/nvme/host/fabrics.h
>>> @@ -70,6 +70,7 @@ enum {
>>>       NVMF_OPT_DISCOVERY    = 1 << 22,
>>>       NVMF_OPT_DHCHAP_SECRET    = 1 << 23,
>>>       NVMF_OPT_DHCHAP_CTRL_SECRET = 1 << 24,
>>> +    NVMF_OPT_TLS        = 1 << 25,
>>>   };
>>>   /**
>>> @@ -128,6 +129,7 @@ struct nvmf_ctrl_options {
>>>       int            max_reconnects;
>>>       char            *dhchap_secret;
>>>       char            *dhchap_ctrl_secret;
>>> +    bool            tls;
>>>       bool            disable_sqflow;
>>>       bool            hdr_digest;
>>>       bool            data_digest;
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index bcf24e9a08e1..bbff1f52a167 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1902,6 +1902,9 @@ static int nvme_tcp_alloc_admin_queue(struct 
>>> nvme_ctrl *ctrl)
>>>               break;
>>>       }
>>>       if (ret) {
>>> +        /* Abort if TLS is requested */
>>> +        if (num_keys && ctrl->opts->tls)
>>> +            goto out_free_queue;
>>>           /* Try without TLS */
>>>           ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
>>>           if (ret)
>>> @@ -1934,6 +1937,8 @@ static int __nvme_tcp_alloc_io_queues(struct 
>>> nvme_ctrl *ctrl)
>>>                   break;
>>>           }
>>>           if (ret) {
>>> +            if (num_keys && ctrl->opts->tls)
>>> +                goto out_free_queues;
>>
>> I don't see why we even attempt tls if we're not explicitly told to.
> 
> Because we don't know.

Exactly why we should do what we've been told.

> It's all easy if we do a discovery before connect, as the discovery log 
> page tells us what to do.
> But if we do _not_ do a discovery, how would we know what to do?

The initiator of the connect needs to know that.

> We could try to start off with no TLS, but if the server requires TLS 
> then all we get is a 'connection refused' error, leaving us none the wiser.
> So really we have to start off with TLS (if we have a matching 
> identity). The server then can always refuse the connection (with the 
> same error), in which case we should re-try without TLS.
> That's where the 'tls' option comes in: to avoid having a fallback 
> without TLS.

I don't think we should do any type of fallback in the driver.
I think that if we want a fallback we need to put it in userspace.

> So in the end we have three modes for the client:
> 
> 1) TLS not supported
> 2) TLS allowed (with fallback to non-TLS)
> 3) TLS required (no fallback to non-TLS)
4) TLS allowed, but not desired.

> For modes 2) and 3) a PSK has to be present, and
> the 'tls' option is used to switch from mode 2) to mode 3)

I think that tls option should tell the driver exactly what
it needs to do. It seems wrong to me that now every tcp connection
would unconditionally start with tls and fallback to normal.

But let me think about it some more...

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready()
  2023-03-22  8:26         ` Hannes Reinecke
@ 2023-03-22 10:13           ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 10:13 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>> I meant the change from read_lock_bh to rcu_read_lock_bh and the
>> same for rcu_dereference_sk_user_data. It needs to be in a
>> separate patch with explanation to why it is needed.
>>
> That is primarily for sanity checking.
> 
> Both us _and_ the tls code are setting the sk_user_data pointer,
> so I wanted to make sure that we do get things correct.
> But seems to work now, so I guess I can drop the rcu_dereference
> modifications.
> 
> Let's see.

As I said, I don't mind having it, seems harmless.
But need it as a separate patch, with a description for why
this is added.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 08/18] nvme-tcp: enable TLS handshake upcall
  2023-03-22  9:12     ` Hannes Reinecke
@ 2023-03-22 10:56       ` Sagi Grimberg
  2023-03-22 12:54         ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 10:56 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>>> Select possible PSK identities and call the TLS handshake upcall
>>> for each identity.
>>> The TLS 1.3 RFC allows to send multiple identities with each ClientHello
>>> request, but none of the SSL libraries implement it. As the connection
>>> is established when the association is created we send only a single
>>> identity for each upcall, and close the connection to restart with
>>> the next identity if the handshake fails.
>>
>> Can't this loop be done in userspace? In other words, how can
>> we get rid of this when SSL libs would decide to support it?
>>
> Well. That is something which I've been thinking about, but really 
> haven't come to a good solution.

I have a more general question.
What is the scenario that we will have for a givin hostnqn and
subsysnqn more than one valid identity? Do we need to support it?

> Crux of the matter is that we have to close the connection after a 
> failed TLS handshake:

Yes.

> 
>  > A TLS 1.3 client implementation that only supports sending a single
>  > PSK identity during connection setup may be required to connect
>  > multiple times in order to negotiate cipher suites with different hash
>  > functions.
> 
> and as it's quite unclear in which state the connection is after the 
> userspace library failed the handshake.
> So the only good way to recover is to close the connection and restart 
> with a different identity.

I see.

> While we can move the identity selection to userspace (eg by providing 
> an 'tls_psk' fabrics option holding the key serial of the PSK to user), 
> that will allow us only to pass a _single_ PSK for each attempt.

That makes sense to me. But I'm unclear how it will choose, if we have
multiple (which again, I'm not clear in which scenario this would be the
case).

> 
> And the other problem is that in its current form the spec allows for 
> _different_ identites for each connection; by passing a key from 
> userspace we would not be able to support that.
> (Not saying that it's useful, mind.)

I'll happily forfit this support. This is completely crazy to do this
per connection.

> 
> We could allow for several 'tls_psk' options, though; maybe that would 
> be a way out.

What do you mean by that?

> 
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>   drivers/nvme/host/tcp.c | 157 +++++++++++++++++++++++++++++++++++++---
>>>   1 file changed, 148 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 0438d42f4179..bcf24e9a08e1 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -8,9 +8,12 @@
>>>   #include <linux/init.h>
>>>   #include <linux/slab.h>
>>>   #include <linux/err.h>
>>> +#include <linux/key.h>
>>>   #include <linux/nvme-tcp.h>
>>> +#include <linux/nvme-keyring.h>
>>>   #include <net/sock.h>
>>>   #include <net/tcp.h>
>>> +#include <net/handshake.h>
>>>   #include <linux/blk-mq.h>
>>>   #include <crypto/hash.h>
>>>   #include <net/busy_poll.h>
>>> @@ -31,6 +34,14 @@ static int so_priority;
>>>   module_param(so_priority, int, 0644);
>>>   MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
>>> +/*
>>> + * TLS handshake timeout
>>> + */
>>> +static int tls_handshake_timeout = 10;
>>> +module_param(tls_handshake_timeout, int, 0644);
>>> +MODULE_PARM_DESC(tls_handshake_timeout,
>>> +         "nvme TLS handshake timeout in seconds (default 10)");
>>
>> Can you share what is the normal time of an upcall?
>>
> That really depends on the network latency and/or reachability of the 
> server. It might just have been started up, switches MAC tables not 
> updated, STP still ongoing, what do I know.
> So 10 seconds seemed to be a good compromise.
> But that's also why I made this configurable.

Does it really take 10 seconds per connection :() ?
I'm planning to give this a go soon so will find out.

>>> +
>>>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>>   /* lockdep can detect a circular dependency of the form
>>>    *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
>>> @@ -104,6 +115,7 @@ enum nvme_tcp_queue_flags {
>>>       NVME_TCP_Q_ALLOCATED    = 0,
>>>       NVME_TCP_Q_LIVE        = 1,
>>>       NVME_TCP_Q_POLLING    = 2,
>>> +    NVME_TCP_Q_TLS        = 3,
>>>   };
>>>   enum nvme_tcp_recv_state {
>>> @@ -148,6 +160,9 @@ struct nvme_tcp_queue {
>>>       __le32            exp_ddgst;
>>>       __le32            recv_ddgst;
>>> +    struct completion       *tls_complete;
>>> +    int                     tls_err;
>>> +
>>>       struct page_frag_cache    pf_cache;
>>>       void (*state_change)(struct sock *);
>>> @@ -1505,7 +1520,102 @@ static void nvme_tcp_set_queue_io_cpu(struct 
>>> nvme_tcp_queue *queue)
>>>       queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, 
>>> false);
>>>   }
>>> -static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
>>> +/*
>>> + * nvme_tcp_lookup_psk - Look up PSKs to use for TLS
>>> + *
>>> + */
>>> +static int nvme_tcp_lookup_psks(struct nvme_ctrl *nctrl,
>>> +                   key_serial_t *keylist, int num_keys)
>>
>> Where is num_keys used?
>>
> Ah, indeed, need to check this in the loop.
> 
>>> +{
>>> +    enum nvme_tcp_tls_cipher cipher = NVME_TCP_TLS_CIPHER_SHA384;
>>> +    struct key *tls_key;
>>> +    int num = 0;
>>> +    bool generated = false;
>>> +
>>> +    /* Check for pre-provisioned keys; retained keys first */
>>> +    do {
>>> +        tls_key = nvme_tls_psk_lookup(NULL, nctrl->opts->host->nqn,
>>> +                          nctrl->opts->subsysnqn,
>>> +                          cipher, generated);
>>> +        if (!IS_ERR(tls_key)) {
>>> +            keylist[num] = tls_key->serial;
>>> +            num++;
>>> +            key_put(tls_key);
>>> +        }
>>> +        if (cipher == NVME_TCP_TLS_CIPHER_SHA384)
>>> +            cipher = NVME_TCP_TLS_CIPHER_SHA256;
>>> +        else {
>>> +            if (generated)
>>> +                cipher = NVME_TCP_TLS_CIPHER_INVALID;
>>> +            else {
>>> +                cipher = NVME_TCP_TLS_CIPHER_SHA384;
>>> +                generated = true;
>>> +            }
>>> +        }
>>> +    } while(cipher != NVME_TCP_TLS_CIPHER_INVALID);
>>
>> I'm unclear about a few things here:
>> 1. what is the meaning of pre-provisioned vs. retained vs. generated?
>> 2. Can this loop be reorganized in a nested for loop with a break?
>>     I'm wandering if it will make it simpler to read.
>>
> 'pre-provisioned' means that the admin has stored the keys in the 
> keyring prior to calling 'nvme connect'.
> 'generated' means a key which is derived from the key material generated 
> from a previous DH-HMAC-CHAP transaction.

Can we ignore the generated until a generation sequence code is actually
introduced. This would help to digest this in a way that is simpler.

> As for the loop: I am going back and forth between having a loop
> (which is executed exactly four times) and unrolling the loop into four 
> distinct calls to nvme_tls_psk_lookup().
> It probably doesn't matter for the actual assembler code (as the 
> compiler will be doing a loop unroll anyway), but the unrolled code 
> would allow for better documentation, Code might be slightly longer, 
> though, with lots of repetitions.
> So really, I don't know which is best.

I think that the best one would be to either:
1. Have one valid identity at all times (not sure if that is
too restrictive, see my questions above)
2. Have userspace do the iterations.

If both are not possible, or too difficult, then we should optimized
for simplicity/readability not code size.

> 
>>> +    return num;
>>> +}
>>> +
>>> +static void nvme_tcp_tls_done(void *data, int status, key_serial_t 
>>> peerid)
>>> +{
>>> +    struct nvme_tcp_queue *queue = data;
>>> +    struct nvme_tcp_ctrl *ctrl = queue->ctrl;
>>> +    int qid = nvme_tcp_queue_id(queue);
>>> +
>>> +    dev_dbg(ctrl->ctrl.device, "queue %d: TLS handshake done, key 
>>> %x, status %d\n",
>>> +        qid, peerid, status);
>>> +
>>> +    queue->tls_err = -status;
>>> +    if (queue->tls_complete)
>>> +        complete(queue->tls_complete);
>>> +}
>>> +
>>> +static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
>>> +                  struct nvme_tcp_queue *queue,
>>> +                  key_serial_t peerid)
>>> +{
>>> +    int qid = nvme_tcp_queue_id(queue);
>>> +    int ret;
>>> +    struct tls_handshake_args args;
>>> +    unsigned long tmo = tls_handshake_timeout * HZ;
>>> +    DECLARE_COMPLETION_ONSTACK(tls_complete);
>>> +
>>> +    dev_dbg(nctrl->device, "queue %d: start TLS with key %x\n",
>>> +        qid, peerid);
>>> +    args.ta_sock = queue->sock;
>>> +    args.ta_done = nvme_tcp_tls_done;
>>> +    args.ta_data = queue;
>>> +    args.ta_my_peerids[0] = peerid;
>>> +    args.ta_num_peerids = 1;
>>> +    args.ta_keyring = nvme_keyring_id();
>>> +    args.ta_timeout_ms = tls_handshake_timeout * 2 * 1000;
>>> +    queue->tls_err = -EOPNOTSUPP;
>>> +    queue->tls_complete = &tls_complete;
>>> +    ret = tls_client_hello_psk(&args, GFP_KERNEL);
>>> +    if (ret) {
>>> +        dev_dbg(nctrl->device, "queue %d: failed to start TLS: %d\n",
>>> +            qid, ret);
>>> +        return ret;
>>> +    }
>>> +    if (wait_for_completion_timeout(queue->tls_complete, tmo) == 0) {
>>> +        dev_dbg(nctrl->device,
>>> +            "queue %d: TLS handshake timeout\n", qid);
>>> +        queue->tls_complete = NULL;
>>> +        ret = -ETIMEDOUT;
>>> +    } else {
>>> +        dev_dbg(nctrl->device,
>>> +            "queue %d: TLS handshake complete, error %d\n",
>>> +            qid, queue->tls_err);
>>> +        ret = queue->tls_err;
>>> +    }
>>> +    queue->tls_complete = NULL;
>>> +    if (!ret)
>>> +        set_bit(NVME_TCP_Q_TLS, &queue->flags);
>>> +    return ret;
>>> +}
>>> +
>>> +static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
>>> +                key_serial_t peerid)
>>>   {
>>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>>       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
>>> @@ -1628,6 +1738,13 @@ static int nvme_tcp_alloc_queue(struct 
>>> nvme_ctrl *nctrl, int qid)
>>>           goto err_rcv_pdu;
>>>       }
>>> +    /* If PSKs are configured try to start TLS */
>>> +    if (peerid) {
>>
>> Where is peerid being initialized? Not to mention that peerid is
>> a rather cryptic name (at least to me). Is this the ClientHello
>> identity?
>>
> 'peerid' is the term used in the netlink handshake protocol.
> It actually is the key serial number of the PSK to use.
> Maybe 'psk_id' would be more appropriate here.

Probably psk_id is better.

>>> +        ret = nvme_tcp_start_tls(nctrl, queue, peerid);
>>> +        if (ret)
>>> +            goto err_init_connect;
>>> +    }
>>> +
>>>       ret = nvme_tcp_init_connection(queue);
>>>       if (ret)
>>>           goto err_init_connect;
>>> @@ -1774,11 +1891,22 @@ static int nvme_tcp_start_io_queues(struct 
>>> nvme_ctrl *ctrl,
>>>   static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>>>   {
>>> -    int ret;
>>> +    int ret = -EINVAL, num_keys, k;
>>> +    key_serial_t keylist[4];
>>> -    ret = nvme_tcp_alloc_queue(ctrl, 0);
>>> -    if (ret)
>>> -        return ret;
>>> +    memset(keylist, 0, sizeof(key_serial_t));
>>> +    num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>>> +    for (k = 0; k < num_keys; k++) {
>>> +        ret = nvme_tcp_alloc_queue(ctrl, 0, keylist[k]);
>>> +        if (!ret)
>>> +            break;
>>> +    }
>>> +    if (ret) {
>>> +        /* Try without TLS */
>>
>> Why? this is trying to always connect with tls and fallback to no-tls?
>> Why not simply do what userspace is asking us to do?
>>
>> Seems backwards to me. Unless there is a statement in the spec
>> that I'm not aware of which tells us to do so.
>>
> This is an implication of the chosen method to select the PSK from the 
> kernel code.
> If we move PSK selection to userspace we clearly wouldn't need this.
> But if we move PSK selection to userspace we need an updated nvme-cli 
> for a) selecting the PSK from the keystore and b) passing in the new 
> option.

This at least for me, sounds better no?

> So for development it was easier to run with the in-kernel selection as 
> I don't need to modify nvme-cli.

I understand. Thanks for explaining.

> 
>>> +        ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
>>> +        if (ret)
>>> +            goto out_free_queue;
>>> +    }
>>>       ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
>>>       if (ret)
>>> @@ -1793,12 +1921,23 @@ static int nvme_tcp_alloc_admin_queue(struct 
>>> nvme_ctrl *ctrl)
>>>   static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>>>   {
>>> -    int i, ret;
>>> +    int i, ret, num_keys = 0, k;
>>> +    key_serial_t keylist[4];
>>> +    memset(keylist, 0, sizeof(key_serial_t));
>>> +    num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>>>       for (i = 1; i < ctrl->queue_count; i++) {
>>> -        ret = nvme_tcp_alloc_queue(ctrl, i);
>>> -        if (ret)
>>> -            goto out_free_queues;
>>> +        ret = -EINVAL;
>>> +        for (k = 0; k < num_keys; k++) {
>>> +            ret = nvme_tcp_alloc_queue(ctrl, i, keylist[k]);
>>> +            if (!ret)
>>> +                break;
>>
>> What is going on here. are you establishing queue_count x num_keys 
>> nvme queues?
>>
> No, I am _trying_ to establish a connection, breaking out if the attempt
> _succeeded_.

Yes, it's just now confusing to read the code this way. The loop makes
it difficult.

Another approach would be to just do this dance for the admin queue, but
for IO queues, use the same psk_id that resolved. If this loop must live
in the driver, at least minimize it to the admin queue alone.

As I said, I am absolutely not interested in supporting per connection
psk_id.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/18] nvme-tcp: fixup send workflow for kTLS
  2023-03-22 10:08     ` Hannes Reinecke
@ 2023-03-22 11:18       ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 11:18 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/22/23 12:08, Hannes Reinecke wrote:
> On 3/22/23 10:31, Sagi Grimberg wrote:
>>
>>
>> On 3/21/23 14:43, Hannes Reinecke wrote:
>>> kTLS does not support MSG_EOR flag for sendmsg(), and the ->sendpage()
>>> call really doesn't bring any benefit as data has to be copied
>>> anyway.
>>> So use sock_no_sendpage() or sendmsg() instead, and ensure that the
>>> MSG_EOR flag is blanked out for kTLS.
>>>
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>   drivers/nvme/host/tcp.c | 33 +++++++++++++++++++++------------
>>>   1 file changed, 21 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index bbff1f52a167..007d457cacf9 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1034,13 +1034,19 @@ static int nvme_tcp_try_send_data(struct 
>>> nvme_tcp_request *req)
>>>           bool last = nvme_tcp_pdu_last_send(req, len);
>>>           int req_data_sent = req->data_sent;
>>>           int ret, flags = MSG_DONTWAIT;
>>> +        bool do_sendpage = sendpage_ok(page);
>>> -        if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
>>> +        if (!last || queue->data_digest || nvme_tcp_queue_more(queue))
>>> +            flags |= MSG_MORE;
>>> +        else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
>>>               flags |= MSG_EOR;
>>> -        else
>>> -            flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
>>
>> I think its time to move the flags setting to a helper.
>>
>>> -        if (sendpage_ok(page)) {
>>> +        if (test_bit(NVME_TCP_Q_TLS, &queue->flags))
>>> +            do_sendpage = false;
>>> +
>>> +        if (do_sendpage) {
>>
>> The do_sendpage looks redundant to me.
>>
>>> +            if (flags & MSG_MORE)
>>> +                flags |= MSG_SENDPAGE_NOTLAST;
>>>               ret = kernel_sendpage(queue->sock, page, offset, len,
>>>                       flags);
>>
>> I think that the SENDPAGE_NOLAST should be set together with MSG_MORE
>> regardless.
>>
>>>           } else {
>>> @@ -1088,19 +1094,22 @@ static int nvme_tcp_try_send_cmd_pdu(struct 
>>> nvme_tcp_request *req)
>>>       bool inline_data = nvme_tcp_has_inline_data(req);
>>>       u8 hdgst = nvme_tcp_hdgst_len(queue);
>>>       int len = sizeof(*pdu) + hdgst - req->offset;
>>> -    int flags = MSG_DONTWAIT;
>>> +    struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
>>> +    struct kvec iov = {
>>> +        .iov_base = (u8 *)req->pdu + req->offset,
>>> +        .iov_len = len,
>>> +    };
>>>       int ret;
>>>       if (inline_data || nvme_tcp_queue_more(queue))
>>> -        flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
>>> -    else
>>> -        flags |= MSG_EOR;
>>> +        msg.msg_flags |= MSG_MORE;
>>> +    else if (!test_bit(NVME_TCP_Q_TLS, &queue->flags))
>>> +        msg.msg_flags |= MSG_EOR;
>>>       if (queue->hdr_digest && !req->offset)
>>>           nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
>>> -    ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
>>> -            offset_in_page(pdu) + req->offset, len,  flags);
>>> +    ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
>>
>> I'd prefer to do kernel_sednpage/sock_no_sendpage similar to how we do
>> it for data and data pdu.
>>
>>>       if (unlikely(ret <= 0))
>>>           return ret;
>>> @@ -1131,7 +1140,7 @@ static int nvme_tcp_try_send_data_pdu(struct 
>>> nvme_tcp_request *req)
>>>       if (queue->hdr_digest && !req->offset)
>>>           nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
>>> -    if (!req->h2cdata_left)
>>> +    if (!test_bit(NVME_TCP_Q_TLS, &queue->flags) && !req->h2cdata_left)
>>>           ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
>>>                   offset_in_page(pdu) + req->offset, len,
>>>                   MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
>>
>> Something is unclear to me. Is kernel_sendpage unsupported with tls? (I
>> think it is). I understand the motivation to add more checks in the code
>> for kernel_sendpage vs. sock_no_sendpage given that it should be
>> perfectly fine to use either.
>>
>> Did you see any regressions with using kernel_sendpage? If so, isn't
>> that a bug in the tls code?
> 
> The actual issue with the tls code is the 'MSG_EOR' handling.
> Problem is that tls is using MSG_EOR internally, and bails out on 
> unknown MSG_ settings:

That is fine, lets separate MSG_EOR for TLS, and any change from
kernel_sendpage to kernel_sendmsg to a different patch (or eliminate
it altogether if it is unneeded).

> 
> int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> {
> [ .. ]
>          if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
>                                 MSG_CMSG_COMPAT))
>                  return -EOPNOTSUPP;
> 
> I would _vastly_ prefer to blank out unsupported flags (like MSG_EOR) 
> from the TLS code, because to all intents and purposes MSG_EOR is just 
> the opposite of MSG_MORE.

Not exactly. But ok.

> Or drop MSG_EOR usage from the nvme tcp code.

Possible to do. But MSG_EOR hints the network stack that no other
payload is expected, so it can send it down the wire asap and avoid
any batching heuristics.

> But then I'm not _that_ into the networking code to make a judgement here.
> And as we're using sendmsg() already I had been switching to use it for 
> ktls, too (as I know that the sendmsg() flow worked).
> But in the end I guess we could use sendpage going forward.

I'd prefer not to change that at this point if it is supported properly.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 11/18] nvme-tcp: control message handling for recvmsg()
  2023-03-21 12:43 ` [PATCH 11/18] nvme-tcp: control message handling for recvmsg() Hannes Reinecke
@ 2023-03-22 11:33   ` Sagi Grimberg
  2023-03-22 11:48     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 11:33 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, boris.pismenny
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


> kTLS is sending TLS ALERT messages as control messages for recvmsg().
> As we can't do anything sensible with it just abort the connection
> and let the userspace agent to a re-negotiation.

Is this a problem if we do end up adding read_sock to tls?
Although I do see that the tls code does manage this in the
sk_buff control buffer, so I assume there is access to this info.

CC'ing Boris here as well.

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 68 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 68 insertions(+)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 007d457cacf9..e0fc98ac9e05 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -13,6 +13,7 @@
>   #include <linux/nvme-keyring.h>
>   #include <net/sock.h>
>   #include <net/tcp.h>
> +#include <net/tls.h>
>   #include <net/handshake.h>
>   #include <linux/blk-mq.h>
>   #include <crypto/hash.h>
> @@ -727,7 +728,12 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, bool pending)
>   {
>   	struct nvme_tcp_hdr *hdr;
>   	size_t rcv_len = queue->pdu_remaining;
> +	char cbuf[CMSG_LEN(sizeof(char))] = {};
> +	struct cmsghdr *cmsg;
> +	unsigned char ctype;
>   	struct msghdr msg = {
> +		.msg_control = cbuf,
> +		.msg_controllen = sizeof(cbuf),
>   		.msg_flags = pending ? 0 : MSG_DONTWAIT,
>   	};
>   	struct kvec iov = {
> @@ -743,6 +749,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, bool pending)
>   			     iov.iov_len, msg.msg_flags);
>   	if (ret <= 0)
>   		return ret;
> +	cmsg = (struct cmsghdr *)cbuf;
> +	if (CMSG_OK(&msg, cmsg) &&
> +	    cmsg->cmsg_level == SOL_TLS &&
> +	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
> +		ctype = *((unsigned char *)CMSG_DATA(cmsg));
> +		if (ctype != TLS_RECORD_TYPE_DATA) {
> +			dev_err(queue->ctrl->ctrl.device,
> +				"queue %d unhandled TLS record %d\n",
> +				nvme_tcp_queue_id(queue), ctype);
> +			return -ENOTCONN;
> +		}
> +	}
>   
>   	rcv_len = ret;
>   	queue->pdu_remaining -= rcv_len;
> @@ -793,6 +811,9 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
>   	struct request *rq =
>   		nvme_cid_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
>   	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +	char cbuf[CMSG_LEN(sizeof(char))];
> +	struct cmsghdr *cmsg;
> +	unsigned char ctype;
>   
>   	if (nvme_tcp_recv_state(queue) != NVME_TCP_RECV_DATA)
>   		return 0;
> @@ -824,6 +845,8 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
>   		/* we can read only from what is left in this bio */
>   		memset(&msg, 0, sizeof(msg));
>   		msg.msg_iter = req->iter;
> +		msg.msg_control = cbuf;
> +		msg.msg_controllen = sizeof(cbuf);
>   
>   		ret = sock_recvmsg(queue->sock, &msg, 0);
>   		if (ret <= 0) {
> @@ -832,6 +855,18 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue)
>   				nvme_tcp_queue_id(queue), rq->tag);
>   			return ret;
>   		}
> +		cmsg = (struct cmsghdr *)cbuf;
> +		if (CMSG_OK(&msg, cmsg) &&
> +		    cmsg->cmsg_level == SOL_TLS &&
> +		    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
> +			ctype = *((unsigned char *)CMSG_DATA(cmsg));
> +			if (ctype != TLS_RECORD_TYPE_DATA) {
> +				dev_err(queue->ctrl->ctrl.device,
> +					"queue %d unhandled TLS record %d\n",
> +					nvme_tcp_queue_id(queue), ctype);
> +				return -ENOTCONN;
> +			}
> +		}
>   
>   		queue->data_remaining -= ret;
>   		if (queue->data_remaining)
> @@ -861,7 +896,12 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue)
>   	char *ddgst = (char *)&queue->recv_ddgst;
>   	size_t recv_len = queue->ddgst_remaining;
>   	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
> +	char cbuf[CMSG_LEN(sizeof(char))] = {};
> +	struct cmsghdr *cmsg;
> +	unsigned char ctype;
>   	struct msghdr msg = {
> +		.msg_control = cbuf,
> +		.msg_controllen = sizeof(cbuf),
>   		.msg_flags = 0,
>   	};
>   	struct kvec iov = {
> @@ -877,6 +917,18 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue)
>   			     msg.msg_flags);
>   	if (ret <= 0)
>   		return ret;
> +	cmsg = (struct cmsghdr *)cbuf;
> +	if (CMSG_OK(&msg, cmsg) &&
> +	    cmsg->cmsg_level == SOL_TLS &&
> +	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
> +		ctype = *((unsigned char *)CMSG_DATA(cmsg));
> +		if (ctype != TLS_RECORD_TYPE_DATA) {
> +			dev_err(queue->ctrl->ctrl.device,
> +				"queue %d unhandled TLS record %d\n",
> +				nvme_tcp_queue_id(queue), ctype);
> +			return -ENOTCONN;
> +		}
> +	}
>   
>   	recv_len = ret;
>   	queue->ddgst_remaining -= recv_len;
> @@ -1372,6 +1424,9 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
>   {
>   	struct nvme_tcp_icreq_pdu *icreq;
>   	struct nvme_tcp_icresp_pdu *icresp;
> +	char cbuf[CMSG_LEN(sizeof(char))] = {};
> +	struct cmsghdr *cmsg;
> +	unsigned char ctype;
>   	struct msghdr msg = {};
>   	struct kvec iov;
>   	bool ctrl_hdgst, ctrl_ddgst;
> @@ -1409,10 +1464,23 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
>   	memset(&msg, 0, sizeof(msg));
>   	iov.iov_base = icresp;
>   	iov.iov_len = sizeof(*icresp);
> +	msg.msg_control = cbuf;
> +	msg.msg_controllen = sizeof(cbuf);
>   	ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
>   			iov.iov_len, msg.msg_flags);
>   	if (ret < 0)
>   		goto free_icresp;
> +	cmsg = (struct cmsghdr *)cbuf;
> +	if (CMSG_OK(&msg, cmsg) &&
> +	    cmsg->cmsg_level == SOL_TLS &&
> +	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
> +		ctype = *((unsigned char *)CMSG_DATA(cmsg));
> +		if (ctype != TLS_RECORD_TYPE_DATA) {
> +			pr_err("queue %d: unhandled TLS record %d\n",
> +			       nvme_tcp_queue_id(queue), ctype);
> +			return -ENOTCONN;
> +		}
> +	}
>   
>   	ret = -EINVAL;
>   	if (icresp->hdr.type != nvme_tcp_icresp) {

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 12/18] nvmet: make TCP sectype settable via configfs
  2023-03-21 12:43 ` [PATCH 12/18] nvmet: make TCP sectype settable via configfs Hannes Reinecke
@ 2023-03-22 11:38   ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 11:38 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


> Add a new configfs attribute 'addr_tsas' to make the TCP sectype
> settable via configfs.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/target/configfs.c | 65 ++++++++++++++++++++++++++++++++++
>   1 file changed, 65 insertions(+)
> 
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index 907143870da5..d3d105a1665c 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -303,6 +303,11 @@ static void nvmet_port_init_tsas_rdma(struct nvmet_port *port)
>   	port->disc_addr.tsas.rdma.cms = NVMF_RDMA_CMS_RDMA_CM;
>   }
>   
> +static void nvmet_port_init_tsas_tcp(struct nvmet_port *port, int tsas)
> +{
> +	port->disc_addr.tsas.tcp.sectype = tsas;

Better to call the argument sectype.

> +}
> +
>   static ssize_t nvmet_addr_trtype_store(struct config_item *item,
>   		const char *page, size_t count)
>   {
> @@ -325,11 +330,70 @@ static ssize_t nvmet_addr_trtype_store(struct config_item *item,
>   	port->disc_addr.trtype = nvmet_transport[i].type;
>   	if (port->disc_addr.trtype == NVMF_TRTYPE_RDMA)
>   		nvmet_port_init_tsas_rdma(port);
> +	else if (port->disc_addr.trtype == NVMF_TRTYPE_TCP)
> +		nvmet_port_init_tsas_tcp(port, NVMF_TCP_SECTYPE_NONE);
>   	return count;
>   }
>   
>   CONFIGFS_ATTR(nvmet_, addr_trtype);
>   
> +static const struct nvmet_type_name_map nvmet_addr_tsas_tcp[] = {
> +	{ NVMF_TCP_SECTYPE_NONE,	"none" },
> +	{ NVMF_TCP_SECTYPE_TLS12,	"tls1.2" },
> +	{ NVMF_TCP_SECTYPE_TLS13,	"tls1.3" },
> +};
> +
> +static ssize_t nvmet_addr_tsas_show(struct config_item *item,
> +		char *page)
> +{
> +	struct nvmet_port *port = to_nvmet_port(item);
> +	int i;
> +
> +	if (port->disc_addr.trtype == NVMF_TRTYPE_TCP) {
> +		for (i = 0; i < ARRAY_SIZE(nvmet_addr_tsas_tcp); i++) {
> +			if (port->disc_addr.tsas.tcp.sectype == nvmet_addr_tsas_tcp[i].type)
> +				return sprintf(page, "%s\n", nvmet_addr_tsas_tcp[i].name);
> +		}
> +	} else if (port->disc_addr.trtype == NVMF_TRTYPE_RDMA) {
> +		switch (port->disc_addr.tsas.rdma.qptype) {
> +		case NVMF_RDMA_QPTYPE_CONNECTED:
> +			return sprintf(page, "connected\n");
> +		case NVMF_RDMA_QPTYPE_DATAGRAM:
> +			return sprintf(page, "datagram\n");

There is no way that datagram is supported. Not even sure why it exists.


> +		default:
> +			return sprintf(page, "reserved\n");
> +		}
> +	}
> +	return sprintf(page, "not required\n");
> +}
> +
> +static ssize_t nvmet_addr_tsas_store(struct config_item *item,
> +		const char *page, size_t count)
> +{
> +	struct nvmet_port *port = to_nvmet_port(item);
> +	int i;
> +
> +	if (nvmet_is_port_enabled(port, __func__))
> +		return -EACCES;
> +
> +	if (port->disc_addr.trtype != NVMF_TRTYPE_TCP)
> +		return -EINVAL;
> +
> +	for (i = 0; i < ARRAY_SIZE(nvmet_addr_tsas_tcp); i++) {
> +		if (sysfs_streq(page, nvmet_addr_tsas_tcp[i].name))
> +			goto found;
> +	}
> +
> +	pr_err("Invalid value '%s' for tsas\n", page);
> +	return -EINVAL;
> +
> +found:
> +	nvmet_port_init_tsas_tcp(port, nvmet_addr_tsas_tcp[i].type);
> +	return count;
> +}
> +
> +CONFIGFS_ATTR(nvmet_, addr_tsas);
> +
>   /*
>    * Namespace structures & file operation functions below
>    */
> @@ -1741,6 +1805,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = {
>   	&nvmet_attr_addr_traddr,
>   	&nvmet_attr_addr_trsvcid,
>   	&nvmet_attr_addr_trtype,
> +	&nvmet_attr_addr_tsas,
>   	&nvmet_attr_param_inline_data_size,
>   #ifdef CONFIG_BLK_DEV_INTEGRITY
>   	&nvmet_attr_param_pi_enable,

Overall though this looks good.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 13/18] nvmet-tcp: allocate socket file
  2023-03-21 12:43 ` [PATCH 13/18] nvmet-tcp: allocate socket file Hannes Reinecke
@ 2023-03-22 11:46   ` Sagi Grimberg
  2023-03-22 12:07     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 11:46 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> When using the TLS upcall we need to allocate a socket file such
> that the userspace daemon is able to use the socket.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/target/tcp.c | 49 ++++++++++++++++++++++++++++-----------
>   1 file changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index 66e8f9fd0ca7..5c43767c5ecd 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -96,12 +96,14 @@ struct nvmet_tcp_cmd {
>   
>   enum nvmet_tcp_queue_state {
>   	NVMET_TCP_Q_CONNECTING,
> +	NVMET_TCP_Q_TLS_HANDSHAKE,
>   	NVMET_TCP_Q_LIVE,
>   	NVMET_TCP_Q_DISCONNECTING,
>   };
>   
>   struct nvmet_tcp_queue {
>   	struct socket		*sock;
> +	struct file		*sock_file;
>   	struct nvmet_tcp_port	*port;
>   	struct work_struct	io_work;
>   	struct nvmet_cq		nvme_cq;
> @@ -1455,12 +1457,19 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
>   	nvmet_sq_destroy(&queue->nvme_sq);
>   	cancel_work_sync(&queue->io_work);
>   	nvmet_tcp_free_cmd_data_in_buffers(queue);
> -	sock_release(queue->sock);
> +	if (queue->sock_file) {
> +		fput(queue->sock_file);

I don't remember, but does the fput call sock_release
on the final put? I'd move this into a helper nvmet_tcp_close_sock()
or something.

> +		queue->sock_file = NULL;
> +		queue->sock = NULL;

I always get a bit weary when I see that deallocations are setting
pointers to NULL.

> +	} else {
> +		WARN_ON(!queue->sock->ops);
> +		sock_release(queue->sock);
> +		queue->sock = NULL;
> +	}
>   	nvmet_tcp_free_cmds(queue);
>   	if (queue->hdr_digest || queue->data_digest)
>   		nvmet_tcp_free_crypto(queue);
>   	ida_free(&nvmet_tcp_queue_ida, queue->idx);
> -
>   	page = virt_to_head_page(queue->pf_cache.va);
>   	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
>   	kfree(queue);
> @@ -1583,7 +1592,7 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
>   	return ret;
>   }
>   
> -static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
> +static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   		struct socket *newsock)

Why is this becoming a void function? This absolutely can fail.

>   {
>   	struct nvmet_tcp_queue *queue;
> @@ -1591,7 +1600,7 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   
>   	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
>   	if (!queue)
> -		return -ENOMEM;
> +		return;
>   
>   	INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
>   	INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
> @@ -1599,15 +1608,28 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   	queue->port = port;
>   	queue->nr_cmds = 0;
>   	spin_lock_init(&queue->state_lock);
> -	queue->state = NVMET_TCP_Q_CONNECTING;
> +	if (queue->port->nport->disc_addr.tsas.tcp.sectype ==
> +	    NVMF_TCP_SECTYPE_TLS13)
> +		queue->state = NVMET_TCP_Q_TLS_HANDSHAKE;
> +	else
> +		queue->state = NVMET_TCP_Q_CONNECTING;
>   	INIT_LIST_HEAD(&queue->free_list);
>   	init_llist_head(&queue->resp_list);
>   	INIT_LIST_HEAD(&queue->resp_send_list);
>   
> +	if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
> +		queue->sock_file = sock_alloc_file(queue->sock, O_CLOEXEC, NULL);
> +		if (IS_ERR(queue->sock_file)) {
> +			ret = PTR_ERR(queue->sock_file);
> +			queue->sock_file = NULL;
> +			goto out_free_queue;
> +		}
> +	}
> +
>   	queue->idx = ida_alloc(&nvmet_tcp_queue_ida, GFP_KERNEL);
>   	if (queue->idx < 0) {
>   		ret = queue->idx;
> -		goto out_free_queue;
> +		goto out_sock;
>   	}
>   
>   	ret = nvmet_tcp_alloc_cmd(queue, &queue->connect);
> @@ -1628,7 +1650,7 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   	if (ret)
>   		goto out_destroy_sq;
>   
> -	return 0;
> +	return;
>   out_destroy_sq:
>   	mutex_lock(&nvmet_tcp_queue_mutex);
>   	list_del_init(&queue->queue_list);
> @@ -1638,9 +1660,14 @@ static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   	nvmet_tcp_free_cmd(&queue->connect);
>   out_ida_remove:
>   	ida_free(&nvmet_tcp_queue_ida, queue->idx);
> +out_sock:
> +	if (queue->sock_file)
> +		fput(queue->sock_file);
> +	else
> +		sock_release(queue->sock);
>   out_free_queue:
>   	kfree(queue);
> -	return ret;
> +	pr_err("failed to allocate queue");

Can we design this better?
It looks backwards that this routine deallocates an argument
coming from the call-site.

I know that this is similar to what happens with kernel_accept
to some extent. But would prefer to avoid this pattern if possible.

>   }
>   
>   static void nvmet_tcp_accept_work(struct work_struct *w)
> @@ -1657,11 +1684,7 @@ static void nvmet_tcp_accept_work(struct work_struct *w)
>   				pr_warn("failed to accept err=%d\n", ret);
>   			return;
>   		}
> -		ret = nvmet_tcp_alloc_queue(port, newsock);
> -		if (ret) {
> -			pr_err("failed to allocate queue\n");
> -			sock_release(newsock);
> -		}
> +		nvmet_tcp_alloc_queue(port, newsock);
>   	}
>   }
>   

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 11/18] nvme-tcp: control message handling for recvmsg()
  2023-03-22 11:33   ` Sagi Grimberg
@ 2023-03-22 11:48     ` Hannes Reinecke
  2023-03-22 11:50       ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 11:48 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, boris.pismenny
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 12:33, Sagi Grimberg wrote:
> 
>> kTLS is sending TLS ALERT messages as control messages for recvmsg().
>> As we can't do anything sensible with it just abort the connection
>> and let the userspace agent to a re-negotiation.
> 
> Is this a problem if we do end up adding read_sock to tls?
> Although I do see that the tls code does manage this in the
> sk_buff control buffer, so I assume there is access to this info.
> 
> CC'ing Boris here as well.
> 
Yeah, that was the other reason; cmsg aka TLS alerts are only available 
for recvmsg().

However, for TLS 1.3 the only TLS alert which does not trigger a 
connection reset would the the ominous 'new session ticket' alert.
But during TLS handshake development we already decided to _disable_
session tickets as they are pretty meaningless for our us.
Consequently all TLS alerts will trigger a connection reset, and 
realistically we don't _need_ to know which alert type has triggered
the reset.

So 'read_sock()' could be implemented by closing the connection on
any TLS alert, and not giving us access to any of the alerts via
control messages. If that makes life easier ...

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 11/18] nvme-tcp: control message handling for recvmsg()
  2023-03-22 11:48     ` Hannes Reinecke
@ 2023-03-22 11:50       ` Sagi Grimberg
  2023-03-22 12:17         ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 11:50 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, boris.pismenny
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>>> kTLS is sending TLS ALERT messages as control messages for recvmsg().
>>> As we can't do anything sensible with it just abort the connection
>>> and let the userspace agent to a re-negotiation.
>>
>> Is this a problem if we do end up adding read_sock to tls?
>> Although I do see that the tls code does manage this in the
>> sk_buff control buffer, so I assume there is access to this info.
>>
>> CC'ing Boris here as well.
>>
> Yeah, that was the other reason; cmsg aka TLS alerts are only available 
> for recvmsg().
> 
> However, for TLS 1.3 the only TLS alert which does not trigger a 
> connection reset would the the ominous 'new session ticket' alert.
> But during TLS handshake development we already decided to _disable_
> session tickets as they are pretty meaningless for our us.
> Consequently all TLS alerts will trigger a connection reset, and 
> realistically we don't _need_ to know which alert type has triggered
> the reset.
> 
> So 'read_sock()' could be implemented by closing the connection on
> any TLS alert, and not giving us access to any of the alerts via
> control messages. If that makes life easier ...

Most likely read_sock() is the wrong place to ignore alerts. Probably
the better way is to do this in nvme, although its a bit harder.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 13/18] nvmet-tcp: allocate socket file
  2023-03-22 11:46   ` Sagi Grimberg
@ 2023-03-22 12:07     ` Hannes Reinecke
  0 siblings, 0 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 12:07 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 12:46, Sagi Grimberg wrote:
> 
> 
> On 3/21/23 14:43, Hannes Reinecke wrote:
>> When using the TLS upcall we need to allocate a socket file such
>> that the userspace daemon is able to use the socket.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/target/tcp.c | 49 ++++++++++++++++++++++++++++-----------
>>   1 file changed, 36 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index 66e8f9fd0ca7..5c43767c5ecd 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -96,12 +96,14 @@ struct nvmet_tcp_cmd {
>>   enum nvmet_tcp_queue_state {
>>       NVMET_TCP_Q_CONNECTING,
>> +    NVMET_TCP_Q_TLS_HANDSHAKE,
>>       NVMET_TCP_Q_LIVE,
>>       NVMET_TCP_Q_DISCONNECTING,
>>   };
>>   struct nvmet_tcp_queue {
>>       struct socket        *sock;
>> +    struct file        *sock_file;
>>       struct nvmet_tcp_port    *port;
>>       struct work_struct    io_work;
>>       struct nvmet_cq        nvme_cq;
>> @@ -1455,12 +1457,19 @@ static void 
>> nvmet_tcp_release_queue_work(struct work_struct *w)
>>       nvmet_sq_destroy(&queue->nvme_sq);
>>       cancel_work_sync(&queue->io_work);
>>       nvmet_tcp_free_cmd_data_in_buffers(queue);
>> -    sock_release(queue->sock);
>> +    if (queue->sock_file) {
>> +        fput(queue->sock_file);
> 
> I don't remember, but does the fput call sock_release
> on the final put? I'd move this into a helper nvmet_tcp_close_sock()
> or something.
> 
Yes, it does. (Took me some weeks to figure that out...)
But yeah, we can do a helper.

>> +        queue->sock_file = NULL;
>> +        queue->sock = NULL;
> 
> I always get a bit weary when I see that deallocations are setting
> pointers to NULL.
> 
And curiously that's a pattern I commonly use to track invalid accesses.
But that's just personal preference.

>> +    } else {
>> +        WARN_ON(!queue->sock->ops);
>> +        sock_release(queue->sock);
>> +        queue->sock = NULL;
>> +    }
>>       nvmet_tcp_free_cmds(queue);
>>       if (queue->hdr_digest || queue->data_digest)
>>           nvmet_tcp_free_crypto(queue);
>>       ida_free(&nvmet_tcp_queue_ida, queue->idx);
>> -
>>       page = virt_to_head_page(queue->pf_cache.va);
>>       __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
>>       kfree(queue);
>> @@ -1583,7 +1592,7 @@ static int nvmet_tcp_set_queue_sock(struct 
>> nvmet_tcp_queue *queue)
>>       return ret;
>>   }
>> -static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>> +static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>>           struct socket *newsock)
> 
> Why is this becoming a void function? This absolutely can fail.
> 
Oh, it can fail.
But it's being called as a last call in a 'void' function, so there is 
nothing the caller could do with the return value.
And the caller actually just uses the return value to print out a
logging message, so I moved that call into nvmet_tcp_alloc_queue()
and made it a 'void' function.

>>   {
>>       struct nvmet_tcp_queue *queue;
>> @@ -1591,7 +1600,7 @@ static int nvmet_tcp_alloc_queue(struct 
>> nvmet_tcp_port *port,
>>       queue = kzalloc(sizeof(*queue), GFP_KERNEL);
>>       if (!queue)
>> -        return -ENOMEM;
>> +        return;
>>       INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
>>       INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
>> @@ -1599,15 +1608,28 @@ static int nvmet_tcp_alloc_queue(struct 
>> nvmet_tcp_port *port,
>>       queue->port = port;
>>       queue->nr_cmds = 0;
>>       spin_lock_init(&queue->state_lock);
>> -    queue->state = NVMET_TCP_Q_CONNECTING;
>> +    if (queue->port->nport->disc_addr.tsas.tcp.sectype ==
>> +        NVMF_TCP_SECTYPE_TLS13)
>> +        queue->state = NVMET_TCP_Q_TLS_HANDSHAKE;
>> +    else
>> +        queue->state = NVMET_TCP_Q_CONNECTING;
>>       INIT_LIST_HEAD(&queue->free_list);
>>       init_llist_head(&queue->resp_list);
>>       INIT_LIST_HEAD(&queue->resp_send_list);
>> +    if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
>> +        queue->sock_file = sock_alloc_file(queue->sock, O_CLOEXEC, 
>> NULL);
>> +        if (IS_ERR(queue->sock_file)) {
>> +            ret = PTR_ERR(queue->sock_file);
>> +            queue->sock_file = NULL;
>> +            goto out_free_queue;
>> +        }
>> +    }
>> +
>>       queue->idx = ida_alloc(&nvmet_tcp_queue_ida, GFP_KERNEL);
>>       if (queue->idx < 0) {
>>           ret = queue->idx;
>> -        goto out_free_queue;
>> +        goto out_sock;
>>       }
>>       ret = nvmet_tcp_alloc_cmd(queue, &queue->connect);
>> @@ -1628,7 +1650,7 @@ static int nvmet_tcp_alloc_queue(struct 
>> nvmet_tcp_port *port,
>>       if (ret)
>>           goto out_destroy_sq;
>> -    return 0;
>> +    return;
>>   out_destroy_sq:
>>       mutex_lock(&nvmet_tcp_queue_mutex);
>>       list_del_init(&queue->queue_list);
>> @@ -1638,9 +1660,14 @@ static int nvmet_tcp_alloc_queue(struct 
>> nvmet_tcp_port *port,
>>       nvmet_tcp_free_cmd(&queue->connect);
>>   out_ida_remove:
>>       ida_free(&nvmet_tcp_queue_ida, queue->idx);
>> +out_sock:
>> +    if (queue->sock_file)
>> +        fput(queue->sock_file);
>> +    else
>> +        sock_release(queue->sock);
>>   out_free_queue:
>>       kfree(queue);
>> -    return ret;
>> +    pr_err("failed to allocate queue");
> 
> Can we design this better?
> It looks backwards that this routine deallocates an argument
> coming from the call-site.
> 
> I know that this is similar to what happens with kernel_accept
> to some extent. But would prefer to avoid this pattern if possible.
> 
Sure; I just followed precedent here.
But no prob to change it.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-21 12:43 ` [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall Hannes Reinecke
@ 2023-03-22 12:13   ` Sagi Grimberg
  2023-03-22 12:34     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 12:13 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> Add functions to start the TLS handshake upcall.
> 
> Signed-off-by: Hannes Reincke <hare@suse.de>
> ---
>   drivers/nvme/target/tcp.c | 188 ++++++++++++++++++++++++++++++++++++--
>   1 file changed, 181 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index 5c43767c5ecd..6e88e98a2c59 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -9,8 +9,10 @@
>   #include <linux/slab.h>
>   #include <linux/err.h>
>   #include <linux/nvme-tcp.h>
> +#include <linux/nvme-keyring.h>
>   #include <net/sock.h>
>   #include <net/tcp.h>
> +#include <net/handshake.h>
>   #include <linux/inet.h>
>   #include <linux/llist.h>
>   #include <crypto/hash.h>
> @@ -40,6 +42,14 @@ module_param(idle_poll_period_usecs, int, 0644);
>   MODULE_PARM_DESC(idle_poll_period_usecs,
>   		"nvmet tcp io_work poll till idle time period in usecs");
>   
> +/*
> + * TLS handshake timeout
> + */
> +static int tls_handshake_timeout = 30;

30 ?

> +module_param(tls_handshake_timeout, int, 0644);
> +MODULE_PARM_DESC(tls_handshake_timeout,
> +		 "nvme TLS handshake timeout in seconds (default 30)");
> +
>   #define NVMET_TCP_RECV_BUDGET		8
>   #define NVMET_TCP_SEND_BUDGET		8
>   #define NVMET_TCP_IO_WORK_BUDGET	64
> @@ -131,6 +141,9 @@ struct nvmet_tcp_queue {
>   	struct ahash_request	*snd_hash;
>   	struct ahash_request	*rcv_hash;
>   
> +	struct key		*tls_psk;
> +	struct delayed_work	tls_handshake_work;
> +
>   	unsigned long           poll_end;
>   
>   	spinlock_t		state_lock;
> @@ -168,6 +181,7 @@ static struct workqueue_struct *nvmet_tcp_wq;
>   static const struct nvmet_fabrics_ops nvmet_tcp_ops;
>   static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
>   static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd);
> +static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct *work);
>   
>   static inline u16 nvmet_tcp_cmd_tag(struct nvmet_tcp_queue *queue,
>   		struct nvmet_tcp_cmd *cmd)
> @@ -1400,6 +1414,8 @@ static void nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
>   {
>   	struct socket *sock = queue->sock;
>   
> +	if (!sock->sk)
> +		return;

Umm, when will the sock not have an sk?

>   	write_lock_bh(&sock->sk->sk_callback_lock);
>   	sock->sk->sk_data_ready =  queue->data_ready;
>   	sock->sk->sk_state_change = queue->state_change;
> @@ -1448,7 +1464,8 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
>   	list_del_init(&queue->queue_list);
>   	mutex_unlock(&nvmet_tcp_queue_mutex);
>   
> -	nvmet_tcp_restore_socket_callbacks(queue);
> +	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
> +		nvmet_tcp_restore_socket_callbacks(queue);

This is because you only save the callbacks after the handshake
phase is done? Maybe it would be simpler to clear the ops because
the socket is going away anyways...

>   	cancel_work_sync(&queue->io_work);
>   	/* stop accepting incoming data */
>   	queue->rcv_state = NVMET_TCP_RECV_ERR;
> @@ -1469,6 +1486,8 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
>   	nvmet_tcp_free_cmds(queue);
>   	if (queue->hdr_digest || queue->data_digest)
>   		nvmet_tcp_free_crypto(queue);
> +	if (queue->tls_psk)
> +		key_put(queue->tls_psk);
>   	ida_free(&nvmet_tcp_queue_ida, queue->idx);
>   	page = virt_to_head_page(queue->pf_cache.va);
>   	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
> @@ -1481,11 +1500,15 @@ static void nvmet_tcp_data_ready(struct sock *sk)
>   
>   	trace_sk_data_ready(sk);
>   
> -	read_lock_bh(&sk->sk_callback_lock);
> -	queue = sk->sk_user_data;
> -	if (likely(queue))
> -		queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
> -	read_unlock_bh(&sk->sk_callback_lock);
> +	rcu_read_lock_bh();
> +	queue = rcu_dereference_sk_user_data(sk);
> +	if (queue->data_ready)
> +		queue->data_ready(sk);
> +	if (likely(queue) &&
> +	    queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
> +		queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
> +			      &queue->io_work);
> +	rcu_read_unlock_bh();

Same comment as the host side. separate rcu stuff from data_ready call.

>   }
>   
>   static void nvmet_tcp_write_space(struct sock *sk)
> @@ -1585,13 +1608,139 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
>   		sock->sk->sk_write_space = nvmet_tcp_write_space;
>   		if (idle_poll_period_usecs)
>   			nvmet_tcp_arm_queue_deadline(queue);
> -		queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
> +		queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
> +			      &queue->io_work);

Why the change?

>   	}
>   	write_unlock_bh(&sock->sk->sk_callback_lock);
>   
>   	return ret;
>   }
>   
> +static void nvmet_tcp_tls_data_ready(struct sock *sk)
> +{
> +	struct socket_wq *wq;
> +
> +	rcu_read_lock();
> +	/* kTLS will change the callback */
> +	if (sk->sk_data_ready == nvmet_tcp_tls_data_ready) {
> +		wq = rcu_dereference(sk->sk_wq);
> +		if (skwq_has_sleeper(wq))
> +			wake_up_interruptible_all(&wq->wait);
> +	}
> +	rcu_read_unlock();
> +}

Can you explain why this is needed? It looks out-of-place.
Who is this waking up? isn't tls already calling the socket
default data_ready that does something similar for userspace?

> +
> +static void nvmet_tcp_tls_handshake_restart(struct nvmet_tcp_queue *queue)
> +{
> +	spin_lock(&queue->state_lock);
> +	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
> +		pr_warn("queue %d: TLS handshake already completed\n",
> +			queue->idx);
> +		spin_unlock(&queue->state_lock);
> +		return;
> +	}
> +	queue->state = NVMET_TCP_Q_CONNECTING;
> +	spin_unlock(&queue->state_lock);
> +
> +	pr_debug("queue %d: restarting queue after TLS handshake\n",
> +		 queue->idx);
> +	/*
> +	 * Set callbacks after handshake; TLS implementation
> +	 * might have changed the socket callbacks.
> +	 */
> +	nvmet_tcp_set_queue_sock(queue);

My understanding is that this is the desired end-state, i.e.
tls connection is ready and now we are expecting nvme traffic?

I think that the function name should be changed, it sounds like
it is restarting the handshake, and it does not appear to do that.

> +}
> +
> +static void nvmet_tcp_save_tls_callbacks(struct nvmet_tcp_queue *queue)
> +{
> +	struct sock *sk = queue->sock->sk;
> +
> +	write_lock_bh(&sk->sk_callback_lock);
> +	rcu_assign_sk_user_data(sk, queue);
> +	queue->data_ready = sk->sk_data_ready;
> +	sk->sk_data_ready = nvmet_tcp_tls_data_ready;
> +	write_unlock_bh(&sk->sk_callback_lock);
> +}
> +
> +static void nvmet_tcp_restore_tls_callbacks(struct nvmet_tcp_queue *queue)
> +{
> +	struct sock *sk = queue->sock->sk;
> +
> +	if (WARN_ON(!sk))
> +		return;
> +	write_lock_bh(&sk->sk_callback_lock);
> +	/* Only reset the callback if it really is ours */
> +	if (sk->sk_data_ready == nvmet_tcp_tls_data_ready)

I still don't understand why our data_ready for tls is needed.
Who are

> +		sk->sk_data_ready = queue->data_ready;
> +	rcu_assign_sk_user_data(sk, NULL);
> +	queue->data_ready = NULL;
> +	write_unlock_bh(&sk->sk_callback_lock);
> +}
> +
> +static void nvmet_tcp_tls_handshake_done(void *data, int status,
> +					 key_serial_t peerid)
> +{
> +	struct nvmet_tcp_queue *queue = data;
> +
> +	pr_debug("queue %d: TLS handshake done, key %x, status %d\n",
> +		 queue->idx, peerid, status);
> +	if (!status) {
> +		spin_lock(&queue->state_lock);
> +		queue->tls_psk = key_lookup(peerid);
> +		if (IS_ERR(queue->tls_psk)) {
> +			pr_warn("queue %d: TLS key %x not found\n",
> +				queue->idx, peerid);
> +			queue->tls_psk = NULL;
> +		}
> +		spin_unlock(&queue->state_lock);
> +	}
> +	cancel_delayed_work_sync(&queue->tls_handshake_work);
> +	nvmet_tcp_restore_tls_callbacks(queue);
> +	if (status)
> +		nvmet_tcp_schedule_release_queue(queue);
> +	else
> +		nvmet_tcp_tls_handshake_restart(queue);
> +}
> +
> +static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct *w)
> +{
> +	struct nvmet_tcp_queue *queue = container_of(to_delayed_work(w),
> +			struct nvmet_tcp_queue, tls_handshake_work);
> +
> +	pr_debug("queue %d: TLS handshake timeout\n", queue->idx);
> +	nvmet_tcp_restore_tls_callbacks(queue);
> +	nvmet_tcp_schedule_release_queue(queue);
> +}
> +
> +static int nvmet_tcp_tls_handshake(struct nvmet_tcp_queue *queue)
> +{
> +	int ret = -EOPNOTSUPP;
> +	struct tls_handshake_args args;
> +
> +	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
> +		pr_warn("cannot start TLS in state %d\n", queue->state);
> +		return -EINVAL;
> +	}
> +
> +	pr_debug("queue %d: TLS ServerHello\n", queue->idx);
> +	args.ta_sock = queue->sock;
> +	args.ta_done = nvmet_tcp_tls_handshake_done;
> +	args.ta_data = queue;
> +	args.ta_keyring = nvme_keyring_id();
> +	args.ta_timeout_ms = tls_handshake_timeout * 2 * 1024;

  why the 2x timeout?

> +
> +	ret = tls_server_hello_psk(&args, GFP_KERNEL);
> +	if (ret) {
> +		pr_err("failed to start TLS, err=%d\n", ret);
> +	} else {
> +		pr_debug("queue %d wakeup userspace\n", queue->idx);
> +		nvmet_tcp_tls_data_ready(queue->sock->sk);
> +		queue_delayed_work(nvmet_wq, &queue->tls_handshake_work,
> +				   tls_handshake_timeout * HZ);
> +	}
> +	return ret;
> +}
> +
>   static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   		struct socket *newsock)
>   {
> @@ -1604,6 +1753,8 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   
>   	INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
>   	INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
> +	INIT_DELAYED_WORK(&queue->tls_handshake_work,
> +			  nvmet_tcp_tls_handshake_timeout_work);
>   	queue->sock = newsock;
>   	queue->port = port;
>   	queue->nr_cmds = 0;
> @@ -1646,6 +1797,29 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   	list_add_tail(&queue->queue_list, &nvmet_tcp_queue_list);
>   	mutex_unlock(&nvmet_tcp_queue_mutex);
>   
> +	if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
> +		nvmet_tcp_save_tls_callbacks(queue);
> +		if (!nvmet_tcp_tls_handshake(queue))
> +			return;
> +		nvmet_tcp_restore_tls_callbacks(queue);
> +
> +		/*
> +		 * If sectype is set to 'tls1.3' TLS is required
> +		 * so terminate the connection if the TLS handshake
> +		 * failed.
> +		 */
> +		if (queue->port->nport->disc_addr.tsas.tcp.sectype ==
> +		    NVMF_TCP_SECTYPE_TLS13) {
> +			pr_debug("queue %d sectype tls1.3, terminate connection\n",
> +				 queue->idx);
> +			goto out_destroy_sq;
> +		}
> +		pr_debug("queue %d fallback to icreq\n", queue->idx);
> +		spin_lock(&queue->state_lock);
> +		queue->state = NVMET_TCP_Q_CONNECTING;
> +		spin_unlock(&queue->state_lock);
> +	}
> +
>   	ret = nvmet_tcp_set_queue_sock(queue);
>   	if (ret)
>   		goto out_destroy_sq;

I'm still trying to learn the state machine here, can you share a few 
words on it? Also please include it in the next round in the change log.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 16/18] nvmet-tcp: rework sendpage for kTLS
  2023-03-21 12:43 ` [PATCH 16/18] nvmet-tcp: rework sendpage for kTLS Hannes Reinecke
@ 2023-03-22 12:16   ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 12:16 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


> kTLS ->sendpage() doesn't support the MSG_EOR flag, and it's
> questionable whether it makes sense for kTLS as one has to copy
> data anyway.
> So use sock_no_sendpage() for kTLS.

Same comments as the host side.
1. separate MSG_EOR from kernel_sendpage
2. keep kernel_sendpage unless unsupported properly.

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/target/tcp.c | 56 ++++++++++++++++++++++++++++-----------
>   1 file changed, 41 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index 6e88e98a2c59..9b69cac84508 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -570,9 +570,14 @@ static int nvmet_try_send_data_pdu(struct nvmet_tcp_cmd *cmd)
>   	int left = sizeof(*cmd->data_pdu) - cmd->offset + hdgst;
>   	int ret;
>   
> -	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
> -			offset_in_page(cmd->data_pdu) + cmd->offset,
> -			left, MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
> +	if (cmd->queue->tls_psk)
> +		ret = sock_no_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
> +				      offset_in_page(cmd->data_pdu) + cmd->offset,
> +				      left, MSG_DONTWAIT | MSG_MORE);
> +	else
> +		ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
> +				      offset_in_page(cmd->data_pdu) + cmd->offset,
> +				      left, MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
>   	if (ret <= 0)
>   		return ret;
>   
> @@ -600,10 +605,17 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
>   		if ((!last_in_batch && cmd->queue->send_list_len) ||
>   		    cmd->wbytes_done + left < cmd->req.transfer_len ||
>   		    queue->data_digest || !queue->nvme_sq.sqhd_disabled)
> -			flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> -
> -		ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
> -					left, flags);
> +			flags |= MSG_MORE;
> +
> +		if (queue->tls_psk)
> +			ret = sock_no_sendpage(cmd->queue->sock, page, cmd->offset,
> +					       left, flags);
> +		else {
> +			if (flags & MSG_MORE)
> +				flags |= MSG_SENDPAGE_NOTLAST;
> +			ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
> +					      left, flags);
> +		}
>   		if (ret <= 0)
>   			return ret;
>   
> @@ -645,12 +657,19 @@ static int nvmet_try_send_response(struct nvmet_tcp_cmd *cmd,
>   	int ret;
>   
>   	if (!last_in_batch && cmd->queue->send_list_len)
> -		flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> -	else
> +		flags |= MSG_MORE;
> +	else if (!cmd->queue->tls_psk)
>   		flags |= MSG_EOR;
>   
> -	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
> -		offset_in_page(cmd->rsp_pdu) + cmd->offset, left, flags);
> +	if (cmd->queue->tls_psk)
> +		ret = sock_no_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
> +			offset_in_page(cmd->rsp_pdu) + cmd->offset, left, flags);
> +	else {
> +		if (flags & MSG_MORE)
> +			flags |= MSG_SENDPAGE_NOTLAST;
> +		ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
> +			offset_in_page(cmd->rsp_pdu) + cmd->offset, left, flags);
> +	}
>   	if (ret <= 0)
>   		return ret;
>   	cmd->offset += ret;
> @@ -673,12 +692,19 @@ static int nvmet_try_send_r2t(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
>   	int ret;
>   
>   	if (!last_in_batch && cmd->queue->send_list_len)
> -		flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> -	else
> +		flags |= MSG_MORE;
> +	else if (!cmd->queue->tls_psk)
>   		flags |= MSG_EOR;
>   
> -	ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
> -		offset_in_page(cmd->r2t_pdu) + cmd->offset, left, flags);
> +	if (cmd->queue->tls_psk)
> +		ret = sock_no_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
> +			offset_in_page(cmd->r2t_pdu) + cmd->offset, left, flags);
> +	else {
> +		if (flags & MSG_MORE)
> +			flags |= MSG_SENDPAGE_NOTLAST;
> +		ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
> +			offset_in_page(cmd->r2t_pdu) + cmd->offset, left, flags);
> +	}
>   	if (ret <= 0)
>   		return ret;
>   	cmd->offset += ret;

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 11/18] nvme-tcp: control message handling for recvmsg()
  2023-03-22 11:50       ` Sagi Grimberg
@ 2023-03-22 12:17         ` Hannes Reinecke
  2023-03-22 12:29           ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 12:17 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, boris.pismenny
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 12:50, Sagi Grimberg wrote:
> 
>>>> kTLS is sending TLS ALERT messages as control messages for recvmsg().
>>>> As we can't do anything sensible with it just abort the connection
>>>> and let the userspace agent to a re-negotiation.
>>>
>>> Is this a problem if we do end up adding read_sock to tls?
>>> Although I do see that the tls code does manage this in the
>>> sk_buff control buffer, so I assume there is access to this info.
>>>
>>> CC'ing Boris here as well.
>>>
>> Yeah, that was the other reason; cmsg aka TLS alerts are only 
>> available for recvmsg().
>>
>> However, for TLS 1.3 the only TLS alert which does not trigger a 
>> connection reset would the the ominous 'new session ticket' alert.
>> But during TLS handshake development we already decided to _disable_
>> session tickets as they are pretty meaningless for our us.
>> Consequently all TLS alerts will trigger a connection reset, and 
>> realistically we don't _need_ to know which alert type has triggered
>> the reset.
>>
>> So 'read_sock()' could be implemented by closing the connection on
>> any TLS alert, and not giving us access to any of the alerts via
>> control messages. If that makes life easier ...
> 
> Most likely read_sock() is the wrong place to ignore alerts. Probably
> the better way is to do this in nvme, although its a bit harder.

But that's just the point; you don't have OOB information for 
read_sock(). So I wouldn't know how you could pass the TLS alert 
messages. Hence my idea of converting them into a connection reset for 
read_sock().
As the whole idea of using ktls is that you do _not_ have to handle TLS 
details, it's questionable what exactly one could do here except from 
resetting the connection...

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 18/18] nvmet-tcp: peek icreq before starting TLS
  2023-03-21 12:43 ` [PATCH 18/18] nvmet-tcp: peek icreq before starting TLS Hannes Reinecke
@ 2023-03-22 12:24   ` Sagi Grimberg
  2023-03-22 12:38     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 12:24 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/21/23 14:43, Hannes Reinecke wrote:
> Incoming connection might be either 'normal' NVMe-TCP connections
> starting with icreq or TLS handshakes. To ensure that 'normal'
> connections can still be handled we need to peek the first packet
> and only start TLS handshake if it's not an icreq.

I think that for nvmet, we will want to strictly enforce tsas.sectype.
What are we gaining from allowing this?

And if you insist that we must, then this needs to be an explicit
setting to a permissive mode.

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/target/tcp.c | 60 +++++++++++++++++++++++++++++++++++++--
>   1 file changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index a69647fb2c81..a328a303c2be 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -1105,6 +1105,61 @@ static inline bool nvmet_tcp_pdu_valid(u8 type)
>   	return false;
>   }
>   
> +static int nvmet_tcp_try_peek_pdu(struct nvmet_tcp_queue *queue)
> +{
> +	struct nvme_tcp_hdr *hdr = &queue->pdu.cmd.hdr;
> +	int len;
> +	struct kvec iov = {
> +		.iov_base = (u8 *)&queue->pdu + queue->offset,
> +		.iov_len = sizeof(struct nvme_tcp_hdr),
> +	};
> +	char cbuf[CMSG_LEN(sizeof(char))] = {};
> +	unsigned char ctype;
> +	struct cmsghdr *cmsg;
> +	struct msghdr msg = {
> +		.msg_control = cbuf,
> +		.msg_controllen = sizeof(cbuf),
> +		.msg_flags = MSG_PEEK,
> +	};
> +
> +	len = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> +			iov.iov_len, msg.msg_flags);
> +	if (unlikely(len < 0)) {
> +		pr_debug("queue %d peek error %d\n",
> +			 queue->idx, len);
> +		return len;
> +	}
> +
> +	cmsg = (struct cmsghdr *)cbuf;
> +	if (CMSG_OK(&msg, cmsg) &&
> +	    cmsg->cmsg_level == SOL_TLS &&
> +	    cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
> +		ctype = *((unsigned char *)CMSG_DATA(cmsg));
> +		if (ctype != TLS_RECORD_TYPE_DATA) {
> +			pr_err("queue %d unhandled TLS record %d\n",
> +				queue->idx, ctype);
> +			return -ENOTCONN;
> +		}
> +	}
> +
> +	if (len < sizeof(struct nvme_tcp_hdr)) {
> +		pr_debug("queue %d short read, %d bytes missing\n",
> +			 queue->idx, (int)iov.iov_len - len);
> +		return -EAGAIN;
> +	}
> +	pr_debug("queue %d hdr type %d hlen %d plen %d size %d\n",
> +		 queue->idx, hdr->type, hdr->hlen, hdr->plen,
> +		 (int)sizeof(struct nvme_tcp_icreq_pdu));
> +	if (hdr->type == nvme_tcp_icreq &&
> +	    hdr->hlen == sizeof(struct nvme_tcp_icreq_pdu) &&
> +	    hdr->plen == sizeof(struct nvme_tcp_icreq_pdu)) {
> +		pr_debug("queue %d icreq detected\n",
> +			 queue->idx);
> +		return len;
> +	}
> +	return 0;
> +}
> +
>   static int nvmet_tcp_try_recv_pdu(struct nvmet_tcp_queue *queue)
>   {
>   	struct nvme_tcp_hdr *hdr = &queue->pdu.cmd.hdr;
> @@ -1879,8 +1934,9 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>   
>   	if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
>   		nvmet_tcp_save_tls_callbacks(queue);
> -		if (!nvmet_tcp_tls_handshake(queue))
> -			return;
> +		if (!nvmet_tcp_try_peek_pdu(queue))

Who guarantees that a payload already exist? Where is the peek resumes
when a payload already exist?

> +			if (!nvmet_tcp_tls_handshake(queue))
> +				return;
>   		nvmet_tcp_restore_tls_callbacks(queue);
>   
>   		/*

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 11/18] nvme-tcp: control message handling for recvmsg()
  2023-03-22 12:17         ` Hannes Reinecke
@ 2023-03-22 12:29           ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 12:29 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, boris.pismenny
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>>>>> kTLS is sending TLS ALERT messages as control messages for recvmsg().
>>>>> As we can't do anything sensible with it just abort the connection
>>>>> and let the userspace agent to a re-negotiation.
>>>>
>>>> Is this a problem if we do end up adding read_sock to tls?
>>>> Although I do see that the tls code does manage this in the
>>>> sk_buff control buffer, so I assume there is access to this info.
>>>>
>>>> CC'ing Boris here as well.
>>>>
>>> Yeah, that was the other reason; cmsg aka TLS alerts are only 
>>> available for recvmsg().
>>>
>>> However, for TLS 1.3 the only TLS alert which does not trigger a 
>>> connection reset would the the ominous 'new session ticket' alert.
>>> But during TLS handshake development we already decided to _disable_
>>> session tickets as they are pretty meaningless for our us.
>>> Consequently all TLS alerts will trigger a connection reset, and 
>>> realistically we don't _need_ to know which alert type has triggered
>>> the reset.
>>>
>>> So 'read_sock()' could be implemented by closing the connection on
>>> any TLS alert, and not giving us access to any of the alerts via
>>> control messages. If that makes life easier ...
>>
>> Most likely read_sock() is the wrong place to ignore alerts. Probably
>> the better way is to do this in nvme, although its a bit harder.
> 
> But that's just the point; you don't have OOB information for 
> read_sock(). So I wouldn't know how you could pass the TLS alert 
> messages.

tls_sw populates cmsg from skb control block, and accesses it from the
skb with tls_msg(skb)...

> Hence my idea of converting them into a connection reset for 
> read_sock().

Probably its the wrong layer for this decision.

> As the whole idea of using ktls is that you do _not_ have to handle TLS 
> details, it's questionable what exactly one could do here except from 
> resetting the connection...

Well, in theory, you could have passed it to userspace along with the
sockfd to handle it for you. But I'm definitely not suggesting that we
do that. But at least the connection reset should be decided above
read_sock.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-22 12:13   ` Sagi Grimberg
@ 2023-03-22 12:34     ` Hannes Reinecke
  2023-03-22 12:51       ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 12:34 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 13:13, Sagi Grimberg wrote:
> 
> 
> On 3/21/23 14:43, Hannes Reinecke wrote:
>> Add functions to start the TLS handshake upcall.
>>
>> Signed-off-by: Hannes Reincke <hare@suse.de>
>> ---
>>   drivers/nvme/target/tcp.c | 188 ++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 181 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index 5c43767c5ecd..6e88e98a2c59 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -9,8 +9,10 @@
>>   #include <linux/slab.h>
>>   #include <linux/err.h>
>>   #include <linux/nvme-tcp.h>
>> +#include <linux/nvme-keyring.h>
>>   #include <net/sock.h>
>>   #include <net/tcp.h>
>> +#include <net/handshake.h>
>>   #include <linux/inet.h>
>>   #include <linux/llist.h>
>>   #include <crypto/hash.h>
>> @@ -40,6 +42,14 @@ module_param(idle_poll_period_usecs, int, 0644);
>>   MODULE_PARM_DESC(idle_poll_period_usecs,
>>           "nvmet tcp io_work poll till idle time period in usecs");
>> +/*
>> + * TLS handshake timeout
>> + */
>> +static int tls_handshake_timeout = 30;
> 
> 30 ?
> 
Yeah; will be changing it to 10.

>> +module_param(tls_handshake_timeout, int, 0644);
>> +MODULE_PARM_DESC(tls_handshake_timeout,
>> +         "nvme TLS handshake timeout in seconds (default 30)");
>> +
>>   #define NVMET_TCP_RECV_BUDGET        8
>>   #define NVMET_TCP_SEND_BUDGET        8
>>   #define NVMET_TCP_IO_WORK_BUDGET    64
>> @@ -131,6 +141,9 @@ struct nvmet_tcp_queue {
>>       struct ahash_request    *snd_hash;
>>       struct ahash_request    *rcv_hash;
>> +    struct key        *tls_psk;
>> +    struct delayed_work    tls_handshake_work;
>> +
>>       unsigned long           poll_end;
>>       spinlock_t        state_lock;
>> @@ -168,6 +181,7 @@ static struct workqueue_struct *nvmet_tcp_wq;
>>   static const struct nvmet_fabrics_ops nvmet_tcp_ops;
>>   static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
>>   static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd);
>> +static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct 
>> *work);
>>   static inline u16 nvmet_tcp_cmd_tag(struct nvmet_tcp_queue *queue,
>>           struct nvmet_tcp_cmd *cmd)
>> @@ -1400,6 +1414,8 @@ static void 
>> nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
>>   {
>>       struct socket *sock = queue->sock;
>> +    if (!sock->sk)
>> +        return;
> 
> Umm, when will the sock not have an sk?
> 
When someone called 'sock_release()'.
But that's basically a leftover from development.

>>       write_lock_bh(&sock->sk->sk_callback_lock);
>>       sock->sk->sk_data_ready =  queue->data_ready;
>>       sock->sk->sk_state_change = queue->state_change;
>> @@ -1448,7 +1464,8 @@ static void nvmet_tcp_release_queue_work(struct 
>> work_struct *w)
>>       list_del_init(&queue->queue_list);
>>       mutex_unlock(&nvmet_tcp_queue_mutex);
>> -    nvmet_tcp_restore_socket_callbacks(queue);
>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
>> +        nvmet_tcp_restore_socket_callbacks(queue);
> 
> This is because you only save the callbacks after the handshake
> phase is done? Maybe it would be simpler to clear the ops because
> the socket is going away anyways...
> 
Or just leave it in place, as they'll be cleared up on sock_release().

>>       cancel_work_sync(&queue->io_work);
>>       /* stop accepting incoming data */
>>       queue->rcv_state = NVMET_TCP_RECV_ERR;
>> @@ -1469,6 +1486,8 @@ static void nvmet_tcp_release_queue_work(struct 
>> work_struct *w)
>>       nvmet_tcp_free_cmds(queue);
>>       if (queue->hdr_digest || queue->data_digest)
>>           nvmet_tcp_free_crypto(queue);
>> +    if (queue->tls_psk)
>> +        key_put(queue->tls_psk);
>>       ida_free(&nvmet_tcp_queue_ida, queue->idx);
>>       page = virt_to_head_page(queue->pf_cache.va);
>>       __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
>> @@ -1481,11 +1500,15 @@ static void nvmet_tcp_data_ready(struct sock *sk)
>>       trace_sk_data_ready(sk);
>> -    read_lock_bh(&sk->sk_callback_lock);
>> -    queue = sk->sk_user_data;
>> -    if (likely(queue))
>> -        queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
>> -    read_unlock_bh(&sk->sk_callback_lock);
>> +    rcu_read_lock_bh();
>> +    queue = rcu_dereference_sk_user_data(sk);
>> +    if (queue->data_ready)
>> +        queue->data_ready(sk);
>> +    if (likely(queue) &&
>> +        queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
>> +        queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
>> +                  &queue->io_work);
>> +    rcu_read_unlock_bh();
> 
> Same comment as the host side. separate rcu stuff from data_ready call.
> 
Ok.

>>   }
>>   static void nvmet_tcp_write_space(struct sock *sk)
>> @@ -1585,13 +1608,139 @@ static int nvmet_tcp_set_queue_sock(struct 
>> nvmet_tcp_queue *queue)
>>           sock->sk->sk_write_space = nvmet_tcp_write_space;
>>           if (idle_poll_period_usecs)
>>               nvmet_tcp_arm_queue_deadline(queue);
>> -        queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
>> +        queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
>> +                  &queue->io_work);
> 
> Why the change?
> 
Left-over from development.

>>       }
>>       write_unlock_bh(&sock->sk->sk_callback_lock);
>>       return ret;
>>   }
>> +static void nvmet_tcp_tls_data_ready(struct sock *sk)
>> +{
>> +    struct socket_wq *wq;
>> +
>> +    rcu_read_lock();
>> +    /* kTLS will change the callback */
>> +    if (sk->sk_data_ready == nvmet_tcp_tls_data_ready) {
>> +        wq = rcu_dereference(sk->sk_wq);
>> +        if (skwq_has_sleeper(wq))
>> +            wake_up_interruptible_all(&wq->wait);
>> +    }
>> +    rcu_read_unlock();
>> +}
> 
> Can you explain why this is needed? It looks out-of-place.
> Who is this waking up? isn't tls already calling the socket
> default data_ready that does something similar for userspace?
> 
Black magic.
The 'data_ready' call might happen at any time after the 'accept' call 
and us calling into userspace.
In particular we have this flow of control:

1. Kernel: accept()
2. Kernel: handshake request
3. Userspace: read data from socket
4. Userspace: tls handshake
5. Kernel: handshake complete

If the 'data_ready' event occurs between 1. and 3. userspace wouldn't 
know that something has happened, and will be sitting there waiting for 
data which is already present.

>> +
>> +static void nvmet_tcp_tls_handshake_restart(struct nvmet_tcp_queue 
>> *queue)
>> +{
>> +    spin_lock(&queue->state_lock);
>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
>> +        pr_warn("queue %d: TLS handshake already completed\n",
>> +            queue->idx);
>> +        spin_unlock(&queue->state_lock);
>> +        return;
>> +    }
>> +    queue->state = NVMET_TCP_Q_CONNECTING;
>> +    spin_unlock(&queue->state_lock);
>> +
>> +    pr_debug("queue %d: restarting queue after TLS handshake\n",
>> +         queue->idx);
>> +    /*
>> +     * Set callbacks after handshake; TLS implementation
>> +     * might have changed the socket callbacks.
>> +     */
>> +    nvmet_tcp_set_queue_sock(queue);
> 
> My understanding is that this is the desired end-state, i.e.
> tls connection is ready and now we are expecting nvme traffic?
> 
Yes.

> I think that the function name should be changed, it sounds like
> it is restarting the handshake, and it does not appear to do that.
> 
Sure, np.

nvmet_tcp_set_queue_callbacks()?

>> +}
>> +
>> +static void nvmet_tcp_save_tls_callbacks(struct nvmet_tcp_queue *queue)
>> +{
>> +    struct sock *sk = queue->sock->sk;
>> +
>> +    write_lock_bh(&sk->sk_callback_lock);
>> +    rcu_assign_sk_user_data(sk, queue);
>> +    queue->data_ready = sk->sk_data_ready;
>> +    sk->sk_data_ready = nvmet_tcp_tls_data_ready;
>> +    write_unlock_bh(&sk->sk_callback_lock);
>> +}
>> +
>> +static void nvmet_tcp_restore_tls_callbacks(struct nvmet_tcp_queue 
>> *queue)
>> +{
>> +    struct sock *sk = queue->sock->sk;
>> +
>> +    if (WARN_ON(!sk))
>> +        return;
>> +    write_lock_bh(&sk->sk_callback_lock);
>> +    /* Only reset the callback if it really is ours */
>> +    if (sk->sk_data_ready == nvmet_tcp_tls_data_ready)
> 
> I still don't understand why our data_ready for tls is needed.
> Who are
> 
See above for an explanation.

>> +        sk->sk_data_ready = queue->data_ready;
>> +    rcu_assign_sk_user_data(sk, NULL);
>> +    queue->data_ready = NULL;
>> +    write_unlock_bh(&sk->sk_callback_lock);
>> +}
>> +
>> +static void nvmet_tcp_tls_handshake_done(void *data, int status,
>> +                     key_serial_t peerid)
>> +{
>> +    struct nvmet_tcp_queue *queue = data;
>> +
>> +    pr_debug("queue %d: TLS handshake done, key %x, status %d\n",
>> +         queue->idx, peerid, status);
>> +    if (!status) {
>> +        spin_lock(&queue->state_lock);
>> +        queue->tls_psk = key_lookup(peerid);
>> +        if (IS_ERR(queue->tls_psk)) {
>> +            pr_warn("queue %d: TLS key %x not found\n",
>> +                queue->idx, peerid);
>> +            queue->tls_psk = NULL;
>> +        }
>> +        spin_unlock(&queue->state_lock);
>> +    }
>> +    cancel_delayed_work_sync(&queue->tls_handshake_work);
>> +    nvmet_tcp_restore_tls_callbacks(queue);
>> +    if (status)
>> +        nvmet_tcp_schedule_release_queue(queue);
>> +    else
>> +        nvmet_tcp_tls_handshake_restart(queue);
>> +}
>> +
>> +static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct *w)
>> +{
>> +    struct nvmet_tcp_queue *queue = container_of(to_delayed_work(w),
>> +            struct nvmet_tcp_queue, tls_handshake_work);
>> +
>> +    pr_debug("queue %d: TLS handshake timeout\n", queue->idx);
>> +    nvmet_tcp_restore_tls_callbacks(queue);
>> +    nvmet_tcp_schedule_release_queue(queue);
>> +}
>> +
>> +static int nvmet_tcp_tls_handshake(struct nvmet_tcp_queue *queue)
>> +{
>> +    int ret = -EOPNOTSUPP;
>> +    struct tls_handshake_args args;
>> +
>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
>> +        pr_warn("cannot start TLS in state %d\n", queue->state);
>> +        return -EINVAL;
>> +    }
>> +
>> +    pr_debug("queue %d: TLS ServerHello\n", queue->idx);
>> +    args.ta_sock = queue->sock;
>> +    args.ta_done = nvmet_tcp_tls_handshake_done;
>> +    args.ta_data = queue;
>> +    args.ta_keyring = nvme_keyring_id();
>> +    args.ta_timeout_ms = tls_handshake_timeout * 2 * 1024;
> 
>   why the 2x timeout?
> 
Because I'm chicken. Will be changing it.l

>> +
>> +    ret = tls_server_hello_psk(&args, GFP_KERNEL);
>> +    if (ret) {
>> +        pr_err("failed to start TLS, err=%d\n", ret);
>> +    } else {
>> +        pr_debug("queue %d wakeup userspace\n", queue->idx);
>> +        nvmet_tcp_tls_data_ready(queue->sock->sk);
>> +        queue_delayed_work(nvmet_wq, &queue->tls_handshake_work,
>> +                   tls_handshake_timeout * HZ);
>> +    }
>> +    return ret;
>> +}
>> +
>>   static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>>           struct socket *newsock)
>>   {
>> @@ -1604,6 +1753,8 @@ static void nvmet_tcp_alloc_queue(struct 
>> nvmet_tcp_port *port,
>>       INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
>>       INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
>> +    INIT_DELAYED_WORK(&queue->tls_handshake_work,
>> +              nvmet_tcp_tls_handshake_timeout_work);
>>       queue->sock = newsock;
>>       queue->port = port;
>>       queue->nr_cmds = 0;
>> @@ -1646,6 +1797,29 @@ static void nvmet_tcp_alloc_queue(struct 
>> nvmet_tcp_port *port,
>>       list_add_tail(&queue->queue_list, &nvmet_tcp_queue_list);
>>       mutex_unlock(&nvmet_tcp_queue_mutex);
>> +    if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
>> +        nvmet_tcp_save_tls_callbacks(queue);
>> +        if (!nvmet_tcp_tls_handshake(queue))
>> +            return;
>> +        nvmet_tcp_restore_tls_callbacks(queue);
>> +
>> +        /*
>> +         * If sectype is set to 'tls1.3' TLS is required
>> +         * so terminate the connection if the TLS handshake
>> +         * failed.
>> +         */
>> +        if (queue->port->nport->disc_addr.tsas.tcp.sectype ==
>> +            NVMF_TCP_SECTYPE_TLS13) {
>> +            pr_debug("queue %d sectype tls1.3, terminate connection\n",
>> +                 queue->idx);
>> +            goto out_destroy_sq;
>> +        }
>> +        pr_debug("queue %d fallback to icreq\n", queue->idx);
>> +        spin_lock(&queue->state_lock);
>> +        queue->state = NVMET_TCP_Q_CONNECTING;
>> +        spin_unlock(&queue->state_lock);
>> +    }
>> +
>>       ret = nvmet_tcp_set_queue_sock(queue);
>>       if (ret)
>>           goto out_destroy_sq;
> 
> I'm still trying to learn the state machine here, can you share a few 
> words on it? Also please include it in the next round in the change log.

As outlined in the response to the nvme-tcp upcall, on the server side 
we _have_ to allow for non-TLS connections (eg. for discovery).
And we have to start the daemon _before_ the first packet arrives, but 
only the first packet will tell us what we should have done.
So really we have to start the upcall and see what happens.
The 'real' handling / differentiation between these two modes is done 
with the 'peek pdu' patch later on.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 18/18] nvmet-tcp: peek icreq before starting TLS
  2023-03-22 12:24   ` Sagi Grimberg
@ 2023-03-22 12:38     ` Hannes Reinecke
  0 siblings, 0 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 12:38 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 13:24, Sagi Grimberg wrote:
> 
> 
> On 3/21/23 14:43, Hannes Reinecke wrote:
>> Incoming connection might be either 'normal' NVMe-TCP connections
>> starting with icreq or TLS handshakes. To ensure that 'normal'
>> connections can still be handled we need to peek the first packet
>> and only start TLS handshake if it's not an icreq.
> 
> I think that for nvmet, we will want to strictly enforce tsas.sectype.
> What are we gaining from allowing this?
> 
> And if you insist that we must, then this needs to be an explicit
> setting to a permissive mode.
> 
We can't. Strict sectype mode does not work for servers as it would 
disallow discovery connections.
(Especially here as we don't support unique discovery subsystems.)
(Maybe it's time to revisit that ...)

>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/target/tcp.c | 60 +++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 58 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index a69647fb2c81..a328a303c2be 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -1105,6 +1105,61 @@ static inline bool nvmet_tcp_pdu_valid(u8 type)
>>       return false;
>>   }
>> +static int nvmet_tcp_try_peek_pdu(struct nvmet_tcp_queue *queue)
>> +{
>> +    struct nvme_tcp_hdr *hdr = &queue->pdu.cmd.hdr;
>> +    int len;
>> +    struct kvec iov = {
>> +        .iov_base = (u8 *)&queue->pdu + queue->offset,
>> +        .iov_len = sizeof(struct nvme_tcp_hdr),
>> +    };
>> +    char cbuf[CMSG_LEN(sizeof(char))] = {};
>> +    unsigned char ctype;
>> +    struct cmsghdr *cmsg;
>> +    struct msghdr msg = {
>> +        .msg_control = cbuf,
>> +        .msg_controllen = sizeof(cbuf),
>> +        .msg_flags = MSG_PEEK,
>> +    };
>> +
>> +    len = kernel_recvmsg(queue->sock, &msg, &iov, 1,
>> +            iov.iov_len, msg.msg_flags);
>> +    if (unlikely(len < 0)) {
>> +        pr_debug("queue %d peek error %d\n",
>> +             queue->idx, len);
>> +        return len;
>> +    }
>> +
>> +    cmsg = (struct cmsghdr *)cbuf;
>> +    if (CMSG_OK(&msg, cmsg) &&
>> +        cmsg->cmsg_level == SOL_TLS &&
>> +        cmsg->cmsg_type == TLS_GET_RECORD_TYPE) {
>> +        ctype = *((unsigned char *)CMSG_DATA(cmsg));
>> +        if (ctype != TLS_RECORD_TYPE_DATA) {
>> +            pr_err("queue %d unhandled TLS record %d\n",
>> +                queue->idx, ctype);
>> +            return -ENOTCONN;
>> +        }
>> +    }
>> +
>> +    if (len < sizeof(struct nvme_tcp_hdr)) {
>> +        pr_debug("queue %d short read, %d bytes missing\n",
>> +             queue->idx, (int)iov.iov_len - len);
>> +        return -EAGAIN;
>> +    }
>> +    pr_debug("queue %d hdr type %d hlen %d plen %d size %d\n",
>> +         queue->idx, hdr->type, hdr->hlen, hdr->plen,
>> +         (int)sizeof(struct nvme_tcp_icreq_pdu));
>> +    if (hdr->type == nvme_tcp_icreq &&
>> +        hdr->hlen == sizeof(struct nvme_tcp_icreq_pdu) &&
>> +        hdr->plen == sizeof(struct nvme_tcp_icreq_pdu)) {
>> +        pr_debug("queue %d icreq detected\n",
>> +             queue->idx);
>> +        return len;
>> +    }
>> +    return 0;
>> +}
>> +
>>   static int nvmet_tcp_try_recv_pdu(struct nvmet_tcp_queue *queue)
>>   {
>>       struct nvme_tcp_hdr *hdr = &queue->pdu.cmd.hdr;
>> @@ -1879,8 +1934,9 @@ static void nvmet_tcp_alloc_queue(struct 
>> nvmet_tcp_port *port,
>>       if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
>>           nvmet_tcp_save_tls_callbacks(queue);
>> -        if (!nvmet_tcp_tls_handshake(queue))
>> -            return;
>> +        if (!nvmet_tcp_try_peek_pdu(queue))
> 
> Who guarantees that a payload already exist? Where is the peek resumes
> when a payload already exist?
> 
We do wait for the first payload to arrive (MSG_DONTWAIT) isn't set, so 
we will be receiving something.
And if we haven't received a payload we're bailing out anyway, no?

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-22 12:34     ` Hannes Reinecke
@ 2023-03-22 12:51       ` Sagi Grimberg
  2023-03-22 13:47         ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 12:51 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake



On 3/22/23 14:34, Hannes Reinecke wrote:
> On 3/22/23 13:13, Sagi Grimberg wrote:
>>
>>
>> On 3/21/23 14:43, Hannes Reinecke wrote:
>>> Add functions to start the TLS handshake upcall.
>>>
>>> Signed-off-by: Hannes Reincke <hare@suse.de>
>>> ---
>>>   drivers/nvme/target/tcp.c | 188 ++++++++++++++++++++++++++++++++++++--
>>>   1 file changed, 181 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>>> index 5c43767c5ecd..6e88e98a2c59 100644
>>> --- a/drivers/nvme/target/tcp.c
>>> +++ b/drivers/nvme/target/tcp.c
>>> @@ -9,8 +9,10 @@
>>>   #include <linux/slab.h>
>>>   #include <linux/err.h>
>>>   #include <linux/nvme-tcp.h>
>>> +#include <linux/nvme-keyring.h>
>>>   #include <net/sock.h>
>>>   #include <net/tcp.h>
>>> +#include <net/handshake.h>
>>>   #include <linux/inet.h>
>>>   #include <linux/llist.h>
>>>   #include <crypto/hash.h>
>>> @@ -40,6 +42,14 @@ module_param(idle_poll_period_usecs, int, 0644);
>>>   MODULE_PARM_DESC(idle_poll_period_usecs,
>>>           "nvmet tcp io_work poll till idle time period in usecs");
>>> +/*
>>> + * TLS handshake timeout
>>> + */
>>> +static int tls_handshake_timeout = 30;
>>
>> 30 ?
>>
> Yeah; will be changing it to 10.
> 
>>> +module_param(tls_handshake_timeout, int, 0644);
>>> +MODULE_PARM_DESC(tls_handshake_timeout,
>>> +         "nvme TLS handshake timeout in seconds (default 30)");
>>> +
>>>   #define NVMET_TCP_RECV_BUDGET        8
>>>   #define NVMET_TCP_SEND_BUDGET        8
>>>   #define NVMET_TCP_IO_WORK_BUDGET    64
>>> @@ -131,6 +141,9 @@ struct nvmet_tcp_queue {
>>>       struct ahash_request    *snd_hash;
>>>       struct ahash_request    *rcv_hash;
>>> +    struct key        *tls_psk;
>>> +    struct delayed_work    tls_handshake_work;
>>> +
>>>       unsigned long           poll_end;
>>>       spinlock_t        state_lock;
>>> @@ -168,6 +181,7 @@ static struct workqueue_struct *nvmet_tcp_wq;
>>>   static const struct nvmet_fabrics_ops nvmet_tcp_ops;
>>>   static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
>>>   static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd);
>>> +static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct 
>>> *work);
>>>   static inline u16 nvmet_tcp_cmd_tag(struct nvmet_tcp_queue *queue,
>>>           struct nvmet_tcp_cmd *cmd)
>>> @@ -1400,6 +1414,8 @@ static void 
>>> nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
>>>   {
>>>       struct socket *sock = queue->sock;
>>> +    if (!sock->sk)
>>> +        return;
>>
>> Umm, when will the sock not have an sk?
>>
> When someone called 'sock_release()'.
> But that's basically a leftover from development.
> 
>>>       write_lock_bh(&sock->sk->sk_callback_lock);
>>>       sock->sk->sk_data_ready =  queue->data_ready;
>>>       sock->sk->sk_state_change = queue->state_change;
>>> @@ -1448,7 +1464,8 @@ static void nvmet_tcp_release_queue_work(struct 
>>> work_struct *w)
>>>       list_del_init(&queue->queue_list);
>>>       mutex_unlock(&nvmet_tcp_queue_mutex);
>>> -    nvmet_tcp_restore_socket_callbacks(queue);
>>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
>>> +        nvmet_tcp_restore_socket_callbacks(queue);
>>
>> This is because you only save the callbacks after the handshake
>> phase is done? Maybe it would be simpler to clear the ops because
>> the socket is going away anyways...
>>
> Or just leave it in place, as they'll be cleared up on sock_release().

This plays a role today, because after we clear sock callbacks, and
flush io_work, we know we are not going to be triggered from the
network, which is needed to continue teardown safely. So if you leave
them in place, you need to do a different fence here.

> 
>>>       cancel_work_sync(&queue->io_work);
>>>       /* stop accepting incoming data */
>>>       queue->rcv_state = NVMET_TCP_RECV_ERR;
>>> @@ -1469,6 +1486,8 @@ static void nvmet_tcp_release_queue_work(struct 
>>> work_struct *w)
>>>       nvmet_tcp_free_cmds(queue);
>>>       if (queue->hdr_digest || queue->data_digest)
>>>           nvmet_tcp_free_crypto(queue);
>>> +    if (queue->tls_psk)
>>> +        key_put(queue->tls_psk);
>>>       ida_free(&nvmet_tcp_queue_ida, queue->idx);
>>>       page = virt_to_head_page(queue->pf_cache.va);
>>>       __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
>>> @@ -1481,11 +1500,15 @@ static void nvmet_tcp_data_ready(struct sock 
>>> *sk)
>>>       trace_sk_data_ready(sk);
>>> -    read_lock_bh(&sk->sk_callback_lock);
>>> -    queue = sk->sk_user_data;
>>> -    if (likely(queue))
>>> -        queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
>>> -    read_unlock_bh(&sk->sk_callback_lock);
>>> +    rcu_read_lock_bh();
>>> +    queue = rcu_dereference_sk_user_data(sk);
>>> +    if (queue->data_ready)
>>> +        queue->data_ready(sk);
>>> +    if (likely(queue) &&
>>> +        queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
>>> +        queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
>>> +                  &queue->io_work);
>>> +    rcu_read_unlock_bh();
>>
>> Same comment as the host side. separate rcu stuff from data_ready call.
>>
> Ok.
> 
>>>   }
>>>   static void nvmet_tcp_write_space(struct sock *sk)
>>> @@ -1585,13 +1608,139 @@ static int nvmet_tcp_set_queue_sock(struct 
>>> nvmet_tcp_queue *queue)
>>>           sock->sk->sk_write_space = nvmet_tcp_write_space;
>>>           if (idle_poll_period_usecs)
>>>               nvmet_tcp_arm_queue_deadline(queue);
>>> -        queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
>>> +        queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
>>> +                  &queue->io_work);
>>
>> Why the change?
>>
> Left-over from development.
> 
>>>       }
>>>       write_unlock_bh(&sock->sk->sk_callback_lock);
>>>       return ret;
>>>   }
>>> +static void nvmet_tcp_tls_data_ready(struct sock *sk)
>>> +{
>>> +    struct socket_wq *wq;
>>> +
>>> +    rcu_read_lock();
>>> +    /* kTLS will change the callback */
>>> +    if (sk->sk_data_ready == nvmet_tcp_tls_data_ready) {
>>> +        wq = rcu_dereference(sk->sk_wq);
>>> +        if (skwq_has_sleeper(wq))
>>> +            wake_up_interruptible_all(&wq->wait);
>>> +    }
>>> +    rcu_read_unlock();
>>> +}
>>
>> Can you explain why this is needed? It looks out-of-place.
>> Who is this waking up? isn't tls already calling the socket
>> default data_ready that does something similar for userspace?
>>
> Black magic.

:)

> The 'data_ready' call might happen at any time after the 'accept' call 
> and us calling into userspace.
> In particular we have this flow of control:
> 
> 1. Kernel: accept()
> 2. Kernel: handshake request
> 3. Userspace: read data from socket
> 4. Userspace: tls handshake
> 5. Kernel: handshake complete
> 
> If the 'data_ready' event occurs between 1. and 3. userspace wouldn't 
> know that something has happened, and will be sitting there waiting for 
> data which is already present.

Umm, doesn't userspace read from the socket once we trigger the upcall?
it should. But I still don't understand what is the difference between
us waiking up userspace, from the default sock doing the same?

>>> +
>>> +static void nvmet_tcp_tls_handshake_restart(struct nvmet_tcp_queue 
>>> *queue)
>>> +{
>>> +    spin_lock(&queue->state_lock);
>>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
>>> +        pr_warn("queue %d: TLS handshake already completed\n",
>>> +            queue->idx);
>>> +        spin_unlock(&queue->state_lock);
>>> +        return;
>>> +    }
>>> +    queue->state = NVMET_TCP_Q_CONNECTING;
>>> +    spin_unlock(&queue->state_lock);
>>> +
>>> +    pr_debug("queue %d: restarting queue after TLS handshake\n",
>>> +         queue->idx);
>>> +    /*
>>> +     * Set callbacks after handshake; TLS implementation
>>> +     * might have changed the socket callbacks.
>>> +     */
>>> +    nvmet_tcp_set_queue_sock(queue);
>>
>> My understanding is that this is the desired end-state, i.e.
>> tls connection is ready and now we are expecting nvme traffic?
>>
> Yes.
> 
>> I think that the function name should be changed, it sounds like
>> it is restarting the handshake, and it does not appear to do that.
>>
> Sure, np.
> 
> nvmet_tcp_set_queue_callbacks()?

I meant about nvmet_tcp_tls_handshake_restart()

> 
>>> +}
>>> +
>>> +static void nvmet_tcp_save_tls_callbacks(struct nvmet_tcp_queue *queue)
>>> +{
>>> +    struct sock *sk = queue->sock->sk;
>>> +
>>> +    write_lock_bh(&sk->sk_callback_lock);
>>> +    rcu_assign_sk_user_data(sk, queue);
>>> +    queue->data_ready = sk->sk_data_ready;
>>> +    sk->sk_data_ready = nvmet_tcp_tls_data_ready;
>>> +    write_unlock_bh(&sk->sk_callback_lock);
>>> +}
>>> +
>>> +static void nvmet_tcp_restore_tls_callbacks(struct nvmet_tcp_queue 
>>> *queue)
>>> +{
>>> +    struct sock *sk = queue->sock->sk;
>>> +
>>> +    if (WARN_ON(!sk))
>>> +        return;
>>> +    write_lock_bh(&sk->sk_callback_lock);
>>> +    /* Only reset the callback if it really is ours */
>>> +    if (sk->sk_data_ready == nvmet_tcp_tls_data_ready)
>>
>> I still don't understand why our data_ready for tls is needed.
>> Who are
>>
> See above for an explanation.
> 
>>> +        sk->sk_data_ready = queue->data_ready;
>>> +    rcu_assign_sk_user_data(sk, NULL);
>>> +    queue->data_ready = NULL;
>>> +    write_unlock_bh(&sk->sk_callback_lock);
>>> +}
>>> +
>>> +static void nvmet_tcp_tls_handshake_done(void *data, int status,
>>> +                     key_serial_t peerid)
>>> +{
>>> +    struct nvmet_tcp_queue *queue = data;
>>> +
>>> +    pr_debug("queue %d: TLS handshake done, key %x, status %d\n",
>>> +         queue->idx, peerid, status);
>>> +    if (!status) {
>>> +        spin_lock(&queue->state_lock);
>>> +        queue->tls_psk = key_lookup(peerid);
>>> +        if (IS_ERR(queue->tls_psk)) {
>>> +            pr_warn("queue %d: TLS key %x not found\n",
>>> +                queue->idx, peerid);
>>> +            queue->tls_psk = NULL;
>>> +        }
>>> +        spin_unlock(&queue->state_lock);
>>> +    }
>>> +    cancel_delayed_work_sync(&queue->tls_handshake_work);
>>> +    nvmet_tcp_restore_tls_callbacks(queue);
>>> +    if (status)
>>> +        nvmet_tcp_schedule_release_queue(queue);
>>> +    else
>>> +        nvmet_tcp_tls_handshake_restart(queue);
>>> +}
>>> +
>>> +static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct *w)
>>> +{
>>> +    struct nvmet_tcp_queue *queue = container_of(to_delayed_work(w),
>>> +            struct nvmet_tcp_queue, tls_handshake_work);
>>> +
>>> +    pr_debug("queue %d: TLS handshake timeout\n", queue->idx);
>>> +    nvmet_tcp_restore_tls_callbacks(queue);
>>> +    nvmet_tcp_schedule_release_queue(queue);
>>> +}
>>> +
>>> +static int nvmet_tcp_tls_handshake(struct nvmet_tcp_queue *queue)
>>> +{
>>> +    int ret = -EOPNOTSUPP;
>>> +    struct tls_handshake_args args;
>>> +
>>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
>>> +        pr_warn("cannot start TLS in state %d\n", queue->state);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    pr_debug("queue %d: TLS ServerHello\n", queue->idx);
>>> +    args.ta_sock = queue->sock;
>>> +    args.ta_done = nvmet_tcp_tls_handshake_done;
>>> +    args.ta_data = queue;
>>> +    args.ta_keyring = nvme_keyring_id();
>>> +    args.ta_timeout_ms = tls_handshake_timeout * 2 * 1024;
>>
>>   why the 2x timeout?
>>
> Because I'm chicken. Will be changing it.l

:)

> 
>>> +
>>> +    ret = tls_server_hello_psk(&args, GFP_KERNEL);
>>> +    if (ret) {
>>> +        pr_err("failed to start TLS, err=%d\n", ret);
>>> +    } else {
>>> +        pr_debug("queue %d wakeup userspace\n", queue->idx);
>>> +        nvmet_tcp_tls_data_ready(queue->sock->sk);
>>> +        queue_delayed_work(nvmet_wq, &queue->tls_handshake_work,
>>> +                   tls_handshake_timeout * HZ);
>>> +    }
>>> +    return ret;
>>> +}
>>> +
>>>   static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>>>           struct socket *newsock)
>>>   {
>>> @@ -1604,6 +1753,8 @@ static void nvmet_tcp_alloc_queue(struct 
>>> nvmet_tcp_port *port,
>>>       INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
>>>       INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
>>> +    INIT_DELAYED_WORK(&queue->tls_handshake_work,
>>> +              nvmet_tcp_tls_handshake_timeout_work);
>>>       queue->sock = newsock;
>>>       queue->port = port;
>>>       queue->nr_cmds = 0;
>>> @@ -1646,6 +1797,29 @@ static void nvmet_tcp_alloc_queue(struct 
>>> nvmet_tcp_port *port,
>>>       list_add_tail(&queue->queue_list, &nvmet_tcp_queue_list);
>>>       mutex_unlock(&nvmet_tcp_queue_mutex);
>>> +    if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
>>> +        nvmet_tcp_save_tls_callbacks(queue);
>>> +        if (!nvmet_tcp_tls_handshake(queue))
>>> +            return;
>>> +        nvmet_tcp_restore_tls_callbacks(queue);
>>> +
>>> +        /*
>>> +         * If sectype is set to 'tls1.3' TLS is required
>>> +         * so terminate the connection if the TLS handshake
>>> +         * failed.
>>> +         */
>>> +        if (queue->port->nport->disc_addr.tsas.tcp.sectype ==
>>> +            NVMF_TCP_SECTYPE_TLS13) {
>>> +            pr_debug("queue %d sectype tls1.3, terminate connection\n",
>>> +                 queue->idx);
>>> +            goto out_destroy_sq;
>>> +        }
>>> +        pr_debug("queue %d fallback to icreq\n", queue->idx);
>>> +        spin_lock(&queue->state_lock);
>>> +        queue->state = NVMET_TCP_Q_CONNECTING;
>>> +        spin_unlock(&queue->state_lock);
>>> +    }
>>> +
>>>       ret = nvmet_tcp_set_queue_sock(queue);
>>>       if (ret)
>>>           goto out_destroy_sq;
>>
>> I'm still trying to learn the state machine here, can you share a few 
>> words on it? Also please include it in the next round in the change log.
> 
> As outlined in the response to the nvme-tcp upcall, on the server side 
> we _have_ to allow for non-TLS connections (eg. for discovery).

But in essence what you are doing is that you allow normal connections
for a secured port...

btw, why not enforce a psk for the discovery controller (on this port)
as well? for secured ports? No one said that we must accept a
non-secured discovery connecting host on a secured port.

> And we have to start the daemon _before_ the first packet arrives,

Not sure why that is.

> but only the first packet will tell us what we should have done.
> So really we have to start the upcall and see what happens.
> The 'real' handling / differentiation between these two modes is done 
> with the 'peek pdu' patch later on.

I am hoping we can kill it.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP
  2023-03-22  8:28       ` Hannes Reinecke
@ 2023-03-22 12:53         ` Sagi Grimberg
  2023-03-22 15:10           ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 12:53 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>>>>> Hi all,
>>>>>
>>>>> finally I've managed to put all things together and enable in-kernel
>>>>> TLS support for NVMe-over-TCP.
>>>>
>>>> Hannes (and Chuck) this is great, I'm very happy to see this!
>>>>
>>>> I'll start a detailed review soon enough.
>>>>
>>>> Thank you for doing this.
>>>>
>>>>> The patchset is based on the TLS upcall mechanism from Chuck Lever
>>>>> (cf '[PATCH v7 0/2] Another crack at a handshake upcall mechanism'
>>>>> posted to the linux netdev list), and requires the 'tlshd' userspace
>>>>> daemon (https://github.com/oracle/ktls-utils) for the actual TLS 
>>>>> handshake.
>>>>
>>>> Do you have an actual link to follow for this patch set?
>>>
>>> Sure.
>>>
>>> git.kernel.org:/pub/scm/linux/kernel/git/hare/scsi-devel.git
>>> branch tls-netlink.v7
>>
>> I meant Chuck's posting on linux-netdev.
> 
> To be found here:
> 
> <https://www.spinics.net/lists/netdev/msg890047.html>

Nice, it would be great to see code, if you have it, for nvme-cli and/or
nvmetcli as well.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 08/18] nvme-tcp: enable TLS handshake upcall
  2023-03-22 10:56       ` Sagi Grimberg
@ 2023-03-22 12:54         ` Hannes Reinecke
  2023-03-22 13:16           ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 12:54 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 11:56, Sagi Grimberg wrote:
> 
>>>> Select possible PSK identities and call the TLS handshake upcall
>>>> for each identity.
>>>> The TLS 1.3 RFC allows to send multiple identities with each 
>>>> ClientHello
>>>> request, but none of the SSL libraries implement it. As the connection
>>>> is established when the association is created we send only a single
>>>> identity for each upcall, and close the connection to restart with
>>>> the next identity if the handshake fails.
>>>
>>> Can't this loop be done in userspace? In other words, how can
>>> we get rid of this when SSL libs would decide to support it?
>>>
>> Well. That is something which I've been thinking about, but really 
>> haven't come to a good solution.
> 
> I have a more general question.
> What is the scenario that we will have for a given hostnqn and
> subsysnqn more than one valid identity? Do we need to support it?
> 
Well; there are SHA-256 and SHA-384 identities. We need to _support_ 
both, but seeing that we're dealing with retained PSKs for now I would 
assume that the admin ensures that both sides are able to support the
chosen hash.
So the real choice is just between a 'retained' and a 'generated' PSK.
And it is assumed that any 'retained' PSK should take priority for any 
'generated' PSK.
So for 'retained' PSKs we can use userland to pass in the PSK
(or, indeed, having the kernel select one as we really only have on 
choice...)
And for 'generated' PSKs they really come into play only if we don't 
have 'retained' PSKs, and if secure concatenation is enabled.
But even there you can (out of necessity) only generate a single PSK,
so again there is no choice.

So in the light of all this I guess we can revert to only using a single 
PSK.

Will be updating the code.

>> Crux of the matter is that we have to close the connection after a 
>> failed TLS handshake:
> 
> Yes.
> 
>>
>>  > A TLS 1.3 client implementation that only supports sending a single
>>  > PSK identity during connection setup may be required to connect
>>  > multiple times in order to negotiate cipher suites with different hash
>>  > functions.
>>
>> and as it's quite unclear in which state the connection is after the 
>> userspace library failed the handshake.
>> So the only good way to recover is to close the connection and restart 
>> with a different identity.
> 
> I see.
> 
>> While we can move the identity selection to userspace (eg by providing 
>> an 'tls_psk' fabrics option holding the key serial of the PSK to 
>> user), that will allow us only to pass a _single_ PSK for each attempt.
> 
> That makes sense to me. But I'm unclear how it will choose, if we have
> multiple (which again, I'm not clear in which scenario this would be the
> case).
> 
The actual priority can already be inferred from the NVMe TCP spec, but 
we'll be having an actual priority list with the TLS spec update 
currently being discussed at fmds.

>>
>> And the other problem is that in its current form the spec allows for 
>> _different_ identites for each connection; by passing a key from 
>> userspace we would not be able to support that.
>> (Not saying that it's useful, mind.)
> 
> I'll happily forfeit this support. This is completely crazy to do this
> per connection.
> 
Same here.

>>
>> We could allow for several 'tls_psk' options, though; maybe that would 
>> be a way out.
> 
> What do you mean by that?
> 
Well, one _could_ allow a syntax like
'nvme connect --tls-psk=<id1> --tls-psk=<id2>'
(if we really want to support multiple identities).

>>
>>>>
>>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>>> ---
>>>>   drivers/nvme/host/tcp.c | 157 
>>>> +++++++++++++++++++++++++++++++++++++---
>>>>   1 file changed, 148 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 0438d42f4179..bcf24e9a08e1 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -8,9 +8,12 @@
>>>>   #include <linux/init.h>
>>>>   #include <linux/slab.h>
>>>>   #include <linux/err.h>
>>>> +#include <linux/key.h>
>>>>   #include <linux/nvme-tcp.h>
>>>> +#include <linux/nvme-keyring.h>
>>>>   #include <net/sock.h>
>>>>   #include <net/tcp.h>
>>>> +#include <net/handshake.h>
>>>>   #include <linux/blk-mq.h>
>>>>   #include <crypto/hash.h>
>>>>   #include <net/busy_poll.h>
>>>> @@ -31,6 +34,14 @@ static int so_priority;
>>>>   module_param(so_priority, int, 0644);
>>>>   MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
>>>> +/*
>>>> + * TLS handshake timeout
>>>> + */
>>>> +static int tls_handshake_timeout = 10;
>>>> +module_param(tls_handshake_timeout, int, 0644);
>>>> +MODULE_PARM_DESC(tls_handshake_timeout,
>>>> +         "nvme TLS handshake timeout in seconds (default 10)");
>>>
>>> Can you share what is the normal time of an upcall?
>>>
>> That really depends on the network latency and/or reachability of the 
>> server. It might just have been started up, switches MAC tables not 
>> updated, STP still ongoing, what do I know.
>> So 10 seconds seemed to be a good compromise.
>> But that's also why I made this configurable.
> 
> Does it really take 10 seconds per connection :() ?
> I'm planning to give this a go soon so will find out.
> 
Normally not, but who know what'll happen for in real life...

>>>> +
>>>>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>>>   /* lockdep can detect a circular dependency of the form
>>>>    *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
>>>> @@ -104,6 +115,7 @@ enum nvme_tcp_queue_flags {
>>>>       NVME_TCP_Q_ALLOCATED    = 0,
>>>>       NVME_TCP_Q_LIVE        = 1,
>>>>       NVME_TCP_Q_POLLING    = 2,
>>>> +    NVME_TCP_Q_TLS        = 3,
>>>>   };
>>>>   enum nvme_tcp_recv_state {
>>>> @@ -148,6 +160,9 @@ struct nvme_tcp_queue {
>>>>       __le32            exp_ddgst;
>>>>       __le32            recv_ddgst;
>>>> +    struct completion       *tls_complete;
>>>> +    int                     tls_err;
>>>> +
>>>>       struct page_frag_cache    pf_cache;
>>>>       void (*state_change)(struct sock *);
>>>> @@ -1505,7 +1520,102 @@ static void nvme_tcp_set_queue_io_cpu(struct 
>>>> nvme_tcp_queue *queue)
>>>>       queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, 
>>>> false);
>>>>   }
>>>> -static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
>>>> +/*
>>>> + * nvme_tcp_lookup_psk - Look up PSKs to use for TLS
>>>> + *
>>>> + */
>>>> +static int nvme_tcp_lookup_psks(struct nvme_ctrl *nctrl,
>>>> +                   key_serial_t *keylist, int num_keys)
>>>
>>> Where is num_keys used?
>>>
>> Ah, indeed, need to check this in the loop.
>>
>>>> +{
>>>> +    enum nvme_tcp_tls_cipher cipher = NVME_TCP_TLS_CIPHER_SHA384;
>>>> +    struct key *tls_key;
>>>> +    int num = 0;
>>>> +    bool generated = false;
>>>> +
>>>> +    /* Check for pre-provisioned keys; retained keys first */
>>>> +    do {
>>>> +        tls_key = nvme_tls_psk_lookup(NULL, nctrl->opts->host->nqn,
>>>> +                          nctrl->opts->subsysnqn,
>>>> +                          cipher, generated);
>>>> +        if (!IS_ERR(tls_key)) {
>>>> +            keylist[num] = tls_key->serial;
>>>> +            num++;
>>>> +            key_put(tls_key);
>>>> +        }
>>>> +        if (cipher == NVME_TCP_TLS_CIPHER_SHA384)
>>>> +            cipher = NVME_TCP_TLS_CIPHER_SHA256;
>>>> +        else {
>>>> +            if (generated)
>>>> +                cipher = NVME_TCP_TLS_CIPHER_INVALID;
>>>> +            else {
>>>> +                cipher = NVME_TCP_TLS_CIPHER_SHA384;
>>>> +                generated = true;
>>>> +            }
>>>> +        }
>>>> +    } while(cipher != NVME_TCP_TLS_CIPHER_INVALID);
>>>
>>> I'm unclear about a few things here:
>>> 1. what is the meaning of pre-provisioned vs. retained vs. generated?
>>> 2. Can this loop be reorganized in a nested for loop with a break?
>>>     I'm wandering if it will make it simpler to read.
>>>
>> 'pre-provisioned' means that the admin has stored the keys in the 
>> keyring prior to calling 'nvme connect'.
>> 'generated' means a key which is derived from the key material 
>> generated from a previous DH-HMAC-CHAP transaction.
> 
> Can we ignore the generated until a generation sequence code is actually
> introduced. This would help to digest this in a way that is simpler.
> 
Sure, that we can do.

>> As for the loop: I am going back and forth between having a loop
>> (which is executed exactly four times) and unrolling the loop into 
>> four distinct calls to nvme_tls_psk_lookup().
>> It probably doesn't matter for the actual assembler code (as the 
>> compiler will be doing a loop unroll anyway), but the unrolled code 
>> would allow for better documentation, Code might be slightly longer, 
>> though, with lots of repetitions.
>> So really, I don't know which is best.
> 
> I think that the best one would be to either:
> 1. Have one valid identity at all times (not sure if that is
> too restrictive, see my questions above)
> 2. Have userspace do the iterations.
> 
> If both are not possible, or too difficult, then we should optimized
> for simplicity/readability not code size.
> 
Yeah, that should be possible. Will be giving it a go.

>>
>>>> +    return num;
>>>> +}
>>>> +
>>>> +static void nvme_tcp_tls_done(void *data, int status, key_serial_t 
>>>> peerid)
>>>> +{
>>>> +    struct nvme_tcp_queue *queue = data;
>>>> +    struct nvme_tcp_ctrl *ctrl = queue->ctrl;
>>>> +    int qid = nvme_tcp_queue_id(queue);
>>>> +
>>>> +    dev_dbg(ctrl->ctrl.device, "queue %d: TLS handshake done, key 
>>>> %x, status %d\n",
>>>> +        qid, peerid, status);
>>>> +
>>>> +    queue->tls_err = -status;
>>>> +    if (queue->tls_complete)
>>>> +        complete(queue->tls_complete);
>>>> +}
>>>> +
>>>> +static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
>>>> +                  struct nvme_tcp_queue *queue,
>>>> +                  key_serial_t peerid)
>>>> +{
>>>> +    int qid = nvme_tcp_queue_id(queue);
>>>> +    int ret;
>>>> +    struct tls_handshake_args args;
>>>> +    unsigned long tmo = tls_handshake_timeout * HZ;
>>>> +    DECLARE_COMPLETION_ONSTACK(tls_complete);
>>>> +
>>>> +    dev_dbg(nctrl->device, "queue %d: start TLS with key %x\n",
>>>> +        qid, peerid);
>>>> +    args.ta_sock = queue->sock;
>>>> +    args.ta_done = nvme_tcp_tls_done;
>>>> +    args.ta_data = queue;
>>>> +    args.ta_my_peerids[0] = peerid;
>>>> +    args.ta_num_peerids = 1;
>>>> +    args.ta_keyring = nvme_keyring_id();
>>>> +    args.ta_timeout_ms = tls_handshake_timeout * 2 * 1000;
>>>> +    queue->tls_err = -EOPNOTSUPP;
>>>> +    queue->tls_complete = &tls_complete;
>>>> +    ret = tls_client_hello_psk(&args, GFP_KERNEL);
>>>> +    if (ret) {
>>>> +        dev_dbg(nctrl->device, "queue %d: failed to start TLS: %d\n",
>>>> +            qid, ret);
>>>> +        return ret;
>>>> +    }
>>>> +    if (wait_for_completion_timeout(queue->tls_complete, tmo) == 0) {
>>>> +        dev_dbg(nctrl->device,
>>>> +            "queue %d: TLS handshake timeout\n", qid);
>>>> +        queue->tls_complete = NULL;
>>>> +        ret = -ETIMEDOUT;
>>>> +    } else {
>>>> +        dev_dbg(nctrl->device,
>>>> +            "queue %d: TLS handshake complete, error %d\n",
>>>> +            qid, queue->tls_err);
>>>> +        ret = queue->tls_err;
>>>> +    }
>>>> +    queue->tls_complete = NULL;
>>>> +    if (!ret)
>>>> +        set_bit(NVME_TCP_Q_TLS, &queue->flags);
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
>>>> +                key_serial_t peerid)
>>>>   {
>>>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>>>       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
>>>> @@ -1628,6 +1738,13 @@ static int nvme_tcp_alloc_queue(struct 
>>>> nvme_ctrl *nctrl, int qid)
>>>>           goto err_rcv_pdu;
>>>>       }
>>>> +    /* If PSKs are configured try to start TLS */
>>>> +    if (peerid) {
>>>
>>> Where is peerid being initialized? Not to mention that peerid is
>>> a rather cryptic name (at least to me). Is this the ClientHello
>>> identity?
>>>
>> 'peerid' is the term used in the netlink handshake protocol.
>> It actually is the key serial number of the PSK to use.
>> Maybe 'psk_id' would be more appropriate here.
> 
> Probably psk_id is better.
> 
>>>> +        ret = nvme_tcp_start_tls(nctrl, queue, peerid);
>>>> +        if (ret)
>>>> +            goto err_init_connect;
>>>> +    }
>>>> +
>>>>       ret = nvme_tcp_init_connection(queue);
>>>>       if (ret)
>>>>           goto err_init_connect;
>>>> @@ -1774,11 +1891,22 @@ static int nvme_tcp_start_io_queues(struct 
>>>> nvme_ctrl *ctrl,
>>>>   static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>>>>   {
>>>> -    int ret;
>>>> +    int ret = -EINVAL, num_keys, k;
>>>> +    key_serial_t keylist[4];
>>>> -    ret = nvme_tcp_alloc_queue(ctrl, 0);
>>>> -    if (ret)
>>>> -        return ret;
>>>> +    memset(keylist, 0, sizeof(key_serial_t));
>>>> +    num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>>>> +    for (k = 0; k < num_keys; k++) {
>>>> +        ret = nvme_tcp_alloc_queue(ctrl, 0, keylist[k]);
>>>> +        if (!ret)
>>>> +            break;
>>>> +    }
>>>> +    if (ret) {
>>>> +        /* Try without TLS */
>>>
>>> Why? this is trying to always connect with tls and fallback to no-tls?
>>> Why not simply do what userspace is asking us to do?
>>>
>>> Seems backwards to me. Unless there is a statement in the spec
>>> that I'm not aware of which tells us to do so.
>>>
>> This is an implication of the chosen method to select the PSK from the 
>> kernel code.
>> If we move PSK selection to userspace we clearly wouldn't need this.
>> But if we move PSK selection to userspace we need an updated nvme-cli 
>> for a) selecting the PSK from the keystore and b) passing in the new 
>> option.
> 
> This at least for me, sounds better no?
> 
>> So for development it was easier to run with the in-kernel selection 
>> as I don't need to modify nvme-cli.
> 
> I understand. Thanks for explaining.
> 
>>
>>>> +        ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
>>>> +        if (ret)
>>>> +            goto out_free_queue;
>>>> +    }
>>>>       ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
>>>>       if (ret)
>>>> @@ -1793,12 +1921,23 @@ static int nvme_tcp_alloc_admin_queue(struct 
>>>> nvme_ctrl *ctrl)
>>>>   static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>>>>   {
>>>> -    int i, ret;
>>>> +    int i, ret, num_keys = 0, k;
>>>> +    key_serial_t keylist[4];
>>>> +    memset(keylist, 0, sizeof(key_serial_t));
>>>> +    num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>>>>       for (i = 1; i < ctrl->queue_count; i++) {
>>>> -        ret = nvme_tcp_alloc_queue(ctrl, i);
>>>> -        if (ret)
>>>> -            goto out_free_queues;
>>>> +        ret = -EINVAL;
>>>> +        for (k = 0; k < num_keys; k++) {
>>>> +            ret = nvme_tcp_alloc_queue(ctrl, i, keylist[k]);
>>>> +            if (!ret)
>>>> +                break;
>>>
>>> What is going on here. are you establishing queue_count x num_keys 
>>> nvme queues?
>>>
>> No, I am _trying_ to establish a connection, breaking out if the attempt
>> _succeeded_.
> 
> Yes, it's just now confusing to read the code this way. The loop makes
> it difficult.
> 
> Another approach would be to just do this dance for the admin queue, but
> for IO queues, use the same psk_id that resolved. If this loop must live
> in the driver, at least minimize it to the admin queue alone.
> 
> As I said, I am absolutely not interested in supporting per connection
> psk_id.
> 
Okay, that would be possible, too.

But I'll looking into the 'single PSK' use-case first; guess that'll 
make the code simpler already.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 08/18] nvme-tcp: enable TLS handshake upcall
  2023-03-22 12:54         ` Hannes Reinecke
@ 2023-03-22 13:16           ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 13:16 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>> I have a more general question.
>> What is the scenario that we will have for a given hostnqn and
>> subsysnqn more than one valid identity? Do we need to support it?
>>
> Well; there are SHA-256 and SHA-384 identities. We need to _support_ 
> both, but seeing that we're dealing with retained PSKs for now I would 
> assume that the admin ensures that both sides are able to support the
> chosen hash.

Yes, lets not over-complicate it.

> So the real choice is just between a 'retained' and a 'generated' PSK.
> And it is assumed that any 'retained' PSK should take priority for any 
> 'generated' PSK.
> So for 'retained' PSKs we can use userland to pass in the PSK
> (or, indeed, having the kernel select one as we really only have on 
> choice...)
> And for 'generated' PSKs they really come into play only if we don't 
> have 'retained' PSKs, and if secure concatenation is enabled.
> But even there you can (out of necessity) only generate a single PSK,
> so again there is no choice.
> 
> So in the light of all this I guess we can revert to only using a single 
> PSK.

This makes perfect sense to me.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-22 12:51       ` Sagi Grimberg
@ 2023-03-22 13:47         ` Hannes Reinecke
  2023-03-22 15:42           ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 13:47 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 13:51, Sagi Grimberg wrote:
> 
> 
> On 3/22/23 14:34, Hannes Reinecke wrote:
>> On 3/22/23 13:13, Sagi Grimberg wrote:
>>>
>>>
>>> On 3/21/23 14:43, Hannes Reinecke wrote:
>>>> Add functions to start the TLS handshake upcall.
>>>>
>>>> Signed-off-by: Hannes Reincke <hare@suse.de>
>>>> ---
>>>>   drivers/nvme/target/tcp.c | 188 
>>>> ++++++++++++++++++++++++++++++++++++--
>>>>   1 file changed, 181 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>>>> index 5c43767c5ecd..6e88e98a2c59 100644
>>>> --- a/drivers/nvme/target/tcp.c
>>>> +++ b/drivers/nvme/target/tcp.c
>>>> @@ -9,8 +9,10 @@
>>>>   #include <linux/slab.h>
>>>>   #include <linux/err.h>
>>>>   #include <linux/nvme-tcp.h>
>>>> +#include <linux/nvme-keyring.h>
>>>>   #include <net/sock.h>
>>>>   #include <net/tcp.h>
>>>> +#include <net/handshake.h>
>>>>   #include <linux/inet.h>
>>>>   #include <linux/llist.h>
>>>>   #include <crypto/hash.h>
>>>> @@ -40,6 +42,14 @@ module_param(idle_poll_period_usecs, int, 0644);
>>>>   MODULE_PARM_DESC(idle_poll_period_usecs,
>>>>           "nvmet tcp io_work poll till idle time period in usecs");
>>>> +/*
>>>> + * TLS handshake timeout
>>>> + */
>>>> +static int tls_handshake_timeout = 30;
>>>
>>> 30 ?
>>>
>> Yeah; will be changing it to 10.
>>
>>>> +module_param(tls_handshake_timeout, int, 0644);
>>>> +MODULE_PARM_DESC(tls_handshake_timeout,
>>>> +         "nvme TLS handshake timeout in seconds (default 30)");
>>>> +
>>>>   #define NVMET_TCP_RECV_BUDGET        8
>>>>   #define NVMET_TCP_SEND_BUDGET        8
>>>>   #define NVMET_TCP_IO_WORK_BUDGET    64
>>>> @@ -131,6 +141,9 @@ struct nvmet_tcp_queue {
>>>>       struct ahash_request    *snd_hash;
>>>>       struct ahash_request    *rcv_hash;
>>>> +    struct key        *tls_psk;
>>>> +    struct delayed_work    tls_handshake_work;
>>>> +
>>>>       unsigned long           poll_end;
>>>>       spinlock_t        state_lock;
>>>> @@ -168,6 +181,7 @@ static struct workqueue_struct *nvmet_tcp_wq;
>>>>   static const struct nvmet_fabrics_ops nvmet_tcp_ops;
>>>>   static void nvmet_tcp_free_cmd(struct nvmet_tcp_cmd *c);
>>>>   static void nvmet_tcp_free_cmd_buffers(struct nvmet_tcp_cmd *cmd);
>>>> +static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct 
>>>> *work);
>>>>   static inline u16 nvmet_tcp_cmd_tag(struct nvmet_tcp_queue *queue,
>>>>           struct nvmet_tcp_cmd *cmd)
>>>> @@ -1400,6 +1414,8 @@ static void 
>>>> nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
>>>>   {
>>>>       struct socket *sock = queue->sock;
>>>> +    if (!sock->sk)
>>>> +        return;
>>>
>>> Umm, when will the sock not have an sk?
>>>
>> When someone called 'sock_release()'.
>> But that's basically a leftover from development.
>>
>>>>       write_lock_bh(&sock->sk->sk_callback_lock);
>>>>       sock->sk->sk_data_ready =  queue->data_ready;
>>>>       sock->sk->sk_state_change = queue->state_change;
>>>> @@ -1448,7 +1464,8 @@ static void 
>>>> nvmet_tcp_release_queue_work(struct work_struct *w)
>>>>       list_del_init(&queue->queue_list);
>>>>       mutex_unlock(&nvmet_tcp_queue_mutex);
>>>> -    nvmet_tcp_restore_socket_callbacks(queue);
>>>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
>>>> +        nvmet_tcp_restore_socket_callbacks(queue);
>>>
>>> This is because you only save the callbacks after the handshake
>>> phase is done? Maybe it would be simpler to clear the ops because
>>> the socket is going away anyways...
>>>
>> Or just leave it in place, as they'll be cleared up on sock_release().
> 
> This plays a role today, because after we clear sock callbacks, and
> flush io_work, we know we are not going to be triggered from the
> network, which is needed to continue teardown safely. So if you leave
> them in place, you need to do a different fence here.
> 
>>
>>>>       cancel_work_sync(&queue->io_work);
>>>>       /* stop accepting incoming data */
>>>>       queue->rcv_state = NVMET_TCP_RECV_ERR;
>>>> @@ -1469,6 +1486,8 @@ static void 
>>>> nvmet_tcp_release_queue_work(struct work_struct *w)
>>>>       nvmet_tcp_free_cmds(queue);
>>>>       if (queue->hdr_digest || queue->data_digest)
>>>>           nvmet_tcp_free_crypto(queue);
>>>> +    if (queue->tls_psk)
>>>> +        key_put(queue->tls_psk);
>>>>       ida_free(&nvmet_tcp_queue_ida, queue->idx);
>>>>       page = virt_to_head_page(queue->pf_cache.va);
>>>>       __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
>>>> @@ -1481,11 +1500,15 @@ static void nvmet_tcp_data_ready(struct sock 
>>>> *sk)
>>>>       trace_sk_data_ready(sk);
>>>> -    read_lock_bh(&sk->sk_callback_lock);
>>>> -    queue = sk->sk_user_data;
>>>> -    if (likely(queue))
>>>> -        queue_work_on(queue_cpu(queue), nvmet_tcp_wq, 
>>>> &queue->io_work);
>>>> -    read_unlock_bh(&sk->sk_callback_lock);
>>>> +    rcu_read_lock_bh();
>>>> +    queue = rcu_dereference_sk_user_data(sk);
>>>> +    if (queue->data_ready)
>>>> +        queue->data_ready(sk);
>>>> +    if (likely(queue) &&
>>>> +        queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
>>>> +        queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
>>>> +                  &queue->io_work);
>>>> +    rcu_read_unlock_bh();
>>>
>>> Same comment as the host side. separate rcu stuff from data_ready call.
>>>
>> Ok.
>>
>>>>   }
>>>>   static void nvmet_tcp_write_space(struct sock *sk)
>>>> @@ -1585,13 +1608,139 @@ static int nvmet_tcp_set_queue_sock(struct 
>>>> nvmet_tcp_queue *queue)
>>>>           sock->sk->sk_write_space = nvmet_tcp_write_space;
>>>>           if (idle_poll_period_usecs)
>>>>               nvmet_tcp_arm_queue_deadline(queue);
>>>> -        queue_work_on(queue_cpu(queue), nvmet_tcp_wq, 
>>>> &queue->io_work);
>>>> +        queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
>>>> +                  &queue->io_work);
>>>
>>> Why the change?
>>>
>> Left-over from development.
>>
>>>>       }
>>>>       write_unlock_bh(&sock->sk->sk_callback_lock);
>>>>       return ret;
>>>>   }
>>>> +static void nvmet_tcp_tls_data_ready(struct sock *sk)
>>>> +{
>>>> +    struct socket_wq *wq;
>>>> +
>>>> +    rcu_read_lock();
>>>> +    /* kTLS will change the callback */
>>>> +    if (sk->sk_data_ready == nvmet_tcp_tls_data_ready) {
>>>> +        wq = rcu_dereference(sk->sk_wq);
>>>> +        if (skwq_has_sleeper(wq))
>>>> +            wake_up_interruptible_all(&wq->wait);
>>>> +    }
>>>> +    rcu_read_unlock();
>>>> +}
>>>
>>> Can you explain why this is needed? It looks out-of-place.
>>> Who is this waking up? isn't tls already calling the socket
>>> default data_ready that does something similar for userspace?
>>>
>> Black magic.
> 
> :)
> 
>> The 'data_ready' call might happen at any time after the 'accept' call 
>> and us calling into userspace.
>> In particular we have this flow of control:
>>
>> 1. Kernel: accept()
>> 2. Kernel: handshake request
>> 3. Userspace: read data from socket
>> 4. Userspace: tls handshake
>> 5. Kernel: handshake complete
>>
>> If the 'data_ready' event occurs between 1. and 3. userspace wouldn't 
>> know that something has happened, and will be sitting there waiting 
>> for data which is already present.
> 
> Umm, doesn't userspace read from the socket once we trigger the upcall?
> it should. But I still don't understand what is the difference between
> us waiking up userspace, from the default sock doing the same?
> 
No, it doesn't (or, rather, can't).
After processing 'accept()' (from the kernel code) data might already be 
present (after all, why would we get an 'accept' call otherwise?).
But the daemon has not been started up (yet); that's only done in
step 3). But 'data_ready' has already been called, so by the time 
userland is able to do a 'read()' on the socket it won't be seeing anything.

To be precise: that's the reasoning I've come up with.
Might be completely wrong, or beside the point.
But what I _do_ know is that without this call userspace would
_not_ see any data, and would happily sit there until we close the 
socket due to a timeout (which also why I put in the timeout
module options ...), and only _then_ see the data.
But with this call everything works. So there.

>>>> +
>>>> +static void nvmet_tcp_tls_handshake_restart(struct nvmet_tcp_queue 
>>>> *queue)
>>>> +{
>>>> +    spin_lock(&queue->state_lock);
>>>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
>>>> +        pr_warn("queue %d: TLS handshake already completed\n",
>>>> +            queue->idx);
>>>> +        spin_unlock(&queue->state_lock);
>>>> +        return;
>>>> +    }
>>>> +    queue->state = NVMET_TCP_Q_CONNECTING;
>>>> +    spin_unlock(&queue->state_lock);
>>>> +
>>>> +    pr_debug("queue %d: restarting queue after TLS handshake\n",
>>>> +         queue->idx);
>>>> +    /*
>>>> +     * Set callbacks after handshake; TLS implementation
>>>> +     * might have changed the socket callbacks.
>>>> +     */
>>>> +    nvmet_tcp_set_queue_sock(queue);
>>>
>>> My understanding is that this is the desired end-state, i.e.
>>> tls connection is ready and now we are expecting nvme traffic?
>>>
>> Yes.
>>
>>> I think that the function name should be changed, it sounds like
>>> it is restarting the handshake, and it does not appear to do that.
>>>
>> Sure, np.
>>
>> nvmet_tcp_set_queue_callbacks()?
> 
> I meant about nvmet_tcp_tls_handshake_restart()
> 
Ah, that. Yes, of course.

>>
>>>> +}
>>>> +
>>>> +static void nvmet_tcp_save_tls_callbacks(struct nvmet_tcp_queue 
>>>> *queue)
>>>> +{
>>>> +    struct sock *sk = queue->sock->sk;
>>>> +
>>>> +    write_lock_bh(&sk->sk_callback_lock);
>>>> +    rcu_assign_sk_user_data(sk, queue);
>>>> +    queue->data_ready = sk->sk_data_ready;
>>>> +    sk->sk_data_ready = nvmet_tcp_tls_data_ready;
>>>> +    write_unlock_bh(&sk->sk_callback_lock);
>>>> +}
>>>> +
>>>> +static void nvmet_tcp_restore_tls_callbacks(struct nvmet_tcp_queue 
>>>> *queue)
>>>> +{
>>>> +    struct sock *sk = queue->sock->sk;
>>>> +
>>>> +    if (WARN_ON(!sk))
>>>> +        return;
>>>> +    write_lock_bh(&sk->sk_callback_lock);
>>>> +    /* Only reset the callback if it really is ours */
>>>> +    if (sk->sk_data_ready == nvmet_tcp_tls_data_ready)
>>>
>>> I still don't understand why our data_ready for tls is needed.
>>> Who are
>>>
>> See above for an explanation.
>>
>>>> +        sk->sk_data_ready = queue->data_ready;
>>>> +    rcu_assign_sk_user_data(sk, NULL);
>>>> +    queue->data_ready = NULL;
>>>> +    write_unlock_bh(&sk->sk_callback_lock);
>>>> +}
>>>> +
>>>> +static void nvmet_tcp_tls_handshake_done(void *data, int status,
>>>> +                     key_serial_t peerid)
>>>> +{
>>>> +    struct nvmet_tcp_queue *queue = data;
>>>> +
>>>> +    pr_debug("queue %d: TLS handshake done, key %x, status %d\n",
>>>> +         queue->idx, peerid, status);
>>>> +    if (!status) {
>>>> +        spin_lock(&queue->state_lock);
>>>> +        queue->tls_psk = key_lookup(peerid);
>>>> +        if (IS_ERR(queue->tls_psk)) {
>>>> +            pr_warn("queue %d: TLS key %x not found\n",
>>>> +                queue->idx, peerid);
>>>> +            queue->tls_psk = NULL;
>>>> +        }
>>>> +        spin_unlock(&queue->state_lock);
>>>> +    }
>>>> +    cancel_delayed_work_sync(&queue->tls_handshake_work);
>>>> +    nvmet_tcp_restore_tls_callbacks(queue);
>>>> +    if (status)
>>>> +        nvmet_tcp_schedule_release_queue(queue);
>>>> +    else
>>>> +        nvmet_tcp_tls_handshake_restart(queue);
>>>> +}
>>>> +
>>>> +static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct 
>>>> *w)
>>>> +{
>>>> +    struct nvmet_tcp_queue *queue = container_of(to_delayed_work(w),
>>>> +            struct nvmet_tcp_queue, tls_handshake_work);
>>>> +
>>>> +    pr_debug("queue %d: TLS handshake timeout\n", queue->idx);
>>>> +    nvmet_tcp_restore_tls_callbacks(queue);
>>>> +    nvmet_tcp_schedule_release_queue(queue);
>>>> +}
>>>> +
>>>> +static int nvmet_tcp_tls_handshake(struct nvmet_tcp_queue *queue)
>>>> +{
>>>> +    int ret = -EOPNOTSUPP;
>>>> +    struct tls_handshake_args args;
>>>> +
>>>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
>>>> +        pr_warn("cannot start TLS in state %d\n", queue->state);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    pr_debug("queue %d: TLS ServerHello\n", queue->idx);
>>>> +    args.ta_sock = queue->sock;
>>>> +    args.ta_done = nvmet_tcp_tls_handshake_done;
>>>> +    args.ta_data = queue;
>>>> +    args.ta_keyring = nvme_keyring_id();
>>>> +    args.ta_timeout_ms = tls_handshake_timeout * 2 * 1024;
>>>
>>>   why the 2x timeout?
>>>
>> Because I'm chicken. Will be changing it.l
> 
> :)
> 
>>
>>>> +
>>>> +    ret = tls_server_hello_psk(&args, GFP_KERNEL);
>>>> +    if (ret) {
>>>> +        pr_err("failed to start TLS, err=%d\n", ret);
>>>> +    } else {
>>>> +        pr_debug("queue %d wakeup userspace\n", queue->idx);
>>>> +        nvmet_tcp_tls_data_ready(queue->sock->sk);
>>>> +        queue_delayed_work(nvmet_wq, &queue->tls_handshake_work,
>>>> +                   tls_handshake_timeout * HZ);
>>>> +    }
>>>> +    return ret;
>>>> +}
>>>> +
>>>>   static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
>>>>           struct socket *newsock)
>>>>   {
>>>> @@ -1604,6 +1753,8 @@ static void nvmet_tcp_alloc_queue(struct 
>>>> nvmet_tcp_port *port,
>>>>       INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
>>>>       INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
>>>> +    INIT_DELAYED_WORK(&queue->tls_handshake_work,
>>>> +              nvmet_tcp_tls_handshake_timeout_work);
>>>>       queue->sock = newsock;
>>>>       queue->port = port;
>>>>       queue->nr_cmds = 0;
>>>> @@ -1646,6 +1797,29 @@ static void nvmet_tcp_alloc_queue(struct 
>>>> nvmet_tcp_port *port,
>>>>       list_add_tail(&queue->queue_list, &nvmet_tcp_queue_list);
>>>>       mutex_unlock(&nvmet_tcp_queue_mutex);
>>>> +    if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
>>>> +        nvmet_tcp_save_tls_callbacks(queue);
>>>> +        if (!nvmet_tcp_tls_handshake(queue))
>>>> +            return;
>>>> +        nvmet_tcp_restore_tls_callbacks(queue);
>>>> +
>>>> +        /*
>>>> +         * If sectype is set to 'tls1.3' TLS is required
>>>> +         * so terminate the connection if the TLS handshake
>>>> +         * failed.
>>>> +         */
>>>> +        if (queue->port->nport->disc_addr.tsas.tcp.sectype ==
>>>> +            NVMF_TCP_SECTYPE_TLS13) {
>>>> +            pr_debug("queue %d sectype tls1.3, terminate 
>>>> connection\n",
>>>> +                 queue->idx);
>>>> +            goto out_destroy_sq;
>>>> +        }
>>>> +        pr_debug("queue %d fallback to icreq\n", queue->idx);
>>>> +        spin_lock(&queue->state_lock);
>>>> +        queue->state = NVMET_TCP_Q_CONNECTING;
>>>> +        spin_unlock(&queue->state_lock);
>>>> +    }
>>>> +
>>>>       ret = nvmet_tcp_set_queue_sock(queue);
>>>>       if (ret)
>>>>           goto out_destroy_sq;
>>>
>>> I'm still trying to learn the state machine here, can you share a few 
>>> words on it? Also please include it in the next round in the change log.
>>
>> As outlined in the response to the nvme-tcp upcall, on the server side 
>> we _have_ to allow for non-TLS connections (eg. for discovery).
> 
> But in essence what you are doing is that you allow normal connections
> for a secured port...
> 
Correct.

> btw, why not enforce a psk for the discovery controller (on this port)
> as well? for secured ports? No one said that we must accept a
> non-secured discovery connecting host on a secured port.
> 
Because we can't have a PSK for standard discovery.
Every discovery controller has the canonical discovery NQN, so every
PSK will have the same subsystem NQN, and you will have to provide the 
_same_ PSK to every attached storage system.

If we had unique discovery controller NQNs everything would be good,
but as we don't I guess we have to support normal connections in 
addition to secured ones.

As I said; maybe it's time to revisit the unique discovery NQN patches;
with those we should be able to separate the various use-case like
having a dedicated secured discovery port.

>> And we have to start the daemon _before_ the first packet arrives,
> 
> Not sure why that is.
> 
>> but only the first packet will tell us what we should have done.
>> So really we have to start the upcall and see what happens.
>> The 'real' handling / differentiation between these two modes is done 
>> with the 'peek pdu' patch later on.
> 
> I am hoping we can kill it.

Sorry, no. Can't.
Not without killing standard discovery.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP
  2023-03-22 12:53         ` Sagi Grimberg
@ 2023-03-22 15:10           ` Hannes Reinecke
  2023-03-22 15:43             ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 15:10 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 13:53, Sagi Grimberg wrote:
> 
>>>>>> Hi all,
>>>>>>
>>>>>> finally I've managed to put all things together and enable in-kernel
>>>>>> TLS support for NVMe-over-TCP.
>>>>>
>>>>> Hannes (and Chuck) this is great, I'm very happy to see this!
>>>>>
>>>>> I'll start a detailed review soon enough.
>>>>>
>>>>> Thank you for doing this.
>>>>>
>>>>>> The patchset is based on the TLS upcall mechanism from Chuck Lever
>>>>>> (cf '[PATCH v7 0/2] Another crack at a handshake upcall mechanism'
>>>>>> posted to the linux netdev list), and requires the 'tlshd' userspace
>>>>>> daemon (https://github.com/oracle/ktls-utils) for the actual TLS 
>>>>>> handshake.
>>>>>
>>>>> Do you have an actual link to follow for this patch set?
>>>>
>>>> Sure.
>>>>
>>>> git.kernel.org:/pub/scm/linux/kernel/git/hare/scsi-devel.git
>>>> branch tls-netlink.v7
>>>
>>> I meant Chuck's posting on linux-netdev.
>>
>> To be found here:
>>
>> <https://www.spinics.net/lists/netdev/msg890047.html>
> 
> Nice, it would be great to see code, if you have it, for nvme-cli and/or
> nvmetcli as well.

PR for libnvme: PR#599
PR for nvme-cli: PR#1868

which is just for updating 'nvme gen-tls-key' to allow the admin to 
provision 'retained' PSKs in the kernel keyring.

For nvmetcli we actually don't need an update; everything works with the 
existing code :-)

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-22 13:47         ` Hannes Reinecke
@ 2023-03-22 15:42           ` Sagi Grimberg
  2023-03-22 16:43             ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 15:42 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>>> The 'data_ready' call might happen at any time after the 'accept' 
>>> call and us calling into userspace.
>>> In particular we have this flow of control:
>>>
>>> 1. Kernel: accept()
>>> 2. Kernel: handshake request
>>> 3. Userspace: read data from socket
>>> 4. Userspace: tls handshake
>>> 5. Kernel: handshake complete
>>>
>>> If the 'data_ready' event occurs between 1. and 3. userspace wouldn't 
>>> know that something has happened, and will be sitting there waiting 
>>> for data which is already present.
>>
>> Umm, doesn't userspace read from the socket once we trigger the upcall?
>> it should. But I still don't understand what is the difference between
>> us waiking up userspace, from the default sock doing the same?
>>
> No, it doesn't (or, rather, can't).
> After processing 'accept()' (from the kernel code) data might already be 
> present (after all, why would we get an 'accept' call otherwise?).
> But the daemon has not been started up (yet); that's only done in
> step 3). But 'data_ready' has already been called, so by the time 
> userland is able to do a 'read()' on the socket it won't be seeing 
> anything.

Not sure I understand. if data exists, userspace will read from the
socket and get data, whenever that is.

> To be precise: that's the reasoning I've come up with.
> Might be completely wrong, or beside the point.
> But what I _do_ know is that without this call userspace would
> _not_ see any data, and would happily sit there until we close the 
> socket due to a timeout (which also why I put in the timeout
> module options ...), and only _then_ see the data.
> But with this call everything works. So there.

I may be missing something, and maybe its because I haven't looked at
the userspace side yet. But if it calls recvmsg, it should either
consume data, or it should block until it arrives.

I'm finding it difficult to understand why emulating almost 100% the
default socket .data_ready is what is needed. But I'll try to understand
better.



>>> As outlined in the response to the nvme-tcp upcall, on the server 
>>> side we _have_ to allow for non-TLS connections (eg. for discovery).
>>
>> But in essence what you are doing is that you allow normal connections
>> for a secured port...
>>
> Correct.

That is not what the feature is supposed to do.

>> btw, why not enforce a psk for the discovery controller (on this port)
>> as well? for secured ports? No one said that we must accept a
>> non-secured discovery connecting host on a secured port.
>>
> Because we can't have a PSK for standard discovery.
> Every discovery controller has the canonical discovery NQN, so every
> PSK will have the same subsystem NQN, and you will have to provide the 
> _same_ PSK to every attached storage system.

Right, because they have the same NQN. Sounds good, what is the problem?
It is much better than to have sectype be nothing more than a
recommendation.

If sectype tls1.x is defined on a port then every connection to that
port is a tls connection.

> If we had unique discovery controller NQNs everything would be good,
> but as we don't I guess we have to support normal connections in 
> addition to secured ones.

Its orthogonal. the well-known disc-nqn may be "weaker", but tls
is still enforced. when we add unique disc-nqn, we can use different
psks.

> As I said; maybe it's time to revisit the unique discovery NQN patches;
> with those we should be able to separate the various use-case like
> having a dedicated secured discovery port.

I don't see why this would be needed.

> 
>>> And we have to start the daemon _before_ the first packet arrives,
>>
>> Not sure why that is.
>>
>>> but only the first packet will tell us what we should have done.
>>> So really we have to start the upcall and see what happens.
>>> The 'real' handling / differentiation between these two modes is done 
>>> with the 'peek pdu' patch later on.
>>
>> I am hoping we can kill it.
> 
> Sorry, no. Can't.
> Not without killing standard discovery.

I don't understand the dependency.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP
  2023-03-22 15:10           ` Hannes Reinecke
@ 2023-03-22 15:43             ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-22 15:43 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


>>>>>>> Hi all,
>>>>>>>
>>>>>>> finally I've managed to put all things together and enable in-kernel
>>>>>>> TLS support for NVMe-over-TCP.
>>>>>>
>>>>>> Hannes (and Chuck) this is great, I'm very happy to see this!
>>>>>>
>>>>>> I'll start a detailed review soon enough.
>>>>>>
>>>>>> Thank you for doing this.
>>>>>>
>>>>>>> The patchset is based on the TLS upcall mechanism from Chuck Lever
>>>>>>> (cf '[PATCH v7 0/2] Another crack at a handshake upcall mechanism'
>>>>>>> posted to the linux netdev list), and requires the 'tlshd' userspace
>>>>>>> daemon (https://github.com/oracle/ktls-utils) for the actual TLS 
>>>>>>> handshake.
>>>>>>
>>>>>> Do you have an actual link to follow for this patch set?
>>>>>
>>>>> Sure.
>>>>>
>>>>> git.kernel.org:/pub/scm/linux/kernel/git/hare/scsi-devel.git
>>>>> branch tls-netlink.v7
>>>>
>>>> I meant Chuck's posting on linux-netdev.
>>>
>>> To be found here:
>>>
>>> <https://www.spinics.net/lists/netdev/msg890047.html>
>>
>> Nice, it would be great to see code, if you have it, for nvme-cli and/or
>> nvmetcli as well.
> 
> PR for libnvme: PR#599
> PR for nvme-cli: PR#1868
> 
> which is just for updating 'nvme gen-tls-key' to allow the admin to 
> provision 'retained' PSKs in the kernel keyring.
> 
> For nvmetcli we actually don't need an update; everything works with the 
> existing code :-)

Can you send these patches together with the next round of submission?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-22 15:42           ` Sagi Grimberg
@ 2023-03-22 16:43             ` Hannes Reinecke
  2023-03-22 16:49               ` Chuck Lever III
  2023-03-23  7:44               ` Sagi Grimberg
  0 siblings, 2 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-22 16:43 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 3/22/23 16:42, Sagi Grimberg wrote:
> 
>>>> The 'data_ready' call might happen at any time after the 'accept' 
>>>> call and us calling into userspace.
>>>> In particular we have this flow of control:
>>>>
>>>> 1. Kernel: accept()
>>>> 2. Kernel: handshake request
>>>> 3. Userspace: read data from socket
>>>> 4. Userspace: tls handshake
>>>> 5. Kernel: handshake complete
>>>>
>>>> If the 'data_ready' event occurs between 1. and 3. userspace 
>>>> wouldn't know that something has happened, and will be sitting there 
>>>> waiting for data which is already present.
>>>
>>> Umm, doesn't userspace read from the socket once we trigger the upcall?
>>> it should. But I still don't understand what is the difference between
>>> us waiking up userspace, from the default sock doing the same?
>>>
>> No, it doesn't (or, rather, can't).
>> After processing 'accept()' (from the kernel code) data might already 
>> be present (after all, why would we get an 'accept' call otherwise?).
>> But the daemon has not been started up (yet); that's only done in
>> step 3). But 'data_ready' has already been called, so by the time 
>> userland is able to do a 'read()' on the socket it won't be seeing 
>> anything.
> 
> Not sure I understand. if data exists, userspace will read from the
> socket and get data, whenever that is. >
That's what I thought, too.
But then the userspace daemon just sat there doing nothing.

>> To be precise: that's the reasoning I've come up with.
>> Might be completely wrong, or beside the point.
>> But what I _do_ know is that without this call userspace would
>> _not_ see any data, and would happily sit there until we close the 
>> socket due to a timeout (which also why I put in the timeout
>> module options ...), and only _then_ see the data.
>> But with this call everything works. So there.
> 
> I may be missing something, and maybe its because I haven't looked at
> the userspace side yet. But if it calls recvmsg, it should either
> consume data, or it should block until it arrives.
> 
> I'm finding it difficult to understand why emulating almost 100% the
> default socket .data_ready is what is needed. But I'll try to understand
> better.
> 
Oh, I don't claim to have a full understanding, either.
I had initially assumed that it would work 'out of the box', without us 
having to specify a callback.
Turns out that it doesn't.
Maybe it's enough to call sk->sk_data_ready() after the upcall has been 
started. But we sure have to do _something_.

If you have other solutions I'm all ears...

> 
> 
>>>> As outlined in the response to the nvme-tcp upcall, on the server 
>>>> side we _have_ to allow for non-TLS connections (eg. for discovery).
>>>
>>> But in essence what you are doing is that you allow normal connections
>>> for a secured port...
>>>
>> Correct.
> 
> That is not what the feature is supposed to do.
> 
Well ... it's a difference in interpretation.
The NVMe TCP spec just says '... describes whether TLS is supported.'.
It does not say 'required'.
But there currently is no way how the server could describe 'TLS 
supported' vs 'TLS required'.

>>> btw, why not enforce a psk for the discovery controller (on this port)
>>> as well? for secured ports? No one said that we must accept a
>>> non-secured discovery connecting host on a secured port.
>>>
>> Because we can't have a PSK for standard discovery.
>> Every discovery controller has the canonical discovery NQN, so every
>> PSK will have the same subsystem NQN, and you will have to provide the 
>> _same_ PSK to every attached storage system.
> 
> Right, because they have the same NQN. Sounds good, what is the problem?
> It is much better than to have sectype be nothing more than a
> recommendation.
> 
> If sectype tls1.x is defined on a port then every connection to that
> port is a tls connection.
> 
See above. Only if we decide to give TSAS a 'TLS required' meaning.
With the this patchset it's a 'TLS supported' syntax.

>> If we had unique discovery controller NQNs everything would be good,
>> but as we don't I guess we have to support normal connections in 
>> addition to secured ones.
> 
> Its orthogonal. the well-known disc-nqn may be "weaker", but tls
> is still enforced. when we add unique disc-nqn, we can use different
> psks.
> 
Not only weaker, but _identical_ for each server.

>> As I said; maybe it's time to revisit the unique discovery NQN patches;
>> with those we should be able to separate the various use-case like
>> having a dedicated secured discovery port.
> 
> I don't see why this would be needed.
> 
We don't if we treat TSAS as 'TLS required', true.
But that's not what the spec says.
And there is also the dodgy statement in section 3.6.1.6:
 > If a host that supports TLS for NVMe/TCP receives a discovery
 > log entry indicating that the NVM subsystem uses NVMe/TCP and
 > does not support TLS, then the host should nonetheless
 > attempt to establish an NVMe/TCP connection that uses TLS.
 >
which really indicates that the TSAS field is a recommendation only.

But really, we shouldn't read too much into what the TSAS field says.
With the upcoming TPAR for TLS clarification all these different 
interpretations should be clarified.

The key question, however, will remain: Do we _want_ to support a 'TLS 
supported' mode for nvmet?
I'd say we should, to have compability with older (client) installations.
And I guess we have to anyway for secure concatenation
(as this _requires_ that we start off unencrypted).
So I'd suggest to leave it for now, have the code to support both, and 
wait for the TPAR to be ratified and then revisit this issue.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-22 16:43             ` Hannes Reinecke
@ 2023-03-22 16:49               ` Chuck Lever III
  2023-03-23  7:21                 ` Sagi Grimberg
  2023-03-23  7:44               ` Sagi Grimberg
  1 sibling, 1 reply; 90+ messages in thread
From: Chuck Lever III @ 2023-03-22 16:49 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme,
	Chuck Lever, kernel-tls-handshake


> On Mar 22, 2023, at 12:43 PM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 3/22/23 16:42, Sagi Grimberg wrote:
>>>>> The 'data_ready' call might happen at any time after the 'accept' call and us calling into userspace.
>>>>> In particular we have this flow of control:
>>>>> 
>>>>> 1. Kernel: accept()
>>>>> 2. Kernel: handshake request
>>>>> 3. Userspace: read data from socket
>>>>> 4. Userspace: tls handshake
>>>>> 5. Kernel: handshake complete
>>>>> 
>>>>> If the 'data_ready' event occurs between 1. and 3. userspace wouldn't know that something has happened, and will be sitting there waiting for data which is already present.
>>>> 
>>>> Umm, doesn't userspace read from the socket once we trigger the upcall?
>>>> it should. But I still don't understand what is the difference between
>>>> us waiking up userspace, from the default sock doing the same?
>>>> 
>>> No, it doesn't (or, rather, can't).
>>> After processing 'accept()' (from the kernel code) data might already be present (after all, why would we get an 'accept' call otherwise?).
>>> But the daemon has not been started up (yet); that's only done in
>>> step 3). But 'data_ready' has already been called, so by the time userland is able to do a 'read()' on the socket it won't be seeing anything.
>> Not sure I understand. if data exists, userspace will read from the
>> socket and get data, whenever that is. >
> That's what I thought, too.
> But then the userspace daemon just sat there doing nothing.

I haven't been following this discussion in detail, but
if the kernel disables the normal TCP data_ready callback,
then user space won't get any data. That's why SunRPC's
data_ready calls the previous sk_data_ready and then shunts
its own data_ready callback during handshakes. Without that
call to the old sk_data_ready, the user space endpoint won't
see any received data.

As long as the previous sk_data_ready callback is invoked
properly during the handshake, the received data is queued
and the user endpoint will behave exactly as you want.

--
Chuck Lever



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-22 16:49               ` Chuck Lever III
@ 2023-03-23  7:21                 ` Sagi Grimberg
  2023-03-24 11:29                   ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-23  7:21 UTC (permalink / raw)
  To: Chuck Lever III, Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake


>>>>>> The 'data_ready' call might happen at any time after the 'accept' call and us calling into userspace.
>>>>>> In particular we have this flow of control:
>>>>>>
>>>>>> 1. Kernel: accept()
>>>>>> 2. Kernel: handshake request
>>>>>> 3. Userspace: read data from socket
>>>>>> 4. Userspace: tls handshake
>>>>>> 5. Kernel: handshake complete
>>>>>>
>>>>>> If the 'data_ready' event occurs between 1. and 3. userspace wouldn't know that something has happened, and will be sitting there waiting for data which is already present.
>>>>>
>>>>> Umm, doesn't userspace read from the socket once we trigger the upcall?
>>>>> it should. But I still don't understand what is the difference between
>>>>> us waiking up userspace, from the default sock doing the same?
>>>>>
>>>> No, it doesn't (or, rather, can't).
>>>> After processing 'accept()' (from the kernel code) data might already be present (after all, why would we get an 'accept' call otherwise?).
>>>> But the daemon has not been started up (yet); that's only done in
>>>> step 3). But 'data_ready' has already been called, so by the time userland is able to do a 'read()' on the socket it won't be seeing anything.
>>> Not sure I understand. if data exists, userspace will read from the
>>> socket and get data, whenever that is. >
>> That's what I thought, too.
>> But then the userspace daemon just sat there doing nothing.
> 
> I haven't been following this discussion in detail, but
> if the kernel disables the normal TCP data_ready callback,
> then user space won't get any data. That's why SunRPC's
> data_ready calls the previous sk_data_ready and then shunts
> its own data_ready callback during handshakes. Without that
> call to the old sk_data_ready, the user space endpoint won't
> see any received data.

Yes that is understood. But the solution that Hannes proposed
was to introduce nvmet_tcp_tls_data_ready which is overriding
the default sock data_ready and does pretty much the same thing.

The reason is that today nvmet_tcp_listen_data_ready schedules accept
and then pretty much immediately replaces the socket data_ready to
nvmet_tcp_data_ready.

I think that a simpler solution was to make nvmet_tcp_listen_data_ready
call port->data_ready (default socket stored data_ready), schedule
the accept_work and only after the handshake bounce to userspace is
completed, override the socket callbacks.

Something like:
--
static void nvmet_tcp_listen_data_ready(struct sock *sk)
{
         struct nvmet_tcp_port *port;

         trace_sk_data_ready(sk);

         read_lock_bh(&sk->sk_callback_lock);
         port = sk->sk_user_data;
         if (!port)
                 goto out;

         port->data_ready(sk); // trigger socket old data_ready

         if (sk->sk_state == TCP_LISTEN)
                 queue_work(nvmet_wq, &port->accept_work);
out:
         read_unlock_bh(&sk->sk_callback_lock);
}

...

static void nvmet_tcp_tls_handshake_done(void *data, int status,
					 key_serial_t peerid)
{
	...
	queue->state = NVMET_TCP_Q_CONNECTING;
	nvmet_tcp_set_queue_sock(queue); // now own socket data_ready
}
--

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-22 16:43             ` Hannes Reinecke
  2023-03-22 16:49               ` Chuck Lever III
@ 2023-03-23  7:44               ` Sagi Grimberg
  1 sibling, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-23  7:44 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

Hey Hannes,

>>> To be precise: that's the reasoning I've come up with.
>>> Might be completely wrong, or beside the point.
>>> But what I _do_ know is that without this call userspace would
>>> _not_ see any data, and would happily sit there until we close the 
>>> socket due to a timeout (which also why I put in the timeout
>>> module options ...), and only _then_ see the data.
>>> But with this call everything works. So there.
>>
>> I may be missing something, and maybe its because I haven't looked at
>> the userspace side yet. But if it calls recvmsg, it should either
>> consume data, or it should block until it arrives.
>>
>> I'm finding it difficult to understand why emulating almost 100% the
>> default socket .data_ready is what is needed. But I'll try to understand
>> better.
>>
> Oh, I don't claim to have a full understanding, either.
> I had initially assumed that it would work 'out of the box', without us 
> having to specify a callback.
> Turns out that it doesn't.
> Maybe it's enough to call sk->sk_data_ready() after the upcall has been 
> started. But we sure have to do _something_.
> 
> If you have other solutions I'm all ears...

See my response to Chuck.

>>
>>>>> As outlined in the response to the nvme-tcp upcall, on the server 
>>>>> side we _have_ to allow for non-TLS connections (eg. for discovery).
>>>>
>>>> But in essence what you are doing is that you allow normal connections
>>>> for a secured port...
>>>>
>>> Correct.
>>
>> That is not what the feature is supposed to do.
>>
> Well ... it's a difference in interpretation.
> The NVMe TCP spec just says '... describes whether TLS is supported.'.
> It does not say 'required'.
> But there currently is no way how the server could describe 'TLS 
> supported' vs 'TLS required'.

But it doesn't make sense. If a subsystem wants to expose itself via
tls and non-tls, it needs to use 2 different ports, Which is a perfectly
reasonable requirement. Making TLS advisory misses the point, regardless
of the spec language.

Having support for advisory TLS is essentially implementing a permissive
mode, and that needs to be explicitly enabled (assuming we want to
support that), not the other way around.

>>>> btw, why not enforce a psk for the discovery controller (on this port)
>>>> as well? for secured ports? No one said that we must accept a
>>>> non-secured discovery connecting host on a secured port.
>>>>
>>> Because we can't have a PSK for standard discovery.
>>> Every discovery controller has the canonical discovery NQN, so every
>>> PSK will have the same subsystem NQN, and you will have to provide 
>>> the _same_ PSK to every attached storage system.
>>
>> Right, because they have the same NQN. Sounds good, what is the problem?
>> It is much better than to have sectype be nothing more than a
>> recommendation.
>>
>> If sectype tls1.x is defined on a port then every connection to that
>> port is a tls connection.
>>
> See above. Only if we decide to give TSAS a 'TLS required' meaning.
> With the this patchset it's a 'TLS supported' syntax.

Yes, tls should be a hard enforcement, regardless of the spec language.
The only way that tls would be a soft enforcement, is via an explicit
permissive-mode setting.

>>> If we had unique discovery controller NQNs everything would be good,
>>> but as we don't I guess we have to support normal connections in 
>>> addition to secured ones.
>>
>> Its orthogonal. the well-known disc-nqn may be "weaker", but tls
>> is still enforced. when we add unique disc-nqn, we can use different
>> psks.
>>
> Not only weaker, but _identical_ for each server.

I don't see any issue with that. I do see an issue with making tls
advisory universally in nvmet.

>>> As I said; maybe it's time to revisit the unique discovery NQN patches;
>>> with those we should be able to separate the various use-case like
>>> having a dedicated secured discovery port.
>>
>> I don't see why this would be needed.
>>
> We don't if we treat TSAS as 'TLS required', true.
> But that's not what the spec says.
> And there is also the dodgy statement in section 3.6.1.6:
>  > If a host that supports TLS for NVMe/TCP receives a discovery
>  > log entry indicating that the NVM subsystem uses NVMe/TCP and
>  > does not support TLS, then the host should nonetheless
>  > attempt to establish an NVMe/TCP connection that uses TLS.

First of all, "should" != "shall", so we are not obligated to do
this if it doesn't make sense for Linux, but regardless that is fine,
the host can fallback to non-tls in such a case if we choose to
implement this.

> which really indicates that the TSAS field is a recommendation only.
> 
> But really, we shouldn't read too much into what the TSAS field says.
> With the upcoming TPAR for TLS clarification all these different 
> interpretations should be clarified.
> 
> The key question, however, will remain: Do we _want_ to support a 'TLS 
> supported' mode for nvmet?
> I'd say we should, to have compability with older (client) installations.

That is great, but only with an explicit permissive port setting that
needs to be enabled, and later on disabled (i.e. when all the hosts get
upgraded over time).

> And I guess we have to anyway for secure concatenation
> (as this _requires_ that we start off unencrypted).

That I can understand. But it definitely should not mean that we
do a universally permissive tls.

> So I'd suggest to leave it for now, have the code to support both, and 
> wait for the TPAR to be ratified and then revisit this issue.

This is not a matter of a TPAR. It is about Linux. It doesn't make sense
to make tls universally permissive for _Linux_. I think we should
introduce code needed to support for secure concatenation only when
it is actually introduced.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-23  7:21                 ` Sagi Grimberg
@ 2023-03-24 11:29                   ` Hannes Reinecke
  2023-03-26  7:18                     ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-24 11:29 UTC (permalink / raw)
  To: Sagi Grimberg, Chuck Lever III
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake

On 3/23/23 08:21, Sagi Grimberg wrote:
> 
>>>>>>> The 'data_ready' call might happen at any time after the 'accept' 
>>>>>>> call and us calling into userspace.
>>>>>>> In particular we have this flow of control:
>>>>>>>
>>>>>>> 1. Kernel: accept()
>>>>>>> 2. Kernel: handshake request
>>>>>>> 3. Userspace: read data from socket
>>>>>>> 4. Userspace: tls handshake
>>>>>>> 5. Kernel: handshake complete
>>>>>>>
>>>>>>> If the 'data_ready' event occurs between 1. and 3. userspace 
>>>>>>> wouldn't know that something has happened, and will be sitting 
>>>>>>> there waiting for data which is already present.
>>>>>>
>>>>>> Umm, doesn't userspace read from the socket once we trigger the 
>>>>>> upcall?
>>>>>> it should. But I still don't understand what is the difference 
>>>>>> between
>>>>>> us waiking up userspace, from the default sock doing the same?
>>>>>>
>>>>> No, it doesn't (or, rather, can't).
>>>>> After processing 'accept()' (from the kernel code) data might 
>>>>> already be present (after all, why would we get an 'accept' call 
>>>>> otherwise?).
>>>>> But the daemon has not been started up (yet); that's only done in
>>>>> step 3). But 'data_ready' has already been called, so by the time 
>>>>> userland is able to do a 'read()' on the socket it won't be seeing 
>>>>> anything.
>>>> Not sure I understand. if data exists, userspace will read from the
>>>> socket and get data, whenever that is. >
>>> That's what I thought, too.
>>> But then the userspace daemon just sat there doing nothing.
>>
>> I haven't been following this discussion in detail, but
>> if the kernel disables the normal TCP data_ready callback,
>> then user space won't get any data. That's why SunRPC's
>> data_ready calls the previous sk_data_ready and then shunts
>> its own data_ready callback during handshakes. Without that
>> call to the old sk_data_ready, the user space endpoint won't
>> see any received data.
> 
> Yes that is understood. But the solution that Hannes proposed
> was to introduce nvmet_tcp_tls_data_ready which is overriding
> the default sock data_ready and does pretty much the same thing.
> 
> The reason is that today nvmet_tcp_listen_data_ready schedules accept
> and then pretty much immediately replaces the socket data_ready to
> nvmet_tcp_data_ready.
> 
> I think that a simpler solution was to make nvmet_tcp_listen_data_ready
> call port->data_ready (default socket stored data_ready), schedule
> the accept_work and only after the handshake bounce to userspace is
> completed, override the socket callbacks.
> 
> Something like:
> -- 
> static void nvmet_tcp_listen_data_ready(struct sock *sk)
> {
>          struct nvmet_tcp_port *port;
> 
>          trace_sk_data_ready(sk);
> 
>          read_lock_bh(&sk->sk_callback_lock);
>          port = sk->sk_user_data;
>          if (!port)
>                  goto out;
> 
>          port->data_ready(sk); // trigger socket old data_ready
> 
>          if (sk->sk_state == TCP_LISTEN)
>                  queue_work(nvmet_wq, &port->accept_work);
> out:
>          read_unlock_bh(&sk->sk_callback_lock);
> }
> 

Nearly there.

The actual patch would be:

@@ -2031,10 +1988,16 @@ static void nvmet_tcp_listen_data_ready(struct 
sock *sk)
         trace_sk_data_ready(sk);

         read_lock_bh(&sk->sk_callback_lock);
+       /* Ignore if the callback has been changed */
+       if (sk->sk_data_ready != nvmet_tcp_listen_data_ready)
+               goto out;
         port = sk->sk_user_data;
         if (!port)
                 goto out;

+       if (port->data_ready)
+               port->data_ready(sk);
+
         if (sk->sk_state == TCP_LISTEN)
                 queue_work(nvmet_wq, &port->accept_work);
  out:

As the callbacks will be changed once TLS is activated, and we really 
should not attempt to run if sk_data_ready() points to another function,
as then the sk_user_data pointer will most likely be changed, too,
causing all sorts of issues.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-24 11:29                   ` Hannes Reinecke
@ 2023-03-26  7:18                     ` Sagi Grimberg
  2023-03-27  6:20                       ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-26  7:18 UTC (permalink / raw)
  To: Hannes Reinecke, Chuck Lever III
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake


>>>>>>>> The 'data_ready' call might happen at any time after the 
>>>>>>>> 'accept' call and us calling into userspace.
>>>>>>>> In particular we have this flow of control:
>>>>>>>>
>>>>>>>> 1. Kernel: accept()
>>>>>>>> 2. Kernel: handshake request
>>>>>>>> 3. Userspace: read data from socket
>>>>>>>> 4. Userspace: tls handshake
>>>>>>>> 5. Kernel: handshake complete
>>>>>>>>
>>>>>>>> If the 'data_ready' event occurs between 1. and 3. userspace 
>>>>>>>> wouldn't know that something has happened, and will be sitting 
>>>>>>>> there waiting for data which is already present.
>>>>>>>
>>>>>>> Umm, doesn't userspace read from the socket once we trigger the 
>>>>>>> upcall?
>>>>>>> it should. But I still don't understand what is the difference 
>>>>>>> between
>>>>>>> us waiking up userspace, from the default sock doing the same?
>>>>>>>
>>>>>> No, it doesn't (or, rather, can't).
>>>>>> After processing 'accept()' (from the kernel code) data might 
>>>>>> already be present (after all, why would we get an 'accept' call 
>>>>>> otherwise?).
>>>>>> But the daemon has not been started up (yet); that's only done in
>>>>>> step 3). But 'data_ready' has already been called, so by the time 
>>>>>> userland is able to do a 'read()' on the socket it won't be seeing 
>>>>>> anything.
>>>>> Not sure I understand. if data exists, userspace will read from the
>>>>> socket and get data, whenever that is. >
>>>> That's what I thought, too.
>>>> But then the userspace daemon just sat there doing nothing.
>>>
>>> I haven't been following this discussion in detail, but
>>> if the kernel disables the normal TCP data_ready callback,
>>> then user space won't get any data. That's why SunRPC's
>>> data_ready calls the previous sk_data_ready and then shunts
>>> its own data_ready callback during handshakes. Without that
>>> call to the old sk_data_ready, the user space endpoint won't
>>> see any received data.
>>
>> Yes that is understood. But the solution that Hannes proposed
>> was to introduce nvmet_tcp_tls_data_ready which is overriding
>> the default sock data_ready and does pretty much the same thing.
>>
>> The reason is that today nvmet_tcp_listen_data_ready schedules accept
>> and then pretty much immediately replaces the socket data_ready to
>> nvmet_tcp_data_ready.
>>
>> I think that a simpler solution was to make nvmet_tcp_listen_data_ready
>> call port->data_ready (default socket stored data_ready), schedule
>> the accept_work and only after the handshake bounce to userspace is
>> completed, override the socket callbacks.
>>
>> Something like:
>> -- 
>> static void nvmet_tcp_listen_data_ready(struct sock *sk)
>> {
>>          struct nvmet_tcp_port *port;
>>
>>          trace_sk_data_ready(sk);
>>
>>          read_lock_bh(&sk->sk_callback_lock);
>>          port = sk->sk_user_data;
>>          if (!port)
>>                  goto out;
>>
>>          port->data_ready(sk); // trigger socket old data_ready
>>
>>          if (sk->sk_state == TCP_LISTEN)
>>                  queue_work(nvmet_wq, &port->accept_work);
>> out:
>>          read_unlock_bh(&sk->sk_callback_lock);
>> }
>>
> 
> Nearly there.
> 
> The actual patch would be:
> 
> @@ -2031,10 +1988,16 @@ static void nvmet_tcp_listen_data_ready(struct 
> sock *sk)
>          trace_sk_data_ready(sk);
> 
>          read_lock_bh(&sk->sk_callback_lock);
> +       /* Ignore if the callback has been changed */
> +       if (sk->sk_data_ready != nvmet_tcp_listen_data_ready)
> +               goto out;
>          port = sk->sk_user_data;
>          if (!port)
>                  goto out;
> 
> +       if (port->data_ready)
> +               port->data_ready(sk);
> +
>          if (sk->sk_state == TCP_LISTEN)
>                  queue_work(nvmet_wq, &port->accept_work);
>   out:
> 
> As the callbacks will be changed once TLS is activated, and we really 
> should not attempt to run if sk_data_ready() points to another function,
> as then the sk_user_data pointer will most likely be changed, too,
> causing all sorts of issues.

Umm, something is unclear to me. if nvmet_tcp_listen_data_ready is
called doesn't it by definition mean that sk->sk_data_ready ==
nvmet_tcp_listen_data_ready ?

Are you talking about a situation where between
nvmet_tcp_listen_data_ready is starting and until the
sk->sk_callback_lock the data_ready cb (and the user data
pointer) is changed?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-26  7:18                     ` Sagi Grimberg
@ 2023-03-27  6:20                       ` Hannes Reinecke
  2023-03-28  8:44                         ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-27  6:20 UTC (permalink / raw)
  To: Sagi Grimberg, Chuck Lever III
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake

On 3/26/23 09:18, Sagi Grimberg wrote:
> 
[ .. ]
>>
>> Nearly there.
>>
>> The actual patch would be:
>>
>> @@ -2031,10 +1988,16 @@ static void nvmet_tcp_listen_data_ready(struct 
>> sock *sk)
>>          trace_sk_data_ready(sk);
>>
>>          read_lock_bh(&sk->sk_callback_lock);
>> +       /* Ignore if the callback has been changed */
>> +       if (sk->sk_data_ready != nvmet_tcp_listen_data_ready)
>> +               goto out;
>>          port = sk->sk_user_data;
>>          if (!port)
>>                  goto out;
>>
>> +       if (port->data_ready)
>> +               port->data_ready(sk);
>> +
>>          if (sk->sk_state == TCP_LISTEN)
>>                  queue_work(nvmet_wq, &port->accept_work);
>>   out:
>>
>> As the callbacks will be changed once TLS is activated, and we really 
>> should not attempt to run if sk_data_ready() points to another function,
>> as then the sk_user_data pointer will most likely be changed, too,
>> causing all sorts of issues.
> 
> Umm, something is unclear to me. if nvmet_tcp_listen_data_ready is
> called doesn't it by definition mean that sk->sk_data_ready ==
> nvmet_tcp_listen_data_ready ?
> 
> Are you talking about a situation where between
> nvmet_tcp_listen_data_ready is starting and until the
> sk->sk_callback_lock the data_ready cb (and the user data
> pointer) is changed?

No. Far simpler:
Starting kTLS will change the callbacks.
So sk->sk_data_ready will point to our callback before
the upcall, but to the kTLS version _after_ the upcall.
It typically doesn't matter, as we're setting it to
nvmet_tcp_data_ready() anyway.
But there might be the odd case where the data_ready
callback is invoked after kTLS has been started but before
we're setting it to nvmet_tcp_data_ready().
Then we cannot guarantee that sk_user_data really is set
to the 'queue' pointer, so we should skip the function.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-27  6:20                       ` Hannes Reinecke
@ 2023-03-28  8:44                         ` Sagi Grimberg
  2023-03-28  9:20                           ` Hannes Reinecke
  2023-03-28 13:22                           ` Chuck Lever III
  0 siblings, 2 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-28  8:44 UTC (permalink / raw)
  To: Hannes Reinecke, Chuck Lever III
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake


>>>
>>> Nearly there.
>>>
>>> The actual patch would be:
>>>
>>> @@ -2031,10 +1988,16 @@ static void 
>>> nvmet_tcp_listen_data_ready(struct sock *sk)
>>>          trace_sk_data_ready(sk);
>>>
>>>          read_lock_bh(&sk->sk_callback_lock);
>>> +       /* Ignore if the callback has been changed */
>>> +       if (sk->sk_data_ready != nvmet_tcp_listen_data_ready)
>>> +               goto out;
>>>          port = sk->sk_user_data;
>>>          if (!port)
>>>                  goto out;
>>>
>>> +       if (port->data_ready)
>>> +               port->data_ready(sk);
>>> +
>>>          if (sk->sk_state == TCP_LISTEN)
>>>                  queue_work(nvmet_wq, &port->accept_work);
>>>   out:
>>>
>>> As the callbacks will be changed once TLS is activated, and we really 
>>> should not attempt to run if sk_data_ready() points to another function,
>>> as then the sk_user_data pointer will most likely be changed, too,
>>> causing all sorts of issues.
>>
>> Umm, something is unclear to me. if nvmet_tcp_listen_data_ready is
>> called doesn't it by definition mean that sk->sk_data_ready ==
>> nvmet_tcp_listen_data_ready ?
>>
>> Are you talking about a situation where between
>> nvmet_tcp_listen_data_ready is starting and until the
>> sk->sk_callback_lock the data_ready cb (and the user data
>> pointer) is changed?
> 
> No. Far simpler:
> Starting kTLS will change the callbacks.
> So sk->sk_data_ready will point to our callback before
> the upcall, but to the kTLS version _after_ the upcall.
> It typically doesn't matter, as we're setting it to
> nvmet_tcp_data_ready() anyway.

For ktls won't we set it to nvmet_tcp_data_ready only when
the handshake is done?

> But there might be the odd case where the data_ready
> callback is invoked after kTLS has been started but before
> we're setting it to nvmet_tcp_data_ready().

What does nvmet_tcp_data_ready has to do with it?
You are changing nvmet_tcp_listen_data_ready.

> Then we cannot guarantee that sk_user_data really is set
> to the 'queue' pointer, so we should skip the function.

nvme_tcp_listen_data_ready was invoked, then sk->sk_data_ready
is nvme_tcp_listen_data_ready. Are you referring to the case
where the callback has changed before the read lock ?

I would like to understand why svc_tcp_listen_data_ready doesn't
have this race as well. Chuck?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-28  8:44                         ` Sagi Grimberg
@ 2023-03-28  9:20                           ` Hannes Reinecke
  2023-03-28  9:43                             ` Sagi Grimberg
  2023-03-28 13:22                           ` Chuck Lever III
  1 sibling, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-28  9:20 UTC (permalink / raw)
  To: Sagi Grimberg, Chuck Lever III
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake

On 3/28/23 10:44, Sagi Grimberg wrote:
> 
>>>>
>>>> Nearly there.
>>>>
>>>> The actual patch would be:
>>>>
>>>> @@ -2031,10 +1988,16 @@ static void 
>>>> nvmet_tcp_listen_data_ready(struct sock *sk)
>>>>          trace_sk_data_ready(sk);
>>>>
>>>>          read_lock_bh(&sk->sk_callback_lock);
>>>> +       /* Ignore if the callback has been changed */
>>>> +       if (sk->sk_data_ready != nvmet_tcp_listen_data_ready)
>>>> +               goto out;
>>>>          port = sk->sk_user_data;
>>>>          if (!port)
>>>>                  goto out;
>>>>
>>>> +       if (port->data_ready)
>>>> +               port->data_ready(sk);
>>>> +
>>>>          if (sk->sk_state == TCP_LISTEN)
>>>>                  queue_work(nvmet_wq, &port->accept_work);
>>>>   out:
>>>>
>>>> As the callbacks will be changed once TLS is activated, and we 
>>>> really should not attempt to run if sk_data_ready() points to 
>>>> another function,
>>>> as then the sk_user_data pointer will most likely be changed, too,
>>>> causing all sorts of issues.
>>>
>>> Umm, something is unclear to me. if nvmet_tcp_listen_data_ready is
>>> called doesn't it by definition mean that sk->sk_data_ready ==
>>> nvmet_tcp_listen_data_ready ?
>>>
>>> Are you talking about a situation where between
>>> nvmet_tcp_listen_data_ready is starting and until the
>>> sk->sk_callback_lock the data_ready cb (and the user data
>>> pointer) is changed?
>>
>> No. Far simpler:
>> Starting kTLS will change the callbacks.
>> So sk->sk_data_ready will point to our callback before
>> the upcall, but to the kTLS version _after_ the upcall.
>> It typically doesn't matter, as we're setting it to
>> nvmet_tcp_data_ready() anyway.
> 
> For ktls won't we set it to nvmet_tcp_data_ready only when
> the handshake is done?
> 
Yes.

>> But there might be the odd case where the data_ready
>> callback is invoked after kTLS has been started but before
>> we're setting it to nvmet_tcp_data_ready().
> 
> What does nvmet_tcp_data_ready has to do with it?
> You are changing nvmet_tcp_listen_data_ready.
> 
This is not a 'race;
>> Then we cannot guarantee that sk_user_data really is set
>> to the 'queue' pointer, so we should skip the function.
> 
> nvme_tcp_listen_data_ready was invoked, then sk->sk_data_ready
> is nvme_tcp_listen_data_ready. Are you referring to the case
> where the callback has changed before the read lock ?
> 
No. The callback is changed when the userspace daemon starts ktls, as
that will set sk->sk_data_ready to tls_data_ready().

So before the userspace upcall we have the chain:

sk_data_ready -> nvmet_tcp_listen_data_ready()

and after a (successful) upcall we have the chain

sk_data_ready -> tls_data_ready
   -> nvmet_tcp_listen_data_ready()

once we set our callback we have

sk_data_ready -> nvmet_tcp_data_ready
   -> tls_data_ready
     -> nvmet_tcp_listen_data_ready

Calling into nvmet_tcp_listen_data_ready() is pointless here,
but we cannot remove the callback either. So we should to deactivate
it to avoid any side effects.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-28  9:20                           ` Hannes Reinecke
@ 2023-03-28  9:43                             ` Sagi Grimberg
  2023-03-28 10:04                               ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-28  9:43 UTC (permalink / raw)
  To: Hannes Reinecke, Chuck Lever III
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake


>>> But there might be the odd case where the data_ready
>>> callback is invoked after kTLS has been started but before
>>> we're setting it to nvmet_tcp_data_ready().
>>
>> What does nvmet_tcp_data_ready has to do with it?
>> You are changing nvmet_tcp_listen_data_ready.
>>
> This is not a 'race;
>>> Then we cannot guarantee that sk_user_data really is set
>>> to the 'queue' pointer, so we should skip the function.
>>
>> nvme_tcp_listen_data_ready was invoked, then sk->sk_data_ready
>> is nvme_tcp_listen_data_ready. Are you referring to the case
>> where the callback has changed before the read lock ?
>>
> No. The callback is changed when the userspace daemon starts ktls, as
> that will set sk->sk_data_ready to tls_data_ready().
> 
> So before the userspace upcall we have the chain:
> 
> sk_data_ready -> nvmet_tcp_listen_data_ready()
> 
> and after a (successful) upcall we have the chain
> 
> sk_data_ready -> tls_data_ready
>    -> nvmet_tcp_listen_data_ready()
> 
> once we set our callback we have
> 
> sk_data_ready -> nvmet_tcp_data_ready
>    -> tls_data_ready
>      -> nvmet_tcp_listen_data_ready
> 
> Calling into nvmet_tcp_listen_data_ready() is pointless here,
> but we cannot remove the callback either. So we should to deactivate
> it to avoid any side effects.

Thanks for the explanation.

First, I would like to understand how is this different from
svc.

Second, we don't want nvmet_tcp_listen_data_ready keep being
called from the datapath, especially as it is taking a read
lock, for absolutly no good reason.

Why don't we restore the original socket callback before
we trigger the handshake? then we wait for the handshake
to complete, and then store nvmet_tcp_data_ready once it
is done?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-28  9:43                             ` Sagi Grimberg
@ 2023-03-28 10:04                               ` Hannes Reinecke
  0 siblings, 0 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-28 10:04 UTC (permalink / raw)
  To: Sagi Grimberg, Chuck Lever III
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake

On 3/28/23 11:43, Sagi Grimberg wrote:
> 
>>>> But there might be the odd case where the data_ready
>>>> callback is invoked after kTLS has been started but before
>>>> we're setting it to nvmet_tcp_data_ready().
>>>
>>> What does nvmet_tcp_data_ready has to do with it?
>>> You are changing nvmet_tcp_listen_data_ready.
>>>
>> This is not a 'race;
>>>> Then we cannot guarantee that sk_user_data really is set
>>>> to the 'queue' pointer, so we should skip the function.
>>>
>>> nvme_tcp_listen_data_ready was invoked, then sk->sk_data_ready
>>> is nvme_tcp_listen_data_ready. Are you referring to the case
>>> where the callback has changed before the read lock ?
>>>
>> No. The callback is changed when the userspace daemon starts ktls, as
>> that will set sk->sk_data_ready to tls_data_ready().
>>
>> So before the userspace upcall we have the chain:
>>
>> sk_data_ready -> nvmet_tcp_listen_data_ready()
>>
>> and after a (successful) upcall we have the chain
>>
>> sk_data_ready -> tls_data_ready
>>    -> nvmet_tcp_listen_data_ready()
>>
>> once we set our callback we have
>>
>> sk_data_ready -> nvmet_tcp_data_ready
>>    -> tls_data_ready
>>      -> nvmet_tcp_listen_data_ready
>>
>> Calling into nvmet_tcp_listen_data_ready() is pointless here,
>> but we cannot remove the callback either. So we should to deactivate
>> it to avoid any side effects.
> 
> Thanks for the explanation.
> 
> First, I would like to understand how is this different from
> svc.
> 
> Second, we don't want nvmet_tcp_listen_data_ready keep being
> called from the datapath, especially as it is taking a read
> lock, for absolutly no good reason.
> 
> Why don't we restore the original socket callback before
> we trigger the handshake? then we wait for the handshake
> to complete, and then store nvmet_tcp_data_ready once it
> is done?

Yep, that works.

Will be updating the patch.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-28  8:44                         ` Sagi Grimberg
  2023-03-28  9:20                           ` Hannes Reinecke
@ 2023-03-28 13:22                           ` Chuck Lever III
  2023-03-28 15:29                             ` Sagi Grimberg
  1 sibling, 1 reply; 90+ messages in thread
From: Chuck Lever III @ 2023-03-28 13:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	Chuck Lever, kernel-tls-handshake



> On Mar 28, 2023, at 4:44 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>>> 
>>>> Nearly there.
>>>> 
>>>> The actual patch would be:
>>>> 
>>>> @@ -2031,10 +1988,16 @@ static void nvmet_tcp_listen_data_ready(struct sock *sk)
>>>>          trace_sk_data_ready(sk);
>>>> 
>>>>          read_lock_bh(&sk->sk_callback_lock);
>>>> +       /* Ignore if the callback has been changed */
>>>> +       if (sk->sk_data_ready != nvmet_tcp_listen_data_ready)
>>>> +               goto out;
>>>>          port = sk->sk_user_data;
>>>>          if (!port)
>>>>                  goto out;
>>>> 
>>>> +       if (port->data_ready)
>>>> +               port->data_ready(sk);
>>>> +
>>>>          if (sk->sk_state == TCP_LISTEN)
>>>>                  queue_work(nvmet_wq, &port->accept_work);
>>>>   out:
>>>> 
>>>> As the callbacks will be changed once TLS is activated, and we really should not attempt to run if sk_data_ready() points to another function,
>>>> as then the sk_user_data pointer will most likely be changed, too,
>>>> causing all sorts of issues.
>>> 
>>> Umm, something is unclear to me. if nvmet_tcp_listen_data_ready is
>>> called doesn't it by definition mean that sk->sk_data_ready ==
>>> nvmet_tcp_listen_data_ready ?
>>> 
>>> Are you talking about a situation where between
>>> nvmet_tcp_listen_data_ready is starting and until the
>>> sk->sk_callback_lock the data_ready cb (and the user data
>>> pointer) is changed?
>> No. Far simpler:
>> Starting kTLS will change the callbacks.
>> So sk->sk_data_ready will point to our callback before
>> the upcall, but to the kTLS version _after_ the upcall.
>> It typically doesn't matter, as we're setting it to
>> nvmet_tcp_data_ready() anyway.
> 
> For ktls won't we set it to nvmet_tcp_data_ready only when
> the handshake is done?
> 
>> But there might be the odd case where the data_ready
>> callback is invoked after kTLS has been started but before
>> we're setting it to nvmet_tcp_data_ready().
> 
> What does nvmet_tcp_data_ready has to do with it?
> You are changing nvmet_tcp_listen_data_ready.
> 
>> Then we cannot guarantee that sk_user_data really is set
>> to the 'queue' pointer, so we should skip the function.
> 
> nvme_tcp_listen_data_ready was invoked, then sk->sk_data_ready
> is nvme_tcp_listen_data_ready. Are you referring to the case
> where the callback has changed before the read lock ?
> 
> I would like to understand why svc_tcp_listen_data_ready doesn't
> have this race as well. Chuck?

svc doesn't alter the pointer address in the listener's
sk_data_ready field, and neither does kTLS.

For a connected socket, when the handshake has completed,
svc examines the socket for more work one last time before
deciding it's idle (a kind of simulated data_ready). That
should pick up work that arrived undetected.

Maybe I'm missing something?

--
Chuck Lever



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-28 13:22                           ` Chuck Lever III
@ 2023-03-28 15:29                             ` Sagi Grimberg
  2023-03-28 15:56                               ` Chuck Lever III
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-28 15:29 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	Chuck Lever, kernel-tls-handshake


>>>>> As the callbacks will be changed once TLS is activated, and we really should not attempt to run if sk_data_ready() points to another function,
>>>>> as then the sk_user_data pointer will most likely be changed, too,
>>>>> causing all sorts of issues.
>>>>
>>>> Umm, something is unclear to me. if nvmet_tcp_listen_data_ready is
>>>> called doesn't it by definition mean that sk->sk_data_ready ==
>>>> nvmet_tcp_listen_data_ready ?
>>>>
>>>> Are you talking about a situation where between
>>>> nvmet_tcp_listen_data_ready is starting and until the
>>>> sk->sk_callback_lock the data_ready cb (and the user data
>>>> pointer) is changed?
>>> No. Far simpler:
>>> Starting kTLS will change the callbacks.
>>> So sk->sk_data_ready will point to our callback before
>>> the upcall, but to the kTLS version _after_ the upcall.
>>> It typically doesn't matter, as we're setting it to
>>> nvmet_tcp_data_ready() anyway.
>>
>> For ktls won't we set it to nvmet_tcp_data_ready only when
>> the handshake is done?
>>
>>> But there might be the odd case where the data_ready
>>> callback is invoked after kTLS has been started but before
>>> we're setting it to nvmet_tcp_data_ready().
>>
>> What does nvmet_tcp_data_ready has to do with it?
>> You are changing nvmet_tcp_listen_data_ready.
>>
>>> Then we cannot guarantee that sk_user_data really is set
>>> to the 'queue' pointer, so we should skip the function.
>>
>> nvme_tcp_listen_data_ready was invoked, then sk->sk_data_ready
>> is nvme_tcp_listen_data_ready. Are you referring to the case
>> where the callback has changed before the read lock ?
>>
>> I would like to understand why svc_tcp_listen_data_ready doesn't
>> have this race as well. Chuck?
> 
> svc doesn't alter the pointer address in the listener's
> sk_data_ready field, and neither does kTLS.

It is inherited from the parent socket sk. IIRC it is cloned in
inet_reqsk_clone.

> For a connected socket, when the handshake has completed,
> svc examines the socket for more work one last time before
> deciding it's idle (a kind of simulated data_ready). That
> should pick up work that arrived undetected.

The question is how does a ktls socket never calls its inherited
sk_data_ready (svc_tcp_listen_data_ready), which according to Hannes'
explanation, should be the case if this socket was created as its
offspring.

> Maybe I'm missing something?

I don't know yet.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-28 15:29                             ` Sagi Grimberg
@ 2023-03-28 15:56                               ` Chuck Lever III
  2023-03-29  6:33                                 ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Chuck Lever III @ 2023-03-28 15:56 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	Chuck Lever, kernel-tls-handshake



> On Mar 28, 2023, at 11:29 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>>>>> As the callbacks will be changed once TLS is activated, and we really should not attempt to run if sk_data_ready() points to another function,
>>>>>> as then the sk_user_data pointer will most likely be changed, too,
>>>>>> causing all sorts of issues.
>>>>> 
>>>>> Umm, something is unclear to me. if nvmet_tcp_listen_data_ready is
>>>>> called doesn't it by definition mean that sk->sk_data_ready ==
>>>>> nvmet_tcp_listen_data_ready ?
>>>>> 
>>>>> Are you talking about a situation where between
>>>>> nvmet_tcp_listen_data_ready is starting and until the
>>>>> sk->sk_callback_lock the data_ready cb (and the user data
>>>>> pointer) is changed?
>>>> No. Far simpler:
>>>> Starting kTLS will change the callbacks.
>>>> So sk->sk_data_ready will point to our callback before
>>>> the upcall, but to the kTLS version _after_ the upcall.
>>>> It typically doesn't matter, as we're setting it to
>>>> nvmet_tcp_data_ready() anyway.
>>> 
>>> For ktls won't we set it to nvmet_tcp_data_ready only when
>>> the handshake is done?
>>> 
>>>> But there might be the odd case where the data_ready
>>>> callback is invoked after kTLS has been started but before
>>>> we're setting it to nvmet_tcp_data_ready().
>>> 
>>> What does nvmet_tcp_data_ready has to do with it?
>>> You are changing nvmet_tcp_listen_data_ready.
>>> 
>>>> Then we cannot guarantee that sk_user_data really is set
>>>> to the 'queue' pointer, so we should skip the function.
>>> 
>>> nvme_tcp_listen_data_ready was invoked, then sk->sk_data_ready
>>> is nvme_tcp_listen_data_ready. Are you referring to the case
>>> where the callback has changed before the read lock ?
>>> 
>>> I would like to understand why svc_tcp_listen_data_ready doesn't
>>> have this race as well. Chuck?
>> svc doesn't alter the pointer address in the listener's
>> sk_data_ready field, and neither does kTLS.
> 
> It is inherited from the parent socket sk. IIRC it is cloned in
> inet_reqsk_clone.

Ah. I meant the changes to support the use of kTLS sockets do
not alter this svc code.


>> For a connected socket, when the handshake has completed,
>> svc examines the socket for more work one last time before
>> deciding it's idle (a kind of simulated data_ready). That
>> should pick up work that arrived undetected.
> 
> The question is how does a ktls socket never calls its inherited
> sk_data_ready (svc_tcp_listen_data_ready), which according to Hannes'
> explanation, should be the case if this socket was created as its
> offspring.

Have a look at svc_tcp_accept(). It resets the inherited callbacks
before setting up the socket.

svc_tcp_init() then sets sk_data_ready to our data_ready function.

--
Chuck Lever



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-28 15:56                               ` Chuck Lever III
@ 2023-03-29  6:33                                 ` Sagi Grimberg
  0 siblings, 0 replies; 90+ messages in thread
From: Sagi Grimberg @ 2023-03-29  6:33 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	Chuck Lever, kernel-tls-handshake


>>>>>>> As the callbacks will be changed once TLS is activated, and we really should not attempt to run if sk_data_ready() points to another function,
>>>>>>> as then the sk_user_data pointer will most likely be changed, too,
>>>>>>> causing all sorts of issues.
>>>>>>
>>>>>> Umm, something is unclear to me. if nvmet_tcp_listen_data_ready is
>>>>>> called doesn't it by definition mean that sk->sk_data_ready ==
>>>>>> nvmet_tcp_listen_data_ready ?
>>>>>>
>>>>>> Are you talking about a situation where between
>>>>>> nvmet_tcp_listen_data_ready is starting and until the
>>>>>> sk->sk_callback_lock the data_ready cb (and the user data
>>>>>> pointer) is changed?
>>>>> No. Far simpler:
>>>>> Starting kTLS will change the callbacks.
>>>>> So sk->sk_data_ready will point to our callback before
>>>>> the upcall, but to the kTLS version _after_ the upcall.
>>>>> It typically doesn't matter, as we're setting it to
>>>>> nvmet_tcp_data_ready() anyway.
>>>>
>>>> For ktls won't we set it to nvmet_tcp_data_ready only when
>>>> the handshake is done?
>>>>
>>>>> But there might be the odd case where the data_ready
>>>>> callback is invoked after kTLS has been started but before
>>>>> we're setting it to nvmet_tcp_data_ready().
>>>>
>>>> What does nvmet_tcp_data_ready has to do with it?
>>>> You are changing nvmet_tcp_listen_data_ready.
>>>>
>>>>> Then we cannot guarantee that sk_user_data really is set
>>>>> to the 'queue' pointer, so we should skip the function.
>>>>
>>>> nvme_tcp_listen_data_ready was invoked, then sk->sk_data_ready
>>>> is nvme_tcp_listen_data_ready. Are you referring to the case
>>>> where the callback has changed before the read lock ?
>>>>
>>>> I would like to understand why svc_tcp_listen_data_ready doesn't
>>>> have this race as well. Chuck?
>>> svc doesn't alter the pointer address in the listener's
>>> sk_data_ready field, and neither does kTLS.
>>
>> It is inherited from the parent socket sk. IIRC it is cloned in
>> inet_reqsk_clone.
> 
> Ah. I meant the changes to support the use of kTLS sockets do
> not alter this svc code.
> 
> 
>>> For a connected socket, when the handshake has completed,
>>> svc examines the socket for more work one last time before
>>> deciding it's idle (a kind of simulated data_ready). That
>>> should pick up work that arrived undetected.
>>
>> The question is how does a ktls socket never calls its inherited
>> sk_data_ready (svc_tcp_listen_data_ready), which according to Hannes'
>> explanation, should be the case if this socket was created as its
>> offspring.
> 
> Have a look at svc_tcp_accept(). It resets the inherited callbacks
> before setting up the socket.
> 
> svc_tcp_init() then sets sk_data_ready to our data_ready function.

Yes, I see.
When starting a listener, it saves inet->sk_data_ready in
svsk->sk_odata, and sets sk->sk_data_ready to svc_tcp_listen_data_ready.
then when accepting, it returns it to newsock->sk->sk_data_ready,

then only when adding the new socket, it sets sk_data_ready to
svc_data_ready...

This is what I agreed with Hannes, that nvmet reset the newsock
inherited sk callbacks, and only restore them after the handshake.

Thanks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-04-03 12:51   ` Sagi Grimberg
@ 2023-04-03 14:05     ` Hannes Reinecke
  0 siblings, 0 replies; 90+ messages in thread
From: Hannes Reinecke @ 2023-04-03 14:05 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake

On 4/3/23 14:51, Sagi Grimberg wrote:
> 
>> Add functions to start the TLS handshake upcall when
>> the TCP RSAS sectype is set to 'tls1.3'.
> 
> TSAS
> 
Ok.

>>
>> Signed-off-by: Hannes Reincke <hare@suse.de>
>> ---
>>   drivers/nvme/target/configfs.c |  32 +++++++-
>>   drivers/nvme/target/tcp.c      | 135 ++++++++++++++++++++++++++++++++-
>>   2 files changed, 163 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/target/configfs.c 
>> b/drivers/nvme/target/configfs.c
>> index ca66ee6dc153..36fbf6a22d09 100644
>> --- a/drivers/nvme/target/configfs.c
>> +++ b/drivers/nvme/target/configfs.c
>> @@ -159,10 +159,12 @@ static const struct nvmet_type_name_map 
>> nvmet_addr_treq[] = {
>>       { NVMF_TREQ_NOT_REQUIRED,    "not required" },
>>   };
>> +#define NVMET_PORT_TREQ(port) ((port)->disc_addr.treq & 
>> NVME_TREQ_SECURE_CHANNEL_MASK)
> 
> Can you make it a static inline?
> 
Sure.

>> +
>>   static ssize_t nvmet_addr_treq_show(struct config_item *item, char 
>> *page)
>>   {
>> -    u8 treq = to_nvmet_port(item)->disc_addr.treq &
>> -        NVME_TREQ_SECURE_CHANNEL_MASK;
>> +    struct nvmet_port *port = to_nvmet_port(item);
>> +    u8 treq = NVMET_PORT_TREQ(port);
>>       int i;
>>       for (i = 0; i < ARRAY_SIZE(nvmet_addr_treq); i++) {
>> @@ -193,6 +195,17 @@ static ssize_t nvmet_addr_treq_store(struct 
>> config_item *item,
>>       return -EINVAL;
>>   found:
>> +#ifdef CONFIG_NVME_TLS
>> +    if (port->disc_addr.trtype == NVMF_TRTYPE_TCP) {
>> +        if (port->disc_addr.tsas.tcp.sectype != 
>> NVMF_TCP_SECTYPE_TLS13) {
>> +            pr_warn("cannot change TREQ when TLS is not enabled\n");
>> +            return -EINVAL;
>> +        } else if (nvmet_addr_treq[i].type == NVMF_TREQ_NOT_SPECIFIED) {
>> +            pr_warn("cannot set TREQ to 'not specified' when TLS is 
>> enabled\n");
>> +            return -EINVAL;
>> +        }
>> +    }
> 
> Is this code wrong if CONFIG_NVME_TLS is not enabled?
> 
Strictly speaking, no; it just won't do anything except from having a 
different value in the discovery log page.

>> +#endif
>>       treq |= nvmet_addr_treq[i].type;
>>       port->disc_addr.treq = treq;
>>       return count;
>> @@ -373,6 +386,7 @@ static ssize_t nvmet_addr_tsas_store(struct 
>> config_item *item,
>>           const char *page, size_t count)
>>   {
>>       struct nvmet_port *port = to_nvmet_port(item);
>> +    u8 treq = port->disc_addr.treq & ~NVME_TREQ_SECURE_CHANNEL_MASK;
>>       int i;
>>       if (nvmet_is_port_enabled(port, __func__))
>> @@ -391,6 +405,20 @@ static ssize_t nvmet_addr_tsas_store(struct 
>> config_item *item,
>>   found:
>>       nvmet_port_init_tsas_tcp(port, nvmet_addr_tsas_tcp[i].type);
>> +    if (nvmet_addr_tsas_tcp[i].type == NVMF_TCP_SECTYPE_TLS13) {
>> +#ifdef CONFIG_NVME_TLS
> 
> Maybe in the start of the function just do:
> 
>      if (!IS_ENABLED(CONFIG_NVME_TLS)) {
>          pr_err("TLS not supported\n");
>          return -EINVAL;
>      }
> 
> Instead of incorporating it here.
> 
Ok.

>> +        if (NVMET_PORT_TREQ(port) == NVMF_TREQ_NOT_SPECIFIED)
>> +            treq |= NVMF_TREQ_REQUIRED;
>> +        else
>> +            treq |= NVMET_PORT_TREQ(port);
>> +#else
>> +        pr_err("TLS not supported\n");
>> +        return -EINVAL;
>> +#endif
>> +    } else {
>> +        /* Set to 'not specified' if TLS is not enabled */
>> +        treq |= NVMF_TREQ_NOT_SPECIFIED;
>> +    }
>>       return count;
>>   }
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index 5931971d715f..ebec882120fd 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -11,6 +11,10 @@
>>   #include <linux/nvme-tcp.h>
>>   #include <net/sock.h>
>>   #include <net/tcp.h>
>> +#ifdef CONFIG_NVME_TLS
>> +#include <net/handshake.h>
> 
> Is net/handshake.h under an ifdef? If so, CONFIG_NVME_TLS should
> select it.
> 
>> +#include <linux/nvme-keyring.h>
> 
> will this include not work if CONFIG_NVME_TLS not work?
> <linux/nvme-auth.h> is not under CONFIG_NVME_AUTH for example.
> 
Hmm. Should. I can remove the ifdefs

>> +#endif
>>   #include <linux/inet.h>
>>   #include <linux/llist.h>
>>   #include <crypto/hash.h>
>> @@ -40,6 +44,16 @@ module_param(idle_poll_period_usecs, int, 0644);
>>   MODULE_PARM_DESC(idle_poll_period_usecs,
>>           "nvmet tcp io_work poll till idle time period in usecs");
>> +#ifdef CONFIG_NVME_TLS
>> +/*
>> + * TLS handshake timeout
>> + */
>> +static int tls_handshake_timeout = 30;
>> +module_param(tls_handshake_timeout, int, 0644);
>> +MODULE_PARM_DESC(tls_handshake_timeout,
>> +         "nvme TLS handshake timeout in seconds (default 30)");
>> +#endif
>> +
>>   #define NVMET_TCP_RECV_BUDGET        8
>>   #define NVMET_TCP_SEND_BUDGET        8
>>   #define NVMET_TCP_IO_WORK_BUDGET    64
>> @@ -130,6 +144,10 @@ struct nvmet_tcp_queue {
>>       bool            data_digest;
>>       struct ahash_request    *snd_hash;
>>       struct ahash_request    *rcv_hash;
>> +#ifdef CONFIG_NVME_TLS
>> +    struct key        *tls_psk;
>> +    struct delayed_work    tls_handshake_work;
>> +#endif
> 
> If these won't be under CONFIG_NVME_TLS will it save a lot of the other
> ifdefs in the code?
> 
Wasn't sure if we always want to have it.
But if we do, sure, things will be easier.

>>       unsigned long           poll_end;
>> @@ -1474,6 +1492,10 @@ static void nvmet_tcp_release_queue_work(struct 
>> work_struct *w)
>>       nvmet_tcp_free_cmds(queue);
>>       if (queue->hdr_digest || queue->data_digest)
>>           nvmet_tcp_free_crypto(queue);
>> +#ifdef CONFIG_NVME_TLS
>> +    if (queue->tls_psk)
>> +        key_put(queue->tls_psk);
> 
> key_put is NULL safe.
> 
>> +#endif
>>       ida_free(&nvmet_tcp_queue_ida, queue->idx);
>>       page = virt_to_head_page(queue->pf_cache.va);
>>       __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
>> @@ -1488,8 +1510,12 @@ static void nvmet_tcp_data_ready(struct sock *sk)
>>       read_lock_bh(&sk->sk_callback_lock);
>>       queue = sk->sk_user_data;
>> -    if (likely(queue))
>> -        queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
>> +    if (queue->data_ready)
>> +        queue->data_ready(sk);
>> +    if (likely(queue) &&
>> +        queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
>> +        queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
>> +                  &queue->io_work);
>>       read_unlock_bh(&sk->sk_callback_lock);
>>   }
>> @@ -1597,6 +1623,89 @@ static int nvmet_tcp_set_queue_sock(struct 
>> nvmet_tcp_queue *queue)
>>       return ret;
>>   }
>> +#ifdef CONFIG_NVME_TLS
>> +static void nvmet_tcp_tls_queue_restart(struct nvmet_tcp_queue *queue)
>> +{
>> +    spin_lock(&queue->state_lock);
>> +    if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
>> +        pr_warn("queue %d: TLS handshake already completed\n",
>> +            queue->idx);
>> +        spin_unlock(&queue->state_lock);
>> +        return;
>> +    }
>> +    queue->state = NVMET_TCP_Q_CONNECTING;
>> +    spin_unlock(&queue->state_lock);
>> +
>> +    pr_debug("queue %d: restarting queue after TLS handshake\n",
>> +         queue->idx);
>> +    /*
>> +     * Set callbacks after handshake; TLS implementation
>> +     * might have changed the socket callbacks.
>> +     */
>> +    nvmet_tcp_set_queue_sock(queue);
> 
> Maybe fold it into the caller? The name is confusing anyways.
> The queue is not restarted, it is post-configured for lack of
> a better term.
> 
I see what I can do.

>> +}
>> +
>> +static void nvmet_tcp_tls_handshake_done(void *data, int status,
>> +                     key_serial_t peerid)
> 
>                      pskid.
> 
>> +{
>> +    struct nvmet_tcp_queue *queue = data;
>> +
>> +    pr_debug("queue %d: TLS handshake done, key %x, status %d\n",
>> +         queue->idx, peerid, status);
>> +    if (!status) {
>> +        spin_lock(&queue->state_lock);
>> +        queue->tls_psk = key_lookup(peerid);
>> +        if (IS_ERR(queue->tls_psk)) {
>> +            pr_warn("queue %d: TLS key %x not found\n",
>> +                queue->idx, peerid);
>> +            queue->tls_psk = NULL;
> 
> Here you let the timeout take care of it later?

Well, this is a slightly odd case; we get a '0' status but failed
to lookup the key.
_Technically_ we should be able to continue, but wasn't sure if I should.

But in the light of the key rotation discussion I probably should; new 
keys (and keyrings) might be provided at any time, so I might hit that 
case here.

Will be updating the code.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Frankenstr. 146, 90461 Nürnberg
Managing Directors: I. Totev, A. Myers, A. McDonald, M. B. Moerman
(HRB 36809, AG Nürnberg)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-29 13:59 ` [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall Hannes Reinecke
@ 2023-04-03 12:51   ` Sagi Grimberg
  2023-04-03 14:05     ` Hannes Reinecke
  0 siblings, 1 reply; 90+ messages in thread
From: Sagi Grimberg @ 2023-04-03 12:51 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chuck Lever, kernel-tls-handshake


> Add functions to start the TLS handshake upcall when
> the TCP RSAS sectype is set to 'tls1.3'.

TSAS

> 
> Signed-off-by: Hannes Reincke <hare@suse.de>
> ---
>   drivers/nvme/target/configfs.c |  32 +++++++-
>   drivers/nvme/target/tcp.c      | 135 ++++++++++++++++++++++++++++++++-
>   2 files changed, 163 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index ca66ee6dc153..36fbf6a22d09 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -159,10 +159,12 @@ static const struct nvmet_type_name_map nvmet_addr_treq[] = {
>   	{ NVMF_TREQ_NOT_REQUIRED,	"not required" },
>   };
>   
> +#define NVMET_PORT_TREQ(port) ((port)->disc_addr.treq & NVME_TREQ_SECURE_CHANNEL_MASK)

Can you make it a static inline?

> +
>   static ssize_t nvmet_addr_treq_show(struct config_item *item, char *page)
>   {
> -	u8 treq = to_nvmet_port(item)->disc_addr.treq &
> -		NVME_TREQ_SECURE_CHANNEL_MASK;
> +	struct nvmet_port *port = to_nvmet_port(item);
> +	u8 treq = NVMET_PORT_TREQ(port);
>   	int i;
>   
>   	for (i = 0; i < ARRAY_SIZE(nvmet_addr_treq); i++) {
> @@ -193,6 +195,17 @@ static ssize_t nvmet_addr_treq_store(struct config_item *item,
>   	return -EINVAL;
>   
>   found:
> +#ifdef CONFIG_NVME_TLS
> +	if (port->disc_addr.trtype == NVMF_TRTYPE_TCP) {
> +		if (port->disc_addr.tsas.tcp.sectype != NVMF_TCP_SECTYPE_TLS13) {
> +			pr_warn("cannot change TREQ when TLS is not enabled\n");
> +			return -EINVAL;
> +		} else if (nvmet_addr_treq[i].type == NVMF_TREQ_NOT_SPECIFIED) {
> +			pr_warn("cannot set TREQ to 'not specified' when TLS is enabled\n");
> +			return -EINVAL;
> +		}
> +	}

Is this code wrong if CONFIG_NVME_TLS is not enabled?

> +#endif
>   	treq |= nvmet_addr_treq[i].type;
>   	port->disc_addr.treq = treq;
>   	return count;
> @@ -373,6 +386,7 @@ static ssize_t nvmet_addr_tsas_store(struct config_item *item,
>   		const char *page, size_t count)
>   {
>   	struct nvmet_port *port = to_nvmet_port(item);
> +	u8 treq = port->disc_addr.treq & ~NVME_TREQ_SECURE_CHANNEL_MASK;
>   	int i;
>   
>   	if (nvmet_is_port_enabled(port, __func__))
> @@ -391,6 +405,20 @@ static ssize_t nvmet_addr_tsas_store(struct config_item *item,
>   
>   found:
>   	nvmet_port_init_tsas_tcp(port, nvmet_addr_tsas_tcp[i].type);
> +	if (nvmet_addr_tsas_tcp[i].type == NVMF_TCP_SECTYPE_TLS13) {
> +#ifdef CONFIG_NVME_TLS

Maybe in the start of the function just do:

	if (!IS_ENABLED(CONFIG_NVME_TLS)) {
		pr_err("TLS not supported\n");
		return -EINVAL;
	}

Instead of incorporating it here.

> +		if (NVMET_PORT_TREQ(port) == NVMF_TREQ_NOT_SPECIFIED)
> +			treq |= NVMF_TREQ_REQUIRED;
> +		else
> +			treq |= NVMET_PORT_TREQ(port);
> +#else
> +		pr_err("TLS not supported\n");
> +		return -EINVAL;
> +#endif
> +	} else {
> +		/* Set to 'not specified' if TLS is not enabled */
> +		treq |= NVMF_TREQ_NOT_SPECIFIED;
> +	}
>   	return count;
>   }
>   
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index 5931971d715f..ebec882120fd 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -11,6 +11,10 @@
>   #include <linux/nvme-tcp.h>
>   #include <net/sock.h>
>   #include <net/tcp.h>
> +#ifdef CONFIG_NVME_TLS
> +#include <net/handshake.h>

Is net/handshake.h under an ifdef? If so, CONFIG_NVME_TLS should
select it.

> +#include <linux/nvme-keyring.h>

will this include not work if CONFIG_NVME_TLS not work?
<linux/nvme-auth.h> is not under CONFIG_NVME_AUTH for example.

> +#endif
>   #include <linux/inet.h>
>   #include <linux/llist.h>
>   #include <crypto/hash.h>
> @@ -40,6 +44,16 @@ module_param(idle_poll_period_usecs, int, 0644);
>   MODULE_PARM_DESC(idle_poll_period_usecs,
>   		"nvmet tcp io_work poll till idle time period in usecs");
>   
> +#ifdef CONFIG_NVME_TLS
> +/*
> + * TLS handshake timeout
> + */
> +static int tls_handshake_timeout = 30;
> +module_param(tls_handshake_timeout, int, 0644);
> +MODULE_PARM_DESC(tls_handshake_timeout,
> +		 "nvme TLS handshake timeout in seconds (default 30)");
> +#endif
> +
>   #define NVMET_TCP_RECV_BUDGET		8
>   #define NVMET_TCP_SEND_BUDGET		8
>   #define NVMET_TCP_IO_WORK_BUDGET	64
> @@ -130,6 +144,10 @@ struct nvmet_tcp_queue {
>   	bool			data_digest;
>   	struct ahash_request	*snd_hash;
>   	struct ahash_request	*rcv_hash;
> +#ifdef CONFIG_NVME_TLS
> +	struct key		*tls_psk;
> +	struct delayed_work	tls_handshake_work;
> +#endif

If these won't be under CONFIG_NVME_TLS will it save a lot of the other
ifdefs in the code?

>   
>   	unsigned long           poll_end;
>   
> @@ -1474,6 +1492,10 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
>   	nvmet_tcp_free_cmds(queue);
>   	if (queue->hdr_digest || queue->data_digest)
>   		nvmet_tcp_free_crypto(queue);
> +#ifdef CONFIG_NVME_TLS
> +	if (queue->tls_psk)
> +		key_put(queue->tls_psk);

key_put is NULL safe.

> +#endif
>   	ida_free(&nvmet_tcp_queue_ida, queue->idx);
>   	page = virt_to_head_page(queue->pf_cache.va);
>   	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
> @@ -1488,8 +1510,12 @@ static void nvmet_tcp_data_ready(struct sock *sk)
>   
>   	read_lock_bh(&sk->sk_callback_lock);
>   	queue = sk->sk_user_data;
> -	if (likely(queue))
> -		queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
> +	if (queue->data_ready)
> +		queue->data_ready(sk);
> +	if (likely(queue) &&
> +	    queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
> +		queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
> +			      &queue->io_work);
>   	read_unlock_bh(&sk->sk_callback_lock);
>   }
>   
> @@ -1597,6 +1623,89 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
>   	return ret;
>   }
>   
> +#ifdef CONFIG_NVME_TLS
> +static void nvmet_tcp_tls_queue_restart(struct nvmet_tcp_queue *queue)
> +{
> +	spin_lock(&queue->state_lock);
> +	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
> +		pr_warn("queue %d: TLS handshake already completed\n",
> +			queue->idx);
> +		spin_unlock(&queue->state_lock);
> +		return;
> +	}
> +	queue->state = NVMET_TCP_Q_CONNECTING;
> +	spin_unlock(&queue->state_lock);
> +
> +	pr_debug("queue %d: restarting queue after TLS handshake\n",
> +		 queue->idx);
> +	/*
> +	 * Set callbacks after handshake; TLS implementation
> +	 * might have changed the socket callbacks.
> +	 */
> +	nvmet_tcp_set_queue_sock(queue);

Maybe fold it into the caller? The name is confusing anyways.
The queue is not restarted, it is post-configured for lack of
a better term.

> +}
> +
> +static void nvmet_tcp_tls_handshake_done(void *data, int status,
> +					 key_serial_t peerid)

					pskid.

> +{
> +	struct nvmet_tcp_queue *queue = data;
> +
> +	pr_debug("queue %d: TLS handshake done, key %x, status %d\n",
> +		 queue->idx, peerid, status);
> +	if (!status) {
> +		spin_lock(&queue->state_lock);
> +		queue->tls_psk = key_lookup(peerid);
> +		if (IS_ERR(queue->tls_psk)) {
> +			pr_warn("queue %d: TLS key %x not found\n",
> +				queue->idx, peerid);
> +			queue->tls_psk = NULL;

Here you let the timeout take care of it later?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall
  2023-03-29 13:59 [PATCHv2 " Hannes Reinecke
@ 2023-03-29 13:59 ` Hannes Reinecke
  2023-04-03 12:51   ` Sagi Grimberg
  0 siblings, 1 reply; 90+ messages in thread
From: Hannes Reinecke @ 2023-03-29 13:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chuck Lever,
	kernel-tls-handshake, Hannes Reinecke

Add functions to start the TLS handshake upcall when
the TCP RSAS sectype is set to 'tls1.3'.

Signed-off-by: Hannes Reincke <hare@suse.de>
---
 drivers/nvme/target/configfs.c |  32 +++++++-
 drivers/nvme/target/tcp.c      | 135 ++++++++++++++++++++++++++++++++-
 2 files changed, 163 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index ca66ee6dc153..36fbf6a22d09 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -159,10 +159,12 @@ static const struct nvmet_type_name_map nvmet_addr_treq[] = {
 	{ NVMF_TREQ_NOT_REQUIRED,	"not required" },
 };
 
+#define NVMET_PORT_TREQ(port) ((port)->disc_addr.treq & NVME_TREQ_SECURE_CHANNEL_MASK)
+
 static ssize_t nvmet_addr_treq_show(struct config_item *item, char *page)
 {
-	u8 treq = to_nvmet_port(item)->disc_addr.treq &
-		NVME_TREQ_SECURE_CHANNEL_MASK;
+	struct nvmet_port *port = to_nvmet_port(item);
+	u8 treq = NVMET_PORT_TREQ(port);
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(nvmet_addr_treq); i++) {
@@ -193,6 +195,17 @@ static ssize_t nvmet_addr_treq_store(struct config_item *item,
 	return -EINVAL;
 
 found:
+#ifdef CONFIG_NVME_TLS
+	if (port->disc_addr.trtype == NVMF_TRTYPE_TCP) {
+		if (port->disc_addr.tsas.tcp.sectype != NVMF_TCP_SECTYPE_TLS13) {
+			pr_warn("cannot change TREQ when TLS is not enabled\n");
+			return -EINVAL;
+		} else if (nvmet_addr_treq[i].type == NVMF_TREQ_NOT_SPECIFIED) {
+			pr_warn("cannot set TREQ to 'not specified' when TLS is enabled\n");
+			return -EINVAL;
+		}
+	}
+#endif
 	treq |= nvmet_addr_treq[i].type;
 	port->disc_addr.treq = treq;
 	return count;
@@ -373,6 +386,7 @@ static ssize_t nvmet_addr_tsas_store(struct config_item *item,
 		const char *page, size_t count)
 {
 	struct nvmet_port *port = to_nvmet_port(item);
+	u8 treq = port->disc_addr.treq & ~NVME_TREQ_SECURE_CHANNEL_MASK;
 	int i;
 
 	if (nvmet_is_port_enabled(port, __func__))
@@ -391,6 +405,20 @@ static ssize_t nvmet_addr_tsas_store(struct config_item *item,
 
 found:
 	nvmet_port_init_tsas_tcp(port, nvmet_addr_tsas_tcp[i].type);
+	if (nvmet_addr_tsas_tcp[i].type == NVMF_TCP_SECTYPE_TLS13) {
+#ifdef CONFIG_NVME_TLS
+		if (NVMET_PORT_TREQ(port) == NVMF_TREQ_NOT_SPECIFIED)
+			treq |= NVMF_TREQ_REQUIRED;
+		else
+			treq |= NVMET_PORT_TREQ(port);
+#else
+		pr_err("TLS not supported\n");
+		return -EINVAL;
+#endif
+	} else {
+		/* Set to 'not specified' if TLS is not enabled */
+		treq |= NVMF_TREQ_NOT_SPECIFIED;
+	}
 	return count;
 }
 
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 5931971d715f..ebec882120fd 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -11,6 +11,10 @@
 #include <linux/nvme-tcp.h>
 #include <net/sock.h>
 #include <net/tcp.h>
+#ifdef CONFIG_NVME_TLS
+#include <net/handshake.h>
+#include <linux/nvme-keyring.h>
+#endif
 #include <linux/inet.h>
 #include <linux/llist.h>
 #include <crypto/hash.h>
@@ -40,6 +44,16 @@ module_param(idle_poll_period_usecs, int, 0644);
 MODULE_PARM_DESC(idle_poll_period_usecs,
 		"nvmet tcp io_work poll till idle time period in usecs");
 
+#ifdef CONFIG_NVME_TLS
+/*
+ * TLS handshake timeout
+ */
+static int tls_handshake_timeout = 30;
+module_param(tls_handshake_timeout, int, 0644);
+MODULE_PARM_DESC(tls_handshake_timeout,
+		 "nvme TLS handshake timeout in seconds (default 30)");
+#endif
+
 #define NVMET_TCP_RECV_BUDGET		8
 #define NVMET_TCP_SEND_BUDGET		8
 #define NVMET_TCP_IO_WORK_BUDGET	64
@@ -130,6 +144,10 @@ struct nvmet_tcp_queue {
 	bool			data_digest;
 	struct ahash_request	*snd_hash;
 	struct ahash_request	*rcv_hash;
+#ifdef CONFIG_NVME_TLS
+	struct key		*tls_psk;
+	struct delayed_work	tls_handshake_work;
+#endif
 
 	unsigned long           poll_end;
 
@@ -1474,6 +1492,10 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	nvmet_tcp_free_cmds(queue);
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
+#ifdef CONFIG_NVME_TLS
+	if (queue->tls_psk)
+		key_put(queue->tls_psk);
+#endif
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
 	page = virt_to_head_page(queue->pf_cache.va);
 	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
@@ -1488,8 +1510,12 @@ static void nvmet_tcp_data_ready(struct sock *sk)
 
 	read_lock_bh(&sk->sk_callback_lock);
 	queue = sk->sk_user_data;
-	if (likely(queue))
-		queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
+	if (queue->data_ready)
+		queue->data_ready(sk);
+	if (likely(queue) &&
+	    queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
+		queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
+			      &queue->io_work);
 	read_unlock_bh(&sk->sk_callback_lock);
 }
 
@@ -1597,6 +1623,89 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
 	return ret;
 }
 
+#ifdef CONFIG_NVME_TLS
+static void nvmet_tcp_tls_queue_restart(struct nvmet_tcp_queue *queue)
+{
+	spin_lock(&queue->state_lock);
+	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
+		pr_warn("queue %d: TLS handshake already completed\n",
+			queue->idx);
+		spin_unlock(&queue->state_lock);
+		return;
+	}
+	queue->state = NVMET_TCP_Q_CONNECTING;
+	spin_unlock(&queue->state_lock);
+
+	pr_debug("queue %d: restarting queue after TLS handshake\n",
+		 queue->idx);
+	/*
+	 * Set callbacks after handshake; TLS implementation
+	 * might have changed the socket callbacks.
+	 */
+	nvmet_tcp_set_queue_sock(queue);
+}
+
+static void nvmet_tcp_tls_handshake_done(void *data, int status,
+					 key_serial_t peerid)
+{
+	struct nvmet_tcp_queue *queue = data;
+
+	pr_debug("queue %d: TLS handshake done, key %x, status %d\n",
+		 queue->idx, peerid, status);
+	if (!status) {
+		spin_lock(&queue->state_lock);
+		queue->tls_psk = key_lookup(peerid);
+		if (IS_ERR(queue->tls_psk)) {
+			pr_warn("queue %d: TLS key %x not found\n",
+				queue->idx, peerid);
+			queue->tls_psk = NULL;
+		}
+		spin_unlock(&queue->state_lock);
+	}
+	cancel_delayed_work_sync(&queue->tls_handshake_work);
+	if (status)
+		nvmet_tcp_schedule_release_queue(queue);
+	else
+		nvmet_tcp_tls_queue_restart(queue);
+}
+
+static void nvmet_tcp_tls_handshake_timeout_work(struct work_struct *w)
+{
+	struct nvmet_tcp_queue *queue = container_of(to_delayed_work(w),
+			struct nvmet_tcp_queue, tls_handshake_work);
+
+	pr_debug("queue %d: TLS handshake timeout\n", queue->idx);
+	nvmet_tcp_schedule_release_queue(queue);
+}
+
+static int nvmet_tcp_tls_handshake(struct nvmet_tcp_queue *queue)
+{
+	int ret = -EOPNOTSUPP;
+	struct tls_handshake_args args;
+
+	if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE) {
+		pr_warn("cannot start TLS in state %d\n", queue->state);
+		return -EINVAL;
+	}
+
+	pr_debug("queue %d: TLS ServerHello\n", queue->idx);
+	args.ta_sock = queue->sock;
+	args.ta_done = nvmet_tcp_tls_handshake_done;
+	args.ta_data = queue;
+	args.ta_keyring = nvme_keyring_id();
+	args.ta_timeout_ms = tls_handshake_timeout * 2 * 1024;
+
+	ret = tls_server_hello_psk(&args, GFP_KERNEL);
+	if (ret) {
+		pr_err("failed to start TLS, err=%d\n", ret);
+	} else {
+		queue_delayed_work(nvmet_wq, &queue->tls_handshake_work,
+				   tls_handshake_timeout * HZ);
+	}
+	return ret;
+}
+#endif
+
 static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 		struct socket *newsock)
 {
@@ -1609,6 +1718,10 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 
 	INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
 	INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
+#ifdef CONFIG_NVME_TLS
+	INIT_DELAYED_WORK(&queue->tls_handshake_work,
+			  nvmet_tcp_tls_handshake_timeout_work);
+#endif
 	queue->sock = newsock;
 	queue->port = port;
 	queue->nr_cmds = 0;
@@ -1622,6 +1735,7 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 	init_llist_head(&queue->resp_list);
 	INIT_LIST_HEAD(&queue->resp_send_list);
 
+#ifdef CONFIG_NVME_TLS
 	if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
 		queue->sock_file = sock_alloc_file(queue->sock, O_CLOEXEC, NULL);
 		if (IS_ERR(queue->sock_file)) {
@@ -1630,6 +1744,7 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 			goto out_free_queue;
 		}
 	}
+#endif
 
 	queue->idx = ida_alloc(&nvmet_tcp_queue_ida, GFP_KERNEL);
 	if (queue->idx < 0) {
@@ -1651,6 +1766,22 @@ static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 	list_add_tail(&queue->queue_list, &nvmet_tcp_queue_list);
 	mutex_unlock(&nvmet_tcp_queue_mutex);
 
+#ifdef CONFIG_NVME_TLS
+	if (queue->state == NVMET_TCP_Q_TLS_HANDSHAKE) {
+		struct sock *sk = queue->sock->sk;
+
+		/* Restore the default callbacks before starting upcall */
+		read_lock_bh(&sk->sk_callback_lock);
+		sk->sk_user_data = NULL;
+		sk->sk_data_ready = port->data_ready;
+		read_unlock_bh(&sk->sk_callback_lock);
+		if (!nvmet_tcp_tls_handshake(queue))
+			return;
+
+		/* TLS handshake failed, terminate the connection */
+		goto out_destroy_sq;
+	}
+#endif
 	ret = nvmet_tcp_set_queue_sock(queue);
 	if (ret)
 		goto out_destroy_sq;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2023-04-03 14:05 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-21 12:43 [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Hannes Reinecke
2023-03-21 12:43 ` [PATCH 01/18] nvme-keyring: register '.nvme' keyring Hannes Reinecke
2023-03-21 13:50   ` Sagi Grimberg
2023-03-21 14:11     ` Hannes Reinecke
2023-03-21 12:43 ` [PATCH 02/18] nvme-keyring: define a 'psk' keytype Hannes Reinecke
2023-03-22  8:29   ` Sagi Grimberg
2023-03-22  8:38     ` Hannes Reinecke
2023-03-22  8:49       ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 03/18] nvme: add TCP TSAS definitions Hannes Reinecke
2023-03-21 13:46   ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 04/18] nvme-tcp: add definitions for TLS cipher suites Hannes Reinecke
2023-03-22  8:18   ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 05/18] nvme-tcp: implement recvmsg rx flow for TLS Hannes Reinecke
2023-03-21 13:39   ` Sagi Grimberg
2023-03-21 13:59     ` Hannes Reinecke
2023-03-22  8:01       ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 06/18] nvme-tcp: call 'queue->data_ready()' in nvme_tcp_data_ready() Hannes Reinecke
2023-03-21 13:44   ` Sagi Grimberg
2023-03-21 14:09     ` Hannes Reinecke
2023-03-22  0:18       ` Chris Leech
2023-03-22  6:59         ` Hannes Reinecke
2023-03-22  8:12           ` Sagi Grimberg
2023-03-22  8:08       ` Sagi Grimberg
2023-03-22  8:26         ` Hannes Reinecke
2023-03-22 10:13           ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 07/18] nvme/tcp: allocate socket file Hannes Reinecke
2023-03-21 13:52   ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 08/18] nvme-tcp: enable TLS handshake upcall Hannes Reinecke
2023-03-22  8:45   ` Sagi Grimberg
2023-03-22  9:12     ` Hannes Reinecke
2023-03-22 10:56       ` Sagi Grimberg
2023-03-22 12:54         ` Hannes Reinecke
2023-03-22 13:16           ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 09/18] nvme-tcp: add connect option 'tls' Hannes Reinecke
2023-03-22  9:24   ` Sagi Grimberg
2023-03-22  9:59     ` Hannes Reinecke
2023-03-22 10:09       ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 10/18] nvme-tcp: fixup send workflow for kTLS Hannes Reinecke
2023-03-22  9:31   ` Sagi Grimberg
2023-03-22 10:08     ` Hannes Reinecke
2023-03-22 11:18       ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 11/18] nvme-tcp: control message handling for recvmsg() Hannes Reinecke
2023-03-22 11:33   ` Sagi Grimberg
2023-03-22 11:48     ` Hannes Reinecke
2023-03-22 11:50       ` Sagi Grimberg
2023-03-22 12:17         ` Hannes Reinecke
2023-03-22 12:29           ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 12/18] nvmet: make TCP sectype settable via configfs Hannes Reinecke
2023-03-22 11:38   ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 13/18] nvmet-tcp: allocate socket file Hannes Reinecke
2023-03-22 11:46   ` Sagi Grimberg
2023-03-22 12:07     ` Hannes Reinecke
2023-03-21 12:43 ` [PATCH 14/18] security/keys: export key_lookup() Hannes Reinecke
2023-03-21 12:43 ` [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall Hannes Reinecke
2023-03-22 12:13   ` Sagi Grimberg
2023-03-22 12:34     ` Hannes Reinecke
2023-03-22 12:51       ` Sagi Grimberg
2023-03-22 13:47         ` Hannes Reinecke
2023-03-22 15:42           ` Sagi Grimberg
2023-03-22 16:43             ` Hannes Reinecke
2023-03-22 16:49               ` Chuck Lever III
2023-03-23  7:21                 ` Sagi Grimberg
2023-03-24 11:29                   ` Hannes Reinecke
2023-03-26  7:18                     ` Sagi Grimberg
2023-03-27  6:20                       ` Hannes Reinecke
2023-03-28  8:44                         ` Sagi Grimberg
2023-03-28  9:20                           ` Hannes Reinecke
2023-03-28  9:43                             ` Sagi Grimberg
2023-03-28 10:04                               ` Hannes Reinecke
2023-03-28 13:22                           ` Chuck Lever III
2023-03-28 15:29                             ` Sagi Grimberg
2023-03-28 15:56                               ` Chuck Lever III
2023-03-29  6:33                                 ` Sagi Grimberg
2023-03-23  7:44               ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 16/18] nvmet-tcp: rework sendpage for kTLS Hannes Reinecke
2023-03-22 12:16   ` Sagi Grimberg
2023-03-21 12:43 ` [PATCH 17/18] nvmet-tcp: control messages for recvmsg() Hannes Reinecke
2023-03-21 12:43 ` [PATCH 18/18] nvmet-tcp: peek icreq before starting TLS Hannes Reinecke
2023-03-22 12:24   ` Sagi Grimberg
2023-03-22 12:38     ` Hannes Reinecke
2023-03-21 13:12 ` [RFC PATCH 00/18] nvme: In-kernel TLS support for TCP Sagi Grimberg
2023-03-21 13:30   ` Hannes Reinecke
2023-03-22  8:16     ` Sagi Grimberg
2023-03-22  8:28       ` Hannes Reinecke
2023-03-22 12:53         ` Sagi Grimberg
2023-03-22 15:10           ` Hannes Reinecke
2023-03-22 15:43             ` Sagi Grimberg
2023-03-29 13:59 [PATCHv2 " Hannes Reinecke
2023-03-29 13:59 ` [PATCH 15/18] nvmet-tcp: enable TLS handshake upcall Hannes Reinecke
2023-04-03 12:51   ` Sagi Grimberg
2023-04-03 14:05     ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).